Supervised Dimension Reduction:

Similar documents
Learning gradients: prescriptive models

Bayesian simultaneous regression and dimension reduction

Supervised Dimension Reduction Using Bayesian Mixture Modeling

Nonparametric Bayesian Models for Supervised Dimension R

Localized Sliced Inverse Regression

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Sayan Mukherjee. June 15, 2007

Pattern Recognition and Machine Learning

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Density Estimation. Seungjin Choi

STAT 518 Intro Student Presentation

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Nonparametric Bayesian Methods (Gaussian Processes)

Probabilistic Reasoning in Deep Learning

Two models for Bayesian supervised dimension reduction

STA 4273H: Sta-s-cal Machine Learning

Learning Gradients on Manifolds

Statistics & Data Sciences: First Year Prelim Exam May 2018

Advanced Machine Learning & Perception

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

STA414/2104 Statistical Methods for Machine Learning II

Introduction to Machine Learning

Advanced Machine Learning & Perception

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Nonparametric Bayesian Methods - Lecture I

ECE521 week 3: 23/26 January 2017

STAT Advanced Bayesian Inference

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Bayesian Registration of Functions with a Gaussian Process Prior

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Introduction to Probabilistic Machine Learning

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

4 Bias-Variance for Ridge Regression (24 points)

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Machine Learning. Lecture 2

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Contents. Part I: Fundamentals of Bayesian Inference 1

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Machine Learning Techniques for Computer Vision

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Partial factor modeling: predictor-dependent shrinkage for linear regression

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Gaussian with mean ( µ ) and standard deviation ( σ)

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Nonparametrics

A Least Squares Formulation for Canonical Correlation Analysis

Nonparametric Bayes Inference on Manifolds with Applications

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

LECTURE 15 Markov chain Monte Carlo

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

ECE 5984: Introduction to Machine Learning

Computational Genomics

INTRODUCTION TO PATTERN RECOGNITION

Bayesian Methods for Machine Learning

Modelling geoadditive survival data

Randomized Algorithms

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Lecture : Probabilistic Machine Learning

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

GAUSSIAN PROCESS REGRESSION

Bayesian Regularization

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

Gaussian Processes (10/16/13)

Gaussian Processes for Machine Learning

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

A Selective Review of Sufficient Dimension Reduction

CIS 520: Machine Learning Oct 09, Kernel Methods

Linear Methods for Prediction

Advanced Introduction to Machine Learning

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Lecture 13 : Variational Inference: Mean Field Approximation

Point spread function reconstruction from the image of a sharp edge

STA 414/2104: Lecture 8

Advanced Introduction to Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Support Vector Machines

Likelihood NIPS July 30, Gaussian Process Regression with Student-t. Likelihood. Jarno Vanhatalo, Pasi Jylanki and Aki Vehtari NIPS-2009

Graphical Models and Kernel Methods

Gaussian Process Regression

Information geometry for bivariate distribution control

A Bayesian Nonparametric Approach to Monotone Missing Data in Longitudinal Studies with Informative Missingness

Approximate Kernel PCA with Random Features

Machine Learning for Data Science (CS4786) Lecture 12

Based on slides by Richard Zemel

Short Questions (Do two out of three) 15 points each

Lecture 2: Statistical Decision Theory (Part I)

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Variable Selection in Structured High-dimensional Covariate Spaces

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Transcription:

Supervised Dimension Reduction: A Tale of Two Manifolds S. Mukherjee, K. Mao, F. Liang, Q. Wu, M. Maggioni, D-X. Zhou Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Department of Mathematics Duke University November 4, 2009

Supervised dimension reduction Information and sufficiency A fundamental idea in statistical thought is to reduce data to relevant information. This was the paradigm of R.A. Fisher (beloved Bayesian) and goes back to at least Adcock 1878 and Edgeworth 1884.

Supervised dimension reduction Information and sufficiency A fundamental idea in statistical thought is to reduce data to relevant information. This was the paradigm of R.A. Fisher (beloved Bayesian) and goes back to at least Adcock 1878 and Edgeworth 1884. X 1,...,X n drawn iid form a Gaussian can be reduced to µ,σ 2.

Supervised dimension reduction Regression Assume the model with X X R p and Y R. Y = f (X) + ε, IEε = 0,

Supervised dimension reduction Regression Assume the model with X X R p and Y R. Y = f (X) + ε, IEε = 0, Data D = {(x i,y i )} n iid i=1 ρ(x,y ).

Supervised dimension reduction Dimension reduction If the data lives in a p-dimensional space X IR p replace X with Θ(X) IR d, p d.

Supervised dimension reduction Dimension reduction If the data lives in a p-dimensional space X IR p replace X with Θ(X) IR d, p d. My belief: physical, biological and social systems are inherently low dimensional and variation of interest in these systems can be captured by a low-dimensional submanifold.

Supervised dimension reduction Supervised dimension reduction (SDR) Given response variables Y 1,...,Y n IR and explanatory variables or covariates X 1,...,X n X R p Y i = f (X i ) + ε i, ε i iid No(0,σ 2 ).

Supervised dimension reduction Supervised dimension reduction (SDR) Given response variables Y 1,...,Y n IR and explanatory variables or covariates X 1,...,X n X R p Y i = f (X i ) + ε i, ε i iid No(0,σ 2 ). Is there a submanifold S S Y X such that Y X P S (X)?

Supervised dimension reduction Visualization of SDR 20 (a) Data 1 (b) Diffusion map z 10 0 10 20 100 50 y 0 20 0 20 x 0.5 0 0.5 Dimension 2 0.8 0.6 0.4 0.2 0 0 0.5 1 Dimension 1 0.5 0 0.5 20 (c) GOP 1 (d) GDM Dimension 2 10 0 10 0.5 0 0.5 Dimension 2 0.8 0.6 0.4 0.2 0.5 0 0.5 20 10 0 10 20 Dimension 1 0 0 0.5 1 Dimension 1

Supervised dimension reduction Linear projections capture nonlinear manifolds In this talk P S (X) = B T X where B = (b 1,...,b d ).

Supervised dimension reduction Linear projections capture nonlinear manifolds In this talk P S (X) = B T X where B = (b 1,...,b d ). Semiparametric model Y i = f (X i ) + ε i = g(b1 T X i,...,bd T X i) + ε i, span B is the dimension reduction (d.r.) subspace.

Learning gradients SDR model Semiparametric model Y i = f (X i ) + ε i = g(b T 1 X i,...,b T d X i) + ε i, span B is the dimension reduction (d.r.) subspace.

Learning gradients SDR model Semiparametric model Y i = f (X i ) + ε i = g(b T 1 X i,...,b T d X i) + ε i, span B is the dimension reduction (d.r.) subspace. Assume marginal distribution ρ X is concentrated on a manifold M IR p of dimension d p.

Learning gradients Gradients and outer products Given a smooth function f the gradient is ( ) f (x) = f (x) f (x) T. x 1,..., x p

Learning gradients Gradients and outer products Given a smooth function f the gradient is ( ) f (x) = f (x) f (x) T. x 1,..., x p Define the gradient outer product matrix Γ f Γ ij = (x) f (x)dρ x i x X (x), j X Γ = E[( f ) ( f )].

Learning gradients GOP captures the d.r. space Suppose y = f (X) + ε = g(b1 T X,...,bd T X) + ε.

Learning gradients GOP captures the d.r. space Suppose y = f (X) + ε = g(b1 T X,...,bd T X) + ε. Note that for B = (b 1,...,b d ) λ i b i = Γb i.

Learning gradients GOP captures the d.r. space Suppose y = f (X) + ε = g(b1 T X,...,bd T X) + ε. Note that for B = (b 1,...,b d ) λ i b i = Γb i. For i = 1,..,d f (x) v i = v T i ( f (x)) 0 b T i Γb i 0. If w b i for all i then w T Γw = 0.

Learning gradients Statistical interpretation Linear case y = β T x + ε, ε iid No(0,σ 2 ). Ω = cov (E[X Y ]), Σ X = cov (X), σ 2 = var (Y ). Y

Learning gradients Statistical interpretation Linear case y = β T x + ε, ε iid No(0,σ 2 ). Ω = cov (E[X Y ]), Σ X = cov (X), σ 2 Y Γ = σ 2 Y ( 1 σ2 σ 2 Y ) 2 Σ 1 X ΩΣ 1 X = var (Y ). σ 2 Σ 1ΩΣ 1. Y X X

Learning gradients Statistical interpretation For smooth f (x) y = f (x) + ε, ε iid No(0,σ 2 ). Ω = cov (E[X Y ]) not so clear.

Learning gradients Nonlinear case Partition into sections and compute local quantities X = I i=1 χ i

Learning gradients Nonlinear case Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (E[X χi Y χi ])

Learning gradients Nonlinear case Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (E[X χi Y χi ]) Σ i = cov (X χi )

Learning gradients Nonlinear case Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (E[X χi Y χi ]) Σ i = cov (X χi ) σi 2 = var (Y χi )

Learning gradients Nonlinear case Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (E[X χi Y χi ]) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ).

Learning gradients Nonlinear case Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (E[X χi Y χi ]) Σ i = cov (X χi ) σi 2 = var (Y χi ) m i = ρ X (χ i ). Γ I i=1 m i σi 2 Σ 1 i Ω i Σ 1 i.

Learning gradients Estimating the gradient Taylor expansion y i f (x i ) f (x j ) + f (x j ),x j x i y j + f (x j ),x j x i if x i x j.

Learning gradients Estimating the gradient Taylor expansion y i f (x i ) f (x j ) + f (x j ),x j x i y j + f (x j ),x j x i if x i x j. Let f f the following should be small w ij (y i y j f (x j ),x j x i ) 2, i,j w ij = 1 s p+2 exp( x i x j 2 /2s 2 ) enforces x i x j.

Learning gradients Estimating the gradient The gradient estimate fd = arg min 1 n n 2 f H p i,j=1 w ij (y i y j ( ) 2 f (x j )) T (x j x i ) + λ f 2 K where f K is a smoothness penalty, reproducing kernel Hilbert space norm.

Learning gradients Estimating the gradient The gradient estimate fd = arg min 1 n n 2 f H p i,j=1 w ij (y i y j ( ) 2 f (x j )) T (x j x i ) + λ f 2 K where f K is a smoothness penalty, reproducing kernel Hilbert space norm.

Learning gradients Computational efficiency The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory fd (x) = n c i,d K(x i,x) i=1 c D = (c 1,D,...,c n,d ) T R np.

Learning gradients Computational efficiency The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory fd (x) = n c i,d K(x i,x) i=1 c D = (c 1,D,...,c n,d ) T R np. Define gram matrix K where K ij = K(x i,x j ) ˆΓ = c D Kc T D.

Learning gradients Estimates on manifolds Mrginal distribution ρ X is concentrated on a compact Riemannian manifold M IR d with isometric embedding ϕ : M R p and metric d M and dµ is the uniform measure on M. Assume regular distribution (i) The density ν(x) = dρ X (x) dµ exists and is Hölder continuous (c 1 > 0 and 0 < θ 1) ν(x) ν(u) c 1 dm θ (x, u) x, u M. (ii) The measure along the boundary is small: (c 2 > 0) ρ M ` x M : dm (x, M) t c 2 t t > 0.

Learning gradients Convergence to gradient on manifold Theorem Under above regularity conditions on ρ X and f C 2 (M), with probability 1 δ ( ) (dϕ) 1 ( ) fd M f 2 L C log n 1 2 d. ρ M δ where (dϕ) (projection onto tangent space) is the dual of the map dϕ.

Baysian Mixture of Inverses Principal components analysis (PCA) Algorithmic view of PCA: 1. Given X = (X 1,...,X n ) a p n matrix construct ˆΣ = (X X)(X X) T

Baysian Mixture of Inverses Principal components analysis (PCA) Algorithmic view of PCA: 1. Given X = (X 1,...,X n ) a p n matrix construct ˆΣ = (X X)(X X) T 2. Eigen-decomposition of ˆΣ λ i v i = ˆΣv i.

Baysian Mixture of Inverses Probabilistic PCA X IR p is charterized by a multivariate normal µ IR p A IR p d IR p p ν IR d. X No(µ + Aν, ), ν No(0,I d )

Baysian Mixture of Inverses Probabilistic PCA X IR p is charterized by a multivariate normal µ IR p A IR p d IR p p ν IR d. ν is a latent variable X No(µ + Aν, ), ν No(0,I d )

Baysian Mixture of Inverses SDR model Semiparametric model Y i = f (X i ) + ε i = g(b T 1 X i,...,b T d X i) + ε i, span B is the dimension reduction (d.r.) subspace.

Baysian Mixture of Inverses Principal fitted components (PFC) Define X y (X Y = y) and specify multivariate normal distribution µ IR p A IR p d ν y IR d. X y No(µ y, ), µ y = µ + Aν y

Baysian Mixture of Inverses Principal fitted components (PFC) Define X y (X Y = y) and specify multivariate normal distribution µ IR p A IR p d ν y IR d. B = 1 A. X y No(µ y, ), µ y = µ + Aν y

Baysian Mixture of Inverses Principal fitted components (PFC) Define X y (X Y = y) and specify multivariate normal distribution µ IR p A IR p d ν y IR d. B = 1 A. X y No(µ y, ), µ y = µ + Aν y Captures global linear predictive structure. Does not generalize to manifolds.

Baysian Mixture of Inverses Mixture models and localization A driving idea in manifold learning is that manifolds are locally Euclidean.

Baysian Mixture of Inverses Mixture models and localization A driving idea in manifold learning is that manifolds are locally Euclidean. A driving idea in probabilistic modeling is that mixture models are flexible and can capture nonparametric distributions.

Baysian Mixture of Inverses Mixture models and localization A driving idea in manifold learning is that manifolds are locally Euclidean. A driving idea in probabilistic modeling is that mixture models are flexible and can capture nonparametric distributions. Mixture models can capture local nonlinear predictive manifold structure.

Baysian Mixture of Inverses Model specification X y No(µ yx, ) µ yx = µ + Aν yx ν yx G y G y : density indexed by y having multiple clusters µ IR p ε N(0, ) with IR p p A IR p d ν xy IR d.

Baysian Mixture of Inverses Dimension reduction space Proposition For this model the d.r. space is the span of B = 1 A Y X d = Y ( 1 A) T X.

Baysian Mixture of Inverses Sampling distribution Define ν i ν yi x i. Sampling distribution for data x i (y i,µ,ν i,a, ) N(µ + Aν i, ) ν i G yi.

Baysian Mixture of Inverses Categorical response: modeling G y Y = {1,...,C}, so each category has a distribution ν i (y i = k) G k, c = 1,...,C.

Baysian Mixture of Inverses Categorical response: modeling G y Y = {1,...,C}, so each category has a distribution ν i (y i = k) G k, c = 1,...,C. ν i modeled as a mixture of C distributions G 1,...,G C with a Dirichlet process model for ech distribution G c DP(α 0,G 0 ).

Baysian Mixture of Inverses Categorical response: modeling G y Y = {1,...,C}, so each category has a distribution ν i (y i = k) G k, c = 1,...,C. ν i modeled as a mixture of C distributions G 1,...,G C with a Dirichlet process model for ech distribution G c DP(α 0,G 0 ). Goto board.

Baysian Mixture of Inverses Likelihood Lik(data θ) Lik(data A,,ν 1,...,ν n,µ) Lik(data θ) det( 1 ) n 2 [ ] exp 1 n (x i µ Aν i ) T 1 (x i µ Aν i ). 2 i=1

Baysian Mixture of Inverses Posterior inference Given data P θ Post(θ data) Lik(θ data) π(θ).

Baysian Mixture of Inverses Posterior inference Given data P θ Post(θ data) Lik(θ data) π(θ). 1. P θ provides estimate of (un)certainty on θ

Baysian Mixture of Inverses Posterior inference Given data P θ Post(θ data) Lik(θ data) π(θ). 1. P θ provides estimate of (un)certainty on θ 2. Requires prior on θ

Baysian Mixture of Inverses Posterior inference Given data P θ Post(θ data) Lik(θ data) π(θ). 1. P θ provides estimate of (un)certainty on θ 2. Requires prior on θ 3. Sample from P θ?

Baysian Mixture of Inverses Markov chain Monte Carlo No closed form for P θ.

Baysian Mixture of Inverses Markov chain Monte Carlo No closed form for P θ. 1. Specify Markov transition kernel K(θ t,θ t+1 ) with stationary distribution P θ.

Baysian Mixture of Inverses Markov chain Monte Carlo No closed form for P θ. 1. Specify Markov transition kernel K(θ t,θ t+1 ) with stationary distribution P θ. 2. Run the Markov chain to obtain θ 1,...,θ T.

Baysian Mixture of Inverses Sampling from the posterior Inference consists of drawing samples θ (t) = (µ (t),a (t), 1 (t),ν (t)) from the posterior.

Baysian Mixture of Inverses Sampling from the posterior Inference consists of drawing samples θ (t) = (µ (t),a (t), 1 (t),ν (t)) from the posterior. Define θ /µ (t) (A (t), 1 (t),ν (t)) θ /A (t) (µ (t), 1 (t),ν (t)) θ / 1 (t) (µ (t),a (t),ν (t) ) θ /ν (t) (µ (t),a (t), 1 (t) ).

Baysian Mixture of Inverses Gibbs sampling Conditional probabilities can be used to sample µ, 1,A ( ) ( ) µ (t+1) data,θ /µ (t) No data,θ /µ (t),

Baysian Mixture of Inverses Gibbs sampling Conditional probabilities can be used to sample µ, 1,A ( ) ( ) µ (t+1) data,θ /µ (t) No data,θ /µ (t), ) ) 1 (t+1) (data,θ / 1 (t) InvWishart (data,θ / 1 (t)

Baysian Mixture of Inverses Gibbs sampling Conditional probabilities can be used to sample µ, 1,A ( ) ( ) µ (t+1) data,θ /µ (t) No data,θ /µ (t), ) ) 1 (t+1) (data,θ / 1 (t) InvWishart (data,θ / 1 (t) ( ) ( ) A (t+1) data,θ /A (t) No data.θ /A (t).

Baysian Mixture of Inverses Gibbs sampling Conditional probabilities can be used to sample µ, 1,A ( ) ( ) µ (t+1) data,θ /µ (t) No data,θ /µ (t), ) ) 1 (t+1) (data,θ / 1 (t) InvWishart (data,θ / 1 (t) ( ) ( ) A (t+1) data,θ /A (t) No data.θ /A (t). Sampling ν (t) is more involved.

Baysian Mixture of Inverses Posterior draws from the Grassmann manifold Given samples ( 1 (t),a (t)) m t=1 compute B (t) = 1 (t) A (t).

Baysian Mixture of Inverses Posterior draws from the Grassmann manifold Given samples ( 1 (t),a (t)) m t=1 compute B (t) = 1 (t) A (t). Each B (t) is a subspace which is a point in the Grassmann manifold G (d,p). There is a Riemannian metric on this manifold. This has two implications.

Baysian Mixture of Inverses Posterior mean and variance Given draws (B (t) ) m t=1 the posterior mean and variance should be computed with respect to the Riemannian metric.

Baysian Mixture of Inverses Posterior mean and variance Given draws (B (t) ) m t=1 the posterior mean and variance should be computed with respect to the Riemannian metric. Given two subspaces W and U spanned by orthonormal bases W and V the Karcher mean is (I X(X T X) 1 X T )Y (X T Y ) 1 = UΣV T Θ = atan(σ) dist(w, V) = Tr(Θ 2 ).

Baysian Mixture of Inverses Posterior mean and variance The posterior mean subspace B Bayes = arg min B G (d,p) m dist(b i, B). i=1

Baysian Mixture of Inverses Posterior mean and variance The posterior mean subspace B Bayes = arg min B G (d,p) m dist(b i, B). i=1 Uncertainty var({b 1,, B m }) = 1 m m dist(b i, B Bayes ). i=1

Baysian Mixture of Inverses Distribution theory on Grassmann manifolds If B is a linear space of d central normal vectors in IR p with covariance matrix Σ the density of Grassmannian distribution G Σ w.r.t. reference measure G I is ( dg Σ det(x T ) d/2 X) ( X ) = dg I det(x T Σ 1, X) where X span(x) where X = (x 1,...x d ).

Results on data Swiss roll Swiss roll X 1 = t cos(t), X 2 = h, X 3 = t sin(t), X 4,...,10 iid No(0,1) where t = 3π 2 (1 + 2θ), θ Unif(0,1), h Unif(0,1) and Y = sin(5πθ) + h 2 + ε, ε No(0,0.01).

Results on data Swiss roll Pictures

Results on data Swiss roll Metric Projection of the estimated d.r. space ˆB = (ˆb 1,, ˆb d ) onto B 1 d d P Bˆbi 2 = 1 d i=1 d (BB T )ˆb i 2 i=1

Results on data Swiss roll Comparison of algorithms 1 Accuracy 0.9 0.8 0.7 BMI BAGL SIR LSIR PHD SAVE 0.6 0.5 0.4 200 300 400 500 600 Sample size

Results on data Swiss roll Posterior variance Boxplot for the Distances 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

Results on data Swiss roll Error as a function of d 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 error 0.4 error 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 1 2 3 4 5 6 7 8 9 10 number of e.d.r. directions 0 1 2 3 4 5 6 7 8 9 10 number of e.d.r. directions

Results on data Digits Digits 5 5 10 10 15 15 20 20 25 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25 5 10 15 20 25

Results on data Digits Two classification problems 3 vs. 8 and 5 vs. 8.

Results on data Digits Two classification problems 3 vs. 8 and 5 vs. 8. 100 training samples from each class.

Results on data Digits BMI x 10 4 5 x 10 4 6 5 4 4 3 3 2 2 1 1 0 0 1 1 2 3 2 3 4

Results on data Digits All ten digits digit Nonlinear Linear 0 0.04(± 0.01) 0.05 (± 0.01) 1 0.01(± 0.003) 0.03 (± 0.01) 2 0.14(± 0.02) 0.19 (± 0.02) 3 0.11(± 0.01) 0.17 (± 0.03) 4 0.13(± 0.02) 0.13 (± 0.03) 5 0.12(± 0.02) 0.21 (± 0.03) 6 0.04(± 0.01) 0.0816 (± 0.02) 7 0.11(± 0.01) 0.14 (± 0.02) 8 0.14(± 0.02) 0.20 (± 0.03) 9 0.11(± 0.02) 0.15 (± 0.02) average 0.09 0.14 Table: Average classification error rate and standard deviation on the digits data.

Results on data Cancer Cancer classification n = 38 samples with expression levels for p = 7129 genes or ests 19 samples are Acute Myeloid Leukemia (AML) 19 are Acute Lymphoblastic Leukemia, these fall into two subclusters B-cell and T-cell.

Results on data Cancer Substructure captured 5 x 104 0 5 10 5 0 5 10 x 10 4

The end Funding IGSP Center for Systems Biology at Duke NSF DMS-0732260