Learning gradients: prescriptive models

Similar documents
Bayesian simultaneous regression and dimension reduction

Supervised Dimension Reduction:

Sayan Mukherjee. June 15, 2007

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Graphs, Geometry and Semi-supervised Learning

Data-dependent representations: Laplacian Eigenmaps

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Localized Sliced Inverse Regression

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Introduction to Machine Learning

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Fisher s Linear Discriminant Analysis

March 13, Paper: R.R. Coifman, S. Lafon, Diffusion maps ([Coifman06]) Seminar: Learning with Graphs, Prof. Hein, Saarland University

Unsupervised dimensionality reduction

Learning Gradients on Manifolds

Nonlinear Dimensionality Reduction

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Data dependent operators for the spatial-spectral fusion problem

A Least Squares Formulation for Canonical Correlation Analysis

c 4, < y 2, 1 0, otherwise,

Global vs. Multiscale Approaches

Nonlinear Dimensionality Reduction. Jose A. Costa

Manifold Learning: Theory and Applications to HRI

ECE521 week 3: 23/26 January 2017

10-701/ Recitation : Kernels

MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design

Stochastic optimization in Hilbert spaces

Nonparametric Bayesian Methods

STA 4273H: Statistical Machine Learning

Statistics & Data Sciences: First Year Prelim Exam May 2018

PCA, Kernel PCA, ICA

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Manifold Regularization

Multiscale Wavelets on Trees, Graphs and High Dimensional Data

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

A Selective Review of Sufficient Dimension Reduction

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

The Nyström Extension and Spectral Methods in Learning

Spectral Regularization

Principal Component Analysis

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Sufficient Dimension Reduction using Support Vector Machine and it s variants

Statistical Pattern Recognition

Machine Learning Practice Page 2 of 2 10/28/13

CS281 Section 4: Factor Analysis and PCA

Regularization via Spectral Filtering

Linear Methods for Prediction

Next is material on matrix rank. Please see the handout

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Kernel Methods. Machine Learning A W VO

Machine Learning And Applications: Supervised Learning-SVM

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Statistical Machine Learning

Analysis of Spectral Kernel Design based Semi-supervised Learning

Chemometrics: Classification of spectra

Statistical Convergence of Kernel CCA

Scalable machine learning for massive datasets: Fast summation algorithms

CS534 Machine Learning - Spring Final Exam

Bi-stochastic kernels via asymmetric affinity functions

Approximate Kernel PCA with Random Features

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Multivariate Statistical Analysis

Lecture: Some Practical Considerations (3 of 4)

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Efficient Complex Output Prediction

LECTURE NOTE #11 PROF. ALAN YUILLE

What is semi-supervised learning?

Bayesian Support Vector Machines for Feature Ranking and Selection

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

DIMENSION REDUCTION. min. j=1

Approximation Theoretical Questions for SVMs

Kernel Methods. Barnabás Póczos

Linear Regression and Its Applications

CMU-Q Lecture 24:

Diffusion Wavelets and Applications

Local Learning Projections

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Kernel Logistic Regression and the Import Vector Machine

Advanced Introduction to Machine Learning

Principal Component Analysis

Geometry on Probability Spaces

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

8.1 Concentration inequality for Gaussian random matrix (cont d)

Variable Selection and Dimension Reduction by Learning Gradients

Unsupervised Learning: Dimensionality Reduction

Kernel Method: Data Analysis with Positive Definite Kernels

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Descriptive Statistics

Advanced Machine Learning & Perception

Active and Semi-supervised Kernel Classification

Transcription:

Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007

Relevant papers Learning Coordinate Covariances via Gradients. Sayan Mukherjee, Ding-Xuan Zhou; Journal of Machine Learning Research, 7(Mar):519 549, 2006. Estimation of Gradients and Coordinate Covariation in Classification., Qiang Wu; Journal of Machine Learning Research, 7(Nov):2481 2514, 2006. Learning Gradients and Feature Selection on Manifolds. Sayan Mukherjee, Qiang Wu, Ding-Xuan Zhou; Annals of Statistics, submitted. Learning Gradients: simultaneous regression and inverse regression. Mauro Maggioni,, Qiang Wu; Journal of Machine Learning Research, in preparation.

Table of contents 1 Regression and inverse regression Learning gradients: a justification for inverse regression 2 Nonparametric kernel model Convergence of estimate 3 4 5 6 7 8

Generative vs. predictive modelling Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n.

Generative vs. predictive modelling Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). X X IR p and Y IR and p n. Two options 1 discriminative or regression Y X 2 generative X Y (sometimes called inverse regression)

Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold.

Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps.

Motivation and related work Learning gradients: a justification for inverse regression Data generated by measuring thousands of variables lies on or near a low-dimensional manifold. Manifold learning: LLE, ISOMAP, Laplacian Eigenmaps, Hessian Eigenmaps. Simultaneous dimensionality reduction and regression: SIR, MAVE, SAVE.

Regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want Y X. A natural idea f r (x) = arg min[var (f )] = arg min E Y (Y f (X )) 2, and f r (x) = E Y [Y x] provides a summary of Y X.

Inverse regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y.

Inverse regression Regression and inverse regression Learning gradients: a justification for inverse regression Given X X IR p and Y IR and p n and ρ(x, Y ) we want X Y. Ω = cov (X Y ) provides a summary of X Y. 1 Ω ii relevance of variable with respect to label 2 Ω ij covariation with respect to label

Learning gradients Regression and inverse regression Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z i iid ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r x 1,..., fr x p ) T.

Learning gradients Regression and inverse regression Learning gradients: a justification for inverse regression Given data = {Z i = (x i, y i )} n i=1 with Z iid i ρ(x, Y ). We will simultaneously estimate f r (x) and f r = ( f r ) T,..., fr x 1 x. p 1 regression: f r (x) 2 inverse regression: outer product of gradients Γ = E[ f r f r ] or fr Γ ij = x i, f r x j.

Linear case Regression and inverse regression Learning gradients: a justification for inverse regression We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Ω = cov (X Y ) = 1 σ 2 Y Σ X Γ Σ X.

Linear case Regression and inverse regression Learning gradients: a justification for inverse regression We start with the linear case y = w x + ε, ε iid No(0, σ 2 ). Σ X = cov (X ), σ 2 Y = var (Y ). Ω = cov (X Y ) = 1 σ 2 Y Σ X Γ Σ X. Γ and Ω are equivalent modulo rotation and scale.

Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression For smooth f (x) Ω = cov (X Y ) not so clear. y = f (x) + ε, ε iid No(0, σ 2 ).

Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ).

Nonlinear case Regression and inverse regression Learning gradients: a justification for inverse regression Partition into sections and compute local quantities I X = i=1 χ i Ω i = cov (X χi Y χi ) Σ i = cov (X χi ) σi 2 = var (Y χi ). Γ = I i=1 σi 2 Σ 1 i Ω i Σ 1 i.

Taylor expansion and gradients Nonparametric kernel model Convergence of estimate Given D = (Z 1,..., Z n ) the variance of f may be approximated as n 2, Var n (f ) = w ij [y i f (x j ) f (x j ) (x i x j )] i,j=1 w ij ensures the locality of x i x j.

Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min [error on data + smoothness of function] f bs

Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min [error on data + smoothness of function] f bs error on data = L(f, data) = Var n (f ) 2 smoothness of function = f 2 K = f (x) 2 dx big function space = reproducing kernel Hilbert space = H K

Penalized loss estimator Nonparametric kernel model Convergence of estimate ˆf (x) = arg min f H K [ L(f, data) + λ f 2 K ] The kernel: K : X X IR e.g. K(u, v) = e ( u v 2). The RKHS { H K = f f (x) = } l α i K(x, x i ), x i X, α i R, l N. i=1

Gradient estimate Regression and inverse regression Nonparametric kernel model Convergence of estimate Nonparametric model (f D, [ n f D ) := arg min w ij (y j f (x i ) ) 2 f (x i ) (x j x i ) f, f H p+1 K i,j=1 +λ 1 f 2 K + λ 2 ] f 2 K, H p K is the space of p functions f = (f 1,..., f p ) where f i H K, f 2 K = p i=1 f i 2 K, and λ 1, λ 2 > 0.

Computational efficiency Nonparametric kernel model Convergence of estimate The computation requires fewer than n 2 parameters and is O(n 6 ) time and O(pn) memory f D (x) = n a i,d K(x i, x), fd (x) = i=1 n c i,d K(x i, x) i=1 with a D = (a 1,D,..., a n,d ) R n and c D = (c 1,D,..., c n,d ) T R np.

Consistency Regression and inverse regression Nonparametric kernel model Convergence of estimate Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) 1 f D f ρx C log n 1/p f f ρx C log δ ( 1 δ ) n 1/p.

Linear example Samples from class 1 were drawn from x j No(1.5, 1), for j = 1,..., 10, x j No( 3, 1), for j = 11,..., 20, x j No(0, σ noise ), for j = 21,..., 80, Samples from class +1 were drawn from x j No(1.5, 1), for j = 41,..., 50, x j No( 3, 1), for j = 51,..., 60, x j No(0, σ noise )), for j = 1,..., 40, 61,..., 80.

Linear example 1 10 20 1.5 1 0.5 0.9 0.8 0.7 Dimensions 30 40 50 0 0.5 1 1.5 RKHS norm 0.6 0.5 0.4 0.3 60 2 70 2.5 3 0.2 0.1 80 5 10 15 20 25 30 35 40 Samples 0 0 10 20 30 40 50 60 70 80 Dimensions

Linear example 0.5 1 10 0.4 0.9 Dimensions 20 30 40 50 60 0.3 0.2 0.1 0 0.1 0.2 0.3 Probability y=+1 0.8 0.7 0.6 0.5 0.4 0.3 0.2 70 0.4 0.1 80 10 20 30 40 50 60 70 80 Dimensions 0 0 5 10 15 20 25 30 35 40 Samples

Nonlinear example Samples from class +1 were drawn from (x 1, x 2 ) = (r sin(θ), r cos(θ)), where r U[0, 1] and θ U[0, 2π], x j No(0.0,.2), for j = 3,..., 200, Samples from class 1 were drawn from (x 1, x 2 ) = (r sin(θ), r cos(θ)), where r U[2, 3] and θ U[0, 2π], x j N(0.0,.2), for j = 3,..., 200.

Nonlinear example 3 Class +1 1 2 Class 1 2 0.8 1 0 1 Dimensions 3 4 5 6 7 0.6 0.4 0.2 8 0 2 9 0.2 10 3 3 2 1 0 1 2 3 1 2 3 4 5 6 7 8 9 10 Dimensions

Nonlinear example 3.5 3.5 3 3 2.5 2.5 RKHS norm 2 1.5 RKHS norm 2 1.5 1 1 0.5 0.5 0 0 10 20 30 40 50 60 70 80 90 100 Dimensions 0 1 2 3 4 5 6 7 8 9 10 Dimensions

Nonlinear example 1 1 0.9 0.9 0.8 0.8 Probability y=+1 0.7 0.6 0.5 0.4 0.3 Probability y=+1 0.7 0.6 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 10 20 30 40 50 60 Samples 0 0 10 20 30 40 50 60 Samples

Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p.

Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p. Given a smooth orthonormal vector field {e 1,..., e d } we can define the gradient on the manifold M f = (e 1 f,..., e d f ).

Restriction to a manifold Assume the data is concentrated on a manifold M IR p with M IR d and there exists an isometric embedding ϕ : M R p. Given a smooth orthonormal vector field {e 1,..., e d } we can define the gradient on the manifold M f = (e 1 f,..., e d f ). For q U M a chart u : U R d satisfying exists. The Taylor expansion on the manifold around q u i (q) = e i (q) f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

A problem with all manifold methods The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

A problem with all manifold methods The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. {(q i, y i )} n i=1 M Y are drawn from the manifold but neither M nor a local expression of M are given. Also we have only the image x i = ϕ(q i ) R p.

The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q.

The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. The Taylor expansion on the manifold around x in terms of f ϕ 1 R p (f ϕ 1 )(u) (f ϕ 1 )(x) (f ϕ 1 )(x) (u x) for u x.

The standard solution The Taylor expansion on the manifold around q f (q ) f (q) + M f (q) (u(q ) u(q)) for q q. The Taylor expansion on the manifold around x in terms of f ϕ 1 R p (f ϕ 1 )(u) (f ϕ 1 )(x) (f ϕ 1 )(x) (u x) for u x. Due to this equivalence f D dϕ M f

Improved rate of convergence Theorem Under mild regularity conditions on the distribution and corresponding density, with probability 1 δ ( ) (dϕ) 1 fd M f ρx C log n 1/d f f ρx C log where (dϕ) is the dual of the map dϕ. δ ( 1 δ ) n 1/d,

Sensitive features Proposition Let f be a smooth function on R p with gradient f. The sensitivity of the function f along a (unit normalized) direction u is u K. The d most sensitive features are those {u 1,..., u d } that are orthogonal and maximize u i f K. A spectral decomposition of Γ is used to compute these features, the eigenvectors corresponding to the d top eigenvalues.

Projection of data The matrix is an empirical estimate of Γ. ˆΓ = f D f D = c T D Kc D,

Projection of data The matrix is an empirical estimate of Γ. ˆΓ = f D f D = c T D Kc D, Geometry is preserved by projection onto top k-eigenvectors. No need to compute the p p matrix, method is O(n 2 p + n 3 ) time and O(p n) memory.

Linear example 8 x 10 4 Dimensions 10 20 30 40 50 60 70 80 90 4 3 2 1 0 1 2 3 4 5 7 6 5 4 3 2 1 100 5 10 15 20 25 30 35 40 samples 0 0 5 10 15 20 25 30 35 40 index of eigenvalues

Linear example 0.3 8 0.2 6 4 0.1 2 0 0 0.1 2 0.2 4 6 0.3 8 0.4 0 10 20 30 40 50 60 70 80 90 100 Dimensions 10 0 5 10 15 20 25 30 35 40 Feature 1

Linear example 3.5 x 10 4 10 6 3 20 4 30 2.5 Dimensions 40 50 60 70 2 0 2 2 1.5 1 80 4 90 6 0.5 100 5 10 15 20 25 30 35 40 samples 0 0 5 10 15 20 25 30 35 40 index of eigenvalues

Nonlinear example 8 8 6 6 4 4 Dimension 2 2 0 2 Feature 2 2 0 2 4 4 6 6 8 8 6 4 2 0 2 4 6 8 Dimension 1 8 8 6 4 2 0 2 4 6 8 Feature 1

Nonlinear example 3 0.6 2.5 0.4 2 0.2 0 1.5 0.2 1 0.4 0.5 0.6 0 0 5 10 15 20 25 30 index of eigenvalues 0.8 0 20 40 60 80 100 120 140 160 180 200 Dimensions

Nonlinear example 0.3 20 0.25 0.2 15 0.15 10 0.1 0.05 5 0 0.05 0 0.1 5 0.15 0.2 0 10 20 30 40 50 60 70 80 90 100 Dimensions 10 0 5 10 15 20 25 30 35 40 Feature 1

Nonlinear example 0.1 0.5 0 0.1 0.2 0.4 0.3 0.3 0.2 0.4 0.5 0.1 0.6 0 0.7 0.8 0.1 0.9 0 20 40 60 80 100 120 140 160 180 200 Dimensions 0.2 0 20 40 60 80 100 120 140 160 180 200 Dimensions

Digits: 6 vs. 9 5 5 10 10 15 15 20 20 25 25 5 10 15 20 25 5 10 15 20 25

Digits: 6 vs. 9 8 x 10 5 0.2 7 6 5 4 3 2 1 Norm 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 10 20 30 40 50 60 index of eigenvalues 0 0 100 200 300 400 500 600 700 800 DImensions

Digits: 6 vs. 9 1500 1500 1000 1000 500 500 feature 2 0 feature 2 0 500 500 1000 1000 1500 1500 1000 500 0 500 1000 feature 1 1500 1500 1000 500 0 500 1000 1500 feature 1

Leukemia 48 samples of AML, 25 samples of ALL, p = 7, 129. Dataset split into a training set of 38 samples and a test set of 35 samples.

Leukemia x 10 4 x 10 4 6 4 2 0 2 4 4 2 0 3 2 1 0 1 2 3 x 10 4 2 4 6 4 2 0 2 x 10 4 x 10 4 2 0 2 2 0 2 x 10 4

Leukemia 5 x 10 7 0.6 4.5 4 3.5 3 2.5 2 1.5 1 0.5 distance to hyperplane 0.4 0.2 0 0.2 0.4 0.6 0 0 5 10 15 20 25 30 35 40 index of eigenvalues 0.8 0 5 10 15 20 25 30 35 samples

Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables.

Gauss-Markov graphical models Give a multivariate normal distribution with covariance matrix C the matrix P = C 1 is the conditional independence matrix P ij = dependence of i j all other variables. Set C = ˆΓ. However, many values near zero and how to truncate.

Multiscale graphical models Assume for now ˆΓ has all positive entries and is normalized to be a Markov matrix. T = Λ 1/2 ˆΓ Λ 1/2, where Λ is the eigenvalue matrix for ˆΓ.

Multiscale graphical models Note the harmonic expansion (I T ) 1 = T k = k=1 (I + T 2k ), k=1 where k is path-length in the first inequality T k ij = probability i j in path-length k. (I + T 2k ) factorizes (I T ) 1 into low rank matrices with fewer entries.

Multiscale graphical models Note the harmonic expansion (I T ) 1 = T k = k=1 (I + T 2k ), k=1 where k is path-length in the first inequality T k ij = probability i j in path-length k. (I + T 2k ) factorizes (I T ) 1 into low rank matrices with fewer entries. This is the idea of diffusion wavelets.

Summary and toy example Key quantities C T C 1 (I T ) 1. so (I + T 2k ) factorizes C 1.

Summary and toy example X abcd ˆΓ = 5 3 1.5 0 3 5.5 2 1.5.5 5 2.5 0 2 2.5 5.

Summary and toy example X abcd ˆΓ = 5 3 1.5 0 3 5.5 2 1.5.5 5 2.5 0 2 2.5 5. d c b ab a c d

General covariance matrices Math goes through as above. Interpretation is slightly different. Path coefficients (Wright 1921) and path regression (Tukey 1954) X 1 = c 12 X 2 + c 13 X 3 +... + c 1m X m where c 12 is the contribution of X 2 to X 1.

General covariance matrices Math goes through as above. Interpretation is slightly different. Path coefficients (Wright 1921) and path regression (Tukey 1954) X 1 = c 12 X 2 + c 13 X 3 +... + c 1m X m where c 12 is the contribution of X 2 to X 1. Matrix Γ can be though of as matrix of path coefficients. Powers of k compute path coefficients of length k as in Markov case.

An early graphical model weight at 33 days Gain 0-33 days External conditions weight at birth Heredity Rate of growth condition of dam Gestation period Size of litter Heredity of dam

Discussion Lots of work left: Semi-supervised setting.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps.

Discussion Lots of work left: Semi-supervised setting. Multi-task setting. Bayesian formulation. Nonlinear projections diffusion maps. Noise on-off manifolds.