MANIFOLD LEARNING: A MACHINE LEARNING PERSPECTIVE. Sam Roweis. University of Toronto Department of Computer Science. [Google: Sam Toronto ]

Size: px

Start display at page:

Download "MANIFOLD LEARNING: A MACHINE LEARNING PERSPECTIVE. Sam Roweis. University of Toronto Department of Computer Science. [Google: Sam Toronto ]"

Jocelyn Shaw
5 years ago
Views:

1 MANIFOLD LEARNING: A MACHINE LEARNING PERSPECTIVE Sam Roweis University of Toronto Department of Computer Science [Google: Sam Toronto ] MSRI High-Dimensional Data Workshop December 10, 2004

2 Manifold Learning Means many things to many people. In machine learning, generally refers to a class of unsupervised statistical problems: Dimensionality reduction of a finite data set to preserve or highlight certain features of the original measurements. Latent factor modeling of high-dimensional observations using only a small number of underlying causes. Density estimation, based on a finite sample of a points from a distribution over a high-dimensional space. Mathematically, we assume x = f(y)+noise We see samples of x based on some unknown function f( ), underlying distribution p(y), and some uncharacterized noise process; and we want to learn f( ) (or its inverse). Ill-posed, so we typically make several strong assumptions, e.g. f is smooth, p(y) is uniform and the noise is small.

3 Motivations for Manifold Learning Most Inputs are Redundant Data are points in a high dimensional space. Coherent structure in the world generates strong correlations between components. Geometrically, observations lie on or near thin, connected low dimensional manifolds. Many Processes are Nonlinear We want to model the curved geometry of high-dimensional manifolds. Linearity can be a useful approximation in local domains, but globally too strong. Most interesting data has nonlinear structure. Computational Savings Need to vastly decrease size of inputs while preserving important similarities and differences. Improve efficiency of statistical algorithms, avoid the curse.

4 Dimensionality Reduction Goal: find a set of low-dimensional coordinates y n for each high-dimensional observation x n in order to preserve some measure of the original structure. Appeal: no assumptions about distributions, the data is just what we have in front of us. Disadvantages: does not generalize to new data, does not explicitly reveal anything about the nature of the underlying process, its latent causes or the structure of the manifold it induces in the observation space.

5 Dimensionality Reduction Approach: optimization of low-dimensional coordinates directly, given some carefully designed objective function. Common theme: how to convert local info into global info (e.g. overlapping local geometric constraints, geodesic distances on local graphs, preserving neighbour identities) Typical setup: build a locally connected graph on the data sample; use local measurements to induce a global objective function; optimize this objective using an eigenvector method Examples of Linear methods: SVD, PCA, Classical MDS Examples of Nonlinear methods: Kruskal MDS, Isomap, LLE, Laplacian Eigenmaps, and variants (Conformal Isomap, Hessian LLE, Semidefinite Embedding), Local MDS, Projection Pursuit, self-organizing maps, Stochastic Neighbour Embedding (SNE)

6 Latent factor models Goal: build an explicit model (often probabilistic) of the embedding function f( ) that explains the data we saw. Appeal: explicitly represents underlying causes, allows us to generalize off the data, handles uncertainty and noise naturally. Disadvantages: too many unknowns to build a full probabilistic model, in particular there is a fundamental degeneracy between sampling in latent space and curvature of manifold.

) Examples of Linear methods: probabilistic PCA, factor analysis, etc.

7 Latent factor models Approach: make very strong assumptions and proceed from there using maximum likelihood learning (or approximations). (Typical assumptions: uniform density in latent space, isometric embedding, bounded curvature of manifold.) Examples of Linear methods: probabilistic PCA, factor analysis, etc. Examples of Nonlinear methods: autoencoder neural networks, principal curves/surfaces, generative topographic mapping (GTM), independent components analysis (ICA), Kernel PCA

8 Global Coordination of Local Models Locally simple (e.g. linear) models can be stitched together or aligned to form a global factor model of the entire data space. Appeal: manifolds often look simple locally (e.g. almost linear, almost uniform data sampling). We can often train a simple model well if it is restricted to a small part of space. Combining models has a long history in statistics as mixture modeling. Disadvantages: we need to specify what our goal is in coordination and then design new algorithms to achieve this Approaches: Decoupled train local models and align their internal coordinates later (Teh/Roweis, Brand, Verbeek). Simultaneously fit local models in a way that encourages their agreement (Roweis,Saul,Hinton).

9 Other Issues in Manifold Learning Out of sample extensions for many dimensionality reduction methods can be achieved with interpolation techniques such as the Nystrom approximation. Semi-supervised versions of many of these problems arise naturally if we are given class labels, partial observations of hidden causes, correspondence information, etc. Recent clustering algorithms have used similar techniques and addressed related problems (e.g. spectral clustering, min cut) Exploration of fundamental link between spectral nonlinear dimensionality reduction algorithms and kernel methods. Isolated sub-problem of estimating the underlying (co-)dimensionality of a manifold has received lots of attention. Increased focus on computational speedups, e.g. landmark methods, efficient iterated eigensolvers, convex programming.

10 Linear Projection Methods References Zoubin Ghahramani & Geoff Hinton, The EM algorithm for Mixtures of Factor Analyzers, U.Toronto Tech Report CRG-TR-96-1, A.J. Bell & T.J. Sejnowski, An information maximisation approach to blind separation and blind deconvolution, Neural Computation 7(6), David Mackay, Maximum Likelihood and Covariant Algorithms for ICA, unpublished, A. Hyvarinen, J. Karhunen, & E. Oja. Independent Component Analysis. Wiley, Sam Roweis, EM Algorithms for PCA and SPCA, NIPS 10, M.E. Tipping & C.M. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society,61(3), pp. 611, A. Basilevsky. Statistical Factor Analysis and Related Methods. Wiley, NewYork, I. Borg & P. Groenen Modern Multidimensional Scaling: Theory and Applications. Springer, T.F. Cox & M.A.A. Cox. Multidimensional Scaling., Chapman and Hall,2001.

11 Alignment of Local Models References Michael E. Tipping & Christopher M. Bishop, Mixtures of Probabilistic Principal Component Analysers., Neural Computation 11(2), pp , Sam Roweis, Lawrence Saul & Geoff Hinton. Global Coordination of Local Linear Models. NIPS 14, pp , Yee Whye Teh & Sam T. Roweis, Automatic Alignment of Hidden Representations. NIPS 15, pp , M. Brand, Charting a manifold, NIPS 15, J. H. Ham, D. D. Lee & L. K. Saul Learning high dimensional correspondences from low dimensional manifolds., ICML Workshop on The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining, 2003 J.J. Verbeek, S.T. Roweis & N. Vlassis, Non-linear CCA and PCA by Alignment of Local Models. NIPS 16, 2004.

12 References Neural networks, and other nonparametric mappings Geoffrey Hinton & Sam T. Roweis, Stochastic Neighbor Embedding. NIPS 15, pp , G. E. Hinton, P. Dayan & M. Revow, Modeling the manifolds of handwritten digits. IEEE Transactions on Neural Networks, N. Kambhatla & T. Leen, Dimension reduction by local principal component analysis. Neural Computation, v.9, pp , C.M. Bishop, M. Svenson & C.K.I. Williams, GTM: The Generative Topographic Mapping, Neural Computation, 10(1), pp , H. Bourlard & Y. Kamp, Auto-association by multilayer perceptrons and singular value decomposition, Biological Cybernetics, Vol. 59, pp , 1988 K.I.Diamantaras & S.Y. Kung, Principal Component Neural Networks. John Wiley, R. Durbin & D. Willshaw, An Analogue Approach to the Travelling Salesman Problem Using an Elastic Net Method Nature, Vol. 326, pp , 1987 E. Erwin, K. Obermayer & K. Schulten, Self-organizing maps: ordering, convergence properties and energy functions Biological Cybernetics, 67(1), pp , 1992.

13 References Principal Curves and Projection Pursuit T.J. Hastie & W. Stuetzle. Principal curves. Journal of the American Statistical Association v.84, pp , P. Diaconis & D. Freedman, Asymptotics of graphical projection pursuit. Annals of Statistics v. 12, pp , J.H. Friedman, W. Stuetzle & A. Schroeder. Projection pursuit density estimation. Journal of the American Statistical Association v.79, pp , J.H. Friedman & J.W. Tukey. A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers,c-23(9), pp , P.J. Huber. Projection pursuit. Annals of Statistics, 13(2), pp , 1985.

14 References Eigenvector Manifold Learning Algorithms S.T. Roweis & L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding. Science, 290(22), pp , L. K. Saul & S. T. Roweis, Think globally, fit locally: unsupervised learning of low dimensional manifolds, Journal of Machine Learning Research, v. 4, pp , J. B. Tenenbaum, V. de Silva & J. C. Langford, A Global Geometric Framework for Nonlinear Dimensionality reduction, Science 290(22), pp , J.B. Tenenbaum, Mapping a Manifold of Perceptual Observations, NIPS 10, M. Belkin & P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6), pp , M. Belkin & P. Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering NIPS 14, pp , Y. Bengio, J. Paiement & P. Vincent. Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps and spectral clustering., NIPS 16, V. desilva & J.B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. NIPS 15, pp , D. L. Donoho & C. E. Grimes, Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Arts and Sciences, v. 100 pp , H. Zha and Z. Zhang, Isometric embedding and continuum Isomap, ICML pp , K. Q. Weinberger, F. Sha & L. K. Saul, Learning a kernel matrix for nonlinear dimensionality reduction, ICML, K. Q. Weinberger & L. K. Saul, Unsupervised learning of image manifolds by semidefinite programming, CVPR, 2004.

15 Spectral Clustering References Andrew Ng, Michael Jordan & Yair Weiss, On spectral clustering: analysis and an algorithm. NIPS 14, Marina Meila & Jianbo Shi. Learning segmentation by random walks., NIPS 12, pp , C. Fowlkes, S. Belongie, F. Chung & J. Malik. Spectral grouping using the Nystrom method. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(2), Jianbo Shi & Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), pp , R. Pless & I. Simon, Embedding images in non-flat spaces, Washington U., Tech. Rep. WU-CS-01-43, F.R.K. Chung. Spectral Graph Theory., American Mathematical Society, 1997.

16 Kernel Methods References M.A. Aizerman, E.M. Braverman & L.I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control v.25, pp ,1964. J. Ham, D.D. Lee, S. Mika & B. Scholkopf. A kernel view of dimensionality reduction of manifolds. ICML, C.K.I. Williams. On a Connection between Kernel PCA and Metric Multidimensional Scaling. NIPS 13, pp , C.K.I. Williams & M. Seeger. Using the Nystrom method to speed up kernel machines. NIPS 13, pp , B. Schoelkopf, A. Smola & K.-R. Mueller, Nonlinear Component Analysis as a Kernel Eigenvalue Problem, Neural Computation, 10(5), pp , B. Scholkopf. The kernel trick for distances. NIPS 13, pp , B. Scholkopf & A.Smola. Learning with Kernels, MIT Press, S. Mika, B. Scholkopf, A. Smola, K. Muller, M. Scholz & G. Ratsch. Kernel PCA and de-noising in feature spaces. NIPS 11, 1999.

17 Useful Overviews, etc. References Martin Law s Manifold Learning Resource Page lawhiu/manifold/ Chris Burges review of dimensionality reduction cburges/tech reports/tr dimred.pdf General Mathematical/Statistical Background B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, 1996 R.O. Duda & P.E. Hart. Pattern Classification and Scene Analysis., John Wiley, G.H. Golub & C.F. Van Loan. Matrix Computations. (3rd ed.) JohnsHopkins,1996. R.A. Horn & C.R. Johnson. Matrix Analysis. Cambridge University Press,1985. J.R. Magnus & H. Neudecker, Matrix Differential Calculus with Applications, Wiley, T. Hastie, R. Tibshirani & J. Friedman, The Elements of Statistical Learning, Springer-Verlag, 2001.

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction Jose A. Costa Mathematics of Information Seminar, Dec. Motivation Many useful of signals such as: Image databases; Gene expression microarrays; Internet traffic time