Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Similar documents
Unsupervised dimensionality reduction

Nonlinear Dimensionality Reduction. Jose A. Costa

Nonlinear Dimensionality Reduction

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Nonlinear Dimensionality Reduction

Manifold Regularization

Non-linear Dimensionality Reduction

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Data-dependent representations: Laplacian Eigenmaps

Statistical Pattern Recognition

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Learning gradients: prescriptive models

Statistical Machine Learning

Apprentissage non supervisée

A graph based approach to semi-supervised learning

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Statistical Learning. Dong Liu. Dept. EEIS, USTC

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Manifold Learning: Theory and Applications to HRI

Graphs, Geometry and Semi-supervised Learning

Intrinsic Structure Study on Whale Vocalizations

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Advances in Manifold Learning Presented by: Naku Nak l Verm r a June 10, 2008

Robust Laplacian Eigenmaps Using Global Information

Lecture: Some Practical Considerations (3 of 4)

Lecture 10: Dimension Reduction Techniques

LECTURE NOTE #11 PROF. ALAN YUILLE

DIMENSION REDUCTION. min. j=1

Dimensionality Reduc1on

L26: Advanced dimensionality reduction

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

EECS 275 Matrix Computation

Distance Metric Learning in Data Mining (Part II) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Learning a kernel matrix for nonlinear dimensionality reduction

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Dimension Reduction and Low-dimensional Embedding

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Data dependent operators for the spatial-spectral fusion problem

Advanced Machine Learning & Perception

Dimensionality Reduction AShortTutorial

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

MATHEMATICAL entities that do not form Euclidean

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Linear Regression and Its Applications

Semi-Supervised Learning by Multi-Manifold Separation

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Graph Metrics and Dimension Reduction

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

Is Manifold Learning for Toy Data only?

Statistical Machine Learning from Data

Kernel methods, kernel SVM and ridge regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Machine Learning 2017

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Latent Variable Models and EM algorithm

Pattern Recognition and Machine Learning

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Graph-Laplacian PCA: Closed-form Solution and Robustness

CS534 Machine Learning - Spring Final Exam

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

Dimensionality Reduction: A Comparative Review

Spectral Dimensionality Reduction

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

L11: Pattern recognition principles

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Probabilistic & Unsupervised Learning

Lecture 2: Linear Algebra Review

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Nonlinear Manifold Learning Summary

Support Vector Machines for Classification: A Statistical Portrait

Module Master Recherche Apprentissage et Fouille

Reproducing Kernel Hilbert Spaces

Unsupervised Learning

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Computational functional genomics

EECS 275 Matrix Computation

A Duality View of Spectral Methods for Dimensionality Reduction

Dimensionality Reduction: A Comparative Review

Transcription:

Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto

About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian Eigenmaps.

Unsupervised learning Only u i.i.d. samples drawn on X from the unknown marginal distribution p(x) {x 1, x 2,..., x u }. The goal is to infer properties of this probability density. In low-dimension many nonparametric methods allow direct estimation of p(x) itself. Owing to the curse of dimensionality, this methods fail in high dimension. One must settle for estimation of crude global models.

Unsupervised learning (cont.) Different types of simple descriptive statistics that characterize aspects of p(x) mixture modelling representation of p(x) by a mixture of simple densities representing different types or classes of observations [eg. Gaussian mixtures] combinatorial clustering attempt to find multiple regions of X that contain modes of X [eg. K-Means] dimensionality reduction attempt to identify low-dimensional manifolds in X that represent high data density [eg. ISOMAP,HLLE, Laplacian Eigenmaps] manifold learning attempt to determine very specific geometrical or topological invariants of p(x) [eg. Homology learning]

Limited formalization With supervised and semi-supervised learning there is a clear measure of effectiveness of different methods. The expected loss of various estimators I[f S ] can be estimated on validation set. In the context of unsupervised learning, it is difficult to find such a direct measure of success. This situation has led to proliferation of proposed methods.

Mixture Modelling Assumption that data is i.i.d. sampled from some probability distribution p(x). p(x) is modelled as a mixture of component density functions, each component corresponding to a cluster or mode. The free parameters of the model are fit to the data by maximum likelihood.

Gaussian Mixtures We first choose a parametric model P θ for the unknown density p(x), hence maximize the likelihood of our data relative to the parameters θ. Example: two-component gaussian mixture model with parameters The model: θ = (π, µ 1,Σ 1, µ 2,Σ 2 ). P θ (x) = (1 π)g Σ1 (x µ 1 ) + πg Σ2 (x µ 2 ) Maximize the log-likelihood l(θ {x 1,..., x u }) = u i=1 log P θ (x i )

The EM algorithm Maximization of l(θ {x 1,..., x u }) is a difficult problem. Iterative maximization strategies, as the EM algorithm, can be used in practice to get local maxima. 1. Expectation: compute the responsibilities γ i = πg Σ2 (x i µ 2 ) (1 π)g Σ1 (x i µ 1 ) + πg Σ2 (x i µ 2 ) 2. Maximization: compute means and variances µ 2 = i γ ix i i, Σ 2 = γ i(x i µ 2 )(x i µ 2 ) T, etc i γ i and the mixing probability π = 1 u 3. Iterate until convergence i γ i. i γ i

Combinatorial Clustering Algorithms in this class work on the data without any reference to an underlying probability model. The goal is assigning each data point x i to a cluster k belonging a predefined set {1,2,..., K} C(i) = k, i = 1,2,..., u The optimal encoder C (i) minimizes the overall dissimilarities d(x i, x j ) between points x i, x j assigned to the same cluster W(C) = 1 2 K k=1 C(i)=k C(j)=k d(x i, x j ) The simplest choice for the dissimilarity d(, ) is the squared Euclidean distance in X

Combinatorial Clustering (cont.) The minimization of the within-cluster point scatter W(C) is straightforward in principle, but... the number of distinct assignments grows exponentially with the number of data points u S(u, K) = 1 K! already S(19,4) 10 10! K k=1 ( 1) K k( K k ) k u In practice, clustering algorithms look for good suboptimal solutions. Most popular algorithms are based on iterative descent strategies. Convergence to local optima.

K-Means If d(x i, x j ) = x i x j 2, introducing the mean vectors x k associated to the k-th cluster, the within-cluster point scatter W(C) can be rewritten as W(C) = 1 2 K k=1 C(i)=k C(j)=k x i x j 2 = K k=1 C(i)=k x i x k 2. Exploiting this representation one can easily verify that the optimal encoder C is the solution of the enlarged minimization problem min C,(m 1,...,m K ) K k=1 C(i)=k x i m k 2.

K-Means (cont.) K-Means attempts the minimization of the enlarged problem by an iterative alternating procedure. Each step 1 and 2 reduces the objective function, so convergence is assured. 1. minimization with respect to (m 1,..., m K ), getting m k = x k 2. minimization with respect to C, getting 3. do until C does not change C(i) = arg min 1 k K x i m k One should compare solutions derived from different initial random means, and choose best local minimum.

Voronoi tessellation

Dimensionality reduction Often reducing the dimensionality of a problem is an effective preliminary step toward the actual solution of a regression or classification problem. We look for a mapping Φ from the high dimensional space IR D to the low dimensional space IR d which preserves some relevant geometrical structure of our problem.

Dimensionality reduction

Principal Component Analysis (PCA) Trying to approximate data {x 1,..., x u } in IR D by a d- dimensional hyperplane H = {c + Vθ θ IR d } c vector in IR D, θ coordinates vector in IR d and V = (v 1,...,v d ), D d matrix with {v i } orthonormal system of vectors in IR D. Problem: find H which minimizes sum of squared distances of data points x i from H H = argmin H u i=1 x i P H (x i ) 2

Linear approximation

PCA: the algorithm 1. center data points: u i=1 x i = 0 2. define u D matrix X = (x 1,..., x u ) T 3. construct singular value decomposition X = UΣW T D D matrix W = (w 1,...,w D ), with {w i } right eigenvectors of X u D matrix U = (u 1,...,u D ), with {u i } left eigenvectors of X D D matrix Σ = diag(σ 1,..., σ D ), with σ 1 σ 2 σ D 0 singular eigenvectors of X 4. answer: V = (w 1,...,w d )

Sketch of proof Rewrite the minimization problem u min x i c Vθ i 2 c,v,{θ i } i=1 Centering and minimizing with respect to c and θ i gives c = 0, θ i = V T x i Plugging into the minimization problem u arg min x i VV T x i 2 = argmax V V i=1 = argmax V u x T i VV T x i i=1 d vj T X T Xv j hence (v 1,...,v d ) are the first d eigenvectors of X T X: (w 1,...,w d ) j=1

Mercer s Theorem Consider the pd kernel K(x, x ) on X X, and the probability distribution p(x) on X. Define the integral operator L K (L K f)(x) = X K(x, x )f(x )dp(x ). Mercer s Theorem states that K(x, x ) = i λ i φ i (x)φ i (x ) where (λ i, φ i ) i is the eigensystem of L K.

Feature Map From Mercer s Theorem, the mapping Φ defined over X Φ(x) = ( λ 1 φ 1 (x), λ 2 φ 2 (x),...) is such that K(x, x ) = Φ(x) T Φ(x). K(x, x ) can be interpreted as the dot product in the feature space. given a mapping of X into an Euclidean space, we can construct a pd kernel X X.

Kernelization Algorithms that depend on the data, only through the dot products x T i x j, can be easily kernelized: 1. Choose pd kernel K(, ) 2. Replace x T i x j with K(x i, x j ) Example: PCA can be kernelized computing the eigenvectors of the matrix M ij = K(x i, x j ) instead of those of the matrix X T X.

ISOMAP Assumption: the support of the marginal distribution p(x) is a convex region of IR d (our manifold M) isometrically embedded in IR D. Goal: constructing a map Φ : IR D IR d which transforms geodesic distances in M into Euclidean distances in IR d Construction 1: approximate the matrix d M of pairwise geodesic distances between data points, estimating the shortest distances d ij over the neighborhood graph. Tenenbaum, et al, 00

Construction 2: compute the u u kernel matrix K = 1 2 HDH, H = I 1 u 11T, with 1 the u-dimensional column vector (1,1,...,1), and D the matrix of squared distances, that is: D ij = d 2 ij. Result: let (λ a,u a ) u a=1 be the eigensystem of K. The embedding Φ, of {x i } u i=1 Φ(x i ) = ( λ 1 (u 1 ) i, λ 2 (u 2 ) i,..., λ d (u d ) i ), is the isometry we were looking for.

ISOMAP global isometry

Explaining ISOMAP Firstly, we have to verify that the matrix K is a genuine pd kernel on the data points. 1. Symmetry: since both H and D are symmetric, K = 1 2 HT DH, hence K T = K. 2. Positivity: Note that, by assumption, there exist vectors {φ i } u i=1, such that d ij = φ i φ j. For all c = (c 1,..., c u ), defining ing c = c 1 u ui=1 c i 1, we get [D ij = φ i φ j 2 ] = 1 2 c T Kc = 1 2 (Hc)T D(Hc) = 1 2 c T Dc [ i c i = 0] = ( i ij c i (φt i φ i + φ T j φ j 2φ T i φ j)c j c i φ i) T ( i c i φ i) 0.

Explaining ISOMAP (cont.) We must prove that the pd kernel K ij induces the correct pairwise distances d ij between data points d 2 ij = K ii + K jj 2K ij. This can be verified by direct computation. By Mercer s Theorem, the feature map Φ 0 (x i ) = ( λ 1 (u 1 ) i, λ 2 (u 2 ) i,..., λ u (u u ) i ), is an isometry. If the manifold M is d-dimensional, λ a = 0 for a > d, and we can use the truncated mapping Φ.

Hessian Locally Linear Embedding (HLLE) ISOMAP outputs an embedding of the data points {x i } u i=1 into IR d, attempting to preserve pairwise distances on the underlying manifold M. The method gives guarantees of convergence if M is isometric to a convex region in IR d. Convexity is a very strong hypothesis. Typically, linear combinations of images are not reasonable images! HLLE gives guarantees of convergence while relaxing the convexity hypothesis. Hessian Eigenmaps; Donoho, Grimes 03

HLLE local isometry

Core idea of HLLE For every point x M and system of coordinates (ξ 1,..., ξ d ) on its tangent space, the Hessian at x of a function f : M IR, is the matrix of second derivatives (H f (x)) ij = ξ i ξ j f(x), i, j = 1,..., d The core idea of HLLE is that the null space of the quadratic form H(f) = M ij (H f (x)) 2 ij is independent of the choice of local coordinates ξ i. The null space of H is the d-dimensional linear space spanned by the global cartesian coordinates

Computing the Hessian In order to implement this idea, HLLE has to evaluate the quadratic form H using the data points x i. 1. construct proxies for the tangent spaces using the k- nearest neighborhood graph 2. implement a finite differences scheme to evaluate second derivatives 3. compute eigensystem of approximation of H. Use d eigenvectors with smallest eigenvalues as embedding coordinates.

Local Linear Neighborhood

Laplacian based methods Unsupervised methods based on the eigensystem of the Laplacian on the neighborhood graph with weights W ij. Dimensionality Reduction: consider the solutions of the eigenvector problem (0 = λ 0 λ 1 λ u 1 ) Lf a = λ a Df a where D = diag(d 11,..., D uu ). The considered embedding into the d-dimensional Euclidean space is Φ(x i ) = ((f 1 ) i,...,(f d ) i ). Spectral Clustering: use sign of components (f 1 ) j to define two clusters: connection to min cut problem. Belkin, Niyogi, 02