Advanced Machine Learning & Perception

Similar documents
Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Manifold Learning: Theory and Applications to HRI

Nonlinear Dimensionality Reduction

LECTURE NOTE #11 PROF. ALAN YUILLE

Advanced Machine Learning & Perception

L26: Advanced dimensionality reduction

Statistical Pattern Recognition

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Nonlinear Dimensionality Reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Statistical Machine Learning

Machine Learning 4771

Advanced Machine Learning & Perception

Dimension Reduction and Low-dimensional Embedding

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

CS798: Selected topics in Machine Learning

Distance Preservation - Part 2

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Nonlinear Methods. Data often lies on or near a nonlinear low-dimensional curve aka manifold.

Learning a Kernel Matrix for Nonlinear Dimensionality Reduction

Dimensionality Reduc1on

Kernel Methods. Machine Learning A W VO

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Dimensionality reduction. PCA. Kernel PCA.

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Nonlinear Dimensionality Reduction. Jose A. Costa

Lecture 10: Dimension Reduction Techniques

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Discriminative Direction for Kernel Classifiers

Manifold Learning and it s application

Learning a kernel matrix for nonlinear dimensionality reduction

Preprocessing & dimensionality reduction

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Advanced Machine Learning & Perception

Unsupervised dimensionality reduction

Non-linear Dimensionality Reduction

Data-dependent representations: Laplacian Eigenmaps

Nonlinear Manifold Learning Summary

Statistical Machine Learning from Data

Lecture: Some Practical Considerations (3 of 4)

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Principal Component Analysis

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Unsupervised learning: beyond simple clustering and PCA

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA, Kernel PCA, ICA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

14 Singular Value Decomposition

Kernel Principal Component Analysis

Unsupervised Learning: Dimensionality Reduction

Machine Learning - MT Clustering

Distance Metric Learning

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Support Vector Machines

Lecture Notes 1: Vector spaces

Data Preprocessing Tasks

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Support Vector Machine (SVM) and Kernel Methods

Kernel methods, kernel SVM and ridge regression

Introduction to Machine Learning

DIMENSION REDUCTION. min. j=1

7 Principal Component Analysis

Support Vector Machine (SVM) and Kernel Methods

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

Machine Learning (BSMC-GA 4439) Wenke Liu

CIS 520: Machine Learning Oct 09, Kernel Methods

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Dimensionality Reduction

Apprentissage non supervisée

Global (ISOMAP) versus Local (LLE) Methods in Nonlinear Dimensionality Reduction

Kernel methods for comparing distributions, measuring dependence

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Singular Value Decomposition and Principal Component Analysis (PCA) I

Support Vector Machine (SVM) and Kernel Methods

Dimensionality Reduction AShortTutorial

Locality Preserving Projections

Metric Learning. 16 th Feb 2017 Rahul Dey Anurag Chowdhury

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Principal Component Analysis

PARAMETERIZATION OF NON-LINEAR MANIFOLDS

Dimensionality Reduction: A Comparative Review

Kernel Methods. Barnabás Póczos

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

15 Singular Value Decomposition

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

CSE 291. Assignment Spectral clustering versus k-means. Out: Wed May 23 Due: Wed Jun 13

Principal Component Analysis

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Mathematical foundations - linear algebra

Data Mining and Analysis: Fundamental Concepts and Algorithms

Intrinsic Structure Study on Whale Vocalizations

7. Symmetric Matrices and Quadratic Forms

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Statistical Methods for SVM

1 Principal Components Analysis

Final Exam, Machine Learning, Spring 2009

Convex Optimization in Classification Problems

Transcription:

Advanced Machine Learning & Perception Instructor: Tony Jebara

Topic 2 Nonlinear Manifold Learning Multidimensional Scaling (MDS) Locally Linear Embedding (LLE) Beyond Principal Components Analysis (PCA) Kernel PCA (KPCA) Semidefinite Embedding (SDE) Minimum Volume Embedding (MVE)

Principal Components Analysis Encode data on linear (flat) manifold as steps along its axes x j y j = µ + C c vi i=1 ij Best choice of µ, c and v is least squares or equivalently maximum Gaussian likelihood error = x j y 2 N j = j =1 N j =1 x j µ C c vi i=1 ij Take derivatives of error over µ, c and v and set to zero 2 µ = 1 N N j =1 x j, v = eig 1 N N j =1 x j µ ( )( x j µ ) T, c = ( x ij i µ ) T v j

Manifold Learning & Embedding Data is often not Gaussian and not in a linear subspace Consider image of face being translated from left-to-right x t = T t x0 nonlinear! How to capture the true coordinates of the data on the manifold or embedding space and represent it compactly? Unlike PCA, Embedding does not try to reconstruct the data Just finds better more compact coordinates on the manifold Example, instead of pixel intensity image (x,y) find a measure (t) of how far the face has translated. t

Multidimensional Scaling Idea: find low dimensional embedding that mimics only the distances between points X in original space Construct another set of low dimensional (say 2D) points with coordinates Y that maintain the pairwise distances A Dissimilarity d(x i,x j ) is a function of two inputs such that ( ) 0 ( ) = 0 ( ) = d ( x j, x i ) d x i, x j d x i, x i d x i, x j A Distance Metric is stricter, satisfies triangle inequality: ( ) d ( x i, x j ) + d ( x j, x k ) d x i, x k ( ) = 1 2 Standard example: Euclidean l2 metric d x i, x j Assume for N objects, we compute a dissimilarity Δ matrix which tells us how far they are Δ ij = d x i, x j ( ) x i x 2 j

Multidimensional Scaling Given dissimilarity Δ between original X points under original d() metric, find Y points with dissimilarity D under another d () metric such that D is similar to Δ Δ ij = d x i, x ( ) j D ij = d '( y i, y ) j Want to find Y s that minimize some difference from D to Δ Eg. Least Squares Stress = Stress ( y 1,, y ) N = D ij Δ ij Eg. Invariant Stress = Eg. Sammon Mapping = Eg. Strain = InvStress = ij 1 Δ ij Stress ( Y ) 2 i<j D ij ( ) 2 D ij Δ ij trace J ( Δ 2 D 2 )J Δ 2 D 2 ij ( ) 2 Some are global Some are local Gradient descent ( ( )) wherej = I 1 N 1 1 T

MDS Example 3D to 2D Have distances from cities to cities, these are on the surface of a sphere (Earth) in 3D space Reconstructed 2D points on plane capture essential properties (poles?)

MDS Example Multi-D to 2D More elaborate example Have correlation matrix between crimes. These are arbitrary dimensionality. Hack: convert correlation to dissimilarity and show reconstructed Y

Locally Linear Embedding Instead of trying to preserve ALL pairwise distances only preserve SOME distances across nearby points: Lets us unwrap manifold! Also, distances are only locally valid Euclidean distance is only similar to geodesic at small scales How do we pick which distances? Find the k nearest neighbors of each point and only preserve those

LLE with K-Nearest Neighbors Start with unconnected points x 1 x 2 Compute pairs of distances A ij = d(x i,x j ) Connect each point to its k closest points B ij = A ij <= sort(a i* ) k Then symmetrize the connectivity matrix B ij = max(b ji, B ij ) x x4 x 3 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 x 1 0 0 1 1 0 0 x 2 0 0 0 1 1 0 x 3 1 0 0 0 0 1 x 4 1 1 0 0 0 0 x 5 0 1 0 1 0 0 x 6 0 0 1 1 0 0

LLE Instead of distance, look at neighborhood of each point. Preserve reconstruction of point with neighbors in low dim Find K nearest neighbors for each point Describe neighborhood as best weights on neighbors to reconstruct the point ε( W ) = subject to ( ) = i x i W ij j W ij x j j = 1 i Find best vectors that still have same weights 2 Φ Y y i W yj ij i j 2 subject to N y i = 0, i=1 N i=1 y T yi i = I

LLE Finding W s (convex combination of weights on neighbors): ( ) = i ε i ( W i ) ε W ε i ( W i ) = x i j W ijx j j = W ij xi x j where ε i W i ( ) = x i j W ijx j 2 ( ) 2 = Wij j x i x j ( ) j ( ) ( ) T ( xi x k ) ( ) T ( W xi ij x ) j jk = W ij W xi ik x j = W ij W ik C jk jk and recall j W ij = 1 W * 1 i = arg min w 2 wt Cw λ( w T 1) 1) Take Deriv & Set to 0 2) Solve Linear system Cw λ( 1 ) = 0 C w λ = 1 3) Find λ 4) Find w 2 w T 1 = 1 λ w T 1 = 1 λ

LLE Finding Y s (new low-d points that agree with the W s) 2 Φ( Y ) N N = i y i W yj ij subject to y i = 0, y T yi i = I = = i i y i W yj j ij ( k ) T T T T y yi i W yi yk ik W yj yi ij + W ij W yj yk ik ( ) T yi W ik yk k Where Y is a matrix whose rows are the y vectors To minimize the above subject to constraints we set Y as the bottom d+1 eigenvectors of M j jk ( ) jk ( ) y j T yk i = δ jk W jk W kj + W ij W ik jk = M jk yj T yk ( ) = tr MYY T j i=1 i=1

LLE Results Synthetic data on S-manifold Have noisy 3D samples X Get 2D LLE embedding Y Real data on face images Each x is an image that has been rasterized into a vector 1 3 0 2 Dots are reconstructed two-dimensional Y points

LLE Results Top=PCA Bottom=LLE

Kernel Principal Components Analysis Recall, PCA approximates the data with eigenvectors, mean and coefficients: x i µ + Get eigenvectors that best approximating the covariance: Σ =V ΛV T Σ 11 Σ 12 Σ 13 Σ 12 Σ 22 Σ 23 Σ 13 Σ 23 Σ 33 = v 1 v2 v3 T Eigenvectors are orthonormal: v i v j = δ ij In coordinates of v, Gaussian is diagonal, cov = Λ Higher eigenvalues are higher variance, use those first To compute the coefficients: c ij = x i µ How to extend PCA to make it nonlinear? Kernels! λ 1 λ 2 λ 3 λ 4 λ 1 0 0 0 λ 2 0 0 0 λ 3 v1 ( ) T v j v2 C j =1 v3 c ij v j T

Kernel PCA Idea: replace dot-products in PCA with kernel evaluations. Recall, could do PCA on DxD covariance matrix of data C = 1 N T x xi N i λv = C Evals & If data is v i=1 Evecs zero-mean or NxN Gram matrix of data: K ij = x T satisfy i x j For nonlinearity, do PCA on feature expansions: N φ( x i )φ( x ) T i C = 1 i=1 N Instead of doing explicit feature expansion, use kernel I.e. d-th order polynomial K ij = k ( x i,x ) j = φ( x ) T i φ x j As usual, kernel must satisfy Mercer s theorem Assume, for simplicity, all N φ( x ) feature data is zero-mean i = 0 i=1 ( ) = ( x T i x ) d j

Kernel PCA Efficiently find & use eigenvectors of C-bar: λv = Cv Can dot either side of above equation with feature vector: ( ) T v = φ ( xi ) T Cv λφ x i Eigenvectors are in span of feature vectors: Combine equations: λφ( x i ) T v = φ ( xi ) T Cv ( ) T N α j φ( x j ) ( ) T N C α j φ( x j ) λφ x i λφ x i { } = φ x j =1 i ( ) T N α j φ( x j ) { } = φ x j =1 i λ N j =1 α j K ij = 1 N ( ) T 1 N λk α = K 2 α N λ α = K α N { } i=1 v = N i=1 α i φ( x ) i { N φ x N ( k )φ( x k ) T } N α k=1 j φ x j =1 j k=1 K ik N j =1 α j K kj ( ) { }

Kernel PCA From before, we had: λφ x i this is an eig equation! N λα = Kα Get eigenvectors α and eigenvalues of K Eigenvalues are N times λ For each eigenvector α k there is an eigenvector v k Want eigenvectors v to be normalized: v k Can now use alphas only for doing PCA projection & reconstruction! ( ) T v = φ ( xi ) T Cv ( ) T v k = 1 ( N α k i φ( x )) T N i=1 i α k j φ x j =1 j ( ( )) = 1 ( α k ) T Kα k = 1 ( α k ) T N λ k α k = 1 α k ( ) T α k = 1 N λ k

Kernel PCA To compute k th projection coefficient of a new point φ(x) { } = N α i ( ) T v k = φ( x) T N α k i φ( x ) i ( ) c k = φ x k k x,x i=1 i=1 i Reconstruction*: φ ( x) K = c k v k K N = α k k=1 i k ( x,x ) N i α k i=1 j φ( x ) k=1 j =1 j *Pre-image problem, linear combo in Hilbert goes outside Can now do nonlinear PCA and do PCA on non-vectors Nonlinear KPCA eigenvectors satisfy same properties as usual PCA but in Hilbert space. These evecs: 1) Top q have max variance 2) Top q reconstruction has with min mean square error 3) Are uncorrelated/orthogonal 4) Top have max mutual with inputs

Centering Kernel PCA So far, we had assumed the N data was zero-mean: φ( x i ) = 0 i=1 We want this: φ ( x ) j = φ( x j ) 1 N N φ( x ) i=1 i How to do without touching feature space? Use kernels K ij = φ ( x ) T i φ ( x ) j = ( φ( x i ) 1 N φ( x ) N ) T ( k=1 k φ( x j ) 1 N φ( x ) N ) k=1 k = φ( x ) T i φ( x j ) 1 N 1 N N k=1 = K ij 1 N ( ) φ x k Can get alpha eigenvectors from K tilde by adjusting old K N φ( x ) T i φ( x ) k N k=1 K kj k=1 T φ x j ( ) + 1 1 N N N N k=1 l =1 1 N K N k=1 ik + 1 N 2 φ( x ) T k φ( x ) l N N k=1 K l =1 kl

Kernel PCA Results KPCA on 2d dataset Left-to-right Kernel poly order goes from 1 to 3 1=linear=PCA Top-to-bottom top evec to weaker evecs

Kernel PCA Results Use coefficients of the KPCA for training a linear SVM classifier to recognize chairs from their images. Use various polynomial kernel degrees where 1=linear as in regular PCA

Kernel PCA Results Use coefficients of the KPCA for training a linear SVM classifier to recognize characters from their images. Use various polynomial kernel degrees where 1=linear as in regular PCA (worst case in experiments) Inferior performance to nonlinear SVMs (why??)

Semidefinite Embedding Also known as Maximum Variance Unfolding Similar to LLE and kernel PCA Like LLE, maintains only distance in the neighborhood Stretch all the data while maintaining the distances: Then apply PCA (or kpca)

Semidefinite Embedding To visualize high-dimensional {x 1,,x N } data: PCA and Kernel PCA (Sholkopf et al): -Get matrix A of affinities between pairs A ij =k(x i,x j ) -SVD A & view top projections Semidefinite Embedding (Weinberger, Saul): -Get k-nearest neighbors graph of data -Get matrix A -Use max trace SDP to stretch stretch graph A into PD graph K -SVD K & view top projections

Semidefinite Embedding SDE unfolds (pulls apart) knn connected graph C but preserves pairwise distances when C ij =1 max K λ i i κ = K R N N s.t.k 0 ij s.t. K ij = 0 s.t. K κ s.t.k ii + K jj K ij K ji = A ii + A jj A ij A ji if C ij = 1 SDE s stretching of graph improves the visualization

SDE Optimization with YALMIP Linear Programming <Quadratic Programming <Quadratically Constrained Quadratic Programming <Semidefinite Programming <Convex Programming <Polynomial Time Algorithms LP QP QCQP SDP CP P

SDE Optimization with YALMIP LP fast O(N 3 ) min x b T x s.t. ci T x αi i QP min x 1 2 x T H x + b T x s.t. ci T x αi i QCQP min x 1 2 x T H x + b T x s.t. ci T x αi i, x T x η SDP min K tr ( BK ) s.t. tr ( C T i K) α i i, K 0 ALL above in the YALMIP package for Matab! Google it, download and install! CP slow O(N 3 ) min x f x ( ) s.t. g ( x ) α

SDE Results SDE unfolds (pulls apart) knn connected graph C SDE s stretching of graph improves the visualization here

SDE Results SDE unfolds (pulls apart) knn connected graph C before doing PCA Gets more use or energy out of the top eigenvectors than PCA SDE s stretching of graph improves visualization here

SDE Problems MVE But SDE stretching could worsen visualization! Spokes Experiment: PCA Want to pull apart only in visualized dimensions Flatten down remaining ones vs.

Tony Jebara Minimum Volume Embedding To reduce dimension, drive energy into top (e.g. 2) dims Maximize fidelity or % energy in top dims ( ) = λ 1 + λ 2 F K i λ i F ( K) = 0.98 F ( K) = 0.29 Equivalent to maximizing Assume β=1/2 λ 1 + λ 2 β i λ i for some β 33

Minimum Volume Embedding Stretch in d<d top dimensions and squash rest. max K D λ i d λ i=1 i i=d +1 s.t. K κ Simplest Linear-Spectral SDP α = α 1 α d α d +1 α D = +1 +1 1 1 Effectively maximizes Eigengap between d th and d+1 th λ

Minimum Volume Embedding Stretch in d<d top dimensions and squash rest. max K D λ i d λ i=1 i i=d +1 s.t. K κ Simplest Linear-Spectral SDP α = α 1 α d α d +1 α D = +1 +1 1 1 Effectively maximizes Eigengap between d th and d+1 th λ Variational bound on cost Iterated Monotonic SDP Lock V and solve SDP K. Lock K and solve SVD for V.

MVE Optimization: Spectral SDP Spectral function: f(λ 1,,λ D ) eigenvalues of a matrix K SDP packages use restricted cost functions over hull kappa. Trace SDPs max K κ tr ( BK ) Logdet SDPs max K κ i log λ i Consider richer SDPs (assume λ i in decreasing order) Linear-Spectral SDPs max K κ i α i λ i Problem: last one doesn t fit nicely into standard SDP code (like YALMIP or CSDP). Need an iterative algorithm

MVE Optimization: Spectral SDP ( ) = α i λ i g K i If alphas are ordered get a variational SDP problem (a Procrustes problem). max K κ g K ( ) = max K κ i α i λ i = max K κ α i λ i tr v T i ( i v i ) s.t. Kv i = λ i v i T = max K κ i α i tr ( λ i v i v i ) λ i λ i+1 T = max K κ i α i tr ( Kv i v i ) v T i v j = δ ij T = max K κ tr ( K i α i v i v i ) = max min tr K α u u T K κ U ( i i i i ) s.t. u T i u j = δ ij if α i α i+1 T max K κ max U tr ( K α i u i u i ) s.t. u T i u j = δ ij if α i α i+1 i For max over K use SDP. For max over U use SVD. Iterate to obtain monotonic improvement.

MVE Optimization: Spectral SDP Theorem: if alpha decreasing the objective g K is convex. ( ) = α i λ i s.t.α i α i+1 i Proof: Recall from (Overton & Womersley 91) and (Fan 49) and (Bach & Jordan 03) Sum of d top eigenvalues of p.d. matrix is convex f d d ( K) = λ i i=1 convex Our linear-spectral cost is a combination of these 1 ( ) = α D f D ( K) + ( α i α i+1 ) f i ( K i=d 1 ) 1 = α D tr ( K) + α i α i+1 f i ( K) g K i=d 1 Trace (linear) + conic combo of convex fn s =convex

MVE Pseudocode Download at www.metablake.com/mve

MVE Results Spokes experiment visualization and spectra Converges in ~5 iterations

MVE Results Swissroll Visualization (Connectivity via knn) (d is set to 2) PCA SDE Same convergence under random initialization or K=A MVE

Tony Jebara knn Embedding with MVE MVE does better even with knn connectivity Face images with spectra PCA SDE MVE 42

Tony Jebara knn Embedding with MVE MVE does better even with knn Digit images visualization with spectra PCA SDE MVE 43

Tony Jebara Graph Embedding with MVE Collaboration network with spectra PCA SDE MVE 44 44

MVE Results Evaluate fidelity or % energy in top 2 dims kpca SDE MVE Star 95.0% 29.9% 100.0% Swiss Roll 45.8% 99.9% 99.9% Twos 18.4% 88.4% 97.8% Faces 31.4% 83.6% 99.2% Network 5.9% 95.3% 99.2%

MVE for Spanning Trees Instead of knn, use maximum weight spanning tree to connect points (Kruskal s algo: connect points with short edges first, skip edges that create loops, stop when tree) Tree connectivity can fold under SDE or MVE. Add constraints on all pairs to keep all distances from shrinking (call this SDE-FULL or MVE-FULL) K original κ and K ii + K jj K ij K ji A ii + A jj A ij A ji

MVE for Spanning Trees Tree connectivity with degree=2. Top: taxonomy tree of 30 species of salamanders Bottom: taxonomy tree of 56 species of crustaceans