Advanced Machine Learning & Perception

Principal Components Analysis Encode data on linear (flat) manifold as steps along its axes x j y j = µ + C c vi i=1 ij Best choice of µ, c and v is least squares or equivalently maximum Gaussian likelihood error = x j y 2 N j = j =1 N j =1 x j µ C c vi i=1 ij Take derivatives of error over µ, c and v and set to zero 2 µ = 1 N N j =1 x j, v = eig 1 N N j =1 x j µ ( )( x j µ ) T, c = ( x ij i µ ) T v j

Manifold Learning & Embedding Data is often not Gaussian and not in a linear subspace Consider image of face being translated from left-to-right x t = T t x0 nonlinear! How to capture the true coordinates of the data on the manifold or embedding space and represent it compactly? Unlike PCA, Embedding does not try to reconstruct the data Just finds better more compact coordinates on the manifold Example, instead of pixel intensity image (x,y) find a measure (t) of how far the face has translated. t

Multidimensional Scaling Idea: find low dimensional embedding that mimics only the distances between points X in original space Construct another set of low dimensional (say 2D) points with coordinates Y that maintain the pairwise distances A Dissimilarity d(x i,x j ) is a function of two inputs such that ( ) 0 ( ) = 0 ( ) = d ( x j, x i ) d x i, x j d x i, x i d x i, x j A Distance Metric is stricter, satisfies triangle inequality: ( ) d ( x i, x j ) + d ( x j, x k ) d x i, x k ( ) = 1 2 Standard example: Euclidean l2 metric d x i, x j Assume for N objects, we compute a dissimilarity Δ matrix which tells us how far they are Δ ij = d x i, x j ( ) x i x 2 j

Multidimensional Scaling Given dissimilarity Δ between original X points under original d() metric, find Y points with dissimilarity D under another d () metric such that D is similar to Δ Δ ij = d x i, x ( ) j D ij = d '( y i, y ) j Want to find Y s that minimize some difference from D to Δ Eg. Least Squares Stress = Stress ( y 1,, y ) N = D ij Δ ij Eg. Invariant Stress = Eg. Sammon Mapping = Eg. Strain = InvStress = ij 1 Δ ij Stress ( Y ) 2 i<j D ij ( ) 2 D ij Δ ij trace J ( Δ 2 D 2 )J Δ 2 D 2 ij ( ) 2 Some are global Some are local Gradient descent ( ( )) wherej = I 1 N 1 1 T

MDS Example 3D to 2D Have distances from cities to cities, these are on the surface of a sphere (Earth) in 3D space Reconstructed 2D points on plane capture essential properties (poles?)

MDS Example Multi-D to 2D More elaborate example Have correlation matrix between crimes. These are arbitrary dimensionality. Hack: convert correlation to dissimilarity and show reconstructed Y

Locally Linear Embedding Instead of trying to preserve ALL pairwise distances only preserve SOME distances across nearby points: Lets us unwrap manifold! Also, distances are only locally valid Euclidean distance is only similar to geodesic at small scales How do we pick which distances? Find the k nearest neighbors of each point and only preserve those

LLE with K-Nearest Neighbors Start with unconnected points x 1 x 2 Compute pairs of distances A ij = d(x i,x j ) Connect each point to its k closest points B ij = A ij <= sort(a i* ) k Then symmetrize the connectivity matrix B ij = max(b ji, B ij ) x x4 x 3 5 x 6 x 1 x 2 x 3 x 4 x 5 x 6 x 1 0 0 1 1 0 0 x 2 0 0 0 1 1 0 x 3 1 0 0 0 0 1 x 4 1 1 0 0 0 0 x 5 0 1 0 1 0 0 x 6 0 0 1 1 0 0

LLE Instead of distance, look at neighborhood of each point. Preserve reconstruction of point with neighbors in low dim Find K nearest neighbors for each point Describe neighborhood as best weights on neighbors to reconstruct the point ε( W ) = subject to ( ) = i x i W ij j W ij x j j = 1 i Find best vectors that still have same weights 2 Φ Y y i W yj ij i j 2 subject to N y i = 0, i=1 N i=1 y T yi i = I

LLE Finding W s (convex combination of weights on neighbors): ( ) = i ε i ( W i ) ε W ε i ( W i ) = x i j W ijx j j = W ij xi x j where ε i W i ( ) = x i j W ijx j 2 ( ) 2 = Wij j x i x j ( ) j ( ) ( ) T ( xi x k ) ( ) T ( W xi ij x ) j jk = W ij W xi ik x j = W ij W ik C jk jk and recall j W ij = 1 W * 1 i = arg min w 2 wt Cw λ( w T 1) 1) Take Deriv & Set to 0 2) Solve Linear system Cw λ( 1 ) = 0 C w λ = 1 3) Find λ 4) Find w 2 w T 1 = 1 λ w T 1 = 1 λ

LLE Finding Y s (new low-d points that agree with the W s) 2 Φ( Y ) N N = i y i W yj ij subject to y i = 0, y T yi i = I = = i i y i W yj j ij ( k ) T T T T y yi i W yi yk ik W yj yi ij + W ij W yj yk ik ( ) T yi W ik yk k Where Y is a matrix whose rows are the y vectors To minimize the above subject to constraints we set Y as the bottom d+1 eigenvectors of M j jk ( ) jk ( ) y j T yk i = δ jk W jk W kj + W ij W ik jk = M jk yj T yk ( ) = tr MYY T j i=1 i=1

LLE Results Synthetic data on S-manifold Have noisy 3D samples X Get 2D LLE embedding Y Real data on face images Each x is an image that has been rasterized into a vector 1 3 0 2 Dots are reconstructed two-dimensional Y points

LLE Results Top=PCA Bottom=LLE

Kernel Principal Components Analysis Recall, PCA approximates the data with eigenvectors, mean and coefficients: x i µ + Get eigenvectors that best approximating the covariance: Σ =V ΛV T Σ 11 Σ 12 Σ 13 Σ 12 Σ 22 Σ 23 Σ 13 Σ 23 Σ 33 = v 1 v2 v3 T Eigenvectors are orthonormal: v i v j = δ ij In coordinates of v, Gaussian is diagonal, cov = Λ Higher eigenvalues are higher variance, use those first To compute the coefficients: c ij = x i µ How to extend PCA to make it nonlinear? Kernels! λ 1 λ 2 λ 3 λ 4 λ 1 0 0 0 λ 2 0 0 0 λ 3 v1 ( ) T v j v2 C j =1 v3 c ij v j T

Kernel PCA Idea: replace dot-products in PCA with kernel evaluations. Recall, could do PCA on DxD covariance matrix of data C = 1 N T x xi N i λv = C Evals & If data is v i=1 Evecs zero-mean or NxN Gram matrix of data: K ij = x T satisfy i x j For nonlinearity, do PCA on feature expansions: N φ( x i )φ( x ) T i C = 1 i=1 N Instead of doing explicit feature expansion, use kernel I.e. d-th order polynomial K ij = k ( x i,x ) j = φ( x ) T i φ x j As usual, kernel must satisfy Mercer s theorem Assume, for simplicity, all N φ( x ) feature data is zero-mean i = 0 i=1 ( ) = ( x T i x ) d j

Kernel PCA Efficiently find & use eigenvectors of C-bar: λv = Cv Can dot either side of above equation with feature vector: ( ) T v = φ ( xi ) T Cv λφ x i Eigenvectors are in span of feature vectors: Combine equations: λφ( x i ) T v = φ ( xi ) T Cv ( ) T N α j φ( x j ) ( ) T N C α j φ( x j ) λφ x i λφ x i { } = φ x j =1 i ( ) T N α j φ( x j ) { } = φ x j =1 i λ N j =1 α j K ij = 1 N ( ) T 1 N λk α = K 2 α N λ α = K α N { } i=1 v = N i=1 α i φ( x ) i { N φ x N ( k )φ( x k ) T } N α k=1 j φ x j =1 j k=1 K ik N j =1 α j K kj ( ) { }

Kernel PCA From before, we had: λφ x i this is an eig equation! N λα = Kα Get eigenvectors α and eigenvalues of K Eigenvalues are N times λ For each eigenvector α k there is an eigenvector v k Want eigenvectors v to be normalized: v k Can now use alphas only for doing PCA projection & reconstruction! ( ) T v = φ ( xi ) T Cv ( ) T v k = 1 ( N α k i φ( x )) T N i=1 i α k j φ x j =1 j ( ( )) = 1 ( α k ) T Kα k = 1 ( α k ) T N λ k α k = 1 α k ( ) T α k = 1 N λ k

Kernel PCA To compute k th projection coefficient of a new point φ(x) { } = N α i ( ) T v k = φ( x) T N α k i φ( x ) i ( ) c k = φ x k k x,x i=1 i=1 i Reconstruction*: φ ( x) K = c k v k K N = α k k=1 i k ( x,x ) N i α k i=1 j φ( x ) k=1 j =1 j *Pre-image problem, linear combo in Hilbert goes outside Can now do nonlinear PCA and do PCA on non-vectors Nonlinear KPCA eigenvectors satisfy same properties as usual PCA but in Hilbert space. These evecs: 1) Top q have max variance 2) Top q reconstruction has with min mean square error 3) Are uncorrelated/orthogonal 4) Top have max mutual with inputs

Centering Kernel PCA So far, we had assumed the N data was zero-mean: φ( x i ) = 0 i=1 We want this: φ ( x ) j = φ( x j ) 1 N N φ( x ) i=1 i How to do without touching feature space? Use kernels K ij = φ ( x ) T i φ ( x ) j = ( φ( x i ) 1 N φ( x ) N ) T ( k=1 k φ( x j ) 1 N φ( x ) N ) k=1 k = φ( x ) T i φ( x j ) 1 N 1 N N k=1 = K ij 1 N ( ) φ x k Can get alpha eigenvectors from K tilde by adjusting old K N φ( x ) T i φ( x ) k N k=1 K kj k=1 T φ x j ( ) + 1 1 N N N N k=1 l =1 1 N K N k=1 ik + 1 N 2 φ( x ) T k φ( x ) l N N k=1 K l =1 kl

Kernel PCA Results KPCA on 2d dataset Left-to-right Kernel poly order goes from 1 to 3 1=linear=PCA Top-to-bottom top evec to weaker evecs

Kernel PCA Results Use coefficients of the KPCA for training a linear SVM classifier to recognize chairs from their images. Use various polynomial kernel degrees where 1=linear as in regular PCA

Kernel PCA Results Use coefficients of the KPCA for training a linear SVM classifier to recognize characters from their images. Use various polynomial kernel degrees where 1=linear as in regular PCA (worst case in experiments) Inferior performance to nonlinear SVMs (why??)

Semidefinite Embedding Also known as Maximum Variance Unfolding Similar to LLE and kernel PCA Like LLE, maintains only distance in the neighborhood Stretch all the data while maintaining the distances: Then apply PCA (or kpca)

Semidefinite Embedding To visualize high-dimensional {x 1,,x N } data: PCA and Kernel PCA (Sholkopf et al): -Get matrix A of affinities between pairs A ij =k(x i,x j ) -SVD A & view top projections Semidefinite Embedding (Weinberger, Saul): -Get k-nearest neighbors graph of data -Get matrix A -Use max trace SDP to stretch stretch graph A into PD graph K -SVD K & view top projections

Semidefinite Embedding SDE unfolds (pulls apart) knn connected graph C but preserves pairwise distances when C ij =1 max K λ i i κ = K R N N s.t.k 0 ij s.t. K ij = 0 s.t. K κ s.t.k ii + K jj K ij K ji = A ii + A jj A ij A ji if C ij = 1 SDE s stretching of graph improves the visualization

SDE Optimization with YALMIP Linear Programming <Quadratic Programming <Quadratically Constrained Quadratic Programming <Semidefinite Programming <Convex Programming <Polynomial Time Algorithms LP QP QCQP SDP CP P

SDE Optimization with YALMIP LP fast O(N 3 ) min x b T x s.t. ci T x αi i QP min x 1 2 x T H x + b T x s.t. ci T x αi i QCQP min x 1 2 x T H x + b T x s.t. ci T x αi i, x T x η SDP min K tr ( BK ) s.t. tr ( C T i K) α i i, K 0 ALL above in the YALMIP package for Matab! Google it, download and install! CP slow O(N 3 ) min x f x ( ) s.t. g ( x ) α

SDE Results SDE unfolds (pulls apart) knn connected graph C SDE s stretching of graph improves the visualization here

SDE Results SDE unfolds (pulls apart) knn connected graph C before doing PCA Gets more use or energy out of the top eigenvectors than PCA SDE s stretching of graph improves visualization here

SDE Problems MVE But SDE stretching could worsen visualization! Spokes Experiment: PCA Want to pull apart only in visualized dimensions Flatten down remaining ones vs.

Tony Jebara Minimum Volume Embedding To reduce dimension, drive energy into top (e.g. 2) dims Maximize fidelity or % energy in top dims ( ) = λ 1 + λ 2 F K i λ i F ( K) = 0.98 F ( K) = 0.29 Equivalent to maximizing Assume β=1/2 λ 1 + λ 2 β i λ i for some β 33

Minimum Volume Embedding Stretch in d<d top dimensions and squash rest. max K D λ i d λ i=1 i i=d +1 s.t. K κ Simplest Linear-Spectral SDP α = α 1 α d α d +1 α D = +1 +1 1 1 Effectively maximizes Eigengap between d th and d+1 th λ

MVE Optimization: Spectral SDP Spectral function: f(λ 1,,λ D ) eigenvalues of a matrix K SDP packages use restricted cost functions over hull kappa. Trace SDPs max K κ tr ( BK ) Logdet SDPs max K κ i log λ i Consider richer SDPs (assume λ i in decreasing order) Linear-Spectral SDPs max K κ i α i λ i Problem: last one doesn t fit nicely into standard SDP code (like YALMIP or CSDP). Need an iterative algorithm

MVE Optimization: Spectral SDP ( ) = α i λ i g K i If alphas are ordered get a variational SDP problem (a Procrustes problem). max K κ g K ( ) = max K κ i α i λ i = max K κ α i λ i tr v T i ( i v i ) s.t. Kv i = λ i v i T = max K κ i α i tr ( λ i v i v i ) λ i λ i+1 T = max K κ i α i tr ( Kv i v i ) v T i v j = δ ij T = max K κ tr ( K i α i v i v i ) = max min tr K α u u T K κ U ( i i i i ) s.t. u T i u j = δ ij if α i α i+1 T max K κ max U tr ( K α i u i u i ) s.t. u T i u j = δ ij if α i α i+1 i For max over K use SDP. For max over U use SVD. Iterate to obtain monotonic improvement.

MVE Optimization: Spectral SDP Theorem: if alpha decreasing the objective g K is convex. ( ) = α i λ i s.t.α i α i+1 i Proof: Recall from (Overton & Womersley 91) and (Fan 49) and (Bach & Jordan 03) Sum of d top eigenvalues of p.d. matrix is convex f d d ( K) = λ i i=1 convex Our linear-spectral cost is a combination of these 1 ( ) = α D f D ( K) + ( α i α i+1 ) f i ( K i=d 1 ) 1 = α D tr ( K) + α i α i+1 f i ( K) g K i=d 1 Trace (linear) + conic combo of convex fn s =convex

MVE Pseudocode Download at www.metablake.com/mve

MVE Results Spokes experiment visualization and spectra Converges in ~5 iterations

MVE Results Swissroll Visualization (Connectivity via knn) (d is set to 2) PCA SDE Same convergence under random initialization or K=A MVE

Tony Jebara knn Embedding with MVE MVE does better even with knn connectivity Face images with spectra PCA SDE MVE 42

Tony Jebara knn Embedding with MVE MVE does better even with knn Digit images visualization with spectra PCA SDE MVE 43

Tony Jebara Graph Embedding with MVE Collaboration network with spectra PCA SDE MVE 44 44

MVE Results Evaluate fidelity or % energy in top 2 dims kpca SDE MVE Star 95.0% 29.9% 100.0% Swiss Roll 45.8% 99.9% 99.9% Twos 18.4% 88.4% 97.8% Faces 31.4% 83.6% 99.2% Network 5.9% 95.3% 99.2%

MVE for Spanning Trees Instead of knn, use maximum weight spanning tree to connect points (Kruskal s algo: connect points with short edges first, skip edges that create loops, stop when tree) Tree connectivity can fold under SDE or MVE. Add constraints on all pairs to keep all distances from shrinking (call this SDE-FULL or MVE-FULL) K original κ and K ii + K jj K ij K ji A ii + A jj A ij A ji

MVE for Spanning Trees Tree connectivity with degree=2. Top: taxonomy tree of 30 species of salamanders Bottom: taxonomy tree of 56 species of crustaceans