Manifold Regularization Vikas Sindhwani Department of Computer Science University of Chicago Joint Work with Mikhail Belkin and Partha Niyogi TTI-C Talk September 14, 24 p.1
The Problem of Learning is drawn from an unknown probability distribution. A Learning Algorithm maps of a hypothesis space mapping. to an element of functions should provide good labels for future examples. Regularization : Choose a simple function that agrees with data. TTI-C Talk September 14, 24 p.2
The Problem of Learning Notions of simplicity are the key to successful learning. Here s a simple function that agrees with data. TTI-C Talk September 14, 24 p.3
Learning and Prior Knowledge But Simplicity is a Relative Concept. Prior Knowledge of the Marginal can modify our notions of simplicity. TTI-C Talk September 14, 24 p.4
Motivation How can we exploit prior knowledge of the marginal distribution? More practically, how can we use unlabeled examples drawn from Why is this important? Natural Data has structure to exploit. Natural Learning is largely semi-supervised. Labels are Expensive, Unlabeled data is cheap and plenty. TTI-C Talk September 14, 24 p.5
Contributions A data-dependent, Geometric Regularization Framework for Learning from examples. Representer Theorems provide solutions. Extensions of SVM and RLS for Semi-supervised Learning. Regularized Spectral Clustering and Dimensionality Reduction. The problem of Out-of-sample extensions in graph methods is resolved. Good Empirical Performance. TTI-C Talk September 14, 24 p.6
%. Regularization with RKHS Learning in Reproducing Kernel Hilbert Spaces : ' (! " Regularized Least Squares (RLS) : ' ) Support Vector Machine (SVM) : )! +-, * TTI-C Talk September 14, 24 p.7
/ 6 4 9 ; What are RKHS? Hilbert Spaces with a nice property : If two functions are close in the distance derived from the inner product, their values are close. / 1 Reproducing Property : is linear, continuous. By Reisz s Representation theorem, 5 4 2 3 7 Kernel Function RKHS : 9 3 8 2 3. 3 8 : 3 : TTI-C Talk September 14, 24 p.8
D D G Why RKHS? Rich Function Spaces with complexity control e.g Gaussian Kernel H I G ' ( 'M I L 'KJ > C >@?BA DFE = < Representer Theorems show that the minimizer has the form : and therefore, 7 O 7 N S NS N RQ S P 9 8 ' ( Motivates kernelization (KPCA, KFD, etc). Good empirical performance. : TTI-C Talk September 14, 24 p.9
Known Marginal If is known, solve : ' V %WV ' ( %UT! " Extrinsic and Intrinsic Regularization controls complexity in ambient space. controls complexity in the intrinsic geometry of %XT %V TTI-C Talk September 14, 24 p.1
Z Continuous Representer Theorem Assume that the penalty term is sufficiently smooth with respect to the RKHS norm. Then the solution to the optimization problem exists and admits the following representation V ( M N Y N where ] ] [\ is the support of the marginal. TTI-C Talk September 14, 24 p.11
_ ^ 9 8 Z Z ( 9 8 Z Z A Manifold Regularizer If, the support of the marginal is a compact submanifold, it seems natural to choose : ' V Z and to find that minimizes : Z %V ' ( %UT! " TTI-C Talk September 14, 24 p.12
Z 9 8 Z Z Z Z 9 8 Mc d ) Z Z Z Laplace Beltrami Operator The intrinsic regularizer is a quadratic form involving the Laplace-Beltrami operator on the manifold d Mc ) `ba : Z V because some calculus on manifolds establishes that for any vector field, TTI-C Talk September 14, 24 p.13
l l /m S S ) n Passage to the Discrete In reality, is unknown and sampled only via examples. Labels are not required for empirical estimates of. BeUf ' V Manifold Graph S k ghji 4 S. ef Laplace Beltrami Graph Laplacian. `ba Mc ol p o ' V Z ' V S ' S ) TTI-C Talk September 14, 24 p.14
( p o '. Algorithms We have motivated the following optimization problem : Find a function that minimizes : %V ol ' ( %qt! " " r Laplacian RLS ' ) Laplacian SVM )! +-, * TTI-C Talk September 14, 24 p.15
( s S s Empirical Representer Theorem The minimizer admits an expansion BeUf N N BeUf as Proof : Write any S N BeUf 9 3k 8, s. increases the norm. So TTI-C Talk September 14, 24 p.16
N p p ' l t N N N N ) r t " "u N r Laplacian RLS By the Representer Theorem, the problem becomes finite dimensional. For Laplacian RLS, we find that minimizes : Bef %WV N %XT ' "! ".,, 7 77 7 77 +. The solution is :,!, 7 77! 7 77 /m where : Gram Matrix ; and Mc %WV = l t %XT '" TTI-C Talk September 14, 24 p.17
, u z u z N r Laplacian SVM For Laplacian SVMs, we solve a QP : p ' ) yx *wv, subject to :, and pt = l D {} ~ BeUf where then invert a linear system : -z t %T %WV p t = l -z %XT ' " TTI-C Talk September 14, 24 p.18
4 N Manifold Regularization Input : l labeled and u unlabeled examples Output : _ 5 Algorithm : Contruct adjacency Graph. Compute Laplacian. Choose Kernel matrix K. Choose. (?) Compute. Output %V %T. Compute Gram Y N Bef TTI-C Talk September 14, 24 p.19
Out-of-sample Extn.Spectral Clustering Ž Unity of Learning Supervised Partially Supervised Unsupervised SVM/RLS Graph Regularization Graph Mincut š X ƒ Ž @ ƒ ŠŒ Xˆ ƒ ª ± ª ž³ ²ž± ž± ª q ª U ª «ž œÿž œ b ž B b ž œÿžf œ µµ ª ª ƒ Š ˆ ƒ Out-of-sample Extn. ª q ª U ª «ž œÿž œ ª ª Š ˆ ƒ Reg. Spectral Clust. ŠŒ -ˆ ƒ ¹ µµ ª ª TTI-C Talk September 14, 24 p.2
ol p o % º N Æ l % d d N Ç Regularized Spectral Clustering Unsupervised Manifold Regularization : ' ( º ÃÄà > ¾ >¼ ÀÂÁ»½¼ ¾ Representer Theorem : leads to an eigenvalue problem : f ' Å Å and. is the smallest-eigenvalue eigenvector; P projects orthogonal to. TTI-C Talk September 14, 24 p.21
Experiments : Synthetic SVM Laplacian SVM Laplacian SVM 2 2 2 1 1 1 1 1 1 1 1 2 γ A =.3125 γ I = 1 1 2 γ A =.3125 γ I =.1 1 1 2 γ A =.3125 γ I = 1 2 2 2 1 1 1 1 1 1 1 1 2 γ A = 1e 6 γ I = 1 1 1 2 γ A =.1 γ I = 1 1 1 2 γ A =.1 γ I = 1 TTI-C Talk September 14, 24 p.22
œ Î œ Î œ Õ Related Algorithms Transductive SVMs [Joachims, Vapnik] œš ¹ µµ b ž š ž Ï ³ ÈÎ b ž š ž Ï ³ ƒ ÈÊÉ œš ž ž Ë ÍÍÍ ŠŒ Ë Ì ˆ Semi-supervised SVMs [Bennet,Fung et al] ž š ž Ï ³ ƒ ÈÊÉ Ñ ž Ë ÍÍÍ ŠÐ Ë Ì ˆ µµ Ó ž š Ï b ž š Ò Ï ³ œš œš žf C Measure-based Reg. [Bousquet et al] Ú b ØbÙ b Ö U ž ž ƒ È É Ô ˆ ž TTI-C Talk September 14, 24 p.23
Experiments : Synthetic Data 2.5 SVM 2.5 Transductive SVM 2.5 Laplacian SVM 2 2 2 1.5 1.5 1.5 1 1 1.5.5.5.5.5.5 1 1 1 1.5 1 1 2 1.5 1 1 2 1.5 1 1 2 TTI-C Talk September 14, 24 p.24
Experiments : Digits Error Rates 2 15 1 RLS vs LapRLS RLS LapRLS Error Rates 2 15 1 SVM vs LapSVM SVM LapSVM Error Rates 2 15 1 TSVM vs LapSVM TSVM LapSVM 5 5 5 1 2 3 4 45 Classification Problems 1 2 3 4 45 Classification Problems 1 2 3 4 45 Classification Problems LapRLS (Test) 15 1 5 Out of Sample Extension 5 1 15 LapRLS (Unlabeled) LapSVM (Test) 15 1 5 Out of Sample Extension 5 1 15 LapSVM (Unlabeled) SVM (o), TSVM (x) Deviation 15 1 5 Performance Deviation 2 4 6 LapSVM Deviation TTI-C Talk September 14, 24 p.25
Experiments : Digits 9 8 RLS vs LapRLS RLS (T) RLS (U) LapRLS (T) LapRLS (U) 1 9 SVM vs LapSVM SVM (T) SVM (U) LapSVM (T) LapSVM (U) 7 8 6 7 Average Error Rate 5 4 Average Error Rate 6 5 4 3 3 2 2 1 1 2 4 8 16 32 64 128 Number of Labeled Examples 2 4 8 16 32 64 128 Number of Labeled Examples TTI-C Talk September 14, 24 p.26
Experiments : Speech Error Rate (unlabeled set) Error Rates (test set) 28 26 24 22 2 18 16 14 35 3 25 2 RLS vs LapRLS RLS LapRLS 1 2 3 Labeled Speaker RLS vs LapRLS RLS LapRLS 1 2 3 Labeled Speaker Error Rates (unlabeled set) Error Rates (test set) 4 35 3 25 2 15 4 35 3 25 2 SVM vs TSVM vs LapSVM SVM TSVM LapSVM 1 2 3 Labeled Speaker SVM vs TSVM vs LapSVM SVM TSVM LapSVM 1 2 3 Labeled Speaker TTI-C Talk September 14, 24 p.27
Experiments : Speech 3 RLS 3 LapRLS Error Rate (Test) 25 2 Error Rate (Test) 25 2 Experiment 1 Experiment 2 15 15 2 25 3 Error Rate (Unlabeled) Experiment 1 Experiment 2 15 15 2 25 3 Error Rate (Unlabeled) 3 SVM 3 LapSVM Error Rate (Test) 25 2 Error Rate (Test) 25 2 Experiment 1 Experiment 2 15 15 2 25 3 Error Rate (Unlabeled) Experiment 1 Experiment 2 15 15 2 25 3 Error Rate (Unlabeled) TTI-C Talk September 14, 24 p.28
Experiments : Text Method PRBEP Error k-nn 73.2 13.3 SGT 86.2 6.2 Naive-Bayes 12.9 Cotraining 6.2 SVM 76.39 (5.6) 1.41 (2.5) TSVM 88.15 (1.) 5.22 (.5) LapSVM 87.73 (2.3) 5.41 (1.) RLS 73.49 (6.2) 11.68 (2.7) LapRLS 86.37 (3.1) 5.99 (1.4) TTI-C Talk September 14, 24 p.29
Experiments : Text Performance of RLS, LapRLS Performance of SVM, LapSVM 85 85 8 8 PRBEP 75 7 PRBEP 75 7 65 6 rls (U) rls (T) laprls (U) laprls (T) 65 6 svm (U) svm (T) lapsvm (U) lapsvm (T) 2 4 8 16 32 64 2 4 8 16 32 64 Number of Labeled Examples LapSVM performance (Unlabeled) Number of Labeled Examples LapSVM performance (Test) 88 86 86 84 PRBEP 84 82 PRBEP 82 8 8 U=779 l U=35 U=15 78 U=779 l U=35 U=15 2 4 8 16 32 64 2 4 8 16 32 64 Number of Labeled Examples Number of Labeled Examples TTI-C Talk September 14, 24 p.3
Future Work Generalization as a function of labeled and unlabeled examples. Additional Structure : Structured Outputs, Invariances Active Learning, Feature Selection Efficient Algorithms : Linear Methods, Sparse Solutions Applications : Bioinformatics, Text, Speech, Vision,... TTI-C Talk September 14, 24 p.31