Advanced Machine Learning

Similar documents
Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Algorithms for Learning Kernels Based on Centered Alignment

Advanced Machine Learning

Domain Adaptation for Regression

Learning with Rejection

MULTIPLEKERNELLEARNING CSE902

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Foundations of Machine Learning

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

ADANET: adaptive learning of neural networks

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Bounds for Non-i.i.d. Processes

Multi-class classification via proximal mirror descent

Advanced Machine Learning

Learning Weighted Automata

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

The sample complexity of agnostic learning with deterministic labels

Sample Selection Bias Correction

Efficient Bandit Algorithms for Online Multiclass Prediction

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning in the Data Revolution Era

Sample-adaptive Multiple Kernel Learning

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Support Vector Machine for Classification and Regression

Convex Optimization Algorithms for Machine Learning in 10 Slides

Support Vector Machines.

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Learning with Imperfect Data

Perceptron Mistake Bounds

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Structured Prediction

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Convex Optimization in Classification Problems

Introduction to Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector and Kernel Methods

Advanced Machine Learning

Discriminative Models

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

On the tradeoff between computational complexity and sample complexity in learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines

Support Vector Machines and Kernel Methods

Discriminative Models

Linear & nonlinear classifiers

Structured Prediction Theory and Algorithms

Mathematical Programming for Multiple Kernel Learning

Machine Learning : Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Accelerating Online Convex Optimization via Adaptive Prediction

Inverse Time Dependency in Convex Regularized Learning

Support Vector Machine & Its Applications

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Intelligent Systems Discriminative Learning, Neural Networks

Support Vector Machine

Boosting Ensembles of Structured Prediction Rules

Support Vector Machines Explained

Full-information Online Learning

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Learning the Kernel Matrix with Semidefinite Programming

Support Vector Machines

Learning the Kernel Matrix with Semi-Definite Programming

Support Vector Machines

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

Advanced Machine Learning

1 Graph Kernels by Spectral Transforms

approximation algorithms I

Composite Loss Functions and Multivariate Regression; Sparse PCA

Linear & nonlinear classifiers

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Classifier Complexity and Support Vector Classifiers

Large-Scale Training of SVMs with Automata Kernels

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernel Methods. Outline

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Model Selection for Gaussian Processes

Applied inductive learning - Lecture 7

Statistical Methods for SVM

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning Lecture 6 Note

Introduction to Machine Learning

Jeff Howbert Introduction to Machine Learning Winter

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Statistical Machine Learning for Structured and High Dimensional Data

A Distributed Solver for Kernelized SVM

Kernel Methods. Machine Learning A W VO

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Introduction to Support Vector Machines

Transcription:

Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.

Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2

Machine Learning Components user features sample algorithm h page 3

Machine Learning Components user features sample critical task algorithm main focus of ML literature h page 4

Kernel Methods Features PDS kernel K : X H K interpreted as a similarity measure. implicitly defined via the choice of a x, y X, (x) (y) =K(x, y). Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem. page 5

Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. N =2 d=2 Example: for and, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 = x 2 1 x 2 2 2 x 1 x 2 2cx 1 2cx 2 c y 2 1 y 2 2 2 y 1 y 2 2cy 1 2cy 2 c. page 6

XOR Problem c =1 Use second-degree polynomial kernel with : ( 1, 1) x 2 (1, 1) 2 x1 x 2 (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 ( 1, 1) (1, 1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 7

Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Normalized kernel of x y 2 2 2, =0. (x, x ) exp x x 2. Sigmoid Kernels: K(x, y) =tanh(a(x y)+b), a,b 0. page 8

Primal: Dual: min w,b 1 2 w 2 + C max m i=1 i SVM m i=1 1 2 subject to: 0 i C (Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992) 1 y i (w K(x i )+b) m i,j=1 m i=1 i jy i y j K(x i,x j ) iy i =0,i [1,m].. + page 9

Kernel Ridge Regression Primal: Dual: min w w 2 + max R m m i=1 (Hoerl and Kennard, 1970; Sanders et al., 1998) (w K(x i )+b y i ) 2. (K + I) +2 y. page 10

Questions How should the user choose the kernel? problem similar to that of selecting features for other learning algorithms. poor choice good choice learning made very difficult. even poor learners could succeed. The requirement from the user is thus critical. can this requirement be lessened? is a more automatic selection of features possible? page 11

Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 12

Standard Learning with Kernels user kernel K sample algorithm h page 13

Learning Kernel Framework user kernel family K sample algorithm (K, h) page 14

Kernel Families q 1 Most frequently used kernel families,, K q = K µ : K µ = p k=1 µ k K k, µ = µ 1. µ p q with q = µ : µ 0, µ q =1. Hypothesis sets: H q = h H K : K K q, h HK 1. page 15

Relation between Norms Lemma: for p, q (0, + ], the following holds: x R N,p q x q x p N 1 p 1 q x q. x =0 Proof: for the left inequalities, observe that for, apple p kxkp NX apple p xi X N apple q xi = =1. kxk q kxk i=1 q kxk {z } i=1 q apple1 Right inequalities follow immediately Hölder s inequality: x p = N i=1 x i p 1 p N i=1 ( x i p ) q p p q N i=1 (1) q q p 1 p q 1 p = x q N 1 p 1 q. page 16

Single Kernel Guarantee >0 >0 1 h H 1 (Koltchinskii and Panchenko, 2002) Theorem: fix. Then, for any, with probability at least, the following holds for all, R(h) R (h)+ 2 Tr[K] m + log 1 2m. page 17

Multiple Kernel Guarantee >0 (Cortes, MM, and Rostamizadeh, 2010) Theorem: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h and any integer 1 s r: H q R(h) apple b R (h)+ 2 1 q, r 1 q + 1 r =1 p skuks m + with u =(Tr[K 1 ],...,Tr[K p ]). s log 1 2m, page 18

Proof 1 q, r 1 q + 1 r =1 Let with. br S (H q )= 1 m E h = 1 m E h = 1 m E h = 1 m E h = 1 m E h sup h2h q mx i=1 i ih(x i ) sup µ2 q, > K µ apple1 sup µ2 q, > K µ apple1 sup µ2 q q > K µ i mx i,j=1 i i j K µ (x i,x j ) i > K µ = 1 h m E sup µ2 q,k k K 1/2 µ p i sup µ u u =( > K 1,..., µ2 q h, i 1/2 K µ apple1 i (Cauchy-Schwarz) > K p ) > ) = 1 m E p ku kr. (definition of dual norm) page 19

K Lemma Lemma: Let be a kernel matrix for a finite sample. Then, for any integer r, E h ( > K ) ri apple (Cortes, MM, and Rostamizadeh, 2010) r Tr[K] r. Proof: combinatorial argument. page 20

Proof 1 s r For any, br S (H q )= 1 m E p ku kr apple 1 m E p ku ks = 1 hh p m E X ( > K k ) si 1 k=1 2s i apple 1 h h X p E ( > K k ) sii 1 m k=1 = 1 h X p E m k=1 apple 1 h X p m k=1 2s h( > K k ) sii 1 2s s i 1 2s s Tr[K k ] = (Jensen s inequality) p skuks m. (lemma) page 21

L 1 Learning Bound Corollary: fix. For any, with probability, the following holds for all : R(h) apple b R (h)+ 2 weak dependency on. bound valid for. Tr[K k ] (Cortes, MM, and Rostamizadeh, 2010) >0 >0 1 h H 1 p m max x r edlog pe p m K k(x, x). p max k=1 Tr[K k] m + s log 1 2m. page 22

Proof For with q =1, the bound holds for any integer s 1 R(h) apple b R (h)+ 2 skuk s = s " p X k=1 p skuks m + Tr[K k ] s # 1 s apple sp 1 s p max k=1 Tr[K k]. s sp 1 s log p The function reaches it minimum at. s log 1 2m, page 23

Lower Bound Tight bound: dependency cannot be improved. argument based on VC dimension or example. Observations: case. canonical projection kernels. H 1 log p contains J p ={x sx k : k [1,p],s {-1, +1}}. VCdim(J p )= (logp). =1 h J p X={-1, +1} p K k (x, x )=x k x k R (h)=r(h) for and,. VCdim(J p )/m VC lower bound:. page 24

R(h) Pseudo-Dimension Bound k [1,p],K k (x, x) R 2 >0 1 Assume that for all. Then, for any, with probability at least, for any, R (h)+ bound additive in 8 128em3R2 2+plog 2 p +256 R2 2 log em 8R m not informative for. p (modulo log terms). p>m based on pseudo-dimension of kernel family. similar guarantees for other families. (Srebro and Ben-David, 2006) 128mR2 log h H 1 2 + log(1/ ). page 25

Comparison /R=.2 page 26

L q Learning Bound >0 Corollary: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h : H q R(h) apple b R (h)+ 2 mild dependency on p. (Cortes, MM, and Rostamizadeh, 2010) 1 q, r 1 q + 1 r =1 q rp 1 r max p k=1 Tr[K k] m + s log 1 2m, Tr[K k ] m max x K k(x, x). page 27

Lower Bound Tight bound: dependency p 1 2r cannot be improved. p 1 4 L 2 in particular tight for regularization. Observations: equal kernels. p k=1 µ k K 1. h 2 p H K1 =( p k=1 µ k p 1 r µ q = p 1 r H q k=1 µ kk k = p thus, for. k=1 µ k) h 2 p H K k=1 µ k =0 (Hölder s inequality). {h H K1 : h HK1 p 1 2r } coincides with. page 28

Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 29

General LK Formulation - SVMs Notation: Kset of PDS kernel functions. K kernel matrices associated to, assumed convex. Y R m m Y ii =y i Optimization problem: K diagonal matrix with. min max 2 1 Y KY K K subject to: 0 C y =0. convex problem: function linear in pointwise maximum. K, convexity of page 30

Parameterized LK Formulation Notation: (K µ ) µ parameterized set of PDS kernel functions. convex set, µ K µ concave function. Y R m m Y ii =y i Optimization problem: diagonal matrix with. min µ convex problem: function convex in pointwise maximum. max 2 1 Y K µ Y subject to: 0 C y =0. µ, convexity of page 31

Non-Negative Combinations K µ = p k=1 µ kk k µ 1,. By von Neumann s generalized minimax theorem (convexity wrt µ, concavity wrt, 1 convex and compact, A convex and compact): min µ 1 =max A min µ 1 max A 2 1 Y K µ Y =max A 2 1 max µ 1 =max 2 1 max A k [1,p] 2 1 Y K µ Y Y K µ Y Y K k Y. page 32

Non-Negative Combinations Optimization problem: in view of the previous analysis, the problem can be rewritten as the following QCQP. max,t 2 1 t subject to: k [1,p],t Y K k Y ; 0 C y =0. complexity (interior-point methods): O(pm 3 ). (Lanckriet et al., 2004) page 33

Equivalent Primal Formulation Optimization problem: 1 px kw k k 2 mx 2 min + C max w,µ2 q 2 µ k k=1 i=1 (!) px 0, 1 y i w k k(x i ). k=1 page 34

Lots of Optimization Solutions QCQP (Lanckriet et al., 2004). Wrapper methods interleaving call to SVM solver and update of µ : SILP (Sonnenburg et al., 2006). Reduced gradient (SimpleML) (Rakotomamonjy et al., 2008). Newton s method (Kloft et al., 2009). Mirror descent (Nath et al., 2009). On-line method (Orabona & Jie, 2011). Many other methods proposed. page 35

Does It Work? Experiments: this algorithm and its different optimization solutions often do not significantly outperform the simple uniform combination kernel in practice! observations corroborated by NIPS workshops. Alternative algorithms: significant improvement (see empirical results of (Gönen and Alpaydin, 2011)). centered alignment-based LK algorithms (Cortes, MM, and Rostamizadeh, 2010 and 2012). non-linear combination of kernels (Cortes, MM, and Rostamizadeh, 2009). page 36

LK Formulation - KRR Kernel family: non-negative combinations. L q regularization. (Cortes, MM, and Rostamizadeh, 2009) Optimization problem: min µ max p k=1 subject to: µ 0 µ µ 0 q. µ k K k +2 y convex optimization: linearity in pointwise maximum. µ and convexity of page 37

Projected Gradient Solving maximization problem in, closed-form solution, reduces problem to =(K µ + I) 1 y min µ y (K µ + I) 1 y subject to: µ 0 µ µ 0 2. Convex optimization problem, one solution using projection-based gradient descent: F µ k =Tr y (K µ + I) 1 y (K µ + I) (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 K k = y (K µ + I) 1 K k (K µ + I) 1 y = K k. page 38

Proj. Grad. KRR - L 2 Reg. ProjectionBasedGradientDescent((K k ) k [1,p], µ 0 ) 1 µ µ 0 2 µ 3 while µ µ > do 4 µ µ 5 (K µ + I) 1 y 6 µ µ + ( K 1,..., K p ) 7 for k 1 to p do 8 µ k max(0, µ k ) 9 µ µ 0 + µ µ 0 µ µ 0 10 return µ page 39

Interpolated Step KRR - L 2 Reg. InterpolatedIterativeAlgorithm((K k ) k [1,p], µ 0 ) 1 2 (K µ0 + I) 1 y 3 while > do 4 5 v ( K 1,..., K p ) 6 µ µ 0 + v v 7 +(1 )(K µ + I) 1 y 8 return Simple and very efficient: few iterations (less than15). page 40

L 2 -Regularized Combinations (Cortes, MM, and Rostamizadeh, 2009) Dense combinations are beneficial when using many kernels. Combining kernels based on single features, can be viewed as principled feature weighting. 1.56 1.54 1.52 DVD baseline L 1 L 2.62 0.6.58 Reuters (acq) baseline L 2 L 1 1.5.56 1.48.54 1.46.52 1.44 0 2000 4000 6000 1000 2000 3000 4000 5000 6000 page 41

Conclusion Solid theoretical guarantees suggesting the use of a large number of base kernels. Broad literature on optimization techniques but often no significant improvement over uniform combination. Recent algorithms with significant improvements, in particular non-linear combinations. Still many theoretical and algorithmic questions left to explore. page 42

References Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, 2002. Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local Rademacher complexity. In Proceedings of NIPS, 2013. Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, 2009. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, 2010. Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, 2010. Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh. Algorithms for Learning Kernels Based on Centered Alignment. JMLR 13: 795-828, 2012. page 43

References Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Tutorial: Learning Kernels. ICML 2011, Bellevue, Washington, July 2011. Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel Learning. In AISTATS, 2011 [see arxiv for corrected version]. Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari: Regularization Techniques for Learning with Matrices. JMLR 13: 1865-1890, 2012. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, 2002. Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT, 2008. Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, 2004. page 44

References Mehmet Gönen, Ethem Alpaydin: Multiple Kernel Learning Algorithms. JMLR 12: 2211-2268 (2011). Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, 2006. Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT, 2009. page 45