Advanced Machine Learning

Size: px
Start display at page:

Download "Advanced Machine Learning"

Transcription

1 Advanced Machine Learning Learning Kernels MEHRYAR MOHRI COURANT INSTITUTE & GOOGLE RESEARCH.

2 Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2

3 Machine Learning Components user features sample algorithm h page 3

4 Machine Learning Components user features sample critical task algorithm main focus of ML literature h page 4

5 Kernel Methods Features PDS kernel K : X H K interpreted as a similarity measure. implicitly defined via the choice of a x, y X, (x) (y) =K(x, y). Flexibility: PDS kernel can be chosen arbitrarily. Help extend a variety of algorithms to non-linear predictors, e.g., SVMs, KRR, SVR, KPCA. PDS condition directly related to convexity of optimization problem. page 5

6 Example - Polynomial Kernels Definition: x, y R N,K(x, y) =(x y + c) d, c > 0. N =2 d=2 Example: for and, K(x, y) =(x 1 y 1 + x 2 y 2 + c) 2 = x 2 1 x x 1 x 2 2cx 1 2cx 2 c y 2 1 y y 1 y 2 2cy 1 2cy 2 c. page 6

7 XOR Problem c =1 Use second-degree polynomial kernel with : ( 1, 1) x 2 (1, 1) 2 x1 x 2 (1, 1, + 2, 2, 2, 1) (1, 1, + 2, + 2, + 2, 1) x 1 2 x1 ( 1, 1) (1, 1) (1, 1, 2, 2, + 2, 1) (1, 1, 2, + 2, 2, 1) Linearly non-separable Linearly separable by x 1 x 2 =0. page 7

8 Other Standard PDS Kernels Gaussian kernels: K(x, y) =exp Normalized kernel of x y 2 2 2, =0. (x, x ) exp x x 2. Sigmoid Kernels: K(x, y) =tanh(a(x y)+b), a,b 0. page 8

9 Primal: Dual: min w,b 1 2 w 2 + C max m i=1 i SVM m i=1 1 2 subject to: 0 i C (Cortes and Vapnik, 1995; Boser, Guyon, and Vapnik, 1992) 1 y i (w K(x i )+b) m i,j=1 m i=1 i jy i y j K(x i,x j ) iy i =0,i [1,m].. + page 9

10 Kernel Ridge Regression Primal: Dual: min w w 2 + max R m m i=1 (Hoerl and Kennard, 1970; Sanders et al., 1998) (w K(x i )+b y i ) 2. (K + I) +2 y. page 10

11 Questions How should the user choose the kernel? problem similar to that of selecting features for other learning algorithms. poor choice good choice learning made very difficult. even poor learners could succeed. The requirement from the user is thus critical. can this requirement be lessened? is a more automatic selection of features possible? page 11

12 Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 12

13 Standard Learning with Kernels user kernel K sample algorithm h page 13

14 Learning Kernel Framework user kernel family K sample algorithm (K, h) page 14

15 Kernel Families q 1 Most frequently used kernel families,, K q = K µ : K µ = p k=1 µ k K k, µ = µ 1. µ p q with q = µ : µ 0, µ q =1. Hypothesis sets: H q = h H K : K K q, h HK 1. page 15

16 Relation between Norms Lemma: for p, q (0, + ], the following holds: x R N,p q x q x p N 1 p 1 q x q. x =0 Proof: for the left inequalities, observe that for, apple p kxkp NX apple p xi X N apple q xi = =1. kxk q kxk i=1 q kxk {z } i=1 q apple1 Right inequalities follow immediately Hölder s inequality: x p = N i=1 x i p 1 p N i=1 ( x i p ) q p p q N i=1 (1) q q p 1 p q 1 p = x q N 1 p 1 q. page 16

17 Single Kernel Guarantee >0 >0 1 h H 1 (Koltchinskii and Panchenko, 2002) Theorem: fix. Then, for any, with probability at least, the following holds for all, R(h) R (h)+ 2 Tr[K] m + log 1 2m. page 17

18 Multiple Kernel Guarantee >0 (Cortes, MM, and Rostamizadeh, 2010) Theorem: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h and any integer 1 s r: H q R(h) apple b R (h)+ 2 1 q, r 1 q + 1 r =1 p skuks m + with u =(Tr[K 1 ],...,Tr[K p ]). s log 1 2m, page 18

19 Proof 1 q, r 1 q + 1 r =1 Let with. br S (H q )= 1 m E h = 1 m E h = 1 m E h = 1 m E h = 1 m E h sup h2h q mx i=1 i ih(x i ) sup µ2 q, > K µ apple1 sup µ2 q, > K µ apple1 sup µ2 q q > K µ i mx i,j=1 i i j K µ (x i,x j ) i > K µ = 1 h m E sup µ2 q,k k K 1/2 µ p i sup µ u u =( > K 1,..., µ2 q h, i 1/2 K µ apple1 i (Cauchy-Schwarz) > K p ) > ) = 1 m E p ku kr. (definition of dual norm) page 19

20 K Lemma Lemma: Let be a kernel matrix for a finite sample. Then, for any integer r, E h ( > K ) ri apple (Cortes, MM, and Rostamizadeh, 2010) r Tr[K] r. Proof: combinatorial argument. page 20

21 Proof 1 s r For any, br S (H q )= 1 m E p ku kr apple 1 m E p ku ks = 1 hh p m E X ( > K k ) si 1 k=1 2s i apple 1 h h X p E ( > K k ) sii 1 m k=1 = 1 h X p E m k=1 apple 1 h X p m k=1 2s h( > K k ) sii 1 2s s i 1 2s s Tr[K k ] = (Jensen s inequality) p skuks m. (lemma) page 21

22 L 1 Learning Bound Corollary: fix. For any, with probability, the following holds for all : R(h) apple b R (h)+ 2 weak dependency on. bound valid for. Tr[K k ] (Cortes, MM, and Rostamizadeh, 2010) >0 >0 1 h H 1 p m max x r edlog pe p m K k(x, x). p max k=1 Tr[K k] m + s log 1 2m. page 22

23 Proof For with q =1, the bound holds for any integer s 1 R(h) apple b R (h)+ 2 skuk s = s " p X k=1 p skuks m + Tr[K k ] s # 1 s apple sp 1 s p max k=1 Tr[K k]. s sp 1 s log p The function reaches it minimum at. s log 1 2m, page 23

24 Lower Bound Tight bound: dependency cannot be improved. argument based on VC dimension or example. Observations: case. canonical projection kernels. H 1 log p contains J p ={x sx k : k [1,p],s {-1, +1}}. VCdim(J p )= (logp). =1 h J p X={-1, +1} p K k (x, x )=x k x k R (h)=r(h) for and,. VCdim(J p )/m VC lower bound:. page 24

25 R(h) Pseudo-Dimension Bound k [1,p],K k (x, x) R 2 >0 1 Assume that for all. Then, for any, with probability at least, for any, R (h)+ bound additive in 8 128em3R2 2+plog 2 p +256 R2 2 log em 8R m not informative for. p (modulo log terms). p>m based on pseudo-dimension of kernel family. similar guarantees for other families. (Srebro and Ben-David, 2006) 128mR2 log h H log(1/ ). page 25

26 Comparison /R=.2 page 26

27 L q Learning Bound >0 Corollary: fix. Let with. Then, for any >0, with probability at least 1, the following holds for all h : H q R(h) apple b R (h)+ 2 mild dependency on p. (Cortes, MM, and Rostamizadeh, 2010) 1 q, r 1 q + 1 r =1 q rp 1 r max p k=1 Tr[K k] m + s log 1 2m, Tr[K k ] m max x K k(x, x). page 27

28 Lower Bound Tight bound: dependency p 1 2r cannot be improved. p 1 4 L 2 in particular tight for regularization. Observations: equal kernels. p k=1 µ k K 1. h 2 p H K1 =( p k=1 µ k p 1 r µ q = p 1 r H q k=1 µ kk k = p thus, for. k=1 µ k) h 2 p H K k=1 µ k =0 (Hölder s inequality). {h H K1 : h HK1 p 1 2r } coincides with. page 28

29 Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 29

30 General LK Formulation - SVMs Notation: Kset of PDS kernel functions. K kernel matrices associated to, assumed convex. Y R m m Y ii =y i Optimization problem: K diagonal matrix with. min max 2 1 Y KY K K subject to: 0 C y =0. convex problem: function linear in pointwise maximum. K, convexity of page 30

31 Parameterized LK Formulation Notation: (K µ ) µ parameterized set of PDS kernel functions. convex set, µ K µ concave function. Y R m m Y ii =y i Optimization problem: diagonal matrix with. min µ convex problem: function convex in pointwise maximum. max 2 1 Y K µ Y subject to: 0 C y =0. µ, convexity of page 31

32 Non-Negative Combinations K µ = p k=1 µ kk k µ 1,. By von Neumann s generalized minimax theorem (convexity wrt µ, concavity wrt, 1 convex and compact, A convex and compact): min µ 1 =max A min µ 1 max A 2 1 Y K µ Y =max A 2 1 max µ 1 =max 2 1 max A k [1,p] 2 1 Y K µ Y Y K µ Y Y K k Y. page 32

33 Non-Negative Combinations Optimization problem: in view of the previous analysis, the problem can be rewritten as the following QCQP. max,t 2 1 t subject to: k [1,p],t Y K k Y ; 0 C y =0. complexity (interior-point methods): O(pm 3 ). (Lanckriet et al., 2004) page 33

34 Equivalent Primal Formulation Optimization problem: 1 px kw k k 2 mx 2 min + C max w,µ2 q 2 µ k k=1 i=1 (!) px 0, 1 y i w k k(x i ). k=1 page 34

35 Lots of Optimization Solutions QCQP (Lanckriet et al., 2004). Wrapper methods interleaving call to SVM solver and update of µ : SILP (Sonnenburg et al., 2006). Reduced gradient (SimpleML) (Rakotomamonjy et al., 2008). Newton s method (Kloft et al., 2009). Mirror descent (Nath et al., 2009). On-line method (Orabona & Jie, 2011). Many other methods proposed. page 35

36 Does It Work? Experiments: this algorithm and its different optimization solutions often do not significantly outperform the simple uniform combination kernel in practice! observations corroborated by NIPS workshops. Alternative algorithms: significant improvement (see empirical results of (Gönen and Alpaydin, 2011)). centered alignment-based LK algorithms (Cortes, MM, and Rostamizadeh, 2010 and 2012). non-linear combination of kernels (Cortes, MM, and Rostamizadeh, 2009). page 36

37 LK Formulation - KRR Kernel family: non-negative combinations. L q regularization. (Cortes, MM, and Rostamizadeh, 2009) Optimization problem: min µ max p k=1 subject to: µ 0 µ µ 0 q. µ k K k +2 y convex optimization: linearity in pointwise maximum. µ and convexity of page 37

38 Projected Gradient Solving maximization problem in, closed-form solution, reduces problem to =(K µ + I) 1 y min µ y (K µ + I) 1 y subject to: µ 0 µ µ 0 2. Convex optimization problem, one solution using projection-based gradient descent: F µ k =Tr y (K µ + I) 1 y (K µ + I) (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 (K µ + I) µ k = Tr (K µ + I) 1 yy (K µ + I) 1 K k = y (K µ + I) 1 K k (K µ + I) 1 y = K k. page 38

39 Proj. Grad. KRR - L 2 Reg. ProjectionBasedGradientDescent((K k ) k [1,p], µ 0 ) 1 µ µ 0 2 µ 3 while µ µ > do 4 µ µ 5 (K µ + I) 1 y 6 µ µ + ( K 1,..., K p ) 7 for k 1 to p do 8 µ k max(0, µ k ) 9 µ µ 0 + µ µ 0 µ µ 0 10 return µ page 39

40 Interpolated Step KRR - L 2 Reg. InterpolatedIterativeAlgorithm((K k ) k [1,p], µ 0 ) 1 2 (K µ0 + I) 1 y 3 while > do 4 5 v ( K 1,..., K p ) 6 µ µ 0 + v v 7 +(1 )(K µ + I) 1 y 8 return Simple and very efficient: few iterations (less than15). page 40

41 L 2 -Regularized Combinations (Cortes, MM, and Rostamizadeh, 2009) Dense combinations are beneficial when using many kernels. Combining kernels based on single features, can be viewed as principled feature weighting DVD baseline L 1 L Reuters (acq) baseline L 2 L page 41

42 Conclusion Solid theoretical guarantees suggesting the use of a large number of base kernels. Broad literature on optimization techniques but often no significant improvement over uniform combination. Recent algorithms with significant improvements, in particular non-linear combinations. Still many theoretical and algorithmic questions left to explore. page 42

43 References Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, Corinna Cortes, Marius Kloft, and Mehryar Mohri. Learning kernels using local Rademacher complexity. In Proceedings of NIPS, Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels. In NIPS, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, Corinna Cortes, Mehryar Mohri, Afshin Rostamizadeh. Algorithms for Learning Kernels Based on Centered Alignment. JMLR 13: , page 43

44 References Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Tutorial: Learning Kernels. ICML 2011, Bellevue, Washington, July Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel Learning. In AISTATS, 2011 [see arxiv for corrected version]. Sham M. Kakade, Shai Shalev-Shwartz, Ambuj Tewari: Regularization Techniques for Learning with Matrices. JMLR 13: , Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT, Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, page 44

45 References Mehmet Gönen, Ethem Alpaydin: Multiple Kernel Learning Algorithms. JMLR 12: (2011). Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT, page 45

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Algorithms for Learning Kernels Based on Centered Alignment

Algorithms for Learning Kernels Based on Centered Alignment Journal of Machine Learning Research 3 (2012) XX-XX Submitted 01/11; Published 03/12 Algorithms for Learning Kernels Based on Centered Alignment Corinna Cortes Google Research 76 Ninth Avenue New York,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

MULTIPLEKERNELLEARNING CSE902

MULTIPLEKERNELLEARNING CSE902 MULTIPLEKERNELLEARNING CSE902 Multiple Kernel Learning -keywords Heterogeneous information fusion Feature selection Max-margin classification Multiple kernel learning MKL Convex optimization Kernel classification

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient computation of inner

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Multi-class classification via proximal mirror descent

Multi-class classification via proximal mirror descent Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Non-Ideal World time real world ideal domain sampling page 2 Outline Domain adaptation. Multiple-source

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Sample-adaptive Multiple Kernel Learning

Sample-adaptive Multiple Kernel Learning Sample-adaptive Multiple Kernel Learning Xinwang Liu, Lei Wang, Jian Zhang, Jianping Yin Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Support Vector Machine for Classification and Regression

Support Vector Machine for Classification and Regression Support Vector Machine for Classification and Regression Ahlame Douzal AMA-LIG, Université Joseph Fourier Master 2R - MOSIG (2013) November 25, 2013 Loss function, Separating Hyperplanes, Canonical Hyperplan

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Mathematical Programming for Multiple Kernel Learning

Mathematical Programming for Multiple Kernel Learning Mathematical Programming for Multiple Kernel Learning Alex Zien Fraunhofer FIRST.IDA, Berlin, Germany Friedrich Miescher Laboratory, Tübingen, Germany 07. July 2009 Mathematical Programming Stream, EURO

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Support Vector Machines Explained

Support Vector Machines Explained December 23, 2008 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Learning the Kernel Matrix with Semidefinite Programming

Learning the Kernel Matrix with Semidefinite Programming Journal of Machine Learning Research 5 (2004) 27-72 Submitted 10/02; Revised 8/03; Published 1/04 Learning the Kernel Matrix with Semidefinite Programg Gert R.G. Lanckriet Department of Electrical Engineering

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Learning the Kernel Matrix with Semi-Definite Programming

Learning the Kernel Matrix with Semi-Definite Programming Learning the Kernel Matrix with Semi-Definite Programg Gert R.G. Lanckriet gert@cs.berkeley.edu Department of Electrical Engineering and Computer Science University of California, Berkeley, CA 94720, USA

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

1 Graph Kernels by Spectral Transforms

1 Graph Kernels by Spectral Transforms Graph Kernels by Spectral Transforms Xiaojin Zhu Jaz Kandola John Lafferty Zoubin Ghahramani Many graph-based semi-supervised learning methods can be viewed as imposing smoothness conditions on the target

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Composite Loss Functions and Multivariate Regression; Sparse PCA

Composite Loss Functions and Multivariate Regression; Sparse PCA Composite Loss Functions and Multivariate Regression; Sparse PCA G. Obozinski, B. Taskar, and M. I. Jordan (2009). Joint covariate selection and joint subspace selection for multiple classification problems.

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

The definitions and notation are those introduced in the lectures slides. R Ex D [h

The definitions and notation are those introduced in the lectures slides. R Ex D [h Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Large-Scale Training of SVMs with Automata Kernels

Large-Scale Training of SVMs with Automata Kernels Large-Scale Training of SVMs with Automata Kernels Cyril Allauzen, Corinna Cortes, and Mehryar Mohri, Google Research, 76 Ninth Avenue, New York, NY Courant Institute of Mathematical Sciences, 5 Mercer

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Model Selection for Gaussian Processes

Model Selection for Gaussian Processes Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

More information

Applied inductive learning - Lecture 7

Applied inductive learning - Lecture 7 Applied inductive learning - Lecture 7 Louis Wehenkel & Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège Montefiore - Liège - November 5, 2012 Find slides: http://montefiore.ulg.ac.be/

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Statistical Machine Learning for Structured and High Dimensional Data

Statistical Machine Learning for Structured and High Dimensional Data Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Risk Bounds for Lévy Processes in the PAC-Learning Framework Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University

More information

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information