Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Size: px
Start display at page:

Download "Learning Kernels -Tutorial Part III: Theoretical Guarantees."

Transcription

1 Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research Mehryar Mohri Courant Institute & Google Research Afshin Rostami UC Berkeley berkeley.edu

2 Standard Learning with Kernels user kernel K sample algorithm h 2

3 Learning Kernel Framework user kernel family K sample algorithm (K, h) 3

4 Learning Kernels Theoretical questions: what is the price to pay for relaxing the requirement from the user to specify a kernel? how does the choice of the kernel family affect generalization? 4

5 Part III Non-negative combinations. General case. 5

6 Kernel Families Most frequently used kernel families, q 1, with q = Hypothesis sets: H q = K q = p k=1 µ k K k : µ q µ : µ 0, µ q =1. h H K : K K q, h HK 1. 6

7 Rademacher Complexity Empirical Rademacher complexity of sample S =(x 1,...,x m ), R S (H) =E σ sup h H 1 m m i=1 Rademacher complexity of H : : for a where σ i s are independent uniform random variables taking values in. { 1, +1} σ i h(x i ) H, R m (H) = E S D m[ R S (H)]. 7

8 Single Kernel Margin Bound Theorem (Koltchinskii and Panchenko, 2002): fix ρ>0. Assume that K(x, x) R 2 for all x, then, for any δ >0, with probability at least 1 δ, for any h H 1, R(h) R R2 /ρ ρ (h)+2 2 m + log 1 δ 2m. 8

9 Early Learning Kernel Bounds For any δ >0, with probability at least, for any, h H 1 (Bousquet and Herrmann 2003; Lanckriet et al., 2004) 1 δ R(h) R ρ (h)+ 1 max p k=1 Tr(K k)max p K k k=1 Tr(K k ) m ρ log 1 δ. but, bound always greater than one (Srebro and Ben- David, 2006)! other bound of (Lanckriet et al., 2004) for linear combination case also always greater than one! 9

10 Multiplicative Learning Bound (Lanckriet et al., 2004) k [1,p],K k (x, x) R 2 Assume that for all. Then, for any δ >0, with probability at least 1 δ, for any, h H 1 pr2 /ρ R(h) R ρ (h)+o 2 m bound multiplicative in p (number of kernels).. 10

11 Additive Learning Bound Assume that for all k [1,p],K k (x, x) R 2. Then, for any δ >0, with probability at least 1 δ, for any, h H 1 R(h) R ρ (h)+ 8 2+p log 128em3 R2 ρ 2 p +256 R2 ρ 2 bound additive in p(modulo log terms). not informative for p>m. similar guarantees for other families. (Srebro and Ben-David, 2006) log ρem 8R 128mR2 log ρ 2 based on pseudo-dimension of kernel family. m + log(1/δ). 11

12 New Data-Dependent Bound (CC, MM, and AR, 2010) Theorem: for any sample S of size m, and positive integer r, R S (H 1 ) ru r m, with u =(Tr[K 1 ],...,Tr[K p ]). similarity with single kernel bound. can be used directly to derive an algorithm. 12

13 New Data-Dependent Bound Proof: Let q, r 1 1 with. R S (H q )= 1 m E σ = 1 m E σ = 1 m E σ = 1 m E σ = 1 m E σ sup h H q m σ i h(x i ) i=1 sup µ q,α Kα 1 sup µ q,α Kα 1 sup µ q m i,j=1 σ K µ σ sup µ uσ µ q q + 1 r =1 σ i α j K µ (x i,x j ) σ K µ α = 1 m E σ sup σ, α K 1/2 µ q,α K 1/2 1 (Cauchy-Schwarz) uσ =(σ K 1 σ,...,σ K p σ) ) = 1 m E σ uσ r. (definition of dual norm) 13

14 New Data-Dependent Bound R S (H 1 )= 1 m E σ Proof: in the following, uσ r 1 is arbitrary integer. 1 m E σ = 1 m E σ uσ r p (σ K k σ) r 1 k=1 2r 1 p E (σ K k σ) r 1 m σ = 1 m 1 m p k=1 p k=1 k=1 2r E σ (σ K k σ) r 1 2r r r Tr[K 2r k] (Jensen s inequality) = ru r m ( r 1, u σ u σ r ). (lemma) 14

15 Key Lemma Lemma: Let Kbe a kernel matrix for a finite sample. Then, for any integer r, E σ (σ Kσ) r proof based on combinatorial argument. r r Tr[K] 15

16 New Learning Bound - L1 Theorem: assume that for all k [1,p],K k (x, x) R 2. Then, for any δ >0, with probability at least 1 δ, for any, h H 1 R(h) R ρ (h) elog pr2 /ρ 2 m (CC, MM, and AR, 2010) very weak dependency on p, no extra log terms. analysis based on Rademacher complexity. bound valid for p m. see also (Kakade et al., 2010). + log 1 δ 2m. 16

17 Comparison 100 [Srebro & Ben David, 2006] 10 Bound p=m p=m^(5/6) p=m^(4/5) p=m^(3/5) p=m^(1/3) p= [Our bound, 2010] m in Millions ρ/r=.2 17

18 Lower Bound Tight bound: dependency log p cannot be improved. argument based on VC dimension or example. Observations: case X={-1, +1} p. canonical projection kernels K k (x, x )=x k x k. H 1 contains J p ={x sx k : k [1,p],s {-1, +1}}. VCdim(J. p )=Ω(logp) ρ=1 h J p R ρ (h)= R(h) Ω VCdim(J p )/m for and,. VC lower bound:. 18

19 New Paper Recent claim (Hussain and Shawe-Taylor, AISTATS 2011): additive bound in terms of log p, instead of multiplicative. main proof incorrect: probabilistic bound on Rademacher complexity, but slack term left out of proof of theorem 8. Adding it multiplicative bound. however: authors are preparing new version (private communication: J. Shawe-Taylor). 19

20 New Learning Bound - Lq Theorem: let q, r 1 1 with q + 1 r =1 and r integer. Assume that for all k [1,p],K k (x, x) R 2. Then, for any δ >0, with probability at least 1 δ, for any, R(h) R ρ (h)+2p 1 2r mild dependency on p rr2 /ρ 2 m (CC, MM, and AR, 2010) analysis based on Rademacher complexity. + log 1 δ 2m. h H q 20

21 Lower Bound Tight bound: dependency p 1 2r cannot be improved. in particular p 1 4 tight for L 2 regularization. Observations: equal kernels. p k=1 µ kk k = p k=1 µ k K1. h 2 H K1 =( p p k=1 µ k p 1 r µ q = p 1 r H q thus, for. k=1 µ k)h 2 H K p k=1 µ k =0 (Hölder s inequality). coincides with {h H K1 : h HK1 p 1 2r }. 21

22 Comparison L1 vs L p=m p=m^1/3 p=20 Bound m in Millions 22

23 Conclusion Theory: tight generalization bounds for learning kernels with L 1 or L q regularization ( p dependency). mild dependency on p. similar proof and analysis for other regularizations. Applications: results suggest using large number of kernels. recent results show significant improvements (CC, MM, AR, ICML 2010). 23

24 Part III Non-negative combinations. General case. 24

25 Kernel Family General case: K a family of kernels bounded by R. finite pseudo-dimension: general hypothesis set: H K = Pdim(K)<. h H K : K K, h HK 1. 25

26 Shattering Definition: Let H be a hypothesis set of functions from X to R. A={x 1,...,x m } is shattered by H if there exist t 1,...,t m R such that sgn L(h(x 1 ),f(x 1 )) t 1. sgn L(h(x m ),f(x m )) t m : h H =2 m. t 1 t 2 x 1 x 2 26

27 Pseudo-Dimension Definition: Let H be a hypothesis set of functions from X to R. The pseudo-dimension of H, Pdim(H), is the size of the largest set shattered by H. Definition (equivalent, see also (Vapnik, 1995)): (Pollard, 1984) Pdim(H) =VCdim (x, t) 1(h(x) t)>0 : h H. 27

28 Pseudo-Dimension - Properties Theorem: Pseudo-dimension of hyperplanes. Pdim(x w x + b: w R N,b R) =N +1. Theorem: Pseudo-dimension of a vector space of real-valued functions H. Pdim(H) =dim(h). Theorem: Pseudo-dimension of where φ is a monotone function: Pdim(φ(H)) dim(h). φ(h)={φ h: h H} 28

29 General Pdim Learning Bound Let K a family of kernel functions bounded by R. Let d=pdim(k), then, for any δ >0, with probability at least, for any, R(h) R ρ (h)+ 1 δ 8 h H K 2+d log 128em3 R2 ρ 2 d +256 R2 ρ 2 bound additive in d (modulo log terms). not informative for d>m. (Srebro and Ben-David, 2006) log ρem 8R m 128mR2 log ρ 2 +log(1/δ). 29

30 Application: Linear Combinations Linear and non-negative combination of base kernels (previous section): K lin = K 1 = K µ = K µ = p k=1 p k=1 µ k K k : p k=1 µ k K k : p p Since K lin K 1, k=1 k=1 µ k K k Pdim(K 1 ) Pdim(K lin ) dim µ k =1 K µ 0 µ k =1 µ 0 p µ k K k =p. k=1 30

31 Application: Gaussian Kernels Gaussian kernels with a fixed covariance matrix: K Gaussian = since exp (x 1, x 2 ) exp( (x 2 x 1 ) A(x 2 x 1 )): A S N + is monotone and since (x 1, x 2 ) (x 2 x 1 ) A(x 2 x 1 ): A S N + n = (x 1, x 2 ) A ij (x 2 x 1 ) i (x 2 x 1 ) j : A S N + i,j=1 span (x 1, x 2 ) (x 2 x 1 ) i (x 2 x 1 ) j :1 i j N Pdim(K Gaussian ) N(N 1) 2. Similar for diagonal, 31 A Pdim(K Gaussian ) N.,.

32 References Bousquet, Olivier and Herrmann, Daniel J. L. On the complexity of learning the kernel matrix. In NIPS, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Generalization Bounds for Learning Kernels. In ICML, Cortes, Corinna, Mohri, Mehryar, and Rostamizadeh, Afshin. Two-stage learning kernel methods. In ICML, Zakria Hussain, John Shawe-Taylor. Improved Loss Bounds For Multiple Kernel Learning. In AISTATS, Kakade, Sham M., Shalev-Shwartz, Shai, and Tewari, Ambuj. Applications of strong convexity strong smoothness duality to learning with matrices, arxiv: v1. Koltchinskii, V. and Panchenko, D. Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30, Koltchinskii, Vladimir and Yuan, Ming. Sparse recovery in large ensembles of kernel machines on-line learning and bandits. In COLT,

33 References Lanckriet, Gert, Cristianini, Nello, Bartlett, Peter, Ghaoui, Laurent El, and Jordan, Michael. Learning the kernel matrix with semidefinite programming. JMLR, 5, Srebro, Nathan and Ben-David, Shai. Learning bounds for support vector machines with learned kernels. In COLT, Ying, Yiming and Campbell, Colin. Generalization bounds for learning the kernel problem. In COLT,

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Deep Boosting MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Model selection. Deep boosting. theory. algorithm. experiments. page 2 Model Selection Problem:

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Rademacher Complexity Bounds for Non-I.I.D. Processes

Rademacher Complexity Bounds for Non-I.I.D. Processes Rademacher Complexity Bounds for Non-I.I.D. Processes Mehryar Mohri Courant Institute of Mathematical ciences and Google Research 5 Mercer treet New York, NY 00 mohri@cims.nyu.edu Afshin Rostamizadeh Department

More information

Domain Adaptation for Regression

Domain Adaptation for Regression Domain Adaptation for Regression Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Motivation Applications: distinct training and test distributions.

More information

Rademacher Bounds for Non-i.i.d. Processes

Rademacher Bounds for Non-i.i.d. Processes Rademacher Bounds for Non-i.i.d. Processes Afshin Rostamizadeh Joint work with: Mehryar Mohri Background Background Generalization Bounds - How well can we estimate an algorithm s true performance based

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Algorithms for Learning Kernels Based on Centered Alignment

Algorithms for Learning Kernels Based on Centered Alignment Journal of Machine Learning Research 3 (2012) XX-XX Submitted 01/11; Published 03/12 Algorithms for Learning Kernels Based on Centered Alignment Corinna Cortes Google Research 76 Ninth Avenue New York,

More information

Learning Bounds for Importance Weighting

Learning Bounds for Importance Weighting Learning Bounds for Importance Weighting Corinna Cortes Google Research corinna@google.com Yishay Mansour Tel-Aviv University mansour@tau.ac.il Mehryar Mohri Courant & Google mohri@cims.nyu.edu Motivation

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Convex Optimization in Classification Problems

Convex Optimization in Classification Problems New Trends in Optimization and Computational Algorithms December 9 13, 2001 Convex Optimization in Classification Problems Laurent El Ghaoui Department of EECS, UC Berkeley elghaoui@eecs.berkeley.edu 1

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER COMPLEXITY Slides adapted from Rob Schapire Machine Learning: Jordan Boyd-Graber UMD Introduction

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Domain Adaptation MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Non-Ideal World time real world ideal domain sampling page 2 Outline Domain adaptation. Multiple-source

More information

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH. Deep Boosting Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Ensemble Methods in ML Combining several base classifiers

More information

Boosting Ensembles of Structured Prediction Rules

Boosting Ensembles of Structured Prediction Rules Boosting Ensembles of Structured Prediction Rules Corinna Cortes Google Research 76 Ninth Avenue New York, NY 10011 corinna@google.com Vitaly Kuznetsov Courant Institute 251 Mercer Street New York, NY

More information

Learning with Imperfect Data

Learning with Imperfect Data Mehryar Mohri Courant Institute and Google mohri@cims.nyu.edu Joint work with: Yishay Mansour (Tel-Aviv & Google) and Afshin Rostamizadeh (Courant Institute). Standard Learning Assumptions IID assumption.

More information

Multi-class classification via proximal mirror descent

Multi-class classification via proximal mirror descent Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

The definitions and notation are those introduced in the lectures slides. R Ex D [h

The definitions and notation are those introduced in the lectures slides. R Ex D [h Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 04, 2016 Due: October 18, 2016 A. Rademacher complexity The definitions and notation

More information

Learning Weighted Automata

Learning Weighted Automata Learning Weighted Automata Joint work with Borja Balle (Amazon Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Weighted Automata (WFAs) page 2 Motivation Weighted automata (WFAs): image

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

ADANET: adaptive learning of neural networks

ADANET: adaptive learning of neural networks ADANET: adaptive learning of neural networks Joint work with Corinna Cortes (Google Research) Javier Gonzalo (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Risk Bounds for Lévy Processes in the PAC-Learning Framework

Risk Bounds for Lévy Processes in the PAC-Learning Framework Risk Bounds for Lévy Processes in the PAC-Learning Framework Chao Zhang School of Computer Engineering anyang Technological University Dacheng Tao School of Computer Engineering anyang Technological University

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

Generalization Bounds for Regularized Pairwise Learning

Generalization Bounds for Regularized Pairwise Learning Generalization Bounds for Regularized Pairwise Learning Yunwen Lei 1, Shao-Bo Lin, Ke Tang 1 1 Shenzhen Key Laboratory of Computational Intelligence, Department of Computer Science and Engineering, Southern

More information

Support Vector Machine & Its Applications

Support Vector Machine & Its Applications Support Vector Machine & Its Applications A portion (1/3) of the slides are taken from Prof. Andrew Moore s SVM tutorial at http://www.cs.cmu.edu/~awm/tutorials Mingyue Tan The University of British Columbia

More information

Structured Prediction

Structured Prediction Structured Prediction Ningshan Zhang Advanced Machine Learning, Spring 2016 Outline Ensemble Methods for Structured Prediction[1] On-line learning Boosting AGeneralizedKernelApproachtoStructuredOutputLearning[2]

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning

Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Uniform Glivenko-Cantelli Theorems and Concentration of Measure in the Mathematical Modelling of Learning Martin Anthony Department of Mathematics London School of Economics Houghton Street London WC2A

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Hierarchical kernel learning

Hierarchical kernel learning Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices

Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices Nathan Srebro Department of Computer Science University of Toronto Toronto, ON, Canada nati@cs.toronto.edu Noga Alon School

More information

A Reformulation of Support Vector Machines for General Confidence Functions

A Reformulation of Support Vector Machines for General Confidence Functions A Reformulation of Support Vector Machines for General Confidence Functions Yuhong Guo 1 and Dale Schuurmans 2 1 Department of Computer & Information Sciences Temple University www.cis.temple.edu/ yuhong

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient computation of inner

More information

Linear Classification and SVM. Dr. Xin Zhang

Linear Classification and SVM. Dr. Xin Zhang Linear Classification and SVM Dr. Xin Zhang Email: eexinzhang@scut.edu.cn What is linear classification? Classification is intrinsically non-linear It puts non-identical things in the same class, so a

More information

The Representor Theorem, Kernels, and Hilbert Spaces

The Representor Theorem, Kernels, and Hilbert Spaces The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

Convex Methods for Transduction

Convex Methods for Transduction Convex Methods for Transduction Tijl De Bie ESAT-SCD/SISTA, K.U.Leuven Kasteelpark Arenberg 10 3001 Leuven, Belgium tijl.debie@esat.kuleuven.ac.be Nello Cristianini Department of Statistics, U.C.Davis

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for

More information

Spectrally-normalized margin bounds for neural networks

Spectrally-normalized margin bounds for neural networks Spectrally-normalized margin bounds for neural networks Peter L. Bartlett Dylan J. Foster Matus Telgarsky 1 Overview Abstract This paper presents a margin-based multiclass generalization bound for neural

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Support Vector Machine I

Support Vector Machine I Support Vector Machine I Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University for Advanced

More information

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 Introduction Often, training distribution does not match testing distribution Want to utilize information about test

More information

Learning the Kernel Matrix with Semidefinite Programming

Learning the Kernel Matrix with Semidefinite Programming Journal of Machine Learning Research 5 (2004) 27-72 Submitted 10/02; Revised 8/03; Published 1/04 Learning the Kernel Matrix with Semidefinite Programg Gert R.G. Lanckriet Department of Electrical Engineering

More information

Robust Kernel-Based Regression

Robust Kernel-Based Regression Robust Kernel-Based Regression Budi Santosa Department of Industrial Engineering Sepuluh Nopember Institute of Technology Kampus ITS Surabaya Surabaya 60111,Indonesia Theodore B. Trafalis School of Industrial

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Support Vector Machines on General Confidence Functions

Support Vector Machines on General Confidence Functions Support Vector Machines on General Confidence Functions Yuhong Guo University of Alberta yuhong@cs.ualberta.ca Dale Schuurmans University of Alberta dale@cs.ualberta.ca Abstract We present a generalized

More information

Sharp Generalization Error Bounds for Randomly-projected Classifiers

Sharp Generalization Error Bounds for Randomly-projected Classifiers Sharp Generalization Error Bounds for Randomly-projected Classifiers R.J. Durrant and A. Kabán School of Computer Science The University of Birmingham Birmingham B15 2TT, UK http://www.cs.bham.ac.uk/ axk

More information

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Ranking. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Ranking Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Very large data sets: too large to display or process. limited resources, need

More information

MythBusters: A Deep Learning Edition

MythBusters: A Deep Learning Edition 1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Learning Kernel-Based Halfspaces with the Zero-One Loss

Learning Kernel-Based Halfspaces with the Zero-One Loss Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz The Hebrew University shais@cs.huji.ac.il Ohad Shamir The Hebrew University ohadsh@cs.huji.ac.il Karthik Sridharan Toyota Technological

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Support Vector Machine. Natural Language Processing Lab lizhonghua

Support Vector Machine. Natural Language Processing Lab lizhonghua Support Vector Machine Natural Language Processing Lab lizhonghua Support Vector Machine Introduction Theory SVM primal and dual problem Parameter selection and practical issues Compare to other classifier

More information

Generalization error bounds for classifiers trained with interdependent data

Generalization error bounds for classifiers trained with interdependent data Generalization error bounds for classifiers trained with interdependent data icolas Usunier, Massih-Reza Amini, Patrick Gallinari Department of Computer Science, University of Paris VI 8, rue du Capitaine

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Time Series Prediction and Online Learning

Time Series Prediction and Online Learning JMLR: Workshop and Conference Proceedings vol 49:1 24, 2016 Time Series Prediction and Online Learning Vitaly Kuznetsov Courant Institute, New York Mehryar Mohri Courant Institute and Google Research,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector

More information

Learning Algorithms for Second-Price Auctions with Reserve

Learning Algorithms for Second-Price Auctions with Reserve Journal of Machine Learning Research 17 (216) 1-25 Submitted 12/14; Revised 8/15; Published 4/16 Learning Algorithms for Second-Price Auctions with Reserve Mehryar Mohri and Andrés Muñoz Medina Courant

More information

Sample-adaptive Multiple Kernel Learning

Sample-adaptive Multiple Kernel Learning Sample-adaptive Multiple Kernel Learning Xinwang Liu, Lei Wang, Jian Zhang, Jianping Yin Science and Technology on Parallel and Distributed Processing Laboratory National University of Defense Technology

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Learning the Kernel Matrix with Semi-Definite Programming

Learning the Kernel Matrix with Semi-Definite Programming Learning the Kernel Matrix with Semi-Definite Programg Gert R.G. Lanckriet gert@cs.berkeley.edu Department of Electrical Engineering and Computer Science University of California, Berkeley, CA 94720, USA

More information

Learning Bound for Parameter Transfer Learning

Learning Bound for Parameter Transfer Learning Learning Bound for Parameter Transfer Learning Wataru Kumagai Faculty of Engineering Kanagawa University kumagai@kanagawa-u.ac.jp Abstract We consider a transfer-learning problem by using the parameter

More information

Efficient classification for metric data

Efficient classification for metric data Efficient classification for metric data Lee-Ad Gottlieb Weizmann Institute of Science lee-ad.gottlieb@weizmann.ac.il Aryeh Leonid Kontorovich Ben Gurion University karyeh@cs.bgu.ac.il Robert Krauthgamer

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

Robust Novelty Detection with Single Class MPM

Robust Novelty Detection with Single Class MPM Robust Novelty Detection with Single Class MPM Gert R.G. Lanckriet EECS, U.C. Berkeley gert@eecs.berkeley.edu Laurent El Ghaoui EECS, U.C. Berkeley elghaoui@eecs.berkeley.edu Michael I. Jordan Computer

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Learning Kernel Parameters by using Class Separability Measure

Learning Kernel Parameters by using Class Separability Measure Learning Kernel Parameters by using Class Separability Measure Lei Wang, Kap Luk Chan School of Electrical and Electronic Engineering Nanyang Technological University Singapore, 3979 E-mail: P 3733@ntu.edu.sg,eklchan@ntu.edu.sg

More information