Learning Bounds for Support Vector Machines with Learned Kernels

Similar documents
Learning Bounds for Support Vector Machines with Learned Kernels

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Optimally Sparse SVMs

18.657: Mathematics of Machine Learning

10-701/ Machine Learning Mid-term Exam Solution

Linear Classifiers III

Support vector machine revisited

Empirical Process Theory and Oracle Inequalities

1 Review and Overview

REGRESSION WITH QUADRATIC LOSS

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Regression with quadratic loss

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Machine Learning Brett Bernstein

6.867 Machine learning

Fast Rates for Regularized Objectives

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Binary classification, Part 1

Support Vector Machines and Kernel Methods

Machine Learning Theory (CS 6783)

1 Review and Overview

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Learning Theory: Lecture Notes

Chapter 7. Support Vector Machine

Lecture 3: August 31

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture #20. n ( x p i )1/p = max

Approximations and more PMFs and PDFs

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Problem Set 2 Solutions

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Algorithms for Clustering

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

APPENDIX A SMO ALGORITHM

Solution of Final Exam : / Machine Learning

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Information-based Feature Selection

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Introduction to Artificial Intelligence CAP 4601 Summer 2013 Midterm Exam

Learnability with Rademacher Complexities

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

CALCULATION OF FIBONACCI VECTORS

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Rademacher Complexity

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

ECE 308 Discrete-Time Signals and Systems

Mixtures of Gaussians and the EM Algorithm

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Expectation-Maximization Algorithm.

Solutions to home assignments (sketches)

Intro to Learning Theory

Questions and answers, kernel part

Estimation of the Mean and the ACVF

Lecture 3 The Lebesgue Integral

Sieve Estimators: Consistency and Rates of Convergence

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Selective Prediction

Near-Orthogonality Regularization in Kernel Methods Supplementary Material

Math Solutions to homework 6

Stochastic Simulation

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

A remark on p-summing norms of operators

Machine Learning. Ilya Narsky, Caltech

1 Approximating Integrals using Taylor Polynomials

A survey on penalized empirical risk minimization Sara A. van de Geer

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

A REMARK ON A PROBLEM OF KLEE

Dimensionality reduction in Hilbert spaces

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

5.1 A mutual information bound based on metric entropy

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

A Note on Effi cient Conditional Simulation of Gaussian Distributions. April 2010

Notes 27 : Brownian motion: path properties

The log-behavior of n p(n) and n p(n)/n

6.867 Machine learning, lecture 7 (Jaakkola) 1

Basics of Probability Theory (for Theory of Computation courses)

2 Banach spaces and Hilbert spaces

Linear Support Vector Machines

Regression and generalization

Pattern recognition systems Lab 10 Linear Classifiers and the Perceptron Algorithm

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

FIR Filters. Lecture #7 Chapter 5. BME 310 Biomedical Computing - J.Schesser

Lecture 20. Brief Review of Gram-Schmidt and Gauss s Algorithm

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Approximation by Superpositions of a Sigmoidal Function

Massachusetts Institute of Technology

Intelligent Systems I 08 SVM

Lecture 4: April 10, 2013

Math 1314 Lesson 16 Area and Riemann Sums and Lesson 17 Riemann Sums Using GeoGebra; Definite Integrals

is also known as the general term of the sequence

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

6.883: Online Methods in Machine Learning Alexander Rakhlin

Lecture 2. The Lovász Local Lemma

Pattern recognition systems Laboratory 10 Linear Classifiers and the Perceptron Algorithm

Output Analysis (2, Chapters 10 &11 Law)

Transcription:

Learig Bouds for Support Vector Machies with Leared Kerels Nati Srebro TTI-Chicago Shai Be-David Uiversity of Waterloo Mostly based o a paper preseted at COLT 06

Kerelized Large-Margi Liear Classificatio φ(x) B γ K(x 1,x 2 ) = φ(x 1 ), φ(x 2 ) Implicitly defies a Hilbert space i which we seek large-margi separatio Represets our prior kowledge, or bias K(x,x) B 2 estimatio = E[] - traiig O ( (B/γ) 2) logδ failure probability sample complexity (B/γ) 2 sample size

Learig the Kerel Success of learig rests o choice of a good Kerel, appropriate for the task How ca we kow which kerel is good for the task at had? Joitly lear classifier ad Kerel, usig the traiig data: Search for a kerel from some family K of allowed kerels Lear badwidth, or covariace matrix of Gaussia kerel; other kerel parameters [Cristiaii+98][Chapelle+02][Keerthi02] etc Liear, or covex, combiatio of base kerels [Lackriet+02,04][Crammer+03]; applicatios, esp. i Bioiformatics [Soeburg+05][Be-Hur&Noble05] etc More flexibility: lower approximatio, but higher estimatio What is the sample complexity cost of this flexibility?

With a fixed kerel: estimatio Outlie How does this chage whe the kerel is leared from some family K? What is the cost of learig the kerel? Mai result: Learig boud for geeral kerel families Additive icrease to the sample complexity Examples: bouds for specific families Lear i α i K i or just use i K i? Group Lasso (block-l 1 ) O demad: proof techique (very simple) ad why usig the Rademacher complexity ca t work O ( (B/γ) 2) logδ

Previous Bouds: Specific Kerel Families K covex (K 1,...,K k ) def = estimatio 2 k ( B γ )2 logδ λ i K i λ i 0ad λ i =1 [Lackriet+ JMLR 2004] K l Gaussia def = estimatio {(x 1,x 2 ) e (x 1 x 2 ) A(x 1 x 2 ) psd A R l l } uspecified fuctio of iput dimesioality 2 C l ( B γ )2 logδ [Micchelli+ 2005] Suggests a multiplicative icrease i the required sample size.

Fiite Cardiality K={K 1,K 2,...,K K } For a sigle kerel K: Pr margi-γ classifier w.r.t. K estimatio > O ( (B/γ) 2) logδ < δ bad evet for a kerel K For a fiite kerel family K, set δ δ/ K, ad take a uio boud over bad evets : O ( (B/γ) 2) logδ/ K Pr K K margi-γ class. w.r.t. K estimatio > < K δ K Pr K K margi-γ class. w.r.t. K estimatio > O ( (B/γ) 2 +log K ) logδ < δ

Mai Result A additive boud for geeral kerel families, i terms of their pseodo-dimesio: For ay K chose from K, ad ay classifier with margi γ with respect to K: 16+8d φ log 128e3 B 2 γ 2 +2048( B d φ γ )2 log γe 8B log128b γ 2 estimatio Õ( (B/γ) 2 +d φ (K) ) logδ sample complexity (B/γ) 2 + d φ (K) d φ (K) = pseudo-dimesio of K = VC-dimesio of subgraphs of K K { (x 1,x 2,t) K(x 1,x 2 )<t }

Bouds for Specific Kerel Families K covex (K 1,...,K k ) def = Previous result: K liear (K 1,...,K k ) def = No previous bouds Applyig our result: d φ (K liear ), d φ (K covex ) k estimatio estimatio λ i K i λ i 0ad 2 k ( B γ )2 logδ λ i K i K λ is psdad Õ( (B/γ) 2 +k ) logδ λ i =1 [Lackriet+ JMLR 2004] λ i =1

Bouds for Specific Kerel Families K l Gaussia Previous result: Applyig our result: def = { (x 1,x 2 ) e (x 1 x 2 ) A(x 1 x 2 ) psd A R l l} estimatio 2 C l ( B γ )2 logδ uspecified fuctio of iput dimesioality [Micchelli+ 2005] iput dimesioality d φ (K Gaussia ) l(l+1)/2 Oly diagoal A: Oly rak(a)k: l kllog 2 (8ekl) estimatio Õ( (B/γ) 2 +l 2) logδ

Additive vs. Multiplicative K covex (K 1,...,K k ) def = λ i K i λ i 0ad λ i =1 Sample complexity aalysis: If predictor with err at margi γ relative to some K K, How may sample eeded to get err+ε? Aswer accordig to multiplicative boud: O ( k(b/γ) 2 ǫ 2 ) Aswer accordig to our (additive) boud: Õ ( (B/γ) 2 +k ǫ 2 ) Relaxed approach: Just use i K i

Feature Space View Istead of multiple kerels K i, ca thik of implied feature spaces directly: φ(x)= w= α1 φ 1 (x) w 1 α2 φ 2 (x) w 2 αk φ k (x) w k Weightig each feature space byα i K = i α i K i K i (x,x )= φ i (x),φ i (x ) Relaxed approach: use uweighted feature space φ(x) K= i K i w 2 = i w i 2 required i uweighted space w 2 i ay weighted space B 2 K = kb2 Estimatio boud: O kb 2 w 2

Additive vs. Multiplicative K covex (K 1,...,K k ) def = λ i K i λ i 0ad λ i =1 Sample complexity aalysis: If predictor with err at margi γ relative to some K K, How may sample eeded to get err+ε? Aswer accordig to multiplicative boud: O ( k(b/γ) 2 ǫ 2 ) Aswer accordig to our (additive) boud: Õ ( (B/γ) 2 +k ǫ 2 ) Relaxed approach: Just use i K i margi γ relative to some K K margi γ relative to i K i B 2 K i = sup x K(x,x) k B 2 ( ) K k(b/γ) 2 Sample complexity: O ǫ 2

Lear i α i K i or use i K i? Relative to margi γ for some i α i K i : Lear i α i K i : Use i K i : of leared predictor Do we have eough samples to afford the factor of k? Is decrease i estimatio worth the computatioal cost? (maybe ot if we have eough data ad the estimatio is small ayway) Relative to margi γ for i (1/k)K i : Use i K i : of best margi γ predictor with some i α i K i + Flexibility with settig weights Lower approximatio but k/ icrease to estimatio Is the decrease i approximatio worth the icrease i estimatio? (ad the extra computatioal cost) Õ( (B/γ) 2 +k ) of leared of best margi γ O ( k(b/γ) 2) predictor predictor with some i α i K + i of leared of best margi γ O ( (B/γ) 2) predictor predictor with i (1/k)K + i

Alterate View: Group Lasso Istead of multiple kerels K i, ca thik of implied feature spaces directly: φ(x)= w= α1 φ 1 (x) w 1 α2 φ 2 (x) w 2 αk φ k (x) w k Weightig each feature space byα i K = i α i K i Relaxed approach: use uweighted feature space φ(x) K= i K i, B 2 K = kb2 K i (x,x )= φ i (x),φ i (x ) w 2 = i w i 2 required i uweighted space w 2 i ay weighted space Estimatio boud: O kb 2 i w i 2 [Bach et al 04] Learig with K covex equivalet to usig uweighted feature space φ(x) ad Block-L 1 regularizer i w i est for group lasso Õ B 2 ( i w i ) 2 +k w 2 = i w i 2 ( i w i ) 2

Proof Sketch boud pseudodimesio d φ (K) stadard result o coverig umbers i terms of d φ stadard results o coverig umbers of the uit sphere coverig of K of size (L) d φ(k) coverig of F K of size (L) (B/ε)2 Costruct coverig for F K as cross-product : for each kerel K i the coverig of K, take the coverig of F K. coverig of F K of size (L) d φ(k) (L) (B/ε)2 Lemma: if K, K are similar as real-valued fuctios, every K- classifier ca be approximated by K -classifier geeralizatio bouds i terms of log(coverig umber)

Rademacher vs. Coverig Numbers Other boud rely o calculatig the Rademacher complexity R[F K ] of the class of classifiers (uit orm) classifiers with respect to ay K K R[F K ] scales with the scale of fuctios i F K, i.e. with B. Geeralizatio bouds deped o R[F K ]/γ Bouds based o the Rademacher Complexity ecessarily have a multiplicative depedece o B/γ Coverig umbers allow us to combie scale-sesitive ad fiite-dimesioality (scale isesitive) argumets (at the cost of messier log-factors)

Learig Bouds for SVMs with Leared Kerels Nati Srebro Shai Be-David Boud o estimatio for large margi classifier with respect to kerel which is chose, from family K, based o traiig data: pseudodimesio of K, as family of real-valued fuctios Õ( d φ (K)+(B/γ) 2) logδ Valid for geeric keralized L 2 -regularized learig Easy to obtai bouds for further kerel families For K covex : usig i K i may require k times more data