Approximation Theoretical Questions for SVMs
|
|
- Roger Martin Briggs
- 5 years ago
- Views:
Transcription
1 Ingo Steinwart LA-UR October 20, 2007
2 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually Y R. Already observed samples T = ( (x 1, y 1 ),..., (x n, y n ) ) (X Y ) n Goal: With the help of T find a function f : X R which predicts label y for new, unseen x.
3 Statistical Learning Theory: an Overview Support Vector Machines Illustration: Binary Classification Problem: The set X is devided into two unknown classes X 1 and X 1. Goal: Find approximately the classes X 1 and X 1. IIllustration: X 1 X 1 {f= 1} {f=1} X 1 X 1 Left: Negative (blue) and positive (red) samples. Right: Behaviour of a decision function (green) f : X Y.
4 Formal Definition of Statistical Learning Statistical Learning Theory: an Overview Support Vector Machines Basic Assumptions: P is an ( unknown probability measure on X Y. T = (x1, y 1 ),..., (x n, y n ) ) (X Y ) n sampled from P n. Future (x, y) will also be sampled from P. L : Y R [0, ] loss function that measures cost L(y, t) of predicting y by t. Goal: Find a function f T : X R with small risk R L,P (f T ) := X Y L ( y, f T (x) ) dp(x, y). Interpretation: Average future cost of predicting by f T should be small.
5 Questions in Statistical Learning I Bayes risk: Statistical Learning Theory: an Overview Support Vector Machines R L,P := inf{ R L,P (f ) f : X R measurable }. A function attaining this minimum is denoted by f L,P. Learning method: Assigns to every training set T a predictor f T : X R. Consistency: A learning method is called universally consistent if R L,P (f T ) R L,P in probability (1) for n and every probability measure P on X Y. Good news: Many learning methods are universally consistent. First result: Stone (1977), AoS
6 Statistical Learning Theory: an Overview Support Vector Machines Questions in Statistical Learning II Rates: Does there exist a learning method and a convergence rate a n 0 such that E T P nr L,P (f T ) R L,P C P a n, n 1, for every probability measure P on X Y. Bad news: (Devroye, 1982, IEEE TPAMI) No! (if Y 2, X =, and L non-trivial ) Good news: Yes, if one makes some mild?! assumptions on P. Too many results in this direction to mention them.
7 Reproducing Kernel Hilbert Spaces I Statistical Learning Theory: an Overview Support Vector Machines k : X X R is a kernel : there exist a Hilbert space H and a map Φ : X H with k(x, x ) = Φ(x), Φ(x ) for all x, x X. all (k(x i, x j )) n i, j=1 are symmetric and positive semi-definite. RKHS of k: the smallest such H that consists of functions. Construction : Take the completion of { n } α i k(x i,.) : n N, α 1,..., α n R, x 1,..., x n X i=1 equipped with the dot product n m α i k(x i,.), β j k(ˆx j,.) := i=1 j=1 Feature map: Φ : x k(x,.). n i=1 j=1 m α i β j k(x i, ˆx j ).
8 Statistical Learning Theory: an Overview Support Vector Machines Reproducing Kernel Hilbert Spaces II Polynomial Kernels: For a 0 and m N let k(x, x ) := ( x, x + a) m, x, x R d. Gaussian RBF kernels: For σ > 0 let k σ (x, x ) := exp( σ 2 x x 2 2), x, x R d. The parameter 1/σ is called width. Denseness of Gaussian RKHSs: The RKHS H σ of k σ is dense in L p (µ) for all p [1, ) and all probability measures µ on R d.
9 Support Vector Machines I Statistical Learning Theory: an Overview Support Vector Machines Support vector machines (SVMs) solve the problem f T,λ = arg min f H λ f 2 H + 1 n n L ( y i, f (x i ) ), (2) i=1 H is a( RKHS, T = (x1, y 1 ),..., (x n, y n ) ) (X Y ) n is a training set, λ > 0 is a free regularization parameter, L : Y R [0, ) is a convex loss, e.g. hinge loss: L(y, t) := max{0, 1 yt} least squares loss: L(y, t) := (y t) 2. Representer Theorem: The unique solution is of the form f T,λ = n i=1 α ik(x i,.). Minimization actually takes place over {α 1,..., α n }.
10 Statistical Learning Theory: an Overview Support Vector Machines Support Vector Machines II Questions: Universally consistent? Learning rates? Efficient algorithms? Performance on real world problems? Additional properties?
11 An Oracle Inequality: Assumptions An Oracle Inequality Consequences Assumptions and notations: L(y, 0) 1 for all y Y. L(y,.) : R [0, ) convex and has a minimum in [ 1, 1]. t := max{ 1, min{1, t}}. L is locally Lipschitz: L(y, t) L(y, t ) t t, y Y, t, t [ 1, 1]. This yields L(y, t) L(y, t) L(y, 0) + L(y, 0) 2 Variance bound: ϑ [0, 1] and V 2 f : X R: E P ( L f L f L,P ) 2 V ( EP (L f L f L,P )) ϑ
12 Entropy Numbers An Oracle Inequality Consequences Let S : E F be a bounded linear operator and n 1. The n-th (dyadic) entropy number of S is defined by e n (S) := inf { ε > 0 : x 1,..., x 2 n 1 : SB E (x i + εb F ) }. 2 n 1 i=1
13 An Oracle Inequality An Oracle Inequality Consequences Oracle Inequality (slightly simplified) H separable RKHS of measurable kernel with k 1. Entropy assumption: p (0, 1) and a 1: E TX P n X e i(id : H L 2 (T X )) a i 1 2p, i, n 1. Fix an f 0 H and a B 0 1 such that L f 0 B 0, Then there exists a constant K > 0 such that with probability P n not less than 1 e τ we have R L,P ( f T,λ ) R L,P 9 ( λ f 0 2 H + R L,P(f 0 ) R ) L,P ( a 2p ) 1 ( 2 p ϑ+ϑp 72V τ ) 1 2 ϑ +K λ p B 0τ n n n
14 A Simplification An Oracle Inequality Consequences Consider the approximation error function ( ) A(λ) := min λ f 2 H + R L,P(f ) R L,P f H and the (unique) minimizer f P,λ. = For f 0 := f P,λ we can choose B 0 = Refined Oracle inequality ( a R L,P ( f T,λ ) R 2p L,P 9A(λ) + K λ p n ( 72V τ +3 n ) 1 2 ϑ + 30τ n A(λ) λ ) 1 2 p ϑ+ϑp + 60τ n A(λ) λ
15 An Oracle Inequality Consequences Consistency Assumptions: L(y, t) c(1 + t q ) for all y Y and t R. H is dense in L q (P X ) = A(λ) 0 for λ 0. = SVM is consistent whenever we chose λ n such that λ n 0 sup nλ n <.
16 Learning Rates An Oracle Inequality Consequences Assumptions: There exists constants c 1 and β (0, 1] such that A(λ) cλ β, λ 0. Note: β = 1 = f L,P H. L is Lipschitz continuous (e.g. hinge loss). = Choosing λ n n α we obtain a polynomial learning rate. Zhou et al. (2005?) Some calculations show that the best learning rate we can obtain is It is achieved by 2β min{ n β+1, β β(2 p ϑ+ϑp)+p }. λ n n min{ 2 β+1, 1 β(2 p ϑ+ϑp)+p }. But β, and (often Ingo Steinwart alsola-ur ϑ and p) areapproximation unknown! Theoretical Questions for SVMs
17 An Oracle Inequality Consequences Adaptivity For T = ((x 1, y 1 ),..., (x n, y n )) define m := n/2 + 1 and T 1 := ((x 1, y 1 ),..., (x m, y m )) T 2 := ((x m+1, y m+1 ),..., (x n, y n )). Split T into T 1 and T 2. Fix an n 2 net Λ n of (0, 1]. Training: Use T 1 to find f T1,λ, λ Λ n. Validation: Use T 2 to determine a λ T2 Λ n that satisifes R L,T2 ( f T1,λ T2 ) = min λ Λ n R L,T2 ( f T1,λ). = This yields a consistent learning method with learning rate 2β min{ n β+1, β β(2 p ϑ+ϑp)+p }.
18 An Oracle Inequality Consequences Discussion The oracle inequality can be generalized to regularized risk minimizers. The presented oracle inequality yields fastest known rates in many cases. In some cases these rates are known to be optimal in a min-max sense. Oracle inequalities can be used to design adaptive strategies that learn fast without knowing key parameters of P. Question: Which distributions can be learned fast?
19 An Oracle Inequality Consequences Discussion II Observations: Data often lies in high dimensional spaces, but not uniformly. Regression: target is often smooth (but not always). Classification: How much do classes overlap?
20 Observations The Single Kernel Case Gaussian Kernels The relation between RKHS H and distribution P is descibed by two quantities: The constants a and p in E TX P n X e i(id : H L 2 (T X )) a i 1 2p, i, n 1. The approximation error function ( ) A(λ) := min λ f 2 H + R L,P(f ) R L,P f H Task: Find realistic assumptions on P such that both quantities are small for commonly used kernels.
21 Entropy Numbers The Single Kernel Case Gaussian Kernels Consider the integral operator T k : L 2 (P X ) L 2 (P X ) defined by T k f (x) := k(x, x )f (x ) P X (dx ) Question: What is the relation between the EW s of T k and X E TX P n X e i(id : H L 2 (T X ))? Question: What is the behaviour if X R d but P X is not absolutely continuous with respect to the Lebesgue measure?
22 Approximation Error Function The Single Kernel Case Gaussian Kernels For the least squares loss we have A(λ) = inf f H λ f 2 H + f f L,P 2 L 2 (P X ). For Lipschitz continuous losses we have have A(λ) inf f H λ f 2 H + f f L,P L 1 (P X ). Smale & Zhou 03: In both cases the behaviour of A(λ) for λ 0 can be characterized by the K-functional of the pair (H, L p (P X )). Questions: What happens if X R d but P X is not absolutely continuous with respect to the Lebesgue measure?
23 The Single Kernel Case Gaussian Kernels Observations: The Gaussian kernel is successfully used in many applications. It has a parameter σ that is almost never fixed a-priori. Question: How does σ influence the learning rates?
24 The Single Kernel Case Gaussian Kernels Appproximation Quantities for Gaussian Kernels For H σ being the Gaussian kernel with width σ the entropy assumption is of the form E TX P n X e i(id : H σ L 2 (T X )) a σ i 1 2p, i, n 1. The approximation error function also depends on σ: A σ (λ) = inf f H σ λ f 2 H σ + R L,P (f ) R L,P.
25 Oracle Inequality The Single Kernel Case Gaussian Kernels Oracle inequality using Gaussian kernels R L,P ( f T,λ ) R L,P 9 ( λ f 0 2 H σ + R L,P (f 0 ) R ) L,P ( 72V τ +K ( a 2p σ λ p n ) 1 2 p ϑ+ϑp +3 n ) 1 2 ϑ + 30B 0(σ)τ n Usually σ becomes larger with the sample size. Task: Find estimates that are good in σ and i (or λ), simultaneously.
26 The Single Kernel Case Gaussian Kernels An Estimate for the Entropy Numbers Theorem: (S. & Scovel, AoS 2007) Let X R d be compact. Then for all ε > 0 and 0 < p < 1 there exists a constant c ε,p 1 such that E TX P n X e i(id : H L 2 (T X )) c ε,p σ (1 p)(1+ε)d 2p i 1 2p This estimate does not consider properties of P X. Questions: How good is this estimate? For which P X can this be significantly improved?
27 Distance to the Decision Boundary The Single Kernel Case Gaussian Kernels η(x) := P(y = 1 x). X 1 := {η < 1/2} and X 1 := {η > 1/2}. For x X R d we define d(x, X 1 ), if x X 1, (x) := d(x, X 1 ), if x X 1, 0 otherwise, (3) where d(x, A) denotes the distance between x and A. Interpretation: (x) measures the distance of x to the decision boundary.
28 Margin Exponents The Single Kernel Case Gaussian Kernels Margin exponent: c 1 and α > 0 such that P X ( (x) t) ct α, t > 0. Example: X R d compact, positive volume. PX uniform distribution. Decision boundary linear or circle. = α = 1. This remains true under transformations Margin-noise exponent: c 1 and β > 0 such that 2η 1 P X ( (x) t) ct α, t > 0. Interpretation: The exponentingo βsteinwart measures LA-UR how much Approximation 2η 1 dp Theoretical X isquestions concentrated for SVMs
29 The Single Kernel Case Gaussian Kernels A Bound on the Approximation Error Theorem: (S. & Scovel, AoS 2007) X R d compact. P distribution on X { 1, 1} with margin-noise exponent β. L hinge loss. c d,τ > 0 and c d,β > 0 σ > 0 and λ > 0 f H σ satisfying f 1 and Remarks: λ f 2 H σ + R L,P (f ) R L,P c d,τ λσ d + c d,β c σ β. Not optimal in λ. How can this be improved? Better dependence on dimension?!
30 Conclusion Oracle inequalities can be used to design adaptive SVMs. For which distributions do such adaptive SVMs learn fast? Bounds for entropy numbers Bounds for approximation error
Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationRobust Support Vector Machines for Probability Distributions
Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,
More informationSparseness of Support Vector Machines
Journal of Machine Learning Research 4 2003) 07-05 Submitted 0/03; Published /03 Sparseness of Support Vector Machines Ingo Steinwart Modeling, Algorithms, and Informatics Group, CCS-3 Mail Stop B256 Los
More informationOn non-parametric robust quantile regression by support vector machines
On non-parametric robust quantile regression by support vector machines Andreas Christmann joint work with: Ingo Steinwart (Los Alamos National Lab) Arnout Van Messem (Vrije Universiteit Brussel) ERCIM
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationSupport Vector Machine
Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationCOMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017
COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University FEATURE EXPANSIONS FEATURE EXPANSIONS
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More informationStatistical Properties of Large Margin Classifiers
Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides
More informationOptimal Rates for Regularized Least Squares Regression
Optimal Rates for Regularized Least Suares Regression Ingo Steinwart, Don Hush, and Clint Scovel Modeling, Algorithms and Informatics Group, CCS-3 Los Alamos National Laboratory {ingo,dhush,jcs}@lanl.gov
More informationTUM 2016 Class 1 Statistical learning theory
TUM 2016 Class 1 Statistical learning theory Lorenzo Rosasco UNIGE-MIT-IIT July 25, 2016 Machine learning applications Texts Images Data: (x 1, y 1 ),..., (x n, y n ) Note: x i s huge dimensional! All
More informationHilbert Space Methods in Learning
Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationSupport Vector Machines and Bayes Regression
Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering
More informationSparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results
Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,
More informationUnlabeled Data: Now It Helps, Now It Doesn t
institution-logo-filena A. Singh, R. D. Nowak, and X. Zhu. In NIPS, 2008. 1 Courant Institute, NYU April 21, 2015 Outline institution-logo-filena 1 Conflicting Views in Semi-supervised Learning The Cluster
More informationPlug-in Approach to Active Learning
Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X
More informationSurrogate loss functions, divergences and decentralized detection
Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationCurve learning. p.1/35
Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem
More informationAdaBoost and other Large Margin Classifiers: Convexity in Classification
AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at
More informationStat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.
Linear SVM (separable case) First consider the scenario where the two classes of points are separable. It s desirable to have the width (called margin) between the two dashed lines to be large, i.e., have
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationLearnability of Gaussians with flexible variances
Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationAdaptive Sampling Under Low Noise Conditions 1
Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università
More informationBack to the future: Radial Basis Function networks revisited
Back to the future: Radial Basis Function networks revisited Qichao Que, Mikhail Belkin Department of Computer Science and Engineering Ohio State University Columbus, OH 4310 que, mbelkin@cse.ohio-state.edu
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationClassifier Complexity and Support Vector Classifiers
Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationClassification objectives COMS 4771
Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification
More informationUniversity of North Carolina at Chapel Hill
University of North Carolina at Chapel Hill The University of North Carolina at Chapel Hill Department of Biostatistics Technical Report Series Year 2015 Paper 44 Feature Elimination in Support Vector
More informationBayesian Support Vector Machines for Feature Ranking and Selection
Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationBayesian Regularization
Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationThe Representor Theorem, Kernels, and Hilbert Spaces
The Representor Theorem, Kernels, and Hilbert Spaces We will now work with infinite dimensional feature vectors and parameter vectors. The space l is defined to be the set of sequences f 1, f, f 3,...
More informationBits of Machine Learning Part 1: Supervised Learning
Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationReproducing Kernel Hilbert Spaces
9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we
More informationReview: Support vector machines. Machine learning techniques and image analysis
Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationEcon 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines
Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationBINARY CLASSIFICATION
BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More information2 Loss Functions and Their Risks
2 Loss Functions and Their Risks Overview. We saw in the introduction that the learning problems we consider in this book can be described by loss functions and their associated risks. In this chapter,
More informationAdvanced Introduction to Machine Learning
10-715 Advanced Introduction to Machine Learning Homework Due Oct 15, 10.30 am Rules Please follow these guidelines. Failure to do so, will result in loss of credit. 1. Homework is due on the due date
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLecture 10 February 23
EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationKernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1
Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,
More informationSupport Vector Machine (continued)
Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need
More informationLearning gradients: prescriptive models
Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan
More informationLecture 10: Support Vector Machine and Large Margin Classifier
Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationBetter Algorithms for Selective Sampling
Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationPattern Recognition 2018 Support Vector Machines
Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht
More informationA GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong
A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationChapter 9. Support Vector Machine. Yongdai Kim Seoul National University
Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationLecture 18: Multiclass Support Vector Machines
Fall, 2017 Outlines Overview of Multiclass Learning Traditional Methods for Multiclass Problems One-vs-rest approaches Pairwise approaches Recent development for Multiclass Problems Simultaneous Classification
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationSupport Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar
Data Mining Support Vector Machines Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar 02/03/2018 Introduction to Data Mining 1 Support Vector Machines Find a linear hyperplane
More informationMIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications. Class 19: Data Representation by Design
MIT 9.520/6.860, Fall 2017 Statistical Learning Theory and Applications Class 19: Data Representation by Design What is data representation? Let X be a data-space X M (M) F (M) X A data representation
More informationLecture 2 Machine Learning Review
Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things
More informationi=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.
More informationCS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines
CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More information