Classification with Reject Option

Size: px
Start display at page:

Download "Classification with Reject Option"

Transcription

1 Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012

2 Outline. Introduction.. Classification with reject option. Spirit of the papers BW Infinite sample consistency.. Bounding excess risk.. Convergence of excess risk. Results.. Generalized hinge loss (SVM) BW Further generalization WY2010

3 Introduction reject option classification: observed data: {X, Y } X {±1} discriminant function: f : X R, typically classify by sign(f ) reject option: withhold for cost d when f < δ, and proceed as usual if f > δ for specified d, δ motivation: want to avoid making hard decisions

4 Introduction loss loss as a function of Yf (X ) typically: l(z) = { 1 if z 0 0 if z > 0 reject option: 1 if z < δ l d,δ (z) = 0 if z δ d if z δ for some d [0, 1/2], δ (0, 1)

5 Introduction loss loss as a function of Yf (X ) reject option: 1 if z < δ l d,δ (z) = 0 if z δ d if z δ for some d [0, 1/2], δ (0, 1) if d 0 always reject, if d > 1/2 never reject intuition: d is necessary amount of confidence for classifying

6 Introduction Bayes rule minimize L d,δ (f ) = E{l d,δ (Yf (X ))} f d (X ) = where η(x ) = P(Y = +1 X ) +1 if η(x ) > 1 d 0 if d η(x ) 1 d 1 if η(x ) < d follows from conditional expectation: E Y X {l d,δ (Yf (X ))} = η(x ) l d,δ (f (X )) + (1 η(x )) l d,δ ( f (X ))

7 Introduction Bayes rule minimize L d,δ (f ) = E{l d,δ (Yf (X ))} f d (X ) = where η(x ) = P(Y = +1 X ) +1 if η(x ) > 1 d 0 if d η(x ) 1 d 1 if η(x ) < d in practice, minimize empirical risk: P n (f ) = 1 n n l d,δ (Y i f (X i )) i=1 minimizing P n is NP hard

8 General spirit propose convex surrogate loss φ d consider φ d risk: Q(f ) = E{φ d (Yf (X )} minimize empirical risk: Q n (f ) let ˆf n = argmin f F Q n study the performance of φ d.. consistency.. bounding excess risk.. convergence rates for empirical maximizer

9 General spirit. consistency.. f d = argmin f Q(f ).. asymptotic minimizer of φ d risk is same as for l d,δ risk.. able to recover f d. bounding excess risk.. want some non-decreasing function ρ( ) such that: L d,δ (f ) L d,δ (f d ) ρ(q(f ) Q(f d )) L d,δ (f ) ρ( Q(f )).. guarantee performance of f. convergence rate for empirical maximizer.. compute rate for Q(ˆf n ) 0.. combine with bound on excess risk

10 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 consistency Proposition 1 The Bayes discriminant function fd Q with: dq(f d ) = L d,δ(f d ) minimizes risk

11 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 excess risk bounds note: for δ 1 d, l d,φ (X ) φ d (X ) Theorem 2 for fixed f, d [0, 1/2): for all δ (0, 1/2], for all δ [1/2, 1 d], L d,δ (f ) d δ Q(f ) L d,δ(f ) Q(f ) recommend using δ = 1/2

12 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 excess risk bounds note: for δ 1 d, l d,φ (X ) φ d (X ) Theorem 2 for fixed f, (δ, d) = (0, 1/2): L(f ) Q(f ) risk bounds for usual classification with hinge loss

13 Generalized hinge loss convex surrogate of the form: 1 ( ) 1 d d z if z 0 φ d (z) = 1 z if 0 < z 1 0 if z > 1 convergence rates bounds of the form: EQ(ˆf n ) Q(fd ) 2 inf ( Q(f ) Q(f d ) ) + ɛ n f F. approximation error for space F.. method of sieves. estimation error for finite sample n authors results

14 Generalized hinge loss convergence rates Margin condition: we say that η satisfies the margin condition at d with exponent α > 0 if there is a c 1 such that for all t > 0, P{ η(x ) d t} ct α P{ η(x ) (1 d) t} ct α Bernstein class: we say that G L 2 (P) is a (β, B)-Bernstein class with respect to the probability measure P, with β (0, 1] and B 1 if every g G satisfies Pg 2 B(Pg) β

15 Generalized hinge loss convergence rates Lemma 8 if η satisfies the margin condition at d with exponent α, then for any class F of measurable uniformly bounded functions, the class G = {g f : f F} has a Bernstein exponent β = where g f (x, y) = φ d (yf (x)) φ d (yf d (x)). the excess φ d loss. E(g f (X, Y )) = Q(f ) α 1+α

16 Generalized hinge loss convergence rates Theorem 12 if η satisfies the margin condition at d with exponent α, F is a countable class of functions f : X R satisfying f B, and F satisfies log N(ɛ, L, F) Cɛ p for all ɛ > 0 and some 0 p 2, then there exists a constant C independent of n, such that EQ(ˆf n ) Q(fd ) 2 inf ( Q(f ) Q(f d ) ) + C n 1+α 2+p+α+pα f F

17 Generalized hinge loss - QCQP can be solved as the following quadratic constraint quadratic programming problem (QCQP) Theorem 5 for any x 1,..., x n X and y 1,..., y n {±1}, let ˆf H be the solution to where r > 0 minimize f n φ d (y i f (x i )) i=1 such that f 2 r 2

18 Generalized hinge loss - QCQP can be solved as the following quadratic constraint quadratic programming problem (QCQP) Theorem 5 Then we can represent ˆf as the finite sum ˆf (x) = n i=1 ˆα ik(x i, x), where ˆα 1,..., ˆα n is the solution to the following QCQP: 1 n min α i,ξ i γ i n i=1 such that i,j ( ξ i + 1 2d d γ i ) α i α j K(x i, x j ) r 2 n ξ i 0, ξ i 1 y i α j K(x i, x j ) j=1 n γ i 0, γ i y i α j K(x i, x j ) j=1

19 Generalization generalize to the following loss 1 if g(x ) Y and g(x ) 0 l[g(x ), Y ] = d if g(x ) = 0 0 if g(x ) = Y with classification rule +1 if f (X ) > δ C(f (X ), δ) = 0 if f (X ) δ 1 if f (X ) < δ equivalence to previous case can be seen by: l[c(f (X ), δ), Y ] l d,δ (Yf (X )) since looking at a family of surrogates we want to separate δ from the loss function

20 Generalization consistency Theorem 1 assue for convex φ, the classification rule C(fφ, δ) for some δ > 0 is infinite sample consistent if and only if. φ (δ) and φ ( δ) both exist. φ (0) < 0. φ (δ) φ (δ)+φ ( δ) = d for strictly convex φ, at most one value of δ for generalized hinge loss, holds for δ (0, 1)

21 Generalization excess risk bounds Theorem 9 assume φ is a convex infinite sample consistent surrogate and suppose that there exists constants C > 0 and s 1 such that then, η d s C s Q η ( δ) (1 η) d s C s Q η (δ) R[C(f, δ)] 2C[ Q(f )] 1/s where Q η (z) = η(x )φ(z) + (1 η(x ))φ( z)

22 Generalization excess risk bounds Theorem 10 further assume 2 s (η 1/2) s + C s Q η ( δ) 2 s ((1 η) 1/2) s + C s Q η (δ) then, R[C(f, δ)] C[ Q(f )] 1/s Theorem 11 if margin condition holds for some c 1 and α 0, then, R[C(f, δ)] K[ Q(f )] 1/(s+β βs) where K depends on c and α, and β = α/(1 + α)

23 Generalization convergence rates Theorem 18 assume that f B for all f F and let γ (0, 1). with probability at least 1 γ Q(ˆf n ) inf Q(f ) + 3L ( L 2 f F n + 8 2c + B ) log(nn /γ) 6 n where N n = N(1/n, L, F) relies on:. Lipschitz continuity of φ. constraint on modulus of convexity of Q

24 Generalization convergence rates Corollary 19 under the assumptions of Theorems 9 and 18, we have, with probability at least 1 γ { R(C(ˆf n, δ)) 2C inf Q(f ) + 3L ( L 2 f F n + 8 2c + LB 3 ) log(nn /γ) n } 1/s if the margin condition also holds, with probability at least 1 γ { R(C(ˆf n, δ)) K inf Q(f ) + 3L ( L 2 f F n + 8 2c + LB 3 ) log(nn /γ) n } 1/(sβ βs)

25 Miscellanea other losses considered: least squares exponential logistic squared hinge DWD further extensions: asymmetric loss hinge loss with L 1 regularization YW Bernoulli, 2011

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

PETER L. BARTLETT AND MARTEN H. WEGKAMP

PETER L. BARTLETT AND MARTEN H. WEGKAMP CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS PETER L. BARTLETT AND MARTEN H. WEGKAMP Abstract. We consier the problem of binary classification where the classifier can, for a particular cost,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina Bayes rule and Bayes error Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))],

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Classification objectives COMS 4771

Classification objectives COMS 4771 Classification objectives COMS 4771 1. Recap: binary classification Scoring functions Consider binary classification problems with Y = { 1, +1}. 1 / 22 Scoring functions Consider binary classification

More information

Large Margin Classifiers: Convexity and Classification

Large Margin Classifiers: Convexity and Classification Large Margin Classifiers: Convexity and Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Collins, Mike Jordan, David McAllester,

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

Classification Methods with Reject Option Based on Convex Risk Minimization

Classification Methods with Reject Option Based on Convex Risk Minimization Journal of Machine Learning Research 11 (010) 111-130 Submitte /09; Revise 11/09; Publishe 1/10 Classification Methos with Reject Option Base on Convex Risk Minimization Ming Yuan H. Milton Stewart School

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

AdaBoost and other Large Margin Classifiers: Convexity in Classification

AdaBoost and other Large Margin Classifiers: Convexity in Classification AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Lecture Notes 15 Prediction Chapters 13, 22, 20.4. Lecture Notes 15 Prediction Chapters 13, 22, 20.4. 1 Introduction Prediction is covered in detail in 36-707, 36-701, 36-715, 10/36-702. Here, we will just give an introduction. We observe training data

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two

More information

BINARY CLASSIFICATION

BINARY CLASSIFICATION BINARY CLASSIFICATION MAXIM RAGINSY The problem of binary classification can be stated as follows. We have a random couple Z = X, Y ), where X R d is called the feature vector and Y {, } is called the

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Support Vector Machines with a Reject Option

Support Vector Machines with a Reject Option Support Vector Machines with a Reject Option Yves Grandvalet, 2, Alain Rakotomamonjy 3, Joseph Keshet 2 and Stéphane Canu 3 Heudiasyc, UMR CNRS 6599 2 Idiap Research Institute Université de Technologie

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Modelli Lineari (Generalizzati) e SVM

Modelli Lineari (Generalizzati) e SVM Modelli Lineari (Generalizzati) e SVM Corso di AA, anno 2018/19, Padova Fabio Aiolli 19/26 Novembre 2018 Fabio Aiolli Modelli Lineari (Generalizzati) e SVM 19/26 Novembre 2018 1 / 36 Outline Linear methods

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Statistical learning with Lipschitz and convex loss functions

Statistical learning with Lipschitz and convex loss functions Statistical learning with Lipschitz and convex loss functions Geoffrey Chinot, Guillaume Lecué and Matthieu Lerasle October 3, 08 Abstract We obtain risk bounds for Empirical Risk Minimizers ERM and minmax

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Decentralized decision making with spatially distributed data

Decentralized decision making with spatially distributed data Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

Curve learning. p.1/35

Curve learning. p.1/35 Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

On the Consistency of AUC Pairwise Optimization

On the Consistency of AUC Pairwise Optimization On the Consistency of AUC Pairwise Optimization Wei Gao and Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University Collaborative Innovation Center of Novel Software Technology

More information

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics

More information

How Good is a Kernel When Used as a Similarity Measure?

How Good is a Kernel When Used as a Similarity Measure? How Good is a Kernel When Used as a Similarity Measure? Nathan Srebro Toyota Technological Institute-Chicago IL, USA IBM Haifa Research Lab, ISRAEL nati@uchicago.edu Abstract. Recently, Balcan and Blum

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,

More information

Lecture 7: Kernels for Classification and Regression

Lecture 7: Kernels for Classification and Regression Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011 Outline Outline A linear regression problem Linear auto-regressive

More information

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018 Subgradient Descent David S. Rosenberg New York University February 7, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 1 / 43 Contents 1 Motivation and Review:

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

This is an author-deposited version published in : Eprints ID : 17710

This is an author-deposited version published in :   Eprints ID : 17710 Open Archive TOULOUSE Archive Ouverte (OATAO) OATAO is an open access repository that collects the work of Toulouse researchers and makes it freely available over the web where possible. This is an author-deposited

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Confidence Sets with Expected Sizes for Multiclass Classification

Confidence Sets with Expected Sizes for Multiclass Classification Journal of Machine Learning Research 18 2017 1-28 Submitted 11/16; Revised 9/17; Published 10/17 Confidence Sets with Expected Sizes for Multiclass Classification Christophe Denis LAMA UMR-CNRS 8050 Université

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Support vector machines

Support vector machines Support vector machines Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 SVM, kernel methods and multiclass 1/23 Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Nonparametric Bayesian Methods

Nonparametric Bayesian Methods Nonparametric Bayesian Methods Debdeep Pati Florida State University October 2, 2014 Large spatial datasets (Problem of big n) Large observational and computer-generated datasets: Often have spatial and

More information

TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control

TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control TaylorBoost: First and Second-order Boosting Algorithms with Explicit Margin Control Mohammad J. Saberian Hamed Masnadi-Shirazi Nuno Vasconcelos Department of Electrical and Computer Engineering University

More information

Support Vector Machines and Bayes Regression

Support Vector Machines and Bayes Regression Statistical Techniques in Robotics (16-831, F11) Lecture #14 (Monday ctober 31th) Support Vector Machines and Bayes Regression Lecturer: Drew Bagnell Scribe: Carl Doersch 1 1 Linear SVMs We begin by considering

More information

Convexity, Classification, and Risk Bounds

Convexity, Classification, and Risk Bounds Convexity, Classification, and Risk Bounds Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@stat.berkeley.edu Michael I. Jordan Computer

More information

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers Hui Zou University of Minnesota Ji Zhu University of Michigan Trevor Hastie Stanford University Abstract We propose a new framework

More information

Plug-in Approach to Active Learning

Plug-in Approach to Active Learning Plug-in Approach to Active Learning Stanislav Minsker Stanislav Minsker (Georgia Tech) Plug-in approach to active learning 1 / 18 Prediction framework Let (X, Y ) be a random couple in R d { 1, +1}. X

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Stanford Statistics 311/Electrical Engineering 377

Stanford Statistics 311/Electrical Engineering 377 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying

More information

Surrogate regret bounds for generalized classification performance metrics

Surrogate regret bounds for generalized classification performance metrics Surrogate regret bounds for generalized classification performance metrics Wojciech Kotłowski Krzysztof Dembczyński Poznań University of Technology PL-SIGML, Częstochowa, 14.04.2016 1 / 36 Motivation 2

More information

Distirbutional robustness, regularizing variance, and adversaries

Distirbutional robustness, regularizing variance, and adversaries Distirbutional robustness, regularizing variance, and adversaries John Duchi Based on joint work with Hongseok Namkoong and Aman Sinha Stanford University November 2017 Motivation We do not want machine-learned

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Corrections. ir j r mod n 2 ). (i r j r ) mod n should be d. p. 83, l. 2 [p. 80, l. 5]: d. p. 92, l. 1 [p. 87, l. 2]: Λ Z d should be Λ Z d.

Corrections. ir j r mod n 2 ). (i r j r ) mod n should be d. p. 83, l. 2 [p. 80, l. 5]: d. p. 92, l. 1 [p. 87, l. 2]: Λ Z d should be Λ Z d. Corrections We list here typos and errors that have been found in the published version of the book. We welcome any additional corrections you may find! We indicate page and line numbers for the published

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Nonparametric Decentralized Detection using Kernel Methods

Nonparametric Decentralized Detection using Kernel Methods Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT Submitted to the Annals of Statistics ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT By Junlong Zhao, Guan Yu and Yufeng Liu, Beijing Normal University, China State University of

More information

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April

More information

Statistical Methods for SVM

Statistical Methods for SVM Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information