Statistical Properties of Large Margin Classifiers

Similar documents
AdaBoost and other Large Margin Classifiers: Convexity in Classification

Large Margin Classifiers: Convexity and Classification

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Convexity, Classification, and Risk Bounds

Calibrated Surrogate Losses

Lecture 10 February 23

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

On the Consistency of AUC Pairwise Optimization

Surrogate loss functions, divergences and decentralized detection

Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs

Surrogate Risk Consistency: the Classification Case

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

Divergences, surrogate loss functions and experimental design

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

A Study of Relative Efficiency and Robustness of Classification Methods

Consistency and Convergence Rates of One-Class SVM and Related Algorithms

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Support Vector Machine

Classification with Reject Option

Lecture 18: Multiclass Support Vector Machines

Statistical Data Mining and Machine Learning Hilary Term 2016

Support Vector Machines for Classification: A Statistical Portrait

Approximation Theoretical Questions for SVMs

On the rate of convergence of regularized boosting classifiers

Generalization theory

A Bahadur Representation of the Linear Support Vector Machine

Does Modeling Lead to More Accurate Classification?

Consistency of Nearest Neighbor Methods

BINARY CLASSIFICATION

Why does boosting work from a statistical view

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

On the Rate of Convergence of Regularized Boosting Classifiers

CIS 520: Machine Learning Oct 09, Kernel Methods

Robust Support Vector Machines for Probability Distributions

ECE 5424: Introduction to Machine Learning

Are Loss Functions All the Same?

CSCI-567: Machine Learning (Spring 2019)

Calibrated asymmetric surrogate losses

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines, Kernel SVM

Decentralized decision making with spatially distributed data

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Learning with Rejection

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

ECE 5984: Introduction to Machine Learning

ECS289: Scalable Machine Learning

Kernels A Machine Learning Overview

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT

Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Statistical Methods for SVM

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning for NLP

Reproducing Kernel Hilbert Spaces

Minimax risk bounds for linear threshold functions

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Qualifying Exam in Machine Learning

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Oslo Class 2 Tikhonov regularization and kernels

Classification objectives COMS 4771

Nonparametric Bayesian Methods

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Support Vector Machines

Fast learning rates for plug-in classifiers under the margin condition

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Nonparametric Decentralized Detection using Kernel Methods

Bayesian Support Vector Machines for Feature Ranking and Selection

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

The Representor Theorem, Kernels, and Hilbert Spaces

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Boosting with Early Stopping: Convergence and Consistency

Statistical Machine Learning Hilary Term 2018

STA141C: Big Data & High Performance Statistical Computing

Discriminative Models

PETER L. BARTLETT AND MARTEN H. WEGKAMP

Surrogate regret bounds for generalized classification performance metrics

Decentralized Detection and Classification using Kernel Methods

Are You a Bayesian or a Frequentist?

Margin Maximizing Loss Functions

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ECS171: Machine Learning

Support Vector Machines

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Online Learning and Sequential Decision Making

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Statistical Methods for Data Mining

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Machine Learning from Data

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

References for online kernel methods

Transcription:

Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides at http://www.stat.berkeley.edu/ bartlett/talks 1

The Pattern Classification Problem i.i.d. (X, Y ), (X 1,Y 1 ),...,(X n,y n ) from X {±1}. Use data (X 1,Y 1 ),...,(X n,y n ) to choose f n : X Rwith small risk, R(f n )=Pr(sign(f n (X)) Y )=El(Y,f(X)). Natural approach: minimize empirical risk, ˆR(f) =Êl(Y,f(X)) = 1 n n l(y i,f(x i )). i=1 Often intractable... Replace 0-1 loss, l, with a convex surrogate, φ. 2

Large Margin Algorithms Consider the margins, Yf(X). Define a margin cost function φ : R R +. Define the φ-risk of f : X R as R φ (f) =Eφ(Yf(X)). Choose f Fto minimize φ-risk. (e.g., use data, (X 1,Y 1 ),...,(X n,y n ), to minimize empirical φ-risk, ˆR φ (f) =Êφ(Yf(X)) = 1 n n φ(y i f(x i )), i=1 or a regularized version.) 3

Large Margin Algorithms Adaboost: F = span(g) for a VC-class G, φ(α) =exp( α), Minimizes ˆR φ (f) using greedy basis selection, line search. Support vector machines with 2-norm soft margin. F = ball in reproducing kernel Hilbert space, H. φ(α) = (max (0, 1 α)) 2. Algorithm minimizes ˆR φ (f)+λ f 2 H. 4

Large Margin Algorithms Many other variants Neural net classifiers φ(α) =max(0, (0.8 α) 2 ). Support vector machines with 1-norm soft margin φ(α) =max(0, 1 α). L2Boost, LS-SVMs φ(α) =(1 α) 2. Logistic regression φ(α) =log(1+exp( 2α)). 5

Large Margin Algorithms 0 1 2 3 4 5 6 7 0 1 exponential hinge logistic truncated quadratic 2 1 0 1 2 α 6

Statistical Consequences of Using a Convex Cost Bayes risk consistency? For which φ? (Lugosi and Vayatis, 2004), (Mannor, Meir and Zhang, 2002): regularized boosting. (Zhang, 2004), (Steinwart, 2003): SVM. (Jiang, 2004): boosting with early stopping. 7

Statistical Consequences of Using a Convex Cost How is risk related to φ-risk? (Lugosi and Vayatis, 2004), (Steinwart, 2003): asymptotic. (Zhang, 2004): comparison theorem. Convergence rates? Estimating conditional probabilities? 8

Overview Relating excess risk to excess φ-risk. The approximation/estimation decomposition and universal consistency. Kernel classifiers: sparseness versus probability estimation. 9

Definitions and Facts R(f) =Pr(sign(f(X)) Y ) R φ (f) =Eφ(Yf(X)) η(x) = Pr(Y = 1 X = x) R =inf f R(f) risk R φ =inf f R φ(f) φ-risk conditional probability. η defines an optimal classifier: R = R(sign(η(x) 1/2)). Notice: R φ (f) =E (E [φ(yf(x)) X]), and conditional φ-risk is: E [φ(yf(x)) X = x] =η(x)φ(f(x)) + (1 η(x))φ( f(x)). 10

Definitions Conditional φ-risk: E [φ(yf(x)) X = x] =η(x)φ(f(x)) + (1 η(x))φ( f(x)). Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α)+(1 η)φ( α)). α R R φ = EH(η(X)). 11

Optimal Conditional φ-risk: Example 0 1 2 3 4 5 6 7 φ(α) φ( α) C 0.3 (α) C 0.7 (α) 2 1 0 1 2 α * (η) H(η) ψ(θ) 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 12

Definitions Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α)+(1 η)φ( α)). α R Optimal conditional φ-risk with incorrect sign: H (η) = inf (ηφ(α)+(1 η)φ( α)). α:α(2η 1) 0 Note: H (η) H(η) H (1/2) = H(1/2). 13

Definitions H(η) = inf (ηφ(α)+(1 η)φ( α)) α R H (η) = inf (ηφ(α)+(1 η)φ( α)). α:α(2η 1) 0 Definition: for η 1/2, φ is classification-calibrated if, H (η) >H(η). i.e., pointwise optimization of conditional φ-risk leads to the correct sign. (c.f. Lin (2001)) 14

Definitions Definition: Given φ, define ψ :[0, 1] [0, ) by ψ = ψ, where ( ) ( ) 1+θ 1+θ ψ(θ) =H H. 2 2 Here, g is the Fenchel-Legendre biconjugate of g, epi(g )=co(epi(g)), epi(g) ={(x, y) :x [0, 1], g(x) y}. 15

ψ-transform: Example ψ is the best convex lower bound on ψ(θ) =H ((1 + θ)/2) H((1 + θ)/2), the excess conditional φ-risk when the sign is incorrect. ψ = ψ is the biconjugate of ψ, epi(ψ) =co(epi( ψ)), epi(ψ) ={(α, t) :α [0, 1], ψ(α) t}. ψ is the functional convex hull of ψ. 2 1 0 1 2 α * (η) H(η) ψ(θ) 0.0 0.2 0.4 0.6 0.8 1.0 16

The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) R φ. 2. This bound cannot be improved. 3. Near-minimal φ-risk implies near-minimal risk precisely when φ is classification-calibrated. 17

The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) R φ. 2. This bound cannot be improved: For X 2, ɛ>0 and θ [0, 1], there is a P and an f with R(f) R = θ ψ(θ) R φ (f) R φ ψ(θ)+ɛ. 3. Near-minimal φ-risk implies near-minimal risk precisely when φ is classification-calibrated. 18

The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) Rφ. 2. This bound cannot be improved. 3. The following conditions are equivalent: (a) φ is classification calibrated. (b) ψ(θ i ) 0 iff θ i 0. (c) R φ (f i ) Rφ implies R(f i) R. Proof involves Jensen s inequality. 19

Classification-calibrated φ Theorem: If φ is convex, φ is classification calibrated φ is differentiable at 0 φ (0) < 0. Theorem: If φ is classification calibrated, γ >0, α R, γφ(α) 1 [α 0]. 20

Overview Relating excess risk to excess φ-risk. The approximation/estimation decomposition and universal consistency. Kernel classifiers: sparseness versus probability estimation. 21

The Approximation/Estimation Decomposition Algorithm chooses f n =argminê n R φ (f)+λ n Ω(f). f F n We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 22

The Approximation/Estimation Decomposition ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error Approximation and estimation errors are in terms of R φ, not R. Like a regression problem. With a rich class and appropriate regularization, R φ (f n ) R φ. (e.g., F n gets large slowly, or λ n 0 slowly.) Universal consistency (R(f n ) R )iffφ is classification calibrated. 23

Overview Relating excess risk to excess φ-risk. The approximation/estimation decomposition and universal consistency. Kernel classifiers: sparseness versus probability estimation. 24

Estimating Conditional Probabilities Does a large margin classifier, f n, allow estimates of the conditional probability η(x) =Pr(Y =1 X = x), say, asymptotically? Confidence-rated predictions are of interest for many decision problems. Probabilities are useful for combining decisions. 25

Estimating Conditional Probabilities If φ is convex, we can write Recall: H(η) = inf (ηφ(α)+(1 η)φ( α)) α R = ηφ(α (η)) + (1 η)φ( α (η)), where α (η) =argmin α (ηφ(α)+(1 η)φ( α)) R {± }. R φ = EH(η(X)) = Eφ(Yα (η(x))) η(x) =Pr(Y =1 X = x). 26

Estimating Conditional Probabilities α (η) =argmin α Examples of α (η) versus η [0, 1]: (ηφ(α)+(1 η)φ( α)) R {± }. 1.5 1 0.5-0.5-1 -1.5 0.2 0.4 0.6 0.8 1 L2 SVM L2-SVM: φ(α) = ((1 α) + ) 2 L1 SVM L1-SVM: φ(α) =(1 α) +. 27

Estimating Conditional Probabilities 1.5 1 0.5-0.5-1 -1.5 0.2 0.4 0.6 0.8 1 L2 SVM L1 SVM If α (η) is not invertible, that is, there are η 1 η 2 with α (η 1 ) α (η 2 ), then there are distributions P and functions f n with R φ (f n ) R φ but f n (x) cannot be used to estimate η(x). e.g., f n (x) α (η 1 ) α (η 2 ).Isη(x) =η 1 or η(x) =η 2? 28

Kernel classifiers and sparseness Kernel classification methods: f n =argmin (Êφ(Yf(X)) + λn f 2), f H where H is a reproducing kernel Hilbert space (RKHS), with norm, and λ n > 0 is a regularization parameter. Representer theorem: solution of optimization problem can be represented as: n f n (x) = α i k(x, x i ). i=1 Data x i with α i 0are called support vectors (SV s). Sparseness (number of support vectors n) means faster evaluation of the classifier. 29

Sparseness: Steinwart s results For L1 and L2-SVM, Steinwart proved that the asymptotic fraction of SV s is EG(η(X)) (under some technical assumptions). The function G(η) depends on the loss function used: 1 0.8 0.6 0.4 0.2 L2 SVM L1 SVM 0.2 0.4 0.6 0.8 1 L2-SVM doesn t produce sparse solutions (asymptotically) while L1-SVM does. Recall: L2-SVM can estimate η while L1-SVM cannot. 30

Sparseness versus Estimating Conditional Probabilities The ability to estimate conditional probabilities always causes loss of sparseness: Lower bound of the asymptotic fraction of data that become SV s can be written as EG(η(X)). G(η) is 1 throughout the region where probabilities can be estimated. The region where G(η) =1is an interval centered at 1/2. 31

Asymptotically Sharp Result For loss functions of the form: φ(t) =h((t 0 t) + ) where h is convex, differentiable and h (0) > 0, if the kernel k is analytic and universal (and the underlying P X is continuous and non-trivial), then for a regularization sequence λ n 0 sufficiently slowly: {i : α i 0} P EG(η(X)) n where η/γ 0 η γ G(η) = 1 γ<η<1 γ (1 η)/γ 1 γ η 1 32

Example 1 1 0.8 0.75 0.6 0.4 0.2 0.25 0.5 0.75 1 0.5 0.25-0.2 0.2 0.4 0.6 0.8 1 1 3 ((1 t) +) 2 + 2 3 (1 t) + -1 α (η) vs. η 0.25 0.5 0.75 G(η) vs. η 33

Overview Relating excess risk to excess φ-risk. The approximation/estimation decomposition and universal consistency. Kernel classifiers No sparseness where α (η) is invertible. Can design φ to trade off sparseness and probability estimation. slides at http://www.stat.berkeley.edu/ bartlett/talks 34