AdaBoost and other Large Margin Classifiers: Convexity in Classification

Similar documents
Statistical Properties of Large Margin Classifiers

Large Margin Classifiers: Convexity and Classification

Convexity, Classification, and Risk Bounds

Lecture 10 February 23

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

Calibrated Surrogate Losses

A Study of Relative Efficiency and Robustness of Classification Methods

Does Modeling Lead to More Accurate Classification?

On the Consistency of AUC Pairwise Optimization

Surrogate loss functions, divergences and decentralized detection

Divergences, surrogate loss functions and experimental design

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Surrogate Risk Consistency: the Classification Case

Classification objectives COMS 4771

On the Design of Loss Functions for Classification: theory, robustness to outliers, and SavageBoost

Decentralized decision making with spatially distributed data

Why does boosting work from a statistical view

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

BINARY CLASSIFICATION

Minimax risk bounds for linear threshold functions

Support Vector Machines for Classification: A Statistical Portrait

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Approximation Theoretical Questions for SVMs

Boosting. March 30, 2009

Consistency and Convergence Rates of One-Class SVM and Related Algorithms

ECE 5424: Introduction to Machine Learning

On the rate of convergence of regularized boosting classifiers

CSCI-567: Machine Learning (Spring 2019)

Boosting with Early Stopping: Convergence and Consistency

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Are You a Bayesian or a Frequentist?

Generalization theory

Support Vector Machine

Surrogate losses and regret bounds for cost-sensitive classification with example-dependent costs

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

ASSESSING ROBUSTNESS OF CLASSIFICATION USING ANGULAR BREAKDOWN POINT

VBM683 Machine Learning

A Bahadur Representation of the Linear Support Vector Machine

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Learning with Rejection

CIS 520: Machine Learning Oct 09, Kernel Methods

On the Rate of Convergence of Regularized Boosting Classifiers

Classification with Reject Option

Consistency of Nearest Neighbor Methods

Robust Support Vector Machines for Probability Distributions

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Statistical Data Mining and Machine Learning Hilary Term 2016

Qualifying Exam in Machine Learning

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Lecture 18: Multiclass Support Vector Machines

Statistical Machine Learning from Data

Support Vector Machines

CS534 Machine Learning - Spring Final Exam

Statistical Methods for SVM

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

ECE 5984: Introduction to Machine Learning

Support Vector Machines, Kernel SVM

Regularization Paths

Online Learning and Sequential Decision Making

Empirical Risk Minimization

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Margin Maximizing Loss Functions

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Calibrated asymmetric surrogate losses

Nonparametric Decentralized Detection using Kernel Methods

Discriminative Models

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning for NLP

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Introduction to Machine Learning

A talk on Oracle inequalities and regularization. by Sara van de Geer

Decentralized Detection and Classification using Kernel Methods

Random Classification Noise Defeats All Convex Potential Boosters

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

Ordinal Classification with Decision Rules

Reproducing Kernel Hilbert Spaces

On divergences, surrogate loss functions, and decentralized detection

Lecture Support Vector Machine (SVM) Classifiers

Reproducing Kernel Hilbert Spaces

On surrogate loss functions and f-divergences

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

Warm up: risk prediction with logistic regression

Surrogate regret bounds for generalized classification performance metrics

Random forests and averaging classifiers

Statistical Methods for Data Mining

Reproducing Kernel Hilbert Spaces

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Lecture 3: Statistical Decision Theory (Part II)

ON EARLY STOPPING IN GRADIENT DESCENT LEARNING. 1. Introduction

Advanced Machine Learning

ECS171: Machine Learning

Computing regularization paths for learning multiple kernels

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Random Classification Noise Defeats All Convex Potential Boosters

Adaptive Sampling Under Low Noise Conditions 1

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Transcription:

AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at http://www.cs.berkeley.edu/ bartlett 1

The Pattern Classification Problem i.i.d. (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from X {±1}. Use data (X 1, Y 1 ),...,(X n, Y n ) to choose f n : X R with small risk, R(f n ) = Pr (sign(f n (X)) Y ) = El(Y, f(x)). Natural approach: minimize empirical risk, ˆR(f) = Êl(Y, f(x)) = 1 n n l(y i, f(x i )). i=1 Often intractable... Replace 0-1 loss, l, with a convex surrogate, φ. 2

Large Margin Algorithms Consider the margins, Y f(x). Define a margin cost function φ : R R +. Define the φ-risk of f : X R as R φ (f) = Eφ(Y f(x)). Choose f F to minimize φ-risk. (e.g., use data, (X 1, Y 1 ),...,(X n, Y n ), to minimize empirical φ-risk, ˆR φ (f) = Êφ(Y f(x)) = 1 n n φ(y i f(x i )), i=1 or a regularized version.) 3

Large Margin Algorithms Adaboost: F = span(g) for a VC-class G, φ(α) = exp( α), Minimizes ˆR φ (f) using greedy basis selection, line search: f t+1 = f t + α t+1 g t+1, ˆR φ (f t + α t+1 g t+1 ) = min α R,g G ˆR φ (f t + αg). Effective in applications: real-time face detection, spoken dialogue systems,... 4

Large Margin Algorithms Many other variants Support vector machines with 1-norm soft margin. F = ball in reproducing kernel Hilbert space, H. φ(α) = max (0, 1 α). Algorithm minimizes ˆR φ (f) + λ f 2 H. Neural net classifiers φ(α) = max(0, (0.8 α) 2 ). L2Boost, LS-SVMs φ(α) = (1 α) 2. Logistic regression φ(α) = log(1 + exp( 2α)). 5

Large Margin Algorithms 0 1 2 3 4 5 6 7 0 1 exponential hinge logistic truncated quadratic 2 1 0 1 2 α 6

Statistical Consequences of Using a Convex Cost Is AdaBoost universally consistent? Other φ? (Lugosi and Vayatis, 2004), (Mannor, Meir and Zhang, 2002): regularized boosting. (Jiang, 2004): process consistency of AdaBoost, for certain probability distributions. (Zhang, 2004), (Steinwart, 2003): SVM. 7

Statistical Consequences of Using a Convex Cost How is risk related to φ-risk? (Lugosi and Vayatis, 2004), (Steinwart, 2003): asymptotic. (Zhang, 2004): comparison theorem. 8

Overview Relating excess risk to excess φ-risk. ψ-transform: best possible bound. conditions on φ. (with Mike Jordan and Jon McAuliffe) Universal consistency of AdaBoost. 9

Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). 10

Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). Notice: R φ (f) = E (E [φ(y f(x)) X]), and conditional φ-risk is: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). 11

Definitions Conditional φ-risk: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R R φ = EH(η(X)). 12

Optimal Conditional φ-risk: Example 0 1 2 3 4 5 6 7 φ(α) φ( α) C 0.3 (α) C 0.7 (α) 2 1 0 1 2 α * (η) H(η) ψ(θ) 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 13

Definitions Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R Optimal conditional φ-risk with incorrect sign: H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Note: H (η) H(η) H (1/2) = H(1/2). 14

Definitions H(η) = inf (ηφ(α) + (1 η)φ( α)) α R H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Definition: for η 1/2, φ is classification-calibrated if, H (η) > H(η). i.e., pointwise optimization of conditional φ-risk leads to the correct sign. (c.f. Lin (2001)) 15

The ψ transform Definition: Given convex φ, define ψ : [0, 1] [0, ) by ( ) 1 + θ ψ(θ) = φ(0) H. 2 (The definition is a little more involved for non-convex φ.) 2 1 0 1 2 α * (η) H(η) ψ(θ) 0.0 0.2 0.4 0.6 0.8 1.0 16

The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) Rφ. 2. For X 2, ǫ > 0 and θ [0, 1], there is a P and an f with R(f) R = θ ψ(θ) R φ (f) Rφ ψ(θ) + ǫ. 3. The following conditions are equivalent: (a) φ is classification calibrated. (b) ψ(θ i ) 0 iff θ i 0. (c) R φ (f i ) Rφ implies R(f i) R. 17

Classification-calibrated φ If φ is classification-calibrated, then ψ(θ i ) 0 iff θ i 0. Since the function ψ is always convex, in that case it is strictly increasing and so has an inverse. Thus, we can write R(f) R ψ 1 ( R φ (f) R φ). 18

Excess Risk Bounds: Proof Idea Facts: H(η), H (η) are symmetric about η = 1/2. H(1/2) = H (1/2), hence ψ(0) = 0. ψ(θ) is convex. ψ(θ) = H ( 1 + θ 2 ) H ( 1 + θ 2 ). 19

Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (ψ convex, ψ(0) = 0) E (1 [sign(f(x)) sign(η(x) 1/2)]ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of ψ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (H minimizes conditional φ-risk) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of R φ ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

Classification-calibrated φ Theorem: If φ is convex, φ is classification calibrated φ is differentiable at 0 φ (0) < 0. Theorem: If φ is classification calibrated, γ > 0, α R, γφ(α) 1 [α 0]. 21

Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. (with Mikhail Traskin) 22

Universal Consistency Assume: i.i.d. data, (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from from X Y (with Y = {±1}). Consider a method f n = A((X 1, Y 1 ),...,(X n, Y n )), e.g., f n = AdaBoost((X 1, Y 1 ),...,(X n, Y n ), t n ). Definition: We say that the method is universally consistent if, for all distributions P, R(f n ) a.s R, where R is the risk and R is the Bayes risk: R(f) = Pr(Y sign(f(x)), R = inf f R(f). 23

The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F ˆR φ (f) + λ n Ω(f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24

The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F n ˆRφ (f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24

The Approximation/Estimation Decomposition ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error Approximation and estimation errors are in terms of R φ, not R. Like a regression problem. With a rich class and appropriate regularization, R φ (f n ) R φ. (e.g., F n gets large slowly, or λ n 0 slowly.) Universal consistency (R(f n ) R ) iff φ is classification calibrated. 25

Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 26

AdaBoost Sample, S n = ((x 1, y 1 ),...,(x n,y n )) (X {±1}) n Number of iterations, T function AdaBoost(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F f t := f t 1 + α t h t return f T 1 n nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 27

Previous results: Regularized versions AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). (Notice that, for many interesting basis classes F, the infimum is zero.) Instead of AdaBoost, consider a regularized version of its criterion. 28

Previous results: Regularized versions 1. Minimize ˆR φ (f) over f γ n co(f), the scaled convex hull of F. 2. Minimize ˆR φ (f) + λ n f 1, over f span(f), where f 1 = inf{γ : f γco(f)}. For suitable choices of the parameters (γ n and λ n ), these algorithms are universally consistent. (Lugosi and Vayatis, 2004), (Zhang, 2004) 29

Previous results: Bounded step size function AdaBoostwithBoundedStepSize(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F 1 n f t := f t 1 + min{α t, ǫ}h t return f T nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 For suitable choices of the parameters (T = T n and ǫ = ǫ n ), this algorithm is universally consistent. (Zhang and Yu, 2005), (Bickel, Ritov, Zakai, 2006) 30

Previous results about AdaBoost AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). Consider AdaBoost with early stopping: f n is the function returned by AdaBoost after t n steps. How should we choose t n? Note: The infimum is often zero. Don t want t n too large. 31

Previous result about AdaBoost: Process consistency Theorem: [Jiang, 2004] For a (suitable) basis class defined on R d, and for all probability distributions P satisfying certain smoothness assumptions, there is a sequence t n such that f n =AdaBoost(S n, t n ) satisfies R(f n ) a.s. R. Conditions on the distribution P are unnatural and cannot be checked. How should the stopping time t n grow with sample size n? Does it need to depend on the distribution P? Rates? 32

Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 33

The key theorem Assume d V C (F) < Otherwise AdaBoost must stop and fail after one step. Assume where lim λ inf {R φ(f) : f λco(f)} = R φ, R φ (f) = E exp( Y f(x)), R φ = inf f R φ(f). That is, the approximation error is zero. For example, F is linear threshold functions, or binary trees with axis orthogonal decisions in R d and at least d + 1 leaves. 34

The key theorem Theorem: If d V C (F) <, R φ = lim λ inf {R φ(f) : f λco(f)}, t n t n = O(n 1 α ) for some α > 0, then AdaBoost is universally consistent. 35

The key theorem: Idea of proof We show R φ (f tn ) R φ, which implies R(f t n ) R, since the loss function α exp( α) is classification calibrated. Step 1. Notice that we can clip f tn : If we define π λ (f) as x max{ λ, min{λ, f(x)}}, then R φ (π λ (f tn )) R φ = R(π λ (f tn )) R = R(f tn ) R. We will need to relax the clipping (λ n ). 2 1 0 1 2 1.0 0.0 1.0 π 1 (x) 36

The key theorem: Idea of proof Step 2. Use VC-theory (for clipped combinations of t functions from F ) to show that, with high probability, R φ (π λ (f t )) ˆR dv C (F)t log t φ (π λ (f t )) + c(λ), n where ˆR φ is the empirical version of R φ, ˆR φ (f) = E n exp( Y f(x)). 37

The key theorem: Idea of proof Step 3. The clipping only hurts for small values of the exponential criterion: ˆR φ (π λ (f t )) ˆR φ (f t ) + e λ. 2 1 0 1 2 38

The key theorem: Idea of proof Step 4. Apply numerical convergence result of (Bickel et al, 2006): For any comparison function f F λ = {R φ (f) : f λco(f)}, ˆR φ (f t ) ˆR φ ( f) + ǫ(λ, t). Here, we exploit an attractive property of the exponential loss function and the fact that classifiers are binary-valued: The second derivative of ˆR φ in a basis direction is large whenever ˆR φ is large. This keeps the steps taken by AdaBoost from being too large. 39

The key theorem: Idea of proof Step 5. Relate ˆR φ ( f) to R φ ( f). Choosing λ n suitably slowly, we can choose f n so that R φ ( f n ) R φ (by assumption), and then for t = O(n1 α ), we have the result. 40

The key theorem: Idea of proof R R n π λn (f tn ) 41

The key theorem: Idea of proof R R n π λn (f tn ) π λn (f tn ) 41

The key theorem: Idea of proof R R n f tn π λn (f tn ) π λn (f tn ) 41

The key theorem: Idea of proof R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

The key theorem: Idea of proof f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

The key theorem: Idea of proof R * f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

Open Problems Other loss functions? e.g., LogitBoost uses α log(1 + exp( 2α)) in place of exp( α). (The difficulty is the behaviour of the second derivative of ˆR φ in the direction of a basis function. For the numerical convergence results, we want it large whenever ˆR φ is large.) Real-valued basis functions? (The same issue arises.) Rates? The bottleneck is the rate of decrease of ˆR φ (f t ). The numerical convergence result ensures it decreases to f as log 1/2 t. This seems pessimistic. 42

Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. slides at http://www.cs.berkeley.edu/ bartlett 43