AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at http://www.cs.berkeley.edu/ bartlett 1
The Pattern Classification Problem i.i.d. (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from X {±1}. Use data (X 1, Y 1 ),...,(X n, Y n ) to choose f n : X R with small risk, R(f n ) = Pr (sign(f n (X)) Y ) = El(Y, f(x)). Natural approach: minimize empirical risk, ˆR(f) = Êl(Y, f(x)) = 1 n n l(y i, f(x i )). i=1 Often intractable... Replace 0-1 loss, l, with a convex surrogate, φ. 2
Large Margin Algorithms Consider the margins, Y f(x). Define a margin cost function φ : R R +. Define the φ-risk of f : X R as R φ (f) = Eφ(Y f(x)). Choose f F to minimize φ-risk. (e.g., use data, (X 1, Y 1 ),...,(X n, Y n ), to minimize empirical φ-risk, ˆR φ (f) = Êφ(Y f(x)) = 1 n n φ(y i f(x i )), i=1 or a regularized version.) 3
Large Margin Algorithms Adaboost: F = span(g) for a VC-class G, φ(α) = exp( α), Minimizes ˆR φ (f) using greedy basis selection, line search: f t+1 = f t + α t+1 g t+1, ˆR φ (f t + α t+1 g t+1 ) = min α R,g G ˆR φ (f t + αg). Effective in applications: real-time face detection, spoken dialogue systems,... 4
Large Margin Algorithms Many other variants Support vector machines with 1-norm soft margin. F = ball in reproducing kernel Hilbert space, H. φ(α) = max (0, 1 α). Algorithm minimizes ˆR φ (f) + λ f 2 H. Neural net classifiers φ(α) = max(0, (0.8 α) 2 ). L2Boost, LS-SVMs φ(α) = (1 α) 2. Logistic regression φ(α) = log(1 + exp( 2α)). 5
Large Margin Algorithms 0 1 2 3 4 5 6 7 0 1 exponential hinge logistic truncated quadratic 2 1 0 1 2 α 6
Statistical Consequences of Using a Convex Cost Is AdaBoost universally consistent? Other φ? (Lugosi and Vayatis, 2004), (Mannor, Meir and Zhang, 2002): regularized boosting. (Jiang, 2004): process consistency of AdaBoost, for certain probability distributions. (Zhang, 2004), (Steinwart, 2003): SVM. 7
Statistical Consequences of Using a Convex Cost How is risk related to φ-risk? (Lugosi and Vayatis, 2004), (Steinwart, 2003): asymptotic. (Zhang, 2004): comparison theorem. 8
Overview Relating excess risk to excess φ-risk. ψ-transform: best possible bound. conditions on φ. (with Mike Jordan and Jon McAuliffe) Universal consistency of AdaBoost. 9
Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). 10
Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). Notice: R φ (f) = E (E [φ(y f(x)) X]), and conditional φ-risk is: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). 11
Definitions Conditional φ-risk: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R R φ = EH(η(X)). 12
Optimal Conditional φ-risk: Example 0 1 2 3 4 5 6 7 φ(α) φ( α) C 0.3 (α) C 0.7 (α) 2 1 0 1 2 α * (η) H(η) ψ(θ) 2 1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 13
Definitions Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R Optimal conditional φ-risk with incorrect sign: H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Note: H (η) H(η) H (1/2) = H(1/2). 14
Definitions H(η) = inf (ηφ(α) + (1 η)φ( α)) α R H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Definition: for η 1/2, φ is classification-calibrated if, H (η) > H(η). i.e., pointwise optimization of conditional φ-risk leads to the correct sign. (c.f. Lin (2001)) 15
The ψ transform Definition: Given convex φ, define ψ : [0, 1] [0, ) by ( ) 1 + θ ψ(θ) = φ(0) H. 2 (The definition is a little more involved for non-convex φ.) 2 1 0 1 2 α * (η) H(η) ψ(θ) 0.0 0.2 0.4 0.6 0.8 1.0 16
The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) Rφ. 2. For X 2, ǫ > 0 and θ [0, 1], there is a P and an f with R(f) R = θ ψ(θ) R φ (f) Rφ ψ(θ) + ǫ. 3. The following conditions are equivalent: (a) φ is classification calibrated. (b) ψ(θ i ) 0 iff θ i 0. (c) R φ (f i ) Rφ implies R(f i) R. 17
Classification-calibrated φ If φ is classification-calibrated, then ψ(θ i ) 0 iff θ i 0. Since the function ψ is always convex, in that case it is strictly increasing and so has an inverse. Thus, we can write R(f) R ψ 1 ( R φ (f) R φ). 18
Excess Risk Bounds: Proof Idea Facts: H(η), H (η) are symmetric about η = 1/2. H(1/2) = H (1/2), hence ψ(0) = 0. ψ(θ) is convex. ψ(θ) = H ( 1 + θ 2 ) H ( 1 + θ 2 ). 19
Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (ψ convex, ψ(0) = 0) E (1 [sign(f(x)) sign(η(x) 1/2)]ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of ψ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (H minimizes conditional φ-risk) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of R φ ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20
Classification-calibrated φ Theorem: If φ is convex, φ is classification calibrated φ is differentiable at 0 φ (0) < 0. Theorem: If φ is classification calibrated, γ > 0, α R, γφ(α) 1 [α 0]. 21
Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. (with Mikhail Traskin) 22
Universal Consistency Assume: i.i.d. data, (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from from X Y (with Y = {±1}). Consider a method f n = A((X 1, Y 1 ),...,(X n, Y n )), e.g., f n = AdaBoost((X 1, Y 1 ),...,(X n, Y n ), t n ). Definition: We say that the method is universally consistent if, for all distributions P, R(f n ) a.s R, where R is the risk and R is the Bayes risk: R(f) = Pr(Y sign(f(x)), R = inf f R(f). 23
The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F ˆR φ (f) + λ n Ω(f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24
The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F n ˆRφ (f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24
The Approximation/Estimation Decomposition ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error Approximation and estimation errors are in terms of R φ, not R. Like a regression problem. With a rich class and appropriate regularization, R φ (f n ) R φ. (e.g., F n gets large slowly, or λ n 0 slowly.) Universal consistency (R(f n ) R ) iff φ is classification calibrated. 25
Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 26
AdaBoost Sample, S n = ((x 1, y 1 ),...,(x n,y n )) (X {±1}) n Number of iterations, T function AdaBoost(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F f t := f t 1 + α t h t return f T 1 n nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 27
Previous results: Regularized versions AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). (Notice that, for many interesting basis classes F, the infimum is zero.) Instead of AdaBoost, consider a regularized version of its criterion. 28
Previous results: Regularized versions 1. Minimize ˆR φ (f) over f γ n co(f), the scaled convex hull of F. 2. Minimize ˆR φ (f) + λ n f 1, over f span(f), where f 1 = inf{γ : f γco(f)}. For suitable choices of the parameters (γ n and λ n ), these algorithms are universally consistent. (Lugosi and Vayatis, 2004), (Zhang, 2004) 29
Previous results: Bounded step size function AdaBoostwithBoundedStepSize(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F 1 n f t := f t 1 + min{α t, ǫ}h t return f T nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 For suitable choices of the parameters (T = T n and ǫ = ǫ n ), this algorithm is universally consistent. (Zhang and Yu, 2005), (Bickel, Ritov, Zakai, 2006) 30
Previous results about AdaBoost AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). Consider AdaBoost with early stopping: f n is the function returned by AdaBoost after t n steps. How should we choose t n? Note: The infimum is often zero. Don t want t n too large. 31
Previous result about AdaBoost: Process consistency Theorem: [Jiang, 2004] For a (suitable) basis class defined on R d, and for all probability distributions P satisfying certain smoothness assumptions, there is a sequence t n such that f n =AdaBoost(S n, t n ) satisfies R(f n ) a.s. R. Conditions on the distribution P are unnatural and cannot be checked. How should the stopping time t n grow with sample size n? Does it need to depend on the distribution P? Rates? 32
Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 33
The key theorem Assume d V C (F) < Otherwise AdaBoost must stop and fail after one step. Assume where lim λ inf {R φ(f) : f λco(f)} = R φ, R φ (f) = E exp( Y f(x)), R φ = inf f R φ(f). That is, the approximation error is zero. For example, F is linear threshold functions, or binary trees with axis orthogonal decisions in R d and at least d + 1 leaves. 34
The key theorem Theorem: If d V C (F) <, R φ = lim λ inf {R φ(f) : f λco(f)}, t n t n = O(n 1 α ) for some α > 0, then AdaBoost is universally consistent. 35
The key theorem: Idea of proof We show R φ (f tn ) R φ, which implies R(f t n ) R, since the loss function α exp( α) is classification calibrated. Step 1. Notice that we can clip f tn : If we define π λ (f) as x max{ λ, min{λ, f(x)}}, then R φ (π λ (f tn )) R φ = R(π λ (f tn )) R = R(f tn ) R. We will need to relax the clipping (λ n ). 2 1 0 1 2 1.0 0.0 1.0 π 1 (x) 36
The key theorem: Idea of proof Step 2. Use VC-theory (for clipped combinations of t functions from F ) to show that, with high probability, R φ (π λ (f t )) ˆR dv C (F)t log t φ (π λ (f t )) + c(λ), n where ˆR φ is the empirical version of R φ, ˆR φ (f) = E n exp( Y f(x)). 37
The key theorem: Idea of proof Step 3. The clipping only hurts for small values of the exponential criterion: ˆR φ (π λ (f t )) ˆR φ (f t ) + e λ. 2 1 0 1 2 38
The key theorem: Idea of proof Step 4. Apply numerical convergence result of (Bickel et al, 2006): For any comparison function f F λ = {R φ (f) : f λco(f)}, ˆR φ (f t ) ˆR φ ( f) + ǫ(λ, t). Here, we exploit an attractive property of the exponential loss function and the fact that classifiers are binary-valued: The second derivative of ˆR φ in a basis direction is large whenever ˆR φ is large. This keeps the steps taken by AdaBoost from being too large. 39
The key theorem: Idea of proof Step 5. Relate ˆR φ ( f) to R φ ( f). Choosing λ n suitably slowly, we can choose f n so that R φ ( f n ) R φ (by assumption), and then for t = O(n1 α ), we have the result. 40
The key theorem: Idea of proof R R n π λn (f tn ) 41
The key theorem: Idea of proof R R n π λn (f tn ) π λn (f tn ) 41
The key theorem: Idea of proof R R n f tn π λn (f tn ) π λn (f tn ) 41
The key theorem: Idea of proof R R n f tn f tn π λn (f tn ) π λn (f tn ) 41
The key theorem: Idea of proof f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41
The key theorem: Idea of proof R * f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41
Open Problems Other loss functions? e.g., LogitBoost uses α log(1 + exp( 2α)) in place of exp( α). (The difficulty is the behaviour of the second derivative of ˆR φ in the direction of a basis function. For the numerical convergence results, we want it large whenever ˆR φ is large.) Real-valued basis functions? (The same issue arises.) Rates? The bottleneck is the rate of decrease of ˆR φ (f t ). The numerical convergence result ensures it decreases to f as log 1/2 t. This seems pessimistic. 42
Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. slides at http://www.cs.berkeley.edu/ bartlett 43