AdaBoost and other Large Margin Classifiers: Convexity in Classification

Size: px

Start display at page:

Download "AdaBoost and other Large Margin Classifiers: Convexity in Classification"

Richard Lee
5 years ago
Views:

1 AdaBoost and other Large Margin Classifiers: Convexity in Classification Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mikhail Traskin. slides at bartlett 1

2 The Pattern Classification Problem i.i.d. (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from X {±1}. Use data (X 1, Y 1 ),...,(X n, Y n ) to choose f n : X R with small risk, R(f n ) = Pr (sign(f n (X)) Y ) = El(Y, f(x)). Natural approach: minimize empirical risk, ˆR(f) = Êl(Y, f(x)) = 1 n n l(y i, f(x i )). i=1 Often intractable... Replace 0-1 loss, l, with a convex surrogate, φ. 2

3 Large Margin Algorithms Consider the margins, Y f(x). Define a margin cost function φ : R R +. Define the φ-risk of f : X R as R φ (f) = Eφ(Y f(x)). Choose f F to minimize φ-risk. (e.g., use data, (X 1, Y 1 ),...,(X n, Y n ), to minimize empirical φ-risk, ˆR φ (f) = Êφ(Y f(x)) = 1 n n φ(y i f(x i )), i=1 or a regularized version.) 3

4 Large Margin Algorithms Adaboost: F = span(g) for a VC-class G, φ(α) = exp( α), Minimizes ˆR φ (f) using greedy basis selection, line search: f t+1 = f t + α t+1 g t+1, ˆR φ (f t + α t+1 g t+1 ) = min α R,g G ˆR φ (f t + αg). Effective in applications: real-time face detection, spoken dialogue systems,... 4

5 Large Margin Algorithms Many other variants Support vector machines with 1-norm soft margin. F = ball in reproducing kernel Hilbert space, H. φ(α) = max (0, 1 α). Algorithm minimizes ˆR φ (f) + λ f 2 H. Neural net classifiers φ(α) = max(0, (0.8 α) 2 ). L2Boost, LS-SVMs φ(α) = (1 α) 2. Logistic regression φ(α) = log(1 + exp( 2α)). 5

6 Large Margin Algorithms exponential hinge logistic truncated quadratic α 6

7 Statistical Consequences of Using a Convex Cost Is AdaBoost universally consistent? Other φ? (Lugosi and Vayatis, 2004), (Mannor, Meir and Zhang, 2002): regularized boosting. (Jiang, 2004): process consistency of AdaBoost, for certain probability distributions. (Zhang, 2004), (Steinwart, 2003): SVM. 7

8 Statistical Consequences of Using a Convex Cost How is risk related to φ-risk? (Lugosi and Vayatis, 2004), (Steinwart, 2003): asymptotic. (Zhang, 2004): comparison theorem. 8

9 Overview Relating excess risk to excess φ-risk. ψ-transform: best possible bound. conditions on φ. (with Mike Jordan and Jon McAuliffe) Universal consistency of AdaBoost. 9

10 Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). 10

11 Definitions and Facts R(f) = Pr(sign(f(X)) Y ) R φ (f) = Eφ(Y f(x)) η(x) = Pr(Y = 1 X = x) R = inf f R(f) R φ = inf f R φ(f) conditional probability. risk φ-risk η defines an optimal classifier: R = R(sign(η(x) 1/2)). Notice: R φ (f) = E (E [φ(y f(x)) X]), and conditional φ-risk is: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). 11

12 Definitions Conditional φ-risk: E [φ(y f(x)) X = x] = η(x)φ(f(x)) + (1 η(x))φ( f(x)). Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R R φ = EH(η(X)). 12

13 Optimal Conditional φ-risk: Example φ(α) φ( α) C 0.3 (α) C 0.7 (α) α * (η) H(η) ψ(θ)

14 Definitions Optimal conditional φ-risk for η [0, 1]: H(η) = inf (ηφ(α) + (1 η)φ( α)). α R Optimal conditional φ-risk with incorrect sign: H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Note: H (η) H(η) H (1/2) = H(1/2). 14

15 Definitions H(η) = inf (ηφ(α) + (1 η)φ( α)) α R H (η) = inf (ηφ(α) + (1 η)φ( α)). α:α(2η 1) 0 Definition: for η 1/2, φ is classification-calibrated if, H (η) > H(η). i.e., pointwise optimization of conditional φ-risk leads to the correct sign. (c.f. Lin (2001)) 15

16 The ψ transform Definition: Given convex φ, define ψ : [0, 1] [0, ) by ( ) 1 + θ ψ(θ) = φ(0) H. 2 (The definition is a little more involved for non-convex φ.) α * (η) H(η) ψ(θ)

17 The Relationship between Excess Risk and Excess φ-risk Theorem: 1. For any P and f, ψ(r(f) R ) R φ (f) Rφ. 2. For X 2, ǫ > 0 and θ [0, 1], there is a P and an f with R(f) R = θ ψ(θ) R φ (f) Rφ ψ(θ) + ǫ. 3. The following conditions are equivalent: (a) φ is classification calibrated. (b) ψ(θ i ) 0 iff θ i 0. (c) R φ (f i ) Rφ implies R(f i) R. 17

18 Classification-calibrated φ If φ is classification-calibrated, then ψ(θ i ) 0 iff θ i 0. Since the function ψ is always convex, in that case it is strictly increasing and so has an inverse. Thus, we can write R(f) R ψ 1 ( R φ (f) R φ). 18

19 Excess Risk Bounds: Proof Idea Facts: H(η), H (η) are symmetric about η = 1/2. H(1/2) = H (1/2), hence ψ(0) = 0. ψ(θ) is convex. ψ(θ) = H ( 1 + θ 2 ) H ( 1 + θ 2 ). 19

20 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (ψ convex, ψ(0) = 0) E (1 [sign(f(x)) sign(η(x) 1/2)]ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

21 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of ψ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

22 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (H minimizes conditional φ-risk) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

23 Excess Risk Bounds: Proof Idea Excess risk of f : X R is Thus, R(f) R = E (1 [sign(f(x)) sign(η(x) 1/2)] 2η(X) 1 ). ψ(r(f) R ) (definition of R φ ) E (1 [sign(f(x)) sign(η(x) 1/2)] ψ ( 2η(X) 1 )) = E ( 1 [sign(f(x)) sign(η(x) 1/2)] ( H (η(x)) H(η(X)) )) E (φ(y f(x)) H(η(X))) = R φ (f) R φ. 20

24 Classification-calibrated φ Theorem: If φ is convex, φ is classification calibrated φ is differentiable at 0 φ (0) < 0. Theorem: If φ is classification calibrated, γ > 0, α R, γφ(α) 1 [α 0]. 21

25 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. (with Mikhail Traskin) 22

26 Universal Consistency Assume: i.i.d. data, (X, Y ), (X 1, Y 1 ),...,(X n, Y n ) from from X Y (with Y = {±1}). Consider a method f n = A((X 1, Y 1 ),...,(X n, Y n )), e.g., f n = AdaBoost((X 1, Y 1 ),...,(X n, Y n ), t n ). Definition: We say that the method is universally consistent if, for all distributions P, R(f n ) a.s R, where R is the risk and R is the Bayes risk: R(f) = Pr(Y sign(f(x)), R = inf f R(f). 23

27 The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F ˆR φ (f) + λ n Ω(f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24

28 The Approximation/Estimation Decomposition Consider an algorithm that chooses f n = arg min f F n ˆRφ (f). We can decompose the excess risk estimate as ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error 24

29 The Approximation/Estimation Decomposition ψ (R(f n ) R ) R φ (f n ) R φ = R φ (f n ) inf f F n R φ (f) }{{} estimation error + inf R φ (f) Rφ. f F n }{{} approximation error Approximation and estimation errors are in terms of R φ, not R. Like a regression problem. With a rich class and appropriate regularization, R φ (f n ) R φ. (e.g., F n gets large slowly, or λ n 0 slowly.) Universal consistency (R(f n ) R ) iff φ is classification calibrated. 25

30 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 26

31 AdaBoost Sample, S n = ((x 1, y 1 ),...,(x n,y n )) (X {±1}) n Number of iterations, T function AdaBoost(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F f t := f t 1 + α t h t return f T 1 n nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 27

32 Previous results: Regularized versions AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). (Notice that, for many interesting basis classes F, the infimum is zero.) Instead of AdaBoost, consider a regularized version of its criterion. 28

33 Previous results: Regularized versions 1. Minimize ˆR φ (f) over f γ n co(f), the scaled convex hull of F. 2. Minimize ˆR φ (f) + λ n f 1, over f span(f), where f 1 = inf{γ : f γco(f)}. For suitable choices of the parameters (γ n and λ n ), these algorithms are universally consistent. (Lugosi and Vayatis, 2004), (Zhang, 2004) 29

34 Previous results: Bounded step size function AdaBoostwithBoundedStepSize(S n, T ) f 0 := 0 for t from 1,...,T (α t,h t ) := arg min α R,h F 1 n f t := f t 1 + min{α t, ǫ}h t return f T nx exp ( y i (f t 1 (x i ) + αh(x i ))) i=1 For suitable choices of the parameters (T = T n and ǫ = ǫ n ), this algorithm is universally consistent. (Zhang and Yu, 2005), (Bickel, Ritov, Zakai, 2006) 30

35 Previous results about AdaBoost AdaBoost greedily minimizes ˆR φ (f) = 1 n n exp( Y i f(x i )) i=1 over f span(f). Consider AdaBoost with early stopping: f n is the function returned by AdaBoost after t n steps. How should we choose t n? Note: The infimum is often zero. Don t want t n too large. 31

36 Previous result about AdaBoost: Process consistency Theorem: [Jiang, 2004] For a (suitable) basis class defined on R d, and for all probability distributions P satisfying certain smoothness assumptions, there is a sequence t n such that f n =AdaBoost(S n, t n ) satisfies R(f n ) a.s. R. Conditions on the distribution P are unnatural and cannot be checked. How should the stopping time t n grow with sample size n? Does it need to depend on the distribution P? Rates? 32

37 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. The approximation/estimation decomposition. AdaBoost: Previous results. Universal consistency. 33

38 The key theorem Assume d V C (F) < Otherwise AdaBoost must stop and fail after one step. Assume where lim λ inf {R φ(f) : f λco(f)} = R φ, R φ (f) = E exp( Y f(x)), R φ = inf f R φ(f). That is, the approximation error is zero. For example, F is linear threshold functions, or binary trees with axis orthogonal decisions in R d and at least d + 1 leaves. 34

39 The key theorem Theorem: If d V C (F) <, R φ = lim λ inf {R φ(f) : f λco(f)}, t n t n = O(n 1 α ) for some α > 0, then AdaBoost is universally consistent. 35

40 The key theorem: Idea of proof We show R φ (f tn ) R φ, which implies R(f t n ) R, since the loss function α exp( α) is classification calibrated. Step 1. Notice that we can clip f tn : If we define π λ (f) as x max{ λ, min{λ, f(x)}}, then R φ (π λ (f tn )) R φ = R(π λ (f tn )) R = R(f tn ) R. We will need to relax the clipping (λ n ) π 1 (x) 36

41 The key theorem: Idea of proof Step 2. Use VC-theory (for clipped combinations of t functions from F ) to show that, with high probability, R φ (π λ (f t )) ˆR dv C (F)t log t φ (π λ (f t )) + c(λ), n where ˆR φ is the empirical version of R φ, ˆR φ (f) = E n exp( Y f(x)). 37

42 The key theorem: Idea of proof Step 3. The clipping only hurts for small values of the exponential criterion: ˆR φ (π λ (f t )) ˆR φ (f t ) + e λ

43 The key theorem: Idea of proof Step 4. Apply numerical convergence result of (Bickel et al, 2006): For any comparison function f F λ = {R φ (f) : f λco(f)}, ˆR φ (f t ) ˆR φ ( f) + ǫ(λ, t). Here, we exploit an attractive property of the exponential loss function and the fact that classifiers are binary-valued: The second derivative of ˆR φ in a basis direction is large whenever ˆR φ is large. This keeps the steps taken by AdaBoost from being too large. 39

44 The key theorem: Idea of proof Step 5. Relate ˆR φ ( f) to R φ ( f). Choosing λ n suitably slowly, we can choose f n so that R φ ( f n ) R φ (by assumption), and then for t = O(n1 α ), we have the result. 40

45 The key theorem: Idea of proof R R n π λn (f tn ) 41

46 The key theorem: Idea of proof R R n π λn (f tn ) π λn (f tn ) 41

47 The key theorem: Idea of proof R R n f tn π λn (f tn ) π λn (f tn ) 41

48 The key theorem: Idea of proof R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

49 The key theorem: Idea of proof f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

50 The key theorem: Idea of proof R * f n R R n f tn f tn π λn (f tn ) π λn (f tn ) 41

51 Open Problems Other loss functions? e.g., LogitBoost uses α log(1 + exp( 2α)) in place of exp( α). (The difficulty is the behaviour of the second derivative of ˆR φ in the direction of a basis function. For the numerical convergence results, we want it large whenever ˆR φ is large.) Real-valued basis functions? (The same issue arises.) Rates? The bottleneck is the rate of decrease of ˆR φ (f t ). The numerical convergence result ensures it decreases to f as log 1/2 t. This seems pessimistic. 42

52 Overview Relating excess risk to excess φ-risk. Universal consistency of AdaBoost. slides at bartlett 43

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides