Computational and Statistical Learning Theory

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes)

Boosting the Error If we have an efficient learning algorithm that for any distribution realizable by H n returns a predictor that is guaranteed to be slightly better then chance, i.e. has error ε 0 = 1 γ < 1 (for some γ > 0), can we construct a 2 2 new efficient algorithm that for any distribution realizable by H n achieves arbitrarily low error ε? Posed (as a theoretical question) by Valiant and Kearns, Harvard 1988 Solved by MIT student Robert Schapire, 1990 AdaBoost Algorithm by Schapire and Yoav Fruend, AT&T 1995 Leslie Valiant Michael Kearns Rob Schapire Yoav Fruend

AdaBoost Input: Training set S = { x 1, y 1, x 2, y 2,, x m, y m } Weak Learner A, which will be applied to distributions D over S If thinking of A(S ) as accepting a sample S : each x, y S is set to (x i, y i ) w.p. D i (independently and with replacements) Can often think of A as operating on a weighted sample, with weights D i Output: hypothesis h Initialize D 1 = 1 m, 1 m,, 1 m For t=1,, T: h t = A(D t ) ε t = L D t h t = σ i D i t h t x i y i α t = 1 2 log 1 ε t 1 D t+1 i = D t i exp αt y i h t x i t σ j D i exp αt y j h t x j Output: h T x = sign σt t=1 α t h t x

AdaBoost: Weight Update Initialize D 1 = 1 m, 1 m,, 1 m For t=1,, T: h t = A(D t ) ε t = L D t h t = σ i D i t h t x i y i α t = 1 2 log 1 ε t 1 D t+1 i = D t i exp αt y i h t x i t σ j D i exp αt y j h t x j T Output: h s x = sign σ t=1 α t h t x Increase weight on errors, decrease on correct points: D i t+1 D i t D i t 1 ε t ε t if h t x i y i ε t 1 ε t if h t x i = y i

AdaBoost: Weight Update D i t+1 = D t i exp αt y i h t x i Z t = 1 Z t D i t D i t 1 ε t ε t if h t x i y i ε t 1 ε t if h t x i = y i = D i t 2ε t if h t x i y i D i t 2 1 ε t if h t x i = y i Z t = σ ht x i y i D i t 1 ε t ε t + σ ht x i =y i D i t ε t 1 ε t = ε t 1 ε t ε t + 1 ε t ε t 1 ε t = 2 ε t 1 ε t L D t+1 h t = σ ht x i y i D i t+1 = σ ht x i y i D i t 1 2ε t = ε t 1 2ε t = 1 2

AdaBoost as Learning a Linear Classifier Recall: h T x = sign σt t=1 α t h t x Let B = all hypothesis outputed by A ( Base Class, e.g. decision stumps) w h = h t =h α t h T sign w, φ x w R B Class of halfspaces L B exp 1 L S w = m σlexp (h w x i ; y i ) l exp y w,φ x h w (x), y = e Each step of AdaBoost: Coordinate descent on L S exp (w) Choose coordinate h of φ(x) s.t. φ x w h L S h = h(x) exp w is very negative Update w h = arg min L S exp (w) s.t. h h w[h ] is unchanged

Coordinate Descent on L S exp w L exp w h S w = w h 1 m σe y ih w x i = 1 m σe y ih w x i y i h w x i w h = 1 m σe y ih w x i y i h(x i ) = 1 m σe y i σ t=1 T 1 α t h t x i y i h x i 2L D T h 1 ς T 1 t=1 e y iα t h t x i D i T Minimize L D T h Minimize L exp w h S w (make it very negative) Updating w[h]: set w t h t = w t 1 h t + α α = arg min L S exp w t 0 = L exp α S w t = exp L w h t S w t 2L D t+1 h t 1 choose α s.t. L D t+1 h t = 1 2

AdaBoost: Minimizing L S (h) Theorem: If t ε t 1 2 γ, then L S 01 h T L S exp ht e 2γ2 T Proof: L S exp ht D T+1 i = 1 ς T m t=1 = 1 σ m i e y i σ T t=1 e y i α t h t x i = ςt t 1 2 ε t 1 ε t 1 2γ 1 + 2γ z t σ i D i T+1 = 1 α t h t x i = 1 σ m i D T+1 i m ςt t=1 z t = ςt t=1 z t TΤ 2 = 1 4γ 2 TΤ 2 e 2γ2 T If A( ) is a weak learner with δ 0, ε 0 = 1 2 γ, and if L D h = 0: L S h = 0 L D t h = 0 w.p. 1 δ, L D t h 1 2 γ w.p. 1 δt, L S h s e 2γ2 T To get any ε > 0, run AdaBoost for T = log 1 ε 2γ 2 rounds Setting ε = 1 log 2m, after T = rounds: L 2m 2γ 2 S h s = 0! What about L D h?

Sparse Linear Classifiers Recall: h s x = sign σt t=1 w t h t x Let B = all hypothesis outputed by A Base Class, e.g. decision stumps h T sign w, φ x w R B, w 0 T Class of halfspaces L B

Sparse Linear Classifiers Recall: h s x = sign σt t=1 w t h t x Let B = all hypothesis outputed by A Base Class, e.g. decision stumps h T sign w, φ x w R B, w 0 T Class of sparse halfspaces L B, T You already saw: VCdim(L B, T ) O T log B Even if B is infinite (e.g. in the case of decision stumps): VCdim L B, T O T VCdim B Sample complexity: m = O log m γ 2 VCdim B ε But if weak learner is improper and VCdim B =? = O VCdim B γ 2 ε

Compression Bounds Focus on realizable case, and learning rules s.t. L S A S = 0 Suppose A(S) only dependent on first r < m examples, A x 1, y 1,, x m, y m = ሚA x 1, y 1,, x r, y r : δ L S r+1:m ሚA S[1: r] = 0 S D m L D A S log 1 δ m r In fact, same holds for any predetermined i 1,, i r, if A(S) only depends on x i1 y i1,, x ir, y ir Now consider A S = ሚA S I S with I: X Y m 1.. m r. That is, can represent A(S) using r training points, but need to choose which ones. Taking a union bound over m r choices of indices: r log m + log 1 δ L D A S m r

Compression Schemes A S is r-compressing if A S = ሚA S I S for some I: X Y m 1.. m r Axis Aligned Rectangles I(S) = { leftmost positive, rightmost positive, top positive, bottom positive} r = 4 Halfspaces in R d A bit trickier, but can be done with r = d + 1 (for non-homogenous) δ A is r-compressing and L S A S = 0 for m > 2r, S D m r log m + log 1 δ L D A S 2 m By VC lower bound: FINDCONS H is r-compressing VCdim H O(r) In fact: VCdim H r Conjecture: every H has a VCdim H -compressing FINDCONS H Manfred Warmuth

Back to Boosting A(S) is an (ε 0 = 1 2 γ, δ 0) weak learner that uses m 0 samples. Boost the confidence to get a ( 1 γ, δ ) learner that uses 2 2 m 1 δ = O m 0 log1 δ + log1 δ log log 1 δ0 samples log 1 γ δ0 2 Run AdaBoost on m samples for T = using m = m 1 δ 2 log m γ 2 iterations, each time T samples for the weak learner to get L S h T = 0 T h T = α t h t t=1 (h 1,, h T ) has a compression scheme with r = T m points What about α t??? h t = A(subsample of size m )

Partial Compression Instead of r training points specifying A(S) exactly, suppose they only specify a limited set of hypothesis in which A(S) lies. I: X Y m 1.. m r F: X Y r hypothesis classes, each with VCdim F S A S F I S D Theorem: If A(S) has a compression scheme as above and L S A S = 0, δ then for m 2r + D, S D m D + r log m + log 2 L D A S O δ m Proof outline: take union bound over choice of indices I(S), of the VC-based uniform convergence bounds, each time using just the points outside I(S).

Back to Boosting A(S) is an (ε 0 = 1 2 γ, δ 0) weak learner that uses m 0 samples. Boost the confidence to get a ( 1 2 γ 2, δ ) learner that uses m 1 δ = O m 0 log1 δ log 1 δ0 + log1 δ log log 1 δ0 γ 2 Run AdaBoost on m samples for T = using m = m 1 Conclusion: δ 2 log m γ 2 samples iterations, each time T samples for the weak learner to get L S h T = 0 T h T = α t h t L h 1,, h T = F(I S ) t=1 For fixed ε 0, δ 0 L D h T O T + Tm log m + log 1 δ m m ε, δ = O = O m 0 log 2 m log 1 δ m m 0 log 21 Τε log 1 δ ε 1 γ 2 log 1 δ0

AdaBoost In Practice Complexity control is in terms of sparsity (#iterations) T Realizable case (MDL): use first T s.t. L S h T = 0 More realistically (SRM): Use validation/cross-validation to select T Test error T=1000 T=100 Training error T=5 Even after L S h T = 0, AdaBoost keeps improving the l 1 margin

Interpretations of AdaBoost Boosting weak learning to get arbitrary small error Theory is for realizable case Shows efficient weak and strong learning equivalent Ensemble method for combining many simpler predictors E.g. combining decision stumps or decision trees Other ensemble methods: bagging, averaging, gating networks Method for learning using sparse linear predictors with large (infinite?) dimensional feature space Sparsity controls complexity Number of iterations controls sparsity Coordinate-wise optimization of L S exp w We ll get back to this when we talk about real-valued loss Learning (in high dimensions) with large l 1 margin Learning guarantee in terms of l 1 margin We ll get back to this when we talk about l 1 margin

Just one more thing

Back to Hardness of Agnostic Learning H = {x w, x > 0 w R n } H k n = h 1 h 2 h k h i H Lemma: h Hk L D h = 0 h H L D h < 1 1 2 2k 2 Want to show: hard to learn agnostically Hard to learn even if realizable H k n H is efficiently agnostically learnable Efficient weak learner for H k n with γ = 1 2k 2 is efficiently learnable (in realizable case) for, e.g. k n = n Conclusion: subject to lattice-based crypto hardness, halfspaces are not efficiently agnostically learnable (even improperly)