1 Rademacher Complexity Bounds

Size: px

Start display at page:

Download "1 Rademacher Complexity Bounds"

Stella Ilene Jenkins
5 years ago
Views:

1 COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability 1 δ, the following two inequalities hold h H: err(h) eˆrr(h) + R (H) + O ln( 1 δ ) err(h) eˆrr(h) + ˆR S (H) + O ln( 1 δ ) (1) 1.1 Bound with respect to H Theore 2. For H < : ˆR S (H) 2 ln( H ) Proof. We will prove this as a corollary of another theore later in the course. This shows that we can bound the Radeacher coplexity with respect to H, which was the first easure of coplexity introduced in this course. Note that if we plug this into (1) we recover the error bound fro previous lectures (up to constant factors). 1.2 Bound with respect to Π H (S) We would like to drop the assuption that H <. Note that ˆR S (H) depends on the x i S only. Thus we only need to see how all h H depend on S. Theore 3. ˆR S (H) 2 ln( ΠH (S) ) Proof. As noted above, we only need to consider behaviors of the hypotheses on S. Let: H {one representative fro H for each behavior on S} Note that H H and H Π H (S) Π H () 2 <.

2 1 ˆR S (H) E σ [sup h H [ 1 E σ sup h H ] σ i h(x i ) ] σ i h(x i ) ˆR S (H ) 2 ln( H ) 2 ln( ΠH (S) ) The second equality follows fro the fact that for the h H that attains the first supreu, there exists h H that attains the sae value. This iplies that the supreu over H is no greater than the supreu over H. Furtherore, the supreu over H is no greater than the supreu over H, as H H. These two observations iply that the suprea are equal. The inequality on the 4 th line follows fro theore 2. The last equality follows by the definition of the growth function and construction of H. Hence we can bound the Radeacher coplexity with respect to the growth function, the second easure of coplexity introduced in this course. 1.3 Bound with respect to the VC Diension We can also bound the Radeacher coplexity by the third coplexity easure the VC diension. Theore 4. Let d V Cdi(H), then for d 1: 2d ln ( ) e d ˆR S (H) Proof. This follows iediately fro Sauer s lea (see lecture #6) and theore 3. 2 Boosting Boosting has its origins in PAC learning. Recall the definition of (strong) PAC learning: Definition 1. We say that C is (strongly) PAC learnable if there exists an algorith A such that for all c C, and for all true distributions D, for all ɛ > 0, for all δ > 0, A gets poly( 1 ɛ, 1 δ ) exaples and finds a hypothesis h A such that: Pr [err(h A ) ɛ] 1 δ. 2

3 But what happens if we can t get the error arbitrarily close to 0? Is learning all or none? To answer these questions, we introduce the notion of weak PAC learning. Definition 2. We say that C is weakly PAC learnable if there exists an algorith A and there exists γ > 0 such that for all c C, and for all true distributions D, for all δ > 0, A gets poly( 1 δ ) exaples and finds a hypothesis h A such that: Pr [err(h A ) 12 ] γ 1 δ. Note the absence of ɛ and the presence of γ in the definition of weak PAC learning. In strong PAC learning, we need to be able to ake errors arbitrarily sall. In weak PAC learning we just require that the error can be brought down to soe threshold. As γ > 0, we still require that we can do better than arbitrary guessing. A natural question to ask is whether strong and weak PAC learning algoriths are equivalent. Moreover, if this is true, we would like to have an algorith to convert a weak PAC learning algorith into a strong PAC learning algorith. We will see that boosting accoplishes this. Definition 3. A boosting algorith is an algorith that converts a weak learning algorith into a strong learning algorith. It is iportant to note that both strong and weak PAC learning are distribution-free. The following exaple will shed ore light on the iportance of this. 2.1 An exaple for learning with a fixed distribution Let C be the set of all concepts over {0, 1} n {z}, where {z} {0, 1} n. Let D be the distribution that assigns ass 1 4 to the point z and has ass 3 4 uniforly distributed over {0, 1} n. That is: { 1 P r x D [x k] 4 if k z if k {0, 1} n n Consider the hypothesis that predicts c(z) if x z and siply flips a coin otherwise. Eventually we will get a saple of z and thus learn c(z) (we can identify such an exaple when x i {0, 1} n ). This hypothesis will always correctly predict all points x i z and predict with 50% accuracy otherwise. Thus its error is: err D (h a ) < 1 2 Hence, (if we drop the distribution-freeness fro the definition) C is weakly PAC learnable for the fixed distribution D. However, V Cdi(C) 2 n hence by theore 1 of lecture #7 C is not strongly PAC learnable (again odifying the definition to a fixed distribution) using any algorith. This is because we would need at least Ω(2 n ) exaples which is not polynoial. Hence we cannot necessarily convert a weak into a strong learning algorith if we fix the distribution. 3

4 2.2 The setup We are given: S {(x 1, y 1 ),..., (x, y )} drawn fro the true distribution D, where x i X, y i { 1, 1}. access to a weak learner A which: D (not necessarily the sae as D) given exaples drawn fro D coputes h H (the hypothesis space of the weak learner) such that: Pr[err D (h) 1 2 γ] 1 δ The following diagra illustrates what the weak learner does. Our goal is to find a final hypothesis H we don t require that H H such that: Pr[err D (H) ɛ] 1 δ Note that we use the true distribution, D and not D for the last probability. 2.3 The ain idea The ain idea behind boosting is to run the weak learning algorith several ties and cobine the hypotheses fro each run. To do this effectively, we need to force the weak algorith to learn by giving it a different D on every run. The following diagra illustrates this: 4

5 2.4 The AdaBoost algorith We will now analyze the AdaBoost algorith. The pseudocode is given below. Algorith 1 AdaBoost procedure AdaBoost(S, T ) D 1 (i) 1 i for t 1,..., T do Construct D t Run A on D t (saple fro D t ) Get h t fro A ɛ t err Dt (h t ) 1 2 γ t Choose α t > 0 { e αt D t+1 (i) Dt(i) if h t (x i ) y i e αt else ( T ) Output H(x) sign α th t (x) In the above algorith, D t (i) weight on (x i, y i ) under D t. We can think of this as a distribution of weight or iportance of the x i S. is siply a noralizing factor to ake D t+1 a probability distribution. Note that the update places ore weight on previously isclassified exaples and less weight on previously correctly classified exaples. α t is unspecified in the code above. We will deterine it later. AdaBoost relies on the weak learning assuption: 2.5 Exaple 2 See slides on course website. γ t γ > Bound on the epirical error of AdaBoost Theore 5. The final hypothesis H output by Adaboost satisfies the following: eˆrr(h) exp exp [ 2 ] ɛ t (1 ɛ t ) ( T [ 1 4γ 2 t ( 2 T ( ) ) 1 RE 2 ɛ t γ 2 t ] ) (Additionally, if the weak learning assuption holds) e 2γ2 T 5

6 Proof. The second line holds by definition. The third line holds as ɛ t 1 2 γ t. The fourth line follows fro the fact that 1 + x e x. The final line follows by the weak learning assuption. Thus it is sufficient to show that the first line holds, which follows fro lea 2 and 3 (stated later). Lea 1. where F (x i ) T α th t (x i ). D T +1 (i) exp( y if (x i )) T Proof. Note D t+1 (i) Dt(i) e αty ih t(x i ) by definition. We can now solve for D T +1 recursively. D T +1 (i) D 1 (i) exp ( α 1y i h 1 (x i )) exp ( α 2y i h 2 (x i )) exp ( α T y i h T (x i )) Z 1 Z 2 Z ( T ) 1 exp y T i α th t (x i ) T exp [ y if (x i )] T Lea 2. eˆrr(h) T Proof. eˆrr(h) {H(xi ) y i } 1 {yi F (x i )0} e y if (x i ) D T +1 (i) D T +1 (i) Line 3 follows fro the fact that e y if (x i ) > 0 if y i F (x i ) > 0 and e y if (x i ) 1 if y i F (x i ) 0. Line 4 follows fro lea 1. The last line follows fro the fact that D T +1 is a probability distribution. 6

7 Lea 3. 2 ɛ t (1 ɛ t ) Proof. D t (i)e y iα th t(x i ) i:y i h t(x i ) D t (i)e αt + i:y i h t(x i ) D t (i)e αt ɛ t e αt + (1 ɛ t )e αt (2) The last equality follows because i:y i h t(x i ) D t(i) Pr[err Dt (h t )] ɛ t. We choose an α t so that the epirical error is iniized. By lea 2, this corresponds to iniizing. This yields: α t 1 ( ) 1 2 ln ɛt ɛ t This is also the α t we use in the algorith. Plugging this into equation (2) we get the desired result. 7

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges