1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Size: px

Start display at page:

Download "1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:"

Terence Owens
5 years ago
Views:

1 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement AdaBoost with boosting stumps and apply the algorithm to the spambase data set of HW2 with the same training and test sets. Plot the average cross-validation error plus or minus one standard deviation as a function of the number of rounds of boosting T by selecting the value of this parameter out of {10, 10 2,..., 10 k } for a suitable value of k, as in HW2. Let T be the best value found for the parameter. Plot the error on the training and test set as a function of the number of rounds of boosting for t [1, T ]. Compare your results with those obtained using SVMs in HW2. Solution: Figure 1: Cross validation error shown with ±1 standard deviation (Figure courtesy of Chaitanya Rudra). In figure 1 we see a typical plot of the average cross-validation performance shown with ±1 standard deviation. Note that the error descreases exponen- 1

2 Figure 2: Test and training error (Figure courtesy of Chaitanya Rudra). tially and eventually levels out after roughly 400 iterations. In figure 2 we see the training and test error after T = 400 rounds. Note that the test error eventually levels off, while the training error continues to decrease towards zero. Overall, the AdaBoost algorithm performs only slightly better than SVM algorithm increasing accuracy by about 1%. 2. Consider the following variant of the classification problem where, in addition to the positive and negative labels +1 and 1, points may be labeled with 0. This can correspond to cases where the true label of a point is unknown, a situation that often arises in practice, or more generally to the fact that the learning algorithm incurs no loss for predicting 1 or +1 for such a point. Let X be the input space and let Y = { 1, 0, +1}. As in standard binary classification, the loss of f : X R on a pair (x, y) X Y is defined by 1 yf(x)<0. Consider a sample S = ((x 1, y 1 ),..., (x m, y m )) (X Y) m and a hypothesis set H of base functions taking values in { 1, 0, +1}. For a base hypothesis h t H and a distribution D t over indices i [1, m], define ɛ s t for s { 1, 0, +1} by ɛ s t = E i Dt [1 yi h t(x i )=s]. (a) Derive a boosting-style algorithm for this setting in terms of ɛ s ts, using the same objective function as that of AdaBoost. You should carefully justify the definition of the algorithm. Solution: Say a boosting-style algorithm is just AdaBoost with a pos- 2

3 sibly different step size t. Recall these definitions from the description of AdaBoost: The final hypothesis is f(x) = t th t (x) and the normalization constant in round t is Z t = i D t(i) exp( t y i h t (x i )). We proved in class that 1 m 1 yi f(x i )<0 1 exp( y i f(x i )) = m t i i and that AdaBoost s step size can be derived by minimizing this objective in each round t. Taking that same approach, observe that Z t Z t = i D t (i) exp( t y i h t (x i )) = ɛ 0 t + ɛ t exp( t) + ɛ + t exp( t). Differentiating the right-hand side with respect to t and ( setting equal to zero shows that Z t is minimized by letting t = 1 2 log ɛ + ) t. ɛ t (b) What is the weak-learning assumption in this setting? Solution: One possible assumption is ɛ+ t ɛ t γ > 0. Informally, 1 ɛ 0 t this assumption says that the difference between the accuracy and error of each weak hypothesis is non-negligible relative to the fraction of examples on which the hypothesis makes any prediction at all. In part (d) we will prove that this assumption suffices to drive the training error to zero. (c) Write the full pseudocode of the algorithm. Solution: 1. Given: Training examples ((x 1, y 1 ),..., (x m, y m )). 2. Initialize D 1 to the uniform distribution on training examples. 3. for t = 1,..., T : a. h t base classifier in H. ( b. t 1 2 log ɛ + ) t. ɛ t c. For each i = 1,..., m: D t+1 (i) Dt(i) exp( ty ih t(x i )) Z t, where Z t i D t(i) exp( t y i h t (x i )) is the normalization constant. i. f T t=1 th t. 4. Return: sign(f). 3

4 (d) Give an upper bound on the training error of the algorithm as a function of the number of rounds of boosting and ɛ s ts. Solution: Plug in the value of t from part (a) into Z t = ɛ 0 t +ɛ t exp( t)+ ɛ + t exp( t) to obtain Z t = ɛ 0 t + 2 ɛ t ɛ+ t. Therefore 1 m 1 yi f(x i )<0 t i Z t = t ( ) ɛ 0 t + 2 ɛ t ɛ+ t. Morever, if the weak learning assumption from part (b) is satisfied then ɛ 0 t + 2 ɛ t ɛ+ t = ɛ 0 t + (1 ɛ 0 t )2 (ɛ + t ɛ t )2 = ɛ 0 t + (1 ɛ 0 t ) 1 (ɛ+ t ɛ t )2 1 ɛ 0 t 1 γ 2. 1 (ɛ+ t ɛ t )2 (1 ɛ 0 t )2 The first equality follows from (ɛ + t + ɛ t )2 (ɛ + t ɛ t )2 = 4ɛ + t ɛ t (just multiply and gather terms) and ɛ + t + ɛ t = 1 ɛ 0 t. The first inequality follows from the fact that square root is concave on [0, ), and thus λ x + (1 λ) y λx + (1 λ)y for λ [0, 1]. The last inequality follows from the weak learning assumption. Therefore we have 1 ( T ( m i 1 y i f(x i )<0 1 γ 2) exp where we used 1 + x exp(x). γ2 T 2 ), B. On-line learning The objective of this problem is to show how another regret minimization algorithm can be defined and studied. Let L be a loss function convex in its first argument and taking values in [0, M]. We will adopt the notation used in the lectures and assume N > e 2. Additionally, for any expert i [1, N], we denote by r t,i the instantaneous regret of that expert at time t [1, T ], r t,i = L(ŷ t, y t ) L(y t,i, y t ), and by R t,i his cumulative regret up to time t: R t,i = t s=1 r t,i. For convenience, we also define R 0,i = 0 for all i [1, N]. For any x R, (x) + denotes max(x, 0), that is the positive part of x, and for x = (x 1,..., x N ) R N, (x) + = ((x 1 ) +,..., (x N ) + ). 4

5 Let > 2 and consider the algorithm that predicts at round t [1, T ] according n to ŷ t = w t,iy t,i, with the weight w t,i defined based on the th power of the n w t,i regret up to time (t 1): w t,i = (R t 1,i ) 1 +. The potential function we use to analyze the algorithm is based on the function Φ defined over R N by Φ: x (x) + 2 = [ N (x i) ] Show that Φ is twice differentiable over R N B, where B is defined as follows: B = {u R N : (u) + = 0}. Solution: For x B, we can write Φ(x) = φ 2 ( N φ 1(x i )), where x 1,..., x N denote the components of x and φ 1 : R R and φ 2 : (0, + ) R are the functions defined by u R, φ 1 (u) = (u) + (1) u (0, + ), φ 2 (u) = u 2. (2) It is not hard to see that both φ 1 and φ 2 are twice (continously) differentiable, which shows, by composition, the same property for Φ over R N B. 2. For any t [1, T ], let r t denote the vector of instantaneous regrets, r t = (r t,1,..., r t,n ), and similarly R t = (R t,1,..., R t,n ). We define the potential function as Φ(R t ) = (R t ) + 2. Compute Φ(R t 1 ) for R t 1 B and show that Φ(R t 1 ) r t 0 (hint: use the convexity of the loss with respect to the first argument). Solution: For any i [1, N], Φ(R t 1 ) R t 1,i Thus, we can write = 2 (R t 1,i) 1 + [ N (R t 1,i ) + sign( Φ(R t ) r t ) = sign ( N ) w t,i r t,i ] 2 = 2w t,i (R t 1 ) + 2. = sign ( N w t,i (L(ŷ t, y t ) L(y t,i, y t ) ) ( N w ) t,i = sign N w (L(ŷ t, y t ) L(y t,i, y t ). t,i (3) 5

6 Now, by the convexity of the first argument of L, the following holds: N w t,i N w t,i (L(ŷ t, y t ) L(y t,i, y t )) = L(ŷ t, y t ) N w t,i N w t,i L(y t,i, y t ) 0, since ŷ t = n w n t,iy t,i w t,i. 3. (Bonus question) Prove the inequality r [ 2 Φ(u)]r 2( 1) r 2 valid for all r R N and u R N B (hint: write the Hessian 2 Φ(u) as a sum of a diagonal matrix and a positive semi-definite matrix multiplied by (2 ). Also, use Hölder s inequality generalizing Cauchy-Schwarz: for any p > 1 and q > 1 with 1 p + 1 q = 1 and u, v RN, u v u p v q ). Solution: As already seen, for any i [1, N], we have For j i, we obtain: We also have 2 Φ(u) u 2 i Φ(u) u i = 2(u i ) 1 + (u) + 2. (4) 2 Φ(u) u j u i = 2(2 )(u j ) 1 + (u i) 1 + (u) + 2(1 ). (5) = 2( 1)(u i ) 2 + (u) (2 )(u i ) 2( 1) + (u) + 2(1 ). (6) Consider the diagonal matrix D = diag((u 1 ) 2 +,..., (u N) 2 + ) and the matrix M defined by M ij = (u j ) 1 + (u i) 1 +. In view of the previous expressions, we can write 2 Φ(u) = 2( 1) (u) + 2 D + 2(2 ) (u) + 2(1 ) M. (7) M is a positive semi-definite matrix since M = vv with v = (u 1 1,..., u 1 Thus, for any r, r Mr 0. Since 2 < 0, we can write r 2 Φ(u)r 2( 1) (u) + 2 = 2( 1) (u) + 2 r Dr N (u i ) 2 + r2 t,i. N ). 6

7 By Hölder s inequality, the following holds: N (u i ) 2 + r2 t,i [ N ((u i ) 2 + ) 2 ] 2 [ N (ri 2 ) 2 This implies that r 2 Φ(u)r 2( 1) r 2. ] 2 = (u) + 2 r Using the answers to the two previous questions and Taylor s formula, show that for all t 1, Φ(R t ) Φ(R t 1 ) ( 1) r t 2, if γr t 1 +(1 γ)r t B for all γ [0, 1]. Solution: By assumption, the segment [R t 1, R t ] does not meet B and thus Φ is twice differentiable over the interior (R t 1, R t ). Thus, by Taylor s formula, there exists u (R t 1, R t ) such that Φ(R t ) Φ(R t 1 ) = (R t 1 ) r t rt t 2 (u)r t 1 2 rt t 2 (u)r t ( 1) r t 2, where we used the inequalities obtained in the two previous questions. 5. Suppose there exists γ [0, 1] such that (1 γ)r t 1 + γr t B. Show that Φ(R t ) ( 1) r t 2. Solution: For that γ, by definition, we have (R t 1 + γr t ) + = 0. Observe that for any two scalars c and d, (c + d) + c + + d + and therefore that (R t 1 +r t ) + (R t 1 +γr t ) + +(1 γ)(r t ) + = (1 γ)(r t ) + (r t ) +. (8) u u 2 is a non-decreasing function of each component u i, thus Φ(R t ) = Φ(R t 1 + r t ) (r t ) + 2. Since > 2, this implies Φ(R t ) ( 1) (r t ) Using the two previous questions, derive an upper bound on Φ(R T ) expressed in terms of T, N, and M. Solution: In view of the two previous questions, the inequality Φ(R t ) Φ(R t 1 ) ( 1) r t 2 holds in all cases. Summing up these inequalities for all t [1, T ], we obtain Φ(R T ) = Φ(R T ) Φ(R 0 ) = T Φ(R t ) Φ(R t 1 ) ( 1) t=1 T r t 2. t=1 Since r t,i M for all t and i, this yields Φ(R T ) ( 1)N 2 M 2 T. 7

8 7. Show that Φ(R T ) admits as a lower bound the square of the regret R T of the algorithm. Solution: By definition of the regret, R T = max i [1,N] R T,i R T = Φ(R T ). Note that this implies that Φ(R T ) (R T ) 2 +, which is not exactly the statement given in the question. However, it suffices for proving upper bounds on the regret, which is the motivation behind our construction of Φ. 8. Using the two previous questions give an upper bound on the regret R T. For what value of is the bound the most favorable? Give a simple expression of the upper bound on the regret for a suitable approximation of that optimal value. Solution: In view of the inequalities of the two previous questions, we have R T ( 1)N 2 M 2 T. For N > e 2, the function ( 1)N 2 reaches its minimum over (2, + ) at ( log N + log 2 N log N ), which is approximately 2 log N. Plugging in that value yields the bound R T M (2 log N 1)eT. 8

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1.