i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Size: px

Start display at page:

Download "i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x"

Todd Julius Lee
5 years ago
Views:

1 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1. For all integers n > 0, Kx, y N cosn x i y i over RN R N. Solution: Since cosx y cos x cos y+sin x sin y, Kx, y can be written as the dot product of the vectors [ ] [ ] cos x cos y Φx and Φy ; sin x sin y thus it is PDS. This is a consequence of the fact that the kernel Kx, y cosx y is PDS since ij c ic j cosx i x j ij c ic j cosx i x j, with x i x i for all i.. Kx, y minx, y xy over [0, 1] [0, 1]. Solution: Let, denote the inner product over the set of all measurable functions on [0, 1]: f, g 1 0 ftgtdt. Note that k 1 x, y t [0,x]1 t [0,y] dt 1 [0,x], 1 [0,y], and k x, y t [x,1]1 t [y,1] dt 1 [x,1], 1 [y,1]. Since they are defined as inner products, both k 1 and k are PDS. Note that k 1 x, y minx, y and k x, y 1 maxx, y. Observe that K k 1 k, thus K is PDS as a product of two PDS kernels. 3. σ > 0, Kx, y e x y σ over R N R N. Solution: It suffices to show that K is the normalized kernel associated to the kernel K defined by x, y R N R N, K x, y e φx,y 1

2 where φx, y 1 σ [ x + y x y ], and to show that K is PDS. For the first part, observe that K x, y K x, xk y, y eφx,y 1 φx,x 1 φy,y e x y σ. To show that K is PDS, it suffices to show that φ is PDS, since composition with a power series with non-negative coefficients here exp preserve the PDS property. Now, for any c 1,..., c n R, let c 0 n c i, then, we can write c i c j φx i, x j 1 c i c j [ x i + x j x i x j ] σ i,j1 1 σ 1 σ i,j1 [ c 0 c i x i + c i c j x i x j, i,j0 c 0 c j x j with x 0 0. Now, for any z R, the following equality holds: Thus, 1 σ i,j0 z 1 1 Γ 1 c i c j x i x j 1 Γ Γ 1 0 σ 1 e tz dt. t 3 i,j1 1 1 e t x i x j c i c j dt σ i,j0 t 3 n i,j0 c ic j e t x i x j t 3 ] c i c j x i x j Since a Gaussian kernel is PDS, the inequality n i,j0 c ic j e t x i x j 0 holds and the right-hand side is non-negative. Thus, the inequality 1 c i c j x i x j 0 σ i,j0 holds, which shows that φ is PDS. Alternatively, one can also apply the theorem on page 43 of the lecture slides on kernel methods to reduce the problem to showing that the norm Gx, y x y is a NDS function. This can be shown through a direct application of the definition of NDS together with the representation of the norm given in the hint. dt.

3 B. Boosting In class, we showed that AdaBoost can be viewed as coordinate descent applied to a convex upper bound on the empirical error. Here, we consider instead an algorithm seeking to minimize the empirical margin loss. For any 0 ρ < 1, using the same notation as in class, let R ρ f 1 m m 1 y i fx i ρ denote the empirical margin loss of a function f of the form f sample S x 1, y 1,..., x m, y m. 1. Show that R ρ f can be upper bounded as follows: R ρ f 1 m exp y i T α t h t x i + ρ T αtht T αt T α t. for a labeled Solution: The following shows how R ρ f can be upper-bounded: R ρ f 1 m 1 yi fx i ρ 1 m 1 m 1 yi T αthtx i ρ T αt 0 exp y i T α t h t x i + ρ T α t.. For any ρ > 0, let G ρ be the objective function defined for all α 0 by G ρ α 1 N N exp y i α j h j x i + ρ α j, m with h j H for all j [1, N], with the notation used in class in the boosting lecture. Show that G ρ is convex and differentiable. Solution: Since exp is convex and that composition with an affine function of α preserves convexity, each term of the objective is a convex function and G ρ is convex as a sum of convex functions. The differentiability follows directly that of exp. 3. Derive a boosting-style algorithm A ρ by applying maximum coordinate descent to G ρ. You should justify in detail the derivation of the algorithm, in j1 j1 3

4 particular the choice of the base classifier selected at each round and that of the step. Compare both to their counterparts in AdaBoost. Solution: By definition of G ρ, for any direction e t we can write G ρ α t 1 +ηe t 1 m t 1 t 1 exp y i α s h s x i +ρ α s e y iηh tx i +ρη. Let D 1 be the uniform distribution, that is D 1 i 1 m for any t [, T ], define D t by for all i [1, m] and D t i D t 1i exp y i α t 1 h t 1 x i Z t 1, with Z t 1 m D t 1i exp y i α t 1 h t 1 x i. Observe that D t i exp y i t 1 αshsx i m t 1. Thus, Zt dg ρ α t 1 + ηe t dη 0 1 t 1 t 1 [ y i h t x i + ρ] exp y i α s h s x i + ρ α s m [ y i h t x i + ρ]d t i t 1 [ m y i h t x i D t i + ρ ] t 1 1 ɛ t + ɛ t + ρ t 1 ɛ t 1 + ρ t 1 Z t e ρ t 1 αs Z t e ρ t 1 αs Z t e ρ t 1 αs Z t e ρ t 1 αs, where ɛ t m D ti1 yi h tx i >0. The direction selected is the one minimizing ɛ t 1 + ρ that is ɛ t. Thus, the algorithm selects at each round the base classifier with the smallest weighted error, as in the case of AdaBoost. 4

5 To determine the step η selected at round t, we solve the following equation dg ρ α t 1 + ηe t 0 dη D t iz t e y iηh tx i +ρη [ y i h t x i + ρ] 0 y i h tx i >0 D t iz t e η 1+ρ 1 + ρ + y i h tx i <0 1 ɛ t e η 1+ρ 1 + ρ + ɛ t e η1+ρ 1 + ρ 0 1 ɛ t e η 1 + ρ + ɛ t e η 1 + ρ 0 e η 1 ρ 1 ɛ t 1 + ρ ɛ t η 1 log 1 ɛ t ρ log ɛ t 1 ρ. D t iz t e η1+ρ 1 + ρ 0 Thus, the algorithm differs from AdaBoost only by the choice of the step, which it chooses more conservatively: the step size is smaller than that of AdaBoost by the additive constant 1 1+ρ log 1 ρ. 4. What is the equivalent of the weak learning assumption for A ρ Hint: use non-negativity of the step value? Solution: For the coordinate descent algorithm to make progress at each round, the step size selected along the descent direction must be non-negative, that is 1 ρ 1 ɛ t > 1 1 ρ1 ɛ t > ρɛ t + ɛ t 1 + ρ ɛ t ɛ t < 1 ρ. Thus, the error of the base classifier chosen must be at least ρ/ better than one half. 5. Give the full pseudocode of the algorithm A ρ. What can you say about the A 0 algorithm? Solution: The normalization factor Z t can be expressed in terms of ɛ t and ρ 5

6 A ρ S x 1, y 1,..., x m, y m 1 for i 1 to m do D 1 i 1 m 3 for t 1 to T do 4 h t base classifier in H with small error ɛ t P i Dt [h t x i y i ] 5 α t 1 6 Z t [ ɛ t1 ɛ t ] 1 1 ρ 7 for i 1 to m do log 1 ɛt ɛ t 1 log 1+ρ 1 ρ normalization factor 8 D t+1 i Dti exp αty ih tx i Z t 9 f T α th t 10 return h sgnf Figure 1: Algorithm A ρ for H { 1, +1} X. using its definition: Z t D t i exp y i α t h t x i e αt 1 ɛ t + e αt ɛ t 1 + ρ 1 ρ 1 ρ 1 ɛ tɛ t ρ 1 ɛ tɛ t [ ] 1 + ρ 1 ρ ɛ t 1 ɛ t 1 ρ ρ [ ] ɛ t 1 ɛ t 1 ρ ɛ t 1 ɛ t 1 ρ. The pseudocode of the algorithm is then given in Figure 1. A 0 coincides exactly with AdaBoost. 6. Implement the A ρ algorithm. The algorithm admits two parameters: the number of rounds T and ρ. Experiment with this algorithm to tackle the same classification problem as the one described in the second homework 6

7 assignment, with the same choice of training and test sets and the same crossvalidation set-up. Use a grid search to determine the best values of ρ e.g., ρ { 10, 9,..., 1 } and T {100, 00, 500, 1000} and report the test error obtained. Compare your result with the one obtained by running AdaBoost using the same set-up. Also, compare these results with those obtained in the second homework assignment using SVMs. Solution: The following figures show the cross-validation results within one standard deviation for various values of ρ and numbers of boosting rounds. Note that ρ 0 corresponds to AdaBoost. The lowest cross-validation error occurs for margin ρ 10 and 1000 boosting rounds. The following table shows the training/test errors for A ρ 10 and Adaboost with T 1000 and SVM from last homework. Algorithm A ρ 10 AdaBoost SVM 10-CV Error Test Error

8 7. For both the A ρ results and AdaBoost, plot the cumulative margins after 500 rounds of boosting, that is plot the fraction of training points with margin less than or equal to θ as a function of θ [0, 1]. Solution: In the graph below, x-axis is θ and y-axis is the fraction of points with margin less than θ. Different colors correspond to different A ρ : red line is A ρ 5 and blue line is AdaBoostA 0. As expected, for ρ > 0, A ρ achieves a larger margin for more points compared to AdaBoost. A 10 Orange line is almost identical to AdaBoost. 8. Bound on R ρ f. a Prove the upper bound R ρ f exp T α tρ T Z t, where the normalization factors Z t are defined as in the case of AdaBoost in class with α t the step chosen by A ρ at round t. Solution: 8

9 In view of the definition of D t and the bound derived in question 1, we can write R ρ f 1 T T exp y i α t h t x i + ρ α t m 1 T T m Z t D t i exp ρ α t m T T exp ρ α t Z t. b Give the expression of Z t as a function of ρ and ɛ t, where ɛ t is the weighted error of the hypothesis found by A ρ at round t defined in the same way as for AdaBoost in class. Use that to prove the following upper bound R ρ f exp T 1 ρ D ɛ t, where Dp q denotes the binary relative entropy of p and q: Dp q p log p 1 p q + 1 p log 1 q, for any p, q [0, 1]. Solution: Define u by u 1 ρ 1+ρ. The expression of Z t was already given above. Plugging in that expression in the bound of the previous question and using the expression of α t gives R ρ f T e αt ρ ɛt 1 ɛ t u 1 + u 1 t ρt ρ 1 ρ T 1 ɛt ɛt 1 ɛ t u 1 + u ρ u 1+ρ + u 1 ρ t T T ɛ t ɛ 1 ρ t 1 ɛ t 1+ρ. 9

10 Observe that u 1+ρ + u 1 ρ 1 ρ 1+ρ 1 + ρ ρ 1 ρ 1 ρ ρ 1 + ρ 1+ρ 1 ρ 1 ρ We also have [ ] log ɛ 1 ρ t 1 ɛ t 1+ρ 1 + ρ 1+ρ 1 ρ 1 ρ 1 1+ρ 1+ρ 1 ρ 1 ρ logɛ t ρ log1 ɛ t 1 ρ D ɛ t + 1 ρ 1 ρ log 1 ρ 1 + ρ 1+ρ D ɛ t + log Combining these two inequalities gives R ρ f exp 1 ρ ρ 1 ρ 1 ρ T 1 ρ D ɛ t. 1 ρ log 1 + ρ. c Assume that for all t [1, T ], 1 ρ ɛ t > γ > 0. Use the result of the previous question to show that R ρ f exp γ T. Hint: you can use Pinsker s inequality: Dp q p q for all p, q [0, 1]. Show that for T > log m, all points have margin at least γ ρ. [ Solution: By Pinsker s inequality, we have D 1 ρ ɛ t 1 ρ ɛ t ]. Thus, we can write R ρ f exp γ T. 10

11 Thus, if the upper bound is less that 1/m, then R ρ f 0 and every training point has margin at least ρ. The inequality exp γ T < 1/m is equivalent to T > log m γ. 11

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 October 31, 2016 Due: A. November 11, 2016; B. November 22, 2016 A. Boosting 1. Implement