i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1. For all integers n > 0, Kx, y N cosn x i y i over RN R N. Solution: Since cosx y cos x cos y+sin x sin y, Kx, y can be written as the dot product of the vectors [ ] [ ] cos x cos y Φx and Φy ; sin x sin y thus it is PDS. This is a consequence of the fact that the kernel Kx, y cosx y is PDS since ij c ic j cosx i x j ij c ic j cosx i x j, with x i x i for all i.. Kx, y minx, y xy over [0, 1] [0, 1]. Solution: Let, denote the inner product over the set of all measurable functions on [0, 1]: f, g 1 0 ftgtdt. Note that k 1 x, y 1 0 1 t [0,x]1 t [0,y] dt 1 [0,x], 1 [0,y], and k x, y 1 0 1 t [x,1]1 t [y,1] dt 1 [x,1], 1 [y,1]. Since they are defined as inner products, both k 1 and k are PDS. Note that k 1 x, y minx, y and k x, y 1 maxx, y. Observe that K k 1 k, thus K is PDS as a product of two PDS kernels. 3. σ > 0, Kx, y e x y σ over R N R N. Solution: It suffices to show that K is the normalized kernel associated to the kernel K defined by x, y R N R N, K x, y e φx,y 1

where φx, y 1 σ [ x + y x y ], and to show that K is PDS. For the first part, observe that K x, y K x, xk y, y eφx,y 1 φx,x 1 φy,y e x y σ. To show that K is PDS, it suffices to show that φ is PDS, since composition with a power series with non-negative coefficients here exp preserve the PDS property. Now, for any c 1,..., c n R, let c 0 n c i, then, we can write c i c j φx i, x j 1 c i c j [ x i + x j x i x j ] σ i,j1 1 σ 1 σ i,j1 [ c 0 c i x i + c i c j x i x j, i,j0 c 0 c j x j with x 0 0. Now, for any z R, the following equality holds: Thus, 1 σ i,j0 z 1 1 Γ 1 c i c j x i x j 1 Γ 1 + 0 + 0 1 + 1 Γ 1 0 σ 1 e tz dt. t 3 i,j1 1 1 e t x i x j c i c j dt σ i,j0 t 3 n i,j0 c ic j e t x i x j t 3 ] c i c j x i x j Since a Gaussian kernel is PDS, the inequality n i,j0 c ic j e t x i x j 0 holds and the right-hand side is non-negative. Thus, the inequality 1 c i c j x i x j 0 σ i,j0 holds, which shows that φ is PDS. Alternatively, one can also apply the theorem on page 43 of the lecture slides on kernel methods to reduce the problem to showing that the norm Gx, y x y is a NDS function. This can be shown through a direct application of the definition of NDS together with the representation of the norm given in the hint. dt.

B. Boosting In class, we showed that AdaBoost can be viewed as coordinate descent applied to a convex upper bound on the empirical error. Here, we consider instead an algorithm seeking to minimize the empirical margin loss. For any 0 ρ < 1, using the same notation as in class, let R ρ f 1 m m 1 y i fx i ρ denote the empirical margin loss of a function f of the form f sample S x 1, y 1,..., x m, y m. 1. Show that R ρ f can be upper bounded as follows: R ρ f 1 m exp y i T α t h t x i + ρ T αtht T αt T α t. for a labeled Solution: The following shows how R ρ f can be upper-bounded: R ρ f 1 m 1 yi fx i ρ 1 m 1 m 1 yi T αthtx i ρ T αt 0 exp y i T α t h t x i + ρ T α t.. For any ρ > 0, let G ρ be the objective function defined for all α 0 by G ρ α 1 N N exp y i α j h j x i + ρ α j, m with h j H for all j [1, N], with the notation used in class in the boosting lecture. Show that G ρ is convex and differentiable. Solution: Since exp is convex and that composition with an affine function of α preserves convexity, each term of the objective is a convex function and G ρ is convex as a sum of convex functions. The differentiability follows directly that of exp. 3. Derive a boosting-style algorithm A ρ by applying maximum coordinate descent to G ρ. You should justify in detail the derivation of the algorithm, in j1 j1 3

particular the choice of the base classifier selected at each round and that of the step. Compare both to their counterparts in AdaBoost. Solution: By definition of G ρ, for any direction e t we can write G ρ α t 1 +ηe t 1 m t 1 t 1 exp y i α s h s x i +ρ α s e y iηh tx i +ρη. Let D 1 be the uniform distribution, that is D 1 i 1 m for any t [, T ], define D t by for all i [1, m] and D t i D t 1i exp y i α t 1 h t 1 x i Z t 1, with Z t 1 m D t 1i exp y i α t 1 h t 1 x i. Observe that D t i exp y i t 1 αshsx i m t 1. Thus, Zt dg ρ α t 1 + ηe t dη 0 1 t 1 t 1 [ y i h t x i + ρ] exp y i α s h s x i + ρ α s m [ y i h t x i + ρ]d t i t 1 [ m y i h t x i D t i + ρ ] t 1 1 ɛ t + ɛ t + ρ t 1 ɛ t 1 + ρ t 1 Z t e ρ t 1 αs Z t e ρ t 1 αs Z t e ρ t 1 αs Z t e ρ t 1 αs, where ɛ t m D ti1 yi h tx i >0. The direction selected is the one minimizing ɛ t 1 + ρ that is ɛ t. Thus, the algorithm selects at each round the base classifier with the smallest weighted error, as in the case of AdaBoost. 4

To determine the step η selected at round t, we solve the following equation dg ρ α t 1 + ηe t 0 dη D t iz t e y iηh tx i +ρη [ y i h t x i + ρ] 0 y i h tx i >0 D t iz t e η 1+ρ 1 + ρ + y i h tx i <0 1 ɛ t e η 1+ρ 1 + ρ + ɛ t e η1+ρ 1 + ρ 0 1 ɛ t e η 1 + ρ + ɛ t e η 1 + ρ 0 e η 1 ρ 1 ɛ t 1 + ρ ɛ t η 1 log 1 ɛ t 1 1 + ρ log ɛ t 1 ρ. D t iz t e η1+ρ 1 + ρ 0 Thus, the algorithm differs from AdaBoost only by the choice of the step, which it chooses more conservatively: the step size is smaller than that of AdaBoost by the additive constant 1 1+ρ log 1 ρ. 4. What is the equivalent of the weak learning assumption for A ρ Hint: use non-negativity of the step value? Solution: For the coordinate descent algorithm to make progress at each round, the step size selected along the descent direction must be non-negative, that is 1 ρ 1 ɛ t > 1 1 ρ1 ɛ t > ρɛ t + ɛ t 1 + ρ ɛ t ɛ t < 1 ρ. Thus, the error of the base classifier chosen must be at least ρ/ better than one half. 5. Give the full pseudocode of the algorithm A ρ. What can you say about the A 0 algorithm? Solution: The normalization factor Z t can be expressed in terms of ɛ t and ρ 5

A ρ S x 1, y 1,..., x m, y m 1 for i 1 to m do D 1 i 1 m 3 for t 1 to T do 4 h t base classifier in H with small error ɛ t P i Dt [h t x i y i ] 5 α t 1 6 Z t [ ɛ t1 ɛ t ] 1 1 ρ 7 for i 1 to m do log 1 ɛt ɛ t 1 log 1+ρ 1 ρ normalization factor 8 D t+1 i Dti exp αty ih tx i Z t 9 f T α th t 10 return h sgnf Figure 1: Algorithm A ρ for H { 1, +1} X. using its definition: Z t D t i exp y i α t h t x i e αt 1 ɛ t + e αt ɛ t 1 + ρ 1 ρ 1 ρ 1 ɛ tɛ t + 1 + ρ 1 ɛ tɛ t [ ] 1 + ρ 1 ρ ɛ t 1 ɛ t 1 ρ + 1 + ρ [ ] ɛ t 1 ɛ t 1 ρ ɛ t 1 ɛ t 1 ρ. The pseudocode of the algorithm is then given in Figure 1. A 0 coincides exactly with AdaBoost. 6. Implement the A ρ algorithm. The algorithm admits two parameters: the number of rounds T and ρ. Experiment with this algorithm to tackle the same classification problem as the one described in the second homework 6

assignment, with the same choice of training and test sets and the same crossvalidation set-up. Use a grid search to determine the best values of ρ e.g., ρ { 10, 9,..., 1 } and T {100, 00, 500, 1000} and report the test error obtained. Compare your result with the one obtained by running AdaBoost using the same set-up. Also, compare these results with those obtained in the second homework assignment using SVMs. Solution: The following figures show the cross-validation results within one standard deviation for various values of ρ and numbers of boosting rounds. Note that ρ 0 corresponds to AdaBoost. The lowest cross-validation error 0.057 occurs for margin ρ 10 and 1000 boosting rounds. The following table shows the training/test errors for A ρ 10 and Adaboost with T 1000 and SVM from last homework. Algorithm A ρ 10 AdaBoost SVM 10-CV Error 0.0567 0.0573 0.0673 Test Error 0.0531 0.05 0.06 7

7. For both the A ρ results and AdaBoost, plot the cumulative margins after 500 rounds of boosting, that is plot the fraction of training points with margin less than or equal to θ as a function of θ [0, 1]. Solution: In the graph below, x-axis is θ and y-axis is the fraction of points with margin less than θ. Different colors correspond to different A ρ : red line is A ρ 5 and blue line is AdaBoostA 0. As expected, for ρ > 0, A ρ achieves a larger margin for more points compared to AdaBoost. A 10 Orange line is almost identical to AdaBoost. 8. Bound on R ρ f. a Prove the upper bound R ρ f exp T α tρ T Z t, where the normalization factors Z t are defined as in the case of AdaBoost in class with α t the step chosen by A ρ at round t. Solution: 8

In view of the definition of D t and the bound derived in question 1, we can write R ρ f 1 T T exp y i α t h t x i + ρ α t m 1 T T m Z t D t i exp ρ α t m T T exp ρ α t Z t. b Give the expression of Z t as a function of ρ and ɛ t, where ɛ t is the weighted error of the hypothesis found by A ρ at round t defined in the same way as for AdaBoost in class. Use that to prove the following upper bound R ρ f exp T 1 ρ D ɛ t, where Dp q denotes the binary relative entropy of p and q: Dp q p log p 1 p q + 1 p log 1 q, for any p, q [0, 1]. Solution: Define u by u 1 ρ 1+ρ. The expression of Z t was already given above. Plugging in that expression in the bound of the previous question and using the expression of α t gives R ρ f T e αt ρ ɛt 1 ɛ t u 1 + u 1 t ρt ρ 1 ρ T 1 ɛt ɛt 1 ɛ t u 1 + u 1 1 + ρ u 1+ρ + u 1 ρ t T T ɛ t ɛ 1 ρ t 1 ɛ t 1+ρ. 9

Observe that u 1+ρ + u 1 ρ 1 ρ 1+ρ 1 + ρ + 1 + ρ 1 ρ 1 ρ + 1 + ρ 1 + ρ 1+ρ 1 ρ 1 ρ We also have [ ] log ɛ 1 ρ t 1 ɛ t 1+ρ 1 + ρ 1+ρ 1 ρ 1 ρ 1 1+ρ 1+ρ 1 ρ 1 ρ logɛ t + 1 + ρ log1 ɛ t 1 ρ D ɛ t + 1 ρ 1 ρ log 1 ρ 1 + ρ 1+ρ D ɛ t + log Combining these two inequalities gives R ρ f exp 1 ρ. + 1 + ρ 1 ρ 1 ρ T 1 ρ D ɛ t. 1 ρ log 1 + ρ. c Assume that for all t [1, T ], 1 ρ ɛ t > γ > 0. Use the result of the previous question to show that R ρ f exp γ T. Hint: you can use Pinsker s inequality: Dp q p q for all p, q [0, 1]. Show that for T > log m, all points have margin at least γ ρ. [ Solution: By Pinsker s inequality, we have D 1 ρ ɛ t 1 ρ ɛ t ]. Thus, we can write R ρ f exp γ T. 10

Thus, if the upper bound is less that 1/m, then R ρ f 0 and every training point has margin at least ρ. The inequality exp γ T < 1/m is equivalent to T > log m γ. 11