i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

Similar documents
1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

2 Upper-bound of Generalization Error of AdaBoost

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Machine Learning, Fall 2011: Homework 5

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

The definitions and notation are those introduced in the lectures slides. R Ex D [h

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Lecture 8. Instructor: Haipeng Luo

CS446: Machine Learning Spring Problem Set 4

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Perceptron Mistake Bounds

The Representor Theorem, Kernels, and Hilbert Spaces

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Support Vector Machines

1 Rademacher Complexity Bounds

Statistical learning theory, Support vector machines, and Bioinformatics

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

CIS 520: Machine Learning Oct 09, Kernel Methods

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Introduction to Support Vector Machines

Linear & nonlinear classifiers

10-701/ Machine Learning - Midterm Exam, Fall 2010

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS798: Selected topics in Machine Learning

Download the following data sets from the UC Irvine ML repository:

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

i=1 = H t 1 (x) + α t h t (x)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Structured Prediction

CSE 546 Final Exam, Autumn 2013

Learning with Rejection

Lecture 21: Minimax Theory

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

10701/15781 Machine Learning, Spring 2007: Homework 2

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Consistency of Nearest Neighbor Methods

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Lecture 10: A brief introduction to Support Vector Machine

Multiclass Boosting with Repartitioning

Reproducing Kernel Hilbert Spaces

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Convex Optimization in Classification Problems

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machine (SVM) and Kernel Methods

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machines, Kernel SVM

Approximation Theoretical Questions for SVMs

Foundations of Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

COMS 4771 Lecture Boosting 1 / 16

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine

FINAL: CS 6375 (Machine Learning) Fall 2014

Hierarchical Boosting and Filter Generation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

1 Review and Overview

HOMEWORK 4: SVMS AND KERNELS

CSCI-567: Machine Learning (Spring 2019)

CMU-Q Lecture 24:

Statistical Methods for SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning. Support Vector Machines. Manfred Huber

Name (NetID): (1 Point)

Kernels and the Kernel Trick. Machine Learning Fall 2017

Classification objectives COMS 4771

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines and Kernel Methods

1 Review of the Perceptron Algorithm

Neural Networks and Deep Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

(Kernels +) Support Vector Machines

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

This is an author-deposited version published in : Eprints ID : 17710

Jeff Howbert Introduction to Machine Learning Winter

Lecture Support Vector Machine (SVM) Classifiers

DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM

Support Vector Machine for Classification and Regression

Infinite Ensemble Learning with Support Vector Machinery

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support vector machines Lecture 4

Statistical Machine Learning from Data

Support Vector Machine

Transcription:

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 November 16, 017 Due: Dec 01, 017 A. Kernels Show that the following kernels K are PDS: 1. For all integers n > 0, Kx, y N cosn x i y i over RN R N. Solution: Since cosx y cos x cos y+sin x sin y, Kx, y can be written as the dot product of the vectors [ ] [ ] cos x cos y Φx and Φy ; sin x sin y thus it is PDS. This is a consequence of the fact that the kernel Kx, y cosx y is PDS since ij c ic j cosx i x j ij c ic j cosx i x j, with x i x i for all i.. Kx, y minx, y xy over [0, 1] [0, 1]. Solution: Let, denote the inner product over the set of all measurable functions on [0, 1]: f, g 1 0 ftgtdt. Note that k 1 x, y 1 0 1 t [0,x]1 t [0,y] dt 1 [0,x], 1 [0,y], and k x, y 1 0 1 t [x,1]1 t [y,1] dt 1 [x,1], 1 [y,1]. Since they are defined as inner products, both k 1 and k are PDS. Note that k 1 x, y minx, y and k x, y 1 maxx, y. Observe that K k 1 k, thus K is PDS as a product of two PDS kernels. 3. σ > 0, Kx, y e x y σ over R N R N. Solution: It suffices to show that K is the normalized kernel associated to the kernel K defined by x, y R N R N, K x, y e φx,y 1

where φx, y 1 σ [ x + y x y ], and to show that K is PDS. For the first part, observe that K x, y K x, xk y, y eφx,y 1 φx,x 1 φy,y e x y σ. To show that K is PDS, it suffices to show that φ is PDS, since composition with a power series with non-negative coefficients here exp preserve the PDS property. Now, for any c 1,..., c n R, let c 0 n c i, then, we can write c i c j φx i, x j 1 c i c j [ x i + x j x i x j ] σ i,j1 1 σ 1 σ i,j1 [ c 0 c i x i + c i c j x i x j, i,j0 c 0 c j x j with x 0 0. Now, for any z R, the following equality holds: Thus, 1 σ i,j0 z 1 1 Γ 1 c i c j x i x j 1 Γ 1 + 0 + 0 1 + 1 Γ 1 0 σ 1 e tz dt. t 3 i,j1 1 1 e t x i x j c i c j dt σ i,j0 t 3 n i,j0 c ic j e t x i x j t 3 ] c i c j x i x j Since a Gaussian kernel is PDS, the inequality n i,j0 c ic j e t x i x j 0 holds and the right-hand side is non-negative. Thus, the inequality 1 c i c j x i x j 0 σ i,j0 holds, which shows that φ is PDS. Alternatively, one can also apply the theorem on page 43 of the lecture slides on kernel methods to reduce the problem to showing that the norm Gx, y x y is a NDS function. This can be shown through a direct application of the definition of NDS together with the representation of the norm given in the hint. dt.

B. Boosting In class, we showed that AdaBoost can be viewed as coordinate descent applied to a convex upper bound on the empirical error. Here, we consider instead an algorithm seeking to minimize the empirical margin loss. For any 0 ρ < 1, using the same notation as in class, let R ρ f 1 m m 1 y i fx i ρ denote the empirical margin loss of a function f of the form f sample S x 1, y 1,..., x m, y m. 1. Show that R ρ f can be upper bounded as follows: R ρ f 1 m exp y i T α t h t x i + ρ T αtht T αt T α t. for a labeled Solution: The following shows how R ρ f can be upper-bounded: R ρ f 1 m 1 yi fx i ρ 1 m 1 m 1 yi T αthtx i ρ T αt 0 exp y i T α t h t x i + ρ T α t.. For any ρ > 0, let G ρ be the objective function defined for all α 0 by G ρ α 1 N N exp y i α j h j x i + ρ α j, m with h j H for all j [1, N], with the notation used in class in the boosting lecture. Show that G ρ is convex and differentiable. Solution: Since exp is convex and that composition with an affine function of α preserves convexity, each term of the objective is a convex function and G ρ is convex as a sum of convex functions. The differentiability follows directly that of exp. 3. Derive a boosting-style algorithm A ρ by applying maximum coordinate descent to G ρ. You should justify in detail the derivation of the algorithm, in j1 j1 3

particular the choice of the base classifier selected at each round and that of the step. Compare both to their counterparts in AdaBoost. Solution: By definition of G ρ, for any direction e t we can write G ρ α t 1 +ηe t 1 m t 1 t 1 exp y i α s h s x i +ρ α s e y iηh tx i +ρη. Let D 1 be the uniform distribution, that is D 1 i 1 m for any t [, T ], define D t by for all i [1, m] and D t i D t 1i exp y i α t 1 h t 1 x i Z t 1, with Z t 1 m D t 1i exp y i α t 1 h t 1 x i. Observe that D t i exp y i t 1 αshsx i m t 1. Thus, Zt dg ρ α t 1 + ηe t dη 0 1 t 1 t 1 [ y i h t x i + ρ] exp y i α s h s x i + ρ α s m [ y i h t x i + ρ]d t i t 1 [ m y i h t x i D t i + ρ ] t 1 1 ɛ t + ɛ t + ρ t 1 ɛ t 1 + ρ t 1 Z t e ρ t 1 αs Z t e ρ t 1 αs Z t e ρ t 1 αs Z t e ρ t 1 αs, where ɛ t m D ti1 yi h tx i >0. The direction selected is the one minimizing ɛ t 1 + ρ that is ɛ t. Thus, the algorithm selects at each round the base classifier with the smallest weighted error, as in the case of AdaBoost. 4

To determine the step η selected at round t, we solve the following equation dg ρ α t 1 + ηe t 0 dη D t iz t e y iηh tx i +ρη [ y i h t x i + ρ] 0 y i h tx i >0 D t iz t e η 1+ρ 1 + ρ + y i h tx i <0 1 ɛ t e η 1+ρ 1 + ρ + ɛ t e η1+ρ 1 + ρ 0 1 ɛ t e η 1 + ρ + ɛ t e η 1 + ρ 0 e η 1 ρ 1 ɛ t 1 + ρ ɛ t η 1 log 1 ɛ t 1 1 + ρ log ɛ t 1 ρ. D t iz t e η1+ρ 1 + ρ 0 Thus, the algorithm differs from AdaBoost only by the choice of the step, which it chooses more conservatively: the step size is smaller than that of AdaBoost by the additive constant 1 1+ρ log 1 ρ. 4. What is the equivalent of the weak learning assumption for A ρ Hint: use non-negativity of the step value? Solution: For the coordinate descent algorithm to make progress at each round, the step size selected along the descent direction must be non-negative, that is 1 ρ 1 ɛ t > 1 1 ρ1 ɛ t > ρɛ t + ɛ t 1 + ρ ɛ t ɛ t < 1 ρ. Thus, the error of the base classifier chosen must be at least ρ/ better than one half. 5. Give the full pseudocode of the algorithm A ρ. What can you say about the A 0 algorithm? Solution: The normalization factor Z t can be expressed in terms of ɛ t and ρ 5

A ρ S x 1, y 1,..., x m, y m 1 for i 1 to m do D 1 i 1 m 3 for t 1 to T do 4 h t base classifier in H with small error ɛ t P i Dt [h t x i y i ] 5 α t 1 6 Z t [ ɛ t1 ɛ t ] 1 1 ρ 7 for i 1 to m do log 1 ɛt ɛ t 1 log 1+ρ 1 ρ normalization factor 8 D t+1 i Dti exp αty ih tx i Z t 9 f T α th t 10 return h sgnf Figure 1: Algorithm A ρ for H { 1, +1} X. using its definition: Z t D t i exp y i α t h t x i e αt 1 ɛ t + e αt ɛ t 1 + ρ 1 ρ 1 ρ 1 ɛ tɛ t + 1 + ρ 1 ɛ tɛ t [ ] 1 + ρ 1 ρ ɛ t 1 ɛ t 1 ρ + 1 + ρ [ ] ɛ t 1 ɛ t 1 ρ ɛ t 1 ɛ t 1 ρ. The pseudocode of the algorithm is then given in Figure 1. A 0 coincides exactly with AdaBoost. 6. Implement the A ρ algorithm. The algorithm admits two parameters: the number of rounds T and ρ. Experiment with this algorithm to tackle the same classification problem as the one described in the second homework 6

assignment, with the same choice of training and test sets and the same crossvalidation set-up. Use a grid search to determine the best values of ρ e.g., ρ { 10, 9,..., 1 } and T {100, 00, 500, 1000} and report the test error obtained. Compare your result with the one obtained by running AdaBoost using the same set-up. Also, compare these results with those obtained in the second homework assignment using SVMs. Solution: The following figures show the cross-validation results within one standard deviation for various values of ρ and numbers of boosting rounds. Note that ρ 0 corresponds to AdaBoost. The lowest cross-validation error 0.057 occurs for margin ρ 10 and 1000 boosting rounds. The following table shows the training/test errors for A ρ 10 and Adaboost with T 1000 and SVM from last homework. Algorithm A ρ 10 AdaBoost SVM 10-CV Error 0.0567 0.0573 0.0673 Test Error 0.0531 0.05 0.06 7

7. For both the A ρ results and AdaBoost, plot the cumulative margins after 500 rounds of boosting, that is plot the fraction of training points with margin less than or equal to θ as a function of θ [0, 1]. Solution: In the graph below, x-axis is θ and y-axis is the fraction of points with margin less than θ. Different colors correspond to different A ρ : red line is A ρ 5 and blue line is AdaBoostA 0. As expected, for ρ > 0, A ρ achieves a larger margin for more points compared to AdaBoost. A 10 Orange line is almost identical to AdaBoost. 8. Bound on R ρ f. a Prove the upper bound R ρ f exp T α tρ T Z t, where the normalization factors Z t are defined as in the case of AdaBoost in class with α t the step chosen by A ρ at round t. Solution: 8

In view of the definition of D t and the bound derived in question 1, we can write R ρ f 1 T T exp y i α t h t x i + ρ α t m 1 T T m Z t D t i exp ρ α t m T T exp ρ α t Z t. b Give the expression of Z t as a function of ρ and ɛ t, where ɛ t is the weighted error of the hypothesis found by A ρ at round t defined in the same way as for AdaBoost in class. Use that to prove the following upper bound R ρ f exp T 1 ρ D ɛ t, where Dp q denotes the binary relative entropy of p and q: Dp q p log p 1 p q + 1 p log 1 q, for any p, q [0, 1]. Solution: Define u by u 1 ρ 1+ρ. The expression of Z t was already given above. Plugging in that expression in the bound of the previous question and using the expression of α t gives R ρ f T e αt ρ ɛt 1 ɛ t u 1 + u 1 t ρt ρ 1 ρ T 1 ɛt ɛt 1 ɛ t u 1 + u 1 1 + ρ u 1+ρ + u 1 ρ t T T ɛ t ɛ 1 ρ t 1 ɛ t 1+ρ. 9

Observe that u 1+ρ + u 1 ρ 1 ρ 1+ρ 1 + ρ + 1 + ρ 1 ρ 1 ρ + 1 + ρ 1 + ρ 1+ρ 1 ρ 1 ρ We also have [ ] log ɛ 1 ρ t 1 ɛ t 1+ρ 1 + ρ 1+ρ 1 ρ 1 ρ 1 1+ρ 1+ρ 1 ρ 1 ρ logɛ t + 1 + ρ log1 ɛ t 1 ρ D ɛ t + 1 ρ 1 ρ log 1 ρ 1 + ρ 1+ρ D ɛ t + log Combining these two inequalities gives R ρ f exp 1 ρ. + 1 + ρ 1 ρ 1 ρ T 1 ρ D ɛ t. 1 ρ log 1 + ρ. c Assume that for all t [1, T ], 1 ρ ɛ t > γ > 0. Use the result of the previous question to show that R ρ f exp γ T. Hint: you can use Pinsker s inequality: Dp q p q for all p, q [0, 1]. Show that for T > log m, all points have margin at least γ ρ. [ Solution: By Pinsker s inequality, we have D 1 ρ ɛ t 1 ρ ɛ t ]. Thus, we can write R ρ f exp γ T. 10

Thus, if the upper bound is less that 1/m, then R ρ f 0 and every training point has margin at least ρ. The inequality exp γ T < 1/m is equivalent to T > log m γ. 11