Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions. For any x R and θ R, let φ θ denote the threshold function that assigns sign +1 to x θ, 1 otherwise: φ θ (x) = 21 x θ 1. Let H be the family of functions maing R N to { 1, +1} defined by { } H = x s φ θ (x i ): i [1, N], θ R, s { 1, +1}, where x i is the ith coordinate of x R N. 1. Show that the following uer bound holds for the growth function of H: Π m (H) 2mN. Solution: Consider first the different ossible ways of labeling m oints using threshold functions based on a fixed coordinate. Since two thresholds with no oint in between lead to the same classifications, there are (m + 1) distinct threshold values to consider for m distinct coordinate values, which gives (m + 1) ways of labeling the oints using threshold functions labeling oints before the threshold as 1. There are (m 1) additional ways of labeling the oints using threshold functions labeling oints before the threshold as 0 (the left-most and right-most thresholds can be obtained using the other family). Thus, this generates 2m distinct ways of labeling the oints, which imlies at most 2mN dichotomies for all dimensions. 2. Let H 2 be the family of functions maing R N to { 1, +1} defined by { H 2 = x s 1 φθ (x i )=+1 φ θ (x j ) + s 1 φθ (x i )= 1 φ θ (x j ): } i j, i, j [1, N], θ, θ R, s, s { 1, +1}. Show that Π m (H 2 ) = O(m 2 N 2 ). Give an exlicit uer bound on Π m (H 2 ). Solution: There are N(N 1)/2 airs (i, j). Each choice leads to at most (2m) 2 ways of labeling the oints. Thus, Π m (H 2 ) (2m) 2 N(N 1)/2 = 2m 2 N(N 1). 1
B. VC-dimension 1. VC-dimension of circles in the lane. (a) Show that the VC-dimension of the circles in the lanes is 3. Solution: It is clear that any two oints can be shattered by a circle. Any three non-colinear oints can also be shattered. Now, given any four oints, there are three cases. The first, that the convex hull of these four oints is a triangle. If so, labeling the oints on the triangle as ositive and the oint inside as negative is a dichotomy that cannot be realized by a circle. Secondly, if the convex hull of the four oints is a quadrilateral, then choosing the further of the two diagonally oosite oints as ositive and the other two as negative is a dichotomy that cannot be realized. Finally, if the four oints are colinear, there is a trivial dichotomy of alternate ositive and negatives that cannot be realized. Thus, VC dimension of all circles in the lane is 3. (b) Let H 1 and H 2 be two families of functions maing from X to {0, 1} and H = {h 1 h 2 : h 1 H 1, h 2 H 2 } their roduct. Show that Π H (m) Π H1 (m)π H2 (m). Solution: By definition, Π H (m) = max h 1(x i )h 2 (x i ) : h 1 H 1, h 2 H 2 {x 1,...,x m} X [ m ] [ ] max h 1(x i ) : h 1 H 1 max h 2(x i ) : h 2 H 2 {x 1,...,x m} X m {x 1,...,x m} X m = Π H1 (m)π H2 (m). (c) Give an uer bound on the VC-dimension of the family of intersections of k circles in the lanes. { k } Solution: Let H = h i : h i H c, where H c is the family of circles in the lans. Then H is the family of intersections of k circles. From (a), VC-dimension of H c is 3. Let m be the VC-dimension of H. Sauer s Lemma and (b) imlies ( em ) 3k 2 m = Π H (m) (Π Hc (m)) k. 3 2
Let n = 6k log 2 (3k). If we can show that 2 n definition, m < n = 6k log 2 (3k). In fact, ( en ) 3k < 2 n 3 > ( ) en 3k, 3 then by 3k log 2 (e2k log 2 (3k)) < 6k log 2 (3k) log 2 (e2k log 2 (3k)) < log 2 (9k 2 ) e2k log 2 (3k) < 9k 2 log 2 (3k) < 9k 2e. The last inequality holds for k = 2, since log 2 (6) 2.6 < 9 e 3.3. Furthermore, the derivative of left-hand-side is 1 k 0.5 for k 2, and the derivative of right-hand-side is 9 2e 1.65. Therefore log 2(3k) < 9k 2e, k 2. Finally, when k = 1 the VC-dimension of H is 3. Therefore, VC-dim(H) < 6k log 2 (3k). 2. VC-dimension of Decision trees. A full binary tree is a tree in which each node is either a leaf or it is an internal node and admits exactly two child nodes. (a) Show that that a binary tree with n internal nodes has exactly n + 1 leaves (Hint: you can roceed by induction). Solution: The roof is by induction. For n = 1, there are exactly two leaves, that is n + 1. Assume now that the equality holds for all trees with at most n internal nodes. Let T be a tree with n + 1 internal nodes. The root node admits two subtrees T L and T R, each with at most n internal nodes. By induction, the number of leaves of T L is inodes(t L ) + 1 and similarly the number of leaves of T R is inodes(t R ) + 1. Thus, the number of leaves of T is leaves(t L ) + leaves(t R ) = inodes(t L ) + inodes(t R ) + 2 = inodes(t ) + 1, taking into account the root node. (b) A binary decision tree is a full binary tree with each leaf labeled with +1 or 1 and each internal node labeled with a question. A binary decision tree classifies a oint as follows: starting with the root of the tree, if the internal node question alied to the oint admits a ositive answer, then the current node becomes the right child, otherwise it becomes the left child. This is reeated until a leaf node 3
is reached. The label assigned to the oint is then the sign of that leaf node. Suose the node questions are of the form x i > 0, i [1, N]. Show that the VC-dimension of the set of binary decision trees with n nodes in dimension N is at most (2n+1) log 2 (N +2) (Hint: bound the cardinality of the set). Use that to derive an uer bound on the Rademacher comlexity of that set. Solution: A binary decision tree with n nodes has exactly n + 1 leaves. Each node can be labeled with an integer from {1,..., N} indicating which dimension is queried to make a binary slit and each leaf can be labeled with ±1 to indicate the classification made at that leaf. Fix an ordering of the nodes and leaves and consider all ossible labelings of this sequence. There can be no more than (N + 2) 2n+1 distinct binary trees and, thus, the VC-dimension of this finite set of hyotheses can be no larger than (2n + 1) log 2 (N + 2) = O(n log N). The Rademacher comlexity can be bounded using R m (H) 2d(log m + 1) m 2(2n + 1) log2 (N + 2)(log m + 1). m C. Suort-Vector Machines 1. Download and install the libsvm software library from: htt://www.csie.ntu.edu.tw/ cjlin/libsvm/ 2. Consider the sambase data set htt://archive.ics.uci.edu/ml/datasets/sambase. Download a shuffled version of that dataset from htt://www.cs.nyu.edu/ mohri/ml17/sambase.data.shuffled Use the libsvm scaling tool to scale the features of all the data. Use the first 3000 examles for training, the last 1601 for testing. The scaling arameters should be comuted only on the training data and then alied to the test data. 4
3. Consider the binary classification that consists of redicting if the e-mail message if a sam using the 57 features. Use SVMs combined with olynomial kernels to tackle this binary classification roblem. To do that, randomly slit the training data into ten equal-sized disjoint sets. For each value of the olynomial degree, d = 1, 2, 3, 4, lot the average cross-validation error lus or minus one standard deviation as a function of C (let other arameters of olynomial kernels in libsvm be equal to their default values), varying C in owers of 2, starting from a small value C = 2 k to C = 2 k, for some value of k. k should be chosen so that you see a significant variation in training error, starting from a very high training error to a low training error. Exect longer training times with libsvm as the value of C increases. Solution: Figure 1 shows the average cross-validation erformance as a function of the regularization arameter C. Based on Figure 1, we choose Figure 1: Average error according to 10-fold cross-validation, with error-bars indicating one standard deviation. C = 2 7 and d = 1. (Note your values of C and d may differ) 4. Let (C, d ) be the best air found reviously. Fix C to be C. Plot the 5
ten-fold cross-validation error and the test errors for the hyotheses obtained as a function of d. Plot the average number of suort vectors obtained as a function of d. How many of the suort vectors lie on the margin hyerlanes? Solution: Figure 2 shows the cross-validation error and test error for C fixed at C = 2 7, and also the number of suort vectors and suort vectors on the margin hyerlane. Note that the CV error is slightly higher than the test error. The number of marginal suort vectors is calculated by finding the number of suort vectors corresonding to a slack of zero. Figure 2: Cross-validation and test error versus degree (left anel), number of suort vectors and marginal suort vectors versus degree (right anel). 5. For any d, let K d denote the olynomial kernel of degree d. Show that for any fixed integer u > 0, G u = 2 u(u+1) i j u K ik j is a PDS kernel. Use SVMs combined with the olynomial kernel G 4 to tackle the same binary classification roblem as in the revious questions: as in the revious questions, use ten-fold cross-validation to determine the best value of C; reort the ten-fold cross-validation error and the test error of the hyothesis obtained when training with G u. Solution: It follows from the closure roerties of PDS kernels (i.e. PDS kernels are closed under summation and roduct) that G u is a PDS kernel. Figure 3 shows the cross-validation error and test error for different values of C. The otimal value for C is C = 2 13, the validation error for C is 0.067 and the test error is 0.058 (Note again that your values may differ). 6
Figure 3: Cross-validation error for G 4 D. Rademacher comlexity Let > 2 and let q be its conjugate: 1/ + 1/q = 1. Let H be the family of linear functions defined over {x R N : x r } by } H = {x w x: w q Λ q, for some r > 0 and Λ q > 0. Give an uer bound on R m (H) in terms of Λ q and r, assuming that for > 2 the following inequality holds for all z 1,..., z m R: E σ [ m σ iz i ] [ 2 m z2 i ] 2. 7
Solution: R S (H) = 1 [ ] m E su σ i w x i σ w q H = 1 [ ] m E su w σ i x i σ w q H = Λ [ q m m E ] σ i x i σ = Λ [ [ q m ]] 1 E σ i x i m σ = Λ [ N q m j=1 [ N Λ q m j=1 E σ [ m [ 2 [ N = Λ q 2m j=1 [[ 1 Λ q 2m m [ 1 = Λ q 2m m σ i x ij ]] 1 x 2 ij [ m ] 2 ] 1 ] 1 ] 1 2 m x2 ij N j=1 x i x ij ]] 1 ]] 1 (def. of dual norm) ( x = [ x ] 1 ) (def. of norm ) (assumtion) (Jensen s ineq.) Λ q r 2m. ( x i r ) 8