where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

Similar documents
The definitions and notation are those introduced in the lectures slides. R Ex D [h

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Elementary Analysis in Q p

A New Perspective on Learning Linear Separators with Large L q L p Margins

i=1 cosn (x 2 i y2 i ) over RN R N. cos y sin x

MATH 2710: NOTES FOR ANALYSIS

Probability Estimates for Multi-class Classification by Pairwise Coupling

Approximating min-max k-clustering

Strong Matching of Points with Geometric Shapes

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

On the Chvatál-Complexity of Knapsack Problems

Combinatorics of topmost discs of multi-peg Tower of Hanoi problem

MATH 6210: SOLUTIONS TO PROBLEM SET #3

By Evan Chen OTIS, Internal Use

#A37 INTEGERS 15 (2015) NOTE ON A RESULT OF CHUNG ON WEIL TYPE SUMS

Math 4400/6400 Homework #8 solutions. 1. Let P be an odd integer (not necessarily prime). Show that modulo 2,

Complex Analysis Homework 1

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

ON THE NORM OF AN IDEMPOTENT SCHUR MULTIPLIER ON THE SCHATTEN CLASS

2 Asymptotic density and Dirichlet density

2 Asymptotic density and Dirichlet density

Numerical Linear Algebra

The Hasse Minkowski Theorem Lee Dicker University of Minnesota, REU Summer 2001

Sums of independent random variables

CHAPTER 2: SMOOTH MAPS. 1. Introduction In this chapter we introduce smooth maps between manifolds, and some important

Computational and Statistical Learning theory

Elementary theory of L p spaces

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

Introduction to Support Vector Machines

Supplementary Materials for Robust Estimation of the False Discovery Rate

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Name (NetID): (1 Point)

QUADRATIC RECIPROCITY

Microeconomics Fall 2017 Problem set 1: Possible answers

QUADRATIC RECIPROCITY

Introduction to Machine Learning

Section 0.10: Complex Numbers from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

19th Bay Area Mathematical Olympiad. Problems and Solutions. February 28, 2017

CONGRUENCE PROPERTIES OF TAYLOR COEFFICIENTS OF MODULAR FORMS

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

arxiv:cond-mat/ v2 25 Sep 2002

An Analysis of Reliable Classifiers through ROC Isometrics

An Introduction To Range Searching

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

SAT based Abstraction-Refinement using ILP and Machine Learning Techniques

DISCRIMINANTS IN TOWERS

ETNA Kent State University

Real Analysis 1 Fall Homework 3. a n.

The non-stochastic multi-armed bandit problem

MA3H1 TOPICS IN NUMBER THEORY PART III

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

Haar type and Carleson Constants

Econometrica Supplementary Material

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

HENSEL S LEMMA KEITH CONRAD

Jeff Howbert Introduction to Machine Learning Winter

Multi-Operation Multi-Machine Scheduling

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO

Extremal Polynomials with Varying Measures

Extension of Minimax to Infinite Matrices

The Fekete Szegő theorem with splitting conditions: Part I

p-adic Properties of Lengyel s Numbers

Class Field Theory. Peter Stevenhagen. 1. Class Field Theory for Q

The Euler Phi Function

Distribution of Matrices with Restricted Entries over Finite Fields

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education

PAC-learning, VC Dimension and Margin-based Bounds

Best approximation by linear combinations of characteristic functions of half-spaces

ON POLYNOMIAL SELECTION FOR THE GENERAL NUMBER FIELD SIEVE

Multi-instance Support Vector Machine Based on Convex Combination

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Opposite-quadrant depth in the plane

The Vapnik-Chervonenkis Dimension

General Linear Model Introduction, Classes of Linear models and Estimation

A Social Welfare Optimal Sequential Allocation Procedure

A construction of bent functions from plateaued functions

Mersenne and Fermat Numbers

A Note on Guaranteed Sparse Recovery via l 1 -Minimization

t 0 Xt sup X t p c p inf t 0

DOMINATION IN DEGREE SPLITTING GRAPHS S , S t. is a set of vertices having at least two vertices and having the same degree and T = V S i

Dimension Characterizations of Complexity Classes

MATH 361: NUMBER THEORY EIGHTH LECTURE

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

COMMUNICATION BETWEEN SHAREHOLDERS 1

SVMC An introduction to Support Vector Machines Classification

Solution sheet ξi ξ < ξ i+1 0 otherwise ξ ξ i N i,p 1 (ξ) + where 0 0

On Z p -norms of random vectors

Introduction to Machine Learning

Slides Prepared by JOHN S. LOUCKS St. Edward s s University Thomson/South-Western. Slide

The Poisson Regression Model

Practice Final Solutions

Support Vector Machines: Maximum Margin Classifiers

Generalization, Overfitting, and Model Selection

Pretest (Optional) Use as an additional pacing tool to guide instruction. August 21

On a Markov Game with Incomplete Information

HARMONIC EXTENSION ON NETWORKS

x 2 a mod m. has a solution. Theorem 13.2 (Euler s Criterion). Let p be an odd prime. The congruence x 2 1 mod p,

Transcription:

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions. For any x R and θ R, let φ θ denote the threshold function that assigns sign +1 to x θ, 1 otherwise: φ θ (x) = 21 x θ 1. Let H be the family of functions maing R N to { 1, +1} defined by { } H = x s φ θ (x i ): i [1, N], θ R, s { 1, +1}, where x i is the ith coordinate of x R N. 1. Show that the following uer bound holds for the growth function of H: Π m (H) 2mN. Solution: Consider first the different ossible ways of labeling m oints using threshold functions based on a fixed coordinate. Since two thresholds with no oint in between lead to the same classifications, there are (m + 1) distinct threshold values to consider for m distinct coordinate values, which gives (m + 1) ways of labeling the oints using threshold functions labeling oints before the threshold as 1. There are (m 1) additional ways of labeling the oints using threshold functions labeling oints before the threshold as 0 (the left-most and right-most thresholds can be obtained using the other family). Thus, this generates 2m distinct ways of labeling the oints, which imlies at most 2mN dichotomies for all dimensions. 2. Let H 2 be the family of functions maing R N to { 1, +1} defined by { H 2 = x s 1 φθ (x i )=+1 φ θ (x j ) + s 1 φθ (x i )= 1 φ θ (x j ): } i j, i, j [1, N], θ, θ R, s, s { 1, +1}. Show that Π m (H 2 ) = O(m 2 N 2 ). Give an exlicit uer bound on Π m (H 2 ). Solution: There are N(N 1)/2 airs (i, j). Each choice leads to at most (2m) 2 ways of labeling the oints. Thus, Π m (H 2 ) (2m) 2 N(N 1)/2 = 2m 2 N(N 1). 1

B. VC-dimension 1. VC-dimension of circles in the lane. (a) Show that the VC-dimension of the circles in the lanes is 3. Solution: It is clear that any two oints can be shattered by a circle. Any three non-colinear oints can also be shattered. Now, given any four oints, there are three cases. The first, that the convex hull of these four oints is a triangle. If so, labeling the oints on the triangle as ositive and the oint inside as negative is a dichotomy that cannot be realized by a circle. Secondly, if the convex hull of the four oints is a quadrilateral, then choosing the further of the two diagonally oosite oints as ositive and the other two as negative is a dichotomy that cannot be realized. Finally, if the four oints are colinear, there is a trivial dichotomy of alternate ositive and negatives that cannot be realized. Thus, VC dimension of all circles in the lane is 3. (b) Let H 1 and H 2 be two families of functions maing from X to {0, 1} and H = {h 1 h 2 : h 1 H 1, h 2 H 2 } their roduct. Show that Π H (m) Π H1 (m)π H2 (m). Solution: By definition, Π H (m) = max h 1(x i )h 2 (x i ) : h 1 H 1, h 2 H 2 {x 1,...,x m} X [ m ] [ ] max h 1(x i ) : h 1 H 1 max h 2(x i ) : h 2 H 2 {x 1,...,x m} X m {x 1,...,x m} X m = Π H1 (m)π H2 (m). (c) Give an uer bound on the VC-dimension of the family of intersections of k circles in the lanes. { k } Solution: Let H = h i : h i H c, where H c is the family of circles in the lans. Then H is the family of intersections of k circles. From (a), VC-dimension of H c is 3. Let m be the VC-dimension of H. Sauer s Lemma and (b) imlies ( em ) 3k 2 m = Π H (m) (Π Hc (m)) k. 3 2

Let n = 6k log 2 (3k). If we can show that 2 n definition, m < n = 6k log 2 (3k). In fact, ( en ) 3k < 2 n 3 > ( ) en 3k, 3 then by 3k log 2 (e2k log 2 (3k)) < 6k log 2 (3k) log 2 (e2k log 2 (3k)) < log 2 (9k 2 ) e2k log 2 (3k) < 9k 2 log 2 (3k) < 9k 2e. The last inequality holds for k = 2, since log 2 (6) 2.6 < 9 e 3.3. Furthermore, the derivative of left-hand-side is 1 k 0.5 for k 2, and the derivative of right-hand-side is 9 2e 1.65. Therefore log 2(3k) < 9k 2e, k 2. Finally, when k = 1 the VC-dimension of H is 3. Therefore, VC-dim(H) < 6k log 2 (3k). 2. VC-dimension of Decision trees. A full binary tree is a tree in which each node is either a leaf or it is an internal node and admits exactly two child nodes. (a) Show that that a binary tree with n internal nodes has exactly n + 1 leaves (Hint: you can roceed by induction). Solution: The roof is by induction. For n = 1, there are exactly two leaves, that is n + 1. Assume now that the equality holds for all trees with at most n internal nodes. Let T be a tree with n + 1 internal nodes. The root node admits two subtrees T L and T R, each with at most n internal nodes. By induction, the number of leaves of T L is inodes(t L ) + 1 and similarly the number of leaves of T R is inodes(t R ) + 1. Thus, the number of leaves of T is leaves(t L ) + leaves(t R ) = inodes(t L ) + inodes(t R ) + 2 = inodes(t ) + 1, taking into account the root node. (b) A binary decision tree is a full binary tree with each leaf labeled with +1 or 1 and each internal node labeled with a question. A binary decision tree classifies a oint as follows: starting with the root of the tree, if the internal node question alied to the oint admits a ositive answer, then the current node becomes the right child, otherwise it becomes the left child. This is reeated until a leaf node 3

is reached. The label assigned to the oint is then the sign of that leaf node. Suose the node questions are of the form x i > 0, i [1, N]. Show that the VC-dimension of the set of binary decision trees with n nodes in dimension N is at most (2n+1) log 2 (N +2) (Hint: bound the cardinality of the set). Use that to derive an uer bound on the Rademacher comlexity of that set. Solution: A binary decision tree with n nodes has exactly n + 1 leaves. Each node can be labeled with an integer from {1,..., N} indicating which dimension is queried to make a binary slit and each leaf can be labeled with ±1 to indicate the classification made at that leaf. Fix an ordering of the nodes and leaves and consider all ossible labelings of this sequence. There can be no more than (N + 2) 2n+1 distinct binary trees and, thus, the VC-dimension of this finite set of hyotheses can be no larger than (2n + 1) log 2 (N + 2) = O(n log N). The Rademacher comlexity can be bounded using R m (H) 2d(log m + 1) m 2(2n + 1) log2 (N + 2)(log m + 1). m C. Suort-Vector Machines 1. Download and install the libsvm software library from: htt://www.csie.ntu.edu.tw/ cjlin/libsvm/ 2. Consider the sambase data set htt://archive.ics.uci.edu/ml/datasets/sambase. Download a shuffled version of that dataset from htt://www.cs.nyu.edu/ mohri/ml17/sambase.data.shuffled Use the libsvm scaling tool to scale the features of all the data. Use the first 3000 examles for training, the last 1601 for testing. The scaling arameters should be comuted only on the training data and then alied to the test data. 4

3. Consider the binary classification that consists of redicting if the e-mail message if a sam using the 57 features. Use SVMs combined with olynomial kernels to tackle this binary classification roblem. To do that, randomly slit the training data into ten equal-sized disjoint sets. For each value of the olynomial degree, d = 1, 2, 3, 4, lot the average cross-validation error lus or minus one standard deviation as a function of C (let other arameters of olynomial kernels in libsvm be equal to their default values), varying C in owers of 2, starting from a small value C = 2 k to C = 2 k, for some value of k. k should be chosen so that you see a significant variation in training error, starting from a very high training error to a low training error. Exect longer training times with libsvm as the value of C increases. Solution: Figure 1 shows the average cross-validation erformance as a function of the regularization arameter C. Based on Figure 1, we choose Figure 1: Average error according to 10-fold cross-validation, with error-bars indicating one standard deviation. C = 2 7 and d = 1. (Note your values of C and d may differ) 4. Let (C, d ) be the best air found reviously. Fix C to be C. Plot the 5

ten-fold cross-validation error and the test errors for the hyotheses obtained as a function of d. Plot the average number of suort vectors obtained as a function of d. How many of the suort vectors lie on the margin hyerlanes? Solution: Figure 2 shows the cross-validation error and test error for C fixed at C = 2 7, and also the number of suort vectors and suort vectors on the margin hyerlane. Note that the CV error is slightly higher than the test error. The number of marginal suort vectors is calculated by finding the number of suort vectors corresonding to a slack of zero. Figure 2: Cross-validation and test error versus degree (left anel), number of suort vectors and marginal suort vectors versus degree (right anel). 5. For any d, let K d denote the olynomial kernel of degree d. Show that for any fixed integer u > 0, G u = 2 u(u+1) i j u K ik j is a PDS kernel. Use SVMs combined with the olynomial kernel G 4 to tackle the same binary classification roblem as in the revious questions: as in the revious questions, use ten-fold cross-validation to determine the best value of C; reort the ten-fold cross-validation error and the test error of the hyothesis obtained when training with G u. Solution: It follows from the closure roerties of PDS kernels (i.e. PDS kernels are closed under summation and roduct) that G u is a PDS kernel. Figure 3 shows the cross-validation error and test error for different values of C. The otimal value for C is C = 2 13, the validation error for C is 0.067 and the test error is 0.058 (Note again that your values may differ). 6

Figure 3: Cross-validation error for G 4 D. Rademacher comlexity Let > 2 and let q be its conjugate: 1/ + 1/q = 1. Let H be the family of linear functions defined over {x R N : x r } by } H = {x w x: w q Λ q, for some r > 0 and Λ q > 0. Give an uer bound on R m (H) in terms of Λ q and r, assuming that for > 2 the following inequality holds for all z 1,..., z m R: E σ [ m σ iz i ] [ 2 m z2 i ] 2. 7

Solution: R S (H) = 1 [ ] m E su σ i w x i σ w q H = 1 [ ] m E su w σ i x i σ w q H = Λ [ q m m E ] σ i x i σ = Λ [ [ q m ]] 1 E σ i x i m σ = Λ [ N q m j=1 [ N Λ q m j=1 E σ [ m [ 2 [ N = Λ q 2m j=1 [[ 1 Λ q 2m m [ 1 = Λ q 2m m σ i x ij ]] 1 x 2 ij [ m ] 2 ] 1 ] 1 ] 1 2 m x2 ij N j=1 x i x ij ]] 1 ]] 1 (def. of dual norm) ( x = [ x ] 1 ) (def. of norm ) (assumtion) (Jensen s ineq.) Λ q r 2m. ( x i r ) 8