Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Similar documents
18.657: Mathematics of Machine Learning

Empirical Process Theory and Oracle Inequalities

Rademacher Complexity

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Learning Theory: Lecture Notes

1 Review and Overview

The Boolean Ring of Intervals

Lecture 1. January 8, 2018

Introduction to Probability. Ariel Yadin. Lecture 7

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Homework 9. (n + 1)! = 1 1

Infinite Sequences and Series

1 Review and Overview

An Introduction to Randomized Algorithms

1 Introduction. 1.1 Notation and Terminology

Agnostic Learning and Concentration Inequalities

The multiplicative structure of finite field and a construction of LRC

Math 104: Homework 2 solutions

Limit superior and limit inferior c Prof. Philip Pennance 1 -Draft: April 17, 2017

} is said to be a Cauchy sequence provided the following condition is true.

Modern Algebra 1 Section 1 Assignment 1. Solution: We have to show that if you knock down any one domino, then it knocks down the one behind it.

Maximum Likelihood Estimation and Complexity Regularization

REGRESSION WITH QUADRATIC LOSS

Learnability with Rademacher Complexities

3 Gauss map and continued fractions

Bertrand s Postulate

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Topics. Homework Problems. MATH 301 Introduction to Analysis Chapter Four Sequences. 1. Definition of convergence of sequences.

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Lecture 12: November 13, 2018

Measure and Measurable Functions

HOMEWORK #4 - MA 504

Machine Learning Theory (CS 6783)

10-701/ Machine Learning Mid-term Exam Solution

Intro to Learning Theory

Machine Learning Theory (CS 6783)

Glivenko-Cantelli Classes

Law of the sum of Bernoulli random variables

Sieve Estimators: Consistency and Rates of Convergence

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Section 11.8: Power Series

Math 25 Solutions to practice problems

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Lecture 10 October Minimaxity and least favorable prior sequences

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Sequences and Series of Functions

ON THE HAUSDORFF DIMENSION OF A FAMILY OF SELF-SIMILAR SETS WITH COMPLICATED OVERLAPS. 1. Introduction and Statements

and each factor on the right is clearly greater than 1. which is a contradiction, so n must be prime.

7.1 Convergence of sequences of random variables

Optimally Sparse SVMs

Regression with quadratic loss

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Lecture 3 : Random variables and their distributions

Learning Bounds for Support Vector Machines with Learned Kernels

Relations Among Algebras

Review Problems 1. ICME and MS&E Refresher Course September 19, 2011 B = C = AB = A = A 2 = A 3... C 2 = C 3 = =

2.4 Sequences, Sequences of Sets

Homework 1 Solutions. The exercises are from Foundations of Mathematical Analysis by Richard Johnsonbaugh and W.E. Pfaffenberger.

Problem Set 2 Solutions

Math 61CM - Solutions to homework 3

FUNDAMENTALS OF REAL ANALYSIS by. V.1. Product measures

Chapter 6 Infinite Series

Square-Congruence Modulo n

1 Convergence in Probability and the Weak Law of Large Numbers

MATH 112: HOMEWORK 6 SOLUTIONS. Problem 1: Rudin, Chapter 3, Problem s k < s k < 2 + s k+1

An analog of the arithmetic triangle obtained by replacing the products by the least common multiples

The log-behavior of n p(n) and n p(n)/n

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Chapter 0. Review of set theory. 0.1 Sets

CSE 1400 Applied Discrete Mathematics Number Theory and Proofs

LECTURE NOTES, 11/10/04

Cardinality Homework Solutions

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

arxiv: v1 [math.pr] 4 Dec 2013

A Proof of Birkhoff s Ergodic Theorem

18.657: Mathematics of Machine Learning

Lecture 10: Mathematical Preliminaries

MATH 324 Summer 2006 Elementary Number Theory Solutions to Assignment 2 Due: Thursday July 27, 2006

The natural exponential function

Notes 27 : Brownian motion: path properties

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

A Note on the Symmetric Powers of the Standard Representation of S n

Lecture 2 February 8, 2016

MATH 205 HOMEWORK #2 OFFICIAL SOLUTION. (f + g)(x) = f(x) + g(x) = f( x) g( x) = (f + g)( x)

LONG SNAKES IN POWERS OF THE COMPLETE GRAPH WITH AN ODD NUMBER OF VERTICES

Introduction to Probability. Ariel Yadin

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

MT5821 Advanced Combinatorics

LECTURE 8: ORTHOGONALITY (CHAPTER 5 IN THE BOOK)

7.1 Convergence of sequences of random variables

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Notes #3 Sequences Limit Theorems Monotone and Subsequences Bolzano-WeierstraßTheorem Limsup & Liminf of Sequences Cauchy Sequences and Completeness

arxiv: v1 [math.co] 23 Mar 2016

On Random Line Segments in the Unit Square

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Transcription:

Statistical Machie Learig II Sprig 2017, Learig Theory, Lecture 7 1 Itroductio Jea Hoorio jhoorio@purdue.edu So far we have see some techiques for provig geeralizatio for coutably fiite hypothesis classes e.g., uio boud), as well as for ifiite hypothesis classes e.g., primal-dual witess, Rademacher complexity). Example 1. Lets start by providig a very simple example: classificatio of oe dimesioal data. Our task is to give a biary label {0, 1} to a iput z R. I this example, the hypothesis class is the set of threshold fuctios: F = {f : R {0, 1} fz) = 1[z > θ], θ R} {f : R {0, 1} fz) = 1[z < θ], θ R} Could we use uio bouds i this case? The cardiality of F is equal to the umber 1 of scalars i R. Istead of coutig the umber of fuctios i F, we will cout the possible ways i which traiig samples could be labeled with fuctios i F. 2 Growth fuctio We will assume a arbitrary domai Z ad a dataset S = {z 1,..., z } cotaiig samples where z i Z for all i. I geeral, we will assume a hypothesis class F {f f : Z {0, 1}}. We will use the followig shorthad otatio: FS) = {fz 1 ),..., fz )) {0, 1} f F} That is, FS) cotais all the {0, 1} vectors that ca produced by applyig all fuctios i F to the dataset S. A atural measure of complexity is the followig. Defiitio 7.1. The growth fuctio or shatter coefficiet) of the hypothesis class F {f f : Z {0, 1}} for samples is: GF, ) = max S Z FS) 1 R is ot a coutable set, by Cator s diagoalisatio argumet. 1

Note that the growth fuctio does ot deped o the specific traiig set S, but it is a measure of the worst case amog all possible traiig sets. Clearly GF, ) 2, but ofte it is much smaller. Example 1 cotiues). Assume we sort all samples i S i icreasig order, ad recall that we have threshold fuctios, thus after sortig it is ot possible label three cosecutive samples as 0, 1, 0 or 1, 0, 1. I other words, all samples to the left should be 0 ad all samples to the right should be 1 or alteratively, all samples to the left should be 1 ad all samples to the right should be 0). Let see this more graphically. Each colum is oe of the samples, ad each row is a possible {0, 1} vector i FS). Clearly, GF, ) = 2 for Example 1. 1, 0, 0, 0,..., 0) 1, 1, 0, 0,..., 0) 1, 1, 1, 0,..., 0). 1, 1, 1, 1,..., 1) 0, 1, 1, 1,..., 1) 0, 0, 1, 1,..., 1) 0, 0, 0, 1,..., 1). 0, 0, 0, 0,..., 0) 3 Vapik-Chervoekis VC) dimesio Defiitio 7.2. The VC dimesio of the hypothesis class F {f f : Z {0, 1}} is: V CF) = max N { GF, ) = 2 } As the growth fuctio, the VC dimesio does ot deped o the specific traiig set S. Also, by defiitio if V CF) = d the for all > d we have GF, ) < 2. The iequality is strict.) I the followig sectio, we show a less obvious result. Example 1 cotiues). As before, assume we sort all samples i S i icreasig order, ad recall that we have threshold fuctios. Lets list all possible 2

{0, 1} vectors ad strikeout the oes that are ot i the set FS). = 1 = 2 = 3 0) 0, 0) 0, 0, 0) 1) 0, 1) 0, 0, 1) 1, 0) 0, 1, 0) 1, 1) 0, 1, 1) 1, 0, 0) 1, 0, 1) 1, 1, 0) 1, 1, 1) GF, 1) = 2 GF, 2) = 4 GF, 3) = 6 Clearly, V CF) = 2 for Example 1. I fact, we previously foud that GF, ) = 2, by Defiitio 7.2 we have: V CF) = max N { GF, ) = 2 } = max N { 2 = 2 } = 2 4 Sauer-Shelah lemma Lemma 7.1. The growth fuctio ad the VC dimesio of a hypothesis class F {f f : Z {0, 1}} fulfill: GF, ) V CF) i=0 ) + 1) V CF) i Proof. The right-had side is just a cosequece of the biomial theorem, thus we will cocetrate o the left-had side. We will use proof by iductio. First, defie for clarity: H, d) = d i=0 ) i Sice the biomial coefficiet fulfills ) i = 1 ) i + 1 i 1), it is clear that: H, d) = H 1, d) + H 1, d 1) 1) We ca restate the theorem as follows, V CF) d the: GF, ) H, d) 2) 3

Base case. We show that eq.2) holds for = 1 ad all d 1. Sice we have oly oe sample, we have oly two possible {0, 1} 1 vectors 0) ad 1), ad therefore: GF, 1) = {0), 1)} = 2 O the other had: H1, d) = = = 2 d ) 1 i i=0 ) ) 1 1 + 0 1 Thus, GF, 1) = H1, d) = 2 ad eq.2) holds for = 1 ad all d 1. Iductive step. Assume that eq.2) holds for 1 ad all d 1, ad show that it holds for ad d. Fix a dataset S ad defie: Furthermore, defie: F 2 = FS 2 ) S = {z 1, z 2,..., z } S 2 = {z 2,..., z } F 2 = {fz 2 ),..., fz )) f F such that f F) f z 1 ) = 1 fz 1 ) ad f z i ) = fz i ) for i = 2... } Let see this more graphically. For FS), each colum is oe of the samples, ad each row is a possible {0, 1} vector. For F 2 ad F 2, each colum is oe of the 1 samples, ad each row is a possible {0, 1} 1 vector. Here b i {0, 1}. There are three cases FS) F 2 F 2 1. Two vectors i FS) match i 0, b 2,..., b ) b 2,..., b ) b 2,..., b ) etries 2 to, but ot i etry 1 1, b 2,..., b ) 2. A vector i FS) is uique i 0, b 2,..., b ) b 2,..., b ) etries 2 to, etry 1 is 0 3. A vector i FS) is uique i 1, b 2,..., b ) b 2,..., b ) etries 2 to, etry 1 is 1 Let c 1, c 2 ad c 3 the umber of times case 1, 2 ad 3 occur i S, respectively. 4

The ext table shows the umber of biary vectors i FS), F 2 ad F 2. There are three cases FS) F 2 F 2 1. Two vectors i FS) match i 2c 1 c 1 c 1 2. A vector i FS) is uique i c 2 c 2 0 3. A vector i FS) is uique i c 3 c 3 0 Total umber of biary vectors 2c 1 +c 2 +c 3 c 1 +c 2 +c 3 c 1 From the above, it is clear that: FS) = F 2 + F 2 3) F 2 FS) 2 F 2 FS) Recall that Defiitio 7.2 VC dimesio) depeds o powers of 2. Thus from the above, it is clear that if V CFS)) d the: V CF 2 ) d V CF 2) d 1 Recall that the umber of samples i S is, while the umber of samples i S 2 is 1. From eq.3), the above ad eq.1), we have: FS) = F 2 + F 2 H 1, d) + H 1, d 1) = H, d) Sice the choice of S was arbitrary, the above holds for ay dataset S, ad thus: GF, ) = max S Z FS) H, d) Therefore, eq.2) holds ad we prove our claim. 5 Massart lemma ad Rademacher complexity Lemma 7.2. Let A be a coutably fiite subset of R. Let σ = {σ 1... σ } be idepedet Rademacher radom variables. We have: E σ [sup σ i a i 2 log A sup a 2 5

Proof. For ay t > 0 we have: [ ) ) t E σ σ i a i E σ [ t sup σ i a i sup = E σ [sup t ) σ i a i [ ) E σ t σ i a i = E σ [ t σ i a i = E σi [ tσ i a i = ta i ) + ta i ) 2 ) 1 2 t2 a 2 i = ) 1 2 t2 a 2 2 ) 1 A 2 t2 sup a 2 2 4.a) 4.b) where the step i eq.4.a) follows from Jese s iequality. The step i eq.4.b) follows sice for all z R we have that e z + e z )/2 e z2 /2. By takig logarithms ad dividig by t o both sides of the above, we have: log A E σ [sup σ i a i + 1 t 2 t sup a 2 2 log A I order to miimize the fuctio ft) = t + 1 2 t sup a 2 2, we make the derivative equal to zero ad solve for t. That is: Thus, t = 0 = ft)/ t = log A t 2 + 1 2 sup a 2 2 2 log A sup a 2. Pluggig this back i the above, we prove our claim. 6

Lemma 7.3. Let F {f f : Z {0, 1}} be a hypothesis class. The empirical Rademacher complexity Defiitio 5.2) of the hypothesis class F with respect to samples is bouded as follows: Proof. R F) R F) = E σ [sup = 1 E σ 2 log GF, ) h F [ 1 sup a FS) σ i hz i ) σ i a i 1 2 log FS) sup a 2 a FS) 5.a) 5.b) 1 2 log GF, ) 5.c) 2 log GF, ) = where the step i eq.5.a) follow from Defiitio 5.2 respectively. The step i eq.5.b) follows from Massart lemma Lemma 7.2). The step i eq.5.c) follows sice S Z ) FS) GF, ), ad sice for all a {0, 1} we have that a 2. 7