Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Similar documents
Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

1 Review and Overview

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Machine Learning Theory (CS 6783)

Empirical Process Theory and Oracle Inequalities

A survey on penalized empirical risk minimization Sara A. van de Geer

Agnostic Learning and Concentration Inequalities

18.657: Mathematics of Machine Learning

Optimally Sparse SVMs

1 Review and Overview

Sieve Estimators: Consistency and Rates of Convergence

Rademacher Complexity

Intro to Learning Theory

REGRESSION WITH QUADRATIC LOSS

6.3 Testing Series With Positive Terms

Binary classification, Part 1

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 15: Learning Theory: Concentration Inequalities

Regression with quadratic loss

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Lecture 3: August 31

Selective Prediction

Once we have a sequence of numbers, the next thing to do is to sum them up. Given a sequence (a n ) n=1

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

6.883: Online Methods in Machine Learning Alexander Rakhlin

7.1 Convergence of sequences of random variables

Estimation for Complete Data

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Notes 5 : More on the a.s. convergence of sums

Infinite Sequences and Series

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Recurrence Relations

An Introduction to Randomized Algorithms

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Seunghee Ye Ma 8: Week 5 Oct 28

Math 155 (Lecture 3)

10-701/ Machine Learning Mid-term Exam Solution

7.1 Convergence of sequences of random variables

Lecture 7: October 18, 2017

Learnability with Rademacher Complexities

Maximum Likelihood Estimation and Complexity Regularization

Sequences and Series of Functions

Glivenko-Cantelli Classes

Math 2784 (or 2794W) University of Connecticut

MA131 - Analysis 1. Workbook 3 Sequences II

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Bertrand s Postulate

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Law of the sum of Bernoulli random variables

Introduction to Machine Learning DIS10

GENERATING FUNCTIONS AND RANDOM WALKS

Problem Set 2 Solutions

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

The Growth of Functions. Theoretical Supplement

Lecture 2. The Lovász Local Lemma

Lecture 2: April 3, 2013

Machine Learning Brett Bernstein

TR/46 OCTOBER THE ZEROS OF PARTIAL SUMS OF A MACLAURIN EXPANSION A. TALBOT

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Mathematical Induction

4.3 Growth Rates of Solutions to Recurrences

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

1 Convergence in Probability and the Weak Law of Large Numbers

ECE 901 Lecture 13: Maximum Likelihood Estimation

18.657: Mathematics of Machine Learning

Empirical Processes: Glivenko Cantelli Theorems

Homework 9. (n + 1)! = 1 1

MA131 - Analysis 1. Workbook 2 Sequences I

Lecture 4: April 10, 2013

Machine Learning Theory (CS 6783)

Linear Regression Demystified

Lecture 12: November 13, 2018

Lecture 9: Expanders Part 2, Extractors

Shannon s noiseless coding theorem

6.867 Machine learning

The random version of Dvoretzky s theorem in l n

CS / MCS 401 Homework 3 grader solutions

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

Feedback in Iterative Algorithms

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Problem Set 4 Due Oct, 12

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

32 estimating the cumulative distribution function

Lecture Chapter 6: Convergence of Random Sequences

Sequences I. Chapter Introduction

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Chapter 6 Infinite Series

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Transcription:

Machie Learig Theory Tübige Uiversity, WS 06/07 Lecture 3 Tolstikhi Ilya Abstract I this lecture we will prove the VC-boud, which provides a high-probability excess risk boud for the ERM algorithm whe performig a biary classificatio over classes of fiite VC dimesio This result geeralizes the agostic boud for fiite classes, discussed i the previous lecture Most of the material follows the expositio of Bousquet et al 004) I also ivite the iterested studets to thik about questios marked with blue You wo t get extra poits for them, but you will certaily get a better uderstadig of the material Let s recall the settig ad some basic facts We have a iput space X ad a output space Y := 0, There is a ukow probability distributio P over X Y We receive a traiig sample S := X i, Y i ) of iid iput-output pairs from P We fix a set of classifiers H We deote the expected risk for ay h H as ad the empirical risk as Lh) := P X,Y ) P hx) Y L h) := hx i ) Y i We itroduce the Empirical Risk Miimizatio ERM) algorithm ĥ := ĥs, H): L Ĥ) = if g H L g) We will require the followig cocetratio iequality, itroduced i the secod lecture: Theorem Hoeffdig s iequality) Let ξ,, ξ be idepedet radom variables such that ξ i [a i, b i ], a i, b i R, for i =,, with probability oe Deote Z := ξ i The for ay ε > 0 it holds that: ε ) PZ E[Z ] ε exp b i a i ) Same iequality holds for PE[Z ] Z ε Moreover, ε ) P Z E[Z ] ε exp b i a i ) Show that the third iequality of theorem follows simply from the first two oes The uio boud is our favourite trick!

Agostic boud for fiite classes Let s shortly recall the agostic excess risk boud for fiite classes, itroduced i the secod lecture We will provide a slightly modified proof leadig to mior chages i the costat factors: Theorem Assume H = h,, h N The for ay δ > 0 with probability larger tha δ the followig holds: Lĥ) mi Lh log N + log δ i) + Th),,N Proof For our further discussio it will be useful to recall the idea behid a proof Assumig h is the miimizer of the expected risk over H we may write: Next we write: Lĥ) Lh ) = Lĥ) L ĥ) + L ĥ) L h ) + L h ) Lh ) Lĥ) L ĥ) + L h ) Lh ) *) Lh) L h) ) + L h) Lh) ) Lh) L h) ) PLh) L h) ɛ = P N Lh i ) L h i ) ɛ ) N P Lh i ) L h i ) ɛ, 3) where we used the uio boud i the last lie We may ow apply Hoeffdig s iequality of Theorem ad get: PLh) L h) ɛ N e ɛ / = Ne ɛ We wat the rhs of the previous iequality to be smaller tha δ I other words, we wat to fid ɛ such that: δ = Ne ɛ Solvig the equatio for ɛ we get: Note that for this choice of ɛ we have logn/δ) ɛ = PLh) L h) ɛ δ, or equivaletly PLh) L h) < ɛ δ I other words, with probability larger tha δ we have Lh) L h) Isertig this boud back to ) we coclude the proof logn/δ)

log N+log Try to slightly improve this result You may replace δ log N+log δ log δ i the upper boud with + For this get back to *) ad do somethig smarter Notice that h does ot deped o S so why upper boudig last two terms with remum? 3 Oe step further: ifiite classes H, VC-boud The mai goal of this lecture is to drop the assumptio of Theorem that the class H is fiite Now we assume that H may be ifiite Actually, there ca be ucoutably may classifiers i H just thik about liear classifiers i R d or simply about thresholds i oe dimesio) 3 Spoiler Before eve itroducig all the ecessary defiitios, let us start with the statemet of theorem, which we are goig to prove Theorem 3 VC-boud) For ay δ with probability larger tha δ it holds that: Lĥ) if Lg) + log S H ) + log 4 δ g H Compare this boud to Th) It looks almost the same, but N is replaced with S H ) a quatity kow as the growth fuctio, which will be itroduced later i the proof For ow it is istructive to ote the similarity betwee these two results: perhaps, it meas that we ca proceed with the same or almost the same) proof, where, magically, N evets appearig o lies ) 3) will be evetually replaced with S H ) evets? It turs out that this is ideed the case! I the followig we preset the proof of Theorem 7 3 Debuggig the proof Ca we still repeat the proof of Theorem? Let s assume for ow that there is h H such that Lh ) = if g H Lg) Show that geerally this is ot true) It turs out that we ca still repeat the first steps, but we ca o more apply the uio boud Ideed, the uio boud P i A i ) i PA i) holds at most for coutable set of evets A i I our case, as we already metioed, we may ed up with ucoutably may evets I summary, we ca ot apply step ) 3) ay more Let s try to fid a workaroud What is actually causig the problem? Note that L h) i lies betwee ) ad ) still takes oly fiitely may values as h rus through the H prove this yourself!) If we had oly L h) appearig iside of probability sig i ) we could still eumerate all the differet values of L h) ad get back to fiitely may evets ad proceed with all the previous steps The real problem is the Lh) term, which also appears i the evets of ) I priciple, Lh) ca take ay value betwee 0 ad for h H prove this yourself!) This is the reaso we may ed up with ucoutably may evets Fortuately, the followig otrivial iequality helps us to get rid of the adversarial Lh) term: Lemma 4 Symmetrizatio iequality) Assume S := X i, Y i )) is a idepedet copy of S, that is S S forms a sequece of iid iput-output pairs distributed accordig to P Deote L h) := hx i ) Y i The for ay ɛ > 0, such that ɛ, it holds that: P S Lh) L h) ) ɛ P S S L h) L h) ) ɛ/ 3

Iequality also holds for L h) Lh) ) 33 Modifyig the proof: gettig rid of Lh) Now, let us retur to the begiig ad try to apply this result: Lĥ) Lh ) = Lĥ) L ĥ) + L ĥ) L h ) + L h ) Lh ) Lĥ) L ĥ) + L h ) Lh ) Lh) L h) ) + L h) Lh) ) As we already ow, if for two evets A ad B it holds that A B the ecessarily PA) PB) This gives us PLĥ) Lh ) ɛ P Lh) L h) ) + L h) Lh) ) ɛ 4) Also ote that by the same reaso for ay radom variables a ad b we have Pa + b ɛ Pa ɛ/ b ɛ/ Pa ɛ/ + Pb ɛ/ Applyig this to 4) ad usig Lemma 4 we get: PLĥ) Lh ) ɛ P Lh) L h) ) ɛ/ + P L h) Lh) ) ɛ/ 4 P S S L h) L h) ) ɛ/ 5) At this poit ote that o matter what h is, L h) L h) ca take oly fiitely may values prove this yourself!) The value of L h) L h) depeds oly o the projectio of H o the double sample S S, where for ay sample S m := X j, Y j ) m j= we defie a projectio i the followig way: hx ) H Sm := ) Y, hx ) Y,, hx m ) Y m, h H 0, m Note that H S S is a subset of the 0, ad thus its cardiality cardh S S ) is upper bouded by We may write PLĥ) Lh ) ɛ 4 P S S L v) L v) ) ɛ/ v H S S where we have overloaded otatios L v) ad L v) i a atural way All i all, it seems like we may ow proceed with the origial ) 3) steps to boud the rhs of the previous iequality, sice is ow over the fiite set This is ideed what we did durig the lecture, but the thig is, this step is ot quite correct Notice that the uio boud assumes that evets A i are fixed I our case, there are fiitely may evets A v := L v) L v) ɛ idexed by v, but they all deped o the radom samples S ad S, so the uio boud at least i its usual form) ca ot be applied, 4

34 Aother eat trick: Rademacher symmetrizatio Istead, we will proceed with a trick commoly kow as the Rademacher symmetrizatio Next lies are take from Sectio 4 of Devroye et al 996) Itroduce radom variables σ,, σ which are all idepedet also idepedet from S ad S ) ad take values ad + with probabilities 05 Rewrite 5) i the followig way: PLĥ) Lh ) ɛ 4 P S S ad otice that distributio of is the same as distributio of proof this yourself!) We may thus write PLĥ) Lh ) ɛ 4 P S S = 4 P σ,s S hx i ) Y i hx i ) Y i ) ɛ/ hx i ) Y i hx i ) Y i ) σ i hx i ) Y i hx i ) Y i ) hx i ) Y i hx i ) Y i ) ɛ/ σ i hx i ) Y i hx i ) Y i ) ɛ/ Next we use the tower rule of expectatio, which ca be writte for ay evet A ad ay radom variable Z as PA) = E Z [PA Z)] This gives us PLĥ) Lh ) ɛ 4E S S [P σ σ i hx i ) Y i hx i ) Y i ) ɛ ] S S It is left to boud the coditioal probability appearig iside of expected value defiitio of the projectio we may rewrite P σ σ i hx i ) Y i hx i ) Y i ) ɛ S S ) = P σ σ i v ɛ v H S S i v i S S, Usig our where we oce agai perhaps cofusigly) used v i ad v i to deote idicators h vx i ) Y i ad h v X i ) Y i, where h v H is ay classifier with projectio equal to v Notice that, because we coditioed o S ad S, these sets are ow fixed, ad thus the projectio H S S is ow ot radom ay more, but istead just some fixed subset of 0, We may ow safely use our iitial ) 3) trick uio boud) ad write P σ σ i hx i ) Y i hx i ) Y i ) ɛ S S ) σ i v ɛ i v i S S v H S S P σ 5

Idividual probabilities may be agai bouded usig Hoeffdig s iequality prove it yourself!): ) P σ σ i v ɛ i v i S S e ɛ /4 4/ = e ɛ /8 35 VC combiatorics Puttig all the bits together we fially get: PLĥ) Lh ) ɛ 4e ɛ /8 E S S [ cardhs S )] Agai, makig the upper boud equal to δ ad solvig for ɛ we get that for ay δ > 0 with probability larger tha δ it holds that: Lĥ) if Lg) + log E H ) + log 4 δ, g H where we deoted E H ) := E S [cardh S )] The quatity E H ) is kow as the VC etropy Obviously, the VC etropy ca be upper bouded i the followig perhaps, extremely crude) way: E H ) S H ) := cardh S ) S : cards)= All we did is replaced the average expectatio) with the maximum value The quatity S H ) is commoly kow as the growth fuctio We showed that with probability larger tha δ it also holds that: Lĥ) if g H Lg) + log S H ) + log 4 δ This cocludes the proof of Theorem 7 But are we satisfied with this result? The good thig about Theorem is that as the sample size grows to ifiity the last term o the rhs of Th) decreases to zero, showig that the performace of ERM achieves the best possible oe Does Theorem 7 have the same behaviour? Of course, the aswer depeds o the growth fuctio S H ), which is defied purely by the geometry of H As we already metioed, the trivial upper boud gives S H ) However, if we isert it i the VC-boud we ed up with, which does ot ted to zero A importat questio is: how should H look like so that log S H )/ 0 as? The aswer to this questio is hidde i the followig defiitio: Defiitio 5 VC dimesio) The VC dimesio of the class H is the largest such that S H ) = If there is o such a we say that H has ifiite VC dimesio The followig fact establishes the polyomial growth of S H ) for classes H of fiite VC dimesio: There is a curious history behid this lemma It was apparetly) simultaeously proved by several groups aroud late 60s early 70th, icludig Vapik ad Chervoekis, Sauer, ad Shelah ad Perles A woderful overview of this fact ca be foud i Leo Bottou s slides available olie here: http://leobottouorg/_media/ papers/vapik-symposium-0pdf 6

Lemma 6 Vapik, Chervoekis, Sauer, Shelah) Let H be a class of VC dimesio d < The for all it holds that d ) S H ), i ad for all d it holds that: S H ) e ) d d We may fially state the followig boud, which behaves exactly like the oe of origial Theorem : Theorem 7 VC-boud) Assume H has a VC dimesio d < For ay δ with probability larger tha δ it holds that: Lĥ) if Lg) + d log e d + log 4 δ g H Refereces Olivier Bousquet, Stéphae Bouchero, ad Gábor Lugosi Itroductio to statistical learig theory Lecture Notes i Artificial Itelligece, 004 URL http://wwwkybmpgde/fileadmi/ user_upload/files/publicatios/pdfs/pdf89pdf Luc Devroye, László Györfi, ad Gábor Lugosi A Probabilistic Theory of Patter Recogitio Spriger, 996 7