VC dimension and Model Selection

Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2

PAC model: Setting A distribution: D (unknown) Target function: c t from C c t : X {0,1} Hypothesis: h from H h: X {0,1} Error probability: error(h) = Prob D [h(x) c t (x)] Oracle: EX(c t,d) Intro. to Machine Learning 2016 3

PAC model: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: For any distribution D over X and c t in C for every input e and d: outputs a hypothesis h in H, while having access to EX(c t,d) with probability 1-d we have error(h) < e Complexities: sample, running time Intro. to Machine Learning 2016 4

PAC model last week For a finite hypothesis class H, sample size m: Realizable case: m > (1/e) ln ( H /d) Non-realizable m > (2/e 2 ) ln (2 H /d) Impossibility results: m > (1/2) log H m > (1/4e) ln (1/d) Introduction to Machine Learning 5

VC dimension: motivation Infinite hypothesis class Threshold Rectangles TODAY: general VC dimension Applies both to realizable and non-realizable. Introduction to Machine Learning 6

VC dimension: definition Notation: C concept class; S - sample Projection: Π C S = c S c C Shattering: C shatters S if Π C S = 2 S VC dimension: size of largest S shattered max d: S, S = d, Π C S = 2 S If no max then infinity For every d there is a shattered set of size d Introduction to Machine Learning 7

VC dimension: Threshold c θ (x) = I(x θ) VC 1 S={0.5}: c 0.3 0.5 = 1 and c 0.6 0.5 = 0 VC <2 S = z 1, z 2 Assume z 1 < z 2 c z 1 = 1, c(z 2 ) = 0 Introduction to Machine Learning 8

VC dimension: union of intervals Intervals on [0,1] Finite but unbounded For any d points: VC-dim = infinity Introduction to Machine Learning 9

VC dimension: convex polygon Convex polygon For any d points VC dimension = infinity Introduction to Machine Learning 10

VC dimension: hyperplane c w,θ x = sign( i w i x i + θ) VC dimension d+1 S = 0, e 1,, e d Given labeling L in {-1,+1}, define c w,θ x : w i = L(x i ) θ = L 0 2 c w,θ 0 = sign(θ) & c w,θ e i = sign(w i + θ) Introduction to Machine Learning 11

VC dimension: hyperplane VC dimension < d+2 For contradiction Assume there is a shattered S and S =d+2 Radom Theorem: S R d, S d + 2 S S conv S conv S S Let S be positive and S-S be negative Let c w,θ be the separating hyperplane Let POS be the positive and NEG be negatives of c w,θ Introduction to Machine Learning 12

VC dimension: hyperplane conv S POS & conv S S NEG closed under convex combinations Radom Theorem: conv S conv S S However: POS NEG = Contradiction! There is no such set S VC-dim < d+2 QED Introduction to Machine Learning 13

VC dim: Sample lower bound Theorem: VC-dim(C)=d+1 m d 16ε Proof: Let {z 0, z 1,, z d } D x 1 8ε x = z 0 = 8ε x = z i 0 otherwize d Target function: c t z 0 = 1; c t z i = 0 or 1 (prob 1 2 ) RARE = {z 1,, z d } Assume S RARE d 2 UNSEEN d 2 Pr error 1 2 2ε 8ε d UNSEEN Introduction to Machine Learning 14

VC dim: Sample lower bound E S RARE = 8εm d 2 Pr S RARE d 2 1 2 With probability at least ½ Error at least 2ε QED Introduction to Machine Learning 15

VC dim: sample upper bound Incorrect proof For sample S: C S = Π C S is finite Use finite class bound: m 1 ε log Π C S δ Problem: S defines C S = Π C S Solution Take 2m points S = S 1 S 2 The randomization in the split to S 1 and S 2 Benefit: We have Π C S Introduction to Machine Learning 16

VC dim: sample upper bound Bad concepts Bad = {h error(h)>ε} Hitting set S: For every h in Bad Exists x in S c t (x) h(x) Goal Compute prob. of S being a hitting set Event A: S 1 not hitting set Exists h in Bad which is consistent Pr[A] <??? Event B: Exists h in Bad h consistent with S 1 h has εm errors on S 2 Introduction to Machine Learning 17

VC dim: sample upper bound Pr[B]=Pr[B A]Pr[A] Since B implies A Pr[B A] Fix such an h Expected errors εm Probability at least ½ Result: 2 Pr[B] Pr[A] F = Π C S 1 S 2 Fix h in F: h consistent with S 1 h has errors εm on S 2 l number of errors Compute the prob. over partitions S 1 and S 2 Introduction to Machine Learning 18

VC dim: sample upper bound Number of total partitions: 2m l Number of partitions which make h consistent on S 1 m l Prob bound m l 2m l l 1 m i = i=0 1 2m i 2 l Bounding probabilities: Union bound over h in F Pr[B] F 2 -εm Pr[A] 2Pr[B] 2 F 2 -εm Introduction to Machine Learning 19

VC dim: sample upper bound High confidence δ 2 F 2 εm m 1 ε log 2 F δ Need to bound F F = Π C S 1 S 2 Sauer-Shelah Lemma: VC-dim(C)=d S =2m d Π C S i=0 Bound 2 m 2(2m) d for m d for m>d 2m i Introduction to Machine Learning 20

VC dim: Sampling Theorem Sample bound m 1 ε log 4 2m d δ m 2 + 1 ε ε log 1 δ +d ε m = O( 1 ε log 1 δ +d ε log d ) ε log 2m Non-realizable m = O( 1 ε 2 log 1 δ + d ε 2 log d ε ) Realizable case Proof methodology Introduction to Machine Learning 21

Rademacher Complexity Motivation: Tighter bounds; Dist. Dependent Notation: f 1, +1 ; f F Pr σ i = +1 = Pr σ i = 1 = 1 2 Introduction to Machine Learning 22

Rademacher Complexity Definition (Radmacher Complexity): S sample of size m R S F R D F = E σ [max f F i=1 m σ i f(x i )] = E S [R S (F)] Introduction to Machine Learning 23

Rademacher Complexity: expected overfitting Theorem (expected overfitting): E S Proof: max f F 1 m Two sample trick, add S = E S max f F m i=1 f xi E D [f x ] 2R D (F) 1 m m i=1 f xi E S [ 1 m m i=1 f xi ] E S,S max f F 1 m m i=1 f xi f(x i ) Introduction to Machine Learning 24

Rademacher Complexity: expected overfitting = E S,S max f F 1 m m i=1 σi (f x i f(x i )) E S max f F 1 m m i=1 σi f x i +E S max f F 1 m m i=1 σi f x i = 2R D (F) QED Introduction to Machine Learning 25

Rademachar Theorem With probability 1-δ, for every h H: ε h ε h + R D H + ln 2 δ 2m ε h + R S H + 3 ln(2 δ ) 2m Introduction to Machine Learning 26

Model selection - Outline Motivation Overfitting Structural Risk Minimization Hypothesis Validation Introduction to Machine Learning 27

Motivation: Problems: We have too few examples We have a very rich hypothesis class How can we find the best hypothesis? Alternatively: Usually we choose the hypothesis class How rich of a class we want? How should we go about doing it? Introduction to Machine Learning 28

Overfitting Concept class: Intervals on a line Can classify any training set Zero training error: Is this the only goal?! Introduction to Machine Learning 29

Overfitting: Intervals Can always get zero training error! Are we interested in zero training error?! Introduction to Machine Learning 30

Overfitting: Intervals intervals 0 1 2 3 4 errors 7 3 2 1 0 Introduction to Machine Learning 31

Overfitting Simple concept plus noise A very complex concept insufficient number of examples + noise 1/3 Introduction to Machine Learning 32

Model Selection error train error generalization error complexity penelty complexity Introduction to Machine Learning 33

Theoretical Model Nested Hypothesis classes H 1 H 2 H 3 H i There is a target function c t (x), non-realizable. True errors: ε(h) = Pr [ h c t ] ε i = inf h Hi e(h) ε(h * )= inf i ε i h * is best hypothesis Training error ε h = 1 m m i=1 I[ h c t ] ε i = inf h Hi ε (h) Introduction to Machine Learning 34

Theoretical Model Complexity of h d(h) = min i {h H i } Add a penalty for d(h) minimize: ε(h)+penalty(h) Penalty based. Chose the hypothesis which minimizes: ε(h)+penalty(h) Introduction to Machine Learning 35

Structural Risk Minimization Parameters: λ i and δ i such that: Pr h H i : ε h ε h > λ i δ i i δ i = δ δ i = δ/2 i Implies: with prob. 1-δ Pr h H: ε h ε h > λ d(h) δ d(h) Introduction to Machine Learning 36

Structural Risk Minimization Setting penalty h Finite H i = λ d(h) λ i = log H i /δ m VC-dim(H i )=i λ i = i log i/δ m Introduction to Machine Learning 37

SRM: Performance THEOROM h * : best hypothesis g srm : SRM choice With probability 1-d ε(h * ) ε(g srm ) ε(h * )+ 2 penalty(h * ) Note: bound depends only on h * Introduction to Machine Learning 38

Proof Bounding the error in H i Pr ε(g srm ε g srm > λ srm Pr[ h H srm : ε(h) ε h > λ srm δ srm Bounding the error across H i ε g srm ε g srm λ srm ε h + λ ε h + λ ε g srm ε h + λ srm ε h + 2λ ε(g srm ) QED Introduction to Machine Learning 39

Hypothesis Validation Separate sample to training and selection. Using the training Select from each H i a candidate g i Using the selection sample select between g 1,,g m The split size (1- )m training set m selection set Introduction to Machine Learning 40

Hypothesis Validation: Algorithm Using (1-γ)m examples: S 1 ε 1 h = error on S 1 g i = arg min h H i ε 1 (h) Using γm examples: S 2 ε 2 h = error on S 2 g HV = arg min g i G ε 2(g i ) Return g HV Introduction to Machine Learning 41

Hypo. Validation: Performance Errors ε hv (m) = error of HV Using m examples ε A (m) = error of A Any algorithm Using m examples Selecting g i from H i only restriction on A For example: any penalty function e Theorem: with probability 1-d hv ( m) e 2 A ((1 ) m) ln(2m m / d ) Introduction to Machine Learning 42

Hypo. Validation: Analysis Pr ε g i ε 2 g i > λ 2e λ2 γm Pr i: ε g i ε 2 g i > λ 2 G e λ2γm = δ Since G m: λ = ln 2m/δ γm ε 2 g i + λ ε 2 g i ε 2 g i ε 2 g HV ε 2 g HV ε g HV λ ε 2 g i + 2λ ε g HV Introduction to Machine Learning 43

Summary PAC model Generalization bounds Empirical Risk Minimization VC dimension Rademacher complexity Model Selection Structural Risk Minimization (SRM) Hypothesis selection Introduction to Machine Learning 45