VC dimension and Model Selection

Size: px

Start display at page:

Download "VC dimension and Model Selection"

Hollie Cannon
5 years ago
Views:

1 VC dimension and Model Selection

2 Overview PAC model: review VC dimension: Definition Examples Sample: Lower bound Upper bound!!! Model Selection Introduction to Machine Learning 2

3 PAC model: Setting A distribution: D (unknown) Target function: c t from C c t : X {0,1} Hypothesis: h from H h: X {0,1} Error probability: error(h) = Prob D [h(x) c t (x)] Oracle: EX(c t,d) Intro. to Machine Learning

4 PAC model: Definition C and H are concept classes over X. C is PAC learnable by H if There Exist an Algorithm A such that: For any distribution D over X and c t in C for every input e and d: outputs a hypothesis h in H, while having access to EX(c t,d) with probability 1-d we have error(h) < e Complexities: sample, running time Intro. to Machine Learning

5 PAC model last week For a finite hypothesis class H, sample size m: Realizable case: m > (1/e) ln ( H /d) Non-realizable m > (2/e 2 ) ln (2 H /d) Impossibility results: m > (1/2) log H m > (1/4e) ln (1/d) Introduction to Machine Learning 5

6 VC dimension: motivation Infinite hypothesis class Threshold Rectangles TODAY: general VC dimension Applies both to realizable and non-realizable. Introduction to Machine Learning 6

7 VC dimension: definition Notation: C concept class; S - sample Projection: Π C S = c S c C Shattering: C shatters S if Π C S = 2 S VC dimension: size of largest S shattered max d: S, S = d, Π C S = 2 S If no max then infinity For every d there is a shattered set of size d Introduction to Machine Learning 7

8 VC dimension: Threshold c θ (x) = I(x θ) VC 1 S={0.5}: c = 1 and c = 0 VC <2 S = z 1, z 2 Assume z 1 < z 2 c z 1 = 1, c(z 2 ) = 0 Introduction to Machine Learning 8

9 VC dimension: union of intervals Intervals on [0,1] Finite but unbounded For any d points: VC-dim = infinity Introduction to Machine Learning 9

10 VC dimension: convex polygon Convex polygon For any d points VC dimension = infinity Introduction to Machine Learning 10

11 VC dimension: hyperplane c w,θ x = sign( i w i x i + θ) VC dimension d+1 S = 0, e 1,, e d Given labeling L in {-1,+1}, define c w,θ x : w i = L(x i ) θ = L 0 2 c w,θ 0 = sign(θ) & c w,θ e i = sign(w i + θ) Introduction to Machine Learning 11

12 VC dimension: hyperplane VC dimension < d+2 For contradiction Assume there is a shattered S and S =d+2 Radom Theorem: S R d, S d + 2 S S conv S conv S S Let S be positive and S-S be negative Let c w,θ be the separating hyperplane Let POS be the positive and NEG be negatives of c w,θ Introduction to Machine Learning 12

13 VC dimension: hyperplane conv S POS & conv S S NEG closed under convex combinations Radom Theorem: conv S conv S S However: POS NEG = Contradiction! There is no such set S VC-dim < d+2 QED Introduction to Machine Learning 13

14 VC dim: Sample lower bound Theorem: VC-dim(C)=d+1 m d 16ε Proof: Let {z 0, z 1,, z d } D x 1 8ε x = z 0 = 8ε x = z i 0 otherwize d Target function: c t z 0 = 1; c t z i = 0 or 1 (prob 1 2 ) RARE = {z 1,, z d } Assume S RARE d 2 UNSEEN d 2 Pr error 1 2 2ε 8ε d UNSEEN Introduction to Machine Learning 14

15 VC dim: Sample lower bound E S RARE = 8εm d 2 Pr S RARE d With probability at least ½ Error at least 2ε QED Introduction to Machine Learning 15

16 VC dim: sample upper bound Incorrect proof For sample S: C S = Π C S is finite Use finite class bound: m 1 ε log Π C S δ Problem: S defines C S = Π C S Solution Take 2m points S = S 1 S 2 The randomization in the split to S 1 and S 2 Benefit: We have Π C S Introduction to Machine Learning 16

17 VC dim: sample upper bound Bad concepts Bad = {h error(h)>ε} Hitting set S: For every h in Bad Exists x in S c t (x) h(x) Goal Compute prob. of S being a hitting set Event A: S 1 not hitting set Exists h in Bad which is consistent Pr[A] <??? Event B: Exists h in Bad h consistent with S 1 h has εm errors on S 2 Introduction to Machine Learning 17

18 VC dim: sample upper bound Pr[B]=Pr[B A]Pr[A] Since B implies A Pr[B A] Fix such an h Expected errors εm Probability at least ½ Result: 2 Pr[B] Pr[A] F = Π C S 1 S 2 Fix h in F: h consistent with S 1 h has errors εm on S 2 l number of errors Compute the prob. over partitions S 1 and S 2 Introduction to Machine Learning 18

19 VC dim: sample upper bound Number of total partitions: 2m l Number of partitions which make h consistent on S 1 m l Prob bound m l 2m l l 1 m i = i=0 1 2m i 2 l Bounding probabilities: Union bound over h in F Pr[B] F 2 -εm Pr[A] 2Pr[B] 2 F 2 -εm Introduction to Machine Learning 19

20 VC dim: sample upper bound High confidence δ 2 F 2 εm m 1 ε log 2 F δ Need to bound F F = Π C S 1 S 2 Sauer-Shelah Lemma: VC-dim(C)=d S =2m d Π C S i=0 Bound 2 m 2(2m) d for m d for m>d 2m i Introduction to Machine Learning 20

21 VC dim: Sampling Theorem Sample bound m 1 ε log 4 2m d δ m ε ε log 1 δ +d ε m = O( 1 ε log 1 δ +d ε log d ) ε log 2m Non-realizable m = O( 1 ε 2 log 1 δ + d ε 2 log d ε ) Realizable case Proof methodology Introduction to Machine Learning 21

22 Rademacher Complexity Motivation: Tighter bounds; Dist. Dependent Notation: f 1, +1 ; f F Pr σ i = +1 = Pr σ i = 1 = 1 2 Introduction to Machine Learning 22

23 Rademacher Complexity Definition (Radmacher Complexity): S sample of size m R S F R D F = E σ [max f F i=1 m σ i f(x i )] = E S [R S (F)] Introduction to Machine Learning 23

24 Rademacher Complexity: expected overfitting Theorem (expected overfitting): E S Proof: max f F 1 m Two sample trick, add S = E S max f F m i=1 f xi E D [f x ] 2R D (F) 1 m m i=1 f xi E S [ 1 m m i=1 f xi ] E S,S max f F 1 m m i=1 f xi f(x i ) Introduction to Machine Learning 24

25 Rademacher Complexity: expected overfitting = E S,S max f F 1 m m i=1 σi (f x i f(x i )) E S max f F 1 m m i=1 σi f x i +E S max f F 1 m m i=1 σi f x i = 2R D (F) QED Introduction to Machine Learning 25

26 Rademachar Theorem With probability 1-δ, for every h H: ε h ε h + R D H + ln 2 δ 2m ε h + R S H + 3 ln(2 δ ) 2m Introduction to Machine Learning 26

27 Model selection - Outline Motivation Overfitting Structural Risk Minimization Hypothesis Validation Introduction to Machine Learning 27

28 Motivation: Problems: We have too few examples We have a very rich hypothesis class How can we find the best hypothesis? Alternatively: Usually we choose the hypothesis class How rich of a class we want? How should we go about doing it? Introduction to Machine Learning 28

29 Overfitting Concept class: Intervals on a line Can classify any training set Zero training error: Is this the only goal?! Introduction to Machine Learning 29

30 Overfitting: Intervals Can always get zero training error! Are we interested in zero training error?! Introduction to Machine Learning 30

31 Overfitting: Intervals intervals errors Introduction to Machine Learning 31

32 Overfitting Simple concept plus noise A very complex concept insufficient number of examples + noise 1/3 Introduction to Machine Learning 32

33 Model Selection error train error generalization error complexity penelty complexity Introduction to Machine Learning 33

34 Theoretical Model Nested Hypothesis classes H 1 H 2 H 3 H i There is a target function c t (x), non-realizable. True errors: ε(h) = Pr [ h c t ] ε i = inf h Hi e(h) ε(h * )= inf i ε i h * is best hypothesis Training error ε h = 1 m m i=1 I[ h c t ] ε i = inf h Hi ε (h) Introduction to Machine Learning 34

35 Theoretical Model Complexity of h d(h) = min i {h H i } Add a penalty for d(h) minimize: ε(h)+penalty(h) Penalty based. Chose the hypothesis which minimizes: ε(h)+penalty(h) Introduction to Machine Learning 35

36 Structural Risk Minimization Parameters: λ i and δ i such that: Pr h H i : ε h ε h > λ i δ i i δ i = δ δ i = δ/2 i Implies: with prob. 1-δ Pr h H: ε h ε h > λ d(h) δ d(h) Introduction to Machine Learning 36

37 Structural Risk Minimization Setting penalty h Finite H i = λ d(h) λ i = log H i /δ m VC-dim(H i )=i λ i = i log i/δ m Introduction to Machine Learning 37

38 SRM: Performance THEOROM h * : best hypothesis g srm : SRM choice With probability 1-d ε(h * ) ε(g srm ) ε(h * )+ 2 penalty(h * ) Note: bound depends only on h * Introduction to Machine Learning 38

39 Proof Bounding the error in H i Pr ε(g srm ε g srm > λ srm Pr[ h H srm : ε(h) ε h > λ srm δ srm Bounding the error across H i ε g srm ε g srm λ srm ε h + λ ε h + λ ε g srm ε h + λ srm ε h + 2λ ε(g srm ) QED Introduction to Machine Learning 39

40 Hypothesis Validation Separate sample to training and selection. Using the training Select from each H i a candidate g i Using the selection sample select between g 1,,g m The split size (1- )m training set m selection set Introduction to Machine Learning 40

41 Hypothesis Validation: Algorithm Using (1-γ)m examples: S 1 ε 1 h = error on S 1 g i = arg min h H i ε 1 (h) Using γm examples: S 2 ε 2 h = error on S 2 g HV = arg min g i G ε 2(g i ) Return g HV Introduction to Machine Learning 41

42 Hypo. Validation: Performance Errors ε hv (m) = error of HV Using m examples ε A (m) = error of A Any algorithm Using m examples Selecting g i from H i only restriction on A For example: any penalty function e Theorem: with probability 1-d hv ( m) e 2 A ((1 ) m) ln(2m m / d ) Introduction to Machine Learning 42

43 Hypo. Validation: Analysis Pr ε g i ε 2 g i > λ 2e λ2 γm Pr i: ε g i ε 2 g i > λ 2 G e λ2γm = δ Since G m: λ = ln 2m/δ γm ε 2 g i + λ ε 2 g i ε 2 g i ε 2 g HV ε 2 g HV ε g HV λ ε 2 g i + 2λ ε g HV Introduction to Machine Learning 43

44 Summary PAC model Generalization bounds Empirical Risk Minimization VC dimension Rademacher complexity Model Selection Structural Risk Minimization (SRM) Hypothesis selection Introduction to Machine Learning 45

PAC Model and Generalization Bounds

PAC Model and Generalization Bounds Overview Probably Approximately Correct (PAC) model Basic generalization bounds finite hypothesis class infinite hypothesis class Simple case More next week 2 Motivating