Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Calvin Day
6 years ago
Views:

1 Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland RADEMACHER COMPLEXITY Slides adapted from Rob Schapire Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 1 / 11

2 Recap Rademacher complexity provides nice guarantees R(h) ˆR(h) + m (H) + log 1 δ 2m (1) But in practice hard to compute for real hypothesis classes Is there a relationship with simpler combinatorial measures? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 2 / 11

3 Growth Function Define the growth function Π H : for a hypothesis set H as: m,π H (m) max {x 1,...,x m } X {(h(x1 ),...,h(x m ) : h H} (2) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 11

4 Growth Function Define the growth function Π H : for a hypothesis set H as: m,π H (m) max {x 1,...,x m } X {(h(x1 ),...,h(x m ) : h H} (2) i.e., the number of ways m points can be classified using H. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 11

5 Rademacher Complexity vs. Growth Function If G is a function taking values in { 1,+1}, then 2lnΠG (m) m (G) m (3) Uses Masart s lemma Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 11

6 Vapnik-Chervonenkis Dimension VC(H) max m : Π H (m) = 2 m (4) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

7 Vapnik-Chervonenkis Dimension VC(H) max m : Π H (m) = 2 m (4) The size of the largest set that can be fully shattered by H. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

8 VC Dimension for Hypotheses Need upper and lower bounds Lower bound: example Upper bound: Prove that no set of d + 1 points can be shattered by H (harder) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

9 Intervals What is the VC dimension of [a,b] intervals on the real line. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

10 Intervals What is the VC dimension of [a,b] intervals on the real line. What about two points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

11 Intervals What is the VC dimension of [a,b] intervals on the real line. What about two points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

12 Intervals What is the VC dimension of [a,b] intervals on the real line. What about two points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

13 Intervals What is the VC dimension of [a,b] intervals on the real line. Two points can be perfectly classified, so VC dimension 2 Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

14 Intervals What is the VC dimension of [a,b] intervals on the real line. Two points can be perfectly classified, so VC dimension 2 What about three points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

15 Intervals What is the VC dimension of [a,b] intervals on the real line. Two points can be perfectly classified, so VC dimension 2 What about three points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

16 Intervals What is the VC dimension of [a,b] intervals on the real line. Two points can be perfectly classified, so VC dimension 2 What about three points? No set of three points can be shattered Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

17 Intervals What is the VC dimension of [a,b] intervals on the real line. Two points can be perfectly classified, so VC dimension 2 What about three points? No set of three points can be shattered Thus, VC dimension of intervals is 2 Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

18 Sine Functions Consider hypothesis that classifies points on a line as either being above or below a sine wave {t sin(ωx) : ω } (5) Can you shatter three points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

19 Sine Functions Consider hypothesis that classifies points on a line as either being above or below a sine wave Can you shatter three points? {t sin(ωx) : ω } (5) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

20 Sine Functions Consider hypothesis that classifies points on a line as either being above or below a sine wave {t sin(ωx) : ω } (5) Can you shatter four points? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

21 Sine Functions Consider hypothesis that classifies points on a line as either being above or below a sine wave {t sin(ωx) : ω } (5) How many points can you shatter? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

22 Sine Functions Consider hypothesis that classifies points on a line as either being above or below a sine wave {t sin(ωx) : ω } (5) Thus, VC dim of sine on line is Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

23 Connecting VC with growth function VC dimension obviously encodes the complexity of a hypothesis class, but we want to connect that to Rademacher complexity and the growth function so we can prove generalization bounds. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

24 Connecting VC with growth function VC dimension obviously encodes the complexity of a hypothesis class, but we want to connect that to Rademacher complexity and the growth function so we can prove generalization bounds. Theorem Sauer s Lemma Let H be a hypothesis set with VC dimension d. Then m d m Π H (m) Φ d (m) (6) i i=0 Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

25 Connecting VC with growth function VC dimension obviously encodes the complexity of a hypothesis class, but we want to connect that to Rademacher complexity and the growth function so we can prove generalization bounds. Theorem Sauer s Lemma Let H be a hypothesis set with VC dimension d. Then m d m Π H (m) Φ d (m) (6) i i=0 This is good because the sum when multiplied out becomes ( m ) = m (m 1)... i i! = (m d ). When we plug this into the learning error limits: log(π H (2m)) = log( (m d )) = (d logm). Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

26 Proof of Sauer s Lemma Prelim: Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 11

27 Proof of Sauer s Lemma Prelim: We ll proceed by induction. Our two base cases are: If m = 0, ΠH (m) = 1. You have no data, so there s only one (degenerate) labeling If d = 0, ΠH (m) = 1. If you can t even shatter a single point, then it s a fixed function Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 11

28 Induction Step Assume that it holds for all m, d for which m + d < m + d. We are given H, S = m, S = x 1,...,x m, and d is the VC dimension of H. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 11 / 11

Build two new hypothesis spaces Encodes where the extended set has differences on

29 Induction Step Assume that it holds for all m, d for which m + d < m + d. We are given H, S = m, S = x 1,...,x m, and d is the VC dimension of H. Build two new hypothesis spaces Encodes where the extended set has differences on the first m points. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 11 / 11

30 What is VC dimension of H 1 and H 2? If a set is shattered by H1, then it is also shattered by H VC-dim(H 1 ) VC-dim(H) = d (7) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 12 / 11

31 What is VC dimension of H 1 and H 2? If a set is shattered by H1, then it is also shattered by H VC-dim(H 1 ) VC-dim(H) = d (7) If a set T is shattered by H2, then T {x m } is shattered by H since there will be two hypotheses in H for every element of H 2 by adding x m VC-dim(H 2 ) d 1 (8) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 12 / 11

32 Bounding Growth Function Π H (S) = H 1 + H 2 (9) d d 1 m 1 m 1 + i i (10) i=0 i=0 (11) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 13 / 11

33 Bounding Growth Function Π H (S) = H 1 + H 2 (9) d d 1 m 1 m 1 + i i (10) i=0 i=0 We can rewrite this as d i=0 (m 1 i 1 ) because ( x 1 ) = 0. (11) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 13 / 11

34 Bounding Growth Function Π H (S) = H 1 + H 2 (9) d d 1 m 1 m 1 + (10) i i i=0 i=0 d m 1 m 1 = + (11) i i 1 i=0 (12) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 13 / 11

35 Bounding Growth Function Π H (S) = H 1 + H 2 (9) d d 1 m 1 m 1 + (10) i i i=0 i=0 d m 1 m 1 = + (11) i i 1 i=0 d m = (12) i i=0 (13) Pascal s Triangle Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 13 / 11

36 Bounding Growth Function Π H (S) = H 1 + H 2 (9) d d 1 m 1 m 1 + (10) i i i=0 i=0 d m 1 m 1 = + (11) i i 1 i=0 d m = (12) i i=0 =Φ d (m) (13) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 13 / 11

37 Wait a minute... Is this combinatorial expression really (m d )? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 14 / 11

38 Generalization Bounds Combining our previous generalization results with Sauer s lemma, we have that for a hypothesis class H with VC dimension d, for any δ > 0 with probability at least 1 δ, for any h H, 2d log em R(h) ˆR(h) + d m log 1 δ + 2m (14) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 15 / 11

39 Whew! We now have some theory down We re now going to see if we can find an algorithm that has good VC dimension Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 16 / 11

40 Whew! We now have some theory down We re now going to see if we can find an algorithm that has good VC dimension And works well in practice... Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 16 / 11

41 Whew! We now have some theory down We re now going to see if we can find an algorithm that has good VC dimension And works well in practice... Support Vector Machines Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 16 / 11

42 Whew! We now have some theory down We re now going to see if we can find an algorithm that has good VC dimension And works well in practice... Support Vector Machines In class: more VC dimension examples Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 16 / 11

Introduction to Machine Learning

Introduction to Machine Learning Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber University of Maryland FEATURE ENGINEERING Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine