Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Emmeline Taylor
5 years ago
Views:

1 Introduction to Machine Learning Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber University of Maryland FEATURE ENGINEERING Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 1 / 11

2 What does it mean to learn something? What are the things that we re learning? What does it mean to be learnable? Provides a framework for reasoning about what we can theoretically learn Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 2 / 11

3 What does it mean to learn something? What are the things that we re learning? What does it mean to be learnable? Provides a framework for reasoning about what we can theoretically learn Sometime theoretically learnable things are very difficult Sometimes things that should be hard actually work Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 2 / 11

4 Example Californian just moved to Colorado When is it nice outside? Has a perfect thermometer, but natives call 50F (10C) nice Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 11

5 Example Californian just moved to Colorado When is it nice outside? Has a perfect thermometer, but natives call 50F (10C) nice Each temperature is an observation x Coloradan concept of nice c(x) Californian wants to learn hypothesis h(x) close to c(x) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 11

6 Example Californian just moved to Colorado When is it nice outside? Has a perfect thermometer, but natives call 50F (10C) nice Each temperature is an observation x Coloradan concept of nice c(x) Californian wants to learn hypothesis h(x) close to c(x) Generalization error R(h) = Pr x D [h(x) c(x)] = x D [1[h(x) c(x)]] (1) [Notation 1[x] = 1 iff x is true, 0 otherwise] Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 11

7 Example Californian just moved to Colorado When is it nice outside? Has a perfect thermometer, but natives call 50F (10C) nice Each temperature is an observation x Coloradan concept of nice c(x) Californian wants to learn hypothesis h(x) close to c(x) Generalization error R(h) = Pr x D [h(x) c(x)] = x D [1[h(x) c(x)]] (1) [Notation 1[x] = 1 iff x is true, 0 otherwise] Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 3 / 11

8 Probably Correct The Californian gets n random examples Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 11

9 Probably Correct The Californian gets n random examples Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 11

10 Probably Correct The Californian gets n random examples Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 11

11 Probably Correct The best rule that conforms with the examples is [a,b] a b Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 11

12 Probably Correct c d a b Let [c,d] be the correct (unknown) rule. Let be the gap between. The probability of being wrong is the probability that n samples missed ca and bd. Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 4 / 11

13 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

14 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) The sample we learn from Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

15 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) The data distribution the sample comes from Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

16 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) The hypothesis we learn Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

17 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) Generalization error Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

18 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) Our bound on the generalization error (e.g., we want it to be better than 0.1) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

19 PAC-learning definition Definition PAC-learnable A concept C is PAC-learnable if algorithm and a polynomial function f such that for any ε and δ, D(X) and c C for any sample size m f 1 ε, 1 δ,n, c Pr S D m [R(h S ) ε] 1 δ (2) The probability of learning a hypothesis with error greater than ε (e.g., 0.05) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 5 / 11

20 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. Pr[x 1 ca x m ca ] = m Pr[x i ca ] (3) i We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

21 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. Pr[x 1 ca x m ca ] = m Pr[x i ca ] (3) i Independence! We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

22 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. m Pr[x i ca ] (3) Pr[x 1 ca x m ca ] = We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) Useful inequality: 1 + x e x i Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

23 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. m Pr[x i ca ] (3) Pr[x 1 ca x m ca ] = We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) We want the generalization to violate ε less than δ, solving for m: i Pr[R(h) ε] δ (5) 2e εm δ (6) εm ln δ 2 (7) 1 ε ln 2 m δ (8) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

24 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. m Pr[x i ca ] (3) Pr[x 1 ca x m ca ] = We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) We want the generalization to violate ε less than δ, solving for m: i Pr[R(h) ε] δ (5) 2e εm δ (6) εm ln δ 2 (7) 1 ε ln 2 m δ (8) Analysis is symmetrical for ca and bd Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

25 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. m Pr[x i ca ] (3) Pr[x 1 ca x m ca ] = We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) We want the generalization to violate ε less than δ, solving for m: i Pr[R(h) ε] δ (5) 2e εm δ (6) εm ln δ 2 (7) 1 ε ln 2 m δ (8) δ corresponds to the probability of bad hypothesis Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

26 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. m Pr[x i ca ] (3) Pr[x 1 ca x m ca ] = We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) We want the generalization to violate ε less than δ, solving for m: i Pr[R(h) ε] δ (5) 2e εm δ (6) εm ln δ 2 (7) 1 ε ln 2 m δ (8) Take log of both sides Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

27 Is a Californian learning temperature PAC learnable? Bad event happens if no training point in ca or bd. m Pr[x i ca ] (3) Pr[x 1 ca x m ca ] = We want the probability of a point landing there (or to be less than ε Pr[x 1 ca x m ca ] = (1 ε) m e εm (4) We want the generalization to violate ε less than δ, solving for m: i Pr[R(h) ε] δ (5) 2e εm δ (6) εm ln δ 2 (7) 1 ε ln 2 m δ (8) Direction of inequality flips when you divide by m Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 6 / 11

28 Consistent Hypotheses, Finite Spaces Possible to prove that specific problems are learnable (and we will!) Can we do something more general? Yes, for finite hypothesis spaces c H That are also consistent with training data Theorem Learning bounds for finite H, consistent Let H be a finite set of functions mapping from to. Let be an algorithm that for a iid sample S returns a consistent hypothesis (training error ˆR(h) = 0), then for any ε,δ > 0, the concept is PAC learnable with samples m 1 ln H + ln 1 ε δ (9) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 7 / 11

29 Proof: Setup We want to bound the probability that some h H is consistent and has error more than ε. Pr h H : ˆR(h) = 0 R(h) > ε (10) =Pr h 1 H ˆR(h1 ) = 0 R(h 1 ) > ε h i H ˆR(hi ) = 0 R(h i ) > ε ˆR(h) Pr = 0 R(h) > ε (11) h ˆR(h) Pr = 0 R(h) > ε h (12) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

30 Proof: Setup We want to bound the probability that some h H is consistent and has error more than ε. Pr h H : ˆR(h) = 0 R(h) > ε (10) =Pr h 1 H ˆR(h1 ) = 0 R(h 1 ) > ε h i H ˆR(hi ) = 0 R(h i ) > ε ˆR(h) Pr = 0 R(h) > ε (11) h ˆR(h) Pr = 0 R(h) > ε h (12) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

31 Proof: Setup We want to bound the probability that some h H is consistent and has error more than ε. Pr h H : ˆR(h) = 0 R(h) > ε (10) =Pr h 1 H ˆR(h1 ) = 0 R(h 1 ) > ε h i H ˆR(hi ) = 0 R(h i ) > ε ˆR(h) Pr = 0 R(h) > ε (11) h ˆR(h) Pr = 0 R(h) > ε h Union bound (12) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

32 Proof: Setup We want to bound the probability that some h H is consistent and has error more than ε. Pr h H : ˆR(h) = 0 R(h) > ε (10) =Pr h 1 H ˆR(h1 ) = 0 R(h 1 ) > ε h i H ˆR(hi ) = 0 R(h i ) > ε ˆR(h) Pr = 0 R(h) > ε (11) h ˆR(h) Pr = 0 R(h) > ε h Definition of conditional probability (12) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 8 / 11

33 Proof: Connection back to interval learning The generalization error is greater than ε, so we bound probability of no inconsistent points in training for a single hypothesis h. Pr ˆR(h) = 0 R(h) > ε (1 ε) m (13) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

34 Proof: Connection back to interval learning The generalization error is greater than ε, so we bound probability of no inconsistent points in training for a single hypothesis h. Pr ˆR(h) = 0 R(h) > ε (1 ε) m (13) but this must be true of all of the hypotheses in H, Pr h H : ˆR(h) = 0 R(h) > ε H (1 ε) m (14) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

35 Proof: Connection back to interval learning The generalization error is greater than ε, so we bound probability of no inconsistent points in training for a single hypothesis h. Pr ˆR(h) = 0 R(h) > ε (1 ε) m (13) but this must be true of all of the hypotheses in H, Pr h H : ˆR(h) = 0 R(h) > ε H (1 ε) m (14) H (1 ε) m H e mε =δ we set the RHS to be equal to δ Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

36 Proof: Connection back to interval learning The generalization error is greater than ε, so we bound probability of no inconsistent points in training for a single hypothesis h. Pr ˆR(h) = 0 R(h) > ε (1 ε) m (13) but this must be true of all of the hypotheses in H, Pr h H : ˆR(h) = 0 R(h) > ε H (1 ε) m (14) H (1 ε) m H e mε =δ lnδ =ln H mε apply log to both sides Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

37 Proof: Connection back to interval learning The generalization error is greater than ε, so we bound probability of no inconsistent points in training for a single hypothesis h. Pr ˆR(h) = 0 R(h) > ε (1 ε) m (13) but this must be true of all of the hypotheses in H, Pr h H : ˆR(h) = 0 R(h) > ε H (1 ε) m (14) H (1 ε) m H e mε =δ lnδ =ln H mε ln 1 ln H = mε δ move ln H to the other side, and rewrite lnδ = 0 ( lnδ) = 1(ln1 lnδ) = ln( 1 δ ) Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

38 Proof: Connection back to interval learning The generalization error is greater than ε, so we bound probability of no inconsistent points in training for a single hypothesis h. Pr ˆR(h) = 0 R(h) > ε (1 ε) m (13) but this must be true of all of the hypotheses in H, Pr h H : ˆR(h) = 0 R(h) > ε H (1 ε) m (14) H (1 ε) m H e mε =δ 1 ε lnδ =ln H mε ln 1 ln H = mε δ ln H + ln 1 =m δ Divide by ε Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 9 / 11

39 But what does it all mean? m 1 ε ln H + ln 1 δ (15) Confidence Complexity Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 11

40 But what does it all mean? m 1 ε ln H + ln 1 δ (15) Confidence: More certainty means more training data Complexity Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 11

41 But what does it all mean? m 1 ε ln H + ln 1 δ (15) Confidence: More certainty means more training data Complexity: More complicated hypotheses need more training data Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 11

42 But what does it all mean? m 1 ε ln H + ln 1 δ (15) Confidence: More certainty means more training data Complexity: More complicated hypotheses need more training data Scary Question What s H for logistic regression? Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 10 / 11

43 What s next... In class: examples of PAC learnability Next time: how to deal with infinite hypothesis spaces Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 11 / 11

44 What s next... In class: examples of PAC learnability Next time: how to deal with infinite hypothesis spaces Takeaway Even though we can t prove anything about logistic regression, it still works However, using the theory will lead us to a better classification technique: support vector machines Machine Learning: Jordan Boyd-Graber UMD Introduction to Machine Learning 11 / 11

Classification: The PAC Learning Framework

Classification: The PAC Learning Framework Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 5 Slides adapted from Eli Upfal Machine Learning: Jordan Boyd-Graber Boulder Classification: