Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Beatrix Hoover
5 years ago
Views:

1 Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos

2 Empirical Risk and True Risk 2

3 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical counter part: Empirical risk: 3

4 Empirical Risk Minimization Law of Large Numbers: Empirical risk is converging to the true risk 4

5 Overfitting in Classification with ERM Generative model: Bayes classifier: Bayes risk: Picture from David Pal 5

6 Overfitting in Classification with ERM n-order thresholded polynomials Empirical risk: Picture from David Pal Bayes risk: 6

7 Overfitting in Regression If we allow very complicated predictors, we could overfit the training data. Examples: Regression (Polynomial of order k-1 degree k ) constant 1.5 k=1 k=2 1 linear quadratic k=3 1.2 k= th order

8 Solutions to Overfitting 8

9 Solutions to Overfitting Structural Risk Minimization Notation: Goal: Risk Empirical risk (Model error, Approximation error) Solution: Structural Risk Minimzation (SRM) 9

10 Big Picture Ultimate goal: Estimation error Approximation error Bayes risk Bayes risk Estimation error Bayes risk Approximation error 10

11 Effect of Model Complexity If we allow very complicated predictors, we could overfit the training data. fixed # training data Prediction error on training data Empirical risk is no longer a good indicator of true risk 11

12 Classification using the 0-1 loss 12

13 The Bayes Classifier Lemma I: Lemma II: Proofs: Lemma I: Trivial from definition Lemma II: Surprisingly long calculation 13

14 The Bayes Classifier This is what the learning algorithm produces We will need these definitions, please copy it! 14

15 The Bayes Classifier Theorem I: Bound on the Estimation error This is what the learning algorithm produces The true risk of what the learning algorithm produces 15

16 Proof of Theorem 1 Theorem I: Bound on the Estimation error The true risk of what the learning algorithm produces Proof:

17 The Bayes Classifier Theorem II: This is what the learning algorithm produces Proof: Trivial 17

18 Corollary Corollary: Main message: It s enough to derive upper bounds for 18

19 Illustration of the Risks 19

20 It s enough to derive upper bounds for It is a random variable that we need to bound! We will bound it with tail bounds! 20

21 Hoeffding s inequality (1963) Special case 21

22 Binomial distributions Our goal is to bound Bernoulli(p) Therefore, from Hoeffding we have: Yuppie!!! 22

23 Inversion From Hoeffding we have: Therefore, 23

24 Union Bound Our goal is to bound: We already know: Theorem: [tail bound on the deviation in the worst case] Worst case error This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk! Proof: 24

25 Inversion of Union Bound We already know: Therefore, 25

26 Inversion of Union Bound The larger the N, the looser the bound This results is distribution free: True for all P(X,Y) distributions It is useless if N is big, or infinite (e.g. all possible hyperplanes) It can be fixed with McDiarmid inequality and VC dimension 26

27 Concentration and Expected Value n 27

The Expected Error Our goal is to bound: We already know: (Tail bound, Concentration inequality) Theorem: [Expected deviation in the worst case] Proof: we already know a tail bound.

28 The Expected Error Our goal is to bound: We already know: (Tail bound, Concentration inequality) Theorem: [Expected deviation in the worst case] Proof: we already know a tail bound. Worst case deviation This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk! (From that actually we get a bit weaker inequality oh well) 28

29 Function classes with infinite many elements

30 McDiarmid s Bounded Difference Inequality It follows that 30

31 Bounded Difference Condition Our main goal is to bound Lemma: Proof: Let g denote the following function: Observation: => McDiarmid can be applied for g! 31

32 Bounded Difference Condition Corollary: The Vapnik-Chervonenkis inequality does that with the shatter coefficient (and VC dimension)! 32

33 Vapnik-Chervonenkis inequality Our main goal is to bound We already know: Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: 33

34 Shattering 34

35 How many points can a linear boundary classify exactly in 1D? 2 pts 3 pts There exists placement s.t. all labelings can be classified - + -?? The answer is 2 35

36 How many points can a linear boundary classify exactly in 2D? pts 4 pts There exists a placement s.t. all labelings can be classified + -?? The answer is 3 No matter how we place 4 points, there is a labeling that cannot be classified 36

37 How many points can a linear boundary classify exactly in 3D? The answer is tetraeder How many points can a linear boundary classify exactly in d-dim? The answer is d+1 37

38 Growth function, Shatter coefficient Definition (=5 in this example) Growth function, Shatter coefficient maximum number of behaviors on n points 38

39 Growth function, Shatter coefficient Definition - + Growth function, Shatter coefficient + maximum number of behaviors on n points Example: Half spaces in 2D

40 # behaviors VC-dimension Definition Growth function, Shatter coefficient maximum number of behaviors on n points Definition: VC-dimension Definition: Shattering Note: 40

41 # behaviors VC-dimension Definition 41

42 VC-dimension - + (such that you want to maximize the # of different behaviors)

43 Examples 43

44 VC dim of decision stumps (axis aligned linear separator) in 2d What s the VC dim. of decision stumps in 2d? There is a placement of 3 pts that can be shattered => VC dim 3 44

45 VC dim of decision stumps (axis aligned linear separator) in 2d What s the VC dim. of decision stumps in 2d? If VC dim = 3, then for all placements of 4 pts, there exists a labeling that can t be shattered 3 collinear in convex hull of other 3 - quadrilateral => VC dim = 3 45

46 VC dim. of axis parallel rectangles in 2d What s the VC dim. of axis parallel rectangles in 2d? There is a placement of 3 pts that can be shattered => VC dim 3 46

47 VC dim. of axis parallel rectangles in 2d There is a placement of 4 pts that can be shattered ) VC dim 4 47

48 VC dim. of axis parallel rectangles in 2d What s the VC dim. of axis parallel rectangles in 2d? If VC dim = 4, then for all placements of 5 pts, there exists a labeling that can t be shattered 4 collinear pentagon in convex hull 1 in convex hull ) VC dim = 4 48

49 Sauer s Lemma We already know that [Exponential in n] Sauer s lemma: The VC dimension can be used to upper bound the shattering coefficient. Corollary: [Polynomial in n] 49

50 Vapnik-Chervonenkis inequality Vapnik-Chervonenkis inequality: [We don t prove this] From Sauer s lemma: Since Therefore, Estimation error 50

51 Linear (hyperplane) classifiers We already know that Estimation error Estimation error Estimation error 51

52 Vapnik-Chervonenkis Theorem We already know from McDiarmid: Vapnik-Chervonenkis inequality: Corollary: Vapnik-Chervonenkis theorem: [We don t prove them] We already know: Hoeffding + Union bound for finite function class: 52

53 PAC Bound for the Estimation VC theorem: Error Inversion: Estimation error 53

54 What you need to know Complexity of the classifier depends on number of points that can be classified exactly Finite case Number of hypothesis Infinite case Shattering coefficient, VC dimension

55 Thanks for your attention 55

56 Attic

57 Proof of Sauer s Lemma Write all different behaviors on a sample (x 1,x 2, x n ) in a matrix: VC dim =

Proof of Sauer s Lemma 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 Shattered subsets of

58 Proof of Sauer s Lemma Shattered subsets of columns: We will prove that Therefore, In this example: =7, since VC=2, n=3 58

59 Proof of Sauer s Lemma Shattered subsets of columns: Lemma 1 In this example: =7 Lemma 2 for any binary matrix with no repeated rows. In this example:

60 Proof of Lemma Shattered subsets of columns: In this example: =7 Lemma 1 Proof Q.E.D. 60

61 Proof of Lemma 2 Lemma 2 for any binary matrix with no repeated rows. Proof: Induction on the number of columns Base case: A has one column. There are three cases: => 1 1 => 1 1 =>2 2 61

62 Proof of Lemma 2 Inductive case: A has at least two columns. We have, By induction (less columns)

63 Proof of Lemma 2 because Q.E.D

64 Solution to Overfitting 2 nd issue: Solution: 64

65 Approximation with the Hinge loss and quadratic loss Picture is taken from R. Herbrich

66 Underfitting Bayes risk =

67 Underfitting Best linear classifier: The empirical risk of the best linear classifier: 67

68 Underfitting Best quadratic classifier: Same as the Bayes risk ) good fit! 68

69 Structural Risk Minimization Estimation error Approximation error Bayes risk Ultimate goal: Estimation error Approximation error So far we studied when estimation error! 0, but we also want approximation error! 0 Many different variants penalize too complex models to avoid overfitting 69

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we