Learning theory Lecture 4

Size: px

Start display at page:

Download "Learning theory Lecture 4"

Jonas Ward
5 years ago
Views:

1 Learning theory Lecture 4 David Sontag New York University Slides adapted from Carlos Guestrin & Luke Zettlemoyer

2 What s next We gave several machine learning algorithms: Perceptron Linear support vector machine (SVM) SVM with kernels, e.g. polynomial or Gaussian How do we guarantee that the learned classifier will perform well on test data? How much training data do we need?

Example: Perceptron applied to spam classificanon In your homework 1, you trained a spam classifier using perceptron The training error was always zero With few data points, there was

3 Example: Perceptron applied to spam classificanon In your homework 1, you trained a spam classifier using perceptron The training error was always zero With few data points, there was a big gap between training error and test error! Test error - training error > Test error (percentage misclassified) Test error - training error < 0.02 Number of training examples

4 How much training data do you need? Depends on what hypothesis class the learning algorithm considers For example, consider a memorizanon- based learning algorithm Input: training data S = { (x i, y i ) } Output: funcnon f(x) which, if there exists (x i, y i ) in S such that x=x i, predicts y i, and otherwise predicts the majority label This learning algorithm will always obtain zero training error But, it will take a huge amount of training data to obtain small test error (i.e., its generalizanon performance is horrible) Linear classifiers are powerful precisely because of their simplicity GeneralizaNon is easy to guarantee

5 Roadmap of lecture 1. GeneralizaNon of ﬁnite hypothesis spaces 2. VC- dimension Will show that linear classiﬁers need to see approximately d training points, where d is the dimension of the feature vectors Explains the good performance we obtained using perceptron!!!! (we had a few thousand features) 3. Margin based generalizanon Applies to inﬁnite dimensional feature vectors (e.g., Gaussian kernel) Test error (percentage misclassified) Perceptron algorithm on spam classification Number of training examples [Figure from Cynthia Rudin]

6 How big should your validanon set be? In PS1, you tried many configuranons of your algorithms (avg vs. regular perceptron, max # of iteranons) and chose the one that had smallest validanon error Suppose in total you tested H =40 different classifiers on the validanon set of m held- out e- mails The best classifier obtains 98% accuracy on these m e- mails!!! But, what is the true classificanon accuracy? How large does m need to be so that we can guarantee that the best configuranon (measured on validate) is truly good?

7 A simple seing H H c H consistent with data ClassificaNon m data points Finite number of possible hypothesis (e.g., 40 spam classifiers) A learner finds a hypothesis h that is consistent with training data Gets zero error in training: error train (h) = 0 I.e., assume for now that one of the classifiers gets 100% accuracy on the m e- mails (we ll handle the 98% case ajerward) What is the probability that h has more than ε true error? error true (h) ε

8 Refresher on probability: outcomes An outcome space specifies the possible outcomes that we would like to reason about, e.g. Ω = {, } Coin toss Ω = {,,,,, } Die toss We specify a probability p(x) for each outcome x such that X p(x) 0, p(x) =1 x2 E.g., p( ) =.6 p( ) =.4

9 Refresher on probability: events An event is a subset of the outcome space, e.g. E = {,, } Even die tosses O = {,, } Odd die tosses The probability of an event is given by the sum of the probabilines of the outcomes it contains, p(e) = X x2e p(x) E.g., p(e) = p( ) + p( ) + p( ) = 1/2, if fair die

10 Refresher on probability: union bound P(A or B or C or D or ) P(A) + P(B) + P(C) + P(D) + A B C D p(a [ B) =p(a)+p(b) p(a \ B) apple p(a)+p(b) Q: When is this a tight bound? A: For disjoint events (i.e., non-overlapping circles)

11 Refresher on probability: independence Two events A and B are independent if p(a \ B) =p(a)p(b) A B Are these events independent? No! p(a \ B) =0 1 p(a)p(b) = 6 2

12 Refresher on probability: independence Two events A and B are independent if p(a \ B) =p(a)p(b) Suppose our outcome space had two different die: Analogy: outcome space defines all possible sequences of e- mails in training set Ω = {,,,, and the probability of each outcome is defined as } 2 die tosses 6 2 = 36 outcomes p( ) = a 1 b 1 p( ) = a 1 b 2 a 1 a 2 a 3 a 4 a 5 a b 1 b 2 b 3 b 4 b 5 b X a i =1 i=1 6X b j =1 j=1

13 Refresher on probability: independence Two events A and B are independent if p(a \ B) =p(a)p(b) Are these events independent? A B Yes! p(a \ B) =0p( ) Analogy: asking about first e- mail in training set p(a) = p( ) p(b) = p( ) = b 2 = 6X a 1 b j = a 1 j=1 6X b j = a 1 j=1 Analogy: asking about second e- mail in training set p(a)p(b) = 2 p( 1 ) p( ) 6

, the integers) Induces a partition of all outcomes For some x 2 D, wesay p(x = x) =p({! 2 : X (!

14 Refresher of probability: discrete random variables A random variable X is a mapping X :! D D is some set (e.g., the integers) Induces a partition of all outcomes For some x 2 D, wesay p(x = x) =p({! 2 : X (!) =x}) probability that variable X assumes state x Notation: Val(X )=setd of all values assumed by X (will interchangeably call these the values or states of variable X ) P Ω = {,,,, } 2 die tosses

15 Refresher of probability: discrete random variables p(x ) is a distribution: P x2val(x ) p(x = x) =1 E.g. X 1 may refer to the value of the first dice, and X 2 to the value of the second dice We call two random variables X and Y iden1cally distributed if Val(X) = Val(Y) and p(x=s) = p(y=s) for all s in Val(X) p( ) = a 1 b 1 p( ) = a 1 b 2 X 1 and X 2 NOT idenncally distributed a 1 a 2 a 3 a 4 a 5 a b 1 b 2 b 3 b 4 b 5 b X a i =1 i=1 6X b j =1 j=1 Ω = {,,,, } 2 die tosses

16 Refresher of probability: discrete random variables p(x ) is a distribution: P x2val(x ) p(x = x) =1 E.g. X 1 may refer to the value of the first dice, and X 2 to the value of the second dice We call two random variables X and Y iden1cally distributed if Val(X) = Val(Y) and p(x=s) = p(y=s) for all s in Val(X) p( ) = a 1 a 1 p( ) = a 1 a 2 X 1 and X 2 idenncally distributed a 1 a 2 a 3 a 4 a 5 a X a i =1 i=1 Ω = {,,,, } 2 die tosses

17 Refresher of probability: discrete random variables X=x is simply an event, so can apply union bound, etc. Two random variables X and Y are independent if: p(x = x, Y = y) =p(x = x)p(y = y) 8x 2 Val(X),y 2 Val(Y ) Joint probability. Formally, given by the event X = x \ Y = y The expectagon of X is defined as: E[X] = X x2val(x) p(x = x)x If X is binary valued, i.e. x is either 0 or 1, then: E[X] = p(x = 0) 0+p(X = 1) 1 = p(x = 1)

A simple seing H H c H consistent with data ClassificaNon m data points Finite number of possible hypothesis (e.g., 40 spam classifiers) A learner finds a hypothesis h that is consistent with training data Gets zero error in training: error train (h) = 0 I.

18 A simple seing H H c H consistent with data ClassificaNon m data points Finite number of possible hypothesis (e.g., 40 spam classifiers) A learner finds a hypothesis h that is consistent with training data Gets zero error in training: error train (h) = 0 I.e., assume for now that one of the classifiers gets 100% accuracy on the m e- mails (we ll handle the 98% case ajerward) What is the probability that h has more than ε true error? error true (h) ε

19 How likely is a single hypothesis to get m data points right? The probability of a hypothesis h incorrectly classifying: X h = ˆp(~x, y)1[h(~x) 6= y] (~x,y) Let Zi h be a random variable that takes two values: 1 if h correctly classifies i th data point, and 0 otherwise The Z h variables are independent and idengcally distributed (i.i.d.) with Pr(Z h i = 0) = X (~x,y) ˆp(~x, y)1[h(~x) 6= y] = h What is the probability that h classifies m data points correctly? Pr(h gets m iid data points right) =(1 h ) m apple e hm

20 Are we done? Pr(h gets m iid data points right error true (h) ε) e - εm Says if h gets m data points correct, then with very high probability (i.e. 1- e - εm ) it is close to perfect (i.e., will have error ε) This only considers one hypothesis! Suppose 1 billion classifiers were tried, and each was a random funcnon For m small enough, one of the funcnons will classify all points correctly but all have very large true error

21 How likely is learner to pick a bad hypothesis? Pr(h gets m iid data points right error true (h) ε) e - εm Suppose there are H c hypotheses consistent with the training data How likely is learner to pick a bad one, i.e. with true error ε? We need a bound that holds for all of them! P(error true (h 1 ) ε OR error true (h 2 ) ε OR OR error true (h H c ) ε) k P(error true (h k ) ε) k (1-ε) m H (1-ε) m H e -mε Union bound bound on individual h j s H c H (1-ε) e -ε for 0 ε 1

22 GeneralizaNon error of finite hypothesis spaces [Haussler 88] We just proved the following result: Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data:

23 Using a PAC bound Typically, 2 use cases: 1: Pick ε and δ, compute m 2: Pick m and δ, compute ε = =.01, H = 40 Need m 830 m p(error true (h) Case 1 ln H + ln 1 ) apple H e m ln H e m ln ln H m ln Argument: Since for all h we know that with probability 1- δ the following holds (either case 1 or case 2) Says: we are willing to tolerate a δ probability of having ε error Case 2 ln H + ln 1 m Log dependence on H, OK if exponennal size (but not doubly) ε has stronger influence than δ ε shrinks at rate O(1/m)

24 LimitaNons of Haussler 88 bound There may be no consistent hypothesis h (where error train (h)=0) Size of hypothesis space What if H is really big? What if it is connnuous? First Goal: Can we get a bound for a learner with error train (h) in the data set?

25 QuesNon: What s the expected error of a hypothesis? The probability of a hypothesis incorrectly classifying: Let s now let Z be a random variable that takes two values, 1 if h correctly classifies i th i h data point, and 0 otherwise The Z variables are independent and idengcally distributed (i.i.d.) with X (~x,y) ˆp(~x, y)1[h(~x) 6= y] Pr(Z h i = 0) = X (~x,y) ˆp(~x, y)1[h(~x) 6= y] EsNmaNng the true error probability is like esnmanng the parameter of a coin! Chernoff bound: for m i.i.d. coin flips, X 1,,X m, where X i {0,1}. For 0<ε<1: p(x i = 1) = True error probability Observed fracnon of points incorrectly classified E[ 1 m mx X i ]= 1 m i=1 mx E[X i ]= i=1 (by linearity of expectanon)

26 GeneralizaNon bound for H hypothesis Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: Pr(error true (h) error D (h) > ) apple H e 2m 2 Why? Same reasoning as before. Use the Union bound over individual Chernoff bounds

27 PAC bound and Bias- Variance tradeoff for all h, with probability at least 1-δ: error true (h) apple error D (h)+ bias s ln H +ln 1 2m variance For large H low bias (assuming we can find a good h) high variance (because bound is looser) For small H high bias (is there a good h?) low variance (Nghter bound)

28 What about connnuous hypothesis spaces? ConNnuous hypothesis space: H = Infinite variance??? Only care about the maximum number of points that can be classified exactly!

29 How many points can a linear boundary classify exactly? (1- D) 2 Points: Yes!! 3 Points: No etc (8 total)

30 Sha ering and Vapnik Chervonenkis Dimension A set of points is sha6ered by a hypothesis space H iff: For all ways of spli7ng the examples into posinve and neganve subsets There exists some consistent hypothesis h The VC Dimension of H over input space X The size of the largest finite subset of X sha ered by H

31 How many points can a linear boundary classify exactly? (2- D) 3 Points: Yes!! 4 Points: No etc. [Figure from Chris Burges]

32 How many points can a linear boundary classify exactly? (d- D) A linear classifier j=1..d w j x j + b can represent all assignments of possible labels to d+1 points But not d+2!! Thus, VC- dimension of d- dimensional linear classifiers is d+1 Bias term b required Rule of Thumb: number of parameters in model ojen matches max number of points QuesNon: Can we get a bound for error as a funcnon of the number of points that can be completely labeled?

33 PAC bound using VC dimension VC dimension: number of training points that can be classified exactly (sha ered) by hypothesis space H!!! Measures relevant size of hypothesis space Same bias / variance tradeoff as always Now, just a funcnon of VC(H) Note: all of this theory is for binary classificanon Can be generalized to muln- class and also regression

34 What is the VC- dimension of rectangle classifiers? First, show that there are 4 points that can be sha ered: Then, show that no set of 5 points can be sha ered: [Figures from Anand Bhaskar, Ilya Sukhar]

35 GeneralizaNon bounds using VC dimension Linear classifiers: VC(H) = d+1, for d features plus constant term b Classifiers using Gaussian Kernel VC(H) = Euclidean distance, squared [Figure from Chris Burges] [Figure from mblondel.org]

36 Gap tolerant classifiers Suppose data lies in R d in a ball of diameter D Consider a hypothesis class H of linear classifiers that can only classify point sets with margin at least M What is the largest set of points that H can sha er? Φ=0 Y=0 Φ=1 Y=+1 Cannot sha er these points: D = 2 M = 3/2 Φ=0 Y=0 <M Φ= 1 Y=-1 Y=0 Φ=0 VC dimension = min d, D2 [Figure from Chris Burges] M 2 M =2 =2 1 w SVM aqempts to minimize w 2, which minimizes VC- dimension!!!

37 Gap tolerant classifiers Suppose data lies in R d in a ball of diameter D Consider a hypothesis class H of linear classifiers that can only classify point sets with margin at least M What is the largest set of points that H can sha er? Φ=0 Y=0 Φ=0 Y=0 Φ=1 Y=+1 Φ= 1 Y=-1 D = 2 M = 3/2 Y=0 Φ=0 VC dimension = min d, D2 [Figure from Chris Burges] M 2 What is R=D/2 for the Gaussian kernel? R = max x = max x = max x =1!!! (x) p (x) p K(x, x) (x) What is w 2? w 2 = w 2 = X i y i (x i ) 2 2 i = X X i j y i y j K(x i,x j ) i j 2 2 M

38 Finite hypothesis space What you need to know Derive results CounNng number of hypothesis Complexity of the classifier depends on number of points that can be classified exactly Finite case number of hypotheses considered Infinite case VC dimension VC dimension of gap tolerant classifiers to jusnfy SVM Bias- Variance tradeoff in learning theory

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good