Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Size: px

Start display at page:

Download "Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38"

Ophelia Webb
5 years ago
Views:

1 Learning Theory Sridhar Mahadevan University of Massachusetts p. 1/38

2 Topics Probability theory meet machine learning Concentration inequalities: Chebyshev, Chernoff, Hoeffding, and Markov. Bounds on generalization error Structural risk minimization Growth functions and VC dimension The VC theorem: the most important theorem in machine learning p. 2/38

3 Generalization The essence of learning is generalization This requires making inferences about a population from a sample Example: In the class poll, 8 out of 21 students said they could make the class on Wednesday. What inferences can we make for the entire class? p. 3/38

4 Polling In the 2012 Presidential election, the New York Times 538 political analysis column predicted President Obama would be reelected with 98% probability and win 303 electoral college votes President Obama was reelected with 330 electoral college votes. On November 3rd, 1948, most news organizations incorrectly predicted that President Truman would be defeated by challenger Dewey. p. 4/38

5 Incorrect Polling p. 5/38

6 The Law of Averages If you toss a coin a large number of times, which of the following statements is true? If you get a lot of heads, then tails should start coming up. The number of heads should gradually get closer to the number of tails. The chance error the difference between the number of heads and half the number of tosses increases. The difference between the percentage of heads and 50% decreases. p. 6/38

7 Borell Cantelli Lemma The most famous theorem in probability theory! Theorem: If {A n,n 1} is a sequence of events such that P(A n ) < i=1 then it follows that P(A n,i.o) = 0 p. 7/38

8 Law of Large Numbers Let X n be a sequence of random variables, and S n = n i=1 X n. S n E(S n ) n 0 Bernoulli (1713): Let X n be a Bernoulli distributed random variable. For ǫ > 0 lim P( S n n n p ǫ) = 0 p. 8/38

9 Chebyshev Inequality Let X be a random variable and g(x) be any non-negative function. Then, for any r > 0: P(g(X) r) E(g(X) r p. 9/38

10 Chebyshev Inequality Proof: Eg(X) = r x:g(x) r g(x)f X (x)dx x:g(x) r = rp(g(x) r) g(x)f X (x)dx f X (x)dx p. 10/38

11 Example Let X be a random variable with mean E(X) = µ and variance Var(X) = σ 2. P( (X µ)2 σ 2 t 2 ) 1 t 2 P( X µ tσ) 1 t 2 P( X µ < 2σ) 3 4 p. 11/38

12 Markov Inequality Let X be a nonnegative random variable. Then for any value r > 0, it follows that P(X r) E(X) r Proof: Straightforward consequence of Chebyshev s inequality. p. 12/38

13 Chernoff Bounds In a seminar paper in 1952, Chernoff introduced a powerful approach to proving concentration inequalities Chernoff bounds have been used widely in computer science (e.g., probabilistic algorithms, large graphs) The most important example of a Chernoff bound is the Hoeffding inequality p. 13/38

14 Chernoff Bounds Let X i,1 i n be a set of mutually independent random variables such that P(X i = +1) = P(X i = 1) = 1 2 Let S n = i X i. Then, it follows that for any a > 0: P(S n > a) < e a2 2n p. 14/38

15 Chernoff s Method First, note a trivial consequence of the Markov inequality P(X > αe(x)) < 1 α Exponentiate what is to be proved and use Markov s inequality: P(S n > a) = P(e λs n > e λa ) < E(eλS n ) e λa for some arbitrary λ > 0 (to be selected later to tighten the bound) p. 15/38

16 Chernoff s Method Note that E(e λx i ) = eλ +e λ 2 = cosh(x) Hence, it follows that E(e λs n ) = E(e i λx i ) = i E(e λx i ) = cosh n (x) e nλ2 2 because cosh(λ) < e λ Finally, we get P(S n > a) < enλ 2 = e nλ e λa 2 2 λa Crucial last step: optimize λ to tighten the bound. p. 16/38

17 Chernoff s Method Recap: here is what we have so far: P(S n > a) < enλ e = enλ2 λa λa Select λ > 0 to make bound as tight as possible. Find the gradient λ (nλ2 2 λa) = nλ a which gives λ = a n. Putting this value of λ above, we get our final bound: P(S n > a) < e a2 2n p. 17/38

18 Hoeffding Inequality Let X i,i = 1,...,n be i.i.d samples where a X i b. Let S n = i X i. Then: P(S n ES n ǫ) e 2ǫ 2 i (a i b i )2 P(S n ES n ǫ) e 2ǫ 2 i (a i b i )2 p. 18/38

19 Proof Sketch As before, we use Chernoff bounding method and Markov s inequality: P(S n ES n ǫ) = P(e λ(s n ES n ) e λǫ ) i E(eλ(X i EX i ) ) e λǫ We also use the following inequality: E(e λx ) e λ2 (b a) 2 8. This follows from the convexity of the exponential function We also tune λ above to tighten the bound as much as possible. p. 19/38

20 Union Bound Given a set of (not necessarily independent) events A 1,...,A k P(A 1...A k ) P(A 1 )+...+P(A k ) This is very useful in converting one-sided bounds into two-sided bounds. P( S n ES n ǫ) 2e 2ǫ 2 i (a i b i )2 p. 20/38

21 Hoeffding Inequality Let X i,i = 1,...,n be i.i.d samples from a Bernoulli distribution with probability of success p. Let ˆp = 1 N i X i. Theorem: In this special case, the Hoeffding inequality states that for any ǫ > 0 P( p ˆp > ǫ) 2e 2ǫ2 n p. 21/38

22 Class Poll Revisited 1.4 Class Poll Hoeffding Bound Probability times Error p. 22/38

23 Classification Error Given a hypothesis h H approximating some true target function f, we can define its training error ǫ(h) ˆ on a dataset D as ˆǫ(h) = (x,y) D 1(h(x) y) The true error of h given any (unknown) distribution P on the entire (unseen) data space X is given as ǫ(h) = P (x,y) P (h(x) y) How does training error relate to true error? p. 23/38

24 Generalization Error From the Hoeffding inequality, it follows that P( ǫ(h) ˆǫ(h) > ǫ) 2e 2ǫ2 n How do we generalize this over all possible hypotheses in our hypothesis space H? For finite hypothesis spaces, we can use the union bound: P(A 1...A k ) P(A 1 )+...+P(A k ) if A 1,...,A k are k different events (may not be independent) p. 24/38

25 Generalization Error Let us define the event A i as the event that P( ǫ(h i ) ˆǫ(h i ) > ǫ) So, the probability that some hypothesis in our space of hypotheses has unacceptable error is bounded by P( h H ǫ(h) ˆǫ(h) > ǫ) k i=1 P(A i ) = 2k e 2ǫ2 n Here, H = k (our space of hypotheses is finite). p. 25/38

26 Number of Examples Needed If we set the reliability at 1 δ, for some δ (0,1), then equating δ = 2k e 2ǫ2 n we can compute the number of examples needed for reliable learning as m 1 2 H ln 2ǫ2 δ p. 26/38

27 Generalization Error bound For fixed reliability 1 δ, and hypothesis space H, we can compute the generalization error as ǫ(h) ˆǫ(h) 1 2n ln 2 H δ Note that this only holds for finite hypothesis spaces where H < For the infinite case, which is more common in machine learning, we need to introduce the concept of growth functions p. 27/38

28 Finding Consistent Hypotheses Class Example Complexity Conjunctions A 1 A 2 Polynomial k-dnf k-cnf (A 1 A 3 ) (A 4 A 2 )... Polynomial (A 1 A 3 ) (A 4 A 2 )... Polynomial 3-layer NN Feedforward net NP-hard k-decision List Simple decision tree Polynomial LTU Perceptron Polynomial k-term DNF Term 1 Term 2 NP-hard For polynomial time learning, we also need to take into account the complexity of finding a low-error (or zero training error) hypothesis p. 28/38

29 PAC Results for Concept Classes Class Sample complexity PAC-time learnable? Pure conj. Polynomial Polynomial k-cnf Polynomial Polynomial k-term DNF Polynomial No, unless NP RP Interesting fact: k-term DNF k-cnf. So, learning a larger set can be easier than learning a smaller set! p. 29/38

30 Structural Risk Minimization With probability 1 δ, the true error of a learned classifier ĥ can be bounded as ) (min ǫ(ĥ) ǫ(h) h H n ln 2 H δ This follows from the Hoeffding bound, because ǫ(ĥ) ˆǫ(ĥ)+ǫ ˆǫ(h )+ǫ ǫ(h )+2ǫ p. 30/38

31 Structural Risk Minimization With probability 1 δ, the true error of a learned classifier ĥ can be bounded as ) (min ǫ(ĥ) ǫ(h) h H n ln 2 H δ If we switch from a less expressive H to a more expressive H, then the bias error due to the first term may reduce. However, the variance error due to the second term increases! p. 31/38

32 Dichotomies How do we extend the previous analysis to infinite spaces (e.g., hyperplanes, polynomial functions etc.)? Let us assume a hypothesis h H maps each example x 1,...,x n D to the set { 1,+1}. The dichotomies defined by H on D are H(x 1,...,x n ) = {(h(x 1 ),...,h(x n )) h H} The growth function m H (n) is defined as the maximum number of dichotomies generated by H on any dataset X of size n. p. 32/38

33 Examples of Growth Functions Consider a dataset X of n points on a plane R 2 and the hypothesis space H to be all lines on R 2. What is m H (3)? What is the growth function m H (4) for the space of hypotheses defined by perceptrons on the Euclidean space R 2? Let H be defined as h : R { 1,+1} in the form h(x) = sgn(x a) (where sgn(x) = 1 if x < 0 and +1 otherwise). What is m H (n)? p. 33/38

34 Shattering and Breakpoints A set X is said to be shattered by H if m H (n) = 2 n where X = n That is, the hypothesis space H produces all possible 2 n dichotomies on the dataset X of n examples. If no dataset X of size n can be shattered, n is called a breakpoint for H. Example: What is the breakpoint of the hypothesis space of perceptrons on the plane? p. 34/38

35 Tree-Structured Attributes any_shape convex non_convex regular_polygon ellipse crescent channel triangle hexagon square circle proper_ellipse What is the breakpoint of the above hypothesis space H? p. 35/38

36 VC Dimension The VC dimension (for Vapnik-Chervonenkis) V C(H) of a hypothesis space H is defined as the smallest n for which the growth function m H (n) = 2 n. That is, the hypothesis space H produces all possible 2 n dichotomies on the dataset X of n examples. If no dataset X of size n can be shattered, n is called a breakpoint for H. Example: What is the breakpoint of the hypothesis space of perceptrons on the plane? p. 36/38

37 VC Bounds Finite case: With probability 1 δ, the true error of a learned classifier ĥ can be bounded as ) (min ǫ(ĥ) ǫ(h) h H n ln 2 H δ Infinite case: ) (min ǫ(ĥ) ǫ(h) h H 8 +2 n ln 4m H δ p. 37/38

38 Vapnik-Chervonenkis (VC) Theorem We can now state the celebrated VC theorem, the most important theoretical result in machine learning! For any hypothesis space H, with VC dimension VC(H), given a classifier ĥ that is found by training on a finite dataset X, its generalization error can be bounded with probability 1 δ by ǫ(ĥ) ǫ(h )+O ( 1 n ln 1 δ + VC(H) n log ) m VC(H) p. 38/38

Computational Learning Theory

CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful