Introduction to Machine Learning (67577) Lecture 5

Size: px

Start display at page:

Download "Introduction to Machine Learning (67577) Lecture 5"

Alannah Cole
6 years ago
Views:

1 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 1 / 39

2 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 2 / 39

3 How to Express Prior Knowledge So far, learner expresses prior knowledge by specifying the hypothesis class H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 3 / 39

4 Other Ways to Express Prior Knowledge Occam s&razor:& A&short&explanaBon&is& preferred&over&a&longer&one & William&of&Occam& ( )& Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39

5 Other Ways to Express Prior Knowledge Occam s&razor:& A&short&explanaBon&is& preferred&over&a&longer&one & William&of&Occam& ( )& Things(that(look(alike(must(be(alike ( Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39

6 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 5 / 39

7 Bias to Shorter Description Let H be a countable hypothesis class Let w : H R be such that h H w(h) 1 The function w reflects prior knowledge on how important w(h) is Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 6 / 39

8 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

9 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

10 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

11 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

12 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Kraft s inequality implies that h w(h) 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

13 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Kraft s inequality implies that h w(h) 1 Proof: define probability over words in d(h) as follows: repeatedly toss an unbiased coin, until the sequence of outcomes is a member of d(h), and then stop. Since d(h) is prefix-free, this is a valid probability over d(h), and the probability to get d(h) is w(h). Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

14 Bias to Shorter Description Theorem (Minimum Description Length (MDL) bound) Let w : H R be such that h H w(h) 1. Then, with probability of at least 1 δ over S D m we have: log(w(h)) + log(2/δ) h H, L D (h) L S (h) + 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39

15 Bias to Shorter Description Theorem (Minimum Description Length (MDL) bound) Let w : H R be such that h H w(h) 1. Then, with probability of at least 1 δ over S D m we have: log(w(h)) + log(2/δ) h H, L D (h) L S (h) + 2m Compare to VC bound: h H, L D (h) L S (h) + C VCdim(H) + log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39

16 Proof For every h, define δ h = w(h) δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

17 Proof For every h, define δ h = w(h) δ By Hoeffding s bound, for every h, }) D ({S m log(2/δh ) : L D (h) > L S (h) + δ h 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

18 Proof For every h, define δ h = w(h) δ By Hoeffding s bound, for every h, }) D ({S m log(2/δh ) : L D (h) > L S (h) + δ h 2m Applying the union bound, }) D ({S m log(2/δh ) : h H, L D (h) > L S (h) + 2m ( }) D m log(2/δh ) h H {S : L D (h) > L S (h) + 2m δ h δ. h H = Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

19 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

20 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

21 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

22 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

23 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m When w(h) = 2 h we obtain log(w(h)) = h log(2) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

24 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m When w(h) = 2 h we obtain log(w(h)) = h log(2) Explicit tradeoff between bias (small L S (h)) and complexity (small h ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

25 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

26 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

27 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Assume h H with L D (h ) = 0. Then, for every ɛ, δ, exists sample size m s.t. D m ({S : L D (MDL(S)) ɛ}) 1 δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

28 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Assume h H with L D (h ) = 0. Then, for every ɛ, δ, exists sample size m s.t. D m ({S : L D (MDL(S)) ɛ}) 1 δ MDL is a Universal Learner Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

29 MDL guarantee p: Y X [0,1] p h 1 ased Theorem on length of (prefix-ambiguous) description d h o For d: Shorter Hevery 0,1 h H, w.p. Description, d(h) is never 1a prefix δ over of d(h S ) for D m any we h, h have: Y X p h = 2 [0,1] log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) Kraft Inequality: p h = p h 1 2m ased on c: U Y X (e.g. python code function it implements) of (prefix-ambiguous) Set of Example: Take Hdescription legal to beprograms the dclass h Uof all 0,1python programs, with h be d(h) is never the code a prefix p h = 2 length of d(h (in (can ) think for bits) any of: h, d h h = arg min σ ) Assume h H with L D (h Andrei ) = 0. Then, for every ɛ, δ, exists : Kolmogorov Complexity sample = p hsize 1m s.t. D m Kolmogorov ({S : L D (MDL(S)) ɛ}) ( ) 1 δ Yinimum X (e.g. MDL Description python is a Universal code function Length Learner learning it implements) rule: biguous MDL legal programs U 0,1 S = arg max p(h) = arg min d h (can think of: d h = arg min σ ) Complexity ption Length learning rule: Andrei Kolmogorov ( ) Ray Solomonoff ( ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

30 Contradiction to the fundamental theorem of learning? Take again H to be all python programs Note that VCdim(H) = The No-Free-Lunch theorem we can t learn H So how come we can learn H using MDL??? Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 12 / 39

31 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 13 / 39

32 Non-uniform learning Definition (Non-uniformly learnable) H is non-uniformly learnable if A and m NUL H : (0, 1)2 H N s.t., ɛ, δ (0, 1), h H, if m m NUL H (ɛ, δ, h) then D, D m ({S : L D (A(S)) L D (h) + ɛ}) 1 δ. Number of required examples depends on ɛ, δ, and h Definition (Agnostic PAC learnable) H is agnostically PAC learnable if A and m H : (0, 1) 2 N s.t. ɛ, δ (0, 1), if m m H (ɛ, δ), then D and h H, D m ({S : L D (A(S)) L D (h) + ɛ}) 1 δ. Number of required examples depends only on ɛ, δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 14 / 39

33 Non-uniform learning vs. PAC learning Corollary Let H be the class of all computable functions H is non-uniform learnable, with sample complexity, m NUL H (ɛ, δ, h) log(w(h)) + log(2/δ) 2ɛ 2 H is not PAC learnable. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39

34 Non-uniform learning vs. PAC learning Corollary Let H be the class of all computable functions H is non-uniform learnable, with sample complexity, m NUL H (ɛ, δ, h) log(w(h)) + log(2/δ) 2ɛ 2 H is not PAC learnable. We saw that the VC dimension characterizes PAC learnability What characterizes non-uniform learnability? Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39

35 Characterizing Non-uniform Learnability Theorem A class H {0, 1} X is non-uniform learnable if and only if it is a countable union of PAC learnable hypothesis classes. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 16 / 39

36 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

37 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

38 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

39 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. For every D s.t. h H n with L D (h) = 0 we have that D n ({S : L D (A(S)) 1/8}) 6/7 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

40 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. For every D s.t. h H n with L D (h) = 0 we have that D n ({S : L D (A(S)) 1/8}) 6/7 The fundamental theorem of statistical learning implies that VCdim(H n ) <, and therefore H n is agnostic PAC learnable Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

41 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

42 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = π 2 n 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

43 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m π 2 n 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

44 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

45 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Applying the union bound over n we obtain D m ({S : n, h H n, L D (h) > L S (h) + ɛ n }) n δ n δ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

46 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Applying the union bound over n we obtain D m ({S : n, h H n, L D (h) > L S (h) + ɛ n }) n δ n δ. This yields a generic non-uniform learning rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

47 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 19 / 39

48 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

49 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] As in the analysis of MDL, it is easy to show that for every h H, L D (SRM(S)) L S (h) + min C d n log(w(n)) + log(1/δ) n:h H n m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

50 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] As in the analysis of MDL, it is easy to show that for every h H, L D (SRM(S)) L S (h) + min C d n log(w(n)) + log(1/δ) n:h H n m Hence, SRM is a generic non-uniform learner with sample complexity m NUL H (ɛ, δ, h) min C d n log(w(n)) + log(1/δ) n:h H n ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

51 No-free-lunch for non-uniform learnability Claim: For any infinite domain set, X, the class H = {0, 1} X is not a countable union of classes of finite VC-dimension. Hence, such classes H are not non-uniformly learnable Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 21 / 39

52 The cost of weaker prior knowledge Suppose H = n H n, where VCdim(H n ) = n Suppose that some h H n has L D (h ) = 0 With this prior knowledge, we can apply ERM on H n, and the sample complexity is C n+log(1/δ) ɛ 2 Without this prior knowledge, SRM will need C n+log(π2 n 2 /6)+log(1/δ) ɛ 2 examples That is, we pay order of log(n)/ɛ 2 more examples for not knowing n in advanced Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39

53 The cost of weaker prior knowledge Suppose H = n H n, where VCdim(H n ) = n Suppose that some h H n has L D (h ) = 0 With this prior knowledge, we can apply ERM on H n, and the sample complexity is C n+log(1/δ) ɛ 2 Without this prior knowledge, SRM will need C n+log(π2 n 2 /6)+log(1/δ) ɛ 2 examples That is, we pay order of log(n)/ɛ 2 more examples for not knowing n in advanced SRM for model selection: Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39

54 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 23 / 39

55 Decision Trees other Color? pale green to pale yellow not-tasty Softness? other gives slightly to palm pressure not-tasty tasty Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 24 / 39

56 VC dimension of Decision Trees Claim: Consider the class of decision trees over X with k leaves. Then, the VC dimension of this class is k Proof: A set of k instances that arrive to the different leaves can be shattered. A set of k + 1 instances can t be shattered since 2 instances must arrive to the same leaf Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 25 / 39

57 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

58 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

59 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

60 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

61 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

62 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

63 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

64 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

65 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 End of the code Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

66 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 End of the code Can apply MDL learning rule: search tree with n nodes that minimizes (n + 1) log2 (d + 3) + log(2/δ) L S (h) + 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

67 Decision Tree Algorithms NP hard problem... Greedy approach: Iterative Dichotomizer 3 Following the MDL principle, attempts to create a small tree with low train error Proposed by Ross Quinlan Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 27 / 39

68 ID3(S, A) Input: training set S, feature subset A [d] if all examples in S are labeled by 1, return a leaf 1 if all examples in S are labeled by 0, return a leaf 0 if A =, return a leaf whose value = majority of labels in S. else : Let j = argmax i A Gain(S, i) if all examples in S have the same label Return a leaf whose value = majority of labels in S else Let T 1 be the tree returned by ID3({(x, y) S : x j = 1}, A \ {j}). Let T 2 be the tree returned by ID3({(x, y) S : x j = 0}, A \ {j}). Return the tree: x j = 1? T 2 T 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 28 / 39

69 Gain Measures Gain(S, i) = C(P S [y]) ( ) P[x i ] C(P[y x i ]) + P[ x i ]C(P[y x i ]). S S S S Train error: C(a) = min{a, 1 a} Information gain: C(a) = a log(a) (1 a) log(1 a) Gini index: C(a) = 2a(1 a) 0.4 Error Info Gain Gini C(a) a Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 29 / 39

70 Pruning, Random Forests,... In the exercise you ll learn about additional practical variants: Pruning the tree Random Forests Dealing with real valued features Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 30 / 39

71 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 31 / 39

72 Nearest Neighbor Things(that(look(alike(must(be(alike ( Memorize the training set S = (x 1, y 1 ),..., (x m, y m ) Given new x, find the k closest points in S and return majority vote among their labels Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 32 / 39

73 1-Nearest Neighbor: Voronoi Tessellation Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39

74 1-Nearest Neighbor: Voronoi Tessellation Unlike ERM,SRM,MDL, etc., there s no H At training time: do nothing At test time: search S for the nearest neighbors Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39

75 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

76 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

77 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Prior knowledge: η is c-lipschitz. Namely, for all x, x X, η(x) η(x ) c x x Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

78 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Prior knowledge: η is c-lipschitz. Namely, for all x, x X, η(x) η(x ) c x x Theorem: Let h S be the k-nn rule, then, ( ) 8 ( E S D m[l D(h S )] 1 + L D (h ) + 6 c ) d + k m 1/(d+1). k Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

79 k-nearest Neighbor: Bias-Complexity Tradeoff k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 Shai Shalev-Shwartz (Hebrew U) k=5 k=100 IML Lecture 5 k=12 k=200 MDL,SRM,trees,neighbors 35 / 39

80 Curse of Dimensionality E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Suppose L D (h ) = 0. Then, to have error ɛ we need m (4 c d/ɛ) d+1. Number of examples grows exponentially with the dimension This is not an artifact of the analysis Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39

81 Curse of Dimensionality E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Suppose L D (h ) = 0. Then, to have error ɛ we need m (4 c d/ɛ) d+1. Number of examples grows exponentially with the dimension This is not an artifact of the analysis Theorem For any c > 1, and every learner, there exists a distribution over [0, 1] d {0, 1}, such that η(x) is c-lipschitz, the Bayes error of the distribution is 0, but for sample sizes m (c + 1) d /2, the true error of the learner is greater than 1/4. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39

82 Contradicting the No-Free-Lunch? E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Seemingly, we learn the class of all functions over [0, 1] d But this class is not learnable even in the non-uniform model... Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39

83 Contradicting the No-Free-Lunch? E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Seemingly, we learn the class of all functions over [0, 1] d But this class is not learnable even in the non-uniform model... There s no contradiction: The number of required examples depends on the Lipschitzness of η (the parameter c), which depends on D. PAC: m(ɛ, δ) non-uniform: m(ɛ, δ, h) consistency: m(ɛ, δ, h, D) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39

84 Issues with Nearest Neighbor Need to store entire training set Replace intelligence with fast memory Curse of dimensionality We ll later learn dimensionality reduction methods Computational problem of finding nearest neighbor What is the correct metric between objects? Success depends on Lipschitzness of η, which depends on the right metric Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 38 / 39

85 Summary Expressing prior knowledge: Hypothesis class, weighting hypotheses, metric Weaker notions of learnability: PAC stronger than non-uniform stronger than consistency Learning rules: ERM, MDL, SRM Decision trees Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 39 / 39

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz