Introduction to Machine Learning (67577) Lecture 5

Size: px
Start display at page:

Download "Introduction to Machine Learning (67577) Lecture 5"

Transcription

1 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 1 / 39

2 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 2 / 39

3 How to Express Prior Knowledge So far, learner expresses prior knowledge by specifying the hypothesis class H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 3 / 39

4 Other Ways to Express Prior Knowledge Occam s&razor:& A&short&explanaBon&is& preferred&over&a&longer&one & William&of&Occam& ( )& Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39

5 Other Ways to Express Prior Knowledge Occam s&razor:& A&short&explanaBon&is& preferred&over&a&longer&one & William&of&Occam& ( )& Things(that(look(alike(must(be(alike ( Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39

6 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 5 / 39

7 Bias to Shorter Description Let H be a countable hypothesis class Let w : H R be such that h H w(h) 1 The function w reflects prior knowledge on how important w(h) is Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 6 / 39

8 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

9 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

10 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

11 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

12 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Kraft s inequality implies that h w(h) 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

13 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Kraft s inequality implies that h w(h) 1 Proof: define probability over words in d(h) as follows: repeatedly toss an unbiased coin, until the sequence of outcomes is a member of d(h), and then stop. Since d(h) is prefix-free, this is a valid probability over d(h), and the probability to get d(h) is w(h). Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39

14 Bias to Shorter Description Theorem (Minimum Description Length (MDL) bound) Let w : H R be such that h H w(h) 1. Then, with probability of at least 1 δ over S D m we have: log(w(h)) + log(2/δ) h H, L D (h) L S (h) + 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39

15 Bias to Shorter Description Theorem (Minimum Description Length (MDL) bound) Let w : H R be such that h H w(h) 1. Then, with probability of at least 1 δ over S D m we have: log(w(h)) + log(2/δ) h H, L D (h) L S (h) + 2m Compare to VC bound: h H, L D (h) L S (h) + C VCdim(H) + log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39

16 Proof For every h, define δ h = w(h) δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

17 Proof For every h, define δ h = w(h) δ By Hoeffding s bound, for every h, }) D ({S m log(2/δh ) : L D (h) > L S (h) + δ h 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

18 Proof For every h, define δ h = w(h) δ By Hoeffding s bound, for every h, }) D ({S m log(2/δh ) : L D (h) > L S (h) + δ h 2m Applying the union bound, }) D ({S m log(2/δh ) : h H, L D (h) > L S (h) + 2m ( }) D m log(2/δh ) h H {S : L D (h) > L S (h) + 2m δ h δ. h H = Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39

19 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

20 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

21 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

22 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

23 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m When w(h) = 2 h we obtain log(w(h)) = h log(2) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

24 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m When w(h) = 2 h we obtain log(w(h)) = h log(2) Explicit tradeoff between bias (small L S (h)) and complexity (small h ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39

25 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

26 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

27 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Assume h H with L D (h ) = 0. Then, for every ɛ, δ, exists sample size m s.t. D m ({S : L D (MDL(S)) ɛ}) 1 δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

28 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Assume h H with L D (h ) = 0. Then, for every ɛ, δ, exists sample size m s.t. D m ({S : L D (MDL(S)) ɛ}) 1 δ MDL is a Universal Learner Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

29 MDL guarantee p: Y X [0,1] p h 1 ased Theorem on length of (prefix-ambiguous) description d h o For d: Shorter Hevery 0,1 h H, w.p. Description, d(h) is never 1a prefix δ over of d(h S ) for D m any we h, h have: Y X p h = 2 [0,1] log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) Kraft Inequality: p h = p h 1 2m ased on c: U Y X (e.g. python code function it implements) of (prefix-ambiguous) Set of Example: Take Hdescription legal to beprograms the dclass h Uof all 0,1python programs, with h be d(h) is never the code a prefix p h = 2 length of d(h (in (can ) think for bits) any of: h, d h h = arg min σ ) Assume h H with L D (h Andrei ) = 0. Then, for every ɛ, δ, exists : Kolmogorov Complexity sample = p hsize 1m s.t. D m Kolmogorov ({S : L D (MDL(S)) ɛ}) ( ) 1 δ Yinimum X (e.g. MDL Description python is a Universal code function Length Learner learning it implements) rule: biguous MDL legal programs U 0,1 S = arg max p(h) = arg min d h (can think of: d h = arg min σ ) Complexity ption Length learning rule: Andrei Kolmogorov ( ) Ray Solomonoff ( ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39

30 Contradiction to the fundamental theorem of learning? Take again H to be all python programs Note that VCdim(H) = The No-Free-Lunch theorem we can t learn H So how come we can learn H using MDL??? Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 12 / 39

31 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 13 / 39

32 Non-uniform learning Definition (Non-uniformly learnable) H is non-uniformly learnable if A and m NUL H : (0, 1)2 H N s.t., ɛ, δ (0, 1), h H, if m m NUL H (ɛ, δ, h) then D, D m ({S : L D (A(S)) L D (h) + ɛ}) 1 δ. Number of required examples depends on ɛ, δ, and h Definition (Agnostic PAC learnable) H is agnostically PAC learnable if A and m H : (0, 1) 2 N s.t. ɛ, δ (0, 1), if m m H (ɛ, δ), then D and h H, D m ({S : L D (A(S)) L D (h) + ɛ}) 1 δ. Number of required examples depends only on ɛ, δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 14 / 39

33 Non-uniform learning vs. PAC learning Corollary Let H be the class of all computable functions H is non-uniform learnable, with sample complexity, m NUL H (ɛ, δ, h) log(w(h)) + log(2/δ) 2ɛ 2 H is not PAC learnable. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39

34 Non-uniform learning vs. PAC learning Corollary Let H be the class of all computable functions H is non-uniform learnable, with sample complexity, m NUL H (ɛ, δ, h) log(w(h)) + log(2/δ) 2ɛ 2 H is not PAC learnable. We saw that the VC dimension characterizes PAC learnability What characterizes non-uniform learnability? Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39

35 Characterizing Non-uniform Learnability Theorem A class H {0, 1} X is non-uniform learnable if and only if it is a countable union of PAC learnable hypothesis classes. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 16 / 39

36 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

37 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

38 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

39 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. For every D s.t. h H n with L D (h) = 0 we have that D n ({S : L D (A(S)) 1/8}) 6/7 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

40 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. For every D s.t. h H n with L D (h) = 0 we have that D n ({S : L D (A(S)) 1/8}) 6/7 The fundamental theorem of statistical learning implies that VCdim(H n ) <, and therefore H n is agnostic PAC learnable Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39

41 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

42 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = π 2 n 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

43 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m π 2 n 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

44 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

45 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Applying the union bound over n we obtain D m ({S : n, h H n, L D (h) > L S (h) + ɛ n }) n δ n δ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

46 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Applying the union bound over n we obtain D m ({S : n, h H n, L D (h) > L S (h) + ɛ n }) n δ n δ. This yields a generic non-uniform learning rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39

47 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 19 / 39

48 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

49 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] As in the analysis of MDL, it is easy to show that for every h H, L D (SRM(S)) L S (h) + min C d n log(w(n)) + log(1/δ) n:h H n m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

50 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] As in the analysis of MDL, it is easy to show that for every h H, L D (SRM(S)) L S (h) + min C d n log(w(n)) + log(1/δ) n:h H n m Hence, SRM is a generic non-uniform learner with sample complexity m NUL H (ɛ, δ, h) min C d n log(w(n)) + log(1/δ) n:h H n ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39

51 No-free-lunch for non-uniform learnability Claim: For any infinite domain set, X, the class H = {0, 1} X is not a countable union of classes of finite VC-dimension. Hence, such classes H are not non-uniformly learnable Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 21 / 39

52 The cost of weaker prior knowledge Suppose H = n H n, where VCdim(H n ) = n Suppose that some h H n has L D (h ) = 0 With this prior knowledge, we can apply ERM on H n, and the sample complexity is C n+log(1/δ) ɛ 2 Without this prior knowledge, SRM will need C n+log(π2 n 2 /6)+log(1/δ) ɛ 2 examples That is, we pay order of log(n)/ɛ 2 more examples for not knowing n in advanced Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39

53 The cost of weaker prior knowledge Suppose H = n H n, where VCdim(H n ) = n Suppose that some h H n has L D (h ) = 0 With this prior knowledge, we can apply ERM on H n, and the sample complexity is C n+log(1/δ) ɛ 2 Without this prior knowledge, SRM will need C n+log(π2 n 2 /6)+log(1/δ) ɛ 2 examples That is, we pay order of log(n)/ɛ 2 more examples for not knowing n in advanced SRM for model selection: Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39

54 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 23 / 39

55 Decision Trees other Color? pale green to pale yellow not-tasty Softness? other gives slightly to palm pressure not-tasty tasty Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 24 / 39

56 VC dimension of Decision Trees Claim: Consider the class of decision trees over X with k leaves. Then, the VC dimension of this class is k Proof: A set of k instances that arrive to the different leaves can be shattered. A set of k + 1 instances can t be shattered since 2 instances must arrive to the same leaf Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 25 / 39

57 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

58 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

59 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

60 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

61 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

62 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

63 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

64 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

65 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 End of the code Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

66 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 End of the code Can apply MDL learning rule: search tree with n nodes that minimizes (n + 1) log2 (d + 3) + log(2/δ) L S (h) + 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39

67 Decision Tree Algorithms NP hard problem... Greedy approach: Iterative Dichotomizer 3 Following the MDL principle, attempts to create a small tree with low train error Proposed by Ross Quinlan Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 27 / 39

68 ID3(S, A) Input: training set S, feature subset A [d] if all examples in S are labeled by 1, return a leaf 1 if all examples in S are labeled by 0, return a leaf 0 if A =, return a leaf whose value = majority of labels in S. else : Let j = argmax i A Gain(S, i) if all examples in S have the same label Return a leaf whose value = majority of labels in S else Let T 1 be the tree returned by ID3({(x, y) S : x j = 1}, A \ {j}). Let T 2 be the tree returned by ID3({(x, y) S : x j = 0}, A \ {j}). Return the tree: x j = 1? T 2 T 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 28 / 39

69 Gain Measures Gain(S, i) = C(P S [y]) ( ) P[x i ] C(P[y x i ]) + P[ x i ]C(P[y x i ]). S S S S Train error: C(a) = min{a, 1 a} Information gain: C(a) = a log(a) (1 a) log(1 a) Gini index: C(a) = 2a(1 a) 0.4 Error Info Gain Gini C(a) a Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 29 / 39

70 Pruning, Random Forests,... In the exercise you ll learn about additional practical variants: Pruning the tree Random Forests Dealing with real valued features Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 30 / 39

71 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 31 / 39

72 Nearest Neighbor Things(that(look(alike(must(be(alike ( Memorize the training set S = (x 1, y 1 ),..., (x m, y m ) Given new x, find the k closest points in S and return majority vote among their labels Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 32 / 39

73 1-Nearest Neighbor: Voronoi Tessellation Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39

74 1-Nearest Neighbor: Voronoi Tessellation Unlike ERM,SRM,MDL, etc., there s no H At training time: do nothing At test time: search S for the nearest neighbors Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39

75 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

76 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

77 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Prior knowledge: η is c-lipschitz. Namely, for all x, x X, η(x) η(x ) c x x Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

78 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Prior knowledge: η is c-lipschitz. Namely, for all x, x X, η(x) η(x ) c x x Theorem: Let h S be the k-nn rule, then, ( ) 8 ( E S D m[l D(h S )] 1 + L D (h ) + 6 c ) d + k m 1/(d+1). k Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39

79 k-nearest Neighbor: Bias-Complexity Tradeoff k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 Shai Shalev-Shwartz (Hebrew U) k=5 k=100 IML Lecture 5 k=12 k=200 MDL,SRM,trees,neighbors 35 / 39

80 Curse of Dimensionality E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Suppose L D (h ) = 0. Then, to have error ɛ we need m (4 c d/ɛ) d+1. Number of examples grows exponentially with the dimension This is not an artifact of the analysis Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39

81 Curse of Dimensionality E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Suppose L D (h ) = 0. Then, to have error ɛ we need m (4 c d/ɛ) d+1. Number of examples grows exponentially with the dimension This is not an artifact of the analysis Theorem For any c > 1, and every learner, there exists a distribution over [0, 1] d {0, 1}, such that η(x) is c-lipschitz, the Bayes error of the distribution is 0, but for sample sizes m (c + 1) d /2, the true error of the learner is greater than 1/4. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39

82 Contradicting the No-Free-Lunch? E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Seemingly, we learn the class of all functions over [0, 1] d But this class is not learnable even in the non-uniform model... Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39

83 Contradicting the No-Free-Lunch? E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Seemingly, we learn the class of all functions over [0, 1] d But this class is not learnable even in the non-uniform model... There s no contradiction: The number of required examples depends on the Lipschitzness of η (the parameter c), which depends on D. PAC: m(ɛ, δ) non-uniform: m(ɛ, δ, h) consistency: m(ɛ, δ, h, D) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39

84 Issues with Nearest Neighbor Need to store entire training set Replace intelligence with fast memory Curse of dimensionality We ll later learn dimensionality reduction methods Computational problem of finding nearest neighbor What is the correct metric between objects? Success depends on Lipschitzness of η, which depends on the right metric Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 38 / 39

85 Summary Expressing prior knowledge: Hypothesis class, weighting hypotheses, metric Weaker notions of learnability: PAC stronger than non-uniform stronger than consistency Learning rules: ERM, MDL, SRM Decision trees Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 39 / 39

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016 12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction

More information

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step

More information

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection

Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Decision Tree Learning

Decision Tree Learning Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

More information

Decision Trees. Tirgul 5

Decision Trees. Tirgul 5 Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2

More information

Computational Learning Theory. Definitions

Computational Learning Theory. Definitions Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Supervised Learning via Decision Trees

Supervised Learning via Decision Trees Supervised Learning via Decision Trees Lecture 4 1 Outline 1. Learning via feature splits 2. ID3 Information gain 3. Extensions Continuous features Gain ratio Ensemble learning 2 Sequence of decisions

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Online Learning Summer School Copenhagen 2015 Lecture 1

Online Learning Summer School Copenhagen 2015 Lecture 1 Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity

More information

Generalization Bounds

Generalization Bounds Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these

More information

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,

More information

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target

More information

http://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

CS 6375: Machine Learning Computational Learning Theory

CS 6375: Machine Learning Computational Learning Theory CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty

More information

Notes on Machine Learning for and

Notes on Machine Learning for and Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect

More information

Lecture 3: Empirical Risk Minimization

Lecture 3: Empirical Risk Minimization Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Understanding Machine Learning A theory Perspective

Understanding Machine Learning A theory Perspective Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler + Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Chapter 6: Classification

Chapter 6: Classification Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification MariaFlorina (Nina) Balcan 10/05/2016 Reminders Midterm Exam Mon, Oct. 10th Midterm Review Session

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

ORIE 4741: Learning with Big Messy Data. Generalization

ORIE 4741: Learning with Big Messy Data. Generalization ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by

More information

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models.

Decision Trees. Introduction. Some facts about decision trees: They represent data-classification models. Decision Trees Introduction Some facts about decision trees: They represent data-classification models. An internal node of the tree represents a question about feature-vector attribute, and whose answer

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection

More information

Chapter 3: Decision Tree Learning

Chapter 3: Decision Tree Learning Chapter 3: Decision Tree Learning CS 536: Machine Learning Littman (Wu, TA) Administration Books? New web page: http://www.cs.rutgers.edu/~mlittman/courses/ml03/ schedule lecture notes assignment info.

More information

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997 Outline Training Examples for EnjoySport Learning from examples General-to-specific ordering over hypotheses [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Version spaces and candidate elimination

More information

Lecture 3: Decision Trees

Lecture 3: Decision Trees Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Domain Adaptation Can Quantity Compensate for Quality?

Domain Adaptation Can Quantity Compensate for Quality? Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz

More information

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University Outline Decision tree representation ID3 learning algorithm Entropy and information gain

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Discriminative v. generative

Discriminative v. generative Discriminative v. generative Naive Bayes 2 Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

Computational learning theory. PAC learning. VC dimension.

Computational learning theory. PAC learning. VC dimension. Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................

More information

Decision Trees.

Decision Trees. . Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

More information