Introduction to Machine Learning (67577) Lecture 5
|
|
- Alannah Cole
- 6 years ago
- Views:
Transcription
1 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 1 / 39
2 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 2 / 39
3 How to Express Prior Knowledge So far, learner expresses prior knowledge by specifying the hypothesis class H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 3 / 39
4 Other Ways to Express Prior Knowledge Occam s&razor:& A&short&explanaBon&is& preferred&over&a&longer&one & William&of&Occam& ( )& Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39
5 Other Ways to Express Prior Knowledge Occam s&razor:& A&short&explanaBon&is& preferred&over&a&longer&one & William&of&Occam& ( )& Things(that(look(alike(must(be(alike ( Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 4 / 39
6 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 5 / 39
7 Bias to Shorter Description Let H be a countable hypothesis class Let w : H R be such that h H w(h) 1 The function w reflects prior knowledge on how important w(h) is Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 6 / 39
8 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39
9 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39
10 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39
11 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39
12 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Kraft s inequality implies that h w(h) 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39
13 Example: Description Length Suppose that each h H is described by some word d(h) {0, 1} E.g.: H is the class of all python programs Suppose that the description language is prefix-free, namely, for every h h, d(h) is not a prefix of d(h ) (Always achievable by including an end-of-word symbol) Let h be the length of d(h) Then, set w(h) = 2 h Kraft s inequality implies that h w(h) 1 Proof: define probability over words in d(h) as follows: repeatedly toss an unbiased coin, until the sequence of outcomes is a member of d(h), and then stop. Since d(h) is prefix-free, this is a valid probability over d(h), and the probability to get d(h) is w(h). Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 7 / 39
14 Bias to Shorter Description Theorem (Minimum Description Length (MDL) bound) Let w : H R be such that h H w(h) 1. Then, with probability of at least 1 δ over S D m we have: log(w(h)) + log(2/δ) h H, L D (h) L S (h) + 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39
15 Bias to Shorter Description Theorem (Minimum Description Length (MDL) bound) Let w : H R be such that h H w(h) 1. Then, with probability of at least 1 δ over S D m we have: log(w(h)) + log(2/δ) h H, L D (h) L S (h) + 2m Compare to VC bound: h H, L D (h) L S (h) + C VCdim(H) + log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 8 / 39
16 Proof For every h, define δ h = w(h) δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39
17 Proof For every h, define δ h = w(h) δ By Hoeffding s bound, for every h, }) D ({S m log(2/δh ) : L D (h) > L S (h) + δ h 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39
18 Proof For every h, define δ h = w(h) δ By Hoeffding s bound, for every h, }) D ({S m log(2/δh ) : L D (h) > L S (h) + δ h 2m Applying the union bound, }) D ({S m log(2/δh ) : h H, L D (h) > L S (h) + 2m ( }) D m log(2/δh ) h H {S : L D (h) > L S (h) + 2m δ h δ. h H = Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 9 / 39
19 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39
20 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39
21 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39
22 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39
23 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m When w(h) = 2 h we obtain log(w(h)) = h log(2) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39
24 Bound Minimization log(w(h))+log(2/δ) MDL bound: h H, L D (h) L S (h) + VC bound: h H, L D (h) L S (h) + C 2m VCdim(H)+log(2/δ) 2m Recall that our goal is to minimize L D (h) over h H Minimizing the VC bound leads to the ERM rule Minimizing the MDL bound leads to the MDL rule: [ ] log(w(h)) + log(2/δ) MDL(S) argmin L S (h) + h H 2m When w(h) = 2 h we obtain log(w(h)) = h log(2) Explicit tradeoff between bias (small L S (h)) and complexity (small h ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 10 / 39
25 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39
26 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39
27 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Assume h H with L D (h ) = 0. Then, for every ɛ, δ, exists sample size m s.t. D m ({S : L D (MDL(S)) ɛ}) 1 δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39
28 MDL guarantee Theorem For every h H, w.p. 1 δ over S D m we have: log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) 2m Example: Take H to be the class of all python programs, with h be the code length (in bits) Assume h H with L D (h ) = 0. Then, for every ɛ, δ, exists sample size m s.t. D m ({S : L D (MDL(S)) ɛ}) 1 δ MDL is a Universal Learner Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39
29 MDL guarantee p: Y X [0,1] p h 1 ased Theorem on length of (prefix-ambiguous) description d h o For d: Shorter Hevery 0,1 h H, w.p. Description, d(h) is never 1a prefix δ over of d(h S ) for D m any we h, h have: Y X p h = 2 [0,1] log(w(h L D (MDL(S)) L D (h ) + )) + log(2/δ) Kraft Inequality: p h = p h 1 2m ased on c: U Y X (e.g. python code function it implements) of (prefix-ambiguous) Set of Example: Take Hdescription legal to beprograms the dclass h Uof all 0,1python programs, with h be d(h) is never the code a prefix p h = 2 length of d(h (in (can ) think for bits) any of: h, d h h = arg min σ ) Assume h H with L D (h Andrei ) = 0. Then, for every ɛ, δ, exists : Kolmogorov Complexity sample = p hsize 1m s.t. D m Kolmogorov ({S : L D (MDL(S)) ɛ}) ( ) 1 δ Yinimum X (e.g. MDL Description python is a Universal code function Length Learner learning it implements) rule: biguous MDL legal programs U 0,1 S = arg max p(h) = arg min d h (can think of: d h = arg min σ ) Complexity ption Length learning rule: Andrei Kolmogorov ( ) Ray Solomonoff ( ) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 11 / 39
30 Contradiction to the fundamental theorem of learning? Take again H to be all python programs Note that VCdim(H) = The No-Free-Lunch theorem we can t learn H So how come we can learn H using MDL??? Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 12 / 39
31 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 13 / 39
32 Non-uniform learning Definition (Non-uniformly learnable) H is non-uniformly learnable if A and m NUL H : (0, 1)2 H N s.t., ɛ, δ (0, 1), h H, if m m NUL H (ɛ, δ, h) then D, D m ({S : L D (A(S)) L D (h) + ɛ}) 1 δ. Number of required examples depends on ɛ, δ, and h Definition (Agnostic PAC learnable) H is agnostically PAC learnable if A and m H : (0, 1) 2 N s.t. ɛ, δ (0, 1), if m m H (ɛ, δ), then D and h H, D m ({S : L D (A(S)) L D (h) + ɛ}) 1 δ. Number of required examples depends only on ɛ, δ Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 14 / 39
33 Non-uniform learning vs. PAC learning Corollary Let H be the class of all computable functions H is non-uniform learnable, with sample complexity, m NUL H (ɛ, δ, h) log(w(h)) + log(2/δ) 2ɛ 2 H is not PAC learnable. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39
34 Non-uniform learning vs. PAC learning Corollary Let H be the class of all computable functions H is non-uniform learnable, with sample complexity, m NUL H (ɛ, δ, h) log(w(h)) + log(2/δ) 2ɛ 2 H is not PAC learnable. We saw that the VC dimension characterizes PAC learnability What characterizes non-uniform learnability? Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 15 / 39
35 Characterizing Non-uniform Learnability Theorem A class H {0, 1} X is non-uniform learnable if and only if it is a countable union of PAC learnable hypothesis classes. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 16 / 39
36 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39
37 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39
38 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39
39 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. For every D s.t. h H n with L D (h) = 0 we have that D n ({S : L D (A(S)) 1/8}) 6/7 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39
40 Proof (Non-uniform learnable countable union) Assume that H is non-uniform learnable using A with sample complexity m NUL H For every n N, let H n = {h H : m NUL H (1/8, 1/7, h) n} Clearly, H = n N H n. For every D s.t. h H n with L D (h) = 0 we have that D n ({S : L D (A(S)) 1/8}) 6/7 The fundamental theorem of statistical learning implies that VCdim(H n ) <, and therefore H n is agnostic PAC learnable Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 17 / 39
41 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39
42 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = π 2 n 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39
43 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m π 2 n 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39
44 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39
45 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Applying the union bound over n we obtain D m ({S : n, h H n, L D (h) > L S (h) + ɛ n }) n δ n δ. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39
46 Proof (Countable union non-uniform learnable) Assume H = n N H n, and VCdim(H n ) = d n < Choose w : N [0, 1] s.t. 6 n w(n) 1. E.g. w(n) = Choose δ n = δ w(n) and ɛ n = C dn+log(1/δn) m By the fundamental theorem, for every n, π 2 n 2 D m ({S : h H n, L D (h) > L S (h) + ɛ n }) δ n. Applying the union bound over n we obtain D m ({S : n, h H n, L D (h) > L S (h) + ɛ n }) n δ n δ. This yields a generic non-uniform learning rule Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 18 / 39
47 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 19 / 39
48 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39
49 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] As in the analysis of MDL, it is easy to show that for every h H, L D (SRM(S)) L S (h) + min C d n log(w(n)) + log(1/δ) n:h H n m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39
50 Structural Risk Minimization (SRM) SRM(S) argmin h H [ L S (h) + min n:h H n C d n log(w(n)) + log(1/δ) m ] As in the analysis of MDL, it is easy to show that for every h H, L D (SRM(S)) L S (h) + min C d n log(w(n)) + log(1/δ) n:h H n m Hence, SRM is a generic non-uniform learner with sample complexity m NUL H (ɛ, δ, h) min C d n log(w(n)) + log(1/δ) n:h H n ɛ 2 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 20 / 39
51 No-free-lunch for non-uniform learnability Claim: For any infinite domain set, X, the class H = {0, 1} X is not a countable union of classes of finite VC-dimension. Hence, such classes H are not non-uniformly learnable Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 21 / 39
52 The cost of weaker prior knowledge Suppose H = n H n, where VCdim(H n ) = n Suppose that some h H n has L D (h ) = 0 With this prior knowledge, we can apply ERM on H n, and the sample complexity is C n+log(1/δ) ɛ 2 Without this prior knowledge, SRM will need C n+log(π2 n 2 /6)+log(1/δ) ɛ 2 examples That is, we pay order of log(n)/ɛ 2 more examples for not knowing n in advanced Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39
53 The cost of weaker prior knowledge Suppose H = n H n, where VCdim(H n ) = n Suppose that some h H n has L D (h ) = 0 With this prior knowledge, we can apply ERM on H n, and the sample complexity is C n+log(1/δ) ɛ 2 Without this prior knowledge, SRM will need C n+log(π2 n 2 /6)+log(1/δ) ɛ 2 examples That is, we pay order of log(n)/ɛ 2 more examples for not knowing n in advanced SRM for model selection: Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 22 / 39
54 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 23 / 39
55 Decision Trees other Color? pale green to pale yellow not-tasty Softness? other gives slightly to palm pressure not-tasty tasty Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 24 / 39
56 VC dimension of Decision Trees Claim: Consider the class of decision trees over X with k leaves. Then, the VC dimension of this class is k Proof: A set of k instances that arrive to the different leaves can be shattered. A set of k + 1 instances can t be shattered since 2 instances must arrive to the same leaf Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 25 / 39
57 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
58 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
59 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
60 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
61 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
62 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
63 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
64 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
65 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 End of the code Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
66 Description Language for Decision Trees Suppose X = {0, 1} d and splitting rules are according to 1 [xi =1] for some i [d] Consider the class of all such decision trees over X Claim: This class contains {0, 1} X and hence its VC dimension is X = 2 d But, we can bias to small trees A tree with n nodes can be described as n + 1 blocks, each of size log 2 (d + 3) bits, indicating (in depth-first order) An internal node of the form 1 [xi=1] for some i [d] A leaf whose value is 1 A leaf whose value is 0 End of the code Can apply MDL learning rule: search tree with n nodes that minimizes (n + 1) log2 (d + 3) + log(2/δ) L S (h) + 2m Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 26 / 39
67 Decision Tree Algorithms NP hard problem... Greedy approach: Iterative Dichotomizer 3 Following the MDL principle, attempts to create a small tree with low train error Proposed by Ross Quinlan Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 27 / 39
68 ID3(S, A) Input: training set S, feature subset A [d] if all examples in S are labeled by 1, return a leaf 1 if all examples in S are labeled by 0, return a leaf 0 if A =, return a leaf whose value = majority of labels in S. else : Let j = argmax i A Gain(S, i) if all examples in S have the same label Return a leaf whose value = majority of labels in S else Let T 1 be the tree returned by ID3({(x, y) S : x j = 1}, A \ {j}). Let T 2 be the tree returned by ID3({(x, y) S : x j = 0}, A \ {j}). Return the tree: x j = 1? T 2 T 1 Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 28 / 39
69 Gain Measures Gain(S, i) = C(P S [y]) ( ) P[x i ] C(P[y x i ]) + P[ x i ]C(P[y x i ]). S S S S Train error: C(a) = min{a, 1 a} Information gain: C(a) = a log(a) (1 a) log(1 a) Gini index: C(a) = 2a(1 a) 0.4 Error Info Gain Gini C(a) a Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 29 / 39
70 Pruning, Random Forests,... In the exercise you ll learn about additional practical variants: Pruning the tree Random Forests Dealing with real valued features Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 30 / 39
71 Outline 1 Minimum Description Length 2 Non-uniform learnability 3 Structural Risk Minimization 4 Decision Trees 5 Nearest Neighbor and Consistency Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 31 / 39
72 Nearest Neighbor Things(that(look(alike(must(be(alike ( Memorize the training set S = (x 1, y 1 ),..., (x m, y m ) Given new x, find the k closest points in S and return majority vote among their labels Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 32 / 39
73 1-Nearest Neighbor: Voronoi Tessellation Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39
74 1-Nearest Neighbor: Voronoi Tessellation Unlike ERM,SRM,MDL, etc., there s no H At training time: do nothing At test time: search S for the nearest neighbors Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 33 / 39
75 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39
76 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39
77 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Prior knowledge: η is c-lipschitz. Namely, for all x, x X, η(x) η(x ) c x x Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39
78 Analysis of k-nn X = [0, 1] d, Y = {0, 1}, D is a distribution over X Y, D X is the marginal distribution over X, and η : R d R is the conditional probability over the labels, that is, η(x) = P[y = 1 x]. Recall: the Bayes optimal rule (that is, the hypothesis that minimizes L D (h) over all functions) is h (x) = 1 [η(x)>1/2]. Prior knowledge: η is c-lipschitz. Namely, for all x, x X, η(x) η(x ) c x x Theorem: Let h S be the k-nn rule, then, ( ) 8 ( E S D m[l D(h S )] 1 + L D (h ) + 6 c ) d + k m 1/(d+1). k Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 34 / 39
79 k-nearest Neighbor: Bias-Complexity Tradeoff k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 Shai Shalev-Shwartz (Hebrew U) k=5 k=100 IML Lecture 5 k=12 k=200 MDL,SRM,trees,neighbors 35 / 39
80 Curse of Dimensionality E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Suppose L D (h ) = 0. Then, to have error ɛ we need m (4 c d/ɛ) d+1. Number of examples grows exponentially with the dimension This is not an artifact of the analysis Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39
81 Curse of Dimensionality E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Suppose L D (h ) = 0. Then, to have error ɛ we need m (4 c d/ɛ) d+1. Number of examples grows exponentially with the dimension This is not an artifact of the analysis Theorem For any c > 1, and every learner, there exists a distribution over [0, 1] d {0, 1}, such that η(x) is c-lipschitz, the Bayes error of the distribution is 0, but for sample sizes m (c + 1) d /2, the true error of the learner is greater than 1/4. Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 36 / 39
82 Contradicting the No-Free-Lunch? E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Seemingly, we learn the class of all functions over [0, 1] d But this class is not learnable even in the non-uniform model... Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39
83 Contradicting the No-Free-Lunch? E S D m[l D(h S )] ( 1 + ) 8 ( L D (h ) + 6 c ) d + k m 1/(d+1). k Seemingly, we learn the class of all functions over [0, 1] d But this class is not learnable even in the non-uniform model... There s no contradiction: The number of required examples depends on the Lipschitzness of η (the parameter c), which depends on D. PAC: m(ɛ, δ) non-uniform: m(ɛ, δ, h) consistency: m(ɛ, δ, h, D) Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 37 / 39
84 Issues with Nearest Neighbor Need to store entire training set Replace intelligence with fast memory Curse of dimensionality We ll later learn dimensionality reduction methods Computational problem of finding nearest neighbor What is the correct metric between objects? Success depends on Lipschitzness of η, which depends on the right metric Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 38 / 39
85 Summary Expressing prior knowledge: Hypothesis class, weighting hypotheses, metric Weaker notions of learnability: PAC stronger than non-uniform stronger than consistency Learning rules: ERM, MDL, SRM Decision trees Nearest Neighbor Shai Shalev-Shwartz (Hebrew U) IML Lecture 5 MDL,SRM,trees,neighbors 39 / 39
Introduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationMachine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017
Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationLecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;
CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and
More informationActive Learning: Disagreement Coefficient
Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationComputational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Shattering and VC Dimensions Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory of Generalization
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationComputational Learning Theory
CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful
More information12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016
12. Structural Risk Minimization ECE 830 & CS 761, Spring 2016 1 / 23 General setup for statistical learning theory We observe training examples {x i, y i } n i=1 x i = features X y i = labels / responses
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationComputational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013
Computational Learning Theory CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Overview Introduction to Computational Learning Theory PAC Learning Theory Thanks to T Mitchell 2 Introduction
More informationLecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan
Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof Ganesh Ramakrishnan October 20, 2016 1 / 25 Decision Trees: Cascade of step
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationOn the tradeoff between computational complexity and sample complexity in learning
On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham
More informationEfficiently Training Sum-Product Neural Networks using Forward Greedy Selection
Efficiently Training Sum-Product Neural Networks using Forward Greedy Selection Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Greedy Algorithms, Frank-Wolfe and Friends
More informationthe tree till a class assignment is reached
Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal
More informationDecision Tree Learning
Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,
More informationDecision Trees. Tirgul 5
Decision Trees Tirgul 5 Using Decision Trees It could be difficult to decide which pet is right for you. We ll find a nice algorithm to help us decide what to choose without having to think about it. 2
More informationComputational Learning Theory. Definitions
Computational Learning Theory Computational learning theory is interested in theoretical analyses of the following issues. What is needed to learn effectively? Sample complexity. How many examples? Computational
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationSupervised Learning via Decision Trees
Supervised Learning via Decision Trees Lecture 4 1 Outline 1. Learning via feature splits 2. ID3 Information gain 3. Extensions Continuous features Gain ratio Ensemble learning 2 Sequence of decisions
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationOnline Learning Summer School Copenhagen 2015 Lecture 1
Online Learning Summer School Copenhagen 2015 Lecture 1 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Online Learning Shai Shalev-Shwartz (Hebrew U) OLSS Lecture
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationEmpirical Risk Minimization, Model Selection, and Model Assessment
Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory Definition Reminder: We are given m samples {(x i, y i )} m i=1 Dm and a hypothesis space H and we wish to return h H minimizing L D (h) = E[l(h(x), y)]. Problem
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationDecision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag
Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:
More informationIntroduction to Machine Learning
Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity
More informationGeneralization Bounds
Generalization Bounds Here we consider the problem of learning from binary labels. We assume training data D = x 1, y 1,... x N, y N with y t being one of the two values 1 or 1. We will assume that these
More informationVC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.
VC Dimension Review The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces. Previously, in discussing PAC learning, we were trying to answer questions about
More informationA Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007
Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized
More informationQuestion of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning
Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis
More informationLecture 3: Decision Trees
Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys,
More informationCS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims
CS340 Machine learning Lecture 5 Learning theory cont'd Some slides are borrowed from Stuart Russell and Thorsten Joachims Inductive learning Simplest form: learn a function from examples f is the target
More informationhttp://imgs.xkcd.com/comics/electoral_precedent.png Statistical Learning Theory CS4780/5780 Machine Learning Fall 2012 Thorsten Joachims Cornell University Reading: Mitchell Chapter 7 (not 7.4.4 and 7.5)
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationComputational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar
Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 This lecture: Computational Learning Theory The Theory
More informationDecision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1
Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating
More informationIntroduction to Machine Learning
Introduction to Machine Learning PAC Learning and VC Dimension Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE
More informationLearning Theory Continued
Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationCS 6375: Machine Learning Computational Learning Theory
CS 6375: Machine Learning Computational Learning Theory Vibhav Gogate The University of Texas at Dallas Many slides borrowed from Ray Mooney 1 Learning Theory Theoretical characterizations of Difficulty
More informationNotes on Machine Learning for and
Notes on Machine Learning for 16.410 and 16.413 (Notes adapted from Tom Mitchell and Andrew Moore.) Learning = improving with experience Improve over task T (e.g, Classification, control tasks) with respect
More informationLecture 3: Empirical Risk Minimization
Lecture 3: Empirical Risk Minimization Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 3 1 / 11 A more general approach We saw the learning algorithms Memorize and
More informationIntroduction to Machine Learning (67577)
Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks
More informationUnderstanding Machine Learning A theory Perspective
Understanding Machine Learning A theory Perspective Shai Ben-David University of Waterloo MLSS at MPI Tubingen, 2017 Disclaimer Warning. This talk is NOT about how cool machine learning is. I am sure you
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU10701 11. Learning Theory Barnabás Póczos Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do we
More informationInstance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest
More informationOnline Learning, Mistake Bounds, Perceptron Algorithm
Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which
More informationMachine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler
+ Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationChapter 6: Classification
Chapter 6: Classification 1) Introduction Classification problem, evaluation of classifiers, prediction 2) Bayesian Classifiers Bayes classifier, naive Bayes classifier, applications 3) Linear discriminant
More information1 A Lower Bound on Sample Complexity
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on
More informationComputational Learning Theory. CS534 - Machine Learning
Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationLecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds
Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification MariaFlorina (Nina) Balcan 10/05/2016 Reminders Midterm Exam Mon, Oct. 10th Midterm Review Session
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationORIE 4741: Learning with Big Messy Data. Generalization
ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by
More informationDecision Trees. Introduction. Some facts about decision trees: They represent data-classification models.
Decision Trees Introduction Some facts about decision trees: They represent data-classification models. An internal node of the tree represents a question about feature-vector attribute, and whose answer
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan Admin Exercise 1 due next Tue, in class Enrolment Recap We have seen: AI by introspection
More informationChapter 3: Decision Tree Learning
Chapter 3: Decision Tree Learning CS 536: Machine Learning Littman (Wu, TA) Administration Books? New web page: http://www.cs.rutgers.edu/~mlittman/courses/ml03/ schedule lecture notes assignment info.
More informationOutline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997
Outline Training Examples for EnjoySport Learning from examples General-to-specific ordering over hypotheses [read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] Version spaces and candidate elimination
More informationLecture 3: Decision Trees
Lecture 3: Decision Trees Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning Lecture 3: Decision Trees p. Decision
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationTufts COMP 135: Introduction to Machine Learning
Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)
More informationDomain Adaptation Can Quantity Compensate for Quality?
Domain Adaptation Can Quantity Compensate for Quality? hai Ben-David David R. Cheriton chool of Computer cience University of Waterloo Waterloo, ON N2L 3G1 CANADA shai@cs.uwaterloo.ca hai halev-hwartz
More informationDecision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University
Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University Outline Decision tree representation ID3 learning algorithm Entropy and information gain
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationDiscriminative v. generative
Discriminative v. generative Naive Bayes 2 Naive Bayes P (x ij,y i )= Y i P (y i ) Y j P (x ij y i ) P (y i =+)=p MLE: max P (x ij,y i ) a j,b j,p p = 1 N P [yi =+] P (x ij =1 y i = ) = a j P (x ij =1
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,
More informationComputational learning theory. PAC learning. VC dimension.
Computational learning theory. PAC learning. VC dimension. Petr Pošík Czech Technical University in Prague Faculty of Electrical Engineering Dept. of Cybernetics COLT 2 Concept...........................................................................................................
More informationDecision Trees.
. Machine Learning Decision Trees Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de
More information