Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Size: px

Start display at page:

Download "Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin"

Myrtle Peters
5 years ago
Views:

1 Learning Theory Aar$ Singh and Barnabas Poczos Machine Learning / Apr 17, 2014 Slides courtesy: Carlos Guestrin

2 Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do I need to make it good enough? 2

3 A simple se2ng Classifica$on m i.i.d. data points Finite number of possible hypothesis (e.g., dec. trees of depth d) A learner finds a hypothesis h that is consistent with training data Gets zero error in training, error train (h) = 0 What is the probability that h has more than ε true error? error true (h) ε Even if h makes zero errors in training data, may make errors in test 3

4 How likely is a bad hypothesis to get m data points right? Consider a bad hypothesis h i.e. error true (h) ε Probability that h gets one data point right 1- ε Probability that h gets m data points right (1- ε) m 4

5 How likely is a learner to pick a bad hypothesis? Usually there are many (say k) bad hypothesis in the class h 1, h 2,, h k s.t. error(h i ) ε i = 1,, k Probability that learner picks a bad hypothesis = Probability that some bad hypothesis is consistent with m data points Prob(h 1 consistent with m data points OR h 2 consistent with m data points OR OR h k consistent with m data points) Prob(h 1 consistent with m data points) + Prob(h 2 consistent with m data points) + + Prob(h k consistent with m data points) Union bound Loose but works k (1- ε) m 5

6 How likely is a learner to pick a bad hypothesis? Usually there are many many (say k) bad hypothesis in the class h 1, h 2,, h k s.t. error(h i ) ε i = 1,, k Probability that learner picks a bad hypothesis k (1- ε) m H (1- ε) m H e - εm Size of hypothesis class m ε Η 6

7 PAC (Probably Approximately Correct) bound Theorem [Haussler 88]: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: apple Equivalently, with probability apple 1 Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 7

8 Using a PAC bound apple Given ε and δ, yields sample complexity ln H +ln1 #training data, m Given m and δ, yields error bound ln H +ln 1 error, m 8

9 LimitaMons of Haussler 88 bound Consistent classifier h such that zero error in training, error train (h) = 0 Dependence on Size of hypothesis space m ln H +ln1 what if H too big or H is con$nuous? 9

10 What if our classifier does not have zero error on the training data? A learner with zero training errors may make mistakes in test set What about a learner with error train (h) 0 in training set? The error of a hypothesis is like es$ma$ng the parameter of a coin! error true (h) := P(h(X) Y) P(H=1) =: θ 1 X error train (h) := 1 h(xi )6=Y m i 1 X Z i =: m b i i 10

11 Hoeffding s Bound for a single hypothesis Consider m i.i.d. flips x 1,,x m, where x i {0,1} of a coin with parameter θ. For 0<ε<1: 2e 2m 2 For a single hypothesis h 2e 2m 2 11

12 PAC bound for H hypotheses For each hypothesis h i : What if we are comparing H hypotheses? Union bound 2e 2m 2 Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h H: 2 H e 2m 2 apple Important: PAC bound holds for all h, but doesn t guarantee that 12 algorithm finds best h!!!

13 PAC bound and Bias- Variance tradeoff 2 H e 2m 2 apple Equivalently, with probability 1 2 Fixed m hypothesis space complex simple small large large small 13

14 What about the size of the hypothesis Sample complexity space? 2 2 H e 2m 2 apple How large is the hypothesis space? 14

15 Number of decision trees of depth k Recursive solu$on: Given n arributes H k = Number of decision trees of depth k H 0 = H k = 2 (#choices of root arribute) *(# possible leu subtrees) *(# possible right subtrees) = n * H k- 1 * H k- 1 2 Write L k = log 2 H k L 0 = 1 L k = log 2 n + 2L k- 1 = log 2 n + 2(log 2 n + 2L k- 2 ) = log 2 n + 2log 2 n log 2 n + +2 k- 1 (log 2 n + 2L 0 ) So L k = (2 k - 1)(1+log 2 n) +1 15

16 PAC bound for decision trees of depth k 2 Bad!!! Number of points is exponen$al in depth k! But, for m data points, decision tree can t get too big Number of leaves never more than number data points 16

17 Number of decision trees with k leaves 2 H k = Number of decision trees with k leaves H 1 =2 H k = (#choices of root arribute) * [(# leu subtrees wth 1 leaf)*(# right subtrees wth k- 1 leaves) + (# leu subtrees wth 2 leaves)*(# right subtrees wth k- 2 leaves) + + (# leu subtrees wth k- 1 leaves)*(# right subtrees wth 1 leaf)] H k = n kx 1 i=1 H i H k i = n k- 1 C k- 1 (C k- 1 : Catalan Number) Loose bound (using Sterling s approximamon): H k apple n k 1 2 2k 1 17

18 Number of decision trees With k leaves log 2 H k apple (k 1) log 2 n +2k 1 number of points m is linear in #leaves 2 linear in k With depth k log 2 H k = (2 k - 1)(1+log 2 n) +1 exponen$al in k number of points m is exponen$al in depth 18

19 PAC bound for decision trees with k leaves Bias- Variance revisited With prob 1- δ 2 With H k apple n k 1 2 2k 1, we get s (k 1) ln n +(2k 1) ln 2 + ln 2 1 2m k = m 0 large (~ > ½) k < m >0 small (~ <½) 19

20 What did we learn from decision trees? Bias- Variance tradeoff formalized Moral of the story: s (k 1) ln n +(2k 1) ln 2 + ln 2 1 Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classifica$on Complexity m no bias, lots of variance Lower than m some bias, less variance 2m 20

21 What about conmnuous hypothesis spaces? 2 Con$nuous hypothesis space: H = Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly! 21

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good