Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Learning Theory Aar$ Singh and Barnabas Poczos Machine Learning 10-701/15-781 Apr 17, 2014 Slides courtesy: Carlos Guestrin

Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do I need to make it good enough? 2

A simple se2ng Classifica$on m i.i.d. data points Finite number of possible hypothesis (e.g., dec. trees of depth d) A learner finds a hypothesis h that is consistent with training data Gets zero error in training, error train (h) = 0 What is the probability that h has more than ε true error? error true (h) ε Even if h makes zero errors in training data, may make errors in test 3

How likely is a bad hypothesis to get m data points right? Consider a bad hypothesis h i.e. error true (h) ε Probability that h gets one data point right 1- ε Probability that h gets m data points right (1- ε) m 4

How likely is a learner to pick a bad hypothesis? Usually there are many (say k) bad hypothesis in the class h 1, h 2,, h k s.t. error(h i ) ε i = 1,, k Probability that learner picks a bad hypothesis = Probability that some bad hypothesis is consistent with m data points Prob(h 1 consistent with m data points OR h 2 consistent with m data points OR OR h k consistent with m data points) Prob(h 1 consistent with m data points) + Prob(h 2 consistent with m data points) + + Prob(h k consistent with m data points) Union bound Loose but works k (1- ε) m 5

How likely is a learner to pick a bad hypothesis? Usually there are many many (say k) bad hypothesis in the class h 1, h 2,, h k s.t. error(h i ) ε i = 1,, k Probability that learner picks a bad hypothesis k (1- ε) m H (1- ε) m H e - εm Size of hypothesis class m ε Η 6

PAC (Probably Approximately Correct) bound Theorem [Haussler 88]: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: apple Equivalently, with probability apple 1 Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 7

Using a PAC bound apple Given ε and δ, yields sample complexity ln H +ln1 #training data, m Given m and δ, yields error bound ln H +ln 1 error, m 8

LimitaMons of Haussler 88 bound Consistent classifier h such that zero error in training, error train (h) = 0 Dependence on Size of hypothesis space m ln H +ln1 what if H too big or H is con$nuous? 9

What if our classifier does not have zero error on the training data? A learner with zero training errors may make mistakes in test set What about a learner with error train (h) 0 in training set? The error of a hypothesis is like es$ma$ng the parameter of a coin! error true (h) := P(h(X) Y) P(H=1) =: θ 1 X error train (h) := 1 h(xi )6=Y m i 1 X Z i =: m b i i 10

Hoeffding s Bound for a single hypothesis Consider m i.i.d. flips x 1,,x m, where x i {0,1} of a coin with parameter θ. For 0<ε<1: 2e 2m 2 For a single hypothesis h 2e 2m 2 11

PAC bound for H hypotheses For each hypothesis h i : What if we are comparing H hypotheses? Union bound 2e 2m 2 Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h H: 2 H e 2m 2 apple Important: PAC bound holds for all h, but doesn t guarantee that 12 algorithm finds best h!!!

PAC bound and Bias- Variance tradeoff 2 H e 2m 2 apple Equivalently, with probability 1 2 Fixed m hypothesis space complex simple small large large small 13

What about the size of the hypothesis Sample complexity space? 2 2 H e 2m 2 apple How large is the hypothesis space? 14

Number of decision trees of depth k Recursive solu$on: Given n arributes H k = Number of decision trees of depth k H 0 = H k = 2 (#choices of root arribute) *(# possible leu subtrees) *(# possible right subtrees) = n * H k- 1 * H k- 1 2 Write L k = log 2 H k L 0 = 1 L k = log 2 n + 2L k- 1 = log 2 n + 2(log 2 n + 2L k- 2 ) = log 2 n + 2log 2 n + 2 2 log 2 n + +2 k- 1 (log 2 n + 2L 0 ) So L k = (2 k - 1)(1+log 2 n) +1 15

PAC bound for decision trees of depth k 2 Bad!!! Number of points is exponen$al in depth k! But, for m data points, decision tree can t get too big Number of leaves never more than number data points 16

Number of decision trees with k leaves 2 H k = Number of decision trees with k leaves H 1 =2 H k = (#choices of root arribute) * [(# leu subtrees wth 1 leaf)*(# right subtrees wth k- 1 leaves) + (# leu subtrees wth 2 leaves)*(# right subtrees wth k- 2 leaves) + + (# leu subtrees wth k- 1 leaves)*(# right subtrees wth 1 leaf)] H k = n kx 1 i=1 H i H k i = n k- 1 C k- 1 (C k- 1 : Catalan Number) Loose bound (using Sterling s approximamon): H k apple n k 1 2 2k 1 17

Number of decision trees With k leaves log 2 H k apple (k 1) log 2 n +2k 1 number of points m is linear in #leaves 2 linear in k With depth k log 2 H k = (2 k - 1)(1+log 2 n) +1 exponen$al in k number of points m is exponen$al in depth 18

PAC bound for decision trees with k leaves Bias- Variance revisited With prob 1- δ 2 With H k apple n k 1 2 2k 1, we get s (k 1) ln n +(2k 1) ln 2 + ln 2 1 2m k = m 0 large (~ > ½) k < m >0 small (~ <½) 19

What did we learn from decision trees? Bias- Variance tradeoff formalized Moral of the story: s (k 1) ln n +(2k 1) ln 2 + ln 2 1 Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classifica$on Complexity m no bias, lots of variance Lower than m some bias, less variance 2m 20

What about conmnuous hypothesis spaces? 2 Con$nuous hypothesis space: H = Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly! 21