Learning Theory Continued

Size: px

Start display at page:

Download "Learning Theory Continued"

Sara Clark
5 years ago
Views:

1 Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec. trees of depth d) n A learner finds a hypothesis h that is consistent with training data Gets zero error in training error train (h) = 0 n What is the probability that h has more than ε true error? error true (h) ε 2 1

2 Generalization error in finite hypothesis spaces [Haussler 88] n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: P (error true (h) > ) apple H e N 3 Limitations of Haussler 88 bound n Consistent classifier P (error true (h) > ) apple H e N n Size of hypothesis space 4 2

3 What if our classifier does not have zero error on the training data? n A learner with zero training errors may make mistakes in test set n What about a learner with error train (h) in training set? 5 Generalization bound for H hypothesis n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: P (error true (h i ) error train (h i ) > ) apple e 2N 2 6 3

4 PAC bound and Bias-Variance tradeoff P (error true (h) error train (h) > ) apple e 2N 2 or, after moving some terms around, with probability at least 1-δ: error true (h) apple error train (h)+ s ln H +ln 1 2N n Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 7 What about the size of the hypothesis space? N ln H +ln1 2 2 n How large is the hypothesis space? 8 4

5 Boolean formulas with m binary features N ln H +ln Number of decision trees of depth k Recursive solution Given m attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k N ln H +ln1 2 2 Write L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k So L k = (2 k -1)(1+log 2 m)

6 PAC bound for decision trees of depth k n Bad!!! N 2k log m +ln 1 2 Number of points is exponential in depth! n But, for N data points, decision tree can t get too big Number of leaves never more than number data points 11 Number of Decision Trees with k Leaves n Number of decision trees of depth k is really really big: ln H is about 2 k log m n Decision trees with up to k leaves: H is about m k k 2k n A very loose bound 12 6

7 PAC bound for decision trees with k leaves Bias-Variance revisited ln H DTs k leaves apple 2k(ln m +lnk) error true (h) apple error train (h)+ s ln H +ln 1 2N error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N 13 What did we learn from decision trees? n n Bias-Variance tradeoff formalized error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity N no bias, lots of variance Lower than N some bias, less variance 14 7

8 What about continuous hypothesis spaces? error true (h) apple error train (h)+ n Continuous hypothesis space: H = Infinite variance??? s ln H +ln 1 2N n As with decision trees, only care about the maximum number of points that can be classified exactly! Called VC dimension see readings for details 15 What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? 16 8

9 Clustering K-means Machine Learning CSE446 Carlos Guestrin University of Washington May 13, Clustering images Set of Images [Goldberger et al.] 18 9

10 Clustering web search results 19 Some Data 20 10

11 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 21 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 22 11

12 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 23 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 24 12

13 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 25 K-means n Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) n Classify: Assign each point j {1, m} to nearest center: n Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 26 13

14 What is K-means optimizing? n Potential function F(µ,C) of centers µ and point allocations C: N n Optimal K-means: min µ min C F(µ,C) 27 Does K-means converge??? Part 1 n Optimize potential function: n Fix µ, optimize C 28 14

15 Does K-means converge??? Part 2 n Optimize potential function: n Fix C, optimize µ 29 Coordinate descent algorithms n n n Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum n as we saw in applet (play with it!) (For LASSO it converged to the optimum) n K-means is a coordinate descent algorithm! 30 15

17 33 How many points can a linear boundary classify exactly? (1-D) 34 17

18 How many points can a linear boundary classify exactly? (2-D) 35 How many points can a linear boundary classify exactly? (d-d) 36 18

19 PAC bound using VC dimension n Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves 37 Shattering a set of points 38 19

20 VC dimension 39 PAC bound using VC dimension n Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves Bound for infinite dimension hypothesis spaces: 40 20

21 Examples of VC dimension n Linear classifiers: VC(H) = d+1, for d features plus constant term b n Neural networks VC(H) = #parameters Local minima means NNs will probably not find best parameters n 1-Nearest neighbor? 41 Another VC dim. example - What can we shatter? n What s the VC dim. of decision stumps in 2d? 42 21

22 Another VC dim. example - What can t we shatter? n What s the VC dim. of decision stumps in 2d? 43 What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? 44 22

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good