Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec. trees of depth d) n A learner finds a hypothesis h that is consistent with training data Gets zero error in training error train (h) = 0 n What is the probability that h has more than ε true error? error true (h) ε 2 1
Generalization error in finite hypothesis spaces [Haussler 88] n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: P (error true (h) > ) apple H e N 3 Limitations of Haussler 88 bound n Consistent classifier P (error true (h) > ) apple H e N n Size of hypothesis space 4 2
What if our classifier does not have zero error on the training data? n A learner with zero training errors may make mistakes in test set n What about a learner with error train (h) in training set? 5 Generalization bound for H hypothesis n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: P (error true (h i ) error train (h i ) > ) apple e 2N 2 6 3
PAC bound and Bias-Variance tradeoff P (error true (h) error train (h) > ) apple e 2N 2 or, after moving some terms around, with probability at least 1-δ: error true (h) apple error train (h)+ s ln H +ln 1 2N n Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 7 What about the size of the hypothesis space? N ln H +ln1 2 2 n How large is the hypothesis space? 8 4
Boolean formulas with m binary features N ln H +ln1 2 2 9 Number of decision trees of depth k Recursive solution Given m attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k N ln H +ln1 2 2 Write L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k So L k = (2 k -1)(1+log 2 m) +1 10 5
PAC bound for decision trees of depth k n Bad!!! N 2k log m +ln 1 2 Number of points is exponential in depth! n But, for N data points, decision tree can t get too big Number of leaves never more than number data points 11 Number of Decision Trees with k Leaves n Number of decision trees of depth k is really really big: ln H is about 2 k log m n Decision trees with up to k leaves: H is about m k k 2k n A very loose bound 12 6
PAC bound for decision trees with k leaves Bias-Variance revisited ln H DTs k leaves apple 2k(ln m +lnk) error true (h) apple error train (h)+ s ln H +ln 1 2N error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N 13 What did we learn from decision trees? n n Bias-Variance tradeoff formalized error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity N no bias, lots of variance Lower than N some bias, less variance 14 7
What about continuous hypothesis spaces? error true (h) apple error train (h)+ n Continuous hypothesis space: H = Infinite variance??? s ln H +ln 1 2N n As with decision trees, only care about the maximum number of points that can be classified exactly! Called VC dimension see readings for details 15 What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? 16 8
Clustering K-means Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 17 Clustering images Set of Images [Goldberger et al.] 18 9
Clustering web search results 19 Some Data 20 10
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 21 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 22 11
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. (Thus each Center owns a set of datapoints) 23 K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 24 12
K-means 1. Ask user how many clusters they d like. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it s closest to. 4. Each Center finds the centroid of the points it owns 5. and jumps there 6. Repeat until terminated! 25 K-means n Randomly initialize k centers µ (0) = µ 1 (0),, µ k (0) n Classify: Assign each point j {1, m} to nearest center: n Recenter: µ i becomes centroid of its point: Equivalent to µ i average of its points! 26 13
What is K-means optimizing? n Potential function F(µ,C) of centers µ and point allocations C: N n Optimal K-means: min µ min C F(µ,C) 27 Does K-means converge??? Part 1 n Optimize potential function: n Fix µ, optimize C 28 14
Does K-means converge??? Part 2 n Optimize potential function: n Fix C, optimize µ 29 Coordinate descent algorithms n n n Want: min a min b F(a,b) Coordinate descent: fix a, minimize b fix b, minimize a repeat Converges!!! if F is bounded to a (often good) local optimum n as we saw in applet (play with it!) (For LASSO it converged to the optimum) n K-means is a coordinate descent algorithm! 30 15
31 32 16
33 How many points can a linear boundary classify exactly? (1-D) 34 17
How many points can a linear boundary classify exactly? (2-D) 35 How many points can a linear boundary classify exactly? (d-d) 36 18
PAC bound using VC dimension n Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves 37 Shattering a set of points 38 19
VC dimension 39 PAC bound using VC dimension n Number of training points that can be classified exactly is VC dimension!!! Measures relevant size of hypothesis space, as with decision trees with k leaves Bound for infinite dimension hypothesis spaces: 40 20
Examples of VC dimension n Linear classifiers: VC(H) = d+1, for d features plus constant term b n Neural networks VC(H) = #parameters Local minima means NNs will probably not find best parameters n 1-Nearest neighbor? 41 Another VC dim. example - What can we shatter? n What s the VC dim. of decision stumps in 2d? 42 21
Another VC dim. example - What can t we shatter? n What s the VC dim. of decision stumps in 2d? 43 What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? 44 22