Learning From Data Lecture 7 Approximation Versus Generalization

Size: px

Start display at page:

Download "Learning From Data Lecture 7 Approximation Versus Generalization"

Dana Richards
5 years ago
Views:

1 Learning From Data Lecture 7 Approimation Versus Generalization The VC Dimension Approimation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100

2 recap: The Vapnik-Chervonenkis Bound (VC Bound) P[ E in (g) E out (g) > ǫ] 4m H (2N)e ǫ2 N/8, for an ǫ > 0. P[ E in (g) E out (g) >ǫ] 2 H e 2ǫ2 N finite H P[ E in (g) E out (g) ǫ] 1 4m H (2N)e ǫ2 N/8, for an ǫ > 0. P[ E in (g) E out (g) ǫ] 1 2 H e 2ǫ2 N finite H E out (g) E in (g)+ E out (g) E in (g)+ 1 2N log 2 H δ 8 N log 4m H(2N) finite H δ, w.p. at least 1 δ. m H (N) k 1 i=1 ( N i ) N k 1 +1 k is a break point. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 2 /22 VC dimension

3 The VC Dimension d vc m H (N) N k 1 The tightest bound is obtained with the smallest break point k. Definition [VC Dimension] d vc = k 1. The VC dimension is the largest N which can be shattered (m H (N) = 2 N ). N d vc : H could shatter our data (H can shatter some N points). N > d vc : N is a break point for H; H cannot possibl shatter our data. m H (N) N d vc +1 Nd vc E out (g) E in (g)+o ( ) d vc logn N c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 3 /22 d vc versus number of parameters

4 The VC-dimension is an Effective Number of Parameters N #Param d vc 2-D perceptron D pos. ra D pos. rectangles < pos. conve sets There are models with few parameters but infinite d vc. There are models with redundant parameters but small d vc. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 4 /22 d vc for perceptron

5 VC-dimension of the Perceptron in R d is d+1 This can be shown in two steps: 1.d vc d+1. What needs to be shown? 2.d vc d+1. What needs to be shown? (a) There is a set of d+1 points that can be shattered. (b) There is a set of d+1 points that cannot be shattered. (c) Ever set of d+1 points can be shattered. (d) Ever set of d+1 points cannot be shattered. (a) There is a set of d+1 points that can be shattered. (b) There is a set of d+2 points that cannot be shattered. (c) Ever set of d+2 points can be shattered. (d) Ever set of d+1 points cannot be shattered. (e) Ever set of d+2 points cannot be shattered. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 5 /22 Step 1 answer

6 VC-dimension of the Perceptron in R d is d+1 This can be shown in two steps: 1.d vc d+1. What needs to be shown? 2.d vc d+1. What needs to be shown? (a) There is a set of d+1 points that can be shattered. (b) There is a set of d+1 points that cannot be shattered. (c) Ever set of d+1 points can be shattered. (d) Ever set of d+1 points cannot be shattered. (a) There is a set of d+1 points that can be shattered. (b) There is a set of d+2 points that cannot be shattered. (c) Ever set of d+2 points can be shattered. (d) Ever set of d+1 points cannot be shattered. (e) Ever set of d+2 points cannot be shattered. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 6 /22 Step 2 answer

7 VC-dimension of the Perceptron in R d is d+1 This can be shown in two steps: 1.d vc d+1. What needs to be shown? 2.d vc d+1. What needs to be shown? (a) There is a set of d+1 points that can be shattered. (b) There is a set of d+1 points that cannot be shattered. (c) Ever set of d+1 points can be shattered. (d) Ever set of d+1 points cannot be shattered. (a) There is a set of d+1 points that can be shattered. (b) There is a set of d+2 points that cannot be shattered. (c) Ever set of d+2 points can be shattered. (d) Ever set of d+1 points cannot be shattered. (e) Ever set of d+2 points cannot be shattered. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 7 /22 d vc characterizes compleit in error bar

8 A Single Parameter Characterizes Compleit out-of-sample error E out (g) E in (g)+ 1 2N log2 H δ Error model compleit in-sample error H H E out (g) E in (g)+ 8 log4((2n)dvc +1) N δ }{{} penalt for model compleit Ω(d vc ) Error d vc out-of-sample error model compleit in-sample error VC dimension, d vc c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 8 /22 Sample compleit

9 Sample Compleit: How Man Data Points Do You Need? Set the error bar at ǫ. Solve for N: ǫ = 8 ln4((2n)dvc +1) N δ N = 8 ǫ 2ln4((2N)d vc +1) δ = O(d vc lnn) Eample. d vc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1). A simple iterative method works well. Tring N = 1000 we get N 1 ( 4(2000) 3 ) 0.1 log We continue iterativel, and converge to N If d vc = 4, N 40000; for d vc = 5, N (N d vc, but gross overestimates) Practical Rule of Thumb: N = 10 d vc c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 9 /22 Theor versus practice

10 Theor Versus Practice The VC analsis allows us to reach outside the data for general H. a single parameter characterizes compleit of H d vc depends onl on H. E in can reach outside D to E out when d vc is finite. In Practice... The VC bound is loose. Hoeffding; m H (N) is a worst case # of dichotomies, not average case or likel case. The polnomial bound on m H (N) is loose. It is a good guide models with small d vc are good. Roughl 10 d vc eamples needed to get good generalization. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 10 /22 Test set

11 The Test Set Another wa to estimate E out (g) is using a test set to obtain E test (g). E test is better than E in : ou don t pa the price for fitting. You can use H = 1 in the Hoeffding bound with E test. Both a test and training set have variance. The training set has optimistic bias due to selection fitting the data. A test set has no bias. The price for a test set is fewer training eamples. (wh is this bad?) E test E out but now E test ma be bad. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 11 /22 Approimation versus Generalization

12 VC Bound Quantifies Approimation Versus Generalization The best H is H = {f}. You are better off buing a lotter ticket. d vc = better chance of approimating f (E in 0). d vc = better chance of generalizing to out of sample (E in E out ). E out E in +Ω(d vc ). VC analsis onl depends on H. Independent of f, P(), learning algorithm. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 12 /22 Bias-variance analsis

13 Bias-Variance Analsis Another wa to quantif the tradeoff: 1.How well can the learning approimate f....as opposed to how well did the learning approimate f in-sample (E in ). 2.How close can ou get to that approimation with a finite data set....as opposed to how close is E in to E out. Bias-variance analsis applies to squared errors (classification and regression) Bias-variance analsis can take into account the learning algorithm Different learning algorithms can have different E out when applied to the same H! c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 13 /22 Sin eample

14 A Simple Learning Problem 2 Data Points. 2 hpothesis sets: H 0 : h() = b H 1 : h() = a+b c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 14 /22 Man data sets

15 Let s Repeat the Eperiment Man Times For each data set D, ou get a different g D. So, for a fied, g D () is random value, depending on D. c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 15 /22 Average behavior

16 What s Happening on Average ḡ() ḡ() sin() sin() We can define: g D () [ ḡ() = E D g D () ] random value, depending on D 1 K (gd 1 ()+ +gd K ()) our average prediction on [ var() = E D (g D () ḡ()) 2] [ = E D g D () 2] ḡ() 2 how variable is our prediction? c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 16 /22 Error on out-of-sample test point

17 E out on Test Point for Data D f() f() E D out() E D out() g D () g D () E D out() = (g D () f()) 2 E out () = E D [ E D out () ] squared error, a random value depending on D epected E out () before seeing D c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 17 /22 bias-vardecomposition

18 The Bias-Variance Decomposition E out () = E D [ (g D () f()) 2] = E D [ g D () 2 2g D ()f()+f() 2] = E D [ g D () 2] 2ḡ()f()+f() 2 understand this; the rest is just algebra = E D [ g D () 2] ḡ() 2 +ḡ() 2 2ḡ()f()+f() 2 [ = E D g D () 2] ḡ() 2 } {{ } var() +(ḡ() f()) 2 } {{ } bias() f H E out () = bias()+var() H bias Ver small model var Ver large model f If ou take average over : E out = bias+var c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 18 /22 Back to sin eample

19 Back to H 0 and H 1 ; and, our winner is... ḡ() ḡ() sin() sin() H 0 bias = 0.50 var = 0.25 E out = 0.75 H 1 bias = 0.21 var = 1.69 E out = 1.90 c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 19 /22 2 versus 5 data points

20 Match Learning Power to Data,...Not to f 2 Data Points 5 Data Points ḡ() ḡ() ḡ() ḡ() sin() sin() sin() sin() H 0 bias = 0.50; var = E out = 0.75 H 1 bias = 0.21; var = E out = 1.90 H 0 bias = 0.50; var = 0.1. E out = 0.6 H 1 bias = 0.21; var = E out = 0.42 c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 20 /22 Learning curves

21 Learning Curves: When Does the Balance Tip? Simple Model Comple Model Epected Error E out E in Epected Error E out E in Number of Data Points, N Number of Data Points, N E out = E [E out ()] c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 21 /22 Decomposing the learning curve

22 Decomposing The Learning Curve VC Analsis Bias-Variance Analsis Epected Error generalization error E out E in Epected Error variance E out E in in-sample error bias Number of Data Points, N Number of Data Points, N Pick H that can generalize and has a good chance to fit the data Pick (H,A) to approimate f and not behave wildl after seeing the data c AM L Creator: Malik Magdon-Ismail Approimation Versus Generalization: 22 /22

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of Generalization (E in E out ) An Effective Number of Hypotheses A Combinatorial Puzzle M. Magdon-Ismail CSCI