Learning From Data Lecture 5 Training Versus Testing

Size: px

Start display at page:

Download "Learning From Data Lecture 5 Training Versus Testing"

Lindsay Cross
5 years ago
Views:

1 Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of Generalization (E in E out ) An Effective Number of Hypotheses A Combinatorial Puzzle M. Magdon-Ismail CSCI 4100/6100

2 recap: The Two Questions of Learning 1.Can we make sure that E out (g) is close enough to E in (g)? 2.Can we make E in (g) small enough? The Hoeffding generalization bound: out-of-sample error 1 E out (g) E in (g)+ 2N ln2 H δ }{{} generalization error bar Error H model complexity in-sample error H E in : training (eg. the practice exam) E out : testing (eg. the real exam) There is a tradeoff when picking H. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 2 /18 Goal of generalization theory

3 What Will The Theory of Generalization Achieve? out-of-sample error E out (g) E in (g)+ 1 2N ln2 H δ Error model complexity in-sample error H H out-of-sample error E out (g) E in (g)+ 8 N ln4m H δ Error model complexity in-sample error model complexity The new bound will be applicable to infinite H. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 3 /18 H is overkill

4 Why is H an Overkill How did H come in? Bad events B g = { E out (g) E in (g) > ǫ} B m = { E out (h m ) E in (h m ) > ǫ} B 1 B 2 We do not know which g, so use a worst case union bound. P[B g ] P[any B m ] H m=1 P[B m ]. B 3 B m are events (sets of outcomes); they can overlap. If the B m overlap, the union bound is loose. If many h m are similar, the B m overlap. There are effectively fewer than H hypotheses,. We can replace H by something smaller. H fails to account for similarity between hypotheses. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 4 /18 Measuring diversity on N points

5 Measuring the Diversity (Size) of H We need a way to measure the diversity of H. A simple idea: Fix any set of N data points. If H is diverse it should be able to implement all functions...on these N points. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 5 /18 Example: large H

6 A Data Set Reveals the True Colors of an H H c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 6 /18...through the eyes of D

7 A Data Set Reveals the True Colors of an H H H through the eyes of the D c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 7 /18 Just one dichotomy

8 A Data Set Reveals the True Colors of an H From the point of view of D, the entire H is just one dichotomy. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 8 /18 An effective number of hypotheses

9 An Effective Number of Hypotheses If H is diverse it should be able to implement many dichotomys. H only captures the maximum possible diversity of H. Consider an h H, and a data set x 1,...,x N. h gives us an N-tuple of ±1 s: (h(x 1 ),...,h(x N )). A dichotomy of the inputs. If H is diverse, we get many different dichotomies. If H contains similar functions, we only get a few dichotomies. dichotomy The growth function quantifies this. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 9 /18 Growth function

10 The Growth Function m H (N) Define the the restriction of H to the inputs x 1,x 2,...,x N : H(x 1,...,x N ) = {(h(x 1 ),...,h(x N )) h H} (set of dichotomies induced by H) The Growth Function m H (N) The largest set of dichotomies induced by H: m H (N) = max x 1,...,x N H(x 1,...,x N ). m H (N) 2 N. Can we replace H by m H, an effective number of hypotheses? Replacing H with 2 N is no help in the bound. (why?) We want m H (N) poly(n) to get a useful error bar. ( the error bar is ) 1 2 H ln 2N δ c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 10 /18 Example: 2-d perceptron

11 Example: 2-D Perceptron Model Cannot implement Can implement all 8 Can implement at most 14 m H (3) = 8 = 2 3. What is m H (5)? m H (4) = 14 < 2 4. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 11 /18 Example: 1-d positive ray

12 Example: 1-D Positive Ray Model w 0 + x 1 x 2 x N h(x) = sign(x w 0 ) Consider N points. There are N +1 dichotomies depending on where you put w 0. m H (N) = N +1. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 12 /18 Example: 2-d positive rectangle

13 Example: Positive Rectangles in 2-D N = 4 N = 5 x 2 x 2 x 1 x 3 x 1 x 3 x 4 x 4 x 4 H implements all dichotomies some point will be inside a rectangle defined by others m H (4) = 2 4 m H (5) < 2 5 We have not computed m H (5) not impossible, but tricky. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 13 /18 The growth functions summarized

14 Example Growth Functions N D perceptron D pos. ray D pos. rectangles < 2 5 m H (N) drops below 2 N there is hope for the generalization bound. A break point is any n for which m H (n) < 2 n. c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 14 /18 Combinatorial puzzle: dichotomys on 3 points

15 A Combinatorial Puzzle x 1 x 2 x 3 A set of dichotomys c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 15 /18 Two points shattered

16 A Combinatorial Puzzle x 1 x 2 x 3 Two points are shattered c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 16 /18 Another set of dichotomys

17 A Combinatorial Puzzle x 1 x 2 x 3 No pair of points is shattered c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 17 /18 What about N = 4?

18 A Combinatorial Puzzle x 1 x 2 x 3 x 1 x 2 x 3 x 4. 4 dichotomies is max. If N = 4 how many possible dichotomys with no 2 points shattered? c AM L Creator: Malik Magdon-Ismail Training Versus Testing: 18 /18

Learning From Data Lecture 7 Approximation Versus Generalization

Learning From Data Lecture 7 Approimation Versus Generalization The VC Dimension Approimation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis