ECS171: Machine Learning

Size: px

Start display at page:

Download "ECS171: Machine Learning"

Junior Atkinson
5 years ago
Views:

1 ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018

2 Preamble to the theory

3 Training versus testing Out-of-sample error (generalization error): What we want: small E out E out = E x [e(h(x), f (x))]

4 Training versus testing Out-of-sample error (generalization error): What we want: small E out In-sample error (training error): E out = E x [e(h(x), f (x))] E in = 1 N N e(h(x n ), f (x n )) n=1 This is what we can minimize

5 The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0

6 The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0 Learning is thus split into 2 questions: Can we make sure that E out (g) E in (g)? Hoeffding s inequality (?) Can we make E in (g) small? Optimization (done)

7 What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N

8 What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron)

9 What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron) Todo: We will establish a finite quantity to replace M P[ E in (g) E out (g) > ɛ]? 2m H (N)e 2ɛ2 N Study m H (N) to understand the trade-off for model complexity

10 Reducing M to finite number

11 Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N

12 Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N The union bound: P[B 1 or B 2 or or B M ] P[B 1 ] + P[B 2 ] + + P[B M ] 2Me 2ɛ2 N }{{} consider worst case: no overlaps

13 Can we improve on M?

14 Can we improve on M?

15 Can we improve on M?

16 Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points

17 Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points E in (h 1 ) E out (h 1 ) E in (h 2 ) E out (h 2 ) Overlapped events!

18 What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

19 What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

20 What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

21 Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1}

22 Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1}

23 Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite

24 Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) :

25 Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N

26 Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N Candidate for replacing M

27 The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X

28 The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X The growth function satisfies: m H (N) 2 N

29 Growth function for perceptrons Compute m H (3) in 2-D space What s H(x 1, x 2, x 3 )?

30 Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) m H (3) = 8

31 Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes)

32 Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) Doesn t matter because we only counts the most dichotomies

33 Growth function for perceptrons What s m H (4)?

34 Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies:

35 Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies: m H (4) = 14 < 2 4

36 Example I: positive rays

37 Example II: positive intervals

38 Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

39 Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

40 Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

41 Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex m H (N) = 2 N for any N We say the N points are shattered by convex sets

42 The 3 growth functions H is positive rays: H is positive intervals: m H (N) = N + 1 m H (N) = 1 2 N N + 1 H is convex sets: m H (N) = 2 N

43 What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N

44 What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good!

45 What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good! How to show m H (N) is polynomial?

46 Conclusions Next class: LFD 2.1, 2.2 Questions?

Learning From Data Lecture 5 Training Versus Testing

Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of Generalization (E in E out ) An Effective Number of Hypotheses A Combinatorial Puzzle M. Magdon-Ismail CSCI