ECS171: Machine Learning - PDF Free Download

ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018

Preamble to the theory

Training versus testing Out-of-sample error (generalization error): What we want: small E out E out = E x [e(h(x), f (x))]

Training versus testing Out-of-sample error (generalization error): What we want: small E out In-sample error (training error): E out = E x [e(h(x), f (x))] E in = 1 N N e(h(x n ), f (x n )) n=1 This is what we can minimize

The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0

The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0 Learning is thus split into 2 questions: Can we make sure that E out (g) E in (g)? Hoeffding s inequality (?) Can we make E in (g) small? Optimization (done)

What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N

What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron)

What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron) Todo: We will establish a finite quantity to replace M P[ E in (g) E out (g) > ɛ]? 2m H (N)e 2ɛ2 N Study m H (N) to understand the trade-off for model complexity

Reducing M to finite number

Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N

Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N The union bound: P[B 1 or B 2 or or B M ] P[B 1 ] + P[B 2 ] + + P[B M ] 2Me 2ɛ2 N }{{} consider worst case: no overlaps

Can we improve on M?

Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points

Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points E in (h 1 ) E out (h 1 ) E in (h 2 ) E out (h 2 ) Overlapped events!

What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1}

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1}

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) :

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N

Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N Candidate for replacing M

The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X

The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X The growth function satisfies: m H (N) 2 N

Growth function for perceptrons Compute m H (3) in 2-D space What s H(x 1, x 2, x 3 )?

Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) m H (3) = 8

Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes)

Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) Doesn t matter because we only counts the most dichotomies

Growth function for perceptrons What s m H (4)?

Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies:

Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies: m H (4) = 14 < 2 4

Example I: positive rays

Example II: positive intervals

Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?

Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex m H (N) = 2 N for any N We say the N points are shattered by convex sets

The 3 growth functions H is positive rays: H is positive intervals: m H (N) = N + 1 m H (N) = 1 2 N2 + 1 2 N + 1 H is convex sets: m H (N) = 2 N

What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N

What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good!

What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good! How to show m H (N) is polynomial?

Conclusions Next class: LFD 2.1, 2.2 Questions?