ECS171: Machine Learning Lecture 6: Training versus Testing (LFD 2.1) Cho-Jui Hsieh UC Davis Jan 29, 2018
Preamble to the theory
Training versus testing Out-of-sample error (generalization error): What we want: small E out E out = E x [e(h(x), f (x))]
Training versus testing Out-of-sample error (generalization error): What we want: small E out In-sample error (training error): E out = E x [e(h(x), f (x))] E in = 1 N N e(h(x n ), f (x n )) n=1 This is what we can minimize
The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0
The 2 questions of learning E out (g) 0 is achieved through: E out (g) E in (g) and E in (g) 0 Learning is thus split into 2 questions: Can we make sure that E out (g) E in (g)? Hoeffding s inequality (?) Can we make E in (g) small? Optimization (done)
What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N
What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron)
What the theory will achieve Currently we only know P[ E in (g) E out (g) > ɛ] 2Me 2ɛ2 N What if M =? (e.g., perceptron) Todo: We will establish a finite quantity to replace M P[ E in (g) E out (g) > ɛ]? 2m H (N)e 2ɛ2 N Study m H (N) to understand the trade-off for model complexity
Reducing M to finite number
Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N
Where did the M come from? The Bad events B m : E in (h m ) E out (h m ) > ɛ with probability 2e 2ɛ2 N The union bound: P[B 1 or B 2 or or B M ] P[B 1 ] + P[B 2 ] + + P[B M ] 2Me 2ɛ2 N }{{} consider worst case: no overlaps
Can we improve on M?
Can we improve on M?
Can we improve on M?
Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points
Can we improve on M? E out : change in +1 and 1 areas E in : change in labels of data points E in (h 1 ) E out (h 1 ) E in (h 2 ) E out (h 2 ) Overlapped events!
What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?
What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?
What can we replace M with? Instead of the whole input space Let s consider a finite set of input points How many patterns of red and blue can you get?
Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1}
Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1}
Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite
Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) :
Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N
Dichotomies: mini-hypotheses A hypothesis: h : X { 1, +1} A dichotomy: h : {x 1, x 2,, x N } { 1, +1} Number of hypotheses H can be infinite Number of dichotomies H(x 1, x 2,, x N ) : at most 2 N Candidate for replacing M
The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X
The growth function The growth function counts the most dichotomies on any N points: m H (N) = max H(x 1,, x N ) x 1,,x N X The growth function satisfies: m H (N) 2 N
Growth function for perceptrons Compute m H (3) in 2-D space What s H(x 1, x 2, x 3 )?
Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) m H (3) = 8
Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes)
Growth function for perceptrons Compute m H (3) in 2-D space when H is perceptron (linear hyperplanes) Doesn t matter because we only counts the most dichotomies
Growth function for perceptrons What s m H (4)?
Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies:
Growth function for perceptrons What s m H (4)? (At least) missing two dichotomies: m H (4) = 14 < 2 4
Example I: positive rays
Example II: positive intervals
Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?
Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?
Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex How many dichotomies can we generate?
Example III: convex sets H is set of h : R 2 { 1, +1} h(x) = +1 is convex m H (N) = 2 N for any N We say the N points are shattered by convex sets
The 3 growth functions H is positive rays: H is positive intervals: m H (N) = N + 1 m H (N) = 1 2 N2 + 1 2 N + 1 H is convex sets: m H (N) = 2 N
What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N
What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good!
What s next? Remember the inequality P[ E in E out > ɛ] 2Me 2ɛ2 N What happens if we replace M by m H (N)? m H (N) polynomial Good! How to show m H (N) is polynomial?
Conclusions Next class: LFD 2.1, 2.2 Questions?