Computational and Statistical Learning Theory

Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I

Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic to stochastic: Deal with errors. What if data not exactly realized by H? Avoid non-learnability due to very specific, adversarial, order of exaples Training on dedicated training set, then predict on population

The Statistical Learning Model Unknown source distribution D over (x, y) Describes reality. What we want to classify, and what should it be classified as. E.g. joint distribution over (, b ) Can think of D as: distribution over x and y x = f(x) Distribution over iages we expect to see (we don t expect to see uniforly distributed iages: ), and what character each iage represents Or, as: distribution over y and over x y Distribution over characters ( e ore likely then & ), and for each character, over possible iages of that character Goal: find predictor h with sall expected error: L D h = P x,y D [h x y] (also called generalization error, risk or true error) Based on a saple S = ( x 1, y 1, x 2, y 2,, x, y ) of training points x t, y t i. i. d. D (we can also write: S D )

The Statistical Learning Model Unknown source distribution D over (x, y) Goal: find predictor h with sall expected error: L D h = P x,y D [h x y] Based on saple S = ( x 1, y 1, x 2, y 2,, x, y ) of training points x t, y t i. i. d. D i. e. S D Statistical (batch) learning: 1. Receive training set S D 2. Learn h = A(S) using learning rule A: X Y Y X 3. Use h on future exaples drawn for D, suffering expected error L D h Main assuption: i.i.d. saples Saples drawn fro distribution D we will later use the predictor on

Expected vs Epirical Error What we care about is the expected error L D h = P x,y D [h x y] Why not just iniize it directly? For a given saple S we can calculate the epirical error L S h = 1 [[ h x t y t ]] t=1 How do we use the epirical error? Is it a good estiate for the expected error? How good?

The Epirical Error as an Estiator for the Expected Error How close are the expected and epirical errors? L S h L D h Rando Variable Nuber: L D h = P x,y D [h x y] L S h = 1 t=1 h x t y t ~ 1 Bino(, L D h ) N L D h, Hoeffding Bound on trail of Binoial: P Z~Bino(,p) Z E Z > t 2e t2 / L D h 1 L D h Conclusion: with probability 1, L D h L S h log 2 2

Epirical Risk Miniization ERM H S = h = arg in h H L S h Can we conclude that w.p. 1, L D h L S h log 2 h 2?

Unifor Convergence and the Union Bound For each h we have: P S L S h L D h t 2e t2 / And so: P S h H L S h L D h t P S L S h L D h t h H H 2e t2 / Theore: For any hypothesis class H and any D, P S~D h H, L D h L S h log H + log 2 2 1 Another way to view the derivation: P S L S h L D h log2 2 H And then log 2/ = log 2 H / = log H + log 2/

Epirical Risk Miniization Theore: For any H and any D, S D, L D h L S h + log H + log 2 2 Post-Hoc Guarantee Theore: For any H and any D, S D, L D h inf h H L D(h) + 2 log H + log 2 2 A-priori Guarantee Proof: if indeed h H, L D h L S h, then: L D h L S h + L S h + L D h + + L S h iniizes of L S and so h L S (h) for any h, including h

Post-Hoc Guarantee Theore: For any H and any D, S D, log H + log L D h L S h 2 + 2 Perforance Guarantee: Without ANY assuptions about D (i.e. about reality), if we soehow find a predictor h with low L S (h), we can be ensured, (with high probability) that it will perfor well on future exaples. Instead, use independent test set S (e.g. split available exaples into a training set S and test set S ). Fro Hoeffding: L D A(S) L S A(S) + Rando, but depends only on S,independent of S log 1/ 2 S Even better using tighter Binoial tail bounds, or even better nuerically with Gaussian approxiation of Binoial or entropy-based bound [see hoework]

A-Priori Learning Guarantee Theore: For any H and any D, S D, L D h inf h H L D(h) + 2 approxiation error log H + log 2 2 estiation error If we assue, based on our expert knowledge, that there exists a good predictor h H, then with enough exaples we can learn a predictor that s alost as good, and we know how any exaples we ll need For any, ε > 0, using = 2 saples is enough to ensure L D log H + log 2 ε 2 h L D (h ) + ε w.p. 1

Cardinality and Learning We saw: All finite hypothesis classes are learnable if we assue there is a good predictor in the class, with enough saples we ll be able to learn it (fairly powerful: includes all 100-line progras) Saple coplexity log H Is cardinality the only thing controlling learnability and saple coplexity? Is this saple coplexity bound always tight? Are all classes of the sae cardinality equally coplex? Are there infinite classes that learnable?

Probably Approxiately Correct (PAC) Definition: A hypothesis class H is PAC-Learnable (in the realizable case) if there exists a learning rule A such that ε, > 0, ε,, D s.t. L D h = 0 for soe h H (i.e. D is realizable by H), S D ε,, L D A S ε Definition: A hypothesis class H is agnostically PAC-Learnable if there exists a learning rule A such that ε, > 0, ε,, D, S D ε,, L D A S inf L D h + ε h H Leslie Valiant

Cardinality and Saple Coplexity We saw: All finite hypothesis classes are PAC learnable H ε, ERM,H ε, O log H +log 1/ ε 2 Are there infinite classes that are PAC learnable? Is the bound on H always tight? Can H be saller than log H? Are all classes of the sae cardinality equally coplex? E.g. X = 1,, 100, H = ±1 X X = 1,, 2 100 10 30, H = x θ θ 1 2 100 }

The Growth Function For C = x 1, x 2,, x X : Γ H C = h x 1, h x 2,, h x ±1 h H Γ H = ax C X Γ H(C) E.g. X = 1,, 100, H = ±1 X Γ = in 2, 2 100 X = 1,, 2 100 10 30, H = x θ θ 1 2 100 Γ = in + 1,2 100

Unifor Convergence using the Growth Function Theore: For any hypothesis class H and any D, P S~D h H, L D h L S h 4 log Γ H(2) + log 2 Proof: hoework 1 Conclusion: For any H and any D, S D, L D h L S h + 4 log Γ H(2) + log 2 and L D h inf h H L D(h) + 8 log Γ H(2) + log 2

Shattering and VC Diension C = x 1,, x is shattered by H if Γ H C = 2, i.e. we can get all 2 behaviors: y1,,y ±1, h H s.t. i h x i = y i The VC-diension of H is the largest nuber of points that can be shattered by H: VCdi H = ax s. t. Γ H = 2 If H is infinite and Γ H = 2, then VCdi H =

VC Diension: Exaples X = 1,, 100, H = ±1 X VCdi=100 Discrete Threshold: X = 1,, 2 100 10 30, H = x θ θ 1 2 100 } VCdi=1 Continuous Thresholds: X = R, H = h θ x = x < θ θ R Only one point can be shattered; VCdi=1 Intervals: X = R, H = h a,b x = a x b a, b R Can shatter any two points With three points, can t realize + - + Axis aligned rectangles Can shatter 1, 2, 3 points Soe sets of 4 points can t be shattered is this a proble? Soe sets of 4 points can be shattered Can t shatter 5 points

Sauer-Shelah-VC Lea If VCdi H = D, then: D Γ H i=0 i for > D e D D Norbert Sauer Saharon Selah Alexey Chervonenkis Vladiir Vapnik

Sauer-Shelah-VC Lea If VCdi H = D, then: D Γ H i=0 i for > D e D D

Conclusion: VC Learning Guarantees Recall: S D, L D h L S h + 4 log Γ H(2) + log 2 Fro Sauer, log Γ H (2) log We therefor have: e VCdi VCdi O VCdi log. S D, L D h L S h + O VCdi(H) log + log 1 With a very coplex proof, this can be iproved to: S D, L D h L S h + O VCdi(H) + log 1

VC Learning Guarantees Conclusion: If VCdi H < then H is agnostically PAC learnable using ERM H S D, L D h L S h + O The saple coplexity is bounded by: VCdi(H) + log 1/ ε, O ε 2 VCdi H + log 1 Finite classes are PAC-learnable, with ERM,H ε, = O log H +log 1 ε 2 VC classes are PAC-learnable, with ERM,H ε, = O VCdi(H)+log 1 ε 2 Can a class with infinite VC-diension be learnable? Can the saple coplexity be lower then the VC diension?

VC-Diension: More Exaples Circles in R 2 : H = h c,r x = x c r c R 2, r R Can shatter 3 points Circles and their copleent Can shatter 4 points Circles around origin: H = h c,r x = x r r R Can shatter only 1 point Axis aligned ellipses: H = h c,r 1,r 2 x = Can shatter 4 points General ellipses Can shatter 5 points x 1 c 1 2 r 1 2 + x 2 c 2 2 r 2 2 1 c R 2, r 1, r 2 R Upper bounds?