Computational and Statistical Learning Theory

Similar documents
Computational and Statistical Learning Theory

1 Proof of learning bounds

Computational and Statistical Learning Theory

1 Generalization bounds based on Rademacher complexity

Understanding Machine Learning Solution Manual

Symmetrization and Rademacher Averages

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

VC Dimension and Sauer s Lemma

1 Rademacher Complexity Bounds

1 Bounding the Margin

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

A Simple Regression Problem

Learnability and Stability in the General Learning Setting

1 Proving the Fundamental Theorem of Statistical Learning

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational and Statistical Learning Theory

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

CS Lecture 13. More Maximum Likelihood

Computational Learning Theory

Stochastic Subgradient Methods

Computational and Statistical Learning theory

Introduction to Machine Learning

Computational and Statistical Learning Theory

Supplement to: Subsampling Methods for Persistent Homology

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Learnability of Gaussians with flexible variances

Bootstrapping Dependent Data

A Theoretical Framework for Deep Transfer Learning

Machine Learning Basics: Estimators, Bias and Variance

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory

Introduction to Robotics (CS223A) (Winter 2006/2007) Homework #5 solutions

Robustness and Regularization of Support Vector Machines

Learnability, Stability, Regularization and Strong Convexity

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

COMS 4771 Introduction to Machine Learning. Nakul Verma

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Generalization, Overfitting, and Model Selection

arxiv: v1 [cs.ds] 3 Feb 2014

Introduction to Machine Learning

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

VC dimension and Model Selection

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Computational Learning Theory

Computable Shell Decomposition Bounds

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory. CS534 - Machine Learning

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Bayes Decision Rule and Naïve Bayes Classifier

Improved Guarantees for Agnostic Learning of Disjunctions

Generalization and Overfitting

The Vapnik-Chervonenkis Dimension

Combining Classifiers

Learnability, Stability and Uniform Convergence

The Weierstrass Approximation Theorem

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Sharp Time Data Tradeoffs for Linear Inverse Problems

Will Monroe August 9, with materials by Mehran Sahami and Chris Piech. image: Arito. Parameter learning

Computational and Statistical Learning Theory

Detection and Estimation Theory

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Kernel Methods and Support Vector Machines

arxiv: v1 [cs.ds] 17 Mar 2016

Fixed-to-Variable Length Distribution Matching

Computable Shell Decomposition Bounds

Probabilistic Machine Learning

The sample complexity of agnostic learning with deterministic labels

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Distance Optimal Target Assignment in Robotic Networks under Communication and Sensing Constraints

New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Understanding Generalization Error: Bounds and Decompositions

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Pattern Recognition and Machine Learning. Artificial Neural networks

Probability Distributions

Lecture 21. Interior Point Methods Setup and Algorithm

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Domain-Adversarial Neural Networks

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Ch 12: Variations on Backpropagation

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Lean Walsh Transform

PAC-Bayes Analysis Of Maximum Entropy Learning

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Empirical Risk Minimization

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Introduction to Machine Learning

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Introduction to Discrete Optimization

Analyzing Simulation Results

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

A Note on the Applied Use of MDL Approximations

Transcription:

Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I

Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic to stochastic: Deal with errors. What if data not exactly realized by H? Avoid non-learnability due to very specific, adversarial, order of exaples Training on dedicated training set, then predict on population

The Statistical Learning Model Unknown source distribution D over (x, y) Describes reality. What we want to classify, and what should it be classified as. E.g. joint distribution over (, b ) Can think of D as: distribution over x and y x = f(x) Distribution over iages we expect to see (we don t expect to see uniforly distributed iages: ), and what character each iage represents Or, as: distribution over y and over x y Distribution over characters ( e ore likely then & ), and for each character, over possible iages of that character Goal: find predictor h with sall expected error: L D h = P x,y D [h x y] (also called generalization error, risk or true error) Based on a saple S = ( x 1, y 1, x 2, y 2,, x, y ) of training points x t, y t i. i. d. D (we can also write: S D )

The Statistical Learning Model Unknown source distribution D over (x, y) Goal: find predictor h with sall expected error: L D h = P x,y D [h x y] Based on saple S = ( x 1, y 1, x 2, y 2,, x, y ) of training points x t, y t i. i. d. D i. e. S D Statistical (batch) learning: 1. Receive training set S D 2. Learn h = A(S) using learning rule A: X Y Y X 3. Use h on future exaples drawn for D, suffering expected error L D h Main assuption: i.i.d. saples Saples drawn fro distribution D we will later use the predictor on

Expected vs Epirical Error What we care about is the expected error L D h = P x,y D [h x y] Why not just iniize it directly? For a given saple S we can calculate the epirical error L S h = 1 [[ h x t y t ]] t=1 How do we use the epirical error? Is it a good estiate for the expected error? How good?

The Epirical Error as an Estiator for the Expected Error How close are the expected and epirical errors? L S h L D h Rando Variable Nuber: L D h = P x,y D [h x y] L S h = 1 t=1 h x t y t ~ 1 Bino(, L D h ) N L D h, Hoeffding Bound on trail of Binoial: P Z~Bino(,p) Z E Z > t 2e t2 / L D h 1 L D h Conclusion: with probability 1, L D h L S h log 2 2

Epirical Risk Miniization ERM H S = h = arg in h H L S h Can we conclude that w.p. 1, L D h L S h log 2 h 2?

Unifor Convergence and the Union Bound For each h we have: P S L S h L D h t 2e t2 / And so: P S h H L S h L D h t P S L S h L D h t h H H 2e t2 / Theore: For any hypothesis class H and any D, P S~D h H, L D h L S h log H + log 2 2 1 Another way to view the derivation: P S L S h L D h log2 2 H And then log 2/ = log 2 H / = log H + log 2/

Epirical Risk Miniization Theore: For any H and any D, S D, L D h L S h + log H + log 2 2 Post-Hoc Guarantee Theore: For any H and any D, S D, L D h inf h H L D(h) + 2 log H + log 2 2 A-priori Guarantee Proof: if indeed h H, L D h L S h, then: L D h L S h + L S h + L D h + + L S h iniizes of L S and so h L S (h) for any h, including h

Post-Hoc Guarantee Theore: For any H and any D, S D, log H + log L D h L S h 2 + 2 Perforance Guarantee: Without ANY assuptions about D (i.e. about reality), if we soehow find a predictor h with low L S (h), we can be ensured, (with high probability) that it will perfor well on future exaples. Instead, use independent test set S (e.g. split available exaples into a training set S and test set S ). Fro Hoeffding: L D A(S) L S A(S) + Rando, but depends only on S,independent of S log 1/ 2 S Even better using tighter Binoial tail bounds, or even better nuerically with Gaussian approxiation of Binoial or entropy-based bound [see hoework]

A-Priori Learning Guarantee Theore: For any H and any D, S D, L D h inf h H L D(h) + 2 approxiation error log H + log 2 2 estiation error If we assue, based on our expert knowledge, that there exists a good predictor h H, then with enough exaples we can learn a predictor that s alost as good, and we know how any exaples we ll need For any, ε > 0, using = 2 saples is enough to ensure L D log H + log 2 ε 2 h L D (h ) + ε w.p. 1

Cardinality and Learning We saw: All finite hypothesis classes are learnable if we assue there is a good predictor in the class, with enough saples we ll be able to learn it (fairly powerful: includes all 100-line progras) Saple coplexity log H Is cardinality the only thing controlling learnability and saple coplexity? Is this saple coplexity bound always tight? Are all classes of the sae cardinality equally coplex? Are there infinite classes that learnable?

Probably Approxiately Correct (PAC) Definition: A hypothesis class H is PAC-Learnable (in the realizable case) if there exists a learning rule A such that ε, > 0, ε,, D s.t. L D h = 0 for soe h H (i.e. D is realizable by H), S D ε,, L D A S ε Definition: A hypothesis class H is agnostically PAC-Learnable if there exists a learning rule A such that ε, > 0, ε,, D, S D ε,, L D A S inf L D h + ε h H Leslie Valiant

Probably Approxiately Correct (PAC) Definition: A hypothesis class H is PAC-Learnable (in the realizable case) if there exists a learning rule A such that ε, > 0, ε,, D s.t. L D h = 0 for soe h H (i.e. D is realizable by H), S D ε,, L D A S ε Definition: A hypothesis class H is agnostically PAC-Learnable if there exists a learning rule A such that ε, > 0, ε,, D, S D ε,, L D A S inf L D h + ε h H Saple coplexity of a learning rule: A,H ε, = in s. t. D, S D ε, Saple coplexity for learning a hypothesis class: H ε, = in A,H ε, A, L D A S inf h H L D h + ε

Cardinality and Saple Coplexity We saw: All finite hypothesis classes are PAC learnable H ε, ERM,H ε, O log H +log 1/ ε 2 Are there infinite classes that are PAC learnable? Is the bound on H always tight? Can H be saller than log H? Are all classes of the sae cardinality equally coplex? E.g. X = 1,, 100, H = ±1 X X = 1,, 2 100 10 30, H = x θ θ 1 2 100 }

The Growth Function For C = x 1, x 2,, x X : Γ H C = h x 1, h x 2,, h x ±1 h H Γ H = ax C X Γ H(C) E.g. X = 1,, 100, H = ±1 X Γ = in 2, 2 100 X = 1,, 2 100 10 30, H = x θ θ 1 2 100 Γ = in + 1,2 100

Unifor Convergence using the Growth Function Theore: For any hypothesis class H and any D, P S~D h H, L D h L S h 4 log Γ H(2) + log 2 Proof: hoework 1 Conclusion: For any H and any D, S D, L D h L S h + 4 log Γ H(2) + log 2 and L D h inf h H L D(h) + 8 log Γ H(2) + log 2

Shattering and VC Diension C = x 1,, x is shattered by H if Γ H C = 2, i.e. we can get all 2 behaviors: y1,,y ±1, h H s.t. i h x i = y i The VC-diension of H is the largest nuber of points that can be shattered by H: VCdi H = ax s. t. Γ H = 2 If H is infinite and Γ H = 2, then VCdi H =

VC Diension: Exaples X = 1,, 100, H = ±1 X VCdi=100 Discrete Threshold: X = 1,, 2 100 10 30, H = x θ θ 1 2 100 } VCdi=1 Continuous Thresholds: X = R, H = h θ x = x < θ θ R Only one point can be shattered; VCdi=1 Intervals: X = R, H = h a,b x = a x b a, b R Can shatter any two points With three points, can t realize + - + Axis aligned rectangles Can shatter 1, 2, 3 points Soe sets of 4 points can t be shattered is this a proble? Soe sets of 4 points can be shattered Can t shatter 5 points

Sauer-Shelah-VC Lea If VCdi H = D, then: D Γ H i=0 i for > D e D D Norbert Sauer Saharon Selah Alexey Chervonenkis Vladiir Vapnik

Sauer-Shelah-VC Lea If VCdi H = D, then: D Γ H i=0 i for > D e D D

Conclusion: VC Learning Guarantees Recall: S D, L D h L S h + 4 log Γ H(2) + log 2 Fro Sauer, log Γ H (2) log We therefor have: e VCdi VCdi O VCdi log. S D, L D h L S h + O VCdi(H) log + log 1 With a very coplex proof, this can be iproved to: S D, L D h L S h + O VCdi(H) + log 1

VC Learning Guarantees Conclusion: If VCdi H < then H is agnostically PAC learnable using ERM H S D, L D h L S h + O The saple coplexity is bounded by: VCdi(H) + log 1/ ε, O ε 2 VCdi H + log 1 Finite classes are PAC-learnable, with ERM,H ε, = O log H +log 1 ε 2 VC classes are PAC-learnable, with ERM,H ε, = O VCdi(H)+log 1 ε 2 Can a class with infinite VC-diension be learnable? Can the saple coplexity be lower then the VC diension?

VC-Diension: More Exaples Circles in R 2 : H = h c,r x = x c r c R 2, r R Can shatter 3 points Circles and their copleent Can shatter 4 points Circles around origin: H = h c,r x = x r r R Can shatter only 1 point Axis aligned ellipses: H = h c,r 1,r 2 x = Can shatter 4 points General ellipses Can shatter 5 points x 1 c 1 2 r 1 2 + x 2 c 2 2 r 2 2 1 c R 2, r 1, r 2 R Upper bounds?