COS 402 Machine Learning and Artificial Intelligence Fall Lecture 3: Learning Theory

Similar documents
A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Generalization, Overfitting, and Model Selection

COMS 4771 Introduction to Machine Learning. Nakul Verma

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Generalization and Overfitting

EE 511 Online Learning Perceptron

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

1 The Probably Approximately Correct (PAC) Model

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Generalization, Overfitting, and Model Selection

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

Introduction to Machine Learning (67577) Lecture 3

Empirical Risk Minimization, Model Selection, and Model Assessment

PAC Model and Generalization Bounds

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Generative v. Discriminative classifiers Intuition

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

The Perceptron algorithm

1 A Lower Bound on Sample Complexity

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning

Understanding Machine Learning A theory Perspective

Day 3: Classification, logistic regression

Computational Learning Theory

Introduction to Statistical Learning Theory

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Machine Learning

Lecture 4: Linear predictors and the Perceptron

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Decision trees COMS 4771

Machine Learning Theory (CS 6783)

Bayesian Learning (II)

PAC-learning, VC Dimension and Margin-based Bounds

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Agnostic Online learnability

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

10.1 The Formal Model

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CSE446: Naïve Bayes Winter 2016

Lecture 3: Empirical Risk Minimization

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Introduction to Machine Learning (67577) Lecture 5

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Active Learning and Optimized Information Gathering

Part of the slides are adapted from Ziko Kolter

Distributed Machine Learning. Maria-Florina Balcan, Georgia Tech

The sample complexity of agnostic learning with deterministic labels

Learning Theory Continued

Machine Learning (CSE 446): Learning as Minimizing Loss; Least Squares

ECE 5984: Introduction to Machine Learning

Computational and Statistical Learning Theory

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Introduction to Machine Learning

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Statistical Learning Learning From Examples

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Advanced Introduction to Machine Learning CMU-10715

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Computational and Statistical Learning Theory

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Selective Prediction. Binary classifications. Rong Zhou November 8, 2017

FORMULATION OF THE LEARNING PROBLEM

Generalization bounds

Machine Learning Theory (CS 6783)

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Learning with multiple models. Boosting.

Linear and Logistic Regression. Dr. Xiaowei Huang

2 Upper-bound of Generalization Error of AdaBoost

1 Review of The Learning Setting

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

ECE 5424: Introduction to Machine Learning

Logistic Regression. Machine Learning Fall 2018

Computational and Statistical Learning theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Boosting: Foundations and Algorithms. Rob Schapire

Lecture 3: Statistical Decision Theory (Part II)

Introduction to Machine Learning Midterm Exam

0.1 Motivating example: weighted majority algorithm

PAC-learning, VC Dimension and Margin-based Bounds

Computational and Statistical Learning Theory

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Empirical Risk Minimization

Computational and Statistical Learning Theory

Does Unlabeled Data Help?

Computational learning theory. PAC learning. VC dimension.

CS 188: Artificial Intelligence. Outline

Decision Trees Lecture 12

Sample and Computationally Efficient Active Learning. Maria-Florina Balcan Carnegie Mellon University

Understanding Generalization Error: Bounds and Decompositions

Computational and Statistical Learning Theory

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Dan Roth 461C, 3401 Walnut

Learnability, Stability, Regularization and Strong Convexity

Introduction to Machine Learning CMU-10701

Transcription:

COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 3: Learning Theory Sanjeev Arora Elad Hazan

Admin Exercise 1 due next Tue, in class Enrolment

Recap We have seen: AI by introspection (Naïve methods) Classification by decision trees Informally: overfitting, generalization

Agenda Statistical & computational learning theories for learning from examples Setting Generalization The ERM algorithm Sample complexity Fundamental theorem of statistical learning for finite hypothesis classes The role of optimization in learning from examples

Wanted: theory for learning from examples Captures: spam detection, chair classification, Generic (doesn t commit to a particular method) Answers our questions from last lecture Overfitting Generalization Sample complexity

Setting Input: X = Domain of examples (emails, pictures, documents, ) Output: Y = label space (for this talk, binary Y={0,1}) Data access model: Data access model: learner can obtain i.i.d samples from D = distribution over (X,Y) (the world) Goal: produce hypothesis h: X Y with low generalization error

Learning from examples A learning problem: L = X, Y, c, l X = Domain of examples (emails, pictures, documents, ) Y = label space (for this talk, binary Y={0,1}) D = distribution over (X,Y) (the world) Data access model: learner can obtain i.i.d samples from D Concept = mapping c: X Y Loss function l: Y, Y R, such as l y,, y - = 1 /0 1/ 2 Goal: produce hypothesis h: X Y with low generalization error err h = E 7,/ 9 [l(h x, c x )], 4 today: err h = Pr 7,/ 9 [h x c x ]

No free lunch Learning algorithm: 1. observe m samples S = { x,,y,, x -, y -,, x E, y E } D from the world distribution. 2. Produce hypothesis h: X Y = {0,1} Extreme overfitting: h x = I y J if x = x J 0 otherwise No-free lunch theorem: Consider any domain X of size X = 2m, and any algorithm A which outputs a hypothesis h H given m sample S. Then there exists a concept c : X {0, 1} and a distribution D such that: err(c) = 0 With probability at least,,q, err(a(s)),,q.

Definition: learning from examples w.r.t. hypothesis class A learning problem: L = (X, Y, c, l, H) X = Domain of examples (emails, pictures, documents, ) Y = label space (for this talk, binary Y={0,1}) D = distribution over (X,Y) (the world) Data access model: learner can obtain i.i.d samples from D Concept = mapping c: X Y Loss function l: Y, Y R, such as l y,, y - = 1 /0 1/ 2 H = class of hypothesis: H {X Y} Goal: produce hypothesis h H with low generalization error err h = E 7,/ 9 [l(h x, c x )]

Realizability A learning problem: L = X, Y, c, l, H is realizable if there exists a hypothesis that has zero generalization error in H. h H s. t. err h = E 7,/ 9 l h x, c x = 0 If not satisfied, agnostic (competitive) learning: try to minimize generalization error compared to best-in-class, i.e. minimize err h min err [ h ]

Examples Apple factory: Apples are sweet (box) or sour (for export) Features of apples: weight and diameter Weight, diameter are distributed uniformly at random in a certain range The concept: X,Y,c =? Reasonable loss function? Reasonable hypothesis class? Realizable?

Examples MPG example from last lecture X,Y,c,H =? mpg cylinders displacement horsepower weight acceleration modelyear maker good 4 low low low high 75to78 asia bad 6 medium medium medium medium 70to74 america bad 4 medium medium medium low 75to78 europe bad 8 high high high low 70to74 america bad 6 medium medium medium medium 70to74 america bad 4 low medium low medium 70to74 asia bad 4 low medium low low 70to74 asia bad 8 high high high low 75to78 america : : : : : : : : : : : : : : : : : : : : : : : : bad 8 high high high low 70to74 america good 8 high medium high high 79to83 america bad 8 high high high low 75to78 america good 4 low low low low 79to83 america bad 6 medium medium medium high 75to78 america good 4 medium low low low 79to83 america good 4 low low medium high 79to83 america bad 8 high high high low 70to74 america good 4 low medium low medium 75to78 europe bad 5 medium medium medium medium 75to78 europe

Examples Spam detection: X,Y,c =? Reasonable loss function? Reasonable hypothesis class? Realizable? Chair classification

PAC learnability Learning algorithm: 1. observe m samples S = { x,, y,, x -, y -,, x E, y E } D from the world distribution. 2. Produce hypothesis h H Learning problem L = (X, Y, H, c, l) is PAC-learnable if there exists a learning algorithm s.t. for every δ, ε > 0, there exists m = f ε, δ, H <, s.t. after observing S examples, for S = m, returns a hypothesis h H, such that with probability at least it holds that 1 δ err h ε f ε, δ, H is called the sample complexity

agnostic PAC learnability Learning problem L = (X, Y, H, c, l) is agnostically PAC-learnable if there exists a learning algorithm s.t. for every δ, ε > 0, there exists m = f ε, δ, H <, s.t. after observing S examples, for S = m, returns a hypothesis h H, such that with probability at least it holds that 1 δ err h min [ ] err h + ε

How good is this definition? Generality: Any distribution, domain, label set, loss function Distribution is unknown to the learner No algorithmic component Realizable? Uncovered Adversarial / changing environment Algorithms

What can be PAC learned and how? Natural classes? Algorithms for PAC learning? ERM algorithm: (Empirical Risk Minimization) Sample S = f ε, δ, H labelled examples from D Find and return the hypothesis that minimizes the loss on these examples: h mno = arg min {err r h } where err r =, l h x [ ] t Jv, wx E J, y J Computational efficiency? decision trees?

Fundamental theorem for finite H Theorem: Every realizable learning problem L = (X, Y, H, c, l) for finite H, is PAC-learnable with sample complexity S = O z{ ]} z{ 0 ~ using the ERM algorithm. VERY powerful, examples coming up Explicitly algorithmic Captures overfitting, generalization... Infinite hypothesis classes? But first proof

Fundamental theorem - proof Proof: 1. Let S = { x,,y,, x -,y -,, x E,y E } D be the sample from the world, size : m = z{ ]} z{ 0 ~ 2. ERM does NOT learn L only if err(h mno ) > ε with probability larger than δ. 3. We know that err r h mno = 0 (why?) Pr err h mno > ε Pr[ h s. t. err h > ε err r h = 0] Pr[err h > ε err r h = 0] Pr i S h x J = y J err h > ε] [ ] [ ] = [ ] v, { t Pr h x J = y J err h > ε] [ ] Jv, wx E(1 ε) = [ 1 ε E = H 1 ε E H eˆe = H 0 Œ Š ~ eˆ Š Ž = δ

Examples statistical learning theorem Theorem: Every realizable learning problem L = (X, Y, H, c, l) for finite H, is PAC-learnable with sample complexity S = O z{ ]} z{ 0 ~ using the ERM algorithm. Apple factory: Wt. is measured in grams, 100-400 scale. Diameter: centimeters, 3-20 Spam classification using decision trees of size 20 nodes 200K words

We stopped here To be continued next class + start optimization!