Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Similar documents
Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory Continued

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Introduction to Machine Learning CMU-10701

ECE 5424: Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

ECE 5984: Introduction to Machine Learning

Machine Learning

Machine Learning

Learning theory Lecture 4

Stochastic Gradient Descent

Bias-Variance Tradeoff

Computational Learning Theory

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

CS 6375: Machine Learning Computational Learning Theory

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Computational Learning Theory. CS534 - Machine Learning

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Decision Trees Lecture 12

Web-Mining Agents Computational Learning Theory

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Logistic Regression Logistic

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Point Estimation. Vibhav Gogate The University of Texas at Dallas

VBM683 Machine Learning

Generalization, Overfitting, and Model Selection

Support vector machines Lecture 4

Computational Learning Theory

Computational Learning Theory

Boos$ng Can we make dumb learners smart?

Classification: The PAC Learning Framework

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

PAC Model and Generalization Bounds

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Introduction to Machine Learning

Bayesian Learning (II)

Linear classifiers Lecture 3

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Variance Reduction and Ensemble Methods

Computational Learning Theory

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Support Vector Machines

Midterm, Fall 2003

Computational Learning Theory

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Empirical Risk Minimization Algorithms

Computational Learning Theory

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Introduction to Machine Learning CMU-10701

Computational Learning Theory

Gaussians Linear Regression Bias-Variance Tradeoff

MACHINE LEARNING. Probably Approximately Correct (PAC) Learning. Alessandro Moschitti

Computational Learning Theory. Definitions

Introduction to Machine Learning CMU-10701

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Bayesian Methods: Naïve Bayes

Computational Learning Theory (VC Dimension)

Computational Learning Theory (COLT)

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Advanced Introduction to Machine Learning CMU-10715

Statistical Learning Learning From Examples

COMP 562: Introduction to Machine Learning

Active Learning and Optimized Information Gathering

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Machine Learning Foundations

Name (NetID): (1 Point)

A first model of learning

Efficient and Principled Online Classification Algorithms for Lifelon

Machine Learning 2nd Edi7on

Statistical and Computational Learning Theory

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

The Perceptron algorithm

Bias-Variance in Machine Learning

CS7267 MACHINE LEARNING

Parametric Models: from data to models

Machine Learning

Announcements. Proposals graded

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Understanding Generalization Error: Bounds and Decompositions

Voting (Ensemble Methods)

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

FINAL: CS 6375 (Machine Learning) Fall 2014

Learning with multiple models. Boosting.

Introduction to Statistical Learning Theory

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Transcription:

Learning Theory Aar$ Singh and Barnabas Poczos Machine Learning 10-701/15-781 Apr 17, 2014 Slides courtesy: Carlos Guestrin

Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do I need to make it good enough? 2

A simple se2ng Classifica$on m i.i.d. data points Finite number of possible hypothesis (e.g., dec. trees of depth d) A learner finds a hypothesis h that is consistent with training data Gets zero error in training, error train (h) = 0 What is the probability that h has more than ε true error? error true (h) ε Even if h makes zero errors in training data, may make errors in test 3

How likely is a bad hypothesis to get m data points right? Consider a bad hypothesis h i.e. error true (h) ε Probability that h gets one data point right 1- ε Probability that h gets m data points right (1- ε) m 4

How likely is a learner to pick a bad hypothesis? Usually there are many (say k) bad hypothesis in the class h 1, h 2,, h k s.t. error(h i ) ε i = 1,, k Probability that learner picks a bad hypothesis = Probability that some bad hypothesis is consistent with m data points Prob(h 1 consistent with m data points OR h 2 consistent with m data points OR OR h k consistent with m data points) Prob(h 1 consistent with m data points) + Prob(h 2 consistent with m data points) + + Prob(h k consistent with m data points) Union bound Loose but works k (1- ε) m 5

How likely is a learner to pick a bad hypothesis? Usually there are many many (say k) bad hypothesis in the class h 1, h 2,, h k s.t. error(h i ) ε i = 1,, k Probability that learner picks a bad hypothesis k (1- ε) m H (1- ε) m H e - εm Size of hypothesis class m ε Η 6

PAC (Probably Approximately Correct) bound Theorem [Haussler 88]: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: apple Equivalently, with probability apple 1 Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! 7

Using a PAC bound apple Given ε and δ, yields sample complexity ln H +ln1 #training data, m Given m and δ, yields error bound ln H +ln 1 error, m 8

LimitaMons of Haussler 88 bound Consistent classifier h such that zero error in training, error train (h) = 0 Dependence on Size of hypothesis space m ln H +ln1 what if H too big or H is con$nuous? 9

What if our classifier does not have zero error on the training data? A learner with zero training errors may make mistakes in test set What about a learner with error train (h) 0 in training set? The error of a hypothesis is like es$ma$ng the parameter of a coin! error true (h) := P(h(X) Y) P(H=1) =: θ 1 X error train (h) := 1 h(xi )6=Y m i 1 X Z i =: m b i i 10

Hoeffding s Bound for a single hypothesis Consider m i.i.d. flips x 1,,x m, where x i {0,1} of a coin with parameter θ. For 0<ε<1: 2e 2m 2 For a single hypothesis h 2e 2m 2 11

PAC bound for H hypotheses For each hypothesis h i : What if we are comparing H hypotheses? Union bound 2e 2m 2 Theorem: Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h H: 2 H e 2m 2 apple Important: PAC bound holds for all h, but doesn t guarantee that 12 algorithm finds best h!!!

PAC bound and Bias- Variance tradeoff 2 H e 2m 2 apple Equivalently, with probability 1 2 Fixed m hypothesis space complex simple small large large small 13

What about the size of the hypothesis Sample complexity space? 2 2 H e 2m 2 apple How large is the hypothesis space? 14

Number of decision trees of depth k Recursive solu$on: Given n arributes H k = Number of decision trees of depth k H 0 = H k = 2 (#choices of root arribute) *(# possible leu subtrees) *(# possible right subtrees) = n * H k- 1 * H k- 1 2 Write L k = log 2 H k L 0 = 1 L k = log 2 n + 2L k- 1 = log 2 n + 2(log 2 n + 2L k- 2 ) = log 2 n + 2log 2 n + 2 2 log 2 n + +2 k- 1 (log 2 n + 2L 0 ) So L k = (2 k - 1)(1+log 2 n) +1 15

PAC bound for decision trees of depth k 2 Bad!!! Number of points is exponen$al in depth k! But, for m data points, decision tree can t get too big Number of leaves never more than number data points 16

Number of decision trees with k leaves 2 H k = Number of decision trees with k leaves H 1 =2 H k = (#choices of root arribute) * [(# leu subtrees wth 1 leaf)*(# right subtrees wth k- 1 leaves) + (# leu subtrees wth 2 leaves)*(# right subtrees wth k- 2 leaves) + + (# leu subtrees wth k- 1 leaves)*(# right subtrees wth 1 leaf)] H k = n kx 1 i=1 H i H k i = n k- 1 C k- 1 (C k- 1 : Catalan Number) Loose bound (using Sterling s approximamon): H k apple n k 1 2 2k 1 17

Number of decision trees With k leaves log 2 H k apple (k 1) log 2 n +2k 1 number of points m is linear in #leaves 2 linear in k With depth k log 2 H k = (2 k - 1)(1+log 2 n) +1 exponen$al in k number of points m is exponen$al in depth 18

PAC bound for decision trees with k leaves Bias- Variance revisited With prob 1- δ 2 With H k apple n k 1 2 2k 1, we get s (k 1) ln n +(2k 1) ln 2 + ln 2 1 2m k = m 0 large (~ > ½) k < m >0 small (~ <½) 19

What did we learn from decision trees? Bias- Variance tradeoff formalized Moral of the story: s (k 1) ln n +(2k 1) ln 2 + ln 2 1 Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classifica$on Complexity m no bias, lots of variance Lower than m some bias, less variance 2m 20

What about conmnuous hypothesis spaces? 2 Con$nuous hypothesis space: H = Infinite variance??? As with decision trees, only care about the maximum number of points that can be classified exactly! 21