Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Similar documents
Learning Theory Continued

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

ECE 5424: Introduction to Machine Learning

Stochastic Gradient Descent

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

ECE 5984: Introduction to Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Learning theory Lecture 4

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Machine Learning

Introduction to Machine Learning CMU-10701

Machine Learning

Bias-Variance Tradeoff

Computational Learning Theory. CS534 - Machine Learning

CS 6375: Machine Learning Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Logistic Regression Logistic

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Computational Learning Theory

VBM683 Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Generalization, Overfitting, and Model Selection

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Support vector machines Lecture 4

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Dimensionality reduction

Computational Learning Theory

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Web-Mining Agents Computational Learning Theory

Voting (Ensemble Methods)

Support Vector Machines

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Bayesian Learning (II)

Active Learning and Optimized Information Gathering

The Naïve Bayes Classifier. Machine Learning Fall 2017

Announcements. Proposals graded

Variance Reduction and Ensemble Methods

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Learning with multiple models. Boosting.

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory

Learning Theory. Sridhar Mahadevan. University of Massachusetts. p. 1/38

Introduction to Machine Learning

Midterm, Fall 2003

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Computational Learning Theory. Definitions

Ad Placement Strategies

Decision Trees Lecture 12

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

TDT4173 Machine Learning

Qualifying Exam in Machine Learning

Computational Learning Theory

Gaussians Linear Regression Bias-Variance Tradeoff

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Bayesian Methods: Naïve Bayes

Linear classifiers Lecture 3

Efficient and Principled Online Classification Algorithms for Lifelon

Introduction to Machine Learning

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Announcements Kevin Jamieson

TDT4173 Machine Learning

Linear and Logistic Regression. Dr. Xiaowei Huang

Kernelized Perceptron Support Vector Machines

Stephen Scott.

Bias-Variance in Machine Learning

12.1 A Polynomial Bound on the Sample Size m for PAC Learning

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Name (NetID): (1 Point)

Machine Learning. Ensemble Methods. Manfred Huber

Name (NetID): (1 Point)

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Data Mining und Maschinelles Lernen

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Linear Classifiers: Expressiveness

Understanding Generalization Error: Bounds and Decompositions

Logistic Regression. Machine Learning Fall 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

Generalization, Overfitting, and Model Selection

PAC Model and Generalization Bounds

Machine Learning

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Classification: The rest of the story

Generalization and Overfitting

Introduction to Statistical Learning Theory

Transcription:

Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good is our classifier, really? How much data do I need to make it good enough? Carlos Guestrin 2005-2013 2 1

A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec. trees of depth d) n A learner finds a hypothesis h that is consistent with training data Gets zero error in training error train (h) = 0 n What is the probability that h has more than ε true error? error true (h) ε Carlos Guestrin 2005-2013 3 How likely is a bad hypothesis to get N data points right? n Hypothesis h that is consistent with training data got N i.i.d. points right h bad if it gets all this data right, but has high true error n Prob. h with error true (h) ε gets one data point right n Prob. h with error true (h) ε gets N data points right Carlos Guestrin 2005-2013 4 2

But there are many possible hypothesis that are consistent with training data Carlos Guestrin 2005-2013 5 How likely is learner to pick a bad hypothesis n Prob. h with error true (h) ε gets N data points right n There are k hypothesis consistent with data How likely is learner to pick a bad one? Carlos Guestrin 2005-2013 6 3

Union bound n P(A or B or C or D or ) Carlos Guestrin 2005-2013 7 How likely is learner to pick a bad hypothesis n Prob. a particular h with error true (h) ε gets N data points right n There are k hypothesis consistent with data How likely is it that learner will pick a bad one out of these k choices? Carlos Guestrin 2005-2013 8 4

Generalization error in finite hypothesis spaces [Haussler 88] n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: P (error true (h) > ) apple H e N Carlos Guestrin 2005-2013 9 Using a PAC bound n Typically, 2 use cases: 1: Pick ε and δ, give you N 2: Pick N and δ, give you ε P (error true (h) > ) apple H e N Carlos Guestrin 2005-2013 10 5

Summary: Generalization error in finite hypothesis spaces [Haussler 88] n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: P (error true (h) > ) apple H e N Even if h makes zero errors in training data, may make errors in test Carlos Guestrin 2005-2013 11 Limitations of Haussler 88 bound n Consistent classifier P (error true (h) > ) apple H e N n Size of hypothesis space Carlos Guestrin 2005-2013 12 6

What if our classifier does not have zero error on the training data? n A learner with zero training errors may make mistakes in test set n What about a learner with error train (h) in training set? Carlos Guestrin 2005-2013 13 Simpler question: What s the expected error of a hypothesis? n The error of a hypothesis is like estimating the parameter of a coin! n Chernoff bound: for N i.i.d. coin flips, x 1,,x N, where x j {0,1}. For 0<ε<1: 0 P @ 1 N 1 NX x j > A apple e 2N 2 j=1 Carlos Guestrin 2005-2013 14 7

Using Chernoff bound to estimate error of a single hypothesis 0 P @ 1 N 1 NX x j > A apple e 2N 2 j=1 Carlos Guestrin 2005-2013 15 But we are comparing many hypothesis: Union bound For each hypothesis h i : P (error true (h i ) error train (h i ) > ) apple e 2N 2 What if I am comparing two hypothesis, h 1 and h 2? Carlos Guestrin 2005-2013 16 8

Generalization bound for H hypothesis n Theorem: Hypothesis space H finite, dataset D with N i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h: P (error true (h i ) error train (h i ) > ) apple e 2N 2 Carlos Guestrin 2005-2013 17 PAC bound and Bias-Variance tradeoff P (error true (h) error train (h) > ) apple e 2N 2 or, after moving some terms around, with probability at least 1-δ: error true (h) apple error train (h)+ s ln H +ln 1 2N n Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! Carlos Guestrin 2005-2013 18 9

What about the size of the hypothesis space? N ln H +ln1 2 2 n How large is the hypothesis space? Carlos Guestrin 2005-2013 19 Boolean formulas with m binary features N ln H +ln1 2 2 Carlos Guestrin 2005-2013 20 10

Number of decision trees of depth k Recursive solution Given m attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = m * H k * H k N ln H +ln1 2 2 Write L k = log 2 H k L 0 = 1 L k+1 = log 2 m + 2L k So L k = (2 k -1)(1+log 2 m) +1 Carlos Guestrin 2005-2013 21 PAC bound for decision trees of depth k n Bad!!! N 2k log m +ln 1 2 Number of points is exponential in depth! n But, for N data points, decision tree can t get too big Number of leaves never more than number data points Carlos Guestrin 2005-2013 22 11

Number of Decision Trees with k Leaves n Number of decision trees of depth k is really really big: ln H is about 2 k log m n Decision trees with up to k leaves: H is about m k k 2k n A very loose bound Carlos Guestrin 2005-2013 23 PAC bound for decision trees with k leaves Bias-Variance revisited ln H DTs k leaves apple 2k(ln m +lnk) error true (h) apple error train (h)+ s ln H +ln 1 2N error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 2N Carlos Guestrin 2005-2013 24 12

What did we learn from decision trees? n n Bias-Variance tradeoff formalized error true (h) apple error train (h)+ s 2k(ln m +lnk)+ln 1 Moral of the story: Complexity of learning not measured in terms of size hypothesis space, but in maximum number of points that allows consistent classification Complexity N no bias, lots of variance Lower than N some bias, less variance 2N Carlos Guestrin 2005-2013 25 What about continuous hypothesis spaces? error true (h) apple error train (h)+ n Continuous hypothesis space: H = Infinite variance??? s ln H +ln 1 2N n As with decision trees, only care about the maximum number of points that can be classified exactly! Called VC dimension see readings for details Carlos Guestrin 2005-2013 26 13

What you need to know n Finite hypothesis space Derive results Counting number of hypothesis Mistakes on Training data n Complexity of the classifier depends on number of points that can be classified exactly Finite case decision trees Infinite case VC dimension n Bias-Variance tradeoff in learning theory n Remember: will your algorithm find best classifier? Carlos Guestrin 2005-2013 27 14