Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Similar documents
Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

Machine Learning

Machine Learning

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Computational Learning Theory

Hypothesis Testing and Computational Learning Theory. EECS 349 Machine Learning With slides from Bryan Pardo, Tom Mitchell

Computational Learning Theory

Computational Learning Theory

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Machine Learning

Computational Learning Theory

Computational Learning Theory (VC Dimension)

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

Understanding Generalization Error: Bounds and Decompositions

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Computational Learning Theory

Machine Learning. Computational Learning Theory. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Introduction to Machine Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS 6375: Machine Learning Computational Learning Theory

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory Continued

Variance Reduction and Ensemble Methods

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

ECE 5424: Introduction to Machine Learning

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Lecture 2 Machine Learning Review

CMU-Q Lecture 24:

Computational Learning Theory. CS534 - Machine Learning

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Bayesian Learning (II)

Midterm, Fall 2003

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Computational Learning Theory (COLT)

Computational Learning Theory

Voting (Ensemble Methods)

1 Differential Privacy and Statistical Query Learning

Stochastic Gradient Descent

ECE 5984: Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Web-Mining Agents Computational Learning Theory

Notes on Machine Learning for and

FINAL: CS 6375 (Machine Learning) Fall 2014

Bias-Variance in Machine Learning

Linear Classifiers: Expressiveness

Introduction to Machine Learning Midterm Exam

Cognitive Cyber-Physical System

Introduction to Machine Learning Midterm Exam Solutions

On the Sample Complexity of Noise-Tolerant Learning

CS7267 MACHINE LEARNING

Stephen Scott.

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Machine Learning Gaussian Naïve Bayes Big Picture

Lecture Slides for INTRODUCTION TO. Machine Learning. By: Postedited by: R.

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

The Perceptron algorithm

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Learning theory Lecture 4

PAC-learning, VC Dimension and Margin-based Bounds

Classification: The PAC Learning Framework

Classifier Performance. Assessment and Improvement

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Generalization, Overfitting, and Model Selection

Parametric Models: from data to models

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction to Machine Learning

The PAC Learning Framework -II

Evaluating Hypotheses

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Learning with multiple models. Boosting.

Introduction to Machine Learning

Machine Learning

Notes on Machine Learning for and

PATTERN RECOGNITION AND MACHINE LEARNING

Computational Learning Theory. Definitions

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 498F: Machine Learning Fall 2010

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Qualifier: CS 6375 Machine Learning Spring 2015

Bias-Variance Tradeoff

BITS F464: MACHINE LEARNING

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Concept Learning through General-to-Specific Ordering

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Active Learning and Optimized Information Gathering

Statistical and Computational Learning Theory

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MACHINE LEARNING. Vapnik-Chervonenkis (VC) Dimension. Alessandro Moschitti

Transcription:

Learning Theory, Overfi1ng, Bias Variance Decomposi9on Machine Learning 10-601B Seyoung Kim Many of these slides are derived from Tom Mitchell, Ziv- 1 Bar Joseph. Thanks!

Any(!) learner that outputs a hypothesis consistent with all training examples (i.e., an h contained in VS H,D )

What it means [Haussler, 1988]: probability that the version space is not ε- exhausted amer m training examples is at most Suppose we want this probability to be at most δ 1. How many training examples suffice?

Here ε is the difference between the training error and true error of the output hypothesis (the one with lowest training error)

Addi9ve Hoeffding Bounds Agnos9c Learning Given m independent flips of a coin with true Pr(heads) = θ we can bound the error in the maximum likelihood eswmate Relevance to agnoswc learning: for any single hypothesis h But we must consider all hypotheses in H Now we assume this probability is bounded by δ. Then, we have m > 1 2 (ln H +ln(1/δ)) ε

Ques9on: If H = {h h: X! Y} is infinite, what measure of complexity should we use in place of H?

Ques9on: If H = {h h: X! Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target funcwon c)

Ques9on: If H = {h h: X! Y} is infinite, what measure of complexity should we use in place of H? Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target funcwon c) VC dimension of H is the size of this subset

a labeling of each member of S as positive or negative Each ellipse corresponds to a possible dichotomy Positive: Inside the ellipse Negative: Outside the ellipse

VC(H)=3

Sample Complexity based on VC dimension How many randomly drawn examples suffice to ε- exhaust VS H,D with probability at least (1- δ)? ie., to guarantee that any hypothesis that perfectly fits the training data is probably (1- δ) approximately (ε) correct Compare to our earlier results based on H :

VC dimension: examples Consider 1- dim real valued input X, want to learn c:x!{0,1} What is VC dimension of Open intervals: x Closed intervals:

VC dimension: examples Consider 1- dim real valued input X, want to learn c:x!{0,1} What is VC dimension of Open intervals: VC(H1)=1 x VC(H2)=2 Closed intervals: VC(H3)=2 VC(H4)=3

VC dimension: examples What is VC dimension of lines in a plane? H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0! y=1) }

What is VC dimension of VC dimension: examples H 2 = { ((w 0 + w 1 x 1 + w 2 x 2 )>0! y=1) } VC(H 2 )=3 For H n = linear separating hyperplanes in n dimensions, VC(H n )=n+1

For any finite hypothesis space H, can you give an upper bound on VC(H) in terms of H? (hint: yes) Assume VC(H) = K, which means H can shaier K examples. For K examples, there are 2 K possible labelings. Thus, H 2 K Thus, K log 2 H

Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct? How tight is this bound?

Tightness of Bounds on Sample Complexity How many examples m suffice to assure that any hypothesis that fits the training data perfectly is probably (1-δ) approximately (ε) correct? How tight is this bound? Lower bound on sample complexity (Ehrenfeucht et al., 1989): Consider any class C of concepts such that VC(C) > 1, any learner L, any 0 < ε < 1/8, and any 0 < δ < 0.01. Then there exists a distribution and a target concept in C, such that if L observes fewer examples than Then with probability at least δ, L outputs a hypothesis with

Agnos9c Learning: VC Bounds for Decision Tree [Schölkopf and Smola, 2002] With probability at least (1- δ) every h H sawsfies

What You Should Know Sample complexity varies with the learning senng Learner acwvely queries trainer Examples arrive at random Within the PAC learning senng, we can bound the probability that learner will output hypothesis with given error For ANY consistent learner (case where c H) For ANY best fit hypothesis (agnoswc learning, where perhaps c not in H) VC dimension as a measure of complexity of H Conference on Learning Theory: hip://www.learningtheory.org Avrim Blum s course on Machine Learning Theory: hips://www.cs.cmu.edu/~avrim/ml14/

OVERFITTING, BIAS/VARIANCE TRADE- OFF 21

What is a good model? Low Robustness Low quality /High Robustness LEGEND Robust Model Model built Known Data New Data 22

Two sources of errors Now let's look more closely into two sources of errors in an funcwon approximator: In the following we show how bias and variance decompose 23

Expected loss, Bias/Variance Decomposi9on Let y be the true (target) output Let h(x) = E[y x] be the op9mal predictor Let f(x) our actual predictor, which will incur the following expected loss E( f (x) y) 2 = f (x) y ( ) 2 p(x, y)dxdy = f (x) h(x) + h(x) y ( ) 2 p(x, y)dxdy = [( f (x) h(x) ) 2 + 2( f (x) h(x) )( h(x) y) + ( h(x) y) 2 ]p(x, y)dxdy The part we can influence by changing our predictor f(x) a noise term, and we can do no beier than this. Thus it is a lower bound of the expected loss 24

Expected loss, Bias/Variance Decomposi9on E( f (x) y) 2 = ( f (x) h(x) ) 2 p(x)dx + ( h(x) y) 2 p(x,y)dxdy Take the expectawon over different datasets f(x;d): We will assume f(x) = f(x w) is a parametric model and the parameters w are fit to a training set D. E D [f(x;d)]: The expected predictor over the mulwple training datasets Bias 2 Variance 25

Proof: Expected loss, Bias/Variance Decomposi9on ( ) 2 ] = E D [ f (x;d) E D [ f (x;d)] + E D [ f (x;d)] h(x) E D [ f (x;d) h(x) ( ) 2 ] ( ) 2 + ( E D [ f (x;d)] h(x) ) 2 ( [ ])( E D [ f (x;d)] h(x) )] = E D [ f (x;d) E D [ f (x;d)] + 2 f (x;d) E D f (x;d) ( ) 2 + E D ( f (x;d) E D [ f (x;d)]) 2 = E D [ f (x;d)] h(x) [ ] Bias 2 Variance Punng things together: expected loss = (bias) 2 + variance + noise 26

expected loss = (bias) 2 + variance + noise 27

Regularized Regression Recall linear regression: y = X T β +ε Regularized LR: L2- regularized LR: β * = argmax β y X T β 2 +λ β where 2 β = β i L1- regularized LR: β * = argmax β y X T β 2 +λ β where β * = argmax β (y X T β) T (y X T β) = argmax β y X T β 2 β = β i i i λ controls bias/variance trade off 28

Bias 2 +variance vs regularizer Bias 2 +variance predicts (shape of) test error quite well. However, bias and variance cannot be computed since it relies on knowing the true distribuwon of x and y (and hence h(x) = E[y x]). 29

Bayes Error Rate Fundamental performance limit for classificawon problem A lower bound on classificawon performance of any algorithms on a given problem i.e., Error rate of the opwmal decision rule 30

Bayes Error Rate: Two Class For a two- class classificawon problem x is input feature vector, and ω 1, ω 2 are two classes Then, Bayes opwmal decision rule is Classify as ω 1 if P(x ω 1 )P(ω 1 ) > P(x ω 2 )P(ω 2 ) Classify as ω 2 if P(x ω 1 )P(ω 1 ) < P(x ω 2 )P(ω 2 ) 31

Bayes Error Rate: Two Class For a two- class classificawon problem x is input feature vector, and ω 1, ω 2 are two classes Given this opwmal decision rule, the error rate is P(error) = P(x R 2,ω 1 ) + P(x R 1,ω 2 ) = P(x R 2 ω 1 )P(ω 1 ) + P(x R 1 ω 2 )P(ω 2 ) = P(x R 2 ω 1 )P(ω 1 )dx + R 2 P(x R 1 ω 2 )P(ω 2 )dx R 1 Bayes error rate gives the irreducible error: fundamental property of the problem, not the classifier 32

Classifica9on Example Simple problem Hard problem 33

Bayes Error Rate: Mul9ple Classes For c- class classificawon c P(correct) = P(x R i,ω i ) i=1 c = P(x R i ω i )P(ω i ) i=1 = c i=1 P(x R i ω i )P(ω i )dx R 2 P(error) = 1 P(correct) 34

Summary Overfinng Bias- variance decomposiwon Bayes Error Rate 35