Bias-Variance in Machine Learning

Similar documents
CS7267 MACHINE LEARNING

Ensemble Methods for Machine Learning

ECE 5424: Introduction to Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

Learning with multiple models. Boosting.

Low Bias Bagged Support Vector Machines

Data Mining und Maschinelles Lernen

Voting (Ensemble Methods)

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Stochastic Gradient Descent

VBM683 Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

ECE 5984: Introduction to Machine Learning

Variance Reduction and Ensemble Methods

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Classifier Performance. Assessment and Improvement

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Statistical Machine Learning from Data

Ensembles of Classifiers.

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Boos$ng Can we make dumb learners smart?

A Unified Bias-Variance Decomposition

A Brief Introduction to Adaboost

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Low Bias Bagged Support Vector Machines

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Machine Learning

Decision Tree Learning Lecture 2

Day 3: Classification, logistic regression

A Unified Bias-Variance Decomposition and its Applications

Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Lecture 3: Decision Trees

Ensemble Methods and Random Forests

Boosting & Deep Learning

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Bias-Variance Tradeoff

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Machine Learning

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Linear Classifiers: Expressiveness

COMS 4771 Regression. Nakul Verma

12 Statistical Justifications; the Bias-Variance Decomposition

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Notes on Machine Learning for and

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Stephen Scott.

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Cross Validation & Ensembling

Ensembles. Léon Bottou COS 424 4/8/2010

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Introduction to Machine Learning

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Classification: Decision Trees

Announcements Kevin Jamieson

Learning Theory Continued

Statistical and Computational Learning Theory

Logistic Regression. William Cohen

Probability and Statistical Decision Theory

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Evaluating Hypotheses

TDT4173 Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

EECS 349: Machine Learning Bryan Pardo

Deconstructing Data Science

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Question of the Day? Machine Learning 2D1431. Training Examples for Concept Enjoy Sport. Outline. Lecture 3: Concept Learning

Logistic Regression. Machine Learning Fall 2018

Decision Trees: Overfitting

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Variance and Bias for General Loss Functions

CS 543 Page 1 John E. Boon, Jr.

Hypothesis Evaluation

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

CS 6375 Machine Learning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Concept Learning Mitchell, Chapter 2. CptS 570 Machine Learning School of EECS Washington State University

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Machine Learning Gaussian Naïve Bayes Big Picture

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Holdout and Cross-Validation Methods Overfitting Avoidance

PAC-learning, VC Dimension and Margin-based Bounds

Classification: The rest of the story

Generalization, Overfitting, and Model Selection

MIRA, SVM, k-nn. Lirong Xia

Transcription:

Bias-Variance in Machine Learning

Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on how it extends to classification Measuring bias, variance and error Bagging - a way to reduce variance Bias-variance for classification

Bias/Variance is a Way to Understand Overfitting and Underfitting high error too simple too complex Error/Loss on an unseen test set D test Error/Loss on training set D train simple classifier complex classifier 3

Bias-Variance: An Example

Example Tom Dietterich, Oregon St

Example Tom Dietterich, Oregon St Same experiment, repeated: with 50 samples of 20 points each

noise is similar to error 1 The true function f can t be fit perfectly with hypotheses from our class H (lines) è Error 1 Fix: more expressive set of hypotheses H We don t get the best hypothesis from H because of noise/small sample size è Error 2 Fix: less expressive set of hypotheses H

Bias-Variance Decomposition: Regression

Bias and variance for regression For regression, we can easily decompose the error of the learned model into two parts: bias (error 1) and variance (error 2) Bias: the class of models can t fit the data. Fix: a more expressive model class. Variance: the class of models could fit the data, but doesn t because it s hard to fit. Fix: a less expressive model class.

Bias Variance decomposition of error E {( ( ) ( )) } 2 f x +ε h D, ε D x dataset and noise true function Fix test case x, then do this experiment: 1. Draw size n sample D=(x 1,y 1 ),.(x n,y n ) 2. Train linear regressor h D using D 3. Draw one test example (x, f(x)+ε) learned from D 4. Measure squared error of h D on that example x What s the expected error? 10 noise

Bias Variance decomposition of error Notation - to simplify this f f (x)+ε yˆ = yˆ D h D ( x) E { D,ε ( f (x)+ε h D (x)) 2 } dataset and noise true function noise learned from D h E D { h ( x)} D long-term expectation of learner s prediction on this x averaged over many data sets D 11

Bias Variance decomposition of error E { D,ε ( f ŷ) 2 } = E { ([ f h]+[h ŷ] ) 2 } = E { [ f h] 2 +[h ŷ] 2 + 2[ f h][h ŷ] } = E { [ f h] 2 +[h ŷ] 2 + 2[ fh fŷ h 2 + hŷ] } h ED{ hd ( x)} yˆ = yˆ D hd ( x) f f (x)+ε = E[( f h) 2 ]+ E[(h ŷ) 2 ]+ 2( E[ fh] E[ fŷ] E[h 2 ]+ E[hŷ] ) E D,ε {( f (x)+ε)* E D { h D (x)}} { } = E D,ε ( f (x)+ε)* h D (x) { { }} E D,ε E D { h D (x)}* E D h D (x) { } = E D,ε E D { h D (x)}* h D (x)

Bias Variance decomposition of error E { D,ε ( f ŷ) 2 } = E { ([ f h]+[h ŷ] ) 2 } = E { [ f h] 2 +[h ŷ] 2 + 2[ f h][h ŷ] } = E[( f h) 2 ]+ E[(h ŷ) 2 ] h ED{ hd ( x)} yˆ = yˆ D hd ( x) f f (x)+ε Squared difference between best possible prediction for x, f(x), and our long-term expectation for what the learner will do if we averaged over many datasets D, E D [h D (x)] 13 BIAS 2 VARIANCE Squared difference btwn our longterm expectation for the learners performance, E D [h D (x)], and what we expect in a representative run on a dataset D (hat y)

x=5 variance bias

Bias-variance decomposition This is something real that you can (approximately) measure experimentally if you have synthetic data Different learners and model classes have different tradeoffs large bias/small variance: few features, highly regularized, highly pruned decision trees, large-k k- NN small bias/high variance: many features, less regularization, unpruned trees, small-k k-nn

Bias-Variance Decomposition: Classification

A generalization of bias-variance decomposition to other loss functions Arbitrary real-valued loss L(y,y ) But L(y,y )=L(y,y), L(y,y)=0, and L(y,y )!=0 if y!=y Define optimal prediction : y* = argmin y L(t,y ) Define main prediction of learner y m =y m,d = argmin y E D {L(y,y )} Define bias of learner : Bias(x)=L(y*,y m ) Define variance of learner Var(x)=E D [L(y m,y)] Define noise for x : N(x) = E t [L(t,y*)] m= D Claim: E D,t [L(t,y) = c 1 N(x)+Bias(x)+c 2 Var(x) where c 1 =Pr D [y=y*] - 1 c 2 =1 if y m =y*, -1 else For 0/1 loss, the main prediction is the most common class predicted by h D (x), weighting h s by Pr(D) Domingos, A Unified Bias-Variance Decomposition and its Applications, ICML 2000

Bias and variance For classification, we can also decompose the error of a learned classifier into two terms: bias and variance Bias: the class of models can t fit the data. Fix: a more expressive model class. Variance: the class of models could fit the data, but doesn t because it s hard to fit. Fix: a less expressive model class.

Bias-Variance Decomposition: Measuring

Bias-variance decomposition This is something real that you can (approximately) measure experimentally if you have synthetic data or if you re clever You need to somehow approximate E D {h D (x)} I.e., construct many variants of the dataset D

Background: Bootstrap sampling Input: dataset D Output: many variants of D: D 1,,D T For t=1,.,t: D t = { } For i=1 D : Pick (x,y) uniformly at random from D (i.e., with replacement) and add it to D t Some examples never get picked (~37%) Some are picked 2x, 3x,.

Measuring Bias-Variance with Bootstrap sampling Create B bootstrap variants of D (approximate many draws of D) For each bootstrap dataset T b is the dataset; U b are the out of bag examples Train a hypothesis h b on T b Test h b on each x in U b Now for each (x,y) example we have many predictions h 1 (x),h 2 (x),. so we can estimate (ignoring noise) variance: ordinary variance of h 1 (x),.,h n (x) bias: average(h 1 (x),,h n (x)) - y

Applying Bias-Variance Analysis By measuring the bias and variance on a problem, we can determine how to improve our model If bias is high, we need to allow our model to be more complex If variance is high, we need to reduce the complexity of the model Bias-variance analysis also suggests a way to reduce variance: bagging (later) 23

Bagging

Bootstrap Aggregation (Bagging) Use the bootstrap to create B variants of D Learn a classifier from each variant Vote the learned classifiers to predict on a test example

Bagging (bootstrap aggregation) Breaking it down: input: dataset D and YFCL Note that you can use any learner you like! output: a classifier h D-BAG use bootstrap to construct variants D 1,,D T for t=1,,t: train YFCL on D t to get h t You can also test h t on the out of bag examples to classify x with h D-BAG classify x with h 1,.,h T and predict the most frequently predicted class for x (majority vote)

Experiments Freund and Schapire

Bagged, minimally pruned decision trees

Generally, bagged decision trees outperform the linear classifier eventually if the data is large enough and clean enough.

Bagging (bootstrap aggregation) Experimentally: especially with minimal pruning: decision trees have low bias but high variance. bagging usually improves performance for decision trees and similar methods It reduces variance without increasing the bias (much).

More detail on bias-variance and bagging for classification Thanks Tom Dietterich MLSS 2014

A generalization of bias-variance decomposition to other loss functions Arbitrary real-valued loss L(y,y ) But L(y,y )=L(y,y), L(y,y)=0, and L(y,y )!=0 if y!=y Define optimal prediction : y* = argmin y L(t,y ) Define main prediction of learner y m =y m,d = argmin y E D {L(y,y )} Define bias of learner : Bias(x)=L(y*,y m ) Define variance of learner Var(x)=E D [L(y m,y)] Define noise for x : N(x) = E t [L(t,y*)] m= D Claim: E D,t [L(t,y) = c 1 N(x)+Bias(x)+c 2 Var(x) where c 1 =Pr D [y=y*] - 1 c 2 =1 if y m =y*, -1 else For 0/1 loss, the main prediction is the most common class predicted by h D (x), weighting h s by Pr(D) Domingos, A Unified Bias-Variance Decomposition and its Applications, ICML 2000

More detail on Domingos s model Noisy channel: y i = noise(f(x i )) f(x i ) is true label of x i Noise noise(.) may change y à y h=h D is learned hypothesis from D={(x 1,y 1 ), (x m,y m )} for test case (x*,y*), and predicted label h(x*), loss is L(h(x*),y*) For instance, L(h(x*),y*) = 1 if error, else 0

More detail on Domingos s model We want to decompose E D,P {L(h(x*),y*)} where m is size of D, (x*,y*)~p Main prediction of learner is y m (x*) y m (x*) = argmin y E D,P {L(h(x*),y )} y m (x*) = most common h D (x*) among all possible D s, weighted by Pr(D) Bias is B(x*) = L(y m (x*), f(x*)) Variance is V(x*) = E D,P {L(h D (x*), y m (x*) ) Noise is N(x*)= L(y*, f(x*))

More detail on Domingos s model We want to decompose E D,P {L(h(x*),y*)} Main prediction of learner is y m (x*) most common h D (x*) over D s for 0/1 loss Bias is B(x*) = L(y m (x*), f(x*)) main prediction vs true label Variance is V(x*) = E D,P {L(h D (x*), y m (x*) ) this hypothesis vs main prediction Noise is N(x*)= L(y*, f(x*)) true label vs observed label

More detail on Domingos s model We will decompose E D,P {L(h(x*),y*)} into Bias is B(x*) = L(y m (x*), f(x*)) main prediction vs true label this is 0/1, not a random variable Variance is V(x*) = E D,P {L(h D (x*), y m (x*) ) this hypothesis vs main prediction Noise is N(x*)= L(y*, f(x*)) true label vs observed label

Case analysis of error

Analysis of error: unbiased case Variance but no noise Main prediction is correct Noise but no variance

Analysis of error: biased case No noise, no variance Main prediction is wrong Noise and variance

Analysis of error: overall Hopefully we ll be in this case more often, if we ve chosen a good classifier Interaction terms are usually small

Analysis of error: without noise which is hard to estimate anyway Vb Vu As with regression, we can experimentally approximately measure bias and variance with bootstrap replicates Typically break variance down into biased variance, Vb, and unbiased variance, Vu.

K-NN Experiments

Tree Experiments

Tree stump experiments (depth 2) Bias is reduced (!)

Large tree experiments (depth 10) Bias is not changed much Variance is reduced