Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

Similar documents
Introduction to Machine Learning and Cross-Validation

Lecture 3: Statistical Decision Theory (Part II)

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Terminology for Statistical Data

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

STAT 830 Non-parametric Inference Basics

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Regression #3: Properties of OLS Estimator

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

A Bias Correction for the Minimum Error Rate in Cross-validation

PDEEC Machine Learning 2016/17

Non-parametric Inference and Resampling

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

MS&E 226: Small Data

Regression I: Mean Squared Error and Measuring Quality of Fit

CMSC858P Supervised Learning Methods

Data Mining Stat 588

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 2 Machine Learning Review

Understanding Generalization Error: Bounds and Decompositions

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Algorithm Independent Topics Lecture 6

Lecture 7 Introduction to Statistical Decision Theory

Machine Learning for OR & FE

Statistical Machine Learning Hilary Term 2018

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Probability and Statistical Decision Theory

Lecture 1: Bayesian Framework Basics

Lecture 2: Statistical Decision Theory (Part I)

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Prediction problems 3: Validation and Model Checking

Recap from previous lecture

STAT Homework X - Solutions

Lecture 14: Shrinkage

Bias-Variance Tradeoff

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Machine Learning And Applications: Supervised Learning-SVM

Variance Reduction and Ensemble Methods

Data Mining Stat 588

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Statistical Learning

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Chapter 7: Model Assessment and Selection

Classification 2: Linear discriminant analysis (continued); logistic regression

Ensemble Methods and Random Forests

Gaussians Linear Regression Bias-Variance Tradeoff

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Introduction to Simple Linear Regression

Qualifying Exam in Machine Learning

Lecture 3: Introduction to Complexity Regularization

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Model Comparison and Out of Sample Prediction

Advanced Introduction to Machine Learning CMU-10715

Introduction to Gaussian Process

Bits of Machine Learning Part 1: Supervised Learning

Graduate Econometrics I: Unbiased Estimation

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Regression Estimation Least Squares and Maximum Likelihood

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 1: Random number generation, permutation test, and the bootstrap. August 25, 2016

IFT Lecture 7 Elements of statistical learning theory

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Opening Theme: Flexibility vs. Stability

STK Statistical Learning: Advanced Regression and Classification

Multi-View Regression via Canonincal Correlation Analysis

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

9. Least squares data fitting

STAT 285: Fall Semester Final Examination Solutions

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Machine Learning Basics: Maximum Likelihood Estimation

IEOR165 Discussion Week 5

Nonparametric Methods

Decision Tree Learning Lecture 2

Linear Regression In God we trust, all others bring data. William Edwards Deming

Introduction to Estimation Methods for Time Series models. Lecture 1

Robustness to Parametric Assumptions in Missing Data Models

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

2. (Today) Kernel methods for regression and classification ( 6.1, 6.2, 6.6)

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistics II Lesson 1. Inference on one population. Year 2009/10

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Introduction to Machine Learning (67577) Lecture 3

Machine Learning. Ensemble Methods. Manfred Huber

Prediction & Feature Selection in GLM

Class 2 & 3 Overfitting & Regularization

Model Comparison and Out of Sample Prediction. ISLR Chapter 5

Lecture 02 Linear classification methods I

Linear Classifiers. Michael Collins. January 18, 2012

CS Machine Learning Qualifying Exam

Transcription:

Bias-Variance Tradeoff David Dalpiaz STAT 430, Fall 2017 1

Announcements Homework 03 released Regrade policy Style policy? 2

Statistical Learning Supervised Learning Regression Parametric Non-Parametric Classification Unsupervised Learning 3

Regression Setup Given a random pair (X, Y ) R p R. We would like to predict Y with some function of X, say, f (X). Define the squared error loss of estimating Y using f (X) as L(Y, f (X)) (Y f (X)) 2 We call the expected loss the risk of estimating Y using f (X) R(Y, f (X)) E[L(Y, f (X))] = E X,Y [(Y f (X)) 2 ] 4

Minimizing Risk After conditioning on X [ E X,Y (Y f (X)) 2] ] = E X E Y X [(Y f (X)) 2 X = x We see that the risk is minimzied by the conditional mean f (x) = E(Y X = x) We call this, the regression function. 5

Estimating f Given data D = (x i, y i ) R p R our goal is to find some ˆf that is a good estimate of the regression function f. 6

Expected Prediction Error ( ) [ ( ) ] 2 EPE Y, ˆf (X) E X,Y,D Y ˆf (X) 7

Reducible and Irreducible Error ( EPE Y, ˆf ) [ ( (x) =E Y X,D Y ˆf ) ] 2 (X) X = x [ ( ) ] 2 = E Y X,D Y ˆf (X) X = x [ ( ) ] 2 = E D f (x) ˆf (x) + V Y X [Y X = x] }{{}}{{} irreducible error reducible error 8

Bias and Variance Recall the definition of the bias of an estimator. ] bias(ˆθ) E [ˆθ θ Also recall the definition of the variance of an estimator. [ V(ˆθ) = var(ˆθ) E (ˆθ E [ˆθ ]) 2] 9

Bias and Variance Figure 1: Dartboard Analogy of Bias and Variance 10

Bias-Variance Decomposition ( MSE f (x), ˆf ) [ ( (x) = E D f (x) ˆf ) ] 2 (x) ( ]) [ 2 ]) ] 2 = f (x) E [ˆf (x) + E (ˆf (x) E [ˆf (x) } {{ } } {{ } bias 2 (ˆf (x)) var(ˆf (x)) 11

Bias-Variance Decomposition ( ) MSE f (x), ˆf (ˆf ) ) (x) = bias 2 (x) + var (ˆf (x) 12

Bias-Variance Decomposition More Dominant Variance Decomposition of Prediction Error More Dominant Bias Squared Bias Variance Bayes EPE Error Error Error Model Complexity Model Complexity Model Complexity 13

Expected Test Error Error versus Model Complexity Error (Expected) Test Train Low Complexity High High Bias Low Low Variance High 14

Simulation Study, Regression Function We will illustrate these decompositions, most importantly the bias-variance tradeoff, through simulation. Suppose we would like to train a model to learn the true regression function function f = function(x) { x ^ 2 } f (x) = x 2 15

Simulation Study, Regression Function More specifically, we d like to predict an observation, Y, given that X = x by using ˆf (x) where and E[Y X = x] = f (x) = x 2 V[Y X = x] = σ 2. 16

Simulation Study, Data Generating Process To carry out a concrete simulation example, we need to fully specify the data generating process. We do so with the following R code. get_sim_data = function(f, sample_size = 100) { x = runif(n = sample_size, min = 0, max = 1) y = rnorm(n = sample_size, mean = f(x), sd = 0.3) data.frame(x, y) } 17

Simulation Study, Models Using this setup, we will generate datasets, D, with a sample size n = 100 and fit four models. predict(fit0, x) = ˆf 0 (x) = ˆβ 0 predict(fit1, x) = ˆf 1 (x) = ˆβ 0 + ˆβ 1 x predict(fit2, x) = ˆf 2 (x) = ˆβ 0 + ˆβ 1 x + ˆβ 2 x 2 predict(fit9, x) = ˆf 9 (x) = ˆβ 0 + ˆβ 1 x + ˆβ 2 x 2 +... + ˆβ 9 x 9 18

Simulation Study, Trained Models Four Polynomial Models fit to a Simulated Dataset y 0.5 0.0 0.5 1.0 1.5 y ~ 1 y ~ poly(x, 1) y ~ poly(x, 2) y ~ poly(x, 9) truth 0.0 0.2 0.4 0.6 0.8 1.0 x 19

y y Simulation Study, Repeated Training y Simulated Dataset 1 Simulated Dataset 2 Simulated Dataset 3 0.5 0.0 0.5 1.0 1.5 y ~ 1 y ~ poly(x, 9) 0.5 0.0 0.5 1.0 1.5 y ~ 1 y ~ poly(x, 9) 0.0 0.5 1.0 1.5 y ~ 1 y ~ poly(x, 9) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x 20

y y Simulation Study, KNN y Simulated Dataset 1 Simulated Dataset 2 Simulated Dataset 3 0.5 0.0 0.5 1.0 1.5 k = 5 k = 100 0.5 0.0 0.5 1.0 1.5 k = 5 k = 100 0.0 0.5 1.0 1.5 k = 5 k = 100 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x x x 21

Simulation Study, Setup set.seed(1) n_sims = 250 n_models = 4 x = data.frame(x = 0.90) predictions = matrix(0, nrow = n_sims, ncol = n_models) 22

Simulation Study, Running Simulations for(sim in 1:n_sims) { sim_data = get_sim_data(f) # fit models fit_0 = lm(y ~ 1, data = sim_data) fit_1 = lm(y ~ poly(x, degree = 1), data = sim_data) fit_2 = lm(y ~ poly(x, degree = 2), data = sim_data) fit_9 = lm(y ~ poly(x, degree = 9), data = sim_data) } # get predictions predictions[sim, 1] = predict(fit_0, x) predictions[sim, 2] = predict(fit_1, x) predictions[sim, 3] = predict(fit_2, x) predictions[sim, 4] = predict(fit_9, x) 23

Simulation Study, Results Simulated Predictions for Polynomial Models Predictions 0.2 0.4 0.6 0.8 1.0 0 1 2 9 Polynomial Degree 24

Bias-Variance Tradeoff As complexity increases, bias decreases. As complexity increases, variance increases. 25

Simulation Study, Quantities of Interest ( ) ( ] ) 2 MSE f (0.90), ˆf k (0.90) = E [ˆf k (0.90) f (0.90) } {{ } bias 2 (ˆf k (0.90)) [ ]) ] 2 + E (ˆf k (0.90) E [ˆf k (0.90) } {{ } var(ˆf k (0.90)) 26

Estimation Using Simulation ( MSE f (0.90), ˆf ) k (0.90) = 1 n sims ) bias (ˆf (0.90) = 1 n sims n sims i=1 n sims i=1 ( f (0.90) ˆf ) 2 k (0.90) (ˆf k (x0.90)) f (0.90) ) var (ˆf (0.90) = 1 n sims n sims i=1 ( ˆf k (0.90) 1 n sims n sims i=1 ˆf k (0.90) ) 2 27

Simulation Study, Results Degree Mean Squared Error Bias Squared Variance 0 0.22643 0.22476 0.00167 1 0.00829 0.00508 0.00322 2 0.00387 0.00005 0.00381 9 0.01019 0.00002 0.01017 28

If Time Note that, ˆf 9 (x) is ubiased Some live coding 29