IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Similar documents
Data Mining Stat 588

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Properties of the least squares estimates

Lecture 20 May 18, Empirical Bayes Interpretation [Efron & Morris 1973]

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

High-dimensional regression

Lecture 5: A step back

Statistics: Learning models from data

IEOR165 Discussion Week 5

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

Regression #3: Properties of OLS Estimator

Linear Models in Machine Learning

Regression, Ridge Regression, Lasso

IEOR 165 Lecture 13 Maximum Likelihood Estimation

Lecture 14: Shrinkage

MS&E 226: Small Data

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Simple and Multiple Linear Regression

MS&E 226: Small Data

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Multiple Linear Regression

Simple Linear Regression

[y i α βx i ] 2 (2) Q = i=1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MS&E 226: Small Data

Multivariate Regression Analysis

Mathematical statistics

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

The regression model with one fixed regressor cont d

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

Estimators as Random Variables

Lecture 11. Multivariate Normal theory

Probability and Statistics Notes

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Motivation for multiple regression

Least Squares Estimation-Finite-Sample Properties

Lecture 1: OLS derivations and inference

Statistics 3858 : Maximum Likelihood Estimators

Linear Regression Model. Badr Missaoui

Machine Learning for OR & FE

Central Limit Theorem ( 5.3)

3.0.1 Multivariate version and tensor product of experiments

Estimation, Inference, and Hypothesis Testing

Linear Model Selection and Regularization

Introduction to Econometrics Midterm Examination Fall 2005 Answer Key

Bias Variance Trade-off

Lecture 4: Heteroskedasticity

UNIVERSITETET I OSLO

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECON The Simple Regression Model

Testing Linear Restrictions: cont.

Weighted Least Squares

Simple Linear Regression

Ch 2: Simple Linear Regression

A Modern Look at Classical Multivariate Techniques

Regression Estimation Least Squares and Maximum Likelihood

IEOR 165: Spring 2019 Problem Set 2

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

Lecture 15. Hypothesis testing in the linear model

Reliability of inference (1 of 2 lectures)

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Association studies and regression

Machine Learning Basics: Estimators, Bias and Variance

Nonparametric Regression. Badr Missaoui

Statistics and Econometrics I

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Intermediate Econometrics

Econ 583 Final Exam Fall 2008

F9 F10: Autocorrelation

Regression Shrinkage and Selection via the Lasso

Lecture 14 Simple Linear Regression

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Basis Penalty Smoothers. Simon Wood Mathematical Sciences, University of Bath, U.K.

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22

Simple Linear Regression: The Model

Lecture 3: Multiple Regression

Lecture Data Science

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Lecture 6: Methods for high-dimensional problems

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

STAT 462-Computational Data Analysis

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Linear Models A linear model is defined by the expression

When is MLE appropriate

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

HT Introduction. P(X i = x i ) = e λ λ x i

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Chapter 14 Stein-Rule Estimation

Transcription:

IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to the true parameter β There are a number of ways that we could characterize this error For mathematical and computational reasons, a popular choice is the squared loss: The error between the estimate and the true parameter is quantified as E ˆβ β 2 By doing some algebra, we can better characterize the nature of this error measure In particular, we have that E ˆβ β 2 = E ˆβ E ˆβ + E ˆβ β 2 = E E ˆβ β 2 + E ˆβ E ˆβ 2 + 2E E ˆβ β ˆβ E ˆβ = E E ˆβ β 2 + E ˆβ E ˆβ 2 + 2E ˆβ βe ˆβ E ˆβ = E E ˆβ β 2 + E ˆβ E ˆβ 2 The term E ˆβ E ˆβ 2 is clearly the variance of the estimate ˆβ The other term E E ˆβ β 2 measures how far away the best estimate is from the true value, and it is common to define bias ˆβ = E E ˆβ β With this notation, we have that E ˆβ β 2 = bias ˆβ 2 + var ˆβ This equation states that the expected estimation error as measured by the squared loss is equal to the bias-squared plus the variance, and in fact there is a tradeoff between these two aspects in an estimate It is worth making three comments: 1 The first is that if bias ˆβ = E E ˆβ β = 0, then the estimate ˆβ is said to be unbiased 2 Second, this bias-variance tradeoff exists for vector-valued parameters β R p, for nonparametric estimates, and other models 3 Lastly, the term overfit is sometimes used to refer to an model with low bias but extremely high variance 1

2 Example: Estimating Variance Suppose X i N µ, σ 2 for i,, n are iid random variables, where µ, σ 2 are both unknown There are two commonly used estimators for the variance: s 2 n 1 ˆσ 2 = n X i X 2 X i X 2 = n 1 n s2 Though these two estimators look similar, they are quite different and have different uses The estimator s 2 is used in hypothesis testing, whereas the estimator ˆσ 2 is the MLE estimator for variance The first comparison we make is the expectation of the two estimators In particular, observe that Es 2 = E n 1 X i X 2 E n n 1 X2 i 2X i X + X 2 E n X2 i n 1 2E n X ix + E n X2 Examining the second term, note that n Xi X = n n Xi j=1 X j = n X2 i + j i X ix j Hence, we have n E Xi X = E Examining the third term, note that Hence, we have n X2 i + j i X ix j n nex 2 + n 1EX 2 n X2 = n n j=1 X 2 j = n j=1 X2 j + n j=1 k j X jx k n E X2 = EX 2 + n 1EX 2 Returning to the above expression for Es 2, we have Es 2 = nex 1 2 2EX 2 2n 1EX 2 + EX 2 + n 1EX 2 n 1 n 1EX 2 n 1EX 2 n 1 = EX 2 EX 2 = σ 2 2

As a result, we also have that E ˆσ 2 n 1 = E n s2 n 1 n σ2 In this example, we would say that s 2 is an unbiased estimate of the variance because biass 2 = E Es 2 σ 2 = E σ 2 σ 2 = 0 However, the ˆσ 2 is not an unbiased estimate of the variance because bias ˆσ 2 = E E ˆσ 2 σ 2 = E n 1 n σ2 σ 2 = σ 2 /n It is tempting to say that s 2 is a better estimate of the variance than ˆσ 2 becuase the former is unbiased and the later is biased However, lets examine the situation more closely In particular, suppose we compute the variance of both estimators From the definition of the χ 2 distribution, we know that s 2 σ 2 χ 2 n 1/n 1, where χ 2 n 1 denotes a χ 2 distribution with n 1 degrees of freedom The variance of a χ 2 n 1 distribution is 2n 1 by properties of the χ 2 distribution Hence, by properties of variance, we have that vars 2 = 2σ 4 /n 1 var ˆσ 2 = 2 n 1 n 2 σ 4 Since n 2 > n 1 2, dividing by n 1n 2 gives the inequality that 1/n 1 > n 1/n 2 This means that vars 2 > var ˆσ 2 : In words, the variance of ˆσ 2 is lower than the variance of s 2 Given that one estimate has lower bias and higher variance than the other, the natural question to ask is which estimate has lower estimation error? Using the bias-variance tradeoff equation, we have Es 2 σ 2 2 = biass 2 2 + vars 2 = 0 2 + 2σ 4 /n 1 = 2σ 4 /n 1 E ˆσ 2 σ 2 2 = bias ˆσ 2 2 + var ˆσ 2 = σ4 + 2 n 1 σ 4 = 2n 1 σ 4 = 2n 1n 1 2σ 4 /n 1 n 2 n 2 n 2 2n 2 But note that 2n 1n 1 < 2n 1n 1 < 2n 2 As a result, 2n 1n 1 < 1 This 2n 2 means that E ˆσ 2 σ 2 2 Es 2 σ 2 2, and so ˆσ 2 has lower estimation error than s 2 when measuring error with the squared loss 3 Stein s Paradox The situation of estimating variance for a Gaussian where a biased estimator has less estimation error than an unbiased estimator is not an exceptional case Rather, the general situation is that 3

biased estimators have lower estimation errors Such a conclusion holds in even surprising cases such as the one described below Suppose we have jointly independent X i N µ i, 1 for i,, n, with µ i µ j for i j This is a model where we have n measurements from Gaussians, where the mean of each measurement is different It is worth emphasizing that in this model we only have one measurement from each different Gaussian Now for this model, it is natural to consider the problem of estimating the mean of each Gaussian And the obvious choice is to use the estimator This is an obvious choice because ˆµ = [ X 1 X 2 X n ] Eˆµ = [ µ 1 µ 2 µ n ] It is hard to imagine that any other estimator could do better in terms of estimation error than this estimator, because i we only have one data point for each Gaussian, and ii each Gaussian is jointly independent However, this is the exact conclusion from Stein s paradox It turns out that if n 3, then the following James-Stein estimator ˆµ JS n 2 ˆµ, ˆµ 2 2 where ˆµ is as defined above, has strictly lower estimation error under the squared loss than that of ˆµ This is paradoxical because the estimation of the means of Gaussians can be impacted by the number of total Gaussians, even when the Gaussians are jointly independent The intuition is that as we try to jointly estimate more means, we can reduce the estimation error by purposely adding some bias to the estimate In this case, we bias the estimates of the means towards zero This is sometimes called shrinkage in statistics, because we are shrinking the values towards zero The error introduced by biasing the estimate is compensated by a greater reduction in the variance of the estimate 4 Proportional Shrinkage for OLS Given that it can be the case that adding some bias can improve the estimation error, it is interesting to consider how we can bias OLS estimates Recall that the OLS estimate with data x i, y i for x i R p, y i R, i,, n is given by ˆβ = arg min Y Xβ 2 2, where the matrix X R n p and the vector Y R n are such that the i-th row of X is x i and the i-th row of Y is y i One popular approach to adding shrinkage to OLS is known as 4

Ridge Regression, Tikhonov regularization, or L2-regularization It is given by the solution to the following optimization problem ˆβ = arg min Y Xβ 2 2 + λ β 2 2, where λ 0 is a tuning parameter We will return to the question of how to choose the value of λ in a future lecture To understand why this method adds statistical shrinkage, consider the one-dimensional special case where x i R In this case, the estimate is given by ˆβ = arg min = arg min y i x i β 2 + λβ 2 yi 2 2y i x i β + x 2 i β 2 + λβ 2 = arg min n 2y ix i β + λ + n x2 i β 2 + n y2 i Next, setting the derivative of the objective equal to zero gives dj dβ = n 2y ix i + 2λ + n x2 i β = 0 ˆβ = n y ix i /λ + n x2 i When λ = 0, this is simply the OLS estimate If λ > 0, then the denominator of the estimate is larger; hence, the estimate will have a smaller absolute value ie, it will shrink towards zero If we can choose a good value of λ, then the estimate will have lower estimation error We will consider how to choose the value of λ in a future lecture 5