Regression Estimation Least Squares and Maximum Likelihood

Similar documents
Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Inference in Regression Analysis

Introduction to Simple Linear Regression

Bias Variance Trade-off

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Remedial Measures, Brown-Forsythe test, F test

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Regression #3: Properties of OLS Estimator

ECE531 Lecture 10b: Maximum Likelihood Estimation

Formal Statement of Simple Linear Regression Model

Simple and Multiple Linear Regression

Statistics and Econometrics I

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

MS&E 226: Small Data

Inference in Normal Regression Model. Dr. Frank Wood

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Multiple Linear Regression

Mathematical statistics

HT Introduction. P(X i = x i ) = e λ λ x i

1. Simple Linear Regression

Central Limit Theorem ( 5.3)

Problem 1 (20) Log-normal. f(x) Cauchy

Chapter 1. Linear Regression with One Predictor Variable

Chapters 9. Properties of Point Estimators

5.2 Fisher information and the Cramer-Rao bound

Ch 2: Simple Linear Regression

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Simple Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Intro to Linear Regression

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Simple Linear Regression Analysis

Review. December 4 th, Review

Intro to Linear Regression

EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Mathematical statistics

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

POLI 8501 Introduction to Maximum Likelihood Estimation

Correlation and Regression

Parametric Techniques Lecture 3

Simple Linear Regression

Chapter 1 Linear Regression with One Predictor

Measuring the fit of the model - SSR

Advanced Signal Processing Introduction to Estimation Theory

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Association studies and regression

Remedial Measures Wrap-Up and Transformations Box Cox

Simple Linear Regression

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

6.867 Machine Learning

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

ML and REML Variance Component Estimation. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 58

CSC321 Lecture 18: Learning Probabilistic Models

Econometrics A. Simple linear model (2) Keio University, Faculty of Economics. Simon Clinet (Keio University) Econometrics A October 16, / 11

6.867 Machine Learning

Parametric Techniques

Theory of Statistics.

Statistics II Lesson 1. Inference on one population. Year 2009/10

Linear Models in Machine Learning

Parametric Inference

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

[y i α βx i ] 2 (2) Q = i=1

Section 4.6 Simple Linear Regression

Introduction to Estimation Methods for Time Series models Lecture 2

Statistics 3858 : Maximum Likelihood Estimators

F & B Approaches to a simple model

Lecture 14 Simple Linear Regression

Machine Learning Basics: Maximum Likelihood Estimation

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

ECON The Simple Regression Model

STAT 512 sp 2018 Summary Sheet

Linear Models A linear model is defined by the expression

Machine Learning Basics: Estimators, Bias and Variance

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

EIE6207: Estimation Theory

STAT5044: Regression and Anova. Inyoung Kim

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

ECON 4160, Autumn term Lecture 1

MS&E 226: Small Data

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

STAT 730 Chapter 4: Estimation

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Practice Problems Section Problems

Linear Methods for Prediction

Brief Review on Estimation Theory

Statistics: Learning models from data

Introduction to Estimation Methods for Time Series models. Lecture 1

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

Lecture 3: Inference in SLR

Linear models and their mathematical foundations: Simple linear regression

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

Lectures on Simple Linear Regression Stat 431, Summer 2012

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Nonparametric Regression and Bonferroni joint confidence intervals. Yang Feng

BTRY 4090: Spring 2009 Theory of Statistics

Transcription:

Regression Estimation Least Squares and Maximum Likelihood Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1

Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q= n i=1 (Y i (β 0 +β 1 X i )) 2 Minimize this by maximizing Q Find partials and set both equal to zero go to board Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 2

Normal Equations The result of this maximization step are called the normal equations. b 0 and b 1 are called point estimators of β 0 and β 1 respectively Yi = nb 0 +b 1 Xi Xi Y i = b 0 Xi +b 1 X 2 i This is a system of two equations and two unknowns. The solution is given by Write these on board Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 3

Solution to Normal Equations After a lot of algebra one arrives at b 1 = (Xi X)(Y i Ȳ) (Xi X) 2 b 0 = Ȳ b 1 X X = Ȳ = Xi n Yi n Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 4

Least Squares Fit? 40 35 Estimate, y = 2.09x + 8.36, mse: 4.15 True, y = 2x + 9, mse: 4.22 Response/Output 30 25 20 15 10 1 2 3 4 5 6 7 8 9 10 11 Predictor/Input Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 5

Guess #1 40 35 Guess, y = 0x + 21.2, mse: 37.1 True, y = 2x + 9, mse: 4.22 Response/Output 30 25 20 15 10 1 2 3 4 5 6 7 8 9 10 11 Predictor/Input Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 6

Guess #2 40 35 Guess, y = 1.5x + 13, mse: 7.84 True, y = 2x + 9, mse: 4.22 Response/Output 30 25 20 15 10 1 2 3 4 5 6 7 8 9 10 11 Predictor/Input Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 7

Looking Ahead: Matrix Least Squares Y 1 Y 2.. = X 1 1 X 2 1... [ β1 β 0 ] Y n X n 1 Solution to this equation is solution to least squares linear regression (and maximum likelihood under normal error distribution assumption) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 8

Questions to Ask Is the relationship really linear? What is the distribution of the of errors? Is the fit good? How much of the variability of the response is accounted for by including the predictor variable? Is the chosen predictor variable the best one? Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 9

Is This Better? 40 7 Order, mse: 3.18 35 Response/Output 30 25 20 15 10 1 2 3 4 5 6 7 8 9 10 11 Predictor/Input Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 10

Goals for First Half of Course How to do linear regression Self familiarization with software tools How to interpret standard linear regression results How to derive tests How to assess and address deficiencies in regression models Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 11

Properties of Solution The i th residual is defined to be e i =Y i Ŷi The sum of the residuals is zero: e i = (Y i b 0 b 1 X i ) i = Y i nb 0 b 1 Xi = 0 By first normal equation. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 12

Properties of Solution The sum of the observed values Y i equals the sum of the fitted valuesŷ i i Y i = i = i Ŷ i (b 1 X i +b 0 ) = (b 1 X i +Ȳ b 1 X) i = b 1 X i +nȳ b 1 n X = b 1 n X+ i i Y i b 1 n X Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 13

Properties of Solution The sum of the weighted residuals is zero when the residual in the i th trial is weighted by the level of the predictor variable in the i th trial i X i e i = (X i (Y i b 0 b 1 X i )) = i X i Y i b 0 Xi b 1 (X 2 i ) = 0 By second normal equation. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 14

Properties of Solution The sum of the weighted residuals is zero when the residual in the i th trial is weighted by the fitted value of the response variable for the i th trial Ŷ i e i = (b 0 +b 1 X i )e i i i = b 0 e i +b 1 e i X i = 0 i i By previous properties. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 15

Properties of Solution The regression line always goes through the point X,Ȳ Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 16

Estimating Error Term Variance σ 2 Review estimation in non-regression setting. Show estimation results for regression setting. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 17

Estimation Review An estimator is a rule that tells how to calculate the value of an estimate based on the measurements contained in a sample i.e. the sample mean Ȳ = 1 n n i=1 Y i Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 18

Point Estimators and Bias Point estimator ˆθ=f({Y 1,...,Y n }) Unknown quantity / parameter Definition: Bias of estimator θ B(ˆθ)=E(ˆθ) θ Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 19

One Sample Example 0.7 0.6 0.5 µ = 5, σ = 0.75 samples θ est. θ 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 9 10 run bias_example_plot.m Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 20

Distribution of Estimator If the estimator is a function of the samples and the distribution of the samples is known then the distribution of the estimator can (often) be determined Methods Distribution (CDF) functions Transformations Moment generating functions Jacobians (change of variable) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 21

Example Samples from a Normal(µ,σ 2 ) distribution Y i Normal(µ,σ 2 ) Estimate the population mean θ=µ, ˆθ= Ȳ = 1 n n i=1 Y i Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 22

Sampling Distribution of the Estimator First moment E(ˆθ) = E( 1 n = 1 n n i=1 n Y i ) i=1 This is an example of an unbiased estimator B(ˆθ)=E(ˆθ) θ=0 E(Y i )= nµ n =θ Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 23

Variance of Estimator Definition: Variance of estimator V(ˆθ)=E([ˆθ E(ˆθ)] 2 ) Remember: V(cY)=c 2 V(Y) V( n i=1 Y i)= n i=1 V(Y i) Only if the Y i are independent with finite variance Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 24

Example Estimator Variance For N(0,1) mean estimator V(ˆθ) = V( 1 n = 1 n 2 n Note assumptions i=1 n i=1 Y i ) V(Y i )= nσ2 n = σ2 2 n Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 25

Distribution of sample mean estimator 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 1000 samples Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 26

Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ)=E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ)=V(ˆθ)+(B(ˆθ) 2 ) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 27

MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)]+[E(ˆθ) θ]) 2 ) = E([ˆθ E(ˆθ)] 2 )+2E([E(ˆθ) θ][ˆθ E(ˆθ)])+E([E(ˆθ) θ] 2 ) = V(ˆθ)+2E([E(ˆθ)[ˆθ E(ˆθ)] θ[ˆθ E(ˆθ)]))+(B(ˆθ)) 2 = V(ˆθ)+2(0+0)+(B(ˆθ)) 2 = V(ˆθ)+(B(ˆθ)) 2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 28

Trade-off Think of variance as confidence and bias as correctness. Intuitions (largely) apply Sometimes a biased estimator can produce lower MSE if it lowers the variance. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 29

Estimating Error Term Variance σ 2 Regression model Variance of each observation Y i is σ 2 (the same as for the error term ǫ i ) Each Y i comes from a different probability distribution with different means that depend on the level X i The deviation of an observation Y i must be calculated around its own estimated mean. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 30

s 2 estimator for σ 2 s 2 =MSE= SSE n 2 = (Yi Ŷ i ) 2 n 2 = e 2 i n 2 MSE is an unbiased estimator of σ 2 E(MSE)=σ 2 The sum of squares SSE has n-2 degrees of freedom associated with it. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 31

Normal Error Regression Model No matter how the error terms ǫ i are distributed, the least squares method provides unbiased point estimators of β 0 and β 1 that also have minimum variance among all unbiased linear estimators To set up interval estimates and make tests we need to specify the distribution of the ǫ i We will assume that the ǫ i are normally distributed. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 32

Normal Error Regression Model Y i =β 0 +β 1 X i +ǫ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ǫ i ~ iid N(0,σ 2 ) i = 1,,n Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 33

Notational Convention When you see ǫ i ~ iid N(0,σ 2 ) It is read as ǫ i is distributed identically and independently according to a normal distribution with mean 0 and variance σ 2 Examples θ ~ Poisson(λ) z ~ G(θ) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 34

Maximum Likelihood Principle The method of maximum likelihood chooses as estimates those values of the parameters that are most consistent with the sample data. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 35

If Likelihood Function X i F(Θ),i=1...n then the likelihood function is L({X i } n i=1,θ)= n i=1 F(X i;θ) Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 36

Example, N(10,3) Density, Single Obs. 0.14 0.12 Samples N(10, 3) Density 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 14 16 18 20 N=10, - log likelihood = 4.3038 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 37

Example, N(10,3) Density, Single Obs. Again 0.14 0.12 Samples N(10, 3) Density 0.1 0.08 0.06 0.04 0.02 0 0 2 4 6 8 10 12 14 16 18 20 N=10, - log likelihood = 4.3038 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 38

Example, N(10,3) Density, Multiple Obs. 0.14 0.12 Samples N(10, 3) Density 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 N=10, - log likelihood = 36.2204 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 39

Maximum Likelihood Estimation The likelihood function can be maximized w.r.t. the parameter(s) Θ, doing this one can arrive at estimators for parameters as well. L({X i } n i=1,θ)= n i=1 F(X i;θ) To do this, find solutions to (analytically or by following gradient) dl({x i } n i=1,θ) dθ =0 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 40

Important Trick Never (almost) maximize the likelihood function, maximize the log likelihood function instead. n log(l({x i } n i=1,θ)) = log( F(X i ;Θ)) = i=1 n log(f(x i ;Θ)) i=1 Quite often the log of the density is easier to work with mathematically. Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 41

ML Normal Regression Likelihood function L(β 0,β 1,σ 2 ) = = n i=1 1 (2πσ 2 ) 1/2e 1 2σ 2(Y i β 0 β 1 X i ) 2 1 (2πσ 2 ) n/2e 1 2σ 2 n i=1 (Y i β 0 β 1 X i ) 2 which if you maximize (how?) w.r.t. to the parameters you get Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 42

Maximum Likelihood Estimator(s) β 0 b 0 same as in least squares case β 1 b 1 same as in least squares case σ 2 ˆσ 2 = i (Y i Ŷ i ) 2 n Note that ML estimator is biased as s 2 is unbiased and s 2 =MSE= n n 2ˆσ2 Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 43

Comments Least squares minimizes the squared error between the prediction and the true output The normal distribution is fully characterized by its first two central moments (mean and variance) Food for thought: What does the bias in the ML estimator of the error variance mean? And where does it come from? Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 44