Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Similar documents
Regression Estimation Least Squares and Maximum Likelihood

Introduction to Simple Linear Regression

Bias Variance Trade-off

Formal Statement of Simple Linear Regression Model

Central Limit Theorem ( 5.3)

Inference in Regression Analysis

Regression #3: Properties of OLS Estimator

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Chapters 9. Properties of Point Estimators

Advanced Signal Processing Introduction to Estimation Theory

Simple and Multiple Linear Regression

ECE531 Lecture 10b: Maximum Likelihood Estimation

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Review. December 4 th, Review

Mathematical statistics

Simple Linear Regression

Inference in Normal Regression Model. Dr. Frank Wood

Statistics II Lesson 1. Inference on one population. Year 2009/10

HT Introduction. P(X i = x i ) = e λ λ x i

Remedial Measures, Brown-Forsythe test, F test

Problem 1 (20) Log-normal. f(x) Cauchy

EIE6207: Estimation Theory

Ch 2: Simple Linear Regression

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Statistics and Econometrics I

MS&E 226: Small Data

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Mathematical statistics

EXAMINERS REPORT & SOLUTIONS STATISTICS 1 (MATH 11400) May-June 2009

Simple Linear Regression

Multiple Linear Regression

Chapter 1. Linear Regression with One Predictor Variable

POLI 8501 Introduction to Maximum Likelihood Estimation

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Nonparametric Regression and Bonferroni joint confidence intervals. Yang Feng

Parametric Techniques Lecture 3

Correlation and Regression

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

1 Least Squares Estimation - multiple regression.

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Measuring the fit of the model - SSR

[y i α βx i ] 2 (2) Q = i=1

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

STAT 512 sp 2018 Summary Sheet

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Introduction to Estimation Methods for Time Series models Lecture 2

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

ML and REML Variance Component Estimation. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 58

CSC321 Lecture 18: Learning Probabilistic Models

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Theory of Statistics.

First Year Examination Department of Statistics, University of Florida

Introduction to Estimation Methods for Time Series models. Lecture 1

Parametric Techniques

Intro to Linear Regression

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

F & B Approaches to a simple model

Statistics: Learning models from data

where x and ȳ are the sample means of x 1,, x n

Statistics 3858 : Maximum Likelihood Estimators

Intro to Linear Regression

Making sense of Econometrics: Basics

Chapter 1 Linear Regression with One Predictor

Association studies and regression

Linear Methods for Prediction

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

STAT 285: Fall Semester Final Examination Solutions

6.867 Machine Learning

Lectures on Simple Linear Regression Stat 431, Summer 2012

IEOR E4703: Monte-Carlo Simulation

6.867 Machine Learning

Chapter 8: Estimation 1

Math 423/533: The Main Theoretical Topics

1. Simple Linear Regression

Linear Models A linear model is defined by the expression

ECON 4160, Autumn term Lecture 1

BTRY 4090: Spring 2009 Theory of Statistics

Elements of statistics (MATH0487-1)

Section 4.6 Simple Linear Regression

MS&E 226: Small Data

Simple Linear Regression Analysis

5.2 Fisher information and the Cramer-Rao bound

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Practice Problems Section Problems

Statistical View of Least Squares

STAT 730 Chapter 4: Estimation

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

ECON The Simple Regression Model

Linear models and their mathematical foundations: Simple linear regression

Lecture 6 Multiple Linear Regression, cont.

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Transcription:

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood

Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Minimize this by maximizing Q Find partials and set both equal to zero dq dβ 0 = 0 dq dβ 1 = 0

Normal Equations The result of this maximization step are called the normal equations. b 0 and b 1 are called point estimators of β 0 and β 1 respectively. Yi = nb 0 + b 1 Xi Xi Y i = b 0 Xi + b 1 X 2 i This is a system of two equations and two unknowns. The solution is given by...

Solution to Normal Equations After a lot of algebra one arrives at b 1 = (Xi X )(Y i Ȳ ) (Xi X ) 2 b 0 = Ȳ b 1 X X = Ȳ = Xi n Yi n

Least Squares Fit

Guess #1

Guess #2

Looking Ahead: Matrix Least Squares Y 1 Y 2. Y n = X 1 1 X 2 1. X n 1 [ β1 Solution to this equation is solution to least squares linear regression (and maximum likelihood under normal error distribution assumption) β 0 ]

Questions to Ask Is the relationship really linear? What is the distribution of the of errors? Is the fit good? How much of the variability of the response is accounted for by including the predictor variable? Is the chosen predictor variable the best one?

Is This Better?

Goals for First Half of Course How to do linear regression Self familiarization with software tools How to interpret standard linear regression results How to derive tests How to assess and address deficiencies in regression models

Estimators for β 0, β 1, σ 2 We want to establish properties of estimators for β 0, β 1, and σ 2 so that we can construct hypothesis tests and so forth We will start by establishing some properties of the regression solution.

Properties of Solution The i th residual is defined to be e i = Y i Ŷ i The sum of the residuals is zero: e i = (Y i b 0 b 1 X i ) i = Y i nb 0 b 1 Xi = 0

Properties of Solution The sum of the observed values Y i equals the sum of the fitted values Ŷi Y i = Ŷ i i i = i (b 1 X i + b 0 ) = (b 1 X i + Ȳ b 1 X ) i = b 1 X i + nȳ b 1 n X = b 1 n X + i i Y i b 1 n X

Properties of Solution The sum of the weighted residuals is zero when the residual in the i th trial is weighted by the level of the predictor variable in the i th trial X i e i = (X i (Y i b 0 b 1 X i )) i = i = 0 X i Y i b 0 Xi b 1 (X 2 i )

Properties of Solution The regression line always goes through the point X, Ȳ

Estimating Error Term Variance σ 2 Review estimation in non-regression setting. Show estimation results for regression setting.

Estimation Review An estimator is a rule that tells how to calculate the value of an estimate based on the measurements contained in a sample i.e. the sample mean Ȳ = 1 n n i=1 Y i

Point Estimators and Bias Point estimator ˆθ = f ({Y 1,..., Y n }) Unknown quantity / parameter θ Definition: Bias of estimator B(ˆθ) = E(ˆθ) θ

One Sample Example

Distribution of Estimator If the estimator is a function of the samples and the distribution of the samples is known then the distribution of the estimator can (often) be determined Methods Distribution (CDF) functions Transformations Moment generating functions Jacobians (change of variable)

Example Samples from a Normal(µ, σ 2 ) distribution Y i Normal(µ, σ 2 ) Estimate the population mean θ = µ, ˆθ = Ȳ = 1 n n i=1 Y i

Sampling Distribution of the Estimator First moment E(ˆθ) = E( 1 n = 1 n n Y i ) i=1 n E(Y i ) = nµ n = θ i=1 This is an example of an unbiased estimator B(ˆθ) = E(ˆθ) θ = 0

Variance of Estimator Definition: Variance of estimator Var(ˆθ) = E([ˆθ E(ˆθ)] 2 ) Remember: Var(cY ) = c 2 Var(Y ) n Var( Y i ) = n Var(Y i ) i=1 i=1 Only if the Y i are independent with finite variance

Example Estimator Variance For N(0, 1) mean estimator Var(ˆθ) = Var( 1 n = 1 n 2 n i=1 n Y i ) i=1 Var(Y i ) = nσ2 n 2 = σ2 n Note assumptions

Central Limit Theorem Review Central Limit Theorem Let Y 1, Y 2,..., Y n be iid random variables with E(Y i ) = µ and Var(Y i ) σ 2 <. Define. U n = (Ȳ ) µ n σ where Ȳ = 1 n n Y i (1) Then the distribution function of U n converges to a standard normal distribution function as n. i=1 Alternately P(a U n b) b a ( 1 2π ) e u2 2 du (2)

Distribution of sample mean estimator

Bias Variance Trade-off The mean squared error of an estimator MSE(ˆθ) = E([ˆθ θ] 2 ) Can be re-expressed MSE(ˆθ) = Var(ˆθ) + (B(ˆθ) 2 )

MSE = VAR + BIAS 2 Proof MSE(ˆθ) = E((ˆθ θ) 2 ) = E(([ˆθ E(ˆθ)] + [E(ˆθ) θ]) 2 ) = E([ˆθ E(ˆθ)] 2 ) + 2 E([E(ˆθ) θ][ˆθ E(ˆθ)]) + E([E(ˆθ) θ] 2 = Var(ˆθ) + 2 E([E(ˆθ)[ˆθ E(ˆθ)] θ[ˆθ E(ˆθ)])) + (B(ˆθ)) 2 = Var(ˆθ) + 2(0 + 0) + (B(ˆθ)) 2 = Var(ˆθ) + (B(ˆθ)) 2

Trade-off Think of variance as confidence and bias as correctness. Intuitions (largely) apply Sometimes choosing a biased estimator can result in an overall lower MSE if it exhibits lower variance. Bayesian methods (later in the course) specifically introduce bias.

Estimating Error Term Variance σ 2 Regression model Variance of each observation Y i is σ 2 (the same as for the error term ɛ i ) Each Y i comes from a different probability distribution with different means that depend on the level X i The deviation of an observation Y i must be calculated around its own estimated mean.

s 2 estimator for σ 2 s 2 = MSE = SSE n 2 = (Yi Ŷi) 2 n 2 MSE is an unbiased estimator of σ 2 E(MSE) = σ 2 = e 2 i n 2 The sum of squares SSE has n-2 degrees of freedom associated with it. Cochran s theorem (later in the course) tells us where degree s of freedom come from and how to calculate them.

Normal Error Regression Model No matter how the error terms ɛ i are distributed, the least squares method provides unbiased point estimators of β 0 and β 1 that also have minimum variance among all unbiased linear estimators To set up interval estimates and make tests we need to specify the distribution of the ɛ i We will assume that the ɛ i are normally distributed.

Normal Error Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor variable in the i th trial ɛ i iid N(0, σ 2 ) note this is different, now we know the distribution i = 1,..., n

Notational Convention When you see ɛ i iid N(0, σ 2 ) It is read as ɛ i is distributed identically and independently according to a normal distribution with mean 0 and variance σ 2 Examples θ Poisson(λ) z G(θ)

Maximum Likelihood Principle The method of maximum likelihood chooses as estimates those values of the parameters that are most consistent with the sample data.

Likelihood Function If X i F (Θ), i = 1... n then the likelihood function is n L({X i } n i=1, Θ) = F (X i ; Θ) i=1

Example, N(10, 3) Density, Single Obs.

Example, N(10, 3) Density, Single Obs. Again

Example, N(10, 3) Density, Multiple Obs.

Maximum Likelihood Estimation The likelihood function can be maximized w.r.t. the parameter(s) Θ, doing this one can arrive at estimators for parameters as well. L({X i } n i=1, Θ) = n F (X i ; Θ) i=1 To do this, find solutions to (analytically or by following gradient) dl({x i } n i=1, Θ) = 0 dθ

Important Trick Never (almost) maximize the likelihood function, maximize the log likelihood function instead. log(l({x i } n i=1, Θ)) = log( = n F (X i ; Θ)) i=1 n log(f (X i ; Θ)) i=1 Quite often the log of the density is easier to work with mathematically.

ML Normal Regression Likelihood function L(β 0, β 1, σ 2 ) = = n i=1 1 (2πσ 2 ) 1 (2πσ 2 ) 1 e 2σ 1/2 2 (Y i β 0 β 1 X i ) 2 1 P n e 2σ n/2 2 i=1 (Y i β 0 β 1 X i ) 2 which if you maximize (how?) w.r.t. to the parameters you get...

Maximum Likelihood Estimator(s) β 0 b 0 same as in least squares case β 1 b 1 same as in least squares case σ 2 ˆσ 2 = i (Y i Ŷ i ) 2 Note that ML estimator is biased as s 2 is unbiased and s 2 = MSE = n n n 2 ˆσ2

Comments Least squares minimizes the squared error between the prediction and the true output The normal distribution is fully characterized by its first two central moments (mean and variance) Food for thought: What does the bias in the ML estimator of the error variance mean? And where does it come from?