Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Similar documents
Multiple Linear Regression

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Chapter 6 Multiple Regression

Simple Linear Regression

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

Lecture 10 Multiple Linear Regression

Math 423/533: The Main Theoretical Topics

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Simple and Multiple Linear Regression

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

STAT 705 Chapter 16: One-way ANOVA

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Chapter 1. Linear Regression with One Predictor Variable

Chapter 2 Inferences in Simple Linear Regression

Formal Statement of Simple Linear Regression Model

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Chapter 3. Diagnostics and Remedial Measures

Lecture 18 MA Applied Statistics II D 2004

Chapter 14. Linear least squares

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Topic 18: Model Selection and Diagnostics

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Chapter 2 Multiple Regression I (Part 1)

where x and ȳ are the sample means of x 1,, x n

STAT5044: Regression and Anova. Inyoung Kim

Linear model selection and regularization

Linear models and their mathematical foundations: Simple linear regression

Ch 2: Simple Linear Regression

Remedial Measures for Multiple Linear Regression Models

The general linear regression with k explanatory variables is just an extension of the simple regression as follows

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

2. Regression Review

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Chapter 2. Continued. Proofs For ANOVA Proof of ANOVA Identity. the product term in the above equation can be simplified as n

Ch 3: Multiple Linear Regression

STAT 100C: Linear models

Math 3330: Solution to midterm Exam

3. Diagnostics and Remedial Measures

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Regression, Ridge Regression, Lasso

Lectures on Simple Linear Regression Stat 431, Summer 2012

STAT763: Applied Regression Analysis. Multiple linear regression. 4.4 Hypothesis testing

Lecture 1 Linear Regression with One Predictor Variable.p2

Day 4: Shrinkage Estimators

Data Mining Stat 588

6. Multiple Linear Regression

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

STAT5044: Regression and Anova. Inyoung Kim

Lecture 6 Multiple Linear Regression, cont.

Math 494: Mathematical Statistics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Sections 7.1, 7.2, 7.4, & 7.6

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

How the mean changes depends on the other variable. Plots can show what s happening...

Regression and Statistical Inference

STAT 540: Data Analysis and Regression

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

2017 Financial Mathematics Orientation - Statistics

Categorical Predictor Variables

Weighted Least Squares

Regression Models - Introduction

Lecture One: A Quick Review/Overview on Regular Linear Regression Models

F-tests and Nested Models

STOR 455 STATISTICAL METHODS I

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Applied Econometrics (QEM)

Least Squares Estimation-Finite-Sample Properties

Diagnostics and Remedial Measures: An Overview

Generalized Linear Models

Simple Linear Regression

Regression Estimation Least Squares and Maximum Likelihood

High-dimensional regression modeling

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Introduction to Simple Linear Regression

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Applied Regression Analysis

Chapter 4: Regression Models

Diagnostics and Remedial Measures

Chapter 3: Multiple Regression. August 14, 2018

BIOS 2083 Linear Models c Abdus S. Wahed

Lecture 3: Inference in SLR

STAT5044: Regression and Anova

Model Selection Procedures

Statistics and Econometrics I

Concordia University (5+5)Q 1.

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Multiple Linear Regression

Chapter 12: Multiple Linear Regression

Lecture 1: Linear Models and Applications

Transcription:

Statistical Computation Math 475 Jimin Ding Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html October 10, 2013 Ridge Part IV October 10, 2013 1 / 40 October 10, 2013 2 / 40 Outline I Examples Ridge 1 2 Ridge Ridge Classical regression models only deal with continuous response variable. Let Y denote response (dependent) variable X denote explanatory (independent) variable (predictor). Grade point average (GPA). X : entrance test scores; Y : GPA by the end of freshman year. X : height; Y : weight. X : education level; Y : income. The covariate X is usually continuous in regression categorical covariates are commonly investigated in ANOVA (analysis of variance). But we can also use regression model to study the relationship between a continuous response a categorical covariate by creating some dummy variables. This is equivalent to ANOVA/ANCOVA/MANOVA to some extend. October 10, 2013 3 / 40 October 10, 2013 4 / 40

General Goal of Study: Ridge linear regression for one covariate: Y i = β 0 + β 1 X i + e i, i = 1,, n, where Y i X i are the ith observed response covariate variables, e i is the rom error, β 0 β 1 are the unknown parameters. Assumptions: 1 E(e i ) = 0 for all i's. 2 Var(e i ) = σ 2. (Homogeneity) 3 e i e j are independent for any i j. (Independence) iid 4 e i N(0, σ 2 ). (Normality) Ridge Describe the relationship between explanatory response variables. Estimate β 0, β 1, σ 2. Predict/Forecast the response variable for a new given predictor value. Predict E(Y new X new ) = β 0 + β 1 X new. : testing whether the relationship is statistically signicant. Find CI for β 1 to see whether it includes 0. Test: H 0 : β 1 = 0. interval of response variable. October 10, 2013 5 / 40 October 10, 2013 6 / 40 Least Square Estimator Ridge Estimator: a function of data that provides a guess about the value or parameters of interest. Example: X can be used to estimate µ. Criteria: how to choose a good" estimator? The smaller the following quantities are, the better the estimator is: Bias: E( ˆβ) β, Variance: Var( ˆβ), MSE: E{( ˆβ β) 2 }. Question: the distribution of ˆβ is usually unknown since the distribution of Y is unknown! Approximate Bias, Variance MSE by their sample version. Ridge To predict Y well in a simple linear regression, it is natural to obtain the estimators by minimizing: Q = n [Y i (β 0 + β 1 X i )] 2, i=1 which is the so called Least Square Criterion. The obtained estimators ˆβ 0 = Ȳ ˆβ 1 X i=1 ˆβ 1 = (X i X )(Yi Ȳ ) i=1 (X i X ) 2 = ρ X,Y S Y S X, are the so called Least Square Estimators (LSE). October 10, 2013 7 / 40 October 10, 2013 8 / 40

Line Sum Squares in ANOVA Table Ridge y 2 1 0 1 2 3 Y^ i Y i 2 1 0 1 lines: y = ˆβ 0 + ˆβ 1 x. Fitted values: Ŷ i = ˆβ 0 + ˆβ 1 X i. Residuals: ê i = Ŷ i Y i. Ridge SSE: sum of square errors i=1 ê2 i. SSTO: i=1 (Y i Ȳ ) 2. SSR: i=1 (Ŷ i Ȳ ) 2. DF: the degree of freedom. DF of the residuals = the number of observations - the number of parameters in the model. MSE: SSE/DF of the error term. 1 n 2 i=1 ê2 i = ˆσ 2 (estimator for σ 2 ). x For LSE, we have i=1 êi = 0 i=1 ê2 i is minimized. October 10, 2013 9 / 40 October 10, 2013 10 / 40 Best Linear Unbiased (BLUE) Other Type of Estimators Ridge Note that LSE are unbiased estimators. E(MSE) = E(ˆσ 2 ) = σ 2 ; E( ˆβ 0 ) = β 0 ; E( ˆβ 1 ) = β 1. Both ˆβ 0 ˆβ 1 are linear functions of observations Y 1,, Y n. Among all unbiased estimators, LSE has smallest variance. In another words, it is more precise than any other unbiased linear predictors. Remark: BLUE property still hold without the normality assumption in (4). Ridge Least absolute deviation estimator (LAD): a robust estimator. Weighted least square estimator (WLSE): an estimator to adjust for heterogeneity. Maximum Likelihood Estimator (MLE): MLE is based on the likelihood function hence is only valid under the assumption (4). October 10, 2013 11 / 40 October 10, 2013 12 / 40

MLE Properties of MLE Ridge Recall Likelihood: L = n f (Y i ) = i=1 n i=1 1 2πσ 2 exp{ 1 2σ 2 (Y i β 0 β 1 X i ) 2 }. The parameters (β 0, β 1, σ 2 ) which maximize above likelihood function, L(β 0, β 1, σ 2 ), are dened as MLE of (β 0, β 1, σ 2 ), denoted by ( ˆβ 0,ML, ˆβ 1,ML, ˆσ 2 ML ). Ridge In the simple linear regression with the normality assumption on errors, LSE for β's are same as MLE for β's. (They are dierent for σ 2.) So MLE for β's are also BLUE. Usually in most of models, MLE are Consistent: ˆβ β in probability or a.s. Sucient: f (Y 1,, Y n ˆθ ML ) does not depend on θ. MVUE: minimum variance unbiased estimator. Asymptotic ecient: Var( ˆβ) reach the Cramér-Rao lower bound. (minimum variance) October 10, 2013 13 / 40 October 10, 2013 14 / 40 Condence Interval Stard Errors Ridge Note: ˆβ 0, ˆβ 1, ˆσ 2 are functions of data so are rom variables. As point estimators, they only provide a guess about true parameters will change if the data are changed. We also want to get a range guess for ˆβ 0, ˆβ 1, ˆσ 2, which guarantee that with certain probability the true parameter will be in the range. For example, if we repeat the experiments 100 times collect 100 set of data, 95 out the 100 guessed range will contain the true parameters. This range is called condence interval. CI for β 0 : ˆβ 0 ± t 1 α/2,n 2 se( ˆβ 0 ); CI for β 1 : ˆβ 1 ± t 1 α/2,n 2 se( ˆβ 1 ). Ridge Var( ˆβ 0 ) = Var(Ȳ ˆβ 1 X ) = Var( ˆβ 1 ) = Var( = σ 2 ( 1 n + X i=1 (X i X ) 2 ) i=1 (X i X )(Yi Ȳ ) i=1 (X ) i X ) 2 = = σ 2 1 ( i=1 (X i X ) 2 ) But σ 2 is still unknown, so we use the estimator ˆσ 2 = MSE to replace σ 2 in estimating stard errors. October 10, 2013 15 / 40 October 10, 2013 16 / 40

Stard Errors Distribution of Estimators Hence se( ˆβ 0 ) = MSE( 1 n + X 2 i=1 (X i X ) 2 ), Under the normality assumption, ˆβ p β p se( ˆβ p ) t n 2, Ridge se( ˆβ 1 ) = 1 MSE( i=1 (X i X ) 2 ). Ridge without the normality assumption, ˆβ p β p se( ˆβ p ) t n 2 approximately, for p = 1, 2. October 10, 2013 17 / 40 October 10, 2013 18 / 40 Hypothesis Test Condence Interval for the Mean of an Observation Ridge f(x) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 H 0 : β 1 = 0 v.s. β 1 0. (Whether Y i depends on X i or not.) Test statistic: t = ˆβ 1 0 se( ˆβ 1 ) t n 2. p-value=p(t n 2 > t ). (two sided) t 29 distribution Two sided p value Ridge Let Ŷ new = ˆβ 0 + ˆβ 1 X new. Var(Ŷ new ) = Var( ˆβ 0 + ˆβ 1 X new ) = = σ 2 ( 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 ). Hence a (1 α)% CI for the mean value of the observation E(Y new ) = β 0 + β 1 X new is Ŷ new ± t 1 α/2,n 2 MSE( 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 ), 0.05 0 5 4 3 2 1 0 1 2 3 4 5 x which is denoted as CLM in SAS output. October 10, 2013 19 / 40 October 10, 2013 20 / 40

Interval (PI) Simultaneous Condence/ Bs Ridge A (1 α)% prediction interval is an interval I such that P(Y new I ) = 1 α. Note that Y new = β 0 + β 1 X new + e new is rom in the sense that Var(Y new ) = σ 2 0. We can derive Y new Ŷ new N(0, σ 2 (1 + 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 )), 1 α = P( Y new Ŷ new MSE(1 + 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 )) t 1 α/2,n 2 ). Hence (1 α)% PI for Y new, denoted as RLCLI in SAS, is: Ŷ new ± t 1 α/2,n 2 MSE(1 + 1 n + ( X Xnew ) 2 n i=1 (X i X ) 2 ). Ridge Condence b for E(Y ): Dierent from prediction interval condence interval for the mean, it is a simultaneous b for the entire regression line. Ŷ i ± W MSE( 1 n + ( X Xi ) 2 i=1 (X i X ) 2 )), where W 2 = 2F (1 α; 2, n 2) i = 1,, n. b for the Y i 's in the entire region: Ŷ i ± S(or B) MSE( 1 n + ( X Xi ) 2 n i=1 (X i X ) 2 )), where S 2 = gf (1 α; g, n 2) (Scheé type) B = t(1 α/2g, n 2) (Bonferroni type). October 10, 2013 21 / 40 October 10, 2013 22 / 40 Coecient of Determination F-test for Goodness of Fit Ridge Coecient of Determination is dened as R 2 = SSR SSTO = 1 SSE [0, 1]. SSTO Interpretation: the proportion reduction of total variation associated with the use of the predictor variable X. The larger R 2 is, the more the total variation of Y is explained by X. In simple linear regression, R 2 is same as ˆρ 2. The R close to 0 does not imply that X Y are not related, but simply means that the linear correlation between X Y is small. Ridge H 0 : Reduced modely i = β 0 + e i, v.s. H 1 : Full modely i = β 0 + β 1 X i + e i. This is equivalent to test H 0 : β 1 = 0. Test statistic: F = MSR F (1 α; 1, n 2), MSE where MSR = SSR/the number of model parameters 1. (Note that F 1,df = t 2 df ) P-value: October 10, 2013 23 / 40 October 10, 2013 24 / 40

General F-test Residual Plots If model 1 (reduced model) is nested (submodel) within model 2 (full model), the comparison between two models can be done by F-test. Residuals against the index of observations: 1 symmetric around 0; 2 constant variability; 3 no serial correction. Ridge F = SSE(R) SSE(F ) df (R) df (F ) SSE(F ) df (F ) General F-test is commonly used in model selection. Ridge Residuals against the predicted values. (add a link to possible problematic residual plots.) QQ plot/normality plot: check the normality assumption. Under the normality assumption, the residuals should be close to the reference line or look linear. October 10, 2013 25 / 40 October 10, 2013 26 / 40 Prototype Residual Plots Outliers Inuential Observations Cook's distance is dened as Ridge Ridge D i = j=1 Ŷj Ŷ j(i) p MSE = n i=1 ê 2 i h ii p MSE [ (1 h ii ) 2 ]. The value of Cook's distance for each observation represents a measure of the degree to which the predicted values change if the observation is left out of the regression. If an observation has an unusually large value for the Cook's distance, it might be worth to further investigations. (small inuence: D < 0.1; huge inuence: D > 0.5) Although above denition requires to t regression n times, it can be simplied only need to t model once. Here h ii is the i-th diagonal element of the hat matrix H = X (X T X ) 1 X T. October 10, 2013 27 / 40 October 10, 2013 28 / 40

Ridge Two predictors: Y i = β 0 + β 1 X i1 + β 2 X i2 + e i. More general: Consider p 1 predictors X i1,, X i(p 1), Y i = β 0 + β 1 X i1 + β 2 X i2 + β p 1 X i(p 1) + e i, for i = 1,, n. We may write it in the following matrix form Y = X β + e, where Y = (Y 1,, Y n ) T, β = (β 0,, β p 1 ) T, e = (e 1,, e n ) T N(0, σ 2 I ) X is the n p design matrix. Ridge The LSE for ˆβ LS = (X T X ) 1 X T Y Cov(ˆβ LS ) = σ 2 (X T X ) 1. se(ˆβ LS ) = [MSE(X T X ) 1 ] 1/2. Under normality assumption, ˆβ ML = ˆβ LS. The tted values: Ŷ = X ˆβ = X (X T X ) 1 X T Y. Hat matrix H = X (X T X ) 1 X T. The residuals: ê = Y Ŷ = (I H)Y, Cov(e) = σ 2 (I H). October 10, 2013 29 / 40 October 10, 2013 30 / 40 ANOVA Table : Covariates Ridge SSE: ê T ê = Y T (I H)Y. SSTO: i=1 (Y i Ȳ ) 2 = Y T Y 1 n Y T JY, where J is the n n matrix with all components equal to 1. SSR=SSTO-SSE. MSE=SSE/(n-p). MSTO=SSTO/(n-1). MSR=SSR(p-1). Overall F-test: H 0 : β = 0 (β 1 = β 2 = = β p 1 = 0) H a : not all β's are zeros. F = MSR MSE F 1 α;p 1,n p. R 2 = SSR SSTO = 1 SSE SSTO. Adjusted R 2 : adjust for the number of predictors 1 n 1 (1 R2 A ) = 1 n p (1 R2 ). Ridge Forward selection: from no covariates Backward selection: from all covariates Stepwise selection: Backward+forward October 10, 2013 31 / 40 October 10, 2013 32 / 40

When Happens Remedy for Ridge Adding or deleting a predictor changes R 2 substantially. Type III SSR heavily depends on other variables in the model. se( ˆβ k ) is large. Predictors are not signicant individually, but simple regression on each covariate is signicant. Ridge Centering stardize the predictors, which might be helpful in polynomial regression when some of the predictors are badly scaled. Drop the correlated variables by model selection. 1 The predictor is not signicant. 2 The reduced model after dropping the predictor ts data nearly as well as the full model. Add new observations. (Economy, Business) Use the index of several variables (PCA) Ridge October 10, 2013 33 / 40 October 10, 2013 34 / 40 Variance Ination Factor (VIF) Ridge Ridge Let R 2 j is the coecient of determination of X j on all other predictors. (R 2 j is the R 2 of regression model X ij = β 0 +β 1 X i1 + +β j 1 X i(j 1) +β j+1 X i(j+1) + +e ij.) Dene VIF j = 1 1 R 2. j VIF j 1 since R 2 j 1 for all j. If VIF > 10, we usually believe the variable has inuential variation to cause collinearity problem. In stardized regression Var ˆβ j = σvif j. In SAS, VIF table is reported in PROC REG by adding VIF option in MODEL statement. Ridge Recall that ˆβ LS = (X T X ) 1 X T Y, E(ˆβ LS ) = β Var(ˆβ) = (X T X ) 1. When (X T X ) is nearly singular (the determinate is close to 0), LSE is unbiased but has large variance, which leads to large mean square error of the estimator. 0.0 0.1 0.2 0.3 0.4 0.5 The idea of the ridge regression is: reduce variance at the cost of increasing bias. 4 2 0 2 4 6 beta October 10, 2013 35 / 40 October 10, 2013 36 / 40

Ridge SAS Program Ridge The ridge regression estimator is: ˆβ r (b) = (X T X + bi ) 1 X T Y, where b is a constant chosen by users referred as tuning parameter. As b increases, the bias increases but the variance decreases, ˆβ r (b) 0 (componentwise). One may choose the tuning parameter b, such that ridge trace (ˆβ r (b) against b) gets at, VIF j (against b) drop around 1. Ridge PROC GLM; PROC REG; PROC CATMOD; PROC GENMOD; PROC LOGISTIC; PROC NLINL; PROC PLS; PROC MIXED; October 10, 2013 37 / 40 October 10, 2013 38 / 40 Reading Assignment Ridge Textbook: Chapter 5 Chapter 9. October 10, 2013 39 / 40