Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Statistical Computation Math 475 Jimin Ding Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html October 10, 2013 Ridge Part IV October 10, 2013 1 / 40 October 10, 2013 2 / 40 Outline I Examples Ridge 1 2 Ridge Ridge Classical regression models only deal with continuous response variable. Let Y denote response (dependent) variable X denote explanatory (independent) variable (predictor). Grade point average (GPA). X : entrance test scores; Y : GPA by the end of freshman year. X : height; Y : weight. X : education level; Y : income. The covariate X is usually continuous in regression categorical covariates are commonly investigated in ANOVA (analysis of variance). But we can also use regression model to study the relationship between a continuous response a categorical covariate by creating some dummy variables. This is equivalent to ANOVA/ANCOVA/MANOVA to some extend. October 10, 2013 3 / 40 October 10, 2013 4 / 40

General Goal of Study: Ridge linear regression for one covariate: Y i = β 0 + β 1 X i + e i, i = 1,, n, where Y i X i are the ith observed response covariate variables, e i is the rom error, β 0 β 1 are the unknown parameters. Assumptions: 1 E(e i ) = 0 for all i's. 2 Var(e i ) = σ 2. (Homogeneity) 3 e i e j are independent for any i j. (Independence) iid 4 e i N(0, σ 2 ). (Normality) Ridge Describe the relationship between explanatory response variables. Estimate β 0, β 1, σ 2. Predict/Forecast the response variable for a new given predictor value. Predict E(Y new X new ) = β 0 + β 1 X new. : testing whether the relationship is statistically signicant. Find CI for β 1 to see whether it includes 0. Test: H 0 : β 1 = 0. interval of response variable. October 10, 2013 5 / 40 October 10, 2013 6 / 40 Least Square Estimator Ridge Estimator: a function of data that provides a guess about the value or parameters of interest. Example: X can be used to estimate µ. Criteria: how to choose a good" estimator? The smaller the following quantities are, the better the estimator is: Bias: E( ˆβ) β, Variance: Var( ˆβ), MSE: E{( ˆβ β) 2 }. Question: the distribution of ˆβ is usually unknown since the distribution of Y is unknown! Approximate Bias, Variance MSE by their sample version. Ridge To predict Y well in a simple linear regression, it is natural to obtain the estimators by minimizing: Q = n [Y i (β 0 + β 1 X i )] 2, i=1 which is the so called Least Square Criterion. The obtained estimators ˆβ 0 = Ȳ ˆβ 1 X i=1 ˆβ 1 = (X i X )(Yi Ȳ ) i=1 (X i X ) 2 = ρ X,Y S Y S X, are the so called Least Square Estimators (LSE). October 10, 2013 7 / 40 October 10, 2013 8 / 40

Line Sum Squares in ANOVA Table Ridge y 2 1 0 1 2 3 Y^ i Y i 2 1 0 1 lines: y = ˆβ 0 + ˆβ 1 x. Fitted values: Ŷ i = ˆβ 0 + ˆβ 1 X i. Residuals: ê i = Ŷ i Y i. Ridge SSE: sum of square errors i=1 ê2 i. SSTO: i=1 (Y i Ȳ ) 2. SSR: i=1 (Ŷ i Ȳ ) 2. DF: the degree of freedom. DF of the residuals = the number of observations - the number of parameters in the model. MSE: SSE/DF of the error term. 1 n 2 i=1 ê2 i = ˆσ 2 (estimator for σ 2 ). x For LSE, we have i=1 êi = 0 i=1 ê2 i is minimized. October 10, 2013 9 / 40 October 10, 2013 10 / 40 Best Linear Unbiased (BLUE) Other Type of Estimators Ridge Note that LSE are unbiased estimators. E(MSE) = E(ˆσ 2 ) = σ 2 ; E( ˆβ 0 ) = β 0 ; E( ˆβ 1 ) = β 1. Both ˆβ 0 ˆβ 1 are linear functions of observations Y 1,, Y n. Among all unbiased estimators, LSE has smallest variance. In another words, it is more precise than any other unbiased linear predictors. Remark: BLUE property still hold without the normality assumption in (4). Ridge Least absolute deviation estimator (LAD): a robust estimator. Weighted least square estimator (WLSE): an estimator to adjust for heterogeneity. Maximum Likelihood Estimator (MLE): MLE is based on the likelihood function hence is only valid under the assumption (4). October 10, 2013 11 / 40 October 10, 2013 12 / 40

MLE Properties of MLE Ridge Recall Likelihood: L = n f (Y i ) = i=1 n i=1 1 2πσ 2 exp{ 1 2σ 2 (Y i β 0 β 1 X i ) 2 }. The parameters (β 0, β 1, σ 2 ) which maximize above likelihood function, L(β 0, β 1, σ 2 ), are dened as MLE of (β 0, β 1, σ 2 ), denoted by ( ˆβ 0,ML, ˆβ 1,ML, ˆσ 2 ML ). Ridge In the simple linear regression with the normality assumption on errors, LSE for β's are same as MLE for β's. (They are dierent for σ 2.) So MLE for β's are also BLUE. Usually in most of models, MLE are Consistent: ˆβ β in probability or a.s. Sucient: f (Y 1,, Y n ˆθ ML ) does not depend on θ. MVUE: minimum variance unbiased estimator. Asymptotic ecient: Var( ˆβ) reach the Cramér-Rao lower bound. (minimum variance) October 10, 2013 13 / 40 October 10, 2013 14 / 40 Condence Interval Stard Errors Ridge Note: ˆβ 0, ˆβ 1, ˆσ 2 are functions of data so are rom variables. As point estimators, they only provide a guess about true parameters will change if the data are changed. We also want to get a range guess for ˆβ 0, ˆβ 1, ˆσ 2, which guarantee that with certain probability the true parameter will be in the range. For example, if we repeat the experiments 100 times collect 100 set of data, 95 out the 100 guessed range will contain the true parameters. This range is called condence interval. CI for β 0 : ˆβ 0 ± t 1 α/2,n 2 se( ˆβ 0 ); CI for β 1 : ˆβ 1 ± t 1 α/2,n 2 se( ˆβ 1 ). Ridge Var( ˆβ 0 ) = Var(Ȳ ˆβ 1 X ) = Var( ˆβ 1 ) = Var( = σ 2 ( 1 n + X i=1 (X i X ) 2 ) i=1 (X i X )(Yi Ȳ ) i=1 (X ) i X ) 2 = = σ 2 1 ( i=1 (X i X ) 2 ) But σ 2 is still unknown, so we use the estimator ˆσ 2 = MSE to replace σ 2 in estimating stard errors. October 10, 2013 15 / 40 October 10, 2013 16 / 40

Stard Errors Distribution of Estimators Hence se( ˆβ 0 ) = MSE( 1 n + X 2 i=1 (X i X ) 2 ), Under the normality assumption, ˆβ p β p se( ˆβ p ) t n 2, Ridge se( ˆβ 1 ) = 1 MSE( i=1 (X i X ) 2 ). Ridge without the normality assumption, ˆβ p β p se( ˆβ p ) t n 2 approximately, for p = 1, 2. October 10, 2013 17 / 40 October 10, 2013 18 / 40 Hypothesis Test Condence Interval for the Mean of an Observation Ridge f(x) 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 H 0 : β 1 = 0 v.s. β 1 0. (Whether Y i depends on X i or not.) Test statistic: t = ˆβ 1 0 se( ˆβ 1 ) t n 2. p-value=p(t n 2 > t ). (two sided) t 29 distribution Two sided p value Ridge Let Ŷ new = ˆβ 0 + ˆβ 1 X new. Var(Ŷ new ) = Var( ˆβ 0 + ˆβ 1 X new ) = = σ 2 ( 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 ). Hence a (1 α)% CI for the mean value of the observation E(Y new ) = β 0 + β 1 X new is Ŷ new ± t 1 α/2,n 2 MSE( 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 ), 0.05 0 5 4 3 2 1 0 1 2 3 4 5 x which is denoted as CLM in SAS output. October 10, 2013 19 / 40 October 10, 2013 20 / 40

Interval (PI) Simultaneous Condence/ Bs Ridge A (1 α)% prediction interval is an interval I such that P(Y new I ) = 1 α. Note that Y new = β 0 + β 1 X new + e new is rom in the sense that Var(Y new ) = σ 2 0. We can derive Y new Ŷ new N(0, σ 2 (1 + 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 )), 1 α = P( Y new Ŷ new MSE(1 + 1 n + ( X Xnew ) 2 i=1 (X i X ) 2 )) t 1 α/2,n 2 ). Hence (1 α)% PI for Y new, denoted as RLCLI in SAS, is: Ŷ new ± t 1 α/2,n 2 MSE(1 + 1 n + ( X Xnew ) 2 n i=1 (X i X ) 2 ). Ridge Condence b for E(Y ): Dierent from prediction interval condence interval for the mean, it is a simultaneous b for the entire regression line. Ŷ i ± W MSE( 1 n + ( X Xi ) 2 i=1 (X i X ) 2 )), where W 2 = 2F (1 α; 2, n 2) i = 1,, n. b for the Y i 's in the entire region: Ŷ i ± S(or B) MSE( 1 n + ( X Xi ) 2 n i=1 (X i X ) 2 )), where S 2 = gf (1 α; g, n 2) (Scheé type) B = t(1 α/2g, n 2) (Bonferroni type). October 10, 2013 21 / 40 October 10, 2013 22 / 40 Coecient of Determination F-test for Goodness of Fit Ridge Coecient of Determination is dened as R 2 = SSR SSTO = 1 SSE [0, 1]. SSTO Interpretation: the proportion reduction of total variation associated with the use of the predictor variable X. The larger R 2 is, the more the total variation of Y is explained by X. In simple linear regression, R 2 is same as ˆρ 2. The R close to 0 does not imply that X Y are not related, but simply means that the linear correlation between X Y is small. Ridge H 0 : Reduced modely i = β 0 + e i, v.s. H 1 : Full modely i = β 0 + β 1 X i + e i. This is equivalent to test H 0 : β 1 = 0. Test statistic: F = MSR F (1 α; 1, n 2), MSE where MSR = SSR/the number of model parameters 1. (Note that F 1,df = t 2 df ) P-value: October 10, 2013 23 / 40 October 10, 2013 24 / 40

General F-test Residual Plots If model 1 (reduced model) is nested (submodel) within model 2 (full model), the comparison between two models can be done by F-test. Residuals against the index of observations: 1 symmetric around 0; 2 constant variability; 3 no serial correction. Ridge F = SSE(R) SSE(F ) df (R) df (F ) SSE(F ) df (F ) General F-test is commonly used in model selection. Ridge Residuals against the predicted values. (add a link to possible problematic residual plots.) QQ plot/normality plot: check the normality assumption. Under the normality assumption, the residuals should be close to the reference line or look linear. October 10, 2013 25 / 40 October 10, 2013 26 / 40 Prototype Residual Plots Outliers Inuential Observations Cook's distance is dened as Ridge Ridge D i = j=1 Ŷj Ŷ j(i) p MSE = n i=1 ê 2 i h ii p MSE [ (1 h ii ) 2 ]. The value of Cook's distance for each observation represents a measure of the degree to which the predicted values change if the observation is left out of the regression. If an observation has an unusually large value for the Cook's distance, it might be worth to further investigations. (small inuence: D < 0.1; huge inuence: D > 0.5) Although above denition requires to t regression n times, it can be simplied only need to t model once. Here h ii is the i-th diagonal element of the hat matrix H = X (X T X ) 1 X T. October 10, 2013 27 / 40 October 10, 2013 28 / 40

Ridge Two predictors: Y i = β 0 + β 1 X i1 + β 2 X i2 + e i. More general: Consider p 1 predictors X i1,, X i(p 1), Y i = β 0 + β 1 X i1 + β 2 X i2 + β p 1 X i(p 1) + e i, for i = 1,, n. We may write it in the following matrix form Y = X β + e, where Y = (Y 1,, Y n ) T, β = (β 0,, β p 1 ) T, e = (e 1,, e n ) T N(0, σ 2 I ) X is the n p design matrix. Ridge The LSE for ˆβ LS = (X T X ) 1 X T Y Cov(ˆβ LS ) = σ 2 (X T X ) 1. se(ˆβ LS ) = [MSE(X T X ) 1 ] 1/2. Under normality assumption, ˆβ ML = ˆβ LS. The tted values: Ŷ = X ˆβ = X (X T X ) 1 X T Y. Hat matrix H = X (X T X ) 1 X T. The residuals: ê = Y Ŷ = (I H)Y, Cov(e) = σ 2 (I H). October 10, 2013 29 / 40 October 10, 2013 30 / 40 ANOVA Table : Covariates Ridge SSE: ê T ê = Y T (I H)Y. SSTO: i=1 (Y i Ȳ ) 2 = Y T Y 1 n Y T JY, where J is the n n matrix with all components equal to 1. SSR=SSTO-SSE. MSE=SSE/(n-p). MSTO=SSTO/(n-1). MSR=SSR(p-1). Overall F-test: H 0 : β = 0 (β 1 = β 2 = = β p 1 = 0) H a : not all β's are zeros. F = MSR MSE F 1 α;p 1,n p. R 2 = SSR SSTO = 1 SSE SSTO. Adjusted R 2 : adjust for the number of predictors 1 n 1 (1 R2 A ) = 1 n p (1 R2 ). Ridge Forward selection: from no covariates Backward selection: from all covariates Stepwise selection: Backward+forward October 10, 2013 31 / 40 October 10, 2013 32 / 40

When Happens Remedy for Ridge Adding or deleting a predictor changes R 2 substantially. Type III SSR heavily depends on other variables in the model. se( ˆβ k ) is large. Predictors are not signicant individually, but simple regression on each covariate is signicant. Ridge Centering stardize the predictors, which might be helpful in polynomial regression when some of the predictors are badly scaled. Drop the correlated variables by model selection. 1 The predictor is not signicant. 2 The reduced model after dropping the predictor ts data nearly as well as the full model. Add new observations. (Economy, Business) Use the index of several variables (PCA) Ridge October 10, 2013 33 / 40 October 10, 2013 34 / 40 Variance Ination Factor (VIF) Ridge Ridge Let R 2 j is the coecient of determination of X j on all other predictors. (R 2 j is the R 2 of regression model X ij = β 0 +β 1 X i1 + +β j 1 X i(j 1) +β j+1 X i(j+1) + +e ij.) Dene VIF j = 1 1 R 2. j VIF j 1 since R 2 j 1 for all j. If VIF > 10, we usually believe the variable has inuential variation to cause collinearity problem. In stardized regression Var ˆβ j = σvif j. In SAS, VIF table is reported in PROC REG by adding VIF option in MODEL statement. Ridge Recall that ˆβ LS = (X T X ) 1 X T Y, E(ˆβ LS ) = β Var(ˆβ) = (X T X ) 1. When (X T X ) is nearly singular (the determinate is close to 0), LSE is unbiased but has large variance, which leads to large mean square error of the estimator. 0.0 0.1 0.2 0.3 0.4 0.5 The idea of the ridge regression is: reduce variance at the cost of increasing bias. 4 2 0 2 4 6 beta October 10, 2013 35 / 40 October 10, 2013 36 / 40

Ridge SAS Program Ridge The ridge regression estimator is: ˆβ r (b) = (X T X + bi ) 1 X T Y, where b is a constant chosen by users referred as tuning parameter. As b increases, the bias increases but the variance decreases, ˆβ r (b) 0 (componentwise). One may choose the tuning parameter b, such that ridge trace (ˆβ r (b) against b) gets at, VIF j (against b) drop around 1. Ridge PROC GLM; PROC REG; PROC CATMOD; PROC GENMOD; PROC LOGISTIC; PROC NLINL; PROC PLS; PROC MIXED; October 10, 2013 37 / 40 October 10, 2013 38 / 40 Reading Assignment Ridge Textbook: Chapter 5 Chapter 9. October 10, 2013 39 / 40