Lecture One: A Quick Review/Overview on Regular Linear Regression Models

Size: px

Start display at page:

Download "Lecture One: A Quick Review/Overview on Regular Linear Regression Models"

Joleen Lang
5 years ago
Views:

1 Lecture One: A Quick Review/Overview on Regular Linear Regression Models Outline The topics to be covered include: Model Specification Estimation(LS estimators and MLEs) Hypothesis Testing and Model Diagnostics (t-test, F-test and ANOVA table, Residuals, Outliers, Other diagnostic tools) 1

2 1 Model Specification Basics Statistical regression models are tools used to study (un-deterministic) statistical relationship between one variable with other variables. Statistical relationship describes general patterns between variables of interest. Unlike models in other fields, the pattern that describes a statistical relationship is not deterministic. Forexample,ason sheightisrelatedtothehisfather sandmother s heights. But one can never deterministically tell the son s height based oneitherhisfatherormother sheightorboth. Linear model formulation Response(y) = linear function in covariates(x) + random error [Insert Figure 1.1 here] Twokindsofvariablesinaregressionmodel: Response variable or dependent variable (often denoted as y). This variable is treated as a random variable in regression. It usually requires to take numerical, often normal-like, values in a linear regression model. Covariate variables or independent variables or explanatory variables (often denoted as x). The variables are treated as fixed and not-random inalinearregressionmodel. 1,2 Onejustificationisfromexperimental design where the covariates or independent variables are pre-determined values. The types of a covariate variables can be either numeric (not-necessarily normal-like), categorical (e.g., answers yes/no, blood types A/B/O/AB, ratings excellent/good/fair/bad, etc.) or mixed. All covariate variables are categorical = ANOVA model. 1 Sometimes statisticians also view the covariate variables as random variable. But in this case, one has to specify the regression model through conditional probability by conditional on the covariates. 2 Here, independent in independent variable has nothing to do with the concept of independent random variables. 2

3 Mixed type of covariate variables = ANCOVA model. (both numerical and categorical) Linear Regression Models Simple linear regression model is the model where we only have a single covariate variable. y i =β 0 +β 1 x i +ɛ i, fori=1,2,...,n, where β = (β 0,β 1 ) T are the unknown regression parameters and ɛ s are random errors. Commonassumptionsonɛ sareoneofthefollowing: Therandomerrorsɛ 1,...,ɛ n areuncorrelatedwithmeanzeroandcommonvarianceσ 2 (unknownparameter). (Stronger) The random errors ɛ 1,...,ɛ n are independently identically distributed(i.i.d.) randomvariablesfromn(0,σ 2 ). Theβ 0 isknownastheinterceptandβ 1 istheslopeparameter. Onecan interpret β 1 are the expected/average increase of response y with a unit increase in the covariate x, when the covariate variable x is a numerical type. Whenthecovariateiscategorical(e.g. inanovamodels),β 1 canbe interpreted as the expected/average change in response due to the choice of the category. [Bring back Figure1.1. here] Multiple linear regression model(some books also call it general liner regression model) is the case we have multiple covariate variables. In a matrix form,itis y=xβ+ɛ where y= y 1 y 2... y n, ɛ= ɛ 1 ɛ 2... ɛ n 1,x 11,x 12,...,x 1p, X= 1,x 21,x 22,...,x 2p..., andβ= 1,x n1,x n2,...,x np Theassumptionsonɛ sarethesameasbefore,andtheunknownparameters aretheregressionparametersβ andtheerrorvarianceσ 2. 3 β 0 β 1... β p

4 2 Estimation We need to estimate the unknown quantities/parameters in the model from data. Least Squares(LS) Estimators The least squares (LS) estimators of β are the values (in the parameter space) that minimize the sum of squared distances between the responses and the regression line. where Q(β)= ˆβ LS =argmin β Q(β) or ˆβLS solveequations β Q(β)=0 n {y i (β 0 +β 1 x i β p x ip )} 2 =(y Xβ) T (y Xβ). i=1 In particular, ˆβ LS =(X T X) 1 X T y. [Insert Figure 1.1b here] TheLSestimatorofσ 2 is ˆσ 2 LS= 1 n (p+1) n (y i ŷ i ) 2 = i=1 1 n (p+1) (y Xˆβ LS ) T (y Xˆβ LS ). Fromnowon,wearegoingtouses 2 todenotethis ˆσ 2 LS. Maximum Likelihood Estimators (MLEs) When the random errors ɛ s are normally distributed, the likelihood functionofobservedresponsesy=(y 1,y 2,...,y n ) T is L(β,σ 2 y)= n f(y i β,σ 2 )= i=1 n i=1 1 2πσ 2 e 1 2¾ 2(y i µ i ) 2 = whereµ i =β 0 +β 1 x i1 +β 2 x i β p x ip. 1 (2πσ 2 ) n 2 TheMLEsof ˆβ andσ 2 maximizesthelikelihoodfunction,i.e., (ˆβ MLE,ˆσ 2 MLE)=argmax β,σ L(β,σ 2 y). 4 e 1 2¾ 2 n i=1 (y i µ i ) 2,

5 In particular, ˆβ MLE =(X T X) 1 X T y= ˆβ LS, andˆσ MLE= 2 1 n (y Xˆβ LS ) T (y Xˆβ LS )= n (p+1) ˆσ LS 2 n Fromnowon,wearegoingtouse ˆβ todenoteeither ˆβ MLE or ˆβ LS. [Insert Figure1.2.'s here (SAS code and output)] Some(theoretical/nice) properties of the estimators When the random errors ɛ s are normally distributed, one can prove that ˆβ N(β,σ 2 (X T X) 1 )and n (p+1) n s 2 σ 2 χ 2 n (p+1). Inaddition, ˆβ ands 2 are independent. From this, we know that both ˆβ and ˆσ 2 LS are unbiased estimators, i.e., Eˆβ=β andes 2 =σ 2. (Gauss-MarkovTheorem)Theestimators ˆβ=(X T X) 1 X T yarethebest linear unbiased estimators(blues) of β. 5

6 3 Hypothesis Testing and Diagnostics Hypothesis Testing Student t-test for jth regression parameter β j. One often wants to test the hypothesis whether the jth covariate has significant contribution to response or not. In math term, the hypotheses are H 0 : β j = 0 versus H 1 : β j 0. To formally conduct the test, the t-statistic T j = ˆβ j se(ˆβ j ) is comparedtothestudentt-distributionwithn (p+1)degreesoffreedom. Whentheabsolutevalue T j islarge(correspondingtosmallp-value),we rejectthenullhypothesish 0 :β j =0. [Bring back Figure1.2. here (SAS output)] F-test for regression line. To test whether the regression line is significant or not(i.e., any of the slope parameters are significantly different from zero ornot),oneturntothef-testandtheanalysisofvariancetable(anova table). In math term, the hypotheses are H 0 : β 1 = β 2 =... = β p = 0 versush 1 :atleastoneoftheβ sisnotequaltozero. Toformallyconduct the test, the F-statistic F = ni=1 (ŷ i ȳ) 2 /p ni=1 (y i ŷ i ) 2 /(n p 1) is compared to the F-distribution with degrees of freedom p and n (p+1). If the F-statistic F is large(corresponding to small p-value), we reject the null hypothesis H 0 : β 1 = β 2 =... = β p = 0. The results are usually summarized in an ANOVA table. [Bring back Figure1.2. here (SAS output)] The quantity n i=1 (y i ȳ) 2 isknownas he total sum of squares(ssto), ni=1 (y i ŷ i ) 2 isknownasthesumofsquaresduetoerror(ortheresidual sum of squares) (SSE) and n i=1 (ŷ i ȳ) 2 is knownasthe sum of squares duetoregression(ssr).here,ȳisthesamplemeanofy i s. Onecanproveanequationthat SSTO=SSR+SSE. (1) Also,SSR χ 2 p,sse χ 2 n p 1,SSTO χ 2 n 1,andSSRandSSEare independent. 6

7 Theratio R= SSR SSTO is called the coe±cient of multiple determination. The closer it is to 1, the larger the proportion of variability is explained by the regression. [Bring back Figure1.2. here (SAS output)] F-test fornested models. Toanswerthequestionwhetherasubsetofthe regression parameters are significant or not in a linear regression model, we use the F-test for nested models. Without loss of generality, assume that wewanttotestwhetherthelastp q(q<p)parametersarezeroornot, thatis,h 0 :β q+1 =...=β p =0versusH 1 :Oneofthesep qparameters arenotequalto0. UnderthenullhypothesisH 0,theregressionmodelhas q covariatesandwecallthismodelareducemodel(r);themodelunder H 1 haspcovariatesandwecallitafullmodel(f).undereachmodel,we havethesumofsquares(ss)equation(1). Toformallyconductthetest, the F-statistic F = {SSR(F) SSR(R)}/(p q) SSE(F)/(n p 1) iscomparedtothef-distributionwithdegreesoffreedomp qandn (p+ 1). If the F-statistic F is large(corresponding to p-value small), we reject thenullhypothesish 0 :β q+1 =...=β p =0. Theresultsaresummarized in an ANOVA table sometimes. Diagnostics for Influential Points ThematrixH=X(X T X) 1 X T isknownasthehatmatrix. Itisfromthe equationthatŷ=xˆβ=x(x T X) 1 X T y=hy. Denote by h ii the ith diagonal element of H. All 0 h ii < 1 and ni=1 h ii =p+1. Wecallthoseobservationshigh leverage points, iftheircorrespondingh ii values are very large (close 1). High leverage points are also called x- outliers(outliers in the covariate space). Ifh ii islarge(closeto1) var(e i )=σ 2 (1 h ii )issmall(here,residual e i = y i ŷ i ) the fitted regression line should be close to y i the location of the ith observation is important and it is very influential to the fitted regression line! 7

8 Heuristicrule: anyobservationwithitscorrespondingh ii > 2(p+1) n is considered as a high leverage point. Residuals and Deletion Diagnostic Theresiduale i =y i ŷ i hasvariance(1 h ii )σ 2. Astandardizedformofthe residual(called standardized residual or externally standardized residual) is r i = e i (1 hii )s 2. - One objection to these usual definition of residuals in diagnostic is thattheydonottakeaccountoftheinfluenceanoutlierhasonthemodel fitting. Onewaytotakeaccountoftheinfluenceofanoutlieristore-fitthemodel with the offending observation deleted. This leads to the deletion residual e i(i) =y i ŷ i(i). Here, the subscript(i) is taken(here and subsequently) to mean that the model has been re-fitted without the ith observation. For example, ŷ i(i) meansthepredictedvalueofy i basedonthemodelfitinwhichobservation i is omitted. Remark: Onecanprovethate i(i) =e i /(1 h ii ). So,infactwedonot need to repeat the entire analysis to compute these deletion residuals. The variance of the deletion residual e i(i) is var(e i(i) ) = σ 2 /(1 h ii ). A standardized form of the deletion residual (called studentized residual or internally standardized residual) is t i = { e i(i) s 2 (i) /(1 h ii ) =...=e n p 2 i (1 h ii )(n p 1)s 2 e 2 i }1 2. Thet i valueisoftencomparedtothequantilesofthestudentt n p 2 distribution. Iftheabsolutevalue t i islargerthanthe1 α/2percentileof thetdistributiont n p 2 (1 α/2), theithobservationisapotentialoutlier (at α level). Sometimes, people prefer more conservative rule with a Bonferroni justification in which α is replaced by α/n. 8

9 [Bring back Figure1.2. here (SAS output)] Measures of Influence DFBETAS measurestheeffectofdeletingasingleobservation(say, ith observation) on the estimator of a particular regression parameter(say, kth parameterβ k ). (DFBETAS) k(i) = ˆβ k ˆβ k(i) s (i) ckk, wherec kk isthekthdiagonalentryof(x T X) 1. -Remark: Thekeypartisthedifferenceofthebetasinthenumerator; thedenominatorissortofastandardization(note,se(ˆβ k )=σ c kk ). TheInterpretationisthatifthisislarge,thentheithobservationhas an undue influence on the kth parameter estimate. Heuristicrule: toflaganyvalueof DFBETAS thatisbiggerthan1 forasmalldatasetor2/ nforalargedataset. DFFITS measures how the fitted value of y i is affected by deleting the observation i from the fit. (DFFITS) i = ŷi ŷ i(i) s (i) hii =t i hii 1 h i i It can be viewed in some degree as a combined measure of influence thattakesintoaccountthehighleveragepointaswellasthesizeofthe residual. Heuristic rule: one recommendation is to consider an observation influentialifdffits isgreaterthan1inthecaseofsmalldatasets,or 2 p/nforlargedatasets. Cook's Distance is intended as an overall measure of the influence of the ith observation on all parameter estimates. D i = (ˆβ ˆβ (i) ) T X T X(ˆβ ˆβ (i) ) (p+1)s 2 = e 2 i (p+1)s 2 h ii (1 h ii ) 2. Heuristic rule: a suggested interpretation is to identify the ith observationasinfluentialifd i isgreatthatthe10%quantileofthef p+1,n (p+1) 9

10 distribution, and highly influential if it is greater that 50% quantile of thef p+1,n (p+1) distribution. [Bring back Figure1.2. here (SAS output)] Graphical Methods The regular residual Plots. Useful residual plots include For fixed j (fixed covariate variable), plot residuals e i against the jth covariatevaluesx ij. Thisservesasacheckonthelinearityoftherelationship. Plotresidualse i againstanyx-variablesomittedfromthemodel. This servesasacheckwhethertheomittedvariableshouldindeedbeinthe model or not. Plotresidualse i againstfittedvaluesŷ i. Thishelpstodetectwaysin which the model changes with the overall level of the process. The general idea is that any systematic variability detectable in the plot potentially represents some feature of data not captured by the current model. If the model is a good fit, then the residuals should look totally random. Atkinson(1985)proposedtouseahalfnormalplot,inwhichtheabsolute values of the deletion residuals are ordered and the kth largest absolute residualisplottedagainstz( k+n 1/8 ). Here,z(α)istheα-percentileofthe 2n+1/2 standard normal distribution. Thehalfnormalplotcanbeusedtodetectinfluentialpointsorpotential outliers. To identify outlying residuals, Atkinson suggested a simulation technique which can produce a simulated envelope. This envelope constitutes a band suchthattheplottedresidualsarelikelytofallwithinthebandifthefitted model is correct. Some details of the simulation are Step1. Foreachofthencases,generateanewy i fromn(ŷ i,s 2 ). Fitalinear regression to the new set ofdata ofsize n, with thecovariateskeepingtheir original values. Order the absolute deletion residuals in ascending order. Step2. Repeatstep1eighteentimes(total19times). 10

11 Step3. Foreachk,k=1,2,...,n,assemblethekthsmallestabsoluteresiduals from the 19 groups and determine the minimum value, the mean, and the maximum value of these 19 kth smallest absolute residuals. Step4. Plottheseminimum,mean,andmaximumvaluesagainstz( k+n 1/8 2n+1/2 ) on the half-normal probability plot for the original data and connect the points by straight lines. Partial residual plot is useful for identifying the nature of relationship for an independent variable under consideration, say the kth covariate x k = (x 1k,x 2k,...,x nk ) T,foradditiontotheregressionmodel. Definethepartial residuals for the kth covariate as: e [k] i =e i +ˆβ k x ik fori=1,2,...,n. Thepartialresidualplotforthekthcovariateplotse [k] i againstx ik. Ifthe responsey i islinearlyrelatedtothekthcovariatex ik,thepointsshouldbe moreorlessaroundastraightline. Other Topics(not going to details) Transformations of variables on the responses(e.g., Box-Cox transformation and the variance stabilization transformation) on the covariates(e.g. Box-Tidwell transformation) Modelselectioncriteria(e.g.,C p,aic,bic)andstepwiseregressions Robust regressions(available softwares only in R or Splus) 11

12 4 References Neter, J., Kutner, M, Nachtsheim, C. and Wasserman, W.(1996). Applied Linear regression Models. 3rd edition. Irwin, Chicago. 12

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation