Multiple Linear Regression

Size: px
Start display at page:

Download "Multiple Linear Regression"

Transcription

1 Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro 1 / 42 Passenger car mileage Consider the carmpg dataset taken from here. Variables: make.model Make and model vol Cubic feet of cab space hp Engine horsepower mpg Average miles per gallon sp Top speed (mph) wt Vehicle weight (100 lb) Goal: Predict a car s gas consumption based on its characteristics. Graphics: pairwise scatterplots and possibly individual boxplots. To [R]. 2 / 42 Scatterplot highlights hp, sp and wt seem correlated. vol and wt seem correlated. Highly correlated predictors can lead to a misinterpretation of the fit. mpg seems polynomial in hp and sp. mpg versus wt shows a fan shape indicating unequal variances. 3 / 42

2 Least squares regression We fit a (simple) linear model: mpg = β 0 + β 1 vol + β 2 hp + β 3 sp + β 4 wt With data {(x i, y i ) : i = 1,...,n}, where x = (x i,1,...,x i,p ), we fit: y i = β 0 + β 1 x i,1 + + β p x i,p + ǫ i Standard assumptions: measurement errors are i.i.d. normal with mean zero, i.e. ǫ 1,...,ǫ n iid N(0, σ 2 ) The least squares fit minimizes the error sum of squares SSE(β 0, β 1,..., β p ) = n (y i β 0 β 1 x i,1 β p x i,p ) 2 i=1 Under the standard assumption, this fit corresponds to the MLE. 4 / 42 Fitted values, residuals and standard error The fitted (predicted) values are defined as: ŷ i = β 0 + β 1 x i,1 + + β p x i,p The residuals are defined as: e i = y i ŷ i Our estimate for σ 2 is the mean squared error of the fit σ 2 = 1 n p 1 n (y i ŷ i ) 2 = i=1 1 n p 1 n i=1 e 2 i 5 / 42

3 Matrix interpretation Let X be the n (p + 1) matrix with row vectors {(1,x i ) : i = 1,...,n}. Define y = (y 1,...,y n ), ǫ = (ǫ 1,...,ǫ n ) and β = (β 0, β 1,...,β p ). The model is: y = Xβ + ǫ Then the least squares estimate β minimizes If the columns of X are linearly independent, then y Xβ 2 β = (X T X) 1 X T y Note that the estimate is linear in the response. Define the hat matrix H = X(X T X) 1 X T H is the orthogonal projection onto span(x). The fitted values may be expressed as ŷ = X(X T X) 1 X T y = Hy The residuals may be expressed as e = y ŷ = (I H)y 6 / 42

4 Distributions Assume that the ǫ i s are i.i.d. normal with mean zero. Define M = (X T X) 1 R (p+1) (p+1). β = ( β 0,..., β p ) has the multivariate normal distribution with mean β (unbiased) and covariance matrix σ 2 M. I.e. the β j s are jointly normal and ) E ( βj = β j, cov ( βj, β ) k = σ 2 M jk For σ 2, we have σ 2 σ 2 n p 1 χ2 n p 1 β and σ 2 are independent. 7 / 42 t-tests Consequently, β j β j σ2 M jj t n p 1 In general, if c = (c 0, c 1,...,c p ) T R p+1, then ( ) where ŝe c T β = σ c T Mc c T β c T β ( ŝe c T β ) t n p 1 In particular, the following is a level-(1 α) confidence interval for β 0 + β 1 x β p x p = β T x, where x = (1, x 1,...,x p ): β T x ± t α/2 ( βt ) n p 1 ŝe x To [R]. 8 / 42

5 Confidence bands / regions Using the fact that ( β β) T X T X( β β) (p + 1) σ 2 F p+1,n p 1 the following defines a level-α confidence region (an ellipsoid) for β: Using the same fact, Scheffe s S-method gives that ( β β) T X T X( β β) (p + 1) σ 2 F α p+1,n p 1 c T β ± ((p + 1)F α p+1,n p 1 )1/2 ŝe ( ) c T β is a level-α confidence interval for c T β simultaneously for all c R p+1. As a special case, we obtain the following confidence band β ( βt ) T x ± ((p + 1)Fp+1,n p 1) α 1/2 ŝe x 9 / 42 Comparing models Say we want to test H 0 : β 1 = = β p = 0 H 1 : β j 0, for some j = 1,...,p Same as H 0 : H 1 : y i = β 0 + ǫ i y i = β 0 + β 1 x i,1 + + β p x i,p + ǫ i The model under H 0 is a submodel of the model under H / 42

6 Analysis of variance The residual sum of squares under H 0 is It has n 1 degrees of freedom. The residual sum of squares under H 1 is It has n p 1 degrees of freedom. The sum of squares due to regression is SS Y = SSE = n (y i y) 2 i=1 n (y i ŷ i ) 2 i=1 SSreg = SS Y SSE = n (y ŷ i ) 2 i=1 It has p degrees of freedom. The ANOVA F-test rejects for large: Under H 0, F = SSreg/p SSE/(n p 1) F F p,n p 1 11 / 42 Coefficient of (multiple) determination Defined as R 2 = 1 SSE SS Y Interpreted as the fraction of variation in y explained by x. R 2 always increases when variables are added. The adjusted R 2 incorporates the number of predictors R 2 a = 1 SSE/(n p 1) SS Y /(n 1) 12 / 42

7 Testing whether a subset of the variables are zero Consider a subset of variables J {1,..., p}. We want to test H 0 : j J β j = 0 H 1 : j J β j 0 Same as H 0 : y i = β 0 + j / J β j x i,j + ǫ i H 1 : y i = β 0 + The model under H 0 is a submodel of the model under H 1. p β j x i,j + ǫ i j=1 13 / 42 Analysis of Variance Let SSE J the rss for the model under H 0. Let SSE be the rss for the model under H 1. The ANOVA F-test rejects for large: Under the null, F = (SSE J SSE)/ J SSE/(n p 1) F F J,n p 1 In particular, testing β j = 0 versus β j 0 using the F-test above is equivalent to using the t-test described earlier. 14 / 42

8 Testing whether a given set of linear combinations are zero Consider a (full-rank) matrix q-by-(p + 1) matrix A. We want to test H 0 : Aβ = 0 H 1 : Aβ 0 Assuming A = (A 1,A 2 ) with A 2 q-by-q invertible, we are comparing the two models H 0 : H 1 : y = (X 1 X 2 A 1 2 A 1)β 1 + ǫ y = Xβ + ǫ The model under H 0 is a submodel of the model under H / 42 Analysis of Variance Let SSE A the rss for the model under H 0. Let SSE be the rss for the model under H 1. The ANOVA F-test rejects for large: Under the null, F = (SSE A SSE)/rank(A) SSE/(n p 1) F F rank(a),n p 1 16 / 42 Spreading Spreading the variables often equalizes the leverage of the data points and improves the fit, both numerically (e.g. R 2 ) and visually. A typical situation is an accumulation of data points near 0, in which case applying a logarithm or a square root helps spread the data more evenly. Example: mammals dataset. 17 / 42

9 Standardized variables Standardizing the variables removes the unit and makes comparing the magnitude of the coefficients easier. One way to do so is to make all the variables mean 0 and unit length: y y y SSY and x j x j x j SSXj An intercept is not needed anymore. Standardizing the variables changes the estimates β j, but not σ 2, the multiple R 2 or the p-value of the tests we saw. 18 / 42 Checking assumptions Can we rely on those p-values? In other words, are the assumptions correct? 1. Mean zero, i.e. model accuracy 2. Equal variances, i.e. homoscedasticity 3. Independence 4. Normally distributed 19 / 42 Residuals We look at the residuals as substitutes for the errors. Remember that e = (e 1,...,e n ), with e = y ŷ = y X β = (I H)y Under the standard assumptions, e is multivariate normal with mean zero and covariance σ 2 (I H)). In particular, cov (e i, e j ) = σ 2 h ij i j, var (e i ) = σ 2 (1 h ii ) i 20 / 42

10 Standardized residuals In practice, the residuals are often corrected for unequal variance. Internally studentized residuals (what R uses): r i = e i σ 1 h ii Note that: ( ri 2 1 n p 1 Beta 2, n p 2 ), so that r i t n p 1 2 Externally studentized residuals: e i t i = σ (i) 1 hii where σ (i) comes from the fit with the ith observation removed. Note that: t i t n p 2 21 / 42 Checking model accuracy With several predictors, the residuals-vs-fitted-values plot may be misleading. Instead, we look at partial residual plots, which focus on one variable at a time. Say we want to look at the influence of predictor X j = (x 1,j,...,x n,j ) (the jth column of X) on the response y. Added variable plots take into account the influence of the other predictors. 1. Regress y on all the predictors excluding X j, obtaining residuals e (j). 2. Regress X j on all the predictors excluding X j, obtaining residuals X (j). 3. Plot e (j) versus X (j). Component plus residual plots are another alternative. 1. Plot e + β j X j versus X j. 22 / 42 Checking homoscedasticity Partial residual plots may help assess whether the errors have equal variance or not. A fan-shape in the plot is an indication that the errors do not have equal variances. 23 / 42

11 Checking normality A q-q plot of the standardized residuals is used. 24 / 42 Checking for independence Checking for independence often requires defining a dependency structure explicitly and testing against that. (For otherwise we would need a number of replicates of the data, i.e. multiple samples of same size.) If dependency is expected, then it is preferable to fit some time series model (for the errors). In that context, we could use something like the Durbin-Watson test to assess whether the errors are autocorrelated. 25 / 42 Outliers Outliers are points that are unusual compared to the bulk of the observations. An outlier in predictor is a data point (x i, y i ) such that x i is away from the bulk of the predictor vectors. Such a point has high-leverage An outlier in response is a data point (x i, y i ) such that y i is away from the trend implied by the other observations. A point that is both an outlier in predictor and response is said to be influential. 26 / 42 Detecting outliers To detect outliers in predictor, plotting the hat values h ii may be useful. (h ii > 2p/n is considered suspect.) To detect outliers in response, plotting the residuals may be useful. 27 / 42

12 Leave-one-out diagnostics If β (i) is the least squares coefficient vector computed with the ith case deleted, then Define β β (i) = (XT X) 1 x i e i 1 h ii ŷ (i) = X β (i) In particular, ŷ (i)i is the value at x i predicted by the model fitted without the observation (x i, y i ). 28 / 42 Cook s distances Cook s distances D i = ŷ ŷ (i) 2 (p + 1) σ 2 (change in fitted values) = ( β β (i) ) T X T X( β β (i) ) (p + 1) σ 2 (change in coefficients) = r 2 i h ii (p + 1)(1 h ii ) (combination of residuals and hat values) D i > F (.05) p+1,n p 1 is considered suspect. To [R]. 29 / 42 DFBETAS and DFFITS DFBETAS DFBETAS j(i) = β j β j(i) σ (i) 2 (XT X) 1 j,j DFBETAS j(i) > 2/ n is considered suspect. DFFITS DFFITS i = ŷi ŷ i(i) σ (i) 2 h i,i = t i hi 1 h i DFFITS i > 2 p/n is considered suspect. 30 / 42

13 Multicollinearity When the predictors are nearly linearly dependent, several issues arise: 1. Interpretation is difficult. 2. The estimates have large variances: ) var ( βj = σ 2 (1 Q 2 j )SS Xj where Q 2 j is the R 2 when regressing X j on X k, k j. 3. The fit is numerically unstable since X T X is almost singular. 31 / 42 Correlation between predictors To detect multicollinearity, we look at large correlation between predictors. For that we compute the correlation matrix of the predictors, with coefficients cor(x j, X k ). To [R]. 32 / 42 Variance Inflation Factors Alternatively, one can look at the variance inflation factors: VIF j = 1 1 Q 2 j If the predictors are standardized to unit norm, X T X is a correlation matrix and we have: VIF = diag((x T X) 1 ) VIF j > 10 (same as Q 2 j > 90%) is considered suspect. To [R]. 33 / 42

14 Condition Indices Another option is to examine the condition indices of X T X: where the λ j s are the eigenvalues of X T X. κ j > 1000 is considered suspect. κ j = λ max λ j λ max /λ min is called the condition number of matrix X T X. To [R]. 34 / 42 Dealing with questionable assumptions If some of the assumptions are doubtful, it is important to correct the model, for otherwise we cannot rely on our results (estimates + tests). 35 / 42 Dealing with curvature When the model accuracy is questionable, we can: 1. Transform the variables and/or the response, which often involves guessing. This is particularly useful to spread the variables. 2. Augment the model by adding other terms, perhaps with interactions. This is often used in conjunction with model selection so as to avoid overfitting. 36 / 42 Dealing with heteroscedasticity When homoscedasticity is questionable, we can: 1. Apply a variance stabilizing transformation to the response. 2. Fit the model by weighted least squares. 3. Both involve guessing at the dependency of the variance as a function of the variables. 37 / 42

15 Variance stabilizing transformations Here are a few common transformations on the response to stabilize the variance: σ 2 E(y i ) (Poisson) change to y σ 2 E(y i )(1 E(y i )) (binomial) change to sin 1 ( y) σ 2 E(y i ) 2 change to log(y) σ 2 E(y i ) 3 change to 1/ y σ 2 E(y i ) 4 change to 1/y In general, we want a transformation ψ such that ψ (Ey i )var(y i ) 1 38 / 42 Weighted least squares Suppose y i = β 0 + β T x i + ǫ i, ǫ i N(0, σ 2 i ) Maximum likelihood estimation of β 0, β corresponds to the weighted least squares solution that minimizes SSE(β 0, β) = n i=1 w i (y i β 0 β T x i ) 2, w i = 1 σ 2 i To determine appropriate weights, different approaches are used: Guess how σ 2 i varies with x i, i.e. the shape of var(y i x i ). Assume a parametric model for the weights and use maximum likelihood estimation solved by iteratively reweighted least squares. Using weights is helpful in situations where some observations are more reliable than others, mostly because of variability. 39 / 42

16 Generalized least squares Weighted least squares assumes that the errors are uncorrelated. Generalized least squares assumes a more general form for the covariance of the errors, namely σ 2 V, where V is usually known. Assuming both the design matrix X and the covariance matrix V are full rank, the maximum likelihood estimates are β = (X T V 1 X) 1 X T V 1 y 40 / 42 Dealing with outliers We have identified an outlier. Was it recorded properly? No: correct it or remove it from the fit. Yes: decide whether to include it in the fit. Genuine outliers carry information, so simply discarding them amounts to losing that information. However, strong outliers can compromise the quality of the overall fit. Possible option: to model outliers and non-outliers separately. 41 / 42 Dealing with Multicolinearity If interpretation is the main goal, drop a few redundant predictors. If prediction is the main goal, use a model selection procedure. 42 / 42

Simple Linear Regression

Simple Linear Regression Simple Linear Regression MATH 282A Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math282a.html MATH 282A University

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model

More information

Regression Diagnostics Procedures

Regression Diagnostics Procedures Regression Diagnostics Procedures ASSUMPTIONS UNDERLYING REGRESSION/CORRELATION NORMALITY OF VARIANCE IN Y FOR EACH VALUE OF X For any fixed value of the independent variable X, the distribution of the

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

11 Hypothesis Testing

11 Hypothesis Testing 28 11 Hypothesis Testing 111 Introduction Suppose we want to test the hypothesis: H : A q p β p 1 q 1 In terms of the rows of A this can be written as a 1 a q β, ie a i β for each row of A (here a i denotes

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Multicollinearity Read Section 7.5 in textbook. Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response. Example of multicollinear

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr Regression Model Specification in R/Splus and Model Diagnostics By Daniel B. Carr Note 1: See 10 for a summary of diagnostics 2: Books have been written on model diagnostics. These discuss diagnostics

More information

Regression Diagnostics for Survey Data

Regression Diagnostics for Survey Data Regression Diagnostics for Survey Data Richard Valliant Joint Program in Survey Methodology, University of Maryland and University of Michigan USA Jianzhu Li (Westat), Dan Liao (JPSM) 1 Introduction Topics

More information

holding all other predictors constant

holding all other predictors constant Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing

More information

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. 1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. T F T F T F a) Variance estimates should always be positive, but covariance estimates can be either positive

More information

Regression Steven F. Arnold Professor of Statistics Penn State University

Regression Steven F. Arnold Professor of Statistics Penn State University Regression Steven F. Arnold Professor of Statistics Penn State University Regression is the most commonly used statistical technique. It is primarily concerned with fitting models to data. It is often

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

Chapter 16. Simple Linear Regression and dcorrelation

Chapter 16. Simple Linear Regression and dcorrelation Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Polynomial Regression

Polynomial Regression Polynomial Regression Summary... 1 Analysis Summary... 3 Plot of Fitted Model... 4 Analysis Options... 6 Conditional Sums of Squares... 7 Lack-of-Fit Test... 7 Observed versus Predicted... 8 Residual Plots...

More information

Diagnostics for Linear Models With Functional Responses

Diagnostics for Linear Models With Functional Responses Diagnostics for Linear Models With Functional Responses Qing Shen Edmunds.com Inc. 2401 Colorado Ave., Suite 250 Santa Monica, CA 90404 (shenqing26@hotmail.com) Hongquan Xu Department of Statistics University

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

3 Multiple Linear Regression

3 Multiple Linear Regression 3 Multiple Linear Regression 3.1 The Model Essentially, all models are wrong, but some are useful. Quote by George E.P. Box. Models are supposed to be exact descriptions of the population, but that is

More information

Statistical Modelling in Stata 5: Linear Models

Statistical Modelling in Stata 5: Linear Models Statistical Modelling in Stata 5: Linear Models Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 07/11/2017 Structure This Week What is a linear model? How good is my model? Does

More information

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved. LINEAR REGRESSION LINEAR REGRESSION REGRESSION AND OTHER MODELS Type of Response Type of Predictors Categorical Continuous Continuous and Categorical Continuous Analysis of Variance (ANOVA) Ordinary Least

More information

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014 Ridge Regression Summary... 1 Data Input... 4 Analysis Summary... 5 Analysis Options... 6 Ridge Trace... 7 Regression Coefficients... 8 Standardized Regression Coefficients... 9 Observed versus Predicted...

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.

More information

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R. Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third

More information

Quantitative Methods I: Regression diagnostics

Quantitative Methods I: Regression diagnostics Quantitative Methods I: Regression University College Dublin 10 December 2014 1 Assumptions and errors 2 3 4 Outline Assumptions and errors 1 Assumptions and errors 2 3 4 Assumptions: specification Linear

More information

1 Least Squares Estimation - multiple regression.

1 Least Squares Estimation - multiple regression. Introduction to multiple regression. Fall 2010 1 Least Squares Estimation - multiple regression. Let y = {y 1,, y n } be a n 1 vector of dependent variable observations. Let β = {β 0, β 1 } be the 2 1

More information

4 Multiple Linear Regression

4 Multiple Linear Regression 4 Multiple Linear Regression 4. The Model Definition 4.. random variable Y fits a Multiple Linear Regression Model, iff there exist β, β,..., β k R so that for all (x, x 2,..., x k ) R k where ε N (, σ

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting

More information

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Weighted Least Squares

Weighted Least Squares Weighted Least Squares The standard linear model assumes that Var(ε i ) = σ 2 for i = 1,..., n. As we have seen, however, there are instances where Var(Y X = x i ) = Var(ε i ) = σ2 w i. Here w 1,..., w

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

ECON 366: ECONOMETRICS II. SPRING TERM 2005: LAB EXERCISE #10 Nonspherical Errors Continued. Brief Suggested Solutions

ECON 366: ECONOMETRICS II. SPRING TERM 2005: LAB EXERCISE #10 Nonspherical Errors Continued. Brief Suggested Solutions DEPARTMENT OF ECONOMICS UNIVERSITY OF VICTORIA ECON 366: ECONOMETRICS II SPRING TERM 2005: LAB EXERCISE #10 Nonspherical Errors Continued Brief Suggested Solutions 1. In Lab 8 we considered the following

More information

Detecting and Assessing Data Outliers and Leverage Points

Detecting and Assessing Data Outliers and Leverage Points Chapter 9 Detecting and Assessing Data Outliers and Leverage Points Section 9.1 Background Background Because OLS estimators arise due to the minimization of the sum of squared errors, large residuals

More information

STAT 540: Data Analysis and Regression

STAT 540: Data Analysis and Regression STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State

More information

Linear Regression Models

Linear Regression Models Linear Regression Models November 13, 2018 1 / 89 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least

More information

Lecture 3: Multiple Regression

Lecture 3: Multiple Regression Lecture 3: Multiple Regression R.G. Pierse 1 The General Linear Model Suppose that we have k explanatory variables Y i = β 1 + β X i + β 3 X 3i + + β k X ki + u i, i = 1,, n (1.1) or Y i = β j X ji + u

More information

Agricultural and Applied Economics 637 Applied Econometrics II

Agricultural and Applied Economics 637 Applied Econometrics II Agricultural and Applied Economics 637 Applied Econometrics II Assignment 1 Review of GLS Heteroskedasity and Autocorrelation (Due: Feb. 4, 2011) In this assignment you are asked to develop relatively

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 MA 575 Linear Models: Cedric E Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1 1 Within-group Correlation Let us recall the simple two-level hierarchical

More information

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION In this lab you will first learn how to display the relationship between two quantitative variables with a scatterplot and also how to measure the strength of

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006 Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Inference for Regression Inference about the Regression Model and Using the Regression Line

Inference for Regression Inference about the Regression Model and Using the Regression Line Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about

More information

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

6. Multiple Linear Regression

6. Multiple Linear Regression 6. Multiple Linear Regression SLR: 1 predictor X, MLR: more than 1 predictor Example data set: Y i = #points scored by UF football team in game i X i1 = #games won by opponent in their last 10 games X

More information

Chapter 10 Building the Regression Model II: Diagnostics

Chapter 10 Building the Regression Model II: Diagnostics Chapter 10 Building the Regression Model II: Diagnostics 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 41 10.1 Model Adequacy for a Predictor Variable-Added

More information

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response. Leverage Some cases have high leverage, the potential to greatly affect the fit. These cases are outliers in the space of predictors. Often the residuals for these cases are not large because the response

More information

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model Outline 1 Multiple Linear Regression (Estimation, Inference, Diagnostics and Remedial Measures) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression

More information

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses. 1 Review: Let X 1, X,..., X n denote n independent random variables sampled from some distribution might not be normal!) with mean µ) and standard deviation σ). Then X µ σ n In other words, X is approximately

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is

More information

x 21 x 22 x 23 f X 1 X 2 X 3 ε

x 21 x 22 x 23 f X 1 X 2 X 3 ε Chapter 2 Estimation 2.1 Example Let s start with an example. Suppose that Y is the fuel consumption of a particular model of car in m.p.g. Suppose that the predictors are 1. X 1 the weight of the car

More information

Simple Linear Regression for the MPG Data

Simple Linear Regression for the MPG Data Simple Linear Regression for the MPG Data 2000 2500 3000 3500 15 20 25 30 35 40 45 Wgt MPG What do we do with the data? y i = MPG of i th car x i = Weight of i th car i =1,...,n n = Sample Size Exploratory

More information

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and Correlation Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

1 The classical linear regression model

1 The classical linear regression model THE UNIVERSITY OF CHICAGO Booth School of Business Business 41912, Spring Quarter 2012, Mr Ruey S Tsay Lecture 4: Multivariate Linear Regression Linear regression analysis is one of the most widely used

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Multiple Linear Regression

Multiple Linear Regression Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

More information

Remedial Measures for Multiple Linear Regression Models

Remedial Measures for Multiple Linear Regression Models Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25 Outline

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

CAS MA575 Linear Models

CAS MA575 Linear Models CAS MA575 Linear Models Boston University, Fall 2013 Midterm Exam (Correction) Instructor: Cedric Ginestet Date: 22 Oct 2013. Maximal Score: 200pts. Please Note: You will only be graded on work and answers

More information

Need for Several Predictor Variables

Need for Several Predictor Variables Multiple regression One of the most widely used tools in statistical analysis Matrix expressions for multiple regression are the same as for simple linear regression Need for Several Predictor Variables

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

2. Regression Review

2. Regression Review 2. Regression Review 2.1 The Regression Model The general form of the regression model y t = f(x t, β) + ε t where x t = (x t1,, x tp ), β = (β 1,..., β m ). ε t is a random variable, Eε t = 0, Var(ε t

More information

Chapter 6 Multiple Regression

Chapter 6 Multiple Regression STAT 525 FALL 2018 Chapter 6 Multiple Regression Professor Min Zhang The Data and Model Still have single response variable Y Now have multiple explanatory variables Examples: Blood Pressure vs Age, Weight,

More information

Applied linear statistical models: An overview

Applied linear statistical models: An overview Applied linear statistical models: An overview Gunnar Stefansson 1 Dept. of Mathematics Univ. Iceland August 27, 2010 Outline Some basics Course: Applied linear statistical models This lecture: A description

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Prepared by: Prof Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti Putra Malaysia Serdang M L Regression is an extension to

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Section 4.6 Simple Linear Regression

Section 4.6 Simple Linear Regression Section 4.6 Simple Linear Regression Objectives ˆ Basic philosophy of SLR and the regression assumptions ˆ Point & interval estimation of the model parameters, and how to make predictions ˆ Point and interval

More information

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 Lecture 3: Linear Models Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector of observed

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

Multivariate Linear Regression Models

Multivariate Linear Regression Models Multivariate Linear Regression Models Regression analysis is used to predict the value of one or more responses from a set of predictors. It can also be used to estimate the linear association between

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Multiple Sample Numerical Data

Multiple Sample Numerical Data Multiple Sample Numerical Data Analysis of Variance, Kruskal-Wallis test, Friedman test University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 /

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Bivariate Paired Numerical Data

Bivariate Paired Numerical Data Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

Linear Models, Problems

Linear Models, Problems Linear Models, Problems John Fox McMaster University Draft: Please do not quote without permission Revised January 2003 Copyright c 2002, 2003 by John Fox I. The Normal Linear Model: Structure and Assumptions

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

Well-developed and understood properties

Well-developed and understood properties 1 INTRODUCTION TO LINEAR MODELS 1 THE CLASSICAL LINEAR MODEL Most commonly used statistical models Flexible models Well-developed and understood properties Ease of interpretation Building block for more

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information

Introduction to Linear regression analysis. Part 2. Model comparisons

Introduction to Linear regression analysis. Part 2. Model comparisons Introduction to Linear regression analysis Part Model comparisons 1 ANOVA for regression Total variation in Y SS Total = Variation explained by regression with X SS Regression + Residual variation SS Residual

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013 18.S096 Problem Set 3 Fall 013 Regression Analysis Due Date: 10/8/013 he Projection( Hat ) Matrix and Case Influence/Leverage Recall the setup for a linear regression model y = Xβ + ɛ where y and ɛ are

More information

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B Simple Linear Regression 35 Problems 1 Consider a set of data (x i, y i ), i =1, 2,,n, and the following two regression models: y i = β 0 + β 1 x i + ε, (i =1, 2,,n), Model A y i = γ 0 + γ 1 x i + γ 2

More information

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Statistics - Lecture Three. Linear Models. Charlotte Wickham   1. Statistics - Lecture Three Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Linear Models 1. The Theory 2. Practical Use 3. How to do it in R 4. An example 5. Extensions

More information