Lecture 6. Multiple linear regression. Model validation, model choice

Size: px
Start display at page:

Download "Lecture 6. Multiple linear regression. Model validation, model choice"

Transcription

1 Lecture 6. Multiple linear regression. Model validation, model choice Jesper Rydén Department of Mathematics, Uppsala University Regression and Analysis of Variance autumn 2015

2 Influential observations Influential point: A point whose removal from the data set would cause a large change in the fit. Measures of influence: Change in the coefficients ˆβ ˆβ (i) Change in the fit, X T ( β β (i) ) = ŷ ŷ (i). Some examples: DFBETA, DFFIT, PRESS, Cook s distance

3 Cook s distance and PRESS Cook s distance: D i = 1 (k + 1)s 2 (ˆβ ˆβ(i)) X X (ˆβ ˆβ(i)) = [e(s) i ] 2 h ii k + 1 (1 h ii ) 2 PRESS: e i e i, 1 = y i ŷ i (i) = 1 h ii

4 Example: Cook s distance Dataset mtcars in R: mpg wt + hp. mpg ~ wt + hp Cook s distance Index

5 Cut-off values? John Fox (1997): These criteria are no substitute for graphical examination of the residuals. Hat values: Exceeding 2(k + 1)/n. Studentized residuals: Outside the range e i 2. Influence, Cook s distance: D i > 4/(n k 1)

6 The problem of collinearity (Multicollinearity, ill conditioning) Subject which can be seen from different views: Statistics Inflated variances of regression coefficients. Tools for diagnosis: correlations, variance inflation factors. Numerical analysis Rounding errors, inaccuracy. Tools for diagnosis: singular values, condition numbers.

7 Collinearity: matrix formulation Recall that in the usual OLS model, β = (X T X) 1 X T y, Cov[ˆβ] = σ 2 (X T X) 1 Problem if X T X is singular. All matrices have generalized inverses, but in the singular case, these are not unique. In practice: columns of X are nearly linearly dependent, leads to near singularity. This is often the case e.g. with econometric data: highly correlated explanatory variables. It is recommended to scale X. Several authors recommend models both centred and scaled: controversy on the subject.

8 Stability of LS plane

9 Residual sum of squares

10 Geometrical interpretation: an unstable table Question: Dear Professor Mean, Could you describe the term collinearity for me? I understand that it has to do with variables which are not totally independent, but that is all I know! Prof. Mean (not your average professor): When you have overlap among some of the variables, it takes more data to disentangle the individual effects of these variables. Think of it as a table where you push two of the four legs away from the corners and close to the middle of the table. Such a table will be very unstable.

11 Note on partial correlation NE (in Swedish): Den korrelation mellan två variabler som kvarstår efter det att den del i de bägge variablerna som kan predikteras (linjärt) från en tredje variabel har eliminerats. Consider the random variables X 1, X 2 och X 3 with correlation coefficients ρ 12, ρ 13 och ρ 23. Related sample correlation coefficients: r 12, r 13 och r 23. Partial correlation coefficient): r 12 3 = r 12 r 13 r 23 (1 r13 2 )(1 r 23 2 ) (See e.g. G.W. Snedecor, W.G. Cochran: Statistical Methods (6th ed) R-rutin: pcor.test

12 Example Random sample, 142 older women from (IA) and Nebraska (NE). Variables: From data: A: Age, B: Blood pressure, C: blood cholesterol r AB = , r AC = , r BC = One usually assumes that B and C increase with age. Are B and C correlated merely because of their common association, or is there a real relation at every age? We compute: r BC A = ( )( ). = Statistical test can be performed, on (142-3) degrees of freedom. It turns out that correlation is not significant. It may be that within the several age groups, B and C are uncorrelated.

13 Multiple correlation Multiple correlation: Definition. The sample multiple correlation between x 1 and x 2,..., x k is defined by R 2 1 (2,...,k) = wt 12 W 1 22 w 12 w 11 where and Z = W = Z T Z x 11 x 1... x 1k x k..... x n1 x 1... x nk x k

14 Tolerance and Variance Inflational Factor Denote by Rj 2 the R 2 between the variable x j and a linear combination of all other independent variables. The tolerance TOL j is defined as TOL j = 1 R 2 j and is close to one if x j is not closely related to other regressor variables. The variance inflation factor VIF j is given by VIF j = 1 TOL j A value of VIF j close to 1 indicates no relationship, large values indicate precense of multicollinearity. Rule of thumb: VIF > 10 of concern.

15 Eigenvalues and condition numbers Denote by λ 1,..., λ k the eigenvalues of the matrix X T X. Condition number: λ 1 κ = λ k Serious concern: κ > 20, κ > 30. One may also study condition numbers like λ 1 /λ j.

16 Model selection

17 Variable selection Select the best subset of predictors... Want to explain the data in the simplest way; redundant predictors should be removed. Cf. the principle of Occam s razor, lex parsimoniae Unnecessary predictors will add noise to the estimation of other quantities that we are interested; degrees of freedom will be wasted. Collinearity is caused by having too many variables trying to do the same job. Cost: If the model is to be used for prediction, time and/or money can be saved by not measuring redundant predictors [Faraway: Practical regression and Anova using R. Chapter 10]

18 Introductory example: Hierarchical models and selection of variables Lower-order terms should not be removed from the model before higher-order terms in the same variable. Polynomial models y = β 0 + β 1 x + β 2 x 2 + ɛ Suppose after LS estimation that the regression summary shows the term in x not significant, but the term in x 2 is. Remove x term, reduced model: y = β 0 + β 2 x 2 + ɛ However, choice of scale (unit) matter (e.g. x x + a). Not good, x term reappears: y = β 0 + β 2 a 2 + 2β 2 ax + β 2 x 2 + ɛ

19 Dropping variables Correct model (fuller one). Effects on Estimates of β j Estimation of error variance Covariance matrix of estimates Predicted values Blackboard

20 Criterion based procedures Various rules of thumb: The number of observations should be 5 10 times the number of independent variables; n k 1 10; n 5k How many possible models? With k independent variables: 2 k 1 models k = 3 k = 5 k = 10 k =

21 Criterion based procedures Some common criteria: Information criteria: AIC and BIC Adjusted R 2 Predicted Residual Sum of Squares (PRESS) Mallow s C p

22 AIC and BIC Assume p potential predictors. Akaike s Information Criterion (AIC) and Bayesian Information Criterion (BIC): where SS E = (y i ŷ i ) 2. AIC = n ln(ss E /n) + 2p BIC = n ln(ss E /n) + p ln n Want to minimize AIC or BIC Larger models fit better; have smaller SS E but more parameters Try to balance fit and model size BIC penalizes larger models more heavily and tend to prefer smaller models in comparison to AIC.

23 Adjusted R 2 Recall coefficient of determination R 2 ; interpreted as percentage of variance of explained: R 2 (ŷi ȳ) 2 = (yi ȳ) 2 = 1 (ŷi y i ) 2 (yi ȳ) 2 = 1 SS E SS T Adding a variable to a model decreases SS E and thus increases R 2. Adjusted R 2, denoted as R 2 a : R 2 a = 1 SS E /(n p) SS T /(n 1) = 1 ( ) n 1 (1 R 2 ) n p

24 PRESS (Predicted residual sum of squares) Definition: PRESS = n i=1 ɛ 2 i, 1 where ɛ i, 1 are residuals calculated without case i in the fit. Thus a cross validation approach The model with lowest PRESS is chosen.

25 Mallow s C p Statistic: C p = (SS E ) p ˆσ 2 + 2p n where ˆσ 2 is an estimate of variance from the model with all predictors and (SS E ) p is the SS E from a model with p parameters. C p is easy to compute For the full model, C p = p exactly Desirable: C p p, p small

26 Model building: Exploring regression data The fundamental axiom of this data analysis is the declaration: The data analyst knows more than the computer. For example, most data analysts will know: (i) Water freezes at 0 C, so a discontinuity at that temperature might not be surprising. (ii) People tend to round their age to the nearest five-year boundary. (iii) The Arab oil embargo made 1973 an unusal year for many economic measures. (iv) The lab technician who handled Sample 3 is new on the job. H.V. Henderson, P.F. Velleman. Biometrika, 1981.

27 Backward elimination A critical α c is chosen, p to remove. 1. Start with all predictors in the model. 2. Remove the predictor with the highest p value greater than α c. 3. Refit the model, goto step Stop when all p values are less than α c.

28 Forward selection The backward method reversed. 1. Start with no variables in the model. 2. For all predictors not in the model, check the p value if they are added to the model. Choose the one with lowest p value less than α c. 3. Continue until no new predictors can be added.

29 Stepwise: Some drawbacks Due to the one-at-a-time nature of adding/dropping variables, it is possible to miss the optimal model. The p values used should not be treated too literally. There is much multiple testing occurring: validity is dubious. The procedures are not directly linked to the final objectives of prediction or exaplanation. Stepwise methods tend to pick models that are smaller than desirable for prediction purposes. (Simple example: simple linear regression; slope not significant)

30 Statistical strategy and model uncertainty Some tactics: Diagnostics Checking of assumptions: constant variance, linearity, normality, outliers, influential points, serial correlation and collinearity Transformation Transforming the response (Box Cox), transforming the predictors; tests and polynomial regression Variable selection Stepwise and criterion based methods What order? Faraway recommends: Diagnostics Transformation Variable selection Diagnostics

31 Avoid too much analysis Torture the data long enough, and sooner or later it will confess. Avoid complex data models for small datasets. Try to obtain new data to validate your proposed model. (Maybe set aside some of the existing data for this purpose) Use past experience with similar data to guide the choice of model.

32 Example: Population data from US states Data from the 50 states collected from the US Bureau of the Census, variables: y Life.Exp Life expectancy in years ( ) x 1 Population Population estimate (July 1, 1975) x 2 Income Per capita income (1974) x 3 Illiteracy Illiteracy (1970, percent of population) x 4 Murder Murder and non-negligent manslaughter rate per population (1976) x 5 HS.Grad Percent high-school graduates (1970) x 6 Frost Mean number of days with min temperature below 32 F (0 C), in capital or large city x 7 Area Land area in square miles. Example from Faraway (2002): Practical Regression and Anova using R

33

34 R output lm(formula = Life.Exp ~., data = statedata) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 7.094e e < 2e-16 *** Population 5.180e e Income e e Illiteracy 3.382e e Murder e e e-08 *** HS.Grad 4.893e e * Frost e e Area e e Signif. codes: 0 *** ** 0.01 * Residual standard error: on 42 degrees of freedom Multiple R-squared: ,Adjusted R-squared: F-statistic: on 7 and 42 DF, p-value: 2.534e-10

35 The F test, formulated in terms of R 2 Consider a general model with RG 2 and a hypothesis model with RH 2. The quantity in the F test in the linear model can be formulated F = R2 G R2 H / 1 R2 G k l N k

36 Transformations Transformations may arise for several reasons in regression models. 1./ Some models are formulated in a multiplicative manner, e.g. µ(x) = αx β, Y = µ(1 + ɛ) or µ(x) = αe βx, Y = µ(1 + ɛ) 2./ Sometimes the response variable is transformed to cope with non-constant variance. A common technique is Box Cox transformation (Box and Cox, 1964).

37 Example, multiplicative model: A gravity model for migration Gravity model for migration between two cities (e.g. Abler et al. (1971)). Y ij : D ij : P i, P j : Gravity model: Number of migrants moving from city i to city j Distance between the cities Populations of the cities Y ij = α Pβ i P γ j D δ ij ɛ ij where α, β, γ, δ are unknown parameters to be estimated from data.

38 Identifiability If X T X is singular and cannot be inverted, there will be infinitely many solutions to the normal equations. Problem when X is not of full rank: columns are linearly dependent. Examples: Weight of a person in pounds and kg: both variables in the model. Number of years of studies at different levels as well as total number of years of studies included in the model. If p > n, more variables than cases: supersaturated model. If p = n, saturated model.

39 Identifiability There are insufficient data to estimate the parameters of interest or There are more parameters than are necessary to model the data.

10. Alternative case influence statistics

10. Alternative case influence statistics 10. Alternative case influence statistics a. Alternative to D i : dffits i (and others) b. Alternative to studres i : externally-studentized residual c. Suggestion: use whatever is convenient with the

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.

More information

Lecture 2. Simple linear regression

Lecture 2. Simple linear regression Lecture 2. Simple linear regression Jesper Rydén Department of Mathematics, Uppsala University jesper@math.uu.se Regression and Analysis of Variance autumn 2014 Overview of lecture Introduction, short

More information

Topic 18: Model Selection and Diagnostics

Topic 18: Model Selection and Diagnostics Topic 18: Model Selection and Diagnostics Variable Selection We want to choose a best model that is a subset of the available explanatory variables Two separate problems 1. How many explanatory variables

More information

L7: Multicollinearity

L7: Multicollinearity L7: Multicollinearity Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Introduction ï Example Whats wrong with it? Assume we have this data Y

More information

14 Multiple Linear Regression

14 Multiple Linear Regression B.Sc./Cert./M.Sc. Qualif. - Statistics: Theory and Practice 14 Multiple Linear Regression 14.1 The multiple linear regression model In simple linear regression, the response variable y is expressed in

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity

Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity Lecture 5: Omitted Variables, Dummy Variables and Multicollinearity R.G. Pierse 1 Omitted Variables Suppose that the true model is Y i β 1 + β X i + β 3 X 3i + u i, i 1,, n (1.1) where β 3 0 but that the

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin Regression Review Statistics 149 Spring 2006 Copyright c 2006 by Mark E. Irwin Matrix Approach to Regression Linear Model: Y i = β 0 + β 1 X i1 +... + β p X ip + ɛ i ; ɛ i iid N(0, σ 2 ), i = 1,..., n

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

ST430 Exam 1 with Answers

ST430 Exam 1 with Answers ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

Collinearity: Impact and Possible Remedies

Collinearity: Impact and Possible Remedies Collinearity: Impact and Possible Remedies Deepayan Sarkar What is collinearity? Exact dependence between columns of X make coefficients non-estimable Collinearity refers to the situation where some columns

More information

Lecture 10. Factorial experiments (2-way ANOVA etc)

Lecture 10. Factorial experiments (2-way ANOVA etc) Lecture 10. Factorial experiments (2-way ANOVA etc) Jesper Rydén Matematiska institutionen, Uppsala universitet jesper@math.uu.se Regression and Analysis of Variance autumn 2014 A factorial experiment

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

ST430 Exam 2 Solutions

ST430 Exam 2 Solutions ST430 Exam 2 Solutions Date: November 9, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textbook are permitted but you may use a calculator. Giving

More information

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics Tuesday, January 17, 2017 Work all problems 60 points are needed to pass at the Masters Level and 75

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression ST 430/514 Recall: a regression model describes how a dependent variable (or response) Y is affected, on average, by one or more independent variables (or factors, or covariates).

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Lecture 3: Multiple Regression

Lecture 3: Multiple Regression Lecture 3: Multiple Regression R.G. Pierse 1 The General Linear Model Suppose that we have k explanatory variables Y i = β 1 + β X i + β 3 X 3i + + β k X ki + u i, i = 1,, n (1.1) or Y i = β j X ji + u

More information

How the mean changes depends on the other variable. Plots can show what s happening...

How the mean changes depends on the other variable. Plots can show what s happening... Chapter 8 (continued) Section 8.2: Interaction models An interaction model includes one or several cross-product terms. Example: two predictors Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. How

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

holding all other predictors constant

holding all other predictors constant Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 3: More on linear regression (v3) Ramesh Johari ramesh.johari@stanford.edu 1 / 59 Recap: Linear regression 2 / 59 The linear regression model Given: n outcomes Y i, i = 1,...,

More information

Matematické Metody v Ekonometrii 7.

Matematické Metody v Ekonometrii 7. Matematické Metody v Ekonometrii 7. Multicollinearity Blanka Šedivá KMA zimní semestr 2016/2017 Blanka Šedivá (KMA) Matematické Metody v Ekonometrii 7. zimní semestr 2016/2017 1 / 15 One of the assumptions

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Lecture 4: Regression Analysis

Lecture 4: Regression Analysis Lecture 4: Regression Analysis 1 Regression Regression is a multivariate analysis, i.e., we are interested in relationship between several variables. For corporate audience, it is sufficient to show correlation.

More information

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved. LINEAR REGRESSION LINEAR REGRESSION REGRESSION AND OTHER MODELS Type of Response Type of Predictors Categorical Continuous Continuous and Categorical Continuous Analysis of Variance (ANOVA) Ordinary Least

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Multiple Regression Analysis. Part III. Multiple Regression Analysis

Multiple Regression Analysis. Part III. Multiple Regression Analysis Part III Multiple Regression Analysis As of Sep 26, 2017 1 Multiple Regression Analysis Estimation Matrix form Goodness-of-Fit R-square Adjusted R-square Expected values of the OLS estimators Irrelevant

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Linear Regression Models

Linear Regression Models Linear Regression Models November 13, 2018 1 / 89 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least

More information

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Psychology Seminar Psych 406 Dr. Jeffrey Leitzel Structural Equation Modeling Topic 1: Correlation / Linear Regression Outline/Overview Correlations (r, pr, sr) Linear regression Multiple regression interpreting

More information

Multiple Linear Regression

Multiple Linear Regression Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

More information

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 2. The Simple Linear Regression Model: Matrix Approach Lecture 2 The Simple Linear Regression Model: Matrix Approach Matrix algebra Matrix representation of simple linear regression model 1 Vectors and Matrices Where it is necessary to consider a distribution

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 3. Hypothesis testing. Goodness of Fit. Model diagnostics GLM (Spring, 2018) Lecture 3 1 / 34 Models Let M(X r ) be a model with design matrix X r (with r columns) r n

More information

Lecture 4 Multiple linear regression

Lecture 4 Multiple linear regression Lecture 4 Multiple linear regression BIOST 515 January 15, 2004 Outline 1 Motivation for the multiple regression model Multiple regression in matrix notation Least squares estimation of model parameters

More information

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: "Statistics Tables" by H.R. Neave PAS 371 SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester 2008 9 Linear

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Linear Regression Measurement & Evaluation of HCC Systems

Linear Regression Measurement & Evaluation of HCC Systems Linear Regression Measurement & Evaluation of HCC Systems Linear Regression Today s goal: Evaluate the effect of multiple variables on an outcome variable (regression) Outline: - Basic theory - Simple

More information

Tutorial 6: Linear Regression

Tutorial 6: Linear Regression Tutorial 6: Linear Regression Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction to Simple Linear Regression................ 1 2 Parameter Estimation and Model

More information

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13 Chapter 8: Polynomials & Interactions

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression September 24, 2008 Reading HH 8, GIll 4 Simple Linear Regression p.1/20 Problem Data: Observe pairs (Y i,x i ),i = 1,...n Response or dependent variable Y Predictor or independent

More information

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects Contents 1 Review of Residuals 2 Detecting Outliers 3 Influential Observations 4 Multicollinearity and its Effects W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 32 Model Diagnostics:

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Föreläsning /31

Föreläsning /31 1/31 Föreläsning 10 090420 Chapter 13 Econometric Modeling: Model Speci cation and Diagnostic testing 2/31 Types of speci cation errors Consider the following models: Y i = β 1 + β 2 X i + β 3 X 2 i +

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

Correlation and Regression

Correlation and Regression Correlation and Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven All models are wrong. Some models are useful. George Box the statistician knows that in nature there never was a

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence

More information

Model Selection Procedures

Model Selection Procedures Model Selection Procedures Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Model Selection Procedures Consider a regression setting with K potential predictor variables and you wish to explore

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Introduction to Linear Regression Rebecca C. Steorts September 15, 2015 Today (Re-)Introduction to linear models and the model space What is linear regression Basic properties of linear regression Using

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

STAT 540: Data Analysis and Regression

STAT 540: Data Analysis and Regression STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model

More information

Generating OLS Results Manually via R

Generating OLS Results Manually via R Generating OLS Results Manually via R Sujan Bandyopadhyay Statistical softwares and packages have made it extremely easy for people to run regression analyses. Packages like lm in R or the reg command

More information

1 Introduction 1. 2 The Multiple Regression Model 1

1 Introduction 1. 2 The Multiple Regression Model 1 Multiple Linear Regression Contents 1 Introduction 1 2 The Multiple Regression Model 1 3 Setting Up a Multiple Regression Model 2 3.1 Introduction.............................. 2 3.2 Significance Tests

More information

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore What is Multiple Linear Regression Several independent variables may influence the change in response variable we are trying to study. When several independent variables are included in the equation, the

More information

Econ107 Applied Econometrics

Econ107 Applied Econometrics Econ107 Applied Econometrics Topics 2-4: discussed under the classical Assumptions 1-6 (or 1-7 when normality is needed for finite-sample inference) Question: what if some of the classical assumptions

More information

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators Multiple Regression Relating a response (dependent, input) y to a set of explanatory (independent, output, predictor) variables x, x 2, x 3,, x q. A technique for modeling the relationship between variables.

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

Available online at   (Elixir International Journal) Statistics. Elixir Statistics 49 (2012) 10108 Available online at www.elixirpublishers.com (Elixir International Journal) Statistics Elixir Statistics 49 (2012) 10108-10112 The detention and correction of multicollinearity effects in a multiple

More information

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation. Statistical Computation Math 475 Jimin Ding Department of Mathematics Washington University in St. Louis www.math.wustl.edu/ jmding/math475/index.html October 10, 2013 Ridge Part IV October 10, 2013 1

More information

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich

Biostatistics. Correlation and linear regression. Burkhardt Seifert & Alois Tschopp. Biostatistics Unit University of Zurich Biostatistics Correlation and linear regression Burkhardt Seifert & Alois Tschopp Biostatistics Unit University of Zurich Master of Science in Medical Biology 1 Correlation and linear regression Analysis

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Regression and the 2-Sample t

Regression and the 2-Sample t Regression and the 2-Sample t James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Regression and the 2-Sample t 1 / 44 Regression

More information

1 Multiple Regression

1 Multiple Regression 1 Multiple Regression In this section, we extend the linear model to the case of several quantitative explanatory variables. There are many issues involved in this problem and this section serves only

More information

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima Applied Statistics Lecturer: Serena Arima Hypothesis testing for the linear model Under the Gauss-Markov assumptions and the normality of the error terms, we saw that β N(β, σ 2 (X X ) 1 ) and hence s

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

x 21 x 22 x 23 f X 1 X 2 X 3 ε

x 21 x 22 x 23 f X 1 X 2 X 3 ε Chapter 2 Estimation 2.1 Example Let s start with an example. Suppose that Y is the fuel consumption of a particular model of car in m.p.g. Suppose that the predictors are 1. X 1 the weight of the car

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Introduction to Regression

Introduction to Regression Regression Introduction to Regression If two variables covary, we should be able to predict the value of one variable from another. Correlation only tells us how much two variables covary. In regression,

More information

Ch 3: Multiple Linear Regression

Ch 3: Multiple Linear Regression Ch 3: Multiple Linear Regression 1. Multiple Linear Regression Model Multiple regression model has more than one regressor. For example, we have one response variable and two regressor variables: 1. delivery

More information

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp Selection criteria Example Methods MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp Lecture 5, spring 2018 Model selection tools Mathematical Statistics / Centre

More information

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

No other aids are allowed. For example you are not allowed to have any other textbook or past exams. UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Sample Exam Note: This is one of our past exams, In fact the only past exam with R. Before that we were using SAS. In

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Introduction and Background to Multilevel Analysis

Introduction and Background to Multilevel Analysis Introduction and Background to Multilevel Analysis Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Background and

More information

Lecture 11. Correlation and Regression

Lecture 11. Correlation and Regression Lecture 11 Correlation and Regression Overview of the Correlation and Regression Analysis The Correlation Analysis In statistics, dependence refers to any statistical relationship between two random variables

More information

Lecture 1: Linear Models and Applications

Lecture 1: Linear Models and Applications Lecture 1: Linear Models and Applications Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction to linear models Exploratory data analysis (EDA) Estimation

More information