Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Similar documents
Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

y response variable x 1, x 2,, x k -- a set of explanatory variables

Simple Linear Regression Using Ordinary Least Squares

Section 3: Simple Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Simple Linear Regression

Business Statistics. Lecture 9: Simple Regression

ECON 497 Midterm Spring

STAT 510 Final Exam Spring 2015

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Advanced Quantitative Data Analysis

IT 403 Practice Problems (2-2) Answers

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Review of Multiple Regression

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

A discussion on multiple regression models

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Taguchi Method and Robust Design: Tutorial and Guideline

Ordinary Least Squares Regression Explained: Vartanian

Interactions between Binary & Quantitative Predictors

1 Correlation and Inference from Regression

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

SIMPLE REGRESSION ANALYSIS. Business Statistics

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

The Simple Regression Model. Part II. The Simple Regression Model

Introduction to Regression

Ch 2: Simple Linear Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

Single and multiple linear regression analysis

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Simple Linear Regression

2 Prediction and Analysis of Variance

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

9. Linear Regression and Correlation

15.1 The Regression Model: Analysis of Residuals

A Re-Introduction to General Linear Models (GLM)

Basic Business Statistics 6 th Edition

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Section Least Squares Regression

LI EAR REGRESSIO A D CORRELATIO

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

STATS DOESN T SUCK! ~ CHAPTER 16

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Correlation Analysis

MORE ON SIMPLE REGRESSION: OVERVIEW

Inferences for Regression

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Research Design - - Topic 19 Multiple regression: Applications 2009 R.C. Gardner, Ph.D.

Multiple OLS Regression

Wed, June 26, (Lecture 8-2). Nonlinearity. Significance test for correlation R-squared, SSE, and SST. Correlation in SPSS.

Chapter 3 Multiple Regression Complete Example

Factorial Independent Samples ANOVA

Midterm 2 - Solutions

In the previous chapter, we learned how to use the method of least-squares

Multivariate Regression (Chapter 10)

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

4/22/2010. Test 3 Review ANOVA

Multiple Linear Regression

Correlation and Regression Bangkok, 14-18, Sept. 2015

Statistics for Managers using Microsoft Excel 6 th Edition

MANOVA is an extension of the univariate ANOVA as it involves more than one Dependent Variable (DV). The following are assumptions for using MANOVA:

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Statistical Techniques II EXST7015 Simple Linear Regression

Practical Biostatistics

sociology sociology Scatterplots Quantitative Research Methods: Introduction to correlation and regression Age vs Income

Binary Logistic Regression

Self-Assessment Weeks 6 and 7: Multiple Regression with a Qualitative Predictor; Multiple Comparisons

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

SPSS LAB FILE 1

Inference for Regression Simple Linear Regression

ECON 450 Development Economics

Final Exam - Solutions

Archdiocese of Washington Catholic Schools Academic Standards Mathematics

Mathematics for Economics MA course

Chapter 9 - Correlation and Regression

Immigration attitudes (opposes immigration or supports it) it may seriously misestimate the magnitude of the effects of IVs

Chapter 10-Regression

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Chapter 4. Regression Models. Learning Objectives

Simple Linear Regression

General Linear Models. with General Linear Hypothesis Tests and Likelihood Ratio Tests

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Ecn Analysis of Economic Data University of California - Davis February 23, 2010 Instructor: John Parman. Midterm 2. Name: ID Number: Section:

Applied Regression Analysis. Section 2: Multiple Linear Regression

Notebook Tab 6 Pages 183 to ConteSolutions

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Regression Models. Chapter 4

The simple linear regression model discussed in Chapter 13 was written as

Model Building Chap 5 p251

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Inference for Regression Inference about the Regression Model and Using the Regression Line

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

1. Regressions and Regression Models. 2. Model Example. EEP/IAS Introductory Applied Econometrics Fall Erin Kelley Section Handout 1

Basic Business Statistics, 10/e

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Transcription:

Univariate analysis Example - linear regression equation: y = ax + c Least squares criteria ( yobs ycalc ) = yobs ( ax + c) = minimum Simple and + = xa xc xy xa + nc = y Solve for a and c Univariate analysis Equation: y = ax+ c How best to summarise the data? SSE sum-squared error Correlation coefficient: r = = SST total of sum-squared error Sxy SSE = S, SST = S S = ( ) = σ = n S x x n x x xx x = ( ) = σ = n S y y n y y y xx S = ( x x) ( y y) = n σ = xy x y xy xy n Establish equation for the best-fit line: y = ax + c Where: c = y intercept (constant) a = slope of best-fit line y = dependent variable x = independent variable 5 5 5 Terminology How accurate is the description? Establish equation for the best-fit line: y = ax + c Best-fit line same as Regression line a is the regression coefficient for x x is the predictor or regressor variable for y Symptom Index 5 5 5 Drug A (dose in mg) Very good fit Symptom Index 5 5 5 Drug B (dose in mg) Moderate fit

Significance test Simple regression uses a t-test to establish whether or not the model describes a significant proportion of the variance in the data Multiple regression uses an Analysis of Variance to discover if the proportion of variance in the data explained by the model is significant These tests are reported in the software output Explaining the distribution of a spatial phenomenon requires the analysis of relationships between the phenomenon and potential explanatory variables The two most useful methods in GIS spatial analysis are multiple regression analysis and logistic regression analysis Regression analysis is used to examine the relationship between the study phenomenon and multiple explanatory variables: Y = β + β X + β X + β n X n + ε Y = β + β X + β X + β n X n + ε Where Y denotes the dependent variable and Xs are explanatory variables; β represents the intercept, while β n denotes estimated parameter of the variable X n and ε is the randomly distributed error term Single most widely used statistical technique in the social sciences Multiple regression can be implemented in matrix form MathCAD example Multiple regression is simply an extension of the simple linear model. The difference is that there are more independent variables: Y i β β. X i β. X i... β. p Xp i ε i The assumptions for multiple regression are similar to simple linear regression: The average value of the dependent variable Y is a linear combination of the independent variables. The only random component is the error term ε, and the independent variables are assumed to be fixed, and independent of ε. Errors between observations are uncorrelated, and normally distributed with a mean of zero and a constant variance σ. While we were able to solve for a single explanatory variable using a spreadsheet, as we add more explanatory variables, it will be harder to solve. This is especially true when the number of independent variables is greater than 3. However, through the use of matrices, the task is far less difficult. We can structure the regression equations into a matrix as follows: y Xβ. ε where X is the observation matrix of independent variables, and β is the vector of unknown parameters. So, if we had 4 observations, and three explanatory variables, our matrices would look like the following: The reason we have a in the first column is because we have to include the intercept parameter. Therefore, we really have 4 unknowns to solve for (the coefficient for each explanatory variable, and the slope) Using basic least square with matrix algebra is fairly simple, once you have a computer program to do the work, we simply solve for the following X Matrix of dependent variables x x x.. x N x x x. x N x3 x3 x3. x3 N y = Xβ + ε β β β β β 3

We will not be performing the math, but it will be useful to create the matrices, just to see how it all gets formed. In our example, suppose a gas utility company is trying to estimate revenue. They may have determined that heating cost is a function of the temperature, the amount of insulation in an attic, and the age of a furnace. They decided to look at customer sites, and quantify the data as shown Home Heating Cost Temperature ( o F) Attic Insulation Age of Furnace 5 35 3 6 3 9 4 3 65 36 7 3 4 43 6 9 5 9 65 5 6 6 3 5 5 7 355 6 7 8 9 7 9 3 9 55 5 73 54 4 5 48 5 3 5 5 4 3 39 4 7 5 7 8 6 6 7 5 8 7 94 58 7 3 8 9 8 9 35 7 9 8 39 3 7 5 We now need to store this information in a matrix as follows (we are only going to do the first 4 rows, just to make it simple You can see how for the first four rows, we have defined the monthly cost for the heating, the intercept, temperature, insulation, and age of furnace. Now, using least square principles with matrix algebra, we can come up with our unknown coefficients. Home Heating Cost Temperature ( o F) Attic Insulation Age of Furnace 5 35 3 6 3 9 4 3 65 36 7 3 4 43 6 9 β β β β β 3 5 35 3 = 9 Y = X 65 36 43 3 6 4 7 3 6 9 Microsoft Excel has an excellent regression tool for relatively small problems. You will find it under the tools -> data analysis tab. Once you select the tool, an interactive dialog will come up stepping you through the regression wizard. Here is where you will enter the range for the Y value (a single column), and the X values (multiple columns) as shown below. You should type the numbers into Excel, and attempt to perform the regression yourself. Check your answers against ours. What this tells us is that our R square value is quite high (.) representing a good fit, and we have a standard error of 5 (in dollars) for our observations. SUMMARY OUTPUT Regression Statistics Multiple R.89675599 R Square.4766 Adjusted R Square.76745954 Standard Error 5.4855358 Observations 3

The next chart tells us our coefficient values (intercept, temperature, insulation, age of furnace). It also tells us our P-value, or a measure of significance. All the values except age of furnace are very low, meaning that they are all significant at the 95% level. So, what we now have is a formula Cost to Heat Home = 47 + -4.58(temperature) + -4.83(attic insulation) + 6.(age of the furnace). Therefore, if a person with no attic insulation decided to add inches, what would they save when the average temperature if degrees? Coefficients Standard Error t Stat P-value Intercept 47.9333 59.493 7.6759374.3764E-6 Temperature (of) -4.586666.7739353-5.93363695.35E-5 Attic Insulation -4.838669 4.75448-3.938977.65963 Age of Furnace 6.36 4.66.56538.4786484 What it means Coefficients Standard Error t Stat P-value Intercept 47.9333 59.493 7.6759374.3764E-6 Temperature (of) -4.586666.7739353-5.93363695.35E-5 Attic Insulation -4.838669 4.75448-3.938977.65963 Age of Furnace 6.36 4.66.56538.4786484 The intercept is 47.94. This is the cost of heating when all the independent variables are equal to zero. The regression coefficients for the mean temperature and the amount of attic insulation are both negative. This is logical: as the outside temperature increases, the cost of heating the house will go down. For each degree the mean temperature increases, we expect the heating cost to decrease $4.583 per month. P-value for all the coefficients are significant for α=.5 (P <.5) except for the coefficient of the variable age of furnace (β3). Hence, we can conclude that they are significantly different from zero (making no difference). However, if we examine the p-value for the variable age of furnace, we see that it is not significant at α=.5. Hence, we cannot conclude that it is significantly different from zero. In that case, we can drop this variable from the model. Let s see what happens if we drop the age of furnace variable from the model Rather than rerunning things, we ll go with the first conclusions: Cost to Heat Home = 47 + -4.58(temperature) + -4.83(attic insulation) + 6.(age of the furnace). Cost to Heat Home = 47 + -4.58() + -4.83() + 6.(6). $8 Cost to Heat Home = 47 + -4.58() + -4.83() + 6.(6). $3 A utility company could then use this information to determine how much revenue they would generate if they provided service to a neighborhood. R - Goodness of fit Summary Adjusted Std. Error of R R Square R Square the Estimate.7 a.5.399 7.734 a. Predictors: (Constant), AGE, GENDER, INCOME For multiple regression, R will get larger every time another independent variable (regressor or predictor) is added to the model New regressor may only provide a tiny improvement in amount of variance in the data explained by the model Need to establish the value of each additional regressor in predicting the DV R adj - adjusted R-square Takes into account the number of regressors in the model Calculated as: R adj = -(-R )(N-)/(N-n-) where: N = number of data points n = number of regressors How well does a model explain the variation in the dependent variable? Effectiveness vs Efficiency Effectiveness: maximizes R ie: maximizes proportion of variance explained by model Efficiency: maximizes increase in R adj upon adding another regressor ie: if new regressor doesn t add much to the variance explained, it s not worth adding Note that R adj will always be smaller than R 4

How well does a model explain the variation in the dependent variable? Are the regressors, taken together, significantly associated with the dependent variable? ANOVA b Effectiveness - 5% very poor and likely to be unacceptable 5-5% poor, but may be acceptable 5-75% good 75-9% very good 9% + likely that there is something wrong Regression Residual Total Sum of Squares df Mean Square F Sig. 65.388 3 355.9 4.35.8 a 37.5 33.337 785.438 5 a. Predictors: (Constant), AGE, GENDER, INCOME b. Dependent Variable: DEPRESS Analysis of Variance test checks to see if model, as a whole, has a significant relationship with the DV Part of the predictive value of each regressor may be shared by one or more of the other regressors in the model, so the model must be considered as a whole Read off the ANOVA table in the output, and report as you did in ANOVA course What relationship does each individual regressor have with the dependent variable? (Constant) INCOME GENDER AGE a. Dependent Variable: DEPRESS Coefficients a Unstandardized Standardized Coefficients Coefficients B Std. Error Beta t Sig. 68.85 5.444 4.4. -9.34E-.9 -.68-3.78.8 3.36 8.94.75.37.78 -.6.344 -. -.47.646 SPSS output table entitled Coefficients Column headed Unstandardised coefficients - B Gives regression coefficient for each regressor variable Units of coefficient are same as those for variable What relationship does each individual regressor have with the dependent variable? Units of coefficient are same as those for variable eg: dependent variable score on video game regressor time of day B coefficient for time of day = 844.57 score = 844.57 time + constant This means that for every increase of one hour in the variable time of day, we would predict that a person s score to increase by 844.57 points Which regressor has the most effect on the dependent variable? Units for each regression coefficient are different, so we must standardise them if we want to compare one with another Column headed Standardised coeficients - Beta Can now compare the Beta weights for each regressor variable to compare effects of each on the dependent variable 5