ST Correlation and Regression

Similar documents
Correlation and the Analysis of Variance Approach to Simple Linear Regression

Lecture 11: Simple Linear Regression

Unit 6 - Introduction to linear regression

Unit 6 - Simple linear regression

df=degrees of freedom = n - 1

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Inferences for Regression

Simple Linear Regression Using Ordinary Least Squares

Inference for the Regression Coefficient

Inference for Regression Simple Linear Regression

Lecture 16 - Correlation and Regression

Inference for Regression

AMS 7 Correlation and Regression Lecture 8

STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Lecture notes on Regression & SAS example demonstration

Lecture 9: Linear Regression

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Overview Scatter Plot Example

CHAPTER EIGHT Linear Regression

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Chapter 4: Regression Models

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Regression Models. Chapter 4. Introduction. Introduction. Introduction

STAT 3A03 Applied Regression With SAS Fall 2017

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Statistics 512: Solution to Homework#11. Problems 1-3 refer to the soybean sausage dataset of Problem 20.8 (ch21pr08.dat).

Measuring the fit of the model - SSR

STOR 455 STATISTICAL METHODS I

General Linear Model (Chapter 4)

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Ch 2: Simple Linear Regression

Chapter 6 Multiple Regression

Lecture 6 Multiple Linear Regression, cont.

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Variance Decomposition and Goodness of Fit

Chapter 12 - Lecture 2 Inferences about regression coefficient

Lectures on Simple Linear Regression Stat 431, Summer 2012

Simple Linear Regression

Scatterplots and Correlation

Chapter 4. Regression Models. Learning Objectives

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Math 3330: Solution to midterm Exam

Analysis of Variance. Source DF Squares Square F Value Pr > F. Model <.0001 Error Corrected Total

Inference for Regression Inference about the Regression Model and Using the Regression Line

Lecture 1 Linear Regression with One Predictor Variable.p2

The Simple Linear Regression Model

STATISTICS 479 Exam II (100 points)

Linear models and their mathematical foundations: Simple linear regression

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

One-sided and two-sided t-test

Lecture 13 Extra Sums of Squares

Warm-up Using the given data Create a scatterplot Find the regression line

Simple Linear Regression

Multiple Linear Regression

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Lecture 18 MA Applied Statistics II D 2004

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Notes 6. Basic Stats Procedures part II

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Ch 3: Multiple Linear Regression

Mathematics for Economics MA course

Lecture 3: Inference in SLR

y response variable x 1, x 2,, x k -- a set of explanatory variables

Simple linear regression

Confidence Interval for the mean response

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

appstats27.notebook April 06, 2017

This document contains 3 sets of practice problems.

Chapter 14 Simple Linear Regression (A)

Simple Linear Regression

Table 1: Fish Biomass data set on 26 streams

Simple Linear Regression: One Quantitative IV

Stat 500 Midterm 2 12 November 2009 page 0 of 11

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Chapter 27 Summary Inferences for Regression

Statistical View of Least Squares

Simple Linear Regression

Section 3: Simple Linear Regression

Chapter 7 Linear Regression

Chapter 12: Multiple Regression

Multiple Linear Regression

Sociology 6Z03 Review II

Chapter 8 Quantitative and Qualitative Predictors

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Statistics 5100 Spring 2018 Exam 1

3.2: Least Squares Regressions

Homework 2: Simple Linear Regression

Correlation and Regression

Simple and Multiple Linear Regression

ECON3150/4150 Spring 2015

Regression Analysis II

SIMPLE REGRESSION ANALYSIS. Business Statistics

Lecture 18: Simple Linear Regression

9. Linear Regression and Correlation

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Transcription:

Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we ve learned: Why we want a random sample and how to achieve it (Sampling Scheme) How to use randomization, replication, and control/blocking to create a valid experiment. Methods for summarizing and graphing data in meaningful ways. Analyzing CRD type designs to investigate which factors are important for a particular response. Next we ll look at how to deal with quantitative explanatory variables and a quantitative response. Multi-Way ANOVA vs Linear Regression Multi-Way ANOVA used with variable(s) and a Determines which factors are significantly associated with the response Correlation and Linear Regression used with variable(s) and a Determines if there is a significant linear relationship between the variables Not necessarily cause and effect! 63

Scatterplots and Correlation Goal: Determine the relationship between two quantitative variables! Scatterplots can display the relationship between two quantitative variables Each dot corresponds to one observation Example: Using number of cups of coffee to predict quiz score (observational study). Cups of Coffee Quiz Score 10 16 4 7 2 5 3 7 8 11 1 5 2 8 7 9 5 10 6 11 64

Examining a Scatterplot: We are looking for 3 things Form Strength 65

Direction Correlation We will focus on the Correlation a statistic that measures the strength and direction of the linear association between two quantitative variables. No need to distinguish between explanatory variable and response for correlation! Properties of Correlation: Unitless quantity ρ and R (or r) are always between Sign indicates the direction of association Distance from 0 (magnitude) indicates the strength of the association Interpreting Correlation: ρ (or R/r) close to 1 = ρ (or R/r) close to 0 = ρ (or R/r) close to 1 = 66

http://www.rossmanchance.com/applets/guesscorrelation.html What would a correlation of 1 or -1 look like on a scatterplot? Not resistant to outliers, why? 67

Example: For the cups of coffee and quiz score example, we can find the sample correlation, r (which is estimating the true underlying population correlation (ρ)) and interpret it. Cups of Coffee Quiz Score 10 16 4 7 2 5 3 7 8 11 1 5 2 8 7 9 5 10 6 11 68

Linear Regression What if I want to use an explanatory variable (x) to predict a response (y)? Simple Linear Regression (SLR) finds the line of best fit between one quantitative explanatory variable x and the response y measured on the same subjects. Data in the form of pairs of observations (x 1, y 1 ), (x 2, y 2 ),, (x n, y n ) - same set up as correlation data Often done using a model of the form: Note: β 0 and β 1 are true unknown parameters. When we have data we fit this model and get estimates denoted by 69

How to get the estimates? Names of line with estimates plugged in: fitted line or regression line or prediction line or prediction equation or least squares line 70

Interpretation of line: Major use of Regression line!! Useful for predicting the response (called ŷ) for a given value of x Just plug the new x into the equation! Warning! Extrapolation is bad! 71

Interpretation of parameter estimates: ˆβ 0 = b 0 = sample intercept ˆβ 1 = b 1 = sample slope 72

Example: Coffee and Quiz Score 1. Find the fitted least squares regression line. 2. Interpret the slope. 3. Interpret the intercept. Is it meaningful in this example? 4. Jenny happens to stop in the coffee shop. She notices your plot and says, I had 5 cups of coffee the night before the quiz. Guess what my score was! 5. Jenny actually scored a 14. Find her residual. 6. If Enrique drank 12 cups of coffee, can we safely predict his quiz score? 73

Note: Always Graph your Data To check linearity assumption. Four data sets below give the same fitted line, but only the first one makes sense! Note: Causation vs Correlation Association between variables does not mean that one variable causes the changes in another. Only well designed experiments can establish that an explanatory variable causes a change in a response variable. 74

How do we know we have a Significant Linear Relationship IDEA: Given data, we find An estimate of the (true) mean response If we did another experiment would we get the same data?? Need to do a statistical test to see if there is a significant linear relationship between the explanatory and response variable. What hypothesis do we want to test? 75

How to investigate this hypothesis?... ANOVA! ANOVA = ANalysis Of VAriance. What variation are we measuring here? Which line above gives more evidence of a significant linear relationship? Why? A measure of the total amount of variability in the response is the sample variance of Y. Consider the numerator, which we call the total sum of squares: SS(T ot) = n (y i ȳ) 2 (variability of observations about the mean) i=1 df T ot = n 1 76

This variability is partitioned into independent components: Sum of squares due to regression, SS(R) (or sum of squares model, SS(M)) SS(R) = df R = 1 n (ŷ i ȳ) 2 (variability of the fitted line about the mean) i=1 Sum of squares due to error, SS(E) SS(E) = e 2 i = (y i ŷ i ) 2 (variability of observations about the fitted line) df E = n 2 Note: SS(T ot) = SS(R) + SS(E) and df T ot = df R + df E (Recall: DF = Number of independent pieces of data for the sum of squares). The ANOVA table from simple linear regression Source df Sum of squares Mean Square F-Ratio Regression 1 SS(R) M S(R) M S(R)/M S(E) Error n 2 SS(E) MS(E) Total n 1 SS(T ot) Meaning of pieces very similar to Multi-Way ANOVA! mean squares - represent standardized measures of variation due to the different sources and are given by SS(source)/df source. F-Ratio - Ratios of mean squares often follow an F -distribution and are appropriate for testing different hypotheses of interest. In this case, to test Use the test statistic (H 0 ) β 1 = 0 vs (H 1 ) β 1 0 F = MS(R)/MS(E) This is used to find a p-value which can then be used to determine if we have a statistically significant result. 77

Conducting the analysis in Statcrunch 78

79

r 2 the coefficient of determination r 2 = SS(Model)/SS(T ot) - Interpretation: 1. High r 2 then the line is good for prediction. 2. In the coffee example, we obtained r 2 = 0.8298 Interpretation: 80

Example: One type of fuel is biodiesel, which comes from plants. An experiment was done to determine how much biodiesel could be generated from a certain type of plant grown in different media. The final biomass was also recorded on 44 the plants from the experiment. Let s consider these two variables, the log of biodiesel and biomass. proc reg data=bioexp ; model logbiodiesel=biomass/clb; run; 1. Describe the scatterplot (3 things!). 2. Give the value of the sample correlation. 81

Output From Proc Reg for Biomass and Log(Biodiesel) Example 1 The REG Procedure Model: MODEL1 Dependent Variable: logbiodiesel Number of Observations Read 44 Number of Observations Used 44 Source Analysis of Variance DF Sum of Squares Mean Square F Value Pr > F Model 1 26.76509 26.76509 21.14 <.0001 Error 42 53.18557 1.26632 Corrected Total 43 79.95066 Root MSE 1.12531 R-Square 0.3348 Dependent Mean 2.46474 Adj R-Sq 0.3189 Coeff Var 45.65627 Variable DF Parameter Estimate Parameter Estimates Standard Error t Value Pr > t 95% Confidence Limits Intercept 1 0.27977 0.50463 0.55 0.5822-0.73862 1.29816 biomass 1 0.19769 0.04300 4.60 <.0001 0.11091 0.28447 3. State and interpret the value of the MSE. 4. Determine if there is a significant linear relationship between these variables. State your reasoning and your hypothesis. 5. Give the fitted line and use it to predict for a biomass of 10. If you have an observed value of 4 for a biomass of 10, what is the residual? 82

Consider the regression of Y = brain size (MRI counts) on x=height(in). Use the output below to answer the following: 1. Interpret the scatterplot and state the correlation. 2. What is the % of variability in the MRI counts explained by the data. 3. Determine the fitted regression line. 4. Is there a significant liner relationship between the variables. Why or why not? 5. Does the intercept have a meaningful interpretation? 6. Predict brain size for an individual that is 60inches tall. 83

84

Multiple Linear Regression (Optional Reading) These ideas can easily be extended to the case of more than 1 quantitative explanatory variable. This is called Multiple Linear Regression: β j are called partial slopes Model Y i = β 0 + β 1 x i1 + β 2 x i2 +... + β p x ip + E i x p can be quadratic values, cubics, or interaction terms x 2 1, x 3 1, x 1 x 2 fitted model, ŷ = b 0 + b 1 x 1 + b 2 x 2 +... + b p x p Model often still fit using least squares i.e. minimize SS(E) = n i=1 (y i ŷ i ) 2 Example (Taken from Probabilitiy and Statistics, Devore) Soil and sediment adsorption, the extent to which chemicals collect in a condensed form on the surface, is an important characteristic influencing the effectiveness of pesticides and various agricultural chemicals. A study was done on 13 soil samples that measured Y = phosphate adsorption index, X 1 = amount of extractable aluminum, and X 2 = amount of extractable iron. The data are given below: Adsorption Label Aluminum Label Iron Label 4 y 1 13 x 1,1 61 x 1,2 18 y 2 21 x 2,1 175 x 2,2 14 y 3 24 x 3,1 111 x 3,2 18 y 4 23 x 4,1 124 x 4,2 26 y 5 64 x 5,1 130 x 5,2 26 y 6 38 x 6,1 173 x 6,2 21 y 7 33 x 7,1 169 x 7,2 30 y 8 61 x 8,1 169 x 8,2 28 y 9 39 x 9,1 160 x 9,2 36 y 10 71 x 10,1 244 x 10,2 65 y 11 112 x 11,1 257 x 11,2 62 y 12 88 x 12,1 333 x 12,2 40 y 13 54 x 13,1 199 x 13,2 85

Output From Proc Reg for Adsorption Example 1 The REG Procedure Model: MODEL1 Dependent Variable: adsorp Number of Observations Read 14 Number of Observations Used 13 Number of Observations with Missing Values 1 Source DF Analysis of Variance Sum of Squares Mean Square F Value Pr > F Model 2 3529.90308 1764.95154 92.03 <.0001 Error 10 191.78922 19.17892 Corrected Total 12 3721.69231 Root MSE 4.37937 R-Square 0.9485 Dependent Mean 29.84615 Adj R-Sq 0.9382 Coeff Var 14.67316 Variable DF Parameter Estimate Parameter Estimates Standard Error t Value Pr > t 95% Confidence Limits Intercept 1-7.35066 3.48467-2.11 0.0611-15.11498 0.41366 aluminum 1 0.34900 0.07131 4.89 0.0006 0.19012 0.50788 iron 1 0.11273 0.02969 3.80 0.0035 0.04658 0.17889 86