Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression

Similar documents
Inference for Regression Inference about the Regression Model and Using the Regression Line

L21: Chapter 12: Linear regression

Section Least Squares Regression

Lecture 4 Multiple linear regression

Chapter 5 Friday, May 21st

Unit 6 - Introduction to linear regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Regression Models for Quantitative and Qualitative Predictors: An Overview

Correlation & Simple Regression

Chapter 6. Exploring Data: Relationships. Solutions. Exercises:

9. Linear Regression and Correlation

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Unit 6 - Simple linear regression

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

REVIEW 8/2/2017 陈芳华东师大英语系

Review of Statistics 101

Topic 10 - Linear Regression

Sem. 1 Review Ch. 1-3

Log-linear Models for Contingency Tables

b. Write the rule for a function that has your line as its graph. a. What shadow location would you predict when the flag height is12 feet?

Inference for Regression

STAT 4385 Topic 03: Simple Linear Regression

Chapter 2: Looking at Data Relationships (Part 3)

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Sociology 6Z03 Review I

Lecture 6 Multiple Linear Regression, cont.

STA102 Class Notes Chapter Logistic Regression

Correlation and simple linear regression S5

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

Business Statistics. Lecture 10: Correlation and Linear Regression

Lecture 5: ANOVA and Correlation

Practice Questions for Exam 1

y response variable x 1, x 2,, x k -- a set of explanatory variables

MODULE 11 BIVARIATE EDA - QUANTITATIVE

Chapter 8. Linear Regression /71

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

QUANTITATIVE STATISTICAL METHODS: REGRESSION AND FORECASTING JOHANNES LEDOLTER VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS ADMINISTRATION SPRING 2013

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Mrs. Poyner/Mr. Page Chapter 3 page 1

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Correlation and Regression

AP STATISTICS Name: Period: Review Unit IV Scatterplots & Regressions

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Lecture 2. The Simple Linear Regression Model: Matrix Approach

BMI 541/699 Lecture 22

Scatterplots and Correlation

Lecture 28 Chi-Square Analysis

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Statistics for Engineers Lecture 9 Linear Regression

Correlation and Regression Notes. Categorical / Categorical Relationship (Chi-Squared Independence Test)

Lecture 45 Sections Wed, Nov 19, 2008

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Analysing data: regression and correlation S6 and S7

Lecture 6: Linear Regression

STAT Chapter 11: Regression

PS5: Two Variable Statistics LT3: Linear regression LT4: The test of independence.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Stat 101 Exam 1 Important Formulas and Concepts 1

We will now find the one line that best fits the data on a scatter plot.

Correlation and regression

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Statistics in medicine

MATH 1150 Chapter 2 Notation and Terminology

Learning Objectives. Math Chapter 3. Chapter 3. Association. Response and Explanatory Variables

STT 315 This lecture is based on Chapter 2 of the textbook.

Chapter 7. Scatterplots, Association, and Correlation

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Lecture 18: Simple Linear Regression

Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

10: Crosstabs & Independent Proportions

Stat 642, Lecture notes for 04/12/05 96

Psych 230. Psychological Measurement and Statistics

CHAPTER EIGHT Linear Regression

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

Chapter 4. Regression Models. Learning Objectives

BNAD 276 Lecture 10 Simple Linear Regression Model

Chapter 4: Regression Models

Looking at Data Relationships. 2.1 Scatterplots W. H. Freeman and Company

Research Methodology: Tools

Unit 10: Simple Linear Regression and Correlation

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

IT 403 Practice Problems (2-2) Answers

Examining Relationships. Chapter 3

Chapter 3: Examining Relationships

Lecture 11: Simple Linear Regression

1. The following two-way frequency table shows information from a survey that asked the gender and the language class taken of a group of students.

Logistic Regression - problem 6.14

ST430 Exam 1 with Answers

Chapter 3: Describing Relationships

Chapter 12: Linear regression II

Finding Relationships Among Variables

Bivariate Relationships Between Variables

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Second Midterm Exam Name: Solutions March 19, 2014

WORKSHOP 3 Measuring Association

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Mathematics for Economics MA course

Transcription:

Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 1 / 48

Outline 1 Association 2 Association For Categorical Variables Testing Categorical Variables for Independence Determining the Strength of the Association 3 Association For Quantitative Variables Linear Association: Direction and Strength Statistical Model Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 2 / 48

Association Association Between Two Variables An association exists between two variables if particular values for one variable are more likely to occur with certain values of the other variable. Response Variable and Explanatory Variable The response variable is the outcome variable on which comparisons are made for different values of the explanatory variable. Examples survival status is the response variable and smoking status is the explanatory variable. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 3 / 48

Association Association Between Two Variables An association exists between two variables if particular values for one variable are more likely to occur with certain values of the other variable. Response Variable and Explanatory Variable The response variable is the outcome variable on which comparisons are made for different values of the explanatory variable. Examples survival status is the response variable and smoking status is the explanatory variable. CO 2 is the response variable and country s amount of gasoline use for automobiles is the explanatory variable. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 3 / 48

Association Association Between Two Variables An association exists between two variables if particular values for one variable are more likely to occur with certain values of the other variable. Response Variable and Explanatory Variable The response variable is the outcome variable on which comparisons are made for different values of the explanatory variable. Examples survival status is the response variable and smoking status is the explanatory variable. CO 2 is the response variable and country s amount of gasoline use for automobiles is the explanatory variable. GPA is the response variable and the number of hours a week spent studying is the explanatory variable. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 3 / 48

Outline 1 Association 2 Association For Categorical Variables Testing Categorical Variables for Independence Determining the Strength of the Association 3 Association For Quantitative Variables Linear Association: Direction and Strength Statistical Model Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 4 / 48

Association Between Two Categorical Variables Pesticide Status Food Type Present Not Present Total Organic 29 98 127 Conventional 19, 485 7, 086 26, 571 Total 19, 514 7, 184 26, 698 Table 1: Frequencies for food type and pesticide status. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 5 / 48

Association Between Two Categorical Variables Pesticide Status Food Type Present Not Present Total Organic 29 98 127 Conventional 19, 485 7, 086 26, 571 Total 19, 514 7, 184 26, 698 Table 1: Frequencies for food type and pesticide status. Pesticide Status Food Type Present Not Present Total Organic 0.228 0.772 1 Conventional 0.733 0.267 1 Total 0.734 0.266 1 Table 2: Conditional proportions on pesticide status for two food types. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 5 / 48

Proportions Conditional Proportion ˆP(present Organic) = 29 127 = 0.228 ˆP(present Conventional) = 19,514 26,698 = 0.733 Marginal Proportion ˆP(present) = 19,485 26,698 = 0.734 Figure 1: Stacked Bar Graph. It stacks the proportion of pesticides present and not present on top of each other. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 6 / 48

Independence and Dependence(Association) Naive approach Association exists if the proportion with pesticide present had a big difference between the food types and vice verse. How big is big? Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 7 / 48

Independence and Dependence(Association) Naive approach Association exists if the proportion with pesticide present had a big difference between the food types and vice verse. How big is big? Chi-Squared Test Where and χ 2 df = (observed count expected count) 2 expected count Expected cell count = Row total Column Total Total sample size df = (r 1) (c 1) Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 7 / 48

Independence and Dependence(Association) Naive approach Association exists if the proportion with pesticide present had a big difference between the food types and vice verse. How big is big? Chi-Squared Test Where and Rule of Thumb χ 2 df = (observed count expected count) 2 expected count Expected cell count = Row total Column Total Total sample size df = (r 1) (c 1) The larger the χ 2 (smaller the p-value), the more statistical evidence for existence of association between the categorical variables. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 7 / 48

Chi-Square(χ 2 ) Distribution 2 χ df 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 0 5 10 15 20 25 30 x Figure 2: χ 2 distribution with degrees of freedom 1. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 8 / 48

Chi-Square(χ 2 ) Distribution 2 χ df 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 2 0 5 10 15 20 25 30 x Figure 3: χ 2 distribution with degrees of freedom 1 and 2, respectively. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 9 / 48

Chi-Square(χ 2 ) Distribution 2 χ df 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 2 3 0 5 10 15 20 25 30 x Figure 4: χ 2 distribution with degrees of freedom 1, 2 and 3, respectively. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 10 / 48

Chi-Square(χ 2 ) Distribution 2 χ df 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 2 3 4 0 5 10 15 20 25 30 x Figure 5: χ 2 distribution with degrees of freedom 1, 2, 3 and 4, respectively. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 11 / 48

Chi-Square(χ 2 ) Distribution 2 χ df 0.0 0.1 0.2 0.3 0.4 0.5 0.6 1 2 3 4 10 0 5 10 15 20 25 30 x Figure 6: χ 2 distribution with degrees of freedom 1, 2, 3, 4 and 5, respectively. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 12 / 48

Pesticide in Organic Foods Pesticide Status Food Type Present Not Present Total Organic 29 98 127 93 34 Expected Count 43.89 119.21 Contribution to χ 2 Conventional 19, 485 7, 086 26, 571 19, 421 7, 150 Expected Count 0.21 0.57 Contribution to χ 2 Total 19, 514 7, 184 26, 698 Table 3: Chi-squared Test Table of independence(not association) of food type and pesticide status. Pearson s Chi-squared test with Yates continuity correction data: pes X-squared = 161.32, df = 1, p-value < 2.2e-16 Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 13 / 48

Strength Of The Association Difference of Proportions Stress Depression Gender Yes No Total Yes No Total Female 44% 56% 100% 11% 89% 100% Male 20% 80% 100% 6% 94% 100% Table 4: Survey of college freshmen on feeling frequently stressed and feeling frequently depressed at UCLA in 2013 Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 14 / 48

Strength Of The Association Difference of Proportions Stress Depression Gender Yes No Total Yes No Total Female 44% 56% 100% 11% 89% 100% Male 20% 80% 100% 6% 94% 100% Table 4: Survey of college freshmen on feeling frequently stressed and feeling frequently depressed at UCLA in 2013 Insights ˆP(Stress Female) ˆP(Stress Male) = 44% 20% = 22% ˆP(Depression Female) ˆP(Depression Male) = 11% 6% = 5% There is evidence of a greater difference between females and males on their feelings about stress than on depression. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 14 / 48

The Ratio of Proportions: Relative Risk Developed Flu Group Yes No Total Vaccine 26 3874 3900 Control 70 3830 3900 Table 5: Results from a Vaccine Efficacy Study. Subjects are either vaccinated with a cell-derived flu vaccine or a placebo(control Group) Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 15 / 48

The Ratio of Proportions: Relative Risk Developed Flu Group Yes No Total Vaccine 26 3874 3900 Control 70 3830 3900 Table 5: Results from a Vaccine Efficacy Study. Subjects are either vaccinated with a cell-derived flu vaccine or a placebo(control Group) Difference of proportions Insight ˆP(Yes Control) ˆP(Yes Vaccine) = 70 3900 26 3900 = 0.0179 0.0067 = 0.0112 The difference of proportions itself is in very small magnitude, which might mistakenly lead us to a conclusion of independence between the developed flu status and vaccine status. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 15 / 48

The Ratio of Proportions: Relative Risk Developed Flu Group Yes No Total Vaccine 0.0067 0.9933 1 Control 0.0179 0.9821 1 Table 6: Conditional proportions of flu status for vaccinated and unvaccinated (control) groups. Ratio of proportions(relative risk) Relative Risk = ˆP(Yes Control) ˆP(Yes Vaccine) = 0.0179 0.0067 = 2.67 Insight Unvaccined subjects are 2.67 times more likely to develop the flu, where 2.67 times more likely refers to the relative risk. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 16 / 48

Odds and Odds Ratio In the table 6, denote ˆp 1 = ˆP(Yes Vaccine) and ˆp 2 = ˆP(Yes Control). Odds the ratio of the two conditional proportions in each row. Odds 1 = ˆp 1 /(1 ˆp 1 ) = 0.0067/0.9933 = 0.0067 1/150 Odds 2 = ˆp 2 /(1 ˆp 2 ) = 0.0179/0.9821 = 0.0182 1/50 Odds Ratio The ratio of the two odds Insights ˆp 2 /(1 ˆp 2 ) ˆp 1 /(1 ˆp 1 ) = 0.0182 0.0067 = 2.72 The odds compare the proportion of subjects developing flu to those who don t in the vaccinated group and unvaccinated group. The odds ratio tells us the difference in times compared one odds to the other. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 17 / 48

Odds and Odds Ratio Interpretation In the control group, for every one subject developing the flu, 50 are not developing it. In the Vaccinated group, for every one subject developing the flu, 150 are not developing it. The odds of developing fle in the control group are 2.7 (about 3) times the odds in the vaccinated group. Properties of the Odds Ratios Could be any nonnegative value. odds ratio equals 1 independence values farther from 1 represents stronger association. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 18 / 48

Smoking and Health A survey of 1, 314 women in the United Kingdom that asked each woman whether she was a smoker. Twenty years later, a follow-up survey observed whether each woman was dead or still alive. Survival Status Smoker Dead Alive total Yes 139 443 582 No 230 502 732 Total 369 945 1314 Table 7: Smoking Status and 20-Year Survival in Women Question What s the response variable? Explanatory variable? Construct a contingency table that shows the conditional proportion of survival status given smoker status. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 19 / 48

Smoking and Health Age Group 18-34 Survival? 35-54 Survival? 55-64 Survival? 65+ Survival? Smoker Dead Alive Dead Alive Dead Alive Dead Alive Yes 5 174 41 198 51 64 42 7 No 6 213 19 180 40 81 165 28 Table 8: Four contingency tables relating smoking status and survival status for these 1, 134 women sperated into four age groups. Example Construct the contingency table that shows conditional proportions of deaths for smokers and nonsmokers by age. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 20 / 48

Smoking and Health Age Group Smoker 18-34 35-54 55-64 65+ Yes 2.8% 17.2% 44.3% 85.7% No 2.7% 9.5% 33.1% 85.5% Difference 0.1% 7.7% 11.2% 0.2% Table 9: Conditional Proportion of deaths for smokers and nonsmokers by age. Insights When looking at the data separately by age group, we can see that smokers have lower survival rates than nonsmokers. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 21 / 48

Outline 1 Association 2 Association For Categorical Variables Testing Categorical Variables for Independence Determining the Strength of the Association 3 Association For Quantitative Variables Linear Association: Direction and Strength Statistical Model Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 22 / 48

Association Between Two Quantitative Variables Interested in if the value of one variable X (response variable) can be reliably predicted by the value of the other variable Y (explanatory or predictor variable). Need examine the nature and strength of the relationship between two variables. Develop a statistical model to predict new values of the response variable using the predictor variable. simple linear straight line polynomial curve Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 23 / 48

Correlation Correlation The correaltion summarizes the strength and direction of the linear (straight-line) association between two quantitative variables. Denoted by r, it takes values between 1 and +1. r = 1 n 1 = 1 n 1 n i=1 ( x i x s x n z xi z yi i=1 )( y i ȳ s y ) Where n is the number of observations, x and ȳ are means, and s x and s y are standard deviations for x and y. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 24 / 48

Correlation: Direction and Strength Figure 7: The data points in the upper-right and lower-left quadrants make a positive contribution to the correlation value of r, and it is opposite for the data points in the upper-left and the lower-right quadrants. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 25 / 48

Correlation: Direction and Strength r= 1 r= 1 r= 0.05 2 1 0 1 y=0.191 x=0.191 1 0 1 2 y= 0.191 x=0.191 2 1 0 1 2 y=0.195 x= 0.006 2 1 0 1 2 1 0 1 2.0 1.0 0.0 1.0 r= 0.89 r= 0.66 r= 0.32 3 2 1 0 1 2 3 y=0.228 x=0.191 2 1 0 1 2 3 4 y=0.324 x=0.191 5 0 5 y= 0.461 x=0.191 2 1 0 1 2 1 0 1 2 1 0 1 Figure 8: The closer the data points fall to a straight line, the closer the correlation r gets to ±1, and the stronger the linear association between x and y. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 26 / 48

Correlation: Direction and Strength 3 2 1 0 1 2 3 0 2 4 6 8 r= 0 3 2 1 0 1 2 3 0 2 4 6 8 r= 0 Figure 9: The correlation r poorly describes the association when the relationship is curved. Always construct a scatterplot to display a relationship between two quantitative variables. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 27 / 48

Simple Linear Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 10: The scatterplot of x weights (in pounds) and y mileage (miles per gallon of gas) for 2004 model cars with automatic transmission. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 28 / 48

Simple Linear Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 11: The scatterplot of x weights (in pounds) and y mileage superimposed with an arbitrary straight line. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 29 / 48

Simple Linear Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 12: The scatterplot of x weights (in pounds) and y mileage superimposed with an arbitrary straight line. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 30 / 48

Simple Linear Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 13: The scatterplot of x weights (in pounds) and y mileage superimposed with an arbitrary straight line. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 31 / 48

Simple Linear Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 14: The scatterplot of x weights (in pounds) and y mileage superimposed with an arbitrary straight line. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 32 / 48

Simple Linear Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 15: The scatterplot of x weights (in pounds) and y mileage superimposed with the straight line of best fit. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 33 / 48

Least Squares Method Assume that there is an underlying straight-line relationship between the predictor variable X and the response variable Y. This relationship can be written as a regression equation, denoted by µ Y = β 0 + β 1 X Where µ Y denotes the population mean of Y for all the subjects at a particular value of X. In practice, we estimate the regression equation using the sample data, and the estimated regression equation is denoted by ŷ = b 0 + b 1 x Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 34 / 48

Least Squares Method The least squared method guarantees the optimal line to fit through the data points by the sum of squared residuals as small as possible. SSE = n (y i ŷ i ) 2 i=1 Where the optimal line is called the regression line. y i ŷ i is the residual for the i th subject with the observed response variable value y i and the predictor variable value x i and ŷ i = b 0 + b 1 x i. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 35 / 48

Car Weights and MPGs Call: lm(formula = car$mileage ~ car$weight) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 45.6453593 2.6027584 17.537 8.30e-15 *** car$weight -0.0052220 0.0006271-8.328 2.14e-08 *** Residual standard error: 3.016 on 23 degrees of freedom Multiple R-squared: 0.751,Adjusted R-squared: 0.7401 F-statistic: 69.35 on 1 and 23 DF, p-value: 2.138e-08 The estimated regression equation is ŷ = 45.65 0.005x Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 36 / 48

Car Weights and MPGs mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 16: The scatterplot of x weights (in pounds) and y mileage superimposed with the estimated regression line ŷ = 45.65 0.005x. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 37 / 48

Intercept and Coefficient For the estimated regression equation ŷ = 45.65 0.005x, what do the y-intercept and x-coefficient mean? b 0 = 45.65 The predicted value of MPG(y) when weight x = 0, which is meaningless. It is a kind of extrapolation and extrapolation is too dangerous to be used! b 1 = 0.005 The average miles per gallon of gas decreases for an additional pound increase of a car. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 38 / 48

Residuals Given the estimated regression equation ŷ = 45.65 0.005x, complete the table by calculating the estimated MPGs and residuals. model weight(x) mileage(y) ŷ y ŷ Toyota Corolla 2590 38 Jeep Liberty 4115 21 Hummer H2 6400 17 Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 39 / 48

Residuals model weight(x) mileage(y) ŷ y ŷ Toyota Corolla 2590 38 32.12 5.88 Jeep Liberty 4115 21 24.16-3.16 Hummer H2 6400 17 12.22 4.78 Table 10: Estimated MPGs and residuals for 3 different cars. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 40 / 48

R-square How to assess the quality of the regression line? Ignoring x and its relationship with y, we can predict y merely using the sample mean ȳ. Then, the total sum of squares equals SST = n (y i ȳ) 2 i=1 Instead, using the estimated regression equation, the sum of squared residuals is n SSE = (y i ŷ i ) 2 i=1 The R-square gives the summary measure of association and describes the predictive power R 2 SST SSE = SST Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 41 / 48

R-square The R-square(R 2 ) summarizes the reduction in the prediction error when using the regression equation rather than ȳ to predict y. R 2 is also interpreted as how much of the variability in y can be explained by x using the simply linear regression. Especially, R 2 is related to the correlation r in formula R 2 = r 2 Where r = 1 n 1 n i=1 ( x i x s x )( y i ȳ s y ) For instance, in the car weight and MPG example, R 2 = 0.75 indicates that 75% of variability in the MPGs that is explained by the simple linear regression between weight and MPG. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 42 / 48

Polynomial Regression Assume that there is an underlying curve relationship between the predictor variable X and the response variable Y. This relationship can be written as a polynomial regression equation, denoted by µ Y = β 0 + β 1 X + β 2 X 2 Where µ Y denotes the population mean of Y for all the subjects at a particular value of X. In practice, we estimate the regression equation using the sample data, and the estimated regression equation is denoted by ŷ = b 0 + b 1 x + b 2 x 2 Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 43 / 48

Polynomial Regression mpg 20 25 30 35 3000 4000 5000 6000 weight Figure 17: The scatter plot of weight(x) and MPG(y) with superimposed the estimated polynomial regression ŷ = 67 0.016x + 1.2 10 6 x 2. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 44 / 48

Polynomial Regression Call: lm(formula = car$mileage ~ car$weight + I(car$weight^2)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.700e+01 8.843e+00 7.577 1.44e-07 *** car$weight -1.571e-02 4.224e-03-3.718 0.0012 ** I(car$weight^2) 1.218e-06 4.862e-07 2.505 0.0202 * Residual standard error: 2.72 on 22 degrees of freedom Multiple R-squared: 0.8062,Adjusted R-squared: 0.7886 F-statistic: 45.77 on 2 and 22 DF, p-value: 1.447e-08 The estimated regression equation is ŷ = 67 0.016x + 1.2 10 6 x 2 Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 45 / 48

Polynomial Regression Using the polynomial regression, R 2 = 0.81, indicates that 81% of variability in the response variable MPG that can be explained by the polynomial regression between the predicator variable weight and the response variable MPG. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 46 / 48

Polynomial Regression Using the polynomial regression, R 2 = 0.81, indicates that 81% of variability in the response variable MPG that can be explained by the polynomial regression between the predicator variable weight and the response variable MPG. Calculate the predicted MPGs and residuals using the estimated polynomial equation ŷ = 67 0.016x + 1.2 10 6 x 2 model weight(x) weight 2 (x 2 ) mileage(6) ŷ y ŷ Toyota Corolla 2590 6.7 10 6 38 Jeep Liberty 4115 16.93 10 6 21 Hummer H2 6400 40.9 10 6 17 Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 46 / 48

Polynomial Regression model weight(x) weight 2 (x 2 ) mileage(y) ŷ y ŷ Toyota Corolla 2590 6.7 10 6 38 34.49 3.51 Jeep Liberty 4115 16.93 10 6 21 22.99-1.99 Hummer H2 6400 40.9 10 6 17 16.36 0.64 Table 11: Predicted MPGs and residuals using the estimated polynomial regression equation ŷ = 67 0.016x + 1.2 10 6 x 2. Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 47 / 48

Model Selection Given two fitted model, the simply linear regression model and the polynomial regression model, which model should you select to make predictions? Rule of Thumb Visual fitting Larger R 2 simpler model Chong Ma (Statistics, USC) STAT 201 Elementary Statistics 48 / 48