Data Analysis and Statistical Methods Statistics 651

Similar documents
Data Analysis and Statistical Methods Statistics 651

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Business Statistics. Lecture 9: Simple Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Data Analysis and Statistical Methods Statistics 651

Measuring the fit of the model - SSR

Inferences for Regression

AMS 7 Correlation and Regression Lecture 8

Statistical View of Least Squares

Statistical View of Least Squares

9. Linear Regression and Correlation

Lecture 14 Simple Linear Regression

Simple Linear Regression Using Ordinary Least Squares

Mathematics for Economics MA course

Inference for the Regression Coefficient

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Inference with Simple Regression

Section 3: Simple Linear Regression

Lecture 11: Simple Linear Regression

Basic Business Statistics 6 th Edition

Simple Linear Regression

Simple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Lecture 18: Simple Linear Regression

Ch 2: Simple Linear Regression

Simple linear regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

UNIT 12 ~ More About Regression

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Simple and Multiple Linear Regression

Categorical Predictor Variables

Chapter 16. Simple Linear Regression and dcorrelation

Relationships Regression

Inference for Regression

POL 681 Lecture Notes: Statistical Interactions

9 Correlation and Regression

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Regression Models - Introduction

LECTURE 15: SIMPLE LINEAR REGRESSION I

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

A discussion on multiple regression models

Introduction to Linear Regression

appstats27.notebook April 06, 2017

Inference for Regression Simple Linear Regression

Introduction to systems of equations

Chapter 27 Summary Inferences for Regression

Remedial Measures, Brown-Forsythe test, F test

1 Multiple Regression

Correlation & Simple Regression

1. (Rao example 11.15) A study measures oxygen demand (y) (on a log scale) and five explanatory variables (see below). Data are available as

STAT5044: Regression and Anova. Inyoung Kim

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

CRP 272 Introduction To Regression Analysis

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Chapter 5 Least Squares Regression

Intro to Linear Regression

Scatter plot of data from the study. Linear Regression

Density Temp vs Ratio. temp

Intro to Linear Regression

Weighted Least Squares

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

INFERENCE FOR REGRESSION

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Inference for Regression Inference about the Regression Model and Using the Regression Line

Data Analysis and Statistical Methods Statistics 651

Scatter plot of data from the study. Linear Regression

Weighted Least Squares

Data Analysis and Statistical Methods Statistics 651

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

Correlation and Linear Regression

CHAPTER EIGHT Linear Regression

Chapter 16. Simple Linear Regression and Correlation

MATH11400 Statistics Homepage

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

Unit 6 - Introduction to linear regression

The Simple Linear Regression Model

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

BIOSTATISTICS NURS 3324

Midterm 2 - Solutions

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

28. SIMPLE LINEAR REGRESSION III

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Graphical Diagnosis. Paul E. Johnson 1 2. (Mostly QQ and Leverage Plots) 1 / Department of Political Science

Relationships between variables. Association Examples: Smoking is associated with heart disease. Weight is associated with height.

Business Statistics. Lecture 10: Course Review

Correlation Analysis

Important note: Transcripts are not substitutes for textbook assignments. 1

CHAPTER 4 DESCRIPTIVE MEASURES IN REGRESSION AND CORRELATION

Statistics for Managers using Microsoft Excel 6 th Edition

Homework 2: Simple Linear Regression

Key Algebraic Results in Linear Regression

Chapter 10 Regression Analysis

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Transcription:

y 1 2 3 4 5 6 7 x Data Analysis and Statistical Methods Statistics 651 http://www.stat.tamu.edu/~suhasini/teaching.html Lecture 32 Suhasini Subba Rao Previous lecture We are interested in whether a dependent variable x i (for example shoe size) exerts any influence on the independent variable (for example height) We do no observe the entire population but a sample of say 5 people and for each person we observe their shoe size and height. We plot the shoe size againist height and fit a the best line through these points. In general the line will not fit exactly through the points. Knowledge of shoe size will not give you exactly the height, but it will give some indication. 1 Height/Shoe size example 1 Let x i be the shoe size and y i their height. We observe the height and shoe size of 5 people: We plot this below. the points are the observations and the line is the line of best fit. S xy = 60, S xx = 20. Height y i 6 14 10 14 26 Feet size x i 1 3 4 5 7 10 15 20 25 Then we have ˆβ 1 = 60 20 = 3 and ˆβ 0 = ȳ ˆβ 1 x = 14 3 4 = 2. Therefore, if I want to predict the average height of someone with size 6 feet, it would be Y (6) = 2 + 3 6 = 20. The line of best fit is Y = 2 + 3x. 2 3

Residuals Recall that when we did ANOVA, we calculated the residuals. In this case, the residuals were the the difference between the sample mean in each group and each of the observations in each group. In the case of regression the residuals are the difference between each Y i and corresponding point on the line of best fit Ŷi = ˆβ 0 + ˆβ 1 x i. They residuals are calculated with ˆε i = Y i Ŷi = Y i ˆβ 0 ˆβ 1 x i. Normally you do not have to calculate the residuals, the statistical software should give it. Look at the shoe size height example above. The residuals is the difference between each point and the corresponding value on the x-axis. The residuals in shoe size/height example The predictor of y i given x i is ŷ i = ˆβ 0 + ˆβ 1 x i. Hence the predictive line is ŷ = 2 + 3x. We use this to obtain the residuals: Height y i 6 14 10 14 26 Feet size x i 1 3 4 5 7 ŷ i = 3x i + 2 5 11 14 17 24 Error in prediction: ˆε i = y i ŷ i 1 3 4 3 2 Residuals are very important. If they residuals are small the line fits the data very well. If the residuals are large, the line does not fit so well. Of course, in many situations, the line fit will not be particularly good, because the association between the x and y variable is weak. However, even if the association is weak it can still be important. For example, the amount of milk drunk and and the height of a child. The link is not 4 5 strong, if you drink lots of milk the child will definitely not be taller. But there is a link. A strong link can be seen when the cloud of points cling closely to the line. A weak link can be seen when the cloud of points does not cling to the cloud, but there appears to be a trend nevertheless. Example 2 As one part of a study of commerical bank branches, data are obtained on the number of independent businesses (x) and number of banks in 11 different areas. The commerical centers of cities are excluded. The data is given below: busi. 92 116 124 210 216 267 306 378 415 502 615 703 banks 3 2 3 5 4 5 5 6 7 7 9 9 Plot the data, does a linear line seem plausible? Locate the regression equations (with y as the dependent variable). 6 7

Plot the data. Solution 2 Summary statistics: X = 328.7, Ȳ = 5.4. Obtaining S XX and S XY : x 92 116 124 210 216 267 306 378 415 502 615 x x -237-213 -205-119 -113-62 -23 49 86 173 286 y 3 2 3 5 4 5 5 6 7 7 9 y ȳ -2.4-3.4-2.4-0.4-1.4-0.4-0.4 0.6 1.6 1.6 3.6 Using the table above we have S XX = 12 i=1 (X i X) 2 = 436262 and S XY = 12 i=1 (X i X)(Y i Ȳ ) = 4844. So the parameter estimators are: and ˆβ 1 = S XY /S XX = 4844/436262 = 0.011. ˆβ 0 = Ȳ ˆβ 1 X = 5.4 0.011 328.7 = 1.87. This gives the predictive equation: Ŷ = 1.784 + 0.0011X. x 92 116 124 210 216 267 306 378 415 502 615 y 3 2 3 5 4 5 5 6 7 7 9 ŷ 2.8 3.0 3.1 4.1 4.1 4.7 5.1 5.9 6.3 7.3 8.5 ˆε i 0.22-1.04-0.13 0.92-0.14 0.30-0.13 0.08 0.67-0.29 0.47 Where ˆε i = y i ŷ i are the estimated residuals. 8 9 The start of inference on the parameters The sample and true slope So given any set of paired observations we can fit a line. But how good or significant is the line? As always, the value of ˆβ 1 itself has no meaning. Even if ˆβ 1 is small it does not mean that x has no influence on Y! Consider the example that Y is generated by the equation Y = 2 + 0.00004x. In this example, ˆβ 1 = 0.00004, which is very small, but still totally determines Y. Hence to determine whether ˆβ 1 is significant or not, we need to compare it to the amount of error there is in ˆβ 1. This is like comparing the sample mean to its standard error. For every random sample we draw, we will get new values of (Y i, x i ) which gives us different values on the X-Y graph and also different lines of best fit. In other words, each new sample leads to a new ˆβ 0 and ˆβ 1, hence ˆβ 0 and ˆβ 1 are random variables and have a distribution. What do we mean by random sample, and what are ˆβ 0 and ˆβ 1 estimating? We start by understanding what (Y,x) means. 10 11

Returning to the shoe size/height example Recall if we randomly select someone from the population of all people there is wide variation in their height. Suppose we restrict ourselves to the subpopulation of all people with size 12 feet. Then we narrow the amount of variation in the height. Size 12 is quite a large shoe size, hence it is reasonable to suppose that people with size 12 feet are also quite tall. However, even if you know that someone has size 12 feet, you will not know their exact height. Their shoe size will only give an indication of their height. You could say, that on average a person with size 12 feet will be 6 feet, but there us much variation about the this average. It also seems reason to suppose that the average varies according to the subpopulation of shoe sizes considered. Indeed it is likely the average will be β 0 + β 1 x, where x denotes the shoe size (for now ignore what β 0 and β 1 are). Why this formula? Remember before we had a population of people, the average height is taken over the entire population. Now we have restricted ourselves to a subpopulation of people with size 12 people. The average is take over this subpopulation. It is quite reasonable to suppose that the average of this subpopulation should be quite large. Compare this average with the subpopulation of people with size 2 feet. This is a small size, so it is quite likely the average height of people with size 4 feet should be small, say 4.11 feet. Compare this average with the subpopulation of people with size 6 feet. This size is somehow normal hence we would expect the average height of people with this shoes size to be about 5.4. 12 13 Now let us look at the general average β 0 + β 1 x again. The linear regression model If β 1 is positive, it tells us that the average height grows with shoe size. This fits with out intuition that height increases with shoe size. Hence Modelling the mean height as β 0 + β 1 x seems quite reasonable. But remember there is a lot of variation about the mean. That is height of a randomly selected size 12 may be quite far from the mean β 0 +β 1 12. To model the variation we add error to the mean. Hence the linear regression model is Y = β 0 + β 1 x + ε }{{} due to random variation due to difference in individuals). error (the error is The height of a randomly selection person with size x feet is Y = β 0 + β 1 x + ε, but we do not know what the intercept β 0 or the slope β 1 is. Examples. Suppose for now it is known that β 0 = 1 and β 1 = 2 We model the height of a person who has size one feet as Y = 1 + 2 + }{{} ε. due to random variation Observe that 3 is the average height of a person with size one feet. The residual ε is the variation about the mean because every individual is different (this is individual effect ). 14 15

We model the height of a person who has size 14 feet as Y = 1 + 2 14 + }{{} ε. due to random variation Observe that 25 is the average height of a person with size 14 feet. In reality we will not know β 0 and β 1. Assumptions for inference In order to make inference on β 0 and β 1 we assume: (1) A straight line fits the data: Y = β 0 + β 1 X + ε. But we do observe (y i, x i ) (recall the shoe size/height example we had 5 observations). ˆβ0 and ˆβ 1 are estimates of β 0 and β 1 based on the sample. Since ˆβ 0 and ˆβ 1 are estimators of β 0 and β 1 we will want to do the obvious things like construct CI for β 0 and β 1 and do statistical tests. In reality, this means if we plot the data, we see several a cloud of dots about a line. And not, for example, a cloud of dots about a curve. (2) The errors ε are approximately normal (this we do below). (3) The variance of the errors ε is the same for all samples (checking this can be a bit tricky and is not discussed here). Under these assumptions we are able to obtain the distribution of ˆβ 0 and ˆβ 1. Which we can use to construct confidence intervals and hypothesis test. 16 17 1. Checking for normality of the residuals It is very easy to check for normality. Effectively we just make a QQplot of the residuals. We recall we calculate the residuals by fitting a line of best fit to the data, Ŷ = ˆβ 0 + ˆβ 1 x, and estimate the residuals with ˆε t = Y i ˆβ 0 ˆβ 1 x i. After estimating the residuals (there are the same number as the number of observations), we can make a QQplot of the residuals to check for normality and also a boxplot to check for outliers. A large deviation from normality and many outliers will give inaccurate CIs and misleading results in statistical tests. 2. Checking to see whether the linear model is correct To see check whether the linear model is appropriate. We should fit the linear models to the data (x 1, Y 1 ),...,(x n, Y n ). Obtain the parameter estimators ˆβ 1, ˆβ 0. Estimate the residuals ˆε i. Make a scatter plot of ˆε i against {x i }. If you still see a pattern in the scatter plot (what looks like a relationship), then the linear model may not be the most appropriate. If the scatter plots looks random, then the linear plot may be appropriate. A pattern would suggest there is still some probably nonlinear dependence between the Y i and x i which has not been taken into account through the linear model. To overcome this problem we can make a transformation of x i (such as looking at the relationship between Y i and x 2 i etc.). 18 19