Scatter plot of data from the study. Linear Regression

Similar documents
Scatter plot of data from the study. Linear Regression

Lecture 14 Simple Linear Regression

Simple Linear Regression

Analysing data: regression and correlation S6 and S7

Simple Linear Regression

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Simple and Multiple Linear Regression

where x and ȳ are the sample means of x 1,, x n

Inference for Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Correlation 1. December 4, HMS, 2017, v1.1

Inferences for Regression

Ch 2: Simple Linear Regression

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Simple linear regression

Measuring the fit of the model - SSR

Topic 16 Interval Estimation

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Correlation and the Analysis of Variance Approach to Simple Linear Regression

The material for categorical data follows Agresti closely.

Chapter 10. Simple Linear Regression and Correlation

[y i α βx i ] 2 (2) Q = i=1

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

ST430 Exam 1 with Answers

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Can you tell the relationship between students SAT scores and their college grades?

Density Temp vs Ratio. temp

STAT5044: Regression and Anova. Inyoung Kim

Simple Linear Regression

STAT 111 Recitation 7

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Correlation Analysis

Topic 12 Overview of Estimation

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

ECON The Simple Regression Model

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Multiple Linear Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

36-707: Regression Analysis Homework Solutions. Homework 3

Lecture 6 Multiple Linear Regression, cont.

Specification Errors, Measurement Errors, Confounding

Linear Regression & Correlation

STAT2012 Statistical Tests 23 Regression analysis: method of least squares

Key Algebraic Results in Linear Regression

Statistics for Managers using Microsoft Excel 6 th Edition

STAT 4385 Topic 03: Simple Linear Regression

Linear regression and correlation

Correlation & Simple Regression

3 Multiple Linear Regression

Regression Models - Introduction

AMS 7 Correlation and Regression Lecture 8

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Basic Business Statistics 6 th Edition

SSR = The sum of squared errors measures how much Y varies around the regression line n. It happily turns out that SSR + SSE = SSTO.

De-mystifying random effects models

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Covariance and Correlation

Lectures on Simple Linear Regression Stat 431, Summer 2012

Statistical View of Least Squares

Chapter 12 - Part I: Correlation Analysis

Ch 3: Multiple Linear Regression

MATH c UNIVERSITY OF LEEDS Examination for the Module MATH1725 (May-June 2009) INTRODUCTION TO STATISTICS. Time allowed: 2 hours

Simple Linear Regression

STAT Chapter 11: Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Applied Regression Analysis

Lecture 6: Linear Regression

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

Single and multiple linear regression analysis

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 1: Simple Linear Regression Introduction and Estimation

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Regression. Marc H. Mehlman University of New Haven

The Simple Linear Regression Model

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

11 Correlation and Regression

BNAD 276 Lecture 10 Simple Linear Regression Model

Important note: Transcripts are not substitutes for textbook assignments. 1

Lecture 19 Multiple (Linear) Regression

Comparing Nested Models

Bivariate Relationships Between Variables

Simple Linear Regression Analysis

3. Linear Regression With a Single Regressor

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

Inference for the Regression Coefficient

Chapter 14 Simple Linear Regression (A)

Regression diagnostics

Lecture 16: Again on Regression

Chapter 16. Simple Linear Regression and dcorrelation

Psychology 282 Lecture #3 Outline

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Lecture 5: Clustering, Linear Regression

Lecture 18: Simple Linear Regression

Simple Linear Regression

11.5 Regression Linear Relationships

Chapter 14. Linear least squares

Transcription:

1 2 Linear Regression Scatter plot of data from the study. Consider a study to relate birthweight to the estriol level of pregnant women. The data is below. i Weight (g / 100) i Weight (g / 100) 1 7 25 17 17 32 2 9 25 18 25 32 3 9 25 19 34 4 12 20 15 34 5 14 21 15 34 6 16 22 15 35 7 16 24 16 35 8 14 30 24 19 34 9 16 30 25 18 35 10 16 31 26 17 36 11 17 30 18 37 43 41 37 35 31 29 25 12 19 31 28 20 38 13 21 30 29 22 40 14 24 28 30 25 6 8 10 12 14 16 18 20 22 24 26 28 15 15 32 31 24 43 16 16 32 3 4 Important features of the scatter plot: The simple linear regression model Two variables are associated with each unit. There appears to be a rough trend or relationship between the two variables. This relationship is not exactly precise in that there exists substantial variation or scatter. A simple linear regression model is a summary of the relationship between a dependent variable (or response variable) Y and an independent variable (or covariate variable) X. Y is assumed to be a random variable while, even if X is a random variable, we condition on it (assume it is fixed). Essentially, we are interested in knowing the behavior of Y given we know X = x. Important questions concerning the scatter plot: Can we summarize the relationship with a linear equation (line)? Is the apparent trend statistically significant? Can we characterize the variability about a summary line? Linear regression is a statistical method used to address these questions.

5 6 43 41 37 35 31 29 25 42 36 30 24 21 18 15 12 9 6 3 0 6 8 10 12 14 16 18 20 22 24 26 28 0 2 4 6 8 10 13 16 19 22 25 28 Let E[Y x] represent the expected value of Y given X = x. We see that (on average) Y is increasing with X. To describe this relationship, I could define E[Y x] = µ Y x = βx where β > 0. I could define E[Y x] = µ Y x = βx where β > 0; but note that when x = 0, E[Y X = 0] = 0, which is not what we see in the data. To account for this (non-zero intercept), I could define E[Y x] = µ Y x = βx + α where β > 0 and α > 0. E[Y x] = µ Y x = βx + α This line is the population regression line. α and β are the population regression coefficients (coefficients). α is the y-intercept of the line (E[Y X = 0]). β is the slope of the line. It gives the change in the mean value of Y that corresponds to a one unit increase in X. If β > 0, the mean increases as X increases; if β < 0, the mean decreases as X increases. Note that even if the population regression line accurately describes the relationship between the two variables, individual measurements of these variables will not necessarily fall on that line. 7 Note that even if the population regression line accurately describes the relationship between the two variables, individual measurements of these variables will not necessarily fall on that line. To account for this, an error term (ɛ) which represents the variance of responses Y conditional on x is introduced. The full linear regression model takes the following form Y = α + βx + ɛ, where ɛ is a (normally*) distributed random variable with mean 0 and variance σ 2. 8

9 10 Linear Regression - Summary of Assumptions: Values of X are fixed (at x). The outcomes of Y are normally distributed (independent) random variables with mean µ Y x and variance σ 2 Y x. σ 2 Y x is the same for all x. This assumption of constant variability across all x values is known as homoscedasticity. (σ 2 Y x=σ 2 ) The relationship between µ Y x and x is described by the straight line µ Y X=x = α + βx Error terms ɛ i and ɛ j are independent of each other (i j). Linear Regression - Notation: We have been using capital X or Y to denote a random variable and x or y to denote the values that the respective random variables could assume. In linear regression, we assume that values of X are fixed (not random). The notation above is consistent with this idea. Many books follow this notation. At the same time, many books (including your text) do not. We will follow the notation in your book. The full linear regression model takes the following form y = α + βx + ɛ, where ɛ is a normally distributed random variable with mean 0 and variance σ 2. (x i, y i ) denote data values. There are 31 (x i, y i ) pairs shown in the scatterplot. 11 12 You have 31 (x i, y i ) pairs. Recall these questions: Can we summarize the relationship with a linear equation (line)? Is the apparent trend statistically significant? Can we characterize the variability about some line? Summarizing the relationship with a linear equation. We would like the line to be as close to the data as possible. The line is given in general by α + βx. For a fixed x i, the corresponding point on the line would be defined as α + βx i Consider measuring the distance from the data point y i to the line. y i α βx i

13 14 The distance from the data point y i to the value of the line for a given x i is y i α βx i We could sum up all of the differences. As always, it s a good idea to square the differences so that they don t cancel out. S denotes the sum of squared differences (distances) between data points and the line With a little calculus, you can show that these values are ˆβ n = i=1 (x i x)(y i ȳ) n and ˆα = ȳ i=1 (x i x) ˆβ x 2 When we estimate α and β based on data ((x i, y i ) pairs), the estimates, ˆα and ˆβ, are called estimated regression coefficients or just regression coefficients. S = n (y i α βx i ) 2 i=1 The least squares line, or estimated regression line, is the line that minimizes S. Finding the least squares line means finding the values of α and β that minimize S. Once estimates ˆα and ˆβ of α and β have been computed, the predicted value of y i given x i is obtained from the estimated regression line. 15 The estimated regression line (ŷ i = 21.52 + 0.608x i ) is shown along with the data below. 16 ŷ i = ˆα + ˆβx i, 43 where ŷ i is the prediction of the true value of y i, for observation i, i = 1... n. In our example, n = 31, and and ˆβ = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 = 0.608 ˆα = ȳ ˆβ x = 21.52 41 37 35 31 29 The estimated regression line is 25 ŷ i = 21.52 + 0.608x i 6 8 10 12 14 16 18 20 22 24 26 28

17 18 The estimated regression line (ŷ i = 21.52 + 0.608x i ) is shown along with the data below (different scale and means shown). Note that 1 n n i=1 x i = 17.22 and 1 n n i=1 y i = 32. Also note that when x i = 0, ŷ i = 21.52. You have 31 (x i, y i ) pairs. Recall these questions: Can we summarize the relationship with a linear equation (line)? Is the apparent trend statistically significant? 42 36 30 24 21 18 15 12 9 6 3 0 Can we characterize the variability about some line? 0 2 4 6 8 10 13 16 19 22 25 28 19 20 Linear regression inference We would like to be able to use the least-squares regression line ŷ = ˆα + ˆβx to make inference about the population regression line. µ y x = α + βx We begin by noting that ˆα and ˆβ are point estimates of the population intercept and slope, respectively. As always, if we obtain different data values, the estimates will change. Suppose we are interested in β and want to infer something about β using the information in ˆβ. In particular, we want to know if there is evidence to reject the null H 0 : β = 0. If the linear regression model holds and we were to select repeated samples of size n from the underlying population and calculate a least squares line for each set of observations, the estimated values of α and β would vary from sample to sample. To test any hypotheses regarding the underlying population regression coefficients, we need to know (or estimate) the standard error of the estimates. It can be shown that and se( ˆβ) = se(ˆα) = σ y x σ y x n i=1 (x i x) 2 1 n + x 2 n i=1 (x i x) 2 Note that the standard errors of the estimated coefficients ˆα and ˆβ depend on σ y x, which is usually unknown. We can estimate it using s y x = n i=1 (y i ŷ i ) 2 n 2

21 22 Note that the difference between the true y i and the predicted value of y i, denoted by ŷ i, is y i ŷ i. This difference is called the residual value. The estimate of the standard deviation s y x is s y x = n i=1 (y i ŷ i ) 2 n 2 We can now test H 0 : β = 0 against H A : β 0. Note that if the null is true, then ˆβ se( ˆβ) follows a t distribution with n 2 degrees of freedom. In our example, the evaluated standard error is and the evaluated test statistic is σ y x n i=1 (x i x) 2 = 0.147 0.608 0.147 = 4.14 Note that from table A.4, P (T 29 3.659) = 0.0005 We reject the null (at α = 0.01 or α = 0.05). Correlation Coefficient In simple linear regression analysis, we saw that the estimate slope of the regression line ˆβ, is a measure of the linear association between a dependent variable y and an independent variable x. However, the value ˆβ depends on which variable we declare the independent variable, and which variable we declare the dependent variable. Also, ˆβ depends on the dispersion of the variables. A measure of the linear association between two variables X and Y which is independent of these concerns is the Pearson correlation coefficient r. It turns out that r is related to the regression coefficient ˆβ by, Note that R 2 denotes a quantity called the coefficient of determination and R 2 = r 2. The statistic r has the following properties: 1 r 1. r = 1 iff y = ˆα + ˆβx for some ˆβ > 0. r = 1 iff y = ˆα + ˆβx for some ˆβ < 0. r remains the same under location-scale transforms of x and y. r measures the extent of linear association. 24. r = ˆβ(s x /s y ) or ˆβ = r(sy /s x ) r tends to be close to zero if there is no linear association between x and y. The correlation coefficient r is computed from the sample and estimates the population correlation coefficient ρ. r = 1 n i=1 (x i x)(y i ȳ) n 1 s x s y where s x and s y are the sample standard deviations of the x and y values, respectively.