Lecture 9: Linear Regression

Similar documents
Simple Linear Regression Using Ordinary Least Squares

Lecture 7: Hypothesis Testing and ANOVA

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Inferences for Regression

Basic Business Statistics 6 th Edition

Linear models and their mathematical foundations: Simple linear regression

Simple linear regression

Remedial Measures, Brown-Forsythe test, F test

Mathematics for Economics MA course

CS 5014: Research Methods in Computer Science

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Chapter 14 Simple Linear Regression (A)

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

ST Correlation and Regression

Simple Linear Regression. (Chs 12.1, 12.2, 12.4, 12.5)

Inference for the Regression Coefficient

SIMPLE REGRESSION ANALYSIS. Business Statistics

Inference for Regression Simple Linear Regression

Chapter 4: Regression Models

Correlation Analysis

Multiple linear regression

Single and multiple linear regression analysis

Ch 3: Multiple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Chapter 16. Simple Linear Regression and Correlation

Chapter 16. Simple Linear Regression and dcorrelation

Six Sigma Black Belt Study Guides

Unit 10: Simple Linear Regression and Correlation

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Confidence Interval for the mean response

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Chapter 4. Regression Models. Learning Objectives

Ch 2: Simple Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Formal Statement of Simple Linear Regression Model

Simple Linear Regression

Simple Linear Regression. Material from Devore s book (Ed 8), and Cengagebrain.com

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

Finding Relationships Among Variables

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

A discussion on multiple regression models

Confidence Intervals, Testing and ANOVA Summary

Statistics and Quantitative Analysis U4320

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

y response variable x 1, x 2,, x k -- a set of explanatory variables

Statistics for Managers using Microsoft Excel 6 th Edition

MULTIPLE REGRESSION ANALYSIS AND OTHER ISSUES. Business Statistics

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Chapter 13. Multiple Regression and Model Building

Review of Statistics

Data Analysis 1 LINEAR REGRESSION. Chapter 03

Multiple Regression Examples

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

STAT Chapter 11: Regression

Unit 6 - Introduction to linear regression

Lecture 1 Linear Regression with One Predictor Variable.p2

3. Diagnostics and Remedial Measures

Chapter 14 Student Lecture Notes 14-1

Variance Decomposition and Goodness of Fit

Categorical Predictor Variables

What is a Hypothesis?

Lecture 11: Simple Linear Regression

Unit 6 - Simple linear regression

Lecture 3: Multiple Regression. Prof. Sharyn O Halloran Sustainable Development U9611 Econometrics II

Simple Linear Regression

Chapter 3. Diagnostics and Remedial Measures

Lecture 10 Multiple Linear Regression

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Model Building Chap 5 p251

Intro to Linear Regression

Ch 13 & 14 - Regression Analysis

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

STA121: Applied Regression Analysis

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Intro to Linear Regression

Unit 11: Multiple Linear Regression

Chapter 12 - Lecture 2 Inferences about regression coefficient

Lecture 5: ANOVA and Correlation

Multiple Linear Regression

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Ch. 1: Data and Distributions

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

22s:152 Applied Linear Regression. Take random samples from each of m populations.

9. Linear Regression and Correlation

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Chapter 7 Student Lecture Notes 7-1

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

The Multiple Regression Model

Inference for Regression

Chapter 14. Linear least squares

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Transcription:

Lecture 9: Linear Regression

Goals Develop basic concepts of linear regression from a probabilistic framework Estimating parameters and hypothesis testing with linear models Linear regression in R

Regression Technique used for the modeling and analysis of numerical data Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships

Regression Lingo Y = X 1 + X 2 + X 3 Dependent Variable Outcome Variable Response Variable Independent Variable Predictor Variable Explanatory Variable

Why Linear Regression? Suppose we want to model the dependent variable Y in terms of three predictors, X 1, X 2, X 3 Y = f(x 1, X 2, X 3 ) Typically will not have enough data to try and directly estimate f Therefore, we usually have to assume that it has some restricted form, such as linear Y = X 1 + X 2 + X 3

Linear Regression is a Probabilistic Model Much of mathematics is devoted to studying variables that are deterministically related to one another y = " 0 + " 1 x y " 0 "x "y " 1 = #y #x x But we re interested in understanding the relationship between variables related in a nondeterministic fashion

A Linear Probabilistic Model Definition: There exists parameters " 0, " 1, and ", 2 such that for any fixed value of the independent variable x, the dependent variable is related to x through the model equation y = " 0 + " 1 x + # " is a rv assumed to be N(0, # 2 ) y " 1 " 2 " 3 True Regression Line y = " 0 + " 1 x " 0 x

Implications The expected value of Y is a linear function of X, but for fixed x, the variable Y differs from its expected value by a random amount Formally, let x* denote a particular value of the independent variable x, then our linear probabilistic model says: E(Y x*) = µ Y x* = mean value of Y when x is x * 2 V (Y x*) = " Y x* = variance of Y when x is x *

Graphical Interpretation µ Y x2 = " 0 + " 1 x 2 y y = " 0 + " 1 x µ Y x1 = " 0 + " 1 x 1 x 1 x 2 x µ Y x =60 For example, if x = height and y = weight then is the average weight for all individuals 60 inches tall in the population

One More Example Suppose the relationship between the independent variable height (x) and dependent variable weight (y) is described by a simple linear regression model with true regression line y = 7.5 + 0.5x and " = 3 Q1: What is the interpretation of " 1 = 0.5? The expected change in height associated with a 1-unit increase in weight Q2: If x = 20 what is the expected value of Y? µ Y x =20 = 7.5 + 0.5(20) = 17.5 Q3: If x = 20 what is P(Y > 22)? " P(Y > 22 x = 20) = P$ # 22-17.5 3 % ' =1( )(1.5) = 0.067 &

Estimating Model Parameters Point estimates of ˆ " and ˆ 0 " 1 are obtained by the principle of least squares n 2 $ [ ] f (" 0," 1 ) = y i # (" 0 + " 1 x i ) i=1 y " 0 ˆ " 0 = y # ˆ " 1 x x

Predicted and Residual Values Predicted, or fitted, values are values of y predicted by the leastsquares regression line obtained by plugging in x 1,x 2,,x n into the estimated regression line y ˆ 1 = ˆ " 0 # " ˆ 1 x 1 y ˆ 2 = ˆ " 0 # " ˆ 1 x 2 Residuals are the deviations of observed and predicted values e 1 = y 1 " y ˆ 1 e 2 = y 2 " y ˆ 2 y e 3 y 1 y ˆ 1 e 1 e 2 x

Residuals Are Useful! They allow us to calculate the error sum of squares (SSE): SSE = n "(e i ) 2 = "(y i # y ˆ i ) 2 i=1 n i=1 Which in turn allows us to estimate " 2 : " ˆ 2 = SSE n # 2 As well as an important statistic referred to as the coefficient of determination: r 2 =1" SSE SST n # SST = (y i " y ) 2 i=1

Multiple Linear Regression Extension of the simple linear regression model to two or more independent variables y = " 0 + " 1 x 1 + " 2 x 2 +...+ " n x n + # Expression = Baseline + Age + Tissue + Sex + Error Partial Regression Coefficients: β i effect on the dependent variable when increasing the i th independent variable by 1 unit, holding all other predictors constant

Categorical Independent Variables Qualitative variables are easily incorporated in regression framework through dummy variables Simple example: sex can be coded as 0/1 What if my categorical variable contains three levels: x i = 0 if AA 1 if AG 2 if GG

Categorical Independent Variables Previous coding would result in colinearity Solution is to set up a series of dummy variable. In general for k levels you need k-1 dummy variables x 1 = x 2 = 1 if AA 0 otherwise 1 if AG 0 otherwise AA AG GG x 1 x 2 1 0 0 1 0 0

Hypothesis Testing: Model Utility Test (or Omnibus Test) The first thing we want to know after fitting a model is whether any of the independent variables (X s) are significantly related to the dependent variable (Y): H 0 : " 1 = " 2 =... = " k = 0 H A : At least one " 1 # 0 f = R 2 (1$ R 2 ) k n $ (k +1) Rejection Region : F ",k,n#(k +1)

Equivalent ANOVA Formulation of Omnibus Test We can also frame this in our now familiar ANOVA framework - partition total variation into two components: SSE (unexplained variation) and SSR (variation explained by linear model)

Equivalent ANOVA Formulation of Omnibus Test We can also frame this in our now familiar ANOVA framework - partition total variation into two components: SSE (unexplained variation) and SSR (variation explained by linear model) Source of Variation df Sum of Squares MS F Regression Error Total k n-2 n-1 SSR = SSE = #( y ˆ i " y ) 2 #(y i " y ˆ i ) 2 SST = #(y i " y ) 2 SSR k SSE n " 2 MS R MS E Rejection Region : F ",k,n#(k +1)

F Test For Subsets of Independent Variables A powerful tool in multiple regression analyses is the ability to compare two models For instance say we want to compare: Full Model : y = " 0 + " 1 x 1 + " 2 x 2 + " 3 x 3 + " 4 x 4 + # Reduced Model : y = " 0 + " 1 x 1 + " 2 x 2 + # Again, another example of ANOVA: SSE R = error sum of squares for reduced model with predictors l SSE F = error sum of squares for full model with k predictors f = (SSE R " SSE F ) /(k " l) SSE F /([n " (k +1)]

Example of Model Comparison We have a quantitative trait and want to test the effects at two markers, M1 and M2. Full Model: Trait = Mean + M1 + M2 + (M1*M2) + error Reduced Model: Trait = Mean + M1 + M2 + error f = (SSE " SSE R F ) /(3 " 2) SSE F /([100 " (3 +1)] = (SSE " SSE ) R F SSE F /96 Rejection Region : F a, 1, 96

Hypothesis Tests of Individual Regression Coefficients Hypothesis tests for each ˆ " i can be done by simple t-tests: H 0 : ˆ " i = 0 H A : ˆ " i # 0 T = ˆ " i $ " i se(" i ) Critical value : t " / 2,n#(k#1) Confidence Intervals are equally easy to obtain: ˆ " i ± t # / 2,n$(k$1) se( ˆ " i )

Checking Assumptions Critically important to examine data and check assumptions underlying the regression model Outliers Normality Constant variance Independence among residuals Standard diagnostic plots include: scatter plots of y versus x i (outliers) qq plot of residuals (normality) residuals versus fitted values (independence, constant variance) residuals versus x i (outliers, constant variance) We ll explore diagnostic plots in more detail in R

Fixed -vs- Random Effects Models In ANOVA and Regression analyses our independent variables can be treated as Fixed or Random Fixed Effects: variables whose levels are either sampled exhaustively or are the only ones considered relevant to the experimenter Random Effects: variables whose levels are randomly sampled from a large population of levels Example from our recent AJHG paper: Expression = Baseline + Population + Individual + Error