MS&E 226: Small Data

Similar documents
MS&E 226: Small Data

1 The Classic Bivariate Least Squares Model

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Categorical Predictor Variables

14 Multiple Linear Regression

MS&E 226: Small Data

MS&E 226: Small Data

MS&E 226: Small Data

Modeling kid s test scores (revisited) Lecture 20 - Model Selection. Model output. Backward-elimination

MS&E 226: Small Data

Lecture 2. The Simple Linear Regression Model: Matrix Approach

MS&E 226: Small Data

Applied Regression Analysis

STAT 350: Geometry of Least Squares

36-463/663: Multilevel & Hierarchical Models

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Lecture 4 Multiple linear regression

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

STAT 540: Data Analysis and Regression

Lecture 4: Regression Analysis

MS&E 226: Small Data

Linear Models Review

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

lm statistics Chris Parrish

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Estimation of the Response Mean. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 27

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Weighted Least Squares

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

Remedial Measures, Brown-Forsythe test, F test

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

Variance Decomposition and Goodness of Fit

Lecture 6: Geometry of OLS Estimation of Linear Regession

Multiple Linear Regression

18.S096 Problem Set 3 Fall 2013 Regression Analysis Due Date: 10/8/2013

STA 2101/442 Assignment Four 1

Simple Linear Regression

Multivariate Regression (Chapter 10)

Matrix Approach to Simple Linear Regression: An Overview

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Lecture 18: Simple Linear Regression

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Need for Several Predictor Variables

Introduction to Simple Linear Regression

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Ch 2: Simple Linear Regression

Weighted Least Squares

L2: Two-variable regression model

Linear Algebra Review

Advanced Quantitative Methods: ordinary least squares

Reference: Davidson and MacKinnon Ch 2. In particular page

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Model Selection. Frank Wood. December 10, 2009

The Classical Linear Regression Model

Linear Models in Machine Learning

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 5. In the last lecture, we covered. This lecture introduces you to

Lecture 18 MA Applied Statistics II D 2004

Introduction and Single Predictor Regression. Correlation

The prediction of house price

Lecture 1: Linear Models and Applications

Simple linear regression

Chapter 2: simple regression model

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

MIT Spring 2015

Inference for Regression

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

11 Hypothesis Testing

Regression #2. Econ 671. Purdue University. Justin L. Tobias (Purdue) Regression #2 1 / 24

Statistics 910, #5 1. Regression Methods

Part 6: Multivariate Normal and Linear Models

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

The Simple Linear Regression Model

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Economics 113. Simple Regression Assumptions. Simple Regression Derivation. Changing Units of Measurement. Nonlinear effects

Regression diagnostics

We ve seen today that a square, symmetric matrix A can be written as. A = ODO t

MODULE 8 Topics: Null space, range, column space, row space and rank of a matrix

ECON3150/4150 Spring 2016

STAT 100C: Linear models

The Simple Regression Model. Part II. The Simple Regression Model

ECON 3150/4150, Spring term Lecture 7

Lecture 1: OLS derivations and inference

4.2 centering Chris Parrish July 2, 2016

ECON3150/4150 Spring 2015

Linear regression methods

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Linear models. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. October 5, 2016

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

MATH 644: Regression Analysis Methods

1 Least Squares Estimation - multiple regression.

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Transcription:

MS&E 226: Small Data Lecture 2: Linear Regression (v3) Ramesh Johari rjohari@stanford.edu September 29, 2017 1 / 36

Summarizing a sample 2 / 36

A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued observations. Simple statistics: Sample mean: Y = 1 n n Y i. i=1 3 / 36

A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued observations. Simple statistics: Sample mean: Y = 1 n n Y i. Sample median: Order Yi from lowest to highest. Median is average of n/2 th and (n/2 + 1) st elements of this list (if n is even) or (n + 1)/2 th element of this list (if n is odd) More robust to outliers i=1 3 / 36

A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued observations. Simple statistics: Sample standard deviation: ˆσ Y = 1 n (Y i Y ) n 1 2. Measures dispersion of the data. (Why n 1? See homework.) i=1 3 / 36

Example in R Children s IQ scores + mothers characteristics from National Longitudinal Survey of Youth (via [DAR]) Download from course site; lives in child.iq/kidiq.dta > library(foreign) > kidiq = read.dta("arm_data/child.iq/kidiq.dta") > mean(kidiq$kid_score) [1] 86.79724 > median(kidiq$kid_score) [1] 90 > sd(kidiq$kid_score) [1] 20.41069 4 / 36

Relationships 5 / 36

Modeling relationships We focus on a particular type of summarization: Modeling the relationship between observations. Formally: Let Y i, i = 1,..., n, be the i th observed (real-valued) outcome. Let Y = (Y 1,..., Y n ) Let X ij, i = 1,..., n, j = 1,..., p be the i th observation of the j th (real-valued) covariate. Let X i = (X i1,..., X ip ). Let X be the matrix whose rows are X i. 6 / 36

Pictures and names How to visualize Y and X? Names for the Y i s: outcomes, response variables, target variables, dependent variables Names for the X ij s: covariates, features, regressors, predictors, explanatory variables, independent variables X is also called the design matrix. 7 / 36

Example in R The kidiq dataset loaded earlier contains the following columns: kid_score mom_hs mom_iq mom_work mom_age Child s score on IQ test Did mom complete high school? Mother s score on IQ test Working mother? Mother s age at birth of child [ Note: Always question how variables are defined! ] Reasonable question: How is kid_score related to the other variables? 8 / 36

Example in R > kidiq kid_score mom_hs mom_iq mom_work mom_age 1 65 1 121.11753 4 27 2 98 1 89.36188 4 25 3 85 1 115.44316 4 27 4 83 1 99.44964 3 25 5 115 1 92.74571 4 27 6 98 0 107.90184 1 18... We will treat kid_score as our outcome variable. 9 / 36

Continuous variables Variables such as kid_score and mom_iq are continuous variables: they are naturally real-valued. For now we only consider outcome variables that are continuous (like kid_score). Note: even continuous variables can be constrained: Both kid_score and mom_iq must be positive. mom_age must be a positive integer. 10 / 36

Categorical variables Other variables take on only finitely many values, e.g.: mom_hs is 0 (resp., 1) if mom did (resp., did not) attend high school mom_work is a code that ranges from 1 to 4: 1 = did not work in first three years of child s life 2 = worked in 2nd or 3rd year of child s life 3 = worked part-time in first year of child s life 4 = worked full-time in first year of child s life These are categorical variables (or factors). 11 / 36

Modeling relationships Goal: Find a functional relationship f such that: Y i f(x i ) This is our first example of a model. We use models for lots of things: Associations and correlations Predictions Causal relationships 12 / 36

Linear regression models 13 / 36

Linear relationships We first focus on modeling the relationship between outcomes and covariates as linear. In other words: find coefficients ˆβ 0,..., ˆβ p such that: 1 This is a linear regression model. Y i ˆβ 0 + ˆβ 1 X i1 + + ˆβ p X ip. 1 We use hats on variables to denote quantities computed from data. In this case, whatever the coefficients are, they will have to be computed from the data we were given. 14 / 36

Matrix notation We can compactly represent a linear model using matrix notation: Let ˆβ = [ ˆβ 0, ˆβ 1, ˆβ p ] be the (p + 1) 1 column vector of coefficients Expand X to have p + 1 columns, where the first column (indexed j = 0) is X i0 = 1 for all i. Then the linear regression model is that for each i: or even more compactly Y i X i ˆβ, Y Xˆβ. 15 / 36

Matrix notation A picture of Y, X, and ˆβ: 16 / 36

Example in R Running pairs(kidiq) gives us this plot: 0.0 0.4 0.8 1.0 2.0 3.0 4.0 kid_score 20 80 140 0.0 0.6 mom_hs mom_iq 70 100 140 1.0 2.5 4.0 mom_work mom_age 18 24 20 60 100 70 90 110 140 18 22 26 Looks like kid_score is positively correlated with mom_iq. 17 / 36

Example in R Let s build a simple regression model of kid_score against mom_iq. > fm = lm(formula = kid_score ~ 1 + mom_iq, data = kidiq) > display(fm) lm(formula = kid_score ~ 1 + mom_iq, data = kidiq) coef.est coef.se (Intercept) 25.80 5.92 mom_iq 0.61 0.06... In other words: kid_score 25.80 + 0.61 mom_iq. Note: You can get the display function and other helpers by installing the arm package in R (using install.packages( arm )). 18 / 36

Example in R Here is the model plotted against the data: 150 100 kid_score 50 80 100 120 140 mom_iq > library(ggplot2) > ggplot(data = kidiq, aes(x = mom_iq, y = kid_score)) + geom_point() + geom_smooth(method="lm", se=false) Note: Install the ggplot2 package using install.packages( ggplot2 ). 19 / 36

Example in R: Multiple regression We can include multiple covariates in our linear model. > fm = lm(data = kidiq, formula = kid_score ~ 1 + mom_iq + mom_hs) > display(fm) lm(formula = kid_score ~ 1 + mom_iq + mom_hs, data = kidiq) coef.est coef.se (Intercept) 25.73 5.88 mom_iq 0.56 0.06 mom_hs 5.95 2.21 (Note that the coefficient on mom_iq is different now...we will discuss why later.) 20 / 36

How to choose ˆβ? There are many ways to choose ˆβ. We focus primarily on ordinary least squares (OLS): Choose ˆβ so that SSE = sum of squared errors = n (Y i Ŷi) 2 i=1 is minimized, where Ŷ i = X i ˆβ = ˆβ0 + p ˆβ j X ij j=1 is the fitted value of the i th observation. This is what R (typically) does when you call lm. (Later in the course we develop one justification for this choice.) 21 / 36

Questions to ask Here are some important questions to be asking: Is the resulting model a good fit? Does it make sense to use a linear model? Is minimizing SSE the right objective? We start down this road by working through the algebra of linear regression. 22 / 36

Ordinary least squares: Solution 23 / 36

OLS solution From here on out we assume that p < n and X has full rank = p + 1. (What does p < n mean, and why do we need it?) Theorem The vector ˆβ that minimizes SSE is given by: ˆβ = ( X X) 1 X Y. (Check that dimensions make sense here: ˆβ is (p + 1) 1.) 24 / 36

OLS solution: Intuition The SSE is the squared Euclidean norm of Y Ŷ: SSE = n (Y i Ŷi) 2 = Y Ŷ 2 = Y Xˆβ 2. i=1 Note that as we vary ˆβ we range over linear combinations of the columns of X. The collection of all such linear combinations is the subspace spanned by the columns of X. So the linear regression question is What is the closest such linear combination to Y? 25 / 36

OLS solution: Geometry 26 / 36

OLS solution: Algebraic proof [ ] Based on [SM], Exercise 3B14: Observe that X X is symmetric and invertible. (Why?) Note that: X ˆr = 0, where ˆr = Y Xˆβ is the vector of residuals. In other words: the residual vector is orthogonal to every column of X. Now consider any vector γ that is (p + 1) 1. Note that: Y Xγ = ˆr + X(ˆβ γ). Since ˆr is orthogonal to X, we get: Y Xγ 2 = ˆr 2 + X(ˆβ γ) 2. The preceding value is minimized when X(ˆβ γ) = 0. Since X has rank p + 1, the preceding equation has the unique solution γ = ˆβ. 27 / 36

Hat matrix (useful for later) [ ] Since: Ŷ = Xˆβ = X(X X) 1 X Y, we have: where: Ŷ = HY, H = X(X X) 1 X. H is called the hat matrix. It projects Y into the subspace spanned by the columns of X. It is symmetric and idempotent (H 2 = H). 28 / 36

Residuals and R 2 29 / 36

Residuals Let ˆr = Y Ŷ = Y Xˆβ be the vector of residuals. Our analysis shows us that: ˆr is orthogonal to every column of X. In particular, ˆr is orthogonal to the all 1 s vector (first column of X), so: Y = 1 n n Y i = 1 n i=1 n Ŷ i = Ŷ. i=1 In other words, the residuals sum to zero, and the original and fitted values have the same sample mean. 30 / 36

Residuals Since ˆr is orthogonal to every column of X, we use the Pythagorean theorem to get: Y 2 = ˆr 2 + Ŷ 2. Using equality of sample means we get: Y 2 ny 2 = ˆr 2 + Ŷ 2 nŷ 2. 31 / 36

Residuals How do we interpret: Y 2 ny 2 = ˆr 2 + Ŷ 2 nŷ 2? Note 1 n 1 ( Y 2 ny 2 ) is the sample variance of Y. 2 Note 1 n 1 ( Ŷ 2 nŷ 2 ) is the sample variance of Ŷ. So this relation suggests how much of the variation in Y is explained by Ŷ. 2 Note that the (adjusted) sample variance is usually defined as 1 n n 1 i=1 (Yi Y )2. You should check this is equal to the expression on the slide! 32 / 36

R 2 Formally: R 2 = n i=1 (Ŷi Ŷ )2 n i=1 (Y i Y ) 2 is a measure of the fit of the model, with 0 R 2 1. 3 When R 2 is large, much of the outcome sample variance is explained by the fitted values. Note that R 2 is an in-sample measurement of fit: We used the data itself to construct a fit to the data. 3 Note that this result depends on Y = Ŷ, which in turn depends on the fact that the all 1 s vector is part of X, i.e., that our linear model has an intercept term. 33 / 36

Example in R The full output of our model earlier includes R 2 : > fm = lm(data = kidiq, formula = kid_score ~ 1 + mom_iq) > display(fm) lm(formula = kid_score ~ 1 + mom_iq, data = kidiq) coef.est coef.se (Intercept) 25.80 5.92 mom_iq 0.61 0.06 --- n = 434, k = 2 residual sd = 18.27, R-Squared = 0.20 Note: residual sd is the sample standard deviation of the residuals. 34 / 36

Example in R We can plot the residuals for our earlier model: residuals(fm) 60 40 20 0 20 40 70 80 90 100 110 fitted(fm) > fm = lm(data = kidiq, formula = kid_score ~ 1 + mom_iq) > plot(fitted(fm), residuals(fm)) > abline(0,0) Note: We generally plot residuals against fitted values, not the original outcomes. You will investigate why on your next problem set. 35 / 36

Questions What do you hope to see when you plot the residuals? Why might R 2 be high, yet the model fit poorly? Why might R 2 be low, and yet the model be useful? What happens to R 2 if we add additional covariates to the model? 36 / 36