MS&E 226: Small Data Lecture 2: Linear Regression (v3) Ramesh Johari rjohari@stanford.edu September 29, 2017 1 / 36
Summarizing a sample 2 / 36
A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued observations. Simple statistics: Sample mean: Y = 1 n n Y i. i=1 3 / 36
A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued observations. Simple statistics: Sample mean: Y = 1 n n Y i. Sample median: Order Yi from lowest to highest. Median is average of n/2 th and (n/2 + 1) st elements of this list (if n is even) or (n + 1)/2 th element of this list (if n is odd) More robust to outliers i=1 3 / 36
A sample Suppose Y = (Y 1,..., Y n ) is a sample of real-valued observations. Simple statistics: Sample standard deviation: ˆσ Y = 1 n (Y i Y ) n 1 2. Measures dispersion of the data. (Why n 1? See homework.) i=1 3 / 36
Example in R Children s IQ scores + mothers characteristics from National Longitudinal Survey of Youth (via [DAR]) Download from course site; lives in child.iq/kidiq.dta > library(foreign) > kidiq = read.dta("arm_data/child.iq/kidiq.dta") > mean(kidiq$kid_score) [1] 86.79724 > median(kidiq$kid_score) [1] 90 > sd(kidiq$kid_score) [1] 20.41069 4 / 36
Relationships 5 / 36
Modeling relationships We focus on a particular type of summarization: Modeling the relationship between observations. Formally: Let Y i, i = 1,..., n, be the i th observed (real-valued) outcome. Let Y = (Y 1,..., Y n ) Let X ij, i = 1,..., n, j = 1,..., p be the i th observation of the j th (real-valued) covariate. Let X i = (X i1,..., X ip ). Let X be the matrix whose rows are X i. 6 / 36
Pictures and names How to visualize Y and X? Names for the Y i s: outcomes, response variables, target variables, dependent variables Names for the X ij s: covariates, features, regressors, predictors, explanatory variables, independent variables X is also called the design matrix. 7 / 36
Example in R The kidiq dataset loaded earlier contains the following columns: kid_score mom_hs mom_iq mom_work mom_age Child s score on IQ test Did mom complete high school? Mother s score on IQ test Working mother? Mother s age at birth of child [ Note: Always question how variables are defined! ] Reasonable question: How is kid_score related to the other variables? 8 / 36
Example in R > kidiq kid_score mom_hs mom_iq mom_work mom_age 1 65 1 121.11753 4 27 2 98 1 89.36188 4 25 3 85 1 115.44316 4 27 4 83 1 99.44964 3 25 5 115 1 92.74571 4 27 6 98 0 107.90184 1 18... We will treat kid_score as our outcome variable. 9 / 36
Continuous variables Variables such as kid_score and mom_iq are continuous variables: they are naturally real-valued. For now we only consider outcome variables that are continuous (like kid_score). Note: even continuous variables can be constrained: Both kid_score and mom_iq must be positive. mom_age must be a positive integer. 10 / 36
Categorical variables Other variables take on only finitely many values, e.g.: mom_hs is 0 (resp., 1) if mom did (resp., did not) attend high school mom_work is a code that ranges from 1 to 4: 1 = did not work in first three years of child s life 2 = worked in 2nd or 3rd year of child s life 3 = worked part-time in first year of child s life 4 = worked full-time in first year of child s life These are categorical variables (or factors). 11 / 36
Modeling relationships Goal: Find a functional relationship f such that: Y i f(x i ) This is our first example of a model. We use models for lots of things: Associations and correlations Predictions Causal relationships 12 / 36
Linear regression models 13 / 36
Linear relationships We first focus on modeling the relationship between outcomes and covariates as linear. In other words: find coefficients ˆβ 0,..., ˆβ p such that: 1 This is a linear regression model. Y i ˆβ 0 + ˆβ 1 X i1 + + ˆβ p X ip. 1 We use hats on variables to denote quantities computed from data. In this case, whatever the coefficients are, they will have to be computed from the data we were given. 14 / 36
Matrix notation We can compactly represent a linear model using matrix notation: Let ˆβ = [ ˆβ 0, ˆβ 1, ˆβ p ] be the (p + 1) 1 column vector of coefficients Expand X to have p + 1 columns, where the first column (indexed j = 0) is X i0 = 1 for all i. Then the linear regression model is that for each i: or even more compactly Y i X i ˆβ, Y Xˆβ. 15 / 36
Matrix notation A picture of Y, X, and ˆβ: 16 / 36
Example in R Running pairs(kidiq) gives us this plot: 0.0 0.4 0.8 1.0 2.0 3.0 4.0 kid_score 20 80 140 0.0 0.6 mom_hs mom_iq 70 100 140 1.0 2.5 4.0 mom_work mom_age 18 24 20 60 100 70 90 110 140 18 22 26 Looks like kid_score is positively correlated with mom_iq. 17 / 36
Example in R Let s build a simple regression model of kid_score against mom_iq. > fm = lm(formula = kid_score ~ 1 + mom_iq, data = kidiq) > display(fm) lm(formula = kid_score ~ 1 + mom_iq, data = kidiq) coef.est coef.se (Intercept) 25.80 5.92 mom_iq 0.61 0.06... In other words: kid_score 25.80 + 0.61 mom_iq. Note: You can get the display function and other helpers by installing the arm package in R (using install.packages( arm )). 18 / 36
Example in R Here is the model plotted against the data: 150 100 kid_score 50 80 100 120 140 mom_iq > library(ggplot2) > ggplot(data = kidiq, aes(x = mom_iq, y = kid_score)) + geom_point() + geom_smooth(method="lm", se=false) Note: Install the ggplot2 package using install.packages( ggplot2 ). 19 / 36
Example in R: Multiple regression We can include multiple covariates in our linear model. > fm = lm(data = kidiq, formula = kid_score ~ 1 + mom_iq + mom_hs) > display(fm) lm(formula = kid_score ~ 1 + mom_iq + mom_hs, data = kidiq) coef.est coef.se (Intercept) 25.73 5.88 mom_iq 0.56 0.06 mom_hs 5.95 2.21 (Note that the coefficient on mom_iq is different now...we will discuss why later.) 20 / 36
How to choose ˆβ? There are many ways to choose ˆβ. We focus primarily on ordinary least squares (OLS): Choose ˆβ so that SSE = sum of squared errors = n (Y i Ŷi) 2 i=1 is minimized, where Ŷ i = X i ˆβ = ˆβ0 + p ˆβ j X ij j=1 is the fitted value of the i th observation. This is what R (typically) does when you call lm. (Later in the course we develop one justification for this choice.) 21 / 36
Questions to ask Here are some important questions to be asking: Is the resulting model a good fit? Does it make sense to use a linear model? Is minimizing SSE the right objective? We start down this road by working through the algebra of linear regression. 22 / 36
Ordinary least squares: Solution 23 / 36
OLS solution From here on out we assume that p < n and X has full rank = p + 1. (What does p < n mean, and why do we need it?) Theorem The vector ˆβ that minimizes SSE is given by: ˆβ = ( X X) 1 X Y. (Check that dimensions make sense here: ˆβ is (p + 1) 1.) 24 / 36
OLS solution: Intuition The SSE is the squared Euclidean norm of Y Ŷ: SSE = n (Y i Ŷi) 2 = Y Ŷ 2 = Y Xˆβ 2. i=1 Note that as we vary ˆβ we range over linear combinations of the columns of X. The collection of all such linear combinations is the subspace spanned by the columns of X. So the linear regression question is What is the closest such linear combination to Y? 25 / 36
OLS solution: Geometry 26 / 36
OLS solution: Algebraic proof [ ] Based on [SM], Exercise 3B14: Observe that X X is symmetric and invertible. (Why?) Note that: X ˆr = 0, where ˆr = Y Xˆβ is the vector of residuals. In other words: the residual vector is orthogonal to every column of X. Now consider any vector γ that is (p + 1) 1. Note that: Y Xγ = ˆr + X(ˆβ γ). Since ˆr is orthogonal to X, we get: Y Xγ 2 = ˆr 2 + X(ˆβ γ) 2. The preceding value is minimized when X(ˆβ γ) = 0. Since X has rank p + 1, the preceding equation has the unique solution γ = ˆβ. 27 / 36
Hat matrix (useful for later) [ ] Since: Ŷ = Xˆβ = X(X X) 1 X Y, we have: where: Ŷ = HY, H = X(X X) 1 X. H is called the hat matrix. It projects Y into the subspace spanned by the columns of X. It is symmetric and idempotent (H 2 = H). 28 / 36
Residuals and R 2 29 / 36
Residuals Let ˆr = Y Ŷ = Y Xˆβ be the vector of residuals. Our analysis shows us that: ˆr is orthogonal to every column of X. In particular, ˆr is orthogonal to the all 1 s vector (first column of X), so: Y = 1 n n Y i = 1 n i=1 n Ŷ i = Ŷ. i=1 In other words, the residuals sum to zero, and the original and fitted values have the same sample mean. 30 / 36
Residuals Since ˆr is orthogonal to every column of X, we use the Pythagorean theorem to get: Y 2 = ˆr 2 + Ŷ 2. Using equality of sample means we get: Y 2 ny 2 = ˆr 2 + Ŷ 2 nŷ 2. 31 / 36
Residuals How do we interpret: Y 2 ny 2 = ˆr 2 + Ŷ 2 nŷ 2? Note 1 n 1 ( Y 2 ny 2 ) is the sample variance of Y. 2 Note 1 n 1 ( Ŷ 2 nŷ 2 ) is the sample variance of Ŷ. So this relation suggests how much of the variation in Y is explained by Ŷ. 2 Note that the (adjusted) sample variance is usually defined as 1 n n 1 i=1 (Yi Y )2. You should check this is equal to the expression on the slide! 32 / 36
R 2 Formally: R 2 = n i=1 (Ŷi Ŷ )2 n i=1 (Y i Y ) 2 is a measure of the fit of the model, with 0 R 2 1. 3 When R 2 is large, much of the outcome sample variance is explained by the fitted values. Note that R 2 is an in-sample measurement of fit: We used the data itself to construct a fit to the data. 3 Note that this result depends on Y = Ŷ, which in turn depends on the fact that the all 1 s vector is part of X, i.e., that our linear model has an intercept term. 33 / 36
Example in R The full output of our model earlier includes R 2 : > fm = lm(data = kidiq, formula = kid_score ~ 1 + mom_iq) > display(fm) lm(formula = kid_score ~ 1 + mom_iq, data = kidiq) coef.est coef.se (Intercept) 25.80 5.92 mom_iq 0.61 0.06 --- n = 434, k = 2 residual sd = 18.27, R-Squared = 0.20 Note: residual sd is the sample standard deviation of the residuals. 34 / 36
Example in R We can plot the residuals for our earlier model: residuals(fm) 60 40 20 0 20 40 70 80 90 100 110 fitted(fm) > fm = lm(data = kidiq, formula = kid_score ~ 1 + mom_iq) > plot(fitted(fm), residuals(fm)) > abline(0,0) Note: We generally plot residuals against fitted values, not the original outcomes. You will investigate why on your next problem set. 35 / 36
Questions What do you hope to see when you plot the residuals? Why might R 2 be high, yet the model fit poorly? Why might R 2 be low, and yet the model be useful? What happens to R 2 if we add additional covariates to the model? 36 / 36