Simple linear regression - PDF Free Download

Simple linear regression Business Statistics 41000 Fall 2015 1

Topics 1. conditional distributions, squared error, means and variances 2. linear prediction 3. signal + noise and R 2 goodness of fit 4. homoscedasticity, normality of errors, and prediction intervals 5. hypothesis tests and confidence intervals for regression parameters 6. the lm() command in R 7. Uses of regression OpenIntro ch. 7, Naked Statistics excerpt. 2

Predicting birth weight, given mother s age Consider trying to predict the birth weight of a newborn baby. Does knowing the mother s age help us at all? In the language of random variables, if Y = birth weight and X = mother s age, is Y independent of X? To answer this question, we simply look at the conditional distribution of Y X = x for different ages, x = {18, 19, 20,..., 45}. If the distributions differ by age, that means that Y has a statistical dependence on X. In terms of our very large NBER data set, this means making density plots, disaggregated by mother age. 3

Predicting birth weight, given mother s age 18 years old 35 years old 0 1000 2000 3000 4000 5000 Birth Weight in grams So, age definitely matters. But how do we turn these density plots into point predictions? 4

Loss criteria: squared error We have seen that making decisions in random settings means trying to do well on average, and that quantifying this requires picking a utility function. In prediction settings, it is common to refer to a loss function, which gauges how far off our prediction was. We seek to make the prediction that gives the minimum average loss. A commonly studied loss function is called squared error loss: (ŷ y) 2, where y is a realization of our random variable Y and ŷ (pronounced y-hat ) is our prediction. 5

The mean is the best prediction under squared error The reason squared error (ŷ y) 2 is so widely used is that the optimal prediction takes a particularly convenient form. To minimize our average squared error loss, we just set ŷ E(Y ). In other words, if we think this loss function is reasonable, we only need to worry about finding the mean of Y! 6

The variance then measures predictability If we go back to the definition of variance and squint hard, we see that it can be interpreted as the average squared error, when using the mean as our prediction. V(Y ) = y Pr(Y = y)(ŷ y) 2 = y p(y)(e(y ) y) 2. This means that if I had to predict infant birth weight, I would predict E(Y ) = 3367 grams. Furthermore, this prediction will give me average, or mean, squared error equal to V(Y ) = 327463.5 This seems too large, because it is in grams squared. For this reason we often talk about root mean squared error, which is V(Y ). In this case that is 572.24 grams. 7

Conditional prediction So what happened to X? Didn t mother s age matter? Indeed, it did. If we did not know the mother s age, we would go with E(Y ) as our birthweight prediction. If we do know the mother s age, then we should predict with the conditional mean E(Y X = x). How well we are able to predict at each age is then measured by the conditional variance V(Y X ). We can figure out our overall prediction error by taking a weighted average of the variance at each age. 8

Birthweight vs age: E(Y X = x) Weight in grams 3200 3250 3300 3350 3400 20 25 30 35 40 45 Mother age What s going on at the higher age range? 9

Birthweight vs age: E(Y X = x) ± 2σ x Weight in grams 0 1000 2000 3000 4000 5000 20 25 30 35 40 45 Mother age Why does σ x have an x subscript? 10

NBA height and weight: E(Y X = x) Weight in lbs 150 200 250 300 65 70 75 80 85 90 Height in inches A few heights have only one observation. Is that problematic? 11

Square feet and house price Price in dollars 80000 120000 160000 200000 1600 1800 2000 2200 2400 2600 Square Feet Many data sets have no shared values of predictor variables. What then? 12

Square feet and house price: E(Y X = x) Price in dollars 80000 120000 160000 200000 1600 1800 2000 2200 2400 2600 Square Feet We can always bin the observations and compute the expected value within each bin. We call this discretizing a continuous covariate. 13

Linear prediction For every unique value of our predictor variable X = x, we can try to determine E(Y X = x) and this is the best prediction we can make according to mean squared error. Sometimes, it is more convenient to have a simple rule that takes a value of x and produces a prediction ŷ x directly, without worrying if that guess is the best possible. A particularly simple and popular type of rule is a linear prediction rule: ŷ x a + xb for two numbers a and b. 14

Lines Remember that a is called the intercept of the line and b is called the slope. Slope is often thought of as rise over run. It is the number of units up the line moves for every unit of x we move to the right. y 0 5 10 15 0 1 2 3 4 5 x How do we decide what slope and intercept to use for prediction? 15

Linear prediction It turns out that the optimal linear prediction rule: ŷ x a + xb according to mean squared error can be found using the following formulas: b cor(x, Y ) σ Y σ X and a E(Y ) be(x ). 16

Least squares Applying this rule do a dataset gives us the least squares line. Least squares refers to the fact that we are using the mean squared error as our criterion. Applied to a data set (an empirical distribution), mean squared prediction error looks like this: E([Ŷ Y ]2 ) = 1 n n (ŷ i y i ) 2. i=1 For a linear prediction rule with intercept a and slope b this can be written 1 n (a + x i b y i ) 2. n i=1 Finding the best linear predictor based on observed data is called linear regression. 17

Mother s age versus birthweight The least squares line-of-best-fit for this subset of the birthweight data is a = 3177 and b = 5.575. Weight in grams 1000 2000 3000 4000 5000 20 25 30 35 40 Mother age What are the units of a? How about the units of b? 18

NBA height versus weight The least squares line-of-best-fit for the NBA data is a = 293.33 and b = 6.513. Weight in lbs 150 200 250 300 65 70 75 80 85 90 Height in inches What would you predict for a player who is 6 foot tall? 19

Square feet versus sale price The least squares line-of-best-fit for the housing data is a = 10091 and b = 70.23. Price in dollars 80000 120000 160000 200000 1600 1800 2000 2200 2400 2600 Square Feet What would a house with zero square feet cost according to this prediction rule? 20

Signal + noise Prediction rules can help us think about individual observations in terms of a trend component and an error component, a signal and some noise. The trend component is just the prediction part, ŷ i. The error component is whatever is left over e i y i ŷ i. In the case of a least-squares line with intercept a and slope b our observed errors, or residuals, are just e i y i (a + bx i ). In practical settings, what might a large residual suggest about a particular observation? For the birthweight data? For the NBA data? For the housing data? 21

Investigating big residuals Residuals allow us to define outliers, conditional on an observed predictor value. Price in dollars -40000 0 20000 1600 1800 2000 2200 2400 2600 Square Feet Here, we find a house that is approximately $55K above the price trend for a house of its size. It turns out to be a four bedroom brick house in a nice neighborhood. 22

Investigating big residuals Weight in lbs 150 200 250 300 65 70 75 80 85 90 Height in inches Turns out Shaq was a big fella, even for someone his height. 23

Sum of squared residuals Using the notion of residuals, we can think about our mean squared error (in-sample) as simply 1 ei 2. n i It would be nice if the residuals were all small, but what counts as small depends on the units of X and Y. To cook up a unit-less measure of small, consider this fact: (y i ȳ) 2 = i i (a + bx i ȳ) 2 + i e 2 i. The left-hand-side is the (sample) variance of Y. The first term on the right is the portion of the variance due to the trend line. The second term is the portion of the variance due to the residuals. 24

Decomposing the variance (y i ȳ) 2 = i i (a + bx i ȳ) 2 + i e 2 i. Price in dollars 80000 120000 160000 200000 1600 1800 2000 2200 2400 2600 Square Feet The variation about the mean breaks down as the part due to the trend line (ORANGE) and the part due to the residuals (BLUE). 25

R 2 measure of goodness-of-fit Taking this idea further we can think of this equation as (y i ȳ) 2 = (a + bx i ȳ) 2 + ei 2. i i i sum-of-squares-total = sum-of-squares-regression + sum-of-squares-error or SST = SSR + SSE. Accordingly, a unit-less measure of the residuals being small is the ratio R 2 variation explained total variation = SSR SST. 26

R 2 is the squared correlation Let s do a little algebra. R 2 = = i (a + bx i ȳ) 2 i (y i ȳ) 2 i (a + bx i a b x) 2 i (y i ȳ) 2 = b2 i (x i x) 2 i (y i ȳ) 2 = b 2 σ2 x σ 2 y = ρ 2. If you define the best linear prediction line first, then this is a really nice way to think about correlation: it s the (square of the) fraction of the variation in one variable that is attributable to its trend in the other variable. 27

Homoscedastic, normal prediction model There s nothing that says our residuals have to be homoscedastic, meaning the noise is about the same size for any given predictor value. Likewise, the conditional distribution P(Y X = x) need not be a normal distribution. It is nonetheless common to make these assumptions. To the extent that they are reasonable, we can construct confidence intervals for our predictions using Y X = x N(a + bx, s 2 ), i e2 i where s 2 = 1 n, the sample sum of squared error. Notice that s2 does not depend on x. 28

Normal linear prediction density For 18 year old mothers, the normal linear prediction model seems reasonable: density NormLinear density 0 1000 2000 3000 4000 5000 Birth Weight in grams 29

Normal linear prediction density For 45 year old mothers, the assumptions appear to break down: density NormLinear density 0 1000 2000 3000 4000 5000 6000 Birth Weight in grams Note, the red curve is based on only 75 data points; however that is enough to tell that the mean is probably off. 30

Prediction intervals Based on the prediction model Y X = x N(a + bx, s 2 ), we can construct prediction intervals by taking plus/minus 1, 2, 3 (etc.) standard deviations. For example, a 95% prediction interval for the price of a house of 2,150 square feet is given by 10.09 + 70.23(2.15) ± 2(22.48), where our units are thousands of dollars and thousands of square feet. 31

Prediction intervals A 99.7% prediction interval for the weight of a 6 8 NBA player is given by 293.33 + 6.51(80) ± 3(14.76), where our units are pounds and inches. A 99.7% interval for a five foot player would be 293.33 + 6.51(60) ± 3(14.76). Why maybe we don t trust this one as much? 32

In-sample versus out-of-sample prediction When we create a least-squares linear predictor from data, we re finding the best linear-predictor for observations sampled at random from our database. When we find the sample correlation, sample standard deviations, and sample means, we are building a linear predictor that is optimal for the empirical distribution. Usually, we are not playing this stylized game of predicting random points from our already-observed data. Rather we want to predict well out-of-sample. 33

Sampling variability What we want is the best linear-predictor for the population. We can think of the data-based least squares linear predictor as an estimate of the desired linear predictor. As usual, the hope is that our sample data looks reasonably close to what other such samples would look like. As we get more and more data, our estimate should get closer and closer to the truth (the estimate should get close to the estimand). As usual, the question is, how can we gauge how close we are? 34

Sampling variability An intuitive mechanical way to simulate sampling variability is bootstrapping. This means that we sample our observed data (with replacement), to get a pseudo-sample. We then fit the least-squares line to this pseudo-sample and we store the resulting estimate. We then repeat thousands of times. 35

Bootstrapped linear predictor Height in inches 65 70 75 80 85 90 150 200 250 300 Weight in pounds 36

Sampling distribution for linear predictor coefficients It turns out that the coefficients the intercept and the slope of the least-squares linear predictor are approximately normally distributed. As with other normal distribution-based hypothesis tests and confidence intervals, all that we need are our standard errors the standard deviation of the sampling distribution. Fortunately, most software provides these standard errors for us (and much more). So before we look at the details of doing statistical tests on the parameters of a linear prediction model, let us take a look at what the R output looks like. 37

Regression in R To load in the birthweight data: bw <- read.csv(url( http://classpage/birthdata.csv ),header=true) To fit the linear regression use the lm() command: bwfit <- lm(birthwt~age,data=bw) Here BirthWt is the response variable or outcome variable and Age is the predictor variable, or covariate, or regressor, or feature. 38

Regression in R To see the regression object, simply type its name at the command line > bwfit Call: lm(formula = BirthWt ~ Age, data = bw) Coefficients: (Intercept) Age 3105.275 9.594 Why are these fitted values different from the ones on slide 18? 39

Regression in R We can extract more information using the summary() command: > summary(bwfit) Call: lm(formula = BirthWt ~ Age, data = bw) Residuals: Min 1Q Median 3Q Max -3225.2-308.0 28.1 357.2 3315.8 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 3105.2747 6.1987 500.95 <2e-16 *** Age 9.5935 0.2211 43.38 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 Residual standard error: 569.6 on 203847 degrees of freedom Multiple R-squared: 0.009149, Adjusted R-squared: 0.009144 F-statistic: 1882 on 1 and 203847 DF, p-value: < 2.2e-16 40

Regression in R We can also call the anova() command, which stands for analysis of variance. > anova(bwfit) Analysis of Variance Table Response: BirthWt Df Sum Sq Mean Sq F value Pr(>F) Age 1 6.1069e+08 610687971 1882.1 < 2.2e-16 *** Residuals 203847 6.6142e+10 324469 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 Let s check that the sum-of-squares regression divided by the sum-of-squares total equals R 2 from the previous slide. 41

Regression in R To load in the NBA height/weight data: nba <- read.table(url( http://classpage/nba.txt ),header=false) To fit the linear regression use the lm() command: nbafit <- lm(weight~height,data=nba) Here weight is the response variable or outcome variable and height is the predictor variable, or covariate, or regressor, or feature. 42

Regression in R > summary(nbafit) Call: lm(formula = weight ~ height, data = nba) Residuals: Min 1Q Median 3Q Max -49.211-9.698 0.328 9.257 64.738 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -293.3295 17.8254-16.46 <2e-16 *** height 6.5128 0.2257 28.86 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 14.76 on 322 degrees of freedom Multiple R-squared: 0.7211, Adjusted R-squared: 0.7203 F-statistic: 832.6 on 1 and 322 DF, p-value: < 2.2e-16 43

Prediction interval Let s make a prediction interval from this regression output. We said before that a 99.7% prediction interval for the weight of a 6 8 NBA player is given by 293.33 + 6.51(80) ± 3(14.76), but where did these numbers come from? Referring back to the previous slide, a = 293.33 is the intercept, b = 6.51 is the slope coefficient, and s = 14.76 is the residual standard error. 44

Regression in R To load in the housing data: house <- read.table(url( http://classpage/midcity.txt ),header=false) To fit the linear regression use the lm() command: housefit <- lm(price~sqft,data=house) Here Price is the response variable or outcome variable and SqFt is the predictor variable, or covariate, or regressor, or feature. 45

Regression in R > summary(housefit) Call: lm(formula = Price ~ SqFt, data = house) Residuals: Min 1Q Median 3Q Max -46.59-16.64-1.61 15.12 54.83 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -10.091 18.966-0.532 0.596 SqFt 70.226 9.426 7.450 1.3e-11 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 22.48 on 126 degrees of freedom Multiple R-squared: 0.3058, Adjusted R-squared: 0.3003 F-statistic: 55.5 on 1 and 126 DF, p-value: 1.302e-11 46

Confidence interval Let s make a confidence interval for the true slope coefficient β from this regression output. For a 95% confidence interval we take the point estimate and then take plus/minus 1.96 standard errors: 70.226 ± 1.96(9.43) = (51.74 to 88.71). Referring back to the previous slide, b = 70.226 is the estimated slope coefficient, and se b = 9.43 is the associated standard error. 47

Confidence interval What about a confidence interval for the true intercept parameter α? For a 95% confidence interval we take the point estimate of the intercept and then take plus/minus 1.96 standard errors: 10 ± 1.96(18.97) = ( 46.18 to 27.18). Referring back to the previous slide, a = 10 is the estimated intercept, and se a = 18.97 is the associated standard error. Can we reject the null hypothesis that α = 0? 48

p-values R is set up to automatically consider two-sided hypothesis tests that the parameters of the linear predictor are exactly zero. To this end, they report the associated p-values by computing z-scores. For example, for the intercept α, we find the test statistic For the housing data, this is z = (a 0) se a. z = (a 0) = 10 0 = 0.527. se a 18.97 To find the p-value we use 2 pnorm( 0.527) = 0.598. This differs only slightly from the output from R (due mainly to rounding). 49

p-values Now let s do the slope parameter z = (b 0) = 70.226 0 = 7.45. se b 9.426 R calls these test statistics t statistics because the adjustment to the normal for small sample sizes that it makes yields a t distribution. In any case, seven and a half standard deviations gives a very tiny p-value. Why does the null hypothesis of β = 0 make sense for the slope parameter; in what sense is zero special? 50

Applications of regression It is no exaggeration to say that linear regression is used everywhere: credit ratings law inforcement (racial profiling!) recommender systems (Amazon and Netflix) professional sports analytics (Moneyball) hedge fund performance evaluation medical diagnostics and on and on... 51

Super Crunchers 52

Netflix 53

Moneyball 54

3 Rules 55

Warren Buffet versus John Maynard Keynes A popular financial model is called the market model. Essentially you consider the returns of a particular investor, fund or stock as the response variable in a linear regression with the S&P500 (or another large market index) as the predictor variable. One may then interpret the linear regression coefficients, the intercept α and the slope β, as excess returns and a risk factor. Why do you suppose is meant by these terms? In the next homework, you will look at the investing records of two titans of finance and see which one seems to have the better performance. 56

Correlation does not imply causation So what does that mean for us? Think about this: straight teeth are probably associated with the price of the car a person drives. For prediction, this is all well and good. What would you think about a parent who buys her kid a new Lambo because she wants to fix his fugly grill? 57