Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation

Size: px

Start display at page:

Download "Lecture 4: Simple Linear Regression Models, with Hints at Their Estimation"

Melissa Poole
5 years ago
Views:

1 Lecture 4: Simple Liear Regressio Models, with Hits at Their Estimatio 36-40, Fall 207, Sectio B The Simple Liear Regressio Model Let s recall the simple liear regressio model from last time. This is a statistical model with two variables X ad Y, where we try to predict Y from X. The assumptios of the model are as follows:. The distributio of X is arbitrary (ad perhaps X is eve o-radom). 2. If X = x, the Y = β 0 + β x + ɛ, for some costats ( coefficiets, parameters ) β 0 ad β, ad some radom oise variable ɛ. 3. E [ɛ X = x] = 0 (o matter what x is), Var [ɛ X = x] = σ 2 (o matter what x is). 4. ɛ is ucorrelated across observatios. To elaborate, with multiple data poits, (X, Y ), (X 2, Y 2 ),... (X, Y ), the the model says that, for each i :, Y i = β 0 + β X i + ɛ i () where the oise variables ɛ i all have the same expectatio (0) ad the same variace (σ 2 ), ad Cov [ɛ i, ɛ j ] = 0 (uless i = j, of course).. Plug-I Estimates I lecture, we saw that the optimal liear predictor of Y from X has slope β = Cov [X, Y ] /Var [X], ad itercept β 0 = E [Y ] β E [X]. A commo tactic i devisig estimators is to use what s sometimes called the plug-i priciple, where we fid equatios for the parameters which would hold if we kew the full distributio, ad plug i the sample versios of the populatio quatities. We saw this i the last lecture, where we estimated β by the ratio of the sample covariace to the sample variace: β = c XY (2)

2 We also saw, i the otes to the last lecture, that so log as the law of large umbers holds, β β (3) as. It follows easily that will also coverge o β 0..2 Least Squares Estimates β 0 = Y β X (4) A alterative way of estimatig the simple liear regressio model starts from the objective we are tryig to reach, rather tha from the formula for the slope. Recall, from lecture, that the true optimal slope ad itercept are the oes which miimize the mea squared error: (β 0, β ) = argmi E [ (Y (b 0 + b X)) 2] (5) (b 0,b ) This is a fuctio of the complete distributio, so we ca t get it from data, but we ca approximate it with data. The i-sample, empirical or traiig MSE is MSE(b 0, b ) (y i (b 0 + b x i )) 2 (6) Notice that this is a fuctio of b 0 ad b ; it is also, of course, a fuctio of the data, (x, y ), (x 2, y 2 ),... (x, y ), but we will geerally suppress that i our otatio. If our samples are all idepedet, for ay fixed (b 0, b ), the law of large umbers tells us that MSE(b 0, b ) MSE(b 0, b ) as. So it does t seem ureasoable to try miimizig the i-sample error, which we ca compute, as a proxy for miimizig the true MSE, which we ca t. Where does it lead us? Start by takig the derivatives with respect to the slope ad the itercept: MSE b 0 = MSE b 0 = (y i (b 0 + b x i ))( 2) (7) (y i (b 0 + b x i ))( 2x i ) (8) Set these to zero at the optimum ( ˆβ 0, ˆβ ): (y i ( 0 + ˆβ x i )) = 0 (9) (y i ( ˆβ 0 + ˆβ x i ))(x i ) = 0 2

3 These are ofte called the ormal equatios for least-squares estimatio, or the estimatig equatios: a system of two equatios i two ukows, whose solutio gives the estimate. May people would, at this poit, remove the factor of /, but I thik it makes it easier to uderstad the ext steps: The first equatio, re-writte, gives Substitutig this ito the remaiig equatio, y ˆβ 0 ˆβ x = 0 (0) xy ˆβ 0 x ˆβ x 2 = 0 () ˆβ 0 = y ˆβ x (2) 0 = xy ȳ x + ˆβ x x ˆβ x 2 (3) 0 = c XY ˆβ (4) ˆβ = c XY (5) That is, the least-squares estimate of the slope is our old fried the plug-i estimate of the slope, ad thus the least-squares itercept is also the plug-i itercept. Goig forward The equivalece betwee the plug-i estimator ad the leastsquares estimator is a bit of a special case for liear models. I some o-liear models, least squares is quite feasible (though the optimum ca oly be foud umerically, ot i closed form); i others, plug-i estimates are more useful tha optimizatio..3 Bias, Variace ad Stadard Error of Parameter Estimates Whether we thik of it as derivig from plugig-i or from least squares, we work out some of the properties of this estimator of the coefficiets, usig the model assumptios. We ll start with the slope, ˆβ. 3

4 ˆβ = c XY = = x iy i xȳ x i(β 0 + β x i + ɛ ) x(β 0 + β x + ɛ) = β 0 x + β x 2 + x iɛ i xβ 0 β x 2 x ɛ s 2 x = β + x iɛ i x ɛ = β + x iɛ i x ɛ (6) (7) (8) (9) (20) (2) Sice x ɛ = i xɛ i, ˆβ = β + (x i x)ɛ i (22) This represetatio of the slope estimate shows that it is equal to the true slope (β ) plus somethig which depeds o the oise terms (the ɛ i, ad their sample average ɛ). We ll use this to fid the expected value ad the variace of the estimator ˆβ. I the ext couple of paragraphs, I am goig to treat the x i as o-radom variables. This is appropriate i desiged or cotrolled experimets, where we get to chose their value. I radomized experimets or i observatioal studies, obviously the x i are t ecessarily fixed; [ however, these ] expressios will be correct for the coditioal expectatio E ˆβ x,... x ad coditioal variace Var ˆβ x,... x, ad I will come back to how we get the ucoditioal [ ] expectatio ad variace. Expected value ad bias Recall that E [ɛ i X i ] = 0, so (x i x)e [ɛ i ] = 0 (23) Thus, [ ] E ˆβ = β (24) Sice the bias of a estimator is the differece betwee its expected value ad the truth, ˆβ is a ubiased estimator of the optimal slope. 4

5 (To repeat what I m sure you remember from mathematical [ statistics: ] bias here is a techical term, meaig o more ad o less tha E ˆβ β. A ubiased estimator could still make systematic mistakes for istace, it could uderestimate 99% of the time, provided that the % of the time it over-estimates, it does so by much more tha it uder-estimates. Moreover, ubiased estimators are ot ecessarily superior to biased oes: the total error depeds o both the bias of the estimator ad its variace, ad there are may situatios where you ca remove lots of bias at the cost of addig a little variace. Least squares for simple liear regressio happes ot to be oe of them, but you should t expect that as a geeral rule.) Turig to the itercept, [ ] [ E ˆβ0 = E Y ˆβ ] X (25) [ ] = β 0 + β X E ˆβ X (26) so it, too, is ubiased. = β 0 + β X β X (27) = β 0 (28) Variace ad Stadard Error Usig the formula for the variace of a sum from lecture, ad the model assumptio that all the ɛ i are ucorrelated with each other, [ ] [ Var ˆβ = Var β + (x ] i x)ɛ i (29) [ = Var (x ] i x)ɛ i = = (30) 2 (x i x) 2 Var [ɛ i ] ( )2 (3) σ 2 s2 X ( )2 (32) = σ2 s 2 (33) X I words, this says that the variace of the slope estimate goes up as the oise aroud the regressio lie (σ 2 ) gets bigger, ad goes dow as we have more observatios (), which are further spread out alog the horizotal axis ( ); it should ot be surprisig that it s easier to work out the slope of a lie from may, well-separated poits o the lie tha from a few poits smushed together. The stadard error of a estimator is just its stadard deviatio, or the square root of its variace: se( ˆβ ) = σ (34) s 2 X 5

6 I will leave workig out the variace of ˆβ 0 as a exercise. Ucoditioal-o-X Properties The last few paragraphs, as I said, have looked at the expectatio ad variace of ˆβ coditioal o x,... x, either because the x s really are o-radom (e.g., cotrolled by us), or because we re just iterested i coditioal iferece. [ ] If we do care [ ] about ucoditioal [ properties, the we still eed to fid E ˆβ ad Var ˆβ, ot just E ˆβ x,... x ] [ ] ad Var ˆβ x,... x. Fortuately, this is easy, so log as the simple liear regressio model holds. To get the ucoditioal expectatio, we use the law of total expectatio : [ ] [ [ ]] E ˆβ = E E ˆβ X,... X (35) = E [β ] = β (36) That is, the estimator is ucoditioally ubiased. To get the ucoditioal variace, we use the law of total variace : [ ] [ [ ]] [ [ ]] Var ˆβ = E Var ˆβ X,... X + Var E ˆβ X,... X (37) [ ] σ 2 = E + Var [β ] (38) [ = σ2 E.4 Parameter Iterpretatio; Causality Two of the parameters are easy to iterpret. ] (39) σ 2 β 0 is the variace of the oise aroud the regressio lie; σ is a typical distace of a poit from the lie. ( Typical here i a special sese, it s the rootmea-squared distace, rather tha, say, the average absolute distace.) is the simply the expected value of Y whe X is 0, E [Y X = 0]. The poit X = 0 usually has o special sigificace, but this settig does esure that the lie goes through the poit (E [X], E [Y ]). The iterpretatio of the slope is both very straightforward ad very tricky. Mathematically, it s easy to covice yourself that, for ay x or, for ay x, x 2, β = E [Y X = x] E [Y X = x ] (40) β = E [Y X = x 2] E [Y X = x ] x 2 x (4) This is just sayig that the slope of a lie is rise/ru. 6

7 The tricky part is that we have a very strog, atural tedecy to iterpret this as tellig us somethig about causatio If we chage X by, the o average Y will chage by β. This iterpretatio is usually completely usupported by the aalysis. If I use a old-fashioed mercury thermometer, the height of mercury i the tube usually has a ice liear relatioship with the temperature of the room the thermometer is i. This liear relatioship goes both ways, so we could regress temperature (Y ) o mercury height (X). But if I maipulate the height of the mercury (say, by chagig the ambiet pressure, or shiig a laser ito the tube, etc.), chagig the height X will ot, i fact, chage the temperature outside. The right way to iterpret β is ot as the result of a chage, but as a expected differece. The correct catch-phrase would be somethig like If we select two sets of cases from the u-maipulated distributio where X differs by, we expect Y to differ by β. This covers the thermometer example, ad every other I ca thik of. It is, I admit, much more ielegat tha If X chages by, Y chages by β o average, but it has the advatage of beig true, which the other does ot. There are circumstaces where regressio ca be a useful part of causal iferece, but we will eed a lot more tools to grasp them; that will come towards the ed of The Gaussia-Noise Simple Liear Regressio Model We have, so far, assumed comparatively little about the oise term ɛ. The advatage of this is that our coclusios apply to lots of differet situatios; the drawback is that there s really ot all that much more to say about our estimator β or our predictios tha we ve already goe over. If we made more detailed assumptios about ɛ, we could make more precise ifereces. There are lots of forms of distributios for ɛ which we might cotemplate, ad which are compatible with the assumptios of the simple liear regressio model (Figure ). The oe which has become the most commo over the last two ceturies is to assume ɛ follows a Gaussia distributio. The result is the Gaussia-oise simple liear regressio model :. The distributio of X is arbitrary (ad perhaps X is eve o-radom). 2. If X = x, the Y = β 0 + β x + ɛ, for some costats ( coefficiets, parameters ) β 0 ad β, ad some radom oise variable ɛ. 3. ɛ N(0, σ 2 ), idepedet of X. Our textbook, rather old-fashioedly, calls this the ormal error model rather tha Gaussia oise. I dislike this: ormal is a over-loaded word i math, while Gaussia is (comparatively) specific; error made sese i Gauss s origial cotext of modelig, specifically, errors of observatio, but is misleadig geerally; ad callig Gaussia distributios ormal suggests they are much more commo tha they really are. 7

8 ε ε ε ε ε ε par(mfrow=c(2,3)) curve(dorm(x), from=-3, to=3, xlab=expressio(epsilo), ylab="", ylim=c(0,)) curve(exp(-abs(x))/2, from=-3, to=3, xlab=expressio(epsilo), ylab="", ylim=c(0,)) curve(sqrt(pmax(0,-x^2))/(pi/2), from=-3, to=3, xlab=expressio(epsilo), ylab="", ylim=c(0,)) curve(dt(x,3), from=-3, to=3, xlab=expressio(epsilo), ylab="", ylim=c(0,)) curve(dgamma(x+.5, shape=.5, scale=), from=-3, to=3, xlab=expressio(epsilo), ylab="", ylim=c(0,)) curve(0.5*dgamma(x+.5, shape=.5, scale=) + 0.5*dgamma(0.5-x, shape=0.5, scale=), from=-3, to=3, xlab=expressio(epsilo), ylab="", ylim=c(0,)) par(mfrow=c(,)) Figure : Some possible oise distributios for the simple liear regressio model, sice all have E [ɛ] = 0, ad could get ay variace by scalig. (The model is eve compatible with each observatio takig ɛ from a differet distributio.) From top left to bottom right: Gaussia; double-expoetial ( Laplacia ); circular distributio; t with 3 degrees of freedom; a gamma distributio (shape.5, scale ) shifted to have 8 mea 0; mixture of two gammas with shape.5 ad shape 0.5, each off-set to have expectatio 0. The first three were all used as error models i the 8th ad 9th ceturies. (See the source file for how to get the code below the figure.)

9 4. ɛ is idepedet across observatios. You will otice that these assumptios are strictly stroger tha those of the simple liear regressio model. More exactly, the first two assumptios are the same, while the third ad fourth assumptios of the Gaussia-oise model imply the correspodig assumptios of the other model. This meas that everythig we have doe so far directly applies to the Gaussia-oise model. O the other had, the stroger assumptios let us say more. They tell us, exactly, the probability distributio for Y give X, ad so will let us get exact distributios for predictios ad for other iferetial statistics. Why the Gaussia oise model? Why should we thik that the oise aroud the regressio lie would follow a Gaussia distributio, idepedet of X? There are two big reasos.. Cetral limit theorem The oise might be due to addig up the effects of lots of little radom causes, all early idepedet of each other ad of X, where each of the effects are of roughly similar magitude. The the cetral limit theorem will take over, ad the distributio of the sum of effects will ideed be pretty Gaussia. For Gauss s origial cotext, X was (simplifyig) Where is such-ad-such-a-plaet i space?, Y was Where does a astroomer record the plaet as appearig i the sky?, ad oise came from defects i the telescope, eye-twitches, atmospheric distortios, etc., etc., so this was pretty reasoable. It is clearly ot a uiversal truth of ature, however, or eve somethig we should expect to hold true as a geeral rule, as the ame ormal suggests. 2. Mathematical coveiece Assumig Gaussia oise lets us work out a very complete theory of iferece ad predictio for the model, with lots of closed-form aswers to questios like What is the optimal estimate of the variace? or What is the probability that we d see a fit this good from a lie with a o-zero itercept if the true lie goes through the origi?, etc., etc. Aswerig such questios without the Gaussia-oise assumptio eeds somewhat more advaced techiques, ad much more advaced computig; we ll get to it towards the ed of the class. 2. Visualizig the Gaussia Noise Model The Gaussia oise model gives us ot just a expected value for Y at each x, but a whole coditioal distributio for Y at each x. To visualize it, the, it s ot eough to just sketch a curve; we eed a three-dimesioal surface, showig, for each combiatio of x ad y, the probability desity of Y aroud that y give that x. Figure 2 illustrates. 2.2 Maximum Likelihood vs. Least Squares As you remember from your mathematical statistics class, the likelihood of a parameter value o a data set is the probability desity at the data uder 9

10 4 2 y x p(y x) x.seq <- seq(from=-, to=3, legth.out=50) y.seq <- seq(from=-5, to=5, legth.out=50) cod.pdf <- fuctio(x,y) { dorm(y, mea=0-5*x, sd=0.5) } z <- outer(x.seq, y.seq, cod.pdf) persp(x.seq,y.seq,z, ticktype="detailed", phi=75, xlab="x", ylab="y", zlab=expressio(p(y x)), cex.axis=0.8) Figure 2: Illustratig how the coditioal pdf of Y varies as a fuctio of X, for a hypothetical Gaussia oise simple liear regressio where β 0 = 0, β = 5, ad σ 2 = (0.5) 2. The perspective is adjusted so that we are lookig early straight dow from above o the surface. (Ca you fid a better viewig agle?) See help(persp) for the 3D plottig (especially the examples), ad help(outer) for the outer fuctio, which takes all combiatios of elemets from two vectors ad pushes them through a fuctio. How would you modify this so that the regressio lie wet through the origi with a slope of 4/3 ad a stadard deviatio of 5? 0

11 those parameters. We could ot work with the likelihood with the simple liear regressio model, because it did t specify eough about the distributio to let us calculate a desity. With the Gaussia-oise model, however, we ca write dow a likelihood 2 By the model s assumptios, if thik the parameters are the parameters are b 0, b, s 2 (reservig the Greek letters for their true values), the Y X = x N(b 0 + b x, s 2 ), ad Y i ad Y j are idepedet give X i ad X j, so the over-all likelihood is (y 2πs 2 e i (b 0 +b x i )) 2 2s 2 (42) As usual, we work with the log-likelihood, which gives us the same iformatio 3 but replaces products with sums: L(b 0, b, s 2 ) = 2 log 2π log s 2s 2 (y i (b 0 + b x i )) 2 (43) We recall from mathematical statistics that whe we ve got a likelihood fuctio, we geerally wat to maximize it. That is, we wat to fid the parameter values which make the data we observed as likely, as probable, as the model will allow. (This may ot be very likely; that s aother issue.) We recall from calculus that oe way to maximize is to take derivatives ad set them to zero. L b 0 = 2s 2 L b = 2s 2 2(y i (b 0 + b x i ))( ) (44) 2(y i (b 0 + b x i ))( x i ) (45) Notice that whe we set these derivatives to zero, all the multiplicative costats i particular, the prefactor of 2s go away. We are left with 2 y i ( β 0 + β x i ) = 0 (46) (y i ( β 0 + β x i ))x i = 0 (47) These are, up to a factor of /, exactly the equatios we got from the method of least squares (Eq. 9). That meas that the least squares solutio is the maximum likelihood estimate uder the Gaussia oise model; this is o coicidece 4. 2 Strictly speakig, this is a coditioal (o X) likelihood; but oly pedats use the adjective i this cotext. 3 Why is this? 4 It s o coicidece because, to put it somewhat aachroistically, what Gauss did was ask himself for what distributio of the oise would least squares maximize the likelihood?.

12 Now let s take the derivative with respect to s: L s = s + 2 2s 3 (y i (b 0 + b x i )) 2 (48) Settig this to 0 at the optimum, icludig multiplyig through by σ 3, we get σ 2 = (y i ( β 0 + β x i )) 2 (49) Notice that the right-had side is just the i-sample mea squared error. Other models Maximum likelihood estimates of the regressio curve coicide with least-squares estimates whe the oise aroud the curve is additive, Gaussia, of costat variace, ad both idepedet of X ad of other oise terms. These were all assumptios we used i settig up a log-likelihood which was, up to costats, proportioal to the (egative) mea-squared error. If ay of those assumptios fail, maximum likelihood ad least squares estimates ca diverge, though sometimes the MLE solves a geeralized least squares problem (as we ll see later i this course). Exercises To thik through, ot to had it.. Show that if E [ɛ X = x] = 0 for all x, the Cov [X, ɛ] = 0. Would this still be true if E [ɛ X = x] = a for some other costat a? 2. Fid the variace of ˆβ0. Hit: Do you eed to worry about covariace betwee Y ad ˆβ? 2

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +