Week 3 Linear Regression I

Size: px

Start display at page:

Download "Week 3 Linear Regression I"

Colin Floyd
6 years ago
Views:

1 Week 3 Linear Regression I POL 200B, Spring 2014 Linear regression is the most commonly used statistical technique. A linear regression captures the relationship between two or more phenomena with a straight line. The value of linear regression in the study of social phenomena is a matter of some debate. In one of your background readings for this week, Charles Wheelan refers to regression analysis as the miracle elixir. Others have been much less enthusiastic. In an article entitled Econometrics Alchemy or Science? Economist David Hendry bemoaned the rise of linear regression in his profession, saying: Econometricians have found their Philosophers Stone; it is called regression analysis and is used for transforming data into significant results! 1 More than thirty years after Hendry s article appeared, linear regression continues to be the workhorse of quantitative analysis in the social sciences. Even if you don t plan on running regressions for the rest of your career, it s probably good to know what the alchemists are up to. 1 What is it? Let s start with the familiar scatter plot from Red State, Blue State... that you re-created as part of the first coding assignment. On the x-axis is the average income in each state in On the y-axis is the vote share received by George Bush in the 2004 presidential election. As you may recall, the trend is negative higher income is correlated with lower Bush vote share. 1 Hendry, David F. Econometrics Alchemy or Science?. Economica (1980): p

2 percent_bush MS WV AR UT ID AL KY MT SC LA NM OK ND KS TX IN SD ME NE GA TN NC AZ MO FL OH IA MI OR WI PA VT HI AK NV RI WY VA CO DEWA IL CA NH MN income NY MD NJ MA CT We could come up with a line to capture the relationship between income and Republican vote share. Why do we want to make a line, you might ask? Well, imagine you were living in the 1800s and you didn t have a computer. Drawing a line and measuring its slope might seem like a reasonable way of capturing a relationship between two phenomena...or maybe it might not, but it is in any event what people chose to do. The attraction of drawing a line is that it allows you to extrapolate beyond your data. Extrapolating can be very useful, but it has some very obvious risks as illustrated here by xkcd: 2

3 So we want to make a line. Looking at our Red State, Blue State... data, we could probably imagine drawing a line through the points just by eyeballing it. It might look something like this: percent_bush MS WV AR UT ID AL KY MT SC LA NM OK NE ND KS TX IN SD GA TN NC AZ MO FL OH IA OR MI WI PA ME HI VT AK VA CO NV NH MN DEWA IL CA income RI WY NY MD NJ MA CT Linear Regression Line 3

4 Once we have a line, we can measure its slope and find the place where it crosses the y-axis. If you haven t repressed all memories of high school math, you might remember seeing the equation for a straight line written like this: Y = mx + b In this equation, m is the slope of the line, and b is the y-intercept. We sometimes call m and b parameters to distinguish them from x and y which are variables. The equation we get from a linear regression is very similar, but we ve changed the symbols used for the parameters and put them in a different order: Y = β 0 + βx In statistics jargon, Y is called the dependent variable, or sometimes the outcome variable. X is called the independent variable or sometimes the explanatory variable or predictor variable. The y-intercept is now written at the beginning of the equation and instead of b it is now β 0 ( beta naught ). The y-intercept is often referred to as the constant. The slope is now β, which is sometimes referred to as the regression coefficient (or just coefficient ). As you can probably guess, eyeballing it is not how statisticians prefer to find the slope and intercept. So how do we find values for β and β 0? We can actually get some valuable insights from thinking about eyeballing it. If we wanted to do an especially good job at eyeballing it, we d try to get the line as close as possible to the most points. One formal mathematical way to accomplish this goal is to use a technique called least squares, as explained below. 2 Ordinary Least Squares Since our line doesn t pass through each point, we can measure the distance between each point and the line. This is called the point s residual. A point that has a small residual is very close to the line (a residual of zero means that the point is on the line). A point that has a large residual is farther away from the line. Here is our linear regression from above, 4

5 showing the residuals for Utah and Massachusetts. percent_bush MS WV AR UT ID AL KY MT SC LA NM OK VT NE ND KS TX IN SD GA TN NC AZ MO FL OH IA WI OR MI PA ME HI AK VA CO NV NH MN DEWA IL CA income RI WY NY MD NJ MA CT Linear Regression Line Residuals The residual for Utah is.123. Therefore, in Utah, Bush s vote share was 12.3 percentage points higher than our line would predict for a state with that income. The residual for Massachusetts is In Massachusetts, Bush s vote share was 5.9 percentage points lower than our line would predict for a state with that income. Aside from eyeballing it, how do we find out which line has the smallest possible residuals? One method is to use the least squares technique. This technique is called least squares because we find the line that minimizes the sum of the squared residuals. Why the squared residuals? you quite reasonably ask. Well, looking at our two residuals above, we notice that Utah has a positive residual, but Massachusetts has a negative residual. We just want to minimize the magnitude of the residuals, so we square them to make them all positive. If this sounds familiar, that s because we used a similar technique when calculating standard deviations and correlation coefficients. The decision to square the residuals instead of just taking their absolute values is something that made a lot of sense in the pre-computer era, but doesn t really make that much sense now. As you might guess, squaring the residuals gives more weight to outlying points. 5

6 To run a regression in STATA you simply use the reg command as follows: reg Y-var X-var So, for our Red State, Blue State... data, the dependent variable (Y ) is Bush vote share (percent bush) and the independent variable (X) is income. Here s what happens if we run the command: As usual, STATA includes a whole bunch of information that you don t really need. The value of β is the coefficient of income which is The value of β0 is the constant (written as cons for some reason) and is What does it all mean? So, now we know that we re drawing a straight line and STATA can tell us the values of the slope and the intercept for that line. But what does it mean for β to be ? Isn t that tiny? It s practically zero! We ll get there, but first let s put the values of the parameters into our model: Y = X 6

7 Okay, now we have a sort of weird-looking equation for a line. Let s remember what each of the pieces refers to: Predicted Bush vote share = 89.3% Income Using this equation, we can make a guess about the vote share Bush would get for any value of state income. Notice that we can predict Bush s vote share for nonsensical incomes. For example, our y-intercept, β 0, is actually the predicted value of Bush s vote share in a hypothetical state where the average income is 0. In general, β is the average amount we predict Y will increase or decrease if X were increased by 1 unit. Therefore, for every $1 increase in state income, we predict an average decrease in Bush s vote share of percentage points. The differences in income between our states are usually bigger than $1, so if we wanted to make our equation more pliable, we could rewrite β as a fraction with meaningful increments in the independent (X) and dependent (Y ) variables. For example, we could rewrite as % or. Consider two states, one with an average income of $30,000 and $10,000 one with an average income of $40,000. Because the difference in state income is $10,000, we predict that Bush s vote share in the second state would be 11 percentage points lower than in the first. 7

8 4 Presenting the results In our example, the regression output from STATA looked like this: In a paper, it would probably be reported as something like this: Table 1: Dependent variable: Bush vote share in 2004 a Variable Coefficient (Std. Err.) State Income b (0.022) Intercept (0.071) a Results are OLS estimates. b State income is measured per $10,000. Significance levels: = p < 5%, = p < 1% I want to emphasize that this is not necessarily a good way of presenting information! These tables are not usually much more helpful than the regression output from STATA and exhibit some rather unfortunate trends in the field. So why learn about them? Well, keeping with the theme of this lecture, we re going to figure out what s going on in a table like this because it s good to know what the alchemists are doing (so you can call them on their alchemy!). 8

9 So what do we have here? The values we found for β 0 and β are in the left hand column. I divided the state incomes by $10,000 so that the coefficient on income (β) reflects the change in vote share for a $10,000 increase in income. The intercept (β 0 ) is also reported. So far this is not too horrible. I m sure you noticed my beautiful bedazzled coefficients! Don t the stars make them look important? The table shows significance levels (p-values) using stars. This is a very common practice, but I personally think it s a bad one. Presenting your results with stars next to the significant ones is often a sign of star gazing. 2 The table also includes the standard error of each parameter. Like a p-value, the standard error is supposed to give you some sense of the statistical uncertainty associated with the value we ve found for the parameter. Unfortunately, both of these measures are quite poorly understood. In fact, you might be wondering what we are so uncertain about. Given our state vote shares and state incomes, the values we get for β and β 0 are completely determined. In a sense, they re facts. The value we have for β or β 0 is no less uncertain than the value of the mean income in our data. All of the measures of statistical uncertainty refer to the uncertainty we should take into account if we want to generalize our results. In the next section we ll talk about three different ways that statistical uncertainty is measured: Standard errors, 95% confidence intervals, and p-values. But first, let s take a stab at defining what is, and what isn t, statistical uncertainty. 5 Statistical uncertainty In statistics, the word uncertainty, like its cousin significance, does not have its normal English meaning. Normally, we think of uncertainty as the result of difficulties we have in figuring out how things really are. Consider the income measure in our Red State, Blue State... example. One source of uncertainty (in the normal English sense) might be that people don t report all of their income, such as money they ve made under the table or hidden in a offshore tax shelter. We can probably think of many more sources of uncertainty in our income data, not to mention sources of uncertainty in measuring Bush s vote share (recall the hanging chad fiasco). Statistical uncertainty, however, is not designed to give 2 A great book about statistical significance is The cult of statistical significance: How the standard error costs us jobs, justice and lives by Dierdre McCloskey and Stephen Ziliak (University of Michigan Press, 2008). An article-length treatment can be found here: 9

10 you information about these real-world sources of uncertainty in your data. In fact, they are completely ignored. Rather, measures of statistical uncertainty are designed to help you identify errors that arise from the fact that you are looking only at a sample. We ll talk below about three measures of statistical uncertainty, but first let s think a bit about how they apply to the technique we just learned namely, linear regression. The measures of statistical uncertainty we ll discuss are somewhat important if you want to generalize your statistical findings. But what does generalize even mean? Int this context, it means that you want to take the equation you got from your linear regression and use it to predict vote shares for values of income that are not currently in our data set. For example, using the Red State, Blue State... data, we might want to use the equation we found to make predictions about the 2012 election results. 3 On the other hand, if we don t care about generalizability, these measures of uncertainty are not actually important and, worse, can be completely misleading! We could just look at the good old standard deviation. Quantitative methodologists often get all up in arms about quantifying uncertainty, but this debate assumes that we are trying to generalize our results to some other population that you don t currently have data about (i.e., that there is something to be uncertain about!) Sometimes, this kind of generalizability is not the point of your quantitative research. If this is the case, though, I d advise you to steer clear of running linear regressions altogether. You can accomplish a lot just by presenting graphs and describing your data set. There are cases where generalizability is actually the goal. For example, if you want to use survey results gathered by interviewing a sample of the US population to say something about the population as a whole. In this case, statistical uncertainty is a very useful tool. However, even in this case, it can be misleading. 5.1 Standard Errors Here s our (bad) table: 3 It s pretty obvious that this would not be the best idea. Clearly some things have changed between 2004 and 2012, not to mention the fact that there are totally different presidential candidates! For the standard errors from our 2004 sample to give us accurate information about how we could generalize to 2012, we would have to assume that our 50 states from 2004 are a random sample from some larger population that includes those same 50 states in

11 Table 2: Dependent variable: Bush vote share in 2004 a Variable Coefficient (Std. Err.) State Income b (0.022) Intercept (0.071) a Results are OLS estimates. b State income is measured per $10,000. Significance levels: = p < 5%, = p < 1% As is often done, I ve put the standard errors in parentheses under their corresponding coefficients. So the standard error of β ( 0.11) is and the standard error of β 0 (0.893) is A standard error is like a standard deviation, but a standard deviation of what? Well, this all goes back to assuming that your data is a random sample from some larger population. If you data were, in fact, such a sample, you could imagine taking many different samples and then running your regression on each of the samples. 4 Each time you ran the regression you could save the value of the coefficient on income. Then you d have a giant list of different values you got for β. If β actually has some underlying average value in the population at large, then most of your different βs would be kind of close to that real value because you were taking representative (random) samples. In fact, if you have lots and lots and lots of perfectly random samples and you calculated the β from each one of them. The values of β could be described by a perfect bell curve (or normal distribution) centered on the mean of β. If this were the case, then about 68% of the values would fall within 1 standard deviation of the mean and about 95% of the values would fall within 2 standard deviations of the mean. If you ve taken statistics before, you might have seen this infamous chart: 4 Sort of like conducting many studies and recording the results you got each time. 11

Graphic from Wikipedia article Standard Deviation available here: http://goo.

12 Graphic from Wikipedia article Standard Deviation available here: Note: µ is just another symbol one can use for the mean of a variable (what we called x) What does this chart mean? Just that if the values of your variable β were perfectly normally distributed around a mean of µ, each of the sections on the graph would contain a certain proportion of your βs (the proportion in each section is shown by its label on the chart). For example, the section between µ and 1σ would include the βs that are greater than the mean (µ) and less than the mean plus one standard deviation. In theory, this would be 34.1% of your data. 5 The standard error is just an estimate of the standard deviation of this hypothetical normal distribution that was reverse-engineered to represent some larger population that your sample could have been randomly drawn from. The information in our data that is used to reverse-engineer the hypothetical population is: The dependence between the variables (i.e., the strength of correlation between X and Y ). The number of observations we have (i.e., if we have 50 states in our data set, the number of observations we have is 50). Higher dependence will, naturally, lead to a lower standard error. If there is a strong re- 5 In reality, unless you had a really huge sample you would probably just get something kind close to 34.1%. 12

13 lationship between the variables in our data, and our data is a random sample from this hypothetical population, there s probably a strong relationship between the variables in the hypothetical population. A larger sample size will also, of course, lead to smaller standard errors, since a larger sample will be more representative of the population at large. As a rule of thumb, statisticians say that if the standard error is less than half of the estimate that our equation spits out, we re in pretty good shape. Using the standard error, STATA also calculates a 95% confidence interval % Confidence Intervals I didn t show the 95% confidence intervals in the regression table, but here s where they show up in the STATA output: I ve circled the 95% confidence interval for the coefficient of income. The 95% confidence interval is given as a range. In this case, the range goes from to Both of these values are calculated directly from the standard error. The upper bound (top) of the 95% percent confidence interval is just the value of β plus 1.96 times its standard error (i.e., ). The lower bound (bottom) of the 95% confidence interval is just the value of β minus 1.96 times its standard error (i.e., ). 13

14 Why 1.96 times the standard error? The 95% confidence interval just gives the values on this distribution such that 95% of the βs would fall in that region (look back up at the normal distribution above if you re confused). By definition, this is between β (the standard error) and β (the standard error). 6 It s interesting to puzzle a little over what the 95% confidence interval corresponds to especially since it s not quite what the name implies! If you were to re-run your regression over and over using new random samples, then 95% of the time your value for β would fall within the range given by the 95% confidence interval. This is a little different from simply saying that we are 95% sure that the real value of β is in the interval given. Related to the 95% confidence interval is the p-value. We first talked about p-values when discussing group comparisons. There, p-values told us about the probability that a difference we were observing in a sample containing two groups (like men and women) was actually zero in the larger population. In the regression context, the p-value is slightly different. 5.3 P-values in Regression A p-value is a loose way of testing how good our line is at making predictions for a larger population one about which don t have information. Given the information we know about our sample, p-values tell us the probabilities that β and β 0 in the overall population are actually zero. The actual calculations for p-values are fairly involved (and not particularly important), but fortunately STATA will give them to you when you run the reg command. As in the group comparison context, this measure is a fairly loose one. 6 Multiple Regression So far we ve only considered linear regression with one dependent variable (Y ) and one independent variable (X). Using the same framework, we can include more independent variables. Adding more variables is often referred to as controlling for them. This can be a bit misleading. I think that the best way to understand multiple regression is through a simple example. 6 Most of the time you could just use 2, but 1.96 is technically exactly where 95% would be in a very, very large sample. 14

15 Using Gallup poll data from , I added a new variable to the Red State, Blue State... data: The percentage of state population who identified as Catholic. 7 I call this variable percent catholic. If we make a scatter plot of Bush s vote share vs. the percent Catholic in each state, we see the following familiar pattern: percent_bush UT AL OK MS SC GA TN WVNC AR ID WY KY IN VAMO OR WA IA KS TX CO NE SD MT AZ FL OHNV MI ND NM PA MN DE CA MEIL MD percent_catholic VT LA WI NH NY NJ CT MA RI As the percentage of Catholics in a state increases, the percentage of the vote share received by Bush in that state decreases. If we run a regression where our dependent variable (Y ) is Bush s 2004 vote share and our independent variable (X) is the percent Catholic we can again describe it using the following equation: Y = β 0 + βx We get the following results for β 0 and β from STATA: 7 Gallup published estimates of the percent of the population identifying as Protestant, Other Christian, Catholic, Mormon, Jewish, or None for 48 states (Alaska and Hawaii are not included). The results are based on telephone interviews with 62,744 randomly selected national adults, aged 18 and older, conducted in Gallup Polls between 2000 and For details see: 15

16 The value of β 0 is 0.649, meaning that in a hypothetical state which had 0 percent Catholics we would predict that Bush would get 64.9% of the vote. The value of β is We can re-write this as the following ratio: an equation: Predicted Bush vote share = 64.9% 5% Bush Vote. Just like before, we can translate this into 10% Catholic 5% Bush Vote 10% Catholic Percent Catholic What will happen if we add income? If we do a regression using both percent Catholic and income, we have two pieces of information for each state. We know that on their own each of these explanatory variables (Xs) can be used to predict vote share. Including both pieces of information should give us a better prediction. Would our prediction necessarily be better? Well, this depends on the degree to which percent Catholic and average income are actually giving us different information. For example, if percent Catholic is strongly related to average income and average income is strongly related to vote share then maybe including both is just like including average income twice. (This may be unintuitive to think about at first, don t worry). Say we modified our regression equation to have two explanatory variables: Y = β 0 + β 1 X 1 + β 2 X 2 16

17 Now X 1 is percent Catholic and X 2 is average income. They each have their own coefficient, β 1 and β 2. If we run the regression in STATA we get the following results: We ll talk more about this next week! Appendix Linear Regression Therapy? It occurred to me as I was writing these notes that regression is an English word that doesn t seem particularly related to the statistical technique that bares its name. Any X- files aficionado knows that hypnotic regression therapy is the technique special agent Fox Mulder uses to help people recover repressed memories of alien encounters! Here, and in general outside of statistics, regression means a return to a former state or condition. How did it come to be the name of the most common statistical technique? Remember our old friend Sir Francis Galton the naturalist, cousin of Charles Darwin, and sweet pea enthusiast? Galton developed linear regression to figure out how traits like intelligence might be passed down from parents to offspring. The kind of regression Galton was thinking about was regression to the mean, or as he called it, reversion towards mediocrity. 17

18 In his research on the relationship between parents and offsprings heights, Galton found that very tall or very short parents tended to have children who were less extreme in height. Galton assumed that this kind of process also applied to traits like intelligence. When combined with Galton s belief that less intelligent people were more fertile than intelligent people, it was clear to him that, if left unchecked, these forces would lead society towards intellectual mediocrity. Following this (incorrect) logic, Galton claimed that it would be quite practicable to produce a highly-gifted race of men by judicious marriages during several consecutive generations. 8 In order to quantify just how much of a regression to the mean one should expect for a particular trait such as height, Galton plotted the heights of parents against the heights of their children. Galton then constructed straight lines that seemed to capture the direction of the trend and calculated their slopes. It is from this checkered past that linear regression gets its name. The kind of linear regression Galton was doing fell out of fashion, but, as often happens in Statistics, the name stuck. Linear regression as we know it today was developed in the 1920s by R.A. Fisher, who combined the work Galton (and Pearson) were doing with the least-squares method developed by the mathematician Carl Friedrich Gauss. Fisher saw linear regression as a way of giving a more definite meaning to the value of a correlation coefficient. As Fisher put it in a paper written in 1925: 9 The idea of regression is usually introduced in connection with the theory of correlation, but it is in reality a more general, and, in some respects, a simpler idea, and the regression coefficients are of interest and scientific importance in many classes of data where the correlation coefficient, if used at all, is an artificial concept of no real utility. 8 For the morbidly curious, you can see Galton s book, Hereditary Genius (1869) here: mugu.com/galton/books/hereditary-genius/. 9 Fisher, Ronald Aylmer. Statistical methods for research workers. Genesis Publishing,

Lecture 26 Section 8.4. Mon, Oct 13, 2008

Lecture 26 Section 8.4. Mon, Oct 13, 2008 Lecture 26 Section 8.4 Hampden-Sydney College Mon, Oct 13, 2008 Outline 1 2 3 4 Exercise 8.12, page 528. Suppose that 60% of all students at a large university access course information using the Internet.