POLI 618 Notes. Stuart Soroka, Department of Political Science, McGill University. March 2010

Size: px

Start display at page:

Download "POLI 618 Notes. Stuart Soroka, Department of Political Science, McGill University. March 2010"

Dina James
5 years ago
Views:

1 POLI 618 Notes Stuart Soroka, Department of Political Science, McGill University March 2010 These pages were written originally as my own lecture notes, but are now designed to be distributed to students taking the stats methods course Poli 618 at McGill University. They are also freely available online, at snsoroka.com. The notes draw on a good number of statistics texts, including Kennedy s Econometrics, Greene s Econometric Analysis, and a number of volumes in Sage s quantitative methods series. That said, please do keep in mind that they are just lecture notes there are errors and omissions, and there is for no single topic enough information included in this file to learn statistics from the notes alone. (There are of course many textbooks that are better equipped for that purpose.) The notes are nonetheless a useful background guide to Poli 618 and perhaps, more generally, to some of the basic statistics most common in empirical political science. If you find errors (and you will), please do let me know. Thanks, Stuart Soroka stuart.soroka@mcgill.ca

2 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 2 Table of Contents Variance, Covariance and Correlation... 3 Introducing Bivariate Ordinary Least Squares Regression... 5 Multivariate Ordinary Least Squares Regression Error, and Model Fit Assumptions of OLS regression Nonlinearities Collinearity and Multicollinearity Heteroskedasticity Outliers Models for dichotomous data Linear Probability Models Nonlinear Probability Model: Logistic Regression An Alternative Description: The Latent Variable Model Nonlinear Probability Model: Probit Regression Maximum Likelihood Estimation Interpretation & Goodness of Fit Measures for Categorical Models Models for Categorical Data Ordinal Outcomes Nominal Outcomes Times Series: Autocorrelation Univariate Statistics Bivariate Statistics Multivariate Models Significance Tests Distribution Functions The chi-square test The t test The F Test Factor Analysis Background: Correlations and Factor Analysis An Algebraic Description Factor Analysis Results Rotated Factor Analyses... 54

3 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 3 Variance, Covariance and Correlation Let s begin with Y i, a continuous variable measuring some value for each individual (i) in a representative sample of the population. Y i can be income, or age, or a thermometer score expressing degrees of approval for a presidential candidate. Variance in our variable Y i is calculated as follows: (Yi (1) S2 Y = Ȳ )2 N 1, or (2) S 2 Y = N( Y 2 i ) ( Y i ) 2 N(N 1), where both versions are equivalent, and the latter is referred to as the computational formula (because it is, in principle, easier to calculate by hand). Note that the equation is pretty simple: we are interested in variance in Y i, and Equation 1 is basically taking the average of each individual Y i s variance around the mean (Y ). There are a few tricky parts. First, the differences between each individual Y i and Y (that is, Y i Y ) are squared in Equation 1, so that negative values do not cancel out positive values (since squaring will lead to only positive values). Second, we use N-1 as the denominator rather than N (where N is the number of cases). This produces a more conservative (slightly inflated) result, in light of the fact that we re working with a sample variance rather than the population variance that is, the values of Y i in our (hopefully) representative sample, and the values of Y i that we believe may exist in the total real-world population. For a small-n samples, where we might suspect that we under-estimate the variance in the population, using N-1 effectively adjusts the estimated variance upwards. With a large-n sample, the difference between N-1 and N is increasingly marginal. That the adjustment matters more for small sample than for big samples reflects our increasing confidence in the representative-ness of our sample as it increases. (Note that some texts distinguish between S Y 2 and σ Y 2, where the Roman S is the sample variance and the Greek σ is the population variance. Indeed, some texts will distinguish between sample values and population values using Roman and Greek versions across the board B for an estimated slope coefficient, for instance, and β for an actual slope in the population. I am not this systematic below.) The standard deviation is a simple function of variance:

4 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 4 (Yi (3) S Y = S 2Y = Ȳ )2 N 1, So standard deviations are also indications of the extent to which a given variable varies around its mean. S Y is important for understanding distributions and significance tests, as we shall see below. So far, we ve looked only at univariate statistics statistics describing a single variable. Most of the time, though, what we want to do is describe relationships between two (or more) variables. Covariance a measure of common variance between two variables, or how much two variables change together is calculated as follows: (4) S XY = (Xi X)(Y i Ȳ ) N 1, or (5) S XY = N X i Y i X i Yi N(N 1), the latter of which is the computational formula. Again, we use N-1 as the denominator, for the same reasons as above. Pearson s correlation coefficient is also based on a ratio of covariances and standard deviations, as follows: (6) r = S XY S X S Y, or (7) r = (Xi X)(Y i Ȳ ) (Xi X) 2 (Y i Ȳ )2. where S XY is the sample covariance between X i and Y i, and S X and S Y are the sample standard deviations of X i and Y i respectively. (Note the relationship between this Equation 7, and the preceding equations for standard deviations and covariances, Equation 3 and Equation 4.)

5 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 5 Introducing Bivariate Ordinary Least Squares Regression Take a simple data series, and plot it X Y What we want to do is describe the relationship between X and Y. Essentially, we want to draw a line between the dots, and describe that line. Given that the data here are relatively simple, we can just do this by hand, and describe it using two basic properties, α and β : where α, the constant, is in this case equal to 1, and β, the slope, is 1 (the increase in Y) divided by 2 (the increase in X) =.5. So we can produce an

6 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 6 equation for this line allowing us to predict values of Y based on values of X. The general model is, (8) Y i = α + βx i And the particular model in this case is Y = 1 +.5X. Note that the constant is simply a function of the means of both X and Y, along with the slope. That is: (9) α = Ȳ β X X Y mean So, following Equation 9, α = Ȳ β X = 3.5 (.5)*5 = = 1. This is pretty simple. The difficulty is that data aren t like this they don t fall along a perfect line. They re likely more like this: X Y Now, note that we can draw any number of lines that will satisfy Equation 8. All that matters is that the line goes through the means of X and Y. So the means are:

7 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 7 X Y mean And let s make up an equation where Y=3.75 when X=5 Y = α + β X 3.75 = α + (β )* = 4 + (β )* = 4 + (-.05)* = 4 + (-.25) So here it is: Y = 4 + (-.05)X. Plotted, it looks like this: Note that this new model has to be expressed in a slightly different manner, including an error term: (10) Y i = α + βx i + i, or, alternatively: (11) Y i = Ŷi + i, where are the estimated values of the actual Y i, and where the error can be expressed in the following ways: (12) i = Y i Ŷ. or i = Y i (α + βx i ). So we ve now accounted for the fact that we work with messy data, and that there will consequently be a certain degree of error in the model. This is

8 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 8 inevitable, of course, since we re trying to draw a straight line through points that are unlikely to be perfectly distributed along a straight line. Of course, the line above won t do it quite clearly does not describe the relationship between X and Y. What we need is a method of deriving a model that better describes the effect that X has on Y essentially, a method that draws a line that comes as close to all the dots as possible. Or, more precisely, a model that minimizes the total amount of error(ε i ). We first need a measure of the total amount of error the degree to which our predictions miss the actual values of Y i. We can t simply take the sum of all errors, i, because positive and negative errors can cancel each other out. We could take the sum of the absolute values, i, which in fact is used in some estimations. The norm is to use the sum of squared errors, the SSE or 2 i. This sum is most greatly affected by large errors by squaring residuals, large residuals take on very large magnitudes. An estimation of Equation 10 that tries to minimize 2 i accordingly tries especially hard to avoid large errors. (By implication, outlying cases will have a particularly strong effect on the overall estimation. We return to this in the section on outliers below.) This is what we are trying to do in ordinary least squares (OLS) regression: minimize the SSE, and have an estimate of β (on which our estimate of α relies) that comes as close to all the dots as is possible. Least-squares coefficients for simple bivariate regression are estimated as follows: (13) β = (Xi X)(Y i Ȳ ) (Xi X) 2, or

9 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 9 (14) β = N Y i X i Y i Xi N X 2 i ( X i ) 2. The latter is referred to as the computational formula, as it s supposed to be easier to compute by hand. (I actually prefer the former, which I find easier to compute, and has the added advantage of nicely illustrating the important features of OLS regression.) We can use Equation 13 to calculate the Least Squares estimate for the above data: The data Calculated values (used in Equation 13) X i Y i X i X Y i Y (X i X )(Y i Y ) (X i X ) X i =5 Y i =3.75 = 9 =20 So solving Equation 13 with the values above looks like this: β = (Xi X)(Y i Ȳ ) (Xi X) 2 = 9 20 =.45 And we can use these results in Equation 9 to find the constant: α = Ȳ β X =3.75 (.45) 5= = 1.5 So the final model looks like this: Y i =1.5+(.45) X i

10 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 10 Using this model, we can easily see what the individual predicted values ( ˆ Y i ) are, as well as the associated errors (ε i ): X i Y i Y ˆ i ε i = Y ˆ i Y i X i =5 Y i =3.75 One further note about Equation 13, and our means of estimating OLS slope coefficients: Recall the equations for variance (Equation 1) and covariance (Equation 4). If we take the ratio of covariance and variance, as follows, (15) S XY S 2 x = P (Xi X)(Y i Ȳ ) N 1 P (Xi ˆX) 2 N 1, we can adjust somewhat to produce the following, (16) S XY S 2 x = (Xi X)(Y i Ȳ ) (Xi ˆX) 2, where Equation 16 simply drops the N-1 denominators, which cancel each other out. More importantly, Equation 16 looks suspiciously indeed, exactly like the formula for β (Equation 13). β is thus essentially a ratio between the covariance between X and Y, and the variance of X, as follows: (17) β YX = S YX S 2 X This should make sense when we consider the standard interpretation of β : for a one-unit shift in X, how much does Y change?

11 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 11 Multivariate Ordinary Least Squares Regression Things are more complicated for multiple, or multivariate, regression, where there is more than one independent variable. The standard OLS multivariate model is nevertheless a relatively simple extension of bivariate regression imagine, for instance, plotting a line through dots plotted along two X axes, in what amounts to three-dimensional space: This is all we re doing in multivariate regression drawing a line through these dots, where values of Y are driven by a combination of X 1 and X 2, and where the model itself would be as follows: (18) Y i = α + β 1 X 1 i + β 2 X 2 i + i. That said, when we have more than two regressors, we start plotting lines through four- and five-dimensional space, and that gets hard to draw. Least squares coefficients for multiple regression with two regressors, as in Equation 18, are calculated as follows: (19)β 1 = ( (X X )(Y Y ) 1i 1 i (X 2i X 2 )) ( (X 2i X 2 )(Y i Y ) (X 1i X 1 )(X 2i X 2 )) ( (X 1i X 1 ) 2 (X 2i X 2 ) 2 ) ( (X 1i X 1 )(X 2i X 2 )) 2 and (20)β 2 = ( (X X )(Y Y ) 2i 2 i (X 1i X 1 )) ( (X 1i X 1 )(Y i Y ) (X 1i X 1 )(X 2i X 2 )), ( (X 1i X 1 ) 2 (X 2i X 2 ) 2 ) ( (X 1i X 1 )(X 2i X 2 )) 2 and the constant is now estimated as follows: (21) α = Ȳ β 1 X 1 β 2 X2.,

12 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 12 Error, and Model Fit The standard deviation of the residuals, or the standard error of the slope, is as follows, 2 i (22) SE β = N 2, Or, more generally, 2 (23) i SE β = N K 2, Equation 22 is the same as Equation 23, except that the former is a simple version that applies to bivariate regression only, and the latter is a more general version that applies to multivariate regression with any number of independent variables. N in these equations refers to the total number of cases, while K is the total number of independent variables in the model. The SE β is a useful measure of the fit of a regression slope it gives you the average error of the prediction. It s also used to test the significance of the slope coefficient. For instance, if we are going to be 95% confident that our estimate is significantly different from zero, zero should not fall within the interval β ± 2(SE β ). Alternatively, if we are using t-statistics to examine coefficients significance, then the ratio of β to SE β should be roughly 2. Assuming you remember the basic sampling and distributional material in your basic statistics course, this reasoning should sound familiar. Here s a quick refresher: Testing model fit is based on some standard beliefs about distributions. Normal distributions are unimodel, symmetric, and are described by the following probability distribution: (24) p(y )= e (Y µ Y ) 2 /2σ 2 Y 2πσ 2 Y where p(y) refers to the probability of a given value of Y, and where the shape of the curve is determined by only two values: the population mean,, and its variance,. (Also see our discussion of distribution functions, below.) Assuming two distributions with the same mean (of zero, for instance), the effect of changing variances is something like this:

13 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 13 We know that many natural phenomena follow a normal distribution. So we assume that many political phenomena do as well. Indeed, where the current case is concerned, we believe that our estimated slope coefficient, β, is one of a distribution of possible β s we might find in repeated samples. These β s are normally distributed, with a standard deviation that we try to estimate from our data. We also know that in any normal distribution, roughly 68% of all cases fall within plus or minus one standard deviation from the mean, and 95% of all cases fall within plus or minus two standard deviations from the mean. It follows that our slope should not be within two standard errors of zero. If it is, we cannot be 95% confidence that our coefficient is significantly different from zero that is, we cannot reject the null hypothesis that there is no significant effect. Going through this process step-by-step is useful. Let s begin with our estimated bivariate model from page 8, where the model is Y i = (.45)*X i, and the data are, X i Y i Y ˆ i ε i = Y ˆ i Y i 2 ε i X i =5 Y i = Based on Equation 22, we calculate the standard error of the slope as follows:

14 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 14 SE β = 2 i N 2 = 4 2 = 2 = 1.35 = 1.16 So, we can be 95% confident that the slope estimate in the population is.45 ± (2 1.16), or.45 ± Zero is certainly within this interval, so our results are not statistically significant. This is mainly due to our very small sample size. Imagine the same slope and SE β, but based on a sample of 200 cases: SE β = 2 i N 2 = = 198 =.014 =.118 Now we can be 95% confident that the slope estimate in the population is.45 ± (2.118), or.45 ±.236. Zero is not within this interval, so our results in this case would be statistically significant. Just to recap, our decision about the statistical significance of the slope is based on a combination of the magnitude of the slope (β ), the total amount of error in the estimate (using the SE β ), and the sample size (N, used in our calculation of the SE β ). Any one of these things can contribute to significant findings: a greater slope, less error, and/or a larger sample size. (Here, we saw the effect that sample size can have.) Another means of examining the overall model fit that is, including all independent variables in a multivariate context is by looking at proportion of the total variation in Y i explained by the model. First, total variation can be decomposed into explained and unexplained components as follows: TSS is the Total Sum of Squares RSS is the Regression Sum of Squares (note that some texts call this RegSS) ESS is the Error Sum of Squares (some texts call this the residual sum of squares, RSS) So, TSS = RSS + ESS, where (25) TSS = (Y i Ȳ )2, (26) RSS = (Ŷi Ȳ )2, and (27)ESS = (Y i Ŷ )2

15 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 15 We re basically dividing up the total variance in Y i around its mean (TSS) into two parts: the variance accounted for in the regression model (RSS), and the variance not accounted for by the regression model (ESS). Indeed, we can illustrate on a case-by-case basis the variance from the mean that is accounted for by the model, and the remaining, unaccounted for, variance: All the explained variance (squared) is summed to form RSS; all the unexplained variance (squared) is summed to form ESS. Using these terms, the coefficient of determination, more commonly, the R 2, is calculated as follows: (28) R 2 = RSS TSS, or R 2 =1 ESS TSS, or R 2 = Or, alternatively, following from Equation 25-Equation 27: (29) R 2 = RSS TSS = And we can estimate all of this as follows: TSS ESS TSS ( Ŷ 1 Ȳ )2 (Yi Ȳ )2 = (Yi Ȳ )2 (Y i Ŷi) 2 (Yi Ȳ )2 X i Y i Y ˆ i (Y i Y ) 2 ( Y ˆ i Y ) 2 (Y i Y ˆ i ) X i =5 Y i =3.75 TSS=6.74 RSS=4.04 ESS=2.7.

16 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 16 The coefficient of determination is thus R2 = RSS TSS = =.599. The coefficient of determination is calculated the same way for multivariate regression. The R 2 has one problem, though it can only ever increase or stay equal as variables are added to the equation. More to the point, including extra variables can never lower the R 2, and the measure accordingly does not reward for model parsimony. If you want a measure that does so, you need to use a correction for degrees of freedom (sometimes called an adjusted R-squared): (30) R2 =1 RSS N K 1 TSS N 1 Note that this should only make a difference when the sample size is relatively small, or the number of independent variables is relatively large. But you can see in Equation 30 that if the sample size is small, increasing the number of variables will reduce the numerator, and thus reduce the adjusted R 2. One further note about the coefficient of determination: note that the R 2 is equivalent to the square of Pearson s r (Equation 6). That is, (31) r = S XY S X S Y = R 2 XY, There is, then, a clear relationship between the correlation coefficient and the coefficient of determination. There is also a relationship between a bivariate correlation coefficient and the regression coefficient. Let s begin with an equation for the regression coefficient, as in Equation 17 above: (32) β XY = S XY S 2 X, and rearrange these terms to isolate the covariance: (33) S XY = β XY SX 2, Now, let s substitute this for in the equation for correlation (Equation 6): (34) r XY = S XY S X S Y = β XY S 2 X S X S Y.

17 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 17 So the correlation coefficient and bivariate regression coefficient are products of each other. More clearly: (35) r XY = β XY S X S Y, and (36) β XY = r XY S Y S X. The relationship between the two in multivariate regression is of course much more complicated. But the point is that all these measures - measures capturing various aspects of the relationship between two (or more) variables - are related to each other, each a function of a given set of variances and covariances.

18 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 18 Assumptions of OLS regression The preceding OLS linear regression models are unbiased and efficient (that is, they provide the Best Linear Unbiased Estimator, or BLUE) provided five assumptions are not violated. If any of these assumptions are violated, the regular linear OLS model ceases to be unbiased and/or efficient. The assumptions themselves, as well as problems resulting from violating each one, are listed below (drawn from Kennedy, Econometrics). Of course, many data or models violate one or more of these assumptions, so much of what we have to cover now is how to deal with these problems. 1. Y can be calculated as a linear function of X, plus a disturbance term. Problems: wrong regressors, nonlinearity, changing parameters 2. Expected value of e is zero; the mean of e is zero. Problems: biased intercept 3. Disturbance terms have the same variance and are not correlated with one another Problems: heteroskedasticity, autocorrelated errors 4. Observations of Y are fixed in repeated samples; it is possible to repeat the sample with the same independent values Problems: errors in variables, autoregression, simultaneity 5. Number of observations is greater than the number of independent variables, and there are no exact linear relationships between the independent variables. Problems: multicollinearity

19 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 19 Nonlinearities So far, we ve assumed that the relationship between Y i and X i is linear. In many cases, this will not be true. We could imagine any number of non-linear relationships. Here are two just common possibilities: We can of course estimate a linear relationship in both cases it doesn t capture the actual relationship very well, though. In order to better capture the relationship between Y and X, we may want to adjust our variables to represent this non-linearity. Let s begin with the basic multivariate model, (37) Y i = α + β 1 X 1i + β 2 X 2i + i. Where a single X is believed to have a nonlinear relationship with Y, the simplest approach is to manipulate the X to use X 2 in place of X, for instance: (38) Y i = α + β 1 X 2 1i + β 2 X 2i + i, This may capture the exponential increase depicted in the first figure above. To capture the ceiling effect in the second figure, we could use both the linear (X) and quadratic (X 2 ), with the expectation that the coefficient for the former (β 1 ) would be positive and large, and the coefficient for the latter ( β 2 ) would be negative and small: (39) Y i = α + β 1 X 1i + β 2 X 2 1i + β 3 X 2i + i, This coefficient on the quadratic will gradually, and increasingly, reduce the positive effect of X 1. Indeed, if the effect of the quadratic is great enough, it can in combination with the linear version of X 1 produce a line that increases, peaks, and then begins to decrease.

20 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 20 Of course, these are just two of the simplest (and most common) nonlinearities. You can imagine any number of different non-linear relationships; most can be captured by some kind of mathematical adjustment to regressors. Sometimes we believe there is a nonlinear relationship between all the Xs and Y that is, all Xs combined have a nonlinear effect on Y, for instance: (40) Y i =(α + β 1 X 1i + β 3 X 2i ) 2 + i. The easiest way to estimate this is not Equation 40, though, but rather an adjustment as follows: (41) Yi = α + β 1 X 1i + β 3 X 2i + i. Here, we simply transform the dependent variable. I ve replaced the squared version of the right hand side (RHS) variables with the square root of the left hand side (LHS) because it s a simple example of a nonlinear transformation. It s not the most common, however. The most common is taking the log of Y, as follows: (42) ln(y i )=α + β 1 X 1i + β 3 X 2i + i. Doing so serves two purposes. First, we might believe that the shape of the effect of our RHS variables on Y i is actually nonlinear and specifically, logistic in shape (a S-curve). This transformation may quite nicely capture this nonlinearity. Second, taking the log of Y i can solve a distributional problem with that variable. OLS estimations will work more efficiently with variables that are normally distributed. If Y i has a great many small values, and a long right-hand tail (as many of our variables will; for instance, income), then taking the log of Y i often does a nice job of generating a more normal distribution. This example highlights a second reason for transforming a variable, on the LHS or RHS. Sometimes, a transformation is based on a particular shape of an effect, based on theory. Other times, a transformation is used to fix a non-normally distributed variable. The first transformation is based on theoretical expectations; the second is based on a statistical problems. (In practice, separating the two is not always easy.)

21 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 21 Collinearity and Multicollinearity When there is a linear relationship among the regressors, the OLS coefficients are not uniquely identified. This is not a problem if your goal is only to predict Y multicollinearity will not affect the overall prediction of the regression model. If your goal is to understand how the individual RHS variables impact Y, however, multicollinearity is a big problem. One problem is that the individual p-values can be misleading confidence intervals on the regression coefficients will be very wide. Essentially, what we are concerned about is the correlation amongst regressors, for instance, X 1 and X 2 : (43) r 12 = (X1 X 2 )(X 2 X 2 ) (Xi X 1 ) 2 (X 2 X 2 ) 2, This is of course just a simple adjustment to the Pearson s r equation (Equation 7). Equation 43 deals just with the relationship between two variables, however, and we are often worried about a more complicated situation one in which a given regressor is correlated with a combination of several, or even all, the other regressors in a model. (Note that this multicollinearity can exist even if there are no striking bivariate relationships between regressors.) Multicollinearity is perhaps most easily depicted as a regression model in which one X is regressed on all others. That is, for the regression model, (44) Y i = α + β 1 X 1i + β 2 X 2i + β 3 X 3i + β 4 X 4i + i we might be concerned that the following regression produces strong results: (45) X 1i = α + β 2 X 2i + β 3 X 3i + β 4 X 4i + i If X 1 is well predicted by X 2 through X 4, it will be very difficult to identify the slope (and error) for X 1 from the set of other slopes (and errors). (The slopes and errors for the other slopes may be affected as well.) Variance inflation factors are one measure that can be used to detect multicollinearity. Essentially, VIFs are a scaled version of the multiple correlation coefficient between variable j and the rest of the independent variables. Specifically, (46) VIF j = 1 1 R 2 j where R 2 j would be based on results from a model as in Equation 45. If R 2 j equals zero (i.e., no correlation between X j and the remaining independent

22 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 22 variables), then VIF j equals 1. This is the minimum value. As R 2 j increases, however, the denominator of Equation 46 decreases, and the estimated VIF rises as a consequence. A value greater than 10 represents a pretty big multicollinearity problem. VIFs tell us how much the variance of the estimated regression coefficient is 'inflated' by the existence of correlation among the predictor variables in the model. The square root of the VIF actually tells us how much the standard error is inflated. This table, drawn from the Sage volume by Fox, shows the relationship between a given R 2 j, the VIF, and the estimated amount by which the standard error of X j is inflated by multicollinearity. Coefficient Variance Inflation as a Function of Inter-Regressor Multiple Correlation R j 2 VIF (impact on SE β j ) Ways of dealing with multicollinearity include (a) dropping variables, (b) combining multiple collinear variables into a single measure, and/or (c) if collinearity is only moderate, and all variables are of substantive importance to the model, simply interpreting coefficients and standard errors taking into account the effects of multicollinearity.

23 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 23 Heteroskedasticity Heteroskedasticity refers to unequal variance in the regression errors. Note that there can be heteroskedasticity relating to the effect of individual independent variables, and also heteroskedasticity related to the combined effect of all independent variables. (In addition, there can be heteroskedasticity in terms of unequal variance over time.) The following figure portrays the standard case of heteroskedasticity, where the variance in Y (and thus the regression error as well) is systematically related to values of X. The difficulty here is that the error of the slope will be poorly estimated it will over-estimate the error at small values of X, and under-estimate the error at large values of X. Diagnosing heteroskedasticity is often easiest by looking at a plot of errors (ε i ) by values of the dependent variable (Y i ). Basically, we begin with the standard bivariate model of Y i, (47) Y i = α + βx i + ε i, and then plot the resulting values of ε i by Y i. If we did so for the data in the preceding figure, then the resulting residuals plot would look as follows:

24 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 24 As Y i increases here, so too does the variance in ε i. There are of course other possible (heteroskedastic) relationships between Y i and ε i, for instance, where variance in much greater in the middle. Any version of heteroskedasticity presents problems for OLS models. When the sample size is relatively small, these diagnostic graphs are probably the best means of identifying heteroskedasticity. When the sample size is large, there are too many dots on the graph to distinguish what s going on. There are several tests for heteroskedasticity, however. The Breusch-Pagan test tests for a relationship between the error and the independent variables. It starts with a standard multivariate regression model, (48) Y i = α + β 1 X 1i + β 2 X 2i β k X ki + i, and then substitutes the estimated errors, squared, for the dependent variable, (49) 2 i = α + β 1x 1i + β 2 x 2i β k x ki + ν i. We then use a standard F-test to test the joint significance of coefficients in Equation 49. If they are significant, there is some kind of systematic relationship between the independent variables and the error.

25 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 25 Outliers Recall that OLS regression pays particularly close attention to avoiding large errors. It follows that outliers cases that are unusual can have a particularly large effect on an estimated regression slope. Consider the following two possibilities, where a single outlier has a huge effect on the estimated slope: Hat values (h i ) are the common measure of leverage in a regression. It is possible to express the fitted values of in terms of the observed values : (50) Ŷj = h 1j Y 1 + h 2 Y h nj Y n = n H ij Y i. i=1 The coefficient, or weight, h ĳ captures the contribution of each observation to the fitted value. Outlying cases can usually not be discovered by looking at residuals OLS estimation tries, after all, to minimize the error for high-leverage cases. In fact, the variance in residuals is in part a function of leverage, (51) V (E i)=σ 2 (1 h i ). The greater the hat value in Equation 51, the lower the variance. How can we identify high-leverage cases? Sometimes, simply plotting data can be very helpful. Also, we can look closely at residuals. Start with the model for standardized residuals, as follows, (52) E i = E i S E 1 hi, which simply expresses each residual as a number (or increment) of standard deviations in E i. The problem with Equation 52 is that case i is included in the

26 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 26 estimation of the variance; what we really want is a sense for how i looks in relation to the variance in all other cases. This is a studentized residual, (53) E i = E i S E( 1) 1 hi. and it provides a good indication of just how far out a given case is in relation to all other cases. (To test significance, the statistic follows a t-distribution with N-K-2 degrees of freedom.) Note that you can estimate studentized residuals in a quite different way (though with the same results). Start by defining a variable D, equal to 1 for case i and equal to 0 for all other cases. Now, for a multivariate regression model as follows: (54) Y i = α + β 1 X 1 + β 2 X β k X k + i. add variable D and estimate, (55) Y i = α + β 1 X 1 + β 2 X β k X k + γd i + i. This is referred to as a mean-shift outlier model, and the t-statistic for γ provides a test equivalent to the studentized residual. What do we do if we have outliers? That depends. If there are reasons to believe the case is abnormal, then sometimes it s best just to drop it from the dataset. If you believe the case is correct, or justifiable, however, in spite of the fact that it s an outlier, then you may choose to keep it in the model. At a minimum, you will want to test your model with and without this outlier, to explore the extent to which you results are driven by a single case (or, in case of several outliers, a small number of cases).

27 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 27 Linear Probability Models Models for dichotomous data Let s begin with a simple definition of our binary dependent variable. We have variable, Y i, which only takes on the values 0 or 1. We want to predict when Y i is equal to 0, or 1; put differently, we want to know for each individual case i the probability that Y i is equal to 1, given X i. More formally, (56)E(Y i )=Pr(Y i =1 X i ), which states that the expected value of Y i is equal to the probability that Y i is equal to one, given X i. Now, a linear probability model simply estimates Pr(Y i =1) in same way as we would estimate an interval-level Y i : (57) Pr(Y i = 1) = α + βx i. There are two difficulties with this kind of model. First, while the estimated slope coefficients are good, the standard errors are incorrect due to heteroskedasticity (errors increase in the middle range, first negative, then positive). Graphing the data with a regular linear regression line, for instance, would look something like this: The second problem with the linear probability model is that it will generate predictions that are greater than 1 and/or less than 0 (as shown in the preceding figure) even though these are nonsensical where probabilities are concerned. As a consequence, it is desirable to try and transform either the LHS or RHS of the model so predictions are both realistic and efficient.

28 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 28 Nonlinear Probability Model: Logistic Regression One option is to transform Y i, to develop a nonlinear probability model. To extend the range beyond 0 to 1, we first transform the probability into the odds (58) Pr(Y i =1 X i ) Pr(Y i =0 X i ) = Pr(Y i =1 X i ) 1 Pr(Y i =1 X i ), which indicate how often something happens relative to how often it does not, and range from 0 to infinity as X i approaches 1. We then take the log of this to get, (59) ln( Pr(Y i =1 X i ) 1 Pr(Y i =1 X i ) ), or more simply, (60) ln( where, p i 1 p i ), (61) p i = Y i =1 X i. Modeling what we ve seen in equation 60 then captures the log odds that something will happen. By taking the log, we ve effectively stretched out the ends of the 0 to 1 range, and consequently have a comparatively unconstrained dependent variable that can be used without difficulty in an OLS regression, where (62) ln( p i 1 p i )=βx i. Just to make clear the effects of our transformation, here s what taking the log odds of a simple probability looks like:

29 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 29 Probability Odds Logit /99= /95= /9= /7= /5= / /1= /5= /1= Note that there is another way of representing a logit model, essentially the inverse (un-logging of both sides) of Equation 62: (63) Pr(Y i =1 X i )= expβx 1 1+exp βx i. Just to be clear, we can work our way backwards from equation Equation 63 to Equation 62 as follows: (64) Pr(Y i =1 X i )= expβx 1 1+exp βx i, and Pr(Y i =0 X i )= exp βx i 1 1+expβX or 1 exp βx i. So, (65) p 1 p 0 = p 1 1 p 1 = and, expβx i 1+expβX i 1 1+expβX i = expβx i 1, (66) p i 1 p i = exp βxi, which when logging both sides becomes, (67)ln( p i 1 p i )=βx i.

30 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 30 The notation in Equation 62 is perhaps the most useful in connecting logistic with probit and other non-linear estimations for binary data. The logit transformation is just one possible transformation that effectively maps the linear prediction into the 0 to 1 interval allowing us to retain the fundamentally linear structure of the model while at the same time avoiding the contradiction of probabilities below 0 or above 1. Many cumulative density functions (CDFs) will meet this requirement. (Note that CDFs define the probability mass to the left of a given value of X; they are of course related in that they are slight adjustment of PDFs, which are dealt with in more detail in the section on significance tests.) Equation 63 is in contrast useful for thinking about the logit model as just one example of transformations in which Pr(Y i =1) is a function of a non-linear transformation of the RHS variables, based on any number of CDFs. A more general version of Equation 63 is, then, (68) Pr(Y i =1 X i )=F (βx i ). where F is the logistic CDF for the logit model, as follows, (69)Pr(Y i =1 X i )=F (βx i ), where F = 1 1+exp (x µ)/s, but could just as easily be the normal CDF for the probit model, or a variety of other CDFs. How do we know which CDF to use? The CDF we choose should reflect our beliefs about the distribution of Y i, or, alternatively (and equivalently) the distribution of error in Y i. We discuss this more below. An Alternative Description: The Latent Variable Model Another way to draw the link between logistic and regular regression is through the latent variable model, which posits that there is an unobserved, latent variable Y * i, where (70)Y i = βx i + i, and the link between the observed binary Y i and the latent Y i * is as follows: (71)Y i =1 if Y i > 0, and (72)Y i =0 if Y i 0. Using this example, the relationship between the observed binary Y i and the latent Y i can be graphed as follows:

31 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 31 So, at any given value of X i there is a given probability that Y i is greater than zero. This figure also shows how our beliefs about the distribution of error (ε i ) are fundamental there is a distribution of possible outcomes in Y i * when, in this figure, X i =4. For a probit model, we assume that Var(ε i ) =1 ; for a logit model, we assume that Var(ε i ) = π 2 /3. Other CDFs make other assumptions. The distribution of error (ε i ) at any given value of X i is related to a non-linear increase in the probability that Y i =1. Indeed, we can show this non-linear shift first by plotting a distribution of ε i at each value of X i, and then by looking at how the movement of this distribution across the zero line shifts the probability that Y i =1:

32 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 32 As the thick part of the distribution moves across the zero line, the probability increases dramatically. Nonlinear Probability Model: Probit Regression As noted above, probit models are based on the same logic as logistic models. Again, they can be thought of as a non-linear transformation of the LHS or RHS variables. The only difference for probit models is that rather than assume a logistic distribution, we assume a normal one. In equation 68, then, F would now be the cumulative density function for a normal distribution. Why assume a normal distribution? The critical question is why assume a logistic one? We typically assume a logistic distribution because it is very close to normal, and estimating a logistic model is computationally much easier than estimating probit model. We now have faster computers, so there is now less reason to rely on logit rather than probit models. That said, logit has some advantages where teaching is concerned. Compared to probit, it s very simple.

33 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 33 Maximum Likelihood Estimation Models for categorical variables are not estimated using OLS, but using maximum likelihood. ML estimates are the values of the parameters that have the greatest likelihood (that is, the maximum likelihood) of generating the observed sample of data if the assumptions of the model are true. For a simple model like Y i = α + βx i, an ML estimation looks at many different possible values of and, and finds the combination which is most likely to generating the observed values of Y i. Take, for instance, the above graph, which shows the observed values of Y i on the bottom axis. There are two different probability distributions, one produced by one set of parameters, A, and one produced by another set of parameters, B. MLE asks which distribution seems more likely to have produced the observed data. Here, it looks like the B parameters have an estimated distribution more likely to produce the observed data. Alternatively, consider the following. If we are interested in the probability that Y i =1, given a certain set of parameters (p), then an ML estimation is interested in the likelihood of p given the observed data (73) L(p Y i ). This is a likelihood function. Finding the best set of parameters is an iterative process, which starts somewhere and starts searching; different optimization algorithms may start in slightly different places, and conduct the search differently; all base their decision about searching for parameters on the rate of improvement in the model. (The way in which model fit is judged is addressed below.)

34 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 34 Note that our being vague about parameters here is purposeful. As analysts, the parameters we are thinking about are the coefficients for the various independent variables (βx ). The parameters critical to the ML estimation, however, are those that define the shape of the distribution; for a normal distribution, for instance, these are the mean (µ) and variance (σ ) (see Equation 24). Every set of parameters, βx, however, produces a given estimated normal distribution of Y i with mean µ and variance σ ; the ML estimation tries to find the βx producing the distribution most likely to have generated our observed data. Not also that while we speak about ML estimations maximizing the likelihood equation, in practice programs maximize the log of the likelihood, which simplifies computations considerably (and gets the same results). Because the likelihood is always between 0 and 1, the log likelihood is always negative. We can see this in the iteration log in STATA logit estimates, for instance.

35 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 35 Interpretation & Goodness of Fit Measures for Categorical Models Indeed, the -2 log likelihood is the measure of model fit for most categorical models. It is as follows, (74) 2(LL A LL B ), where LL A is the log likelihood of finding our sample of Y i in a distribution produced by our parameterized model, and LL B is the log likelihood of finding our sample of Y i in the distribution produced when all parameters are restricted to 0. Essentially, then, we re looking at the total improvement in the model s predictive power the difference between our model, and no model (save for a distributional assumption). Multiplying this difference by -2 has the (albeit mysterious) advantage of producing a statistic that is asymptotically χ 2 distributed. There are various versions of a pseudo R 2 for categorical models, usually based on some manipulation of the -2 log likelihood. To interpret individual coefficients resulting from a categorical model, we usually transform them into odds ratios (from log-odds ratios, which are not readily interpretable). This transformation is relatively simple. Recall that one version of the logit model is as follows, p i (75) ln( )=βx i. 1 p i This is the log odds ratio, of course, equivalent to the following, (76) p i 1 p i = exp(βx i ). This transformation of coefficients produces odds ratios, where each coefficient is now expressed as the odds that Y i is equal to 1 (rather than 0) when there is a one-unit increase in X i. (There are equivalent transformations for probit coefficients.)

36 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 36 Ordinal Outcomes Models for Categorical Data For models where the dependent variable is categorical, but ordered, ordered logit is the most appropriate modelling strategy. A typical description begins with a latent variable Y * i which is a function of (77)Y i * = βx i + ε i, and a link between an observed binary Y i and a latent Y i * as follows: (78)Y i =1 if Y i * δ 1 Y i = 2 if δ 1 Y i * δ 2, and, and Y i = 3 if Y i * δ 2, where δ 1 and δ 2 are unknown parameters to be estimated along with the β in equation 79. We can restate the model, then, as follows: (79)Pr(Y i =1x) = Pr(βX i + ε i δ 1 ) = Pr(ε i δ 1 βx i ), and Pr(Y i = 2 X i ) = Pr(δ 1 βx + ε i δ 2 ) = Pr(δ 1 βx i < ε i δ 2 βx i ), and Pr(Y i = 3 X i ) = Pr(βX i + ε i δ 2 ) = Pr(ε i δ 2 βx i ). The last statement of each line here makes clear the importance that the distribution of error plays in the estimation: the probability of a given outcome can be expressed as the probability that the error is in the first line, for instance smaller than the difference between theta and the estimated value. This set of statements can also be expressed as follows, adding hats to denote estimated values, substituting predicted Y ˆ for βx, and inserting a given cumulative distribution function, F, from which we derive our probability estimates: (80) p ˆ i1 = Pr(ε i ˆ δ 1 Y ˆ i ) = F( ˆ δ 1 Y ˆ i ), and p ˆ i2 = Pr( ˆ δ 1 Y ˆ i < ε i ˆ δ 2 Y ˆ i ) = F( ˆ δ 2 Y ˆ i ) F( ˆ δ 1 Y ˆ 1 ), and p ˆ i3 = Pr(ε i ˆ δ 2 Y ˆ i ) =1 F( ˆ δ 2 Y ˆ i ), Where F can again be the logistic CDF (for ordered logit), but also the normal CDF (for ordered probit), and so on. Again, using the logistic version as the

37 March 2010 Poli618 Notes, Stuart Soroka, Dept of Political Science, McGill University pg 37 example is far easier, and we can express the whole system in another way, as follows: p (81)ln( 1 p ) = βx, ln( 1 + p 2 p ) = βx, ln( 1 + p p k ) = βx, 1 p 1 1 p 1 p 2 1 p 1 p 2... p k where. Note that these models rest on the parallel slopes assumption: the slope coefficients do not vary between different categories of the dependent variable (i.e., from the first to second category, the second to third category, and so on). If this assumption is unreasonable, a multinomial model is more appropriate. (In fact, this assumption can be tested by fitting a multinomial model and examining differences and similarities in coefficients across categories.) And now, when we talk about odds ratios, we are talking about a shift in the odds of falling into a given category (m), (82)OR(m) = Pr(Y i m) Pr(Y i < m). Nominal Outcomes Multinomial logit is essentially a series of logit regressions examining the probability that Y i = m rather than Y i = k, where k is a reference category. This means that one category of the dependent variable is set aside as the reference category, and all models show the probability of Y i being one outcome rather than outcome k. Say, for instance, there are four outcomes k, m, n, and q, where k is the reference category. The models estimated are: (83)ln( Pr(Y i = m) Pr(Y i = k) ) = β m X, ln(pr(y i = n) Pr(Y i = k) ) = β n X, ln(pr(y i = q) Pr(Y i = k) ) = β q X These models explore the variables that distinguish each of m, n, and q from k. Any category can be the base category, of course. It may be that it is additionally interesting to see how q is distinguished from the other categories, in which case the following models can be estimated: (84)ln( Pr(Y i = k) Pr(Y i = q) ) = β k X, ln(pr(y i = m) Pr(Y i = q) ) = β m xx, ln(pr(y i = n) Pr(Y i = q) ) = β n X Results for multinomial logit models aren t expressed as odds ratios, since odds ratios refer to the probability of an outcome divided by 1. Rather, multinomial results are expressed as a risk-ratio, or relative risk, which is easily calculated by taking the exponential of the log risk-ratio. Where, the log risk-ratio is

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture