Outline. 1 Association in tables. 2 Logistic regression. 3 Extensions to multinomial, ordinal regression. sociology. 4 Multinomial logistic regression

Size: px

Start display at page:

Download "Outline. 1 Association in tables. 2 Logistic regression. 3 Extensions to multinomial, ordinal regression. sociology. 4 Multinomial logistic regression"

Virgil Hart
5 years ago
Views:

1 UL Winter School Quantitative Stream Unit B2: Categorical Data Analysis Brendan Halpin Department of Socioy University of Limerick January 15 16, Overview Overview This 3-day section focuses on categorical data Analysis of data in tables, association between categorical variables : regression with a binary variable as dependent Extensions of istic regression multinomial, ordinal. Draws heavily on Chs 8 (part) and 15 of Agresti & Finlay 3 4 Association Tables display association between categorical variables Made evident by patterns of percentages Tested by χ 2 test How do we characterise association? Is there association? What form does it take? How strong is it? 5 6 Is there association? What form does it take? This is what the χ 2 test determines evidence of association Does not characterise nature or size! Depends on N Other tests exist, such as Fisher s exact test Examine percentages Compare observed and expected: residuals Standardised residuals: behave like z, i.e., should lie in range 2 : +2 about 95% of time, if independence is true z = O E E(1 row proportion)(1 col proportion) O E = E(1 R T )(1 C T ) 7 8

2 How strong is it? Ordinal variables Many possible measures of association Difference in proportions Ratio of proportions or relative rate Ratio of odds or odds ratio (see Ordinal variables may have more structured association Simpler pattern, anaous to correlation X high, Y high; X low, Y low 9 10 Characterising ordinal association Higher order tables Focus on concordant/discordant pairs Pairs of cases which differ on both variables Concordant: case that is higher on one variable also higher on other Discordant: higher on one, lower on the other Gamma, ˆγ = C D C+D Values range 1 γ +1 like correlation in interpretation Has asymptotic standard error t-test possible We can consider association in higher-order tables, e.g., 3-way Is the association between A and B the same for different values of C? Does the association between A and B dissapear if we control for C? Simpson s paradox etc. Scouting 1/3 Scouting example (ch 10): negative association between scouting and delinquency Control for family characteristics (church attendance) and it disappears See also death penalty example: note pattern of odds ratios Cochran-Mantel-Haenszel test: 2 2 k table H 0 : within each of k 2 2 panels, OR = 1 delinq scout 0 1 Total Total Scouting 2/3 Scouting 3/ church and delinq scout church scout Total Total

3 Loglinear modelling More complex questions and larger tables can be handled by linear modelling Treats all variables as dependent variables Can test null hypothesis of independence, as well as specified patterns of interaction Linear Probability Model OLS regression requires interval dependent variable Binary or yes/no dependent variables are not suitable Nor are rates, e.g., n successes out of m trials Errors are distinctly not normal While predicted value can be read as a probability, can depart from 0:1 range Particular difficulties with multiple explanatory variables. OLS gives the linear probability model in this case: Pr(Y = 1) = a + bx data is 0/1, prediction is probability Assumptions violated, but if predicted probabilities in range , not too bad See credit card example: becomes unrealistic only at very low or high income Logistic transformation Probability is bounded [0 : 1] OLS predicted value is unbounded How to transform probability to : range? Odds: p 1 p Log of odds: range is 0 : p 1 p has range : uses this as the dependent variable: ( ) Pr(Y = 1) = a + bx 1 Pr(Y = 1) Alternatively: Pr(Y = 1) 1 Pr(Y = 1) = ea+bx Or: Pr(Y = 1) = ea+bx 1 + e a+bx = e a bx Parameters The b parameter ) is the effect of a unit change in X on ( Pr(Y =1) 1 Pr(Y =1) This implies a multiplicative change of e b Pr(Y =1) in 1 Pr(Y =1), in the Odds Thus an odds ratio But the effect of b on P depends on the level of b See credit card example Death penalty example allows us to see the link between odds ratios and estimates In practice, inference is similar to OLS though based on a different ic For each explanatory variable, H 0 : β = 0 is the interesting null z = ˆβ SE is approximately normally distributed (large sample property) ( ) 2 More usually, the Wald test is used: ˆβ SE has a χ 2 distribution with one degree of freedom 23 24

4 Likelihood ratio tests Nested models The likelihood ratio test is thought more robust than the Wald test for smaller samples Where l 0 is the likelihood of the model without X j, and l 1 that with it, the quantity ( 2 l ) 0 = 2 ( l 0 l 1 ) l 1 is χ 2 distributed with one degree of freedom ( More generally, 2 lo l 1 ) tests nested models: where model 1 contains all the variables in model 0, plus m extra ones, it tests the null that all the extra $β$s are zero (χ 2 with m df) If we compare a model against the null model (no explanatory variables, it tests H 0 : β 1 = β 2 =... = β k = 0 Strong anay with F -test in OLS Maximum likelihood Maximum likelihood estimation Maximum likelihood Iterative search What is this likelihood? Unlike OLS, istic regression (and many, many other models) are extimated by maximum likelihood estimation In general this works by choosing values for the parameter estimates which maximise the probability (likelihood) of observing the actual data OLS can be ML estimated, and yields exactly the same results Sometimes the values can be chosen analytically A likelihood function is written, defining the probability of observing the actual data given parameter estimates Differential calculus derives the values of the parameters that maximise the likelihood, for a given data set Often, such closed form solutions are not possible, and the values for the parameters are chosen by a systematic computerised search (multiple iterations) Extremely flexible, allows estimation of a vast range of complex models within a single framework Maximum likelihood Likelihood as a quantity Tabular data Tabular data Either way, a given model yields a specific maximum likelihood for a give data set This is a probability, henced bounded [$0:1$] Reported as -likelihood, hence bounded [$- :0$] Thus is usually a large negative number Where an iterative solution is used, likelihood at each stage is usually reported normally getting nearer 0 at each step If all the explanatory variables are categorical (or have few fixed values) your data set can be represented as a table If we think of it as a table where each cell contains n yeses and m n noes (n successes out of m trials) we can fit grouped istic regression n successes out of m trials implies a binomial distribution of degree m n m n = α + βx The parameter estimates will be exactly the same as if the data were treated individually Tabular data Tabular data and goodness of fit Goodness of fit and accuracy of classification Fit with individual data But unlike with individual data, we can calculate goodness of fit, by relating observed successes to predicted in each cell If these are close we cannot reject the null hypothesis that the model is incorrect (i.e., you want a high p-value) Where l i is the likelihood of the current model, and l s is the likelihood of the saturated model the test statistic is ( 2 l ) i l s The saturated model predicts perfectly and has as many parameters as there are settings (cells in the table) The test has df of number of settings less number of parameters estimated, and is χ 2 distributed Where the number of settings (combinations of values of explanatory variables) is large, this approach to fit is not feasible Cannot be used with continuous covariates Hosmer-Lemeshow statistic attempts to create an anay Divide sample into deciles of predicted probability Calculate a fit measure based on observed and predicted numbers in the ten groups Simulation shows this is χ 2 distributed with 2 df Not a perfect solution, sensitive to how the cuts are made Pseudo-R 2 measures exist, but none approaches the clean interpretation as in OLS See general/psuedo_rsquareds.htm 31 32

5 Goodness of fit and accuracy of classification Predicting outcomes Goodness of fit and accuracy of classification Some problems Another way of assessing the adequacy of a it model is its accuracy of classification: True yes True no Predicted yes a c Predicted no b d Proportion correctly classified: a a+d a+b+c+d d c+d b+d Sensitivity: a+b ; Specificity: b False positive: ; False negative: c a+c Stata: estat class Zero cells in tables can cause problems: no yeses or no noes for particular settings Not automatically a problem but can give rise to attempts to estimate a parameter as or + If this happens, you will see a large parameter estimate and a huge standard error In individual data, sometimes certain combinations of variables have only successes or only failures In Stata, these cases are dropped from estimation you need to be aware of this as it changes the interpretation (you may wish to drop one of the offending variables instead) Extensions to multinomial, ordinal regression Multinomial istic regression Multinomial istic regression baseline category extension of binary What if we have multiple possible outcomes, not just two? is binary: yes/no Many interesting dependent variables have multiple categories voting intention by party first destination after second-level education housing tenure type We can use binary istic by recoding into two categories dropping all but two categories But that would lose information Multinomial istic regression baseline category extension of binary Multinomial istic regression Another idea: Pick one of the J categories as baseline For each of J 1 other categories, fit binary models contrasting that category with baseline Multinomial istic effectively does that, fitting J 1 models simultaneously P(Y = j) P(Y = J) = α j + β j X, j = 1,..., c 1 Which category is baseline is not critically important, but better for interpretation if it is reasonably large and coherent (i.e. "Other" is a poor choice) Multinomial istic regression baseline category extension of binary Predicting p from formula Multinomial istic regression Interpreting example, inference Example π j = α j + β j X π J π j αj +βj X = e π J αj +βj X π j = π J e J 1 J 1 π J = 1 π k = 1 π J π J = k=1 k=1 e αk+βkx J 1 k=1 eαk+βkx = 1 J k=1 eαk+βkx π j = αj +βj e X J k=1 eαk+βkx Let s attempt to predict housing tenure Owner occupier Local authority renter Private renter using age and employment status Employed Unemployed Not in labour force mit ten3 age i.eun 39 40

6 Multinomial istic regression Interpreting example, inference Stata output Multinomial istic regression Interpreting example, inference Interpretation Multinomial istic regression Number of obs = LR chi2(6) = Prob > chi2 = Log likelihood = Pseudo R2 = ten3 Coef. Std. Err. z P>z [95% Conf. Interval] 1 (base outcome) 2 age eun _cons age eun _cons Stata chooses category 1 (owner) as baseline Each panel is similar in interpretation to a binary regression on that category versus baseline Effects are on the of the odds of being in category j versus the baseline Multinomial istic regression Interpreting example, inference Ordinal it At one level inference is the same: Wald test for H o : β k = 0 LR test between nested models However, each variable has J 1 parameters Better to consider the LR test for dropping the variable across all contrasts: H 0 : j : β j k = 0 Thus retain a variable even for contrasts where it is insignificant as long as it has an effect overall Which category is baseline affects the parameter estimates but not the fit (-likelihood, predicted values, LR test on variables) Ordinal it Predicting ordinal outcomes Ordinal it Stereotype it Stereotype it While mit is attractive for multi-category outcomes, it is imparsimonious For nominal variables this is necessary, but for ordinal variables there should be a better way We consider three useful models Stereotype it it Continuation ratio or sequential it Each approaches the problem is a different way If outcome is ordinal we should see a pattern in the parameter estimates:. mit educ c.age i.sex if age>30 [...] Multinomial istic regression Number of obs = LR chi2(4) = Prob > chi2 = Log likelihood = Pseudo R2 = educ Coef. Std. Err. z P>z [95% Conf. Interval] Hi age sex _cons Med age sex _cons Lo (base outcome) Ordinal it Stereotype it Ordered parameter estimates Ordinal it Stereotype it Scale factor Low education is the baseline The effect of age: for high vs low for medium vs low 0.000, implicitly for low vs low Sex: , and Stereotype it fits a scale factor φ to the parameter estimates to capture this pattern Compare mit: P(Y = j) P(Y = J) = α j + β 1j X 1 + β 2j X, j = 1,..., J 1 with sit P(Y = j) P(Y = J) = α j + φ j β 1 X 1 + φ j β 2 X 2, j = 1,..., J 1 φ is zero for the baseline category, and 1 for the maximum It won t necessarily rank your categories in the right order: sometimes the effects of other variables do not coincide with how you see the ordinality 47 48

7 Ordinal it Stereotype it Sit example Ordinal it Stereotype it Interpreting φ. sit educ age i.sex if age>30 [...] Stereotype istic regression Number of obs = Wald chi2(2) = Log likelihood = Prob > chi2 = ( 1) [phi1_1]_cons = 1 educ Coef. Std. Err. z P>z [95% Conf. Interval] age sex /phi1_1 1 (constrained) /phi1_ /phi1_3 0 (base outcome) /theta /theta /theta3 0 (base outcome) (educ=lo is the base outcome) With low education as the baseline, we find φ estimates thus: High 1 Medium Low 0 That is, averaging across the variables, the effect of medium vs low is times that of high vs low The /theta terms are the α j s Ordinal it Stereotype it Surprises from sit Ordinal it Stereotype it Recover by including non-linear age sit is not guaranteed to respect the order Stereotype istic regression Number of obs = Wald chi2(2) = Log likelihood = Prob > chi2 = ( 1) [phi1_1]_cons = 1 educ Coef. Std. Err. z P>z [95% Conf. Interval] age sex /phi1_1 1 (constrained) /phi1_ /phi1_3 0 (base outcome) /theta /theta /theta3 0 (base outcome) (educ=lo is the base outcome) age has a strongly non-linear effect and changes the order of φ Stereotype istic regression Number of obs = Wald chi2(3) = Log likelihood = Prob > chi2 = ( 1) [phi1_1]_cons = 1 educ Coef. Std. Err. z P>z [95% Conf. Interval] age c.age#c.age sex /phi1_1 1 (constrained) /phi1_ /phi1_3 0 (base outcome) /theta /theta /theta3 0 (base outcome) (educ=lo is the base outcome) Ordinal it Stereotype it Stereotype it Ordinal it The proportional odds model Stereotype it treats ordinality as ordinality in terms of the explanatory variables There can be therefore disagreements between variables about the pattern of ordinality It can be extended to more dimensions, which makes sense for categorical variables whose categories can be thought of as arrayed across more than one dimension See Long and Freese, Ch 6.8 The most commonly used ordinal istic model has another ic It assumes the ordinal variable is based on an unobserved latent variable Unobserved cutpoints divide the latent variable into the groups indexed by the observed ordinal variable The model estimates the effects on the of the odds of being higher rather than lower across the cutpoints Ordinal it The model Ordinal it J 1 contrasts again, but different For j = 1 to J 1, P(Y > j) P(Y <= j) = α j + βx Only one β per variable, whose interpretation is the effect on the odds of being higher rather than lower One α per contrast, taking account of the fact that there are different proportions in each one But rather than compare categories against a baseline it splits into high and low, with all the data involved each time 55 56

Ordinal it An example Ordinal it Ordered istic: Stata output Using data from the BHPS, we predict the probability of each of 5 ordered responses to the assertion "homosexual relationships are wrong"

14 Prob > chi2 = 0.0000 Log likelihood = -17802.088 Pseudo R2 = 0.0593 ropfamr Coef. Std. Err. z P>z [95% Conf. Interval] 2.rsex.8339045.033062 25.22 0.000.7691041.8987048 rage -.0371618.0009172-40.

8 Ordinal it An example Ordinal it Ordered istic: Stata output Using data from the BHPS, we predict the probability of each of 5 ordered responses to the assertion "homosexual relationships are wrong" Answers from 1: strongly agree, to 5: strongly disagree Sex and age as predictors descriptively women and younger people are more likely to disagree (i.e., have high values) Ordered istic regression Number of obs = LR chi2(2) = Prob > chi2 = Log likelihood = Pseudo R2 = ropfamr Coef. Std. Err. z P>z [95% Conf. Interval] 2.rsex rage /cut /cut /cut /cut Ordinal it Interpretation Ordinal it Linear predictor The betas are straightforward: The effect for women is The OR is e.8339 or Women s odds of being on the "approve" rather than the "disapprove" side of each contrast are times as big as men s Each year of age reduced the -odds by (OR 0.964). The cutpoints are odd: Stata sets up the model in terms of cutpoints in the latent variable, so they are actually α j Thus the α + βx or linear predictor for the contrast between strongly agree (1) and the rest is (2-5 versus 1) female age Between strongly disagree (5) and the rest (1-4 versus 5) and so on female age Ordinal it Predicted odds Ordinal it Predicted odds per contrast The predicted -odds lines are straight and parallel The highest relates to the 1-4 vs 5 contrast Parallel lines means the effect of a variable is the same across all contrasts Exponentiating, this means that the multiplicative effect of a variable is the same on all contrasts: hence "proportional odds" This is a key assumption Ordinal it Predicted probabilities relative to contrasts Ordinal it Predicted probabilities relative to contrasts We predict the probabilities of being above a particular contrast in the standard way Since age has a negative effect, downward sloping sigmoid curves Sigmoid curves are also parallel (same shape, shifted left-right) We get probabilities for each of the five states by subtraction 63 64

9 Ordinal it Ordinal it Testing proportional odds The key elements of inference are standard: Wald tests and LR tests Since there is only one parameter per variable it is more straightforward than MNL However, the key assumption of proportional odds (that there is only one parameter per variable) is often wrong. The effect of a variable on one contrast may differ from another Long and Freese s SPost Stata add-on contains a test for this It is possible to fit each contrast as a binary it The brant command does this, and tests that the parameter estimates are the same across the contrast It needs to use Stata s old-fashioned xi: prefix to handle categorical variables: xi: oit ropfamr i.rsex rage brant, detail Ordinal it Brant test output Ordinal it What to do?. brant, detail Estimated coefficients from j-1 binary regressions y>1 y>2 y>3 y>4 _Irsex_ rage _cons Brant Test of Parallel Regression Assumption Variable chi2 p>chi2 df All _Irsex_ rage A significant test statistic provides evidence that the parallel regression assumption has been violated. In this case the assumption is violated for both variables, but looking at the individual estimates, the differences are not big It s a big data set (14k cases) so it s easy to find departures from assumptions However, the departures can be meaningful. In this case it is worth fitting the "Generalised Ordinal Logit" model Ordinal it Generalised Ordinal Logit Ordinal it Sequential it Sequential it This extends the proportional odds model in this fashion P(Y > j) P(Y <= j) = α j + β j x That is, each variable has a per-contrast parameter At the most imparsimonious this is like a reparameterisation of the MNL in ordinal terms However, can constrain βs to be constant for some variables Get something intermediate, with violations of PO accommodated, but the parsimony of a single parameter where that is acceptable Download Richard William s goit2 to fit this model: ssc install goit2 Different ways of looking at ordinality suit different ordinal regression formations categories arrayed in one (or more) dimension(s): sit categories derived by dividing an unobserved continuum: oit etc categories that represent successive stages: the continuation-ratio model Where you get to higher stages by passing through lower ones, in which you could also stay Educational qualification: you can only progress to the next stage if you have completed all the previous ones Promotion: you can only get to a higher grade by passing through the lower grades Ordinal it Sequential it "Continuation ratio" model Ordinal it Sequential it J 1 contrasts again, again different Here the question is, given you reached level j, what is your chance of going further: P(Y > j) P(Y = j) = α + βx j For each level, the sample is anyone in level j or higher, and the outcome is being in level j + 1 or higher That is, for each contrast except the lowest, you drop the cases that didn t make it that far But rather than splitting high and low, with all the data involved each time, it drops cases below the baseline 71 72

Ordinal it Sequential it Fitting CR Ordinal it Sequential it seqit This model implies one equation for each contrast Can be fitted by hand by defining outcome variable and subsample for each contrast

10 Ordinal it Sequential it Fitting CR Ordinal it Sequential it seqit This model implies one equation for each contrast Can be fitted by hand by defining outcome variable and subsample for each contrast (ed has 4 values): Maarten Buis s seqit does it more or less automatically: gen con1 = ed>1 gen con2 = ed>2 replace con2 =. if ed<=1 gen con3 = ed>3 replace con3 =. if ed<=2 it con1 odoby i.osex it con2 odoby i.osex it con3 odoby i.osex seqit ed odoby i.osex, tree(1 : 2 3 4, 2 : 3 4, 3 : 4 ) you need to specify the contrasts You can impose constraints to make parameters equal across contrasts Alternative specific options Cit and the basic idea If we have information on the outcomes (e.g., how expensive they are) but not on the individuals, we can use cit Alternative specific options ascit and practical examples Modelling count data: Poisson distribution If we have information on the outcomes and also on the individuals, we can use ascit, alternative-specific it See examples in Long and Freese Count data raises other problems: Non-negative, integer, open-ended Neither a rate nor suitable for a continuous error distribution If the mean count is high (e.g., counts of newspaper readership) the normal approximation is good If the events are independent, with expected value (mean) µ, their distribution is Poisson: P(X = k) = µk e µ k! The variance is equal to the mean: also µ Poisson Density Poisson regression In the GLM framework we can use a regression to predict the mean value, and use the Poisson distribution to account for variation around the predicted value: Poisson regression Log link: effects on the count are multiplicative Poisson distribution: variation around the predicted count (not ) is Poisson In Stata, either poisson count xvar1 xvar2 or glm count xvar1 xvar2, link() family(poisson) 79 80

11 Over dispersion Negative binomial regression If the variance is higher than predicted, the SEs are not correct This may be due to omitted variables Or to events which are not independent of each other Adding relevant variables can help Negative binomial regression is a form of poisson regression that takes an extra parameter to account for overdispersion. If there is no overdispersion, poisson is more efficient Zero inflated, zero-truncated Another issue with count data is the special role of zero Sometimes we have zero-inflated data, where there is a subset of the population which is always zero and a subset which varies (and may be zero). This means that there are more zeros than poisson would predict, which will bias its estimates. Zero-inflated poisson (zip) can take account of this, by modelling the process that leads to the all-zeroes state Zero-inflated negative binomial (zinb) is a version that also copes with over dispersion Zero truncation also arises, for example in data collection schemes that only see cases with at least one event. ztp deals with this. 83

Outline. 1 Introduction. 2 Logistic regression. 3 Margins and graphical representations of effects. 4 Multinomial logistic regression.

Outline. 1 Introduction. 2 Logistic regression. 3 Margins and graphical representations of effects. 4 Multinomial logistic regression. Outline Brendan Halpin, Socioical Research Methods Cluster, Dept of Socioy, University of Limerick June 20-21 2016 1 2 3 4 Multinomial istic regression 5 6 1 2 What this is About these notes What this