Homework Assignment 5

Size: px

Start display at page:

Download "Homework Assignment 5"

Marion Joleen Boone
5 years ago
Views:

1 Biostatistics 201B Homework 5 March 5th, 2018 Homework Assignment 5 Due Date: Friday, March 16th, 2018 Note: To save you time and headache I have provided the graphics needed for Problem 8, parts (a) and (c). I have included the commands in the instruction file for those who are interested in being able to replicate them. I have also marked certain parts of the last problem as optional for purposes of assignment credit to reduce burden for the last week of class. However, all the parts are relevant for the final exam so be sure you go through the solutions even if you don t turn them in. Note: There are 6 warm-up problems and 2 turn-in problems on this assignment, covering Poisson + negative binomial regression and mixed models. Next week I will post some additional practice problems on survival analysis but these will NOT need to be turned in. Like the regular warm-up problems they will serve as examples and and provide practice on these topics for the final. You must submit the 2 turn-in problems together with the corresponding STATA/SAS printouts to receive full credit. The assignment is officially due in class on the last day of lecture but you may turn it in to the TA folder (in CHS ) any time before 5:00 with no penalty. I won t take assignments after that though as I will need to post the solutions so people can study for the final exam. Note: Output from any calculations done in STATA or SAS MUST be included with your assignment for full credit. If I do not specify which way to do a problem you may chose whether to do it by hand or on the computer. All the STATA/SAS commands needed to complete this homework are given at the end of the assignment and will be reviewed in the lab. You do not need to hand in a separate lab report simply turn in the relevant output as part of your homework. Note: You are encouraged to work with fellow students, as necessary, on these problems. However, each of you MUST write up your solution ON YOUR OWN and IN YOUR OWN WORDS. The style of your write-up is as important as getting the correct answer. Your solutions should be easy to follow, and contain English explanations of what you are doing and why. You do not have to write an essay for each problem, but you should give enough comments so that someone who has not seen the problem statement can understand your work. You do not have to type your assignments. However, if they are too sloppy to read, too hard to understand, or give just numbers with no comments, you WILL lose points. Problems labeled ACM are adapted from the text Computer Aided Multivariate Analysis by Afifi, Clark and May. Problems labeled (AA) are adapted from An Introruction to Categorical Data Analysis by Alan Agresti. Problems labeled (ATS) use data sets from UCLA s Academic Technology Services web site. Warm-up Problems (1) Poisson and Negative Binomial Regression Basics: (a) Explain what the distributions, link functions and systematic components are for Poisson and Negative Binomial GLMs. (b) Give examples of circumstances where you might want to use each of these models. (c) Explain the concepts of overdispersion and zero-inflation, give examples of how each of these situations 1

2 might arise in a Poisson or Negative Binomial Model, and what you could do to adjust for them. (2) Munching (Computer) Chips (AA 4.6 and 4.8): An experimenter analyzes imperfection rates for two processes used to fabricate silicon waters for computer chips. Each treatment is applied to 10 chips and the number of imperfections are recorded. The thickness of the silicon coating is also reported (0 = thin and 1 = thick). Use the data to answer the following questions. (a) Fit a simple Poisson regression using treatment as the predictor. Does there appear to be a significant difference between the two processes? Show two ways you could perform this test. (b) Give a brief interpretation of the regression coefficients and their confidence intervals, both on the log scale and on the mean scale. (c) Now add the thickness of the silicon coating as a predictor and rerun the model. Is the new model a significant improvement over the previous model? Describe as carefully as you can the effect of the coating thickness on the number of imperfections. What are you assuming about the joint effect of process and coating thickness by fitting the model this way? (3) More Sports Fanatics (AA 4.14): British fans are mad (literally!) about their soccer, even more than New Zealanders are about rugby. This data set gives the number of fans in attendance (in thousands) and the total number of arrests in the season for soccer teams in the second division of the British football league. (a) Fit a Poisson regression model for number of arrests using attendance as an offset variable. Explain (i) why it is important to use an offset variable here and (ii) what the interpretation of the intercept from the model is. (Note that there are no predictors here other than the offset!) (b) Plot arrests vs attendence and overlay the prediction equation. Obtain residuals from the model and use them to identify teams that had much larger or much smaller than expected number of arrests. (4) Camping Data (ATS): This problem goes in depth into the fishing example I used in class. The outcome variable is the number of fish caught (numfish) by 250 groups of people who went to a wildlilfe park. The predictors are the number of people in the group (persons), the number of children in the group (children) and whether or not they were camping (camper = 1 for yes and 0 for no.) Use the data to answer the following questions. (a) Fit a Poisson regression for the number of fish caught with number of people, number of children and whether or not the party was camping as predictors. (b) Give careful interpretations of the coefficients of persons, children and camper staus and their confidence intervals (i) on the log scale and (ii) on the mean scale. Which of these variables appear to be significant predictors? (c) Plot the Pearson residuals versus the fitted values for this model and also obtain the deviance and Pearson goodness of fit tests. Do you think it is reasonable to use these measures here? Are there any obvious outliers? Does the model seem to be well calibrated? If not can you identify where the misfit is occuring? (d) Now we evaluate the issue of over-dispersion. Suggest some plausible reasons why there might be overdispersion in this data set and then check for it in the following ways: (1) Using basic descriptive statistics obtain the mean and variance of the numfish variable. Do this for 2

3 various subgroups of the data set (e.g. camping or not, with children or not, etc.) What do these statistics suggest about over-dispersion. (2) Calculate the approximate dispersion factor based on the Pearson chi-squared goodness of fit statistic. (3) Explain what your plot of residuals from part (c) suggests about over-dispersion. (e) Rerun the model using a negative binomial regression. Does this model confirm your conclusions about over-dispersion from part (d)? Explain briefly. (f) Give brief interpretations of the coefficients from the negative binomial regression. Are your conclusions (as to magnitude, direction and significance) of the effects of the predictors any different from those in part (b)? In the next part of the problem we consider the issue of zero inflation. (g) Explain intuitively why we might expect zero-inflation in the context of this problem. Do you expect any of these predictors to be especially predictive of who would be a certain 0?. (h) Continuing with your idea of descriptive statistics from part (d) calculate the expected number of 0 s for various subgroups of the data set assuming a Poisson distribution and then examine the actual number of 0 s. Do we seem to have a zero-inflation problem? (Note: You may find it helpful to remember that for a Poisson random variable the probability of k events is P (Y = k) = µ k e µ /k!). (i) Fit a zero-inflated Poisson model to the data using persons, children and camper as the predictors for the Poisson component and experimenting with various variables the zero-inflation component of the model. Does this model seem to be a significant improvement over a standard Poisson regression? Explain briefly. (j) Give brief interpretations of the coefficients for each component of the zero-inflated Poisson model. Which variables seem to to be significant in each part of the model? Explain as carefully as you can what this model suggests about how these variables affect numbers of fish caught. (k) Now fit a zero-inflated negative binomial model using the same model set-up as in part (j). Is this model superior to the zero-inflated Poisson model of part (j)? Is it superior to the plain negative binomial model of part (e)? Explain briefly in each instance. What does that tell you about the issues of overdispersion and zero-inflation? Do any of your conceptual conclusions from part (j) change with this model? (5) Mixed Model Basics: (a) Explain the conceptual differences between the repeated measures, random effects, and hierarchical model formulations of techniques for correlated data. (b) Explain the difference between a fixed effect and a random effect and give examples of each. (c) Describe several of the common covariance structures for correlated data and when you might want to use them. (6) This Assigmnent Makes My Blood Pressure Rise: The blood pressure data I used as an example in class is included in the data set for this assignment. Here we will analyze it in more detail. The primary data are given in long form where the predictor variables are subject id (subj), blood pressure (bp), group (0=placebo or 1=medication), time (baseline or after 3 months of treatment) and a group by time interaction (groupbytime) with one row for each time point. I have also included four extra variables, bp0p, bp3p, bp0m, and bp3m which give the blood pressure values at times 0 and 3 months separately for the placebo 3

4 (p) and medication (m) groups. This will be useful for parts (a) - (c). (a) Obtain means and standard deviations for each of the two treatment groups at each of the two time points. What do these values suggest intuitively about the results of the trial? (b) Is there a difference in blood pressure between the two groups at baseline? What about at the end of treatment? Evaluate this formally by performing a simple test at each time point. What do your results suggest about treatment? Why might you not simply report the end of treatment analysis as your final result and not worry about the repeated measures? (c) Is there evidence of change over time within either the placebo group or the treatment group? Carry out an appropriate simple test within each group to answer this question. Why is it not sufficient to simply note that there is a significant improvement in the treatment group over time? (d) Now use STATA or SAS to fit a model with data from both groups at both timepoints, using the repeated measures formulation of the model and assuming a compound symmetric covariance structure. (In STATA before you start, use the xtset command to tell STATA you are working with panel data (see the instructions below.)) Use blood pressure as your outcome and group, time and a group by time interaction as predictors. Use the model to answer the following questions: (i) Is the model significant overall? Write down the hypotheses and the corresponding p-value for an appropriate test. (ii) Is there evidence of a significant group by time interaction? Explain as carefully as you can what this tells you about the effect of the treatment. Incorporate the direction and magnitude of the various regression coefficients into your explanation. Does it matter whether the coefficients of the individual group and time variables are significant? What are they telling you? (iii) In class we discussed other possible covariance structures for correlated data including the AR(1) model and the unstructured covariance model. Do we need either of these structures here? Explain briefly. (e) Rerun the model from part (d) using a mixed model point of view. What random effect(s) do you need to include to obtain the same fit as in part (d)? Verify that you in fact get the same answer. In what circumstances would the random effects formulation have been better for these data? (f) Was it necessary in this problem to use a a more complex covariance structure rather than simply assuming the errors were independent? You should discuss this (i) conceptually, (ii) by obtaining the correlation between the two timepoints and (iii) by looking at an appropriate test on your model printout. (g) Now try running the model again including both a random intercept and a random slope. Intuitively do you think this will help much? How could you check whether this model is an improvement over the random intercept model? 4

5 Problems To Turn In (7) I Wish I Could Play Hookey From 201B (ACM Problems and 12.26): This problem examines factors associated with adolescent students being absent from school without a valid excuse. For each of n = 252 students, the number of times absent in the last month was recorded (abbreivated nhookey for number of times the student played hookey ) along with possible predictors including age, gender (1 = male and 0 = female) and the degree to which the student liked school (rated from 1 = liked it very much to 5 = disliked it very much). Our goal in this problem is to build a model for school absences. (a) Fit the Poisson model with no predictors and explain as carefully as you can the meaning of the estimated intercept. (Note you may find it helpful to exponentiate it!) (b) Plot mean number of days absent as a function of (i) likeschool and (ii) age. Do there appear to be significant relationships between these variables and numbers of days absent? Do the relationships appear linear? (c) Fit a Poisson regression for the number of days the adolescents in the sample were absent from school with gender, age and how much the student liked school as predictors. Obtain two versions one in which likeschool is treated as continuous and one in which it is treated as categorical. Which version of the model do you think fits better and why? From here on out, for simplicity, treat the likeschool variable as continuous rather than categorical. (d) Give careful interpretations of the coefficients of the age, gender and likeschool variables and their confidence intervals (i) on the log scale and (ii) on the mean scale. Which of these variables appear to be significant predictors? (e) Now suppose that the number of absences for some of the students were measured over 1 month and some were measured over 3 months as indicated by the variable hmonth. Refit the model for the number of days absent using this variable as the offset. How does this change your answers to part (d)? Note: For the rest of the problem revert to thinking of the number of absences as all corresponding to one-month intervals as in parts (a)-(d). (f) In this part of the problem we evaluate the issue of over-dispersion several different ways. First suggest some plausible reasons why there might be over-dispersion in this data set and then check for it in the following ways: (1) Using basic descriptive statistics obtain the mean and variance of the nhookey variable. (i) What do these statistics suggest about over-dispersion and (ii) why might this assessment be too crude to evaluate the presence of over-dispersion in the model overall? (2) Evaluate the goodness of fit of the model using by calculating the Pearson chi-squared goodness of fit statistic and approximating the dispersion factor. Explain what each of these checks tells you. (3) Obtain the Pearson residuals for the model and plot them as a function of the fitted values. What does this plot suggest about over-dispersion? Do there appear to be any outliers? Briefly justify your answer. (g) Rerun the model using a negative binomial regression. Does this model confirm your conclusions about over-dispersion from part (f)? Explain briefly. (h) Give brief interpretations of the coefficients from the negative binomial regression. Are your conclusions (as to magnitude, direction and significance) of the effects of the predictors any different from those in part 5

6 (d)? In the next part of the problem we consider the issue of zero inflation. (i) Explain intuitively why we might expect zero-inflation in the context of this problem. Of our three predictors, which would you expect to be most associated with a certain zero? Explain briefly. (j) Using the estimated rate of unexcused absences from your model in part (a) calculate the expected number of 0 s we would see in a sample of 252 students if absences have a Poisson distribution. (Note: You may find it helpful to remember that for a Poisson random variable the probability of k events is P (Y = k) = µ k e µ /k!). How many 0 s did we actually observe? What does this suggest about whether we have a zero-inflation issue? (Note: This is a fairly crude approximation because we haven t accounted for the covariates but it is rather suggestive. You can of course take this approach further and break it down by gender, age bin and likeschool categories. ) (k) Fit a zero-inflated Poisson model to the data using age, gender and likeschool as the predictors for both the Poisson component and the zero-inflation component of the model. Does this model seem to be a significant improvement over a standard Poisson regression? Explain briefly. (l) Give brief interpretations of the coefficients for each component of the zero-inflated Poisson model. Which variables seem to to be significant in each part of the model? Explain as carefully as you can what this model suggests about how these variables affect school absences. (m) Now fit a zero-inflated negative binomial model using the same model set-up as in part (k). Is this model superior to the zero-inflated Poisson model of part (k)? Is it superior to the plain negative binomial model of part (g)? Explain briefly in each instance. What does that tell you about the issues of overdispersion and zero-inflation? Do any of your conceptual conclusions from part (l) change with this model? (8) Statistics Makes My Teeth Chatter: This problem is based on a dental research data set, courtesy of my colleague Dr. Rob Weiss. The outcome variable is the distance (in millimeters) from the center of the pituitary gland to the ptergyomaxillary fissure (you can check out wikipedia to get a picture of where this is in the skull in relation to the jaw.) The idea is to see how this distance changes developmentally in children and in particular whether there are any differences between boys and girls. Developmental trajectories are one of the most common targets of longitudinal mixed models. This data set includes 11 girls and 16 boys, each measured every two years at ages 8,10,12 and 14. The data are included in both long form (dentalid, distance, sex, year and yearbysex with one row for each time point) and in wide form (dentalid2, distance at each age, sex2; one row for each child) in the HW5 data set. Note that sex=1 for girls and 0 for boys. I have used sex and year here instead of gender and age to differentiate the variables from those in Problem 7. Note: Before you start, if you are using STATA, use the xtset command (details in the instruction section) to tell STATA you are going to be working with panel data with subjects indexed by the variable dentalid and time indexed by the variable year which is given in two-year intervals. The command is xtset dentalid year, delta(2) If you want to switch to working with a different data set you would then need to use the command xtset, clear. (a) A profile or spaghetti plot shows the trajectory for each subject (observed points connected by lines). I have provided such a plot in the accompanying graphics file, color-coded so you can distinguish girls from 6

7 boys since we are interested in gender differences. Describe any overall patterns you see in the plot as well as any perceived differences between boys and girls. (Instructions for creating the plot are given in the command section but you do not need to reproduce it.) (b) Calculate the correlations between the timepoints (i.e. distance at age 8 with distance at age 10, etc.) separately for boys and girls. (This will be easiest to do using the wide form of the data.) Do you see any interesting patterns? What does this suggest about whether we need to use a random effects/repeated measures model? (c) I also obtained the mean distances at each age separately for boys and girls and used them to create a scatterplot of distance versus age as a function of group (see the graphics file). This is a convenient way of condensing the information from the profile plot in part (a) and is sometimes referred to as an empirical summary plot (instructions are at the end of the assignment for those who are interested). Does there appear to be an overall difference in distance between boys and girls? Does there appear to be a difference in slope (distance vs year) between the groups? (d) Fit a model to these data using distance as the outcome and year as the predictor using the mixed model formulation with a random intercept covariance structure. (Specify the MLE fitting procedure so that you can use likelihood ratio tests for some of the follow-up questions.) Is the model overall significant? Explain briefly what this tells you. (e) Now rerun the model from part (d) adding sex and the yearbysex interaction as additional predictors. Is there any evidence of gender differences in the year (age) trajectories? Answer this question by performing a likelihood ratio test comparing the model from part (d) (with year only) to the new model which adds the two terms involving gender. Explain as carefully as you can what this tells you, incorporating the direction and magnitude of the various regression coefficients into your explanation. (f) Using your model from part (e) obtain the estimated mean distance for each group at each age and plot the results. Compare this plot to the plot from part (c) and explain any differences that you see. Can you say that one plot is more right than the other? (g) Is there evidence that you needed to include the random effects in your model from part (e) instead of just using ordinary least squares regression? Explain briefly by using values from your part (e) printout. Is this consistent with your answer to part (b)? Optional: You can also do this by rerunning the model from part (e) using an independence covariance structure (i.e. equivalent to ordinary least squares!) and performing a likelihood ratio test. How does the within subject variance compare to the between subject variance? (Note: STATA doesn t do an independent covariance structure in MLE random effects so if you are using STATA you need to fit an OLS regression and then use the command estat ic to get the log likelihood and information criteria to compare to the mixed model.) (h) Optional Now rerun the model from part (e) including a random slope as well as a random intercept. Does this appear to improve the model? Explain briefly. (i) Optional There are mean structures more general than the group by continuous age set up we have used above. Similarly, there are covariance structures more general than the random intercept/random slope specification we used in part (h). Describe (i) the most general mean structure and (ii) at least two alternate covariance structures and explain why they might or might not be appropriate here. You do not need to actually fit the models. (j) Optional Run the models you described in part (i). Do you think the more general mean structure represents a significant improvement over the mean structure from part (e)? (Use the random intercept/compound 7

8 symmetric covariance structure for this comparison to keep things simple.) What do you think is the best covariance structure? (Use the mean structure from part (e) to keep things simple.) A good way to present your results is tabulate the covariance structure, the number of parameters it involves, -2 times the log likelihood and the AIC and BIC information criteria. You can use likelihood ratio chi-squared tests to compare each of your covariance models to (i) the independence model and (ii) the completely unstructured covariance model. Why can you not necessarily directly compare the other models using the likelihood ratio test? STATA and SAS Commands For this assignment you need to be able to fit Poisson/negative binomial and repeated measures/mixed effects models The necessary commands are given below. To save you time I have given pretty complete details for the commands for the turn-in problems. Commands in STATA (1) Poisson Regression: Poisson regression works pretty much the same way as other regression models. You use the poisson command followed by the names of your outcome and predictor variables. For example, for warm-up Problem 2 predicting the number of imperfections in computer chips as a function of manufacturing process the command would be poisson imperfections treatment If you want to fit a Poisson model with an offset you need to be a little bit careful. STATA provides both an offset option and an exposure option. What I talked about as an offset in class (and what most people mean when they say offset) is what STATA calls exposure namely the amount of time (or other units) over which events could take place. The important thing is to understand how STATA treats these two options. If you use offset then STATA includes the offset variable as a predictor in its raw units with a coefficient constrained to 1. This is not really what we usually want since if we are thinking of the offset as the time or other per unit component of the Poisson rate then our model is really log(µ/t) = Xβ or log(µ) = log(t)+xβ, meaning we want the offset in the model on the log scale with a coefficient constrained to 1. Thus to use offset we have to first take the log of our units variable. STATA s exposure option includes the variable in the model on the log scale with the coefficient constrained to 1 which is usually what we want. If we use this option we don t have to transform the offset variable first. For example, in Warm-up Problem 3 where we are looking at arrests at British soccer matches our offset variable is attendance we would express number of arrests to be proportional to the size of the crowd. You could also think of this as exposure since each person there had a chance to get arrested. In that problem we are fitting a model with no predictors except the offset so our two versions of the statements become gen logatt = log(attendance) poisson arrests, offset(logatt) poisson arrests, exposure(attendance) Many of the follow-up commands for the Poisson model are like those for other GLMs. For example, you can use the predict varname to obtain the predicted counts for the values in your sample and store them in the new variable varname. This is the default for the predict command with the Poisson model. You can, of course, use other options to get other estimated quantities but the poisson model doesn t have as many options as logit or probit. In particular, it doesn t automatically create the various residuals. If you want to get those automatically you can use the glm command instead of the poisson command (see below). Of course it is not hard to create the residuals manually. For example, suppose I want to create the Pearson 8

9 residuals for a Poisson regression model. The Pearson residuals are the observed minus the predicted values divided by the standard deviation of the prediction. For a Poisson random variable the mean and the variance are equal so the estimated standard deviation for the predicted mean should just be the square root of the predicted mean. For instance, for warm-up problem 2, supposed we have saved the predicted number of imperfections in the variable predimp. Then the residuals would be created as gen impresids = (imperfections - predimp)/sqrt(predimp) Various other post-estimation commands also work with poisson. For instance, you can store the results using the estimates store mymodelname and then compare two nested models using the lrtest command to get a likelihood ratio chi-squared test. You can also use estat gof to obtain the deviance goodness of fit test and estat gof, pearson to get the Pearson chi-squared goodness of fit test. (2) Negative Binomial Regression: To fit a standard negative binomial regression in STATA you use the nbreg command followed by your outcome and predictor variables. For instance for the fishing data of warm-up Problem 4 we would type nbreg numfish camper children persons The other options for nbreg are basically the same as those for Poisson regression so I won t repeat them here. The only other thing of importance to note is that STATA automatically generates the test comparing the negative binomial regression to a Poisson regression. Specifically it gives a likelihood ratio chi-squared test for the over-dispersion parameter it calls alpha at the bottom of the printout. A small p-value means there was overdispersion and the negative binomial model fits better. (3) Zero-inflated Models: Both Poisson regression and negative binomial regression can be fit with a zero-inflation option. The commands zip and zinb replace the commands poisson and nbreg respectively. You also have to use an inflate option to specify which variables you want to use to predict the probability of certain zeros. These can be the same or different from the variables used for the count part of the model. If you want to test whether there was, in fact zero-inflation, you use the vuong option which perform s (surprise!) Vuong s test. If you also include the zip option in the negative binomial regression it will give you both the over-dispersion test and the zero-inflation test. The commands, applied to the fishing data of Warm-up Problem 4 are zip numfish camper persons children, inflate(camper persons children) vuong zinb numfish camper persons children, inflate(camper persons children) vuong zip (4) The General GLM Command (Optional): In addition to having specific commands for logistic, poisson, and other generalized linear models, STATA has a glm command in which you can specify the distribution and link function as part of the regression statement. The basic syntax, using Poisson regression with a log link as an example, is glm Y X1 X2 X3, family(poisson) link(log) Note that by default STATA uses the link that is canonical or natural for the specified distribution. For Poisson regression this is the log link so you don t in fact have to include the link statement. The advantage to using glm for Poisson regression is that it has more post-estimation and other options, including calculation of the residuals don t ask me why they don t include these with the poisson command! You can look at the help file on glm for additional details. (5) Review of Useful Old Commands: 9

10 (a) Data Summaries: The summarize command: To obtain means, standard deviations and the like for a given variable you type summarize followed by the variable name. You can obtain this summary information for subgroups of the data set using the bysort option, specifying as many grouping variables as you like. For example, for warm-up problem 6 to get the means and standard deviations of the blood pressure scores for each of the treatment groups at each timepoint you would type bysort group time: summarize bp Similarly, for turn-in problem 8(c) to get the mean distances for boys and girls at the 4 ages you would type bysort sex year: summarize distance To get the mean and standard deviation of the number number of fish caught in Warm-up Problem 4, subsetted by the number of children in the party, I would type bysort children: summarize numfish The same command, using table instead of summarize would allow you to see the frequency tables of numbers of fish caught by subgroup. (b) t-tests: To perform both two-sample t-tests and paired t-tests in STATA you use the ttest command. If the data for the two groups are stored in two separate columns the default is a paired t-test. For instance, for warm-up problem 6(c) comparing blood pressure before and after treatment for the placebo group you would type: ttest bp0p = bp3p If you want a two-sample t-test and the data are stored in a single column with the group label in a separate column you use the by() option. Note that you can also use the if option to restrict the test to a certain subset of the points. For example in warm-up problem 6b if we want to compare the placebo and medication groups at baseline we would type: ttest bp if time==0, by(group) (c) Correlation: To find the pairwise correlations between any groups of variables simply type corr followed by the variable names. As usual you can use the if option to restrict to subsets of the data. For example, in turn-in Problem 8 to get the correlations between distances at the four age points separately for boys and girls we would type corr distance8 distance10 distance12 distance14 if sex2==1 corr distance8 distance10 distance12 distance14 if sex2==0 (d) Scatter Plots: To obtain a scatterplot of Y vs X you simply type scatter Y X The scatter plot can accept more than one Y variable for the same X variable and will color code the plot so you can tell which outcome is which. This would be useful, for instance, in warm-up Problem 3(b) where I ask you to compare the observed and predicted number of arrests as a function of the number of people attending the game. Assuming my predicted values have been store in myarrests the command would be 10

11 scatter arrests myarrests attendance (e) Scatterplots of Bin Means: In Problem 7(b) I ask you to plot mean number of absences from school as a function of whether the child likes school and age. You can of course obtain the bin means as described in (c) above by typing bysort likeschool: summarize nhookey but then you have to enter the 5 means into a new column. An alternative is to use the egen command as follows: bysort likeschool: egen likemeans = mean(nhookey) This creates a new variable, likemeans, whose values are the average number of absences for people with the specified rating of liking school. If you then type scatter likemeans likeschool you will get the desired plot. Age, of course, works the same way. (f) Including Categorical Variables in a Model: To do this you can either create your own dummy variables or as part of the model statement in later versions of STATA you can attach the prefix i. to the variable name. For instance, for Problem 10(c) where you are asked to try treating likeschool as categorical you would either need to create 4 dummy variables (since likeschool has 5 categories) or else tell STATA that likeschool was categorical. The resulting two versions of the commands would be poisson nhookey age gender like2 like3 like4 like5 poisson nhookey age gender i.likeschool I already created the dummy variables for you but if you wanted to do it yourself one approach would be the following: gen like2 = 0 replace like2 = 1 if likeschool==2 And simmilarly for the other categories. (6) Mixed Models: The basic commands for mixed models in STATA are xtreg and xtmixed. The command xtreg does the basics while xtmixed allows more complex covariance structures. Before fitting one of these models you need to tell STATA that your data are in panel format, meaning that they have repeated measurements, and you need to tell it what variable identifies the units (subjects) and what variable indexes the repeats (usually a time variable.) In later versions of STATA you can also use the delta option to specify what the spacing is between time points. For example, the command for setting the data for warm-up problem 6 would be xtset subj time, delta(3) The timepoints are baseline and 3 months so specifying delta(3) tells STATA to parameterize time in terms of months rather than 3 month blocks. For turn-in problem 8, since the ages are at two-year intervals, we would use xtset dentalid year, delta(2) If you have other variables in your data set for other models you may need to clear the panel settings. To do this you type xtset, clear 11

12 If you are not sure of the current panel settings and want to check them simply type xtset. Once you have your data structure declared you can use the xtreg and xtmixed commands to run models using either the repeated measures or the mixed effects model framework. For the repeated measures framework you have to use the pa option which stands for population average and says we are specifying a covariance pattern. You then specify the structure you want using the corr() option. The most common covariance pattern is compound symmetry which STATA calls exchangeable. To run this set up for the blood pressure data of warm-up problem 6 we would type xtreg bp group time groupbytime, pa corr(exchangeable) In early versions of STATA as part of the options what variable is serving as the unit identifier using i(). For this problem we would have xtreg bp group time groupbytime, pa corr(exchangeable) i(subj) Of course there are other covariance options which you can substitute for exchangeable. These include unstructured (which is the most general) and ar # (which is the autoregressive structure that has the correlation decrease as the spacing between the timepoints gets larger. For the ar model in older versions of STATA you have to tell it what variable gives the time units using the option t(timevar). For example, for the optional bonus part in turn-in problem 8 you would type: xtreg distance sex year, pa corr(ar 1) i(subj) t(year) To run a model using the random effects formulation you use the option mle for maximum likelihood random effects. The basic random intercept model requires no additional options. For the data of warm-up problem 6 we would simply type xtreg bp group time groupbytime, mle For turn-in problem 8(d) and 8(e) our commands would be xtreg distance year, mle xtreg distance year sex yearbysex, mle In this case the i(subj) option doesn t seem to be necessary in the early versions of the package though you can still use it. With the random intercept model STATA gives you estimates of the standard deviation for the random intercept (sigma u ), the standard deviation of the residual errors (sigma e ) and the correlation between timepoints (rho). It also provides a test of whether the random intercept was needed by checking whether it s variance is greater than 0. To get more complex random effects models, for instance with random slopes, you need to use xtmixed. The basic syntax, using warm-up problem 6g as an example, is xtmixed bp group time groupbytime subj: time, covariance(independent) mle Note that I used the mle option so that I could use a likelihood ratio test to compare this model to the model with just a random intercept. For turn-in Problem 8(h) the corresponding code would be xtmixed distance year sex yearbysex dentalid: year, covariance(independent) mle The tells STATA you are getting ready to specify the random effects part of the model. It is followed 12

13 by your unit variable (here subject for the blood pressure data or dentalid for the dental data) and and semicolon and then any of the predictor variables for which you want the random effects. The covariance is used to specify whether there is a correlation between the random effects. I have specified independent to indicate no such correlation which is usually fine unless the model is very complicated. As noted above, using the mle option allows you to use likelihood ratio tests to compare this to other models. (7) Profile (Spaghetti) and Empirical Summary Plots (Optional): In repeated measures/random effects models it is often useful to be able to plot the trajectory for each subject/unit in the model (the so called profile or spaghetti plot). Here are two approaches. STATA has a built in function xtline which produces the correct basic format. The syntax for turn-in problem 8a would be xtline distance, i(dentalid) t(year) overlay legend(off) Basically you specify out outcome variable, the subject id variable and the time variable and tell it to overlay the points on the lines. The legend(off) option just supresses some extra printout. Actually you don t even need the i() and t() options if you ve already used the xtset command to tell STATA you are using panel data (see below). The problem with this command is that it isn t easy to color code the lines by subgroups which can be really helpful for detecting important effects like the gender differences that we are interested in for problem 8. To do this you can construct the plot manually instead. First we make sure that the data are sorted by the subject and time variables. For problem 8a that would be sort dentalid year You can also do the sorting from within the data editor. Next we need to make a new variable containing the the y value of the NEXT observation from that subject: gen nextdistance = distance[ n+1] if dentalid==dentalid[ n+1] The if dentalid==dentalid[ n+1] code ensures that we won t tell STATA the y value following the last observation from a subject is equal to the first y value from the next subject! Similarly we create a new variable containing the next x value: gen nextyear = year[ n+1] if dentalid==dentalid[ n+1] Look at the data in the data editor to make sure you understand what these commands do. Now we can use the twoway plot type named pcspike to connect adjacent pairs of (y,x) observations: twoway pcspike distance year nextdistance nextyear To use separate colors for boys and girls we add some additional options: twoway (pcspike distance year nextdistance nextyear if sex==1, lcolor(pink)) (pcspike distance year nextdistance nextyear if sexr==0, lcolor(blue)) For Problem 8c we also need to be able to plot the group means for each group at each age. We could take the easy (but inelegant!) route and simply use the summarize command for the means subgrouped by age and sex and then manually enter them as new columns and use the two plots as above. Alternatively we can use the following more sophisticated code: bysort sex year: egen avgdistance = mean(distance) twoway (connected avgdistance year if dentalid==1, mcolor(pink) lcolor(pink)) (connected 13

14 avgdistance year if dentalid==12, mcolor(blue) lcolor(blue)) Note that we have picked ids for one subject of each gender here it doesn t matter which. Basically we are replotting the means at each time for each subject which is overkill but works nicely. Commands in SAS (1) Summaries By Group: Descriptive statistics in SAS are obtained using proc univariate, proc freq and proc means. You specify the procedure, the data set and the variables you want to summarize using the var command. If you want to split the summaries by group you use a class statement to specify the grouping variable. As an example I use the fishing data from warm-up problem 6. The first set of commands below gives detailed summary statistics for the number of fish caught, the number of people and the number of children. If I wanted to subgroup the number of fish caught by whether or not the people were camping I would add the class statement as shown in the second set of commands. I could also use proc means, specifying which summary statistics I want. The third command below gives a table with the mean, standard deviation minimum and maximum for the number of fish split by the levels of the campter variable. You can include as many variables as you like in the var statement. The final set of commands gives the frequencies for each of the observed values for the camper, persons and children variables using proc freq. This procedure and be expanded to give contingency tables by typing var1*var2 in the var statement and you can add the option /chisq to get the various contingency table tests as well. proc univariate data = tmp1.hw4 ; var numfish persons children ; proc univariate data = tmp1.hw4; class camper; var numfish; proc means data = tmp1.hw4 mean std min max var; histogram numfish/midpoints 0 to 50 by 1 vscale = numfish; proc freq data = tmp1.hw4; tables camper children persons camper*children/chisq; \ (2) Including Categorical Variables In A Model: Categorical variables are something SAS has historically handled more easily than STATA. In basically any regression model from OLS to a zero-inflated binomial model, all you have to do is add a line with a class command and list the variables you want treated as categorical. If the variables are multicategory, SAS will give tests corresponding to whether the group of dummy variables representing the categories significantly improves the model (i.e. a partial F test in OLS and a likelihood ratio chi-squared test in a GLM.) The only tricky issue is what SAS specifies as the reference category. In the sections below I describe some of the referencing options. This was also discussed in the commands for HW4. (3) Poisson Regression: Poisson regression in SAS is generally fit using proc genmod which is shorthand for generalized linear model. In general for proc genmod you have to specify the distribution and link function. If you do not specify the distribution, normality is assumed. If you do not specify the link SAS defaults to 14

15 the canonical link. For the Poisson distribution this is the log link so you can skip that part. Like other regression models in SAS, proc genmod allows a class statement which you can use to tell it which variables are categorical. You can add an option descrnding which tells it that you want to use the lowest value (i.e. 0) as the reference value. Otherwise it defaults to using the highest value as the reference. The model statement has the same form as in standard regression, namely Y = X 1 X 2... After it you can add the option type3 which will make SAS automatically present global tests for each categorical variable. The model statement is also where you specify the form of the glm using dist and link options. Below is the syntax we would use to run a Poisson model for the fishing data of warm-up problem 4. The first basic version treats all the variables as continuous (for an indicator variable taking on only the values 0 and 1 it doesn t matter whether you indicate it is categorical or not.) Note that I didn t really need the link option. Also note that SAS automatically gives you the deviance and Pearson goodness of fit statistics along with the log likelihood. proc genmod data = tmp1.hw5; model numfish = camping persons children/dist = poisson link = log; The second version, below, treats the camping and children variables as categorical for illustrative purposes. Note the descending option. The default test for the option type3 is likelihood ratio chi-squared tests. If you ad the option wald after type3 you get Wald tests instead. proc genmod data = tmp1.hw5; class campter children/ descending; model numfish = camping persons children/type3 dist = poisson link = log; There are many other options you can use. For example, the option obstat used after the model statement prints out a table of things like fitted values and residuals for the points in the data set. Beware of doing this with a large data set as you will get MANY pages of printout. The option offset = variable is used to include an offset term in the model. Note that like STATA, SAS uses the variable on the raw scale in the model so to get what we were calling offset or exposure variable in class you need to take the natural log first. I illustrate these commands below for the soccer arrest data from warm-up problem 3 using the new logattendance variable I created: proc genmod data = tmp1.hw5; model arrests = / dist = poisson link = log offset = logattendance obstat; (4) Negative Binomial Regression: This works basically exactly the same way as poisson regression ecxcept that for the distribution we use negbin. The various other options work the same way. The basic commands for the fishing data would look like proc genmod data = tmp1.hw5; model numfish = camping persons children/dist = negbin; (5) Zero Inflated Models: To get zero-inflated versions of the Poisson or negative binomial models you can still use proc genmod but you have to add a few extra options. For example, for the zero-inflated Poisson you specificy the distribution as zip instead of poisson and you have to tell it what model to fit for 15

16 the zeros. The syntax for the fishing example of warm-up problem 4 is shown below using all three variables in both parts of the model, treated as continuous. Since the log link is the default I have left that part of the statement off but you are free to specify it. If you wanted to you could add a class statement with the usual options to treat some of the variables as categorical and you could change which variables were involved in which part of the model. proc genmod data = tmp1.hw5; model numfish = camper persons children/dist = zip; zeromodel camper persons children/link = logit; The set-up for negative binomial models is similar except our distribution is the zero-inflated negative binomial, zinb. The basic code for the fishing data is: proc genmod data = tmp1.hw5; model numfish = camper persons children/dist = zinb; zeromodel camper persons children/link = logit; The SAS zero-inflated negative binomial procedure does give a Wald confidence interval for the dispersion but neither the zip nor zinb models currently have the Vuong test for the zero-inflation built in. There is a macro which you can download from the web if you feel like making the effort. However significance of the coefficients in the zeromodel is usually a sufficient indication that the model is an improvement. There is also a special procedure, proc countreg which is more specific to count data and has some extra options. There are several fitting methods. The one that matches the proc genmod procedure is call quasi-newton. You can get this with the method = qn option in the proc statement. The basic code for the Poisson model is as follows: proc countreg data = tmp1.hw5 method = qn; model numfish = camper persons children/dist = zip; zeromodel numfish~camper persons children; You can even fit these sorts of models using another procedure, proc nlmixed (which stands for non-linear mixed models) not necessary here but useful in some circumstances. (6) Repeated Measures/Mixed Models: SAS actually has a more flexible mixed models platform in many ways than STATA. The basic command is proc mixed. As usual, you need to specify your data set, use the class statement to specify what variables are categorical and give a model statement. The only really new thing is that we need either a repeated or random statement to deal with the correlations between measurements within subjects. Naturally, the repeated command is used if you want to directly specify the correlation structure among the measurements while the random statement is natural if you are thinking in a random effects statement. SAS expects data for such models to be in what is called long form which means that you have one row per observation per subject (so if we had 4 measurements per subject as in turn-in problem 8 we would have 4 rows of data for each subject, one for each measurement point.) Note also that SAS defaults to REML (restricted maximum likelihood estimation). You can change the fitting method to maximum likelihood (or other things) by specifying method = ML in the proc mixed statement. 16

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Today s Class: Review of 3 parts of a generalized model Models for discrete count or continuous skewed outcomes Models for two-part