Homework Assignment 5

Size: px
Start display at page:

Download "Homework Assignment 5"

Transcription

1 Biostatistics 201B Homework 5 March 5th, 2018 Homework Assignment 5 Due Date: Friday, March 16th, 2018 Note: To save you time and headache I have provided the graphics needed for Problem 8, parts (a) and (c). I have included the commands in the instruction file for those who are interested in being able to replicate them. I have also marked certain parts of the last problem as optional for purposes of assignment credit to reduce burden for the last week of class. However, all the parts are relevant for the final exam so be sure you go through the solutions even if you don t turn them in. Note: There are 6 warm-up problems and 2 turn-in problems on this assignment, covering Poisson + negative binomial regression and mixed models. Next week I will post some additional practice problems on survival analysis but these will NOT need to be turned in. Like the regular warm-up problems they will serve as examples and and provide practice on these topics for the final. You must submit the 2 turn-in problems together with the corresponding STATA/SAS printouts to receive full credit. The assignment is officially due in class on the last day of lecture but you may turn it in to the TA folder (in CHS ) any time before 5:00 with no penalty. I won t take assignments after that though as I will need to post the solutions so people can study for the final exam. Note: Output from any calculations done in STATA or SAS MUST be included with your assignment for full credit. If I do not specify which way to do a problem you may chose whether to do it by hand or on the computer. All the STATA/SAS commands needed to complete this homework are given at the end of the assignment and will be reviewed in the lab. You do not need to hand in a separate lab report simply turn in the relevant output as part of your homework. Note: You are encouraged to work with fellow students, as necessary, on these problems. However, each of you MUST write up your solution ON YOUR OWN and IN YOUR OWN WORDS. The style of your write-up is as important as getting the correct answer. Your solutions should be easy to follow, and contain English explanations of what you are doing and why. You do not have to write an essay for each problem, but you should give enough comments so that someone who has not seen the problem statement can understand your work. You do not have to type your assignments. However, if they are too sloppy to read, too hard to understand, or give just numbers with no comments, you WILL lose points. Problems labeled ACM are adapted from the text Computer Aided Multivariate Analysis by Afifi, Clark and May. Problems labeled (AA) are adapted from An Introruction to Categorical Data Analysis by Alan Agresti. Problems labeled (ATS) use data sets from UCLA s Academic Technology Services web site. Warm-up Problems (1) Poisson and Negative Binomial Regression Basics: (a) Explain what the distributions, link functions and systematic components are for Poisson and Negative Binomial GLMs. (b) Give examples of circumstances where you might want to use each of these models. (c) Explain the concepts of overdispersion and zero-inflation, give examples of how each of these situations 1

2 might arise in a Poisson or Negative Binomial Model, and what you could do to adjust for them. (2) Munching (Computer) Chips (AA 4.6 and 4.8): An experimenter analyzes imperfection rates for two processes used to fabricate silicon waters for computer chips. Each treatment is applied to 10 chips and the number of imperfections are recorded. The thickness of the silicon coating is also reported (0 = thin and 1 = thick). Use the data to answer the following questions. (a) Fit a simple Poisson regression using treatment as the predictor. Does there appear to be a significant difference between the two processes? Show two ways you could perform this test. (b) Give a brief interpretation of the regression coefficients and their confidence intervals, both on the log scale and on the mean scale. (c) Now add the thickness of the silicon coating as a predictor and rerun the model. Is the new model a significant improvement over the previous model? Describe as carefully as you can the effect of the coating thickness on the number of imperfections. What are you assuming about the joint effect of process and coating thickness by fitting the model this way? (3) More Sports Fanatics (AA 4.14): British fans are mad (literally!) about their soccer, even more than New Zealanders are about rugby. This data set gives the number of fans in attendance (in thousands) and the total number of arrests in the season for soccer teams in the second division of the British football league. (a) Fit a Poisson regression model for number of arrests using attendance as an offset variable. Explain (i) why it is important to use an offset variable here and (ii) what the interpretation of the intercept from the model is. (Note that there are no predictors here other than the offset!) (b) Plot arrests vs attendence and overlay the prediction equation. Obtain residuals from the model and use them to identify teams that had much larger or much smaller than expected number of arrests. (4) Camping Data (ATS): This problem goes in depth into the fishing example I used in class. The outcome variable is the number of fish caught (numfish) by 250 groups of people who went to a wildlilfe park. The predictors are the number of people in the group (persons), the number of children in the group (children) and whether or not they were camping (camper = 1 for yes and 0 for no.) Use the data to answer the following questions. (a) Fit a Poisson regression for the number of fish caught with number of people, number of children and whether or not the party was camping as predictors. (b) Give careful interpretations of the coefficients of persons, children and camper staus and their confidence intervals (i) on the log scale and (ii) on the mean scale. Which of these variables appear to be significant predictors? (c) Plot the Pearson residuals versus the fitted values for this model and also obtain the deviance and Pearson goodness of fit tests. Do you think it is reasonable to use these measures here? Are there any obvious outliers? Does the model seem to be well calibrated? If not can you identify where the misfit is occuring? (d) Now we evaluate the issue of over-dispersion. Suggest some plausible reasons why there might be overdispersion in this data set and then check for it in the following ways: (1) Using basic descriptive statistics obtain the mean and variance of the numfish variable. Do this for 2

3 various subgroups of the data set (e.g. camping or not, with children or not, etc.) What do these statistics suggest about over-dispersion. (2) Calculate the approximate dispersion factor based on the Pearson chi-squared goodness of fit statistic. (3) Explain what your plot of residuals from part (c) suggests about over-dispersion. (e) Rerun the model using a negative binomial regression. Does this model confirm your conclusions about over-dispersion from part (d)? Explain briefly. (f) Give brief interpretations of the coefficients from the negative binomial regression. Are your conclusions (as to magnitude, direction and significance) of the effects of the predictors any different from those in part (b)? In the next part of the problem we consider the issue of zero inflation. (g) Explain intuitively why we might expect zero-inflation in the context of this problem. Do you expect any of these predictors to be especially predictive of who would be a certain 0?. (h) Continuing with your idea of descriptive statistics from part (d) calculate the expected number of 0 s for various subgroups of the data set assuming a Poisson distribution and then examine the actual number of 0 s. Do we seem to have a zero-inflation problem? (Note: You may find it helpful to remember that for a Poisson random variable the probability of k events is P (Y = k) = µ k e µ /k!). (i) Fit a zero-inflated Poisson model to the data using persons, children and camper as the predictors for the Poisson component and experimenting with various variables the zero-inflation component of the model. Does this model seem to be a significant improvement over a standard Poisson regression? Explain briefly. (j) Give brief interpretations of the coefficients for each component of the zero-inflated Poisson model. Which variables seem to to be significant in each part of the model? Explain as carefully as you can what this model suggests about how these variables affect numbers of fish caught. (k) Now fit a zero-inflated negative binomial model using the same model set-up as in part (j). Is this model superior to the zero-inflated Poisson model of part (j)? Is it superior to the plain negative binomial model of part (e)? Explain briefly in each instance. What does that tell you about the issues of overdispersion and zero-inflation? Do any of your conceptual conclusions from part (j) change with this model? (5) Mixed Model Basics: (a) Explain the conceptual differences between the repeated measures, random effects, and hierarchical model formulations of techniques for correlated data. (b) Explain the difference between a fixed effect and a random effect and give examples of each. (c) Describe several of the common covariance structures for correlated data and when you might want to use them. (6) This Assigmnent Makes My Blood Pressure Rise: The blood pressure data I used as an example in class is included in the data set for this assignment. Here we will analyze it in more detail. The primary data are given in long form where the predictor variables are subject id (subj), blood pressure (bp), group (0=placebo or 1=medication), time (baseline or after 3 months of treatment) and a group by time interaction (groupbytime) with one row for each time point. I have also included four extra variables, bp0p, bp3p, bp0m, and bp3m which give the blood pressure values at times 0 and 3 months separately for the placebo 3

4 (p) and medication (m) groups. This will be useful for parts (a) - (c). (a) Obtain means and standard deviations for each of the two treatment groups at each of the two time points. What do these values suggest intuitively about the results of the trial? (b) Is there a difference in blood pressure between the two groups at baseline? What about at the end of treatment? Evaluate this formally by performing a simple test at each time point. What do your results suggest about treatment? Why might you not simply report the end of treatment analysis as your final result and not worry about the repeated measures? (c) Is there evidence of change over time within either the placebo group or the treatment group? Carry out an appropriate simple test within each group to answer this question. Why is it not sufficient to simply note that there is a significant improvement in the treatment group over time? (d) Now use STATA or SAS to fit a model with data from both groups at both timepoints, using the repeated measures formulation of the model and assuming a compound symmetric covariance structure. (In STATA before you start, use the xtset command to tell STATA you are working with panel data (see the instructions below.)) Use blood pressure as your outcome and group, time and a group by time interaction as predictors. Use the model to answer the following questions: (i) Is the model significant overall? Write down the hypotheses and the corresponding p-value for an appropriate test. (ii) Is there evidence of a significant group by time interaction? Explain as carefully as you can what this tells you about the effect of the treatment. Incorporate the direction and magnitude of the various regression coefficients into your explanation. Does it matter whether the coefficients of the individual group and time variables are significant? What are they telling you? (iii) In class we discussed other possible covariance structures for correlated data including the AR(1) model and the unstructured covariance model. Do we need either of these structures here? Explain briefly. (e) Rerun the model from part (d) using a mixed model point of view. What random effect(s) do you need to include to obtain the same fit as in part (d)? Verify that you in fact get the same answer. In what circumstances would the random effects formulation have been better for these data? (f) Was it necessary in this problem to use a a more complex covariance structure rather than simply assuming the errors were independent? You should discuss this (i) conceptually, (ii) by obtaining the correlation between the two timepoints and (iii) by looking at an appropriate test on your model printout. (g) Now try running the model again including both a random intercept and a random slope. Intuitively do you think this will help much? How could you check whether this model is an improvement over the random intercept model? 4

5 Problems To Turn In (7) I Wish I Could Play Hookey From 201B (ACM Problems and 12.26): This problem examines factors associated with adolescent students being absent from school without a valid excuse. For each of n = 252 students, the number of times absent in the last month was recorded (abbreivated nhookey for number of times the student played hookey ) along with possible predictors including age, gender (1 = male and 0 = female) and the degree to which the student liked school (rated from 1 = liked it very much to 5 = disliked it very much). Our goal in this problem is to build a model for school absences. (a) Fit the Poisson model with no predictors and explain as carefully as you can the meaning of the estimated intercept. (Note you may find it helpful to exponentiate it!) (b) Plot mean number of days absent as a function of (i) likeschool and (ii) age. Do there appear to be significant relationships between these variables and numbers of days absent? Do the relationships appear linear? (c) Fit a Poisson regression for the number of days the adolescents in the sample were absent from school with gender, age and how much the student liked school as predictors. Obtain two versions one in which likeschool is treated as continuous and one in which it is treated as categorical. Which version of the model do you think fits better and why? From here on out, for simplicity, treat the likeschool variable as continuous rather than categorical. (d) Give careful interpretations of the coefficients of the age, gender and likeschool variables and their confidence intervals (i) on the log scale and (ii) on the mean scale. Which of these variables appear to be significant predictors? (e) Now suppose that the number of absences for some of the students were measured over 1 month and some were measured over 3 months as indicated by the variable hmonth. Refit the model for the number of days absent using this variable as the offset. How does this change your answers to part (d)? Note: For the rest of the problem revert to thinking of the number of absences as all corresponding to one-month intervals as in parts (a)-(d). (f) In this part of the problem we evaluate the issue of over-dispersion several different ways. First suggest some plausible reasons why there might be over-dispersion in this data set and then check for it in the following ways: (1) Using basic descriptive statistics obtain the mean and variance of the nhookey variable. (i) What do these statistics suggest about over-dispersion and (ii) why might this assessment be too crude to evaluate the presence of over-dispersion in the model overall? (2) Evaluate the goodness of fit of the model using by calculating the Pearson chi-squared goodness of fit statistic and approximating the dispersion factor. Explain what each of these checks tells you. (3) Obtain the Pearson residuals for the model and plot them as a function of the fitted values. What does this plot suggest about over-dispersion? Do there appear to be any outliers? Briefly justify your answer. (g) Rerun the model using a negative binomial regression. Does this model confirm your conclusions about over-dispersion from part (f)? Explain briefly. (h) Give brief interpretations of the coefficients from the negative binomial regression. Are your conclusions (as to magnitude, direction and significance) of the effects of the predictors any different from those in part 5

6 (d)? In the next part of the problem we consider the issue of zero inflation. (i) Explain intuitively why we might expect zero-inflation in the context of this problem. Of our three predictors, which would you expect to be most associated with a certain zero? Explain briefly. (j) Using the estimated rate of unexcused absences from your model in part (a) calculate the expected number of 0 s we would see in a sample of 252 students if absences have a Poisson distribution. (Note: You may find it helpful to remember that for a Poisson random variable the probability of k events is P (Y = k) = µ k e µ /k!). How many 0 s did we actually observe? What does this suggest about whether we have a zero-inflation issue? (Note: This is a fairly crude approximation because we haven t accounted for the covariates but it is rather suggestive. You can of course take this approach further and break it down by gender, age bin and likeschool categories. ) (k) Fit a zero-inflated Poisson model to the data using age, gender and likeschool as the predictors for both the Poisson component and the zero-inflation component of the model. Does this model seem to be a significant improvement over a standard Poisson regression? Explain briefly. (l) Give brief interpretations of the coefficients for each component of the zero-inflated Poisson model. Which variables seem to to be significant in each part of the model? Explain as carefully as you can what this model suggests about how these variables affect school absences. (m) Now fit a zero-inflated negative binomial model using the same model set-up as in part (k). Is this model superior to the zero-inflated Poisson model of part (k)? Is it superior to the plain negative binomial model of part (g)? Explain briefly in each instance. What does that tell you about the issues of overdispersion and zero-inflation? Do any of your conceptual conclusions from part (l) change with this model? (8) Statistics Makes My Teeth Chatter: This problem is based on a dental research data set, courtesy of my colleague Dr. Rob Weiss. The outcome variable is the distance (in millimeters) from the center of the pituitary gland to the ptergyomaxillary fissure (you can check out wikipedia to get a picture of where this is in the skull in relation to the jaw.) The idea is to see how this distance changes developmentally in children and in particular whether there are any differences between boys and girls. Developmental trajectories are one of the most common targets of longitudinal mixed models. This data set includes 11 girls and 16 boys, each measured every two years at ages 8,10,12 and 14. The data are included in both long form (dentalid, distance, sex, year and yearbysex with one row for each time point) and in wide form (dentalid2, distance at each age, sex2; one row for each child) in the HW5 data set. Note that sex=1 for girls and 0 for boys. I have used sex and year here instead of gender and age to differentiate the variables from those in Problem 7. Note: Before you start, if you are using STATA, use the xtset command (details in the instruction section) to tell STATA you are going to be working with panel data with subjects indexed by the variable dentalid and time indexed by the variable year which is given in two-year intervals. The command is xtset dentalid year, delta(2) If you want to switch to working with a different data set you would then need to use the command xtset, clear. (a) A profile or spaghetti plot shows the trajectory for each subject (observed points connected by lines). I have provided such a plot in the accompanying graphics file, color-coded so you can distinguish girls from 6

7 boys since we are interested in gender differences. Describe any overall patterns you see in the plot as well as any perceived differences between boys and girls. (Instructions for creating the plot are given in the command section but you do not need to reproduce it.) (b) Calculate the correlations between the timepoints (i.e. distance at age 8 with distance at age 10, etc.) separately for boys and girls. (This will be easiest to do using the wide form of the data.) Do you see any interesting patterns? What does this suggest about whether we need to use a random effects/repeated measures model? (c) I also obtained the mean distances at each age separately for boys and girls and used them to create a scatterplot of distance versus age as a function of group (see the graphics file). This is a convenient way of condensing the information from the profile plot in part (a) and is sometimes referred to as an empirical summary plot (instructions are at the end of the assignment for those who are interested). Does there appear to be an overall difference in distance between boys and girls? Does there appear to be a difference in slope (distance vs year) between the groups? (d) Fit a model to these data using distance as the outcome and year as the predictor using the mixed model formulation with a random intercept covariance structure. (Specify the MLE fitting procedure so that you can use likelihood ratio tests for some of the follow-up questions.) Is the model overall significant? Explain briefly what this tells you. (e) Now rerun the model from part (d) adding sex and the yearbysex interaction as additional predictors. Is there any evidence of gender differences in the year (age) trajectories? Answer this question by performing a likelihood ratio test comparing the model from part (d) (with year only) to the new model which adds the two terms involving gender. Explain as carefully as you can what this tells you, incorporating the direction and magnitude of the various regression coefficients into your explanation. (f) Using your model from part (e) obtain the estimated mean distance for each group at each age and plot the results. Compare this plot to the plot from part (c) and explain any differences that you see. Can you say that one plot is more right than the other? (g) Is there evidence that you needed to include the random effects in your model from part (e) instead of just using ordinary least squares regression? Explain briefly by using values from your part (e) printout. Is this consistent with your answer to part (b)? Optional: You can also do this by rerunning the model from part (e) using an independence covariance structure (i.e. equivalent to ordinary least squares!) and performing a likelihood ratio test. How does the within subject variance compare to the between subject variance? (Note: STATA doesn t do an independent covariance structure in MLE random effects so if you are using STATA you need to fit an OLS regression and then use the command estat ic to get the log likelihood and information criteria to compare to the mixed model.) (h) Optional Now rerun the model from part (e) including a random slope as well as a random intercept. Does this appear to improve the model? Explain briefly. (i) Optional There are mean structures more general than the group by continuous age set up we have used above. Similarly, there are covariance structures more general than the random intercept/random slope specification we used in part (h). Describe (i) the most general mean structure and (ii) at least two alternate covariance structures and explain why they might or might not be appropriate here. You do not need to actually fit the models. (j) Optional Run the models you described in part (i). Do you think the more general mean structure represents a significant improvement over the mean structure from part (e)? (Use the random intercept/compound 7

8 symmetric covariance structure for this comparison to keep things simple.) What do you think is the best covariance structure? (Use the mean structure from part (e) to keep things simple.) A good way to present your results is tabulate the covariance structure, the number of parameters it involves, -2 times the log likelihood and the AIC and BIC information criteria. You can use likelihood ratio chi-squared tests to compare each of your covariance models to (i) the independence model and (ii) the completely unstructured covariance model. Why can you not necessarily directly compare the other models using the likelihood ratio test? STATA and SAS Commands For this assignment you need to be able to fit Poisson/negative binomial and repeated measures/mixed effects models The necessary commands are given below. To save you time I have given pretty complete details for the commands for the turn-in problems. Commands in STATA (1) Poisson Regression: Poisson regression works pretty much the same way as other regression models. You use the poisson command followed by the names of your outcome and predictor variables. For example, for warm-up Problem 2 predicting the number of imperfections in computer chips as a function of manufacturing process the command would be poisson imperfections treatment If you want to fit a Poisson model with an offset you need to be a little bit careful. STATA provides both an offset option and an exposure option. What I talked about as an offset in class (and what most people mean when they say offset) is what STATA calls exposure namely the amount of time (or other units) over which events could take place. The important thing is to understand how STATA treats these two options. If you use offset then STATA includes the offset variable as a predictor in its raw units with a coefficient constrained to 1. This is not really what we usually want since if we are thinking of the offset as the time or other per unit component of the Poisson rate then our model is really log(µ/t) = Xβ or log(µ) = log(t)+xβ, meaning we want the offset in the model on the log scale with a coefficient constrained to 1. Thus to use offset we have to first take the log of our units variable. STATA s exposure option includes the variable in the model on the log scale with the coefficient constrained to 1 which is usually what we want. If we use this option we don t have to transform the offset variable first. For example, in Warm-up Problem 3 where we are looking at arrests at British soccer matches our offset variable is attendance we would express number of arrests to be proportional to the size of the crowd. You could also think of this as exposure since each person there had a chance to get arrested. In that problem we are fitting a model with no predictors except the offset so our two versions of the statements become gen logatt = log(attendance) poisson arrests, offset(logatt) poisson arrests, exposure(attendance) Many of the follow-up commands for the Poisson model are like those for other GLMs. For example, you can use the predict varname to obtain the predicted counts for the values in your sample and store them in the new variable varname. This is the default for the predict command with the Poisson model. You can, of course, use other options to get other estimated quantities but the poisson model doesn t have as many options as logit or probit. In particular, it doesn t automatically create the various residuals. If you want to get those automatically you can use the glm command instead of the poisson command (see below). Of course it is not hard to create the residuals manually. For example, suppose I want to create the Pearson 8

9 residuals for a Poisson regression model. The Pearson residuals are the observed minus the predicted values divided by the standard deviation of the prediction. For a Poisson random variable the mean and the variance are equal so the estimated standard deviation for the predicted mean should just be the square root of the predicted mean. For instance, for warm-up problem 2, supposed we have saved the predicted number of imperfections in the variable predimp. Then the residuals would be created as gen impresids = (imperfections - predimp)/sqrt(predimp) Various other post-estimation commands also work with poisson. For instance, you can store the results using the estimates store mymodelname and then compare two nested models using the lrtest command to get a likelihood ratio chi-squared test. You can also use estat gof to obtain the deviance goodness of fit test and estat gof, pearson to get the Pearson chi-squared goodness of fit test. (2) Negative Binomial Regression: To fit a standard negative binomial regression in STATA you use the nbreg command followed by your outcome and predictor variables. For instance for the fishing data of warm-up Problem 4 we would type nbreg numfish camper children persons The other options for nbreg are basically the same as those for Poisson regression so I won t repeat them here. The only other thing of importance to note is that STATA automatically generates the test comparing the negative binomial regression to a Poisson regression. Specifically it gives a likelihood ratio chi-squared test for the over-dispersion parameter it calls alpha at the bottom of the printout. A small p-value means there was overdispersion and the negative binomial model fits better. (3) Zero-inflated Models: Both Poisson regression and negative binomial regression can be fit with a zero-inflation option. The commands zip and zinb replace the commands poisson and nbreg respectively. You also have to use an inflate option to specify which variables you want to use to predict the probability of certain zeros. These can be the same or different from the variables used for the count part of the model. If you want to test whether there was, in fact zero-inflation, you use the vuong option which perform s (surprise!) Vuong s test. If you also include the zip option in the negative binomial regression it will give you both the over-dispersion test and the zero-inflation test. The commands, applied to the fishing data of Warm-up Problem 4 are zip numfish camper persons children, inflate(camper persons children) vuong zinb numfish camper persons children, inflate(camper persons children) vuong zip (4) The General GLM Command (Optional): In addition to having specific commands for logistic, poisson, and other generalized linear models, STATA has a glm command in which you can specify the distribution and link function as part of the regression statement. The basic syntax, using Poisson regression with a log link as an example, is glm Y X1 X2 X3, family(poisson) link(log) Note that by default STATA uses the link that is canonical or natural for the specified distribution. For Poisson regression this is the log link so you don t in fact have to include the link statement. The advantage to using glm for Poisson regression is that it has more post-estimation and other options, including calculation of the residuals don t ask me why they don t include these with the poisson command! You can look at the help file on glm for additional details. (5) Review of Useful Old Commands: 9

10 (a) Data Summaries: The summarize command: To obtain means, standard deviations and the like for a given variable you type summarize followed by the variable name. You can obtain this summary information for subgroups of the data set using the bysort option, specifying as many grouping variables as you like. For example, for warm-up problem 6 to get the means and standard deviations of the blood pressure scores for each of the treatment groups at each timepoint you would type bysort group time: summarize bp Similarly, for turn-in problem 8(c) to get the mean distances for boys and girls at the 4 ages you would type bysort sex year: summarize distance To get the mean and standard deviation of the number number of fish caught in Warm-up Problem 4, subsetted by the number of children in the party, I would type bysort children: summarize numfish The same command, using table instead of summarize would allow you to see the frequency tables of numbers of fish caught by subgroup. (b) t-tests: To perform both two-sample t-tests and paired t-tests in STATA you use the ttest command. If the data for the two groups are stored in two separate columns the default is a paired t-test. For instance, for warm-up problem 6(c) comparing blood pressure before and after treatment for the placebo group you would type: ttest bp0p = bp3p If you want a two-sample t-test and the data are stored in a single column with the group label in a separate column you use the by() option. Note that you can also use the if option to restrict the test to a certain subset of the points. For example in warm-up problem 6b if we want to compare the placebo and medication groups at baseline we would type: ttest bp if time==0, by(group) (c) Correlation: To find the pairwise correlations between any groups of variables simply type corr followed by the variable names. As usual you can use the if option to restrict to subsets of the data. For example, in turn-in Problem 8 to get the correlations between distances at the four age points separately for boys and girls we would type corr distance8 distance10 distance12 distance14 if sex2==1 corr distance8 distance10 distance12 distance14 if sex2==0 (d) Scatter Plots: To obtain a scatterplot of Y vs X you simply type scatter Y X The scatter plot can accept more than one Y variable for the same X variable and will color code the plot so you can tell which outcome is which. This would be useful, for instance, in warm-up Problem 3(b) where I ask you to compare the observed and predicted number of arrests as a function of the number of people attending the game. Assuming my predicted values have been store in myarrests the command would be 10

11 scatter arrests myarrests attendance (e) Scatterplots of Bin Means: In Problem 7(b) I ask you to plot mean number of absences from school as a function of whether the child likes school and age. You can of course obtain the bin means as described in (c) above by typing bysort likeschool: summarize nhookey but then you have to enter the 5 means into a new column. An alternative is to use the egen command as follows: bysort likeschool: egen likemeans = mean(nhookey) This creates a new variable, likemeans, whose values are the average number of absences for people with the specified rating of liking school. If you then type scatter likemeans likeschool you will get the desired plot. Age, of course, works the same way. (f) Including Categorical Variables in a Model: To do this you can either create your own dummy variables or as part of the model statement in later versions of STATA you can attach the prefix i. to the variable name. For instance, for Problem 10(c) where you are asked to try treating likeschool as categorical you would either need to create 4 dummy variables (since likeschool has 5 categories) or else tell STATA that likeschool was categorical. The resulting two versions of the commands would be poisson nhookey age gender like2 like3 like4 like5 poisson nhookey age gender i.likeschool I already created the dummy variables for you but if you wanted to do it yourself one approach would be the following: gen like2 = 0 replace like2 = 1 if likeschool==2 And simmilarly for the other categories. (6) Mixed Models: The basic commands for mixed models in STATA are xtreg and xtmixed. The command xtreg does the basics while xtmixed allows more complex covariance structures. Before fitting one of these models you need to tell STATA that your data are in panel format, meaning that they have repeated measurements, and you need to tell it what variable identifies the units (subjects) and what variable indexes the repeats (usually a time variable.) In later versions of STATA you can also use the delta option to specify what the spacing is between time points. For example, the command for setting the data for warm-up problem 6 would be xtset subj time, delta(3) The timepoints are baseline and 3 months so specifying delta(3) tells STATA to parameterize time in terms of months rather than 3 month blocks. For turn-in problem 8, since the ages are at two-year intervals, we would use xtset dentalid year, delta(2) If you have other variables in your data set for other models you may need to clear the panel settings. To do this you type xtset, clear 11

12 If you are not sure of the current panel settings and want to check them simply type xtset. Once you have your data structure declared you can use the xtreg and xtmixed commands to run models using either the repeated measures or the mixed effects model framework. For the repeated measures framework you have to use the pa option which stands for population average and says we are specifying a covariance pattern. You then specify the structure you want using the corr() option. The most common covariance pattern is compound symmetry which STATA calls exchangeable. To run this set up for the blood pressure data of warm-up problem 6 we would type xtreg bp group time groupbytime, pa corr(exchangeable) In early versions of STATA as part of the options what variable is serving as the unit identifier using i(). For this problem we would have xtreg bp group time groupbytime, pa corr(exchangeable) i(subj) Of course there are other covariance options which you can substitute for exchangeable. These include unstructured (which is the most general) and ar # (which is the autoregressive structure that has the correlation decrease as the spacing between the timepoints gets larger. For the ar model in older versions of STATA you have to tell it what variable gives the time units using the option t(timevar). For example, for the optional bonus part in turn-in problem 8 you would type: xtreg distance sex year, pa corr(ar 1) i(subj) t(year) To run a model using the random effects formulation you use the option mle for maximum likelihood random effects. The basic random intercept model requires no additional options. For the data of warm-up problem 6 we would simply type xtreg bp group time groupbytime, mle For turn-in problem 8(d) and 8(e) our commands would be xtreg distance year, mle xtreg distance year sex yearbysex, mle In this case the i(subj) option doesn t seem to be necessary in the early versions of the package though you can still use it. With the random intercept model STATA gives you estimates of the standard deviation for the random intercept (sigma u ), the standard deviation of the residual errors (sigma e ) and the correlation between timepoints (rho). It also provides a test of whether the random intercept was needed by checking whether it s variance is greater than 0. To get more complex random effects models, for instance with random slopes, you need to use xtmixed. The basic syntax, using warm-up problem 6g as an example, is xtmixed bp group time groupbytime subj: time, covariance(independent) mle Note that I used the mle option so that I could use a likelihood ratio test to compare this model to the model with just a random intercept. For turn-in Problem 8(h) the corresponding code would be xtmixed distance year sex yearbysex dentalid: year, covariance(independent) mle The tells STATA you are getting ready to specify the random effects part of the model. It is followed 12

13 by your unit variable (here subject for the blood pressure data or dentalid for the dental data) and and semicolon and then any of the predictor variables for which you want the random effects. The covariance is used to specify whether there is a correlation between the random effects. I have specified independent to indicate no such correlation which is usually fine unless the model is very complicated. As noted above, using the mle option allows you to use likelihood ratio tests to compare this to other models. (7) Profile (Spaghetti) and Empirical Summary Plots (Optional): In repeated measures/random effects models it is often useful to be able to plot the trajectory for each subject/unit in the model (the so called profile or spaghetti plot). Here are two approaches. STATA has a built in function xtline which produces the correct basic format. The syntax for turn-in problem 8a would be xtline distance, i(dentalid) t(year) overlay legend(off) Basically you specify out outcome variable, the subject id variable and the time variable and tell it to overlay the points on the lines. The legend(off) option just supresses some extra printout. Actually you don t even need the i() and t() options if you ve already used the xtset command to tell STATA you are using panel data (see below). The problem with this command is that it isn t easy to color code the lines by subgroups which can be really helpful for detecting important effects like the gender differences that we are interested in for problem 8. To do this you can construct the plot manually instead. First we make sure that the data are sorted by the subject and time variables. For problem 8a that would be sort dentalid year You can also do the sorting from within the data editor. Next we need to make a new variable containing the the y value of the NEXT observation from that subject: gen nextdistance = distance[ n+1] if dentalid==dentalid[ n+1] The if dentalid==dentalid[ n+1] code ensures that we won t tell STATA the y value following the last observation from a subject is equal to the first y value from the next subject! Similarly we create a new variable containing the next x value: gen nextyear = year[ n+1] if dentalid==dentalid[ n+1] Look at the data in the data editor to make sure you understand what these commands do. Now we can use the twoway plot type named pcspike to connect adjacent pairs of (y,x) observations: twoway pcspike distance year nextdistance nextyear To use separate colors for boys and girls we add some additional options: twoway (pcspike distance year nextdistance nextyear if sex==1, lcolor(pink)) (pcspike distance year nextdistance nextyear if sexr==0, lcolor(blue)) For Problem 8c we also need to be able to plot the group means for each group at each age. We could take the easy (but inelegant!) route and simply use the summarize command for the means subgrouped by age and sex and then manually enter them as new columns and use the two plots as above. Alternatively we can use the following more sophisticated code: bysort sex year: egen avgdistance = mean(distance) twoway (connected avgdistance year if dentalid==1, mcolor(pink) lcolor(pink)) (connected 13

14 avgdistance year if dentalid==12, mcolor(blue) lcolor(blue)) Note that we have picked ids for one subject of each gender here it doesn t matter which. Basically we are replotting the means at each time for each subject which is overkill but works nicely. Commands in SAS (1) Summaries By Group: Descriptive statistics in SAS are obtained using proc univariate, proc freq and proc means. You specify the procedure, the data set and the variables you want to summarize using the var command. If you want to split the summaries by group you use a class statement to specify the grouping variable. As an example I use the fishing data from warm-up problem 6. The first set of commands below gives detailed summary statistics for the number of fish caught, the number of people and the number of children. If I wanted to subgroup the number of fish caught by whether or not the people were camping I would add the class statement as shown in the second set of commands. I could also use proc means, specifying which summary statistics I want. The third command below gives a table with the mean, standard deviation minimum and maximum for the number of fish split by the levels of the campter variable. You can include as many variables as you like in the var statement. The final set of commands gives the frequencies for each of the observed values for the camper, persons and children variables using proc freq. This procedure and be expanded to give contingency tables by typing var1*var2 in the var statement and you can add the option /chisq to get the various contingency table tests as well. proc univariate data = tmp1.hw4 ; var numfish persons children ; proc univariate data = tmp1.hw4; class camper; var numfish; proc means data = tmp1.hw4 mean std min max var; histogram numfish/midpoints 0 to 50 by 1 vscale = numfish; proc freq data = tmp1.hw4; tables camper children persons camper*children/chisq; \ (2) Including Categorical Variables In A Model: Categorical variables are something SAS has historically handled more easily than STATA. In basically any regression model from OLS to a zero-inflated binomial model, all you have to do is add a line with a class command and list the variables you want treated as categorical. If the variables are multicategory, SAS will give tests corresponding to whether the group of dummy variables representing the categories significantly improves the model (i.e. a partial F test in OLS and a likelihood ratio chi-squared test in a GLM.) The only tricky issue is what SAS specifies as the reference category. In the sections below I describe some of the referencing options. This was also discussed in the commands for HW4. (3) Poisson Regression: Poisson regression in SAS is generally fit using proc genmod which is shorthand for generalized linear model. In general for proc genmod you have to specify the distribution and link function. If you do not specify the distribution, normality is assumed. If you do not specify the link SAS defaults to 14

15 the canonical link. For the Poisson distribution this is the log link so you can skip that part. Like other regression models in SAS, proc genmod allows a class statement which you can use to tell it which variables are categorical. You can add an option descrnding which tells it that you want to use the lowest value (i.e. 0) as the reference value. Otherwise it defaults to using the highest value as the reference. The model statement has the same form as in standard regression, namely Y = X 1 X 2... After it you can add the option type3 which will make SAS automatically present global tests for each categorical variable. The model statement is also where you specify the form of the glm using dist and link options. Below is the syntax we would use to run a Poisson model for the fishing data of warm-up problem 4. The first basic version treats all the variables as continuous (for an indicator variable taking on only the values 0 and 1 it doesn t matter whether you indicate it is categorical or not.) Note that I didn t really need the link option. Also note that SAS automatically gives you the deviance and Pearson goodness of fit statistics along with the log likelihood. proc genmod data = tmp1.hw5; model numfish = camping persons children/dist = poisson link = log; The second version, below, treats the camping and children variables as categorical for illustrative purposes. Note the descending option. The default test for the option type3 is likelihood ratio chi-squared tests. If you ad the option wald after type3 you get Wald tests instead. proc genmod data = tmp1.hw5; class campter children/ descending; model numfish = camping persons children/type3 dist = poisson link = log; There are many other options you can use. For example, the option obstat used after the model statement prints out a table of things like fitted values and residuals for the points in the data set. Beware of doing this with a large data set as you will get MANY pages of printout. The option offset = variable is used to include an offset term in the model. Note that like STATA, SAS uses the variable on the raw scale in the model so to get what we were calling offset or exposure variable in class you need to take the natural log first. I illustrate these commands below for the soccer arrest data from warm-up problem 3 using the new logattendance variable I created: proc genmod data = tmp1.hw5; model arrests = / dist = poisson link = log offset = logattendance obstat; (4) Negative Binomial Regression: This works basically exactly the same way as poisson regression ecxcept that for the distribution we use negbin. The various other options work the same way. The basic commands for the fishing data would look like proc genmod data = tmp1.hw5; model numfish = camping persons children/dist = negbin; (5) Zero Inflated Models: To get zero-inflated versions of the Poisson or negative binomial models you can still use proc genmod but you have to add a few extra options. For example, for the zero-inflated Poisson you specificy the distribution as zip instead of poisson and you have to tell it what model to fit for 15

16 the zeros. The syntax for the fishing example of warm-up problem 4 is shown below using all three variables in both parts of the model, treated as continuous. Since the log link is the default I have left that part of the statement off but you are free to specify it. If you wanted to you could add a class statement with the usual options to treat some of the variables as categorical and you could change which variables were involved in which part of the model. proc genmod data = tmp1.hw5; model numfish = camper persons children/dist = zip; zeromodel camper persons children/link = logit; The set-up for negative binomial models is similar except our distribution is the zero-inflated negative binomial, zinb. The basic code for the fishing data is: proc genmod data = tmp1.hw5; model numfish = camper persons children/dist = zinb; zeromodel camper persons children/link = logit; The SAS zero-inflated negative binomial procedure does give a Wald confidence interval for the dispersion but neither the zip nor zinb models currently have the Vuong test for the zero-inflation built in. There is a macro which you can download from the web if you feel like making the effort. However significance of the coefficients in the zeromodel is usually a sufficient indication that the model is an improvement. There is also a special procedure, proc countreg which is more specific to count data and has some extra options. There are several fitting methods. The one that matches the proc genmod procedure is call quasi-newton. You can get this with the method = qn option in the proc statement. The basic code for the Poisson model is as follows: proc countreg data = tmp1.hw5 method = qn; model numfish = camper persons children/dist = zip; zeromodel numfish~camper persons children; You can even fit these sorts of models using another procedure, proc nlmixed (which stands for non-linear mixed models) not necessary here but useful in some circumstances. (6) Repeated Measures/Mixed Models: SAS actually has a more flexible mixed models platform in many ways than STATA. The basic command is proc mixed. As usual, you need to specify your data set, use the class statement to specify what variables are categorical and give a model statement. The only really new thing is that we need either a repeated or random statement to deal with the correlations between measurements within subjects. Naturally, the repeated command is used if you want to directly specify the correlation structure among the measurements while the random statement is natural if you are thinking in a random effects statement. SAS expects data for such models to be in what is called long form which means that you have one row per observation per subject (so if we had 4 measurements per subject as in turn-in problem 8 we would have 4 rows of data for each subject, one for each measurement point.) Note also that SAS defaults to REML (restricted maximum likelihood estimation). You can change the fitting method to maximum likelihood (or other things) by specifying method = ML in the proc mixed statement. 16

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Generalized Linear Models for Count, Skewed, and If and How Much Outcomes Today s Class: Review of 3 parts of a generalized model Models for discrete count or continuous skewed outcomes Models for two-part

More information

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game. EdPsych/Psych/Soc 589 C.J. Anderson Homework 5: Answer Key 1. Probelm 3.18 (page 96 of Agresti). (a) Y assume Poisson random variable. Plausible Model: E(y) = µt. The expected number of arrests arrests

More information

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013 Analysis of Count Data A Business Perspective George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013 Overview Count data Methods Conclusions 2 Count data Count data Anything with

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

Categorical and Zero Inflated Growth Models

Categorical and Zero Inflated Growth Models Categorical and Zero Inflated Growth Models Alan C. Acock* Summer, 2009 *Alan C. Acock, Department of Human Development and Family Sciences, Oregon State University, Corvallis OR 97331 (alan.acock@oregonstate.edu).

More information

Statistical Distribution Assumptions of General Linear Models

Statistical Distribution Assumptions of General Linear Models Statistical Distribution Assumptions of General Linear Models Applied Multilevel Models for Cross Sectional Data Lecture 4 ICPSR Summer Workshop University of Colorado Boulder Lecture 4: Statistical Distributions

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 -

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models Introduction to generalized models Models for binary outcomes Interpreting parameter

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010

MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 MIXED MODELS FOR REPEATED (LONGITUDINAL) DATA PART 2 DAVID C. HOWELL 4/1/2010 Part 1 of this document can be found at http://www.uvm.edu/~dhowell/methods/supplements/mixed Models for Repeated Measures1.pdf

More information

over Time line for the means). Specifically, & covariances) just a fixed variance instead. PROC MIXED: to 1000 is default) list models with TYPE=VC */

over Time line for the means). Specifically, & covariances) just a fixed variance instead. PROC MIXED: to 1000 is default) list models with TYPE=VC */ CLP 944 Example 4 page 1 Within-Personn Fluctuation in Symptom Severity over Time These data come from a study of weekly fluctuation in psoriasis severity. There was no intervention and no real reason

More information

Introduction to Within-Person Analysis and RM ANOVA

Introduction to Within-Person Analysis and RM ANOVA Introduction to Within-Person Analysis and RM ANOVA Today s Class: From between-person to within-person ANOVAs for longitudinal data Variance model comparisons using 2 LL CLP 944: Lecture 3 1 The Two Sides

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p ) Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p. 376-390) BIO656 2009 Goal: To see if a major health-care reform which took place in 1997 in Germany was

More information

A Re-Introduction to General Linear Models

A Re-Introduction to General Linear Models A Re-Introduction to General Linear Models Today s Class: Big picture overview Why we are using restricted maximum likelihood within MIXED instead of least squares within GLM Linear model interpretation

More information

Chapter 19: Logistic regression

Chapter 19: Logistic regression Chapter 19: Logistic regression Self-test answers SELF-TEST Rerun this analysis using a stepwise method (Forward: LR) entry method of analysis. The main analysis To open the main Logistic Regression dialog

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Outline. Linear OLS Models vs: Linear Marginal Models Linear Conditional Models. Random Intercepts Random Intercepts & Slopes

Outline. Linear OLS Models vs: Linear Marginal Models Linear Conditional Models. Random Intercepts Random Intercepts & Slopes Lecture 2.1 Basic Linear LDA 1 Outline Linear OLS Models vs: Linear Marginal Models Linear Conditional Models Random Intercepts Random Intercepts & Slopes Cond l & Marginal Connections Empirical Bayes

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

LDA Midterm Due: 02/21/2005

LDA Midterm Due: 02/21/2005 LDA.665 Midterm Due: //5 Question : The randomized intervention trial is designed to answer the scientific questions: whether social network method is effective in retaining drug users in treatment programs,

More information

Multilevel/Mixed Models and Longitudinal Analysis Using Stata

Multilevel/Mixed Models and Longitudinal Analysis Using Stata Multilevel/Mixed Models and Longitudinal Analysis Using Stata Isaac J. Washburn PhD Research Associate Oregon Social Learning Center Summer Workshop Series July 2010 Longitudinal Analysis 1 Longitudinal

More information

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878 Contingency Tables I. Definition & Examples. A) Contingency tables are tables where we are looking at two (or more - but we won t cover three or more way tables, it s way too complicated) factors, each

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester Modelling Rates Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 05/12/2017 Modelling Rates Can model prevalence (proportion) with logistic regression Cannot model incidence in

More information

Poisson regression: Further topics

Poisson regression: Further topics Poisson regression: Further topics April 21 Overdispersion One of the defining characteristics of Poisson regression is its lack of a scale parameter: E(Y ) = Var(Y ), and no parameter is available to

More information

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion Biostokastikum Overdispersion is not uncommon in practice. In fact, some would maintain that overdispersion is the norm in practice and nominal dispersion the exception McCullagh and Nelder (1989) Overdispersion

More information

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

(c) Interpret the estimated effect of temperature on the odds of thermal distress. STA 4504/5503 Sample questions for exam 2 1. For the 23 space shuttle flights that occurred before the Challenger mission in 1986, Table 1 shows the temperature ( F) at the time of the flight and whether

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Class (or 3): Summary of steps in building unconditional models for time What happens to missing predictors Effects of time-invariant predictors

More information

Generalized Models: Part 1

Generalized Models: Part 1 Generalized Models: Part 1 Topics: Introduction to generalized models Introduction to maximum likelihood estimation Models for binary outcomes Models for proportion outcomes Models for categorical outcomes

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Topics: What happens to missing predictors Effects of time-invariant predictors Fixed vs. systematically varying vs. random effects Model building

More information

Case-control studies C&H 16

Case-control studies C&H 16 Case-control studies C&H 6 Bendix Carstensen Steno Diabetes Center & Department of Biostatistics, University of Copenhagen bxc@steno.dk http://bendixcarstensen.com PhD-course in Epidemiology, Department

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Chapter 5: Logistic Regression-I

Chapter 5: Logistic Regression-I : Logistic Regression-I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

Model Assumptions; Predicting Heterogeneity of Variance

Model Assumptions; Predicting Heterogeneity of Variance Model Assumptions; Predicting Heterogeneity of Variance Today s topics: Model assumptions Normality Constant variance Predicting heterogeneity of variance CLP 945: Lecture 6 1 Checking for Violations of

More information

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements: Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements: The project was supported

More information

Stat 587: Key points and formulae Week 15

Stat 587: Key points and formulae Week 15 Odds ratios to compare two proportions: Difference, p 1 p 2, has issues when applied to many populations Vit. C: P[cold Placebo] = 0.82, P[cold Vit. C] = 0.74, Estimated diff. is 8% What if a year or place

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Multilevel Models in Matrix Form Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Today s Lecture Linear models from a matrix perspective An example of how to do

More information

Section Poisson Regression

Section Poisson Regression Section 14.13 Poisson Regression Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Data Analysis II 1 / 26 Poisson regression Regular regression data {(x i, Y i )} n i=1,

More information

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012 An Introduction to Multilevel Models PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012 Today s Class Concepts in Longitudinal Modeling Between-Person vs. +Within-Person

More information

12 Generalized linear models

12 Generalized linear models 12 Generalized linear models In this chapter, we combine regression models with other parametric probability models like the binomial and Poisson distributions. Binary responses In many situations, we

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Interactions among Continuous Predictors

Interactions among Continuous Predictors Interactions among Continuous Predictors Today s Class: Simple main effects within two-way interactions Conquering TEST/ESTIMATE/LINCOM statements Regions of significance Three-way interactions (and beyond

More information

Generalized Multilevel Models for Non-Normal Outcomes

Generalized Multilevel Models for Non-Normal Outcomes Generalized Multilevel Models for Non-Normal Outcomes Topics: 3 parts of a generalized (multilevel) model Models for binary, proportion, and categorical outcomes Complications for generalized multilevel

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

A (Brief) Introduction to Crossed Random Effects Models for Repeated Measures Data

A (Brief) Introduction to Crossed Random Effects Models for Repeated Measures Data A (Brief) Introduction to Crossed Random Effects Models for Repeated Measures Data Today s Class: Review of concepts in multivariate data Introduction to random intercepts Crossed random effects models

More information

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois

Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 2017, Chicago, Illinois Longitudinal Data Analysis Using Stata Paul D. Allison, Ph.D. Upcoming Seminar: May 18-19, 217, Chicago, Illinois Outline 1. Opportunities and challenges of panel data. a. Data requirements b. Control

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

Introduction to SAS proc mixed

Introduction to SAS proc mixed Faculty of Health Sciences Introduction to SAS proc mixed Analysis of repeated measurements, 2017 Julie Forman Department of Biostatistics, University of Copenhagen 2 / 28 Preparing data for analysis The

More information

Monday 7 th Febraury 2005

Monday 7 th Febraury 2005 Monday 7 th Febraury 2 Analysis of Pigs data Data: Body weights of 48 pigs at 9 successive follow-up visits. This is an equally spaced data. It is always a good habit to reshape the data, so we can easily

More information

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models: Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models: Marginal models: based on the consequences of dependence on estimating model parameters.

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Topics: What happens to missing predictors Effects of time-invariant predictors Fixed vs. systematically varying vs. random effects Model building strategies

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago Mohammed Research in Pharmacoepidemiology (RIPE) @ National School of Pharmacy, University of Otago What is zero inflation? Suppose you want to study hippos and the effect of habitat variables on their

More information

Introduction to Generalized Models

Introduction to Generalized Models Introduction to Generalized Models Today s topics: The big picture of generalized models Review of maximum likelihood estimation Models for binary outcomes Models for proportion outcomes Models for categorical

More information

Introduction to SAS proc mixed

Introduction to SAS proc mixed Faculty of Health Sciences Introduction to SAS proc mixed Analysis of repeated measurements, 2017 Julie Forman Department of Biostatistics, University of Copenhagen Outline Data in wide and long format

More information

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts

Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 2017, Boston, Massachusetts Longitudinal Data Analysis Using SAS Paul D. Allison, Ph.D. Upcoming Seminar: October 13-14, 217, Boston, Massachusetts Outline 1. Opportunities and challenges of panel data. a. Data requirements b. Control

More information

Simple, Marginal, and Interaction Effects in General Linear Models

Simple, Marginal, and Interaction Effects in General Linear Models Simple, Marginal, and Interaction Effects in General Linear Models PRE 905: Multivariate Analysis Lecture 3 Today s Class Centering and Coding Predictors Interpreting Parameters in the Model for the Means

More information

SAS Code for Data Manipulation: SPSS Code for Data Manipulation: STATA Code for Data Manipulation: Psyc 945 Example 1 page 1

SAS Code for Data Manipulation: SPSS Code for Data Manipulation: STATA Code for Data Manipulation: Psyc 945 Example 1 page 1 Psyc 945 Example page Example : Unconditional Models for Change in Number Match 3 Response Time (complete data, syntax, and output available for SAS, SPSS, and STATA electronically) These data come from

More information

Designing Multilevel Models Using SPSS 11.5 Mixed Model. John Painter, Ph.D.

Designing Multilevel Models Using SPSS 11.5 Mixed Model. John Painter, Ph.D. Designing Multilevel Models Using SPSS 11.5 Mixed Model John Painter, Ph.D. Jordan Institute for Families School of Social Work University of North Carolina at Chapel Hill 1 Creating Multilevel Models

More information

Generalized linear models

Generalized linear models Generalized linear models Outline for today What is a generalized linear model Linear predictors and link functions Example: estimate a proportion Analysis of deviance Example: fit dose- response data

More information

Longitudinal Data Analysis of Health Outcomes

Longitudinal Data Analysis of Health Outcomes Longitudinal Data Analysis of Health Outcomes Longitudinal Data Analysis Workshop Running Example: Days 2 and 3 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development

More information

Basic Regressions and Panel Data in Stata

Basic Regressions and Panel Data in Stata Developing Trade Consultants Policy Research Capacity Building Basic Regressions and Panel Data in Stata Ben Shepherd Principal, Developing Trade Consultants 1 Basic regressions } Stata s regress command

More information

A course in statistical modelling. session 09: Modelling count variables

A course in statistical modelling. session 09: Modelling count variables A Course in Statistical Modelling SEED PGR methodology training December 08, 2015: 12 2pm session 09: Modelling count variables Graeme.Hutcheson@manchester.ac.uk blackboard: RSCH80000 SEED PGR Research

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI Introduction of Data Analytics Prof. Nandan Sudarsanam and Prof. B Ravindran Department of Management Studies and Department of Computer Science and Engineering Indian Institute of Technology, Madras Module

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

More Accurately Analyze Complex Relationships

More Accurately Analyze Complex Relationships SPSS Advanced Statistics 17.0 Specifications More Accurately Analyze Complex Relationships Make your analysis more accurate and reach more dependable conclusions with statistics designed to fit the inherent

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Review of CLDP 944: Multilevel Models for Longitudinal Data

Review of CLDP 944: Multilevel Models for Longitudinal Data Review of CLDP 944: Multilevel Models for Longitudinal Data Topics: Review of general MLM concepts and terminology Model comparisons and significance testing Fixed and random effects of time Significance

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs)

36-309/749 Experimental Design for Behavioral and Social Sciences. Dec 1, 2015 Lecture 11: Mixed Models (HLMs) 36-309/749 Experimental Design for Behavioral and Social Sciences Dec 1, 2015 Lecture 11: Mixed Models (HLMs) Independent Errors Assumption An error is the deviation of an individual observed outcome (DV)

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Sections 4.1, 4.2, 4.3

Sections 4.1, 4.2, 4.3 Sections 4.1, 4.2, 4.3 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1/ 32 Chapter 4: Introduction to Generalized Linear Models Generalized linear

More information

Fitting a Straight Line to Data

Fitting a Straight Line to Data Fitting a Straight Line to Data Thanks for your patience. Finally we ll take a shot at real data! The data set in question is baryonic Tully-Fisher data from http://astroweb.cwru.edu/sparc/btfr Lelli2016a.mrt,

More information

Analysing categorical data using logit models

Analysing categorical data using logit models Analysing categorical data using logit models Graeme Hutcheson, University of Manchester The lecture notes, exercises and data sets associated with this course are available for download from: www.research-training.net/manchester

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response) Model Based Statistics in Biology. Part V. The Generalized Linear Model. Logistic Regression ( - Response) ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch 9, 10, 11), Part IV

More information

BIOS 625 Fall 2015 Homework Set 3 Solutions

BIOS 625 Fall 2015 Homework Set 3 Solutions BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

1 The basics of panel data

1 The basics of panel data Introductory Applied Econometrics EEP/IAS 118 Spring 2015 Related materials: Steven Buck Notes to accompany fixed effects material 4-16-14 ˆ Wooldridge 5e, Ch. 1.3: The Structure of Economic Data ˆ Wooldridge

More information

Package HGLMMM for Hierarchical Generalized Linear Models

Package HGLMMM for Hierarchical Generalized Linear Models Package HGLMMM for Hierarchical Generalized Linear Models Marek Molas Emmanuel Lesaffre Erasmus MC Erasmus Universiteit - Rotterdam The Netherlands ERASMUSMC - Biostatistics 20-04-2010 1 / 52 Outline General

More information

Soc 63993, Homework #7 Answer Key: Nonlinear effects/ Intro to path analysis

Soc 63993, Homework #7 Answer Key: Nonlinear effects/ Intro to path analysis Soc 63993, Homework #7 Answer Key: Nonlinear effects/ Intro to path analysis Richard Williams, University of Notre Dame, https://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Problem 1. The files

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. Contingency Tables Definition & Examples. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels. (Using more than two factors gets complicated,

More information

Comparing IRT with Other Models

Comparing IRT with Other Models Comparing IRT with Other Models Lecture #14 ICPSR Item Response Theory Workshop Lecture #14: 1of 45 Lecture Overview The final set of slides will describe a parallel between IRT and another commonly used

More information

Recent Developments in Multilevel Modeling

Recent Developments in Multilevel Modeling Recent Developments in Multilevel Modeling Roberto G. Gutierrez Director of Statistics StataCorp LP 2007 North American Stata Users Group Meeting, Boston R. Gutierrez (StataCorp) Multilevel Modeling August

More information

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29 Logistic Regression March 29 Introduction Binary outcomes are quite common in medicine and public health: alive/dead, diseased/healthy, infected/not infected, case/control Assuming that these outcomes

More information