Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1. Handling Quasi-Nonconvergence in Logistic Regression:

Size: px
Start display at page:

Download "Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1. Handling Quasi-Nonconvergence in Logistic Regression:"

Transcription

1 Running head: QUASI-NONCONVERGENCE LOGISTIC REGRESSION 1 Handling Quasi-Nonconvergence in Logistic Regression: Technical Details and an Applied Example Jeffrey M. Miller Northcentral University M. David Miller University of Florida Author Note Jeffrey M. Miller, Ph.D., College of Education, Northcentral University M. David Miller, Ph.D., Research & Evaluation Methodology, University of Florida Correspondence regarding this article should be addressed to Jeffrey M. Miller, 4117 SW 20 th Ave., #73, Gainesville, FL contact: jeffmiller.research@gmail.com

2 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 2 Abstract Nonconvergence is a concern for any iterative data analysis process. However, there are instances in which convergence will be obtained for the overall solution but not for a specific estimate. For most software packages, it is not easy to notice this problem unless the researcher has a priori knowledge of reasonable solutions. Hence, faulty inferences can be disguised by a presumably correct estimation procedure known as quasi-nonconvergence. This type of nonconvergence occurs in logistic regression models when the data are quasi-completely separated. This is to say that prediction is completely or nearly completely perfect. Firth (1993) presented a penalized likelihood correction that was then extended by Heinze and Ploner (2003) to solve the quasi-nonconvergence problem. This procedure was applied to educational research data to demonstrate its success in eliminating the problem. Keywords: quasi-nonconvergence nonconvergence logistic regression

3 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 3 Handling Quasi-Nonconvergence in Logistic Regression: Technical Details and an Applied Example Many researchers have experienced nonconvergence errors where, for one reason or another, a maximum likelihood solution can not be calculated or does not exist. Conditions for nonconvergence include sparseness of data, multiple maximas, unspecified boundary constraints, and data separation. Logistic regression analyses are especially prone to data separation issues resulting in complete or quasi-complete separation. This article provides calculations for ML estimates in logistic regression, a solution to the data separation problem, and an example using real data. Logistic regression is often used to describe and/or predict a binary outcome given a set of covariates. The technique is widely used in research including the topics of dropping out of school (Suh, Suh, & Houston, 2007), retention and graduation (Wohlgemuth, Whalen, Sullivan, Nading, Mack, & Wang, 2007), and skipping classes (Kimberly, 2007). In the parlance of generalized linear modeling (MaCullagh & Nelder; 1989; Nelder & Wedderburn, 1972), the logistic function has a random, a systematic, and a linking component. The random component specifies information about the response variable. In this case, given n randomly selected, identically distributed, and independent trials, the random component is specified as Y ~ B( n, ) (Agresti, 1996). This is to say that outcome Y has a binomial (B) distribution with the n parameter representing the number of trials and the π parameter representing the probability that Y equals one. This As explained by McCullagh and Nelder (1989), the area under the curve for k successes, or the probability mass function for Pr(Y = 1), can be calculated by hand when

4 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 4 provided the total number of trials (n), the number of successes (k) and the probability of a success. n! k n k f ( k; n, p) [ p (1 p) ]. (1) k!( n k)! The systematic component for the logistic regression model contains the P covariates that are specified as predictors of the probability that Y = 1 through their P estimators. The link function specifies how the random component should be related to the systematic component. For typical regression analyses (i.e., ordinary least squares regression), we equate the mean of the response variable to the predictor variables and have assumed the identity link. However, if we do this with a binary response variable then we can feasibly obtain a regression prediction equation that permits predicted values less than zero or greater than one (Hair, Black, Babin, Anderson, & Tatham, 2006; Agresti, 1996). This is undesirable since we are modeling a probability that is bounded between zero and one. Further, ordinary least squares regression models assume normally distributed residuals, which is rarely the case for binary response variables. The typical (i.e., canonical) link for logistic regression is the logit, which is the natural log of the odds ratio where the odds ratio is equal to the probability divided by the probability subtracted from one. g( ) log( /(1 )) (2) Combining the random component, systematic component, and link function yields the logistic generalized linear model logit(π) = log 0 X1... X p. (3) 1

5 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 5 Since natural logs can be reversed through exponentiation, and since odds can be converted to probabilities, the fitted equation can be used to predict probabilities via or equivalently P i i ip p 1 (4) p 1 Pr( y 1 x ) {1 exp( x )} exp( X X ) 1 1 p p. (5) 1 [exp( X X )] 1 1 p p Obtaining the required P + 1 estimates of β requires maximum likelihood estimation. An iterative process is used to find the values of the coefficients that are most likely (i.e., maximized) given the data. This is done by iteratively solving a likelihood function. N ni! yi ni yi L( y) i (1 i ). (6) y!( n y )! i 1 i i i First and second derivatives of this function are required in order to determine the maximizing estimates. The calculations are simplified after applying algebraic manipulations to the function. First, we simplify the term with the subtracting exponential (n y) since that is equal to their quotient. Second, we return to the generalized linear model for logistic regression, exponentiate both sides of the equation, and solve for π. The results of exponentiation can be substituted in the left-hand side of the likelihood equation, and the solution to π can be substituted into the right-hand side of the likelihood equation. This simplification contains terms that are powers of powers. Thus, the equation can be simplified further since a number raised to a raised power is equal to that number raised to the product of those powers. Finally, further simplification by applying the

6 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 6 natural logarithm yields a function that readily permits calculation of first and second derivatives. N P P (7) l( ) [ y ( x ) n ][log(1 exp( x )] T hen, differentiation of the first derivative, i ip p i ip p i 1 p 1 p 1 P xip p xip (8) p p 0 which leads to P + 1 equations to be solved for β p by setting them to zero. l( ) N yi xip ni ixip (9) p i 1 This differentiation of the log-likelihood of the parameters is also known as the Score function (U) or the gradient of the log-likelihood. Next, differentiation of the second derivative 2 N l( ) yi xip ni ixip (10) i 1 p p' p' leads to another set of equations to be solved for β. 2 N l( ) ni xip i (1 i ) xip. (11) i 1 p p' During the iterative procedure, a maximum or minimum will be achieved if and when the matrix for the second derivatives contains only negative numbers on its diagonal (i.e., a negative-definite matrix). Note that the equation for the second derivative includes the term, (1 ). This is also the variance for a binomial variable; hence, it is i i not a coincidence that the matrix of second derivatives is also the covariance matrix for the maximum likelihood estimates.

7 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 7 Iteratively solving the equations is a tedious process. Ferron & Hess (2007) provide an excellent didactic example of both maximum likelihood estimation and the iterative procedure for structural equation models. Succinctly, starting values for the estimates are declared and succeeding iterations are based on the first two Taylor polynomial expansions leading to subsequent estimates that are hopefully closer and closer to the maximized solution for the estimates. A problem can arise when maximizing estimates for data that suffers from complete or quasi-complete separation (Albert & Anderson, 1984). This is to say that the probability of y=1 or y=0 is nearly perfectly predictable from a predictor or set of predictors (Webb, Wilson, & Choug, 2004). An extreme example of the separation issue occurs when a 2X2 analysis has one cell containing all 0's or all 1's (Heinze & Ploner, 2003). The problem has been addressed in detail in the fields of medicine (Heinze, 2003) and biometrics (Firth, 1993). However, no research was found that addressed the issue using educational data. Convergence for logistic regression models is affected by the configurations of the observed values in the sample space (So, 1999; Santner & Duffy, 1985; Albert & Albertson, 1984). This is simple to conceptualize by imagining a scatterplot relating y = the probability of being in elementary school and x = subject age (ranging from 5 to 35). There would obviously be two distinct clusters of observations; this is the configuration for this space. Suppose that the resulting maximum likelihood estimate for the age coefficient perfectly predicts being enrolled or not being enrolled in elementary school. In fact, given perfect predictability, the maximum likelihood solution is zero since the fit is perfect. This is complete separation, and it would be apparent in the data after sorting

8 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 8 by age. A scatterplot would draw a line between the two clusters; no member belonging to one cluster would appear either on the line or within the other cluster. A more realistic scenario is one that distinguishes between borderline observations. Due to month of birth differences, there is presumably an age at which only some observations will be classified as enrolled in elementary school. The scatterplot would still neatly divide the clusters; however, some observations would fall on the line itself. In this case, a maximum likelihood estimate exists and is usually very close to zero. However, the dispersion matrix will usually take on unrealistically large or small values. In other words, without prior knowledge regarding reasonable estimates of coefficients and variances, the successful convergence may create the illusion of accurate results leading to incorrect inferences and interpretations leading to what we term quasinonconvergence. If you have an idea of what the coefficients and/or variances should be then the problem can be identified since the provided estimates may be absurd. Variances tend to be extremely inflated (Webb, Wilson, & Choug, 2004), and odds ratios are often infinite (Heinze, 2006). For example, SAS will provide an odds ratio estimate of > Program logs and/or output sometimes provide a clue although there is much variability in how the clue may be presented (Zorn, 2005). SAS reports Quasi-complete separation of data points detected. Warning: the maximum likelihood estimate may not exist. Warning: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood iteration. Validity of the model fit is questionable. SPSS reports, Estimation terminated at iteration because maximum iterations have been reached. Final solution cannot be found.

9 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 9 The R program reports, fitted probabilities numerically 0 or 1 occurred in:, which is not to say that the observed responses for a variable were all 0 s or 1 s. Stata actually eliminates variables under complete separation to produce a solution and fails to provide any estimates under quasi-complete separation. Researchers encountering complete or quasi-complete separation tend to take arbitrary steps to eliminate the problem. The simplest solution is to delete the predictor that is responsible for the separation. This assumes that the problem is due to a particular predictor and not a linear combination (Heinze, 2006). Some researchers insert artificial data to eliminate separation. Another possibility is to use exact logistic regression as proposed by Cox (1970); however, this procedure leads to degenerate estimates when some or all of the predictors are continuous (Heinze & Schemper, 2002). Zorn (2005) presents many examples of published research that have attempted to resolve separation issues using such procedures; the procedures have been reiterated by others (Heinze, 2006; Heinze & Schemper, 2002). A more appropriate procedure, penalized maximum likelihood, is more recent. The procedure is based on unrelated research by Firth (1993) and was applied to resolving separation issues in logistic regression by Heinze and Ploner (2003). It has long been known that maximum likelihood logistic regression estimates are biased. Firth (1993) refers to the findings of 3.4% asymptotic bias away from the true value (Copas, 1988). Hence Firth (1993) proceeded to construct a penalized likelihood correction. This correction is a shrinkage quantity intended to remove the bias. The correction, known in the Bayesian literature as the Jeffrey s invariant prior (Zorn, 2005), adds one-half of the natural logarithm of the information matrix to the log-likelihood

10 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 10 while concurrently adjusting the Score function. As the log-likelihood declines to zero over iterations, the adjustment serves to counteract the bias. The resulting penalized likelihood, log-likelihood, and score equations are displayed below. * ( ) ( ) I (12) * ln ( ) ln ( ).5ln I (13) * 1 I ( ) U( ) U( ).5 tr I( ) (14) Heinze and Ploner (2003) found that this correction also eliminates what they term the nonconvergence bug due to separation issues. They noted that the correction will always provide a finite solution in the presence of separation. Heinze and Schemper (2002) found 97.5% coverage for the confidence intervals as well as high power. Zorn (2005) noted that because they are shrunken towards zero, penalized-likelihood estimates will typically be smaller in absolute value than standard MLEs, though their standard errors will also be reduced, yielding similar inferences about the significance of parameter estimates for those parameters whose MLE is finite (2005, p. 160). This is not to say that penalized maximum likelihood is a panacea to separation issues. Caution is advised when constructing confidence intervals for the odds ratio. The penalized estimates are permitted to be asymmetrical leading to inappropriate coverage for traditional Wald intervals. In this case, the analyst should construct profile intervals for the penalized likelihood estimates (Heinze, 2006; Heinze & Ploner, 2003; Zorn, 2005). Another caution is to not ignore other modeling problems; for example, penalized likelihood does not serve to resolve issues of multicollinearity (Heinze & Schemper, 2002).

11 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 11 The primary difference between Firth s correction and those of others (Rubin & Shenker, 1987; Clogg, Rubin, Schenker, Schultz, & Weidman, 1991) is that the other procedures are not iterative hence failing to leverage over the course of maximization. Heinze and Ploner (2004) wrote SAS, S-PLUS, and R macros to implement the penalized maximum likelihood estimation; the code is freely available at This code also produces figures to help determine whether the Wald confidence intervals or profile-based intervals are more appropriate. Applied Example The data for this example were extracted from the Internet Access in U.S. Public Schools survey of Fall, 2005 (NCES, 2005) collected in the Fall of The survey was developed to measure the extent of Internet usage and aspects related to Internet usage such as the type of connection, control of content access, and integration into curriculum. Responses were obtained from 1,104 school technology coordinators (or staff members most knowledgeable about school Internet access). For the purposes of this example, listwise deletion was used to remove all observations with missing data on one or more of the variables used as predictors or responses. This resulted in an analyzed sample size of 822. Further listwise deletions were made in order to produce a perfect quasi-complete separation. This reduced the sample size to 814. The binary response variable was item Q7DA: Students with Disabilities (0 = No Access to Internet, 1 = Access to Internet). There were three binary predictors as follows: NS_LEVEL: School (0 = Secondary, 1 = Elementary), Q9AB: What technologies or other procedures does your school use to prevent student access to inappropriate material

12 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 12 on the Internet? Intranet (0 = No, 1 = Yes), and Q10: Does your school allow students to access instructional computers with Internet access at times other than regular school hours? (0 = no, 1 = yes). In addition, there were three continuous predictors as follows: Q4A: How many computers in your school currently have Internet access, PCTMIN: percent minorities enrollment, and FLEP: percent of students eligible or free or reducedprice school lunch. The adjusted variable was NSLEVEL such that the probability of a student having Internet access was perfectly predictable for elementary schools but not for secondary schools. In other words, for elementary schools, all students, with disabilities had Internet access; for secondary schools, not all students with disabilities had Internet access. If perfect prediction was obtained for elementary schools then the condition would be complete, not quasi-complete, separation. Logistic regression was used to analyze these data using SAS PROC LOGISTIC. Table 1 displays partial SAS output displaying the output warning of quasi-complete separation. However, the maximum likelihood does appear to exist. Further, the maximum likelihood estimates appear to be reasonable until one inspects the standard errors and notices the school-level (NSLEVEL) estimate of 100.8, which is also the standard error for the intercept. These estimates are displayed in Table 2. The problem is transparent when inspecting the maximum likelihood estimates for the odds ratios. The school-level variable has a reported odds ratio estimate of < with a 95% Wald confidence interval of (<0.001, ). Given the warning and the unreasonable odds ratio estimate, we conclude quasi-nonconvergence for this solution. These results are displayed in Table 3.

13 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 13 The data were analyzed again using the penalized log-likelihood macro in SAS. This time there were no nonconvergence warnings. As seen in Table 4, the standard errors for the intercept and school-level (NSLEVEL) variable are more reasonable. Table 5 suggests that, for school-level, the odds of students with disabilities having Internet access are times higher for elementary schools than for high schools. This is a considerably more reasonable estimate than the previously found estimate of <.001. It is also interesting to note that maximum likelihood produced a p-value for school-level of 0.929, which strongly suggests lack of support for this variable as a predictor. However, the penalized maximum likelihood results indicate that this variable has an extremely low p-value of This tells us that the quasi-nonconvergence should not be ignored. Conclusion The primary purpose of this article is to inform applied researchers about a technique for handling a common problem that occurs when analyzing response variables that are binary. Logistic regression was reviewed from the perspective of generalized linear models followed by an extensive treatment of maximum likelihood estimation. Not all logistic regression models truly converge. Quasi-nonconvergence can arise due to complete or quasi-complete separation of data in the configuration space. Much previous research has addressed the problem by eliminating the culprit variables, making transformations of the data, and even inserting artificial data. It is also suspect that some research has not addressed the problem at all since statistics software does not provide a clear indication of the problem or how to handle it.

14 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 14 Firth (1993) proposed penalized likelihood as a mechanism for correcting bias inherent in logistic regression. Heinze and Ploner (2003) extended penalized likelihood estimation to resolve nonconvergence in logistic regression due to separation issues. Their research demonstrated adequate coverage and power for these odds ratio estimates that always have a finite solution. It is intended that researchers will be more aware of the potential for quasinonconvergence in logistic regression that may go unnoticed. A converging maximum likelihood estimation does not guarantee a converging parameter estimate. The consequence of reporting such results may be incorrect or even nonsensical. Researchers are encouraged to obtain the macro for penalized maximum likelihood and apply it when faced with complete or quasi-complete data separation.

15 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 15 References Agresti, A. (1996). An introduction to categorical data analysis. New York: John Wiley and Sons. Albert, A., & Anderson, J. A. (1984). On the existence of maximum likelihood estimates in logistic regression models, Biometrika, 71, Clogg, C. C., Rubin, D. B., Schenker, N., Schultz, B., and Weidman, L. (1991). Multiple imputation of industry and occupation codes in censu public-use samples using Bayesian logistic regression. Journal of the American Statistical Association, 86, Copas, J. B. (1988). Binary regression models for contaminated data (with discussion). Journal of the Royal Statistical Sociate: Series B, 50, Cox, D. (1970). Analysis of Binary Data. New York: Wiley. Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80, Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., & Tatham, R. L. (2006). Multivariate Data Analysis. (6 th ed.). Upper Saddle River, NJ: Perason Education. Heinze, G. (2006). A comparitive investigation of methods for logistic regression with separated or nearly separated data. Statistics in Medicine, 25, Heinze, G., & Ploner, M. (2004). A SAS macro, S-PLUS library, and R package to perform logistic regression without convergence problems. Technical Report 2/2004. Medical University of Vienna: Department of Computer Sciences Section of Clinical Biometrics, Vienna. Heinze, G., & Ploner, M. (2003). Fixing the nonconvergence bug in logistic regression

16 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 16 with SPLUS and SAS. Computer Methods and Programs in Biomedicine, 71, Heinze, G., & Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21, Kimberly, L. (2007). Who s skipping school: Characteristics of truants in 8 th and 10 th grade. Journal of School Health, 77, McCullagh, P., &J.A. Nelder. (1989). Generalized Linear Models. Chapman and Hall: London. Nelder, J. A. and Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Seris A, 135, Santner, T. J., & Duffy, E. D. (1986). A note on A. Albert and J. A. Anderson s conditions for the existence of maximum likelhood estimates in logistic regression models. Biometrika, 73, Rubin, D. B., & Schenker, N. (1987). Logit-based interval estimation for binomial data using the Jeffrey prior. In Sociological Methodology 1987, Ed. C. C. Clogg, pp Washington, DC: American Sociological Association. So, Y. (1999). A tutorial on logistic regression. SAS Technical Report. Cary, NC. Suh, S., Suh, J., & Houston, I. (2007). Predictors of categorical at-risk high school dropouts. Journal of Counseling & Development, 85, Webb, M. C., Wilson, J. R., & Chong, J. (2004). An analysis of quasi-complete binary data with logistic models: Applications to alcohol abuse data. Journal of Data Science, 2, Wohlgemuth, D., Whalen, D., Sullivan, J., Nading, C., Mack, S., & Wang, Y. (2007).

17 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 17 Financial, academic, and environmental influences on the retention and graduation of students. Journal of College Student Retention: Research, Theory & Practice, 8, Zorn, C., (2005). A solution to separation in binary response models. Political Analysis, 13,

18 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 18 Table 1 SAS Output Displaying Warning of Quasi-Complete Separation Model Convergence Status Quasi-complete separation of data points detected Warning: The maximum likelihood estimate may not exist. Warning: The LOGISTIC procedure continues despite of the above warning. Results shown are based on the last maximum likelihood estimation The LOGISTIC procedure Warning: The validity of the model fit is questionable.

19 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 19 Table 2 SAS Output Displaying ML Logit Estimates Parameter DF Estimate Error Chi- Pr > ChiSq Square Intercept NSLEVEL Q9AB Q Q4A PCTMIN FLEP

20 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 20 Table 3 SAS Output Displaying ML Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits NSLEVEL 0 vs. 1 <0.001 <0.001 > Q9AB 0 vs Q10 0 vs Q4A PCTMIN FLEP

21 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 21 Table 4 SAS Output Displaying Penalized ML Logit Estimates FL estimates and Wald confidence limits and tests NOTE: Confidence interval for Intercept based on Wald method. Paramater Standard Lower Upper Pr> Variable Estimate Error 95% c. l. 95% c. l. Chi-Square INTERCEP <.0001 NSLEVEL Q9AB Q4A PCTMIN FLEP

22 QUASI-NONCONVERGENCE LOGISTIC REGRESSION 22 Table 5 SAS Output Displaying Penalized ML Odds Ratio Estimates Odds Lower Upper Pr> Variable Estimate 95% c. l. 95% c. l. Chi-Square NSLEVEL Q9AB Q4A PCTMIN FLEP

Penalized likelihood logistic regression with rare events

Penalized likelihood logistic regression with rare events Penalized likelihood logistic regression with rare events Georg Heinze 1, Angelika Geroldinger 1, Rainer Puhr 2, Mariana Nold 3, Lara Lusa 4 1 Medical University of Vienna, CeMSIIS, Section for Clinical

More information

LOGISTIC REGRESSION Joseph M. Hilbe

LOGISTIC REGRESSION Joseph M. Hilbe LOGISTIC REGRESSION Joseph M. Hilbe Arizona State University Logistic regression is the most common method used to model binary response data. When the response is binary, it typically takes the form of

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 7. Models with binary response II GLM (Spring, 2018) Lecture 7 1 / 13 Existence of estimates Lemma (Claudia Czado, München, 2004) The log-likelihood ln L(β) in logistic

More information

Accurate Prediction of Rare Events with Firth s Penalized Likelihood Approach

Accurate Prediction of Rare Events with Firth s Penalized Likelihood Approach Accurate Prediction of Rare Events with Firth s Penalized Likelihood Approach Angelika Geroldinger, Daniela Dunkler, Rainer Puhr, Rok Blagus, Lara Lusa, Georg Heinze 12th Applied Statistics September 2015

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Longitudinal Modeling with Logistic Regression

Longitudinal Modeling with Logistic Regression Newsom 1 Longitudinal Modeling with Logistic Regression Longitudinal designs involve repeated measurements of the same individuals over time There are two general classes of analyses that correspond to

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Comparison of Estimators in GLM with Binary Data

Comparison of Estimators in GLM with Binary Data Journal of Modern Applied Statistical Methods Volume 13 Issue 2 Article 10 11-2014 Comparison of Estimators in GLM with Binary Data D. M. Sakate Shivaji University, Kolhapur, India, dms.stats@gmail.com

More information

The Problem of Modeling Rare Events in ML-based Logistic Regression s Assessing Potential Remedies via MC Simulations

The Problem of Modeling Rare Events in ML-based Logistic Regression s Assessing Potential Remedies via MC Simulations The Problem of Modeling Rare Events in ML-based Logistic Regression s Assessing Potential Remedies via MC Simulations Heinz Leitgöb University of Linz, Austria Problem In logistic regression, MLEs are

More information

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION

ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION ONE MORE TIME ABOUT R 2 MEASURES OF FIT IN LOGISTIC REGRESSION Ernest S. Shtatland, Ken Kleinman, Emily M. Cain Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA ABSTRACT In logistic regression,

More information

A Re-Introduction to General Linear Models (GLM)

A Re-Introduction to General Linear Models (GLM) A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing

More information

Sample size determination for logistic regression: A simulation study

Sample size determination for logistic regression: A simulation study Sample size determination for logistic regression: A simulation study Stephen Bush School of Mathematical Sciences, University of Technology Sydney, PO Box 123 Broadway NSW 2007, Australia Abstract This

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection

Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Biometrical Journal 42 (2000) 1, 59±69 Confidence Intervals of the Simple Difference between the Proportions of a Primary Infection and a Secondary Infection, Given the Primary Infection Kung-Jong Lui

More information

Reconstruction of individual patient data for meta analysis via Bayesian approach

Reconstruction of individual patient data for meta analysis via Bayesian approach Reconstruction of individual patient data for meta analysis via Bayesian approach Yusuke Yamaguchi, Wataru Sakamoto and Shingo Shirahata Graduate School of Engineering Science, Osaka University Masashi

More information

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction ReCap. Parts I IV. The General Linear Model Part V. The Generalized Linear Model 16 Introduction 16.1 Analysis

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

LOGISTICS REGRESSION FOR SAMPLE SURVEYS

LOGISTICS REGRESSION FOR SAMPLE SURVEYS 4 LOGISTICS REGRESSION FOR SAMPLE SURVEYS Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-002 4. INTRODUCTION Researchers use sample survey methodology to obtain information

More information

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly

More information

Simple logistic regression

Simple logistic regression Simple logistic regression Biometry 755 Spring 2009 Simple logistic regression p. 1/47 Model assumptions 1. The observed data are independent realizations of a binary response variable Y that follows a

More information

A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS

A COEFFICIENT OF DETERMINATION FOR LOGISTIC REGRESSION MODELS A COEFFICIENT OF DETEMINATION FO LOGISTIC EGESSION MODELS ENATO MICELI UNIVESITY OF TOINO After a brief presentation of the main extensions of the classical coefficient of determination ( ), a new index

More information

11. Generalized Linear Models: An Introduction

11. Generalized Linear Models: An Introduction Sociology 740 John Fox Lecture Notes 11. Generalized Linear Models: An Introduction Copyright 2014 by John Fox Generalized Linear Models: An Introduction 1 1. Introduction I A synthesis due to Nelder and

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Class (or 3): Summary of steps in building unconditional models for time What happens to missing predictors Effects of time-invariant predictors

More information

Introducing Generalized Linear Models: Logistic Regression

Introducing Generalized Linear Models: Logistic Regression Ron Heck, Summer 2012 Seminars 1 Multilevel Regression Models and Their Applications Seminar Introducing Generalized Linear Models: Logistic Regression The generalized linear model (GLM) represents and

More information

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News: Today HW 1: due February 4, 11.59 pm. Aspects of Design CD Chapter 2 Continue with Chapter 2 of ELM In the News: STA 2201: Applied Statistics II January 14, 2015 1/35 Recap: data on proportions data: y

More information

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011) Ron Heck, Fall 2011 1 EDEP 768E: Seminar in Multilevel Modeling rev. January 3, 2012 (see footnote) Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

SEPARATION PHENOMENA LOGISTIC REGRESSION FENÔMENOS DE SEPARAÇÃO REGRESSÃO LOGÍSITICA

SEPARATION PHENOMENA LOGISTIC REGRESSION FENÔMENOS DE SEPARAÇÃO REGRESSÃO LOGÍSITICA SEPARATION PHENOMENA LOGISTIC REGRESSION FENÔMENOS DE SEPARAÇÃO REGRESSÃO LOGÍSITICA Ikaro Daniel de Carvalho Barreto 1, Suzana Leitão Russo 2, Gutemberg Hespanha Brasil³, Vitor Hugo Simon 4 1 Departamento

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

Pooling multiple imputations when the sample happens to be the population.

Pooling multiple imputations when the sample happens to be the population. Pooling multiple imputations when the sample happens to be the population. Gerko Vink 1,2, and Stef van Buuren 1,3 arxiv:1409.8542v1 [math.st] 30 Sep 2014 1 Department of Methodology and Statistics, Utrecht

More information

A Note on Bayesian Inference After Multiple Imputation

A Note on Bayesian Inference After Multiple Imputation A Note on Bayesian Inference After Multiple Imputation Xiang Zhou and Jerome P. Reiter Abstract This article is aimed at practitioners who plan to use Bayesian inference on multiplyimputed datasets in

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Topics: What happens to missing predictors Effects of time-invariant predictors Fixed vs. systematically varying vs. random effects Model building strategies

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

GLM models and OLS regression

GLM models and OLS regression GLM models and OLS regression Graeme Hutcheson, University of Manchester These lecture notes are based on material published in... Hutcheson, G. D. and Sofroniou, N. (1999). The Multivariate Social Scientist:

More information

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA STATISTICS IN MEDICINE, VOL. 17, 59 68 (1998) CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA J. K. LINDSEY AND B. JONES* Department of Medical Statistics, School of Computing Sciences,

More information

Introduction to Econometrics

Introduction to Econometrics Introduction to Econometrics T H I R D E D I T I O N Global Edition James H. Stock Harvard University Mark W. Watson Princeton University Boston Columbus Indianapolis New York San Francisco Upper Saddle

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

Issues using Logistic Regression for Highly Imbalanced data

Issues using Logistic Regression for Highly Imbalanced data Issues using Logistic Regression for Highly Imbalanced data Yazhe Li, Niall Adams, Tony Bellotti Imperial College London yli16@imperialacuk Credit Scoring and Credit Control conference, Aug 2017 Yazhe

More information

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350 bsa347 Logistic Regression Logistic regression is a method for predicting the outcomes of either-or trials. Either-or trials occur frequently in research. A person responds appropriately to a drug or does

More information

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline

More information

y response variable x 1, x 2,, x k -- a set of explanatory variables

y response variable x 1, x 2,, x k -- a set of explanatory variables 11. Multiple Regression and Correlation y response variable x 1, x 2,, x k -- a set of explanatory variables In this chapter, all variables are assumed to be quantitative. Chapters 12-14 show how to incorporate

More information

Trends in Human Development Index of European Union

Trends in Human Development Index of European Union Trends in Human Development Index of European Union Department of Statistics, Hacettepe University, Beytepe, Ankara, Turkey spxl@hacettepe.edu.tr, deryacal@hacettepe.edu.tr Abstract: The Human Development

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Generalized Linear Modeling - Logistic Regression

Generalized Linear Modeling - Logistic Regression 1 Generalized Linear Modeling - Logistic Regression Binary outcomes The logit and inverse logit interpreting coefficients and odds ratios Maximum likelihood estimation Problem of separation Evaluating

More information

Procedia - Social and Behavioral Sciences 109 ( 2014 )

Procedia - Social and Behavioral Sciences 109 ( 2014 ) Available online at www.sciencedirect.com ScienceDirect Procedia - Social and Behavioral Sciences 09 ( 04 ) 730 736 nd World Conference On Business, Economics And Management - WCBEM 03 Categorical Principal

More information

Estimated Precision for Predictions from Generalized Linear Models in Sociological Research

Estimated Precision for Predictions from Generalized Linear Models in Sociological Research Quality & Quantity 34: 137 152, 2000. 2000 Kluwer Academic Publishers. Printed in the Netherlands. 137 Estimated Precision for Predictions from Generalized Linear Models in Sociological Research TIM FUTING

More information

Pseudo-score confidence intervals for parameters in discrete statistical models

Pseudo-score confidence intervals for parameters in discrete statistical models Biometrika Advance Access published January 14, 2010 Biometrika (2009), pp. 1 8 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asp074 Pseudo-score confidence intervals for parameters

More information

Plausible Values for Latent Variables Using Mplus

Plausible Values for Latent Variables Using Mplus Plausible Values for Latent Variables Using Mplus Tihomir Asparouhov and Bengt Muthén August 21, 2010 1 1 Introduction Plausible values are imputed values for latent variables. All latent variables can

More information

Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events

Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events Problems with Penalised Maximum Likelihood and Jeffrey s Priors to Account For Separation in Large Datasets with Rare Events Liam F. McGrath September 15, 215 Abstract When separation is a problem in binary

More information

Firth's penalized likelihood logistic regression: accurate effect estimates AND predictions?

Firth's penalized likelihood logistic regression: accurate effect estimates AND predictions? Firth's penalized likelihood logistic regression: accurate effect estimates AND predictions? Angelika Geroldinger, Rainer Puhr, Mariana Nold, Lara Lusa, Georg Heinze 18.3.2016 DAGStat 2016 Statistics under

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Time-Invariant Predictors in Longitudinal Models

Time-Invariant Predictors in Longitudinal Models Time-Invariant Predictors in Longitudinal Models Today s Topics: What happens to missing predictors Effects of time-invariant predictors Fixed vs. systematically varying vs. random effects Model building

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data

Approximation of Survival Function by Taylor Series for General Partly Interval Censored Data Malaysian Journal of Mathematical Sciences 11(3): 33 315 (217) MALAYSIAN JOURNAL OF MATHEMATICAL SCIENCES Journal homepage: http://einspem.upm.edu.my/journal Approximation of Survival Function by Taylor

More information

49th European Organization for Quality Congress. Topic: Quality Improvement. Service Reliability in Electrical Distribution Networks

49th European Organization for Quality Congress. Topic: Quality Improvement. Service Reliability in Electrical Distribution Networks 49th European Organization for Quality Congress Topic: Quality Improvement Service Reliability in Electrical Distribution Networks José Mendonça Dias, Rogério Puga Leal and Zulema Lopes Pereira Department

More information

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW SSC Annual Meeting, June 2015 Proceedings of the Survey Methods Section ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW Xichen She and Changbao Wu 1 ABSTRACT Ordinal responses are frequently involved

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY Ingo Langner 1, Ralf Bender 2, Rebecca Lenz-Tönjes 1, Helmut Küchenhoff 2, Maria Blettner 2 1

More information

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012

Ronald Heck Week 14 1 EDEP 768E: Seminar in Categorical Data Modeling (F2012) Nov. 17, 2012 Ronald Heck Week 14 1 From Single Level to Multilevel Categorical Models This week we develop a two-level model to examine the event probability for an ordinal response variable with three categories (persist

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

A Re-Introduction to General Linear Models

A Re-Introduction to General Linear Models A Re-Introduction to General Linear Models Today s Class: Big picture overview Why we are using restricted maximum likelihood within MIXED instead of least squares within GLM Linear model interpretation

More information

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010 Statistical Models for Management Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon February 24 26, 2010 Graeme Hutcheson, University of Manchester GLM models and OLS regression The

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

Statistical Practice

Statistical Practice Statistical Practice A Note on Bayesian Inference After Multiple Imputation Xiang ZHOU and Jerome P. REITER This article is aimed at practitioners who plan to use Bayesian inference on multiply-imputed

More information

The propensity score with continuous treatments

The propensity score with continuous treatments 7 The propensity score with continuous treatments Keisuke Hirano and Guido W. Imbens 1 7.1 Introduction Much of the work on propensity score analysis has focused on the case in which the treatment is binary.

More information

ZERO INFLATED POISSON REGRESSION

ZERO INFLATED POISSON REGRESSION STAT 6500 ZERO INFLATED POISSON REGRESSION FINAL PROJECT DEC 6 th, 2013 SUN JEON DEPARTMENT OF SOCIOLOGY UTAH STATE UNIVERSITY POISSON REGRESSION REVIEW INTRODUCING - ZERO-INFLATED POISSON REGRESSION SAS

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS Libraries 1997-9th Annual Conference Proceedings ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS Eleanor F. Allan Follow this and additional works at: http://newprairiepress.org/agstatconference

More information

Linear Programming Algorithms for Detecting Separated Data in Binary Logistic Regression Models

Linear Programming Algorithms for Detecting Separated Data in Binary Logistic Regression Models Linear Programming Algorithms for Detecting Separated Data in Binary Logistic Regression Models Kjell Konis Worcester College University of Oxford A thesis submitted for the degree of Doctor of Philosophy

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models ERSH 8990 REMS Seminar on HLM Last Lecture! Hierarchical Generalized Linear Models Introduction to generalized models Models for binary outcomes Interpreting parameter

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

A Practitioner s Guide to Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Semiparametric Generalized Linear Models

Semiparametric Generalized Linear Models Semiparametric Generalized Linear Models North American Stata Users Group Meeting Chicago, Illinois Paul Rathouz Department of Health Studies University of Chicago prathouz@uchicago.edu Liping Gao MS Student

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Count data page 1. Count data. 1. Estimating, testing proportions

Count data page 1. Count data. 1. Estimating, testing proportions Count data page 1 Count data 1. Estimating, testing proportions 100 seeds, 45 germinate. We estimate probability p that a plant will germinate to be 0.45 for this population. Is a 50% germination rate

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.

More information

Chapter 22: Log-linear regression for Poisson counts

Chapter 22: Log-linear regression for Poisson counts Chapter 22: Log-linear regression for Poisson counts Exposure to ionizing radiation is recognized as a cancer risk. In the United States, EPA sets guidelines specifying upper limits on the amount of exposure

More information

Statistics: A review. Why statistics?

Statistics: A review. Why statistics? Statistics: A review Why statistics? What statistical concepts should we know? Why statistics? To summarize, to explore, to look for relations, to predict What kinds of data exist? Nominal, Ordinal, Interval

More information

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T. Exam 3 Review Suppose that X i = x =(x 1,, x k ) T is observed and that Y i X i = x i independent Binomial(n i,π(x i )) for i =1,, N where ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T x) This is called the

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

A simulation study for comparing testing statistics in response-adaptive randomization

A simulation study for comparing testing statistics in response-adaptive randomization RESEARCH ARTICLE Open Access A simulation study for comparing testing statistics in response-adaptive randomization Xuemin Gu 1, J Jack Lee 2* Abstract Background: Response-adaptive randomizations are

More information

Inference for Binomial Parameters

Inference for Binomial Parameters Inference for Binomial Parameters Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 58 Inference for

More information

Introduction to the Logistic Regression Model

Introduction to the Logistic Regression Model CHAPTER 1 Introduction to the Logistic Regression Model 1.1 INTRODUCTION Regression methods have become an integral component of any data analysis concerned with describing the relationship between a response

More information

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error? Simple linear regression Linear Regression Nicole Beckage y " = β % + β ' x " + ε so y* " = β+ % + β+ ' x " Method to assess and evaluate the correlation between two (continuous) variables. The slope of

More information

Linear Regression with Multiple Regressors

Linear Regression with Multiple Regressors Linear Regression with Multiple Regressors (SW Chapter 6) Outline 1. Omitted variable bias 2. Causality and regression analysis 3. Multiple regression and OLS 4. Measures of fit 5. Sampling distribution

More information

Time Invariant Predictors in Longitudinal Models

Time Invariant Predictors in Longitudinal Models Time Invariant Predictors in Longitudinal Models Longitudinal Data Analysis Workshop Section 9 University of Georgia: Institute for Interdisciplinary Research in Education and Human Development Section

More information

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies

An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies Paper 177-2015 An Empirical Comparison of Multiple Imputation Approaches for Treating Missing Data in Observational Studies Yan Wang, Seang-Hwane Joo, Patricia Rodríguez de Gil, Jeffrey D. Kromrey, Rheta

More information

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors. EDF 7405 Advanced Quantitative Methods in Educational Research Data are available on IQ of the child and seven potential predictors. Four are medical variables available at the birth of the child: Birthweight

More information

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016 UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016 Work all problems. 60 points are needed to pass at the Masters Level and 75 to pass at the

More information

Finite Population Sampling and Inference

Finite Population Sampling and Inference Finite Population Sampling and Inference A Prediction Approach RICHARD VALLIANT ALAN H. DORFMAN RICHARD M. ROYALL A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane

More information

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data

Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions Data Quality & Quantity 34: 323 330, 2000. 2000 Kluwer Academic Publishers. Printed in the Netherlands. 323 Note Discrete Response Multilevel Models for Repeated Measures: An Application to Voting Intentions

More information

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS

STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS STATISTICAL INFERENCE FOR SURVEY DATA ANALYSIS David A Binder and Georgia R Roberts Methodology Branch, Statistics Canada, Ottawa, ON, Canada K1A 0T6 KEY WORDS: Design-based properties, Informative sampling,

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information