LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L
Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Collaboration: Visit our website to request personalized statistical advice and assistance with: Designing Experiments Analyzing Data Interpreting Results Grant Proposals Software (R, SAS, JMP, Minitab...) LISA also offers: LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions <30 mins. See our website for additional times and locations. All services are FREE for VT researchers. We assist with research not class projects or homework. www.lisa.stat.vt.edu
Outline 1. What is CDA? 2. Contingency Table 3. Measures of Association 4. Test of Independence & Test of Symmetry 5. What is GLM? When should we use it? 6. How GLM works? 7. Logistic Regression 8. Poisson Regression
What is Categorical Data Analysis(CDA)? Dependent Variable Independent Variables Model Continuous (Normal) Continuous Ordinary Linear Regression Continuous (Normal) Categorical ANOVA Continuous (Normal) Mixed ANCOVA Categorical Categorical CDA
Contingency Table A contingency table is a rectangular table having I rows for categories of X and J columns for categories of Y. The cells of the table represent the I J possible outcomes.
Contingency Table: Example 1_Heart attack vs. Aspirin use The table below is from a report on the relationship between aspirin use and heart attacks by the Physicians Health Study Research Group at Harvard Medical School. The 2 3 contingency table is
Contingency Table: Generating Contingency Table in R Input the 2 3 table in R as a 2 3 matrix Change the matrix to table using the function as.table(), because some functions are happier with tables than matrices
Measures of Association Continuous Variables-Pearson Correlation Coefficient Ordinal Variables-Pearson Correlation Coefficient Nominal Variables-Phi Coefficient and Cramer s V
Measures of Association: Phi Coefficient The phi coefficient (φ) measures the association between two binary variables. Its value ranges from -1 to +1, where +1/-1 indicates perfect positive association/negative association, 0 indicates no association. The square of the phi coefficient is related to the chisquared statistic for a 2 2 contingency table. φ 2 = χ2 n
Measures of Association: Cramer s V Cramer s V measures the association between two nominal variables. It varies from 0 (no association) to 1 (complete association) and can reach 1 only when the two variables are equal to each other.
Measures of Association in R Comments: 1, When the two variables are binary, Cramer s V is the same as Phi Coefficient 2, In R, under library(psych), use function phi() for Phi Coefficient 3, In R, under library(vcd), use function assocstats() for Cramer s V
Test of Independence Large Sample Size Chi-square Test Small Sample Size Fisher s Exact Test
Test of Independence (Chi-square Test) Column 1 Column 2 Total Row 1 π 11 ( π 11 =n 11 /n) π 12 ( π 12 =n 12 /n) π 1+ ( π 1+ = n 1+ /n) Row 2 π 21 ( π 21 =n 21 /n) π 22 ( π 22 =n 22 /n) π 2+ ( π 2+ = n 2+ /n) Total π +1 ( π +1 =n +1 /n) π +2 ( π +2 =n +2 /n) 1 H 0 : Row and Column are independent πij=πi+π+j for all i,j H a : Row and Column are not independent πij πi+π+j for some i and j
Test of Independence (Chi-square Test) Under H 0 : πij=πi+π+j for all i,j Expected Counts in each cell is H 0 : Row and Column are independent πij=πi+π+j for all i,j H a : Row and Column are not independent πij πi+π+j for some i and j
Test of Independence (Fisher s Exact Test) When any of the expected counts fall below 5, Chisquare test is not appropriate. Instead, we use Fisher s Exact Test. Example 2: The following data are from a Stanford University study of the effectiveness of the antidepressant Celexa in the treatment of compulsive shopping. Outcome Worse Same Better Treatment Celexa 2 3 7 Placebo 2 8 2
Test of Independence in R Chi-Square Test Use R function chisq.test() Fisher s Exact Test Use R function fisher.test()
Test of Symmetry: Matched Pairs Example 3: Suppose two surveys on President s job approval were conducted one-month apart on 1600 Americans and the result is summarized in the following table. (Source: Agresti, 1990) Is there a significant difference in job approval rating? 2 nd Survey Approve Disapprove 1 st Survey Approve 794 150 Disapprove 86 570
Test of Symmetry: Matched Pairs
What is GLM? Wikipedia defines the generalized linear model (GLM) as a flexible generalization of ordinarily linear regression that allows for response variables that have other than a normal distribution. (Source: http://en.wikipedia.org/wiki/generalized_linear_model) LISA: GLMs R Basics & CDA in R Summer Nov. 4, 2014 2013
General linear model vs. Generalized linear model Special cases Functions in R Typical estimatio n method General linear model ANOVA, ANCOVA, MANOVA, MANCOVA, linear regression, mixed model lm() Least squares, best linear unbiased prediction Generalized linear model Linear regression, logistic regression, Poisson regression glm() Maximum likelihood (Source: http://en.wikipedia.org/wiki/comparison_of_general_and_generalized_linear_models LISA: GLMs R Basics & CDA in R Summer Nov. 4, 2014 2013
Ordinary Linear Regression Ordinary Linear Regression (OLR) investigates and models the linear relationship between independent variables and dependent variables that are continuous. The simplest regression is Simple Linear regression, which models the linear relationship between a single independent variable and a single dependent variable. Simple Linear Regression Model: y = β 0 + β 1 x + ϵ Dependent variable Intercept Slope Independent variable Random Error
Assumptions in OLR & when the assumptions are violated The assumptions are: The true relationship between x and y is linear. The errors are normally distributed with mean zero and unknown common variance σ 2. The errors are uncorrelated. The possible approaches when the assumptions of a normally distributed dependent variable with constant variance are violated: Data transformations Weighted least squares Generalized linear model (GLM)
GLM Model Generalized Linear Model g μ = β 0 + β 1 x g function is called the link function because it connects the mean μ and the linear predictor x Dependent variable s distribution must come from the Exponential Family of Distributions Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc. 3 Components Random: Identifies dependent Y and its probability distribution Systematic: Independent variables in a linear predictor function Link function: Invertible function g. that links the mean of the dependent variable to the systematic component.
Random Component Normal: continuous, symmetric, mean μ and var σ 2 Bernoulli: 0 or 1, mean p and var p(1-p) special case of Binomial Poisson: non-negative integer, 0, 1, 2,, mean λ var λ # of events in a fixed time interval
Types of GLMs Distribution of Dependent variable Link Function Independent variable Recall: The link function relates the dependent variable to the linear model. Model Normal Identity Continuous Ordinary Linear Regression Normal Identity Categorical Analysis of Variance Normal Identity Mixed Analysis of Covariance Binomial Logit Mixed Logistic Regression Poisson Log Mixed Poisson Regression
GLM and Ordinary linear regression Ordinary linear regression is a special case of GLM In OLR, the 3 components for GLM are: Random: the dependent variable is normally distributed with mean μ and variance σ 2 Systematic: Independent variables in a linear predictor function β 0 + β 1 x Link function: Identity link g μ = μ Therefore, the GLM model for Ordinary linear regression is E Y = μ = β 0 + β 1 x
Model Evaluation: Deviance Deviance: measures how close the predicted values from the fitted model match the actual values from the raw data. Definition: Deviance = -2[log-likelihood(proposed model)-log-likelihood(saturated model)] A saturated model is a model that fits the data perfectly, so its log-likelihood is the maximum. It has as many parameters as observations and hence it provides no simplification at all. The deviance has a chi-squared asymptotic null distribution. The degree of freedom is n-p, where n is the number of observations and p is the number of model parameters.
Inference in GLM Goodness of Fit test The null hypothesis is that the model is a good alternative to the saturated model. Deviance is the Likelihood Ratio Statistic Likelihood Ratio test Allows for the comparison of one model to another model by looking at the difference in deviance of the two models. Null Hypothesis: the predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit. Alternative Hypothesis: the predictor variables in Model 1 that are not found in Model 2 are significant to the model fit. LRS is distributed as Chi-square distribution. Simpler models have larger deviance.
Model Comparison in GLM Two additional measures for model comparison are: Akaike Information Criterion (AIC) Penalizes model for having many parameters AIC=-2logLikelihood+2*p where p is the number of parameters in the model The smaller AIC, the better the model Bayesian Information Criterion (BIC) BIC=-2logLikelihood+ln(n)*p where p is the number of parameters in the model and n is the number of observations Usually stronger penalization for additional parameter than AIC The smaller BIC, the better the model
Summary of GLM Setup of GLM Inference in GLM Deviance and Likelihood Ratio Test Test goodness of fit for the proposed GLM model Test the significance of a predictor variable or set of predictor variables in the model Model Comparison in GLM AIC BIC
Logistic Regression Logistic regression is a regression technique for predicting the outcome of a binary dependent variable. Example: y=1-success 0-Failure Random Component: the dependent variable follows a Bernoulli distribution Probability of Success: p Probability of Failure: 1-p The probability of obtaining y=1 or y=0 is given by the probability mass function of Bernoulli Distribution: P Y = y = p y (1 p) (1 y) (y = 0,1) Mean(Y): μ = p
Logistic Regression Systematic Component: β 0 + β 1 x Link function in Logistic regression is the logit link μ g μ = log = log p 1 μ 1 p p is called odds 1 p Therefore, the Logistic Regression Model is p log 1 p = β 0 + β 1 x exp(β 0+β 1 x) Note: By transformation, we get p = 1+exp(β 0 +β 1 x) which guarantees p is between 0 and 1
Logistic Regression: Interpretation The fitted Logistic regression model log p 1 p = β 0 + β 1 x If x is a binary variable, and we label x as 1 and 0. log log p(x=1) 1 p(x=1) odds x=1 odds x=0 = β 0 + β 1 & log = β 1 p(x=0) 1 p(x=0) odds x=1 odds x=0 = exp( β 1 ) = β 0 When interpreting β 1 it is easy to take the odds ratio approach Odds Ratio=exp( β 1 ) The estimated change in the odds of success (multiplicatively) by increasing x by 1 unit is exp( β 1 )
Logistic Regression in R 1. Create a single vector of 0 s and 1 s for the response variable. 2. Use the function glm() family=binomial to fit the model. 3. Test for goodness of fit and significance of predictors. 4. Interpretation.
Poisson Regression Poisson regression is a regression technique for predicting the outcome of a count dependent variable. Dependent variable measures the number of occurrences in a given time frame. Outcomes equal to 0,1,2, Examples: Number of penalties during a football game. Number of customers shop at a grocery store on a given day. Number of car accidents at an intersection during a period of time.
Poisson Regression Random Component: the dependent variable follows a Poisson distribution, i.e. Y~Poisson(λ) Poisson distribution takes in to account that the data are counts The variance and mean are the same, both are equal to λ Systematic Component: β 0 + β 1 x Link function: Log link where g μ = log μ = log(λ) Therefore, the Poisson Regression Model is log λ = β 0 + β 1 x
Poisson Regression: Interpretation Poisson Regression Model is log λ = log(e Y ) = β 0 + β 1 x Interpretation of Poisson regression coefficients: Given a one unit change in the independent variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant.
Poisson Regression in R 1. Input data where y is a column of counts. 2. Use the function glm() family=poisson to fit the model. 3. Test for goodness of fit and significance of predictors.
Please don t forget to fill the sign in sheet and to complete the survey that will be sent to you by email. Thank you!