LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Similar documents
Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Linear Regression Models P8111

Logistic Regression in R. by Kerry Machemer 12/04/2015

Generalized Linear Models (GLZ)

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

LOGISTIC REGRESSION Joseph M. Hilbe

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Logistic Regressions. Stat 430

9 Generalized Linear Models

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Multinomial Logistic Regression Models

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

UNIVERSITY OF TORONTO Faculty of Arts and Science

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Chapter 1 Statistical Inference

Sample size determination for logistic regression: A simulation study

Unit 9: Inferences for Proportions and Count Data

Model Estimation Example

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Generalized Linear Models 1

Unit 9: Inferences for Proportions and Count Data

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Poisson regression: Further topics

8 Nominal and Ordinal Logistic Regression

Categorical data analysis Chapter 5

Exam Applied Statistical Regression. Good Luck!

Testing and Model Selection

Generalized Linear Models

Sections 4.1, 4.2, 4.3

Chapter 1. Modeling Basics

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Generalised linear models. Response variable can take a number of different formats

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Data Analysis as a Decision Making Process

Longitudinal Modeling with Logistic Regression

Small n, σ known or unknown, underlying nongaussian

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

26:010:557 / 26:620:557 Social Science Research Methods

R Hints for Chapter 10

Cohen s s Kappa and Log-linear Models

An Introduction to Path Analysis

WU Weiterbildung. Linear Mixed Models

Statistical Methods in Clinical Trials Categorical Data

Generalized Linear Models and Exponential Families

Log-linear Models for Contingency Tables

More Accurately Analyze Complex Relationships

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Modeling Overdispersion

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Generalized Models: Part 1

Generalized linear models

STAT 7030: Categorical Data Analysis

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Chapter 22: Log-linear regression for Poisson counts

Subject CS1 Actuarial Statistics 1 Core Principles

Generalized linear models

STAT 705: Analysis of Contingency Tables

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Categorical Data Analysis Chapter 3

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Simple logistic regression

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

Frequency Distribution Cross-Tabulation

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear Regression With Special Variables

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Categorical Predictor Variables

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Ordinal Variables in 2 way Tables

Generalized logit models for nominal multinomial responses. Local odds ratios

An Introduction to Mplus and Path Analysis

BIOS 625 Fall 2015 Homework Set 3 Solutions

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Review of the General Linear Model

A CASE STUDY IN HANDLING OVER-DISPERSION IN NEMATODE COUNT DATA SCOTT EDWIN DOUGLAS KREIDER. B.S., The College of William & Mary, 2008 A THESIS

11. Generalized Linear Models: An Introduction

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Introducing Generalized Linear Models: Logistic Regression

Modeling the Mean: Response Profiles v. Parametric Curves

Textbook Examples of. SPSS Procedure

Transcription:

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L

Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefit from the use of Statistics Collaboration: Visit our website to request personalized statistical advice and assistance with: Designing Experiments Analyzing Data Interpreting Results Grant Proposals Software (R, SAS, JMP, Minitab...) LISA also offers: LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions <30 mins. See our website for additional times and locations. All services are FREE for VT researchers. We assist with research not class projects or homework. www.lisa.stat.vt.edu

Outline 1. What is CDA? 2. Contingency Table 3. Measures of Association 4. Test of Independence & Test of Symmetry 5. What is GLM? When should we use it? 6. How GLM works? 7. Logistic Regression 8. Poisson Regression

What is Categorical Data Analysis(CDA)? Dependent Variable Independent Variables Model Continuous (Normal) Continuous Ordinary Linear Regression Continuous (Normal) Categorical ANOVA Continuous (Normal) Mixed ANCOVA Categorical Categorical CDA

Contingency Table A contingency table is a rectangular table having I rows for categories of X and J columns for categories of Y. The cells of the table represent the I J possible outcomes.

Contingency Table: Example 1_Heart attack vs. Aspirin use The table below is from a report on the relationship between aspirin use and heart attacks by the Physicians Health Study Research Group at Harvard Medical School. The 2 3 contingency table is

Contingency Table: Generating Contingency Table in R Input the 2 3 table in R as a 2 3 matrix Change the matrix to table using the function as.table(), because some functions are happier with tables than matrices

Measures of Association Continuous Variables-Pearson Correlation Coefficient Ordinal Variables-Pearson Correlation Coefficient Nominal Variables-Phi Coefficient and Cramer s V

Measures of Association: Phi Coefficient The phi coefficient (φ) measures the association between two binary variables. Its value ranges from -1 to +1, where +1/-1 indicates perfect positive association/negative association, 0 indicates no association. The square of the phi coefficient is related to the chisquared statistic for a 2 2 contingency table. φ 2 = χ2 n

Measures of Association: Cramer s V Cramer s V measures the association between two nominal variables. It varies from 0 (no association) to 1 (complete association) and can reach 1 only when the two variables are equal to each other.

Measures of Association in R Comments: 1, When the two variables are binary, Cramer s V is the same as Phi Coefficient 2, In R, under library(psych), use function phi() for Phi Coefficient 3, In R, under library(vcd), use function assocstats() for Cramer s V

Test of Independence Large Sample Size Chi-square Test Small Sample Size Fisher s Exact Test

Test of Independence (Chi-square Test) Column 1 Column 2 Total Row 1 π 11 ( π 11 =n 11 /n) π 12 ( π 12 =n 12 /n) π 1+ ( π 1+ = n 1+ /n) Row 2 π 21 ( π 21 =n 21 /n) π 22 ( π 22 =n 22 /n) π 2+ ( π 2+ = n 2+ /n) Total π +1 ( π +1 =n +1 /n) π +2 ( π +2 =n +2 /n) 1 H 0 : Row and Column are independent πij=πi+π+j for all i,j H a : Row and Column are not independent πij πi+π+j for some i and j

Test of Independence (Chi-square Test) Under H 0 : πij=πi+π+j for all i,j Expected Counts in each cell is H 0 : Row and Column are independent πij=πi+π+j for all i,j H a : Row and Column are not independent πij πi+π+j for some i and j

Test of Independence (Fisher s Exact Test) When any of the expected counts fall below 5, Chisquare test is not appropriate. Instead, we use Fisher s Exact Test. Example 2: The following data are from a Stanford University study of the effectiveness of the antidepressant Celexa in the treatment of compulsive shopping. Outcome Worse Same Better Treatment Celexa 2 3 7 Placebo 2 8 2

Test of Independence in R Chi-Square Test Use R function chisq.test() Fisher s Exact Test Use R function fisher.test()

Test of Symmetry: Matched Pairs Example 3: Suppose two surveys on President s job approval were conducted one-month apart on 1600 Americans and the result is summarized in the following table. (Source: Agresti, 1990) Is there a significant difference in job approval rating? 2 nd Survey Approve Disapprove 1 st Survey Approve 794 150 Disapprove 86 570

Test of Symmetry: Matched Pairs

What is GLM? Wikipedia defines the generalized linear model (GLM) as a flexible generalization of ordinarily linear regression that allows for response variables that have other than a normal distribution. (Source: http://en.wikipedia.org/wiki/generalized_linear_model) LISA: GLMs R Basics & CDA in R Summer Nov. 4, 2014 2013

General linear model vs. Generalized linear model Special cases Functions in R Typical estimatio n method General linear model ANOVA, ANCOVA, MANOVA, MANCOVA, linear regression, mixed model lm() Least squares, best linear unbiased prediction Generalized linear model Linear regression, logistic regression, Poisson regression glm() Maximum likelihood (Source: http://en.wikipedia.org/wiki/comparison_of_general_and_generalized_linear_models LISA: GLMs R Basics & CDA in R Summer Nov. 4, 2014 2013

Ordinary Linear Regression Ordinary Linear Regression (OLR) investigates and models the linear relationship between independent variables and dependent variables that are continuous. The simplest regression is Simple Linear regression, which models the linear relationship between a single independent variable and a single dependent variable. Simple Linear Regression Model: y = β 0 + β 1 x + ϵ Dependent variable Intercept Slope Independent variable Random Error

Assumptions in OLR & when the assumptions are violated The assumptions are: The true relationship between x and y is linear. The errors are normally distributed with mean zero and unknown common variance σ 2. The errors are uncorrelated. The possible approaches when the assumptions of a normally distributed dependent variable with constant variance are violated: Data transformations Weighted least squares Generalized linear model (GLM)

GLM Model Generalized Linear Model g μ = β 0 + β 1 x g function is called the link function because it connects the mean μ and the linear predictor x Dependent variable s distribution must come from the Exponential Family of Distributions Includes Normal, Bernoulli, Binomial, Poisson, Gamma, etc. 3 Components Random: Identifies dependent Y and its probability distribution Systematic: Independent variables in a linear predictor function Link function: Invertible function g. that links the mean of the dependent variable to the systematic component.

Random Component Normal: continuous, symmetric, mean μ and var σ 2 Bernoulli: 0 or 1, mean p and var p(1-p) special case of Binomial Poisson: non-negative integer, 0, 1, 2,, mean λ var λ # of events in a fixed time interval

Types of GLMs Distribution of Dependent variable Link Function Independent variable Recall: The link function relates the dependent variable to the linear model. Model Normal Identity Continuous Ordinary Linear Regression Normal Identity Categorical Analysis of Variance Normal Identity Mixed Analysis of Covariance Binomial Logit Mixed Logistic Regression Poisson Log Mixed Poisson Regression

GLM and Ordinary linear regression Ordinary linear regression is a special case of GLM In OLR, the 3 components for GLM are: Random: the dependent variable is normally distributed with mean μ and variance σ 2 Systematic: Independent variables in a linear predictor function β 0 + β 1 x Link function: Identity link g μ = μ Therefore, the GLM model for Ordinary linear regression is E Y = μ = β 0 + β 1 x

Model Evaluation: Deviance Deviance: measures how close the predicted values from the fitted model match the actual values from the raw data. Definition: Deviance = -2[log-likelihood(proposed model)-log-likelihood(saturated model)] A saturated model is a model that fits the data perfectly, so its log-likelihood is the maximum. It has as many parameters as observations and hence it provides no simplification at all. The deviance has a chi-squared asymptotic null distribution. The degree of freedom is n-p, where n is the number of observations and p is the number of model parameters.

Inference in GLM Goodness of Fit test The null hypothesis is that the model is a good alternative to the saturated model. Deviance is the Likelihood Ratio Statistic Likelihood Ratio test Allows for the comparison of one model to another model by looking at the difference in deviance of the two models. Null Hypothesis: the predictor variables in Model 1 that are not found in Model 2 are not significant to the model fit. Alternative Hypothesis: the predictor variables in Model 1 that are not found in Model 2 are significant to the model fit. LRS is distributed as Chi-square distribution. Simpler models have larger deviance.

Model Comparison in GLM Two additional measures for model comparison are: Akaike Information Criterion (AIC) Penalizes model for having many parameters AIC=-2logLikelihood+2*p where p is the number of parameters in the model The smaller AIC, the better the model Bayesian Information Criterion (BIC) BIC=-2logLikelihood+ln(n)*p where p is the number of parameters in the model and n is the number of observations Usually stronger penalization for additional parameter than AIC The smaller BIC, the better the model

Summary of GLM Setup of GLM Inference in GLM Deviance and Likelihood Ratio Test Test goodness of fit for the proposed GLM model Test the significance of a predictor variable or set of predictor variables in the model Model Comparison in GLM AIC BIC

Logistic Regression Logistic regression is a regression technique for predicting the outcome of a binary dependent variable. Example: y=1-success 0-Failure Random Component: the dependent variable follows a Bernoulli distribution Probability of Success: p Probability of Failure: 1-p The probability of obtaining y=1 or y=0 is given by the probability mass function of Bernoulli Distribution: P Y = y = p y (1 p) (1 y) (y = 0,1) Mean(Y): μ = p

Logistic Regression Systematic Component: β 0 + β 1 x Link function in Logistic regression is the logit link μ g μ = log = log p 1 μ 1 p p is called odds 1 p Therefore, the Logistic Regression Model is p log 1 p = β 0 + β 1 x exp(β 0+β 1 x) Note: By transformation, we get p = 1+exp(β 0 +β 1 x) which guarantees p is between 0 and 1

Logistic Regression: Interpretation The fitted Logistic regression model log p 1 p = β 0 + β 1 x If x is a binary variable, and we label x as 1 and 0. log log p(x=1) 1 p(x=1) odds x=1 odds x=0 = β 0 + β 1 & log = β 1 p(x=0) 1 p(x=0) odds x=1 odds x=0 = exp( β 1 ) = β 0 When interpreting β 1 it is easy to take the odds ratio approach Odds Ratio=exp( β 1 ) The estimated change in the odds of success (multiplicatively) by increasing x by 1 unit is exp( β 1 )

Logistic Regression in R 1. Create a single vector of 0 s and 1 s for the response variable. 2. Use the function glm() family=binomial to fit the model. 3. Test for goodness of fit and significance of predictors. 4. Interpretation.

Poisson Regression Poisson regression is a regression technique for predicting the outcome of a count dependent variable. Dependent variable measures the number of occurrences in a given time frame. Outcomes equal to 0,1,2, Examples: Number of penalties during a football game. Number of customers shop at a grocery store on a given day. Number of car accidents at an intersection during a period of time.

Poisson Regression Random Component: the dependent variable follows a Poisson distribution, i.e. Y~Poisson(λ) Poisson distribution takes in to account that the data are counts The variance and mean are the same, both are equal to λ Systematic Component: β 0 + β 1 x Link function: Log link where g μ = log μ = log(λ) Therefore, the Poisson Regression Model is log λ = β 0 + β 1 x

Poisson Regression: Interpretation Poisson Regression Model is log λ = log(e Y ) = β 0 + β 1 x Interpretation of Poisson regression coefficients: Given a one unit change in the independent variable, the difference in the logs of expected counts is expected to change by the respective regression coefficient, given the other predictor variables in the model are held constant.

Poisson Regression in R 1. Input data where y is a column of counts. 2. Use the function glm() family=poisson to fit the model. 3. Test for goodness of fit and significance of predictors.

Please don t forget to fill the sign in sheet and to complete the survey that will be sent to you by email. Thank you!