Continuing with Binary and Count Outcomes

Similar documents
GOV 2001/ 1002/ E-2001 Section 8 Ordered Probit and Zero-Inflated Logit

GOV 2001/ 1002/ Stat E-200 Section 8 Ordered Probit and Zero-Inflated Logit

GOV 2001/ 1002/ E-200 Section 7 Zero-Inflated models and Intro to Multilevel Modeling 1

Count and Duration Models

Count and Duration Models

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Logit Regression and Quantities of Interest

Logit Regression and Quantities of Interest

ZERO INFLATED POISSON REGRESSION

Gov 2001: Section 4. February 20, Gov 2001: Section 4 February 20, / 39

Probability Distributions Columns (a) through (d)

Linear Regression Models P8111

From Model to Log Likelihood

MODELING COUNT DATA Joseph M. Hilbe

Linear Regression With Special Variables

Modeling Overdispersion

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Generalized Linear Models Introduction

High-Throughput Sequencing Course

Lecture 10: Introduction to Logistic Regression

Model Checking. Patrick Lam

Comparing IRT with Other Models

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

LOGISTIC REGRESSION Joseph M. Hilbe

Generalized Linear Models for Non-Normal Data

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Lecture 12: Application of Maximum Likelihood Estimation:Truncation, Censoring, and Corner Solutions

Discrete Choice Modeling

Part 8: GLMs and Hierarchical LMs and GLMs

Final Exam. Name: Solution:

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Lecture 8. Poisson models for counts

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes 1

Brief Review of Probability

Advanced Quantitative Methods: limited dependent variables

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Varieties of Count Data

A strategy for modelling count data which may have extra zeros

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Contents 1. Contents

General Regression Model

Models for Heterogeneous Choices

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Semiparametric Generalized Linear Models

Advanced Quantitative Methods: maximum likelihood

multilevel modeling: concepts, applications and interpretations

Part 2: One-parameter models

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

Ordinary Least Squares Regression Explained: Vartanian

A Re-Introduction to General Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Further simulation evidence on the performance of the Poisson pseudo-maximum likelihood estimator

Introducing Generalized Linear Models: Logistic Regression

Ordinary Least Squares Regression Explained: Vartanian

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

MSH3 Generalized linear model Ch. 6 Count data models

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Module 11: Linear Regression. Rebecca C. Steorts

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Product Held at Accelerated Stability Conditions. José G. Ramírez, PhD Amgen Global Quality Engineering 6/6/2013

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Probability and Estimation. Alan Moses

Exam Applied Statistical Regression. Good Luck!

Generalized Linear Models

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Chapter 22: Log-linear regression for Poisson counts

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Advanced Quantitative Methods: maximum likelihood

Lecture 5: Poisson and logistic regression

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

,..., θ(2),..., θ(n)

Lecture 2: Poisson and logistic regression

Biostatistics Advanced Methods in Biostatistics IV

Poisson regression: Further topics

Probability Theory. Patrick Lam

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

GOV 2001/ 1002/ E-2001 Section 3 Theories of Inference

Generalized Multilevel Models for Non-Normal Outcomes

Truncation and Censoring

ECON 594: Lecture #6

CS 361: Probability & Statistics

Logistic regression modeling the probability of success

GOV 2001/ 1002/ E-2001 Section 10 1 Duration II and Matching

Transcription:

Gov 2001 Section 8: Continuing with Binary and Count Outcomes Konstantin Kashin 1 March 27, 2013 1 Thanks to Jen Pan, Brandon Stewart, Iain Osgood, and Patrick Lam for contributing to this material.

Outline Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Replication Paper You will receive a group to re-replicate tonight Re-replication due Wednesday, April 3 at 7pm Aim to be helpful, not critical! Any questions about expectations?

Outline Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Why Zero-Inflation? What if we knew that something in our data were mismeasured? For example, what if we thought that some of our data were sytematically zero rather than randomly zero? This could be when: 1. Some data are spoiled or lost 2. Survey respondents put zero to an ordered answer on a survey just to get it done. If our data are mismeasured in some systematic way, our estimates will be off.

A Working Example: Fishing You re trying to figure out the probability of catching a fish in a park from a survey. People were asked: How many children were in the group How many people were in the group Whether they caught a fish.

A Working Example: Fishing The problem is, some people didn t even fish! These people have systematically zero fish.

The Model We re going to assume that whether or not the person fished is the outcome of a Bernoulli trial. Y i = { 0 Logistic with probability ψ i with probability 1 ψ i

The Model We can write out the distribution of Y i as: P(Y i = y i β, ψ i ) { ψ i + (1 ψ i ) (1 ) if y 1+e Xβ i = 0 1 (1 ψ i ) ( ) if y 1+e Xβ i = 1 And we can put covariates on ψ: ψ = 1 1 + e z iγ 1

Deriving the Likelihood The likelihood function is proportional to the probability of Y i : L(β, ψ i Y i ) P(Y i β, ψ i ) 1 Yi 1 = [ψ i + (1 ψ i ) (1 1 + e X iβ )] Yi 1 [(1 ψ i ) ( 1 + e X iβ )] 1 Yi 1 = [ 1 + e z iγ + (1 1 1 + e z iγ ) (1 1 1 + e X iβ )] Yi 1 [(1 1 + e z iγ ) ( 1 1 + e X iβ )]

Deriving the Likelihood Multiplying over all observations we get: L(β, γ Y) = n i=1 [(1 1 Yi 1 [ 1 + e + (1 1 ziγ 1 + e ) (1 1 ziγ 1 + e )] Xi β 1 1 + e ziγ ) ( 1 1 + e Xi β )] Yi

Deriving the Likelihood Taking the log we get: ln L = = n i=1 1 {Y i ln [(1 ψ) ( )] + Xi 1 + e β 1 )]} Xi 1 + e β 1 1 + e ) ( 1 )] + ziγ Xi 1 + e β 1 1 + e + (1 1 ziγ 1 + e ) (1 1 ziγ 1 + e (1 Y i ) ln[ψ + (1 ψ) (1 n i=1 {Y i ln [(1 (1 Y i ) ln [ Xi β )]}

Let s program this in R Load and get the data ready: fish <- read.table("http://www.ats.ucla.edu/stat/r/dae/fish.csv" X <- fish[c("child", "persons")] Z <- fish[c("persons")] X <- as.matrix(cbind(1,x)) Z <- as.matrix(cbind(1,z)) y <- ifelse(fish$count>0,1,0)

Let s program this in R Write out the Log-likelihood function ll.zilogit <- function(par, X, Z, y){ beta <- par[1:ncol(x)] gamma <- par[(ncol(x)+1):length(par)] phi <- 1/(1+exp(-Z%*%gamma)) pie <- 1/(1+exp(-X%*%beta)) sum(y*log((1-phi)*pie) + (1-y)*(log(phi + (1-phi)*(1-pie)))) }

Let s program this in R Optimize to get the results par <- rep(1,(ncol(x)+ncol(z))) out <- optim(par, ll.zilogit, Z=Z, X=X,y=y, method="bfgs", control=list(fnscale=-1), hessian=true) out$par [1] 1.507470-2.686476 1.447307 1.876404-1.247189

Plotting to See the Relationship These numbers don t mean a lot to us, so we can plot the predicted probabilities of a person having not fished. First, we have to simulate our gammas: varcv.par <- solve(-out$hessian) library(mvtnorm) sim.pars <- rmvnorm(10000, out$par, varcv.par) sim.z <- sim.pars[,(ncol(x)+1):length(par)]

Plotting to See the Relationship These numbers don t mean a lot to us, so we can plot the predicted probabilities of a group having not fished. We then generate predicted probabilities that different sized groups did not fish. person.vec <- seq(1,4) Zcovariates <- cbind(1, person.vec) exp.holder <- matrix(na, ncol=4, nrow=10000) for(i in 1:length(person.vec)){ exp.holder[,i] <- 1/(1+exp(-Zcovariates[i,]%*%t(sim.z))) }

Plotting to See the Relationship These numbers don t mean a lot to us, so we can plot the predicted probabilities of a group having not fished. Using these numbers, we can plot the densities of probabilities, to get a sense of the probability and the uncertainty. plot(density(exp.holder[,4]), col="blue", xlim=c(0,1), main="probability of a Structural Zero", xlab="probability") lines(density(exp.holder[,3]), col="red") lines(density(exp.holder[,2]), col="green") lines(density(exp.holder[,1]), col="black") legend(.7,12, legend=c("one Person", "Two People", "Three People", "Four People"), col=c("black", "green", "red", "blue"), lty=1)

Plotting to See the Relationship These numbers don t mean a lot to us, so we can plot the predicted probabilities of a group having not fished. Probability of a Structural Zero Density 0 5 10 15 One Person Two People Three People Four People 0.0 0.2 0.4 0.6 0.8 1.0 Probability

Outline Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model The Poisson Distribution It s a discrete probability distribution which gives the probability that some number of events will occur in a fixed period of time. Examples: 1. number of terrorist attacks in a given year 2. number of publications by a professor in a career 3. number of times word hope is used in a Barack Obama speech 4. number of songs on a pop music CD

The Poisson Distribution -Here s the probability density function (PDF) for a random variable Y that is distributed Pois(λ): Pr(Y = y) = λy y! e λ -Suppose Y Pois(3). What s Pr(Y = 4)? Pr(Y = 4) = 34 4! e 3 = 0.168. Poisson Distribution Pr(Y=y) 0 0.05 0.1 0.15 0 2 4 6 8 10 y

The Poisson Distribution One more time, the probability density function (PDF) for a random variable Y that is distributed Pois(λ): Pr(Y = y) = λy y! e λ Using a little bit of geometric series trickery, it isn t too hard to show that E[Y] = y=0 y λy y! e λ = λ. It also turns out that Var(Y) = λ, a feature of the model we will discuss later on.

The Poisson Distribution Poisson data arises when there is some discrete event which occurs (possibly multiple times) at a constant rate for some fixed time period. This constant rate assumption could be restated: the probability of an event occurring at any moment is independent of whether an event has occurred at any other moment. Derivation of the distribution has some other technical first principles, but the above is the most important.

The Poisson Model for Event Counts 1. The stochastic component: Y i Pois(λ i ) 2. The systematic component: λ i = exp(x i β) The likelihood is therefore: L(β X, y) = n λ y i i i=1 y i! e λ i

The Poisson Model for Event Counts And the log-likelihood ln L(β X, y) = = = n y i ln λ i ln(y i!) λ i i=1 n y i ln(exp(x i β) ln(y i!) exp(x i β) i=1 n i=1 y i (X i β) exp(x i β)

Comparing with the Linear Model

Comparing with the Linear Model Possible dimensions for comparison: 1. distribution of Y X 2. shape of the mean function 3. assumptions about Var(Y X) 4. calculating fitted values 5. meaning of intercept and slope Generally: the linear model (OLS) is biased, inefficient, and inconsistent for count data!

Example: Civil Conflict in Northern Ireland Background: a conflict largely along religious lines about the status of Northern Ireland within the United Kingdom, and the division of resources and political power between Northern Ireland s Protestant (mainly Unionist) and Catholic (mainly Republican) communities. The data: the number of Republican deaths for every month from 1969, the beginning of sustained violence, to 2001 (at which point, most organized violence had subsided). Also, the unemployment rates in the two main religious communities.

Example: Civil Conflict in Northern Ireland

Example: Civil Conflict in Northern Ireland The model: Let Y i = # of Republican deaths in a month. Our sole predictor for the moment will be: U C = the unemployment rate among Northern Ireland s Catholics. Our model is then: and Y i Pois(λ i ) λ i = E[Y i U C i ] = exp(β 0 + β 1 U C i ).

Estimate (just as we have all along!) mod <- zelig(repdeaths ~ cathunemp, data = troubles, model = "poisson") > summary(mod)$coefficients Estimate Std. Error z value Pr(> z ) (Intercept) 1.295875 0.1805327 7.178064 7.070547e-13 cathunemp 1.406498 0.6689819 2.102445 3.551432e-02

Our fitted model λ i = E[Y i U C i ] = exp(1.296 + 1.407 U C i ). repdeaths 0 10 20 30 40 50 0.10 0.20 0.30 0.40 cathunemp

Some fitted and predicted values Suppose U C is equal to.2. mod.coef <- coef(mod); mod.vcov <- vcov(mod) beta.draws <- mvrnorm(10000, mod.coef, mod.vcov) lambda.draws <- exp(beta.draws[,1] +.2*beta.draws[,2]) outcome.draws <- rpois(10000, lambda.draws) Expected Values Predicted Values Density 0.0 1.0 2.0 3.0 Frequency 0 500 1000 1500 4.4 4.8 5.2 E[Y U] 0 5 10 15 Y U

Some fitted and predicted values Is the difference between expected and predicted values clear? What kind of uncertainty is accounted for in each of the two distributions? Expected Values Predicted Values Density 0.0 1.0 2.0 3.0 Frequency 0 500 1000 1500 4.4 4.8 5.2 E[Y U] 0 5 10 15 Y U Estimation uncertainty for expected values. Both estimation uncertainty and fundamental uncertainty for predicted values.

Overdispersion 36% of observations lie outside the 2.5% or 97.5% quantile of the Poisson distribution that we are alleging generated them. repdeaths 0 10 20 30 40 50 0.10 0.20 0.30 0.40 cathunemp

Outline Administrative Issues Zero-Inflated Logistic Regression Counts: Poisson Model Counts: Negative Binomial Model

The Negative Binomial Model The variance of the Poisson distribution is only equal to its mean if the probability of an event occurring at any moment is independent of whether an event has occurred at any other moment, and if the occurrence rate is constant. We can perturb this second assumption (constant rate) in order to derive a distribution which can handle both violations of the constant rate assumption and violations of the independence of events (or no contagion) assumption. The trick is to assume that λ varies, within the same observation span, according to a new parameter we will introduce call ς.

Alternative Parameterization Here s the new stochastic component: Y i λ i, ζ i Poisson(ζ i λ i ) ζ i 1 Gamma ( σ 2 1, 1 σ 2 1 ) Note that Gamma distribution has a mean of 1. Therefore, Poisson(ζ i λ i ) has mean λ i. Note that the variance of this distribution is σ 2 1. This means that as σ 2 goes to 1, the distribution of ζ i collapses to a spike over 1.

Alternative Parameterization Using a similar approach to that described in UPM pgs. 51-52 we can derive the marginal distribution of Y as where Notes: Y i Negbin(λ i, σ 2 ) f nb (y i λ i, σ 2 ) = Γ( λ i σ 2 1 + y i) y!γ( λ i σ 2 1 ) ( σ 2 1 σ 2 ) yi (σ 2 ) 1. λ i > 0 and σ > 1 λ i σ 2 1 2. E[Y i ] = λ i and Var[Y i ] = λ i σ 2. What value of σ 2 would be evidence against overdispersion? 3. We still have the same old systematic component: λ i = exp(x i β).

Estimates mod <- zelig(repdeaths ~ cathunemp, data = troubles, model = "negbin") summary(mod) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.2959 0.1805 7.178 7.07e-13 *** cathunemp 1.4065 0.6690 2.102 0.0355 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Theta: 0.8551 Std. Err.: 0.0754

Overdispersion Handled! 5.68% of observations lie at or above the 95% quantile of the Negative Binomial distribution that we are alleging generated them. repdeaths 0 10 20 30 40 50 0.10 0.20 0.30 0.40 cathunemp

Other Models Note that there are many other count models: Generalized Event Count (GEC) Model Zero-Inflated Poisson Zero-Inflated Negative Binomial Zero-Truncated Models Hurdle Models