Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Similar documents
Linear Regression Models P8111

Single-level Models for Binary Responses

Generalized Linear Models

9 Generalized Linear Models

Generalized Linear Models 1

Generalized linear models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

STAT5044: Regression and Anova

Logistic Regressions. Stat 430

Introduction to General and Generalized Linear Models

Generalized Linear Models. Kurt Hornik

Introduction to General and Generalized Linear Models

Generalized Linear Models

11. Generalized Linear Models: An Introduction

Generalized Linear Models and Exponential Families

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Linear Regression With Special Variables

Classification. Chapter Introduction. 6.2 The Bayes classifier

Outline of GLMs. Definitions

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Generalized Linear Models: An Introduction

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Linear models and their mathematical foundations: Simple linear regression

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Generalized Linear Models. stat 557 Heike Hofmann

Lecture 13: More on Binary Data

Generalized Linear Models for Non-Normal Data

Generalized Linear Models Introduction

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Correlation and regression

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Chapter 1. Modeling Basics

Introduction to Generalized Linear Models

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

GLM I An Introduction to Generalized Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Generalized Linear Models (GLZ)

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Generalized linear models

STAT 7030: Categorical Data Analysis

Introduction to the Logistic Regression Model

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

12 Modelling Binomial Response Data

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Statistical Modelling with Stata: Binary Outcomes

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Likelihoods for Generalized Linear Models

Sections 4.1, 4.2, 4.3

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Logistic Regression. Seungjin Choi

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

COMPLEMENTARY LOG-LOG MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Ch 2: Simple Linear Regression

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Lecture 9: Linear Regression

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Regression Model Building

Generalized Linear Models

Generalized Models: Part 1

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

STAT5044: Regression and Anova

Logistic regression: Miscellaneous topics

Categorical data analysis Chapter 5

Generalized linear models

Poisson regression 1/15

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Chapter 22: Log-linear regression for Poisson counts

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Analysing categorical data using logit models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Introduction to Generalized Models

Semiparametric Generalized Linear Models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Introducing Generalized Linear Models: Logistic Regression

LOGISTIC REGRESSION Joseph M. Hilbe

Chapter 9 Regression with a Binary Dependent Variable. Multiple Choice. 1) The binary dependent variable model is an example of a

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Introduction To Logistic Regression

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Lecture 12: Effect modification, and confounding in logistic regression

Formulary Applied Econometrics

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Generalized Linear Models

WU Weiterbildung. Linear Mixed Models

Transcription:

Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20

Modeling overview Want to capture important features of the relationship between a (set of) variable(s) and one or more response(s) Many models are of the form g(y ) = f (x) + error Differences in the form of g, f and distributional assumptions about the error term Logistic regression (EPFL) Applied Statistics 11 Nov 2010 2 / 20

Examples of models Linear: Y = β 0 + β 1 x + ɛ Linear: Y = β 0 + β 1 x + β 2 x 2 + ɛ (Intrinsically) Nonlinear: Y = αx β 1 x γ 2 x 3 δ + ɛ Generalized Linear Model (e.g. Binomial): log p 1 p = β 0 + β 1 x + β 2 x 2 Proportional Hazards (in Survival Analysis): h(t) = h 0 (t) exp(βx) Logistic regression (EPFL) Applied Statistics 11 Nov 2010 3 / 20

Linear modeling A simple linear model: E(Y ) = β 0 + β 1 x Gaussian measurement model: Y = β 0 + β 1 x + ɛ, ɛ N(0, σ 2 ) More generally: Y = X β + ɛ, where Y is n 1, X is n p, β is p 1, ɛ is n 1, often assumed N(0, σ 2 I n n ) Logistic regression (EPFL) Applied Statistics 11 Nov 2010 4 / 20

Analysis of designed experiments An important use of linear models Define a (design) matrix X so that for response variable Y : E(Y ) = X β, where β is a vector of parameters (or contrasts) Many ways to define design matrix/contrasts Logistic regression (EPFL) Applied Statistics 11 Nov 2010 5 / 20

Model fitting For the standard (fixed effects) linear model, estimation is usually by least squares Can be more complicated with random effects or when x-variables are subject to measurement error as well Logistic regression (EPFL) Applied Statistics 11 Nov 2010 6 / 20

Model checking Examination of residuals Normality Time effects Nonconstant variance Curvature Detection of influential observations Logistic regression (EPFL) Applied Statistics 11 Nov 2010 7 / 20

Linear regression model (again) Linear model Y = β 0 + β 1 x 1 + β 2 x 2 + + β k x k + ɛ, ɛ N(0, σ 2 ) Another way to write this: Y N(µ, σ 2 ), µ = β 0 + β 1 x 1 + β 2 x 2 + + β k x k Suitable for a continuous response NOT suitable for a binary response Logistic regression (EPFL) Applied Statistics 11 Nov 2010 8 / 20

Modified model Instead of modeling the response directly, could instead model the probability of 1 Problems: could lead to fitted values outside of [0, 1] normality assumption on errors is wrong Instead of modeling the expected response directly as a linear function of the predictors, model a suitable transformation For binary data, this is generally taken to be the logit (or logistic) transformation Logistic regression (EPFL) Applied Statistics 11 Nov 2010 9 / 20

Logit transformation logit(p) = log Therefore, p 1 p = β 0 + β 1 x 1 + β 2 x 2 + + β k x k p(x 1,... x k ) = exp(β 0 + β 1 x 1 + β 2 x 2 + + β k x k ) 1 + exp(β 0 + β 1 x 1 + β 2 x 2 + + β k x k ) The parameter β k is such that exp(β k ) is the odds that the response takes value 1 when x k increases by one, when the remaining variables are constant Estimate parameters by maximum likelihood Logistic regression (EPFL) Applied Statistics 11 Nov 2010 10 / 20

Binary response in a linear model In a standard linear model, the response variable is modeled as a normally distributed However, if the response variable is binary, it does not make sense to model the outcome as normal Generalized linear models (GLMs) are an extension of linear models to model non-normal response variables We are using logistic regression for a binary response Logistic regression (EPFL) Applied Statistics 11 Nov 2010 11 / 20

Generalized linear models: some theory Allows unified treatment of statistical methods for several important classes of models Response Y assumed to have exponential family distribution: f (y) = exp[a(y)b(θ) + c(θ) + d(y)] For a standard linear model Y = β 0 + β 1 x 1 +... + β k x k + ɛ, with ɛ N(0, σ 2 ) The expected response is E[Y x] = β 0 + β 1 x 1 +... + β k x k Let η denote the linear predictor η = β 0 + β 1 x 1 +... + β k x k For a standard linear model, E[Y x] = η In a generalized linear model, there is a link function g between η and the expected response: g(e[y x]) = η For a standard linear model, g(y) = y (identity link) Logistic regression (EPFL) Applied Statistics 11 Nov 2010 12 / 20

Link function When the response variable is binary (with values coded as 0 or 1), then E[Y x] = P(Y = 1 x) A convenient function in this case is E[Y x] = P(Y = 1 x) = eη 1+e η The corresponding link function (inverse of this function) is called the logit logit(x) = log(x/(1 x)) Regression using this model is called logistic regression Logistic regression (EPFL) Applied Statistics 11 Nov 2010 13 / 20

Link function: examples Family Name Link binomial Gamma gaussian inverse.gaussian poisson logit D probit cloglog identity D inverse D log D 1/mu^2 D sqrt Logistic regression (EPFL) Applied Statistics 11 Nov 2010 14 / 20

Analogous to linear regression The logit function g has many of the desirable properties of a linear regression model Mathematically convenient and flexible Can meaningfully interpret parameters Linear in the parameters A difference: Error distribution is binomial (not normal) Logistic regression (EPFL) Applied Statistics 11 Nov 2010 15 / 20

Fitting the model For linear regression, typically use least squares When outcome dichotomous, the nice statistical properties of least squares estimators no longer hold The general estimation method that leads to least squares (for normally distributed errors) is maximum likelihood Logistic regression (EPFL) Applied Statistics 11 Nov 2010 16 / 20

Maximum likelihood estimation Likelihood: f (x i ) = p(x i ) y i [1 p(x i )] 1 y i Assuming independent observations, the likelihood l(β) = n i=1 f (x i) log likelihood L(β) = log[l(β)] = n i=1 (y i log(p(x i )) + (1 y i ) log(1 p(x i ))) To find β that maximize the log likelihood, differentiate wrt each β i and set the derivative equal to 0 In linear regression these equations are easily solved In logistic regression, these are nonlinear in β and are solved iteratively Logistic regression (EPFL) Applied Statistics 11 Nov 2010 17 / 20

Assessing model fit In linear regression, an anova table partitions SST, the total sum of squared deviations of observations about their mean, into two parts: SSE, or residual (observed - predicted) sum of squares SSR, or regression sum of squares Large SSR suggests the explanatory variable(s) is(are) important Use same guiding principle in logistic regression: compare observed response to predicted response obtained from models with/without the variable(s) Comparison based on log likelihood function Logistic regression (EPFL) Applied Statistics 11 Nov 2010 18 / 20

Deviance In standard linear models, estimate parameters by minimizing residual sum of squares (Equivalent to ML for normal model) In GLM, estimate parameters by ML The deviance is (proportional to) 2 l (Analogous to SSE) Obtaining absolute measure of goodness of fit depends on some assumptions that may not be satisfied in practice Usually focus on comparing competing models When the models are nested, can carry out likelihood ratio test Logistic regression (EPFL) Applied Statistics 11 Nov 2010 19 / 20

Comparing models In linear regression, consider coefficient significant if (squared) standardized value ˆβ/SE( ˆβ) is large Can also do this for logistic regression (Wald test), but there are some problems with it Preferred approach: likelihood ratio test Deviance D = 2 n i=1 y i log ( ˆpi y i ) + (1 y i ) log ( ) 1 ˆpi 1 y i To compare models, compute G = D(submodel) D(bigger model) Under the null (i.e. the submodel), G χ 2 with df = difference in the number of estimated parameters Logistic regression (EPFL) Applied Statistics 11 Nov 2010 20 / 20