Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Similar documents
Lecture 10: Introduction to Logistic Regression

Lecture 12: Effect modification, and confounding in logistic regression

Homework Solutions Applied Logistic Regression

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Chapter 11. Regression with a Binary Dependent Variable

Lecture 3.1 Basic Logistic LDA

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Lecture 2: Poisson and logistic regression

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

Lecture 5: Poisson and logistic regression

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Linear Regression Models P8111

General Linear Model (Chapter 4)

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STA102 Class Notes Chapter Logistic Regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Binary Dependent Variables

Binary Logistic Regression

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

STAT 7030: Categorical Data Analysis

Lecture 4: Generalized Linear Mixed Models

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Generalized Linear Models for Non-Normal Data

Lecture 2: Probability and Distributions

multilevel modeling: concepts, applications and interpretations

ECON 594: Lecture #6

raise Coef. Std. Err. z P> z [95% Conf. Interval]

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Binary Dependent Variable. Regression with a

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

STAT 526 Spring Final Exam. Thursday May 5, 2011

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

Introduction to logistic regression

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Sociology 362 Data Exercise 6 Logistic Regression 2

Statistical Modelling with Stata: Binary Outcomes

2. We care about proportion for categorical variable, but average for numerical one.

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Simple logistic regression

9 Generalized Linear Models

High-Throughput Sequencing Course

UNIVERSITY OF TORONTO Faculty of Arts and Science

The Logit Model: Estimation, Testing and Interpretation

Regression techniques provide statistical analysis of relationships. Research designs may be classified as experimental or observational; regression

Lecture 1 Introduction to Multi-level Models

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Sample Size Determination

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 3 Linear random intercept models

Correlation and regression

BMI 541/699 Lecture 22

Measures of Association and Variance Estimation

One-stage dose-response meta-analysis

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models

Basic Medical Statistics Course

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Logistisk regression T.K.

Medical statistics part I, autumn 2010: One sample test of hypothesis

Probability and Probability Distributions. Dr. Mohammed Alahmed

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Recent Developments in Multilevel Modeling

Beyond GLM and likelihood

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Generalized linear models

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 12: Interactions and Splines

Sample size and power calculation using R and SAS proc power. Ho Kim GSPH, SNU

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Stat 587: Key points and formulae Week 15

Logit estimates Number of obs = 5054 Wald chi2(1) = 2.70 Prob > chi2 = Log pseudolikelihood = Pseudo R2 =

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

R Hints for Chapter 10

Classification. Chapter Introduction. 6.2 The Bayes classifier

Unit 9: Inferences for Proportions and Count Data

Statistical Methods in Clinical Trials Categorical Data

Consider Table 1 (Note connection to start-stop process).

Statistical Analysis of List Experiments

Final Exam. Name: Solution:

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Introduction To Logistic Regression

Introduction to Crossover Trials

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Transcription:

Lecture : Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 27 Binomial Model n independent trials (e.g., coin tosses) p = probability of success on each trial (e.g., p =! = Pr of Heads) Y = number of successes out of n trials (e.g., Y= number of heads) Logistic Regression Binomial Distribution Regression for a response variable that follows a binomial distribution Recall the binomial model And the Binomial Distribution Example: P( Y ' n$ % " p & y# y = y) = n! y (! p)

Why can t we use regular regression (SLR or MLR)? Example Consider phase I clinical trial in which 35 independent patients are given a new medication for pain relief. Of the 35 patients, 22 report significant relief one hour after medication Question: How effective is the drug? Cannot use Linear Regression The response, Y, is NOT Normally Distributed The variability of Y is NOT constant since the variance, Var(Y)=pq, depends on the expected response, E(Y)=p. The predicted/fitted values must be such that the corresponding probabilities are between and. Model Y = # patients who get relief n = 35 patients (trials) p = probability of relief for any patient The truth we seek in the population How effective is the drug? What is p? Get best estimate of p given data Determine margin of error -- range of plausible values for p

Maximum Likelihood Method The method of maximum likelihood estimation chooses values for parameter estimates which make the observed data maximally likely under the specified model Maximum Likelihood So, estimate p by choosing the value for p which makes observed data maximally likely i.e., choose p that makes the value of Pr (Y=22) maximal The ML estimate is y/n = 22/35 =.63 estimated proportion of patients who will experience relief Maximum Likelihood Maximum Likelihood For the binomial model, we have observed Y=y and ' n$ y P( Y = y) = % " p! p & y# So for this example n! y ( ) ' 35$ P ( Y = y) = % " p 22! p & 22# ( ) 3 Likelihood.e- 5.e- Likelihood Function: Pr(22 of 35) Max Likelihood MLE: p=.63..2.3.4.5.6.7.8.9 p=prob(event)

Confidence Interval for p Variance of pˆ : Var( pˆ )= Standard Error of : (! p) pq n Estimate of Standard Error of : pˆ qˆ n pˆ p n pq n = pˆ Conclusion Based upon our clinical trial in which 22 of 35 patients experience relief, we estimate that 63% of persons who receive the new drug experience relief within hour (95% CI: 47% to 79%) Confidence Interval for p 95% Confidence Interval for the true proportion, p: pq ˆ ˆ pˆ ±.96 =.63±.96 n =.63-.96(.82),.63+.96(.82) =(.47,.79) (.63)(.37) 35 Conclusion Whether 63% (47% to 79%) represents an effective drug will depend many things, especially on the science of the problem. Sore throat pain? Arthritis pain? Accidentally cut your leg off pain?

Aside: Probabilities and Odds The odds of an event are defined as: P(Y = ) odds(y=) = P(Y = ) = = p -p P(Y = ) - P(Y = ) Odds Ratio We saw that an odds ratio (OR) can be helpful for comparisons. Recall the Vitamin A trial: OR = odds(death Vit. A) odds(death No Vit A.) Probabilities and Odds We can go back and forth between odds and probabilities: Odds = p -p p = odds/(odds+) Odds Ratio The OR here describes the benefits of Vitamin A therapy. We saw for this example that: OR =.59 An estimated 4% reduction in mortality OR is a building block for logistic regression

Logistic Regression Suppose we want to ask whether new drug is better than a placebo and have the following observed data: Relief? No Yes Total Drug 3 22 35 Placebo 2 5 35 Odds Ratio OR = = odds(relief Drug) odds(relief Placebo) P(Relief Drug) / [- P(Relief P(Relief Placebo) / [- P(Relief.63/(-.63).45/(-.45) = = 2.26 Drug)] Placebo)] Confidence Intervals for p Confidence Interval for OR Placebo ( ) ( ) Drug..2.3.4.5.6.7.8.9 p CI used Woolf s method for the standard error of log( Oˆ R) : se( log( Oˆ R) = find Then (e L,e U ) 22 + 3 + 5 + 2 =.489 log( O ˆR) ±.96se(log( Oˆ R))

Interpretation OR = 2.26 95% CI: (.86, 5.9) The Drug is an estimated 2 " times better than the placebo. But could the difference be due to chance alone? Model log [ odds(relief Tx) ] = log% % " " & P(no relief Tx) # ' P(relief Tx) = ( + ( Tx if Placebo where: Tx = if Drug $ Logistic Regression Can we set up a model for this similar to what we ve done in ANOVA and Regression? Idea: model the log odds of the event, (in this example, relief) as a function of predictor variables Then log( odds(relief Drug) ) = ( + ( log( odds(relief Placebo) ) = ( log( odds(r D)) log( odds(r P)) = (

And ' odds(r D) $ & odds(r P) # % " Thus: log % " = ( And: OR = exp(( ) = e (!! So: exp(( ) = odds ratio of relief for patients taking the Drug-vs-patients taking the Placebo. It s the same! So, why go to all the trouble of setting up a linear model? What if there is a biologic reason to expect that the rate of relief (and perhaps drug efficacy) is age dependent? Logistic Regression Logit estimates Number of obs = 7 LR chi2() = 2.83 Prob > chi2 =.926 Log likelihood = -46.9969 Pseudo R2 =.292 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drug.837752.48892.66.96 -.444926.77243 _cons -.287682.34565 -.84.4 -.957372.38773 ------------------------------------------------------------------------------ Estimates: log( odds(relief) ) = ˆ (ˆ ( + Drug = -.288 +.84(Drug) Adding other variables What if Pr(relief) = function of Drug or Placebo AND Age We could easily include age in a model such as: log( odds(relief) ) = ( + ( Drug + ( 2 Age Therefore: OR = exp(.84) = 2.26!

Logistic Regression As in MLR, we can include many additional covariates. For a Logistic Regression model with p predictors: log ( odds(y=)) = ( + ( X +... + ( p X p Pr( Y = )! Pr( Y = ) where: odds(y=) = = Pr( Y Pr( Y = ) = ) Logistic regression Linear regression might estimate anything (-!, +!), not just a proportion in the range of to. Logistic regression is a way to estimate a proportion (between and ) as well as some related items Logistic Regression log Thus: ' Pr( Y = ) $ % " = ( + ( X +... + ( p X p & Pr( Y = ) # But, why use log(odds)? Linear models for binary outcomes We would like to use something like what we know from linear regression: Continuous outcome = " + " X + " 2 X 2 + How can we turn a proportion into a continuous outcome?

Transforming a proportion The odds are always positive: ' p $ odds = % " * [, +) ) &! p # The log odds is continuous: ' p $ Log odds = ln% " * (!), +) ) &! p # Logit Function Relates log-odds (logit) to p = Pr(Y=) log-odds 5-5 - logit function.5 Probability of Success Logit transformation Key Relationships Measure Pr(Y = ) Min Max Name probability Relating log-odds, probabilities, and parameters in logistic regression: Suppose model: logit(p) = ( + ( X Pr( Y = )! Pr( Y = ) ' Pr( Y = ) $ log% " &! Pr( Y = ) # -) ) ) odds log-odds or logit ' p $ i.e. log % " = ( + ( X -p Take anti-logs & p -p ' % & $ " # # = exp(( + ( X)

Solve for p p = ( p)+exp(( + ( X) p = exp(( + ( X) p+exp(( + ( X) p + p+exp(( + ( X) = exp(( + ( X) p+{+ exp(( + ( X)} = exp(( + ( X) exp(( +( X ) p = + exp(( +( X ) Dependence of Blindness on Age The following data concern the Aegean island of Kalytos where inhabitants suffer from a congenital eye disease whose effects become more marked with age. Samples of 5 people were taken at five different ages and the numbers of blind people were counted. What s the point? Example: Data We can determine the probability of success for a specific set of covariates, X, after running a logistic regression model. Age 2 35 45 55 7 Number blind / 5 6 / 5 7 / 5 26 / 5 37 / 5 44 / 5

Question The scientific question of interest is to determine how the probability of blindness is related to age in this population. Let p i = Pr(a person in age class i is blind) Model 2 logit(p i ) = ( + ( (age i 45) ( = log-odds of blindness among 45 year olds exp(( ) = odds of blindness among 45 year olds ( = difference in log-odds of blindness comparing a group that is one year older than another exp(( ) = odds ratio of blindness comparing a group that is one year older than another Model logit(p i ) = ( * Results Model Iteration : log likelihood = -73.8674 ( *= log-odds of blindness for all ages exp(( *) = odds of blindness for all ages No age dependence in this model Logit estimates Number of obs = 25 LR chi2() =. Prob > chi2 =. Log likelihood = -73.8674 Pseudo R2 =. ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons -.8427.265924 -.63.527 -.328593.68739 ------------------------------------------------------------------------------ exp(!.8) pˆ pˆ =. 48 i i = + exp(!.8) logit( ) = -.8 or

Results or Model 2 Logit estimates Number of obs = 25 LR chi2() = 99.3 Prob > chi2 =. Log likelihood = -23.43444 Pseudo R2 =.2869 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- age.94683.9755 7.86..75967.75399 _cons -4.3568.57966-7.64. -5.473549-3.23882 ------------------------------------------------------------------------------ logit( pˆ ) = -4.4 +.94(age i - 45) i exp pˆ i = + exp {! 4.4 +.94( age! 45) } i {! 4.4 +.94( age! 45) } i What about the Odds Ratio? Maximum likelihood estimates: OR = exp( ˆ( )=. s.e.( ˆ( ) =.3 z-test: Ho: exp(( ) = z = 7.86 p-val =. 95% C.I. (.7,.3) *(done on log scale) It appears that blindness is age dependent Note: exp() =, where is this fact useful? Test of significance Is the addition of the age variable in the model important? Maximum likelihood estimates: ˆ( =.94 s.e.( ˆ( )=.2 z-test: H : ( = z=7.855; p-val=. 95% C.I. (.7,.2) Model : Plot of observed proportion -vspredicted proportions using an intercept only model Prob Blindness Observed.5 Predicted 2 4 6 8 Age

Model 2 Plot of observed proportion -vspredicted proportions with age in the model Prob Blindness.5 Observed Predicted 2 4 6 8 Age Summary Logistic regression gives us a framework in which to model binary outcomes Uses the structure of linear models, with outcomes modelled as a function of covariates Many concepts carry over from linear regression Interactions Linear splines Tests of significance for coefficients All coefficients will have different interpretations in logistic regression Conclusion Model 2 clearly fits better than Model! Including age in our model is better than intercept alone.