22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Similar documents
Lecture 12: Effect modification, and confounding in logistic regression

Linear Regression Models P8111

Single-level Models for Binary Responses

Lecture 4 Multiple linear regression

Lecture 10: Introduction to Logistic Regression

Lecture 18: Simple Linear Regression

BMI 541/699 Lecture 22

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Binary Logistic Regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Lecture 6 Multiple Linear Regression, cont.

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Chapter 20: Logistic regression for binary response variables

Exam Applied Statistical Regression. Good Luck!

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Introduction Fitting logistic regression models Results. Logistic Regression. Patrick Breheny. March 29

Logistic Regressions. Stat 430

Lecture 5: Poisson and logistic regression

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Lecture 13: More on Binary Data

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Lecture #11: Classification & Logistic Regression

Lecture 2: Poisson and logistic regression

R 2 and F -Tests and ANOVA

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Advanced Quantitative Data Analysis

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Scatter plot of data from the study. Linear Regression

Additional Notes: Investigating a Random Slope. When we have fixed level-1 predictors at level 2 we show them like this:

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Introducing Generalized Linear Models: Logistic Regression

Simple logistic regression

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

The probability of an event is viewed as a numerical measure of the chance that the event will occur.

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

LOGISTIC REGRESSION Joseph M. Hilbe

Density Temp vs Ratio. temp

Scatter plot of data from the study. Linear Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Sociology 362 Data Exercise 6 Logistic Regression 2

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

UNIVERSITY OF TORONTO Faculty of Arts and Science

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression. Returning to a continuous response variable Y...

MS&E 226: Small Data

Ch 2: Simple Linear Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

22s:152 Applied Linear Regression. In matrix notation, we can write this model: Generalized Least Squares. Y = Xβ + ɛ with ɛ N n (0, Σ)

Logistic Regression - problem 6.14

MS&E 226: Small Data

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Binary Response: Logistic Regression. STAT 526 Professor Olga Vitek

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST430 Exam 1 with Answers

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Introduction to the Analysis of Tabular Data

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Generalized Linear Models Introduction

Generalized Linear Models for Non-Normal Data

A Significance Test for the Lasso

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Ling 289 Contingency Table Statistics

Non-independence due to Time Correlation (Chapter 14)

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Linear Decision Boundaries

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Swarthmore Honors Exam 2012: Statistics

Stat 135, Fall 2006 A. Adhikari HOMEWORK 10 SOLUTIONS

Bias Variance Trade-off

Chapter 5: Logistic Regression-I

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Stat 642, Lecture notes for 04/12/05 96

22s:152 Applied Linear Regression. Chapter 5: Ordinary Least Squares Regression. Part 2: Multiple Linear Regression Introduction

Section 4.6 Simple Linear Regression

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Multiple Linear Regression

Regression. Bret Hanlon and Bret Larget. December 8 15, Department of Statistics University of Wisconsin Madison.

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Continuing with Binary and Count Outcomes

Analysing categorical data using logit models

9 Generalized Linear Models

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Beyond GLM and likelihood

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Transcription:

22s:52 Applied Linear Regression Ch. 4 (sec. and Ch. 5 (sec. & 4: Logistic Regression Logistic Regression When the response variable is a binary variable, such as 0 or live or die fail or succeed then we approach our modeling a little differently What is wrong with our previous modeling? Let s look at an example... Example: Study on lead levels in children Binary response called : Either high lead blood level( or low lead blood level(0 One continuous predictor variable called : Measure of the level of lead in the in the subject s backyard > slread.csv("lead.csv" > attach(sl > head(sl 290 2 0 90 3 894 4 0 93 5 40 6 40 2 The dependent variable plotted against the independent variable: Consider the usual regression model: Y i β 0 + β x i + ɛ i 0 000 2000 3000 4000 5000 The fitted model and residuals: Ŷ 0.378 + 0.0003x Fitted model (jittered points lm.out$residuals -0.8-0.4 0.0 0.4 Residuals vs. Fitted values Original 0 000 3000 5000 0.4 0.8.2.6 lm.out$fitted.values Normal Q-Q Plot Sample Quantiles -0.8-0.4 0.0 0.4-2 - 0 2 0 000 2000 3000 4000 5000 Theoretical Quantiles Jittered All sorts of violations here... 3 4

Violations: - Our usual ɛ i N(0, σ 2 isn t reasonable The errors are not normal, and they don t have the same variability across all x-values. What is a better way to model a 0- response using regression? At each x value, there s a certain chance of a 0 and a certain chance of a. - predictions for observations can be outside [0,]. For x3000, Ŷ.4, and this is not possibly an average of the 0s and s at x3000 (like a conditional mean given x. Ŷ x3000.4 which is not in [0,]. 0 000 2000 3000 4000 5000 For example, in this data set, it looks more likely to get a at higher values of x. Or, P (Y changes with the x-value. 5 6 Conditioning on a given x, we have P (Y, and P (Y (0,. We will consider this probability of getting a given the x-value(s in our modeling... π i P (Y i X i The previous plot shows that P (Y depends on the value of x. It is actually a transformation of this probability (π i that we will use as our response in the regression model. Odds of an Event First, let s discuss the odds of an event. When two fair coins are flipped, P (two heads/4 P (not two heads3/4 The odds in favor of getting two heads is: odds P (2 heads P (not 2 heads /4 3/4 /3 or sometimes referred to as to 3 odds. You re 3 times as likely to not get 2 heads as you are to get 2 heads. 7 8

For a binary variable Y (2 possible outcomes, odds in favor of Y is P (Y P (Y P (Y 0 P (Y For example, if P (heart attack0.008, then the odds of a heart attack is 0.008 0.9982 0.008 0.008 0.00803 The ratio of the odds for two different groups is also a quantity of interest. For example, consider heart attacks for male nonsmoker vs. male smoker Suppose P (heart attack0.0036 for a male smoker, and P (heart attack0.008 for a male nonsmoker. 9 Then, the odds ratio (O.R. for a heart attack in nonsmoker vs. smoker is O.R. odds of a heart attack for non-smoker odds of a heart attack for smoker ( Pr(heart attack non-smoker Pr(heart attack non-smoker ( Pr(heart attack smoker Pr(heart attack smoker ( 0.008 ( 0.9982 0.0036 0.9964 0.499 Interpretation of the odds ratio for binary 0- O.R. 0 If O.R..0, then P (Y is the same in both samples If O.R. <.0, then P (Y is less in the numerator group than in the denominator group O.R. 0 if and only if P (Y 0 in numerator sample 0 Back to Logistic Regression... The response variable we will model is a transformation of P (Y i for a given X i. The transformation is the logit transformation a logit(a a The response variable we will use: P (Yi logit[p (Y i ] P (Y i This is the log e of the odds that Y i. Notice: P (Y i (0, P (Yi (0, P (Y i P (Yi and < < P (Y i The logistic regression model: P (Yi β P (Y i 0 +β X i + +β k X ki This response on the left isn t bounded by [0,] eventhough the Y-values themselves are (having the response bounded by [0,] was a problem before. The response on the left can feasibly be any positive or negative quantity. This a nice characteristic because the right side of the equation can potentially give any possible predicted value to. The logistic regression model is a GENERALIZED LINEAR MODEL. Linear model on the right, something other than the usual continuous Y on the left. 2

Let s look at the fitted logistic regression model for the lead level data with one X covariate. P (Yi P (Y i ˆβ 0 + ˆβ X i.560+0.0027 X i jitter(, factor 0.2 0 000 2000 3000 4000 5000 The curve represents the E(Y i x i. And... E(Y i x i 0 P (Y i 0 x i + P (Y i x i P (Y i x i We can manipulate the regression model to put it in terms of P (Y i E(Y i If we take this regression model P (Yi.560 + 0.0027 X i P (Y i and solve for P (Y i we get... P (Y i exp(.560 + 0.0027 X i +exp(.560 + 0.0027 X i The value on the right is bounded to [0,]. Because our β is positive, As X gets larger, P (Y i goes to. As X gets smaller, P (Y i goes to 0. The fitted curve, i.e. the function of X i on the right, is S-shaped (sigmoidal 3 4 In the general model with one covariate: P (Yi β 0 + β X i P (Y i which means P (Y i exp(β 0 + β X i +exp(β 0 + β X i +exp[ (β 0 + β X i ] If β is positive, we get the S-shape that goes to as X i goes to and goes to 0 as X i goes to (as in the lead example. If β is negative, we get the opposite S-shape that goes to 0 as X i goes to and goes to as X i goes to. Another way to think of logistic regression is that we re modeling a Bernoulli random variable occurring for each X i, and the Bernoulli parameter π i depends on the covariate value. Then, Y i X i Bernoulli(π i where π i represents P (Y i X i E(Y i X i π i and V (Y i X i π i ( π i We re thinking in terms of the conditional distribution of Y X (we don t have constant variance across the X values, mean and variance are tied together. Writing π i as a function of X i, π i exp(β 0 + β X i +exp(β 0 + β X i 5 6

Interpretation of parameters For the lead levels example. Intercept β 0 : When X i 0, there is no lead in the in the backyard. Then, P (Yi β 0 P (Y i So, β 0 is the log-odds that a randomly selected child with no lead in their backyard has a high lead blood level. Or, π i eβ 0 +e β is the chance that a kid with 0 no lead in their backyard has high lead blood level. For this data, ˆβ 0.560 or ˆπ i 0.800 7 Interpretation of parameters For the lead levels example. Coefficient β : Consider the log e of the odds ratio (O.R. of having high lead blood levels for the following 2 groups... those with exposure level of x 2 those with exposure level of x + ( π2 π log 2 e β 0e β (x+ e π log e π e β 0e β x log e (e β β β is the log e of O.R. for a unit increase in x. It compares the groups with x exposure and (x + exposure. 8 Or, un-doing the log, e β is the O.R. comparing the two groups. For this data, ˆβ 0.0027 or e 0.0027.0027 A unit increase in X increases the odds of having a high lead blood level by a factor of.0027. Though this value is small, the range of the X values is large(40 to 5255, so it can have a substantial impact when you consider the full spectrum of x values. Prediction What is the predicted probability of high lead blood level for a child with X 500. P (Y i exp(.560 + 0.0027 X i +exp(.560 + 0.0027 X i +exp[ (.560 + 0.0027 X i ] 0.4586 + exp[.560 0.0027 500] What is the predicted probability of high lead blood level for a child with X 4000. P (Y i 0.9999 + exp[.560 0.0027 4000] jitter(, factor 0.2 0 000 2000 3000 4000 5000 9 20

What is the predicted probability of high lead blood level for a child with X 0. P (Y i 0.800 +exp[.560] So, there s still a chance of having high lead blood level even when the backyard doesn t have any lead. This is why we don t see the low-end of our fitted S-curve go to zero. Testing Significance of Covariate > glm.outglm(~,familybinomial(logit > summary(glm.out Coefficients: Estimate Std. Error z value Pr(> z (Intercept -.560568 0.3380483-4.485 7.30e-06 *** 0.0027202 0.0005385 5.05 4.39e-07 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. A couple things... Method of Maximum Likelihood is used to fit the model. Estimates ˆβ j are asymptotically normal, so R uses Z-tests (or Wald tests for covariate significance in the output. [NOTE: squaring a z-statistic gets you a chi-squared statistic with df.] General concepts easily extended to logistic regression with many covariates. Soil is a significant predictor in the logistic regression model. 2 22