Logistic Regression 21/05

Similar documents
Logistic Regressions. Stat 430

Linear Regression Models P8111

Exercise 5.4 Solution

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

UNIVERSITY OF TORONTO Faculty of Arts and Science

Classification. Chapter Introduction. 6.2 The Bayes classifier

Logistic Regression - problem 6.14

ssh tap sas913, sas

Interactions in Logistic Regression

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Modeling Overdispersion

Age 55 (x = 1) Age < 55 (x = 0)

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

R Hints for Chapter 10

Log-linear Models for Contingency Tables

Week 7 Multiple factors. Ch , Some miscellaneous parts

Leftovers. Morris. University Farm. University Farm. Morris. yield

STA 450/4000 S: January

Lecture 5: LDA and Logistic Regression

Regression Methods for Survey Data

On the Inference of the Logistic Regression Model

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Introduction to General and Generalized Linear Models

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Generalized linear models

Generalized linear models

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Various Issues in Fitting Contingency Tables

Generalised linear models. Response variable can take a number of different formats

Introduction to the Analysis of Tabular Data

Introduction to Logistic Regression

Non-Gaussian Response Variables

Poisson Regression. The Training Data

STA102 Class Notes Chapter Logistic Regression

BMI 541/699 Lecture 22

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Checking the Poisson assumption in the Poisson generalized linear model

Matched Pair Data. Stat 557 Heike Hofmann

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

Exam Applied Statistical Regression. Good Luck!

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Simple logistic regression

Ch 6: Multicategory Logit Models

Solutions to obligatorisk oppgave 2, STK2100

Stat 401B Final Exam Fall 2016

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

R Output for Linear Models using functions lm(), gls() & glm()

Analysing categorical data using logit models

9 Generalized Linear Models

PAPER 206 APPLIED STATISTICS

Stat 401B Exam 3 Fall 2016 (Corrected Version)

Random Independent Variables

Logistic & Tobit Regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Introduction to Logistic Regression

ST430 Exam 1 with Answers

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

STAC51: Categorical data Analysis

Generalized Linear Models. stat 557 Heike Hofmann

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Table of Contents. Logistic Regression- Illustration Carol Bigelow March 21, 2017

PAPER 218 STATISTICAL LEARNING IN PRACTICE

Homework 5 - Solution

STAT 510 Final Exam Spring 2015

Lecture 12: Effect modification, and confounding in logistic regression

The GLM really is different than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity.

Introduction to logistic regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Applied Statistics Friday, January 15, 2016

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Ch 2: Simple Linear Regression

Generalized Additive Models

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

HW1 Roshena MacPherson Feb 1, 2017

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Final Exam. Name: Solution:

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Unit 5 Logistic Regression Practice Problems

Statistical Prediction

Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Multiple Regression Part I STAT315, 19-20/3/2014

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Lecture 14: Introduction to Poisson Regression

Transcription:

Logistic Regression 21/05

Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression is a standard statistical method which can be used for classification. It is analogous to linear regression for regression problems, but often performs better (in the sense that in a real-life problem, logistic regression is more likely to be a good method in practice.)

Recall [see last week s notes] that we can t just use the model y i = β 0 + β 1 x 1 + + β p x p because this could make predictions for y i outside (0,1). There are many ways of transforming a value in (-, ) to a value in (0,1). [See board]. A common one in statistics is the logistic transformation.

Logistic Transformation f x = log x 1 x Also called log-odds. If p is a probability of an event, p/(1-p) is called the odds. e.g. p=1/3 corresponds of odds of 1/2. Note: the odds are called odds on in betting. Odds are usually quoted as odds against, which is p/(1-p). For example:

Ronnie O'Sullivan 2/1 Mark Selby 6/1 Neil Robertson 7/1 Ding Junhui 10/1 P(Ding Junhui wins) = p (1 p)/p = 10 p = 1/11 Note: an event with a probability greater than ½ is quoted as odds-on. For example, a horse with a probability of winning of 2/3 has odds of 2, which is quoted as 2-1 on. or 1-2.

odds logit(x) = f(x) Inverse logit

f is useful for transforming variables. Often if a variable is positive by nature, you should take its log. Similarly, if a variable takes values in (0,1), you should try taking its logit. Other possibilities: F(x) for any probability distribution f. Logit is particularly convenient because it means that various inferences from logistic regression can be expressed in closed form (similar to multiple linear regression requiring assumptions which you should know!)

Logistic Regression log P y i = 1 P y i = 0 = β 0 + β 1 x 1 + + β p x p Alternatively, could be written: y i ~Bernoulli(logit 1 (β 0 + β 1 x 1 + + β p x p )) Fitted in software by calculating the likelihood and maximising. (More difficult than linear regression.)

Interpreting β i The β i are log(odds ratios). e β i is how much odds(y=1) increases with a one-unit increase in x i (for x i continuous). To interpret this in terms of probability, you can roughly divide by 4. (Recommended in Gelman & Hill). This only works near p=1/2, by considering the slope of the logit curve [demonstrated in class.]

glm(formula = ifelse(sp == "B", 1, 0) ~. - index, data = crabs) Deviance Residuals: Min 1Q Median 3Q Max -0.42869-0.11730 0.01784 0.13141 0.34693 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.88303 0.09299 9.496 < 2e-16 *** sexm -0.13242 0.05027-2.634 0.009116 ** FL -0.26735 0.02692-9.933 < 2e-16 *** RW -0.07513 0.02232-3.365 0.000923 *** CL -0.04001 0.03545-1.129 0.260457 CW 0.25063 0.02304 10.879 < 2e-16 *** BD -0.21638 0.03032-7.137 1.88e-11 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for gaussian family taken to be 0.0317067) Null deviance: 50.0000 on 199 degrees of freedom Residual deviance: 6.1194 on 193 degrees of freedom AIC: -113.8 Number of Fisher Scoring iterations: 2

Deviance = -2max value of LL Measure of how well the model fits overall. Made up of the sum of the deviance residuals; used in place of ordinary (Pearson) residuals Deviance Residuals: Min 1Q Median 3Q Max -0.42869-0.11730 0.01784 0.13141 0.34693 Confidence intervals for coefficients can be calculated using the normal distribution, eg. For sexm: -0.13242 0.05027

Null deviance: 50.0000 on 199 degrees of freedom Residual deviance: 6.1194 on 193 degrees of freedom AIC: -113.8 The deviance is supposed to go down by 1 for each variable added, if the variables are random noise. Under this assumption, the deviance will be chi-squared distributed with (199-193) degrees of freedom. 1-pchisq(50-6.11, df=199-193) [1] 7.772806e-08 Analogue of ANOVA F-test for overall significance of the model.

Note: just like in linear regression, you should usually centre the variables before doing the analysis. This makes the intercept more meaningful and improves convergence of the algorithm which fits the model. Note 2: As usual, SAS outputs quite a bit more. The extra table of statistics are measures of correlation between the fitted probabilities and the observed values of y. (Percent Concordant etc.)

Using logistic regression for classification Recall that this section is about classifying things. We can get a 0/1 prediction from logistic regression just by rounding (up from 0.5, down from 0.5). Sometimes you have a rare y=1 and you want to include 50% 1 s and 50% 0 s in your data for logistic regression. Then you should adjust the cutoff for prediction. This is called case-control sampling.

folds <- sample(1:5, 200, replace=t) cv <- rep(0,5) for (i in 1:5){ } train <- crabs[folds!=i,] test <- crabs[folds==i,] y <- ifelse(train$sp=="b", 1, 0) model <- glm(y ~., data=train[,-1], family="binomial") pred <- predict(model, test[,-1]) pred <- 1/(1+exp(-pred)) cv[i] <- sum(ifelse(pred>0.5,1,0) == ifelse(test[,1]=="b",1,0))/nrow(test) In the crabs data, the model works perfectly but logistic regression gives a warning. This is because the algorithm fails to converge if the classes are separated by a hyperplane (which is the case in this example.) Warning messages: 1: glm.fit: algorithm did not converge 2: glm.fit: fitted probabilities numerically 0 or 1 occurred

Example (if time): library(mass) data(pima.tr) data(pima.te) model <- glm(type ~., data=pima.tr, family="binomial") summary(model) pred <- predict(model, Pima.te) pred <- 1/(1+exp(-pred)) pred <- round(pred) sum(pred==ifelse(pima.te$type=="yes",1,0))/nrow(pima.te)