MS&E 226: Small Data

Similar documents
MS&E 226: Small Data

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Data Mining Stat 588

Machine Learning Linear Classification. Prof. Matteo Matteucci

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Logistic Regression and Generalized Linear Models

Lecture 4: Newton s method and gradient descent

STA 450/4000 S: January

Linear Regression Models P8111

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

MS&E 226: Small Data

Classification. Chapter Introduction. 6.2 The Bayes classifier

MS&E 226: Small Data

MS&E 226: Small Data

Single-level Models for Binary Responses

MS&E 226: Small Data

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Machine Learning for OR & FE

Chapter 1. Modeling Basics

Sparse Linear Models (10/7/13)

High-Dimensional Statistical Learning: Introduction

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Statistical Modelling with Stata: Binary Outcomes

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

STAT 7030: Categorical Data Analysis

MS&E 226: Small Data

Applied Machine Learning Annalisa Marsico

Lecture 9: Classification, LDA

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MS&E 226: Small Data

Lecture 9: Classification, LDA

Proteomics and Variable Selection

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Distribution-free ROC Analysis Using Binary Regression Techniques

Classification Based on Probability

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Lecture #11: Classification & Logistic Regression

Linear Regression (9/11/13)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 5: LDA and Logistic Regression

ECE521 Lecture7. Logistic Regression

Linear Regression With Special Variables

CSC 411 Lecture 7: Linear Classification

Data-analysis and Retrieval Ordinal Classification

Variations of Logistic Regression with Stochastic Gradient Descent

Lecture 9: Classification, LDA

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

MSG500/MVE190 Linear Models - Lecture 15

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Lecture Data Science

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Statistical Methods for SVM

Support Vector Machines

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Generalized Linear Models for Non-Normal Data

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Linear Model Selection and Regularization

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Statistics. Nicodème Paul Faculté de médecine, Université de Strasbourg. 9/5/2018 Statistics

Introduction to Machine Learning

Generalized logit models for nominal multinomial responses. Local odds ratios

Lecture 10: Introduction to Logistic Regression

Lecture 5: Clustering, Linear Regression

Classification 1: Linear regression of indicators, linear discriminant analysis

Lecture 6: Linear Regression

COMS 4771 Regression. Nakul Verma

Lecture 6: Methods for high-dimensional problems

Generalized Linear Models

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic & Tobit Regression

BMI 541/699 Lecture 22

Binary Logistic Regression

Lecture 4 Multiple linear regression

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Association studies and regression

Prediction & Feature Selection in GLM

ECON 594: Lecture #6

Econometrics Lecture 5: Limited Dependent Variable Models: Logit and Probit

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Lecture 12: Effect modification, and confounding in logistic regression

Linear Decision Boundaries

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Discrete Multivariate Statistics

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Week 5: Logistic Regression & Neural Networks

STA 216, GLM, Lecture 16. October 29, 2007

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

MSA220/MVE440 Statistical Learning for Big Data

Age 55 (x = 1) Age < 55 (x = 0)

Transcription:

MS&E 226: Small Data Lecture 9: Logistic regression (v2) Ramesh Johari ramesh.johari@stanford.edu 1 / 28

Regression methods for binary outcomes 2 / 28

Binary outcomes For the duration of this lecture suppose the outcome variable Y i {0, 1} for each i. (Much of what we cover generalizes to discrete outcomes with more than two levels, but we focus on the binary case.) 3 / 28

Linear regression? Why doesn t linear regression work well for prediction? A picture: 1 Y 0 20 0 20 X 4 / 28

Logistic regression At its core, logistic regression is a method that directly addresses this issue with linear regression: it produces fitted values that always lie in [0, 1]. Input: Sample data X and Y. Output: A fitted model ˆf( ), where we interpret ˆf( X) as an estimate of the probability that the corresponding outcome Y is equal to 1. 5 / 28

Logistic regression: Basics 6 / 28

The population model Logistic regression assumes the following population model: P(Y = 1 X) = exp( Xβ) 1 + exp( Xβ) = 1 P(Y = 0 X). A plot in the case with only one covariate, and β 0 = 0, β 1 = 1: 1.00 0.75 P(Y = 1 X) 0.50 0.25 0.00 10 5 0 5 10 X 7 / 28

Logistic curve The function g given by: g(q) = log ( q ) 1 q is called the logit function. It has the following inverse, called the logistic curve: g 1 (z) = exp(z) 1 + exp(z). In terms of g, we can write the population model as: 1 P(Y = 1 X) = g 1 ( Xβ). 1 This is one example of a generalized linear model (GLM); for a GLM, g is called the link function. 8 / 28

Logistic curve Note that from this curve we see some important characteristics of logistic regression: The logistic curve is increasing. Therefore, in logistic regression, larger values of covariates that have positive coefficients will tend to increase the probability that Y = 1. When z > 0, then g 1 (z) > 1/2; when z < 0, then g 1 (z) < 1/2. Therefore, when Xβ > 0, Y is more likely to be one than zero; and conversely, when Xβ < 0, Y is more likely to be zero than one. 9 / 28

Interpretations: Log odds ratio In many ways, the choice of a logistic regression model is a matter of practical convenience, rather than any fundamental understanding of the population: it allows us to neatly employ regression techniques for binary data. One way to interpret the model: Note that given a probability 0 < q < 1, q/(1 q) is called the odds ratio. The odds ratio lies in (0, ). So logistic regression uses a model that suggests the log odds ratio of Y given X is linear in the covariates. Note the log odds ratio lies in (, ). 10 / 28

Interpretations: Latent variables Another way to interpret the model is the latent variable approach: Suppose given a vector of covariates X, a logistic random variable Z is sampled independently: P(Z < z) = g 1 (z). Define Y = 1 if Xβ + Z > 0, and Y = 0 otherwise. By this definition, Y = 1 if and only if Z < Xβ, an event that has probability g 1 ( Xβ) exactly the logistic regression population model. 11 / 28

Interpretations: Latent variables The latent variable interpretation is particularly popular in econometrics, where it is a first example of a discrete choice modeling. For example, in modeling customer choice over whether or not to buy a product, suppose: Each customer has a feature vector X. This customer s utility for purchasing the item is the (random) quantity Xβ + Z, and the utility for not purchasing the item is zero. The probability the customer purchases the item is exactly the logistic regression model. This is a very basic example of a random utility model for customer choice. 12 / 28

Fitting and interpreting logistic regression models 13 / 28

Finding logistic regression coefficients So far we have only discussed the population model for logistic regression. But so far we haven t designed an actual method for finding coefficients. Logistic regression uses the population model we ve discussed to suggest a way to find the coefficients ˆβ of a fitted model. The basic outline is: Assume that the population model follows the logistic regression. Find coefficients (parameters) ˆβ that maximize the chance of seeing the data Y and X, given the logistic regression population model. This is an example of the maximum likelihood approach to fitting a model. (We will discuss it more when we discuss inference.) For now we investigate what the results of this fitting procedure yield for real data. 14 / 28

Example: The CORIS dataset Recall: 462 South African males evaluated for heart disease. Outcome variable: Coronary heart disease (chd). Covariates: Systolic blood pressure (sbp) Cumulative tobacco use (tobacco) LDL cholesterol (ldl) Adiposity (adiposity) Family history of heart disease (famhist) Type A behavior (typea) Obesity (obesity) Current alcohol consumption (alcohol) Age (age) 15 / 28

Logistic regression To run logistic regression in R, we use the glm function, as follows: > fm = glm(formula = chd ~., family = "binomial", data = coris) > display(fm) coef.est coef.se (Intercept) -6.15 1.31... famhist 0.93 0.23... obesity -0.06 0.04... 16 / 28

Interpreting the output Recall that if a coefficient is positive, it increases the probability that the outcome is 1 in the fitted model (since the logistic function is increasing). So, for example, obesity has a negative coefficient. What does this mean? Do you believe the implication? 17 / 28

Interpreting the output In linear regression, coefficients were directly interpretable; this is not as straightforward for logistic regression. Some approaches to interpretation of ˆβ j (holding all other covariates constant): ˆβj is the change in the log odds ratio for Y per unit change in X j. exp( ˆβ j ) is the change in the odds ratio for Y per unit change in X j. 18 / 28

The divide by 4 rule We can also directly differentiate the fitted model with respect to X j. By differentating g 1 ( X ˆβ), we find that the change in the (fitted) probability Y = 1 per unit change in X j is: ( exp( Xβ) ) [1 + exp( Xβ)] ˆβ j. 2 Note that the term in parentheses cannot be any larger than 1/4. Therefore, ˆβ j /4 is an upper bound on the magnitude of the change in the fitted probability that Y = 1, per unit change in X j. 19 / 28

Residuals Here is what happens when we plot residuals vs. fitted values: 1.0 0.5 residuals.coris 0.0 0.5 0.00 0.25 0.50 0.75 fitted.coris Why is this plot not informative? 20 / 28

Residuals An alternative approach: bin the residuals based on fitted value, and then plot the average residual in each bin. E.g., when we divide the data into 50 bins, here is what we obtain: Binned residual plot Average residual 0.2 0.1 0.0 0.1 0.2 0.0 0.2 0.4 0.6 0.8 Expected Values 21 / 28

Logistic regression as a classifier 22 / 28

Classification Logistic regression serves as a classifier in the following natural way: Given the estimated coefficients ˆβ, and a new covariate vector X, compute X ˆβ. If the resulting value is positive, return Y = 1 as the predicted value. If the resulting value is negative, return Y = 0 as the predicted value. 23 / 28

Linear classification The idea is to directly use the model to choose the most likely value for Y, given the model: When X ˆβ > 0, the fitted model gives P(Y = 1 X, ˆβ) > 1/2, so we make the prediction that Y = 1. Note that logistic regression is therefore an example of a linear classifier: the boundary between covariate vectors where we predict Y = 1 (resp., Y = 0) is linear. 24 / 28

Tuning logistic regression As with other classifiers, we can tune logistic regression to trade off false positives and false negatives. In particular, suppose we choose a threshold t, and predict Y = 1 whenever X ˆβ > t. When t = 0, we recover the classification rule on the preceding slide. This is the rule that minimizes average 0-1 loss on the training data. What happens to our classifier when t? What happens to our classifier when t As usual, you can plot an ROC curve for the resulting family of classifiers as t varies. 25 / 28

Regularization 26 / 28

Regularized logistic regression [ ] As with linear regression, regularized logistic regression is often used in the presence of many features. In practice, the most common regularization technique is to add the penalty λ j ˆβ j to the maximum likelihood problem; this is the equivalent of lasso for logistic regression. As for linear regression, this penalty has the effect of selecting a subset of the parameters (for sufficient large values of the regularization penalty λ). 27 / 28

Do you believe the model? Logistic regression highlights one of the inherent tensions in working with models: If we took the population model literally, it is meant to be a probabilistic description of the process that generated our data. At the same time, it appears that logistic regression is an estimation procedure that largely exists for its technical convenience. (Do we really believe that the population model is logistic?) At times like this it is useful to remember: All models are wrong, but some are more useful than others. George E.P. Box 28 / 28