Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Similar documents
Classification 1: Linear regression of indicators, linear discriminant analysis

Generalized Linear Models

Classification 2: Linear discriminant analysis (continued); logistic regression

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Linear Regression Models P8111

Topic 2: Logistic Regression

Linear Models in Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized logit models for nominal multinomial responses. Local odds ratios

Data Mining 2018 Logistic Regression Text Classification

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. INFO-2301: Quantitative Reasoning 2 Michael Paul and Jordan Boyd-Graber SLIDES ADAPTED FROM HINRICH SCHÜTZE

Data-analysis and Retrieval Ordinal Classification

Lecture #11: Classification & Logistic Regression

Linear Decision Boundaries

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Lecture 5: LDA and Logistic Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Linear Methods for Prediction

Outline of GLMs. Definitions

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

High-dimensional regression

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Lecture 4: Training a Classifier

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Lecture 2 Machine Learning Review

High-Throughput Sequencing Course

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

ECE521 Lecture7. Logistic Regression

Proportional hazards regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Models for Classification

ECE521 week 3: 23/26 January 2017

Binary Logistic Regression

Linear Methods for Prediction

Regression with Numerical Optimization. Logistic

Applied Machine Learning Annalisa Marsico

MS&E 226: Small Data

Generalized Linear Models

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

Tufts COMP 135: Introduction to Machine Learning

Linear Regression (9/11/13)

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Logit Regression and Quantities of Interest

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

POLI 8501 Introduction to Maximum Likelihood Estimation

The Naïve Bayes Classifier. Machine Learning Fall 2017

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Generative Learning algorithms

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Kernel Logistic Regression and the Import Vector Machine

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Classification Methods II: Linear and Quadratic Discrimminant Analysis

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Lecture : Probabilistic Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Linear Regression With Special Variables

Lecture 4: Training a Classifier

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Logistic Regression. Seungjin Choi

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Sparse Linear Models (10/7/13)

Lecture 10: Introduction to Logistic Regression

12 Generalized linear models

Linear Regression and Its Applications

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Association studies and regression

Stability and the elastic net

Multinomial Logistic Regression Models

8 Nominal and Ordinal Logistic Regression

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Lecture 6: Methods for high-dimensional problems

Notes on Discriminant Functions and Optimal Classification

Week 7: Binary Outcomes (Scott Long Chapter 3 Part 2)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Applied Regression Modeling: A Business Approach Chapter 2: Simple Linear Regression Sections

HOMEWORK #4: LOGISTIC REGRESSION

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Introducing Generalized Linear Models: Logistic Regression

Generalized Linear Models for Non-Normal Data

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

x k+1 = x k + α k p k (13.1)

Cox s proportional hazards model and Cox s partial likelihood

Max Margin-Classifier

Transcription:

Logistic Regression Advanced Methods for Data Analysis (36-402/36-608 Spring 204 Classification. Introduction to classification Classification, like regression, is a predictive task, but one in which the outcome takes only values across discrete categories; classification problems are very common (arguably just as or perhaps even more common than regression problems! Examples: Predicting whether a patient will develop breast cancer or remain healthy, given genetic information Predicting whether or not a user will like a new product, based on user covariates and a history of his/her previous ratings Predicting the region of Italy in which a brand of olive oil was made, based on its chemical composition Predicting the next elected president, based on various social, political, and historical measurements Classification is fundamentally a different problem than regression, and so, we will need different tools. In this lecture we will learn one of the most common tools: istic regression. You should know that there are many, many more methods beyond this one (just like there are many methods for estimating the regression function.2 Why not just use least squares? Before we present istic regression, we address the (reasonable question: why not just use least squares? Consider a classification problem in which we are given samples (x i, y i, i =,... n, and as usual x i R p denotes predictor measurements, and y i discrete outcomes. In this classification problem, we will assume that y i can take two values, which we will write (without a loss of generality as 0 and. Then, to predict a future outcome Y from predictor measurement X, we might consider performing linear regression. I.e., we fit coefficients ˆβ according to the familiar criterion min β R p (y i β T x i 2, and to predict the class of Y from X, we round ˆβ T X to whichever class is closest, 0 or What is the problem with this, if any? For purely predictive purposes, this actually is not a crazy idea it tends to give decent predictions. But there are two drawbacks:

. We cannot use any of the well-established routines for statistical inference with least squares (e.g., confidence intervals, etc., because these are based on a model in which the outcome is continuously distributed. At an even more basic level, it is hard to precisely interpret ˆβ 2. We cannot use this method when the number of classes exceeds 2. If we were to simply code the response as,... K for a number of classes K > 2, then the ordering here would be arbitrary but it actually matters 2 Logistic regression 2. The istic model Throughout this section we will assume that the outcome has two classes, for simplicity. (We return to the general K class setup at the end. Logistic regression starts with different model setup than linear regression: instead of modeling Y as a function of X directly, we model the probability that Y is equal to class, given X. First, abbreviate p(x = P(Y = X. Then the istic model is p(x = exp(βt X + exp(β T ( X The function on the right-hand side above is called the sigmoid of β T X. What does it look like for β T X large positive? For β T X negative? Plot it to gather some intuition (i.e., plot e a /( + e a as a function of a Rearranged, the equation ( says that ( p(x = β T X. (2 p(x The left-hand side above is called the odds or it of p(x, and is written as it p(x. In general, it(a = (a/( a Note that assuming ( (or equivalently, (2, is a modeling decision, just like it is a modeling decision to use linear regression Also note that, to include an intercept term of the form β 0 + β T X, we just append a to the vector X of predictors, as we do in linear regression 2.2 Interpreting coefficients How can we interpret the role of the coefficients β = (β,... β p R p in ( (i.e., in (2? One nice feature of the istic model is that it comes equipped with a useful interpretation for these coefficients Write p(x p(x = T eβ X = e βx+...+βpxp. The left-hand side above is the odds of class (conditional on X. We can see that increasing X j by one unit, while keeping all other predictors fixed, multiplies the odds by e βj. This is because e βx+...+βj(xj++...+βpxp = e βx+...+βjxj+...+βpxp e βj You might think we could get around this by modeling one class versus the rest with a binary coding, and performing K separate regressions, then using the strongest prediction at the end, given an input X. This is flawed too, however, as we would likely encounter a problem called masking. With this problem, there is some class j that never ends up being predicted in favor of the others, regardless of the input 2

Equivalently, write ( p(x = β T X = β X +... + β p X p. p(x Now increasing X j by one unit, and keeping all other predictors fixed, changes the odds by β j It will help to get comfortable with the concept of odds, and odds, if you haven t done so already in another class. Note that probabilities q close to 0 or have odds q/( q close to 0 or, respectively. And probabilities q close to 0 or have odds (q/( q close to or, respectively 2.3 Maximum likelihood estimation Given samples (x i, y i R p {0, }, i =,... n, we let p(x i = P(y i = x i, and assume ( p(xi = β T x i, i =,... n p(x i To construct an estimate ˆβ of the coefficients, we will use the principle of maximum likelihood. I.e., assuming independence of the samples, the likelihood (conditional on x i, i =,... n is L(β = ( p(x i p(xi = i : y i : y i=0 n p(x i yi( p(x i y i. We will choose ˆβ to maximize this likelihood criterion Note that maximizing a function is the same as maximizing the of a function (because is monotone increasing. Therefore ˆβ is equivalently chosen to maximize the likelihood l(β = It helps to rearrange this as l(β = = y i p(x i + ( y i ( p(x i. [ y i p(xi ( p(x i ] + ( p(x i ( p(xi y i + ( p(x i. p(x i Finally, plugging in for (p(x i /( p(x i = x T i β and using p(x i = /( + exp(x T i β, i =,... n, l(β = y i (x T i β ( + exp(x T i β. (3 You can see that, unlike the least squares criterion for regression, this criterion l(β does not have a closed-form expression for its maximizer (e.g., try taking its partial derivatives and setting them equal to zero. Hence we have to run an optimization algorithm to find ˆβ 3

Somewhat remarkably, we can maximize (3 by running repeated weighted least squares regressions! For those of you who have learned a little bit of optimization, this is actually just an instantiation of Newton s method. Applied to the criterion (3, we refer to it as iteratively reweighted least squares or IRLS In short: estimation of ˆβ in istic regression is more involved than it is in linear regression, but it is possible to do so by iteratively using linear regression software 2.4 Decision boundary Suppose that we have formed the estimate ˆβ of the istic coefficients, as discussed in the last section. To predict the outcome of a new input x R p, we form ˆp(x = exp( ˆβ T x + exp( ˆβ T x, and then predict the associated class according { 0 ˆp(x 0.5 ˆf(x = ˆp(x > 0.5 Equivalently, we can study the odds and predict the associated class using The set of all x R p such that it ˆp(x = ˆβ T x, ˆf(x = { 0 ˆβT x 0 ˆβT x > 0 ˆβ T x = ˆβ x +... + ˆβ p x p = 0 is called the decision boundary between classes 0 and. On either side of this boundary, we would predict one class or the other Remembering the intercept, we would rewrite the decision boundary as ˆβ 0 + ˆβ x +... + ˆβ p x p = 0. (4 This is a point when p =, it is a line when p = 2, and in general it is a (p -dimensional subspace. We would therefore say that istic regression has a linear decision boundary; this is because the equation (4 is linear in x 2.5 Inference A lot of the standard machinery for inference in linear regression carries over to istic regression. Recall that we can solve for the istic regression coefficients ˆβ by performing repeated weighted linear regressions; hence we can simply think of the istic regession estimates ˆβ as the result of a single weighted linear regression the last one in this sequence (upon convergence. Confidence intervals for ˆβ j, j =,... p, and so forth, are then all obtained from this weighted linear regression perspective. We will not go into detail here, but such inferential tools are implemented in software, and it helps to be aware of where they come from 4

2.6 Multinomial regression With more than two classes, the story is similar. Now we use an extension of the istic model called the multinomial model, which, given K classes for the outcome Y, takes the form exp(β T X P(Y = X = j= exp(βt j X exp(β2 T X P(Y = 2 X = j= exp(βt j X. exp(βk T P(Y = K X = X j= exp(βt j X P(Y = K X = j= exp(βt j X Equivalently, we can write this in odds form as ( P(Y = X = β T X P(Y = K X ( P(Y = 2 X = β2 T X P(Y = K X ( P(Y = K X = β T P(Y = K X K X The interpretation of coefficients is similar to before: increasing X l by one unit and keeping all other predictors fixed, β jl pertains to the change in (P(Y = j X/P(Y = K X Estimation proceeds by maximum likelihood, as before; inference is again drawn from treating the estimates as the result of a single weighted linear regression Finally, predictions are made at an input x by forming ˆp (x = ˆp 2 (x = and then predicting the class according to. exp( ˆβ T x j= exp( ˆβ T j x exp( ˆβ T 2 x j= exp( ˆβ T j x exp( ˆβ K T ˆp K (x = x j= exp( ˆβ j T x ˆp K (x = j= exp( ˆβ j T x, ˆf(x = argmax j=,...k ˆp j (x 5