Logistic Regression and Generalized Linear Models

Similar documents
MS&E 226: Small Data

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

MS&E 226: Small Data

Classification. Chapter Introduction. 6.2 The Bayes classifier

STA 450/4000 S: January

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Lecture 5: LDA and Logistic Regression

Machine Learning Linear Classification. Prof. Matteo Matteucci

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Linear Methods for Prediction

Machine Learning 2017

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Linear Regression With Special Variables

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Regression Models P8111

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear Models in Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Generative v. Discriminative classifiers Intuition

STAT5044: Regression and Anova

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Overfitting, Bias / Variance Analysis

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Logistic Regression. Seungjin Choi

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning, Fall 2012 Homework 2

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Ch 4. Linear Models for Classification

Statistical Machine Learning Hilary Term 2018

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Generative v. Discriminative classifiers Intuition

Data Mining Stat 588

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian and Linear Discriminant Analysis; Multiclass Classification

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

COMS 4771 Regression. Nakul Verma

SB1a Applied Statistics Lectures 9-10

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

STA 4273H: Statistical Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning: Assignment 1

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Linear Decision Boundaries

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Linear & nonlinear classifiers

LOGISTIC REGRESSION Joseph M. Hilbe

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Some explanations about the IWLS algorithm to fit generalized linear models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Kernel Logistic Regression and the Import Vector Machine

SGN (4 cr) Chapter 5

Lecture 4: Newton s method and gradient descent

10701/15781 Machine Learning, Spring 2007: Homework 2

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. 7. Logistic and Linear Regression

ECE 5984: Introduction to Machine Learning

Linear Regression (9/11/13)

Iterative Reweighted Least Squares

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Bias-Variance Tradeoff

ECE521 week 3: 23/26 January 2017

HOMEWORK #4: LOGISTIC REGRESSION

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Regression with Numerical Optimization. Logistic

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Methods for Prediction

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Part 8: GLMs and Hierarchical LMs and GLMs

Lecture 5: Classification

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

MIT Spring 2016

Midterm exam CS 189/289, Fall 2015

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Generalized Linear Models and Exponential Families

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

HOMEWORK #4: LOGISTIC REGRESSION

Generative v. Discriminative classifiers Intuition

Lecture 16 Solving GLMs via IRWLS

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

CSC242: Intro to AI. Lecture 21

Transcription:

Logistic Regression and Generalized Linear Models Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/2

Topics Generative vs. Discriminative models In many cases, it is difficult to model data using a parametric class conditional density P(X ω, θ) Yet, in many problems, a linear decision boundary is usually adequate to separate classes (also, gaussian densities with a shared covariance matrix produces a linear decision boundary). Logistic regression: discriminative model for classification that produces linear decision boundaries Model fitting problem solved using maximum likelihood Iterative gradient-based algorithm for solving nonlinear maximum likelihood equations Recursive weighted least squares regression Logistic regression is an instance of a generalized linear model (GLM), which consists of a large variety of exponential models.glms can also be extended to generalized additive models (GAMs). Sridhar Mahadevan: CMPSCI 689 p. 2/2

Discriminative vs. Generative Models Both generative and discriminative approaches address the problem of modeling the discriminant function P(y x) of output labels (or values) y conditioned on the input x. In generative models, we estimate both P(x) and P(x y), and use Bayes rule to compute the discriminant. P(y x) P(x)P(x y) Discriminative approaches model the conditional distribution P(y x) directly, and ignore the marginal P(x). We now turn to explore several types instances of discriminative models, including logistic regression in this class, and later several other types including support vector machines. Sridhar Mahadevan: CMPSCI 689 p. 3/2

Generalized Linear Models In linear regression, we model the output y as a linear function of the input variables, with a noise term that is zero mean constant variance Gaussian. y = g(x) + ǫ, where the conditional mean E(y x) = g(x), and the noise term is ǫ. g(x) = β T x (where β 0 is an offset term). We saw earlier that the maximum likelihood framework justified the use of a squared error loss function, provided the errors were IID gaussian (the variance does not matter). We want to generalize this idea of specifying a model family by specifying the type of error distribution: When the output variable y is discrete (e.g., binary or multinomial), the noise term is not gaussian, but binomial or multinomial. A change in the mean is coupled by a change in the variance, and we want to be able to couple mean and variance in our model. Generalized linear models provides a rich tool of models based on specifying the error distribution. Sridhar Mahadevan: CMPSCI 689 p. 4/2

Logit Function Since the output variable y only takes on values (0, 1) (for binary classification), we need a different way of representing E(y x) so that the range of y (0, 1). One convenient form to use is the sigmoid or logistic function. Let us assume a vector-valued input variable x = (x 1,..., x p ). The logistic function is S shaped and approaches 0 (as x ) or 1 (as x ). P(y = 1 x, β) = µ(x β) = eβt x 1 + e βt x = 1 1 + e βt x P(y = 0 x, β) = 1 µ(x β) = 1 1 + e βt x We assume an extra input x 0 = 1, so that β 0 is an offset. We can invert the above transformation to get the logit function g(x β) = log µ(x β) 1 µ(x β) = βt x Sridhar Mahadevan: CMPSCI 689 p. 5/2

Logistic Regression y β2 X2 X1 β1 β0 X0 Sridhar Mahadevan: CMPSCI 689 p. 6/2

Example Dataset for Logistic Regression The data set we are analyzing is coronary heart disease in South Africa. The chd response (output) variable is binary (yes, no), and there are 9 predictor variables: There are 462 instances, out of which 160 are cases (positive instances), and 302 are controls (negative instances). The predictor variables are systolic blood pressure, tobacco, ldl, famhist, obesity, alcohol, age, adiposity, typea, Let s focus on a subset of the predictors: sbp, tobacco, ldl, famhist, obesity, alcohol, age. We want to fit a model of the following form P(chd = 1 x, β) = 1 1 + e βt x where β T x = β 0 +β 1 x sbp +β 2 x tobacco +β 3 x ldl +β 4 x famhist +β 5 x age +β 6 x alcohol +β 7 x obesity Sridhar Mahadevan: CMPSCI 689 p. 7/2

Noise Model for Logistic Regression Let us try to represent the logistic regression model as y = µ(x β) + ǫ and ask ourself what sort of noise model is represented by ǫ. Since y takes on the value 1 with probability µ(x β), it follows that ǫ can also only take on two possible values, namely If y = 1, then ǫ = 1 µ(x β) with probability µ(x β). Conversely, if y = 0, then ǫ = µ(x β) and this happens with probability (1 µ(x β)). This analysis shows that the error term in logistic regression is a binomially distributed random variable. Its moments can be computed readily as shown below: E(ǫ) = µ(x β)(1 µ(x β)) (1 µ(x β))µ(x β) = 0 (the error term has mean 0). V ar(ǫ) = Eǫ 2 (Eǫ) 2 = Eǫ 2 = µ(x β)(1 µ(x β)) (show this!) Sridhar Mahadevan: CMPSCI 689 p. 8/2

Maximum Likelihood for LR Suppose we want to fit a logistic regression model to a dataset of n observations X = (x 1, y 1 ),..., (x n, y n ). We can express the conditional likelihood of a single observation simply as P(y i x i, β) = µ(x i β) yi (1 µ(x i β)) 1 yi Hence, the conditional likelihood of the entire dataset can be written as P(Y X, β) = n i=1 µ(x i β) yi (1 µ(x i β)) 1 yi The conditional log-likelihood is then simply l(β X, Y ) = n i=1 y i log µ(x i β) + (1 y i )log(1 µ(x i β)) Sridhar Mahadevan: CMPSCI 689 p. 9/2

Maximum Likelihood for LR We solve the conditional log-likelihood equation by taking gradients l(β X, Y ) β k = n i=1 y i 1 µ(x i β) µ(x i β) β k (1 y i ) 1 (1 µ(x i β)) µ(x i ) β k Using the fact that µ(xi β) β k = 1 ( β k 1+e βt x i ) = µ(x i β)(1 µ(x i β))x i k, we get l(β X, Y ) β k = n i=1 x i k (yi µ(x i β)) Setting this to 0, since x 0 = 1 the first component of these equations reduces to n i=1 y i = n i=1 µ(x i β) The expected number of instances of each class must match the observed number. Sridhar Mahadevan: CMPSCI 689 p. 10/2

Newton-Raphson Method Newton s method is a general procedure for finding the roots of an equation f(θ) = 0. Newton s algorithm is based on the recursion θ t+1 = θ t f(θ t) f (θ t ) Newton s method finds the minimum of a function f. We want to find the maximum of the log likehood equation. But, the maximum of a function f(θ) is exactly when its derivative f (θ) = 0. So, plugging in f (θ) for f(θ) above, we get θ t+1 = θ t f (θ t ) f (θ t ) Sridhar Mahadevan: CMPSCI 689 p. 11/2

Fisher Scoring In logistic regresion, the parameter β is a vector, so we have to use the Newton-Raphson algorithm β t+1 = β t H 1 β l(β t X, Y ) Here, β l(β t X, Y ) is the vector of partial derivatives of the log-likelihood equation H ij = 2 l(β X,Y ) β i β j is the Hessian matrix of second order derivatives. The use of Newton s method to find the solution to the conditional log-likelihood equation is called Fisher scoring. Sridhar Mahadevan: CMPSCI 689 p. 12/2

Fisher Scoring for Maximum Likelihood Taking the second derivative of the likelihood score equations gives us 2 l(β X, Y ) β k β m = n i=1 x i k xi mµ(x i β)(1 µ(x i β)) We can use matrix notation to write the Newton-Raphson algorithm for logistic regression. Define the n n diagonal matrix W = µ(x 1 β)(1 µ(x 1 β))... 0 0 µ(x 2 β)(1 µ(x 2 β))...... 0... µ(x n β)(1 µ(x n β)) Let Y be an n 1 column vector of output values, and X be the design matrix of size n (p + 1) of input values, and P be the column vector of fitted probability values µ(x i β). Sridhar Mahadevan: CMPSCI 689 p. 13/2

Iterative Weighted Least Squares The gradient of the log likelihood can be written in matrix form as l(β X, Y ) β = n i=1 x i (y i µ(x i β)) = X T (Y P) The Hessian can be written as 2 l(β X, Y ) β β T = X T WX The Newton-Raphson algorithm then becomes β new = β old + (X T WX) 1 X T (Y P) = (X T WX) 1 X T W ( Xβ old + W 1 (Y P) ) = (X T WX) 1 X T WZ where Z Xβ old + W 1 (Y P) Sridhar Mahadevan: CMPSCI 689 p. 14/2

Weighted Least Squares Regression Weighted least squares regression finds the best least-squares solution to the equation WAx Wb (WA) T WAˆx = (WA) T Wb ˆx = (A T CA) 1 A T Cb where C = W T W Returning to logistic regression, we now see β new = (X T WX) 1 X T WZ is weighted least squares regression (where X is the matrix A above, W is a diagonal weight vector with entries µ(x i β)(1 µ(x i β)), and Z corresponds to the vector b above). It is termed recursive weighted least squares, because at each step, the weight vector W keeps changing (since the β s are changing). We can visualize RWLS as solving the following equation β new argmin β (Z Xβ) T W(Z Xβ) Sridhar Mahadevan: CMPSCI 689 p. 15/2

Stochastic Gradient Ascent Newton s method is often referred to as a 2 nd order method, because it involves taking the Hessian. This can be difficult in large problems, because it involves matrix inversion. One way to avoid this is to settle for slower convergence, but less work at each step. For each training instance (x, y) we can derive an incremental gradient update rule. l(β x, y) β j = x j (y µ(x β)) The stochastic gradient ascent rule can be written as (for instance (x i, y i )) β j β j + α(y i µ(x i β))x i j The convergence of this update rule can be iffy, unlike Newton s method. It depends on tweaking the learning rate α so that the steps are not too small or too large, and also a cooling schedule. Sridhar Mahadevan: CMPSCI 689 p. 16/2

The LMS Algorithm The LMS (least mean square) algorithm can be used to solve least squares regression problems incrementally taking the gradient of the loss function w.r.t. the parameters, and adjusting the weights for each data instance. h(x β) = β 0 + n j=1 β j x j = β T x Given a data set D, we want to find the vector β that minimizes the (mean) squared error loss L(h) = 1 2 i (y i h(x; β)) 2. LMS algorithm: find the gradient on a particular instance L(h) β j = (y h(x β))x j Adjust the weight in the direction of decreasing the error (negative gradient). β j β j + α(y h(x β))x j Sridhar Mahadevan: CMPSCI 689 p. 17/2

Logistic Regression vs LDA Recall from Bayes decision theory that when the class conditional densities P(x ω i, µ i,σ) share the same underlying covariance matrix Σ, the decision boundary that separates the classes is a line (or hyperplane). Such Bayesian classifiers are called linear discriminant classifiers. We see from above that logistic regression also produces a decision boundary that is a hyperplane, since its discriminant function was shown to be g(x β) = log µ(x β) 1 µ(x β) = βt x So, if both gaussian LDA and logistic regression produce linear decision boundaries, which is preferable? As a general rule, logistic regression works in a larger class of problems because it does not assume the underlying class conditional densities are Gaussian! Sridhar Mahadevan: CMPSCI 689 p. 18/2

Generalized Linear Models A generalized linear model is specified using two functions A link function that describes how the mean depends on the linear predictor g(µ) = η, where η = β T x. For example, in logistic regression, the link function is the logit function g(µ) = log µ 1 µ = βt x In linear regression, the link function is simply g(µ) = µ The inverse of the logit function g 1 (η) = µ describes how the mean µ can be related back to the linear predictor. For logistic regression, the inverse link function is the sigmoid. A variance function that specifies how the variance of the output y depends on the mean φv (µ), where φ is constant. For example, in logistic regression, the variance is µ(1 µ), because it is a binomial distribution. In linear regression, the variance is simply 1 because the error term is modeled as having constant variance. Sridhar Mahadevan: CMPSCI 689 p. 19/2

Generalized Linear Models Distribution Link Function Variance Function Gaussian µ 1 Binomial log µ 1 µ µ(1 µ) Poisson log µ µ Gamma 1 µ µ 2 Sridhar Mahadevan: CMPSCI 689 p. 20/2

Multiway Classification As one more example of a generalized linear model, we generalize the logistic regression model to a multinomial (e.g, sorting email into various categories) It easily follows that log P(Y = 1 X = x, β) log P(Y = k X = x, β) P(Y = 2 X = x, β) log P(Y = k X = x, β) P(Y = K 1 X = x, β) P(Y = k X = x, β) = β1 T x = β2 T x... = βk 1 T x P(Y = k X = x, β) = P(Y = K X = x, β) = e βt k x 1 + K 1 l=1 eβt l x, k = 1,..., K 1 1 1 + K 1 l=1 eβt l x, k = 1,..., K 1 Sridhar Mahadevan: CMPSCI 689 p. 21/2

Multinomial Link function The multinomial PDF can be written as a member of the exponential family P(y φ) = φ 1{y=1} 1 φ 2{y=2} 2... φ 1{y=K} K = φ y 1 1 φ = e ηt ỹ a(η) k 1 y 2 2... φ1 K i=1 ỹi The last expression above is an instance of the generic form of the exponential family (check Weber handout or earlier notes) Instead of a single link function, we now have a vector η = log φ 1 φ k log φ 2 φ k. log φ k 1 φ k Sridhar Mahadevan: CMPSCI 689 p. 22/2

Fitting GLM Models in R The statisics package R has comprehensive built-in features for fitting generalized linear models (and many related models as well). This is a very brief intro to model fitting in R. See the R documentation, as well as the excellent text Statistical Models in S by Chambers and Hastie. The main function is called glm(formula=, family=,...). Here, formula is a symbolic description of the model that is to be fit. family is the description of the appropriate error distribution and link function to be used. Sridhar Mahadevan: CMPSCI 689 p. 23/2

Heart Disease Dataset The data set we are analyzing is coronary heart disease in South Africa. The chd response (output) variable is binary (yes, no), and there are 9 predictor variables: systolic blood pressure, tobacco, ldl, famhist, obesity, alcohol, age, adiposity, typea, The R command glm(chd sbp + tobacco + ldl + famhist + age + alcohol + obesity, family = binomial, data = heart) will fit a logistic regression model to the heart disease data set (assuming the dataset is loaded into the system). The R command glm(chd sbp + tobacco + ldl + famhist + age + alcohol + obesity, family = binomial(link=probit), data = heart) will fit the same data, now using the probit distribution (inverse CDF of the normal distribution). Sridhar Mahadevan: CMPSCI 689 p. 24/2

Heart Disease Data The following model was found to be the best fit to the data, using maximum likelihood. Coefficient Estimate (Intercept) -4.1290787 sbp 0.0057608 tobacco 0.0795237 ldl 0.1847710 famhistpresent 0.9391330 age 0.0425344 alcohol 0.0006058 obesity -0.0345467 Sridhar Mahadevan: CMPSCI 689 p. 25/2

Polynomial Regression Load the data data(cars93) which contains information about 93 models of cars. names(cars93) tells you the attributes that make up the dataset. [1] "Manufacturer" "Model" "Type" [4] "Min.Price" "Price" "Max.Price" [7] "MPG.city" "MPG.highway" "AirBags" [10] "DriveTrain" "Cylinders" "EngineSize" [13] "Horsepower" "RPM" "Rev.per.mile" [16] "Man.trans.avail" "Fuel.tank.capacity" "Passengers" [19] "Length" "Wheelbase" "Width" [22] "Turn.circle" "Rear.seat.room" "Luggage.room" [25] "Weight" "Origin" "Make" Let s assume we want to fit a polynomial regression model, and we chose Price as the predictor variable, and Weight as the response variable. We can graph the these two variables using: plot(weight Price, data = Cars93) Sridhar Mahadevan: CMPSCI 689 p. 26/2

Fitting GLM Models in R: Cars W 2 2 1 2 P Sridhar Mahadevan: CMPSCI 689 p. 27/2

Fitting GLM Models in R The R command to fit a d th degree polynomial is glm(weight poly(price, d), family = gaussian, data = Cars93) Notice how we specified the link function to be gaussian. W 2 2 1 2 P Sridhar Mahadevan: CMPSCI 689 p. 28/2

Generalized Additive Models In the regression setting, a generalized additive model has the form E(Y X 1,..., X p ) = α + f 1 (X 1 ) + f 2 (X 2 ) +... + f p (X p ) Here, the f i are unspecified smooth (nonparametric ) functions. In the classification setting, a generalized additive logistic regression model has the form log( µ(x β) 1 µ(x β) = α + f 1(X 1 ) + f 2 (X 2 ) +... + f p (X p ) See Section 9.1.2 in Hastie s book (Statistical Learning) for how to fit additive logistic regression models to an email-spam dataset. Sridhar Mahadevan: CMPSCI 689 p. 29/2