Logistic Regression. Seungjin Choi

Similar documents
Ch 4. Linear Models for Classification

Introduction to Machine Learning

Linear Models for Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayes Decision Theory

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear and logistic regression

Density Estimation. Seungjin Choi

Iterative Reweighted Least Squares

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Linear Models for Regression

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Multiclass Logistic Regression

Linear Models for Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Neural Network Training

Expectation Maximization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning 2017

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Nonparameteric Regression:

Linear Classification: Probabilistic Generative Models

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Statistical Machine Learning Hilary Term 2018

Machine Learning Lecture 5

Introduction to Machine Learning

Regression with Numerical Optimization. Logistic

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Lecture 5: LDA and Logistic Regression

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Independent Component Analysis (ICA)

Machine Learning. 7. Logistic and Linear Regression

CS489/698: Intro to ML

STA 4273H: Statistical Machine Learning

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generative v. Discriminative classifiers Intuition

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Nonnegative Matrix Factorization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Graphical Models

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Linear & nonlinear classifiers

Overfitting, Bias / Variance Analysis

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning for Signal Processing Bayes Classification and Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Linear Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Machine Learning Linear Classification. Prof. Matteo Matteucci

CS 340 Lec. 16: Logistic Regression

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Machine Learning Lecture 7

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Gradient-Based Learning. Sargur N. Srihari

CS340 Machine learning Gaussian classifiers

Information Theory Primer:

Gaussian discriminant analysis Naive Bayes

Fisher s Linear Discriminant Analysis

Lecture 16 Solving GLMs via IRWLS

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

Linear Models in Machine Learning

Classification. Sandro Cumani. Politecnico di Torino

Rapid Introduction to Machine Learning/ Deep Learning

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Machine Learning

Machine Learning - Waseda University Logistic Regression

Introduction to Machine Learning

Probabilistic generative models

Weighted Least Squares I

Linear Regression (9/11/13)

Logistic Regression. Stochastic Gradient Descent

ECE521 week 3: 23/26 January 2017

Generalized Linear Models. Kurt Hornik

Bayesian Regression Linear and Logistic Regression

Warm up: risk prediction with logistic regression

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Lecture 5: Logistic Regression. Neural Networks

Classification Logistic Regression

Generalized Linear Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Bayesian Logistic Regression

Classification Based on Probability

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning, Fall 2012 Homework 2

Week 5: Logistic Regression & Neural Networks

Transcription:

Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/ seungjin 1 / 30

Binary Classification 2 / 30

Binary Classification Consider a binary classification problem, i.e., C = {C 1, C 2 }. Given a particular vector x, we wish to assign it to one of the two classes. To this end, we use Bayes rule and calculate the relevant posterior probability: P(C 1 x) = P(x C 1)P(C 1 ) P(x) P(x C 1 )P(C 1 ) = P(x C 1 )P(C 1 ) + P(x C 2 )P(C 2 ) 1 = 1 + P(x C2) P(C 2) = P(x C 1) P(C 1) { 1 + exp log 1 [ P(x C1) P(x C 2) ] log [ ]} P(C1) P(C 2) 3 / 30

It can be written in the form of the logistic function: where P(C 1 x) = 1 1 + e ξ, [ ] [ ] P(x C1 ) P(C1 ) ξ = log + log. P(x C 2 ) P(C 2 ) }{{}}{{} likelihood ratio prior ratio Choose a particular form for the class-conditional density function P(x C i ). 4 / 30

Multivariate Gaussian Model Assume that the class-conditional densities are multivariate Gaussian with identical covariance matrix Σ: { 1 P(x C i ) = exp 1 } (2π) D 2 Σ 1 2 2 (x µ i) Σ 1 (x µ i ). After a simple calculation, we have where P(C 1 x) = 1 1 + e (w x+b), w = Σ 1 (µ 1 µ 2 ), b = 1 2 (µ 2 + µ 1 ) Σ 1 (µ 2 µ 1 ) + log [ ] P(C1 ). P(C 2 ) 5 / 30

Properties of Logistic Function 1 0.9 0.8 0.7 0.6 0.5 σ(ξ) 0 as ξ. σ(ξ) 1 as ξ. σ( ξ) = 1 σ(ξ). d dξ [σ(ξ)] = σ(ξ)σ( ξ). 0.4 0.3 0.2 0.1 0 5 0 5 6 / 30

Maximum Likelihood Formulation 7 / 30

Logistic Regression Predict a binary output y n {0, 1} from an input x n. The logistic regression models the input-output by a conditional Bernoulli distribution where E [y n x n ] = P(y n = 1 x n ) = σ(w x n ), σ(ξ) = 1 eξ = 1 + e ξ 1 + e ξ. Discriminative model: Directly model p(y x). 8 / 30

Logistic Regression: MLE Given {(x n, y n ) n = 1,..., N}, the likelihood is given by p(y X, w) = = N p(y n = 1 x n ) yn (1 p(y n = 1 x n )) 1 yn N σ(w x n ) ( yn 1 σ(w x n ) ) 1 y n. Then log-likelihood function is given by L = N log p(y n x n ) = where σ n = σ(w x n ). N { } y n log σ n + (1 y n ) log(1 σ n ), This is a nonlinear function of w whose maximum cannot be computed in a closed form. Iterative re-weighted least squares (IRLS) is a popular algorithm, derived from Newton s method. 9 / 30

Newton s Method 10 / 30

Mathematical Preliminaries Gradient Hessian matrix Gradient descent/ascent Newton s method 11 / 30

Gradient Consider a real-valued function f (x), that takes a real-valued vector x R D as an input, f (x) : R D R. The gradient of f (x) is defined by f (x) = = f (x) x [ f,, x 1 ] f. x D 12 / 30

Hessian Matrix If f (x) belongs to the class C 2, the Hessian matrix H is defined as the symmetric matrix with the (i, j)-element 2 f (x) x i x j, H = 2 f (x) [ 2 ] f (x) = x i x j = 2 f x 2 1. 2 f x D x 1 [ f (x) 2 f x 1 x 2 2 f x 1 x D. 2 f x D x 2 2 f xd 2 ]. = x x = [ f (x)] x 13 / 30

Gradient Descent/Ascent The gradient descent/ascent learning is a simple (first-order) iterative method for minimization/maximization. Gradient descent: iterative minimization ( ) J w (k+1) = w (k) η. w Gradient ascent: iterative maximization ( ) J w (k+1) = w (k) + η. w Learning rate: η > 0 14 / 30

Newton s Method The basic idea of Newton s method is to optimize the quadratic approximation of the objective function J (w) around the current point w (k). The second-order Taylor series expansion of J (w) at the current point w (k) gives [ ] J 2(w) = J (w (k) ) + J (w (k) ) (w w (k)) + 1 ( w w (k)) ( 2 J (w (k) ) w w (k)), 2 where 2 J (w (k) ) is the Hessian of J (w) evaluated at w = w (k). Differentiate this w.r.t. w and set it equal 0, which leads to J (w (k) ) + 2 J (w (k) )w 2 J (w (k) )w (k) = 0, 15 / 30

Thus we have [ 1 w = w (k) 2 J (w )] (k) J (w (k) ). Hence the Newton s method is of the form [ 1 w (k+1) = w (k) 2 J (w )] (k) J (w (k) ). Remark: The Hessian 2 J (w (k) ) should be positive definite for all t. 16 / 30

Illustration of Newton s Method f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k (a) convex (b) non-convex [Figure source: Murphy s book] 17 / 30

Logistic Regression: Algorithms 18 / 30

Logistic Regression: Gradient Ascent The gradient ascent learning has the form w new = w old ( ) L + η. w Calculate the gradient L w = t { yn σ(w x n ) } x n. Hence, the gradient ascent update rule for w is w new = w old + η { ( ( y n σ w old) )} xn x n. t 19 / 30

Detailed Calculation of Gradient Recall the log-likelihood L = N [ ] y n log σ n + (1 y n) log(1 σ n). Calculate N L w = [ ] σ n y n x n + (1 y n) σ n x n σ n 1 σ n = = = N [ ] σ n(1 σ n) σn(1 σn) y n (1 y n) x n σ n 1 σ n N [y n(1 σ n) (1 y n)σ n] x n N (y n σ n) x n. 20 / 30

Detailed Calculation of Hessian Calculate the Hessian: 2 L = = = w [ L] [ N ] (y n σ n) x n w N σ n(1 σ n)x nx n. 21 / 30

We set the objective function J (w) as the negative log-likelihood: J (w) = L(w) = N Thus, the gradient and the Hessian are: J (w) = 2 J (w) = [ ] y n log σ n + (1 y n ) log(1 σ n ). N (y n σ n ) x n, N σ n (1 σ n )x n x n. 22 / 30

Logistic Regression: IRLS Newton s update has the form [ N w = σ n (1 σ n ) x n x n ] 1 }{{} inverse of Hessian, [ 2 J (w)] 1 [ ] N (y n σ n ) x n, } {{ } gradient, J (w) Newton s update reduces to iterative re-weighted least squares (IRLS): where S = b = w = ( XSX ) 1 XSb, σ 1 (1 σ 1 ) 0..... 0 σ N (1 σ N ) y 1 σ 1 σ 1(1 σ 1). y N σ N σ N (1 σ N )., 23 / 30

Alternatively we write w k+1 = w k + ( XS k X ) 1 XSk b k, where S k and b k are computed using w k. That is, w k+1 = w k + ( XS k X ) 1 XSk b k = ( XS k X ) 1 [( XSk X ) ] w k + XS k b k = ( XS k X ) 1 XSk [ X w k + b k ]. 24 / 30

Algorithm Outline Algorithm 1 IRLS Input: {(x n, y n ) n = 1,..., N} 1: Initialize w = 0 and w 0 = log (y/(1 y)) 2: repeat 3: for n = 1,..., N do 4: Compute η n = w x n + w 0 5: Compute σ n = σ(η n ) 6: Compute s n = σ n (1 σ n ) 7: Compute z n = η n + yn σn s n 8: end for 9: Construct S = diag(s 1:N ) 10: Update w = ( XSX ) 1 XSz 11: until convergence 12: return w 25 / 30

Generalized Linear Models 26 / 30

Generalized Linear Models The GLM is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution. The GLM consists of three elements: 1. A probability distribution from the exponential family. 2. A linear predictor η = X θ 3. A link function g( ) such that E [y X] = µ = g 1 (η). 27 / 30

Logistic Regression: Generalized Linear Model Bernoulli distribution for a univariate binary random variable y {0, 1} with mean µ, p(y µ) = µ y (1 µ) 1 y. ( ) Logit transform (log-odds), θ = w x = log µ 1 µ, leads to [0, 1] R. The logistic function, which is the inverse-logit, gives µ = 1 1+e θ = σ(θ). Thus, we have p(y θ) = σ(θ) y (1 σ(θ)) 1 y = σ(θ) y σ( θ) 1 y 28 / 30

Multiclass Extension 29 / 30

Multiclass Logistic Regression (Multinomial Logistic Regression) Model input-output by a softmax transformation of logits θ k = w k x n: p(y n = k x n) = softmax(θ k ) = exp{w k x n} K j=1 exp{w j x n} Given Y R K N (each column y n R K follows the 1-of-K coding) and X R D N, the likelihood is given by p(y X, w 1,..., w K ) = The log-likelihood is given by L = N k=1 N k=1 K p(y n = k x n) Y k,n. K Y k,n log [p(y n = k x n)]. One can apply Newton s update to derive IRLS, as in logistic regression. 30 / 30