Multinomial Logistic Regression Algorithms

Similar documents
Machine Learning. Lecture 3: Logistic Regression. Feng Li.

FSAN815/ELEG815: Foundations of Statistical Learning

Graphical Model Selection

Lecture 17: Numerical Optimization October 2014

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Selected Topics in Optimization. Some slides borrowed from

Interior-Point Methods for Linear Optimization

Statistical Estimation

Kernel Logistic Regression and the Import Vector Machine

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Ch 4. Linear Models for Classification

Lecture 3 - Linear and Logistic Regression

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Lecture 3 September 1

CS229 Lecture notes. Andrew Ng

Fast Regularization Paths via Coordinate Descent

Lecture 1: Supervised Learning

Machine Learning 2017

Journal of Statistical Software

Linear Discrimination Functions

ECS171: Machine Learning

Likelihood-Based Methods

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Lecture 25: November 27

CS60021: Scalable Data Mining. Large Scale Machine Learning

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Machine Learning Hilary Term 2018

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Sparse inverse covariance estimation with the lasso

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

A Quick Tour of Linear Algebra and Optimization for Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Computational statistics

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Machine Learning. Lecture 2: Linear regression. Feng Li.

Lecture 4: Newton s method and gradient descent

Time Series Analysis

Logistic Regression. Seungjin Choi

A Note on the Expectation-Maximization (EM) Algorithm

STA141C: Big Data & High Performance Statistical Computing

Constrained Optimization Approaches to Estimation of Structural Models: Comment

ECON 5350 Class Notes Nonlinear Regression Models

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Models in Machine Learning

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

A note on the group lasso and a sparse group lasso

The Newton-Raphson Algorithm

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

1 One-way analysis of variance

Lecture 6: Methods for high-dimensional problems

Gaussian Graphical Models and Graphical Lasso

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Lecture notes: Rust (1987) Economics : M. Shum 1

Standardization and the Group Lasso Penalty

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Logistic regression model for survival time analysis using time-varying coefficients

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Fall 2003: Maximum Likelihood II

Homework #3 RELEASE DATE: 10/28/2013 DUE DATE: extended to 11/18/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Homework 2: Solutions

Midterm exam CS 189/289, Fall 2015

10702/36702 Statistical Machine Learning, Spring 2008: Homework 3 Solutions

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Introduction to gradient descent

Lecture 10: Logistic Regression

Learning Binary Classifiers for Multi-Class Problem

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Multiclass Logistic Regression

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning Linear Models

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM

Practical Optimization: Basic Multidimensional Gradient Methods

MS-C1620 Statistical inference

Generalized Linear Models and Logistic Regression

Computational statistics

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

Package spcov. R topics documented: February 20, 2015

STAT Advanced Bayesian Inference

Linear Regression (continued)

Optimization for neural networks

HOMEWORK #4: LOGISTIC REGRESSION

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

STA141C: Big Data & High Performance Statistical Computing

Theorems. Least squares regression

HOMEWORK #4: LOGISTIC REGRESSION

ECS289: Scalable Machine Learning

Logistic Regression and Generalized Linear Models

The lasso: some novel algorithms and applications

Homework 5. Convex Optimization /36-725

STANDARDIZATION AND THE GROUP LASSO PENALTY

Transcription:

Multinomial Logistic Regression Algorithms Multinomial logistic regression Karl Gregory 10/03/2018 Let Y 1,..., Y n 1,..., K be independent random variables and let x 1,..., x n R d+1 be fixed vectors such that Y i Multinomial(1, η 1 (x i ; θ),..., η K (x i ; θ)), where θ is a parameter vector such that θ = (θ T 1, θ T K 1 )T, with θ 1,..., θ K 1 R d+1, and for k = 1,..., K 1 and The probability mass function of Y i is thus given by P (Y i = y) = for i = 1,..., n, so the likelihood function is and the log-likelihood function is K 1 η k (x; θ) = e xt θ k [1 + e xt i θ l ] 1 l=1 K 1 η K (x; θ) = [1 + e xt i θ l ]. l=1 K [η k (x i ; θ)] 1(y=k) 1(y 1,..., K) k=1 L(θ; Y 1,..., Y n ) = l(θ; Y 1,..., Y n ) = n k=1 k=1 K [η k (x i ; θ)] 1(y=k), K 1(Y i = k) log η k (x i ; θ). We do not have a closed form for the maximum likelihood estimator ˆθ for θ, so we must find ˆθ numerically. We consider three algorithms: Coordinate descent, a ridge-stabilized Newton-Raphson algorithm, and a fixed-hessian Newton-Raphson algorithm. In developing these algorithms, we will make use of the fact that θ k η l (x; θ) = η l (x; θ)[1(k = l) η k (x; θ)]x for all 1 k, l K, which can be shown without too much trouble. Moreover we have for k = 1,..., K 1 and θ k l(θ; Y 1,..., Y n ) = 2 θ k θ l l(θ; Y 1,..., Y n ) = [1(Y i = k) η k (x i ; θ)]x i η k (x i ; θ)[1(y i = l) η l (x i ; θ)]x i xi T 1

for 1 k, l K 1, from which we define the (K 1)(d+1) 1 score vector and the (K 1)(d+1) (K 1)(d+1) Hessian matrix as ( n ) S(θ) = [1(Y i = k) η k (x i ; θ)]x i 1 k K 1 and ( ) H(θ) = η k (x i ; θ)[1(k = l) η l (x i ; θ)]x i xi T. 1 k,l K 1 Coordinate descent The idea of coordinate descent algorithms is to iteratively maximize a function with respect to one parameter at a time, holding the others fixed. One cycles repeatedly through the parameters until convergence. Precisely, in our case, given initial values θ1,..., θ K 1, coordinate descent prescribes repeating the following until convergence: Outer loop: Update θ 1 to θ+ 1 by θ + 1 argmax θ 1 l(θ 1, θ 2,..., θ K 1 ; Y 1,..., Y n ), and then for k = 2,..., K 2, update θ k to θ+ k by and then update θ K 1 to θ+ K 1 by θ + k argmax θ k l(θ + 1,..., θ+ k 1, θ k, θ k+1,..., θ K 1 ; Y 1,..., Y n ), θ + K 1 argmax θ k l(θ + 1,..., θ+ K 2, θ K 1; Y 1,..., Y n ). Each of the maximizations may be carried out using a Newton-Raphson algorithm. For any k = 1,..., K 1, the Newton-Raphson updates are as follows: Inner loop: For initial values θ 1,..., θ K 1, update θ k where and θ k S k (θ) = H k,k (θ) = to θ k by θ k H 1 k (θ )S k (θ ), [1(Y i = k) η k (x i ; θ)]x i η k (x i ; θ)[1 η k (x i ; θ)]x i xi T. These updates may be reformulated such that θk is the solution to a least squares problem: If we define the n n matrix W k (θ) = diag(η k (x 1 ; θ)[1 η k (x 1 ; θ)],..., η k (x n ; θ)[1 η k (x n ; θ)]) and the n 1 vector U k (θ) = (1(Y 1 = k) η k (x 1 ; θ),..., 1(Y n = k) η k (x n ; θ)) T, 2

then we may write Then we may express the update as Letting we have H k (θ) = X T W k (θ)x and S k (θ) = X T U k (θ). θ k = θ k + (X T W k (θ )X) 1 X T U k (θ ) = (X T W k (θ )X) 1 X T W k (θ )[Xθ k + W 1 (θ )U k (θ )]. Finally, defining we may write Z k (θ ) = Xθ k + W 1 (θ )U k (θ ), θ k H 1 k (θ )S k (θ ) = (X T W k (θ )X) 1 X T W k (θ )Z k (θ ). Z k = W 1/2 k (θ )Z k (θ ) and X k = W 1/2 k (θ )X, θ k T = ( X X k k) 1 X k Z k = argmin θ Z k X kθ 2 2. A good reference for this method is Friedman, Hastie, and Tibshirani (2010). The code below implements the coordinate descent algorithm for computing the maximum likelihood estimator of θ. # In all the R code, there are K sets of parameters so that # theta = (theta_1,...,theta_k), but # for identifiability, theta_k = 0 always. # define the softmax function softmax <- function(x)exp(x)/sum(exp(x)) # generate some multinomial data: n <- 50 d <- 2 K <- 3 XX <- cbind(rep(1,n),matrix(rnorm(n*d),n,d)) THETA <- cbind(matrix(rnorm((d+1)*(k-1)),d+1,k-1),rep(0,d+1)) ETA <- t(apply(xx %*% THETA,1,softmax)) Y <- apply(eta,1,fun=sample, size=1,x=1:k, replace=false) # initialize parameter values THETA.1 <- matrix(0,d+1,k) ETA.0 <- 1 # let the loop begin # set convergence criterion delta <-.001 iter <- 0 while( max(abs(eta.0 - ETA.1 )) > delta) # stop when fitted probabilities cease to change. ETA.0 <- ETA.1 THETA.0 <- THETA.1 for( k in 1:(K-1)) THETA.0[,k] <- THETA.1[,k] + 1 # let the loop begin 3

while( max(abs(theta.1[,k] - THETA.0[,k])) > delta) THETA.0[,k] <- THETA.1[,k] p.k <- t(apply(xx %*% THETA.0,1,softmax))[,k] w.k <- p.k*(1-p.k) Z.k <- XX %*% THETA.0[,k] + ( (Y == k) - p.k) / w.k W.k <- diag(w.k) XX.w.k <- sqrt(w.k) %*% XX Z.w.k <- sqrt(w.k) * Z.k theta.1.k <- solve(t(xx.w.k) %*% XX.w.k) %*% t(xx.w.k) %*% Z.w.k THETA.1[,k] <- theta.1.k iter <- iter + 1 THETA.1.CD <- THETA.1 print(iter) ## [1] 8 print(theta.1.cd) ## [,1] [,2] [,3] ## [1,] 0.1429447 2.0140096 0 ## [2,] -0.5358643 0.8258555 0 ## [3,] 0.6997993 1.7411876 0 These results are very close (try playing with the convergence criterion) to those from the package nnet, as can be seen after some re-parameterizing: library(nnet) # relabel Y so that we get the same answer from the multinom function, # which uses a different identifiability constraint: Y.multinom <- -Y multinom.coefs <- as.matrix(summary(multinom(y.multinom ~ XX[,-1]))$coefficients) ## # weights: 12 (6 variable) ## initial value 54.930614 ## iter 10 value 31.325856 ## final value 31.308472 ## converged if(k==2) multinom.coefs <- t(multinom.coefs) THETA.1.nnet <- t(as.matrix(rbind(multinom.coefs[(k-1):1,],rep(0,ncol(xx))))) print(theta.1.nnet) ## -1-2 4

## (Intercept) 0.1500389 2.0181569 0 ## XX[, -1]1-0.5356763 0.8252182 0 ## XX[, -1]2 0.7040395 1.7437920 0 Ridge-stabilized Newton-Raphson Given an initial value θ of the parameter θ, the Newtown-Raphson algorithm prescribes the updates θ + θ H 1 (θ )S(θ ). This algorithm, however, is frought with convergence issues for K 3. Sometimes the Hessian will not be negative definite, and sometimes the quadratic approximation upon which each Newton-Raphson update is based is very poor. The following algorithm, taken from Goldfeld, Quandt, and Trotter (1966), addresses these problems in each iteration. Let λ 1 be the greatest eigenvalue of H(θ ) and let R be a constant which will play the role of controlling the step size of the algorithm. Then the ridge-stabilized Newton-Raphson algorithm proceeds via the updates where, for α = λ 1 + R S(θ ) 2, H α (θ ) = θ + θ H 1 α (θ )S(θ ), H(θ ) αi if α > 0 H(θ ) if α 0. The modified Hessian H α (θ ) is always negative definite. Goldfeld, Quandt, and Trotter (1966) prescribe a way to adjust the constant R in each iteration. This is implemented in the code below, which carries out the algorithm. # define function to evaluate negative log-likelihood negll <- function(theta,y,xx) K <- max(y) negll <- 0 PROBS <- t(apply(xx %*% THETA,1,softmax)) for( k in 1:K) negll <- negll - sum( (Y==k)*log(PROBS[,k]) ) return(negll) # initialize parameter values THETA.1 <- matrix(0,d+1,k) ETA.0 <- 1 # let the loop begin # initialize iteration counter and create empty vector in which to store # evaluations of the negative log-likelihood function after every step iter <- 1 negll.vals <- numeric() negll.vals[1] <- negll(theta.1,y,xx) 5

R <- 10 # set initial value for "radius", which is kind of like a step size. SS <- 1 # allow algorithm to begin while(max(abs(ss)) > delta) ETA.0 <- ETA.1 THETA.0 <- THETA.1 ETA.1 <- t(apply(xx %*% THETA.0,1,softmax)) # Build Hession HH and score function SS HH <- matrix(na,(k-1)*(d+1),(k-1)*(d+1)) SS <- numeric((k-1)*(d+1)) for(k1 in 1:(K-1)) ind1 <- ((k1-1)*(d+1) + 1):(k1*(d+1)) SS[ind1] <- t(xx) %*% ((Y == k1) - ETA.1[,k1]) for(k2 in 1:(K-1)) ind2 <- ((k2-1)*(d+1) + 1):(k2*(d+1)) if(k1 == k2) HH[ind1,ind2] <- - t(xx) %*% diag(eta.1[,k1]*(1 - ETA.1[,k1])) %*% XX else HH[ind1,ind2] <- - t(xx) %*% diag(eta.1[,k1]*eta.1[,k2]) %*% XX HH[is.na(HH)] <- 0 # compute alpha lambda.1 <- max(eigen(hh)$values) norm.ss <- sqrt(sum(ss^2)) alpha <- lambda.1 + R * norm.ss # compute modified Hessian BB <- HH - (alpha > 0) * alpha * diag((k-1)*(d+1)) # get update Theta.0.long <- as.vector(theta.0[,-k]) Theta.1.long <- Theta.0.long - solve(bb) %*% SS # do update with modified Hessian 6

THETA.1 <- cbind(matrix(theta.1.long,ncol=k-1),rep(0,d+1)) # adjust R dtheta <- Theta.1.long - Theta.0.long negll.lqa <- - (negll.vals[iter] + SS %*% dtheta + (1/2)*t(dTheta) %*% HH %*% dtheta) negll.vals[iter+1] <- negll(theta.1,y,xx) delta.negll <- negll.vals[iter+1] - negll.vals[iter] delta.negll.lqa <- negll.lqa - negll.vals[iter] z <- delta.negll / delta.negll.lqa if( z < 0) R <- R * 4 else if((.7 < z) & (z < 1.3)) R <- R * 0.4 else if(z > 2) R <- R * 4 iter <- iter + 1 plot(negll.vals, xlab="iterations") negll.vals 35 40 45 50 55 0 20 40 60 80 iterations 7

THETA.1.ridgestabilizedNR <- THETA.1 print(theta.1.ridgestabilizednr) ## [,1] [,2] [,3] ## [1,] 0.1496101 2.0176628 0 ## [2,] -0.5358097 0.8253049 0 ## [3,] 0.7037386 1.7435231 0 We see that the results are the same. Fixed-Hessian algorithm A disadvantage of the Newton-Raphson algorithm is that the Hessian must be inverted in every iteration. This incurs a high computational cost. Böhning (1992) proposed an algorithm (see the paper for complete details) which replaces the inverse of the Hessian in all iterations with a single fixed matrix. As a result, the algorithm tends to require more iterations, but each iteration is very cheap. Given an initial parameter value θ, Böhning (1992) prescribes the updates θ + θ B 1 S(θ ), where B = (K/2) (K 1 I K 2 11 T ) X T X, where I is the (K 1) (K 1) identity matrix and 1 is a (K 1) 1 vector of ones. This gives B 1 = 2(I + 11 T ) (X T X) 1. Note that the matrix (K 1 I K 2 11 T ) X T X is the Hessian under the uniform Multinomial distribution, that is, when η 1 (x i ; θ) = = η K (x i ; θ) = 1/K for i = 1,..., n. See Böhning (1992) for further details. The advantage of this algorithm is that we only need to compute the matrix B 1 once. The code below implements the algorithm: # initialize parameter values THETA.1 <- matrix(0,d+1,k) ETA.0 <- 1 # let the loop begin iter <- 1 negll.vals <- numeric() negll.vals[1] <- negll(theta.1,y,xx) # Build B inverse: B.inv <- - 2 * (diag(k-1) + matrix(1,k-1,k-1)) %x% solve( t(xx) %*% XX ) # allow algorithm to begin SS <- 1 while(max(abs(ss)) > delta) ETA.0 <- ETA.1 THETA.0 <- THETA.1 8

ETA.1 <- t(apply(xx %*% THETA.0,1,softmax)) # Build score function SS SS <- numeric((k-1)*(d+1)) for(k1 in 1:(K-1)) ind1 <- ((k1-1)*(d+1) + 1):(k1*(d+1)) SS[ind1] <- t(xx) %*% ((Y == k1) - ETA.1[,k1]) # Perform update Theta.0.long <- as.vector(theta.0[,-k]) Theta.1.long <- Theta.0.long - B.inv %*% SS THETA.1 <- cbind(matrix(theta.1.long,ncol=k-1),rep(0,d+1)) negll.vals[iter+1] <- negll(theta.1,y,xx) iter <- iter + 1 plot(negll.vals,xlab="iterations") negll.vals 35 40 45 50 55 0 10 20 30 40 THETA.1.fixedHNR <- THETA.1 iterations 9

print(theta.1.fixedhnr) ## [,1] [,2] [,3] ## [1,] 0.1506317 2.0183429 0 ## [2,] -0.5351952 0.8253456 0 ## [3,] 0.7042880 1.7439444 0 References Böhning, Dankmar. 1992. Multinomial Logistic Regression Algorithm. Annals of the Institute of Statistical Mathematics 44 (1). Springer: 197 200. Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software 33 (1). NIH Public Access: 1. Goldfeld, Stephen M, Richard E Quandt, and Hale F Trotter. 1966. Maximization by Quadratic Hill- Climbing. Econometrica: Journal of the Econometric Society. JSTOR, 541 51. 10