Multiclass Logistic Regression

Similar documents
Iterative Reweighted Least Squares

Machine Learning. 7. Logistic and Linear Regression

Neural Network Training

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Bayesian Logistic Regression

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Ch 4. Linear Models for Classification

Reading Group on Deep Learning Session 1

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Stochastic gradient descent; Classification

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Linear Models for Classification

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Gradient-Based Learning. Sargur N. Srihari

Linear Models for Regression. Sargur Srihari

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. Seungjin Choi

Error Backpropagation

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning - Waseda University Logistic Regression

Lecture 5: Logistic Regression. Neural Networks

Machine Learning Lecture 5

Linear Classification: Probabilistic Generative Models

Machine Learning Lecture 7

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

NEURAL NETWORKS

Machine Learning: Logistic Regression. Lecture 04

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Artificial Neural Networks. MGS Lecture 2

Logistic Regression & Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning Hilary Term 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression. Stochastic Gradient Descent

Introduction to Neural Networks

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Regularization in Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear and logistic regression

Feed-forward Network Functions

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neural Networks: Backpropagation

Gradient Descent. Sargur Srihari

y(x n, w) t n 2. (1)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear Models for Classification

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Multilayer Perceptron

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Rapid Introduction to Machine Learning/ Deep Learning

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Stochastic Gradient Descent

CS 340 Lec. 16: Logistic Regression

Linear Classification

Statistical Machine Learning from Data

CSC 578 Neural Networks and Deep Learning

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Classification. Sandro Cumani. Politecnico di Torino

CSC 411: Lecture 04: Logistic Regression

Linear Models for Regression

Pattern Recognition and Machine Learning

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Deep Feedforward Networks

Machine Learning 2017

CSC321 Lecture 4: Learning a Classifier

Statistical Data Mining and Machine Learning Hilary Term 2016

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

CSC321 Lecture 4: Learning a Classifier

Machine Learning for NLP

Linear Models in Machine Learning

Machine Learning Basics

Ad Placement Strategies

Regression with Numerical Optimization. Logistic

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Classification Logistic Regression

Linear & nonlinear classifiers

Linear Models for Regression

Introduction to Machine Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Variational Bayesian Logistic Regression

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Generative v. Discriminative classifiers Intuition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Outline Lecture 2 2(32)

Logistic Regression. Machine Learning Fall 2018

Probability and Information Theory. Sargur N. Srihari

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Transcription:

Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA

Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis functions in linear classification 2. Logistic Regression (two-class 3. Iterative Reweighted Least Squares (IRLS 4. Multiclass Logistic Regression 5. Probit Regression 6. Canonical Link Functions 2

Machine Learning Srihari Topics in Multiclass Logistic Regression Multiclass Classification Problem Softmax Regression Softmax Regression Implementation Softmax and Training One-hot vector representation Objective function and gradient Summary of concepts in Logistic Regression Example of 3-class Logistic Regression 3

Machine Learning Srihari Multi-class Classification problem Categories K=10 Examples =100 4

Machine Learning Srihari Softmax Regression In the two-class case p(c 1 ϕ =y(ϕ = σ(w T ϕ+b where ϕ=[ϕ 1,.., ϕ M ] T, w =[w 1,.., w M ] T and a=w T ϕ+b is the activation For K classes, we work with soft-max function instead of logistic sigmoid (Softmax regression p(c k φ = y k (φ = exp(a k exp(a j j where a k =w kt ϕ +b k, k =1,..,K w k =[w k1,.., w km ] T and a={a 1,..a K } We learn a set of K weight vectors {w 1,.., w K }and biases b Arranging weight vectors as a matrix W a =W T ϕ+b W = w 1.. w K = w 11.. w K1 w 1M w KM y = softmax(a y i = exp(a i 3 j=1 exp(a j 5

Softmax Regression Implementation 3-class Logistic Regression with 3 inputs W = a =W T x+b W 1 W 2 W 3 W 1.1 W 1,2 W 1,3 = W 2,1 W 2.2 W 2,3 W 3,1 W 3,2 W 3,3 y = softmax(a y i = exp(a i 3 j=1 exp(a j etwork Computes In matrix multiplication notation An example 6

Softmax and Training We use maximum likelihood to determine the parameters {w k }, k=1,..k The exp within softmax j exp(a j works very well when training using log-likelihood Log-likelihood can undo the exp of softmax log softmax(a i = a i log exp(a j Input a i always has a direct contribution to cost Because this term cannot saturate, learning can proceed even if second term becomes very small First term encourages a i to be pushed up Second term encourages all a to be pushed down 7 j softmax(a i = exp(a i

Machine Learning Srihari Derivatives The multiclass logistic regression model is p(c k φ = y k (φ = exp(a k exp(a j j For maximum likelihood we will need the derivatives of y k wrt all of the activations a j These are given by y k a j = y k (I kj y j where I kj are the elements of the identity matrix 8

One-hot vector representation Classes C 1,..C K represented by 1-of-K scheme One-hot vector: class C k is a K-dim vector or [t 1,..,t K ] T, t i ε {0,1} With K=6, class C 3 is (0,0,1,0,0,0 T with t 1 =t 2 =t 4 =t 5 =t 6 =0, & t 3 =1 The class probabilities obey K p(c K k=1 k = t k=1 k If p(t k =1 = µ k then K t p(c k = µ k k where µ = (µ 1,..,µ K T k=1 e.g., probability of C 3 is p([0,0,1,0,0,0] = µ 3 = 1 Why use one-hot representation? If we used numerical categories 1,2,3 we would impute ordinality. We can now use simpler Bernoulli instead of multinoulli 9

Machine Learning Srihari Target Matrix, T Classes have values 1,.., K Each represented as a K-dimensional binary vector We have labeled samples So instead of target vector t we have a target matrix T Classes T = t 11.. t 1K.. t 1 t K Samples ote that t nk corresponds to sample n and class k 10

Machine Learning Srihari Objective Function & Gradient Likelihood of observations K p(t w 1,..,w K = p(c k φ n t n,k t = y nk nk k=1 Where, for feature vector φ n K k=1 T = y nk = y k (φ n = t 11.. t 1K.. t 1 j t K is exp(w T φ k n exp(w T j φ n Objective Function: negative log-likelihood E(w 1,...,w K = ln p(t w 1,..,w K = t nk lny nk Known as cross-entropy error for multi-class Gradient of error function wrt parameter w j wj E(w 1,...,w K = (y nj t nj φ n K k=1 y k (φ = exp(a k exp(a j j a k =w kt ϕ using k t nk = 1 Error x Feature Vector y k a j = y k (I kj y j where I kj are elements of the identity matrix 11

Machine Learning Srihari Gradient Descent Has same form for gradient as for the sum of squares error function with the linear model and cross-entropy error for the logistic regression model i.e., product of the error (y nj -t nj times the basis function ϕ n We can use the sequential algorithm in which inputs are presented one at a time in which the weight vector is updated using w τ+1 = w τ η E n 12

Machine Learning Srihari ewton-raphson update gives IRLS Hessian matrix comprises blocks of size M x M Block j,k is given by wk wj E(w 1,...,w K = y nk (I kj y nj φ n o of blocks is also M x M, each corresponding to a pair of classes (with redundancy Hessian matrix is positive-definite, therefore error function has a unique minimum Batch Algorithm based on ewton-raphson φ n T 13

Machine Learning Srihari Summary of Logistic Regression concepts Definition of gradient and Hessian Gradient and Hessian in Linear Regression Gradient and Hessian in 2-class Logistic Regression 14

Definitions of Gradient and Hessian First derivative of a scalar function E(w with respect to a vector w=[w 1,w 2 ] T is a vector called the Gradient of E(w E(w = d dw E(w = E w 1 E w 2 Second derivative of E(w is a matrix called the Hessian H = E(w = d 2 dw E(w = 2 2 E w 1 2 2 E w 2 w 1 If there are M elements in the vector then Gradient is a M x 1 vector Jacobian matrix consists of first derivatives of a vectorvalued function wrt a vector 2 E w 1 w 2 2 E w 2 2 Hessian is a matrix with M 2 elements

Use of Gradient & Hessian in ML Training samples:,.. Inputs: M x 1 vectors φ(x n = φ 0 x n Outputs ( ( φ 1 ( x n... φ M 1 ( x n T t = (t 1,.. t T φ 0(x1 φ 0(x 2 Φ = φ 0(x φ (x 1... φ M 1(x 1 φ M 1(x Φ 0 (x=1, dummy feature 1 Error surface for M=2 a paraboloid with a single global minimum For Linear Regression (sum-of-squared error: E(w = 1 [w T φ(x 2 n t n ] 2 where w = (w 0,w 1,..w M 1 T For Logistic Regression (cross-entropy error: { } E(w = ln p(t w = t n lny n +(1 t n ln(1 y n where y n = σ w T φ(x n For Stochastic Gradient Descent we need E n (w where E(w = w (τ+1 = w (τ η E n (w ( For ewton-raphson update we need both E(w and H= E(w w (new = w (old H 1 E(w n E n (w

17 Gradient and Hessian for Linear Regression Sum-of squared Errors (equivalent to maximum likelihood E(w = 1 2 Gradient of E E(w = Hessian of E ewton-raphson [w T φ(x n t n ] 2 [w T φ n t n ]φ n w ML =- (Φ T Φ -1 Φ T t H = E(w = Φ T Φ w (new = w (old -(Φ T Φ -1 {Φ T Φw (old -Φ T t} = (Φ T Φ -1 Φ T t φ(x n = φ 0 x n t = (t 1,.. t = Φ T Φw Φ T t ( ( φ 1 ( x n... φ M 1 ( x n T φ 0(x1 φ 0(x 2 Φ = φ 0(x φ (x 1 1... φ φ M 1 M 1 (x1 (x Which is the same solution as with Gradient Descent

18 Gradient & Hessian: 2-class Logistic Regression Cross-Entropy Error E(w = ln p(t w = t n lny n +(1 t n ln(1 y n where y n = σ w T φ(x n { } ( Gradient of E Hessian of E E(w = (y n t n φ(x n = Φ T (y t H = E(w = y n (1 y n φ(x n φ T (x n = Φ T RΦ y = (y 1,..y T t = (t 1,.. t T φ 0(x1 φ 0(x 2 Φ = φ 0(x φ (x 1 1... φ M 1(x 1 φ M 1(x R is x diagonal matrix with elements R nn =y n (1-y n =w T φ(x n (1-w T φ(x n ( ( φ 1 ( x n... φ M 1 ( x n T φ(x n = φ 0 x n Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, u T Hu>0 error function is a concave function of w and so has a unique minimum

19 Gradient & Hessian: Multi-class Logistic Regression Cross-Entropy Error E(w 1,..., w K = ln p(t w 1,.., w K = Gradient of E wj E(w 1,..., w K = Hessian of E K (y nj t nj φ(x n k=1 wk wj E(w 1,..., w K = y nk (I kj y nj φ(x n φ T (x n t nk lny nk φ 0(x1 φ 0(x 2 Φ = φ 0(x T = y nk = y k (φ(x n = φ(x n = φ 0 x n Each element of the Hessian needs M multiplications and additions Since there are M 2 elements in the matrix the computation is O(M 3 φ (x 1 t 11.. t 1K.. t 1 1... φ φ M 1 M 1 t K exp(w T φ(x k n exp(w T j j φ(x n (x1 (x ( ( φ 1 ( x n... φ M 1 ( x n T

20 An Example of 3-class Logistic Regression Input Data Φ 0 (x=1, dummy feature

Three-class Logistic Regression Three weight vectors (Initial Gradient Hessian (9x9 with some 3 x 3 blocks repeated

Final Weight Vector, Gradient and Hessian (3-class Weight Vector Gradient Hessian umber of iterations : 6 Error (Initial and Final: 38.9394, 1.5000e-009