Multiclass Logistic Regression

Size: px

Start display at page:

Download "Multiclass Logistic Regression"

Jocelin Parker
6 years ago
Views:

1 Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA

2 Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative 1. Fixed basis functions in linear classification 2. Logistic Regression (two-class 3. Iterative Reweighted Least Squares (IRLS 4. Multiclass Logistic Regression 5. Probit Regression 6. Canonical Link Functions 2

3 Machine Learning Srihari Topics in Multiclass Logistic Regression Multiclass Classification Problem Softmax Regression Softmax Regression Implementation Softmax and Training One-hot vector representation Objective function and gradient Summary of concepts in Logistic Regression Example of 3-class Logistic Regression 3

4 Machine Learning Srihari Multi-class Classification problem Categories K=10 Examples =100 4

5 Machine Learning Srihari Softmax Regression In the two-class case p(c 1 ϕ =y(ϕ = σ(w T ϕ+b where ϕ=[ϕ 1,.., ϕ M ] T, w =[w 1,.., w M ] T and a=w T ϕ+b is the activation For K classes, we work with soft-max function instead of logistic sigmoid (Softmax regression p(c k φ = y k (φ = exp(a k exp(a j j where a k =w kt ϕ +b k, k =1,..,K w k =[w k1,.., w km ] T and a={a 1,..a K } We learn a set of K weight vectors {w 1,.., w K }and biases b Arranging weight vectors as a matrix W a =W T ϕ+b W = w 1.. w K = w 11.. w K1 w 1M w KM y = softmax(a y i = exp(a i 3 j=1 exp(a j 5

6 Softmax Regression Implementation 3-class Logistic Regression with 3 inputs W = a =W T x+b W 1 W 2 W 3 W 1.1 W 1,2 W 1,3 = W 2,1 W 2.2 W 2,3 W 3,1 W 3,2 W 3,3 y = softmax(a y i = exp(a i 3 j=1 exp(a j etwork Computes In matrix multiplication notation An example 6

7 Softmax and Training We use maximum likelihood to determine the parameters {w k }, k=1,..k The exp within softmax j exp(a j works very well when training using log-likelihood Log-likelihood can undo the exp of softmax log softmax(a i = a i log exp(a j Input a i always has a direct contribution to cost Because this term cannot saturate, learning can proceed even if second term becomes very small First term encourages a i to be pushed up Second term encourages all a to be pushed down 7 j softmax(a i = exp(a i

8 Machine Learning Srihari Derivatives The multiclass logistic regression model is p(c k φ = y k (φ = exp(a k exp(a j j For maximum likelihood we will need the derivatives of y k wrt all of the activations a j These are given by y k a j = y k (I kj y j where I kj are the elements of the identity matrix 8

9 One-hot vector representation Classes C 1,..C K represented by 1-of-K scheme One-hot vector: class C k is a K-dim vector or [t 1,..,t K ] T, t i ε {0,1} With K=6, class C 3 is (0,0,1,0,0,0 T with t 1 =t 2 =t 4 =t 5 =t 6 =0, & t 3 =1 The class probabilities obey K p(c K k=1 k = t k=1 k If p(t k =1 = µ k then K t p(c k = µ k k where µ = (µ 1,..,µ K T k=1 e.g., probability of C 3 is p([0,0,1,0,0,0] = µ 3 = 1 Why use one-hot representation? If we used numerical categories 1,2,3 we would impute ordinality. We can now use simpler Bernoulli instead of multinoulli 9

10 Machine Learning Srihari Target Matrix, T Classes have values 1,.., K Each represented as a K-dimensional binary vector We have labeled samples So instead of target vector t we have a target matrix T Classes T = t 11.. t 1K.. t 1 t K Samples ote that t nk corresponds to sample n and class k 10

11 Machine Learning Srihari Objective Function & Gradient Likelihood of observations K p(t w 1,..,w K = p(c k φ n t n,k t = y nk nk k=1 Where, for feature vector φ n K k=1 T = y nk = y k (φ n = t 11.. t 1K.. t 1 j t K is exp(w T φ k n exp(w T j φ n Objective Function: negative log-likelihood E(w 1,...,w K = ln p(t w 1,..,w K = t nk lny nk Known as cross-entropy error for multi-class Gradient of error function wrt parameter w j wj E(w 1,...,w K = (y nj t nj φ n K k=1 y k (φ = exp(a k exp(a j j a k =w kt ϕ using k t nk = 1 Error x Feature Vector y k a j = y k (I kj y j where I kj are elements of the identity matrix 11

12 Machine Learning Srihari Gradient Descent Has same form for gradient as for the sum of squares error function with the linear model and cross-entropy error for the logistic regression model i.e., product of the error (y nj -t nj times the basis function ϕ n We can use the sequential algorithm in which inputs are presented one at a time in which the weight vector is updated using w τ+1 = w τ η E n 12

13 Machine Learning Srihari ewton-raphson update gives IRLS Hessian matrix comprises blocks of size M x M Block j,k is given by wk wj E(w 1,...,w K = y nk (I kj y nj φ n o of blocks is also M x M, each corresponding to a pair of classes (with redundancy Hessian matrix is positive-definite, therefore error function has a unique minimum Batch Algorithm based on ewton-raphson φ n T 13

14 Machine Learning Srihari Summary of Logistic Regression concepts Definition of gradient and Hessian Gradient and Hessian in Linear Regression Gradient and Hessian in 2-class Logistic Regression 14

15 Definitions of Gradient and Hessian First derivative of a scalar function E(w with respect to a vector w=[w 1,w 2 ] T is a vector called the Gradient of E(w E(w = d dw E(w = E w 1 E w 2 Second derivative of E(w is a matrix called the Hessian H = E(w = d 2 dw E(w = 2 2 E w E w 2 w 1 If there are M elements in the vector then Gradient is a M x 1 vector Jacobian matrix consists of first derivatives of a vectorvalued function wrt a vector 2 E w 1 w 2 2 E w 2 2 Hessian is a matrix with M 2 elements

16 Use of Gradient & Hessian in ML Training samples:,.. Inputs: M x 1 vectors φ(x n = φ 0 x n Outputs ( ( φ 1 ( x n... φ M 1 ( x n T t = (t 1,.. t T φ 0(x1 φ 0(x 2 Φ = φ 0(x φ (x 1... φ M 1(x 1 φ M 1(x Φ 0 (x=1, dummy feature 1 Error surface for M=2 a paraboloid with a single global minimum For Linear Regression (sum-of-squared error: E(w = 1 [w T φ(x 2 n t n ] 2 where w = (w 0,w 1,..w M 1 T For Logistic Regression (cross-entropy error: { } E(w = ln p(t w = t n lny n +(1 t n ln(1 y n where y n = σ w T φ(x n For Stochastic Gradient Descent we need E n (w where E(w = w (τ+1 = w (τ η E n (w ( For ewton-raphson update we need both E(w and H= E(w w (new = w (old H 1 E(w n E n (w

17 17 Gradient and Hessian for Linear Regression Sum-of squared Errors (equivalent to maximum likelihood E(w = 1 2 Gradient of E E(w = Hessian of E ewton-raphson [w T φ(x n t n ] 2 [w T φ n t n ]φ n w ML =- (Φ T Φ -1 Φ T t H = E(w = Φ T Φ w (new = w (old -(Φ T Φ -1 {Φ T Φw (old -Φ T t} = (Φ T Φ -1 Φ T t φ(x n = φ 0 x n t = (t 1,.. t = Φ T Φw Φ T t ( ( φ 1 ( x n... φ M 1 ( x n T φ 0(x1 φ 0(x 2 Φ = φ 0(x φ (x φ φ M 1 M 1 (x1 (x Which is the same solution as with Gradient Descent

18 18 Gradient & Hessian: 2-class Logistic Regression Cross-Entropy Error E(w = ln p(t w = t n lny n +(1 t n ln(1 y n where y n = σ w T φ(x n { } ( Gradient of E Hessian of E E(w = (y n t n φ(x n = Φ T (y t H = E(w = y n (1 y n φ(x n φ T (x n = Φ T RΦ y = (y 1,..y T t = (t 1,.. t T φ 0(x1 φ 0(x 2 Φ = φ 0(x φ (x φ M 1(x 1 φ M 1(x R is x diagonal matrix with elements R nn =y n (1-y n =w T φ(x n (1-w T φ(x n ( ( φ 1 ( x n... φ M 1 ( x n T φ(x n = φ 0 x n Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, u T Hu>0 error function is a concave function of w and so has a unique minimum

19 19 Gradient & Hessian: Multi-class Logistic Regression Cross-Entropy Error E(w 1,..., w K = ln p(t w 1,.., w K = Gradient of E wj E(w 1,..., w K = Hessian of E K (y nj t nj φ(x n k=1 wk wj E(w 1,..., w K = y nk (I kj y nj φ(x n φ T (x n t nk lny nk φ 0(x1 φ 0(x 2 Φ = φ 0(x T = y nk = y k (φ(x n = φ(x n = φ 0 x n Each element of the Hessian needs M multiplications and additions Since there are M 2 elements in the matrix the computation is O(M 3 φ (x 1 t 11.. t 1K.. t φ φ M 1 M 1 t K exp(w T φ(x k n exp(w T j j φ(x n (x1 (x ( ( φ 1 ( x n... φ M 1 ( x n T

20 20 An Example of 3-class Logistic Regression Input Data Φ 0 (x=1, dummy feature

21 Three-class Logistic Regression Three weight vectors (Initial Gradient Hessian (9x9 with some 3 x 3 blocks repeated

22 Final Weight Vector, Gradient and Hessian (3-class Weight Vector Gradient Hessian umber of iterations : 6 Error (Initial and Final: , e-009

Iterative Reweighted Least Squares

Iterative Reweighted Least Squares Sargur. University at Buffalo, State University of ew York USA Topics in Linear Classification using Probabilistic Discriminative Models Generative vs Discriminative