Generalized Linear Models and Logistic Regression

Size: px

Start display at page:

Download "Generalized Linear Models and Logistic Regression"

May Doyle
6 years ago
Views:

1 Department of Electrical and Computer Engineering University of Texas at Austin

2 Machine Learning - What do we want the machines to learn? We want humanoid agents capable of making intelligent decisions Decisions often need to be made based on a certain target outcome associated with a specified input. So... If a machine can predict the target output after seeing an input, we can easily program (using a few simple if-else statements) to make the decision associated with the output. For example, if a robot can predict whether it is about to fall or not based on inputs from various motor sensors, it can be programmed to initiate balancing motor functions to prevent the fall.

3 Predictive Modeling The process of learning to predict an outcome from a given input is known as Predictive Modeling. The output can be A continuous value - This is called Regression. For example, predicting the force acting on the right limb of a robot given inputs from motor sensors. A discrete label - This is called Classification. For example, predicting whether the robot will fall or not based on the sensor inputs. Supervised Learning is an important class of methods for predictive modeling.

4 Supervised Learning - Introduction Given a set of input-output pairs, we want to learn a function f which relates an input to the corresponding output. This is Supervised Learning. The learnt function can then be used to predict output for unseen input values and the process is known as Prediction. The set of given input-output pairs is known as Training Set. Learning of the function f is supervised by the training set. The set of unseen input values is known as Test Set. Inputs are also called features/attributes or independent variables. Training and Test sets can be assumed to be sampled i.i.d. from a joint unknown distribtuion P(x, y) = P(y x)p(x).

5 Supervised Learning - Preliminaries Let X be the domain of the attributes. For a real valued input (input from a single sensor), X R For a vector valued input (from multiple sensors) X R D. For a categorical attribute, X S, a set of possible values of the attribute. For example, the set of allowed rotations of a limb. Let Y be the domain of the target output. For regression, Y R or if we have vector valued output Y R D. For classification, Y S, a set of possible labels. For robot example, S {0, 1}, 0 for fall and 1 for no fall. Given a training set {(x i, y i )} N i=1 of N input-output pairs, we need to learn a function f : X Y such that y = f (x). f is the model which relates X to Y and we want to learn this model.

6 Supervised Learning - Preliminaries The space of all possible functions f : X Y is infinite. Learning thus requires to search this infinite space for a function f that satisfies the input-output pairs of the training set. This is very expensive. So... We restrict the search process to a sub-space of functions The sub-space is represented by a parameterized class of functions having a particular form. y f (x; θ) The goal then is to learn the parameters θ which give the function that best approximates the output within this restricted class of functions. For example, we can only look for linear functions of inputs f (x; θ) = θ T x. This is the case for linear models. Or we can search for a more complicated non-linear function for better approximation, for example f (x; θ) = g(θ T x), where g( ) is some non-linear function.

7 Supervised Learning - Noise Model Restricting the search to a sub-space results in an approximate estimate of the output y i = f (x i ; θ) + ɛ i ɛ is the residue/noise that the learnt function fails to account for due to restricted search Since each input-output pair (x i, y i ) is sampled from P(x, y), We learn a model for the expected value of the output E[y i ] = f (x i ; θ) + E[ɛ i ] Finite size of training set will be another approximation as now the expected value is based on finite sample. The associated noise ɛ will also be random having same form of distribution as P(y x i ) (Note we just need to scale y i by a deterministic value f (x i ; θ) to get ɛ i ). Different distribution assumptions for the output, P(y x) give rise to different models. We will first look at simple models - least squares regression and logistic regression, and then generalize them to a class of models known as Generalized Linear Models (GLM).

8 Least Squares Regression - Gaussian Noise Model The domain of target output values is Y R. It is reasonable to assume that predicted output f (x i ; θ) is corrupted by a zero mean Gaussian noise. ɛ i N (0, σ 2 ) The target output y i then will also be normally distributed y i N (f (x i ; θ), σ 2 ) The training samples are assumed to be drawn i.i.d. from the distribution P(x, y) P({y i } {x i })P({x i }) = N N (f (x i ; θ), σ 2 )P({x i }) We can learn the parameters θ by maximizing the log-likelihood θ = arg max θ θ = arg min θ i=1 N log(p({y i } {x i })P({x i })) i=1 N (y i f (x i ; θ)) 2 + constant constant includes the terms independent of θ which do not affect the optimization for θ and can be ignored. i=1

9 Least Squares Regression - Gaussian Noise Model Least squares Regression arises when we assume output is normally distributed. For special case of linear models, we can assume f (x i ; θ) = θ T x i θ = arg min θ N ) 2 (y i θ T x i i=1 This is commonly used Linear Regression or Linear Least Squares discovered by Gauss! A closed form solution can be obtained for θ by taking derivative of objective function above and setting it zero. The solution is the famous pseudo inverse solution θ = (X T X) 1 X T y What happens when y is not normally distributed? We will now look at models that arise when y has a non-normal distribution. We will start with the case when y has Bernoulli distribution. This gives rise to well known Logistic Regression.

10 Logistic Regression - Introduction Output is now a Bernoulli random variable with domain Y {0, 1}. The problem is really a classification problem since we have just two labels. Regression stuck to make it consistent with a general class of models called GLM of which logistic regression is a special case. Remember we model the expected value of the target. Also the noise is generally assumed to be a zero mean random variable. E[y i ] = f (x i ; θ) = P(y i = 1 x i ; θ) The function that we want to learn thus is constrained to have a range [0, 1]. This is generally done by applying a non-linear squashing function to θ T x i so that output of linear model lies between [0, 1]. For logistic regression it is the sigmoid function ) f (x i ; θ) = σ (θ T 1 x i = 1 + exp ( θ T ) x i

11 Log Odds Modeling E[y] by a sigmoid function over linear model is equivalent to modeling log-odds by the linear model. This is easy to prove 1 E[y] = P(y = 1 x) = 1 + exp( θ T x) exp( θ T 1 P(y = 1 x) x) = P(y = 1 x) ( ) P(y = 1 x) θ T x = log P(y = 0 x) ( ) P(y=1 x) log are the log-odds for the Bernoulli random variable. P(y=0 x)

12 Logistic Sigmoid Function Figure: Logistic Sigmoid Function : σ(x) A smooth monotonic well-behaved function to squash the real line to range [0, 1] Well-behaved implies continuity, double differentiability etc. which are important during optimization. For logistic regression, it is the conditional probability of label 1 given the attributes. In Bayes decision theory, it is the posterior distribution of the positive class. We will not cover Bayes decision theory in these slides. Useful property - σ(x) = σ(x)(1 σ(x)). We will use this when learning the parameters.

13 Linear Decision Boundary Modeling expected value via sigmoid function applied to a linear model results in a linear separating hyperplane. This can easily be proved. In classification, decision boundary is given by A linear decision boundary. P(y = 1 x) = P(y = 0 x) P(y = 1 x) = 1 P(y = 1 x) exp( θ T x) = θ T x = 0 exp( θ T x) 1 + exp( θ T x)

14 Parameter Learning We modeled P(y i = 1 x i, θ) = σ(θ T x i ). Likewise, P(y i = 0 x i, θ) = 1 σ(θ T x i ) The distribution for the Bernoulli random variable is then given by ( ) P(y i x i, θ) = σ(θ T yi ( ) x i ) 1 σ(θ T (1 yi ) x i ) With i.i.d. assumption on the training set, the likelihood is given by P(y x, θ) = The log-likelihood is given by L(θ) = N i=1 N i=1 θ can be learnt by maximizing the L(θ) ( ) yi ( ) (1 yi ) σ(θ T x i ) 1 σ(θ T x i ) ( ) ( ) y i log σ(θ T x i ) + (1 y i ) log 1 σ(θ T x i ) θ = arg maxl(θ) θ Due to sigmoid function, a closed form solution does not exist. So we turn to gradient methods for optimization.

15 Gradient Ascent for Optimization L(θ) is a concave function, known as cross-entropy. Global maximum can be obtained using gradient ascent. Starting from an initial point, move in the direction of steepest ascent i.e. in the direction of gradient. We get the following iterative update procedure The gradient is given by θ (t+1) = θ (t) + α L(θ) (t) L(θ) (t) = N i=1 ( ) y i σ(θ (t) x i ) x i α is the learning rate and needs to be selected using line-search to ensure likelihood increases at every iteration which can be time consuming. In practise, α is gradually decreased as the iterations progress to avoid drastic jumps in the search space. To avoid expensive line search, other methods like Newton s method or Stochastic gradient ascent can be used.

16 IRLS - Iterated Re-weighted Least Squares Another method to minimize L(θ) (same as maximizing the liklihood) is by Newton s method involving inverse of the Hessian matrix ( 1 θ (t+1) = θ (t) L(θ )) (t) L(θ (t) ) The Hessian is given by N L(θ (t) ) = σ(θ (t) x i )(1 σ(θ (t) x i ))xi T x i = x T W (t) x i=1 ( ) W (t) = diag σ(θ (t) x i )(1 σ(θ (t) x i )) The update for the parameters is then given by ( 1 θ (t+1) = θ (t) x T W x) (t) x T (σ(θ (t) x) y) θ (t+1) = (x T W (t) x) 1 x T W (t) z z = xθ (t) (W (t) ) 1 (σ(θ (t) x) y) If Hessian is positive-definite (which is in this case), a unique optimum is found. The update is a weighted least squares problem with weights W (t) The weight matrix is dependent on current estimate of θ (t) and needs to be re-estimated every iteration. Hence, the name IRLS. For large problems, Hessian could be a large matrix with an expensive inverse computation.

17 Stochastic Gradient Ascent Incremental counterpart of Gradient Ascent method. We update the parameters as we receive more training points. The gradient is now computed only for the newly obtained point which gives the following update ( ) θ (t+1) = θ (t) + α y i σ(θ (t) x i ) x i Can be slow to reach the optimum as opposed to batch updates where we sum up over the gradients for all the training points. For large scale problems with streaming data, this is the ideal choice and is the reason for the popularity of logistic regression (GLMs in general) for large scale predictive modeling problems.

18 Multi-Class Classification - SoftMax Regression Logistic Regression can be generalized to multiple classes by SoftMax Regression. For K class case, the target is a K dimensional vector y such that y k = 1 if the training examples belongs to class k. Following standard noise model, we model the expected value of each y k as E[y ik ] = P(y ik = 1 x i, θ k ) = exp(θ T k x) K k =1 exp(θt k x) The likelihood for the training set can then be written as ( N K exp(θ T k P(Y X, {θ k }) = x) K i=1 k=1 k =1 exp(θt k x) The likelihood can be maximized by any of the methods discussed to learn the parameters. In SoftMax Regression, there is an implicit Multinomial distribution assumption for the target output. During prediction on test set, the learnt parameters estimate the probability of each class. Normally, the final classification is decided based on maximum estimated probability of a particular class. ) yik

19 Fisher Iris Dataset Figure: Fisher Iris Data One of the most widely used datasets in machine learning. Comes with Matlab! Just use the command load fisheriris to load the data into workspace. Three classes with 50 instances each. Setosa is linearly separable from Versicolor and Virginica. Independent variables - Sepal length, Sepal width, Petal length, Petal Width (all in cms). Let s try to classify setosa from the other two classes.

20 >>load fisheriris >> X = meas(:,1:2); >> Y(1:50) = 0; Y(51:150) = 1; Y = Y ; >> beta = glmfit(x, Y, binomial, logit ); >> scatter(x(1:50,1), X(1:50,2), +g ); >> hold on >> scatter(x(51:150,1), X(51:150,2), +r ); >> x = 4:0.01:8; >> y = (-beta(2).* x - beta(1))./ beta(3); >> plot(x,y); Figure: Classifying fisher iris dataset using features (1,2) and (3,4) glmfit uses IRLS for iteratively learning the model parameters.

21 The two classes are linearly separable in the four dimensional space of the given independent variables. Here is an illustration using first three attributes. Figure: Classifying fisher iris dataset using features (1,2,3)

22 Train-Test Split >> ind test = randsample(150, 50); >> X test = X(ind test, :); Y test = Y(ind test); >> ind train = setdiff([1:150], ind test); >> X train = X(idx train,:); Y train = Y(ind train); >> betas = glmfit(x train, Y train, binomial, logit ); >> Y hat = glmval(betas, X test, logit ); Figure: Classifying fisher iris dataset - train and test sets Only one point was miss-classified. The classes however, seem to be strongly linearly separable. Let us now look at a more challenging dataset which might not be linearly separable.

23 E. Coli Dataset Protein localization site for Ecoli. The available classes are 1 cp (cytoplasm), 2 im (inner membrane without signal sequence), 3 pp (perisplasm), 4 imu (inner membrane, uncleavable signal sequence), 5 om (outer membrane), 6 oml (outer membrane lipoprotein), 7 iml (inner membrane lipoprotein), 8 ims (inner membrane, cleavable signal sequence). We only focus on two classes cp and im as they form bulk of the available data (more than 95%). There are 7 predicitve attributes - 1 mcg: McGeoch s method for signal sequence recognition. 2 gvh: von Heijne s method for signal sequence recognition. 3 lip: von Heijne s Signal Peptidase II consensus sequence score. Binary attribute. 4 chg: Presence of charge on N-terminus of predicted lipoproteins. Binary attribute. 5 aac: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins. 6 alm1: score of the ALOM membrane spanning region prediction program. 7 alm2: score of ALOM program after excluding putative cleavable signal regions from the sequence.

24 Logistic Regression on Ecoli Dataset >> data = dlmread( ecoli.txt ); >> X = data(:,1:7); Y = data(:,8); >> betas = glmfit(x, Y, binomial, logit ); >> betas [ ] Some of the coefficients are very small relative to other. This shows that the corresponding dimensions are not informative in predicting the target class. Let us do a PCA on the data matrix and then do logistic regression.

25 >> data = dlmread( ecoli.txt ); >> X = data(:,1:7); Y = data(:,8); >> [coeff, X1, latent] = princomp(x); >> beta = glmfit(x1(:,1:3), Y, binomial, logit ); >> scatter3(x1(1:143,1), X1(1:143,2),X1(1:143,3), +g ); >> scatter3(x1(144:220,1), X1(144:220,2),X1(144:220,3), +r ); >> y = -0.6:0.01:0.6; x = -0.6:0.01:0.6; >> z = (-beta(2).* x - beta(3).* y - beta(1))./ beta(4); >> plot3(x,y,z); >> beta = [ ] Figure: Classsifying E.coli using first three principal components It gives a reasonable separating hyperplane even though the classes are not linearly separable.

26 Gradient Ascent Implementation function [theta] = learn theta(x, Y, prec, niters) log sigmoid 1./ (1 + exp( x)); linear model theta) x * theta; % Add a column of ones to X to account for the bias term X = [ones(length(y), 1) X]; theta current = rand(size(x,2), 1); theta new = 2 * theta current; t = 1; alpha current = 10; eta = 0.5; % the update rule for the step size theta stack(t,:) = theta current; alpha update iters = 1000; counter = 1; % Gradient ascent iterations while (norm(theta new theta current) > prec & t < niters) theta current = theta new; if (t == counter * alpha update iters) % update only every alpha update iters alpha current = update alpha(alpha current, eta); counter = counter + 1; end theta new = theta current + alpha current.* X' *... (Y log sigmoid(linear model(x, theta current))); t = t + 1; theta stack(t,:) = theta new; end theta = theta current; x = 1:size(theta stack, 1); for ii = 1:length(theta) plot(x, theta stack(:,ii)'); hold on; end return; % Step size selection function [alpha ret] = update alpha(alpha current, eta) alpha ret = eta * alpha current; if (alpha ret < 0.001) alpha ret = alpha current; end return;

27 Gradient Ascent on Fisher Iris Dataset Figure: Parameter convergence and performance using gradient ascent

28 Gradient ascent can easily be implemented in matlab using matrix operations. It is considerably faster than IRLS since it avoids expensive inverse computation. We however, need to play around with the step size α to get reasonable convergence of the parameters. The plots in the last slide were generated by starting with a high α = 0.1 and then reducing it by 0.5 times every 1000 iterations till it reached We want longer steps in the beginning to quickly reach the basin of convergence but need to reduce step size progressively to avoid oscillations around the optimum. Note that the separation is not as good as glmfit. This is because we used a simple stepsize selection rule and stopped early. Both IRLS and gradient ascent are guranteed to reach the same correct optimum. Other sophisticated (but expensive!) line-search methods can be used for step-size selection. Now, let us try to see how gradient ascent behaves when the classes are not linearly separable. Let us try to separate virginica from other two classes

29 Figure: Parameter convergence and performance using gradient ascent in non-linear separation case Seems to be a reasonable separation. Note that the code for gradient ascent accepts X and Y as inputs. To generate these plots I just changed the class labels in Y to separate virginica from other two classes.

30 Train-Test Split for Fisher Iris Dataset Figure: Classifying fisher iris dataset - train and test sets As in case of IRLS, only one point was miss-classified using gradient ascent method.

31 Gradient Ascent for E.coli Dataset Figure: Parameter convergence for E.coli dataset >>Y hat = glmval(theta, X, logit ); >> idx = find(y hat >= 0.8); >> Y hat(idx) = 1; >> idx = find(y hat < 0.8); >> Y hat(idx) = 0; glmval computes the expected value of the target by applying the inverse link function to linear model. For logistic regression, this is probability of target being 1. We threshold by 0.8 to get the target class The model correctly classifies 217 (98.64%) training instances.

32 Generalizing to other distributions We now generalize the formulation to other possible distributions that the target output can assume. We will consider a rich class of distributions called Exponential Family of Distributions. Many known distributions belong to this family such as Gaussian, Bernoulli, Binomial, Gamma, Beta and many more. Following this family, the class of models we obtain is known as Generalized Linear Models or (GLMs). Lets begin with the basics of Exponential Family of Distributions and see how many commonly known distributions are the special cases.

33 Exponential Family of Distributions A parameterized family of distributions having the form p(x θ) = exp( t(x), θ Ψ(θ))p 0 (x) Few properties In canonical form, t(x) = x θ is the natural parameter of the distribution Ψ( ) is the cummulant function associated with the distribution p 0(x) is the Lebesgue-Steiltjes integral with respect to the underlying measure. We are only interested in the property that relates E[x] to the natural parameter θ E[x] = Ψ(θ) This property forms the basis of GLMs. Ψ 1 ( ) has a special name called canonical link function - Simplest function that links the expected value to the natural parameter. Table: Important special members of Exponential family Distribution Ψ(θ) Ψ(θ) Ψ ( 1 (t) ) exp(θ) Bernoulli log(1 + exp(θ)) log t (1+exp(θ)) 1 t Binomial N log(1 + exp(θ)) N exp(θ) ( ) log t (1+exp(θ)) N t Poisson exp(θ) 1 exp(θ) log t Gaussian θ 2 θ t 2 Gamma log( θ) 1 1 θ t Note that the inverse link function for the Bernoulli distribution is the logistic sigmoid function.

34 Generalized Linear Models (GLMs) The target output is now distributed according to a particular member of the exponential family of distributions. We assume a linear model for the natural parameter of the distribution θ = β T x We changed notation slightly! We thus model the expected value of the target via inverse canonical link function computed at the linear model Ψ 1 (E[y]) = β T x We obtain the following expression for the likelihood P(y x; β) = exp( t(x), β T x Ψ(β T x))p 0 (x) The likelihood can then be maximized using methods described in previous slides to learn the model parameters βs.

35 Generalized Linear Models - GLMs GLMs are an important class of prediction models covering a large class of probability distributions. They can model a large class of problems by having more complicated link functions that don t need to be inverse cummulant function. For example, in probit regression, the link function is the Gaussian CDF for a standard normal random variable. Parameters can be easily learnt by maximizing the likelihood using the gradient methods. Table: Important special members of Generalized Linear Models Distribution Support Prediction Model Bernoulli {0, 1} Logistic Regression Poisson Z + Poisson Regression Gaussian R Least Squares Gamma R + Gamma Regression

36 Summary GLMs in general and Logistic Regression in particular are supervised learning methods catering to large class of distributions called exponential family of distributions. Simple, scalable parameter learning via gradient methods. Linear separating hyperplane for logistic regression. More complicated models can be obtained using complicated, non-monotonic link functions.

37 References Pattern Recognition and Machine Learning. Christopher M.Bishop, Springer Generalized Linear Models. P. McCullagh and J.A. Nelder, Chapman & Hall/CRC, Machine Learning. Thomas M. Mitchell, McGraw-Hill Higher Education, Generalized Linear Models, with Application in Engineering and Sciences. R.H. Myers, D.C. Montgomery and C.G. Vining, John Wiley & Sons, A probabilistic classification system for predicting the cellular localization sites of proteins. Paul Horton and Kenta Nakai, Intelligent Systems in Molecular Biology, , St. Louis, USA E.coli Dataset. UCI Machine Learning Repository.

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification