MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28
Outline 1 Logistic Regression 2 Multi-layer perceptrons Amos Storkey MLPR: Logistic Regression and Neural Networks 2/28
Recap Generative Models for Real Variables: P(y, x) as a Gaussian, y real. Compute P(y x) by conditioning. Conditional Models for Real Variables: regression - P(y x). E.g. predict price of particular stock. Generative models for classes: Class conditional models: P(y, x) = P(x y)p(y). y discrete. Condition to get P(y x). Conditional Models for Classes? Have used class conditional modelling: P(y x) P(x y)p(y). This is a generative approach. Now model P(y x) directly. This is a conditional approach. Don t bother modelling P(x). We will introduce the logistic trick. Amos Storkey MLPR: Logistic Regression and Neural Networks 4/28
Which is the correct model? Two approaches encode different assumptions. Generative assumption: classes exist because data is drawn from two different distributions. Discriminative assumption, class label is drawn dependent on the value of x. Generative: Class Data. Discriminative: Data Class. Amos Storkey MLPR: Logistic Regression and Neural Networks 5/28
Example The weight of men and women. Men and women have different weight distributions because of characteristics of gender: men are on average taller, and are therefore more likely to have a higher weight. Weight and heart attacks. Obesity is a contributory factor to heart attacks. We do not expect someone s current weight to be determined by the heart attack they are going to have in the future! The underlying distribution of people s weight does not affect the chance of someone with a given weight having a heart attack. E.g. if the whole population on average lost weight, does not affect the model. Can ignore the distribution of people s weight. Amos Storkey MLPR: Logistic Regression and Neural Networks 6/28
Is this rule hard and fast? No. In a given stationary (i.e. no distributions are changing) circumstance, with no missing data, either approach can be used. If the conditional approach is used in a situation where a generative approach is more appropriate, it just models the P(x y) and P(y) implicitly through P(y x) = P(x y)p(y)/p(x). The conditional approach often has the advantage that more flexible model can be used for P(y x) than for P(x y). Amos Storkey MLPR: Logistic Regression and Neural Networks 7/28
Two Class Discrimination Consider a two class case: y {0, 1}. Use a model of the form P(y = 1 x) = g(x; w) g must be between 0 and 1. Furthermore the fact that probabilities sum to one means P(y = 0 x) = 1 g(x; w) What should we propose for g? Amos Storkey MLPR: Logistic Regression and Neural Networks 8/28
The logistic trick We need two things: A function that returns probabilities (i.e. stays between 0 and 1). But in regression (any form of regression) we used a function that returned values between ( and ). Use a simple trick to convert any regression model to a model of class probabilities. Squash it! The logistic (or sigmoid) function provides a means for this. g(x) = σ(x) 1/(1 + exp( x)). As x goes from to, so σ(x) goes from 0 to 1. Amos Storkey MLPR: Logistic Regression and Neural Networks 9/28
The Logistic Function 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 6 4 2 0 2 4 6 The Logistic Function σ(x) = 1 1+exp( x). Amos Storkey MLPR: Logistic Regression and Neural Networks 10/28
That is it! Almost. That is all we need. We can now convert any regression model to a classification model. Consider linear regression f = b + w T x. For linear regression P(y x) = N(y; f (x), σ 2 ) is Gaussian with mean f. Change the prediction by adding in the logistic function. This changes the likelihood... P(y = 1 x) = σ(f (x)) = σ(b + w T x). Linear regression/linear parameter models + logistic trick (use of sigmoid squashing) = logistic regression. Probability of 1 changes with distance from some hyperplane. Amos Storkey MLPR: Logistic Regression and Neural Networks 11/28
The Linear Decision Boundary 0 0 0 0 0 0 0 w 1 1 1 1 1 1 1 1 For two dimensional data the decision boundary is a line. Amos Storkey MLPR: Logistic Regression and Neural Networks 12/28
Logistic regression The bias parameter b shifts (for constant w) the position of the hyperplane, but does not alter the angle. The direction of the vector w affects the angle of the hyperplane. The hyperplane is perpendicular to w. The magnitude of the vector w effects how certain the classifications are. For small w most of the probabilities within a region of the decision boundary will be near to 0.5. For large w probabilities in the same region will be close to 1 or 0. Amos Storkey MLPR: Logistic Regression and Neural Networks 13/28
Likelihood Assume data is independent and identically distributed. For parameters Ω, the likelihood is p(d Ω) = N N P(y n x n ) = P(y = 1 x n ) y n (1 P(y = 1 x n )) 1 y n n=1 n=1 Hence the log likelihood is log P(D Ω) = N y n log P(y = 1 x n ) + (1 y n ) log (1 P(y = 1 x n )) (2) n=1 (1) Amos Storkey MLPR: Logistic Regression and Neural Networks 14/28
Logistic Regression Log Likelihood Using our assumed logistic regression model, the log likelihood becomes log P(D w, b) = N y n log σ(b+w T x n )+(1 y n ) log ( 1 σ(b + w T x n ) ) n=1 (3) For maximum likelihood we wish to maximise this value w.r.t the parameters w and b. Cannot do this explicitly as before. Use an iterative procedure. Amos Storkey MLPR: Logistic Regression and Neural Networks 15/28
Gradients As before we can calculate the gradients of the log likelihood. Gradient of logistic function is σ (x) = σ(x)(1 σ(x)). w L = N (y n σ(b + w T x n ))x n (4) n=1 L N b = (y n σ(b + w T x n )) (5) n=1 This cannot be solved directly to find the maximum. Have to revert to an iterative procedure. - E.g. Gradient Ascent Amos Storkey MLPR: Logistic Regression and Neural Networks 16/28
Gradient Ascent Consider the likelihood as a surface: a function of the parameters. Want to find the maximum likelihood value. In other words we want to find the highest point of the likelihood surface - the top of the hill. We propose a dumb hill climbing approach. Make sure you take each step in the steepest direction (locally). Eventually we will get to a point where whatever direction we step in will take us down. We are at a top. Note we are not necessarily at the top, but are at a top. We ignore this issue for the moment. Amos Storkey MLPR: Logistic Regression and Neural Networks 17/28
Gradient Ascent for Logistic Regression Choose some step size (or more accurately a learning rate) η. Initialise at some position in parameter space. Presume we are in position (w, b). At each step, move to position w new = w + η w L (6) b new = b + η L b (7) Iterate the stepping until some stopping criterion is reached. This might be when w and b don t change much anymore (equivalently all the partial derivatives are nearly zero). More on optimisation methods next lecture. Amos Storkey MLPR: Logistic Regression and Neural Networks 18/28
Multiple targets Just as with regression, we can have vector target values y by treating each target independently. We can have multinomial targets by using the softmax function on K separate targets: P(y = k x) = softmax(f) = exp(f k(x)) k exp(f k (x)) Amos Storkey MLPR: Logistic Regression and Neural Networks 19/28
Generalising to features Just as with regression, we can replace the x with features φ(x) in logistic regression too. Just compute the features ahead of time. Just as in regression, there is still the curse of dimensionality. Amos Storkey MLPR: Logistic Regression and Neural Networks 20/28
Reminder Up to now, for multivariate f we have f i = w ij Φ j (x) (8) j where i counts over outputs, and j counts over features and the bias term has been included as a separate feature. f is used to predict y: P(y x) = N(y, f, σ 2 I) for multivariate regression. P(y x) = σ(f) for multivariate binary classification. P(y x) = softmax(f) for any multinomial variable. Amos Storkey MLPR: Logistic Regression and Neural Networks 22/28
Feature adaptation in regression and logistic regression. One approach for overcoming curse of dimensionality: Parameterise features. Adjust the parameters of the features to find good ones. One simple way is to make each feature take a logistic regression form! φ j (x) = σ(w T j x + b j ) Hierarchy: output y depends on φ j which in turn depends on x. What makes a feature good? It aids in modelling the data. Maximum likelihood: maximize the likelihood w.r.t. all the parameters, including the parameters in the features. Amos Storkey MLPR: Logistic Regression and Neural Networks 23/28
MLP architecture Example: 1 hidden layer (bottom to top)..... output units... hidden units...... input units Amos Storkey MLPR: Logistic Regression and Neural Networks 24/28
-1 6 2-8 -4 9 7 3 5 Output is x y f = (6σ(7x + 3y 8) + 2σ(9x + 5y 4) 1) Amos Storkey MLPR: Logistic Regression and Neural Networks 25/28
Multilayer Perceptron, Neural Network This model is called various things. Feedforward Neural Network: by analogy each input x k output f i and feature φ j is seen as a local processing unit (neuron) that takes its inputs and transforms them to get its output. Do not stretch this analogy as they never were good models of neurons... or of neural networks. Amos Storkey MLPR: Logistic Regression and Neural Networks 26/28
Calculating Derivatives We can calculate the derivatives. Collect all the parameters in the output layer into w y. Use gradients for optimisation. Collect all the parameters in the feature layer into w φ. For MLP regression, the derivative w.r.t. some w is log P(y x) w = 1 N σ 2 (f (φ(x n, w φ ), w y ) y n ) f (φ(xn, w φ ), w y ) w n=1 If w is a parameter in w y then can compute this derivative directly. But if w is in w φ we need to use chain rule. f (φ(x n, w φ ), w y ) w = f (φ(xn, w φ ), w y ) φ(x n, w φ ) φ w The use of the chain rule in neural networks has become known as back-propagation. Amos Storkey MLPR: Logistic Regression and Neural Networks 27/28
Summary The difference between generative and discriminative models. The logistic function. Logistic regression. Hyperplane decision boundaries. The Perceptron. The likelihood for logistic regression. Amos Storkey MLPR: Logistic Regression and Neural Networks 28/28