MLPR: Logistic Regression and Neural Networks

Similar documents
Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Learning from Data: Multi-layer Perceptrons

Learning from Data Logistic Regression

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Introduction to Machine Learning

Linear Models for Classification

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

From perceptrons to word embeddings. Simon Šuster University of Groningen

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear & nonlinear classifiers

Introduction to Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Multi-layer Neural Networks

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Artificial Neural Networks. MGS Lecture 2

Lecture 5: Logistic Regression. Neural Networks

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Machine Learning Lecture 5

Bits of Machine Learning Part 1: Supervised Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

ECE521 Lectures 9 Fully Connected Neural Networks

Artificial Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Basics III

Logistic Regression. Seungjin Choi

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE521 Lecture7. Logistic Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Lecture 3 Feedforward Networks and Backpropagation

Intelligent Systems Discriminative Learning, Neural Networks

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Neural Networks and the Back-propagation Algorithm

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Week 5: Logistic Regression & Neural Networks

Neural Networks and Deep Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Feed-forward Network Functions

Neural networks and support vector machines

Machine Learning 2017

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multilayer Perceptron

Logistic Regression. COMP 527 Danushka Bollegala

Artificial Neural Networks (ANN)

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Signal Processing Bayes Classification and Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Advanced statistical methods for data analysis Lecture 2

Artificial Neural Network

ECE521 Lecture 7/8. Logistic Regression

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Lecture 3 Feedforward Networks and Backpropagation

STA 4273H: Statistical Machine Learning

Generative v. Discriminative classifiers Intuition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning

Learning from Data: Regression

Logistic Regression. Machine Learning Fall 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

A summary of Deep Learning without Poor Local Minima

CS 340 Lec. 16: Logistic Regression

Lecture 3: Pattern Classification

Notes on Discriminant Functions and Optimal Classification

Logistic Regression & Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Gaussian discriminant analysis Naive Bayes

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 2: Logistic Regression and Neural Networks

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

4. Multilayer Perceptrons

Neural Networks: Backpropagation

Machine Learning - Waseda University Logistic Regression

Linear and logistic regression

1 Machine Learning Concepts (16 points)

Deep Feedforward Networks

10-701/ Machine Learning, Fall

CSCI567 Machine Learning (Fall 2018)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Transcription:

MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition Amos Storkey Amos Storkey MLPR: Logistic Regression and Neural Networks 1/28

Outline 1 Logistic Regression 2 Multi-layer perceptrons Amos Storkey MLPR: Logistic Regression and Neural Networks 2/28

Recap Generative Models for Real Variables: P(y, x) as a Gaussian, y real. Compute P(y x) by conditioning. Conditional Models for Real Variables: regression - P(y x). E.g. predict price of particular stock. Generative models for classes: Class conditional models: P(y, x) = P(x y)p(y). y discrete. Condition to get P(y x). Conditional Models for Classes? Have used class conditional modelling: P(y x) P(x y)p(y). This is a generative approach. Now model P(y x) directly. This is a conditional approach. Don t bother modelling P(x). We will introduce the logistic trick. Amos Storkey MLPR: Logistic Regression and Neural Networks 4/28

Which is the correct model? Two approaches encode different assumptions. Generative assumption: classes exist because data is drawn from two different distributions. Discriminative assumption, class label is drawn dependent on the value of x. Generative: Class Data. Discriminative: Data Class. Amos Storkey MLPR: Logistic Regression and Neural Networks 5/28

Example The weight of men and women. Men and women have different weight distributions because of characteristics of gender: men are on average taller, and are therefore more likely to have a higher weight. Weight and heart attacks. Obesity is a contributory factor to heart attacks. We do not expect someone s current weight to be determined by the heart attack they are going to have in the future! The underlying distribution of people s weight does not affect the chance of someone with a given weight having a heart attack. E.g. if the whole population on average lost weight, does not affect the model. Can ignore the distribution of people s weight. Amos Storkey MLPR: Logistic Regression and Neural Networks 6/28

Is this rule hard and fast? No. In a given stationary (i.e. no distributions are changing) circumstance, with no missing data, either approach can be used. If the conditional approach is used in a situation where a generative approach is more appropriate, it just models the P(x y) and P(y) implicitly through P(y x) = P(x y)p(y)/p(x). The conditional approach often has the advantage that more flexible model can be used for P(y x) than for P(x y). Amos Storkey MLPR: Logistic Regression and Neural Networks 7/28

Two Class Discrimination Consider a two class case: y {0, 1}. Use a model of the form P(y = 1 x) = g(x; w) g must be between 0 and 1. Furthermore the fact that probabilities sum to one means P(y = 0 x) = 1 g(x; w) What should we propose for g? Amos Storkey MLPR: Logistic Regression and Neural Networks 8/28

The logistic trick We need two things: A function that returns probabilities (i.e. stays between 0 and 1). But in regression (any form of regression) we used a function that returned values between ( and ). Use a simple trick to convert any regression model to a model of class probabilities. Squash it! The logistic (or sigmoid) function provides a means for this. g(x) = σ(x) 1/(1 + exp( x)). As x goes from to, so σ(x) goes from 0 to 1. Amos Storkey MLPR: Logistic Regression and Neural Networks 9/28

The Logistic Function 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 6 4 2 0 2 4 6 The Logistic Function σ(x) = 1 1+exp( x). Amos Storkey MLPR: Logistic Regression and Neural Networks 10/28

That is it! Almost. That is all we need. We can now convert any regression model to a classification model. Consider linear regression f = b + w T x. For linear regression P(y x) = N(y; f (x), σ 2 ) is Gaussian with mean f. Change the prediction by adding in the logistic function. This changes the likelihood... P(y = 1 x) = σ(f (x)) = σ(b + w T x). Linear regression/linear parameter models + logistic trick (use of sigmoid squashing) = logistic regression. Probability of 1 changes with distance from some hyperplane. Amos Storkey MLPR: Logistic Regression and Neural Networks 11/28

The Linear Decision Boundary 0 0 0 0 0 0 0 w 1 1 1 1 1 1 1 1 For two dimensional data the decision boundary is a line. Amos Storkey MLPR: Logistic Regression and Neural Networks 12/28

Logistic regression The bias parameter b shifts (for constant w) the position of the hyperplane, but does not alter the angle. The direction of the vector w affects the angle of the hyperplane. The hyperplane is perpendicular to w. The magnitude of the vector w effects how certain the classifications are. For small w most of the probabilities within a region of the decision boundary will be near to 0.5. For large w probabilities in the same region will be close to 1 or 0. Amos Storkey MLPR: Logistic Regression and Neural Networks 13/28

Likelihood Assume data is independent and identically distributed. For parameters Ω, the likelihood is p(d Ω) = N N P(y n x n ) = P(y = 1 x n ) y n (1 P(y = 1 x n )) 1 y n n=1 n=1 Hence the log likelihood is log P(D Ω) = N y n log P(y = 1 x n ) + (1 y n ) log (1 P(y = 1 x n )) (2) n=1 (1) Amos Storkey MLPR: Logistic Regression and Neural Networks 14/28

Logistic Regression Log Likelihood Using our assumed logistic regression model, the log likelihood becomes log P(D w, b) = N y n log σ(b+w T x n )+(1 y n ) log ( 1 σ(b + w T x n ) ) n=1 (3) For maximum likelihood we wish to maximise this value w.r.t the parameters w and b. Cannot do this explicitly as before. Use an iterative procedure. Amos Storkey MLPR: Logistic Regression and Neural Networks 15/28

Gradients As before we can calculate the gradients of the log likelihood. Gradient of logistic function is σ (x) = σ(x)(1 σ(x)). w L = N (y n σ(b + w T x n ))x n (4) n=1 L N b = (y n σ(b + w T x n )) (5) n=1 This cannot be solved directly to find the maximum. Have to revert to an iterative procedure. - E.g. Gradient Ascent Amos Storkey MLPR: Logistic Regression and Neural Networks 16/28

Gradient Ascent Consider the likelihood as a surface: a function of the parameters. Want to find the maximum likelihood value. In other words we want to find the highest point of the likelihood surface - the top of the hill. We propose a dumb hill climbing approach. Make sure you take each step in the steepest direction (locally). Eventually we will get to a point where whatever direction we step in will take us down. We are at a top. Note we are not necessarily at the top, but are at a top. We ignore this issue for the moment. Amos Storkey MLPR: Logistic Regression and Neural Networks 17/28

Gradient Ascent for Logistic Regression Choose some step size (or more accurately a learning rate) η. Initialise at some position in parameter space. Presume we are in position (w, b). At each step, move to position w new = w + η w L (6) b new = b + η L b (7) Iterate the stepping until some stopping criterion is reached. This might be when w and b don t change much anymore (equivalently all the partial derivatives are nearly zero). More on optimisation methods next lecture. Amos Storkey MLPR: Logistic Regression and Neural Networks 18/28

Multiple targets Just as with regression, we can have vector target values y by treating each target independently. We can have multinomial targets by using the softmax function on K separate targets: P(y = k x) = softmax(f) = exp(f k(x)) k exp(f k (x)) Amos Storkey MLPR: Logistic Regression and Neural Networks 19/28

Generalising to features Just as with regression, we can replace the x with features φ(x) in logistic regression too. Just compute the features ahead of time. Just as in regression, there is still the curse of dimensionality. Amos Storkey MLPR: Logistic Regression and Neural Networks 20/28

Reminder Up to now, for multivariate f we have f i = w ij Φ j (x) (8) j where i counts over outputs, and j counts over features and the bias term has been included as a separate feature. f is used to predict y: P(y x) = N(y, f, σ 2 I) for multivariate regression. P(y x) = σ(f) for multivariate binary classification. P(y x) = softmax(f) for any multinomial variable. Amos Storkey MLPR: Logistic Regression and Neural Networks 22/28

Feature adaptation in regression and logistic regression. One approach for overcoming curse of dimensionality: Parameterise features. Adjust the parameters of the features to find good ones. One simple way is to make each feature take a logistic regression form! φ j (x) = σ(w T j x + b j ) Hierarchy: output y depends on φ j which in turn depends on x. What makes a feature good? It aids in modelling the data. Maximum likelihood: maximize the likelihood w.r.t. all the parameters, including the parameters in the features. Amos Storkey MLPR: Logistic Regression and Neural Networks 23/28

MLP architecture Example: 1 hidden layer (bottom to top)..... output units... hidden units...... input units Amos Storkey MLPR: Logistic Regression and Neural Networks 24/28

-1 6 2-8 -4 9 7 3 5 Output is x y f = (6σ(7x + 3y 8) + 2σ(9x + 5y 4) 1) Amos Storkey MLPR: Logistic Regression and Neural Networks 25/28

Multilayer Perceptron, Neural Network This model is called various things. Feedforward Neural Network: by analogy each input x k output f i and feature φ j is seen as a local processing unit (neuron) that takes its inputs and transforms them to get its output. Do not stretch this analogy as they never were good models of neurons... or of neural networks. Amos Storkey MLPR: Logistic Regression and Neural Networks 26/28

Calculating Derivatives We can calculate the derivatives. Collect all the parameters in the output layer into w y. Use gradients for optimisation. Collect all the parameters in the feature layer into w φ. For MLP regression, the derivative w.r.t. some w is log P(y x) w = 1 N σ 2 (f (φ(x n, w φ ), w y ) y n ) f (φ(xn, w φ ), w y ) w n=1 If w is a parameter in w y then can compute this derivative directly. But if w is in w φ we need to use chain rule. f (φ(x n, w φ ), w y ) w = f (φ(xn, w φ ), w y ) φ(x n, w φ ) φ w The use of the chain rule in neural networks has become known as back-propagation. Amos Storkey MLPR: Logistic Regression and Neural Networks 27/28

Summary The difference between generative and discriminative models. The logistic function. Logistic regression. Hyperplane decision boundaries. The Perceptron. The likelihood for logistic regression. Amos Storkey MLPR: Logistic Regression and Neural Networks 28/28