Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Similar documents
Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Multilayer Perceptron

CSC321 Lecture 5: Multilayer Perceptrons

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Multi-layer Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Machine Learning Lecture 5

Introduction to Neural Networks

Linear & nonlinear classifiers

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Feed-forward Network Functions

Artificial Neural Networks. MGS Lecture 2

Neural Networks and the Back-propagation Algorithm

Linear Models for Classification

Deep Neural Networks (1) Hidden layers; Back-propagation

Reading Group on Deep Learning Session 1

ECE521 Lectures 9 Fully Connected Neural Networks

Artificial Neural Networks (ANN)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Deep Neural Networks (1) Hidden layers; Back-propagation

Linear & nonlinear classifiers

Artificial Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Artificial Neural Networks

Neural Networks and Deep Learning

Deep Neural Networks

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 17: Neural Networks and Deep Learning

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Deep Feedforward Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Nonlinear Classification

Intelligent Systems Discriminative Learning, Neural Networks

Neural Networks Lecture 4: Radial Bases Function Networks

CSC 411 Lecture 10: Neural Networks

Week 5: Logistic Regression & Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Learning from Data: Multi-layer Perceptrons

Introduction to Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Neural networks. Chapter 20, Section 5 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning (CSE 446): Neural Networks

Logistic Regression & Neural Networks

A summary of Deep Learning without Poor Local Minima

4. Multilayer Perceptrons

Lecture 3: Pattern Classification

Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Multilayer Perceptrons and Backpropagation

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Stochastic gradient descent; Classification

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Artificial Neural Networks

Neural Network Training

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Artificial Intelligence

Lecture 3 Feedforward Networks and Backpropagation

Ch 4. Linear Models for Classification

Logistic Regression. COMP 527 Danushka Bollegala

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Multilayer Neural Networks

Multilayer Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Lecture 3: Pattern Classification. Pattern classification

CMU-Q Lecture 24:

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Lecture 4: Feed Forward Neural Networks

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Machine Learning Lecture 7

Linear discriminant functions

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Advanced statistical methods for data analysis Lecture 2

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 6

STA 4273H: Statistical Machine Learning

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Computational statistics

Part 8: Neural Networks

Unit III. A Survey of Neural Network Model

LECTURE NOTE #NEW 6 PROF. ALAN YUILLE

CS 4700: Foundations of Artificial Intelligence

Neural Networks: Introduction

Lecture 6. Notes on Linear Algebra. Perceptron

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Transcription:

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10

Introduction In the next 2 lectures we ll look at Multi-Layer Perceptrons (MLPs) which are more powerful than Single-Layer models. It will be seen that MLPs can be trained as a discriminative model to yield class posteriors. MLPs are a type of Artificial Neural Network: the computation is performed using a set of (many) simple units with weighted connections between them. Also there are learning algorithms to set the values of the weights and the same basic structure (with different weight values) is able to perform many tasks. In this and the following lecture we will consider Overall structure of multi-layer perceptrons Decision boundaries that they can form Training Criteria Networks as posterior probability estimators Basic Error back-propagation training algorithm Improved training methods Engineering Part IIB: Module 4F10 1

Application of MLPs Application of multi-layer perceptron neural networks to vision problems Application of Multilayer Perceptron Network for Tagging Parts-of-Speech An application of the multilayer perceptron: Solar radiation maps in Spain Application of multilayer perceptron networks to laser diode nonlinearity determination for radio-over-fibre mobile communications A Neural Network Based System for Intrusion Detection and Classification of Attacks Water quality sensing using multi-layer percepton artificial neural networks Evaluation of a multilayer perceptron based model to forecast O3 and NO2 hourly levels Many applications in many areas! Recently much interest in various areas of speech recognition from Deep Neural Networks: MLPs with many hidden layers. Engineering Part IIB: Module 4F10 2

Multi-Layer Perceptron A multi-layer perceptron is needed to handle the XOR problem. More generally multi-layer perceptrons allow a neural network to perform arbitrary mappings. A 2-hidden layer neural network is shown. The aim is to map an input vector x into an output y(x). The layers may be described as: Inputs x 1 x 2 y (x) 1 y (x) 2 Outputs Input layer: accepts the data vector or pattern; x d First layer Second Output layer layer y (x) K Hidden layers: one or more layers. They accept the output from the previous layer, weight them, and pass through a, normally, non-linear activation function. Output layer: takes the output from the final hidden layer weights them, and possibly pass through an output non-linearity, to produce the target values. Engineering Part IIB: Module 4F10 3

Possible Decision Boundaries The decision boundaries that may be produced varies with network topology. Here only threshold (see the single layer perceptron) activation functions are used. There are three situations to consider Single layer: able to position a hyperplane in input space (the SLP). Two layers (one hidden layer): this is able to describe a decision boundary which surrounds a single convex region of the input space. (1) (2) (3) Three layers (two hidden layers): able to generate arbitrary decision boundaries Note: any decision boundary can be approximated arbitrarily closely by a two layer network with sigmoidal activation functions if enough hidden units. Engineering Part IIB: Module 4F10 4

Number of Hidden Units The number of hidden layers determines the decision boundaries that can be generated. Choosing the number of layers the following considerations are made. Multi-layer networks are harder to train than single layer networks. Two layer networks (one hidden) with sigmoidal activation functions can model any decision boundary. Two layer networks are commonly used in pattern recognition ( hidden layer with sigmoidal activation functions), although increasing interest in deep networks. How many units to have in each layer? Number of output units is often determined by number of output classes. The number of inputs is determined by the number of input dimensions The number of hidden units is a design issue. The problems are: too few, the network will not model complex decision boundaries; too many, the network will have poor generalisation. Engineering Part IIB: Module 4F10 5

Hidden Layer Perceptron The hidden and output layer perceptron is a generalisation of the SLP. The weighted input is passed to a general activation function, rather than a threshold. Consider a single perceptron. Assume that there are n units at the previous level (k 1). x x 1 2 1 w i1 w i2 w i0 Σ z i Activation function i y (x ) x j = y j (x (k 1) ) The output from the perceptron, y i (x) may be written as y i (x ) = φ(z i ) = φ(w ix +w i0 ) = φ( where φ() is the activation function. x n w in n j=1 w ij x j +w i0 ) Note: activation function is not necessarily non-linear. However, if linear activation functions are used throughout much of the power of the network is lost. Engineering Part IIB: Module 4F10 6

Activation Functions A variety of non-linear activation functions that may be used. General form: y i (x) = φ(w ix+w i ) = φ(z i ) and there are n units, perceptrons, for the current level. Heaviside (or step) function: y i (x) = { 0, zi < 0 1, z i 0 Used in threshold units, output is binary (gives a discriminant function). Sigmoid (or logistic regression) function: y i (x) = 1 1+exp( z i ) The output is continuous, 0 y i (x) 1. Engineering Part IIB: Module 4F10 7

Softmax (or normalised exponential or generalised logistic) function: y i (x) = exp(z i ) n j=1 exp(z j) The output is positive and the sum of all the outputs at the current level is 1, 0 y i (x) 1. Hyperbolic tan (or tanh) function: y i (x) = exp(z i) exp( z i ) exp(z i )+exp( z i ) The output is continuous, 1 y i (x) 1. Engineering Part IIB: Module 4F10 8

Activation Functions (cont) 1 0.8 Graph shows: step activation function (green) sigmoid activation function (red) tanh activation function (blue) 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 5 4 3 2 1 0 1 2 3 4 5 Softmax often used at output layer as sum-to-one constraint is enforced (for e.g. probability over output classes). Sigmoid can be used as an approximation to a 0/1 threshold Linear output units can be used for regression problems. Engineering Part IIB: Module 4F10 9

Notation Used Consider a multi-layer perceptron with: d-dimensional input data; L hidden layers (L+1 layer including the output layer); N units in the k th level; K-dimensional output. x is the input to the k th layer x is the extended input to the k th layer x = [ 1 x ] x x x 1 2 N (k 1) 1 w i1 w i2 w (k 1) in w i0 Σ z i Activation function y i W is the weight matrix of the k th layer. By definition this is a N N (k 1) matrix. Engineering Part IIB: Module 4F10 10

Notation (cont) W is the weight matrix including the bias weight of[ the k th layer. ] By definition this is a N (N (k 1) +1) matrix. W = w 0 W z is the N -dimensional vector defined as z = y is the output from the k th layer, so y j = φ(z j ) W x All hidden layer activation functions assumed same φ(). Initially also assume that output activation function is also φ(). These matrix notation feed forward equations may then used for an MLP with input x and output y(x). x (1) = x x = y (k 1) z = W x y = φ(z ) y(x) = y (L+1) where 1 k L+1. The target values for the training of the networks will be denoted as t(x) for training example x. Engineering Part IIB: Module 4F10 11

Training Criteria A variety of training criteria may be used. Assuming we have supervised training examples {{x 1,t 1 }...,{x n,t n }} Some standard examples are: Least squares error: one of the most common training criteria. E = 1 2 n y(x p ) t p ) 2 = 1 2 p=1 n p=1 K (y i (x p ) t pi ) 2 i=1 This may be derived from considering the targets as being corrupted by zero-mean Gaussian distributed noise. Engineering Part IIB: Module 4F10 12

Cross-Entropy for two classes: consider the case when t is binary (and softmax output). The expression is E = n (t p log(y(x p ))+(1 t p )log(1 y(x p ))) p=1 This expression goes to zero with the perfect mapping - compare logistic regression training. Cross-Entropy for multiple classes: the above expression becomes (again softmax output) E = n p=1 K t pi log(y i (x p )) i=1 The minimum value is now non-zero, it represents the entropy of the target values. Engineering Part IIB: Module 4F10 13

Posterior Probabilities Consider the multi-class classification training problem with d-dimensional feature vectors: x; K-dimensional output from network: y(x) & targets t. Want output of the network, y(x), to approximate the posterior distribution of the set of K classes. So y i (x) P(ω i x) Consider training a network with: { 1 if x ωi 1-out-ofK coding, i.e. t i = 0 if x ω i The network will act as a d-dimensional to K-dimensional mapping. Can we interpret the output of the network? A couple of assumptions will be made errors from each output independent (not true of soft-max) error is a function of the magnitude difference of output and target E = K i=1 f( y i(x) t i ) Engineering Part IIB: Module 4F10 14

Posterior Probabilities (cont) In the limit of infinite training data, E{E} = K f( y i (x) t i )P(t x)p(x)dx i=1 t As we are using the 1-out-of-K coding P(t x) = K i=1 where δ ij = This yields K δ(t i δ ij )P(ω j x) j=1 { 1, (i = j) 0, otherwise E{E} = K i=1 (f(1 y i (x))p(ω i x)+f(y i (x))(1 P(ω i x)))p(x)dx Engineering Part IIB: Module 4F10 15

Differentiating and equating to zero, for least squared, cross-entropy solutions are y i (x) = P(ω i x) - this is a discriminative model. Some limitations exist for this to be valid: an infinite amount of training data, or knowledge of correct joint distribution for p(x,t); the topology of the network is complex enough that final error is small; good training algorithm for the network - it finds global optimum Engineering Part IIB: Module 4F10 16

Compensating for Different Priors Bayes law is used for the posterior probability for generative models. P(ω j x) = p(x ω whereclasspriors, P(ω j ), andclassconditional j)p(ω j ) densities, p(x ω j ), are estimated separately. p(x) For some tasks two use different training data sets (for example for speech recognition, the language model and the acoustic model). How can this difference in priors from the training and the test conditions be built into the MLP framework where the posterior probability is directly calculated? (This is applicable to all discriminative models) Again using Bayes law p(x ω j ) P(ω j x) P(ω j ) Thus if posterior is divided by the training data prior a value proportional to the class-conditional probability can be obtained. The standard form of Bayes rule may now be used with the appropriate, different, prior. Engineering Part IIB: Module 4F10 17

Error Back Propagation So far the training of MLPs has not been discussed. Originally interest in MLPs was limited due to issues in training. These problems were partly addressed with the development of the error back propagation algorithm. Error back propagation algorithm is based on gradient descent. Hence x 1 y 1(x) the activation function must be differentiable: need to be able to compute the derivative of the error function with respect to the weights of all layers. Inputs x 2 x d Hidden layer y (x) 2 y K(x) Output layer Outputs All gradients are evaluated at the current model parameters. Engineering Part IIB: Module 4F10 18

Output Layer or SLP Want to minimise e.g. the square error between the target of the output, t p, and the current output value y(x p ). Assume training criterion is least squares and the activation function is a sigmoid function. The cost function may be written as E = 1 2 n (y(x p ) t p ) (y(x p ) t p )) = p=1 n p=1 E (p) To simplify notation, we will only consider a single observation x with associated target values t and current output from the network y(x). The error with this single observation is denoted E. How does the error change as y(x) changes? But we are not interested in y(x) E y(x) = y(x) t How do we find the effect of varying the weights? Engineering Part IIB: Module 4F10 19