Neural Networks and the Back-propagation Algorithm

Similar documents
ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks

Multilayer Perceptron

Machine Learning

y(x n, w) t n 2. (1)

Artificial Neural Network

Data Mining Part 5. Prediction

The Expectation-Maximization Algorithm

Reading Group on Deep Learning Session 1

Neural networks. Chapter 20. Chapter 20 1

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

AI Programming CS F-20 Neural Networks

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural networks. Chapter 19, Sections 1 5 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Revision: Neural Network

4. Multilayer Perceptrons

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Unit III. A Survey of Neural Network Model

Artificial Intelligence

Machine Learning

Unit 8: Introduction to neural networks. Perceptrons

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural Networks DWML, /25

Simple Neural Nets For Pattern Classification

From perceptrons to word embeddings. Simon Šuster University of Groningen

COMP-4360 Machine Learning Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Course 395: Machine Learning - Lectures

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Feedforward Neural Nets and Backpropagation

ECE521 Lectures 9 Fully Connected Neural Networks

CSC321 Lecture 5: Multilayer Perceptrons

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Neural networks. Chapter 20, Section 5 1

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Multi-layer Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Deep Feedforward Networks

The Perceptron. Volker Tresp Summer 2014

Linear & nonlinear classifiers

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks biological neuron artificial neuron 1

Nonlinear Classification

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSC 411 Lecture 10: Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks The Introduction

Pattern Classification

Multilayer Perceptron = FeedForward Neural Network

Multilayer Perceptrons and Backpropagation

Neural Network Training

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Deep Neural Networks (1) Hidden layers; Back-propagation

Multilayer Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Multilayer Perceptron

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Neural Networks

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Linear Models for Classification

Introduction to Machine Learning

Lab 5: 16 th April Exercises on Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Linear discriminant functions

Neural Networks Lecture 4: Radial Bases Function Networks

Intelligent Systems Discriminative Learning, Neural Networks

Learning and Neural Networks

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

In the Name of God. Lecture 11: Single Layer Perceptrons

Learning and Memory in Neural Networks

Machine Learning. Neural Networks

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Artificial Neural Networks

The Perceptron. Volker Tresp Summer 2016

Analysis of Multilayer Neural Network Modeling and Long Short-Term Memory

Lecture 17: Neural Networks and Deep Learning

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Artificial Neural Networks

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 6. Notes on Linear Algebra. Perceptron

Machine Learning Lecture 5

The Perceptron Algorithm 1

Transcription:

Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely follow the presentation in [1]. We refer to [1, 2, 3] for further details. Throughout these notes, random variables are represented with upper-case letters, such as X or Z. A sample of a random variable is represented by the corresponding lower-case letter, such as x or z. When random variables are vector valued, we use subscripts to indicate specific components, as in X k or Z k. The corresponding samples are represented using bold-face letters, such as x or z, and individual components as x k or z k, respectively. When considering an indexed family of vector-valued data-points, we use indexed bold-face symbols to denote the elements in the family, as in x n or z n. 1 The Perceptron Artificial neural networks (ANNs) arose as an attempt to model mathematically the process by which information is handled by the brain. Learning methods based on neural networks are general and relatively simple to implement, making them a widely used class of methods when complex real-world data must be interpreted. Examples include recognition of handwritten digits, spoken words or faces. ANNs correspond to networks of densely connected nodes, known as neurons, each of which is a small processing unit. The simplest model of such x 0 x 1... x p w 0 w 1 Threshold Activation w p a ŷ Figure 1: Representation of the Perceptron. 1

Half-plane corresponding to positive class w Half-plane corresponding to negative class Decision-boundary Figure 2: Decision boundary for the Perceptron, given the weight vector w. a network consists of a single unit, known as perceptron, and represented in the diagram of Fig. 1. The perceptron takes as input a vector x = [x 1,..., x p ] of p real-valued inputs, from which it computes the activation, a, which is a linear combination of these inputs, p a = w 0 + w i x i = w x. i=1 Note that we included one additional weight, w 0, that is independent of the input and is known as the bias. However, to provide a uniform treatment of the weights in the perceptron, it is customary to consider one additional input, x 0, that is constant and equal to 1, i.e., x 0 1. We included this fictitious input in the representation of Fig. 1. The output of the perceptron, ŷ, is computed as the image of the activation a by a threshold function σ, { 1 if a > 0 ŷ(x) σ(a) = -1 otherwise. The perceptron can then be used for binary classification tasks, where the inputs x for which ŷ(x) = 1 correspond to the positive instances and those for which ŷ(x) = 1 correspond to the negative instances. Geometrically, the data-points classified by the perceptron as belonging to the positive class correspond to those data-points x whose inner product with the weight vector w is positive (see Fig. 2). 1.1 Perceptron Learning Rule To determine the process by which the perceptron is trained, it is necessary to define an error function with respect to which the performance of the 2

perceptron is to be measured (remember that this is one of the fundamental elements necessary to define a learning task). While the number of misclassified data-points is a natural candidate for error function, it is not amenable to an easy analytical treatment. Instead, we introduce the so-called perceptron criterion. Note, first of all, that a data-point x in class y (with y { 1, 1}) is properly classified by the perceptron if w xy > 0. Given a training data-set D, let M denote the set of misclassified data-points. The perceptron criterion tries to minimize the error E(w) = x M w x n y n. To minimize the error E, we adopt a general gradient descent approach, whereby the minimum of a general real-valued function F (z) is gradually approximated by the sequence { z (1), z (2),... } defined recursively by z (τ+1) = z (τ) η z F (z (τ) ), where η is a positive step-size. Specifically in the case of the perceptron, the weight vector w is adjusted as w w η w E(w) = w + η n M x n y n. (1) Interestingly, two modifications are generally considered to the learning rule in (1). The first is to consider incremental updates, where the weight vector is updated one data-point at a time. The second modification arises from noting that the output of the perceptron remains unchanged if w is multiplied by a constant, which allows us to consider a step-size η = 1. The training process for the perceptron can thus be summarized as follows. Given the data-set D = {(x n, y n ), n = 1,..., N}, 1. For each pair (x n, y n ) D, if w x n y n > 0, move to the next pair. 2. Otherwise, adjust w according to the learning rule: w w + x n y n. (2) While the training rule for perceptrons is straightforward to implement, perceptrons are restricted to linear decision boundaries, which means that they are unable to learn classifiers for data that is not linearly separable. 2 Multilayer Perceptron A multilayer perceptron (MLP), also known as feed-forward neural network or multilayer feedforward network, is a network of densely connected units 3

x 0 x 1 Inputs Input units Hidden units Output unit x 2 x 3 Figure 3: Example of a multilayer perceptron with two input units, four hidden units and one output unit. σ a Figure 4: Sigmoid threshold function σ(a) = 1 1+exp( a). similar to the perceptron discussed in Section 1 and having a non-linear threshold function. The units in a MLP are arranged in layers units connected directly to the inputs of the network constitute the input layer of the network, while those whose output corresponds to the output of the network constitute the output layer of the network. All other intermediate layers are referred as hidden layers. An example of a multilayer perceptron is depicted in Fig. 3. Multilayer perceptrons are able to represent a much richer set of functions than those representable using a single perceptron. In fact, the universal approximation theorem states that a multilayer perceptron with a single hidden layer that contains finite number of hidden neurons and with an arbitrary activation function can approximate with an arbitrarily small error any continuous function defined over any compact subset of R p. Each unit in a MLP is similar to the perceptron surveyed in Section 1 and depicted in Fig. 1. However, for purposes of training, it is convenient that the output(s) of the network are differentiable functions of the inputs, for which reason the neurons in a MLP usually are defined with differential threshold functions. A common threshold function is the logistic sigmoid function, 1 σ(a) = 1 + exp( a), depicted in Fig. 4. 4

x 0 w 0i1 Unit i 1 a i1 z i1 w i1 j 1 Unit j 1 a j1 z j1 w 0i2 w i1 j 2 w j1 k Unit k a k ŷ = z k w 1i1 x 1 w 1i2 Unit i 2 a i2 w i2 j 1 Unit j 2 z i2 w i2 j a 2 j2 w j2 k z j2 Figure 5: Artificial neural network with 1 hidden layer. The output of the network can be computed by propagating the input information throughout the network, in a process known as forward propagation. To illustrate this process, consider the ANN model depicted in detail in Fig. 5. We denote by w i the weights associated with unit i and by w ij the weight associated with the connection between the output of unit i and unit j. Given the input vector x to the network, the output of input units i 1 and i 2 is given by z i1 = σ(w i 1 x) z i2 = σ(w i 2 x), which corresponds to the forward propagation of the input x through the first layer in the network. The two outputs z i1 and z i2 now act as inputs for the second layer in the network. Letting z i = [z i1, z i2 ], it follows that z j1 = σ(w j 1 z i ) z j2 = σ(w j 2 z i ), which corresponds to the propagation of z i through the second layer in the network. Finally, the output of the network is given by ŷ(x) = z k = σ(w k z j) where, as before, we defined z j = [z j1, z j2 ]. 2.1 The Back-propagation Algorithm As before, to determine the process by which a MLP can be trained, it is necessary to define an error function with respect to which the performance of the MLP is to be measured. Given a data-set D = {(x n, y n ), n = 1,..., N}, we adopt the error function E(w) = 1 2 N (ŷ(xn ) 2. ) y n n=1 We have, for simplicity, considered the case where there is one single output ŷ to the network, but the reasoning can be trivially replicated to accommodate vector outputs. 5

z i... z l w ij w lj w jk. Unit k... a j z j ŷ(x) Figure 6: Unit j in the network. To minimize the error E, we again adopt a gradient descent approach, where weights in the networ, w, are adjusted according to the rule w w η w E(w), where w E(w) denotes the gradient of the error function with respect to the weights in the network. The back-propagation algorithm allows for a simple and efficient way of propagating the error information backwards in the network, allowing for successive updates of the weights from the output to the input. To derive the back-propagation learning rule, we start by writing the error function as E(w) = N E n (w), n=1 with E n (w) = 1 2(ŷ(xn ) y n ) 2. As with the perceptron, the updates to the weights can be done incrementally, using one data-point at a time, using instead the update rule w w η w E n (w). It remains to determine the gradient w E(w). Let us then focus on one particular unit in the network say, unit j and determine the components of w E n (w) comprising the derivatives of E n (w) with respect the weights in unit j, w ij (see Fig. 6). We start by writing the derivative with respect to w ij as E n = E n. w ij w ij To simplify the notation, we henceforth write δ j = E n. 6

Moreover, it follows from the definition of activation that Combining the two, we get w ij = z i. E n w ij = δ j z i. (3) Let us now compute the term δ j. If j is the output unit, then and we immediately get E n = 1 2 (σ(a j) y n ) 2, δ j = σ (a j )(σ(a j ) y n ) (4) which, in the case of the logistic sigmoid function, yields δ j = σ(a j )(1 σ(a j ))(σ(a j ) y n ). On the other hand, if j is not an output function, E n depends on a j through all units k to which unit j is connected (see Fig. 6). In other words, Finally, we have that and thus E n = K k=1 E n a k a k = K k=1 a k = w jk σ (a j ), δ j = σ (a j ) δ k a k. K w jk δ k. (5) Note that, as evidenced in (5), the derivative δ j for unit j can be computed by propagating the derivatives δ k of the subsequent nodes through the network. In conclusion, the back-propagation algorithm can be summarized as follows. Given the data-set D = {(x n, y n ), n = 1,..., N}, k=1 1. For each pair (x n, y n ) D forward propagate the input x n through the network to compute ŷ(x n ). In this process, compute the activations a j for all hidden and output units. 2. Evaluate δ j for the output units using (4). 3. Back-propagate the δ s using (5), determining δ j for all hidden units in the network. 7

4. For all nodes in the network, compute the derivatives En w ij using (3). 5. Update each weight w ij using the rule References w ij w ij η E n w ij. (6) [1] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer Science, 2006. [2] Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, 2nd edition, 1998. [3] Tom M. Mitchell. Machine Learnin. McGraw-Hill, 1997. 8