Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks MGS 2018 - Lecture 2

OVERVIEW Biological Neural Networks Cell Topology: Input, Output, and Hidden Layers Functional description Cost functions Training ANNs Back-Propagation

ARTIFICIAL NEURAL NETS Feed-forward neural network/multilayer Perceptron one of many ANNs Fixed number of basis functions Each basis function adaptive (i.e. by tuning a parameter) We focus on the Multilayer Perceptron Really multiple layers of logistic regression models Continuous nonlinearities Perceptron is non-continuous Neural Nets are good, but likelihood function isn t a convex function of model parameters

ORIGINS Originally developed as algorithms that mimic the brain (bioinspired) Killed-off for some time by Marvin Minsky, who led a personal vendetta to divert funding to AI research (book: perceptrons) Developed around 1960s (Perceptron model, Frank Rosenblatt) Resurgence in 1980s and 1990s Recent resurgence due to increased computational power, and the advent of Deep Learning

BIO-INSPIRATION Input wires Output wire Signal processing

LOGISTIC REGRESSION Linear combinations of fixed nonlinear basis functions: y(x, w) =f 0 @ MX j=1 w j 1 j (x) A Non-linear activation function for classification Identity for regression ANNs replace these by parameterised basis functions: a j = DX i=1 w (1) ji x i + w (1) j0 Indicates layer 1 (input layer)

ACTIVATION AND OUTPUT a f(a) y a = (net) activation y = f(a) = output

ACTIVATION AND OUTPUT x 1 w j1 y x i w ji X wji x i a f(a) y w jk x k y a = kx i=1 w ji x i The output as input x i y functions to units in the next layer.

PERCEPTRON A perceptron maps a real valued input binary output : f(x) ( 1 if w x + b>0 f(x) = 0 otherwise x to a a f(a) y We focus on logistic regression functions more than perceptrons. Perceptrons have mostly historical value.

SIMPLE NETWORK The simplest ANNs consist of: A layer of D input nodes A layer of hidden nodes A layer of output nodes Fully connected between layers

INPUT LAYER One input node for every feature/dimension Output of input layer serves as a linear combinatory input to the hidden units: a j = DX i=1 w (1) ji x i + w (1) j0 x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0

CURSE OF DIMENSIONALITY Fundamental Machine Learning concept: Curse of Dimensionality When D becomes large, learning problems can become very difficult. For example: when dividing a space 2 R D into regular cells, the number of cells grows exponentially with D. in linear regression a polynomial model of order M has D M coefficients a sphere in high dimension has most of it s volume in an infinitesimally thin slice near the surface

CURSE OF DIM (EXAMPLE) Regular sub-division of feature space: Growth of general polynomial:

HIDDEN LAYER Hidden layer(s) can: have arbitrary number of nodes/units have arbitrary number of links from input nodes and to output nodes (or to next hidden layer) there can be multiple hidden layers Default is a fully interconnected graph, i.e. every input node is linked to every hidden node, and every hidden node to every output node.

HIDDEN UNIT ACTIVATION z j = h(a j ) Common functions for h( ) are the sigmoid or tanh: f(x,, )= 1 1+e (x ) f(x) = tanh(x)

RELU f(x) = 1X i=1 (x i +0.5) log(1 + e x ) New trend, responsible for great deal of Deep Learning success: No vanishing gradient problem Can model any positive real value Can stimulate sparseness

OUTPUT LAYER Output layer can be: single node for binary classification single node for regression n nodes for multi-class classification One network can also cover multiple output variables, thus increasing the number of nodes.

OUTPUT UNIT ACTIVATION Output unit activation transformation depends on output type: Regression: y k = a k (Identity) Binary classification: y k = (a k ) 1 (a) = 1+exp( a) Multiclass classification: y k = exp(a k) P j exp(a j) (Softmax)

OVERALL NETWORK FUNCTION Network can be represented as a single function of input variables and weights: x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 y k (x, w) = M X j=1 z 0 w (2) kj h D X i=1 w (1) ji x i + w (1) j0 + w (2) k0

SHORTER NETWORK FUNCTION Biases can be incorporated as unity-valued units: x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 y k (x, w) = z 0 M X j=0 w (2) kj h D X i=0 w (1) ji x i

TERMINOLOGY Networks can be easily generalised to arbitrary number of layers This gives rise to confusion in naming conventions. The network shown is either a: 3-layer network (counting the number of layers of units) Our definition 1-hidden-layer network 2-layer network (counting the num- ber of layers of adaptive weights)

NETWORK TOPOLOGY Variations include: Arbitrary number of layers Fewer hidden units than input units (causes in effect dimensionality reduction, equivalent to PCA) Skip-layer connections (see below) Fully/sparsely interconnected networks Large number of possible weight-assignments lead to identical functionality - factor of M!2 M per hidden layer.

EXPRESSIVE POWER

TRAINING A NETWORK target t...... t 1 t 2 t k t c output z z 1...... z 2 z k z c output w kj...... y 1 y 2 y j y nh hidden w ji...... x 1 x 2 x i x d input input x...... x 1 x 2 x i x d

ERROR FUNCTIONS In order to optimise the performance of ANNs an error function on the training set must be minimised This is done by adjusting: Weights connecting nodes Network Architecture Parameters of non-linear functions h(a)

ERROR FUNCTIONS In order to optimise the performance of ANNs an error function on the training set must be Hyper-parameters minimised (optimised This is done by adjusting: by measuring and comparing Weights connecting nodes Network Architecture Parameters of non-linear functions h(a) generalisation error on validation data)

BACKPROPAGATION Used to calculate derivatives of error function efficiently Errors propagate backwards layer by layer

W-DEPENDENT ERROR FUNCTION E(w) w 1 w A w B w C w 2 E

MOVING IN ERROR SPACE Making a small step in weight space: w w + w Results in a change in error: Points in direction of greatest change E(w) E ' w T re(w) Stop condition: w 1 re(w) =0 w 2 w A w B w C E

ERROR FUNCTIONS Regression: E(w) = 1 2 NX ky(x n, w) t n k 2 n=1 Binary classification: E(w) = (cross-entropy error) NX {t n ln y n +(1 t n )ln(1 y n )} n=1 Multiple independent binary classification: NX KX E(w) = {t nk ln y nk +(1 t nk )ln(1 y nk )} n=1 k=1

ERROR FUNCTIONS Multi-class classification (mutually exclusive): E(w) = NX n=1 KX t kn ln y k (x n, w) k=1

NO ANALYTICAL HOPE Error function has highly nonlinear dependence on weights Many points in weight space where gradient vanishes Many inequivalent stationary points (local minima) No hope for an analytical solution Use iterative numerical procedures (e.g. backprop): w ( +1) = w ( ) + w ( )

GRADIENT DESCENT Repeat until convergence: { } j := j @ @ j J( 0,..., k ) for j =1...k Multi-dimensional case required here: w ( +1) = w ( ) re(w ( ) )

GRADIENT DESCENT VARIANTS Gradient descent is a poor algorithm itself. Better variants exist: Conjugate gradients Quasi Newton Stochastic gradient descent Ballistic methods New methods coming out regularly due to importance to field Fletcher, Practical Methods of Optimization (second ed.),wiley, 1987 Gill, Murray, and Wright, Practical Optimization, Academic Press, 1981 Nocedal and Wright, Numerical Optimization, Springer, 1999

BACKPROPAGATION Used to calculate derivatives of error function efficiently Errors propagate backwards layer by layer Iterative minimisation of error function: 1. Calculate derivative of error function w.r.t. weights 2. Derivatives used to adjust weights Backpropagation refers to the calculation of the derivatives

GENERAL FORMULATION Backprop is for: Arbitrary feed-forward topology Differentiable nonlinear activation functions Broad class of error function General error function formulation: E(w) = 1 N NX n=1 E n (w)

SIMPLE LINEAR CASE x 1 w j1 x i w ji X wji x i y k x m for a single input pattern x n E n = 1 2 X (y nk t nk ) 2 k All output nodes

TOPOLOGY & VARIABLES Note the index order! x D w (1) MD z M w (2) KM y K inputs outputs y 1 x 1 x 0 z 1 w (2) 10 z 0

Error for a single training pattern and single output: E nk = 1 2 (y nk t nk ) 2 Gradient with respect to a single weight : w ji @E n @w ij =(y nj t nj )x ni Local computation for a single weight involving product of error signal and input variable.

FORWARD PROPAGATION a j = X i w ji z i z j = h(a j ) z 1 w j1 z i z j w X ji a wji x j i h(a j ) w jk z j z k Each pattern in the training set results in particular values for z i z j, and calculated through forward propagation through the network. a j

PRODUCT RULE s dependency on is only through summed input E n w ji a j, and therefore chain rule can be applied: @E n @w ji = @E n @a j @a j @w ji j @E n @a j @E n @w ji = j z i @a j @w ji = z i Called errors Cell outputs of previous layer

NODE DELTAS forward z i w ji δ j w kj δ k z j backward δ 1 Output nodes: k = y k t k Hidden nodes: j @E n @a j = X k j = h 0 (a j ) X k @E n @a k @a k @a j w kj k backpropagation formula

ERROR BACKPROPAGATION 1. Apply input vector to network and propagate forward 2. Evaluate for al output units k 3. Backpropagate s to obtain j for all hidden units @E n 4. Evaluate error derivatives as: Result = j of z i forward Result of backprop @w activation ji of network Weight update: w kj = k z j

DEEP LEARNING Basically a Neural Network Many hidden layers Major breakthrough in pre-training Treat each layer first as an unsupervised restricted Boltzmann machine to initialise weights Then do standard supervised backpropagation Can be used for unsupervised learning and dimensionality reduction Decoder T W 1 2000 T W 2 1000 T W 3 500 T W 4 30 Code layer W 4 500 W 3 1000 W 2 2000 W 1 Encoder Unrolling