C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

C4 Phenomenological Modeling - Regression & Neural Networks 4040-849-03: Computational Modeling and Simulation Instructor: Linwei Wang

Recall.. The simple, multiple linear regression function ŷ(x) = a 0 + a 1 x 1 + a 2 x 2 +... + a n x n Which can be viewed as a black-box in the right: Each of the input node values is multiplied with a corresponding weight, the results are added up, and the output value is obtained as this sum plus a constant (the so-called bias) This is exactly what one neuron does in the ANN!

Building Blocks of ANN Our crude way to simulate the brain electronically Multiple inputs Weights can be negative to represent excitory or inhibitory influences Output: activation

Artificial Neural Network Mathematically, an artificial neuron is modeled as: d(x) = f (w T x + w 0 ) where f is a non-linear function (transfer/activation function), e.g. Threshold Sigmoid " $ f (y) = 0 y < 0 # % $ 1 y! 0 1 f (y) = 1+ e &cy f (y) = 1! a tan(y)+ 1 2 Mul:ple linear regression + a nonlinear ac:va:on func:on

Feed-Forward Neural Network Put all the neurons into a network Arbitrary number of layers It is common to use two layers Arbitrary number of ANs in each layer Common ANNs have 3 layers Input layer: x Hidden layer: z Output layer: y z i = f h (w i T x + w i,0 ) y = f o (v T z + v 0 ) Number of parameters to tune h(n +1)+ m(h +1) Each layer acts in the same way but with different coefficients and/or nonlinear func:ons

Feed-Forward Neural Network Generalize: skip-layer connections Let s still look at 3 layers Input layer: x Hidden layer: z Output layer: y z i = f h (w i T x + w i,0 ) Number of parameter to tune h(n +1)+ m(h +1)+ mn y = f o (v T z + v 0 + w o T x) The single- hidden- layer feedforward neural network can approximate any con:nuous func:on by increasing the size of the hidden layer

The Learning Process Each neural network possesses knowledge contained in the values of the connected weights Modifying the knowledge stored in the network as a function of experiences implies a learning rule for changing the values of the weights Minimization of RSQ Learning as a gradient descent Back-propagation algorithm m min{ 1! ( y 2 i=1 i " ŷ i ) 2 )

Back-Propagation Algorithm To adjust the weights for each unit such that the error between the desired output and the actual output is reduced (minimizing RSQ) Learning using the method of gradient descent To compute the gradient of the error function, i.e., error derivative of the weights EW (how the error changes as each weight is increased or decreased slightly) Must guarantee the continuity and differentiability of the error function Activation function: e.g. sigmoid function f (x) = 1 df (x) = 1+ e!cx dx e!x (1+ e!cx ) 2 = f (x)(1! f (x))

Back-Propagation Algorithm Activation function: e.g. sigmoid function Guarantee the continuity and differentiability of the error function Valued between 0 and 1 Local minima could occur 1 f (x) = n 1+ exp( w i x i + w 0 )! i=1

Back-Propagation Algorithm Find a local minimum of the error function E = 1 m! ( y 2 i=1 i " ŷ i ) 2 ANN is initialized with randomly chosen weights The gradient of the error function, EW, is computed and used to correct the initial weights EW is computed recursively!e = ( "E "w 1, "E "w 2,..., "E "w l )!w i = "! #E #w i i =1,...,l Assume l weights in the network r: learning constant, defines the step length of each itera:on in the nega:ve gradient direc:on

Back-Propagation Algorithm Now let s forget about training sets and learning Our objective is to find a method for efficiently calculating the gradient of network function according to the weights of the network Because our network = a complex chain of function compositions (addition, weighted edge, nonlinear activation), we expect the chain rules of calculus to play a major role in finding the gradient Let s start with a 1D network

B-Diagram Feed-forward step: Info comes from the left and each unit evaluates function f in its right side and the derivation f in left side Both results are stored in the unit, only that on the right side is transmitted to the units connected to the right Backpropagation step Running the whole network backwards, using the stored results Deriva:ve of the func:on Single compu:ng unit func:on Separa:on into addi:on and ac:va:on unit

Three Basic Cases Function composition Forward: Backward: The input from the right of the network is the constant 1 Incoming info is multiplied by the value stored in its left side The results (traversing value) is the derivative of the function composition Any sequence of function compositions can be evaluated in this way & its derivative obtained in the backpropagation step The network being used backwards with the input 1 At each node the product with the value stored in the left side is computed

Three Basic Cases Function addition Forward: Backward: All incoming edges to a unit fan out the traversing value at this node and distribute it to the connected units to the left When two right-to-left paths meet, the computed traversing values are added

Three Basic Cases Weighted edges Forward: Backward

Steps of the Backpropagation Algorithm Consider a network with a single real input x and network function F, the derivative F (x) is computed in two phases: Feedforward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node & stored Backpropagation: The constant 1 is fed into the output unit and the network is fun backwards. Incoming info to a node is added and the result is multiplied by the values stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the F (x) We can prove that it works in arbitrary feed-forward networks with differentiable activation functions at the nodes

Steps of the Backpropagation Algorithm F(x) =!(w 1 F 1 (x)+ w 2 F 2 (x)+... + w m F m (x)) F '(x) =! '(s)(w 1 F ' 1 (x)+ w 2 F ' 2 (x)+... + w m F ' m (x))

Generalization to More Inputs The feed-forward step remains unchanged & all left side slots of the units are filled as usual In the backpropagation we can identify two subnetworks

Learning with Backpropagation The feed-forward step is computed in the usual way, but we also store the output of each unit in its right side We perform the backpropagation in the network If we fix our attention on one of the weights, say w ij whose associated edge points from the i-th to the j-th node in the network The weight can be treated as input channel into the subnetwork made of all paths starting at w ij and ending in the single output unit of the network The info fed into the subnetwork in the feed-forward step was o i w ij (o i the stored output of unit i) The backpropagation computes the gradient of error E with respect to this input!e!e Usual result in backpropaga:on at = o i one node with regard to one input!w ij!o i w ij

Learning with Backpropagation The backpropagation is performed in the usual way. All subnetworks defined by each weight of the network can be handled simultaneously, but we store additionally at each node i The output o i of the node in the feed-forward step The cumulative result of the backward computation up to this node (backpropagated error δ j )!E!w ij = o i! j Once all partial derivatives are computed, we can perform gradient descent by adding to each weight:!w ij = "!o i " j

Layered Networks Notation: n input, k hidden, m output Weights and matrix W 1,W 2 The excitation net of the j-th hidden units n+1 (1) net j =! w ij ô i i=1 The outputs of this unit = s( n+1 w (1) j! ij ô i ) o (1) i=1 In matrix form o (1) = s(ôw 1 ) o (2) = s(ô (1) W 2 )

Layered Networks Let s consider a single input-output pair (o,t), i.e., 1 training set Backpropagation Feedforward computation Backpropagation to the output layer Backpropagation to the hidden layer Weights updates Stops when the value of the error functions is sufficiently sall Extended network for compu:ng error

Layered Networks Feedward computation The vector o is presented to the network, the vectors o (1) and o (2) are computed and stored. The derivatives of the activation functions are also stored at each unit Backpropagation to the output layer!e Interested in!w (2) ij Extended network for compu:ng error

Layered Networks Backpropagation to the output layer Interested in!e /!w (2) ij Bakpropagated error! j (2) = o j (2) (1! o j (2) )(o j (2)! t j ) Partial derivative!e /!w (2) (1) ij = o i! (2) j = [o (2) j (1" o (2) j )(o (2) (1) j " t j )]o i

Layered Networks Backpropagation to the hidden layer Interested in!e /!w (1) ij Bakpropagated error! (1) j = o (1) j (1! o (1) m j ) w (2) (2) " jq! q q=1 Partial derivative!e /!w (1) (1) ij = o i! j

Layered Networks Weights update Hidden-output layer!w (2) = "!o (1) ij i " (2) j, i =1,..., k +1; j =1,...m Input-hidden layer o n+1 = o (1) k+1 =1!w (1) ij = "!o i " j (1), i =1,..., n +1; j =1,...k Make the corrections to the weight only after the backpropagated error has been computed for all units in the network!!!! Otherwise the corrections become interwined with the backpropagation, and the computed corrections do not correspond to the negative gradient direction

More than One Training Set If we have p datasets Batch / offline updates! 1 w (1) ij,! 2 w (1) (1) ij,...! p w ij The necessary updates:!w (1) ij =! 1 w (1) ij +! 2 w (1) (1) ij +... +! p w ij Online / sequential updates The corrections do not exactly follow the negative gradient direction If the training sets are selected randomly, the search direction oscillates around the exact gradient direction and, on average, the algorithm implements a form of descent in the error function Adding some noise to the gradient function can help to avoid falling into shallow local minima It is very expensive to compute the exact gradient direction when the training set is large

Backpropagation in Matrix Forms Input-output: o (2) = s(ô (1) W 2 ) o (1) = s(ô W 1 ) The derivatives (stored in the feed-forward step) o 1 (2) (1! o 1 (2) ) 0... 0 o 1 (1) (1! o 1 (1) ) 0... 0 D 2 = ( 0 o (2) 2 (1! o (2) 2 )... 0!! "! 0 0... o (2) m (1! o (2) m ) ) D 2 = ( 0 o (1) 2 (1! o (1) 2 )... 0!! "! 0 0... o (1) k (1! o (1) k ) ) The stored derivatives of the quadratic error " o (2) 1! t 1 % $ ' $ o (2) e = 2! t 2 ' $! ' $ # o (2) ' m! t m &

Backpropagation in Matrix Forms The m-dimensional vector of the backpropagated error up to the output units! (2) = D 2 e The k-dimensional vectors of the backpropagated error up to the hidden layer! (1) = D 1 W 2! (2) The correction for the two weight matrices!w T 2 = "!" (2) ô (1),!W T 1 = "!" (1) ô We can generalize this for l-layers! (l) = D l e! (i) = D i W i+1! (i+1) i =1,..l!1 Or! (i) = D i W i+1...w l!1 D l!1 W l D l e

Back-Propagation Summary First, compute EA: rate at which the error changes as the activity level of a unit is changed Output layer: the difference between the actual and desired outputs Hidden layer: Identifying all the weights between that hidden unit and the outputs it connected with Multiply those weights by the EAs of those output units and add the products Other layers: Similar fashion, calculated from layer to layer in a direction opposite to the way activities propagate through the network (hence the name back-propagation) Second, EW for each connection of the unit is the product of the EA and the activity through the incoming connection

More General ANN Feed-forward ANN Signals travel one way only: from input to output Feedback ANN Signals travel in both directions through loops in the network Very powerful and can get extremely complicated Automatic detection of nonlinearities: ANN describes the nonlinear dependency of the response variable on the independent variables without a previous explicit specification of this nonlinear dependency

Generalization and Overfitting Back to the investment example: 3- layer ANN h = 3 h = 6 Which model is beqer?

Generalization and Overfitting Generalization: Suppose two mathematical models (S,Q,M) and (S,Q,M * ) have been setup using a training dataset D train. Then (S,Q,M) is said to generalize better than (S,Q,M * ) on a test dataset D test with respect to some error criterion E, if (S,Q,M) produces a smaller value of E on D test compared to (S,Q,M * ) Not sufficient to look at a model s performance only on the dataset used to construct the model, if one wants to achieve good predictive capabilities Better predictions are obtained from models which describe the essential tendency of the data instead of following random oscillations

Generalization and Overfitting Overfitting: A mathematical model (S,Q,M) is said to overfit a training dataset D train with respect to an error criterion E and a test dataset D test, if another model (S,Q,M * ) with a larger error on D train generalizes better to D test Regularization methods can be used to reduce overfitting, using modified fitting criteria that penalize the roughness of the ANN Weight decay Roughness is associated with large values of the weight parameters The sum of squares of the network is included in the fitting criterion