C4 Phenomenological Modeling - Regression & Neural Networks : Computational Modeling and Simulation Instructor: Linwei Wang

Similar documents
Medical Image Recognition Linwei Wang

y(x n, w) t n 2. (1)

4. Multilayer Perceptrons

Data Mining Part 5. Prediction

Neural Networks and the Back-propagation Algorithm

Multilayer Perceptron

Neural Networks and Deep Learning

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Feedforward Neural Nets and Backpropagation

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

AI Programming CS F-20 Neural Networks

Artificial Intelligence

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Computational statistics

Lecture 4: Feed Forward Neural Networks

Artifical Neural Networks

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

CSC242: Intro to AI. Lecture 21

Neural Nets Supervised learning

CSC 411 Lecture 10: Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

EPL442: Computational

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning

Supervised Learning in Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Multilayer Perceptrons and Backpropagation

Artificial Neural Networks

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Introduction to Neural Networks

Artificial Neural Networks. MGS Lecture 2

Introduction to Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Machine Learning

Multilayer Neural Networks

Multilayer Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Artificial Neural Networks. Edward Gatt

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural networks. Chapter 20. Chapter 20 1

Introduction to Machine Learning

Learning Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Artificial Neural Networks

Machine Learning Basics III

Neural Networks (and Gradient Ascent Again)

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

NN V: The generalized delta learning rule

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Introduction Biologically Motivated Crude Model Backpropagation

Neural Networks DWML, /25

Statistical NLP for the Web

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Unit III. A Survey of Neural Network Model

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

CS:4420 Artificial Intelligence

18.6 Regression and Classification with Linear Models

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

A thorough derivation of back-propagation for people who really want to understand it by: Mike Gashler, September 2010

From perceptrons to word embeddings. Simon Šuster University of Groningen

8. Lecture Neural Networks

Intro to Neural Networks and Deep Learning

Revision: Neural Network

Course 395: Machine Learning - Lectures

Neural networks. Chapter 19, Sections 1 5 1

Neural Networks. Intro to AI Bert Huang Virginia Tech

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 17: Neural Networks and Deep Learning

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Machine Learning (CSE 446): Backpropagation

THE MOST IMPORTANT BIT

Statistical Machine Learning from Data

Lab 5: 16 th April Exercises on Neural Networks

Midterm: CS 6375 Spring 2018

CSC321 Lecture 6: Backpropagation

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

1 What a Neural Network Computes

Backpropagation: The Good, the Bad and the Ugly

A summary of Deep Learning without Poor Local Minima

Lecture 7 Artificial neural networks: Supervised learning

Nonlinear Classification

Neural Networks. William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Logistic Regression & Neural Networks

Transcription:

C4 Phenomenological Modeling - Regression & Neural Networks 4040-849-03: Computational Modeling and Simulation Instructor: Linwei Wang

Recall.. The simple, multiple linear regression function ŷ(x) = a 0 + a 1 x 1 + a 2 x 2 +... + a n x n Which can be viewed as a black-box in the right: Each of the input node values is multiplied with a corresponding weight, the results are added up, and the output value is obtained as this sum plus a constant (the so-called bias) This is exactly what one neuron does in the ANN!

Building Blocks of ANN Our crude way to simulate the brain electronically Multiple inputs Weights can be negative to represent excitory or inhibitory influences Output: activation

Artificial Neural Network Mathematically, an artificial neuron is modeled as: d(x) = f (w T x + w 0 ) where f is a non-linear function (transfer/activation function), e.g. Threshold Sigmoid " $ f (y) = 0 y < 0 # % $ 1 y! 0 1 f (y) = 1+ e &cy f (y) = 1! a tan(y)+ 1 2 Mul:ple linear regression + a nonlinear ac:va:on func:on

Feed-Forward Neural Network Put all the neurons into a network Arbitrary number of layers It is common to use two layers Arbitrary number of ANs in each layer Common ANNs have 3 layers Input layer: x Hidden layer: z Output layer: y z i = f h (w i T x + w i,0 ) y = f o (v T z + v 0 ) Number of parameters to tune h(n +1)+ m(h +1) Each layer acts in the same way but with different coefficients and/or nonlinear func:ons

Feed-Forward Neural Network Generalize: skip-layer connections Let s still look at 3 layers Input layer: x Hidden layer: z Output layer: y z i = f h (w i T x + w i,0 ) Number of parameter to tune h(n +1)+ m(h +1)+ mn y = f o (v T z + v 0 + w o T x) The single- hidden- layer feedforward neural network can approximate any con:nuous func:on by increasing the size of the hidden layer

The Learning Process Each neural network possesses knowledge contained in the values of the connected weights Modifying the knowledge stored in the network as a function of experiences implies a learning rule for changing the values of the weights Minimization of RSQ Learning as a gradient descent Back-propagation algorithm m min{ 1! ( y 2 i=1 i " ŷ i ) 2 )

Back-Propagation Algorithm To adjust the weights for each unit such that the error between the desired output and the actual output is reduced (minimizing RSQ) Learning using the method of gradient descent To compute the gradient of the error function, i.e., error derivative of the weights EW (how the error changes as each weight is increased or decreased slightly) Must guarantee the continuity and differentiability of the error function Activation function: e.g. sigmoid function f (x) = 1 df (x) = 1+ e!cx dx e!x (1+ e!cx ) 2 = f (x)(1! f (x))

Back-Propagation Algorithm Activation function: e.g. sigmoid function Guarantee the continuity and differentiability of the error function Valued between 0 and 1 Local minima could occur 1 f (x) = n 1+ exp( w i x i + w 0 )! i=1

Back-Propagation Algorithm Find a local minimum of the error function E = 1 m! ( y 2 i=1 i " ŷ i ) 2 ANN is initialized with randomly chosen weights The gradient of the error function, EW, is computed and used to correct the initial weights EW is computed recursively!e = ( "E "w 1, "E "w 2,..., "E "w l )!w i = "! #E #w i i =1,...,l Assume l weights in the network r: learning constant, defines the step length of each itera:on in the nega:ve gradient direc:on

Back-Propagation Algorithm Now let s forget about training sets and learning Our objective is to find a method for efficiently calculating the gradient of network function according to the weights of the network Because our network = a complex chain of function compositions (addition, weighted edge, nonlinear activation), we expect the chain rules of calculus to play a major role in finding the gradient Let s start with a 1D network

B-Diagram Feed-forward step: Info comes from the left and each unit evaluates function f in its right side and the derivation f in left side Both results are stored in the unit, only that on the right side is transmitted to the units connected to the right Backpropagation step Running the whole network backwards, using the stored results Deriva:ve of the func:on Single compu:ng unit func:on Separa:on into addi:on and ac:va:on unit

Three Basic Cases Function composition Forward: Backward: The input from the right of the network is the constant 1 Incoming info is multiplied by the value stored in its left side The results (traversing value) is the derivative of the function composition Any sequence of function compositions can be evaluated in this way & its derivative obtained in the backpropagation step The network being used backwards with the input 1 At each node the product with the value stored in the left side is computed

Three Basic Cases Function addition Forward: Backward: All incoming edges to a unit fan out the traversing value at this node and distribute it to the connected units to the left When two right-to-left paths meet, the computed traversing values are added

Three Basic Cases Weighted edges Forward: Backward

Steps of the Backpropagation Algorithm Consider a network with a single real input x and network function F, the derivative F (x) is computed in two phases: Feedforward: The input x is fed into the network. The primitive functions at the nodes and their derivatives are evaluated at each node & stored Backpropagation: The constant 1 is fed into the output unit and the network is fun backwards. Incoming info to a node is added and the result is multiplied by the values stored in the left part of the unit. The result is transmitted to the left of the unit. The result collected at the input unit is the F (x) We can prove that it works in arbitrary feed-forward networks with differentiable activation functions at the nodes

Steps of the Backpropagation Algorithm F(x) =!(w 1 F 1 (x)+ w 2 F 2 (x)+... + w m F m (x)) F '(x) =! '(s)(w 1 F ' 1 (x)+ w 2 F ' 2 (x)+... + w m F ' m (x))

Generalization to More Inputs The feed-forward step remains unchanged & all left side slots of the units are filled as usual In the backpropagation we can identify two subnetworks

Learning with Backpropagation The feed-forward step is computed in the usual way, but we also store the output of each unit in its right side We perform the backpropagation in the network If we fix our attention on one of the weights, say w ij whose associated edge points from the i-th to the j-th node in the network The weight can be treated as input channel into the subnetwork made of all paths starting at w ij and ending in the single output unit of the network The info fed into the subnetwork in the feed-forward step was o i w ij (o i the stored output of unit i) The backpropagation computes the gradient of error E with respect to this input!e!e Usual result in backpropaga:on at = o i one node with regard to one input!w ij!o i w ij

Learning with Backpropagation The backpropagation is performed in the usual way. All subnetworks defined by each weight of the network can be handled simultaneously, but we store additionally at each node i The output o i of the node in the feed-forward step The cumulative result of the backward computation up to this node (backpropagated error δ j )!E!w ij = o i! j Once all partial derivatives are computed, we can perform gradient descent by adding to each weight:!w ij = "!o i " j

Layered Networks Notation: n input, k hidden, m output Weights and matrix W 1,W 2 The excitation net of the j-th hidden units n+1 (1) net j =! w ij ô i i=1 The outputs of this unit = s( n+1 w (1) j! ij ô i ) o (1) i=1 In matrix form o (1) = s(ôw 1 ) o (2) = s(ô (1) W 2 )

Layered Networks Let s consider a single input-output pair (o,t), i.e., 1 training set Backpropagation Feedforward computation Backpropagation to the output layer Backpropagation to the hidden layer Weights updates Stops when the value of the error functions is sufficiently sall Extended network for compu:ng error

Layered Networks Feedward computation The vector o is presented to the network, the vectors o (1) and o (2) are computed and stored. The derivatives of the activation functions are also stored at each unit Backpropagation to the output layer!e Interested in!w (2) ij Extended network for compu:ng error

Layered Networks Backpropagation to the output layer Interested in!e /!w (2) ij Bakpropagated error! j (2) = o j (2) (1! o j (2) )(o j (2)! t j ) Partial derivative!e /!w (2) (1) ij = o i! (2) j = [o (2) j (1" o (2) j )(o (2) (1) j " t j )]o i

Layered Networks Backpropagation to the hidden layer Interested in!e /!w (1) ij Bakpropagated error! (1) j = o (1) j (1! o (1) m j ) w (2) (2) " jq! q q=1 Partial derivative!e /!w (1) (1) ij = o i! j

Layered Networks Weights update Hidden-output layer!w (2) = "!o (1) ij i " (2) j, i =1,..., k +1; j =1,...m Input-hidden layer o n+1 = o (1) k+1 =1!w (1) ij = "!o i " j (1), i =1,..., n +1; j =1,...k Make the corrections to the weight only after the backpropagated error has been computed for all units in the network!!!! Otherwise the corrections become interwined with the backpropagation, and the computed corrections do not correspond to the negative gradient direction

More than One Training Set If we have p datasets Batch / offline updates! 1 w (1) ij,! 2 w (1) (1) ij,...! p w ij The necessary updates:!w (1) ij =! 1 w (1) ij +! 2 w (1) (1) ij +... +! p w ij Online / sequential updates The corrections do not exactly follow the negative gradient direction If the training sets are selected randomly, the search direction oscillates around the exact gradient direction and, on average, the algorithm implements a form of descent in the error function Adding some noise to the gradient function can help to avoid falling into shallow local minima It is very expensive to compute the exact gradient direction when the training set is large

Backpropagation in Matrix Forms Input-output: o (2) = s(ô (1) W 2 ) o (1) = s(ô W 1 ) The derivatives (stored in the feed-forward step) o 1 (2) (1! o 1 (2) ) 0... 0 o 1 (1) (1! o 1 (1) ) 0... 0 D 2 = ( 0 o (2) 2 (1! o (2) 2 )... 0!! "! 0 0... o (2) m (1! o (2) m ) ) D 2 = ( 0 o (1) 2 (1! o (1) 2 )... 0!! "! 0 0... o (1) k (1! o (1) k ) ) The stored derivatives of the quadratic error " o (2) 1! t 1 % $ ' $ o (2) e = 2! t 2 ' $! ' $ # o (2) ' m! t m &

Backpropagation in Matrix Forms The m-dimensional vector of the backpropagated error up to the output units! (2) = D 2 e The k-dimensional vectors of the backpropagated error up to the hidden layer! (1) = D 1 W 2! (2) The correction for the two weight matrices!w T 2 = "!" (2) ô (1),!W T 1 = "!" (1) ô We can generalize this for l-layers! (l) = D l e! (i) = D i W i+1! (i+1) i =1,..l!1 Or! (i) = D i W i+1...w l!1 D l!1 W l D l e

Back-Propagation Summary First, compute EA: rate at which the error changes as the activity level of a unit is changed Output layer: the difference between the actual and desired outputs Hidden layer: Identifying all the weights between that hidden unit and the outputs it connected with Multiply those weights by the EAs of those output units and add the products Other layers: Similar fashion, calculated from layer to layer in a direction opposite to the way activities propagate through the network (hence the name back-propagation) Second, EW for each connection of the unit is the product of the EA and the activity through the incoming connection

More General ANN Feed-forward ANN Signals travel one way only: from input to output Feedback ANN Signals travel in both directions through loops in the network Very powerful and can get extremely complicated Automatic detection of nonlinearities: ANN describes the nonlinear dependency of the response variable on the independent variables without a previous explicit specification of this nonlinear dependency

Generalization and Overfitting Back to the investment example: 3- layer ANN h = 3 h = 6 Which model is beqer?

Generalization and Overfitting Generalization: Suppose two mathematical models (S,Q,M) and (S,Q,M * ) have been setup using a training dataset D train. Then (S,Q,M) is said to generalize better than (S,Q,M * ) on a test dataset D test with respect to some error criterion E, if (S,Q,M) produces a smaller value of E on D test compared to (S,Q,M * ) Not sufficient to look at a model s performance only on the dataset used to construct the model, if one wants to achieve good predictive capabilities Better predictions are obtained from models which describe the essential tendency of the data instead of following random oscillations

Generalization and Overfitting Overfitting: A mathematical model (S,Q,M) is said to overfit a training dataset D train with respect to an error criterion E and a test dataset D test, if another model (S,Q,M * ) with a larger error on D train generalizes better to D test Regularization methods can be used to reduce overfitting, using modified fitting criteria that penalize the roughness of the ANN Weight decay Roughness is associated with large values of the weight parameters The sum of squares of the network is included in the fitting criterion