Multilayer Perceptrons and Backpropagation

Similar documents
Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

y(x n, w) t n 2. (1)

4. Multilayer Perceptrons

Multilayer Perceptron

Lab 5: 16 th April Exercises on Neural Networks

Neural Networks (Part 1) Goals for the lecture

Lecture 4: Perceptrons and Multilayer Perceptrons

Multilayer Neural Networks

Artificial Neural Networks

Gradient Descent Training Rule: The Details

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Revision: Neural Network

Feedforward Neural Nets and Backpropagation

Artifical Neural Networks

AI Programming CS F-20 Neural Networks

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Neural Networks Task Sheet 2. Due date: May

Multi-layer Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks and the Back-propagation Algorithm

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

CSC 411 Lecture 10: Neural Networks

Unit III. A Survey of Neural Network Model

Neural Nets Supervised learning

Supervised Learning in Neural Networks

Course 395: Machine Learning - Lectures

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Intelligence

Machine Learning

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural Networks biological neuron artificial neuron 1

Pattern Classification

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Machine Learning. Neural Networks

Artificial Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Artificial Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Introduction to Machine Learning

Backpropagation: The Good, the Bad and the Ugly

Logistic Regression & Neural Networks

Week 5: The Backpropagation Algorithm

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Artificial Neural Network : Training

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Artificial Neuron (Perceptron)

Part 8: Neural Networks

Input layer. Weight matrix [ ] Output layer

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Classification with Perceptrons. Reading:

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Stochastic gradient descent; Classification

CS:4420 Artificial Intelligence

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CSC321 Lecture 6: Backpropagation

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Data Mining Part 5. Prediction

A summary of Deep Learning without Poor Local Minima

Machine Learning

Artificial Neural Networks. MGS Lecture 2

Lecture 3 Feedforward Networks and Backpropagation

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Artificial Neural Networks. Edward Gatt

Neural Networks: Backpropagation

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Artificial Neural Networks 2

Neural networks. Chapter 20, Section 5 1

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Introduction to Neural Networks

Multilayer Perceptron = FeedForward Neural Network

Nonlinear Classification

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Lecture 3 Feedforward Networks and Backpropagation

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Unit 8: Introduction to neural networks. Perceptrons

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Machine Learning (CSE 446): Backpropagation

Multilayer Feedforward Networks. Berlin Chen, 2002

Multilayer Neural Networks

Neural Networks Lecture 4: Radial Bases Function Networks

Learning in State-Space Reinforcement Learning CIS 32

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Plan. Perceptron Linear discriminant. Associative memories Hopfield networks Chaotic networks. Multilayer perceptron Backpropagation

CSC321 Lecture 5: Multilayer Perceptrons

18.6 Regression and Classification with Linear Models

EPL442: Computational

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Transcription:

Multilayer Perceptrons and Backpropagation Informatics 1 CG: Lecture 7 Chris Lucas School of Informatics University of Edinburgh January 31, 2017 (Slides adapted from Mirella Lapata s.) 1 / 33

Reading: Kevin Gurney s Introduction to Neural Networks, Chapters 5 6.5 2 / 33

Recap: Perceptrons Connectionism is a computer modeling approach inspired by neural networks. Anatomy of a connectionist model: units, connections The Perceptron as a linear classifier. A learning algorithm for Perceptrons. Key limitation: only works for linearly separable data. 3 / 33

Recap: Perceptrons 1 w 0 x 1 w 1 x = n i=0 w ix i 0 1 y 0 w n x n 4 / 33

Multilayer Perceptrons (MLPs) Input layer Hidden layer Output layer MLPs are feed-forward neural networks, organized in layers. One input layer, one or more hidden layers, one output layer. Each node in a layer connected to all other nodes in next layer. Each connection has a weight (can be zero). 5 / 33

Activation Functions x 1 w 1 x 2 w 2 w n 0 h 1 y x n Step function Sigmoid function y y 1 1 0 h Outputs 0 or 1. x 0 h x Outputs a real value between 0 and 1. 6 / 33

Sigmoids 7 / 33

Learning with MLPs Input layer Hidden layer Output layer As with perceptrons, finding the right weights is very hard! Solution technique: learning! Learning: adjusting the weights based on training examples. 8 / 33

Supervised Learning General Idea 1 Send the MLP an input pattern, x, from the training set. 2 Get the output from the MLP, y. 3 Compare y with the right answer, or target t, to get the error quantity. 4 Use the error quantity to modify the weights, so next time y will be closer to t. 5 Repeat with another x from the training set. When updating weights after seeing x, the network doesn t just change the way it deals with x, but other inputs too... Inputs it has not seen yet! Generalization is the ability to deal accurately with unseen inputs. 9 / 33

Learning and Error Minimization Recall: Perceptron Learning Rule Minimize the difference between the actual and desired outputs: w i w i + η(t o)x i Error Function: Mean Squared Error (MSE) An error function represents such a difference over a set of inputs: E( w) = 1 2N N (t p o p ) 2 p=1 N is the number of patterns t p is the target output for pattern p o p is the output obtained for pattern p Actually MSE/2; the 2 makes little difference, but makes life easier later on! 10 / 33

Gradient Descent One technique that can be used for minimizing functions is gradient descent. Can we use this on our error function E? We would like a learning rule that tells us how to update weights, like this: w ij = w ij + w ij But what should w ij be? 11 / 33

Gradient and Derivatives: The Idea The derivative is a measure of the rate of change of a function, as its input changes; For function y = f (x), the derivative dy dx indicates how much y changes in response to changes in x. If x and y are real numbers, and if the graph of y is plotted against x, the derivative measures the slope or gradient of the line at each point, i.e., it describes the steepness or incline. 12 / 33

Gradient and Derivatives: The Idea dy dx > 0 implies that y increases as x increases. If we want to find the minimum y, we should reduce x. 13 / 33

Gradient and Derivatives: The Idea dy dx > 0 implies that y increases as x increases. If we want to find the minimum y, we should reduce x. dy dx < 0 implies that y decreases as x increases. If we want to find the minimum y, we should increase x. 13 / 33

Gradient and Derivatives: The Idea dy dx > 0 implies that y increases as x increases. If we want to find the minimum y, we should reduce x. dy dx < 0 implies that y decreases as x increases. If we want to find the minimum y, we should increase x. dy dx = 0 implies that we are at a minimum or maximum or a plateau. To get closer to the minimum: x new = x old η dy dx 13 / 33

Gradient and Derivatives: The Idea Example So, we know how to use derivatives to adjust one input value. But we have several weights to adjust! We need to use partial derivatives. A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. If y = f (x 1, x 2 ), then we can have y x 1 and y x 2. In our learning rule case, if we can work out the partial derivatives, we can use this rule to update the weights: w ij = w ij + w ij where w ij = η E w ij. 14 / 33

Summary So Far We learned what a multilayer perceptron is. We know a learning rule for updating weights in order to minimise the error: w ij = w ij + w ij where w ij = η E w ij w ij tells us in which direction and how much we should change each weight to roll down the slope (descend the gradient) of the error function E. So, how do we calculate E w ij? 15 / 33

Using Gradient Descent to Minimize the Error j i f w ij j f The mean squared error function* E, which we want to minimize: E( w) = 1 2N N (t p o p ) 2 p=1 16 / 33

Using Gradient Descent to Minimize the Error j i f w ij j f If we use a sigmoid activation function f, then the output of neuron i for pattern p is: o p 1 i = f (u i ) = 1 + e au i where a is a pre-defined constant and u i is the result of the input function in neuron i: u i = w ij x ij j 17 / 33

Using Gradient Descent to Minimize the Error j i f w ij j f For the pth pattern and the ith neuron, we use gradient descent on the error function: w ij = η E p w ij = η(t p i o p i )f (u i )x ij where f (u i ) = df du i is the derivative of f with respect to u i. If f is the sigmoid function, f (u i ) = af (u i )(1 f (u i )). 18 / 33

Using Gradient Descent to Minimize the Error j i f w ij j f We can update weights after processing each pattern, using rule: w ij = η (t p i o p i ) f (u i ) x ij w ij = η δ p i x ij This is known as the generalized delta rule. We need to use the derivative of the activation function f. So, f must be differentiable! The threshold activation function is not continuous, thus not differentiable! Sigmoid has a derivative that is easy to calculate. 19 / 33

Updating Output vs Hidden Neurons We can update output neurons using the generalized delta rule: w ij = η δ p i x ij δ p i = (t p i o p i )f (u i ) This δ p i is only good for the output neurons, since it relies on target outputs. But we don t have target output for the hidden nodes! What can we use instead? δ p i = k w ki δ k f (u i ) This rule propagates error back from output nodes to hidden nodes. If effect, it blames hidden nodes according to how much influence they had. So, now we have rules for updating both output and hidden neurons! 20 / 33

Backpropagation stration: 1 2 21 / 33

Backpropagation stration: 1 2 1 Present the pattern at the input layer. 21 / 33

Backpropagation stration: 1 2 1 Present the pattern at the input layer. 22 / 33

Backpropagation tration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 23 / 33

Backpropagation stration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 24 / 33

Backpropagation ration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 25 / 33

Backpropagation ation: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 26 / 33

Backpropagation tration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 3 Calculate error for the output neurons. 27 / 33

Backpropagation tration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 3 Propagate backward error. 28 / 33

Backpropagation ration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 3 Propagate backward error. 29 / 33

Backpropagation ation: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 3 Propagate backward error. 30 / 33

Backpropagation ration: 1 2 1 Present the pattern at the input layer. 2 Propagate forward activations. 3 Propagate backward error. 4 Calculate E w ij 5 Repeat for all patterns and change each weight by the sum. 31 / 33

Online Backpropagation 1: Initialize all weights to small random values. 2: repeat 3: for each training example do 4: Forward propagate the input features of the example to determine the MLP s outputs. 5: Back propagate error to generate w ij for all weights w ij. 6: Update the weights using w ij. 7: end for 8: until stopping criteria reached. 32 / 33

Summary We learned what a multilayer perceptron is. We have some intuition about using gradient descent on an error function. We know a learning rule for updating weights in order to minimize the error: w ij = η E w ij If we use the squared error, we get the generalized delta rule: w ij = ηδ p i x ij. We know how to calculate δ p i for output and hidden layers. We can use this rule to learn an MLP s weights using the backpropagation algorithm. Next lecture: a neural network model of the past tense. 33 / 33