Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Similar documents
SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

y(x n, w) t n 2. (1)

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Multilayer Perceptron

Lab 5: 16 th April Exercises on Neural Networks

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks and the Back-propagation Algorithm

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Revision: Neural Network

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

1 What a Neural Network Computes

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural Networks (Part 1) Goals for the lecture

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Artificial Intelligence

Linear Models for Regression

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Artifical Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Unit 8: Introduction to neural networks. Perceptrons

Multilayer Neural Networks

Data Mining Part 5. Prediction

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Feedforward Neural Nets and Backpropagation

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Course 395: Machine Learning - Lectures

ECE521 Lectures 9 Fully Connected Neural Networks

Machine Learning. Neural Networks

CSCI567 Machine Learning (Fall 2018)

Machine Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural networks. Chapter 20. Chapter 20 1

Artificial Neural Networks Examination, June 2005

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Neural Network Training

AI Programming CS F-20 Neural Networks

Neural Nets Supervised learning

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Neural Networks biological neuron artificial neuron 1

Nonlinear Classification

Introduction to Machine Learning

Neural Networks and Deep Learning

Neural Networks: Backpropagation

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Introduction to Neural Networks

Machine Learning and Data Mining. Linear classification. Kalev Kask

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

CSC 411 Lecture 10: Neural Networks

Single layer NN. Neuron Model

Multilayer Neural Networks

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Artificial Neural Networks. Edward Gatt

Logistic Regression & Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

4. Multilayer Perceptrons

Artificial Neural Networks

Multilayer Perceptrons and Backpropagation

Learning and Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Artificial Neural Networks

Pattern Classification

Lecture 5: Logistic Regression. Neural Networks

Artificial Neural Networks Examination, March 2004

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Feedforward Networks

Multilayer Perceptron = FeedForward Neural Network

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Reading Group on Deep Learning Session 1

CMSC 421: Neural Computation. Applications of Neural Networks

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

Lecture 4: Training a Classifier

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Artificial Neural Network

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Artificial Neural Networks. MGS Lecture 2

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Unit III. A Survey of Neural Network Model

Intelligent Systems Discriminative Learning, Neural Networks

Machine Learning (CSE 446): Neural Networks

Linear discriminant functions

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Security Analytics. Topic 6: Perceptron and Support Vector Machine

Multilayer Perceptrons (MLPs)

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Linear Models for Classification

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

CSC321 Lecture 6: Backpropagation

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Sections 18.6 and 18.7 Artificial Neural Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Transcription:

Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1

Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension to every input vector, that is always equal to 1. x 0 is called the bias input. It is always equal to 1. x = 1 x 1 x 2 x D w 0 is called the bias weight. It is optimized during training. 2

Perceptrons x 0 = 1 x 1 x 2 x D z = h w T x Output: z A perceptron computes its output z in two steps: First step: a = w T x = Second step: z = h a D i=0 w i x i D In a single formula: z = h i=0 w i x i 3

Perceptrons x 0 = 1 x 1 x 2 x D z = h w T x Output: z A perceptron computes its output z in two steps: First step: a = w T x = Second step: z = h a D i=0 w i x i h is called an activation function. For example, h could be the sigmoidal function σ a = 1 1+e a 4

Perceptrons x 0 = 1 x 1 x 2 x D z = h w T x Output: z We have seen perceptrons before, we just did not call them perceptrons. For example, logistic regression produces a classifier function y x = σ w Τ φ(x). If we set φ x = x and h = σ, then y x is a perceptron. 5

Perceptrons x 0 = 1 and Neurons x 1 z = h w T x Output: z x 2 x D Perceptrons are inspired by neurons. Neurons are the cells forming the nervous system, and the brain. Neurons somehow sum up their inputs, and if the sum exceeds a threshold, they "fire". Since brains are "intelligent", computer scientists have been hoping that perceptron-based systems can be used to model intelligence. 6

Activation Functions A perceptron produces output z = h w T x. One choice for the activation function h: the step function. h a = 0, if a < 0 1, if a 0 The step function is useful for providing some intuitive examples. It is not useful for actual real-world systems. Reason: it is not differentiable, it does not allow optimization via gradient descent. 7

Activation Functions A perceptron produces output z = h w T x. Another choice for the activation function h(a): the sigmoidal function. σ a = 1 1+e a The sigmoidal is often used in real-world systems. It is a differentiable function, it allows use of gradient descent. 8

Example: The AND Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean AND function: x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 9

Example: The AND Perceptron Verification: If x 1 = 0 and x 2 = 0: w T x = 1.5 + 1 0 + 1 0 = 1.5. h w T x = h 1.5 = 0. Corresponds to case false AND false = false. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 10

Example: The AND Perceptron Verification: If x 1 = 0 and x 2 = 1: w T x = 1.5 + 1 0 + 1 1 = 0.5. h w T x = h 0.5 = 0. Corresponds to case false AND true = false. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 11

Example: The AND Perceptron Verification: If x 1 = 1 and x 2 = 0: w T x = 1.5 + 1 1 + 1 0 = 0.5. h w T x = h 0.5 = 0. Corresponds to case true AND false = false. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 12

Example: The AND Perceptron Verification: If x 1 = 1 and x 2 = 1: w T x = 1.5 + 1 1 + 1 1 = 0.5. h w T x = h 0.5 = 1. Corresponds to case true AND true = true. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 13

Example: The OR Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean OR function: x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 14

Example: The OR Perceptron Verification: If x 1 = 0 and x 2 = 0: w T x = 0.5 + 1 0 + 1 0 = 0.5. h w T x = h 0.5 = 0. Corresponds to case false OR false = false. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 15

Example: The OR Perceptron Verification: If x 1 = 0 and x 2 = 1: w T x = 0.5 + 1 0 + 1 1 = 0.5. h w T x = h 0.5 = 1. Corresponds to case false OR true = true. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 16

Example: The OR Perceptron Verification: If x 1 = 1 and x 2 = 0: w T x = 0.5 + 1 1 + 1 0 = 0.5. h w T x = h 0.5 = 1. Corresponds to case true OR false = true. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 17

Example: The OR Perceptron Verification: If x 1 = 1 and x 2 = 1: w T x = 0.5 + 1 1 + 1 1 = 1.5. h w T x = h 1.5 = 1. Corresponds to case true OR true = true. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 18

Example: The NOT Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean NOT function: x 0 = 1 NOT(false) = true NOT(true) = false x 1 w 1 = 1 z = h w T x Output: z 19

Example: The NOT Perceptron Verification: If x 1 = 0: w T x = 0.5 1 0 = 0.5. h w T x = h 0.5 = 1. Corresponds to case NOT(false) = true. x 0 = 1 NOT(false) = true NOT(true) = false x 1 w 1 = 1 z = h w T x Output: z 20

Example: The NOT Perceptron Verification: If x 1 = 1: w T x = 0.5 1 1 = 0.5. h w T x = h 0.5 = 0. Corresponds to case NOT(true) = false. x 0 = 1 NOT(false) = true NOT(true) = false x 1 w 1 = 1 z = h w T x Output: z 21

The XOR Function false XOR false = false false XOR true = true true XOR false = true true XOR true = false As before, we represent false with 0 and true with 1. The figure shows the four input points of the XOR function. green corresponds to output value true. red corresponds to output value false. The two classes (true and false) are not linearly separable. Therefore, no perceptron can compute the XOR function. 22

Neural Networks A neural network is built using perceptrons as building blocks. The inputs to some perceptrons are outputs of other perceptrons. Here is an example neural network computing the XOR function. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 23

Neural Networks To simplify the picture, we do not show the bias input anymore. We just show the bias weights w j,0. Besides the bias input, there are two inputs: x 1, x 2. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 24

Neural Networks This neural network example consists of six units: Three input units (including the not-shown bias input). Three perceptrons. Yes, inputs count as units. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 25

Neural Networks Weights are denoted as w ji. Weight w ji belongs to the edge that connects the output of unit i with an input of unit j. Units 0, 1,, D are the input units (units 0, 1, 2 in this example). Unit 3 x 1 Unit 5 Output: x 2 Unit 4 26

Neural Network Layers Oftentimes, neural networks are organized into layers. The input layer is the initial layer of input units (units 0, 1, 2 in our example). The output layer is at the end (unit 5 in our example). Zero, one or more hidden layers can be between the input and output layers. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 27

Neural Network Layers There is only one hidden layer in our example, containing units 4 and 5. Each hidden layer's inputs are outputs from the previous layer. Each hidden layer's outputs are inputs to the next layer. The first hidden layer's inputs come from the input layer. The last hidden layer's outputs are inputs to the output layer. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 28

Feedforward Networks Feedforward networks are networks where there are no directed loops. If there are no loops, the output of a neuron cannot (directly or indirectly) influence its input. While there are varieties of neural networks that are not feedforward or layered, our main focus will be layered feedforward networks. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 29

Computing the Output Notation: L is the number of layers. Layer 1 is the input layer, layer L is the output layer. Given values for the input units, output is computed as follows: For (l = 2; l L; l = l + 1): Compute the outputs of layer L, given the outputs of layer L-1. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 30

Computing the Output To compute the outputs of layer l (where l > 1), we simply need to compute the output of each perceptron belonging to layer l. For each such perceptron, its inputs are coming from outputs of perceptrons at layer l 1. Remember, we compute layer outputs in increasing order of l. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 31

What Neural Networks Can Compute An individual perceptron is a linear classifier. The weights of the perceptron define a linear boundary between two classes. Layered feedforward neural networks with one hidden layer can compute any continuous function. Layered feedforward neural networks with two hidden layers can compute any mathematical function. This has been known for decades, and is one reason scientists have been optimistic about the potential of neural networks to model intelligent systems. Another reason is the analogy between neural networks and biological brains, which have been a standard of intelligence we are still trying to achieve. There is only one catch: How do we find the right weights? 32

Training a Neural Network In linear regression, for the sum-of-squares error, we could find the best weights using a closed-form formula. In logistic regression, for the cross-entropy error, we could find the best weights using an iterative method. In neural networks, we cannot find the best weights (unless we have an astronomical amount of luck). We only have optimization methods that find local minima of the error function. Still, in recent years such methods have produced spectacular results in real-world applications. 33

Notation for Training Set We define w to be the vector of all weights in the neural network. We have a set X of N training examples. X = {x 1, x 2,, x N } Each x n is a (D+1)-dimensional column vector. Dimension 0 is the bias input, always set to 1. x n = (1, x n1, x n2,, x nd ) We also have a set T of N target outputs. T = t 1, t 2,, t N t n is the target output for training example x n. Each t n is a K-dimensional column vector: t n = (t n1, t n2,, t nk ) Note: K typically is not equal to D. 34

Perceptron Learning Before we discuss how to train an entire neural network, we start with a single perceptron. Remember: given input x n, a perceptron computes its output z using this formula: z(x) = h w T x We use sum-of-squares as our error function. E n (w) is the contribution of training example x n : E n w = 1 2 z(x n) t n 2 N The overall error E is defined as: E w = n=1 E n w 35

Perceptron Learning Suppose that a perceptron is using the step function as its activation function h. h a = 0, if a < 0 1, if a 0 z(x) = h w T x = 0, if wt x < 0 1, if w T x 0 Can we apply gradient descent in that case? No, because E(w) is not differentiable. Small changes of w usually lead to no changes in h w T x. The only exception is when the change in w causes h w T x to switch signs (from positive to negative, or from negative to positive). 36

Perceptron Learning A better option is setting h to the sigmoid function: z x = h w T x = 1 1 + e wt x Then, measured just on a single training object x n, the error E n (w) is defined as: E n w = 1 2 t n z x n 2 = 1 2 t n 1 1 + e wt x n 2 Note: here we use the sum-of-squares error, and not the crossentropy error that we used for logistic regression. Also note: if our neural network is a single perceptron, then the target output t n is one-dimensional. 37

Computing the Gradient E n w = 1 2 t n z x n 2 = 1 2 t n 1 1+e wt xn 2 In this form, E n w is differentiable. If we do the calculations, the gradient turns out to be: E n w = 1 2 t n z x n z x n 1 z x n x n Note that E n is a (D+1) dimensional vector. It is a w scalar (shown in red) multiplied by vector x n. 38

Weight Update E n w = 1 2 t n z x n z x n 1 z x n x n So, we update the weight vector w as follows: w = w η t n z x n z x n 1 z x n x n As before, η is the learning rate parameter. It is a positive real number that should be chosen carefully, so as not to be too big or too small. In terms of individual weights w d, the update rule is: w d = w d η t n z x n z x n 1 z x n x nd 39

Perceptron Learning - Summary Input: Training inputs x 1,,, x N, target outputs t 1,, t N 1. Extend each x n to a (D+1) dimensional vector, by adding 1 (the bias input) as the value for dimension 0. 2. Initialize weights w d to small random numbers. For example, set each w d between -0.1 and 0.1 3. For n = 1 to N: 1. Compute z x n. 2. For d = 0 to D: w d = w d η t n z x n z x n 1 z x n x nd 4. If some stopping criterion has been met, exit. 5. Else, go to step 3. 40

Stopping Criterion At step 4 of the perceptron learning algorithm, we need to decide whether to stop or not. One thing we can do is: Compute the cumulative squared error E(w) of the perceptron at that point: N E w = E n w n=1 = N n=1 1 2 t n z x n 2 Compare the current value of E w with the value of E w computed at the previous iteration. If the difference is too small (e.g., smaller than 0.00001) we stop. 41

Using Perceptrons for Multiclass Problems A perceptron outputs a number between 0 and 1. This is sufficient only for binary classification problems. For more than two classes, there are many different options. We will follow a general approach called one-versusall classification. 42

One-Versus-All Perceptrons Suppose we have K classes C 1,, C K, where K > 2. We have training inputs x 1,,, x N, and target values t 1,, t N. Each target value t n is a K-dimensional vector: t n = (t n1, t n2,, t nk ) t nk = 0 if the class of x n is not C k. t nk = 1 if the class of x n is C k. For each class C k, train a perceptron z k by using t nk as the target value for x n. So, perceptron z k is trained to recognize if an object belongs to class C k or not. In total, we train K perceptrons, one for each class. 43

One-Versus-All Perceptrons To classify a test pattern x: Compute the responses z k (x) for all K perceptrons. Find the perceptron z k such that the value z k (x) is higher than all other responses. Output that the class of x is C k. In summary: we assign x to the class whose perceptron produced the highest output value for x. 44

Neural Network Notation U is the total number of units in the neural network. Each unit, is denoted as P j, where 0 j U 1. Units P 0,, P D are the input units. Unit P 0 is the bias input, always equal to 1. We denote by w ji the weight of the edge connecting the output of P i to an input of P j. We denote by z j the output of unit P j. If 0 j D, then z j = x j. We denote by vector z the vector of all outputs: z = (z 0, z 1,, z U 1 ) 45

Target Value Notation To make notation more convenient, we treat target value t n as a U-dimensional vector. In other words, the dimensionality of t n is equal to the number of units in the network. If P j is the k-th output unit, and x n belongs to class C k, then: t nj = 1. t n will have values 0 in all other dimensions. This way, there is a one-to-one correspondence between dimensions of t n and dimensions of z. We do this only to simplify notation. See formula for defining error, in the next slide. 46

Squared Error for Neural Networks We define Y to be the set of output units: Y = P j P j belongs to the output layer As usual, we denote by E n (w) the contribution that training input x n makes to the overall error E(w). E n (w) = 1 2 j:p j Y t nj z j 2 Note that only the output from output units contributes to the error. If P j is not an output unit, then t nj and z j get ignored. 47

Squared Error for Neural Networks As usual, we denote by E(w) the overall error over all training examples: N E w = E n w n=1 = 1 2 N n=1 j:p j Y t nj z j 2 This is now a double summation. We sum over all training examples x n. For each x n, we sum over all perceptrons in the output layer. 48

Training Neural Networks To train a neural network, we follow the same approach of sequential learning that we followed for training single perceptrons: Given a training example x n and target output t n : Compute the training error E n (w). Compute the gradient E n w. Based on the gradient, we can update all weights. The process of computing the gradient and updating neural network weights is called backpropagation. We will see the solution when we use the sigmoidal function as activation function h. 49

Computing the Gradient Overall, we want to compute E n w. E n w is a vector containing all weights w ji. Therefore, it suffices to compute, for each w ji, the partial derivative E n w ji. In order to compute E n w ji, we will use this strategy: Decompose E n into a composition of simpler functions. Compute the derivative of each of those simpler functions. Apply the chain rule to obtain E n w ji. 50

Decomposing the Error Function Let P j be a perceptron in the neural network. Define function a j x n, w = w ji z i x n, w i a j x n, w is simply the weighted sum of the inputs of P j, given current input x n and given the current value of w. Define function z j α = σ α = 1 1+e a. The output of P j is z j a j x n, w. 51

Decomposing the Error Function Define y j to be a vector containing all outputs of all perceptrons belonging to the same layer as P j. Define function E nj y j to be the error of the network given the outputs of all perceptrons of the layer that P j belongs to. Intuition for E nj y j : Suppose that you do not know anything about layers before the layer of perceptron P j. Suppose that you know y j, and all layers after the layer of P j. Then, you can still compute the output of the network, and the error E n. 52

Visualizing Function E nj As long as we can see the outputs of the layer that P j belongs to, and all layers after the layer of P j, we can compute the output of the network, and the error E n. Previous layers: unknown?????? Layer of P j Unit P q Unit P q+1 Unit P j Unit P r Unit Unit Unit Unit Output layer Unit Unit 53

Decomposing the Error Function We have defined three auxiliary functions: a j x n, w z j α E nj y j Suppose that the perceptrons belonging to the same layer as P j are indexed as P q, P q+1,, P r 1, P r Then, E n is a composition of functions E nj, z j, a j. E n x n, w = E nj y j = E nj z q a q x n, w,, z r a r x n, w 54

Computing the Gradient of E n E n x n, w = E nj z q a q x n, w,, z r a r x n, w Then, we can compute E n w ji by applying the chain rule: E n w ji = E nj z j z j a j a j w ji We will compute each of these three terms. 55

Computing a j w ji E n w ji = E nj z j z j a j a j w ji a j = w ji u w ju z u x n, w w ji = z i x n, w Remember, z i x n, w is just the output of unit P i. The outputs of all units are straightforward to compute, given x n and w. So, computing a j w ji is entirely straightforward. 56

Computing z j a j E n w ji = E nj z j z j a j a j w ji z j = a j σ a j a j = σ a j 1 σ a j = z j 1 z j One of the reasons we like using the sigmoidal function for activation is that its derivative has such a simple form. 57

Computing E nj z j, Case 1: If P j Is an Output Unit If P j is an output unit, then z j is an output of the entire network. z j contributes to the error the term 1 t 2 2 nj z j Therefore: E nj z j = 1 2 t nj z j 2 z j = z j t nj 58

Updating Weights of Output Units If P j is an output unit, then we have computed all the terms we need for E n w ji. E n w ji = E nj z j z j a j a j w ji E n w ji = z j t nj z j 1 z j z i So, if w ji is the weight of an output unit, we update it as follows: w ji = w ji η z j t nj z j 1 z j z i 59

Computing E nj z j, Case 2: If P j Is a Hidden Unit Let P j be a hidden unit. Define L to be the set of all units in the layer after P j. E nj z j = u L E nj z u z u a u a u z j We need to compute these three terms. 60

Computing E nj z j, Case 2: If P j Is a Hidden Unit E nj z j = u L E nj z u z u a u a u z j a u = i w ui z i x n, w z j z j = w uj We computed this already, a few slides ago. z u a u = z u 1 z u 61

Computing E nj z j, Case 2: If P j Is a Hidden Unit E nj z j = u L E nj z u z u a u a u z j In the previous slide, we computed z u a u and a u z j. So, the formula becomes: E nj E nj = z z j z u 1 z u w uj u u L 62

Computing E nj z j, Case 2: If P j Is a Hidden Unit E nj z j = u L E nj z u z u 1 z u w uj Notice that E nj z j is defined using E nj z u. This is a recursive definition. To compute the values for a layer, we use the values from the next layer. This is why the whole algorithm is called backpropagation. We propagate computations from the output layer backwards towards the input layer. 63

Formula for Hidden Units From the previous slides, we have these formulas: E n w ji = E nj z j z j a j a j w ji = E nj z j z j 1 z j z i E nj z j = u L E nj z u z u 1 z u w uj We can combine these formulas, to compute E n w ji for any weight of any hidden unit. Just remember to start the computations from the output layer, and move backwards towards the input layer. 64

Simplifying Notation The previous formulas are sufficient and will work, but look complicated. We can simplify the formulas considerably, by defining: δ j = E nj z j z j a j Then, if we combine calculations we already did: If P j is an output unit, then: δ j = z j t nj z j 1 z j If P j is a hidden unit, then: δ j = δ u w uj u L z j 1 z j 65

Final Backpropagation Formula Using the definition of δ j from the previous slide, we finally get a very simple formula: E n w ji = δ j z i Therefore, given a training input x n, and given a positive learning rate η, each weight w ji is updated using this formula: w ji = w ji ηδ j z i 66

Backpropagation for One Object Step 1: Initialize Input Layer We will now see how to apply backpropagation, step by step, in pseudocode style, for a single training object. First, given a training example x n, and its target output t n, we must initialize the input units: // Array z will store, for every perceptron P j, its output. double z[] = new double[u] // Update the input layer, set inputs equal to x n. For j = 0 to D: z[j] = x nj // x nj is the j-th dimension of x n. 67

Backpropagation for One Object Step 2: Compute Outputs // we create an array a, which will store, for every // perceptron P j, the weighted sum of the inputs of P j. double a[] = new double[u] // Update the rest of the layers: For l = 2 to L: // L is the number of layers: For each perceptron P j in layer l: a[j] = z[j] = h a[j] = w ji z[i] i // weighted sum of inputs of P j 1 1+ e a[j] // output of unit P j 68

Backpropagation for One Object Step 3: Compute New δ Values // we create an array δ, which will store, for every // perceptron P j, value δ j. double δ[] = new double[u] For each output unit P j : δ j = z[j] t nj z[j] 1 z[j] For l = L 1 to 2: // MUST be decreasing order of l For each perceptron P j in layer l: δ j = u: P u layer l+1 δ[u]w uj z j 1 z j 69

Backpropagation for One Object Step 4: Update Weights For l = 2 to L: // Order does not matter here, we can go // from 2 to L or from L to 2. For each perceptron P j in layer l: For each perceptron P i in the preceding layer l 1: w ji = w ji η δ j z[i] IMPORTANT: Do Step 3 before Step 4. Do NOT do steps 3 and 4 as a single loop. All δ values must be computed using the old values of weights. Then, all weights must be updated using the new δ values. 70

Inputs: Backpropagation Summary N D-dimensional training objects x 1,, x N. The associated target values t 1,, t N, which are U-dimensional vectors. 1. Extend each x n to a (D+1) dimensional vector, by adding the bias input as the value for the zero-th dimension. 2. Initialize weights w ji to small random numbers. For example, set each w u,v between -0.1 and 0.1. 3. last_error = E(w) 4. For n = 1 to N: Given x n, update weights w ji as described in the previous slides. 5. err = E(w) 6. If err last_error < threshold, exit. // threshold can be 0.00001. 7. Else: last_error = err, go to step 4. 71

Classification with Neural Networks Suppose we have K classes C 1,, C K, where K > 2. Each class C k corresponds to an output perceptron P u. Given a test pattern x to classify: Extend x to a (D+1)-dimensional vector, by adding the bias input as value for dimension 0. Compute outputs for all units of the network, working from the input layer towards the output layer. Find the output unit P u with the highest output z u. Return the class that corresponds to P u. 72

Structure of Neural Networks Backpropagation describes how to learn weights. However, it does not describe how to learn the structure: How many layers? How many units at each layer? These are parameters that we have to choose somehow. A good way to choose such parameters is by using a validation set, containing examples and their class labels. The validation set should be separate (disjoint) from the training set. 73

Structure of Neural Networks To choose the best structure for a neural network using a validation set, we try many different parameters (number of layers, number of units per layer). For each choice of parameters: We train several neural networks using backpropagation. We measure how well each neural network classifies the validation examples. Why not train just one neural network? 74

Structure of Neural Networks To choose the best structure for a neural network using a validation set, we try many different parameters (number of layers, number of units per layer). For each choice of parameters: We train several neural networks using backpropagation. We measure how well each neural network classifies the validation examples. Why not train just one neural network? Each network is randomly initialized, so after backpropagation it can end up being different from the other networks. At the end, we select the neural network that did best on the validation set. 75