Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1
Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension to every input vector, that is always equal to 1. x 0 is called the bias input. It is always equal to 1. x = 1 x 1 x 2 x D w 0 is called the bias weight. It is optimized during training. 2
Perceptrons x 0 = 1 x 1 x 2 x D z = h w T x Output: z A perceptron computes its output z in two steps: First step: a = w T x = Second step: z = h a D i=0 w i x i D In a single formula: z = h i=0 w i x i 3
Perceptrons x 0 = 1 x 1 x 2 x D z = h w T x Output: z A perceptron computes its output z in two steps: First step: a = w T x = Second step: z = h a D i=0 w i x i h is called an activation function. For example, h could be the sigmoidal function σ a = 1 1+e a 4
Perceptrons x 0 = 1 x 1 x 2 x D z = h w T x Output: z We have seen perceptrons before, we just did not call them perceptrons. For example, logistic regression produces a classifier function y x = σ w Τ φ(x). If we set φ x = x and h = σ, then y x is a perceptron. 5
Perceptrons x 0 = 1 and Neurons x 1 z = h w T x Output: z x 2 x D Perceptrons are inspired by neurons. Neurons are the cells forming the nervous system, and the brain. Neurons somehow sum up their inputs, and if the sum exceeds a threshold, they "fire". Since brains are "intelligent", computer scientists have been hoping that perceptron-based systems can be used to model intelligence. 6
Activation Functions A perceptron produces output z = h w T x. One choice for the activation function h: the step function. h a = 0, if a < 0 1, if a 0 The step function is useful for providing some intuitive examples. It is not useful for actual real-world systems. Reason: it is not differentiable, it does not allow optimization via gradient descent. 7
Activation Functions A perceptron produces output z = h w T x. Another choice for the activation function h(a): the sigmoidal function. σ a = 1 1+e a The sigmoidal is often used in real-world systems. It is a differentiable function, it allows use of gradient descent. 8
Example: The AND Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean AND function: x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 9
Example: The AND Perceptron Verification: If x 1 = 0 and x 2 = 0: w T x = 1.5 + 1 0 + 1 0 = 1.5. h w T x = h 1.5 = 0. Corresponds to case false AND false = false. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 10
Example: The AND Perceptron Verification: If x 1 = 0 and x 2 = 1: w T x = 1.5 + 1 0 + 1 1 = 0.5. h w T x = h 0.5 = 0. Corresponds to case false AND true = false. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 11
Example: The AND Perceptron Verification: If x 1 = 1 and x 2 = 0: w T x = 1.5 + 1 1 + 1 0 = 0.5. h w T x = h 0.5 = 0. Corresponds to case true AND false = false. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 12
Example: The AND Perceptron Verification: If x 1 = 1 and x 2 = 1: w T x = 1.5 + 1 1 + 1 1 = 0.5. h w T x = h 0.5 = 1. Corresponds to case true AND true = true. x 0 = 1 false AND false = false false AND true = false true AND false = false true AND true = true x 1 z = h w T x Output: z x 2 13
Example: The OR Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean OR function: x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 14
Example: The OR Perceptron Verification: If x 1 = 0 and x 2 = 0: w T x = 0.5 + 1 0 + 1 0 = 0.5. h w T x = h 0.5 = 0. Corresponds to case false OR false = false. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 15
Example: The OR Perceptron Verification: If x 1 = 0 and x 2 = 1: w T x = 0.5 + 1 0 + 1 1 = 0.5. h w T x = h 0.5 = 1. Corresponds to case false OR true = true. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 16
Example: The OR Perceptron Verification: If x 1 = 1 and x 2 = 0: w T x = 0.5 + 1 1 + 1 0 = 0.5. h w T x = h 0.5 = 1. Corresponds to case true OR false = true. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 17
Example: The OR Perceptron Verification: If x 1 = 1 and x 2 = 1: w T x = 0.5 + 1 1 + 1 1 = 1.5. h w T x = h 1.5 = 1. Corresponds to case true OR true = true. x 0 = 1 false OR false = false false OR true = true true OR false = true true OR true = true x 1 z = h w T x Output: z x 2 18
Example: The NOT Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean NOT function: x 0 = 1 NOT(false) = true NOT(true) = false x 1 w 1 = 1 z = h w T x Output: z 19
Example: The NOT Perceptron Verification: If x 1 = 0: w T x = 0.5 1 0 = 0.5. h w T x = h 0.5 = 1. Corresponds to case NOT(false) = true. x 0 = 1 NOT(false) = true NOT(true) = false x 1 w 1 = 1 z = h w T x Output: z 20
Example: The NOT Perceptron Verification: If x 1 = 1: w T x = 0.5 1 1 = 0.5. h w T x = h 0.5 = 0. Corresponds to case NOT(true) = false. x 0 = 1 NOT(false) = true NOT(true) = false x 1 w 1 = 1 z = h w T x Output: z 21
The XOR Function false XOR false = false false XOR true = true true XOR false = true true XOR true = false As before, we represent false with 0 and true with 1. The figure shows the four input points of the XOR function. green corresponds to output value true. red corresponds to output value false. The two classes (true and false) are not linearly separable. Therefore, no perceptron can compute the XOR function. 22
Neural Networks A neural network is built using perceptrons as building blocks. The inputs to some perceptrons are outputs of other perceptrons. Here is an example neural network computing the XOR function. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 23
Neural Networks To simplify the picture, we do not show the bias input anymore. We just show the bias weights w j,0. Besides the bias input, there are two inputs: x 1, x 2. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 24
Neural Networks This neural network example consists of six units: Three input units (including the not-shown bias input). Three perceptrons. Yes, inputs count as units. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 25
Neural Networks Weights are denoted as w ji. Weight w ji belongs to the edge that connects the output of unit i with an input of unit j. Units 0, 1,, D are the input units (units 0, 1, 2 in this example). Unit 3 x 1 Unit 5 Output: x 2 Unit 4 26
Neural Network Layers Oftentimes, neural networks are organized into layers. The input layer is the initial layer of input units (units 0, 1, 2 in our example). The output layer is at the end (unit 5 in our example). Zero, one or more hidden layers can be between the input and output layers. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 27
Neural Network Layers There is only one hidden layer in our example, containing units 4 and 5. Each hidden layer's inputs are outputs from the previous layer. Each hidden layer's outputs are inputs to the next layer. The first hidden layer's inputs come from the input layer. The last hidden layer's outputs are inputs to the output layer. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 28
Feedforward Networks Feedforward networks are networks where there are no directed loops. If there are no loops, the output of a neuron cannot (directly or indirectly) influence its input. While there are varieties of neural networks that are not feedforward or layered, our main focus will be layered feedforward networks. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 29
Computing the Output Notation: L is the number of layers. Layer 1 is the input layer, layer L is the output layer. Given values for the input units, output is computed as follows: For (l = 2; l L; l = l + 1): Compute the outputs of layer L, given the outputs of layer L-1. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 30
Computing the Output To compute the outputs of layer l (where l > 1), we simply need to compute the output of each perceptron belonging to layer l. For each such perceptron, its inputs are coming from outputs of perceptrons at layer l 1. Remember, we compute layer outputs in increasing order of l. Unit 3 x 1 Unit 5 Output: x 2 Unit 4 31
What Neural Networks Can Compute An individual perceptron is a linear classifier. The weights of the perceptron define a linear boundary between two classes. Layered feedforward neural networks with one hidden layer can compute any continuous function. Layered feedforward neural networks with two hidden layers can compute any mathematical function. This has been known for decades, and is one reason scientists have been optimistic about the potential of neural networks to model intelligent systems. Another reason is the analogy between neural networks and biological brains, which have been a standard of intelligence we are still trying to achieve. There is only one catch: How do we find the right weights? 32
Training a Neural Network In linear regression, for the sum-of-squares error, we could find the best weights using a closed-form formula. In logistic regression, for the cross-entropy error, we could find the best weights using an iterative method. In neural networks, we cannot find the best weights (unless we have an astronomical amount of luck). We only have optimization methods that find local minima of the error function. Still, in recent years such methods have produced spectacular results in real-world applications. 33
Notation for Training Set We define w to be the vector of all weights in the neural network. We have a set X of N training examples. X = {x 1, x 2,, x N } Each x n is a (D+1)-dimensional column vector. Dimension 0 is the bias input, always set to 1. x n = (1, x n1, x n2,, x nd ) We also have a set T of N target outputs. T = t 1, t 2,, t N t n is the target output for training example x n. Each t n is a K-dimensional column vector: t n = (t n1, t n2,, t nk ) Note: K typically is not equal to D. 34
Perceptron Learning Before we discuss how to train an entire neural network, we start with a single perceptron. Remember: given input x n, a perceptron computes its output z using this formula: z(x) = h w T x We use sum-of-squares as our error function. E n (w) is the contribution of training example x n : E n w = 1 2 z(x n) t n 2 N The overall error E is defined as: E w = n=1 E n w 35
Perceptron Learning Suppose that a perceptron is using the step function as its activation function h. h a = 0, if a < 0 1, if a 0 z(x) = h w T x = 0, if wt x < 0 1, if w T x 0 Can we apply gradient descent in that case? No, because E(w) is not differentiable. Small changes of w usually lead to no changes in h w T x. The only exception is when the change in w causes h w T x to switch signs (from positive to negative, or from negative to positive). 36
Perceptron Learning A better option is setting h to the sigmoid function: z x = h w T x = 1 1 + e wt x Then, measured just on a single training object x n, the error E n (w) is defined as: E n w = 1 2 t n z x n 2 = 1 2 t n 1 1 + e wt x n 2 Note: here we use the sum-of-squares error, and not the crossentropy error that we used for logistic regression. Also note: if our neural network is a single perceptron, then the target output t n is one-dimensional. 37
Computing the Gradient E n w = 1 2 t n z x n 2 = 1 2 t n 1 1+e wt xn 2 In this form, E n w is differentiable. If we do the calculations, the gradient turns out to be: E n w = 1 2 t n z x n z x n 1 z x n x n Note that E n is a (D+1) dimensional vector. It is a w scalar (shown in red) multiplied by vector x n. 38
Weight Update E n w = 1 2 t n z x n z x n 1 z x n x n So, we update the weight vector w as follows: w = w η t n z x n z x n 1 z x n x n As before, η is the learning rate parameter. It is a positive real number that should be chosen carefully, so as not to be too big or too small. In terms of individual weights w d, the update rule is: w d = w d η t n z x n z x n 1 z x n x nd 39
Perceptron Learning - Summary Input: Training inputs x 1,,, x N, target outputs t 1,, t N 1. Extend each x n to a (D+1) dimensional vector, by adding 1 (the bias input) as the value for dimension 0. 2. Initialize weights w d to small random numbers. For example, set each w d between -0.1 and 0.1 3. For n = 1 to N: 1. Compute z x n. 2. For d = 0 to D: w d = w d η t n z x n z x n 1 z x n x nd 4. If some stopping criterion has been met, exit. 5. Else, go to step 3. 40
Stopping Criterion At step 4 of the perceptron learning algorithm, we need to decide whether to stop or not. One thing we can do is: Compute the cumulative squared error E(w) of the perceptron at that point: N E w = E n w n=1 = N n=1 1 2 t n z x n 2 Compare the current value of E w with the value of E w computed at the previous iteration. If the difference is too small (e.g., smaller than 0.00001) we stop. 41
Using Perceptrons for Multiclass Problems A perceptron outputs a number between 0 and 1. This is sufficient only for binary classification problems. For more than two classes, there are many different options. We will follow a general approach called one-versusall classification. 42
One-Versus-All Perceptrons Suppose we have K classes C 1,, C K, where K > 2. We have training inputs x 1,,, x N, and target values t 1,, t N. Each target value t n is a K-dimensional vector: t n = (t n1, t n2,, t nk ) t nk = 0 if the class of x n is not C k. t nk = 1 if the class of x n is C k. For each class C k, train a perceptron z k by using t nk as the target value for x n. So, perceptron z k is trained to recognize if an object belongs to class C k or not. In total, we train K perceptrons, one for each class. 43
One-Versus-All Perceptrons To classify a test pattern x: Compute the responses z k (x) for all K perceptrons. Find the perceptron z k such that the value z k (x) is higher than all other responses. Output that the class of x is C k. In summary: we assign x to the class whose perceptron produced the highest output value for x. 44
Neural Network Notation U is the total number of units in the neural network. Each unit, is denoted as P j, where 0 j U 1. Units P 0,, P D are the input units. Unit P 0 is the bias input, always equal to 1. We denote by w ji the weight of the edge connecting the output of P i to an input of P j. We denote by z j the output of unit P j. If 0 j D, then z j = x j. We denote by vector z the vector of all outputs: z = (z 0, z 1,, z U 1 ) 45
Target Value Notation To make notation more convenient, we treat target value t n as a U-dimensional vector. In other words, the dimensionality of t n is equal to the number of units in the network. If P j is the k-th output unit, and x n belongs to class C k, then: t nj = 1. t n will have values 0 in all other dimensions. This way, there is a one-to-one correspondence between dimensions of t n and dimensions of z. We do this only to simplify notation. See formula for defining error, in the next slide. 46
Squared Error for Neural Networks We define Y to be the set of output units: Y = P j P j belongs to the output layer As usual, we denote by E n (w) the contribution that training input x n makes to the overall error E(w). E n (w) = 1 2 j:p j Y t nj z j 2 Note that only the output from output units contributes to the error. If P j is not an output unit, then t nj and z j get ignored. 47
Squared Error for Neural Networks As usual, we denote by E(w) the overall error over all training examples: N E w = E n w n=1 = 1 2 N n=1 j:p j Y t nj z j 2 This is now a double summation. We sum over all training examples x n. For each x n, we sum over all perceptrons in the output layer. 48
Training Neural Networks To train a neural network, we follow the same approach of sequential learning that we followed for training single perceptrons: Given a training example x n and target output t n : Compute the training error E n (w). Compute the gradient E n w. Based on the gradient, we can update all weights. The process of computing the gradient and updating neural network weights is called backpropagation. We will see the solution when we use the sigmoidal function as activation function h. 49
Computing the Gradient Overall, we want to compute E n w. E n w is a vector containing all weights w ji. Therefore, it suffices to compute, for each w ji, the partial derivative E n w ji. In order to compute E n w ji, we will use this strategy: Decompose E n into a composition of simpler functions. Compute the derivative of each of those simpler functions. Apply the chain rule to obtain E n w ji. 50
Decomposing the Error Function Let P j be a perceptron in the neural network. Define function a j x n, w = w ji z i x n, w i a j x n, w is simply the weighted sum of the inputs of P j, given current input x n and given the current value of w. Define function z j α = σ α = 1 1+e a. The output of P j is z j a j x n, w. 51
Decomposing the Error Function Define y j to be a vector containing all outputs of all perceptrons belonging to the same layer as P j. Define function E nj y j to be the error of the network given the outputs of all perceptrons of the layer that P j belongs to. Intuition for E nj y j : Suppose that you do not know anything about layers before the layer of perceptron P j. Suppose that you know y j, and all layers after the layer of P j. Then, you can still compute the output of the network, and the error E n. 52
Visualizing Function E nj As long as we can see the outputs of the layer that P j belongs to, and all layers after the layer of P j, we can compute the output of the network, and the error E n. Previous layers: unknown?????? Layer of P j Unit P q Unit P q+1 Unit P j Unit P r Unit Unit Unit Unit Output layer Unit Unit 53
Decomposing the Error Function We have defined three auxiliary functions: a j x n, w z j α E nj y j Suppose that the perceptrons belonging to the same layer as P j are indexed as P q, P q+1,, P r 1, P r Then, E n is a composition of functions E nj, z j, a j. E n x n, w = E nj y j = E nj z q a q x n, w,, z r a r x n, w 54
Computing the Gradient of E n E n x n, w = E nj z q a q x n, w,, z r a r x n, w Then, we can compute E n w ji by applying the chain rule: E n w ji = E nj z j z j a j a j w ji We will compute each of these three terms. 55
Computing a j w ji E n w ji = E nj z j z j a j a j w ji a j = w ji u w ju z u x n, w w ji = z i x n, w Remember, z i x n, w is just the output of unit P i. The outputs of all units are straightforward to compute, given x n and w. So, computing a j w ji is entirely straightforward. 56
Computing z j a j E n w ji = E nj z j z j a j a j w ji z j = a j σ a j a j = σ a j 1 σ a j = z j 1 z j One of the reasons we like using the sigmoidal function for activation is that its derivative has such a simple form. 57
Computing E nj z j, Case 1: If P j Is an Output Unit If P j is an output unit, then z j is an output of the entire network. z j contributes to the error the term 1 t 2 2 nj z j Therefore: E nj z j = 1 2 t nj z j 2 z j = z j t nj 58
Updating Weights of Output Units If P j is an output unit, then we have computed all the terms we need for E n w ji. E n w ji = E nj z j z j a j a j w ji E n w ji = z j t nj z j 1 z j z i So, if w ji is the weight of an output unit, we update it as follows: w ji = w ji η z j t nj z j 1 z j z i 59
Computing E nj z j, Case 2: If P j Is a Hidden Unit Let P j be a hidden unit. Define L to be the set of all units in the layer after P j. E nj z j = u L E nj z u z u a u a u z j We need to compute these three terms. 60
Computing E nj z j, Case 2: If P j Is a Hidden Unit E nj z j = u L E nj z u z u a u a u z j a u = i w ui z i x n, w z j z j = w uj We computed this already, a few slides ago. z u a u = z u 1 z u 61
Computing E nj z j, Case 2: If P j Is a Hidden Unit E nj z j = u L E nj z u z u a u a u z j In the previous slide, we computed z u a u and a u z j. So, the formula becomes: E nj E nj = z z j z u 1 z u w uj u u L 62
Computing E nj z j, Case 2: If P j Is a Hidden Unit E nj z j = u L E nj z u z u 1 z u w uj Notice that E nj z j is defined using E nj z u. This is a recursive definition. To compute the values for a layer, we use the values from the next layer. This is why the whole algorithm is called backpropagation. We propagate computations from the output layer backwards towards the input layer. 63
Formula for Hidden Units From the previous slides, we have these formulas: E n w ji = E nj z j z j a j a j w ji = E nj z j z j 1 z j z i E nj z j = u L E nj z u z u 1 z u w uj We can combine these formulas, to compute E n w ji for any weight of any hidden unit. Just remember to start the computations from the output layer, and move backwards towards the input layer. 64
Simplifying Notation The previous formulas are sufficient and will work, but look complicated. We can simplify the formulas considerably, by defining: δ j = E nj z j z j a j Then, if we combine calculations we already did: If P j is an output unit, then: δ j = z j t nj z j 1 z j If P j is a hidden unit, then: δ j = δ u w uj u L z j 1 z j 65
Final Backpropagation Formula Using the definition of δ j from the previous slide, we finally get a very simple formula: E n w ji = δ j z i Therefore, given a training input x n, and given a positive learning rate η, each weight w ji is updated using this formula: w ji = w ji ηδ j z i 66
Backpropagation for One Object Step 1: Initialize Input Layer We will now see how to apply backpropagation, step by step, in pseudocode style, for a single training object. First, given a training example x n, and its target output t n, we must initialize the input units: // Array z will store, for every perceptron P j, its output. double z[] = new double[u] // Update the input layer, set inputs equal to x n. For j = 0 to D: z[j] = x nj // x nj is the j-th dimension of x n. 67
Backpropagation for One Object Step 2: Compute Outputs // we create an array a, which will store, for every // perceptron P j, the weighted sum of the inputs of P j. double a[] = new double[u] // Update the rest of the layers: For l = 2 to L: // L is the number of layers: For each perceptron P j in layer l: a[j] = z[j] = h a[j] = w ji z[i] i // weighted sum of inputs of P j 1 1+ e a[j] // output of unit P j 68
Backpropagation for One Object Step 3: Compute New δ Values // we create an array δ, which will store, for every // perceptron P j, value δ j. double δ[] = new double[u] For each output unit P j : δ j = z[j] t nj z[j] 1 z[j] For l = L 1 to 2: // MUST be decreasing order of l For each perceptron P j in layer l: δ j = u: P u layer l+1 δ[u]w uj z j 1 z j 69
Backpropagation for One Object Step 4: Update Weights For l = 2 to L: // Order does not matter here, we can go // from 2 to L or from L to 2. For each perceptron P j in layer l: For each perceptron P i in the preceding layer l 1: w ji = w ji η δ j z[i] IMPORTANT: Do Step 3 before Step 4. Do NOT do steps 3 and 4 as a single loop. All δ values must be computed using the old values of weights. Then, all weights must be updated using the new δ values. 70
Inputs: Backpropagation Summary N D-dimensional training objects x 1,, x N. The associated target values t 1,, t N, which are U-dimensional vectors. 1. Extend each x n to a (D+1) dimensional vector, by adding the bias input as the value for the zero-th dimension. 2. Initialize weights w ji to small random numbers. For example, set each w u,v between -0.1 and 0.1. 3. last_error = E(w) 4. For n = 1 to N: Given x n, update weights w ji as described in the previous slides. 5. err = E(w) 6. If err last_error < threshold, exit. // threshold can be 0.00001. 7. Else: last_error = err, go to step 4. 71
Classification with Neural Networks Suppose we have K classes C 1,, C K, where K > 2. Each class C k corresponds to an output perceptron P u. Given a test pattern x to classify: Extend x to a (D+1)-dimensional vector, by adding the bias input as value for dimension 0. Compute outputs for all units of the network, working from the input layer towards the output layer. Find the output unit P u with the highest output z u. Return the class that corresponds to P u. 72
Structure of Neural Networks Backpropagation describes how to learn weights. However, it does not describe how to learn the structure: How many layers? How many units at each layer? These are parameters that we have to choose somehow. A good way to choose such parameters is by using a validation set, containing examples and their class labels. The validation set should be separate (disjoint) from the training set. 73
Structure of Neural Networks To choose the best structure for a neural network using a validation set, we try many different parameters (number of layers, number of units per layer). For each choice of parameters: We train several neural networks using backpropagation. We measure how well each neural network classifies the validation examples. Why not train just one neural network? 74
Structure of Neural Networks To choose the best structure for a neural network using a validation set, we try many different parameters (number of layers, number of units per layer). For each choice of parameters: We train several neural networks using backpropagation. We measure how well each neural network classifies the validation examples. Why not train just one neural network? Each network is randomly initialized, so after backpropagation it can end up being different from the other networks. At the end, we select the neural network that did best on the validation set. 75