SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Topics in Machine Learning-EE 5359 Neural Networks 1

The Perceptron Output: A perceptron is a function that maps D-dimensional vectors to real numbers. For notational convenience, we add a zero-th dimension to every input vector, that is always equal to 1. is called the bias input. It is always equal to 1. is called the bias weight. It is optimized during training. 2

Perceptron Function Output: A perceptron computes its output in two steps: First step (linear combination): Second step (nonlinear mapping/activation function): E.g., sigmoidal function h()=1/(1+^( ) ) In a single formula: 3

Motivation We have seen perceptrons before, we just did not call them perceptrons. For example, logistic regression produces a classifier function y. If we set and, then y is a perceptron. Perceptrons are inspired by neurons. Neurons are the cells forming the nervous system, and the brain. Neurons somehow sum up their inputs, and if the sum exceeds a threshold, they "fire". Since brains are "intelligent", scientists have been hoping that perceptron-based systems can be used to model intelligence. 4

Activation Functions A perceptron produces output. One choice for the activation function : the step function. 0, if 0 1, if 0 Piecewise linear The step function is useful for providing some intuitive examples. It is not useful for actual real-world systems. Reason: it is not differentiable, it does not allow optimization via gradient descent. 5

Sigmoid Activation Function A perceptron produces output. Another choice for the activation function : the sigmoidal function. The sigmoidal is often used in real-world systems. It is a differentiable function, it allows use of gradient descent. 6

Logical AND Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean AND function: false AND false = false false AND true = false true AND false = false true AND true = true Output: 7

Logical OR Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean OR function: false OR false = false false OR true = true true OR false = true true OR true = true Output: 8

Logical NOT Perceptron Suppose we use the step function for activation. Suppose boolean value false is represented as number 0. Suppose boolean value true is represented as number 1. Then, the perceptron below computes the boolean NOT function: NOT(false) = true NOT(true) = false Output: 9

Neural Networks A neural network is built using perceptrons as building blocks. The inputs to some perceptrons are outputs of other perceptrons. Here is an example neural network computing the XOR function. Unit 3 Unit 5 Output: Unit 4 10

Details This neural network example consists of six units: Three input units (including the not-shown bias input). Three perceptrons. Yes, inputs count as units. Weights are denoted as. Weight belongs to the edge that connects the output of unit with an input of unit. Units are the input units (units 0, 1, 2 in this example). 11

Network Layers Oftentimes, neural networks are organized into layers. The input layer is the initial layer of input units (units 0, 1, 2 in our example). The output layer is at the end (unit 5 in our example). Zero, one or more hidden layers can be between the input and output layers. Unit 3 Unit 5 Output: Unit 4 12

Network Layers There is only one hidden layer in our example, containing units 4 and 5. Each hidden layer's inputs are outputs from the previous layer. Each hidden layer's outputs are inputs to the next layer. The first hidden layer's inputs come from the input layer. The last hidden layer's outputs are inputs to the output layer. Unit 3 Unit 5 Output: Unit 4 13

Feedforward Networks Feedforward networks are networks where there are no directed loops. If there are no loops, the output of a neuron cannot (directly or indirectly) influence its input. While there are varieties of neural networks that are not feedforward or layered, our main focus will be layered feedforward networks. 14

Layers and Output Notation: L is the number of layers. Layer 1 is the input layer, layer L is the output layer. Given values for the input units, output is computed as follows: For 2; ; 1: Compute the outputs of layer L, given the outputs of layer L-1. To compute the outputs of layer (where 1), we simply need to compute the output of each perceptron belonging to layer. For each such perceptron, its inputs are coming from outputs of perceptrons at layer 1. Remember, we compute layer outputs in increasing order of. 3 5 Output: Un 4 15

Why Neural Nets are Worthy? An individual perceptron can be trained as a linear classifier. The weights of the perceptron define a linear boundary between two classes. Layered feedforward neural nets with one hidden layer can compute any continuous function. Layered feedforward neural nets with two hidden layers can compute any function. Another reason is the close similarity between neural networks and biological brains. 16

NN Weight Training In linear regression RSS was used to find the best weights using a closed-form formula. In logistic regression, the log-likeliehood was maximized, to find the best weights using an iterative NR method. In neural networks, we cannot find the optimal weights. We only have optimization methods that find local minima of the error cost function. Still, in recent years such methods have produced spectacular results in real-world applications. 17

Basic Training Set Notations We define to be the vector of all weights in the neural network. We have a set of N training examples.,,, Each is a (D+1)-dimensional column vector. Dimension 0 is the bias input, always set to 1. 1,,,, We also have a set of N target outputs.,,, is the target output for training example. Each is a K-dimensional column vector:,,, Note: K typically is not equal to D. 18

Training a Single Perceptron Given input, a perceptron computes its output formula: We use sum-of-squares as our error function. is the contribution of training example : using this The overall error is defined as: Weights are chosen as the minimizer of E( ) 19

Learning Using Step Function Perceptron is using the step function as its activation function. Can gradient descent be applied in that case? No, because is not differentiable. Small changes of usually lead to no changes in The only exception is when the change in causes to switch signs (from positive to negative, or from negative to positive). 20

Learning Using Sigmoid Set to the sigmoid function: Then, measured just on a single training object, the error is defined as: 1 2 1 2 1 1 Also note: if our neural network is a single perceptron, then the target output is one-dimensional. 21

Minimizing E(w) 1 2 1 2 1 1 In this form, is differentiable. The gradient turns out to be: 1 2 1 Note that is a (D+1) dimensional vector. It is a scalar (shown in red) multiplied by vector. 22

Updating the Weights 1 2 1 So, we update the weight vector as follows: 1 As before, is the learning rate parameter. It is a positive real number that should be chosen carefully, so as not to be too big or too small. In terms of individual weights, the update rule is: 1 23

Perceptron Training Algorithm Input: Training inputs,,,,target outputs,, 1. Extend each to a (D+1) dimensional vector, by adding a 1 (the bias input) as the value for dimension 0. 2. Initialize weights to random numbers close to zero. E.g., set each between -0.1 and 0.1 3. For n = 1 to N: 1. Compute. 2. For d = 0 to D: 1 4. If some stopping criterion has been met, exit. 5. Else, go to step 3. 24

Stopping Criterion At step 4 of the perceptron learning algorithm, we need to decide whether to stop or not. Compute the cumulative squared error E(w) of the perceptron at that point: 1 2 Compare the current value of with the value of computed at the previous iteration. If the difference is too small (e.g., smaller than 0.0001) we stop. 25

Multiclass Classification Problems A perceptron output ranges between 0 and 1. This is sufficient only for binary classification problems. For more than two classes, we will follow a general approach called one-versus-all classification. Consider classes,,, where 2. Training inputs,,,,and target values,,. Each target value is a K-dimensional vector:,,, 0 if the class of is not C k. 1 if the class of is C k. For each class,train a perceptron by using as the target value for. Perceptron is trained to recognize if an object belongs to class or not. We train perceptrons, one for each class. 26

Classification in All-versus-One To classify a test pattern : Calculate the responses for all perceptrons. Find perceptron that gives the highest value output wrt other K-1 outputs. Output that the class of x is. In summary: we assign to the class whose perceptron produced the highest output value for. 27

Generalizing to Neural Nets is the total number of units in the neural network. Each unit, is denoted as, where 01. Units,, are the input units. Unit is the bias input, always equal to 1. We denote by the weight of the edge connecting the output of to an input of. We denote by the output of unit. If 0, then. We denote by vector the vector of all outputs:,,, 28

Notational Conventions We treat target value as a U-dimensional vector. The dimensionality of is equal to the number of units in the network. If is the k-th output unit, and belongs to class,then: 1. will have values 0 in all other dimensions. One-to-one correspondence between dimensions of and dimensions of. Notation simplicity. 29

Squared Error Cost We define to be the set of output units: Denote by the contribution that training input makes to the overall error. : Only the output from output units contributes to the error. If is not an output unit, then and get ignored. 30

Back Propagation Training in NN Follow the same approach of recursive learning that we followed for training single perceptrons: Given a training example and target output : Compute training error. Compute gradient. Update all weights using gradient. The process of computing the gradient and updating weights is called backpropagation. We will details using the sigmoidal function as activation function. 31

Starting with the Gradient We want to compute. (gradient vector) is a vector containing all weights. Therefore, it suffices to compute, for each,the partial derivative. To compute,: Decompose into a composition of simpler functions. Compute the derivative of each of those simpler functions. Apply the chain rule to obtain. 32

Cost Function Reformulation Let be a perceptron in the neural network. Define function is simply the weighted sum of the inputs of,given current input and given the current value of. Define function The output of is.. 33

Decompose the Cost Define y to be a vector containing all outputs of all perceptrons belonging to the same layer as perceptron. Define function y to be the error of the network given the outputs of all perceptrons of the layer that belongs to. Intuition for y : Suppose that you do not know anything about layers before the layer of perceptron. Suppose that you know y,and all layers after the layer of. Then, you can still compute the output of the network, and the error. 34

Visualization of E nj As long as we can see the outputs of the layer that belongs to, and all layers after the layer of,we can compute the output of the network, and the error. Previous layers: unknown Layer of Unit Unit Unit Unit Output layer Unit Unit Unit Unit Unit Unit 35

Decomposing the Error Function Define three auxiliary functions: Suppose that the perceptrons belonging to the same layer as are indexed as Then, is a composition of functions,,. 36

Gradient of E n Then, we can compute by applying the chain rule: We will compute each of these three terms. 37

Computing,, Remember,, is just the output of unit. The outputs of all units are straightforward to compute, given and. So, computing is entirely straightforward. 38

Computing One of the reasons we like using the sigmoidal function for activation is that its derivative has such a simple form. 39

Computing Case 1: Output Unit If is an output unit, then is an output of the entire network. contributes to the error the term Therefore: 40

Updating the Output Unit Weights If is an output unit, then we have computed all the terms we need for. 1 So, if is the weight of an output unit, we update it as follows: 1 41

Computing Let be a hidden unit. Define to be the set of all units in the layer after. We need to compute these three terms. 42

Details in Computing We computed this already, a few slides ago. So, the formula becomes: 1 43

Why Backpropagation? 1 Notice that is defined using. This is a recursive definition. To compute the values for a layer, we use the values from the next layer. This is why the whole algorithm is called backpropagation. We propagate computations from the output layer backwards towards the input layer. 44

Hidden Units Update From the previous slides, we have these formulas: We can combine these formulas, to compute for any weight of any hidden unit. Start the computations from the output layer, and move backwards towards the input layer. 45

Simplify Notations The previous formulas are sufficient and will work, but look complicated. Simplify the formulas considerably, by defining: Then, if we combine calculations we already did: If is an output unit, then: If is a hidden unit, then: 46

Backpropagation Formula Using the definition of from the previous slide, we finally get a very simple formula: Given a training input,and given a positive learning rate, each weight is updated using this formula: 47

Step 1: Initialize Input Layer First, given a training example,and its target output, we must initialize the input units: Array will store, for every perceptron,its output. double [] = new double[ ] Update the input layer, set inputs equal to. For to : % is the j-th entry of. 48

Step 2: Output Computations Create an array, which will store, for every perceptron,the weighted sum of the inputs of. double [] = new double[ ] Update the rest of the layers: For to : % is the number of layers: For each perceptron in layer : % weighted sum of inputs of 49

Step 3: Compute New δ Values Create an array, which will store, for every perceptron,value. double [] = new double[ ] For each output unit : For to : % MUST be decreasing order of For each perceptron in layer : : 50

Step 4: Weight Updating For to : %Order does not matter here, we can go %from to or from to. For each perceptron in layer : For each perceptron in the preceding layer : 51

The Backpropagation Algorithm Inputs: N D-dimensional training vectors,,. The associated target values,,,which are U-dimensional vectors. 1. Extend each to a (D+1) dimensional vector, by adding the bias input. 2. Initialize weights to small random numbers. For example, set each w u,v between -0.1 and 0.1. 3. last_error = 4. For 1 to : Given,update weights as described in the previous slides. 5. err = 6. If err last_error < threshold, exit. 7. Else: last_error = err, go to step 4. 52

Classification Using NNs Suppose we have classes where. Each class corresponds to an output perceptron. Given a test pattern to classify: Extend to a (D+1)-dimensional vector, by adding the bias input. Compute outputs for all units of the network, working from the input layer towards the output layer. Find the output unit with the highest output. Return the class that corresponds to Issues such as selecting number of layers and units. 53