Classification with Perceptrons. Reading:

Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters 1-2 only, but you re encouraged to read Chapter 3 (and the rest of the book!) as well

Perceptrons as simplified neurons Input is (,, ) input w 1 Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 2 y output w n

Perceptrons as simplified neurons Input is (,, ) input w 1 +1 w 0 Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output w n

w 0 is called the bias w 0 is called the threshold input w 2 Perceptrons as simplified neurons w 1 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 y output w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons w 1 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output If w 1 + w 2 + + w n > w 0 then y =1, else y = 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons w 1 x 0 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output If w 1 + w 2 + + w n > w 0 then y =1, else y = 0 If w 0 + w 1 + w 2 + + w n > 0 then y =1, else y = 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons x 0 +1 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 1 w 0 output = y(x) = a(w 0 x 0 + w 1 + w 2 + + w n ) w 2 y output where a(z) = 0 if z 0 1 if z > 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons x 0 +1 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is +1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 1 w 0 output = y(x) = a(w 0 x 0 + w 1 + w 2 + + w n ) w 2 w n y output where a(z) = 0 if z 0 1 if z > 0 Let x = (x 0,,, ) w = (w 0, w 1, w 2, w n ) Then: y = a(w x)

Decision Surfaces Assume data can be separated into two classes, positive and negative, by a linear decision surface Assuming data is n-dimensional, a perception represents a (n 1)-dimensional hyperplane that separates the data into two classes, positive (1) and negative (0)

Feature 1 Feature 2

Example where line won t work? Feature 1 Feature 2

Example What is the predicted class y? 1 +1 4 1 4 1 y

Geometry of the perceptron Hyperplane In 2D: w 1 + w 2 + w 0 = 0 Feature 1 = w 1 w 2 w 0 w 2 Feature 2

In-Class Exercise 1

input Perceptron Learning x 0 +1 w 1 w 0 w 2 y output w n

input Perceptron Learning x 0 +1 w 1 w 0 w 2 y output w n Input instance: x k = (,, ), with target class t k {0,1}

input Perceptron Learning x 0 +1 Goal is to use the training data to learn a set of weights that will: w 1 w 0 (1) correctly classify the training data w 2 (2) generalize to unseen data y output w n Input instance: x k = (,, ), with target class t k {0,1}

Perceptron Learning Learning is often framed as an optimization problem: Find w that minimizes average loss : M J(w) = 1 M L(w, x k,t k ) k=1 where M is number of training examples and L is a loss function One part of the art of ML is to define a good loss function Here, define the loss function as follows: Let y = a(w x) L(w, x k,t k ) = 1 ( 2 t k y) 2 "squared loss" How to solve this minimization problem? Gradient descent

Gradient descent To find direction of steepest descent, take the derivative of J(w) with respect to w A vector derivative is called a gradient : J (w) J(w) = J, J,, J w 0 w 1 w n

Here is how we change each weight: For i = 0 to n: w i w i + Δw i where Δw i = η J w i and η is the learning rate (ie, size of step to take downhill)

True (or batch ) gradient descent One epoch = one iteration through the training data After each epoch, compute average loss over the training set: J(w) = 1 M M k=1 L(w, x k,t k ) = 1 M M k=1 1 2 ( t k y) 2 Change the weights to move in direction of steepest descent in the average-loss surface: J(w) w 1 w 2 From T M Mitchell, Machine Learning

Problem with true gradient descent: Training process is slow Training process can land in local optimum Common approach to this: use stochastic gradient descent: Instead of doing weight update after all training examples have been processed, do weight update after each training example has been processed (ie, perceptron output has been calculated) Stochastic gradient descent approximates true gradient descent increasingly well as η 1/

We defined J = 1 ( 2 t y ) 2 Derivation of perceptron learning rule (stochastic gradient descent) Here, use J = 1 2 ( t (w x) ) 2 = 1 ( 2 t (w 1 + w 2 + + w n )) 2 Then, 1 0 0 ( ) y = a w x k y = w x k J k = ( t (w x) ) x i w i But we'll use J k = ( t y) x i w i Δw i = η J = η(t k y k k )x i w i This is called the perceptron learning rule

Perceptron Learning Algorithm Start with small random weights, w = (w 0, w 1, w 2,, w n ), where w i 05,05 Repeat until accuracy on the training data stops increasing or for a maximum number of epochs (iterations through training set): For k = 1 to M (total number in training set): 1 Select next training example (x k, t k ) 2 Run the perceptron with input x k and weights w to obtain y 3 If y t k, update weights: for i = 0,, n: [ ] ; Note that bias weight w 0 is changed just like all other weights! 4 Go to 1 w i w i + Δw i where Δw i = η (t k y k )x i k

Training set: ((0, 0), 1) Example +1 4 (1, 1), 0) (1, 0), 0) 1 2 y w i w i + Δw i where Initial weights: {w 0, w 1, w 2 } = {04, 01, 02) Δw i = η (t k y k )x i k What is initial accuracy? Apply perceptron learning rule for one epoch with η = 01 What is new accuracy?

In-Class Exercise 2

Recognizing Handwritten Digits MNIST dataset 60,000 training examples 10,000 test examples Each example is a 28 28- pixel image, where each pixel is a grayscale value in [0,255] 28 pixels Label: 2 See csv files First value in each row is the target class 28 pixels

Perceptron architecture for handwritten digits classification bias input 1 0 785 inputs (= 28 28 + 1) Fully connected 1 2 3 4 5 6 7 8 9

Preprocessing (To keep weights from growing very large) Scale each feature to a fraction between 0 and 1: x í = x i 255

Processing an input At each output, compute: x 0 1 0 1 n w i x i i=0 2 3 The output with highest value is the prediction Fully connected 4 5 6 If correct, do nothing If not correct, adjust weights according to perceptron learning rule 1 7 8 See assignment for details 9

For each training example k, input x k to the perceptron Suppose x k is a 2 t k = 1 for the 2 output x 0 1 t k = 0 for all other outputs 1 Fully connected Example 0 1 2 3 4 5 6 7 8 9 For each output, calculate 735 w i x i i=0 Set y = 1 for outputs that are greater than 0 Set y = 0 otherwise Then for all weights coming into each output: w i w i + Δw i where Δw i = η (t k y k )x i k

Homework 1 Implement perceptron (785 inputs, 10 outputs) and perceptron learning algorithm in any programming language you would like Download MNIST data from class website Preprocess MNIST data: Scale each feature to a fraction between 0 and 1

Homework 1 Experiments Train perceptrons with three different learning rates: η = 0001, 001, and 01 For each learning rate: 1 Choose small random weights w i [ 05,05] 2 Repeat cycling through the training data until the accuracy on the training data has essentially stopped improving (ie, the difference between training accuracy from one epoch to the next is less than some small number, like 001) After the initialization, and after each epoch (one cycle through training data), compute accuracy on training and test set (for plot), without changing weights

Homework 1: Presenting results For each experiment, plot training and test accuracy over epochs: Accuracy (%) Epoch

For each experiment, give confusion matrix: Actual class Predicted class 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 In each cell (i,j) put number of test examples that were predicted (classified) as class i, and whose actual class is class j

Homework 1 FAQ Q: Can I use code for perceptrons from another source? A: No, you need to write your own code Q: How long will it take to train the perceptrons? A: Depends on your computer and your code, but probably an hour or more Q: What accuracy should I expect on the test set? A: It depends on initial weights, but probably over 80% Q: Can I ask questions while doing the assignment? A: Yes, feel free to ask questions! You can ask me, the TA, and fellow students for help (but not code) You can also send questions to the class mailing list (and you can answer questions as well) Q: Should I wait until the last minute? A: No!!! You only have a week and a half Start as soon as possible

Homework: Report Report should be in pdf format and include graphs and formatted confusion matrices Code should be in plain text file

In-Class Exercise 3

Quiz on Tuesday Time allotted: 30 minutes Format: You are allowed to bring in one (double-sided) page of notes to use during the quiz You may bring/use a calculator, but you don t really need one What you need to know: Perceptrons: input, output, weights, bias, threshold Given an input vector x and perceptron weights, how to calculate output Given perceptron weights, how to sketch separation line (in two dimensions) Perceptron learning algorithm Difference between true and stochastic gradient descent How to interpret / fill in a confusion matrix How all-pairs classification method works Questions will be similar to examples and exercises we ve done in class