Classification with Perceptrons. Reading:

Similar documents
CSC321 Lecture 4 The Perceptron Algorithm

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Least Mean Squares Regression. Machine Learning Fall 2018

Machine Learning. Neural Networks

Least Mean Squares Regression

Pattern Classification

Machine Learning (CS 567) Lecture 3

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

HOMEWORK #4: LOGISTIC REGRESSION

Introduction To Artificial Neural Networks

Optimization and Gradient Descent

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Lecture 6. Notes on Linear Algebra. Perceptron

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks biological neuron artificial neuron 1

Linear discriminant functions

Multilayer Neural Networks

Midterm: CS 6375 Spring 2015 Solutions

Single layer NN. Neuron Model

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Welcome to the Machine Learning Practical Deep Neural Networks. MLP Lecture 1 / 18 September 2018 Single Layer Networks (1) 1

Neural Networks (Part 1) Goals for the lecture

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Course 395: Machine Learning - Lectures

Chapter 2 Single Layer Feedforward Networks

The Perceptron Algorithm 1

4. Multilayer Perceptrons

Simple Neural Nets For Pattern Classification

Multilayer Perceptrons and Backpropagation

CSC242: Intro to AI. Lecture 21

In the Name of God. Lecture 11: Single Layer Perceptrons

ECS171: Machine Learning

Linear Discrimination Functions

Machine Learning Linear Models

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Neural networks. Chapter 19, Sections 1 5 1

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Lecture 4: Perceptrons and Multilayer Perceptrons

Revision: Neural Network

HOMEWORK 4: SVMS AND KERNELS

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Multilayer Perceptron

y(x n, w) t n 2. (1)

Deep Neural Networks (1) Hidden layers; Back-propagation

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks

Machine Learning Basics Lecture 3: Perceptron. Princeton University COS 495 Instructor: Yingyu Liang

Artificial Neural Network

CSC321 Lecture 5: Multilayer Perceptrons

Multilayer Neural Networks

Nonlinear Classification

CE213 Artificial Intelligence Lecture 14

Introduction to Machine Learning

HOMEWORK #4: LOGISTIC REGRESSION

Name: Student number:

Machine Learning

ECE521 Lectures 9 Fully Connected Neural Networks

Advanced statistical methods for data analysis Lecture 2

Deep Neural Networks (1) Hidden layers; Back-propagation

Neural Networks and the Back-propagation Algorithm

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Computational statistics

Neural networks and support vector machines

Lecture 5 Neural models for NLP

Linear Models for Classification

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Artificial Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Linear classifiers: Overfitting and regularization

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

The Perceptron algorithm

Computational Intelligence Winter Term 2017/18

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

The Perceptron. Volker Tresp Summer 2016

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Machine Learning Basics

Multilayer Perceptrons (MLPs)

Stochastic Gradient Descent

Comparison of Modern Stochastic Optimization Algorithms

Machine Learning & Data Mining CMS/CS/CNS/EE 155. Lecture 2: Perceptron & Gradient Descent

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Artifical Neural Networks

Lab 5: 16 th April Exercises on Neural Networks

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Multilayer Perceptron

Stochastic gradient descent; Classification

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Linear Classifiers. Michael Collins. January 18, 2012

Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

Transcription:

Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters 1-2 only, but you re encouraged to read Chapter 3 (and the rest of the book!) as well

Perceptrons as simplified neurons Input is (,, ) input w 1 Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 2 y output w n

Perceptrons as simplified neurons Input is (,, ) input w 1 Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output w n

Perceptrons as simplified neurons Input is (,, ) input w 1 +1 w 0 Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output w n

w 0 is called the bias w 0 is called the threshold input w 2 Perceptrons as simplified neurons w 1 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 y output w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons w 1 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output If w 1 + w 2 + + w n > w 0 then y =1, else y = 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons w 1 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output If w 1 + w 2 + + w n > w 0 then y =1, else y = 0 If w 0 + w 1 + w 2 + + w n > 0 then y =1, else y = 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons w 1 x 0 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output If w 1 + w 2 + + w n > w 0 then y =1, else y = 0 If w 0 + w 1 + w 2 + + w n > 0 then y =1, else y = 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons w 1 x 0 +1 w 0 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: If w 1 + w 2 + + w n >threshold then y =1, else y = 0 w 2 y output If w 1 + w 2 + + w n > w 0 then y =1, else y = 0 If w 0 + w 1 + w 2 + + w n > 0 then y =1, else y = 0 w n If w 0 x 0 + w 1 + w 2 + + w n > 0 then y =1, else y = 0

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons x 0 +1 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 1 w 0 output = y(x) = a(w 0 x 0 + w 1 + w 2 + + w n ) w 2 y output where a(z) = 0 if z 0 1 if z > 0 w n

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons x 0 +1 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is 1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 1 w 0 output = y(x) = a(w 0 x 0 + w 1 + w 2 + + w n ) w 2 w n y output where a(z) = 0 if z 0 1 if z > 0 Let x = (x 0,,, ) w = (w 0, w 1, w 2, w n )

w 0 is called the bias w 0 is called the threshold input Perceptrons as simplified neurons x 0 +1 Input is (,, ) Weights are (w 1, w 2, w n ) Output y is +1 ( the neuron fires ) if the sum of the inputs times the weights is greater or equal to the threshold: w 1 w 0 output = y(x) = a(w 0 x 0 + w 1 + w 2 + + w n ) w 2 w n y output where a(z) = 0 if z 0 1 if z > 0 Let x = (x 0,,, ) w = (w 0, w 1, w 2, w n ) Then: y = a(w x)

Decision Surfaces Assume data can be separated into two classes, positive and negative, by a linear decision surface Assuming data is n-dimensional, a perception represents a (n 1)-dimensional hyperplane that separates the data into two classes, positive (1) and negative (0)

Feature 1 Feature 2

Feature 1 Feature 2

Feature 1 Feature 2

Example where line won t work? Feature 1 Feature 2

Example What is the predicted class y? 1 +1 4 1 4 1 y

Geometry of the perceptron Hyperplane In 2D: w 1 + w 2 + w 0 = 0 Feature 1 = w 1 w 2 w 0 w 2 Feature 2

In-Class Exercise 1

input Perceptron Learning x 0 +1 w 1 w 0 w 2 y output w n

input Perceptron Learning x 0 +1 w 1 w 0 w 2 y output w n Input instance: x k = (,, ), with target class t k {0,1}

input Perceptron Learning x 0 +1 Goal is to use the training data to learn a set of weights that will: w 1 w 0 (1) correctly classify the training data w 2 (2) generalize to unseen data y output w n Input instance: x k = (,, ), with target class t k {0,1}

Perceptron Learning Learning is often framed as an optimization problem: Find w that minimizes average loss : M J(w) = 1 M L(w, x k,t k ) k=1 where M is number of training examples and L is a loss function One part of the art of ML is to define a good loss function Here, define the loss function as follows: Let y = a(w x) L(w, x k,t k ) = 1 ( 2 t k y) 2 "squared loss" How to solve this minimization problem? Gradient descent

Gradient descent To find direction of steepest descent, take the derivative of J(w) with respect to w A vector derivative is called a gradient : J (w) J(w) = J, J,, J w 0 w 1 w n

Here is how we change each weight: For i = 0 to n: w i w i + Δw i where Δw i = η J w i and η is the learning rate (ie, size of step to take downhill)

True (or batch ) gradient descent One epoch = one iteration through the training data After each epoch, compute average loss over the training set: J(w) = 1 M M k=1 L(w, x k,t k ) = 1 M M k=1 1 2 ( t k y) 2 Change the weights to move in direction of steepest descent in the average-loss surface: J(w) w 1 w 2 From T M Mitchell, Machine Learning

Problem with true gradient descent: Training process is slow Training process can land in local optimum Common approach to this: use stochastic gradient descent: Instead of doing weight update after all training examples have been processed, do weight update after each training example has been processed (ie, perceptron output has been calculated) Stochastic gradient descent approximates true gradient descent increasingly well as η 1/

We defined J = 1 ( 2 t y ) 2 Derivation of perceptron learning rule (stochastic gradient descent) Here, use J = 1 2 ( t (w x) ) 2 = 1 ( 2 t (w 1 + w 2 + + w n )) 2 Then, 1 0 0 ( ) y = a w x k y = w x k J k = ( t (w x) ) x i w i But we'll use J k = ( t y) x i w i Δw i = η J = η(t k y k k )x i w i This is called the perceptron learning rule

Perceptron Learning Algorithm Start with small random weights, w = (w 0, w 1, w 2,, w n ), where w i 05,05 Repeat until accuracy on the training data stops increasing or for a maximum number of epochs (iterations through training set): For k = 1 to M (total number in training set): 1 Select next training example (x k, t k ) 2 Run the perceptron with input x k and weights w to obtain y 3 If y t k, update weights: for i = 0,, n: [ ] ; Note that bias weight w 0 is changed just like all other weights! 4 Go to 1 w i w i + Δw i where Δw i = η (t k y k )x i k

Training set: ((0, 0), 1) Example +1 4 (1, 1), 0) (1, 0), 0) 1 2 y w i w i + Δw i where Initial weights: {w 0, w 1, w 2 } = {04, 01, 02) Δw i = η (t k y k )x i k What is initial accuracy? Apply perceptron learning rule for one epoch with η = 01 What is new accuracy?

In-Class Exercise 2

Recognizing Handwritten Digits MNIST dataset 60,000 training examples 10,000 test examples Each example is a 28 28- pixel image, where each pixel is a grayscale value in [0,255] 28 pixels Label: 2 See csv files First value in each row is the target class 28 pixels

Perceptron architecture for handwritten digits classification bias input 1 0 785 inputs (= 28 28 + 1) Fully connected 1 2 3 4 5 6 7 8 9

Preprocessing (To keep weights from growing very large) Scale each feature to a fraction between 0 and 1: x í = x i 255

Processing an input At each output, compute: x 0 1 0 1 n w i x i i=0 2 3 The output with highest value is the prediction Fully connected 4 5 6 If correct, do nothing If not correct, adjust weights according to perceptron learning rule 1 7 8 See assignment for details 9

For each training example k, input x k to the perceptron Suppose x k is a 2 t k = 1 for the 2 output x 0 1 t k = 0 for all other outputs 1 Fully connected Example 0 1 2 3 4 5 6 7 8 9 For each output, calculate 735 w i x i i=0 Set y = 1 for outputs that are greater than 0 Set y = 0 otherwise Then for all weights coming into each output: w i w i + Δw i where Δw i = η (t k y k )x i k

Homework 1 Implement perceptron (785 inputs, 10 outputs) and perceptron learning algorithm in any programming language you would like Download MNIST data from class website Preprocess MNIST data: Scale each feature to a fraction between 0 and 1

Homework 1 Experiments Train perceptrons with three different learning rates: η = 0001, 001, and 01 For each learning rate: 1 Choose small random weights w i [ 05,05] 2 Repeat cycling through the training data until the accuracy on the training data has essentially stopped improving (ie, the difference between training accuracy from one epoch to the next is less than some small number, like 001) After the initialization, and after each epoch (one cycle through training data), compute accuracy on training and test set (for plot), without changing weights

Homework 1: Presenting results For each experiment, plot training and test accuracy over epochs: Accuracy (%) Epoch

For each experiment, give confusion matrix: Actual class Predicted class 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 In each cell (i,j) put number of test examples that were predicted (classified) as class i, and whose actual class is class j

Homework 1 FAQ Q: Can I use code for perceptrons from another source? A: No, you need to write your own code Q: How long will it take to train the perceptrons? A: Depends on your computer and your code, but probably an hour or more Q: What accuracy should I expect on the test set? A: It depends on initial weights, but probably over 80% Q: Can I ask questions while doing the assignment? A: Yes, feel free to ask questions! You can ask me, the TA, and fellow students for help (but not code) You can also send questions to the class mailing list (and you can answer questions as well) Q: Should I wait until the last minute? A: No!!! You only have a week and a half Start as soon as possible

Homework: Report Report should be in pdf format and include graphs and formatted confusion matrices Code should be in plain text file

In-Class Exercise 3

Quiz on Tuesday Time allotted: 30 minutes Format: You are allowed to bring in one (double-sided) page of notes to use during the quiz You may bring/use a calculator, but you don t really need one What you need to know: Perceptrons: input, output, weights, bias, threshold Given an input vector x and perceptron weights, how to calculate output Given perceptron weights, how to sketch separation line (in two dimensions) Perceptron learning algorithm Difference between true and stochastic gradient descent How to interpret / fill in a confusion matrix How all-pairs classification method works Questions will be similar to examples and exercises we ve done in class