The Perceptron algorithm

Similar documents
Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Empirical Risk Minimization Algorithms

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Lecture 4: Linear predictors and the Perceptron

ECE 5424: Introduction to Machine Learning

The Perceptron Algorithm 1

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

1 What a Neural Network Computes

A Course in Machine Learning

The Perceptron Algorithm

COMS 4771 Introduction to Machine Learning. Nakul Verma

Linear Classifiers. Michael Collins. January 18, 2012

CMU-Q Lecture 24:

Linear models and the perceptron algorithm

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Machine Learning

Optimization and Gradient Descent

Part of the slides are adapted from Ziko Kolter

Linear Discrimination Functions

Linear & nonlinear classifiers

Logistic regression and linear classifiers COMS 4771

4 THE PERCEPTRON. 4.1 Bio-inspired Learning. Learning Objectives:

1 Active Learning Foundations of Machine Learning and Data Science. Lecturer: Maria-Florina Balcan Lecture 20 & 21: November 16 & 18, 2015

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning

CSC321 Lecture 4 The Perceptron Algorithm

Perceptron (Theory) + Linear Regression

Minimax risk bounds for linear threshold functions

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

VBM683 Machine Learning

Machine Learning and Data Mining. Linear classification. Kalev Kask

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Generalization, Overfitting, and Model Selection

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

IFT Lecture 7 Elements of statistical learning theory

Logistic Regression Logistic

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Discriminative Models

Linear discriminant functions

Support Vector Machines. Machine Learning Fall 2017

9 Classification. 9.1 Linear Classifiers

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

18.6 Regression and Classification with Linear Models

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Machine Learning 4771

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

GRADIENT DESCENT AND LOCAL MINIMA

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CPSC 340: Machine Learning and Data Mining

Qualifying Exam in Machine Learning

Active Learning: Disagreement Coefficient

Overfitting, Bias / Variance Analysis

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning

1 Learning Linear Separators

Linear Classifiers and the Perceptron Algorithm

CSC242: Intro to AI. Lecture 21

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Generalization, Overfitting, and Model Selection

Online Learning, Mistake Bounds, Perceptron Algorithm

Single layer NN. Neuron Model

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Warm up: risk prediction with logistic regression

Support Vector Machines

Discriminative Models

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

More about the Perceptron

The Perceptron Algorithm, Margins

Artificial Neural Networks

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

18.9 SUPPORT VECTOR MACHINES

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

On the power and the limits of evolvability. Vitaly Feldman Almaden Research Center

Simple Neural Nets For Pattern Classification

Linear & nonlinear classifiers

Support vector machines Lecture 4

Solving Classification Problems By Knowledge Sets

Classification with Perceptrons. Reading:

Introduction to Machine Learning Midterm Exam

Lecture 6. Notes on Linear Algebra. Perceptron

Computational Learning Theory

Brief Introduction to Machine Learning

Machine Learning 4771

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 Learning Linear Separators

Midterm Exam Solutions, Spring 2007

In the Name of God. Lecture 11: Single Layer Perceptrons

Learning Deep Architectures for AI. Part I - Vijay Chakilam

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Linear Models for Classification

Optimization Methods for Machine Learning (OMML)

Transcription:

The Perceptron algorithm Tirgul 3 November 2016

Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following property: For every ε, δ 0,1, and for every distribution D over X Y, when running the learning algorithm on m m H ε, δ i.i.d. examples generated by D, the algorithm returns a hypothesis h such that, with probability of at least 1 δ (over the choice of the m training examples), L D h min h H L D h + ε 2

Agnostic PAC Learnability: L D h min h H L D h + ε Goal: what is h = argmin h H L D h

When Life Gives You Lemons, Make Lemonade We do have our sample set S and we hope it represents the distribution pretty good i.i.d assumption So why can t we just minimize the error over the training set? In other words - Empirical Risk Minimization 4

Empirical Risk Minimization Examples Consistent Halving L s h = i m : h(x i y i } m 5

Linear Predictors ERM Approach 6

Introduction Linear Predictors: Efficient Intuitive Fit data reasonably well in many natural learning problems Several hypothesis classes: linear regression, logistic regression, Perceptron. 7

Example

Linear Predictors The different hypothesis classes of linear predictors are compositions of a function φ: R Y on H: Binary classification: φ is the sign function sgn x. Regression: φ is the identity function (f x = x). 9

Halfspaces Designed for binary classification problems X = R d, Y = ±1 H halfspaces = {x sign w, x + b : w R d } Geometric illustration (d=2): each hypothesis forms a hyperplane that is perpendicular to the vector w. Instances that are above the hyperplane are labeled positively Instances that are below the hyperplane are labeled negatively. 10

Adding a Bias Add b (a bias) into w as an extra coordinate: w = b, w 1, w 2,, w d R d+1 and add a value of 1 to all x X: x = 1, x 1, x 2,, x d R d+1 Thus, each affine function in R d can be rewritten as a homogenous linear function in R d+1. 11

The Dot Product Algebraic Defintion: n w x = i=1 w i x i = i=1 Notation: w, x = w T x a = 0 3, b = 4 0 a b = 0 4 + 3 0 = 0 n w 1 x 1 + w 2 x 2 + + w n x n b θ a 12

The Dot Product Geometric Definition: a b = a b cos θ Where: x is the magnitude of vector x. θ is the angle between a and b. b θ a If θ = 90 : a b = 0 If θ = 0 : a b = a b This implies that that dot product of a vector by itself is: a a = a 2 Which gives: a = a a The formula of the Euclidean length of the vector. 13

The Decision Boundary Perceptron tries to find a straight line that separates between the positive and negative examples A line in 2D, a plane in 3D, a hyperplane in higher dimensions Called a decision boundary. 14

The Linearly Separable Case The linearly separable case: If a perfect decision boundary exists (The realizable case.) The separable case: Possible to separate with a hyperplane all the positive examples from all the negative ones. 15

Finding an ERM Halfspace In the separable case: Linear Programming The Perceptron Algorithm (Rosenblatt, 1957) In the non-separable case: Learn a halfspace that minimizes a different loss function E.g. Logistic Regression 16

Perceptron x i - inputs; w i - weights The inputs x i are multiplied by the weights w i The neuron sums their values. If the sum is greater than the threshold θ then the neuron fires (1); Otherwise, it does not. 17

Finding θ We now need to learn both w and θ: 18

Finding θ θ is equivalent to the parameter b we mentioned previously. Reminder: we added a bias Thus, we have a adjustable threshold Don t need to learn another parameter x 0 w 0 = θ 19

Perceptron for Halfspaces

Perceptron for Halfspaces Our goal is to have i, y i w, x i > 0 y i w t+1, x i = y i w t + y i x i, x i = y i w t, x i + x i 2 The update rule makes Perceptron more correct on the ith example 21

The Learning Rate η The update rule: w (t+1) = w t + y i x i We could add a parameter η w (t+1) = w t + ηy i x i Controls how much the weights will change In the separable case η has no affect. Proof: HW If η = 1: The weights change a lot whenever there is a wrong answer Unstable network, never settles down. If η is very small: The weights need to see inputs more often before they change significantly Network takes longer to learn Typically choose 0.1 η 0.4 22

Example: Logic Function OR Data of the OR logic function and a plot of the datapoints: -1 23

A Feasibility Problem Suppose the algorithm found a weight vector that learned all of the examples correctly There are many different values that will give correct outputs! We are interested in finding a set of weights that works feasibility A feasibility (/satisfiability) problem, is the problem of finding any feasible solution, without regard to the objective value 24

Example: Logic Function OR The Perceptron network 1 25

Example: Logic Function OR We need to find the three weights: Initially: w (1) = 0, 0, 0 The algorithm: First input: x 1 = 0, 0, y 1 = 1 Include the bias: 1, 0, 0 Value of neuron: w (1) x 1 = 0 1 + 0 0 + 0 0 = 0 y 1 (w (1) x 1 ) = 0 0 Update: w 2 = w (1) + ( 1)(1, 0, 0) w 2 = ( 1, 0, 0) The dataset: -1 26

Example: Logic Function OR w (2) = 1, 0, 0 The algorithm: Second input: x 2 = 0, 1, y 2 = 1 Include the bias: 1, 0, 1 Value of neuron: w (2) x 2 = 1 1 + 0 0 + 0 1 = 1 y 2 (w (2) x 2 ) = 1 0 Update: w 3 = w (2) + (1)(1, 0, 1) w 3 = (0, 0, 1) The dataset: -1 27

Example: Logic Function OR w (3) = 0, 0, 1 The algorithm: Third input: x 3 = 1, 0, y 3 = 1 Include the bias: 1, 1, 0 Value of neuron: w (3) x 3 = 0 1 + 0 1 + 1 0 = 0 y 3 (w (3) x 3 ) = 0 0 Update: w 4 = w (3) + (1)(1, 1, 0) w 4 = (1, 1, 1) The dataset: -1 28

Example: Logic Function OR w (4) = 1, 1, 1 Fourth input: x 4 = 1, 1, y 4 = 1 Include the bias: 1, 1, 1 Value of neuron: w (4) x 4 = 1 1 + 1 1 + 1 1 = 3 y 4 (w (4) x 4 ) = 3 0 No update The algorithm: The dataset: -1 29

Example: Logic Function OR Not done yet! w (4) = 1, 1, 1 First input: x 1 = 0, 0, y 1 = 1 Include the bias: 1, 0, 0 Value of neuron: w (4) x 1 = 1 1 + 1 0 + 1 0 = 1 y 1 (w (4) x 1 ) = 1 0 Need to update again The algorithm: The dataset: -1 30

Example: Logic Function OR We ve been through all the inputs once but that doesn t mean we finished! Need to go through the inputs again Till the weights settle down and stop changing When data is inseparable, the weights may never stop changing 31

When to stop? The algorithm runs over the dataset many times How to decide when to stop learning? (generally) 32

Validation Set Training set: to train the algorithm To adjust the weights Validation set: to keep tack of how well it is doing To verify that any increase in accuracy over the training data yields an increase in accuracy over a dataset that the network wasn t trained on. Test set: to produce the final results To test the final solution, in order to confirm the actual predictive power of the algorithm. 33

Validation Set Proportion of train/validation/test sets Are typically 60:20:20 (after the dataset has been shuffled!) Alternatively: K-fold Cross validation The dataset is randomly partitioned into K subsets One subset is used for validation; the algorithm is trained on all the others Then, a different subset is let out, and a new model is trained Repeat the process for all K subsets Finally, the model that produced the lowest validation error is used. Leave-one-out Algorithm is validated on one piece of data, and trained on all the rest, N times. Length of dataset 34

Overfitting Rather than finding a general function (left), the our network matches the input perfectly, including the noise in them (right). Reduces generalization capabilities. 35

Back to: When to stop? If we plot the error during training: Note: This graph is general and does not necessarily describe the behavior of the error rate while training Perceptron, because Perceptron does not guarantee that there will be fewer mistakes on the next iterations. it typically reduces fairly quickly during the first few training iterations, then the reduction slows down as the learning algorithm performs small changes to find the exact local minimum. 36

When to stop? We don't want to stop training until the local minimum has been found, but, keeping on training too long leads to overfitting. This is where the validation set comes in useful. 37

When to stop? We train the network for some predetermined amount of time, and then use the validation set to estimate how well the network is generalizing. We then carry on training for a few more iterations, and repeat the whole process. 38

When to stop? At some stage the error on the validation set will start increasing again, because the network has stopped learning about the function that generated the data, and started to learn about the noise that is in the data itself. At this stage we stop the training. This technique is called early stopping. 39

When to stop? Thus, the validation set was used to prevent overfitting, and to monitor the generalization ability of the network: If the accuracy over the training data set increases, but the accuracy over then validation data set stays the same or decreases, then we have caused overfitting, and should stop training. 40

Perceptron Variant The pocket algorithm: Keeps the best solution seen so far "in its pocket". The algorithm then returns the solution in the pocket, rather than the last solution. 41

Perceptron Bound Theorem Note: γ is called a margin.