The Perceptron Algorithm

Similar documents
Neural networks and support vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Kernelized Perceptron Support Vector Machines

Linear & nonlinear classifiers

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Jeff Howbert Introduction to Machine Learning Winter

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Single layer NN. Neuron Model

The Perceptron Algorithm 1

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Support Vector Machine (continued)

Reading Group on Deep Learning Session 1

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear & nonlinear classifiers

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Lecture 5

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

3.4 Linear Least-Squares Filter

Neural Networks and Deep Learning

Support Vector Machines

In the Name of God. Lecture 11: Single Layer Perceptrons

Intelligent Systems Discriminative Learning, Neural Networks

y(x n, w) t n 2. (1)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Artificial Neural Networks

More about the Perceptron

Neural Networks (Part 1) Goals for the lecture

Linear Models in Machine Learning

Support Vector Machines

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Classifiers as Pattern Detectors

(Kernels +) Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Midterm: CS 6375 Spring 2015 Solutions

SVMs: nonlinearity through kernels

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Chapter 2 Single Layer Feedforward Networks

Machine Learning - Waseda University Logistic Regression

Machine Learning Lecture 7

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Support Vector Machines

Artificial Neural Networks The Introduction

Neural Networks and the Back-propagation Algorithm

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machine (SVM) and Kernel Methods

Logistic Regression. Machine Learning Fall 2018

Introduction to Neural Networks

Support Vector Machines

Machine Learning : Support Vector Machines

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Learning Methods for Linear Detectors

Support Vector Machines and Kernel Methods

Introduction to Machine Learning

GRADIENT DESCENT AND LOCAL MINIMA

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning Practice Page 2 of 2 10/28/13

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

The Perceptron Algorithm

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Discriminative Models

Support Vector Machines: Maximum Margin Classifiers

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Ch 4. Linear Models for Classification

Artificial Neural Network

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

CS798: Selected topics in Machine Learning

Binary Classification / Perceptron

Neural Networks Lecture 4: Radial Bases Function Networks

Linear Discrimination Functions

Discriminative Models

COMS 4771 Introduction to Machine Learning. Nakul Verma

Linear Models for Classification

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

Classification with Perceptrons. Reading:

Radial Basis Function Networks. Ravi Kaushik Project 1 CSC Neural Networks and Pattern Recognition

Introduction to Support Vector Machines

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Support Vector Machine. Industrial AI Lab.

Computational statistics

6.867 Machine learning

Support Vector Machine (SVM) and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm: CS 6375 Spring 2018

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Neural Network Training

Transcription:

The Perceptron Algorithm Greg Grudic Greg Grudic Machine Learning

Questions? Greg Grudic Machine Learning 2

Binary Classification A binary classifier is a mapping from a set of d inputs to a single output which can take on one of TWO values In the most general setting inputs: x R output: y {, + } Specifying the output classes as - and + is arbitrary! Often done as a mathematical convenience d Greg Grudic Machine Learning 3

A Binary Classifier Given learning data: ( x, y ),...,( x, y ) A model is constructed: N N x Classification Model M ( x) yˆ {, + } Greg Grudic Machine Learning 4

Linear Separating Hyper-Planes x 2 d + i i = i= β β x y =+ y = d + i i > i= β β x d + i i i= β β x x Greg Grudic Machine Learning 5

Linear Separating Hyper-Planes The Model: Where: ˆ = ( ) = sgn ˆ + ˆ,..., ˆ The decision boundary: ( ) y M x β β β d sgn [ A] if A> = otherwise ( ) ˆ ˆ,..., ˆ T + x = β β β d x T Greg Grudic Machine Learning 6

Linear Separating Hyper-Planes The model parameters are: ˆ ˆ ˆ ( β ), β,..., βd The hat on the betas means that they are estimated from the data Many different learning algorithms have been proposed for determining β β ˆ ˆ ˆ ( ),,..., βd Greg Grudic Machine Learning 7

Rosenblatt s Preceptron Learning Algorithm Dates back to the 95 s and is the motivation behind Neural Networks The algorithm: Start with a random hyperplane ˆ ˆ ˆ ( β ), β,..., βd Incrementally modify the hyperplane such that points that are misclassified move closer to the correct side of the boundary Stop when all learning examples are correctly classified Greg Grudic Machine Learning 8

Rosenblatt s Preceptron Learning Algorithm The algorithm is based on the following property: Signed distance of any point x to the boundary is: Therefore, if M is the set of misclassified learning examples, we can push them closer to the boundary by minimizing the following D (,..., d ) ˆ ˆ ˆ T β+ β β x d = βˆ ˆ ˆ + β β d ˆ 2 β i i= (,..., d ) ( T x ) ˆ ˆ ˆ ˆ ˆ ˆ ( β ) ( ), β,..., β = y β+ β,..., β d i d i i M x T Greg Grudic Machine Learning 9

Rosenblatt s Minimization Function This is classic Machine Learning! First define a cost (risk/loss) function in model parameter space d D( βˆ ˆ ˆ ) ˆ ˆ, β,..., βd = y i β+ βkx ik i M k= Then find an algorithm that modifies such that this cost function is minimized One such algorithm is Gradient Descent ˆ ˆ ˆ ( β ), β,..., βd Greg Grudic Machine Learning

Gradient Descent ( ˆ, ˆ ˆ,..., d ) D β β β = ˆβ = ˆβ = Greg Grudic Machine Learning

The Gradient Descent Algorithm βˆ βˆ ρ i i ( ˆ, ˆ,..., β ˆ ) d D β β βˆ i Where the learning rate is defined by: ρ > Greg Grudic Machine Learning 2

The Gradient Descent Algorithm for D ( βˆ, βˆ ˆ,..., βd ) βˆ = i M βˆ ˆ β y i βˆ ˆ yx β i i ρ βˆ ˆ yx i id d β d the Perceptron y i D ( β ˆ, β ˆ ˆ,..., βd ) βˆ = y x, j =,..., d Greg Grudic Machine Learning 3 j i M Two Versions of the Perceptron Algorithm: Update One misclassified point at a time (online) i y ˆ ˆ i β β i M βˆ ˆ yixi β i M ρ βˆ ˆ d β d yix id ij i M Update all misclassified points at once (batch)

The Learning Data Training Data: ( x, y),...,( xn, yn) Matrix Representation of N learning examples of d dimensional inputs X x x d y =, Y = x x y N Nd N Greg Grudic Machine Learning 4

The Good Theoretical Properties of the Perceptron Algorithm If a solution exists the algorithm will always converge in a finite number of steps! Question: Does a solution always exist? Greg Grudic Machine Learning 5

Linearly Separable Data Which of these datasets are separable by a linear boundary? + + - - - + + a) - - + b) - Greg Grudic Machine Learning 6

Linearly Separable Data Which of these datasets are separable by a linear boundary? + + - - - + + a) - - + b) - Not Linearly Separable! Greg Grudic Machine Learning 7

Bad Theoretical Properties of the Perceptron Algorithm If the data is not linearly separable, algorithm cycles forever! Cannot converge! This property stopped active research in this area between 968 and 984 Perceptrons, Minsky and Pappert, 969 Even when the data is separable, there are infinitely many solutions Which solution is best? When data is linearly separable, the number of steps to converge can be very large (depends on size of gap between classes) Greg Grudic Machine Learning 8

What about Nonlinear Data? Data that is not linearly separable is called nonlinear data Nonlinear data can often be mapped into a nonlinear space where it is linearly separable Greg Grudic Machine Learning 9

Nonlinear Models The Linear Model: yˆ = M( x) = sgn βˆ ˆ + βix i i= The Nonlinear (basis function) Model: yˆ = M( x) = sgn βˆ ˆ + βφ i i( ) x i= Examples of Nonlinear Basis Functions: 2 2 ( x) = x ( x) = x ( x) = x x ( x) = sin( x ) φ φ φ φ 2 2 3 2 4 55 k d Greg Grudic Machine Learning 2

Linear Separating Hyper-Planes In φ 2 β Nonlinear Basis Function Space k + βφ i i = i= y =+ y = β k + βφ i i > i= β k + βφ i i i= φ Greg Grudic Machine Learning 2

An Example.8.6.4 : y=+ : y=- Φ.2 x 2 -.2 -.4.9 : y=+ : y=- -.6.8 -.8.7 - - -.8 -.6 -.4 -.2.2.4.6.8 x φ 2 = x 2 2.6.5.4.3.2...2.3.4.5.6.7.8.9 Greg Grudic Machine Learning 22 φ = x 2

Polynomial K Sigmoid K Kernels as Nonlinear Transformations ( x, x ) = ( x, x + q) i j i j ( x, x ) = tanh ( κ x, x + θ) i j i j Gaussian or Radial Basis Function (RBF) K 2 x, x = exp x x i j 2 2 i j σ ( ) k Greg Grudic Machine Learning 23

The Kernel Model Training Data: ( x, y ),...,( x, y ) N N N yˆ = M( x) = sgn βˆ ˆ + βik xi, x i= ( ) The number of basis functions equals the number of training examples! - Unless some of the beta s get set to zero Greg Grudic Machine Learning 24

Training Data: K Gram (Kernel) Matrix ( x, y ),...,( x, y ) K( ) ( ), K, x x x xn = K( x, x ) K( x, x ) N N N Properties: Positive Definite Matrix Symmetric Positive on diagonal N by N Greg Grudic Machine Learning 25 N N

Picking a Model Structure? How do you pick the Kernels? Kernel parameters These are called learning parameters or hyperparamters Two approaches choosing learning paramters Bayesian Learning parameters must maximize probability of correct classification on future data based on prior biases Frequentist Use the training data to learn the model parameters Use validation data to pick the best hyperparameters. ( βˆ, βˆ ˆ,..., βd ) Greg Grudic Machine Learning 26

Perceptron Algorithm Convergence Two problems: No convergence when data is not separable in basis function space Gives infinitely many solutions when data is separable Can we modify the algorithm to fix these problems? Greg Grudic Machine Learning 27