Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Similar documents
Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

CS 188: Artificial Intelligence. Outline

EE 511 Online Learning Perceptron

More about the Perceptron

MIRA, SVM, k-nn. Lirong Xia

CS 188: Artificial Intelligence Fall 2011

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

The Perceptron Algorithm

CS 343: Artificial Intelligence

CS 5522: Artificial Intelligence II

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Kaggle.

COMS 4771 Introduction to Machine Learning. Nakul Verma

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Machine Learning A Geometric Approach

Perceptron (Theory) + Linear Regression

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS 188: Artificial Intelligence. Machine Learning

Logistic Regression Logistic

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

A Course in Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Classifiers. Michael Collins. January 18, 2012

Lecture 4: Linear predictors and the Perceptron

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

9 Classification. 9.1 Linear Classifiers

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

COMP 875 Announcements

Statistical NLP Spring A Discriminative Approach

Classification with Perceptrons. Reading:

Linear Classifiers and the Perceptron

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

CS 188: Artificial Intelligence Spring Announcements

The Perceptron algorithm

Voting (Ensemble Methods)

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

4 THE PERCEPTRON. 4.1 Bio-inspired Learning. Learning Objectives:

Name (NetID): (1 Point)

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

Learning with multiple models. Boosting.

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Machine Learning (CSE 446): Neural Networks

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Expectation maximization

Brief Introduction to Machine Learning

Name (NetID): (1 Point)

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Support Vector Machines

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning for Structured Prediction

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

CS7267 MACHINE LEARNING

CPSC 340: Machine Learning and Data Mining

Kernelized Perceptron Support Vector Machines

Introduction to Deep Learning

Intelligent Systems Discriminative Learning, Neural Networks

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Hierarchical Boosting and Filter Generation

Online Passive-Aggressive Algorithms

CS 188: Artificial Intelligence Spring Announcements

CS 446: Machine Learning Lecture 4, Part 2: On-Line Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Warm up: risk prediction with logistic regression

1 Machine Learning Concepts (16 points)

Machine Learning Lecture 5

Machine Learning Practice Page 2 of 2 10/28/13

Artificial neural networks

PAC-learning, VC Dimension and Margin-based Bounds

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Support vector machines Lecture 4

Linear Learning Machines

CSCI-567: Machine Learning (Spring 2019)

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Neural Networks for Machine Learning. Lecture 2a An overview of the main types of neural network architecture

Structured Prediction

Logistic regression and linear classifiers COMS 4771

Neural networks. Chapter 20. Chapter 20 1

Advanced statistical methods for data analysis Lecture 2

Introduction to Logistic Regression and Support Vector Machine

Perceptron Mistake Bounds

The Perceptron Algorithm 1

The Perceptron Algorithm, Margins

Neural Networks: Introduction

Learning Methods for Linear Detectors

Nonlinear Classification

Support Vector Machine

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Geometric View of Machine Learning Nearest Neighbor Classification. Slides adapted from Prof. Carpuat

Mini-project 2 (really) due today! Turn in a printout of your work at the end of the class

Lecture 8. Instructor: Haipeng Luo

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Transcription:

Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015

So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor classifier Inductive bias: all features are equally good Perceptrons Today Inductive bias: use all features, but some more than others 2/19

A neuron (or how our brains work) Neuroscience 101 3/19

Perceptron Input are feature values Each feature has a weight Sum in the activation activation(w, x) = X i w i x i = w T x If the activation is: > b, output class 1 x 1 w 1 otherwise, output class 2 x 2 w 2 > b w 3 x 3 4/19

Perceptron Input are feature values Each feature has a weight Sum in the activation activation(w, x) = X i w i x i = w T x If the activation is: > b, output class 1 x 1 w 1 otherwise, output class 2 x 2 w 2 > b x (x, 1) w 3 w T x + b (w,b) T (x, 1) x 3 4/19

Example: Spam Imagine 3 features (spam is positive class): free (number of occurrences of free ) money (number of occurrences of money ) BIAS (intercept, always has value 1) email x w w T x w T x > 0 SPAM 5/19

Geometry of the perceptron In the space of feature vectors examples are points (in D dimensions) an weight vector is a hyperplane (a D-1 dimensional object) One side corresponds to y=+1 Other side corresponds to y=-1 Perceptrons are also called as linear classifiers w w T x =0 6/19

Learning a perceptron Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Perceptron training algorithm [Rosenblatt 57] Initialize for iter = 1,,T for i = 1,..,n predict according to the current model w [0,...,0] +1 if w T x i > 0 y i x w x i ŷ i = 1 if w T x i apple 0 if else, y i =ŷ i w, no change w + y i x i y i = 1 7/19

Learning a perceptron Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Perceptron training algorithm [Rosenblatt 57] Initialize for iter = 1,,T for i = 1,..,n predict according to the current model w [0,...,0] +1 if w T x i > 0 y i x w x i ŷ i = 1 if w T x i apple 0 if else, y i =ŷ i w, no change w + y i x i y i = 1 error driven, online, activations increase for +, randomize 7/19

Properties of perceptrons Separability: some parameters will classify the training data perfectly Convergence: if the training data is separable then the perceptron training will eventually converge [Block 62, Novikoff 62] Mistake bound: the maximum number of mistakes is related to the margin assuming, x i apple 1 #mistakes < 1 2 = max w min (xi,y i ) yi w T x i such that, w =1 8/19

Review geometry 9/19

Proof of convergence Let,ŵ be the separating hyperplane with margin 10/19

Proof of convergence w is getting closer Let,ŵ be the separating hyperplane with margin ŵ T w (k) = ŵ T w (k 1) + y i x i = ŵ T w (k 1) + ŵ T y i x i ŵ T w (k 1) + update rule algebra definition of margin k w (k) k 10/19

Proof of convergence w is getting closer Let,ŵ be the separating hyperplane with margin ŵ T w (k) = ŵ T w (k 1) + y i x i = ŵ T w (k 1) + ŵ T y i x i ŵ T w (k 1) + update rule algebra definition of margin k w (k) k bound the norm w (k) 2 = w (k 1) + y i x i 2 apple w (k 1) 2 + y i x i 2 apple w (k 1) 2 +1 apple k update rule triangle inequality norm w (k) apple p k 10/19

Proof of convergence w is getting closer Let,ŵ be the separating hyperplane with margin ŵ T w (k) = ŵ T w (k 1) + y i x i = ŵ T w (k 1) + ŵ T y i x i ŵ T w (k 1) + update rule algebra definition of margin k w (k) k bound the norm w (k) 2 = w (k 1) + y i x i 2 apple w (k 1) 2 + y i x i 2 apple w (k 1) 2 +1 apple k update rule triangle inequality norm w (k) apple p k k apple w (k) apple p k k apple 1 2 10/19

Limitations of perceptrons Convergence: if the data isn t separable, the training algorithm may not terminate noise can cause this some simple functions are not separable (xor) Mediocre generation: the algorithm finds a solution that barely separates the data Overtraining: test/validation accuracy rises and then falls Overtraining is a kind of overfitting 11/19

A problem with perceptron training Problem: updates on later examples can take over 10000 training examples The algorithm learns weight vector on the first 100 examples Gets the next 9899 points correct Gets the 10000 th point wrong, updates on the the weight vector This completely ruins the weight vector (get 50% error) w (9999) w (10000) x 10000 Voted and averaged perceptron (Freund and Schapire, 1999) 12/19

Voted perceptron Key idea: remember how long each weight vector survives Let, the perceptron learning algorithm. Let, w (1), w (2),...,w (K) c (1),c (2),...,c (K), be the sequence of weights obtained by, be the survival times for each of these. a weight that gets updated immediately gets c = 1 a weight that survives another round gets c = 2, etc. Then, ŷ = sign KX c (k) sign w (k)t x k=1 13/19

Voted perceptron training algorithm Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Output: list of pairs (w (1),c (1) ), (w (2),c (2) ),...,(w (K),c (K) ) Initialize: for iter = 1,,T for i = 1,..,n predict according to the current model k =0,c (1) =0, w (1) [0,...,0] ŷ i = if y i =ŷ i, else, +1 if w (k)t x i > 0 1 if w (k)t x i apple 0 c (k) = c (k) +1 w (k+1) = w (k) + y i x i c (k+1) =1 k = k +1 14/19

Voted perceptron training algorithm Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Output: list of pairs (w (1),c (1) ), (w (2),c (2) ),...,(w (K),c (K) ) Initialize: for iter = 1,,T for i = 1,..,n predict according to the current model k =0,c (1) =0, w (1) [0,...,0] ŷ i = if y i =ŷ i, else, +1 if w (k)t x i > 0 1 if w (k)t x i apple 0 c (k) = c (k) +1 w (k+1) = w (k) + y i x i c (k+1) =1 k = k +1 Better generalization, but not very practical 14/19

Averaged perceptron Key idea: remember how long each weight vector survives Let, the perceptron learning algorithm. Let,, be the sequence of weights obtained by, be the survival times for each of these. a weight that gets updated immediately gets c = 1 a weight that survives another round gets c = 2, etc. Then, w (1), w (2),...,w (K) c (1),c (2),...,c (K) KX ŷ = sign c (k) w (k)t x = sign w T x k=1 performs similarly, but much more practical 15/19

Averaged perceptron training algorithm Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Output: w Initialize: for iter = 1,,T for i = 1,..,n predict according to the current model if y i =ŷ i, else, w Return c =0, w =[0,...,0], w =[0,...,0] ŷ i = w c =1 w + cw +1 if w T x i > 0 c = c +1 w + cw w + y i x i 1 if w T x i apple 0 update average 16/19

Comparison of perceptron variants MNIST dataset (Figure from Freund and Schapire 1999) 17/19

Improving perceptrons Multilayer perceptrons: to learn non-linear functions of the input (neural networks) Separators with good margins: improves generalization ability of the classifier (support vector machines) Feature-mapping: to learn non-linear functions of the input using a perceptron we will learn do this efficiently using kernels :(x 1,x 2 ) (x 2 1, p 2x 1 x 2,x 2 2) x 2 1 a 2 + x2 2 b 2 =1 z 1 a 2 + z 3 b 2 =1 18/19

Slides credit Some slides adapted from Dan Klein at UC Berkeley and CIML book by Hal Daume Figure comparing various perceptrons are from Freund and Schapire 19/19