Perceptron Subhransu Maji CMPSCI 689: Machine Learning 3 February 2015 5 February 2015
So far in the class Decision trees Inductive bias: use a combination of small number of features Nearest neighbor classifier Inductive bias: all features are equally good Perceptrons Today Inductive bias: use all features, but some more than others 2/19
A neuron (or how our brains work) Neuroscience 101 3/19
Perceptron Input are feature values Each feature has a weight Sum in the activation activation(w, x) = X i w i x i = w T x If the activation is: > b, output class 1 x 1 w 1 otherwise, output class 2 x 2 w 2 > b w 3 x 3 4/19
Perceptron Input are feature values Each feature has a weight Sum in the activation activation(w, x) = X i w i x i = w T x If the activation is: > b, output class 1 x 1 w 1 otherwise, output class 2 x 2 w 2 > b x (x, 1) w 3 w T x + b (w,b) T (x, 1) x 3 4/19
Example: Spam Imagine 3 features (spam is positive class): free (number of occurrences of free ) money (number of occurrences of money ) BIAS (intercept, always has value 1) email x w w T x w T x > 0 SPAM 5/19
Geometry of the perceptron In the space of feature vectors examples are points (in D dimensions) an weight vector is a hyperplane (a D-1 dimensional object) One side corresponds to y=+1 Other side corresponds to y=-1 Perceptrons are also called as linear classifiers w w T x =0 6/19
Learning a perceptron Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Perceptron training algorithm [Rosenblatt 57] Initialize for iter = 1,,T for i = 1,..,n predict according to the current model w [0,...,0] +1 if w T x i > 0 y i x w x i ŷ i = 1 if w T x i apple 0 if else, y i =ŷ i w, no change w + y i x i y i = 1 7/19
Learning a perceptron Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Perceptron training algorithm [Rosenblatt 57] Initialize for iter = 1,,T for i = 1,..,n predict according to the current model w [0,...,0] +1 if w T x i > 0 y i x w x i ŷ i = 1 if w T x i apple 0 if else, y i =ŷ i w, no change w + y i x i y i = 1 error driven, online, activations increase for +, randomize 7/19
Properties of perceptrons Separability: some parameters will classify the training data perfectly Convergence: if the training data is separable then the perceptron training will eventually converge [Block 62, Novikoff 62] Mistake bound: the maximum number of mistakes is related to the margin assuming, x i apple 1 #mistakes < 1 2 = max w min (xi,y i ) yi w T x i such that, w =1 8/19
Review geometry 9/19
Proof of convergence Let,ŵ be the separating hyperplane with margin 10/19
Proof of convergence w is getting closer Let,ŵ be the separating hyperplane with margin ŵ T w (k) = ŵ T w (k 1) + y i x i = ŵ T w (k 1) + ŵ T y i x i ŵ T w (k 1) + update rule algebra definition of margin k w (k) k 10/19
Proof of convergence w is getting closer Let,ŵ be the separating hyperplane with margin ŵ T w (k) = ŵ T w (k 1) + y i x i = ŵ T w (k 1) + ŵ T y i x i ŵ T w (k 1) + update rule algebra definition of margin k w (k) k bound the norm w (k) 2 = w (k 1) + y i x i 2 apple w (k 1) 2 + y i x i 2 apple w (k 1) 2 +1 apple k update rule triangle inequality norm w (k) apple p k 10/19
Proof of convergence w is getting closer Let,ŵ be the separating hyperplane with margin ŵ T w (k) = ŵ T w (k 1) + y i x i = ŵ T w (k 1) + ŵ T y i x i ŵ T w (k 1) + update rule algebra definition of margin k w (k) k bound the norm w (k) 2 = w (k 1) + y i x i 2 apple w (k 1) 2 + y i x i 2 apple w (k 1) 2 +1 apple k update rule triangle inequality norm w (k) apple p k k apple w (k) apple p k k apple 1 2 10/19
Limitations of perceptrons Convergence: if the data isn t separable, the training algorithm may not terminate noise can cause this some simple functions are not separable (xor) Mediocre generation: the algorithm finds a solution that barely separates the data Overtraining: test/validation accuracy rises and then falls Overtraining is a kind of overfitting 11/19
A problem with perceptron training Problem: updates on later examples can take over 10000 training examples The algorithm learns weight vector on the first 100 examples Gets the next 9899 points correct Gets the 10000 th point wrong, updates on the the weight vector This completely ruins the weight vector (get 50% error) w (9999) w (10000) x 10000 Voted and averaged perceptron (Freund and Schapire, 1999) 12/19
Voted perceptron Key idea: remember how long each weight vector survives Let, the perceptron learning algorithm. Let, w (1), w (2),...,w (K) c (1),c (2),...,c (K), be the sequence of weights obtained by, be the survival times for each of these. a weight that gets updated immediately gets c = 1 a weight that survives another round gets c = 2, etc. Then, ŷ = sign KX c (k) sign w (k)t x k=1 13/19
Voted perceptron training algorithm Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Output: list of pairs (w (1),c (1) ), (w (2),c (2) ),...,(w (K),c (K) ) Initialize: for iter = 1,,T for i = 1,..,n predict according to the current model k =0,c (1) =0, w (1) [0,...,0] ŷ i = if y i =ŷ i, else, +1 if w (k)t x i > 0 1 if w (k)t x i apple 0 c (k) = c (k) +1 w (k+1) = w (k) + y i x i c (k+1) =1 k = k +1 14/19
Voted perceptron training algorithm Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Output: list of pairs (w (1),c (1) ), (w (2),c (2) ),...,(w (K),c (K) ) Initialize: for iter = 1,,T for i = 1,..,n predict according to the current model k =0,c (1) =0, w (1) [0,...,0] ŷ i = if y i =ŷ i, else, +1 if w (k)t x i > 0 1 if w (k)t x i apple 0 c (k) = c (k) +1 w (k+1) = w (k) + y i x i c (k+1) =1 k = k +1 Better generalization, but not very practical 14/19
Averaged perceptron Key idea: remember how long each weight vector survives Let, the perceptron learning algorithm. Let,, be the sequence of weights obtained by, be the survival times for each of these. a weight that gets updated immediately gets c = 1 a weight that survives another round gets c = 2, etc. Then, w (1), w (2),...,w (K) c (1),c (2),...,c (K) KX ŷ = sign c (k) w (k)t x = sign w T x k=1 performs similarly, but much more practical 15/19
Averaged perceptron training algorithm Input: training data (x 1,y 1 ), (x 2,y 2 ),...,(x n,y n ) Output: w Initialize: for iter = 1,,T for i = 1,..,n predict according to the current model if y i =ŷ i, else, w Return c =0, w =[0,...,0], w =[0,...,0] ŷ i = w c =1 w + cw +1 if w T x i > 0 c = c +1 w + cw w + y i x i 1 if w T x i apple 0 update average 16/19
Comparison of perceptron variants MNIST dataset (Figure from Freund and Schapire 1999) 17/19
Improving perceptrons Multilayer perceptrons: to learn non-linear functions of the input (neural networks) Separators with good margins: improves generalization ability of the classifier (support vector machines) Feature-mapping: to learn non-linear functions of the input using a perceptron we will learn do this efficiently using kernels :(x 1,x 2 ) (x 2 1, p 2x 1 x 2,x 2 2) x 2 1 a 2 + x2 2 b 2 =1 z 1 a 2 + z 3 b 2 =1 18/19
Slides credit Some slides adapted from Dan Klein at UC Berkeley and CIML book by Hal Daume Figure comparing various perceptrons are from Freund and Schapire 19/19