Linear models and the perceptron algorithm

8/5/6 Preliminaries Linear models and the perceptron algorithm Chapters, 3 Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as inner product or scalar product. It is sometimes denoted as (hence the name dot product). i= x T x + b > 0 T x + b < 0 Preliminaries Labeled data Definition: The Euclidean dot product beteen to vectors is the expression dx T x = i x i The dot product is also referred to as inner product or scalar product. Geometric interpretation. The dot product beteen to unit vectors is the cosine of the angle beteen them. The dot product beteen a vector and a unit vector is the length of its projection in that direction. And in general: The norm of a vector: x = x x i= x = x cos( ) 3 A labeled dataset: Where D = {(x i,y i )} N i= x i R d The labels: are discrete for classification problems (e.g. +, -) for binary classification are d-dimensional vectors 4

8/5/6 Labeled data Linear models A labeled dataset: D = {(x i,y i )} N i= (linear decision boundaries) Where x i R d are d-dimensional vectors T x + b > 0 The labels: T x + b < 0 Linear models for regression 90 are continuous values for a regression problem (estimating a linear function) 85 80 75 70 65 60 55 50 45 5 40 40 50 60 70 80 90 00 6 T x + b > 0 T x + b > 0 T x + b < 0 T x + b < 0 Discriminant/scoring function: x + b Decision boundary: all x such that x + b =0 eight vector bias For linear models the the decision boundary is a line in -d, a plane in 3-d and a hyperplane in higher dimensions 7 8

8/5/6 Tx + b > 0 Tx + b > 0 Tx + b < 0 Using the discriminant to make a prediction: Tx + b < 0 Decision boundary: all x such that y = sign( x + b) f (x) = x + b = 0 What can you say about the decision boundary hen b = 0? the sign function euals hen its argument is positive and - otherise 9 0 Linear models for regression Why linear? When using a linear model for regression the scoring function is the prediction: Least Suares y = x + blinear Regression y y x x out-of-sample error Ein(h) = N N! x noisy target P (y x) y = f (x) + ϵ in-sample error It s a good baseline: alays start simple Linear models are stable Linear models are less likely to overfit the training data because they have less parameters. Can sometimes underfit. Often all you need hen the data is high dimensional. Lots of scalable algorithms (h(xn) yn) n= Eout(h) = Ex[(h(x) y) ] h(x) = tx 3

8/5/6 From linear to non-linear.5.5 0.5 0 From linear to non-linear 5 4.5 4 3.5 3.5 0.5.5.5 0.5.5.5.5 0.5 0 0.5.5.5 0 0 0.5.5.5 3 3.5 4 4.5 5 There is a neat mathematical trick that ill enable us to use linear classifiers to create non-linear decision boundaries! Original data: not linearly separable Transformed data: (x,y ) = (x, y ) 3 4 Linearly separable data Linearly separable data: there exists a linear decision boundary separating the classes. The bias and homogeneous coordinates Formulating a model that does not have a bias does not reduce the expressivity of the model because e can obtain a bias using the folloing trick: Example: T x + b < 0 T x + b > 0 Add another dimension x 0 to each input and set it to. Learn a eight vector of dimension d+ in this extended space, and interpret 0 as the bias term. With the notation =(,..., d ) We have that: =( 0,,..., d ) x =(,x,...,x d ) x = 0 + x 5 See page 7 in the book 6 4

8/5/6 Finding a good hyperplane The perceptron algorithm (Rosenblatt, 957) We ould like a classifier that fits the data, i.e. e ould like to find a vector that minimizes E in = N = N NX i= NX i= [h(x i ) 6= f(x i )] [sign( x i ) 6= y i )] This is a difficult problem because of the discrete nature of the indicator and sign function (knon to be NP-hard). 7 Idea: iterate over the training examples, and update the eight vector in a ay that ould make x i is more likely to be correctly classified. Let s assume that x i is misclassified, and is a positive example i.e. Note: e re learning a classifier x i < 0 ithout a bias term We ould like to update to such that 0 x i > x i This can be achieved by choosing 0 < apple 0 = + x i is the learning rate Age Rosenblatt, Frank (957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-, Cornell Aeronautical Laboratory. Section. in the book Income 8 The perceptron algorithm The perceptron algorithm If x i is a negative example, the update needs to be opposite. Overall, e can summarize the to cases as: 0 = + y i x i Input: labeled data D in homogeneous coordinates Output: eight vector = 0 converged = false hile not converged : converged = true for i in,,n : if x i is misclassified update and set converged=false return Since the algorithm is not guaranteed to converge if the data is not linearly separable you need to set a limit on the number of iterations: Input: labeled data D in homogeneous coordinates Output: eight vector = 0 converged = false hile (not converged or number of iterations < T) : converged = true for i in,,n : if x i is misclassified: update and set converged=false return 9 0 5

8/5/6 The perceptron algorithm The algorithm is guaranteed to converge if the data is linearly separable, and does not converge otherise. Issues ith the algorithm: The algorithm chooses an arbitrary hyperplane that separates the to classes. It may not be the best one from the learning perspective. Does not converge if the data is not separable (can halt after a fixed number of iterations). There are variants of the algorithm that address these issues (to some extent). The pocket algorithm Input: labeled data D in homogeneous coordinates Output: a eight vector = 0, pocket = 0 converged = false hile (not converged or number of iterations < T) : converged = true for i in,,n : if x i is misclassified: update and set converged=false if leads to better E in than pocket : pocket = return pocket Gallant, S. I. (990). Perceptron-based learning algorithms. IEEE Transactions on Neural Netorks, vol., no., pp. 79 9. Intensity and Symmetry Features Image classification feature: animportantpropertyoftheinputthatyouthinkisusefulfor classification. Features: (dictionary.com: important a prominentproperties or conspicuous part of orthe characteristic) input you think are relevant for classification Pocket vs perceptron Pocket on Digits Data Comparison on image data: distinguishing beteen the digits PLA Pocket and 5 (see page 83 in the book): 50% Eout 50% Error (log scale) 0% % Error (log scale) 0% % Eout Ein Ein 0 50 500 750 000 Iteration Number, t 0 50 500 750 000 Iteration Number, t In this case e consider the level of symmetry (image its flipped version) and overall } intensity (fraction } of pixels that are x =(,x,x dark) ) x =(,x input,x ) input d =( 0,, ) =(linear 0,, model ) linear vc =3 d model vc =3 3 4 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: 4/ Regression 6