Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Preliminaries Linear models: the perceptron and closest centroid algorithms Chapter 1, 7 Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also referred to as inner product or scalar product. It is sometimes denoted as (hence the name dot product). i=1 x T x b > 0 T x b < 0 1 Preliminaries Labeled data Definition: The Euclidean dot product beteen to vectors is the expression d T x = i x i The dot product is also referred to as inner product or scalar product. Geometric interpretation. The dot product beteen to unit vectors 1 is the cosine of the angle beteen them. The dot product beteen a vector and a unit vector is the length of its projection in that direction. And in general: The norm of a vector: x = x x i=1 x = x cos( ) 3 A labeled dataset: Where D = {(x i,y i )} n i=1 x i R d The labels: are discrete for classification problems (e.g. 1, -1) for binary classification are d-dimensional vectors 4 1

Labeled data Linear models A labeled dataset: D = {(x i,y i )} n i=1 (linear decision boundaries) Where x i R d are d-dimensional vectors T x b > 0 The labels: T x b < 0 Linear models for regression 90 are continuous values for a regression problem (estimating a linear function) 85 80 75 70 65 60 55 50 45 5 40 140 150 160 170 180 190 00 6 T x b > 0 T x b > 0 T x b < 0 T x b < 0 Discriminant/scoring function: f(x) = x b Decision boundary: all x such that f(x) = x b =0 eight vector bias For linear models the the decision boundary is a line in -d, a plane in 3-d and a hyperplane in higher dimensions 7 8

T x b > 0 T x b > 0 T x b < 0 T x b < 0 Using the discriminant to make a prediction: ŷ = sign( x b) Decision boundary: all x such that f(x) = x b =0 the sign function equals 1 hen its argument is positive and -1 otherise What can you say about the decision boundary hen b = 0? 9 10 Linear models for regression Why linear? When using a linear model for regression the scoring function is the prediction: ŷ = x b 90 85 80 75 70 It s a good baseline: alays start simple Linear models are stable Linear models are less likely to overfit the training data because they have relatively less parameters. Can sometimes underfit. Often all you need hen the data is high dimensional. Lots of scalable algorithms 65 60 55 50 45 40 140 150 160 170 180 190 00 11 1 3

Large margin classification margin p (pn)/ n =pn 13 14 Non-linear large margin classifiers.5 1.5 1 0.5 0 From linear to non-linear 5 4.5 4 3.5 3.5 0.5 1 1.5 1.5 1 0.5.5.5 1.5 1 0.5 0 0.5 1 1.5.5 0 0 0.5 1 1.5.5 3 3.5 4 4.5 5 There is a neat mathematical trick that ill enable us to use linear classifiers to create non-linear decision boundaries! Original data: not linearly separable Transformed data: (x,y ) = (x, y ) 15 16 4

Define: µ () = 1 Pos {i y i =1} x i µ ( ) = 1 Neg here Pos/Neg is the number of positive/negative examples. This is the center of mass of the positive/negative examples. Classify an input x according to hich center of mass it is closest to. {i y i= 1} x i p µ () (pn)/ µ ( n ) =pn = µ () µ ( ) Let s express this as a linear classifier! See page 1- in the textbook Our hyperplane is going be perpendicular to the vector that connects the to centers of mass. Therefore: = µ () µ ( ) 17 18 µ () p (µ () µ ( ) ) (pn)/ µ ( n ) =pn = µ () µ ( ) µ () p (µ () µ ( ) ) (pn)/ µ ( n ) =pn = µ () µ ( ) To find the bias term e use the fact that the midpoint beteen the means is on the hyperplane, i.e. With a little algebra: b = 1 (µ() µ ( ) )(µ () µ ( ) ) (µ() µ ( ) ) b =0 19 The norm of a vector: x = x x = 1 ( µ() µ ( ) ) 0 5

One of the first classifiers proposed goes back to 1957! Linearly separable data Linearly separable data: there exists a linear decision boundary separating the classes. Example: T x b > 0 T x b < 0 1 The bias and homogeneous coordinates (Rosenblatt, 1957) In some cases e ill use algorithms that learn a discriminant function ithout a bias term. This does not reduce the expressivity of the model because e can obtain a bias using the folloing trick: Add another dimension x 0 to each input and set it to 1. Learn a eight vector of dimension d1 in this extended space, and interpret 0 as the bias term. With the notation =( 1,..., d ) We have that: See page 4 in the book =( 0, 1,..., d ) x =(1,x 1,...,x d ) x = 0 x 3 Idea: iterate over the training examples, and update the eight vector in a ay that ould make x i is more likely to be correctly classified. Let s assume that x i is misclassified, and is a positive example i.e. Note: e re learning a classifier x i < 0 ithout a bias term We ould like to update to such that This can be achieved by choosing Where 0 x i > x i 0 = x i 0 < apple 1 is the learning rate Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory. Section 7. in the book 4 6

If x i is a negative example, the update needs to be opposite. Overall, e can summarize the to cases as: 0 = y i x i Input: labeled data D in homogeneous coordinates Output: eight vector = 0 converged = false hile not converged : converged = true for i in 1,, D : if x i is misclassified update and set converged=false The algorithm makes sense, but let s try to derive it in a more principled ay. The algorithm is trying to find a vector that separates positive from negative examples. We can express that as: y i x i > 0, For a given eight vector the degree to hich this does not hold can be expressed as: E() = i: x i is misclassified i =1,...,n y i x i We ant to find that minimizes or maximizes this criterion? 5 6 Digression: gradient descent Given a function E(), the gradient is the direction of steepest ascent Therefore to minimize E(), take a step in the direction of the negative of the gradient Notice that the gradient is perpendicular to contours of equal E() Gradient descent We can no express gradient descent as: here (t 1) = (t) (t) re() @E() @ @E() @E() @ =,..., @E() @ 1 @ d And (t) is the eight vector at iteration t The constant η is called the step size (learning rate hen used in the context of machine learning). Images from http://en.ikipedia.org/iki/gradient_descent 7 8 7

Let s apply gradient descent to the perceptron criterion: E() = y i x i i: x i is misclassified @E() @ = i: x i is misclassified (t 1) = (t) @E() @ = (t) Which is exactly the perceptron algorithm! y i x i i: x i is misclassified y i x i The algorithm is guaranteed to converge if the data is linearly separable, and does not converge otherise. Issues ith the algorithm: The algorithm chooses an arbitrary hyperplane that separates the to classes. It may not be the best one from the learning perspective. Does not converge if the data is not separable (can halt after a fixed number of iterations). There are variants of the algorithm that address these issues (to some extent). 9 30 Perceptron for regression Replace the update equation ith: 0 = (y i ŷ i ) x i This is not likely to converge so the algorithm is run for a fixed number of training epochs Training epoch one complete run through the training data 31 8