Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

90 8 80 7 70 6 60 0 8/7/ Preliminaries Preliminaries Linear models and the perceptron algorithm Chapters, T x + b < 0 T x + b > 0 Definition: The Euclidean dot product beteen to vectors is the expression d T x = ix i The dot product is also referred to as inner product or scalar product. It is sometimes denoted as (hence the name dot product). i= x Definition: The Euclidean dot product beteen to vectors is the expression d T x = ix i The dot product is also referred to as inner product or scalar product. Geometric interpretation. The dot product beteen to unit vectors is the cosine of the angle beteen them. The dot product beteen a vector and a unit vector is the length of its projection in that direction. And in general: The norm of a vector: x = x x i= x = x cos( ) Labeled data Labeled data Linear models A labeled dataset: D = {(x i,y i )} n i= A labeled dataset: D = {(x i,y i )} n i= (linear decision boundaries) Where x i R d are d-dimensional vectors Where x i R d are d-dimensional vectors T x + b > 0 The labels: are discrete for classification problems (e.g. +, -) for binary classification The labels: are continuous values for a regression problem Linear models for regression (estimating a linear function) T x + b < 0 0 0 0 60 70 80 90 00 6

8/7/ Tx + b > 0 Tx + b > 0 Tx + b < 0 Tx + b > 0 Tx + b < 0 Decision boundary: f (x) = x + b Discriminant/scoring function: eight vector Tx + b < 0 bias Using the discriminant to make a prediction: f (x) = x + b = 0 all x such that For linear models the the decision boundary is a line in -d, a plane in -d and a hyperplane in higher dimensions 7 y = sign( x + b) the sign function equals hen its argument is positive and - otherise 8 9 Linear models for regression Why linear? When using a linear model for regression the scoring function is the prediction: Least Squares y = x + blinear Regression Tx + b > 0 It s a good baseline: alays start simple Linear models are stable Linear models are less likely to overfit the training data because they have relatively less parameters. Can sometimes underfit. Often all you need hen the data is high dimensional. Lots of scalable algorithms Decision boundary: all x such that y y Tx + b < 0 f (x) = x + b = 0 x x x What can you say about the decision boundary hen b = 0? noisy target P (y x) y = f (x) +! 0 in-sample error out-of-sample error c AM # L Creator: Malik Magdon-Ismail Ein(h) = N N! (h(xn) yn) n= Eout(h) = Ex[(h(x) y) ] Linear Classification and Regression: 6 / h(x) = tx Matrix representation

8/7/ From linear to non-linear From linear to non-linear Linearly separable data... Linearly separable data: there exists a linear decision boundary separating the classes. 0.. Example: 0. 0...... 0. 0 0... 0. 0 0 0..... T x + b > 0 There is a neat mathematical trick that ill enable us to use linear classifiers to create non-linear decision boundaries! Original data: not linearly separable Transformed data: (x,y ) = (x, y ) T x + b < 0 The bias and homogeneous coordinates Finding a good hyperplane (Rosenblatt, 97) In some cases e ill use algorithms that learn a discriminant function ithout a bias term. This does not reduce the expressivity of the model because e can obtain a bias using the folloing trick: Add another dimension x 0 to each input and set it to. Learn a eight vector of dimension d+ in this extended space, and interpret 0 as the bias term. With the notation =(,..., d ) We have that: See page 7 in the book =( 0,,..., d ) x =(,x,...,x d ) x = 0 + x 6 We ould like a classifier that fits the data, i.e. e ould like to find a vector that minimizes E in (h) = n = n n i= n i= [h(x i ) 6= f(x i )] [sign( x i ) 6= y i ] This is a difficult problem because of the discrete nature of the indicator and sign function (knon to be NP-hard). 7 Idea: iterate over the training examples, and update the eight vector in a ay that ould make x i is more likely to be correctly classified. Let s assume that x i is misclassified, and is a positive example i.e. Note: e re learning a classifier x i < 0 ithout a bias term We ould like to update to such that 0 x i > x i This can be achieved by choosing 0 < apple 0 = + x i is the learning rate Age Rosenblatt, Frank (97), The Perceptron--a perceiving and recognizing automaton. Report 8-60-, Cornell Aeronautical Laboratory. Section. in the book 8 c AM L Creator: Malik Magdon-Ismail The Perceptron: 0/ PLA Income

8/7/ If x i is a negative example, the update needs to be opposite. Overall, e can summarize the to cases as: 0 = + y i x i Output: eight vector = 0 hile not converged : for i in,, D : if x i is misclassified update and set converged=false return Since the algorithm is not guaranteed to converge you need to set a limit on the number of iterations: Output: eight vector = 0 hile (not converged or number of iterations < T) : for i in,, D : if x i is misclassified: update and set converged=false return The algorithm makes sense, but let s try to derive it in a more principled ay. The algorithm is trying to find a vector that separates positive from negative examples. We can express that as: y i x i > 0, i =,...,n For a given eight vector the degree to hich this does not hold can be expressed as: E() = y i x i We ant to find that minimizes or maximizes this criterion? 9 0 Digression: gradient descent Given a function E(), the gradient is the direction of steepest ascent Therefore to minimize E(), take a step in the direction of the negative of the gradient Notice that the gradient is perpendicular to contours of equal E() Gradient descent We can no express gradient descent as: here (t + ) = (t) (t) re() @E() @ @E() @E() @ =,..., @E() @ @ d And (t) is the eight vector at iteration t The constant η is called the step size (learning rate hen used in the context of machine learning). Let s apply gradient descent to the perceptron criterion: E() = y i x i @E() @ = (t + ) = (t) @E() @ = (t)+ Which is exactly the perceptron algorithm! y ix i y ix i Images from http://en.ikipedia.org/iki/gradient_descent

8/7/ The algorithm is guaranteed to converge if the data is linearly separable, and does not converge otherise. Issues ith the algorithm: q q The algorithm chooses an arbitrary hyperplane that separates the to classes. It may not be the best one from the learning perspective. Does not converge if the data is not separable (can halt after a fixed number of iterations). The pocket algorithm Output: a eight vector = 0, pocket = 0 hile (not converged or number of iterations < T) : for i in,, D : if x i is misclassified: update and set converged=false if leads to better E in than pocket : pocket = return pocket Intensity and Symmetry Features Image classification feature: animportantpropertyoftheinputthatyouthinkisusefulfor classification. Features: (dictionary.com: important a prominentproperties or conspicuous part of orthe characteristic) input you think are relevant for classification There are variants of the algorithm that address these issues (to some extent). Gallant, S. I. (990). Perceptron-based learning algorithms. IEEE Transactions on Neural Netorks, vol., no., pp. 79 9. 6 In this case e consider the level of symmetry (image its flipped version) and overall } intensity (fraction } of pixels that are x =(,x,x dark) ) x =(,x input,x ) input d =( 0,, ) =(linear 0,, model ) linear vc = d model vc = 7 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / PLA on digits data Pocket vs on perceptron Digits Data Comparison on image data: distinguishing beteen the digits PLA Pocket and (see page 8 in the book): 0% Eout 0% Error (log scale) 0% % Error (log scale) 0% % Eout Ein Ein 0 0 00 70 000 Iteration Number, t 0 0 00 70 000 Iteration Number, t 8 c AM L Creator: Malik Magdon-Ismail Linear Classification and Regression: / Regression