CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

CSE 151 Machine Learning Instructor: Kamalika Chaudhuri

Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from

Linear Classification Process Classification O Algorithm w Training Data Prediction Rule: Vector w How do you classify a test example x? Output y = sign(w, x)

Linear Classification Given labeled data (xi, yi), i=1,..,n, where y is 1 or 1, Find a hyperplane through the origin to separate from O w w: normal vector to the hyperplane For a point x on one side of the hyperplane, w, x > 0 For a point x on the other side, w, x < 0

The Perceptron Algorithm Goal: Given labeled data (xi, yi), i=1,..,n, where y is 1 or 1, Find a vector w such that the corresponding hyperplane separates from Perceptron Algorithm: 1. Initially, w1 = y1x1 2. For t=2, 3,... If y t w t,x t < 0 then: Else: wt1 = wt yt xt wt1 = wt

Linear Separability linearly separable not linearly separable

Measure of Separability: Margin For a unit vector w, and training set S, margin of w is defined as: γ =min x S w, x x min cosine of angle between w and x, for any x in S O θ w O θ w Low Margin High Margin

Real Examples: Margin For a unit vector w, and training set S, margin of w is defined as: γ =min x S w, x x min cosine of angle between w and x, for any x in data w w Low Margin High Margin

When does Perceptron work? Theorem: If the training points are linearly separable with margin γ and if x i =1 for all xi in the training data, then #mistakes made by perceptron is at most 1/γ 2 Observations: 1. Lower the margin, more mistakes 2. May need more than one pass over the training data to get a classifier which makes no mistakes on the training set

What if data is not linearly separable? Ideally, we want to find a linear separator that makes the minimum number of mistakes on training data But this is NP Hard

Voted Perceptron Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If y t w m,x t 0 3. Output wm then: wm1 = wm yt xt m = m 1 Voted Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If Else: y t w m,x t 0 then: wm1 = wm yt xt m = m 1 cm = 1 cm = cm 1 3. Output (w1, c1), (w2, c2),..., (wm, cm)

Voted Perceptron Voted Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If Else: y t w m,x t 0 then: wm1 = wm yt xt m = m 1 cm = 1 cm = cm 1 3. Output (w1, c1), (w2, c2),..., (wm, cm) How to classify example x? Output: sign( m i=1 c i sign(w i,x)) Problem: Have to store all the classifiers

Averaged Perceptron Averaged Perceptron: 1. Initially: m = 1, w1 = y1x1 2. For t=2, 3,... If Else: y t w m,x t 0 then: wm1 = wm yt xt m = m 1 cm = 1 cm = cm 1 3. Output (w1, c1), (w2, c2),..., (wm, cm) How to classify example x? m Output: sign( c i w i,x) i=1

Voted vs. Averaged How to classify example x? Averaged: sign( Voted: sign( m i=1 m i=1 c i w i,x) c i sign(w i,x)) Example where voted and averaged are different: w1 = x1, w2 = x2, w3 = x3, c1 = c2 = c3 = 1 Averaged: x1 x2 x3 > 0 Voted: Any two of x1, x2, x3 > 0

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) O w

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 O w1 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) x2 w1 O Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! x2 w1 O Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 x2 w1 O x2 w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! x3 x2 O w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! x3 x2 Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Update: c2 = 2 Step 4: (x4, y4) = ( (2, 2), 1) y 4 w 2,x 4 < 0 Mistake! O x4 w2 Decision Boundary

Perceptron: An Example Data: ( (4, 0), 1), ( (1, 1), 1), ( (0, 1), 1), ( (2,2), 1) Decision Boundary Step 1: w1 = (4, 0), c1 = 1 Step 2: (x2, y2) = ( (1, 1), 1) y 2 w 1,x 2 < 0 Mistake! Update: w2 = (3, 1), c2 = 1 Step 3: (x3, y3) = ( (0, 1), 1) y 3 w 2,x 3 > 0 Correct! Update: c2 = 2 Step 4: (x4, y4) = ( (2, 2), 1) y 4 w 2,x 4 < 0 Mistake! Update: w3 = (1, 3), c3 = 1 x4 O w2 Final Voted Perceptron: sign(sign(w 1,x)2sign(w 2,x)sign(w 3,x)) w3 Final Averaged Perceptron: sign(w 1 2w 2 w 3,x)

Summary Linear classification Perceptron Voted and Averaged Perceptron Some issues

Margins Max margin separator Separator Suppose data is linearly separable by a margin. Perceptron will stop once it finds a linear separator. But sometimes we would like to compute the maxmargin separator. This is done by Support Vector Machines (SVMs).

More General Linear Classification Given labeled data (xi, yi), i=1,..,n, where y is 1 or 1, Find a hyperplane to separate from We can transform general linear classification to linear classification by a hyperplane through the origin as follows: For each (xi, yi) (xi is ddimensional), construct: (zi, yi), where zi is the d1dimensional vector [1, xi] Any hyperplane through the origin in the zspace corresponds to a hyperplane (not necessarily through the origin) in xspace, and vice versa

Categorical Features Features with no inherent ordering Example: Gender = Male, Female State = Alaska, Hawaii, California Convert a categorical feature with k categories to a kdimensional vector: e.g, Alaska maps to [1, 0, 0], Hawaii to [0, 1, 0], California to [0, 0, 1]

Multiclass Linear Classification Main Idea: Reduce to multiple binary classification problems Onevsall: For k classes, solve k binary classification problems Class 1 vs. rest, Class 2 vs. rest,..., Class k vs. rest Allvsall: For k classes, solve k(k1)/2 binary classification problems Class i vs. j, for i=1,..,k, and j = i1,..,k To classify, combine by voting

The Confusion Matrix 1 2 3 4 5 6 1 2 3 4 5 6 Mij = #examples that have label j but are classified as i Nj = #examples with label j Cij = Mij/Nj Indicates which classes are easy and hard to classify High diagonal entry: easy to classify High offdiagonal entry: high scope for confusion

Kernels

Feature Maps X2 Z2 φ X1 Z1 Z3 Data which is not linearly separable may be linearly separable in another feature space. In the figure, data is linearly separable in zspace: (Z1, Z2, Z3) = (X1 2, 2X1X2, X2 2 )

Feature Maps X2 Z2 φ X1 Z1 Z3 Let φ(x) be a feature map. For example: φ(x) = (x1 2, x1x2, x2 2 ) We want to run perceptron on the feature space φ(x)

Feature Maps X2 X1 What feature map will make this dataset linearly separable?

Feature Maps Let φ(x) be a feature map. For example: φ(x) = (x1 2, x1, x2) We want to run perceptron on the feature space φ(x) Many linear classification algorithms, including perceptron, SVMs, etc can be written in terms of dotproducts x, z We can simply change these dot products to φ(x), φ(z) For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Thus we can write the perceptron algorithm in terms of K

Kernels For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Examples: 1. K(x, z) =(x, z) 2

Kernels For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Examples: 1. K(x, z) =(x, z) 2 d K(x, z) = x i z i = d d x i x j z i z j = d (x i x j )(z i z j ) i=1 x i z i d i=1 i=1 j=1 i,j=1 For d=3, φ(x) = [ x1 2, x1x2, x1x3, x2x1, x2 2, x2x3, x3x1, x3x2, x3 2 ] T Time to compute K directly = O(d) Time to compute K though feature map = O(d 2 )

Kernels For a feature map φ we define the corresponding kernel K: K(x,y) =φ(x), φ(y) Examples: 1. 2. K(x, z) =(x, z) 2 K(x, z) =(x, z c) 2 In Class Problem: What is the feature map for this kernel?