Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Size: px

Start display at page:

Download "Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)"

Debra Stevens
5 years ago
Views:

1 Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML)

2 Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR: x = (x 1, x 2, x 1 x 2 ) income: add degree + major Perceptron Map data into feature space x! (x) Solution in span of (x i )

3 Quadratic Features Separating surfaces are Circles, hyperbolae, parabolae

4 Kernels as dot products Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to numbers. For higher order polynomial features much worse. Solution Don t compute the features, try to compute dot products implicitly. For some features this works... Definition A kernel function k : X X! R is a symmetric function in its arguments for which the following property holds k(x, x 0 )=h (x), (x 0 )i for some feature map. If k(x, x 0 ) is much cheaper to compute than (x)...

5 Quadratic Kernel x4: -1 x1: +1 x3: +1 Quadratic Features in R 2 Dot Product x2: -1 h (x), (x 0 )i = (x) := x 2 1, p 2x 1 x 2,x 2 2 D x 2 1, p 2x 1 x 2,x 2 2 = hx, x 0 i 2., = k(x, x 0 ) for x in R n, quadratic ɸ: naive: ɸ(x): O(n 2 ) ɸ(x) ɸ(x ): O(n 2 ) kernel k(x,x ): O(n) x 0 12, p E 2x 0 1x 0 2,x Insight Trick works for any polynomials of order d via hx, x 0 i d.

6 The Perceptron on features { } {± nction initialize w, b =0 repeat Pick (x i,y i ) from data if y i (w (x i )+b) apple 0 then w 0 = w + y i (x i ) b 0 = b + y i until y i (w (x i )+b) > 0 for all i d Nothing happens if classified correctly Weight X vector is linear combination w = X i2i i (x i ) Classifier is (implicitly) a linear combination of inner products f(x) = X i2i i h (x i ), (x)i

7 Kernelized Perceptron { } {± } initialize f =0 repeat Functional Form Pick (x i,y i ) from data if y i f(x i ) apple 0 then f( ) f( )+y i k(x i, )+y i until y i f(x i ) > 0 for all i d instead of updating w, now update α i increase its vote by 1 i i + y i Weight vector is linear combination w = X i2i i (x i ) Classifier is linear combination of inner products f(x) = X i2i i h (x i ), (x)i = X i2i i k(x i,x)

8 Kernelized Perceptron Primal Form update weights Dual Form update linear coefficients w w + y i (x i ) classify f(k) =w (x) i i + y i implicitly equivalent to: w = X i (x i ) i2i Nothing happens if classified correctly Weight vector is linear combination w = X i2i i (x i ) Classifier is linear combination of inner products f(x) = X i2i i h (x i ), (x)i = X i2i i k(x i,x)

9 Kernelized Perceptron Primal Form update weights Dual Form update linear coefficients w w + y i (x i ) i i + y i classify f(k) =w (x) w = X i2i implicitly equivalent to: i (x i ) classify f(x) =w i (x i )] (x) slow O(d 2 ) = X i2i (x) =[ X i2i i h (x i ), (x)i fast O(d) = X i2i i k(x i,x)

10 Kernelized Perceptron initialize i =0 for all i Dual Form repeat update linear coefficients Pick (x i,y i ) from data i i + y i until if y i f(x i ) apple 0 i y i f(x i ) > 0 then i + y i for all if #features >> #examples, dual is easier; otherwise primal is easier i implicitly w = X i2i classify f(x) =w slow O(d 2 ) = X i2i i (x i ) (x) =[ X i2i i h (x i ), i (x)i (x i )] (x) fast O(d) = X i2i i k(x i,x)

11 Kernelized Perceptron Primal Perceptron update weights Dual Perceptron update linear coefficients w w + y i (x i ) i i + y i classify f(k) =w (x) implicitly w = X i2i i (x i ) if #features >> #examples, dual is easier; otherwise primal is easier Q: when is #features >> #examples? A: higher-order polynomial kernels or exponential kernels (inf. dim.)

12 Kernelized Perceptron Pros/Cons of Kernel in Dual pros: Dual Perceptron update linear coefficients no need to compute ɸ(x) (time) i i + y i no need to store ɸ(x) and w (memory) implicitly w = X i (x i ) i2i cons: sum over all misclassified training examples for test classify f(x) =w i (x i )] (x) need to store all misclassified training examples (memory) = X i2i (x) =[ X i2i i h (x i ), (x)i slow O(d 2 ) called support vector set SVM will minimize this set! = X i2i i k(x i,x) fast O(d)

13 Kernelized Perceptron Primal Perceptron update on new param. x1: -1 w = (0, -1) x2: +1 w = (2, 0) x3: +1 w = (2, -1) x 1 (0, 1) : 1 x 2 (2, 1) : +1 x 3 (0, 1) : +1 Dual Perceptron update on new param. w (implicit) x1: -1 α = (-1, 0, 0) -x1 x2: +1 α = (-1, 1, 0) -x1 + x2 x3: +1 α = (-1, 1, 1) -x1 + x2 + x3 linear kernel (identity map) final implicit w = (2, -1) geometric interpretation of dual classification: sum of dot-products with x2 & x3 bigger than dot-product with x1 (agreement w/ positive > w/ negative)

14 XOR Example x4: -1 x1: +1 Dual Perceptron update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0) φ(x1) x3: +1 x2: -1 x2: -1 α = (+1, -1, 0, 0) φ(x1) - φ(x2) k(x, x 0 ) = (x x 0 ) 2, (x) = (x 2 1, x 2 2, p 2x 1 x 2 ) w = (0, 0, 2 p 2) classification rule in dual/geom: in dual/algebra: (x x 1 ) 2 > (x x 2 ) 2 ) cos 2 1 > cos 2 2 ) cos 1 > cos 2 x1: +1 x2: -1 (x x 1 ) 2 > (x x 2 ) 2 ) (x 1 + x 2 ) 2 > (x 1 x 2 ) 2 ) x 1 x 2 > 0 also verify in primal

15 Circle Example?? Dual Perceptron update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0) φ(x1) x2: -1 α = (+1, -1, 0, 0) φ(x1) - φ(x2) k(x, x 0 ) = (x x 0 ) 2, (x) = (x 2 1, x 2 2, p 2x 1 x 2 )

16 Polynomial Kernels Idea We want to extend k(x, x 0 )=hx, x 0 i 2 to k(x, x 0 )=(hx, x 0 i + c) d where c>0and d 2 N. Prove that such a kernel corresponds to a dot product. Proof strategy +c is just augmenting space. Simple and straightforward: compute the explicit sum simpler proof: set x 0 = sqrt(c) given by the kernel, i.e. mx k(x, x 0 )=(hx, x 0 i + c) d d = (hx, x 0 i) i c d i i=0 i Individual terms (hx, x 0 i) i are dot products for some i (x).

17 Circle Example x2 Dual Perceptron x5 x1 x3 update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0, 0) φ(x1) x4 x2: -1 α = (+1, -1, 0, 0, 0) φ(x1) - φ(x2) x3: -1 α = (+1, -1, -1, 0, 0) k(x, x 0 ) = (x x 0 ) 2, (x) = (x 2 1, x 2 2, p 2x 1 x 2 ) k(x, x 0 ) = (x x 0 + 1) 2, (x) =?

18 Examples you only need to know polynomial and gaussian. Examples of kernels k(x, x 0 ) Linear hx, x 0 i Laplacian RBF exp ( kx x 0 k) Gaussian RBF exp kx x 0 k 2 distorts distance Polynomial (hx, x 0 i + ci) d,c 0, d2 N B-Spline B 2n+1 (x x 0 ) Cond. Expectation E c [p(x c)p(x 0 c)] distorts angle Simple trick for checking Mercer s condition Compute the Fourier transform of the kernel and check that it is nonnegative.

19 Kernel Summary For a feature map ɸ, find a magic function k, s.t.: the dot-product ɸ(x) ɸ(x ) = k(x, x ) this k(x, x ) should be much faster than ɸ(x) k(x, x ) should be computable in O(n) if x in R n ɸ(x) is much slower: O(n d ) for poly d, more for Gaussian But for any k function, is there a ɸ s.t. ɸ(x) ɸ(x ) = k(x,x )? Examples of kernels k(x, x 0 ) Linear hx, x 0 i Laplacian RBF exp ( kx x 0 k) Gaussian RBF exp kx x 0 k 2 Polynomial (hx, x 0 i + ci) d,c 0, d2 N

20 Mercer s Theorem The Theorem For any symmetric function k : X X! R which is square integrable in X X and which satisfies Z k(x, x 0 )f(x)f(x 0 )dxdx 0 0 for all f 2 L 2 (X) X X there exist i : X! R and numbers i 0 where k(x, x 0 )= X i i i(x) i (x 0 ) for all x, x 0 2 X. Interpretation Double integral is the continuous version of a vectormatrix-vector multiplication. For positive semidefinite matrices we have X X k(xi,x j ) i j 0

21 Properties Distance in Feature Space Distance between points in feature space via d(x, x 0 ) 2 :=k (x) (x 0 )k 2 =h (x), (x)i 2h (x), (x 0 )i + h (x 0 ), (x 0 )i =k(x, x)+k(x 0,x 0 ) 2k(x, x) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K ij = h (x i ), (x j )i = k(x i,x j ) where x i are the training patterns. Similarity Measure The entries K ij tell us the overlap between (x j ), so k(x i,x j ) is a similarity measure. (x i ) and

22 Kernelized Pegasos for SVM for HW2, you don t need to randomly choose training examples. just go over all training examples in the original order, and call that an epoch (same as HW1).

23 σ =1.0 C = f(x) =1 Gaussian RBF kernel (default in sklearn) f(x) =0 f(x) = 1 f(x) = NX i α i y i exp ³ x x i 2 /2σ 2 + b

24 σ =1.0 C =100 Decrease C, gives wider (soft) margin

25 σ =1.0 C =10 f(x) = NX i α i y i exp ³ x x i 2 /2σ 2 + b

26 σ =1.0 C = f(x) = NX i α i y i exp ³ x x i 2 /2σ 2 + b

27 σ =0.25 C = Decrease sigma, moves towards nearest neighbour classifier

28 σ =0.1 C = f(x) = NX i α i y i exp ³ x x i 2 /2σ 2 + b

29 Polynomial Kernels this is in contrast with C: smaller C => wide margin (underfitting) larger C => narrow margin (overfitting)

30 Overfitting vs. Overfitting

31 From SVM to Nearest Neighbor for each test example x, decide its label by the training example closest to x decision boundary highly non-linear (Voronoi) k-nearest neighbor (k-nn): smoother boundaries

32 K = 1 K = 7 Training data Testing data Training data Testing data error = 0.0 error = 0.15 error = error = K = 3 K = 21 Training data Testing data Training data Testing data error = error = error = error =

33 K = 1 small k: overfitting K = 7 Training data Testing data Training data Testing data error = 0.0 error = 0.15 error = error = K = 3 K = 21 large k: underfitting Training data Testing data Training data Testing data error = error = what about k=n? error = error =

34 SVM vs. Nearest Neighbor support vectors few all

36 d f b a c e

Nonlinearity & Preprocessing

Nonlinearity & Preprocessing Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR: x = (x 1, x 2, x 1 x 2 ) income: add degree + major Perceptron Map data into feature space