Kernel methods, kernel SVM and ridge regression

Size: px

Start display at page:

Download "Kernel methods, kernel SVM and ridge regression"

Leo Cook
5 years ago
Views:

1 Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

2 Collaborative Filtering 2

3 Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U, V s. t. f ( U, V ) U 0, V R UV T 0, k 2 F m, n. Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 3

MAP = argmax θ P(θ D) X N X new Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ =

4 Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: P(θ D) = P D θ P(θ) P(D) = P D θ P(θ) P D θ P θ dθ Posterior mean estimation: θ bayes = θ P θ D dθ θ Maximum likelihood approach θ ML = argmax θ P D θ, θ MAP = argmax θ P(θ D) X N X new Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ A frequentist prediction: use a plug-in estimator P x new D = P(x new θ ML ) or P x new D = P(x new θ MAP ) 4

5 PMF: Parameter Estimation Parameter estimation: MAP estimate θ MAP = argmax θ P θ D, α = argmax θ P D θ P(θ α) = argmax θ P θ, D α In the paper: 5

6 PMF: Interpret prior as regularization Maximize the posterior distribution with respect to parameter U and V Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log) 6

7 Bayesian PMF: predicting new ratings Bayesian prediction, take into account all possible value of θ P x new D = P x new, θ D dθ = P x new θ P θ D dθ In the paper, integrating out all parameters and hyperparameters. 7

8 Bayesian PMF: overall algorithm 8

9 Nonlinear classifier Nonlinear Decision Boundaries Linear SVM Decision Boundaries 9

10 Nonlinear regression Linear regression Three standard deviations above mean Nonlinear regression function Three standard deviations below mean Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression 10

11 Nonconventional clusters Need more advanced methods, such as kernel methods or spectral clustering to work 11

12 Nonlinear principal component analysis PCA Nonlinear PCA 12

13 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w x j + b y j 1 ξ j, ξ j 0, j ξ j : Slack variables 13

SVM for nonlinear problem Solve nonlinear

data points Nonlinear clustering, principal

14 SVM for nonlinear problem Solve nonlinear problem with linear relation in feature space Non-linear decision boundary Linear decision boundary in feature space Transform data points Nonlinear clustering, principal component analysis, canonical correlation analysis 14

15 SVM for nonlinear problems Some problem needs complicated and even infinite features φ x = x, x 2, x 3, x 4, Explicitly computing high dimension features is time consuming, and makes subsequent optimization costly Nonlinear Decision Boundaries Linear SVM Decision Boundaries 15

16 SVM Lagrangian Primal problem: min w,ξ 1 2 w w + C j ξ j s. t. w x j + b y j 1 ξ j, ξ j 0, j Lagrangian L w, ξ, α, β = 1 2 w w + C j ξ j + j α j (1 ξ j w x j + b y j ) β j ξ j α i 0, β i 0 Take derive of L w, ξ, α, β with respect to w and ξ we have w = j α j y j x j b = y k w x k for any k such that 0 < α k < C Can be infinite dimensional features φ x j 16

17 SVM dual problem Plug in w and b into the Lagrangian, and the dual problem Max α i α i 1 2 s. t. i α i y i = 0 0 α i C i,j α i α j y i y j x i x j x i x j = φ(x i ) φ(x j ) = l l x il x jl φ(x i ) l φ(x j ) l It is a quadratic programming; solve for α, then we get w = j α j y j x j b = y k w x k for any k such that 0 < α k < C Data points corresponding to nonzeros α i are called support vectors 17

Kernel Functions Denote the inner product as a

5 maps to # node # edge # triangle # rectangle #

18 Kernel Functions Denote the inner product as a function k x i, x j = φ x i φ x j K(, )=0.6 K(, )=0.2 Inner product maps to # node # edge # triangle # rectangle # pentagon K(, )=0.5 maps to # node # edge # triangle # rectangle # pentagon ACAAGAT GCCATTG GCCATTG K( TCCCCCG, )=0.7 GCCTCCT GCTGCTG GCATGAC ACCTGCT GGTCCTA 18

19 Problem of explicitly construct features Explicitly construct feature φ x : R m F, feature space can grow really large and really quickly Eg. Polynomial feature of degree d x 1 d, x 1 x 2 x d, x 1 2 x 2 x d 1 Total number of such feature is huge for large m and d d + m 1 = d+m 1! d d! m 1! d = 6, m = 100, there are 1.6 billion terms 19

20 Can we avoid expanding the features? Rather than computing the features explicitly, and then compute inner product. Can we merge two steps using a clever kernel function k(x i, x j ) Eg. Polynomial kernel d = 2 φ x φ y = 2 x 1 x 1 x 2 2 x 2 x 2 x 1 = x 1 y 1 + x 2 y 2 2 = x y 2 2 y 1 y 1 y 2 = x y x 1 x 2 y 1 y 2 + x y 2 y 2 y 2 y 1 O m computation! Polynomial kernel d = 2, k x, y = x y d = φ x φ y 20

21 What k(x, y) can be called a kernel function? k(x, y) equivalent to first compute feature φ(x), and then perform inner product k x, y = φ x φ y A dataset D = x 1, x 2, x 3 x n Compute pairwise kernel function k x i, x j and form a n n kernel matrix (Gram matrix) K = k(x 1, x 2 ) k(x 1, x n ) k(x n, x 1 ) k(x n, x n ) k(x, y) is a kernel function, iff matrix K is positive semidefinite v R n, v Kv 0 21

22 Kernel trick The dual problem of SVMs, replace inner product by kernel Max α i α i 1 2 s. t. i α i y i = 0 0 α i C i,j α i α j y i y j φ(x i ) φ(x j ) k(x i, x j ) It is a quadratic programming; solve for α, then we get w = j α j y j φ(x j ) b = y k w x k for any k such that 0 < α k < C Evaluate the decision boundary on a new data point f x = w φ x =( j α j y j φ(x j )) φ x = j α j y j k(x j, x) 22

23 Typical kernels for vector data Polynomial of degree d k x, y = x y d Polynomial of degree up to d k x, y = x y + c d Exponential kernel (infinite degree polynomials) k x, y = exp (s x y) Gaussian RBF kernel k x, y = exp x y 2 Laplace Kernel 2σ 2 k x, y = exp x y 2σ 2 Exponentiated distance k x, y = exp d x,y 2 s 2 23

24 Shape of some kernels Translation invariant kernel k x, y = g x y The decision boundary is weighted sum of bumps f x = α j y j j k(x j, x) 24

25 How to construct more complicated kernels Know the feature space φ x, but find a fast way to compute the inner product k(x, y) Eg. string kernels, and graph kernels Find a function k x, y and prove it is positive semidefinite Make sure the function captures some useful similarity between data points Combine existing kernels to get new kernels What combination still results in kernel 25

26 Combining kernels I Positive weighted combination of kernels are kernels k 1 x, y and k 2 (x, y) are kernels α, β 0 Then k x, y = αk 1 x, y + βk 2 x, y is a kernel Weighted combination kernel is like concatenated feature space k 1 x, y = φ x φ y, k 2 x, y = ψ x ψ y k x, y = φ(x) ψ(x) φ y ψ y Mapping between spaces give you kernels k x, y is a kernel, then k φ x, φ y is a kernel k x, y = x 2 y 2 26

27 Kernel combination II Product of kernels are kernels k 1 x, y and k 2 (x, y) are kernels Then k x, y = k 1 x, y k 2 x, y is a kernel Product kernel is like using tensor product feature space k 1 x, y = φ x φ y, k 2 x, y = ψ x ψ y k x, y = φ x ψ x φ y ψ y k x, y = k 1 x, y k 2 x, y k d x, y is a kernel with higher order tensor features k x, y = φ x ψ x ξ(x) φ y ψ y ξ(y) 27

28 Nonlinear regression Linear regression Three standard deviations above mean Nonlinear regression function Three standard deviations below mean Nonlinear regression and want to estimate the variance Need advanced methods such as Gaussian processes and kernel regression 28

29 Ridge regression A dataset X = x 1,, x n R d n, y = y 1,, y n R d With some regularization parameter λ, the goal is to find regression function w w = argmin w (y i x i w i = argmin w y X w 2 + λ w λ w 2 ) Find w: take derivative of the objective and set it to zeros w = XX + λi 1 Xy 29

30 Matrix inversion lemma Find w: take derivative of the objective and set it to zeros w = XX + λi 1 d Xy w = XX + λi d 1 Xy = X X X + λi n 1 y Evaluate w in a new data point, we have w x = y X X + λi n 1 X x The above expression only depends on inner products! 30

31 Kernel ridge regression w x = y X X + λi n 1 X x only depends on inner products! X X = X x = x 1 x 1 x 1 x n x n x 1 x n x n x 1 x x n x Kernel ridge regression: replace inner product by a kernel function X X K = k x i, x j n n X x k x = k x i, x n 1 f x = y K + λi n 1 k x 31

32 Kernel ridge regression Use Gaussian rbf kernel k x, y = exp x y 2 2σ 2 large σ, large λ small σ, small λ small σ, large λ Use cross-validation to choose parameters 32

Kernel PCA, clustering and canonical correlation analysis

Kernel PCA, clustering and canonical correlation analysis ernel PCA, clustering and canonical correlation analsis Le Song Machine Learning II: Advanced opics CSE 8803ML, Spring 2012 Support Vector Machines (SVM) 1 min w 2 w w + C j ξ j s. t. w j + b j 1 ξ j,