Basis Expansion and Nonlinear SVM Kai Yu
Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2
Nonlinear Classifiers via Basis Expansion f(x) =w > h(x)+b z(x) = sign(f(x)) Nonlinear basis functions h(x)=[h 1 (x), h 2 (x),, h m (x)] f(x) = w T x+b is a special case where h(x)=x This explains a lot of classification models, including SVMs. 8/7/12 3
Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 4
Review the QP for linear SVMs After a lot of stuff, we obtain the Lagrange dual L D = N α i 1 2 N N α i α i y i y i x T i x i i =1 The solution has the form w = i y i x i In other words, the solution w is in span(x 1,x 2,...,x N ) 8/7/12 5
A more general result RKHS representation theorem (Wahba, 1971) In its simplest form, L(w T x,y) is covex w.r.t. w, the solution of min L(w T x i,y i )+ kwk 2 w has the form w = i x i Proof sketch Note: the conclusion is general, not only for SVMs. 8/7/12 6
For general basis expansion functions The solution of min w L(w > h(x i ),y i )+ kwk 2 has the form w = i h(x i ) 8/7/12 7
Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 8
Kernel Define the Mercer kernel as k(x i,x j )=h(x i ) > h(x j ) 8/7/12 9
Kernel trick Apply the representation theorem w = i h(x i ) we have f(x) = i k(x i,x) kwk 2 = i j k(x i,x j )= T K i,j=1 min L i k(x i,x),y i! + > K 8/7/12 10
Primal and Kernel formulations min w L w > h(x),y i + kwk 2 k(x i,x j )=h(x i ) > h(x j ) min L i k(x i,x),y i! + > K Given a kernel, we don t even need h(x)! really? 8/7/12 11
Popular kernels k(x,x ) is a symmetric, positive (semi-) definite function dth deg. poly.: K(x, x )=(1+ x, x ) d Example: radial basis: K(x, x )=exp( x x 2 /c) K(x, x )=(1+ x, x ) 2 =(1+x 1 x 1 + x 2 x 2) 2 =1+2x 1 x 1 +2x 2 x 2 +(x 1 x 1) 2 +(x 2 x 2) 2 +2x 1 x 1x 2 x 2 h 1 (x) =1,h 2 (x) = 2x 1, h 3 (x) = 2x 2, h 4 (x) =x 2 1, h 5 (x) =x 2 2, and h 6 (x) = 2x 1 x 2, then ( )= ( ) ( ). 8/7/12 12
Non-linear feature mapping Datasets that are linearly separable 0 x But what if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x 2 0 x
Nonlinear feature mapping General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: h: x h(x)
Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 15
Various equivalent formulations Parametric form min w L w > h(x),y i + kwk 2 Dual form min L i k(x i,x),y i! + > K Nonparametric form min L(f(x i ),y i )+ f kfk 2 H k 8/7/12 16
Various equivalent formulations Parametric form min w L w > h(x),y i + kwk 2 Dual form min L Nonparametric form min L(f(x i ),y i )+ f! i k(x i,x),y i + > K Telling what kind of f(x) is preferred kfk 2 H k 8/7/12 17
Regularization induced by kernel (or basis functions) Eigen expansion: K(x, Hy) = f(x) = Desired kernel is a smoothing operator, smoother eigenfunctions ϕ i tend to have larger eigenvalues γ i f 2 def H K = c 2 i /γ i What does this mean? γ i φ i (x)φ i (y) c i φ i (x) 8/7/12 18
Understand regularization If push down this regularization term f 2 H K def = c 2 i /γ i In f(x), minor components ϕ i (x) with smaller γ i are penalized more heavily. à principle components are preferred in f(x)! A desired kernel is a smoothing operator, i.e., principle components are smoother functions à the regularization encourages f(x) to be smooth! 8/7/12 19
Understanding regularization f 2 H K def = c 2 i /γ i Using what kernel? Using what feature (for linear model)? Using what h(x)? Using what functional norm kfk 2 H k All pointing to one thing what kind of functions are preferred apriori 8/7/12 20
Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 21
Nonlinear Logistic Regression So far, things we discussed, including - representation theorem, - kernel trick, - regularization, are not limited to SVMs. They are all applicable to logistic regression. The only difference is the loss function. 8/7/12 22
Nonlinear Logistic Regression Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Binomial Log-likelihood Support Vector Parametric form min f Nonparametric form min f ln -3-2 -1 0 1 2 3 ln yf(x) (margin) 1+e y iw > h(x i ) + kwk 2 1+e y if(x i ) + kfk 2 H k 8/7/12 23
Logistic Regression vs. SVM Both can be linear or nonlinear, parametric or nonparametric, the main difference is the loss; They are very similar in performance; Outputs probabilities, useful for scoring confidence; Logistic regression is easier for multiple classes. 10 years ago, one was old, the other is new. Now, both are old. 8/7/12 24
Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 25
Many known classification models follow a similar structure Neural networks RBF networks Learning VQ (LVQ) Boosting These models all learn w and h(x) together 8/7/12 26
Many known classification models follow a similar structure Neural networks RBF networks Learning VQ (LVQ) Boosting SVMs Linear Classifier Logistic Regression 8/7/12 27
Develop your own stuff! By deciding Which loss function? hinge, least square, What form of h(x)? RBF, logistic, tree, Infinite h(x) or h(x)? Learning h(x) or not? How to optimize? QP, LBFGS, functional gradient, you can obtain various classification algorithms. 8/7/12 28
Parametric vs. nonparametric models h(x) is finite dim, parametric model f(x)=w T h(x). Training complexity is O(Nm 3 ) h(x) is nonlinear and infinite dim, then has to use kernel trick. This is a nonparametric model. The training complexity is around O(N 3 ) Nonparametric models, including kernel SVMs, Gaussian processes, Dirichlet processes etc., are elegant in math, but nontrivial for large-scale computation. 8/7/12 29
Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 30
Summary Representation theorem and kernels Regularization prefers principle eigenfunctions of the kernel (induced by basis functions) Basis expansion - a general framework for classification models, e.g., nonlinear logistic regression, SVMs, 8/7/12 31