Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015
Support Vector Machines (SVMs) 1. One of the most widely used, successful approaches to train a classifier 2. Based on new idea of maximizing the margin as objective function 3. Based on the idea of kernel functions
Kernel Regression
Linear Regression We wish to learn f: X > Y, where X = <X1,.Xp>, Y realvalued, p = number of features Learn f (x) = x w = < x, w > = x T w Where w = arg min y x w λ w
Vectors, data points, inner products Consider f (x) = x w = < x, w > = x T w Where x = [3 1] and w = [1 2] x2 θ w x x1 For any two vectors, their dot product (aka inner product) is equal to product of their lenghts, times the cosine of angle between them p < x, w > = j 1 x j w j = x w cos θ
Linear Regression Primal Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w y Xw λ w regularization term Solve by taking the derivative wrt w, set to zero.. w = X X λi X y So: f x new =x T w = x T X X λi X y
Linear Regression Primal Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w y Xw λ w Solution: w = X X λi X y Interesting observation: w lies in the space spanned by training examples (why?)
Linear Regression Dual Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w Solution: w = X X λi X y y Xw λ w Dual form use fact that: w = α x Learn f x = α < x, x > Solution: α = XX λi y A lot of dot products..
Key ingredients of Dual Solution Step 1: Compute α = K λi y Where K = XX that is k =< x, x > Step 2: Evaluate on new point x by g x = α < x, x > Important observation: both steps only involve inner products between input data points
Kernel Functions Since the computation only involves dot products, we can substitute for all occurrences of <.,.> a kernel function k that computes: k x, x =< Φ x, Φ x > Φ is a function from the current space to a feature (higherdimensional space) F defined by the mapping: Φ x Φ x ϵ F
Kernel Functions u2 Original space. x1. x2 Φ R R. Φ(x1) Projected space (higher dimensional) Φ(x2). u1 What the kernel function k does is to give me some other operation ( in the original space) which is equivalent to compute dot products into the higher dimensional space k x, x =< Φ x, Φ x > k: R R
Linear Regression Dual Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w Solution: w = X X λi X y y Xw λ w Dual form use fact that: w = α x Learn f x = α < x, x > Solution: α = XX λi y By doing that we gain computational complexity!
Example: Quadratic kernel Suppose we have data originally in 2D, but project it into 3D using Ф(x) x = x x Φ(x) = x 2x x x This converts our linear regression problem into quadratic regression! But we can use the following kernel function to calculate dot products in the projected 3D space, in terms of operations in the 2D space < Φ(x ), Φ x >=< x, x > k (x, x ) And use it to train and apply our regression function, never leaving 2D space f x = α k(x, x ) α = K λi y K = k(x, x )
Implications of the kernel trick Consider for example computing a regression function over 1000 images represented by pixel vectors 32 x 32 = 1024 pixels By using the quadratic kernel we implement the regression function in a 1,000,000 dimensional space But actually using less computation for the learning phase than we did in the original space inverting a 1000 x 1000 matrix instead of a 1024 x 1024 matrix
Some common kernels Polynomial of degree d K x, z =< x z > Polynomial of degree up to d K x, z =< x z c > Gaussian / Radial kernels (polynomials of all orders projected Space has infinite dimensions) x z K x, z = exp 2σ Linear kernel K x, z =< x z >
Key points about kernels Many learning tasks are framed as optimization problems Primal and Dual formulation of optimization problems Dual version framed in terms of dot products between x s Kernel functions k(x,z) allow calculating dot products <Ф(x), Ф(z)> without actually projecting x into Ф(x) Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces We can learn nonlinear functions
KernelBased Classifiers
Linear Classifier Which line is better?
Pick the one with the largest margin!
Parametrizing the decision boundary w x b > 0 w x b < 0 Labels y ϵ 1, 1 class
Maximizing the margin ɣ ɣ Margin = Distance of closest examples from the decision line / hyperplane Margin = γ = a/ w Labels y ϵ 1, 1 class
Maximizing the margin ɣ ɣ Margin = Distance of closest examples from the decision line / hyperplane Margin = γ = a/ w Labels y ϵ 1, 1 class Maximizing the margin corresponds to minimize w!
SVM: Maximize the margin ɣ ɣ Margin = γ = a/ w max w, γ = a/ w s.t. w x b y a Note: a is arbitrary (we can normalize equations by a) Labels y ϵ 1, 1 class
Support Vector Machine (primal form) max w, γ = 1/ w ɣ ɣ s.t. w x b y 1 Primal form min w, w w s.t. w x b y 1 Solve efficiently by quadratic Programming (QP) Wellstudied solution algorithms Nonkernelized version of SVMs!
SVMs (from primal form to dual form) With kernel regression we had to go from the primal form of our optimization problem to the dual version of it > expressed in a way that we only need to compute dot products We do the same for SVMs All things which apply to kernel regression apply to SVM s But with a different objective function: the margin
SVMs (from primal form to dual form) Primal form: solve for w, b min w, w w s.t. w x b y 1 for all j training examples Classification test for new x: w x b > 0 Dual form: solve for α1,..., αn max, α α y y < x, x > s.t. α 0 and for all j training examples α y = 0 Classification test for new x α y < x, x > b 0
Support Vectors α y < x, x > b > 0 α y < x, x > b < 0 w x b > 0 w x b < 0 ɣ ɣ Linear hyperplane defined by support vectors Moving other points a little doesn t change the decision boundary Only need to store the Support vectors to predict labels of new points Hard margin Support Vector Machine
Kernel SVMs Because the dual form only depends on dot products, we can apply the Kernel trick to work in a (virtual) projected space Ф : X F Primal form: solve for w, b in the projected higher dim. space min w, w w s.t. w Φ(x ) b y 1 for all j training examples Classification test for new x: w Φ(x) b > 0 Dual form: solve for α1,..., αn max, α α y y < x, x > s.t. α 0 and for all j training examples α y = 0 Classification test for new x α y < x, x > b 0
SVM decision surface using Gaussian Kernel f x = w Φ x b x2 f x = b α y x1 Circled points are the support vectors: training examples with nonzero α Points plotted in original 2D space Contour lines correspond to f x k(x, x ) = b α y exp x x 2σ
SVMs with Soft Margin Allow errors in classification min w, w w C # mistakes s.t. w Φ(x ) b y 1 for all j training examples Maximize margin and minimize the number of mistakes on training data C tradeoff parameter Not QP Treats all errors equally
What if the data are not linearly Allow errors in classification separable? min w, w w C ζ s.t. w Φ(x ) b y 1 ζ for all j training examples ζ = slack variable (>1 if x misclassified) Pay linear penalty for mistakes C tradeoff parameter Still QP
Variable selection with SVMs Forward Selection: all features are tried separately and the one performing the best f is retained. Then, all remaining features are added in turn and the best pair (f, f ) is retained. Then all the remaining features are added in turn and the best trio {f, f, f } is retained. And so on until the performance stops increasing or until all features have been exhausted. Pseudocode: F full set of features, S=selected features={}, p=curr performance=0, oldp = previous performance=1, p*=best performance=0 While F {} and while p > oldp for each feature f in F for k in [1 k] folds # crossvalidation split D into T(training) and V(validation) train a model M on T using features S U f compute the performance of M on V compute the average performance over k folds choose the feature f that leads to best performance p if p > p, then oldp = p, p = p, F = F/f, else stop Output features in order of importance
Variable selection with SVMs Recursive feature elimination: at first all features are used to train a SVM The margin γ is computed. Then, for each feature f, a new margin is computed using the feature set F =F/{f} and the margin is updated to γ. The feature f leading to the smallest difference between γ and γ is considered least valuable and is discarded. The process is Repeated until the performance starts degrading. Pseudocode: F full set of features, S=selected features={}, p=curr performance=0, p*=best performance=0; t = threshold on p While F {} train a SVM on D (training set) using crossvalidation to tune parameters p is the performance obtained with best set of parameters if p p* > t p* = p, oldf = F for each feature f in F compute the difference in performance discard the feature that leads to smallest difference else {F = oldf) Output the features that are left in F
SVM Summary Objective: maximize margin between decision surface and data Primal and dual formulations: Dual represents classifier decision in terms of support vector Kernel SVM s Learn linear decision in high dimension space, working in original low dimension space Handling noisy data: soft margin slack variables again primal and dual forms SVM algorithm: Quadratic Programming Optimization single global minimum
Applications of SVMs in Bioinformatics Gene function prediction (from microarray data, RNAseq) Cancer tissue classification Remote homology detection in proteins (structure & sequence features) Translation initiation site recognition in DNA (from distal sequences) Promoter prediction (from sequence alone or other genomic features) Protein localization Virtual screening of small molecules