Applied Machine Learning Annalisa Marsico

Similar documents
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Announcements - Homework

Jeff Howbert Introduction to Machine Learning Winter

Pattern Recognition 2018 Support Vector Machines

SVMs: nonlinearity through kernels

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine & Its Applications

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Linear & nonlinear classifiers

Support Vector Machines Explained

Support Vector Machine

Support Vector Machines.

(Kernels +) Support Vector Machines

Linear classifiers Lecture 3

18.9 SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods and Support Vector Machines

Kernelized Perceptron Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Support Vector Machines

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Review: Support vector machines. Machine learning techniques and image analysis

Support vector machines Lecture 4

Support Vector Machines

Support Vector Machines

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

L5 Support Vector Classification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Statistical Machine Learning from Data

Support Vector Machines for Classification and Regression

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Statistical learning theory, Support vector machines, and Bioinformatics

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Statistical Pattern Recognition

Support Vector Machines

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

SUPPORT VECTOR MACHINE

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machines and Speaker Verification

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to SVM and RVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Classifier Complexity and Support Vector Classifiers

Support Vector Machines, Kernel SVM

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning 2010

Support Vector Machines and Kernel Methods

CS 188: Artificial Intelligence Spring Announcements

SVM optimization and Kernel methods

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

6.867 Machine learning

Support Vector Machine. Industrial AI Lab.

Non-linear Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Max Margin-Classifier

Support Vector Machines: Kernels

Introduction to Support Vector Machines

CMU-Q Lecture 24:

Introduction to Machine Learning

Incorporating detractors into SVM classification

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Methods. Machine Learning A W VO

CS145: INTRODUCTION TO DATA MINING

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS 188: Artificial Intelligence Spring Announcements

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Neural networks and support vector machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Transcription:

Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015

Support Vector Machines (SVMs) 1. One of the most widely used, successful approaches to train a classifier 2. Based on new idea of maximizing the margin as objective function 3. Based on the idea of kernel functions

Kernel Regression

Linear Regression We wish to learn f: X > Y, where X = <X1,.Xp>, Y realvalued, p = number of features Learn f (x) = x w = < x, w > = x T w Where w = arg min y x w λ w

Vectors, data points, inner products Consider f (x) = x w = < x, w > = x T w Where x = [3 1] and w = [1 2] x2 θ w x x1 For any two vectors, their dot product (aka inner product) is equal to product of their lenghts, times the cosine of angle between them p < x, w > = j 1 x j w j = x w cos θ

Linear Regression Primal Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w y Xw λ w regularization term Solve by taking the derivative wrt w, set to zero.. w = X X λi X y So: f x new =x T w = x T X X λi X y

Linear Regression Primal Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w y Xw λ w Solution: w = X X λi X y Interesting observation: w lies in the space spanned by training examples (why?)

Linear Regression Dual Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w Solution: w = X X λi X y y Xw λ w Dual form use fact that: w = α x Learn f x = α < x, x > Solution: α = XX λi y A lot of dot products..

Key ingredients of Dual Solution Step 1: Compute α = K λi y Where K = XX that is k =< x, x > Step 2: Evaluate on new point x by g x = α < x, x > Important observation: both steps only involve inner products between input data points

Kernel Functions Since the computation only involves dot products, we can substitute for all occurrences of <.,.> a kernel function k that computes: k x, x =< Φ x, Φ x > Φ is a function from the current space to a feature (higherdimensional space) F defined by the mapping: Φ x Φ x ϵ F

Kernel Functions u2 Original space. x1. x2 Φ R R. Φ(x1) Projected space (higher dimensional) Φ(x2). u1 What the kernel function k does is to give me some other operation ( in the original space) which is equivalent to compute dot products into the higher dimensional space k x, x =< Φ x, Φ x > k: R R

Linear Regression Dual Form Learn f (x) = x w = < x, w > = x T w Where w = arg min w Solution: w = X X λi X y y Xw λ w Dual form use fact that: w = α x Learn f x = α < x, x > Solution: α = XX λi y By doing that we gain computational complexity!

Example: Quadratic kernel Suppose we have data originally in 2D, but project it into 3D using Ф(x) x = x x Φ(x) = x 2x x x This converts our linear regression problem into quadratic regression! But we can use the following kernel function to calculate dot products in the projected 3D space, in terms of operations in the 2D space < Φ(x ), Φ x >=< x, x > k (x, x ) And use it to train and apply our regression function, never leaving 2D space f x = α k(x, x ) α = K λi y K = k(x, x )

Implications of the kernel trick Consider for example computing a regression function over 1000 images represented by pixel vectors 32 x 32 = 1024 pixels By using the quadratic kernel we implement the regression function in a 1,000,000 dimensional space But actually using less computation for the learning phase than we did in the original space inverting a 1000 x 1000 matrix instead of a 1024 x 1024 matrix

Some common kernels Polynomial of degree d K x, z =< x z > Polynomial of degree up to d K x, z =< x z c > Gaussian / Radial kernels (polynomials of all orders projected Space has infinite dimensions) x z K x, z = exp 2σ Linear kernel K x, z =< x z >

Key points about kernels Many learning tasks are framed as optimization problems Primal and Dual formulation of optimization problems Dual version framed in terms of dot products between x s Kernel functions k(x,z) allow calculating dot products <Ф(x), Ф(z)> without actually projecting x into Ф(x) Leads to major efficiencies, and ability to use very high dimensional (virtual) feature spaces We can learn nonlinear functions

KernelBased Classifiers

Linear Classifier Which line is better?

Pick the one with the largest margin!

Parametrizing the decision boundary w x b > 0 w x b < 0 Labels y ϵ 1, 1 class

Maximizing the margin ɣ ɣ Margin = Distance of closest examples from the decision line / hyperplane Margin = γ = a/ w Labels y ϵ 1, 1 class

Maximizing the margin ɣ ɣ Margin = Distance of closest examples from the decision line / hyperplane Margin = γ = a/ w Labels y ϵ 1, 1 class Maximizing the margin corresponds to minimize w!

SVM: Maximize the margin ɣ ɣ Margin = γ = a/ w max w, γ = a/ w s.t. w x b y a Note: a is arbitrary (we can normalize equations by a) Labels y ϵ 1, 1 class

Support Vector Machine (primal form) max w, γ = 1/ w ɣ ɣ s.t. w x b y 1 Primal form min w, w w s.t. w x b y 1 Solve efficiently by quadratic Programming (QP) Wellstudied solution algorithms Nonkernelized version of SVMs!

SVMs (from primal form to dual form) With kernel regression we had to go from the primal form of our optimization problem to the dual version of it > expressed in a way that we only need to compute dot products We do the same for SVMs All things which apply to kernel regression apply to SVM s But with a different objective function: the margin

SVMs (from primal form to dual form) Primal form: solve for w, b min w, w w s.t. w x b y 1 for all j training examples Classification test for new x: w x b > 0 Dual form: solve for α1,..., αn max, α α y y < x, x > s.t. α 0 and for all j training examples α y = 0 Classification test for new x α y < x, x > b 0

Support Vectors α y < x, x > b > 0 α y < x, x > b < 0 w x b > 0 w x b < 0 ɣ ɣ Linear hyperplane defined by support vectors Moving other points a little doesn t change the decision boundary Only need to store the Support vectors to predict labels of new points Hard margin Support Vector Machine

Kernel SVMs Because the dual form only depends on dot products, we can apply the Kernel trick to work in a (virtual) projected space Ф : X F Primal form: solve for w, b in the projected higher dim. space min w, w w s.t. w Φ(x ) b y 1 for all j training examples Classification test for new x: w Φ(x) b > 0 Dual form: solve for α1,..., αn max, α α y y < x, x > s.t. α 0 and for all j training examples α y = 0 Classification test for new x α y < x, x > b 0

SVM decision surface using Gaussian Kernel f x = w Φ x b x2 f x = b α y x1 Circled points are the support vectors: training examples with nonzero α Points plotted in original 2D space Contour lines correspond to f x k(x, x ) = b α y exp x x 2σ

SVMs with Soft Margin Allow errors in classification min w, w w C # mistakes s.t. w Φ(x ) b y 1 for all j training examples Maximize margin and minimize the number of mistakes on training data C tradeoff parameter Not QP Treats all errors equally

What if the data are not linearly Allow errors in classification separable? min w, w w C ζ s.t. w Φ(x ) b y 1 ζ for all j training examples ζ = slack variable (>1 if x misclassified) Pay linear penalty for mistakes C tradeoff parameter Still QP

Variable selection with SVMs Forward Selection: all features are tried separately and the one performing the best f is retained. Then, all remaining features are added in turn and the best pair (f, f ) is retained. Then all the remaining features are added in turn and the best trio {f, f, f } is retained. And so on until the performance stops increasing or until all features have been exhausted. Pseudocode: F full set of features, S=selected features={}, p=curr performance=0, oldp = previous performance=1, p*=best performance=0 While F {} and while p > oldp for each feature f in F for k in [1 k] folds # crossvalidation split D into T(training) and V(validation) train a model M on T using features S U f compute the performance of M on V compute the average performance over k folds choose the feature f that leads to best performance p if p > p, then oldp = p, p = p, F = F/f, else stop Output features in order of importance

Variable selection with SVMs Recursive feature elimination: at first all features are used to train a SVM The margin γ is computed. Then, for each feature f, a new margin is computed using the feature set F =F/{f} and the margin is updated to γ. The feature f leading to the smallest difference between γ and γ is considered least valuable and is discarded. The process is Repeated until the performance starts degrading. Pseudocode: F full set of features, S=selected features={}, p=curr performance=0, p*=best performance=0; t = threshold on p While F {} train a SVM on D (training set) using crossvalidation to tune parameters p is the performance obtained with best set of parameters if p p* > t p* = p, oldf = F for each feature f in F compute the difference in performance discard the feature that leads to smallest difference else {F = oldf) Output the features that are left in F

SVM Summary Objective: maximize margin between decision surface and data Primal and dual formulations: Dual represents classifier decision in terms of support vector Kernel SVM s Learn linear decision in high dimension space, working in original low dimension space Handling noisy data: soft margin slack variables again primal and dual forms SVM algorithm: Quadratic Programming Optimization single global minimum

Applications of SVMs in Bioinformatics Gene function prediction (from microarray data, RNAseq) Cancer tissue classification Remote homology detection in proteins (structure & sequence features) Translation initiation site recognition in DNA (from distal sequences) Promoter prediction (from sequence alone or other genomic features) Protein localization Virtual screening of small molecules