Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Similar documents
Statistical learning theory, Support vector machines, and Bioinformatics

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Polyhedral Computation. Linear Classifiers & the SVM

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine & Its Applications

Support Vector Machine (SVM) and Kernel Methods

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines Explained

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machines.

Introduction to Support Vector Machines

SVMs, Duality and the Kernel Trick

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Lecture 10: Support Vector Machine and Large Margin Classifier

Pattern Recognition 2018 Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Introduction to SVM and RVM

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CS798: Selected topics in Machine Learning

Support Vector Machines: Kernels

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines

Support Vector Machine

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

SUPPORT VECTOR MACHINE

Support Vector Machines and Speaker Verification

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

CS145: INTRODUCTION TO DATA MINING

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Kernel Methods and Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning with kernels and SVM

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture Notes on Support Vector Machine

Support Vector Machines for Classification: A Statistical Portrait

Kernels and the Kernel Trick. Machine Learning Fall 2017

Discriminative Models

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Applied Machine Learning Annalisa Marsico

Review: Support vector machines. Machine learning techniques and image analysis

Introduction to Support Vector Machines

Statistical Pattern Recognition

Support Vector Machine

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machine for Classification and Regression

L5 Support Vector Classification

Support Vector Machine II

Brief Introduction to Machine Learning

Statistical Machine Learning from Data

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Machine Learning : Support Vector Machines

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machines

Introduction to Support Vector Machines

Universal Learning Technology: Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Discriminative Models

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning 2010

Kernel Methods. Outline

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Kernel Methods & Support Vector Machines

Max Margin-Classifier

Support Vector Machines

Bits of Machine Learning Part 1: Supervised Learning

Support Vector Machines

Machine Learning Lecture 7

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Announcements - Homework

Transcription:

1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University of Tokyo, Japan, July 17-19, 2002.

2 3 days outline Day 1: Introduction to SVM Day 2: Applications in bioinformatics Day 3: Advanced topics and current research

3 Today s outline 1. SVM: A brief overview (FAQ) 2. Simplest SVM: linear classifier for separable data 3. More useful SVM: linear classifiers for general data 4. Even more useful SVM: non-linear classifiers for general data 5. Remarks

4 Part 1 SVM: a brief overview (FAQ)

5 What is a SVM? a family of learning algorithm for classification of objects into two classes (works also for regression) Input: a training set S = {(x 1, y 1 ),..., (x N, y N )} of objects x i X and their known classes y i { 1, +1}. Output: a classifier f : X { 1, +1} which predicts the class f(x) for any (new) object x X.

6 Examples of classification tasks (more tomorrow) Optical character recognition: x is an image, y a character. Text classification: x is a text, y is a category (topic, spam / non spam...) Medical diagnosis: x is a set of features (age, sex, blood type, genome...), y indicates the risk. Protein secondary structure prediction: secondary structure x is a string, y is a

Pattern recognition example 7

8 Are there other methods for classification? Bayesian classifier (based on maximum a posteriori probability) Fisher linear discriminant Neural networks Expert systems (rule-based) Decision tree...

9 Why is it gaining popularity Good performance in real-world applications Computational efficiency (no local minimum, sparse representation...) Robust in high dimension (e.g., images, microarray data, texts) Sound theoretical foundations

10 Why is it so efficient? Still a research subject Always try to classify objects with large confidence, which prevent from overfitting No strong hypothesis on the data generation process (contrary to Bayesian approaches)

11 What is overfitting? There is always a trade-off between good classification of the training set, and good classification of future objects (generalization performance) Overfitting means fitting too much the training data, degrades the generalization performance which Very important in large dimensions, or with complex non-linear classifiers.

Overfitting example 12

13 What is Vapnik s Statistical Learning Theory The mathematical foundation of SVM Gives conditions for a learning algorithm to generalize well The capacity of the set of classifiers which can be learned must be controlled

14 Why is it relevant for bioinformatics? Classification problems are very common (structure, function, localization prediction; analysis of microarray data;...) Small training sets in high dimension is common Extensions of SVM to non-vector objects (strings, graphs...) natural is

15 Part 2 Simplest SVM: Linear SVM for separable training sets

16 Framework We suppose that the object are finite-dimensional real vectors: X = R n and an object is: x = (x 1,..., x m ). x i can for example be a feature of a more general object Example: a protein sequence can be converted to a 20-dimensional vector by taking the amino-acid composition

17 Vectors and inner product x x2 x1 inner product: x. x = x 1 x 1 + x 2 x 2 (+... + x m x m) (1) = x. x. cos( x, x ) (2)

18 Linear classifier Hyperplan: w.x+b=0 Half space: w.x+b < 0 Class: 1 w Half space: w.x+b > 0 Class: +1 Classification is base on the sign the decision function: f w,b ( x) = w. x + b

19 Linearly separable training set w.x+b > 0 w.x+b < 0 w.x+b=0

Which one is the best? 20

21 Vapnik s answer : LARGEST MARGIN γ γ

22 How to find the optimal hyperplane? For a given linear classifier f w,b consider the tube defined by the values 1 and +1 of the decision function: w.x+b=0 x1 x2 w.x+b > +1 w.x+b < 1 w w.x+b=+1 w.x+b= 1

23 The width of the tube is 1/ w Indeed, the points x 1 and x 2 satisfy: { w. x1 + b = 0, w. x 2 + b = 1. By subtracting we get w.( x 2 x 1 ) = 1, and therefore: γ = x 2 x 1 = 1 w.

24 All training points should be on the right side of the tube For positive examples (y i = 1) this means: w. x i + b 1 For negative examples (y i = 1) this means: w. x i + b 1 Both cases are summarized as follows: i = 1,..., N, y i ( w. x i + b) 1

25 Finding the optimal hyperplane The optimal hyperplane is defined by the pair ( w, b) which solves the following problem: Minimize: under the constraints: w 2 i = 1,..., N, y i ( w. x i + b) 1 0. This is a classical quadratic program.

26 How to find the minimum of a convex function? If h(u 1,..., u n ) is a convex and differentiable function of n variable, then u is a minimum if and only if: h h(u u 1 ( u ) ) =. = 0. h u 1 ( u ) 0 h(u) u*

27 How to find the minimum of a convex function with linear constraints? Suppose that we want the minimum of h(u) under the constraints: g i ( u) 0, i = 1,..., N, where each function g i ( u) is affine. We introduce one variable α i for each constraint and consider the Lagrangian: N L( u, α) = h( u) α i g i ( u). i=1

28 Lagrangian method (ctd.) For each α we can look for u α which minimizes L( u, α) (with no constraint), and note the dual function: L( α) = min u L( u, α). The dual variable α which maximizes L( α) gives the solution of the primal minimization problem with constraint: u = u α.

29 Application to optimal hyperplane In order to minimize: under the constraints: 1 w 2 2 i = 1,..., N, y i ( w. x i + b) 1 0. we introduce one dual variable α i for each constraint, i.e., for each training point. The Lagrangian is: L( w, b, α) = 1 2 w 2 N α i (y i ( w. x i + b) 1). i=1

30 Solving the dual problem The dual problem is to find α maximize L( α) = N α i 1 2 N α i α j y i y j x i. x j, i=1 i,j=1 under the (simple) constraints α i 0 (for i = 1,..., N), and N i=1 α i y i = 0. α can be easily found using classical optimization softwares.

31 Recovering the optimal hyperplane Once α is found, we recover ( w, b ) corresponding to the optimal hyperplane. w is given by: w = N i=1 α i x i, and the decision function is therefore: f ( x) = w. x + b = N α i x i. x + b. (3) i=1

32 Interpretation : support vectors α=0 α>0

33 Simplest SVM: conclusion Finds the optimal hyperplane, which corresponds to the largest margin Can be solved easily using a dual formulation The solution is sparse: the number of support vectors can be very small compared to the size of the training set Only support vectors are important for prediction of future points. All other points can be forgotten.

34 Part 3 More useful SVM: Linear SVM for general training sets

In general, training sets are not linearly separable 35

36 What goes wrong? The dual problem, maximize L( α) = N α i 1 2 N α i α j y i y j x i. x j, i=1 i,j=1 under the constraints α i 0 (for i = 1,..., N), and N i=1 α i y i = 0, has no solution: the larger some α i, the larger the function to maximize.

37 Forcing a solution One solution is to limit the range of α, to be sure that one solution exists. For example, maximize L( α) = N α i 1 2 N α i α j y i y j x i. x j, i=1 i,j=1 under the constraints: { 0 αi C, for i = 1,..., N N i=1 α iy i = 0.

38 Interpretation α=0 α=c 0<α< C

39 Remarks This formulation finds a trade-off between: minimizing the training error maximizing the margin Other formulations are possible to adapt SVM to general training sets. All properties of the separable case are conserved (support vectors, sparseness, computation efficiency...)

40 Part 4 General SVM: Non-linear classifiers for general training sets

Sometimes linear classifiers are not interesting 41

42 Solution: non-linear mapping to a feature space φ

43 Example R Let Φ( x) = (x 2 1, x 2 2), w = (1, 1) and b = 1. Then the decision function is: f( x) = x 2 1 + x 2 2 R 2 = w.φ( x) + b,

44 Kernel (simple but important) For a given mapping Φ from the space of objects X to some feature space, the kernel of two objects x and x is the inner product of their images in the features space: x, x X, K(x, x ) = Φ(x). Φ(x ). Example: if Φ( x) = (x 2 1, x 2 2), then K( x, x ) = Φ( x). Φ( x ) = (x 1 ) 2 (x 1) 2 + (x 2 ) 2 (x 2) 2.

45 Training a SVM in the feature space Replace each x. x in the SVM algorithm by K(x, x ) The dual problem is to maximize L( α) = N α i 1 2 N α i α j y i y j K(x i, x j ), i=1 i,j=1 under the constraints: { 0 αi C, for i = 1,..., N N i=1 α iy i = 0.

46 Predicting with a SVM in the feature space The decision function becomes: f(x) = w. Φ(x) + b = N α i K(x i, x) + b. (4) i=1

47 The kernel trick The explicit computation of Φ(x) is not necessary. The kernel K(x, x ) is enough. SVM work implicitly in the feature space. It is sometimes possible to easily compute kernels which correspond to complex large-dimensional feature spaces.

48 Kernel example For any vector x = (x 1, x 2 ), consider the mapping: Φ( x) = The associated kernel is: ( x 2 1, x 2 2, 2x 1 x 2, 2x 1, 2x 2, 1). K( x, x ) = Φ( x).φ( x ) = (x 1 x 1 + x 2 x 2 + 1) 2 = ( x. x + 1) 2

49 Classical kernels for vectors Polynomial: K(x, x ) = (x.x + 1) d Gaussian radial basis function K(x, x ) = exp ( x x 2 ) 2σ 2 Sigmoid K(x, x ) = tanh(κx.x + θ)

50 Example: classification with a Gaussian kernel f( x) = N i=1 α i exp ( x xi 2 ) 2σ 2

51 Part 4 Conclusion (day 1)

52 Conclusion SVM is a simple but extremely powerful learning algorithm for binary classification The freedom to choose the kernel offers wonderful opportunities (see day 3: one can design kernels for non-vector objects such as strings, graphs...) More information : http://www.kernel-machines.org Lecture notes (draft) on my homepage