Support vector machines, Kernel methods, and Applications in bioinformatics

Similar documents
Metabolic networks: Activity detection and Inference

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Kernel methods, kernel SVM and ridge regression

Canonical Correlation Analysis with Kernels

Perceptron Revisited: Linear Separators. Support Vector Machines

Statistical learning on graphs

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Classifier Complexity and Support Vector Classifiers

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Support Vector Machine & Its Applications

Applied Machine Learning Annalisa Marsico

Support Vector Machines.

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Machine learning methods to infer drug-target interaction network

A Least Squares Formulation for Canonical Correlation Analysis

Linear Classification and SVM. Dr. Xin Zhang

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

c 4, < y 2, 1 0, otherwise,

Support Vector Machine (SVM) and Kernel Methods

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Support Vector Machine (SVM) and Kernel Methods

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Statistical Machine Learning

Discriminative Direction for Kernel Classifiers

Lecture 10: Support Vector Machine and Large Margin Classifier

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

SVMs: nonlinearity through kernels

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Introduction to Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Kernel methods for comparing distributions, measuring dependence

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Predicting Protein Functions and Domain Interactions from Protein Interactions

Modifying A Linear Support Vector Machine for Microarray Data Classification

Bayesian Data Fusion with Gaussian Process Priors : An Application to Protein Fold Recognition

CS798: Selected topics in Machine Learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Statistical Machine Learning from Data

Machine Learning. Support Vector Machines. Manfred Huber

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

SUPPORT VECTOR MACHINE

Cluster Kernels for Semi-Supervised Learning

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

K-means-based Feature Learning for Protein Sequence Classification

Kernel PCA, clustering and canonical correlation analysis

December 2, :4 WSPC/INSTRUCTION FILE jbcb-profile-kernel. Profile-based string kernels for remote homology detection and motif extraction

Review: Support vector machines. Machine learning techniques and image analysis

Motivating the need for optimal sequence alignments...

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Universal Learning Technology: Support Vector Machines

Advanced Introduction to Machine Learning CMU-10715

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Principal Component Analysis

CS145: INTRODUCTION TO DATA MINING

PCA, Kernel PCA, ICA

Advanced Machine Learning & Perception

Multiple Kernel Learning

Machine learning for pervasive systems Classification in high-dimensional spaces

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Support Vector Machines

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

1 Diffusion Kernels 2003/08/19 22:08

Basis Expansion and Nonlinear SVM. Kai Yu

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Learning with kernels and SVM

Kernels A Machine Learning Overview

Machine Learning, Midterm Exam

Kernel Methods. Barnabás Póczos

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CSCE555 Bioinformatics. Protein Function Annotation

Dimension Reduction (PCA, ICA, CCA, FLD,

Kernel methods CSE 250B

Microarray Data Analysis: Discovery

Support Vector Machine (continued)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Bioinformatics Chapter 1. Introduction

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Chemometrics: Classification of spectra

Training algorithms for fuzzy support vector machines with nois

Kernel Methods. Machine Learning A W VO

Infinite Ensemble Learning with Support Vector Machinery

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Transcription:

1 Support vector machines, Kernel methods, and Applications in bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group Machine Learning in Bioinformatics conference, October 17th, 2003, Brussels, Belgium

2 Overview 1. Support Vector Machines and kernel methods 2. Application: Protein remote homology detection 3. Application: Extracting pathway activity from gene expression data

3 Partie 1 Support Vector Machines (SVM) and Kernel Methods

The pattern recognition problem 4

5 The pattern recognition problem Learn from labelled examples a discrimination rule

6 The pattern recognition problem Learn from labelled examples a discrimination rule Use it to predict the class of new points

7 Pattern recognition examples Medical diagnosis (e.g., from microarrays) Drugability/activity of chemical compouds Gene function, structure, localization Protein interactions

8 Support Vector Machines for pattern recognition φ Object x represented by the vector Φ(x) (feature space)

9 Support Vector Machines for pattern recognition φ Object x represented by the vector Φ(x) (feature space) Linear separation in the feature space

10 Support Vector Machines for pattern recognition φ Object x represented by the vector Φ(x) (feature space) Linear separation with large margin in the feature space

Large margin separation 11

12 Large margin separation H

13 Large margin separation m H

14 Large margin separation m e1 e3 e2 e4 e5 H

15 Large margin separation m e1 e3 e2 e4 e5 H min H,m { 1 m 2 + C i e i }

16 Dual formulation The classification of a new point x is the sign of: f(x) = i α i K(x, x i ) + b, where α i solves: n max α i=1 α i 1 n 2 i,j=1 α iα j y i y j K(x i, x j ) i = 1,..., n 0 α i C n i=1 α iy i = 0 with the notation: K(x, x ) = Φ(x). Φ(x )

17 The kernel trick for SVM The separation can be found without knowing Φ(x). kernel matters: K(x, y) = Φ(x). Φ(y) Only the Simple kernels K(x, y) can correspond to complex Φ SVM work with any sort of data as soon as a kernel is defined

18 Kernel examples Linear : Polynomial : K(x, x ) = x.x K(x, x ) = (x.x + c) d Gaussian RBf : K(x, x ) = exp ( x x 2 ) 2σ 2

19 Kernels For any set X, a function K : X X R is a kernel iff: it is symetric : K(x, y) = K(y, x), it is positive semi-definite: a i a j K(x i, x j ) 0 i,j for all a i R and x i X

20 Advantages of SVM Works well on real-world applications Large dimensions, noise OK (?) Can be applied to any kind of data as soon as a kernel is available

21 Examples: SVM in bioinformatics Gene functional classification from microarry: Brown et al. (2000), Pavlidis et al. (2001) Tissue classification from microarray: Mukherje et al. (1999), Furey et al. (2000), Guyon et al. (2001) Protein family prediction from sequence: Jaakkoola et al. (1998) Protein secondary structure prediction: Hua et al. (2001) Protein subcellular localization prediction from sequence: Hua et al. (2001)

22 Kernel methods Let K(x, y) be a given kernel. Then is it possible to perform other linear algorithms implicitly in the feature space such as: Compute the distance between points Principal component analysis (PCA) Canonical correlation analysis (CCA)

23 Compute the distance between objects φ( g 1 ) d 0 φ( g 2 ) d(g 1, g 2 ) 2 = Φ(g 1 ) Φ(g 2 ) 2 ( = Φ(g1 ) Φ(g ) ( 2 ). Φ(g1 ) Φ(g ) 2 ) = Φ(g 1 ). Φ(g 1 ) + Φ(g 2 ). Φ(g 2 ) 2Φ(g 1 ). Φ(g 2 ) d(g 1, g 2 ) 2 = K(g 1, g 1 ) + K(g 2, g 2 ) 2K(g 1, g 2 )

24 Distance to the center of mass φ( g 1 ) m Center of mass: m = 1 N N i=1 Φ(g i ), hence: Φ(g 1 ) m 2 = Φ(g 1 ). Φ(g 1 ) 2 Φ(g 1 ). m + m. m = K(g 1, g 1 ) 2 N N i=1 K(g 1, g i ) + 1 N 2 N i,j=1 K(g i, g j )

25 Principal component analysis PC2 PC1 It is equivalent to find the eigenvectors of ( K = Φ(gi ). Φ(g ) j ) i,j=1...n = ( K(g i, g j ) ) i,j=1...n Useful to project the objects on small-dimensional spaces (feature extraction).

26 Canonical correlation analysis CCA2 CCA1 CCA2 CCA1 K 1 and K 2 are two kernels for the same objects. CCA can be performed by solving the following generalized eigenvalue problem: ( ) ( ) 0 K1 K 2 K 2 ξ = ρ 1 0 ξ K 2 K 1 0 0 K2 2 Useful to find correlations between different representations of the same objects (ex: genes,...)

27 Part 2 Local alignment kernel for strings (with S. Hiroto, N. Ueda, T. Akutsu, preprint 2003)

28 Motivations Develop a kernel for strings adapted to protein / DNA sequences Several methods have been adopted in bioinformatics to measure the similarity between sequences... but are not valid kernels How to mimic them?

29 Related work Spectrum kernel (Leslie et al.): K(x 1... x m, y 1... y n ) = m k i=1 n k j=1 δ(x i... x i+k, y j... y j+k ).

29 Related work Spectrum kernel (Leslie et al.): K(x 1... x m, y 1... y n ) = m k i=1 n k j=1 δ(x i... x i+k, y j... y j+k ). Fisher kernel (Jaakkola et al.): given a statistical model ( pθ, θ Θ R d) : φ(x) = θ log p θ (x) and use the Fisher information matrix.

30 Local alignment For two strings x and y, a local alignment π with gaps is: ABCD EF G HI JKL MNO EEPQRGS I TUVWX The score is: s(x, y, π) = s(e, E) + s(f, F ) + s(g, G) + s(i, I) s(gaps)

31 Smith-Waterman (SW) score SW (x, y) = max s(x, y, π) π Π(x,y) Computed by dynamic programming Not a kernel in general

32 Convolution kernels (Haussler 99) Let K 1 and K 2 be two kernels for strings Their convolution is the following valid kernel: K 1 K 2 (x, y) = K 1 (x 1, y 1 )K 2 (x 2, y 2 ) x 1 x 2 =x,y 1 y 2 =y

33 3 basic kernels For the unaligned parts: K 0 (x, y) = 1.

33 3 basic kernels For the unaligned parts: K 0 (x, y) = 1. For aligned residues: K (β) a (x, y) = { 0 if x 1 or y 1, exp (βs(x, y)) otherwise

33 3 basic kernels For the unaligned parts: K 0 (x, y) = 1. For aligned residues: K (β) a (x, y) = { 0 if x 1 or y 1, exp (βs(x, y)) otherwise For gaps: K g (β) (x, y) = exp [β (g( x ) + g( y ))]

34 Combining the kernels Detecting local alignments of exactly n residues: K (β) (n) (x, y) = K 0 ( K (β) a ) (n 1) K g (β) K (β) a K 0.

34 Combining the kernels Detecting local alignments of exactly n residues: K (β) (n) (x, y) = K 0 ( K (β) a ) (n 1) K g (β) K (β) a K 0. Considering all possible local alignments: K (β) LA = i=0 K (β) (i).

35 Properties K (β) LA (x, y) = π Π(x,y) exp (βs(x, y, π)),

35 Properties K (β) LA (x, y) = π Π(x,y) exp (βs(x, y, π)), lim β + 1 β ln K(β) LA (x, y) = SW (x, y).

36 Kernel computation e X0 d X d X2 B M E d Y0 Y Y2 e

37 Application: remote homology detection Unrelated proteins Twilight zone Close homologs Sequence similarity Same structure/function but sequence diverged Remote homology can not be found by direct sequence similarity

38 SCOP database SCOP Fold Superfamily Family Remote homologs Close homologs

39 A benchmark experiment Can we predict the superfamily of a domain if we have not seen any member of its family before?

39 A benchmark experiment Can we predict the superfamily of a domain if we have not seen any member of its family before? During learning: remove a family and learn the difference between the superfamily and the rest

39 A benchmark experiment Can we predict the superfamily of a domain if we have not seen any member of its family before? During learning: remove a family and learn the difference between the superfamily and the rest Then, use the model to test each domain of the family removed

40 SCOP superfamily recognition benchmark No. of families with given performance 60 50 40 30 20 10 SVM-LA SVM-pairwise SVM-Mismatch SVM-Fisher 0 0 0.2 0.4 0.6 0.8 1 ROC50

41 Part 3 Detecting pathway activity from microarray data (with M. Kanehisa, ECCB 2003)

42 Genes encode proteins which can catalyse chemical reations Nicotinamide Mononucleotide Adenylyltransferase With Bound Nad+

43 Chemical reactions are often parts of pathways From http://www.genome.ad.jp/kegg/pathway

44 Microarray technology monitors mrna quantity (From Spellman et al., 1998)

45 Comparing gene expression and pathway databases VS Detect active pathways? Denoise expression data? Denoise pathway database? Find new pathways? Are there correlations?

46 A useful first step and g1 g8 g5 g6 g2 g7 g3 g4 g6 g1 g3 g8 g2 g7 g5 g4

47 Using microarray only PC1 g1 g8 g5 g6 g2 g7 g3 g4 PCA finds the directions (profiles) explaining the largest amount of variations among expression profiles.

48 PCA formulation Let f v (i) be the projection of the i-th profile onto v. The amount of variation captured by f v is: h 1 (v) = N i=1 f v (i) 2 PCA finds an orthonormal basis by solving successively: max v h 1 (v)

49 Issues with PCA PCA is useful if there is a small number of strong signal In concrete applications, we observe a noisy superposition of many events Using a prior knowledge of metabolic networks can help denoising the information detected by PCA

50 The metabolic gene network GAL10 Glucose HKA, HKB, GLK1 Glucose 6P PGT1 Fructose 6P PFK1,PFK2 Fructose 1,6P2 FBA1 Glucose 1P PGM1, PGM2 FBP1 HKA1 PGM1 PFK1 GAL10 HKA2 PGT1 FBP1 FBA1 GLK1 PGM2 PFK2 Link two genes when they can catalyze two successive reactions

51 Mapping f v to the metabolic gene network 0.8 g1 0.7g2 0.4 g3 g4 +0.1 +0.8 g8 g7+0.5 g6 g5 +0.4 +0.2 g1 g8 g5 g6 g2 g7 g3 g4 0.8 +0.8 +0.2 +0.4 0.7 +0.5 0.4 +0.1 Does it look interesting or not?

52 Important hypothesis If v is related to a metabolic activity, then f v should vary smoothly on the graph Smooth Rugged

53 Graph Laplacian L = D A 1 2 L = 3 4 5 1 0 1 0 0 0 1 1 0 0 1 1 3 1 0 0 0 1 2 1 0 0 0 1 1

54 Smoothness quantification h 2 (f) = f exp( βl)f f f is large when f is smooth h(f) = 2.5 h(f) = 34.2

55 Motivation For a candidate profile v, h 1 (f v ) is large when v captures a lot of natural variation among profiles h 2 (f v ) is large when f v is smooth on the graph Try to maximize both terms in the same time

56 Problem reformulation Find a function f v and a function f 2 such that: h 1 (f v ) be large h 2 (f 2 ) be large corr(f v, f 2 ) be large by solving: max (f v,f 2 ) corr(f v, f 2 ) h 1(f v ) h 1 (f v ) + δ h 2(f 2 ) h 2 (f 2 ) + δ

57 Solving the problem This formultation is equivalent to a generalized form of CCA (Kernel-CCA, Bach and Jordan, 2002), which is solved by the following generalized eigenvector problem ( 0 K1 K 2 K 2 K 1 0 ) ( α β ) ( K 2 = ρ 1 + δk 1 0 0 K2 2 + δk 2 ) ( α β ) where [K 1 ] i,j = e i e j and K 2 = exp( L). Then, f v = K 1 α and f 2 = K 2 β.

58 The kernel point of view... g1 g2 g3 g4 g8 g7 g5 g6 g1 g8 g5 g6 g2 g7 g3 g4 Diffusion kernel Linear kernel g7 g3 g4 g1 g2 g5 g8 g6 Kernel CCA g2 g4 g1 g3 g6 g8 g7 g5

59 Data Gene network: two genes are linked if the catalyze successive reactions in the KEGG database (669 yeast genes) Expression profiles: 18 time series measures for the 6,000 genes of yeast, during two cell cycles

60 First pattern of expression Expression Time

61 Related metabolic pathways 50 genes with highest s 2 s 1 belong to: Oxidative phosphorylation (10 genes) Citrate cycle (7) Purine metabolism (6) Glycerolipid metabolism (6) Sulfur metabolism (5) Selenoaminoacid metabolism (4), etc...

Related genes 62

Related genes 63

Related genes 64

65 Opposite pattern Expression Time

66 Related genes RNA polymerase (11 genes) Pyrimidine metabolism (10) Aminoacyl-tRNA biosynthesis (7) Urea cycle and metabolism of amino groups (3) Oxidative phosphorlation (3) ATP synthesis(3), etc...

Related genes 67

Related genes 68

Related genes 69

70 Second pattern Expression Time

71 Extensions Can be used to extract features from expression profiles (preprint 2002) Can be generalized to more than 2 datasets and other kernels Can be used to extract clusters of genes (e.g., operon detection, ISMB 03 with Y. Yamanishi, A. Nakaya and M. Kanehisa)

Conclusion 72

73 Conclusion Kernels offer a versatile framework to represent biological data SVM and kernel methods work well on real-life problems, in particular in high dimension and with noise Encouraging results on real-world applications Many opportunities in developping kernels for particular applications