Statistical learning theory, Support vector machines, and Bioinformatics

Similar documents
Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Hilbert Space Methods in Learning

Linear Classification and SVM. Dr. Xin Zhang

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine & Its Applications

Does Unlabeled Data Help?

Support vector machines, Kernel methods, and Applications in bioinformatics

Jeff Howbert Introduction to Machine Learning Winter

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machines

Support Vector Machines.

Kernel Methods and Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

Machine Learning & SVM

Modelli Lineari (Generalizzati) e SVM

Machine Learning 4771

Statistical and Computational Learning Theory

Applied Machine Learning Annalisa Marsico

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Machine Learning. Support Vector Machines. Manfred Huber

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Classifier Complexity and Support Vector Classifiers

(Kernels +) Support Vector Machines

Support Vector Machine for Classification and Regression

Support Vector Machines. Maximizing the Margin

Discriminative Models

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

CS798: Selected topics in Machine Learning

Introduction to Machine Learning

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Introduction to Support Vector Machines

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Support Vector Machine (continued)

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Max Margin-Classifier

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Lecture 7

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines Explained

Machine Learning, Fall 2011: Homework 5

Introduction to Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Discriminative Models

An introduction to Support Vector Machines

Foundations of Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

SUPPORT VECTOR MACHINE

Kernels and the Kernel Trick. Machine Learning Fall 2017

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Linear & nonlinear classifiers

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Introduction to SVM and RVM

Generalization theory

Linear & nonlinear classifiers

Kernel Methods & Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Bits of Machine Learning Part 1: Supervised Learning

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Computational Learning Theory

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine learning for ligand-based virtual screening and chemogenomics!

Support Vector Machines and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Announcements. Proposals graded

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines

Support Vector Machines: Kernels

Regularization Networks and Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

VC dimension and Model Selection

This is an author-deposited version published in : Eprints ID : 17710

Machine Learning Practice Page 2 of 2 10/28/13

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Advanced Introduction to Machine Learning

ML (cont.): SUPPORT VECTOR MACHINES

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

Lecture 10: A brief introduction to Support Vector Machine

Transcription:

1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003.

2 Overview 1. Statistical learning theory 2. Support vector machines 3. Computational biology 4. Short example: virtual screening for drug design

3 Partie 1 Statistical learning theory

The pattern recognition problem 4

5 The pattern recognition problem Learn from labelled examples a discrimination rule

6 The pattern recognition problem Learn from labelled examples a discrimination rule Use it to predict the class of new points

7 Pattern recognition applications Multimedia (OCR, speech recognition, spam filter,..) Finance, Marketing Bioinformatics, medical diagnosis, drug design,...

8 Formalism Object: x X

8 Formalism Object: x X Label: y Y (e.g., Y = { 1, 1})

8 Formalism Object: x X Label: y Y (e.g., Y = { 1, 1}) Training set: S = (z 1,..., z N ) with z i = (x i, y i ) X Y

8 Formalism Object: x X Label: y Y (e.g., Y = { 1, 1}) Training set: S = (z 1,..., z N ) with z i = (x i, y i ) X Y A classifier is any f : X Y A learning algorithm is: a set of classifiers F Y X a learning procedure: S (X Y) N f F

9 Example: linear disrimination Objects: X = R d, Y = { 1, 1} Classifiers: F = { f w,b : (w, b) R d R }, where f w,b (x) = { 1 if w.x + b > 0, 1 if w.x + b 0, Algorithm: linear perceptron, Fisher discriminant,...

10 Questions How to analyse/understand learning algorithms? How to design good learning algorithms?

11 Useful hypothesis: probabilistic setting Learning is only possible if the future is related to the past

11 Useful hypothesis: probabilistic setting Learning is only possible if the future is related to the past Mathematically: X Y is a measurable set endowed with a probability measure P

11 Useful hypothesis: probabilistic setting Learning is only possible if the future is related to the past Mathematically: X Y is a measurable set endowed with a probability measure P Past observations: Z 1,..., Z N are N independent and identically distributed (according to P ) random variables

11 Useful hypothesis: probabilistic setting Learning is only possible if the future is related to the past Mathematically: X Y is a measurable set endowed with a probability measure P Past observations: Z 1,..., Z N are N independent and identically distributed (according to P ) random variables Future observations: a random variable Z N+1 also distributed according to P

11 Useful hypothesis: probabilistic setting Learning is only possible if the future is related to the past Mathematically: X Y is a measurable set endowed with a probability measure P Past observations: Z 1,..., Z N are N independent and identically distributed (according to P ) random variables Future observations: a random variable Z N+1 also distributed according to P The future is related to the past by P

12 How good is a classifier? Let l : Y Y R a loss function ( e.g., 0/1 loss)

12 How good is a classifier? Let l : Y Y R a loss function ( e.g., 0/1 loss) The risk of a classifier f : X Y is the average loss: R(f) = E (X,Y ) P [l(f(x), Y )]

12 How good is a classifier? Let l : Y Y R a loss function ( e.g., 0/1 loss) The risk of a classifier f : X Y is the average loss: R(f) = E (X,Y ) P [l(f(x), Y )] Ideal goal: for a given (unknown) P, find f = arg inf f R(f)

13 How to learn a good classifier? P is unknown, we must learn a classifier f from the training set S.

13 How to learn a good classifier? P is unknown, we must learn a classifier f from the training set S. For any classifier f : X Y, we can compute the empirical risk: R N (f) = 1 N N i=1 l(f(x i ), y i ) Obviously, R(f) = E S P [R N (f)] for any f F

14 Empirical risk minimization For a given class F Y X, chose f that minimizes the empirical risk: ˆf N = arg min f F R N(f)

14 Empirical risk minimization For a given class F Y X, chose f that minimizes the empirical risk: ˆf N = arg min f F R N(f) The best choice, if P was known, would be f 0 = arg min f F R(f)

14 Empirical risk minimization For a given class F Y X, chose f that minimizes the empirical risk: ˆf N = arg min f F R N(f) The best choice, if P was known, would be f 0 = arg min f F R(f) Central question: is R( ˆ f N ) close to R(f 0 )?

15 Consistency issue An algorithm is called consistent iff lim R( f ˆ N ) = R(f 0 ) N R( ˆ f N ) is random, so we need a notion of convergence for random variable, such as convergence in probability: { lim P R( ˆ } f N ) R(f 0 ) > ɛ N = 0, ɛ > 0.

16 Consistency and law of large numbers Classical law of large numbers: for any f F, lim P { R N(f) R(f) > ɛ} = 0, ɛ > 0. N

16 Consistency and law of large numbers Classical law of large numbers: for any f F, lim P { R N(f) R(f) > ɛ} = 0, ɛ > 0. N 0 R( ˆ f N ) R(f 0 ) [ R( f ˆ N ) R N ( f ˆ ] N ) + [R N (f 0 ) R(f 0 )] The second term converges to 0 by classical LLN applied to f 0. What about the first one?

17 Uniform law of large numbers Classical LLN is not enough to ensure that: R( ˆ f N ) R N ( ˆ f N ) 0 N

17 Uniform law of large numbers Classical LLN is not enough to ensure that: R( ˆ f N ) R N ( ˆ f N ) 0 N Theorem: the ERM principle is consistent if and only if the following uniform law of large numbers holds: { } lim P N sup f F (R(f) R N (f)) > ɛ = 0, ɛ > 0.

18 Vapnik-Chervonenkis entropy and consistency For any N, S and F, let G F (N, S) = card {(f(x 1 ),..., f(x N )) : f F} Obviously, G F (N, S) 2 N

18 Vapnik-Chervonenkis entropy and consistency For any N, S and F, let G F (N, S) = card {(f(x 1 ),..., f(x N )) : f F} Obviously, G F (N, S) 2 N Theorem: the ULLN holds if and only if: lim N E S P [ln G F (N, S)] N = 0.

19 Distribution-independent bound Theorem (Vapnik-Chervonenkis): For any δ > 0, the following holds with P -probability at least 1 δ: f F, R(f) R N (f) + ln sups G F (N, S) + ln 1/δ. 8N

19 Distribution-independent bound Theorem (Vapnik-Chervonenkis): For any δ > 0, the following holds with P -probability at least 1 δ: f F, R(f) R N (f) + ln sups G F (N, S) + ln 1/δ. 8N This is valid for any P! A sufficient condition for distribution-independent fast ULLN is: lim N ln sup S G F (N, S) N = 0

20 VC dimension The VC dimension of F is the largest number of points that can be shattered by F, i.e., such that there exists a set S with G F (N, S) = 2 N

20 VC dimension The VC dimension of F is the largest number of points that can be shattered by F, i.e., such that there exists a set S with G F (N, S) = 2 N Example: for hyperplanes in R d, the VC dimension is d + 1

21 Sauer lemma Let F be a set with finite VC dimension h. Then ln sup S G F (N, S) { = N if N h, h ln en h if N h h N

22 VC dimension and learning Finiteness of the VC-dimension is a necessary and sufficient condition for uniform convergence independant of P

22 VC dimension and learning Finiteness of the VC-dimension is a necessary and sufficient condition for uniform convergence independant of P If F has finite CV dimension h, then the following holds with probability at least 1 δ: h ln 2eN h f F, R(f) R N (f) + + ln 4 δ. 8N

23 Partie 2 Support vector machines

24 Structural risk minimization Define a nested family of function sets: with increasing VC dimensions: F 1 F 2... Y X h 1 h 2... In each class, the ERM algorithm finds a classifier ˆf i that satisfies: R( ˆf h i ln 2eN h i ) inf R N (f) + i + ln 4 δ. f F i 8N

25 Structural risk minimization (2) SRM principle: choose ˆf = ˆf i that minimizes the upper bound

25 Structural risk minimization (2) SRM principle: choose ˆf = ˆf i that minimizes the upper bound The validity of this prinple can also be justified mathematically

26 Curse of dimension? Remember the VC dim of the class of hyperplanes in R d is d + 1 Can not learn in large dimension?

27 Large margin hyperplanes For a given set of points S in a ball of radius R, consider only the hyperplanes F γ that correctly separate the points with margin at least γ: γ

27 Large margin hyperplanes For a given set of points S in a ball of radius R, consider only the hyperplanes F γ that correctly separate the points with margin at least γ: (independent of the dimension!) γ V C(F γ ) R2 γ 2

28 SRM on hyperplanes Intuitively, select an hyperplane with: small empirical risk large margin

28 SRM on hyperplanes Intuitively, select an hyperplane with: small empirical risk large margin Support vector machines implement this principle.

Linear SVM 29

30 Linear SVM H

31 Linear SVM m H

32 Linear SVM m e1 e3 e2 e4 e5 H

33 Linear SVM m e1 e3 e2 e4 e5 H min H,m { 1 m 2 + C i e i }

34 Dual formulation The classification of a new point x is the sign of: f(x) = i α i x. x i + b, where α i solves: n max α i=1 α i 1 n 2 i,j=1 α iα j y i y j x i. x j ) i = 1,..., n 0 α i C n i=1 α iy i = 0

Sometimes linear classifiers are not interesting 35

36 Solution: non-linear mapping to a feature space φ

37 Example R Let Φ( x) = (x 2 1, x 2 2), w = (1, 1) and b = 1. Then the decision function is: f( x) = x 2 1 + x 2 2 R 2 = w.φ( x) + b,

38 Kernel (simple but important) For a given mapping Φ from the space of objects X to some feature space, the kernel of two objects x and x is the inner product of their images in the features space: x, x X, K(x, x ) = Φ(x). Φ(x ). Example: if Φ( x) = (x 2 1, x 2 2), then K( x, x ) = Φ( x). Φ( x ) = (x 1 ) 2 (x 1) 2 + (x 2 ) 2 (x 2) 2.

39 Training a SVM in the feature space Replace each x. x in the SVM algorithm by K(x, x ) The dual problem is to maximize L( α) = N α i 1 2 N α i α j y i y j K(x i, x j ), i=1 i,j=1 under the constraints: { 0 αi C, for i = 1,..., N N i=1 α iy i = 0.

40 Predicting with a SVM in the feature space The decision function becomes: f(x) = w. Φ(x) + b = N α i K(x i, x) + b. (1) i=1

41 The kernel trick The explicit computation of Φ(x) is not necessary. The kernel K(x, x ) is enough. SVM work implicitly in the feature space. It is sometimes possible to easily compute kernels which correspond to complex large-dimensional feature spaces.

42 Kernel example For any vector x = (x 1, x 2 ), consider the mapping: Φ( x) = The associated kernel is: ( x 2 1, x 2 2, 2x 1 x 2, 2x 1, 2x 2, 1). K( x, x ) = Φ( x).φ( x ) = (x 1 x 1 + x 2 x 2 + 1) 2 = ( x. x + 1) 2

43 Classical kernels for vectors Polynomial: K(x, x ) = (x.x + 1) d Gaussian radial basis function K(x, x ) = exp ( x x 2 ) 2σ 2 Sigmoid K(x, x ) = tanh(κx.x + θ)

44 Example: classification with a Gaussian kernel f( x) = N i=1 α i exp ( x xi 2 ) 2σ 2

45 Kernels For any set X, a function K : X X R is a kernel iff: it is symetric : K(x, y) = K(y, x), it is positive semi-definite: a i a j K(x i, x j ) 0 i,j for all a i R and x i X

46 Kernel properties The set of kernels is a convex cone closed under the topology of pointwise convergence Closed under the Schur product K 1 (x, y)k 2 (x, y) Bochner theorem: in R d, K(x, y) = φ(x y) is a kernel iff φ has a non-negative Fourier transform (generalization to more general groups and semi-groups)

47 Partie 3 Computational biology

48 Overview 1990-95: DNA chips 2003: completion of the Human Genome Projects Gene sequences, structures, expression, localization, phenotypes, SNP,... many many data Pharmacy, environment, food,... many applications

49 Data examples 2D and 3D structure of prion

50 Data examples (From Spellman et al., 1998)

Data examples 51

52 Pattern recognition examples Medical diagnosis from gene expression Functional genomics Structural genomics Drug design... SVM are being applied

53 Partie 4 Virtual screening of small molecules

54 The problem Objects = chemical compounds (formula, structure..) C N C C C O O Problem = predict their: drugability pharmacocinetics activity on a target etc...

55 Classical approaches Use molecular descriptors to represent the compouds as vectors Select a limited numbers of relevant descriptors Use linear regression, NN, nearest neighbour etc...

56 SVM approach We need a kernel K(c 1, c 2 ) between compounds

56 SVM approach We need a kernel K(c 1, c 2 ) between compounds One solution: inner product between vectors

56 SVM approach We need a kernel K(c 1, c 2 ) between compounds One solution: inner product between vectors Alternative solution: define a kernel directly using graph comparison tools

57 Example: graph kernel (Kashima et al., 2003) C N C C C O O

58 Example: graph kernel (Kashima et al., 2003) C O C C C O N C C C O O Extract random paths

59 Example: graph kernel (Kashima et al., 2003) C O C C C O N C O C C C C C N C C O Extract random paths

60 Example: graph kernel (Kashima et al., 2003) Let H 1 be a random path of a compound c 1 Let H 2 be a random path of a compound c 2 The following is a valid kernel: K(c 1, c 2 ) = Prob(H 1 = H 2 ).

61 Remarks The feature space is infinite-dimensional (all possible paths)

61 Remarks The feature space is infinite-dimensional (all possible paths) The kernel can be computed efficiently

61 Remarks The feature space is infinite-dimensional (all possible paths) The kernel can be computed efficiently Two compounds are compared in terms of their common substructures

61 Remarks The feature space is infinite-dimensional (all possible paths) The kernel can be computed efficiently Two compounds are compared in terms of their common substructures What about kernels for the 3D structure?

Conclusion 62

63 Conclusion Important developments of statistical learning theory recently Involve several fields: probability/statistics, functional analysis, optimization, harmonic analysis on semigroups, differential geometry... Results in useful state-of-the-art algorithms in many fields Computational biology directly benefits from these developments... but it s only the begining