A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Similar documents
Support vector machines Lecture 4

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machine (continued)

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Announcements - Homework

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine & Its Applications

Jeff Howbert Introduction to Machine Learning Winter

Linear & nonlinear classifiers

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Support Vector Machines.

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CS145: INTRODUCTION TO DATA MINING

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Machine Learning. Support Vector Machines. Manfred Huber

Linear Classification and SVM. Dr. Xin Zhang

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

CSC 411 Lecture 17: Support Vector Machine

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machine (SVM) and Kernel Methods

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines

Support Vector Machine

SUPPORT VECTOR MACHINE

Machine Learning And Applications: Supervised Learning-SVM

Linear classifiers Lecture 3

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines, Kernel SVM

ML (cont.): SUPPORT VECTOR MACHINES

LECTURE 7 Support vector machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Introduction to Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kernel Methods and Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Applied Machine Learning Annalisa Marsico

Introduction to SVM and RVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Statistical Machine Learning from Data

Kernelized Perceptron Support Vector Machines

Brief Introduction to Machine Learning

Max Margin-Classifier

L5 Support Vector Classification

Statistical learning theory, Support vector machines, and Bioinformatics

Support Vector Machines and Speaker Verification

Support Vector Machines

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Nearest Neighbors Methods for Support Vector Machines

Statistical Pattern Recognition

Support Vector Machines

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Support Vector Machines for Classification: A Statistical Portrait

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning Lecture 7

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines and Kernel Methods

Support Vector and Kernel Methods

Introduction to Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

Constrained Optimization and Support Vector Machines

Modelli Lineari (Generalizzati) e SVM

An introduction to Support Vector Machines

PAC-learning, VC Dimension and Margin-based Bounds

Review: Support vector machines. Machine learning techniques and image analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

SVMs, Duality and the Kernel Trick

Introduction to Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Classifier Complexity and Support Vector Classifiers

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

6.867 Machine learning

Support Vector Machine

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Support Vector Machine for Classification and Regression

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Transcription:

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

Outline General ideas about supervised learning (Not specific to biological domain) Training, generalization, overfitting Small bit of theory Cancer classification, gene signatures Lab preview: Nevins paper (Bild et al., Nature Medicine 2006, Oncogenic pathways ) as a concrete example SVMs in some (mathematical) detail

What is machine learning? Statistics with more than 20 variables Intersection of computer science and statistics Provisional definition: [R. Schapire] Machine learning studies how to automatically learn to make predictions based on past observations

Classification problems Classification: Learn to classify examples into a given set of categories ( classes ) Example of supervised learning ( labeled training examples, i.e. known class labels) (, 5 ) (, 7 ) (, 2 ) 2

ML vs. Traditional Statistics L. Breiman: The two cultures, Statistical Science, 2001 Data modeling culture (Generative models) Assume probabilistic model of known form, not too many parameters (<50) Fit model to data Interpret model and parameters, make predictions after

ML vs Traditional Statistics Algorithmic modeling culture (Predictive models) Learn a prediction function from inputs to outputs, possibly many parameters (e.g. 10 2-10 6 ) Design algorithm to find good prediction function Primary goal: accurate predictions on new data, i.e. avoid overfitting, good generalization Interpret after, finding truth is not central goal (but some truth in accurate prediction rule?) Never solve a more difficult problem than you need to [V. Vapnik]

Example: Generative model Probability mean1 var1 mean2 var2 Voice Pitch [Figure: Y. Freund]

Example: Prediction function No. of mistakes [Figure: Y. Freund] Voice Pitch

Poorly behaved training data No. of mistakes Probability mean1 mean2 Voice Pitch [Figure: Y. Freund]

Conditions for accurate learning Example: predict good vs. bad [R. Schapire]

An example classifier Decision tree

Another possible classifier Perfectly classifies training data, makes mistakes on test set Intuitively too complex

Yet another classifier Fails to fit from training data Overly simple

Complexity vs. accuracy Classifiers must be expressive enough to capture true patterns in training data but if too complex, can overfit (learn noise or spurious patterns) Problem: Can t tell best classifier from training error Controlling overfitting is central problem of ML

Conditions for accurate learning To learn an accurate classifier, need Enough training examples Good performance on training set Control over complexity (Occam s razor) Measure complexity by: Minimum description length (number of bits needed to encode rule) Number of parameters VC dimension

Cancer classification Training data: expression data from different tumor types; few examples, high dimensional feature space Goals: (Accurately predict tumor type) Learn gene signature = smaller set of whose expression pattern discriminates between classes Feature selection problem

Oncogenic pathways [Nevins lab, Nature 2006] Training data: Human cell cultures where specific oncogenic pathway has been activated vs. control cells (Myc, Ras, E2F3, etc) Prediction scores probability/confidence that pathway is activated in sample Test data: Mouse models for pathways Human cancer cell lines

Pathway signatures

Prediction in mouse models Rank tumors from mouse models using trained pathway vs control classifiers

Prediction scores as features Oncogenic pathway prediction scores used to represent tumors for clustering Pathway scores on cell lines correlate with response to inhibitors

Support vector machines SVMs are a family of algorithms for learning a linear classification rule from labeled training data {(x 1, y 1 ),, (x m, y m )}, y i = 1 or -1 Well-motivated by learning theory Various properties of the SVM solution help avoid overfitting, even in very high dimensional feature spaces

Vector space preliminaries Inner product of two vectors: <w,x> = Σ g w g x g Hyperplane with normal vector w and bias b: <w,x> b = 0 w <w,x> b = 0

Linear classification rules SVMs consider only linear classifiers: f w,b (x) = w, x b Leads to linear prediction rules: h w,b (x) = sign(f w,b (x)) Decision boundary is a hyperplane Prediction score f w,b (x) interpreted as confidence in prediction <w,x> b > 0 w <w,x> b < 0

Support vector machines Assume linearly separable training data Margin of example = distance to separating hyperplane Margin of training set = min margin of examples Choose (unique) hyperplane that maximizes the margin Prediction score for test example f(x) ~ signed distance of x to hyperplane test

Geometric margin Consider training data S and a particular linear classifier f w,b If w = 1, then the geometric margin of training data for f w,b is γ S = Min S y i ( w, x i b) b w γ S

Maximal margin classifier Hard margin SVM: given training data S, find linear classifier f w,b with maximal geometric margin γ S Solve optimization problem to find w and b that give maximal margin solution

Hard margin SVMs Equivalently, enforce a functional margin 1 for every training vector, and minimize w Primal problem: Minimize ½ <w,w> subject to y i (<w,x i > b) 1 for all training vectors x i f w,b (x) = 1 f w,b (x) = -1

Non-separable case If training data is not linearly separable, can: Penalize each example by the amount it violates the margin ( soft margin SVM ) Map examples to a higher dimensional space where data is separable Combination of above 2 solutions

Soft margin SVMs Introduce slack variable ξ i to represent margin violation for training vector x i Now constraint becomes: y i (<w,x i >b) 1- ξ i ξ i x i

Soft margin SVMs Primal optimization problem becomes: Minimize ½ <w,w> C Σ i ξ i ( 1-norm ) LIBSVM or ½ <w,w> C Σ i ξ i 2 ( 2-norm ) SVM-light subject to y i (<w,x i >b) 1- ξ i, ξ i 0 C: trade-off parameter

Regularization viewpoint Trade-off optimization problem (1-norm soft margin): minimize w 2 C Σ i (1 - y i f w,b (x i )) (1 - y f(x)) : hinge loss, penalty for margin violation w 2 : regularization term ; intuitively, prevents overfitting by constraining w

Properties of SVM solution Introduce dual variable ( weight ) α i for each constraint, i.e. for each training example Solve dual optimization problem to find α i Convex quadratic problem unique solution, good algorithms w = Σ i α i y i x i Normal vector is linear combination of support vectors, i.e. training vectors with α i >0

Support vectors If x i has margin > 1, α i = 0 1-norm SVM: two kinds of support vectors If x i has margin = 1, 0 < α i < C If x i has margin < 1, α i = C 0<α i < C α i = C

Feature selection How to extract a cancer signature? Simplest feature selection: filter on training data E.g. Apply t-test or Fisher s criterion to find genes that discriminate between classes Train SVM on reduced feature set Usually better to use results of training to select features

Ranking features Normal vector w = Σ i α i y i x i gives direction in which prediction scores change Rank features by w g to get most significant components Recursive feature elimination (RFE): iteratively Throw out bottom half of genes ranked by w g Retrain SVM on remaining genes Induces ranking on all genes

Kernel trick Idea: map to higher dimensional feature space Only need kernel values: K(x 1,x 2 ) = Φ(x 1 ) Φ(x 2 ) to solve dual optimization problem Φ Input Space Feature Space

Examples of kernels Large margin non-linear decision boundaries Not needed with expression data Degree 2 polynomial Radial basis Kx,z ( )= x z C ( ) 2 Kx,z ( )= exp x z 2 σ 2

Discussion issues for paper How well-defined is a cancer signature? How stable is feature selection on small data set? Empirical validation of gene set, number of genes? Which analyses are purely training data results, which show prediction performance? Significance of prediction performance? Traditional ML does not assert significance via a p- value but comparison against other methods Can compare to a baseline method, e.g. single oncogene expression level

Be careful! Potti et al., Nature Medicine 2006: Similar analysis to predict response to chemotherapy, based on NCI 60 cell line data Coombed et al., Nature Medicine 2007: Bioinformatics forensics, unable to reproduce results Mislabeling of samples ( vs -) Off-by-one indexing error, wrong genes in signature No separation of training and test for feature reduction ( metagene ), not strictly inductive learning Summary: poor computational practices and (probably) overfitting lead to erroneous results