A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Similar documents
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Lecture 18: Multiclass Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

A Bahadur Representation of the Linear Support Vector Machine

Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Smoothing Spline ANOVA Models III. The Multicategory Support Vector Machine and the Polychotomous Penalized Likelihood Estimate

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Institute of Statistics Mimeo Series No Variable Selection for Multicategory SVM via Sup-Norm Regularization

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

A Bias Correction for the Minimum Error Rate in Cross-validation

Statistical Properties and Adaptive Tuning of Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear Classification and SVM. Dr. Xin Zhang

Modifying A Linear Support Vector Machine for Microarray Data Classification

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

arxiv: v2 [stat.ml] 22 Feb 2008

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

TECHNICAL REPORT NO. 1064

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine & Its Applications

Support Vector Machines for Classification: A Statistical Portrait

Lecture 10: Support Vector Machine and Large Margin Classifier

TECHNICAL REPORT NO. 1051r. Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data

Cross-Validation with Confidence

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

CS 231A Section 1: Linear Algebra & Probability Review

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Learning Kernel Parameters by using Class Separability Measure

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Cross-Validation with Confidence

Regularized Discriminant Analysis and Its Application in Microarrays

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

Support Vector Machines

Building a Prognostic Biomarker

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Lecture 14: Variable Selection - Beyond LASSO

Sparse Additive machine

A Magiv CV Theory for Large-Margin Classifiers

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

application in microarrays

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Support Vector Machine (SVM) and Kernel Methods

SVMC An introduction to Support Vector Machines Classification

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Kernel Logistic Regression and the Import Vector Machine

Analysis of Fast Input Selection: Application in Time Series Prediction

Bayesian Support Vector Machines for Feature Ranking and Selection

Regularization Paths

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

PROGRAMMING. Yufeng Liu and Yichao Wu. University of North Carolina at Chapel Hill

Support Vector Machine

Statistical Methods for Data Mining

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Learning with the Lasso, spring The Lasso

Fast Regularization Paths via Coordinate Descent

Applied Machine Learning Annalisa Marsico

Sparse Penalized Forward Selection for Support Vector Classification

Machine Learning. Ensemble Methods. Manfred Huber

Variable Selection and Weighting by Nearest Neighbor Ensembles

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

High-Dimensional Statistical Learning: Introduction

Statistical Inference

Kernelized Perceptron Support Vector Machines

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Sparse Approximation and Variable Selection

Robust Variable Selection Through MAVE

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Midterm exam CS 189/289, Fall 2015

Machine Learning for OR & FE

c 4, < y 2, 1 0, otherwise,

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning

Support Vector Machines

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Regularization and Variable Selection via the Elastic Net

Variable Selection for Nonparametric Quantile. Regression via Smoothing Spline ANOVA

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

ESANN'2003 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2003, d-side publi., ISBN X, pp.

TECHNICAL REPORT NO. 1063

Perceptron Revisited: Linear Separators. Support Vector Machines

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Microarray Data Analysis: Discovery

Infinite Ensemble Learning with Support Vector Machinery

Transcription:

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

Microarray Data Relative amount of mrnas of tens of thousands of genes. (e.g. Affymetrix Human U133a GeneChips: 22K genes) Gene expression profiles as fingerprints at the molecular level. Potential for cancer diagnosis or prognosis. Examples Golub et al. (1999) Science: acute leukemia classification. Perou et al. (2000) Nature: breast cancer subclass identification. Van t Veer et al. (2002) Nature: breast cancer prognosis.

Objectives and related issues Accurate diagnosis or prognosis. Interpretability. Subset selection or dimension reduction. Systematic approach to gene (variable) selection. Assessment of the variability in analysis results.

Small Round Blue Cell Tumors of Childhood Khan et al. (2001) in Nature Medicine Tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS). Number of genes : 2308 Class distribution of data set Data set EWS BL(NHL) NB RMS total Training set 23 8 12 20 63 Test set 6 3 6 5 20 Total 29 11 18 25 83

Classification problem A training data set {(x i, y i ), i = 1,..., n}, where y i {1,..., k}. Learn a functional relationship f between x = (x 1,..., x p ) and y from the training data, which can be generalized to novel cases.

Measures of marginal association for gene selection Pick genes with the largest marginal association. Two-class: two-sample t-test statistic, correlation, and etc. Multi-class: Dudoit et al. (2000) For gene l, the ratio of between classes sum of squares to within class sum of squares is defined as BSS(l) WSS(l) = n k i=1 j=1 I(y i = j)( x (j) l x l ) 2 n k i=1 j=1 I(y i = j)(x il x (j). l )2

4.5 4 observed permuted (95%) 3.5 ratio of between to within 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 gene order (log 10 scale) Figure: Observed ratios of between-class SS to within-class SS and the 95 percentiles of the corresponding ratios for expression levels with randomly permuted class labels.

d=1 d=5, TE = 5 10% d=100 (20~500), TE = 0% d=all, TE = 0 15% EWS NHL NB RMS EWS NHL NB RMS Figure: Pairwise distance matrices for the training data as the numbers of genes included change, and test error rates of MSVM with Gaussian kernel.

Support Vector Machines Vapnik (1995), http://www.kernel-machines.org y i { 1, 1}. Find f (x) = b + h(x) with h H K minimizing 1 n n (1 y i f (x i )) + + λ h 2 H K. i=1 Then ˆf (x) = ˆb + n i=1 ĉik (x i, x), where K : a bivariate positive definite function called a reproducing kernel. Classification rule: φ(x) = sign[f (x)].

Hinge loss [ t] * 2 (1 t) + 1 0 2 1 0 1 2 t=yf Figure: (1 yf (x)) + is an upper bound of the misclassification loss function I(y φ(x)) = [ yf (x)] (1 yf (x)) + where [t] = I(t 0) and (t) + = max{t, 0}.

1 0.5 Margin=2/ w 0 wx+b=1 wx+b=0 0.5 wx+b= 1 1 1 0.5 0 0.5 1 Figure: Linear Support Vector Machine example in separable case. Red and blue circles indicate positive and negative examples, respectively. The solid line is the separating hyperplane with the maximum margin in this example.

Characteristics of Support Vector Machines Competitive classification accuracy. Flexibility - implicit embedding through kernel. Handle high dimensional data. A black box unless the embedding is explicit.

Variable (feature) Selection The best subset selection. Nonnegative garrote [Breiman, Technometrics (1995)] Least Absolute Shrinkage and Selection Operator [Tibshirani, JRSS (1996)] COmponent Selection and Smoothing Operator [Lin & Zhang, Technical Report (2003)] Structural modelling with sparse kernels [Gunn & Kandola, Machine Learning (2002)]

Functional ANOVA decomposition Wahba (1990) Function: f (x) = b + p α=1 f α(x α ) + α<β f αβ(x α, x β ) + Functional space: f H = p α=1 ({1} H α ), H = {1} p H α=1 α α<β ( H α H β ) Reproducing kernel (r.k.): K (x, x ) = 1 + p α=1 K α(x, x ) + α<β K αβ(x, x ) + Modification of r.k. by rescaling parameters θ 0 K θ (x, x ) = 1+ p α=1 θ αk α (x, x )+ α<β θ αβk αβ (x, x )+

l 1 penalty on θ Truncating H to F = {1} d ν=1 F ν, find f (x) F minimizing 1 n n i=1 L(y i, f (x i )) + λ ν θ 1 ν P ν f 2. Then ˆf (x) = ˆb + n i=1 ĉi d ν=1 θ νk ν (x i, x). For sparsity, minimize 1 n L(y i, f (x i )) + λ n i=1 ν subject to θ ν 0, ν. θν 1 P ν f 2 + λ θ θ ν ν

Structured MSVM with ANOVA decomposition Lee, Lin & Wahba, JASA (2004) Find f = (f 1,..., f k ) = (b 1 + h 1 (x),..., b k + h k (x)) with the sum-to-zero constraint minimizing 1 n n L(y i ) (f (x i ) y i ) + + λ 2 i=1 ( k d j=1 ν=1 d +λ θ θ ν subject to θ ν 0, for ν = 1,..., d. ν=1 By the representer theorem, ˆf j (x) = ˆb j + n i=1 ĉj i d ν=1 θ νk ν (x i, x). θ 1 ν P ν h j 2 )

Updating Algorithm Denoting the objective function by Φ(θ, b, C), Initialize θ (0) = (1,..., 1) t and (b (0), C (0) ) = argmin Φ(θ (0), b, C). At the m-th step (m = 1, 2,...) θ-step: Find θ (m) minimizing Φ(θ, b (m 1), C (m 1) ) with (b, C) fixed. c-step: Find (b (m), C (m) ) minimizing Φ(θ (m), b, C) with θ fixed. One-step update can be used in practice.

A toy example: visualization of the θ-step x 2 0.0 0.5 1.0 θ 2 1.5 1.0 0.5 0.0 0.9 0.8 0.7 0.6 0.0 0.5 1.0 0.0 0.5 1.0 1.5 x 1 θ 1 Figure: Left: the scatter plot of two covariates with class labels distinguished by color. Right: the feasible region of (θ 1, θ 2 ) in gray and a level plot of the θ-step objective function.

The trajectory of θ as a function of λ θ θ 0.0 0.5 1.0 1 2 1 x 2 15 10 5 0 log 2 (λ θ ) Figure: The trajectories of θ 1 ;, θ 2 ; +, and θ 12 ; with the two-way interaction spline kernel with λ θ tuned by GCKL. The overlap of + and indicates that the trajectories of θ 2 and θ 12 are almost indistinguishable.

A synthetic miniature data set It consists of 100 genes from Khan et al. (63 training and 20 test cases) Use the F-ratio for each gene based on the training cases only. The top 20 genes as variables truly associated with the class. The bottom 80 genes with the class label randomly jumbled as irrelevant variables. 100 replicates by bootstrapping samples from this miniature data set keeping the class proportions the same as the original data.

The proportion of gene inclusion (%) Proportion 0 20 40 60 80 100 0 20 40 60 80 100 Variable id Figure: The proportion of inclusion (%) of each gene in the final classifiers over 100 runs. The dotted line delimits informative variables from noninformative ones. 10-fold CV was used for tuning.

Summary of the synthetic miniature data analysis with Structured MSVM 13 informative genes were selected more than 95% of the runs. 74 noninformative genes were picked up less than 5% of the runs. Gene selection resulted in decrease in the test error rate over the 20 test cases on average of 0.0095 (from 0.0555 to 0.0460) with standard error 0.00413.

The original data with 2308 genes Proportion of inclusion 0.0 0.5 1.0 0 500 1000 1500 2000 gene rank Figure: The proportion of selection of each gene in one-step updated SMSVMs for 100 bootstrap samples. Genes are presented in the order of marginal rank in the original sample.

Cumulative number of genes 0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of inclusion Figure: The number of genes selected less often than or as frequently as a given proportion in 100 runs.

Summary of the full data analysis The empirical distribution of the number of genes included in one-step updates contained the middle 50% of values between 212 and 228 with median 221. 67 genes were consistently selected for more than 95% of the time. About 2000 genes were selected less than 20% of the time. Gene selection led to reduction in test error rates by 0.0230 on average (from 0.0455 to 0.0225) with standard error of 0.00484. It also reduced the variance of test error rates.

Concluding remarks Integrate feature selection with learning classification rules. Enhance interpretation without compromising prediction accuracy. Characterize the solution path of SMSVM for effective computation and tuning. Efron et al. LAR (2004) and Hastie et al. SVM solution path (2004). c-step: the entire spectrum of solutions from the simplest majority rule to the complete overfit to data. θ-step: the entire spectrum of solutions from the constant model to the full model with all the variables.

The following papers are available from www.stat.ohio-state.edu/ yklee. Structured Multicategory Support Vector Machine with ANOVA decomposition, Lee, Y., Kim, Y., Lee, S., and Koo, J.-Y., Technical Report No. 743, The Ohio State University, 2004. Characterizing the Solution Path of Multicategory Support Vector Machines, Lee, Y. and Cui, Z., Technical Report No. 754, The Ohio State University, 2005.