Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Similar documents
A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

Lecture 18: Multiclass Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

A Bahadur Representation of the Linear Support Vector Machine

Overview. Structured learning for feature selection and prediction. Motivation for feature selection. Outline. Part III:

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Institute of Statistics Mimeo Series No Variable Selection for Multicategory SVM via Sup-Norm Regularization

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Smoothing Spline ANOVA Models III. The Multicategory Support Vector Machine and the Polychotomous Penalized Likelihood Estimate

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification: A Statistical Portrait

Statistical Properties and Adaptive Tuning of Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machine & Its Applications

Linear Classification and SVM. Dr. Xin Zhang

Lecture 10: Support Vector Machine and Large Margin Classifier

Regularization Paths

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Logistic Regression and the Import Vector Machine

TECHNICAL REPORT NO. 1064

Another Look at Linear Programming for Feature Selection via Methods of Regularization 1

An Improved 1-norm SVM for Simultaneous Classification and Variable Selection

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

A Magiv CV Theory for Large-Margin Classifiers

Support Vector Machines

Sparse Additive machine

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Learning with kernels and SVM

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Discriminative Models

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Introduction to Support Vector Machines

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Support Vector Machine (SVM) and Kernel Methods

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

A Tutorial on Support Vector Machine

Lecture 3: Statistical Decision Theory (Part II)

Support Vector Machines With Adaptive L q Penalty

Support Vector Machines

Midterm exam CS 189/289, Fall 2015

Support vector machines

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Perceptron Revisited: Linear Separators. Support Vector Machines

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Machine Learning. Ensemble Methods. Manfred Huber

Support vector machines with adaptive L q penalty

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Support Vector Machines: Maximum Margin Classifiers

A Bias Correction for the Minimum Error Rate in Cross-validation

Support Vector Machine. Industrial AI Lab.

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

Support Vector Machine (continued)

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Discriminative Models

Cross-Validation with Confidence

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Representer Theorem. By Grace Wahba and Yuedong Wang. Abstract. The representer theorem plays an outsized role in a large class of learning problems.

CIS 520: Machine Learning Oct 09, Kernel Methods

Statistical Machine Learning from Data

Support Vector Machines for Classification and Regression

Multicategory Support Vector Machines

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

arxiv: v2 [stat.ml] 22 Feb 2008

Regularization Paths. Theme

Cross-Validation with Confidence

PROGRAMMING. Yufeng Liu and Yichao Wu. University of North Carolina at Chapel Hill

Support Vector Machines

Support vector machines Lecture 4

Introduction to Logistic Regression and Support Vector Machine

Kernel Methods and Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

18.9 SUPPORT VECTOR MACHINES

Lecture 10: A brief introduction to Support Vector Machine

Hierarchical Penalization

Exploiting Primal and Dual Sparsity for Extreme Classification 1

Incorporating detractors into SVM classification

Support Vector Machine via Nonlinear Rescaling Method

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machine for Classification and Regression

TECHNICAL REPORT NO. 1063

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Machine Learning. Lecture 9: Learning Theory. Feng Li.

ECE 5424: Introduction to Machine Learning

ICS-E4030 Kernel Methods in Machine Learning

Manual for a computer class in ML

A Convex Optimization Approach to Modeling Consumer Heterogeneity in Conjoint Estimation

Transcription:

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee

Predictive learning Multivariate function estimation. A training data set {(x i, y i ), i = 1,..., n}. Learn functional relationship f between x = (x 1,..., x p ) and y from the training data, which can be generalized to novel cases. Examples include Regression: continuous y R, and Classification: categorical y {1,..., k}.

Goodness of a learning method Accurate prediction with respect to a given loss L(y, f (x)). Flexible (nonparametric) and data-adaptive. Interpretability (e.g. subset selection). Computational ease for large p (high dimensional input) and n (large sample).

Support Vector Machine Vapnik (1995), http://www.kernel-machines.org y i { 1, 1}. Find f (x) = b + h(x) with h H K minimizing 1 n n (1 y i f (x i )) + + λ h 2 H K. i=1 Then ˆf (x) = ˆb + n i=1 ĉik (x i, x), where K : a bivariate positive definite function called a reproducing kernel. Classification rule: φ(x) = sign [f (x)].

Hinge loss [ t] * 2 (1 t) + 1 0 2 1 0 1 2 t=yf Figure: (1 yf (x)) + is an upper bound of the misclassification loss function I(y φ(x)) = [ yf (x)] (1 yf (x)) + where [t] = I(t 0) and (t) + = max{t, 0}.

Feature Selection Linear SVM with l 1 penalty [Bradley & Mangasarian (1998)]. Recursive feature selection [Guyon et al. (2002)]. Rescaling parameters [Chapelle et al. (2002)]. Least Absolute Shrinkage and Selection Operator [Tibshirani (1996)]. COmponent Selection and Smoothing Operator [Lin & Zhang (2003)]. Structural modelling with sparse kernels [Gunn & Kandola (2002)].

Strategy for feature selection Structured representation of f. A sparse solution approach with l 1 penalty. A unified treatment of the nonlinear and multiclass case. Not expensive additional computation. Systematic elaboration of f with features.

Functional ANOVA decomposition Wahba (1990) Function: f (x) = b + p α=1 f α(x α ) + α<β f αβ(x α, x β ) + Functional space: f H = p α=1 ({1} H α ), H = {1} p α=1 H α α<β ( H α H β ) Reproducing kernel (r.k.): K (x, x ) = 1 + p α=1 K α(x, x ) + α<β K αβ(x, x ) + Modification of r.k. by rescaling parameters θ 0 K θ (x, x ) = 1+ p α=1 θ αk α (x, x )+ α<β θ αβk αβ (x, x )+

l 1 penalty on θ Truncating H to F = {1} d ν=1 F ν, find f (x) F minimizing 1 n n θ 1 ν P ν f 2. L(y i, f (x i )) + λ i=1 ν Then ˆf (x) = ˆb + ĉi[ n d ] i=1 ν=1 θ νk ν (x i, x). For sparsity, minimize 1 n L(y n i, f (x i )) + λ i=1 ν subject to θ ν 0, ν. θν 1 Pν f 2 + λ θ θ ν ν

Related to kernel learning Micchelli and Pontil (2005), Learning the kernel function via regularization, to appear JMLR. K = {K ν, ν N }: a compact and convex set of kernels. A variational problem for optimal kernel configuration ( 1 min min K K f H K n n i=1 ) L(y i, f (x i )) + λj(f ).

Structured MSVM with ANOVA decomposition Lee, Lin & Wahba, JASA (2004) Find f = (f 1,..., f k ) = (b 1 + h 1 (x),..., b k + h k (x)) with the sum-to-zero constraint minimizing 1 n n L(y i ) (f (x i ) y i ) + + λ 2 i=1 ( k d j=1 ν=1 d +λ θ θ ν subject to θ ν 0, for ν = 1,..., d. ν=1 θ 1 ν P ν h j 2 ) y = (y 1,..., y k ): class code with y j = 1 and 1/(k 1) elsewhere, if y = j and L(y): misclassification cost. By the representer theorem, ˆf j (x) = ˆb j + [ n d ] i=1 ĉj i ν=1 θ νk ν (x i, x).

Updating Algorithm Letting C = ({b j }, {c j i }) and denoting the objective function by Φ(θ, C), Initialize θ (0) = (1,..., 1) t and C (0) = argmin Φ(θ (0), C). At the m-th iteration (m = 1, 2,...) (θ-step) find θ (m) minimizing Φ(θ, C (m 1) ) with C fixed. (c-step) find C (m) minimizing Φ(θ (m), C) with θ fixed. One-step update can be used in practice.

Two-way regularization c-step solutions range from the simplest majority rule to the complete overfit to data as λ decreases. θ-step solutions range from the constant model to the full model with all the variables as λ θ decreases. Any computational shortcut to get the entire regularization path? e.g. Least Angle Regression [Efron et al. (2004)] and SVM solution path [Hastie et al. (2004)].

c-step regularization path Extension of the binary SVM solution path [Hastie et al. (2004)]. By the Karush-Kuhn-Tucker (KKT) complementarity conditions, the MSVM solution at λ satisfies that for i, j α j i (f j i y j i ξ j i ) = 0 (L j cat(i) αj i ) ξj i = 0 0 α j i L j cat(i) and ξj i 0 where f j i = ˆf j λ (x i) and cat(i): the category of y i, thus (L 1 cat(i),..., Lk cat(i) ) = L(y i ).

1/(k 1) f j Figure: MSVM component loss (f j y j ) + where y j = 1/(k 1). E = {(i, j) f j i y j i = 0, ξ j i = 0, 0 α j i L j cat(i) } Elbow set, U = {(i, j) f j i y j i > 0, ξ j i > 0, α j i = L j cat(i) } Upper set, L = {(i, j) f j i y j i < 0, ξ j i = 0, α j i = 0} Lower set.

Characterization of the entire solution path Keep track of the events that change the elbow set. λ 0 > λ 1 > λ 2 >..., a decreasing sequence of breakpoints of λ at which the elbow set E changes. Piecewise linearity of the solution: The coefficient path of the MSVM is linear in 1/λ on the interval (λ l+1, λ l ). Construct the path sequentially by solving a system of linear equations.

f(x) 0.5 0.5 1.5 2.5 f1 f2 f3 2 4 6 8 10 12 14 log(λ) Figure: The entire paths of ˆf 1 λ (x i), ˆf 2 λ (x i), and ˆf 3 λ (x i) for an outlying instance x i from class 3. The circles correspond to λ with the minimum test error rate.

elbow size 0 10 30 50 70 E1 E2 E3 2 4 6 8 10 12 14 log(λ) Figure: The size of elbow set E j l for three classes as a function λ.

Small Round Blue Cell Tumors of Childhood Khan et al. (2001) in Nature Medicine Tumor types: neuroblastoma (NB), rhabdomyosarcoma (RMS), non-hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS). Number of genes : 2308 Class distribution of data set Data set EWS BL(NHL) NB RMS total Training set 23 8 12 20 63 Test set 6 3 6 5 20 Total 29 11 18 25 83

A synthetic miniature data set It consists of 100 genes from Khan et al. (63 training and 20 test cases) Use the F-ratio for each gene based on the training cases only. The top 20 genes as variables truly associated with the class. The bottom 80 genes with the class label randomly jumbled as irrelevant variables. 100 replicates by bootstrapping samples from this miniature data set keeping the class proportions the same as the original data.

The proportion of gene inclusion (%) Proportion 0 20 40 60 80 100 0 20 40 60 80 100 Variable id Figure: The proportion of inclusion (%) of each gene in the final classifiers over 100 runs. The dotted line delimits informative variables from noninformative ones. 10-fold CV was used for tuning.

The original data with 2308 genes Proportion of inclusion 0.0 0.5 1.0 0 500 1000 1500 2000 gene rank Figure: The proportion of selection of each gene in one-step updated SMSVMs for 100 bootstrap samples. Genes are presented in the order of marginal rank in the original sample.

Cumulative number of genes 0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of inclusion Figure: The number of genes selected less often than or as frequently as a given proportion in 100 runs.

Summary of the full data analysis The empirical distribution of the number of genes included in one-step updates contained the middle 50% of values between 212 and 228 with median 221. 67 genes were consistently selected for more than 95% of the time. About 2000 genes were selected less than 20% of the time. Gene selection led to reduction in test error rates by 0.0230 on average (from 0.0455 to 0.0225) with standard error of 0.00484. It also reduced the variance of test error rates.

Concluding remarks Integrate feature selection with SVM using l 1 type penalty for general case. Enhance interpretation without compromising prediction accuracy. Construct the entire solution path of c-step regularization via the optimality conditions. Further streamline the c-step fitting process by early stopping and basis thinning. Characterize the solution path of θ-step for effective computation and tuning.

The following papers are available from www.stat.ohio-state.edu/ yklee. Structured Multicategory Support Vector Machine with ANOVA decomposition, Lee, Y., Kim, Y., Lee, S., and Koo, J.-Y., Technical Report No. 743, The Ohio State University, 2004. Characterizing the Solution Path of Multicategory Support Vector Machines, Lee, Y. and Cui, Z., Technical Report No. 754, The Ohio State University, 2005.