Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Similar documents
Variable Selection for High-Dimensional Data with Spatial-Temporal Effects and Extensions to Multitask Regression and Multicategory Classification

Lecture 18: Multiclass Support Vector Machines

A Magiv CV Theory for Large-Margin Classifiers

ECE521 week 3: 23/26 January 2017

Supplementary File: Proof of Proposition 1

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

CS6220: DATA MINING TECHNIQUES

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Statistical Methods for Data Mining

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

CS6220: DATA MINING TECHNIQUES

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Statistical Methods for SVM

Machine Learning Practice Page 2 of 2 10/28/13

Data Mining Stat 588

A Bahadur Representation of the Linear Support Vector Machine

Regularization Paths

ISyE 691 Data mining and analytics

Summary and discussion of: Dropout Training as Adaptive Regularization

CS-E4830 Kernel Methods in Machine Learning

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Does Modeling Lead to More Accurate Classification?

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Linear Methods for Regression. Lijun Zhang

A Study of Relative Efficiency and Robustness of Classification Methods

Machine Learning Lecture 7

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Computational statistics

Understanding SVM (and associated kernel machines) through the development of a Matlab toolbox

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Midterm exam CS 189/289, Fall 2015

Fast Regularization Paths via Coordinate Descent

Machine Learning for OR & FE

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

COMS 4771 Regression. Nakul Verma

MS-C1620 Statistical inference

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Logistic Regression with the Nonnegative Garrote

10-701/ Machine Learning - Midterm Exam, Fall 2010

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Applied Machine Learning Annalisa Marsico

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

CPSC 340: Machine Learning and Data Mining

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Machine Learning for Biomedical Engineering. Enrico Grisan

A Least Squares Formulation for Canonical Correlation Analysis

Fisher Consistency of Multicategory Support Vector Machines

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Lecture 2 Machine Learning Review

Support Vector Machines for Classification: A Statistical Portrait

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Big Data Analytics: Optimization and Randomization

A Modern Look at Classical Multivariate Techniques

Lecture 6. Regression

ESS2222. Lecture 4 Linear model

A simulation study of model fitting to high dimensional data using penalized logistic regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

6.036 midterm review. Wednesday, March 18, 15

Linear & nonlinear classifiers

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

SMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Linear Methods for Prediction

Support Vector and Kernel Methods

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine via Nonlinear Rescaling Method

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Feature Engineering, Model Evaluations

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

ECS289: Scalable Machine Learning

Change point method: an exact line search method for SVMs

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Recap from previous lecture

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Statistical Machine Learning from Data

ECE 5424: Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Kernel Learning via Random Fourier Representations

Learning gradients: prescriptive models

Mathematical Optimization Models and Applications

CPSC 540: Machine Learning

Machine Learning Linear Classification. Prof. Matteo Matteucci

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Linear Regression In God we trust, all others bring data. William Edwards Deming

Introduction to Logistic Regression and Support Vector Machine

Transcription:

Multicategory Vertex Discriminant Analysis for High-Dimensional Data Tong Tong Wu Department of Epidemiology and Biostatistics University of Maryland, College Park October 8, 00 Joint work with Prof. Kenneth Lange at UCLA, to appear in AOAS /

Outline Introduction Vertex Discriminant Analysis (VDA) Category Indicator: Equidistant Points in R k ɛ-insensitive Loss Function for VDA Cyclic Coordinate Descent for VDA L Euclidean Penalty for Grouped Effects (VDA E, VDA LE ) Fisher Consistency of ɛ-insensitive Loss 4 Numerical Examples 5 Discussion /

Introduction /

Overview of Vertex Discriminant Analysis (VDA) A new method of supervised learning Linear discrimination among the vertices Each vertex of a regular simplex in Euclidean space representing a different category Minimization of ɛ-insensitive residuals and penalties on the coefficients of the linear predictors Classification and variable selection performed simultaneously Minimization by a primal MM algorithm or a coordinate descent algorithm Fisher consistency of ɛ-insensitive Statistical accuracy and computational speed 4 /

Motivation Cancer subtype classification sheer scale of cancer data sets prevalence of multicategory problems excess of predictors over cases exceptional speed and memory capacity of modern computers 5 /

Review of Discriminant Analysis Purpose: categorize objects based on a fixed number of observed features x R p Observations: category membership indicator y and feature vector x R p Discriminant rule: divide R p into disjoint regions corresponding to different categories Supervised learning Begin with a set of fully categorized cases (training data) Build discriminant rules using training data Given a loss function L(y, x), minimize Expected loss E [L(Y, X )] = E {E [L(Y, X ) X ]} Average conditional loss n n i= L(y i, x i ) with a penalty term 6 /

Multicategory Problems in SVM Solving a series of binary problems One-versus-rest (OVR): k binary classifications, but poor performance when no dominating class exists (Lee at al. 004) Pairwise comparisons: ( k ) comparisons, a violation of the criterion of parsimony (Kressel 999) Considering all classes simultaneously (Bredensteiner and Bennett 999; Crammer and Singer 00; Guermeur 00; Lee et al. 004; Liu et al. 006, 005, 006; Liu 007; Vapnik 998; Weston and Watkins 999; Zhang 004b; Zou et al. 006) Closely related work: LMSVM (Wang and Shen 007) and LMSVM (Lee et al. 004) 7 /

Vertex Discriminant Analysis (VDA) 8 /

Notation n: number of observations p: dimension of feature space k: number of categories 9 /

Questions for Multicategory Discriminant Analysis How to choose category indicators? How to choose a loss function? How to minimize the loss function? 0 /

Equidistant Points in R k Question How to choose class indicators? /

Equidistant Points in R k Question How to choose class indicators? Proposition It is possible to choose k equidistant points in R k but not k + equidistant points under the Euclidean norm. /

Equidistant Points in R k Question How to choose class indicators? Proposition It is possible to choose k equidistant points in R k but not k + equidistant points under the Euclidean norm. The points occur at the vertices of a regular simplex classes: -, (line) classes: vertices of an equilateral triangle circumscribed by the unit circle (plane) k classes: v,..., v k of a regular simplex in R k /

Plot of Indicator Vertices for Classes Indicator Vertices for Classes z.5.0 0.5 0.0 0.5.0.5.5.0 0.5 0.0 0.5.0.5 z /

Ridge Penalized ɛ-insensitive Loss Function for VDA R ɛ-insensitive Euclidean Loss f (z) = z,ɛ = max{ z ɛ, 0} Linear classifier y = Ax + b to maintain parsimony Penalties on the slopes a jl imposed to avoid overfitting Minimizing the objective function to proceed classification R(A, b) = n = n n i= k f (y i Ax i b) + λ R n f (y i Ax i b) + λ R p j= l= i= l= where y i is the vertex assignment for case i aj t is the jth row of a k p matrix A of regression coefficients b is a k column vector of intercepts a jl p a l /

Multivariate Regression Prediction function y i = Ax i + b for the ith observation: v yi, a... a p x i y i =.. =..... + v yi,k a k,... a k,p x ip b.. b k a Linear system for n observations: y t. y t n = x t ị. x t n a p a... a p... a k,... a k,p t + b. b k t 4 /

A Toy Example for VDA n = 00 training observations over k = classes, each attached a normally distributed predictor with variance and mean 4, class = µ = 0, class = 4, class = Compare: least squares with class indicators v j equated to e j in R (indicator regression) least squares with class indicators v j equated to the vertices of an equilateral triangle ɛ-insensitive loss with the triangular vertices and ɛ = 0.6 4 ɛ-insensitive loss with the triangular vertices and ɛ = k/(k ) = 0.866 5 /

6 4 0 4 6 0.4 0.6 0.8.0..4.6.8 Squared Euclidean Loss with e_j x Distance 6 4 0 4 6 0.5.0.5.0 Squared Euclidean Loss with Vertices x Distance 6 4 0 4 6.0.5.0 ε insensitive Loss with ε = 0.6 x Distance 6 4 0 4 6.0.5.0 ε insensitive Loss with ε = 0.866 x Distance 6 /

Modified ɛ-insensitive Loss v ɛ if v ɛ + δ ( v g(v) = ɛ+δ) (δ v +ɛ) if v 6δ (ɛ δ, ɛ + δ) 0 if v ɛ δ Epsilon insensitive Loss loss function 0.0 0. 0.4 0.6 0.8.0 f(x) g(x) 0 x 7 /

VDA L Minimizing the objective function R(A, b) = n n i= k g(y i Ax i b) + λ L j= l= p a jl Although R(A, b) is non-differentiable, it possesses forward and backward directional derivatives along each coordinate direction Related work: LMSVM (Wang and Shen 007) 8 /

Cyclic Coordinate Descent Forward and backward directional derivatives along e jl are and R(θ + τe jl ) R(θ) d ejl R(A, b) = lim τ 0 τ { = n λ if a jl 0 g(r i ) + n a jl λ if a jl < 0 i= R(θ τe jl ) R(θ) d ejl R(A, b) = lim τ 0 τ { = n λ if a jl > 0 g(r i ) + n a jl λ if a jl 0 i= If both d ejl R(A, b) and d ejl R(A, b) are nonnegative skip If either directional derivative is negative solve for the minimum in the corresponding direction 9 /

Newton s Updates If r m i is the value of the ith residual at iteration m a m+ jl = a m jl n n i= { λ if a m i ) + jl 0 λ if ajl m < 0 n n i= g(r m ajl i ) a jl g(r m and b m+ j = b m j n n i= n b j g(ri m ) n i= g(r m bj i ) 0 /

Euclidean Penalty for Grouped Effects Selection of groups of variables rather than individual variables ( all-in-or-all-out ) Euclidean norm λ E a l is an ideal group penalty since it couples parameters and preserves convexity (Wu and Lange 008) In multicategory classification, the slopes of a single predictor for different dimensions of R k form a natural group In VDA E, we minimize the objective function R(A, b) = n n p g(y i Ax i b) + λ E a l i= l= /

VDA LE In VDA LE, we minimize the objective function R(A, b) = n n i= k g(y i Ax i b) + λ L j= l= p p a jl + λ E a l l= λ E = 0 VDA L λ L = 0 VDA E /

Fisher Consistency of ɛ-insensitive Loss /

Fisher Consistency of ɛ-insensitive Loss Proposition If a minimizer f (x) of E [ Y f (X ) ɛ X = x] with ɛ = k/(k ) lies closest to vertex vl, then p l (x) = max j p j (x). Either f (x) occurs exterior to all of the ɛ-insensitive balls or on the boundary of the ball surrounding v l. The assigned vertex v l is unique if the p j (x) are distinct. 4 /

p = (/, /, /) p = (0.7, 0.7, 0.6) p = (0.6, 0., 0.) p = ( + t, t 4, t ), where t = 0.05 5 /

Numerical Examples 6 /

Simulation Example An example in Wang and Shen (007) k =, n = 60, and p = 0, 0, 40 (overdetermind), 80, 60 (underdetermined) x ij are i.i.d. N(0, ) for j > and have mean a j for j (a, a ) = (, ) for class (, ) for class (, ) for class 60 training cases are spread evenly across the classes and 0,000 testing cases To compare the three modified VDA methods with LMSVM (Wang and Shen 007) and LMSVM (Lee et al. 004) 7 /

p Bayes VDA LE, VDA L, and VDA E LMSVM LMSVM Error # Var Time Error Error 0 0.8%.8% (0.0%).9 (0.) 0.007 (0.0008).6% 5.44% 4.4% (0.4%) 4.8 (0.4) 0.0050 (0.0008).70% (0.%).08 (0.0) 0.0074 (0.0008) 0 0.8%.65% (0.%).87 (0.4) 0.004 (0.0007) 4.06% 7.8% 5.8% (0.9%) 6.89 (0.64) 0.004 (0.0007).08% (0.%) 4.68 (0.6) 0.00 (0.0007) 40 0.8%.0% (0.%) 5.5 (0.) 0.078 (0.000) 4.94% 0.0% 5.66% (0.0%) 8.76 (0.9) 0.0056 (0.0007).50% (0.%) 7.5 (0.) 0.047 (0.0008) 80 0.8%.% (0.4%) 8.8 (0.) 0.045 (0.005) 5.68%.8% 6.5% (0.%).8 (.6) 0.0089 (0.0008).99% (0.5%). (0.5) 0.0440 (0.004) 60 0.8% 4.0% (0.4%).00 (0.55) 0.0647 (0.000) 6.58% 7.54% 7.% (0.%) 0.59 (.78) 0.080 (0.0008) 5.08% (0.9%) 9.8 (0.50) 0.080 (0.00) 8 /

Stability Selection (Meinshausen and Buehlmann 009) for p = 60 Selection Probabilities Average Number of Selected Variables Selection Probability 0.0 0. 0.4 0.6 0.8.0 Average Number of Selected Variables 0 5 0 5 0 5 avg. number of selected var. exp. number of falsely selectd var. with π = 0.6 exp. number of falsely selectd var. with π = 0.7 exp. number of falsely selectd var. with π = 0.8 exp. number of falsely selectd var. with π = 0.9 0.0 0. 0.4 0.6 0.8.0 0.0 0. 0.4 0.6 0.8.0 λ L λ L 9 /

Real Data Examples: Underdetermined Problems Average -fold cross-validated testing errors (%) over 50 random partitions for six benchmark cancer data sets Method Leukemia Colon Prostate Lymphoma SRBCT Brain (, 7, 57) (, 6, 000) (, 0, 60) (, 6, 406) (4, 6, 08) (5, 4, 5597) VDA LE.56 (45.5) 9.68 (7.) 5.48 (48.).66 (7.).58 (65.).80 (76.) VDA L 7.4 (40.7) 4.6 (49.) 9.8 (68.7) 4.6 (56.6) 9.5 (5.5) 48.86 (56.) VDA E.0 (89.4).08 (76.7) 6.76 (40.4).5 (9.).58 (79.9) 0.44 (84.9) BagBoost 4.08 6.0 7.5.6.4.86 Boosting 5.67 9.4 8.7 6.9 6.9 7.57 RanFor.9 4.86 9.00.4.7.7 SVM.8 5.05 7.88.6.00 8.9 PAM.75.90 6.5 5..0 5.9 DLDA.9.86 4.8.9.9 8.57 KNN.8 6.8 0.59.5.4 9.7 0 /

Discussion /

Summary VDA and its various modifications are competitive among the competing methods Virtues of VDA: parsimony, robustness, speed, and symmetry Four VDA methods VDA R : robustness and symmetry but falling behind in parsimony and speed, highly recommended for problems with a handful of predictors VDA LE : best performance on high-dimensional problems, though sacrificing a little symmetry for extra parsimony VDA E : robustness, speed, and symmetry VDA L : putting too high a premium on parsimony at the expense of symmetry /

Future Work Euclidean penalties for grouped effects (Wu and Lange 008) Redesigning the class vertices if they are not symmetrically distributed Nonlinear classifier Extension to multi-task learning More theoretical studies Further increasing computing speed in parallel computing /