Multicategory Vertex Discriminant Analysis for High-Dimensional Data

Multicategory Vertex Discriminant Analysis for High-Dimensional Data Tong Tong Wu Department of Epidemiology and Biostatistics University of Maryland, College Park October 8, 00 Joint work with Prof. Kenneth Lange at UCLA, to appear in AOAS /

Outline Introduction Vertex Discriminant Analysis (VDA) Category Indicator: Equidistant Points in R k ɛ-insensitive Loss Function for VDA Cyclic Coordinate Descent for VDA L Euclidean Penalty for Grouped Effects (VDA E, VDA LE ) Fisher Consistency of ɛ-insensitive Loss 4 Numerical Examples 5 Discussion /

Introduction /

Overview of Vertex Discriminant Analysis (VDA) A new method of supervised learning Linear discrimination among the vertices Each vertex of a regular simplex in Euclidean space representing a different category Minimization of ɛ-insensitive residuals and penalties on the coefficients of the linear predictors Classification and variable selection performed simultaneously Minimization by a primal MM algorithm or a coordinate descent algorithm Fisher consistency of ɛ-insensitive Statistical accuracy and computational speed 4 /

Motivation Cancer subtype classification sheer scale of cancer data sets prevalence of multicategory problems excess of predictors over cases exceptional speed and memory capacity of modern computers 5 /

Review of Discriminant Analysis Purpose: categorize objects based on a fixed number of observed features x R p Observations: category membership indicator y and feature vector x R p Discriminant rule: divide R p into disjoint regions corresponding to different categories Supervised learning Begin with a set of fully categorized cases (training data) Build discriminant rules using training data Given a loss function L(y, x), minimize Expected loss E [L(Y, X )] = E {E [L(Y, X ) X ]} Average conditional loss n n i= L(y i, x i ) with a penalty term 6 /

Multicategory Problems in SVM Solving a series of binary problems One-versus-rest (OVR): k binary classifications, but poor performance when no dominating class exists (Lee at al. 004) Pairwise comparisons: ( k ) comparisons, a violation of the criterion of parsimony (Kressel 999) Considering all classes simultaneously (Bredensteiner and Bennett 999; Crammer and Singer 00; Guermeur 00; Lee et al. 004; Liu et al. 006, 005, 006; Liu 007; Vapnik 998; Weston and Watkins 999; Zhang 004b; Zou et al. 006) Closely related work: LMSVM (Wang and Shen 007) and LMSVM (Lee et al. 004) 7 /

Vertex Discriminant Analysis (VDA) 8 /

Notation n: number of observations p: dimension of feature space k: number of categories 9 /

Questions for Multicategory Discriminant Analysis How to choose category indicators? How to choose a loss function? How to minimize the loss function? 0 /

Equidistant Points in R k Question How to choose class indicators? /

Equidistant Points in R k Question How to choose class indicators? Proposition It is possible to choose k equidistant points in R k but not k + equidistant points under the Euclidean norm. /

Equidistant Points in R k Question How to choose class indicators? Proposition It is possible to choose k equidistant points in R k but not k + equidistant points under the Euclidean norm. The points occur at the vertices of a regular simplex classes: -, (line) classes: vertices of an equilateral triangle circumscribed by the unit circle (plane) k classes: v,..., v k of a regular simplex in R k /

Plot of Indicator Vertices for Classes Indicator Vertices for Classes z.5.0 0.5 0.0 0.5.0.5.5.0 0.5 0.0 0.5.0.5 z /

Ridge Penalized ɛ-insensitive Loss Function for VDA R ɛ-insensitive Euclidean Loss f (z) = z,ɛ = max{ z ɛ, 0} Linear classifier y = Ax + b to maintain parsimony Penalties on the slopes a jl imposed to avoid overfitting Minimizing the objective function to proceed classification R(A, b) = n = n n i= k f (y i Ax i b) + λ R n f (y i Ax i b) + λ R p j= l= i= l= where y i is the vertex assignment for case i aj t is the jth row of a k p matrix A of regression coefficients b is a k column vector of intercepts a jl p a l /

Multivariate Regression Prediction function y i = Ax i + b for the ith observation: v yi, a... a p x i y i =.. =..... + v yi,k a k,... a k,p x ip b.. b k a Linear system for n observations: y t. y t n = x t ị. x t n a p a... a p... a k,... a k,p t + b. b k t 4 /

A Toy Example for VDA n = 00 training observations over k = classes, each attached a normally distributed predictor with variance and mean 4, class = µ = 0, class = 4, class = Compare: least squares with class indicators v j equated to e j in R (indicator regression) least squares with class indicators v j equated to the vertices of an equilateral triangle ɛ-insensitive loss with the triangular vertices and ɛ = 0.6 4 ɛ-insensitive loss with the triangular vertices and ɛ = k/(k ) = 0.866 5 /

6 4 0 4 6 0.4 0.6 0.8.0..4.6.8 Squared Euclidean Loss with e_j x Distance 6 4 0 4 6 0.5.0.5.0 Squared Euclidean Loss with Vertices x Distance 6 4 0 4 6.0.5.0 ε insensitive Loss with ε = 0.6 x Distance 6 4 0 4 6.0.5.0 ε insensitive Loss with ε = 0.866 x Distance 6 /

Modified ɛ-insensitive Loss v ɛ if v ɛ + δ ( v g(v) = ɛ+δ) (δ v +ɛ) if v 6δ (ɛ δ, ɛ + δ) 0 if v ɛ δ Epsilon insensitive Loss loss function 0.0 0. 0.4 0.6 0.8.0 f(x) g(x) 0 x 7 /

VDA L Minimizing the objective function R(A, b) = n n i= k g(y i Ax i b) + λ L j= l= p a jl Although R(A, b) is non-differentiable, it possesses forward and backward directional derivatives along each coordinate direction Related work: LMSVM (Wang and Shen 007) 8 /

Cyclic Coordinate Descent Forward and backward directional derivatives along e jl are and R(θ + τe jl ) R(θ) d ejl R(A, b) = lim τ 0 τ { = n λ if a jl 0 g(r i ) + n a jl λ if a jl < 0 i= R(θ τe jl ) R(θ) d ejl R(A, b) = lim τ 0 τ { = n λ if a jl > 0 g(r i ) + n a jl λ if a jl 0 i= If both d ejl R(A, b) and d ejl R(A, b) are nonnegative skip If either directional derivative is negative solve for the minimum in the corresponding direction 9 /

Newton s Updates If r m i is the value of the ith residual at iteration m a m+ jl = a m jl n n i= { λ if a m i ) + jl 0 λ if ajl m < 0 n n i= g(r m ajl i ) a jl g(r m and b m+ j = b m j n n i= n b j g(ri m ) n i= g(r m bj i ) 0 /

Euclidean Penalty for Grouped Effects Selection of groups of variables rather than individual variables ( all-in-or-all-out ) Euclidean norm λ E a l is an ideal group penalty since it couples parameters and preserves convexity (Wu and Lange 008) In multicategory classification, the slopes of a single predictor for different dimensions of R k form a natural group In VDA E, we minimize the objective function R(A, b) = n n p g(y i Ax i b) + λ E a l i= l= /

VDA LE In VDA LE, we minimize the objective function R(A, b) = n n i= k g(y i Ax i b) + λ L j= l= p p a jl + λ E a l l= λ E = 0 VDA L λ L = 0 VDA E /

Fisher Consistency of ɛ-insensitive Loss /

Fisher Consistency of ɛ-insensitive Loss Proposition If a minimizer f (x) of E [ Y f (X ) ɛ X = x] with ɛ = k/(k ) lies closest to vertex vl, then p l (x) = max j p j (x). Either f (x) occurs exterior to all of the ɛ-insensitive balls or on the boundary of the ball surrounding v l. The assigned vertex v l is unique if the p j (x) are distinct. 4 /

p = (/, /, /) p = (0.7, 0.7, 0.6) p = (0.6, 0., 0.) p = ( + t, t 4, t ), where t = 0.05 5 /

Numerical Examples 6 /

Simulation Example An example in Wang and Shen (007) k =, n = 60, and p = 0, 0, 40 (overdetermind), 80, 60 (underdetermined) x ij are i.i.d. N(0, ) for j > and have mean a j for j (a, a ) = (, ) for class (, ) for class (, ) for class 60 training cases are spread evenly across the classes and 0,000 testing cases To compare the three modified VDA methods with LMSVM (Wang and Shen 007) and LMSVM (Lee et al. 004) 7 /

p Bayes VDA LE, VDA L, and VDA E LMSVM LMSVM Error # Var Time Error Error 0 0.8%.8% (0.0%).9 (0.) 0.007 (0.0008).6% 5.44% 4.4% (0.4%) 4.8 (0.4) 0.0050 (0.0008).70% (0.%).08 (0.0) 0.0074 (0.0008) 0 0.8%.65% (0.%).87 (0.4) 0.004 (0.0007) 4.06% 7.8% 5.8% (0.9%) 6.89 (0.64) 0.004 (0.0007).08% (0.%) 4.68 (0.6) 0.00 (0.0007) 40 0.8%.0% (0.%) 5.5 (0.) 0.078 (0.000) 4.94% 0.0% 5.66% (0.0%) 8.76 (0.9) 0.0056 (0.0007).50% (0.%) 7.5 (0.) 0.047 (0.0008) 80 0.8%.% (0.4%) 8.8 (0.) 0.045 (0.005) 5.68%.8% 6.5% (0.%).8 (.6) 0.0089 (0.0008).99% (0.5%). (0.5) 0.0440 (0.004) 60 0.8% 4.0% (0.4%).00 (0.55) 0.0647 (0.000) 6.58% 7.54% 7.% (0.%) 0.59 (.78) 0.080 (0.0008) 5.08% (0.9%) 9.8 (0.50) 0.080 (0.00) 8 /

Stability Selection (Meinshausen and Buehlmann 009) for p = 60 Selection Probabilities Average Number of Selected Variables Selection Probability 0.0 0. 0.4 0.6 0.8.0 Average Number of Selected Variables 0 5 0 5 0 5 avg. number of selected var. exp. number of falsely selectd var. with π = 0.6 exp. number of falsely selectd var. with π = 0.7 exp. number of falsely selectd var. with π = 0.8 exp. number of falsely selectd var. with π = 0.9 0.0 0. 0.4 0.6 0.8.0 0.0 0. 0.4 0.6 0.8.0 λ L λ L 9 /

Real Data Examples: Underdetermined Problems Average -fold cross-validated testing errors (%) over 50 random partitions for six benchmark cancer data sets Method Leukemia Colon Prostate Lymphoma SRBCT Brain (, 7, 57) (, 6, 000) (, 0, 60) (, 6, 406) (4, 6, 08) (5, 4, 5597) VDA LE.56 (45.5) 9.68 (7.) 5.48 (48.).66 (7.).58 (65.).80 (76.) VDA L 7.4 (40.7) 4.6 (49.) 9.8 (68.7) 4.6 (56.6) 9.5 (5.5) 48.86 (56.) VDA E.0 (89.4).08 (76.7) 6.76 (40.4).5 (9.).58 (79.9) 0.44 (84.9) BagBoost 4.08 6.0 7.5.6.4.86 Boosting 5.67 9.4 8.7 6.9 6.9 7.57 RanFor.9 4.86 9.00.4.7.7 SVM.8 5.05 7.88.6.00 8.9 PAM.75.90 6.5 5..0 5.9 DLDA.9.86 4.8.9.9 8.57 KNN.8 6.8 0.59.5.4 9.7 0 /

Discussion /

Summary VDA and its various modifications are competitive among the competing methods Virtues of VDA: parsimony, robustness, speed, and symmetry Four VDA methods VDA R : robustness and symmetry but falling behind in parsimony and speed, highly recommended for problems with a handful of predictors VDA LE : best performance on high-dimensional problems, though sacrificing a little symmetry for extra parsimony VDA E : robustness, speed, and symmetry VDA L : putting too high a premium on parsimony at the expense of symmetry /

Future Work Euclidean penalties for grouped effects (Wu and Lange 008) Redesigning the class vertices if they are not symmetrically distributed Nonlinear classifier Extension to multi-task learning More theoretical studies Further increasing computing speed in parallel computing /