Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Size: px

Start display at page:

Download "Flexible High-Dimensional Classification Machines and Their Asymptotic Properties"

Mae Newton
5 years ago
Views:

1 Journal of Machine Learning Research 16 (2015) Submitte 1/14; Revise 9/14; Publishe 8/15 Flexible High-Dimensional Classification Machines an Their Asymptotic Properties Xingye Qiao Department of Mathematical Sciences Binghamton University State University of New York Binghamton, NY , USA Lingsong Zhang Department of Statistics Purue University West Lafayette, IN 47907, USA Eitor: Massimiliano Pontil Abstract Classification is an important topic in statistics an machine learning with great potential in many real applications. In this paper, we investigate two popular large-margin classification methos, Support Vector Machine (SVM) an Distance Weighte Discrimination (DWD), uner two contexts: the high-imensional, low-sample size ata an the imbalance ata. A unifie family of classification machines, the FLexible Assortment MachinE (FLAME) is propose, within which DWD an SVM are special cases. The FLAME family helps to ientify the similarities an ifferences between SVM an DWD. It is well known that many classifiers overfit the ata in the high-imensional setting; an others are sensitive to the imbalance ata, that is, the class with a larger sample size overly influences the classifier an pushes the ecision bounary towars the minority class. SVM is resistant to the imbalance ata issue, but it overfits high-imensional ata sets by showing the unesire ata-piling phenomenon. The DWD metho was propose to improve SVM in the highimensional setting, but its ecision bounary is sensitive to the imbalance ratio of sample sizes. Our FLAME family helps to unerstan an intrinsic connection between SVM an DWD, an provies a trae-off between sensitivity to the imbalance ata an overfitting the high-imensional ata. Several asymptotic properties of the FLAME classifiers are stuie. Simulations an real ata applications are investigate to illustrate theoretical finings. Keywors: classification, Fisher consistency, high-imensional low-sample size asymptotics, imbalance ata, support vector machine 1. Introuction Classification refers to preicting the class label, y C, of a ata object base on its covariates, x X. Here C is the space of class labels, an X is the space of the covariates. Usually we consier X R, where is the number of variables or the imension. See Dua et al. (2001) an Hastie et al. (2009) for a comprehensive introuction to many popular classification methos. When C = {+1, 1}, this is an important class of classification c 2015 Xingye Qiao an Lingsong Zhang.

2 Qiao an Zhang problems, calle binary classification. The classification rule for a binary classifier usually has the form φ(x) = sign {f(x)}, where f(x) is calle the iscriminant function. Linear classifiers are the most important an the most commonly use classifiers, as they are often easy to interpret in aition to reasonable classification performance. We focus on linear classifier in this article. In the above formula, linear classifiers correspon to f(x; ω, β) = x T ω+β. The sample space is ivie into halves by the separating hyperplane, also known as the classification bounary, efine by { x : f(x) x T ω + β = 0 }. Note that the coefficient vector ω R efines the normal vector, an hence the orientation, of the classification bounary; an the intercept term β R efines the location of the classification bounary. In this paper, two popular classification methos, Support Vector Machine (SVM; Cortes an Vapnik, 1995; Vapnik, 1998; Cristianini an Shawe-Taylor, 2000) an Distance Weighte Discrimination (DWD; Marron et al., 2007) are investigate uner two important contexts: the High-Dimensional, Low-Sample Size (HDLSS) ata an the imbalance ata. Both methos are large-margin classifiers (Smola et al., 2000), that seek separating hyperplanes which maximize certain notions of gap (that is, istances) between the two classes. The investigation of the performance of SVM an DWD motivates the notion of a unifie family of classifiers, the FLexible Assortment MachinE (FLAME), which connects the two classifiers, an helps to unerstan their connections an ifferences. There is a large literature in statistics an machine learning on large-margin classifiers. For example, Wahba (1999) stuie kernel SVM in Reproucing Kernel Hilbert Spaces. Lin (2004) introuce an prove Fisher consistency for SVM. Bartlett et al. (2006) quantifie the excess risk of a loss function in a learning problem incluing the case of large-margin classification. On the methoology level, Shen et al. (2003) invente ψ-learning; Wu an Liu (2007) introuce robust SVM. Recently, Liu, Zhang, an Wu (2011) stuies a unifie class of classifiers which connecte har classification an soft classification (probability estimation). It is worth mentioning that the FLAME family is not propose as a better classification metho to replace SVM or DWD. Instea, it is propose as a unifie machine, which is very helpful to investigate the trae-off between generalization errors an overfitting. A single parameter will be use to control the trae-off, of which DWD an SVM sit on the two ens. 1.1 Motivation: Pros an Cons of SVM an DWD SVM is a very popular classifier in statistics an machine learning. It has been shown to have Fisher consistency, that is, when sample size goes to infinity, its ecision rule converges to the Bayes rule (Lin, 2004). SVM has several nice properties: 1) Its ual formulation is relatively easy to implement (through Quaratic Programming). 2) SVM is robust to the moel specification, which makes it very popular in various real applications. However, when being applie to HDLSS ata, it has been observe that a large portion of the ata (usually the support vectors, to be properly efine later) lie on two hyperplanes parallel to the SVM classification bounary. This is known as the ata-piling phenomenon (Marron et al., 2007; Ahn an Marron, 2010). Data-piling of SVM inicates a type of overfitting. Other overfitting phenomenon of SVM uner the HDLSS context inclue: 1. The angle between the SVM irection an the Bayes rule irection is usually large. 1548

3 Flexible High-Dimensional Classification Machines 2. The variability of the sampling istribution of the SVM irection ω is very large (Zhang an Lin, 2013). Moreover, because the separating hyperplane is ecie only by the support vectors, the SVM irection tens to be unstable, in the sense that small turbulence or measurement error to the support vectors can lea to a big change of the estimate irection. 3. In some cases, the out-of-sample classification performance may not be optimal ue to the suboptimal irection of the estimate SVM iscrimination irection. DWD is a recently evelope classifier to improve SVM in the HDLSS setting. It uses a ifferent notion of gap from SVM. While SVM is to maximize the smallest istance between classes, DWD is to maximize a special average istance (harmonic mean) between classes. It has been shown in many earlier simulations that DWD largely overcomes the overfitting (ata-piling) issue an it usually gives a better iscrimination irection. On the other han, the intercept term β of the DWD metho is sensitive to the sample size ratio between the two classes, that is, to the imbalance ata (Qiao et al., 2010). Note that, even though a goo iscriminant irection ω is more important in revealing the structure of the ata, the classification/preiction performance heavily epens on the intercept β, more than on the irection ω. As shown in Qiao et al. (2010), usually the β term of the SVM classifier is not sensitive to the sample size ratio, while the β term of the DWD metho will become too large (or too small) if the sample size of the positive class (or negative class) is very large. In summary, both methos have pros an cons. SVM has a greater stochastic variability an usually overfits the ata by showing ata-piling phenomena, but is less sensitive to the imbalance ata issue. DWD usually overcomes the overfitting/ata-piling issue, an has a smaller sampling variability, but is very sensitive to the imbalance ata. Driven by their similarity, we propose a unifie class of classifiers, FLAME, in which the above two classifiers are special cases. FLAME provies a framework to stuy the connections an ifferences between SVM an DWD. Each FLAME classifier has a parameter θ which is use to control the performance balance between overfitting the HDLSS ata an the sensitivity to the imbalance ata. It turns out that the DWD metho is FLAME with θ = 0; an that the SVM metho correspons to FLAME with θ = 1. The optimal θ epens on the trae-off among several factors: stochastic variability, overfitting an resistance against the imbalance ata. In this paper, we also propose an approach to select θ, where the resulting FLAME have the potential to achieve a balance performance between the SVM an DWD methos. 1.2 Outline The rest of the paper is organize as follows. Section 2 provies toy examples an highlights the strengths an rawbacks of SVM an DWD on classifying the HDLSS an imbalance ata. We evelop the FLAME metho in Section 3, which is motivate by the investigation of the loss functions of SVM an DWD. Section 4 provies suggestions on choosing the parameters. Three types of asymptotic results for the FLAME classifier are stuie in Section 5. Section 6 emonstrates its properties using simulation experiments. A real application stuy is conucte in Section 7. Some concluing remarks an iscussions are 1549

4 Qiao an Zhang mae in Section 8. Technical proofs of theorems an propositions are inclue in Online Appenix Comparison of SVM an DWD In this section, we use several toy examples to illustrate the strengths an rawbacks of SVM an DWD uner two contexts: HDLSS ata an imbalance ata. 2.1 Overfitting HDLSS Data We use several simulate examples to compare SVM an DWD. The results show that the stochastic variability of the SVM irection is usually larger than that of the DWD metho, an SVM irections are eviate farther away from Bayes rule irections. In aition, the new propose FLAME machine (see etails in Section 3) is also inclue in the comparison, an it turns out that FLAME with a meiocre θ is between the above two methos. Figure 1 shows the comparison results between SVM, DWD an FLAME (with tuning parameter θ = 1/2). We simulate 10 samples with the same unerlying istribution. Each simulate ata set contains 12 variables an two classes, with 120 observations in each class. The two classes have mean ifference on only the first three imensions an the within-class covariances are iagonal, that is, the variables are inepenent. For each simulate ata set, we plot the first three components of the resulting iscriminant irections from SVM, DWD an FLAME (after normalizing the 3D vectors to have unit L 2 norms), as shown in Figure 1. It clearly shows that the DWD irections (the blue own-pointing triangles) are the closest ones to the true Bayes rule irection (the cyan iamon marker) among the three approaches. In aition, the DWD irections have the smallest variation (that is, more stable) over ifferent samples. The SVM irections (the re up-pointing triangles) are farthest from the true Bayes rule irection an have a larger variation than the other two methos. To highlight the irection variabilities of the three methos, we introuce a novel measure for the variation (instability) of the iscriminant irections: the trace of the sample covariance of the resulting irection vectors over the 10 replications, which we name as ispersion. The ispersion for the DWD metho (0.0031) is much smaller than that of the SVM metho (0.0453), as highlighte in the figure as well. The new FLAME classifiers usually have a performance between DWD an SVM. Figure 1 shows the results of a specific FLAME (θ = 0.5, the magenta squares), which are better than SVM but worse than DWD. Besies the avantage in terms of the stochastic variability an the eviation from the true irection, DWD outperforms SVM in terms of stability in the presence of small perturbations. In Figure 2, we use a two-imensional example to illustrate this phenomenon. We simulate a perfectly separable 2-imensional ata set. The theoretical Bayes rule ecision bounary is shown as the thick black line. The ashe re line an the ashe otte blue line are the SVM an the DWD classification bounaries respectively before the perturbation. We then move one observation in the positive group slightly (from the soli triangle to the soli iamon as shown in the figure). This perturbation leas to a visible change of the SVM irection (shown as the otte re line), but a smaller change for DWD (shown as the soli blue line). Note that all four hyperplanes are capable of classifying this training 1550

5 Flexible High-Dimensional Classification Machines True irection DWD irection vector FLAME (θ=0.5) irection vector SVM irection vector 0.7 ispersion of DWD = (0,0,0) T ispersion of FLAME (θ=0.5) = ispersion of SVM = Figure 1: The true population mean ifference irection vector (the cyan ashe line an iamon marker; equivalent to the Bayes rule irection), the DWD irections (blue own-pointing triangles), the FLAME irections with θ = 0.5 (magenta squares), an the SVM irections (re up-pointing triangles) for 10 realizations of simulate ata. Each irection vector has norm 1 an thus is epicte as a point on the 3D unit sphere. On average, all machines have their iscriminant irection vectors scattering aroun the true irection. The DWD irections are the closest to the true irection an have the smallest variation. The SVM irections have the largest variation an are farthest from the true irection. The variation of the intermeiate FLAME irection vectors is between the two machines above. The variation (ispersion) of a machine is also measure by the trace of the sample covariance calculate from the 10 resulting irection vectors for the 10 simulations. ata set perfectly. But it may not be true for an out-of-sample test set. This example shows the unstableness of SVM. 2.2 Sensitivity to Imbalance Data In the last subsection, we have shown that DWD outperforms SVM in estimating the iscrimination irection, that is, DWD irections are closer to the Bayes rule iscrimination irections an usually have a smaller variability. However, it was foun that the location of DWD classification bounary, which is characterize by the intercept β, is sensitive to the sample size ratio between the two classes (Qiao et al., 2010). 1551

6 Qiao an Zhang Positive class Negative class Before movement Movement irection After movement True bounary SVM (for original ata) SVM (after movement) DWD (for original ata) DWD (after movement) Figure 2: A 2D example shows that the unstable SVM bounary has change ue to a small turbulence of a support vector (the soli re triangle an iamon) while the DWD bounary remains almost still. Usually, a goo iscriminant irection ω helps to reveal the profiling ifference between two classes of populations. But the classification/preiction performance heavily epens on the location coefficient β. We efine the imbalance factor m 1 as the sample size ratio between the majority class an the minority class. It turns out that β in the SVM classifier is not sensitive to m. However, the β term for the DWD metho is very sensitive to m. We also notice that, as a consequence, the DWD separating hyperplane will be pushe towar the minority class, when the ratio m is close to infinity, that is, DWD classifiers inten to ignore the minority class. In this section, we use another toy example to better illustrate the impact of the imbalance ata on both the estimate β an the classification performance. Figure 3 uses a one-imensional example, so that estimating ω is not neee. This also correspons to a multivariate ata set, where ω is estimate correctly first, after which the ata set is projecte to ω to form the one-imensional ata. In this plot, the x-coorinates of the re ots an the blue ots are the values of the ata while the y-coorinates are ranom jitters for better visualization. The re an blue curves are the kernel ensity estimations for both classes. In the top subplot of Figure 3, where m = 1 (that is, the balance ata), both the DWD (blue lines) an SVM (re lines) bounaries are close to the Bayes rule bounary (black soli line), which sits at 0. In the bottom subplot, the sample size of the re class is triple, which correspons to m = 3. Note that the SVM bounary moves a little towars the minority (blue) class, but still fairly close to the true bounary. The DWD bounary, however, is pushe towars the minority. Although this oes not impose immeiate problems for the training ata set, the DWD classifier will suffer from a great loss of classification performance when it is applie to an out-of-sample ata set. It can be shown that when m goes to infinity, the DWD classification bounary will tens to 1552

7 Flexible High-Dimensional Classification Machines True bounary SVM (balance) DWD (balance) SVM (imbalance) DWD (imbalance) Figure 3: A 1D example shows that the DWD bounary is pushe towars the minority class (blue) when the majority class (re) has triple its sample size. negative infinity, which totally ignores the minority group (see our Theorem 4). However, SVM will not suffer from the imbalance ata issue. One reason is that SVM only nees a small fraction of ata (calle support vectors) to estimate both ω an β, which mitigate the imbalance ata issue naturally. Imbalance ata issues have been investigate in both statistics an machine learning. See an extensive survey in Chawla et al. (2004). Recently, Owen (2007) stuie the asymptotic behavior of infinitely imbalance binary logistic regression. In aition, Qiao an Liu (2009) an Qiao et al. (2010) propose to use aaptive weighting approaches to overcome the imbalance ata issue. In summary, the performance of DWD an SVM is ifferent in the following ways: 1) The SVM irection usually has a larger variation an eviates farther from the Bayes rule irection than the DWD irection, which are inicators of overfitting HDLSS ata. 2) The SVM intercept is not sensitive to the imbalance ata, but the DWD intercept is. These observations have motivate us to investigate their similarity an ifferences. In the next section, a new family of classifier will be propose, which unifies the above two classifiers. 3. FLAME Family In this section, we introuce FLAME, a family of classifiers, through a thorough investigation of the loss functions of SVM an DWD in Section 3.1. The formulation an implementation of the FLAME classifiers are given in Section

8 Qiao an Zhang 3.1 SVM an DWD Loss Functions The key factors that rive the very istinct performances of the SVM an the DWD methos are their associate loss functions (see Figure 4.) FLAME loss functions for three θ values (C=1) DWD (FLAME: θ = 0) FLAME, θ = 0.5 SVM (FLAME: θ = 1) Loss Functional margin, u=yf(x) Figure 4: FLAME loss functions for three θ values: θ = 0 (equivalent to SVM/Hinge loss), θ = 0.5, θ = 1 (equivalent to DWD). The parameter C is set to be 1. Figure 4 isplays the loss functions of SVM, DWD an FLAME with some specific tuning parameters. SVM uses the Hinge loss function, H(u) = (1 u) + (the re ashe curve in Figure 4), where u correspons to the functional margin u yf(x). Note that the functional margin u can be viewe as the istance of vector x from the separating hyperplane (efine by {x : f(x) = 0}). When u > 0 an is large, the ata vector is correctly classifie an is far from the separating hyperplane; when u < 0, the ata vector is wrongly classifie. Note that when u > 1, the corresponing Hinge loss equals zero. Thus, only those observations with u 1 contribute to the estimation of ω an β. These observations are calle support vectors. Hence, SVM is insensitive to the observations that are far away from the ecision bounary, which is the reason that it is less sensitive to the imbalance ata issue. However, the influence by only the support vectors makes the SVM solution subject to overfitting (ata-piling). This can be explaine by the following: in the optimization process of SVM, the functional margins for the vectors are pushe towars a region with small loss, that is, functional margins u are encourage to be large. But once a vector is pushe to the point where u = 1, the optimization mechanism lacks further incentive to continue pushing it towars a larger function margin as the Hinge loss cannot be further reuce for this vector. Therefore many ata vectors are piling along the hyperplane corresponing to u = 1. Data-piling is ba for generalization because a small turbulence to the support vectors coul lea to a big ifference of the estimate iscriminant irection vector (recall the examples in Section 2.1). 1554

9 Flexible High-Dimensional Classification Machines The DWD metho correspons to a ifferent DWD loss function, { 2 C Cu if u 1 V (u) = C, 1/u otherwise. (1) Here C is a pre-efine constant. Figure 4 shows the DWD loss function with C = 1. It is clear that the DWD loss function is very similar to the SVM loss function when u is small (both are linearly ecreasing with respect to u). The major ifference is that the DWD loss is always positive. This property will make the DWD metho behave in a very ifferent way than SVM. As there is always an incentive to make the function margin to be larger (an the loss to be smaller), the DWD loss function kills ata-piling, an mitigates the overfitting issue for HDLSS ata. On the other han, the DWD loss function makes the DWD metho very sensitive to the imbalance ata issue. This is because now that each observation will have some influence, the larger class will have a larger influence. The ecision bounary of the DWD metho tens to ignore the smaller class, because sacrificing the smaller class (bounary being closer to the smaller class an farther from the larger class) can lea to a ramatic reuction of the loss from the larger class, which ultimately lea to a minimize overall loss. 3.2 FLAME We propose to borrow strengths from both methos to simultaneously eal with both the imbalance ata an the overfitting (ata-piling) issues. We first highlight the connections between the DWD loss an an moifie version of the Hinge loss (of SVM). Then we moify the DWD loss so that samples far from the classification bounary will have zero loss. Let f(x) = x T ω +β. The formulation of SVM can be rewritten (see etails in Appenix A) in the form of argmin ω,β function H is efine as i H (y i f(x i )), s.t. ω 2 1 where the moifie Hinge loss H (u) = { C Cu if u 1 C, 0 otherwise. (2) Comparing the DWD loss (1) an this moifie Hinge loss (2), one can easily see their connections: for u 1 C, the DWD loss is greater than the Hinge loss of SVM by an exact constant C, an for u > 1 C, the DWD loss is 1/u while the SVM Hinge loss equals 0. Clearly the moifie Hinge loss (2) is the result of soft-thresholing the DWD loss at C. In other wors, SVM can be seen as a special case of DWD where the losses of those vectors with u = y i f(x i ) > 1/ C are shrunken to zero. To allow ifferent levels of softthresholing, we propose to use a new loss function which (soft-)threshols the DWD loss function by constant θ C where 0 θ 1, that is, a fraction of C. The new loss function is L(u) = [ V (u) θ ] C = + (2 θ) C Cu if u 1 C, 1/u θ C if 1 C u < 1 θ C, 0 if u 1 θ C, (3) 1555

10 Qiao an Zhang that is, to reuce the DWD loss by a constant, an truncate it at 0. The magenta soli curve in Figure 4 is the FLAME loss when C = 1 an θ = 0.5. This simple but useful moification unifies the DWD an SVM methos. When θ = 1, the new loss function (when C = 1) reuces to the SVM Hinge loss function; while when θ = 0, it remains as the DWD loss. Note that L(u) = 0 for u > 1/(θ C). Thus, those ata vectors with very large functional margins will still have zero loss. For DWD loss (corresponing to θ = 0), note that 1/(θ C) =. Thus no ata vector can have zero loss. For SVM loss, all the ata vector with u > 1/(θ C) = 1/ C will have zero loss. Training a FLAME classifier with 0 < θ < 1 can be interprete as sampling a portion of ata which are farther from the bounary than 1/θ C an assign zero loss to them. Alternatively, it can be viewe as sampling ata that are closer to the bounary than 1/θ C an assign positive loss to them. Note that the larger θ is, the fewer ata are sample to have positive loss. As one can flexibly choose θ, the new classification metho with this new loss function is calle the FLexible Assortment MachinE (FLAME). FLAME can be implemente by a Secon-Orer Cone Programming algorithm (Toh et al., 1999; Tütüncü et al., 2003). Let θ [0, 1] be the FLAME parameter. The propose n ( 1 metho minimizes min + Cξ i θ ) C. A slack variable ϕ i 0 can be introuce ω,b,ξ r i=1 i + to absorb the ( ) + function. The optimization of the FLAME can be written as s.t. min ϕ i, ω,b,ξ i ( 1 r i + Cξ i θ C) ϕ i 0, ϕ i 0, r i = y i (x T i ω + β) + ξ i, r i 0 an ξ i 0, ω 2 1. A MATLAB routine has been implemente an is available at the authors personal websites. See Online Appenix 1 for more etails on the implementation. 4. Choice of Parameters There are two tuning parameters in the FLAME moel: one is the C, inherite from the DWD loss, which controls the amount of allowance for misclassification; the other is the FLAME parameter θ, which controls the level of soft-thresholing. Similar to the iscussion in DWD (Marron et al., 2007), the classification performance of FLAME is insensitive to ifferent values of C. In aition, it can be shown for any C, FLAME is Fisher consistent, by applying the general results in Lin (2004). Thus, the efault value for C as propose in Marron et al. (2007) will be use in FLAME. As the property an the performance of FLAME epens on the choice of the θ parameter, it is important to select the right amount of thresholing. In this section, we introuce a way of choosing the secon parameter θ, which is motivate by a theoretical consieration an is heuristically meaningful as well. Having observe that the DWD iscrimination irection is usually closer to the Bayes rule irection, but its location term β is sensitive to the imbalance ata issue, we propose 1556

11 Flexible High-Dimensional Classification Machines the following ata-riven approach to select an appropriate θ. Without loss of generality, we assume that the negative class is the majority class with sample size n an the positive class is the minority class with sample size n +. We point out that the main reason that DWD is sensitive to the imbalance ata issue is that it uses all vectors in the majority class to buil up a classifier. A heuristic strategy to correct this woul be to force the optimization to use the same number of vectors from both classes (so as to mimic a balance ata set) to buil up a classifier: we first apply DWD to the ata set, an calculate the istances of all ata in the majority (negative) class to the current DWD classification bounary; we then train FLAME with a carefully chosen parameter θ which assigns positive loss to the closet n + (< n ) ata vectors in the majority (negative) class to the classification bounary. As a consequence, each class will have exactly n + vectors which have positive loss. In other wors, while keeping the least imbalance (because we have the same numbers of vectors from both classes that have influence over the optimization), we obtain a moel with the least possible overfitting (because 2n + vectors have influence, instea of only the limite support vectors as in SVM.) In practice, since the new FLAME classification bounary using the θ value chosen above may be ifferent from the initial DWD classification bounary, the n + closest points to the FLAME classification bounary may not be the same n + closest points to the DWD bounary. This means that it is not guarantee that exactly n + points from the majority class will have positive loss. However, one can expect that reasonable approximation can be achieve. Moreover, an iterative scheme for fining θ is introuce as follows in orer to minimize such iscrepancy. For simplicity, we let (x i, y i ) with inex i be an observation from the positive/minority class an (x j, y j ) with inex j be an observation from the negative/majority class. Algorithm 1 (Aaptive parameter) 1. Initiate θ 0 = For k = 0, 1,, (a) Solve FLAME solutions ω(θ k ) an β(θ k ) given parameter θ k. (b) Let θ k+1 = max ( θ k, { g (n+ )(θ k ) C } 1 ), where g j (θ k ) is the functional margin u j y j (x T j ω(θ k) + β(θ k )) of the jth vector in the negative/majority class an g (l) (θ k ) is the lth orer statistic of these functional margins. 3. When θ k = θ k 1, the iteration stops. The goal of this algorithm is to make g (n+ )(θ k ) to be the greatest functional margin among all the ata vectors that have positive loss in the negative/majority class. To achieve this, we calibrate θ by aligning g (n+ )(θ k ) to the turning point u = 1/(θ C) in the efinition of the FLAME loss (3), that is g (n+ )(θ k ) = 1/(θ ( C) θ = g (n+ )(θ k ) 1. C) We efine the equivalent sample objective function of FLAME for the iterative algorithm n + n 1 above, s(ω, β, θ) = L((x T i ω + β), θ) + L( (x n + + n jω + β), θ) + λ 2 ω 2. Then i=1 the convergence of this algorithm is shown in Theorem 1. The proofs of all the theorems an propositions in this article are inclue in Online Appenix 1. j=1 1557

12 Qiao an Zhang Theorem 1 In Algorithm 1, s(ω k, β k, θ k ) is non-increasing in k. As a consequence, Algorithm 1 converges to a stationary point s(ω, β, θ ) where s(ω k, β k, θ k ) s(ω, β, θ ). Moreover, Algorithm 1 terminates finitely. ( Ieally, one woul hope to get an optimal parameter θ which satisfies θ = g (n+ )(θ ) 1. C) In practice, θ will approximate θ very well. In aition, we notice that one-step iteration usually gives ecent results for simulation examples an some real examples. 5. Theoretical Properties In this section, several important theoretical properties of the FLAME classifier are investigate. We first prove the Fisher consistency (Lin, 2004) of the FLAME in Section 5.1. As one focus of this paper is imbalance ata classification, the asymptotic properties for FLAME uner extremely imbalance ata setting is stuie in Section 5.2. Lastly, a novel HDLSS asymptotics where n is fixe an, the other focus of this article, is stuie in Section Fisher Consistency an Large Sample Asymptotics Fisher consistency is a very basic property for a classifier. A classifier is Fisher consistent implies that the minimizer of the conitional risk of the classifier given observation x has the same sign as the Bayes rule, argmax P(Y = k X = x). It has been shown that both SVM k {+1, 1} an DWD are Fisher consistent (Lin, 2004; Qiao et al., 2010). The following proposition states that the FLAME classifier is Fisher consistent too. Proposition 2 Let f be the global minimizer of the expecte loss E[L(Y f(x), θ)], where L( ) is the loss function for the FLAME classifier, given parameters C an θ. Then sign (f (x)) = sign (P(Y = +1 X = x) 1/2). Fisher consistency is also known as classification-calibrate, notably by Bartlett et al. (2006). With this weakest possible conition on the loss function, they extene the results of Zhang (2004) an showe that there was a nontrivial upper boun on the excess risk. Moreover, they were able to erive faster rates of convergence in some low noise settings. In particular, for a classification-calibrate loss function L( ), there exists a function ( ψ : [ 1, 1] [0, ) so that ψ(r 0 1 (f) R0 1 ) R L(f) RL or c(r 0 1(f) R0 1 (R0 1 ) )α (f) R0 1 ψ )α 2c R L (f) RL for some constant c > 0 with certain low noise parameter α, where R 0 1 (f) an R L (f) are the risk of the preiction function f with respect to the 0-1 loss an the loss function L respectively, an R0 1 an R L are the corresponing Bayes risk an optimal L-risk respectively. The techniques in Zhang (2004) an Bartlett et al. (2006) can be irectly applie to the FLAME classifier. The form of the ψ transform above which establishes the relations between the two excess risks, being applie to the current article, is given by Proposition 3. Proposition 3 The ψ-transform of the FLAME loss function with parameters C an θ is ψ(γ) = (2 θ) C H((1 + γ)/2), 1558

13 Flexible High-Dimensional Classification Machines where { C min(η, 1 η)(2 + 1 H(η) = θ θ), if η < θ2 or η > 1, 1+θ 2 1+θ 2 (4) C[2 min(η, 1 η) θ + 2 η(1 η)], otherwise. These results provie bouns for the excess risk R 0 1 (f) R0 1 in terms of the excess L-risk R L (f) RL. Combine with a boun on the excess L-risk, they can give us a boun on the excess risk. Recent works for SVM have focuse on fast rates of convergence. Vito et al. (2005) stuie classification problems as inverse problems; Steinwart an Scovel (2007) stuie the convergence properties of the stanar SVM with Gaussian kernels; Blanchar et al. (2008) use a metho calle localization. See also Chen et al. (2004) for another relevant work for the q-soft margin SVM. 5.2 Asymptotics uner Imbalance Setting In this subsection, we investigate the asymptotic performance of SVM, DWD an FLAME. The asymptotic setting we focus on is when the minority sample size n + is fixe an the majority sample size n, which is similar to the setting in Owen (2007). We will show that DWD is sensitive to the imbalance ata, while FLAME with proper choices of parameter θ an SVM are not. Let x + be the sample mean of the positive/minority class. Theorem 4 shows that in the imbalance ata setting, when the size of the negative/majority class grows while that of the positive/minority class is fixe, the intercept term for DWD tens to negative infinity, in the orer of m. Therefore, DWD will classify all the observations to the negative/majority class, that is, the minority class will be 100% misclassifie. Theorem 4 Let n + be fixe. Assume that the conitional istribution of the negative majority class F (x) surrouns x + by the efinition given in Owen (2007), an that γ is a constant satisfying F (x) > γ 0, then the DWD intercept β satisfies inf ω =1 (x x + ) ω>0 γ β < C m n γ xt +ω = n + C xt +ω. In Section 4, we have introuce an iterative approach to select the parameter θ. Theorem 5 shows that with the optimal parameter θ foun by Algorithm 1, the iscriminant irection of FLAME is in the same irection of the vector that joins the sample mean of the positive class an the tilte population mean of the negative class. Moreover, in contrast to DWD, the intercept term of FLAME in this case is finite. Theorem 5 Suppose that n n + an ω an β are the FLAME solutions traine with ( the parameter θ that satisfies θ = g (n+ )(θ ) 1. C) Then ω an β satisfy that ω = [ C x + (1 + m)λ (x T ω + β ) 2 ] xf (x E) (x T ω + β ) 2, (5) F (x E) 1559

14 Qiao an Zhang where E is the event that [Y (X T ω + β )] 1 θ C where (X, Y ) is a ranom sample from the negative/majority class, an that (x T ω + β ) 2 F (x E) = n + n o C, where 0 < no n +. Note that event E is [Y (X T ω + β )] 1 θ C, which implies that the secon term in (5) focuses on ata vectors in the negative class with positive loss since their functional margins are less than 1/(θ C). Recall that this is precisely the interpretation of FLAME (see Section 3.2), namely, to sample a subset of the majority class to have positive loss, so as to make the problem less imbalance. Remark: As a consequence of Theorem 5, when m = n /n +, we have ω 0. Since the right-han-sie of the last equation above is positive an finite, β oes not iverge. In aition, since P(E) 1 with probability converging to 1, β < 1/(θ C). The following theorem shows the performance of SVM uner the imbalance ata context, which completes our comparisons between SVM, DWD an FLAME. Theorem 6 Suppose that n n +. The solutions ω an β to SVM satisfy that { } 1 ω = x + xf (x G), (1 + m)λ where G is the event that 1 Y (X T ω + β) > 0 where (X, Y ) is a ranom sample from the negative/majority class, an that P(G) = P(1 + X T ω + β 0) = 1 1/m. Remark: The last statement in Theorem 6 means that with probability converging to 1, β 1. However, note this is the only restriction that SVM solution has for the intercept term (recall that the counterpart in DWD is β < γ C m xt +ω). 5.3 High-Dimensional, Low-Sample Size Asymptotics HDLSS ata are emerging in many areas of scientific research. The HDLSS asymptotics is a recently evelope theoretical framework. Hall et al. (2005) gave a geometric representation for the HDLSS ata, which can be use to stuy these new n fixe, asymptotic properties of binary classifiers such as SVM an DWD. Ahn et al. (2007) weakene the conitions uner which the representation hols. Qiao et al. (2010) improve the conitions an applie this representation to investigate the performance of the weighte DWD classifier. Bolivar-Cime an Marron (2013) compare several binary classification methos in the HDLSS setting uner the same theoretical framework. The same geometric representation can be use to analyze FLAME. See summary of some previous HDLSS results in Online Appenix 1. We evelop the HDLSS asymptotic properties of the FLAME family by proviing conitions in Theorem 7 uner which the FLAME classifiers always correctly classify HDLSS ata. We first introuce the notations an give some regularity assumptions, then state the main theorem. Let k {+1, 1} be the class inex. For the kth class an given a fixe n k, 1560

15 Flexible High-Dimensional Classification Machines consier a sequence of ranom ata matrices X k 1, X k 2, X k,, inexe by the number of rows, where each column of X k is a ranom observation vector from R an each row represents a variable. Assume that each column of X k comes from a multivariate istribution with imension an with covariance matrix Σ k inepenently. Let λk 1, λ k, be the eigenvalues of the covariance, an ( σ k ) 2 = 1 i=1 λk i, the average eigenvalue. The eigenvalue ecomposition of Σ k is Σk = V k ) ( Λk V k T. We may efine the square root of Σ k as ( Σ k ) 1/2 = V k ( Λ k ) 1/2, an the inverse square root ( Σ k ) 1/2 = ( Λ k ) 1/2 ( V k ) T. With minimal abuse of notation, let E(X k ) enote the expectation of columns of X k. Lastly, the nk n k ual sample covariance matrix is enote by S k D, = 1{ X k E(Xk )} T { X k E(X k )}. Assumption 1 There are five components: (i) Each column of X k has mean E(Xk ) an the covariance matrix Σk of its istribution is positive efinite. (ii) The entries of Z k ( Σ k ) 1 { 2 X k E(Xk )} = ( Λ k ) 1 ( 2 V k T { ) X k E(X k )} are inepenent. (iii) The fourth moment of each entry of each column is uniformly boune by M > 0 an the Wishart representation hols for each ual sample covariance matrix S k D, associate with X k, that is, { ( ) T ( ) 1/2 ( ) } { T ( ) } 1/2Z S k D, = Z k Λ k V k V k Λ k k = λ k i, W k i,, T where W k i, (Zi,) k Z k i, an Z i, is the ith row of Z k efine above. It is calle Wishart representation because if X k is Gaussian, then each W k i, follows the Wishart istribution W n k(1, I n k) inepenently. (iv) The eigenvalues of Σ k are sufficiently iffuse, in the sense that ɛ k = i=1 (λk i, )2 ( 0 as. (6) i=1 λk i, )2 (v) The sum of the eigenvalues of Σ k is the same orer as, in the sense that ( σ k ) 2 = O(1) an 1/ ( σ k ) 2 = O(1). Assumption 2 The istance between the two population expectations satisfies, 1 E(X (+1) ) E(X ( 1) ) 2 µ 2, as. Moreover, there exist constants σ 2 an τ 2, such that ( ) 2 ( σ 2, an σ (+1) σ ( 1) ) 2 τ 2. Let ν 2 µ 2 + σ 2 /n + + τ 2 /n. The following theorem gives the sure classification conition for FLAME, which inclues SVM an DWD as special cases. i=1 1561

16 Qiao an Zhang Theorem 7 Without loss of generality, assume that n + n. The situation of n+ > n is similar an omitte. If either one[ of the following three conitions is satisfie, 1. for θ 0, (1 + m 1 )/(ν ) C), µ 2 > (n /n + ) 1 2 σ 2 /n + τ 2 /n > 0; [ 2. for θ (1 + m 1 )/(ν C), 2/(ν ) C), µ 2 > T τ 2 /n > 0 where T := ( 1/(2θ C) + ) 2 1/(4θ 2 C) + σ 2 /n + σ 2 /n + ; [ 3. for θ 2/(ν ] C), 1, µ 2 > σ 2 /n + τ 2 /n > 0, then for a new ata point x + 0 from the positive class (+1), P(x + 0 is correctly classifie by FLAME) 1, as. Otherwise, the probability above 0. If either one[ of the following three conitions is satisfie, 1. for θ 0, (1 + m 1 )/(ν ) C), (n /n + ) 1 2 σ 2 /n + τ 2 /n > 0; [ 2. for θ (1 + m 1 )/(ν C), 2/(ν ) C), T τ 2 /n > 0; [ 3. for θ 2/(ν ] C), 1, σ 2 /n + τ 2 /n > 0, then for any µ > 0, for a new ata point x 0 from the negative class ( 1), P(x 0 is correctly classifie by FLAME) 1, as. Remark: Theorem 7 has two parts. The first part gives the conitions uner which FLAME correctly classifies a new ata point from the positive class, an the secon part is for the negative class. Each part lists three conitions base on three isjoint intervals of parameter θ. Note the first an thir intervals of each part generalize results which were shown to hol only for DWD an SVM before (c.f. Theorem 1 an Theorem 2 in Hall et al., 2005). In particular, it shows that all the FLAME classifiers with θ falling into the first interval behave like DWD asymptotically. Similarly, all the FLAME classifiers with θ falling into the thir interval behave like SVM asymptotically. This partially explains the shape of the within-group error curve that we will show in Figure 6 (see also Figures A.2 an A.3 in Online Appenix 1), which we will iscuss in the next section. In the first part, the conition for other FLAMEs (with θ in the secon interval) is weaker than the DWD-like FLAMEs (in the first interval), but stronger than the SVM-like FLAMEs (in the thir interval). This means that it is easier to classify a new ata point from the positive/minority class by SVM, than by an intermeiate FLAME, which is easier than by DWD. Note that when n + n, the hyperplane for FLAME is in general closer to the positive class. In terms of classifying ata points from the negative class, the orer of the ifficulties among DWD, FLAME an SVM reverses. 6. Simulations FLAME is not only a unifie representation of DWD an SVM, but also introuces a new family of classifiers which has the potential of avoiing the overfitting HDLSS ata issue an the sensitivity to imbalance ata issue. In this section, we use simulations to show the performance of FLAME at various parameter levels. 1562

17 Flexible High-Dimensional Classification Machines 6.1 Measures of Performance Before we introuce our simulation examples, we first introuce the performance measures in this paper. Note that the Bayes rule classifier can be viewe as the gol stanar classifier. In our simulation settings, we assume that ata are generate from two multivariate normal istributions with ifferent mean vectors µ + an µ an same covariance matrices Σ. This setting leas to the following Bayes rule, sign(x T ω B + β B ) where ω B = Σ 1 (µ + µ ) an β B = 1 2 (µ + + µ ) ω B. (7) Five performance measures are evaluate in this paper: 1. The mean within-class error (MWE) for an out-of-sample test set, which is efine as MW E = 1 2n + n + i=1 1(Ŷ + i Y + i ) + 1 2n n j=1 1(Ŷ j Y j ) 2. The eviation of the estimate intercept β from the Bayes rule intercept β B : β β B. 3. Dispersion: a measure of the stochastic variability of the estimate iscrimination irection vector ω. The ispersion measure was introuce in Section 1, as the trace of the sample covariance of the resulting iscriminant irection vectors: isperson = Var([ω r ] r=1:r ) where R is the number of repeate runs. 4. Angle between the estimate iscrimination irection ω an the Bayes rule irection ω B : (ω, ω B ). 5. RankComp(ω, ω B ): In general, for two irection vectors ω an ω, RankComp is efine as the proportion of the pairs of variables, among all ( 1)/2 pairs, whose relative importances (in terms of their absolute values) given by the two irections are ifferent, that is, RankComp(ω, ω ) 1 ( 1)/2 1 i<j 1 { ( ω i ω j ) ( ω i ω j ) < 0 }, where ω i an ωi are the ith components of the vectors ω an ω respectively. The RankComp measure can be viewe as a iscretize analog to the angle between two vectors, an it provies more insights in the ranking of variables that a irection vector may suggest. We report the RankComp between the estimate irection ω an the Bayes rule irection ω B to measure their closeness. We will investigate these measures base on ifferent imensions an ifferent imbalance factors m. 6.2 Effects of Dimensions an Imbalance Data In Section 1, a specific FLAME (θ = 0.5) has been compare with SVM (θ = 1) an DWD (θ = 0) in Figure 1, an on average, its iscriminant irections are closer to the Bayes rule irection ω B compare to the SVM irections, but are less close than the DWD irections. In this subsection, we will further investigate the performance of FLAME with several 1563

18 Qiao an Zhang ispersion m = 1 θ=0 θ=0.25 θ=0.5 θ=0.75 θ=1 ispersion m = 4 ispersion m = m = 1 15 m = 4 15 m = 9 angle from true irection 10 5 angle from true irection 10 5 angle from true irection Figure 5: The ispersions (top row) an the angles between the FLAME irection an the Bayes irection (bottom row) for 50 runs of simulations, where the imbalance factors m are 1, 4 an 9 (the left, center an right panels), in the increasing imension setting ( = 100, 400, 700, 1000; shown on the x-axes). The FLAME machines have θ = 0, 0.25, 0.5, 0.75, 1 which are epicte using ifferent curve styles (the first an the last cases correspon to DWD an SVM, respectively.) Note that with θ an the imension increase, both the ispersion an the eviation from the Bayes irection increase. The emergence of the imbalance ata (the increase of m) oes not much eteriorate the FLAME irections except for large. ifferent values of θ, an compare them with DWD an SVM uner various simulation settings. Figure 5 shows the comparison results uner the same simulation setting with various combinations of (, m) s. In this simulation setting, ata are from multivariate normal istributions with ientity covariance matrix MV N (µ ±, I ), where = 100, 400, 700 an We let µ 0 = c(, 1, 2,, 1) T where c > 0 is a constant which scales µ 0 to have norm 2.7. Then we let µ + = µ 0 an µ = µ 0. The imbalance factor varies among 1, 4 an 9 while the total sample size is 240. For each experiment, we repeat the simulation

19 Flexible High-Dimensional Classification Machines times, an plot the average performance measure in Figure 5. The Bayes rule is calculate accoring to (7). It is obvious that when the imension increases, both the ispersion an the angle increase. They are inicators of overfitting HDLSS ata. When the imbalance factor m increases, the two measures increase as well, although not as much as when the imension increases. More importantly, it shows that when θ ecreases (from 1 to 0, or equivalently FLAME changes from SVM to DWD), the ispersion an the angle both ecrease, which is promising because it shows that FLAME improves SVM on the overfitting issue. 6.3 Effects of Tuning Parameters with Covariance We also investigate the effect of ifferent covariance structures, since inepenence structure among variables as in the last subsection is less common in real applications. We investigate three covariance structures: inepenent, interchangeable an block-interchangeable covariance. Data are generate from two multivariate normal istributions MV N 300 (µ ±, Σ) with = 300. We fist let µ 1 = (75, 74, 73,, 1, 0, 0,, 0), then scale it by multiply a constant c such that the Mahalanobis istance between µ + = cµ 1 an µ = cµ 1 equals 5.4, that is, (µ + µ ) Σ 1 (µ + µ ) = 5.4. Note that this represents a reasonable signal-to-noise ratio. We consier the FLAME machines with ifferent parameter θ from a gri of 11 values (0, 0.1, 0.2,, 1), an apply them to nine simulate examples (three ifferent imbalance factors (m = 2, 3, 4) three covariance structures). For the inepenent structure example, Σ = I 300 ; For the interchangeable structure example, Σ ii = 1 an Σ ij = 0.8 for i j; For the block-interchangeable structure example, we let Σ be a block iagonal matrix with five iagonal blocks, the sizes of which are 150, 100, 25, 15, 10, an each block is an interchangeable covariance matrix with iagonal entries 1 an off-iagonal entries 0.8. Figure 6 shows the results of the interchangeable structure example. Since the results uner ifferent covariance structures are similar, those for the other two covariance structures are reporte in Online Appenix 1 to save space (Figure A.2 for the inepenent structure, an Figure A.3 for the block-interchangeable covariance). In each plot, we inclue the within-group error (top-left), the absolute value of the ifference between the estimate intercept an the Bayes intercept β β B (top-mile), the angle between the estimate irection an the Bayes irection (ω, ω B ) (bottom-left), the RankComp between the estimate irection an the Bayes irection (bottom-mile) an the ispersion of the estimate irections (bottom-right). We can see that in Figure 6 (an Figures A.2 an A.3 in Online Appenix 1), when we increase θ from 0 to 1, that is, when the FLAME moves from the DWD en to the SVM en, the within-group error ecreases. This is mostly ue to the fact that the intercept term β comes closer to the Bayes rule intercept β B. On the other han, the estimate irection is eviating from the true irection (larger angle), is giving the wrong rank of the variables (larger RankComp), an is more unstable (larger ispersion). Similar phenomena hol for the other two covariance structures, with one exception in the block interchangeable setting (Figure A.3 in Online Appenix 1) where the RankComp first ecreases then increases. In the entire FLAME family, DWD represents one extreme which provies better estimation of the irection, is closer to the Bayes irection, provies the right orer for all 1565

Flexible High-dimensional Classification Machines and. Their Asymptotic Properties

Flexible High-dimensional Classification Machines and Their Asymptotic Properties arxiv:30.3004v [stat.ml] Oct 203 Xingye Qiao Department of Mathematical Sciences State University of New York, Binghamton,