arxiv: v2 [stat.ml] 19 Oct 2016

Size: px

Start display at page:

Download "arxiv: v2 [stat.ml] 19 Oct 2016"

Aleesha Norris
6 years ago
Views:

1 Sparse Quadratic Discriminant Anaysis and Community Bayes arxiv: v2 [stat.ml] 19 Oct 2016 Ya Le Department of Statistics Stanford University Abstract Trevor Hastie Department of Statistics Stanford University We deveop a cass of rues spanning the range between quadratic discriminant anaysis and naive Bayes, through a path of sparse graphica modes. A group asso penaty is used to introduce shrinkage and encourage a simiar pattern of sparsity across precision matrices. It gives sparse estimates of interactions and produces interpretabe modes. Inspired by the connected-components structure of the estimated precision matrices, we propose the community Bayes mode, which partitions features into severa conditiona independent communities and spits the cassification probem into separate smaer ones. The community Bayes idea is quite genera and can be appied to non-gaussian data and ikeihood-based cassifiers. 1 Introduction In the generic cassification probem, the outcome of interest G fas into K unordered casses, which for convenience we denote by {1, 2,..., K}. Our goa is to buid a rue for predicting the cass abe of an item based on p measurements of features X R p. The training set consists of the cass abes and features for n items. This is an important practica probem with appications in many fieds. Quadratic discriminant anaysis (QDA) can be derived as the maximum ikeihood method for norma popuations with different means and different covariance matrices. It is a favored too when there are some strong interactions and inear boundaries cannot separate the casses. However, QDA is poory posed when the sampe size n k is not consideraby arger than p for any cass k, and ceary i-posed if n k p. Therefore it encourages us to empoy a reguarization method [17, 24]. Athough there aready exist a number of proposas to reguarize inear discriminant anaysis (LDA) [1, 4, 6, 8, 9, 11, 23, 26], few methods have been deveoped to reguarize QDA. Friedman [7] suggests appying a ridge penaty to within-cass covariance matrices, and Price et a. [19] propose to add a ridge penaty and a ridge fusion penaty to the og-ikeihood. 1

2 For both methods, no eements of the resuting within-cass precision matrices wi be zero, eading to dense interaction terms in the discriminant functions and difficuties in interpretation. Assuming sparse conditions on both unknown means and covariance matrices, Li and Shao [13] propose to construct mean and covariance estimators by threshoding. Sun and Zhao [21] assume bock-diagona structure for covariance matrices and estimate each bock by 1 reguarization. The bocks are assumed to be of the same size, and are formed based on two sampe t statistics. However, both methods can ony appy to two-cass cassification. Another approach is to assume conditiona independence of the features (naive Bayes). Naive Bayes cassifiers often outperform far more sophisticated aternatives when the dimension p of the feature space is high, but the conditiona independence assumption may be too rigid in the presence of strong interactions. In this paper we deveop a cass of rues spanning the range between QDA and naive Bayes using a group asso penaty [27]. A group asso penaty is appied to the (i, j)th eement across a K within-cass precision matrices, which forces the zeros in the K estimated precision matrices to occur in the same paces. This shared pattern of sparsity resuts in a sparse estimate of interaction terms, making the cassifier easy to interpret. We refer to this cassification method as Sparse Quadratic Discriminant Anaysis (SQDA). The connected components in the resuting precision matrices correspond exacty to those obtained from simpe threshoding rues on the within-cass sampe covariance matrices [5]. This suggests us to partition features into severa independent communities before estimating the parameters [3, 10, 18]. Therefore, we propose the community Bayes mode, a simpe idea to identify and make use of conditiona independent communities of features. Furthermore, it is a genera approach appicabe to non-gaussian data and any ikeihoodbased cassifiers. Specificay, the community Bayes works by partitioning features into severa communities, soving separate cassification probems for each community, and combining the community-wise resuts into one fina prediction. We show that this approach can improve the accuracy and interpretabiity of the corresponding cassifier. The paper is organized as foows. In section 2 we discuss sparse quadratic discriminant anaysis and the sampe covariance threshoding rues. In section 3 we discuss the community Bayes idea. Simuated and rea data exampes appear in Section 4, concuding remarks in Section 5 and the proofs are gathered in the Appendix. 2 Sparse Quadratic Discriminant Anaysis Suppose we have training data (x i, g i ) R p K, i = 1, 2,..., n, where x i s are observations with measurements on a set of p features and g i s are cass abes. We assume that the n k observations within the kth cass are identicay distributed as N(µ k, Σ k ), and the n = k n k observations are independent. We 2

3 et π k denote the prior for cass k. Then the quadratic discriminant function is δ k (x) = 1 2 og det Σ 1 k 1 2 (x µ k) T Σ 1 k (x µ k) + og π k. (1) Let Θ (k) = Σ 1 k denote the true precision matrix for cass k, and θ ij = (θ (1) ij,..., θ(k) ij ) denote the vector of the (i, j)th eement across a K precision matrices. We propose to estimate µ, π, Θ by maximizing the penaized og ikeihood K n k 2 og det Θ(k) 1 (x i µ k ) T Θ (k) (x i µ k ) + n k og π k λ θ ij k=1 g i=k (2) The ast term is a group asso penaty, appied to the (i, j)th eement across a K precision matrices, which forces a simiar pattern of sparsity across a the precision matrices. Hence, we refer to this cassification method as Sparse Quadratic Discriminant Anaysis (SQDA). Let Θ = (Θ (1),..., Θ (K) ). Soving (2) gives us the estimates ˆµ k = 1 x i (3) n k ˆπ k = n k n g i=k ˆΘ =argmax Θ (k) 0 K k=1 i j (4) ) n k (og det Θ (k) tr(s (k) Θ (k) ) λ θ ij 2, (5) i j where S (k) = 1 n k g i=k (x i ˆµ k )(x i ˆµ k ) T is the sampe covariance matrix for cass k. This mode has severa important features: 1. λ is a nonnegative tuning parameter controing the simutaneous shrinkage of a precision matrices toward diagona matrices. The vaue λ = 0 gives rise to QDA, whereas λ = yieds the naive Bayes cassifier. Vaues between these imits represent degrees of reguarization ess severe than naive Bayes. Since it is often the case that even sma amounts of reguarization can argey eiminate quite drastic instabiity, smaer vaues of λ (smaer than ) have the potentia of superior performance when some features have strong partia correations. 2. The group asso penaty forces ˆΘ (1),..., ˆΘ (K) to share the ocations of the nonzero eements, eading to a sparse estimate of interactions. Therefore, our method produces more interpretabe resuts. 3. The optimization probem (5) can be separated into independent subprobems of the same form by simpe screening rues on S (1),..., S (K). This eads to a potentiay massive reduction in computationa compexity. 3

4 The convex optimization probem (5) can be quicky soved by R package JGL [5]. Not for the purpose of cassification as here, Danaher et a. [5] propose to jointy estimate mutipe reated Gaussian graphica modes by maximizing the group graphica asso max Θ (k) 0 K ) n k (og detθ (k) tr(s (k) Θ (k) ) λ 1 θ ij 1 λ 2 θ ij 2. k=1 (6) They use the additiona asso penaty to further encourage sparsity within (θ (1) ij,..., θ(k) ij ). However, for the purpose of cassification, our goa is to identify interaction terms that is, we are interested in whether (θ (1) ij,..., θ(k) ij ) is a zero vector instead of whether each eement θ (k) ij is zero. Therefore, we ony use the group asso penaty to estimate precision matrices. We iustrate our method with a handwritten digit recognition exampe. Here we focus on the sub-task of distinguishing handwritten 3s and 8s. We use the same data as Le Cun et a. [12], who normaized binary images for size and orientation, resuting in 8-bit, gray scae images. There are 658 threes and 542 eights in our training set, and 166 test sampes for each. Figure 1 shows a random seection of 3s and 8s. i j i j Figure 1: Exampes of digitized handwritten 3s and 8s. Each image is a 8 bit, gray scae version of the origina binary image. Because of their spatia arrangement the features are highy correated and some kind of smoothing or fitering aways heps. We fitered the data by repacing each non-overapping 2 2 pixe bock with its average. This reduces the dimension of the feature space from 256 to 64. We compare our method with QDA, naive Bayes and a variant of RDA which we refer to as diagona reguarized discriminant anaysis (DRDA). The estimated covariance matrix of DRDA for cass k is ˆΣ (k) (λ) = (1 λ)s (k) + λdiag(s (k) ), with λ [0, 1]. (7) The tuning parameters are seected by 5-fod cross vaidation. Tabe 1 shows the miscassification errors of each method and the corresponding standardized tuning parameters s = P ( ˆΘ(λ))/P ( ˆΘ(0)), where P (Θ) = Σ i j θ (1) ij,..., θ(k) ij 2. The resuts show that SQDA outperforms DRDA, QDA and naive Bayes. In 4

5 Figure 2, we see that the test error of SQDA decreases dramaticay as the mode deviates from naive Bayes and achieves its minimum at a sma vaue of s, whie the test error of DRDA keeps decreasing as the mode ranges from naive Bayes to QDA. This is expected since at sma vaues of s, SQDA ony incudes important interaction terms and shrinks other noisy terms to 0, but DRDA has a interactions in the mode once deviating from naive Bayes. To better dispay the estimated precision matrices, we standardize the estimates to have unit diagona (the standardized precision matrix is equa to the partia correation matrix up to the sign of off-diagona entries). Figure 3 dispays the standardized sampe precision matrices, and the standardized SQDA and DRDA estimates. From Figure 3, we can see that SQDA ony incudes interactions within a diagona band. Tabe 1: Digit cassification resuts of 3s and 8s. Tuning parameters are seected by 5-fod cross vaidation. SQDA DRDA QDA Naive Bayes Test error Training error CV error Standardized tuning parameter Exact covariance threshoding into connected components Mazumder & Hastie [16] and Witten et a. [25] estabish a connection between the graphica asso and connected components. Specificay, the connected components in the estimated precision matrix correspond exacty to those obtained from threshoding the entries of the sampe covariance matrix at λ. For the soution to (5), we have simiar resuts. By simpe threshoding rues on the sampe covariance matrices S (1),..., S (K), the optimization probem (5) can be separated into severa optimization sub-probems of the same form, which eads to huge speed improvements. A more genera resut for the soution to (6) can be found in [5]. Suppose ˆΘ = ( ˆΘ (1),..., ˆΘ (K) ) is the soution to (5) with reguarization parameter λ. Define E (λ) ij = { 1 if (ˆθ (1) (K) ij,..., ˆθ ij ) 0, i j; 0 otherwise. This defines a symmetric graph G (λ) = (V, E (λ) ), namey the estimated concentration graph defined on the nodes V = {1,..., p} with edges E (λ). Suppose it (8) 5

6 Figure 2: The 5-fod cross-vaidation errors (bue) and the test errors (red) of SQDA and DRDA on 3s and 8s. Figure 3: Heat maps of the sampe precision matrices and the estimated precision matrices of SQDA and DRDA. Estimates are standardized to have unit diagona. The first ine corresponds to the precision matrix of 3s and the second ine corresponds to that of 8s. admits a decomposition into κ(λ) connected components G (λ) = κ(λ) =1 G(λ), (9) where G (λ) = (ˆV (λ), E (λ) ) are the components of the graph G (λ). Define S ij = (n 1 S (1) ij,..., n KS (K) ij ) 2.We can aso perform a threshoding 6

7 on the entries of S and obtain a graph edge skeeton defined by { E (λ) 1 if ij = S ij > λ, i j; 0 otherwise. (10) The symmetric matrix E (λ) defines a symmetric graph G (λ) = (V, E (λ) ), which is referred to as the threshoded sampe covariance graph. G (λ) aso admits a decomposition into connected components where G (λ) G (λ) = k(λ) =1 G(λ), (11) = (V (λ), E (λ) ) are the components of the graph G (λ). Theorem 1. For any λ > 0, the components of the estimated concentration graph G (λ) induce exacty the same vertex-partition as that of the threshoded sampe covariance graph G (λ). Formay, κ(λ) = k(λ) and there exists a permutation π on {1,..., k(λ)} such that ˆV (λ) i = V (λ) π(i), i = 1,..., k(λ). (12) Furthermore, given λ > λ > 0, the vertex-partition induced by the components of G (λ) are nested within that induced by the components of G (λ ). That is, κ(λ) κ(λ (λ) ) and the vertex-partition { ˆV } 1 κ(λ) forms a finer resoution of { ˆV (λ ) } 1 κ(λ ). Proof. The proof of this theorem appears in the Appendix. Theorem 1 aows us to quicky check the connected components of the estimated concentration graph by simpe screening rues on S. Notice that for each cass k, the edge set E (k,λ) defined by E (k,λ) ij = 1 {ˆθ(k) ij 0} is nested in E (λ). Therefore, the features can be reordered in such a way that each ˆΘ (k) is bock diagona ˆΘ (k) = ˆΘ (k) ˆΘ(k) ˆΘ(k) k(λ), (13) where the different components represent bocks of indices given by V (λ), = 1,..., k(λ). Then one can simpy sove the optimization probem (5) on the features within each bock separatey, making probem (5) feasibe for certain vaues of λ athough it may be impossibe to operate on the p p variabes Θ (1),..., Θ (K) on a singe machine. 7

8 3 Community Bayes The estimated precision matrices of SQDA are often bock diagona under suitabe ordering of the features, which impies that the features can be partitioned into severa communities and these communities are mutuay independent within each cass. In this section, we generaize this idea to non-gaussian data and other cassification modes, and refer to it as the community Bayes mode. A reated work in the regression setting is [10]. The authors propose to spit the asso probem into smaer ones by estimating the connected components of the sampe covariance matrix. Their approach invoves ony one covariance matrix and works specificay for the asso. 3.1 Main idea We et X R p denote the feature vector and G {1,..., K} denote the cass variabe. Suppose the feature set V = {1,..., p} admits a partition V = L =1 V such that X V1,..., X VL are mutuay independent conditiona on G = k for k = 1,..., K, where X V is a subset of X containing the features in community V. Then the posterior probabiity has the form og p(g = k X) = og p(x G = k) + og p(g = k) og p(x) (14) L = og p(x V G = k) + og p(g = k) og p(x) (15) = =1 L =1 og p(g = k X V ) + (1 L) og p(g = k) + C(X) (16) where C(X) = L =1 og p(x V ) og p(x) ony depends on X and serves as a normaization term. The equation (16) impies an interesting resut: we can fit the cassification mode on each community separatey, and combine the resutant posteriors into the posterior to the origina probem by simpy adjusting the intercept and normaizing it. This resut has three important consequences: 1. The goba probem competey separates into L smaer tractabe subprobems of the same form, making it possibe to sove an otherwise infeasibe arge-scae probem. Moreover, the moduar structure ends it naturay to parae computation. That is, one can sove these sub-probems independenty on separate machines. 2. This idea is quite genera and can be appied to any ikeihood-based cassifiers, incuding discriminant anaysis, mutinomia ogistic regression, generaized additive modes, cassification trees, etc. 3. When using a cassification mode with interaction terms, (16) doesn t invove interactions across different communities. Therefore, it has fewer degrees of freedom and thus smaer variance than the goba probem. 8

9 To find conditionay independent communities, we use the Spearman s rho, a robust nonparametric rank-based statistics, to directy estimate the unknown correation matrices, and then appy average, singe or compete inkage custering to the estimated correation matrices to get the estimated communities. In next section, we wi show that this procedure consistenty identify conditionay independent communities when the data are from a nonparanorma famiy. The community Bayes agorithm is summarized beow. Agorithm 1 The community Bayes agorithm 1: Compute ˆR (k), the estimated correation matrix based on the Spearman s rho for each cass k. 2: Perform average, singe or compete inkage custering with simiarity matrix R, where R ij = (n 1 ˆR(1) ij,..., n K ˆR (K) ij ) 2, and cut the dendrogram at eve τ to produce vertex-partition V = L =1 ˆV. 3: For each community = 1,..., L, estimate og p(g = k X ˆV ) using a cassification method of choice. 4: Pick τ and any other tuning parameters by cross-vaidation. 3.2 Community estimation In this section we address the key part of the community Bayes mode: how to find conditionay independent communities. We propose to derive the communities by appying average, singe or compete inkage custering to the correation matrices, which are estimated based on nonparametric rank-based statistics. In the case where data are from a nonparanorma famiy, we prove that given knowedge of L, this procedure consistenty identify conditionay independent communities. We assume X = (X 1,..., X p ) T G = k foows a nonparanorma distribution NP N(µ (k), Σ (k), f (k) ) [15] and Σ (k) is nonsinguar. That is, there exists a set of univariate stricty increasing transformations f (k) = {f (k) j such that } p j=1 Z (k) := f (k) (X) G = k N(µ (k), Σ (k) ), (17) where f (k) (X) = (f (k) 1 (X 1 ),..., f p (k) (X p )) T. Notice that the transformation functions f (k) can be different for different casses. To make the mode identifiabe, f (k) preserves the popuation mean and standard deviations: E(X j G = k) = E(f (k) j (X j ) G = k) = µ (k) j, Var(X j G = k) = Var(f (k) j (X j ) G = k) = σ (k) 2 j. Liu et a.[15] prove that the nonparanorma distribution is a Gaussian copua when the transformation functions are monotone and differentiabe. Let R (k) denote the correation matrix of the Gaussian distribution Z (k), and define { 1 if (R (1) ij E ij =,..., R(K) ij ) 0, i j; (18) 0 otherwise. 9

10 Suppose the graph G = (V, E) admits a decomposition into L connected components G = L =1 G. Then the vertex-partition V = L =1 V induced by this decomposition gives us exacty the conditionay independent communities we need. This is because G and G are disconnected Z (k) V Z (k) V, k X V X V G = k, k. (19) Let (x i, g i ) R p K, i = 1,..., n be the training data where x i = (x i1,..., x ip ) T. We estimate the correation matrices using Nonparanorma SKEPTIC [14], which expoits the Spearman s rho and Kenda s tau to directy estimate the unknown correation matrices. Since the estimated correation matrices based on the Spearman s rho and Kenda s tau have simiar theoretica performance, we ony adopt the ones based on the Spearman s rho here. In specific, et ˆρ (k) ij be the Spearman s rho between features i and j based on n k sampes in cass k, i.e., {x i g i = k, i = 1,..., n}. Then the estimated correation matrix for cass k is { ˆR (k) 2 sin( π ij = 6 ˆρ(k) ij ) i j; (20) 1 i = j. Notice that the graph defined by R (k) has the same vertex-partition as that defined by (R (k) ) 1. Inspired by the exact threshoding resut of SQDA, we define R ij = (n ˆR(1) 1 ij,..., n (K) K ˆR ij ) 2 and perform exact threshoding on the entries of R at a certain eve τ, where τ is estimated by cross-vaidation. The resutant vertex-partition yieds an estimate of the conditionay independent communities. Furthermore, there is an interesting connection to hierarchica custering. Specificay, the vertex-partition induced by threshoding matrix R corresponds to the subtrees from when we appy singe inkage aggomerative custering to R and then cut the dendrogram at eve τ [22]. Singe inkage custering tends to produce traiing custers in which individua features are merged one at a time. However, Theorem 2 shows that given knowedge of the true number of communities L, appication of singe, average or compete inkage aggomerative custering on R consistenty estimates the vertex-partition of G. Theorem 2. Assume that G has L connected components and min k n k 21 og p + 2. Define R 0 ij = (n 1R (1) ij,..., n KR (K) ij ) 2 and et min R 0 ij 16πK n k og p, for k = 1,..., K. (21) i,j V ;=1,...,L Then the estimated vertex-partition V = L =1 ˆV resuting from performing SLC, ALC, or CLC with simiarity matrix R satisfies P ( : ˆV V ) K p 2. Proof. The proof of this theorem appears in the Appendix. A sufficient condition for (21) is min i,j V ;=1,...,L (R(1) ij,..., R(K) ij 10 ) 2 16πK og p n max n max n min, (22)

11 where n min = min k n k and n max = max k n k. Therefore, Theorem 2 estabishes the consistency of identification of conditionay independent communities by performing hierarchica custering using SLC, ALC or CLC, provided that n max = Ω(og p), n max /n min = O(1) as n k, p, and provided that no within-community eement of R (k) is too sma in absoute vaue for a k. 4 Exampes 4.1 Sparse quadratic discriminant anaysis In this section we study the performance of SQDA in severa simuated exampes and a rea data exampe. The resuts show that SQDA achieves a ower miscassification error and better interpretabiity compared to naive bayes, QDA and a variant of RDA. Since RDA proposed by Friedman [7] has two tuning parameters, resuting in an unfair comparison, we use a version of RDA that shrinks ˆΣ (k) towards its diagona: ˆΣ (k) (λ) = (1 λ)s (k) + λdiag(s (k) ), with λ [0, 1] (23) Note that λ = 0 corresponds to QDA and λ = 1 corresponds to the naive Bayes cassifier. For any λ < 1, the estimated covariance matrices are dense. We refer to this version of RDA as diagona reguarized discriminant anaysis (DRDA) in the rest of this paper. We report the test miscassification errors and its corresponding standardized tuning parameters of these four cassifiers. Let P (Θ) = i j (θ(1) ij,..., θ(k) ij ) 2. The standardized tuning parameter is defined as s = P ( ˆΘ(λ))/P ( ˆΘ(0)), with s = 0 corresponding to the naive Bayes cassifier and s = 1 corresponding to QDA Simuated exampes The data generated in each experiment consists of a training set, a vaidation set to tune the parameters, and a test set to evauate the performance of our chosen mode. Foowing the notation of [28], we denote././. the number of observations in the training, vaidation and test sets respectivey. For every data set, the tuning parameter minimizing the vaidation miscassification error is chosen to compute the test miscassification error. Each experiment was repicated 50 times. In a cases the popuation cass conditiona distributions were norma, the number of casses was K = 2, and the prior probabiity of each cass was taken to be equa. The mean for cass k was taken to have Eucidean norm tr(σ (k) )/p and the two means were orthogona to each other. We consider three different modes for precision matrices and in a cases the K precision matrices are from the same mode: Mode 1. Fu mode: θ (k) ij Mode 2. Decreasing mode: θ (k) ij = 1 if i = j and θ (k) ij = ρ i j k. 11 = ρ k otherwise.

12 Mode 3. Bock diagona mode with bock size q: θ (k) ij = 1 if i = j, θ (k) ij = ρ k if i j and i, j q and θ (k) ij = 0 otherwise, where 0 < q < p. Tabe 2 and 3, summarizing the resuts for each situation, present the average test miscassification errors over the 50 repications. The quantities in parentheses are the standard deviations of the respective quantities over the 50 repications. Exampe 1: Dense interaction terms We study the performance of SQDA on Mode 1 and Mode 2 with p = 8, ρ 1 = 0, and ρ 2 = 0.8. We generated 50/50/10000 observations for each cass. Tabe 2 summarizes the resuts. In this exampe, both situations have fu interaction terms and shoud favor DRDA. The resuts show that SQDA and DRDA have simiar performance, and both of them give ower miscassification errors and smaer standard deviations than QDA or naive Bayes. The mode-seection procedure behaves quite reasonaby, choosing arge vaues of the standardized tuning parameter s for both SQDA and DRDA. Tabe 2: Miscassification errors and seected tuning parameters for simuated exampe 1. The vaues are averages over 50 repications, with the standard errors in parentheses. Miscassification error SQDA DRDA QDA Naive Bayes Average standardized tuning parameter SQDA DRDA Fu 0.129(0.021) 0.129(0.020) 0.238(0.028) 0.179(0.026) 0.828(0.258) 0.835(0.302) Decreasing 0.103(0.011) 0.104(0.013) 0.170(0.031) 0.153(0.032) 0.817(0.237) 0.781(0.314) Exampe 2: Sparse interaction terms We study the performance of SQDA on Mode 3 with ρ 1 = 0, ρ 2 = 0.8 and q = 4. We performed experiments for (p, n) = (8, 50), (20, 200), (40, 800), (100, 1500), and for each p we generated n/n/10000 observations for each cass. Tabe 3 summarizes the resuts. In this exampe, a situations have sparse interaction terms and shoud favor SQDA. Moreover, the sparsity eve increases as p increases. As conjectured, SQDA strongy dominates with ower miscassification errors at a dimensionaities. The standardized tuning parameter vaues for DRDA are uniformy arger 12

13 Tabe 3: Miscassification errors and seected tuning parameters for simuated exampe 2. The vaues are averages over 50 repications, with the standard errors in parentheses. Miscassification error SQDA DRDA QDA Naive Bayes Average standardized tuning parameter SQDA DRDA p = 8, n = (0.010) 0.088(0.013) 0.092(0.011) 0.107(0.014) 0.529(0.309) 0.688(0.314) p = 20, n = (0.007) 0.109(0.009) 0.114(0.006) 0.121(0.004) 0.190(0.046) 0.293(0.345) p = 40, n = (0.006) 0.108(0.007) 0.127(0.002) 0.114(0.004) 0.109(0.015) 0.194(0.230) p = 100, n = (0.007) 0.188(0.009) 0.230(0.025) 0.215(0.035) 0.024(0.004) 0.073(0.059) than those for SQDA. This is expected because in order to capture the same amount of interactions DRDA needs to incude more noises than SQDA Rea data exampe: vowe recognition data This data consists of training and test data with 10 predictors and 11 casses. We obtained the data from the benchmark coection maintained by Scott Fahman at Carnegie Meon University. The data was contributed by Anthony Robinson [20]. The casses correspond to 11 vowe sounds, each contained in 11 different words. Here are the words, preceded by the symbos that represent them: Vowe Word Vowe Word Vowe Word i heed a: hard U hood I hid Y hud u: who d E head O hod 3: heard A had C: hoard The word was uttered once by each of the 15 speakers. Four mae and four femae speakers were used to train the modes, and the other four mae and three femae speakers were used for testing the performance. This paragraph is technica and describes how the anaog speech signas were transformed into a 10-dimensiona feature vector. The speech signas were owpass fitered at 4.7kHz and then digitized to 12 bits with a 10kHz samping rate. Twefth-order inear predictive anaysis was carried out on six 512-sampe Hamming windowed segments from the steady part of the vowe. The refection coefficients were used to cacuate 10 og-area parameters, giving a 10-dimensiona input space. Each speaker thus yieded six frames of speech from 11 vowes. This gave 528 frames from the eight speakers used to train the modes and 462 frames from the seven speakers used to test the modes. 13

14 We impement our method on a subset of data, which consists of four vowes "Y", "O", "U" and "u:". These four vowes are seected because their sampe precision matrices are approximatey sparse, and SQDA usuay performs we on such dataset. Figure 4 dispays the standardized sampe precision matrices, and the standardized SQDA and DRDA estimates. In addition to DRDA, QDA and naive Bayes, we aso compare our method with three other cassifiers, namey the Support Vector Machine (SVM), k-nearest neighborhood (knn) and Random Forest (RF), in terms of miscassification rate. The SVM, knn, and Random Forest were impemented by the R packages of "e1071", "cass" and "randomforest" with defaut settings, respectivey. The resuts of these cassification procedures are shown in Tabe 4. Figure 4: Heat maps of the sampe precision matrices and the estimated precision matrices of SQDA and DRDA on vowe data. Estimates are standardized to have unit diagona. 14

15 Tabe 4: Vowe speech cassification resuts. Tuning parameters are seected by 5-fod cross vaidation. SQDA DRDA QDA Naive Bayes SVM knn Random Forest Test error CV error The SQDA performs significanty better than the other six cassifiers. Compared with the second winner DRDA, the SQDA has a reative gain of (19.7% %)/19.7% = 12.7%. 4.2 Community Bayes In this section we study the performance of the community Bayes mode using ogistic regression as an exampe. We refer to the cassifier that combines the community Bayes agorithm with ogistic regression as community ogistic regression. We compare the performance of ogistic regression (LR) and community ogistic regression (CLR) in severa simuated exampes and a rea data exampe, and the resuts show that CLR has better accuracy and smaer variance in predictions Simuated exampes The data generated in each experiment consists of a training set, a vaidation set to tune the parameters, and a test set to evauate the performance of our chosen mode. Each experiment was repicated 50 times. In a cases the distributions of Z (k) were norma, the number of casses was K = 3, and the prior probabiity of each cass was taken to be equa. The mean for cass k was taken to have Eucidean norm tr(σ (k) )/(p/2) and the three means were orthogona to each other. We consider three modes for covariance matrices and in a cases the K covariance matrices are the same: Mode 1. Fu mode: Σ ij = 1 if i = j and Σ ij = ρ otherwise. Mode 2. Decreasing mode: Σ ij = ρ i j. Mode 3. Bock diagona mode with q bocks: Σ = diag(σ 1,..., Σ q ). Σ i is of size p q p q or ( p q + 1) ( p q + 1) and is from Mode 1. For simpicity, the transformation functions for a dimensions were the same f (k) 1 =... = f p (k) = f (k). Define g (k) = (f (k) ) 1. In addition to the identity transformation, two different transformations g (k) were empoyed as in [15]: the symmetric power transformation and the Gaussian CDF transformation. See [15] for the definitions. We compare the performance of LR and CLR on the above modes with p = 16, ρ = 0.5 and q = 2, 4, 8. We generated 20/20/10000 observations for 15

16 each cass and use average inkage custering to estimate the communities. Two exampes were studied: Exampe 1: The transformations g (k) for a three casses are the identity transformation. The popuation cass conditiona distributions are norma with different means and common covariance matrix. The ogits are inear in this exampe. Exampe 2: The transformations g (k) for cass 1,2 and 3 are the identity transformation, the symmetric power transformation with α = 3, and the Gaussian CDF transformation with µ g0 = 0 and σ g0 = 1 respectivey. α, µ g0 and σ g0 are defined in [15]. The power and CDF transformations map a univariate norma distribution into a highy skewed and a bi-moda distribution respectivey. The ogits are noninear in this exampe. Tabe 5 summarizes the test miscassification errors of these two methods and the estimated numbers of communities. CLR performs we in a cases with ower test errors and smaer standard errors than LR, incuding the ones where the covariance matrices don t have a bock structure. The community Bayes agorithm introduces a more significant improvement when the cass conditiona distributions are not norma. Tabe 5: Miscassification errors and estimated numbers of communities for the two simuated exampes. The vaues are averages over 50 repications, with the standard errors in parentheses. Exampe 1 CLR test error LR test error Number of communities Exampe 2 CLR test error LR test error Number of communites Fu Decreasing Bock q = 2 q = 4 q = (0.041) 0.274(0.040) 5.220(4.292) 0.221(0.049) 0.255(0.061) 7.480(4.253) 0.221(0.032) 0.286(0.036) 6.040(4.262) 0.227(0.040) 0.282(0.062) 7.320(3.977) 0.227(0.033) 0.271(0.039) 4.700(3.882) 0.208(0.041) 0.263(0.062) 6.940(4.533) 0.203(0.035) 0.278(0.041) 5.320(3.395) 0.203(0.036) 0.266(0.063) 6.220(3.627) 0.262(0.030) 0.325(0.036) 7.780(3.935) 0.239(0.031) 0.309(0.057) 9.160(3.782) Rea data exampe: emai spam The data for this exampe consists of information from 4601 emai messages, in a study to screen emai for spam. The true outcome emai or spam is avaiabe, aong with the reative frequencies of 57 of the most commony occurring words and punctuation marks in the emai message. Since most of the spam predictors 16

17 have a very ong-taied distribution, we og-transformed each variabe (actuay og(x + 0.1)) before fitting the modes. We compare CLR with LR and four other commony used cassifiers, namey SVM, knn, Random Forest and Boosting. Same as in section 4.1.2, the SVM, knn, Random Forest and Boosting were impemented by the R packages of "e1071", "cass", "randomforest" and "ada" with defaut settings, respectivey. The range of the number of communities for CLR were fixed to be between 1 and 20, and the optima number of communities were seected by 5-fod cross vaidation. We randomy chose 1000 sampes from this data, and spit them into equay sized training and test sets. We repeated the random samping and spitting 20 times. The mean miscassification percentage of each method is isted in Tabe 6 with standard error in parenthesis. Tabe 6: Emai spam cassification resuts. The vaues are test miscassification errors averaged over 20 repications, with standard errors in parentheses. CLR LR SVM knn Random Forest Boosting 0.068(0.019) 0.087(0.026) 0.070(0.011) 0.106(0.017) 0.071(0.010) 0.075(0.012) The mean test error rate for LR is 8.7%. By comparison, CLR has a mean test error rate of 6.8%, yieding a 21.8% improvement. Figure 5 shows the test error and the cross-vaidation error of CLR over the range of the number of communities for one repication. It corresponds to LR when the number of communities is 1. The estimated number of communities in this repication is 6, and these communities are isted beow. Most features in community 1 are negativey correated with spam, whie most features in community 2 are positivey correated. hp, hp, george, ab, abs, tenet, technoogy, direct, origina, pm, cs, re, edu, conference, 650, 857, 415, 85, 1999, ch;, ch(, ch[, CAPAVE, CAPMAX, CAPTOT. over, remove, internet, order, free, money, credit, business, emai, mai, receive, wi, peope, report, address, addresses, make, a, our, you, your, 000, ch!, ch$. font, ch#. data, project. parts, meeting, tabe. 3d. 5 Concusions In this paper we have proposed sparse quadratic discriminant anaysis, a cassifier ranging between QDA and naive Bayes through a path of sparse graphica 17

18 Figure 5: The 5-fod cross-vaidation errors (bue) and the test errors (red) of CLR on the spam data. modes. By aowing interaction terms into the mode, this cassifier reaxes the strict and often unreasonabe assumption of naive Bayes. Moreover, the resuting estimates of interactions are sparse and easier to interpret compared to other existing cassifiers that reguarize QDA. Motivated by the connection between the estimated precision matrices and singe inkage custering with the sampe covariance matrices, we present the community Bayes mode, a simpe procedure appicabe to non-gaussian data and any ikeihood-based cassifiers. By expoiting the bock-diagona structure of the estimated correation matrices, we reduce the probem into severa subprobems that can be soved independenty. Simuated and rea data exampes show that the community Bayes mode can improve accuracy and reduce variance in predictions. A Proofs A.1 Proof of Theorem 1 Proof. The standard KKT conditions [2] of the optimization probem (5) give: ( ( ) ) 1 n ˆΘ(k) k + S (k) + λγ (k) = 0, for k = 1,..., K (24) 18

19 where Γ (k) is a p p matrix with diagona eements being zero: 0 γ (k) 12 γ (k) 1p Γ (k) = Denote γ ij = (γ (1) ij,..., γ(k) ˆθ ij 2 for i j, and ( Let Ŵ(k) = ( ˆΘ(k)) 1. Then ( n 1 (Ŵ(1) n 1 (Ŵ(1) ij S (1) ij γ (k) 21 0 γ (k) 2p γ (k) p1 ij ) and ˆθ ij = (ˆθ (1) ij γ (k) p2 0 { B2 (1) if θ ij = 0; θ ij 2 = otherwise. ),..., n K(Ŵ(K) ij θ ij θ ij 2 ) S (K) ij ). (K),..., ˆθ ij ). The subgradient γ ij = λ ˆθ ij, if i j and ˆθ ij (25) 0; ˆθ ij 2 ij S (1) ij ),..., nk(ŵ(k) ij S (K) 2 ij )) λ, if i j and ˆθ ij = 0; (26) (Ŵ(1) ) ( ) ii,..., Ŵ(K) ii = S (1) ii,..., S(K) ii. (27) There exists an ordering of the vertices {1,..., p} such that the edge-matrix of the threshoded covariance graph is bock-diagona. For notationa convenience, we wi assume that the matrix is aready in this order: E (λ) 1 0 E (λ) =....., (28) 0 E (λ) k(λ) where E (λ) represents the bock of indices given by V (λ), = 1,..., k(λ). We wi construct Ŵ(1),..., Ŵ(K) having the same structure as E (λ) which is a soution to the optimization probem (5). Note that if Ŵ(k) is bock diagona then so is its inverse. Let Ŵ(k) and its inverse ˆΘ (k) be given by: Ŵ (k) = Ŵ (k) Ŵ (k) k(λ), ˆΘ(k) = , for a k. 0 ˆΘ(k) k(λ) (29) ˆΘ (k) 19

20 Define Ŵ(k) or equivaenty ˆΘ = argmin Θ (k) 0 K k=1 ˆΘ (k) = n k ( ogdet Θ (k) (Ŵ(k) ) 1 via the foowing sub-probems ) tr(s (k) Θ (k) ) + λ i,j V (λ) i j θ ij 2 (30) is a sub-bock of S (k) for = 1,..., k(λ), where ˆΘ (1) (K) = ( ˆΘ,..., ˆΘ ), and S (k) with row/coumn indices from V (λ) V (λ) (1) (K). Thus for every, ( ˆΘ,..., ˆΘ ) satisfies the KKT conditions corresponding to the th bock of the p p dimensiona probem. By construction of the threshoded sampe covariance graph, if i V (λ) and j V (λ) with, then S (k) ij λ. Hence, the choice ˆΘ ij = Ŵ(k) ij = 0, k = 1,..., K satisfies the KKT conditions (26) for a the off-diagona entries in the bock-matrix (29). Hence ( ˆΘ (1),..., ˆΘ (K) ) soves the optimization probem (5). (λ) This shows that the vertex-partition { ˆV } 1 κ(λ) obtained from the estimated precision graph is a finer resoution of that obtained from the threshoded covariance graph, i.e., for every {1,..., k(λ)}, there is a {1,..., κ(λ)} such (λ) that ˆV V (λ). In particuar k(λ) κ(λ). Conversey, if ( ˆΘ (1),..., ˆΘ (K) ) admits the decomposition as in the statement (λ) (λ) of the theorem, then it foows from (26) that: for i ˆV and j ˆV with, we have S ij 2 λ since Ŵ(k) ij = 0 for k = 1,..., K. Thus the vertexpartition induced by the connected components of the threshoded covariance graph is a finer resoution of that induced by the connected components of the estimated precision graph. In particuar this impies that k(λ) κ(λ). Combining the above two we concude k(λ) = κ(λ) and aso the equaity (12). Since the abeing of the connected components is not unique, there is a permutation π in the theorem. Note that the connected components of the threshoded sampe covariance graph G (λ) are nested within the connected components of G (λ ). In particuar, the vertex-partition of the threshoded sampe covariance graph at λ is aso nested within that of the threshoded sampe covariance graph at λ. We have aready proved that the vertex-partition induced by the connected components of the estimated precision graph and the threshoded sampe covariance graph are equa. Using this resut, we concude that the vertex-partition induced by the components of the estimated precision graph at λ, given by { } 1 κ(λ) is contained inside the vertex-partition induced by the components of the estimated precision graph at λ, given by { ˆV (λ ) } 1 κ(λ ). ˆV (λ) 20

21 A.2 Proof of Theorem 2 Proof. Let a = min i,j V ;=1,...,L R 0 ij. a 2 } {ˆV = V, } [22]. Then P (, ˆV ) V k=1 First note that {max ij R ij R 0 ij < ( P max R ij R 0 ij a ) ij 2 ( K ) P max n ˆR(k) k ij ij R(k) ij a 2 k=1 K ( ) P max n ˆR(k) k ij ij R(k) ij a 2K K ( P max ij k=1 ˆR (k) ij R (k) ij a ) 2n k K The second inequaity is because a a2 m b b2 m a 1 b a m b m. Since a 16πK n k og p, we have P (max ˆR(k) ij ij R (k) ij a 2n k K ) 1 p by 2 Theorem 4.1 of [14]. Therefore P (, ˆV V ) K p. 2 References [1] Peter J Bicke and Eizaveta Levina. Some theory for fisher s inear discriminant function, naive bayes, and some aternatives when there are many more variabes than observations. Bernoui, pages , [2] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, [3] Peter Bühmann, Phiipp Rütimann, Sara van de Geer, and Cun-Hui Zhang. Correated variabes in regression: custering and sparse estimation. Journa of Statistica Panning and Inference, 143(11): , [4] Line Cemmensen, Trevor Hastie, Daniea Witten, and Bjarne Ersbø. Sparse discriminant anaysis. Technometrics, 53(4), [5] Patrick Danaher, Pei Wang, and Daniea M Witten. The joint graphica asso for inverse covariance estimation across mutipe casses. Journa of the Roya Statistica Society: Series B (Statistica Methodoogy), [6] Sandrine Dudoit, Jane Fridyand, and Terence P Speed. Comparison of discrimination methods for the cassification of tumors using gene expression data. Journa of the American statistica association, 97(457):77 87,

22 [7] Jerome H Friedman. Reguarized discriminant anaysis. Journa of the American statistica association, 84(405): , [8] Yaqian Guo, Trevor Hastie, and Robert Tibshirani. Reguarized inear discriminant anaysis and its appication in microarrays. Biostatistics, 8(1):86 100, [9] Trevor Hastie, Andreas Buja, and Robert Tibshirani. Penaized discriminant anaysis. The Annas of Statistics, pages , [10] Nadine Hussami and Robert Tibshirani. A component asso. arxiv preprint arxiv: , [11] WJ Krzanowski, Phiip Jonathan, WV McCarthy, and MR Thomas. Discriminant anaysis with singuar covariance matrices: methods and appications to spectroscopic data. Appied statistics, pages , [12] B Boser Le Cun, John S Denker, D Henderson, Richard E Howard, W Hubbard, and Lawrence D Jacke. Handwritten digit recognition with a backpropagation network. In Advances in neura information processing systems. Citeseer, [13] Quefeng Li and Jun Shao. Sparse quadratic discriminant anaysis for high dimensiona data. Statist. Sinica, [14] Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. The nonparanorma skeptic. arxiv preprint arxiv: , [15] Han Liu, John Lafferty, and Larry Wasserman. The nonparanorma: Semiparametric estimation of high dimensiona undirected graphs. The Journa of Machine Learning Research, 10: , [16] Rahu Mazumder and Trevor Hastie. Exact covariance threshoding into connected components for arge-scae graphica asso. The Journa of Machine Learning Research, 13: , [17] Finbarr O Suivan. A statistica perspective on i-posed inverse probems. Statistica science, pages , [18] Mee Young Park, Trevor Hastie, and Robert Tibshirani. Averaged gene expressions for regression. Biostatistics, 8(2): , [19] Bradey S Price, Chares J Geyer, and Adam J Rothman. Ridge fusion in statistica earning. arxiv preprint arxiv: , [20] Anthony John Robinson. Dynamic error propagation networks. PhD thesis, University of Cambridge, [21] Jiehuan Sun and Hongyu Zhao. The appication of sparse estimation of covariance matrix to quadratic discriminant anaysis. BMC bioinformatics, 16(1):48,

23 [22] Kean Ming Tan, Daniea Witten, and Ai Shojaie. The custer graphica asso for improved estimation of gaussian graphica modes. arxiv preprint arxiv: , [23] Robert Tibshirani, Trevor Hastie, Baasubramanian Narasimhan, and Gibert Chu. Diagnosis of mutipe cancer types by shrunken centroids of gene expression. Proceedings of the Nationa Academy of Sciences, 99(10): , [24] DM Titterington. Common structure of smoothing techniques in statistics. Internationa Statistica Review/Revue Internationae de Statistique, pages , [25] Daniea M Witten, Jerome H Friedman, and Noah Simon. New insights and faster computations for the graphica asso. Journa of Computationa and Graphica Statistics, 20(4): , [26] Ping Xu, Guy N Brock, and Rudoph S Parrish. Modified inear discriminant anaysis approaches for cassification of high-dimensiona microarray data. Computationa Statistics & Data Anaysis, 53(5): , [27] Ming Yuan and Yi Lin. Mode seection and estimation in regression with grouped variabes. Journa of the Roya Statistica Society: Series B (Statistica Methodoogy), 68(1):49 67, [28] Hui Zou and Trevor Hastie. Reguarization and variabe seection via the eastic net. Journa of the Roya Statistica Society: Series B (Statistica Methodoogy), 67(2): ,

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA) 1 FRST 531 -- Mutivariate Statistics Mutivariate Discriminant Anaysis (MDA) Purpose: 1. To predict which group (Y) an observation beongs to based on the characteristics of p predictor (X) variabes, using