Multilevel Functional Clustering Analysis

Size: px

Start display at page:

Download "Multilevel Functional Clustering Analysis"

Emma Austin
6 years ago
Views:

1 Multilevel Functional Clustering Analysis Nicoleta Serban 1 H. Milton Stewart School of Industrial Systems and Engineering Georgia Institute of Technology nserban@isye.gatech.edu Huijing Jiang Business Analytics & Mathematical Sciences IBM T.J. Watson Research Center huijiang@us.ibm.com Abstract: In this paper, we investigate clustering methods for multilevel functional data, which consist of repeated random functions observed for a large number of units (e.g. subjects) at multiple sub-units (e.g. proteins); that is, there are multiple random functions observed for each unit. To describe the within- and between-variability induced by the hierarchical structure in the data, we take a multilevel functional principal components (MFPCA) approach. We develop and compare a hard clustering method based on the scores derived from the multilevel FPCA and a soft clustering method using a MFPCA decomposition. In a simulation study, we assess the estimation accuracy of the clustering membership and cluster patterns under a series of settings: small vs. moderate number of time points, various noise levels and varying number of repeated measurements or subunits per unit. We demonstrate the applicability of the clustering analysis to a real data set consisting of expression profiles from genes of immune cells. Common and unique response patterns are identified by clustering the expression profiles using our multilevel clustering analysis. Keywords: cluster analysis, hard clustering, functional ANOVA, microarray analysis, multilevel functional data, multilevel principal component analysis, model-based clustering, soft clustering. 1 Correspondent Author

2 1 Introduction Due to an increasing number of applications involving analysis of a large number of observed random functions, exploratory tools such as unsupervised or supervised clustering play an important role in uncovering prevalent patterns among the observed random functions. Specific applications include gene expression profiling from microarray studies (Hastie et al., 2000; Bar-Joseph et al., 2002; Wakefield et al., 2002; Serban and Wasserman, 2005; Booth et al., 2008), clustering subjects by their spinal bone mineral density (James and Sugar, 2003), and summarizing the market value trends for manufacturing companies (Serban, 2009). Functional clustering methods divide into hard and soft (model-based) methods. Hard clustering partitions the set of random functions into non-overlapping subsets according to a similarity measure (e.g. correlation). In soft clustering, the underlying assumption is that the observed random functions are realizations from a mixture process where the mixture weights are the cluster probabilities. The cluster membership is not fixed as in hard clustering but random following a multinomial distribution; therefore, the soft clustering model is equivalent to a mixture of densities model. Examples of hard clustering methods are by Hastie et al. (2000), Bar-Joseph et al. (2002); Serban and Wasserman (2005); Chiou and Li (2008). Examples of soft clustering are by James and Sugar (2003); Fraley and Raftery (2002); Wakefield et al.(2002) and Booth, Casella & Hobert (2008). In the existing literature, clustering algorithms are developed for and applied to data consisting of one random function or vector for each unit to be clustered - one level data. In this paper, we introduce clustering methods for multilevel functional data; that is, cluster X i (t), i = 1,..., I where X i (t) is a multidimensional random function. For simplicity, we will focus on two-level data: X ij (t) for j = 1,..., J and i = 1,..., I where j indexes subunits; that is, for each unit (e.g. subject, product or gene), we observe random functions for J subunits. The underlying model is functional ANOVA 1

3 X ij (t) = α(t) + β j (t) + Y i (t) + W ij (t) + ε ij (t) (1) where α(t) and β j (t) for j = 1,..., J are fixed functional means specifying the global trend, and respectively, subunit-specific functional trends. For simplicity, we assume α(t) = 0 and β j (t) = 0; when non-zero, we can use standard nonparametric methods to estimate them. Under this framework, we pose two clustering problems: Clustering by similarity of unit-specific means (at level 1): two units i 1 and i 2 are in the same cluster if their unit-specific means Y i1 (t) and Y i2 (t) are similar in shape. Clustering by similarity of within-unit deviations (at level 2): two units i 1 and i 2 are in the same cluster if their corresponding deviations from the unit-specific means, {W i1 j} j=1,...,j and {W i2 j} j=1,...,j, are dynamically similar or they move together over time. The first clustering problem identifies groups of units which behave similarly in average across J subunits and it can be viewed as an extension from the existing functional clustering approaches. Following this extension, this clustering problem could be simply carried out by estimating the unit-specific means Y i (t) using nonparametric methods and cluster the smooth means using functional clustering algorithms. In this paper, we call this method the level-1 naive clustering approach. A second modeling alternative is to decompose the functional ANOVA model following the multilevel functional principal component analysis (MFPCA) introduced by Di et al. (2008) and Di and Crainiceanu (2010) and cluster the level- 1 estimated scores using common clustering methods such as k-means, k-median, hierarchical clustering and others (Hastie et al., 2009). We call this method the level-1 hard clustering approach. The third approach is a soft clustering model using a MFPCA decomposition. We call this method the level-1 soft clustering approach. The second clustering problem is more unique in its definition. Assuming Y i (t) = 0, each unit features repeated random functions W ij (t), j = 1,..., J, which are dissimilar within the unit. For example, one could observe protein expression profiles (subunits) for a number of subjects (units) in response to an experimental drug. The focus may be on clustering subjects 2

4 responding differently to the drug treatment - whether they are drug-resistant or not - where the response is recorded only for a small number of established proteins. If the proteins respond differently to the experimental drug, W ij (t), j = 1,..., J will be dissimilar within each subject, and therefore, clustering by similarity in Y i (t) doesn t provide the clustering of interest. On the other hand, we expect some grouping of the subjects by similarity of their protein expression profiles, which is a multidimensional measure of whether they are resistant or not to the drug, for example. Clustering at level 2 can therefore be used to identify the grouping of units or subjects where the similarity is not a measure between two univariate random functions as in all existing clustering methods but between two multivariate random functions, {W i1 j} j=1,...,j and {W i2 j} j=1,...,j. Therefore, level-2 clustering assumes that the random functions within each unit are dissimilar up to an overall mean Y i (t) and it commonly applies when the subunits are non-homogeneous (different proteins in this this example or different bacteria in the case study of this paper). Level-2 clustering can be reduced to estimation of the correlation between two samples of random functions and cluster based on the correlation estimates. For example, one msy apply the dynamical correlation analysis introduced by Dubin and Müller (2005) to {W i1 j} j=1,...,j and {W i2 j} j=1,...,j to obtain a correlation value ρ i1,i 2 for each pair of units (i 1, i 2 ) and further apply a distance-based clustering with the correlation matrix {ρ i1,i 2 } i1 =1,...,I;i 2 =1,...,I. However, this approach assumes large J and large number of time points, assumption that does not hold in many applications. Instead, we can apply the MFPCA approach to the multilevel data and cluster the level-2 estimated scores. We call this level-2 hard clustering approach. An alternative approach is a soft clustering model using a MFPCA decomposition. We call this method the level-2 soft clustering approach. In this paper, we discuss advantages and disadvantages of these clustering approaches and validate their performance within a simulation study. We point out here that one underlying advantage of the soft clustering approach is that it provides a natural framework 3

5 for inference on the number of clusters, imputed cluster memberships and cluster means, and it allows incorporating information about the dependence between functions at various levels. However, a drawback is that it is computationally intensive because the estimation of the clustering model components is based on an Expectation-Maximization algorithm. The rest of the paper is organized as follows. In Section 2, we review the ANOVA functional model and its decomposition using the MFPCA approach. We will continue in Section 3 with the description of a series of hard clustering approaches and in Section 4 with the presentation of the soft clustering method. An important aspect of unsupervised clustering is that the number of clusters is unknown. Under the soft clustering model, we discuss a selection method for the number of clusters in Section 5. We assess the performance of the clustering approaches discussed in this paper within a simulation study in Section 6 and within a case study in Section 7. Some technical details are deferred to the Appendix. 2 Multi-level Functional Model Let {X ij (t), j = 1,..., J} be a group of random functions observed over a continuous variable t T (T is the functional domain) for the ith experimental unit with i = 1,..., I (I is the number of units). Generally, the number of units I is large (I >> 100 s) whereas the number of subunits per unit, J, is small (J 2 5). Under the functional ANOVA model in (1) with unknown functional effects, we employ a nonparametric decomposition N 1 X ij (t) = s=1 ξ i,s ϕ (1) s (t) + N 2 r=1 ζ ij,r ϕ (2) r (t) + ε ij (t) (2) where {ξ i,s } s=1,...,n1 and {ζ ij,r } r=1,...,n2,j=1,...,j are the level-1 and level-2 unconditional scores for the ith unit. In this paper, we use the term unconditional in contrast to the term conditional which refers to conditionality on the cluster membership variable in the clustering model. In this paper, we assume 4

6 A.1 E(ξ i,s ) = 0, V ar(ξ i,s ) = τ (1) s for any unit i and E(ξ i,s1 ξ i,s2 ) = 0 for s 1 s 2. A.2 {ϕ (1) s (t), s = 1, 2,...} is an orthogonal basis in L 2 (T ). A.3 E(ζ ij,r ) = 0, V ar(ζ ij,r ) = τ (2) j,r and for r 1 r 2. A.4 {ϕ (2) r (t), r = 1, 2,...} is an orthogonal basis in L 2 (T ). and E(ζ ij,r 1, ζ ij,r2 ) = 0 for any unit i and any subunit j A.5 {ξ i,s, s = 1, 2,...} are uncorrelated with {ζ ij,r, r = 1, 2,...}. There are various approaches for estimating the functional ANOVA model. Recent methods are by Bugli and Lambert (2006), who assume that the bases of functions in A.2 and A.4 are fixed and estimate the scores using penalized splines; Di et al. (2008) and Di and Crainiceanu (2010), who base their estimation procedure on functional principal component analysis; and Kaufman and Sain (2010), who pursue a fully Bayesian approach. An advantage of employing the MFPCA approach is its computational efficiency; the bases of functions are functional principal components which allow reducing the functional space into a lower dimensional space than when fixing the bases of functions. Moreover, it applies to both densely observed as well as sparse data. To this end, our clustering model is based on the MFPCA decomposition. Remark: Assumption A.3 of our clustering model is less restrictive than in the MFPCA by Di et al. (2008) and Di and Crainiceanu (2010). Specifically, in the existing works, MFPCA assumes that V ar(ζ ij,r ) = τ r (2) ; that is, the variances are the same for all subunits. However, as we will discuss in Section 4, the soft clustering model is subject to the more general assumption A.3 when the cluster means vary with the subunit index j. 3 Hard Clustering 3.1 Level-1 Clustering In this section, we describe two approaches to clustering by similarity of unit-specific means; they are both hard clustering methods. Generally, in hard clustering, the underlying as- 5

7 sumption is that the set of units to be clustered I = {1, 2,..., I} is divided into a partition of K subsets, {C 1,..., C K } with C k1 C k2 = for any k 1 and k 2. Two units are in the same cluster if they are similar according to a similarity measure. When the objective is to cluster random functions by shape regardless of scale, the similarity measure is often the correlation between two functions. One common approach to clustering functional data is to first project the random functions into a finite dimensional space using nonparametric decompositions, and cluster based on similarity of the transform coefficients. James and Sugar (2003) dubbed this approach as filtering. Clustering functions by shape using the correlation measure in the functional domain is equivalent to clustering the transform coefficients using the Euclidean distance in the transform domain (Serban and Wasserman, 2005). For multi-level functional data, a naive clustering approach is to first decompose the random functions using an orthogonal basis of functions {ψ 1 (t), ψ 2 (t),...}: X ij (t) = θ p,ij ψ p (t) = Ψ(t)θ ij p=1 where θ ij = (θ 1,ij, θ 2,ij,...) is the vector of coefficients of the random functions observed for unit i in the transform domain. The selection of the basis of functions depends on the smoothness of the underlying regression functions and the irregularity of the design points at which the random functions are observed. Since we observe the random functions at a finite number of time points, we need to truncate the summation in the decomposition above. That is, estimate up to P i < coefficients where P i controls the smoothness of the estimated unit-specific mean Y i (t), and therefore, its selection will impact the accuracy of the estimated cluster memberships. Bugli and Lambert (2006) proposed using a large P i = P to reduce the modeling bias but penalize the influence of the coefficients - penalized smoothing spline. Further, we cluster the estimated mean coefficients ˆθ i = 1 J ˆθ J i=1 ij using common clustering approaches for multivariate data. For densely observed random functions, this approach will perform reasonably well since 6

8 the coefficients θ ij are accurately estimated - ˆθ ij are asymptotically unbiased and consistent. On the other hand, under sparse design (i.e. each random function is observed at a small number of design points), the coefficients θ ij are inaccurately estimated which in turn, will result in inaccurate clustering membership estimation. To overcome this difficulty, one approach is to employ an estimation method which allows borrowing strength across subunits to improve the accuracy of the estimated coefficients for individual units. Consequently, our proposed algorithm for clustering at level 1 is: 1. Apply MFPCA to impute the scores at level 1: ˆξi,s ; and 2. Apply a multivariate clustering algorithm to the estimated scores ˆξ i,s where the similarity measure is the Euclidean distance (d(i 1, i 2 ) = ˆξ i1 ˆξ i2 2 for i 1, i 2 I). This algorithm is equivalent to clustering the unit-specific means Y i (t) by shape regardless of scale, or, more precisely, clustering by correlation in the functional space. By borrowing strength across subunits, the clustering membership is more accurately estimated than for the naive approach as supported by our simulation study (see Section 6). 3.2 Level-2 Clustering Clustering by similarity of within-unit deviations requires defining a similarity measure between the groups of random functions {W i1 j} j=1,...,j and {W i2 j} j=1,...,j. For large J and densely sampled time domain, one such measure is the dynamical correlation for multivariate longitudinal data by Dubin and Müller (2005). However, it is rarely the case that we will have available a large number of subunits J per each unit observed over a large number of time points. Because of this limitation, we propose a hard clustering approach as follows 1. Apply MFPCA to impute the scores at level 2: ˆζ ij,r ; and 2. Apply a multivariate clustering algorithm to the estimated coefficients ˆζ ij,r. The similarity measure is the average L 2 norm d(i 1, i 2 ) = J ˆζ i1 j ˆζ i2 j 2. j=1 7

9 4 Soft Clustering In this section, we introduce a soft clustering approach which allows borrowing strength across random functions within the same cluster and within the same unit (MFPCA). In soft clustering, the underlying assumption is that the complete data are bivariate variables (X i, Z i ) for i = 1,..., I where X i are unit-specific realizations from a multivariate distribution and the cluster membership Z i is a latent variable (Fraley and Raftery, 2002). A common estimation method for soft clustering is the Estimation-Maximization algorithm where at the Estimation step, we impute or predict the cluster membership Z = (Z 1,..., Z I ) along with estimation of the cluster weights π 1,..., π K, and at the Maximization step, we estimate the parameters specifying the conditional distribution of X i Z i, i = 1,..., I. Therefore, we need to specify the conditional distribution X i Z i, i = 1,..., I and the distribution of the latent variable Z i, which in turn, specify the distribution of the complete data. The cluster membership of the ith unit Z i follows a multinomial distribution with proportion parameters π 1,..., π K where K is the number of clusters. X i Z i = k, i = 1,..., I are commonly assumed conditionally independent following a distribution with cluster mean µ k (t) and covariance function Σ k (t, t ). Using a similar framework for clustering multilevel data, the complete data are (X ij, Z (1) i, Z (2) i ) for i = 1,..., I and j = 1,..., J where Z (1) i and Z (2) i are latent variables specifying the clustering membership at level 1, and respectively, at level 2. We assume: The cluster membership Z (1) i of the ith unit has a multinomial distribution with proportion parameters π (1) 1,..., π (1) C 1 where C 1 is the number of clusters at level 1. The cluster membership Z (2) i of the ith unit has a multinomial distribution with proportion parameters π (2) 1,..., π (2) C 2 where C 2 is the number of clusters at level 2. Level-1 Clustering. For clustering at level 1, we assume C 2 = 1 but C 1 1. Therefore, the joint data are (X ij, Z (1) i ). However, to model the distribution of the joint data we need 8

10 to specify the conditional distribution of X i Z (1) i. Following the model in (1), the conditional distribution is: X ij (t) (Z (1) i = k) = N 1 s=1 ν i,s,k ϕ (1) s (t) + N 2 r=1 ζ ij,r ϕ (2) r (t) + ε ij (t) with (3) ν i,k = (ν i,1,k,..., ν i,n1,k) N(µ k, Λ (1) k ) ζ ij = (ζ ij,1,..., ζ ij,n2 ) N(0, Λ (2) j ) where µ k = (µ 1,k,..., µ N1,k) and Λ (1) k is a N 1 N 1 diagonal matrix with diagonal elements λ (1) k = (λ (1) 1,k,..., λ(1) N 1,k ). ν i,s,k = (ξ i,s Z (1) i Under this conditional model, the conditional scores = k) for k = 1,..., C 1 are assumed independent with conditional mean µ s,k and conditional variance λ (1) s,k. For this model, ξ i,s for i = 1,..., I and s = 1,..., N 1 are the unconditional scores at level 1 with a distribution following assumption A.1. The scores at level 2 are unconditional of the clustering latent variable Z (1), and therefore, their distribution follows the assumption A.3. From the conditional and unconditional models, we derive 0 = E(ξ i,s ) = E(E(ξ i,s Z (1) i )) = τ (1) s = V(ξ i,s ) = C 1 k=1 π (1) k (λ(1) C 1 k=1 s,k + C 1 µ2 s,k) ( π (1) k E(ν i,s,k) = k=1 C 1 π (1) k µ s,k) 2 = k=1 C 1 π (1) k µ s,k (4) k=1 π (1) k (λ(1) s,k + µ2 s,k). (5) It follows that the clustering model at level 1 (Model 1) is X ij (t) = N 1 s=1 ξ i,sϕ (1) s (t) + N 2 r=1 ζ ij,rϕ (2) r (t) + ε ij (t) ξ i,s (Z (1) i = k) N(µ s,k, λ (1) s,k ) Z (1) i Multinomial(1; π (1) 1,..., π (1) C 1 ) ζ ij,r N(0, λ (2) j,r ) indep. of ξ i,s,k, Z (1) i (6) 9

11 subject to the constrain C 1 k=1 π(1) k µ s,k = 0 by (4). We note that the relationship between conditional and unconditional variances in equation (5) does not impose a constraint. Under this clustering set up, the kth cluster mean is E(X ij (t) Z (1) i = k) = E(Y i (t) Z (1) i = k) = N 1 s=1 µ s,k ϕ (1) s (t). (7) Level-2 Clustering. For clustering at level 2, we assume C 1 = 1 but C 2 1. Therefore, the joint data are (X i, Z (2) i ) and the conditional distribution of X i Z (2) i is: X ij (t) (Z (2) i = k) = N 1 s=1 ξ i,s ϕ (1) s (t) + N 2 r=1 δ ij,r,k ϕ (2) r (t) + ε ij (t) with (8) ξ i = (ξ i,1,..., ξ i,n1 ) N(0, Λ (1) ) δ ij,k = (δ ij,1,k,..., δ ij,n2,k) N(η jk, Λ (2) j,k ) where η jk = (η j,1,k,..., η j,n2,k) and Λ (2) jk is an N 2 N 2 diagonal matrix with diagonal elements λ (2) jk = (λ(2) j,1,k,..., λ(2) j,n 2,k ). Under this conditional model, the conditional scores at level 2, δ ij,r,k = (ζ ij,r Z (2) i = k), are assumed independent with conditional mean η j,r,k and conditional variance τ (2) j,r,k for k = 1,..., C 2. For this model, ζ ij,r s are the unconditional scores in the unconditional model (2) assumed independent with mean zero (E(ζ ij,r ) = 0) and constant variance across units (V(ζ ij,r ) = τ (2) j,r ) as provided in assumption A.3. From the conditional and unconditional models, we derive 0 = E(ζ i,s ) = E(E(ζ i,s Z (2) i )) = τ (2) j,r = V(ζ ij,r) = C 2 π (2) k (λ(1) k=1 C 2 k=1 π (2) k E(δ i,s,k) = j,r,k + C 2 η2 j,r,k) ( k=1 C 2 k=1 π (2) k η j,r,k) 2 = π (2) k η s,k (9) C 2 k=1 π (2) k (λ(1) j,r,k + η2 j,r,k) (10) Similar to the clustering model at level 1, the clustering model at level 2 (Model 2) is 10

12 X ij (t) = N 1 s=1 ξ i,sϕ (1) s (t) + N 2 r=1 ζ ij,rϕ (2) r ζ ij,r (Z (2) i = k) N(η j,k, Λ (1) j,k ) Z (2) i Multinomial(1; π (2) 1,..., π (2) k ) ξ i,s N(0, τ (1) s ) indep. of ζ ij,r,k, Z (2) i (t) + ε ij (t) (11) subject to the constraint C 2 k=1 π(2) k η j,r,k = 0 by (9). On the other hand, the relationship between unconditional and conditional variances in equation (10) requires that the unconditional variances differ across subunits when η jk varies with j leading to assumption A.3 in Section 2. However, MFPCA as introduced by Di et al. (2009) does not allow for the eigenvalues at level-2 to vary across subunits. For this, the estimated level-2 scores will provide lower accuracy clustering when the number of repeated subunits, J, is large; this observation is supported by our simulation study. Under this clustering set up, the kth cluster trend for the jth condition is E(X ij (t) Z (2) i = k) = E(W ij (t) Z (2) i = k) = N 2 r=1 η j,r,k ϕ (2) r (t). (12) The formulation and estimation of the level-1 and level-2 joint clustering model with C 1 1 and C 2 1 is provided in the Supplemental Material. The estimation method is an iterative likelihood-based algorithm. 5 Model Selection The clustering models described in the previous section depend on a series of parameters which are assumed fixed: C 1, C 2, N 1 and N 2. We identify two model selection problems: (1) Selecting the number of eigenfunctions which explain a large percentage of the variability between units (selecting N 1 ) and within units (selecting N 2 ); and (2) Selecting the number of clusters at level 1 (selecting C 1 ) and/or the number of clusters at level 2 (selecting C 2 ). We can select N 1 and N 2 using the unconditional MFPCA model. Di et al. (2008) and Di 11

13 and Crainiceanu (2010) discuss various alternative methods for selection of the number of basis functions and we follow their direction. For identifying the number of clusters, we focus on likelihood-based approaches. Common variable selection methods, such as the Akaike information criterion (AIC), and Bayesian information criterion (BIC) have been employed for estimating the number of clusters (Fraley and Raftery, 2002). Both criteria select the number of clusters by minimizing 2 log L(Ψ) + 2P (C 1, C 2 ) where log L( ˆΨ) is the log likelihood of observed data, which measures the lack of fit. In our multi-level clustering model, log L(Ψ) = I C 1 C 2 i=1 k=1 k =1 π (1) k π(2) k log f(x i ; µ k, η k, Λ (1) k, Λ(2) k, σ 2 ). The second term 2P (C 1, C 2 ) is the penalty term that measures the complexity of the model. For AIC, 2P (C 1, C 2 ) = 2d,and for BIC, 2P (C 1, C 2 ) = (log IJm)d where d = 2C 1 K 1 + 2C 2 K 2 K 1 K 2 + C 1 + C 2 1 is the number of parameters. The number of parameters is Level-1 (C 2 = 1) unequal variance: d = 2N 1 C 1 + 2JN 2 + C N Level-1 (C 2 = 1) equal variance: d = N 1 (C 1 + 1) + 2JN 2 + C N Level-2 (C 1 = 1) unequal variance: d = 2N 1 + 2JN 2 C C 2 JN Level-2 (C 1 = 1) equal variance: d = 2N 1 + JN 2 (C 2 + 1) C 2 JN Many authors (for example, Koehler and Murphee, 1988) observed that models selected using AIC tend to overfit as AIC prefers larger models. In the soft clustering context, this translates into overestimation of the number of clusters (Soromenho, 1933; Celeux and Soromenho, 1996). Alternatively, the likelihood correction using BIC selects more parsimonious models. Consequently, BIC selection criteria has been often used in soft clustering (Fraley and Raftery, 1998). Lereoux (1992) has shown that BIC does not underestimate the true number of components, asymptotically. 12

14 6 Simulation Studies The primary objective of this simulation study is to assess the estimation accuracy of the clustering membership and cluster means under various comparative settings: 1. Varying sparsity in the sampling design; 2. Varying number of subunits J; 3. Varying noise level; and 4. Naive vs. hard vs. soft clustering. 6.1 Level-1 Clustering We generate samples of functions from the joint model (X i, Z (1) i ) described in Section 4. Specifically, we generate Z (1) i, the clustering membership, from multinomial distribution with fixed cluster weights π (1) 1,..., π (1) C 1 across all simulations. For simplicity, we choose C 1 = 2 with π (1) 1 = 1/3 and π (1) 2 = 2/3. The generated data consist of I = 100 units. We vary the number of maximum observations or time points per random function, m = 4, 6, 10, 15 and the number of subunits per unit, J = 3, 4, 5. The conditional variances at level 1 are generated according to two different settings: Equal conditional variances across clusters: λ s,k = 0.9 s 1 for k = 1,..., C 1 ; and Varying conditional variances across clusters: λ s,k = 2 2(k s) 1. The unconditional variances at level 2 are τ jr = j+1 2 2r. The conditional means at level 1 are µ 1 = (3, 2, 1, 0) and µ 2 = ( 1.5, 1, 0.5, 0) selected such that C 1 eigenfunctions are k=1 π(1) k µ s,k = 0. The Φ (1) (t) = ( 2 sin(2πt), 2 cos(2πt), 2 sin(4πt), 2 cos(4πt)) Φ (2) (t) = (1, 3(2t 1), 5(6t 2 6t + 1), 7(20t 3 30t t 1). The number of eigenfunctions at level 1 is N 1 = 4 and at level 2 is N 2 = 4. The noise level for the simulation in this paper is σ = 2. We investigate the estimation accuracy of the cluster membership and cluster means for other values of σ in the Supplemental Material. In our simulation example, because we have the true clustering membership, we can assess 13

15 the accuracy of the clustering prediction for the method introduced in this paper using a clustering/classification error. We measure the clustering error using the Rand index (Rand, 1971), which is the fraction of all misclustered pairs of functions. Let C = {f 1,..., f S } denote the set of true functions, Ĉ = { ˆf 1,..., ˆf S } denote the set of estimated functions, and T and ˆT denote the true and estimated clustering maps, respectively. The Rand index is R(C, Ĉ) = r<s I(T k(f r, f s ) ˆT k (f r, f s )). ( N 2 ) Therefore, the Rand index is low when there are only few misclustered functions. In order to evaluate the accuracy of the estimated cluster means we report the mean square error calculated as RMSE = 1 C 1 C 1 k=1 T (µ k(t) ˆµ k (t)) 2 dt T µ2 k (t) dt. We report the estimation accuracy of the clustering membership and the cluster means for the naive clustering approach where the basis of functions is the radial spline basis, the hard clustering approach discussed in Section 3 and the soft clustering approach discussed in Section 4. We do not report accuracy results for the naive clustering algorithm when m = 4 because of computational instability. The values reported for the Rand index and root mean squared errors are averages over 100 simulations. We also investigated the use of the Gap Statistic (Tibshirani et al., 2001) for identifying the number of clusters for hard clustering and the use of the AIC and BIC model selection criteria for identifying the number of clusters for soft clustering (see Supplemental Material for additional figures). Based on the results reported in Figure 1 and the additional simulation results in the Supplemental Material, we brief the estimation accuracy results as follows: The naive clustering fails under the very sparse sampling design, m = 4, and it is much less accurate than alternative methods in most settings; 14

16 There is a significant improvement in the estimation accuracy for both the clustering membership and the clustering patterns when comparing the hard to the naive clustering; For equal conditional variances, the hard and soft clustering methods perform similarly whereas for varying conditional variances, a more realistic setting, the soft clustering approach performs significantly better uniformly over all settings; As J increases and under equal conditional variances, the clustering estimation accuracy improves for all three methods; however, under the unequal conditional variances setting, the clustering estimation accuracy improves only for the soft clustering approach consistently over all settings. A m increases, the clustering estimation accuracy does not improve significantly for the soft clustering approach but it improves for the naive clustering method. As the noise level increases, the gap in accuracy between soft clustering and other methods increases with much better performance for the soft clustering method at high noise levels. The Gap Statistics for hard clustering accurately identifies the correct number of clusters, C1 = 2, under the assumption of equal conditional variances but not under the assumption of non-equal conditional variances. BIC outperforms AIC in correctly identifying the number of clusters under the assumption of non-equal conditional variances. BIC identifies the number of clusters for most of 100 simulations. As J increases and as the maximum number of time points, m, increases, the accuracy of AIC diminishes. 6.2 Level-2 Clustering To assess the clustering performance of our soft clustering method at level 2, we simulate C 2 = 2 clusters with π (2) 1 = 1/3 and π (2) 2 = 2/3. The true eigenfunctions are the same as in the previous section and the unconditional variances at level 1 are λ s,k = 0.9 s 1. The conditional means at level 2, η j,k, are selected such that C 2 k=1 π(2) k η j,k = 0. Since in our 15

17 simulations we compare the estimation accuracy for J = 3, 4, 5, the conditional means for cluster 1 are η 1,1 = (4, 3, 2, 1), η 2,1 = (4, 3, 2, 1), η 3,1 = (4, 3, 2, 1), η 4,1 = ( 4, 3, 2, 1), η 5,1 = ( 4, 3, 2, 1) and the means for cluster 2 are η 1,2 = ( 2, 1.5, 1, 0.5), η 2,2 = ( 2, 1.5, 1, 0.5), η 3,2 = ( 2, 1.5, 1, 0.5), η 4,2 = (2, 1.5, 1, 0.5), η 5,2 = (2, 1.5, 1, 0.5). The conditional variances at level 2 are λ jr,k = a kj 2 (2(r 1)) where a kj is a scaling constant randomly generated from U nif(0.5, 1.5) (varying across clusters and across replicates within each unit). Figure 2 provides the accuracy of the clustering membership measured by the Rand index and the accuracy of the clustering patterns measured by the mean square error for the simulation setting above. We do not show the results for equal level-2 conditional variances as this is not a realistic assumption because of the constraint given by (10). We brief the estimation accuracy results as follows: The estimation accuracy of the clustering membership and cluster means improves significantly for the soft clustering approach as compared to the hard-clustering method. One possible reason for this significant improvement is that the hard clustering approach assumes equal unconditional variances of the scores whereas the soft clustering model does not (assumption A.3). Moreover, the soft clustering approach updates the clustering provided by the hard clustering by maximizing a goodness of fit function, the likelihood function; An increase in m does not improve the accuracy of the clustering membership estimated using the hard or soft clustering approach; The accuracy of the cluster membership estimated using the soft clustering model also increases as J increases; Similarly to the level-1 clustering, as σ, the noise level, increases, there is only a slight decrease in the accuracy, more significant for the cluster mean estimates and for the naive clustering method; The Gap Statistics for selecting the number of clusters for the hard clustering approach performs poorly within all settings. 16

18 Similarly to level-1 clustering, BIC outperforms AIC in selecting the number of clusters for the soft clustering method although the gap between the two methods is smaller for level-2 clustering as compared to level-1 clustering. Generally, As m increases, the accuracy of both methods diminishes. 7 Case Study Innate immunity is an antimicrobial host defense in most multi-cellular organism. This immunity system involves a series of cells (macrophages, dendritic cells and others), which in turn activate a pathway of genes. Existing microarray studies of cells infected with various pathogens identified hundreds of differentially expressed genes which could potentially be responsible in the expression pathway. These studies can be divided along several lines, e.g., cell types, bacteria types (Gram-negative and Gram-positive pathogens) and host species (human and mouse). Specific bacteria types are known to trigger very different immune responses (Nau et al., 2002). Immune response microarray experiments consisting of 29 datasets were retrieved and organized from various supporting websites (Lu et al., 2010). In this case study, we selected only six experiments conducted on human macrophages cells infected by different bacteria types. The data consist of I = experimental units or genes each observed for J = 6 bacteria types. The gene expression profiles are observed at m = 8 time points, specifically, t 1 = 0, t 2 = 30, t 3 = 60, t 4 = 120, t 5 = 240, t 6 = 360, t 7 = 720, t 8 = 1440 (in minutes). The data are therefore observed at two levels: unit-specific level where the genes correspond to units in our model description and subunit level which are expression profiles for J = 6 subunits or bacteria. We therefore apply the multi-level clustering methods to identify underlying common responses to different bacteria (level-1 clustering) and to summarize the variability within responses to different bacteria (level-2 clustering). We note here that we first estimate α(t) and β j (t) as means of normalization and remove the estimated means from the data. Then we apply the MFPCA method to obtain the functional principal components 17

19 and the scores for the within and between covariances. The number of selected components is N 1 = 4 and N 2 = 2. An important aspect in the analysis of gene expression profiles is that only few genes are significantly expressed whereas the rest have approximately constant expression profiles. Most genes are so called house keeping genes which are not expressed (more or less independent of stimulus), and therefore, they have constant trends. Serban and Wasserman (2005) point out the challenge of clustering a large number of curves or random functions when most of them are approximately constant. They suggest employing a preliminary filtering step for removing a large percentage of the constant curves followed by clustering of only those which were not removed from the complete set. In this analysis, we therefore apply the level-1 clustering algorithms to the 278 differentially expressed genes identified by Lu et al. (2010) and not to the complete set of genes (i.e. I=278). Similarly, we apply the level-2 clustering algorithms to 292 genes that show significant within-variation in the response to different bacteria as identified by Lu et al. (2010). They correspond to genes that display higher variability across bacteria responses. For details on the the list of differentially expressed genes we refer to lyongu/pub/immune/immune.html. We use different sets of genes for level-1 clustering and level-2 clustering because the grouping has different meaning. Clustering genes at level-1 means clustering by their average expression across all bacteria, and therefore, genes have to be differentially expressed in average across all bacteria. Clustering genes at level-2 means clustering by their bacteria-specific behavior, and therefore, genes are clusters based on their unique responses to different bacteria. We investigated the selection of the number of clusters using the Gap Statistic (Tibshirani et al., 2001) for the hard clustering approach and the AIC selection criteria for the soft clustering method. We also visually assessed the cluster means when deciding about the number of clusters. In this paper, we discuss the results for C 1 = 3 and C 2 = 3. We provide additional results and discussions in the Supplemental Material. 18

20 Figures 3 and 4 display the 5% and 95% quantiles of the observed curves for the genes within each cluster along with the estimated cluster means. For hard clustering, the cluster means are estimated by averaging over the estimated Y i (t) within each cluster. For soft clustering, the cluster means are estimated using the estimation method for the level-1 clustering model discussed in the Supplemental Material. Therefore, when using the soft clustering method, not only the clustering membership is updated but also the cluster means. In the clustering analysis at level-1, we identify the common responses of human macrophages genes to different bacteria. Similarly to Lu et al. (2010), we cluster the expression responses into three categories: constant/unchanged pattern corresponding to inactivated genes; up pattern corresponding to induced genes and down pattern corresponding to suppressed genes. Figure 4 displays the clustering patterns obtained using our soft clustering method. It suggests that out of 278 genes identified as differentially expressed by Lu et al., only 62 genes have non-constant expression profiles. Moreover, the pattern of cluster 1 indicates that 17 genes are suppressed at early time points then slowly stabilizing whereas the 45 genes in cluster 3 are induced by the bacteria at first then stabilizing. Interestingly, the average response time is around 280 minutes after treatment; that is, the responsible genes in human macrophages cells respond within 4-5 hours. Comparing the two clustering methods (Figures 3 and 4), we find that the clustering derived from simply applying hard clustering to the MFPCA score classifies most of the curves in cluster 1 whereas clusters 2 and 3 are very similar in trends (in fact, cluster 3 consists of only 2 genes). Therefore, hard clustering does not pick up on the three expression patterns described above while assigning most of the genes to one cluster. In the clustering analysis at level-2, we summarize the within-bacteria variability in the response of human macrophages genes to different bacteria. Figures 5 and 6 display the level 2 cluster means for C 2 = 3. Each of the three subplots of Figures 5 and 6 contain six cluster patterns, which correspond to the summarized immune responses to J = 6 bacteria. 19

21 The results of the soft clustering method shown in Figure 6 imply that 10 genes in cluster 1 are induced by three bacteria (two Gram- and one Gram+) and suppressed by two bacteria (one Gram- and one Gram+) while the 32 genes in cluster 3 have opposite responses to these six bacteria. The remaining 254 genes show similar activating patterns to all six bacteria types. While the soft clustering method identifies genes with varying within-unit expression means (clusters 1 and 3), level-2 hard clustering does not capture much variation in the immune responses to different bacteria. For instance, the 45 genes in cluster 2 have similar upward response patterns to six bacteria as shown in Figure 5(b). Comparing the soft and hard clusterings for various numbers of clusters, we find that the hard clustering algorithm is sensitive to outlying patters in the sense that it tends to estimate clusters consisting of one unit. For example, the clustering in Figure 5 is for K = 4 where the forth cluster consists of one unit or gene (not shown as a separate cluster). Similarly, when we set K = 5, two of the clusters at level 2 consists of one gene only. Last, we compare the clustering at level-1 and level-2 estimated using the hard and soft clustering algorithms using the Rand Index, R(H, S) where H is the clustering membership using the hard clustering method and S is the clustering membership using the soft clustering method. We find that the mismatch in the cluster membership between the two clusterings measured by R(H, S) increases as the number of cluster increases, and it is in the range of 31% 40% for level-1 clustering and in the range of 25% 41% for level-2 clustering, which suggests that the soft clustering approach updates the cluster membership significantly. 8 Discussions In this paper, we introduce a means for clustering multilevel functional data; the clustering algorithm identifies groups of functions which are similar in their overall behavior across repeated measurements or/and similar in within-unit trends. The underlying clustering (hard or soft) begins with the specification of a model using functional principal component analysis and either clusters the resulting estimated scores using common hard clustering methods or 20

22 updates the estimated scores assuming a clustering model. The estimation procedure for the latter approach is iterative and therefore more computational expensive; however, in contrast to purely algorithmic approach, it allows inference on the model parameters such as the number of clusters, imputed cluster memberships and cluster means. From our simulation studies, we find that clustering by similarity of unit-specific means at level-1 using either of the two approaches will provide similar results as soon as there is not a significant difference in within-cluster variability across clusters. Therefore, the extra computational cost incurred by updating the scores using the soft clustering approach will be offset by an improve in the estimation accuracy as soon as the variability between functions assigned to the same cluster (i.e. the conditional variances) will largely vary from one cluster to another. Because it most often may be difficult to evaluate how the between-cluster variability varies across clusters since the clustering is unknown, we suggest proceeding with soft clustering if the number of units to be clustered is not large (I ), as the additional computational cost is not great. On the other hand, for large I we recommend either a more computational efficient implementation of the soft clustering method or simply the application of the hard clustering approach with the understanding of its shortcomings including lower estimation accuracy. Clustering by similarity of within-unit deviations at level 2 is more difficult as it pools information across multiple functions simultaneously. The hard clustering approach using the estimated scores from MFPCA provides inaccurate clustering. On the other hand, after updating the scores and the cluster membership using the soft clustering approach, the accuracy of the estimated cluster membership and cluster means improves over the hard clustering algorithm significantly. We therefore recommend using the soft clustering approach over the hard clustering method under any setting, small or large noise level, lower or higher sparsity in the sampling design. Last, our case study clearly shows that soft clustering outperforms hard clustering. In 21

23 contrast to soft clustering, hard clustering at level-1 is sensitive to outlying patters in the sense that it tends to estimate clusters consisting of one unit, it doesn t identify the primary gene expression trends. In addition, hard clustering at level-2 does not capture the patterns in the between-unit variability providing clusters. Acknowledgement The authors are thankful to Ciprian Crainiceanu for providing useful insights about the research in this paper and to Chong-Zhi Di for providing the software for sparse MFPCA. The authors thank to the referees and associate editor for helpful comments. References [1] Z. Bar-Joseph, G. Gerber, D.K. Gifford, T.S. Jaakkola (2002), A new approach to analyzing gene expression time series data, Proceedings of the 6th Annual International Conference on RECOMB, [2] J.G. Booth, G. Casella, J.P. Hobert (2008), Clustering Using Objective Functions and Stochastic Search, Journal of the Royal Statistical Society, B 70(1), p [3] C. Bugli and P. Lambert (2006) Functional ANOVA with random functional effects: an application to event-related potentials modelling for electroencephalograms analysis, Statistics in Medicine, 25, [4] H. Cardot (2007), Conditional functional principal components analysis, Scandinavian Journal of Statistics, 34, [5] J.M. Chiou and P.L. Li (2007), Functional clustering and identifying substructures of longitudinal data, Journal of the Royal Statistical Society, Series B, 69, [6] Di, C. Z., Crainiceanu, C. M., Caffo, B. S. and Punjabi, N. M. (2009), Multilevel Functional Principal Component Analysis, Annals of Applied Statistics, 3(1), [7] Di, C.Z., Crainiceanu, C.M. (2010). Multilevel Sparse Functional Principal Component Analysis, Johns Hopkins University, Dept. of Biostatistics, Working Papers. [8] C. Fraley, A. E. Raftery (2002), Model-Based Clustering, Discriminant Analysis, and Density Estimation, Journal of the American Statistical Association, 97, p [9] T. Hastie, R. Tibshirani, M. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. Chan, D. Botstein, P. Brown (2000), Gene shaving as a method for identifying distinct sets of genes with similar expression patterns, Genome Biology, I(2):research

24 [10] T. Hastie, R. Tibshirani and J. Friedman (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer. [11] G.M. James, and C.A. Sugar (2003), Clustering for sparsely sampled functional data, Journal of the American Statistical Association, 98, [12] G. M. James, T. Hastie, and C. Sugar (2000) Principal Component Models for Sparse Functional Data, Biometrika, 87, [13] C. Kaufman and S.R. Sain (2010) Bayesian Functional ANOVA Modeling Using Gaussian Process Prior Distributions, Bayesian Analysis, 5, Number 1, pp [14] Y. Lu, R. Rosenfeld, G.J. Nau, Z. Bar-Joseph (2010), Cross Species Expression Analysis of Innate Immune Response, Journal of Computational Biology, 17(3), [15] G.J. Nau, J.F.L. Richmond, A. Schlesinger, E.G. Jennings, E.S. Lander, R.A. Young (2002), Human macrophage activation programs induced by bacterial pathogens, Proceedings of the National Academy of Sciences USA, 99, [16] J.O. Ramsay and B.W. Silverman, B.W. (2002) Applied Functional Data Analysis, Springer, New York. [17] W. M. Rand (1971), Objective Criteria for the Evaluation of Clusterings Methods, J. of American Statistical Association, 66, p [18] N. Serban (2008), Clustering in the Presence of Heteroscedastic Errors, Journal of Nonparametric Statistics, 20, 7, [19] N. Serban (2009), Clustering Confidence Sets, Journal of Statistical Planning and Inference, 139, (2009), [20] N. Serban, L. Wasserman (2005), CATS: Cluster Analysis by Transformation and Smoothing, J. of the American Statistical Association, 100, [21] C. Sugar and G. James (2003), Finding the Number of Clusters in a Data Set: An Information Theoretic Approach, Journal of the American Statistical Association, 98, (2003), [22] R. Tibshirani, G. Walther, T. Hastie (2001), Estimating the number of clusters in a dataset via the gap statistic, Journal of the Royal Statististical Society, B 63, [23] F. Vaida, S. Blanchard (2005), Conditional Akaike information for mixed-effects models, Biometrika, 92(2), [24] F. Yao, H.G. Müller, and J.L. Wang (2005), Functional data analysis for sparse longitudinal data, Journal of the American Statistical Association, 100,

25 Rand Index J=3/Naive J=4/Naive J=5/Naive J=3/Hard J=4/Hard J=5/Hard J=3/Soft J=4/Soft J=5/Soft Rand Index J=3/Naive J=4/Naive J=5/Naive J=3/Hard J=4/Hard J=5/Hard J=3/Soft J=4/Soft J=5/Soft Maximum number of time points Maximum number of time points (a) Rand Index: Equal Variance (b) Rand Index: Nonequal Variance RMSE J=3/Naive J=4/Naive J=5/Naive J=3/Hard J=4/Hard J=5/Hard J=3/Soft J=4/Soft J=5/Soft RMSE J=3/Naive J=4/Naive J=5/Naive J=3/Hard J=4/Hard J=5/Hard J=3/Soft J=4/Soft J=5/Soft Maximum number of time points Maximum number of time points (c) MSE: Equal Variance (d) MSE: Nonequal Variance Figure 1: Leve1-clustering: Comparing naive, hard and soft clustering for various values of J = 3, 4, 5 and for equal vs. non-equal level-2 eigenvalues. We evaluate the estimation accuracy from the cluster membership (Rand Index) and for the cluster means (MSE). 24

FUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1

FUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1 FUNCTIONAL DATA ANALYSIS Contribution to the International Handbook (Encyclopedia) of Statistical Sciences July 28, 2009 Hans-Georg Müller 1 Department of Statistics University of California, Davis One