BMI/STAT 768 : Lecture 13 Sparse Models in Images

Size: px

Start display at page:

Download "BMI/STAT 768 : Lecture 13 Sparse Models in Images"

Neal Blankenship
5 years ago
Views:

1 BMI/STAT 768 : Lecture 13 Sparse Models in Images Moo K. Chung mkchung@wisc.edu March 25, Why sparse models are needed? If we are interested quantifying the voxel measurements in every voxel in an image simultaneously, the standard procedure is to set up a multivariate general linear model (MGLM), which generalizes widely used univariate GLM by incorporating vector valued responses and explanatory variables [1, 12, 33, 34, 30, 6]. Hotelling s T 2 statistic is a special case of MGLM and has been mainly used for inference on surface shapes and deformations [31, 19, 4, 13, 7]. Let J n p = (J ij ) be the measurement matrix, J ij is the measurement for subject i at voxel position j. The

2 subscripts denote the dimension of matrix. We can think J ij as either Jacobian determinant, fractional anisotropy values or fmri activation. Assume there are total n subjects and p voxels of interest. The measurement vector at the j-th voxel is denoted as x j = (J 1j,, J nj ). The measurement vector for the i-th subject is denoted as y i = (J i1,, J ip ). y i, is expected to be distributed identically and independently over subjects. Note that J = (x 1,, x p ) = (y 1,, y n ). We may assume the covariance matrix of y i to be V(y 1 ) = = V(y n ) = Σ p p = (σ kl ). With these notations, we now set up the following MGLM over all subjects and across different voxel positions: J n p = X n k B k p + Z n q G q p + U n p Σ 1/2 p p, (1) where X is the matrix of contrasted explanatory variables while B is the matrix of unknown coefficients to be estimated. Nuisance covariates of non-interest are in the matrix Z and the corresponding coefficients are in the matrix G. The components of Gaussian random matrix U are independently distributed with zero mean and unit variance. The symmetric matrix Σ 1/2 is the square-root

3 of the covariance matrix accounting for the spatial dependency across different voxels. In MGLM (1), we are interested in testing the null hypothesis H 0 : B = 0. The parameter matrices in the model are estimated via the least squares method. The resulting multivariate test statistics are called the Lawley-Hotelling trace or Roy s maximum root. When there is only one voxel, i.e. p = 1, these multivariate test statistics collapses to Hotelling s T 2 statistic [34, 6]. Note that MGLM (1) is equivalent to the assumption that y i follows multivariate normal with some mean µ and covariance Σ, i.e., y i N(µ, Σ). Then neglecting constant terms, the log-likelihood function L of y i is given by L(µ, Σ) = log det Σ 1 1 n (y i µ) Σ 1 (y i µ). n i=1 By maximizing the log-likelihood, MLE of µ and Σ are given by µ = ȳ i = 1 n y i n Σ = 1 n i=1 n (y i ȳ i ) (y i ȳ i ). (2) i=

4 For a notational convenience, we can center the measurement y i such that y i y i ȳ i. We are basically centering the measurements by subtracting the group mean over subjects. Then MLE (3) can be written in a more compact form Σ = 1 n J p nj n p. (3) However, there is a serious defect with MGLM (1) and its MLE (3); namely the estimated covariance matrix Σ is positive definite only for n p [12, 29]. J J becomes rank deficient for n < p. In most imaging studies, there are more voxels than the number of subjects, i.e., n < p. When Σ is singular, we do not properly have the inverse of Σ, which is the precision matrix often needed in partial correlation based network analyses [22]. This is the main reason MGLM was rarely employed over the whole brain region and researchers are still using mostly univariate approaches in imaging studies. 1.1 Why sparse network? The majority of functional and structural connectivity studies in brain imaging are usually performed following the

5 standard analysis framework [14, 15, 10, 35]. From 3D whole brain images, n regions of interest (ROI) are identified and serve as the nodes of the brain network. Measurements at ROIs are then correlated in a pair-wise fashion to produce the connectivity matrix of size n n. The connectivity matrix is then thresholded to produce the adjacency matrix consisting of zeros and ones that define the link between two nodes. The binarized adjacency matrix is then used to construct the brain network. Then various graph complexity measures such as degree, clustering coefficients, entropy, path length, hub centrality and modularity are defined on the graph and the subsequent statistical inference is performed on these complexity measures. For a large number of nodes, simple thresholding of correlation will produce a large number of links which makes the interpretation difficult. For example, for voxels in an image, we can possibly have a total of links in the graph. For this reason we used the sparse data recovery framework in obtaining a far smaller number of significant links.

6 2 Graphical-LASSO To remedy the small n and large-p problem, the likelihood is regularized with a L1-norm penalty. If we center the measurements y i, µ = 0. So the log-likelihood can be written as L(Σ) = log det Σ 1 1 n n yi Σ 1 y i i=1 = log det Σ 1 tr ( Σ 1 S ), where S = 1 n n i=1 y i y i is the sample covariance matrix. We used the fact that the trace of a scalar value is equivalent to the scalar value itself and tr(ab) = tr(ba) for matrices A and B. We made the likelihood as a function of Σ 1 to simply emphasize that we are trying to estimate the inverse covariance matrix. To avoid the small-n large-p problem, we penalize the log-likelihood with L1-norm penalty: ( ) L(Σ) = log det Σ 1 tr Σ 1 S λ Σ 1 1, (4) where 1 is the sum of the absolute values of the elements. The penalized log-likelihood is maximized over the space of all possible symmetric positive definite matrices. (4) is a convex problem and it is usually solved

7 using the graphical-lasso (GLASSO) algorithm [3, 2, 11, 18, 25]. The tuning parameter λ > 0 controls the sparsity of the off-diagonal elements of the inverse covariance matrix. By increasing λ > 0, the estimated inverse covariance matrix becomes more sparse. To remedy this small n and large-p problem, we propose to regularize the likelihood term with L 1 -penalty and maximize the sparse likelihood: ( ) L(Σ) = log det Σ 1 tr Σ 1 S λ Σ 1, (5) where is the sum of the absolute values of the elements. The sparse-likelihood is given as a function of Σ 1 to emphasize that we are actually interested in estimating the inverse covariance. The tuning parameter λ > 0 controls the sparsity of the off-diagonal elements of the covariance matrix. Then we maximize L over the space of all possible symmetric positive definite matrices. (5) is a convex problem and we solve it using the graphical-lasso (GLASSO) algorithm [3, 11, 18]. By increasing λ, the estimated covariance matrix becomes more sparse. GLASSO is a fairly time consuming algorithm [11, 18]. Solving GLASSO for 548 nodes, for instance, takes about 6 minutes on a desktop computer. If Σ i (λ) is the estimated sparse covariance for group i at given sparse

8 parameter λ, we are usually interested in testing the equivalence of covariance matrices between the two groups at fixed λ, i.e., H 0 : Σ 1 (λ) = Σ 2 (λ). 2.1 Filtration in graphical-lasso The solution to graphical-lasso has a peculiar topological structure. Let Σ 1 (λ) = (σ ij (λ)) be the inverse covariance estimated from graphical-lasso. Let A(λ) = (a ij ) be the corresponding adjacency matrix given by { 1 if σ ij 0; a ij (λ) = (6) 0 otherwise. The adjacency matrix A induces a graph G(λ) consisting of κ(λ) number of partitioned subgraphs G(λ) = κ(λ) l=1 G l (λ) with G l = {V l (λ), A l (λ)}, where V l and A l are node and edge sets of subgraph G l. Let S = (s ij ) be the sample covariance matrix. Let B(λ) = (b ij ) be the adjacency matrix defined by { 1 if ŝ ij > λ; b ij (λ) = (7) 0 otherwise.

Figure 1: Left: Adjacency matrices obtained through graphical-lasso with increasing λ values. The persistent homological structure is self-evident.

9 Figure 1: Left: Adjacency matrices obtained through graphical-lasso with increasing λ values. The persistent homological structure is self-evident. Right: Adjacency matrices are clustered as a block diagonal matrix D by permutation. The adjacency matrix B similarly induces a graph with τ (λ) disjoint subgraphs: τ (λ) H(λ) = [ Hl (λ) with Hl = {Wl (λ), Bl (λ)}, l=1 where Wl and Bl are node and edge sets of subgraph Hl. Then the partitioned graphs are shown to be partially nested in a sense that the node sets exhibits persistency.

10 Theorem 1 For any λ > 0, the adjacency matrices (6) and (7) induce the identical vertex partition so that κ(λ) = τ(λ) and V l (λ) = W l (λ). Further, the node sets V l and W l form filtrations over the sparse parameter: V l (λ 1 ) V l (λ 2 ) V l (λ 3 ) (8) W l (λ 1 ) W l (λ 2 ) W l (λ 3 ) (9) for λ 1 λ 2 λ 3. From (7), it is trivial to see the filtration holds for W l. The filtration for V l is proved in [18]. The equivalence of the node sets V l = W l is proved in [25]. Note that the edge sets may not form a filtration. The construction of the filtration on the node sets V l (8) is very time consuming since we have to solve the sequence of graphical-lasso. For instance, for 548 node sets and 547 different filtration values, the whole filtration takes more than 54 hours in a desktop [5]. In Figure 1, we randomly simulated the data matrix X 5 10 from the standard normal distribution. The sample covariance matrix is then feed into graphical-lasso with different filtration values. To identify the structure better, we transformed the adjacency matrix A by permutation P such that D = P AP 1 is a block diagonal matrix. Theoretically only the partitioned node sets are expected to exhibit the nestedness but in this example, the edge sets are also nested as well.

11 3 Sparse correlation network The problem with graphial-lasso or any type of similar L1 norm optimization is that it becomes computationally expensive as the number of node p increases. So it is not really practical for large-scale brain networks. For largescale brain networks, we simply recommend thresholding correlations. Here is the mathematical justification. 3.1 Correlations Consider measurement vector x j on node j. If we center and rescale the measurement x j such that x j 2 = x jx j = 1, the sample correlation between nodes i and j is given by x i x j. Since the data is normalized, the sample covariance matrix is reduced to the sample correlation matrix. Consider the following linear regression between nodes j and k (k j): x j = γ jk x k + ɛ j. (10) We are basically correlating data at node j to data at node k. In this particular case, γ jk is the usual Pearson correlation. The least squares estimation (LSE) of γ jk is then

12 given by γ jk = x jx k, (11) which is the sample correlation. For the normalized data, regression coefficient estimation is exactly the sample correlation. For the normalized and centered data, the regression coefficient is the correlation. It can be shown that (11) minimizes the sum of least squares over all nodes: p x j γ jk x k 2. (12) j=1 k j Note that we do not really care about correlating x j to itself since the correlation is then trivially γ jj = Sparse correlations Let Γ = (γ jk ) be the correlation matrix. The sparse penalized version of (12) is given by F (Γ) = 1 p p x j γ jk x k 2 +λ γ jk. (13) 2 j=1 k j j=1 k j The sparse correlation is given by minimizing F (Γ). By increasing λ, the estimated correlation matrix Γ(λ) becomes more sparse. When λ = 0, the sparse correlation

13 is simply given by the sample correlation, i.e. γ jk = x j x k. As λ increases, the correlation matrix Γ shrinks to zero and becomes more sparse. This is separable compressed sensing or LASSO type problem. However, there is no need to numerically optimize (13) using the coordinate descent learning or the active-set algorithm often used in compressed sensing [27, 11]. The minimization of (13) can be done by the proposed soft-thresholding method analytically by exploiting the topological structure of the problem. This sparse regression is not orthogonal, i.e. x i x j δ ij, the Dirac delta, so the existing soft-thresholding method for LASSO [32] is not applicable. Theorem 2 For λ 0, the solution of the following separable LASSO problem 1 γ jk (λ) = arg min γ jk 2 p j=1 k j x j γ jk x k 2 +λ p γ jk, j=1 k j is given by the soft-thresholding x j x k λ if x j x k > λ γ jk (λ) = 0 if x j x k λ. (14) x j x k + λ if x j x k < λ

14 Proof. Write (13) as F (Γ) = 1 2 p f(γ jk ), (15) j=1 k j where f(γ jk ) = x j γ jk x k 2 +2λ γ jk. Since f(γ jk ) is nonnegative and convex, F (Γ) is minimum if each component f(γ jk ) achieves minimum. So we only need to minimize each component f(γ jk ). This differentiates our sparse correlation formulation from the standard compressed sensing that cannot be optimized in this component wise fashion. f(γ jk ) can be rewritten as f(γ jk ) = x j 2 2γ jk x jx k + γ 2 jk x k 2 + 2λ γ jk = (γ jk x jx k ) 2 + 2λ γ jk + 1. We used the fact x j x j = 1. For λ = 0, the minimum of f(γ jk ) is achieved when γ jk = x j x k, which is the usual LSE. For λ > 0, Since f(γ jk ) is quadratic in γ jk, the minimum is achieved when f γ jk = 2γ jk 2x jx k ± 2λ = 0 (16)

15 The sign of λ depends on the sign of γ jk. Thus, sparse correlation γ jk is given by a soft-thresholding of x j x k: x j x k λ if x j x k > λ γ jk (λ) = 0 if x j x k λ. (17) x j x k + λ if x j x k < λ The estimated sparse correlation (17) basically thresholds the sample correlation that is larger or smaller than λ by the amount λ. Due to this simple expression, there is no need to optimize (13) numerically as often done in compressed sensing or LASSO [27, 11]. However, Theorem 2 is only applicable to separable cases and for nonseparable cases, numerical optimization is still needed. The different choices of sparsity parameter λ will produce different solutions in sparse model A(λ). Instead of analyzing each model separately, we can analyze the whole collection of all the sparse solutions for many different values of λ. This avoids the problem of identifying the optimal sparse parameter that may not be optimal in practice. The question is then how to use the collection of A(λ) in a coherent mathematical fashion. This can be addressed using persistent homology [9, 20, 21].

16 3.3 Filtration in sparse correlations Using the sparse solution (17), we can construct a filtration. We will basically build a graph G using spare correlations. Let γ jk (λ) be the sparse correlation estimate. Let A(λ) = (a ij ) be the adjacency matrix defined as { 1 if γ jk (λ) 0; a jk (λ) = 0 otherwise. This is equivalent to the adjacency matrix B = (b jk ) defined as { 1 if x j b jk (λ) = x k > λ; (18) 0 otherwise. The adjacency matrix B is simply obtained by thresholding the sample correlations. Then the adjacency matrices A and B induce a identical graph G(λ) consisting of κ(λ) number of partitioned subgraphs G(λ) = κ(λ) l=1 G l (λ) with G l = {V l (λ), E l (λ)}, where V l and E l are node and edge sets respectively. Note G l Gm = for any l m.

17 Figure 2: Jocobian determinant of deformation field are measured at 548 nodes along the white matter boundary [5]. The β0 -number (number of connected components) of the filtrations on the sample correlations and covariances show huge group separation between normal controls and post-institutionalized (PI) children. and no two nodes between the different partitions are connected. The node S and edge sets are denoted as V(λ) = S κ κ V and E(λ) = l=1 l l=1 El respectively. Then we have the following theorem: Theorem 3 The induced graph from the spare correlation forms a filtration: G(λ1 ) G(λ2 ) G(λ3 ) (19)

18 for λ 1 λ 2 λ 3. Equivalently, the node and edge sets also form filtrations as well: V(λ 1 ) V(λ 2 ) V(λ 3 ) E(λ 1 ) E(λ 2 ) E(λ 3 ). The proof can be easily obtained from the definition of adjacency matrix (18). 4 Partial correlation network Let p be the number of nodes in the network. In most applications, the number of nodes is expected to be larger than the number of observations n, which gives an underdetermined system. Consider measurement vector at the j-th node x j = (x 1j,, x nj ) consisting of n measurements. Vector x j are assumed to be distributed with mean zero and covariance Σ = (σ ij ). The correlation γ ij between the two nodes i and j is given by γ ij = σ ij. σii σ jj By thresholding the correlation, we can establish a link between two nodes. However, there is a problem with

19 this simplistic approach in that it fails to explicitly factor out the confounding effect of other nodes. To remedy this problem, partial correlations can be used in factoring out the dependency of other nodes [16, 24, 17, 18, 27]. If we denote the inverse covariance matrix as Σ 1 = (σ ij ), the partial correlation between the nodes i and j while factoring out the effect of all other nodes is given by σij ρ ij =. (20) σ ii σjj Equivalently, we can compute the partial correlation via a linear model as follows. Consider a linear model of correlating measurement at node j to all other nodes: x j = k j β jk x k + ɛ k. (21) The parameters β jk are estimated by minimizing the sum of squared residual of (21) p L(β) = β jk x k 2 (22) j=1 x j k j in a least squares fashion. If we denote the least squares estimator by β jk, the residuals are given by r j = x j k j β jk x k. (23)

20 The partial correlation is then obtained by computing the correlation between the residuals [16, 23, 27]: ρ ij = corr (r j, r j ). 4.1 Sparse partial correlations There is a serious problem with the least squares estimation framework discussed in the previous section. Since n p, this is a significantly underdetermined system. This is also related to the covariance matrix Σ being singular so we cannot just invert the covariance matrix. For this, we need sparse network modeling. The minimization of (22) is exactly given by solving the normal equation: x j = k j β jk x k, (24) which can be turned into standard linear form y = Aβ [22]. Note that (24) can be written as β j1 β j2 x j = [x 1,, x j 1, 0, x j+1,, x p ] }{{} X j. β jp, } {{ } β j

21 where 0 n 1 is a column vector of all zero entries. Then we have x 1 X β 1 x 2 0 X 2 0 β 2 =......, (25).. x p 0 0 X p β p }{{}}{{}}{{} y np 1 A np p 2 β p 2 1 where A is a block diagonal matrix and 0 n p is a matrix of all zero entries. We regularize (25) by incorporating l 1 LASSO-penalty J [32, 27, 22]: J = i,j β ij. The sparse estimation of β ij is then given by minimizing L + λj. Since there is dependency between y and A, (25) is not exactly a standard compressed sensing problem [27, 22]. It should be intuitively understood that sparsity makes the linear equation (24) less underdetermined. The larger the value of λ, the more sparse the underlying topological structure gets. Since ρ ij = β ij σ ii σ jj, the sparsity of β ij directly corresponds to the sparsity of ρ ij, which is the strength of the link between nodes i and

22 j [27, 22]. Once the sparse partial correlation matrix ρ is obtained, we can simply link nodes j and j, if ρ ij > 0 and assign the weight ρ ij to the edge. This way, we obtain the weighted graph. 4.2 Limitations However, the sparse partial correlation framework has a serious computational bottleneck. For n measurements over p nodes, it is required that we solve a linear system with an extremely large A matrix of size np p 2, so that the complexity of the problem increases by a factor of p 3! Consequently, for a large number of nodes, the problem immediately becomes almost intractable for a small computer. For example, for 1 million nodes, we have to compute 1 trillion possible pairwise relationships between nodes. One practical solution is to modify (21) so that the measurement at node i is represented more sparsely over some possible index set S i : x i = S i β ij x j + ɛ i. making the problem substantially smaller. An alternate approach is to simply follow the homotopy path, which adds network links one by one with a very limited increase of computational complexity so

23 there is no need to compute β repeatedly from scratch [8, 28, 26]. The trajectory of the optimal solution β in LASSO follows a piecewise linear path as we change λ. By tracing the linear path, we can substantially reduce the computational burden of reestimating β when λ changes. References [1] T.W. Anderson. An Introduction to Multivariate Statistical Analysis. Wiley, 2nd edition, [2] O. Banerjee, L. El Ghaoui, and A. d Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9: , [3] O. Banerjee, L.E. Ghaoui, A. d Aspremont, and G. Natsoulis. Convex optimization techniques for fitting sparse Gaussian graphical models. In Proceedings of the 23rd International Conference on Machine Learning, page 96, [4] J. Cao and K.J. Worsley. The detection of local shape changes via the geometry of Hotelling s T2 fields. Annals of Statistics, 27: , 1999.

24 [5] M.K. Chung, J.L. Hanson, J. Ye, R.J. Davidson, and S.D. Pollak. Persistent homology in sparse regression and its application to brain morphometry. IEEE Transactions on Medical Imaging, 34: , [6] M.K. Chung, K.J. Worsley, M.N. Brendon, K.M. Dalton, and R.J. Davidson. General multivariate linear modeling of surface shapes using SurfStat. NeuroImage, 53: , [7] M.K. Chung, K.J. Worsley, T. Paus, D.L. Cherif, C. Collins, J. Giedd, J.L. Rapoport, and A.C. Evans. A unified statistical approach to deformation-based morphometry. NeuroImage, 14: , [8] D.L. Donoho and Y. Tsaig. Fast solution of l 1 -norm minimization problems when the solution may be sparse. Citeseer, [9] H. Edelsbrunner and J. Harer. Persistent homology - a survey. Contemporary Mathematics, 453: , [10] A. Fornito, A. Zalesky, and E.T. Bullmore. Network scaling effects in graph analytic studies of human resting-state fmri data. Frontiers in Systems Neuroscience, 4:1 16, 2010.

25 [11] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432, [12] K.J. Friston, A.P. Holmes, K.J. Worsley, J.-P. Poline, C.D. Frith, and R.S.J. Frackowiak. Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping, 2: , [13] C. Gaser, H.-P. Volz, S. Kiebel, S. Riehemann, and H. Sauer. Detecting structural changes in whole brain based on nonlinear deformations - Application to schizophrenia research. NeuroImage, 10: , [14] G. Gong, Y. He, L. Concha, C. Lebel, D.W. Gross, A.C. Evans, and C. Beaulieu. Mapping anatomical connectivity patterns of human cerebral cortex using in vivo diffusion tensor imaging tractography. Cerebral Cortex, 19: , [15] P. Hagmann, M. Kurant, X. Gigandet, P. Thiran, V.J. Wedeen, R. Meuli, and J.P. Thiran. Mapping human whole-brain structural networks with diffusion MRI. PLoS One, 2(7):e597, [16] Y. He, Z.J. Chen, and A.C. Evans. Small-world anatomical networks in the human brain revealed

26 by cortical thickness from MRI. Cerebral Cortex, 17: , [17] S. Huang, J. Li, L. Sun, J. Liu, T. Wu, K. Chen, A. Fleisher, E. Reiman, and J. Ye. Learning brain connectivity of Alzheimer s disease from neuroimaging data. In Advances in Neural Information Processing Systems, pages , [18] S. Huang, J. Li, L. Sun, J. Ye, A. Fleisher, T. Wu, K. Chen, and E. Reiman. Learning brain connectivity of Alzheimer s disease by sparse inverse covariance estimation. NeuroImage, 50: , [19] S.C. Joshi. Large Deformation Diffeomorphisms and Gaussian Random Fields for Statistical Characterization of Brain Sub-Manifolds. PhD thesis, Washington University, St. Louis, [20] H. Lee, M.K. Chung, H. Kang, B.-N. Kim, and D.S. Lee. Computing the shape of brain networks using graph filtration and Gromov-Hausdorff metric. MICCAI, Lecture Notes in Computer Science, 6892: , [21] H. Lee, H. Kang, M.K. Chung, B.-N. Kim, and D.S Lee. Persistent brain network homology from the perspective of dendrogram. IEEE Transactions on Medical Imaging, 31: , 2012.

27 [22] H. Lee, D.S.. Lee, H. Kang, B.-N. Kim, and M.K. Chung. Sparse brain network recovery under compressed sensing. IEEE Transactions on Medical Imaging, 30: , [23] J.P. Lerch, K. Worsley, W.P. Shaw, D.K. Greenstein, R.K. Lenroot, J. Giedd, and A.C. Evans. Mapping anatomical correlations across cerebral cortex (MACACC) using cortical thickness from MRI. NeuroImage, 31: , [24] G. Marrelec, A. Krainik, H. Duffau, M. Pélégrini- Issac, S. Lehéricy, J. Doyon, and H. Benali. Partial correlation for functional brain interactivity investigation in functional MRI. NeuroImage, 32: , [25] R. Mazumder and T. Hastie. Exact covariance thresholding into connected components for largescale graphical LASSO. The Journal of Machine Learning Research, 13: , [26] M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. IMA Journal of Numerical Analysis, 20: , [27] J. Peng, P. Wang, N. Zhou, and J. Zhu. Partial correlation estimation by joint sparse regression mod-

28 els. Journal of the American Statistical Association, 104: , [28] M.D. Plumbley. Geometry and homotopy for l 1 sparse representations. Proceedings of SPARS, 5: , [29] J. Schäfer and K. Strimmer. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32, [30] J.E. Taylor and K.J. Worsley. Random fields of multivariate test statistics, with applications to shape analysis. Annals of Statistics, 36:1 27, [31] P.M. Thompson, D. MacDonald, M.S. Mega, C.J. Holmes, A.C. Evans, and A.W Toga. Detection and mapping of abnormal brain structure with a probabilistic atlas of cortical surfaces. Journal of Computer Assisted Tomography, 21: , [32] R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society. Series B (Methodological), 58: , 1996.

29 [33] K.J. Worsley, S. Marrett, P. Neelin, A.C. Vandal, K.J. Friston, and A.C. Evans. A unified statistical approach for determining significant signals in images of cerebral activation. Human Brain Mapping, 4:58 73, [34] K.J. Worsley, J.E. Taylor, F. Tomaiuolo, and J. Lerch. Unified univariate and multivariate random field theory. NeuroImage, 23:S , [35] A. Zalesky, A. Fornito, I.H. Harding, L. Cocchi, M. Yücel, C. Pantelis, and E.T. Bullmore. Wholebrain anatomical networks: Does the choice of nodes matter? NeuroImage, 50: , 2010.

1 Introduction. Moo K. Chung 1, Jamie L. Hanson 1, Hyekyung Lee 2, Nagesh Adluru 1, Andrew L. Alexander 1, Richard J. Davidson 1, Seth D.

1 Introduction. Moo K. Chung 1, Jamie L. Hanson 1, Hyekyung Lee 2, Nagesh Adluru 1, Andrew L. Alexander 1, Richard J. Davidson 1, Seth D. Exploiting Hidden Persistent Structures in Multivariate Tensor-Based Morphometry and Its Application to Detecting White Matter Abnormality in Maltreated Children Moo K. Chung 1, Jamie L. Hanson 1, Hyekyung