An Imputation-Consistency Algorithm for Biomedical Complex Data Analysis

Size: px

Start display at page:

Download "An Imputation-Consistency Algorithm for Biomedical Complex Data Analysis"

Scott Curtis
5 years ago
Views:

1 An Imputation-Consistency Algorithm for Biomedical Complex Data Analysis Faming Liang Purdue University January 11, 2018

2 Outline Introduction to biomedical complex data An IC algorithm for high-dimensional missing data problems The missing data problem The IC algorithm Theoretical development for the IC algorithm Numerical examples Extension to Blockwise Consistency Algorithm Discussion

3 Biomedical Complex Data Motivation: During the past two decades, the dramatic improvement in data collection and acquisition technologies has enabled scientists to collect vast amounts of health-related data in biomedical studies. Here are some examples: Multi-omics data: SNPs, copy number variants, mutation, methylation, RNA-seq Biomedical image data: cancer pathological images, brain images Mobile health data: wearable and/or ambient sensors Electronic health record If analyzed properly, these data can help us to improve contemporary healthcare services from diagnosis to prevention to personalized treatment, and also provide us some insights toward reducing healthcare costs.

4 Biomedical Complex Data The biomedical complex data are often characterized by some mixture of missing data heterogeneity high dimensionality small sample size high variety high volume high velocity How to analyze these data has posed many challenges on existing statistical methods!

5 Missing Data Problem Missing data appear ubiquitously in both low- and high-dimensional problems. For low-dimensional data, EM algorithm can be used. For high-dimensional data, some problem-specific algorithms have been developed, but there still lacks a general algorithm. Example: for some microarray data, missing values can appear for over 90% of genes (Ouyang et al., 2004).

6 EM Algorithm (Dempster et al., 1977) E-step: Calculate the expected value of the log-likelihood function with respect to the predictive distribution of the missing data given the current estimate of θ (t), i.e., Q(θ θ (t) ) = log f (X obs, x mis θ)h(x mis θ (t), X obs )dx mis. M-step: Find a value of θ that maximizes the quantity Q(θ θ (t) ), i.e., set θ (t+1) = arg max θ Q(θ θ(t) ).

7 Variants of EM Algorithm Stochastic EM algorithm (Celeus and Diebolt, 1985): The E-step is replaced by an imputation step Monte Carlo EM algorithm (Wei and Tanner, 1990): the E-step is replaced by Monte Carlo integration ECM algorithm (Meng and Rubin, 1993): Replacing the M-step by a number of computationally simpler conditional maximization steps. ECME (Liu and Rubin, 1994; He and Liu, 2012), PX-EM (Liu et al., 1998).

8 High-Dimensional Missing Data Problems The existing algorithms are usually problem-specific: Bayesian principal component analysis (BPCA) (Oba et al., 2003) matrix completion (Cai et al., 2010): large incomplete matrix MissGLasso (Stadler and Buhlmann, 2012): Gaussian graphical models MissPALasso (Stadler et al., 2014)

9 Imputation Consistency (IC) Algorithm I-step: Draw X mis from the predictive distribution h(x mis X obs, θ n (t) ) given X obs and the current estimate θ (t) C-step: Based on the pseudo-complete data X = (X obs, X mis ), find an updated estimate θ n (t+1) which forms a consistent estimate of θ (t) n. = arg max E θ θ n (t) log f θ ( x), (1) where E (t) θ log f θ ( x) = n log(f (x obs, x mis θ))f (x obs θ )h( x mis x obs, θ n (t) )dx obs d x mis, θ denotes the true value of the parameters, and f (x obs θ ) denotes the marginal density function of x obs.

10 Imputation Consistency (IC) Algorithm To find a consistent estimate of θ (t), we suggest a regularization approach: Estimate θ (t) by maximizing a penalized likelihood function, θ (t+1) n [ = arg max log f (X obs, X ] mis θ) λp(θ) θ where P(θ) denotes the penalty function of θ and λ is an appropriately tuned regularization parameter, and X mis denotes the imputed data based on the current estimate θ (t) n. Here the regularization should be understood in a general sense, and it also includes Bayesian and variable screening methods. (2)

11 Convergence of the IC Algorithm The rationale underlying the algorithm can be intuitively explained as follows: The consistency step is to find the minimizer of the Kullback-Leibler divergence from f (x obs, x mis θ) to the joint density f (x obs θ )h( x mis x obs, θ n (t) ). Hence, each consistency step provides a momentum to drive the current estimate θ n (t) toward θ, and the convergence will eventually happen by assuming n. For the empirical version (i.e., with a finite value of n), will jump around θ after the convergence due to the randomness in imputation. θ (t) n

12 Convergence of the IC Algorithm Let x = (x obs, x mis ) and define G n (θ θ n (t) ) = E (t) θ log f θ ( x) = n n Ĝ n (θ x, θ n (t) ) = 1 n G n (θ θ n (t) ) = 1 n i=1 n i=1 log f (x obs i log(f θ ( x))f (x obs θ )h( x mis θ n (t) )d x,, x i mis θ), log f (xi obs, x mis θ)h( x mis x obs, θ n (t) )d x mis. Let θ n (t+1) = arg max θ Θn Ĝ n (θ x, θ n (t) ) λ n P(θ), and θ (t) = arg max θ Θn G n (θ θ n (t) ). Our goal is to show θ n (t+1) p θ (t) as n.

13 Convergence of the IC Algorithm Ĝ n (θ x, θ n (t) ) G n (θ θ n (t) ) = {Ĝn(θ x, θ (t) n ) G n (θ θ n (t) )} + { G n (θ θ n (t) ) G n (θ θ n (t) )}. Lemma 1 [ULLN of Q n ] Assume conditions A 1 -A 3 and A 6 hold. Then sup sup G n (θ θ n (t) ) G n (θ θ n (t) ) p 0. θ n (t) Θ θ Θ n n Theorem 1 Assume conditions A1-A8 hold. For any T such that log T = o p (n), consider Θ T n as an arbitrary subsets of Θ n with T elements (can be allowed to have replicates). Then, sup θ Θn Ĝ n (θ x, θ n (t) ) G n (θ θ n (t) ) p 0. (i) sup (t) θ n Θ T n (ii) sup (t) θ n Θ T n θ (t+1) n θ (t) p 0.

14 Convergence of the IC Algorithm: Conditions (A1) log f θ ( x) is a continuous function of θ for each x X and a measurable function of x for each θ. (A2) Θ n is compact. (A3) There exists a function m n ( x) such that sup θ Θn, x X log f θ ( x) m n ( x). (A4) P(θ)/n 0 as n, where P(θ) is the penalty function or the log-prior density function. (A5) G n (θ θ n (t) ) has a unique maximum at θ (t) for all θ n (t) Θ n.

15 Convergence of the IC Algorithm: Conditions (A6) [Conditions for Glivenko-Cantelli theorem] (a) Assume that there exists mn(x obs ) such that 0 m n (x obs, θ n (t) ) mn(x obs ) for all θ n (t), E[mn(x obs )] <, and sup n Z + E[mn(x obs )1(mn(x obs ) ζ)] 0 as ζ. In addition, sup n 1 sup x X,θ Θn m n ( x)1(m n ( x) > ζ)h( x mis x, θ)d x mis 0 as ζ. (b) Define F n = { log f (xi obs, x mis θ)h( x mis x obs, θ n (t) )d x mis θ, θ n (t) Θ n }, and G n,m = {q1(mn(x obs ) M) q F n }. Suppose for any fixed M, ϵ, log N(ϵ, G n,m, L 1 (P n )) = o p (n), where P n is the empirical measure of x obs, L 1 (P n ) denotes the L 1 space of the empirical measure, and N(ϵ, G n,m, L 1 (P n )) denotes the minimum number of balls {g : g q ϵ} of radius ϵ needed to cover the set G n,m.

16 Convergence of the IC Algorithm Define B r (θ) = {θ θ θ 2 < r}, r n (η θ n (t) ) = inf{r G n (θ (t) θ n (t) and r n (η) = sup (t) θ r n Θ n (η θ n (t) ). n ) sup θ Θn \B r (θ (t) n (A7) r(η) = sup n 1 r n (η) 0 as η 0. ) G n(θ θ n (t) ) > η}, (A8) [Bounds on tails of the imputed data] For any θ n (t) Θ n and x obs X obs, the random variable x mis h( x obs, θ n (t) ) satisfies: (a) log(f (x obs, x mis θ)) [ M, M], for some generic constant M > 0. (log(f (x obs, x mis θ))) σ 2, for some generic constant σ 2 > 0. (b) var θ (t) n

17 Convergence of the IC Algorithm The IC algorithm generates two interleaved Markov chains: θ (t) n sampling mis optimization X t+1 θ n (t+1) sampling mis X t+2 It can be shown that the Markov chain {θ n (t) } is almost surely ergodic for sufficiently large n, i.e., Theorem 2. If A1 A8 hold, then {θ (t) n } is almost surely ergodic for sufficiently large n.

18 Convergence of the IC Algorithm Define the mapping M(θ) = arg max θ E θ log f θ ( x). (A9) (Contraction) The mapping M(θ) is differentiable. Let λ n (θ) be the largest singular value of M(θ)/ θ. There exists a number λ < 1 such that λ n (θ) λ for all θ Θ n for sufficiently large n and almost every x obs -sequence. Theorem 3. If A1 A9 hold, then for sufficiently large n, sufficiently large t, and almost every x obs -sequence, we have θ (t) n θ = o p (1). Furthermore, the sample average of the Markov chain also forms a consistent estimate of θ, i.e., 1 T T t=1 θ (t) n θ = o p (1), as n and T.

19 Imputation-Conditional Consistency Algorithm I-step. Draw Z from the conditional distribution h(z Y, θ n (t,1),..., θ n (t,k) ) given Y and the current estimate (θ n (t,1),..., θ n (n,k) ). CC-step. Based on the pseudo-complete data X = (Y, Z), do the following step: (1) Conditional on (θ n (t,2),..., θ n (t,k) ), find θ n (t+1,1) which forms a consistent estimate of θ (t,1) = arg max E θ (t,1) θ n (t,1),...,θ n (n,k) log f ( x θ (t,1) n, θ n (t,2),..., θ n (t,k) ), where the expectation is with respect to the joint density function of x = (y, z) and the subscript of E gives the current estimate of θ (k) Conditional on (θ n (t+1,1) forms a consistent estimate of θ (t,k) = arg max,..., θ (t,k 1) n E θ (t,k) θ n (t+1,1),...,θ n (t+1,k 1),θ n (t,k) ), find θ (t+1,k) n which log f ( x θ n (t+1,1),..., θ n (t,k) ), where the expectation is with respect to the joint density function of x = (y, z) and the subscript of E gives the current estimate of θ.

20 Convergence of the ICC Algorithm (A9 ) Let M i denote the mapping of the ith part of the CC-step, i.e., θ (t,i) = M i (θ n (t+1,1),..., θ n (t+1,i 1), θ n (t,i),..., θ n (t,k) ). Let M = M k M k 1... M 1 denote the joint mapping of M 1,..., M k. Let λ n (θ) denote the largest singular value of M(θ)/ θ. There exists a number λ < 1 such that λ n (θ) λ for all θ Θ n, all sufficiently large n, and almost every x obs -sequence.

21 Convergence of the ICC Algorithm Theorem 4 If A1-A8 and A9 hold, then for sufficiently large n, sufficiently large t, and almost every x obs -sequence, θ (t) n θ = o p (1). Furthermore, the sample average of the Markov chain also forms a consistent estimate of θ, i.e., 1 T T t=1 θ (t) n θ = o p (1), as n and T.

22 Gaussian Graphical Models The algorithms for complete data: Graphical Lasso (Yuan and Lin, 2007; Friedman et al., 2008) nodewise regression (Meinshausen and Buhlmann, 2006) ψ-learning algorithm (Liang et al., 2015)

23 ψ-learning algorithm 1. Correlation screening, which is to determine the conditional set for each pair of variables for calculating the partial correlation coefficient. 2. Calculation of ψ-partial correlation coefficients based on the reduced conditional sets. The ψ-partial correlation coefficient is equivalent to the partial correlation coefficient in learning GGM structure in the sense that ψ ij = 0 ρ ij = ψ-partial correlation screening, which is to determine the structure of the network.

24 Equivalent Measure Figure: Illustrative plot for calculation of ψ-partial correlation coefficients, where the solid and dotted edges indicate the direct and indirect associations, respectively. It reduces a high-dimensional problem (calculation of ρ ij s) to a low-dimensional problem (calculation of ψ ij s)

25 IC Algorithm for Gaussian Graphical Models (Initialization) Replace each missing entry by the median of the corresponding variable, and then iterates between the C- and I-steps. (C-step) Apply the ψ-learning algorithm to learn the structure of the Gaussian graphical network. (I-step) Impute missing values based on the network structure learned in the C-step.

26 Simulated Example The simulated example is an autoregressive process of order two with the concentration matrix given by 0.5, if j i = 1, i = 2,..., (p 1), 0.25, if j i = 2, i = 3,..., (p 2), C i,j = (3) 1, if i = j, i = 1,..., p, 0, otherwise. n = 200, p = 100, 200, 300, 400 missing rate 10%: randomly delete 10% of observations

27 Simulated Example (a) p = 100 (b) p = 400 precision misglasso Median BPCA IC Last IC Ave True precision misglasso Median BPCA IC Last IC Ave True recall recall Figure: Precision-recall curves for the GGM with missing data: IC-Ave is obtained from the ψ-scores averaged over last 20 iterations, IC-Last is obtained from the ψ-score generated in the last iteration, True is obtained from the ψ-score calculated using the complete data, Median is obtained from the ψ-score calculated with the missing entry replaced by the median expression value of the corresponding gene, and BPCA is obtained from the ψ-score calculated with the missing entries replaced by the BPCA estimate, and misglasso refers to the misglasso algorithm

28 Yeast Cell Expression Data Gasch et al. (2000) explored the genomic expression patterns in the yeast Saccharomyces cerevisiae responding to diverse environmental changes. The dataset contains 173 samples and 6152 genes. The missing rate is 3.01%. We work on top 1000 genes with the largest variation across samples.

29 Yeast Cell Expression Data (a) (b) log probability log degree Figure: (a) Integrated network obtained by the IC algorithm for the yeast data. (b) Log-log plot of the degree distribution of the integrated network.

30 High-Dimensional variable selection Y = (1, X)β + ϵ, where ϵ N(0, σ 2 I n ), some elements of X are missed at random, and each row of X follows a multivariate normal distribution N(0, Σ). The parameters of the model include θ 1 = (β, σ 2 ) and θ 2 = Σ 1. The ICC algorithm is applicable.

31 High-Dimensional variable selection Regularization methods for complete data: Lasso (Tibshirani, 1996): L 1 penalty. elastic net (Zou and Hastie, 2005): A linear combination of L 1 and L 2 penalties. SCAD (Fan and Li, 2001), MCP (Zhang, 2009): Concave penalty. Extend BIC (Chen and Chen, 2008): L 0 penalty. rlasso (Song and Liang, 2015): reciprocal L 1 penalty.

32 ICC Algorithm for High-Dimensional variable selection (Initialization) Replace each missing entry of X by the median of the corresponding column, and then iterates between the CC- and I-steps. (CC-step) (i) Apply the MCP algorithm to estimate the regression coefficients; (ii) estimate σ 2 ϵ conditional on the estimate of β; and (iii) apply the ψ-learning algorithm to learn the structure of the Gaussian graphical network. (I-step) Impute missing values according to the conditional distributions based on the regression model and network structure learned in the CC-step.

33 Simulated Example The datasets were simulated with n = 100 and a variety of p=200 and 500. The covariates X were generated in two settings: (i) the covariates are mutually independent, where x i N(0, 2I n ) for i = 1,..., n; and (ii) the covariates are generated according to the concentration matrix (3). For both settings, we set (β 0, β 1,..., β 5 ) = (1, 2, 1.5, 2.5, 5) and β 6 = = β p = 0, and random error term ϵ N(0, 2I n ). Under each setting of (n, p), we simulated 10 datasets. For each dataset, we considered two missing rates, randomly deleting 5% and 10% observations of X as missing values.

34 Simulated Example Table: Comparison of the ICC algorithm with the Median and BPCA methods for high-dimensional variable selection with independent covariates. True denotes the results obtained by the MCP method from the complete data. p MR BPCA Median ICC True err 2 β 0.257(0.267) 0.262(0.261) 0.042(0.041) 0.046(0.048) 5% fsr 0.119(0.143) 0.082(0.092) 0(0) 0(0) 200 nsr 0(0) 0(0) 0(0) 0(0) err 2 β 0.903(0.396) 0.856(0.421) 0.065(0.087) 0.046(0.048) 10% fsr 0.310(0.159) 0.308(0.178) 0(0) 0(0) nsr 0(0) 0(0) 0(0) 0(0) err 2 β 0.339(0.214) 0.350(0.206) 0.029(0.034) 0.027(0.023) 5% fsr 0.249(0.225) 0.266(0.237) 0(0) 0(0) 500 nsr 0(0) 0(0) 0(0) 0(0) err 2 β 1.532(1.071) 1.354(0.895) 0.044(0.022) 0.027(0.023) 10% fsr 0.470(0.265) 0.420(0.255) 0(0) 0(0) nsr 0.033(0.070) 0.017(0.053) 0(0) 0(0)

35 Simulated Example Table: Comparison of the ICC algorithm with the Median and BPCA methods for high-dimensional variable selection with dependent covariates. True denotes the results obtained by the MCP method from the complete data. p MR BPCA Median ICC True err 2 β 0.580(0.413) 0.548(0.140) 0.118(0.097) 0.071(0.050) 5% fsr 0.262(0.204) 0.263(0.200) 0(0) 0(0) 200 nsr 0.017(0.052) 0.017(0.052) 0(0) 0(0) err 2 β 1.604(0.666) 1.575(0.974) 0.424(0.461) 0.071(0.050) 10% fsr 0.247(0.229) 0.273(0.238) 0(0) 0(0) nsr 0.100(0.086) 0.083(0.088) 0.033(0.070) 0(0) err 2 β 0.669(0.366) 0.717(0.358) 0.172(0.195) 0.096(0.083) 5% fsr 0.262(0.202) 0.289(0.236) 0(0) 0(0) 500 nsr 0.017(0.053) 0.017(0.053) 0(0) 0(0) err 2 β 2.752(2.306) 2.896(2.601) 0.578(0.587) 0.096(0.083) 10% fsr 0.297(0.230) 0.327(0.224) 0(0) 0(0) nsr 0.133(0.070) 0.133(0.070) 0.050(0.081) 0(0)

36 A Real Data Example The eye dataset contains 120 samples and 200 variables. We set the missing rate at 5% and run for 10 times with different missing entries. Each run consists of 30 iterations and the results from the last 10 iterations are averaged. Table: Estimation errors of ˆβ (with respect to β c ) produced by ICC, Median and BPCA for the Bardet-Biedl syndrome example, where err 2 β is calculated by averaging ˆβ β c 2 over 10 incomplete datasets, and s.d. represents the standard deviation of err 2 β. Method BPCA Median ICC err 2 β s.d

37 A Real Data Example: Model Selection Complete Data: V.153, v.180, v.185, v.87, v.200 ICC: v.153, v.185, v.180, v.87, v.200 Median: v.153, v.185, v.62, v.200, v.54 BPCA: v.153, v.87, v.185, v.62, v.200

38 Mixture High-Dimensional Regression This model mimics the personalized medicine problem and addresses the variety (or heterogeneity) issue of big data β 0 + β 1 x 1 + β 2 x 2 + β 3 x 3 + ϵ, 1 i 200, β 0 + β 1 x 1 + β 102 x β 103 x ϵ, 201 i 400, y = β 0 + β 1 x 1 + β 202 x β 203 x ϵ, 401 i 600, β 0 + β 1 x 1 + β 302 x β 303 x ϵ, 601 i 800, β 0 + β 1 x 1 + β 402 x β 403 x ϵ, 801 i 1000, Results comparison (for p = 500 and n 1 = n 2 = = n 5 = 200) SIS-MCP: 38 variables are selected SIS-SCAD: 40 variables are selected SIS-Lasso: 47 variables are selected IC: 5 clusters and the exactly true model is selected.

39 Mixture High-Dimensional Regression Cluster Dendrogram hclust (*, "average") Height Figure: Clusters identified by the ICC algorithm for the mixture high-dimensional regression example.

40 Cancer Cell Line Encyclopedia (CCLE) Data The CCLE dataset contains compound screening data performed on large panels of molecularly characterized cancer cell lines. The gene expression (genome-wide), copy number profiling, and mutation data have been summarized to gene-level features. The CCLE panel is composed of 41,814 genomic features and 24 compounds on 504 cell lines (411 cell lines contain all measurement types). We tested the proposed method on the compound Lapatinib, which is an orally active drug for breast cancer and other solid tumors. For this compound, the dataset contains 491 cell lines. The AUC (area under the response curve) was used as the response variable, and the gene expression data were used as predictors. For the purpose of illustration, we used only top 500 genes, where the genes were ranked according to their marginal Henze-Zirkler scores with respect to the response variable (Xue and Liang, 2016).

41 CCLE Data: Lapatinib (a) (b) Fitted Y Fitted Y Y (c) Y (d) Fitted Y Height Y hclust (*, "average") Figure: (a)-(c) Scatter plots of the fitted response versus the original observation for M = 1, 2, and 3, respectively; (d) Cluster dendrogram for M = 3.

42 CCLE Data: Lapatinib M = 1: corr(y, Ŷ ) = 0.49, average-bic value= (Xue and Liang, 2017), gene ERBB2 was selected which is the known predictive gene of Lapatinib (Penzcalto et al., 2013). M = 2: corr(y, Ŷ ) = 0.85, average-bic value= M = 3: corr(y, Ŷ ) = 0.93, average-bic value=316.37, cluster 1 (214 cell lines) selected 39 genes, cluster 2 (166 cell lines) selected 32 genes, and cluster 3 (111 cell lines) selected only the gene ERBB2. The algorithm is efficient: A total of 6.5 minutes (CPU time) on a 3.5GHz computer for all three models. For both M = 2 and M = 3, each was run for 200 iterations.

43 Biomarker Discovery Biomarker identification from high-throughput omics data has been one of major focuses in cancer research. Yet despite intense effort in the past two decades, the number of biomarkers approved by the FDA each year for clinical use remains in the single digits. An important factor contributing to this failure is the lack of appropriate statistical methods for analyzing such heterogeneous, high-dimensional, but small sample-sized data. ICC provides a promising tool for biomarker discovery under heterogeneity.

44 Extension: Blockwise Consistency (BwC) Algorithm 1. There exists a constant K k such that every index s {1, 2,..., k} is chosen at least once between the rth iteration and the (r + K 1)th iteration, for all r. 2. For the chosen index s, find an estimator of θ (s) t asymptotically maximizes the objective function ˆθ (s+1) t 1 which W (θ (s) t ) = E θ log f (X ˆθ (1) (s 1) t 1,..., ˆθ t 1, θ(s) (k) t,,..., ˆθ t 1 ) (4) based on the samples X 1,..., X n, where t indexes iterations, and ˆθ (j) t 1 s (for j s) denote the current estimates and are treated as constants at iteration t. Let ˆθ (s) t denote the estimator of θ (s) found at iteration t, and set ˆθ (j) t = {ˆθ(s) ˆθ (j) t 1 t, j = s,, j s.

45 Extension: Blockwise Consistency (BwC) Algorithm Let θ (s) t = arg max W (θ (s) t ), and let ˆθ (s) t denote a regularized estimator of θ (s) t. { θ t } forms a path of coordinate descent (or iterated conditional modes) algorithm. Under appropriate conditions, it can be shown that ˆθ t converges to θ t uniformly with probability going to 1. Consequently, the two sequences will converge to the same limit.

46 An Illustrative Example We consider an variable selection example with n = 100 and p = 5000: y i = θ 0 + p x ij θ j + ϵ i, i = 1, 2,..., n, (5) j=1 where ϵ i s are iid normal random errors with mean 0 and variance 1. The true value of θ j s are θ j = 1 for j = 1, 2,..., 10 and 0 otherwise. The predictors x 1,..., x p are given by x 1 = z 1 + e, x 2 = z 2 + e,...,..., x p = z p + e, (6) where e, z 1,..., z p are iid normal random vectors drawn from N(0, I n ). Ten datasets are independently generated.

47 An Illustrative Example Repositioned the true variables such that they are positioned as {1, 2, 1001, 1002, 2001, 2002, 3001, 3002, 4001, 4002}. 1. Split the predictors into 5 blocks: {x 1,..., x 1000 }, {x 1001,..., x 2000 }, {x 2001,..., x 3000 }, {x 3001,..., x 4000 }, and {x 4001,..., x 5000 }. 2. Conduct variable selection using MCP for each block independently, and combine the selected predictors to get an initial estimate of θ. 3. Conduct blockwise conditional variable selection using MCP for 25 sweeps. Here a sweep refers to a cycle of updates for all blocks.

48 BwC Results I o o fsr nsr sqrt(sse) + o o o o o o o o o o o o o o o o o o o o o o o Number of Sweeps Figure: Convergence path of BwC for one simulated dataset with n = 100 and p = 5000, where fsr, nsr, and sqrt(sse) denote the false selection rate, negative selection rate, and parameter estimation error ˆθ θ, respectively.

49 BwC Results II Table: Comparison of BwC, Lasso, SCAD and MCP for the simulated example with p = Lasso SCAD MCP BwC ŝ avg 21(0.0) 19.9(1.10) 20.3(0.70) 12.8(0.36) ŝ s avg 3.7 (0.79) 5.4 (0.93) 5.2 (0.95) 10(0) ˆθ θ avg 3.31 (0.24) 2.96 (0.41) 3.01 (0.39) (0.05) fsr nsr

50 BwC for eqtl analysis The eqtl (expression quantitative trait loci) analysis can be formulated as a multivariate regression analysis, Y = XB + E, where Y represents expression levels of q genes, X represents p single nucleotide variants (SNVs), and E denote the random error matrix. The goal is to identify both the cis-eqtls and trans-eqtls. The former refers to that SNVs regulate the expression of their own genes, and the latter refers to that SNVs regulates the expression of the genes that they do not belong to. Let ϵ (i) denote the ith row of E. In general, it is assumed that ϵ T (i) follows a multivariate normal distribution N(0, Σ), while ϵ T (1),..., ϵt (n) are mutually independent. We are interested in jointly estimating the regression coefficient matrix B and the precision matrix Ω = Σ 1, in particular, when q and/or p are greater than n.

51 BwC for eqtl analysis Table: Results comparison for precision matrix estimation (with n = 200, q = 100 and p = 3000, 5000 and 10000): the results are averages over 10 datasets, where ŝ avg denotes the number of connections selected by the method, and ŝ s avg denotes the number of true connections selected by the method. The true number of connections is 197. p metod ŝ avg ŝ s avg fsr nsr BwC (2.28) (1.75) 0.05 (0.008) 0.18 (0.009) AMRCE 70.7 (5.92) 1.1 (0.33) 0.99 (0.004) 0.99 (0.003) BwC (3.51) (2.64) 0.05 (0.006) 0.18 (0.013) AMRCE (10.14) 0.65 (0.20) 0.99 (0.003) 0.99 (0.002) BwC (1.57) (1.54) 0.06 (0.006) 0.21 (0.008) AMRCE (13.58) 1.55 (0.52) 0.98 (0.006) 0.98 (0.005) BwC: Treat B and Ω as two blocks, and B can be blocked further. AMRCE (approximate multivariate regression with covariance estimation) by Rothman et al. (2010). A Bayesian method (Bhadra and Mallick, 2013).

52 BwC for eqtl analysis Frequency Frequency BwC (p=3000) AMRCE (p=3000) Frequency Frequency BwC (p=5000) AMRCE (p=5000) Frequency Frequency BwC (p=10000) AMRCE (p=10000) Figure: Histograms of non-zero elements of ˆB obtained by BwC (left panel) and AMRCE (right panel) for three datasets with p = 3000 (upper panel), p = 5000 (middle panel), and p = 10000, respectively.

53 Discussion: I Consistency is a useful concept, which can lead to many efficient (approximation) algorithms for high-dimensional and/or big data computing. The variance of the ICC samples reflects the information loss due to the missing data.

54 Discussion: II The BwC method decomposes the high dimensional parameter estimation problem to a series of lower dimensional parameter estimation problems which have often much simpler structures than the original high-dimensional problem, and thus can be easily solved. (Example: Hierarchical model) The BwC algorithm provides a potential solution to the problem of parameter estimation for complex models that are often encountered in big data analysis. Under the framework provided by BwC, a variety of methods, such as Bayesian and frequentist methods, can be jointly used to achieve a consistent estimator for the original high-dimensional complex model. This is very important for big data problems for which a complex model is often needed!

55 Acknowledgments NSF grant DMS NIH R01GM NIH R01GM126089

of Complex Models Runmin Shi, Faming Liang, Ye Luo, Qifan Song and Malay Ghosh December 13, 2016 Abstract

A Blockwise Consistency Method for Parameter Estimation of Complex Models Runmin Shi, Faming Liang, Ye Luo, Qifan Song and Malay Ghosh December 13, 2016 Abstract The drastic improvement in data collection