Integrated Anlaysis of Genomics Data

Size: px

Start display at page:

Download "Integrated Anlaysis of Genomics Data"

Elmer Malcolm Austin
5 years ago
Views:

1 Integrated Anlaysis of Genomics Data Elizabeth Jennings July 3, 01 Abstract In this project, we integrate data from several genomic platforms in a model that incorporates the biological relationships between platforms to more precisely identify the molecular mechanisms that affect the clinical outcome of cancer. We present the model and describe the estimation procedure. We then summarize simulation results to validify our choice of parameterization and to assess the performance of our method. Finally we apply our method to a Glioblastoma Multiforme (GBM) data set and discuss the results. 1 Introduction The Central Dogma of Biology summarizes the steps involved in the expression of a gene on a molecular level: DNA is transcribed to messenger RNA (mrna), which is then translated to a protein, which carries out an action. There are also alterations and interferences that can occur at the DNA and/or mrna levels which affect the ultimate expression. In this project we will consider the platforms of methylation (which occurs at the DNA level and typically results in a silencing of the gene), copy number (which describes an attribute at the DNA level that affects mrna expression), and mrna expression (which affects protein expression). There is also the potential to incorporate other platforms in a straightforward manner, and we plan to have microrna (mirna) data (which can mute mrna or affect mrna expression directly) in the near future. The process above describes the expression of a single gene, but it is believed that the mechanism of cancer involves multiple genes. Research has found that genes interact and are related through certain pathways, and for this project we focus on genes from a single pathway that is believed to affect GBM processes [1]. Our goal is to integrate data from several genomic platforms in a model that incorporates the biological relationships between platforms to identify not only which genes have a significant effect on survival, but also which platform(s) of those genes is (are) modulating the effect. 1

2 Model We employ a two step, hierarchical model. The first component can be considered the mechanistic model, and the second can be considered the clinical model. This hierarchical setup has recently been introduced as an integrative Bayesian analysis of genomics data (ibag) model [4]..1 ibag Model The mechanistic model is the first component, and it models the effect of methylation and copy number on gene expression. The clinical model subsequently models the effects of these pieces on the clinical outcome. The model can be expressed as: mrna i = M i + CN i + O i where i = 1,..., max(p j ); j = 1,..., k (1) with definitions as follows: Y = Mβ 1 + CNβ + Oβ 3 + ɛ () Let n = number of patients; k = number of platforms; and p j = number of genes with data from platform j. mrna i is the level of gene expression for gene i and has dimension (n 1). M i is the part of gene i expression that is due to methylation and has dimension (n 1). M has dimension (n p 1 ). CN i is the part of gene i expression that is due to copy number and has dimension (n 1). CN has dimension (n p ). O i is the other remaining part of gene i expression, i.e., the part due to neither methylation nor copy number, and has dimension (n 1). O has dimension (n p 3 ). Y is the clinical outcome (survival in days from diagnosis) and has dimension (n 1). β i is the effect of platform i on clinical outcome and has dimension (p i 1). ɛ is the error term and has dimension (n 1).

3 . Estimation To estimate M i, CN i, and O i in the mechanistic model, we first carry out two Principal Component Analyses (PCA s) for gene i : one on the methylation data for gene i and one on the copy number data for gene i. Then we perform least squares regression of mrna i on the methylation and copy number PC scores that account for 90% of the variation. We use the estimated pieces and the residuals from this regression to estimate the vectors M i, CN i, and O i. This process is repeated for each gene to yield the matrices M, CN, and O. Since we believe there is sparsity in the parameters in the clinical model, we estimate them by implementing the Bayesian Lasso. (This approach is also beneficial because it will allow us to estimate the variances of the parameters.) We represent the clinical model with the following hierarchy (based on the suggestion of Park and Casella [3]): n = number of samples; k = number of platforms; p i = number of predictors for platform i; k p = p i = number of total predictors in clinical model; i=1 Y = Mβ 1 + CNβ + Oβ 3 + ɛ = Xβ + ɛ where Y is mean-centered, and ɛ = Normal(0 n, σ I n ); Thus: Y = Normal(Xβ, σ I n ); β = Normal(0 p, σ D τ ) where D τ = diag(τ 1,1,..., τ 1,p 1,..., τ k,1,...τ k,p k ); τ i,j = Negative Exponential with mean /λ i, i.e., density (λ i /) exp( λ i τ i,j/); σ = InverseGamma(a, b), i.e., density b a (σ ) (a+1) exp( b/σ )/Γ(a); λ i = Gamma(r, δ), i.e., density δ r (λ i ) r 1 exp( δλ i )/Γ(r). From this representation, we derive the complete conditionals (see Appendix for details). Note that we will be using the parameterization involving the precision (γ ) as opposed to the parameterization with τ : 3

4 β rest Normal { (X T X + Dτ 1 ) 1 X T Y, σ (X T X + Dτ 1 ) 1} σ rest Inv.Gamma ( a = a + (n + p)/, b = b + {(Y Xβ) T (Y Xβ) + β T Dτ 1 β}/ ) p λ i rest Gamma(a = p i + r, b = δ + τi,j/) τ i,j rest Gen.Inv.Gaussian(a = λ i, b = β i,j/σ, p = 1/), where j=1 V = Gen.Inv.Gaussian(a, b, p) has density (a/b) p/ v p 1 exp{ (av + b/v)/}/{k p ( ab)}, where K p ( ) is a modified Bessel function of the second kind. γ i,j rest = (1/τ i,j) rest Inv.Gaussian( ν = (σ λ i /β i,j) 1/, λ = λ i ), where X = Inv.Gaussian( ν, λ) has density { λ/(π)} 1/ x 3/ exp{ λ(x ν) /( ν x)} Since the complete conditionals are all in closed form, we can use a Gibbs sampler (with block updates for β and γ ) to do estimation. The initial values and hyperparameters are chosen as follows: The initial β is the estimate from the frequentist Lasso with a single shrinkage parameter. The initial σ is the MLE for σ. Each initial λ i is the square of the penalty parameter chosen by cross validation by the frequentist Lasso. (So, the initial λ i s are all equal.) The initial γ i,j s are all set to 1. The hyperparameters for σ are a = b = 0.001, so as to be uninformative. The hyperparameters for each λ i are r = 1 and δ = 0.1, which results in a posterior that is relatively flat with high posterior probability near the MLE []. We also consider three alternative parameterization options that provide equivalent models but different estimation properties; we will discuss those in the Simulation section. 3 Simulations We use simulations to compare four parameterization options for the clinical model. After choosing the best parameterization, we then further assess its estimation properties. 4

3.1 Choosing parameterization Park and Casella argue for σ to be in the β prior because it results in a unimodal full posterior [3], but we were concerned that this might also inflate the correlation

5 3.1 Choosing parameterization Park and Casella argue for σ to be in the β prior because it results in a unimodal full posterior [3], but we were concerned that this might also inflate the correlation between MCMC samples, so we investigate the option of giving β a Normal(0 p, D τ ) prior. In each of those two cases (with and without σ in the β prior), we also compare using the τ versus γ parameterization. We were interested in this aspect because we were somewhat suspicious about the accuracy of the R commands used to sample from the Inverse Gaussian and Generalized Inverse Gaussian distributions; some of the values of the parameters of the complete conditional distributions that were coming up in the MCMC samples were extreme, and we wanted to make sure that the variation seen in the output of the R commands was just a result of this, and not an error in sampling mechanism. We adjusted the Gibbs sampler to reflect the differences in complete conditional distributions for each parameterization. Then for each different choice of β, we simulated a training data set such that n = 100, k = 1, p 1 = 90, σ = 1, each X entry from Normal(0, 1), and Y = Normal(Xβ, σ I n ) and also a test data set with the same settings except n = 400. The results from the simulations that most closely reflect what we anticipate from the data (i.e. sparsity, and number of predictors close to number of samples) are shown in the tables below. The MSE Ratio is the MSE from least squares divided by the MSE from our method. Summary for β = (0, 3, 0, 0, 3, 0, 3, 0, 0) with each entry repeated 10 times: Summary for β = 90 values sampled from Laplace(λ = 1): Since the estimates of σ seem to be very inaccurate when we leave σ out of the β prior, we choose to include it. Since the results do not appear to be impacted by the choice of γ versus τ, we choose to use the parameterization with γ, since the code with the Inverse Gaussian distribution runs more than ten times faster than the code with 5

6 the Generalized Inverse Gaussian distribution. The high correlation between σ and λ is somewhat concerning, and we address this in the next subsection. 3. Assessing Gibbs Sampler We ran many simulations to assess the sampler. We tried different combinations of p i (number of genes per platform) and k (number of platforms), resulting in cases that ranged from p << n to p > n, focusing on the case where p is close to n, since that is what we anticipate to see in real data sets. (Recall that p = k i=1 p i = number of predictors in clinical model.) We also investigated the results when using true β values coming from a mixture of 0 and ±3 (as in the Casella simulation []) as well as true β values as sampled from a Laplace distribution with various λ values (to increase sparsity). We also simulated data using different σ values. After running 10, 000 iterations of the Gibbs sampler, using 500 for a burn-in, we looked at results such as trace plots, posterior means, credible intervals for β, shrinkage plots, correlation between σ and λ i s, and MSE efficiency for each of these scenarios. The training data and test data were simulated in the same manner as the previous simulations, with n = 100 for the training data and n = 400 for the test data, and other settings (σ, β, etc.) specific to the particular simulation. We generally saw excellent mixing of the β i,j s, except for some distinct autocorrelation when we tried updating the β i,j s in a few blocks as opposed to a single block. The shrinkage plots showed what we wanted: the effects close to zero were shrunk even closer to 0, and the larger effects had minimal shrinkage. The trace plots of σ and the λ i s looked good for n = 100 and p = 30, but we started seeing autocorrelation with p = 60 and even stronger autocorrelation when p was increased to 90. The estimates still seemed reasonable, but we wanted to investigate further to ensure that we were covering the entire parameter space. In investigating this issue, we realized that in the cases when the trace plots showed high autocorrelation, there was also high correlation between σ and λ i ; this makes sense because with our parameterization, the parameter of the Laplace prior for β i is σ/λ i. So we checked the trace plots of σ/λ i and found that the ratio was showing no distinct autocorrelation even when p = 90. We also did simulations with the same data (simulated with k = 1) but different starting values for λ 1 to ensure that the posterior means and convergence were not highly dependent on the initial value; with a true λ 1 of 9, we used initial values of 0.1, 1, 10, 30, and 100, and obtained almost identical results from each. At this point we were satisfied that the sampler was performing well. Results and plots from two relevant simulations are shown below. In each, we set n = 100, σ = 1, λ i = 3 (i = 1,..., k) and sampled from Laplace(λ i ) to set β. 6

7 Simulation 1: k = 3, p 1 = p = p 3 = 30 (3 shrinkage parameters) 7

Posterior means: σ = 1.1458 λ 1 = 8.6484, λ = 17.194, λ 3 = 9.6146 ˆσ/ λ 1 = 0.3640, ˆσ/ λ = 0.581, ˆσ/ λ 3 = 0.345 MSE Efficiency: On training data = 0.3835. On test data = 3.461.

8 Posterior means: σ = λ 1 = , λ = , λ 3 = ˆσ/ λ 1 = , ˆσ/ λ = 0.581, ˆσ/ λ 3 = MSE Efficiency: On training data = On test data = Simulation : k = 1, p 1 = 90 (1 shrinkage parameter) Posterior means: σ = 0.761, λ 1 = , ˆσ/ λ 1 = MSE Efficiency: On training data = On test data =

9 The patterns discussed above can be seen in the plots from the simulations. We observe that the ratio σ/λ i seems to be estimated more accurately than the individual parameters. We also note that the MSE efficiency (which is the MSE from least squares divided by the MSE from our method) is < 1 on the training data but > 1 on the test data. This is consistent with the idea that with so much true sparsity, the least squares estimates overfit the training data while our method provides estimates that are more applicable to the population. Overall, we are pleased with the sampler and believe that it is mixing well and converging to the true values. We are confident to move forward and apply this to a real data set. 4 Analysis on Data Glioblastoma Multiforme (GBM) is one of the most common and most malignant brain tumors. The data used in this project is GBM data from The Cancer Genome Atlas (TCGA). Among other things, the data set contains information on mrna expression, methylation, copy number, and survival for 33 patients. We are using the data corresponding to 49 genes in a single signaling pathway. 4.1 Description The bioanalyst extracted the relevant data from a much larger set of data that included information for many more genes. Four patients had multiple samples on at least one platform; the bioanalyst thinks the repeats were done to ensure the data is consistent. Under the assumption that this is the rationale (and not that there was some problem with the first sample), we decided to use the average of these repeats. Then we reformatted the extracted data into several structures: 1. OurSurvival (33 1), containing days of survival after diagnosis for each patient (with no missing data).. OurMRNA (33 49), containing mrna expression levels for each gene (columns) for each patient (rows) (with no missing data). 3. OurMeth (33 176), containing data on the methylation markers (columns) for each patient (rows). There can be multiple (ranging from 1-1) markers per gene; the columns are ordered by gene - all markers for gene 1, then all markers for gene, etc. There are 40 missing values out of (< 0.1%). There is one gene with no methylation data, so we set the corresponding coefficient to 0 later in the analysis. 9

10 4. OurCopyNumber (33 54), containing copy number data (columns) for each patient (rows). Again, there are multiple (ranging from 1-43) values per gene, and the columns are ordered by gene. There are 676 missing values out of 109 (5.5%). After imputing the missing values (see the following subsection), we perform two PCA s for each gene: one on the associated methylation markers and one on the associated copy number locations, each time keeping enough PC scores to account for at least 90% of the variation. Then (still for the single gene) we do least squares regression on the mrna expression using the PC scores as predictors, and we use the predicted pieces and the residuals from this regression to estimate M i, CN i, and O i in the mechanistic model. After repeating this for each gene, we put all the M i, CN i, and O i vectors into a matrix X (33 147). Each row of X corresponds to a patient, and the columns consist of 49 columns for the M i s, then 49 for the CN i s, and then 49 for the O i s. One gene has no methylation data, so we remove that column from the X matrix, which essentially sets that effect to be 0. Any effect that may be due to methylation for that gene would then be captured by the O predictor in the clinical model. Since we are analyzing survival data, we choose to use a log-normal model, and thus we mean-center log(oursurvival) to obtain our response vector. After standardizing the columns of X, we now have our final X matrix and response vector. We run the Gibbs sampler using k = 3, so there is a separate shrinkage parameter for methylation, copy number, and other. 4. Missing Data Since the percentage of missing data is so low, we choose to do imputation using the following algorithm for both the methylation data and the copy number data: (1) For each marker, replace any NA s with the mean of the other patients. Call this resulting matrix Temp. () Use Temp to calculate a correlation matrix between markers. (3) For each marker with missing value(s), regress it on the 3 markers it is most highly positively correlated with (using the Temp matrix for the predictors to avoid further complications from missing data). (4) Substitute this predicted value for the missing value in the original matrix. 4.3 Results Plots and output summarizing the results of the above steps are shown below: 10

First, we see that there does not appear to be an issue of autocorrelation in the trace plots of σ or the λ i s.

effect less than 0), and CN 40, CN 45, O 15, O 0, and O (which have estimated effects greater than 1).

11 First, we see that there does not appear to be an issue of autocorrelation in the trace plots of σ or the λ i s. The β i,j s also seem to be mixing well. The posterior mean of σ is 0.445, and the posterior means of λ 1, λ, and λ 3 are 6.6, 83.0, and 71., respectively. There are six 95% credible intervals that do not contain 0; they correspond to the effects of M 33 (which has an estimated effect less than 0), and CN 40, CN 45, O 15, O 0, and O (which have estimated effects greater than 1). Genes 33, 40, 45, 15, 0, and are GRB, CCND1, MDM, SRC, PDGFRB, and ERBB, respectively. 5 Conclusions & Future Work We have identified six genes that appear to have a significant effect on survival, and we have also identified the mechanism of the effect. 11

12 However, the shrinkage plot from analyzing the GBM data shows that the effects with the largest least squares estimates are shrunk much more than we would like; it appears that the single parameter in the beta prior is not sufficient to capture both the mass around zero and the mass in the tails. Our next step is to try using a two-parameter prior for beta, such as NEG or NG. There are several more things we are planning to investigate: We will check diagnostic plots to ensure the validity of the model. Instead of doing the mrna regression first and then the Bayesian Lasso, we may consider connecting the two steps in a fully unified Bayesian framework. We could incorporate the functional aspect of the copy number data, using chromosome location information, and do functional PCA as opposed to simply PCA. We may consider other methods of handling the missing data values, such as multiple step imputation or a variant of PCA designed to handle missing data. We plan to include mirna as another platform once we receive that data. The holdup is a biological issue of which mirna s to associate with which genes. The bioanalyst is trying to obtain association scores that we could use to make this determination. We may look into incorporating additional platforms, such as proteomic data. Eventually, our goal is to incorporate multiple gene pathways into one model. 1

13 References [1] Memorial Sloan-Kettering Cancer Center. Pathway analysis of genetic alterations in glioblastoma (tcga) [] Minjung Kyung, Jeff Gill, Malay Ghosh, and George Casella. Penalized regression, standard errors, and bayesian lassos. Bayesian Analysis, 5():369 41, 010. [3] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical Association, 103(48): , 008. [4] Wenting Wang, Veera Baladandayuthapani, Jeffrey S. Morris, Bradley M. Broom, Ganiraju C. Manyam, and Kim-Anh Do. Integrative bayesian analysis of high-dimensional multi-platform genomics data. Submitted to Bioinformatics, May

14 Appendix Derivations of Complete Conditional Distributions: Inverse Gamma: V = IG(a, b) has a density f V (v) = b a v (a+1) exp( b/v)/γ(a). Generalized Inverse Gaussian: V = GIG(a, b, p) has density (a/b) p/ v p 1 exp{ (av + b/v)/}/{k p ( ab)}, where K p ( ) is a modified Bessel function of the second kind. Inverse Gaussian: X = InvGauss(ν, λ) has a density Parameterization of the model: f X (x) = {λ/(π)} 1/ x 3/ exp{ λ(x ν) /(ν x)}. k = number of predictor sets; p i = number of predictors for set i; k p = p i = total number of predictors; i=1 n = number of samples; Y = Xβ + ɛ where Yis mean-centered, β = (β 1,1,..., β 1,p,..., β k,1,..., β k,p ) and ɛ = Normal(0 n, σ I n ); Thus: Y = Normal(Xβ, σ I n ); β = Normal(0 p, σ D τ ) where D τ = diag(τ 1,1,..., τ 1,p,..., τ k,1,...τ k,p); τ i,j = Negative Exponential with mean /λ i, i.e., density (λ i /) exp( λ i τ i,j/); σ = InverseGamma(a, b), i.e., density b a (σ ) (a+1) exp( b/σ )/Γ(a); λ i = Gamma(r, δ), i.e., density δ r (λ i ) r 1 exp( δλ i )/Γ(r). The joint likelihood of (Y, X, β, σ, τ, λ ) is (σ ) n/ exp{ (Y Xβ) T (Y Xβ)/(σ )} p k i (σ ) p/ [ (τi,j) 1/ ] exp{ β T Dτ 1 β/(σ )} i=1 j=1 (σ ) (a+1) exp( b/σ ) k (λ i ) (r 1) exp( δλ i ) i=1 p k i λ i exp( λ i τi,j/). i=1 j=1 1

15 The distribution of β given the data and (τ, σ, λ ). This should be normal. [β rest] exp{ (Y Xβ) T (Y Xβ)/(σ )} exp{ β T Dτ 1 β/(σ )} exp{y T Xβ/σ β T (X T X + Dτ 1 )β/(σ )} Now we can apply the shortcut that when density exp( β T Σ 1 β/ + Cβ), then β Normal(ΣC T, Σ) to find: β rest Normal { (X T X + D 1 τ ) 1 X T Y, σ (X T X + D 1 τ ) 1}. The distribution of τ given the data and (β, σ, λ ). This should be generalized inverse gamma. [ τ rest ] ( k ) p i (τi,j) 1/ exp{ β T Dτ 1 β/(σ )} = i=1 j=1 p k i exp( λ i τi,j/) i=1 j=1 k i=1 j=1 p [ (τ i,j ) 1/ exp{ βi,j/(σ τi,j} exp( λ i τi,j/) ] = g(τ 1,1)g(τ 1,)...g(τ k,p k ) where g(τ j ) = (τ i,j) 1/ exp[ {λ i τ i,j + β i,j/(σ τ i,j)}/] So we can see that the τ i,j s are conditionally independent with τ i,j rest GIG(a = λ i, b = β i,j/σ, p = 1/) Also, defining the precision γi,j = 1/τi,j and applying the change of variable formula, we obtain: [ γ rest ] p k i g(1/γ1,1)g(1/γ 1,)...g(1/γ k,p k ) = h(γ 1,1)...h(γ k,p k ) where h(γ i,j ) = (γ i,j ) 3/ exp{ β i,jγ i,j/(σ ) λ i /(γ i,j)} i=1 j=1 (γ i,j) (γ i,j ) 3/ exp{ β i,jγ i,jλ i /(σ λ i ) + λ i (σ λ i /β i,j) 1/ λ i /(γ i,j)} = (γ i,j ) 3/ exp[ λ i {γ i,j (σ λ i /β i,j) 1/ } /{(σ λ i /β i,j)γ i,j}] So we can see that the γ i,j s are conditionally independent with γ i,j rest InvGaussian( ν = (σ λ i /β i,j) 1/, λ = λ i ) 13

16 The distribution of σ given the data and (τ, β, λ ). This should be inverse Gamma. [ σ rest ] (σ ) n/ exp{ (Y Xβ) T (Y Xβ)/(σ )} (σ ) p/ exp{ β T Dτ 1 β/(σ )} (σ ) (a+1) exp( b/σ ) = (σ ) {(n+ p)/+a+1)} exp [ {(Y Xβ) T (Y Xβ) + β T Dτ 1 β + b}/(σ ) ] Thus, we see that σ rest InvGamma ( a = a + (n + p)/, b = b + {(Y Xβ) T (Y Xβ) + β T D 1 τ β}/ ) The distribution of λ given the data and (τ, σ, β). This should be Gamma. [ λ rest ] = p k i λ i exp( λ i τi,j/) (λ i ) r 1 exp( δλ i ) i=1 j=1 p k i (λ i ) pi+r 1 exp{ λ i (δ + τi,j/)} i=1 j=1 So we can see that the λ i s are conditionally independent with p i λ i rest Gamma(a = p i + r, b = δ + τi,j/) j=1 14

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,