Regularized non-negative matrix factorization for latent component discovery in heterogeneous methylomes

Size: px

Start display at page:

Download "Regularized non-negative matrix factorization for latent component discovery in heterogeneous methylomes"

Erica Paul
5 years ago
Views:

1 Regularized non-negative matrix factorization for latent component discovery in heterogeneous methylomes N. Vedeneev, P. Lutsik 2, M. Slawski 3, J. Walter, M. Hein Saarland University, Germany, 2 DKFZ, Germany, 3 George Mason University, USA Problem Setting: DNA methylation is one of the most important and studied marks in modern epigenetics. It plays a central role in numerous biological processes, such as stem cell differentiation, embryonic development and diseases like cancer []. Each (sub)-cell type has a characteristic methylation profile. For some tissue, e.g. blood, individual cell types can be isolated and their methlyation profile can be measured directly. However, for some tissue, e.g. brain, such cell separation is difficult and in many cases only methylation measurement of a full tissue is available which is a mixture of cell types. Reliable and accurate identification of the cell type-specific methylation patterns and their corresponding mixture proportions is of great benefit for any subsequent biological analysis. This can be seen as a blind deconvolution problem which we tackle using a problem-adapted form of a non-negative matrix factorization. In mammalian genomes, including human, DNA methylation mark occurs predominantly in the context of CpG dinucleotide sequence motifs (CpGs). While state of the art sequencingbased methods allow to obtain the complete DNA methylation profile, the cost- and labouroptimized approaches, such as the Infinium 450k and EPIC microarrays, cover a small, yet representative subset of CpGs in the human genome (m = and m = , respectively). For each CpG i of total m tagged ones and each subject j... n Infinium microarrays output two intensity measurements M ij and U ij proportional to the number of DNA molecules in which this CpG is methylated and unmethylated, respectively. We define N ij = M ij + U ij to be the total microarray intensity at each CpG. The data matrix D R m n, is the ratio D ij = M ij /N ij. We denote by r the total number of latent components (typically associated with cell types). In practice, as a rule m n > r. Regularized NMF: Let T R m r + be a matrix representing r latent components, or pure methylation profiles, and A R r n + a matrix of their proportions, or mixtures. Our experiments [6] suggest that a linear mixture model is plausible, i.e: D T A and columns of T are affinely independent. As on a single cell level methylation is binary (on or off) we modeled in the previous work [7] T to be a binary matrix in {0, } m r. While we have shown [7], that such a constraint implies uniqueness of the factorization beyond conditions such as separability [3], it turns out that this constraint is too rigid as even cells of the same cell type can have different methylation at a small fraction of the sites and thus have to be seen as a statistical mixture. In Figure we plot the histogram of measured methlyation values of an isolated cell type. While most of the measurements are concentrated close to zero and one, there is a certain fraction of intermediate DNA methylation. This motivated us in 29th Conference on Neural Information Processing Systems (NIPS 206), Barcelona, Spain.

2 current work [6], the method is called MeDeCom, to relax the constraint to T [0, ] m r and solve the following regularized NMF problem minimize T IR m r,a IR r n 2 D T A 2 F + λ subject to m i= j= T [0, ] m r, A R r n +, n T ij ( T ij ) r A ij =, j =,..., n. i= () The constraint on A is added so that one can interpret A ij as the proportion of latent component i in sample j. As we illustrate below the factorization without regularizer is highly non-unique as in this application we are far away from the separability condition. However, we know from the property of single cells that most methylation sites of a certain cell type should be close to 0 or. Thus we enforce by the non-convex regularization term the entries of T to satisfy this property. As Figure 2 shows, the regularizer is the key to more accurate estimation of both the matrix T and the proportions A as it biases the estimated latent components T towards 0 or. Note that the factorization of the data (blue dots) is highly non-unique as there exist infinitely many solutions with zero fit. However, the strong prior resolves this non-uniqueness. This example, although artificial, is a demonstration of a biologically relevant extreme hard case where the proportions of the cell types across samples are only varying very little and the methylation profiles of the identified cell types in T are highly correlated or, geometrically speaking, when the observations are compactly concentrated inside the simplex spanned by the columns of T and far away from some/all of its vertices/facets (which is exactly the opposite of the separability condition [3]). In instances like in Figure 2 we outperform [5], which is basically based on standard NMF, by large margin Figure : Histogram of T values Figure 2: Hard toy case: observations are deep inside the simplex and away from its boundaries. m = 2, r = 3 Contribution: In the workshop submission we build upon [6] and address two issues: we study families of regularizers which enforce the entries of T being close to zero or one and we examine how the influence of the highly varying number of cells measured at each site should influence the loss used in the NMF model. We solve the optimization problem (and the variants discussed below) via alternating optimization of T and A where we use DCA [8] for the non-convex problem in T. Regularizer: In a number of experiments we solved problem () and compared different regularizers like:. Regularizers enforcing a bias towards {0, } represented by a function family S α,β modeled as concave two-piece cubic splines with a junction at /2 such that S α,β = {s : [0, ] R s(0) = s() = 0, s( /2) = /4, s ( /2) = 0, s (0) = α, s () = β}, where α, β are user-defined parameters controlling slopes at 0 and. Constraints at /4 are in order to make regularizers from S α,β comparable with (). It is worth noting that 2

3 once α and β are specified, any regularizer s S α,β is uniquely defined (4 parameters per cubic piece with 2+2 equalities delivered by the values and derivatives at the end-points). 2. Regularizers derived from a prior based on the distribution of the methylation values in isolated cell types. Here we use the known correspondence between MAP estimation of a prior and a corresponding regularizer and fit a parametric model to the measured distribution. In our experiments the prior distribution was first fit using kernel methods, then the values of the negative log of the density estimate were further fit with polynomials of different degrees. The resulting functions were used as regularizers. We observed that the sensitivity of the regularization parameter, selected by cross-validation, is influenced by the choice of the regularizer. However, in experiments we concentrate on the effect of loss, as it turns out that different regularizers yield similar results given that the grid of λ-values is fine enough. Modified Loss for NMF: The squared loss [SL] in () is associated with a Gaussian noise model. However, as the data matrix consists of ratios of discrete measurements, this noise model might not be appropriate. In particular, the squared loss does not take into account that the number of measured cells N ij for each site i and sample j varies by orders of magnitude ( , this is a consequence of the measurement technique of the Infinium HumanMethylation450 BeadChip) and also does not consider the discrete nature of the measurements. Recall that M ij is the number of methylated cells in N ij measurements and each measurement is done for each cell with the same method. Therefore we interpret D ij as the sample mean of N ij independent Bernoulli trials with the sample variance estimate ˆσ ij 2 = Dij( Dij) N ij. Since N ij is relatively large, ˆσ ij 2 is a good estimate of the true variance σij 2. Furthermore, for large N ij the distribution of the sample mean D ij can be well approximated by a Gaussian distribution and thus we arrive at the noise model: D ij N ( r k= T ika kj, σij 2 ), i =,..., m, j =,..., n. The new loss (negative log likelihood of the noise distribution) is a weighted squared loss [WSL] given by: Λ (D T A) 2 F where Λ ij = ˆσ ij and is the Hadamard product. To avoid numerical problems we truncate Λ at the 0.95 quantile of its values distribution to account for very small variance estimates. We discuss the effect of this adapted loss compared to the standard squared loss. Real data experiment: We use well-known annotated data from the study [4], where brain cell nuclei from a single brain sample were separated for a neuron-specific marker NeuN using fluorescence activated cell sorting (FACS) and the obtained NeuN + (neuronal) and NeuN (non-neuronal) fractions were mixed subsequently into mixtures with proportions 0.0 : 0. :.0. The mixtures were then profiled using the Infinium 450K array. We use this dataset to uncover source NeuN +\ methylomes and their mixing proportions. In order to make the problem harder and more realistic we only use five mixtures corresponding to proportions 0.3 : 0. : 0.7. While the mixture ground-truth data is known, the reference profiles are not available and thus we use T ref - the average of reference profiles from 30 different patients. Thus our estimated error in the T matrix is only an approximation. The Infinium 450k microarray includes slightly more than primer extension assays of two different design types known as type I and type II probes. Methylation levels from two array types were shown to have diffent properties [2]. In order to exclude a bias in the data the analysis on the homogeneous subset of type I probes (comprising approximately one third of the 450k array) is performed. CpGs with probes that overlapped with annotated SNP positions (dbsnp32 entries with MAF> 0.05, as defined in the RnBeads.hg9 annotation) along the complete probe sequence were also discarded to eliminate the confounding genetic variability. This way the data was reduced to the matrix D R Furthermore, in a refined setting several additional layers of quality filtering were used, i.e. each methylation value was required to be supported by at least 5 Infinium beads and CpGs with extreme intensities (< 0. and > 0.95 quantiles of all intensity values) were also removed as potentially erroneous - this is how matrix D R is obtained. Both matrices D and D are factorized in order to assess the robustness of different loss 3

4 functions with respect to noise. We use leave one out cross-validation scheme across samples to determine the optimal regularization parameter. The set of possible λ-values is {0, α 0 k α {, 2.5, 5, 7.5}, k { 5, 4, 3, 2}}. For the weighted squared loss λ was further scaled by the median of the Λ matrix values distribution in order to make the range of employed regularization parameters comparable. Figure 3 shows how the estimation Â of the true mixture matrix A is affected by the choice of the loss. Table below provides a quantitative evaluation. Proportion of NeuN Ground truth SL for raw Type I, lambda=e-03 WSL for raw Type I, lambda= SL for filtered Type I, lambda=5e-04 WSL for filtered Type I, lambda= Samples Figure 3: A (ground truth) versus the estimated Â returned by different variants of the regularized algorithm after matching with T ref. The NeuN + proportions of all Â matrices correspond to the ascending order that of A. As we can see SL underestimates the NeuN + proportions for all the samples for the case of the noisy raw Type data, whereas WSL is not seriously affected by the noise and provides a reasonable estimate of the proportions. The reason for the failure of SL in this case is that the cross-validation routine selects the wrong parameter, which might be due to the more noisy version of the error measure. Both SL and WSL produce very good estimates (2.5% resp..3%) for the filtered case, again WSL outperforms SL slightly. Thus we have shown in this experiment that integrating the information about the total intensity level and the variance information for each CpG site leads both to more robustness against noise but also to a more accurate estimation of the proportions. We also compared the ˆT estimates against T ref. However, as we do not have ground truth information on T but just the average of the reference profiles T ref, this comparison should be seen rather as a sanity check. The results are summarized in Table. We see that on average the estimation of the matrix T is good in all cases. As expected both WSL and SL perform better in the T estimation on the filtered data. The large maximal absolute error can be explained by the fact that T ref is not the true ground truth data and thus is correct only on average but not on the level of single CpG sites. rn Â A Â A rm ˆT T ref ˆT T ref SL, raw Type I WSL, raw Type I SL, filtered Type I WSL, filtered Type I Table : Comparison of estimates ( ˆT, Â) and T ref against A and T ref on the raw and filtered Type I dataset. Conclusions: The Infinium microarray is an efficient tool for measuring DNA methylations on the genome scale yet the data produced is contaminated by measurement noise and biases of various nature and origin. We addressed those issues by assessing the importance of biologically motivated regularizers and adequate noise modeling. In future work we will further study the good result of WSL for the neural data [4] on more datasets. 4

5 References [] C. Bock. Analysing and interpreting DNA methylation data. Nature Reviews Genetics, 3:705 79, 202. [2] S. Dedeurwaerder, M. Defrance, M. Bizet, E. Calonne, G. Bontempi, and F. Fuks. A comprehensive overview of Infinium HumanMethylation450 data processing. Brief. Bioinformatics, 5(6):929 94, Nov 204. [3] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition into parts? In NIPS, [4] J. Guintivano, M. J. Aryee, and Z. a. Kaminsky. A cell epigenotype specific model for the correction of brain cellular heterogeneity bias and its application to age, brain region and major depression. Epigenetics, 8(3): , 203. [5] E.A. Houseman, M.L. Kile, D.C. Christiani, T.A. Ince, K.T. Kelsey, and C.J. Marsit. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics, 7(259), 206. [6] P. Lutsik, M. Slawski, G. Gasparoni, M. Hein, and J. Walter. MeDeCom discovers and quantifies latent components of heterogeneous methylomes. Submitted. [7] M. Slawski, M. Hein, and P. Lutsik. Matrix factorization with Binary Components. In NIPS, 203. [8] P.D. Tao and L.T.H. An. Difference of Convex Functions Optimization Algorithms (DCA) for Globally Minimizing Nonconvex Quadratic Forms on Euclidean Balls and Spheres. Oper. Res. Lett., 9(5):207 26, November

Measures of hydroxymethylation

Measures of hydroxymethylation Alla Slynko Axel Benner July 22, 2018 arxiv:1708.04819v2 [q-bio.qm] 17 Aug 2017 Abstract Hydroxymethylcytosine (5hmC) methylation is well-known epigenetic mark impacting