J. Hasenauer a J. Heinrich b M. Doszczak c P. Scheurich c D. Weiskopf b F. Allgöwer a

Size: px

Start display at page:

Download "J. Hasenauer a J. Heinrich b M. Doszczak c P. Scheurich c D. Weiskopf b F. Allgöwer a"

Chad Mason
6 years ago
Views:

1 J. Hasenauer a J. Heinrich b M. Doszczak c P. Scheurich c D. Weiskopf b F. Allgöwer a Visualization methods and support vector machines as tools for determining markers in models of heterogeneous populations: Proapoptotic signaling as a case study Stuttgart, June 2011 a Institute of Systems Theory and Automatic Control, University of Stuttgart, Pfaffenwaldring 9, Stuttgart/ Germany {jan.hasenauer,frank.allgower}@ist.uni-stuttgart.de b Visualization Research Center University of Stuttgart, Allmandring 19, Stuttgart/ Germany {julian.heinrich, daniel.weiskopf}@visus.uni-stuttgart.de c Institute of Cell Biology and Immunology, University of Stuttgart, Allmandring 31, Stuttgart/ Germany {peter.scheurich,malgorzata.doszczak}@izi.uni-stuttgart.de This work has been presented at the 8th Workshop for Computational Systems Biology (WCSB 2011), 6-8 July, Please cite this article as: J. Hasenauer, J. Heinrich, M. Doszczak, P. Scheurich, D. Weiskopf, and F. Allgöwer. Visualization methods and support vector machines as tools for determining markers in models of heterogeneous populations: Proapoptotic signaling as a case study. In Proc. of 8th Workshop for Computational Systems Biology (WCSB 2011), Zürich, Switzerland, pages 61 64, Abstract In recent years, cell population models have become very common, as they allow for the study of population heterogeneity. Unfortunately, the complexity of population models so far has prevented the development of analysis tools permitting an in-depth study of the source of heterogeneity, like genetics and epigenetics. In this paper we propose an explorative analysis combining parallel-coordinate plots and nonlinear support vector machines to determine the main sources of cell-to-cell variability within decision processes in heterogeneous cell populations. The approach is applied to analyze proapoptotic signal transduction in cancer cell populations and to determine decision markers. Keywords visual analytics parallel coordinates support vector machines cell population Postprint Series Issue No Stuttgart Research Centre for Simulation Technology (SRC SimTech) SimTech Cluster of Excellence Pfaffenwaldring 7a Stuttgart publications@simtech.uni-stuttgart.de

2 2 J. Hasenauer et al. 1 INTRODUCTION Models of intracellular signaling pathways become increasingly complex. Most of the commonly used quantitative models have tens, sometimes hundreds of states and parameters. Due to this complexity, understanding the models and the important elements of the pathway is challenging. This is a problem particularly in situations where, apart from single cell dynamics, cell-to-cell variability has to be considered as well [1,2]. In such cases, the complexity of the model often prevents the application of classical analysis tools for dynamical systems. One of these cases is the model-driven identification of markers for decision processes in heterogeneous cell populations. In this work, we address the question: Which parameters cause the heterogeneity of the population s response?. We propose the application of two methods widely used in data analysis: Parallel-coordinate plots and nonlinear support vector machines (SVMs). The former method provides an easy tool to obtain qualitative understanding, whereas the latter allows for assessing the performance of decision marker combinations quantitatively. Good decision markers are thereby parameters that allow for a good prediction of the decision outcome of an individual cell. The paper is structured as follows: In Section 2 the considered system class and problem is described in mathematical terms. The general idea is discussed in Section 3, and the application of parallel-coordinate plots and SVMs is outlined. Section 4 provides the example application of the proposed method to a model of the caspase cascade. The results are summarized in Section 5. 2 PROBLEM DESCRIPTION In this paper, heterogeneous cell populations are studied. The population dynamics are described using a cell ensemble model [1,2], which is a collection of N cells, Σ pop = { Σ(θ (i) ) i = {1,..., N}, θ (i) Θ }. (1) The signaling process in each individual cell in Σ pop is described by ordinary differential equations, Σ(θ (i) ) : ẋ (i) = f (x (i), θ (i) ), x (i) (0) = x 0 (θ (i) ), (2) with state vector x (i) (t) R n + and parameter vector θ (i) R q +. The index i specifies individual cells within the population. The vector field f : R n + R q + R n describing the cell dynamics is locally Lipschitz and the mapping x 0 : R q + R n + is continuously differentiable. The parameters θ (i) may be kinetic constants, e.g. synthesis, degradation, or reaction rates. Heterogeneity within the cell ensemble is modeled by differential parameter values θ (i) among individual cells. The density of parameters θ (i) is given by a probability density function Θ : R q + R +. Thus, the probability of obtaining θ (i) [θ, θ+dθ] is Θ(θ)dθ. This modeling approach is employed in several publications (e.g. [1,2]) and has been proven to be useful to study short-term dynamic processes, e.g. cellular apoptosis. In the following, discrete decision processes are considered. Therefore, the functional d : l 1 { 1, +1} is introduced, which maps the single cell trajectory x (i) ( ) to a discrete decision δ (i) { 1, +1}, δ (i) = d(x (i) ( )). (3) This functional is used to evaluate the outcome of the decision process in each cell. Example: To exemplify the decision functional, we considered the process of cell death. There are two possible decisions a cell can make: Either it stays alive, δ (i) = +1, or it dies δ (i) = 1. In many cases, the cell is assumed to die if a certain protein concentration x j exceeds a threshold x j,th [2,3]. This would yield the decision functional { +1 if max d(x (i) x (i) ( )) := t j (t) x j,th (4) 1 otherwise. Note that the response x (i) ( ) of a cell implicitly depends on the cell s parameters θ (i). Furthermore, the parameters are the only difference between two cells. Thus, the decision of a single cell can be viewed as a function of the parameters, δ (i) = δ (i) (θ (i) ). To understand the heterogeneity within the population response, it is necessary to understand how the decision depends on these parameters. In particular, the question arises which parameters θ m := [θ m1,..., θ mr ] T, j : m j {1,..., q} (5)

3 A maximum likelihood estimator for parameter distributions in heterogeneous cell populations 3 cause and explain most of the heterogeneity. If parameters θ m can be determined, they can be used (i) to predict the outcome of the decision process and (ii) as markers to distinguish between individual cells with different responses. 3 METHODS 3.1 Basic idea In this contribution, two methods are combined to determine decision markers, θ m, for models of heterogeneous cell populations. The proposed methods are well-known, but almost exclusively applied to study highdimensional sets of measurement data. To exploit the methods in the context of population model analysis, in a first step the cell ensemble is simulated for N 1, yielding many pairs of parameters and trajectories, ( θ (i), x (i) ( ) ), i = 1,..., N. (6) Given these pairs, a sample of cell decisions is constructed, S = {( θ (1), δ (1)),..., ( θ (N), δ (N))}, with δ (i) = d(x (i) ( )). This sample contains much information about the dependency of the decision δ on the parameters θ. In the following, the sample S is considered as a dataset and the dependency δ = δ(θ) is analyzed using visualization techniques and SVMs. The integration of interactive visualization with automated methods (such as SVMs) thereby closely follows the visual analytics paradigm [4] for finding insights from complex data. In this paper, parallel coordinates [5] are employed to visually select the most promising dimensions which are then used to train a SVM. 3.2 Parallel coordinates for marker selection Parallel coordinates [5] are constructed by placing axes in parallel within a 2-D Cartesian embedding. An N- dimensional data point is then represented by a polyline intersecting the axes at the respective values. While parallel coordinates are widely used to identify patterns or trends in high-dimensional data, they greatly suffer from overplotting if many lines have to be drawn. Instead of rendering opaque lines, continuous parallel coordinates [6] estimate the (line-)density of the resulting image from the sample S. In this work, a pointwise density-estimate is obtained using additive alpha-blending. To indicate class-membership of a sample member in parallel coordinates, we use different colors for each class. Combining density estimate and colors, good markers θ m can visually be determined. They correspond to coordinate axes on which the different colors (classes) are well separated. 3.3 SVMs to quantitative marker properties Given a qualitative understanding of the importance of the parameters and a selection of potential markers θ m, a quantitative assessment of the classification power of θ m is necessary. To obtain this the sample S is used to learn a nonlinear SVM. This is a two step process illustrated in Figure 1. First, a mapping Φ : R r R r is constructed that transforms the input space into a feature space of higher dimension (r > r). Secondly, a linear separation of the data is performed in the feature space [7]. Therefore, the optimization problem min w,b,ξ 1 2 wt w + C N i=1 ξ i s.t. δ (i) ( w T Φ(θ (i) m ) + b ) 1 ξ i, i = 1,..., N, ξ i 0, i = 1,..., N, is solved, in which w and b denote the normal vector of the separating hyperplane and its offset, respectively. The objective function combines a misclassification penalty, S i=1 ξ i, and a margin maximization, 1 2 wt w. For (7)

4 4 J. Hasenauer et al. input space feature space feature space Step 1 Step 2 θ 2 Φ 2(θ) Φ 2(θ) θ 1 Φ 1(θ) Φ 1(θ) Fig. 1 Visualization of the SVM approach for separating cells with δ (i) = +1 (+ ) and δ (i) = 1 ( ). Left: Distributed data in the input space. Middle: Sample transformed in the feature space which allows for better separation. Right: Separation result for separating hyperplane with normal vector w ( ). As a perfect separation is in general not possible, misclassifications ( ) exist. k 8 k8 k 9 k9 k4 IAP k7 stimulus (e.g. TNF or TRAIL) C8 k 3 IAP k3 C3a k2 C3a k6 k5 C8a k1 cell death C8a CARP k11 k 11 C3 k13 CARP k 10 k10 k 12 k12 Fig. 2 Illustration of proapoptotic signaling pathway [3]. Normal arrows ( ) refer to conversion reactions, dashed arrows ( ) indicate enzymatic activity, and thick arrows ( ) illustrate inputs and outputs of the system. a detailed introduction to SMVs, we refer to [7]. An application for the study of dynamical systems can be found in [8]. Given the solution of (7), the percentage of true positive classifications, TP m, and false positive classifications, FP m, can be evaluated. This is done for a second sample, S, to avoid overfitting. TP m and FP m provide information about predictability of the outcome for θ (i) using solely θ m (i). Thus, the marker quality can be assessed via TP m and FP m. If a low-dimensional θ m exists that provides TP m 1 and FP m 0, the parameters θ m dominate the decision process and are good markers. For a quantification of this effect, the classification performance can be analyzed in ROC space. For details we refer to [9]. Summing up, parallel-coordinate plots allow for an intuitive visual assessment of the dependency of the decision on the parameters and the selection of potential markers. A quantitative evaluation of the marker quality is possible using SVMs. By combining both methods, the combinatorial explosion related to checking all possible marker combinations using SVMs can be avoided, resulting in a tremendously reduced computational complexity. 4 ANALYSIS OF PROAPOPTOTIC SIGNALING To illustrate what insight can be gained using the proposed methods, proapoptotic signaling is analyzed. Proapoptotic signaling is involved in the process of apoptosis (programmed cell death). Apoptosis is an important physiological process to remove infected, malfunctioning, or no longer needed cells from a multicellular organism. The apoptotic signaling pathways converge at the caspase cascade. In [3], a mathematical model for the signal transduction in a single cell has been proposed. This model is also studied in this paper, and depicted in Figure 2. For details about the model we refer to [3]. The process of apoptosis induction is known to be heterogeneous [2,3]. Therefore, the single cell model is extended by introducing differences in parameters: 1) The amount of caspase 8 (C8), caspase 3 (C3), caspase 8- and 10-associated RING protein (CARP), and inhibitor of apoptosis protein (IAP) is known to be different among cells. To account for this, the synthesis rates (k 8, k 9, k 10, and k 12 ) in individual cells are assumed to be different. The distribution of k 8, k 9, k 10, and k 12 within the population is modeled as log-normal distribution, with mean as published in [3] and a

5 A maximum likelihood estimator for parameter distributions in heterogeneous cell populations 5 log(θ j) E[log(θ j)] C8a(0) k 8 k 9 k 10 k 12 Fig. 3 Parallel coordinate density-plot in which each polyline represents the parameter of a single cell, θ (i). The color of a polyline encodes whether the cell survived ( ) or died ( ). After estimating line-density with additive alpha-blending, a logarithmic colormap is applied. The coordinates k 8 and k 10 show the best separation of colors and hence correspond to potential markers. IAP synthesis, k C3 synthesis, k 10 (a) Classification employing C3 synthesis, k 10, and IAP synthesis, k 8, as markers. For the classification of cell survival: TP = 0.77, FP = init. active C8, C8a(0) CARP synthesis, k 12 (b) Classification employing initial amount of C8a, C8a(0), and C8 synthesis, k 8, as markers. For the classification of cell survival: TP = 0.68, FP = True positive False positive C8a(0) k 8 k 9 k 10 k 12 C8a(0) and k 8 C8a(0) and k 9 C8a(0) and k 10 C8a(0) and k 12 k 8 and k 9 k 8 and k 10 k 8 and k 12 k 9 and k 10 k 9 and k 12 k 10 and k 12 k 8, k 10 and k 12 (c) Classification performance for different marker combinations m in ROC space. The performance of all individual markers, all marker pairs, and the best marker triplet is shown. Note that an optimal classifier would be in the upper left corner. Fig. 4 Illustration of achieved classification (prediction) performance using different marker combinations. In plot (b) and (a) the prediction ( = alive; = dead) is shown of two marker combinations, as well as a test sample ( = alive; = dead). Plot (c) depicts the classification performance of different marker combinations in ROC space. coefficient of variation of 0.4 (own unpublished data). The initial conditions of C8, C3, CARP, and IAP are set to their steady state values. 2) Similar to [3], the activation of the caspase cascade is modeled by a non-zero initial condition of active caspase 8, C8a(0). In the population, C8a(0) is log-normally distributed with a mean of 4,000 molecules per cells and a coefficient of variation of 0.4. The variation of C8a(0) accounts for variability up-stream of the caspase cascade. The binding affinities and kinetic rates are the same for all cells. The precise values can be found in [3]. Given this heterogeneous population, it is analyzed which cells undergo apoptosis during the first 12 hours. As indicator for this, the amount of active caspase 3 (C3a) is used. If more than 5,000 copies of active caspase 3 are present in a cell, this cell is assumed to undergo apoptosis, yielding the decision functional similar to (4). Hence, the question we address is which low-dimensional subset of the parameters, θ = [C8a(0), k 8, k 9, k 10, k 12 ] T, are good markers for cell death and survival, respectively. Parallel coordinates: In Figure 3, a sample S with 100, 000 members is visualized in parallel coordinates. The second and fourth parameters (θ m = [k 8, k 10 ] T ) indicate a good separation between the classes (orange = dead, blue = alive). Most of the surviving cells have high values of k 8 and low values of k 10, which corresponds to a high IAP expression and a low C3 expression, respectively. Although the other parameters also influence the process, their influence seems to be minor. SVM: Given the results from the visual analysis, we select θ m = [k 8, k 10 ] T and compute the classification quality using SVMs (with Gaussian kernels). As visible in Figure 4(a), we obtain a good separation (TP = 0.77, FP = 0.11). For comparison, all other combinations of two markers are evaluated and depicted in

6 6 J. Hasenauer et al. Figure 4(c). The marker θ m = [k 8, k 10 ] T outperforms all other combinations considering the norm distance to the optimal classifier. Some other combinations result in more than 50 % of false positive classifications (see e.g. Figure 4(b)). Of course, extending the marker vector e.g. by adding k 12 results in further improvement. This case study shows that parallel coordinate plots are a proper tools to easily determine markers. The predictive power of the markers can then be quantified using SVM. In this example, the markers found agree well with those found in the literature. In particular, the important role of IAP is outlined in several publications. This study suggests that the amount of available C3 is even more important than expected. 5 CONCLUSION In this paper, a first and novel explorative approach has been presented to determine markers for decision processes in heterogeneous populations. It has been shown that methods used for data analysis can also be employed to gain insight into complex models. Especially the potential of parallel coordinate plots and SVMs has been illustrated. The proposed visual analytics approach has been applied to a cell population model for the tumor necrosis factor induced proapoptotic signaling. The markers found are the same as those mentioned in the literature. This provides an additional and so far missing validation of the model and thus proves the usefulness of our approach. 6 ACKNOWLEDGMENTS The authors acknowledge financial support from the German Research Foundation within the Cluster of Excellence in Simulation Technology (EXC 310/1) at the University of Stuttgart, from the German Federal Ministry of Education and Research (BMBF) within the FORSYS-Partner program (grant nr A and D), and from Center Systems Biology at the University of Stuttgart. References 1. J. Hasenauer, S. Waldherr, N. Radde, M. Doszczak, P. Scheurich, and F. Allgöwer, A maximum likelihood estimator for parameter distributions in heterogeneous cell populations, Procedia Computer Science, vol. 1, no. 1, pp , S. Spencer, S. Gaudet, J. Albeck, J. Burke, and P. Sorger, Non-genetic origins of cell-to-cell variability in TRAIL-induced apoptosis, Nature, vol. 459, no. 7245, pp , T. Eissing, H. Conzelmann, E. Gilles, F. Allgöwer, E. Bullinger, and P. Scheurich, Bistability analyses of a caspase activation model for receptor-induced apoptosis, Journal of Biological Chemistry, vol. 279, no. 35, pp , J. J. Thomas and K. A. Cook, A Visual Analytics Agenda., IEEE Computer Graphics and Applications, vol. 26, no. 1, pp. 10 3, A. Inselberg and B. Dimsdale, Parallel coordinates: a tool for visualizing multi-dimensional geometry, in Proc. of IEEE Visualization, 1990, pp J. Heinrich and D. Weiskopf, Continuous parallel coordinates, IEEE Transactions of Visual Computer Graphics, vol. 15, no. 6, pp , O. Ivanciuc, Reviews in computational chemistry, vol. 23, chapter Applications of support vector machines in chemistry, pp , Wiley-VCH, Weinheim, J. Hasenauer, C. Breindl, S. Waldherr, and F. Allgöwer, Approximative classification of regions in parameter spaces of nonlinear ODEs yielding different qualitative behavior, in Proc. IEEE Conference on Decision and Control (CDC 2010), Atlanta, USA, 2010, pp M. Zweig and G. Campbell, Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine, Clinical Chemistry, vol. 39, no. 8, pp , 1993.

Density-based modeling and identification of biochemical networks in cell populations

Density-based modeling and identification of biochemical networks in cell populations J. Hasenauer 1,, S. Waldherr 1, M. Doszczak 2, P. Scheurich 2, and F. Allgöwer 1 arxiv:1002.4599v1 [q-bio.mn] 24 Feb