Non-linear dimensionality reduction analysis of the apoptosis signalling network. College of Science and Engineering School of Informatics

Non-linear dimensionality reduction analysis of the apoptosis signalling network Sergii Ivakhno Thesis College of Science and Engineering School of Informatics Masters of Science August 2007 The University of Edinburgh

Acknowledgments I would like to express my deepest gratitude to my supervisor, Dr. Douglas Armstrong, for his support, guidance, and encouragement throughout my graduate research. My thanks also go to Dr. Douglas Lauffenburger at the Massachusetts Institute of Technology, Biological Engineering Division, where I have completed a sizable portion of this thesis work. My parents, Sergii and Irina, receive my deepest gratitude and love for their dedication and the many years of support during my current and previous studies that provided the foundation for this work. I also thank Dr. John Albeck and Dr. Kevin Janes for useful discussions and comments on the thesis manuscript. Finally, I would like to acknowledge the support of several funding bodies that enabled my four-month research visit to MIT: European Molecular Biology Organization (EMBO) short term-fellowship and Scottish International Education Trust (SIET) travel grant. Erasmus Mundus fellowship from European Commission supported my whole period of study and research in the EuMI master program. ii

Abstract Systems wide modelling and analysis of signalling networks is essential for understanding complex cellular behaviours, such as the biphasic responses to different combinations of cytokines and growth factors. For example, tumour necrosis factor (TNF) can act as a proapoptotic or prosurvival factor depending on its concentration, the current state of signalling network and the presence of other cytokines. To understand combinatorial regulation in such systems, new computational approaches are required that can take into account non-linear interactions in signalling networks and provide tools for clustering, visualization and predictive modelling. I extended and applied an unsupervised nonlinear dimensionality reduction approach, Isomap, to find clusters of similar treatment conditions of the apoptosis signalling network in human epithelial cancer cells treated with different combinations of TNF, epidermal growth factor (EGF) and insulin. For the analysis of the apoptosis signalling network I used the Cytokine compendium dataset where activity and concentration of 19 intracellular signalling molecules were measured to characterise apoptotic response to TNF, EGF and insulin. By projecting the original 19-dimensional iii

space of intracellular signals into a low-dimensional space, Isomap was able to reconstruct clusters corresponding to different cytokine treatments that were identified with graph-based clustering. In comparison, Principal Component Analysis (PCA) and Partial Least Squares - Discriminant analysis (PLS-DA) were unable to find biologically meaningful clusters. It is showed that by using Isomap components for supervised classification with k-nearest neighbour (k-nn) and quadratic discriminant analysis (QDA), apoptosis intensity can be predicted for different combinations of TNF, EGF and insulin. Prediction accuracy was highest when early activation time points in the apoptosis signalling network were used to predict apoptosis rates at later time points. To summarize, in this thesis I developed and applied extended Isomap approach for the analysis of cell signalling networks. Potential biological applications of this method include characterization, visualization and clustering of different treatment conditions (i.e. low and high doses of TNF) in terms of changes in intracellular signalling they induce. The material presented in this thesis has been also published in the following two journal articles and presented at the international conference. Ivakhno S, Armstrong JD. Non-linear dimensionality reduction of signalling networks. BMC Syst Biol. 2007 Jun 8;1:27 [full content is given in the appendix]. Ivakhno S.S. From functional genomics to systems biology. FEBS J. 2007 May;274(10):2439-48. Epub 2007 Apr 19. Ivakhno S.S., Lauffenburger D. A., Armstrong J. D. Non-linear dimensionality reduction of the apoptosis signalling network activated by TNF, EGF and Insulin. Poster at the 3rd EMBL Biennial Symposium: From Functional Genomics to Systems Biology 2006, Heidelberg, Germany iv

Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. v

Contents Acknowledgments ii Abstract iii Declaration v Chapter 1 Introduction and the structure of the thesis 1 Chapter 2 Literature Survey 5 2.1 Types of biological networks................... 5 2.2 Machine learning methods used for the analysis of biological networks.............................. 8 2.2.1 Gene networks...................... 9 2.2.2 Enhanced gene networks and regulatory networks... 10 2.2.3 Signalling networks: Biological perspective...... 13 2.2.4 Signalling networks: Machine learning approaches... 14 2.3 Systems Biology methodology used for the analysis of biological networks............................. 19 vi

2.4 Dimensionality reduction techniques............... 21 Chapter 3 Dimensionality reduction analysis of apoptosis signalling network 24 3.1 Apoptosis signalling network and the Cytokine data compendium 24 3.2 Main aims of the thesis research................. 28 3.3 Cytokine compendium dataset.................. 30 3.4 Data representation and transformation............. 31 3.5 PCA and PLS-DA could not find low dimensional embedding of apoptosis signalling networks................. 32 Chapter 4 Feature selection and supervised analysis of Cytokine compendium 38 4.1 Features of cytokine clusters found by Isomap......... 38 4.2 Interpretation of components recovered by Isomap....... 41 4.3 Supervised comparison of PCA and Isomap........... 43 Chapter 5 Discussion and conclusions 47 5.1 Extended Isomap can be used for visualization, predictive and descriptive modelling of apoptosis signalling network..... 47 5.2 Factors contributing to Isomap algorithm performance.... 49 5.3 Conclusions............................ 53 Chapter 6 Methods and Algorithms 54 vii

6.1 Cytokine compendium data normalisation and transformation 54 6.2 Extended Isomap approach................... 55 6.2.1 Original Isomap and determination of k nearest neighbours 55 6.2.2 Graph-based clustering in the Isomap components... 56 6.2.3 Interpretation of Isomap projections by neural networks 57 6.2.4 Choice of class labels................... 59 6.2.5 Classifiers comparison.................. 60 6.2.6 PLS Discriminant Analysis................ 61 6.2.7 Choice of Principal components (PC) using cross validation........................... 61 6.2.8 Guide to the software................... 62 Bibliography 63 Appendices 69 Appendix A 70 Appendix B Article 74 viii

Chapter 1 Introduction and the structure of the thesis It is now recognized that the impact of biology in the 21st century will be similar in scope to contributions of physics in the 20th century: from the creation of artificial bacterial systems to produce alternative sources of energy to cheap sequencing and ready interpretation of each person s genome. Such dramatic progress in molecular biosciences in the past 50 years (after discovery of DNA structure) is attributable to technological advances such as creation of new methods for sequencing genomes and other methods for studying cells at the molecular level. The advancement of genomic/post-genomic era and generation of large datasets lead to the convergence of Biology with Information Technologies, Computer Science, Statistics and mathematical modelling and emergence of new disciplines of Bioinformatics and Systems Biology. The execution of body functions comes from coordinated actions of thousands of 1

networks of genes and proteins ever varied and changing in response to the environmental stimuli. Changes in such networks can lead to devastating diseases such as cancer, where through mutations and other mechanisms the normal functions of the part of the network changes and lead to the abnormal cell growth and malignant phenotype. Almost all diseases can be viewed as malfunctioning of biological networks, but our organisms have evolved multiple mechanisms to combat such abnormal changes, i.e. immune system. As a result of advancements in large scale high-throughput experiments, large number of data sets is available in proteomics, genomics and transcriptomics for data analysis and interpretation. By means of statistical genomics individual variation in susceptibility to common diseases can be investigated, whereas Systems Biology modelling allows designing better drugs for cancer treatment [1, 2]. Construction and understanding of biological networks represent the first steps toward modelling cellular activities in time and space. Network biology provides a simple conceptual representation of biological data in terms of interactions between various biological entities and has a nice property of representing complex biological relations indiscernible in large scale data sets in a lucid interpretable way. In this framework, the grand challenge of network biology is to combine different biological data to create predictive model of cellular behaviour. This leads to the realm of statistics, machine learning and data modelling. Machine learning approaches play a significant role in network biology by allowing quick automatic reconstruction and of biological networks 2

from high-throughput datasets and follow-up data analysis. Broadly, the topic of my thesis is the application of machine learning methods to the analysis of biological networks. In particular, the thesis are devoted to the application and modification of a non-linear dimensionality reduction techniques to data on the response of the cell s several biological networks (TNF and cytokine signalling) to the treatment with combination of various compounds with the aim to understand how cells respond to them in terms of signalling network s state rather than observable phenotype of the cell. This might have a practical application particularly for development of cancer drugs that target various proteins in signalling networks that become abnormal. The thesis is structured as follows. In Chapter 1 I start with the description of various biological networks and current literature survey of Networks Biology, which places my work in the broad context of current research directions in the field. This is followed by closer description of signalling networks and a particular network of TNF signalling that was investigated in this work. Chapter 2 begins with the description of the dataset and the pre-processing and normalisation steps. This is followed by general description of non-linear dimensionality reduction approach that was used in this work. The main aim of the thesis was to investigate if nonlinear dimensionality reductions methods can give greater biological incites and understanding of biological processes, than simpler linear dimensionality reduction methods such as Principal Component Analysis (PCA). Consequently, Chapter 3 includes more detailed il- 3

lustration of results and evaluates modified Isomap algorithm in terms of its biological discoveries and already established biological knowledge play the role of the benchmark. While Chapter 3 deals more with biological interpretations, Chapter 4 places them in the broader context of application of dimensionality reduction techniques to biological networks and discusses their usefulness in general. The major theme of this theses is gaining new biological knowledge through the use of advanced machine learning methods. Therefore, the technical details of algorithms and general design of machine learning experiment (statistical analysis, cross-validation), are summarized at the end of the thesis in Chapter 5 to preserve the flow of biologically centred analysis through introduction, results and discussions (Chapters 1-4). Various subsections in the thesis are introduced to make the navigation easier. 4

Chapter 2 Literature Survey 2.1 Types of biological networks There exist a great variety of biological networks that can capture different levels of cellular complexity. These networks differ from each other with respect to the type of high-throughput technology used to generate data (i.e. DNA microarrays, yeast-two hybrid system or LC-MS/MS-based proteomics), approaches to the network reconstruction (automatic network generation or manually created), additional type of information that networks can provide (dynamic flow of information through edges as in ODE models of signalling networks). Below different types of biological networks are listed and their aforementioned properties are briefly discussed. Protein-protein interaction networks. These networks represent either direct or indirect (a part of protein complex) physical interactions be- 5

tween proteins. In most cases protein interaction networks are static and show only small subset of the true biological interactions, although by integrating various data sources it was recently possible to reconstruct dynamics of protein complex formation during cell cycle [3]. Proteinprotein interaction networks can be created directly from the results of high throughput experiments, manually from literature curation or using computational approaches such as text mining. Genetic networks. In genetic networks protein interaction are defined indirectly at the gene level when mutation in one gene affects the phenotypic expressivity of other genes. Genetic networks can be constructed from synthetic lethal screens, where the simultaneous mutations in two genes lead to a lethal phenotype. As Protein-protein interaction networks, genetic networks usually represent static interactions. However, information flow can be readily traced through the network when epistasis analysis is used, in which case temporal order of protein interactions can be established [4]. Genetic networks can be reconstructed manually (for small networks) or using clustering and machine learning algorithms. Gene (expression) networks. Gene networks use DNA microarray data in combination of other techniques to find general relations at the gene level. The basic idea in gene network reconstruction is that of coexpression - if two genes show similar expression profiles, they are supposed to follow the same regulatory regime and are therefore connected. In case of a more elaborate perturbation experiments (i.e. RNAi) direct casual 6

interactions between genes can be inferred. Gene networks can be both dynamic and static and are usually reconstructed using various machine learning approaches. Transcription (regulatory) networks. These networks explicitly represent gene regulation at the transcriptional level by using information on DNA-protein interactions. Transcription networks are always directed and can be cyclic if two transcription factors regulate expression of each other. They can be reconstructed either manually from the literature or automatically by using high-throughput technologies for detecting regulatory proteins that bind to promoter regions (Chip technology) [5]. Signalling network. Signalling networks represent information flow in signal transduction pathways. In case of signalling networks the influences between regulatory molecules (such as protein kinases or phospholipids) are usually directed and dynamical so that information flow can be determined in temporal manner. Most of the signalling networks have been reconstructed by direct analysis of literature [6], although recently developed high-throughput techniques allow automatic determination of signalling network structure using machine learning techniques. Metabolic networks. Metabolic networks represent biochemical reactions pathways within the cell, where the nodes designate enzymes and/or metabolites and edges represent metabolic reactions. Metabolic networks are usually reconstructed manually from databases such as KEGG or 7

biological literature. These networks are directed and show dynamical events of metabolite conversion in biochemical pathway. Developmental networks. Developmental networks represent complex and hierarchical level of gene regulatory circuits during development [7]. They are highly similar to transcription networks in that most regulatory influences are captured at the DNA-protein interactions level [8]. However, these networks have an additional complexity of explicitly representing spatiotemporal regulation. Since many levels of developmental processes (i.e. special and temporal) are represented created manually. 2.2 Machine learning methods used for the analysis of biological networks It is not the purpose of this review to provide comprehensive literature analysis of machine learning methods used for the generation and analysis of biological networks. Rather, my aim in the next section is to provide brief overview of computational analysis approaches to gene and signalling networks. Although the main work of this thesis is devoted to the non linear dimensionality reduction analysis of signalling networks, the material on the gene networks is introduced first since due to the technological ease of generating gene expression data, the field of computational analysis of biological networks is much more established and provides already developed tools that can easily be adapted for the analysis of signalling networks. 8

2.2.1 Gene networks Reverse engineering (automatic reconstruction) of gene networks has become an active area of research since the pioneering work of Friedman et al [9] on reconstructing network of the Saccharomyces cerevisiae cell cycle using Bayesian networks approach. Many different computational techniques have been applied to the problem of reconstruction of gene networks, such as differential equations, Boolean networks and probabilistic graphical modelling approaches. Excellent early reviews on this topic can be found in De Jong [10]. The simplest way to reconstruct gene network is by utilizing simple coexpression network approach based on similarity measures such as correlation. In this case two genes become linked in the network by an edge if they have a similar expression profiles. Correlation networks are easy to interpret and can be accurately estimated even if the number of genes is much larger than the number of samples. One of the most successful applications of Correlation networks is found in Stuart et al. [11]; they built a graph from coexpression microarray data across multiple organisms (humans, flies, worms and yeast) and found that many coexpression relationships are conserved over evolution. The largest limitation of coexpression networks is that they can not distinguish between direct and indirect dependencies, for instance they would not be able to test if the regulation between two genes is mediated by some other third gene. To find such dependencies first order conditional independence models must be applied, in which genes can be regulated through one or few intermediates. Recently Basso et al. [12] built such model based on conditional mutual information as 9

a coexpression measure. The resulting method, called ARACNe, was successfully applied to reverse engineering of gene network from expression profiles of human B cells related to germinal centres. The authors used a wide range of microarray data sources resulting in 336 distinct B cell phenotypes that allow them to achieve high dynamic range of gene expression values and therefore build correlation network with high statistical confidence. Furthermore, by considering in details subnetwork around protooncogene MYC they were able to experimentally verify several proposed gene interactions, showing that new results can be obtained by judicious data integration schemes in combination with appropriate machine learning algorithms. 2.2.2 Enhanced gene networks and regulatory networks The accuracy of the reconstructed gene network can be improved by integrating microarray data with other sources, which can be achieved by appropriate modification of Bayesian networks or the use of other probabilistic graphical models. Such data integration can also lead to the increase in the accuracy of reconstructed protein-protein interaction networks and transcriptional networks. Below I consider data integration with respect to gene and transcriptional networks with protein-protein interaction networks reviewed in later sections. The common type of additional information that can be incorporated into gene networks is the sequence data for promoter elements. This integration is based on assumption that coexpressed genes should share common 10

regulatory regions in their promoter sequence, which can be considered as a promoter module of set active motifs. First approach to data integration based on probabilistic relational models (PRM) was proposed by Segal et al [13]. PRM differ from Bayesian networks in two major respects. First, these graphical models incorporate the notion of objects and relations between them, allowing one to explicitly integrate into the network different data entities: gene expression values, transcription factor binding sites sequences and modular composition of promoter regions. Second, they impose constraints on the kind of interactions that are possible between different objects, for instance gene expression values only depend on the composition of promoter regions. The model itself works as follows: first genes are partitioned into clusters by applying probabilistic clustering algorithm, followed by identification of common motifs in the promoter regions of coclustered genes using discriminative model for motif detection [14]. Then genes are reassigned to the clusters based on the number of motifs in such a way that two genes that share similar motifs in their promoter regions tend to get into the same cluster. Since PRM are the fully probabilistic models, Expectation Maximisation algorithm can be applied to find a set of motifs that regulate particular gene. Therefore, the major goal of applying PMR is to improve detection of motifs and motif clusters in promoter regions rather than to learn structure of gene networks. The probabilistic model gave a two fold improvement for motif detection over standard clustering approaches. Unfortunately, the identity of transcription factors that bind to promoter regions remain unknown with this approach, 11

although can be incorporating additional CHIP-chip data it becomes possible to infer the structure of transcription networks. The approach proposed by Segal has been extended in a number of different ways. Beer and Beer et. al used Bayesian networks to model probabilistic dependence of motif modules on gene expression values and incorporated additional constraints on the distribution of motifs in the promoter region [15]. They have defined a common promoter module for a set of gene such that motifs in the module are assembled by the set of AND, NOT, OR rules and made additional constraints on motif length, orientation, and relative position to the transcription start site (i.e. promoter module combined through the AND rule that incorporates particular motifs at a certain distance from the start site). It should be noted that gene network reconstructed by this approach is necessarily directed and represents influences of promoter module on the set of genes that it regulates. Additional constraints have significantly improved detection of promoter modules and Beer and Tavasoi reported that with this approach gene expression profiles can be reproduced with 73 percent accuracy. Furthermore, by conducting similar analysis for Caenorhabditis elegans they showed that their method of assembling genes into clusters based on the organisation of promoter module can be successfully applied to the detection of regulatory rules in higher eukaryotes. Apart from Bayesian networks approach, a number of regression trees methods have also been applied for motifs detection and reconstruction of regulatory (transcription) networks [16-18]. In the model proposed by Phuong et al for learning regulatory network of yeast cell cycle 12

data, motifs act as predictor variables and are assigned to the internal nodes of decision trees whereas leaf nodes designate gene expression values. During the process of regression tree learning the promoter module is assigned to each gene dynamically based on the tree splits for motifs. 2.2.3 Signalling networks: Biological perspective Having considered current methodology for reconstruction of gene networks I will now concentrate specifically of the application of machine learning methods to signalling networks, starting with the description of signalling networks themselves from biological perspective. Briefly, signalling networks execute the signal transduction process, which refers to any process by which cells converts one kind of signal or stimulus into another, most often involving ordered sequences of biochemical reactions inside the cell, that are carried out by enzymes and linked through second messengers resulting in what is thought of as a second messenger pathway. In many signal transduction processes, the number of proteins and other molecules participating in these events increases as the process emanates from the initial stimulus, resulting in a signal cascade and often results in a relatively small stimulus eliciting a large response. As the signalling networks are censoring and executional devices of the cell, their investigation merits particular attention. Multicellular organisms use a broad repertoire of extracellular signalling molecules for communication between and within cells. The specific binding of an extracellular signalling molecule (ligand) to its receptor on a target cell triggers a 13

specific response. This consists of a series of mutually activating or inhibitory molecular events, called a signal transduction pathway (or signalling pathway). Growth factors are important signalling molecules comprising a large group of secreted proteins. Figure 1 shows a graphical depiction of the general signalling pathway and below I briefly described the series of events comprising signal transduction that is depicted in the figure. (1). Specific growth factor binds with high specificity to a cell surface receptor protein (2). This activates intracellular signal transduction proteins (3) and initiates a cascade of activations of responsive proteins (often by phosphorylation) that act as second messengers (4). Hormones are small signalling molecules (5) that arrive via the bloodstream. They enter the cell either by diffusion or by binding to a cell surface receptor (6). Some hormones bind to an intranuclear receptor (7). Activated transcription factors (8) together with cofactors initiate transcription (9). Prior to transcription, an elaborate system of DNA damage recognition and repair mechanisms (10) checks DNA integrity (cell cycle control, 11). Cell division, as represented in the figure (or other biological process) proceeds if faults in DNA structure have been repaired; if not, the cell is sacrificed by apoptosis (cell death, 12). 2.2.4 Signalling networks: Machine learning approaches A number of machine learning approaches were applied for modelling biological networks. They can be broadly classified into the following categories: Structure-based learning (i.e. Bayesian networks) 14

Figure 1: Schematic outline of the signalling pathway. Various steps of the signal transduction process are explained in the main text. Taken from reference [22] 15

Predictive modelling (regression and classification algorithms) Dimensionality reduction and unsupervised learning (including feature selection) In this and the next few subsections I will briefly describe each approach in turn before focusing on the main theme of the thesis. Both probabilistic graphical modelling and other machine learning algorithms were used for the Structure-based analysis of this type of signalling networks. The differences between reverse engineering approaches for gene vs. signalling networks is attributed to both biological and technological factors. As has been accurately pointed out by Gaudet et al [19] higher complexity of protein chemistry in comparison to DNA and RNA chemistry makes it impossible to develop a single high-throughput technique for studying signalling pathways. As a consequence a combination of various specific assays is required to delineate information flow in signal transduction, which makes data acquisition necessarily slow. Therefore, to date the reconstruction of signalling networks served a rather different purpose than in the case of gene networks - whereas in gene networks the main goal is the discovery of new gene relationships and network structure, in signalling networks the objective is not only accurate reconstruction of network structure, but also the prediction of network dynamic response to various perturbations (regression problem). Another noteworthy difference is the size of reconstructed networks, due to data scarcity for signalling network their size reported in literature has not exceeded 30 proteins. 16

Both Bayesian networks and Causal networks have been used for reconstruction of signal transduction pathways. Using Bayesian networks approach signalling network inference Causal networks (Bayesian network with directed edges allowed) were used to automatically reconstruct a map of human primary T cell signalling based on muliparameter flow cytometry data [20]. Human primary naive CD4+ T cells were exposed to series of stimulatory cues and inhibitory interventions (i.e. G06976 inhibitor of PKC isozymes) followed by flow cytometry measurements of 11 phosphorylated proteins and phospholipids. Unexpectedly, excellent performance has been reported - of the 17 arcs in the predicted network, 15 were expected, all 17 were either expected or reported already in the literature and only 3 were missed. Reconstructed network also dismissed indirect connections that were already explained by other network arcs, thereby producing robust network structure. The authors reported three features that distinguish this dataset from the majority of currently available biological datasets and attributed to the excellent performance of Causal networks approach. First, simultaneous measurement of multiple protein states in individual cells eliminated population-averaging effects that could have otherwise obscured interesting correlations. Second, because the measurements were made on single cells, thousands of data points were collected in each experiment. This feature constitutes a tremendous asset for Bayesian network modelling, because the large number of observations allowed for accurate assessment of underlying probabilistic relationships, and therefore extraction of complex relationships from noisy data. Third, interventional as- 17

says with protein inhibitors/stimulators made it possible to perform causal inference and reconstruct directed graph, which would have been impossible in other experimental settings. Brent et all compared this use of inhibitors in delineating signalling networks structure to epistatsis analysis of biochemical and vesicle transport systems in classical genetics [21]. It is noteworthy, that removing through simulation or additional experiments any of these features significantly decreased the accuracy of reconstructed network; with population averaging (simulated Western Blotting) unexpectedly showing the worst performance. Finally, using RNAi authors confirmed several poorly established regulatory phosphorylations reconstructed by the model. Unfortunately, muliparameter flow cytometry at present can not detect temporal phosphorylation patterns, resulting in few missed feedback loops that could have potentially been found by DBN technique. Bayesian networks approach allow reconstructing signalling networks structure and performing general regression analysis, however without being true regression algorithm they lack many futures available in statistical regression models, such as selection of most predictive protein signals in the network (feature selection problem). As was mentioned previously, the goal in signalling networks analysis is often to find proteins/molecular signals that best determine network response to external cues. An example of such approach is an analysis of combinatorial interactions of several cytokine/growth factors that determine apoptosis/survival cellular response, or similar analysis of other signalling systems exhibiting biphasic activity. To address such 18

problem new experimental approach must be developed to effectively sample activities of molecular signals in the networks. It should offer capability to perform treatment with varying combinations of cytokines/growth factors without loosing interpretability, measure activity of different modules within signalling network and provide quantitative information about the output signals (i.e. apoptosis). Furthermore, the technique should have robust normalisation routines and be amenable to statistical analysis and/or mechanistic modelling. All these requirements have recently been addressed by the study of Gaudet et al, whose main dataset was used for algorithms development and testing in this thesis. 2.3 Systems Biology methodology used for the analysis of biological networks Before proceeding with description of the approach taken by Gaudet et al (Chapter 3), the general principles of regression - based machine learning applications to signalling networks should be outlined in greater depth. The typical approach for studying cellular signalling in such experiments is based on the notion of cues - signals - responses paradigm and involves several steps in experimental design and computational modelling (Figure 2A) [19]. First, one or several signal transduction cascades are chosen depending on the questions that are addressed. For example, the apoptosis signalling network can be selected to study different rates of apoptosis in cancer cells. 19

Since even a small number of signalling cascades may include hundreds of proteins and other signalling molecules, this step also involves selection of the smaller subset of signalling proteins, signals, which are believed to be the most relevant for the regulation of a signalling network (based on the background biological knowledge and availability of appropriate high-throughput technology). As a final step in this design phase, the choice for specific perturbations - cues - is made to induce changes in the information flow through the signalling network and a number of specific cellular responses are assayed to analyze output of the network. For example, in the apoptosis signalling networks, different combinations of cytokines and growth factors can act as cues and assays measuring apoptosis intensity can act as responses. In the second step the activity and concentration of signalling molecules as well as corresponding cellular responses are measured experimentally across different cues/treatment conditions. Possible experimental approaches include western blotting, high-throughput kinase activity assays and protein microarrays [22]. Assays for quantification of cellular responses vary depending on the specific application and may include measurements of cell migration, overall cell integrity or secretion of specific ligands. The third and final step involves data analysis that addresses issues of building predictive and descriptive models of signalling networks. For instance, principal component analysis has been used to find how different cues and treatment conditions are positioned in the low-dimensional subspace of intracellular signals [23]. Alternatively, information on the activity and amount of 20

intracellular signalling molecules was used to build regression model for prediction of cellular responses and selection of the most informative for classification subset of signals (feature selection) [24]. It should be noted that three steps of the systems biology methodology outlined above can be extended in various ways, for example the raise in activity of signalling molecules in response to different cues can be measured along many time points. 2.4 Dimensionality reduction techniques The main purpose of dimensionality reduction techniques in machine learning and statistics is to reduce the number of variables and dimensions with the aim to extract new information (feature extraction), improve interpretability of the data and enhance exploratory data analysis. Dimensionality reduction can be Figure 2 (following page): Features of the Cytokine Compendium and the cues - signals - responses paradigm. A. Cues - signals - responses paradigm for the design and execution of systems biology experiments for the analysis of signalling networks. Cells are exposed to perturbations (cues) and molecular signalling molecules and cellular responses are assayed, followed by application of multivariate statistical and machine learning data analysis techniques. B. Schematic representation of the apoptosis signalling network induced by TNF, EGF and insulin. On the diagram arrows indicate the type of interaction: activation (green), inhibition (red) and slow process (blue). The measured proteins (molecular signals) are highlighted in yellow. Red circles, triangles, and rectangles indicate kinase assay, antibody array, and western blotting measurements, respectively. The decomposition of the network into different pathways is shown by different colours (reproduced with modifications with permission from [5]). C. Distinction between one-to-one and many-to-one mapping functions between signals and responses. In the former case cellular response is measured for each time point of a particular treatment condition (cue), in the later - one cellular response is measured after several time points of the treatment condition. 21

A JNK B Cytokines Growth Factors Adhesion molecules AKT p38 ERK NF-kB Cell migration Apoptosis Proliferation Cues Signals Responses Caspase Activation Pathway Bid Bax CytoC Apaf Casp. 9 Mitochondrial Pathway Bcl-2 Bcl-xl SubG1 DNA content Bad TRADD FADD C P Casp. 8 P Casp. 3 Caspase-3 & cytokeratin cleavage TNF TRAF2 RIP T N F R 1 NFkB Pathway claps TRADD MAP3K IKK lkb NFkB NFkB genes XIAP PS exposure Grb 2 Sos p38 MK2 HSP27 L- FLIP p38 pathway Membrane permeabilization S- FLIP RAF S MEK ERK Rsk EGF GAP Shc ERK pathway AP-1 AP-1 genes P ase E G F R PI3K Grb2 Sos Ras Y S S Akt JNK1 Rac MEKK1 JNKK P ase S FKHR AFX Ins JNK pathway IR Shc Grb2 Sos PTEN Y S IRS1 PI3K Ptdlns PDK1 S6K mtor Rheb TSC2 TSC1 Forkhead genes Akt pathway Kinase assay Immunoblot Antibody array C: cleaved P: pro S: phospho-s Y: phospho-y Activation inactivation Slow C time time responses cues responses cues signals signals Figure 1 one-to-one mapping many-to-one mapping 22

used on its own or in combination with other methods. Broadly, dimensionality reduction algorithms are subdivided into linear and non-linear dimensionality reduction. Linear methods assume that data come from some linear subspace, whereas non-linear methods allow presence of non-linear manifold. The classical techniques for linear dimensionality reduction, such as principal component analysis (PCA) and multidimensional scaling (MDS), are simple to implement, efficiently computable, and guaranteed to discover the true structure of data lying on or near a linear subspace of the high-dimensional input space (13). PCA finds a low-dimensional embedding of the data points that best preserves their variance as measured in the high-dimensional input space. Classical MDS finds an embedding that preserves the interpoint distances, equivalent to PCA when those distances are Euclidean. However, many data sets contain essential nonlinear structures that are invisible to PCA and MDS, for instance rotated face images of the same individual with intrinsic 3-dimensional degree of freedom [26]. A number of non-linear dimensionality reduction methods, such as Isomap [26], have been developed to address these challenges (detailed description of the Isomap approach to dimensionality reduction is provided in the methods section of the thesis). 23

Chapter 3 Dimensionality reduction analysis of apoptosis signalling network 3.1 Apoptosis signalling network and the Cytokine data compendium In the light of the general introduction to the signalling networks and systems biology approach of studying them described above, I will now discuss specific signalling network that was analyzed in this thesis. In this study I investigated the relationship between cues, signals and responses using apoptosis signalling network in human adenocarcinoma cells. To analyze the apoptosis network I considered a previously published protein signalling dataset known as the 24

Cytokine compendium first described by Gaudet et al [19], for which quantitative western blotting, high-throughput protein kinase assays and protein microarrays were used to investigate the combinatorial effect of tumour necrosis factor (TNF), epidermal growth factor (EGF) and insulin on apoptosis of human adenocarcinoma cells. Following is the brief description of experimental design and the major questions addressed in that study. To investigate how TNF-related signalling networks coordinate cellular responses, they have established multi-input system where HT-29 colon cancer cells were stimulated with combination of TNF, EGF, and insulin [19]. Since the components of intracellular protein network downstream of these cytokine inputs are understood in reasonable detail, they chose 19 key molecular signals - receptor, kinase, caspase, and adaptor proteins within signalling networks for activity measurements, thereby addressing the issue of sufficient signalling nodes coverage. Multiplex kinase assays, antibody microarrays and quantitative immunoblots have been used to permit reliable identification of protein activity [22]. With nine distinct pair wise combinations of TNF, EGF, and insulin they sampled each molecular signal in triplicate at 13 time points between 0 and 24 hours to compile 7980 distinct molecular signals from the shared intracellular network. The whole dataset was called Compendium of Signals and Responses Triggered by Prodeath and Prosurvival Cytokines (hereafter referred to as the compendium).to test whether apoptosis could be connected to the measured signals in the network, cell-death phenotype was assayed for each combinatorial cytokine stimulus using four distinct apoptosis assays. Together, 25

these output measurements constituted an apoptotic signature that characterized early (phosphatidylserine exposure), middle (caspase substrate cleavage and membrane permeability), and late (nuclear fragmentation) responses of apoptosis. With this cytokine compendium Gaudet et al investigated how individual molecular signals in combination could predict apoptotic response. This can be viewed as a typical multivariate regression problem where independent/predictor variables (molecular signals) regress against dependent variables (apoptotic outputs). Feature selection methods can be applied in this framework to determine which subset of original features in combination can best predict apoptotic responses. Since it was unclear which features of dynamically sampled molecular signal could be responsible for apoptotic outcome (i.e. kinase activity at various time points or just maximal activity), Janes et al defined an additional panel of time dependent signalling metrics such as peak of kinase activity or total area under the temporal profile (i.e. creation of new variables). In total they obtained 660 independent variables/molecular signals and 12 dependent variables/apoptotic outputs. Having temporal measurements resulted in high redundancy of the dataset and requirement to apply feature selection techniques. The statistical approach that they chose was partial least-squares (PLS) regression [23]. It differs from multivariate regression in that later uses subset selection approach for pooling predictive variables that depends on the order in which they are presented (greedy selection). On the other hand, PLS regression is more suitable for the cases where large number of correlated variables should be combined into small set of linear 26

combinations (dimensions). In this respect it is similar to PCA, but PLS regression also seeks directions that have high correlation with response (and not merely high variance). By applying PLS regression Janes et al were able to project original data set (660 dimensions) into the 3-dimensional/ principal components set of feature combinations. Resulting apoptosis model was able to predict output of four apoptosis metrics with 94 percent on cross validation dataset. Further analysis of this model gave interesting discoveries and observations [24]. For instance, the model was able to predict apoptosis outputs to new growth factor treatments that were not available in the original dataset with 90 percent accuracy. Furthermore, fist two principal components defining the optimal two-dimensional slice through the signalling data represented opposing apoptosis and survival signalling axes (apoptosis basis set). The first principal component oriented toward stress and apoptotic pathways whereas second appeared to constitute a global survival signal. These two-dimensional signalling axes for instance revealed that the same molecule (i.e. IKK) can convey either pro- or antiapoptotic messages depending on the timing and mechanism of activation. It is noteworthy, that these derived metrics were essential for constructing predictive PLS regression models. PLS regression treats each time point independently and derived metrics were therefore the primary means by which time dependences were encoded [25]. However, it may not always be possible to apply a fully-supervised dimensionality reduction approach to study how changes in the activity of signalling network influence cellular responses. For instance, Janes et al [24] 27

used measurements of network activity across multiple early time points to predict apoptosis outcome at later time points, whereas in some applications it may be desirable to relate signals to responses at identical time points [24]. In the context of the variable and conflicting cellular responses to the various TNF, EGF, and insulin treatment conditions noted above it would be useful to understand how different treatments are positioned within the resulting space of intracellular molecular signals. 3.2 Main aims of the thesis research In this thesis I assessed applicability of unsupervised dimensionality reduction techniques for the analysis of apoptosis signalling networks. The key aim was to infer connections between external cues, intracellular molecular signals and corresponding cellular responses. More specifically, by using Cytokine signalling data compendium I aimed to answer the following questions: 1. Can different treatment combinations of ligands be positioned into separate clusters of cues in the low-dimensional signalling space? 2. What are characteristics of these clusters in the low-dimensional space (unsupervised learning)? 3. Can low-dimensional embedding be intuitively explained in terms of original dimensions of molecular signals? 4. Can the low-dimensional representation of signalling networks be used 28

for predictive (supervised) modelling of cellular responses, i.e. apoptosis intensity? 5. Are there any differences in performance between linear and nonlinear dimensionality reduction techniques in the case of both supervised and unsupervised learning (i.e. PCA vs. Isomap)? To address these questions I applied a non-linear dimensionality reduction approach Isomap [26]. Isomap and a similar technique, local linear embedding (LLE) [26, 27] have already been successfully applied as dimensionality reduction approaches for gene networks [28-30] and many other problems in cognitive sciences and computer vision. These algorithms have been found superior to PCA and Multidimensional Scaling (MDS) in finding low-dimensional submanifolds in many cases. For instance, a modified LLE algorithm Local Context Finder (LCF) enabled successful reconstruction of a low-dimensional representation of the pathogen induced gene network in Arabidopsis [31]. To apply Isomap specifically in the context of signalling networks I extended it with graph-based clustering (questions 2) and neural networks (questions 3) (Figure 3). Finally, I used the low-dimensional embedding found by Isomap to build a classifier for prediction of apoptosis intensity (question 5). In my application Isomap generates a low-dimensional projection of signalling networks represented by the activity of molecular signals, where groups of different cues/treatment conditions could be easily identified and visualized (henceforce I refer to such projection as low-dimensional embedding of the signalling network). Consequently, the main contribution of this thesis is analysis of 29

Extended Isomap I. 1. 2. 3. 4. Construct k-nn graph and find optimal value for k Estimate geodesic distances using Djikstra's algorithm Estimate Euclidian distance matrix Isomap Construct Isomap d- dimensional embedding (MDS) II. Estimate Euclidian distances for Isomap lowdimensional map 1. 2. 3. Construct k-nn graph from Euclidian distance matrix (sparcification) Graph-based clustering with Isomap embedding Partition k-nn graph into separate clusters (multilevel k-way partitioning for irregular graphs) III. 1. 2. 3. Build ensemble of neural networks (NN) and perform regression from original dimensions into Isomap projections Construct test and training set with the split ¼ and use parametric bootstrap to select five best NNs Use sensitivity analysis for feature selection, choose five top features Interpretation of Isomap projections with neural networks Figure 3: Schematic representation of the extended Isomap approach. Extended Isomap approach involves three main steps: Isomap algorithm to construct a low dimensional embedding of the apoptosis signalling networks (I), graph-based clustering to find clusters in the Isomap space (II) and neural networks ensemble to find meaningful interpretation of the Isomap projection (III) signalling networks in the new unsupervised learning context using nonlinear dimensionality reduction approaches. 3.3 Cytokine compendium dataset The main details of the Cytokine compendium dataset [19] that are relevant for the present study are as follows. HT-29 epithelial cancer cells were treated with 10 combinations of saturating or subsaturating concentrations of TNF, EGF and insulin (0, 0.2, 5, 100 ng/ml TNF and 0, 1, 100 ng/ml EGF or 0, Figure 2 1, 5, 500 ng/ml insulin respectively), which collectively represent all the cues used in the study. 19 molecular signals were chosen to characterize changes in 30