STATISTICAL APPROACHES FOR MATCHING THE COMPONENTS OF COMPLEX MICROBIAL COMMUNITIES

Size: px

Start display at page:

Download "STATISTICAL APPROACHES FOR MATCHING THE COMPONENTS OF COMPLEX MICROBIAL COMMUNITIES"

Vincent McCarthy
5 years ago
Views:

1 STATISTICAL APPROACHES FOR MATCHING THE COMPONENTS OF COMPLEX MICROBIAL COMMUNITIES by Chongci Tang Submitted in partia fufiment of the requirements for the degree of Master of Science at Dahousie University Haifax, Nova Scotia May 2016 c Copyright by Chongci Tang, 2016

2 Tabe of Contents List of Tabes iv List of Figures v Abstract List of Abbreviations and Symbos Used vi vii Acknowedgements viii Chapter 1 Introduction Background BiomeNet The chaenge of appying BiomeNet to rea data Structure of the thesis Chapter 2 Methods for matching the metaboic components of compex microbia communities LASSO Regression Review of the method An exampe of LASSO fitting NNLS Regression Review of the method An exampe of NNLS fitting A method based on minimizing JSD The method An exampe of JSD minimization method Summary Chapter 3 Resuts for matching the metaboic components of compex microbia communities RSS and JSD of a the matches Quaity of the matches ii

3 3.3 Heatmaps of the estimated coefficient matrix Chapter 4 Discussion and suggestions for future work Bibiography iii

4 List of Tabes 1.1 Mathematica definition of the pate diagram The first 30 steps of the LASSO path on Y The updated parameters β for MJSD method on Y ˆβj, RSS, JSD of three methods on Y The number of the significant matches among a 100 matches. 37 iv

5 List of Figures 1.1 Pate diagram for the BiomeNet mode The Change of coefficients through first 15 steps of LASSO path on Y LASSO coefficients of the seected best LASSO mode on Y Change of RSS for different threshoding modes of Y The coefficients of the NNLS mode of Y 1 after threshoding JSD of Y 1 to a 200 X variabes Coefficients of the variabes resuted from the MJSD method on Y The positive reactions of a the 11 X variabes Some y s and ŷ s of three methods on Y RSS and JSD of a 100 matches for three methods Some y s and ŷ s of three methods on Y Some y s and ŷ s of three methods on Y RSS and JSD permutation histogram of Y 49 for the three methods RSS and JSD permutation histogram of Y 30 for the three methods The number of positive coefficients of a 100 matches for three methods Heatmap of the ordered coefficients for the MJSD v

6 Abstract A hierarchica Bayesian mode caed BiomeNet can be used to identify the functiona units of metaboic reactions, subnetworks and community-eve metaboic networks. The framework modes metaboic structures by assuming each sampe consists of many tighty connected subnetworks, which in turn are comprised of different reactions. When appying the method the number of subnetworks L must be pre-specified When L is set arger in BiomeNet, the inferred structures of the subnetwoks are expected to come out in a more trivia form. Three methods, LASSO, NNLS and a new method, MJSD, are appied to match a subnetwork in one anaysis (say, when L=100) with severa subnetworks when L is increased (say, L=200). RSS and JSD are appied as matching criteria to conduct mutipe tests to judge the significance of the matches. From the resuts, I am abe to identify those predominant subnetworks and give a reasonabe conjecture that those predominant subnetworks aways come out as unbroken bocks for any arger L vaues. vi

7 List of Abbreviations and Symbos Used Symbos and Abbr. X Y Y j X j y i β JSD RSS NNLS LASSO LDA S-R LAR df MCMC PCA IBD KEGG MJSD FDR B-H procedure Description 2824 by 200 data matrix 2824 by 100 data matrix j th coumn in Y j th coumn in X i th eement in Y coefficients vector Jensen Shannon divergence Residua sum of squares Non-negative east square east absoute shrinkage and seection operator Latent Dirichet Aocation substrate-product east ange regression degrees of freedom Markov chain Monte Caro Principe Component Anaysis Infammatory bowe disease Kyoto Encycopedia of Genes and Genomes the new method based on minimizing JSD Fase Discovery Rate Benjamin-Hochberg procedure vii

8 Acknowedgements I woud ike to express my huge gratitude to my supervisor, Dr. Hong,Gu and cosupervisor Dr. Joseph Bieawski, for their guidance and encouragement for the two years. They spent a ot of time on instructing me and revise my thesis even though they are reay busy. Without their support this thesis woud not have been possibe. I woud aso ike to thank my committee members, Dr. Toby Kenney and Dr. Bruce Smith for their kind comments and advice on my thesis. I woud ike to give my thanks to my friends and everyone ese who heped me. My thanks aso goes to my professors and cassmates in Dahousie University. viii

9 Chapter 1 Introduction 1.1 Background From protists to humans, a pants and animas ive in cose association with microbia organisms. These microbes are beieved to pay a critica roe in a wide variety of settings, from gobay significant nutrient cycing to infuencing human physioogy [16]. The ecoogica community of commensa, symbiotic and pathogenic microorganisms that iteray share the human body space have a profound effect on human heath [11]. For this reason, in human microbiomics, research is focused on the reationship of the microbia communities to both heath and disease status. In ate 1990s, researchers found that the microbiome in the gut payed a roe in the deveopment and maintenance of the human immune system, and the human microbiome is now impicated in a variety of autoimmune diseases ike diabetes [10] [22]. A poor mix of microbes in the gut may aso aggravate common conditions such as obesity [20]. The consensus opinion is that athough the mammaian immune system seems to be designed to contro microorganisms, it is in fact controed by microorganisms [12]. Most microbes in most environments are not cutivatabe, so studies of microbia communities must be made based on DNA samped directy from the environment, athough RNA, protein and metaboite based studies may aso be performed [6]. There are two ways to use environmenta DNA to study microbia communities. One is targeted ampicon studies (e.g., 16S SSU rdna), which focuses on a known marker gene and is primariy informative about taxonomy [21]. The other that appeared more recenty is shotgun metagenomics, which can be used to study the functiona potentia of the community [21]. Shotgun sequencing of tota environmenta DNA proceeds by randomy cutting the tota DNA within a sampe, and sequencing the many short fragments that are produced [14]. The resuting coection of DNA sequences represents the genomes of a the microbes in the sampe, and is referred to as the metagenome of the samped microbiota. The function of some of those 1

10 2 sequences can be inferred via homoogy to reference sequence with known functions (e.g., through the KEGG database) [14]. However, ony a fraction of the gene sequences can be assigned a function; the function of a very arge number of sequences, wi remain unknown [14]. Of particuar interest are those sequences that encode a known enzyme, as they can be used to infer the community-eve metaboic potentia of the microbiota. The compete set of metaboic and physica processes of a ce, caed a metaboic network, determines the physioogica and biochemica properties of the bacterium. The fundamenta units of a metaboic network are its chemica reactions. The essentia components of a reaction are the substrates (chemica compounds), the enzyme (a protein encoded by DNA) that wi act upon the substrates, and the products that are produced by the action of the enzyme on its substrates (other chemica compounds). It is important to note that the products of one enzyme-mediated reaction can serve as the substrate for a different enzyme-mediated reaction. Such chemica reactions are organized into subnetworks (functiona modues), which are themseves organized into the higher-eve structures that bioogists refer to as the metaboic network. In the fied of microbiomics, the notion of the metaboic network is extended to incude the fu metaboic capacity, and the fu set of interactions, carried out by a community of microbes. In order to infer differentia usage of metaboic networks among ecoogicay divergent microbia communities, a Bayesian modeing approach caed BiomeNet was deveoped by Shafiei et a. [16]. The modeing framework differs from previous approaches [e.g., PCA and its variants] in attempting to capture the hierarchica nature of rea metaboic networks [22]. Specificay, sets of reactions (reated by expoiting a shared poo of substrates and products) are modeed as a metaboic subnetwork, and over-apping mixtures of such metaboic subnetworks are used to mode the fu metaboic network [16]. When such networks are inferred from the mode (as opposed to being defined soey according to biochemica expertise) they are caed metabosystems [16]. Metabosystems are intended to represent hypothetica metaboic phenotypes, and sampes are permitted to consist of overapping mixtures of metabosystems.

11 3 Currenty, the most widey used method to anayze community metaboic networks, and metagenomic variation in genera, is Principe Component Anaysis (PCA) and it variants [16]. BiomeNet is fundamentay different from such methods, in providing a framework to take advantage of dependencies between reactions encoded by metaboic networks without any transformation and reduction. This means that resuts obtained under BiomeNet offer bioogists a direct functiona interpretation. Thus, BiomeNet contributes a vauabe addition to PCA [16]. BiomeNet has been appied to a variety of datasets, incuding marine water sampes, mammaian gut microbiomes and the gut microbiomes of infammatory bowe disease (IBD) patients [16]. The case of IBD iustrates why a direct functiona interpretation of the resuts is critica to microbiome researchers. IBD is a human disease characterized by chronic immune dysreguation in the gut. Using BiomeNet, Shafiei et a [16] found that IBD patients differed from heathy individuas according to the mixture of metabosystems found in their gut. Interestingy, the metabosystem having a greater prevaence among IBD patients was characterized by metaboic reactions associated with (i) cose association with the human gut epitheium, (ii) resistance to dietary intervention, and (iii) interference with uptake of antioxidants connected to IBD [16]. IBD is a serious disease; it can cause severey disruptive pain, require surgery, and even cause increased mortaity. Even more pressing is that it is on the increase wordwide, with Canada having among the highest rates. Thus, it is important that BiomeNet produces resuts that have a direct reationship to microbia metaboic capacity, and the posterior mixture weights have a straightforward interpretation. Inference under BiomeNet wi be focus of this thesis, and the framework is described in more detai in the next section. 1.2 BiomeNet BiomeNet is a hierarchica mixed-membership Bayesian mode simiar to Latent Dirichet Aocation (LDA). LDA empoys a mixture of Dirichet priors to faciitate custering, or cassification. Standard LDA, however, cannot be used to resove the underying structure of a metaboic network in terms of easiy interpretabe parts such as metaboic pathways or subnetworks. In the metagenomic setting, the observed data (produced by shotgun sequencing) is the abundance of enzyme-encoding sequence

12 4 reads in each sampe; but, for input into BiomeNet the data must be converted to counts of substrate-product pairs. To do this, where possibe, each enzyme-encoding sequence is assigned to its enzyme within the EC database. The EC database provides information about the biochemica reaction that the enzyme catayzes, incuding a of the substrates and products for that reaction. Each reaction is decomposed into a possibe substrates and products (this eads to a mini hyper-graph for each reaction), and a pairs of compounds are assigned the abundance of the associated enzymeencoding sequence within the sampe. As mammaian gut metagenomes can encode thousands of unique enzymes, the input data for BiomeNet is rich in information reevant to metaboic interactions. BiomeNet is used to mode metaboic interactions at the community-eve. The mode assumes that each microbiome sampe is a mixture of K metabosystems, where K is assumed to be known and fixed. However, the contribution of the K metabosystems is permitted to differ between sampes. Each metabosystem is itsef viewed as a mixture of a fixed number (L) of metaboic subnetworks. Hence, metabosystems differ according to their particuar mixture of subnetworks, with the subnetworks viewed as a mixture of metaboic reactions. Because reactions within a subnetwork are inked through shared chemica compounds, the subnetworks are actuay modeed as a subset of compounds that are converted to another subset of compounds. This is why each enzymatic reaction must be decomposed into substrate-product pairs (S-R pairs), with each subnetwork having its own substrates and products. The ower eves of the hierarchy (e.g., reactions) can aways contribute to any of the higher eves (e.g., to different subnetworks, metabosystem and sampes) to different degrees [16]. The set of S-R pairs for each reaction in a sampe is the ony observabe variabe, with other variabes (the subnetwork (Y ) and metabosystem assignments (Z) of the reactions) being atent variabes. In order to capture the dependence among the many variabes more concisey, a pate diagram is shown in Figure 1.1 with the mathematica definitions given in Tabe 1.1. The outer pate represents the tota data, the midde pate represents sampes, and the inner pate represents reactions within a sampe. The reative contribution of each metabosystem to the n th microbiome sampe is modeed via the variabe θ n, which is a probabiity vector of K vaues that sum to one. Typicay, a

13 5 Figure 1.1: Pate diagram for the BiomeNet mode. This mode specifies a generative process; couping between substrate-product pairs is enforced by conditioning their generation on a singe subnetwork membership [16]. Note: Adapted from BiomeNet, A bayesian mode for inference of metabic divergence among mibrobia communities by Mahdi Shafei, Katherine A Dunn. PLoS Comput Bio, 10(11):e , sparse symmetric Dirichet priori is paced on θ. θ Dirichet(α θ ) The unique mixture of L subnetworks for each metabosystem is modeed with a probabiity vector, ϕ k, of mixing probabiities that sum to one. Thus, given K metabosystems, a K L matrix caed ϕ is used to represent the reative contribution of the th subnetwork to the k th metabosystem. Typicay, a sparse symmetric Dirichet prior is paced on ϕ. ϕ k Dirichet(α ϕ )

14 6 Because reactions are inked through shared chemica compounds, each subnetwork has its own substrate (S) and product (R) groups. Note that the products of one reaction can serve as the substrate of another reaction (i.e., the intermediary compounds in a metaboic pathway), and the membership of compounds to the S and R groups must be soft (probabiistic) rather than discrete. Thus, for L subnetworks, there wi be L substrate and products groups. Each subnetwork has a vector of C compounds for its own substrate and products, denoted by δ and γ respectivey, summing to one. For L subnetworks, there wi be an L C matrix, δ, for substrate compounds and another L C matrix, γ, for product compounds. A vaue in row and coumn c of one of these matrices gives the contribution of compound c to a subnetwork (as either a substrate or product, depending on the matrix). As above, a sparse symmetric Dirichet prior is paced on δ and γ. δ Dirichet(α δ ) γ Dirichet(α γ ) Variabe Type Meaning N integer number of microbiome sampe(e.g 38) K integer number of metabosystems (e.g 3) L integer number of subnetworks (e.g 100) C integer number of compounds (e.g 2713) θ probabiity prior distribution of metabosystems in a sampe ϕ probabiity prior distribution of subnetworks in metabosystems δ probabiity prior distribution of substrate compounds in subnetworks γ probabiity prior distribution of product compounds in subnetworks α probabiity concentration parameter of the Dirichet distribution n integer index of a sampe i integer index of a reaction j integer index of a substrate-product pairs Z ni vector metabosystem assignment for the i th reaction in the n th sampe Y ni vector subnetwork assignment for the i th reaction in the n th sampe S nij vector substrate compounds assigned for the J in S-R pairs R nij vector product compounds assigned for the J in S-R pairs Tabe 1.1: Mathematica definition of the pate diagram

15 7 Whie the prior probabiities are in the Dirichet distributions, the posterior distributions are in a mutinomia distribution. The reative contribution of each metabosystem and subnetwork to the sampe n is modeed as Z ni θ n Muti(θ n ) Y ni Z ni Muti(ϕ zni ) Where Z ni and Y ni denote the metabosystem and subnetwork assignments for reaction i in sampe n. Inference under BiomeNet invoves samping the posterior distribution of the subnetwork and metabosystem assignments for each reaction in the data and integrating out a other atent variabes. Specificay, coapsed Gibb samping is used to sampe from the conditiona distribution P (Z, Y R, S, α θ, α ϕ, α δ, α γ ), with the atent variabe θ, ϕ, δ, γ integrated out. The subnetwork and metabosystem assignments are samped for a singe reaction in a singe sampe, P (Z, Y S, R), given the set of subnetwork and metabosystem assignment for a other reactions in a other sampes except ony reaction i in sampe n (Y ni and Z ni respectivey). Each iteration of Gibb samping not ony provides a joint posterior distribution samping point on P (Z, Y S, R), but aso provide a samping point from a margina distribution P (Z ni, Y ni S, R). Based on the posterior distribution of metabosystems and subnetworks assignments, P (Z S, R) and P (Y S, R), the posterior distribution of θ and ϕ, and their means, can be directy inferred if sufficient iterations of Gibbs samping have been carried out [16]. 1.3 The chaenge of appying BiomeNet to rea data The strength of BiomeNet is that it provides a method to earn the community-eve structure of the data in terms of subnetworks and metabosystems, rather than according to curated pathways, or some other bioogy-centered definitions. Mode-derived structures hep researchers to investigate the atent metaboic structure of any microbia community, and to understand microbia community ecoogy, without having to assume that metaboic pathways have been previousy defined and annotated in a way that is suitabe to the communities under study. However, BiomeNet does have imitations, and further deveopment of the anaytica framework is warranted. The number of metabosystems K in each sampe, and the number of subnetworks L

16 8 must be pre-specified [16]. Without strong bioogica criteria of seecting the vaues of K and L, the user of BiomeNet must either (i) try different number of K and L and compare the discrepancy to choose a reasonabe vaues, or (ii) choose arbitrariy high vaues for K and L and set the concentration parameter of the Dirichet priors cose to zero as a means of pushing BiomeNet to characterize metabosystems by a reativey few predominant subnetworks and reactions. The atter is a strategy for minimizing variance and maximizing interpretabiity in the face of high vaues for K and L. The probem with both strategies is that there is no objective means of assessing which, if any, subnetworks within a metabosystem, or reactions within a subnetwork, warrant predominant status and further bioogica interpretation. A metagenomic dataset comprised of 38 mammaian gut microbiomes [16] provides both a good iustration of the chaenges, as we as a good test case of future method deveopment. This dataset is rich in metaboic information, having 2824 unique reactions invoving 2713 compounds. In the origina study, Shafiei et a [16], set K= 3 metabosystems, to match the number of dietary niches (carnivore, omnivore and herbivore) represented by the samped mammas. However, the authors found that they coud ony separate the carnivores from the herbivores, and concuded that the there was strong signa for 2 metabosystems, and ony weak signa for a third. Shafiei et a. [16] had no a priori basis to seect a good vaue for L, so they experimented with different numbers of subnetworks (i.e., L=50, 100, 150, 200) and assessed the robustness of the reaction composition of the metabosystems to L. [14]. Athough their resuts suggested that a good vaue of L was ikey somewhere between 100 and 150, they provided no means of objectivey assessing which were the most important subnetworks. When L is arger than needed, there wi be redundant subnetworks in BiomeNet. The redundant subnetworks wi carry very sma weight for many of the metabosystems and their reactions wi have ony a trivia impact in composition of each metabosystem. Even when the concentration parameter of the Dirichet priors are cose to zero, and BiomeNet has been encouraged to use reativey few predominant subnetworks and reactions, it remains uncear how to decide at what point the mixture weight of a given metabosystem shoud be considered trivia. Thus the motivation of my research is to address this chaenge.

17 9 I empoyed the mamma dataset described above to deveop and investigate severa aternative methods for discriminating the so-caed predominant subnetworks from those making ony trivia contributions to a metabosystem. The ogic that underpins my approach is that the informative subnetworks shoud make consistent contributions to each metabosystem across aternative vaues of L (as ong as L is not too sma), and that matching the components (i.e., reactions) of subnetworks can be used to identify them. However, comparison of resuts across aternative vaues of L is far from straightforward. First, subnetworks are unikey to be abeed in the same way. For exampe, subnetwork 10 under L=100 might correspond to subnetwork 33 under L=150. Second, the signa for a subnetwork in one anaysis (say, when L=100) coud be spit among severa subnetworks when the vaue for L is increased (say, when L = 200). Thus, I investigated mode-based methods for matching subnetworks from one case as inear combinations of subnetworks derived from another case. This strategy overcomes the probem of potentiay compex reationships between the components of subnetworks derived from different anayses, and it has no requirements about abeing of subnetworks among the different anayses. To achieve this, I summarize the information about the components of each subnetwork in a N L reaction matrix. Here, N is the number of unique reaction in the dataset, which is 2824 for the 38 mammaian gut metagenomes. Each coumn in the N L matrix represents a reaction s posterior mixing probabiity to the corresponding subnetwork [16] (detaied desciption of the reaction matrix in Chapter 2). In the thesis, I first evauate two cassica methods, LASSO (east absoute shrinkage and seection operator) and NNLS (non-negative east squares) regression, for matching resuts structured as described above. I then describe a new method simiar to forward seection for matching the components of subnetworks. The effectiveness of these three methods are compared, and I show that the new method is more suitabe in the case of the 38 mammaian gut metagenomes. 1.4 Structure of the thesis This thesis is comprised of four chapters. The next two chapters (2 and 3) are devoted to method deveopment and appication to rea data, In Chapter 2, I introduce two cassica methods (LASSO and NNLS) and the new method, and describe how they

18 10 can be appied to the probem of matching the metaboic components of compex microbia communities. As an exampe, I match one subnetwork from the anaysis of L=100 to the subnetworks from that of L=200 using the three methods. In Chapter 3, these three methods are appied to a the 100 subnetworks in the L = 100 run of BiomeNet to match the subnetworks in the L= 200 run. I use RSS and JSD as criteria to rank the good matches and give exampes of a good and a bad match according to those criteria. I then discriminate those we matched subnetworks by using permutation test and present the estimated coefficient matrices to find the features of the good matches. Chapter 4 concudes with the strengths and weaknesses of a three methods and provides directions for future research.

19 Chapter 2 Methods for matching the metaboic components of compex microbia communities As introduced in Chapter 1, the strength of BiomeNet is that it provides a method to earn the community-eve structure of the data in terms of subnetworks and metabosystems. One way to find if BiomeNet output is consistent with the information on the true community-eve structure is to match the subnetworks to find whether the informative subnetworks aways can be identified across aternative vaues of L when K is fixed. Here K is fixed at 3 as this was the vaue empoyed in the origina study of Shafiei.et.a [16]. If BiomeNet works we, the signa for a subnetwork in one anaysis (say, when L=100) coud be either the same or spit among severa subnetworks when the vaue for L is increased (say, L=200). Because subnetworks are unikey to be abeed in the same way, some method is needed to identify the reationships between subnetworks derived from different appications of BiomeNet to the same data. The reationships coud be either a one to one correspondence, or a subnetwork from one case being spit into severa subnetworks in another case. Each subnetwork is represented by a coumn of the posterior mixing probabiities in the N L reaction matrix, where N=2824 is the number of unique reactions for the 38 mammaian gut metagenomic data and L is the number of subnetworks. Denote the N L 1 reaction matrix from one run of BiomeNet as Y, and the N L 2 reaction matrix from different run of BiomeNet as X, the matching probem that one subnetwork in Y corresponds to one or more subnetworks in X can be represented mathematicay as foows: Y = Xβ = β 1 X 1 + β 2 X β L2 X L2 Thus the matching probem naturay can be formuated as a regression without intercept, in which case minimization of the sum of squared residuas can be used as the criterion to achieve the best fit. Matching subnetworks from one case as 11

20 12 inear combinations of another case shoud perform we because the expectation is that a arge subnetwork in L=100 wi be spit into severa smaer (and argey nonoverapping) subnetworks when L=200. Thus the inear regression mode is chosen to represent the reationship between subnetworks. Usuay, the east square inear regression coefficient vector is not sparse. However, in this appication one subnetwork is expected to be matched by either one or ony a few subnetworks, i.e. the consistency of the subnetwork outputs from BiomeNet is associated to the sparsity and the quaity of the matches. So, a reguarized version of the east squares soution may be preferabe to satisfy the sparsity of β. Naturay, LASSO regression is a good candidate method since it can resut in a good regression mode with a smaer subset of predictors for LASSO fitting. A modified LARS agorithm [11] can be used to return a fu LASSO path, and a sparse soution can be achieved with a proper mode seection criterion. If a subnetwork in Y is spit into severa subnetworks in X, then ideay the regression coefficients shoud be non-negative which means β 0. LASSO regression ony satisfies the requirement of sparsity of β, but not the non-negativity of β. Thus another good candidate method is the NNLS regression which assures the non-negativity of the coefficients. The NNLS soution is not ony non-negative, it can aso be sparse to some extent, and further sparsity can be achieved by a proper threshoding agorithm. Furthermore, since each coumn in matrices X and Y represents the posterior mixing probabiity over a reactions, the sum of a eements in a coumn shoud be 1. If one coumn can be represented by a inear combination of severa other coumns, i.e., Y =Xβ, thus 1 = 1 T Y = 1 T Xβ = 1 T β. This impies another constraint on β. Both LASSO and NNLS don t necessariy output the soution that stricty satisfy this third constraint for β. As a measure of the quaity of matching of two mutinomia probabiity distributions, residua sum of squares is not the most proper criterion. A natura criterion for this purpose is the Jensen-Shannon Divergence. Thus a new method is deveoped by directy optimizing the Jensen-Shannon divergence between the target vector Y and a sparse non-negative inear combination of X such that the coefficients sum to 1. In the rest of this chapter, I first review the methods of LASSO and NNLS and

21 13 then describe our newy deveoped method which is more suitabe for this appication. Foowing the review of each method, I show an exampe of its performance on matching a subnetwork from L = 100, denoted as Y 1, with subnetworks from L=200 obtained from the mammaian metagenomics data by using BiomeNet. 2.1 LASSO Regression Review of the method Given a set of input measurements X 1, X 2,..., X L and an outcome measurement Y, the LASSO [4] regression fits a inear mode Y = β 1 X 1 + β 2 X β L X L = Xβ with a constraint on β, the L 1 norm of the parameter vector, not greater than a given vaue. The criterion is: argmin β subject to N (y i i=1 L x ij β j ) 2 j=1 L β j s j=1 The bound s is a tuning parameter [2]. Of course when s is arge enough, the constraint has no effect, and the soution of LASSO regression is just the same as the east square inear regression. The probem can be equivaenty represented as an unconstrained minimization probem with the L 1 norm penaty λ β added as foowing: argmin 1 2 N L L (y i x ij β j ) 2 + λ β j i=1 j=1 j=1 An aternative reguarized version of east squares is Ridge regression [4], which uses the constraint that β 2, the L 2 -norm of the parameter vector, is no greater than a given vaue. Here, LASSO regression is preferred to Ridge regression, since with the penaty λ increasing, a parameters of Ridge regression are shrunk towards 0, but not exacty equa to 0; whie in LASSO more parameters are shrunk to zero which

22 14 resuts in a sparse soution [2]. Since our purpose is to find severa subnetworks that exhibit the strongest inear reation with the target subnetwork, sparsity is desirabe. Because the LASSO oss function is not quadratic, the optimization probem can not be soved using genera convex optimization methods. Here the entire LASSO path is cacuated by a modified form of the LARS (east ange regression) agorithm [2]. The R package ars is used to sove the LASSO fitting. By defaut in the package, each variabe is standardized to have unit L 2 -norm. In this appication, the variabes are mutinomia probabiities that sum to 1; so, it s not proper to standardize the variabes. The soution path is piecewise inear there are a finite number of the points at which the regression projection changes its direction [2]. The sparsity of the soution is controed by the reguarization parameter λ, which in genera is chosen by a cross-vaidation procedure on training data. In our appication, it is not proper to perform the cross-vaidation, because both responses and predictors are mutinomia distribution probabiities. The sparsity of these vectors makes the variance of the cross-vaidation resuts very arge and it is not proper to ignore the property that the sum of these vectors are a 1 s. An aternative method to cross-vaidation for choosing the tuning parameter is to use Maow s Cp as the mode seection criterion [2]. A smaer Cp vaue indicates a better mode. The Cp criterion is defined as. where ŷ i(k) is the predicted vaue of the ith observation y i from the mode with k regressors. size. SSE k = Cp = SSE k MSE p n + 2k n (y i ŷ i(k) ) 2 is the sum of squares for the mode with k predictors. i=1 MSE p is the mean square error on the compete set of p regressors. n is the sampe The compexity of the modes in the LASSO path can be represented by their degrees of freedom. The number of degrees of freedom for a inear mode is defined as the true dimension of the inear subspace in which the predictors of the mode ie. However defining the number of predictors in the LASSO mode is ony an approximation because each dimension shoudn t be counted as a fu degree of freedom due

23 15 to the shrinkage appied [2]. The approximation works due to a usefu concusion in [19] that when fitting a inear mode via LASSO stopping at some number of steps k<p, the df of the modified LAR procedure at any stage, is approximatey equa to the number of predictors in the mode. Thus if the seected mode incudes k non-zero coefficients, I define the degrees of freedom (d.f.) as k An exampe of LASSO fitting As an exampe, I use one subnetwork from the L = 100 case, denoted as Y 1,to iustrate the LASSO fitting. Note that the same Y 1 variabe is aso used as iustration exampe for the NNLS and our newy deveoped method. β β β β β β β β β β β β β β β β β Figure 2.1: Profies of LASSO coefficients. A vertica ine is drawn at df=11, the vaue chosen by smaest Cp criterion. The profies are piece-wise inear, and so are computed ony at the first 15 steps of LASSO path. β j is the coefficients of the predictor X j. With the increase of df, the predictor is added into the mode one by one. In this particuar LASSO regression, there are 153 steps in the whoe LASSO path

24 which means that 153 variabes are added into the mode at the end of LASSO. Df is used to denote the number of variabes that are aready seected into the mode at each step. df index Cp RSS λ Tabe 2.1: The first 30 steps of the LASSO path on Y 1. Df denotes degrees of freedom of the mode. Index indicates which predictor is incuded in the mode at each step. 16 Tabe 2.1 shows the first 30 steps of the LASSO path. From Tabe 2.1, the smaest Cp vaue corresponds to df=11, which means 11 variabes are seected by this mode. The LASSO path is dispayed in Figure 2.1, where a vertica ine is drawn at df=11

25 17 whichisthebest modeseectedby Maow scpcriterion. Thecoefficientsoftheseected modearepotedinfigure2.2. Fromthepot, X 19 rankshighestin matchingy andisfoowedbyx 107 andx 51. Meanwhie,X 2, X 149 andx 124 payatriviaroein matchingy sincethecorespondingcoefficients arereativeysma. TheCoeficientsofthe Modeseected β j^ Indexofβ j^ Figure2.2: ProfiesofLASSOcoefficientsoftheseectedbest modeony 1. Among the200estimatedcoefficients,11variabesareseectedinto mode,awithpositive coefficients. Theindexnumbersofthepredictorsare markednexttoeachpoint.

26 NNLS Regression Review of the method Given X 1, X 2,..., X L in X N L matrix as the predictor variabes and a coumn vector Y as the response variabe, the criterion of NNLS regression is [18]: argmin β Y Xβ 2 subject to β j 0, j = 1,..., L. The NNLS optimization probem is a quadratic programming probem [18], argmin Y Xβ 2 = argmin(β T X T Xβ Y T Xβ β T X T Y + Y T Y ) β 0 β 0 = argmin(β T X T Xβ 2Y T Xβ + Y T Y ) β 0 = argmin( 1 β 0 2 βt X T Xβ Y T Xβ) There are many simiar agorithms to sove the NNLS optimization probem, of which the first widey used agorithm is an active set method pubished by Lawson and Hanson in their 1974 book [18]. The R package nns is used to sove the NNLS fitting in this study. The NNLS regression coud resut in many positive coefficients in the mode, which is not convenient for seecting the most important predictors. In our exampe most of the estimated coefficients are extremey sma and can be negected without much increase in RSS. Therefore, threshoding these sma positive coefficients has itte infuence on the precision of the matching. Here, hard-threshoding is used in mode seection of NNLS. The hard-threshoding works by setting the threshod imit t equa to a sequence of increasing numbers. To cacuate the corresponding RSS, a the coefficients that are ess than or equa to the threshod t are set to zero. Hence I obtain a sequence of RSS corresponding to the sequence of threshods. Our threshoding criterion is to choose the maximum threshod such that the difference in RSS before and after threshoding is ess than 10 5.

27 Anexampeof NNLSfitting Asanexampe,IagainuseY 1 toiustratethennlsfiting. Amongthe200predictorvariabes NNLSanaysisresutedin71positivecoefficientswiththerestofthe coefficientsazerocoefficients. Figure2.3showsthechangeinthevauesofRSSversusthreshod. Thethreshod tissetfrom0to0.05 withanincrementof Theverticaineisthe mode indicatingthatthethreshodis Thecoefficientssmaerthan0.009arethreshodedtozero. Beforeappyingforanythreshoding,theRSSis ,andafter threshodingat0.009the RSSis with11nonzerocoefficientseft. The differencebeforeandafterthreshodingisno morethan10 5. Givensuchavery smadifferenceinrssitisreasonabetoignoretheosofprecision. ThechangeofRSSfordiferentthreshoding modes RSS threshod Figure2.3: ChangeofRSSwithincrementofthreshodinNNLSregresionofY 1. ThevauesofRSSarepotedversesthevauesofthethreshods. Theverticaineis the modechosenwhenthreshodis0.009.

28 20 coeficients TheCoeficientsofthe Modeseeted Indexofβ j^ Figure2.4: ThecoefficientsoftheNNLS modeofy 1 afterthreshoding. Amongthe 200estimatedcoefficients,11variabesremaininthe mode. Theindexnumbersof nonzeroβ j are markednexttothepoints. Thebest modeafterthreshodingispotedinfigure2.4. Fromthepot,X 19 rankshighestin matchingy andisfoowedbyx 107 andx 51. Meanwhie,X 2,X 149 andx 124 payatriviaroeinmatchingysincethecorespondingcoefficientsarevery sma. TheresutsareverysimiartotheLASSOprofie(bycomparingFigure2.4 andfigure2.2). 2.3 A methodbasedon minimizingjsd Differentfrom LASSOand NNLSregresion,ourproposed methodusesjensen- Shannon Divergence(JSD)insteadof RSSastheoptimizationcriterion. JSDisa wideyusedmethodofmeasuringthesimiaritybetweentwoprobabiitydistributions, anditiswideyappiedinbioinformaticsandgenomecomparisons[9]. TheJensen-ShannonDivergencebetweentwomutinomiaprobabiityvectorsx= (x 1,,x n )andy=(y 1,,y n )isdefinedas[9]:

29 21 JSD(x y) = 1 2 ( n i=1 x i og x i h i + n i=1 y i og y i h i ) (2.1) where h = x+y 2 is the mean vector of x and y. From the definition, it is cear that JSD is a modified version of Kuback-Leiber divergence. The direct appication of Kuback-Leiber divergence is not possibe to measure two mutinomia distributions with many zero probabiity categories. Often the ogarithm in the JSD definition uses 2 as its base, in which case JSD has the property: 0 JSD 1. Thus when the ogarithm uses e as its base, it has the property: 0 JSD n(2). In this thesis, e is used as the ogarithm base. Jensen-Shannon Distance is defined as the square root of Jensen-Shannon Divergence [7]. As a distance measure, Jensen- Shannon Distance satisfies a three properties required of a distance [9] The method Given a set of mutinomia probabiity vectors X 1, X 2,..., X p in the matrix X and a target mutinomia probabiity vector Y seected from the matrix Y, the aim is to find a inear combination of severa X variabes which has smaest JSD with the variabe Y. i.e. Ŷ = β 1 X 1 + β 2 X β L X L L with the coefficients satisfying β j = 1, β j [0, 1]. This guarantees that Ŷ is a j=1 mutinomia probabiity vector too. The objective function for finding the β is min β JSD(Y Ŷ ) Directy optimizing this function is chaenging. At first severa optimization agorithms such as gradient descent and Newton s methods with constraints [13] were tried to sove the probem directy. But the Jacobian or Hessian is unavaiabe anayticay and too expensive to compute numericay at every iteration. Then I considered an aternative method to Newton s methods (i.e., Quasi-Newton methods) without

30 requiring the Jacobian in order to search for zeros [13] [3]. However, Quasi-Newton methods often find a oca maximum and minimum of a function and the soution is highy dependent on the initia vaues [3]. Since different sets of initia vaues can resut in different oca maximum and minimum, the need to set the 200 initia vaues became a big probem. The output of 200 coefficients are not sparse at a if the input numbers are not sparse. The chaenge posed by directy optimizing this function eads us to consider other methods. A heuristic searching strategy which is simiar in nature to the forward variabe seection procedure is proposed. First, I cacuate the JSD between Y and each X j variabe, JSD(Y X 1 ), JSD(Y X 2 ),..., JSD(Y X L ), which are then ordered in an increasing order. Thus, the X variabes are ordered as X [1],..., X [L], so that the variabes in X which are more simiar to Y wi be considered first. I first assign the best match of Y as X [1] and cacuate the JSD, then I consider whether X [2] shoud be added into the mode. If the inear combination of X [1] and X [2] ead to a smaer JSD with Y variabe, then X [2] is added into the mode, otherwise the procedure is stopped. Adding X variabes according to this procedure ensures that the inear combination satisfy the constraints on the coefficients whie the JSD to Y is decreased at each step. The optimization criterion at first step is foowing: min β JSD(Y (β 1 X [1] + β 2 X [2] )) subject to β 1 + β 2 = 1, β 1 [0, 1], β 2 [0, 1] The R package named Genera-purpose Optimization is used to sove the optimization probem with constraint. The procedure stops when adding another variabe wi no onger reduce the JSD (i.e., the coefficient for the newy added variabe is zero). The detais of this forward variabe seection procedure are summarized in the foowing agorithm: Note that MJSD might not find the optimum resut for a the coefficients. This coud happen when the seected subnetworks overap each other. 22 However, when one subnetwork is spit into amost non-overapping severa smaer subnetworks (as we expect wi be a common case), the MJSD method wi provide neary optimum resuts. In the next section I give an exampe showing that the seected X variabes are non-overapping subnetworks.

31 23 Agorithm 1: A forward seection procedure to sequentiay minimize JSD 1. Initiaization (a) Cacuate the JSD between Y to each X variabe, order X variabes according to their JSD from smaest to argest. (b) Set an set P =, set R = [X [1],..., X [p] ]. Move X [1] from set R to set P, set Ŷ = X [1], β = At the ith step (i=2,... p): (a) Move X [i] variabe from set R to set P: cacuate β that minimize JSD(Y β Ŷ + (1 β )X [i] ). (b) If 1 β > 0, update Ŷ = β Ŷ + (1 β )X [i]. Go back to step 2(a). Ese if 1 β = 0, the recursive procedure stops, output Ŷ An exampe of JSD minimization method I again use Y 1 as an exampe to iustrate the method s abiity in finding the best matching set of variabes in X. The JSD of each of the 200 X variabes with Y 1 are shown in Figure 2.5. Here the smaest one is JSD(Y 1 X 19 ), the second smaest one is JSD(Y 1 X 73 ). In Figure 2.5, most of the 200 X variabes have JSD near the upper bound of n(2). The X variabe with smaest JSD vaues wi be seected by the mode at first. When β converges to 1, it shows that the program adds a the 11 smaest X j variabes to the inear combination resuting in the fina optima JSD as Tabe 2.2 shows the updated parameters of β and JSD at each step. X j is the variabe added into the mode in each step. At the beginning the initia β is equa to 1, and then it becomes a vaue between 0 and 1 with more variabes seected into the mode. The β stops at as shown in Tabe 2.2. β is 1 at the 12 step.

32 24 X j β JSD(Y Ŷ) 1 X X X X X X X X X X X Tabe2.2: Theupdatedparametersβ for MJSD methodony 1 JSD JSDofY 1 toax j indexofx j Figure2.5: ProfieofJSDofY 1 toa200x variabes.jsdvauesarepotedverses theindexof200x variabes. TheindexnumberofsomeX variabesareaso marked nexttothepoints. ˆβ j arepotedinfigure2.6. ˆβ j standsforthecoefficientofx j thathasbeen seectedintothe modeafterthewhoeprocedurestops. ˆβ j =1. X 19 stiranks

33 25 highestin matchingy,buttheweightcacuatedbythis methodisevenargerthan LASSOand NNLS methods. Thesamesetofvariabesareseectedbyathree methods,whichincude X 19,X 51,X 73,X 107,X 1,X 72,X 2,X 124,X 149,X 124,X 164,howevertheorderthatthesevariabesenterthe mode,andthecoefficients,aredifferent. CoeficientsfromMJSDmethodonY 1 Coeficients indexof200x j variabe Figure2.6: Coefficientsofthevariabesresutedfromthe MJSD method. Amongthe 200estimatedcoefficients,11X variabesareseectedinto modewithcoresponding positivecoefficients. Theindexnumbersofthepredictorsare markednexttothe points. Next,Iinvestigatedtheoverapinthereactioncompositionofthe11subnetworks withpositivecoefficientsthatwereaddedintothe modeby MJSD.Figure2.7shows athenonzeroreactionsinthe11seectedxvariabes. Forthispot,reactionswere reorderedasfoows. First,thenonzeroreactionsofX 19 wereorderedfromargest tosmaest. Then,anyremainingnonzeroreactionsofX 51 wereorderedfromargest tosmaestandaddedtothesequencederivedfromx 19. Thisproceswasrepeated untiathenonzeroreactionsofathe11x variabes werereordered. ThedistributionofreactionsinFigure2.7showsthattheseectedsubnetworksareargey

34 26 non-overappingatthereactioneve(i.e.,theyarecomprisedofdifferentreactions). Thusthethe MJSD methodshoudprovideverysimiarresutstothosethatwoud beobtainedbydirectyoptimizingtheinearcombinationofthose11x variabes. Thepositivereactionsofathe11Xvariabes positivereactions X 19 X 51 X 73 X 107 X 1 X 72 X 2 X 124 X 151 X 149 X indexoftheorderedpositivereactions Figure2.7: Thepositivereactionsofathe11X variabes. Theegendindicatesthe indexofthe11x variabesandtheircorespondingpointtypes. 2.4 Summary LASSOregresionresutssatisfytherequirementofsparsityofˆβ,throughaproper modeseectioncriteriontoseectthebest modefromafulassopath.inour exampe,thebest modeareadysatisfiesthenon-negativityrequirementofthecoefficients. Thenon-negativitywasnaturaysatisfiedin93%ofathe100subnetwork

35 27 matchings when using Cp-statistics for mode seection. For the remaining 7% of subnetworks, some very sma negative coefficients resut from LASSO fitting among their matching X variabes. By changing these sma negative coefficients to zero, the RSS is not much changed. NNLS regression resuts naturay satisfy the non-negativity constraints of the coefficients. The NNLS soution is not ony non-negative, but aso sparse to some extent. However, there are sti too many very sma positive coefficients in the NNLS regression resuts. Hence, further sparsity has to be achieved through an appropriate threshoding agorithm without much increase in the fina RSS. The new method, named as MJSD, uses a forward variabe seection procedure to simpify the optimization probem. It is based on directy minimizing JSD to obtain the positive and sparse coefficients that sum to 1, without the mode seection step in LASSO or the hard-threshoding in NNLS. Note that since LASSO and NNLS don t necessariy output a soution satisfying ˆβ1=1, the resuting Ŷ from LASSO and NNLS are not probabiity vectors. In order to compare the resuts, I isted a the fina coefficients from these three methods in Tabe 2.3. The three methods have seected the same subset of variabes in X to match Y 1, but the coefficients are different. The LASSO and NNLS resuts are more simiar in this case. To compare the fitting of these three methods, both RSS and JSD on Y 1 are used as the criteria. RSS for the three methods can be directy N cacuated from RSS = (y i ŷ i ) 2. However, before cacuating the JSD for LASSO i=1 and NNLS matches, I need to normaize the vector Ŷ to sum to 1. The normaized N vector is denoted as Ŷ = [ŷ 1,..., ŷ N ]T, where ŷ i=ŷ i / ŷ i. Then the JSD(Y Ŷ ) is used as criterion for LASSO and NNLS. JSD and RSS for the three methods on Y 1 are aso shown in Tabe 2.3. When using JSD as the criterion, the matching of MJSD is better than LASSO and NNLS, and NNLS is a itte better than LASSO. However, when using RSS as criterion, the matching of LASSO and NNLS are better than MJSD, and NNLS is sti a itte better than LASSO. This outcome is reasonabe because LASSO and NNLS use RSS as optimization criterion, whie MJSD use JSD as optimization criterion. i=1

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,