Functional Redundancy and Expression Divergence among Gene Duplicates in Yeast

Functional Redundancy and Expression Divergence among Gene Duplicates in Yeast by Zineng Yuan A thesis submitted in conformity with the requirements for the degree of Master of Science Department of Molecular Genetics University of Toronto Copyright by Zineng Yuan, 2010

Functional Redundancy and Expression Divergence among Gene Duplicates in Yeast Abstract Zineng Yuan Master of Science Department of Molecular Genetics University of Toronto 2010 My research mainly focused on the functional redundancy and expression divergence of gene duplicates to address currently unsolved problems. Herein, we employed a method based on GO terms to measure functional overlap between paralogs. We established that functional similarity between duplicate genes is the key determinant of their backup capacity. Later, we also investigated expression divergence. Recent studies suggest that only a small proportion of expression variation can be explained by transcriptional variation between paralogs. Here, the contribution from diverged TFregulations was re-examined and differential promoter chromatin status was also found as an important contributor to expression divergence. To better understand the role of gene duplication in great detail, a case study was performed on the yeast chaperone system, which includes many gene duplicates. Taken together, this study sheds light on the roles of redundancy and divergence in long-term retention of gene duplicates. ii

Acknowledgments I would like to first express my sincere gratitude to my supervisor, Professor Zhaolei Zhang, for his invaluable guidance and strong support throughout my M.Sc. program of University of Toronto. I also thank my committee members, Dr. John Parkinson and Dr. Walid A. Houry. I have tremendously benefited from their outstanding visions, technical insights and practical sensibility. I deeply appreciate their strict training and precious advice in all aspects of my academic development. I thank all my colleagues Jingjing Li, Xiao Li, Dong Dong, Lee Zamparo and Renqiang Min for their valuable discussions on research and precious suggestions and friendship that support me throughout the past two years. I wish them all the best in pursing their dreams in the future. I would like to thank my friends Colin and Vestraea, who are funny guys, for their company and support. I appreciate all the comfort from you when I feel horrible. I appreciate all the joy they have brought to me. I feel so blessed to have them on my side to unconditionally support me in my efforts towards my goals. Finally, I wish to send my gratitude to my parents. Their unreserved love and support have been the most important power that kept my perseverance throughout the past years. Without their care and encouragement from a distance, I would not have been so focused in pursuing my study. This thesis is dedicated to them. iii

Abstract... ii Acknowledgments... iii List of Tables... vi List of Figures... vii List of Abbreviations... viii Chapter 1 Introduction... 1 1.1 Overview of gene duplication... 1 1.1.1 Background and overview... 1 1.2 Whole Genome Duplication (WGD) and Small Scale Duplication (SSD)... 1 1.2.1 Two scales of gene duplications... 1 1.2.2 Intrinsic difference between WGD and SSD genes... 3 1.2.3 Different evolutionary constraints on WGD and SSD genes... 4 1.3 Evolutionary fates of gene duplicates... 5 1.3.1 Nonfunctionalization (Pseudogenization)... 5 1.3.2 Subfunctionalization... 5 1.3.3 Neofunctionalization... 6 1.3.4 Expression divergence contributing to functional divergence between paralogs... 7 1.4 Genetic redundancy by gene duplication... 9 1.5 S. cerevisiae as a model organism to study gene duplicates... 10 1.6 Thesis rationale... 11 Chapter 2 Functional Redundancy and Expression Divergence in Gene Duplicates... 13 2.1 Functional overlap in gene duplicates... 13 2.1.1 Introductions and Motivations... 13 2.1.2 Data and methods... 14 2.1.3 Prevalent and strong genetic backup between duplicate paralogs... 17 2.1.4 Partial functional overlap as a key determinant of backup capacity between paralogs... 19 2.1.5 Molecular basis of genetic buffering between duplicated genes... 24 2.2 The contribution of cis-elements to expression divergence between duplicated genes... 27 2.2.1 Introduction and motivations... 27 iv

2.2.2 Data and Methods... 28 2.2.3 Transcription factors divergence explains expression divergence between paralogs... 31 2.2.4 Divergence in promoter chromatin structure between paralogs... 32 2.2.5 Better explanation of expression divergence between paralogs by the composite divergence of TF regulation and promoter chromatin status... 36 2.3 Gene duplicates in the chaperone system... 40 2.3.1 Introduction and motivations... 40 2.3.2 Functional divergence and redundancy in the chaperone system... 42 2.3.3 Functional divergence leading to functional specificity... 43 Chapter 3 Conclusion... 48 3.1 Summary of research findings... 48 3.2 Long-term retention of gene duplicates... 50 3.3 Future work.... 51 Reference... 53 v

List of Tables Table 2-1 Summary of chaperones in budding yeast S. cerevisiae Appendix Table 1 Gene duplicates in the chaperone system vi

List of Figures Figure 1-1 Models of divergence in duplication paralogs Figure 1-2 Functional dispersal of a duplicated gene Figure 2-1 An example of calculating GO-div between gene A and B Figure 2-2 Aggregating genetic interactions among gene duplicates Figure 2-3 Genetic buffering between duplicates resulting from functional redundancy Figure 2-4 Prediction of backup capacity between paralogs Figure 2-5 Comparison of the co-cluster in both backup and non-backup pairs Figure 2-6 A schematic figure representing the neural network applied in this study. Figure 2-7 Comparison of age distribution in this study and yeast genome background. Figure 2-8 Comparison of open in ancient and recent pairs Figure 2-9 Divergence in nuc-sequences, open and transcription factors Figure 2-10 Chromatin divergence accounting for expression divergence Figure 2-11 Correlation comparison between expression divergence and TF divergence Figure 2-12 Distinct functional enrichment of negative interactors for CPR6 /CPR7 Figure 2-13 Functional study of CPR7 linking its role to transportation vii

List of Abbreviations open promoter chromatin status Arabidopsis Arabidopsis thaliana BP biological process ChIP chromatin immunoprecipitation DDC duplication-degeneration- complementation DivTF TF divergence GCA growth curve analysis GO gene ontology GO-div gene ontology divergence KDE kernel density estimation Ka non-synonymous substitution rate Ks synonymous substitution rate K. waltii Kluyveromyces waltii (also known as Lachancea waltii) MIPS Munich information center for protein sequences database MSE mean square error PCC Pearson correlation coefficient PNDR promoter nucleosome-depleted region RSA random spore analysis S. cerevisiae Saccharomyces cerevisiae SGA synthetic genetic array SGD Saccharomyces Genome Database SSD small-scale duplication SVM support vector machine TF transcription factor WGD whole-genome duplication viii

Chapter 1 Introduction 1.1 Overview of gene duplication 1.1.1 Background and overview Gene duplication has long been considered a major driver for creating genetic novelty. Forty years ago, Susumu Ohno hypothesized that natural selection merely modified, while redundancy created in his classical work Evolution by Gene duplication (Ohno, 1970). Gene duplication is critical to the development of novel cellular functionality, because one copy is free to evolve novel functions while the other retains ancestral function (Lynch and Conery, 2000a). In recent decades, extensive studies have been carried out to clarify the evolutionary fate of duplicated genes. In the introduction part of this thesis, I mainly review previous work regarding gene duplication. I first introduce the two different classes of gene duplication in yeast, namely whole-genome duplication (WGD) and small-scale duplication (SSD) and highlight their major differences. I then focus on the evolutionary fates of gene duplicates. Finally, I outline the thesis rationale and underscore its potential contribution in addressing currently unsolved problems. 1.2 Whole Genome Duplication (WGD) and Small Scale Duplication (SSD) 1.2.1 Two scales of gene duplications Gene duplications could occur on two distinct scales. One is Whole Genome Duplication (WGD), by doubling the chromosomes derived from a species (autopolyploidy) or from different species (allopolyploidy) (Chen, 2007). The other is Small Scale Duplication (SSD), which occurs locally involving a single gene or a group of genes within a chromosomal segment. Shortly after the completion of the genome sequencing of S. cerevisiae, Wolfe and colleagues proposed that this species had undergone WGD due to the existence of large, non-overlapping blocks in the genome sequence (Wolfe and 1

Shields 1997). Wolfe and colleagues further provided criteria which would satisfy the yeast WGD, including conserved gene order in non-overlapping, paired chromosomal blocks with an approximate 2:1 orthological relationship with an outgroup species (Skrabanek and Wolfe, 1998). Kellis and colleagues provided the strongest evidence confirming the genome scale duplication (Kellis et al, 2004). They compared the whole genome sequence of S. cerevisiae with a related species, K. waltii, and found a large number of aligned blocks exist in two copies in S. cerevisiae. In addition to budding yeast, a number of other eukaryotes were also found to have undergone genome-wide duplications. For example, the Arabidopsis genome has duplicated three times (Arabidopsis Genome Initiative, 2000), and the ancestor of extant vertebrates had undergone two rounds of WGDs (Putnam et al, 2008). Given the prevalence of WGD events in many species, researchers have studied their contribution to genetic, functional, and phenotypic diversities of the host organisms. WGD increases the number of regulators, which could facilitate achieving a more complex regulatory system (Freeling and Thomas, 2006). As Maere pointed out, WGD could explain more than 90% of the increased regulatory genes in Arabidopsis (Maere et al, 2005). In vertebrates, WGD is found to be a contributor to the expansion of the homeobox (HOX) genes, insulin receptors, and nuclear receptors (Maere et al, 2005). It was also pointed out that WGD could increase species diversity since loss of different sister genes after duplication events in separated populations might give rise to reproductive isolation, yielding new species (Lynch and Force, 2000b). In comparison with WGD, Small Scale Duplication (SSD) occurs locally involving individual genes within a chromosomal segment. It results from unequal crossing over during meiosis. In this case, chromosomes exchange segment of nucleotide sequence unequally. Therefore, in the exchanged regions of daughter cells, varying zygosity for genes occur. Since WGD and SSD genes have different origins, they might be subject to distinct evolutionary constraints (DeLuna et al, 2008; Guan et al, 2007; Hakes et al, 2007). Therefore, it is worthwhile to first explore the differences between WGD and SSD genes. In the following section I will elaborate on the differences between these two evolutionary events. 2

1.2.2 Intrinsic difference between WGD and SSD genes Several groups compared the difference between WGD and SSD genes (Davis and Petrov, 2005; Guan et al, 2007). Davis et al found that WGD and SSD genes are enriched for different functional categories (Davis et al, 2005). Guan et al. found that WGD gene pairs have higher functional similarity than SSD gene pairs after applying a Bayesian data integration method in quantifying functional associations (Guan et al, 2007). Later, similar conclusions were drawn by comparing the shared physical interactions and by gauging Gene Ontology (GO) annotations (Hakes et al, 2007). In addition, Guan also found that WGD pairs have more divergence in both regulatory region and expression pattern (Guan et al, 2007). In addition to these differences, the half-life, i.e., the time required for the number of duplicates to be reduced to half of its initial value, of WGD and SSD genes is also distinct. It was estimated the half-life for SSD derived genes is about 4 million years, while the estimated half-life for WGD derived genes is about 33 million years in S. cerevisiae (Lynch et al, 2000a). Taken together, these observations suggest that WGD and SSD genes are subject to different evolutionary pressures. The above conclusions about the difference between WGD and SSD genes also reconciled some previously inconsistent results, where conclusions regarding duplicated genes were drawn without distinguishing between these two categories (Guan et al, 2007). For example, studying a small dataset of WGD pairs, Wagner pointed out that there is no correlation between fitness for single gene knockouts and sequence similarity of duplicated pairs (Wagner, 2000). However, a positive correlation was found by Gu et al on a large gene pool including both SSD and WGD pairs (Gu et al, 2003). Wagner found no coupling between sequence divergence and expression divergence while other two groups found a positive correlation using both WGD and SSD pairs (Gu et al, 2005; Gu et al, 2002; Zhang et al, 2004). Those disputes was explained by treating WGD and SSD genes separately (Guan et al, 2007). 3

1.2.3 Different evolutionary constraints on WGD and SSD genes Since WGD and SSD are under different constraints, the underlying mechanisms attributed to their intrinsic differences were then examined. Dosage effect, especially dosage balance was regarded to play an important role in determining the subsequent pressure on gene duplicates (Papp et al, 2003a). Considering a protein complex comprising two protein subunits, A and B, often excessive abundance of either member will break the equilibrium and lower fitness. For example, extra copies of A might compete with other AB-binding regulatory subunits, thereby interfering with AB s normal function. Alternatively, extra A might form non-functional homodimers rather than functional AB heterodimers. Therefore, the stoichiometry between components in the protein complexes or pathways must be maintained to avoid potential dosage disruption (Papp et al, 2003a). This idea could largely explain the different evolutionary pressure on WGD and SSD genes. The entire protein complex is duplicated in WGD but not in SSD. Therefore, WGD genes are more likely to be retained in the protein complexes, as a synchronous increase on dosage brings minimal harmful consequence to the overall dosage equilibrium. In contrast, a SSD derived gene product may disrupt stoichiometry between components, when it is involved in a protein complex. Then strong purifying selections will rapidly eliminate this extra copy (Papp et al, 2003a). The dosage balance hypothesis also explains the observation that in the ribosome, where dosage balance is of significant importance, SSD genes are rarely retained (Papp et al, 2003a). In addition to dosage balance, an alternative explanation for long time retention of the WGD genes is that duplication of an entire protein complex has greater chance to gain immediate benefits. If a certain complex is dosage sensitive, selection will operate on its members immediately afterwards when the increased dosage is beneficial, and the whole complex might be retained (Aury et al, 2006). Taken together, dosage effect is a crucial factor leading to distinct evolutionary constraints on WGD genes and SSD genes. 4

1.3 Evolutionary fates of gene duplicates Different evolutionary models have been proposed to explain the evolutionary fates of duplicated genes. In neofunctionalization, one copy of gene duplicates evolves novel functions, while the other copy retains progenitor s function (Kellis et al, 2004; Lynch et al, 2000a; Wolfe and Li, 2003). Nevertheless, not all paralogs can gain novel functions since the acquisition of beneficial mutations is uncommon (Takuno and Innan, 2009). An alternative model, subfunctionalization, argues that gene duplicates evolve merely through partitioning their ancestor s function rather than creating functional novelty. This partition model has been observed in multifunctional genes, where daughter copies divide progenitor s functions immediately after duplication (Force et al, 1999). Here, I categorize and summarize the well-established models regarding the evolutionary fate of gene duplicates. 1.3.1 Nonfunctionalization (Pseudogenization) Nonfunctionalization is the most likely fate of duplicates and one of the two copies becomes silenced. For example, in S. cerevisiae, only one copy of ~90% WGD pairs is retained since the ancient WGD event (Kellis et al, 2004). At the time of duplication, two copies are completely identical. This state is not stable as the two paralogs are functionally redundant, which allows the accumulation of degenerative (loss-of-function) mutations. In many cases, one of the two copies finally becomes a pseudogene. Through exploring the genomic data of several eukaryotes, Lynch and colleges proposed and confirmed the rapid loss of gene duplicates and also found the number of remaining pairs can be fitted by the survivorship function in (1.1): Ns =N 0 e -ds (1.1) Ns is the number of duplicates observed at the divergence level S. N 0 and d are constants, which are fitted by linear regression of the log-transformed data (Lynch et al, 2000a). 1.3.2 Subfunctionalization Though the nonfunctionalization model could explain the fate of most gene duplicates, researchers still observed many extant gene duplicates, especially those following the 5

ancient polyploidy event. Several models were introduced (Figure 1-1) to explain the long time preservation of genes duplicates. In the subfunctionalization model, gene duplicates are preserved through a process in which the ancestor s multi-functions are divided into its daughter copies. Specifically, Force et al. introduced the duplicationdegeneration-complementation (DDC) model (Force et al, 1999). A gene accumulates degenerative mutations which are compensated by its paralogous copies. As a result, through subfunctionalization, paralogous copies could fulfill the function of their ancestor through undertaking complementary roles. 1.3.3 Neofunctionalization Though the DDC model could explain the retention of duplicate genes, recent studies suggested that this model is inadequate to explain all preserved gene duplicates (He and Zhang, 2005; Li et al, 2005). For example, the DDC model brings two predictions when applied in the study of cis-element motif: 1) The total number of cis-element motifs in duplicated pairs should decrease with time due to degenerations. 2) Genes with more paralogs tend to have less regulatory motifs since the dispersal of cis-element motifs in multiple rounds. Surprisingly, Papp and colleagues found that in contrast to the predictions, the number of total motifs in duplicated pairs keeps constant across the evolutionary time (calibrated by Ks), and genes with numerous paralogs do not have particularly low number of regulatory motifs (Papp et al, 2003b). Therefore, in order to maintain the number of motifs, new motifs must emerge in gene duplicates. Such a process requires the acquisition of beneficial mutations conferring new functions (neofunctionalization) (Papp et al, 2003b). After examining genome-wide protein-protein interaction data in budding yeast and comparing the interaction partners between paralogs, He revealed that the DDC model is inadequate to explain the constant number of shared partners over time and confirmed the existence of neofunctionalization. As a result, they proposed a new theory, predicting that a large number of gene duplicates have passed through fast subfunctionalization followed by prolonged and sufficient neofunctionalization (He et al, 2005). In a recent study, Hittinger and Carroll provided experimental evidence in support of He s hypothesis (Hittinger and Carroll, 2007). The bi-functional ancestral gene GAL1 which did not experience duplication is still present in 6

some yeast species. However, in Saccharomyces cerevisiae, the ancestral function of GAL1 was split and carried out by GAL1 and GAL3 as galactokinase and co-inducer respectively, indicating subfunctionalization. Furthermore, adaptive evolution was observed in one of these sister paralogs, GAL1, indicating neofunctionalization (Hittinger et al, 2007). 1.3.4 Expression divergence contributing to functional divergence between paralogs Extant gene duplicates diverge their functions by either subfunctionalization or neofunctionalization and the underlying mechanisms were studied by researches. Ohno once proposed that expression divergence is an important step in the functional divergence between paralogs (Ohno, 1970). It is known that gene expression divergence following gene duplication could result in expression specialization in tissue or developmental processes, which is a sign of evolving adaptive functions (Huminiecki and Wolfe, 2004). For example, in human and apes, GLUD2, a glutamate dehydrogenase, shows strong evidence of adaptive evolution, and is different from the ancestral form GLUD1 (Plaitakis et al, 2003). This process seems to be initiated by changes in its expression pattern. Consequently, GLUD2 has specific changes in allosteric sensitivity and seems more adaptive to its new location, neurons. (Plaitakis et al, 2003). Moreover, Wolfe and colleagues found that recent lineage-specific duplicates increase human and mouse expression divergence in orthologous tissues (Huminiecki et al, 2004). They also found that specialized expression pattern is a general trend stemming from gene duplication, leading to functional specificity (Huminiecki et al, 2004). 7

Figure 1-1 Evolutionary models post a gene duplication event Evolutionary models following gene duplication event are implied by instances of random mutations in cis-regulatory motifs. The coloured small boxes represent functional regulatory elements while the white boxes denote the non-functional elements. The large black boxes denote the transcribed regions. In the first two steps post duplication, one of the copies harbours a null mutation in regulatory region. On the left, one copy acquires null mutations in each element and eventually, this copy will become pseudogenized (denoted by the white boxes). The central part depicts the neofunctionalization model in which one copy acquires a beneficial new-motif. The right shows the subfunctionalization model in which both copies function complementarily to perform the ancestral functions. 8

1.4 Genetic redundancy by gene duplication Extant gene duplicates have to diverge through the processes of subfunctionalization or neofunctionalization because complete redundancy is not favoured by evolution (Kitano, 2004). However, functional overlap indeed exists between extant paralogs (Musso et al, 2007; Wagner, 2000). Therefore, it is intriguing to investigate the contribution from these redundant copies to host organism s fitness. Wagner firstly introduced the idea that the widespread gene duplication events could be responsible for robustness against genetic perturbations but he was not able to offer convincing evidence due to limited data at the time (Wagner, 2000). The idea of backup by gene duplicates has been upheld by a number of observations of backup circuit in real network (Kafri et al, 2005), where one gene could change its expression profile to compensate for its lost paralog. Such expression reprogramming has been verified in two isoenzymes Acs1 and Acs2. Despite their dissimilar expression in normal condition, Acs1 achieves an Acs2-like response to glucose, upon the deletion of Acs2 (Van den Berg, 1996). A similar case involves NHP6A and NHP6B, in which deletion of NHP6A gives rise to a three-fold increase in NHP6B synthesis (Kolodrubetz et al, 2001). Recently, using high-throughput flow cytometry, Deluna et al performed a genome scale study on paralog responsiveness (DeLuna et al, 2010). By comparing protein abundance of wide-type strains with paralog-knockout strains, they found that paralog responsiveness is need-based and only appears when gene function is required (DeLuna et al, 2010). On the other hand, the aggravating interaction from the Synthetic Genetic Array (SGA) experiment (Tong et al, 2001) indicates genetic backup and therefore, a clear prediction of genetic backup is that buffering pairs should exhibit a strong aggravating interaction with their paralogs. In a recent study, the high prevalence of duplicates genetic buffering can be observed as reported on subset of yeast gene duplicates (Ihmels et al, 2007; Musso et al, 2008), which confirms that gene duplicates are indeed responsible for genetic robustness. Moreover, maintaining these duplicates could come up with robustness under other conditions. Duplicated pairs do show condition-dependent aggravating interactions or responsiveness, which are quite different across varying conditions (DeLuna et al, 2010; Musso et al, 2008). 9

However, it is worthwhile to mention that instead of conferring robustness, maintaining redundant paralogs could also be subject to alternative explanations. It might be merely attributed to dosage effect, that is, keeping balance of gene dosage (Papp et al, 2003a). 1.5 S. cerevisiae as a model organism to study gene duplicates There are several reasons which render the budding yeast S. cerevisiae a valuable model organism for scientific research. First, studying the budding yeast could provide meaningful insights into our own genome as this tiny organism contains orthologs of many human genes. Next, the genome of budding yeast can be manipulated easily due to ease of cell culturing and its compact genome. In the past few years, many large-scale experiments have been undertaken in S. cerevisiae. These experiments covered almost every aspect of functional categorizations. For example, S. cerevisiae was the first sequenced eukaryote (Goffeau et al, 1996) and extensive microarrays have been performed under multiple conditions to obtain mrna expression profiles (Hughes et al, 2000; Spellman et al, 1998). The fitness contribution of each gene in budding yeast was determined through a comprehensive analysis of gene-deletion phenotypes (Giaever et al, 2002). Global analysis of protein localization in budding yeast was also available via a large-scale fluorescence labelling study (Huh et al, 2003). S. cerevisiae was the first eukaryote to be studied in large-scale protein interaction screens by Yeast-2-Hybrid (Y2H) (Gavin et al, 2002) and by Tandem Affinity Purification followed by Mass Spectrometry (TAP-MS) (Krogan et al, 2006). High-throughput Synthetic Genetic Array (SGA) experiment was also developed to detect synthetic lethal or synthetic sick interactions between gene pairs, indicating their functional information (Costanzo et al, 2010; Tong et al, 2001). Gene regulation by transcription factors (TF) binding in regulatory elements was studies by ChIP-chip experiments (Harbison et al, 2004; Lee et al, 2002). Moreover, ChIP-chip and ChIP-Seq experiments on nucleosome occupancy and nucleosome dynamics have been accomplished (Lee et al, 2007; Shivaswamy et al, 2008), since chromatin structure of promoter also regulates gene expression by undergoing a remodelling process prior to TF binding (Jiang and Pugh, 2009). In this 10

process, nucleosomes are removed to leave regulatory regions physically accessible by regulatory factors (Jiang et al, 2009). These high-throughput experiments and studies in traditional biochemistry resulted in comprehensive functional annotations. Most of them have been collected and categorized in the publicly available Saccharomyces Genome Database (SGD). Several datasets are of specific relevance to my research. 1.6 Thesis rationale Long-term retention of paralogs is the key issue in the study of gene duplication. Given sufficient evolutionary time, the states of extant paralogs ought to be stabilized. Such states might embrace both functional redundancy and functional divergence (See Fig 1-2). Functional redundancy contributes to genetic robustness, whereas functional differentiation produces genetic novelty and complexity. The study of functional redundancy and divergence between extant paralogs is the premise for understanding the long-term preservation of gene duplicates. However, there are still many questions without definitive answers: (1) What are the underlying determinants of genetic buffering between gene duplicates? (2) Though it has long been postulated that expression evolution to be an important step in the functional differentiation between paralogs, what are the underlying determinants of expression divergence? This study serves to address the above questions on the basis of a large body of extant gene duplicates. In this study, redundancy resulting from gene duplication is investigated through a comprehensive dataset including both WGD and SSD derived genes. We established that functional similarity between duplicate genes, measured by Gene Ontology (GO) terms, is a key determinant and is highly predictive of their backup capacity. This study next investigated mechanisms which lead to expression divergence. Here, transcription factor (TF) divergence is re-evaluated using a more comprehensive dataset compared with previous study and we demonstrated differential TF regulation plays a more important role in expression divergence of paralogs than previously appreciated. Moreover, the role of chromatin structure in determining expression evolution between paralogs is clarified and highlighted. In the last part of this thesis, gene duplicates are studied in the chaperone system, which is an essential quality-control 11

system in S. cerevisiae with many gene duplicates. Such a case study in a familiar biological system serves to examine in depth the functional association between duplicated pairs, how extensive functional dispersal is, what role it has played in longterm retention of gene duplicates. We note that duplicates in the chaperone system are not merely redundant; instead, they are divergent in their functions and such divergence might lead to their preservation. Figure 1-2 Functional dispersal of a duplicated gene The schematic figure shows functional divergence and redundancy over long spans of evolutionary time. Functions are symbolized by areas of rectangles. F overlap is the functional overlap between these two genes while F x, F x represents the functional divergence in these two genes respectively. The total function F total is the summation of F x, F x and Foverlap. Upon the time of gene duplication, two copies are functionally identical. After long evolutionary time, paralogs may diverge but meanwhile retain partial functional overlap as indicated by dark red. 12

Chapter 2 Functional Redundancy and Expression Divergence in Gene Duplicates 2.1 Functional overlap in gene duplicates 2.1.1 Introductions and Motivations It has been long hypothesized that a duplicated copy provided by gene duplication could buffer perturbations on its progenitor copy (Wagner, 2000). However, controversy remains. On one hand, duplicated genes do show markedly elevated dispensability than singleton genes through the single deletion profile study, which has been speculated to result from mutual compensation between paralogs (Gu et al, 2003); on the other hand, He and Zhang proposed that less important genes are more likely to duplicate through the comparison between different species (He and Zhang, 2006). Therefore, the observed elevated dispensability of duplicates by Gu et al might merely result from the intrinsically higher duplicability of these less important genes rather than from the compensation between paralogs. Therefore, in order to reconcile the controversy, a systematic interrogation of genetic interaction data is an effective way to determine the extent to which yeast paralogs could buffer each other. Based on recent studies, high prevalence of mutual genetic buffering by duplicates was consistently observed on a small subset of yeast paralogs, suggesting that paralogous copies do serve to backup each other (Dean et al, 2008; DeLuna et al, 2008; Ihmels et al, 2007; Musso et al, 2008). However, when determining the characteristic of buffering paralogs, researchers found little functional similarity between paralogs, leading to the hypothesis buffering without redundancy (Ihmels et al, 2007). Similarly, Musso wrote Epistatic paralog pairs could not generally be shown to have more shared functional overlap (as gauged by physical interactions) than comparable non-epistatic paralogs and functional compensation can not necessarily be predicted based on the conservation of duplicated pairs; direct assay of function is required (Musso, 2010). Using the largest SGA data with a total of ~5.4 million gene pairs screened (Costanzo et al, 2010), we examined the genetic buffering between duplicate pairs. 13

2.1.2 Data and methods Compiling gene duplicates To examine duplicated genes, we employed the dataset from Guan and colleagues (Guan et al, 2007). In this dataset, gene pairs with sequence similarity no less than 20% were identified as being paralogs based on reciprocal best match. WGD pairs were further detected on the basis of Kellis et al. (Kellis et al, 2004). We excluded ribosome-related proteins from our analysis because they tend to display disproportionately high levels of conservation (Papp et al, 2003a). Of 374 WGD pairs and 483 SSD pairs, only 266 WGD pairs and 228 SSD pairs were present in the dataset from Costanzo et al. (Costanzo et al, 2010). The scoring scheme for the SGA experiments is described in the original paper. A significant negative interaction (ε<0 and p<0.05) between a gene pair is defined as a genetic buffering. Compiling protein complexes Protein complexes were curated by merging annotations from SGD Saccharomyces Genome Database (SGD), the Gene Ontology (GO) and The Munich Information Center for Protein Sequences (MIPS). Synonymous and non-synonymous substitution rates per nucleotide Nonsynonymous (Ka) and synonymous (Ks) substitution rates is a useful and straightforward metric for measuring sequence variations of homologs. Nonsynonymous substitution is a nucleotide substitution that results in a change of amino acid encoded, while synonymous substitution does not cause an amino acid replacement. The coding sequences of the above pairs were obtained from Ensembl database and Ka and Ks was calculated between duplicated gene pairs using the PAML package (Yang, 1997). Measurement of functional associations To measure functional association, we used the method from Guo et al (Guo et al, 2006), who developed a method adopting the concept of information content. Each term of Biological Process (BP) in GO represents a corpus and each gene is annotated within this corpus. Functional similarity is defined by the semantic similarity as follows: Suppose there are two genes G1 and G2, and G1 is annotated with M terms and G2 is annotated 14

with N terms. The semantic similarity between any two terms, m, n, where m M and n N, is derived as shown in Equation (2.1) (Guo et al, 2006). T (m,n) = 2 ln(min x S(m,n) {p(x)}) ln p(m) + ln p(n) (2.1) where S(m,n) is the set of parent terms shared by m and n, and p(x) represents the occurring frequency of a term x or any child term. The numerator is to calculate the information content of the most specific parent term(s) shared by m and n, and the denominator is the normalization constant as to scale the score between zero and one. Thus for two terms, if both terms are specific (the bottom layers of GO tree) while their common ancestor term is also very specific, then the two terms receive high score T, indicating greatest semantic similarity between the two terms. For a pair of paralogs, all possible configurations from their GO terms were calculated and the maximal score assigned for the best matched GO-term pairs was regarded as the functional similarity (see an example in Figure 2-1). As a result, as long as two genes share some very specific functions regardless of their divergence in other functions, they will be assigned a high score. In other words, this method has the potential capacity to capture the partial functional overlap which is of high relevance to the study here. 15

Figure 2-1 An example of calculating GO-div between gene A and B (A) A table shows all configurations of GO terms for a gene pair A and B. For each combination of GO terms, we first calculate functional similarity using equation (2.1) and assign one minus the best score as GO-div. (B) A figure shows how the equation (2.1) works. For GO-term m and n in the GO-tree, the red node indicates the most specific common ancestral term. 16

2.1.3 Prevalent and strong genetic backup between duplicate paralogs Among the assayed duplicate pairs, we found that 39.5% (105/266) of the WGD paralogs have significant aggravating interactions, in comparison with 18.4% (42/228) for SSD paralogs. The percentage of backup pairs for WGD is comparable to what was previously reported (~35%) (Musso et al, 2008). We designed two control sets to examine whether duplicate pairs have excessive backup capacity. First, random gene pairs were chosen with genetic interactions and we found only 7% of them have aggravating interactions (Figure 2-2). Second, we took all the duplicated genes and randomly grouped them into pairs, and found that only 6.6% of these random pairs have aggravating interactions; this ruled out the possibility that duplicate genes intrinsically have more aggravating genetic interactions. Thus, the analysis established that duplicates indeed have excessive backup capacity. We also studied the backup strength between paralogs. Compared with both control sets, the interaction strength between duplicate pairs is much stronger with average scores of - 0.4154 and -0.3275 for WGD and SSD, respectively, in sharp contrast to -0.07 and - 0.069 for the two random controls, respectively (P=8.54 10-36 for WGD, P=1.87 10-6 for SSD and P=0.06 between WGD and SSD, Wilcoxon rank-sum test). Notably, these findings are in agreement with what was previously reported (Dean et al, 2008; DeLuna et al, 2008; Ihmels et al, 2007; Musso et al, 2008). Taken together, our analysis established that strong genetic buffering capacity is prevalent between both WGD and SSD paralogs, which provides enhanced genetic robustness in yeast cells. 17

Figure 2-2 Aggravating genetic interactions among gene duplicates Left figure: Aggravating interaction percentage between SSD, WGD and random simulations 1000 pairs were randomly chosen as a group and the percentage of random pairs maintaining negative genetic interaction can be determined for this group. We did this for 1000 times, then distribution of the percentages can be estimated from the 1000 randomized controls. Right figure: Buffering strength between duplicates is stronger than the randomly paired genes. The x-axis represents the strength of aggravating interaction and y-axis denotes the cumulative density from zero to one. WGDs have much stronger backup strength than SSD. 18

2.1.4 Partial functional overlap as a key determinant of backup capacity between paralogs Intuitively, the ability for paralogous genes to backup each other should be correlated with their functional similarity. However, based on small datasets, previous work suggested that functional redundancy between buffering duplicates is minimal and no more than paralogs without backup capacity (Ihmels et al, 2007; Musso, 2010). We noted that in early studies, functional similarity between paralogs was calculated indirectly by divergence in expression profiles, protein interactions, or genetic interaction profiles (Ihmels et al, 2007; Musso et al, 2008). However, for paralogs, as long as they could keep minial functional overlaps, they might buffer each other (see Figure 1-2). Therefore, it is likely that interaction profile is an inappropriate measurement for partial functional overlap. To unravel partial functional overlap, we employed GO-div, to gauge functional overlap between paralogs directly from their respective GO annotations (Guo et al, 2006). Conceptually, GO-div measures the semantic similarity between the sets of GO annotations associated with a pair of genes (Guo et al, 2006). GO-div is calculated on the basis of similarity between the best matched GO terms between paralogs (see method in section (2.1.2)). Higher GO-div indicates less functional overlap between paralogs while lower GO-div indicates both paralogs at least share some very specific functions even though they have diverged in other functions. To increase the reliability of our analysis, the electronic annotation (with the code of IEA) was removed for it was annotated electronically without manually examining. Complementary to GO-div (Li, 1997), the non-synonymous substitution rate per site (Ka) was also calculated between paralogs to indicate coding sequence evolution between paralogs. Among the 494 duplicate pairs, compelling evidence was found arguing against a previous statement that backup between paralogs does not require functional redundancy (Ihmels et al, 2007). We found that substantial functional overlap between paralogs (for both WGD and SSD duplicates) is a key determinant of their genetic backup capability. First, as revealed by Figure 2-3(A, B), duplicate pairs (either WGD or SSD) are more likely to buffer each other if they have less diverged functions; this trend holds when functional divergence was estimated either by the direct measure (GO-div) or by the 19

indirect ones (Ka). Secondly, for the buffering pairs in both WGD and SSD, buffering strength between the paralogs is significantly correlated with their functional divergence (Figure 2-3 C and D) scored by GO-div, having Pearson s R=0.34, P=3.1 10-4 for WGD pairs and R=0.37, P=0.01 for SSD pairs. The correlation is also significant when using Ka to approximate functional divergence between paralogs in both WGD and SSD, with Pearson s R=0.41, P=1.5 10-5 for WGD pairs and R=0.33, P=0.03 for SSD pairs. Expression divergence between duplicates is significantly correlated with their buffering strength for SSD paralogs with R=0.33, P=0.03, but not for WGD pairs, consistent with previous work showing little difference in expression divergence between backup and non-backup WGD pairs (Musso et al, 2008). Since buffering pairs tend to be more functionally similar, it is intriguing to ask for any paralog pairs whether we can accurately predict their buffering potential based on their functional similarity. To test the predictability of backup capacity between paralogs, we pooled together the WGD and SSD duplicates and labeled the 147 backup pairs and the remaining non-backup pairs as positive and negative samples, respectively. We characterized each pair with a feature vector, each element being a metric scoring their functional divergence, including Ka, sequence identity, expression divergence and GOdiv. A support vector machine (SVM) was subsequently implemented to classify these paralogs into either being backup pairs or non-backup pairs. With a 3-fold crossvalidation, as demonstrated in Figure 2-4, functional similarities are found sufficient to distinguish those backup pairs from the non-backup pairs with AUC=0.74±0.05. Such a high predictive power further strengthens that our argument that backup between paralogs stems from their functional redundancy. It is also important to note that GO-div, which scores the specificity of the best matched functions between paralogs, is a strong indicator of backup capacity between paralogs as when removing this feature, prediction based on Ka, sequence identity and expression divergence, AUC substantially reduced to 0.67±0.04. Taken together, such a tight coupling between buffering strength and functional overlap between paralogs and powerful prediction of the functional overlap demonstrate that the compensatory effect between paralogs is indeed maintained by their functional overlap 20

and that less diverged pairs tend to have stronger buffering strength. It is also important to note that WGD and SSD paralogs have different origins and functional propensities (Davis et al, 2005; Guan et al, 2007). Therefore, the consistent observation of these two classes of duplicates suggests that the above conclusion was not biased towards particular function categories. 21

Figure 2-3 Genetic buffering between gene duplicates resulting from functional redundancy A and B indicate functionally similar genes are more likely to backup each other for WGD (A) and SSD (B) paralogs, respectively, where functional similarity was calibrated by the overlap of GO annotations (GO-div) and coding sequence divergence (Ka). C and D indicate buffering strength between paralogs is on average proportional to their functional similarity for WGD (C) and SSD (D) paralogs, respectively. 22

Figure 2-4 Prediction of backup capacity between paralogs The receiver operating characteristic (ROC) curve for the prediction of backup capacity between paralogs based on functional similarities. The ROC curve (Marzban, 2004) is a two dimensional measure of classification performance based on true positive rate (TPR) and false positive rate (FPR). The area under the ROC curve (AUC) is a scalar for assessing classification performance. A higher AUC indicates a better overall performance. The blue diagonal line represents a random control which has no classification capability. This curve, together with the AUC score, was from one random realization of the 3-fold cross-validation. 23

2.1.5 Molecular basis of genetic buffering between duplicated genes In the above analysis, we have seen an appreciable proportion of paralogs, especially those WGD sister paralogs, whose mutual buffering still exists after a long evolutionary time. The next step is to determine the molecular mechanisms by which backup could achieve long-term retention. Since a number of gene duplicates have lost their mutual buffering from the above study, we only considered the ancient WGD and SSD duplicates with backup capacity and considering these backup being stabilized. The WGD paralogs were derived from a single WGD event that occurred ~100 million years ago, and we considered this time is sufficiently long for the sequence and regulation of the paralogs to diverge and become fixed. We contrasted the 105 WGD duplicates with retained backup capacity against those 161 WGD pairs that had lost their mutual compensation. In contrast, for SSD paralogs, we only considered those paralog pairs with Ks greater than 2. In the end, we were able to compare 32 ancient SSD backup pairs (Ks>2) with the 163 non-backup pairs within the same age range (Ks>2). We first compared the sequence divergence between the non-buffering paralog pairs and the buffering paralog pairs. The buffering paralogs have significantly lower sequence divergence (~20% lower, p<1e-3, Wilcoxon rank-sum test). However, regardless of WGD and SSD, we found the buffering pairs do share some characteristics beyond the sequence level. With a total of 392 literature curated protein complexes examined, both WGD and SSD buffering pairs are more likely to reside in the same protein complexes, with the percentage of ~18% for the buffering pairs, compared with only~5-8% for the non-buffering pairs (Figure 2-5). Taken together, it reveals the elevated propensity of cocomplex for backup duplicated pairs. Also it is worthwhile to mention that although both WGD and SSD paralogs could have buffering capacity, substantial difference in the rate of functional divergence is revealed in Figure 2-6. It is clear that WGD pairs have far more buffering pairs than SSD paralogs and maintain much stronger buffering strength. We reasoned that this might result from differential evolutionary modes between WGD and SSD paralogs (Davis et al, 2005). It 24

is known that dosage balance plays an important role in WGD retention (Davis et al, 2005; Papp et al, 2003b); thus WGD paralogs are expected to be under stronger functional constraints (see Figure 2-6), which reduce the rate of functional divergence (such as reduced sequence divergence than SSD pairs as shown in Figure 2-6). Figure 2-5 Comparison of the co-cluster in both backup and non-backup pairs Both WGD and SSD buffering pairs are more likely to be associated in the same complex (indicated by the asterisks, Chi-square test, p<0.05). This result suggests that the cocluster in the same complex provides more functional constraints. 25

Figure 2-6 Functional divergence measured by Ka and GO-div in WGD and SSD WGD (A) and SSD (B) buffering paralogs have reduced sequence divergence and have more specific overlapping GO annotations. WGD buffering pairs have more conserved sequence evolution than those SSD buffering pairs. 26

2.2 The contribution of cis-elements to expression divergence between duplicated genes 2.2.1 Introduction and motivations Extensive backup by duplicated genes has been confirmed in above analysis and a coupling between functional overlap and genetic buffering was observed for duplicated pairs. Despite the observed prevalence of mutual compensation between paralogs, a majority of the paralogs (>80% of SSD pairs and >60% of WGD pairs) have lost their mutual buffering, suggesting they might have kept minimal functional overlap and have diverged their functions. Expression divergence has long been regarded as an important factor leading to functional divergence (Ohno, 1970). Examination of the expression patterns of duplicated genes in budding yeast has shown that expression divergence scales with evolutionary time at a rapid rate, suggesting that alterations in transcription are critical in the functional dispersal of paralogs, and subsequently, their long-term retention (Gu et al, 2005; Gu et al, 2002). However, underlying mechanisms of expression divergence remain incomplete as only a very small proportion of expression variation can be explained by recent studies. It was reported that only 2-3% (Zhang et al, 2004) or 8% (Leach et al, 2007) of the expression variation between paralogs could be explained by examining regulatory motifs recognized by TFs. We noted that these studies were based on small datasets, which may bias the results. Here, we examined a more comprehensive dataset of regulatory interactions. In addition to differential TFs regulation, it is likely that other cis-influences on gene expression could further explain the lack of observed correlation between TF binding and expression divergence for paralogs. Specifically, prior to TF binding, the chromatin structure of promoters has to undergo a remodelling process, whereby nucleosomes are removed to leave regulatory regions physically accessible by regulatory factors (Jiang et al, 2009). As a large fraction of nucleosome occupancy is encoded by the flanking cis-elements (Field et al, 2009; Field et al, 2008; Kaplan et al, 2009), thus deviation in the cis-elements could affect nucleosome positioning and drive divergence in gene expression without observable differences in TF binding sites (Tirosh et al, 2008). Here, to further understand 27

expression evolution between paralogs and to test the above hypothesis, we explored the chromatin structure of promoter region between paralogs. 2.2.2 Data and Methods Compiling gene duplicates, transcription regulation and expression data To comprehensively examine the above hypothesis, a compendium of regulatory and gene expression data much greater in coverage than used by previous studies was applied. The known regulatory interactions in budding yeast from two genome-wide ChIP-chip studies (Harbison et al, 2004; Lee et al, 2002) were collected and small-scale biochemical experiments were also curated from recent literature (Balaji et al, 2006; Yu and Gerstein, 2006). This dataset covers 4,684 yeast genes, among which 298 are TFs mediating 15,451 regulatory interactions. Expression data is from microarray experiments which contain three large datasets across a total of 549 physiological conditions (Gasch et al, 2000; Hughes et al, 2000; Spellman et al, 1998). Of all the gene pairs from Guan and colleagues (Guan et al, 2007) (see method in section (2.1.2)), 606 pairs were available in which both genes had annotated regulatory interactions and corresponding expression data. Measurement of TF divergence, expression divergence and nucleosome occupancy divergence TF regulatory divergence is the fraction of diverged TFs for a duplicated pair and was calculated by one minus the fraction of shared TFs between paralogs. The fraction of shared TFs between paralogs is measured by Jaccard index (see equation (2.2)). For two samples, Jaccard index is defined as the size of their intersection divided by the size of their union. Here, the numerator denotes the number of shared TFs and denominator is the total size of TFs. Then one minus this fraction is defined as TF regulatory divergence. (2.2) 28

Expression divergence is quantified as one minus the Pearson s correlation coefficient of expression between sister paralogs across all 549 physiological conditions. r is the Pearson s correlation coefficient of gene X, Y across a total of N conditions (N=549 here) in (2.3). One minus r is expression divergence. (2.3) Nucleosome occupancy divergence is calculated based on promoter nucleosome-depleted region (PNDR) scores, which Field and colleagues have devised to quantify the openness (or the lack of nucleosome presence) for promoter regions corresponding to each yeast gene (Field et al, 2009). In essence, this score represents the lowest average nucleosome occupancy across any 100bp region within the nuc-sequence, with a higher score indicating a more closed promoter. In this framework, nucleosome occupancy at each base within the nuc-sequence was calculated by Field et al. based on a probabilistic model where information from flanking sequences was considered and the calculation was highly predictive of the experimentally assayed nucleosome organization (Field et al, 2008). These PNDR scores for each gene within the set of 606 duplicate pairs was combined and further normalized using kernel density estimation (KDE) for cumulative functions (with a Gaussian window) across the entire genome of 5,778 genes. In this way the scaled PNDR score represents the estimated fraction of genes having scores less than a given gene. This normalization procedure neither changed the PNDR score rankings across the genome background, nor did it distort the original score distribution. Therefore, the divergence in open promoter status (denoted by open) between sister paralogs was then taken to represent the absolute difference of the scaled PNDR scores. Construction of neural network We explored the coupling between TF-regulatory divergence and chromatin status divergence ( open) in determining the corresponding expression divergence. Since the relationship between TF regulation and chromatin status is not necessarily a linear 29

combination in determination of expression divergence, we sought for a regression method which has the capability to model potentially non-linear relationship. Neural network is a non-linear statistical data modeling tool, which is well-known for finding non-linear patterns between datasets (Bishop, 1995). Therefore, we employed a neural network here. For a neural network, it comprises of a set of highly interconnected processing elements. The learning results highly rely on the quality of initial input data, the architecture of connecting units as well as the efficacy of the input-output function. One of the most widely used structures is back-propagation (BP) network which has been known for its well-designed multi-layer pattern. The BP neural network applied here comprises one hidden layer consisting of 4 neurons of a hyperbolic tangent sigmoid transfer function and one output layer containing 1 neuron of a linear transfer function. And the loss function employs mean square error (MSE) function (Figure 2-7). In the training procedure, the 606 duplicate genes were then randomly partitioned into training, validation and test sets 100 times, with the regression between TF divergence and open to expression divergence learned. This BP neural network was examined on both training (60%, 364 among 606 pairs) and validation sets (20%, 121 among 606 pairs), which was used for preventing the network from over-fitting the data. The learned composite divergence derived from TF regulation and chromatin status was then independently and blindly tested on the test sets (20%, 121 among 606 pairs). 30

Figure 2-7 A schematic figure representing the neural network applied in this study. The input layer is the TF and promoter chromatin structure divergence. The hidden layer consists of 4 neurons of a hyperbolic tangent sigmoid transfer function. The output layer contains 1 neuron of a linear transfer function. The output layer yields a value corresponding to the expression divergence learning from divergence of both TF regulation and chromatin status. 2.2.3 Transcription factors divergence explains expression divergence between paralogs Using the collected expression and TF-regulatory datasets, we observed a more significant correlation between expression divergence and TF regulatory divergence for paralogs (Pearson s R=0.367, P<10-41 and Spearman s ρ=0.372, P<10-41 ). This result thus suggests a stronger association between these two variables than previously reported (e.g., R=0.15 or 0.27; (Zhang et al, 2004) and (Leach et al, 2007), respectively). However, even for all duplicates, the proportion of expression divergence explained by TF divergence was still very low (~13.5%, i.e., the square of Pearson s R, that is 0.3672). We thus investigated whether the divergence in promoter chromatin structure for the paralogs could help explain the pattern of expression divergence. 31

2.2.4 Divergence in promoter chromatin structure between paralogs In yeast, the accessibility of a promoter for a gene is usually determined by nucleosome occupancy over the 200bp region upstream of its translation start site (Field et al, 2009; Lee et al, 2007; Shivaswamy et al, 2008), and these upstream sequences are referred to as nuc sequences. The normalized PNDR score as discussed in method was applied to access the divergence in chromatin structure. Then nucleosome occupancy between paralogs was examined. Reliable (i.e. non saturated) Ks values could be obtained for 147 paralog pairs (Ks 2) from the calculated synonymous (Ks) and non synonymous (Ka) substitution rates per nucleotide between sister paralogs. While not a complete set, these 147 pairs do not demonstrate any noticeable bias when compared to the complete paralog set and thus were considered a representative set. First, the relationship between dynamics in promoter nucleosome status diverging and Ks was explored. Divergence in promoter nucleosome status between sister paralogs was significantly correlated with Ks (R=0.24, P=3.3 10 3 and Spearman s ρ=0.28, P=6.9 10 4 ). Moreover, as Ks is a good indication of time (Gu et al, 2002; Li, 1997), we investigated whether chromatin status scales with divergence time. The differential promoter chromatin status and Ks of 101 SSD pairs (Ks < 2) was significantly correlated (Spearman s ρ=0.39, P=5.74 10 4 ), which implies that divergence in promoter chromatin structure increases with duplicate age with identical status of chromatin structure upon the time of gene duplication and gradually diverging afterwards. Next, we compared the recent gene duplicates with ancient ones. In all, 79 relatively ancient duplicates (1<Ks 2) and 31 very recent duplicates (Ks 0.1) were compared, and the ancient duplicates were much more highly diverged in promoter openness (Figure 2-8 median open=0.02 for the recent pairs versus open=0.23 for the ancient duplicates, P= 6.4 10-6 ; Wilcoxon ranksum test). Nonetheless, these ancient duplicates were still more similar in promoter openness than unrelated pairs of genes sampled from the genome (median open= 0.30 for 1,000 randomly paired genes, P=0.02, Wilcoxon ranksum test). Notably, among these ancient pairs, there is an excess of sister paralogs with little diverged promoter status with open 0.05 (P<1 10-4, chi-square test), which 32

suggests the presence of selection on maintaining the similar promoter chromatin structures even between distant paralogs. Figure 2-8 Comparison of open in ancient and recent pairs This figure is the comparison of open in ancient and recent pairs using cumulative density function. X axis denotes the normalized open and Y axis denotes the cumulative density. Wilcoxon ranksum test suggests the significant difference between these two samples (P= 2.42 10-5 ) with recent pairs more similar in chromatin structure. 33

We then investigated whether diverged nucleosome occupancy could be explained by sequence conservation of their nuc-sequences (i.e. the promoter regions where nucleosomes reside). Kimura distance is a well-established and widely used method, which measures DNA difference based on the number of nucleotide substitution with consideration of the difference between transitional and transversional substitutions (Kimura, 1980). The Kimura distance between all the paralogous nuc-sequences was calculated, and their sequence divergence is henceforth termed Knuc. Not surprisingly, for the 31 recent duplicate pairs (Ks 0.1), their Knuc is highly correlated with open (Pearson s R=0.83, P=1.05 10 8 and Spearman s ρ=0.90, P=5.82 10 12 ). The significant correlation remains observable for pairs with intermediate divergence time (0.1 Ks<1), however, when considering the 79 ancient duplicate pairs (1<Ks 2), this correlation is substantially reduced (Pearson s R=0.18, P=0.11 and Spearman s ρ=0.20, P=0.08; Figure 2-9). This observation indicates that most of the recent duplicate pairs have relatively less divergence in nuc sequences and chromatin structure (Figure 2-9; left panel). Nevertheless, for ancient pairs (Figure 2-9; middle panel), although their nuc sequences have substantially diverged, an appreciable proportion of sister paralogs show little difference in promoter chromatin status. We further highlighted this observed discordance by comparing the distributions of conservation values for these ancient duplicates. As demonstrated (Figure 2-9; the right panel), clearly the nuc sequences are more divergent than the promoter openness between paralogs for the ancient pairs. On the other hand, examination of TF regulation for these ancient duplicates shows that TF regulation has completely diverged for most ancient duplicate pairs, apparently consistent with the highly degenerate nature of TF binding sites. Therefore, even though ancient duplicates share few common transcription factors and have highly diverged nuc sequences, their chromatin status are still fairly conserved. It is likely that despite the divergence in molecular function, these sister paralogs still keep the chromatin states of their promoters and are still regulated in the same manner for they still might be in the same broad functional categories. 34

Figure 2-9 Divergence in nuc-sequences, promoter chromatin status ( open) and transcription factors Divergence in nuc-sequences, promoter chromatin status ( open) and transcription factors between sisters paralogs. Knuc and open were normalized between 0 and 1 by dividing by their respective maximums. The left panel shows the comparison for recent duplicates (Ks < 0.1) with each row representing a duplicate pair. The middle panel similarly shows the same comparison for ancient duplicates (1 < Ks < 2). A histogram of open, Knuc and TF divergence (DivTF) between sister paralogs is shown in the right panel. These results show conserved promoter chromatin structure between ancient paralogs. 35

2.2.5 Better explanation of expression divergence between paralogs by the composite divergence of TF regulation and promoter chromatin status The divergence of nucleosome occupancy between duplicated genes scales with time. The next step is to investigate the potential involvement of chromatin divergence in expression evolution between paralogs. It is rational to postulate the diverged chromatin structure will play an important role in differential expression dynamics. Using the same 606 duplicated pairs analyzed in section (2.2.3), we found that expression divergence between sister paralogs is significantly correlated with open (Pearson s R=0.21, P=2.29x10-7 and Spearman s ρ=0.19, P=3.84x10-6 ). Although this correlation is weaker than the comparable correlation between expression divergence and divergence in TF regulation (R=0.36, see above in section (2.2.3)), it is likely that observed expression divergence can be better accounted for when information from both TF regulation and chromatin status are combined. As they play different roles in regulating expression, it is worthwhile to demonstrate this idea with an appropriate model. As the coupling of divergence in TF regulation and differential chromatin status ( open) between duplicates is not necessarily linear, a neural network using Levenberg-Marquardt back-propagation (BP) was trained to learn the relationship between TF regulation and chromatin status as described above (see method in section (2.2.2)). Through the 100 simulations by randomly choosing training, validation and test sets, we derived an empirical distribution of the estimated correlation between the composite divergence and expression divergence. The composite divergence derived from divergence in TF regulation and chromatin status was found to be significantly correlated with expression divergence. As shown in Figure 2-10, one random realization from the 100 simulations indicated ~22% of expression divergence could be explained by the composite divergence between TF regulation and promoter structure, with Pearson s R=0.47, P=5.7x10-8 and Spearman s ρ=0.44, P=4.7x10-7. The distribution of Pearson s correlation for the composite divergence from the 100 simulations was also shown in Figure 2-10B (the red bars), contrasting the lessened correlation of TFs alone (the blue bars) which was derived from a matched control with the same sampling protocol. We realized that the aforementioned results 36

might not explain expression divergence between paralogs completely. Therefore, we discussed some other potential mechanisms below (see Future Work in section (3.3)). To demonstrate that the superiority of the combined TF and nucleosome occupancy approach in determining expression divergence is not a by product of the differential non linear regression of the two metrics, the TF divergence to expression divergence was regressed using the same neural network, instead of directly computing their Pearson Correlation Coefficient (PCC). After the same 100 blind tests we again found that the correlation of TF divergence after neural network mapping (median R =0.38) was significantly lower than that of the composite divergence (P=6x10 5, Wilcoxon ranksum test, see Figure 2-11). Next, to assure that the observed superiority of the combined metric was indeed due to differential chromatin structure of sister paralogs, we matched the TF divergence for each pair with a randomly selected open value and computed the correlation of the randomized composite divergence with expression divergence through 100 blind tests. The randomized correlation is substantially lower than the real composite divergence (P=6x10 6, Wilcoxon ranksum test, see Figure 2-11), establishing the role of chromatin structure in expression divergence between paralogs. Since WGD and SSD pairs are of different origins, we performed additional experiments to ensure that the above results are not the result of sampling bias towards a particular group. If our experiment was biased towards solely WGD or SSD genes, we do not expect to see the correlation when the regression model is trained using one data set (WGD or SSD) and tested on the other dataset (SSD or WGD). However, even using pairs with different origin for training, the correlation tested on the other set with different origin is still highly significant, with R=0.35, P=1 10-11 when using SSD for test and WGD for training, and R=0.34, P=4 10-8 when using WGD for testing and SSD for training. In addition, in our analysis, the test sets were randomly chosen from all the duplicate pairs for 100 times, and the regression results were estimated from the distribution of the 100 random sampling. This protocol essentially minimized the sampling bias towards a particular group (WGD and SSD). Therefore, this conclusion applies to duplicate pairs from different origins. Taken all the above analysis together, these results demonstrate that expression divergence between paralogs can be better 37

explained by the combination of TF regulation and promoter chromatin structure than by TF alone. Figure 2-10 Chromatin divergence accounting for expression divergence (A) The correlation between expression divergence and the composite divergence in TF regulation and chromatin structures is shown. Data are from one of the 100 simulations in a blind test. The reference line Y=X is also shown (the blue thin line) to represent the perfect linear correlation. (B) The histogram of correlations between expression and TF divergence (blue bar)/the composite (TF+ chromatin) (the red bar) divergence derived from 100 randomization. A low P-value (Wilcoxon rank sum test) indicates that the composite divergence is significantly more correlated with expression divergence than TF divergence alone. 38

Figure 2-11 Comparison of correlation between expression divergence and TF divergence The cumulative density of correlation between expression divergence and TF divergence (after neural network mapping) alone (the green curve), the randomized composite divergence derived from TF divergence and the shuffled open (the blue curve), and the real composite divergence derived from paired TF and open (the red curve). The lower curve indicates its higher correlation with expression divergence. 39

2.3 Gene duplicates in the chaperone system 2.3.1 Introduction and motivations The chaperone system is an essential quality-control system which is conserved in three domains of life. Chaperones assist in protein folding and they engage in multiple biological processes, such as protein assembly, intracellular protein transportation and protein degradation (Hartl and Hayer-Hartl, 2009). Specifically, when proteins tend to aggregate because of unfavourable environmental conditions, chaperones function to maintain the system equilibrium by refolding or degrading the deformed proteins. Recently, the importance of some chaperones has been further highlighted. For example, HSP90 was found to act as a phenotypic buffer that could counter unfavourable genetic mutations (Rutherford and Lindquist, 1998). HSP90 also contributes to evolution of new phenotypes (Cowen and Lindquist, 2005). In this study, we collected 75 known chaperones of S. cerevisiae from the literature. Some of them are regarded as cofactors/co-chaperones, which function together with chaperones. These cofactors either regulate chaperones activities or transfer substrates to chaperones (Caplan et al, 2003; Gong et al, 2009). These chaperones are divided into different families according to their signature domain structure or known functions. (see Table 2-1). Here, a more focused study of gene duplication was performed in the yeast chaperone system. Such a case study in a familiar biological system serves to examine in depth the functional overlap between paralogs, what role functional dispersal has played in long-term preservation of gene duplicates. Moreover, a close examination of this system from the perspective of evolution could also shed new light on chaperones functions. Therefore, studying gene duplicates in this system will be of particular interests. 40

Table 2-1 Summary of chaperones in budding yeast S. cerevisiae Family Total Number Standard Name Hsp70 14 ECM10, KAR2, LHS1,SSA1, SSA2, SSA3, SSA4,SSB1, SSB2, SSC1,SSE1,SSE2, SSQ1, SSZ1 Hsp40 22 APJ1,CAJ1,CWC23, DJP1,ERJ5,JAC1, Small heat shock proteins CCT/TRiC complex 8 JEM1, JID1, JJJ1,JJJ2,JJJ3,MDJ1,MDJ2 HLJ1, PAM18, SCJ1, SEC63, SIS1, SWA2, XDJ1,YDJ1,ZUO1 7 HSP12,HSP26, HSP31, HSP32, HSP33, HSP42,SNO4 CCT2, CCT3,CCT4,CCT5,CCT6,CCT7, CCT8, TCP1 Prefoldin/GimC 6 GIM3,GIM4,GIM5,PAC10,PFD1,YKE2 AAA+ family 3 HSP78,HSP104, MCX1, HSP90 2 HSC82,HSP82 HSP60 1 HSP60 HSP90 cofactor 11 AHA1,CDC37,CNS1,CPR6,CPR7,HCH1,PI HSP60 cofactor 1 HSP10 H1,PPT1,SBA1,STI1,TAH1 The above table lists the name and family of S. cerevisiae chaperones. Hsp40s and Hsp70s function together as the Hsp70 system, and they play crucial roles including protein folding, protein translocation and heat shock response in different compartments. Hsp90 family has multiple members, with HSP82 and HSC82 in the central part. Hsp90s can buffer the effects of genetic mutations, and they are involved in the signal transduction, chromatin remodelling and transportation processes. Prefoldin (PFD) helps to transfer protein substrates to CCT/chaperonin. The CCT complex is the chaperonin system in cytosol, while Hsp60 and Hsp10 work together as the chaperonin system in mitochondria. AAA+ family and small heat shock proteins are involved in multiple processes including protein disaggregation and protein degradation. 41

2.3.2 Functional divergence and redundancy in the chaperone system It is known that functional redundancy exists between many paralogs in the yeast chaperone system, such as HSP82/HSC82, SSA1/SSA2 and SSB1/SSB2 (Gautschi et al, 2002; Gong et al, 2009; Matsumoto et al, 2006). We performed a more comprehensive study of the gene duplicates using the same dataset as discussed in section (2.1.2) from Guan et al (Guan et al, 2007). As a result, in all 17 gene duplicates were found (see Appendix Table 1). We next examined the genetic buffering between these duplicates. Of all ten pairs that have genetic interaction scores, six of them have lost their mutual buffering, indicating their functional divergence. For the rest four pairs that show aggravating genetic interactions, it is intriguing to investigate the extent of functional overlap between the sister paralogs. We found that three of these four buffering pairs belong to the HSP70 family in cytoplasmic. Therefore, we next studied the gene duplicates of cytoplasmic HSP70 families. We found that these four cytoplasmic pairs in HSP70 families seem highly redundant, for these pairs share high sequence similarity (>75%) and maintain strong aggravating interactions (SSA1/SSA2-0.76, SSB1/SSB2-0.95, SSE1/SSE2-0.36, SSA3/SSA4 is unavailable). We next asked whether these pairs are merely redundant. However, among these pairs, many pieces of evidence indicate that they are functionally divergent. From literature, previous results observed substrates specificity and different deletion phenotypes in both SSA1/SSA2 and SSA3/SS4 pairs, indicating their functional divergence (Kabani and Martineau, 2008). Evolutionary analysis also found strong selection sites in one copy of both SSA1/SSA2 and SSB1/SSB2 pairs, respectively (Takuno et al, 2009). In our analysis, we found that each gene has its own genetic substrates, which suggests functional disparity. This is because if one gene is completely identical to its paralogs, we would not expect to detect its genetic interactors with other genes due to compensation from its paralogous copy. Thus, for sister paralogs, genetic interactions only develop on the subset of diverged functions (Figure 1-2). For example, 42

for SSE1/SSE2 pair, SSE1 has ~350 negative interactors while SSE2 has only ~50 negative interactors. Similarly, SSA1/SSA2 and SSB1/SSB2 have their own genetic interactors. Besides, we also found that these pairs are under different transcriptional regulations (expression pattern measured by PCC for both SSE1/SSE2 and SSB1/SSB2 is less than 0.10), indicating their different usages corresponding to varying conditions. Taken together, these results lend support to functional divergence in the cytoplasmic HSP70 family, suggesting that duplicates in the chaperone system are not merely redundant. 2.3.3 Functional divergence leading to functional specificity We next analyzed cases that gene duplicates could promote substantial functional diversity given the long spans of evolutionary time. We did find completely diverged functions between CPR7/CPR6 pair, which serve as cofactors in the HSP90 machinery. Previously, CPR6 and CPR7 were known as peptidyl-prolyl cis-trans isomerases to catalyze the cis-trans isomerization of peptide bonds N-terminal to proline residues, and they were considered as functional identically (Mayr et al, 2000). However, the substantial sequence divergence between this pair (Ka=0.64) and the potential long evolutionary time (Ks=4.06) motivate us to study their functional divergence. First, we found this paralogous pair has lost genetic buffering (ε>0), indicating minimal functional overlap. Second, we examined their genetic interaction profiles to further explore their functions as genetic interaction profile proves useful in the study of gene function (Collins et al, 2007; Fiedler et al, 2009). CPR6 and CPR7 have different interaction partners, suggesting their functional differentiation. The negative hits of CPR7 are enriched in transport/vesicle-mediated transport and vesicle organization categories, while CPR6 s negative hits are enriched in cell-cycle related categories (Figure 2-12). A detailed analysis was then performed to further clarify their distinct functions through a direct comparison of their genetic interaction profiles. We found a strong correlation between CPR7 and cog2-1 (PCC=0.43, p<1e-20) (Figure 2-13 A). CPR7 also has strong positive correlations with other COG complex components (PCC~0.2, p<1e-8 with other components of COG complex). The high correlations suggest functional similarity 43

between CPR7 and the COG complex. Moreover, we draw an integrated interaction map of CPR7 and its interactors in the Golgi and vesicle transportation functional categories. These interactors were grouped in accordance with their functional subgroups (Figure 2-13 B). Such interactions with many genes in the transporting pathway strongly indicate CPR7 s functional involvement. However, CPR6 is different from CPR7, which has no positive correlations with COG complex components (PCC= -0.04 to -0.06) and has no such genetic or physical interactors in this category. When compared with CPR7, the physical and genetic interaction data strongly suggest CPR6 s function in spindle dynamics. CPR6 s genetic interaction profile is similar to CIN8 (measured by PCC~0.2, p<1e-8), which is also known as a Kinesin motor protein involving in mitotic spindle assembly and chromosome segregation (Geiser et al, 1997; Gerson-Gurwitz et al, 2009). In addition, CPR6 has similar genetic interaction profiles with several other genes including CLB4, KIP3, STU2, STU1 and KAR9, which are all relevant to spindle dynamics. Specifically, CPR6 has negative interactions with multiple components of microtubule motor including PAC11, DYN3, DYN1, NIP100, KAR3, and NUM1(Lee et al, 2005), indicating its functional overlap with this microtubule motor. Moreover, CPR6 has physical interactions with CIN8 and SLI15 (Figure 2-14). 44

Figure 2-12 Distinct functional enrichment of negative interactors for CPR6 /CPR7 The y-axis represents significant GO-slim BP categories, and x-axis represents the fold enrichment. The enrichment is calculated using hypergeometric distribution. GO terms with a p-value smaller than 0.01 are presented here. 45

Figure 2-13 Functional study of CPR7 linking its role in transportation (A) Genetic interaction profiles of both CPR7 and cog2-1, the PCC is 0.43, p<1e -20 (B) An integrated physical and genetic interaction map reveals many CPR7 s targets in Golgi trafficking and transportation categories. These interactors were grouped in accordance with their functional subgroups. 46