Genetics: Published Articles Ahead of Print, published on February 23, 2012 as 10.1534/genetics.111.137893 Note February 3, 2011 The abundance of deleterious polymorphisms in humans Sankar Subramanian Griffith School of Environment, Griffith University, 170 Kessels Road, Nathan Qld 4111, Australia 1 Copyright 2012.
Running head: Temporal pattern of deleterious polymorphisms Keywords: human evolution, deleterious mutations, disease-associated mutations, genome-wide association and population genetics theory Address for correspondence: Dr. Sankar Subramanian Environmental Futures Centre School of Environment Griffith University 170 Kessels Road Nathan QLD 4111 Australia Phone: +61-7-3735 7495 Fax: +61-7-3735 7459 E-mail: s.subramanian@griffith.edu.au 2
ABSTRACT Here I show a gradual decline in the proportion of deleterious nonsynonymous SNPs (nsnps) from tip to root of the human population tree. This study reveals that up to 48% of nsnps specific to a single genome are deleterious in nature, which underscores the abundance of deleterious polymorphisms in humans. 3
The neutral theory of molecular evolution predicts that amino acid replacement polymorphisms are neutral and largely governed by mutation and genetic drift (KIMURA and OHTA 1971). However, Ohta (1973) suggested that some amino acid polymorphisms could be slightly deleterious and that they are modulated by purifying selection and drift. According to her prediction, rare amino acid polymorphisms are, on average slightly deleterious and the number of these variants in a population is generally more than that expected under the strict neutral hypothesis (OHTA 1973). Later, Kimura (1983) proposed that deleterious mutations can contribute significantly to diversity, but at high frequencies they are selected against and prevented from becoming fixed. These predictions were examined in a number of studies using RFLP (EXCOFFIER 1990; MERRIWETHER et al. 1991; WHITTAM et al. 1986) or DNA sequence data (TAJIMA 1989), which indeed showed an excess in the number of rare variants than that expected using Watterson s test (WATTERSON 1978) and/or Tajima s test (TAJIMA 1989). Using a different approach, studies on mitochondrial genes compared the diversity and divergence at synonymous and nonsynonymous sites and showed that the ratio of intra-specific diversities at nonsynonymous- to synonymous positions ( =dn/ds) was significantly higher than the ratio of inter-specific divergence at these sites (BALLARD and KREITMAN 1994; NACHMAN et al. 1994; RAND et al. 1994). This suggests an excess of nonsynonymous polymorphisms (nsnps) in populations compared to fixed replacement mutations. The excess fraction of nsnps could be enriched with rare alleles, which are potentially deleterious and natural selection eliminates them from the population over time. This pattern appears to be universal as it was observed by a number of studies on mitochondrial genes of human (HASEGAWA et al. 1998; NACHMAN et al. 1996; RAND and KANN 1996), mouse (NACHMAN et al. 1994; RAND and KANN 1996), fruit fly 4
(BALLARD and KREITMAN 1994; RAND et al. 1994) as well as on nuclear genes of bacteria (ROCHA et al. 2006) and viruses (HOLMES 2003). Using a number of nuclear loci, Hughes et.al (2003; 2005) showed a reduced diversity at nonsynonymous sites of human genes suggesting purifying selection, which in turn, implies that slightly deleterious nsnps are prevalent in human populations. Large-scale genome-wide studies on mitochondrial (KIVISILD et al. 2006; RUIZ- PESINI et al. 2004; SUBRAMANIAN 2009) and nuclear genes (BUSTAMANTE et al. 2005; CARGILL et al. 1999; WONG et al. 2003) also suggested the presence of low-frequency deleterious amino acid SNPs in humans. Using allele frequency spectrum of synonymous and nonsynonymous sites, several studies estimated the strength of selection on amino acid mutations and thus determined the proportion of deleterious amino acid mutations in humans (BOYKO et al. 2008; EYRE-WALKER et al. 2006). However, it is important to estimate the fraction of deleterious nsnps at different depths of the human population tree. This is because a much higher fraction of deleterious nsnps is expected if individuals from the same population (or between closely related populations) are compared, as opposed to comparisons involving genomes from distantly related populations. For instance, we anticipate a relatively high fraction of deleterious SNPs among the polymorphisms shared between two Europeans than those shared between a European and an Asian or a European and an African. This is evident from a previous study that showed a significantly less fraction of deleterious nsnps that are shared between African-Americans and European-Americans than those present in only one of the population (LOHMUELLER et al. 2008). This is because the fraction of deleterious SNPs is expected to be inversely proportional to the time of separation between populations as they are removed by purifying selection over time. Hence I examined 5
the temporal pattern of deleterious nsnps at all internal and terminal branches of the human population tree using complete genomes belonging to four distinct human populations. To quantify the proportion of deleterious polymorphisms, two independent methods were used. Polymorphism data from ten complete human genomes belonging to Europeans, Asians, Africans and a Khoisan (an ancient African lineage) were obtained from different databanks (see, supporting information for materials and methods). Synonymous (ssnps) and nonsynonymous SNPs (nsnps) were grouped based on the pattern of sharing between the genomes following the known phylogenetic relationship between the human populations (TISHKOFF et al. 2009). If a SNP is shared between a Khoisan and any other genome, it was considered to be the oldest (see branch A in figure 1). If a SNP is shared between a European and an African and not shared by Khoisan, it is considered to be common to Africans and non-africans (branch B). Similarly, if a SNP is shared between a European and an Asian but not shared by an African or Khoisan, it is considered to be ancestral to Eurasians (branch C). Likewise, if a SNP is shared only between any two Europeans and not shared with any other genome, it was considered to be ancestral to Europeans (branch D). Finally, if a SNP is present in only one genome and not shared with any other genome, it is considered to be specific to that individual (all terminal branches such as E). These grouping assumed that convergent or parallel mutations are rare. Typically, nsnps are expected to be more abundant at terminal branches than internal branches of the tree as a fraction of them are removed over time. In contrast, ssnps are expected to be present at roughly equal proportions at all nodes of the tree. Therefore, the ratio of nsnps (A) to ssnps (S) will normalize the time elapsed and therefore this ratio could be used to estimate the 6
excess fraction of nsnps in each branch at various depths of the tree. Table 1 shows that A/S ratios are highest among unshared genome specific SNPs (branches E, G, I and J) and lowest among those SNPs that are shared with Khoisan genome (branch A). I then estimated the ratio of diversity at nonsynonymous to synonymous sites ( ) for each branch. This measure is much clearer, as it is on a scale of 0-1. Figure 1 shows a steady decline in from the tip to the root of the tree with the estimate for the terminal branches (0.34-0.48) being up to twofold higher than that of the ancestral branch (0.26). The estimate of was smallest for the inter-species humanchimp comparison (0.25), which is significantly smaller than the s estimated using intra-species comparisons (P < 0.001, using a Z test) (Table 1). These results are in conformity with the theoretical predictions suggesting that intra-species is expected to be higher than interspecies when Nes < 0 (KIMURA 1983; KRYAZHIMSKIY and PLOTKIN 2008; PETERSON and MASEL 2009). McDonald and Kreitman (MCDONALD and KREITMAN 1991) showed that under neutrality the estimated for human populations ( H ) is expected to be equal to the HC estimated for the interspecific (human-chimp) comparison i.e. H HC. It is clear from table 1 that the estimated for the human population is always higher than that estimated for the human-chimp comparison i.e. H HC. This is due to the presence of a fraction of deleterious nsnps ( ) in the populations. To subtract the fraction of deleterious SNPs the equation could be written as H H HC. This equation could be simplified to estimate the fraction of deleterious nonsynonymous SNPs ( ) as HC / H ). The measure is the fraction of deleterious nsnps that are segregating in the population. I estimated for each branch of the tree and as 7
expected, this measure systematically declined from the tip to the root of the tree (Figure 1). This is also clear from the negative relationship between and the relative age of the SNPs (Figure 2). Importantly, deleterious proportion is up to 7.9 times higher among the nsnps specific to individual genomes (0.48 - branch E) than those shared with the ancient Khoisan genome (0.06 - branch A). To obtain an independent estimate of the fraction of deleterious nsnps, I used PolyPhen-2, which predicts the possible impact of an amino acid substitution on the structure and function of a human protein using physicochemical and comparative information (ADZHUBEI et al. 2010). Based on severity, this software predicts an amino acid change as either benign or damaging. The proportion of nsnps damaging to the protein ( ) was then estimated using nsnps specific to each branch of the tree. It is interesting to note that the pattern of on the human population tree is almost identical to that of (Table 1 and Figure 1&2). Among the oldest nsnps that are shared with the Khoisan, only 10% were predicted to be damaging to protein structure/function (branch A) while this fraction was 33% for the genome-specific nsnps. Although the declining patterns of from tip to root are very clear in African as well as non- African lineages, the estimates of non-african branches (0.29-0.48) (Figure 2A&B) are higher than those of African ones (0.21-0.38) (Figure 2C). This could be due to the effect of population bottlenecks that presumably had occurred in European and Asian populations (LI and DURBIN 2011; MARTH et al. 2004). During this period some of the deleterious SNPs might have drifted to higher frequencies, which are expected to decrease during population expansion [see,(hughes 8
et al. 2005)]. Since the African population has had an uninterrupted (or mildly bottlenecked) population history (LI and DURBIN 2011), such an effect is expected to be minimal. The results show that the deleterious proportion of nsnps is can be up to 48% of the amino acid variants specific to a single genome. However, this is an underestimate as this analysis was based only on a small dataset of 10 genomes. Adding more genomes to this analysis will redefine the pattern of sharing of SNPs as a number of nsnps that are currently identified as lineage or genome specific will likely be shared among diverse populations. However a large proportion of these shared nsnps are expected to be neutral as deleterious SNPs are unlikely to perpetuate in a population for a long time. Hence, by adding more genomes, a large proportion of neutral nsnps are likely to be shifted to deeper internal branches of the human population tree. Although this will lead to a reduction in the number of nsnps (observed in this study) in the terminal branches of the population tree, a much higher (than reported here) proportion of them are likely to be deleterious in nature. Furthermore, the fraction of deleterious SNPs estimated here is based on the assumption that adaptive evolution in hominids is rare. If this assumption is not valid then the fractions of deleterious SNPs reported here are underestimates. This is because adaptive evolution results in higher rate of nonsynonymous substitution compared to nonsynonymous diversity. Hence by excluding the adaptive fraction while estimating will lead to lower inter-species (than that reported here), which will result in higher estimates. A previous study using two populations (African-Americans and European-Americans) showed that the deleterious fraction of population specific nsnps was 52-77% higher compared to that 9
shared between the two populations (LOHMUELLER et al. 2008). However using a much deeper human population tree this study showed that the deleterious fraction of genome-specific nsnps is 3.3 7.9 times higher compared to that shared between all humans. Although the temporal pattern reported in this study is based on nonsynonymous coding SNPs, an identical pattern is expected for all noncoding SNPs such as those present in promoters, silencers, enhancers, splicejunctions and untranslated regions (UTRs) that are under selective constraint. The findings of this study suggest that SNPs that are exclusively present in a single genome or closely related genomes are more likely to be associated with a disease as they are under strong purifying selection (HUGHES et al. 2003). Hence genome-wide association studies could focus on these SNPs, particularly when the disease causing variants have large phenotypic effects (CIRULLI and GOLDSTEIN 2010). ACKNOWLEDGEMENTS The author is grateful to David Lambert and thanks Leon Huynen, Robert Friedman, Michael Nachman and two anonymous reviewers for valuable comments. FOOTNOTES Supporting information is available online at: http://www.genetics.org/.. Polyphen data is available at http://www.mediafire.com/?5jila01rh49dmgm. LITERATURE CITED ADZHUBEI, I. A., S. SCHMIDT, L. PESHKIN, V. E. RAMENSKY, A. GERASIMOVA et al., 2010 A method and server for predicting damaging missense mutations. Nat Methods 7: 248-249. BALLARD, J. W., and M. KREITMAN, 1994 Unraveling selection in the mitochondrial genome of Drosophila. Genetics 138: 757-772. 10
BOYKO, A. R., S. H. WILLIAMSON, A. R. INDAP, J. D. DEGENHARDT, R. D. HERNANDEZ et al., 2008 Assessing the evolutionary impact of amino acid mutations in the human genome. PLoS Genet 4: e1000083. BUSTAMANTE, C. D., A. FLEDEL-ALON, S. WILLIAMSON, R. NIELSEN, M. T. HUBISZ et al., 2005 Natural selection on protein-coding genes in the human genome. Nature 437: 1153-1157. CARGILL, M., D. ALTSHULER, J. IRELAND, P. SKLAR, K. ARDLIE et al., 1999 Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 22: 231-238. CIRULLI, E. T., and D. B. GOLDSTEIN, 2010 Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet 11: 415-425. EXCOFFIER, L., 1990 Evolution of human mitochondrial DNA: evidence for departure from a pure neutral model of populations at equilibrium. J Mol Evol 30: 125-139. EYRE-WALKER, A., M. WOOLFIT and T. PHELPS, 2006 The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics 173: 891-900. HASEGAWA, M., Y. CAO and Z. YANG, 1998 Preponderance of slightly deleterious polymorphism in mitochondrial DNA: nonsynonymous/synonymous rate ratio is much higher within species than between species. Mol Biol Evol 15: 1499-1505. HOLMES, E. C., 2003 Patterns of intra- and interhost nonsynonymous variation reveal strong purifying selection in dengue virus. J Virol 77: 11296-11298. HUGHES, A. L., B. PACKER, R. WELCH, A. W. BERGEN, S. J. CHANOCK et al., 2003 Widespread purifying selection at polymorphic sites in human protein-coding loci. Proc Natl Acad Sci U S A 100: 15754-15757. 11
HUGHES, A. L., B. PACKER, R. WELCH, A. W. BERGEN, S. J. CHANOCK et al., 2005 Effects of natural selection on interpopulation divergence at polymorphic sites in human proteincoding Loci. Genetics 170: 1181-1187. KIMURA, M., 1983 The neutral theory of molecular evolution. Cambridge University press, Cambridge. KIMURA, M., and T. OHTA, 1971 Protein polymorphism as a phase of molecular evolution. Nature 229: 467-469. KIVISILD, T., P. SHEN, D. P. WALL, B. DO, R. SUNG et al., 2006 The role of selection in the evolution of human mitochondrial genomes. Genetics 172: 373-387. KRYAZHIMSKIY, S., and J. B. PLOTKIN, 2008 The population genetics of dn/ds. PLoS Genet 4: e1000304. LI, H., and R. DURBIN, 2011 Inference of human population history from individual wholegenome sequences. Nature 475: 493-496. LOHMUELLER, K. E., A. R. INDAP, S. SCHMIDT, A. R. BOYKO, R. D. HERNANDEZ et al., 2008 Proportionally more deleterious genetic variation in European than in African populations. Nature 451: 994-997. MARTH, G. T., E. CZABARKA, J. MURVAI and S. T. SHERRY, 2004 The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166: 351-372. MCDONALD, J. H., and M. KREITMAN, 1991 Adaptive protein evolution at the Adh locus in Drosophila. Nature 351: 652-654. MERRIWETHER, D. A., A. G. CLARK, S. W. BALLINGER, T. G. SCHURR, H. SOODYALL et al., 1991 The structure of human mitochondrial DNA variation. J Mol Evol 33: 543-555. 12
NACHMAN, M. W., S. N. BOYER and C. F. AQUADRO, 1994 Nonneutral evolution at the mitochondrial NADH dehydrogenase subunit 3 gene in mice. Proc Natl Acad Sci U S A 91: 6364-6368. NACHMAN, M. W., W. M. BROWN, M. STONEKING and C. F. AQUADRO, 1996 Nonneutral mitochondrial DNA variation in humans and chimpanzees. Genetics 142: 953-963. OHTA, T., 1973 Slightly deleterious mutant substitutions in evolution. Nature 246: 96-98. PETERSON, G. I., and J. MASEL, 2009 Quantitative prediction of molecular clock and ka/ks at short timescales. Mol Biol Evol 26: 2595-2603. RAND, D. M., M. DORFSMAN and L. M. KANN, 1994 Neutral and non-neutral evolution of Drosophila mitochondrial DNA. Genetics 138: 741-756. RAND, D. M., and L. M. KANN, 1996 Excess amino acid polymorphism in mitochondrial DNA: contrasts among genes from Drosophila, mice, and humans. Mol Biol Evol 13: 735-748. ROCHA, E. P., J. M. SMITH, L. D. HURST, M. T. HOLDEN, J. E. COOPER et al., 2006 Comparisons of dn/ds are time dependent for closely related bacterial genomes. J Theor Biol 239: 226-235. RUIZ-PESINI, E., D. MISHMAR, M. BRANDON, V. PROCACCIO and D. C. WALLACE, 2004 Effects of purifying and adaptive selection on regional variation in human mtdna. Science 303: 223-226. SUBRAMANIAN, S., 2009 Temporal Trails of Natural Selection in Human Mitogenomes. Molecular Biology and Evolution 26: 715-717. TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585-595. 13
TISHKOFF, S. A., F. A. REED, F. R. FRIEDLAENDER, C. EHRET, A. RANCIARO et al., 2009 The genetic structure and history of Africans and African Americans. Science 324: 1035-1044. WATTERSON, G. A., 1978 The homozygosity test of neutrality. Genetics 88: 405-417. WHITTAM, T. S., A. G. CLARK, M. STONEKING, R. L. CANN and A. C. WILSON, 1986 Allelic variation in human mitochondrial genes based on patterns of restriction site polymorphism. Proc Natl Acad Sci U S A 83: 9611-9615. WONG, G. K., Z. YANG, D. A. PASSEY, M. KIBUKAWA, M. PADDOCK et al., 2003 A population threshold for functional polymorphisms. Genome Res 13: 1873-1879. 14
Table 1. Genome-wide human SNPs and parameter estimates Branch on the tree Nonsynoymous SNPs (A) Synonymous SNPs (S) A/S = dn/ds Fraction of deleterious nsnps ( Damaging nsnps Benign nsnps Fraction of damaging nsnps ( ) HC * 62373 92246 0.68 0.25 (0.001) - - - - A 6494 9019 0.72 0.26 (0.004) 0.06 (0.017) 503 4598 0.10 (0.004) B 3028 3723 0.81 0.30 (0.007) 0.17 (0.025) 382 1982 0.16 (0.008) C 2003 2181 0.92 0.33 (0.010) 0.26 (0.032) 329 1221 0.21 (0.010) D 1210 1232 0.98 0.36 (0.014) 0.31 (0.043) 217 758 0.22 (0.013) E 5172 3956 1.31 0.48 (0.010) 0.48 (0.024) 1364 2782 0.33 (0.007) F 293 309 0.95 0.35 (0.028) 0.29 (0.085) 55 186 0.23 (0.027) G 2333 1843 1.27 0.46 (0.014) 0.47 (0.034) 576 1327 0.30 (0.011) H 411 478 0.86 0.31 (0.021) 0.21 (0.069) 77 261 0.23 (0.023) I 4013 4284 0.94 0.34 (0.007) 0.28 (0.023) 878 2382 0.27 (0.008) J 5565 5140 1.08 0.39 (0.008) 0.38 (0.021) 1535 3118 0.33 (0.007) * - Human-Chimpanzee comparison 15
FIGURE LEGENDS Figure 1. Population phylogeny of human genomes used in this study. At each branch, estimates of the following are given in parenthesis: i) Ratio of nonsynonymous to synonymous SNPs per site ( ) ii) Fraction of deleterious nonsynonymous SNPs ( ) iii) Proportion of damaging nonsynonymous SNPs ( ). *Estimates for terminal branches are the average of five, two and two corresponding genomes of Europeans, Asians and Africans respectively. Materials and methods are provided in the supporting information. Figure 2. The proportion of deleterious nonsynonymous polymorphisms ( ) and the fraction of damaging amino acid polymorphisms ( ) estimated for each branch of the human population tree shown in Figure 1. The negative relationship between deleterious nsnps and relative time is shown for the A) European lineage B) Asian lineage C) African lineage. Error bars indicate the standard error of the mean. 16
17
18