Jungreis et al. Supplementary Materials

Size: px

Start display at page:

Download "Jungreis et al. Supplementary Materials"

Pamela Williams
5 years ago
Views:

1 Supplemental Text S1: What we mean by functional To understand what we mean when we refer to functional readthrough candidates, it is helpful to review the nature of the signal detected by PhyloSF. PhyloSF gets most of its information from substitutions, so regions with very high nucleotide-level conservation have very little signal and tend to get scores near 0, intermediate between the positive scores of most coding regions and negative ones of most noncoding regions. For a region to get a high PhyloSF score it needs to have many substitutions, and these substitutions need to be ones that make up a higher proportion of the substitutions typical of coding regions than of those typical of noncoding regions, such as synonymous substitutions and substitutions between amino acids with similar biochemical properties. nlike high amino acid conservation, which could be a side effect of selective constraint on the DN sequence unrelated to translation, a high PhyloSF score can only occur when selection has allowed some variation, but has preferred variation that preserves the biochemical properties of a translation product. Furthermore, selective pressure favoring the act of translation rather than its product, say because of some regulatory effect of translation, would have a different signature. Thus, PhyloSF detects the signature of a fitness advantage of the translation product itself. Functional Translational Readthrough has been defined as translational readthrough that leads to functions different from the parent protein (Schueren and Thoms 2016). While PhyloSF detects the evolutionary signature of a peptide extension that serves some function, it alone does not show that this function is different from that of the parent protein. However, the fact that leaky stop codon contexts have been conserved for most of our readthrough candidates provides evidence that the extended protein is functionally different from the parent; indeed, if the two were functionally identical then why would there be selective pressure to preserve readthrough and to keep the extension functional? While these evolutionary signatures provide evidence for Functional Translational Readthrough in evolutionary time, only experimental validation can determine if readthrough of any particular transcript has continued to the present (though SNV data from the nopheles gambiae 1000 genomes project provide evidence for this in the aggregate). Furthermore, while we have evidence that the extended proteins do have functions distinct from the parent, we do not know what their functions are. For these reasons we have referred to them as readthrough candidates. Supplement Page 1

2 Supplemental Text S2: Estimate for number of orthologous readthrough pairs due to convergent evolution We estimated that the number of pairs of orthologous readthrough stop codons descended from a nonreadthrough ancestral stop codon that have become readthrough independently in the two clades through convergent evolution as roughly the product of the number, N, of non-readthrough ancestral stop codons that have descendent stop codons in the two clades for which we can detect orthology, times the probability, P, that in both clades a non-readthrough ancestral stop codon will have developed readthrough detectable by our process. We estimate N to be around 7188 by taking the number of stop codons in. gambiae in genes that have a Diptera-level OrthoDB ortholog in D. melanogaster, 8562, times the fraction of. gambiae readthrough stop codons having an orthologous gene but not found by orthology, that also have an end-orthologous transcript in D. melanogaster, 0.853, minus the number of orthologous readthrough pairs, 115, an estimate for the number to be excluded due to being readthrough in the ancestor. This somewhat overestimates N, because it estimates the number of pairs that are end-orthologous rather than the number satisfying the stricter condition of being stop-orthologous, and because when we subtract orthologous readthrough pairs we ignore the fact that some of our ambiguous readthrough could actually be readthrough. lthough it subtracts an estimate for the number of ancestral readthrough stop codons using orthologous readthrough pairs, which treats convergent evolution cases as ancestral, the result is little changed if we do not make this subtraction. To detect readthrough in both species, we need to detect it in one or the other species without using orthology, but can then use orthology to detect it in the other. We can estimate the probability that a stop codon that was not readthrough in the ancestor has developed readthrough in one species, and that we can detect this readthrough without orthology, by counting the number of non-ancient readthrough candidates in that species (all of which were found without using orthology) and dividing by the total number of stop codons in that species minus the number of ancient readthrough candidates. We can estimate the probability that the orthologous stop codon is readthrough in the other species and that we will be able to detect it, possibly using orthology, using our estimate for the number of readthrough stop codons in each species for which the PhyloSF-Ψ Emp score of the second ORF is positive, since that is roughly the criteria we used to determine if a stop codon is readthrough once we know it is stop orthologous to a readthrough stop codon in the other species, minus the number of ancient readthrough regions having positive PhyloSF-Ψ Emp, divided by the total number of stop codons in that species minus the number of ancient readthrough candidates. arrying out these estimates we find that the probability that a non-readthrough ancestral stop codon will have become readthrough in both clades is approximately However, while this calculation accounts for the fact that detection of readthrough in one species helps us detect readthrough in the other, it does not take into account that some genes, such as longer ones, could have a greater tendency to become readthrough, in both species, than others. ccounting for the distributions of first ORF lengths of readthrough candidates and of all other genes increases the probability that a non-readthrough ancestral stop codon will have developed readthrough in both clades by about 30%, which gives us an estimate for P of Supplement Page 2

3 ombining these estimates of N and P, we estimate that the number of pairs of orthologous readthrough stop codons that have become readthrough independently from a non-readthrough ancestral stop codon through convergent evolution in the two clades is around 7188 * = caveat is that there could be unknown confounders other than first ORF length that make an ancestral gene more likely to develop readthrough, though such confounders would have be to considerably more influential than first ORF length to change the result much in comparison to the total number of ancient readthrough regions. Supplement Page 3

4 Supplemental Text S3: Z curve test provides an underestimate for the number of readthrough regions There are several reasons that the results of our Z curve test is likely to be an underestimate for the actual number of functional readthrough regions. First, the test only counts readthrough regions at least 10 codons long and that have Z curve score greater than 0. s discussed in the main text, using our PhyloSF test we found evidence that more than half of functional readthrough regions are less than 10 codons long. Furthermore, only about half of the readthrough regions of our readthrough candidates have Z curve score > 0 (Supplemental Figure S5,D). Since the readthrough regions of our readthrough candidates are the ones having the highest coding potential as measured by PhyloSF, we would expect them also to have higher Z curve scores than most of the other readthrough regions, so probably fewer than half of functional readthrough regions have Z curve score > 0. ombining these results, we find that the estimate provided by our test is probably less than 25% of the actual number of annotated stop codons that have functional readthrough. Second, our test only looks at annotated stop codons, which could be considerably fewer than the number of actual stop codons in species having incomplete annotations. Third, there are several factors that could inflate the counts of second ORFs having positive Z curve scores in frames 1 and 2 but not frame 0, which would decrease any frame-0 excess and thus our estimate. When an annotated stop codon lies within a coding region in another frame, the second ORF of that stop codon in the other frame is likely to have a high Z curve score, which would inflate the count in frame 1 or 2; we exclude any stop codons that are within annotated coding regions in another frame, but stop codons within unannotated coding regions in other frames would bias our readthrough estimate downward. lso, while our estimate corrects for recent nonsense substitutions and sequencing errors that read a sense codon as a stop codon, which would inflate the count in frame 0, we do not correct for recent indels and sequencing indels, which would inflate the counts in frames 1 and 2 and bias our readthrough estimate downward. Supplement Page 4

5 T- T- T- T-T T-T T- T-T T- T- T- T- T- Readthrough,.gam. Readthrough, D.mel. Non-readthrough,.gam. Non-readthrough, D.mel. 20% 10% 0% 10% 20% 30% 40% Fraction of stop codons with specified context Supplemental Figure S1. Stop codon contexts in. gambiae and D. melanogaster. sage of stop codon context (stop codon and subsequent base) sorted in order of decreasing frequency among the 12,058 non-readthrough, nonmitochondrial, annotated. gambiae stop codons (dark blue), with less frequent stop codons (top, e.g., T-) experimentally associated with translational leakage in other species, and most frequent (bottom, e.g., T-) associated with efficient termination. Error bars show standard error of mean. ontext frequencies for. gambiae readthrough candidates (red) are almost the opposite of those of non-readthrough transcripts, suggesting a preference for leaky context, with 36% using T- and almost none using T-. Frequencies shown are for our unbiased by stop codon subset of readthrough candidates and exclude double-stop readthrough candidates. The frequencies for both readthrough and nonreadthrough stop codons are similar to those for D. melanogaster (pink and light blue, respectively). Supplement Page 5

6 Protein Length (mino cids) gam. RT.gam. NonRT D.mel. RT D.mel. NonRT B Number of DS exons gam. RT.gam. NonRT D.mel. RT D.mel. NonRT Mean length of an intron (NT) gam. RT.gam. NonRT D.mel. RT D.mel. NonRT D Nonreadthrough Ortholog Number of DS exons of RT NonRT pairs.gam. RT D.mel. RT E Nonreadthrough Ortholog Mean length of an intron of RT NonRT pairs.gam. RT D.mel. RT Readthrough andidate 5e+01 5e+02 5e+03 5e+04 Readthrough andidate Number of DS exons RT paired with NonRT NonRT paired with RT ll NonRT that have an ortholog Mean length of an intron (NT) RT paired with NonRT NonRT paired with RT ll NonRT that have an ortholog Supplemental Figure S2. Readthrough genes are longer than non-readthrough. (-) Box plots for lengths of unextended proteins (), number of exons in the coding portion of the transcript (B), and average length of an intron in each transcript having at least one intron (), for. gambiae readthrough candidates (red),. ambia non-readthrough transcripts (green), D. melanogaster readthrough candidates (orange), and D. melanogaster non-readthrough transcripts (gray). By all three measures, readthrough candidates are significantly longer than non-readthrough transcripts, in both species. (D,E) pper panels are scatter plots showing number of coding exons (D) and average intron length (E) for each. gambiae readthrough candidate orthologous to a non-readthrough D. melanogaster transcript versus the corresponding value for its non-readthrough ortholog (black dots) and similarly for D. melanogaster readthrough candidates orthologous to nonreadthrough. gambiae transcripts (red dots). Lower panels are box plots showing number of coding exons (D) and average intron length (E) of those readthrough candidates in either species that are orthologous to a non-readthrough transcript in the other (red), the corresponding lengths of the paired non-readthrough transcripts (green), and the lengths of all non-readthrough transcripts in genes that have orthologs in the other species (orange). nlike the similar comparison for protein lengths shown in Figure 3E, there is no clear relationship among the number of exons or average intron length of readthrough candidates, their non-readthrough orthologs, and other non-readthrough transcripts. Supplement Page 6

P006059-R FBtr0089796 These parts Homologous appear stem loops similar

Supplemental Figure S3. RN structures for readthrough-readthrough pairs.

downstream from (and including) candidate readthrough stop codons for

There is clear homology between stem loops near the 5 ends of the

Other than that, we see no obvious similarity between the predicted

7 P R FBtr These parts Homologous appear stem loops similar P R FBtr P R FBtr P R FBtr Supplemental Figure S3. RN structures for readthrough-readthrough pairs. onserved, stable, RN structures predicted by RNz in the 100-nt regions downstream from (and including) candidate readthrough stop codons for the four stoporthologous pairs of readthrough candidates in. gambiae (left) and D. melanogaster (right) for which a structure was predicted in both. There is clear homology between stem loops near the 5 ends of the second pair of structures (red ovals). Other than that, we see no obvious similarity between the predicted structures in the two species, offering the possibility that it is the presence of a stable structure that is functional rather than particular features of that structure. Supplement Page 7

Readthrough structures and stop codons 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% T T T 20.0% 10.0% 0.0%.gam. Structure.gam. No Structure D.mel. Structure D.mel. No Structure Supplemental Figure S4.

gambiae readthrough candidates having predicted structures (first group),. gambiae readthrough candidates lacking predicted structures (second group), D.

8 Readthrough structures and stop codons 80.0% 70.0% 60.0% 50.0% 40.0% 30.0% T T T 20.0% 10.0% 0.0%.gam. Structure.gam. No Structure D.mel. Structure D.mel. No Structure Supplemental Figure S4. Stop codon usage in candidates having predicted structures. Frequencies of usage of T (blue), T (red), and T (green) first stop codons among. gambiae readthrough candidates having predicted structures (first group),. gambiae readthrough candidates lacking predicted structures (second group), D. melanogaster readthrough candidates having predicted structures (third group), and D. melanogaster readthrough candidates lacking predicted structures (fourth group), with error bars showing standard error of mean. lthough most readthrough candidates in both species use T, readthrough candidates having structures in D. melanogaster show a preference for T, prompting speculation that a leaky stop codon context might not be necessary for readthrough in the presence of an RN structure. However, in. gambiae there is only a small and not-statistically significant depletion for T stop codons among readthrough candidates having predicted structures. Supplement Page 8

9 .gam. Readthrough andidates B D.mel. Readthrough andidates Readthrough Regions 1st ORF 3' ends 3rd ORFs Readthrough Regions 1st ORF 3' ends 3rd ORFs PhyloSF Per odon PhyloSF Per odon.gam. Readthrough andidates D D.mel. Readthrough andidates 2nd ORFs 1st ORF 3' ends 3rd ORFs 2nd ORFs 1st ORF 3' ends 3rd ORFs Z urve score Z urve score E RT RT pair 2nd ORF codons D.mel gam. Supplemental Figure S5. Readthrough regions under less constraint than other coding regions. (-D) Distribution of coding potential as measured by PhyloSF score per codon ( and B) or Z curve score ( and D) of readthrough regions (red), same-sized coding regions at the 3 end of the first ORF of readthrough candidates (black), and noncoding third ORFs of readthrough candidates (cyan), for. gambiae ( and ) and D. melanogaster (B and D). Double readthrough candidates and readthrough candidates whose third ORF is 0-length or contains degenerate bases have been excluded. In all four cases, coding potential of readthrough regions tends to be intermediate between that of noncoding regions and other coding regions, suggesting that they have been under some purifying selection at the amino acid level within each clade, but less so than other coding regions. (E) Log-scale scatter plot showing the lengths in codons of the readthrough regions in. gambiae and D. melanogaster for each pair of stop-orthologous readthrough candidates. lthough for many pairs the lengths are similar in the two species (circles along diagonal line), there are also many pairs for which the lengths are quite different, suggesting that in many cases readthrough regions have remained functional despite large changes in length. Supplement Page 9

10 gamp3_aa * Q L S T P S K S K L * gamp3 T TT T T T T T T T gams1 T TT T T T T T T T gamm1 T TT T T T T T T T merm1 T TT T T T T T T T arad1 T TT T T T T T T T quas1 T TT T T T T T T T mel1 T TT T T T T T T T chr1 T TT T T T T T T T T epie1 T TT T T T TT T T T minm1 T TT T T T TT T T T T cul1 T TT T T T TT T T T T funf1 T TT T T T TT T T T T stes1 T TT T T T TT T T T T stei2 T TT T T T TT T T T T macm1 T TT T T T TT T T T T farf1 T TT T T T T T T T T dirw1 T TT T T T T T T T T sins1 T TT T T T T T T T T T atre1 T TT T T T T T T T T dar2 T TT T T T T T T T T T albs1 T TT T T T T T T T T T P B dmel_aa * Q L N P K S K L * dmel T TT T T T TT T dsim T TT T T T TT T dsec T TT T T T TT T dyak T TT T T T TT T dere T TT T T T TT T deug T TT T T T TT T dbia T TT T T T TT T dtak T TT T T T TT T dfic T TT T T T TT T dele T TT T T T TT T drho T TT T T T TT T dkik T TT T T T TT T dana T TT T T T TT T dbip T TT T T T TT T dper T TT T T T TT T dpse T TT T T T TT T dwil T TT T T T TT T dmoj T TT T T T T T dvir T TT T T T TT T dgri T TT T T T TT T 1969 dmel_aa * V Q H H P Y L Y Y P P E D D P Y Y N S R M * dmel T T T T TT T T TT T TT T--- T T T T dsim T T T T T TT T TT T TT T--- T T T T dsec T T T T T TT T TT T T TT T--- T T T T dyak T T T T T TT T TT T T TT T--- T T T T dere T T T T T TT T TT T T TT T--- T T T T deug T T T T T TT T TT T T T T T--- T T T T dbia T T T T T TT T TT T T T T--- T T T T dtak T T T T TT TT T TT T T T--- T T T T T dfic T T T T T T T TT T T T T--- T T T T dele T T T T T T T T T T T--- T T T T drho T T T T T T T TT T T T--- T T T T dkik T T T T T TT T T T T T--- T T T dana T T T T T TT T T T T TT--- T T T dbip T T T T T TT T TT T T TT--- T T dper T T T T T T TT T T --- T TT T T--- T T T dpse T T T T T T TT T T --- T TT T T--- T T T dwil T T T-- -T T TT T T TT --- T --- T TT TT T T T T dmoj T T T-- - TT TT T TT T --- TT T T TT T T T--- T T T dvir T T T-- - TT TTT T TT T --- TTT T T TT T T T--- T T Tsp86D Supplemental Figure S6. Peroxisomal targeting signals. lignments of readthrough regions of. gambiae gene P among 21 nopheles genomes (), its D. melanogaster ortholog, 1969, among 20 Drosophila genomes (B), and D. melanogaster gene Tsp86D (). predicted peroxisomal targeting signal in the final 12 amino acids of the extension of P (yellow highlighting) has been conserved across all 21 mosquitoes and 20 flies, despite the presence of several radical amino acid substitutions within the nopheles lineage (red highlighting) and two amino acid insertions or deletions between the two clades (red circles). The predicted peroxisomal targeting signal in Tsp86D is conserved as far as D. kikkawai but not in D. ananassae or beyond. Supplement Page 10

11 gamp3_aa * V E L F R T R L P S S F V V S L N P L T L R P W W W W H T S Q V Q * N R N R N R F F gamp3 T T T TT T T T TT T TT TT T TT T T TT T T T T T T T T T TTT T TT gams1 T T T TT T T T TT T TT TT T TTT T T TT T T T T T T T T T TTT T TT gamm1 T T T TT T T T TT T TT TT T TT T T TT T T T T T T T T T TTT T TT merm1 T T T TT T T T TT T TT TT T TTT T T TT T T T T T T T T T TTT T TT arad1 T T T TT T T T TT T TT TT T T T T TT T T T T T T T T T TTT T TT quas1 T T T TT T T T TT T TT TT T TT T T TT T T T T T T T T T TTT T TT mel1 T T T TT T T T TT T TT TT T TTT T T TT T T T T T T T T T T TTT T TT chr1 T T T TT T T T TT T TT TT T TTT T T TT T T T --- T T T T - -- T TT- T TT epie1 T T T TT T T T TT T TT TT T TTT T T TT T T T T --- T T T T TT TT T- T T minm1 T T T T T T T TT T TT TT T TTT T T TT TT T T T T T T T - T T TT T TT cul1 T T T T T T T TT T TT TT T TTT T T TT TT T T T T T - T TT TT T TT funf1 T T T TT T T T TT T TT TT T TTT T T TT TTT T T T T T T - TT --- T TTT T TT stes1 T T T TT T T T TT T T TT T TT T T TT TT T T T T T T T T TT T T TTT T TT stei2 T T T TT T T T TT T T TT T TT T T TT TT T T T T T T T T TT T T TTT T TT macm1 T T T TT T T T TT T TT TT T TT T T TT TT T T T T T T T T TT T T T T TT farf1 T T TT - -- TT T T TT TT T TT T TT T T T T TT T T TT T --- T T TT- T TT dirw1 T T T TT - -- TT TT T T TT T T TT- T TT sins1 T T TT - -- TT T TT TTT T TT T T TT T T -T T T T T - -- T T T T T atre1 T T TT - -- TT T T TTT T TT T T T T T T TT T T T - -- T T T T T T T T TT dar2 T T T TT - -- T T T TT T T - -- T -- T TT T TT T T T - -- TT T T albs1 T T T TT - -- TT T T TT T T - -- TT -- T T TT TT T T T - -- T T T T Supplemental Figure S7. Readthrough candidates identified using other features of coding regions. lignment including readthrough region of. gambiae readthrough candidate P R. lthough the PhyloSF+Stop score of this readthrough region, 10.9, is below our threshold of 17.0, it was included among our list of readthrough candidates because it exhibits other features characteristic of coding regions that are not accounted for by PhyloSF+Stop. In particular, the reading frame is preserved in all species despite the presence of many indels (gray) but degrades soon after the second stop codon (orange) and there are several synonymous substitutions in the second stop codon. (ll insertions relative to. gambiae before the second stop codon are frame preserving but are not shown in order to make the image more compact.) We included 40 second ORFs having PhyloSF+Stop between 5.0 and 17.0 in our list of. gambiae readthrough candidates based on a subjective assessment of these and other features characteristic of protein coding regions. Supplement Page 11

12 dmel_aa Y N S P P R F L * R F R V P D H H Y F S M P F * I N K I R Y N * N * dmel T T TT TTT TT T TTT T T T TT TTT T T TT T T T T T T T dsim T T TT TTT TT T TTT T T T TT TTT T T TT T T T T T T T dsec T T TT TTT TT T TTT T T T TT TTT T T TT T T T T T T T dyak T T TT TTT TT T TTT T T T T TT TTT T T TT T T T T T T T dere T T TT TTT TT T TTT T T T TT TTT T T TT T T- - T T T T T dana TTT T T T T TTT TT T T TT - T TT TT T T TT T- --- T TT T- T dpse TTT TT TTT TT T TT T T T TT TTT T T TT T - -- TT T T TT T dper TTT TT TTT TT T TT T T T TT TTT T T TT T - -- TT T T TT T dwil T T TTT TT T T T T T T TT T TT T T T TT T dvir TT T TT TT TTT TTT T T T T T TT T TT T TT T T T T T dmoj T TT T TTT TT T TT T T T TTT TT TTT T TTT T T T TT T TT T dgri TT T T T TTT TT T TT T T TT TT TTT T TT T T T T TT T TT T gamp3_aa Y H S P P R F L * K R L Q V F P D R S L F V M P F * T K K K K H E T T gamp3 TT T TT TT T T TT TT T T T T TT T T T TT T T gams1 TT T TT TT T T TT TT T T T T TT T T T TT T -- gamm1 TT T TT TT T T TT TT T T T T TT T T T TT T - T merm1 TT T TT TT T T TT TT T T T T TT T T T TT T --- T arad1 TT T TT TT T T TT TT T T T T TT T T T TT T T quas1 TT T TT TT T T T TT T T T T TT T T T TT T T mel1 TT T TT TT T T TT TT T T T T TT T T T TT T. T chr1 TT T TT TT T T TT TTT T T T T TT TT T T TT T T epie1 TT T TT TT T T TT TT T T T T TT T T T TT T T T minm1 TT T TT TT T T T TT TT T T T TT T T T TT T T cul1 TT T TT TT T T T TT TT T T T TT T T T TT T funf1 TT T TT TT T T TT TT T T T TT T T T TT T stes1 TT T TT TT T T TT TT T T T T TT T T T TT T -- - T stei2 TT T TT TT T T TT TT T T T T TT T T T TT T -- - T macm1 TT T TT TT T TT T TT T T T T TT T T T TT T T farf1 TT T TT TT T T TT TTT T T T T TT T T T TT T dirw1 TT T TT TT T T TT TTT T T T T TT T T T TT T sins1 TT TT TT TT T TT T TTT T T T T TT T T T TT T atre1 TT TT TT TT T TT T TTT T T T T TT T T T TT T dar2 TT T TT TT T T T TTT T T T T TT T T T TT T albs1 TT T TT TT T T T TTT T T T T TT T T T TT T Supplemental Figure S8. Readthrough identified using orthology. lignment of the readthrough region of D. melanogaster readthrough candidate FBtr (upper panel) identified using orthology to. gambiae readthrough candidate P R (lower panel), with many matching amino acids both within the readthrough region and before the first stop codon (yellow highlighting). FBtr had not been previously identified as readthrough, but satisfied our criteria for readthrough given orthology to a readthrough candidate in the other clade. In order to facilitate cross-clade comparisons, we identified 51 D. melanogaster and 21. gambiae readthrough candidates in this way. Supplement Page 12

enomes enome Browser Tools Mirrors Downloads My Data View Help bout s S enome Browser on D. melanogaster pr. 2006 (BDP R5/dm3) move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.

13 enomes enome Browser Tools Mirrors Downloads My Data View Help bout s S enome Browser on D. melanogaster pr (BDP R5/dm3) move <<< << < > >> >>> zoom in 1.5x 3x 10x base zoom out 1.5x 3x 10x 10 B chr3r:16,505,520-16,535,237 29,718 bp. enter position, gene symbol or search terms ncient Paired ncient npaired Double-stop RT Orthologs Paralogs Orthologs Orthologs 109.gam.... P R P R P R D.mel. FBtr dmel_aa Y V * Y E T P K S T P L T P D P E N E S Y L E F S T * T F P dmel T T T TT T T TT T T T T T TT TTT T TT T dsim T T T TT T T TT T T T T TT TTT T TT T dsec T T T TT T T TT T T T T TT TTT T TT T dyak T T T TT T T TT T T T TT TTT T T T dere T T T TT T TT T T T T TT TTT T T T dana T T T TT T T T T TT TTT T dpse TT T T TT T T T T TT T TTT T T dper TT T T TT T T T T TT T TTT T T dwil TT T T TT T T T T T T T T T T T T TT T TTT T T T dvir TT T T TT -- - T --- T T T T TT TTT T T dmoj T T T TT -- - T --- T T T T TT TTT T T T dgri TT T T TT -- - T --- T T T T TT TTT T T gamp3_aa Y V * * R V T R V R I V H H H P V R I T R V I R F * E E S P S S S gamp3 T T T T T T T T T T TT T TT TTT T T T T T gams1 T T T T T T T T T T TT T TT TTT T T gamm1 T T T T T T T T T T TT T TT TTT T T merm1 T T T T T T T T T T TT T TT TTT T T arad1 T T T T T T T T T T TT T TT TTT T T quas1 T T T T T T T T T T TT T TT TTT T T mel1 T T T T T T T T T T TT T TT TTT T T chr1 T T T T T T T T - -- T T TT T TT TTT T epie1 T T T T T T T T T T TT T TT TTT T T minm1 T T T T T T T TT - -- T T T TT T TT TTT T cul1 T T T T T T T TT - -- T T T TT T TT TTT T funf1 T T T T T T T TT - -- T T T TT T TT TTT T stes1 T T T T T T T T T - -- T T T TT T TT TTT T stei2 T T T T T T T T T - -- T T T TT T TT TTT T macm1 T T T T T T T T T - -- T T TT T TT TTT T farf1 T T T T T T T T - --T T T T T TT TTT T dirw1 T T T T T T T T - --T T T TT T TT TTT T sins1 T T T T T T T T - --T T T TT T TT TTT T atre1 T T T T T T T T - --T T T TT T TT TTT T dar2 T T T T T T T TT T T T T TT TTT T T albs1 T T T T T T T TT T T TT T TT TTT T T P R P R P RB P RB P R P R P R FBtr FBtr FBtr FBtr FBtr FBtr FBtr D dmel_aa D N D L * * Q E E L E P L R N S R T I I S L I D L * V Q * T L R K ancestor T TT T T T T T T T T TT T T T T TT T T T T dmel TT T T T T T T T T T TT TT T T T T T T T T dsim TT T T T T T T T T T TT TT T T TT T T T T T dsec TT T T T T T T T T T TT TT T T T T T T T T dyak TT T T TT T T T T T T TT TT T T T T T T T T dere TT T T T T T T T T T TT TT T T T T T T T T T dana T T T T T T T T T T T T TT T T T T T T --- TT -- dpse TT T T T TT- -- TT T T T T T TT T T T T T T dper TT T T T TT- -- TT T T T T T TT T T T T T T dwil T TT T T T T T T T T T T T TT T T T T TT T T TT T dvir T T TT T T T T T T T TT T T T T TT T T T T T dmoj T TT T T T T T T T TT T T T T T dgri T TT T T T T T T T TT T T T T TT T. gamp3_aa F D L * * P S T E S V Y L R K P D L E R F L I H L * R Q E D L ancestor TT T TT T T T T T T T T TT TT T T T T T gamp3 TTT TT TT T T T T T TT T T TT TT T T T T T gams1 TTT TT TT T T T T T TT T T TT TT T T T T T gamm1 TTT TT TT T T T T T TT T T TT TT T T T T T merm1 TTT TT TT T T T T T T TT T T TT TT T T T T T arad1 TTT TT TT T T T T T TT T T TT TT T T T T T quas1 TTT TT TT T T T T T TT T T TT TT T T T T T mel1 TTT TT TT T T T T T TT T T TT TT T T T T T chr1 TT T TT TT T T T T T T TT T T TT TT TT T T T T epie1 TT TT TT T T T T T T T TT T T TT TT T T T T T minm1 TTT TT TT T T T T T T TT TT T TTT TT T T T T T cul1 TTT TT TT T T T T T T T TT T TTT TT T T T T T funf1 TT TT TT T T T T T T TT T T TTT TT T T T T T stes1 TT TT TT T T T T T T TT T T TTT TT T T T T T stei2 TT TT TT T T T T T T TT T T TTT TT T T T T T macm1 TTT T TT TT T T T T T T TT T T T TTT TT T T T T T farf1 TT T TT T T T T T T T T TT T T T T T dirw1 TT T TT T T T T T T T T TT T T T T T sins1 TT T TT T T T T T T T T TT TT T T T T T T atre1 TT T TT T T T T T T T T T TT T T T T T T T T dar2 TTT T TT T T T T T T T T TT TT T T T T albs1 TT T TT T T T T T T T T TT TT T T T T Supplement Page 13

14 Supplemental Figure S9. ncient readthrough pairing. () Pairing of ancient readthrough candidates in. gambiae and D. melanogaster. We defined a readthrough candidate to be ancient if it is stop-orthologous to a readthrough candidate in the other species at the Diptera level and is not a double-stop readthrough candidate. In most cases (109) there is a one-toone pairing of orthologous ancient readthrough candidates (green boxes). However, in two cases there are two paralogous. gambiae readthrough candidates orthologous to the same D. melanogaster candidate, so we excluded one of the paralogs (white boxes) from analyses that required a one-to-one pairing. There is one case of four homologous readthrough candidates, however we were able to determine that these split into two orthologous pairs (panel B). There are three D. melanogaster readthrough candidates orthologous to double-stop readthrough candidates in. gambiae (blue rectangles), so these were excluded from analyses that required a one-to-one pairing of non-double-stop readthrough candidates (panel ). Finally, there is one pair of orthologous double-stop readthrough candidates, which were excluded from most of our analyses (panel D). (B) Four-way homologous readthrough candidates. Two. gambiae readthrough candidates and two D. melanogaster readthrough candidates, all homologous at the Diptera level, displayed using the S genome browser. In each species, the first ORFs of the two readthrough candidate transcripts are four-exon alternative splice variants of a single gene, with the first three exons shared. The downstream fourth exons in the two species are more similar to each other than either is to the alternative exons, and the same is true of the upstream fourth exons, suggesting that these two genes are descended from a gene in the common ancestor having a similar configuration, and that the alternative final exons were formed by an earlier duplication of a final exon containing a readthrough stop codon. () Double-stop readthrough orthologous to single-stop readthrough. lignments including the readthrough regions of D. melanogaster readthrough candidate FBtr and. gambiae double-stop readthrough candidate P RB. It is possible that the second T stop codon in the double first stop codon of P RB is related by a single nucleotide substitution to the TT tyrosine codon immediately 3 of the first stop codon in FBtr We note that although there is very little cross-clade similarity between the readthrough regions at the amino acid level, within each of the two clades there are several alignment columns immediately after the first stop codon and near the end of the readthrough region that have no synonymous substitutions, suggesting that there might be some overlapping constraint at the nucleotide level in addition to any constraint on the amino acid sequences. (D) Orthologous double-stop readthrough candidates. lignments including the readthrough regions of orthologous double-stop readthrough candidates FBtr and P R. The common double-t stop codon suggests that these might be descended from a double-stop readthrough transcript in the common ancestor. Supplement Page 14

15 gamp3_aa D S L Y K R * L L R V R S L F S L * * Q R E L * V L W R gamp3 T T T T T TT T T T TT T T T T --- T T TT TT T gams1 T T T T T TT T T T TT T T T T --- T T T TT T T gamm1 T T T T T TT T T T TT T T T T --- T T T TT T T merm1 T T T T T TT T T T TT T T T T --- T T T TT T T arad1 T T T T T TT T T T TT T T T T --- T T TT TT T quas1 T T T T T TT T T T TT T T T T --- T T T TT T T mel1 T T T T T TT T T T TT T T T T --- T T T TT T T chr1 T T T T T TT T T T TT T T T T --- T T T T TT T T T epie1 T T T T T TT T T T TT T T T --- T T T minm1 T T T T T TT T T T TT T T T T --- T T T cul1 T T T T T TT T T T T TT T T T T --- T T T funf1 T T T T T TT T TT T T TT T T T T --- T T T stes1 T T T T T TT T T T TT T T T T --- T T T T T T stei2 T T T T T TT T T T TT T T T T --- T T T T T T macm1 T T T T T TT T T T TT T T T T --- T T T T. T farf1 T T T T T T TT T T T TT T T T T T T dirw1 T T T T T TT T T T T TT T T T T --- T T T sins1 T T T T T TT T T T TTT T T T T T T T atre1 T T T T T T TT TT T T TTT T T T T T T T dar2 T T T T T T T T T T T TT T T T T T T albs1 T T T T T T T T T T T TT T T T T T T B gamp3_aa I D L L * K T K * * Q L L K R R F * N D T H R R gamp3 T T TT TT T T T T T T TTT T T T gams1 T T TT TT T T T T T T TTT T T T merm1 T T TT TT T T T T TT T TTT T T T arad1 T T TT TT T T T T T T TTT T T T quas1 T T TT TT T T T T TT T TTT T TT T mel1 T T TT TT T T T T TT T TTT T T T chr1 TT T T TT T TT T T TT T TTT T TTT TT T epie1 TT T TT TT T TT T T TT T T TTT T T-- T T minm1 TT T TT TT T TT T T TT T TTT T T- --T cul1 TT T TT TT T TT T T TT T TTT T T- --T funf1 TT T TT TT T TT T T TT T TTT T TT- T--T TT stes1 T T TT TT T TT T T TT T T TTT T TT TT T stei2 T T TT TT T TT T T TT T T TTT T TT TT T farf1 T T TT TT T TT T T TTT T T T dirw1 T T T TT TT T T T T TT T TT T T T-- --T sins1 T T T TT T T T T T T TT T T --- T-- T atre1 T T T TT T T T T T T TTT T T T -- T dar2 T T T T TT T TT T TT T T TT T -TT - -- T T albs1 T T T TT T TT T T T T TT T Supplemental Figure S10. Single-double readthrough. lignments showing triple readthrough for P R () and P R (B), in which a single readthrough is followed by a double-stop readthrough in some species. In P R, the second T stop codon in the. gambiae double stop codon is aligned to a likely-ancestral T ysteine codon in several species, though the presence of indels makes the history of the particular codon uncertain. In P R the second T stop codon in the double stop codon appears to be ancestral, and has become a TT Leucine codon in. darlingi. Supplement Page 15

Percent of SNVs 80% 70% 60% 50% 40% 30% 20% SNVs synonymous in frame Frame0 Frame1 Frame2 10% 0% Before 1st Stop In RT Region <er 2nd Stop B umulative Distribution 0.0 0.2 0.4 0.6 0.8 1.

16 Percent of SNVs 80% 70% 60% 50% 40% 30% 20% SNVs synonymous in frame Frame0 Frame1 Frame2 10% 0% Before 1st Stop In RT Region <er 2nd Stop B umulative Distribution SNVs Before 1st Stop Non synonymous Synonymous umulative Distribution SNVs In RT Region Non synonymous Synonymous umulative Distribution SNVs fter 2nd Stop Non synonymous Synonymous Derived llele Frequency Derived llele Frequency Derived llele Frequency Supplemental Figure S11. Polymorphism evidence supports recent protein-coding selection. () Single nucleotide variants (SNVs) in. gambiae show a strong bias toward synonymous codon changes in readthrough regions (middle) and same-sized coding regions 5 of readthrough stop codons (left), but not in same-sized noncoding regions 3 of second stop codons of readthrough candidates (right), providing evidence that readthrough regions are under purifying selection at the amino acid level within the. gambiae population. For each type of region we show the fraction of SNVs that would be synonymous if translated in each of three frames, with frame 0 matching the translated frame of the coding region of the gene. Error bars show the Standard Error of the Mean (SEM). (B) umulative distributions of derived allele frequencies for SNVs that would be synonymous (red) or non-synonymous (black) if translated in the frame of the coding region of the gene, for the same three sets of regions. Derived allele frequencies are lower for non-synonymous SNVs than for synonymous ones, in both coding and readthrough regions, indicating that they are under greater purifying selection, whereas in noncoding regions there is no significant difference, providing further evidence that purifying selection at the amino acid level in readthrough regions has continued in the. gambiae population. Supplement Page 16

17 .gam. 5 codon regions B.gam. 10 codon regions PsiEmp coding Psi normal coding PsiEmp noncoding Psi normal noncoding PsiEmp coding Psi normal coding PsiEmp noncoding Psi normal noncoding Density PhyloSF scores corresponding to PhyloSF-Ψ EMP = 17 and PhyloSF-Ψ = 17. Density Normal approxima8on overes8mates density in right tail of noncoding distribu8on Raw PhyloSF Score Raw PhyloSF Score.gam. 20 codon regions D.gam. 40 codon regions PsiEmp coding Empirical coding Psi normal coding PsiEmp noncoding Empirical noncoding Psi normal noncoding PsiEmp coding Empirical coding Psi normal coding PsiEmp noncoding Empirical noncoding Psi normal noncoding Density Density Small sample size makes tail of empirical distribu8on unreliable for regions >> 10 codons Raw PhyloSF Score Raw PhyloSF Score Supplement Page 17

18 E D.mel. 5 codon regions F D.mel. 10 codon regions PsiEmp coding Psi normal coding PsiEmp noncoding Psi normal noncoding PsiEmp coding Psi normal coding PsiEmp noncoding Psi normal noncoding Density Density Raw PhyloSF Score Raw PhyloSF Score D.mel. 20 codon regions H D.mel. 40 codon regions PsiEmp coding Empirical coding Psi normal coding PsiEmp noncoding Empirical noncoding Psi normal noncoding PsiEmp coding Empirical coding Psi normal coding PsiEmp noncoding Empirical noncoding Psi normal noncoding Density Density Raw PhyloSF Score Raw PhyloSF Score Supplemental Figure S12. Empirical score distributions allow PhyloSF-Ψ Emp to achieve higher sensitivity than PhyloSF-Ψ while maintaining high specificity. oding (black) and noncoding (red) PhyloSF score distributions used to define PhyloSF-Ψ Emp for. gambiae (-D) and D. melanogaster (E-H), and normal approximations used to define PhyloSF-Ψ (magenta and orange dashed lines), for regions of lengths 5, 10, 20, and 40 codons. For lengths less than or equal to 10, the distributions used to define PhyloSF-Ψ Emp were calculated directly from training regions of that length, whereas for greater lengths PhyloSF-Ψ Emp was defined by scaling the distributions for regions of length 10. For lengths greater than 10 we show the actual distributions of coding (green) and noncoding (cyan) training regions of those lengths for reference, even though they were not used in calculating PhyloSF-Ψ Emp. The bump in the right tail of the cyan curve for. gambiae 40-codon regions (D) is presumably a result of sampling error due to the small number of noncoding training regions of that length; such sampling errors are the reason that we scaled the length-10 distribution rather than interpolating densities through scores of actual regions of greater lengths. PhyloSF scores for regions of each length for which PhyloSF-Ψ Emp and PhyloSF-Ψ are equal to our score threshold of 17 (solid and dashed vertical blue lines, respectively) are the scores that are approximately 50 times more likely to occur in coding regions than noncoding regions. Because the normal approximations overestimate the densities in the right tail of the noncoding score distributions, PhyloSF-Ψ overestimates the PhyloSF score needed to achieve this high specificity; by using the more accurate empirical densities, PhyloSF-Ψ Emp allows us to detect coding regions having lower PhyloSF score, achieving greater sensitivity while maintaining this high specificity. Supplement Page 18

19 Supplement Page 19 Supplemental Figure S13. Score distribution scale factors used for definition of PhyloSF-Ψ Emp. Mean PhyloSF scores of coding (black circles) and noncoding (red circles) training regions of each length used to define PhyloSF-Ψ Emp for. gambiae () and D. melanogaster (B) and intervals one standard deviation above and below (black and red vertical lines). For lengths greater than 10 codons, PhyloSF-Ψ Emp was defined by scaling the distributions for regions of length 10 codons using the means and standard deviations of the scores of training regions of each length up to 60 codons, and an extrapolation using linear regression on the means and the logs of the standard deviations of regions of length from 30 to 60 codons for regions greater than 60 codons (green and cyan curves) Region length (codons) Raw PhyloSF score PhyloSF PsiEmp scale factors for D.mel mean + stdev for coding training regions mean + stdev for noncoding training regions extrapolation for longer coding regions extrapolation for longer noncoding regions Region length (codons) Raw PhyloSF score PhyloSF PsiEmp scale factors for.gam mean + stdev for coding training regions mean + stdev for noncoding training regions extrapolation for longer coding regions extrapolation for longer noncoding regions B

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from