SUPPLEMENTARY INFORMATION doi:10.1038/nature11510 Supplementary Table 1. Indel Index Removal Gene Number of Starting Sequences Number of Final Sequences Percentage of Sequences Removed based on the Indel Contribution Index Number Non- Redundant Sequences Number Unique Species ATP6 22545 20433 9.36% 5359 3021 ATP8 14810 13724 7.33% 1638 1244 COX1 24557 19403 21.05% 6327 4450 COX2 35381 24009 32.14% 6357 4204 COX3 16171 15300 5.38% 2822 2191 CYTB 54628 53066 2.86% 20976 7654 ND1 16256 14590 10.24% 2881 2056 ND2 38346 30853 19.54% 11547 5963 ND3 19799 18296 7.59% 3653 2852 ND4 16550 14926 9.81% 2979 2041 ND4L 16595 15355 7.47% 2102 1785 ND5 12250 10221 16.56% 1849 949 ND6 13011 12471 4.15% 1345 1015 eef1a1 3433 2653 22.72% 1880 1743 H3.2 11098 8171 26.30% 1315 1228 RuBisCo 22656 19441 14.19% 16322 13912 WWW.NATURE.COM/NATURE 1
RESEARCH SUPPLEMENTARY INFORMATION Supplementary Table 2. Consistency of alignment length and comparison of amino acid usage across three different multiple sequence alignment methods. Number of sites occupied by amino acids in more than half of species in a KM-Coffee alignment Protein sequence length in human and thale cress for RuBisCO Clustal Omega MAFFT KM-Coffee KM-Coffee vs Clustal Omega KM-Coffee vs Mafft Gene ATP6 227 226 9.260 9.246 9.386 99.05 98.75 99.1 ATP8 55 68 11.665 11.413 11.612 91.95 93.3 92.5 COX1 501 513 6.361 6.248 6.334 99.95 99.9 100.0 COX2 227 227 9.104 9.060 9.118 99.7 99.6 99.8 COX3 261 261 7.175 7.128 7.145 99.7 99.6 99.7 CYTB 379 380 10.697 10.647 10.683 99.65 99.6 99.7 ND1 323 318 8.407 8.391 8.413 99.1 98.85 98.9 ND2 346 347 10.694 10.616 10.659 99.1 99.1 99.4 ND3 116 115 10.221 10.126 10.238 97.8 98.3 98.4 ND4 460 459 8.935 8.905 8.995 96.7 96.45 97.1 ND4L 98 98 10.364 10.223 10.236 96.35 96.9 94.1 ND5 611 603 7.071 7.006 7.105 96.55 94.75 95.9 ND6 173 174 8.813 8.937 8.970 87.4 90.5 87.9 eef1a1 364 462 3.047 3.045 3.045 100 100 100.0 H3.2 109 136 3.583 3.581 3.594 99.7 99.7 100.0 RuBisCO 453 479 8.661 8.601 8.651 99.95 99.85 99.9 Clustal Omega vs Mafft 2 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Supplementary Table 3. Fraction of species with data on the density of non-fixed states. Gene Number of species Number of species with 2 or more sequences Fraction of species with 2 or more sequences ATP6 3021 2069 0.68 ATP8 1244 865 0.70 COX1 4450 1809 0.41 COX2 4204 2044 0.49 COX3 2191 1582 0.72 CYTB 7954 4864 0.61 ND1 2056 1148 0.56 ND2 5963 2596 0.44 ND3 2852 1811 0.63 ND4 2041 1427 0.70 ND4L 1785 1710 0.96 ND5 949 678 0.71 ND6 1015 706 0.70 eef1a1 1743 207 0.12 H3.2 1228 727 0.59 RuBisCO 13912 2558 0.18 Supplementary Table 4. Estimating amino acid usage without the contribution of non-fixed states through the elimination of rare amino acid states. Gene Amino acid usage Expected dn/ds from (u-1)/19 ATP6 6.92 0.31 ATP8 9.07 0.42 COX1 3.85 0.15 COX2 6.00 0.26 COX3 6.21 0.27 CYTB 5.78 0.25 ND1 5.72 0.25 ND2 7.32 0.33 ND3 6.17 0.27 ND4 8.34 0.39 ND4L 7.06 0.32 ND5 5.70 0.25 ND6 8.03 0.37 eef1a1 2.25 0.07 H3.2 1.90 0.05 RuBisCO 2.74 0.09 WWW.NATURE.COM/NATURE 3
RESEARCH SUPPLEMENTARY INFORMATION Supplementary Table 5. Estimating amino acid usage without the contribution of rare non-fixed states. Gene Number of species with 3 or more sequences Number of species Fraction of species with 3 or more sequences Average amino acid usage excluding rare non-fixed states Average amino acid usage from 1000 replicates of single sequences ATP6 228 3021 0.075 5.0 5.2 ATP8 51 1244 0.041 5.2 5.3 COX1 259 4450 0.058 3.0 3.2 COX2 346 4204 0.082 5.0 5.4 COX3 55 2191 0.025 3.5 3.6 CYTB 1585 7954 0.199 5.5 6.5 ND1 73 2056 0.036 4.7 4.8 ND2 641 5963 0.107 7.0 7.3 ND3 123 2852 0.043 5.4 5.6 ND4 112 2041 0.055 4.8 4.9 ND4L 52 1785 0.029 4.8 4.9 ND5 53 949 0.056 3.6 3.6 ND6 31 1015 0.031 4.1 4.2 eef1a1 21 1743 0.012 1.1 1.2 H3.2 12 1228 0.001 0.9 1.0 RiBisCO 353 13912 0.025 2.2 2.5 Supplementary Table 6. Estimating average dn/ds in different genes Gene Number of nonoverlapping clusters Number of species Average observed dn/ds Standard deviation of the observed dn/ds ATP6 245 1300 0.056 0.048 ATP8 100 781 0.224 0.158 COX1 326 1123 0.015 0.022 COX2 330 1214 0.025 0.024 COX3 173 622 0.036 0.031 CYTB 798 3992 0.039 0.029 ND1 177 569 0.040 0.032 ND2 623 3210 0.067 0.033 ND3 242 989 0.069 0.047 ND4 135 510 0.045 0.027 ND4L 135 441 0.076 0.078 ND5 97 370 0.057 0.028 ND6 104 406 0.073 0.068 eef1a1 94 1343 0.020 0.014 H3.2 73 670 0.037 0.065 RiBisCO 151 13546 0.072 0.067 4 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH 0.08 0.07 0.06 Frequency 0.05 0.04 0.03 0.02 0.01 0 0 5 10 15 20 Amino acid usage Supplementary Figure 1. Frequency distribution of amino acid usage across sites in all genes in our dataset. 0.25 0.2 Frequency 0.15 0.1 0.05 0 0 5 10 15 20 The number of times an amino acid state observed at a site Supplementary Figure 2. Frequency distribution of the number of times an amino acid state is observed across sites in all genes in our dataset. WWW.NATURE.COM/NATURE 5
RESEARCH SUPPLEMENTARY INFORMATION t 0 t 1 t 2 time t 3 t 4 Supplementary Figure 3. A simulated phylogeny with regularly spaced speciation events. The total time on the phylogeny between different depths is indicated by t n ; for example, the total evolutionary time on the phylogeny since the last speciation event is t 0. In this example t 0 is twofold larger than t 1 and t n is twofold larger than t n+1. If the rate of amino acid substitution is constant along the phylogeny than the number of substitutions that happened within the timeframe of t 0 is also twofold higher than the number of substitutions that occurred within the timeframe of t 1. Therefore, given a multiple alignment of orthologues from the species represented on this tree the number of amino acid states found only once is expected to be twofold larger than the number of states that are found twice. Thus, given a realistic phylogeny of many species, without an overwhelming bias of shorter branches close to the terminal areas of the phylogeny, the frequency distribution of amino acid states is expected to be an exponentially declining function closely resembling the relationship reported in Supplementary Figure 2. 6 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH FSupplementary Figure 4A-B WWW.NATURE.COM/NATURE 7
RESEARCH SUPPLEMENTARY INFORMATION Supplementary Figure4C-D 8 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Supplementary Figure 4E-F WWW.NATURE.COM/NATURE 9
RESEARCH SUPPLEMENTARY INFORMATION Supplementary Figure 4G-H 10 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Supplementary Figure 4I-J WWW.NATURE.COM/NATURE 11
RESEARCH SUPPLEMENTARY INFORMATION Supplementary Fgure 4K-L 12 WWW.NATURE.COM/NATURE
SUPPLEMENTARY INFORMATION RESEARCH Supplementary Figure 4M-N WWW.NATURE.COM/NATURE 13
RESEARCH SUPPLEMENTARY INFORMATION H3.2 Supplementary Figure 4O-P Supplementary Figure 4. The relationship between amino acid usage, u, and the number of sequences included in the multiple alignment. From the multiple alignment we sampled a single sequence without replacement and placed it into a new alignment calculating u in the new alignment at every step until we ran out of sequences. The procedure was repeated 100 times. 14 WWW.NATURE.COM/NATURE