Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes. We perform all-against-all two-way blast [1] comparisons between the full-length mrna sequence sets of these species [2] and determine homologous gene pairs based on the alignment coverage (C > 0.6) and E-value (E < 10 6 ) of the comparison. 2. Generate Pairwise Sequence Alignments. For each comparison, we randomly select 500 homologous pairs and align them with the program blastz [3]. The resulting pairwise alignments are then used to train Markov models of comparisons [4]. 3. Calculate KLD Distances and Cluster the Comparisons. We calculate KLD distances between Markov models and generate a distance profile for each comparison. (A profile is the vector of KLD distances of one comparison with all comparisons.) Profiles are then clustered using hierarchical clustering with the Pearson correlation coefficient implemented in the R package (http://r-project.org), to create groups of similar comparisons. 4. Determine and Evaluate Optimal Seeds. We optimize seeds for each cluster and for each comparison in the clusters, and validate them by comparing their performance within their group and in the other groups. We use the methods in [4] to calculate the sensitivity of a seed, combined with a hill-climbing procedure to estimate optimal seeds [5]. 1

1.2 Efficient calculation of the KLD distance for two Markov models Let M 1 and M 2 be two Markov models of order k = 3, characterized by the probability distributions P and Q on the space of alignment words X = {0, 1, x} L. To compute KLD(P, Q), we divide the contributions from words w in groups based on the last 3 digits in the word, and calculate the values recursively for increasing word lengths w = m L: KLD(P, Q) = a 1 a 2 a 3 {0,1,x} 3 KLD (L) (P, Q; a 1 a 2 a 3 ) (1) Let λ {0, 1, x} m 3, b {0, 1, x}, for m 3. KLD (m+1) (P, Q; a 1 a 2 a 3 ) (2) = p(λba 1 a 2 a 3 ) log p(λba 1a 2 a 3 ) 1 a 2 a 3 ) = ( p(λba 1 a 2 ) µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + log µ P (a 3 ba 1 a 2 ) ) µ Q (a 3 ba 1 a 2 ) = p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) = [ µ P (a 3 ba 1 a 2 ) p(λba 1 a 2 ) log p(λba 1a 2 )] + K(a 3 ; ba 1 a 2 ) p(λba 1 a 2 ) b λ 1 a 2 ) b λ = b where is a constant, and µ P (a 3 ba 1 a 2 ) KLD (m) (P, Q; ba 1 a 2 ) + b K(a 3 ; ba 1 a 2 ) = µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) P (m) (ba 1 a 2 ) = λ K(a 3 ; ba 1 a 2 ) P (m) (ba 1 a 2 ) (3) p(λba 1 a 2 ) (4) Note that P (m) (ba 1 a 2 ) can be calculated a priori with the recurrences: P (m+1) (a 1 a 2 a 3 ) = λb = λ = b = b p(λba 1 a 2 a 3 ) (5) p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) b µ P (a 3 ba 1 a 2 )( p(λba 1 a 2 )) λ µ P (a 3 ba 1 a 2 )P (m) (ba 1 a 2 ) For L = 64, these recurrences will generate 64 3 3 intermediate values, which are later used in the calculation of KLD (m). Hence, the KLD distance can be computed effciently in O(L) time.

2 Supplementary Tables Table S1: Seeds optimized with the hill-climbing algorithm for four clusters. Cluster n 1 n 0 n x W S n Optimal Seeds L 1 10 10 2 11 0.999915 11x111110x101100000000 L 1 11 9 2 12 0.999749 11x11011111x1100000000 L 1 11 7 4 13 0.999387 11x110x1011x11x1100000 L 1 12 6 4 14 0.998603 11x1100x011011x11x1100 L 1 13 5 4 15 0.997180 11x11001011x11x11x1100 L 1 14 4 4 16 0.994985 11x110x1011x11111x1100 L 2 9 9 4 11 0.984559 11x1100x0x1011x1100000 L 2 10 8 4 12 0.967898 110x101100x0x1011x1100 L 2 11 7 4 13 0.950523 11x1100x0x1011x1101100 L 2 12 6 4 14 0.922839 11x11011x1100x011x1100 L 2 13 5 4 15 0.888319 1011x1100x011x11011x11 L 2 13 3 6 16 0.845684 1x11011x110xx011x11x11 L 3 10 10 2 11 0.916878 x10110000x101101101100 L 3 9 7 6 12 0.853739 10110xx00x0x1011xx1011 L 3 13 9 0 13 0.825306 1011011000011011011011 L 3 13 7 2 14 0.758105 101101101100x011011x11 L 3 14 6 2 15 0.685740 1011x11011001011011x11 L 3 14 4 4 16 0.602507 1011x11011xx1011011x11 L 4 10 10 2 11 0.812331 x10110000x101101101100 L 4 10 8 4 12 0.725855 10110x100x0x10110x1011 L 4 13 9 0 13 0.677509 1011011000011011011011 L 4 14 8 0 14 0.589546 1011011011001011011011 L 4 14 6 2 15 0.493761 1011x11011001011011x11 L 4 15 5 2 16 0.408166 1011x11011011011011x11

Table S2: Average of seed sensitivities when applying seeds optimized for each of the four clusters L 1, L 2, L 3, L 4 to the four clusters L 1, L 2, L 3, and L 4. Only results for seeds with weight 11 and 12 are shown. L 1,O, L 2,O, L 3,O, L 4,O are the seeds optimized with hill-climbing for clusters L 1, L 2, L 3, L 4, respectively. Cluster W L 1,O L 2,O L 3,O L 4,O L 1 11 0.9996 0.9996 0.9995 0.9995 L 1 12 0.9989 0.9989 0.9987 0.9987 L 2 11 0.9746 0.9758 0.9736 0.9728 L 2 12 0.9508 0.9525 0.9494 0.9485 L 3 11 0.8495 0.8656 0.8707 0.8704 L 3 12 0.7671 0.7866 0.7926 0.7923 L 4 11 0.6790 0.7128 0.7259 0.7273 L 4 12 0.5682 0.6021 0.6168 0.6177

Table S3: Average of seed sensitivities when applying optimal seeds obtained from clusters L 1, L 2, L 3, L 4 on the comparisons in cluster L 2. Comparisons W L 1 L 2 L 3 L 4 human.mouse 11 0.990007 0.991129 0.990382 0.990099 human.mouse 12 0.977016 0.978889 0.977764 0.977382 human.rat 11 0.989001 0.990320 0.989518 0.989211 human.rat 12 0.975015 0.977156 0.975931 0.975512 human.cow 11 0.995041 0.995272 0.994641 0.994413 human.cow 12 0.987990 0.988396 0.987354 0.987044 human.dog 11 0.993861 0.994194 0.993400 0.993134 human.dog 12 0.985533 0.986036 0.984773 0.984401 chimp.mouse 11 0.977199 0.978856 0.977005 0.976370 chimp.mouse 12 0.954564 0.956988 0.954438 0.953667 chimp.rat 11 0.974885 0.976662 0.974638 0.973946 chimp.rat 12 0.950804 0.953457 0.950650 0.949849 chimp.cow 11 0.978726 0.979210 0.976890 0.976128 chimp.cow 12 0.958448 0.958901 0.955660 0.954789 chimp.dog 11 0.966509 0.966284 0.963142 0.961981 chimp.dog 12 0.939685 0.939603 0.935185 0.933967 macaque.mouse 11 0.954598 0.957642 0.954435 0.953297 macaque.mouse 12 0.918272 0.922035 0.917908 0.916685 macaque.rat 11 0.947439 0.950677 0.947242 0.946005 macaque.rat 12 0.907642 0.911506 0.907144 0.905842 macaque.cow 11 0.964470 0.965480 0.962198 0.961092 macaque.cow 12 0.934989 0.935939 0.931590 0.930354 macaque.dog 11 0.970691 0.970929 0.967699 0.966559 macaque.dog 12 0.945759 0.945838 0.941483 0.940334 mouse.cow 11 0.985923 0.987575 0.986820 0.986468 mouse.cow 12 0.968894 0.971558 0.970417 0.969980 mouse.dog 11 0.976326 0.978580 0.977104 0.976572 mouse.dog 12 0.952359 0.955700 0.953659 0.952978 rat.cow 11 0.978807 0.980776 0.979610 0.979117 rat.cow 12 0.956675 0.959823 0.958126 0.957556 rat.dog 11 0.964705 0.967622 0.965481 0.964743 rat.dog 12 0.933586 0.937681 0.934863 0.934011 cow.dog 11 0.987664 0.988014 0.986763 0.986323 cow.dog 12 0.973879 0.974468 0.972577 0.972045

Table S4: Average of seed sensitivities when applying optimal seeds obtained from individual comparisons in clusters L 1, L 2, L 3, L 4 to clusters L 1, L 2, L 3, L 4. W mouse.rat cow.dog chimp.chicken frog.fugu Apply seeds optimized with hill-climbing on cluster L 1. 11 0.999603 0.999624 0.999535 0.999514 12 0.998867 0.998886 0.998748 0.998687 Apply seeds optimized with hill-climbing on cluster L 2. 11 0.975104 0.975701 0.973295 0.972550 12 0.951742 0.952515 0.949633 0.948207 Apply seeds optimized with hill-climbing on cluster L 3. 11 0.856632 0.865617 0.870608 0.870199 12 0.773639 0.783920 0.792412 0.791913 Apply seeds optimized with hill-climbing on cluster L 4. 11 0.696037 0.712277 0.726508 0.727207 12 0.578936 0.597052 0.616496 0.617461

Table S5: Performance of seeds optimized for the four clusters when incorporated into sim4cc (Zhou and Florea, in prep.; http://dna.cs.gwu.edu). S n and S p are the sensitivity and specificity at the nucleotide level. The Intron column shows the percentage of accurately detected introns, as a measure of the splice junction detection accuracy. Cluster Human-Mouse Human-Zebrafish W = 12 S n S p Intron S n S p Intron L 1 0.930 0.958 0.924 0.687 0.959 0.528 L 2 0.936 0.957 0.926 0.744 0.966 0.591 L 3 0.934 0.953 0.923 0.756 0.964 0.604 L 4 0.933 0.956 0.924 0.761 0.964 0.605 References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215(3), 403-410. [2] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35 (Database issue), D61-65. [3] Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D. and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res., 13(1), 103-107. [4] Zhou, L. and Florea, L. (2007) Designing sensitive and specific spaced seeds for cross-species mrna-to-genome alignment. J. Comput. Biol., 14(2), 113-130. [5] Buhler, J., Keich, U. and Sun, Y. (2003) Designing seeds for similarity search in genomic DNA, In Proc. Seventh Annual Intln. Conference on Computational Molecular Biology. RECOMB 2003, 67-75.