Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Size: px

Start display at page:

Download "Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons"

Esther Farmer
5 years ago
Views:

1 Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes. We perform all-against-all two-way blast [1] comparisons between the full-length mrna sequence sets of these species [2] and determine homologous gene pairs based on the alignment coverage (C > 0.6) and E-value (E < 10 6 ) of the comparison. 2. Generate Pairwise Sequence Alignments. For each comparison, we randomly select 500 homologous pairs and align them with the program blastz [3]. The resulting pairwise alignments are then used to train Markov models of comparisons [4]. 3. Calculate KLD Distances and Cluster the Comparisons. We calculate KLD distances between Markov models and generate a distance profile for each comparison. (A profile is the vector of KLD distances of one comparison with all comparisons.) Profiles are then clustered using hierarchical clustering with the Pearson correlation coefficient implemented in the R package ( to create groups of similar comparisons. 4. Determine and Evaluate Optimal Seeds. We optimize seeds for each cluster and for each comparison in the clusters, and validate them by comparing their performance within their group and in the other groups. We use the methods in [4] to calculate the sensitivity of a seed, combined with a hill-climbing procedure to estimate optimal seeds [5]. 1

2 1.2 Efficient calculation of the KLD distance for two Markov models Let M 1 and M 2 be two Markov models of order k = 3, characterized by the probability distributions P and Q on the space of alignment words X = {0, 1, x} L. To compute KLD(P, Q), we divide the contributions from words w in groups based on the last 3 digits in the word, and calculate the values recursively for increasing word lengths w = m L: KLD(P, Q) = a 1 a 2 a 3 {0,1,x} 3 KLD (L) (P, Q; a 1 a 2 a 3 ) (1) Let λ {0, 1, x} m 3, b {0, 1, x}, for m 3. KLD (m+1) (P, Q; a 1 a 2 a 3 ) (2) = p(λba 1 a 2 a 3 ) log p(λba 1a 2 a 3 ) 1 a 2 a 3 ) = ( p(λba 1 a 2 ) µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + log µ P (a 3 ba 1 a 2 ) ) µ Q (a 3 ba 1 a 2 ) = p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) = [ µ P (a 3 ba 1 a 2 ) p(λba 1 a 2 ) log p(λba 1a 2 )] + K(a 3 ; ba 1 a 2 ) p(λba 1 a 2 ) b λ 1 a 2 ) b λ = b where is a constant, and µ P (a 3 ba 1 a 2 ) KLD (m) (P, Q; ba 1 a 2 ) + b K(a 3 ; ba 1 a 2 ) = µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) P (m) (ba 1 a 2 ) = λ K(a 3 ; ba 1 a 2 ) P (m) (ba 1 a 2 ) (3) p(λba 1 a 2 ) (4) Note that P (m) (ba 1 a 2 ) can be calculated a priori with the recurrences: P (m+1) (a 1 a 2 a 3 ) = λb = λ = b = b p(λba 1 a 2 a 3 ) (5) p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) b µ P (a 3 ba 1 a 2 )( p(λba 1 a 2 )) λ µ P (a 3 ba 1 a 2 )P (m) (ba 1 a 2 ) For L = 64, these recurrences will generate intermediate values, which are later used in the calculation of KLD (m). Hence, the KLD distance can be computed effciently in O(L) time.

3 2 Supplementary Tables Table S1: Seeds optimized with the hill-climbing algorithm for four clusters. Cluster n 1 n 0 n x W S n Optimal Seeds L x111110x L x x L x110x1011x11x L x1100x011011x11x1100 L x x11x11x1100 L x110x1011x11111x1100 L x1100x0x1011x L x101100x0x1011x1100 L x1100x0x1011x L x11011x1100x011x1100 L x1100x011x11011x11 L x11011x110xx011x11x11 L x x L xx00x0x1011xx1011 L L x011011x11 L x x11 L x11011xx x11 L x x L x100x0x10110x1011 L L L x x11 L x x11

4 Table S2: Average of seed sensitivities when applying seeds optimized for each of the four clusters L 1, L 2, L 3, L 4 to the four clusters L 1, L 2, L 3, and L 4. Only results for seeds with weight 11 and 12 are shown. L 1,O, L 2,O, L 3,O, L 4,O are the seeds optimized with hill-climbing for clusters L 1, L 2, L 3, L 4, respectively. Cluster W L 1,O L 2,O L 3,O L 4,O L L L L L L L L

5 Table S3: Average of seed sensitivities when applying optimal seeds obtained from clusters L 1, L 2, L 3, L 4 on the comparisons in cluster L 2. Comparisons W L 1 L 2 L 3 L 4 human.mouse human.mouse human.rat human.rat human.cow human.cow human.dog human.dog chimp.mouse chimp.mouse chimp.rat chimp.rat chimp.cow chimp.cow chimp.dog chimp.dog macaque.mouse macaque.mouse macaque.rat macaque.rat macaque.cow macaque.cow macaque.dog macaque.dog mouse.cow mouse.cow mouse.dog mouse.dog rat.cow rat.cow rat.dog rat.dog cow.dog cow.dog

6 Table S4: Average of seed sensitivities when applying optimal seeds obtained from individual comparisons in clusters L 1, L 2, L 3, L 4 to clusters L 1, L 2, L 3, L 4. W mouse.rat cow.dog chimp.chicken frog.fugu Apply seeds optimized with hill-climbing on cluster L Apply seeds optimized with hill-climbing on cluster L Apply seeds optimized with hill-climbing on cluster L Apply seeds optimized with hill-climbing on cluster L

7 Table S5: Performance of seeds optimized for the four clusters when incorporated into sim4cc (Zhou and Florea, in prep.; S n and S p are the sensitivity and specificity at the nucleotide level. The Intron column shows the percentage of accurately detected introns, as a measure of the splice junction detection accuracy. Cluster Human-Mouse Human-Zebrafish W = 12 S n S p Intron S n S p Intron L L L L References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215(3), [2] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35 (Database issue), D [3] Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D. and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res., 13(1), [4] Zhou, L. and Florea, L. (2007) Designing sensitive and specific spaced seeds for cross-species mrna-to-genome alignment. J. Comput. Biol., 14(2), [5] Buhler, J., Keich, U. and Sun, Y. (2003) Designing seeds for similarity search in genomic DNA, In Proc. Seventh Annual Intln. Conference on Computational Molecular Biology. RECOMB 2003,

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this