Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Similar documents
Sequence Alignment Techniques and Their Uses

BLAST. Varieties of BLAST

Performing local similarity searches with variable length seeds

Multiple Genome Alignment by Clustering Pairwise Matches

On Spaced Seeds for Similarity Search

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

SUPPLEMENTARY INFORMATION

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Improved hit criteria for DNA local alignment

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

On spaced seeds for similarity search

Small RNA in rice genome

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Sequence Database Search Techniques I: Blast and PatternHunter tools

In-Depth Assessment of Local Sequence Alignment

An Introduction to Sequence Similarity ( Homology ) Searching

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

BLAT The BLAST-Like Alignment Tool

EECS730: Introduction to Bioinformatics

A Novel Method for Similarity Analysis of Protein Sequences

Optimization of a New Score Function for the Detection of Remote Homologs

Basic Local Alignment Search Tool

Reducing storage requirements for biological sequence comparison

Multiple Alignment of Genomic Sequences

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Single alignment: Substitution Matrix. 16 march 2017

Multi-seed lossless filtration (Extended abstract)

Chapter 7: Rapid alignment methods: FASTA and BLAST

GenomeBlast: a Web Tool for Small Genome Comparison

Alignment Strategies for Large Scale Genome Alignments

Subset seed automaton

A profile-based protein sequence alignment algorithm for a domain clustering database

Virus Classifications Based on the Haar Wavelet Transform of Signal Representation of DNA Sequences

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Local Alignment Statistics

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors

Biologically significant sequence alignments using Boltzmann probabilities

CS612 - Algorithms in Bioinformatics

Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes

Visit to BPRC. Data is crucial! Case study: Evolution of AIRE protein 6/7/13

Handling Rearrangements in DNA Sequence Alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool

Multiseed Lossless Filtration

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Introduction to Bioinformatics

How much non-coding DNA do eukaryotes require?

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

GEP Annotation Report

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

Multiseed lossless filtration

Finding Anchors for Genomic Sequence Comparison ABSTRACT

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

NIH Public Access Author Manuscript Pac Symp Biocomput. Author manuscript; available in PMC 2009 October 6.

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Sequences, Structures, and Gene Regulatory Networks

Multiple Choice Review- Eukaryotic Gene Expression

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Evolutionary Rate Covariation of Domain Families

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

Phylogenomic Resources at the UCSC Genome Browser

Effects of Gap Open and Gap Extension Penalties

Optimal spaced seeds for faster approximate string matching

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Optimal spaced seeds for faster approximate string matching

BIOINFORMATICS LAB AP BIOLOGY

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Bioinformatics Chapter 1. Introduction

Computational methods for predicting protein-protein interactions

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Whole Genome Alignments and Synteny Maps

Comparing whole genomes

A Practical Approach to Significance Assessment in Alignment with Gaps

Predicting Protein Functions and Domain Interactions from Protein Interactions

Truncated Profile Hidden Markov Models

Example of Function Prediction

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

SUPPLEMENTARY INFORMATION

BLAST: Basic Local Alignment Search Tool

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Week 10: Homology Modelling (II) - HHpred

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Chapter 15 Active Reading Guide Regulation of Gene Expression

Comparative Bioinformatics Midterm II Fall 2004

Transcription:

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes. We perform all-against-all two-way blast [1] comparisons between the full-length mrna sequence sets of these species [2] and determine homologous gene pairs based on the alignment coverage (C > 0.6) and E-value (E < 10 6 ) of the comparison. 2. Generate Pairwise Sequence Alignments. For each comparison, we randomly select 500 homologous pairs and align them with the program blastz [3]. The resulting pairwise alignments are then used to train Markov models of comparisons [4]. 3. Calculate KLD Distances and Cluster the Comparisons. We calculate KLD distances between Markov models and generate a distance profile for each comparison. (A profile is the vector of KLD distances of one comparison with all comparisons.) Profiles are then clustered using hierarchical clustering with the Pearson correlation coefficient implemented in the R package (http://r-project.org), to create groups of similar comparisons. 4. Determine and Evaluate Optimal Seeds. We optimize seeds for each cluster and for each comparison in the clusters, and validate them by comparing their performance within their group and in the other groups. We use the methods in [4] to calculate the sensitivity of a seed, combined with a hill-climbing procedure to estimate optimal seeds [5]. 1

1.2 Efficient calculation of the KLD distance for two Markov models Let M 1 and M 2 be two Markov models of order k = 3, characterized by the probability distributions P and Q on the space of alignment words X = {0, 1, x} L. To compute KLD(P, Q), we divide the contributions from words w in groups based on the last 3 digits in the word, and calculate the values recursively for increasing word lengths w = m L: KLD(P, Q) = a 1 a 2 a 3 {0,1,x} 3 KLD (L) (P, Q; a 1 a 2 a 3 ) (1) Let λ {0, 1, x} m 3, b {0, 1, x}, for m 3. KLD (m+1) (P, Q; a 1 a 2 a 3 ) (2) = p(λba 1 a 2 a 3 ) log p(λba 1a 2 a 3 ) 1 a 2 a 3 ) = ( p(λba 1 a 2 ) µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + log µ P (a 3 ba 1 a 2 ) ) µ Q (a 3 ba 1 a 2 ) = p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log p(λba 1a 2 ) 1 a 2 ) + p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) = [ µ P (a 3 ba 1 a 2 ) p(λba 1 a 2 ) log p(λba 1a 2 )] + K(a 3 ; ba 1 a 2 ) p(λba 1 a 2 ) b λ 1 a 2 ) b λ = b where is a constant, and µ P (a 3 ba 1 a 2 ) KLD (m) (P, Q; ba 1 a 2 ) + b K(a 3 ; ba 1 a 2 ) = µ P (a 3 ba 1 a 2 ) log µ P (a 3 ba 1 a 2 ) µ Q (a 3 ba 1 a 2 ) P (m) (ba 1 a 2 ) = λ K(a 3 ; ba 1 a 2 ) P (m) (ba 1 a 2 ) (3) p(λba 1 a 2 ) (4) Note that P (m) (ba 1 a 2 ) can be calculated a priori with the recurrences: P (m+1) (a 1 a 2 a 3 ) = λb = λ = b = b p(λba 1 a 2 a 3 ) (5) p(λba 1 a 2 )µ P (a 3 ba 1 a 2 ) b µ P (a 3 ba 1 a 2 )( p(λba 1 a 2 )) λ µ P (a 3 ba 1 a 2 )P (m) (ba 1 a 2 ) For L = 64, these recurrences will generate 64 3 3 intermediate values, which are later used in the calculation of KLD (m). Hence, the KLD distance can be computed effciently in O(L) time.

2 Supplementary Tables Table S1: Seeds optimized with the hill-climbing algorithm for four clusters. Cluster n 1 n 0 n x W S n Optimal Seeds L 1 10 10 2 11 0.999915 11x111110x101100000000 L 1 11 9 2 12 0.999749 11x11011111x1100000000 L 1 11 7 4 13 0.999387 11x110x1011x11x1100000 L 1 12 6 4 14 0.998603 11x1100x011011x11x1100 L 1 13 5 4 15 0.997180 11x11001011x11x11x1100 L 1 14 4 4 16 0.994985 11x110x1011x11111x1100 L 2 9 9 4 11 0.984559 11x1100x0x1011x1100000 L 2 10 8 4 12 0.967898 110x101100x0x1011x1100 L 2 11 7 4 13 0.950523 11x1100x0x1011x1101100 L 2 12 6 4 14 0.922839 11x11011x1100x011x1100 L 2 13 5 4 15 0.888319 1011x1100x011x11011x11 L 2 13 3 6 16 0.845684 1x11011x110xx011x11x11 L 3 10 10 2 11 0.916878 x10110000x101101101100 L 3 9 7 6 12 0.853739 10110xx00x0x1011xx1011 L 3 13 9 0 13 0.825306 1011011000011011011011 L 3 13 7 2 14 0.758105 101101101100x011011x11 L 3 14 6 2 15 0.685740 1011x11011001011011x11 L 3 14 4 4 16 0.602507 1011x11011xx1011011x11 L 4 10 10 2 11 0.812331 x10110000x101101101100 L 4 10 8 4 12 0.725855 10110x100x0x10110x1011 L 4 13 9 0 13 0.677509 1011011000011011011011 L 4 14 8 0 14 0.589546 1011011011001011011011 L 4 14 6 2 15 0.493761 1011x11011001011011x11 L 4 15 5 2 16 0.408166 1011x11011011011011x11

Table S2: Average of seed sensitivities when applying seeds optimized for each of the four clusters L 1, L 2, L 3, L 4 to the four clusters L 1, L 2, L 3, and L 4. Only results for seeds with weight 11 and 12 are shown. L 1,O, L 2,O, L 3,O, L 4,O are the seeds optimized with hill-climbing for clusters L 1, L 2, L 3, L 4, respectively. Cluster W L 1,O L 2,O L 3,O L 4,O L 1 11 0.9996 0.9996 0.9995 0.9995 L 1 12 0.9989 0.9989 0.9987 0.9987 L 2 11 0.9746 0.9758 0.9736 0.9728 L 2 12 0.9508 0.9525 0.9494 0.9485 L 3 11 0.8495 0.8656 0.8707 0.8704 L 3 12 0.7671 0.7866 0.7926 0.7923 L 4 11 0.6790 0.7128 0.7259 0.7273 L 4 12 0.5682 0.6021 0.6168 0.6177

Table S3: Average of seed sensitivities when applying optimal seeds obtained from clusters L 1, L 2, L 3, L 4 on the comparisons in cluster L 2. Comparisons W L 1 L 2 L 3 L 4 human.mouse 11 0.990007 0.991129 0.990382 0.990099 human.mouse 12 0.977016 0.978889 0.977764 0.977382 human.rat 11 0.989001 0.990320 0.989518 0.989211 human.rat 12 0.975015 0.977156 0.975931 0.975512 human.cow 11 0.995041 0.995272 0.994641 0.994413 human.cow 12 0.987990 0.988396 0.987354 0.987044 human.dog 11 0.993861 0.994194 0.993400 0.993134 human.dog 12 0.985533 0.986036 0.984773 0.984401 chimp.mouse 11 0.977199 0.978856 0.977005 0.976370 chimp.mouse 12 0.954564 0.956988 0.954438 0.953667 chimp.rat 11 0.974885 0.976662 0.974638 0.973946 chimp.rat 12 0.950804 0.953457 0.950650 0.949849 chimp.cow 11 0.978726 0.979210 0.976890 0.976128 chimp.cow 12 0.958448 0.958901 0.955660 0.954789 chimp.dog 11 0.966509 0.966284 0.963142 0.961981 chimp.dog 12 0.939685 0.939603 0.935185 0.933967 macaque.mouse 11 0.954598 0.957642 0.954435 0.953297 macaque.mouse 12 0.918272 0.922035 0.917908 0.916685 macaque.rat 11 0.947439 0.950677 0.947242 0.946005 macaque.rat 12 0.907642 0.911506 0.907144 0.905842 macaque.cow 11 0.964470 0.965480 0.962198 0.961092 macaque.cow 12 0.934989 0.935939 0.931590 0.930354 macaque.dog 11 0.970691 0.970929 0.967699 0.966559 macaque.dog 12 0.945759 0.945838 0.941483 0.940334 mouse.cow 11 0.985923 0.987575 0.986820 0.986468 mouse.cow 12 0.968894 0.971558 0.970417 0.969980 mouse.dog 11 0.976326 0.978580 0.977104 0.976572 mouse.dog 12 0.952359 0.955700 0.953659 0.952978 rat.cow 11 0.978807 0.980776 0.979610 0.979117 rat.cow 12 0.956675 0.959823 0.958126 0.957556 rat.dog 11 0.964705 0.967622 0.965481 0.964743 rat.dog 12 0.933586 0.937681 0.934863 0.934011 cow.dog 11 0.987664 0.988014 0.986763 0.986323 cow.dog 12 0.973879 0.974468 0.972577 0.972045

Table S4: Average of seed sensitivities when applying optimal seeds obtained from individual comparisons in clusters L 1, L 2, L 3, L 4 to clusters L 1, L 2, L 3, L 4. W mouse.rat cow.dog chimp.chicken frog.fugu Apply seeds optimized with hill-climbing on cluster L 1. 11 0.999603 0.999624 0.999535 0.999514 12 0.998867 0.998886 0.998748 0.998687 Apply seeds optimized with hill-climbing on cluster L 2. 11 0.975104 0.975701 0.973295 0.972550 12 0.951742 0.952515 0.949633 0.948207 Apply seeds optimized with hill-climbing on cluster L 3. 11 0.856632 0.865617 0.870608 0.870199 12 0.773639 0.783920 0.792412 0.791913 Apply seeds optimized with hill-climbing on cluster L 4. 11 0.696037 0.712277 0.726508 0.727207 12 0.578936 0.597052 0.616496 0.617461

Table S5: Performance of seeds optimized for the four clusters when incorporated into sim4cc (Zhou and Florea, in prep.; http://dna.cs.gwu.edu). S n and S p are the sensitivity and specificity at the nucleotide level. The Intron column shows the percentage of accurately detected introns, as a measure of the splice junction detection accuracy. Cluster Human-Mouse Human-Zebrafish W = 12 S n S p Intron S n S p Intron L 1 0.930 0.958 0.924 0.687 0.959 0.528 L 2 0.936 0.957 0.926 0.744 0.966 0.591 L 3 0.934 0.953 0.923 0.756 0.964 0.604 L 4 0.933 0.956 0.924 0.761 0.964 0.605 References [1] Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215(3), 403-410. [2] Pruitt, K.D., Tatusova, T. and Maglott, D.R. (2007) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res., 35 (Database issue), D61-65. [3] Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D. and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res., 13(1), 103-107. [4] Zhou, L. and Florea, L. (2007) Designing sensitive and specific spaced seeds for cross-species mrna-to-genome alignment. J. Comput. Biol., 14(2), 113-130. [5] Buhler, J., Keich, U. and Sun, Y. (2003) Designing seeds for similarity search in genomic DNA, In Proc. Seventh Annual Intln. Conference on Computational Molecular Biology. RECOMB 2003, 67-75.