MB330 - January, 2006 Sequence-comparison methods erard Kleywegt Uppsala University Outline! Why compare sequences?! Dotplots! airwise sequence alignments &! Multiple sequence alignments! rofile methods! Hidden Markov Models (HMMs) Separate lectures by atrik Johansson! Motif- and family-based methods Separate lecture by Marian ovotny Why compare sequences? Buzzzzzzzz! Sequence comparison is the bread and butter of bioinformatics - WHY??? Sequence-to-database Sequence-to-sequence! Discuss in groups of 3 for 3 minutes! Write down ~3 things that you think protein sequence comparisons could be used for! 1
he class of 2006! Sequence-to-database dentification of protein lues about function Find related sequences lues about domain structure Verify hypothetical proteins lues about structural similarities Find sequence motifs (active site, ) he class of 2006! Sequence-to-sequence nvestigate evolutionary history and relationships nalyse differences between species and between individuals (e.g., disease-causing mutations) Structure modelling lues about secondary structure Sequence motifs (active site, ) Sequence-database comparison! Find related sequences Homology Descended from a common ancestor (/F!!) Occurrence in other organisms (orthologs; speciation) Occurrence in same organism (paralogs; gene duplication) onvergent evolution ndependently evolved same function Shared motif(s) Shared domains hance similarities! Find clues about function! Find clues about structure Sequence-sequence comparison! lignment of (possibly) homologous sequences Measure similarity, cluster Determine residue-residue correspondences Find patterns of conservation and variability Functionally important sites Structurally important sites nfer evolutionary relationships, phylogeny Structure prediction Secondary structure prediction Homology modelling Function prediction (caution!) 2
Sequence identity! Sequence identity (%S) = 100% * (r of identical residues in pairwise alignment) / (ength of the shortest sequence)! x: -- - Sequence identity/homology! Homology and level of sequence identity (or similarity) are two fundamentally different concepts!! an homology be inferred/rejected based on the level of sequence identity?! %S = 100% * 6 / min(9,10) = 67% Sequence identity/homology! Sequence identity of non-homologous proteins Sequence identity/homology! Sequence identity of homologous proteins (Rost, 1999) (Rost, 1999) 3
Sequence identity/homology! wo proteins of 100 or more residues with %S >35% are likely to be homologous! However, homologous proteins may well have %S <35% wilight Zone (Doolittle)! %S <20% Midnight Zone (Rost)! verage %S ~8.5% for remote homologs! verage %S ~5.6% for random sequences Structure conservation! Homologous proteins will have similar structures! Structure better conserved than sequence! on-homologous proteins may have similar structures (hothia & esk, 1986) Dotplots Dotplots! Dotplot: simple overview of the similarities of two words/sequences ives clues about alignment too! alculation: Matrix olumns = residues of sequence 1 Rows = residues of sequence 2 (or 1) Simplest form: put dots in the matrix where the row and column residues are identical 4
5 Dotplot example O Z H D M O Z H W Self-dotplot! nternal symmetry ranslational = domain duplication nversion D recognition sites for transcriptional regulators and restriction enzymes x: cor: / ow-complexity regions x: lu repeat! Why compare a sequence to itself? Dotplot of a palindrome? Dotplot of a palindrome!
ow-complexity region? ow-complexity region! D F D F D D F F Domain duplication? Domain duplication! 6
Shared domains? Shared domains! Domain D Domain D Domain F Domain F Dotplots Dotplots with window! Usually: Define a window size ount number of identical residues within the window f the count exceeds a certain threshold, put a dot in the matrix element x: window 3 (-1,0,+1), minimum of 2 identities x: window 15 (-7,-6,,+7), minimum of 6 identities Window 3 hreshold 2???? 7
Dotplots with window Dotplot examples Window 3 hreshold 2! Human lactalbumin: 123 Residues, sequence from DB entry 1B9O alcium-binding protein involved in lactose biosynthesis! Hen egg-white lysozyme: 129 Residues, sequence from DB entry 2DS nzyme that breaks down bacterial cell walls! Homologous; %S ~36% (structure-based sequence alignment)! ote: plots now from lower-left to upper-right corner Window 1, threshold 1 Window 3, threshold 2 8
Window 11, threshold 5 Summary! Dotplots are an excellent means of assessing the (self-)similarity of sequences asy to calculate asy to interpret ompare every residue in one sequence to every residue in the other sequence rovide an indication of how the sequences should be aligned Detect similarities that are easily missed by global pairwise alignment (e.g., shuffled domain order, internal symmetry) different kind of dotplot! Dotplots can be used to compare any strings! x: a manual chapter in Dutch, French, erman, talian, Spanish, and Swedish (one million 4-grams) Sequencing!! For the next lecture need two random D sequences! ach of you pick one of the four nucleotides,,, or! We ll generate a ojk sequence and a jej sequence 9
Sequencing! jej-jej dotplot! he class of 2006 generated the following random sequences:! jej ote: contains low-complexity palindrome () and a repeat of the domain ()! ojk ote: contains low-complexity region () and a palindrome-in-a-palindrome ()! he following dotplots were calculated with window size 3 and threshold 2 ojk-ojk dotplot jej-ojk dotplot 10