Outline Sequence-comparison methods erard Kleywegt Uppsala University Why compare sequences otplots airwise sequence alignments Multiple sequence alignments rofile methods Buzzzzzzzz Why compare sequences Sequence comparison is the bread and butter of bioinformatics - WHY Sequence-to-database Sequence-to-sequence iscuss in groups of 2-3 for ~3 minutes Write down ~3 things that you think protein sequence comparisons could be used for! MB330 - he class of 2008 Sequence-to-database Find related genes in different species atenting (check novelty of sequence) lues about function lues about structure dentification of the protein MB330 - he class of 2008 Sequence-to-sequence Find small sequence variations Study mutation rates hylogenetic analysis, evolutionary relationships rotein structure prediction Finding sequence motifs (active site, ) 1
B351 - he class of 2007 Function prediction Structure prediction volutionary history, phylogeny, ancestry, classification Find homologous proteins dentify unknown proteins Find similarities and differences (mutations) between proteins, species Find conserved/consensus sequences omain structure MB330 - he class of 2007 Find related proteins (homology) lues about function volutionary history, phylogeny lues about structure isease-related variants dentify unknown proteins Find similarities and differences between proteins, species dentify possible active-site residues MB330 - he class of 2006 Sequence-to-database dentification of protein lues about function Find related sequences lues about domain structure Verify hypothetical proteins lues about structural similarities Find sequence motifs (active site, ) MB330 - he class of 2006 Sequence-to-sequence nvestigate evolutionary history and relationships nalyse differences between species and between individuals (e.g., disease-causing mutations) Structure modelling lues about secondary structure Sequence motifs (active site, ) Sequence-database comparison Find related sequences Homology escended from a common ancestor (/F!) Occurrence in other organisms (orthologs; speciation) Occurrence in same organism (paralogs; gene duplication) onvergent evolution ndependently evolved same function Shared motif(s) Shared domains hance similarities Find clues about structure Find clues about function Sequence-sequence comparison lignment of (possibly) homologous sequences etermine residue-residue correspondences Measure similarity, cluster nfer evolutionary relationships, phylogeny Find patterns of conservation and variability Functionally important sites Structurally important sites Sites important for specificity Structure prediction Secondary structure prediction Homology modelling Function prediction (caution!) 2
Sequence identity Sequence identity (%S) = 100% * (r of identical residues in pairwise alignment) / (ength of the shortest sequence) Other definitions exist x: -- - %S %S = 100% * 6 / min(9,10) = 67% 55% 60% 67% 75% Sequence identity/homology Homology and level of sequence identity (or similarity) are two fundamentally different concepts! hese two proteins are 28% homologous an homology be inferred/rejected based on the level of sequence identity Sequence identity/homology Sequence identity of non-homologous proteins Sequence identity/homology Sequence identity of homologous proteins (Rost, 1999) (Rost, 1999) Sequence identity/homology wo proteins of 100 or more residues with %S >35% are likely to be homologous However, homologous proteins may well have %S <35% wilight Zone (oolittle) %S <20% Midnight Zone (Rost) verage %S ~8.5% for remote homologues verage %S ~5.6% for random sequences Structure conservation Homologous proteins will have similar structures Structure better conserved than sequence! roteins with similar structure and function likely to be homologous ould also be analogous (similar due to convergent evolution) (hothia & esk, 1986) 3
Homology - current thinking Statistically significant sequence and structural similarity strongly imply common ancestry (i.e., homology) Statistically significant sequence or structural similarity Weakly implies common ancestry (homology) ould result from convergent evolution (analogy) Functional similarity Supports a common ancestry hypothesis, but is not sufficient to prove it Functional dissimilarity does not disprove common ancestry (e.g., lactalbumin vs. lysozyme) Homology - why bother Science: (probable) homology must be established before you can onclude that the structures will be similar Suspect that the functions may be related o phylogenetic analysis raw any meaningful conclusions from a (multiple) sequence alignment ractical: f you plan to design a drug against a bacterial or parasitic enzyme you want to know about any human orthologs of that enzyme! otplots otplots otplot: simple overview of the similarities of two words/sequences ives clues about alignment too alculation: Matrix olumns = residues of sequence 1 Rows = residues of sequence 2 (or 1) Simplest form: put dots in the matrix where the row and column residues are identical otplot example otplot example W H Z O W H Z O M M H H Z Z O O 4
5 Self-dotplot nternal symmetry ranslational = domain duplication nversion recognition sites for transcriptional regulators and restriction enzymes x: cor: / ow-complexity regions x: lu repeat Why compare a sequence to itself otplot of a palindrome otplot of a palindrome! ow-complexity region F F ow-complexity region! F F omain duplication omain omain B omain omain B
omain duplication! Shared domains omain B omain omain B omain omain omain B omain omain F omain Shared domains! otplots omain omain omain F omain B omain Usually: efine a window size ount number of identical residues within the window f the count exceeds a certain threshold, put a dot in the matrix element x: window 3 (-1,0,+1), minimum of 2 identities x: window 15 (-7,-6,,+7), minimum of 6 identities otplots with window otplots with window Window 3 hreshold 2 Window 3 hreshold 2 6
otplots with window otplots with window Window 3 hreshold 2 Window 3 hreshold 2 o otplot examples otplot examples HW ysozyme Human lactalbumin Human lactalbumin: alcium-binding protein involved in lactose biosynthesis 123 Residues, sequence from B entry 1B9O Hen egg-white lysozyme: nzyme that breaks down bacterial cell walls 129 Residues, sequence from B entry 2S Homologous; %S ~36% (structure-based sequence alignment) ote: plots now from lower-left to upper-right corner Window 1, threshold 1 Window 3, threshold 2 7
Window 11, threshold 5 Summary otplots are an excellent means of assessing the (self-)similarity of sequences asy to calculate asy to interpret ompare every residue in one sequence to every residue in the other sequence rovide an indication of how the sequences should be aligned etect similarities that are easily missed by global pairwise alignment (e.g., shuffled domain order, internal symmetry) different kind of dotplot Sequencing! otplots can be used to compare any strings x: a manual chapter in utch, French, erman, talian, Spanish, and Swedish (one million 4-grams) lso: academic fraud u Fr e t Sp Sw For the next lecture needed two random sequences asked the MB330 students of 2006 to each pick one of the four nucleotides:,, or his yielded a random(ish) ojk sequence (boys) and jej sequence (girls) Sequencing! jej-jej dotplot jej ote: contains low-complexity palindrome () and a repeat of the (palindromic) domain () ojk ote: contains low-complexity region () and a palindrome-in-a-palindrome () he following dotplots were calculated with window size 3 and threshold 2 8
ojk-ojk dotplot jej-ojk dotplot 9