he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction of exon structure Exon 1 MetlaProrghrLeuLeuLeuLeuLeuLeulylaLeula Leuhrlnhrrplaly Pairwise alignment SerHisSerMetrgyrPhehrhrSer Exon 2 ValSerrgProlyrglyluProrgPheIlelaVallyyrValspsphr lnphevalrgphespsersplalaserlnrgmetluprorglaprorp IlelulnlulyProluyrrpspLeulnhrrgsnValLyslalnSer lnhrsprglasnleulyhrleurglyyryrsnlnserlula - 1
Why sequence alignments? Prediction of function Protein family analysis omparative genomics Phylogeny / Evolutionary history enome sequencing: ssembly lignment to reference genome Prediction of function Sequence to be investigated Seq. with known function We have a new sequence. It is similar to a previously known sequence? We can test by alignment whether it is similar to a sequence with known function. If it is we can assign a possible function to our new sequence Database of sequences Protein family analysis omparative genomics - reveals biologically significant regions of the genome 2
Pairwise alignment dotplot - Pairwise alignment dotplot Pairwise alignment dotplot Pairwise alignment dotplot - + 2221222222222222 + + - + + + + + - + + + + + + = 25 ----- ------ + 2+ 2-2 -22 - -22 - + 22 + + 2-222 - --- 2 2222222 - + + + + + = -2 3
More sophisticated scoring of protein sequence alignments Each amino acid change has a characteristic probability substitution matrix More sophisticated scoring of protein sequence alignments Each amino acid change has a characteristic probability L E L D 4+ 0+4 +9+2 =19 Local and global alignments B Frequently used methods in sequence analysis that are based on sequence alignment Local alignment BLS - searches in databases for sequence similarity lustalw - multiple alignment of sequences lobal alignment B 4
BLS Searching databases for sequence similarity - traditional alignment method too slow BLS - Basic Local lignment Search ool FS, 1988 William Pearson BLS, 1990 query sequence (DN or protein) is tested against all sequences in a database (DN or protein), i.e the query is aligned to all the database sequences. Final output is a list of the best matching database sequences. David Lipman Stephen ltschul Searching databases for sequence similarity - shortcuts of BLS Improvement of speed as compared to local alignment algorithm: Initial search is for word hits. Word hits are then extended in either direction. "word hit" M K I Q L K R Y M K L Q L K R Y BLS output BLSP 2.2.9 [May-01-2004] Reference: ltschul, Stephen F., homas L. Madden, lejandro. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "apped BLS and PSI-BLS: a new generation of protein database search programs", Nucleic cids Res. 25:3389-3402. Query= lcl SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein (SRP54) (504 letters) Database: swissprot 197,228 sequences; 71,501,181 total letters Searching...done Score E Sequences producing significant alignments: (bits) Value SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein... 959 0.0 SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kda protein... 958 0.0 SRP54_MF (Q4R965) Signal recognition particle 54 kda protein... 958 0.0 SRP54_HUMN (P61011) Signal recognition particle 54 kda protein... 958 0.0 SRP54_NF (P61010) Signal recognition particle 54 kda protein... 958 0.0 SRP54_R (Q6YB5) Signal recognition particle 54 kda protein (S... 957 0.0 SRP54_EOY (Q8MZJ6) Signal recognition particle 54 kda protein... 794 0.0 SR542_LYES (P49972) Signal recognition particle 54 kda protein... 565 e-161 SR543_RH (P49967) Signal recognition particle 54 kda protein... 560 e-159 SR542_HORVU (P49969) Signal recognition particle 54 kda protein... 558 e-158...... SRPR_MOUSE (Q9DB7) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_HUMN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_YES (P32916) Signal recognition particle receptor alpha s... 98 7e-20 5
BLS output, cont. sp Q9I3P8.1 FLHF_PSEE RecName: Full=Flagellar biosynthesis prot... 57 3e-07 sp Q44758.1 FLHF_BORBU RecName: Full=Flagellar biosynthesis prot... 55 2e-06 sp Q01960.1 FLHF_BSU RecName: Full=Flagellar biosynthesis prot... 53 4e-06 sp O28980.1 Y1289_RFU RecName: Full=Uncharacterized protein F... 39 0.064 sp B9LK1.1 YS_HLSY RecName: Full=denylyl-sulfate kinase; l... 38 0.21 sp Q12U80.1 RDB_MEBU RecName: Full=DN repair and recombinatio... 37 0.29 sp 5D014.1 D_PELS RecName: Full=cetyl-coenzyme carboxyla... 35 0.93 sp Q0356.1 RSM_LB RecName: Full=Ribosomal RN small subunit... 35 1.2 sp Q1I2K4.1 YS_PSEE4 RecName: Full=denylyl-sulfate kinase; l... 35 1.6 sp Q38V22.1 RSM_LSS RecName: Full=Ribosomal RN small subunit... 34 1.8 sp 1U3X8.1 YS_MRV RecName: Full=denylyl-sulfate kinase; l... 34 2.3 sp 6D42.1 YS_KLEP7 RecName: Full=denylyl-sulfate kinase; l... 34 2.9 sp P63890.2 YS_SLI RecName: Full=denylyl-sulfate kinase; l... 34 2.9... Expect value (E) Parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. his means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. High Scoring Pair (HSP) High Scoring Pair (HSP) Query: 1 MVLDLRKISLRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSI 60 MVLDLRKISLRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSI Sbjct: 1 MVLDLRKISLRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSI 60 Query: 61 DLEEMSLNKRKMIQHVFKELVKLVDPVKWPKKQNVIMFVLQSKSK 120 DLEEMSLNKRKMIQHVFKELVKLVDPVKWPKKQNVIMFVLQSKSK Sbjct: 61 DLEEMSLNKRKMIQHVFKELVKLVDPVKWPKKQNVIMFVLQSKSK 120 Query: 121 LYYYQRKWKLIDFRFDQLKQNKRIPFYSYEMDPVIISEVEKFK 180 LYYYQRKWKLIDFRFDQLKQNKRIPFYSYEMDPVIISEVEKFK Sbjct: 121 LYYYQRKWKLIDFRFDQLKQNKRIPFYSYEMDPVIISEVEKFK 180 Query: 181 NENFEIIIVDSRHKQEDSLFEEMLQVSNIQPDNIVYVMDSIQEQKFKDKV 240 NENFEIIIVDSRHKQEDSLFEEMLQV+NIQPDNIVYVMDSIQEQKFKDKV Sbjct: 181 NENFEIIIVDSRHKQEDSLFEEMLQVNIQPDNIVYVMDSIQEQKFKDKV 240 Query: 241 DVSVIVKLDHKLSVKSPIIFIEHIDDFEPFKQPFISKLLMDI 300 DVSVIVKLDHKLSVKSPIIFIEHIDDFEPFKQPFISKLLMDI Sbjct: 241 DVSVIVKLDHKLSVKSPIIFIEHIDDFEPFKQPFISKLLMDI 300 >SRPR_MOUSE (Q9DB7) Signal recognition particle receptor alpha subunit (SR-alpha) (Docking protein alpha) (DP-alpha) Length = 636 Score = 99.0 bits (245), Expect = 3e-20 Identities = 68/313 (21%), Positives = 143/313 (45%), aps = 31/313 (9%) Query: 14 LRSLSNIINEEVLNMLKEVLLEDVNIKLVKQLRENVKSIDLEEMSLNKRK 73 L+ L + ++ E + ++L ++ L+ +V + QL E+V + ++ + M + Sbjct: 322 LKLVSKSLSREDMESVLDKMRDHLIKNVDIVQLESVNKLEKVMFSVS 381 Query: 74 MIQHVFKELVKLVDPVKW-------PKKQNVIMFVLQSKSKLYYYQ 126 ++ + + LV+++ P + + + V+ F + K+ +K++++ Sbjct: 382 VKQLQESLVQILQPQRRVDMLRDIMDQRRQRPYVVFVNVKSNLKISFWLL 441 Query: 127 RKWKLIDFRFDQLK-------------QNKRIPFYSYEMDPVIIS 173 + + DFR +QL+ ++ + + + D I Sbjct: 442 ENFSVLIDFRVEQLRHRRLLHPPEKHRMVQLFEKYKDIM 501 6
BLS output revealing orthologs and paralogs BLSP 2.2.9 [May-01-2004] Reference: ltschul, Stephen F., homas L. Madden, lejandro. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "apped BLS and PSI-BLS: a new generation of protein database search programs", Nucleic cids Res. 25:3389-3402. Query= lcl SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein (SRP54) (504 letters) Database: swissprot 197,228 sequences; 71,501,181 total letters Searching...done Score E Sequences producing significant alignments: (bits) Value SRP54_MOUSE (P14576) Signal recognition particle 54 kda protein... 959 0.0 SRP54_PONPY (Q5R4R6) Signal recognition particle 54 kda protein... 958 0.0 SRP54_MF (Q4R965) Signal recognition particle 54 kda protein... 958 0.0 SRP54_HUMN (P61011) Signal recognition particle 54 kda protein... 958 0.0 SRP54_NF (P61010) Signal recognition particle 54 kda protein... 958 0.0 SRP54_R (Q6YB5) Signal recognition particle 54 kda protein (S... 957 0.0 SRP54_EOY (Q8MZJ6) Signal recognition particle 54 kda protein... 794 0.0 SR542_LYES (P49972) Signal recognition particle 54 kda protein... 565 e-161 SR543_RH (P49967) Signal recognition particle 54 kda protein... 560 e-159 SR542_HORVU (P49969) Signal recognition particle 54 kda protein... 558 e-158...... SRPR_MOUSE (Q9DB7) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_HUMN (P08240) Signal recognition particle receptor alpha s... 99 3e-20 SRPR_YES (P32916) Signal recognition particle receptor alpha s... 98 7e-20 orthologs paralogs he two kinds of protein evolutionary relationship enes or proteins are homologous if they are related by divergence from a common ancestor. Orthology Paralogy Sequences that diverged after a speciation event. Orthologous genes often have the same function in different species. Sequences that diverged after a gene duplication event.paralogous genes perform different but related functions within one organism. Orthologs Paralogs X ncestral organism X Organism Speciation Organism B ene duplication X X X X Organism Organism B X1 X2 Xa Xb Orthologs Paralogs 7
Example of orthology / paralogy relationships he different variants of BLS he variants of BLS Query Database blastp Protein Protein blastn DN DN tblastn Protein DN blastx DN Protein tblastx DN DN ited 31998 times since 1990! BL lignment software specialized for next-generation sequencing technology BW Bowtie SOP2 lign reads to a reference genome Reference genome 8
Further improvement of computational efficiency - BL (http://genome.ucsc.edu/cgi-bin/hgblat?command=start) Frequently used methods in sequence analysis that are based on sequence alignment BLS - searches in databases for sequence similarity lustalw - multiple alignment of sequences ited 34,646 times! lustalw onstruction of tree based on pairwise alignments Progressive alignment guided by tree. Introduction to the practical E B HIV D 9
Introduction to the practical Introduction to the practical EMBOSS programs in this practical sixpack plotorf dottup - dotplot analysis water - Smith Waterman local alignment needle - Needleman - Wunsch global alignment 10
ranslation of a nucleotide sequence using sixpack M K R K L K K N L K F V F S I F1 W Q R E S K R K L L L H L V L L L F2 K E K V K K E L K N F I Y Y F3 1 60 ----:---- ----:---- ----:---- ----:---- ----:---- ----:---- 1 60 X F L F N F F F K F V K N L I V F6 X P L S F L F S S L F K Q Q M H F5 H L S L F L V F S K N K S N S F4 Introduction to the practical L L L N I P I S L Q S S N F1 L Y L M V F Q L V L L S L P I Q L F2 F I V N W Y S N F N S V F Q Y N F3 61 120 ----:---- ----:---- ----:---- ----:---- ----:---- ----:---- 61 120 K N N V L P I I L K V D E L V V F6 Q K I L H Y E L H K L E K W Y L F5 S Q S I N W N S S L R I S F4 E I S Q L R N V M Y Y D W S F1 R L L H K L L Q Y V M I M V L F2 D Y F S Y Y R V N V L W L V Y F3 121 180 ----:---- ----:---- ----:---- ----:---- ----:---- ----:---- 121 180 S I V E V V P N R L I Y P S Q D F6 Q S K V L L V Y H L N H H S F5 L N S L S S P I Y H I I V P R F4 Plotorf to show open reading frames (in this case ORF is defined as starting with U codon) Ribosomal protein L19 3426-3773 Introduction to the practical Unnamed protein 416-1522 trn methyltransferase 2617-3384 Ribosomal protein S16 1771-2019 11
Introduction to the practical Introduction to the practical ag ag-pol fusion (5%) lobal alignment of mrn sequence to genomic DN sequence Effect of gap parameters lobal alignment of mrn sequence to genomic DN sequence Effect of gap parameters genomic DN mature, spliced mrn 12
Introduction to the practical Dot plot analysis (dottup) reveals repeats Introduction to the "Exercises with biological sequences - examining HIV genes and proteins" - biological questions addressed with BLS and lustalx. BLS - search databases for sequence similarity Identifying homologous proteins. Non-viral homologues to any HIV proteins? re we able to identify a relationship between human HIV and the monkey SIV? lustalx - multiple sequence alignment Identifying amino acids involved in drug resistance. What is the relationship between HIV and monkey SIV? Using a multiple alignment to compute a phylogenetic tree. 13