William R. Pearson faculty.virginia.edu/wrpearson
|
|
- Chastity Carpenter
- 6 years ago
- Views:
Transcription
1 Protein Sequence Comparison and Protein Evolution (What BLAST does/why BLAST works William R. Pearson faculty.virginia.edu/wrpearson 1
2 Effective Similarity Searching in Practice 1. Always search protein databases (possibly with translated DNA) 2. Use E()-values, not percent identity, to infer homology E() < is significant in a single search 3. Search smaller (comprehensive) databases 4. Change the scoring matrix for: short sequences (exons, reads) short evolutionary distances (mammals, vertebrates, a- proteobacteria) high identity (>50% alignments) to reduce over-extension 5. All methods (pairwise, HMM, PSSM) miss homologs, and find homologs the other methods miss 2
3 Homology Fundamentals Homologous sequences are unexpectedly similar (excess similarity) excess compared to??? random similarity (similarity by chance, E()-value) Non-significant similarity IS NOT evidence for nonhomology significance depends on perspective (divergence on tree) significant similarity to a protein of different structure shows non-homology Homology at the (entire) sequence level is different from homology at the residue level sequence homology is inferred from statistics residue homology REQUIRES sequence homology 3
4 Establishing homology from statistically significant similarity Why BLAST works For most proteins, homologs are easily found over long evolutionary distances (500 My 2 By) using standard approaches (BLAST, FASTA) Difficult for distant relationships or very short domains Most default search parameters are optimized for distant relationships and work well Not every aligned residue is homologous but with significant similarity, there is an homologous domain 4
5 This talk is not about: Alignment Alignment quality may be more sensitive to parameter choice Multiple sequences for biologically accurate alignments Inferring Protein Function Homology (common ancestry) implies common structure (guaranteed), not necessarily common function But >40% global identity usually have similar functions Homologs have different functions Non-homologs have similar (or identical) functions The best sequences for building trees Protein sequences are clearly best for establishing homology, but DNA sequences may be better for resolving recent divergence 5
6 Homologues share a common ancestor vertebrates/ arthopods 18,000 6,530 4,289 plants/animals time (bilions of years) human horse fish insect worm wheat yeast E. coli prokaryotes/eukaryotes -3.0 self-replicating systems -4.0 chemical evolution 6
7 When do we infer homology? Homology <=> structural similarity? sequence similarity Bovine trypsin (5ptp) Structure: E()< ; RMSD 0.0 A Sequence: E()< % 223/223 S. griseus trypsin (1sgt) E()< RMSD 1.6 A E()< %; 226/223 S. griseus protease A (2sga) E()< 10-4 ; RMSD 2.6 A E()< %; 199/181 7
8 When can we infer non-homology? Non-homologous proteins have different structures Bovine trypsin (5ptp) Structure: E()<10-23 RMSD 0.0 A Sequence: E()< % 223/223 Subtilisin (1sbt) E() >100 E()<280; 25% 159/275 Cytochrome c4 (1etp) E() > 100 E()<5.5; 23% 171/190 8
9 Homology and EXCESS similarity Why are proteins (sequences or structures) similar alternative hypotheses? 1. common ancestry homology 2. convergence from independent origins due to functional constraints We infer homology from excess similarity if no excess, similarity could happen by chance (independently) in a similarity search, there is always a most similar sequence To recognize excess similarity we need to know what random similarity looks like 9
10 Query: atp6_human.aa ATP synthase a chain aa Library: PIR1 Annotated (rel. 66) residues in sequences z-sc obs E() < := := one = represents 23 library sequences := := :* :* :===* :======= * :=============== * :========================= * :=======================================* :================================================* :=====================================================*== :======================================================*===== :====================================================*==== :===============================================*== :=========================================*==== :===================================*== :=========================== * :===================== * :===================* :===============* :============* :=========* :=======* :======* :====*= :===* :==* :==* :=* :=* :=* :* :* inset = represents 1 library sequences :* :* :========*========= :* :======*== :* :====*== :* :===* :* :==*========== :* :=*=== :* :=* :* :*==== :* :*=== :* :*= :* :*==== :* :*===== := *== := *= > :== *============================== 10
11 Inferring Homology from Statistical Significance Real UNRELATED sequences have similarity scores that are indistinguishable from RANDOM sequences If a similarity is NOT RANDOM, then it must be NOT UNRELATED Therefore, NOT RANDOM (statistically significant) similarity must reflect RELATED sequences 11
12 Query: atp6_human.aa ATP synthase a chain aa Library: residues in sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp P00846 ATP6_HUMAN ATP synthase a chain (AT ( 226) e sp P00847 ATP6_BOVIN ATP synthase a chain (AT ( 226) e sp P00848 ATP6_MOUSE ATP synthase a chain (AT ( 226) e sp P00849 ATP6_XENLA ATP synthase a chain (AT ( 226) e sp P00851 ATP6_DROYA ATP synthase a chain (AT ( 224) e sp P00854 ATP6_YEAST ATP synthase a chain pre ( 259) e sp P00852 ATP6_EMENI ATP synthase a chain pre ( 256) e sp P14862 ATP6_COCHE ATP synthase a chain (AT ( 257) e sp P68526 ATP6_TRITI ATP synthase a chain (AT ( 386) e sp P05499 ATP6_TOBAC ATP synthase a chain (AT ( 395) e sp P07925 ATP6_MAIZE ATP synthase a chain (AT ( 291) e sp P0AB98 ATP6_ECOLI ATP synthase a chain (AT ( 271) e sp P0C2Y5 ATPI_ORYSA Chloroplast ATP synth (A ( 247) sp P06452 ATPI_PEA Chloroplast ATP synthase a ( 247) sp P27178 ATP6_SYNY3 ATP synthase a chain (AT ( 276) sp P06451 ATPI_SPIOL Chloroplast ATP synthase ( 247) sp P08444 ATP6_SYNP6 ATP synthase a chain (AT ( 261) sp P69371 ATPI_ATRBE Chloroplast ATP synthase ( 247) sp P06289 ATPI_MARPO Chloroplast ATP synthase ( 248) sp P30391 ATPI_EUGGR Chloroplast ATP synthase ( 251) sp P19568 TLCA_RICPR ADP,ATP carrier protein ( 498) sp P24966 CYB_TAYTA Cytochrome b ( 379) sp P03892 NU2M_BOVIN NADH-ubiquinone oxidored ( 347) sp P68092 CYB_STEAT Cytochrome b ( 379) sp P03891 NU2M_HUMAN NADH-ubiquinone oxidored ( 347) sp P00156 CYB_HUMAN Cytochrome b ( 380) sp P15993 AROP_ECOLI Aromatic amino acid tr ( 457) sp P24965 CYB_TRANA Cytochrome b ( 379) sp P29631 CYB_POMTE Cytochrome b ( 308) sp P24953 CYB_CAPHI Cytochrome b ( 379)
13 >sp P00846 ATP6_HUMAN ATP synthase subunit a; F-ATPase protein 6 >sp P0AB98 ATP6_ECOLI ATP synthase subunit a; F-ATPase subunit 6 Length=271 Score = 47.9 bits (178), Expect = 3e-06 Identities = 55/199 (27%), Positives = 113/199 (56%), Gaps = 37/199 (18%) Query 8 SFIAPTILGLPAAVLIILFPPLLIPTSKYLINNRLITTQQWLIKLTSKQMMTMHNTKGRT 67 S +LGL ++++LF T + +I M++ K + Sbjct 45 SMFFSVVLGL---LFLVLFRSVAKKATSG-VPGKFQTAIELVIGFVNGSVKDMYHGKSKL 100 Query 68 WSLMLVSLIIFIATTNLLGLLP HSF TPTTQLSMNLAMAIPLWAG NL+ LLP H + P+ +++ L+MA+ ++ Sbjct 101 IAPLALTIFVWVFLMNLMDLLPIDLLPYIAEHVLGLPALRVVPSADVNVTLSMALGVF Query 112 TVIMGFRSKIKNALAHFLPQGTPTPL-----IPMLVIIETISLLIQPMALAVRLTANITA F S + F + T P+ IP+ +I+E +SLL +P++L +RL N+ A Sbjct 159 -ILILFYSIKMKGIGGFTKELTLQPFNHWAFIPVNLILEGVSLLSKPVSLGLRLFGNMYA 217 Query 167 GHLLMHLIGSATLAMSTINLPSTLIIFTILILLTILEIAVALIQAYVFTLLVSLYL 222 G L+ LI S L IF ILI+ +QA++F +L +YL Sbjct 218 GELIFILIAGLLPWWSQWILNVPWAIFHILIIT LQAFIFMVLTIVYL
14 The PAM250 matrix Cys 12 Ser 0 2 Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp C S T P A G N D E Q H R K M I L V F Y W 14
15 Where do scoring matrices come from? # λs = log% $ frequency of replacement in homologs q ij p i p j & ( ' frequency of alignment by chance Scoring matrices can be designed for different evolutionary distances (less=shallow; more=deep) Deep matrices allow more substitution Pam40 A R N D E I L A 8 R N D E I L Pam250 A R N D E I L A 2 R -2 6 N D E I L
16 Query: atp6_human.aa ATP synthase a chain aa Library: residues in sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp P00846 ATP6_HUMAN ATP synthase a chain (AT ( 226) e sp P00847 ATP6_BOVIN ATP synthase a chain (AT ( 226) e sp P00848 ATP6_MOUSE ATP synthase a chain (AT ( 226) e sp P00849 ATP6_XENLA ATP synthase a chain (AT ( 226) e sp P00851 ATP6_DROYA ATP synthase a chain (AT ( 224) e sp P00854 ATP6_YEAST ATP synthase a chain pre ( 259) e sp P00852 ATP6_EMENI ATP synthase a chain pre ( 256) e sp P14862 ATP6_COCHE ATP synthase a chain (AT ( 257) e sp P68526 ATP6_TRITI ATP synthase a chain (AT ( 386) e sp P05499 ATP6_TOBAC ATP synthase a chain (AT ( 395) e sp P07925 ATP6_MAIZE ATP synthase a chain (AT ( 291) e sp P0AB98 ATP6_ECOLI ATP synthase a chain (AT ( 271) e sp P0C2Y5 ATPI_ORYSA Chloroplast ATP synth (A ( 247) sp P06452 ATPI_PEA Chloroplast ATP synthase a ( 247) sp P27178 ATP6_SYNY3 ATP synthase a chain (AT ( 276) sp P06451 ATPI_SPIOL Chloroplast ATP synthase ( 247) sp P08444 ATP6_SYNP6 ATP synthase a chain (AT ( 261) sp P69371 ATPI_ATRBE Chloroplast ATP synthase ( 247) sp P06289 ATPI_MARPO Chloroplast ATP synthase ( 248) sp P30391 ATPI_EUGGR Chloroplast ATP synthase ( 251) sp P19568 TLCA_RICPR ADP,ATP carrier protein ( 498) sp P24966 CYB_TAYTA Cytochrome b ( 379) sp P03892 NU2M_BOVIN NADH-ubiquinone oxidored ( 347) sp P68092 CYB_STEAT Cytochrome b ( 379) sp P03891 NU2M_HUMAN NADH-ubiquinone oxidored ( 347) sp P00156 CYB_HUMAN Cytochrome b ( 380) sp P15993 AROP_ECOLI Aromatic amino acid tr ( 457) sp P24965 CYB_TRANA Cytochrome b ( 379) sp P29631 CYB_POMTE Cytochrome b ( 308) sp P24953 CYB_CAPHI Cytochrome b ( 379)
17 Query: atp6_ecoli.aa ATP synthase a aa Library: residues in sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp P0AB98 ATP6_ECOLI ATP synthase a chain (AT ( 271) e sp P06451 ATPI_SPIOL Chloroplast ATP synthase ( 247) e sp P69371 ATPI_ATRBE Chloroplast ATP synthase ( 247) e sp P08444 ATP6_SYNP6 ATP synthase a chain (AT ( 261) e sp P06452 ATPI_PEA Chloroplast ATP synthase a ( 247) e sp P30391 ATPI_EUGGR Chloroplast ATP synthase ( 251) e sp P0C2Y5 ATPI_ORYSA Chloroplast ATP synthase ( 247) e sp P27178 ATP6_SYNY3 ATP synthase a chain (AT ( 276) e sp P06289 ATPI_MARPO Chloroplast ATP synthase ( 248) e sp P07925 ATP6_MAIZE ATP synthase a chain (AT ( 291) e sp P68526 ATP6_TRITI ATP synthase a chain (AT ( 386) e sp P00854 ATP6_YEAST ATP synthase a chain pre ( 259) e sp P05499 ATP6_TOBAC ATP synthase a chain (AT ( 395) e sp P00846 ATP6_HUMAN ATP synthase a chain (AT ( 226) e sp P00852 ATP6_EMENI ATP synthase a chain pre ( 256) e sp P00849 ATP6_XENLA ATP synthase a chain (AT ( 226) e sp P00847 ATP6_BOVIN ATP synthase a chain (AT ( 226) e sp P14862 ATP6_COCHE ATP synthase a chain (AT ( 257) e sp P00848 ATP6_MOUSE ATP synthase a chain (AT ( 226) e sp P00851 ATP6_DROYA ATP synthase a chain (AT ( 224) sp P24962 CYB_STELO Cytochrome b ( 379) sp P09716 US17_HCMVA Hypothetical protein HVL ( 293) sp P68092 CYB_STEAT Cytochrome b ( 379) sp P24960 CYB_ODOHE Cytochrome b ( 379) sp P03887 NU1M_BOVIN NADH-ubiquinone oxidored ( 318) sp P24992 CYB_ANTAM Cytochrome b ( 379)
18 Homology is Transitive (on domains) vs human : E. coli Human mito : 10-6 Bovine mito : 10-6 Mouse mito : 10-5 Frog mito : 10-6 Dros. mito : Yeast mito : 10-8 Cochliobolus mito : 10-5 Aspergillus mito : /10-6 Corn mito : 10-9 Wheat mito : 10-8 Rice chloro : Tobacco chloro : Spinach chloro : Pea chloro : March. chloro : Cyanobacteria : Synechocystis : Euglena chloro : E. coli 10-6 : vs human : E. coli 18
19 Homology and EXCESS similarity The importance of perspective We use the E()-value to infer homology based on excess similarity BETWEEN the QUERY and the SUBJECT sequences Two homologous sequences may not share excess similarity in a BLAST search with one or the other as query, but share significant similarity to a third sequence (or to a PSSM or HMM) If two sequences share significant similarity to the same sequence in the same region, we can infer homology. As always, non-excess similarity does not imply non-homology 19
20 Homology and Domains Histone acetyltransferase KAT2B The best scores are: s-w bits E(454402) %_id %_sim alen KAT2B_HUMAN Histone acetyltransferase KAT2B ( 832) KAT2A_HUMAN Histone acetyltransferase KAT2A ( 837) GCN5_SCHPO Histone acetyltransferase gcn5 ( 454) e GCN5_YEAST Histone acetyltransferase GCN5 ( 439) e GCN5_ORYSJ Histone acetyltransferase GCN5 ( 511) e GCN5_ARATH Histone acetyltransferase GCN5; ( 568) e BPTF_HUMAN Nucleosome-remodeling factor sub (3046) e NU301_DROME Nucleosome-remodeling factor su (2669) e CECR2_HUMAN Cat eye syndrome critical regio (1484) e BRD4_HUMAN Bromodomain-containing protein 4 (1362) e BRD4_MOUSE Bromodomain-containing protein 4 (1400) e BAZ2A_HUMAN Bromodomain adjacent to zinc fi (1905) e BAZ2A_XENLA Bromodomain adjacent to zinc fi (1698) e FSH_DROME Homeotic protein female sterile; (2038) e BAZ2A_MOUSE Bromodomain adjacent to zinc fi (1889) e BRDT_MACFA Bromodomain testis-specific prot ( 947) e BRD3_HUMAN Bromodomain-containing protein 3 ( 726) e
21 KATB_HUMAN KAT2B_HUMAN KAT2B_HUMAN KAT2A_HUMAN KAT2B_HUMAN GCN5_YEAST KAT2B_HUMAN GCN5_ARATH Homology and Domains Histone acetyltransferase KAT2B PCAF_N C.Acetyltrans Bromodomain PCAF_N C.Acetyltrans Bromodomain C.Acetyltrans Bromodomain C.Acetyltrans Bromodomain E()< E()< e e KAT2B_HUMAN BPTF_HUMAN e KAT2B_HUMAN BRD4_MOUSE 500 BromodomainBromodomain e KAT2B_HUMAN BRD4_MOUSE 500 BromodomainBromodomain e
22 A quick diversion LALIGN statistically significant internal alignments # lalign36 -s BP62 -m B -w 80 mchu.aa mchu.aa LALIGN finds non-overlapping local alignments Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12: >sp P62158 CALM_HUMAN Calmodulin; CaM Length=149 Score = 51.4 bits (168), Expect = 7e-12 Identities = 35/73 (47%), Positives = 51/73 (69%), Gaps = 3/73 (4%) Query 1 MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARK 76 M D +EE+I +EAF +FDKDG+G I+ EL VM +LG+ T+ E+ +MI E D DG+G +++ EF+ MM K Sbjct 77 MKDTDSEEEI---REAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK 149 Score = 41.5 bits (132), Expect = 7e-09 Identities = 36/97 (37%), Positives = 55/97 (56%), Gaps = 8/97 (8%) Query 11 AEFKEAFSLFDKDGDGTITTKELGTVM-RSLGQNPTEAELQDMINEVDADGNGTIDFPEF---LTMMARKMKDTDSEEEI 86 AE ++ + D DG+GTI E T+M R + +E E+++ D DGNG I E +T + K+ D + +E I Sbjct 47 AELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMI 126 Query 87 REAFRVFDKDGNGYISAAELRHVMT 111 REA D DG+G ++ E +MT Sbjct 127 REA----DIDGDGQVNYEEFVQMMT 147 Score = 18.5 bits (48), Expect = 0.06 Identities = 13/33 (39%), Positives = 23/33 (69%), Gaps = 5/33 (15%) Query 1 MADQLTEEQIAEF-KEAFSLFDKDGDGTITTKELGTVM LT+E++ E +EA D DGDG + +E +M Sbjct 113 LGEKLTDEEVDEMIREA----DIDGDGQVNYEEFVQMM 146 No identity alignment 22
23 LALIGN Identifying mobile domains: mobile (duplicated) domains in local alignments sp P CALM_HUMAN Calmodulin; CaM EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand score: 168; 51.8 bits; E(1) < 5.8e-12 score: 132; 41.8 bits; E(1) < 5.9e-09 score: 48; 18.5 bits; E(1) < EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-handEF-hand EF-hand EF-hand EF-hand EF-hand EF-handEF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand sp P CALM_HUMAN Calmodulin; CaM aa E(): < <0.01 <1 <1e+02 >1e+02
24 Similarity searches for homology Homologous sequences are unexpectedly similar (excess similarity) excess compared to??? random similarity (similarity by chance, E()-value) In a similarity search, excess similarity reflects the perspective of the query sequence different queries can reveal excess similarity homology in the aligned region Non-significant similarity IS NOT evidence for non-homology significant similarity to a protein of different structure shows non-homology 24
25 Computer Demo fasta.bioch.virginia.edu/mol_evol/ tinyurl.com/fasta-demo/ Work in PAIRS Answer questions with partner Goals: trust the statistics what are the positive / negative controls? look at the alignments does non-alignment imply non-homology? use [pgm] link 25
26 Effective similarity searching Use protein/translated DNA comparisons Modern sequence similarity searching is highly efficient, sensitive, and reliable homologs are homologs similarity statistics are accurate databases are large most queries will find a significant match Improving similarity searches protein (translated DNA) similarity searches smaller databases appropriate scoring matrices for short reads/assemblies appropriate alignment boundaries Extracting more information from annotations homologous over extension scoring sub-alignments to identify homologous domains All methods (pairwise, HMM, PSSM) miss homologs all methods find genuine homologs the other methods miss 26
27 DNA vs protein sequence comparison The best scores are: DNA tfastx3 prot. E(188,018) E(187,524) E(331,956) DMGST D.melanogaster GST e e e-109 MDGST1 M.domestica GST-1 gene 2e e e-76 LUCGLTR Lucilia cuprina GST 1.5e e e-73 MDGST2A M.domesticus GST-2 mrna 9.3e e e-62 MDNF1 M.domestica nf1 gene e e e-62 MDNF6 M.domestica nf6 gene e e e-62 MDNF7 M.domestica nf7 gene e e e-62 AGGST15 A.gambiae GST mrna 3.1e e e-61 CVU87958 Culicoides GST 1.8e e e-58 AGG3GST11 A.gambiae GST1-1 mrna 1.5e e e-43 BMO6502 Bombyx mori GST mrna 1.1e e e-40 AGSUGST12 A.gambiae GST1-1 gene 2.3e e e-37 MOTGLUSTRA Manduca sexta GST 5.7e e e-25 RLGSTARGN R.legominosarum gsta e e-10 HUMGSTT2A H. sapiens GSTT e e-09 HSGSTT1 H.sapiens GSTT1 mrna e e-10 ECAE E. coli hypothet. prot. 4.7e e-09 MYMDCMA Methyl. dichlorometh. DH 1.1e e-07 BCU19883 Burkholderia maleylacetate red. 1.2e e-08 NFU43126 Naegleria fowleri GST 3.2e SP505GST Sphingomonas paucim 1.8e EN1838 H. sapiens maleylaceto. iso. 2.1e e-06 HSU86529 Human GSTZ1 3.0e e-06 SYCCPNC Synechocystis GST 1.2e e-06 HSEF1GMR H.sapiens EF1g mrna 9.0e
28 Why smaller databases produce more sensitive searches statistics S = λs raw - ln K m n S bit = (λs raw - ln K)/ln(2) P(S >x) = 1 - exp(-e -x ) P(S bit > x) = 1 -exp(-mn2 -x ) E(S >x D) = P D Z(s) λs bit P(B bits) = m n 2 -B P(40 bits)= 1.5x10-7 E(40 D=4000) = 6x10-4 E(40 D=12E6) =
29 What is a bit score (I)? 1. Scoring matrices (PAM250, BLOSUM62, VTML40) contain log-odds scores: s i,j (bits) = log 2 (q i,j /p i p j ) (q i,j freq. in homologs / p i p j freq. by chance) s i,j (bits) = 2 -> a residue is 2 2 =4-times more likely to occur by homology compared with chance (at one residue) s i,j (bits) = -1 -> a residue is 2-1 = 1/2 as likely to occur by homology compared with chance (at one residue) 2. An alignment score is the maximum sum of s i,j bit scores across the aligned residues. A 40-bit score is 2 40 more likely to occur by homology than by chance. 3. How often should a score occur by chance? In a 400 * 400 alignment, there are ~160,000 places where the alignment could start by chance, so we expect a score of 40 bits would occur: P(S bit > x) = 1 -exp(-mn2 -x ) ~ mn2 -x 400 x 400 x 2-40 = 160,000 / 2 40 ( ) = 1.5 x 10-7 times Thus, the probability of a 40 bit score in ONE alignment is ~
30 What is a bit score (I)? 4. But we did not ONE alignment, we did 4,000, 40,000, 500,000, or 20 million alignments when we searched the database: E(S bit D) = p(40 bits) x database size E(40 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant) E(40 40,000) = 10-7 x 4 x 10 4 = 4 x 10-3 (not significant) E(40 500,000) =10-7 x 5 x 10 5 = 0.05 (not significant) E(40 20 million) = 10-7 x 2.0 x 10 7 = 2.0 (not significant) Bonferroni correction for multiple tests each alignment is a test Not significant does not mean not-homologous 30
31 E()-values are conservative frequentist estimates that similarity occurred by chance S = λs raw - ln K m n S bit = (λs raw - ln K)/ln(2) P(S >x) = 1 - exp(-e -x ) P(S bit > x) = 1 -exp(-mn2 -x ) Z(s) λs bit Bonferroni correction: E(S >x D) = P D(# of tests) With modern sequence databases (thousands of complete proteomes), E()<10-10 is routine for sequences >25% identical, after correcting for 10,000,000 sequences (tests) 31
32 How many bits do I need? E() = p() x database size E(40 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant) E(40 40,000) = 10-7 x 4 x 10 4 = 4 x 10-3 (not significant) E(40 500,000) = 10-7 x 5 x 10 5 = 0.05 (not significant) To get E() ~ 10-3, how many bits do I need? p = m n 2 bits bits = log2(p/(m n)) = log2(e()/(database_size m n)) genome (10,000) p ~ 10-3 /10 4 = 10-7 /160,000 = 40 bits SwissProt (500,000) p ~ 10-3 /10 6 = 10-9 /160,000 = 47 bits Uniprot/NR (10 7 ) p ~ 10-3 /10 7 = /160,000 = 50 bits very significant E()<10-50 significant E()<10-6 significant E()<10-3 not significant 32
33 What database to search? Search the smallest comprehensive database likely to contain your protein vertebrates human proteins (40,000) fungi S. cerevisiae (6,000) bacteria E. coli, gram positive, etc. (<100,000) QFO (Quest for Orthologs proteome set) Search a richly annotated protein set (SwissProt, >500,000) Always search NR (> 50 million) LAST Never Search GenBank (DNA) 33
34 Scoring matrices Scoring matrices can set the evolutionary lookback time for a search Lower PAM (PAM10/MDM10 PAM60) for closer (10% 50% identity) Higher BLOSUM for higher conservation (BLOSUM50 distant, BLOSUM80 conserved) Shallow scoring matrices for short domains/short queries (metagenomics) Matrices have bits/position (score/position), 40 aa at 0.45 bits/position (BLOSUM62) means 18 bit ave. score (50 bits significant) Deep scoring matrices allow alignments to continue, possibly outside the homologous region 34
35 Where do scoring matrices come from? Pam40 A R N D E I L A 8 R N D E I L Pam250 A R N D E I L A 2 R -2 6 N D E I L λs i, j = log b ( q i, j p i p j ) q ij : replacement frequency at PAM40, 250 q R:N ( 40) = p R = q R:N (250) = p N = λ 2 S ij = lg 2 (q ij /p i p j ) λ e S ij = ln(q ij /p i p j ) p R p N = λ 2 S R:N( 40) = lg 2 ( / )= λ 2 = 1/3; S R:N( 40) = /l 2 = -7 λ S R:N(250) = lg2 ( / )= 0 35
36 PAM matrices and alignment length BLOSUM80 BLOSUM62 BLOSUM50 Short domains require shallow scoring matrices 36
37 Empirical matrix performance (median results from random alignments) Matrix target % ident bits/position aln len (50 bits) VT160-12/ BLOSUM50-10/ BLOSUM62* -11/ VT120-11/ VT80-11/ PAM70* -10/ PAM30* -9/ VT40-12/ VT20-15/ VT10 /16/ HMMs can be very "deep" 37
38 Scoring matrices affect alignment boundaries (homologous over-extension) BLOSUM62-11/-1 VTML80-10/-1 SH3_domain SH3_domain SH3_domain 500 SH3_domain sp Q SRC8_HUMAN Src substrate cort sp Q SRC8_HUMAN Src substrate cort sp Q SRC8_HUMAN Src substrate cortactin; Am sp Q SRC8_HUMAN Src substrate cortactin; Am E(): < <0.01 <1 <1e+02 >1e+02 E(): < <0.01 <1 <1e+02 >1e+02 38
39 Scoring Matrices - Summary PAM and BLOSUM matrices greatly improve the sensitivity of protein sequence comparison low identity with significant similarity PAM matrices have an evolutionary model - lower number, less divergence lower=closer; higher=more distant BLOSUM matrices are sampled from conserved regions at different average identity higher=more conservation Short alignments (domains, exons, reads) require shallow (higher information content) matrices Shallow matrices set maximum look-back time 39
40 Effective similarity searching Modern sequence similarity searching is highly efficient, sensitive, and reliable homologs are homologs similarity statistics are accurate databases are large most queries will find a significant match Improving similarity searches smaller databases appropriate scoring matrices for short reads/assemblies appropriate alignment boundaries Extracting more information from annotations homologous over extension scoring sub-alignments to identify homologous domains All methods (pairwise, HMM, PSSM) miss homologs all methods find genuine homologs the other methods miss 40
41 Overextension into random sequence PF PF > pf E6SGT6 E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC= Length=888 Score = 299 bits (766), Expect = 1e-90, Method: Compositional matrix adjust. Identities = 170/341 (50%), Positives = 224/341 (66%), Gaps = 19/341 (6%) 113 Query 84 FLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLGKTLEAVAKGRTSEAIKKLMGLKA 143 +L+ V A +P+ +F + V++ L+ LG LE A+GRTSEAIKKL+GL+A Sbjct 312 WLYSTVAVAFPQIFPSMALAEVFYDVTAVVVALVNLGLALELRARGRTSEAIKKLIGLQA Query 144 KRARVIRGGRELDIPVEAVLAGDLVVVRPGEKIPVDGVVEEGASAVDESMLTGESLPVDK ARV+R G E+DIPVE VL GD+VVVRPGEKIPVDGVV EG S+VDESM+TGES+PV+ Sbjct 372 RTARVVRDGTEVDIPVEEVLVGDIVVVRPGEKIPVDGVVIEGTSSVDESMITGESIPVEM 431 Query 204 QPGDTVIGATLNKQGSFKFRATKVGRDTALAQIISVVEEAQGSKAPIQRLADTISGYFVP 263 +PGD VIGAT+N+ GSF+FRATKVG+DTAL+QII +V++AQGSKAPIQR+ D +S YFVP Sbjct 432 KPGDEVIGATINQTGSFRFRATKVGKDTALSQIIRLVQDAQGSKAPIQRIVDRVSHYFVP 491 Query 264 VVVSLAVITFFVWYFAVAPENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKG 323 V+ LA++ VWY + AL+ F L+IACPCALGLATPTS+ VG GKGAE+G Sbjct 492 AVLILAIVAAVVWYVFGPEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQG Query 324 ILFKGGEHLENAG GGAHTEGAENKAELLKTRATGISILVTLGLTAKGRDRS 374 IL + G+ L+ A G T+G +++ ATG + L LTA Sbjct 552 ILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVVA--ATGFDEDLILRLTA Query 375 TVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGV 415 A ++ + L I + L R +A E+ +A P GV Sbjct AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGV
42 Can homologous proteins have different structures? Roessler C G et al. PNAS (2008)105:2343 BLOSUM80 40% id VTML40 70% id Region: 12-37:14-34 : score=50; bits=26.0; Qval=41 : DOMAIN_N: alpha Region: 42-60:39-57 : score=26; bits=14.5; Qval= 7 : DOMAIN_N: beta Smith-Waterman score: 82; E(1) < 2e-9; 49.0% identity (61.2% similar) in 49 aa overlap BD1 MNAIDIAINKLGSVSALAASLGVRQSAISNWRARGRVPAERCIDIERVTNGAVICRELRPDVFGASPAGHRPEASNAAA :. :::::.::: :::::. : : : :.:.: : :.:: 2PIJ XKKIPLSKYLEEHGTQSALAAALGVNQSAISQ-----MVRAGRSIEITLYEDGRVEANEIRPIPARPKRTAA [ ] [ ] Region: 12-29:14-31 : score=82; bits=42.9; LPr=9.2 : DOMAIN_N: alpha Smith-Waterman score: 82; E(1) < 6.5e % identity (88.9% similar) in 18 aa overlap BD1 MNAIDIAINKLGSVSALAASLGVRQSAISNWRARGRVPAERCIDIERVTNGAVICRELRPDVFGASPAGHRPEASNAAA :. :::::.::: ::::: 2PIJ XKKIPLSKYLEEHGTQSALAAALGVNQSAISQMVRAGRSIEITLYEDGRVEANEIRPIPARPKRTAA [ ]
43 Scoring matrices affect alignment boundaries (homologous over-extension) BLOSUM62-11/-1 BLOSUM62-11/-1 sp Q SRC8_HUMAN Src substrate cort SH3_domain SH3_domain 32-42: : Id=0.455; Q= 0.0 : NODOM : : : Id=0.158; Q= 0.0 : in : : Id=0.622; Q=37.4 : in : : Id=0.757; Q=50.2 : in : : Id=0.811; Q=61.0 : in : : Id=0.568; Q=35.3 : in : : Id=0.649; Q=41.5 : in : : Id=0.565; Q= 8.9 : in : : Id=0.165; Q= 0.0 : NODOM : : Id=0.200; Q= 0.0 : SH : : Id=0.657; Q=102.2 : in : : Id=0.757; Q=138.0 : in : : Id=0.811; Q=164.6 : in : : Id=0.568; Q= 91.9 : in : : Id=0.649; Q=112.4 : in sp Q SRC8_HUMAN Src substrate cortactin; Am : : Id=0.565; Q= 36.7 : in E(): < <0.01 <1 <1e+02 >1e+02 VTML80-10/-1 43
44 Overextension into random sequence PF PF > pf E6SGT6 E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC= Length=888 Score = 299 bits (766), Expect = 1e-90, Method: Compositional matrix adjust. Identities = 170/341 (50%), Positives = 224/341 (66%), Gaps = 19/341 (6%) 113 Query 84 FLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLGKTLEAVAKGRTSEAIKKLMGLKA 143 +L+ V A +P+ +F + V++ L+ LG LE A+GRTSEAIKKL+GL+A Sbjct 312 WLYSTVAVAFPQIFPSMALAEVFYDVTAVVVALVNLGLALELRARGRTSEAIKKLIGLQA Query 144 KRARVIRGGRELDIPVEAVLAGDLVVVRPGEKIPVDGVVEEGASAVDESMLTGESLPVDK ARV+R G E+DIPVE VL GD+VVVRPGEKIPVDGVV EG S+VDESM+TGES+PV+ Sbjct 372 RTARVVRDGTEVDIPVEEVLVGDIVVVRPGEKIPVDGVVIEGTSSVDESMITGESIPVEM 431 Query 204 QPGDTVIGATLNKQGSFKFRATKVGRDTALAQIISVVEEAQGSKAPIQRLADTISGYFVP 263 +PGD VIGAT+N+ GSF+FRATKVG+DTAL+QII +V++AQGSKAPIQR+ D +S YFVP Sbjct 432 KPGDEVIGATINQTGSFRFRATKVGKDTALSQIIRLVQDAQGSKAPIQRIVDRVSHYFVP 491 Query 264 VVVSLAVITFFVWYFAVAPENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKG 323 V+ LA++ VWY + AL+ F L+IACPCALGLATPTS+ VG GKGAE+G Sbjct 492 AVLILAIVAAVVWYVFGPEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQG Query 324 ILFKGGEHLENAG GGAHTEGAENKAELLKTRATGISILVTLGLTAKGRDRS 374 IL + G+ L+ A G T+G +++ ATG + L LTA Sbjct 552 ILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVVA--ATGFDEDLILRLTA Query 375 TVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGV 415 A ++ + L I + L R +A E+ +A P GV Sbjct AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGV
45 Sub-alignment scoring detects overextension >>sp E6SGT6 E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC= (888 aa) qregion: : : score=15; bits=12.3; Id=0.219; Q=0.0 : Shuffle qregion: : : score=736; bits=232.8; Id=0.641; Q=644.7 : PF00122 qregion: : : score=14; bits=12.0; Id=0.236; Q=0.0 : Shuffle Region: : : score=11; bits=11.1; Id=0.194; Q=0.0 : NODOM :0 Region: : : score=736; bits=232.8; Id=0.641; Q=644.7 : PF00122 Pfam Region: : : score=16; bits=12.6; Id=0.244; Q=0.0 : PF00702 Pfam s-w opt: 632 Z-score: bits: E(274545): 3.7e-51 Smith-Waterman score: 765; 49.7% identity (73.3% similar) in 344 aa overlap (81-415: ) B0TE74 E6SGT6 Shuffle PF00122 Shuffle PF04945 PF00403 PF00122 PF ][ 120 B0TE74 LPIVPGTMALGVQSDGKDETLLVALEVPVDRERVGPAIVDAFRFLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLG :.:..:.:...:...:. :...:. :: E6SGT6 ILGLLTLPVMLWSGSHFFNGMWQGLKHRQANMHTLISIGIAAAWLYSTVAVAFPQIFPSMALAEVFYDvtavvvalvnlg ][ ][ B0TE74 APENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKGILFKGGEHLENAGG GAHTEGAENKAELL. ::. :...:.::::::::::::::. :: :::::.:::...:. :. :. :. :.:.... E6SGT6 PEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQGILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVV ] [ B0TE74 KTRATGISILVTLGLTAKGRDRSTVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGVVVDSLVTTAFLAVEEI :::... : ::: :... : :.. : :.: :...: : :: E6SGT6 --AATGFDEDLILRLTA AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGVEAQVEGHHVLVGNERL
46 Scoring domains highlights over extension >>sp SRC8_HUMAN Src substrate cortactin; (550 aa) >>sp SRC8_CHICK Src substrate p85; Cort (563 aa) 84.7% id (1-550:11-563) E(454402): 1.2e-159 >>sp SRC8_HUMAN Src substrate cortactin (550 aa) >>sp HCLS1_MOUSE Hematopoiet ln cell-sp (486 aa) 44.1% id (1-548:1-485) E(454402): 4.1e : Id=0.873; Q=281.4 : NODOM : Id=1.000; Q=133.2 : in : Id=0.946; Q=121.0 : in : Id=0.973; Q=127.1 : in : Id=0.973; Q=128.3 : in : Id=0.973; Q=137.5 : in : Id=0.892; Q=117.3 : in : Id=0.957; Q= 69.6 : in : Id=0.632; Q=386.6 : NODOM : Id=0.966; Q=226.3 : SH3 1-79: 1-78 Id=0.671; Q=213.0 : NODOM : Id=0.757; Q= 97.9 : in : Id=0.703; Q= 94.8 : in : Id=0.703; Q= 97.3 : in : Id=0.826; Q= 60.5 : in : Id=0.179; Q= 0.0 : NODOM : : Id=0.719; Q=173.2 : SH3 Q = -10 log(p) Q>30.0 -> p <
47 Homology, non-homology, and over-extension Sequences that share statistically significant sequence similarity are homologous (simplest explanation) But not all regions of the alignment contribute uniformly to the score lower identity/q-value because of non-homology (overextension)? lower identity/q-value because more distant relationship (domains have different ages)? Test by searching with isolated region can the distant domain (?) find closer (significant) homologs? Similar (homology) or distinct (non-homology) structure is the gold standard Multiple sequence alignment can obscure over-extension if the alignment is over-extended, part of the alignment is NOT homologous 47
48 Improving sensitivity with protein/domain family models PSI-BLAST - method 1. do BLAST search 2. use query-based implied multiple sequence alignment to build Position Specific Scoring Matrix (PSSM) 3. repeat steps 1 and 2 with PSSM, for 5 10 iterations PSI-BLAST results: 1. Typically 2X as sensitive as single sequence methods 2. Over-extension can cause PSSM contamination 48
49 Improving sensitivity with protein/domain family models HMMER3 jackhmmer method 1. do HMMER (Hidden Markov Model, HMM) search with single sequence 2. use query-hmm-based implied multiple sequence alignment to more accurate HMM 3. repeat steps 1 and 2 with HMM HMMER3 results: 1. Less over-extension because of probabilistic alignment 2. Used to construct Pfam domain database Many protein families are too diverse for one HMM, Pfam divides families into multiple HMMs and groups in Clans 3. Clearly homologous sequences are still missed 49
50 Missing homology beyond the HMM model >>tr Q8LNM4 Q8LNM4_ORYSJ Eukaryotic aspartyl protease family protein vs >>tr Q2QSI0 Q2QSI0_ORYSJ Glycosyl hydrolase family 9 protein, expressed OS=O (694 aa) qregion: : : score=508; bits=240.8; LPr=67.0 : Aspartyl protease s-w opt: 508 Z-score: bits: E(1): 9.6e-68 Smith-Waterman score: 508; 62.5% identity (79.2% similar) in 144 aa overlap Q8LNM4 TDACKSIPTSNCSSNMCTYEGTINSKLGGHTLGIVATDTFAIGTATASLGFGCVVASGIDTMGGPSGLIGLGRAPSSLVS ::: :.: ::. :. : : : : :::::.::: :.: ::::::: :::: : ::..::::.: :::. Q2QSI0 LCESISNDIHNCSGNVCMYEASTNA---GDTGGKVGTDTFAVGTAKANLAFGCVVASNIDTMDGSSGIVGLGRTPWSLVT Q8LNM4 QMNITKFSYCLTPHDSGKNSRLLLGSSAKLAGGGNSTTTPFVKTSPGDDMSQYYPIQLDGIKAGDAAIALPPSGNTVLVQ :.. :::::.:::.:::. :.:::.:::::::...:::: : :.:.: ::.::..::::: : ::::: Q2QSI0 QTGVAAFSYCLAPHDAGKNNALFLGSTAKLAGGGKTASTPFVNIS-GNDLSNYYKVQLEVLKAGDAMIPLPPSGVLWDNY Q8LNM4 Q2QSI0 Asp 50
51 Pfam misses/mis-aligns proteins distant from the model For diverse families, a single model can find, and miss, closely related homologs Even if homologs are found, alignments may be short her pesec cmvhh2 ratodor humrsc chkgpcr gppaf musp2u chkp2y humfmlf ratg10d ratang dogrdc1 humil8 bovlcr1 cmvhh3 ratrbs11 humssr1 musdelto humc5a ratbk2 humthr ratlh bovop ratrta hummrg hummas humedg1 ratcgpcr ratpot humacth hummsh musep3 humtxa2 musep2 ratcckadogad1 dogcckb humd2 huma2a hama1a hamb2 hum5ht1a ratd1 bovh1 ratnpyy1 ratnk1 flynk flynpy musgir ratntr mustrh musgrp musgnrh ratvia boveta humm1 51
52 Homology Fundamentals Homologous sequences are unexpectedly similar (excess similarity) excess compared to??? random similarity (similarity by chance, E()-value) Non-significant similarity IS NOT evidence for nonhomology significant similarity to a protein of different structure shows non-homology Homology at the (entire) sequence level is different from homology at the residue level sequence homology is inferred from statistics residue homology REQUIRES sequence homology 52
53 Effective Similarity Searching in Practice 1. Always search protein databases (possibly with translated DNA) 2. Use E()-values, not percent identity, to infer homology E() < is significant in a single search 3. Search smaller (comprehensive) databases 4. Change the scoring matrix for: short sequences (exons, reads) short evolutionary distances (mammals, vertebrates, a- proteobacteria) high identity (>50% alignments) to reduce over-extension 5. All methods (pairwise, HMM, PSSM) miss homologs, and find homologs the other methods miss 53
William R. Pearson faculty.virginia.edu/wrpearson
Protein Sequence Comparison and Protein Evolution (What BLAST does/why BLAST works William R. Pearson faculty.virginia.edu/wrpearson pearson@virginia.edu 1 Effective Similarity Searching in Practice 1.
More informationSimilarity searching summary (2)
Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity
More informationPractical search strategies
Computational and Comparative Genomics Similarity Searching II Practical search strategies Bill Pearson wrp@virginia.edu 1 Protein Evolution and Sequence Similarity Similarity Searching I What is Homology
More informationSequence Homology and Analysis. Understanding How FASTA and BLAST work to optimize your sequence similarity searches.
Sequence Homology and Analysis Understanding How FASTA and BLAST work to optimize your sequence similarity searches. Brandi Cantarel, PhD BICF 04/27/2016 Take Home Messages 1. Homologous sequences share
More informationProgramming for Biology Protein Evolution / Similarity Searching" What BLAST Does / Why BLAST works"
Programming for Biology Protein Evolution / Similarity Searching" What BLAST Does / Why BLAST works" Bill Pearson" wrp@virginia.edu" 1 Sequence Similarity - Conclusions" Homologous sequences share a common
More informationSequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013
Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationSimilarity or Identity? When are molecules similar?
Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationMassachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution
Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral
More informationCSE 549: Computational Biology. Substitution Matrices
CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationSequence Analysis and Databases 2: Sequences and Multiple Alignments
1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:
More informationChapter 5. Proteomics and the analysis of protein sequence Ⅱ
Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and
More informationLarge-Scale Genomic Surveys
Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationAlignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in
More informationIntroduction to protein alignments
Introduction to protein alignments Comparative Analysis of Proteins Experimental evidence from one or more proteins can be used to infer function of related protein(s). Gene A Gene X Protein A compare
More informationPairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )
Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance
More informationLecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models
Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary
More informationSecondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure
Bioch/BIMS 503 Lecture 2 Structure and Function of Proteins August 28, 2008 Robert Nakamoto rkn3c@virginia.edu 2-0279 Secondary Structure Φ Ψ angles determine protein structure Φ Ψ angles are restricted
More informationSequence Database Search Techniques I: Blast and PatternHunter tools
Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered
More informationBioinformatics for Biologists
Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational
More informationIntroduction to Bioinformatics
Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression
More informationBLAST: Target frequencies and information content Dannie Durand
Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences
More informationMotifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC
Motifs, Profiles and Domains Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC Comparing Two Proteins Sequence Alignment Determining the pattern of evolution and identifying conserved
More informationSequence Based Bioinformatics
Structural and Functional Analysis of Inosine Monophosphate Dehydrogenase using Sequence-Based Bioinformatics Barry Sexton 1,2 and Troy Wymore 3 1 Bioengineering and Bioinformatics Summer Institute, Department
More informationStatistics I: Homology, Statistical Significance, and Multiple Tests. Statistics and excess similarity Correcting for multiple tests I
Statistics I: Homology, Statistical Significance, and Multiple Tests homology and statistical significance how do we measure significance? what is the probability of an alignment score? given two sequences
More informationSequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of
More informationIntroduction to the Ribosome Overview of protein synthesis on the ribosome Prof. Anders Liljas
Introduction to the Ribosome Molecular Biophysics Lund University 1 A B C D E F G H I J Genome Protein aa1 aa2 aa3 aa4 aa5 aa6 aa7 aa10 aa9 aa8 aa11 aa12 aa13 a a 14 How is a polypeptide synthesized? 2
More informationChristian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel
Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a
More informationProperties of amino acids in proteins
Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationPairwise sequence alignments
Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October
More informationProtein function prediction based on sequence analysis
Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005
More informationTutorial 4 Substitution matrices and PSI-BLAST
Tutorial 4 Substitution matrices and PSI-BLAST 1 Agenda Substitution Matrices PAM - Point Accepted Mutations BLOSUM - Blocks Substitution Matrix PSI-BLAST Cool story of the day: Why should we care about
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationComputational Biology
Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationModule: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
More informationBLAST. Varieties of BLAST
BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database
More informationHMMs and biological sequence analysis
HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the
More informationHomology and Information Gathering and Domain Annotation for Proteins
Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The
More informationSequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.
Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationHomology Modeling. Roberto Lins EPFL - summer semester 2005
Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,
More informationSequences, Structures, and Gene Regulatory Networks
Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align
More information114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009
114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome
More informationBiochemistry 324 Bioinformatics. Pairwise sequence alignment
Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationAlignment & BLAST. By: Hadi Mozafari KUMS
Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationExploring Evolution & Bioinformatics
Chapter 6 Exploring Evolution & Bioinformatics Jane Goodall The human sequence (red) differs from the chimpanzee sequence (blue) in only one amino acid in a protein chain of 153 residues for myoglobin
More informationStructure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27
Acta Cryst. (2014). D70, doi:10.1107/s1399004714021695 Supporting information Volume 70 (2014) Supporting information for article: Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase
More informationAoife McLysaght Dept. of Genetics Trinity College Dublin
Aoife McLysaght Dept. of Genetics Trinity College Dublin Evolution of genome arrangement Evolution of genome content. Evolution of genome arrangement Gene order changes Inversions, translocations Evolution
More informationSequence alignment methods. Pairwise alignment. The universe of biological sequence analysis
he universe of biological sequence analysis Word/pattern recognition- Identification of restriction enzyme cleavage sites Sequence alignment methods PstI he universe of biological sequence analysis - prediction
More informationBioinformatics: Investigating Molecular/Biochemical Evidence for Evolution
Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare
More informationPractical considerations of working with sequencing data
Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!
More informationSequence Analysis '17 -- lecture 7
Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a
More informationBiology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan
Biology Tutorial Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan Viruses A T4 bacteriophage injecting DNA into a cell. Influenza A virus Electron micrograph of HIV. Cone-shaped cores are
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More informationBioinformatics Exercises
Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted
More informationMultiple sequence alignment
Multiple sequence alignment Multiple sequence alignment: today s goals to define what a multiple sequence alignment is and how it is generated; to describe profile HMMs to introduce databases of multiple
More informationBioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre
Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement
More informationSequence Alignment Techniques and Their Uses
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
More informationStatistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences
BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the
More informationPairwise & Multiple sequence alignments
Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived
More informationCISC 636 Computational Biology & Bioinformatics (Fall 2016)
CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More information7.012 Problem Set 1. i) What are two main differences between prokaryotic cells and eukaryotic cells?
ame 7.01 Problem Set 1 Section Question 1 a) What are the four major types of biological molecules discussed in lecture? Give one important function of each type of biological molecule in the cell? b)
More informationBLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010
BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for
More informationOrthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona
Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona Toni Gabaldón Contact: tgabaldon@crg.es Group website: http://gabaldonlab.crg.es Science blog: http://treevolution.blogspot.com
More informationQuantifying sequence similarity
Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity
More informationTools and Algorithms in Bioinformatics
Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology
More informationLecture 15: Realities of Genome Assembly Protein Sequencing
Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing
More informationHomology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB
Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded
More information2 Genome evolution: gene fusion versus gene fission
2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,
More informationLocal Alignment Statistics
Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More informationbioinformatics 1 -- lecture 7
bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment
More informationRamachandran Plot. 4ysz Phi (degrees) Plot statistics
B Ramachandran Plot ~b b 135 b ~b ~l l Psi (degrees) 5-5 a A ~a L - -135 SER HIS (F) 59 (G) SER (B) ~b b LYS ASP ASP 315 13 13 (A) (F) (B) LYS ALA ALA 315 173 (E) 173 (E)(A) ~p p ~b - -135 - -5 5 135 (degrees)
More informationSnork Synthesis Lab Lab Directions
Snork Synthesis Lab Lab Directions This activity, modified from the original at The Biology Corner, will help you practice your understanding of protein synthesis. Submit your lab answers according to
More informationTiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1
Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with
More informationScoring Matrices. Shifra Ben-Dor Irit Orr
Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison
More informationHomology. and. Information Gathering and Domain Annotation for Proteins
Homology and Information Gathering and Domain Annotation for Proteins Outline WHAT IS HOMOLOGY? HOW TO GATHER KNOWN PROTEIN INFORMATION? HOW TO ANNOTATE PROTEIN DOMAINS? EXAMPLES AND EXERCISES Homology
More informationProteins: Characteristics and Properties of Amino Acids
SBI4U:Biochemistry Macromolecules Eachaminoacidhasatleastoneamineandoneacidfunctionalgroupasthe nameimplies.thedifferentpropertiesresultfromvariationsinthestructuresof differentrgroups.thergroupisoftenreferredtoastheaminoacidsidechain.
More informationExercise 5. Sequence Profiles & BLAST
Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)
More informationStructural Alignment of Proteins
Goal Align protein structures Structural Alignment of Proteins 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO --- --- --- GLU ALA ILE
More information8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011
8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationSequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5
Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many
More informationHeteropolymer. Mostly in regular secondary structure
Heteropolymer - + + - Mostly in regular secondary structure 1 2 3 4 C >N trace how you go around the helix C >N C2 >N6 C1 >N5 What s the pattern? Ci>Ni+? 5 6 move around not quite 120 "#$%&'!()*(+2!3/'!4#5'!1/,#64!#6!,6!
More informationBasic Local Alignment Search Tool
Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses
More informationEarly History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species
Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) 0.0 he Parsimony criterion GKN.0 Stochastic Models of Sequence Evolution GKN 7.0 he Likelihood criterion GKN 0.0
More informationMultiple Sequence Alignments
Multiple Sequence Alignments...... Elements of Bioinformatics Spring, 2003 Tom Carter http://astarte.csustan.edu/ tom/ March, 2003 1 Sequence Alignments Often, we would like to make direct comparisons
More information