William R. Pearson faculty.virginia.edu/wrpearson

Size: px

Start display at page:

Download "William R. Pearson faculty.virginia.edu/wrpearson"

Chastity Carpenter
6 years ago
Views:

1 Protein Sequence Comparison and Protein Evolution (What BLAST does/why BLAST works William R. Pearson faculty.virginia.edu/wrpearson 1

2 Effective Similarity Searching in Practice 1. Always search protein databases (possibly with translated DNA) 2. Use E()-values, not percent identity, to infer homology E() < is significant in a single search 3. Search smaller (comprehensive) databases 4. Change the scoring matrix for: short sequences (exons, reads) short evolutionary distances (mammals, vertebrates, a- proteobacteria) high identity (>50% alignments) to reduce over-extension 5. All methods (pairwise, HMM, PSSM) miss homologs, and find homologs the other methods miss 2

3 Homology Fundamentals Homologous sequences are unexpectedly similar (excess similarity) excess compared to??? random similarity (similarity by chance, E()-value) Non-significant similarity IS NOT evidence for nonhomology significance depends on perspective (divergence on tree) significant similarity to a protein of different structure shows non-homology Homology at the (entire) sequence level is different from homology at the residue level sequence homology is inferred from statistics residue homology REQUIRES sequence homology 3

4 Establishing homology from statistically significant similarity Why BLAST works For most proteins, homologs are easily found over long evolutionary distances (500 My 2 By) using standard approaches (BLAST, FASTA) Difficult for distant relationships or very short domains Most default search parameters are optimized for distant relationships and work well Not every aligned residue is homologous but with significant similarity, there is an homologous domain 4

5 This talk is not about: Alignment Alignment quality may be more sensitive to parameter choice Multiple sequences for biologically accurate alignments Inferring Protein Function Homology (common ancestry) implies common structure (guaranteed), not necessarily common function But >40% global identity usually have similar functions Homologs have different functions Non-homologs have similar (or identical) functions The best sequences for building trees Protein sequences are clearly best for establishing homology, but DNA sequences may be better for resolving recent divergence 5

6 Homologues share a common ancestor vertebrates/ arthopods 18,000 6,530 4,289 plants/animals time (bilions of years) human horse fish insect worm wheat yeast E. coli prokaryotes/eukaryotes -3.0 self-replicating systems -4.0 chemical evolution 6

0 A Sequence: E()< 10-84 100% 223/223 S.

7 When do we infer homology? Homology <=> structural similarity? sequence similarity Bovine trypsin (5ptp) Structure: E()< ; RMSD 0.0 A Sequence: E()< % 223/223 S. griseus trypsin (1sgt) E()< RMSD 1.6 A E()< %; 226/223 S. griseus protease A (2sga) E()< 10-4 ; RMSD 2.6 A E()< %; 199/181 7

8 When can we infer non-homology? Non-homologous proteins have different structures Bovine trypsin (5ptp) Structure: E()<10-23 RMSD 0.0 A Sequence: E()< % 223/223 Subtilisin (1sbt) E() >100 E()<280; 25% 159/275 Cytochrome c4 (1etp) E() > 100 E()<5.5; 23% 171/190 8

9 Homology and EXCESS similarity Why are proteins (sequences or structures) similar alternative hypotheses? 1. common ancestry homology 2. convergence from independent origins due to functional constraints We infer homology from excess similarity if no excess, similarity could happen by chance (independently) in a similarity search, there is always a most similar sequence To recognize excess similarity we need to know what random similarity looks like 9

10 Query: atp6_human.aa ATP synthase a chain aa Library: PIR1 Annotated (rel. 66) residues in sequences z-sc obs E() < := := one = represents 23 library sequences := := :* :* :===* :======= * :=============== * :========================= * :=======================================* :================================================* :=====================================================*== :======================================================*===== :====================================================*==== :===============================================*== :=========================================*==== :===================================*== :=========================== * :===================== * :===================* :===============* :============* :=========* :=======* :======* :====*= :===* :==* :==* :=* :=* :=* :* :* inset = represents 1 library sequences :* :* :========*========= :* :======*== :* :====*== :* :===* :* :==*========== :* :=*=== :* :=* :* :*==== :* :*=== :* :*= :* :*==== :* :*===== := *== := *= > :== *============================== 10

11 Inferring Homology from Statistical Significance Real UNRELATED sequences have similarity scores that are indistinguishable from RANDOM sequences If a similarity is NOT RANDOM, then it must be NOT UNRELATED Therefore, NOT RANDOM (statistically significant) similarity must reflect RELATED sequences 11

12 Query: atp6_human.aa ATP synthase a chain aa Library: residues in sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp P00846 ATP6_HUMAN ATP synthase a chain (AT ( 226) e sp P00847 ATP6_BOVIN ATP synthase a chain (AT ( 226) e sp P00848 ATP6_MOUSE ATP synthase a chain (AT ( 226) e sp P00849 ATP6_XENLA ATP synthase a chain (AT ( 226) e sp P00851 ATP6_DROYA ATP synthase a chain (AT ( 224) e sp P00854 ATP6_YEAST ATP synthase a chain pre ( 259) e sp P00852 ATP6_EMENI ATP synthase a chain pre ( 256) e sp P14862 ATP6_COCHE ATP synthase a chain (AT ( 257) e sp P68526 ATP6_TRITI ATP synthase a chain (AT ( 386) e sp P05499 ATP6_TOBAC ATP synthase a chain (AT ( 395) e sp P07925 ATP6_MAIZE ATP synthase a chain (AT ( 291) e sp P0AB98 ATP6_ECOLI ATP synthase a chain (AT ( 271) e sp P0C2Y5 ATPI_ORYSA Chloroplast ATP synth (A ( 247) sp P06452 ATPI_PEA Chloroplast ATP synthase a ( 247) sp P27178 ATP6_SYNY3 ATP synthase a chain (AT ( 276) sp P06451 ATPI_SPIOL Chloroplast ATP synthase ( 247) sp P08444 ATP6_SYNP6 ATP synthase a chain (AT ( 261) sp P69371 ATPI_ATRBE Chloroplast ATP synthase ( 247) sp P06289 ATPI_MARPO Chloroplast ATP synthase ( 248) sp P30391 ATPI_EUGGR Chloroplast ATP synthase ( 251) sp P19568 TLCA_RICPR ADP,ATP carrier protein ( 498) sp P24966 CYB_TAYTA Cytochrome b ( 379) sp P03892 NU2M_BOVIN NADH-ubiquinone oxidored ( 347) sp P68092 CYB_STEAT Cytochrome b ( 379) sp P03891 NU2M_HUMAN NADH-ubiquinone oxidored ( 347) sp P00156 CYB_HUMAN Cytochrome b ( 380) sp P15993 AROP_ECOLI Aromatic amino acid tr ( 457) sp P24965 CYB_TRANA Cytochrome b ( 379) sp P29631 CYB_POMTE Cytochrome b ( 308) sp P24953 CYB_CAPHI Cytochrome b ( 379)

13 >sp P00846 ATP6_HUMAN ATP synthase subunit a; F-ATPase protein 6 >sp P0AB98 ATP6_ECOLI ATP synthase subunit a; F-ATPase subunit 6 Length=271 Score = 47.9 bits (178), Expect = 3e-06 Identities = 55/199 (27%), Positives = 113/199 (56%), Gaps = 37/199 (18%) Query 8 SFIAPTILGLPAAVLIILFPPLLIPTSKYLINNRLITTQQWLIKLTSKQMMTMHNTKGRT 67 S +LGL ++++LF T + +I M++ K + Sbjct 45 SMFFSVVLGL---LFLVLFRSVAKKATSG-VPGKFQTAIELVIGFVNGSVKDMYHGKSKL 100 Query 68 WSLMLVSLIIFIATTNLLGLLP HSF TPTTQLSMNLAMAIPLWAG NL+ LLP H + P+ +++ L+MA+ ++ Sbjct 101 IAPLALTIFVWVFLMNLMDLLPIDLLPYIAEHVLGLPALRVVPSADVNVTLSMALGVF Query 112 TVIMGFRSKIKNALAHFLPQGTPTPL-----IPMLVIIETISLLIQPMALAVRLTANITA F S + F + T P+ IP+ +I+E +SLL +P++L +RL N+ A Sbjct 159 -ILILFYSIKMKGIGGFTKELTLQPFNHWAFIPVNLILEGVSLLSKPVSLGLRLFGNMYA 217 Query 167 GHLLMHLIGSATLAMSTINLPSTLIIFTILILLTILEIAVALIQAYVFTLLVSLYL 222 G L+ LI S L IF ILI+ +QA++F +L +YL Sbjct 218 GELIFILIAGLLPWWSQWILNVPWAIFHILIIT LQAFIFMVLTIVYL

14 The PAM250 matrix Cys 12 Ser 0 2 Thr Pro Ala Gly Asn Asp Glu Gln His Arg Lys Met Ile Leu Val Phe Tyr Trp C S T P A G N D E Q H R K M I L V F Y W 14

15 Where do scoring matrices come from? # λs = log% $ frequency of replacement in homologs q ij p i p j & ( ' frequency of alignment by chance Scoring matrices can be designed for different evolutionary distances (less=shallow; more=deep) Deep matrices allow more substitution Pam40 A R N D E I L A 8 R N D E I L Pam250 A R N D E I L A 2 R -2 6 N D E I L

16 Query: atp6_human.aa ATP synthase a chain aa Library: residues in sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp P00846 ATP6_HUMAN ATP synthase a chain (AT ( 226) e sp P00847 ATP6_BOVIN ATP synthase a chain (AT ( 226) e sp P00848 ATP6_MOUSE ATP synthase a chain (AT ( 226) e sp P00849 ATP6_XENLA ATP synthase a chain (AT ( 226) e sp P00851 ATP6_DROYA ATP synthase a chain (AT ( 224) e sp P00854 ATP6_YEAST ATP synthase a chain pre ( 259) e sp P00852 ATP6_EMENI ATP synthase a chain pre ( 256) e sp P14862 ATP6_COCHE ATP synthase a chain (AT ( 257) e sp P68526 ATP6_TRITI ATP synthase a chain (AT ( 386) e sp P05499 ATP6_TOBAC ATP synthase a chain (AT ( 395) e sp P07925 ATP6_MAIZE ATP synthase a chain (AT ( 291) e sp P0AB98 ATP6_ECOLI ATP synthase a chain (AT ( 271) e sp P0C2Y5 ATPI_ORYSA Chloroplast ATP synth (A ( 247) sp P06452 ATPI_PEA Chloroplast ATP synthase a ( 247) sp P27178 ATP6_SYNY3 ATP synthase a chain (AT ( 276) sp P06451 ATPI_SPIOL Chloroplast ATP synthase ( 247) sp P08444 ATP6_SYNP6 ATP synthase a chain (AT ( 261) sp P69371 ATPI_ATRBE Chloroplast ATP synthase ( 247) sp P06289 ATPI_MARPO Chloroplast ATP synthase ( 248) sp P30391 ATPI_EUGGR Chloroplast ATP synthase ( 251) sp P19568 TLCA_RICPR ADP,ATP carrier protein ( 498) sp P24966 CYB_TAYTA Cytochrome b ( 379) sp P03892 NU2M_BOVIN NADH-ubiquinone oxidored ( 347) sp P68092 CYB_STEAT Cytochrome b ( 379) sp P03891 NU2M_HUMAN NADH-ubiquinone oxidored ( 347) sp P00156 CYB_HUMAN Cytochrome b ( 380) sp P15993 AROP_ECOLI Aromatic amino acid tr ( 457) sp P24965 CYB_TRANA Cytochrome b ( 379) sp P29631 CYB_POMTE Cytochrome b ( 308) sp P24953 CYB_CAPHI Cytochrome b ( 379)

17 Query: atp6_ecoli.aa ATP synthase a aa Library: residues in sequences The best scores are: ( len) s-w bits E(13351) %_id %_sim alen sp P0AB98 ATP6_ECOLI ATP synthase a chain (AT ( 271) e sp P06451 ATPI_SPIOL Chloroplast ATP synthase ( 247) e sp P69371 ATPI_ATRBE Chloroplast ATP synthase ( 247) e sp P08444 ATP6_SYNP6 ATP synthase a chain (AT ( 261) e sp P06452 ATPI_PEA Chloroplast ATP synthase a ( 247) e sp P30391 ATPI_EUGGR Chloroplast ATP synthase ( 251) e sp P0C2Y5 ATPI_ORYSA Chloroplast ATP synthase ( 247) e sp P27178 ATP6_SYNY3 ATP synthase a chain (AT ( 276) e sp P06289 ATPI_MARPO Chloroplast ATP synthase ( 248) e sp P07925 ATP6_MAIZE ATP synthase a chain (AT ( 291) e sp P68526 ATP6_TRITI ATP synthase a chain (AT ( 386) e sp P00854 ATP6_YEAST ATP synthase a chain pre ( 259) e sp P05499 ATP6_TOBAC ATP synthase a chain (AT ( 395) e sp P00846 ATP6_HUMAN ATP synthase a chain (AT ( 226) e sp P00852 ATP6_EMENI ATP synthase a chain pre ( 256) e sp P00849 ATP6_XENLA ATP synthase a chain (AT ( 226) e sp P00847 ATP6_BOVIN ATP synthase a chain (AT ( 226) e sp P14862 ATP6_COCHE ATP synthase a chain (AT ( 257) e sp P00848 ATP6_MOUSE ATP synthase a chain (AT ( 226) e sp P00851 ATP6_DROYA ATP synthase a chain (AT ( 224) sp P24962 CYB_STELO Cytochrome b ( 379) sp P09716 US17_HCMVA Hypothetical protein HVL ( 293) sp P68092 CYB_STEAT Cytochrome b ( 379) sp P24960 CYB_ODOHE Cytochrome b ( 379) sp P03887 NU1M_BOVIN NADH-ubiquinone oxidored ( 318) sp P24992 CYB_ANTAM Cytochrome b ( 379)

18 Homology is Transitive (on domains) vs human : E. coli Human mito : 10-6 Bovine mito : 10-6 Mouse mito : 10-5 Frog mito : 10-6 Dros. mito : Yeast mito : 10-8 Cochliobolus mito : 10-5 Aspergillus mito : /10-6 Corn mito : 10-9 Wheat mito : 10-8 Rice chloro : Tobacco chloro : Spinach chloro : Pea chloro : March. chloro : Cyanobacteria : Synechocystis : Euglena chloro : E. coli 10-6 : vs human : E. coli 18

19 Homology and EXCESS similarity The importance of perspective We use the E()-value to infer homology based on excess similarity BETWEEN the QUERY and the SUBJECT sequences Two homologous sequences may not share excess similarity in a BLAST search with one or the other as query, but share significant similarity to a third sequence (or to a PSSM or HMM) If two sequences share significant similarity to the same sequence in the same region, we can infer homology. As always, non-excess similarity does not imply non-homology 19

20 Homology and Domains Histone acetyltransferase KAT2B The best scores are: s-w bits E(454402) %_id %_sim alen KAT2B_HUMAN Histone acetyltransferase KAT2B ( 832) KAT2A_HUMAN Histone acetyltransferase KAT2A ( 837) GCN5_SCHPO Histone acetyltransferase gcn5 ( 454) e GCN5_YEAST Histone acetyltransferase GCN5 ( 439) e GCN5_ORYSJ Histone acetyltransferase GCN5 ( 511) e GCN5_ARATH Histone acetyltransferase GCN5; ( 568) e BPTF_HUMAN Nucleosome-remodeling factor sub (3046) e NU301_DROME Nucleosome-remodeling factor su (2669) e CECR2_HUMAN Cat eye syndrome critical regio (1484) e BRD4_HUMAN Bromodomain-containing protein 4 (1362) e BRD4_MOUSE Bromodomain-containing protein 4 (1400) e BAZ2A_HUMAN Bromodomain adjacent to zinc fi (1905) e BAZ2A_XENLA Bromodomain adjacent to zinc fi (1698) e FSH_DROME Homeotic protein female sterile; (2038) e BAZ2A_MOUSE Bromodomain adjacent to zinc fi (1889) e BRDT_MACFA Bromodomain testis-specific prot ( 947) e BRD3_HUMAN Bromodomain-containing protein 3 ( 726) e

21 KATB_HUMAN KAT2B_HUMAN KAT2B_HUMAN KAT2A_HUMAN KAT2B_HUMAN GCN5_YEAST KAT2B_HUMAN GCN5_ARATH Homology and Domains Histone acetyltransferase KAT2B PCAF_N C.Acetyltrans Bromodomain PCAF_N C.Acetyltrans Bromodomain C.Acetyltrans Bromodomain C.Acetyltrans Bromodomain E()< E()< e e KAT2B_HUMAN BPTF_HUMAN e KAT2B_HUMAN BRD4_MOUSE 500 BromodomainBromodomain e KAT2B_HUMAN BRD4_MOUSE 500 BromodomainBromodomain e

22 A quick diversion LALIGN statistically significant internal alignments # lalign36 -s BP62 -m B -w 80 mchu.aa mchu.aa LALIGN finds non-overlapping local alignments Please cite: X. Huang and W. Miller (1991) Adv. Appl. Math. 12: >sp P62158 CALM_HUMAN Calmodulin; CaM Length=149 Score = 51.4 bits (168), Expect = 7e-12 Identities = 35/73 (47%), Positives = 51/73 (69%), Gaps = 3/73 (4%) Query 1 MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFLTMMARK 76 M D +EE+I +EAF +FDKDG+G I+ EL VM +LG+ T+ E+ +MI E D DG+G +++ EF+ MM K Sbjct 77 MKDTDSEEEI---REAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK 149 Score = 41.5 bits (132), Expect = 7e-09 Identities = 36/97 (37%), Positives = 55/97 (56%), Gaps = 8/97 (8%) Query 11 AEFKEAFSLFDKDGDGTITTKELGTVM-RSLGQNPTEAELQDMINEVDADGNGTIDFPEF---LTMMARKMKDTDSEEEI 86 AE ++ + D DG+GTI E T+M R + +E E+++ D DGNG I E +T + K+ D + +E I Sbjct 47 AELQDMINEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMI 126 Query 87 REAFRVFDKDGNGYISAAELRHVMT 111 REA D DG+G ++ E +MT Sbjct 127 REA----DIDGDGQVNYEEFVQMMT 147 Score = 18.5 bits (48), Expect = 0.06 Identities = 13/33 (39%), Positives = 23/33 (69%), Gaps = 5/33 (15%) Query 1 MADQLTEEQIAEF-KEAFSLFDKDGDGTITTKELGTVM LT+E++ E +EA D DGDG + +E +M Sbjct 113 LGEKLTDEEVDEMIREA----DIDGDGQVNYEEFVQMM 146 No identity alignment 22

23 LALIGN Identifying mobile domains: mobile (duplicated) domains in local alignments sp P CALM_HUMAN Calmodulin; CaM EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand score: 168; 51.8 bits; E(1) < 5.8e-12 score: 132; 41.8 bits; E(1) < 5.9e-09 score: 48; 18.5 bits; E(1) < EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-handEF-hand EF-hand EF-hand EF-hand EF-hand EF-handEF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand EF-hand sp P CALM_HUMAN Calmodulin; CaM aa E(): < <0.01 <1 <1e+02 >1e+02

24 Similarity searches for homology Homologous sequences are unexpectedly similar (excess similarity) excess compared to??? random similarity (similarity by chance, E()-value) In a similarity search, excess similarity reflects the perspective of the query sequence different queries can reveal excess similarity homology in the aligned region Non-significant similarity IS NOT evidence for non-homology significant similarity to a protein of different structure shows non-homology 24

25 Computer Demo fasta.bioch.virginia.edu/mol_evol/ tinyurl.com/fasta-demo/ Work in PAIRS Answer questions with partner Goals: trust the statistics what are the positive / negative controls? look at the alignments does non-alignment imply non-homology? use [pgm] link 25

26 Effective similarity searching Use protein/translated DNA comparisons Modern sequence similarity searching is highly efficient, sensitive, and reliable homologs are homologs similarity statistics are accurate databases are large most queries will find a significant match Improving similarity searches protein (translated DNA) similarity searches smaller databases appropriate scoring matrices for short reads/assemblies appropriate alignment boundaries Extracting more information from annotations homologous over extension scoring sub-alignments to identify homologous domains All methods (pairwise, HMM, PSSM) miss homologs all methods find genuine homologs the other methods miss 26

27 DNA vs protein sequence comparison The best scores are: DNA tfastx3 prot. E(188,018) E(187,524) E(331,956) DMGST D.melanogaster GST e e e-109 MDGST1 M.domestica GST-1 gene 2e e e-76 LUCGLTR Lucilia cuprina GST 1.5e e e-73 MDGST2A M.domesticus GST-2 mrna 9.3e e e-62 MDNF1 M.domestica nf1 gene e e e-62 MDNF6 M.domestica nf6 gene e e e-62 MDNF7 M.domestica nf7 gene e e e-62 AGGST15 A.gambiae GST mrna 3.1e e e-61 CVU87958 Culicoides GST 1.8e e e-58 AGG3GST11 A.gambiae GST1-1 mrna 1.5e e e-43 BMO6502 Bombyx mori GST mrna 1.1e e e-40 AGSUGST12 A.gambiae GST1-1 gene 2.3e e e-37 MOTGLUSTRA Manduca sexta GST 5.7e e e-25 RLGSTARGN R.legominosarum gsta e e-10 HUMGSTT2A H. sapiens GSTT e e-09 HSGSTT1 H.sapiens GSTT1 mrna e e-10 ECAE E. coli hypothet. prot. 4.7e e-09 MYMDCMA Methyl. dichlorometh. DH 1.1e e-07 BCU19883 Burkholderia maleylacetate red. 1.2e e-08 NFU43126 Naegleria fowleri GST 3.2e SP505GST Sphingomonas paucim 1.8e EN1838 H. sapiens maleylaceto. iso. 2.1e e-06 HSU86529 Human GSTZ1 3.0e e-06 SYCCPNC Synechocystis GST 1.2e e-06 HSEF1GMR H.sapiens EF1g mrna 9.0e

28 Why smaller databases produce more sensitive searches statistics S = λs raw - ln K m n S bit = (λs raw - ln K)/ln(2) P(S >x) = 1 - exp(-e -x ) P(S bit > x) = 1 -exp(-mn2 -x ) E(S >x D) = P D Z(s) λs bit P(B bits) = m n 2 -B P(40 bits)= 1.5x10-7 E(40 D=4000) = 6x10-4 E(40 D=12E6) =

29 What is a bit score (I)? 1. Scoring matrices (PAM250, BLOSUM62, VTML40) contain log-odds scores: s i,j (bits) = log 2 (q i,j /p i p j ) (q i,j freq. in homologs / p i p j freq. by chance) s i,j (bits) = 2 -> a residue is 2 2 =4-times more likely to occur by homology compared with chance (at one residue) s i,j (bits) = -1 -> a residue is 2-1 = 1/2 as likely to occur by homology compared with chance (at one residue) 2. An alignment score is the maximum sum of s i,j bit scores across the aligned residues. A 40-bit score is 2 40 more likely to occur by homology than by chance. 3. How often should a score occur by chance? In a 400 * 400 alignment, there are ~160,000 places where the alignment could start by chance, so we expect a score of 40 bits would occur: P(S bit > x) = 1 -exp(-mn2 -x ) ~ mn2 -x 400 x 400 x 2-40 = 160,000 / 2 40 ( ) = 1.5 x 10-7 times Thus, the probability of a 40 bit score in ONE alignment is ~

30 What is a bit score (I)? 4. But we did not ONE alignment, we did 4,000, 40,000, 500,000, or 20 million alignments when we searched the database: E(S bit D) = p(40 bits) x database size E(40 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant) E(40 40,000) = 10-7 x 4 x 10 4 = 4 x 10-3 (not significant) E(40 500,000) =10-7 x 5 x 10 5 = 0.05 (not significant) E(40 20 million) = 10-7 x 2.0 x 10 7 = 2.0 (not significant) Bonferroni correction for multiple tests each alignment is a test Not significant does not mean not-homologous 30

31 E()-values are conservative frequentist estimates that similarity occurred by chance S = λs raw - ln K m n S bit = (λs raw - ln K)/ln(2) P(S >x) = 1 - exp(-e -x ) P(S bit > x) = 1 -exp(-mn2 -x ) Z(s) λs bit Bonferroni correction: E(S >x D) = P D(# of tests) With modern sequence databases (thousands of complete proteomes), E()<10-10 is routine for sequences >25% identical, after correcting for 10,000,000 sequences (tests) 31

How many bits do I need? E() = p() x database size E(40 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant) E(40 40,000) = 10-7 x 4 x 10 4 = 4 x 10-3 (not significant) E(40 500,000) = 10-7 x 5 x 10 5 = 0.

32 How many bits do I need? E() = p() x database size E(40 4,000) = 10-7 x 4,000 = 4 x 10-4 (significant) E(40 40,000) = 10-7 x 4 x 10 4 = 4 x 10-3 (not significant) E(40 500,000) = 10-7 x 5 x 10 5 = 0.05 (not significant) To get E() ~ 10-3, how many bits do I need? p = m n 2 bits bits = log2(p/(m n)) = log2(e()/(database_size m n)) genome (10,000) p ~ 10-3 /10 4 = 10-7 /160,000 = 40 bits SwissProt (500,000) p ~ 10-3 /10 6 = 10-9 /160,000 = 47 bits Uniprot/NR (10 7 ) p ~ 10-3 /10 7 = /160,000 = 50 bits very significant E()<10-50 significant E()<10-6 significant E()<10-3 not significant 32

33 What database to search? Search the smallest comprehensive database likely to contain your protein vertebrates human proteins (40,000) fungi S. cerevisiae (6,000) bacteria E. coli, gram positive, etc. (<100,000) QFO (Quest for Orthologs proteome set) Search a richly annotated protein set (SwissProt, >500,000) Always search NR (> 50 million) LAST Never Search GenBank (DNA) 33

34 Scoring matrices Scoring matrices can set the evolutionary lookback time for a search Lower PAM (PAM10/MDM10 PAM60) for closer (10% 50% identity) Higher BLOSUM for higher conservation (BLOSUM50 distant, BLOSUM80 conserved) Shallow scoring matrices for short domains/short queries (metagenomics) Matrices have bits/position (score/position), 40 aa at 0.45 bits/position (BLOSUM62) means 18 bit ave. score (50 bits significant) Deep scoring matrices allow alignments to continue, possibly outside the homologous region 34

35 Where do scoring matrices come from? Pam40 A R N D E I L A 8 R N D E I L Pam250 A R N D E I L A 2 R -2 6 N D E I L λs i, j = log b ( q i, j p i p j ) q ij : replacement frequency at PAM40, 250 q R:N ( 40) = p R = q R:N (250) = p N = λ 2 S ij = lg 2 (q ij /p i p j ) λ e S ij = ln(q ij /p i p j ) p R p N = λ 2 S R:N( 40) = lg 2 ( / )= λ 2 = 1/3; S R:N( 40) = /l 2 = -7 λ S R:N(250) = lg2 ( / )= 0 35

36 PAM matrices and alignment length BLOSUM80 BLOSUM62 BLOSUM50 Short domains require shallow scoring matrices 36

37 Empirical matrix performance (median results from random alignments) Matrix target % ident bits/position aln len (50 bits) VT160-12/ BLOSUM50-10/ BLOSUM62* -11/ VT120-11/ VT80-11/ PAM70* -10/ PAM30* -9/ VT40-12/ VT20-15/ VT10 /16/ HMMs can be very "deep" 37

38 Scoring matrices affect alignment boundaries (homologous over-extension) BLOSUM62-11/-1 VTML80-10/-1 SH3_domain SH3_domain SH3_domain 500 SH3_domain sp Q SRC8_HUMAN Src substrate cort sp Q SRC8_HUMAN Src substrate cort sp Q SRC8_HUMAN Src substrate cortactin; Am sp Q SRC8_HUMAN Src substrate cortactin; Am E(): < <0.01 <1 <1e+02 >1e+02 E(): < <0.01 <1 <1e+02 >1e+02 38

39 Scoring Matrices - Summary PAM and BLOSUM matrices greatly improve the sensitivity of protein sequence comparison low identity with significant similarity PAM matrices have an evolutionary model - lower number, less divergence lower=closer; higher=more distant BLOSUM matrices are sampled from conserved regions at different average identity higher=more conservation Short alignments (domains, exons, reads) require shallow (higher information content) matrices Shallow matrices set maximum look-back time 39

40 Effective similarity searching Modern sequence similarity searching is highly efficient, sensitive, and reliable homologs are homologs similarity statistics are accurate databases are large most queries will find a significant match Improving similarity searches smaller databases appropriate scoring matrices for short reads/assemblies appropriate alignment boundaries Extracting more information from annotations homologous over extension scoring sub-alignments to identify homologous domains All methods (pairwise, HMM, PSSM) miss homologs all methods find genuine homologs the other methods miss 40

41 Overextension into random sequence PF PF > pf E6SGT6 E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC= Length=888 Score = 299 bits (766), Expect = 1e-90, Method: Compositional matrix adjust. Identities = 170/341 (50%), Positives = 224/341 (66%), Gaps = 19/341 (6%) 113 Query 84 FLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLGKTLEAVAKGRTSEAIKKLMGLKA 143 +L+ V A +P+ +F + V++ L+ LG LE A+GRTSEAIKKL+GL+A Sbjct 312 WLYSTVAVAFPQIFPSMALAEVFYDVTAVVVALVNLGLALELRARGRTSEAIKKLIGLQA Query 144 KRARVIRGGRELDIPVEAVLAGDLVVVRPGEKIPVDGVVEEGASAVDESMLTGESLPVDK ARV+R G E+DIPVE VL GD+VVVRPGEKIPVDGVV EG S+VDESM+TGES+PV+ Sbjct 372 RTARVVRDGTEVDIPVEEVLVGDIVVVRPGEKIPVDGVVIEGTSSVDESMITGESIPVEM 431 Query 204 QPGDTVIGATLNKQGSFKFRATKVGRDTALAQIISVVEEAQGSKAPIQRLADTISGYFVP 263 +PGD VIGAT+N+ GSF+FRATKVG+DTAL+QII +V++AQGSKAPIQR+ D +S YFVP Sbjct 432 KPGDEVIGATINQTGSFRFRATKVGKDTALSQIIRLVQDAQGSKAPIQRIVDRVSHYFVP 491 Query 264 VVVSLAVITFFVWYFAVAPENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKG 323 V+ LA++ VWY + AL+ F L+IACPCALGLATPTS+ VG GKGAE+G Sbjct 492 AVLILAIVAAVVWYVFGPEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQG Query 324 ILFKGGEHLENAG GGAHTEGAENKAELLKTRATGISILVTLGLTAKGRDRS 374 IL + G+ L+ A G T+G +++ ATG + L LTA Sbjct 552 ILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVVA--ATGFDEDLILRLTA Query 375 TVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGV 415 A ++ + L I + L R +A E+ +A P GV Sbjct AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGV

Can homologous proteins have different structures? Roessler C G et al. PNAS (2008)105:2343 BLOSUM80 40% id VTML40 70% id Region: 12-37:14-34 : score=50; bits=26.

42 Can homologous proteins have different structures? Roessler C G et al. PNAS (2008)105:2343 BLOSUM80 40% id VTML40 70% id Region: 12-37:14-34 : score=50; bits=26.0; Qval=41 : DOMAIN_N: alpha Region: 42-60:39-57 : score=26; bits=14.5; Qval= 7 : DOMAIN_N: beta Smith-Waterman score: 82; E(1) < 2e-9; 49.0% identity (61.2% similar) in 49 aa overlap BD1 MNAIDIAINKLGSVSALAASLGVRQSAISNWRARGRVPAERCIDIERVTNGAVICRELRPDVFGASPAGHRPEASNAAA :. :::::.::: :::::. : : : :.:.: : :.:: 2PIJ XKKIPLSKYLEEHGTQSALAAALGVNQSAISQ-----MVRAGRSIEITLYEDGRVEANEIRPIPARPKRTAA [ ] [ ] Region: 12-29:14-31 : score=82; bits=42.9; LPr=9.2 : DOMAIN_N: alpha Smith-Waterman score: 82; E(1) < 6.5e % identity (88.9% similar) in 18 aa overlap BD1 MNAIDIAINKLGSVSALAASLGVRQSAISNWRARGRVPAERCIDIERVTNGAVICRELRPDVFGASPAGHRPEASNAAA :. :::::.::: ::::: 2PIJ XKKIPLSKYLEEHGTQSALAAALGVNQSAISQMVRAGRSIEITLYEDGRVEANEIRPIPARPKRTAA [ ]

Scoring matrices affect alignment boundaries (homologous over-extension) BLOSUM62-11/-1 BLOSUM62-11/-1 sp Q14247.

158; Q= 0.0 : in 80-116:117-153 : Id=0.622; Q=37.4 : in 117-153:154-190 : Id=0.757; Q=50.2 : in 154-190:191-227 : Id=0.811; Q=61.0 : in 191-227:228-264 : Id=0.

200; Q= 0.0 : SH3 82-116:119-153 : Id=0.657; Q=102.2 : in 117-153:154-190 : Id=0.757; Q=138.0 : in 154-190:191-227 : Id=0.811; Q=164.6 : in 191-227:228-264 : Id=0.

43 Scoring matrices affect alignment boundaries (homologous over-extension) BLOSUM62-11/-1 BLOSUM62-11/-1 sp Q SRC8_HUMAN Src substrate cort SH3_domain SH3_domain 32-42: : Id=0.455; Q= 0.0 : NODOM : : : Id=0.158; Q= 0.0 : in : : Id=0.622; Q=37.4 : in : : Id=0.757; Q=50.2 : in : : Id=0.811; Q=61.0 : in : : Id=0.568; Q=35.3 : in : : Id=0.649; Q=41.5 : in : : Id=0.565; Q= 8.9 : in : : Id=0.165; Q= 0.0 : NODOM : : Id=0.200; Q= 0.0 : SH : : Id=0.657; Q=102.2 : in : : Id=0.757; Q=138.0 : in : : Id=0.811; Q=164.6 : in : : Id=0.568; Q= 91.9 : in : : Id=0.649; Q=112.4 : in sp Q SRC8_HUMAN Src substrate cortactin; Am : : Id=0.565; Q= 36.7 : in E(): < <0.01 <1 <1e+02 >1e+02 VTML80-10/-1 43

44 Overextension into random sequence PF PF > pf E6SGT6 E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC= Length=888 Score = 299 bits (766), Expect = 1e-90, Method: Compositional matrix adjust. Identities = 170/341 (50%), Positives = 224/341 (66%), Gaps = 19/341 (6%) 113 Query 84 FLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLGKTLEAVAKGRTSEAIKKLMGLKA 143 +L+ V A +P+ +F + V++ L+ LG LE A+GRTSEAIKKL+GL+A Sbjct 312 WLYSTVAVAFPQIFPSMALAEVFYDVTAVVVALVNLGLALELRARGRTSEAIKKLIGLQA Query 144 KRARVIRGGRELDIPVEAVLAGDLVVVRPGEKIPVDGVVEEGASAVDESMLTGESLPVDK ARV+R G E+DIPVE VL GD+VVVRPGEKIPVDGVV EG S+VDESM+TGES+PV+ Sbjct 372 RTARVVRDGTEVDIPVEEVLVGDIVVVRPGEKIPVDGVVIEGTSSVDESMITGESIPVEM 431 Query 204 QPGDTVIGATLNKQGSFKFRATKVGRDTALAQIISVVEEAQGSKAPIQRLADTISGYFVP 263 +PGD VIGAT+N+ GSF+FRATKVG+DTAL+QII +V++AQGSKAPIQR+ D +S YFVP Sbjct 432 KPGDEVIGATINQTGSFRFRATKVGKDTALSQIIRLVQDAQGSKAPIQRIVDRVSHYFVP 491 Query 264 VVVSLAVITFFVWYFAVAPENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKG 323 V+ LA++ VWY + AL+ F L+IACPCALGLATPTS+ VG GKGAE+G Sbjct 492 AVLILAIVAAVVWYVFGPEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQG Query 324 ILFKGGEHLENAG GGAHTEGAENKAELLKTRATGISILVTLGLTAKGRDRS 374 IL + G+ L+ A G T+G +++ ATG + L LTA Sbjct 552 ILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVVA--ATGFDEDLILRLTA Query 375 TVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGV 415 A ++ + L I + L R +A E+ +A P GV Sbjct AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGV

45 Sub-alignment scoring detects overextension >>sp E6SGT6 E6SGT6_THEM7 Heavy metal translocating P-type ATPase EC= (888 aa) qregion: : : score=15; bits=12.3; Id=0.219; Q=0.0 : Shuffle qregion: : : score=736; bits=232.8; Id=0.641; Q=644.7 : PF00122 qregion: : : score=14; bits=12.0; Id=0.236; Q=0.0 : Shuffle Region: : : score=11; bits=11.1; Id=0.194; Q=0.0 : NODOM :0 Region: : : score=736; bits=232.8; Id=0.641; Q=644.7 : PF00122 Pfam Region: : : score=16; bits=12.6; Id=0.244; Q=0.0 : PF00702 Pfam s-w opt: 632 Z-score: bits: E(274545): 3.7e-51 Smith-Waterman score: 765; 49.7% identity (73.3% similar) in 344 aa overlap (81-415: ) B0TE74 E6SGT6 Shuffle PF00122 Shuffle PF04945 PF00403 PF00122 PF ][ 120 B0TE74 LPIVPGTMALGVQSDGKDETLLVALEVPVDRERVGPAIVDAFRFLFVNVFAALFNYWPTEGKILMFGKLEKVLITLILLG :.:..:.:...:...:. :...:. :: E6SGT6 ILGLLTLPVMLWSGSHFFNGMWQGLKHRQANMHTLISIGIAAAWLYSTVAVAFPQIFPSMALAEVFYDvtavvvalvnlg ][ ][ B0TE74 APENFTRALLNFTAVLVIACPCALGLATPTSIMVGTGKGAEKGILFKGGEHLENAGG GAHTEGAENKAELL. ::. :...:.::::::::::::::. :: :::::.:::...:. :. :. :. :.:.... E6SGT6 PEPAYIYALIVFVTTLIIACPCALGLATPTSLTVGIGKGAEQGILIRSGDALQMASRLDVIVLDKTGTITKGKPELTDVV ] [ B0TE74 KTRATGISILVTLGLTAKGRDRSTVAFQKNTGFKLKIPIGQAQLQREVAASESIVISAYPIVGVVVDSLVTTAFLAVEEI :::... : ::: :... : :.. : :.: :...: : :: E6SGT6 --AATGFDEDLILRLTA AIERKSEHPLATAIVEGALARGLALPEADGFAAIPGHGVEAQVEGHHVLVGNERL

Scoring domains highlights over extension >>sp SRC8_HUMAN Src substrate cortactin; (550 aa) >>sp SRC8_CHICK Src substrate p85; Cort (563 aa) 84.7% id (1-550:11-563) E(454402): 1.

46 Scoring domains highlights over extension >>sp SRC8_HUMAN Src substrate cortactin; (550 aa) >>sp SRC8_CHICK Src substrate p85; Cort (563 aa) 84.7% id (1-550:11-563) E(454402): 1.2e-159 >>sp SRC8_HUMAN Src substrate cortactin (550 aa) >>sp HCLS1_MOUSE Hematopoiet ln cell-sp (486 aa) 44.1% id (1-548:1-485) E(454402): 4.1e : Id=0.873; Q=281.4 : NODOM : Id=1.000; Q=133.2 : in : Id=0.946; Q=121.0 : in : Id=0.973; Q=127.1 : in : Id=0.973; Q=128.3 : in : Id=0.973; Q=137.5 : in : Id=0.892; Q=117.3 : in : Id=0.957; Q= 69.6 : in : Id=0.632; Q=386.6 : NODOM : Id=0.966; Q=226.3 : SH3 1-79: 1-78 Id=0.671; Q=213.0 : NODOM : Id=0.757; Q= 97.9 : in : Id=0.703; Q= 94.8 : in : Id=0.703; Q= 97.3 : in : Id=0.826; Q= 60.5 : in : Id=0.179; Q= 0.0 : NODOM : : Id=0.719; Q=173.2 : SH3 Q = -10 log(p) Q>30.0 -> p <

47 Homology, non-homology, and over-extension Sequences that share statistically significant sequence similarity are homologous (simplest explanation) But not all regions of the alignment contribute uniformly to the score lower identity/q-value because of non-homology (overextension)? lower identity/q-value because more distant relationship (domains have different ages)? Test by searching with isolated region can the distant domain (?) find closer (significant) homologs? Similar (homology) or distinct (non-homology) structure is the gold standard Multiple sequence alignment can obscure over-extension if the alignment is over-extended, part of the alignment is NOT homologous 47

48 Improving sensitivity with protein/domain family models PSI-BLAST - method 1. do BLAST search 2. use query-based implied multiple sequence alignment to build Position Specific Scoring Matrix (PSSM) 3. repeat steps 1 and 2 with PSSM, for 5 10 iterations PSI-BLAST results: 1. Typically 2X as sensitive as single sequence methods 2. Over-extension can cause PSSM contamination 48

49 Improving sensitivity with protein/domain family models HMMER3 jackhmmer method 1. do HMMER (Hidden Markov Model, HMM) search with single sequence 2. use query-hmm-based implied multiple sequence alignment to more accurate HMM 3. repeat steps 1 and 2 with HMM HMMER3 results: 1. Less over-extension because of probabilistic alignment 2. Used to construct Pfam domain database Many protein families are too diverse for one HMM, Pfam divides families into multiple HMMs and groups in Clans 3. Clearly homologous sequences are still missed 49

Missing homology beyond the HMM model >>tr Q8LNM4 Q8LNM4_ORYSJ Eukaryotic aspartyl protease family protein vs >>tr Q2QSI0 Q2QSI0_ORYSJ Glycosyl hydrolase family 9 protein, expressed OS=O (694 aa)

50 Missing homology beyond the HMM model >>tr Q8LNM4 Q8LNM4_ORYSJ Eukaryotic aspartyl protease family protein vs >>tr Q2QSI0 Q2QSI0_ORYSJ Glycosyl hydrolase family 9 protein, expressed OS=O (694 aa) qregion: : : score=508; bits=240.8; LPr=67.0 : Aspartyl protease s-w opt: 508 Z-score: bits: E(1): 9.6e-68 Smith-Waterman score: 508; 62.5% identity (79.2% similar) in 144 aa overlap Q8LNM4 TDACKSIPTSNCSSNMCTYEGTINSKLGGHTLGIVATDTFAIGTATASLGFGCVVASGIDTMGGPSGLIGLGRAPSSLVS ::: :.: ::. :. : : : : :::::.::: :.: ::::::: :::: : ::..::::.: :::. Q2QSI0 LCESISNDIHNCSGNVCMYEASTNA---GDTGGKVGTDTFAVGTAKANLAFGCVVASNIDTMDGSSGIVGLGRTPWSLVT Q8LNM4 QMNITKFSYCLTPHDSGKNSRLLLGSSAKLAGGGNSTTTPFVKTSPGDDMSQYYPIQLDGIKAGDAAIALPPSGNTVLVQ :.. :::::.:::.:::. :.:::.:::::::...:::: : :.:.: ::.::..::::: : ::::: Q2QSI0 QTGVAAFSYCLAPHDAGKNNALFLGSTAKLAGGGKTASTPFVNIS-GNDLSNYYKVQLEVLKAGDAMIPLPPSGVLWDNY Q8LNM4 Q2QSI0 Asp 50

51 Pfam misses/mis-aligns proteins distant from the model For diverse families, a single model can find, and miss, closely related homologs Even if homologs are found, alignments may be short her pesec cmvhh2 ratodor humrsc chkgpcr gppaf musp2u chkp2y humfmlf ratg10d ratang dogrdc1 humil8 bovlcr1 cmvhh3 ratrbs11 humssr1 musdelto humc5a ratbk2 humthr ratlh bovop ratrta hummrg hummas humedg1 ratcgpcr ratpot humacth hummsh musep3 humtxa2 musep2 ratcckadogad1 dogcckb humd2 huma2a hama1a hamb2 hum5ht1a ratd1 bovh1 ratnpyy1 ratnk1 flynk flynpy musgir ratntr mustrh musgrp musgnrh ratvia boveta humm1 51

52 Homology Fundamentals Homologous sequences are unexpectedly similar (excess similarity) excess compared to??? random similarity (similarity by chance, E()-value) Non-significant similarity IS NOT evidence for nonhomology significant similarity to a protein of different structure shows non-homology Homology at the (entire) sequence level is different from homology at the residue level sequence homology is inferred from statistics residue homology REQUIRES sequence homology 52

53 Effective Similarity Searching in Practice 1. Always search protein databases (possibly with translated DNA) 2. Use E()-values, not percent identity, to infer homology E() < is significant in a single search 3. Search smaller (comprehensive) databases 4. Change the scoring matrix for: short sequences (exons, reads) short evolutionary distances (mammals, vertebrates, a- proteobacteria) high identity (>50% alignments) to reduce over-extension 5. All methods (pairwise, HMM, PSSM) miss homologs, and find homologs the other methods miss 53

William R. Pearson faculty.virginia.edu/wrpearson

Protein Sequence Comparison and Protein Evolution (What BLAST does/why BLAST works William R. Pearson faculty.virginia.edu/wrpearson pearson@virginia.edu 1 Effective Similarity Searching in Practice 1.