Protein function prediction based on sequence analysis

Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005 Rein Aasland Department of Molecular Biology University of Bergen Please do not distribute without the author s consent! MOL204 Applied Bioinformatics Lecture 8 1 MOL204 Applied Bioinformatics Lecture 8 2 Sequence vs structure Sequence vs structure & function Chothia & Lesk, (1986) EMBO J. 5:823-826 Devos et al., (2000) Proteins: Structure, Function, and Genetics 41:98.107 MOL204 Applied Bioinformatics Lecture 8 4 MOL204 Applied Bioinformatics Lecture 8 3 Complete genome sequences gives new meaning to database searches Sequence similarities can be used as basis for evaluating hypotheses of homology Sequences are HOMOLOGOUS if they have a common ancestor Homologus genes/proteins diverge during evolution Sequence similarity searches is one of our most powerful means for protein function prediction and genome annotation Similar sequences are ANALOGOUS if they do not have a common ancestor Analogous sequences can CONVERGE during evolution and become more similar MOL204 Applied Bioinformatics Lecture 8 5 MOL204 Applied Bioinformatics Lecture 8 6 1

Sequence similarities can be used as basis for evaluating hypotheses of homology Two genes are HOMOLOGS if they have a common ancestor Two genes in different species are ORTHOLOGS if they have evolved from a common ancestor by speciation The homologous sequence space orthologs in different species a protein (super)family paralogs in one species Two genes in one species are PARALOGS if they have evolved from a common ancestor by duplication NOTE 1: genes are either homologous or not! NOTE 2: two genes can be considered partially homolous if they share one homologous domain. sequence similarity Not always possible to distinguish between orthologs and paralogs or just distant homolgs MOL204 Applied Bioinformatics Lecture 8 7 MOL204 Applied Bioinformatics Lecture 8 8 Reasons for performing Database Searches Sequence Alignments and Database Searches Find a particular sequence - very close homologues trivial database searches Reveal clues to function - identify functional modules FIRST: use SMART, Pfam, CDD and InterPro to search for known globular modules THEN: use Blast database searches to search for distant relatives which may reveal additional (unknown) globular domains Often difficult to distinguish TP from FP A hit in a database search, even if apparently significant, may be a false positive; it is a hypothesis of homology! Questions relating to database searches: What is the architecture of my protein? Does it contain globular domains belonging to families with known function? Is the sequence similarity strong engough to allow for a precise prediction of function? How can I trust the similarities I find? In many cases, only structural comparison can prove homology. MOL204 Applied Bioinformatics Lecture 8 9 MOL204 Applied Bioinformatics Lecture 8 10 Globular Different types of protein Trans-membrane Globular Domains topic from lecture 4 Secondary structure elements: helices, strands, loops Structural motifs: primary organisatin of 2nd ary structure elements Folds: basic structural elements, one or more motifs Random coil Coiled coil Domains: Elaborated folds - as found in real proteins. require different types of bioinformatical analysis MOL204 Applied Bioinformatics Lecture 8 11 MOL204 Applied Bioinformatics Lecture 8 12 2

Globular domains have hydrophobic cores Conserved motifs often correspond to core secondary structure elements b14 α8 MOL204 Applied Bioinformatics Lecture 8 13 MOL204 Applied Bioinformatics Lecture 8 14 Search for known domains Database searches by comparison to databases of multiple alignments of domains SMART (EMBL, Heidelberg) Pfam (St. Luis, Stockholm, Cambridge, Jouy) InterPro (EBI, Cambrdige) CDD (NCBI, US) Scoring and statistical significance Scoring matrices, gap penalites, E- and P-values Different methods Blast, Fasta, Smith-Waterman Psi-Blast Check for repeats Dot plots (and Pfam, Smart etc.) Interpretation Visual inspection, reciprocal searches Multiple alignment Clustal_X, T-Coffee, Muscle. MOL204 Applied Bioinformatics Lecture 8 15 MOL204 Applied Bioinformatics Lecture 8 16 Database Searches Scoring matrices PAM250 Dayhof matrix A R N D C Q E G H I L K M F P S T W Y V A 2 R -2 6 N 0 0 2 D 0-1 2 4 Dayhof C -2-4 matrices -4-5 4 were built in 1978 and based on Q 0 1 1 2-5 4 E 0-1 1 3-5 2 4 G - 71 1-3 groups 0 1-3 of -1 sequences 0 5 (+85% identity) H -1 2 2 1-3 3 1-2 6 I - -1 assuming -2-2 -2 evolutionary -2-2 -2-3 -2 model 5 where L - -2 every -3-3 mutation -4-6 -2 is -3 independent -4-2 2 6 K -1 3 1 0-5 1 0-2 0-2 -3 5 - molecular clock is constant M -1 0-2 -3-5 -1-2 -3-2 2 4 0 6 F -4-4 -4-6 -4-5 -5-5 -2 1 2-5 0 9 P PAM 1 := 0-1 Percent -1-3 Accepted 0-1 -1 0 Mutations -2-3 -1-2 -5 6 S 1 0 1 0 0-1 0 1-1 -1-3 0-2 -3 1 3 T 1-1 0 0-2 -1 0 0-1 0-2 0-1 -2 0 1 3 W -6 2-4 -7-8 -5-7 -7-3 -5-2 -3-4 0-6 -2-5 17 Y -3-4 -2-4 0-4 -4-5 0-1 -1-4 -2 7-5 -3-3 0 10 V 0-2 -2-2 -2-2 -2-1 -2 4 2-2 2-1 -1-1 0-6 -2 4 MOL204 Applied Bioinformatics Lecture 8 17 Database Searches Other scoring matrices BLOSUM series (Henikoff, 1992) BLOSUM62 (for about 62% identities) is one of the most commonly used matrices Gonnet series (Gonnet, 1992) Similar to PAM matrices, Often superior, but less frequently used. Each matrix requires optimised gap penalties Advice: try searches with several matrices and different gap penalites. MOL204 Applied Bioinformatics Lecture 8 18 3

Database Searches Heuristic methods Database Searches Statistical significance FASTA Uses word search in look-up tables followed by Smith-Waterman alignment of best hits BLAST Uses word search in look-up tables, gapped extension followed by Smith-Waterman alignment of best hits More powerful implemenation - and fast server at NCBI. Both methods are fast and sensitive, but do not formally guarentee the best alignments. E-value (for the score of a match) The number of matches with at least this score that can be expected with the same querey in a database of random sequences with the same size. Tentative recommendations: P-value (for the score of a match) The probablilty that a match with at least this score E-value range will appear with interpreation same querey in Smaller than e-100 a database of are random exact matches sequences (same with the gene, same same size. species). Between e-50 to e-100 are nearly identical genes Beetween e-10 to e-50 are interesting closely related sequences. Between 1 and Current e-5 version CAN of be Blast real homologues. uses E-values Greater than 1 are most likely not relevant MOL204 Applied Bioinformatics Lecture 8 19 MOL204 Applied Bioinformatics Lecture 8 20 Database Searches: Blastp Blast at NCBI Database Searches: Blastp Limit search by species Paste in your sequence here Filter removes low complexity regions (GGGSGGGS ) Increase E value for higher sensitivity and shorter sequences Use the smallest database needed for your purpose Try different matrices and Gap costs MOL204 Applied Bioinformatics Lecture 8 21 MOL204 Applied Bioinformatics Lecture 8 22 Database Searches: Blastp Database Searches: Blastp Increas list size for large protein families sequence length Check here if you want PSI- BLAST Result from CDD Limit hits to an E-value range for large families MOL204 Applied Bioinformatics Lecture 8 23 request ID Press format button to continue; - but you may choose to alter setting! MOL204 Applied Bioinformatics Lecture 8 24 4

Database Searches: Blastp choose style of output type of alignment Check to format for PSI-Blast Other Database Search Sites The Blast family: Blastn DNA-DNA Blastp Protein-Protein Blastx DNA-Protein good for new cdnas! Tblastn Protein-DNA if gene is not predicted Tblastx 6-frame x 6frame ExPasy Blast WU-Blast2 at EBI and at Washington U. easier to try different options FASTA3 at Swiss Bioinformatics Centre (SIB) quick because less used; UniProt! at EBI - the current best FASTA implementation If needed, restrict to a range of E-values BIC - the Bioccelerator ParAlign (in Oslo!) Smith-Waterman on special chip Novel fast heuristic method MOL204 Applied Bioinformatics Lecture 8 25 MOL204 Applied Bioinformatics Lecture 8 26 Other Database Search Sites http://www.expasy.org/tools/blast/ Other Database Search Sites http://www.expasy.org/tools/blast/ Uses UNIPROT MOL204 Applied Bioinformatics Lecture 8 27 MOL204 Applied Bioinformatics Lecture 8 28 Choice of databases Quality of databases SwissProt 153 871 entries TREMBL 1 333 971 REFSEQ 50.850 entries PDB 17.1811 entries The best quality and best annotated database - but also rather incomplete Translation of EMBL DNA database ~equivalent to GenPept A reference database one entry per object The ULTIMATE database in theory! A reference database one entry per object The ULTIMATE database in theory! Sequencing errors Gene prediction errors Very common for large complete genomes c.f. NURF P301, TOUTATIS Annotation errors Primary databases are not corrected unless authors agree. c.f. FSH Redundancy Even the non-redunant databases are significantly redundant. Organism-specific databases. MOL204 Applied Bioinformatics Lecture 8 29 MOL204 Applied Bioinformatics Lecture 8 30 5

A case: Search for yeast SET domains A case: Search for yeast SET domains Blast Default Blast Blast Blosum62 Blosum45 no filter no filter (>gap) Bic Default Bic (>gap) 1e-155 1e-155 1e-126 7e-171 3e-163 2e-14 2e-14 2e-13 2e-15 4e-15 0.56 23 0.78 0.51 0.02 0.001 0.021 0.0013 0.001 41 26.3 26 6.7 17.5 15.6 MOL204 Applied Bioinformatics Lecture 8 31 7 hits 8 hits 8 hits 41 hits 34 hits MOL204 Applied Bioinformatics Lecture 8 32 A case: Search for yeast SET domains Always perform reciprocal searches A case: Search for yeast SET domains Check that alignments are sensible 2e-4 2.4 1e-53 1e-153 >gi 6322293 ref NP_012367.1 transcription factor containing a SET domain; Set2p Length = 733 Score = 27.3 bits (59), Expect = 2.4 Identities = 39/152 (25%), Positives = 64/152 (41%), Gaps = 32/152 (21%) Query: 37 CSN--WESSRSADIEVRKSSNERDFGVFAADSCVKGELIQEYLGKIDFQKNYQTDPNNDY 94 C N ++ + A I + K+ + + +GV A + I EY G++ + ++ D DY Sbjct: 109 CQNQRFQKKQYAPIAIFKTKH-KGYGVRAEQDIEANQFIYEYKGEVIEEMEFR-DRLIDY 166 Query: 95 RLMGTTKPKVLFHPHWPL-----YIDSRETGGLTRYIRRSCEPNVELVTVRPLDEKPRGD 149 + H ++ + +ID+ G L R+ SC PN + Sbjct: 167 ------DQRHFKHFYFMMLQNGEFIDATIKGSLARFCNHSCSPNAYV------------- 207 Query: 150 NDCRVKFVLR----AIRDIRKGEEISVEWQWD 177 N VK LR A R I KGEEI+ ++ D Sbjct: 208 NKWVVKDKLRMGIFAQRKILKGEEITFDYNVD 239 4 hits MOL204 Applied Bioinformatics Lecture 8 33 MOL204 Applied Bioinformatics Lecture 8 34 A case: Search for yeast SET domains A case: Search for yeast SET domains Check that alignments are sensible A false hit (E=38) >gi 6323244 ref NP_013316.1 Cdc123p Length = 360 Score = 23.5 bits (49), Expect = 38 Identities = 23/97 (23%), Positives = 42/97 (42%), Gaps = 10/97 (10%) Query: 10 KAITISEYKDKYVKMFIDNHYDDDWVVCSNWESSRSADIEVRKSSNERDFGVFAADSCVK 69 K+I + K+++ + + D + E+SRS E + + D+ + D Sbjct: 37 KSIVLKSLPKKFIQ-----YLEQDGIKLPQEENSRSVYTEEIIRNEDNDYSDWEDDEDTA 91 Query: 37 CSN--WESSRSADIEVRKSSNERDFGVFAADSCVKGELIQEYLGKIDFQKNYQTDPNNDY 94 C N ++ + A I + K+ + + +GV A + I EY G++ + ++ D DY Sbjct: 109 CQNQRFQKKQYAPIAIFKTKH-KGYGVRAEQDIEANQFIYEYKGEVIEEMEFR-DRLIDY 166 Query: 70 GELIQEYLGKIDFQKNYQ--TDPNNDYRLMGTTKPKV 104 E +QE IDF + +Q D N+ +G PK+ Sbjct: 92 TEFVQEVEPLIDFPELHQKLKDALNE---LGAVAPKL 125 MOL204 Applied Bioinformatics Lecture 8 35 MOL204 Applied Bioinformatics Lecture 8 36 6

Database Searches Using low sequence complexity filter (SEG) [p254] Many proteins contain extensive regions with a low sequence complexity Example: C-ter third of FSH_DROME HLMQPAGPQQ QQQQQQQQPF GHQQQQQQQQ QQQQQQQQQH MDYVTELLSK GAENVGGMNG NHLLNFNLDM AAAYQQKHPQ QQQQQAHNNG FNVADFGMAG FDGLNMTAAS FLDLEPSLQQ QQMQQMQLQQ QHHQQQQQQT HQQQQQHQQQ HHQQQQQQLT QQQLQQQQQQ QQQQQHLQQQ QHQQQHHQAA NKLLIIPKPI ESMMPSPPDK QQLQQHQKVL PPQQSPSDMK LHPNAAAAAA VASAQAKLVQ TFKANEQNLK NASSWSSLAS ANSPQSHTSS SSSSSKAKPA MDSFQQFRNK AKERDRLKLL EAAEKEKKNQ KEAAEKEQQR KHHKSSSSSL TSAAVAQAAA IAAATAAAAV TLGAAAAAAL ASSASNPSGG SSSGGAGSTS QQAITGDRDR DRDRERERER SGSGGGQSGN GNNSSNSANS NGPGSAGSGG SGGGGGSGPA SAGGPNSGGG GTANSNSGGG GGGGGPALLN AGSNSNSGVG SGGAASSNSN SSVGGIVGSG GPGSNSQGSS GGGGGGPASG GGMGSGAIDY GQQVAVLTQV AANAQAQHVA AAVAAQAILA ASPLGAMESG RKSVHDAQPQ ISRVEDIKAS Using multiple alignments as basis for similarity searches: PSI-Blast You can use Blast2sequences to see what is filtered out MOL204 Applied Bioinformatics Lecture 8 37 MOL204 Applied Bioinformatics Lecture 8 38 PSI-BLAST Position-specific iterated BLAST A very sensitive method for finding distant homologues PSI-BLAST All 6 yeast SET domains are easily identified after 2 rounds Hits after a BLAST search are selected for alignment and subsequent profile-search Conserved positions are emphasised. The procedure can be repeated until convergence is achieved; i.e. no new matches MOL204 Applied Bioinformatics Lecture 8 39 MOL204 Applied Bioinformatics Lecture 8 40 PSI-BLAST Similarity between SET domains and Methyltransferases The similarity between SET domains and a group of plant Methyltransferases was found after 9 rounds of PSI-BLAST (Rea et al. 2000) MOL204 Applied Bioinformatics Lecture 8 41 MOL204 Applied Bioinformatics Lecture 8 42 7

SET domains ARE smilar to Rubisco MTase PSI-BLAST PSI-BLAST involves the manual inclusion of new candidate relatives A powerful program that must be used with great care!!! MOL204 Applied Bioinformatics Lecture 8 43 MOL204 Applied Bioinformatics Lecture 8 44 Using multiple alignments as basis for similarity searches: Profiles and HMMs MOL204 Applied Bioinformatics Lecture 8 45 MOL204 Applied Bioinformatics Lecture 8 46 MOL204 Applied Bioinformatics Lecture 8 47 MOL204 Applied Bioinformatics Lecture 8 48 8

MOL204 Applied Bioinformatics Lecture 8 49 MOL204 Applied Bioinformatics Lecture 8 50 Domain families in SMART Bork et al., EMBL, Heidelberg Domain families in Pfam Eddy, Bateman, Sonnhammer et al., et al., St.Louis, Cambridge, Stockholm Collection of carefully aligned domains (665) Collection of carefully aligned domains (7503 = Pfam-A) + the automatically aligned ProDom families ( = Pfam-B) ~73% of sequences contains at least one hit with Pfam A (or B) Implementations at St. Louis and Sanger Centre are different! and both have nice features! - including MOL204 Applied Bioinformatics Lecture 8 51 MOL204 Applied Bioinformatics Lecture 8 52 Domain families in Pfam Eddy, Bateman, Sonnhammer et al., et al., St.Louis, Cambridge, Stockholm Domain families in CDD NCBI team Pfam + SMART + NCBI + COG (11088 ) Search by Reverse Position Specific BLAST Automatically done in Blastp MOL204 Applied Bioinformatics Lecture 8 53 MOL204 Applied Bioinformatics Lecture 8 54 9

Domain families in CDD NCBI team Domain families and more in InterPro Appweiler et al., EBI Domains, motifs, familes, superfamilies (11007 entries, 2573 domains 8166 families ) Well integrated with many other databases, and well mapped onto GO (Gene Ontology) Problem: collection of data from various sources of varying quality. I.e. check where data comes from! MOL204 Applied Bioinformatics Lecture 8 55 MOL204 Applied Bioinformatics Lecture 8 56 Domain families and more in InterPro Appweiler et al., EBI Dot plots to reveal repeats MOL204 Applied Bioinformatics Lecture 8 57 MOL204 Applied Bioinformatics Lecture 8 58 Different types of protein Prediction of TM regions Globular Trans-membrane http://www.ch.embnet.org/software/tmpred_form.html Random coil Coiled coil. require different types of bioinformatical analysis MOL204 Applied Bioinformatics Lecture 8 59 MOL204 Applied Bioinformatics Lecture 8 60 10

Prediction of TM regions Prediction of TM regions TMpred (ISREC, Geneva) Human EGF Receptor TMpred (ISREC, Geneva) Human Rhodopsin A B C D E F G MOL204 Applied Bioinformatics Lecture 8 61 MOL204 Applied Bioinformatics Lecture 8 62 Prediction of coiled-coils Prediction of coiled-coils http://www.ch.embnet.org/software/coils_form.html Coils (ISREC, Geneva) Human EEA1 MOL204 Applied Bioinformatics Lecture 8 63 MOL204 Applied Bioinformatics Lecture 8 64 GlobPlot: a method for predicting structure and unstructure MOL204 Applied Bioinformatics Lecture 8 65 MOL204 Applied Bioinformatics Lecture 8 66 11