Profiles. Evolutionary related sequences (orthologs and paralogs) are often identified with local alignment programs like BLAST, FASTA, SSEARCH.

Size: px

Start display at page:

Download "Profiles. Evolutionary related sequences (orthologs and paralogs) are often identified with local alignment programs like BLAST, FASTA, SSEARCH."

Winifred Hawkins
5 years ago
Views:

1 Profiles Tore Samuelsson Nov 9 Background Evolutionary related sequences (orthologs and paralogs) are often identified with local alignment programs like BLAST, FASTA, SSEARCH. However, these methods are not always sufficient. In many cases the amino acid sequences of related proteins have diverged significantly, although the fold of the proteins is preserved.

2 Amino acid sequences may change rapidly during evolution although D structure is preserved Species A ATGGCAAAACTTGAAAAACTGAATCAAGCAGGCCTGATGGTCGCTGGT M A K L E K L N Q A G L M V A G % Species B ATGGCTAGGTTGGAGAAGAUAAACCAAGCTGGGATAATAGTTGCAGGA M V R L E K I N Q A G L L V A G 9% Species C M V R I Q K I N E K G A L L A G 8% Species D Q V R I Q K I Y E K G A L L A A 9% ( twilight zone ) Species E Q V R I Q K I Y E K T A L L F A % ( midnight zone ) In a BLAST search evolutionary related proteins may have very poor E-values Sequences producing significant alignments: Score E (bits) Value SRC_HUMAN (P9) Proto-oncogene tyrosine-protein kinase Src (E... e- YES_HUMAN (P9) Proto-oncogene tyrosine-protein kinase Yes (E... e-8 FYN_HUMAN (P) Proto-oncogene tyrosine-protein kinase Fyn (E... e- FGR_HUMAN (P99) Proto-oncogene tyrosine-protein kinase FGR (E... 9 e- HCK_HUMAN (P8) Tyrosine-protein kinase HCK (EC...) (p... e- LCK_HUMAN (P9) Proto-oncogene tyrosine-protein kinase LCK (E... 8e- LYN_HUMAN (P98) Tyrosine-protein kinase Lyn (EC...) e- BLK_HUMAN (P) Tyrosine-protein kinase BLK (EC...) (B... e- FRK_HUMAN (P8) Tyrosine-protein kinase FRK (EC...) (N... 9 e SHC_HUMAN (Q99) SHC transforming protein (SH domain prote.... SHA_HUMAN (Q9NP) SH domain protein A (T cell-specific adap.... CHIN_HUMAN (P88) N-chimaerin (NC) (N-chimerin) (Alpha chimeri.... APS_HUMAN (O9) SH and PH domain-containing adapter protein.... CISH_HUMAN (Q9NSE) Cytokine-inducible SH-containing protein (C.... SOCS_HUMAN (O) Suppressor of cytokine signaling (SOCS-) CHIO_HUMAN (P) Beta-chimaerin (Beta-chimerin) (Rho-GTPase-a LIPL_HUMAN (P88) Lipoprotein lipase precursor (EC...) ( SOCS_HUMAN (O9) Cytokine inducible SH-containing protein.... ATPF_HUMAN (Q8NM) ATP synthase mitochondrial F complex assem.... TENS_HUMAN (Q9HBL) Tensin.8 FBWA_HUMAN (Q9Y9) F-box/WD-repeat protein A (F-box and WD-re....8 STAT_HUMAN (P) Signal transducer and activator of transcri.... SOCS_HUMAN (O) Suppressor of cytokine signaling (SOCS-).... Profile-based searches are more efficient in identifying remote sequence similarity.

3 Profiles are generated from multiple alignments, and they are one of many applications of multiple alignments. * Profiles * Identify conserved motifs - patterns (PROSITE) * Phylogenetic studies * Prediction of protein secondary structure Multiple alignment generated by methods like Clustalw andtcoffee Terminology Profile. A matrix where the numbers reflect the probabilities of characters appearing in a certain position in a multiple alignment PSSM "Position specific scoring matrix". More of less synonymous with 'profile' but sometimes a 'profile' refers to a matrix where also gaps are taken into account. Sometimes also called Weight or frequency matrix

4 Position-Specific Scoring Matrix (PSSM) Multiple alignment of ' splice site sequences GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT 8 CAGGTATAC 9 TGTGTGAGT AAGGTAAGT Calculate the absolute frequency of each nucleotide at each position PSSM GAGGTAAAC TCCGTAAGT CAGGTTGGA ACAGTCAGT TAGGTCATT TAGGTACTG ATGGTAACT 8 CAGGTATAC 9 TGTGTGAGT AAGGTAAGT A C G T 8 9

5 Calculate the relative frequency of each nucleotide at each position PSSM 8 9 A C G T 8 9 A C G T PSSM What is the probability of finding CAGGTTGGA? The product of the frequency of each nucleotide at each position:. *. *. * * *. *. *. *. 8 9 A C G T

6 Compute the log odds ratios log(m ij /P i ) M ij = probability of nucleotide i at position j P i = background probability of of nucleotide i For this example we assume P i =. 8 9 PSSM A C G T A C G T Scoring with a PSSM PSSM We want to analyze the sequence GTAGTAGAAGGTAAGTGTCCGTAG with the profile 8 9 A C G T We examine a window the size of the profile GTAGTAGAAGGTAAGTGTCCGTAG

7 T G C A 9 8 PSSM Find the score for GTAGTAGAAGGTAAGTGTCCGTAG (- ) GTAGTAGAAGGTAAGTGTCCGTAG (8.) GTAGTAGAAGGTAAGTGTCCGTAG (.) Pseudocounts T G C A 9 8 T 8 G C 8 A 9 8 PSSM

8 PSSM Position-Specific Scoring Matrix (PSSM) Pseudocounts With a very large number of sequences in the multiple alignment an observed amino acid/nt frequency is expected to be approximately equal to the actual probability of finding that amino acid/nt. However, in most cases the number of sequences are limited so that for some amino acids/nts the observed frequency = whereas the actual probability should be >. For this reason fake counts, pseudocounts, are added to avoid zero probability. For instance, one simple solution is to add to all counts. PSSM Pseudocounts More sophisticated : q u, a nu, N a p seq a where q u,a = estimated probability of residue type a occuring in column u p a = frequency of occurrence of residue type a (based on composition of proteins/dna) n u,a = count of residues a in column u N seq = total number of sequences scaling parameter 8

9 Representing profile as a sequence logo Amount of uncertainty in column u: H u fu, a log fu, a a where H u is the uncertainty at position u, a is one of the four bases, or in the case of proteins, one of the amino acids. f u,a is the frequency of base (amino acid) a in column u. Total information at the position u is represented by the decrease in uncertainty : I u = log - H u (proteins) I u = log - H u (DNA) where I u is the amount of information present at column u, and log (or log ) is the maximum uncertainty at any given position. The entire set of I au values forms a curve that represents the importance of various positions. The height of this curve is the height of the logo at that position. The size of each base/amino acid printed in a logo is determined by multiplying the frequency by the total information at that position: Height of base/amino acid a at position u = P au I u The bases/amino acids are then stacked on top of each other in increasing order of their frequencies and plotted. Sequence logo example. Consider the simple amino acid multiple sequence alignment: Seq Seq Seq Seq We use A A A A A S A G T A G G H f, log f, u a u a u a for each of the columns of the multiple alignment. H = - * log () = H = - ((. * log.) +(. * log.)) = H = -(( *. * log.)) = Total height of columns: I u = log - H u I =. - I =. - I =. - 9

10 Sequence logo example, cont. Height of A at position = f A * I = * I Height of A at position = f A * I =. * I Height of G at position = f G * I =. * I Height of A at position = f A * I =. * I etc. I =. I =. I =. This is the sequence logo obtained at if the alignment above is used, and "small sample correction" is deselected. Sequence logo ' splice site example

11 Sequence logo translation start site in bacteria Methods that take into account position-specific information from multiple alignments 99 PSI-BLAST (Altschul et al) ~99 Profile HMMs (S Eddy) => HMMER software

12 Principle of PSI-BLAST Query sequence "Normal" BLAST search Query sequence Database hits Evalue cutoff A C D.. Y Use hits above cutoff PSSM iterate BLAST search with PSSM Database hits Use hits above cutoff PSI-BLAST PSI-BLAST is an important tool to identify remote protein similarity. It proceeds by way of the following steps: () PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program. () The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. () The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm can be used for this directly. () PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale, and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments remain applicable to profile alignments. () Finally, PSI-BLAST iterates, by returning to step (), an arbitrary number of times or until convergence. Profile-alignment statistics allow PSI-BLAST to proceed as a natural extension of BLAST; the results produced in iterative search steps are comparable to those produced from the first pass. Advantage : Unlike most profile-based search methods, PSI-BLAST runs as one program, starting with a single protein sequence, and the intermediate steps of multiple alignment and profile construction are invisible to the user.

13 PSI-blast - Constructing the profile Query-anchored multiple alignment Query MKDRNLGEK Sbjct MKD-NLAEK Query MKD-RNLGEK Sbjct MKEARNLAEK Pairwise alignments from PSI-blast Query MKD-RNLGEK Sbjct MKD--NLAEK Sbjct MKEARNLAEK disregarded PSI-blast - Constructing the profile

14 Psiblast tutorial

searching serves to unify formerly disparate members of a database"

16 "This analysis illustrates not only how the search for sequence relatives can reveal the function of a protein, but also how similarity searching serves to unify formerly disparate members of a database" Yeast Pop Pop Pop Rpp Rpr Pop Pop Pop8 Pop Man hpop Rpp9 hpop Rpp Rpp Rpp Rpp8 Rpp Rpp

Outcome of PSI-BLAST is dependent on query sequence: Only some Pop homologues as query identifies Rpp Results from round Sequences producing significant alignments: Sequences used in model and found

17 Outcome of PSI-BLAST is dependent on query sequence: Only some Pop homologues as query identifies Rpp Results from round Sequences producing significant alignments: Sequences used in model and found again: Score E (bits) Value POP_Pichia_stipitis 8_pichia_stipitis_FM.aa.fasta unnamed p... e- ref XP_9. PREDICTED: similar to ribonuclease P kda subu... e- ref XP_98. PREDICTED: similar to ribonuclease P (predicte... e- ref NP_. ribonuclease P kda subunit [Homo sapiens] >gi... e- ref XP_. PREDICTED: similar to RPP protein [Pan troglo... 9 e- dbj BAA98. unnamed protein product [Homo sapiens] 9 e- POP_Pichia_guilliermondii supercont_ Minus (of... e- ref XP_. PREDICTED: ribonuclease P (predicted) [Rattus... e- gb AAH8. Ribonuclease P kda subunit [Mus musculus] >gi... e- ref XP_8. PREDICTED: similar to ribonuclease P kda subu... e- gb AAH8. MGC8 protein [Xenopus laevis] e-8 ref XP_898. PREDICTED: similar to RPP protein [Gallus gal... 9e-8 TPRQKVAIIY DVGVSTLYKR FP IPRKQVAIIY DVAVSTLYKK FP HPRQQLAIIF GIGVSTLYRY FP GSKTKLAQAA GIRLASLYSW KG TTFKQIALES GLSTGTISSF IN IPYQEFAKLI GKSTGAVRRM ID VTLQQFAELE GVSERTAYRW TT FTYNQYAQMM NISRENAYGV LA LGASHISKTM NIARSTYVKV IN TGATEIAHQL SIARSTVYKI LE ISISAIAREF NTTRQTILRV KA GNISALADAE NISRKIITRC IN MVLADIAQAV EMHESTISRV TT LVLHDIAEAV GMHESTISRV TT LNLRIVADAI KMHESTVSRV TS MTRGDIGNYL GLTVETISRL LG LSLSALSRQF GYAPTTLANA LE MSLAELGRSN GLSSSTLKNA LD FDIASVAQHV CLSPSRLSHL FR LRIDEVARHV CLSPSRLAHL FR VTLEALADQV AMSPFHLHRL FK VLYPDIAKKF NTTASRVERA IR Profiles: Example with HTH (helix turn helix) motif

18 Result of scoring with HTH profile >lcl AADR_RHOPA (Q98) Transcriptional activatory protein aadr (Anaerobic aromatic degradation regulator) >lcl AGLR_RHIME (Q9ZR) HTH-type transcriptional regulator aglr >lcl ANR_PSEAE (P9) Transcriptional activator protein anr >lcl ARAC_ERWCH (P) Arabinose operon regulatory protein >lcl ASCG_ECOLI (P) HTH-type transcriptional regulator ascg (Cryptic asc operon repressor) >lcl CCPA_BACME (P88) Glucose-resistance amylase regulator (Catabolite control protein) >lcl CCPA_BACSU (P) Catabolite control protein A (Glucose-resistance amylase regulator) >lcl CCPA_STRMU (O9) Probable catabolite control protein A >lcl CCPB_BACSU (P) Catabolite control protein B >lcl CENPB_CRIGR (P8988) Major centromere autoantigen B (Centromere protein B) (CENP-B) >lcl CENPB_HUMAN (P99) Major centromere autoantigen B (Centromere protein B) (CENP-B) >lcl CENPB_MOUSE (P9) Major centromere autoantigen B (Centromere protein B) (CENP-B) >lcl DEGA_BACSU (P9) HTH-type transcriptional regulator dega (Degradation activator) >lcl DEOR_BACSU (P9) Deoxyribonucleoside regulator >lcl EBGR_ECOLI (P8) HTH-type transcriptional regulator ebgr (Ebg operon repressor) >lcl ENDR_PAEPO (P8) Probable HTH-type transcriptional regulator endr >lcl ETRA_SHEON (P8) Electron transport regulator A >lcl FECI_ECOLI (P8) Probable RNA polymerase sigma factor feci >lcl FLP_LACCA (P98) Probable transcriptional regulator flp >lcl FNRA_PSEST (P) Transcriptional activator protein fnra >lcl FNRN_RHILV (P9) Probable transcriptional activator (ORF-) >lcl FNR_ACTAC (Q9EXQ) Anaerobic regulatory protein >lcl FNR_ECO (PA9E) Fumarate and nitrate reduction regulatory protein >lcl FNR_ECOL (PA9E) Fumarate and nitrate reduction regulatory protein >lcl FNR_ECOLI (PA9E) Fumarate and nitrate reduction regulatory protein >lcl FNR_HAEIN (P99) Anaerobic regulatory protein >lcl FNR_KLEOX (Q9AQ) Fumarate nitrate reduction regulatory protein >lcl FNR PASMU (Q9CMY) Anaerobic regulatory protein Methods that take into account position-specific information from multiple alignments 99 PSI-BLAST (Altschul et al) ~99 Profile HMMs (S Eddy) => HMMER software 8

19 Profile HMMs HMMER software package hmmbuild Build a model from a multiple sequence alignment. hmmpfam Search an HMM database for matches to a query sequence. hmmsearch Search a sequence database for matches to a single profile HMM. Pfam database - attempt to completly and accurately classify protein families and domains "All science is either physics or stamp collecting" Ernest Rutherford 9

20 Profile HMMs Pfam database Pfam is a collection of multiple sequence alignments and profile hidden Markov models (HMMs). Each Pfam HMM represents a protein family or domain. By searching a protein sequence against Pfam library of HMMs you can find out its domain architecture. Pfam may also be used to analyse proteomes and domain architectures. Two categories of families: Pfam-A families are manually curated HMM based families which are built using an alignment of a small number of representative sequences ('seed' alignment). A threshold is manually set for each HMM, and this determines the minimum score a sequence must attain to belong to the family. HMMs are searched against the UniProt database, and include all sequences that score above the cut-off value for a particular family in the family's full alignment. Pfam-A matches are very unlikely to be false matches. Profile HMMs Pfam database Pfam-B. To complement the Pfam-A families, Pfam-B families are automatically generated using the PRODOM database. Pfam-B families are formed by taking alignments of sequence segments from PRODOM, and removing any Pfam-A residues from them. (PRODOM is a database of protein domain sequence families constructed using PSI-BLAST analysis of protein sequences as well as using information from the SCOP database.) All families in Pfam are non-overlapping such that no amino acid belongs to more that one family/domain. Two HMMs for each Pfam entry. For each Pfam entry two HMMs are built, one to represent full length matches (ls model), and one to represent fragment matches (fs model).

21 Complexity of Pfam, PfamA families,, protein sequences in Uniprot analyzed => on average ~. PfamA domains/protein a total of, different architectures

Databases related to Pfam PRODOM www.toulouse.inra.fr/prodom.html CDD www.ncbi.nlm.nih.gov/structure/cdd/cdd.shtml SMART smart.embl-heidelberg.de/ INTERPRO www.ebi.ac.

22 Databases related to Pfam PRODOM CDD SMART smart.embl-heidelberg.de/ INTERPRO combines information from : Pfam Prints SMART Prosite PRODOM CDD

23 SMART

24 Databases like InterPro have aided considerably in the annotation of the human genome

25 Exercises Compare pw alignment methods BLAST FASTA SSEARCH to profile methods PSI-BLAST Hmmer Protein domain studied : SH domain originally found in oncoproteins Src and Fps. SH domains are found in many proteins taking part in signal transduction pathways. The function of SH domains is to specifically recognize the phosphorylated state of tyrosine residues, thereby allowing SH domain-containing proteins to localize to tyrosine-phosphorylated sites

26 SH domain Step Extract the SH domain from the human SRC protein % extractseq src_human.fa Step Pw alignment methods Step ) BLAST see PSI-blast step below ) FASTA % fasta [input_file] [database] > result ) SSEARCH % ssearch [input_file] [database] > result Profile-based methods PSI-blast % blastpgp -i [input_file] -d [database] -j -o output_file (st round : normal BLAST search) Hmmer % hmmsearch shfs.hmm [database] > result_file where shfs.hmm is the HMM profile for the SH domain

27 HIT proteins / uridylyltransferases The Histidine Triad (HIT) motif, His-phi-His-phi-His-phi-phi (phi, a hydrophobic amino acid) was identified as being highly conserved in a variety of species. Proteins in the HIT superfamily are conserved as nucleotide-binding proteins, and are structurally related to a family of enzymes that includes GalT, a uridylyltransferase. This relationship was first revealed by structural analysis, but may also be detected using PSIblast. Relationship of ATP- and NAD-dependent DNA ligases ATP-dependent DNA ligases: Eukarya, Archaea NAD-dependent DNA ligases: Eubacteria Previously these enzymes were believed to be evolutionary unrelated but PSI-blast provides evidence that they are related.

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Patterns and profiles applications of multiple alignments Tore Samuelsson March 3 Protein patterns and the PROSITE database Proteins that bind the nucleotides ATP or GTP share a short sequence motif Entry