Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se
Outline Protein features motifs patterns profiles signals 2
Protein Principles Proteins reflects millions of years of evolution Most proteins belong to large evolutionary families 3D structure is better conserved than sequence during evolution Similarities between sequences or between structures may reveal information about shared biological functions of a protein family 3
How can we determine the function of an uncharacterized protein sequence? MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAA QILSLLPLKFFPIIVIGIIALILALAIGLGIHFDCSGK YRCRSSFKCIELIARCDGVSDCKDGEDEYRCVRVGGQN AVLQVFTAASWKTMCSDDWKGHYANVACAQLGFPSYVS SDNLRVSSLEGQFREEFVSIDHLLPDDKVTALHHSVYV REGCASGHVVTLQCTACGHRRGYSSRIVGGNMSLLSQW PWQASLQFQGYHLCGGSVITPLWIITAAHCVYDLYLPK SWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKRLGND IALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSG WGATEDGAGDASPVLNHAAVPLISNKICNHRDVYGGII SPSMLCAGYLTGGVDSCQGDSGGPLVCQERRLWKLVGA TSFGIGCAEVNKPGVYTRVTSFLDWIHEQMERDLKT 4
Paradigm Similar sequence - similar structure - similar function 5
Multiple Sequence Alignment 6
Definitions Motif: Conserved regions of protein or DNA Motifs often contain important features Pattern: Qualitative description of motif based on regular expression-like syntax Profile: Quantitative motif description using weight-matrix syntax Hidden Markov Model: Quantitative state descriptions using different weightmatrices per state 7
Conserved motifs/patterns/profiles/domains Consensus methods multiple sequence alignments -> consensus sequence increase sensitivity and efficiency reduced evolutionary noise align unknown to library of consensus sequences Reduces to machine learning 8
Function from sequence MGENDPPAVEAPFSFRSLFGLDD LKISPVAPDADAVAAQILSLLPL KFFPIIVIGIIALILALAIGLGI HFDCSGKYRCRSSFKCIELIARC DGVSDCKDGEDEYRCVRVGGQNA VLQVFTAASWKTMCSDDWKGHYA NVACAQLGFPSYVSSDNLRVSSL EGQFREEFVSIDHLLPDDKVTAL HHSVYVREGCASGHVVTLQCTAC GHRRGYSSRIVGGNMSLLSQWPW QASLQFQGYHLCGGSVITPLWII TAAHCVYDLYLPKSWTIQVGLVS LLDNPAPSHLVEKIVYHSKYKPK RLGNDIALMKLAGPLTFNEMIQP VCLPNSEENFPDGKVCWTSGWGA TEDGAGDASPVLNHAAVPLISNK ICNHRDVYGGIISPSMLCAGYLT GGVDSCQGDSGGPLVCQERRLWK LVGATSFGIGCAEVNKPGVYTRV TSFLDWIHEQMERDLKT Sequence similarity (homology) Conserved domains profiles Hidden Markov Models Motifs / Fingerprints Functional sites 9
Sequence Similarity Global or Local Similarity Search BLAST, PSI-BLAST Alignments that cover most of the sequence Sequence divergence -> function divergence? 10
Conserved domains If no homologs are found Domains are structurally and functionally distinct units units of evolution "Independently folding structural unit" is a common definition of a protein domain, but it very much falls into the "I know it when I see it" class of definition. 11
Motifs/Fingerprints Single motif regular expression Prosite Single motif permissive expression emotif Multiple motif methods PRINTS BLOCKS 12
Prosite Prosite determines the function of uncharacterized protein, and to which known family of proteins it belongs. A pattern describes a group of amino acids that constitutes an usually short but characteristic motif within a protein sequence. For example: The pattern [AC] - x - V - x(4) - {ED}. is interpreted as: [Ala or Cys] - any - Val - any-any-any-any- {any but Glu or Asp}. 13
Prosite Syntax For example: The pattern [AC] - x - V - X(4) - {ED}. is interpreted as: [Ala or Cys] - any - Val - any-any-any-any- {any but Glu or Asp}. The standard one-letter code for amino acids. `x' : any amino acid. `[ ]' : residues allowed at the position. `{ }' : residues forbidden at the position. `( )' : repetition of a pattern element are indicated in parenthesis. X(n) or X(n, m) to indicate the number or range of repetition. `-' : separates each pattern element. ` ' : indicated a N-terminal restriction of the pattern. ` ' : indicated a C-terminal restriction of the pattern. `.' : the period ends the pattern.. 14
Prosite Patterns Consensus sequences and patters are regular expressions, that can be used like fingerprints. E.g. PROSITE patters: -N-{P}-[ST]-{P}- PS00001: N-Glycosylation MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAAQILSLLPLKFFPIIVIGIIALIL ALAIGLGIHFDCSGKYRCRSSFKCIELIARCDGVSDCKDGEDEYRCVRVGGQNAVLQVFTA ASWKTMCSDDWKGHYANVACAQLGFPSYVSSDNLRVSSLEGQFREEFVSIDHLLPDDKVTA LHHSVYVREGCASGHVVTLQCTACGHRRGYSSRIVGGNMSLLSQWPWQASLQFQGYHLCGG SVITPLWIITAAHCVYDLYLPKSWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKRLGNDI ALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSGWGATEDGAGDASPVLNHAAVPLIS NKICNHRDVYGGIISPSMLCAGYLTGGVDSCQGDSGGPLVCQERRLWKLVGATSFGIGCAE VNKPGVYTRVTSFLDWIHEQMERDLKT 15
How to predict the number of false positives? N(random) = M * p(pattern) M, nr of aa residues in whole database p(x) = 1.0 p([ags]) = f(a) + f(g) + f(s) p({p}) = 1.0 - f(p) 16
Prosite Patterns Advantages Relative straightforward and fast Intuitive to read and understand Databases with large number of patterns are available Disadvantages Patterns are a qualitative description and lose information about relative frequency of each residue at each position, e.g. [GAV] versus 0.6 G, 0.28 A, and 0.12 V Can be difficult to write complex motifs using regular expressions Can not represent subtle sequence motifs 17
Permissive Patterns Prosite patterns sometimes to strict One mismatch enough to fail emotif generalizes the elements of the regular expression [G] -> [G] [MV] -> [ILMV] [QNS] -> [x] 18
Profile A profile is a position-dependent scoring matrix that gives a quantitative description of a sequence motif For protein sequences, the scoring matrix has N rows and 20+ columns, N being the length of the profile (# of amino acids) The first 20 columns of each row specify the probability for finding, at that position in the sequence, each of the 20 amino acids The columns after the first 20 contain penalties for insertions/deletion at that position in the target sequence 19
Profile matrix Amino acid j and gap penalties Sequence profile position, k Mkj Pkj Mkj = log ( ) Pj pkj = probability of amino acid j at position k in the profile p j = background probability of amino acid j in sequence 20
Calculating the profile matrix Use the frequency of each amino acid at each sequence position Built upon the empirically determined matrix of amino acids substitutions, e.g. BLOSSUM or PAM (to be able to handle unseen amino acids and still give unequal weight to amino acids with different biochemical characteristics) 21
Visualizing a profile: sequence logo 22
Pfam The Pfam database contains information about protein domains and families. For each entry a protein sequence alignment and a Hidden Markov Model (HMM) is stored. These HMMs can be used to search sequence databases with the HMMER package written by Sean Eddy. 74% of protein sequences have at least one match to Pfam. This number is called the sequence coverage. 23
Hidden Markov Models More advanced probabilistic method: Different states, with different probabilities of each amino acid in the different states Transition probabilities between states 24
HMM, an example: 5 splice site recognition The HMM invokes three states, one for each of the three labels we might assign to a nucleotide: E (exon), 5 (5'SS) and I (intron). Each state has its own emission probabilities (shown above the states), which model the base composition of exons, introns and the consensus G at the 5'SS. Each state also has transition probabilities (arrows), the probabilities of moving from this state to a new state. The transition probabilities describe the linear order in which we expect the states to occur: one or more Es, one 5, one or more Is. Eddy, Nature Biotech 2004 25
Interpro Unites many resources for protein characterization Prosite, Pfam, SMART linked sites: Panther, Tigrfam, Gene3D 26
Prediction of signal peptides A signal peptide is a short (3-60 amino acids long) peptide chain that directs the post-translational transport of a protein. Signal peptides may also be called targeting signals, signal sequences, transit peptides, or localization signals. 27
Prediction of cleavage site and localization SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. TargetP 1.1 predicts the subcellular location of eukaryotic proteins. The location assignment is based on the predicted presence of any of the N-terminal presequences: chloroplast transit peptide (ctp), mitochondrial targeting peptide (mtp) or secretory pathway signal peptide (SP). The method incorporates a prediction based on a combination of several artificial neural networks and HMMs. 28
Never trust a server blindly Always do control experiments: Positive controls: submit sequences for which you know the right answer. Negative controls: random or shuffled sequences. 29
30