Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Similar documents
EECS730: Introduction to Bioinformatics

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Week 10: Homology Modelling (II) - HHpred

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Hidden Markov Models (HMMs) and Profiles

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Computational Molecular Biology (

Sequence analysis and comparison

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

CSCE555 Bioinformatics. Protein Function Annotation

Multiple sequence alignment

Large-Scale Genomic Surveys

Tools and Algorithms in Bioinformatics

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Tools and Algorithms in Bioinformatics

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Markov Chains and Hidden Markov Models. = stochastic, generative models

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Sequence analysis and Genomics

Data Mining in Bioinformatics HMM

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Sequence Alignment Techniques and Their Uses

Today s Lecture: HMMs

-max_target_seqs: maximum number of targets to report

Algorithms in Bioinformatics

HMMs and biological sequence analysis

Protein bioinforma-cs. Åsa Björklund CMB/LICR

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Protein Structure Prediction Using Neural Networks

EBI web resources II: Ensembl and InterPro

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Functional Annotation

Lecture 3: Markov chains.

Bioinformatics Chapter 1. Introduction

Similarity or Identity? When are molecules similar?

Some Problems from Enzyme Families

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Single alignment: Substitution Matrix. 16 march 2017

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Hidden Markov Models

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Sequences, Structures, and Gene Regulatory Networks

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

A profile-based protein sequence alignment algorithm for a domain clustering database

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

1. In most cases, genes code for and it is that

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

An Introduction to Sequence Similarity ( Homology ) Searching

Homology and Information Gathering and Domain Annotation for Proteins

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

O 3 O 4 O 5. q 3. q 4. Transition

Tutorial 4 Substitution matrices and PSI-BLAST

BIOINFORMATICS: An Introduction

Hidden Markov Models

Structure to Function. Molecular Bioinformatics, X3, 2006

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Incorporating dependence into models for DNA motifs

A NEURAL NETWORK METHOD FOR IDENTIFICATION OF PROKARYOTIC AND EUKARYOTIC SIGNAL PEPTIDES AND PREDICTION OF THEIR CLEAVAGE SITES

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Quantifying sequence similarity

Computational Genomics and Molecular Biology, Fall

Introduction to Pattern Recognition. Sequence structure function

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Genomics and bioinformatics summary. Finding genes -- computer searches

Genome Annotation Project Presentation

Hidden Markov Models for biological sequence analysis

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Intro Protein structure Motifs Motif databases End. Last time. Probability based methods How find a good root? Reliability Reconciliation analysis

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

GCD3033:Cell Biology. Transcription

Bioinformatics: Secondary Structure Prediction

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Motivating the need for optimal sequence alignments...

BME 5742 Biosystems Modeling and Control

Bioinformatics and BLAST

Domain-based computational approaches to understand the molecular basis of diseases

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Transcription:

Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se

Outline Protein features motifs patterns profiles signals 2

Protein Principles Proteins reflects millions of years of evolution Most proteins belong to large evolutionary families 3D structure is better conserved than sequence during evolution Similarities between sequences or between structures may reveal information about shared biological functions of a protein family 3

How can we determine the function of an uncharacterized protein sequence? MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAA QILSLLPLKFFPIIVIGIIALILALAIGLGIHFDCSGK YRCRSSFKCIELIARCDGVSDCKDGEDEYRCVRVGGQN AVLQVFTAASWKTMCSDDWKGHYANVACAQLGFPSYVS SDNLRVSSLEGQFREEFVSIDHLLPDDKVTALHHSVYV REGCASGHVVTLQCTACGHRRGYSSRIVGGNMSLLSQW PWQASLQFQGYHLCGGSVITPLWIITAAHCVYDLYLPK SWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKRLGND IALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSG WGATEDGAGDASPVLNHAAVPLISNKICNHRDVYGGII SPSMLCAGYLTGGVDSCQGDSGGPLVCQERRLWKLVGA TSFGIGCAEVNKPGVYTRVTSFLDWIHEQMERDLKT 4

Paradigm Similar sequence - similar structure - similar function 5

Multiple Sequence Alignment 6

Definitions Motif: Conserved regions of protein or DNA Motifs often contain important features Pattern: Qualitative description of motif based on regular expression-like syntax Profile: Quantitative motif description using weight-matrix syntax Hidden Markov Model: Quantitative state descriptions using different weightmatrices per state 7

Conserved motifs/patterns/profiles/domains Consensus methods multiple sequence alignments -> consensus sequence increase sensitivity and efficiency reduced evolutionary noise align unknown to library of consensus sequences Reduces to machine learning 8

Function from sequence MGENDPPAVEAPFSFRSLFGLDD LKISPVAPDADAVAAQILSLLPL KFFPIIVIGIIALILALAIGLGI HFDCSGKYRCRSSFKCIELIARC DGVSDCKDGEDEYRCVRVGGQNA VLQVFTAASWKTMCSDDWKGHYA NVACAQLGFPSYVSSDNLRVSSL EGQFREEFVSIDHLLPDDKVTAL HHSVYVREGCASGHVVTLQCTAC GHRRGYSSRIVGGNMSLLSQWPW QASLQFQGYHLCGGSVITPLWII TAAHCVYDLYLPKSWTIQVGLVS LLDNPAPSHLVEKIVYHSKYKPK RLGNDIALMKLAGPLTFNEMIQP VCLPNSEENFPDGKVCWTSGWGA TEDGAGDASPVLNHAAVPLISNK ICNHRDVYGGIISPSMLCAGYLT GGVDSCQGDSGGPLVCQERRLWK LVGATSFGIGCAEVNKPGVYTRV TSFLDWIHEQMERDLKT Sequence similarity (homology) Conserved domains profiles Hidden Markov Models Motifs / Fingerprints Functional sites 9

Sequence Similarity Global or Local Similarity Search BLAST, PSI-BLAST Alignments that cover most of the sequence Sequence divergence -> function divergence? 10

Conserved domains If no homologs are found Domains are structurally and functionally distinct units units of evolution "Independently folding structural unit" is a common definition of a protein domain, but it very much falls into the "I know it when I see it" class of definition. 11

Motifs/Fingerprints Single motif regular expression Prosite Single motif permissive expression emotif Multiple motif methods PRINTS BLOCKS 12

Prosite Prosite determines the function of uncharacterized protein, and to which known family of proteins it belongs. A pattern describes a group of amino acids that constitutes an usually short but characteristic motif within a protein sequence. For example: The pattern [AC] - x - V - x(4) - {ED}. is interpreted as: [Ala or Cys] - any - Val - any-any-any-any- {any but Glu or Asp}. 13

Prosite Syntax For example: The pattern [AC] - x - V - X(4) - {ED}. is interpreted as: [Ala or Cys] - any - Val - any-any-any-any- {any but Glu or Asp}. The standard one-letter code for amino acids. `x' : any amino acid. `[ ]' : residues allowed at the position. `{ }' : residues forbidden at the position. `( )' : repetition of a pattern element are indicated in parenthesis. X(n) or X(n, m) to indicate the number or range of repetition. `-' : separates each pattern element. ` ' : indicated a N-terminal restriction of the pattern. ` ' : indicated a C-terminal restriction of the pattern. `.' : the period ends the pattern.. 14

Prosite Patterns Consensus sequences and patters are regular expressions, that can be used like fingerprints. E.g. PROSITE patters: -N-{P}-[ST]-{P}- PS00001: N-Glycosylation MGENDPPAVEAPFSFRSLFGLDDLKISPVAPDADAVAAQILSLLPLKFFPIIVIGIIALIL ALAIGLGIHFDCSGKYRCRSSFKCIELIARCDGVSDCKDGEDEYRCVRVGGQNAVLQVFTA ASWKTMCSDDWKGHYANVACAQLGFPSYVSSDNLRVSSLEGQFREEFVSIDHLLPDDKVTA LHHSVYVREGCASGHVVTLQCTACGHRRGYSSRIVGGNMSLLSQWPWQASLQFQGYHLCGG SVITPLWIITAAHCVYDLYLPKSWTIQVGLVSLLDNPAPSHLVEKIVYHSKYKPKRLGNDI ALMKLAGPLTFNEMIQPVCLPNSEENFPDGKVCWTSGWGATEDGAGDASPVLNHAAVPLIS NKICNHRDVYGGIISPSMLCAGYLTGGVDSCQGDSGGPLVCQERRLWKLVGATSFGIGCAE VNKPGVYTRVTSFLDWIHEQMERDLKT 15

How to predict the number of false positives? N(random) = M * p(pattern) M, nr of aa residues in whole database p(x) = 1.0 p([ags]) = f(a) + f(g) + f(s) p({p}) = 1.0 - f(p) 16

Prosite Patterns Advantages Relative straightforward and fast Intuitive to read and understand Databases with large number of patterns are available Disadvantages Patterns are a qualitative description and lose information about relative frequency of each residue at each position, e.g. [GAV] versus 0.6 G, 0.28 A, and 0.12 V Can be difficult to write complex motifs using regular expressions Can not represent subtle sequence motifs 17

Permissive Patterns Prosite patterns sometimes to strict One mismatch enough to fail emotif generalizes the elements of the regular expression [G] -> [G] [MV] -> [ILMV] [QNS] -> [x] 18

Profile A profile is a position-dependent scoring matrix that gives a quantitative description of a sequence motif For protein sequences, the scoring matrix has N rows and 20+ columns, N being the length of the profile (# of amino acids) The first 20 columns of each row specify the probability for finding, at that position in the sequence, each of the 20 amino acids The columns after the first 20 contain penalties for insertions/deletion at that position in the target sequence 19

Profile matrix Amino acid j and gap penalties Sequence profile position, k Mkj Pkj Mkj = log ( ) Pj pkj = probability of amino acid j at position k in the profile p j = background probability of amino acid j in sequence 20

Calculating the profile matrix Use the frequency of each amino acid at each sequence position Built upon the empirically determined matrix of amino acids substitutions, e.g. BLOSSUM or PAM (to be able to handle unseen amino acids and still give unequal weight to amino acids with different biochemical characteristics) 21

Visualizing a profile: sequence logo 22

Pfam The Pfam database contains information about protein domains and families. For each entry a protein sequence alignment and a Hidden Markov Model (HMM) is stored. These HMMs can be used to search sequence databases with the HMMER package written by Sean Eddy. 74% of protein sequences have at least one match to Pfam. This number is called the sequence coverage. 23

Hidden Markov Models More advanced probabilistic method: Different states, with different probabilities of each amino acid in the different states Transition probabilities between states 24

HMM, an example: 5 splice site recognition The HMM invokes three states, one for each of the three labels we might assign to a nucleotide: E (exon), 5 (5'SS) and I (intron). Each state has its own emission probabilities (shown above the states), which model the base composition of exons, introns and the consensus G at the 5'SS. Each state also has transition probabilities (arrows), the probabilities of moving from this state to a new state. The transition probabilities describe the linear order in which we expect the states to occur: one or more Es, one 5, one or more Is. Eddy, Nature Biotech 2004 25

Interpro Unites many resources for protein characterization Prosite, Pfam, SMART linked sites: Panther, Tigrfam, Gene3D 26

Prediction of signal peptides A signal peptide is a short (3-60 amino acids long) peptide chain that directs the post-translational transport of a protein. Signal peptides may also be called targeting signals, signal sequences, transit peptides, or localization signals. 27

Prediction of cleavage site and localization SignalP predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. TargetP 1.1 predicts the subcellular location of eukaryotic proteins. The location assignment is based on the predicted presence of any of the N-terminal presequences: chloroplast transit peptide (ctp), mitochondrial targeting peptide (mtp) or secretory pathway signal peptide (SP). The method incorporates a prediction based on a combination of several artificial neural networks and HMMs. 28

Never trust a server blindly Always do control experiments: Positive controls: submit sequences for which you know the right answer. Negative controls: random or shuffled sequences. 29

30