Protein bioinforma-cs. Åsa Björklund CMB/LICR

Similar documents
Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Functional Annotation

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Week 10: Homology Modelling (II) - HHpred

-max_target_seqs: maximum number of targets to report

CS612 - Algorithms in Bioinformatics

EBI web resources II: Ensembl and InterPro

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

Homology. and. Information Gathering and Domain Annotation for Proteins

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

CSCE555 Bioinformatics. Protein Function Annotation

Protein structure alignments

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Multiple sequence alignment

Large-Scale Genomic Surveys

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Protein Structure: Data Bases and Classification Ingo Ruczinski

BMD645. Integration of Omics

A Protein Ontology from Large-scale Textmining?

Hidden Markov Models (HMMs) and Profiles

Homology and Information Gathering and Domain Annotation for Proteins

Structure to Function. Molecular Bioinformatics, X3, 2006

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Prediction of protein function from sequence analysis

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Some Problems from Enzyme Families

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Gene function annotation

ALL LECTURES IN SB Introduction

CAP 5510 Lecture 3 Protein Structures

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Protein function prediction based on sequence analysis

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Amino Acid Structures from Klug & Cummings. Bioinformatics (Lec 12)

Protein Structure and Function Prediction using Kernel Methods.

Introduction to Bioinformatics Online Course: IBT

Genome Annotation Project Presentation

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Protein Structure Prediction and Display

Computational methods for predicting protein-protein interactions

Gene Ontology and overrepresentation analysis

Getting To Know Your Protein

PROTEIN CLUSTERING AND CLASSIFICATION

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

EECS730: Introduction to Bioinformatics

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Large-Scale Genomic Surveys

Computational Genomics and Molecular Biology, Fall

Lecture 2. The Blast2GO annotation framework

EBI web resources II: Ensembl and InterPro

Sequences, Structures, and Gene Regulatory Networks

BCB 444/544 Fall 07 Dobbs 1

Hands-On Nine The PAX6 Gene and Protein

In-Silico Approach for Hypothetical Protein Function Prediction

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Multiple Choice Review- Eukaryotic Gene Expression

Patterns and profiles applications of multiple alignments. Tore Samuelsson March 2013

Computational Molecular Biology (

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

Today s Lecture: HMMs

HMMs and biological sequence analysis

Genomics and bioinformatics summary. Finding genes -- computer searches

Computational Molecular Modeling

Meiothermus ruber Genome Analysis Project

STRUCTURAL BIOINFORMATICS II. Spring 2018

Computational Biology: Basics & Interesting Problems

Update on human genome completion and annotations: Protein information resource

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Update on genome completion and annotations: Protein Information Resource

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Software and Databases for Protein Structure Classification. Some slides are modified from Kun Huang (OSU) and Doug Brutlag (Stanford)

Bioinformatics methods COMPUTATIONAL WORKFLOW

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Comparative Features of Multicellular Eukaryotic Genomes

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Visualization of Macromolecular Structures

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Basics of protein structure

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Transcription:

Protein bioinforma-cs Åsa Björklund CMB/LICR asa.bjorklund@licr.ki.se

In this lecture Protein structures and 3D structure predic-on Protein domains HMMs Protein networks Protein func-on annota-on / predic-on

PTM Localiza-on Degrada-on Muta-ons Selec-on Gene mrna Polypep-de Folding 3D protein Protein complex ATGATCATGGTTACAGGT AUGAUCAUGGUUACAGGU MAHRKYLI Structure is more conserved than sequence!

This lecture Protein structures and structure predic-on Protein domains Mo-fs, Profiles and HMMs Some servers Protein networks Func-on predic-on

Protein sequence databases UniProtKB Swiss- Prot Quality PIR- PSD TrEMBL Quan-ty UniMES (metagenomics samples) Entrez Protein (NCBI) Coding regions from GenBank and Swissprot, PIR, PDB, etc. RefSeq Non redundant databases Uniref100, Uniref90 etc. NCBI nr

Protein structure databases (Gutmanas et al. 2013) 1. So_ X- ray tomogram of a fission yeast cell 2. Electron tomogram of ribosomes in the cytosol 3. Cryo- EM reconstruc-on of the 80S ribosome from yeast 4. Crystal structure of the 50S ribosomal subunit (PDB entry 3uzk) 5. Crystal structure revealing how tmrna and the small protein SmpB enable the kirromycin- stalled 70S ribosome to proceed with transla-on (PDB entries 4abr and 4abs)

PDB Protein Data Bank http://www.rcsb.org/pdb Main database of three-dimensional protein structures Also contains structures of other macromolecules (DNA, RNA, carbohydrates) PDB entry format resembles EMBL Currently (May 2013) 90,424 entries

Viewing protein structures Pymol Jmol Rasmol Download PDB file Open in viewer Select, color, rotate

Structure predic-on De novo folding Molecular Dynamics Based on energy minimiza-on, depends on star-ng structure Works well for small pepe-des, not feasible for very large proteins Folding@home Homology modelling Template from homologous structure(s) Refinement based on energy minimiza-on Swiss- modeller, 3D- Jigsaw Consensus methods Combines predic-ons from several servers Pcons.net

Docking Predict interac-ons between proteins or protein and ligand Most proteins change conforma-on upon binding Ligand docking commonly used in drug design HADDOCK, PatchDock, ClusPro

Protein domains

Protein domains Defined as independent folding unit or independent evolving unit Each domain has a characteris-c structure and/or func-on conserved in evolu-on Domains are o_en combined to create mul-- domain proteins O_en grouped into families and superfamilies Known domains in ~80% of all proteins, covering ~58% of the residues

Protein domains

Protein Structure Classifica-on Databases QUALITY SCOP : All manual CATH : Semi-automatic FSSP : All automatic ENTREZ: All automatic QUANTITY

Structural Classifica-on Of Proteins (SCOP) http://scop.mrc-lmb.cam.ac.uk/scop/ Attempt to classify all proteins in PDB according to structural and evolutionary relationships Hierarchical classification system Classification based on human expertise Superfamily contains HMMs for all SCOP families (http://supfam.cs.bris.ac.uk/)

SCOP Main secondary structure elements Class " Fold Superfamily Family All- alpha All- beta Alpha/beta

SCOP Arrangements of secondary structure elements Class " Fold Superfamily Family Globin- like Prion- like Alpha- beta knot

SCOP Low sequence similarity but conserved structure and/ or func-on Class " Fold Superfamily Family

SCOP Significant sequence similarity and similar structure and func-on Class " Fold Superfamily Family

Domains based on sequence conserva-on Pfam - pfam.sanger.ac.uk PfamA manually curated PfamB autmated clustering PfamClans groups families into superfamilies Smart - smart.embl- heidelberg.de Manually curated, specializes in signalling, extracellular and chroma-n- associated proteins. ProDom - prodom.prabi.fr Automated clustering

Other mo-fs DNA/RNA/Protein binding mo-fs Transmembrane helices (TMH) Signal pep-des Pospransla-onal modifica-on (PTM) signals Secondary structure Disordered regions

Hidden Markov Models (HMMs) Different states, with different probabili-es of each symbol (aa or nt) at each state Transi-on probabili-es between states Insert states Silent states

HMM models: key concepts - No magic involved: just an extension of the profile - Enables modelling of deletions and insertions - Very useful for protein domains, HMMs for many different domain databases such as SCOP, Pfam etc. are available for download or web-based searches - Common programs to build HMMs from MSAs and scan sequence databases for matches are HMMER and SAM.

HMMER Program package by Sean Eddy (hmmer.janelia.org) Create HMMs from Mul-ple Sequence Alignment (MSA) Run searches with HMMs, ex. Pfam Requires a few unix commands Has webserver for homology searches

Membrane protein topology predic-on Predict transmembrane helices (TMH) based on hydrophobicity profile A TMH is normally about 20aa Reentrant regions can create mispredic-ons Posi-ve inside rule guides the direc-on O_en includes homology informa-on Most methods use HMMs for predic-on

Membrane protein topology predic-on hpp://topcons.cbr.su.se/

Predic-on of signal pep-des A signal pep-de is a short (3-60 amino acids long) pep-de chain that directs the post- transla-onal transport of a protein. Signal pep-des may also be called targe-ng signals, signal sequences, transit pep-des, or localiza-on signals.

Predic-on of cleavage site and localiza-on SignalP - predicts the presence and loca-on of signal pep-de cleavage sites in amino acid sequences from different organisms: Gram- posi-ve prokaryotes Gram- nega-ve prokaryotes Eukaryotes TargetP - predicts the subcellular loca-on of eukaryo-c proteins. The loca-on assignment is based on the predicted presence of any of the N- terminal presequences: chloroplast transit pep-de (ctp), mitochondrial targe-ng pep-de (mtp) or secretory pathway signal pep-de (SP). The methods combines predic-on from several ar-ficial neural networks and HMMs.

InterPro Integrates domain/mo-f predic-ons from several databases ProDom: sequence- clusters built from UniProtKB using PSI- BLAST. PROSITE paperns: simple regular expressions. PROSITE and HAMAP profiles: sequence matrices. PRINTS fingerprints, un- weighted Posi-on Specific Sequence Matrices (PSSMs). PANTHER, PIRSF, Pfam, SMART, TIGRFAMs, Gene3D and SUPERFAMILY: hidden Markov models (HMMs). TMHMM, SignalP

Protein interac-on networks Yeast PPI (hpp://www.bordalierins-tute.com/) Connec-vity/degree = number of interac-on partners Hubs = highly connected proteins Scale- free topology most genes have low connec-vity, few have high. Mainly yeast2hybrid and tandem affinity purifica-ons

Protein networks Protein protein interac-ons (PPI) IntAct www.ebi.ac.uk/intact DIP - dip.doe- mbi.ucla.edu/ Pathways KEGG - www.genome.jp/kegg/ Biocarta - www.biocarta.com Reactome - www.reactome.org/ Regulatory networks

Predic-ng the func-on of a gene Expression papern From RNAseq or Microarrays Same expression papern - > involved in the same pathway or has similar func-on Homology interac-ons are o_en conserved between species

Predic-ng the func-on of a gene Phylogene-c profiles Same conserva-on papern - > involved in same pathways

Predic-ng the func-on of a gene Gene fusions (Rosepa stone theory) Fusion of two proteins that interact can be convenient since their expression can be co- regulated.

Predic-ng the func-on of a gene Genomic context Adjacent genes may be regulated together (operons in bacteria)

Predic-ng the func-on of a gene Automated liperature mining Genes that o_en are found in the same abstracts are more likely to interact or have related func-ons Gene-c interac-on Genes with similar knock- down phenotypes or rescuing phenotypes are likely to have similar func-ons

STRING hpp://string.embl.de A database that combines different predic-ons of func-onal links Includes experimental databases such as DIP, BIND, KEGG, Biocharta etc. Bayesian sta-s-cs with weigh-ng of the different data sources and valida-on against known interac-ons.

STRING hpp://string.embl.de

Gene ontology (GO) Func-onal classifica-on of genes/proteins All different databases and annotators use different defini-ons of the same func-on => creates a problem in bioinforma-cs. The GO Consor-um gives standardized annota-ons to genes and proteins with rela-onships between terms.

Gene ontology (GO) Divided into three categories: (for cytochrome C) Molecular func-on (electron transporter ac-vity) Cellular component (mitochondrial matrix) Biological process (oxida-ve phosphoryla-on) Uses Directed Acyclic Graphs (DAGs) Evidence codes, ex. TAS (Traceable author statement) or IEP (Inferred from expression papern)

Never trust a server blindly! Always do control experiments: PosiCve controls: submit sequences for which you know the right answer. NegaCve controls: random or shuffled sequences. Try several different methods and use the consensus