Prediction of protein function from sequence analysis

Similar documents
CS612 - Algorithms in Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Homology and Information Gathering and Domain Annotation for Proteins

CSCE555 Bioinformatics. Protein Function Annotation

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

EBI web resources II: Ensembl and InterPro

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Homology. and. Information Gathering and Domain Annotation for Proteins

Protein structure alignments

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Some Problems from Enzyme Families

Biology Scope and Sequence Student Outcomes (Objectives Skills/Verbs)

I. Molecules & Cells. A. Unit One: The Nature of Science. B. Unit Two: The Chemistry of Life. C. Unit Three: The Biology of the Cell.

METABOLIC PATHWAY PREDICTION/ALIGNMENT

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Structure to Function. Molecular Bioinformatics, X3, 2006

Chem Lecture 4 Enzymes Part 1

Large-Scale Genomic Surveys

Introduction to Bioinformatics

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

Protein Structure: Data Bases and Classification Ingo Ruczinski

MITOCW watch?v=0xajihttcns

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy

BIOINFORMATICS LAB AP BIOLOGY

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

SUPPLEMENTARY INFORMATION

BIOINFORMATICS: An Introduction

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

ADVANCED PLACEMENT BIOLOGY

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

From Sequence to Function (I): - Protein Profiling - Case Studies in Structural & Functional Genomics

Motif Prediction in Amino Acid Interaction Networks

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Curriculum Links. AQA GCE Biology. AS level

In-Silico Approach for Hypothetical Protein Function Prediction

Research Article A Topological Description of Hubs in Amino Acid Interaction Networks

BMD645. Integration of Omics

Introduction and. Properties of Enzymes

Quiz answers. Allele. BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 17: The Quiz (and back to Eukaryotic DNA)

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

I. Molecules and Cells: Cells are the structural and functional units of life; cellular processes are based on physical and chemical changes.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Study of Mining Protein Structural Properties and its Application

Method of Enzyme Assay

Elements and Isotopes

A Protein Ontology from Large-scale Textmining?

Bioinformatics Exercises

Basic modeling approaches for biological systems. Mahesh Bule

K-means-based Feature Learning for Protein Sequence Classification

Computational Biology: Basics & Interesting Problems

Prediction of protein

Method of Enzyme Assay

Peddie Summer Day School

SUPPLEMENTARY INFORMATION

Computational methods for predicting protein-protein interactions

Computational Biology From The Perspective Of A Physical Scientist

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

The CATH Database provides insights into protein structure/function relationships

Microbiology / Active Lecture Questions Chapter 10 Classification of Microorganisms 1 Chapter 10 Classification of Microorganisms

Heteropolymer. Mostly in regular secondary structure

Division Ave. High School AP Biology

Sequence Alignment Techniques and Their Uses

Hands-On Nine The PAX6 Gene and Protein

Multiple Choice Review- Eukaryotic Gene Expression

Updated: 10/11/2018 Page 1 of 5

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Introduction to molecular biology. Mitesh Shrestha

Bioinformatics. Macromolecular structure

Chapter 7: Covalent Structure of Proteins. Voet & Voet: Pages ,

TIPS TO PREPARE FOR THE BIOLOGY 2 nd SEMESTER FINAL EXAM:

Miller & Levine Biology 2014

Subsystem: TCA Cycle. List of Functional roles. Olga Vassieva 1 and Rick Stevens 2 1. FIG, 2 Argonne National Laboratory and University of Chicago

Enzyme Catalysis & Biotechnology

Genome Annotation Project Presentation

Text of objective. Investigate and describe the structure and functions of cells including: Cell organelles

Sara Khraim. Shaymaa Alnamos ... Dr. Nafeth

Computational Structural Bioinformatics

BIOLOGY STANDARDS BASED RUBRIC

Matter and Substances Section 3-1

DATE A DAtabase of TIM Barrel Enzymes

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

ALL LECTURES IN SB Introduction

We used the PSI-BLAST program ( to search the

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Supporting Online Material for

SPRINGFIELD TECHNICAL COMMUNITY COLLEGE ACADEMIC AFFAIRS

Activation of a receptor. Assembly of the complex

All Proteins Have a Basic Molecular Formula

Transcription:

Prediction of protein function from sequence analysis Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy

The omic era Genome Sequencing Projects: Archaea: 74 species In Progress:52 Bacteria: 973 species In Progress: 2266 species Eukaryotic: Complete-23 Draft Assembly 318 In Progress-359 http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html Update: January 2010

The Data Bases of Biological Sequences and Structures GenBank: 108,431,692 sequences 106,533,156,756 nucleotides >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT 35,5 HGE! NR(*): 10,381,779 sequences 3,542,056,219 residues KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH SwissProt: 514,212 sequences 180,900,945 residues PDB: 60,654 structures membrane proteins <2% (*) CDS translations+pdb+swissprot+pir+prf Update: January 2009

(about 30,000 in the human genome) code for proteins... >protein kinase acctgttgatggcgacagggactgtatgctgatct atgctgatgcatgcatgctgactactgatgtgggg gctattgacttgatgtctatc... Genes in DNA... with different effects depending on variability Over 20 millions of single mutations are known in genes proteins correspond to functions... From 5000 to 10000 proteins per tissue From Genotype to Phenotype Proteins interact.in methabolic pathways when they are expressed

STRING 8 a global view on proteins and their functional interactions in 630 organisms- Jensen et al., 2009, Nucleic Acids Research, Vol 37. The Human Interactome in STRING 22,937 proteins and 1,482,533 interactions http://string.embl.de

One problem of the omic era : Protein functional Protein functional annotation

The Protein Data Bank http://www.rcsb.org/pdb/home/home.do No of Proteins with known structure: 57529

SCOP: Structural Classification of Proteins Domains are hierarchically classified: -class -fold:proteins with secondary structures in same arrangement with the same topological connections -superfamily: structures and functional features suggest a common evolutionary origin -family: proteins with identities 30%; with identities <30% but with similar structures and functions

From the Protein Sequence to the Structure and Function space Lesk A., 2004

100% Sequence comparison 30% Sequence Identity (% %) From the Protein Sequence to the Structure space PDB Fold recognition Machine-learning aided alignment Threading 0% New Folds Ab initio and de novo modelling Machine-learning prediction of structural features

From the Protein Sequence to the Structure and Function space What is protein function?

What is a function? For enzymes: function can be defined on the basis of the catalysed molecular reaction. e.g. aspartic aminotransferase (AST)

In biochemistry, a transaminaseor an aminotransferaseis an enzyme that catalyzes a type of reaction between an amino acid and an α-keto acid. Specifically, this reaction (transamination) involves removing the amino group from the amino acid, leaving behind an α-keto acid, and transferring it to the reactant α-keto acid and converting it into an amino acid. The enzymes are important in the production of various amino acids, and measuring the concentrations of various transaminases in the blood is important in the diagnosing and tracking many diseases. Transaminases require the coenzyme pyridoxal-phosphate, which is converted into pyridoxaminein the first phase of the reaction, when an amino acid is converted into a keto acid. Enzyme-bound pyridoxamine in turn reacts with pyruvate, oxaloacetate, or alphaketoglutarate, giving alanine, aspartic acid, or glutamic acid, respectively. The presence of elevated transaminases can be an indicator of liver damage.

Enzyme Commission (E.C.) classification A hierarchical classification for enzymes

EC 2.6 Transferring nitrogenous groups EC 2.6.1Transaminases EC 2.6.1.1 Aspartate transaminase Other name(s): glutamic-oxaloacetic transaminase; glutamic-aspartic transaminase; transaminase A; AAT; AspT; 2- oxoglutarate-glutamate aminotransferase; aspartate α-ketoglutarate transaminase; aspartate aminotransferase; aspartate-2-oxoglutarate transaminase; aspartic acid aminotransferase; aspartic aminotransferase; aspartyl aminotransferase; AST; glutamate-oxalacetate aminotransferase; glutamate-oxalate transaminase; glutamic-aspartic aminotransferase; glutamic-oxalacetic transaminase; glutamic oxalic transaminase; GOT (enzyme); L-aspartate transaminase; L-aspartate-α-ketoglutarate transaminase; L-aspartate-2-ketoglutarate aminotransferase; L-aspartate- 2-oxoglutarate aminotransferase; L-aspartate-2-oxoglutarate-transaminase; L-aspartic aminotransferase; oxaloacetate-aspartate aminotransferase; oxaloacetate transferase; aspartate:2-oxoglutarate aminotransferase; glutamate oxaloacetate transaminase Systematic name: L-aspartate:2-oxoglutarate aminotransferase

Problems: Isoforms e.g How to differentiate the function of the cytoplasmic aspartate amintransferase from that of mitochondrial isoform? Non enzymatic proteins

GO function vocabulary: http://www.geneontology.org/ The Ontologies Cellular component Biological process Molecular function

Gene Ontology classification: The human cytoplasmic aspartate transaminase GO:0004069 GO:0005829 GO:0006533

One BIG problem of the omic era : Protein functional Protein functional annotation

Functional annotation in silico by homology search ADH1_SULSO ----------MRAVRLVEIGKP--LSLQEIGVPKPKGPQVLIKVEAAGVCHSDVHMRQGRFGNLRIVE ADH_CLOBE ----------MKGFAMLGINKLG---WIEKERPVAGSYDAIVRPLAVSPCTSDIHTVFEGA------- ADH_THEBR ----------MKGFAMLSIGKVG---WIEKEKPAPGPFDAIVRPLAVAPCTSDIHTVFEGA------- ADH1_SOLTU MSTTVGQVIRCKAAVAWEAGKP--LVMEEVDVAPPQKMEVRLKILYTSLCHTDVYFWEAKG------- ADH2_LYCES MSTTVGQVIRCKAAVAWEAGKP--LVMEEVDVAPPQKMEVRLKILYTSLCHTDVYFWEAKG------- ADH1_ASPFL ----MSIPEMQWAQVAEQKGGP--LIYKQIPVPKPGPDEILVKVRYSGVCHTDLHALKGDW------- Sequence comparison is performed with alignment programs Sequence identity 40 % Similar structure and function (??) Methods for similarity searches: BLAST, Psi-BLAST (http://www.ncbi.nlm.nih.gov/blast/) sequence Altschul et al., (1990) J Mol Biol 215:403-410 Altschul et al., (1998) Nucleic Acids Res. 25:3389-3402 Pfam (http://pfam.wustl.edu/hmmsearch.shtml) sequence/structure Bateman et al., (2000) Nucleic Acids Research 28:263-266

Transfer by inheritance: Function annotation transfer from sequence through homology

http://www.uniprot.org/

PDB The annotation process at UniProt

Open problems of inheritance through homology Not all UniProt files are GO annotated The optimal threshold value of sequence identity for function transfer is not known Proteins contain multiple domains Proteins can share common domains and not necessarily the same function In proteins different combination of shared domains lead to different biological roles