Some Problems from Enzyme Families

Similar documents
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

CSCE555 Bioinformatics. Protein Function Annotation

Introduction to Bioinformatics Online Course: IBT

CS612 - Algorithms in Bioinformatics

Computational methods for predicting protein-protein interactions

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

EBI web resources II: Ensembl and InterPro

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

A profile-based protein sequence alignment algorithm for a domain clustering database

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Sequence Alignment Techniques and Their Uses

Week 10: Homology Modelling (II) - HHpred

Structure to Function. Molecular Bioinformatics, X3, 2006

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

EECS730: Introduction to Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Multiple sequence alignment

An Introduction to Bioinformatics Algorithms Hidden Markov Models

PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification

Introductory course on Multiple Sequence Alignment Part I: Theoretical foundations

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

BMD645. Integration of Omics

Prediction of protein function from sequence analysis

Information content of sets of biological sequences revisited

Quantifying sequence similarity

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Homology and Information Gathering and Domain Annotation for Proteins

STRUCTURAL BIOINFORMATICS I. Fall 2015

BIOINFORMATICS: An Introduction

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Hidden Markov Models

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Introduction to Evolutionary Concepts

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Computational Genomics and Molecular Biology, Fall

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

An Introduction to Sequence Similarity ( Homology ) Searching

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

O 3 O 4 O 5. q 3. q 4. Transition

Update on human genome completion and annotations: Protein information resource

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Protein Structure: Data Bases and Classification Ingo Ruczinski

Gene function annotation

Heteropolymer. Mostly in regular secondary structure

Copyright 2000 N. AYDIN. All rights reserved. 1

Protein Families. João C. Setubal University of São Paulo Agosto /23/2012 J. C. Setubal

A Protein Ontology from Large-scale Textmining?

Pairwise & Multiple sequence alignments

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

Hidden Markov Models

Genomics and bioinformatics summary. Finding genes -- computer searches

Mining and classification of repeat protein structures

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

The PRALINE online server: optimising progressive multiple alignment on the web

Hidden Markov Models (HMMs) and Profiles

Simultaneous Sequence Alignment and Tree Construction Using Hidden Markov Models. R.C. Edgar, K. Sjölander

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

K-means-based Feature Learning for Protein Sequence Classification

An Efficient Algorithm for Protein-Protein Interaction Network Analysis to Discover Overlapping Functional Modules

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

Similarity searching summary (2)

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Homology. and. Information Gathering and Domain Annotation for Proteins

-max_target_seqs: maximum number of targets to report

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Computational Genomics and Molecular Biology, Fall

CAP 5510 Lecture 3 Protein Structures

Functional Annotation

NetAffx GPCR annotation database summary December 12, 2001

Effects of Gap Open and Gap Extension Penalties

MSAT a Multiple Sequence Alignment tool based on TOPS

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Large-Scale Genomic Surveys

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Chemical Data Retrieval and Management

Probalign: Multiple sequence alignment using partition function posterior probabilities

Tools and Algorithms in Bioinformatics

Practical considerations of working with sequencing data

Motivating the need for optimal sequence alignments...

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

Transcription:

Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems from bioinformatics and related areas that we encounter in applying knowledge of families of enzymes during our search of fungal genomes (actually ESTs) for more effective enzymes. These include multiple sequence alignments, scoping the boundaries between families and subfamilies, constructing classifiers for family membership, and predicting enzymatic activity of new sequences/enzymes.

Aims of Talk introduce some problems in bioinformatics All are open research problems! give some pointers to solutions Outline Enzymes and Enzyme Families Problem: Determine Properties of a New Enzyme SubProblem: Multiple Sequence Algnment SubProblem: Splitting Families and Subfamilies SubProblem: Building Classifiers SubProblem: Predicting Enzymatic Activity Some Literature References

What is an Enzyme? Enzyme is a protein that catalyses a reaction.

Enzymes are very specific. What is an Enzyme? Enzymes are very efficient catalysts.

Enzyme Families Aim: To classify and organize enzymes. Some Example Classification Schemes EC (Enzyme Commission) numbers To consider the classification and nomenclature of enzymes and coenzymes, their units of activity and standard methods of assay, together with the symbols used in the description of enzyme kinetics. GO (Gene Ontology) three classifications of gene products molecular function biological process cellular component CATH: Class, Architecture, Topology, Homology There is no objective definition. a family is clearly related by sequence similarity, a superfamily is composed of families whose sequence relationship isn t clear, but which are believed on structural and functional grounds to be homologous, and a fold is a group of superfamilies that share a common structural topology but are not necessarily homologous. InterPro combination of many classification schemes

Gene Ontology Entry

InterPro

The Fungal Genomics Project

Multiple Sequence Alignment (MSA) Problem: Given a set of protein sequences, and an objective function, determine the optimal alignment of the sequences. Why? Amino acid sequence determines protein structure determines enzyme function

MSA Issues Multiple sequence alignment is a complicated task choice of the sequences choice of an objective function the optimization of the objective function Issues math vs biology (optimal MSA not necessarily good MSA for biologist) outliers affect results divergence can affect choice of parameters/algorithms multi-domain sequences are problems many sequences, long sequences costly Ideal align closely related sequences trim so only one domain present feed in lots of constraints eg, structural information...

Progressive Approaches to MSA sequences are added one by one to the multiple alignment according to a precomputed order Iterative iteratively modify a sub-optimal solution Stochastic iterative randomly modify result is either kept or discarded dependent on an acceptance function convergence via more stringent acceptance function Consistency-based given a set of independent observations, the most consistent are often closer to the truth optimal MSA is one that agrees the most with all the possible optimal pair-wise alignments Constraint-based use prior information as constraints on the alignment

Splitting Families into Subfamilies Problem: Given the sequences for a family of enzymes, determine how to delineate cohesive subfamilies. Why?: more homologous means easier to study easier to build better alignments easier to build better classifiers Subproblem: remove outliers from the set of sequences

Building Classifiers for Enzyme Families Problem: Given the sequences for a family of enzymes, determine how to decide membership in the family. In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. A profile or weight matrix is a table of position-specific amino acid weights and gap costs. A domain is a conserved protein region. independently folding structural unit A fingerprint is a group of conserved motifs used to characterise a protein family.

Predicting Enzyme Activity Problem: Given the sequences for a family of enzymes, with (quantitative) information about their enzymatic activity, and given a new sequence in the family, predict the (quantitative) enzymatic activity of the new protein. Why?: quantitative aspect of enzyme function Subproblem: understand known enzymes in (sub)family

Measuring Enzyme Kinetic Activity

Panther System from Celera The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the Gen- Bank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster.

Panther System from Celera

Panther System from Celera

PipeAlign System from Strasbourg

Some Solutions Multiple Sequence Alignment many ClustalW most widely used POA seems best compromise of speed vs quality Splitting a Family into Subfamilies Panther PipeAlign Classifiers of an Enzyme Family many, but HMMer is most widely used Predicting Kinetic Activity???

Acknowledgements

References L. Duret and S. Abdeddaim, Multiple alignments for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics: Sequence, Structure and Databanks, editted by D. Higgins and W. Taylor, Oxford University Press, 2000. C. Notredame, Recent progresses in multiple sequence alignment: a survey, Pharmacogenomics 3(1) (2002) 131-144. J.D. Thompson, F. Plewniak, O. Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Research, 27, 13 (1999) 2682-2690. Timo Lassmann and Erik L.L. Sonnhammer, Quality assessment of multiple alignment programs, FEBS Letters 529:126-130 (2002). Paul D. Thomas et al, PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucl. Acids. Res., 31 (2003) 334-341. F. Plewniak et al, PipeAlign : a new toolkit for protein family analysis. Nucleic Acids Research, 2003, Vol.31, 13:3829-3832. N. Wicker, G.R. Perrin, J.C. Thierry and O. Poch. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol.Biol.Evol., 2001, 8:1435-1441.