Some Problems from Enzyme Families

Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems from bioinformatics and related areas that we encounter in applying knowledge of families of enzymes during our search of fungal genomes (actually ESTs) for more effective enzymes. These include multiple sequence alignments, scoping the boundaries between families and subfamilies, constructing classifiers for family membership, and predicting enzymatic activity of new sequences/enzymes.

Aims of Talk introduce some problems in bioinformatics All are open research problems! give some pointers to solutions Outline Enzymes and Enzyme Families Problem: Determine Properties of a New Enzyme SubProblem: Multiple Sequence Algnment SubProblem: Splitting Families and Subfamilies SubProblem: Building Classifiers SubProblem: Predicting Enzymatic Activity Some Literature References

What is an Enzyme? Enzyme is a protein that catalyses a reaction.

Enzymes are very specific. What is an Enzyme? Enzymes are very efficient catalysts.

Enzyme Families Aim: To classify and organize enzymes. Some Example Classification Schemes EC (Enzyme Commission) numbers To consider the classification and nomenclature of enzymes and coenzymes, their units of activity and standard methods of assay, together with the symbols used in the description of enzyme kinetics. GO (Gene Ontology) three classifications of gene products molecular function biological process cellular component CATH: Class, Architecture, Topology, Homology There is no objective definition. a family is clearly related by sequence similarity, a superfamily is composed of families whose sequence relationship isn t clear, but which are believed on structural and functional grounds to be homologous, and a fold is a group of superfamilies that share a common structural topology but are not necessarily homologous. InterPro combination of many classification schemes

Gene Ontology Entry

InterPro

The Fungal Genomics Project

Multiple Sequence Alignment (MSA) Problem: Given a set of protein sequences, and an objective function, determine the optimal alignment of the sequences. Why? Amino acid sequence determines protein structure determines enzyme function

MSA Issues Multiple sequence alignment is a complicated task choice of the sequences choice of an objective function the optimization of the objective function Issues math vs biology (optimal MSA not necessarily good MSA for biologist) outliers affect results divergence can affect choice of parameters/algorithms multi-domain sequences are problems many sequences, long sequences costly Ideal align closely related sequences trim so only one domain present feed in lots of constraints eg, structural information...

Progressive Approaches to MSA sequences are added one by one to the multiple alignment according to a precomputed order Iterative iteratively modify a sub-optimal solution Stochastic iterative randomly modify result is either kept or discarded dependent on an acceptance function convergence via more stringent acceptance function Consistency-based given a set of independent observations, the most consistent are often closer to the truth optimal MSA is one that agrees the most with all the possible optimal pair-wise alignments Constraint-based use prior information as constraints on the alignment

Splitting Families into Subfamilies Problem: Given the sequences for a family of enzymes, determine how to delineate cohesive subfamilies. Why?: more homologous means easier to study easier to build better alignments easier to build better classifiers Subproblem: remove outliers from the set of sequences

Building Classifiers for Enzyme Families Problem: Given the sequences for a family of enzymes, determine how to decide membership in the family. In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. A profile or weight matrix is a table of position-specific amino acid weights and gap costs. A domain is a conserved protein region. independently folding structural unit A fingerprint is a group of conserved motifs used to characterise a protein family.

Predicting Enzyme Activity Problem: Given the sequences for a family of enzymes, with (quantitative) information about their enzymatic activity, and given a new sequence in the family, predict the (quantitative) enzymatic activity of the new protein. Why?: quantitative aspect of enzyme function Subproblem: understand known enzymes in (sub)family

Measuring Enzyme Kinetic Activity

Panther System from Celera The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the Gen- Bank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster.

Panther System from Celera

PipeAlign System from Strasbourg

Some Solutions Multiple Sequence Alignment many ClustalW most widely used POA seems best compromise of speed vs quality Splitting a Family into Subfamilies Panther PipeAlign Classifiers of an Enzyme Family many, but HMMer is most widely used Predicting Kinetic Activity???

Acknowledgements

References L. Duret and S. Abdeddaim, Multiple alignments for structural, functional, or phylogenetic analyses of homologous sequences. In Bioinformatics: Sequence, Structure and Databanks, editted by D. Higgins and W. Taylor, Oxford University Press, 2000. C. Notredame, Recent progresses in multiple sequence alignment: a survey, Pharmacogenomics 3(1) (2002) 131-144. J.D. Thompson, F. Plewniak, O. Poch, A comprehensive comparison of multiple sequence alignment programs, Nucleic Acids Research, 27, 13 (1999) 2682-2690. Timo Lassmann and Erik L.L. Sonnhammer, Quality assessment of multiple alignment programs, FEBS Letters 529:126-130 (2002). Paul D. Thomas et al, PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucl. Acids. Res., 31 (2003) 334-341. F. Plewniak et al, PipeAlign : a new toolkit for protein family analysis. Nucleic Acids Research, 2003, Vol.31, 13:3829-3832. N. Wicker, G.R. Perrin, J.C. Thierry and O. Poch. Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol.Biol.Evol., 2001, 8:1435-1441.