A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family

Similar documents
Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Sequence Based Bioinformatics

Computational Analysis of the Fungal and Metazoan Groups of Heat Shock Proteins

Elements of Bioinformatics 14F01 TP5 -Phylogenetic analysis

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetic inference

Introduction to Bioinformatics Online Course: IBT

GATA family of transcription factors of vertebrates: phylogenetics and chromosomal synteny

Multiple Sequence Alignment. Sequences

Browsing Genomic Information with Ensembl Plants

Supplementary Information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

Introduction to Bioinformatics Introduction to Bioinformatics

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Quantifying sequence similarity

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Multiple sequence alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Introduction to Evolutionary Concepts

Phylogenetic analyses. Kirsi Kostamo

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Tools and Algorithms in Bioinformatics

Graph Alignment and Biological Networks

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Building a Homology Model of the Transmembrane Domain of the Human Glycine α-1 Receptor

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Ch. 9 Multiple Sequence Alignment (MSA)

Computational methods for predicting protein-protein interactions

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Symmetric Tree, ClustalW. Divergence x 0.5 Divergence x 1 Divergence x 2. Alignment length

Effects of Gap Open and Gap Extension Penalties

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Bioinformatics Exercises

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Phylogenetic Tree Reconstruction

Some Problems from Enzyme Families

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Genomics and bioinformatics summary. Finding genes -- computer searches

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Hands-On Nine The PAX6 Gene and Protein

Computational approaches for functional genomics

Copyright 2000 N. AYDIN. All rights reserved. 1

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Introduction to Biology Lecture 1

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Constructing Evolutionary/Phylogenetic Trees

Sequence Alignment Techniques and Their Uses

C3020 Molecular Evolution. Exercises #3: Phylogenetics

7. Tests for selection

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Overview & Applications. T. Lezon Hands-on Workshop in Computational Biophysics Pittsburgh Supercomputing Center 04 June, 2015

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

CSCE555 Bioinformatics. Protein Function Annotation

Phylogenetics: Building Phylogenetic Trees

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Procedure to Create NCBI KOGS

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Finding Motifs in Protein Sequences and Marking Their Positions in Protein Structures

Goals. Structural Analysis of the EGR Family of Transcription Factors: Templates for Predicting Protein DNA Interactions

Constructing Evolutionary/Phylogenetic Trees

Figure A1. Phylogenetic trees based on concatenated sequences of eight MLST loci. Phylogenetic trees were constructed based on concatenated sequences

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Genomes and Their Evolution

BIOINFORMATICS: An Introduction

Multiple Sequence Alignments

Tracking Protein Allostery in Evolution

Dr. Amira A. AL-Hosary

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

1 ATGGGTCTC 2 ATGAGTCTC

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Phylogenetic inference: from sequences to trees

Genome-wide analysis of the MYB transcription factor superfamily in soybean

C a h p a t p e t r e r 6 E z n y z m y e m s

SCOTCAT Credits: 20 SCQF Level 7 Semester 1 Academic year: 2018/ am, Practical classes one per week pm Mon, Tue, or Wed

Chapter 10: Hemoglobin

PHYLOGENY AND SYSTEMATICS

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

SUPPLEMENTARY INFORMATION

Warm Up. What are some examples of living things? Describe the characteristics of living things

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

BIOINFORMATICS LAB AP BIOLOGY

Algorithms in Bioinformatics

Introduction to protein alignments

Protein Modeling. Generating, Evaluating and Refining Protein Homology Models

Incremental Phylogenetics by Repeated Insertions: An Evolutionary Tree Algorithm

Browsing Genes and Genomes with Ensembl

Transcription:

A bioinformatics approach to the structural and functional analysis of the glycogen phosphorylase protein family Jieming Shen 1,2 and Hugh B. Nicholas, Jr. 3 1 Bioengineering and Bioinformatics Summer Institute, Department of Computational Biology, University of Pittsburgh, Pittsburgh, PA 15261 2 Department of Chemistry and Chemical Biology, Rutgers University, Piscataway, NJ 08854 3 Biomedical Initiative Group, Pittsburgh Supercomputing Center, Pittsburgh, PA 15213

Objectives Build a phylogenetic profile for the glycogen phosphorylase protein family from existing sequences Identify orthologues/paralogues; orthologues have same biochemical/physiological role Proteins involved in the same pathway usually have similar phylogenetic profiles Aids in identification of pathways and physiological processes in which uncharacterized protein appears Conduct a validation study of the ProbCons alignment algorithm Prepare multiple sequence alignments in parallel using ProbCons as well as more established methods for obtaining alignment (ClustalW) Compare results http://www.iscb.org/ismb2004/posters/thomas.meinelatmolgen.mpg.de_323.html

Background Information Glycogen phosphorylase plays a major role in carbohydrate metabolism by catalyzing the breakdown of glycogen into glucose subunits Acts on linear chains of glycogen Glycogen shorted by one residue Dimer of 2 identical subunits http://wine1.sb.fsu.edu/bch4053/lecture27/lecture27.htm

Methodology Sequence Retrieval (iproclass) Overview Sequence retrieval Pattern identification and matching Multiple sequence alignment Principle component analysis Phylogenetic analysis Group entropy analysis 3D visualization Pattern Identification and Matching (MEME) GeneDoc Multiple Sequence Alignment (ClustalW, ProbCons) Refined Multiple Sequence Alignment Principle Component Analysis (SeqSpace) Identified Subgroups Group Entropy Analysis (GEnt) Distinctive Residues Phylogenetic Profile Bootstrap Phylogenetic Analysis (PHYLIP suite)

Sequence Retrieval iproclass database Obtained 355 glycogen phosphorylase sequences of about 800+ residues each Eliminated duplicates or near duplicates left with 282 sequences Final dataset included top most sequenced genomes: honeybee, chicken, sea squirt, cow, dog, drosophila, Japanese puffer fish, human, mosquito, mouse, and C. elegans

Pattern Identification and Matching MEME (Multiple EM for Motif Elicitation) Discovers motifs (highly conserved sequence patterns) in group of related protein sequences Sorts through sequences and finds and reports motif alignments regardless of their placement along the protein Obtained 20 highly conserved motifs Conserved residues and motifs essential to protein structure and function

Multiple Sequence Alignment ClustalW Progressive method ~1 hour ProbCons Consistency-based method ~10 hours T-COFFEE, another consistency-based method, was attempted but did not complete the MSA within a reasonable amount of time (2 weeks) may have had to do with large size of dataset and long sequence length GeneDoc program used to highlight MEME patterns in the MSAs

Multiple Sequence Alignment GeneDoc: ProbCons conserved residues

Multiple Sequence Alignment GeneDoc: ClustalW conserved residues

Multiple Sequence Alignment GeneDoc: ProbCons with highlighted MEME motifs

Multiple Sequence Alignment GeneDoc: ClustalW with highlighted MEME motifs

Refined MSA Incorporated information from both the ClustalW and ProbCons outputs Refined alignment by hand using MEME motifs as guide

Principle Component Analysis SeqSpace One method to identify groups of subfamilies Top-down analysis Takes distance measures among set of sequences and converts into self-consistent set of coordinates in some arbitrary number of dimensions Coordinates plotted and examined for clusters of sequences Clustered protein sequences share similar pattern of substitutions presumed to reflect some common biochemical/physiological property or function

SeqSpace: Dimensions 2 x 3 Various bacteria and archaea Metazoa

Bootstrap Phylogenetic Analysis with PHYLIP suite ClustalW alignment with gap columns eliminated Gblocks: eliminate poorly aligned positions and divergent regions of sequence of protein alignment Seqboot: generate multiple data sets (1000) that are resampled versions of the input data set, randomly sample columns with replacement Protdist: analyzes the multiple data sets and computed a distance measure for protein sequences, using maximum likelihood estimates based on the Dayhoff PAM matrix in this case

Bootstrap Phylogenetic Analysis with PHYLIP suite Neighbor: constructs a tree by successive clustering of lineages, setting branch lengths as the lineages join Consense: computes consensus trees by majority-rule method Phylogenetic consensus tree viewed in ATV viewer and TreeView Groups of subfamilies compared to those obtained from Seqspace for further refinement

TreeView Unrooted tree for Metazoa subfamily recalculated from distinct subset of overall tree Reveals enzyme isozyme groups for brain, muscle, and liver tissue within the vertebrates http://science.howstuffworks.com/brain.htm http://digilander.libero.it/bodymindcare/kapil/more medi.htm http://en.wikipedia.org/wiki/biceps_brachii

GEnt Group Entropy Analysis Identify distinctive features of mutually exclusive subsets determined by phylogenetic analysis through cross entropy analysis of the columns in MSA Within single column, contrast: Amino acid composition within defined group of sequences Amino acid composition in all sequences outside of group Alignment positions where residue composition inside groups is very different from that outside group are expected to indicate positions associate with distinctive properties of sequences of that subgroup

Group Entropy Equations ( ) ( ) background residue frequency foreground residue frequency log Group Entropy Distance log Family Entropy Distance 2 2 = = = = i i i i i i i i i q p q p q p q p p

GEnt Results for Metazoa Subfamily Important to family Family Entropy 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 Forbidden 251t-y 322y-w 1005c-c 360k-f 479f-f 243y-y Important to subfamily Not enough information 0 2 4 6 8 10 12 14 16 Group Entropy

GEnt Results for Fungi Subfamily Family Entropy 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 543q-m 1181c-m 322w-w 1141k-l 1304d-a 249d-q1223l-t Important to subfamily 492w-w 0 1 2 3 4 5 6 7 8 9 10 11 Group Entropy

GEnt Distinctive subfamily residues highlighted

Visualization 3D structures for human liver isoform and yeast obtained from Protein Data Bank Residues of interest highlighted in VMD

Visualization: Human Liver Isoform GEnt: distinctive residues UniProt: binding/association sites

Visualization: Human Liver Isoform Globally conserved MEME motifs

Visualization: Yeast GEnt: distinctive residues UniProt: binding sites

Visualization: Yeast Globally conserved MEME motifs

Conclusions GEnt In metazoans: identified residue (155) known to be important in allosteric control Other predicted residues have high probability of reflecting important adaptations of enzyme to environment Determined possible evolutionary path of glycogen phosphorylase MSA methodology ProbCons and T-COFFEE did not perform well due to the large number of sequences and long sequence length of the dataset In this case, the ClustalW alignment method gave the best results

Future Work Experimentally test GEnt-predicted residues (site-directed mutagenesis) Additional computation work with quantum mechanics and molecular mechanics to identify roles of conserved residues, motifs, and GEntpredicted residues

Acknowledgements Hugh B. Nicholas, Jr. Alexander J. Ropelewski Ricardo G. Mendez Pittsburgh Supercomputing Center Bioengineering & Bioinformatics Summer Institute

References PSC website tutorials (http://www.psc.edu/nrbsc/education/tutor ials/) Michael M. Crerar, Ph.D. (http://www.biol.yorku.ca/grad/faculty/mic haelm.htm) Buchbinder et al. (2001) Structural relationships among regulated and unregulated phosphorylases. Annu. Rev. Biophys. Biomol. Struct. 30: 191-209. Nicholas Jr., H.B. Glutathione S-transferase subfamily differences: remodeling the subunit and domain interfaces.