FUNCTION ANNOTATION PRELIMINARY RESULTS

Similar documents
SUPPLEMENTARY INFORMATION

functional annotation preliminary results

We have: We will: Assembled six genomes Made predictions of most likely gene locations. Add a layers of biological meaning to the sequences

Functional Annotation

Meiothermus ruber Genome Analysis Project

CS 229 Project: A Machine Learning Framework for Biochemical Reaction Matching

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Meiothermus ruber Genome Analysis Project

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

CRISPR-SeroSeq: A Developing Technique for Salmonella Subtyping

-max_target_seqs: maximum number of targets to report

BMD645. Integration of Omics

Genome Annotation Project Presentation

CSCE555 Bioinformatics. Protein Function Annotation

TMHMM2.0 User's guide

PNmerger: a Cytoscape plugin to merge biological pathways and protein interaction networks

SUPPLEMENTARY INFORMATION

RGP finder: prediction of Genomic Islands

Comparative Genomics Background & Strategy. Faction 2

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Meiothermus ruber Genome Analysis Project

This document describes the process by which operons are predicted for genes within the BioHealthBase database.

BIOINFORMATICS LAB AP BIOLOGY

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Homology and Information Gathering and Domain Annotation for Proteins

CS612 - Algorithms in Bioinformatics

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES

Comparative genomics: Overview & Tools + MUMmer algorithm

Week 10: Homology Modelling (II) - HHpred

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

Metabolic modelling. Metabolic networks, reconstruction and analysis. Esa Pitkänen Computational Methods for Systems Biology 1 December 2009

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Metabolic pathway predictions for metabolomics: a molecular structure matching approach

Public Database 의이용 (1) - SignalP (version 4.1)

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

A profile-based protein sequence alignment algorithm for a domain clustering database

Riboflavin Metabolism: A study to see if Mrub_1256 is Orthologous to E. coli b0415, and if Mrub_1254 is Orthologous to E.

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Sequence Alignment Techniques and Their Uses

Gene function annotation

Hands-On Nine The PAX6 Gene and Protein

Homology. and. Information Gathering and Domain Annotation for Proteins

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Computational methods for predicting protein-protein interactions

Some Problems from Enzyme Families

1. HyperLogLog algorithm

Functional Annotation & Comparative Genomics. Lu Wang, Georgia Tech

COMP 598 Advanced Computational Biology Methods & Research. Introduction. Jérôme Waldispühl School of Computer Science McGill University

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

STRUCTURAL BIOINFORMATICS I. Fall 2015

Protein structure alignments

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

Microbiome Metabolic Modeling with Pathway Tools

Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins

Pathway Bioinformatics: Inference, Visualization, and Analysis. Peter D. Karp, Ph.D.

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Prediction of protein function from sequence analysis

Predicting Protein Functions and Domain Interactions from Protein Interactions

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Gibbs Sampling Methods for Multiple Sequence Alignment

BIOINFORMATICS: An Introduction

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

METABOLIC PATHWAY PREDICTION/ALIGNMENT

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

GROOLS: Reactive Graph Reasoning for Genome Annotation

ATLAS of Biochemistry

Genomics and bioinformatics summary. Finding genes -- computer searches

ALL LECTURES IN SB Introduction

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Integration of functional genomics data

Integration of Omics Data to Investigate Common Intervals

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Sifting through genomes with iterative-sequence clustering produces a large, phylogenetically diverse protein-family resource

Large-Scale Genomic Surveys

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

objective functions...

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

In Silico Identification and Characterization of Effector Catalogs

7 Multiple Genome Alignment

Biological Systems: Open Access

Comparative Analysis of Nitrogen Assimilation Pathways in Pseudomonas using Hypergraphs

Bioinformatics methods COMPUTATIONAL WORKFLOW

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries

Overview of Research at Bioinformatics Lab

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Transcription:

FUNCTION ANNOTATION PRELIMINARY RESULTS FACTION I KAI YUAN KALYANI PATANKAR KIERA BERGER CAMILA MEDRANO HUBERT PAN JUNKE WANG YANXI CHEN AJAY RAMAKRISHNAN MRUNAL DEHANKAR

OVERVIEW Introduction Previous Pipeline Test Data Tools and Results New Pipeline References

INTRODUCTION What we have? Genes Coordinates for 24 Salmonella enterica serovar Heidelberg isolates from the outbreak of 2013 What we want to do? Attach biological information to those genes

PREVIOUS PIPELINE GFF/Fasta Genome Assembly Coding Regions Non-coding Regions Others Automate Pipeline Ab-Initio Ab-Initio Ab-Initio Homologybased Homologybased Blast2GO RAST Phobius TMHMM LipoP SignalP InterProScan JAMp/JAMg VFDB KOBAS Infernal-Rfam Piler-CR CRT DOOR2 Output A Output A Final Output Compare/Combine

TEST DATA Reference sequences NC_017623.1 NZ_CP019176.1 NZ_CP005995.1

TOOLS Coding Region: Lipoproteins Transmembrane proteins Signal Peptides Gene Ontology Non-coding Regions CRISPR Other Operons Virulence Factors Pathways

CODING REGIONS Lipoproteins: LipoP Signal Peptide: SignalP, Phobius, LipoP Transmembrane proteins: Phobius, LipoP, Interproscan, TMHMM

LIPOP Predicts the presence of a lipoprotein, signal peptide and transmembrane helices in a sequence of amino acids Uses Hidden Markov Model Command: LipoP -short Inputfile > Outputfile Results

SIGNALP Predicts the presence and location of signal peptide cleavage sites in amino acid sequences Uses Hidden Markov Model Command: signalp -t gram- -f short Input.faa > Outputfile Results

PHOBIUS Predicts transmembrane topology and signal peptides from the amino acid sequence Uses Hidden Markov Model Command: phobius -short Inputfile > outputfile Results

SIGNAL PEPTIDES NC_017623.1 NZ_CP019176. 1 NZ_CP005995. 1

TMHMM Predicts Transmembrane Helices Operates through a Hidden Markov Model Results:

TRANSMEMBRANE HELICES NC_017623.1 NZ_CP019176. 1 NZ_CP005995. 1

Verification Pulled Protein Name and ID information from Reference Sequence Genbank files Labeled the ones that are transmembrane proteins Currently using pattern matching in the protein name (Ideally we would look up the information using the protein ids) Compared results to tool prediction. *.gbk? *.output

>30 hours Input : amino acid sequences Reduce functionality : KOBAS? INTERPROSCAN

NON-CODING REGIONS CRISPR: Piler-CR CRT

PILER-CR Specifically designed for identification and classification of CRISPR repeats Installation Path: /data/home/kpatankar7/piler_cr/pilercr1.06 Command Used:./pilercr -in <fasta file> -out <fasta file> Results:

CRT Installation Path: /data/home/kpatankar7/crt_crispr Command Used: java -cp CRT1.2-CLI.jar crt <inputfile> <outputfile> Results:

PilerCR vs CRT Piler-CR gives more number of exact matches (TP) when the predicted CRISPR arrays were compared against CRISPRdb as compared to CRT. High precision rate over CRT(Precision= Number of instances correctly identified to all of the instances retrieved.) Sensitivity of Piler-CR may approach 100% with default parameters. PILER-CR is currently the only program that detects insertions and/or deletions in repeats. PilerCR CRT NCBI annotation pipeline NC_017623.1 2 2 2 NZ_CP019176.1 3 2 3 NZ_CP005995.1 2 3 2

OTHER Operon DOOR2 Virulence Factor Virulence Factor Database - VFDB Pathways Interproscan, Kobas

DOOR2

Available strains in DOOR2 DOOR2

Operon table DOOR2

VFDB Database of Virulence Factors present in bacteria No command line Blast against the VFDB database

KOBAS Predicts pathways based on sequence similarity Conflicting/limited documentation for command line installation and use Searching against KO using fasta files known to be time consuming Strategy to increase speed: BLAST protein sequences against merged database of Salmonella Heidelberg strains from KEGG catalog -> run KOBAS search against KO with output Sample of output from web tool

NEW PIPELINE

Homework Homework is up on the wiki under Exercises You have one week to do it

REFEREN CES Lihong Chen, Dandan Zheng, Bo Liu, Jian Yang, Qi Jin; VFDB 2016: hierarchical and refined dataset for big data analysis 10 years on. Nucleic Acids Res 2016; 44 (D1): D694-D697. doi: 10.1093/nar/gkv1239 Chen, Lihong et al. VFDB: A Reference Database for Bacterial Virulence Factors. Nucleic Acids Research 33.Database Issue (2005): D325 D328. PMC. Web. 7 Mar. 2017 Jian Yang, Lihong Chen, Lilian Sun, Jun Yu, Qi Jin; VFDB 2008 release: an enhanced web-based resource for comparative pathogenomics. Nucleic Acids Res 2008; 36 (suppl_1): D539- D542. doi: 10.1093/nar/gkm951 Chen, Lihong et al. VFDB 2012 Update: Toward the Genetic Diversity and Molecular Evolution of Bacterial Virulence Factors. Nucleic Acids Research 40.Database issue (2012): D641 D645. PMC. Web. 7 Mar. 2017. Juncker, Agnieszka S. et al. Prediction of Lipoprotein Signal Peptides in Gram-Negative Bacteria. Protein Science : A Publication of the Protein Society 12.8 (2003): 1652 1662. Print. Charles Bland, Teresa L Ramsey, Fareedah Sabree. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic r epeat s BMC Bioinformatics. 2007; 8: 209 Robert C Edgar PILER-CR: Fast and accurate identification of CRISPR repeats BMC Bioinformatics20078:18 Nikki Shariat et al CRISPR-MVLST subtyping of Salmonella enterica subsp. entericaserovars Typhimurium and Heidelberg and application in identifying outbreak isolates BMC Microbiology201313:254DOI: 10.1186/1471-2180-13-254

REFEREN CES Xie, C. et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Research 39, W316 W322 (2011). Wu, J., Mao, X., Cai, T., Luo, J., Wei, L. KOBAS server: a web-based platform for automated annotation and pathway identification. Nucleic Acids Res 34, W720 W724 (2006). Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27 30. Caspi R., Billington R., Ferrer L., Foerster H., Fulcher C.A., Keseler I.M., Kothari A., Krummenacker M., Latendresse M., Mueller L.A., Ong Q., Paley S., Subhraveti P., Weaver D.S., Karp P.D. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research 44(1):D471-80.(2015) Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer. A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology, 338(5):1027-1036, May 2004. Reynolds, Sheila M. et al. Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks. PLOS Computational Biology 4.11 (2008): e1000213. PLoS Journals. Web. Remmert, Michael et al. HHblits: Lightning-Fast Iterative Protein Sequence Searching by HMM-HMM Alignment. Nature Methods 9.2 (2012): 173 175. www.nature.com. Web.