FUNCTION ANNOTATION PRELIMINARY RESULTS

FUNCTION ANNOTATION PRELIMINARY RESULTS FACTION I KAI YUAN KALYANI PATANKAR KIERA BERGER CAMILA MEDRANO HUBERT PAN JUNKE WANG YANXI CHEN AJAY RAMAKRISHNAN MRUNAL DEHANKAR

OVERVIEW Introduction Previous Pipeline Test Data Tools and Results New Pipeline References

INTRODUCTION What we have? Genes Coordinates for 24 Salmonella enterica serovar Heidelberg isolates from the outbreak of 2013 What we want to do? Attach biological information to those genes

PREVIOUS PIPELINE GFF/Fasta Genome Assembly Coding Regions Non-coding Regions Others Automate Pipeline Ab-Initio Ab-Initio Ab-Initio Homologybased Homologybased Blast2GO RAST Phobius TMHMM LipoP SignalP InterProScan JAMp/JAMg VFDB KOBAS Infernal-Rfam Piler-CR CRT DOOR2 Output A Output A Final Output Compare/Combine

TEST DATA Reference sequences NC_017623.1 NZ_CP019176.1 NZ_CP005995.1

TOOLS Coding Region: Lipoproteins Transmembrane proteins Signal Peptides Gene Ontology Non-coding Regions CRISPR Other Operons Virulence Factors Pathways

CODING REGIONS Lipoproteins: LipoP Signal Peptide: SignalP, Phobius, LipoP Transmembrane proteins: Phobius, LipoP, Interproscan, TMHMM

LIPOP Predicts the presence of a lipoprotein, signal peptide and transmembrane helices in a sequence of amino acids Uses Hidden Markov Model Command: LipoP -short Inputfile > Outputfile Results

SIGNALP Predicts the presence and location of signal peptide cleavage sites in amino acid sequences Uses Hidden Markov Model Command: signalp -t gram- -f short Input.faa > Outputfile Results

PHOBIUS Predicts transmembrane topology and signal peptides from the amino acid sequence Uses Hidden Markov Model Command: phobius -short Inputfile > outputfile Results

SIGNAL PEPTIDES NC_017623.1 NZ_CP019176. 1 NZ_CP005995. 1

TMHMM Predicts Transmembrane Helices Operates through a Hidden Markov Model Results:

TRANSMEMBRANE HELICES NC_017623.1 NZ_CP019176. 1 NZ_CP005995. 1

Verification Pulled Protein Name and ID information from Reference Sequence Genbank files Labeled the ones that are transmembrane proteins Currently using pattern matching in the protein name (Ideally we would look up the information using the protein ids) Compared results to tool prediction. *.gbk? *.output

>30 hours Input : amino acid sequences Reduce functionality : KOBAS? INTERPROSCAN

NON-CODING REGIONS CRISPR: Piler-CR CRT

PILER-CR Specifically designed for identification and classification of CRISPR repeats Installation Path: /data/home/kpatankar7/piler_cr/pilercr1.06 Command Used:./pilercr -in <fasta file> -out <fasta file> Results:

CRT Installation Path: /data/home/kpatankar7/crt_crispr Command Used: java -cp CRT1.2-CLI.jar crt <inputfile> <outputfile> Results:

PilerCR vs CRT Piler-CR gives more number of exact matches (TP) when the predicted CRISPR arrays were compared against CRISPRdb as compared to CRT. High precision rate over CRT(Precision= Number of instances correctly identified to all of the instances retrieved.) Sensitivity of Piler-CR may approach 100% with default parameters. PILER-CR is currently the only program that detects insertions and/or deletions in repeats. PilerCR CRT NCBI annotation pipeline NC_017623.1 2 2 2 NZ_CP019176.1 3 2 3 NZ_CP005995.1 2 3 2

OTHER Operon DOOR2 Virulence Factor Virulence Factor Database - VFDB Pathways Interproscan, Kobas

DOOR2

Available strains in DOOR2 DOOR2

Operon table DOOR2

VFDB Database of Virulence Factors present in bacteria No command line Blast against the VFDB database

KOBAS Predicts pathways based on sequence similarity Conflicting/limited documentation for command line installation and use Searching against KO using fasta files known to be time consuming Strategy to increase speed: BLAST protein sequences against merged database of Salmonella Heidelberg strains from KEGG catalog -> run KOBAS search against KO with output Sample of output from web tool

NEW PIPELINE

Homework Homework is up on the wiki under Exercises You have one week to do it

REFEREN CES Lihong Chen, Dandan Zheng, Bo Liu, Jian Yang, Qi Jin; VFDB 2016: hierarchical and refined dataset for big data analysis 10 years on. Nucleic Acids Res 2016; 44 (D1): D694-D697. doi: 10.1093/nar/gkv1239 Chen, Lihong et al. VFDB: A Reference Database for Bacterial Virulence Factors. Nucleic Acids Research 33.Database Issue (2005): D325 D328. PMC. Web. 7 Mar. 2017 Jian Yang, Lihong Chen, Lilian Sun, Jun Yu, Qi Jin; VFDB 2008 release: an enhanced web-based resource for comparative pathogenomics. Nucleic Acids Res 2008; 36 (suppl_1): D539- D542. doi: 10.1093/nar/gkm951 Chen, Lihong et al. VFDB 2012 Update: Toward the Genetic Diversity and Molecular Evolution of Bacterial Virulence Factors. Nucleic Acids Research 40.Database issue (2012): D641 D645. PMC. Web. 7 Mar. 2017. Juncker, Agnieszka S. et al. Prediction of Lipoprotein Signal Peptides in Gram-Negative Bacteria. Protein Science : A Publication of the Protein Society 12.8 (2003): 1652 1662. Print. Charles Bland, Teresa L Ramsey, Fareedah Sabree. CRISPR Recognition Tool (CRT): a tool for automatic detection of clustered regularly interspaced palindromic r epeat s BMC Bioinformatics. 2007; 8: 209 Robert C Edgar PILER-CR: Fast and accurate identification of CRISPR repeats BMC Bioinformatics20078:18 Nikki Shariat et al CRISPR-MVLST subtyping of Salmonella enterica subsp. entericaserovars Typhimurium and Heidelberg and application in identifying outbreak isolates BMC Microbiology201313:254DOI: 10.1186/1471-2180-13-254

REFEREN CES Xie, C. et al. KOBAS 2.0: a web server for annotation and identification of enriched pathways and diseases. Nucleic Acids Research 39, W316 W322 (2011). Wu, J., Mao, X., Cai, T., Luo, J., Wei, L. KOBAS server: a web-based platform for automated annotation and pathway identification. Nucleic Acids Res 34, W720 W724 (2006). Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27 30. Caspi R., Billington R., Ferrer L., Foerster H., Fulcher C.A., Keseler I.M., Kothari A., Krummenacker M., Latendresse M., Mueller L.A., Ong Q., Paley S., Subhraveti P., Weaver D.S., Karp P.D. The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Research 44(1):D471-80.(2015) Lukas Käll, Anders Krogh and Erik L. L. Sonnhammer. A Combined Transmembrane Topology and Signal Peptide Prediction Method. Journal of Molecular Biology, 338(5):1027-1036, May 2004. Reynolds, Sheila M. et al. Transmembrane Topology and Signal Peptide Prediction Using Dynamic Bayesian Networks. PLOS Computational Biology 4.11 (2008): e1000213. PLoS Journals. Web. Remmert, Michael et al. HHblits: Lightning-Fast Iterative Protein Sequence Searching by HMM-HMM Alignment. Nature Methods 9.2 (2012): 173 175. www.nature.com. Web.