A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Similar documents
Week 10: Homology Modelling (II) - HHpred

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Hidden Markov Models

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

SUPPLEMENTARY INFORMATION

Taxonomical Classification using:

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Similarity searching summary (2)

EECS730: Introduction to Bioinformatics

Performance Evaluation and Comparison

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Intro Protein structure Motifs Motif databases End. Last time. Probability based methods How find a good root? Reliability Reconciliation analysis

Large-Scale Genomic Surveys

Model Accuracy Measures

Tools and Algorithms in Bioinformatics

Sequence analysis and comparison

Computational Genomics and Molecular Biology, Fall

Overview of IslandPick pipeline and the generation of GI datasets

Tools and Algorithms in Bioinformatics

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Hidden Markov Models (HMMs) and Profiles

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

SUPPLEMENTARY INFORMATION

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

-max_target_seqs: maximum number of targets to report

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Database update 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families

Hidden Markov Models

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Some Problems from Enzyme Families

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Stephen Scott.

Exercise 5. Sequence Profiles & BLAST

HMMs and biological sequence analysis

Supplementary Information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Nature Biotechnology: doi: /nbt Supplementary Figure 1. Detailed overview of the primer-free full-length SSU rrna library preparation.

Protein Structure Prediction, Engineering & Design CHEM 430

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

NRProF: Neural response based protein function prediction algorithm

Pattern Matching (Exact Matching) Overview

An Introduction to Sequence Similarity ( Homology ) Searching

VOGUE: A Novel Variable Order-Gap State Machine for Modeling Sequences

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Final Examination CS 540-2: Introduction to Artificial Intelligence

Machine Learning in Action

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Probabilistic Arithmetic Automata

Efficient Remote Homology Detection with Secondary Structure

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Lecture 7 Sequence analysis. Hidden Markov Models

L3: Blast: Keyword match basics

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

STRUCTURAL BIOINFORMATICS II. Spring 2018

Matrix-based pattern discovery algorithms

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Proteome-wide label-free quantification with MaxQuant. Jürgen Cox Max Planck Institute of Biochemistry July 2011

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Computational Methods for Mass Spectrometry Proteomics

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Gibbs Sampling Methods for Multiple Sequence Alignment

Mitochondrial Genome Annotation

Genomics and bioinformatics summary. Finding genes -- computer searches

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Intelligent Systems (AI-2)

Discovering Binding Motif Pairs from Interacting Protein Groups

Introduction to Bioinformatics

Multiple sequence alignment

Protein Structure Prediction using String Kernels. Technical Report

Intelligent Systems (AI-2)

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

STRUCTURAL BIOINFORMATICS I. Fall 2015

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

CSCE555 Bioinformatics. Protein Function Annotation

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

The Pennsylvania State University. The Graduate School. College of Engineering A COMPUTATIONAL FRAMEWORK FOR INFERRING STRUCTURE, FUNCTION,

Mining and classification of repeat protein structures

Pointwise Exact Bootstrap Distributions of Cost Curves

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

HIDDEN MARKOV MODELS FOR REMOTE PROTEIN HOMOLOGY DETECTION

Bioinformatics 1--lectures 15, 16. Markov chains Hidden Markov models Profile HMMs

Transcription:

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini, Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and Quantitative Biology (LCQB) - Sorbonne University

Introduction

Metagenomic analysis workflow Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment. 1

Protein domains Size of individual structural domains: from 36 aa to 692 aa most of them has < 200 residues average of 100 residues 2

Domain identification in short MG/MT reads Very short fragments (150-300 bp) Two possible approaches: Assembly-based (e.g., HMM-GRASPx) Direct read annotation (e.g., MetaCLADE, UProC) 3

Y I W A QA E K R D LK RAFWS A A D N TT T S EK S S C I K L LMI V I I WL VF I V L S DAN A TRS AQ E KT AH YR HK S D I H L Q M T AGR LI G HNI QF SLRSC AV M Q L V LRN HMN AMAV L I SCL S II SVTY ATVM S L VGAG T C A MCSV YFV FALA L Q T Y G Y R H F M I C MCLQ KQ WER KAR QHLT L K E K H Q A LQ TI Y MA DF ERV VHYWYLFH Y LSR DKC PGG T L Q N T W S FLA L YWW W V A YCG SAC T LI NASS SM VP I H Q S T V WF P HNP E CD IG FLWE LNY SAV L Q T N MQS I MS QNGA V C M S I LEV T V M P G H ENDQ S TPAK R S I L FAYS V D S G EW DG A TSPV N WebLogo 3.6.0 How can we represent a domain? Multiple sequence alignment (MSA) of homologous domain sequences F VL C KQ QE RR R LE N T F L Q L T I S D F V A C F L S S AGM V IFY KQRTGLV PM CG AQ L MLT QI WRN KACE SE T G VFYW K E V MTP SNHY HRK G HD GN P SDP LIN WL H TQV E MA E WEAM QD RR N A I VDWF NL HYAG L P F EDI AG AN Y L F MWC GS CS G V AA CST N P CV I V LGN C K CAT F A VQ YF Probabilistic representations: - position-specific scoring matrices (PSSMs) - profile hidden Markov models (phmms) 4

The Pfam database A large collection of protein domain families Each family is represented by two MSAs (Full and Seed) and a profile HMM 5

CLADE (CLoser sequences for Annotations Directed by Evolution) 6

MetaCLADE

MetaCLADE - Main features Extends CLADE to handle MG/MT data Puts all domain hits in a two-dimensional space Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to assign confidence values to predictions Provides data visualization of functional annotation 7

MetaCLADE - General overview Predicted CDS/ORFs in MG/MT reads Conservation profiles Hit identification Domain prediction D 2 D 8 1 MAKLKVANDKA... Input sequence D 2 D 2 D 2 D 8 D 6 1 D 2 D 3 1. Removal of overlapping hits 2. Selection of hits with prob 0.9 3. Selection of hits with best bit-score and % identity Domain hits on the input sequence D 2 D 2 D 6 1 D 2 D 3 D 6 1 D 2 D 3 1 D 2 D 8 D 2 D 8 D 2 D 8 D 3 CCMs Domain D i in CLADE CLADE model library...... D i.. Global-consensus D N. Sets of positive and negative sequences Identification of domain-specific separating parameters Domain-dependent probability space pre-computation 8

MetaCLADE - Model construction Pfam domains...... D i D M Seed i phmm i Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction Pfam domains...... D i D M Seed i Full i phmm i S i 1... S i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction Pfam domains Seed i phmm i...... D i D M Full i S i 1 PSI-BLAST CCM i 1... NR... S i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction Pfam domains Seed i phmm i...... D i D M Full i S i 1 PSI-BLAST CCM i 1... NR... S i ni PSI-BLAST CCM i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction CLADE Library Pfam domains Seed i phmm i Global consensus models phmm 1... phmm i... phmm M...... D i Clade-centered models D M Full i S i 1... PSI-BLAST NR CCM i 1... CCM 1 1...... CCM i 1...... CCM 1 n1...... CCM i ni S i ni n i 350 PSI-BLAST CCM i ni CCM M 1... CCM M nm Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - General idea D 2 D 6 1 D 2 D 2 D 8 D 2 D 3 10

MetaCLADE - Training set The set of positive sequences for a domain D i is based on suffixes, prefixes and random fragments from Seed i. The set of negative sequences for a domain D i is based on three different methods: 1. 2-mer shuffling 2. sequence reversal 3. Markov model (probabilities based on 4-mers of Seed i ) 11

MetaCLADE - Training set Negative sequences are generated until negative sequences 1 positive sequences 2 Generation of a 2-dimension space by considering the bit score and the mean bit score of domain hits 12

MetaCLADE - Hit classification A discrete version of a naive Bayes classifier is used in order to partition the hit space in regions with an associated probability 13

MetaCLADE - Hit filtering and domain prediction 1. Removal of overlapping hits D 2 D 6 1 D 2 D 2 D 8 D 2 D 3 2. Selection of hits with prob 0.9 D 2 D 8 D 6 1 D 2 D 3 3. Selection of hits with best bit-score and % identity D 2 D 8 1 D 3 Final prediction D 2 D 8 1 14

Results

MetaCLADE on metatranscriptomics Marine eukaryotic phytoplankton metatranscriptoms 1.5M high quality cdna sequences, average length of 242bp 15

MetaCLADE - Functional Annotation 16

MetaCLADE - MetaCLADE vs HMMER (Ion transport) 17

MetaCLADE - Higher resolution 18

MetaCLADE - Comparison with other methods Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp Precision Recall curve Tool TP FP FN TPR PPV F-score 100 bp UProC 336 302 20 258 448 249 42.9 94.3 58.9 MetaCLADE 323 009 12 145 461 542 41.2 96.4 57.7 HMMGRASP 328 729 37 224 455 822 41.9 89.8 57.1 MetaCLADE+UProC 405 734 20 145 378 817 51.7 95.3 67.0 UProC+MetaCLADE 406 370 25 965 378 181 51.8 94.0 66.8 200 bp UProC 264 787 19 060 220 368 54.6 93.3 68.9 MetaCLADE 347 936 18 138 137 219 71.7 95.0 81.7 HMMGRASP 290 155 37 189 195 000 59.8 88.6 71.4 MetaCLADE+UProC 363 667 21 479 121 488 75.0 94.4 83.6 UProC+MetaCLADE 364 641 28 444 120 514 75.2 92.8 83.0 Precision 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 UProC (AUC=0.529) MetaCLADE (AUC=0.699) MetaCLADE+UProC (AUC=0.727) 0.000 0.095 0.190 0.285 0.380 0.475 0.570 0.665 0.760 Recall 19

Future improvements More domains and new models for an improved annotation Constructing a library of conserved small motifs Annotation of longer sequences Reduction of the number of redundant models New criteria to filter overlapping hits 20

Conclusions Learning about the functional activity of the community and its sub-communities is a crucial step to understand species interactions and large-scale environmental impact Functional annotation methods need to be as precise as possible in identifying remote homology MetaCLADE allows for the discovery of patterns in highly divergent sequences Unknown sequences will augment in number, hence probabilistic models are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces 21

Thank you for your attention! Acknowledgments Ari Ugarte Juliana Silva Bernardes Alessandra Carbone References A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling, Microbiome, 2018, 6:149. J.S. Bernardes, C. Vaquero, G. Zaverucha, A. Carbone, Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence, PLoS Computational Biology, 2016 12(7):e1005038. 22