A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Size: px

Start display at page:

Download "A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling"

Linette Morris
5 years ago
Views:

1 A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini, Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and Quantitative Biology (LCQB) - Sorbonne University

2 Introduction

3 Metagenomic analysis workflow Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment. 1

4 Protein domains Size of individual structural domains: from 36 aa to 692 aa most of them has < 200 residues average of 100 residues 2

Domain identification in short MG/MT reads Very short fragments (150-300 bp) Two possible

5 Domain identification in short MG/MT reads Very short fragments ( bp) Two possible approaches: Assembly-based (e.g., HMM-GRASPx) Direct read annotation (e.g., MetaCLADE, UProC) 3

Y I W A QA E K R D LK RAFWS A A D N TT T S EK S S C I K L LMI V I I WL VF I V L S DAN A TRS AQ E KT AH YR HK S D I H L Q M T AGR LI G HNI QF SLRSC AV M Q L V LRN HMN AMAV L I SCL S II SVTY ATVM S L

6 Y I W A QA E K R D LK RAFWS A A D N TT T S EK S S C I K L LMI V I I WL VF I V L S DAN A TRS AQ E KT AH YR HK S D I H L Q M T AGR LI G HNI QF SLRSC AV M Q L V LRN HMN AMAV L I SCL S II SVTY ATVM S L VGAG T C A MCSV YFV FALA L Q T Y G Y R H F M I C MCLQ KQ WER KAR QHLT L K E K H Q A LQ TI Y MA DF ERV VHYWYLFH Y LSR DKC PGG T L Q N T W S FLA L YWW W V A YCG SAC T LI NASS SM VP I H Q S T V WF P HNP E CD IG FLWE LNY SAV L Q T N MQS I MS QNGA V C M S I LEV T V M P G H ENDQ S TPAK R S I L FAYS V D S G EW DG A TSPV N WebLogo How can we represent a domain? Multiple sequence alignment (MSA) of homologous domain sequences F VL C KQ QE RR R LE N T F L Q L T I S D F V A C F L S S AGM V IFY KQRTGLV PM CG AQ L MLT QI WRN KACE SE T G VFYW K E V MTP SNHY HRK G HD GN P SDP LIN WL H TQV E MA E WEAM QD RR N A I VDWF NL HYAG L P F EDI AG AN Y L F MWC GS CS G V AA CST N P CV I V LGN C K CAT F A VQ YF Probabilistic representations: - position-specific scoring matrices (PSSMs) - profile hidden Markov models (phmms) 4

7 The Pfam database A large collection of protein domain families Each family is represented by two MSAs (Full and Seed) and a profile HMM 5

8 CLADE (CLoser sequences for Annotations Directed by Evolution) 6

9 MetaCLADE

10 MetaCLADE - Main features Extends CLADE to handle MG/MT data Puts all domain hits in a two-dimensional space Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to assign confidence values to predictions Provides data visualization of functional annotation 7

MetaCLADE - General overview Predicted CDS/ORFs in

Domain prediction D 2 D 8 1 MAKLKVANDKA.

Selection of hits with best bit-score and % identity

3 D 6 1 D 2 D 3 1 D 2 D 8 D 2 D 8 D 2 D 8 D 3 CCMs

11 MetaCLADE - General overview Predicted CDS/ORFs in MG/MT reads Conservation profiles Hit identification Domain prediction D 2 D 8 1 MAKLKVANDKA... Input sequence D 2 D 2 D 2 D 8 D 6 1 D 2 D 3 1. Removal of overlapping hits 2. Selection of hits with prob Selection of hits with best bit-score and % identity Domain hits on the input sequence D 2 D 2 D 6 1 D 2 D 3 D 6 1 D 2 D 3 1 D 2 D 8 D 2 D 8 D 2 D 8 D 3 CCMs Domain D i in CLADE CLADE model library D i.. Global-consensus D N. Sets of positive and negative sequences Identification of domain-specific separating parameters Domain-dependent probability space pre-computation 8

12 MetaCLADE - Model construction Pfam domains D i D M Seed i phmm i Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: domains CCMs (PSSMs) and phmm 9

13 MetaCLADE - Model construction Pfam domains D i D M Seed i Full i phmm i S i 1... S i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: domains CCMs (PSSMs) and phmm 9

14 MetaCLADE - Model construction Pfam domains Seed i phmm i D i D M Full i S i 1 PSI-BLAST CCM i 1... NR... S i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: domains CCMs (PSSMs) and phmm 9

15 MetaCLADE - Model construction Pfam domains Seed i phmm i D i D M Full i S i 1 PSI-BLAST CCM i 1... NR... S i ni PSI-BLAST CCM i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: domains CCMs (PSSMs) and phmm 9

16 MetaCLADE - Model construction CLADE Library Pfam domains Seed i phmm i Global consensus models phmm 1... phmm i... phmm M D i Clade-centered models D M Full i S i 1... PSI-BLAST NR CCM i 1... CCM CCM i CCM 1 n CCM i ni S i ni n i 350 PSI-BLAST CCM i ni CCM M 1... CCM M nm Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: domains CCMs (PSSMs) and phmm 9

17 MetaCLADE - General idea D 2 D 6 1 D 2 D 2 D 8 D 2 D 3 10

18 MetaCLADE - Training set The set of positive sequences for a domain D i is based on suffixes, prefixes and random fragments from Seed i. The set of negative sequences for a domain D i is based on three different methods: 1. 2-mer shuffling 2. sequence reversal 3. Markov model (probabilities based on 4-mers of Seed i ) 11

19 MetaCLADE - Training set Negative sequences are generated until negative sequences 1 positive sequences 2 Generation of a 2-dimension space by considering the bit score and the mean bit score of domain hits 12

20 MetaCLADE - Hit classification A discrete version of a naive Bayes classifier is used in order to partition the hit space in regions with an associated probability 13

21 MetaCLADE - Hit filtering and domain prediction 1. Removal of overlapping hits D 2 D 6 1 D 2 D 2 D 8 D 2 D 3 2. Selection of hits with prob 0.9 D 2 D 8 D 6 1 D 2 D 3 3. Selection of hits with best bit-score and % identity D 2 D 8 1 D 3 Final prediction D 2 D

22 Results

23 MetaCLADE on metatranscriptomics Marine eukaryotic phytoplankton metatranscriptoms 1.5M high quality cdna sequences, average length of 242bp 15

24 MetaCLADE - Functional Annotation 16

25 MetaCLADE - MetaCLADE vs HMMER (Ion transport) 17

26 MetaCLADE - Higher resolution 18

27 MetaCLADE - Comparison with other methods Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp Precision Recall curve Tool TP FP FN TPR PPV F-score 100 bp UProC MetaCLADE HMMGRASP MetaCLADE+UProC UProC+MetaCLADE bp UProC MetaCLADE HMMGRASP MetaCLADE+UProC UProC+MetaCLADE Precision UProC (AUC=0.529) MetaCLADE (AUC=0.699) MetaCLADE+UProC (AUC=0.727) Recall 19

28 Future improvements More domains and new models for an improved annotation Constructing a library of conserved small motifs Annotation of longer sequences Reduction of the number of redundant models New criteria to filter overlapping hits 20

29 Conclusions Learning about the functional activity of the community and its sub-communities is a crucial step to understand species interactions and large-scale environmental impact Functional annotation methods need to be as precise as possible in identifying remote homology MetaCLADE allows for the discovery of patterns in highly divergent sequences Unknown sequences will augment in number, hence probabilistic models are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces 21

Carbone, A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling, Microbiome,

30 Thank you for your attention! Acknowledgments Ari Ugarte Juliana Silva Bernardes Alessandra Carbone References A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling, Microbiome, 2018, 6:149. J.S. Bernardes, C. Vaquero, G. Zaverucha, A. Carbone, Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence, PLoS Computational Biology, (7):e

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative