A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling Ari Ugarte, Riccardo Vicedomini, Juliana Silva Bernardes, Alessandra Carbone 9 September, 2018 Laboratory of Computational and Quantitative Biology (LCQB) - Sorbonne University

Introduction

Metagenomic analysis workflow Functional profiling Protein domain identification of MG/MT data allows to study and analyze which are the main functions expressed in a specific environment. 1

Protein domains Size of individual structural domains: from 36 aa to 692 aa most of them has < 200 residues average of 100 residues 2

Domain identification in short MG/MT reads Very short fragments (150-300 bp) Two possible approaches: Assembly-based (e.g., HMM-GRASPx) Direct read annotation (e.g., MetaCLADE, UProC) 3

Y I W A QA E K R D LK RAFWS A A D N TT T S EK S S C I K L LMI V I I WL VF I V L S DAN A TRS AQ E KT AH YR HK S D I H L Q M T AGR LI G HNI QF SLRSC AV M Q L V LRN HMN AMAV L I SCL S II SVTY ATVM S L VGAG T C A MCSV YFV FALA L Q T Y G Y R H F M I C MCLQ KQ WER KAR QHLT L K E K H Q A LQ TI Y MA DF ERV VHYWYLFH Y LSR DKC PGG T L Q N T W S FLA L YWW W V A YCG SAC T LI NASS SM VP I H Q S T V WF P HNP E CD IG FLWE LNY SAV L Q T N MQS I MS QNGA V C M S I LEV T V M P G H ENDQ S TPAK R S I L FAYS V D S G EW DG A TSPV N WebLogo 3.6.0 How can we represent a domain? Multiple sequence alignment (MSA) of homologous domain sequences F VL C KQ QE RR R LE N T F L Q L T I S D F V A C F L S S AGM V IFY KQRTGLV PM CG AQ L MLT QI WRN KACE SE T G VFYW K E V MTP SNHY HRK G HD GN P SDP LIN WL H TQV E MA E WEAM QD RR N A I VDWF NL HYAG L P F EDI AG AN Y L F MWC GS CS G V AA CST N P CV I V LGN C K CAT F A VQ YF Probabilistic representations: - position-specific scoring matrices (PSSMs) - profile hidden Markov models (phmms) 4

The Pfam database A large collection of protein domain families Each family is represented by two MSAs (Full and Seed) and a profile HMM 5

CLADE (CLoser sequences for Annotations Directed by Evolution) 6

MetaCLADE

MetaCLADE - Main features Extends CLADE to handle MG/MT data Puts all domain hits in a two-dimensional space Uses two-dimensional thresholds (defined with a Naive Bayes classifier) to assign confidence values to predictions Provides data visualization of functional annotation 7

MetaCLADE - General overview Predicted CDS/ORFs in MG/MT reads Conservation profiles Hit identification Domain prediction D 2 D 8 1 MAKLKVANDKA... Input sequence D 2 D 2 D 2 D 8 D 6 1 D 2 D 3 1. Removal of overlapping hits 2. Selection of hits with prob 0.9 3. Selection of hits with best bit-score and % identity Domain hits on the input sequence D 2 D 2 D 6 1 D 2 D 3 D 6 1 D 2 D 3 1 D 2 D 8 D 2 D 8 D 2 D 8 D 3 CCMs Domain D i in CLADE CLADE model library...... D i.. Global-consensus D N. Sets of positive and negative sequences Identification of domain-specific separating parameters Domain-dependent probability space pre-computation 8

MetaCLADE - Model construction Pfam domains...... D i D M Seed i phmm i Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction Pfam domains...... D i D M Seed i Full i phmm i S i 1... S i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction Pfam domains Seed i phmm i...... D i D M Full i S i 1 PSI-BLAST CCM i 1... NR... S i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction Pfam domains Seed i phmm i...... D i D M Full i S i 1 PSI-BLAST CCM i 1... NR... S i ni PSI-BLAST CCM i ni n i 350 Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - Model construction CLADE Library Pfam domains Seed i phmm i Global consensus models phmm 1... phmm i... phmm M...... D i Clade-centered models D M Full i S i 1... PSI-BLAST NR CCM i 1... CCM 1 1...... CCM i 1...... CCM 1 n1...... CCM i ni S i ni n i 350 PSI-BLAST CCM i ni CCM M 1... CCM M nm Inherited from CLADE Several models are built in order to represent each known Pfam family Pfam27: 14 831 domains 2 389 235 CCMs (PSSMs) and 14 831 phmm 9

MetaCLADE - General idea D 2 D 6 1 D 2 D 2 D 8 D 2 D 3 10

MetaCLADE - Training set The set of positive sequences for a domain D i is based on suffixes, prefixes and random fragments from Seed i. The set of negative sequences for a domain D i is based on three different methods: 1. 2-mer shuffling 2. sequence reversal 3. Markov model (probabilities based on 4-mers of Seed i ) 11

MetaCLADE - Training set Negative sequences are generated until negative sequences 1 positive sequences 2 Generation of a 2-dimension space by considering the bit score and the mean bit score of domain hits 12

MetaCLADE - Hit classification A discrete version of a naive Bayes classifier is used in order to partition the hit space in regions with an associated probability 13

MetaCLADE - Hit filtering and domain prediction 1. Removal of overlapping hits D 2 D 6 1 D 2 D 2 D 8 D 2 D 3 2. Selection of hits with prob 0.9 D 2 D 8 D 6 1 D 2 D 3 3. Selection of hits with best bit-score and % identity D 2 D 8 1 D 3 Final prediction D 2 D 8 1 14

Results

MetaCLADE on metatranscriptomics Marine eukaryotic phytoplankton metatranscriptoms 1.5M high quality cdna sequences, average length of 242bp 15

MetaCLADE - Functional Annotation 16

MetaCLADE - MetaCLADE vs HMMER (Ion transport) 17

MetaCLADE - Higher resolution 18

MetaCLADE - Comparison with other methods Guerrero Negro Hypersaline Microbial Mats (GNHM): 100/200 bp Precision Recall curve Tool TP FP FN TPR PPV F-score 100 bp UProC 336 302 20 258 448 249 42.9 94.3 58.9 MetaCLADE 323 009 12 145 461 542 41.2 96.4 57.7 HMMGRASP 328 729 37 224 455 822 41.9 89.8 57.1 MetaCLADE+UProC 405 734 20 145 378 817 51.7 95.3 67.0 UProC+MetaCLADE 406 370 25 965 378 181 51.8 94.0 66.8 200 bp UProC 264 787 19 060 220 368 54.6 93.3 68.9 MetaCLADE 347 936 18 138 137 219 71.7 95.0 81.7 HMMGRASP 290 155 37 189 195 000 59.8 88.6 71.4 MetaCLADE+UProC 363 667 21 479 121 488 75.0 94.4 83.6 UProC+MetaCLADE 364 641 28 444 120 514 75.2 92.8 83.0 Precision 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1.00 UProC (AUC=0.529) MetaCLADE (AUC=0.699) MetaCLADE+UProC (AUC=0.727) 0.000 0.095 0.190 0.285 0.380 0.475 0.570 0.665 0.760 Recall 19

Future improvements More domains and new models for an improved annotation Constructing a library of conserved small motifs Annotation of longer sequences Reduction of the number of redundant models New criteria to filter overlapping hits 20

Conclusions Learning about the functional activity of the community and its sub-communities is a crucial step to understand species interactions and large-scale environmental impact Functional annotation methods need to be as precise as possible in identifying remote homology MetaCLADE allows for the discovery of patterns in highly divergent sequences Unknown sequences will augment in number, hence probabilistic models are expected to play a major role in the annotation of sequences spanning unrepresented sequence spaces 21

Thank you for your attention! Acknowledgments Ari Ugarte Juliana Silva Bernardes Alessandra Carbone References A. Ugarte, R. Vicedomini, J.S. Bernardes, A. Carbone, A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling, Microbiome, 2018, 6:149. J.S. Bernardes, C. Vaquero, G. Zaverucha, A. Carbone, Improvement in protein domain identification is reached by breaking consensus, with the agreement of many profiles and domain co-occurrence, PLoS Computational Biology, 2016 12(7):e1005038. 22