TrAnsFuSE refines the search for protein function: oxidoreductaseswz

Size: px

Start display at page:

Download "TrAnsFuSE refines the search for protein function: oxidoreductaseswz"

Claire Floyd
5 years ago
Views:

1 Integrative Biology View Online / Journal Homepage Dynamic Article Links Cite this: DOI: /c2ib00131d PAPER TrAnsFuSE refines the search for protein function: oxidoreductaseswz Arye Harel,* ab Paul Falkowski a and Yana Bromberg* b Received 10th October 2011, Accepted 25th February 2012 DOI: /c2ib00131d Non-equilibrium catalysis of electron transfer reactions (i.e. redox) regulates the flux of key elements found in biological macromolecules. The enzymes responsible, oxidoreductases, contain specific transition metals in poorly sequence-conserved domains. These domains evolved B2.4 billion years ago in microbes and spread across the tree of life. We lack understanding of how oxidoreductases evolved; divergence of sequences makes identification difficult. We developed a method to recognise the various versions of these enzyme-domains in unannotated sequencespace. Often, homology is used to transfer function annotations from experimentally resolved domains to unannotated sequences. Unreliability of inferring homology below 30% sequence identity limits single-sequence based searches. Misaligned functional sites may compromise annotation transfer from even very similar sequences. Combining profile-based searches with knowledge of functional sites could improve domain detection accuracy. Here we present an approach that enhances the search for redox domains using catalytic site annotations. From the scientific literature, we validated annotations of 104 InterPro domains indicated as using transition metals in redox reactions. These domains mediate electron transfer in 20% of oxidoreductases, primarily employing iron, copper and molybdenum. We used the experimentally identified catalytic residues in these domains to validate sequence alignment-based protein function annotations. Our method, TrAnsFuSE, is 11% and 14% more accurate than PSI-BLAST and InterPro, respectively. Moreover, it is robust for use with other functional residues we attain higher accuracy at comparable coverage using metal binding, in addition to catalytic, sites. TrAnsFuSE can be used to focus the study of the vast amounts of unannotated sequencing data from meta-/genome projects. a Environmental Biophysics and Molecular Ecology Program, Institute of Marine and Coastal Science, Rutgers the State University of New Jersey, 71 Dudley Road, New Brunswick, NJ 08901, USA. harel@marine.rutgers.edu; Fax: ; Tel: x412 b Department of Biochemistry and Microbiology, Rutgers the State University of New Jersey, Lipman Hall 218, New Brunswick, NJ 08901, USA. yanab@rci.rutgers.edu; Fax: ; Tel: x203 w Published as part of an ibiology themed issue entitled Computational Integrative Biology Guest Editor: Prof. Jan Baumbach. z Electronic supplementary information (ESI) available: Tables S1 and S2. See DOI: /c2ib00131d Introduction In nature, H, C, N, O, S, and P serve as the core elements used for the synthesis of biological macromolecules. 1 These elements are often found in molecules with oxidation states that make them biologically inaccessible. Redox reactions, mediated by oxidoreductases (Enzyme Commission 2 class 1; EC1), evolved to facilitate the flux of the first five of these elements. 3 Oxidoreductases catalyse the transfer of electrons using transition metals and other prosthetic groups. These enzymes are found in all kingdoms of life, but their catalytic Insight, innovation, integration Oxidoreductases regulate key processes of life like photosynthesis and respiration. The phylogeny of these ancient and far diverged domains is hard to reconstruct. Identifying new sequences for these domains elucidates their evolutionary paths. Here, we built TrAnsFuSE a novel approach that integrates sequence alignments with knowledge of functional sites to find new metal-redox oxidoreductase domains. Fewer sequences are identified in total, but the found set is virtually guaranteed to be correct. TrAnsFuSE can filter unannotated data from meta-/genome sequencing projects for high quality oxidoreductase matches. Moreover, the methodology can be easily adapted to fit other functional families.

2 domains evolved in microbes before the Great Oxidation Event (GOE) over 2.4 billion years ago. 4 Over this time, the sequences containing these domains diverged, making their phylogeny difficult to reconstruct. Searching for specific functionality, such as redox, in a set of unannotated sequences is generally based on homology to experimentally annotated genes/proteins. It is generally impossible to infer the function of one protein from another below 30% sequence identity. 5 In fact, some studies claim that the threshold is even higher (60%). 6 Even function annotation transfer from very similar sequences may be compromised when aligned functional residues differ. Profile searches, e.g. using Multiple Sequence Alignments (MSAs) 7 or Hidden Markov Models (HMMs), 8 are generally better at function transfer 9 as they describe the full set of functionally similar sequences. Profiles are built by aligning multiple homologous sequences and are meant to capture family-specific information, including functionally and structurally important residues. Using profiles helps identify distant homologues that are not recognized from alignments to single sequences. One example of a profile repository and search engine is InterPro, a widely used database of protein domains, 10 integrating profiles from 12 different resources. Even profile searches, however, can be improved by focusing more on the function-specific residues than on others. 11 Unfortunately, building a profile that highlights the functional residues and increases the probability of finding sequences containing these residues is not a trivial task. 12 In one effort 11 to improve function transfer by homology, catalytic residue annotations in the Catalytic Site Atlas (CSA) 13 were used to search for functionally related enzymes. In this study, function annotation transfer was only allowed when a homologue found by PSI-BLAST also conserved all catalytic residues. This approach, icsa (Inpharmatica CSA) filtering, is reported to have achieved 87% accuracy in finding enzymes with same EC numbers at the level of the third digit. Here we show that searching for transition metal-binding redox proteins can similarly be improved. We introduce a novel tool, TrAnsFuSE (Transfer of Annotations of Function using Sequence Elements; flow chart in Fig. 1A), which can use function-defining sequence elements such as catalytic or metal binding sites to aid function transfer by homology. TrAnsFuSE accuracy of function transfer using catalytic sites, an approach similar to icsa, is 94% (11% and 14% better than PSI-BLAST and InterPro, respectively). Filtering for metal binding residues, instead of catalytic sites, attains 71% accuracy (17% better than InterPro alone for the same set of sequences). Finally, the accuracy of function transfer attained by extending the concept of site-based filtering to the combination of metal and catalytic sites is 97%; i.e. 14%, 12%, and 3% higher than that of InterPro, PSI-BLAST, and catalytic site-based TrAnsFuSE, respectively. With this approach we pick up nearly 55 thousand highaccuracy novel oxidoreductases from TrEMBL 14 and annotate as redox-related roughly proteins in otherwise unannotated Global Ocean Sampling (GOS) expedition 15 data. The TrAns- FuSE methodology can thus potentially be used to improve function annotation for all of the unannotated sequence space. Results and discussion Iron, copper and molybdenum are the primary mediators of electron transfer We extracted from InterPro 104 transition metal-utilizing redoxrelated domains (methods; Table S1, ESIz).Atleasthalfofthese domains are validated for containing metal binding sites, using annotations from InterPro component databases (e.g. PFam, 8b TIGRFAMs, 16 SUPERFAMILY 17 ). There are hits for these domains in 7201 Swiss-Prot 18 entries (20% of all annotated EC1s). A small number of hits are ambiguous with regard to the metal they bind. (1) B4.5% of the InterPro hits are cambialistic; i.e. each sequence can utilize varied transition metals to catalyse redox reactions. (2) B1.5% is not explicitly defined by InterPro; i.e. the profile picks up different metal binding sequences. (3) Vanadium, tungsten, manganese and nickel binding sequences are uncommon (1%). (4) The majority (79.5%) of the hit domains solely use iron for electron transfer. Iron is present as: iron sulfur clusters (56%; 45% 4Fe 4S and 11% 2Fe 2S; Fig. 2B), heme (31%), and iron (12%). (5) The rest rely on copper (8.5%) and molybdenum (4.5%; Fig. 2A). Distribution of transition metal domains is in line with evolutionary evidence The 104 transition metal-redox domains preferentially use iron and molybdenum. If the distribution of these domains in Swiss- Prot is representative of the entire universe of proteins it may be an indication of the age 4 and importance 19 of these domains. It has been previously shown that Fe and Mo were preferentially selected early in biological life. 4 The appearance of many of the protein structures binding these metals dates to before the GOE. Higher abundance of sequences containing iron sulfur binding domains, over heme and iron binding ones, is also in line with the order of their evolution; i.e. structures that bind Fe in mixed Fe S clusters are thought to have evolved before those that bind Fe through porphyrin rings or direct amino acid bonds. 4 We also found that sequences containing domains using copper for electron transfer are highly abundant in the Swiss-Prot database. In contrast to iron, copper became soluble only later in earth s history, after the transition to a generally oxic and oxidizing ocean (B billion years ago). 4 The abundance of copper may be a result of its incorporation into the cytochrome oxidase in aerobic respiration complex. Thirty % of all copperbased domain sequences encode a cytochrome c oxidase subunit, further supporting this hypothesis. Over a third of sequence-defined transition-metal redox domains contain all experimentally annotated catalytic sites The 104 redox domains can be subdivided into six domain classes based on the presence of icsa-defined catalytic sites 13 within domain boundaries in the scanned EC1 Swiss-Prot sequences (methods; Table 1): (1) 28 domains contain all catalytic sites in all sequences; (2) 7 domains contain all catalytic sites in at least one representative sequence; (3) 5 domains have at least one sequence containing at least one complete catalytic site; (4) 10 domains have at least one sequence containing at least one partial catalytic site; (5) 16 domains contain no catalytic sites within domain boundaries; and (6) 38 domains have no annotation in icsa.

3 View Online Fig. 1 TrAnsFuSE flow diagram of searching in Swiss-Prot and GOS for transition metal oxidoreductases. The TrAnsFuSE search of Swiss-Prot is highly accurate and can be used to search the GOS database of unannotated sequences. Panel (A) shows the performance of TrAnsFuSE (and the numbers of true and false positives) in using GS (cgs, mgs, and mcgs) sequences to search all of Swiss-Prot. All TrAnsFuSE steps are illustrated: first, PSI-BLAST followed by filtering for catalytic, metal or metal and catalytic sites. We show for comparison the results of scanning Swiss-Prot for InterPro GS domains (cgs, mgs, and mcgs). Note that each 3-box node on the tree represents the three possible data sets corresponding to searching with catalytic (cgs), metal (mgs), and metal and catalytic (mcgs) sites. Panel (B) similarly displays the InterPro and TrAnsFuSE search results for the GOS database. Note that we only report the results based on catalytic sites (i.e. using cgs). Our analysis reveals 35 domains in 135 Swiss-Prot EC1s (domain classes 1 and 2) with at least one representative sequence that contains all known catalytic sites within InterPro domain boundaries. Note that one Swiss-Prot entry may contain more than one InterPro domain. This set of domains (Table 2) and their representative sequences (Table S2, ESIz) is referred

Fig. 2 Distribution of domains utilizing transition metals for redox, in Swiss-Prot EC1 oxidoreductases. The majority of metal-redox domains found in Swiss-Prot EC1s utilize iron and copper for redox.

4 Fig. 2 Distribution of domains utilizing transition metals for redox, in Swiss-Prot EC1 oxidoreductases. The majority of metal-redox domains found in Swiss-Prot EC1s utilize iron and copper for redox. Panel (A) shows the percentage of metal-redox domains utilizing: iron (white), copper (light grey), ambiguous metal (copper/zinc, iron/ molybdenum/vanadium, iron/copper, iron/ manganese; dark grey), molybdenum (black), and vanadium/tungsten/nickel (too few of each to indicate separately; horizontal lines). Ambiguous metal denotes cambialistic domains or domains that are not explicitly defined by InterPro to bind one specific metal. Iron sulfur clusters are the preferred way of iron utilization in redox. Panel (B) shows the percentage of iron-binding domains using iron sulfur clusters (white), heme (grey), and iron (black) ligands. to as catalytic site-based Gold Standard (cgs) domains and sequences. The cgs domains are found in 13% of Swiss-Prot EC1s. Between all of them, these proteins use all of the transition metals except nickel and tungsten. In this domain subset the distribution of metal prosthetic groups is similar to that in the full oxidoreductase set; i.e. the majority of cgs domains utilize iron (73%; 40% iron sulfur clusters, 30% heme, and 3% iron), copper (6%), and molybdenum (5%). Identifying catalytic sites in sequence is not simple Catalytic sites mapped onto a monomer. In the simplest case, the entire 3D functional unit of an enzyme is a monomer. For example, Formate dehydrogenase H (UniProt ID P07658; PDB 1aa6) contains a Molybdopterin oxidoreductase domain (InterPro ID IPR006655). This domain contains all four formate dehydrogenase catalytic residues making up the single catalytic site reported in the CSA. In this case, all sites are easily mapped onto the corresponding UniProt sequence. Catalytic sites mapped onto a homo-polymer. In some cases, the functional protein unit is a polymer of identical chains. For example, Superoxide reductase (SOR; P82385, 1do6) contains Desulfoferrodoxin, ferrous iron-binding domain (IPR002742). SOR forms a homo-tetramer, with each subunit adopting a fold that coordinates a non-heme iron centre. 20 The desulfoferroredoxin domain contains the two SOR catalytic residues annotated by CSA. Since the catalytic site is identical for each of the components of the homo-tetramer, the sequence mapping only contains two catalytic residues. Catalytic sites mapped onto a hetero-polymer. Finally, catalytic function is often performed by a complex of different chains. For instance, the nitrogenase complex (PDB 1n2c 21 ; Table 3) is made of three sequence-distinct chains: NifH, NifD, and NifK. These sequences contain the corresponding four domains: two for NifH and two for NifD and NifK. Each chain contributes different catalytic sites to the complex (Table 3). There is also a site made by the combination of all three chains in the complex. The catalytic residues participate in a variety of functions, such as Table 1 Domain classes. Categorization of domains that use transition metals for redox based on the presence of functional sites: icsa-defined catalytic sites, metal binding sites, or both Domain class Domain description Catalytic sites Domain count (%) Sequence count a Metal-binding sites Domain count (%) Sequence count a Both metal binding and catalytic sites Domain count (%) 1 All functional sites in all sequences 28 (27%) (33%) (22%) 81 2 All functional sites in at least one 7 (7%) (15%) 48 9 (9%) 47 representative sequence 3 At least one sequence containing at 5 (5%) (26%) (26%) 91 least one complete functional site 4 At least one sequence containing at 10 (10%) 53 0 b least one partial functional site 5 No functional sites within domain 16 (15.5%) 0 21 (20%) 0 6 (6%) 0 boundaries 6 No annotation for functional sites 38 (36.5%) 0 6 (5.7%) 0 39 (38%) 0 a Sequences with functional sites in domain boundaries. b Partial sites are impossible when a site is a single residue. Sequence count a

5 Table 2 cgs oxidoreductase domains (gold standard catalytic site-based domains) # InterPro ID Domain description Metal Ligand 1 IPR Peptidyl-glycine alpha-amidating monooxygenase Copper Cu 2 IPR Cytochrome c oxidase subunit II C-terminal Copper Cu 3 IPR Di-copper centre-containing Copper Cu 4 IPR Copper amine oxidase, C-terminal Copper Cu 5 IPR Aromatic-ring-hydroxylating dioxygenase, alpha subunit Iron 2Fe 2S 6 IPR Rieske iron sulfur protein, C-terminal Iron 2Fe 2S 7 IPR Rieske [2Fe 2S] iron sulfur domain Iron 2Fe 2S 8 IPR Fe 4S ferredoxin, iron sulfur binding domain Iron 4Fe 4S 9 IPR Nitrogenase iron protein, subunit NifH/Protochlorophyllide Iron 4Fe 4S reductase, subunit ChlL 10 IPR Iron hydrogenase, large subunit, C-terminal Iron 4Fe 4S 11 IPR Light-independent protochlorophyllide reductase, iron sulfur Iron 4Fe 4S ATP-binding protein 12 IPR Nitrogenase iron protein NifH Iron 4Fe 4S 13 IPR NADH: ubiquinone oxidoreductase-like, 20 kda subunit Iron 4Fe 4S 14 IPR Intradiol ring-cleavage dioxygenase, C-terminal Iron Fe 15 IPR Isopenicillin N synthase Iron Fe 16 IPR Desulfoferrodoxin, ferrous iron-binding domain Iron Fe 17 IPR Taurine catabolism dioxygenase TauD/TfdA Iron Fe 18 IPR Extradiol ring-cleavage dioxygenase, class III enzyme, subunit B Iron Fe 19 IPR Heme peroxidase, plant/fungal/bacterial Iron Heme 20 IPR Plant ascorbate peroxidise Iron Heme 21 IPR Cytochrome P450, B-class Iron Heme 22 IPR Cytochrome P450, E-class, group I Iron Heme 23 IPR Cytochrome P450, E-class, group IV Iron Heme 24 IPR Nitric oxide synthase, oxygenase domain Iron Heme 25 IPR Heme peroxidase Iron Heme 26 IPR Nitric oxide synthase, oxygenase subunit Iron Heme 27 IPR Heme peroxidase, animal, subgroup Iron Heme 28 IPR Cytochrome c oxidase, subunit I Iron, copper Heme a3-cub 29 IPR Cytochrome c oxidase, subunit I bacterial type Iron, copper Heme a3-cub 30 IPR Nitrogenase component 1, conserved site Iron, vanadium, molybdenum FeFe, MoFe, Vfe 31 IPR Nitrogenase/oxidoreductase, component 1 Iron, vanadium, molybdenum FeFe, MoFe, Vfe 32 IPR Germin, manganese binding site Manganese Mn 33 IPR Molybdopterinoxido reductase, prokaryotic, conserved site Molybdenum Mo 34 IPR Aldehyde oxidase/xanthine dehydrogenase, molybdopterin binding Molybdenum Mo 35 IPR Superoxide dismutase, copper/zinc, binding site Copper, zinc Fe, Zn a Domains in this list that have all catalytic sites defined in the icsa 11 map within the domain boundaries. Table 3 Mapping the nitrogenase complex (hetero-polymer) chains to Swiss-Prot entries UniProt ID P00459 P07328 UniProt name (gene name) InterPro ID InterPro name Description Nitrogenase iron protein 1 (nifh) Nitrogenase Mo Fe protein a-chain (nifd) IPR IPR Nitrogenase iron protein NifH Nitrogenase iron protein, subunit Component 2: homodimer, iron sulfur protein NifH/protochlorophyllide reductase, subunit ChlL IPR Nitrogenase/oxidoreductase, component 1 Component 1: hetero-tetramer, Mo Fe containing protein made of two pairs of a and b subunits Catalytic site residues a PDB chains a 0: K11, D130 0: E, F, G, H 1: K16, K42 1: E, F, G, H 2: K11, K16, 2: E, F, G, H K42, D130 0: C154, L158 0: A, C 1: C62, A65, 1: A, C R96, H195 P07329 Nitrogenase IPR Nitrogenase/oxidoreductase, Component 1: (same as above) 0: C153, V157 0: D, B Mo Fe protein b-chain (nifk) IPR component 1 Nitrogenase component 1, conserved site a Catalytic sites and PDB chains of 1n2c are extracted from the Catalytic Site Atlas (CSA) the electron transfer pathway between the P-cluster (8Fe 7S) and the FeMo-cofactor of the Mo Fe protein. 21,22 In the case of the nitrogenase complex, and in all cases of a single catalytic site made by different chains of a hetero-polymer, mapping all catalytic residues to a single Swiss-Prot sequence is impossible. However, the catalytic residues of one chain that are involved in a joint-chain catalytic site are, for our purposes, annotated as separate catalytic sites. The complete reaction mediated by a given EC1 may require other processes in addition to electron transfer including, but not limited to protonation, 23 substrate gating, 20 and guiding of the substrate into the active site. 24 The presence of all necessary catalytic residues is therefore imperative for proper oxidoreductase function. This observation is the motivation for our preferential selection of only those cgs sequences containing all known catalytic sites within domain boundaries.

6 By excluding sequences that contain catalytic sites that reside outside the domain boundaries (or are lacking some sites altogether), we define a compact functional subunit that serves as a reliable basis for further profile-building and evolutionary studies. Additionally, this choice allows us to measure method performance in the most simple and conservative fashion. Future studies should aim to integrate into the cgs set partial catalytic sites, as is now done for joint hetero-polymer sites. Manual curation to enlarge the arsenal of cgs sequences could also be used to retrieve more active domains. Using catalytic site annotations to refine searching for oxidoreductase functionality improves accuracy 4614 Swiss-Prot EC1 and 1106 non-ec1 sequences contain cgs domains. Assuming that all EC1s in this set are correctly assigned oxidoreductase activity, these counts correspond to B80% accuracy (InterPro accuracy; methods, eqn (1); Fig. 3A, white column, ingsdomains ) of annotating protein function using cgs domains. Alignment of the cgs sequences to all of these InterPro hits using PSI-BLAST (methods; no restrictions for sequence identity, seq. id. 4 0%) results in 3558 EC1s and 496 non-ec1s, for a total of 88% accuracy (PSI-BLAST accuracy, Fig. 3A, dots column, ingsdomains ). Filtering the aligned sequences for those containing all catalytic sites (methods) results in 2410 EC1s and 146 non-ec1s (94% TrAnsFuSE accuracy, Fig. 3A, grey column, ingsdomains ). To recap, TrAnsFuSE is 14% more accurate than InterPro and 6% more accurate than PSI-BLAST for sequences containing cgs domains. True positives. The 2410 cgs domain-containing EC1 entries that conserve all catalytic sites are considered to be true positives (TP; methods). An example of such an entry is the Formate dehydrogenase subunit alpha (P06131) found by alignment to the cgs sequence Formate dehydrogenase H (described above). This alignment (sequence identity = 41%) shows that all catalytic sites of the latter are also present in the former, indicating similar function. Filtering for catalytic sites also enables validating PSI-BLAST results with low sequence identities. Alignments of the cgs sequences to EC1s with cgs domains, at sequence identities below 30% but filtered for catalytic sites, attain the same 94% accuracy as better (higher seq. id.) alignments. For example, Cytochrome P450 4d1 (O16805) was found by Cytochrome P450 19A1 (P11511) with a sequence identity of 28.5% via conservation of all four catalytic residues. False positives. There are 1106 non-ec1s (including 42 enzymes from different EC classes) containing at least one of the 35 cgs domains. Assuming that oxidoreductase activity is fully described by EC1s, these represent erroneous annotations on behalf of InterPro. 45% of these sequences were also found by alignment to cgs sequences an error of PSI-BLAST mediated function annotation transfer. However, most of these lack catalytic residues leaving only 146 (13%) false positives (FP; methods) that are retained by our method. True negatives. The full set of Swiss-Prot proteins that do not contain cgs domains constitutes the very large, but trivial, set of true negative (TN) annotations. Note that while this set may contain transition metal-binding oxidoreductases we have no explicit way of finding these. As Swiss-Prot entries are manually curated it is reasonable to assume that there is a negligible amount of errors in the classification of the enzyme s primary activity, i.e. redox (EC first digit). Therefore, non-ec1 enzymes missing the annotated catalytic sites may be safely assumed to not have redox activity in spite of any InterPro hits. Such is the case for Probable intron-encoded endonuclease ai3 (A9RAH6), which contains the cgs domain Cytochrome c oxidase (IPR000883) and was also found via alignment to one of the cgs sequences (Q5SJ79). However, the alignment didn t conserve the catalytic residues, so the sequence match was considered invalid. The Swiss-Prot manual annotation of this sequence, truncated non-functional cytochrome oxidase 1, supports the assumption that this protein is inactive as a cytochrome component. In line with the assumption of a negligible amount of erroneous assignments of enzyme primary activity, non-enzyme non- EC1s are even less likely to be misannotated. For example, the Sporulation initiation inhibitor protein soj (P37522) containing the Nitrogenase_NifH/reductase_ChlL (IPR000392) domain aligned to the cgs sequence NifH (described above). The alignment was correctly discarded since it was missing the catalytic residues. False negatives EC1 entries containing cgs domains lack proper alignment of the catalytic residues and are considered false negatives (FNs) for the purposes of computing 1-based coverage (methods, eqn (2)). In quite a few cases these are manifestations of a deficiency of our method. However, in some examples this is actually a problem of lack of data, which can be assuaged by expanding the cgs repertoire as discussed above. For example, our method incorrectly invalidates some sequences due to a lack of a true corresponding cgs sequence. Such is the case with NAD(P)H-quinone oxidoreductase (EC ; A0A389) containing a cgs ferredoxin domain (IPR017896). This sequence doesn t align to anything in the cgs sequence list so it is never picked up by TrAnsFuSE. In other cases, the FNs should actually be TNs. Specifically, cgs sequence catalytic site annotation might be inaccurate for experimental or homology-based function transfer reasons. For example, for redox domain-containing sequence fragments (incomplete sequences) our prediction of lack of redox activity is actually correct. Note that there is no automated way to estimate the extent of all misassignments. Note that many of the InterPro domain profiles are generated using Swiss-Prot sequences, among others. Since our cgs sequences are specifically selected to contain these domains, there is some circular logic in using the cgs sequences to search Swiss-Prot. However,twoconceptsarekey:(1)ifthereisbiasinInterProtodo better for Swiss-Prot, InterPro scan results should be overestimated in comparison to our method (this isn t the case, as illustrated by the accuracy of TrAnsFuSE) and (2) we expect to use our method in the future to annotate proteins from new sequencing projects that were not used in making of the profiles. Additional profiles needed to find all EC1s There are over 7000 EC1 proteins with GO and Swiss-Prot sequence annotations indicative of binding transition metals. Of these sequences, nearly a fifth do not contain any of the 104 curated metal-redox domains and only two of these

7 View Online Fig. 3 Comparing the performance of TrAnsFuSE to PSI-BLAST and InterPro. Filtering for the presence of metal-binding sites, catalytic sites, and a combination of catalytic sites and metal-binding sites results in improved accuracy when compared withinterpro andpsi-blast.panels A Cshow the accuracy of searches based on (A) catalytic residues, (B) metal-binding residues, and (C) metal and catalytic residues in all Swiss-Prot sequences (All), Swiss-Prot sequences that contain GS domains (ingsdomains), and Swiss-Prot sequences that do not contain GS domains (outgsdomains). Accuracy standard deviation is shown for InterPro (white bars), PSI-BLAST using the GS sequences as queries (dots bars), and TrAnsFuSE using functional sites (grey bars). The latter consistently outperforms both InterPro and PSI-BLAST for all presented sets. Panels D F show accuracy (line), coverage over first 3 EC digits ( 3-based coverage; outer white column) and coverage over the first EC digit only ( 1-based coverage; inner grey columns) of InterPro, PSI-BLAST, and TrAnsFuSE searches in all of Swiss-Prot using (D) cgs, (E) mgs, and (F) mcgs sites.

8 proteins can be found by alignment to cgs sequences. Our inability to annotate these sequences as metal-binding oxidoreductases is explained by two reasons: (1) the set of cgs sequences is incomplete, covering only B30% of all oxidoreductase domains, while (2) the full hand-curated domain set misses other relevant domains. It is also plausible, however, that some of these missed proteins do not map to any known metal-binding redox domains. For instance, the Alpha-ketoglutarate-dependent dioxygenase FTO (Q9C0B1) does not contain any of our 104 domains nor does it align with any of the cgs sequences. Yet there is sufficient evidence that it may oxidize iron. 25 We believe that many oxidoreductases not found by homology-based searches can be annotated with additional oxidoreductase profiles. Until our domain set is expanded, however, we cannot expect to find all annotated EC1s. Thus, to estimate coverage (sensitivity) we would do injustice to our method by simply taking as reference all 35 thousand EC1s in Swiss-Prot (methods, 1-based coverage, Fig. 3D F, inner grey columns). Since we were using a limited set of domains to describe all of oxidoreductase activity, we estimated the possible hit-space for 3-based coverage with a collection of Swiss-Prot sequences with the first 3 digits of the EC number identical to our query sequences (methods, Fig. 3D F, outer white columns). TrAnsFuSE is accurate and enables extending searching for function beyond InterPro-defined sequences cgs sequences are constrained by our definition to contain cgs domains. However, using cgs sequences as queries for alignments does not guarantee the presence of cgs domains in hit sequences. A PSI-BLAST of cgs sequences against all of Swiss-Prot (seq. id. 40%) results in 83% EC1/non-EC1 accuracy (Fig. 1A and 3A, dots column, All ). Filtering these hits for sequences containing all catalytic sites (applying TrAnsFuSE) attains 94% accuracy (Fig. 1A and 3A, grey column, All ) and 53% 3-based coverage (Fig. 3D, outer white column). Even for similar sequences, with over 40% sequence identity, TrAnsFuSE with catalytic sites (Fig. 4, filled circles) was more accurate than PSI-BLAST alone (Fig.4,emptycircles). The majority of the catalytic-site filtered EC1s were found both by scanning for cgs domains and by alignments to cgs sequences. However, some of these were accessible only by alignment with sequences without the cgs domains (Fig. 3A, grey columns, outgsdomains ). For example, the Cytochrome P450 3A8 (P33268), which doesn t contain any of the cgs domains, was found by alignment to the cgs sequence Putative cytochrome P (Q59990). Thus, additionally TrAnsFuSE-ing sequences not containing cgs domains added 9% (207 sequences) to the coverage of EC1 data with an insignificant drop in accuracy. To recap, for the set of all Swiss-Prot sequences (with and without cgs domains) TrAnsFuSE with catalytic sites was more accurate than InterPro or PSI-BLAST (TrAnsFuSE 94%, InterPro 80%, PSI-BLAST 82%) at a low cost to coverage ( 53% TrAnsFuSE, B67% for InterPro and PSI-BLAST). TrAnsFuSE is a robust approach for filtering by other functional residues: metal-binding sites In order to test the robustness of our method in annotating sequences based on other functional sites we used metal binding site annotations instead of catalytic sites (methods). Fig. 4 Comparison of the performance of TrAnsFuSE search with varied functional sites at different thresholds of sequence similarity. TrAnsFuSE performs well at low sequence similarity thresholds. Filled shapes show the accuracy of TrAnsFuSE-ing through all Swiss-Prot sequences at 0 60% sequence identity, with catalytic sites (circles), metal-binding sites (squares), and both metal-binding and catalytic sites (triangles). Empty shapes show the PSI-BLAST accuracy for searching all of Swiss-Prot at the same (0 60%) thresholds of similarity for sequences aligning to cgs (circles), mgs (squares), and mcgs (triangles) sequences. Based on the presence of metal binding residues, domains are divided into previously described domain classes (Table 1). Our analysis reveals 50 domains in 239 Swiss-Prot EC1s (domain classes 1 and 2, metal binding, mgs, domains and sequences) with at least one representative sequence that contains all known metal binding sites within InterPro domain boundaries. Roughly equal numbers of EC1s and non-ec1s contain the mgs domains for 53% accuracy (Fig. 1A and 3B, white columns in ingsdomains ). Of these, 5233 EC1s and 2101 non-ec1s align to the mgs sequences, for a total of 71% PSI-BLAST accuracy. TrAnsFuSE-ing with metal binding sites is just as accurate (Fig. 3B, grey columns, ingsdomains ). A PSI-BLAST of mgs sequences against all of Swiss-Prot results in 70% accuracy and TrAnsFuSE-ing with metal binding sites is just 1% more accurate (Fig. 1A and 3B, grey columns, All ; 47% 3-based coverage, Fig. 3E, outer white column). The InterPro accuracy of the mgs domains (53%) is lower than the accuracy of cgs domains (80%) or even the base line accuracy for all 104 transition metal-binding domains (57%). The higher number of non-ec1s containing mgs domains results in lower accuracies of PSI-BLAST (71% mgs vs. 88% cgs) and of consecutive TrAnsFuSE-ing (71% mgs vs. 94% cgs). Note that both these results are still B17% more

9 accurate than InterPro alone (Fig. 3B, grey vs. white bars). The difference in the accuracies is partially due to the different sets of domains and sequences. This difference demonstrates that searching for a specific function requires being very specific in the description of gold standard sequences, functional sites, and corresponding domains. Similarity in metal-binding site TrAnsFuSE performance with mgs sequence-based PSI-BLAST is likely a result of the lower specificity of metal binding sites for oxidoreductases. Metal binding sites are less function definitive than catalytic sites (i.e. metal binding may occur for many reasons) and generally defined on a per-residue basis (i.e. the full site is not defined, rather all participating residues are defined one by one). To achieve greater specificity, we combined the two annotations, catalytic and metal binding sites, into one. TrAnsFuSE is a robust approach for filtering by other functional residues: [catalytic and metal binding]-sites Based on the presence of metal and catalytic binding residues, domains are divided into previously described domain classes (Table 1). Our analysis reveals 32 domains in 120 Swiss-Prot EC1s belonging to domain classes 1 and 2 (metal binding and catalytic site based, mcgs, domains and sequences). InterPro is 83% accurate in identifying EC1s and non-ec1s from mcgs domains (Fig. 1A). Of these, 3426 EC1s and 335 non-ec1s align to the mcgs sequences, for a total of 91% PSI-BLAST accuracy. TrAnsFuSE-ing with both metal binding and catalytic sites is 97% accurate (2200 EC1s and 67 non-ec1s; Fig. 3C, grey columns, ingsdomains ). A PSI-BLAST of mgs sequences against all of Swiss-Prot results in 85% accuracy and TrAnsFuSE-ing with metal binding sites is 12% more accurate (Fig. 1A and 3C, grey columns, All ; 51% 3-based coverage, Fig. 3F, outer white column). Thus, TrAnsFuSE-ing Swiss-Prot sequences with both metal and catalytic sites was more accurate but less sensitive than InterPro or PSI-BLAST (accuracy/coverage improvement/ loss 14%/17% for InterPro and 12%/16% for PSI-BLAST). The proportion described here between gains in accuracy and losses in coverage is advantageous when searching for function signals in large unannotated databases such as the Global Ocean Sampling (GOS) expedition. 15 TrAnsFuSE-ing all Swiss-Prot sequences with both metal and catalytic sites achieves much higher accuracy than using catalytic or metal sites alone. This gain in accuracy is not reflected in a loss of ( 3-based ) coverage between functional sites (Fig. 3D F, outer white columns). This high gain in accuracy is a result of selecting a set of highly relevant and complete functional annotations to define transition-metal oxidoreductase activity. Thus, if relevant functional sites are defined, TrAnsFuSE can arguably search sequence databases for any given function. TrAnsFuSE achieves high accuracy at low sequence similarities TrAnsFuSE performs better than PSI-BLAST in identifying EC1s at all levels of sequence identity (Fig. 4), but most visibly so at low thresholds. Aligning the mcgs sequences against all Swiss-Prot EC1s and TrAnsFuSE-ing the results with both catalytic and metal-binding sites achieved high accuracy (B97%) even at seq. id. r30% (filled triangles; Fig. 4). These values are 12% higher than the accuracy achieved by PSI-BLAST (empty triangles; Fig. 4) and 14% higher than InterPro (83%). Similar trends were visible for TrAnsFuSE-ing with metalbinding and catalytic sites; with metal binding-site filtering performing worse than catalytic site filtering, and worse than mcgs TrAnsFuSE-ing. The ability to look in previously ignored sequence similarity regions will enable finding diverged proteins of similar function. Use of TrAnsFuSE decreases the experimental load necessary to analyse the vast amounts of new data Sequences from the genome and metagenome sequencing projects are stored in public resources such as RefSeq 27 and TrEMBL. 14 These databases, among many other features, provide the user with automatic annotations of sequence functionality. Unfortunately, the accuracy and coverage of these annotations leave much to be desired. We compared our method with the UniProt automatic annotations for oxidoreductases found in TrEMBL. There was B40% agreement in EC1 assignment between TrEMBL annotations and TrAnsFuSE-ing TrEMBL sequences with cgs catalytic sites. Although we do not have the means to measure which approach is more accurate, we found that if we exclude non-enzyme TrEMBL entries (entries with no EC number annotations) there is a 99% agreement between the two approaches. Thus, the main difference between UniProt automatic annotations and TrAnsFuSE comes from the set of entries not annotated as enzymes by TrEMBL. Considering the size of TrEMBL this is indeed a very large set. While much of this set is likely non-enzyme, there are also un-identified enzymes. PSI-BLASTing the cgs sequences against all of TrEMBL (18M sequences) results in 635K hits, of which 256K do not have TrEMBL assigned EC numbers. TrAnsFuSE-ing all of TrEMBL with the catalytic sites produces 88K hits, of which 55K have no TrEMBL assigned EC number. If all the TrAnsFuSE false positives (5K based on 94% accuracy) are in the set of TrEMBL non-ecs we expect to bring in at least 50K new EC1s not picked up by TrEMBL. The ability to annotate B20% (50K of 256K TrEMBL non-ecs) more sequences clearly demonstrates the value of TrAnsFuSE. Genome and metagenome sequencing projects, such as the Global Ocean Sampling (GOS) expedition, 15 produce vast amounts of sequence data. We have analysed 6 million (6M) GOS entries of predicted proteins (Fig. 1B) found in the UniMES database. 26 Fifty thousand (50K) of these sequences contain the 35 cgs domains. Estimating the oxidoreductase numbers from previously reported InterPro performance (80% accuracy) means that there are 40K EC1s and 10K non-ec1s. PSI-BLAST of cgs sequences against UniMES results in 74K hits, while TrAnsFuSE-ing with catalytic sites produces 13K sequences. Assuming that TrAnsFuSE is 94% accurate, we estimate B700 false positive results in this set. The 93% drop in the number of estimated false positives (B700 instead of InterPro 10K) clearly demonstrates the value of TrAnsFuSE in decreasing the experimental load. While the strength of this approach is in finding hits with a high reliability, note that attaining a broader overview of EC1 functionalities requires using additional InterPro domains.

10 The methodology presented here, while yet less extensively developed than resources like InterPro, is more accurate, and as such, requires less experimental verification for its predictions. Arguably, when dealing with millions of sequences this is an advantage that cannot be overstated. Conclusions We developed a protein sequence-scanning method (TrAnsFuSE) that improves the search for proteins containing transition metal-utilizing redox domains (EC1s). We have used a set of manually curated, transition metal binding, redox-specific domains as a basis of our study. First, we demonstrated that the distribution of these domains is in line with trends in their evolution. Then, we used a set of gold standard sequences, with annotated catalytic sites found within redox domain borders, to query all of Swiss-Prot. Only the sequences that conserved all catalytic sites of the query were assigned oxidoreductase functionality. This methodology achieved 94% accuracy in finding EC1s. Moreover, we demonstrated the robustness of this approach for generic functional sites by using annotations for metal binding residues instead of catalytic sites and by using the combination of both metal binding and catalytic sites. The concepts described here may improve homology-based function annotation transfer using sites such as binding hot-spots and, potentially, predictions of functional residues instead of experimental annotations. Our method will decrease the experimental load necessary for analysis of the functions of unknown sequences coming from the sea of data in (meta-) genome sequencing projects like the Global Ocean Sampling (GOS) Expedition. Experimental Protein entries Reviewed protein entries (Swiss-Prot) were downloaded at January 2011 with their amino acid sequences, EC (Enzyme Commission) numbers, Gene Ontology 28 terms (GO), and the associated PDB 29 identifiers. Non-reviewed protein entries (TrEMBL) with their amino acid sequences and EC numbers were downloaded December GOS protein entries with their amino acid sequences and InterPro 10 annotations were extracted from UniProt Metagenomic and Environmental Sequences database (UniMES; downloaded November 2011). Domain curation Transition metal-utilizing redox domains in Swiss-Prot and TrEMBL entries were extracted from InterPro. 10,30 For Swiss- Prot only entries whose descriptions or associated GO terms contained transition metal symbols (e.g. Fe, Cu, etc.), names (e.g. iron, copper, etc.), or names of structures containing a transition metal (e.g. ferredoxin) were retained. This set was manually validated, checking for transition metal utilization in redox reactions. Validation was based on the InterPro summaries, component database annotations (e.g. PFam, 8b TIGRFAMs, 16 SUPERFAMILY 17 ), and literature citations for a given InterPro entry. Note that for some of the sequences, the Swiss-Prot annotation for metal binding is not in agreement with the InterPro domain annotation; i.e. the InterPro profile annotates cambialistic sequences but the Swiss-Prot annotation verifies only one metal. An InterPro signature may outline more than one activity (other than redox) as reflected in descriptions of its component signatures. To obtain the domains definitively relevant to our study, we extracted via manual curation entries containing solely signatures using transition metals to perform redox. Note that we refer to all of these entries as domains, although different terms may have been used in InterPro (e.g. conserved site). Since each InterPro domain is an integration of several component domains from different sources, in this work we refer to maximal boundaries encompassing all domains as domain boundaries. icsa to Swiss-Prot mapping For every Swiss-Prot entry assigned to the EC class (EC1; oxidoreductase) we extracted all corresponding PDB IDs and chains. We retained only PDB entries that had associated SIFTS (Structure Integration with function Taxonomy and Sequences; downloaded July, 2011) 31 and icsa (Inpharmatica CSA) 11 records. SIFTS was used to map all residues of a given Swiss-Prot entry to the corresponding residues of the associated icsa-reported PDB entry (Table S2, ESIz). The icsa database contains both literaturederived and homology-based entries. We only retained icsa literature-based entries whose catalytic sites correctly mapped from the corresponding PDB entries to the Swiss-Prot sequence. Those icsa-lits that could not be precisely mapped were discarded together with any of their homology-based derivatives. Of the remaining homology-based entries, those that could not be fully mapped to their corresponding Swiss-Prot sequences were excluded. Identical catalytic sites from different icsa PDB entries were unified according to their positions on the Swiss-Prot sequence. Catalytic site-based gold standard (cgs) sequences Only Swiss-Prot EC1 sequences mapping all residues of all catalytic sites (as described above) within the annotated InterPro domain boundaries (domain classes 1 and 2, Table 1) were used for further analysis. These sequences and their corresponding 35 InterPro domains are referred to as the catalytic site-based Gold Standard (cgs) sequences and cgs domains, respectively. Similarly we defined metal-based GS sequences (mgs) as EC1 sequences which contain all transition metal-binding residues within domain boundaries (domain classes 1 and 2, Table 1). These sequences and their corresponding 50 InterPro domains are referred to as mgs sequences and mgs domains. In addition we defined [metal and catalytic site]-based GS sequences (mcgs) as EC1 sequences, which contain all transition metal-binding and catalytic residues within domain boundaries (domain classes 1 and 2, Table 1). These sequences and their corresponding 32 domains are referred to as mcgs sequences and mcgs domains. Filtering alignments for proteins with identical catalytic sites The cgs sequences were used to search for other proteins with similar functions in an icsa-like approach: Each cgs sequence was used as a query to retrieve similar sequences from Swiss-Prot using four PSI-BLAST iterations

11 (evalue 10 3, inclusion ethresh 10 10, max target seqs 10 8, num alignment 10 8, num descriptions 10 8, num iterations 4, comp based stats 0). The decision to include into PSSM only the hits with an E-value o10 10 was made to have a more focused PSSM for each domain. The specified E-values are accepted in the field as base-line for finding functional similarity of proteins. 9 The number of PSI-BLAST iterations was the same as in icsa filtering, 11 shown to reduce the number of false positive results. 32 Finally, to achieve maximum sensitivity we set the upper bound to be extremely high (10 8 ) for the number of sequences used for building the PSSM and for alignments returned. Each homologue was validated against its query sequence. We checked the known catalytic sites for conservation of residue type as described below. 1. Conservation of a catalytic residue is defined as: (1) in cases where the residue side chain is the functional group only an exact match is acceptable, and (2) in cases where the main chain is the functional group, any residue in the same chemical group is allowed to match the query residue. Residue groups are: (1) negatively charged: D, E; (2) positively charged: R, K, H; (3) hydroxylic polar: S, T; (4) amidic polar: N, Q; (5) short hydrophobic chain: A, V, I, L, and (6) aromatic: F, Y, W, C. Note that G, P, and M residues may only match themselves. 2. In cases where the residues were not found to be conserved with PSI-BLAST, the alignment of the query sequence to the subject was repeated using first ClustalW 33 (in full multiple alignment mode) and, failing that, Smith Waterman 34 (from the EMBOSS package using default parameters). 3. An EC1 assignment was only made when a homologue conserved all catalytic residues as described in step 1. Filtering alignments for proteins with other sites Metal binding sites were obtained from Swiss-Prot annotations for all EC1 sequences. Only experimentally validated sites were used. Metal binding site-based GS (mgs) sequences were extracted as described above for catalytic sites. mgs sequence-based searching was performed similarly to cgs-based search with minor differences: (1) each metal binding residue was considered a site and (2) since there are no annotations for the residue active group, filtering followed the conservative scheme of exact match (i.e. a metal binding residue on the query sequence could only align with itself in the subject). By combining the annotation of catalytic sites and metal binding residues we created the [metal and catalytic site]-based gold standard (mcgs) domains and sequences. Searches using mcgs sequences were performed by integrating the filtering approaches for cgs and mgs, previously described. Calculating accuracy of scanning Accuracy was calculated as: TP accuracy ¼ 100 TP þ FP where True Positives (TPs) are method hits to experimentally annotated EC1 proteins and False Positives (FPs) are method ð1þ hits annotated as non-ec1s. The standard deviation of the accuracy was calculated by 100-fold bootstrapping: taking 100 subsets of 50% of randomly selected queries with their hits. Calculating coverage of scanning True coverage could not be computed for this study due to the absence of experimental estimates of the numbers of true transition metal-using oxidoreductase domains among EC1s. We define coverage over all EC1s as 1-based coverage (anchored to the first digit of the EC number) and coverage over the first 3 EC digits as 3-based coverage: TP coverage ¼ 100 TP þ FN In 1-based coverage, True Positives (TPs) are method hits that are annotated as EC1s and False Negatives (FNs) are all other EC1s in Swiss-Prot. Note that this is the same definition of TPs as is used for all accuracy calculations (eqn (1)). Clearly, since we were using a limited set of domains to describe all of oxidoreductase activity, we did not expect to hit all possible EC1s. To achieve a better estimate of the possible hit-space for 3-based coverage we collected all Swiss-Prot sequences with the first 3 digits of the EC numbers identical to our query sequences (i.e. GS sequences). In this case, TPs are similarly re-computed to be 3-based. Abbreviations TrAnsFuSE Transfer of Annotations of Function using Sequence Elements EC Enzyme Commission GOE Great Oxidation Event MSA Multiple Sequence Alignment HMM Hidden Markov Model CSA Catalytic Site Atlas icsa Inpharmatica CSA UniProt Universal Protein Resource PDB Protein Data Bank PSI-BLAST Position-Specific Iterative Basic Local Alignment Search Tool GO Gene Ontology PSSM Position Specific Scoring Matrix SIFTS Structure Integration with Function, Taxonomy and Sequences GS Gold Standard cgs catalytic site based GS domains and sequences mgs metal binding site based GS domains and sequences mcgs metal and catalytic binding site based GS domains and sequences EMBOSS European Molecular Biology Open Software Suite TrEMBL Translation of European Molecular Biology Laboratory nucleotide sequences GOS Global Ocean Sampling expedition UniMES UniProt Metagenomic and Environmental Sequences database. ð2þ

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The