TrAnsFuSE refines the search for protein function: oxidoreductaseswz

Size: px
Start display at page:

Download "TrAnsFuSE refines the search for protein function: oxidoreductaseswz"

Transcription

1 Integrative Biology View Online / Journal Homepage Dynamic Article Links Cite this: DOI: /c2ib00131d PAPER TrAnsFuSE refines the search for protein function: oxidoreductaseswz Arye Harel,* ab Paul Falkowski a and Yana Bromberg* b Received 10th October 2011, Accepted 25th February 2012 DOI: /c2ib00131d Non-equilibrium catalysis of electron transfer reactions (i.e. redox) regulates the flux of key elements found in biological macromolecules. The enzymes responsible, oxidoreductases, contain specific transition metals in poorly sequence-conserved domains. These domains evolved B2.4 billion years ago in microbes and spread across the tree of life. We lack understanding of how oxidoreductases evolved; divergence of sequences makes identification difficult. We developed a method to recognise the various versions of these enzyme-domains in unannotated sequencespace. Often, homology is used to transfer function annotations from experimentally resolved domains to unannotated sequences. Unreliability of inferring homology below 30% sequence identity limits single-sequence based searches. Misaligned functional sites may compromise annotation transfer from even very similar sequences. Combining profile-based searches with knowledge of functional sites could improve domain detection accuracy. Here we present an approach that enhances the search for redox domains using catalytic site annotations. From the scientific literature, we validated annotations of 104 InterPro domains indicated as using transition metals in redox reactions. These domains mediate electron transfer in 20% of oxidoreductases, primarily employing iron, copper and molybdenum. We used the experimentally identified catalytic residues in these domains to validate sequence alignment-based protein function annotations. Our method, TrAnsFuSE, is 11% and 14% more accurate than PSI-BLAST and InterPro, respectively. Moreover, it is robust for use with other functional residues we attain higher accuracy at comparable coverage using metal binding, in addition to catalytic, sites. TrAnsFuSE can be used to focus the study of the vast amounts of unannotated sequencing data from meta-/genome projects. a Environmental Biophysics and Molecular Ecology Program, Institute of Marine and Coastal Science, Rutgers the State University of New Jersey, 71 Dudley Road, New Brunswick, NJ 08901, USA. harel@marine.rutgers.edu; Fax: ; Tel: x412 b Department of Biochemistry and Microbiology, Rutgers the State University of New Jersey, Lipman Hall 218, New Brunswick, NJ 08901, USA. yanab@rci.rutgers.edu; Fax: ; Tel: x203 w Published as part of an ibiology themed issue entitled Computational Integrative Biology Guest Editor: Prof. Jan Baumbach. z Electronic supplementary information (ESI) available: Tables S1 and S2. See DOI: /c2ib00131d Introduction In nature, H, C, N, O, S, and P serve as the core elements used for the synthesis of biological macromolecules. 1 These elements are often found in molecules with oxidation states that make them biologically inaccessible. Redox reactions, mediated by oxidoreductases (Enzyme Commission 2 class 1; EC1), evolved to facilitate the flux of the first five of these elements. 3 Oxidoreductases catalyse the transfer of electrons using transition metals and other prosthetic groups. These enzymes are found in all kingdoms of life, but their catalytic Insight, innovation, integration Oxidoreductases regulate key processes of life like photosynthesis and respiration. The phylogeny of these ancient and far diverged domains is hard to reconstruct. Identifying new sequences for these domains elucidates their evolutionary paths. Here, we built TrAnsFuSE a novel approach that integrates sequence alignments with knowledge of functional sites to find new metal-redox oxidoreductase domains. Fewer sequences are identified in total, but the found set is virtually guaranteed to be correct. TrAnsFuSE can filter unannotated data from meta-/genome sequencing projects for high quality oxidoreductase matches. Moreover, the methodology can be easily adapted to fit other functional families.

2 domains evolved in microbes before the Great Oxidation Event (GOE) over 2.4 billion years ago. 4 Over this time, the sequences containing these domains diverged, making their phylogeny difficult to reconstruct. Searching for specific functionality, such as redox, in a set of unannotated sequences is generally based on homology to experimentally annotated genes/proteins. It is generally impossible to infer the function of one protein from another below 30% sequence identity. 5 In fact, some studies claim that the threshold is even higher (60%). 6 Even function annotation transfer from very similar sequences may be compromised when aligned functional residues differ. Profile searches, e.g. using Multiple Sequence Alignments (MSAs) 7 or Hidden Markov Models (HMMs), 8 are generally better at function transfer 9 as they describe the full set of functionally similar sequences. Profiles are built by aligning multiple homologous sequences and are meant to capture family-specific information, including functionally and structurally important residues. Using profiles helps identify distant homologues that are not recognized from alignments to single sequences. One example of a profile repository and search engine is InterPro, a widely used database of protein domains, 10 integrating profiles from 12 different resources. Even profile searches, however, can be improved by focusing more on the function-specific residues than on others. 11 Unfortunately, building a profile that highlights the functional residues and increases the probability of finding sequences containing these residues is not a trivial task. 12 In one effort 11 to improve function transfer by homology, catalytic residue annotations in the Catalytic Site Atlas (CSA) 13 were used to search for functionally related enzymes. In this study, function annotation transfer was only allowed when a homologue found by PSI-BLAST also conserved all catalytic residues. This approach, icsa (Inpharmatica CSA) filtering, is reported to have achieved 87% accuracy in finding enzymes with same EC numbers at the level of the third digit. Here we show that searching for transition metal-binding redox proteins can similarly be improved. We introduce a novel tool, TrAnsFuSE (Transfer of Annotations of Function using Sequence Elements; flow chart in Fig. 1A), which can use function-defining sequence elements such as catalytic or metal binding sites to aid function transfer by homology. TrAnsFuSE accuracy of function transfer using catalytic sites, an approach similar to icsa, is 94% (11% and 14% better than PSI-BLAST and InterPro, respectively). Filtering for metal binding residues, instead of catalytic sites, attains 71% accuracy (17% better than InterPro alone for the same set of sequences). Finally, the accuracy of function transfer attained by extending the concept of site-based filtering to the combination of metal and catalytic sites is 97%; i.e. 14%, 12%, and 3% higher than that of InterPro, PSI-BLAST, and catalytic site-based TrAnsFuSE, respectively. With this approach we pick up nearly 55 thousand highaccuracy novel oxidoreductases from TrEMBL 14 and annotate as redox-related roughly proteins in otherwise unannotated Global Ocean Sampling (GOS) expedition 15 data. The TrAns- FuSE methodology can thus potentially be used to improve function annotation for all of the unannotated sequence space. Results and discussion Iron, copper and molybdenum are the primary mediators of electron transfer We extracted from InterPro 104 transition metal-utilizing redoxrelated domains (methods; Table S1, ESIz).Atleasthalfofthese domains are validated for containing metal binding sites, using annotations from InterPro component databases (e.g. PFam, 8b TIGRFAMs, 16 SUPERFAMILY 17 ). There are hits for these domains in 7201 Swiss-Prot 18 entries (20% of all annotated EC1s). A small number of hits are ambiguous with regard to the metal they bind. (1) B4.5% of the InterPro hits are cambialistic; i.e. each sequence can utilize varied transition metals to catalyse redox reactions. (2) B1.5% is not explicitly defined by InterPro; i.e. the profile picks up different metal binding sequences. (3) Vanadium, tungsten, manganese and nickel binding sequences are uncommon (1%). (4) The majority (79.5%) of the hit domains solely use iron for electron transfer. Iron is present as: iron sulfur clusters (56%; 45% 4Fe 4S and 11% 2Fe 2S; Fig. 2B), heme (31%), and iron (12%). (5) The rest rely on copper (8.5%) and molybdenum (4.5%; Fig. 2A). Distribution of transition metal domains is in line with evolutionary evidence The 104 transition metal-redox domains preferentially use iron and molybdenum. If the distribution of these domains in Swiss- Prot is representative of the entire universe of proteins it may be an indication of the age 4 and importance 19 of these domains. It has been previously shown that Fe and Mo were preferentially selected early in biological life. 4 The appearance of many of the protein structures binding these metals dates to before the GOE. Higher abundance of sequences containing iron sulfur binding domains, over heme and iron binding ones, is also in line with the order of their evolution; i.e. structures that bind Fe in mixed Fe S clusters are thought to have evolved before those that bind Fe through porphyrin rings or direct amino acid bonds. 4 We also found that sequences containing domains using copper for electron transfer are highly abundant in the Swiss-Prot database. In contrast to iron, copper became soluble only later in earth s history, after the transition to a generally oxic and oxidizing ocean (B billion years ago). 4 The abundance of copper may be a result of its incorporation into the cytochrome oxidase in aerobic respiration complex. Thirty % of all copperbased domain sequences encode a cytochrome c oxidase subunit, further supporting this hypothesis. Over a third of sequence-defined transition-metal redox domains contain all experimentally annotated catalytic sites The 104 redox domains can be subdivided into six domain classes based on the presence of icsa-defined catalytic sites 13 within domain boundaries in the scanned EC1 Swiss-Prot sequences (methods; Table 1): (1) 28 domains contain all catalytic sites in all sequences; (2) 7 domains contain all catalytic sites in at least one representative sequence; (3) 5 domains have at least one sequence containing at least one complete catalytic site; (4) 10 domains have at least one sequence containing at least one partial catalytic site; (5) 16 domains contain no catalytic sites within domain boundaries; and (6) 38 domains have no annotation in icsa.

3 View Online Fig. 1 TrAnsFuSE flow diagram of searching in Swiss-Prot and GOS for transition metal oxidoreductases. The TrAnsFuSE search of Swiss-Prot is highly accurate and can be used to search the GOS database of unannotated sequences. Panel (A) shows the performance of TrAnsFuSE (and the numbers of true and false positives) in using GS (cgs, mgs, and mcgs) sequences to search all of Swiss-Prot. All TrAnsFuSE steps are illustrated: first, PSI-BLAST followed by filtering for catalytic, metal or metal and catalytic sites. We show for comparison the results of scanning Swiss-Prot for InterPro GS domains (cgs, mgs, and mcgs). Note that each 3-box node on the tree represents the three possible data sets corresponding to searching with catalytic (cgs), metal (mgs), and metal and catalytic (mcgs) sites. Panel (B) similarly displays the InterPro and TrAnsFuSE search results for the GOS database. Note that we only report the results based on catalytic sites (i.e. using cgs). Our analysis reveals 35 domains in 135 Swiss-Prot EC1s (domain classes 1 and 2) with at least one representative sequence that contains all known catalytic sites within InterPro domain boundaries. Note that one Swiss-Prot entry may contain more than one InterPro domain. This set of domains (Table 2) and their representative sequences (Table S2, ESIz) is referred

4 Fig. 2 Distribution of domains utilizing transition metals for redox, in Swiss-Prot EC1 oxidoreductases. The majority of metal-redox domains found in Swiss-Prot EC1s utilize iron and copper for redox. Panel (A) shows the percentage of metal-redox domains utilizing: iron (white), copper (light grey), ambiguous metal (copper/zinc, iron/ molybdenum/vanadium, iron/copper, iron/ manganese; dark grey), molybdenum (black), and vanadium/tungsten/nickel (too few of each to indicate separately; horizontal lines). Ambiguous metal denotes cambialistic domains or domains that are not explicitly defined by InterPro to bind one specific metal. Iron sulfur clusters are the preferred way of iron utilization in redox. Panel (B) shows the percentage of iron-binding domains using iron sulfur clusters (white), heme (grey), and iron (black) ligands. to as catalytic site-based Gold Standard (cgs) domains and sequences. The cgs domains are found in 13% of Swiss-Prot EC1s. Between all of them, these proteins use all of the transition metals except nickel and tungsten. In this domain subset the distribution of metal prosthetic groups is similar to that in the full oxidoreductase set; i.e. the majority of cgs domains utilize iron (73%; 40% iron sulfur clusters, 30% heme, and 3% iron), copper (6%), and molybdenum (5%). Identifying catalytic sites in sequence is not simple Catalytic sites mapped onto a monomer. In the simplest case, the entire 3D functional unit of an enzyme is a monomer. For example, Formate dehydrogenase H (UniProt ID P07658; PDB 1aa6) contains a Molybdopterin oxidoreductase domain (InterPro ID IPR006655). This domain contains all four formate dehydrogenase catalytic residues making up the single catalytic site reported in the CSA. In this case, all sites are easily mapped onto the corresponding UniProt sequence. Catalytic sites mapped onto a homo-polymer. In some cases, the functional protein unit is a polymer of identical chains. For example, Superoxide reductase (SOR; P82385, 1do6) contains Desulfoferrodoxin, ferrous iron-binding domain (IPR002742). SOR forms a homo-tetramer, with each subunit adopting a fold that coordinates a non-heme iron centre. 20 The desulfoferroredoxin domain contains the two SOR catalytic residues annotated by CSA. Since the catalytic site is identical for each of the components of the homo-tetramer, the sequence mapping only contains two catalytic residues. Catalytic sites mapped onto a hetero-polymer. Finally, catalytic function is often performed by a complex of different chains. For instance, the nitrogenase complex (PDB 1n2c 21 ; Table 3) is made of three sequence-distinct chains: NifH, NifD, and NifK. These sequences contain the corresponding four domains: two for NifH and two for NifD and NifK. Each chain contributes different catalytic sites to the complex (Table 3). There is also a site made by the combination of all three chains in the complex. The catalytic residues participate in a variety of functions, such as Table 1 Domain classes. Categorization of domains that use transition metals for redox based on the presence of functional sites: icsa-defined catalytic sites, metal binding sites, or both Domain class Domain description Catalytic sites Domain count (%) Sequence count a Metal-binding sites Domain count (%) Sequence count a Both metal binding and catalytic sites Domain count (%) 1 All functional sites in all sequences 28 (27%) (33%) (22%) 81 2 All functional sites in at least one 7 (7%) (15%) 48 9 (9%) 47 representative sequence 3 At least one sequence containing at 5 (5%) (26%) (26%) 91 least one complete functional site 4 At least one sequence containing at 10 (10%) 53 0 b least one partial functional site 5 No functional sites within domain 16 (15.5%) 0 21 (20%) 0 6 (6%) 0 boundaries 6 No annotation for functional sites 38 (36.5%) 0 6 (5.7%) 0 39 (38%) 0 a Sequences with functional sites in domain boundaries. b Partial sites are impossible when a site is a single residue. Sequence count a

5 Table 2 cgs oxidoreductase domains (gold standard catalytic site-based domains) # InterPro ID Domain description Metal Ligand 1 IPR Peptidyl-glycine alpha-amidating monooxygenase Copper Cu 2 IPR Cytochrome c oxidase subunit II C-terminal Copper Cu 3 IPR Di-copper centre-containing Copper Cu 4 IPR Copper amine oxidase, C-terminal Copper Cu 5 IPR Aromatic-ring-hydroxylating dioxygenase, alpha subunit Iron 2Fe 2S 6 IPR Rieske iron sulfur protein, C-terminal Iron 2Fe 2S 7 IPR Rieske [2Fe 2S] iron sulfur domain Iron 2Fe 2S 8 IPR Fe 4S ferredoxin, iron sulfur binding domain Iron 4Fe 4S 9 IPR Nitrogenase iron protein, subunit NifH/Protochlorophyllide Iron 4Fe 4S reductase, subunit ChlL 10 IPR Iron hydrogenase, large subunit, C-terminal Iron 4Fe 4S 11 IPR Light-independent protochlorophyllide reductase, iron sulfur Iron 4Fe 4S ATP-binding protein 12 IPR Nitrogenase iron protein NifH Iron 4Fe 4S 13 IPR NADH: ubiquinone oxidoreductase-like, 20 kda subunit Iron 4Fe 4S 14 IPR Intradiol ring-cleavage dioxygenase, C-terminal Iron Fe 15 IPR Isopenicillin N synthase Iron Fe 16 IPR Desulfoferrodoxin, ferrous iron-binding domain Iron Fe 17 IPR Taurine catabolism dioxygenase TauD/TfdA Iron Fe 18 IPR Extradiol ring-cleavage dioxygenase, class III enzyme, subunit B Iron Fe 19 IPR Heme peroxidase, plant/fungal/bacterial Iron Heme 20 IPR Plant ascorbate peroxidise Iron Heme 21 IPR Cytochrome P450, B-class Iron Heme 22 IPR Cytochrome P450, E-class, group I Iron Heme 23 IPR Cytochrome P450, E-class, group IV Iron Heme 24 IPR Nitric oxide synthase, oxygenase domain Iron Heme 25 IPR Heme peroxidase Iron Heme 26 IPR Nitric oxide synthase, oxygenase subunit Iron Heme 27 IPR Heme peroxidase, animal, subgroup Iron Heme 28 IPR Cytochrome c oxidase, subunit I Iron, copper Heme a3-cub 29 IPR Cytochrome c oxidase, subunit I bacterial type Iron, copper Heme a3-cub 30 IPR Nitrogenase component 1, conserved site Iron, vanadium, molybdenum FeFe, MoFe, Vfe 31 IPR Nitrogenase/oxidoreductase, component 1 Iron, vanadium, molybdenum FeFe, MoFe, Vfe 32 IPR Germin, manganese binding site Manganese Mn 33 IPR Molybdopterinoxido reductase, prokaryotic, conserved site Molybdenum Mo 34 IPR Aldehyde oxidase/xanthine dehydrogenase, molybdopterin binding Molybdenum Mo 35 IPR Superoxide dismutase, copper/zinc, binding site Copper, zinc Fe, Zn a Domains in this list that have all catalytic sites defined in the icsa 11 map within the domain boundaries. Table 3 Mapping the nitrogenase complex (hetero-polymer) chains to Swiss-Prot entries UniProt ID P00459 P07328 UniProt name (gene name) InterPro ID InterPro name Description Nitrogenase iron protein 1 (nifh) Nitrogenase Mo Fe protein a-chain (nifd) IPR IPR Nitrogenase iron protein NifH Nitrogenase iron protein, subunit Component 2: homodimer, iron sulfur protein NifH/protochlorophyllide reductase, subunit ChlL IPR Nitrogenase/oxidoreductase, component 1 Component 1: hetero-tetramer, Mo Fe containing protein made of two pairs of a and b subunits Catalytic site residues a PDB chains a 0: K11, D130 0: E, F, G, H 1: K16, K42 1: E, F, G, H 2: K11, K16, 2: E, F, G, H K42, D130 0: C154, L158 0: A, C 1: C62, A65, 1: A, C R96, H195 P07329 Nitrogenase IPR Nitrogenase/oxidoreductase, Component 1: (same as above) 0: C153, V157 0: D, B Mo Fe protein b-chain (nifk) IPR component 1 Nitrogenase component 1, conserved site a Catalytic sites and PDB chains of 1n2c are extracted from the Catalytic Site Atlas (CSA) the electron transfer pathway between the P-cluster (8Fe 7S) and the FeMo-cofactor of the Mo Fe protein. 21,22 In the case of the nitrogenase complex, and in all cases of a single catalytic site made by different chains of a hetero-polymer, mapping all catalytic residues to a single Swiss-Prot sequence is impossible. However, the catalytic residues of one chain that are involved in a joint-chain catalytic site are, for our purposes, annotated as separate catalytic sites. The complete reaction mediated by a given EC1 may require other processes in addition to electron transfer including, but not limited to protonation, 23 substrate gating, 20 and guiding of the substrate into the active site. 24 The presence of all necessary catalytic residues is therefore imperative for proper oxidoreductase function. This observation is the motivation for our preferential selection of only those cgs sequences containing all known catalytic sites within domain boundaries.

6 By excluding sequences that contain catalytic sites that reside outside the domain boundaries (or are lacking some sites altogether), we define a compact functional subunit that serves as a reliable basis for further profile-building and evolutionary studies. Additionally, this choice allows us to measure method performance in the most simple and conservative fashion. Future studies should aim to integrate into the cgs set partial catalytic sites, as is now done for joint hetero-polymer sites. Manual curation to enlarge the arsenal of cgs sequences could also be used to retrieve more active domains. Using catalytic site annotations to refine searching for oxidoreductase functionality improves accuracy 4614 Swiss-Prot EC1 and 1106 non-ec1 sequences contain cgs domains. Assuming that all EC1s in this set are correctly assigned oxidoreductase activity, these counts correspond to B80% accuracy (InterPro accuracy; methods, eqn (1); Fig. 3A, white column, ingsdomains ) of annotating protein function using cgs domains. Alignment of the cgs sequences to all of these InterPro hits using PSI-BLAST (methods; no restrictions for sequence identity, seq. id. 4 0%) results in 3558 EC1s and 496 non-ec1s, for a total of 88% accuracy (PSI-BLAST accuracy, Fig. 3A, dots column, ingsdomains ). Filtering the aligned sequences for those containing all catalytic sites (methods) results in 2410 EC1s and 146 non-ec1s (94% TrAnsFuSE accuracy, Fig. 3A, grey column, ingsdomains ). To recap, TrAnsFuSE is 14% more accurate than InterPro and 6% more accurate than PSI-BLAST for sequences containing cgs domains. True positives. The 2410 cgs domain-containing EC1 entries that conserve all catalytic sites are considered to be true positives (TP; methods). An example of such an entry is the Formate dehydrogenase subunit alpha (P06131) found by alignment to the cgs sequence Formate dehydrogenase H (described above). This alignment (sequence identity = 41%) shows that all catalytic sites of the latter are also present in the former, indicating similar function. Filtering for catalytic sites also enables validating PSI-BLAST results with low sequence identities. Alignments of the cgs sequences to EC1s with cgs domains, at sequence identities below 30% but filtered for catalytic sites, attain the same 94% accuracy as better (higher seq. id.) alignments. For example, Cytochrome P450 4d1 (O16805) was found by Cytochrome P450 19A1 (P11511) with a sequence identity of 28.5% via conservation of all four catalytic residues. False positives. There are 1106 non-ec1s (including 42 enzymes from different EC classes) containing at least one of the 35 cgs domains. Assuming that oxidoreductase activity is fully described by EC1s, these represent erroneous annotations on behalf of InterPro. 45% of these sequences were also found by alignment to cgs sequences an error of PSI-BLAST mediated function annotation transfer. However, most of these lack catalytic residues leaving only 146 (13%) false positives (FP; methods) that are retained by our method. True negatives. The full set of Swiss-Prot proteins that do not contain cgs domains constitutes the very large, but trivial, set of true negative (TN) annotations. Note that while this set may contain transition metal-binding oxidoreductases we have no explicit way of finding these. As Swiss-Prot entries are manually curated it is reasonable to assume that there is a negligible amount of errors in the classification of the enzyme s primary activity, i.e. redox (EC first digit). Therefore, non-ec1 enzymes missing the annotated catalytic sites may be safely assumed to not have redox activity in spite of any InterPro hits. Such is the case for Probable intron-encoded endonuclease ai3 (A9RAH6), which contains the cgs domain Cytochrome c oxidase (IPR000883) and was also found via alignment to one of the cgs sequences (Q5SJ79). However, the alignment didn t conserve the catalytic residues, so the sequence match was considered invalid. The Swiss-Prot manual annotation of this sequence, truncated non-functional cytochrome oxidase 1, supports the assumption that this protein is inactive as a cytochrome component. In line with the assumption of a negligible amount of erroneous assignments of enzyme primary activity, non-enzyme non- EC1s are even less likely to be misannotated. For example, the Sporulation initiation inhibitor protein soj (P37522) containing the Nitrogenase_NifH/reductase_ChlL (IPR000392) domain aligned to the cgs sequence NifH (described above). The alignment was correctly discarded since it was missing the catalytic residues. False negatives EC1 entries containing cgs domains lack proper alignment of the catalytic residues and are considered false negatives (FNs) for the purposes of computing 1-based coverage (methods, eqn (2)). In quite a few cases these are manifestations of a deficiency of our method. However, in some examples this is actually a problem of lack of data, which can be assuaged by expanding the cgs repertoire as discussed above. For example, our method incorrectly invalidates some sequences due to a lack of a true corresponding cgs sequence. Such is the case with NAD(P)H-quinone oxidoreductase (EC ; A0A389) containing a cgs ferredoxin domain (IPR017896). This sequence doesn t align to anything in the cgs sequence list so it is never picked up by TrAnsFuSE. In other cases, the FNs should actually be TNs. Specifically, cgs sequence catalytic site annotation might be inaccurate for experimental or homology-based function transfer reasons. For example, for redox domain-containing sequence fragments (incomplete sequences) our prediction of lack of redox activity is actually correct. Note that there is no automated way to estimate the extent of all misassignments. Note that many of the InterPro domain profiles are generated using Swiss-Prot sequences, among others. Since our cgs sequences are specifically selected to contain these domains, there is some circular logic in using the cgs sequences to search Swiss-Prot. However,twoconceptsarekey:(1)ifthereisbiasinInterProtodo better for Swiss-Prot, InterPro scan results should be overestimated in comparison to our method (this isn t the case, as illustrated by the accuracy of TrAnsFuSE) and (2) we expect to use our method in the future to annotate proteins from new sequencing projects that were not used in making of the profiles. Additional profiles needed to find all EC1s There are over 7000 EC1 proteins with GO and Swiss-Prot sequence annotations indicative of binding transition metals. Of these sequences, nearly a fifth do not contain any of the 104 curated metal-redox domains and only two of these

7 View Online Fig. 3 Comparing the performance of TrAnsFuSE to PSI-BLAST and InterPro. Filtering for the presence of metal-binding sites, catalytic sites, and a combination of catalytic sites and metal-binding sites results in improved accuracy when compared withinterpro andpsi-blast.panels A Cshow the accuracy of searches based on (A) catalytic residues, (B) metal-binding residues, and (C) metal and catalytic residues in all Swiss-Prot sequences (All), Swiss-Prot sequences that contain GS domains (ingsdomains), and Swiss-Prot sequences that do not contain GS domains (outgsdomains). Accuracy standard deviation is shown for InterPro (white bars), PSI-BLAST using the GS sequences as queries (dots bars), and TrAnsFuSE using functional sites (grey bars). The latter consistently outperforms both InterPro and PSI-BLAST for all presented sets. Panels D F show accuracy (line), coverage over first 3 EC digits ( 3-based coverage; outer white column) and coverage over the first EC digit only ( 1-based coverage; inner grey columns) of InterPro, PSI-BLAST, and TrAnsFuSE searches in all of Swiss-Prot using (D) cgs, (E) mgs, and (F) mcgs sites.

8 proteins can be found by alignment to cgs sequences. Our inability to annotate these sequences as metal-binding oxidoreductases is explained by two reasons: (1) the set of cgs sequences is incomplete, covering only B30% of all oxidoreductase domains, while (2) the full hand-curated domain set misses other relevant domains. It is also plausible, however, that some of these missed proteins do not map to any known metal-binding redox domains. For instance, the Alpha-ketoglutarate-dependent dioxygenase FTO (Q9C0B1) does not contain any of our 104 domains nor does it align with any of the cgs sequences. Yet there is sufficient evidence that it may oxidize iron. 25 We believe that many oxidoreductases not found by homology-based searches can be annotated with additional oxidoreductase profiles. Until our domain set is expanded, however, we cannot expect to find all annotated EC1s. Thus, to estimate coverage (sensitivity) we would do injustice to our method by simply taking as reference all 35 thousand EC1s in Swiss-Prot (methods, 1-based coverage, Fig. 3D F, inner grey columns). Since we were using a limited set of domains to describe all of oxidoreductase activity, we estimated the possible hit-space for 3-based coverage with a collection of Swiss-Prot sequences with the first 3 digits of the EC number identical to our query sequences (methods, Fig. 3D F, outer white columns). TrAnsFuSE is accurate and enables extending searching for function beyond InterPro-defined sequences cgs sequences are constrained by our definition to contain cgs domains. However, using cgs sequences as queries for alignments does not guarantee the presence of cgs domains in hit sequences. A PSI-BLAST of cgs sequences against all of Swiss-Prot (seq. id. 40%) results in 83% EC1/non-EC1 accuracy (Fig. 1A and 3A, dots column, All ). Filtering these hits for sequences containing all catalytic sites (applying TrAnsFuSE) attains 94% accuracy (Fig. 1A and 3A, grey column, All ) and 53% 3-based coverage (Fig. 3D, outer white column). Even for similar sequences, with over 40% sequence identity, TrAnsFuSE with catalytic sites (Fig. 4, filled circles) was more accurate than PSI-BLAST alone (Fig.4,emptycircles). The majority of the catalytic-site filtered EC1s were found both by scanning for cgs domains and by alignments to cgs sequences. However, some of these were accessible only by alignment with sequences without the cgs domains (Fig. 3A, grey columns, outgsdomains ). For example, the Cytochrome P450 3A8 (P33268), which doesn t contain any of the cgs domains, was found by alignment to the cgs sequence Putative cytochrome P (Q59990). Thus, additionally TrAnsFuSE-ing sequences not containing cgs domains added 9% (207 sequences) to the coverage of EC1 data with an insignificant drop in accuracy. To recap, for the set of all Swiss-Prot sequences (with and without cgs domains) TrAnsFuSE with catalytic sites was more accurate than InterPro or PSI-BLAST (TrAnsFuSE 94%, InterPro 80%, PSI-BLAST 82%) at a low cost to coverage ( 53% TrAnsFuSE, B67% for InterPro and PSI-BLAST). TrAnsFuSE is a robust approach for filtering by other functional residues: metal-binding sites In order to test the robustness of our method in annotating sequences based on other functional sites we used metal binding site annotations instead of catalytic sites (methods). Fig. 4 Comparison of the performance of TrAnsFuSE search with varied functional sites at different thresholds of sequence similarity. TrAnsFuSE performs well at low sequence similarity thresholds. Filled shapes show the accuracy of TrAnsFuSE-ing through all Swiss-Prot sequences at 0 60% sequence identity, with catalytic sites (circles), metal-binding sites (squares), and both metal-binding and catalytic sites (triangles). Empty shapes show the PSI-BLAST accuracy for searching all of Swiss-Prot at the same (0 60%) thresholds of similarity for sequences aligning to cgs (circles), mgs (squares), and mcgs (triangles) sequences. Based on the presence of metal binding residues, domains are divided into previously described domain classes (Table 1). Our analysis reveals 50 domains in 239 Swiss-Prot EC1s (domain classes 1 and 2, metal binding, mgs, domains and sequences) with at least one representative sequence that contains all known metal binding sites within InterPro domain boundaries. Roughly equal numbers of EC1s and non-ec1s contain the mgs domains for 53% accuracy (Fig. 1A and 3B, white columns in ingsdomains ). Of these, 5233 EC1s and 2101 non-ec1s align to the mgs sequences, for a total of 71% PSI-BLAST accuracy. TrAnsFuSE-ing with metal binding sites is just as accurate (Fig. 3B, grey columns, ingsdomains ). A PSI-BLAST of mgs sequences against all of Swiss-Prot results in 70% accuracy and TrAnsFuSE-ing with metal binding sites is just 1% more accurate (Fig. 1A and 3B, grey columns, All ; 47% 3-based coverage, Fig. 3E, outer white column). The InterPro accuracy of the mgs domains (53%) is lower than the accuracy of cgs domains (80%) or even the base line accuracy for all 104 transition metal-binding domains (57%). The higher number of non-ec1s containing mgs domains results in lower accuracies of PSI-BLAST (71% mgs vs. 88% cgs) and of consecutive TrAnsFuSE-ing (71% mgs vs. 94% cgs). Note that both these results are still B17% more

9 accurate than InterPro alone (Fig. 3B, grey vs. white bars). The difference in the accuracies is partially due to the different sets of domains and sequences. This difference demonstrates that searching for a specific function requires being very specific in the description of gold standard sequences, functional sites, and corresponding domains. Similarity in metal-binding site TrAnsFuSE performance with mgs sequence-based PSI-BLAST is likely a result of the lower specificity of metal binding sites for oxidoreductases. Metal binding sites are less function definitive than catalytic sites (i.e. metal binding may occur for many reasons) and generally defined on a per-residue basis (i.e. the full site is not defined, rather all participating residues are defined one by one). To achieve greater specificity, we combined the two annotations, catalytic and metal binding sites, into one. TrAnsFuSE is a robust approach for filtering by other functional residues: [catalytic and metal binding]-sites Based on the presence of metal and catalytic binding residues, domains are divided into previously described domain classes (Table 1). Our analysis reveals 32 domains in 120 Swiss-Prot EC1s belonging to domain classes 1 and 2 (metal binding and catalytic site based, mcgs, domains and sequences). InterPro is 83% accurate in identifying EC1s and non-ec1s from mcgs domains (Fig. 1A). Of these, 3426 EC1s and 335 non-ec1s align to the mcgs sequences, for a total of 91% PSI-BLAST accuracy. TrAnsFuSE-ing with both metal binding and catalytic sites is 97% accurate (2200 EC1s and 67 non-ec1s; Fig. 3C, grey columns, ingsdomains ). A PSI-BLAST of mgs sequences against all of Swiss-Prot results in 85% accuracy and TrAnsFuSE-ing with metal binding sites is 12% more accurate (Fig. 1A and 3C, grey columns, All ; 51% 3-based coverage, Fig. 3F, outer white column). Thus, TrAnsFuSE-ing Swiss-Prot sequences with both metal and catalytic sites was more accurate but less sensitive than InterPro or PSI-BLAST (accuracy/coverage improvement/ loss 14%/17% for InterPro and 12%/16% for PSI-BLAST). The proportion described here between gains in accuracy and losses in coverage is advantageous when searching for function signals in large unannotated databases such as the Global Ocean Sampling (GOS) expedition. 15 TrAnsFuSE-ing all Swiss-Prot sequences with both metal and catalytic sites achieves much higher accuracy than using catalytic or metal sites alone. This gain in accuracy is not reflected in a loss of ( 3-based ) coverage between functional sites (Fig. 3D F, outer white columns). This high gain in accuracy is a result of selecting a set of highly relevant and complete functional annotations to define transition-metal oxidoreductase activity. Thus, if relevant functional sites are defined, TrAnsFuSE can arguably search sequence databases for any given function. TrAnsFuSE achieves high accuracy at low sequence similarities TrAnsFuSE performs better than PSI-BLAST in identifying EC1s at all levels of sequence identity (Fig. 4), but most visibly so at low thresholds. Aligning the mcgs sequences against all Swiss-Prot EC1s and TrAnsFuSE-ing the results with both catalytic and metal-binding sites achieved high accuracy (B97%) even at seq. id. r30% (filled triangles; Fig. 4). These values are 12% higher than the accuracy achieved by PSI-BLAST (empty triangles; Fig. 4) and 14% higher than InterPro (83%). Similar trends were visible for TrAnsFuSE-ing with metalbinding and catalytic sites; with metal binding-site filtering performing worse than catalytic site filtering, and worse than mcgs TrAnsFuSE-ing. The ability to look in previously ignored sequence similarity regions will enable finding diverged proteins of similar function. Use of TrAnsFuSE decreases the experimental load necessary to analyse the vast amounts of new data Sequences from the genome and metagenome sequencing projects are stored in public resources such as RefSeq 27 and TrEMBL. 14 These databases, among many other features, provide the user with automatic annotations of sequence functionality. Unfortunately, the accuracy and coverage of these annotations leave much to be desired. We compared our method with the UniProt automatic annotations for oxidoreductases found in TrEMBL. There was B40% agreement in EC1 assignment between TrEMBL annotations and TrAnsFuSE-ing TrEMBL sequences with cgs catalytic sites. Although we do not have the means to measure which approach is more accurate, we found that if we exclude non-enzyme TrEMBL entries (entries with no EC number annotations) there is a 99% agreement between the two approaches. Thus, the main difference between UniProt automatic annotations and TrAnsFuSE comes from the set of entries not annotated as enzymes by TrEMBL. Considering the size of TrEMBL this is indeed a very large set. While much of this set is likely non-enzyme, there are also un-identified enzymes. PSI-BLASTing the cgs sequences against all of TrEMBL (18M sequences) results in 635K hits, of which 256K do not have TrEMBL assigned EC numbers. TrAnsFuSE-ing all of TrEMBL with the catalytic sites produces 88K hits, of which 55K have no TrEMBL assigned EC number. If all the TrAnsFuSE false positives (5K based on 94% accuracy) are in the set of TrEMBL non-ecs we expect to bring in at least 50K new EC1s not picked up by TrEMBL. The ability to annotate B20% (50K of 256K TrEMBL non-ecs) more sequences clearly demonstrates the value of TrAnsFuSE. Genome and metagenome sequencing projects, such as the Global Ocean Sampling (GOS) expedition, 15 produce vast amounts of sequence data. We have analysed 6 million (6M) GOS entries of predicted proteins (Fig. 1B) found in the UniMES database. 26 Fifty thousand (50K) of these sequences contain the 35 cgs domains. Estimating the oxidoreductase numbers from previously reported InterPro performance (80% accuracy) means that there are 40K EC1s and 10K non-ec1s. PSI-BLAST of cgs sequences against UniMES results in 74K hits, while TrAnsFuSE-ing with catalytic sites produces 13K sequences. Assuming that TrAnsFuSE is 94% accurate, we estimate B700 false positive results in this set. The 93% drop in the number of estimated false positives (B700 instead of InterPro 10K) clearly demonstrates the value of TrAnsFuSE in decreasing the experimental load. While the strength of this approach is in finding hits with a high reliability, note that attaining a broader overview of EC1 functionalities requires using additional InterPro domains.

10 The methodology presented here, while yet less extensively developed than resources like InterPro, is more accurate, and as such, requires less experimental verification for its predictions. Arguably, when dealing with millions of sequences this is an advantage that cannot be overstated. Conclusions We developed a protein sequence-scanning method (TrAnsFuSE) that improves the search for proteins containing transition metal-utilizing redox domains (EC1s). We have used a set of manually curated, transition metal binding, redox-specific domains as a basis of our study. First, we demonstrated that the distribution of these domains is in line with trends in their evolution. Then, we used a set of gold standard sequences, with annotated catalytic sites found within redox domain borders, to query all of Swiss-Prot. Only the sequences that conserved all catalytic sites of the query were assigned oxidoreductase functionality. This methodology achieved 94% accuracy in finding EC1s. Moreover, we demonstrated the robustness of this approach for generic functional sites by using annotations for metal binding residues instead of catalytic sites and by using the combination of both metal binding and catalytic sites. The concepts described here may improve homology-based function annotation transfer using sites such as binding hot-spots and, potentially, predictions of functional residues instead of experimental annotations. Our method will decrease the experimental load necessary for analysis of the functions of unknown sequences coming from the sea of data in (meta-) genome sequencing projects like the Global Ocean Sampling (GOS) Expedition. Experimental Protein entries Reviewed protein entries (Swiss-Prot) were downloaded at January 2011 with their amino acid sequences, EC (Enzyme Commission) numbers, Gene Ontology 28 terms (GO), and the associated PDB 29 identifiers. Non-reviewed protein entries (TrEMBL) with their amino acid sequences and EC numbers were downloaded December GOS protein entries with their amino acid sequences and InterPro 10 annotations were extracted from UniProt Metagenomic and Environmental Sequences database (UniMES; downloaded November 2011). Domain curation Transition metal-utilizing redox domains in Swiss-Prot and TrEMBL entries were extracted from InterPro. 10,30 For Swiss- Prot only entries whose descriptions or associated GO terms contained transition metal symbols (e.g. Fe, Cu, etc.), names (e.g. iron, copper, etc.), or names of structures containing a transition metal (e.g. ferredoxin) were retained. This set was manually validated, checking for transition metal utilization in redox reactions. Validation was based on the InterPro summaries, component database annotations (e.g. PFam, 8b TIGRFAMs, 16 SUPERFAMILY 17 ), and literature citations for a given InterPro entry. Note that for some of the sequences, the Swiss-Prot annotation for metal binding is not in agreement with the InterPro domain annotation; i.e. the InterPro profile annotates cambialistic sequences but the Swiss-Prot annotation verifies only one metal. An InterPro signature may outline more than one activity (other than redox) as reflected in descriptions of its component signatures. To obtain the domains definitively relevant to our study, we extracted via manual curation entries containing solely signatures using transition metals to perform redox. Note that we refer to all of these entries as domains, although different terms may have been used in InterPro (e.g. conserved site). Since each InterPro domain is an integration of several component domains from different sources, in this work we refer to maximal boundaries encompassing all domains as domain boundaries. icsa to Swiss-Prot mapping For every Swiss-Prot entry assigned to the EC class (EC1; oxidoreductase) we extracted all corresponding PDB IDs and chains. We retained only PDB entries that had associated SIFTS (Structure Integration with function Taxonomy and Sequences; downloaded July, 2011) 31 and icsa (Inpharmatica CSA) 11 records. SIFTS was used to map all residues of a given Swiss-Prot entry to the corresponding residues of the associated icsa-reported PDB entry (Table S2, ESIz). The icsa database contains both literaturederived and homology-based entries. We only retained icsa literature-based entries whose catalytic sites correctly mapped from the corresponding PDB entries to the Swiss-Prot sequence. Those icsa-lits that could not be precisely mapped were discarded together with any of their homology-based derivatives. Of the remaining homology-based entries, those that could not be fully mapped to their corresponding Swiss-Prot sequences were excluded. Identical catalytic sites from different icsa PDB entries were unified according to their positions on the Swiss-Prot sequence. Catalytic site-based gold standard (cgs) sequences Only Swiss-Prot EC1 sequences mapping all residues of all catalytic sites (as described above) within the annotated InterPro domain boundaries (domain classes 1 and 2, Table 1) were used for further analysis. These sequences and their corresponding 35 InterPro domains are referred to as the catalytic site-based Gold Standard (cgs) sequences and cgs domains, respectively. Similarly we defined metal-based GS sequences (mgs) as EC1 sequences which contain all transition metal-binding residues within domain boundaries (domain classes 1 and 2, Table 1). These sequences and their corresponding 50 InterPro domains are referred to as mgs sequences and mgs domains. In addition we defined [metal and catalytic site]-based GS sequences (mcgs) as EC1 sequences, which contain all transition metal-binding and catalytic residues within domain boundaries (domain classes 1 and 2, Table 1). These sequences and their corresponding 32 domains are referred to as mcgs sequences and mcgs domains. Filtering alignments for proteins with identical catalytic sites The cgs sequences were used to search for other proteins with similar functions in an icsa-like approach: Each cgs sequence was used as a query to retrieve similar sequences from Swiss-Prot using four PSI-BLAST iterations

11 (evalue 10 3, inclusion ethresh 10 10, max target seqs 10 8, num alignment 10 8, num descriptions 10 8, num iterations 4, comp based stats 0). The decision to include into PSSM only the hits with an E-value o10 10 was made to have a more focused PSSM for each domain. The specified E-values are accepted in the field as base-line for finding functional similarity of proteins. 9 The number of PSI-BLAST iterations was the same as in icsa filtering, 11 shown to reduce the number of false positive results. 32 Finally, to achieve maximum sensitivity we set the upper bound to be extremely high (10 8 ) for the number of sequences used for building the PSSM and for alignments returned. Each homologue was validated against its query sequence. We checked the known catalytic sites for conservation of residue type as described below. 1. Conservation of a catalytic residue is defined as: (1) in cases where the residue side chain is the functional group only an exact match is acceptable, and (2) in cases where the main chain is the functional group, any residue in the same chemical group is allowed to match the query residue. Residue groups are: (1) negatively charged: D, E; (2) positively charged: R, K, H; (3) hydroxylic polar: S, T; (4) amidic polar: N, Q; (5) short hydrophobic chain: A, V, I, L, and (6) aromatic: F, Y, W, C. Note that G, P, and M residues may only match themselves. 2. In cases where the residues were not found to be conserved with PSI-BLAST, the alignment of the query sequence to the subject was repeated using first ClustalW 33 (in full multiple alignment mode) and, failing that, Smith Waterman 34 (from the EMBOSS package using default parameters). 3. An EC1 assignment was only made when a homologue conserved all catalytic residues as described in step 1. Filtering alignments for proteins with other sites Metal binding sites were obtained from Swiss-Prot annotations for all EC1 sequences. Only experimentally validated sites were used. Metal binding site-based GS (mgs) sequences were extracted as described above for catalytic sites. mgs sequence-based searching was performed similarly to cgs-based search with minor differences: (1) each metal binding residue was considered a site and (2) since there are no annotations for the residue active group, filtering followed the conservative scheme of exact match (i.e. a metal binding residue on the query sequence could only align with itself in the subject). By combining the annotation of catalytic sites and metal binding residues we created the [metal and catalytic site]-based gold standard (mcgs) domains and sequences. Searches using mcgs sequences were performed by integrating the filtering approaches for cgs and mgs, previously described. Calculating accuracy of scanning Accuracy was calculated as: TP accuracy ¼ 100 TP þ FP where True Positives (TPs) are method hits to experimentally annotated EC1 proteins and False Positives (FPs) are method ð1þ hits annotated as non-ec1s. The standard deviation of the accuracy was calculated by 100-fold bootstrapping: taking 100 subsets of 50% of randomly selected queries with their hits. Calculating coverage of scanning True coverage could not be computed for this study due to the absence of experimental estimates of the numbers of true transition metal-using oxidoreductase domains among EC1s. We define coverage over all EC1s as 1-based coverage (anchored to the first digit of the EC number) and coverage over the first 3 EC digits as 3-based coverage: TP coverage ¼ 100 TP þ FN In 1-based coverage, True Positives (TPs) are method hits that are annotated as EC1s and False Negatives (FNs) are all other EC1s in Swiss-Prot. Note that this is the same definition of TPs as is used for all accuracy calculations (eqn (1)). Clearly, since we were using a limited set of domains to describe all of oxidoreductase activity, we did not expect to hit all possible EC1s. To achieve a better estimate of the possible hit-space for 3-based coverage we collected all Swiss-Prot sequences with the first 3 digits of the EC numbers identical to our query sequences (i.e. GS sequences). In this case, TPs are similarly re-computed to be 3-based. Abbreviations TrAnsFuSE Transfer of Annotations of Function using Sequence Elements EC Enzyme Commission GOE Great Oxidation Event MSA Multiple Sequence Alignment HMM Hidden Markov Model CSA Catalytic Site Atlas icsa Inpharmatica CSA UniProt Universal Protein Resource PDB Protein Data Bank PSI-BLAST Position-Specific Iterative Basic Local Alignment Search Tool GO Gene Ontology PSSM Position Specific Scoring Matrix SIFTS Structure Integration with Function, Taxonomy and Sequences GS Gold Standard cgs catalytic site based GS domains and sequences mgs metal binding site based GS domains and sequences mcgs metal and catalytic binding site based GS domains and sequences EMBOSS European Molecular Biology Open Software Suite TrEMBL Translation of European Molecular Biology Laboratory nucleotide sequences GOS Global Ocean Sampling expedition UniMES UniProt Metagenomic and Environmental Sequences database. ð2þ

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Metabolism. Fermentation vs. Respiration. End products of fermentations are waste products and not fully.

Metabolism. Fermentation vs. Respiration. End products of fermentations are waste products and not fully. Outline: Metabolism Part I: Fermentations Part II: Respiration Part III: Metabolic Diversity Learning objectives are: Learn about respiratory metabolism, ATP generation by respiration linked (oxidative)

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010

Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 Annotation Error in Public Databases ALEXANDRA SCHNOES UNIVERSITY OF CALIFORNIA, SAN FRANCISCO OCTOBER 25, 2010 1 New genomes (and metagenomes) sequenced every day... 2 3 3 3 3 3 3 3 3 3 Computational

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

Table S1: Domain based (meta)genome comparison of selected metagenomes of methanotrophs and genomes of selected methanogens*

Table S1: Domain based (meta)genome comparison of selected metagenomes of methanotrophs and genomes of selected methanogens* ANME-1-s ANME-1-m ANME-2a ANME-2d-h ANME-2d-a AM A M-1 M-2 M-3 H-1 H-2 H-3 H-4 S Table S1: Domain based (meta)genome comparison of selected metagenomes of methanotrophs and genomes of selected methanogens*

More information

CHEM 463: Advanced Inorganic Chemistry Modeling Metalloproteins for Structural Analysis

CHEM 463: Advanced Inorganic Chemistry Modeling Metalloproteins for Structural Analysis CHEM 463: Advanced Inorganic Chemistry Modeling Metalloproteins for Structural Analysis Purpose: The purpose of this laboratory is to introduce some of the basic visualization and modeling tools for viewing

More information

Lecture 12. Metalloproteins - II

Lecture 12. Metalloproteins - II Lecture 12 Metalloproteins - II Metalloenzymes Metalloproteins with one labile coordination site around the metal centre are known as metalloenzyme. As with all enzymes, the shape of the active site is

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

Performing a Pharmacophore Search using CSD-CrossMiner

Performing a Pharmacophore Search using CSD-CrossMiner Table of Contents Introduction... 2 CSD-CrossMiner Terminology... 2 Overview of CSD-CrossMiner... 3 Searching with a Pharmacophore... 4 Performing a Pharmacophore Search using CSD-CrossMiner Version 2.0

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

PDBe TUTORIAL. PDBePISA (Protein Interfaces, Surfaces and Assemblies)

PDBe TUTORIAL. PDBePISA (Protein Interfaces, Surfaces and Assemblies) PDBe TUTORIAL PDBePISA (Protein Interfaces, Surfaces and Assemblies) http://pdbe.org/pisa/ This tutorial introduces the PDBePISA (PISA for short) service, which is a webbased interactive tool offered by

More information

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Sequence Analysis and Databases 2: Sequences and Multiple Alignments

Sequence Analysis and Databases 2: Sequences and Multiple Alignments 1 Sequence Analysis and Databases 2: Sequences and Multiple Alignments Jose María González-Izarzugaza Martínez CNIO Spanish National Cancer Research Centre (jmgonzalez@cnio.es) 2 Sequence Comparisons:

More information

PHOTOSYNTHESIS: A BRIEF STORY!!!!!

PHOTOSYNTHESIS: A BRIEF STORY!!!!! PHOTOSYNTHESIS: A BRIEF STORY!!!!! This is one of the most important biochemical processes in plants and is amongst the most expensive biochemical processes in plant in terms of investment. Photosynthesis

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

MiGA: The Microbial Genome Atlas

MiGA: The Microbial Genome Atlas December 12 th 2017 MiGA: The Microbial Genome Atlas Jim Cole Center for Microbial Ecology Dept. of Plant, Soil & Microbial Sciences Michigan State University East Lansing, Michigan U.S.A. Where I m From

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC

Motifs, Profiles and Domains. Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC Motifs, Profiles and Domains Michael Tress Protein Design Group Centro Nacional de Biotecnología, CSIC Comparing Two Proteins Sequence Alignment Determining the pattern of evolution and identifying conserved

More information

Homology. and. Information Gathering and Domain Annotation for Proteins

Homology. and. Information Gathering and Domain Annotation for Proteins Homology and Information Gathering and Domain Annotation for Proteins Outline WHAT IS HOMOLOGY? HOW TO GATHER KNOWN PROTEIN INFORMATION? HOW TO ANNOTATE PROTEIN DOMAINS? EXAMPLES AND EXERCISES Homology

More information

ΔG o' = ηf ΔΕ o' = (#e ( V mol) ΔΕ acceptor

ΔG o' = ηf ΔΕ o' = (#e ( V mol) ΔΕ acceptor Reading: Sec. 19.1 Electron-Transfer Reactions in Mitochondria (listed subsections only) 19.1.1 Electrons are Funneled to Universal Electron Acceptors p. 692/709 19.1.2 Electrons Pass through a Series

More information

Bioinorganic Chemistry

Bioinorganic Chemistry PRINCIPLES OF Bioinorganic Chemistry Stephen J. Lippard MASSACHUSETTS INSTITUTE OF TECHNOLOGY Jeremy M. Berg JOHNS HOPKINS SCHOOL OF MEDICINE f V University Science Books Mill Valley, California Preface

More information

The structure of vanadium nitrogenase reveals an unusual bridging ligand

The structure of vanadium nitrogenase reveals an unusual bridging ligand SUPPLEMENTARY INFORMATION The structure of vanadium nitrogenase reveals an unusual bridging ligand Daniel Sippel and Oliver Einsle Lehrstuhl Biochemie, Institut für Biochemie, Albert-Ludwigs-Universität

More information

Biophysics 490M Project

Biophysics 490M Project Biophysics 490M Project Dan Han Department of Biochemistry Structure Exploration of aa 3 -type Cytochrome c Oxidase from Rhodobacter sphaeroides I. Introduction: All organisms need energy to live. They

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Chapter 6- An Introduction to Metabolism*

Chapter 6- An Introduction to Metabolism* Chapter 6- An Introduction to Metabolism* *Lecture notes are to be used as a study guide only and do not represent the comprehensive information you will need to know for the exams. The Energy of Life

More information

Structure to Function. Molecular Bioinformatics, X3, 2006

Structure to Function. Molecular Bioinformatics, X3, 2006 Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families

More information

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute

Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain, Rensselaer Polytechnic Institute Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae Emily Germain, Rensselaer Polytechnic Institute Mentor: Dr. Hugh Nicholas, Biomedical Initiative, Pittsburgh Supercomputing

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Subsystem: Succinate dehydrogenase

Subsystem: Succinate dehydrogenase Subsystem: Succinate dehydrogenase Olga Vassieva Fellowship for Interpretation of Genomes The super-macromolecular respiratory complex II (succinate:quinone oxidoreductase) couples the oxidation of succinate

More information

CHLOROPLASTS, CALVIN CYCLE, PHOTOSYNTHETIC ELECTRON TRANSFER AND PHOTOPHOSPHORYLATION (based on Chapter 19 and 20 of Stryer )

CHLOROPLASTS, CALVIN CYCLE, PHOTOSYNTHETIC ELECTRON TRANSFER AND PHOTOPHOSPHORYLATION (based on Chapter 19 and 20 of Stryer ) CHLOROPLASTS, CALVIN CYCLE, PHOTOSYNTHETIC ELECTRON TRANSFER AND PHOTOPHOSPHORYLATION (based on Chapter 19 and 20 of Stryer ) Photosynthesis Photosynthesis Light driven transfer of electron across a membrane

More information

BIOCHEMISTRY. František Vácha. JKU, Linz.

BIOCHEMISTRY. František Vácha. JKU, Linz. BIOCHEMISTRY František Vácha http://www.prf.jcu.cz/~vacha/ JKU, Linz Recommended reading: D.L. Nelson, M.M. Cox Lehninger Principles of Biochemistry D.J. Voet, J.G. Voet, C.W. Pratt Principles of Biochemistry

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Photosynthesis 1. Light Reactions and Photosynthetic Phosphorylation. Lecture 31. Key Concepts. Overview of photosynthesis and carbon fixation

Photosynthesis 1. Light Reactions and Photosynthetic Phosphorylation. Lecture 31. Key Concepts. Overview of photosynthesis and carbon fixation Photosynthesis 1 Light Reactions and Photosynthetic Phosphorylation Lecture 31 Key Concepts Overview of photosynthesis and carbon fixation Chlorophyll molecules convert light energy to redox energy The

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Published online February 15, 26 166 18 Nucleic Acids Research, 26, Vol. 34, No. 3 doi:1.193/nar/gkj494 Comprehensive genome analysis of 23 genomes provides structural genomics with new insights into protein

More information

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

SABIO-RK Integration and Curation of Reaction Kinetics Data  Ulrike Wittig SABIO-RK Integration and Curation of Reaction Kinetics Data http://sabio.villa-bosch.de/sabiork Ulrike Wittig Overview Introduction /Motivation Database content /User interface Data integration Curation

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

Similarity searching summary (2)

Similarity searching summary (2) Similarity searching / sequence alignment summary Biol4230 Thurs, February 22, 2016 Bill Pearson wrp@virginia.edu 4-2818 Pinn 6-057 What have we covered? Homology excess similiarity but no excess similarity

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

Gene Ontology and overrepresentation analysis

Gene Ontology and overrepresentation analysis Gene Ontology and overrepresentation analysis Kjell Petersen J Express Microarray analysis course Oslo December 2009 Presentation adapted from Endre Anderssen and Vidar Beisvåg NMC Trondheim Overview How

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

ADVANCED PLACEMENT BIOLOGY

ADVANCED PLACEMENT BIOLOGY ADVANCED PLACEMENT BIOLOGY Description Advanced Placement Biology is designed to be the equivalent of a two-semester college introductory course for Biology majors. The course meets seven periods per week

More information

Peter L Warren, Pamela Y Shadforth ICI Technology, Wilton, Middlesbrough, U.K.

Peter L Warren, Pamela Y Shadforth ICI Technology, Wilton, Middlesbrough, U.K. 783 SCOPE AND LIMITATIONS XRF ANALYSIS FOR SEMI-QUANTITATIVE Introduction Peter L Warren, Pamela Y Shadforth ICI Technology, Wilton, Middlesbrough, U.K. Historically x-ray fluorescence spectrometry has

More information

M.Sc. Project Introduction Nitrogen-fixing Enzymes

M.Sc. Project Introduction Nitrogen-fixing Enzymes M.Sc. Project Introduction Nitrogen-fixing Enzymes M.Sc. Candidate: Egill Skulason Supervisor: Hannes Jonsson Co-supervisor: Magnus Mar Kristjansson Raunvisindastofnun Haskola Islands Efnafraedistofa vklubbur

More information

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013 EBI web resources II: Ensembl and InterPro Yanbin Yin Spring 2013 1 Outline Intro to genome annotation Protein family/domain databases InterPro, Pfam, Superfamily etc. Genome browser Ensembl Hands on Practice

More information

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster. NCBI BLAST Services DELTA-BLAST BLAST (http://blast.ncbi.nlm.nih.gov/), Basic Local Alignment Search tool, is a suite of programs for finding similarities between biological sequences. DELTA-BLAST is a

More information

TCA Cycle. Voet Biochemistry 3e John Wiley & Sons, Inc.

TCA Cycle. Voet Biochemistry 3e John Wiley & Sons, Inc. TCA Cycle Voet Biochemistry 3e Voet Biochemistry 3e The Electron Transport System (ETS) and Oxidative Phosphorylation (OxPhos) We have seen that glycolysis, the linking step, and TCA generate a large number

More information

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare

More information

Lecture 2. The Blast2GO annotation framework

Lecture 2. The Blast2GO annotation framework Lecture 2 The Blast2GO annotation framework Annotation steps Modulation of annotation intensity Export/Import Functions Sequence Selection Additional Tools Functional assignment Annotation Transference

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

Ch. 9 Multiple Sequence Alignment (MSA)

Ch. 9 Multiple Sequence Alignment (MSA) Ch. 9 Multiple Sequence Alignment (MSA) - gather seqs. to make MSA - doing MSA with ClustalW - doing MSA with Tcoffee - comparing seqs. that cannot align Introduction - from pairwise alignment to MSA -

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Truncated Profile Hidden Markov Models

Truncated Profile Hidden Markov Models Boise State University ScholarWorks Electrical and Computer Engineering Faculty Publications and Presentations Department of Electrical and Computer Engineering 11-1-2005 Truncated Profile Hidden Markov

More information

Identifying Interaction Hot Spots with SuperStar

Identifying Interaction Hot Spots with SuperStar Identifying Interaction Hot Spots with SuperStar Version 1.0 November 2017 Table of Contents Identifying Interaction Hot Spots with SuperStar... 2 Case Study... 3 Introduction... 3 Generate SuperStar Maps

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Building 3D models of proteins

Building 3D models of proteins Building 3D models of proteins Why make a structural model for your protein? The structure can provide clues to the function through structural similarity with other proteins With a structure it is easier

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

PHOTOSYNTHESIS. Light Reaction Calvin Cycle

PHOTOSYNTHESIS. Light Reaction Calvin Cycle PHOTOSYNTHESIS Light Reaction Calvin Cycle Photosynthesis Purpose: use energy from light to convert inorganic compounds into organic fuels that have stored potential energy in their carbon bonds Carbon

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

A Protein Ontology from Large-scale Textmining?

A Protein Ontology from Large-scale Textmining? A Protein Ontology from Large-scale Textmining? Protege-Workshop Manchester, 07-07-2003 Kai Kumpf, Juliane Fluck and Martin Hofmann Instructive mistakes: a narrative Aim: Protein ontology that supports

More information

objective functions...

objective functions... objective functions... COFFEE (Notredame et al. 1998) measures column by column similarity between pairwise and multiple sequence alignments assumes that the pairwise alignments are optimal assumes a set

More information

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. Sequence Analysis, '18 -- lecture 9 Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene. How can I represent thousands of homolog sequences in a compact

More information

Unit 3: Cell Energy Guided Notes

Unit 3: Cell Energy Guided Notes Enzymes Unit 3: Cell Energy Guided Notes 1 We get energy from the food we eat by breaking apart the chemical bonds where food is stored. energy is in the bonds, energy is the energy we use to do things.

More information

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE Examples of Protein Modeling Protein Modeling Visualization Examination of an experimental structure to gain insight about a research question Dynamics To examine the dynamics of protein structures To

More information

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References

More information

Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

More information

V14 extreme pathways

V14 extreme pathways V14 extreme pathways A torch is directed at an open door and shines into a dark room... What area is lighted? Instead of marking all lighted points individually, it would be sufficient to characterize

More information

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein The parsimony principle: A quick review Find the tree that requires the fewest

More information

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics

Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Using Ensembles of Hidden Markov Models for Grand Challenges in Bioinformatics Tandy Warnow Founder Professor of Engineering The University of Illinois at Urbana-Champaign http://tandy.cs.illinois.edu

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

-max_target_seqs: maximum number of targets to report

-max_target_seqs: maximum number of targets to report Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Session-Based Queueing Systems

Session-Based Queueing Systems Session-Based Queueing Systems Modelling, Simulation, and Approximation Jeroen Horters Supervisor VU: Sandjai Bhulai Executive Summary Companies often offer services that require multiple steps on the

More information

Heterotrophs: Organisms that depend on an external source of organic compounds

Heterotrophs: Organisms that depend on an external source of organic compounds Heterotrophs: Organisms that depend on an external source of organic compounds Autotrophs: Organisms capable of surviving on CO2 as their principle carbon source. 2 types: chemoautotrophs and photoautotrophs

More information

METABOLIC PATHWAY PREDICTION/ALIGNMENT

METABOLIC PATHWAY PREDICTION/ALIGNMENT COMPUTATIONAL SYSTEMIC BIOLOGY METABOLIC PATHWAY PREDICTION/ALIGNMENT Hofestaedt R*, Chen M Bioinformatics / Medical Informatics, Technische Fakultaet, Universitaet Bielefeld Postfach 10 01 31, D-33501

More information