The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues

Size: px
Start display at page:

Download "The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues"

Transcription

1 Protein Engineering vol.13 no.3 pp , 2000 The CATH Dictionary of Homologous Superfamilies (DHS): a consensus approach for identifying distant structural homologues J.E.Bray 1,2, A.E.Todd 1, F.M.G.Pearl 1, J.M.Thornton 1,3 and C.A.Orengo 1 1 Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK, 3 Department of Crystallography, Birkbeck College, Malet Street, London WC1E 7HX, UK 2 To whom correspondence should be addressed. james@biochem.ucl.ac.uk A consensus approach has been developed for identifying distant structural homologues. This is based on the CATH Dictionary of Homologous Superfamilies (DHS), a database of validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies (URL: bsm/dhs). Multiple structural alignments have been generated for 362 well-populated superfamilies in the CATH structural domain database and annotated with secondary structure, physicochemical properties, functional sequence patterns and protein ligand interaction data. Consensus functional information for each superfamily includes descriptions and keywords extracted from SWISS-PROT and the ENZYME database. The Dictionary provides a powerful resource to validate, examine and visualize key structural and functional features of each homologous superfamily. The value of the DHS, for assessing functional variability and identifying distant evolutionary relationships, is illustrated using the pyridoxal-5 -phosphate (PLP) binding aspartate aminotransferase superfamily. The DHS also provides a tool for examining sequence structure relationships for proteins within each fold group. Keywords: CATH database/functional annotation/homologous superfamily/protein domains/structural alignments Introduction The protein structure database (PDB), jointly managed by the Research Collaboratory for Structural Bioinformatics (RCSB) and the European Bioinformatics Institute (EBI), contains over 9000 protein chains consisting of over domains determined by X-ray crystallography or NMR. (Abola et al., 1987, URL: With 200 new structures being deposited each month and structural genomic projects gathering momentum (Pennisi, 1998), it becomes increasingly important that the structural data are organized and annotated in a biologically meaningful and useful way. Protein structure classification databases, CATH (Orengo et al., 1997), SCOP (Murzin et al., 1995), Dali Domain Dictionary (Holm and Sander, 1999), 3Dee (Siddiqui and Barton, 1997), DDBASE (Sowdhamini et al., 1996) and ENTREZ/MMDB (Hogue et al., 1996; Marchler-Bauer et al., 1999) are now well established and provide frameworks for ordering the known protein universe. These databases cluster proteins into fold groups or evolutionary families using manual methods (Murzin et al., 1995) or automatic structure comparison methods, i.e. SSAP (Taylor and Orengo 1989), DALI (Holm and Sander, 1993), STAMP (Russell and Barton, 1992), DIAL (Sowdhamini and Blundell, 1995) and VAST (Gibrat et al., 1997). There are no consensus definitions for fold similarity and groups apply different criteria for assigning proteins to fold groups (reviewed in Orengo, 1994, and Brown et al., 1996). More recently, databases have emerged that present structural alignments for selected protein families. HOMSTRAD (Mizuguchi et al., 1998a) contains 372 alignments for homologous families originally developed by Overington et al. (1990). CAMPASS (Sowdhamini et al., 1998) is a database of 52 structurally aligned protein superfamilies derived from DDBASE. In both cases the alignments are annotated with structural features using JOY (Mizuguchi et al., 1998b). CATH and SCOP are currently the largest manually validated hierarchical classifications of protein domain structures (Hubbard et al., 1999; Orengo et al., 1999). Both of these databases group proteins into homologous families and superfamilies. These levels are interesting to many biologists as they cluster proteins that have descended from a common ancestral gene and whose core structural and functional features have often been conserved by evolution (Chothia, 1984; Overington et al., 1990). Identifying the homologous superfamily of a protein of interest can be an important step in determining the biological role of the protein. Proteins are called analogues when they share a common fold but there is no definitive evidence that they are related by evolution. This has been observed to occur in some highly populated fold groups described as superfolds (Orengo et al., 1994) or frequently occurring domains (FODs, Brenner et al., 1997). They are particularly common in nature, possibly owing to the inherent thermodynamic stability of the fold and prevalence of common recurring structural motifs (Salem et al., 1999). Homologous relationships can be identified relatively easily if the divergent proteins retain high sequence identities of 30% or more (Chothia and Lesk, 1986; Sander and Schneider, 1991; Rost, 1999). The presence of a functional sequence motif (PROSITE, Hofmann et al., 1999) or set of motifs (PRINTS, Attwood et al., 1999) can be used to detect more distant sequence relationships. More recently, sequence searching methods that use profile-based approaches or intermediate sequences, e.g. PSI-BLAST, (Altschul et al., 1997), hidden Markov models (SAM-T98, Hughey and Krogh, 1996), ISS (Park et al., 1997) and MISS (Salamov et al., 1999), have also been shown to detect more distant homologues than pairwise sequence techniques (Park et al., 1998; Salamov et al., 1999). Sequence databases of homologous protein families and their associated multiple alignments have been established using these methods, e.g. PROSITE, PRINTS, PFAM (Bateman et al., 1999), SMART (Ponting et al., 1999) and PRODOM (Corpet et al., 1999). Oxford University Press 153

2 J.E.Bray et al. More distant evolutionary relationships ( 20% sequence identity) are difficult to elucidate without a combination of structural and functional evidence to prove homology (see Murzin, 1996, 1998 for reviews). SCOP uses the literature to manually identify unusual structural features (e.g. betabulges) and key conserved residues involved in structure stabilization, substrate or co-factor binding or catalysis (Murzin et al., 1995; Brenner et al., 1996). In the CATH database, high structural similarity indicated by the SSAP structure comparison algorithm is used to infer homologous proteins that are then validated by checking the literature and available functional information. While CATH and SCOP contain similar Table I. The number of branches at each hierarchical level in the CATH database (release 1.5) CATH level Number Class 4 Architecture 32 Topology family 632 Homologous superfamily 959 Sequence family 1570 Near identical family 2807 Identical family 4567 Fig. 1. Flowchart of the methods and data required for creating and maintaining the Dictionary of Homologous Superfamilies (DHS). Fig. 2. Structural features of aspartate aminotransferase (PDB entry 2cst). (A) Cartoon representation of the large (light grey) and small (dark grey) domains of the enzyme complexed with pyridoxal-5 -phosphate (PLP, space-fill representation). The figure was generated using MOLSCRIPT (Kraulis, 1991) and RASTER3D (Merritt and Bacon, 1997). (B) The LIGPLOT diagram (Wallace et al., 1995) shows the interactions between the aspartate aminotransferase and the PLP co-factor. PLP is covalently bound to lysine 258 in the large domain. Hydrogen bonds are indicated with dotted lines and hydrophobic contacts are shown with spoked arcs. (C) Schematic TOPS diagram (Westhead et al., 1998) for the large domain (CATH entry 2cstA2). Strands are shown as triangles and helices as circles (3 10 helices are not shown). Lines drawn over a symbol are connections to the top of a secondary structure, otherwise connection is to the base. 154

3 DHS: CATH Dictionary of Homologous Superfamilies classifications for close homologues, the difficulty in identifying distant relationships means that they differ for the more distant homologues (Hadley and Jones, 1999). Other approaches for identifying homologous proteins are based on deriving core structures for protein families (Schmidt et al., 1997; Matsuo and Bryant, 1999; Orengo, 1999) or comparing functional descriptions, e.g. SWISS-PROT keywords (Holm and Sander, 1997). Manual validation of very distant evolutionary homologues using functional data can be a very time-consuming step. However, because there is currently no consistent format for most functional data, a purely automated approach can sometimes fail to detect significant relationships and can lead to incorrect homologue assignment. The possibility of functional inheritance errors in sequence databases (Bork and Koonin, 1998) highlights the need for a similarly cautious approach to homologue classification for structural databases. For this reason we have developed the CATH Dictionary of Homologous Superfamilies (DHS), a Web-based resource that provides multiple structural alignments for each superfamily containing more than one non-identical representative (362 families) and facilitates the recognition of distant homologues. Alignments and structural profiles for each superfamily have been generated using the CORA suite of programs (Orengo, 1999). The alignments permit the identification of consensus superfamily-specific properties (e.g. conserved protein ligand interactions). These consensus features can be used to verify the results of sequence (e.g. PSI-BLAST) or structure (e.g. SSAP, CORA structural profiles) comparison methods that have inferred a distant homologous relationship to a protein in a CATH superfamily. We illustrate the value of the DHS as a diagnostic tool using the pyridoxal-5 -phosphate (PLP) binding aspartate aminotransferase superfamily which contains diverse sequences and a broad range of functions. The CORA multiple alignments can easily be downloaded from the DHS Web pages and will be updated with each CATH database release. Materials and methods Protein data set: overview of CATH families The CATH database (release 1.5, October 1998) provides a hierarchical classification of protein domain structures in the PDB (Orengo et al., 1997). The major classification levels are protein class (C), architecture (A), topology (T), homologous superfamily (H) and sequence family (S) (see Table I). To reduce the level of redundancy in the PDB, the proteins in CATH with 95% sequence identity are clustered into 2807 near-identical protein families (N-level). The CATH- 95 dataset includes one representative structure from each N- level family and is used for all structural comparisons in the construction of the DHS. Sequence families contain close evolutionary relatives that are identified using sequence comparison methods (Needleman and Wunsch, 1970) and conservative sequence identity cut-offs ( 35%). There are 959 homologous superfamily levels where more distantly related structures are grouped according to structural and functional similarities that suggest evolution from a common ancestor. PSI-BLAST (Altschul et al., 1997) is now used for detecting remote homologues in the CATH classification procedure using stringent E-value cut-offs (0.0005) and scanning against a translated non-redundant complexity masked GenBank database (Benson et al., 1999). More distant homologues are identified using the SSAP structure comparison algorithm (Taylor and Orengo, 1989; Orengo et al., 1992) and CORA structural profiles for each homologous superfamily (Orengo, 1999). The DHS presents those 362 superfamilies that contain more than one CATH-95 structure. Figure 1 summarizes the steps required to generate the DHS. Generation of data for the DHS Generation of structure comparison data using SSAP. In order to provide data on sequence and structure relationships within each fold group in CATH, SSAP structural comparisons were performed for each pair of CATH-95 domains within each fold ( pairwise comparisons). Protein pairs within the same fold and same superfamily are defined as homologues whereas those within the same fold but in different superfamilies are analogues. Fold level comparisons provide a complete dataset for analysing analogues and homologues and checking for any incorrect classifications. SSAP returns a normalized score between 0 and 100 for each pairwise comparison. Scores above 70 accompanied by significant residue overlap ( 60%) indicate proteins with the same fold or topology (T). Higher scores ( 80) suggest a homologous relationship between two proteins. Sequence identity in this Table II. The four sequence families in the PLP-dependent aspartate aminotransferase homologous superfamily (CATH code ) Protein family enzyme description Sequence family number CATH domains Enzyme classification Aspartate aminotransferases ars02, 2cstA2, 7aatA2, 1ajsB2 All β-eliminating lyases tplA ax4A ω-aminotransferases/pyruvate decarboxylases dkb Ornithine decarboxylases ordA Table III. The number of homologous superfamilies, sequence families and representative protein domains (95% sequence identity) for four of the most populated fold groups in the CATH database Superfold name CATH code No. of non-identical No. of sequence No. of homologous protein representatives families superfamilies Globins Immunoglobulins TIM barrel Rossmann

4 J.E.Bray et al. study is defined as the number of identical residues (after structural alignment by SSAP) divided by the number of residues in the smallest protein domain. SSAP score and sequence identity matrices are available for each superfamily in the DHS Web pages. Automatic validation of structural relatives (DHS-VALID). The DHS-VALID program is used to check automatically all the pairwise sequence and structure comparison data generated for each fold group and homologous superfamily in CATH. Outlying proteins that have low structural similarity scores and low structural overlap percentages against the majority of the relatives are identified. These can then be checked against the known functional information and ligand interaction data and if necessary placed into a newly created homologous family. Similarly, high SSAP scores can identify potentially homologous proteins that are currently classified in different superfamilies. Generation of multiple structural alignments using CORA. Multiple structural alignments were generated using the CORA program (Orengo, 1999) for each of the 362 homologous superfamilies with more than one CATH-95 structure. CORA (Conserved Residue Attributes) is a suite of programs for automatically multiply aligning and analysing protein structural families. CORA uses the pairwise structural comparison data from SSAP to determine the initial set of proteins to be aligned and then identifies conserved characteristics and expresses them as a 3D structural profile for each family. As the profiles encapsulate the critical core of the fold and functional sites, which in the case of homologous proteins have been conserved throughout evolution, they are more sensitive at identifying distant structural homologues than comparing against a single structure. Annotation of structural alignments. The residues in the multiple structural alignments are annotated by colour in Fig. 3. Sequence identity and SSAP structural comparison relationships between representatives from the sequence families in the aspartate aminotransferase homologous superfamily (2cstA2, 1ax4A2, 2dkb02, 1ordA2). 156

5 DHS: CATH Dictionary of Homologous Superfamilies several different ways using the program DHS-PLOT. At the simplest level, plots are colour coded according to secondary structure regions (as defined by DSSP; Kabsch and Sander, 1983), sequence identity and amino acid type (Taylor, 1997). A shaded score bar beneath each alignment indicates the CORA structural conservation score which measures the conservation of the structural environment (dark grey/black are highly conserved regions). PROSITE sequence patterns (Hofmann et al., 1999; URL: are also included in the multiple alignment data for each homologous superfamily. Only structurally significant PROSITE patterns (from release 1.4) are used, as identified for all known PDB structures (Kasuya and Thornton, 1999). Ligand interaction data is derived from the GROW algorithm (Milburn et al., 1998). Importantly, the structural alignment plots help to identify the consensus PROSITE patterns and ligand interaction positions that exist in the structurally conserved regions for a given protein superfamily. PROSITE patterns, ligand interactions and domain boundary representations are also brought together in DOMPLOT diagrams (Todd et al., 1999a) which are available for each homologous superfamily. The 3D structural superpositions can be viewed interactively in a RASMOL viewer (Sayle and Milner-White, 1995) using a customized RASMOL script (ROMLAS; R.A.Laskowski, unpublished computer program) that colours the 3D structures to complement the shading of the 2D sequence plots. Methods for the automatic extraction of functional information. Protein information as given in the PDB header including the protein name, species, crystallographic or NMR information was extracted directly from PDBsum web pages (Laskowski et al., 1997; URL: pdbsum). The files from the ENZYME database (Bairoch, 1999, release 23.0, July 1998) and the SWISS-PROT database (Bairoch and Apweiler, 1999, release 35.0) were downloaded from the ExPasy website (URL: SWISS-PROT entries give links to structures in the PDB but do not specify individual PDB chains. A cross-reference table of PDB chain to SWISS-PROT entry was derived from searching PDB chain sequences against SWISS-PROT and using the highest identity matches from SWISS-PROT (A.C.R.Martin, personal communication). The EC numbers for PDB chains were automatically extracted from the SWISS- PROT files using this cross-reference table (Martin et al., 1998). The summary information from the ENZYME file (description and reaction entries) and the SWISS-PROT file (comments and keyword entries) was extracted using Perl scripts. Compiling and updating the DHS. DHS-WEB is a suite of programs and scripts that are used to combine the multiple alignments with the functional data, consensus sequence pat- continued on next page 157

6 J.E.Bray et al. terns and information about sequence/structure relationships. The output is a DHS Web page of functional summary tables in HTML for each homologous superfamily. The programs can be run automatically to regenerate the DHS web pages and links to other web sites for each new release of the CATH database as further relatives are added to the homologous superfamilies. The CATH classification data is stored in a Postgres relational database and the DHS data is in flat file format. CATH and the DHS are currently being transferred into an Oracle relational database that will be updated initially on a 6 monthly cycle and subsequently on a quarterly cycle. Once the Oracle database is set up, CATH and DHS mirror sites will be established in three locations (USA, Canada and Cambridge, UK). DHS alignment files will be made available from the CATH ftp site (URL: ftp://ftp.biochem.ucl.ac.uk/pub/ cathdata). Results The CATH Dictionary of Homologous Superfamilies (DHS) The CATH DHS is a resource providing multiple structural alignments in the 362 well-populated CATH superfamilies and facilities for viewing them (URL These alignments establish which residues are structurally equivalent across a whole superfamily and, since they are annotated with consensus sequence patterns and functional information, the DHS provides a valuable diagnostic tool for detecting remote homologous relationships based on the identification of consensus superfamily specific features. Conserved active site residues, ligand interactions or sequence motifs often provide the functional signature necessary to prove an evolutionary link. The DHS resource summarizes these properties for each superfamily on a single Web page. It is linked to the CATH structure classification server (URL: allowing the biologist to scan the CATH database with a newly determined structure and then access the DHS pages for each match to assess the potential evolutionary relationship. Using the DHS as a diagnostic tool for identifying remote homologues in the PLP-dependent aspartate aminotransferase superfamily (large domain) Pyridoxal-5 -phosphate (PLP, a vitamin B 6 derivative) is a versatile cofactor, able to catalyse many different reactions involved in nitrogen metabolism in all organisms (Jansonius, 1998). The PLP-dependent aspartate aminotransferase superfamily is presented here to demonstrate the use of the DHS as a resource for homologue classification within the CATH continued on next page 158

7 DHS: CATH Dictionary of Homologous Superfamilies Fig. 4. Screen snapshots from the DHS Web page for the aspartate aminotransferase superfamily. Information for the eight representative domains is shown in tables for (A) protein descriptions, (B) enzyme classification numbers, (C) PROSITE patterns, (D) SWISS-PROT descriptions and (E) ligand data. 159

8 J.E.Bray et al. continued on next page database. Recent studies of PLP-dependent enzymes have shown that there are five distinct PLP-binding domain folds (Denessiouk et al., 1999). All the enzymes have PLP bound to an active site lysine, forming an internal aldimine. Once the amino acid substrate reacts with the cofactor, any one of three remaining bonds around the Cα may be cleaved, enabling a broad range of reactions including transamination, racemization and decarboxylation. Reaction specificity is due to interactions with the groups surrounding the Cα of the substrate that favour a particular bond cleavage (Martell, 1982). In the PLP-dependent aspartate aminotransferase superfamily, all the enzymes have two distinct domains that have 160 different topologies (Figure 2A) and so are classified in different superfamilies in the CATH domain database. The PLP binds covalently to a conserved lysine residue (in the large domain) at the bottom of the interdomain active site cleft (Figure 2B). The large domain has three-layer sandwich architecture with a seven-stranded mixed β-sheet forming the domain core that is surrounded by helices on both sides (CATH code ). The central β-sheet is mostly parallel with one anti-parallel edge strand. This is reminiscent of the Rossmann fold which has the same three-layer sandwich architecture but has a six-stranded parallel β-sheet at the domain core. Figure 2C shows the TOPS diagram (Westhead et al., 1998) to

9 DHS: CATH Dictionary of Homologous Superfamilies Fig. 5. CORA multiple structural alignment for the aspartate aminotransferase superfamily. (A) The sequence alignment for the region surrounding the totally conserved lysine residue (alignment position 308) that binds PLP is shown in three different colour schemes, secondary structure, PROSITE patterns and ligand interactions. (B) The DOMPLOT diagram (Todd et al., 1999a) for the sequence family representatives in this superfamily (2cstA2, 1ax4A2, 2dkb02, 1ordA2). Coloured vertical lines within the boxes highlight residues involved in ligand interactions, coloured horizontal lines below the boxes denote PROSITE motifs. Consensus ligand interactions are shown as small black lines at the bottom of the plot. Specific residue information can be obtained by cross-reference with the DHS ligand interaction table (not shown). (C) The 3D structures can be superposed and viewed in RASMOL (Sayle and Milner-White, 1995) as seen here for the sequence family representatives. The coloured regions correspond to the ligand interacting residues. 161

10 J.E.Bray et al. Fig. 6. Sequence structure plot for proteins within the aspartate aminotransferase superfamily. Pairwise comparisons are shown as black triangles. The majority of proteins in this family are distant evolutionary relatives and have sequence identities below 15%. represent schematically the topology of the large domain. The small domain has an α β plait topology and can be found in DHS Web pages for CATH code The multiple structural alignments and functional tables shown here are for the large domain superfamily (CATH code ) that contains four sequence families and eight representative domains (Table II). There is 11% sequence identity between relatives of these different sequence families (Figure 3). Pairwise SSAP scores of indicate that the proteins have similar folds but are in the twilight zone of structural similarity regarding their evolutionary relationship. These families would not be merged automatically to form a homologous superfamily as the sequence identity and structural similarity are not sufficiently high and a number of different functions are observed across the set. Instead, the DHS functional tables and annotated structural alignments can be consulted to search for any possible evolutionary link. Figure 4A shows the DHS summary table containing the protein information as given in the PDB header which includes the protein name, species and crystallographic or NMR information. There are also links from the DHS table to PDBsum pages, individual MOLSCRIPT pictures (Kraulis, 1991) and DOMPLOT diagrams. Enzyme Classification (EC) numbers [Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC IUBMB), 1992] are also listed (Figure 4B) and it can be seen that these are spread over two primary classes, transferases (class 2) and lyases (class 4), indicating the functional diversity in this superfamily. A recent analysis has shown that this type of diversity is observed in 9% (17) of the 190 enzyme homologous 162 superfamilies in CATH with multiple members in the PDB. Almost half (91/190) of the superfamilies have different EC numbers at some level in the EC hierarchy (Todd et al., 1999b). However, it can be seen from the EC table that even though the EC numbers are different, the enzyme reactions all contain an L-amino acid (or similar) as either the substrate or the product. The links go to the ENZYME Web site where more detailed information can be found (see Materials and methods section for all Web site locations). Another DHS table gives information on sequence patterns extracted from PROSITE. Figure 4C shows that each sequence family has a different PROSITE pattern and name; however, the multiple structural alignment shows that the patterns overlap around the totally conserved lysine residue that binds the PLP (Figure 5A). Further inspection of this consensus feature reveals that the different patterns are all PLP binding motifs, whilst the SWISS-PROT summary table (Figure 4D) allows the user to identify the pyridoxal phosphate (PLP) keywords in all the protein entries. PLP is also listed in the ligand information table (Figure 4E). As the summary tables have links to the original databases, more detailed information can be quickly accessed. DOMPLOT diagrams of the structural alignment are also used to visualize protein ligand interactions and show that there is a high degree of conservation in the positioning of the PLP contacting residues (Figure 5B). There are four conserved ligand-interacting regions for the four sequence family representatives (2cstA2, 1ax4A2, 2dkb02, 1ordA2) that all bind PLP (or similar). Ligand binding positions can also be viewed on the multiple superposition (Figure 5C). Considering that these proteins all belong to the same fold group, the PLP binding data, consensus PROSITE motifs and L-amino acid substrates provide the explicit functional evidence to support the structural data that these are evolutionary relatives and should therefore be classified in the same CATH homologous superfamily. Using the DHS data to analyse sequence and structure relationships in CATH DHS data on pairwise sequence and structural similarity data within each fold can be used to analyse structural relationships within different CATH fold groups and superfamilies. For example, Figure 6 shows a sequence structure plot for the PLP aspartate aminotransferase family, illustrating that the current CATH family contains diverse relatives ( 15% sequence identity) as well as close homologues containing high structural similarity (SSAP score 80) and significant sequence identity ( 25%). The DHS contains a sequence structure plot for each homologous superfamily. Currently, 30% of homologous superfamilies in CATH belong to superfold groups (Orengo et al., 1994). Amongst the most highly populated of these folds are the globins, immunoglobulins, TIM barrels and the Rossmann fold that contain between four and 72 different homologous superfamilies (see Table III). The pairwise comparison data in the DHS has been used to explore the structural relationships and sequence identity variation (based on structural alignment) within these superfold groups. Interestingly, we find that sequence identity distributions for homologous proteins show a dependence on the type of fold group studied (Figure 7A D) whilst the distribution of analogous sequence identities is found to peak between 6 and 8% independent of the fold group (Figure 7E H). This is in broad agreement with a previous analysis by Russell et al. (1997).

11 DHS: CATH Dictionary of Homologous Superfamilies Fig. 7. Plots to show the sequence identity distributions for pairs of homologous and analogous proteins in the globin fold (A, E), TIM barrel fold (B, F), Rossmann fold (C, G) and the immunoglobulin fold (D, H). The plots for the analogous proteins are underneath the respective homologous distribution. Fig. 8. Pie chart to show the result of assigning 2646 recently deposited protein domains to homologous superfamilies in the CATH database using sequence comparison methods, combined sequence structure criteria and the DHS. In the globin superfold (Figure 7A) there is a broad homologue distribution between 8 and 30% sequence identity that is distinct from the analogue distribution found at lower identities. This superfold contains only four homologous superfamilies, the largest of which are the oxygen-binding globins (43 out of 54 CATH-95 representatives) that have a high degree of structural conservation. The relatively high sequence identity distribution in this superfold may be caused by the structural constraints of providing the correct environment for the large haem group to function specifically as a reversible oxygen-binding co-factor. By contrast, in the homologue distributions for both TIM barrel folds (Figure 7B) and Rossmann folds (Figure 7C) there are large peaks at 6% sequence identity, much lower than for the globin fold. These homologous distributions have considerable overlap with their respective analogous distributions. The shift to lower sequence identities relative to the globin distributions may be due to the considerable functional diversity of TIM barrel and Rossmann fold proteins. If just the single domain proteins are considered, there are 46 enzyme functions in the TIM barrel fold and 37 different functions in the Rossmann fold (A.E.Todd, unpublished data). This functional diversity may be the consequence of a stable structural framework that is tolerant to extensive changes in the sequence so that there remains very little similarity in the sequences and no common sequence motifs between proteins having different functions. In these folds, analogous pairs may be very distantly related homologues. However, this is very difficult to prove without the presence of intermediate sequences or further evolutionary evidence such as proteins being part of a common metabolic pathway. The homologous distribution of the immunoglobulin superfold unusually exhibits two large peaks below 30% sequence identity (Figure 7D). There are 30 homologous superfamilies in the immunoglobulin fold with a total of

12 J.E.Bray et al. CATH-95 representative domains. The fold population is strongly biased towards the largest superfamily (CATH code ) that contains 220 representatives and includes the antibody protein domains. One possible explanation for the peaks is that the homologous domains of the antibody proteins have arisen through gene duplication but have diverged to perform different functions within the antibody molecule. The variable domains from the light and heavy chains (V L and V H ) are required for binding to the antigen while the constant domains (C L and C H ) have a structural role. The peak around 23% arises from identities between two functionally similar domains in the same antibody molecule (i.e. V L V H or C L C H ). The lower peak around 10% sequence identity is caused by the identities from dissimilar domains (i.e. V L C L,V H C H,V L C H and V H C L ). The peaks above 30% are due to comparing the same domains (e.g. V L V L or V H V H ) from different antibody molecules in the PDB. These observations suggest that once functional constraints have been removed, sequences can diverge much further while retaining the same fold. As the function diverges within a homologous family (e.g. in paralogous proteins) the degree of sequence and structural similarity resembles that of analogous proteins and it becomes difficult to distinguish between analogous proteins and very distant homologues. Considering also structural similarity, we observe that in general analogous protein pairs often have lower SSAP scores than homologous pairs and 25% sequence identity (see, for example, Figure 7D for the immunoglobulin fold). A full analysis of all the pairwise relationships in the CATH- 95 dataset showed that no analogues exhibit the combined criteria of SSAP 80, sequence identity 25% and residue overlap 70%. As manual validation of homologues can be slow, even with the benefit of the DHS, these empirical cutoffs can be used to assign homologous proteins automatically to CATH superfamilies. In a recent dataset of 2646 new domain structures (1879 chains), 64% could be classified as homologues using cautious sequence identity cut-offs ( 35%), a further 9.8% could be assigned using stringent E-value cutoffs in PSI-BLAST (see Materials and methods). The combined sequence structure criteria identified a further 2.9% of homologous pairs. Of the remaining domains, 4.6% were found to be distant homologues using the DHS Web pages and assigned to existing CATH superfamilies, 6.9% were assigned to new superfamilies within an existing fold group as they lacked the sufficient evidence to prove homology and 11.8% were identified as novel folds (see Figure 8). These automatic homologue assignment criteria in combination with the DHS should enable the CATH classification to keep pace with the rapid increase in structure determination expected from structural genomic initiatives. Discussion The Dictionary of Homologous Superfamilies (DHS) provides a compendium of structural and functional information focused at the level of evolutionary relationships. It has emerged from the growing need to validate rigorously the existing CATH hierarchical classification through using functional information in addition to the existing sequence and structure comparison tools. The DHS resource provides multiple structural alignments for each superfamily together with derived consensus functional information, e.g. sequence motifs and ligand interactions. The alignments can be downloaded from the Web and will be updated with each release 164 of the CATH database. Automatic and manual checking using the DHS functional tables and alignments has ensured that CATH superfamilies are more biologically coherent. This is increasingly important as structure databases are used as a resource for training and benchmarking fold recognition algorithms (e.g. GenTHREADER, Jones, 1999), testing sequence comparison methods (Brenner et al., 1998; Park et al., 1998; Salamov et al., 1999) and genome analysis (Salamov et al., 1998). One of the main purposes of studying the 3D structure of a protein is to gain clearer insights into its specific function and biological role. However, for many protein families there are not yet any structural data, e.g. only ~1000 structural families are known compared with ~ sequence families. For these cases information on biological role can sometimes be gleaned by considering the functions of related protein sequences. This technique of functional inheritance is now routinely applied when analysing and annotating novel genome sequences. Structural genomic initiatives have now been established (Pennisi, 1998) to determine structural representatives for each of the sequence families, making the prospect of functional inheritance through structural similarity increasingly feasible. This highlights the growing importance of incorporating functional annotations for structures and related genomic sequences into the DHS. The CATH Protein Family Database (CATH-PFDB) of genomic sequences has recently been established to complement the CATH structural database (Pearl et al., 2000). The CATH-PFDB contains over clear homologues from the sequence databases (e.g. translated GenBank) that have been identified using PSI-BLAST to be closely related to structural domains in CATH. These sequences and their associated functional annotations (e.g. from SWISS-PROT) will be incorporated into the DHS to increase significantly the functional information available for each structural superfamily. A PSI-BLAST server (URL: ac.uk/bsm/psi-cath) has been developed to allow the user to scan the CATH database with a new protein sequence. Hits to CATH structures will be linked to the DHS to allow consideration of functional variation within the potential superfamily. This should help to improve the accuracy of functional inheritance for sequences assigned to CATH superfamilies, particularly for cases where there is greater functional diversity so that any properties should be inherited more cautiously. Acknowledgements We acknowledge the support of the BBSRC and MRC. James Bray is in receipt of a BBSRC special studentship. Annabel Todd is supported by a BBSRC Case studentship in collaboration with Oxford Molecular Limited. Frances Pearl is supported by a BBSRC grant. Christine Orengo is supported by the Medical Research Council. We acknowledge support from the BBSRC for computing facilities. We thank Dr Roman Laskowski and Dr Andrew Martin for the use of their computer programs, Ian Sillitoe and William Valdar for helpful discussions and David Lee for the PSI-BLAST server. References Abola,E.E., Berstein,F.C., Bryant,S.H., Koetzle,T.F. and Weng,J. (1987) In Allen,F.H., Bergerhoff,G. and Sievers,R. (eds), Crystallographic Databases: Information Content, Sofware Systems, Scientific Applications. Commission of the International Union of Crystallography, Bonn, pp Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Nucleic Acids Res., 25, Attwood,T.K., Flower,D.R., Lewis,A.P., Mabey,J.E., Morgan,S.R., Scordis,P., Selley,J.N. and Wright,W. (1999) Nucleic Acids Res., 27, Bairoch,A. (1999) Nucleic Acids Res., 27,

13 DHS: CATH Dictionary of Homologous Superfamilies Bairoch,A. and Apweiler,R. (1999) Nucleic Acids Res., 27, Bateman,A., Birney,E., Durbin,R., Eddy,S.R., Finn,R.D. and Sonnhammer, E.L.L (1999) Nucleic Acids Res., 27, Benson,D.A., Boguski,M.S., Lipman,D.J., Ostell,J., Ouellette,B.F.F., Rapp,B.A. and Wheeler,D.L. (1999) Nucleic Acids Res., 27, Bork,P. and Koonin,E.V. (1998) Nature Genet., 18, Brenner,S.E., Chothia,C., Hubbard,T.J.P. and Murzin,A.G. (1996) Methods Enzymol., 266, Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1997) Curr. Opin. Struct. Biol., 7, Brenner,S.E., Chothia,C. and Hubbard,T.J.P. (1998) Proc. Natl Acad. Sci. USA, 95, Brown,N.P., Orengo,C.A. and Taylor,W.R. (1996) Comput. Chem., 20, Chothia,C. (1984) Annu. Rev. Biochem., 53, Chothia,C. and Lesk,A.M. (1986) EMBO J., 5, Corpet,F., Gouzy,J. and Kahn,D. (1999) Nucleic Acids Res., 27, Denessiouk,K.A., Denesyuk,A.I. Lehtonen,J.V., Korpela,T. and Johnson,M.S. (1999) Proteins: Struct. Funct. Genet., 35, Gibrat,J.F., Madej,T., Spouge,J.L. and Bryant,S.H. (1997) Biophys. J., 72, 298. Hadley,C. and Jones,D.T. (1999) Structure, 7, Hofmann,K., Bucher,P., Falquet,L. and Bairoch,A. (1999) Nucleic Acids Res., 27, Holm,L. and Sander,C. (1993) J. Mol. Biol., 233, Holm,L. and Sander,C. (1997) In Gaasterland,T., Karp,P., Karplus,K., Ouzounis,C., Sander,C. and Valencia,A. (eds), Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, CA., pp Holm,L. and Sander,C. (1999) Nucleic Acids Res., 27, Hogue,C.W.V., Ohkawa,H. and Bryant,S.H. (1996). Trends. Biochem. Sci., 21, Hubbard,T.J.P. and Blundell,T.L. (1987). Protein Engng, 1, Hubbard,T.J.P., Ailey,B., Brenner,S.E., Murzin,A.G. and Chothia,C. (1999) Nucleic Acids Res., 27, Hughey,R. and Krogh,A. (1996) Comput. Appl. Biosci., 12, Jansonius,J.N. (1998) Curr. Opin. Struct. Biol., 8, Jones,D.T. (1999) J. Mol. Biol., 287, Kabsch,W. and Sander,C. (1983) Biopolymers, 22, Kasuya,A. and Thornton,J.M. (1999) J. Mol. Biol., 286, Kraulis,P.J. (1991) J. Appl. Crystallogr., 24, Laskowski,R.A., Hutchinson,E.G., Michie,A.D., Wallace,A.C., Jones,M.L. and Thornton,J.M. (1997) Trends Biochem. Sci., 22, Marchler-Bauer,A., Addess,K.J., Chappey,C., Geer,L., Madej,T., Matsuo,Y., Wang,Y. and Bryant, S. (1999) Nucleic Acids Res., 27, Martell,A.E. (1982) Adv. Enzymol. Relat. Areas Mol. Biol., 53, Martin,A.C., Orengo,C.A., Hutchinson,E.G., Jones,S., Karmirantzou,M., Laskowski,R.A., Mitchell,J.B., Taroni,C. and Thornton,J.M. (1998) Structure, 6, Matsuo,Y. and Bryant,S.H. (1999) Proteins: Struct. Funct. Genet., 35, Merritt,E.A. and Bacon,D.J. (1997) Methods Enzymol., 277, Milburn,D., Laskowski,R.A. and Thornton,J.M. (1998) Protein Engng, 11, Mizuguchi,K., Deane,C.M., Blundell,T.L. and Overington,J.P. (1998a) Protein Sci., 7, Mizuguchi,K., Deane,C.M., Blundell,T.L. Johnson,M.S. and Overington,J.P. (1998b) Bioinformatics, 14, Murzin,A.G. (1996) Curr. Opin. Struct. Biol., 6, Murzin,A.G. (1998) Curr. Opin. Struct. Biol., 8, Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) J. Mol. Biol., 247, Needleman,S.B. and Wunsch,C.D. (1970). J. Mol. Biol., 48, Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) (1992) Enzyme Nomenclature. Academic Press, New York. Orengo,C.A. (1994) Curr. Opin. Struct. Biol., 4, Orengo,C.A. (1999) Protein Sci., 8, Orengo,C.A., Brown,N.P. and Taylor,W.R. (1992) Proteins, 14, Orengo,C.A., Jones,D.T. and Thornton,J.M. (1994) Nature, 372, Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindells,M.B. and Thornton,J.M. (1997) Structure, 5, Orengo,C.A., Pearl,F.M.G., Bray,J.E., Todd,A.E., Martin,A.C., Lo Conte,L. and Thornton,J.M. (1999) Nucleic Acids Res., 27, Overington,J.P., Johnson,M.S., Sali,A. and Blundell,T.L. (1990). Proc. R. Soc. London, Ser B, 241, Park,J., Teichmann,S.A., Hubbard,T. and Chothia,C. (1997) J. Mol. Biol., 273, Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. and Chothia,C. (1998) J. Mol. Biol., 284, Pearl,F.M.G., Lee,D., Bray,J.E., Sillitoe,I., Todd, A.E., Harrison,A.P., Thornton,J.M. and Orengo,C.A. (2000) Nucleic Acids Res., 28, Pennisi,E. (1998) Science, 279, Ponting,C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) Nucleic Acids Res., 27, Rost,B. (1999) Protein Engng, 12, Russell,R.B. and Barton,G.J. (1992) Proteins: Struct. Funct. Genet., 20, Russell,R.B., Saqi,M.A.S., Sayle,R.A., Bates,P.A. and Sternberg,M.J.E. (1997) J. Mol. Biol., 269, Salamov,A.A., Suwa,M., Orengo,C.A. and Swindells,M.B. (1998) Protein Sci., 8, Salamov,A.A., Suwa,M., Orengo,C.A. and Swindells,M.B. (1999) Protein Engng, 12, Salem,G.M., Hutchinson,E.G., Orengo,C.A. and Thornton,J.M. (1999) J. Mol. Biol., 287, Sander,C. and Schneider,R. (1991) Proteins: Struct. Funct. Genet., 9, Sayle,R.A. and Milner-White,E.J. (1995) Trends Biochem. Sci., 20, Schmidt,R., Gerstein,M. and Altman,R.B. (1997) Protein Sci., 6, Siddiqui,A.S. and Barton,G.J. (1997) help_intro.html. Sowdhamini,R. and Blundell,T.L. (1995) Protein Sci., 4, Sowdhamini,R., Rufino,S.D. and Blundell,T.L. (1996) Folding Des., 1, Sowdhamini,R., Burke,D.F., Huang,F., Mizuguchi,K., Nagarajaram,H.A., Srinivasan,N., Steward,R.E. and Blundell,T.L. (1998) Structure, 6, Taylor,W.T. (1997) Protein Engng, 10, Taylor,W.T. and Orengo,C.A. (1989) J. Mol. Biol., 208, Todd,A.E., Orengo,C.A. and Thornton,J.M. (1999a) Protein Engng, 12, Todd,A.E., Orengo,C.A. and Thornton,J.M. (1999b) Curr. Opin. Chem. Biol., 3, Wallace,A.C., Laskowski,R.A. and Thornton,J.M. (1995) Protein Engng, 8, Westhead,D.R., Hatton,D.C. and Thornton,J.M. (1998) Trends Biochem. Sci., 23, Received August 8, 1999; revised December 14, 1999; accepted January 6,

The CATH Database provides insights into protein structure/function relationships

The CATH Database provides insights into protein structure/function relationships 1999 Oxford University Press Nucleic Acids Research, 1999, Vol. 27, No. 1 275 279 The CATH Database provides insights into protein structure/function relationships C. A. Orengo, F. M. G. Pearl, J. E. Bray,

More information

and structural variability. One of the principal objectives the structures are subjected to evolutionary pressure for

and structural variability. One of the principal objectives the structures are subjected to evolutionary pressure for Protein Engineering vol.14 no.4 pp.219 226, 2001 Use of a database of structural alignments and phylogenetic trees in investigating the relationship between sequence and structural variability among homologous

More information

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB)

1. Protein Data Bank (PDB) 1. Protein Data Bank (PDB) Protein structure databases; visualization; and classifications 1. Introduction to Protein Data Bank (PDB) 2. Free graphic software for 3D structure visualization 3. Hierarchical classification of protein

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Protein Structure: Data Bases and Classification Ingo Ruczinski

Protein Structure: Data Bases and Classification Ingo Ruczinski Protein Structure: Data Bases and Classification Ingo Ruczinski Department of Biostatistics, Johns Hopkins University Reference Bourne and Weissig Structural Bioinformatics Wiley, 2003 More References

More information

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society 1 of 5 1/30/00 8:08 PM Protein Science (1997), 6: 246-248. Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society FOR THE RECORD LPFC: An Internet library of protein family

More information

Protein Folds, Functions and Evolution

Protein Folds, Functions and Evolution Article No. jmbi.1999.3054 available online at http://www.idealibrary.com on J. Mol. Biol. (1999) 293, 333±342 Protein Folds, Functions and Evolution Janet M. Thornton 1,2 *, Christine A. Orengo 1, Annabel

More information

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence

Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Number sequence representation of protein structures based on the second derivative of a folded tetrahedron sequence Naoto Morikawa (nmorika@genocript.com) October 7, 2006. Abstract A protein is a sequence

More information

DATE A DAtabase of TIM Barrel Enzymes

DATE A DAtabase of TIM Barrel Enzymes DATE A DAtabase of TIM Barrel Enzymes 2 2.1 Introduction.. 2.2 Objective and salient features of the database 2.2.1 Choice of the dataset.. 2.3 Statistical information on the database.. 2.4 Features....

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

Functional diversity within protein superfamilies

Functional diversity within protein superfamilies Functional diversity within protein superfamilies James Casbon and Mansoor Saqi * Bioinformatics Group, The Genome Centre, Barts and The London, Queen Mary s School of Medicine and Dentistry, Charterhouse

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Genome Databases The CATH database

Genome Databases The CATH database Genome Databases The CATH database Michael Knudsen 1 and Carsten Wiuf 1,2* 1 Bioinformatics Research Centre, Aarhus University, DK-8000 Aarhus C, Denmark 2 Centre for Membrane Pumps in Cells and Disease

More information

An automated approach for defining core atoms and domains in an ensemble of NMR-derived protein structures

An automated approach for defining core atoms and domains in an ensemble of NMR-derived protein structures Protein Engineering vol.10 no.6 pp.737 741, 1997 PROTOCOL An automated approach for defining core atoms and domains in an ensemble of NMR-derived protein structures Lawrence A.Kelley, Stephen P.Gardner

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Picasso: generating a covering set of protein family profiles. Andreas Heger and Liisa Holm

Picasso: generating a covering set of protein family profiles. Andreas Heger and Liisa Holm BIOINFORMATICS Vol. 17 no. 3 2001 Pages 272 279 Picasso: generating a covering set of protein family profiles Andreas Heger and Liisa Holm Structural Genomics Group, EMBL-EBI, Cambridge CB10 1SD, UK Received

More information

Robert B.Russell 1,2,3, Mansoor A.S.Saqi 3,4, Paul A.Bates 1, Roger A.Sayle 4 and Michael J.E.Sternberg 1,5

Robert B.Russell 1,2,3, Mansoor A.S.Saqi 3,4, Paul A.Bates 1, Roger A.Sayle 4 and Michael J.E.Sternberg 1,5 Protein Engineering vol.11 no.1 pp.1 9, 1998 Recognition of analogous and homologous protein folds assessment of prediction success and associated alignment accuracy using empirical substitution matrices

More information

proteins SHORT COMMUNICATION MALIDUP: A database of manually constructed structure alignments for duplicated domain pairs

proteins SHORT COMMUNICATION MALIDUP: A database of manually constructed structure alignments for duplicated domain pairs J_ID: Z7E Customer A_ID: 21783 Cadmus Art: PROT21783 Date: 25-SEPTEMBER-07 Stage: I Page: 1 proteins STRUCTURE O FUNCTION O BIOINFORMATICS SHORT COMMUNICATION MALIDUP: A database of manually constructed

More information

Homology and Information Gathering and Domain Annotation for Proteins

Homology and Information Gathering and Domain Annotation for Proteins Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The

More information

Homology. and. Information Gathering and Domain Annotation for Proteins

Homology. and. Information Gathering and Domain Annotation for Proteins Homology and Information Gathering and Domain Annotation for Proteins Outline WHAT IS HOMOLOGY? HOW TO GATHER KNOWN PROTEIN INFORMATION? HOW TO ANNOTATE PROTEIN DOMAINS? EXAMPLES AND EXERCISES Homology

More information

Heteropolymer. Mostly in regular secondary structure

Heteropolymer. Mostly in regular secondary structure Heteropolymer - + + - Mostly in regular secondary structure 1 2 3 4 C >N trace how you go around the helix C >N C2 >N6 C1 >N5 What s the pattern? Ci>Ni+? 5 6 move around not quite 120 "#$%&'!()*(+2!3/'!4#5'!1/,#64!#6!,6!

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013 EBI web resources II: Ensembl and InterPro Yanbin Yin Spring 2013 1 Outline Intro to genome annotation Protein family/domain databases InterPro, Pfam, Superfamily etc. Genome browser Ensembl Hands on Practice

More information

BIOINFORMATICS. Consensus alignment for reliable framework prediction in homology modeling

BIOINFORMATICS. Consensus alignment for reliable framework prediction in homology modeling BIOINFORMATICS Vol. 19 no. 13 23, pages 1682 1691 DOI: 1.193/bioinformatics/btg211 Consensus alignment for reliable framework prediction in homology modeling J.C. Prasad 1, S.R. Comeau 1, S. Vajda 2 and

More information

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon A Comparison of Methods for Assessing the Structural Similarity of Proteins Dean C. Adams and Gavin J. P. Naylor? Dept. Zoology and Genetics, Iowa State University, Ames, IA 50011, U.S.A. 1 Introduction

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Prediction of protein function from sequence analysis

Prediction of protein function from sequence analysis Prediction of protein function from sequence analysis Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy The omic era Genome Sequencing Projects: Archaea: 74 species In Progress:52 Bacteria:

More information

Protein Structure & Motifs

Protein Structure & Motifs & Motifs Biochemistry 201 Molecular Biology January 12, 2000 Doug Brutlag Introduction Proteins are more flexible than nucleic acids in structure because of both the larger number of types of residues

More information

Structure to Function. Molecular Bioinformatics, X3, 2006

Structure to Function. Molecular Bioinformatics, X3, 2006 Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Analysis and Prediction of Protein Structure (I)

Analysis and Prediction of Protein Structure (I) Analysis and Prediction of Protein Structure (I) Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 2006 Free for academic use. Copyright @ Jianlin Cheng

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing

Bioinformatics. Proteins II. - Pattern, Profile, & Structure Database Searching. Robert Latek, Ph.D. Bioinformatics, Biocomputing Bioinformatics Proteins II. - Pattern, Profile, & Structure Database Searching Robert Latek, Ph.D. Bioinformatics, Biocomputing WIBR Bioinformatics Course, Whitehead Institute, 2002 1 Proteins I.-III.

More information

PROMALS3D web server for accurate multiple protein sequence and structure alignments

PROMALS3D web server for accurate multiple protein sequence and structure alignments W30 W34 Nucleic Acids Research, 2008, Vol. 36, Web Server issue Published online 24 May 2008 doi:10.1093/nar/gkn322 PROMALS3D web server for accurate multiple protein sequence and structure alignments

More information

Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases

Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases Sliding helices and strands in structural comparisons 921 Analysis on sliding helices and strands in protein structural comparisons: A case study with protein kinases V S GOWRI, K ANAMIKA, S GORE 1 and

More information

Protein structure alignments

Protein structure alignments Protein structure alignments Proteins that fold in the same way, i.e. have the same fold are often homologs. Structure evolves slower than sequence Sequence is less conserved than structure If BLAST gives

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

MSAT a Multiple Sequence Alignment tool based on TOPS

MSAT a Multiple Sequence Alignment tool based on TOPS MSAT a Multiple Sequence Alignment tool based on TOPS Te Ren, Mallika Veeramalai, Aik Choon Tan and David Gilbert Bioinformatics Research Centre Department of Computer Science University of Glasgow Glasgow,

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Structural Informatics Homology Modeling Sequence Alignment Structure Classification Gene

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

PROMALS3D: a tool for multiple protein sequence and structure alignments

PROMALS3D: a tool for multiple protein sequence and structure alignments Nucleic Acids Research Advance Access published February 20, 2008 Nucleic Acids Research, 2008, 1 6 doi:10.1093/nar/gkn072 PROMALS3D: a tool for multiple protein sequence and structure alignments Jimin

More information

Database update 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families

Database update 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families Database update 3PFDB+: improved search protocol and update for the identification of representatives of protein sequence domain families Agnel P. Joseph 1, Prashant Shingate 1,2, Atul K. Upadhyay 1 and

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Protein structure analysis. Risto Laakso 10th January 2005

Protein structure analysis. Risto Laakso 10th January 2005 Protein structure analysis Risto Laakso risto.laakso@hut.fi 10th January 2005 1 1 Summary Various methods of protein structure analysis were examined. Two proteins, 1HLB (Sea cucumber hemoglobin) and 1HLM

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

Concept Lattice Based Mutation Control for Reactive Motifs Discovery

Concept Lattice Based Mutation Control for Reactive Motifs Discovery Concept Lattice Based Mutation Control for Reactive Motifs Discovery Kitsana Waiyamai, Peera Liewlom, Thanapat Kangkachit, and Thanawin Rakthanmanon Data Analysis and Knowledge Discovery Laboratory (DAKDL),

More information

Data Mining in Protein Binding Cavities

Data Mining in Protein Binding Cavities In Proc. GfKl 2004, Dortmund: Data Mining in Protein Binding Cavities Katrin Kupas and Alfred Ultsch Data Bionics Research Group, University of Marburg, D-35032 Marburg, Germany Abstract. The molecular

More information

The SUPERFAMILY database in 2007: families and functions

The SUPERFAMILY database in 2007: families and functions Nucleic Acids Research Advance Access published November 10, 2006 Nucleic Acids Research, 2006, Vol. 00, Database issue D1 D6 doi:10.1093/nar/gkl910 The SUPERFAMILY database in 2007: families and functions

More information

Integrating multi-attribute similarity networks for robust representation of the protein space

Integrating multi-attribute similarity networks for robust representation of the protein space BIOINFORMATICS ORIGINAL PAPER Vol. 22 no. 13 2006, pages 1585 1592 doi:10.1093/bioinformatics/btl130 Structural bioinformatics Integrating multi-attribute similarity networks for robust representation

More information

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1

Amino Acid Structures from Klug & Cummings. 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 1 Amino Acid Structures from Klug & Cummings 10/7/2003 CAP/CGS 5991: Lecture 7 2 Amino Acid Structures from Klug & Cummings

More information

The ups and downs of protein topology; rapid comparison of protein structure

The ups and downs of protein topology; rapid comparison of protein structure Protein Engineering vol.13 no.12 pp.829 837, 2000 The ups and downs of protein topology; rapid comparison of protein structure Andrew C.R.Martin search, but in the worst-case scenario, it could still be

More information

Methodologies for target selection in structural genomics

Methodologies for target selection in structural genomics Progress in Biophysics & Molecular Biology 73 (2000) 297 320 Review Methodologies for target selection in structural genomics Michal Linial a, *, Golan Yona b a Department of Biological Chemistry, Institute

More information

Chapter 2 Structures. 2.1 Introduction Storing Protein Structures The PDB File Format

Chapter 2 Structures. 2.1 Introduction Storing Protein Structures The PDB File Format Chapter 2 Structures 2.1 Introduction The three-dimensional (3D) structure of a protein contains a lot of information on its function, and can be used for devising ways of modifying it (propose mutants,

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like

SCOP. all-β class. all-α class, 3 different folds. T4 endonuclease V. 4-helical cytokines. Globin-like SCOP all-β class 4-helical cytokines T4 endonuclease V all-α class, 3 different folds Globin-like TIM-barrel fold α/β class Profilin-like fold α+β class http://scop.mrc-lmb.cam.ac.uk/scop CATH Class, Architecture,

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Visualization of Macromolecular Structures

Visualization of Macromolecular Structures Visualization of Macromolecular Structures Present by: Qihang Li orig. author: O Donoghue, et al. Structural biology is rapidly accumulating a wealth of detailed information. Over 60,000 high-resolution

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Published online February 15, 26 166 18 Nucleic Acids Research, 26, Vol. 34, No. 3 doi:1.193/nar/gkj494 Comprehensive genome analysis of 23 genomes provides structural genomics with new insights into protein

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

FoldMiner: Structural motif discovery using an improved superposition algorithm

FoldMiner: Structural motif discovery using an improved superposition algorithm FoldMiner: Structural motif discovery using an improved superposition algorithm JESSICA SHAPIRO 1 AND DOUGLAS BRUTLAG 1,2 1 Biophysics Program and 2 Department of Biochemistry, Stanford University, Stanford,

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Do Aligned Sequences Share the Same Fold?

Do Aligned Sequences Share the Same Fold? J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a

More information

Bioinformatics. Macromolecular structure

Bioinformatics. Macromolecular structure Bioinformatics Macromolecular structure Contents Determination of protein structure Structure databases Secondary structure elements (SSE) Tertiary structure Structure analysis Structure alignment Domain

More information

From Sequence to Function (I): - Protein Profiling - Case Studies in Structural & Functional Genomics

From Sequence to Function (I): - Protein Profiling - Case Studies in Structural & Functional Genomics BCHS 6229 Protein Structure and Function Lecture 6 (Oct 27, 2011) From Sequence to Function (I): - Protein Profiling - Case Studies in Structural & Functional Genomics 1 From Sequence to Function in the

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Getting To Know Your Protein

Getting To Know Your Protein Getting To Know Your Protein Comparative Protein Analysis: Part III. Protein Structure Prediction and Comparison Robert Latek, PhD Sr. Bioinformatics Scientist Whitehead Institute for Biomedical Research

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

The molecular functions of a protein can be inferred from

The molecular functions of a protein can be inferred from Global mapping of the protein structure space and application in structure-based inference of protein function Jingtong Hou*, Se-Ran Jun, Chao Zhang, and Sung-Hou Kim* Department of Chemistry and *Graduate

More information

Efficient Remote Homology Detection with Secondary Structure

Efficient Remote Homology Detection with Secondary Structure Efficient Remote Homology Detection with Secondary Structure 2 Yuna Hou 1, Wynne Hsu 1, Mong Li Lee 1, and Christopher Bystroff 2 1 School of Computing,National University of Singapore,Singapore 117543

More information

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE Examples of Protein Modeling Protein Modeling Visualization Examination of an experimental structure to gain insight about a research question Dynamics To examine the dynamics of protein structures To

More information

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program) Course Name: Structural Bioinformatics Course Description: Instructor: This course introduces fundamental concepts and methods for structural

More information

Functionally-Valid Unsupervised Compression of the Protein Space

Functionally-Valid Unsupervised Compression of the Protein Space Functionally-Valid Unsupervised Compression of the Protein Space Noam Kaplan 1, Moriah Friedlich 2, Menachem Fromer 2 and Michal Linial 1* 1 Department of Biological Chemistry, Institute of Life Sciences,

More information

Comparing Protein Structures. Why?

Comparing Protein Structures. Why? 7.91 Amy Keating Comparing Protein Structures Why? detect evolutionary relationships identify recurring motifs detect structure/function relationships predict function assess predicted structures classify

More information

Study of Mining Protein Structural Properties and its Application

Study of Mining Protein Structural Properties and its Application Study of Mining Protein Structural Properties and its Application A Dissertation Proposal Presented to the Department of Computer Science and Information Engineering College of Electrical Engineering and

More information

PSI-BLAST pseudocounts and the minimum description length principle

PSI-BLAST pseudocounts and the minimum description length principle Published online 16 December 2008 Nucleic Acids Research, 2009, Vol. 37, No. 3 815 824 doi:10.1093/nar/gkn981 PSI-BLAST pseudocounts and the minimum description length principle Stephen F. Altschul*, E.

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 21 no. 7 2005, pages 1010 1019 doi:10.1093/bioinformatics/bti128 Structural bioinformatics Non-sequential structure-based alignments reveal topology-independent core

More information

ALL LECTURES IN SB Introduction

ALL LECTURES IN SB Introduction 1. Introduction 2. Molecular Architecture I 3. Molecular Architecture II 4. Molecular Simulation I 5. Molecular Simulation II 6. Bioinformatics I 7. Bioinformatics II 8. Prediction I 9. Prediction II ALL

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy many slides by Philip E. Bourne Department of Pharmacology, UCSD Agenda Understand the relationship between sequence,

More information

PSIC: profile extraction from sequence alignments with position-specific counts of independent observations

PSIC: profile extraction from sequence alignments with position-specific counts of independent observations Protein Engineering vol.12 no.5 pp.387 394, 1999 PSIC: profile extraction from sequence alignments with position-specific counts of independent observations Shamil R.Sunyaev 1,2,3, Frank Eisenhaber 1,2,4,

More information

Variable-Length Protein Sequence Motif Extraction Using Hierarchically-Clustered Hidden Markov Models

Variable-Length Protein Sequence Motif Extraction Using Hierarchically-Clustered Hidden Markov Models 2013 12th International Conference on Machine Learning and Applications Variable-Length Protein Sequence Motif Extraction Using Hierarchically-Clustered Hidden Markov Models Cody Hudson, Bernard Chen Department

More information

BIRKBECK COLLEGE (University of London)

BIRKBECK COLLEGE (University of London) BIRKBECK COLLEGE (University of London) SCHOOL OF BIOLOGICAL SCIENCES M.Sc. EXAMINATION FOR INTERNAL STUDENTS ON: Postgraduate Certificate in Principles of Protein Structure MSc Structural Molecular Biology

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

METABOLIC PATHWAY PREDICTION/ALIGNMENT

METABOLIC PATHWAY PREDICTION/ALIGNMENT COMPUTATIONAL SYSTEMIC BIOLOGY METABOLIC PATHWAY PREDICTION/ALIGNMENT Hofestaedt R*, Chen M Bioinformatics / Medical Informatics, Technische Fakultaet, Universitaet Bielefeld Postfach 10 01 31, D-33501

More information

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure Bioch/BIMS 503 Lecture 2 Structure and Function of Proteins August 28, 2008 Robert Nakamoto rkn3c@virginia.edu 2-0279 Secondary Structure Φ Ψ angles determine protein structure Φ Ψ angles are restricted

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Subfamily HMMS in Functional Genomics D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Pacific Symposium on Biocomputing 10:322-333(2005) SUBFAMILY HMMS IN FUNCTIONAL GENOMICS DUNCAN

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information