Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains

Size: px
Start display at page:

Download "Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains"

Transcription

1 Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains Tali Sadka 1 and Michal Linial 1,2, * 1 School of Computer Science and Engineering and 2 Dept of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem 91904, Israel. 2 Present address: Dept of Computer Science and Engineering, University of Washington, Seattle, WA, , USA. * Corresponding author ABSTRACT In eukaryotes, membranous proteins account for 20-30% of the proteome. Most of these proteins contain one or more transmembrane (TM) domains. These are short segments that transverse the bilayer lipid membrane. Various properties of the TM domains, such as their number, their topology and their arrangement within the membrane are closely related to the protein s cellular functions. Properties of the TM domains also determine the cellular targeting and localization of these proteins. It is not known, however, whether the information encoded by TM domains suffices for the purpose of classifying proteins into their functional families. This is the question we address here. We introduce an algorithm that creates a profile of each functional family of membranous proteins based only on the amino-acid composition of their TM domains. This is complemented by a classifier program for each such family (to determine whether a given protein belongs to it or not). We find that in most instances TM domains contain enough information to allow an accurate discrimination of ~80% sensitivity and ~90% specificity among unrelated polytopic functional families with the same number of TM domains. 1. INTRODUCTION Integral membrane proteins participate in countless cell activities. They play a crucial role in signaling pathways, cell adhesion, intercellular communication and more. It is estimated that as many as 20-30% of all genes in multi-cellular organisms encode membranous proteins (Wallin and von Heijne, 1998). Due to their critical roles in the cell s biological activities, these proteins are of immense importance for the pharmaceutical industry (Bockaert et al., 2004). Despite their great significance, our understanding of membranous proteins leaves much to be desired. For example, the structural study of proteins with transmembrane (TM) domains is still largely an enigma, and at present, only several dozen membranous proteins (e.g., K + channel, bacteriorhodopsin) have been structurally solved. In the present paper we deal with membranous proteins, the TM domains of which are short - helices. These proteins comprise the major part of all integral membrane proteins across phyla (Liu et al., 2002). A relatively small group of proteins that cross the bilayer membrane via a cluster of -sheets (e.g., porin) will not be considered (Levitt, 1990). TM domains are known to have a characteristically hydrophobic amino acid (aa) distribution. Propensity scales of aa s (Engelman et al., 1986; Kyte and Doolittle, 1982; Pilpel et al., 1999; Tusnady and Simon, 1998) developed for TM domains yield a more accurate sequence alignment of proteins rich in TM domains as well as a good prediction of TM domain topology (Muller et al., 2001; Ng et al., 2000). Numerous databases based on experimental and computational analyses for protein localization (e.g., PSORT, DAS) focus mostly on bacterial proteomes. However, in recent years, several programs such as Protein Prediction, TMHMM, TargetP, SOSUI, TMPpred and TMTOP (a good collection can be found in were developed. These rely on aa propensities in TM domains and can successfully predict whether a protein is membranous or soluble and determine its number of TM domains with a reasonable accuracy (Chen and Rost, 2002; Rost et al., 1996). However, these programs do not fare so well in assigning in-out membrane topology (Ott and Lingappa, 2002). 1

2 In some studied instances, a protein s functionality is tightly linked to the features of one or more of its TM domains. For example, in voltage dependent ion channels, a specific TM domain serves as a voltage sensor and largely determines the gating properties of the channel. Similarly, specificity and selectivity of ion channels are governed by minor sequence differences in their TM domains that line the channel pore. In numerous reports a single aa change was shown to convert a channel from an anion to cation permeability or to reverse the ion selectivity (Favre et al., 1996). This suggests that various channels may be distinguished by their most detailed aa information (Miloshevsky and Jordan, 2004). On the other hand, TM domains may, at times, serve only as a means of crossing a membrane from one side of a compartment to the other or to define membrane preference localization (Lin et al., 1998). In such instances, the aa that cross the lipid bilayer are subject to certain biophysical constraints (i.e, overall hydrophobicity, minimal length), but the protein s functionality is entirely independent of the detailed information within the TM. Here we ask which of these competing hypotheses prevails. We also investigate the extent to which the information encoded by the TM segments can be applied to automatically classify membranous proteins into their functional families. Concretely, we ask whether a coherent functional family can be defined based solely on such limited information on the TM segments. This is to be done with no recourse to other sequence-based (signature, motif) of functional (binding site, catalytic site) information. We find that for most protein families, the information encoded at the TM domains suffices for the purpose of classifying functional families. Furthermore, in some instance, a partition of the family to its subfamilies is achieved solely by specific properties of the TM domains. Specifically, we suggest that aa composition of the TM domain alone suffices to determine the identity of the family and may be used as an additional source for protein family refinement. 2. METHODS Most functional families have a typical number of TM domains. For example, all G-coupled protein receptors have exactly 7 TM domains. Though this typical number need not be so strict in other functional families, the vast majority of proteins in a functional family possess the same number of TM domains. For simplicity, we concentrate on functional families that conform to this rule. Of course, the number of TM domains cannot define the functional families as many transporters, channels, pumps and exchangers have exactly 12 TM domains Data set Our data set comes from the ProtoNet ver 3.0 database ( (Kaplan et al., 2005). ProtoNet includes 114,000 proteins from SwissProt. To obtain a classification of proteins into functional families we use PANDORA ver 2.1 ( a tool for detecting subsets of proteins that share unique biological properties defined by keywords. In this case - function (i.e. channel, receptor, transporter) (Kaplan et al., 2003). Out of the protein sets matching a keyword retrieved by PANDORA, we used almost exclusively the definition by InterPro families (ver 5.2, supplemented with keyword annotations from SwissProt (ver 40.21, The broader set of membranous proteins is collected by GO cellular localization category that marks 24,000 proteins as membrane. We set the typical number of TM domains in a family to be the number of TM domains annotated for most of the proteins in the group. As a supplementary database we used GPCRDB ver 8.0 ( This database is a source for G-protein coupled receptors (GPCR) and their partition into subfamilies (Horn et al., 2003). GPCRDB database is manually annotated based on literature search and expert view. For a full list of protein sets and sources, see the supplement ( Classification Algorithm In the next section we discuss how we classify a protein to its functional family by (a) defining the representation of proteins, and (b) describing the classification algorithm. Fig. 1 shows a flowchart of the main steps (marked 1-5). Step 1 is composed of i-iv. (i) Location annotation of a single TM domain: A reliable annotation of the TM domains locations (start and end aa positions) is mostly missing. Exceptional are the TM domains defined by SwissProt provided in the FT- feature field. For consistency, we applied TMHMM Ver 2.0 (Krogh et al., 2001) for predicting the TM domains in the entire set of proteins under consideration. TMHMM is based 2

3 on HMM for the prediction of TM domains locations in un-annotated proteins. Based on performance tests, this method is expected to reach a very high level of accurately (Krogh et al., 2001). (ii) Representation of information contained in a given TM domain: We represent each TM domain in a single protein by its aa composition alone. Thus, corresponding to each TM domain is a 20-dimensional vector, the i th coordinate of which equals: # ith aa appearance in the TM domain # of aa in the TM domain (iii) Representation of a single TM domain in a functional family: Our underlying statistical assumption is this: The aa compositions of a particular TM domain in a specific protein from a given functional family is sampled out of a multi-normal distribution. This distribution is characteristic of that TM domain in that family. Accordingly, we construct a 20-dimensional Gaussian to represent each TM domain in a functional family. Each of the 20 dimensions represents a single aa. A Gaussian is uniquely specified by its mean and its co-variance matrix. These are set using maximum-likelihood estimates. The mean is the frequency count of all aa compositions in that specific TM domain in each of the proteins considered members of that family. The co-variance matrix is the co-variance of all those examples. Parameter estimation was achieved using Matlab version 6.5. (iv) Representation of dependencies among TM domains in a functional family of proteins: Since all proteins in a functional family share the same typical number, say X, of TM domains, each family is now represented by X Gaussians. On input a protein Y (Fig 1, step 2) with X annotated/predicted TM domains, we derive a score for each of the X TM domains in Y. Namely, the probability density function value of the 20-dimensional vector representing aa composition of that protein in that TM domain, according to the Gaussian for that TM domain. We now associate an X-dimensional vector with protein Y, the i th coordinate of which is the score to protein Y by the i th Gaussian. This vector encodes the dependencies between TM domains. The number of TM domains in the proteins we consider is one of 1,4,5,6,7,8,10,12. Other values of this parameter correspond to families that are too small to be statistically significant. Classification for one family: Our procedure to decide whether or not a given protein belongs to a certain family is based on SVM (Byvatov and Schneider, 2003). The implementation of SVM we use is called mysvm version ( Recall that a protein with X TM domains is encoded by the X-dimensional vector of its scores on each of the family s Gaussians. It is possible to separate proteins that belong to the family from those that do not, if the relevant vectors in X-dimensional space are geometrically separated. We train our SVM classifier with the vectors representing proteins in the family as positive examples and those from other families as negative examples. This SVM can be now applied to new examples (Fig 1, step 2). The resulting score represents the example s distance from the separator. A large positive classification score means we are confident that the protein belongs to the family, and a large (in absolute value) negative score indicates high confidence in its being outside the family. Classifying a protein to one of several families: For each family of proteins with X TM domains, we construct X Gaussians and an SVM classifier. On an input protein Y with X TM domains we proceed as follows: (i) For each family compute the vector of scores for protein Y according to the family s Gaussians, and find the SVM classification for that family. (ii) Prediction: protein Y belongs to the family with the highest - scoring SVM classification Size limitation of available data: Maximumlikelihood estimates used for building multidimensional Gaussians require many examples. Protein families in high quality database (as in the case of SwissProt) are frequently too small to provide the essential amounts of data. We have excluded families with limited number of proteins (below 50). In a few specific instances we also included sets of somewhat smaller size. From these families we removed proteins that are marked as fragment s and those for which no clear annotation is provided. Following these filters, all families with fewer than 15 proteins each were excluded. For the remaining proteins (about ~3100 consists of 29 annotated functional families) we constructed the Gaussians as described above. In cases where we were still unable to construct the Gaussians from the protein family s data due to a sparsity of examples (or strong dependencies between examples) we used PCA for dimension reduction (using Matlab version 6.5). This way 3

4 we created Gaussians whose dimension was less than 20. Setting the parameters: For SVM classification we used a polynomial kernel of degree 4, with cross-validation of the data divided into 7 groups. For the PCA, we used only the principal components explaining at least 3% of the variation in data. Tests and measuring the performance: In order to test our algorithm we divided our protein families into groups according to their number of TM domains. Each group contained all families with identical number of TM domains. We used a leave-one-out cross-validation test. In each iteration a single protein is left out (out of one family in an X TM domain group) while all families Gaussians and classifiers are built without that protein. (Note that the classifier for each family is built, with family members as positive examples and members of other families in the X TM domains group as negative examples). Each family produces an SVM classification for the left-out protein, and the protein is predicted to belong to the family that produced the most positive classification (Fig 1). This procedure repeats itself until all proteins from all families in the group have been left-out, and for each one a family prediction is retrieved. For scoring our success we used the sensitivity and specificity measures. Where sensitivity is defined as how many of the positive examples we capture: tp and specificity is defined as tp fn how many of the negative examples we capture: tn. Note that this definition of tn fp specificity deviates from the more widely accepted definition being how many of those marked as true are truly positive:. This tp tp fp deviation was required in our tests since the size of families within an X TM domains group varies significantly. As a consequence, if one family, say family A, is much larger than another family, say B, then always predicting A will still result in a high specificity value (since fp<<tp), though it is clear that our prediction of always A is nothing but specific. 3. RESULTS We have tested 12 groups of protein families (each group contains families with a defined number of TM domains), with a total of 29 different families that include ~3100 annotated proteins. The proteins were collected from a wide range of phylogenic taxonomy and thus representing the maximal variability of the family. The TM domain groups contain all the families that are presented in the SwissProt database following our filtration procedure (see Methods). A few additional groups with some bitopic protein families were composed. For consistency, we used TMHMM for determining the TM domains and their putative length. Fig. 2 shows the size distribution of TM in our dataset. As seen, most TM domain lengths are within a narrow range of 21 to 25 aa. In contrast, the size of groups in terms of the number of the proteins that are included varies significantly. The 7, 12 and single TM domains are notably larger relative to other groups (for a complete list see supplemental material). Fig 1: A 5 step flowchart of the classification algorithm based on information of the TM domains. The representation of proteins for our algorithm (step 1) is composed of the following components: (i) Location annotation of TM domains (ii) Representation of information contained in a given TM domain (iii) Representation of a TM domain in a functional family of proteins (iv) Representation of dependencies between TM domains in a functional family of proteins. For details see Methods. 4

5 # TMDs in DB Fig. 2: Number of TM domains as a function of their length (in aa) according to TMHMM prediction program. group TMD family name protein number Length sensitivity tp/tp+fn (%) specificity tn/tn+fp (%) 12 TMD bcct transp sodium NT symp aa permease sodium H exch facilit glu transp TMD sulfate transp cl channel TMD H transp pump zinc transp TMD kv channel abc2 transp TMD connexin scamp synaptophysin Table 1: Average levels of sensitivity and specificity (unweighted) for the tested group. Each group composed of protein families with the same number of TMD (TM domains). For a full list of groups and the family annotations, see supplemented material. Table 1 summarizes the results for approximately 1500 proteins falling into 14 different families that contain 4,6,8,10 and 12 TM domains. Each group contains proteins with a defined number of TM domains. The results express the success in separating among the different families within a group using a leaveone-out prediction (see Methods). The results show that most families are separable, solely on information describing the aa composition of their TM domains. Prediction success varies between groups. Most groups show average sensitivity levels above 80% and average specificity levels above 90%. We can see that while some groups are composed of families with highly variable sensitivity and specificity rates, others are quite robust, in the sense that all families within the group attained more or less the same success rate. In most cases, failure in family distinction is associated with a relative small number of proteins in that family. To test wheather the high separability achieved reflects TM groups that are composed of functionally remote families (i.e., transporters and channels, Table I), we extended the analysis to groups that are composed of closely related protein families. We concentrated on the largest group of membranous proteins in most eukaryotes the G-protein coupled receptor (GPCR, with 7 TM domains). We included 1456 proteins of families A (1102 pr), B (182 pr), C (100 pr), and Frizzle (72 pr), as defined by GPCRDB V8.0. Families D, E were excluded from the analysis since they were annotated as having too few members (see Methods). As can be seen in Fig. 3 all families in the 7 TM domain group, show extremely good results in terms of both sensitivity and specificity, and therefore are easily separable. sensitivity specificity GPCRs - sensitivity and specificity gpcr A gpcr frizzled gpcr C gpcr B 100% 80% 60% 40% 20% Fig. 3: Sensitivity and specificity scores for each family GPCRs group. The sensitivity and specificity average (unweighted) values are 85.44% and 96.71%, respectively. Light grey - sensitivity score; Dark grey - specificity score. We tested whether the algorithm presented here is capable of separating proteins whose family characterization is not readily evident. We tested the performance of our method in separating related families of the photosystem II. The reaction center of photosystem II is composed of closely related proteins having 5 TM domains as resolved by the 3D structure of the photosystem complex (from Rhodobacter). The complex is 0% 5

6 PSBD_CHLRE Photosynthetic reaction centre protein (Logo of Pfam - PF00124) Fig 4. A HMM consensus logo for Pfam Photo- RC (945 proteins). The position of the TM domains, marked 1-5 are included in the 320 aa length HMM of the family. The 3 red dots marks the key functional aa that are essential for the energy transfer reaction. Note that high level of aa conservation is extended throughout the entire sequence. involved in the transformation of light energy into chemical energy. In photosystem II of eukaryotic chloroplasts two related proteins, the D1 (psba) and D2 proteins (psbd) are known. All four proteins D1, D2, L and M were probably evolved from a common ancestor. All four related families are considered by Pfam as one family with a unified HMM (Photo_RC, PF00124). Inspecting the HMM by the logo representation (Fig. 4) reveals that the sequence conservation is detected throughout the 320 aa that are included in the family HMM. The TM domains (marked 1-5) cover about a third of the family HMM but they capture only part of the information (note the information rich loop between TM3 and TM4). Based on TigrFam, InterPro (IPR000484) suggested a natural partition of the ~950 UniProt proteins to 4 subfamiles of D1, D2, M and L. Our algorithm successfully separated those 4 families. Results are shown in Table 2. group # proteins Sensitivity Specificity 5 TMD d1q d2q M L Table 2: Sensitivity and specificity scores for each family of the photosynthesis group with 5 TM domains. The sensitivity and specificity average (unweighted) values are 70.96% and 93.4%, respectively. We set few tests to better understand the results for the 29 families that were combined into 12 TM domain groups (5 groups for bitopic protein families and groups for 4,5,6,7,8,10 and 12 TM domains). Inspecting all the results indicated that there is no correlation between the success of the algorithm to separate functional groups and the number of proteins in a group or the number of families in the group (Table 1,2, and supplemental material). One might assume that the algorithm performance is better correlated to the number of TM domains that defines the group. An even more rational assumption is that a correlation exists between the algorithm success and the fraction of the protein length that is occupied by TM domains (defined as coverage). The intuition is that a better separation capacity is expected for proteins for which the TM domains occupy a substantial part of their sequence (i.e., proteins with high number of TM domains or very short connecting loops) and the opposite is true for low coverage groups. Fig. 5 does not support such assumptions. There is no apparent correlation between the level of success and the number of TM domains (top) and the coverage (bottom) in all protein family groups. The only tendency that is evident is that for bitopic proteins (a single TM domain) the variability in prediction is very large, suggesting that the composition of families that are included in each of the tested group dominates the prediction (not shown). A critical test was included to check whether the information that the TM domain provides in separating the families within the group can be already retrieved directly from the raw data of the aa sequence. We compared the results from the Gaussian based method (Fig 1) to a simple clustering algorithm. The input to the clustering algorithm is the vectors of aa composition in each TM domain concatenated to each other by the order of their appearance. In this direct method the vectors are not transformed into Gaussian scores and no supervised learning algorithm is used. We applied the k-means clustering, which creates a partition into k groups that minimizes the sum, over all clusters, of the within-cluster sums of point-to-center distance. The distance was measured using l 1 norm and 6

7 several iterations were performed to avoid dependencies on starting conditions. The iteration number was 50. In cases that the data was insufficient (in 5 TM domains for photosynthesis group, Fig 3) we have used only 10 iterations. (yet, on average the improvement is more moderate). 100% 80% 60% 40% 20% 0% 12 TMD 10 TMD 8 TMD 7 TMD 6 TMD PHOTOSYNTHSIS 4 TMD 1 TMD SYNAPTIC SYX_MHC SYX_SYNAPTOTAGMIN RTK2_3_5 Fig. 6: A comparison of sensitivity rates achieved by the Gaussian-based aa composition algorithm (dark gray) and the k-means clustering method with distance between clusters measured in l 1 norm (light gray). The tested groups are as shown in Table 1-2, Fig.2 and some additional combinations of bitopic families. Synaptic proteins including VAMP, synaptotagmin and syntaxin; Photosynthesis 4 families of 5 TM domains participating of reaction center II; Synsyntaxin and synaptotagmin; Rtk receptor tyrosine kinase of classes 2,3 and 5; Synsyntaxin and MHC - class I histocompatibility antigen. Fig. 5: Correlation between the number of TM domains (top) and the average coverage (bottom) in a family and the sensitivity score achieved. Fig. 6 shows the results from the clustering method relative to the results obtained by the Gaussian based method. We extended the analysis to bitopic proteins, which are the most abundant membranous proteins in eukaryotes (Liu et al., 2002). To this end, we combined several (overlapping) families of bitopic proteins with 2 up to 7 families (see supplementary material for details). Our method outperforms this naïve method, in terms of the sensitivity rates, in all tested cases (~4000 proteins, 12 groups comprised of 39 family combinations). In some instances, such as for GPCR, the improvement of our algorithm is very significant (from 40% to over 80% for the sensitivity measure). The results from a similar test that was performed for the specificity rates by both algorithms indicate a consistent superior distinguishing capacity for our Gaussian- based algorithm relative to the direct clustering method In summary, our results show that, in addition to the dominating biophysical constraints for an helix to cross the bilayer membrane, the aa composition seems to provide a sufficiently strong signal to define functional families in all instances tested with the exception of some combination of bitopic proteins (Fig 6). 4. Discussion We have introduced here an algorithm with a satisfactory accuracy to separate between functional families containing the same number of TM domains, based solely on aa composition in their TM domains. Using a leave-one-out cross-validation test, we have reached sensitivity rates of 75-85%, and specificity rates of 85-95% for most groups tested. Such high success rates strongly support the hypothesis according to which TM domains are far from being simple hydrophobic linkers. Variable separability is recorded to groups of bitopic proteins (Fig. 5). We associated the later finding to the numerous families in which their single TM domain serves as a mean to anchor the protein to the membrane while the actual function of the protein is associated with either 7

8 the cytoplasmic or extracellular domain. This may be the case for most adhesion proteins, integrins and more. To test whether the family annotation provided for the family is already based on the information encoded in the TM domains we inspected the annotation source as defined by InterPro. For most of our tested families, InterPro annotation is based on a family-based Pfam HMM (Fig. 4). Recording the signatures for all membranous families (29 annotated families) presented in this study showed that for ~50% of the families, the TM domains are included in the family HMM that is often much larger. In another 15% only few of the TM domains are included while in the rest (35%) of the families (i.e., receptor tyrosine kinases, synaptotagmin) the HMM and the TM domains do not overlap. We found that there is a very low correlation between the sensitivity rates achieved and the number of TM domains, coverage, number of proteins in the group and number of families in the group. The observation that some discrimination between families can be achieved already by a simple clustering method that is based on raw data (Fig. 6) implies that the separability between families is an intrinsic quality of the data and not a result of data manipulation caused by our algorithm. Maximal improvement of our algorithm is seen for the group of 7 TM domains (Fig 6). Note that this group occupies about half of all proteins tested in this study. Indeed, the failure of routine alignment methods to functionally classify GPCR superfamily and other 7 TM domains proteins had been announced. Adding features such as the TM topology, phylogenetic distances and aa biochemical properties improved the classification (Inoue et al., 2004; Lapinsh et al., 2002). The large size of this group and the diversity of the proteins within predispose the distinguishing of functional subfamilies to the application of SVM-based approaches (this study and Bhasin and Raghava, 2004). There are a few limitations and pitfalls to the approach presented: The family size: Genome wide analysis based on Pfam reveals ~230 PfamA of polytopic families (Liu et al., 2002), most of them are very small. Therefore, methods of parameters estimation that require many examples, once applied to a relatively small set of examples are prone to severe overfitting. Source for functional families: Proteins that are marked as unknown or putative were excluded from our analysis. This rule was applied to maintain a strict level of objectivity. However, increasing annotation coverage and reducing false annotations will allow implementing our method to additional protein families. Membranous protein coverage: Recall that we disregard membranous proteins other than -helical TM proteins. Most prediction methods (i.e. TMHMM) do not apply to non -helical TM domain (such as -barrels). Consequently, annotation for such proteins is very unreliable. Assessment of the statistical confidence: Our leave-one-out cross-validation test showed promising results, however to ensure statistical confidence, we are currently repeating the tests (on the present dataset and on a larger set that includes UniProt-90 knowledge) by a leave 10%- out. We expect that a possible bias due to sequence similarity (i.e., aa composition) will be minimized. Accuracy in TM prediction: The accuracy of most TM prediction programs, including the TMHMM that was applied throughout, is often less satisfactory than reported (see discussion in Kernytsky and Rost, 2003). In summary, we conclude that the information encoded by aa composition of TM domains in polytopic proteins is valuable and sufficient in most cases to define the family relationship of a protein. In the few instances (Fig 5,6) that our algorithm performed poorly, it is suggested that the TM domains serve as localization anchors rather than being fundamental components of the family function. The effect on protein functions following replacement of an authentic TM domain by a lipid anchor had been monitored for certain protein families (Grote et al., 2000; Kemble et al., 1993). In more general terms, considering proteins according to their broad definition, such as membranous, nuclear etc. is useful for extracting additional nonalignment-based information (such as TM aa composition, this study). The results could support classification and characterization efforts, especially for newly available but poorly annotated genomes. 8

9 5. ACKNOWLEDGMENTS I would like to thank Nati Linial for his helpful advice, Kobi Cramer for his help with machine learning approaches, and the ProtoNet research group at the Hebrew University of Jerusalem for DB support and stimulating discussions. 6. REFERENCES Bhasin, M., and Raghava, G. P. (2004). GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res 32, W Bockaert, J., Dumuis, A., Fagni, L., and Marin, P. (2004). GPCR-GIP networks: a first step in the discovery of new therapeutic drugs? Curr Opin Drug Discov Devel 7, Byvatov, E., and Schneider, G. (2003). Support vector machine applications in bioinformatics. Appl Bioinformatics 2, Chen, C. P., and Rost, B. (2002). State-of-the-art in membrane protein prediction. Appl Bioinformatics 1, Engelman, D. M., Steitz, T. A., and Goldman, A. (1986). Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 15, Favre, I., Moczydlowski, E., and Schild, L. (1996). On the structural basis for ionic selectivity among Na+, K+, and Ca2+ in the voltage-gated sodium channel. Biophys J 71, Grote, E., Baba, M., Ohsumi, Y., and Novick, P. J. (2000). Geranylgeranylated SNAREs are dominant inhibitors of membrane fusion. J Cell Biol 151, Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F. E., and Vriend, G. (2003). GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 31, Inoue, Y., Ikeda, M., and Shimizu, T. (2004). Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput Biol Chem 28, Kaplan, N., Sasson, O., Inbar, U., Friedlich, M., Fromer, M., Fleischer, H., Portugaly, E., Linial, N., and Linial, M. (2005). ProtoNet 4.0: a hierarchical classification of one million protein sequences. Nucleic Acids Res 33 Database Issue, D Kaplan, N., Vaaknin, A., and Linial, M. (2003). PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res 31, Kemble, G. W., Henis, Y. I., and White, J. M. (1993). GPI- and transmembrane-anchored influenza hemagglutinin differ in structure and receptor binding activity. J Cell Biol 122, Kernytsky, A., and Rost, B. (2003). Static benchmarking of membrane helix predictions. Nucleic Acids Res 31, Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, Kyte, J., and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, Lapinsh, M., Gutcaits, A., Prusis, P., Post, C., Lundstedt, T., and Wikberg, J. E. (2002). Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci 11, Levitt, D. (1990). Gramicidin, VDAC, porin and perforin channels. Curr Opin Cell Biol 2, Lin, S., Naim, H. Y., Rodriguez, A. C., and Roth, M. G. (1998). Mutations in the middle of the transmembrane domain reverse the polarity of transport of the influenza virus hemagglutinin in MDCK epithelial cells. J Cell Biol 142, Liu, Y., Engelman, D. M., and Gerstein, M. (2002). Genomic analysis of membrane protein families: abundance and conserved motifs. Genome Biol 3, research0054. Miloshevsky, G. V., and Jordan, P. C. (2004). Permeation in ion channels: the interplay of structure and theory. Trends Neurosci 27, Muller, T., Rahmann, S., and Rehmsmeier, M. (2001). Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 17 Suppl 1, S Ng, P. C., Henikoff, J. G., and Henikoff, S. (2000). PHAT: a transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics 16, Ott, C. M., and Lingappa, V. R. (2002). Integral membrane protein biosynthesis: why topology is hard to predict. J Cell Sci 115, Pilpel, Y., Ben-Tal, N., and Lancet, D. (1999). kprot: a knowledge-based scale for the propensity of residue orientation in transmembrane segments. Application to membrane protein structure prediction. J Mol Biol 294, Rost, B., Casadio, R., and Fariselli, P. (1996). Refining neural network predictions for helical transmembrane proteins by dynamic programming. Proc Int Conf Intell Syst Mol Biol 4, Tusnady, G. E., and Simon, I. (1998). Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 283, Wallin, E., and von Heijne, G. (1998). Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7,

Received on January 15, 2005; accepted on March 27, 2005

Received on January 15, 2005; accepted on March 27, 2005 BIOINFORMATICS Vol. 21 Suppl. 1 2005, pages i378 i386 doi:10.1093/bioinformatics/bti1035 Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains

More information

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines

Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Article Prediction and Classif ication of Human G-protein Coupled Receptors Based on Support Vector Machines Yun-Fei Wang, Huan Chen, and Yan-Hong Zhou* Hubei Bioinformatics and Molecular Imaging Key Laboratory,

More information

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries

A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries A Machine Text-Inspired Machine Learning Approach for Identification of Transmembrane Helix Boundaries Betty Yee Man Cheng 1, Jaime G. Carbonell 1, and Judith Klein-Seetharaman 1, 2 1 Language Technologies

More information

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure Bioch/BIMS 503 Lecture 2 Structure and Function of Proteins August 28, 2008 Robert Nakamoto rkn3c@virginia.edu 2-0279 Secondary Structure Φ Ψ angles determine protein structure Φ Ψ angles are restricted

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Supporting online material

Supporting online material Supporting online material Materials and Methods Target proteins All predicted ORFs in the E. coli genome (1) were downloaded from the Colibri data base (2) (http://genolist.pasteur.fr/colibri/). 737 proteins

More information

ProtoNet 4.0: A hierarchical classification of one million protein sequences

ProtoNet 4.0: A hierarchical classification of one million protein sequences ProtoNet 4.0: A hierarchical classification of one million protein sequences Noam Kaplan 1*, Ori Sasson 2, Uri Inbar 2, Moriah Friedlich 2, Menachem Fromer 2, Hillel Fleischer 2, Elon Portugaly 2, Nathan

More information

Analysis of N-terminal Acetylation data with Kernel-Based Clustering

Analysis of N-terminal Acetylation data with Kernel-Based Clustering Analysis of N-terminal Acetylation data with Kernel-Based Clustering Ying Liu Department of Computational Biology, School of Medicine University of Pittsburgh yil43@pitt.edu 1 Introduction N-terminal acetylation

More information

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure

Today. Last time. Secondary structure Transmembrane proteins. Domains Hidden Markov Models. Structure prediction. Secondary structure Last time Today Domains Hidden Markov Models Structure prediction NAD-specific glutamate dehydrogenase Hard Easy >P24295 DHE2_CLOSY MSKYVDRVIAEVEKKYADEPEFVQTVEEVL SSLGPVVDAHPEYEEVALLERMVIPERVIE FRVPWEDDNGKVHVNTGYRVQFNGAIGPYK

More information

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models

Intro Secondary structure Transmembrane proteins Function End. Last time. Domains Hidden Markov Models Last time Domains Hidden Markov Models Today Secondary structure Transmembrane proteins Structure prediction NAD-specific glutamate dehydrogenase Hard Easy >P24295 DHE2_CLOSY MSKYVDRVIAEVEKKYADEPEFVQTVEEVL

More information

Public Database 의이용 (1) - SignalP (version 4.1)

Public Database 의이용 (1) - SignalP (version 4.1) Public Database 의이용 (1) - SignalP (version 4.1) 2015. 8. KIST 이철주 Secretion pathway prediction ProteinCenter (Proxeon Bioinformatics, Odense, Denmark; http://www.cbs.dtu.dk/services) SignalP (version 4.1)

More information

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES

PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES 3251 PROTEIN SUBCELLULAR LOCALIZATION PREDICTION BASED ON COMPARTMENT-SPECIFIC BIOLOGICAL FEATURES Chia-Yu Su 1,2, Allan Lo 1,3, Hua-Sheng Chiu 4, Ting-Yi Sung 4, Wen-Lian Hsu 4,* 1 Bioinformatics Program,

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Functionally-Valid Unsupervised Compression of the Protein Space

Functionally-Valid Unsupervised Compression of the Protein Space Functionally-Valid Unsupervised Compression of the Protein Space Noam Kaplan 1, Moriah Friedlich 2, Menachem Fromer 2 and Michal Linial 1* 1 Department of Biological Chemistry, Institute of Life Sciences,

More information

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH

SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH SUB-CELLULAR LOCALIZATION PREDICTION USING MACHINE LEARNING APPROACH Ashutosh Kumar Singh 1, S S Sahu 2, Ankita Mishra 3 1,2,3 Birla Institute of Technology, Mesra, Ranchi Email: 1 ashutosh.4kumar.4singh@gmail.com,

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

CISC 636 Computational Biology & Bioinformatics (Fall 2016) CISC 636 Computational Biology & Bioinformatics (Fall 2016) Predicting Protein-Protein Interactions CISC636, F16, Lec22, Liao 1 Background Proteins do not function as isolated entities. Protein-Protein

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

A model for the evaluation of domain based classification of GPCR

A model for the evaluation of domain based classification of GPCR 4(4): 138-142 (2009) 138 A model for the evaluation of domain based classification of GPCR Tannu Kumari *, Bhaskar Pant, Kamalraj Raj Pardasani Department of Mathematics, MANIT, Bhopal - 462051, India;

More information

Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached?

Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached? 550 Current Protein and Peptide Science, 2010, 11, 550-561 Topology Prediction of Helical Transmembrane Proteins: How Far Have We Reached? Gábor E. Tusnády and István Simon* Institute of Enzymology, BRC,

More information

Review. Membrane proteins. Membrane transport

Review. Membrane proteins. Membrane transport Quiz 1 For problem set 11 Q1, you need the equation for the average lateral distance transversed (s) of a molecule in the membrane with respect to the diffusion constant (D) and time (t). s = (4 D t) 1/2

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

K-means-based Feature Learning for Protein Sequence Classification

K-means-based Feature Learning for Protein Sequence Classification K-means-based Feature Learning for Protein Sequence Classification Paul Melman and Usman W. Roshan Department of Computer Science, NJIT Newark, NJ, 07102, USA pm462@njit.edu, usman.w.roshan@njit.edu Abstract

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg

TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg title: short title: TMSEG Michael Bernhofer, Jonas Reeb pp1_tmseg lecture: Protein Prediction 1 (for Computational Biology) Protein structure TUM summer semester 09.06.2016 1 Last time 2 3 Yet another

More information

Reliability Measures for Membrane Protein Topology Prediction Algorithms

Reliability Measures for Membrane Protein Topology Prediction Algorithms doi:10.1016/s0022-2836(03)00182-7 J. Mol. Biol. (2003) 327, 735 744 Reliability Measures for Membrane Protein Topology Prediction Algorithms Karin Melén 1, Anders Krogh 2 and Gunnar von Heijne 1 * 1 Department

More information

A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm

A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm Protein Engineering vol.12 no.5 pp.381 385, 1999 COMMUNICATION A novel method for predicting transmembrane segments in proteins based on a statistical analysis of the SwissProt database: the PRED-TMR algorithm

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

BIOINFORMATICS. Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles

BIOINFORMATICS. Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles BIOINFORMATICS Vol.? no.? 200? Pages 1 1 Enhanced Recognition of Protein Transmembrane Domains with Prediction-based Structural Profiles Baoqiang Cao 2, Aleksey Porollo 1, Rafal Adamczak 1, Mark Jarrell

More information

Molecular Cell Biology 5068 In Class Exam 2 November 8, 2016

Molecular Cell Biology 5068 In Class Exam 2 November 8, 2016 Molecular Cell Biology 5068 In Class Exam 2 November 8, 2016 Exam Number: Please print your name: Instructions: Please write only on these pages, in the spaces allotted and not on the back. Write your

More information

Motif Extraction and Protein Classification

Motif Extraction and Protein Classification Motif Extraction and Protein Classification Vered Kunik 1 Zach Solan 2 Shimon Edelman 3 Eytan Ruppin 1 David Horn 2 1 School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel {kunikver,ruppin}@tau.ac.il

More information

Structure Prediction of Membrane Proteins. Introduction. Secondary Structure Prediction and Transmembrane Segments Topology Prediction

Structure Prediction of Membrane Proteins. Introduction. Secondary Structure Prediction and Transmembrane Segments Topology Prediction Review Structure Prediction of Membrane Proteins Chunlong Zhou 1, Yao Zheng 2, and Yan Zhou 1 * 1 Hangzhou Genomics Institute/James D. Watson Institute of Genome Sciences, Zhejiang University/Key Laboratory

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

A SURPRISING CLARIFICATION OF THE MECHANISM OF ION-CHANNEL VOLTAGE- GATING

A SURPRISING CLARIFICATION OF THE MECHANISM OF ION-CHANNEL VOLTAGE- GATING A SURPRISING CLARIFICATION OF THE MECHANISM OF ION-CHANNEL VOLTAGE- GATING AR. PL. Ashok Palaniappan * An intense controversy has surrounded the mechanism of voltage-gating in ion channels. We interpreted

More information

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space

Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Published online February 15, 26 166 18 Nucleic Acids Research, 26, Vol. 34, No. 3 doi:1.193/nar/gkj494 Comprehensive genome analysis of 23 genomes provides structural genomics with new insights into protein

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB

Supporting Text 1. Comparison of GRoSS sequence alignment to HMM-HMM and GPCRDB Structure-Based Sequence Alignment of the Transmembrane Domains of All Human GPCRs: Phylogenetic, Structural and Functional Implications, Cvicek et al. Supporting Text 1 Here we compare the GRoSS alignment

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Potassium channel gating and structure!

Potassium channel gating and structure! Reading: Potassium channel gating and structure Hille (3rd ed.) chapts 10, 13, 17 Doyle et al. The Structure of the Potassium Channel: Molecular Basis of K1 Conduction and Selectivity. Science 280:70-77

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Improved membrane protein topology prediction by domain assignments

Improved membrane protein topology prediction by domain assignments Improved membrane protein topology prediction by domain assignments ANDREAS BERNSEL AND GUNNAR VON HEIJNE Department of Biochemistry and Biophysics, Stockholm University, SE-106 91 Stockholm, Sweden Stockholm

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature17991 Supplementary Discussion Structural comparison with E. coli EmrE The DMT superfamily includes a wide variety of transporters with 4-10 TM segments 1. Since the subfamilies of the

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Fishing with (Proto)Net a principled approach to protein target selection

Fishing with (Proto)Net a principled approach to protein target selection Comparative and Functional Genomics Comp Funct Genom 2003; 4: 542 548. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.328 Conference Review Fishing with (Proto)Net

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach

Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Identification of Representative Protein Sequence and Secondary Structure Prediction Using SVM Approach Prof. Dr. M. A. Mottalib, Md. Rahat Hossain Department of Computer Science and Information Technology

More information

Protein structure alignments

Protein structure alignments Protein structure alignments Proteins that fold in the same way, i.e. have the same fold are often homologs. Structure evolves slower than sequence Sequence is less conserved than structure If BLAST gives

More information

Genome Annotation Project Presentation

Genome Annotation Project Presentation Halogeometricum borinquense Genome Annotation Project Presentation Loci Hbor_05620 & Hbor_05470 Presented by: Mohammad Reza Najaf Tomaraei Hbor_05620 Basic Information DNA Coordinates: 527,512 528,261

More information

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander

Subfamily HMMS in Functional Genomics. D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Subfamily HMMS in Functional Genomics D. Brown, N. Krishnamurthy, J.M. Dale, W. Christopher, and K. Sjölander Pacific Symposium on Biocomputing 10:322-333(2005) SUBFAMILY HMMS IN FUNCTIONAL GENOMICS DUNCAN

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species

PGA: A Program for Genome Annotation by Comparative Analysis of. Maximum Likelihood Phylogenies of Genes and Species PGA: A Program for Genome Annotation by Comparative Analysis of Maximum Likelihood Phylogenies of Genes and Species Paulo Bandiera-Paiva 1 and Marcelo R.S. Briones 2 1 Departmento de Informática em Saúde

More information

1. Statement of the Problem

1. Statement of the Problem Recognizing Patterns in Protein Sequences Using Iteration-Performing Calculations in Genetic Programming John R. Koza Computer Science Department Stanford University Stanford, CA 94305-2140 USA Koza@CS.Stanford.Edu

More information

Building a Homology Model of the Transmembrane Domain of the Human Glycine α-1 Receptor

Building a Homology Model of the Transmembrane Domain of the Human Glycine α-1 Receptor Building a Homology Model of the Transmembrane Domain of the Human Glycine α-1 Receptor Presented by Stephanie Lee Research Mentor: Dr. Rob Coalson Glycine Alpha 1 Receptor (GlyRa1) Member of the superfamily

More information

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel

Christian Sigrist. November 14 Protein Bioinformatics: Sequence-Structure-Function 2018 Basel Christian Sigrist General Definition on Conserved Regions Conserved regions in proteins can be classified into 5 different groups: Domains: specific combination of secondary structures organized into a

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

The Potassium Ion Channel: Rahmat Muhammad

The Potassium Ion Channel: Rahmat Muhammad The Potassium Ion Channel: 1952-1998 1998 Rahmat Muhammad Ions: Cell volume regulation Electrical impulse formation (e.g. sodium, potassium) Lipid membrane: the dielectric barrier Pro: compartmentalization

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

-max_target_seqs: maximum number of targets to report

-max_target_seqs: maximum number of targets to report Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:

More information

Membrane Protein Channels

Membrane Protein Channels Membrane Protein Channels Potassium ions queuing up in the potassium channel Pumps: 1000 s -1 Channels: 1000000 s -1 Pumps & Channels The lipid bilayer of biological membranes is intrinsically impermeable

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function

Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function Enhanced membrane protein topology prediction using a hierarchical classification method and a new scoring function Allan Lo 1, 2, Hua-Sheng Chiu 3, Ting-Yi Sung 3, Ping-Chiang Lyu 2, and Wen-Lian Hsu

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

A Genetic Algorithm to Enhance Transmembrane Helices Prediction

A Genetic Algorithm to Enhance Transmembrane Helices Prediction A Genetic Algorithm to Enhance Transmembrane Helices Prediction Nazar Zaki Intelligent Systems Faculty of Info. Technology UAEU, Al Ain 17551, UAE nzaki@uaeu.ac.ae Salah Bouktif Software Development Faculty

More information

CELL BIOLOGY - CLUTCH CH. 9 - TRANSPORT ACROSS MEMBRANES.

CELL BIOLOGY - CLUTCH CH. 9 - TRANSPORT ACROSS MEMBRANES. !! www.clutchprep.com K + K + K + K + CELL BIOLOGY - CLUTCH CONCEPT: PRINCIPLES OF TRANSMEMBRANE TRANSPORT Membranes and Gradients Cells must be able to communicate across their membrane barriers to materials

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

Motif Prediction in Amino Acid Interaction Networks

Motif Prediction in Amino Acid Interaction Networks Motif Prediction in Amino Acid Interaction Networks Omar GACI and Stefan BALEV Abstract In this paper we represent a protein as a graph where the vertices are amino acids and the edges are interactions

More information

Structure to Function. Molecular Bioinformatics, X3, 2006

Structure to Function. Molecular Bioinformatics, X3, 2006 Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Overview of ion channel proteins. What do ion channels do? Three important points:

Overview of ion channel proteins. What do ion channels do? Three important points: Overview of ion channel proteins Protein Structure Membrane proteins & channels Specific channels Several hundred distinct types Organization Evolution We need to consider 1. Structure 2. Functions 3.

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

A hidden Markov model for predicting transmembrane helices in protein sequences

A hidden Markov model for predicting transmembrane helices in protein sequences Procedings of ISMB 6, 1998, pages 175-182 A hidden Markov model for predicting transmembrane helices in protein sequences Erik L.L. Sonnhammer National Center for Biotechnology Information Building 38A,

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

FUSION OF CONDITIONAL RANDOM FIELD AND SIGNALP FOR PROTEIN CLEAVAGE SITE PREDICTION

FUSION OF CONDITIONAL RANDOM FIELD AND SIGNALP FOR PROTEIN CLEAVAGE SITE PREDICTION FUSION OF CONDITIONAL RANDOM FIELD AND SIGNALP FOR PROTEIN CLEAVAGE SITE PREDICTION Man-Wai Mak and Wei Wang Dept. of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society

Protein Science (1997), 6: Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society 1 of 5 1/30/00 8:08 PM Protein Science (1997), 6: 246-248. Cambridge University Press. Printed in the USA. Copyright 1997 The Protein Society FOR THE RECORD LPFC: An Internet library of protein family

More information

Discriminative Motif Finding for Predicting Protein Subcellular Localization

Discriminative Motif Finding for Predicting Protein Subcellular Localization IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 1 Discriminative Motif Finding for Predicting Protein Subcellular Localization Tien-ho Lin, Robert F. Murphy, Senior Member, IEEE, and

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Heteropolymer. Mostly in regular secondary structure

Heteropolymer. Mostly in regular secondary structure Heteropolymer - + + - Mostly in regular secondary structure 1 2 3 4 C >N trace how you go around the helix C >N C2 >N6 C1 >N5 What s the pattern? Ci>Ni+? 5 6 move around not quite 120 "#$%&'!()*(+2!3/'!4#5'!1/,#64!#6!,6!

More information

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Jeremy Chang Identifying protein protein interactions with statistical coupling analysis Abstract: We used an algorithm known as statistical coupling analysis (SCA) 1 to create a set of features for building

More information

Yeast ORFan Gene Project: Module 5 Guide

Yeast ORFan Gene Project: Module 5 Guide Cellular Localization Data (Part 1) The tools described below will help you predict where your gene s product is most likely to be found in the cell, based on its sequence patterns. Each tool adds an additional

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

In order to compare the proteins of the phylogenomic matrix, we needed a similarity

In order to compare the proteins of the phylogenomic matrix, we needed a similarity Similarity Matrix Generation In order to compare the proteins of the phylogenomic matrix, we needed a similarity measure. Hamming distances between phylogenetic profiles require the use of thresholds for

More information

TRANSPORT ACROSS MEMBRANE

TRANSPORT ACROSS MEMBRANE TRANSPORT ACROSS MEMBRANE The plasma membrane functions to isolate the inside of the cell from its environment, but isolation is not complete. A large number of molecules constantly transit between the

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

Computational modeling of G-Protein Coupled Receptors (GPCRs) has recently become

Computational modeling of G-Protein Coupled Receptors (GPCRs) has recently become Homology Modeling and Docking of Melatonin Receptors Andrew Kohlway, UMBC Jeffry D. Madura, Duquesne University 6/18/04 INTRODUCTION Computational modeling of G-Protein Coupled Receptors (GPCRs) has recently

More information