Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains

Size: px

Start display at page:

Download "Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains"

Scot Gaines
5 years ago
Views:

1 Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains Tali Sadka 1 and Michal Linial 1,2, * 1 School of Computer Science and Engineering and 2 Dept of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem 91904, Israel. 2 Present address: Dept of Computer Science and Engineering, University of Washington, Seattle, WA, , USA. * Corresponding author ABSTRACT In eukaryotes, membranous proteins account for 20-30% of the proteome. Most of these proteins contain one or more transmembrane (TM) domains. These are short segments that transverse the bilayer lipid membrane. Various properties of the TM domains, such as their number, their topology and their arrangement within the membrane are closely related to the protein s cellular functions. Properties of the TM domains also determine the cellular targeting and localization of these proteins. It is not known, however, whether the information encoded by TM domains suffices for the purpose of classifying proteins into their functional families. This is the question we address here. We introduce an algorithm that creates a profile of each functional family of membranous proteins based only on the amino-acid composition of their TM domains. This is complemented by a classifier program for each such family (to determine whether a given protein belongs to it or not). We find that in most instances TM domains contain enough information to allow an accurate discrimination of ~80% sensitivity and ~90% specificity among unrelated polytopic functional families with the same number of TM domains. 1. INTRODUCTION Integral membrane proteins participate in countless cell activities. They play a crucial role in signaling pathways, cell adhesion, intercellular communication and more. It is estimated that as many as 20-30% of all genes in multi-cellular organisms encode membranous proteins (Wallin and von Heijne, 1998). Due to their critical roles in the cell s biological activities, these proteins are of immense importance for the pharmaceutical industry (Bockaert et al., 2004). Despite their great significance, our understanding of membranous proteins leaves much to be desired. For example, the structural study of proteins with transmembrane (TM) domains is still largely an enigma, and at present, only several dozen membranous proteins (e.g., K + channel, bacteriorhodopsin) have been structurally solved. In the present paper we deal with membranous proteins, the TM domains of which are short - helices. These proteins comprise the major part of all integral membrane proteins across phyla (Liu et al., 2002). A relatively small group of proteins that cross the bilayer membrane via a cluster of -sheets (e.g., porin) will not be considered (Levitt, 1990). TM domains are known to have a characteristically hydrophobic amino acid (aa) distribution. Propensity scales of aa s (Engelman et al., 1986; Kyte and Doolittle, 1982; Pilpel et al., 1999; Tusnady and Simon, 1998) developed for TM domains yield a more accurate sequence alignment of proteins rich in TM domains as well as a good prediction of TM domain topology (Muller et al., 2001; Ng et al., 2000). Numerous databases based on experimental and computational analyses for protein localization (e.g., PSORT, DAS) focus mostly on bacterial proteomes. However, in recent years, several programs such as Protein Prediction, TMHMM, TargetP, SOSUI, TMPpred and TMTOP (a good collection can be found in were developed. These rely on aa propensities in TM domains and can successfully predict whether a protein is membranous or soluble and determine its number of TM domains with a reasonable accuracy (Chen and Rost, 2002; Rost et al., 1996). However, these programs do not fare so well in assigning in-out membrane topology (Ott and Lingappa, 2002). 1

2 In some studied instances, a protein s functionality is tightly linked to the features of one or more of its TM domains. For example, in voltage dependent ion channels, a specific TM domain serves as a voltage sensor and largely determines the gating properties of the channel. Similarly, specificity and selectivity of ion channels are governed by minor sequence differences in their TM domains that line the channel pore. In numerous reports a single aa change was shown to convert a channel from an anion to cation permeability or to reverse the ion selectivity (Favre et al., 1996). This suggests that various channels may be distinguished by their most detailed aa information (Miloshevsky and Jordan, 2004). On the other hand, TM domains may, at times, serve only as a means of crossing a membrane from one side of a compartment to the other or to define membrane preference localization (Lin et al., 1998). In such instances, the aa that cross the lipid bilayer are subject to certain biophysical constraints (i.e, overall hydrophobicity, minimal length), but the protein s functionality is entirely independent of the detailed information within the TM. Here we ask which of these competing hypotheses prevails. We also investigate the extent to which the information encoded by the TM segments can be applied to automatically classify membranous proteins into their functional families. Concretely, we ask whether a coherent functional family can be defined based solely on such limited information on the TM segments. This is to be done with no recourse to other sequence-based (signature, motif) of functional (binding site, catalytic site) information. We find that for most protein families, the information encoded at the TM domains suffices for the purpose of classifying functional families. Furthermore, in some instance, a partition of the family to its subfamilies is achieved solely by specific properties of the TM domains. Specifically, we suggest that aa composition of the TM domain alone suffices to determine the identity of the family and may be used as an additional source for protein family refinement. 2. METHODS Most functional families have a typical number of TM domains. For example, all G-coupled protein receptors have exactly 7 TM domains. Though this typical number need not be so strict in other functional families, the vast majority of proteins in a functional family possess the same number of TM domains. For simplicity, we concentrate on functional families that conform to this rule. Of course, the number of TM domains cannot define the functional families as many transporters, channels, pumps and exchangers have exactly 12 TM domains Data set Our data set comes from the ProtoNet ver 3.0 database ( (Kaplan et al., 2005). ProtoNet includes 114,000 proteins from SwissProt. To obtain a classification of proteins into functional families we use PANDORA ver 2.1 ( a tool for detecting subsets of proteins that share unique biological properties defined by keywords. In this case - function (i.e. channel, receptor, transporter) (Kaplan et al., 2003). Out of the protein sets matching a keyword retrieved by PANDORA, we used almost exclusively the definition by InterPro families (ver 5.2, supplemented with keyword annotations from SwissProt (ver 40.21, The broader set of membranous proteins is collected by GO cellular localization category that marks 24,000 proteins as membrane. We set the typical number of TM domains in a family to be the number of TM domains annotated for most of the proteins in the group. As a supplementary database we used GPCRDB ver 8.0 ( This database is a source for G-protein coupled receptors (GPCR) and their partition into subfamilies (Horn et al., 2003). GPCRDB database is manually annotated based on literature search and expert view. For a full list of protein sets and sources, see the supplement ( Classification Algorithm In the next section we discuss how we classify a protein to its functional family by (a) defining the representation of proteins, and (b) describing the classification algorithm. Fig. 1 shows a flowchart of the main steps (marked 1-5). Step 1 is composed of i-iv. (i) Location annotation of a single TM domain: A reliable annotation of the TM domains locations (start and end aa positions) is mostly missing. Exceptional are the TM domains defined by SwissProt provided in the FT- feature field. For consistency, we applied TMHMM Ver 2.0 (Krogh et al., 2001) for predicting the TM domains in the entire set of proteins under consideration. TMHMM is based 2

3 on HMM for the prediction of TM domains locations in un-annotated proteins. Based on performance tests, this method is expected to reach a very high level of accurately (Krogh et al., 2001). (ii) Representation of information contained in a given TM domain: We represent each TM domain in a single protein by its aa composition alone. Thus, corresponding to each TM domain is a 20-dimensional vector, the i th coordinate of which equals: # ith aa appearance in the TM domain # of aa in the TM domain (iii) Representation of a single TM domain in a functional family: Our underlying statistical assumption is this: The aa compositions of a particular TM domain in a specific protein from a given functional family is sampled out of a multi-normal distribution. This distribution is characteristic of that TM domain in that family. Accordingly, we construct a 20-dimensional Gaussian to represent each TM domain in a functional family. Each of the 20 dimensions represents a single aa. A Gaussian is uniquely specified by its mean and its co-variance matrix. These are set using maximum-likelihood estimates. The mean is the frequency count of all aa compositions in that specific TM domain in each of the proteins considered members of that family. The co-variance matrix is the co-variance of all those examples. Parameter estimation was achieved using Matlab version 6.5. (iv) Representation of dependencies among TM domains in a functional family of proteins: Since all proteins in a functional family share the same typical number, say X, of TM domains, each family is now represented by X Gaussians. On input a protein Y (Fig 1, step 2) with X annotated/predicted TM domains, we derive a score for each of the X TM domains in Y. Namely, the probability density function value of the 20-dimensional vector representing aa composition of that protein in that TM domain, according to the Gaussian for that TM domain. We now associate an X-dimensional vector with protein Y, the i th coordinate of which is the score to protein Y by the i th Gaussian. This vector encodes the dependencies between TM domains. The number of TM domains in the proteins we consider is one of 1,4,5,6,7,8,10,12. Other values of this parameter correspond to families that are too small to be statistically significant. Classification for one family: Our procedure to decide whether or not a given protein belongs to a certain family is based on SVM (Byvatov and Schneider, 2003). The implementation of SVM we use is called mysvm version ( Recall that a protein with X TM domains is encoded by the X-dimensional vector of its scores on each of the family s Gaussians. It is possible to separate proteins that belong to the family from those that do not, if the relevant vectors in X-dimensional space are geometrically separated. We train our SVM classifier with the vectors representing proteins in the family as positive examples and those from other families as negative examples. This SVM can be now applied to new examples (Fig 1, step 2). The resulting score represents the example s distance from the separator. A large positive classification score means we are confident that the protein belongs to the family, and a large (in absolute value) negative score indicates high confidence in its being outside the family. Classifying a protein to one of several families: For each family of proteins with X TM domains, we construct X Gaussians and an SVM classifier. On an input protein Y with X TM domains we proceed as follows: (i) For each family compute the vector of scores for protein Y according to the family s Gaussians, and find the SVM classification for that family. (ii) Prediction: protein Y belongs to the family with the highest - scoring SVM classification Size limitation of available data: Maximumlikelihood estimates used for building multidimensional Gaussians require many examples. Protein families in high quality database (as in the case of SwissProt) are frequently too small to provide the essential amounts of data. We have excluded families with limited number of proteins (below 50). In a few specific instances we also included sets of somewhat smaller size. From these families we removed proteins that are marked as fragment s and those for which no clear annotation is provided. Following these filters, all families with fewer than 15 proteins each were excluded. For the remaining proteins (about ~3100 consists of 29 annotated functional families) we constructed the Gaussians as described above. In cases where we were still unable to construct the Gaussians from the protein family s data due to a sparsity of examples (or strong dependencies between examples) we used PCA for dimension reduction (using Matlab version 6.5). This way 3

4 we created Gaussians whose dimension was less than 20. Setting the parameters: For SVM classification we used a polynomial kernel of degree 4, with cross-validation of the data divided into 7 groups. For the PCA, we used only the principal components explaining at least 3% of the variation in data. Tests and measuring the performance: In order to test our algorithm we divided our protein families into groups according to their number of TM domains. Each group contained all families with identical number of TM domains. We used a leave-one-out cross-validation test. In each iteration a single protein is left out (out of one family in an X TM domain group) while all families Gaussians and classifiers are built without that protein. (Note that the classifier for each family is built, with family members as positive examples and members of other families in the X TM domains group as negative examples). Each family produces an SVM classification for the left-out protein, and the protein is predicted to belong to the family that produced the most positive classification (Fig 1). This procedure repeats itself until all proteins from all families in the group have been left-out, and for each one a family prediction is retrieved. For scoring our success we used the sensitivity and specificity measures. Where sensitivity is defined as how many of the positive examples we capture: tp and specificity is defined as tp fn how many of the negative examples we capture: tn. Note that this definition of tn fp specificity deviates from the more widely accepted definition being how many of those marked as true are truly positive:. This tp tp fp deviation was required in our tests since the size of families within an X TM domains group varies significantly. As a consequence, if one family, say family A, is much larger than another family, say B, then always predicting A will still result in a high specificity value (since fp<<tp), though it is clear that our prediction of always A is nothing but specific. 3. RESULTS We have tested 12 groups of protein families (each group contains families with a defined number of TM domains), with a total of 29 different families that include ~3100 annotated proteins. The proteins were collected from a wide range of phylogenic taxonomy and thus representing the maximal variability of the family. The TM domain groups contain all the families that are presented in the SwissProt database following our filtration procedure (see Methods). A few additional groups with some bitopic protein families were composed. For consistency, we used TMHMM for determining the TM domains and their putative length. Fig. 2 shows the size distribution of TM in our dataset. As seen, most TM domain lengths are within a narrow range of 21 to 25 aa. In contrast, the size of groups in terms of the number of the proteins that are included varies significantly. The 7, 12 and single TM domains are notably larger relative to other groups (for a complete list see supplemental material). Fig 1: A 5 step flowchart of the classification algorithm based on information of the TM domains. The representation of proteins for our algorithm (step 1) is composed of the following components: (i) Location annotation of TM domains (ii) Representation of information contained in a given TM domain (iii) Representation of a TM domain in a functional family of proteins (iv) Representation of dependencies between TM domains in a functional family of proteins. For details see Methods. 4

5 # TMDs in DB Fig. 2: Number of TM domains as a function of their length (in aa) according to TMHMM prediction program. group TMD family name protein number Length sensitivity tp/tp+fn (%) specificity tn/tn+fp (%) 12 TMD bcct transp sodium NT symp aa permease sodium H exch facilit glu transp TMD sulfate transp cl channel TMD H transp pump zinc transp TMD kv channel abc2 transp TMD connexin scamp synaptophysin Table 1: Average levels of sensitivity and specificity (unweighted) for the tested group. Each group composed of protein families with the same number of TMD (TM domains). For a full list of groups and the family annotations, see supplemented material. Table 1 summarizes the results for approximately 1500 proteins falling into 14 different families that contain 4,6,8,10 and 12 TM domains. Each group contains proteins with a defined number of TM domains. The results express the success in separating among the different families within a group using a leaveone-out prediction (see Methods). The results show that most families are separable, solely on information describing the aa composition of their TM domains. Prediction success varies between groups. Most groups show average sensitivity levels above 80% and average specificity levels above 90%. We can see that while some groups are composed of families with highly variable sensitivity and specificity rates, others are quite robust, in the sense that all families within the group attained more or less the same success rate. In most cases, failure in family distinction is associated with a relative small number of proteins in that family. To test wheather the high separability achieved reflects TM groups that are composed of functionally remote families (i.e., transporters and channels, Table I), we extended the analysis to groups that are composed of closely related protein families. We concentrated on the largest group of membranous proteins in most eukaryotes the G-protein coupled receptor (GPCR, with 7 TM domains). We included 1456 proteins of families A (1102 pr), B (182 pr), C (100 pr), and Frizzle (72 pr), as defined by GPCRDB V8.0. Families D, E were excluded from the analysis since they were annotated as having too few members (see Methods). As can be seen in Fig. 3 all families in the 7 TM domain group, show extremely good results in terms of both sensitivity and specificity, and therefore are easily separable. sensitivity specificity GPCRs - sensitivity and specificity gpcr A gpcr frizzled gpcr C gpcr B 100% 80% 60% 40% 20% Fig. 3: Sensitivity and specificity scores for each family GPCRs group. The sensitivity and specificity average (unweighted) values are 85.44% and 96.71%, respectively. Light grey - sensitivity score; Dark grey - specificity score. We tested whether the algorithm presented here is capable of separating proteins whose family characterization is not readily evident. We tested the performance of our method in separating related families of the photosystem II. The reaction center of photosystem II is composed of closely related proteins having 5 TM domains as resolved by the 3D structure of the photosystem complex (from Rhodobacter). The complex is 0% 5

6 PSBD_CHLRE Photosynthetic reaction centre protein (Logo of Pfam - PF00124) Fig 4. A HMM consensus logo for Pfam Photo- RC (945 proteins). The position of the TM domains, marked 1-5 are included in the 320 aa length HMM of the family. The 3 red dots marks the key functional aa that are essential for the energy transfer reaction. Note that high level of aa conservation is extended throughout the entire sequence. involved in the transformation of light energy into chemical energy. In photosystem II of eukaryotic chloroplasts two related proteins, the D1 (psba) and D2 proteins (psbd) are known. All four proteins D1, D2, L and M were probably evolved from a common ancestor. All four related families are considered by Pfam as one family with a unified HMM (Photo_RC, PF00124). Inspecting the HMM by the logo representation (Fig. 4) reveals that the sequence conservation is detected throughout the 320 aa that are included in the family HMM. The TM domains (marked 1-5) cover about a third of the family HMM but they capture only part of the information (note the information rich loop between TM3 and TM4). Based on TigrFam, InterPro (IPR000484) suggested a natural partition of the ~950 UniProt proteins to 4 subfamiles of D1, D2, M and L. Our algorithm successfully separated those 4 families. Results are shown in Table 2. group # proteins Sensitivity Specificity 5 TMD d1q d2q M L Table 2: Sensitivity and specificity scores for each family of the photosynthesis group with 5 TM domains. The sensitivity and specificity average (unweighted) values are 70.96% and 93.4%, respectively. We set few tests to better understand the results for the 29 families that were combined into 12 TM domain groups (5 groups for bitopic protein families and groups for 4,5,6,7,8,10 and 12 TM domains). Inspecting all the results indicated that there is no correlation between the success of the algorithm to separate functional groups and the number of proteins in a group or the number of families in the group (Table 1,2, and supplemental material). One might assume that the algorithm performance is better correlated to the number of TM domains that defines the group. An even more rational assumption is that a correlation exists between the algorithm success and the fraction of the protein length that is occupied by TM domains (defined as coverage). The intuition is that a better separation capacity is expected for proteins for which the TM domains occupy a substantial part of their sequence (i.e., proteins with high number of TM domains or very short connecting loops) and the opposite is true for low coverage groups. Fig. 5 does not support such assumptions. There is no apparent correlation between the level of success and the number of TM domains (top) and the coverage (bottom) in all protein family groups. The only tendency that is evident is that for bitopic proteins (a single TM domain) the variability in prediction is very large, suggesting that the composition of families that are included in each of the tested group dominates the prediction (not shown). A critical test was included to check whether the information that the TM domain provides in separating the families within the group can be already retrieved directly from the raw data of the aa sequence. We compared the results from the Gaussian based method (Fig 1) to a simple clustering algorithm. The input to the clustering algorithm is the vectors of aa composition in each TM domain concatenated to each other by the order of their appearance. In this direct method the vectors are not transformed into Gaussian scores and no supervised learning algorithm is used. We applied the k-means clustering, which creates a partition into k groups that minimizes the sum, over all clusters, of the within-cluster sums of point-to-center distance. The distance was measured using l 1 norm and 6

7 several iterations were performed to avoid dependencies on starting conditions. The iteration number was 50. In cases that the data was insufficient (in 5 TM domains for photosynthesis group, Fig 3) we have used only 10 iterations. (yet, on average the improvement is more moderate). 100% 80% 60% 40% 20% 0% 12 TMD 10 TMD 8 TMD 7 TMD 6 TMD PHOTOSYNTHSIS 4 TMD 1 TMD SYNAPTIC SYX_MHC SYX_SYNAPTOTAGMIN RTK2_3_5 Fig. 6: A comparison of sensitivity rates achieved by the Gaussian-based aa composition algorithm (dark gray) and the k-means clustering method with distance between clusters measured in l 1 norm (light gray). The tested groups are as shown in Table 1-2, Fig.2 and some additional combinations of bitopic families. Synaptic proteins including VAMP, synaptotagmin and syntaxin; Photosynthesis 4 families of 5 TM domains participating of reaction center II; Synsyntaxin and synaptotagmin; Rtk receptor tyrosine kinase of classes 2,3 and 5; Synsyntaxin and MHC - class I histocompatibility antigen. Fig. 5: Correlation between the number of TM domains (top) and the average coverage (bottom) in a family and the sensitivity score achieved. Fig. 6 shows the results from the clustering method relative to the results obtained by the Gaussian based method. We extended the analysis to bitopic proteins, which are the most abundant membranous proteins in eukaryotes (Liu et al., 2002). To this end, we combined several (overlapping) families of bitopic proteins with 2 up to 7 families (see supplementary material for details). Our method outperforms this naïve method, in terms of the sensitivity rates, in all tested cases (~4000 proteins, 12 groups comprised of 39 family combinations). In some instances, such as for GPCR, the improvement of our algorithm is very significant (from 40% to over 80% for the sensitivity measure). The results from a similar test that was performed for the specificity rates by both algorithms indicate a consistent superior distinguishing capacity for our Gaussian- based algorithm relative to the direct clustering method In summary, our results show that, in addition to the dominating biophysical constraints for an helix to cross the bilayer membrane, the aa composition seems to provide a sufficiently strong signal to define functional families in all instances tested with the exception of some combination of bitopic proteins (Fig 6). 4. Discussion We have introduced here an algorithm with a satisfactory accuracy to separate between functional families containing the same number of TM domains, based solely on aa composition in their TM domains. Using a leave-one-out cross-validation test, we have reached sensitivity rates of 75-85%, and specificity rates of 85-95% for most groups tested. Such high success rates strongly support the hypothesis according to which TM domains are far from being simple hydrophobic linkers. Variable separability is recorded to groups of bitopic proteins (Fig. 5). We associated the later finding to the numerous families in which their single TM domain serves as a mean to anchor the protein to the membrane while the actual function of the protein is associated with either 7

8 the cytoplasmic or extracellular domain. This may be the case for most adhesion proteins, integrins and more. To test whether the family annotation provided for the family is already based on the information encoded in the TM domains we inspected the annotation source as defined by InterPro. For most of our tested families, InterPro annotation is based on a family-based Pfam HMM (Fig. 4). Recording the signatures for all membranous families (29 annotated families) presented in this study showed that for ~50% of the families, the TM domains are included in the family HMM that is often much larger. In another 15% only few of the TM domains are included while in the rest (35%) of the families (i.e., receptor tyrosine kinases, synaptotagmin) the HMM and the TM domains do not overlap. We found that there is a very low correlation between the sensitivity rates achieved and the number of TM domains, coverage, number of proteins in the group and number of families in the group. The observation that some discrimination between families can be achieved already by a simple clustering method that is based on raw data (Fig. 6) implies that the separability between families is an intrinsic quality of the data and not a result of data manipulation caused by our algorithm. Maximal improvement of our algorithm is seen for the group of 7 TM domains (Fig 6). Note that this group occupies about half of all proteins tested in this study. Indeed, the failure of routine alignment methods to functionally classify GPCR superfamily and other 7 TM domains proteins had been announced. Adding features such as the TM topology, phylogenetic distances and aa biochemical properties improved the classification (Inoue et al., 2004; Lapinsh et al., 2002). The large size of this group and the diversity of the proteins within predispose the distinguishing of functional subfamilies to the application of SVM-based approaches (this study and Bhasin and Raghava, 2004). There are a few limitations and pitfalls to the approach presented: The family size: Genome wide analysis based on Pfam reveals ~230 PfamA of polytopic families (Liu et al., 2002), most of them are very small. Therefore, methods of parameters estimation that require many examples, once applied to a relatively small set of examples are prone to severe overfitting. Source for functional families: Proteins that are marked as unknown or putative were excluded from our analysis. This rule was applied to maintain a strict level of objectivity. However, increasing annotation coverage and reducing false annotations will allow implementing our method to additional protein families. Membranous protein coverage: Recall that we disregard membranous proteins other than -helical TM proteins. Most prediction methods (i.e. TMHMM) do not apply to non -helical TM domain (such as -barrels). Consequently, annotation for such proteins is very unreliable. Assessment of the statistical confidence: Our leave-one-out cross-validation test showed promising results, however to ensure statistical confidence, we are currently repeating the tests (on the present dataset and on a larger set that includes UniProt-90 knowledge) by a leave 10%- out. We expect that a possible bias due to sequence similarity (i.e., aa composition) will be minimized. Accuracy in TM prediction: The accuracy of most TM prediction programs, including the TMHMM that was applied throughout, is often less satisfactory than reported (see discussion in Kernytsky and Rost, 2003). In summary, we conclude that the information encoded by aa composition of TM domains in polytopic proteins is valuable and sufficient in most cases to define the family relationship of a protein. In the few instances (Fig 5,6) that our algorithm performed poorly, it is suggested that the TM domains serve as localization anchors rather than being fundamental components of the family function. The effect on protein functions following replacement of an authentic TM domain by a lipid anchor had been monitored for certain protein families (Grote et al., 2000; Kemble et al., 1993). In more general terms, considering proteins according to their broad definition, such as membranous, nuclear etc. is useful for extracting additional nonalignment-based information (such as TM aa composition, this study). The results could support classification and characterization efforts, especially for newly available but poorly annotated genomes. 8

9 5. ACKNOWLEDGMENTS I would like to thank Nati Linial for his helpful advice, Kobi Cramer for his help with machine learning approaches, and the ProtoNet research group at the Hebrew University of Jerusalem for DB support and stimulating discussions. 6. REFERENCES Bhasin, M., and Raghava, G. P. (2004). GPCRpred: an SVM-based method for prediction of families and subfamilies of G-protein coupled receptors. Nucleic Acids Res 32, W Bockaert, J., Dumuis, A., Fagni, L., and Marin, P. (2004). GPCR-GIP networks: a first step in the discovery of new therapeutic drugs? Curr Opin Drug Discov Devel 7, Byvatov, E., and Schneider, G. (2003). Support vector machine applications in bioinformatics. Appl Bioinformatics 2, Chen, C. P., and Rost, B. (2002). State-of-the-art in membrane protein prediction. Appl Bioinformatics 1, Engelman, D. M., Steitz, T. A., and Goldman, A. (1986). Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins. Annu Rev Biophys Biophys Chem 15, Favre, I., Moczydlowski, E., and Schild, L. (1996). On the structural basis for ionic selectivity among Na+, K+, and Ca2+ in the voltage-gated sodium channel. Biophys J 71, Grote, E., Baba, M., Ohsumi, Y., and Novick, P. J. (2000). Geranylgeranylated SNAREs are dominant inhibitors of membrane fusion. J Cell Biol 151, Horn, F., Bettler, E., Oliveira, L., Campagne, F., Cohen, F. E., and Vriend, G. (2003). GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res 31, Inoue, Y., Ikeda, M., and Shimizu, T. (2004). Proteome-wide classification and identification of mammalian-type GPCRs by binary topology pattern. Comput Biol Chem 28, Kaplan, N., Sasson, O., Inbar, U., Friedlich, M., Fromer, M., Fleischer, H., Portugaly, E., Linial, N., and Linial, M. (2005). ProtoNet 4.0: a hierarchical classification of one million protein sequences. Nucleic Acids Res 33 Database Issue, D Kaplan, N., Vaaknin, A., and Linial, M. (2003). PANDORA: keyword-based analysis of protein sets by integration of annotation sources. Nucleic Acids Res 31, Kemble, G. W., Henis, Y. I., and White, J. M. (1993). GPI- and transmembrane-anchored influenza hemagglutinin differ in structure and receptor binding activity. J Cell Biol 122, Kernytsky, A., and Rost, B. (2003). Static benchmarking of membrane helix predictions. Nucleic Acids Res 31, Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E. L. (2001). Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305, Kyte, J., and Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J Mol Biol 157, Lapinsh, M., Gutcaits, A., Prusis, P., Post, C., Lundstedt, T., and Wikberg, J. E. (2002). Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci 11, Levitt, D. (1990). Gramicidin, VDAC, porin and perforin channels. Curr Opin Cell Biol 2, Lin, S., Naim, H. Y., Rodriguez, A. C., and Roth, M. G. (1998). Mutations in the middle of the transmembrane domain reverse the polarity of transport of the influenza virus hemagglutinin in MDCK epithelial cells. J Cell Biol 142, Liu, Y., Engelman, D. M., and Gerstein, M. (2002). Genomic analysis of membrane protein families: abundance and conserved motifs. Genome Biol 3, research0054. Miloshevsky, G. V., and Jordan, P. C. (2004). Permeation in ion channels: the interplay of structure and theory. Trends Neurosci 27, Muller, T., Rahmann, S., and Rehmsmeier, M. (2001). Non-symmetric score matrices and the detection of homologous transmembrane proteins. Bioinformatics 17 Suppl 1, S Ng, P. C., Henikoff, J. G., and Henikoff, S. (2000). PHAT: a transmembrane-specific substitution matrix. Predicted hydrophobic and transmembrane. Bioinformatics 16, Ott, C. M., and Lingappa, V. R. (2002). Integral membrane protein biosynthesis: why topology is hard to predict. J Cell Sci 115, Pilpel, Y., Ben-Tal, N., and Lancet, D. (1999). kprot: a knowledge-based scale for the propensity of residue orientation in transmembrane segments. Application to membrane protein structure prediction. J Mol Biol 294, Rost, B., Casadio, R., and Fariselli, P. (1996). Refining neural network predictions for helical transmembrane proteins by dynamic programming. Proc Int Conf Intell Syst Mol Biol 4, Tusnady, G. E., and Simon, I. (1998). Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 283, Wallin, E., and von Heijne, G. (1998). Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci 7,

Received on January 15, 2005; accepted on March 27, 2005

Received on January 15, 2005; accepted on March 27, 2005 BIOINFORMATICS Vol. 21 Suppl. 1 2005, pages i378 i386 doi:10.1093/bioinformatics/bti1035 Families of membranous proteins can be characterized by the amino acid composition of their transmembrane domains