2 Spial. Chapter 1. Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6. Pathway level. Atomic level. Cellular level. Proteome level.

Size: px

Start display at page:

Download "2 Spial. Chapter 1. Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6. Pathway level. Atomic level. Cellular level. Proteome level."

Clementine Owens
5 years ago
Views:

1 2 Spial Chapter Chapter 2 Chapter 3 Chapter 4 Chapter 5 Chapter 6 Spial Quorum sensing Chemogenomics Descriptor relationships Introduction Conclusions and perspectives Atomic level Pathway level Proteome level Cellular level

3 Chapter 2: Spial 2- Contents of chapter 2 Contents of chapter 2... Summary of chapter Introduction... 3 Underlying assumptions... 4 On the shoulders of giants... 4 Scorecons... 5 Sequence logos... 5 ConSurf... 6 The evolutionary trace method... 6 Pairwise HMM logos... 6 Two Sample Logo... 6 The Spial algorithm... 7 Examples of use... 9 Example : The dimerisation interface of STAT5a... Example 2: Differences in co-ordination of retinal between vertebrate and cephalopod rhodopsin... Conclusions... 2 References... 3

4 Chapter 2: Spial 2-2 Summary of chapter 2 Spial (Specificity in alignments) is a tool for the comparative analysis of two alignments of related sequences that differ in their annotation, such as two receptor subtypes. It highlights functionally important residues that are specific to one but not both of the two alignments and visualises this information in three complementary ways: by colour-coding alignment positions, by sequence logos, or by colour-coding the residues of a protein structure provided by the user. The aim of Spial is to assist the identification of residues that are specific to one of the two alignments but not to the other. This can aid in the identification of residues that are involved in the alignment-specific interaction with a small molecule, other proteins, or nucleic acids. Alternatively, Spial may be used to identify residues that are the target of posttranslational modifications in one of the two alignments but not in the other. Parts of this chapter will appear in the following peer-reviewed articles: Wuster, A., G. F. Schertler and M. Madan Babu (29). Spial: Analysis of subtypespecific features in multiple sequence alignments of proteins Bioinformatics (in review). I could not have been able to create Spial without the help of Gebhard Schertler, MRC Laboratory of Molecular Biology. Spial can be accessed on:

Chapter 2: Spial 2-3 Introduction Identifying protein residues that are associated with specific functions is a recurring task in molecular biology.

5 Chapter 2: Spial 2-3 Introduction Identifying protein residues that are associated with specific functions is a recurring task in molecular biology. To assist this, I have developed Spial, a web-based tool that allows the comparative analysis of two related protein subtypes. Spial differs from similar tools by allowing the identification of residues that are specific to one of the two subtypes but not the other. For example, when comparing the alignments of two related receptor subtypes that bind two different small molecules, Spial allows the identification of those residues that are specific to the binding of each of those ligands. For this, Spial takes two related sequence alignments as an input (figure 2-) and assigns each residue to one of eight possible types, depending on whether it is specific to the first alignment, the second alignment, the consensus, or any combination of those (figure 2-2) Figure 2-. Screenshot of the Spial submission page, as displayed in the Firefox Browser. The user provides two alignments (A and B) by either entering the into the text boxes or by uploading them from her hard disk. Several parameters, such as the consensus threshold or the specificity threshold can be amended. If a protein structure is available, the user can upload it from either the Protein Database (PDB) or from her hard disk, so that Spial s results can be mapped to it. The Spial submission page can be accessed on

6 Chapter 2: Spial 2-4 Underlying assumptions Proteins can be grouped into protein families, whose members all have a common ancestor. The three-dimensional structures of proteins within a family tend to be more similar to each other than their amino acid sequences. Even if amino acid sequence identity falls to 25%, the carbon backbone of the protein tends to follow a common fold within 2 Å (Alberts, Johnson et al. 22). Because proteins within a family tend to have similar amino acid sequences, they can be aligned. A multiple sequence alignment is a way of writing out related amino acid sequences below each other so that the amino acid in one protein is positioned directly above or below the corresponding amino acid in the other protein. By using such a multiple sequence alignment, it is possible to distinguish between conserved and variable positions (columns) in the alignment. Conserved positions are those where the same amino acid can be found in most sequences, whilst variable positions are those that have a different amino acid in most sequences. It is frequently assumed that it is possible to make conclusions about the relative functional importance of residues by considering whether they are conserved or variable. This is because the more important a residue is, the more deleterious a change in such a residue is likely to be. Deleterious mutations are more likely to be eliminated by natural selection. Therefore, functionally important residues will be more conserved. The possible reasons why a residue is conserved can be that it is important for maintaining the protein s structure, that it is part of the active site, or that it is involved in the binding of a small molecule, of nucleic acids, or of another protein. On the shoulders of giants Spial is not the first computational tool that uses sequence conservation to make conclusions about functionally important sites. In the following, I introduce the most important precursors and inspirations for Spial (table 2-) and discuss how they differ from Spial.

7 Chapter 2: Spial 2-5 Name and server URL Reference Description Scorecons (Valdar 22) alignment. Weblogo ConSurf Evolutionary Trace Analysis Pairwise HMM logos Two Sample Logo (Crooks, Hon et al. 24) (Armon, Graur et al. 2) (Lichtarge, Bourne et al. 996; Mihalek, Res et al. 26) (Schuster-Böckler and Bateman 25) (Vacic, Iakoucheva et al. 26) Scorecons calculates the degree of amino acid variability in each column of a multiple sequence The user uploads a multiple sequence alignment and Weblogo returns a sequence logo (Schneider and Stephens 99). The user uploads a multiple sequence alignment and a protein structure. ConSurf computes how conserved each position in the alignment is and colours it accordingly in the protein structure. The aim of Evolutionary Trace Analysis is to identify evolutionarily important residues in a multiple sequence alignment. After defining subgroups in a phylogenetic tree computed from the alignment, Evolutionary Trace Analysis defines those residues as trace residues that are conserved within subgroups but not necessarily between subgroups. Given two multiple sequence alignments, this tool generates a logo for each and returns them in such a way that the positions in each alignment correspond. With Two Sample Logo, the user can visualise the differences between two multiple sequence alignments. Two Sample Logo calculates how significantly the amino acids differ between the two alignments in each position. Table 2-. Computational tools that have inspired Spial or with functionality that is comparable to Spial. Scorecons Scorecons (Valdar 22) is a tool that takes a multiple sequence alignment as an input and returns the conservation of each position in the alignment in text format. The user can select what the measure of conservation is, the default being a method that refers to the variability of each alignment position. Sequence logos Sequence logos (Schneider and Stephens 99) are a way of displaying the conservation of each position in an alignment in an intuitive way. The amino acids that occur at each position are written in such a way that their relative heights correspond to their prevalence at this position. Weblogo (Crooks, Hon et al. 24) is a tool for generating such sequence logos, and Spial makes use of it.

8 Chapter 2: Spial 2-6 ConSurf Consurf (Armon, Graur et al. 2; Glaser, Pupko et al. 23; Landau, Mayrose et al. 25) is similar to Scorecons in the sense that it takes a multiple sequence alignment as an input and supplies information about the conservation of each position in the alignment as an output. However, ConSurf also takes a protein structure as input. The way in which it supplies the information about the conservation of alignment position is by colour-coding each amino acid in the protein accordingly. This way, it is possible to identify conserved surface patches whose amino acids are located at different positions of the alignment but come together in the structure due to protein folding. The evolutionary trace method The evolutionary trace method (Lichtarge, Bourne et al. 996) can be implemented using the Evolutionary trace report_maker (Mihalek, Res et al. 26). This tool takes a protein structure as input and then automatically identifies associated sequence data. Based on this, it then creates a report on the amount of selection pressure over evolutionary time on each residue in the structure. The Evolutionary trace report_maker does this by first constructing a phylogenetic tree from a multiple sequence alignment of the sequence data. For each node in the tree, a consensus sequence is then computed, and the variability of the consensus sequences is then mapped onto the protein structure. Pairwise HMM logos Pair-wise hidden Markov model (HMM) logos are a tool for visualising and comparing the logos of two protein family subgroups in an intuitive way. Unlike Spial, it automatically decides which positions in the two subgroups correspond. One advantage of the web server provided by the authors ( is that although the user can upload her own alignments, this is not strictly necessary, as it is alternatively possible to use the alignments available in the protein family database Pfam (Finn, Mistry et al. 26). Two Sample Logo Of all the tools reviewed here, Two Sample Logo (Vacic, Iakoucheva et al. 26) is the one that is most similar to Spial. Like Spial, it takes two alignments of related protein family subgroups as input. As an output, it supplies two logos on top of each other. Residues that are specific to the first alignment appear in the top logo, residues that are specific to the second alignment appear in the bottom logo, and

9 Chapter 2: Spial 2-7 consensus residues that are specific to both alignments appear in a line between the bottom and the top logo. A functionality that is offered by Two Sample Logo but not by Spial is the computation of a p-value for each position in the input alignments using a t-test or a binomial test. Both tests estimate the p-value of the null hypothesis that both input alignments were generated by the same distribution. The reason why Spial does not calculate a p-value is that I would like to argue that p-values are meaningless in this context. The reason is related to phylogenetic non-independence (Felsenstein 985), which states that different branches of a phylogeny cannot be considered as independent data points as they are related by descent. In the case of Two Sample Logo, two sequences in the alignments provided by the user may simply have the same amino acid in the same position because they are closely related. Because a t- test requires that all observations are independent, it is not appropriate to apply it in this case. The Spial algorithm For the input alignments, both the FASTA and SELEX alignment formats are accepted. The sequences in the two input alignments must originate from two protein subtypes with related sequences. They have to be of the same length and the positions in both alignments have to correspond. One way to produce the two input alignments is to align all sequences using an alignment program such as Muscle (Edgar 24), and then to split the resulting alignment into the two separate input alignment files. In the following, I refer to these two alignments as alignment A and alignment B, respectively. Additionally, Spial accepts a protein structure in protein database (PDB) format as input. For each position in the two input alignments A and B, Spial decides whether the residue is consensus or not. In order for an amino acid to be consensus, it has to be present above a certain threshold proportion in both alignments. This threshold can be specified by the user. In figure 2-2A, a consensus threshold value of.35 was used. Next, Spial decides whether there are amino acids that are characteristic for one of the two alignments, but not for the consensus. For this, a non-consensus amino acid has to be present above a certain threshold proportion in one of the alignments. Again, this threshold can be specified by the user. In figure 2-2A, a specificity

10 Chapter 2: Spial 2-8 threshold value of.35 was used. As long as the sum of the consensus threshold and the specificity threshold is lower than one, a position can be specific to the consensus and specific to one of the alignments at the same time. Therefore, there are eight possible combinations of specificity for alignment A, alignment B, or the consensus (figure 2-2B). I refer to these eight combinations as types. Each position in an alignment can also be one of three possible categories, which indicate whether the position is specific to the consensus (C), specific to one or both of the input alignments but not the consensus (S), or not specific at all (). The one-letter codes that specify the types and categories of each residue are located in two rows below the Spial output alignment (figure 2-2A). Spial's output consists of coloured alignments as described above, of sequence logos (Schneider and Stephens 99; Crooks, Hon et al. 24), and of coloured protein structures (figure 2-2A). The logos produced by Spial appear similar to those produced by the program Two Sample Logo (Vacic, Iakoucheva et al. 26), which treats one alignment as the background and then computes whether there are residues that are enriched or depleted in the other alignment. Spial logos differ from this by visualising how frequent a residue is in either alignment, or, if it is a consensus residue, how frequent it is in the consensus. In the output protein structures, the default colouring scheme differs from that used in the alignments. The colour of each protein residue reflects whether it is specific to alignment A, in which case its colour is red, specific to alignment B, in which case its colour is green, specific to both, in which case its colour is yellow, or specific to neither, in which case its colour is black (figure 2-3A). The proteins that are coloured in this way can be viewed either directly in the browser if a Chime ( plug-in is installed, or by loading the structure into the PyMol structure viewer ( and then running a script that is provided by Spial. Another option offered by Spial is the colouring of residues according to residue type as defined above.

Chapter 2: Spial 2-9 A Input alignments Input PDB structure (optional) #alignment A seq. AD-RVAT-SH seq.2 ADYKV-S-SH seq.3 ADY-VVS-SS seq.4 AEYGVIS-SS seq.5 VEYHVMT-SS # alignment B seq2.

11 Chapter 2: Spial 2-9 A Input alignments Input PDB structure (optional) #alignment A seq. AD-RVAT-SH seq.2 ADYKV-S-SH seq.3 ADY-VVS-SS seq.4 AEYGVIS-SS seq.5 VEYHVMT-SS # alignment B seq2. AEWTLMTPSM seq2.2 AEWTILTPSM seq2.3 AEWTMITPPS seq2.4 AEWTGFTPPS Spial Alignment Alignment A seq. A D RV A T S H seq.2 A DY KV S S H seq.3 A DY V V S S S seq.4 A E Y GV I S S S seq.5 V E Y HV M T S S Alignment B seq2. A E W T L M T P S M seq2.2 A E W T I L T P S M seq2.3 A E W T M I T P P S seq2.4 A E W T G F T P P S Types and Categories type category C CS SS CS C C Logo alignment A specific consensus residues alignment B specific Structure, with alignment specificities colour-coded (optional) alignment A specific consensus positions B type category S S S C C C C specific for... C A B explanation no specificity specific for alignment B, but not for alignment A specific for alignment A, but not for alignment B specific for alignment A and B separately specific for the consensus only specific for the consensus and alignment B specific for the consensus and alignment A specific for the consensus and alignments A and B Figure 2-2. Spial input and output. (A) Spial takes two alignments (A and B) and optionally a PDB protein structure as input. The output consists of an alignment, a logo, and optionally the structure with colour-coded residues. In the coloured alignment, each position or column is coloured according to whether the position is specific to alignment A, alignment B, the consensus, or a combination of the three. A row of numbers and letters below the alignment give further information on the specificity of that position. The logo consists of three rows that show which residues frequently occur at each position in alignment A (top row), the consensus (centre row), or alignment B (bottom row). The colours of the amino acids in the logo correspond to the properties of the amino acid. The residues of the structure are coloured according to how frequent they are in alignment A or in alignment B. (B) Each position in an alignment output by Spial can be one of eight possible types. The type of each position is determined by whether it is specific to alignment A, alignment B, the consensus, or any combination of those. Each position in an alignment can also be one of three possible categories, which indicate whether the position is specific to the consensus (C), specific to one or both of the input alignments but not the consensus (S), or not specific at all (). Examples of use Spial is a versatile tool with a number of potential applications. Scenarios in which Spial may be useful include: Of a number of homologous proteins, some bind a certain ligand or drug whilst others do not. Spial can assist in identifying surface patches that are specific to the proteins that bind the ligand.

12 Chapter 2: Spial 2- A protein has homologues in two different evolutionary lineages. Spial can assist in identifying residues that are specific to either lineage, and those that are conserved in both. Of a number of paralogues in a genome, some have a specific function whilst others do not. Spial can assist in identifying the residues that are specific to the proteins that have the function of interest. Spial can assist in identifying residues that undergo post-translational modifications by running an alignment of sequences that are commonly modified against an alignment of related sequences that are not. Some single nucleotide polymorphism (SNPs) may be specific to certain subpopulations within a species. Spial can assist in exploring the specificity of SNPs within subpopulations and mapping them to protein structures. In the following, I describe two examples for the usage of Spial: A Spial comparison of STAT5a and STAT4 allows identification of residues that are integral to the dimerisation interface of STAT5a, and a Spial comparison of cephalopod and vertebrate rhodopsins allows the identification of differences in the way the retinal moiety is co-ordinated. Example : The dimerisation interface of STAT5a I have used Spial to compare two of the seven known families of the signal transducer and activator of transcription (STAT) proteins (Aaronson and Horvath 22; Rawlings, Rosler et al. 24). I compared an alignment of STAT5a orthologues with an alignment of STAT4 orthologues from different animal species. Both alignments were obtained from the Ensembl genome database ( I then subjected the two alignments to Spial analysis and mapped the specificities I obtained onto a crystallographic structure of the unphosphorylated core STAT5a (Neculai, Neculai et al. 25). Unphosphorylated STAT5a dimerises in a way that is different to the dimerisation mode of STAT4 via Src-homology 2 (SH2) domains. By concentrating on the residues involved in intermolecular contacts between the STAT5a dimers, I was able to show that they tend to be highly conserved within STAT5a orthologues, but not conserved in STAT4 orthologues (figure 2-3B). Because residues that are located at the interface of protein-protein interactions tend to be conserved (Mintseris and Weng 25), and because in this case most of the interface residues are of type 3 (specific for the STAT5a and the STAT4 alignment separately; see figure 2-3B), the interaction

Chapter 2: Spial 2- between the subunits of the STAT5a dimer may be specific. For the Spial output for this example, see http://tinyurl.com/m6e86d. A frequency in alignment A B.

13 Chapter 2: Spial 2- between the subunits of the STAT5a dimer may be specific. For the Spial output for this example, see A frequency in alignment A B.5 specific to alignment A.5 frequency in alignment B consensus residue specific to alignment B C F2 K35 W274 M24 F29 F25 F88 Figure 2-3. How Spial maps the specificities of residues to protein structures. (A) The colour scheme used by Spial to indicate whether a residue is specific to alignment A, to alignment B, to both, or to neither. Residues that are specific to alignment A only are in red, those that are specific to alignment B only are in green, and those that are specific to both are in yellow. Residues with a low frequency in both alignments are coloured in black. (B) Cartoon representation of the interface of the STAT5a dimer with the interface running from the lower left to the upper right. Residues involved in the protein-protein interaction are represented as sticks. Hydrogen bonds are represented as yellow dashed lines. The colour scheme for the interface residues is as above, where alignment A is STAT5a and alignment B is STAT4. (C) Residues in a 4Å vicinity of retinal (in blue). The colour scheme is as above, where alignment A is cephalopod rhodopsins and alignment B is vertebrate rhodopsins. Example 2: Differences in co-ordination of retinal between vertebrate and cephalopod rhodopsin Opsins are a family of seven-helix membrane receptors that activate G proteins in a light-dependent manner via the photoisomerisation of a chromophore moiety

14 Chapter 2: Spial 2-2 embedded in the protein. Vertebrate and cephalopod rhodopsins are two subgroups of opsins. Though related, they differ in their molecular properties and function (Terakita 25). Whilst vertebrate rhodopsin activates the cyclic GMP signalling pathway, invertebrate rhodopsin activates the inositol-,4,5-triphosphate signalling pathway via a G q -type G protein (Murakami and Kouyama 28). It is not clear what the cause of the functional difference between these two rhodopsin families is. I used alignments of cephalopod and vertebrate rhodopsins as published on the GPCR database (Horn, Bettler et al. 23) as input for Spial and mapped the results to the structure of squid rhodopsin (Murakami and Kouyama 28). The result clearly shows that although some residues that co-ordinate retinal are conserved between vertebrate and cephalopod rhodopsin, this does not apply to all of them. For example Lys 35, which covalently binds retinal, is conserved between both vertebrate and cephalopod rhodopsin. Other hydrophobic residues in the retinal binding pocket, including Phe 2 and Phe 88 are specific to cephalopod rhodopsin and are not conserved in vertebrate rhodopsin (figure 2-3C). Phe 25, although it is part of the binding pocket in the squid rhodopsin structure used here, is generally not conserved in cephalopods but in vertebrates. For the Spial output for this example, see Conclusions The above examples illustrate how Spial can be used as a tool for the identification and visualisation of information about the specificity of protein residues, and how this information can be used to understand protein function, protein-small molecule and protein-protein interactions.

15 Chapter 2: Spial 2-3 References Aaronson, D. S. and C. M. Horvath (22). "A road map for those who don't know JAK-STAT." Science 296(5573): Alberts, B., A. Johnson, et al. (22). Molecular Biology of the Cell. New York, Garland Science. Armon, A., D. Graur, et al. (2). "ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information." J Mol Biol 37(): Crooks, G. E., G. Hon, et al. (24). "WebLogo: a sequence logo generator." Genome Res 4(6): Edgar, R. C. (24). "MUSCLE: multiple sequence alignment with high accuracy and high throughput." Nucleic Acids Res 32(5): Felsenstein, J. (985). "Phylogenies and the comparative method." Am Nat 25(): -5. Finn, R. D., J. Mistry, et al. (26). "Pfam: clans, web tools and services." Nucleic Acids Res 34(Database issue): D Glaser, F., T. Pupko, et al. (23). "ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information." Bioinformatics 9(): Horn, F., E. Bettler, et al. (23). "GPCRDB information system for G protein-coupled receptors." Nucleic Acids Res 3(): Landau, M., I. Mayrose, et al. (25). "ConSurf 25: the projection of evolutionary conservation scores of residues on protein structures." Nucleic Acids Res 33(Web Server issue): W Lichtarge, O., H. R. Bourne, et al. (996). "An evolutionary trace method defines binding surfaces common to protein families." J Mol Biol 257(2): Mihalek, I., I. Res, et al. (26). "Evolutionary trace report_maker: a new type of service for comparative analysis of proteins." Bioinformatics 22(3): Mintseris, J. and Z. Weng (25). "Structure, function, and evolution of transient and obligate protein-protein interactions." Proc Natl Acad Sci U S A 2(3): Murakami, M. and T. Kouyama (28). "Crystal structure of squid rhodopsin." Nature 453(793): Neculai, D., A. M. Neculai, et al. (25). "Structure of the unphosphorylated STAT5a dimer." J Biol Chem 28(49): Rawlings, J. S., K. M. Rosler, et al. (24). "The JAK/STAT signaling pathway." J Cell Sci 7(Pt 8): Schneider, T. D. and R. M. Stephens (99). "Sequence logos: a new way to display consensus sequences." Nucleic Acids Res 8(2): Schuster-Böckler, B. and A. Bateman (25). "Visualizing profile-profile alignment: pairwise HMM logos." Bioinformatics 2(2): Terakita, A. (25). "The opsins." Genome Biol 6(3): 23. Vacic, V., L. M. Iakoucheva, et al. (26). "Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments." Bioinformatics 22(2): Valdar, W. S. (22). "Scoring residue conservation." Proteins 48(2):

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between