Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae Emily Germain, Rensselaer Polytechnic Institute Mentor: Dr. Hugh Nicholas, Biomedical Initiative, Pittsburgh Supercomputing Center Introduction A central concept of biochemistry is that the amino acid sequence of a protein is directly responsible for its three dimensional structure. This specific structure determines the biochemical activity and function. Proteins with related structures or functions are considered to be part of the same protein family. Over time, mutations introduced into the genome have led to proteins with similar activity and structure but differing sequences. As species diverge the differences become more prominent but retain certain sequence elements that are critical to the proper function of the protein. These changes can be mapped to produce an evolutionary tree that provides clues about how species are related and how long ago they diverged. We will be using the information obtained to learn which regions of the proteins are the most highly conserved and to make hypotheses about which sequence elements are essential to the overall structure or function. Heat shock proteins are found in every living cell and have a broad range of functions. HSPs are part of the cell response to stresses such as extremes of temperature, deprivation of oxygen or glucose, or exposure to toxins. They normally make up about two percent of a cell s soluble protein, but in a stressed cell they can account for twenty percent. They can be found in both the cytoplasm and the nucleus depending on the specific type and condition of a cell. HSPs help proteins denatured by stresses to refold back into their proper shape, and most can also chaperone the folding of newly made
proteins. They transport other proteins between compartments within the cell and possibly function in the immune response by presenting abnormal peptides to molecules that move them to the cell surface. The presence of HSPs outside the cell is also a strong signal to the immune system that necrosis is taking place. The full range of functions of HSPs is unknown, but evidence shows that it is also important in the embryonic stages of Drosophila. The protein that will be focused on in these analyses is heat shock protein 23, a member of the alpha-crystallin-related small heat shock protein family. HSP 23 is present in small concentrations in a Drosophila embryo before gastrulation, but is 100 times more abundant in the first fifteen minutes after the start of the ventral furrow formation (Gong et al, 2004). The reasons for this are still unknown, making this an interesting protein for study and analysis. Methods About 190 sequences for heat shock proteins of organisms of the Kingdom Viridiplantae will be extracted from the data available in the IProClass database. These sequences will be aligned using two different programs, T-Coffee (Notredame, 2000) and MEME (Bailey et al, 1994). T-Coffee creates a global multiple sequence alignment, which will attempt to line up all the similar regions of data and compare across the entire set. The alignment is quantified in a relationship tree which will be used to group the most closely related sequences to aid visualization. Patterns will be identified, mapped, and organized to visually display the variations that exist between the different protein sequences. The alignment displays regions that are highly conserved and therefore likely to be critical to the protein structure or function. The regions with higher degrees of variation are those that are less essential to the protein and are tolerant of mutations.
MEME will be run using the Zero or One Per Sequence method to identify twenty motifs. The MEME program scans the sequences to look for patterns regardless of their placement along the protein. The patterns identified are position independent conserved sequence elements that aid in judging the accuracy of the results of T-Coffee. These patterns are used to manually refine the results found during the global alignment. These programs are the standard for their respective processes because they are common and well established, and are known for their good performance. Using programs from the PHYLIP suite (Felsenstein, 2004), a bootstrap analysis will be performed to separate the proteins with similar biochemical activities into distinct subfamilies and quantify how closely related members of a subfamily are. A SeqSpace analysis (Cassari et al, 1995) will calculate which columns of the alignment have the most and least similar sequence variations to confirm groupings and identify which residues contribute to the characteristics of a particular subfamily. A phylogenetic tree will be constructed to visualize these relationships. A cross-entropy analysis will be calculated using the GEnt program to identify which residues are unique to a particular subset of the family and contribute to its specific properties. After the highly conserved sequences have been identified, three dimensional graphical models will be constructed using RasMol and a general representative for each of the protein subfamilies. Important features and residues that define the subfamily will be highlighted and the models will be used to form hypotheses about which areas make up the active site or bind to other molecules, which parts are critical to maintaining a functional structure, and how the protein performs its functions.
Expected Results and Interpretation The map of aligned sequences produced from this research will aid in identifying possible roles of conserved and critical residues. It will identify which regions must be maintained in order for the protein to perform the functions that define the heat shock family of proteins. Distinct groups of more closely related sequences will suggest which proteins were at one time duplicates of others in the same organisms. Regions highly conserved in one group and less so in others will provide clues to the function of those regions when compared to differences in activity experimentally observed for the proteins in those groups. A group of sequences that have a specific sequence element conserved that is not present in others may yield information on which residues bind a substrate specific to proteins from that subgroup. The three dimensional model constructed of a heat shock protein and labeled with the highly conserved regions can give insight into how the protein works and where it binds substrates. It can be used to model the effects of sequence mutations and how they lead to the development of new or more refined functions. The results will be an accumulation of hypotheses about which residues are important features of the molecule and can be used as a starting point for experimental verification. The set of sequences from the Kingdom Viridiplantae to be analyzed will be joined with similar data collected from heat shock proteins of animals and fungi. The larger amount of collective information will be used to gain further insights into the differences in hsp 23 across a wider range of organisms. The information collected will lead to a better understanding of heat shock proteins and the role they play in the cell function.
References Bailey Timothy L., Elkan. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology. (1994): 28-36. AAAI Press, Menlo Park, California. Cassari, G., Sander, C. and Valencia, A. A method to predict functional residues in proteins. Structural Biology. 2 (1995): 171-178. Felsenstein J. PHYLIP: Phylogeny Inference Package. Department of Genome Sciences, University of Washington. 2004. http://evolution.genetics.washington.edu/phylip/doc/main.html Gong, Mamta, et al. Drosophila ventral furrow morphogenesis: a proteomic analysis. Development. 131 (2004): 643-656. Notredame, C., Higgins, D., Heringa, J. T-Coffee: A novel method for multiple sequence alignments. J. Mol. Bio. 302 (2000): 205-217.