FlexSADRA: Flexible Structural Alignment using a Dimensionality Reduction Approach

FlexSADRA: Flexible Structural Alignment using a Dimensionality Reduction Approach Shirley Hui and Forbes J. Burkowski University of Waterloo, 200 University Avenue W., Waterloo, Canada ABSTRACT A topic of research that is frequently studied in Structural Biology is the problem of determining the degree of similarity between two protein structures. The most common solution is to perform a three dimensional structural alignment of the two structures. Rigid structural alignment algorithms have been developed in the past to accomplish this but treat the protein molecules as immutable structures. Since protein structures can bend and flex, algorithms dealing with rigid structures do not yield accurate results. As an attempt to improve similarity studies, flexible structural alignment algorithms have been developed. The challenge for these algorithms is that the protein structures are represented using thousands of atomic coordinate variables. This results in a great computational burden due to the large number of degrees of freedom required to account for the flexibility. Past research in dimensionality reduction techniques has shown that a linear dimensionality reduction technique called Principal Component Analysis (PCA) is well suited for high dimensionality reduction. This paper introduces a new flexible structural alignment algorithm called FlexSADRA, which uses PCA to perform flexible structural alignments. Test results show that FlexSADRA determines better alignments than rigid structural alignment algorithms. Unlike existing rigid and flexible algorithms, FlexSADRA addresses the problem in a significantly lower dimensional problem space and assesses not only the structural fit but also the structural feasibility of the final alignment. Keywords: flexible protein structural alignment, dimensionality reduction, principal component analysis 1. INTRODUCTION Proteins are molecules made up of a string of amino acids folded into simple to complex three dimensional structures. The structure of a protein enables it to fulfill its biological function. This is evident in the case of the calcium binding molecules: Calmodulin and Calbindin. These molecules are dumbbell-shaped with the end lobes attached by a flexible linker region. 1 Although the amino acid sequence similarity between these proteins is quite low, they are both similar in structure. It is the shape and flexibility of these molecules that allow them to wrap around calcium ions to fulfill their biological functions. Therefore, determining the structural similarity between two proteins has become an important research area in Structural Biology. In this paper, if two proteins have amino acid sequences that are very different then the two structures are completely different protiens and we are not interested in attempting to measure their similarity. The more interesting situation is characterized by two amino acid sequences that share some sequence similarity perhaps due to an evolutionary descent from a common ancestor. A protein structure may be represented by a set atomic coordinates representing the protein conformation in three dimensional space. The coordinates are the x, y, z positions of each atom in the protein concatenated into a long n dimensional vector called a protein conformation vector. We call the set of all protein conformation vectors representing a sampling of the proteins flexibility motion a protein conformational vector set. If there are n vectors in the set, the flexibility motion may be represented by an n m dimensional matrix. A rigid structural alignment maybe determined between two proteins by aligning their conformation vector representations. This is usually accomplished by finding a rotation and translation of one structure to another in order to minimize the distance between the two. The root mean squared deviation (RMSD) is used to calculate this distance and measure the degree of similarity between the two structures. A small RMSD value means that the two proteins are structurally very similar. Rigid alignment algorithms do assume that proteins are static structures. It is often the case that a better structural alignment may be achieved if protein flexibility is Further author information: Send correspondence to Shirley Hui: E-mail: shirleyhui@alumni.uwaterloo.ca

considered. This is the case in situations where one protein is hinge bent with respect to the other. Existing flexible structural alignment algorithms model protein flexibility using the atomic coordinates of one of the protein structures and treat the other protein as a rigid structure. Finding an alignment that minimizes the RMSD involves searching all possible conformations in the protein conformational vector set. This involves searching over all the different positions for the proteins atomic coordinates. Even small proteins may consist of several thousand atoms. The computational burden required to model protein flexibility in structural alignment algorithms quickly becomes enormous. This paper introduces a novel algorithm called FlexSADRA (Flexible Structural Alignment using a Dimensionality Reduction Approach). FlexSADRA uses a dimensionality reduction approach to perform flexible structural alignments using significantly fewer degrees of freedom than existing alignment algorithms. This algorithm not only assesses the structural fit but also the structural feasibility of the final alignment. 2. BACKGROUND It is often the case that high dimensional data is represented using many more degrees of freedom than is actually necessary. The goal of dimensionality reduction techniques is to find a mapping for the data from its high dimensional space to a lower dimensional space with minimal information loss. A popular linear dimensionality reduction technique is Principal Component Analysis (PCA). 2, 3 This technique is commonly used since it is fast and straightforward to implement. In the past, it has been applied to many different areas of research including face recognition systems, and protein flexibility modeling. 1, 4 Dimensionality reduction using PCA is achieved by transforming the original variables describing the data, to a set of new variables called principal components. The first principal component contributes the most variation in the data, while successive components specify lesser amounts. Dimensionality reduction is achieved by using the d components that explain the largest amount of variance in the original data. In algorithms for face recognition applications, there is typically a training set of faces that is used to find a set of principal components that describe a lower dimensional feature space called a face space. 1, 5 Each principal component represents a characteristic feature from the faces in the training set. Therefore, a face can be composed of a combination of the principal component in the right proportions. If a new face enters the system, it is projected onto the face space to determine a lower dimensional representation of the new face using the calculated principal component. The new face can be determined to be similar to an existing face or not a face at all. Principal Component Analysis has also been applied to model protein flexibility motion. 4, 6 The principal components describe the proteins flexibility and are combined in different proportions to reconstruct the original conformations in varying degrees of accuracy. As a result, only the first few principal components that retain a certain amount of the original flexibility are used to model the flexibility. The result is an approximate description of the protein flexibility motion, using only a few degrees of freedom. Our dimensionality reduction approach to the flexible structural similarity problem is similar to the approach used in face recognition systems and protein flexibility modeling. Instead of faces, we use the protein conformations representing the proteins flexibility to determine a lower dimensional space. This space describes the original proteins flexibility using only a few principal components rather than thousands of atomic coordinates. A rigid protein structure is projected onto this lower dimensional space. The projection can be mapped back to high dimensional space to determine a novel flexed protein structure that is as close to the rigid protein structure as possible. 3. DATA The protein structures used in the FlexSADRA algorithm experiments are obtained from the Protein Data Bank (PDB). 7 Only the backbone atomic coordinates and no side chains are used. The protein conformational data have been generated through experimental methods such as X-Ray Crystallography or NMR Spectroscopy, however these methods are time-consuming, expensive and do not generate many structures. As a result, a molecular dynamics (MD) simulation software application called NAMD (Not Another Molecular Dynamics) 8, 9 is used in FlexSADRA to generate the conformational data sets for the flexible protein. MD simulations generate protein conformations based on first principle calculations and although they are less accurate, they are faster and more affordable than experimental methods.

4. METHOD Principal components can be computed in various ways, but the eigenvector decomposition method of Singular Value Decomposition (SVD) is commonly used. If the protein conformational vector set called S is an n m matrix, where n m, then X has a singular value decomposition as follows: X = UEV T (1) The matrix U is an n n column-orthogonal matrix, V is an m m column-orthogonal matrix, E is an n m matrix whose off-diagonal entries are all 0s. The diagonal entries of E are the singular values σ 1, σ 2,..., σ n of X, where σ 1 > σ 2 >... > σ n. Since there will be many small singular values, the original matrix may be approximated with good accuracy using only d columns of U and V and only d singular values as follows: X d = U d E d V d T (2) A lower dimensional representation of the high dimensional protein structure is determined by using the following projection operation: y = U d T X (3) The original data may be approximately reconstructed using the following operation: ˆx = (U d T ) + y (4) where (U T d ) + is known as the pseudoinverse of U T d. 3 The matrix U is a transformation matrix that maps points that are close to each other in high dimensions to points that are close to each other in lower dimensions. If U is determined according to the flexible protein structure data, it defines a proteins flexibility in a lower dimensional space. For the flexible protein structure similarity problem we wish to determine the degree to which two protein structures are similar. Instead of working with high dimensional atomic coordinates, a flexed protein structure that is very close to the rigid protein can be determined more simply by projecting the rigid protein into the lower dimensional space of the flexible protein. When this lower dimensional structure is transformed back to high dimensions, this yields a flexed protein conformation that is as close to the rigid protein as possible. 5. FLEXSADRA ALGORITHM The input for the algorithm are the rigid and flexible protein structures. The flexible protein is represented by a protein conformational vector set, while the rigid structure is represented by a protein conformation vector of atomic coordinates. Assessment of the flexible alignments is based on two criteria: how close the two structures are aligned and how feasible the flexed structure is. The closeness of fit is determined by calculating the RMSD value. Since the alignment relies on the flexible protein structure flexing in a specific way, the feasibility of fit is determined by calculating the conformational energy of the flexed structure. 5.1. Algorithm Given: x r : vector of n atomic coordinates representing the rigid protein structure X: n m protein conformational vector set consisting of the atomic coordinates describing the flexible protein structure d: target dimensionality The pseudoinverse of a matrix U is calculated by U + = (U T U) 1 U T

Step One: Reduce the dimensionality of X to determine a lower dimensional space represented by U d. Determine U d by applying PCA to X using SVD to obtain d principal components that are the columns of the matrix U d. Step Two: Find y, the representation of x r in the lower dimensional space according to equation 3. Step Three: Transform y back to high dimensional space according to equation 4 to obtain a novel flexed conformation x f that is as close as possible to x r. Step Four: Assess the quality of the alignment achieved by calculating the RMSD between x f and x r and calculating the conformational energy score for x f. 6. CONSIDERATIONS Since the focus of this algorithm is protein flexibility, any overall structural rotational or translational movements 4, 10 are removed from the data set following the procedure used in the past. The result is that the data will only depict the flexing degrees of freedom of the protein. Finding the mean structure in the data set and then rigidly aligning each structure to the mean structure accomplishes this. 11 An initial structural alignment must also be performed in order to determine equivalence between residues in the two structures to be aligned. There are numerous alignment algorithms that exist to align protein sequences. Obtaining an initial alignment is a common step as a starting point in many alignment algorithms. 12 The FlexSADRA algorithm only flexes the sections of the protein molecule corresponding to aligned residues between the flexible and rigid proteins. In some cases, the flexed structure is a projection that is far away from the other points in the lower dimensional flexibility space. These conformations usually have high energy values corresponding to very unlikely conformations. In situations like these, a new flexed conformation must be found with a better energy value. A minimization activity done by a molecular dynamics simulation may be performed in order to find a more energetically favourable conformation that is as close to the original structure as possible. Finally, a requirement of our algorithm is that the structures being flexed must be of the same length. This is because the projection step of the algorithm can only be performed on objects with the same number of degrees of freedom. Therefore, only the aligned parts of the structures as determined by the initial alignment can be flexed. If the structures have different lengths, the final flexed structure may have gap sections where the atomic coordinates are missing. This is not desirable since the location of atoms in the gap areas must be known in order for the potential energy to be calculated. In addition, a complete flexed structure is more practical for further analysis than fragments of aligned sections. This problem does not limit the approach to structures of 13, 14 the same length however, since the missing coordinates may be estimated using a variety of methods. 7. RESULTS The FlexSADRA algorithm was applied to five different pairs of proteins and fully documented in previous work. 14 However, due to page restrictions, only the results for two pairs of molecules are discussed in this paper. The molecules were aligned and compared to alignments obtained by a rigid structural alignment 11 and the results from another flexible structural alignment algorithm called FlexProt. 15 Homeodomain Protein / Paired Domain Protein (6PAX-1PDN) The 6PAX protein is a member of the homeodomain family of proteins. These proteins are transcription regulators that bind to specific DNA sequences of other genes to regulate their expression and to induce cell development and differentiation. The paired box 1PDN molecule is homologous to the 6PAX homeodomain proteins. Both proteins consist of two ends made up of alphahelices connected by a flexible linker region allowing it exhibit a hinge-like motion. A flexible 6PAX molecule was aligned with a rigid 1PDN molecule. Apo Calbindin / EF Hand Domain Protein (1CDN-1H8B) Calbindin is a calcium binding protein that is involved in the uptake, transportation and absorption of calcium in the body. This protein belongs to a super family known as the EF hand super family. Calbindin has two EF hand domains and a short linker region. Small shear movements in the helices and loops exhibit the flexibility of this molecule. A flexible Calbindin molecule was structurally aligned with a rigid molecule from the EF hand super family called alpha-actinin. The results of comparing FlexSADRA to a rigid alignment algorithm and the FlexProt algorithm are summarized in the table below:

Test DOF (% Retained) RMSD Energy Flex-Rigid FlexProt FlexSADRA Rigid FlexProt FlexSADRA Rigid FlexSADRA (Min) 6PAX-1PDN 399 232 (85) 9.70 1.21 1.66 43.24 41.61 1CDN-1H8B 225 71 (65) 4.11 2.96 1.78 42.26 - Table 1. Summary of results for FlexSADRA vs Rigid and FlexProt protein structural alignments. DOF (% Retained) number of degrees of freedom and the amount of original data retained by using the degrees of freedom. RMSD root mean squares deviation in Angstroms. Energy - the conformational energy for the rigid structure and the minimized FlexSADRA flexed structure. Figure 1. FlexSADRA flexible structural alignment of 1PDN and 6PAX. Left - Rigid Alignment. Middle - FlexSADRA Alignment. Right - Minimized Alignment. 1PDN (red) 6PAX (blue). Figure 2. FlexSADRA flexible structural alignment of 1CDN and 1H8B. Left - Rigid Alignment. Right - FlexSADRA Alignment. 1H8B (red) 1CDN (blue). 8. DISCUSSION The results of the tests indicate that the FlexSADRA algorithm is able to perform flexible structural alignment in a reduced dimensionality problem space. In general, the amount of reduction in the dimensionality of the data was 50 to 70%. In many cases, the number of dimensions decreased from a few thousand to only a few hundred dimensions. The flexible structural alignments results in a significant improvement in RMSD values. The improvement is more than 7 Å for 6PAX and 1PDN and more than 2 Å for 1CDN and 1H8B. Without the added flexibility, the structures would seem to be very dissimilar. However, after allowing flexibility the RMSD decreases to less than 2 Å in both cases indicating that the molecules are indeed structurally very similar. The energy values for the flexed 1PDN protein prior to energy minimization was high with respect to its original energy value. This would indicate that the flexed 1PDN structure does not represent a likely flexed conformation. In order to determine if this was the case, a short energy minimization simulation was run on the flexed structure. After minimization, the RMSD was compared with the pre-minimized flexed structure. The difference is about 0.7 Å. This indicates that the pre-minimized and minimized structures are almost identical. The offending areas may be a result of the fact that there is a certain amount of information lost when using dimensionality reduction techniques. Another factor could be that missing coordinate values in the alignment must be estimated and poorly estimated coordinate values will increase the energy value. It can also be case that the flexed structure has a low energy value as in the case of 1CDN and 1H8B. Since the structure already had an acceptable energy value, it was not minimized further. Overall, the RMSD values obtained using FlexSADRA and FlexProt are similar, however, the algorithms are very different. The biggest difference is that FlexProt performs the structural alignment using the original high dimensional atomic coordinates. The algorithm is not trivial and the alignment consists of a series of steps each involving a large number of computations. In addition, the output of FlexProt is a set of disconnected rigid fragments of the flexed protein. It is not clear if this structure corresponds to a structurally feasible conformation. An energy value analysis should be performed on the flexed structure, however it is not possible to do so with most energy calculators since they

require connected structures as input. Also, the atomic coordinates for the hinge areas are not provided. In many cases, hinge-like or connecting areas are flexible loops and are still important parts of the molecule. Another problem is that molecules do not move as disconnected rigid fragments. As a result, FlexProt only provides a disjoint partial view of the flexed protein. The FlexSADRA algorithm provides a complete picture of the flexed protein. Flexed areas are not modeled as isolated regions joining rigid fragments, but apply to the molecule as a whole. Moreover, FlexSADRA is a data driven algorithm and flexes proteins according to the given data. This is in contrast with the FlexProt algorthim, which does not use any experimental or simulated flexible data to determine its flexible alignments. 9. CONCLUSION While a few researchers have developed algorithms to address the flexible structural alignment problem, none have tried to address it using a dimensionality reduction approach. In this paper, an algorithm called FlexSADRA has been introduced for the flexible structural alignment of three dimensional protein structures. This algorithm uses a dimensionality reduction approach to model protein flexibility motion and to perform the structural alignment. It has been tested on protein molecules that have previously been structurally aligned or studied in the literature. The results of the tests show that flexible structural alignment can be performed in a reduced dimensionality problem space. The results are better than the results obtained from rigid structural alignments and comparable to the results from another flexible structural alignment algorithm called FlexProt. However in contrast to these algorithms, the FlexSADRA algorithm is simpler, involves significantly fewer degrees of freedom, and only deals with feasible structures. REFERENCES 1. M. Turk and A. Pentland, Eigenfaces for recognition, J. Cog. Neu., vol. 3, pp. 71 86, 1991. 2. H. Hotelling, Analysis of a complex of statistical variables into principal components, J. Educ. Psych., vol. 24, pp. 417 441, 1933. 3. I. Jolliffe, Principal Component Analysis. New York: Springer, 2nd ed., 2002. 4. M. Teodoro, G. P. Jr., and L. Kavraki., A dimensionality reduction approach to modeling protein flexibility, In Proc. ACM Int. Conf. on Computational Biology (RECOMB), pp. 299 308, 2002. 5. A. Goldstein, L. Harmon, and A. Lesk, Identification of human faces, Proc. IEEE, vol. 9, pp. 748 760, 1971. 6. S. Hui and M. Shakeel, An investigative approach into dimensionality reduction techniques for protein flexibility modeling. ISMB 2004 Poster Presentation http://www.iscb.org/ismb2004/, 2004. 7. H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne, The protein data bank, Nucleic Acids Research, vol. 28, pp. 235 242, 2000. 8. L. Kal, R. Skeel, M. Bhandarkar, R. Brunner, A. Gursoy, J. P. N. Krawetz, A. Shinozaki, K. Varadarajan, and K. Schulten, Namd - not another molecular dynamics. http://www.ks.uiuc.edu/research/namd/, 1999. 9. J. M. Haile, Molecular Dynamics Simulation: Elementary Methods. John Wiley & Sons, Inc., 1992. 10. A. Amadei, A. B. M. Linssen, and H. J. C. Berendsen, Essential dynamics of proteins, Proteins: Structure, Function, and Genetics, vol. 17, p. 412425, 1993. 11. W. Kabsch, A solution for the best rotation to relate two sets of vectors, Acta Cryst., vol. A32, p. 922, 1976. 12. W. Wriggers and K. Schulten, Protein domain movements: Detection of rigid domains and visualization of effective rotations in comparisons of atomic coordinates, Proteins: Structure, Function, and Genetics, vol. 29, pp. 1 14, 1997. 13. R. Little and D. Rubin, Statistical Analysis with Missing Data. John Wiley & Sons, Inc., 1987. 14. S. Hui, Flexsadra: Flexible structural alignment using a dimensionality reduction approach, Master s thesis, University of Waterloo, School of Computer Science, Faculty of Mathematics, Sept 2005. 15. M. Shatsky, H. Wolfson, and R. Nussinov, Flexible protein alignment and hinge detection, Proteins: Structure, Function, and Genetics, vol. 48, pp. 242 256, 2002.