A Theoretical Inference of Protein Schemes from Amino Acid Sequences

A Theoretical Inference of Protein Schemes from Amino Acid Sequences Angel Villahoz-Baleta angel_villahozbaleta@student.uml.edu ABSTRACT Proteins are based on tri-dimensional dispositions generated from amino acid sequences. The disposition of a new protein must be totally or almost stable so that it can have a biological meaning. Proteins can be viewed as making-life blocks to build an alive being. Pursuing such a molecular stability is a problem very famous to be computed and it has to be still totally solved by the artificial intelligence (AI) for the biology community. The theoretical inference proposed in this paper is trying to find potential stable protein schemes using two algorithmic tools coming from the AI: the uninformed and informed searches. Author Keywords artificial intelligence, amino acid, hydrophobic, hydrophilic, informed search, molecular stability, neutral, protein, protein structure, uninformed search, scheme, sequence, water affinity. INTRODUCTION The chemical challenge of the molecular stability in proteins is frequently covered in scientific publications. A great part of publications shows the importance of the AI as a valuable help in their research efforts. The theoretical inference in this paper also uses AI but gets AI and chemical properties to need each other for another alternative research. There are several chemical properties to be considered influential for the molecular stability of proteins, but the most important one would be the water affinity since any known biological process is developed in an aqueous environment. About 60% of our adult human body is composed by water and our biological processes occur there. Each amino acid comes from a set of 22 different amino acids. They are classified into three classes: hydrophobic, hydrophilic, and neutral depending on their water affinities. The water-affinity subdivision of the 22 amino acids, as well as their short abbreviations is showed in the Table 1. amino acids have an aversion towards water molecules so these hydrophobic amino acids Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. tend to be inside of the protein. amino acids are attracted into water molecules so these hydrophilic amino acids tend to be in the outer layer of the protein. Neutral amino acids do not have any adverse or favorable reaction to water molecules so these neutral amino acids can be put at any place of the protein. Amino Acid Name (Short Abbreviation) Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C) Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Hydroxyproline (O) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Pyroglutamatic (U) Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V) Water Affinity Neutral Neutral Table 1. The water-affinity subdivision of the 22 amino acids.

A 1D sequence can get out of its one-dimensionality by folding into itself. Such a folding mechanism generates 2D schemes and 3D structures. 2D schemes are the intermediary step between 1D sequences and 3D structures. These 3D structures are the most accurate representation of the proteins but they are also probably one of the mathematical objects most expensive to be computed. The 2D schemes are less accurate but they demand less computational efforts. So there is a compromise between accurateness and computational power. Yet, this compromise can be improved with AI algorithms: uninformed search and informed search. Uninformed search takes advantage of the brute force offered by the current software forming each possible 2D scheme. Informed search refines the results of the uninformed search using the chemical property of the water affinity. Finally, a successful 2D scheme can be the basis on which the algorithms to get 3D structures can be started there with a major probability of success and a less costly effort. PROJECT DESCRIPTION A queue contains the amino acid sequence and it is preferable to put the first amino acid in the head of the queue, not the tail. Each amino acid coming from the queue is classified according to its water affinity. The first two amino acids have always a unique 2D disposition as its only possible way to be connected with each other. This initial disposition is the start state. Let s i be a state defined as a succession of planar coordinates, each one being the position of an amino acid. Then the start state can be written as s 0 = {(0, 0), (1, 0)} (see Figure 1). Notice that the first amino acid is hydrophilic due to its white color and the second amino acid, hydrophobic, due to its black color as the example for an arbitrary amino acid sequence given. A priori, the minimal number of amino acids is two to begin with, but the threshold of the number of amino acids to have a minimal biological is about fifteen amino acids as viruses, the simplest living beings, for example. After the two first amino acids being connected in an imaginary plane mimicking an aqueous environment, the next amino acid coming from the queue is put on each one of the 3 possible sides of the last amino acid following an uninformed search strategy (USS). The three new states are written as s 1 = {(0, 0), (1, 0), (1, 1)}, s 2 = {(0, 0), (1, 0), (2, 0)}, and s 3 = {(0, 0), (1, 0), (1, -1)} (see Figure 2). The USS always tries to discover any possible disposition with these 3 sides with the exception of the last side which the last amino acid maintains a connection with the next-to-last amino acid. Yet, one or more sides are usually not available for a next amino acid at a moment of such a search due to being already occupied by other previous amino acids during the early development of the USS. So each next amino acid in its turn would have 1 to 3 free sides to try. The branching factor of the USS is always between 1 and 3. The size of the search space based on the USS would equal or less than the number of 3 (m - 2) planar dispositions where m is the number of amino acids. For example, the number s 1 ) s 2 ) s 3 ) (1, 1) (2, 0) (1, -1) of planar dispositions for a sequence of 5 amino acids is 3 (5-2) = 3 3 = 27 states. Unfortunately, as proteins very studied by biologists have about 150 amino acids, the size of the uninformed search space can be so big too easily. So there is another alternative search strategy to be considered: the informed search strategy (ISS). The main difference between both search strategies, the USS and the ISS, is about the ISS using a chemical property, the water affinity, as an information rule to refine the search strategy with fewer sides to consider. amino acids are the target of such an information rule. Now the information rule of the water affinity for any amino acid is defined by the following points: Figure 1. The start state. amino acids are white circles, hydrophobic amino acids black circles, and neutral amino acids gray circles. Figure 2. The three possible states, s 1. s 2, and s 3, based on the USS with m = 3 after the start state. If a hydrophilic or neutral amino acid comes from the queue, the ISS will follow the rules of the USS regarding to the sides. If a hydrophobic amino acid comes from the queue, the side(s) to be put will be the nearest one(s) to the last hydrophobic amino acid put in the 2D scheme.

The new states generated by the ISS after the start state are written as s 1 = {(0, 0), (1, 0), (1, 1)} and s 2 = {(0, 0), (1, 0), (1, -1)} (see Figure 3). Notice that a potential state {(0, 0), (1, 0), (2, 0)} is dismissed due to the distance between its two hydrophobic amino acids being greater than the ones in s 1 and s 2. s 1 ) (1, 1) data for both USS and ISS. But the project can work with two or more data sets in a batch mode. Biologists use a standard format known as the FASTA format to store and interchange data sets. There are two FASTA formats, the first one about nucleic acids and the other one about amino acids which we use for this project. There is an example of the FASTA format in Figure 4. The FASTA format commands any written or electronic media to have its contents readable as plain text. The first line has to have the information about the protein as its name, its NCBI identifier, etc. The amino acid sequence is broken into short lines of 70 alphanumeric characters and then these lines are put in this file. Each line can be seen as a subsequence from the sequence only for the purpose of a better human reading. Only any data set following the FASTA format is accepted for the project. s 2 ) (1, -1) >gi 5524211 gb AAD44166.1 cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY Figure 4. An example of the FASTA format. Evidently, if there are more hydrophobic amino acids than hydrophilic or neutral amino acids in a sequence given as an example, then the ISS becomes more powerful than the USS. According to the Table 1, there is a ratio of 9:12 between hydrophobic and hydrophilic amino acids making the ISS working with much fewer sides than the USS. The size of the search space based on the USS would not be greater than the number of 3 m + n planar dispositions where m is the number of non-hydrophobic amino acids existent in the subsequence starting from the third amino acid. The second term, n, is a quasi-linear function of the total sum of the planar disposition(s) generated by each hydrophobic amino acid with the application of the minimum Euclidean distance. The example in the Figure 3 would have n = 1 + 1 = 2 states. Each one of the states generated from the USS or the ISS is qualified as stable or unstable. The main condition of the molecular stability as the goal state for any amino acid sequence is resumed in the following rule: Figure 3. The two possible states, s 1 and s 2, based on the ISS after the start state. Each hydrophobic amino acid must be isolated from the water molecules, that is, each one of its sides must be occupied by a hydrophilic or neutral amino acid but never any water molecule. An amino acid sequence in the queue is abstracted as the data set of a string of upper letters. Each upper letter is the short abbreviation of an amino acid (see Table 1). There can be amino acids as many as possible to be put in the data set if the computational resources can afford to bear such a number of amino acids. But it is very rare that a giant protein would have more than 400 amino acids. Yet, only the data set of an amino acid sequence is accepted as input ANALYSIS OF RESULTS The results were produced from the project working with several electronic files containing biological information organized in accordance to the FASTA format. The execution of the project pauses only when a stable 2D scheme is detected and visually showed. There was always a pause when each stable 2D scheme was generated and detected during the tests (see Figure 5 as an example). The remaining 2D schemes were unstable and the project dismissed them continuing its execution. All the executions of the project were visual and each expansion node of both USS and ISS was visually checked at each interval of one second. Eventually a one-dimensional pattern was discovered. If the 1D sequence was too fragmented, that is, alternative subsequences with a same water affinity were too short no stable 2D scheme was generated. Figure 5. A stable 2D scheme detected as the goal state.

If the longest subsequence of hydrophilic and neutral amino acids was 4 times (as the same number of the sides) longer than the longest one of hydrophobic amino acids, then the probability about reaching a goal state was very high. Besides the successful order of water affinity of these subsequences was to get first the hydrophobic subsequence and then the other subsequence. The metrics of both USS and ISS give the cost of one for the process of making a node. The cost of all the previous nodes made is transported into the next process of making the new node at each recursive call during the development of both USS and ISS. After making several tests of metrics with the same several electronic files as before, the final results of numbers of nodes at the final states detected as goal states showed that the performance difference between both USS and ISS was not very marked on the short term as it was expected but became more and more marked on the long term with longer amino acid sequences. DISCUSSION The inference proposed here would be improved if more chemical properties would be joined to the water affinity in the algorithmic motor of the ISS. Then the branching factor would be one and, rarely times, two thanks to the refinement coming from the new chemical properties. A few properties would be studied and chosen in accordance to their chemical influence in the molecular stability for the next version of the ISS. Some candidate chemical properties would be the ph factor or the polarity or the aromaticity. Another interesting point of discussion would be about an amino acid sequence being processed as a queue or stack. The queue of the data input here is processed beginning by its head. But there is another onedimensional alternative in the 1D sequence to begin by its tail so the amino acid sequence would be stored as a stack, too. The inclusion of new chemical properties and the treatment of the amino acid sequence as a bidirectional data structure would allow a hypothetical new version of the ISS to restrict the explosive growth in the number of states generated from the ISS working with 3D structures instead of 2D schemes. So the start state would be s 0 = {(0, 0, 0)}. The first algorithmic step of the ISS would generate 6 states: s 1 = {(0, 0, 0), (1, 0, 0)}, s 2 = {(0, 0, 0), (0, 1, 0)}, s 3 = {(0, 0, 0), (0, 0, 1)}, s 4 = {(0, 0, 0), (-1, 0, 0)}, s 5 = {(0, 0, 0), (0, -1, 0)}, and s 6 = {(0, 0, 0), (0, 0, -1)}. Then the next states would be processed according to the formula of the minimum Euclidean distance: d = ((x m x n ) 2 + (y m y n ) 2 + (z m z n ) 2 ) 1/2 where (x m, y m, z m ) and (x n, y n, z n ) are the positions of the new and last hydrophobic amino acids. It is know that Python, the programming language used by the project, has a well-defined and standard GUI, Tkinter, for 2D schemes but, unfortunately, there is no standard GUI for 3D structures. Perhaps Pymol would be integrated as a potential 3D GUI in the next version of the ISS for 3D structures. CONCLUSION The theoretical inference proposed here demonstrates that it is not necessary to generate all the possible final states, given their computational expenses. Instead it is possible to arrive only at the subset of the final states with the highest probability of being goal states with a minor computational cost. The use of AI as an algorithmic tool in informed searches together with chemical properties used as refinement factors opens new (and less computationally costly) research ways for biologists to discover new proteins unknown in Nature and beneficial for Mankind. ACKNOWLEDGMENT The work described in this paper was conducted as a part of a Fall 2012 Artificial Intelligence course, taught in the Computer Science department of the University of Massachusetts Lowell by Prof. Fred Martin. REFERENCES 1.Bui, T.N. and Sundarraj, G. An Efficient Genetic Algorithm for Predicting Protein Tertiary Structures in the 2D HP Model. in GECCO '05: Proceedings of the 2005 conference on Genetic and evolutionary computation, (Washington, DC, 2005), ACM Press (2005), 385-392. 2.FASTA Format. http://www.ncbi.nlm.nih.gov/blast/blastcgihelp.shtml 3.Hart, W.E. and Newman A. Protein Structure Prediction with Lattice Models. in Aluru, S. ed. Handbook of Computational Molecular Biology, Chapman & Hall CRC Computer and Information Science Series, 2006, 30-1-30-24. 4.ModelViewController. http://wiki.wxpython.org/modelviewcontroller/. 5.Mount, D.W. Bioinformatics Sequence and Genome Analysis. 2 nd ed. Cold Spring Harbor Laboratory Press, 2004. 6.Newman, A. A New Algorithm for Protein Folding in the HP Model. in SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, (San Francisco, CA, 2002), SIAM Press (2002), 876-884. 7.Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach. 3 rd ed. Prentice Hall, 2010. 8.Pevsner, J. Bioinformatics and Functional Genomics. 2 nd ed. Wiley-Blackwell, 2009. 9.Proteinogenic amino acid. http://en.wikipedia.org/wiki/proteinogenic_amino_acid. 10.Python Programming Language. http://www.python.org/. 11.Python Programming/Object-oriented programming. http://en.wikibooks.org/wiki/python_programming/ Object-oriented_programming. 12.Tkinter.

http://wiki.python.org/moin/tkinter/