A Theoretical Inference of Protein Schemes from Amino Acid Sequences

Similar documents
Proteins: Characteristics and Properties of Amino Acids

The Select Command and Boolean Operators

Part 4 The Select Command and Boolean Operators

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Range of Certified Values in Reference Materials. Range of Expanded Uncertainties as Disseminated. NMI Service

INTRODUCTION. Amino acids occurring in nature have the general structure shown below:

How did they form? Exploring Meteorite Mysteries

Protein Secondary Structure Prediction

Discussion Section (Day, Time):

Amino Acids and Peptides

Enzyme Catalysis & Biotechnology

1. Wings 5.. Jumping legs 2. 6 Legs 6. Crushing mouthparts 3. Segmented Body 7. Legs 4. Double set of wings 8. Curly antennae

Hypergraphs, Metabolic Networks, Bioreaction Systems. G. Bastin

Lecture 15: Realities of Genome Assembly Protein Sequencing

ORGANIC - BROWN 8E CH AMINO ACIDS AND PROTEINS.

EXAM 1 Fall 2009 BCHS3304, SECTION # 21734, GENERAL BIOCHEMISTRY I Dr. Glen B Legge

Chemistry Chapter 22

All Proteins Have a Basic Molecular Formula

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Patrick: An Introduction to Medicinal Chemistry 5e Chapter 03

1. Amino Acids and Peptides Structures and Properties

Evidence from Evolution Activity 75 Points. Fossils Use your textbook and the diagrams on the next page to answer the following questions.

file:///biology Exploring Life/BiologyExploringLife04/

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

Generation Date: 12/07/2015 Generated By: Tristan Wiley Title: Bio I Winter Packet

Properties of amino acids in proteins

Proteome Informatics. Brian C. Searle Creative Commons Attribution

Periodic Table. 8/3/2006 MEDC 501 Fall

Practice Midterm Exam 200 points total 75 minutes Multiple Choice (3 pts each 30 pts total) Mark your answers in the space to the left:

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

Translation. A ribosome, mrna, and trna.

Chemical Properties of Amino Acids

CHAPTER 29 HW: AMINO ACIDS + PROTEINS

Protein Struktur (optional, flexible)

Proteomics. November 13, 2007

ANSWERS TO CASE STUDIES Chapter 2: Drug Design and Relationship of Functional Groups to Pharmacologic Activity

Protein Identification Using Tandem Mass Spectrometry. Nathan Edwards Informatics Research Applied Biosystems

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Protein Structure Bioinformatics Introduction

NH 2. Biochemistry I, Fall Term Sept 9, Lecture 5: Amino Acids & Peptides Assigned reading in Campbell: Chapter

12/6/12. Dr. Sanjeeva Srivastava IIT Bombay. Primary Structure. Secondary Structure. Tertiary Structure. Quaternary Structure.

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Towards Understanding the Origin of Genetic Languages

GENETIC CODE AS A HARMONIC SYSTEM: TWO SUPPLEMENTS. Miloje M. Rakočević

Student Handout 2. Human Sepiapterin Reductase mrna Gene Map A 3DMD BioInformatics Activity. Genome Sequencing. Sepiapterin Reductase

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Discussion Section (Day, Time):

The Calculation of Physical Properties of Amino Acids Using Molecular Modeling Techniques (II)

GENERAL BIOLOGY LABORATORY EXERCISE Amino Acid Sequence Analysis of Cytochrome C in Bacteria and Eukarya Using Bioinformatics

Studies Leading to the Development of a Highly Selective. Colorimetric and Fluorescent Chemosensor for Lysine

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

In eukaryotes the most important regulatory genes contain homeobox sequences and are called homeotic genes.

Structures in equilibrium at point A: Structures in equilibrium at point B: (ii) Structure at the isoelectric point:

Investigating Evolutionary Relationships between Species through the Light of Graph Theory based on the Multiplet Structure of the Genetic Code

A rapid and highly selective colorimetric method for direct detection of tryptophan in proteins via DMSO acceleration

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor

Protein Structure Marianne Øksnes Dalheim, PhD candidate Biopolymers, TBT4135, Autumn 2013

Collision Cross Section: Ideal elastic hard sphere collision:

Supporting information. Contents

4) Chapter 1 includes heredity (i.e. DNA and genes) as well as evolution. Discuss the connection between heredity and evolution?

CHEMISTRY ATAR COURSE DATA BOOKLET

Dental Biochemistry EXAM I

Homework 9: Protein Folding & Simulated Annealing : Programming for Scientists Due: Thursday, April 14, 2016 at 11:59 PM

Discussion Section (Day, Time): TF:

Separation of Large and Small Peptides by Supercritical Fluid Chromatography and Detection by Mass Spectrometry

Scoring Matrices. Shifra Ben-Dor Irit Orr

1014NSC Fundamentals of Biochemistry Semester Summary

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Systematic approaches to study cancer cell metabolism

Research Article Novel Numerical Characterization of Protein Sequences Based on Individual Amino Acid and Its Application

BCH 4053 Exam I Review Spring 2017

UNIT TWELVE. a, I _,o "' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I "-oh

Principles of Biochemistry

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Module No. 31: Peptide Synthesis: Definition, Methodology & applications

BENG 183 Trey Ideker. Protein Sequencing

Introduction to graph theory and molecular networks

National Nutrient Database for Standard Reference Release 28 slightly revised May, 2016

Properties of Amino Acids

Discussion Section (Day, Time):

From Amino Acids to Proteins - in 4 Easy Steps

STRUCTURAL BIOINFORMATICS I. Fall 2015

4. The Michaelis-Menten combined rate constant Km, is defined for the following kinetic mechanism as k 1 k 2 E + S ES E + P k -1

Dental Biochemistry Exam The total number of unique tripeptides that can be produced using all of the common 20 amino acids is

Geometric interpretation of signals: background

Biophysical Society On-line Textbook

Quantifying sequence similarity

Using an Artificial Regulatory Network to Investigate Neural Computation

(Bio)chemical Proteomics. Alex Kentsis October, 2013

Analysis of Relevant Physicochemical Properties in Obligate and Non-obligate Protein-protein Interactions

Transcription:

A Theoretical Inference of Protein Schemes from Amino Acid Sequences Angel Villahoz-Baleta angel_villahozbaleta@student.uml.edu ABSTRACT Proteins are based on tri-dimensional dispositions generated from amino acid sequences. The disposition of a new protein must be totally or almost stable so that it can have a biological meaning. Proteins can be viewed as making-life blocks to build an alive being. Pursuing such a molecular stability is a problem very famous to be computed and it has to be still totally solved by the artificial intelligence (AI) for the biology community. The theoretical inference proposed in this paper is trying to find potential stable protein schemes using two algorithmic tools coming from the AI: the uninformed and informed searches. Author Keywords artificial intelligence, amino acid, hydrophobic, hydrophilic, informed search, molecular stability, neutral, protein, protein structure, uninformed search, scheme, sequence, water affinity. INTRODUCTION The chemical challenge of the molecular stability in proteins is frequently covered in scientific publications. A great part of publications shows the importance of the AI as a valuable help in their research efforts. The theoretical inference in this paper also uses AI but gets AI and chemical properties to need each other for another alternative research. There are several chemical properties to be considered influential for the molecular stability of proteins, but the most important one would be the water affinity since any known biological process is developed in an aqueous environment. About 60% of our adult human body is composed by water and our biological processes occur there. Each amino acid comes from a set of 22 different amino acids. They are classified into three classes: hydrophobic, hydrophilic, and neutral depending on their water affinities. The water-affinity subdivision of the 22 amino acids, as well as their short abbreviations is showed in the Table 1. amino acids have an aversion towards water molecules so these hydrophobic amino acids Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. tend to be inside of the protein. amino acids are attracted into water molecules so these hydrophilic amino acids tend to be in the outer layer of the protein. Neutral amino acids do not have any adverse or favorable reaction to water molecules so these neutral amino acids can be put at any place of the protein. Amino Acid Name (Short Abbreviation) Alanine (A) Arginine (R) Asparagine (N) Aspartic Acid (D) Cysteine (C) Glutamic Acid (E) Glutamine (Q) Glycine (G) Histidine (H) Hydroxyproline (O) Isoleucine (I) Leucine (L) Lysine (K) Methionine (M) Phenylalanine (F) Proline (P) Pyroglutamatic (U) Serine (S) Threonine (T) Tryptophan (W) Tyrosine (Y) Valine (V) Water Affinity Neutral Neutral Table 1. The water-affinity subdivision of the 22 amino acids.

A 1D sequence can get out of its one-dimensionality by folding into itself. Such a folding mechanism generates 2D schemes and 3D structures. 2D schemes are the intermediary step between 1D sequences and 3D structures. These 3D structures are the most accurate representation of the proteins but they are also probably one of the mathematical objects most expensive to be computed. The 2D schemes are less accurate but they demand less computational efforts. So there is a compromise between accurateness and computational power. Yet, this compromise can be improved with AI algorithms: uninformed search and informed search. Uninformed search takes advantage of the brute force offered by the current software forming each possible 2D scheme. Informed search refines the results of the uninformed search using the chemical property of the water affinity. Finally, a successful 2D scheme can be the basis on which the algorithms to get 3D structures can be started there with a major probability of success and a less costly effort. PROJECT DESCRIPTION A queue contains the amino acid sequence and it is preferable to put the first amino acid in the head of the queue, not the tail. Each amino acid coming from the queue is classified according to its water affinity. The first two amino acids have always a unique 2D disposition as its only possible way to be connected with each other. This initial disposition is the start state. Let s i be a state defined as a succession of planar coordinates, each one being the position of an amino acid. Then the start state can be written as s 0 = {(0, 0), (1, 0)} (see Figure 1). Notice that the first amino acid is hydrophilic due to its white color and the second amino acid, hydrophobic, due to its black color as the example for an arbitrary amino acid sequence given. A priori, the minimal number of amino acids is two to begin with, but the threshold of the number of amino acids to have a minimal biological is about fifteen amino acids as viruses, the simplest living beings, for example. After the two first amino acids being connected in an imaginary plane mimicking an aqueous environment, the next amino acid coming from the queue is put on each one of the 3 possible sides of the last amino acid following an uninformed search strategy (USS). The three new states are written as s 1 = {(0, 0), (1, 0), (1, 1)}, s 2 = {(0, 0), (1, 0), (2, 0)}, and s 3 = {(0, 0), (1, 0), (1, -1)} (see Figure 2). The USS always tries to discover any possible disposition with these 3 sides with the exception of the last side which the last amino acid maintains a connection with the next-to-last amino acid. Yet, one or more sides are usually not available for a next amino acid at a moment of such a search due to being already occupied by other previous amino acids during the early development of the USS. So each next amino acid in its turn would have 1 to 3 free sides to try. The branching factor of the USS is always between 1 and 3. The size of the search space based on the USS would equal or less than the number of 3 (m - 2) planar dispositions where m is the number of amino acids. For example, the number s 1 ) s 2 ) s 3 ) (1, 1) (2, 0) (1, -1) of planar dispositions for a sequence of 5 amino acids is 3 (5-2) = 3 3 = 27 states. Unfortunately, as proteins very studied by biologists have about 150 amino acids, the size of the uninformed search space can be so big too easily. So there is another alternative search strategy to be considered: the informed search strategy (ISS). The main difference between both search strategies, the USS and the ISS, is about the ISS using a chemical property, the water affinity, as an information rule to refine the search strategy with fewer sides to consider. amino acids are the target of such an information rule. Now the information rule of the water affinity for any amino acid is defined by the following points: Figure 1. The start state. amino acids are white circles, hydrophobic amino acids black circles, and neutral amino acids gray circles. Figure 2. The three possible states, s 1. s 2, and s 3, based on the USS with m = 3 after the start state. If a hydrophilic or neutral amino acid comes from the queue, the ISS will follow the rules of the USS regarding to the sides. If a hydrophobic amino acid comes from the queue, the side(s) to be put will be the nearest one(s) to the last hydrophobic amino acid put in the 2D scheme.

The new states generated by the ISS after the start state are written as s 1 = {(0, 0), (1, 0), (1, 1)} and s 2 = {(0, 0), (1, 0), (1, -1)} (see Figure 3). Notice that a potential state {(0, 0), (1, 0), (2, 0)} is dismissed due to the distance between its two hydrophobic amino acids being greater than the ones in s 1 and s 2. s 1 ) (1, 1) data for both USS and ISS. But the project can work with two or more data sets in a batch mode. Biologists use a standard format known as the FASTA format to store and interchange data sets. There are two FASTA formats, the first one about nucleic acids and the other one about amino acids which we use for this project. There is an example of the FASTA format in Figure 4. The FASTA format commands any written or electronic media to have its contents readable as plain text. The first line has to have the information about the protein as its name, its NCBI identifier, etc. The amino acid sequence is broken into short lines of 70 alphanumeric characters and then these lines are put in this file. Each line can be seen as a subsequence from the sequence only for the purpose of a better human reading. Only any data set following the FASTA format is accepted for the project. s 2 ) (1, -1) >gi 5524211 gb AAD44166.1 cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY Figure 4. An example of the FASTA format. Evidently, if there are more hydrophobic amino acids than hydrophilic or neutral amino acids in a sequence given as an example, then the ISS becomes more powerful than the USS. According to the Table 1, there is a ratio of 9:12 between hydrophobic and hydrophilic amino acids making the ISS working with much fewer sides than the USS. The size of the search space based on the USS would not be greater than the number of 3 m + n planar dispositions where m is the number of non-hydrophobic amino acids existent in the subsequence starting from the third amino acid. The second term, n, is a quasi-linear function of the total sum of the planar disposition(s) generated by each hydrophobic amino acid with the application of the minimum Euclidean distance. The example in the Figure 3 would have n = 1 + 1 = 2 states. Each one of the states generated from the USS or the ISS is qualified as stable or unstable. The main condition of the molecular stability as the goal state for any amino acid sequence is resumed in the following rule: Figure 3. The two possible states, s 1 and s 2, based on the ISS after the start state. Each hydrophobic amino acid must be isolated from the water molecules, that is, each one of its sides must be occupied by a hydrophilic or neutral amino acid but never any water molecule. An amino acid sequence in the queue is abstracted as the data set of a string of upper letters. Each upper letter is the short abbreviation of an amino acid (see Table 1). There can be amino acids as many as possible to be put in the data set if the computational resources can afford to bear such a number of amino acids. But it is very rare that a giant protein would have more than 400 amino acids. Yet, only the data set of an amino acid sequence is accepted as input ANALYSIS OF RESULTS The results were produced from the project working with several electronic files containing biological information organized in accordance to the FASTA format. The execution of the project pauses only when a stable 2D scheme is detected and visually showed. There was always a pause when each stable 2D scheme was generated and detected during the tests (see Figure 5 as an example). The remaining 2D schemes were unstable and the project dismissed them continuing its execution. All the executions of the project were visual and each expansion node of both USS and ISS was visually checked at each interval of one second. Eventually a one-dimensional pattern was discovered. If the 1D sequence was too fragmented, that is, alternative subsequences with a same water affinity were too short no stable 2D scheme was generated. Figure 5. A stable 2D scheme detected as the goal state.

If the longest subsequence of hydrophilic and neutral amino acids was 4 times (as the same number of the sides) longer than the longest one of hydrophobic amino acids, then the probability about reaching a goal state was very high. Besides the successful order of water affinity of these subsequences was to get first the hydrophobic subsequence and then the other subsequence. The metrics of both USS and ISS give the cost of one for the process of making a node. The cost of all the previous nodes made is transported into the next process of making the new node at each recursive call during the development of both USS and ISS. After making several tests of metrics with the same several electronic files as before, the final results of numbers of nodes at the final states detected as goal states showed that the performance difference between both USS and ISS was not very marked on the short term as it was expected but became more and more marked on the long term with longer amino acid sequences. DISCUSSION The inference proposed here would be improved if more chemical properties would be joined to the water affinity in the algorithmic motor of the ISS. Then the branching factor would be one and, rarely times, two thanks to the refinement coming from the new chemical properties. A few properties would be studied and chosen in accordance to their chemical influence in the molecular stability for the next version of the ISS. Some candidate chemical properties would be the ph factor or the polarity or the aromaticity. Another interesting point of discussion would be about an amino acid sequence being processed as a queue or stack. The queue of the data input here is processed beginning by its head. But there is another onedimensional alternative in the 1D sequence to begin by its tail so the amino acid sequence would be stored as a stack, too. The inclusion of new chemical properties and the treatment of the amino acid sequence as a bidirectional data structure would allow a hypothetical new version of the ISS to restrict the explosive growth in the number of states generated from the ISS working with 3D structures instead of 2D schemes. So the start state would be s 0 = {(0, 0, 0)}. The first algorithmic step of the ISS would generate 6 states: s 1 = {(0, 0, 0), (1, 0, 0)}, s 2 = {(0, 0, 0), (0, 1, 0)}, s 3 = {(0, 0, 0), (0, 0, 1)}, s 4 = {(0, 0, 0), (-1, 0, 0)}, s 5 = {(0, 0, 0), (0, -1, 0)}, and s 6 = {(0, 0, 0), (0, 0, -1)}. Then the next states would be processed according to the formula of the minimum Euclidean distance: d = ((x m x n ) 2 + (y m y n ) 2 + (z m z n ) 2 ) 1/2 where (x m, y m, z m ) and (x n, y n, z n ) are the positions of the new and last hydrophobic amino acids. It is know that Python, the programming language used by the project, has a well-defined and standard GUI, Tkinter, for 2D schemes but, unfortunately, there is no standard GUI for 3D structures. Perhaps Pymol would be integrated as a potential 3D GUI in the next version of the ISS for 3D structures. CONCLUSION The theoretical inference proposed here demonstrates that it is not necessary to generate all the possible final states, given their computational expenses. Instead it is possible to arrive only at the subset of the final states with the highest probability of being goal states with a minor computational cost. The use of AI as an algorithmic tool in informed searches together with chemical properties used as refinement factors opens new (and less computationally costly) research ways for biologists to discover new proteins unknown in Nature and beneficial for Mankind. ACKNOWLEDGMENT The work described in this paper was conducted as a part of a Fall 2012 Artificial Intelligence course, taught in the Computer Science department of the University of Massachusetts Lowell by Prof. Fred Martin. REFERENCES 1.Bui, T.N. and Sundarraj, G. An Efficient Genetic Algorithm for Predicting Protein Tertiary Structures in the 2D HP Model. in GECCO '05: Proceedings of the 2005 conference on Genetic and evolutionary computation, (Washington, DC, 2005), ACM Press (2005), 385-392. 2.FASTA Format. http://www.ncbi.nlm.nih.gov/blast/blastcgihelp.shtml 3.Hart, W.E. and Newman A. Protein Structure Prediction with Lattice Models. in Aluru, S. ed. Handbook of Computational Molecular Biology, Chapman & Hall CRC Computer and Information Science Series, 2006, 30-1-30-24. 4.ModelViewController. http://wiki.wxpython.org/modelviewcontroller/. 5.Mount, D.W. Bioinformatics Sequence and Genome Analysis. 2 nd ed. Cold Spring Harbor Laboratory Press, 2004. 6.Newman, A. A New Algorithm for Protein Folding in the HP Model. in SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, (San Francisco, CA, 2002), SIAM Press (2002), 876-884. 7.Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach. 3 rd ed. Prentice Hall, 2010. 8.Pevsner, J. Bioinformatics and Functional Genomics. 2 nd ed. Wiley-Blackwell, 2009. 9.Proteinogenic amino acid. http://en.wikipedia.org/wiki/proteinogenic_amino_acid. 10.Python Programming Language. http://www.python.org/. 11.Python Programming/Object-oriented programming. http://en.wikibooks.org/wiki/python_programming/ Object-oriented_programming. 12.Tkinter.

http://wiki.python.org/moin/tkinter/