A modular Fibonacci sequence in proteins

Similar documents
Using an Artificial Regulatory Network to Investigate Neural Computation

Genetic Code, Attributive Mappings and Stochastic Matrices

Objective: You will be able to justify the claim that organisms share many conserved core processes and features.

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Genetic code on the dyadic plane

Reducing Redundancy of Codons through Total Graph

In previous lecture. Shannon s information measure x. Intuitive notion: H = number of required yes/no questions.

The Genetic Code Degeneracy and the Amino Acids Chemical Composition are Connected

Lecture IV A. Shannon s theory of noisy channels and molecular codes

A p-adic Model of DNA Sequence and Genetic Code 1

THE GENETIC CODE INVARIANCE: WHEN EULER AND FIBONACCI MEET

Edinburgh Research Explorer

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

UNIVERSITY OF CAMBRIDGE INTERNATIONAL EXAMINATIONS General Certifi cate of Education Advanced Subsidiary Level and Advanced Level

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky

CHEMISTRY 9701/42 Paper 4 Structured Questions May/June hours Candidates answer on the Question Paper. Additional Materials: Data Booklet

Biology 155 Practice FINAL EXAM

The degeneracy of the genetic code and Hadamard matrices. Sergey V. Petoukhov

A Minimum Principle in Codon-Anticodon Interaction

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

A Mathematical Model of the Genetic Code, the Origin of Protein Coding, and the Ribosome as a Dynamical Molecular Machine

PROTEIN SYNTHESIS INTRO

The genetic code, 8-dimensional hypercomplex numbers and dyadic shifts. Sergey V. Petoukhov

Mathematics of Bioinformatics ---Theory, Practice, and Applications (Part II)

Natural Selection. Nothing in Biology makes sense, except in the light of evolution. T. Dobzhansky

Analysis of Codon Usage Bias of Delta 6 Fatty Acid Elongase Gene in Pyramimonas cordata isolate CS-140

Foundations of biomaterials: Models of protein solvation

Lect. 19. Natural Selection I. 4 April 2017 EEB 2245, C. Simon

Lesson Overview. Ribosomes and Protein Synthesis 13.2

Ribosome kinetics and aa-trna competition determine rate and fidelity of peptide synthesis

Three-Dimensional Algebraic Models of the trna Code and 12 Graphs for Representing the Amino Acids

ATTRIBUTIVE CONCEPTION OF GENETIC CODE, ITS BI-PERIODIC TABLES AND PROBLEM OF UNIFICATION BASES OF BIOLOGICAL LANGUAGES *

2013 Japan Student Services Origanization

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

CODING A LIFE FULL OF ERRORS

Abstract Following Petoukhov and his collaborators we use two length n zero-one sequences, α and β,

Crystal Basis Model of the Genetic Code: Structure and Consequences

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

In Silico Modelling and Analysis of Ribosome Kinetics and aa-trna Competition

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

In Silico Modelling and Analysis of Ribosome Kinetics and aa-trna Competition

Practical Bioinformatics

Recent Evidence for Evolution of the Genetic Code

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

DO NOT OPEN THE EXAMINATION PAPER UNTIL YOU ARE TOLD BY THE SUPERVISOR TO BEGIN

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

ومن أحياها Translation 1. Translation 1. DONE BY :Maen Faoury

Molecular Genetics Principles of Gene Expression: Translation

Journal of Biometrics & Biostatistics

Slide 1 / 54. Gene Expression in Eukaryotic cells

C CH 3 N C COOH. Write the structural formulas of all of the dipeptides that they could form with each other.

Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis

Advanced topics in bioinformatics

Protein Secondary Structure Prediction

Lecture 15: Realities of Genome Assembly Protein Sequencing

Degeneracy. Two types of degeneracy:

NIH Public Access Author Manuscript J Theor Biol. Author manuscript; available in PMC 2009 April 21.

Proteins: Characteristics and Properties of Amino Acids

Translation. Genetic code

Gene Finding Using Rt-pcr Tests

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

Transcription Attenuation

THE GENETIC CODE AND EVOLUTION

Get started on your Cornell notes right away

Laith AL-Mustafa. Protein synthesis. Nabil Bashir 10\28\ First

An algebraic hypothesis about the primeval genetic code

Sequence comparison: Score matrices

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

From DNA to protein, i.e. the central dogma

Supplementary Information for

Fundamental mathematical structures applied to physics and biology. Peter Rowlands and Vanessa Hill

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Translation. A ribosome, mrna, and trna.

Translation and the Genetic Code

AP Biology Spring Review

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Part 4 The Select Command and Boolean Operators

From Gene to Protein

2005 Fall Workshop on Information Theory and Communications. Bioinformatics. from a Perspective of Communication Science

Genome and language two scripts of heredity

C h a p t e r 2 A n a l y s i s o f s o m e S e q u e n c e... methods use different attributes related to mis sense mutations such as

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

SUPPLEMENTARY DATA - 1 -

The Select Command and Boolean Operators

Six Fractal Codes of Biological Life: Perspectives in Astrobiology and Emergence of Binary Logics

Range of Certified Values in Reference Materials. Range of Expanded Uncertainties as Disseminated. NMI Service

Lecture 5. How DNA governs protein synthesis. Primary goal: How does sequence of A,G,T, and C specify the sequence of amino acids in a protein?

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Protein Struktur. Biologen und Chemiker dürfen mit Handys spielen (leise) go home, go to sleep. wake up at slide 39

Crick s early Hypothesis Revisited

arxiv: v2 [physics.bio-ph] 8 Mar 2018

Translation - Prokaryotes

Molecular Evolution and Phylogenetic Analysis

Periodic Table. 8/3/2006 MEDC 501 Fall

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Transcription:

A modular Fibonacci sequence in proteins P. Dominy 1 and G. Rosen 2 1 Hagerty Library, Drexel University, Philadelphia, PA 19104, USA 2 Department of Physics, Drexel University, Philadelphia, PA 19104, USA Abstract. Protein-fragment seqlets typically feature about 10 amino acid residue positions that are fixed to within conservative substitutions but usually separated by a number of prescribed gaps with arbitrary residue content. By quantifying a general amino acid residue sequence in terms of the associated codon number sequence, we have found a precise modular Fibonacci sequence in a continuous gap-free 10-residue seqlet with either 3 or 4 conservative amino acid substitutions. This modular Fibonacci sequence is genuinely biophysical, for it occurs nine times in the SWISS-Prot/TrEMBL database of natural proteins. Introduction. In a model-theoretic context of long-range correlations in DNA, Roche et al. [5] have studied Fibonacci polygc quasiperiodic sequences based on the inflation rule G GC and C G, which gives successively G, GC, GCG, GCGGC, GCGGCGCG,... for sequences of length 1, 2, 3, 5, 8,... respectively. It appears that such Fibonacci sequences can be used as prototypes for strongly correlated DNA segments as long as ~160 nm (i.e., ~450 bp). More than fifty years ago, Weyl [9] showed that the symmetry in biological phenotypes can be quantified by basic group theory, as shown for example in Fig. 1. That Fibonacci and other symmetries may exist in the sequences of DNA s, RNA s and certain proteins is suggested by the recent findings of motif pattern sequences the so-called seqlets [3, 4] that occur Present address for correspondence: 415 Charles Lane, Wynnewood, PA 19096, USA. 1

repeatedly in proteins. Typically featuring about 10 amino acid residue positions that are fixed to within conservative substitutions but usually separated by a number of prescribed gaps with arbitrary residue content, the existence of protein-fragment seqlets poses an intriguing question: Do certain seqlets have a sequence with a symmetry precisely quantifiable by group theory? This question is answered affirmatively in the present communication. We have found a precise modular Fibonacci sequence in a continuous gap-free 10-residue seqlet with either 3 or 4 conservative amino acid substitutions. Unambiguous statistical significance is attached to this modular Fibonacci seqlet, for it occurs nine times in the SWISS-Prot/TrEMBL database [1,2] of natural proteins. As in many cases of phenotypic symmetry [9], cyclic group theory here again provides the medium for the analysis. Nucleotide and codon representative cyclic groups. We start with a purely physical C 4 (cyclic group of order 4) representation for the nucleotides. As shown in Table 1, the number of oxygen and nitrogen atoms n o and n N along with the molecular ring numbers r engender the complex number z = ( ( representation of i) n O 1) n N r for each nucleotide. Constituting a C 4, the four complex numbers have the form z = exp(i nπ / 2) where [6,7] n(u) = 0, n(c) = 1, n(a) = 2 and n(g) = 3. Note that under the complementarity transpositions U A, C G, we have z z for all z. Next, we make the biologically appropriate foliation of the C 4 cyclic group for the nucleotides into a C 64 (cyclic group of order 64) representation for the codons. Let 5' (N N N + ) 3' denote a generic codon with the successive triplet of nucleotides N, N, N +. Then the codon s C 64 representative is defined by 2

Z = ( z ) 1/4 z ( z + ) 1/16 = exp (in π / 32) (1) in which z, z, z + are the C 4 representatives of the respective nucleotides and the principalbranch roots are taken by the fractional exponents on z and z + in the middle member of (1). The codon number N introduced by implicit definition in the final member of (1) is therefore given by [6,7] N = 4 n + 16n + n + (2) Thus, for example, with 5' (N N N + ) 3' = AUG for Methionine, we have N = 4n(A) + 16n(U) + n(g) = 11. Observe that N runs from 0 to 63 along with the codons in the natural progression of the universal genetic code (see Table 2). Finally, the code itself assigns one of the 20 amino acids or the stop signal to each codon, and thus concomitantly to its codon number N. Hence, a value for N implies an amino acid or the stop signal (the latter, in the cases N = 34, 35 or 50). Moreover, a sequence of Z s of the form (1), or equivalently a sequence of N values, implies a specific polypeptide which may or may not be a continuous gap-free seqlet. Modular Fibonacci sequences. Now let Z k 1, Z k, Z k+1 denote the C 64 representatives of three successive codons in an mrna. Suppose that through the formation mechanism of the mrna the representative Z k+1 depends on its two predecessors Z k, Z k 1 C 64 C 64 C 64 in a symmetrical-function manner over a range of integer k values, thereby representing a grouptheoretic direct-product surjection of the form (k) x (k 1) (k +1). The only uniformly consistent way that this can happen is for the C 64 representatives to satisfy Z k+1 = Z k Z k 1 for all k in the range, a relation equivalent to the modular Fibonacci sequence condition 3

N k+1 = ( N k + N k 1 ) (mod 64) (3) Here mod 64 instructs one to subtract 64 from the sum if it is greater than 63, for Z in (1) is unchanged by N (N 64) and N ranges from 0 to 63 by definition in (2). A sequence of N k s (to which there is an associated polypeptide) follows from (3) if one prescribes N 1, N 2 and the range of k. As a prime example, consider the biological sequence that features N 1 = 11 for M (see Table 2), the start signal for a protein, and N 2 = 33, which is M s partner under nucleotide complementarity: N 1 = 11 AUG UAC N 2 = 33. Then (3) employed recursively produces the sequence 11 33 44 13 57 6 63 5 4 9 13 22 35 Note that we have terminated the sequence at k = 13, because N 13 = 35 signals stop. By using Table 2 the sequence translates into the polypeptide M Y D V S L G L L I V P (stop) which finally yields the modular Fibonacci seqlet Y D V S L G L L I V (4) In (4) we have deleted the M (interpreted as a start signal) and the P (for terminal deletion symmetry). The standard equivalence classes for amino acid conservative substitutions are {F, Y}, {L, I, M, V}, {S, T}, {P}, {A, G}, {H}, {Q, N}, {K. R}, {D. E}, {C}, [W}, as permitted by the SWISS-Prot/TrEMBL BLAST program [1,2]. With (4) as the query, the SWISS- Prot/TrEMBL database (containing 137,811 non-redundant sequences with 50,431,180 total letters when accessed) identified the following nine protein segments as biological equivalents to (4): >gi 3122441 sp 021077 NU1M_MYXGL NADH-ubiquinone oxidoreductase chain 1 Length = 318 Identities = 7/10 (70%), Positives = 10/10 (100%) 4

Y+V+LGL+IV Subject: 144 YEVTLGLMIV 153 >gi 266658 sp P29920 NQO8_PARDE NADH-quinone oxidoreductase chain 8 (NADH dehydrogenase: NDH-1, chain 8) Length = 345 Identities = 7/10 (70%), Positives = 10/10 (100%) Y+VSLGL+I+ Subject: 157 YEVSLGLIII 166 >gi 1171804 sp P42032 NUOH_RHOCA NADH-quinone oxidoreductase chain H (NADH dehydrogenase: NDH-1, chain H) Length = 345 Identities = 7/10 (70%), Positives = 10/10 (100%) Y+VS+GL+IV Subject: 157 YEVSMGLIIV 166 >gi 6647674 sp Q9ZCF7 NUOH_RICPR NADH-quinone oxidoreductase chain H (NADH dehydrogenase: NDH-1, chain H) Length = 339 Identities = 6/10 (60%), Positives = 10/10 (100%) Y+VS+GL+I+ Subject: 157 YEVSMGLVII 166 >gi 1352538 sp P48899 NUIM_CYACA NADH- ubiquinone oxidoreductase chain 1 Length = 344 Identities = 6/10 (60%), Positives = 10/10 (100%) Y+VS+GL+I+ Subject: 165 YEVSIGLIII 174 >gi 14194971 sp 079546 NUIM_DINSE NADH- ubiquinone oxidoreductase chain 1 Length = 321 Identities = 6/10 (60%), Positives = 10/10 (100%) Y+V+LGL+I+ Subject: 148 YEVTLGLIII 157 >gi 2499233 sp Q96182 NUIM_POLOR NADH- ubiquinone oxidoreductase chain 1 Length = 319 Identities = 6/10 (60%), Positives = 10/10 (100%) Y+V+LGL+I+ Subject: 146 YEVTLGLIII 155 >gi 462756 sp P26845 NUIM_MARPO NADH-ubiquinone oxidoreductase chain 1 Length = 328 Identities = 6/10 (60%), Positives = 10/10 (100%) Y+VS+GLL++ 5

Subject: 150 YEVSIGLLLI 159 >gi 3122451 sp Q37556 NUIM_METSE NADH-ubiquinone oxidoreductase chain 1 Length = 334 Identities = 6/10 (60%), Positives = 10/10 (100%) Y+VS+GL+I+ Subject: 157 YEVSIGLIII 166 Statistical significance. Observe that (3) prescribes the 9 amino acids that follow Y in (4), with N 1 = 11 and N 2 = 33. For an incidentally random sequence, the probability of a single appearance of a biological equivalent of (4) with Y in the first position and j conservative substitutes in the subsequent nine positions is given approximately by φ j = (9!/(9 j)! j!) (.05 ) 10 j (.11 ) j (5.04 10 7 ), in which (.05) is the average probability for any one of the 20 amino acids, (.11) is the average probability for an amino acid or one of its biological equivalents, and (5.04 10 7 ) is the approximate number of total letters, i.e., chances for a hit in the database; for j = 3 and 4 we therefore have φ 3 = 4.40 10 3 and φ 4 =1.45 10 2. Thus, even with suitable latitude given to evolutionary correlations between the SWISS-Prot/TrEMBL protein fragments cited above, it follows that the incidental random-sequence composite probability for three j = 3 and six j = 4 conformances is small enough to preclude an incidental origin for the accord with virtual certainty. Therefore, the modular Fibonacci seqlet shown in (4) can be deemed to be genuinely biophysical. A possible dynamical origin for modular Fibonacci seqlets. In conclusion, it should C 4 C 64 be emphasized that the nucleotide group, the codon group, and hence a modular Fibonacci sequence in proteins have a precise biophysical underpinning in terms of nucleotide and codon aerobic-atom content and molecular ring structure, as expressed by and n, n and Ο N 6

C 4 C 64 r. The z s of and the Z s of may perhaps also have a quantum physics origin as wavefunction prefactors that result from the nucleotide and codon production reactions. Consonant with their aerobic-atom content and molecular ring structure, the z s may have originated dynamically as relative prefactors in the nucleotide groundstate wavefunctions; if so, the phase angles in the z, n π / 2, may be interpreted as residual Berry phase angles [8] from the nucleotide production process. Furthermore, if the Z k s are residual Berry phase factors from the codon production process, then the relation Z k+1 = Z k Z k 1 and its equivalent (3) minimize an energy of the form ε Re( Z k+1 k * Z k Z k 1 ), where the constant ε is the magnitude of an intercodon binding energy increment in the mrna groundstate. Irrespective of the underlying mechanism, however, the biophysical genuineness of the modular Fibonacci sequence is clearly indicated statistically. Acknowledgements. P.D. is grateful for the helpful communications received from Elisabeth Gasteiger on the BLAST program procedures. G. R. wishes to thank Robert Kuniewicz and Neville Kallenbach for their suggestions regarding the SWISS-Prot/TrEMBL database. The photograph of the Nautilus was kindly taken by Kevin Aires (courtesy of K. R. Design). 7

REFERENCES [1] ALTSCHUL S. F. et al., Nucleic Acids Res. 25 (1997) 3389; [2] BAIROCH A. et al., Nucleic Acids Res. 26 (1998) 38; interactive program at (http://www.ncbi.nih.gov./blast/blast.cgi). [3] HUYNH T., et al., Nucleic Acids Res. 31 (2003) 3645. (http://us.expasy.org/tools/blast/). [4] RIGOUTSOS I., et al., Proteins: Structure, Function and Genetics 37 (1999) 1 [5] ROCHE S. et al., Phys. Rev. Lett., 91 (2003) 228101. [6] ROSEN G., Bull. Math. Biol. 53 (1991) 845. [7] ROSEN G., Phys. Lett. A 253 (1999) 354. [8] SHAPERE A. and WILCZEK, F. eds. Geometric Phases in Physics (World Scientific Press, Singapore, 1989). [9] WEYL H., Symmetry (Princeton U. Press, 1952) in particular pp. 68 73. MCS: 92D20, 62P10 8

Figure 1 Half-shell of a pearly Nautilus in which the edge of the coil is a near-perfect logarithmic spiral [5], a curve such that the rates of rotation and dilatation are proportional [hence yielding θ ln (r / r o )]. The self-similar air chambers increase in diameter according to a fractionallymuted Fibonacci relation [viz., d k = (const) (f k ) 1/8 for the k th chamber, where f k = 1, 2, 3, 5, 8, 13, 21,... is the primary Fibonacci series]. 9

Table 1 The nucleotide C 4 representatives z = (i) n O ( 1) n N r. Here nο, n, and r are the N number of oxygen atoms, nitrogen atoms, and rings in the nucleotide molecule, the quantities that determine its z value. n r z n N Ο U, uracil 2 2 1 1 pyrimidines T, thymine 2 2 1 1 C, cytosine 1 3 1 i purines A, adenine 0 5 2 1 G, guanine 1 5 2 i 10

Table 2 The universal genetic code annotated with codon numbers. The code associates 61 mrna codons ordered triplets of the nucleotides U, C, A, G with the 20 lifeform amino acids. In particular, AUG for M also serves as a start signal for a protein, while UAA, UAG and UGA signal stop by not being associated with any amino acid. The codon number N [given by Eq. (2)] appears to the left of each respective codon, and the associated amino acid along with its letter symbol appears to the right of its codon correspondents. 0 UUU Phenylalanine 16 UCU 32 UAU Tyrosine 48 UGU Cysteine 1 UUC F 17 UCC Serine 33 UAC Y 49 UGC C 2 UUA 18 UCA S 34 UAA (stop) 50 UGA (stop) 3 UUG 19 UCG 35 UAG (stop) 51 UGG Tryptophan W 4 CUU Leucine 20 CCU 36 CAU Histidine 52 CGU 5 CUC L 21 CCC Proline 37 CAC H 53 CGC Arginine 6 CUA 22 CCA P 38 CAA Glutamine 54 CGA R 7 CUG 23 CCG 39 CAG Q 55 CGG 8 AUU 24 ACU 40 AAU Asparagine 56 AGU Serine 9 AUC Isoleucine 25 ACC Threonine 41 AAC N 57 AGC S 10 AUA I 26 ACA T 42 AAA Lysine 58 AGA Arginine 11 AUG Methionine M 27 ACG 43 AAG K 59 AGG R (start) 12 GUU 28 GCU 44 GAU Aspartic acid 60 GGU 13 GUC Valine 29 GCC Alanine 45 GAC D 61 GGC Glycine 14 GUA V 30 GCA A 46 GAA Glutamic acid 62 GGA G 15 GUG 31 GCG 47 GAG E 63 GGG 11