Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Jacobs University Bremen Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective Semester Project II By: Dawit Nigatu Supervisor: Prof. Dr. Werner Henkel Transmission Systems Group (TrSyS) School of Engineering and Science October 2013

JACOBS UNIVERSITY BREMEN Abstract School of Engineering and Science Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective by Dawit Nigatu This research contains two separate parts. In the first part, we have used classical multidimensional scaling (CMD) technique to scale down a 64-dimensional empirical codon mutation (ECM) matrix and a 20-dimensional chemical distance matrix to two dimensions (2-D). The 2-D plots of ECM show that most mutations occur between codons that encode the same amino acid, i.e., the changes from one codon to another will not change the amino acid to be produced. Furthermore, most of the highly probable inter-amino acid mutations will not result in a dramatic change of chemical properties. However, we have seen some inconsistencies in comparing the 2-D plots of ECM and chemical distance matrices, in which codons near to each other in mutation distance have a significant difference in chemical properties. This may lead to a severe effect, and hence the results point out that some protection mechanism is needed to counteract. In addition, the arrangement of the amino acids is very much in line with the so-called Taylor classification. In the second part of the research, we have focused on investigating the relationship between Shannon and Boltzmann entropies using the complete genome sequence of the bacteria E. coli. There are positions in which parallel and anti parallel relationships exist. We have found that around the terminus, the two entropies seem to have an opposite trend with high Shannon and low Boltzmann entropies, meaning that the sequence is more random and at the same time less stable. In general, the Boltzmann entropy decreases as we move along the gene from the origin to the terminus. Furthermore, with the cooperation with a molecular biology colleague, we have compared the entropies with the number of different types of functional genes (anabolic, catabolic, aerobic, and anaerobic) located at the same positions. We have seen that there is a strong similarity between the distribution of anabolic genes and the two entropies.

Contents Abstract i List of Figures iii 1 Introduction 1 1.1 Basic Theoretical Background......................... 1 1.1.1 DNA................................... 1 1.1.2 The Central Dogma.......................... 1 1.2 Organization of the Report.......................... 4 2 Dimension Reduction of Evolutionary and Chemical Distance Matrices 5 2.1 Evolutionary Substitution and Chemical Distance Matrices........ 5 2.2 Classical Multidimensional Scaling...................... 6 2.3 Result and Discussion............................. 8 3 Relation Between Boltzmann and Shannon Entropy 11 3.1 Introduction................................... 11 3.2 Boltzmann Entropy and Distribution..................... 12 3.2.1 Laws of Thermodynamics....................... 12 3.2.1.1 First Law of Thermodynamics............... 12 3.2.1.2 Second Law of Thermodynamics.............. 13 3.2.2 Ideal Gas Law............................. 13 3.2.3 Entropy of a Gas............................ 13 3.2.3.1 Macroscopic View...................... 13 3.2.3.2 Microscopic View: Boltzmann Entropy.......... 14 3.2.4 Boltzmann Distribution........................ 16 3.2.5 Gibbs Entropy Formula........................ 18 3.2.6 Entropy of an Ideal Gas........................ 19 3.3 Entropy of the E. coli Genome........................ 19 3.4 Result and Discussion............................. 20 4 Conclusions 26 A Additional Plots 27 Bibliography 29 ii

List of Figures 1.1 The structure of DNA............................. 2 1.2 Central dogma of molecular biochemistry with enzymes.......... 3 1.3 Codon-amino acid encoding chart....................... 3 2.1 2-D plot of the mutation distance matrix................... 9 2.2 2-D plot of the chemical distance matrix................... 9 2.3 Taylor classification of amino acids....................... 10 2.4 2-D plot of the mutation distance matrix................... 10 3.1 Adiabatic expansion of a gas at constant temperature........... 14 3.2 Boltzmann and Shannon entropies of E. coli genome, 2bp block...... 20 3.3 Boltzmann and Shannon entropies of E. coli genome, 3bp block...... 21 3.4 Number of anabolic genes with Boltzmann and Shannon entropies.... 22 3.5 Number of anabolic genes with difference of Boltzmann and Shannon entropies..................................... 22 3.6 Number of catabolic genes with Boltzmann and Shannon entropies.... 23 3.7 Number of catabolic genes with difference of Boltzmann and Shannon entropies..................................... 23 3.8 Number of aerobic genes with Boltzmann and Shannon entropies..... 24 3.9 Number of aerobic genes with difference of Boltzmann and Shannon entropies...................................... 24 3.10 Number of anaerobic genes with Boltzmann and Shannon entropies.... 25 3.11 Number of anaerobic genes with difference of Boltzmann and Shannon entropies..................................... 25 A.1 Boltzmann and Shannon entropies of E. coli genome, 4bp block...... 27 A.2 Boltzmann and Shannon entropies of E. coli genome, 5bp block...... 28 A.3 Boltzmann and Shannon entropies of E. coli genome, 6bp block...... 28 iii

Chapter 1 Introduction 1.1 Basic Theoretical Background 1.1.1 DNA Deoxyribonucleic acid (DNA) is a double stranded structure found in all cells, containing the genetic information of the living organism. It consists of building blocks called nucleotides. The nucleotides are made of sugar phosphate backbone and one of the four nitrogenous bases attached to the sugars. These bases are called Adenine, Thymine, Cytosine, and Guanine (A, T, C, G). For the DNA to have the double helix structure, the nucleotides are linked together into chains. A figure showing the structure of the DNA is presented in Fig. 1.1. The two strands are complementary to each other. According to the Watson-Crick pairing rule A is always paired with T and G is always paired with C [2]. This means, if we know the sequence of nucleotides on one strand, the sequence in the complementary strand is known right away. The bases are attached by hydrogen bonds. GC pairs have three hydrogen bonds whereas AT pairs have two hydrogen bonds. The additional hydrogen bond makes the GC pairs more stable than AT pairs. 1.1.2 The Central Dogma Francis Crick [3] states that flow of biologic information is from DNA towards proteins and called the process the central dogma of molecular biology (Fig. 1.2). The sequences of bases aligned in a segment of a DNA, called a gene, carry the directions for building proteins that have special functions in the cell. Protein synthesis consists of two steps, transcription and translation. The RNA (ribonucleic acid) polymerase enzyme 1

Chapter 1. Introduction 2 Cell nucleus Base pairs [ Base pairs [ Adenine Thymine, Guanine Cytosine DNA's Double Helix. DNA molecules are found inside the cell's nucleus, tightly packed into chromosomes. Scientists use the term "double helix" to describe DNA's winding, two-stranded chemical structure. Alternating sugar and phosphate groups form the helix's two parallel strands, which run in opposite directions. Nitrogen bases on the two strands chemically pair together to form the interior, or the backbone of the helix. The base adenine (A) always pairs with thymine (T), while guanine (G) always pairs with cytosine (C). Figure 1.1: The structure of DNA [1]. unwinds the DNA molecule and the transcription process begins. In transcription the gene sequence is copied into Messenger RNA (mrna) using the template strand of the DNA. Messenger RNA is a single stranded molecule similar with DNA except for the base Thymine (T) is replaced by Uracil (U). In the translation phase, the ribosome translates the sequence of mrna molecule to amino acids, reading the sequence in groups of three bases (codons). There are 20 naturally occurring amino acids. The chart in Fig. 1.3 shows the codon to amino acid translation. The process starts when the smaller ribosomal subunit is attached to the translation initiation site, usually AUG. Then, the transfer RNA (trna) binds to the mrna. The trna contains an anticodon complementary to the mrna to which it binds and the corresponding amino acid is attached to it. Next, the large ribosomal subunit binds to create the P-site (peptidyl) and A-site (aminoacyl). The first trna occupies the P-site and the second trna enters to the A-site. After that, the trna at the P-site transfers the amino acid it carries to the second trna at the A-site and exits. Finally, the ribosome moves along the mrna and the next trna enters. This

Chapter 1. Introduction 3 Figure 1.2: Central dogma of molecular biochemistry with enzymes [4] Figure 1.3: Codon-amino acid encoding chart [5]. process will continue until a stop codon (UAG, UAA, or UGA) signals the end of the mrna molecule. Lastly, the amino acids are connected by a peptide bond and folded in a certain way to create proteins.

Chapter 1. Introduction 4 1.2 Organization of the Report In Chapter 2, we first present the different types of evolutionary substitution matrices and the chemical distance matrix followed by the mathematics behind classical multidimensional scaling. Then, the results of the dimension reduction are presented and discussed. In Chapter 3, the proofs of Boltzmann entropy and Boltzmann distribution are described. Thereafter, the Shannon and Boltzmann entropies of E. coli genome are computed, presented, and discussed. Finally, the conclusions are presented in Chapter 4.

Chapter 2 Dimension Reduction of Evolutionary and Chemical Distance Matrices 2.1 Evolutionary Substitution and Chemical Distance Matrices There are several substitution matrices providing the mutational change of one amino acid by another inside protein sequences. The first of such matrices is the point accepted mutations (PAM) matrix which is obtained by counting the number of replacements and computing the mutation probabilities from a database of aligned protein sequences [6]. However, if the protein sequences are on a different part of the phylogenetic tree, the PAM matrix is not efficient. The other type that overcomes the shortcomings of the PAM matrix is the BLOSUM matrix (Block Substitution Matrix), which uses blocks of aligned protein segments [7]. The third type based on amino acid substitutions is called the WAG substitution matrix [8]. The WAG matrix utilizes a large database of aligned proteins of different families and uses a maximum-likelihood technique to derive the substitution scores. The evolutionary models mentioned so far are based on amino acid substitutions. Besides, there are also models which describe codon to codon substitutions. One of them is the 64 64 empirical codon mutation (ECM) matrix proposed by Schneider et al. [9]. For developing the ECM matrix, 8.3 million aligned codons from five vertebrates were used to tally the number of substitutions and derive the mutational probabilities. Since the transitions to stop codons are not considered, the matrix contains a block diagonal 5

Chapter 2. Multidimensional Scaling (MDS) 6 3 3 entries for the three stop codons separated from 61 61 matrix outside the stop codons. The ECM matrix provides an extra edge by providing the transitions between codons encoding the same amino acid, in addition to transitions leading to different ones. Hence, we have used this matrix for the rest of our work. Grantham s chemical distance matrix takes into account the three chemical properties (composition, polarity, and molecular volume) which have a strong correlation with the substitution frequencies. The matrix presents a mechanism to identify the difference between amino acids. The distance between amino acids is computed by making the three chemical properties as an axis in Euclidean space. We would like to compare how these chemical properties relate to the mutation probabilities. Since the matrices are of 64 and 20 dimensions, we have to apply the dimension reduction technique to bring it down to 2 or 3 dimensions for easy comparisons and to see if some kind of clustering will appear. More importantly, we would like to see the severeness of mutational changes, which is visible in the chemical properties. For reducing the dimensions of 64 64 ECM and 20 20 chemical distance matrices, we used a technique called classical multidimensional scaling (CMD), which will be presented in the following section. 2.2 Classical Multidimensional Scaling In this section, the mathematics behind CMD technique will be described. The reference used for this section is [10]. Assume that we have observed n n Euclidean distance matrix D = [d ij ] derived from a raw n p data matrix X. With CMD, the aim is to recover the original data matrix of n points in p dimensions from the distance matrix. However, since distances are invariant to change in location, rotation, and reflections, the original data cannot be fully retrieved. Define an n n matrix B such that B = XX T. (2.1) The elements of B are given by p b ij = x ik x jk. (2.2) k=1

Chapter 2. Multidimensional Scaling (MDS) 7 Similarly, since D is a distance matrix, the squared Euclidean distances can be written as d 2 ij = = p (x ik x jk ) 2, k=1 p p p x 2 ik + x 2 jk 2 x ik x jk, k=1 k=1 k=1 = b ii + b jj 2b ij. (2.3) At this point, If we can rewrite the b ij s in terms of the d ij s, X can be derived from B. However, unless a location constraint is introduced, a unique solution cannot be found to determine B from D. Commonly, the center of the columns of X are set to the origin, i.e., n x ik = 0, k. (2.4) i=1 The added constraint will also mean that the sum of the terms in any row of B is zero. Let T be the trace of B and observe that n d 2 ij = T + nb jj, (2.5) i=1 n d 2 ij = nb ii + T, (2.6) j=1 n i=1 j=1 n d 2 ij = 2nT. (2.7) Solving for b ij, b ij = 1 d 2 ij 1 2 n n d 2 ij 1 n j=1 n d 2 ij + 1 n 2 i=1 n n i=1 j=1 d 2 ij (2.8) Applying singular value decomposition (SVD) on B, B = VΛV = VΛ 1 2 1 Λ 1 2 1 V. (2.9) Using only the 2 (or 3) biggest eigenvalues, λ 1 and λ 2 and the corresponding eigenvectors u 1 and u 2 we obtain X = V 1 Λ 1 2 1, (2.10) [ ] λ1 0 where Λ 1 = and V 1 = [u 1 u 2 ]. 0 λ 2

Chapter 2. Multidimensional Scaling (MDS) 8 2.3 Result and Discussion To apply the CMD method, we need to convert the mutation probabilities in the ECM matrix to some form of Euclidean distance measure. To do so, we have assumed a Gaussian model and computed the codon based distances from the pairwise error probability expression given by P ij = 1 ( ) 2 erfc Dij 2σ, (2.11) Where σ is a standard deviation. We have assumed a constant standard deviation for the mutation distances. The two dimensional (2-D) plots of the mutation and chemical distance matrices are shown in figures 2.1 and 2.2, respectively. The codons encoding the same amino acid are bundled together. Also, the clusterings of amino acids are mostly consistent with Taylor classification shown in Fig. 2.3, which classifies amino acids based on their physiochemical properties [11]. Using these observations we can deduce that most of the mutational changes will not lead to a significant change of chemical properties. However, there are also some inconsistencies where lower mutation distances come together with higher chemical distances and vice versa. The results can also be used as references to apply some sort of protection for high mutation probabilities with higher chemical differences. The inconsistencies are listed below. Large chemical distance but small mutation distance C with all others G with E S with {P,T,A} {D,N} with E {D,N} with G {Q,H} with {W,Y} K with N Small chemical distance but large mutation distance {W,Y} with {F,L,M,I,V} {P,T,A} with {Q,H,R}

Chapter 2. Multidimensional Scaling (MDS) 9 80 60 40 20 CGC CGA AGG AGA CGT CGG AAG R AAA K CAC CAT CAG CAA Q H TGC TGG TGT W c 0 ATA CCG ATC AAT N P AAC CCT ATT 20 CCCCCA GTG GAG E AGC GAA GTA GTC TCC AGT TCT GAC GAT S V TCG GGG TCA GTT 40 ACG D GGA GGC ACA T ACT GGT ACC G GCC GCG GCT GCA 60 A 60 40 20 0 20 40 60 80 TAC TAT Y M ATG Figure 2.1: 2-D plot of the mutation distance matrix. F CTC TTC CTA TTT CTG CTT TTA TTG I L 120 100 C 80 60 40 20 0 20 S N G Q P T A V M Y L I F W 40 D E H R 60 K 100 80 60 40 20 0 20 40 60 80 100 Figure 2.2: 2-D plot of the chemical distance matrix.

Chapter 2. Multidimensional Scaling (MDS) 10 Figure 2.3: Taylor classification of amino acids [12]. The CMD method works best if the eigenvalues used for reconstruction are very large compared to the unused eigenvalues. However, in our case the eigenvalues are not decaying very quickly, and hence the error in 2-D representation is significant, with a root mean squared error of around half the mean distance. For this reason, we will try to improve the performance by applying another better dimension reduction and clustering method in a future work. The 3-D plot of the ECM mutation matrix is shown in Fig. 2.4. 80 60 40 20 0 20 40 60 80 60 R CGT CGC CGA CGG AGA AGG 40 20 CAC CAT Q 0 H AAG AAA CAG CAA K 20 TGG AAC AAT D 40 TAC TAT TGC TCC CTC TCT CTTTTA TCG TTG N CCC TCA CTG CCT CTA AGC CCG CCA AGT GCC ACC GAC GGC P ACT GCT GAT GGT ACG ATG GCG T GGG ACA GAG GCA M GGA GAA A W E 60 TGT Y G 60 C 40 S 20 TTC TTT L F 0 I ATC ATT GTC ATA GTT GTG GTA 20 V 40 60 80 Figure 2.4: 3-D plot of the mutation distance matrix.

Chapter 3 Relation Between Boltzmann and Shannon Entropy 3.1 Introduction DNA is a double sequence of nucleotides based on a 4-letter alphabet called Adenine, Thymine, Cytosine, and Guanine (A, T, C, G) in which the second sequence is complementary to the first one. For a sequence of such kinds, the Shannon entropy gives an average measure of information obtained from the distribution of the symbols (words) of the source. In addition, the sequence in which these four letters are aligned in the DNA is a major factor determining the stability of the DNA structure [13]. Hence, looking into the information contained in the sequence of nucleotides along with the stability that comes with it is important. Shannon block entropy for a block size of N symbols is mathematically given as H N = i P (N) i log P (N) i, (3.1) where P (N) i is the probability (relative frequency) to observe the i th word of block size N. The entropy is maximal when all words occur at equal probabilities, and it is zero when one of the symbols occurs with probability one. In statistical mechanics and thermodynamics, the Boltzmann-Gibbs entropy has the form very similar to the Shannon entropy measure given in Eq. (3.1). However, it should be properly scaled by the Boltzmann constant k, which gives this entropy a unit 11

Chapter 3. Relation Between Boltzmann and Shannon Entropy 12 of kcal/kelvin and natural logarithm is used. S = k i P (N) i ln P (N) i. (3.2) Our aim is to apply the two forms of entropy measures on the complete genome of Escherichia coli (E.coli) and to see how the entropies develop across the genome. Furthermore, we would like to compare and figure out if there is some sort of relation that can help us relate the two. 3.2 Boltzmann Entropy and Distribution 3.2.1 Laws of Thermodynamics In this section, the two laws of thermodynamics will be presented. The reference used for this section is [14]. 3.2.1.1 First Law of Thermodynamics For a system undergoing a process, the change in energy is equal to the heat added to the system minus the work done by the system. It simply means, the energy of the universe is conserved. The change in internal energy of the system, de is given by the equation de = dq dw, (3.3) where dq is the heat transferred into or out of the system and dw is the work done by or on the system. If the work done is a mechanical work by an expanding or contracting gas, dw can be derived to be P dv and the equation becomes de = dq P dv. (3.4) The negative sign is from the sign convention for work. The above equation is only valid if the pressure is constant throughout the reaction. Under such conditions, the heat transfer is called enthalpy(h) and the first law of thermodynamics can be written as de = dh P dv. (3.5)

Chapter 3. Relation Between Boltzmann and Shannon Entropy 13 3.2.1.2 Second Law of Thermodynamics The second law is about entropy, a quantity which describes the microscopic state of a system in equilibrium. If the system is thermally isolated and undergoes a change of state, the entropy will always increase, i.e., S 0. (3.6) However, if the system is not thermally isolated and the change of state is in a quasistatic fashion in which a heat, dq, is absorbed, then ds = dq T, (3.7) where T is the absolute temperature. Entropy has units of Joule/Kelvin or Cal/Kelvin and it is a state variable, i.e., it is independent of the path between the initial and final states. 3.2.2 Ideal Gas Law The state of a gas is determined by its pressure (P ), volume (V ), and temperature (T )[14].The ideal gas law is commonly stated a, P V = nrt, (3.8) where n is the number of moles in the gas and R is the universal gas constant(8.314 J/K mol). The ideal gas law can also be formulated as P V = NkT, (3.9) N is the number of molecules in the gas and k is the Boltzmann constant. 3.2.3 Entropy of a Gas 3.2.3.1 Macroscopic View Consider an isothermal and adiabatic process, i.e., occurring without exchange of heat of a system with its environment at constant temperature. Since we considered the process to be adiabatic and isothermal, de = 0 and dt = 0 [15]. Using the laws of

Chapter 3. Relation Between Boltzmann and Shannon Entropy 14 Figure 3.1: Adiabatic expansion of a gas at constant temperature [15] thermodynamics (equations (3.4) and (3.7)) and the ideal gas law (Eq. (3.9)), dq = de + P dv, (3.10) T ds = NkT dv V ds = NkdV V Integrating from the initial state to the final, we obtain, (3.11), (3.12) S = Nk ln V 2 V 1. (3.13) In this specific case the volume is doubled. Therefore, V 2 = 2V 1, S = Nk ln 2. (3.14) 3.2.3.2 Microscopic View: Boltzmann Entropy It was Boltzmann who first gave thermodynamic entropy a meaning in relation to the number of arrangements of the molecules Ω [15]. In the above process, if we initially assume the number of molecules to be N and the number of arrangements of molecules (number of possible microscopic states) to be Ω, the final system will have 2 N Ω ways of arrangements (a molecule can be either on the left or on the right). Let S 1 and S 2 be the entropy of the first and second states with Ω 1 and Ω 2 arrangements respectively. The following proof is taken from [16]. The entropy of the final system will be S = S 1 + S 2. (3.15) The number of arrangements Ω of the final system will be Ω = Ω 1 Ω 2. (3.16)

Chapter 3. Relation Between Boltzmann and Shannon Entropy 15 Boltzmann postulated the entropy to be a function of Ω, S f(ω). (3.17) Therefore, S 1 = f(ω 1 ), S 2 = f(ω 2 ), and f(ω 1 Ω 2 ) = f(ω 1 ) + f(ω 2 ). (3.18) Differentiating with respect to Ω 1 leads to Again differentiating with respect to Ω 2 yields Thus, where C is a constant by separation of variables. f(ω 1 Ω 2 ) Ω 1 = f(ω 1Ω 2 ) Ω 1 Ω 2 Ω 2 = f(ω 1) Ω 1 (3.19) f(ω) Ω Ω 2 = f(ω 1) Ω 1. (3.20) f(ω 1 Ω 2 ) Ω 2 = f(ω 1Ω 2 ) Ω 1 Ω 2 Ω 1 = f(ω 2) Ω 2 (3.21) f(ω) Ω Ω 1 = f(ω 2) Ω 2. (3.22) 1 Ω 2 f(ω 1 ) Ω 1 = 1 Ω 1 f(ω 2 ) Ω 2 (3.23) Ω 1 f(ω 1 ) Ω 1 = Ω 2 f(ω 2 ) Ω 2 = C, (3.24) S 1 = f(ω 1 ) = C ln Ω 1 + const (3.25) S 2 = f(ω 2 ) = C ln Ω 2 + const (3.26) S = C ln Ω 1 + C ln Ω 2 + const (3.27) S = S 1 + S 2 (3.28) Hence, with const = 0, we obtain S = C ln Ω. (3.29) The value of the constant C can be observed by applying the postulate to the expansion of a gas depicted in Fig. 3.1. S = S 2 S 1 (3.30) S = C ln 2 N Ω C ln Ω (3.31)

Chapter 3. Relation Between Boltzmann and Shannon Entropy 16 S = CN ln 2 (3.32) Comparing with Eq. (3.14) we obtain C = k. The Boltzmann entropy becomes S = k ln Ω. (3.33) 3.2.4 Boltzmann Distribution Consider an isolated system with energy E, volume V, and number of molecules N fixed. The N molecules will be arranged in such a way that n 1 is in the first energy state (ɛ 1 ), n 2 is in the second (ɛ 2 ), n 3 is in the third..., and n i is in the ɛ i energy states. The number of possible arrangements will be ( )( )( ) N N n1 N n1 n 2 Ω = = N! n i!. (3.34) n 1 n 2 When the system under consideration reaches on equilibrium, the molecules will disperse and the number of possible arrangements will be maximum [16]. To find the most probable configuration of the molecules, we have to maximize Ω for fixed N and E. The reference for this section is [16]. n 3 i maximize n i subject to Ω n i = N, i i n i ɛ i = E (3.35) Reformulating the constraints in terms of probabilities P i = n i N, i n i = N can be replaced by i P i = 1, and i n iɛ i = E can be replaced by i P iɛ i = Ē. Instead of Ω we can also maximize ln Ω and the problem becomes, maximize ln Ω, P i subject to P i = 1, i i Using Stirling s approximation for large N, P i ɛ i = Ē (3.36) ln N! N ln N N. (3.37)

Chapter 3. Relation Between Boltzmann and Shannon Entropy 17 Applying the approximation for ln Ω ln Ω ln N! i ln n i! (3.38) = N ln N N i n i ln n i + i n i (3.39) = N i P i ln P i (3.40) Omitting N, because it has no effect in the maximization, and applying Lagrange multipliers method, Eq. (3.40) leads to L = i P i ln P i α i P i β i P i ɛ i. (3.41) Setting L P j = 0, ln P j 1 α βɛ j = 0, P j = e α e βɛ j, (3.42) Substituting in the constraints, P j = 1 Z e βɛ j, where Z = e α. P j = 1 = 1 e βɛ j = 1, Z j Z(β) = e βɛ j. j j (3.43) Therefore, P j = e βɛ j e. βɛ j (3.44) j The constant β can be shown to be 1 kt. To do so, one can compare the average energy obtained using the Boltzmann distribution, which is 3 2β with the average kinetic energy of a molecule at equilibrium 3kT 2 From the definition of temperature, we have. Another way to derive that β = 1 kt, is as follows [17]. 1 T = S E. (3.45) V,N

Chapter 3. Relation Between Boltzmann and Shannon Entropy 18 Using Boltzmann s entropy definition, S = k ln Ω, and replacing Eq. (3.42) in Eq. (3.40), S = k ln Ω = kn i p i ln (e α e βɛ i ), (3.46) = kn i p i ( α βɛ i ), (3.47) = kn i p i α + knβ i p i ɛ i, (3.48) = knα + knβ E N, (3.49) = knα + kβe, (3.50) 1 T = S E = kβ, (3.51) V,N = β = 1 kt. (3.52) Therefore, the Boltzmann distribution relating the energy and temperature to the microscopic properties is given by P j = i e ɛ j kt e ɛ i kt (3.53) 3.2.5 Gibbs Entropy Formula In the Boltzmann definition of the entropy, at a fixed energy, all states resulting in an energy E are assumed to be equally likely [15]. If the states of the thermodynamic system are not equally probable, Gibb s definition of entropy given by S = k i P i ln P i, (3.54) where the sum is over all microstates and P i is the probability that the molecule is in the i th microstate [18]. This definition, like Boltzmann s, is a fundamental postulate which can explain the experimental facts accurately [18]. To see if this definition of entropy is more general, consider a system having Ω microstates and if all microstates are equally probable, i.e., the P i = 1 Ω, (3.54) results in S = k Ω i=1 ( ) 1 Ω ln ( ) 1 = k ln Ω, (3.55) Ω

Chapter 3. Relation Between Boltzmann and Shannon Entropy 19 which is the Boltzmann definition of entropy. 3.2.6 Entropy of an Ideal Gas From the first law of thermodynamics (given in Eq. (3.4)) we have dq = de + V dp. (3.56) For any gas, the change in internal energy de depends on the change in temperature. Thus, de = C v dt per mole of a gas, where C v is the specific heat 1 at constant volume. Integrating both sides of the equation, leads to dq = C v dt + nrt dv V (3.57) dq T = mc vdt + nr dv T V (3.58) S = C v ln T + nr ln V + constant. (3.59) Depending on the type of experimental condition of the system, the change in entropy will be different [14]. If the process is done at constant temperature, S = nr ln V 2 V 1, If the process is done at constant volume, S = nc v ln T 2 T 1, and If the process is done at constant pressure, S = nc p ln T 2 T 1. 3.3 Entropy of the E. coli Genome We have used the 4,639,221 base pairs (bp) sequence of E. coli K-12 strain. First, the data is rearranged to start at the origin of replication. Then, entropy of chunks of the DNA sequence is computed for different block sizes (2bp up to 6bp) in nonoverlapping windows containing 100 Kbp. For calculating the Boltzmann entropy the stacking energies of base pairs obtained from [13] is used. All the neighboring base pairs are considered. That is, if the nucleotide sequence is AGCT, the energies of AG, GC, and CT will be taken into account. 1 The specific heat is the amount of heat per mass unit required to raise the temperature by one degree Kelvin

Chapter 3. Relation Between Boltzmann and Shannon Entropy 20 We have assumed that all nearest neighbor pairs in the window are independent and we postulated discrete states in which the probabilities for having the corresponding stacking energy are drawn from the Boltzmann distribution. Although we are aware that the Boltzmann distribution gives the most probable distribution of energy for states having a random distribution of energies (e.g., ideal gas), which is not the case here, we used it to have a representation of stability (energy) in an expression that follows the structure of an entropy. The Boltzmann distribution for a state having a stacking energy E i at an absolute temperature of T is P i = e Ei kt i e E i. (3.60) kt 3.4 Result and Discussion The result for a block size of 2bp and window size of 100 Kbp is shown in Fig. 3.2. The result of the Boltzmann entropy is scaled down to the range of the Shannon entropy to make it easy for visual comparisons. Although we could not yet find a single general interpretation relating the two entropies, we can see some opposite trend in some positions (e.g., Window 16 to 25) and parallel tendencies in some other (e.g., Window 40 to 46). The plots for 3bp, 4bp, 5bp, and 6bp are also similar. This shows that the entropies are more or less invariant under the change of block size. Hence, from now on results with a block size of 3bp will be plotted. The plots for 4bp, 5bp, and 6bp are presented in Appendix A. 3.986 3.984 Window:of:size::100Kb::2:base:pairs Boltzmann:Entropy Shannon:Entropy 3.982 Entropy 3.98 3.978 3.976 3.974 3.972 0 5 10 15 20 25 30 35 40 45 50 Window:Number Figure 3.2: Boltzmann and Shannon entropies of E. coli genome, 2bp block.

Chapter 3. Relation Between Boltzmann and Shannon Entropy 21 5.965 5.96 Windowaofasize:a100Kb:a3abaseapairs BoltzmannaEntropy ShannonaEntropy 5.955 5.95 Entropy 5.945 5.94 5.935 5.93 5.925 0 5 10 15 20 25 30 35 40 45 50 WindowaNumber Figure 3.3: Boltzmann and Shannon entropies of E. coli genome, 3bp block. Once the results for Shannon and Boltzmann entropies were obtained, we discussed the results with molecular biology colleagues. As a result, we decided to see how the entropies relate to the number of the four functional classes of genes, namely anabolic, catabolic, aerobic, and anaerobic. Additionally, they provided us with the data for the distribution of the classes of genes in the genome. We used a 500 kb sliding window starting with the origin as the center of the first window and slide it 4 kb at a time across the complete genome. Then, the number of genes of the corresponding functional gene along with the Shannon and Boltzmann or their difference is plotted. The results are presented in figures from 3.4 to 3.11. Interestingly, from Fig. 3.4, we observe that the shape of Boltzmann entropy and number of anabolic genes are strongly related. This implies that, the stability is dependent on the number of anabolic genes present. Also, the distribution of the aerobic genes has a similar pattern as the difference of the entropies as shown in Fig. 3.9. All in all, even if there is no straightforward relationship between some of the curves, there seems to be a hidden meaning which we should further analyze together with our molecular genetics colleagues.

Chapter 3. Relation Between Boltzmann and Shannon Entropy 22 5.955 5.95 oric SlidingfWindowfoffsize:f500Kb:f3fBasepairs Ter BoltzmannfEntropy ShannonfEntropy AnabolicfGenes oric 100 80 Entropy 5.945 5.94 60 40 NumberfoffGenes 5.935 20 5.93 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalfPosition xf10 6 Figure 3.4: Number of anabolic genes with Boltzmann and Shannon entropies. 0.35 oric SlidingWWindowWofWsize:W500Kb:W3WBasepairs Ter oric BoltzmannWEntropyW WShannonWEntropy NumberWofWAnabolicWGenes 100 DifferenceWofWtheWEntropies NumberWofWGenes 0.3 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalWPosition xw10 6 Figure 3.5: Number of anabolic genes with difference of Boltzmann and Shannon entropies.

Chapter 3. Relation Between Boltzmann and Shannon Entropy 23 5.955 oric SlidingfWindowfoffsize:f500Kb:f3fBasepairs Ter oric 35 5.95 30 Entropy 5.945 5.94 25 20 NumberfoffGenes 5.935 15 BoltzmannfEntropy ShannonfEntropy CatabolicfGenes 5.93 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalfPosition xf10 6 Figure 3.6: Number of catabolic genes with Boltzmann and Shannon entropies. 0.36 SlidingWWindowWofWsize:W500Kb:W3WBasepairs oric Ter oric BoltzmannWEntropyW WShannonWEntropy NumberWofWCatabolicWGenes 40 DifferenceWofWtheWEntropies 0.34 0.32 30 20 NumberWofWGenes 0.3 10 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalWPosition xw10 6 Figure 3.7: Number of catabolic genes with difference of Boltzmann and Shannon entropies.

Chapter 3. Relation Between Boltzmann and Shannon Entropy 24 5.952 oric SlidingwWindowwofwsize:w500Kb:w3wBasepairs Ter oric 18 5.95 5.948 BoltzmannwEntropy ShannonwEntropy AerobicwGenes 16 14 Entropy 5.946 5.944 5.942 5.94 12 10 8 6 NumberwofwGenes 5.938 4 5.936 2 5.934 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalwPosition xw10 6 Figure 3.8: Number of aerobic genes with Boltzmann and Shannon entropies. 0.35 oric SlidingWWindowWofWsize:W500Kb:W3WBasepairs Ter oric 20 DifferenceWofWtheWEntropies 0.34 0.33 0.32 BoltzmannWEntropyW WShannonWEntropy NumberWofWAerobicWGenes 15 10 5 NumberWofWGenes 0.31 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalWPosition xw10 6 Figure 3.9: Number of aerobic genes with difference of Boltzmann and Shannon entropies.

Chapter 3. Relation Between Boltzmann and Shannon Entropy 25 5.952 5.95 5.948 oric SlidingwWindowwofwsize:w500Kb:w3wBasepairs Ter BoltzmannwEntropy ShannonwEntropy AnaerobicwGenes oric 45 40 35 Entropy 5.946 5.944 5.942 5.94 30 25 20 15 NumberwofwGenes 5.938 10 5.936 5 5.934 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalwPosition xw10 6 Figure 3.10: Number of anaerobic genes with Boltzmann and Shannon entropies. 0.35 oric SlidingWWindowWofWsize:W500Kb:W3WBasepairs Ter oric 50 BoltzmannWEntropyW WShannonWEntropy NumberWofWAnaerobicWGenes DifferenceWofWtheWEntropies NumberWofWGenes 0.3 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 ChromosomalWPosition xw10 6 Figure 3.11: Number of anaerobic genes with difference of Boltzmann and Shannon entropies.

Chapter 4 Conclusions A comparison between chemical properties of amino acids and mutation probabilities of codons was carried out using the classical multidimensional scaling method. The results showed that most of the highly probable mutations will not lead to a dramatic change of chemical properties. However, some inconsistencies were also observed. Thus, further studies of the severeness of the mutations and possible protection mechanism to counteract the effects is required. In addition, the error introduced in representing 64-dimensional data with two dimensions is significant. This is due to the slow decay of the eigenvalues of the data. Therefore, another dimension reduction and clustering method with a better performance can be applied in the future. Our second task was to look into the relationship between Shannon and Boltzmann entropies. We have seen that, even though we did not yet find suitable interpretations, at some positions they follow the same pattern and in other positions they tend to move in opposite directions. We further investigated how both entropies are related to the functional classes of genes located at the same positions in the genome. We found interesting correlations, especially with the distribution of anabolic genes. 26

Appendix A Additional Plots 7.92 7.91 Window:of:size::100Kb::4:base:pairs Boltzmann:Entropy Shannon:Entropy 7.9 Entropy 7.89 7.88 7.87 7.86 0 5 10 15 20 25 30 35 40 45 50 Window:Number Figure A.1: Boltzmann and Shannon entropies of E. coli genome, 4bp block. 27

Appendix A. Additional Plots 28 9.87 9.86 Window:of:size::100Kb::5:Basepairs Boltzmann:Entropy Shannon:Entropy 9.85 9.84 Entropy 9.83 9.82 9.81 9.8 9.79 9.78 0 5 10 15 20 25 30 35 40 45 50 Window:Number Figure A.2: Boltzmann and Shannon entropies of E. coli genome, 5bp block. 11.8 11.78 WindowKofKsize:K100Kb:K6KBasepairs BoltzmannKEntropy ShannonKEntropy 11.76 Entropy 11.74 11.72 11.7 11.68 0 5 10 15 20 25 30 35 40 45 50 WindowKNumber Figure A.3: Boltzmann and Shannon entropies of E. coli genome, 6bp block.

Bibliography [1] Deoxyribonucleic acid (dna). [Online]. Available: http://www.genome.gov/ 25520880 [2] J. D. Watson, F. H. Crick et al., Molecular structure of nucleic acids, Nature, vol. 171, no. 4356, pp. 737 738, 1953. [3] F. H. Crick, On protein synthesis. in Symposia of the Society for Experimental Biology, vol. 12, 1958, p. 138. [4] Central dogma of molecular biochemistry with enzymes. [Online]. Available: http://en.wikipedia.org/wiki/file:central Dogma of Molecular Biochemistry with Enzymes.jpg [5] More non-random dna wonders. [Online]. Available: http://iaincarstairs. wordpress.com/2011/12/26/more-non-random-dna-wonders/ [6] M. Dayhoff, R. Schwartz, and B. Orcutt, A model for evolutionary change. mo dayhoff, ed, Atlas of protein sequence and structure, vol. 5, p. 345, 1978. [7] S. Henikoff and J. G. Henikoff, Amino acid substitution matrices from protein blocks, Proceedings of the National Academy of Sciences, vol. 89, no. 22, pp. 10 915 10 919, 1992. [8] S. Whelan and N. Goldman, A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach, Molecular biology and evolution, vol. 18, no. 5, pp. 691 699, 2001. [9] A. Schneider, G. M. Cannarozzi, and G. H. Gonnet, Empirical codon substitution matrix, BMC bioinformatics, vol. 6, no. 1, p. 134, 2005. [10] S. W. Cheng, Multidimensional scaling (mds). [Online]. Available: http: //www.stat.nthu.edu.tw/ swcheng/teaching/stat5191/lecture/06 MDS.pdf [11] W. R. Taylor, The classification of amino acid conservation, Journal of theoretical Biology, vol. 119, no. 2, pp. 205 218, 1986. 29

Bibliography 30 [12] Amino acids venn diagram. [Online]. Available: http://commons.wikimedia.org/ wiki/file:amino Acids Venn Diagram.png [13] J. SantaLucia, A unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics, Proceedings of the National Academy of Sciences, vol. 95, no. 4, pp. 1460 1465, 1998. [14] F. Reif, Fundamentals of Statistical and Thermal Physics, international student edition ed. McGraw-Hill Book, 1985. [15] W. Allison, Lecture notes on statistical physics. [Online]. Available: http://www-sp.phy.cam.ac.uk/ wa14/camonly/statistical/lecture2.pdf [16] A. Huan, Course notes on statistical mechanics. [Online]. Available: http://www.spms.ntu.edu.sg/pap/courseware/statmech.pdf [17] J. Saunders, Classical and statistical thermodynamics. [Online]. Available: http://personal.rhul.ac.uk/uhap/027/ph2610/ph2610 files/sect2.pdf [18] M. Evans, Statistical physics section 1: Information theory approach to statistical mechanics. [Online]. Available: http://www2.ph.ed.ac.uk/ mevans/sp/sp1.pdf