Information Tutorial

Size: px

Start display at page:

Download "Information Tutorial"

Darren Chandler
5 years ago
Views:

1 Information Tutorial Dr. Michael D. Rice Emeritus Professor of Computer Science Wesleyan University Middletown, CT Version 1.0 October, 2005 (revised June, 2018) 1. Introduction The notion of entropy was used by Shannon in his pioneering work [Sha] on communication theory. Since that time, the fields of information theory and coding theory have been extensively developed. In most sources, the basic ideas are introduced as background for presenting Shannon s fundamental theorems. More recently, researchers have applied the notions of entropy and information in many new contexts including machine learning [Qui], molecular biology [SSGE], [MBHSWF], [SS], statistics [Kul], and combinatorial optimization and simulation [RK]. The aim of this short tutorial is to provide a general introduction to the ideas of uncertainty and information including the notions of relative entropy and individual information. Most of the material can be found in one of the references listed in Section 12, but I ve reorganized the material using a consistent notation. Section 11 discusses some of the contributions found in the references. Much of the discussion in the tutorial centers on sequences that are meaningful in biological contexts, but for our purposes the sequences simply represent convenient examples to illustrate the general ideas. In particular, we do not present biological applications of the ideas. However, this type of application is found in many of the references including [BTS], [CK], [MBHSWF], [SSGE], [SS], and [WR]. The document is divided into the following sections. You can access an individual section by using the first character hyperlink. List of Sections 02. Notation 08. Relative Entropy 03. Uncertainty and Information 09. Mutual Information 04. Sequence Alignments 10. Sampling Corrections 05. Information Profiles 11. Notes 06. Individual Information 12. References 07. Generalized Individual Information

2 For convenience, the definitions and facts found in the tutorial are summarized below with first character hyperlinks to the actual text. Definitions 01. Maximum number of bits of uncertainty 02. Bits of uncertainty in a symbol 03. Entropy of a sequence 04. Information content of a sequence 05. Information profile of a sequence alignment 06. Cumulative information of a sequence alignment 07. Bits of information in a symbol 08. Individual information score of a sequence 09. Generalized individual information score of a sequence 10. Relative entropy of a sequence 11. Relative entropy score of a sequence 12. Joint entropy of a pair of sequences 13. Mutual information of a pair of sequences Facts 01. Upper bound on entropy 02. Average of individual information scores is cumulative information 03. Relative entropy is non-negative and unbounded 04. Mutual information is difference of joint entropy and entropies of positions 2. Notation ( ) For our purposes, an alphabet is simply a non-empty finite set of symbols and = where denotes the cardinality of a set. Familiar alphabets include the ASCII characters, binary symbols B = {0, 1}, DNA symbols D = {A, C, G, T}, RNA symbols R = {A, C, G, U}, and the symbols that represent amino acids A = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. For each alphabet, * denotes all finite sequences (strings) consisting of symbols in. For each s *, s denotes the number of elements in s (length of s). For each non-negative integer n, n denotes all sequences (strings) of length n consisting of symbols in. If s = n, then for each 1 k n, sk denotes the kth element of s or in array notation sk = s[k]. Here are some examples that illustrate the definitions. For the alphabet D, = 4 and for the alphabet A, = 20. The set B 8 denotes all binary strings of length 8. The set D n denotes all DNA sequences of length n. The set A * denotes all finite sequences of amino acids including the empty sequence that contains no symbols. If s = AUG, then s = 3. If s = , then s2 = s4 = 0 and s1 = s3 = 1. 2

3 Given s n and a symbol, the frequency of occurrence of is defined as f = {k sk = } / n. For example, if s = B 5, then f0 = 2/5 and f1 = 3/5. Similarly, if s = AGGUGA R 6, then fa = 1/3, fc = 0, fg = 1/2, and fu = 1/6. 3. Uncertainty and Information ( ) Given a sequence s *, we can quantify its uncertainty or randomness by using an idea from information theory called entropy. This idea is based on computing the number of bits of uncertainty per symbol in the sequence. Alternately, this can be interpreted as the number of binary (yes or no) decisions needed to determine the occurrence of a particular symbol. The idea is illustrated by the following example. Suppose s is a DNA sequence containing at most four different types of symbols. Each symbol can be classified as either an A or G (purine) or a C or T (pyrimidine) by making exactly one decision or, equivalently, by using one bit (e.g. 0 purine, 1 pyrimidine). Since each group contains two symbols, determining the specific symbol also requires exactly one additional decision or bit. Thus a maximum of 2 bits per symbol is needed to classify it. On the other hand, if we know that only purines occur in the sequence s, then at most 1 bit per symbol is needed to classify it. Finally, if s contains exactly one type of symbol such as an A, there is no uncertainty so 0 bits per symbol is required. As another example, in a protein sequence, at most 20 different types of amino acids may occur. To distinguish an individual amino acid (symbol) requires at most five decisions. For example, the first decision (bit) distinguishes 10 of the symbols from the other 10 while the second decision distinguishes 5 of the remaining symbols from the other 5. An additional decision classifies the remaining 5 symbols into two groups one of size 3; the other of size 2, and so on, until one symbol is selected. This is analogous to the construction of a decision tree structure. Definition 1 (def) For each s *, the maximum number of bits of uncertainty per symbol is log2( ), where =. If s * contains distinct symbols, each occurring with frequency f = 1/, then the maximum number of bits per symbol can be rewritten as This formula leads to the following definition: log2( ) = log2(1/f) = log2(f). Definition 2 (def) For each s *, the symbol contributes log2(f ) bits of uncertainty where f denotes its frequency of occurrence in s and f > 0. For example, if s is a DNA sequence and fa = fg = 1/2, then each of the symbols A or G contributes log2(1/2) = 1 bit of uncertainty. We don t assign uncertainty values to C or T since fc = ft = 0. 3

4 As a second example, if s is an RNA sequence with fa = 1/4, fc = 1/8, and fg = fu = 5/16, then A contributes log2(1/4) = 2 bits of uncertainty and C contributes log2(1/8) = 3 bits of uncertainty. Also, G and U each contribute log2(5/16) 1.68 bits of uncertainty. These values make intuitive sense. For example, C is the least frequently occurring symbol, so it has the highest uncertainty and G and U are the most frequently occurring symbols, so each one has the lowest uncertainty. The uncertainties associated with each symbol can be combined based on their frequencies of occurrence to provide an overall uncertainty value for a sequence. Definition 3 (def) For each s *, the number of bits of uncertainty is H(s) = {f log2(f ) and f > 0}. Formally, the value H(s) is called the entropy of the sequence s. Based on the definition, this value can also be referred to as the expected uncertainty. For example, if every symbol in is equally likely, then each f = 1/ and H(s) = log2( ). As another example, if s is an RNA sequence with fa = 3/4, fc = 1/8, and fg = fu = 1/16, then H(s) = 3/4 log2(3/4) 1/8 log2(1/8) 1/16 log2(1/16) 1/16 log2(1/16) = 3/4 (log2(3) 2) 1/8 ( 3) 1/16 ( 4) 1/16 ( 4) = 3/4 log2(3) + 19/ Intuitively, it is reasonable that the uncertainty H(s) is significantly less than 2 since, on average, 3 out of every 4 symbols in s is an A. Fact 1 (fact) For each s *, H(s) log2( ) and equality holds precisely when each f = 1/. Based on Fact 1, the maximum uncertainty for a sequence containing at most different types of symbols is log2( ). This is all that we state in general. It represents the uncertainty before any specifics about the sequence are known. On the other hand, after we have knowledge about the sequence, the uncertainty may decrease. For example, if s is a DNA sequence with fg = fu = 1/4 and fa = 1/2, then H(s) = 1.5 in contrast to the maximum entropy of log2(4) = 2. This decrease in uncertainty of 0.5 bits per symbol represents the information in the sequence s. In its simplest formulation, information can be thought of as the difference of uncertainty values measured in the before and after states. Informally, this principle can be phrased as information is the reduction in uncertainty based on additional knowledge. For now, we assume that the before state corresponds to a situation where the uncertainty has the maximum possible value. Definition 4 (def) For each s *, the information content of s is 4

5 info(s) = log2( ) H(s). For example, if s contains exactly the amino acid V, then H(s) = 0, so info(s) = log2(20) 4.32 bits per symbol. In the example before Fact 1, H(s) = 1.19, so info(s) = 2 H(s) = 0.81 bits per symbol. 4. Sequence Alignments ( ) In addition to the applications of entropy in communication theory, the notion of information and its generalization, relative entropy (discussed in Section 8), have been used to study a variety of problems in molecular biology. In particular, one of the main applications has been in the area of sequence alignments. In this situation, a collection of sequences are aligned by position based on a common property such as a fixed starting position or a special pattern shared by all the sequences. The following alignment of five binary sequences illustrates the idea using the pattern 01010: In a biological context, sequence alignments are used to investigate common patterns across different organisms and to deduce consensus sequences. This term denotes a special DNA, RNA, or protein sequence that occurs in essentially the same position with respect to a biological element (such as a gene or chromosome) in the same organism or different organisms. For example, the promoter region of a bacterial gene typically contains a sequence of the form TATAAT located approximately ten positions before the transcription start site. As another example, the telomeric regions in yeast chromosomes contain many repeats of the sequence pattern TG 2-3 (TG) 1-3 (denoting any sequence that starts with a T followed by 2 or 3 Gs, followed by 1, 2 or 3 occurrences of TG). Typically, consensus sequences are discovered by aligning sequences from different sources and looking for a common pattern. For example, the following set of sequences is a six amino acid portion of an alignment of proteins from different organisms s1 L S P A D K s2 L S A A D K s3 L S E G E W s4 L S A A E K s5 L T E S Q A A visual inspection reveals an approximate consensus sequence of the form L S [A E] A [D E] K 5

6 where [ ] denotes a choice of the two symbols. We can estimate the quality of this consensus sequence by computing the information content of each column sequence. For this purpose, we use the following notation: Given a set of sequences s1, s2,, sn belonging to m, let sij denote the jth member of sequence si, 1 j m. Let c1, c2,, cm denote the sequences representing the columns of the alignment and let f j denote the frequency of occurrence of symbol at position j. In the preceding example, fl1 = 1, so info(c1) = log2(20) 4.32, the maximum possible information that reflects perfect conservation at position 1. Also, fs2 = 0.8 and ft2 = 0.2, so H(c2) = 0.8 log2(0.8) 0.2 log2(0.2)) = log2(5) 1.6. Hence info(c2) = log2(20) H(c2) = 3.6, a value indicating a high degree of conservation at position 2. On the other hand, since fd5 = fe5 = 0.4 and fq5 = 0.2, H(c5) = log2(5) 0.8, so info(c5) = 2.8, reflecting the poorer quality of the alignment at position 5. In general, the values of the information content can be used on any sequence alignment to assess the degree of conservation in the columns. These values are particularly useful for large sets of sequences where it is difficult to visually inspect each column. 5. Information Profiles ( ) As we saw in the proceeding section, given a set of sequences s1, s2,, sn belonging to m, we can compute the information content info(cj) for each position 1 j m. It is useful to be able to refer collectively to these values. Definition 5 (def) The vector <info(c1), info(c2),, info(cm)> is called the information profile of the sequence alignment. The graph of the information profile provides a convenient summary of the degree of conservation in a sequence alignment. The following example illustrates the idea; it lists the frequencies of occurrence of A, C, G, and T at positions 4, 3, 2, 1, +4 in an alignment of 76 DNA sequences. (As usual, we assume that the start site pattern ATG is found at positions +1, +2, and +3 in the sequences.) A C G T Based on these frequencies, the information profile is <0.30, 0.56, 0.40, 0.18, 0.22>. 6

7 Information (bits) The profile is displayed in the following figure: The maximum value info(c-3) = 0.56 reflects one main feature of the alignment namely that at position -3, the bases A and G occur in 90% of the sequences. Given an alignment of sequences, frequently it is useful to also tabulate the collective information found in a profile. Definition 6 (def) Given an information profile <info(c1), info(c2),, info(cm)>, the cumulative information is info(c1 cm) = {info(cj) 1 j m}. For instance, in the above example, the cumulative information is info(c-4 c-1,c+4) = Individual Information ( ) Nucleotide Position The ideas introduced in Section 5 give an information-theoretic perspective of a sequence alignment. However, it does not assign a value to an individual sequence that measures its conformity to the consensus of the alignment. Therefore, it does not provide a means for assessing a sequence that may be of interest, but does not belong to the original collection. These shortcomings can be corrected by using the idea of the uncertainty of a symbol introduced in Definition 2. We adopt the same perspective that was used previously to define information. Namely, without specific knowledge about the occurrence of symbol, there are log2( ) bits of uncertainty, but once the frequency of occurrence f > 0 is known, there are log2(f ) bits of uncertainty. The difference of the two values represents the information with respect to. Definition 7 (def) For each s *, a symbol with f > 0 contributes 7

8 log2(f /f) = log2( ) + log2(f ) bits of information where f = 1/. Suppose s1, s2,, sn is a set of sequences in m. For each and 1 j m, define the weight (, j) = log2(f j/f) where f j denotes the frequency of occurrence of symbol at position j. Based on Definition 7, this value is the number of bits of information contributed by symbol in the sequence cj. Using the set of values { (, j)}, we can assign a score to each sequence in the following manner: Definition 8 (def) The individual information score for the sequence si is infoscore(si) = { (sij, j) 1 j m}. In other words, the score is the sum of the bits contributed by the symbols found at the various positions in the sequences. For example, consider the following alignment of ten RNA sequences. s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 CAUGGGAGAG CAUGCGAGAG CAGAGUUAAA CACGGCGGUA CACCCAGAAG AAACCUGAAG AGAUCUAAAG AGAUCUAAAG CACGGUACAG CACGCGGGAG The information profile for the alignment is <1.12, 1.28, 0.15, 0.24, 1.03, 0.32, 0.64, 0.64, 1.53, 1.28> and the cumulative information is The set of weight values (, j) are shown below for each = A, C, G, U and 1 j A C G U

9 Based on this weight matrix, a consensus sequence CA GCUAAAG can be determined by choosing the symbol in each position j where (, j) 1. The individual information scores for each sequence are shown below listed in order of highest score. s s s s s s s s s s An examination of the highest scoring sequence s10 = CACGCGGGAG shows that it matches the consensus CA GCUAAAG in the highlighted positions. However, the lowest scoring sequence s3 = CAGAGUUAAA also matches the consensus in the five highlighted positions. This illustrates how individual information scores can provide a more discriminating measure than simply counting the number of matches to a consensus sequence. Notice that the average of the individual information scores listed above is exactly the value of the cumulative information In fact, this observation holds in general. Fact 2 (fact) For any set of sequences s1, s2,, sn in m, info(c1 cm) = {infoscore(si) 1 i n}/n. Based on Definition 8, each individual information score can be written in the form infoscore(si) = log2( m ) + log2( {f (j), j (j) = sij, 1 j m}). This expression shows that each score can also be interpreted as a difference of uncertainty values. For example, if we are working with DNA sequences, then = 4, so infoscore(si) = log2((1/4) m ) ( log2(pi)) = 2m ( log2(pi)). Here 4 -m is the probability of occurrence of a random DNA sequence of length m and pi = f (j), j is the probability that the sequence si occurs (assuming independence at each position). 9

10 Number of Occurrences The following figure shows the distribution of individual information scores (rounded to the nearest integer) for the family of DNA sequences discussed after Definition 5. The actual scores range from a minimum of 5.26 to a maximum of with an average value Therefore, by Fact 2, 1.66 is the value of the cumulative information. Individual Information Score (rounded to nearest integer) The distribution resembles a truncated normal distribution. This truncation will always occur for individual information scores of DNA sequences because a given score can t exceed the value 2m. 7. Generalized Individual Information ( ) Suppose s1, s2,, sn is a set of sequences in m. Using Definition 8, we can assign an individual information score to any sequence s of length m as long as each symbol at position j in s matches one of the symbols sij. However, in the example following the definition, we can t assign a score to either GAUGGGAGAG or UAUGGGAGAG since for every sequence si, si1 = A or si1 = C. This restriction hampers the general applicability of the scoring technique. One solution is to guarantee that (, j) is defined for each symbol and position j regardless of the nature of the sequences s1, s2,, sn used to define the weights. This can be done by using frequency pseudocounts instead of frequency counts. The idea is to assume that a 1/ fraction of each symbol is always present at any position in a sequence alignment. Formally, for each and 1 j m, define f j * = ( {i sij = } + 1/ )/(n + 1). In particular, if symbol is not found at position j, then f j * = 1/ (n + 1). Since the denominator is n + 1, the equation {f j * } = 1 holds for each j, so we still maintain a frequency distribution at each position. Using the new set of frequencies, define a generalized set of weight values by *(, j) = log2( ) + log2(f j * ). 10

11 Definition 9 (def) The generalized individual information score for a sequence u m is infoscore * (u) = { * (uj, j) 1 j m}. As an illustration, the following figure shows the set of generalized weight values * (, j) for the example following Definition A C G U Comparing the two sets of weight values, notice that the values of the existing positive (negative) weights have slightly decreased (increased). Also, the new weight corresponding to a previously undefined value is 2 + log2(1/44) The following table shows the scores of the original family of sequences with respect to the two different sets of weight values. Notice that the relative ordering of the scores is the same in both cases. sequence infoscore infoscore * s s s s s s s s s s The following table shows the scores for two sequences t1 and t2 that are not present in the original data set. sequence infoscore infoscore * t1 = GAUGGGAGAG 4.53 t2 = CACUGUACAG 8.01 The sequence t1 differs from s1 only in position 1 (G instead of C), but G has a weight of while C has a weight of 1.40 so there is a significant difference in scores for the two sequences 11

12 (4.53 vs. 9.39). The sequence t2 also differs from s9 in only position 4 (U instead of G), but the difference in the respective weights is only 1.23 so the scores are fairly close (8.01 vs. 9.24). 8. Relative Entropy ( ) The definitions of information and individual information in the preceding sections are based on the assumption that the before state reflects a random sequence. Therefore, the uncertainty in this state is assigned the value log2( ). Based on Definition 4, the information content of a sequence s in * is info(s) = log2( ) H(s) = log2( ) ( {f log2(f ) f > 0}) = {f (log2( ) + log2(f )) f > 0} = {f log2(f /b ) f > 0}. where b = 1/ represents the background frequency of occurrence of symbol. This is equivalent to the assumption of no a priori knowledge. The preceding expression suggests another way to incorporate background information. Suppose <b > is a fixed background frequency distribution for the set of symbols with each b > 0 and b = 1. Definition 10 (def) For each s *, the relative entropy of s with respect to <b > is infob(s) = {f log2(f /b ) f > 0}. The relative entropy is also called the cross-entropy or the Kullback-Liebler divergence of s for the background <b > and is frequently denoted by D(f b). Suppose s1, s2,, sn is a set of sequences in m. By analogy with Definitions 5 and 6, we can define a relative entropy profile and a cumulative relative entropy <infob(c1), infob(c2),, infob(cm)> infob(c1 cm) = {infob(ck) 1 j m}. We can also generalize the notion of individual information by incorporating background information. For each and 1 j m, define a modified weight value using the equation b(, j) = log2(f j /b ). Definition 11 (def) The individual relative entropy score for the sequence si is infoscoreb(si) = { b(sij, j) 1 j m}. 12

13 As we did in Definition 9, by using a set of weight values { b * } based on frequency pseudocounts, we can also define a generalized individual relative entropy function infoscoreb * that can be applied to any sequence u m. The following examples compare the use of information and relative entropy. Let = {0, 1, 2} and suppose that the frequency occurrences for a sequence s * are f0 = 0.8 and f1 = f2 = 0.1. Assume that the background distribution is b0 = 0.5, b1 = 0.3, and b2 = 0.2. Based on Definition 3, H(s) = Since = 3, by Definition 4, info(s) = log2(3) H(s) = On the other hand, by Definition 10, infob(s) = 0.28, which is a considerably smaller value. The reason is that the assumption of the higher background frequency for symbol 0 (0.5) diminishes the effect of the larger value of f0. At the other extreme, if the background distribution is b0 = 0.1 and b1 = b2 = 0.4, then the effect of the large value f0 is magnified. In this case, the relative entropy infob(s) = 2.17, which exceeds the maximum number of bits of uncertainty for a three-symbol alphabet (log2(3) 1.585). Therefore, we can t interpret the value 2.17 as a number of decisions or bits with respect to that alphabet. This illustrates a drawback of using the relative entropy measure. Fact 3 (fact) For each s *, the relative entropy infob(s) 0, but there is no upper bound on the value of infob(s) that is independent of the background <b >. For the following alignment of five RNA sequences, assume that the background frequencies are ba = 0.4, bc = 0.1, and bg = bu = s1 s2 s3 s4 s5 CACU GACU GACG GCCA GCCA The information (relative entropy) profile is <1.28, 1.03, 2.00, 0.48> (<1.54, 1.15, 3.32, 0.21>) and the cumulative information (relative entropy) is 4.79 (6.22). The explanations for the different values are based on the background assumptions. First, the assumption of very scarce C s (bc = 0.1) increases their importance when they are observed because of the expression log2(fcj/bc). Therefore, the C in position 1 of s1 increases the relative entropy to 1.54 from the information value of The same idea explains the increase in value in position 3 ( ). On the other hand, the assumption of abundant A s (ba = 0.4) decreases their importance when they are observed. Therefore, in position 4 there is a decrease in values ( ) because of the presence of the A s. Finally, there is little change in the values at position 2 ( ) because of the counteracting effects of the A s and C s. 13

14 The following individual information and relative entropy scores can also be explained in a similar manner. sequence infoscore infoscoreb CACU GACU GACG GCCA GCCA Mutual Information ( ) Frequently, it is useful to assess the degree of dependence between a pair of sequences. In conventional statistics, the sequences often consist of numerical values and this assessment involves the use of a covariance matrix. There is an analogous idea that can be used for sequences of symbols that leads to the notion of mutual information. Let s and t be two sequences of length n. For each pair of symbols,, define the joint frequency of occurrence f = {i si = and ti = } / n. The following definition is the analogue of Definition 3. Definition 12 (def) For each pair s, t n, the joint entropy of s and t is H(s, t) = {f log2(f ), and f > 0}. For example, for the following sequences of 0 s and 1 s, s t H(s, t) = f01 log2(f01) f10 log2(f10) f11 log2(f11) = 0.5 log2(0.5) 0.2 log2(0.2) 0.3 log2(0.3) = The joint entropy can be interpreted as the number of bits of uncertainty per symbol where the maximum number of bits of uncertainty per symbol is log2(number of symbols). In the preceding example, there are 4 distinct ordered pairs of symbols, so H(s, t) 2 always holds with equality precisely when f = 1/4 for each, (Fact 1). As a second example, consider the set of sequences shown below. s1 s2 s CUCCCUUIGCAUGGGAG CUCGCUUIGCAUGCGAG UAACUUUGUCAGAGUUA 14

15 The joint entropy of sequences c4 and c14 is s4 CCGCCCUUUCACGGCGG s5 UCUGGCUUUCACCCAGA s6 UCAGGCUGAAAACCUGA s7 UUAGACUGAAGAUCUAA s8 UUAGACUGAAGAUCUAA s9 GUACCCUGCCACGGUAC s10 CUCGCCUGCCACGCGGG H(c4, c14) = fcg log2(fcg) fgc log2(fgc) = 0.97 since fcg = 0.4 and fgc = 0.6 (using positions 4 and 14). Similarly, H(c9, c10) = The larger joint entropy at positions 9 and 10 reflects the greater uncertainty involved in observing the four pairs of symbols AA, CC, GC, and UC than in observing the two pairs of symbols CG and GC. Based on Definition 4, there is a natural way to use the joint entropy to define the joint information content between pairs of sequences s and t: info(s, t) = log2( 2 ) H(s, t) = 2 log2( ) H(s, t). (Since the pair-wise sequence (s, t) has values in the product, the maximum uncertainty is log2( 2 ) = 2 log2( ).) However, there is another measure called mutual information that also takes into account the individual frequencies of occurrence in each sequence. Moreover, this measure, in some sense, quantifies the extent to which the two sequences are dependent. Definition 13 (def) For each pair s, s n, the mutual information of s and t is M(s, s ) = {f log2[f f f ] f > 0}. The ratio f /f f measures the degree of dependence with respect to symbols and, where f (resp. f ) denotes the frequency of occurrence of (resp. ) in the sequence s (resp. s ). If the positions are completely independent for this pair of symbols, then f = f f, so the ratio is 1 and the term does not contribute to the mutual information. The mutual information for the pair of sequences shown after Definition 12 is M(s, s ) = f01 log2(f01 / f0 f1 ) + f10 log2(f10 / f1 f0 ) + f11 log2(f11 / f1 f1 ) = 0.5 log2(0.5/(0.5)(0.8)) log2(0.2/(0.5)(0.2)) log2(0.3/(0.5)(0.8)) = 0.5 log2(1.25) log2(3) log2(0.75) = Notice that this value is significantly smaller than the information content info(s, s ) = that does not measure dependence. 15

16 The following table lists the values of the mutual information for several pairs of column sequences in the preceding example of RNA sequences. j k M(cj, ck) The highest value of the mutual information is found at positions 3 and 15 (1.68). This value indicates a high degree of dependence between the two positions. Notice that when a symbol changes in column 3, the symbol in column 15 also changes. This is shown by the distribution of pairs: CG, GC, AU, and UA. On the other hand, the mutual information at positions 5 and 10 is relatively low (0.68) indicating a high degree of independence the pairs GC, GA, and AA are found at the two positions as well as the pairs CC and UC. Therefore, in many cases, a symbol change in column 5 is not reflected by a symbol change in column 10. In general, one can establish the following result that relates the entropy and mutual information definitions. Fact 4 (fact) For each pair s, t n, (a) M(s, t) = H(s) + H(t) H(s, t). (b) 0 M(s, t) min{h(s), H(t)}. The first equation says that we can interpret the mutual information of two sequences s and t in a manner similar to our interpretation of information. Namely, if we assume that the before state represents the assumption that s and t are independent, then H(s) + H(t) represents the uncertainty. In the after state, we have knowledge of the dependencies, so H(s, t) represents the new uncertainty. Hence the (mutual) information is H(s) + H(t) H(s, t). The second inequality says that the mutual information is always bounded by the smaller entropy of the two sequences. Therefore, in the preceding example the inequality M(cj, ck) 2 always holds. This shows that the mutual information M(c3, c15) = 1.68 is a reasonably large value. On the other hand, M(s, t) = for the pair of sequences listed after Definition 12. Since we are dealing with a two-symbol alphabet, it follows from (b) that in general, M(s, t) 1, so the relative value of the mutual information is small. 10. Sampling Corrections ( ) Suppose s1, s2,, sn is a set of sequences belonging to m. In general, a sampling bias is introduced by using the frequencies of occurrence of symbols to compute the information profiles 16

17 ([S86], Appendix). The extent of this bias at a position 1 j m depends on the number of sequences n and, to some extent, on the frequencies of the symbols at position j. An unbiased estimate for the information content can be calculated by first assuming a multinomial distribution of the symbols at a given position and then computing the expected value of an associated random variable. Assume that the sample space is S = {(m ) m 0, m = n} where each m represents the number of occurrences of symbol. For each position j, the probability Pj of occurrence of an event e = (m ) in S is defined by Pj(e) = c(e) {f j m } where c(e) = n!/ {m! } and {f j} denotes the observed frequencies of the symbols at position j. Define a random variable Ij on the probability space (S, Pj) by the expression Ij(e) = log2( ) H(e) where H(e) = {(m /n) log2(m /n) and m > 0} denotes the value of the entropy for a sequence e = (m ). The expected value of Ij provides an unbiased estimate of the information content at position j. This value is defined by the expression E(Ij) = {Pj(e)Ij(e) e S} = log2( ) {Pj(e)H(e) e S}. It is established in [Bas] that E(Ij) info(cj) as n +, and furthermore, that the difference between the observed information and the expected value of Ij is equal to a correction factor plus a scaling factor info(cj) E(Ij) = r(, n) + O(1/n 2 ) where r(, n) = ( 1)/(ln(4) n). Numerical estimates for small numbers of symbols ( 5) show that for a given sample size n, info(cj) E(Ij) has approximately the same value independent of the distribution {f j}. In addition, for n 100, the value info(cj) E(Ij) is closely approximated by r(, n). The following tables illustrate these ideas. The first table shows that info(cj) E(Ij) has approximately the same value for = 3 using the following values for {f j}: (a) <1/3, 1/3, 1/3>, (b) <1/5, 3/10, 1/2>, (c) <1/10, 1/5, 7/10>. n (a) (b) (c)

18 The next table shows how info(cj) E(Ij) is approximated by r(, n) for a variety of sample sizes where = 4 and each f j = 1/4. n info(cj) E(Ij) r(, n) error The final table shows that for a given sample size n, the sampling correction info(cj) E(Ij) increases as the number of symbols increases. The values are calculated based on the assumption that each f j = 1/. n = 2 = 3 = 4 = The variance of the random variable Ij is given by the expression Var(Ij) = {Pj(e)H 2 (e) e S} ( {Pj(e)H(e) e S}) 2. For n 100, this expression can be directly evaluated. The following table shows sample values of the variance for = 4 under the assumption that each f j = 1/. n info(cj) E(Ij) Var(Ij)

19 Numerical estimates for values 5 indicate that for n > 100, the variance Var(Ij) is closely approximated by the following expression given in [Bas]: Vare(Ij) = ( {f j log2 2 (f j) } H(f j) 2 ) / n. Based on the preceding work, it follows that the cumulative information is estimated by the expression E( {Ij 1 j m}) = {E(Ij) 1 j m}. Assuming that the random variables {Ij} are independent, the variance at the m positions is estimated by the expression Var( {Ij 1 j m}) = {Var(Ij) 1 j m}. In the example following Definition 5, the information profile was <0.30, 0.56, 0.40, 0.18, 0.22> and the cumulative information was These values are based on the biased information content at each position. In this example, = 4, so the red entry in the third table shows that the unbiased values of the information content are approximately 0.03 smaller. Hence the unbiased information profile is <0.27, 0.53, 0.37, 0.15, 0.19> and the unbiased cumulative information is Similarly, in the example after Definition 8, only 10 RNA sequences were used in the alignment, so the unbiased value of each information content value is significantly smaller than the biased value (blue entry ). Hence, for example, the unbiased information content at position 2 is 1.03 vs Also, in some cases, the unbiased information values are negative (positions 3 and 4). The unbiased cumulative information is also significantly smaller (5.72 vs. 8.22). In principle, we can also compute sampling corrections associated with relative entropy values that depend not only on the frequencies of occurrence {f j} at a given position but also on the background frequencies {b }. One approach for calculating these corrections uses the methods described above with each Ij replaced by a random variable Ij * that is defined by Ij * (e) = {(m /n) log2((m /n)/b ) }. However, I haven t carried out the calculations of E(Ij * ) or Var(Ij * ). Also, in this case we don't know a closed-form approximation of the sampling corrections for large sample sizes. 11. Notes ( ) The reference [Sha] is the classic paper that introduced entropy as the foundation of communication theory. The initial chapters in [Ash] and [CT] provides thorough introductions to information theory including the concepts of entropy, relative entropy, and mutual information. The historical notes found in [CT] also provide a useful perspective. Chapters 6 and 7 in [App] give an elegant basic introduction to information theory. Chapters 1-3 in [Mac] present an introduction to information theory and discuss its applications to statistical inference. Chapter 7 in [Ham] provides a concise introduction to entropy and the maximum entropy principle. The 19

20 tutorial [Sch]1, although written for molecular biologists, also provides a general introduction to the notions of entropy and information. The reference [DEKM] discusses the general use of probabilistic models in analyzing biological problems. In particular, chapter 11 provides a useful summary of entropy, relative entropy, and mutual information. Reference [CK] is the pioneering paper on using mutual information to predict RNA secondary structure. The lecture notes [Tom] provide brief, but very useful discussions of applications of information profiles (chapter 7) and relative entropy (chapters 8 and 9). The reference [KL] is the pioneering paper on the relationships between information theory and statistics. Chapter 12 in [CT] also provides an introduction to this subject. The paper [HS] presents a useful discussion of statistical issues that arise in sequence alignments including the calculation of P-values. The papers [Bas] and [Mil] investigate several statistical issues related to entropy. The appendix to [SSGE] discusses the calculation of sampling corrections for entropy values and connections with the work in [Bas]. The references [BTS], [LB], [MBHSWF], [Sch]2, [SS], and [WR] use the ideas of information content and individual information in various biological contexts to analyze binding sites and splice sites. 12. References ( ) [App] D. Applebaum, Probability and Information: An Integrated Approach, Cambridge University Press, [Ash] R. B. Ash, Information Theory, Dover Publications, New York, [Bas] G. P. Basharin, On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables, Theory Probability Appl. 4, , [BTS] C. B. Burge, T. Tuschl, and P. A. Sharp, Splicing of precursors to mrnas by the spliceosomes, in The RNA World (editors R.F. Gesteland, T.R. Cech, and J.F. Atkins), Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, , [CK] D. K. Y. Chiu and T. Kolodziejczak, Inferring consensus structure from nucleic acid sequences, CABIOS 7, 3, , [CT] T. Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, [DEKM] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, [Ham] R. W. Hamming, The Art of Probability for Scientists and Engineers, Westview Press, [HS] G. Z. Hertz and G.D. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, Bioinformatics 15: , [Kul] S. Kullback, Information Theory and Statistics, Wiley, New York, [KL] S. Kullback and R. A. Leibler, On information and sufficiency, Ann. Math. Stat. 22, 79-86,

21 [LB] L. P. Lim and C. B. Burge, A computational analysis of sequence features involved in recognition of short introns, Proc. Natl. Acad. Sci. 98, , [Mac] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, [Mil] G. A. Miller, Information Theory in Psychology, Free Press, Glencoe, Illinois, , [MBHSWF] S. M. Mount, C. Burks, G. Hertz, G. D. Stormo, O. White, and C. Fields, Splicing signals in Drosophila: intron size, information content, and consensus sequences, Nucleic Acids Res. 20, , [Qui] J. R. Quinlan, Induction of decision trees, Machine Learning 1, , [RK] R. Y. Rubinstein & D. P. Kroese, The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Springer-Verlag, [SSGE] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht, Information content of binding sites on nucleotide sequences, J. Mol. Biol. 188, , [Appendix: T.D. Schneider, J.S. Haemer, and G.D. Stormo, Calculation of sampling uncertainty and variance] [Sch] 1 T. D. Schneider, Information Theory Primer with Appendix on Logarithms, Version 2.54, January, ( [Sch] 2 T. D. Schneider, Information content of individual genetic sequences, J. Theor Biol. 189: , [SW] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, [Sha] C. E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, , , July, October, 1948 ( [SS] R. M. Stephens and T. D. Schneider, Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites, J. Mol. Biol. 228, , [Tom] M. Tompa, Lecture Notes on Computational Biology, CSE 527, Winter ( [WR] M. Weir and M. Rice, Ordered partitioning reveals extended splice-site consensus information, Genome Research 14, 67-78,

13 Comparative RNA analysis

13 Comparative RNA analysis Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.W. Mount. Bioinformatics: Sequences and Genome analysis,