Information Tutorial

Size: px
Start display at page:

Download "Information Tutorial"

Transcription

1 Information Tutorial Dr. Michael D. Rice Emeritus Professor of Computer Science Wesleyan University Middletown, CT Version 1.0 October, 2005 (revised June, 2018) 1. Introduction The notion of entropy was used by Shannon in his pioneering work [Sha] on communication theory. Since that time, the fields of information theory and coding theory have been extensively developed. In most sources, the basic ideas are introduced as background for presenting Shannon s fundamental theorems. More recently, researchers have applied the notions of entropy and information in many new contexts including machine learning [Qui], molecular biology [SSGE], [MBHSWF], [SS], statistics [Kul], and combinatorial optimization and simulation [RK]. The aim of this short tutorial is to provide a general introduction to the ideas of uncertainty and information including the notions of relative entropy and individual information. Most of the material can be found in one of the references listed in Section 12, but I ve reorganized the material using a consistent notation. Section 11 discusses some of the contributions found in the references. Much of the discussion in the tutorial centers on sequences that are meaningful in biological contexts, but for our purposes the sequences simply represent convenient examples to illustrate the general ideas. In particular, we do not present biological applications of the ideas. However, this type of application is found in many of the references including [BTS], [CK], [MBHSWF], [SSGE], [SS], and [WR]. The document is divided into the following sections. You can access an individual section by using the first character hyperlink. List of Sections 02. Notation 08. Relative Entropy 03. Uncertainty and Information 09. Mutual Information 04. Sequence Alignments 10. Sampling Corrections 05. Information Profiles 11. Notes 06. Individual Information 12. References 07. Generalized Individual Information

2 For convenience, the definitions and facts found in the tutorial are summarized below with first character hyperlinks to the actual text. Definitions 01. Maximum number of bits of uncertainty 02. Bits of uncertainty in a symbol 03. Entropy of a sequence 04. Information content of a sequence 05. Information profile of a sequence alignment 06. Cumulative information of a sequence alignment 07. Bits of information in a symbol 08. Individual information score of a sequence 09. Generalized individual information score of a sequence 10. Relative entropy of a sequence 11. Relative entropy score of a sequence 12. Joint entropy of a pair of sequences 13. Mutual information of a pair of sequences Facts 01. Upper bound on entropy 02. Average of individual information scores is cumulative information 03. Relative entropy is non-negative and unbounded 04. Mutual information is difference of joint entropy and entropies of positions 2. Notation ( ) For our purposes, an alphabet is simply a non-empty finite set of symbols and = where denotes the cardinality of a set. Familiar alphabets include the ASCII characters, binary symbols B = {0, 1}, DNA symbols D = {A, C, G, T}, RNA symbols R = {A, C, G, U}, and the symbols that represent amino acids A = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y}. For each alphabet, * denotes all finite sequences (strings) consisting of symbols in. For each s *, s denotes the number of elements in s (length of s). For each non-negative integer n, n denotes all sequences (strings) of length n consisting of symbols in. If s = n, then for each 1 k n, sk denotes the kth element of s or in array notation sk = s[k]. Here are some examples that illustrate the definitions. For the alphabet D, = 4 and for the alphabet A, = 20. The set B 8 denotes all binary strings of length 8. The set D n denotes all DNA sequences of length n. The set A * denotes all finite sequences of amino acids including the empty sequence that contains no symbols. If s = AUG, then s = 3. If s = , then s2 = s4 = 0 and s1 = s3 = 1. 2

3 Given s n and a symbol, the frequency of occurrence of is defined as f = {k sk = } / n. For example, if s = B 5, then f0 = 2/5 and f1 = 3/5. Similarly, if s = AGGUGA R 6, then fa = 1/3, fc = 0, fg = 1/2, and fu = 1/6. 3. Uncertainty and Information ( ) Given a sequence s *, we can quantify its uncertainty or randomness by using an idea from information theory called entropy. This idea is based on computing the number of bits of uncertainty per symbol in the sequence. Alternately, this can be interpreted as the number of binary (yes or no) decisions needed to determine the occurrence of a particular symbol. The idea is illustrated by the following example. Suppose s is a DNA sequence containing at most four different types of symbols. Each symbol can be classified as either an A or G (purine) or a C or T (pyrimidine) by making exactly one decision or, equivalently, by using one bit (e.g. 0 purine, 1 pyrimidine). Since each group contains two symbols, determining the specific symbol also requires exactly one additional decision or bit. Thus a maximum of 2 bits per symbol is needed to classify it. On the other hand, if we know that only purines occur in the sequence s, then at most 1 bit per symbol is needed to classify it. Finally, if s contains exactly one type of symbol such as an A, there is no uncertainty so 0 bits per symbol is required. As another example, in a protein sequence, at most 20 different types of amino acids may occur. To distinguish an individual amino acid (symbol) requires at most five decisions. For example, the first decision (bit) distinguishes 10 of the symbols from the other 10 while the second decision distinguishes 5 of the remaining symbols from the other 5. An additional decision classifies the remaining 5 symbols into two groups one of size 3; the other of size 2, and so on, until one symbol is selected. This is analogous to the construction of a decision tree structure. Definition 1 (def) For each s *, the maximum number of bits of uncertainty per symbol is log2( ), where =. If s * contains distinct symbols, each occurring with frequency f = 1/, then the maximum number of bits per symbol can be rewritten as This formula leads to the following definition: log2( ) = log2(1/f) = log2(f). Definition 2 (def) For each s *, the symbol contributes log2(f ) bits of uncertainty where f denotes its frequency of occurrence in s and f > 0. For example, if s is a DNA sequence and fa = fg = 1/2, then each of the symbols A or G contributes log2(1/2) = 1 bit of uncertainty. We don t assign uncertainty values to C or T since fc = ft = 0. 3

4 As a second example, if s is an RNA sequence with fa = 1/4, fc = 1/8, and fg = fu = 5/16, then A contributes log2(1/4) = 2 bits of uncertainty and C contributes log2(1/8) = 3 bits of uncertainty. Also, G and U each contribute log2(5/16) 1.68 bits of uncertainty. These values make intuitive sense. For example, C is the least frequently occurring symbol, so it has the highest uncertainty and G and U are the most frequently occurring symbols, so each one has the lowest uncertainty. The uncertainties associated with each symbol can be combined based on their frequencies of occurrence to provide an overall uncertainty value for a sequence. Definition 3 (def) For each s *, the number of bits of uncertainty is H(s) = {f log2(f ) and f > 0}. Formally, the value H(s) is called the entropy of the sequence s. Based on the definition, this value can also be referred to as the expected uncertainty. For example, if every symbol in is equally likely, then each f = 1/ and H(s) = log2( ). As another example, if s is an RNA sequence with fa = 3/4, fc = 1/8, and fg = fu = 1/16, then H(s) = 3/4 log2(3/4) 1/8 log2(1/8) 1/16 log2(1/16) 1/16 log2(1/16) = 3/4 (log2(3) 2) 1/8 ( 3) 1/16 ( 4) 1/16 ( 4) = 3/4 log2(3) + 19/ Intuitively, it is reasonable that the uncertainty H(s) is significantly less than 2 since, on average, 3 out of every 4 symbols in s is an A. Fact 1 (fact) For each s *, H(s) log2( ) and equality holds precisely when each f = 1/. Based on Fact 1, the maximum uncertainty for a sequence containing at most different types of symbols is log2( ). This is all that we state in general. It represents the uncertainty before any specifics about the sequence are known. On the other hand, after we have knowledge about the sequence, the uncertainty may decrease. For example, if s is a DNA sequence with fg = fu = 1/4 and fa = 1/2, then H(s) = 1.5 in contrast to the maximum entropy of log2(4) = 2. This decrease in uncertainty of 0.5 bits per symbol represents the information in the sequence s. In its simplest formulation, information can be thought of as the difference of uncertainty values measured in the before and after states. Informally, this principle can be phrased as information is the reduction in uncertainty based on additional knowledge. For now, we assume that the before state corresponds to a situation where the uncertainty has the maximum possible value. Definition 4 (def) For each s *, the information content of s is 4

5 info(s) = log2( ) H(s). For example, if s contains exactly the amino acid V, then H(s) = 0, so info(s) = log2(20) 4.32 bits per symbol. In the example before Fact 1, H(s) = 1.19, so info(s) = 2 H(s) = 0.81 bits per symbol. 4. Sequence Alignments ( ) In addition to the applications of entropy in communication theory, the notion of information and its generalization, relative entropy (discussed in Section 8), have been used to study a variety of problems in molecular biology. In particular, one of the main applications has been in the area of sequence alignments. In this situation, a collection of sequences are aligned by position based on a common property such as a fixed starting position or a special pattern shared by all the sequences. The following alignment of five binary sequences illustrates the idea using the pattern 01010: In a biological context, sequence alignments are used to investigate common patterns across different organisms and to deduce consensus sequences. This term denotes a special DNA, RNA, or protein sequence that occurs in essentially the same position with respect to a biological element (such as a gene or chromosome) in the same organism or different organisms. For example, the promoter region of a bacterial gene typically contains a sequence of the form TATAAT located approximately ten positions before the transcription start site. As another example, the telomeric regions in yeast chromosomes contain many repeats of the sequence pattern TG 2-3 (TG) 1-3 (denoting any sequence that starts with a T followed by 2 or 3 Gs, followed by 1, 2 or 3 occurrences of TG). Typically, consensus sequences are discovered by aligning sequences from different sources and looking for a common pattern. For example, the following set of sequences is a six amino acid portion of an alignment of proteins from different organisms s1 L S P A D K s2 L S A A D K s3 L S E G E W s4 L S A A E K s5 L T E S Q A A visual inspection reveals an approximate consensus sequence of the form L S [A E] A [D E] K 5

6 where [ ] denotes a choice of the two symbols. We can estimate the quality of this consensus sequence by computing the information content of each column sequence. For this purpose, we use the following notation: Given a set of sequences s1, s2,, sn belonging to m, let sij denote the jth member of sequence si, 1 j m. Let c1, c2,, cm denote the sequences representing the columns of the alignment and let f j denote the frequency of occurrence of symbol at position j. In the preceding example, fl1 = 1, so info(c1) = log2(20) 4.32, the maximum possible information that reflects perfect conservation at position 1. Also, fs2 = 0.8 and ft2 = 0.2, so H(c2) = 0.8 log2(0.8) 0.2 log2(0.2)) = log2(5) 1.6. Hence info(c2) = log2(20) H(c2) = 3.6, a value indicating a high degree of conservation at position 2. On the other hand, since fd5 = fe5 = 0.4 and fq5 = 0.2, H(c5) = log2(5) 0.8, so info(c5) = 2.8, reflecting the poorer quality of the alignment at position 5. In general, the values of the information content can be used on any sequence alignment to assess the degree of conservation in the columns. These values are particularly useful for large sets of sequences where it is difficult to visually inspect each column. 5. Information Profiles ( ) As we saw in the proceeding section, given a set of sequences s1, s2,, sn belonging to m, we can compute the information content info(cj) for each position 1 j m. It is useful to be able to refer collectively to these values. Definition 5 (def) The vector <info(c1), info(c2),, info(cm)> is called the information profile of the sequence alignment. The graph of the information profile provides a convenient summary of the degree of conservation in a sequence alignment. The following example illustrates the idea; it lists the frequencies of occurrence of A, C, G, and T at positions 4, 3, 2, 1, +4 in an alignment of 76 DNA sequences. (As usual, we assume that the start site pattern ATG is found at positions +1, +2, and +3 in the sequences.) A C G T Based on these frequencies, the information profile is <0.30, 0.56, 0.40, 0.18, 0.22>. 6

7 Information (bits) The profile is displayed in the following figure: The maximum value info(c-3) = 0.56 reflects one main feature of the alignment namely that at position -3, the bases A and G occur in 90% of the sequences. Given an alignment of sequences, frequently it is useful to also tabulate the collective information found in a profile. Definition 6 (def) Given an information profile <info(c1), info(c2),, info(cm)>, the cumulative information is info(c1 cm) = {info(cj) 1 j m}. For instance, in the above example, the cumulative information is info(c-4 c-1,c+4) = Individual Information ( ) Nucleotide Position The ideas introduced in Section 5 give an information-theoretic perspective of a sequence alignment. However, it does not assign a value to an individual sequence that measures its conformity to the consensus of the alignment. Therefore, it does not provide a means for assessing a sequence that may be of interest, but does not belong to the original collection. These shortcomings can be corrected by using the idea of the uncertainty of a symbol introduced in Definition 2. We adopt the same perspective that was used previously to define information. Namely, without specific knowledge about the occurrence of symbol, there are log2( ) bits of uncertainty, but once the frequency of occurrence f > 0 is known, there are log2(f ) bits of uncertainty. The difference of the two values represents the information with respect to. Definition 7 (def) For each s *, a symbol with f > 0 contributes 7

8 log2(f /f) = log2( ) + log2(f ) bits of information where f = 1/. Suppose s1, s2,, sn is a set of sequences in m. For each and 1 j m, define the weight (, j) = log2(f j/f) where f j denotes the frequency of occurrence of symbol at position j. Based on Definition 7, this value is the number of bits of information contributed by symbol in the sequence cj. Using the set of values { (, j)}, we can assign a score to each sequence in the following manner: Definition 8 (def) The individual information score for the sequence si is infoscore(si) = { (sij, j) 1 j m}. In other words, the score is the sum of the bits contributed by the symbols found at the various positions in the sequences. For example, consider the following alignment of ten RNA sequences. s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 CAUGGGAGAG CAUGCGAGAG CAGAGUUAAA CACGGCGGUA CACCCAGAAG AAACCUGAAG AGAUCUAAAG AGAUCUAAAG CACGGUACAG CACGCGGGAG The information profile for the alignment is <1.12, 1.28, 0.15, 0.24, 1.03, 0.32, 0.64, 0.64, 1.53, 1.28> and the cumulative information is The set of weight values (, j) are shown below for each = A, C, G, U and 1 j A C G U

9 Based on this weight matrix, a consensus sequence CA GCUAAAG can be determined by choosing the symbol in each position j where (, j) 1. The individual information scores for each sequence are shown below listed in order of highest score. s s s s s s s s s s An examination of the highest scoring sequence s10 = CACGCGGGAG shows that it matches the consensus CA GCUAAAG in the highlighted positions. However, the lowest scoring sequence s3 = CAGAGUUAAA also matches the consensus in the five highlighted positions. This illustrates how individual information scores can provide a more discriminating measure than simply counting the number of matches to a consensus sequence. Notice that the average of the individual information scores listed above is exactly the value of the cumulative information In fact, this observation holds in general. Fact 2 (fact) For any set of sequences s1, s2,, sn in m, info(c1 cm) = {infoscore(si) 1 i n}/n. Based on Definition 8, each individual information score can be written in the form infoscore(si) = log2( m ) + log2( {f (j), j (j) = sij, 1 j m}). This expression shows that each score can also be interpreted as a difference of uncertainty values. For example, if we are working with DNA sequences, then = 4, so infoscore(si) = log2((1/4) m ) ( log2(pi)) = 2m ( log2(pi)). Here 4 -m is the probability of occurrence of a random DNA sequence of length m and pi = f (j), j is the probability that the sequence si occurs (assuming independence at each position). 9

10 Number of Occurrences The following figure shows the distribution of individual information scores (rounded to the nearest integer) for the family of DNA sequences discussed after Definition 5. The actual scores range from a minimum of 5.26 to a maximum of with an average value Therefore, by Fact 2, 1.66 is the value of the cumulative information. Individual Information Score (rounded to nearest integer) The distribution resembles a truncated normal distribution. This truncation will always occur for individual information scores of DNA sequences because a given score can t exceed the value 2m. 7. Generalized Individual Information ( ) Suppose s1, s2,, sn is a set of sequences in m. Using Definition 8, we can assign an individual information score to any sequence s of length m as long as each symbol at position j in s matches one of the symbols sij. However, in the example following the definition, we can t assign a score to either GAUGGGAGAG or UAUGGGAGAG since for every sequence si, si1 = A or si1 = C. This restriction hampers the general applicability of the scoring technique. One solution is to guarantee that (, j) is defined for each symbol and position j regardless of the nature of the sequences s1, s2,, sn used to define the weights. This can be done by using frequency pseudocounts instead of frequency counts. The idea is to assume that a 1/ fraction of each symbol is always present at any position in a sequence alignment. Formally, for each and 1 j m, define f j * = ( {i sij = } + 1/ )/(n + 1). In particular, if symbol is not found at position j, then f j * = 1/ (n + 1). Since the denominator is n + 1, the equation {f j * } = 1 holds for each j, so we still maintain a frequency distribution at each position. Using the new set of frequencies, define a generalized set of weight values by *(, j) = log2( ) + log2(f j * ). 10

11 Definition 9 (def) The generalized individual information score for a sequence u m is infoscore * (u) = { * (uj, j) 1 j m}. As an illustration, the following figure shows the set of generalized weight values * (, j) for the example following Definition A C G U Comparing the two sets of weight values, notice that the values of the existing positive (negative) weights have slightly decreased (increased). Also, the new weight corresponding to a previously undefined value is 2 + log2(1/44) The following table shows the scores of the original family of sequences with respect to the two different sets of weight values. Notice that the relative ordering of the scores is the same in both cases. sequence infoscore infoscore * s s s s s s s s s s The following table shows the scores for two sequences t1 and t2 that are not present in the original data set. sequence infoscore infoscore * t1 = GAUGGGAGAG 4.53 t2 = CACUGUACAG 8.01 The sequence t1 differs from s1 only in position 1 (G instead of C), but G has a weight of while C has a weight of 1.40 so there is a significant difference in scores for the two sequences 11

12 (4.53 vs. 9.39). The sequence t2 also differs from s9 in only position 4 (U instead of G), but the difference in the respective weights is only 1.23 so the scores are fairly close (8.01 vs. 9.24). 8. Relative Entropy ( ) The definitions of information and individual information in the preceding sections are based on the assumption that the before state reflects a random sequence. Therefore, the uncertainty in this state is assigned the value log2( ). Based on Definition 4, the information content of a sequence s in * is info(s) = log2( ) H(s) = log2( ) ( {f log2(f ) f > 0}) = {f (log2( ) + log2(f )) f > 0} = {f log2(f /b ) f > 0}. where b = 1/ represents the background frequency of occurrence of symbol. This is equivalent to the assumption of no a priori knowledge. The preceding expression suggests another way to incorporate background information. Suppose <b > is a fixed background frequency distribution for the set of symbols with each b > 0 and b = 1. Definition 10 (def) For each s *, the relative entropy of s with respect to <b > is infob(s) = {f log2(f /b ) f > 0}. The relative entropy is also called the cross-entropy or the Kullback-Liebler divergence of s for the background <b > and is frequently denoted by D(f b). Suppose s1, s2,, sn is a set of sequences in m. By analogy with Definitions 5 and 6, we can define a relative entropy profile and a cumulative relative entropy <infob(c1), infob(c2),, infob(cm)> infob(c1 cm) = {infob(ck) 1 j m}. We can also generalize the notion of individual information by incorporating background information. For each and 1 j m, define a modified weight value using the equation b(, j) = log2(f j /b ). Definition 11 (def) The individual relative entropy score for the sequence si is infoscoreb(si) = { b(sij, j) 1 j m}. 12

13 As we did in Definition 9, by using a set of weight values { b * } based on frequency pseudocounts, we can also define a generalized individual relative entropy function infoscoreb * that can be applied to any sequence u m. The following examples compare the use of information and relative entropy. Let = {0, 1, 2} and suppose that the frequency occurrences for a sequence s * are f0 = 0.8 and f1 = f2 = 0.1. Assume that the background distribution is b0 = 0.5, b1 = 0.3, and b2 = 0.2. Based on Definition 3, H(s) = Since = 3, by Definition 4, info(s) = log2(3) H(s) = On the other hand, by Definition 10, infob(s) = 0.28, which is a considerably smaller value. The reason is that the assumption of the higher background frequency for symbol 0 (0.5) diminishes the effect of the larger value of f0. At the other extreme, if the background distribution is b0 = 0.1 and b1 = b2 = 0.4, then the effect of the large value f0 is magnified. In this case, the relative entropy infob(s) = 2.17, which exceeds the maximum number of bits of uncertainty for a three-symbol alphabet (log2(3) 1.585). Therefore, we can t interpret the value 2.17 as a number of decisions or bits with respect to that alphabet. This illustrates a drawback of using the relative entropy measure. Fact 3 (fact) For each s *, the relative entropy infob(s) 0, but there is no upper bound on the value of infob(s) that is independent of the background <b >. For the following alignment of five RNA sequences, assume that the background frequencies are ba = 0.4, bc = 0.1, and bg = bu = s1 s2 s3 s4 s5 CACU GACU GACG GCCA GCCA The information (relative entropy) profile is <1.28, 1.03, 2.00, 0.48> (<1.54, 1.15, 3.32, 0.21>) and the cumulative information (relative entropy) is 4.79 (6.22). The explanations for the different values are based on the background assumptions. First, the assumption of very scarce C s (bc = 0.1) increases their importance when they are observed because of the expression log2(fcj/bc). Therefore, the C in position 1 of s1 increases the relative entropy to 1.54 from the information value of The same idea explains the increase in value in position 3 ( ). On the other hand, the assumption of abundant A s (ba = 0.4) decreases their importance when they are observed. Therefore, in position 4 there is a decrease in values ( ) because of the presence of the A s. Finally, there is little change in the values at position 2 ( ) because of the counteracting effects of the A s and C s. 13

14 The following individual information and relative entropy scores can also be explained in a similar manner. sequence infoscore infoscoreb CACU GACU GACG GCCA GCCA Mutual Information ( ) Frequently, it is useful to assess the degree of dependence between a pair of sequences. In conventional statistics, the sequences often consist of numerical values and this assessment involves the use of a covariance matrix. There is an analogous idea that can be used for sequences of symbols that leads to the notion of mutual information. Let s and t be two sequences of length n. For each pair of symbols,, define the joint frequency of occurrence f = {i si = and ti = } / n. The following definition is the analogue of Definition 3. Definition 12 (def) For each pair s, t n, the joint entropy of s and t is H(s, t) = {f log2(f ), and f > 0}. For example, for the following sequences of 0 s and 1 s, s t H(s, t) = f01 log2(f01) f10 log2(f10) f11 log2(f11) = 0.5 log2(0.5) 0.2 log2(0.2) 0.3 log2(0.3) = The joint entropy can be interpreted as the number of bits of uncertainty per symbol where the maximum number of bits of uncertainty per symbol is log2(number of symbols). In the preceding example, there are 4 distinct ordered pairs of symbols, so H(s, t) 2 always holds with equality precisely when f = 1/4 for each, (Fact 1). As a second example, consider the set of sequences shown below. s1 s2 s CUCCCUUIGCAUGGGAG CUCGCUUIGCAUGCGAG UAACUUUGUCAGAGUUA 14

15 The joint entropy of sequences c4 and c14 is s4 CCGCCCUUUCACGGCGG s5 UCUGGCUUUCACCCAGA s6 UCAGGCUGAAAACCUGA s7 UUAGACUGAAGAUCUAA s8 UUAGACUGAAGAUCUAA s9 GUACCCUGCCACGGUAC s10 CUCGCCUGCCACGCGGG H(c4, c14) = fcg log2(fcg) fgc log2(fgc) = 0.97 since fcg = 0.4 and fgc = 0.6 (using positions 4 and 14). Similarly, H(c9, c10) = The larger joint entropy at positions 9 and 10 reflects the greater uncertainty involved in observing the four pairs of symbols AA, CC, GC, and UC than in observing the two pairs of symbols CG and GC. Based on Definition 4, there is a natural way to use the joint entropy to define the joint information content between pairs of sequences s and t: info(s, t) = log2( 2 ) H(s, t) = 2 log2( ) H(s, t). (Since the pair-wise sequence (s, t) has values in the product, the maximum uncertainty is log2( 2 ) = 2 log2( ).) However, there is another measure called mutual information that also takes into account the individual frequencies of occurrence in each sequence. Moreover, this measure, in some sense, quantifies the extent to which the two sequences are dependent. Definition 13 (def) For each pair s, s n, the mutual information of s and t is M(s, s ) = {f log2[f f f ] f > 0}. The ratio f /f f measures the degree of dependence with respect to symbols and, where f (resp. f ) denotes the frequency of occurrence of (resp. ) in the sequence s (resp. s ). If the positions are completely independent for this pair of symbols, then f = f f, so the ratio is 1 and the term does not contribute to the mutual information. The mutual information for the pair of sequences shown after Definition 12 is M(s, s ) = f01 log2(f01 / f0 f1 ) + f10 log2(f10 / f1 f0 ) + f11 log2(f11 / f1 f1 ) = 0.5 log2(0.5/(0.5)(0.8)) log2(0.2/(0.5)(0.2)) log2(0.3/(0.5)(0.8)) = 0.5 log2(1.25) log2(3) log2(0.75) = Notice that this value is significantly smaller than the information content info(s, s ) = that does not measure dependence. 15

16 The following table lists the values of the mutual information for several pairs of column sequences in the preceding example of RNA sequences. j k M(cj, ck) The highest value of the mutual information is found at positions 3 and 15 (1.68). This value indicates a high degree of dependence between the two positions. Notice that when a symbol changes in column 3, the symbol in column 15 also changes. This is shown by the distribution of pairs: CG, GC, AU, and UA. On the other hand, the mutual information at positions 5 and 10 is relatively low (0.68) indicating a high degree of independence the pairs GC, GA, and AA are found at the two positions as well as the pairs CC and UC. Therefore, in many cases, a symbol change in column 5 is not reflected by a symbol change in column 10. In general, one can establish the following result that relates the entropy and mutual information definitions. Fact 4 (fact) For each pair s, t n, (a) M(s, t) = H(s) + H(t) H(s, t). (b) 0 M(s, t) min{h(s), H(t)}. The first equation says that we can interpret the mutual information of two sequences s and t in a manner similar to our interpretation of information. Namely, if we assume that the before state represents the assumption that s and t are independent, then H(s) + H(t) represents the uncertainty. In the after state, we have knowledge of the dependencies, so H(s, t) represents the new uncertainty. Hence the (mutual) information is H(s) + H(t) H(s, t). The second inequality says that the mutual information is always bounded by the smaller entropy of the two sequences. Therefore, in the preceding example the inequality M(cj, ck) 2 always holds. This shows that the mutual information M(c3, c15) = 1.68 is a reasonably large value. On the other hand, M(s, t) = for the pair of sequences listed after Definition 12. Since we are dealing with a two-symbol alphabet, it follows from (b) that in general, M(s, t) 1, so the relative value of the mutual information is small. 10. Sampling Corrections ( ) Suppose s1, s2,, sn is a set of sequences belonging to m. In general, a sampling bias is introduced by using the frequencies of occurrence of symbols to compute the information profiles 16

17 ([S86], Appendix). The extent of this bias at a position 1 j m depends on the number of sequences n and, to some extent, on the frequencies of the symbols at position j. An unbiased estimate for the information content can be calculated by first assuming a multinomial distribution of the symbols at a given position and then computing the expected value of an associated random variable. Assume that the sample space is S = {(m ) m 0, m = n} where each m represents the number of occurrences of symbol. For each position j, the probability Pj of occurrence of an event e = (m ) in S is defined by Pj(e) = c(e) {f j m } where c(e) = n!/ {m! } and {f j} denotes the observed frequencies of the symbols at position j. Define a random variable Ij on the probability space (S, Pj) by the expression Ij(e) = log2( ) H(e) where H(e) = {(m /n) log2(m /n) and m > 0} denotes the value of the entropy for a sequence e = (m ). The expected value of Ij provides an unbiased estimate of the information content at position j. This value is defined by the expression E(Ij) = {Pj(e)Ij(e) e S} = log2( ) {Pj(e)H(e) e S}. It is established in [Bas] that E(Ij) info(cj) as n +, and furthermore, that the difference between the observed information and the expected value of Ij is equal to a correction factor plus a scaling factor info(cj) E(Ij) = r(, n) + O(1/n 2 ) where r(, n) = ( 1)/(ln(4) n). Numerical estimates for small numbers of symbols ( 5) show that for a given sample size n, info(cj) E(Ij) has approximately the same value independent of the distribution {f j}. In addition, for n 100, the value info(cj) E(Ij) is closely approximated by r(, n). The following tables illustrate these ideas. The first table shows that info(cj) E(Ij) has approximately the same value for = 3 using the following values for {f j}: (a) <1/3, 1/3, 1/3>, (b) <1/5, 3/10, 1/2>, (c) <1/10, 1/5, 7/10>. n (a) (b) (c)

18 The next table shows how info(cj) E(Ij) is approximated by r(, n) for a variety of sample sizes where = 4 and each f j = 1/4. n info(cj) E(Ij) r(, n) error The final table shows that for a given sample size n, the sampling correction info(cj) E(Ij) increases as the number of symbols increases. The values are calculated based on the assumption that each f j = 1/. n = 2 = 3 = 4 = The variance of the random variable Ij is given by the expression Var(Ij) = {Pj(e)H 2 (e) e S} ( {Pj(e)H(e) e S}) 2. For n 100, this expression can be directly evaluated. The following table shows sample values of the variance for = 4 under the assumption that each f j = 1/. n info(cj) E(Ij) Var(Ij)

19 Numerical estimates for values 5 indicate that for n > 100, the variance Var(Ij) is closely approximated by the following expression given in [Bas]: Vare(Ij) = ( {f j log2 2 (f j) } H(f j) 2 ) / n. Based on the preceding work, it follows that the cumulative information is estimated by the expression E( {Ij 1 j m}) = {E(Ij) 1 j m}. Assuming that the random variables {Ij} are independent, the variance at the m positions is estimated by the expression Var( {Ij 1 j m}) = {Var(Ij) 1 j m}. In the example following Definition 5, the information profile was <0.30, 0.56, 0.40, 0.18, 0.22> and the cumulative information was These values are based on the biased information content at each position. In this example, = 4, so the red entry in the third table shows that the unbiased values of the information content are approximately 0.03 smaller. Hence the unbiased information profile is <0.27, 0.53, 0.37, 0.15, 0.19> and the unbiased cumulative information is Similarly, in the example after Definition 8, only 10 RNA sequences were used in the alignment, so the unbiased value of each information content value is significantly smaller than the biased value (blue entry ). Hence, for example, the unbiased information content at position 2 is 1.03 vs Also, in some cases, the unbiased information values are negative (positions 3 and 4). The unbiased cumulative information is also significantly smaller (5.72 vs. 8.22). In principle, we can also compute sampling corrections associated with relative entropy values that depend not only on the frequencies of occurrence {f j} at a given position but also on the background frequencies {b }. One approach for calculating these corrections uses the methods described above with each Ij replaced by a random variable Ij * that is defined by Ij * (e) = {(m /n) log2((m /n)/b ) }. However, I haven t carried out the calculations of E(Ij * ) or Var(Ij * ). Also, in this case we don't know a closed-form approximation of the sampling corrections for large sample sizes. 11. Notes ( ) The reference [Sha] is the classic paper that introduced entropy as the foundation of communication theory. The initial chapters in [Ash] and [CT] provides thorough introductions to information theory including the concepts of entropy, relative entropy, and mutual information. The historical notes found in [CT] also provide a useful perspective. Chapters 6 and 7 in [App] give an elegant basic introduction to information theory. Chapters 1-3 in [Mac] present an introduction to information theory and discuss its applications to statistical inference. Chapter 7 in [Ham] provides a concise introduction to entropy and the maximum entropy principle. The 19

20 tutorial [Sch]1, although written for molecular biologists, also provides a general introduction to the notions of entropy and information. The reference [DEKM] discusses the general use of probabilistic models in analyzing biological problems. In particular, chapter 11 provides a useful summary of entropy, relative entropy, and mutual information. Reference [CK] is the pioneering paper on using mutual information to predict RNA secondary structure. The lecture notes [Tom] provide brief, but very useful discussions of applications of information profiles (chapter 7) and relative entropy (chapters 8 and 9). The reference [KL] is the pioneering paper on the relationships between information theory and statistics. Chapter 12 in [CT] also provides an introduction to this subject. The paper [HS] presents a useful discussion of statistical issues that arise in sequence alignments including the calculation of P-values. The papers [Bas] and [Mil] investigate several statistical issues related to entropy. The appendix to [SSGE] discusses the calculation of sampling corrections for entropy values and connections with the work in [Bas]. The references [BTS], [LB], [MBHSWF], [Sch]2, [SS], and [WR] use the ideas of information content and individual information in various biological contexts to analyze binding sites and splice sites. 12. References ( ) [App] D. Applebaum, Probability and Information: An Integrated Approach, Cambridge University Press, [Ash] R. B. Ash, Information Theory, Dover Publications, New York, [Bas] G. P. Basharin, On a Statistical Estimate for the Entropy of a Sequence of Independent Random Variables, Theory Probability Appl. 4, , [BTS] C. B. Burge, T. Tuschl, and P. A. Sharp, Splicing of precursors to mrnas by the spliceosomes, in The RNA World (editors R.F. Gesteland, T.R. Cech, and J.F. Atkins), Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, , [CK] D. K. Y. Chiu and T. Kolodziejczak, Inferring consensus structure from nucleic acid sequences, CABIOS 7, 3, , [CT] T. Cover and J. Thomas, Elements of Information Theory, John Wiley & Sons, New York, [DEKM] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, [Ham] R. W. Hamming, The Art of Probability for Scientists and Engineers, Westview Press, [HS] G. Z. Hertz and G.D. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, Bioinformatics 15: , [Kul] S. Kullback, Information Theory and Statistics, Wiley, New York, [KL] S. Kullback and R. A. Leibler, On information and sufficiency, Ann. Math. Stat. 22, 79-86,

21 [LB] L. P. Lim and C. B. Burge, A computational analysis of sequence features involved in recognition of short introns, Proc. Natl. Acad. Sci. 98, , [Mac] D. J. C. MacKay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, [Mil] G. A. Miller, Information Theory in Psychology, Free Press, Glencoe, Illinois, , [MBHSWF] S. M. Mount, C. Burks, G. Hertz, G. D. Stormo, O. White, and C. Fields, Splicing signals in Drosophila: intron size, information content, and consensus sequences, Nucleic Acids Res. 20, , [Qui] J. R. Quinlan, Induction of decision trees, Machine Learning 1, , [RK] R. Y. Rubinstein & D. P. Kroese, The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning, Springer-Verlag, [SSGE] T. D. Schneider, G. D. Stormo, L. Gold, and A. Ehrenfeucht, Information content of binding sites on nucleotide sequences, J. Mol. Biol. 188, , [Appendix: T.D. Schneider, J.S. Haemer, and G.D. Stormo, Calculation of sampling uncertainty and variance] [Sch] 1 T. D. Schneider, Information Theory Primer with Appendix on Logarithms, Version 2.54, January, ( [Sch] 2 T. D. Schneider, Information content of individual genetic sequences, J. Theor Biol. 189: , [SW] C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, University of Illinois Press, Urbana, [Sha] C. E. Shannon, A Mathematical Theory of Communication, The Bell System Technical Journal, Vol. 27, , , July, October, 1948 ( [SS] R. M. Stephens and T. D. Schneider, Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites, J. Mol. Biol. 228, , [Tom] M. Tompa, Lecture Notes on Computational Biology, CSE 527, Winter ( [WR] M. Weir and M. Rice, Ordered partitioning reveals extended splice-site consensus information, Genome Research 14, 67-78,

13 Comparative RNA analysis

13 Comparative RNA analysis 13 Comparative RNA analysis Sources for this lecture: R. Durbin, S. Eddy, A. Krogh und G. Mitchison, Biological sequence analysis, Cambridge, 1998 D.W. Mount. Bioinformatics: Sequences and Genome analysis,

More information

Position-specific scoring matrices (PSSM)

Position-specific scoring matrices (PSSM) Regulatory Sequence nalysis Position-specific scoring matrices (PSSM) Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d ix-marseille, France Technological dvances for Genomics and Clinics

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Computational Biology and Chemistry

Computational Biology and Chemistry Computational Biology and Chemistry 33 (2009) 245 252 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem Research Article

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Simulation of the Evolution of Information Content in Transcription Factor Binding Sites Using a Parallelized Genetic Algorithm

Simulation of the Evolution of Information Content in Transcription Factor Binding Sites Using a Parallelized Genetic Algorithm Simulation of the Evolution of Information Content in Transcription Factor Binding Sites Using a Parallelized Genetic Algorithm Joseph Cornish*, Robert Forder**, Ivan Erill*, Matthias K. Gobbert** *Department

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

A Mathematical Theory of Communication

A Mathematical Theory of Communication A Mathematical Theory of Communication Ben Eggers Abstract This paper defines information-theoretic entropy and proves some elementary results about it. Notably, we prove that given a few basic assumptions

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

partial order, lattice, random variable, entropy, mutual information, sequence analysis

partial order, lattice, random variable, entropy, mutual information, sequence analysis FULL PAPER Lattices of Random Variables Zhongyuan Che Penn State Beaver, 100 University Drive, Monaca, PA 15061 Phone: 724-773-3884, Fax: 724-773-3557, Email: zxc10@psu.edu Michael D. Rice Department of

More information

CSCE 471/871 Lecture 3: Markov Chains and

CSCE 471/871 Lecture 3: Markov Chains and and and 1 / 26 sscott@cse.unl.edu 2 / 26 Outline and chains models (s) Formal definition Finding most probable state path (Viterbi algorithm) Forward and backward algorithms State sequence known State

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9: Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative sscott@cse.unl.edu 1 / 27 2

More information

Information in Biology

Information in Biology Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve

More information

ITCT Lecture IV.3: Markov Processes and Sources with Memory

ITCT Lecture IV.3: Markov Processes and Sources with Memory ITCT Lecture IV.3: Markov Processes and Sources with Memory 4. Markov Processes Thus far, we have been occupied with memoryless sources and channels. We must now turn our attention to sources with memory.

More information

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 13 Competitive Optimality of the Shannon Code So, far we have studied

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.

More information

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Proceedings of Machine Learning Research vol 73:153-164, 2017 AMBN 2017 On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Kei Amii Kyoto University Kyoto

More information

Combinatorial Proof of the Hot Spot Theorem

Combinatorial Proof of the Hot Spot Theorem Combinatorial Proof of the Hot Spot Theorem Ernie Croot May 30, 2006 1 Introduction A problem which has perplexed mathematicians for a long time, is to decide whether the digits of π are random-looking,

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery

CSE 527 Autumn Lectures 8-9 (& part of 10) Motifs: Representation & Discovery CSE 527 Autumn 2006 Lectures 8-9 (& part of 10) Motifs: Representation & Discovery 1 DNA Binding Proteins A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%,

More information

Thomas D. Schneider. version = 2.55 of primer.tex 2004 Sep 24

Thomas D. Schneider. version = 2.55 of primer.tex 2004 Sep 24 Information Theory Primer With an Appendix (see section ) on Logarithms Postscript version: ftp://ftp.ncifcrf.gov/pub/delila/primer.ps web versions: http://www.lecb.ncifcrf.gov/ toms/paper/primer/ Thomas

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12:00 4001 Motif finding This exposition was developed by Knut Reinert and Clemens Gröpl. It is based on the following

More information

Quantitative Biology Lecture 3

Quantitative Biology Lecture 3 23 nd Sep 2015 Quantitative Biology Lecture 3 Gurinder Singh Mickey Atwal Center for Quantitative Biology Summary Covariance, Correlation Confounding variables (Batch Effects) Information Theory Covariance

More information

DNA Binding Proteins CSE 527 Autumn 2007

DNA Binding Proteins CSE 527 Autumn 2007 DNA Binding Proteins CSE 527 Autumn 2007 A variety of DNA binding proteins ( transcription factors ; a significant fraction, perhaps 5-10%, of all human proteins) modulate transcription of protein coding

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Lecture 18: Quantum Information Theory and Holevo s Bound

Lecture 18: Quantum Information Theory and Holevo s Bound Quantum Computation (CMU 1-59BB, Fall 2015) Lecture 1: Quantum Information Theory and Holevo s Bound November 10, 2015 Lecturer: John Wright Scribe: Nicolas Resch 1 Question In today s lecture, we will

More information

A Lower-Variance Randomized Algorithm for Approximate String Matching

A Lower-Variance Randomized Algorithm for Approximate String Matching A Lower-Variance Randomized Algorithm for Approximate String Matching Mikhail J. Atallah Elena Grigorescu Yi Wu Department of Computer Science Purdue University West Lafayette, IN 47907 U.S.A. {mja,egrigore,wu510}@cs.purdue.edu

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

More Codon Usage Bias

More Codon Usage Bias .. CSC448 Bioinformatics Algorithms Alexander Dehtyar.. DA Sequence Evaluation Part II More Codon Usage Bias Scaled χ 2 χ 2 measure. In statistics, the χ 2 statstic computes how different the distribution

More information

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory 1 The intuitive meaning of entropy Modern information theory was born in Shannon s 1948 paper A Mathematical Theory of

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction Jakob V. Hansen Department of Computer Science, University of Aarhus Ny Munkegade, Bldg. 540, DK-8000 Aarhus C,

More information

A statistical mechanical interpretation of algorithmic information theory

A statistical mechanical interpretation of algorithmic information theory A statistical mechanical interpretation of algorithmic information theory Kohtaro Tadaki Research and Development Initiative, Chuo University 1 13 27 Kasuga, Bunkyo-ku, Tokyo 112-8551, Japan. E-mail: tadaki@kc.chuo-u.ac.jp

More information

Information in Biology

Information in Biology Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell

Hidden Markov Models in computational biology. Ron Elber Computer Science Cornell Hidden Markov Models in computational biology Ron Elber Computer Science Cornell 1 Or: how to fish homolog sequences from a database Many sequences in database RPOBESEQ Partitioned data base 2 An accessible

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Inferring Models of cis-regulatory Modules using Information Theory

Inferring Models of cis-regulatory Modules using Information Theory Inferring Models of cis-regulatory Modules using Information Theory BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 28 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material,

More information

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University A Gentle Introduction to Gradient Boosting Cheng Li chengli@ccs.neu.edu College of Computer and Information Science Northeastern University Gradient Boosting a powerful machine learning algorithm it can

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Sequences and Information

Sequences and Information Sequences and Information Rahul Siddharthan The Institute of Mathematical Sciences, Chennai, India http://www.imsc.res.in/ rsidd/ Facets 16, 04/07/2016 This box says something By looking at the symbols

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Primer on Probability Random

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Information Theory Primer With an Appendix on Logarithms

Information Theory Primer With an Appendix on Logarithms Information Theory Primer With an Appendix on Logarithms PDF version: http://alum.mit.edu/www/toms/papers/primer/primer.pdf web versions: http://alum.mit.edu/www/toms/paper/primer/ Thomas D. Schneider

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda 1,2, Shunsuke Inenaga 1,2, Hideo Bannai 3, Ayumi Shinohara 1,2, and Setsuo Arikawa 1 1 Department of Informatics,

More information

UNIT I INFORMATION THEORY. I k log 2

UNIT I INFORMATION THEORY. I k log 2 UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper

More information

Information & Correlation

Information & Correlation Information & Correlation Jilles Vreeken 11 June 2014 (TADA) Questions of the day What is information? How can we measure correlation? and what do talking drums have to do with this? Bits and Pieces What

More information

Inferring Models of cis-regulatory Modules using Information Theory

Inferring Models of cis-regulatory Modules using Information Theory Inferring Models of cis-regulatory Modules using Information Theory BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 26 Anthony Gitter gitter@biostat.wisc.edu Overview Biological question What is causing

More information

Entropy as a measure of surprise

Entropy as a measure of surprise Entropy as a measure of surprise Lecture 5: Sam Roweis September 26, 25 What does information do? It removes uncertainty. Information Conveyed = Uncertainty Removed = Surprise Yielded. How should we quantify

More information

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35 1 / 35 Information Theory Mark van Rossum School of Informatics, University of Edinburgh January 24, 2018 0 Version: January 24, 2018 Why information theory 2 / 35 Understanding the neural code. Encoding

More information

Received: 20 December 2011; in revised form: 4 February 2012 / Accepted: 7 February 2012 / Published: 2 March 2012

Received: 20 December 2011; in revised form: 4 February 2012 / Accepted: 7 February 2012 / Published: 2 March 2012 Entropy 2012, 14, 480-490; doi:10.3390/e14030480 Article OPEN ACCESS entropy ISSN 1099-4300 www.mdpi.com/journal/entropy Interval Entropy and Informative Distance Fakhroddin Misagh 1, * and Gholamhossein

More information

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Inferring Protein-Signaling Networks

Inferring Protein-Signaling Networks Inferring Protein-Signaling Networks Lectures 14 Nov 14, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1

More information

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler STATC141 Spring 2005 The materials are from Pairise Sequence Alignment by Robert Giegerich and David Wheeler Lecture 6, 02/08/05 The analysis of multiple DNA or protein sequences (I) Sequence similarity

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Theoretical distribution of PSSM scores

Theoretical distribution of PSSM scores Regulatory Sequence Analysis Theoretical distribution of PSSM scores Jacques van Helden Jacques.van-Helden@univ-amu.fr Aix-Marseille Université, France Technological Advances for Genomics and Clinics (TAGC,

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy Haykin_ch05_pp3.fm Page 207 Monday, November 26, 202 2:44 PM CHAPTER 5 Information Theory 5. Introduction As mentioned in Chapter and reiterated along the way, the purpose of a communication system is

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions)

Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions) Modeling Motifs Collecting Data (Measuring and Modeling Specificity of Protein-DNA Interactions) Computational Genomics Course Cold Spring Harbor Labs Oct 31, 2016 Gary D. Stormo Department of Genetics

More information

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS August 0 Vol 4 No 005-0 JATIT & LLS All rights reserved ISSN: 99-8645 wwwjatitorg E-ISSN: 87-95 EVOLUTIONAY DISTANCE MODEL BASED ON DIFFEENTIAL EUATION AND MAKOV OCESS XIAOFENG WANG College of Mathematical

More information

How much non-coding DNA do eukaryotes require?

How much non-coding DNA do eukaryotes require? How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics

More information

Supporting Information

Supporting Information Supporting Information Weghorn and Lässig 10.1073/pnas.1210887110 SI Text Null Distributions of Nucleosome Affinity and of Regulatory Site Content. Our inference of selection is based on a comparison of

More information