1.5 MM and EM Algorithms

Size: px

Start display at page:

Download "1.5 MM and EM Algorithms"

Lambert Osborne Sharp
6 years ago
Views:

1 72 CHAPTER 1. SEQUENCE MODELS 1.5 MM and EM Algorithms The MM algorithm [1] is an iterative algorithm that can be used to minimize or maximize a function. We focus on using it to maximize the log likelihood l(y ; θ) of observed data Y over model parameters θ. Given a current guess θ (m) for the parameters, the MM algorithm prescribes a minorizing function h(θ θ (m) ) such that h(θ θ (m) ) l(y ; θ) h(θ (m) θ (m) ) = l(y ; θ (m) ). Fig. 1.3 gives an example of a log likelihood with accompanying minorizing function. The minorizing function must be chosen such that the log likelihood dominates it everywhere except at θ (m), where they are equal. If h(θ θ (m) ) has derivatives everywhere, then the tangents h (θ θ (m) ) = l (Y ; θ) are also equal at θ = θ (m), and it is easy to see that choosing θ (m+1) to maximize h(θ θ (m) ) must also increase l(y ; θ). The challenge is to find a minorizing function h(θ θ (m) ) that is easy to maximize. The EM algorithm is a special case of the MM algorithm that capitalizes on the concept of hidden or missing data to construct a minorizing function. The EM applies when one can imagine unobserved data Z that when added to the observed data Y to produce complete data X = (Y, Z) yields a far simpler log likelihood function l C (X; θ), called the complete log likelihood. A simpler function is one that is easier to maximize. The unobserved data Z can be truly missing, in the sense that we forgot or were unable to measure them, or it can be hypothetically missing, in the sense that we can only measure them in our imagination (examples of the latter below). In fact, your imagination plays an important role in deciding when and how to apply the EM to problems. The EM algorithm consists of two steps, iterated repeatedly until algorithm convergence. E step. Fill in the missing data Z by computing the expected missing data E[Z Y ] given the current parameter estimates. M step. Given the expected complete data X = (Y, E[Z Y ]), estimate all parameters using the (easier) expected complete log likelihood E[l C (X; θ)]. The MM and EM algorithms have the very desirable property that they are guaranteed to improve with each iteration. However, they may converge to a local

2 1.5. MM AND EM ALGORITHMS 73 Log density l(y;θ) h(θ θ (m) ) θ (m) θ (m+1) θ^ θ Figure 1.3: MM algorithm for minorization function h(θ θ (m) ) of the observed log likelihood l(y ; θ) for genotype data (see Problem 1). maximum when the log likelihood has multiple modes. Consistent results from many random starting points give us some confidence that we have found the global maximum. We now illustrate the EM algorithm in a particular example Motif Finding A common problem in sequence analysis is to find a motif that is common to a set of sequences. A typical example would be the task of identifying a specific protein binding site in a set of unaligned DNA fragments (e.g., data derived from ChIPchip or ChIP-seq experiments). We shall specify a general probabilistic model for the sequences and then show how to best parameterize the model. Let there be N sequences S 1, S 2,..., S N of lengths L 1, L 2,..., L N, consisting of letters drawn from an alphabet A ({A, C, G, T } for DNA). We assume the motif is exactly of length w and may or may not occur in any one of the sequences. In each position i of the motif, let the letter j be observed with probability p ij. If the motif

3 74 CHAPTER 1. SEQUENCE MODELS occurs, then we allow for the possibility that the sequences upstream (to the left) and downstream (to the right) are of different average composition. Precisely, we assume that in the left context, letters are drawn independently with probability distribution p Lj, j A. Similarly, there is a right probability distribution p Rj, j A. To completely specify the model, let the probability that sequence S n contains the motif be p s. Let θ = {p Lj, p ij, p Rj, p s } denote a particular choice of parameters. Given these parameters, the likelihood of sequence S n is g(s n ; θ), and if the sequences are independent, then the likelihood of the entire set of sequences is simply g(s 1,..., S N ; θ) = n g(s n; θ). It is our goal to find a parameter estimate ˆθ such that the data likelihood g(s 1,..., S N ; θ) is maximized. In other words, we are looking for the best possible model fit for the observed sequences. For simplicity of notation, we consider a single sequence S n. The general case of N sequences follows immediately and only requires replacing sums over k by double sums over n and k. The missing data that would make this maximization problem easy is the position of the motif in each sequence. With this information, the likelihood of the single sequence is easy to compute. Let K n = k be the start of the motif in sequence S n, the conditional likelihood is g(s n K n = k; θ) = ( j p n Lnkj Lj ) ( k+w i=k j p I(S ni=j) ij ) ( j p n Rnkj Rj ) (1.13) where n Lnkj is the number of occurrences of letter j in positions 1 through k 1, n Rnkj is the number of occurrences of letter j in positions k + w + 1 to L n, and I(S ni = j) is an indicator function that evaluates to 1 if the ith position of the sequence S n is occupied by letter j. It is trivial to maximize this likelihood. Indeed, p Lj = n Lnkj, p j n ij = I(S n,k+i=j), and p Lnkj 1 Rj = n Rnkj. (Clearly, the MLE j n Rnkj p ij only becomes interesting after we have observed several sequences S n.) Unfortunately, life is not so easy. We do not know K n. When the location of the motif is unknown, we must integrate over all possibilities g(s n ; θ) = k g(s n, k; θ) = k g(s k; θ)g n (k; p s ), where g n (k; p s ) is the model (a priori) probability that the motif occurs at position k in S n. If sequences vary in length L n, g n (k; p s ) will depend on length L n, hence the subscript n. For unaligned sequences and no prior information where the motif

4 1.5. MM AND EM ALGORITHMS 75 might be, we may specify uniform probabilities g n (k; p s ) = p s L n w+1 k = 1, 2,..., L n w p s 2 k = 0, L n w + 2. (1.14) Here, k = 0 represents the case that the sequence is all right context, and k = L n w + 2 represents the case that the sequence is all left context. You can think about it or trust us; ln g(s n ; θ) is not easy to maximize. The EM algorithm tells us to instead maximize E [ln g(s n, K n ; θ)] = L n w+2 k=0 g(k S n ; θ (m) ) ln g(s n, k; θ), where the expectation is taken against density g(k n S n ; θ (m) ), that is the probability mass function for hidden data K n given observed data S n and the current estimate of the parameters θ (m). This density is available through Bayes rule as where g(s n Eq. (1.14). g(k n S n ; θ (m) ) = g(s n K n ; θ (m) )g n (K n ; p (m) s ) k g(s n k; θ (m) )g n (k; p (m) s ), K n ; θ (m) ) is given by Eq. (1.13) and g n (K n ; p (m) s ) is given by To show that maximizing this expected complete log likelihood works, i.e. to prove the ascent property of the EM algorithm in this case where the hidden data is a discrete variable. We start by converting the expected complete log likelihood to a minorizing function by adding and subtracting exactly what we need so the minorizing function equals the log likelihood at θ = θ (m). h ( θ θ (m)) = E [ln g(s n, K n ; θ)] + ln g(s n ; θ (m) ) E [ ln g(s n, K n ; θ (m) ) ]. (1.15) We need to show this minorizing function actually minorizes the log density for all θ. [ ( )] h(θ θ (m) ) = ln g(s n ; θ (m) g(sn, K n ; θ) ) + E ln g(s n, K n ; θ (m) ) [ ( = ln g(s n ; θ (m) g(sn ; θ) ) + E ln g(s n ; θ (m) ) g(k )] n S n ; θ) g(k n S n ; θ (m) ) [ ( )] g(kn S n ; θ) = ln g(s n ; θ) + E ln g(k n S n ; θ (m) )

5 76 CHAPTER 1. SEQUENCE MODELS Above, we used g(s n, K n ; θ) = g(s n ; θ)g(k n S n ; θ) and recognized g(s n ; θ) is constant with respect to the expectation. Finally, we obtain [ ( )] h(θ θ (m) g(kn S n ; θ) ) ln g(s n ; θ) = E ln 0 g(k n S n ; θ (m) ) because θ (m) maximizes E [ln g(k n S n ; θ)], viewed as a function of θ. To see the last claim, consider maximizing i w i ln x i over x i given the constraint that i x i = C sums to some constant. Using Lagrange multipliers, one can see that x k = w k i. In our case, w w k = g(k S i n ; θ (m) ) and x k = g(k S n ; θ) both sum to 1, so x k = ln g(k S n ; θ (m) ). We are now ready to maximize the minorizing function of Eq. (1.15) to generate new estimate θ (m+1), which we now know will at least not decrease the observed log likelihood. As already claimed, maximizing h(θ θ (m) ) is equivalent to maximizing the expected complete log likelihood since the other terms are constant in θ. The expected complete log likelihood is [ k g(k S n; θ (m) ) j n Lskj log p Lj + w + ] j n Rskj log p Rj + log g n (k; p s ) i=0 j I(S n,i+k = j) log p ij = [ j k n Lskjg(k S n ; θ (m) ) ] log p Lj + w [ i=0 j k I(S n,i+k = j)g(k S n ; θ (m) ) ] log p ij + [ j k n Rskjg(k S n ; θ (m) ) ] log p Rj + k g(k S n; θ (m) ) log p n (k; p s ), which consists of a bunch of sums of the form i w i ln x i, with i x i = 1. As before, maximization is achieved for p (m+1) k Lj = g(k S n; θ (m) )n Lskj l k g(k S n; θ (m) )n Lskl k p (m+1) ij = g(k S n; θ (m) ) k+w i=k I(S ni = j) l k g(k S n; θ (m) ) k+w i=k I(S ni = l) p (m+1) k Rj = g(k S n; θ (m) )n Rskj l k g(k S, and n; θ (m+1) )n Rskl Ln w+1 p (m+1) k=1 g(k S n ; θ (m) ) s = g(0 S n ; θ (m) ) + g(l n w + 2 S n ; θ (m) ). Problems

6 1.5. MM AND EM ALGORITHMS One of the simplest illustrations of the EM algorithm comes from genetics. Suppose an observable phenotype is controlled by a single locus with only two possible alleles, A and a, with A dominant to a. Because of the dominance of A, genotypes AA and Aa have the same phenotype 1, which is distinct from the phenotype 2 of genotype aa. Suppose we can only observe the phenotype, and we observe n 1 = 39 individuals of phenotype 1 and n 2 = 11 of phenotype 2. At the genotype level, there are n AA of type AA and n Aa of type Aa such that n AA + n Aa = n 1. The split into the two genotypes of this phenotype is the hidden information. We directly observe n aa = n 2 individuals with the last genotype. Under Hardy-Weinberg equilibrium, the probabilities of the genotypes in terms of the allele frequency p A are p AA = p 2 A p Aa = 2p A (1 p A ) p aa = (1 p A ) 2. Use the given data to estimate the maximum likelihood p A. Notice, it is possible to maximize the observed log likelihood log g(n 1, n 2 ; p A ) = n 1 log [ p 2 A + 2p A (1 p A ) ] + 2n 2 log [ (1 p A ) 2] in this simple case. You can use this fact to check your answer. Fig. 1.3 visually demonstrates one iteration of the EM algorithm for this data set starting from current estimate θ (m) = The TAL effectors are proteins found in Xanthomonas pathogens that infect plants. Each TAL consists of an N terminal domain, a variable number of 34 amino acid repeats, and a C terminal domain. Residues 12 and 13, we will call them diresidues, of each 34 amino acid repeat are thought to directly interact with a nucleotide in the binding site. So, if the TAL effector has 17 repeats, then it will bind a site of 17 contiguous nucleotides. Suppose the ith TAL effector has L i repeats, with diresidue sequence w i = (w i1,..., w ili ), where w ij represents one diresidue. Let x i = (x i1,..., x ini ) be the N i 1,000 nucleotides immediately upstream of the translation start site of gene known to be targeted by the ith TAL effector. We will assume that a targeted gene has a TAL effector binding site in this sequence. Let Y i {L i,..., N i } indicate the position of the binding site relative to the translation start site at position 0. This is hidden information that we

7 78 CHAPTER 1. SEQUENCE MODELS 1 do not know. Without prior information, we assume s im = N i L i is the +1 probability that TAL effector i binds position m. Assume the orientation of binding is known, so that the first diresidue binds distal to the translation start site. Let p(a, b) be the probability that diresidue a binds nucleotide b, and let q b be the probability that an unbound upstream site is nucleotide b. Develop and fit an EM algorithm for this model using the data in listings 1.1 and HG NI NS NG N HD NN IG Listing 1.1: These are the diresidue sequences for 10 TAL effectors. The 8 distinct diresidues (listed on the last line) map to numbers 1, 2,..., 8, in the order given. The first line indicates the number of TAL effectors and the number of distinct diresidues. The second line lists the number of repeats in each TAL effector. Lines 3 through 13 give the TAL diresidue sequences

8 1.5. MM AND EM ALGORITHMS

9 80 CHAPTER 1. SEQUENCE MODELS

10 1.5. MM AND EM ALGORITHMS Listing 1.2: The nucleotides upstream of each known target gene. The first line gives the number of genes. Each gene is known to bind the corresponding TAL effector of Listing 1.1. The second line gives the length of each upstream region. Most are around 1,000 nucleotides. The next 10 lines give the nucleotide sequences, encoded as A=0, C=1, G=2, T=3.

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically