Motif representation using position weight matrix

Size: px

Start display at page:

Download "Motif representation using position weight matrix"

Lionel French
5 years ago
Views:

1 Motif representation using position weight matrix Xiaohui Xie University of California, Irvine Motif representation using position weight matrix p.1/31

2 Position weight matrix Position weight matrix representation of a motif with width w: θ 11 θ 12 θ 13 θ 14 θ 21 θ 22 θ 23 θ 24 θ = θ w1 θ w2 θ w3 θ w4 where each row represents one position of the motif, and is normalized: 4 θ ij = 1 (2) for all i = 1, 2,,w. j=1 (1) Motif representation using position weight matrix p.2/31

3 Likelihood Given the position weight matrix θ, the probability of generating a sequence S = (S 1,S 2,,S w ) from θ is P(S θ) = = w w P(S i θ i ) (3) θ i,si (4) For convenience, we have converted S from a string of {A,C,G,T } to a string of {1, 2, 3, 4}. Motif representation using position weight matrix p.3/31

4 Likelihood Suppose we observe not just one, but a set of sequences S 1,S 2,,S n, each of which contains exactly w letters. Assume each of them is generated independently from the model θ. Then, the likelihood of observing these n sequences is P(S 1,S 2,,S n θ) = = n k=1 n k=1 P(S k θ) (5) w θ i,ski = w 4 j=1 θ c ij ij (6) where c ij is the number of letter j at position i (Note that 4 j=1 c ij = n for all i). Motif representation using position weight matrix p.4/31

5 Parameter estimation Now suppose we do not know θ. How to estimate it from the observed sequence data S 1,S 2,,S n? One solution: calculate the likelihood of observing the provided n sequences for different values of θ, L(θ) = P(S 1,S 2,,S n θ) = n k=1 w θ i,ski (7) Pick the one with the largest likelihood, that is, to find θ that max θ P(S 1,S 2,,S n θ) (8) Motif representation using position weight matrix p.5/31

6 Maximum likelihood estimation Maximum likelihood estimation of θ is ˆθ ML s.t. = arg max θ log L(θ)) = w 4 c ij log θ ij j=1 4 θ ij = 1, i = 1,,w (9) j=1 Motif representation using position weight matrix p.6/31

7 Optimization with equality constraints Construct a Lagrangian function taking the equality constraint into account: w 4 g(θ) = log L(θ) + λ i (1 θ ij ) (10) j=1 Solve the unconstrained optimization problem w 4 w 4 ˆθ = arg max θ g(θ)) = j=1 c ij log θ ij + λ i (1 j=1 θ ij ) (11) Motif representation using position weight matrix p.7/31

8 Optimization with equality constraints Take the derivative of g(θ) w.r.t. θ ij and the Lagrange multiplier λ i and set them to 0 which leads to: g(θ) θ ij = 0 (12) g(θ) λ i = 0 (13) ˆθ ij = c ij n which is simply the frequency of different letters at each position. (c ij is the number of letter j at position i). (14) Motif representation using position weight matrix p.8/31

9 Bayes Theorem P(θ S) = P(S θ)p(θ) P(S) (15) Each term in Bayes theorem has a conventional name: 1. P(S θ) the conditional probability of S given θ, also called the likelihood. 2. P(θ) the prior probability or marginal probability of θ. 3. P(θ S) the conditional probability of θ given S, also called the posterior probability of θ 4. P(S) the marginal probability of S, and acts as a normalizing constant. Motif representation using position weight matrix p.9/31

10 Maximum a posteriori (MAP) estmation MAP (or posterior mode) estimation of θ is ˆθ MAP (S) = arg max θ = arg max θ P(θ S 1,S 2,,S n ) (16) log L(θ) + log P(θ) (17) Assume P(θ) = w P(θ i) (independence of θ i at diffferent position i). Model P(θ i ) with a Dirichlet distribution (θ i1,θ i2,θ i3,θ i4 ) Dir(α 1,α 2,α 3,α 4 ). (18) Motif representation using position weight matrix p.10/31

11 Dirichlet Distribution Probability density function of Dirichlet distribution Dir(α) of order K 2: p(x 1,,x K ;α 1,,α K ) = 1 B(α) K x α i 1 k (19) for all x 1,,x K > 0 and K x i = 1. The density is zero outside this open (K 1)-dimensional simplex. α = (α 1,,α K ) are parameters with α i > 0 for all i. B(α), the normalizing constant, is the multinomial beta function: B(α) = K Γ(α i) Γ( K α i) (20) Motif representation using position weight matrix p.11/31

12 Gamma function Gamma function for positive real z Γ(z) = 0 t z 1 e t dt (21) If n is a positive integer, then Γ(z + 1) = zγ(z) (22) Γ(n + 1) = n! (23) Motif representation using position weight matrix p.12/31

13 Properties of Dirichlet distribution Dirichlet distribution p(x 1,,x K ;α) = 1 B(α) Expectation, define α 0 = K α i, K x α i 1 i (24) E[X i ] = α i α 0 (25) Variance Co-variance V ar[x i ] = α i(α 0 α i ) α 2 0(α 0 + 1) Cov[X i X j ] = α iα j α0(α ) (26) (27) Motif representation using position weight matrix p.13/31

14 Posterior Distribution Conditional probability: P(S 1,S 2,,S n θ) = w 4 j=1 θc ij ij Prior probability: p(θ i1,,θ i4 ;α) = 1 B(α) Posterior probability: 4 θα i 1 i P(θ i S 1,,S n ) = Dir(c i1 + α 1,c i2 + α 2,c i3 + α 3,c i4 + α 4 ) Maxmium a posteriori estimate: θ MAP i = c ij + α i 1 n + α 0 4 (28) where α 0 i α i. Motif representation using position weight matrix p.14/31

15 Mixture of sequences Suppose we have a more difficult situation: Among the set of n given sequences, S 1,S 2,,S n, only a subset of them are generated by a weight matrix model θ. How to identify θ in this case? Let us first define the "non-motif" (also called background) sequence. Suppose they are generated from a single distribution p 0 = (p 0 A,p 0 C,p 0 G,p 0 T) = (p 0 1,p 0 2,p 0 3,p 0 4) (29) Motif representation using position weight matrix p.15/31

16 Likelihood for mixture of sequences Now the problem is we do not know which sequence is generated from the motif (θ) and which one is generated from the background model (θ 0 ). Suppose we are provided with such label information: 1 if S i is generated by θ z i = (30) 0 if S i is generated by θ 0 for all i = 1, 2,,n. Then, the likelihood of observing the n sequences P(S 1,S 2,,S n z,θ,θ 0 ) = n [z i P(S i θ) + (1 z i )P(S i θ 0 )] Motif representation using position weight matrix p.16/31

17 Maximum Likelihood Find the joint probability of sequences and the labels P(S,z θ,θ 0 ) = P(S z,θ,θ 0 )P(z) n = P(z i )[z i P(S i θ) + (1 z i )P(S i θ 0 )] where z (z 1,,z n ) and P(z) = i P(z i). Marginalize over labels to derive the likelihood L(θ) = P(S θ,θ 0 ) = n [P(z i = 1)P(S i θ)+p(z i = 0)P(S i θ 0 )] Maximum likelihood estimate: ˆθ ML = arg max θ log L(θ)) Motif representation using position weight matrix p.17/31

18 Maximum Likelihood Find the joint probability of sequences and the labels P(S,z θ,θ 0 ) = P(S z,θ,θ 0 )P(z) n = P(z i )[z i P(S i θ) + (1 z i )P(S i θ 0 )] where z (z 1,,z n ) and P(z) = i P(z i). Marginalize over labels to derive the likelihood L(θ) = P(S θ,θ 0 ) = n [P(z i = 1)P(S i θ)+p(z i = 0)P(S i θ 0 )] Maximum likelihood estimate: ˆθ ML = arg max θ log L(θ)) Motif representation using position weight matrix p.18/31

19 Lower bound on the L(θ) Log likelihood function log L(θ) = n log [P(z i = 1)P(S i z i = 1) + P(z i = 0)P(S i z i = 0)] where P(S i z i = 1) = P(S i θ) and P(S i z i = 0) = P(S i θ 0 ). Jensen s inequality: log(q 1 x + q 2 y) q 1 log(x) + q 2 log(y) for all q 1,q 2 0 and q 1 + q 2 = 1. Motif representation using position weight matrix p.19/31

20 EM-algorithm Lower bound on log L(θ). log L(θ) N {q i log P(z i = 1)P(S i z i = 1) q i + (1 q i )log P(z i = 0)P(S i z i = 0) 1 q i } N φ(q i,θ) Expectation-Maximization: Alternate between two steps: E-step M-step ˆq i = arg max q i ˆθ = arg max θ φ(q i, ˆθ) N φ(ˆq i,θ) Motif representation using position weight matrix p.20/31

21 E-Step Auxiliary function φ(q i, θ) = q i log P(z i = 1)P(S i z i = 1) q i +(1 q i ) log P(z i = 0)P(S i z i = 0) 1 q i E-step which leads to ˆq i = arg max q i φ(q i, ˆθ) ˆq i = P(z i = 1)P(S i z i = 1) P(z i = 1)P(S i z i = 1) + P(z i = 0)P(S i z i = 0) = P(z i S i ) Motif representation using position weight matrix p.21/31

22 M-Step Auxiliary function φ(q i, θ) = q i log P(z i = 1)P(S i z i = 1) q i +(1 q i ) log P(z i = 0)P(S i z i = 0) 1 q i M-step ˆθ = arg max θ = arg max θ N φ(ˆq i, θ) N ˆq i [log P(S i θ) + (1 ˆq i ) log P(S i θ 0 )] which leads to ˆθ ij = N k=1 ˆq ki(s ki = j) N k=1 ˆq k where I(a) is an indicator function: I(a) = 1 if a is true and 0 o.w. Motif representation using position weight matrix p.22/31

23 Summary of EM-algorithm Initialize parameters θ. Repeat until convergence E-step: estimate the expected values of labels, given the current parameter estimate ˆq i = P(z i S i ) M-step: re-estimate the parameters, given the expected estimates of the labels ˆθ ij = N k=1 ˆq ki(s ki = j) N k=1 ˆq k The procedure is guaranteed to converge to a local maximum or saddle point solution. Motif representation using position weight matrix p.23/31

24 What about MAP estimate? Consider a Dirichlet prior distribution on θ i : Initialize parameters θ. Repeat until convergence θ i = Dir(α) i = 1,, w E-step: estimate the expected values of labels, given the current parameter estimate ˆq i = P(z i S i ) M-step: re-estimate the parameters, given the expected estimates of the labels ˆθ ij = N k=1 ˆq ki(s ki = j) + α j 1 N k=1 ˆq k + α 0 4 Motif representation using position weight matrix p.24/31

25 ifferent methods for parameter estimatio So far, we have introduced two methods: ML and MAP Maximum likelihood (ML) ˆθ ML = arg max θ P(S θ) Maximum a posterior (MAP) ˆθ MAP = arg max θ P(θ S) Bayes Estimator ˆθ = arg max θ E [ˆθ(S) θ) 2 ] that is the one minimizing MSE( mean square error). ˆθ = E[θ S] = θp(θ S)dθ Motif representation using position weight matrix p.25/31

26 Joint distribution Joint distribution of labels and sequences P(S, z θ, θ 0 ) = P(S z, θ, θ 0 )P(z) = n P(z i )[z i P(S i θ) + (1 z i )P(S i θ 0 )] Joint distribution of S, z, θ and θ 0 P(S, z, θ, θ 0 ) = n P(z i )[z i P(S i θ) + (1 z i )P(S i θ 0 )]P(θ)P(θ 0 ) Find the joint distribution of S and z P(S, z) = n P(z i )[z i P(S i θ) + (1 z i )P(S i θ 0 )]P(θ)P(θ 0 )dθ 0 θ θ 0 Motif representation using position weight matrix p.26/31

27 Posterior distribution of labels Posterior distribution of z: P(z S) = θ n P(z i )[z i P(S i θ) + (1 z i )P(S i θ 0 )]P(θ)dθ/P(S) w q m (1 q) n m θ i 4 j=1 θ n ij ij P(θ i)dθ i w [ ] q m (1 q) n m B(ni + α) B(n0 + α 0 ) B(α) B(α 0 ) B(n 0 + α 0 ) B(α 0 ) where m = n z i, P(z i = 1) = q, P(θ i ) = Dir(α), P(θ 0 ) = Dir(α 0 ) are Dirichlet priors, and n ij n k=1 z ki(s ki = j) is the number of letter j at position i among the sequences with label 1. n i (n i1,, n i4 ). n 0,j n k=1 (1 z k) w I(S ki = j) and n 0 (n 0,1,, n 0,4 ). Motif representation using position weight matrix p.27/31

28 Sampling Posterior distribution of z: P(z S) q m (1 q) n m w B(n i + α) B(α) B(n 0 + α 0 ) B(α 0 ) Posterior distribution of z k conditioned on all other labels z k {z i i = 1,, n, i k}: P(z k = 1 z k, S) q w [ ] B(n k,i + (S ki ) + α) B(n k,i + α) where n k,ij n l=1,l k z li(s li = j) is the number of letter j at position i among all sequences with label 1 excluding the k th sequence. n k,i (n k,i1,, n k,i4 ). (l) = (b 1,, b 4 ) with b j = 1 for j = S ki and otherwise 0. Motif representation using position weight matrix p.28/31

29 Posterior distribution Posterior distribution of z k conditioned on z k : P(z k = 1 z k, S) q w n k,iski + α Ski 1 j [n k,ij + α j 1] = q w θ iski Note that θ iski is same as the MAP estimate of the frequency weight matrix using all sequences with label 1 excluding the k th sequence. Similarly P(z k = 0 z k, S) (1 q) w n k,0ski + α 0,Ski 1 j [n k,0j + α 0,j 1] = (1 q) θ 0,Ski is same as the MAP estimate of the background distribution. w θ 0,Ski Motif representation using position weight matrix p.29/31

30 Gibbs sampling Posterior probability w P(z k = 1 z k, S) θ iski P(z k = 0 z k, S) (1 q) Gibbs sampling w θ 0,Ski P(z k = 1 z k, S) = q w θ is ki q w θ is ki + (1 q) w θ 0,S ki Motif representation using position weight matrix p.30/31

31 Gibbs sampling Initialize labels z: Assign the value of z i randomly according to P(z i = 1) = q for all i = 1,, n. Repeat until converge Repeat from i = 1 to n Update θ matrix using the MAP estimate (excluding i th sequence) Sample the value of z i Motif representation using position weight matrix p.31/31

EM-algorithm for motif discovery

EM-algorithm for motif discovery Xiaohui Xie University of California, Irvine EM-algorithm for motif discovery p.1/19 Position weight matrix Position weight matrix representation of a motif with width