Basic math for biology Lei Li Florida State University, Feb 6, 2002
The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood estimate: log P θ (Y) = log P θ (Y, X) log P θ (X Y) log L(Y; θ) = log L(Y, X; θ) log P (X Y; θ) Maximum likelihood estimate: ˆθ maximizes log L(Y; θ). However, usually the MLE based on the full data has a closed a form.
The EM algorithm: key idea Take conditional expectation with respect to Y = y at the parameter θ and let Q(θ; θ ) = E θ [log L(Y, X; θ) Y = y] H(θ; θ ) = E θ [log P (X Y; θ) Y = y] log L(Y; θ) = Q(θ; θ ) H(θ; θ ) E-M algorithm: Iterate between the following two steps with an initial value θ. E-step: Calculate Q(θ; θ ) for a current value of θ ; M-step: Maximize Q(θ; θ ) with respect to θ.
The EM algorithm: the magic Conditional expectation: can be calculated in cases such as exponential family. Don t worry about H(θ; θ )! Jensen s inequality, Shannon s first theorem. Partial-likelihood always goes up! Convergence, local maximum.
Bayesian inference Parametric models: {p(data θ)}. A prior distribution of θ: π(θ), θ is a random variable. Posterior distribution of θ: p(θ data) = p(data θ) p(data θ)) π(θ)dθ MAP solution: the θ that maximizes it. Posterior mean: the expected value of θ w.r.t. the posterior distribution.
Gibbs sampling Conditional distributions of (X 1, X 2 ) known: p(x 1 = x 1 X 2 = x 2 ), p(x 2 = x 2 X 1 = x 1 ) Gibbs sampling scheme 1. Start with x (n) 2 ; 2. Generate x (n+1) 1 according to p(x 1 X 2 = x (n) 2 ); 3. Generate x (n+1) 2 according to p(x 2 X 1 = x (n+1) 1 ) and go back to Step 1; Joint distribution of (X 1, X 2 ): given by (x (n) 1, x(n) 2 ), n = 1, 2,.
Bayesian treatment of missing data problem Data: full data (Y, X); partial data Y. Parametric models: {p(y, X θ)}. What we need? p(x, θ Y) and hence p(θ Y) We can apply Gibbs sampling if we know the following: p(x θ, Y) p(θ X, Y)
Markov Chain Markov property: the future and past are independent given the present knowledge Markov models: Why do we need them? Time dependence and stochastic process Markov property: simple but general enough Characterized by transition matrix, {p ij }
Structure and notation Hidden process (Markov Chain): {X t } takes values from n states s i and transition probability matrix {p ij = P (X t+1 = s j X t = s i )}. Observation: each hidden state X t emits a random variable {O t } taking values from m letters v j, and emission probability {e jk = P (O t = v k X t = s i )}. Parameters λ: initial distribution of hidden states θ = (θ 1,, θ n ), {p ij }, {e jk }. Topology of the hidden Markov chain: represents our a priori knowledge. The time scale of observation process is not necessarily 1-D, alignment of two sequences.
The three basic problems in HMM Likelihood: what is the probability of a sequence of observations? Forward-backward algorithm. Parameter estimation: what are the maximum likelihood estimates of parameters? EM-algorithm? Decoding: what is the most likely sequence of states that produced a given sequence of observations? Viterbi decoding, marginal decoding.
The forward algorithm Let α t (i) = P (o 1 o 2 o t, X t = s i ; λ) 1. Initialization: α 1 (i) = θ i e i,o1 ; 2. Induction: α t+1 (j) = [ n α t (i)p ij ]e j,ot+1, i=1 3. Termination: P (O λ) = n i=1 α T (i). Complexity: n(n + 1)(T 1) + n multiplications and n(n 1)(T 1) additions.
The backward algorithm Let β t (i) = P (o t+1 o t+2 o T X t = s i ; λ) 1. Initialization: β T (i) = 1; 2. Induction: n β t (i) = p ij e j,ot+1 β t+1 (j), i=1 3. Termination: T P (O λ) = α T (i). Complexity: n 2 T computation. i=1
Marginal decoding Let γ t (i) = P (X t = s i O = o; λ), then γ t (i) = α t(i) β t (i) P (O = o λ) = α t (i) β t (i) n i=1 α t(i) β t (i). The state that maximizes this marginal posterior probability gives the solution of marginal decoding.
The Viterbi decoding Goal: find x = max 1 x P (x, O = o λ). Soul: optimal path on a directed acyclic graph (DAG). Intermediate variables in the recursion: Let η t (i) be the probability of most probable path ending in state s i. Namely, max P (X 1 = x x 1,,x 1,, X t 1 = x t 1, X t = s i, o 1o 2 o t λ) t 1 Keep track of the argument which maximizes the above quantity: ψ t (i).
The Viterbi decoding: recursion 1. Initialization: η 1 (i) = θ i e i,o1, ψ 1 (i) = 0; 2. Induction: η t (j) = max [η t 1(i)p ij ]e j,ot, 1 i n ψ t (j) = max 1 [η t 1 (i)p ij ]. 1 i n 3. Termination: P (x, O) = max [η T (i)], 1 i n x T = max 1 [η T (i)]. 1 i n 4. Traceback: x t = η t+1 (x t+1).
The EM algorithm in HMM Missing data in HMM: the hidden states X. Conditional expectation γ t (i) = P (X t = s i O = o; λ), ξ t (i, j) = P (X t = s i, X t+1 = s j O = o; λ), where ξ t (i, j) = n i=1 MLE of the full data ˆp ij = α t (i) p ij e j,ot+1 β t+1 (j) n i=1 α t(i) p ij e β j,ot+1 t+1(j) t=t 1 t=1 ξ t (i, j) t=t 1 t=1 γ t (i, j), ê jk = Computation: underflow. t=t t=1,o t=v k γ t (j) t=t t=1 γ t(j)