CS1820 Notes hgupta1, kjline, smechery April 3-April 5 April 3 Notes 1 Minichiello-Durbin Algorithm input: set of sequences output: plausible Ancestral Recombination Graph (ARG) note: the optimal ARG is the one with the minimum of recombinations. 2 Definitions 0,1 are alleles, is an undefined allele sequences involved: haplotype sequences of length m over alphabet 0, 1, Let C be a sequence, C[i] is the ith symbol, 1 i m C 1 [i] C 2 [i] iff C 1 [i] = C 2 [i] or C 1 [i] = or C 2 [i] = T is the argument for the time step S T is a sample of sequences operations 1. Coalesce Rule: If there are two sequences C 1 and C 2 in S T such that for all instances of i the condition C 1 [i] C 2 [i] is true, then C 1 and C 2 coalesce into an ancestor sequence Transition: S T +1 = (S T \ {C 1, C 2 }) {C} such that C[i] = C 1 [i] when C 1 [i] and C 2 [i] and C[i] = C 2 [i] otherwise. 2. Mutation 1
Rule: If there exists a sequence C 1 in S T and a marker i where, for all of C 2 in (S T \ {C 1 }), we have C 2 [i] = C 1 [i] or, then we can remove the derived allele (C 1 [i]) from the population Transition: S T +1 = (S T \ C 1 ) {C }, where C [i] = C 1 [i] and C [j] = C 1 [j] for all j i 3. Recombination Rule: When the rules of mutation and coalesce do not apply, must apply recombination or pair of recombinations. Denote the recombination break point as (α, β) as meaning that it occurs between markers α and β. Picking a shared tract C 1, C 2 [α, β] from those available in S T, we aim to put recombination parent of C 1 and one recombination parent of C 2 satisfy the rule of coalescence. To do this, we must put a break point at (α 1, α), if α 1 and put a break point at (β, β + 1) if β m. Transition: From the tract {C 1, C 2 }[α, β], pick (1) a valid breakpoint (α, β), where either (α, β) = (α 1, α) or (α, β) = (β, β+1) and (2) a recombinant sequence C R, where either the C R = 1 or C R = 2. Then, S T +1 = (S T \{C R }){C 1, C 2}, where C 1[i] = C R [i] for all i α and C 1[i] = otherwise, C 2[i] = C R [i] for all i β and C 2[i] = otherwise. If both (α 1, α) and (β, β +1) are valid breakpoints, we must put the second recombination (taking us to S T +2 ), on an appropriate ancestor of C 1 or C 2. 3 The Algorithm The goal of the algorithm is to find a single ancestral sequence of a sample of sequences. 1. The algorithm starts at T=1. 2. For each iteration S i (where i = T ) from 1 i m, apply the Coalesce, Mutation, and Recombination rules to S i 3. Stop when S i contains one sequence. 2
April 5 Notes Chapter 4: Hidden Markov Models - The Learning Problem and Algorithm Three Fundamental Problems A Hidden Markov Model has the following inputs The observation sequence: Θ = θ 1 θ 2...θ T The Model: λ = (A, B, π) Problem 1: the Equation problem or Model Scoring problem. Given: Θ, λ Compute: P (Θ λ), the probability of observing the observation given the word. Problem 2: The Decoding Problem or Uncovering the Hidden Part Given: Θ, λ Compute: the sequence of sites q 1, q 2,..., q T that optimally explains the observed sequences The Viturbi Algorithm (Maximum Likelihood) is used for this problem. Problem 3: The Learning problem or Training problem Given: Θ Compute: the parameters of a model λ = (A, B, π) that maximize P (Θ λ) Definition of a Hidden Markov Model An HMM has 5 elements: N, M, A, B, π N 1. N = of states s = s 1, s 2,..., s N 2. M = of distinct observation symbols per state. 3. A = transition probability distribution, A = a ij, a ij = P [q t+1 = s j q i = s i ], 1 i, j N 4. B = observation symbols probability distribution, B = b j (k) b j (k) = P [v k at time t q t = s j ], 1 j N, 1 k M 5. π N = The initial state distribution. π = π i, π i = P [q i = s i ], 1 i N We need a number of variables: α: The forward variable α t (i) = P (θ 1 θ 2...θ T, q T = s i ) Note: θ 1 θ 2 is the prefix of the observation sequences. The backward variable: B t (i) = P (θ t+1 θ t+2...θ T, q t = s i ) θ t+1 θ t+2 is the suffix of the observation sequence. 3
Delta δ = δ t (i) = MAX P [q 1 q 2...q t = s i, θ 1 θ 2...θ T Solution to Problem 3: We want to construct parameters of the model λ = (A, Bπ) to maximize the probability of observing the sequence Θ in λ. There is no analytical exact solution (like for problems 1 and 2) We are going to construct a λ = (A, B π that is a local max pf P(Θ λ ) An Iterative algorithm: ζ t (i, j) = probability of being in state i at time t and transitioning to state s j at time t + 1. ζ t (i, j) = P (q t = s i, q t+1 = s j Θ, λ) ζ t (i, j) = αt(i)aijbj(θt+1)βt+1(j) P (θ λ) ζ t (i, j) = N α t(i)a ijbj(θ t+1)β t+1(j) N i =1 j =1 αt(i )a i j b j (θ t+1)β t(j ) β t (j ) is the normalizing factor Gamma: γ t = the probability of being in state s i at the time t given the observation sequence. γ t (i) = N j=1 ζ t(i, j) The expected number of times that state s i is visited: T 1 t=1 γ t(i) The expected number of transitions from s i : T 1 t=1 ζ t(i, j)=the expected number of transitions from s i to s j Suppose that we have a model λ = (A, B, π) Construct a new model λ = (Ā, B, π) π i =expected frequency (n number of times) in state s i at t=1 = γ t (i) A. a ij a ij = expected number of transitions from s i to s j = expected number of transitions from s j B. b j (k) b j (k) = expected of times in state observing symbol v k expected of times in state s j T γ t=1,θ = k =v t(j) k T γt(j) t=1 T 1 t=1 ζ T (i,j) T 1 t=1 γt(i) 4
λ = (Ā, B, π) 5
Figure 1: An example of a full ARG 6
Figure 2: First part of a partial recombination Figure 3: Second part of a partial recombination 7