BMI/CS 776 Lecture 4. Colin Dewey

BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01

Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference

Poisson process continuous Markov process X t0 X t1 X t2 X tn Y t0 Y t1 Y t2 Y tn P[X ti+1 = x i+1 X ti = x i ] = e λ(t i+1 t i ) (λ(t i+1 t i )) x i+1 x i (x i+1 x i )! P[Y ti+1 = b Y ti = a, X ti+1 = x i+1, X ti = x i ] = (R x i+1 x i ) ab

Jukes-Cantor Model 1 0 0 0 Π = 1 0 1 0 0 4 0 0 1 0 0 0 0 1 Jukes-Cantor model. 1 1 1 1 3 1 1 1 R = 1 1 1 1 1 4 1 1 1 1 1 3 1 1 4 1 1 3 1. 1 1 1 1 1 1 1 3 P(t) = 1 4 1 + 3e λt 1 e λt 1 e λt 1 e λt 1 e λt 1 + 3e λt 1 e λt 1 e λt 1 e λt 1 e λt 1 + 3e λt 1 e λt 1 e λt 1 e λt 1 e λt 1 + 3e λt Jukes-Cantor model satisfies two important properties which a

Properties of Jukes- Cantor ergodic lim t P(t) = ΠJ time reversible ΠQ = Q T Π

Transitions vs. Transversions Transition purine purine pyrimidine pyrimidine Transversion pyrimidine purine purine pyrimidine Purines A G Pyrimidines C T

HKY85 R = 1 π G α π Y β π C β π G α π T β Q = π A β 1 π G α π R β π G β π T α π A α π C β 1 π A α π Y β π T β π A β π C α π G β 1 π C α π R β π G α π Y β π C β π G α π T β π A β π G α π R β π G β π T α π A α π C β π A α π Y β π T β π A β π C α π G β π C α π R β., Here α, β 0 and 0 α + 2β < 1 and π A + π C + π G + π T π R = π A + π G and π Y = π C + π T. = 1 with

Branch lengths What do the lengths of tree edges mean? Expected number of mutations per site time (t) and rate (λ) are not identifiable, only their product is: λt P(mutation event) = 1 x Σ π x R xx = 1 trace(πr) = trace(πq). Expected number of mutations = λt ( trace(πq)) Typically scale λ so that λt ( trace(πq)) = 1

Estimating branch length Given empirical P(t), can estimate branch length with the logdet transform logdet(p(t)) = log(det(p(t))) logdet(p(t)) = λt ( trace(q)).

Directed Graphical Models Same thing as Bayesian networks Directed graph G = (V, E) Vertices correspond to random variables Missing edges represent independence statements

X4 X2 Example DGM X6 X1 X4 X2 X1 X6 X5 X3 X3 X5

Tree model P[x 1, x 2,..., x 9 ] = π x9 P xi x j (t ij ) X9 L(data model) = (x i,x j ) E x 6,x 7,x 8,x 9 P(x 1, x 2,..., x 9 ) t97 t98 X8 data = (x 1, x 2, x 3, x 4, x 5 ) model = (T, P(t), Π) X7 t71 t72 t83 t64 t86 X6 t65 X1 X2 X3 X4 X5

Likelihood computation L k (a): Likelihood of subtree rooted by k with x k = a for each vertex k in a topological ordering: if k is a leaf: { 1 if xk = a set L k (a) = 0 otherwise else: let i and j be the child vertices of k set L k (a) = b,c P ab (t ki )L i (b)p ac (t kj )L j (c) return L(data model) = a π a L r (a), where r is the root vertex (Felsenstein s algorithm, sum-product algorithm)

Marginal reconstruction most likely character at root given the leaf characters compute L r (a) using the likelihood algorithm return argmax a π a L r (a)

Joint reconstruction character at root for most likely assignment to all interior vertices given the leaf characters V k (a): Likelihood of most likely interior vertex assignments for subtree rooted by k, with x k = a for each vertex k in a topological ordering: if k is a leaf: { 1 if xk = a set V k (a) = 0 otherwise else: let i and j be the child vertices of k set V k (a) = max b,c return argmax a P ab(t ki )V i (b)p ac (t kj )V j (c) π a V r (a), where r is the root vertex