Maxmum lkelhood Fredrk Ronqust September 28, 2005 Introducton Now that we have explored a number of evolutonary models, rangng from smple to complex, let us examne how we can use them n statstcal nference. The standard approach s to use maxmum lkelhood, whch wll be covered n ths lecture. Assume that we have some data D and a model M of the process that generated the data. The model has some parameters θ, the specfc value of whch we do not know but wsh to estmate. If the model s properly constructed, we wll be able to calculate the probablty of t generatng the observed data gven a specfc set of parameter values, P (D θ, M). Often, the condtonng on the model s suppressed n the notaton, n whch case the probablty would smply be wrtten as P (D θ). Ths probablty s often referred to as the lkelhood of the parameter values. In maxmum lkelhood nference, we smply estmate the unknowns n θ by fndng the values wth the maxmum lkelhood or, more precsely, the hghest probablty of generatng the observed data. 2 One-parameter models The maxmzaton process s typcally straght-forward when there s only one parameter n the model. Say, for nstance, that we are nterested n estmatng the probablty of obtanng heads when tossng a con. A reasonable model s that each toss s dentcal and ndependent and has a heads probablty of p. Hence, the probablty of tals would be p, and the probablty of a partcular outcome would follow a bnomal dstrbuton. For nstance, the probablty of the
Computatonal Evolutonary Bology sequence HHTTHTHHTTT would be L = P (D p) = pp( p)( p)p( p)pp( p)( p)( p) = p 5 ( p) 6 As you can see, t s suffcent to know the number of heads and tals to calculate the probablty. If we graph the lkelhood we can smply fnd ts maxmum, whch s at p = 5/ = 0.454545... (Fg. ); the estmate of p s often denoted ˆp. We can also calculate ˆp analytcally by takng the dervatve of L wth respect to p: dl dp = 5p4 ( p) 6 6p 5 ( p) 5 and fndng where ths dervatve s zero. That s easly done by factorng out p 4 and ( p) 5 and concentratng on the remanng expresson 5( p) 6p, whch wll gve us the only relevant root p = 5/. Fgure : Lkelhood curve for the con tossng example. Note that the lkelhood does not ntegrate to ; t s not a probablty dstrbuton. The maxmum lkelhood estmate for p s found by locatng the peak of the curve. It s often easer to maxmze the logarthm of the lkelhood than the lkelhood tself. In ths case, 2
we would get: ln L = 5 ln p + 6 ln( p) Computatonal Evolutonary Bology and the dervatve d(ln L) dp If we set the dervatve to zero, we would get = 5 p 6 p 5 p 6 p = 0 5( p) 6p = 0 5 p = 0 p = 5 whch, not surprsngly, s the same estmate of p. Generally n ths type of model, the maxmum lkelhood estmate turns out to be the fracton of tosses that are heads, an ntutvely appealng estmate of the probablty of obtanng heads. Mult-parameter models When the model has more than one parameter t becomes much more dffcult to maxmze lkelhood. A mult-parameter model typcally has one parameter or a small set of parameters that we are nterested n and a number of other parameters that are ntroduced n the model only to make the latter more realstc. In lkelhood nference we often refer to the former parameter(s) as structural parameter(s) and the latter as nusance parameters. Assume that we have a parameter vector θ = {θ, θ 2 }, and that we are nterested n nferrng the value of the frst parameter (θ, the structural parameter) but not the second (θ 2, the nusance parameter). We can take two dfferent approaches. One s to calculate the profle lkelhood L p (θ ) = max θ 2 L(θ, θ 2 ) whch smply fnds the lkelhood of each value of θ by maxmzng over the possble values of θ 2. The alternatve approach s to use ntegrated lkelhood, whch s the same thng as margnal lkelhood, although the latter term s more commonly used n Bayesan nference. Wth ths method,
Computatonal Evolutonary Bology the lkelhood for the structural parameter s obtaned by summng or ntegratng out the nusance parameter: L (θ ) = L(θ, θ 2 ) θ 2 for a dscrete nusance parameter and L (θ ) = L(θ, θ 2 ) dθ 2 θ 2 for a contnuous nusance parameter. It s easy to understand the dfference between the two approaches by referrng to a smple lkelhood landscape for two parameters (Fg. 2). The base of the landscape s determned by the two parameters n the model and the heght s the lkelhood for each combnaton of parameter values. The profle lkelhood method would obtan the lkelhood curve for the parameter of nterest by fndng the profle of the landscape n the relevant plane, that s, ts maxmum heght for each value of the structural parameter. The lkelhood curve obtaned wth the ntegrated lkelhood method would nstead measure the average heght of the landscape for each value of the structural parameter. Generally speakng, there are good reasons to mnmze the number of parameters that we are tryng to maxmze durng a maxmum lkelhood analyss. Therefore, ntegraton or summaton s often favored where t s possble. An nterestng property of profle lkelhoods s that they are scale-ndependent. That s, regardless of the parametrzaton we use for the nusance parameter(s), the maxmum of the profle lkelhood for the structural parameter wll be the same. However, ths s not the case for ntegrated lkelhoods. The result of ntegraton wll be dependent on how much weght we put on dfferent parts of the nusance parameter space. Hard-core lkelhoodsts wll only use ntegrated lkelhoods when there s one parametrzaton that s so obvous that people forget there are other possbltes, and they rarely pont out that there s an assumed pror. In prncple, however, ntegrated lkelhoods always requre the specfcaton of prors. When there s an obvous parametrzaton, we are mplctly assumng a unform pror on the chosen parameter space. 4 Consstency and effcency Generally speakng, lkelhood methods have the propertes of consstency and effcency. Consstency means that the maxmum lkelhood estmate converges to the correct value as data accumu- 4
Computatonal Evolutonary Bology Fgure 2: A lkelhood landscape for a model wth two parameters. We are nterested n the lkelhood curve for the parameter along the x axs. It can be determned usng ether the profle lkelhood or the ntegrated lkelhood method. lates. Effcency means that the varance of the estmate around the true parameter value s small and decreases rapdly wth ncreasng amounts of data. However, there are stuatons when lkelhood s msbehaved. A typcal stuaton s when the number of parameters that we are tryng to estmate grows wth the addton of new data; f the amount of data per parameter we are estmatng s not growng, then there wll always be some resdual varance of the estmate that wll never dsappear. Other nstances occur when the model s ncorrect. For nstance, phylogenetc nference under over-smplfed models s well known to suffer from the problem of long branch attracton. In such cases we wll tend to nfer ancestral states ncorrectly, especally across long branches, leadng to a tendency to place these ncorrectly n the tree. The phenomenon s more pronounced for parsmony methods but does affect lkelhood methods as well. 5
Computatonal Evolutonary Bology 5 Dervng transton probabltes To calculate the lkelhood of a phylogenetc model ncludng a tree and a substtuton model, we need frst to derve the transton probabltes for our substtuton model. We have seen before that, f we have the nstantaneous rate matrx Q, we can obtan the transton probabltes P (t) over some tme perod t by exponentatng the Q matrx: P = e Qt As we have dscussed earler, we can typcally not estmate the absolute rate of change but only the amount of change (v), whch s the product of rate and tme. Therefore, we typcally measure the branch length n phylogenetc trees n terms of the amount of change. In partcular, we want a branch length unt to correspond to one expected change per ste at statonarty. Ths requres a very specfc scalng of the Q matrx, namely that π q = For llustraton, we can scale the rate matrx for the Jukes Cantor model. The unscaled verson would be Q = or somethng smlar. If we scale t wth the factor µ we get µ µ µ µ µ µ µ µ Q = µ µ µ µ µ µ µ µ and fnally we solve for the value of the scalng factor by usng the scalng condton above. Specfcally, snce the statonary state frequency of all nucleotdes s the same (/4) n the Jukes Cantor 6
Computatonal Evolutonary Bology model, we get 4 π ( µ) = =0 4 µ 4 = µ = µ = If we now nsert ths n the Q matrx, we get the scaled verson Q = The prncple s the same for more complex rate matrces even though the analytcal expresson for the scaler wll be more complcated. However, t s easy enough to devse an algorthm to determne ts value numercally. To exponentate the scaled Q matrx, we would usually fnd ts egenvalues and egenvectors, decomposng t nto the product: Q = SΛS where S s a matrx whose columns are the rght egenvectors of Q and Λ s a dagonal matrx contanng the egenvalues (λ ). The exponental of Q can now be found by smply exponentatng each of the elements n Λ separately. For smple nstantaneous rate matrces, the egenvalues and egenvectors are known n closed form so that we can express the elements of the transton probablty matrx exactly. For nstance, the transton probablty matrx for the Jukes Cantor model s 4 + 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 P = 4 e 4v/ 4 + 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 + 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 + 4 e 4v/ The transton probablty matrx s known n closed form for many of the other smple substtuton models but not for the more complex ones. In the latter cases t s common to use routnes 7
Computatonal Evolutonary Bology for numercally determnng the egenvalues and egenvectors and then usng those n dervng the transton probabltes. However, the egenvalue decomposton methods can occasonally be numercally naccurate, n whch case Golub and van Loan (996) recommend usng the Padé approxmaton. 6 Calculatng lkelhoods on trees Calculatng lkelhoods on trees s relatvely straghtforward. We assume for now that the tree and the branch lengths are gven, as well as the values of all other parameters n our phylogenetc model. Typcally, we are nterested n calculatng the total probablty of the tree by summng over all possble ancestral state assgnments. The summaton over ancestral states s an example of the use of ntegrated lkelhood to elmnate a nusance parameter; ths partcular parameter s mportant to elmnate because f we tred to estmate the ancestral states (maxmze over the possble assgnments), we would run nto the problem wth the growng number of parameters that was mentoned above. The calculaton proceeds from the tp down to the root of the tree n a downpass that s very smlar to the Sankoff downpass, except that we replace addton wth multplcaton and maxmzaton wth summaton. The ntalzaton s also slghtly dfferent. At each node n the tree we calculate the condtonal probablty of the tree gven each of the possble state assgnments at that node. As you probably recognze, ths s an example of a dynamc programmng algorthm. We descrbe the algorthm here for a four-state DNA character, and we assume that the P matrx has been calculated already for all branches n the tree. We assgn to each node p a set G p = {g (p) A, g(p) C, g(p) G, g(p) T } contanng the condtonal probablty of the subtree rooted at p gven each of the possble state assgnments at that node. We use another smlar set H p, whch wll gve the condtonal probablty of assgnng each state to the ancestral end of the branch havng p as ts descendant. The elements of H p wll be derved from G p and the elements of P p, the transton probabltes for the branch endng n the node p. Intalzaton s performed by gong through all tps and settng g (p) = f the state has been observed at tp p and g (p) = 0 otherwse. As usual, the downpass algorthm s formulated for a node p and ts two descendant nodes q and r (Algorthm ). At the root node ρ, we obtan the total probablty of the tree by summng over all possble states, weghtng each assgnment by ts statonary state frequency. Thus L = π g (ρ) 8
Computatonal Evolutonary Bology Once we have the lkelhood for each ste we obtan the total lkelhood for the entre data matrx by smply multplyng over all characters: L = P (D θ) = L The entre procedure s llustrated n Fgure. Algorthm Lkelhood downpass algorthm for all do h (q) 0 for all j do h (q) end for end for h (q) for all do h (r) 0 for all j do h (r) end for end for for all do g (p) end for h (r) h (q) h (r) + p (q) j g(q) j + p (r) j g(r) j 6. Study Questons. What s the dfference between a profle lkelhood and an ntegrated lkelhood? 2. Why does ntegrated lkelhoods requre a pror, at least mplctly?. Are profle lkelhoods scale ndependent? Are ntegrated lkelhoods scale ndependent? 4. Why do we need to scale nstantaneous rate matrces? How s t done? 5. Formulate the forward algorthm of an HMM by modfyng the downpass algorthm for calculatng tree probabltes. Tp: assume that there s only one descendant of each node n the tree and set G p = H q. 9
Computatonal Evolutonary Bology Fgure : A worked example for the calculaton of the probablty of a tree. We start wth transton probablty matrces for each of the branches n the tree (a) and then assgn condtonal probabltes to the tps of the tree (b). For each node n the tree n a postorder traversal, we then calculate the condtonal probabltes of the dfferent ancestral state assgnments at the bottom end of each of the descendant branches. We then multply these costs to get the costs for the node of nterest (d). Fnally, at the root node, we fnd the total probablty of the tree by calculatng the weghted sum over all possble state assgnments at the root, usng the statonary state frequences as the weghts (e). 0