Maximum likelihood. Fredrik Ronquist. September 28, 2005

Similar documents
3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Linear Approximation with Regularization and Moving Least Squares

Difference Equations

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture 12: Discrete Laplacian

= z 20 z n. (k 20) + 4 z k = 4

Limited Dependent Variables

Composite Hypotheses testing

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Singular Value Decomposition: Theory and Applications

Lecture Notes on Linear Regression

Solutions to Problem Set 6

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Rockefeller College University at Albany

Chapter 6. Supplemental Text Material

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

NUMERICAL DIFFERENTIATION

Convergence of random processes

1 GSW Iterative Techniques for y = Ax

Module 9. Lecture 6. Duality in Assignment Problems

Basically, if you have a dummy dependent variable you will be estimating a probability.

Global Sensitivity. Tuesday 20 th February, 2018

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Density matrix. c α (t)φ α (q)

Hidden Markov Models

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Structure and Drive Paul A. Jensen Copyright July 20, 2003

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Inductance Calculation for Conductors of Arbitrary Shape

Note on EM-training of IBM-model 1

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Errors for Linear Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Homework Assignment 3 Due in class, Thursday October 15

Table III.1: Probabilities for a Two-Item Test. Item 1

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Temperature. Chapter Heat Engine

Conjugacy and the Exponential Family

SIO 224. m(r) =(ρ(r),k s (r),µ(r))

5 The Rational Canonical Form

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

8.6 The Complex Number System

Random Walks on Digraphs

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

CHAPTER 14 GENERAL PERTURBATION THEORY

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

1 Matrix representations of canonical matrices

Lecture 12: Classification

Supplementary Notes for Chapter 9 Mixture Thermodynamics

Numerical Heat and Mass Transfer

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

b ), which stands for uniform distribution on the interval a x< b. = 0 elsewhere

The exam is closed book, closed notes except your one-page cheat sheet.

One-sided finite-difference approximations suitable for use with Richardson extrapolation

Foundations of Arithmetic

Lecture 10 Support Vector Machines II

2.3 Nilpotent endomorphisms

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

1 Generating functions, continued

The KMO Method for Solving Non-homogenous, m th Order Differential Equations

Learning from Data 1 Naive Bayes

CS 331 DESIGN AND ANALYSIS OF ALGORITHMS DYNAMIC PROGRAMMING. Dr. Daisy Tang

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

For example, if the drawing pin was tossed 200 times and it landed point up on 140 of these trials,

Chapter 1. Probability

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

The Geometry of Logit and Probit

Relevance Vector Machines Explained

Statistical pattern recognition

Assortment Optimization under MNL

4.3 Poisson Regression

Expectation Maximization Mixture Models HMMs

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Analysis of Discrete Time Queues (Section 4.6)

Entropy of Markov Information Sources and Capacity of Discrete Input Constrained Channels (from Immink, Coding Techniques for Digital Recorders)

EEE 241: Linear Systems

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

LECTURE 9 CANONICAL CORRELATION ANALYSIS

arxiv: v2 [stat.me] 26 Jun 2012

Lecture 4. Instructor: Haipeng Luo

Lecture 3: Probability Distributions

Transcription:

Maxmum lkelhood Fredrk Ronqust September 28, 2005 Introducton Now that we have explored a number of evolutonary models, rangng from smple to complex, let us examne how we can use them n statstcal nference. The standard approach s to use maxmum lkelhood, whch wll be covered n ths lecture. Assume that we have some data D and a model M of the process that generated the data. The model has some parameters θ, the specfc value of whch we do not know but wsh to estmate. If the model s properly constructed, we wll be able to calculate the probablty of t generatng the observed data gven a specfc set of parameter values, P (D θ, M). Often, the condtonng on the model s suppressed n the notaton, n whch case the probablty would smply be wrtten as P (D θ). Ths probablty s often referred to as the lkelhood of the parameter values. In maxmum lkelhood nference, we smply estmate the unknowns n θ by fndng the values wth the maxmum lkelhood or, more precsely, the hghest probablty of generatng the observed data. 2 One-parameter models The maxmzaton process s typcally straght-forward when there s only one parameter n the model. Say, for nstance, that we are nterested n estmatng the probablty of obtanng heads when tossng a con. A reasonable model s that each toss s dentcal and ndependent and has a heads probablty of p. Hence, the probablty of tals would be p, and the probablty of a partcular outcome would follow a bnomal dstrbuton. For nstance, the probablty of the

Computatonal Evolutonary Bology sequence HHTTHTHHTTT would be L = P (D p) = pp( p)( p)p( p)pp( p)( p)( p) = p 5 ( p) 6 As you can see, t s suffcent to know the number of heads and tals to calculate the probablty. If we graph the lkelhood we can smply fnd ts maxmum, whch s at p = 5/ = 0.454545... (Fg. ); the estmate of p s often denoted ˆp. We can also calculate ˆp analytcally by takng the dervatve of L wth respect to p: dl dp = 5p4 ( p) 6 6p 5 ( p) 5 and fndng where ths dervatve s zero. That s easly done by factorng out p 4 and ( p) 5 and concentratng on the remanng expresson 5( p) 6p, whch wll gve us the only relevant root p = 5/. Fgure : Lkelhood curve for the con tossng example. Note that the lkelhood does not ntegrate to ; t s not a probablty dstrbuton. The maxmum lkelhood estmate for p s found by locatng the peak of the curve. It s often easer to maxmze the logarthm of the lkelhood than the lkelhood tself. In ths case, 2

we would get: ln L = 5 ln p + 6 ln( p) Computatonal Evolutonary Bology and the dervatve d(ln L) dp If we set the dervatve to zero, we would get = 5 p 6 p 5 p 6 p = 0 5( p) 6p = 0 5 p = 0 p = 5 whch, not surprsngly, s the same estmate of p. Generally n ths type of model, the maxmum lkelhood estmate turns out to be the fracton of tosses that are heads, an ntutvely appealng estmate of the probablty of obtanng heads. Mult-parameter models When the model has more than one parameter t becomes much more dffcult to maxmze lkelhood. A mult-parameter model typcally has one parameter or a small set of parameters that we are nterested n and a number of other parameters that are ntroduced n the model only to make the latter more realstc. In lkelhood nference we often refer to the former parameter(s) as structural parameter(s) and the latter as nusance parameters. Assume that we have a parameter vector θ = {θ, θ 2 }, and that we are nterested n nferrng the value of the frst parameter (θ, the structural parameter) but not the second (θ 2, the nusance parameter). We can take two dfferent approaches. One s to calculate the profle lkelhood L p (θ ) = max θ 2 L(θ, θ 2 ) whch smply fnds the lkelhood of each value of θ by maxmzng over the possble values of θ 2. The alternatve approach s to use ntegrated lkelhood, whch s the same thng as margnal lkelhood, although the latter term s more commonly used n Bayesan nference. Wth ths method,

Computatonal Evolutonary Bology the lkelhood for the structural parameter s obtaned by summng or ntegratng out the nusance parameter: L (θ ) = L(θ, θ 2 ) θ 2 for a dscrete nusance parameter and L (θ ) = L(θ, θ 2 ) dθ 2 θ 2 for a contnuous nusance parameter. It s easy to understand the dfference between the two approaches by referrng to a smple lkelhood landscape for two parameters (Fg. 2). The base of the landscape s determned by the two parameters n the model and the heght s the lkelhood for each combnaton of parameter values. The profle lkelhood method would obtan the lkelhood curve for the parameter of nterest by fndng the profle of the landscape n the relevant plane, that s, ts maxmum heght for each value of the structural parameter. The lkelhood curve obtaned wth the ntegrated lkelhood method would nstead measure the average heght of the landscape for each value of the structural parameter. Generally speakng, there are good reasons to mnmze the number of parameters that we are tryng to maxmze durng a maxmum lkelhood analyss. Therefore, ntegraton or summaton s often favored where t s possble. An nterestng property of profle lkelhoods s that they are scale-ndependent. That s, regardless of the parametrzaton we use for the nusance parameter(s), the maxmum of the profle lkelhood for the structural parameter wll be the same. However, ths s not the case for ntegrated lkelhoods. The result of ntegraton wll be dependent on how much weght we put on dfferent parts of the nusance parameter space. Hard-core lkelhoodsts wll only use ntegrated lkelhoods when there s one parametrzaton that s so obvous that people forget there are other possbltes, and they rarely pont out that there s an assumed pror. In prncple, however, ntegrated lkelhoods always requre the specfcaton of prors. When there s an obvous parametrzaton, we are mplctly assumng a unform pror on the chosen parameter space. 4 Consstency and effcency Generally speakng, lkelhood methods have the propertes of consstency and effcency. Consstency means that the maxmum lkelhood estmate converges to the correct value as data accumu- 4

Computatonal Evolutonary Bology Fgure 2: A lkelhood landscape for a model wth two parameters. We are nterested n the lkelhood curve for the parameter along the x axs. It can be determned usng ether the profle lkelhood or the ntegrated lkelhood method. lates. Effcency means that the varance of the estmate around the true parameter value s small and decreases rapdly wth ncreasng amounts of data. However, there are stuatons when lkelhood s msbehaved. A typcal stuaton s when the number of parameters that we are tryng to estmate grows wth the addton of new data; f the amount of data per parameter we are estmatng s not growng, then there wll always be some resdual varance of the estmate that wll never dsappear. Other nstances occur when the model s ncorrect. For nstance, phylogenetc nference under over-smplfed models s well known to suffer from the problem of long branch attracton. In such cases we wll tend to nfer ancestral states ncorrectly, especally across long branches, leadng to a tendency to place these ncorrectly n the tree. The phenomenon s more pronounced for parsmony methods but does affect lkelhood methods as well. 5

Computatonal Evolutonary Bology 5 Dervng transton probabltes To calculate the lkelhood of a phylogenetc model ncludng a tree and a substtuton model, we need frst to derve the transton probabltes for our substtuton model. We have seen before that, f we have the nstantaneous rate matrx Q, we can obtan the transton probabltes P (t) over some tme perod t by exponentatng the Q matrx: P = e Qt As we have dscussed earler, we can typcally not estmate the absolute rate of change but only the amount of change (v), whch s the product of rate and tme. Therefore, we typcally measure the branch length n phylogenetc trees n terms of the amount of change. In partcular, we want a branch length unt to correspond to one expected change per ste at statonarty. Ths requres a very specfc scalng of the Q matrx, namely that π q = For llustraton, we can scale the rate matrx for the Jukes Cantor model. The unscaled verson would be Q = or somethng smlar. If we scale t wth the factor µ we get µ µ µ µ µ µ µ µ Q = µ µ µ µ µ µ µ µ and fnally we solve for the value of the scalng factor by usng the scalng condton above. Specfcally, snce the statonary state frequency of all nucleotdes s the same (/4) n the Jukes Cantor 6

Computatonal Evolutonary Bology model, we get 4 π ( µ) = =0 4 µ 4 = µ = µ = If we now nsert ths n the Q matrx, we get the scaled verson Q = The prncple s the same for more complex rate matrces even though the analytcal expresson for the scaler wll be more complcated. However, t s easy enough to devse an algorthm to determne ts value numercally. To exponentate the scaled Q matrx, we would usually fnd ts egenvalues and egenvectors, decomposng t nto the product: Q = SΛS where S s a matrx whose columns are the rght egenvectors of Q and Λ s a dagonal matrx contanng the egenvalues (λ ). The exponental of Q can now be found by smply exponentatng each of the elements n Λ separately. For smple nstantaneous rate matrces, the egenvalues and egenvectors are known n closed form so that we can express the elements of the transton probablty matrx exactly. For nstance, the transton probablty matrx for the Jukes Cantor model s 4 + 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 P = 4 e 4v/ 4 + 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 + 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 4 e 4v/ 4 + 4 e 4v/ The transton probablty matrx s known n closed form for many of the other smple substtuton models but not for the more complex ones. In the latter cases t s common to use routnes 7

Computatonal Evolutonary Bology for numercally determnng the egenvalues and egenvectors and then usng those n dervng the transton probabltes. However, the egenvalue decomposton methods can occasonally be numercally naccurate, n whch case Golub and van Loan (996) recommend usng the Padé approxmaton. 6 Calculatng lkelhoods on trees Calculatng lkelhoods on trees s relatvely straghtforward. We assume for now that the tree and the branch lengths are gven, as well as the values of all other parameters n our phylogenetc model. Typcally, we are nterested n calculatng the total probablty of the tree by summng over all possble ancestral state assgnments. The summaton over ancestral states s an example of the use of ntegrated lkelhood to elmnate a nusance parameter; ths partcular parameter s mportant to elmnate because f we tred to estmate the ancestral states (maxmze over the possble assgnments), we would run nto the problem wth the growng number of parameters that was mentoned above. The calculaton proceeds from the tp down to the root of the tree n a downpass that s very smlar to the Sankoff downpass, except that we replace addton wth multplcaton and maxmzaton wth summaton. The ntalzaton s also slghtly dfferent. At each node n the tree we calculate the condtonal probablty of the tree gven each of the possble state assgnments at that node. As you probably recognze, ths s an example of a dynamc programmng algorthm. We descrbe the algorthm here for a four-state DNA character, and we assume that the P matrx has been calculated already for all branches n the tree. We assgn to each node p a set G p = {g (p) A, g(p) C, g(p) G, g(p) T } contanng the condtonal probablty of the subtree rooted at p gven each of the possble state assgnments at that node. We use another smlar set H p, whch wll gve the condtonal probablty of assgnng each state to the ancestral end of the branch havng p as ts descendant. The elements of H p wll be derved from G p and the elements of P p, the transton probabltes for the branch endng n the node p. Intalzaton s performed by gong through all tps and settng g (p) = f the state has been observed at tp p and g (p) = 0 otherwse. As usual, the downpass algorthm s formulated for a node p and ts two descendant nodes q and r (Algorthm ). At the root node ρ, we obtan the total probablty of the tree by summng over all possble states, weghtng each assgnment by ts statonary state frequency. Thus L = π g (ρ) 8

Computatonal Evolutonary Bology Once we have the lkelhood for each ste we obtan the total lkelhood for the entre data matrx by smply multplyng over all characters: L = P (D θ) = L The entre procedure s llustrated n Fgure. Algorthm Lkelhood downpass algorthm for all do h (q) 0 for all j do h (q) end for end for h (q) for all do h (r) 0 for all j do h (r) end for end for for all do g (p) end for h (r) h (q) h (r) + p (q) j g(q) j + p (r) j g(r) j 6. Study Questons. What s the dfference between a profle lkelhood and an ntegrated lkelhood? 2. Why does ntegrated lkelhoods requre a pror, at least mplctly?. Are profle lkelhoods scale ndependent? Are ntegrated lkelhoods scale ndependent? 4. Why do we need to scale nstantaneous rate matrces? How s t done? 5. Formulate the forward algorthm of an HMM by modfyng the downpass algorthm for calculatng tree probabltes. Tp: assume that there s only one descendant of each node n the tree and set G p = H q. 9

Computatonal Evolutonary Bology Fgure : A worked example for the calculaton of the probablty of a tree. We start wth transton probablty matrces for each of the branches n the tree (a) and then assgn condtonal probabltes to the tps of the tree (b). For each node n the tree n a postorder traversal, we then calculate the condtonal probabltes of the dfferent ancestral state assgnments at the bottom end of each of the descendant branches. We then multply these costs to get the costs for the node of nterest (d). Fnally, at the root node, we fnd the total probablty of the tree by calculatng the weghted sum over all possble state assgnments at the root, usng the statonary state frequences as the weghts (e). 0