The Expectation-Maximisation Algorithm

Similar documents
Limited Dependent Variables

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture Notes on Linear Regression

Hidden Markov Models

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Stat 543 Exam 2 Spring 2016

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Composite Hypotheses testing

Stat 543 Exam 2 Spring 2016

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Chapter 13: Multiple Regression

Lecture 12: Discrete Laplacian

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

Linear Approximation with Regularization and Moving Least Squares

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Primer on High-Order Moment Estimators

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Chapter Newton s Method

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

The Expectation-Maximization Algorithm

Solutions Homework 4 March 5, 2018

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Lecture 10 Support Vector Machines II

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Credit Card Pricing and Impact of Adverse Selection

Difference Equations

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Conjugacy and the Exponential Family

Foundations of Arithmetic

Properties of Least Squares

The Basic Idea of EM

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Chapter 20 Duration Analysis

e i is a random error

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Goodness of fit and Wilks theorem

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Learning from Data 1 Naive Bayes

Topic 5: Non-Linear Regression

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Thermodynamics and statistical mechanics in materials modelling II

EM and Structure Learning

Computing MLE Bias Empirically

First Year Examination Department of Statistics, University of Florida

Maximum Likelihood Estimation

The Feynman path integral

More metrics on cartesian products

The Geometry of Logit and Probit

Convergence of random processes

PHYS 705: Classical Mechanics. Calculus of Variations II

Appendix B. The Finite Difference Scheme

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Classification as a Regression Problem

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

Problem Set 9 Solutions

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

12. The Hamilton-Jacobi Equation Michael Fowler

6. Stochastic processes (2)

Web-based Supplementary Materials for Inference for the Effect of Treatment. on Survival Probability in Randomized Trials with Noncompliance and

Tracking with Kalman Filter

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

6. Stochastic processes (2)

3.1 ML and Empirical Distribution

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Randomness and Computation

Marginal Effects in Probit Models: Interpretation and Testing. 1. Interpreting Probit Coefficients

STAT 3008 Applied Regression Analysis

Lecture 4: September 12

1 Matrix representations of canonical matrices

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Lecture 4: Universal Hash Functions/Streaming Cont d

Economics 130. Lecture 4 Simple Linear Regression Continued

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Rockefeller College University at Albany

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

COS 521: Advanced Algorithms Game Theory and Linear Programming

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Global Sensitivity. Tuesday 20 th February, 2018

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Supplement to Clustering with Statistical Error Control

9. Binary Dependent Variables

Assortment Optimization under MNL

Generalized Linear Methods

Transcription:

Chapter 4 The Expectaton-Maxmsaton Algorthm 4. The EM algorthm - a method for maxmsng the lkelhood Let us suppose that we observe Y {Y } n. The jont densty of Y s f(y ; θ 0), and θ 0 s an unknown parameter. Our objectve s to estmate θ 0. The log-lkelhood of Y s L n (Y ; θ) log f(y ; θ) Observe, that we have not specfed that {Y } are d random varables. Ths s because the procedure that we wll descrbe below s extremely general and the observatons do not need to be ether ndependent or dentcally dstrbuted (ndeed a very nterestng extenson of ths procedure, s to tme seres wth mssng data frst proposed n Shumway and Stoffer (982) and Engle and Watson (982)). Our objectve s to estmate θ 0, n the stuaton where ether evaluatng the log-lkelhood L n or maxmsng L n s dffcult. Hence an alternatve means of maxmsng L n s requred. Often, there may exst unobserved data {U {U } m }, where the lkelhood of (Y U) can be easly evaluated. It s through these unobserved data that we fnd an alternatve method for maxmsng L n. Example 4.. Let us suppose that {T } n+m are d survval tmes, wth densty f(x; θ 0 ). Some of these tmes are censored and we observe {Y } n+m, where Y mn(t c). To smplfy notaton we wll suppose that {Y T } n, hence the survval tme for n, s observed 25

but Y c for n + n + m. Usng the results n Secton the log-lkelhood of Y s n L n (Y ; θ) n+m log f(y ; θ) + n+ log (Y ; θ). The observatons {Y } n+ n+m can be treated as f they were mssng. Defne the complete observatons U {T } n+ n+m, hence U contans the unobserved survval tmes. Then the lkelhood of (Y U) s L n (Y U; θ) n+m log f(t ; θ). Usually t s a lot easer to maxmse L n (Y U) than L n (Y ). We now formally descrbe the EM-algorthm. As mentoned n the dscusson above t s easer to deal wth the jont lkelhood of (Y U) than wth the lkelhood of Y tself. Hence let us consder ths lkelhood n detal Let us suppose that the jont lkelhood of (Y U) s L n (Y U; θ) log f(y U; θ). Ths lkelhood s often called the complete lkelhood, we wll assume that f U were known, then ths lkelhood would be easy to obtan and dfferentate. We wll assume that the densty f(u Y ; θ) s also known and s easy to evaluate. By usng Bayes theorem t s straghtforward to show that log f(y U; θ) log f(y ; θ) + log f(u Y ; θ) (4.) L n (Y U; θ) L n (Y ; θ) + log f(u Y ; θ). Of course, n realty log f(y U; θ) s unknown, because U s unobserved. However, let us consder the expected value of log f(y U; θ) gven what we observe Y. That s Q(θ 0 θ) log f(y U; θ) Y θ 0 log f(y u; θ) f(u Y θ 0 ) (4.2) where f(u Y θ 0 ) s the condtonal dstrbuton of U gven Y and the unknown parameter θ 0. Hence f f(u Y θ 0 ) were known, then Q(θ 0 θ) can be evaluated. Remark 4.. It s worth notng that Q(θ 0 θ) log f(y U; θ) Y θ0 can be vewed as the best predctor of the complete lkelhood (nvolvng both observed and unobserved data - (Y U)) gven what s observed Y. We recall that the condtonal expectaton s the best predctor of U n terms of mean squared error, that s the functon of Y whch mnmses the mean squared error: (U Y ) arg mn g (U g(y )) 2. 26

The EM algorthm s based on teratng Q( ) n such a way that at each step we obtanng a θ whch gves a larger value of Q( ) (and as we wll show later, ths gves a larger L n (Y ; θ)). We descrbe the EM-algorthm below. The EM-algorthm () Defne an ntal value θ Θ. Let θ θ. () The expectaton step The k+)-step), For a fxed θ evaluate Q(θ θ) log f(y U; θ) Y θ log f(y u; θ) f(u Y θ )du for all θ Θ. () The maxmsaton step Evaluate θ k+ arg max θ Θ Q(θ θ). We note that the maxmsaton can be done by fndng the soluton of log f(y U; θ) θ Y θ 0. (v) If θ k and θ k+ are suffcently close to each other stop the algorthm and set ˆθ n θ k+. Else set θ θ k+, go back and repeat steps () and () agan. We use ˆθ n as an estmator of θ 0. To understand why ths teraton s connected to the maxmsng of L n (Y ; θ) and, under certan condtons, gves a good estmator of θ 0 (n the sense that ˆθ n s close to the parameter whch maxmses L n ) let us return to (4.). Takng the expectaton of log f(y U; θ), condtoned on Y we have Q(θ θ) log f(y U; θ) Y θ log f(y ; θ) + log f(u; θ) Y θ. (observe that ths s lke (4.2)), but the dstrbuton used n the expectaton s f(u Y θ ) nstead of f(u Y θ )). Defne D(θ θ) log f(u; θ) Y θ log f(u; θ) f(u Y θ )du. Hence we have Q(θ θ) L n (θ) + D(θ θ). (4.3) Now we recall that at the (k + )th step teraton of the EM-algorthm, θ k+ maxmses Q(θ k θ) over all θ Θ, hence Q(θ k θ k+ ) Q(θ k θ k ). In the lemma below we show that L n (θ k+ ) L n (θ k ), hence at each teraton of the EMalgorthm we are obtanng a θ k+ whch ncreases the lkelhood over the prevous teraton. 27

Lemma 4.. We have L n (θ k+ ) L n (θ k ). Moroever, under certan condtons we have θ k converges to the maxmum lkelhood estmator arg max L n (Y ) (we do not prove ths part of the result here). PROOF. From (4.3) t s clear that Q(θ k θ k+ ) Q(θ k θ k ) L n (θ k+ ) L n (θ k ) + D(θ k θ k+ ) D(θ k θ k ). (4.4) We wll now that D(θ n θ n+ ) D(θ n θ n ) 0, the result follows from ths. We observe that D(θn θ n+ ) D(θ n θ n ) log f(u; θ n+) f(u; θ n+ ) f(u Y θ n)du. Now by usng the Jenson s nequalty (whch we have used several tmes prevously) we have that D(θn θ n+ ) D(θ n θ n ) log f(u; θ n+ )du 0. Therefore, we have D(θ n θ n+ ) D(θ n θ n ) 0. Snce D(θ n θ n+ ) D(θ n θ n ) 0 we have by (4.4) that Ln (θ n+ ) L n (θ n ) Q(θ n θ n+ ) Q(θ n θ n ) 0. and we obtan the desred result (L n (θ k+ ) L n (θ k )). Remark 4..2 The Fsher nformaton) The Fsher nformaton of of the observed lkelhood L n (Y ; θ) s I n (θ 0 ) 2 log f(y ; θ) θ 2 θθ0. As n the Secton 4., I n (θ 0 ) s the asymptotc varance of the lmtng dstrbuton of ˆθ n. To understand, how much s lost by not havng a complete set of observatons, we now rewrte the Fsher nformaton n terms of the complete data and the mssng data. By usng (4.) I n (θ 0 ) can be rewrtten as I n (θ 0 ) 2 log f(y ; θ) θ 2 2 log f(u Y ; θ) θθ 0 θ 2 θθ 0 I n (C) (θ 0 ) I n (M) (θ 0 ). In the case that θ s unvarate, t s clear that I (C) n (θ 0 ) I (M) (θ 0 ). Hence, as one would expect, the complete data set (Y U) contans more nformaton about the unknown parameter than Y. 28 n

If U s fully determned by Y, then t can be shown that I n (M) (θ 0 ) 0, and no nformaton has been lost. From a practcal pont of vew, one s nterested n how many teratons of the EM-algorthm s requred to obtan an estmator suffcently close to the MLE. Let J n (C) (θ 0 ) 2 log f(y ; θ) θ 2 θθ 0 Y θ 0 J n (M) (θ 0 ) 2 log f(u Y ; θ) Y θ 0. θ 2 θθ 0 By dfferentatng (4.) twce wth respect to the parameter θ we have J n (θ 0 ) 2 log f(y ; θ) θ 2 J n (C) (θ 0 ) J n (M) (θ 0 ). θθ 0 Now t can be shown the rate of convergence of the algorthm depends on on the rato J n (C) (θ 0 ) J n (M) (θ 0 ). The closer the largest egenvalue of J n (C) (θ 0 ) J n (M) (θ 0 ) s to one, the slower the rate of convergence, and a large number of teratons requred. On the other hand f the largest egenvalue of J n (C) (θ 0 ) J n (M) (θ 0 ) s close to zero, then the rate of convergence s fast (small number of teratons for convergence to the mle). 4.. Censored data Let us return to the example at the start of ths secton, and construct the EM-algorthm for censored data. We recall that the log-lkelhoods for censored data and complete data are and n L n (Y ; θ) n L n (Y U; θ) n+m log f(y ; θ) + n+ n+m log f(y ; θ) + n+ log (Y ; θ). log f(t ; θ). To mplement the EM-algorthm we need to evaluate the expectaton step Q(θ θ). It s easy to see that Q(θ θ) L n (Y U; θ) Y θ n To obtan log f(t ; θ) Y θ ( n + ) we note that n+m log f(y ; θ) + n+ log f(t ; θ) Y θ (log f(t ; θ) T c) log f(t ; θ) f(u; θ )du. (c; θ) 29 c log f(t ; θ) Y θ.

Therefore we have Q(θ θ) n m log f(y ; θ) + (c; θ ) c log f(t ; θ) f(u; θ )du. We also note that the dervatve of Q(θ θ) wth respect to θ s Q(θ θ) θ n f(y ; θ) f(y ; θ) + θ Hence for ths example, the EM-algorthm s () Defne an ntal value θ Θ. Let θ θ. () The expectaton step, For a fxed θ evaluate Q(θ θ) θ n () The maxmsaton step f(y ; θ) f(y ; θ) + θ m (c; θ ) c m (c; θ ) c Solve for Q(θ θ) θ. Let θ k+ be such that Q(θ θ) θ θθk 0. f(u; θ) f(u; θ) f(u; θ) f(u; θ )du. θ f(u; θ) f(u; θ )du. θ (v) If θ k and θ k+ are suffcently close to each other stop the algorthm and set ˆθ n θ k+. Else set θ θ k+, go back and repeat steps () and () agan. 4..2 Mxture dstrbutons We now consder a useful applcaton of the EM-algorthm, to the estmaton of parameters n mxture dstrbutons. Let us suppose that {Y } n are d random varables wth densty f(y; θ) pf (y; θ ) + ( p)f 2 (y; θ 2 ) where θ (p θ θ 2 ) are unknown parameters. For the purpose of dentfablty we wll suppose that θ θ 2, p and p 0. The log-lkelhood of {Y } s n L n (Y ; θ) log pf (Y ; θ ) + ( p)f 2 (Y ; θ 2 ). (4.5) Now maxmsng the above can be extremely dffcult. As an llustraton consder the example below. 30

Example 4..2 Let us suppose that f (y; θ ) and f 2 (y; θ ) are normal denstes, then the log lkelhood s n L n (Y ; θ) log p exp( (Y 2πσ 2 µ ) 2 ) + ( p) exp( 2πσ 2 2 2σ 2 2σ 2 2 (Y µ 2 ) 2 ). We observe ths s extremely dffcult to maxmse. On the other hand f Y were smply normally dstrbuted then the log-lkelhood s extremely smple n L n (Y ; θ) log σ 2 + 2σ 2 (Y µ ) 2 ). (4.6) In other words, the smplcty of maxmsng the log-lkelhood of the exponental famly of dstrbutons (see Secton 3.) s lost for mxtures of dstrbutons. We now use the EM-algorthm as an ndrect but smple method of maxmsng (4.6). In ths example, t s not clear what observatons are mssng. However, let us consder one possble ntepretaton of the mxture dstrbuton. Let us defne the random varables δ and Y, where δ { 2}, and the probablty of Y y gven δ s P (δ ) p and P (δ 2) ( p) f(y y δ ) f (y; θ )dy and f(y y δ 2) f 2 (y; θ 2 )dy. Therefore, t s clear from the above that the densty of Y s f(y; θ) pf (y; θ ) + ( p)f 2 (y; θ 2 ). Hence, one nterpretaton of the mxture model s that there s a hdden unobserved random varable whch determnes the state or dstrbuton of Y. A smple example, s that Y s the heght of an ndvdual and δ s the gender. However, δ s unobserved and only the heght s observed. Often a mxture dstrbuton has a physcal nterpretaton, smlar to the heght example, but sometmes t can be used to parametrcally model a wde class of denstes. Based on the dscusson above, U {δ } can be treated as the mssng observatons. The lkelhood of (Y U ) s p f (Y ; θ ) I(δ ) p2 f 2 (Y ; θ 2 ) I(δ 2) pδ f δ (Y ; θ δ ). where we set p 2 p. Therefore the log lkelhood of {(Y δ )} s n L n (Y U; θ) log pδ + log f δ (Y ; θ δ ). 3

We now need to evaluate Q(θ θ) L n (Y U; θ) Y θ n log pδ Y θ + log fδ (Y ; θ δ ) Y θ. We see that the above expectaton s taken wth respect the dstrbuton of δ condtoned on Y and the parameter θ. By usng condtonng arguments t s easy to see that Therefore P (δ Y y θ ) P (δ Y y; θ ) P (Y y; θ ) : w (θ ) P (δ 2 Y y θ ) p f 2 (y θ 2 ) p f (y θ ) + ( p )f 2 (y θ 2 ) : w 2 (θ ) w (θ ). Q(θ θ) n log p + log f (Y ; θ ) w (θ ) + n p f (y θ ) p f (y θ ) + ( p )f 2 (y θ 2 ) log( p) + log f 2 (Y ; θ 2 ) w 2 (θ ). Now maxmsng the above wth respect to p θ and θ 2 n general wll be much easer than maxmsng L n (Y ; θ). For ths example the EM algorthm s () Defne an ntal value θ Θ. Let θ θ. () The expectaton step, For a fxed θ evaluate n Q(θ θ) log p + log f (Y ; θ ) w (θ ) + () The maxmsaton step n log( p) + log f 2 (Y ; θ 2 ) w 2 (θ ). Evaluate θ k+ arg max θ Θ Q(θ θ) by dfferentatng Q(θ θ) wrt to θ and equatng to zero. Snce the parameters p and θ θ 2 are n separate subfunctons, they can be maxmsed separately. (v) If θ k and θ k+ are suffcently close to each other stop the algorthm and set ˆθ n θ k+. Else set θ θ k+, go back and repeat steps () and () agan. Exercse: Derve the EM algorthm n the case that f and f 2 have normal denstes. It s straghtforward to see that the arguments above can be generalsed to the case that the densty of Y s a mxture of r dfferent denstes. However, we observe that the selecton of r can be qute adhoc. There are methods for choosng r, these nclude the reversble jump MCMC methods proposed by Peter Green. 32

Example 4..3 Queston: Suppose that the regressors x t are beleved to nfluence the response varable Y t. The dstrbuton of Y t s P (Y t y) p λy t exp( λ ty) y where λ t exp(β x t) and λ t2 exp(β 2 x t). + ( p) λy t2 exp( λ t2y) y () State mnmum condtons on the parameters, for the above model to be dentfable? () Carefully explan (gvng detals of Q(θ θ ) and the EM stages) how the EM-algorthm can be used to obtan estmators of β β 2 and p. () Derve the dervatve of Q(θ θ ), and explan how the dervatve may be useful n the maxmsaton stage of the EM-algorthm. (v) Gven an ntal value, wll the EM-algorthm always fnd the maxmum of the lkelhood? Explan how one can check whether the parameter whch maxmses the EM-algorthm, maxmses the lkelhood. Soluton () 0 < p < and β β 2 (these are mnmum assumptons, there could be more whch s hard to account for gven the regressors x t ). () We frst observe that P (Y t y) s a mxture of two Posson dstrbutons where each has the canoncal lnk functon. Defne the unobserved varables, {U t }, whch are d and where P (U t ) p and P (U t 2) ( p) and P (Y y U ) λy t exp( λty) y and P (Y y) U 2) λy t2 exp( λt2y) y. Therefore, we have log f(y t U t θ) Y t βu t x t exp(βu t x t ) + log Y t + log p where θ (β β 2 p). Thus, E(log f(y t U t θ) Y t θ ) s E(log f(y t U t θ) Y t θ ) Y t βx t exp(βx t ) + log Y t + log p π(θ Y t ) + Y t β2x t exp(β2x t ) + log Y t + log p ( π(θ Y t )). where P (U Y t θ ) s evaluated as P (U Y t θ ) π(θ Y t ) pf (Y t θ ) pf (Y t θ ) + ( p)f 2 (Y t θ ) 33

wth f (Y t θ ) exp(β x ty t ) exp( Y t exp(β x t)) Y t Thus Q(θ θ ) s Q(θ θ ) T t f (Y t θ ) exp(β x ty t ) exp( Y t exp(β x t). Y t Y t βx t exp(βx t ) + log Y t + log p π(θ Y t ) + Y t β2x t exp(β2x t ) + log Y t + log( p) ( π(θ Y t )). Usng the above, the EM algorthm s the followng: (a) Start wth an ntal value whch s an estmator of β β 2 and p, denote ths as θ. (b) For every θ evaluate Q(θ θ ). (c) Evaluate arg max θ Q(θ θ ). Denote the maxmum as θ and return to step (b). (d) Keep teratng untl the maxmums are suffcently close. () The dervatve of Q(θ θ ) s Q(θ θ ) β Q(θ θ ) β 2 Q(θ θ ) p T t T t T t Y t exp(βx t ) x t π(θ Y t ) Y t exp(β2x t ) x t ( π(θ Y t )) p π(θ Y t ) p ( π(θ Y t ). Thus maxmsaton of Q(θ θ ) can be acheved by solvng for the above equatons usng teratve weghted least squares. (v) Dependng on the ntal value, the EM-algorthm may only locate a local maxmum. To check whether we have found the global maxmum, we can start the EM-algorthm wth several dfferent ntal values and check where they converge. Example 4..4 Queston (2) Let us suppose that (t) and 2 (t) are two survval functons. Let x denote a unvarate regressor. [25] () Show that (t; x) p (t) exp(βx) + ( p) 2 (t) exp(β2x) s a vald survval functon and obtan the correspondng densty functon. 34

Soluton () Suppose that T are survval tmes and x s a unvarate regressor whch exerts an nfluence an T. Let Y mn(t c), where c s a common censorng tme. {T } are ndependent random varables wth survval functon (t; x ) p (t) exp(βx ) + ( p) 2 (t) exp(β2x ), where both and 2 are known, but p, β and β 2 are unknown. State the censored lkelhood and show that the EM-algorthm together wth teratve least squares n the maxmsaton step can be used to maxmse ths lkelhood (suffcent detals need to be gven such that your algorthm can be easly coded). ) Snce and 2 are monotoncally decreasng postve functons where (0) F 2 (0) and ( ) 2 ( ) 0, then t mmedately follows that s the same (use that df(t) dt (t x) p (t) eβ x + ( p) 2 (t) eβ 2 x f (t)), thus (t; x) s a survval functon. (t x) t pe βx f (t) (t) eβ x ( p)e β2x f 2 (t) 2 (t) eβ 2 x f(t; x) pe βx f (t) (t) eβ x + ( p)e β2x f 2 (t) 2 (t) eβ 2 x ) The censored log lkelhood s T L n (β β 2 p) [δ log f(y ; β β 2 p) + ( δ ) log (Y ; β β 2 p)]. Clearly, drectly maxmzng the above s extremely dffcult. Thus we look for an alternatve method va the EM algorthm. Defne the unobserved varable I wth P (I ) p p 2 wth P (I 2) ( p) p 2. Then the jont densty of (ϕ δ I ) s δ log p I + β I x + log f I (Y ) + (e β I x ) log F I (Y ) +( δ ) log p I + (e β I x ) log F I (Y ). Thus the complete log lkelhood s T L T (Y δ I ; β β 2 p) {δ [log p I + β I x + log f I (Y ) + (e β I x ) log I (Y ) +( δ )[log p I + (e β I x ) log I (Y )])} 35

Now we need to calculate P (I Y δ ). We have ω δ P (I Y δ p α β α β α 2 ) p α e βα x f (Y ) (Y ) eβα x p α e βα x f (Y ) (Y ) eβα x + ( p α )e βα 2 x f 2 (Y ) 2 (Y ) eβα 2 x ω δ 0 P (I Y δ 0 p α β α β α 2 ) p α F (Y ) eβα x p α (Y ) eβα x + ( p α ) 2 (Y ) eβα 2 x Therefore the complete lkelhood condtoned on what we observe s Q(θ θ α ) T {δ ω δ [log p + β x + log f (Y ) + (e βx ) log (Y )] + ( δ )ω δ T + log p + e βx log (Y ) } {δ ( ω δ )[log( p) + β 2x + log f 2 (Y ) + (e β2x ) log F 2 (Y )] + ( δ )( ω δ )[log( p) + e β2x log F 2 (Y )]} The condtonal lkelhood, above, looks unweldy. However, the parameter estmates can to be separated. Frst, dfferentatng wth respect to p gves Q T p δ ω δ p + T ω δ ( δ ) T p δ ( ω δ ) T p ( ω δ )( δ ) p. Equatng the above to zero we have the estmator ˆp a a+b, where a b T δ ω δ + T ω δ ( δ ) T T δ ( ω δ ) ( ω δ )( δ ). Now we consder the estmates of β and β 2 at the th teraton step. Dfferentatng Q wrt 36

to β and β 2 gves Q β Q β 2 2 Q β 2 2 Q β 2 2 T {δ ω δ + e βx log F (Y ) + ( δ )ω δ e βx log F (Y )}x 0 T {δ ( ω δ ) + e β2x log F 2 (Y ) T δ ω δ eβx log F (Y ) + ( δ )ω δ e βx log F (Y )x 2 T + ( δ )( ω δ )e β2x log F 2 (Y )}x 0 δ ( ω δ )eβ2x log F 2 (Y ) + ( δ )( ω δ )e β2x log F 2 (Y )x 2. Thus to estmate (β β 2 ) at the j th teraton we use β (j) β (j) 2 β (j ) β (j ) 2 + 2 Q β 2 0 0 2 Q β 2 β j ) Q β Q β 2 β j ) Thus β (j) β (j ) + 2 Q β 2 Q β β j ). And smlarly for β (j) 2. Now we can rewrte Q 2 Q β 2 β β j ) as (X ω (j ) X) X S (j ) where ω (j ) X (x x 2... x T ) dag[ω (j )... ω (j ) T ] S (j ) ω (j ) δ ω s S (j ) S (j ). S (j ) T eβj ) wth log (Y ) + ( δ )ω δ e βj ) x log (Y ) j δ ω δ [ + eβj ) x log (Y )] + ( δ )ω δ e βj ) x log (Y )] Thus altogether n the EM-algorthm we have: Start wth ntal value β 0 β0 2 p0 Step Set (β r β 2r p r ) (β β 2 p ). Evaluate ω δ and ω δ stay the same throughout the teratve least squares). 37 (these probables/weghts

Step 2 Maxmze Q(θ θ ) by usng the algorthm p r where a r b r are defned prevously. Now evaluate ar a r+b r β (j) β (j ) + (X ω (j ) X) X S (j0) same for β (j) 2, β (j) 2 + (X ω (j ) 2 X) X S (j ) 2 terate untl convergence. Step 3 Let β r β 2r p r be the lmt of the teratve least squares, go back to step untl convergence. Example 4..5 Queston Let us suppose that X and Z are ndependent postve random varables wth denstes f X and f Z respectvely. () Derve the densty functon of /X. () Show that the densty of XZ s (or equvalently c f Z (cy)f X (c )dc). (b) Consder the lnear regresson model x f Z( y x )f X(x)dx (4.7) Y α x + σ ε where ε follows a standard normal dstrbuton (mean zero and varance ) and σ 2 follows a Gamma dstrbuton f(σ 2 ; λ) σ2(κ ) λ κ exp( λσ 2 ) Γ(κ) σ 2 0 wth κ > 0. parameter. Let us suppose that α and λ are unknown parameters but κ s a known () Gve an expresson of the log-lkelhood of Y and explan why t s dffcult to compute the maxmum lkelhood estmate? () As an alternatve to drectly maxmsng the lkelhood, the EM algorthm can be used nstead. Derve the EM-algorthm for ths case. In your dervaton explan what quanttes wll have to be evaluated numercally. 38

Soluton 3 (a) () P (/X c) P (X > /c) F X (/c) (F X dstrbuton functon of X). Therefore the densty of /X s f /X (c) /c 2 f X (/c). () We frst note that P (XZ y X x) P (Z y/x). Therefore the densty of XZ X s f XZ X (y) x f Z ( y x ). Usng ths we obtan the densty of XZ f XZ (y) P (XZ y X x)f X (x) f XZ X (y x)f X (x)dx x f Z( y x )f X(x)dx (4.8) Or equvalently we can condton on X to obtan f XZ (y) P (XZ y X x)f /X(c) c f Z (cy)f X (c )dc. cf Z (cy) c 2 f X(c )dc Note that wth a change of varables c /x we can show that both ntegrals are equvalent. (b) () We recall that Y α x + σ ε. Therefore the log-lkelhood of Y s L n (α λ) n log f σε (Y α x ) n log x f σ( Y α x ; λ)f ε (x)dx x where we use (4.8) to obtan the densty of f σε, f σ ( ; λ) s the densty of a square root Gamma random varable and f ε s the densty of a normal. It s clear ether t s very hard or mpossble to obtan an explct expresson for f σε. () Let U (σ 2... σ2 n) denote the unobserved varances whch are unobserved and Y (Y... Y n ) (whch s observed). The complete (unobserved) log-lkelhood of U Y s L T (Y U; α λ) n log σ 2 + σ 2 Y α 2 x + (κ ) log σ 2 + κ log λ λσ 2 Of course, the above can not be evaluated, snce U s unobserved. Instead we evaluate the condton expectaton of the above wth respect to what s observed. 39 Thus the

condtoned lkelhood wth respect to Y and the parameters α λ s Q(α λ) E L T (Y U; α λ) Y α λ n E log σ 2 σ 2 ε 2 (Y α x ) 2 λ Y +E σ 2 σ2 ε 2 (Y α x ) 2 λ α 2 x +(κ )E log σ 2 σ2 ε 2 (Y α x ) 2 λ + κ log λ λe(σ 2 σ2 ε 2 (Y α x ) 2 λ ). We note that the above s true because condtonng on Y and α, means that σ 2 ε2 (Y α x ) 2 s observed. Thus by evaluatng Q at each stage we can mplement the EM algorthm: The EMalgorthm () Defne an ntal value θ Θ. Let θ θ. () The expectaton step The k+)-step), For a fxed θ evaluate Q(θ θ) log f(y U; θ) Y θ log f(y u; θ) f(u Y θ )du for all θ Θ. () The maxmsaton step Evaluate θ k+ arg max θ Θ Q(θ θ). We note that the maxmsaton can be done by fndng the soluton of log f(y U; θ) θ Y θ 0. (v) If θ k and θ k+ are suffcently close to each other stop the algorthm and set ˆθ n θ k+. Else set θ θ k+, go back and repeat steps () and () agan. The useful feature of ths EM-algorthm s that f the weghts E σ 2 σ 2 ε 2 (Y α x ) 2 λ E(σ 2 σ 2 ε 2 (Y α x ) 2 λ ). are known. Then we donot need to numercally maxmse Q(α λ) at each stage. Ths 40

s because the dervatve of Q(α λ) leads to an explct soluton for α and λ: Q(α λ) α Q(α λ) α n Y 2 E σ 2 σ 2 ε 2 (Y α x ) 2 λ α x x 0 n κ λ E(σ2 σ 2 ε 2 (Y α x ) 2 λ ) 0. It s straghtfoward to see that the above can easly be solved for α and λ. Of course we need to evaluate the weghts E σ 2 ε 2 (Y α x ) 2 λ σ 2 E(σ 2 σ 2 ε 2 (Y α x ) 2 λ ). Ths s done numercally, by notng that for a general g( ) the condtonal expectaton s E(g(σ 2 ) σ 2 ε 2 y) g(σ 2 )f σ 2 σ 2 ε 2(σ2 y)dσ 2. Thus to obtan the densty of f σ 2 σ 2 ε 2 we note that P (σ2 < s σ 2 ε 2 y) P (y < ε 2 s) P (ε 2 y/s) P (ε 2 y/s). Hence the densty of σ 2 gven σ 2 ε 2 y s f σ 2 σ 2 ε 2(s ε2 ) f s 2 ε 2(y/s), where f ε 2 s a ch-squared dstrbuton wth one degree of freedom. Hence E(g(σ 2 ) σ 2 ε 2 y) g(σ 2 ) σ 2 f ε 2(y/σ2 )dσ 2. Usng the above we can numercally evaluate the above condtonal expectatons and thus Q(α λ). We keep teratng untl we get convergence. 4..3 Hdden Markov Models Fnally, we consder applcatons of the the EM-algorthm to parameter estmaton n Hdden Markov Models (HMM). Ths s a model where the EM-algorthm pretty much surpasses any other lkelhood maxmsaton methodology. It s worth mentonng that the EM-algorthm n ths settng s often called the Baum-Welch algorthm. Hdden Markov models are a generalsaton of mxture dstrbutons, however unlke mxture dstbutons t s dffcult to derve an explct expresson for the lkelhood of a Hdden Markov Models. HMM are a general class of models whch are wdely used n several applcatons (ncludng speech recongton), and can easly be generalsed to the Bayesan set-up. A nce descrpton of them can be found on Wkpeda. 4

In ths secton we wll only brefly cover how the EM-algorthm can be used for HMM. We do not attempt to address any of the ssues surroundng how the maxmsaton s done, nterested readers should refer to the extensve lterature on the subject. The general HMM s descrbed as follows. Let us suppose that we observe {Y t }, where the rvs Y t satsfy the Markov property P (Y t Y t Y t...) P (Y t Y t ). In addton to {Y t } there exsts a hdden unobserved dscrete random varables {U t }, where {U t } satsfes the Markov property P (U t U t U t 2...) P (U t U t ) and drves the dependence n {Y t }. In other words P (Y t U t Y t U t...) P (Y t U t ). To summarse, the HMM s descrbed by the followng propertes: () We observe {Y t } (whch can be ether contnuous or dscrete random varables) but do not observe the hdden dscrete random varables {U t }. () Both {Y t } and {U t } are tme-homogenuous Markov random varables that s P (Y t Y t Y t...) P (Y t Y t ) and P (U t U t U t...) P (U t U t ). The dstrbutons of P (Y t ), P (Y t Y t ), P (U t ) and P (U t U t ) do not depend on t. () The dependence between {Y t } s drven by {U t }, that s P (Y t U t Y t U t...) P (Y t U t ). There are several examples of HMM, but to have a clear ntepretaton of them, n ths secton we shall only consder one classcal example of a HMM. Let us suppose that the hdden random varable U t can take N possble values {... N} and let p P (U t ) and p j P (U t U t j). Moreover, let us suppose that Y t are contnuous random varables where (Y t U t ) N (µ σ 2) and the condtonal random varables Y t U t and Y τ U τ are ndependent of each other. Our objectve s to estmate the parameters θ {p p j µ σ 2} gven {Y }. Let f ( ; θ) denote the normal dstrbuton N (µ σ 2). Remark 4..3 HMM and mxture models) Mxture models (descrbed n the above secton) are a partcular example of HMM. In ths case the unobserved varables {U t } are d, where p P (U t U t j) P (U t ) for all and j. Let us denote the log-lkelhood of {Y t } as L T (Y ; θ) (ths s the observed lkelhood). It s clear that constructng an explct expresson for L T s dffcult, thus maxmsng the lkelhood s near mpossble. In the remark below we derve the observed lkelhood. Remark 4..4 The lkelhood of Y (Y... Y T ) s L T (Y ; θ) f(y T Y T Y T 2... ; θ)... f(y 2 Y ; θ)p (Y ; θ) f(y T Y T ; θ)... f(y 2 Y ; θ)f(y ; θ). 42

Thus the log-lkelhood s L T (Y ; θ) T log f(y t Y t ; θ) + f(y ; θ). The dstrbuton of f(y ; θ) s smply the mxture dstrbuton t2 f(y ; θ) p f(y ; θ ) +... + p N f(y ; θ N ) where p P (U t ). The condtonal f(y t Y t ) s more trcky. We start wth f(y t Y t ; θ) f(y t Y t ; θ). f(y t ; θ) An expresson for f(y t ; θ) s gven above. To evaluate f(y t Y t ; θ) we condton on U t U t to gve (usng the Markov and condtonal ndependent propery) f(y t Y t ; θ) j j j f(y t Y t U t U t j)p (U t U t j) f(y t U t )P (Y t U t j)p (U t U t j)p (U t ) f (Y t ; θ )f j (Y t ; θ j )p j p. Thus we have f(y t Y t ; θ) j f (Y t ; θ )f j (Y t ; θ j )p j p p. f(y t ; θ ) We substtute the above nto L T (Y ; θ) to gve the expresson L T (Y ; θ) Now try to maxmse ths T t2 log j f (Y t ; θ )f j (Y t ; θ j )p j p p f(y t ; θ ) N + log p f(y ; θ ) Instead we seek an ndrect method for maxmsng the lkelhood. By usng the EM algorthm we can maxmse a lkelhood whch s a lot easer to evaluate. Let us suppose that we observe {Y t U t }. Snce P (Y U) P (Y T Y T... Y U)P (Y T Y T 2... Y U)... P (Y U) T t P (Y t U t ), and the dstrbuton of Y t U t s N (µ Ut σ 2 U t ), then the complete lkelhood of {Y t U t } s T t f(y t U t ; θ) T p U t2 43 p Ut U t.

Thus the log-lkelhood of the complete observatons {Y t U t } s L T (Y U; θ) T T log f(y t U t ; θ) + log p Ut U t + log p U. t Of course, we do not observe the complete lkelhood, but the above can be used n order to defne the functon Q(θ θ) whch s maxmsed n the EM-algorthm. It s worth mentonng that gven the transton probabltes of a dscrete Markov chan (that s {p j } j ) one can obtan the margnal probabltes {p }. Thus t s not necessary to estmate the margnal probabltes {p } (note that the excluson of {p } n the log-lkelhood, above, gves the condtonal complete log-lkelhood). We recall that to maxmse the observed lkelhood L T (Y ; θ) usng the EM algorthm nvolves evaluatng Q(θ θ), where T Q(θ θ) log f(y t U t ; θ) + U U t t t2 T log p Ut U t + log p U Y θ t2 T T log f(y t U t ; θ) + log p Ut U t + log p U p(u Y θ ) T t2 t[log f(y t U t ; θ)]p (U t Y θ ) + U T [log p Ut U t ]P (U t U t Y θ ) + [log p U ]P (U Y θ ) and U denotes all combnatons of U. Snce P (U t Y θ ) P (U t Y θ )/P (Y θ ) and P (U t U t Y θ ) P (U t U t Y θ )/P (Y θ ) and P (Y θ ) s common to all U t and s ndependent of θ we can defne t2 Q(θ θ) U T t[log f(y t U t ; θ)]p (U t Y θ ) + U T [log p Ut U t ]P (U t U t Y θ ) + [log p U ]P (U Y θ ) t2 where Q(θ θ) Q(θ θ) and the maxmum of Q(θ θ) wth respect to θ s the same as the maxmum of Q(θ θ). Thus the quantty Q(θ θ) s evaluated and maxmsed wth respect to θ. For a gven θ and Y, the condtonal probabltes P (U t Y θ ) and P (U t U t Y θ ) can be evaluated through a seres of teratve steps. For ths example the EM algorthm s () Defne an ntal value θ Θ. Let θ θ. () The expectaton step, For a fxed θ evaluate P (U t Y θ ), P (U t U t Y θ ). Q(θ θ) (defned n (4.9)). 44

() The maxmsaton step Evaluate θ k+ arg max θ Θ Q(θ θ) by dfferentatng Q(θ θ) wrt to θ and equatng to zero. (v) If θ k and θ k+ are suffcently close to each other stop the algorthm and set ˆθ n θ k+. Else set θ θ k+, go back and repeat steps () and () agan. 45