Computing the maximum likelihood estimates: concentrated likelihood, EM-algorithm. Dmitry Pavlyuk

Similar documents
Expectation-Maximization Algorithm.

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

The Expectation-Maximization (EM) Algorithm

Algorithms for Clustering

Lecture 11 and 12: Basic estimation theory

Bayesian Methods: Introduction to Multi-parameter Models

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Exponential Families and Bayesian Inference

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

ADVANCED SOFTWARE ENGINEERING

Bayesian and E- Bayesian Method of Estimation of Parameter of Rayleigh Distribution- A Bayesian Approach under Linex Loss Function

Statistical Pattern Recognition

Probabilistic Unsupervised Learning

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Probabilistic Unsupervised Learning

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Maximum Likelihood Estimation

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Lecture Stat Maximum Likelihood Estimation

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Random Variables, Sampling and Estimation

Finite Mixtures of Multivariate Skew Laplace Distributions

Modeling and Estimation of a Bivariate Pareto Distribution using the Principle of Maximum Entropy

Probability and MLE.

Lecture 9: September 19

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Distributional Similarity Models (cont.)

Maximum likelihood estimation from record-breaking data for the generalized Pareto distribution

A Note on Box-Cox Quantile Regression Estimation of the Parameters of the Generalized Pareto Distribution

The new class of Kummer beta generalized distributions

Distributional Similarity Models (cont.)

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Mixtures of Gaussians and the EM Algorithm

A Unified Approach on Fast Training of Feedforward and Recurrent Networks Using EM Algorithm

5. Likelihood Ratio Tests

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Kolmogorov-Smirnov type Tests for Local Gaussianity in High-Frequency Data

Notes 19 : Martingale CLT

4.5 Multiple Imputation

CS284A: Representations and Algorithms in Molecular Biology

Estimation for Complete Data

Lecture 3: MLE and Regression

The Sampling Distribution of the Maximum. Likelihood Estimators for the Parameters of. Beta-Binomial Distribution

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Comparison of Minimum Initial Capital with Investment and Non-investment Discrete Time Surplus Processes

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

CSE 527, Additional notes on MLE & EM

RAINFALL PREDICTION BY WAVELET DECOMPOSITION

GUIDELINES ON REPRESENTATIVE SAMPLING

Asymptotic Properties of MLE in Stochastic. Differential Equations with Random Effects in. the Diffusion Coefficient

Assessment of extreme discharges of the Vltava River in Prague

A New Class of Bivariate Distributions with Lindley Conditional Hazard Functions

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Statistical Inference Based on Extremum Estimators

Numerical Method for Blasius Equation on an infinite Interval

Regression and generalization

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

Asymptotic distribution of products of sums of independent random variables

Stochastic Simulation

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

SEMIPARAMETRIC SINGLE-INDEX MODELS. Joel L. Horowitz Department of Economics Northwestern University

DECOMPOSITION METHOD FOR SOLVING A SYSTEM OF THIRD-ORDER BOUNDARY VALUE PROBLEMS. Park Road, Islamabad, Pakistan

A Note on Effi cient Conditional Simulation of Gaussian Distributions. April 2010

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

A proposed discrete distribution for the statistical modeling of

Linear Regression Models

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

International Journal of Mathematical Archive-5(7), 2014, Available online through ISSN

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

CHAPTER 4 BIVARIATE DISTRIBUTION EXTENSION

Estimation of Gumbel Parameters under Ranked Set Sampling

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Fitting an ARIMA Process to Data

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Kernel density estimator

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Lecture 33: Bootstrap

Some Examples on Gibbs Sampling and Metropolis-Hastings methods

Using the IML Procedure to Examine the Efficacy of a New Control Charting Technique

Expectation and Variance of a random variable

1 Models for Matched Pairs

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Department of Mathematics

November 2002 Course 4 solutions

Axis Aligned Ellipsoid

Introductory statistics

MA Advanced Econometrics: Properties of Least Squares Estimators

BIOSTATISTICS. Lecture 5 Interval Estimations for Mean and Proportion. dr. Petr Nazarov

Recurrence Relations

A New Lifetime Distribution For Series System: Model, Properties and Application

Confidence interval for the two-parameter exponentiated Gumbel distribution based on record values

Paper SD-07. Key words: upper tolerance limit, order statistics, sample size, confidence, coverage, maximization

Element sampling: Part 2

1 Hypothesis test of a mean vector

Linear Programming and the Simplex Method

CS322: Network Analysis. Problem Set 2 - Fall 2009

Machine Learning Assignment-1

Transcription:

Computig the maximum likelihood estimates: cocetrated likelihood, EM-algorithm Dmitry Pavlyuk The Mathematical Semiar, Trasport ad Telecommuicatio Istitute, Riga, 13.05.2016

Presetatio outlie 1. Basics of MLE 2. Pseudo-likelihood 3. Fiite Mixture Models 4. The Expectatio-Maximizatio algorithm 5. Numerical Example 2

1. Basics of MLE 3

The estimatio problem Let X (X (1), X (2),, X (d) ) is a multivariate (d-variate) radom variable with kow multivariate p.d.f. f(x, θ) of K ukow parameters θ (θ 1, θ 2,, θ K ), θ Θ. The problem is to estimate parameters θ o the base of the sample of size from X: x (x 1, x 2,, x ) x i (x i (1), xi (2),, xi (d) ) 4

Maximum likelihood estimator The likelihood fuctio L θ x represets probability of receivig the sample x give parameters θ. I case of idepedet observatios i the sample L θ x f x i, θ i1 The maximum likelihood estimator is (R. Fisher, 1912+): መθ mle argmax θ Θ if a maximum exists. L θ x 5

Maximum likelihood estimator For computatio purposes the log-likelihood fuctio is itroduced: l θ x ll θ x l i1 f x i, θ Good limitig statistical properties of መθ mle : - Cosistecy - Asymptotic efficiecy - Asymptotic ormality i1 lf x i, θ 6

Maximum likelihood estimator FOC : l θ x i1 lf x i, θ θ max l θ x θ k 0 for all k 1,,K Not all log-likelihood fuctios have aalytical derivatives! 7

MLE example: multivariate ormal For example, for the multivariate ormal variable: X~MVN μ, Σ θ mv μ, Σ f mv x, θ mv φ x, μ, Σ 1 2π d e 1 2 x μ T Σ 1 x μ detσ 2π d 2detΣ 1 2exp 1 2 x μ T Σ 1 x μ 8

MLE example: multivariate ormal The log-likelihood fuctio: l mv μ, Σ x lφ x i, μ, Σ i1 d 2 l 2π 2 l detσ 1 2 i1 x i μ T Σ 1 x i μ FOC: l mv μ, Σ x μ l mv μ, Σ x Σ 0 0 9

MLE example: multivariate ormal Matrix calculus (for symmetric A): B T AB B 2BT A l deta A A 1 10

MLE example: multivariate ormal l mv μ, Σ x i1 μ d 2 l 2π 2 l detσ 1 2 σ i1 x i μ T Σ 1 μ x i μ T Σ 1 x i μ Settig this to zero we obtai the pleasat result μƹ Σ 1 i1 x i xҧ 11

MLE example: multivariate ormal l mv μ, Σ x Σ d 2 l 2π 2 l detσ 1 2 σ i1 2 Σ 1 1 2 i1 Σ Σ 1 x i μ x i μ T Σ 1 x i μ T Σ 1 x i μ Settig this to zero we obtai the result Σ 1 x i μ x i μ T i1 12

2. Pseudo-likelihood 13

Pseudo-likelihood There are a umber of suggestios for modifyig the likelihood fuctio to extract the evidece i the sample cocerig a parameter of iterest θ A whe θ (θ A, θ B ). The sample vector x is also trasformed ito 2 parts: x s s A, s B Such modificatios are geerally kow as pseudolikelihood fuctios: Coditioal likelihood Margial likelihood Cocetrated (profile) likelihood 14

Margial likelihood Margial likelihood fuctio: f X, θ f s,θ s A, s B, θ A, θ B f margial,a s A θ A f margial,b s B s A, θ A, θ B Maximum likelihood estimates for θ A are obtaied by maximizig the margial desity: f margial,a s A θ A Problems: Igorig some of the data Require aalytical forms of the fuctios 15

Coditioal likelihood Coditioal likelihood fuctio: f X, θ f s,θ s A, s B, θ A, θ B f coditioal,a s A s B, θ A f coditioal,b s B θ A, θ B Maximum likelihood estimates for θ A are obtaied by maximizig the coditioal desity: f coditioal,a s A s B, θ A Problems: Igorig some of data variability Require aalytical forms of the fuctios 16

Cocetrated likelihood Cocetrated likelihood fuctio: f X, θ f X, θ A, θ B f cocetrated X, θ A f cocetrated X, θ A, መθ B θ A Maximum likelihood estimates for θ A are obtaied by maximizig the cocetrated likelihood f cocetrated Problems: Severely biased Require መθ B θ A 17

Cocetrated likelihood l θ A, θ B x θa,θ B max Takig l θ A, θ B x θ B aalytically ad solvig l θ A, θ B x θ B 0 we obtai መθ B θ A ad move to cocetrated (profile) likelihood. 18

Cocetrated likelihood Σ μ 1 x i μ x i μ T i1 The cocetrated likelihood: l mv,cocetrated μ, Σ μ x d 2 l 2π 2 l det 1 i1 x i μ x i μ T 1 2 i1 x i μ T 1 i1 x i μ x i μ T 1 x i μ 19

Cocetrated likelihood l mv,cocetrated μ, Σ μ x 2 d l 2π + l det i1 x i μ x i μ T + d μ argmi μ l det i1 x i μ x i μ T This result is quite famous i ecoometrics! 20

3. Fiite Mixture Models 21

Gaussia mixture model Let we have a mixture of M multivariate radom variables (for example, ormal): X m ~MVN μ m, Σ m m 1,.., M with probability π m for every class. θ gmm μ 1,, μ m, Σ 1, Σ 2,, Σ m, π 1,, π m McLachla G., Peel D. (2000) Fiite Mixture Models, Willey Series i Probability ad Statistics, Joh Wiley & Sos, New York. 22

Gaussia mixture model d1: d2: 23

Gaussia mixture model Medical applicatios Schlattma P. (2009) Medical Applicatios of Fiite Mixture Models, Statistics for Biology ad Health, Spriger Fiacial applicatios Brigo, D.; Mercurio, F. (2002). Logormal-mixture dyamics ad calibratio to market volatility smiles Alexader, C. (2004). "Normal mixture diffusio with ucertai volatility: Modellig short- ad log-term smile effects" Image, speech, text recogitio Styliaou, Y. etc. (2005). GMM-Based Multimodal Biometric Verificatio Reyolds, D., Rose, R. (1995). Robust text-idepedet speaker idetificatio usig Gaussia mixture speaker models Permuter, H.; Fracos, J.; Jermy, I.H. (2003). Gaussia mixture models of texture ad colour for image database retrieval. 24

Gaussia mixture model Followig the law of complete probability, the likelihood fuctio is L gmm θ gmm x π m φ x i, μ m, Σ m M i1 m1 M l gmm θ gmm x ll gmm θ gmm x l π m φ x i, μ m, Σ m i1 m1 l π 1 φ x i, μ 1, Σ 1 + + π m φ x i, μ m, Σ m i1 The logarithm of sum prevets aalytical derivatives! 25

4. The Expectatio-Maximizatio algorithm 26

EM-algorithm The expectatio-maximizatio (EM) algorithm is a geeral method for fidig maximum likelihood estimates whe there are missig values or latet variables. I the mixture model cotext, the missig data is represeted by a set of observatios of a discrete radom variable Z that idicates which mixture compoet geerated the observatio i. 1, if observatio i belogs to class m, z im ቊ 0, otherwise 27

EM-algorithm: GMM 1, if observatio i belogs to class m, z im ቊ 0, otherwise If Z {z im } is give, l gmm θ gmm x l π 1 φ x i, μ 1, Σ 1 + + π m φ x i, μ m, Σ m trasformed to i1 M l gmm,complete θ gmm x, Z z i,m l π m φ x i, μ m, Σ m i1 m1 M i1 m1 z i,m lπ m + lφ x i, μ m, Σ m 28

EM-algorithm The EM iteratio icludes: a expectatio (E) step, which creates a fuctio for the expectatio of the log-likelihood evaluated usig the curret estimate for the parameters, ad a maximizatio (M) step, which computes parameters maximizig the expected log-likelihood foud o the E step. These parameter-estimates are the used to determie the distributio of the latet variables i the ext E step. 29

EM-algorithm: GMM (0) Assume θ gmm ad move to maximizatio of the expectatio of the log-likelihood fuctio: (0) E Z l gmm,complete θ gmm i1 M m1 E Z (0) z i,m θ gmm x, Z lπ m + lφ x i, μ m, Σ m 30

EM-algorithm: E-step (0) E Z z i,m x i, θ gmm (0) τ m x i, θ gmm 0 0 P z i,m 0 x i, θ gmm 0 P z i,m 1 x i, θ gmm P z 0 i,m 1 f x θ gmm, z i,m 1 0 f x i, Z θ gmm π m φ x i, μ m 0, Σ m 0 σm m 1 π m φ x i, μ 0 0 m, Σ m 0 + 1 P z i,m 1 x i, θ gmm These values are called class resposibilities. 31

EM-algorithm: M-step (0) E Z l gmm,complete θ gmm i1 FOC: M m1 τ m (0) x i, θ gmm x, Z (0) E l gmm,complete θ gmm x, Z 0, μ m (0) E l gmm,complete θ gmm x, Z 0, Σ m (0) E l gmm,complete θ gmm x, Z 0 π m lπ m + lφ x i, μ m, Σ m 32

EM-algorithm: M-step (0) E Z l complete θ gmm i1 τ m μ m (0) x i, θ gmm x, Z i1 τ m x i μ m T Σ 1 0 (0) φ x i, μ m, Σ m x i, θ gmm μ m (1) μƹ gmm,m σ (0) i1 τ m x i, θ gmm σ (0) i1 τ m x i, θ gmm x i 33

EM-algorithm: M-step (0) E Z l complete θ gmm i1 τ m Σ m x, Z i1 τ m (0) x i, θ gmm 2 Σ m 1 1 2 i1 (0) φ x i, μ m, Σ m x i, θ gmm Σ m Σ m 1 x i μ m x i μ m T Σ m 1 (1) Σ gmm,m σ (0) i1 τ m x i, θ gmm x i (0) μƹ gmm,m σ (0) i1 τ m x i, θ gmm (0) x i μ gmm,m T 34

EM-algorithm: M-step π j 1 i1 (0) E Z l complete θ gmm x, Z i1 τ m 1 π m i1 π m (0) lπ m x i, θ gmm π m τ m (0) x i, θ gmm (1) π gmm,m + λ σ m1 M π m 1 π m + λ 0 σ (0) i1 τ m x i, θ gmm 35

EM-algorithm 1. Iitialisatio Choose iitial values of θ (0) ad calculate the likelihood l θ (0) x, s0. 2. E-step Compute expectatio of specified parameterse Z 3. M-step Compute the ew estimates θ (s+1) 4. Covergece check Compute the ew likelihood ad if z i,m x i, θ (s) l θ (s+1) x l θ s x > precisio, the retur to step 2. 36

EM-algorithm Dempster, Laird, ad Rubi (1977) show that the likelihood fuctio l θ (s) x is ot decreased after a EM iteratio; that is for s 0,1,2,... l θ (s+1) x l θ (s) x See the proof i: McLachla G.J., Krisha T. (1997) The EM Algorithm ad Extesios, Wiley. 304 p. 37

5. Numerical Example 38

Numerical example DGP: d 2 Class π μ Σ 1 0.8 (1, 1) 2 0 0 2 2 0.2 (3, 4) 2 0.7 0.7 1 Implemeted with R, script is available o the semiar web page. 39

Numerical example Sample size 1000 Iteratio Log-likelihood Plot 0 0.7573044 1 12.96707 40

Numerical example Iteratio Log-likelihood Plot 2 26.22704 3 29.21921 41

Numerical example Iteratio Log-likelihood Plot 20 30.37445 42

Numerical example Real values Estimates Class π μ Σ (1000) π em 1 0.8 (1, 1) 2 0 0 2 2 0.2 (3, 4) 2 0.7 0.7 1 0.873 (1.033, 1.125) 0.127 (3.592, 4.376) (1000) (1000) μƹ em Σ em 1.994 0.113 0.113 2.406 1.203 0.522 0.522 0.880 43

Problems with EM Local maxima partially solved with careful (repetitive) iitial values selectio Slow covergece (i some cases) Meta-algorithm should be adapted for every specific problem Sigularities ad over-fittig 44

After EM Next step: Variatioal Bayes treat all parameters θ as missig variables iterate over compoets of missig variables (icludig θ) ad recalculate its expectatio 45

Recommeded literature McLachla G., Krisha T. (2008) The EM Algorithm ad Extesios, Wiley Series i Probability ad Statistics, 2d Editio, - 400 p. McLachla G., Peel D. (2000) Fiite Mixture Models, Willey Series i Probability ad Statistics, Joh Wiley & Sos, New York Gelma A., Carli J., Ster H., Duso D., Vehtari A., Rubi D. Bayesia Data Aalysis, Third Editio (Chapma & Hall/CRC Texts i Statistical Sciece) http://www.stat.columbia.edu/~gelma/book/ 46

Thak you for your attetio! Questios are very appreciated Cotacts: email: Dmitry.Pavlyuk@tsi.lv phoe: +37129958338