CSE 527, Additional notes on MLE & EM

Similar documents
Random Variables, Sampling and Estimation

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Module 1 Fundamentals in statistics

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Estimation for Complete Data

Lecture 12: September 27

Approximations and more PMFs and PDFs

Lecture 2: Monte Carlo Simulation

Topic 9: Sampling Distributions of Estimators

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Problem Set 4 Due Oct, 12

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Stat 421-SP2012 Interval Estimation Section

Expectation and Variance of a random variable

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Chapter 6 Sampling Distributions

Algorithms for Clustering

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Exponential Families and Bayesian Inference

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

CS284A: Representations and Algorithms in Molecular Biology

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Chapter 6 Principles of Data Reduction

Machine Learning Brett Bernstein

Statistics 511 Additional Materials

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Lecture 19: Convergence

Properties and Hypothesis Testing

The Expectation-Maximization (EM) Algorithm

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

1.010 Uncertainty in Engineering Fall 2008

1 Introduction to reducing variance in Monte Carlo simulations

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Parameter, Statistic and Random Samples

The standard deviation of the mean

Statistical Pattern Recognition

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Binomial Distribution

MATH/STAT 352: Lecture 15

Simulation. Two Rule For Inverting A Distribution Function

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Statisticians use the word population to refer the total number of (potential) observations under consideration

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 33: Bootstrap

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

2 Definition of Variance and the obvious guess

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Topic 9: Sampling Distributions of Estimators

4. Partial Sums and the Central Limit Theorem

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

The Growth of Functions. Theoretical Supplement

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Estimation of the Mean and the ACVF

Lecture 11 and 12: Basic estimation theory

Machine Learning Brett Bernstein

1 Inferential Methods for Correlation and Regression Analysis

Topic 10: Introduction to Estimation

5. Likelihood Ratio Tests

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

6.3 Testing Series With Positive Terms

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

PRACTICE PROBLEMS FOR THE FINAL

Mathematical Statistics - MS

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

KLMED8004 Medical statistics. Part I, autumn Estimation. We have previously learned: Population and sample. New questions

Clases 7-8: Métodos de reducción de varianza en Monte Carlo *

Chapter 13: Tests of Hypothesis Section 13.1 Introduction

7.1 Convergence of sequences of random variables

Understanding Samples

SDS 321: Introduction to Probability and Statistics

CS 330 Discussion - Probability

Topic 9: Sampling Distributions of Estimators

Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

Lecture 12: November 13, 2018

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

(7 One- and Two-Sample Estimation Problem )

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

A statistical method to determine sample size to estimate characteristic value of soil parameters

Economics Spring 2015

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Infinite Sequences and Series

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Element sampling: Part 2

There is no straightforward approach for choosing the warmup period l.

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Transcription:

CSE 57 Lecture Notes: MLE & EM CSE 57, Additioal otes o MLE & EM Based o earlier otes by C. Grat & M. Narasimha Itroductio Last lecture we bega a examiatio of model based clusterig. This lecture will be the techical backgroud leadig to the Expectatio Maximizatio (EM) algorithm. Do gee expressio data fit a Gaussia model? The cetral limit theorem implies that the sum of a large umber of idepedet idetically distributed radom variables ca be well approximated by a Normal distributio. While it is far from clear that the expressio data is a sum of idepedet variables, usig the Normal distributio seems to work i practice. Besides, havig a weak model is better tha havig o model at all. Probability Basics A radom variable ca be cotiuous or discrete (or both). A discrete radom radom variable correspods to a probability distributio o a discrete sample space, such as the roll of a dice. A cotiuous radom variable correspods to a probability distributio o a cotiuous sample space such as. Show i the table below are two examples of probability distributios, with the first represetig a roll of a ubiased die, ad the secod represetig a Normal distributio. Discrete Cotiuous Sample Space 8,,... 6< Distributio p, p,... p 6 0, 6 i= p i = p = p =... = p 6 = ÅÅÅ 6 fhxl 0, Ÿ fhxl dx = fhxl = ÅÅÅÅÅ ê s è!!!!!!!!!!!! p s e-hx-ml Discrete Probability Distributio

CSE 57 Lecture Notes: MLE & EM 0.3 0.5 0. 0.5 0. 0.05 Cotiuous Probability Distributio 3 4 5 6 0.4 0.3 0. 0. -4-4 Parameter Estimatio May distributios are parametrized. Typically, we have data x, x,..., x that is sampled from a parametric distributio f Hx» ql. Ofte, the goal is to estimate the parameter q. The mea m ad variace s are ofte used as such parameters. Estimates of these quatities derived from the sampled data are ofte called the sample statistics, while the (true) parameter based o the etire sample space is called the populatio statistic. The followig table illustrates these two cocepts.

CSE 57 Lecture Notes: MLE & EM 3 Discrete Cotiuous Populatio Mea m = i i p i m = Ÿ x fhxl dx Populatio Variace s = i Hi - ml p i s = Ÿ Hx - ml fhxl dx Sample Mea ê x = i= Sample Variace s ê = i= x i ê ê x = i= Hx i - ê xl ê s ê = i= x i ê Hx i - x ê L ê While the sample statistics ca be used as estimates of these parameters, this is ofte ot the prefered way of estimatig these quatities. For example, the sample variace s êê = i= Hx i - êê xl ê is a biased estimate of the true variace because it uderestimates the quatity (a ubiased estimate of the variace is give by s êê = i= Hx i - êê xl ê H - L ). Maximum Likelihood Estimatio is oe of may parameter estimatio techiques (ote that the MLE is ot guarateed to be ubiased either). Assumig the data are idepedet, the likelihood of the data x, x,..., x give the parameter q is LHx, x,..., x» ql = i= f Hx i» ql where f is the probability desity fuctio of the presumed distributio (which of course dcepeds o q). Note that the x i are kow costats, ot variables; they are the values we observed. O the other had, q is ukow. We treat the likelihood L as a fuctio of q ad ask what value of q maximizes it. The typical approach is to solve for ÅÅÅÅÅÅÅ q LHx, x,..., x» ql = 0 Sice the likelihood fuctio is always positive (ad we may assume it to be strictly positive), the log likelihood l LHx, x,..., x» ql = l i= f Hx i» ql = i= l f Hx i» ql is well defied, ad by the mootoicity of the logarithm, the log likelihood is maximized exactly whe the likelihood is maximized. Hece we ca solve for ÅÅÅÅÅÅÅ q l LHx, x,..., x» ql = 0

CSE 57 Lecture Notes: MLE & EM 4 Note that i geeral, these coditios are statisfied by maxima, miima ad statioary poits of the log-likelihood fuctio. (A "statioary poit" is a temporary flat spot o a curve that otherwise teds upward or dowward.) Further, if q is restricted to be i some bouded rage, the maxima might occur at the boudary which does ot satisfy this coditio. Therefore, we eed to check the boudaries separately. Here is a example which illustrates this procedure. Example. Let x, x,..., x be coi flips, ad let q be the probability of gettig heads. Suppose we observe 0 tails ad heads ( 0 + = L. The the likelihood fuctio is give by This yields LHx, x,..., x» ql = H - ql 0 q Hece the log - likelihood fuctio is l LHx, x,..., x» ql = 0 l H - ql + l q To fid a value of q that maximizes this fuctio, we solve for ÅÅÅÅÅÅÅ q l LHx, x,..., x» ql = - 0-0 -q + ÅÅÅÅÅÅ = 0 q H - ql = 0 q = H 0 + L q ÅÅÅÅÅÅÅ H 0 + L = q ÅÅÅÅÅÅ = q -q + ÅÅÅÅÅÅ q = 0 (The sig of d derivative ca the be checked to guaratee that this is a maximum ot a miimum. Likewise, you ca easily verify that the maximum is ot attaied at the boudaries of the parameter space, i.e. at q=0 or q=.) This estimate for the parameter of the distributio matches our ituitio. Example. Suppose x i ~ NHm, sl, s = ad m ukow. The

CSE 57 Lecture Notes: MLE & EM 5 LHx, x,..., x» ql = ÅÅ è!!!!!! i= p e-hx i -ql ê l LHx, x,..., x» ql = S I- ÅÅÅÅ i= l p - Hx i -ql ÅÅÅÅÅÅ M ÅÅÅÅÅÅÅ q l LHx, x,..., x» ql = i= Hx i - ql = i= x i - q = 0 So the value of q that maximizes the likelihood is q = i= x i ê Agai matchig our ituitio: the sample mea is the maximum likelihood estimator (MLE) for the populatio mea. Example 3. Suppose x i ~ NHm, sl, s ad m ukow. The LHx, x,..., x» q, q L = i= ÅÅÅÅÅÅÅ è!!!!!!!!!! e -Hx i -q L ê q pq l LHx, x,..., x» q, q L = S I- ÅÅÅÅ i= l p q - Hx i -q L ÅÅÅÅÅÅÅ q l LHx, x,..., x» q, q L = i= q l LHx, x,..., x» q, q L = S I- p ÅÅÅÅ ÅÅ i= p q + Hx i -q L ÅÅÅÅÅÅÅ M = S q I- i= q + Hx i -q L q M Hx i -q L ÅÅÅÅÅÅ q = 0 ï i= x i ê = q ÅÅÅÅÅÅÅ M = 0 ï q i= Hx i - q L ê = q The MLE for the populatio variace is the sample variace. This is a biased estimator. It systematically uderestimates the populatio variace, but is oe the less the MLE. The MLE does't promise a ubiased estimator but it is a reasoable approach. Expectatio Maximizatio The MLE approach works well whe we have relatively simple parametrized distributios. However, whe we have more complicated situatios, we may ot be able to solve for the ML estimate because the complexity of the likelihood fuctio precludes both aalytical ad umerical optimizatio. The EM algorithm ca be thought of as a algorithm that provides a tractable approximatio to the ML estimate.

CSE 57 Lecture Notes: MLE & EM 6 Cosider the followig example. We have data correspodig to heights of idividuals, as show i the figures below. Is this distributio likely to be Normally distributed as show below? 0. 0.5 0. 0.05 X X X X X X X X X X -0-5 5 0 Ü Graphics Ü Or is there some hidde variable, like geder, so the distributio should be more like this: 0. 0.5 0. 0.05 X X X X X X X X X X -0-5 5 0 Ü Graphics Ü The clusterig problem ca is essetially a parameter estimatio problem : Try to fid if there are hidde parameters that cause the data to fall ito two distributios f HxL, f HxL. These distributios deped o some parameter q: f Hx, ql, f Hx, ql, ad there are also mixig parameters t ad t, t + t =, which describe the probability of samplig from each group. Ca we estimate the parameters for the this more complex model? Let's suppose that the two groups are ormal but with differet, ukow, parameters. The likelihood is ow give by LHx, x,..., x» t, t, m, m, s, s L = i= j= t j f j Hx i» q j L

CSE 57 Lecture Notes: MLE & EM 7 If we try to work with this i our existig framework it becomes messy ad algebraically itractable, due to the product-of-sums form, ad remais so eve if we take the log of the likelihood. This leads us to itroduce the Expectatio Maximizatio (EM) algorithm as a heuristic for fidig the MLE. It is particularly useful for problems cotaiig a hidde variable. It uses a hill-climbig strategy to fid a local maximum of the likelihood. Itroduce ew variables z ij = 9 0 iff x i was sampled from distributio j otherwise These variables are itroduced for mathematical coveiece. They let us avoid a sum over j i the expressio for the likelihood. The full data table becomes x z z x z z x z z If the z were kow estimatig t, t would be easy, ad estimatio of the parameters would become easy agai. If we kew the parameters estimatio of the z would be easy. The EM algorithm iterates over these alteratives. It ca be proved that the likelihood will be mootoically icreasig, ad so will coverge to a (local) maximum. [There is a polyomial time algorithm for estimatig Gaussia mixtures uder the assumptio that the compoets are "well-separated," but the method is ot used much i practice. I do't kow whether the complexity of the geeral problem is kow; plausibly it's NPhard. So, the EM algorithm is probably the method of choice.] Expectatio step Assume fixed values for t j ad q j. Let A be the evet that x i is draw from the distributio f, let B be the evet that x i is draw from f, ad let D be the evet that x i is observed. We wat PHA» DL, but it is easier to fid PHD» AL. We use Bayes' rule: PHA» DL = PHD»AL PHAL ÅÅÅÅÅÅÅ PHDL PHDL = PHD» AL PHAL + PHD» BL PHBL = t PHD» AL + t PHD» BL = t f Hx i» q L + t f Hx i» q L

CSE 57 Lecture Notes: MLE & EM 8 PHA» DL is the expected value of z i give q ad q. This is the expectatio step of the EM algorithm. To be cocrete, cosider a sample of poits take from a mixture of Gaussia distributios with ukow parameters ad ukow mixig coefficiets. The EM algorithm will give estimates of the parameters that raise the likelihood of the data. A easy heuristic to apply is If EHz i L ê the set z i = If EHz i L < ê the set z i = 0 This gives rise to the so-called Classificatio EM algorithm (we classify each observatio as comig from exactly oe of the compoet distributios). The k-meas clusterig algorithm is a example. I this case, the maximizatio step is just like the simple Maximum Likelihood Estimatio examples cosidered above. The more geeral M-step (below) accouts for the iheret ucertaity i these classificatios, appropriately weightig the cotributios of each observatio to the parameter estimates for each mixture compoet. Maximizatio step The expressio for the likelihood is LHx, z, z, x, z, z,...» q, tl The x i are kow. If the z ij were kow fidig the MLE of q, t would be easy, but we do't. Istead we maximize the expected log likelihood of the visible data EHl LHx, x,..., x» q, tll. The expectatio is take over the distributio of the hidde variables z ij. Assumig s = s = s, ad t =t =t =H ÅÅÅÅ L: so LHx, z» q, tl = t i= EHl LHx, z» q, tll = EH i= i= l ÅÅÅÅ - ÅÅÅÅÅÅÅ è!!!!!!!!!!! ps e-ê s H j= z ij Hx i -m j L L l ÅÅÅÅ - ÅÅÅÅ l ps - s j= z ij Hx i - m j L L = ÅÅÅÅ l ps - s j= EHz ij L Hx i - m j L L

CSE 57 Lecture Notes: MLE & EM 9 The last step above depeds o the importat fact that expectatio is liear: if c ad d are costats ad X ad Y are radom variables, the E(cX+dY) = c E(X) + d E(Y). We calculated EHz ij L i the previous step. We ca ow solve for the m j that maximize the expectatio by the methods give earlier: set derivatives to zero, etc. With a little more algebra you will see that the MLE for m j is the weighted average of the x i 's, where the weights are the EHz ij L's,which makes sese ituitively: if a give poit x i has a high probability of havig bee sampled from distributio, the it will cotribute strogly to our estimate of m ad weakly to our estimate of m. It ca be show that this procedure icreases the likelihood at every iteratio, hece is guarateed to coverge to a local maximum. Ufortuately, it is ot guarateed to be the global maximum, but empirically it works well i may situatios.