ECE531 Lecture 10b: Maximum Likelihood Estimation

Similar documents
ECE531 Lecture 8: Non-Random Parameter Estimation

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

EIE6207: Estimation Theory

Advanced Signal Processing Introduction to Estimation Theory

ECE 275A Homework 7 Solutions

Statistics: Learning models from data

Chapter 3. Point Estimation. 3.1 Introduction

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Estimation and Detection

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Regression Estimation Least Squares and Maximum Likelihood

10. Linear Models and Maximum Likelihood Estimation

Parametric Techniques Lecture 3

Parametric Techniques

F & B Approaches to a simple model

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

A Few Notes on Fisher Information (WIP)

Basic concepts in estimation

STAT 730 Chapter 4: Estimation

Detection & Estimation Lecture 1

Detection & Estimation Lecture 1

Estimators as Random Variables

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

CSC321 Lecture 18: Learning Probabilistic Models

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

Chapter 3: Maximum Likelihood Theory

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Mathematical statistics

Classical Estimation Topics

Module 2. Random Processes. Version 2, ECE IIT, Kharagpur

ECE531 Lecture 4b: Composite Hypothesis Testing

COS513 LECTURE 8 STATISTICAL CONCEPTS

2 Statistical Estimation: Basic Concepts

ECE 275A Homework 6 Solutions

A General Overview of Parametric Estimation and Inference Techniques.

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Econometrics I, Estimation

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

3.0.1 Multivariate version and tensor product of experiments

Introduction to Maximum Likelihood Estimation

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Brief Review on Estimation Theory

Lecture : Probabilistic Machine Learning

Methods of evaluating estimators and best unbiased estimators Hamid R. Rabiee

Chapter 8: Least squares (beginning of chapter)

PATTERN RECOGNITION AND MACHINE LEARNING

ECE531 Screencast 5.5: Bayesian Estimation for the Linear Gaussian Model

Lecture 7 Introduction to Statistical Decision Theory

Covariance function estimation in Gaussian process regression

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

ECE531 Homework Assignment Number 7 Solution

Regression #3: Properties of OLS Estimator

ECE 275B Homework # 1 Solutions Winter 2018

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

ECE 275B Homework # 1 Solutions Version Winter 2015

Lecture 8: Information Theory and Statistics

POLI 8501 Introduction to Maximum Likelihood Estimation

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

ECE531 Lecture 10b: Dynamic Parameter Estimation: System Model

ECE531 Homework Assignment Number 6 Solution

Parameter Estimation

Math 494: Mathematical Statistics

Maximum Likelihood Estimation

Parameter Estimation

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

The Expectation-Maximization Algorithm

ECE531 Lecture 2b: Bayesian Hypothesis Testing

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Introduction to Estimation Methods for Time Series models Lecture 2

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Estimation, Inference, and Hypothesis Testing

Economics 520. Lecture Note 19: Hypothesis Testing via the Neyman-Pearson Lemma CB 8.1,

Introduction to Simple Linear Regression

Rowan University Department of Electrical and Computer Engineering

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

A Very Brief Summary of Statistical Inference, and Examples

STA 260: Statistics and Probability II

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Estimation Theory Fredrik Rusek. Chapters 6-7

Mathematical statistics

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Estimation Theory Fredrik Rusek. Chapters

Parametric Inference

6.1 Variational representation of f-divergences

General Bayesian Inference I

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Stat 579: Generalized Linear Models and Extensions

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Part 4: Multi-parameter and normal models

Introduction to Machine Learning. Lecture 2

Transcription:

ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23

Introduction So far, we have three techniques for finding good estimators when the parameter is non-random: 1. Rao-Blackwell-Lehmann-Sheffe technique for finding MVU estimators. 2. Guess and check : use your intuition to guess at a good estimator and then compare variance to information inequality or CRLB 3. BLUE The first two techniques can fail or be difficult to use in some scenarios, e.g. Can t find a complete sufficient statistic Can t compute the conditional expectation in the RBLS theorem Can t invert the Fisher information matrix... BLUE may not be good enough. Today we will learn another popular technique that can often help us find good estimators: the maximum likelihood criterion. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 2 / 23

The Maximum Likelihood Criterion Suppose for a moment that our parameter Θ is random (like Bayesian estimation) and we have a prior on the unknown parameter π Θ (θ). The maximum a posteriori (MAP) estimator can be written as ˆθ map (y) = arg max p Θ(θ y) Bayes = arg max p Y (y θ)π Θ (θ) θ Λ θ Λ If you were unsure about this prior, you might assume a least favorable prior. Intuitively, what sort of prior would give the least information about the parameter? We could assume the prior is uniformly distributed over Λ, i.e. π Θ (θ) π > 0 for all θ Λ. Then ˆθ map (y) = arg max θ Λ p Y (y ; θ)π = arg max θ Λ p Y (y ; θ) = ˆθ ml (y) finds the value of θ that makes the observation Y = y most likely. ˆθ ml (y) is called the maximum likelihood (ML) estimator. Problem 1: If Λ is not a bounded set, e.g. Λ = R, then π = 0. A uniform distribution on R (or any unbounded Λ) doesn t really make sense. Worcester Problem Polytechnic 2: Institute Assuming a uniform D. Richard prior Brown is not III the same as assuming 05-Apr-2011 the 3 / 23

The Maximum Likelihood Criterion Even though our development of the ML estimator is questionable, the criterion of finding the value of θ that makes the observation Y = y most likely (assuming all parameter values are equally likely) is still interesting. Note that, since ln is strictly monotonically increasing, ˆθ ml (y) = arg max θ Λ p Y (y ; θ) = arg max θ Λ ln p Y (y ; θ) Assuming that ln p Y (y ; θ) is differentiable, we can state a necessary (but not sufficient) condition for the ML estimator: θ ln p Y (y ; θ) = 0 θ=ˆθml (y) where the partial derivative is taken to mean the gradient θ for multiparameter estimation problems. This expression is known as the likelihood equation. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 4 / 23

The Likelihood Equation s Relationship with the CRLB Recall from last week s lecture that we can attain the CRLB if and only if p Y (y ; θ) is of the form { θ ] } p Y (y ; θ) = h(y) exp I(t) [ˆθ(y) θ dt for all y Y. ln p Y (y ; θ) = ln h(y) + a θ a ] I(t) [ˆθ(y) θ dt When this condition holds, the likelihood equation becomes θ ln p Y (y ; θ) [ˆθ(y) = I(θ) θ θ=ˆθml (y) ]θ=ˆθ = 0 ml (y) which, as long as I(θ) > 0, has the unique solution ˆθ ml (y) = ˆθ(y). What does this mean? If ˆθ(y) attains the CRLB, it must be a solution to the likelihood equation. The converse is not always true, however. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 5 / 23

Some Initial Properties of Maximum Likelihood Estimators If ˆθ(y) attains the CRLB, it must be a solution to the likelihood equation. In this case, ˆθml (y) = ˆθ mvu (y). Solutions to the likelihood equation may not achieve the CRLB. In this case, it may be possible to find other unbiased estimators with lower variance than the ML estimator. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 6 / 23

Example: Estimating a Constant in White Gaussian Noise Suppose we have random observations given by Y k = θ + W k k = 0,..., n 1 i.i.d. where W k N (0, σ 2 ). The unknown parameter θ is non-random and can take on any value on the real line (we have no prior pdf). Let s set up the likelihood equation θ ln p Y (y ; θ) = 0... θ=ˆθml (y) We can write { θ ln 1 } n 1 k=0 exp (y k θ) 2 (2πσ 2 ) n/2 2σ 2 = 1 n 1 θ=ˆθ ml σ 2 (y k ˆθ ml ) = 0 hence ˆθ ml (y) = 1 n 1 y k = ȳ. n This solution is unique. You can also easily confirm that it maximizes ln p Y (y ; θ). Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 7 / 23 k=0 k=0

Example: Est. the Parameter of an Exponential Distrib. i.i.d. Suppose we have random observations given by Y k θe θy k for k = 0,..., n 1 and y k 0. The unknown parameter θ > 0 and we have no prior pdf. Let s set up the likelihood equation θ ln p Y (y ; θ) = 0... θ=ˆθml (y) Since p Y (y ; θ) = k p Y k (y k ; θ) = θ n exp { θ k y k}, we can write θ ln [θn exp { nθȳ}] = θ=ˆθml θ (n ln θ nθȳ) θ=ˆθml n = ˆθ ml (y) nȳ = 0 which has the unique solution ˆθ ml (y) = 1/ȳ. You can easily confirm that this solution is a maximum of p Y (y ; θ) since 2 ln p θ 2 Y (y ; θ) = n < 0. θ 2 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 8 / 23

Example: Est. the Parameter of an Exponential Distrib. Assuming n 2, the mean of the ML estimator can be computed as } n E θ {ˆθml (Y ) = n 1 θ Note that ˆθ ml (y) is biased for finite n} but it is asymptotically unbiased in the sense that lim n E θ {ˆθml (Y ) = θ. Assuming n 3, the variance of the ML estimator can be computed as } n var θ {ˆθml 2 θ 2 (Y ) = (n 1) 2 > θ2 (n 2) n = I 1 (θ) where I(θ) is the Fisher information for this estimation problem. ] Since θ ln p Y (y ; θ) is not of the form k(θ) [ˆθml (y) f(θ), we know that the information inequality can not be attained here. Nevertheless, ˆθ ml (y) is} asymptotically efficient in the sense that lim n var θ {ˆθml (Y ) = I 1 (θ). Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 9 / 23

Example: Est. the Parameter of an Exponential Distrib. It can be shown using the RBLS theorem and our theorem about complete sufficient statistics for exponential density families that ˆθ mvu (y) = ( ) n 1 1 1 y k n 1 k=0 = n 1 n ˆθ ml (y). Assuming n 3, the variance of the MVU estimator is then } var θ {ˆθmvu (Y ) = θ2 n 2 > θ2 n = I 1 (θ). Which is better, ˆθ ml (y) or ˆθ mvu (y)? Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 10 / 23

Which is Better, MVU or ML? To answer this question, you have to return to what got us here: the squared error cost function. E θ {(ˆθ ml (Y ) θ) 2} } ( θ = var θ {ˆθml (Y ) + > θ 2 n 2 = var θ n 1 {ˆθmvu (Y )}. ) 2 = n + 2 n 1 θ 2 n 2 The MVU estimator is preferable to the ML estimator for finite n. Asymptotically, however, their squared error performance is equivalent. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 11 / 23

Example: Estimating a The Mean and Variance of WGN i.i.d. Suppose we have random observations given by Y k N (µ, σ 2 ) for k = 0,..., n 1. The unknown vector parameter θ = [µ, σ 2 ] where µ R and σ 2 > 0. The joint density is given as { 1 } n 1 k=0 p Y (y ; θ) = (2πσ 2 exp (y k µ) 2 ) n/2 2σ 2 How should we approach this joint maximization problem? In general, we would compute the gradient θ ln p Y (y ; θ), set the result equal θ=ˆθml (y) to zero, and then try to solve the simultaneous equations (also using the Hessian to check that we did indeed find a maximum)... Or we could recognize that finding the value of µ that maximizes p Y (y ; θ) does not depend on σ 2... Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 12 / 23

Example: Estimating a The Mean and Variance of WGN We already know ˆµ ml (y) for estimating µ: ˆµ ml (y) = ȳ (same as MVU estimator). So now we just need to solve ( { ˆσ 2 ml = arg max a>0 ln 1 exp (2πa) n/2 { }}) n 1 k=0 (y k ȳ) 2 2a Skipping the details (standard calculus, see example IV.D.2 in the Poor textbook), it can be shown that ˆσ ml 2 = 1 n 1 (y k ȳ) 2 n k=0 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 13 / 23

Example: Estimating a The Mean and Variance of WGN Is ˆσ 2 ml biased in this case? 2 E θ {ˆσ ml (Y ) } = 1 n 1 E θ [(Y k µ) 2 2(Y k µ)(ȳ n µ) + (Ȳ µ)2 ] = 1 n k=0 n 1 k=0 = n 1 n σ2 σ 2 2σ2 n + σ2 n Yes, ˆσ 2 ml is biased but it is asymptotically unbiased. The steps in the previous analysis can be followed to show that ˆσ ml 2 is unbiased if µ is known. The unknown mean, even though we have an unbiased efficient estimator of it, causes ˆσ ml 2 (y) to be biased here. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 14 / 23

Example: Estimating a The Mean and Variance of WGN It can also be shown that var {ˆσ 2 ml(y ) } = 2(n 1)σ4 n 2 < 2σ4 n 1 = var {ˆσ 2 mvu(y ) } Which is better, ˆσ 2 ml (y) or ˆσ2 mvu(y)? To answer this, let s compute the mean squared error (MSE) of the ML estimator for σ 2 : { E θ (ˆσ 2 ml (Y ) σ 2 ) 2} 2 = var θ {ˆσ ml (Y ) } + ( 2 E θ {ˆσ ml (Y ) } σ 2) 2 ( ) 2 2(n 1)σ4 n 1 = n 2 + n σ2 σ 2 = (2n 1)σ4 n 2 < 2σ4 n 1 = E θ { (ˆσ 2 mvu (Y ) σ 2 ) 2} Hence the ML estimator has uniformly lower mean squared error (MSE) performance than the MVU estimator. The increase in MSE due to the bias is more than offset by the decreased variance of the ML estimator. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 15 / 23

More Properties of ML Estimators From our examples, we can say that Maximum likelihood estimators may be biased. A biased ML estimator may, in some cases, outperform a MVU estimator in terms of overall mean squared error. It seems that ML estimators are often asymptotically unbiased: lim E θ {ˆθml (Y )} = θ n Is this always true? It seems that ML estimators are often asymptotically efficient: lim var θ {ˆθml (Y )} = I 1 (θ) n Is this always true? Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 16 / 23

Consistency of ML Estimators For i.i.d. Observations Suppose that we have i.i.d. observations with marginal distribution i.i.d. Y k p Z (z ; θ) and define ψ(z ; θ ) := θ ln p Z(z ; θ) θ=θ J(θ ; θ ) := E θ {ψ(y k ; θ )} = ψ(z ; θ )p Z (z ; θ) dz where Z is the support of the pdf of a single observation p Z (z ; θ). Theorem If all of the following (sufficient but not necessary) conditions hold 1. J(θ ; θ ) is a continuous function of θ and has a unique root θ = θ at which point J changes sign, 2. ψ(y k ; θ ) is a continuous function of θ (almost surely), and 3. For each n, 1 n 1 n k=0 ψ(y k ; θ ) has a unique root ˆθ n (almost surely), then ˆθ n i.p. θ, where i.p. means convergence in probability. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 17 / 23 Z

Consistency of ML Estimators For i.i.d. Observations Convergence in probability means that [ ] lim Prob θ ˆθn (Y ) θ > ɛ n = 0 for all ɛ > 0 Example: Estimating a constant in white Gaussian noise: Y k = θ + W k i.i.d. for k = 0,..., n 1 with W k N (0, σ 2 ). For finite n, we know the estimator ˆθ n (y) = ȳ N (θ, σ 2 /n). Hence [ ] ( ) nɛ Prob θ ˆθn (Y ) θ > ɛ = 2Q σ and ( ) nɛ lim 2Q = 0 n σ for any ɛ > 0. So this estimator (ML=MVU here) is consistent. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 18 / 23

Consistency of ML Estimators: Intuition Setting likelihood equation equal to zero for i.i.d. observations: n 1 θ ln p Y (y ; θ) = ψ(y θ=θ k ; θ ) = 0 (1) k=0 The weak law of large numbers tells us that n 1 1 ψ(y k ; θ ) i.p. { E θ ψ(y1 ; θ ) } = J(θ ; θ ) n k=0 Hence, solving (1) in the limit as n is equivalent to finding θ such that J(θ ; θ ) = 0. Under our usual regularity conditions, it is easy to show that J(θ ; θ ) will always have a root at θ = θ. J(θ ; θ ) θ =θ = = Z Z θ ln p Z(z ; θ)p Z (z ; θ) dz θ p Z(z ; θ) p Z (z ; θ) p Z(z ; θ) dz = θ Z p Z (z ; θ) dz = 0 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 19 / 23

Asymptotic Unbiasedness of ML Estimators An estimator is asymptotically unbiased if lim E θ {ˆθml,n (Y )} = θ n Asymptotically unbiasedness requires convergence in mean. We ve already seen several examples of ML estimators that are asymptotically unbiased. Consistency implies that { E θ lim n } ˆθ ml,n (Y ) = θ This is convergence in probability. Does consistency imply asymptotic unbiasedness? Only when the limit and expectation can be exchanged. In most cases of practical significance, this exchange is valid If you want to know more precisely the conditions under which this exchange is valid, you need to learn about dominated convergence (a course in Analysis will cover this). Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 20 / 23

Asymptotic Normality of ML Estimators Main idea: For i.i.d. observations each distributed as p Z (z ; θ) and under regularity conditions similar to those you ve seen before, n(ˆθml,n (Y ) θ) d N ( ) 1 0, i(θ) where d means convergence in distribution and { [ ] } 2 i(θ) := E θ θ ln p Z(Z ; θ) is the Fisher information of a single observation Y k about the parameter θ. See Proposition IV.D.2 in the Poor textbook for the details. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 21 / 23

Asymptotic Normality of ML Estimators The asymptotic normality of ML estimators is very important. For similar reasons that convergence in probability is not sufficient to imply asymptotic unbiasedness, convergence in distribution is also not sufficient to imply that E θ [ n(ˆθ n (Y ) θ)] 0 (asymptotic unbiasedness) or var θ [ n(ˆθ n (Y ) θ)] 1 i(θ) (asymptotic efficiency). Nevertheless, as our examples showed, ML estimators are often both asymptotically unbiased and asymptotically efficient. When an estimator is asymptotically unbiased and asymptotically efficient, it is asymptotically MVU. Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 22 / 23

Conclusions Even though we started out with some questionable assumptions, we found that that ML estimators can have nice properties: If an estimator achieves the CRLB, then it must be a solution to the likelihood equation. Note that the converse is not always true. For i.i.d. observations, ML estimators are guaranteed to be consistent (within regularity). For i.i.d. observations, ML estimators are asymptotically Gaussian (within regularity). For i.i.d. observations, ML estimators are usually asymptotically unbiased and asymptotically efficient. Unlike MVU estimators, ML estimators may be biased. It is customary in many problems to assume sufficiently large n such that the asymptotic properties hold, at least approximately. See Kay I: Chapter 7. Lots of good examples plus: Transformed parameters Numerical methods Vector parameters Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 23 / 23