Maximum Likelihood Estimation

Similar documents
Statistics 3858 : Maximum Likelihood Estimators

Lecture 4 September 15

Statistics: Learning models from data

POLI 8501 Introduction to Maximum Likelihood Estimation

Chapters 9. Properties of Point Estimators

Math 494: Mathematical Statistics

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum likelihood estimation

STAT 730 Chapter 4: Estimation

Maximum Likelihood Estimation

Parametric Techniques

Linear Methods for Prediction

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Parametric Techniques Lecture 3

Advanced Quantitative Methods: maximum likelihood

Lecture 3 September 1

Machine Learning, Fall 2012 Homework 2

Parameter estimation Conditional risk

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

STAT 514 Solutions to Assignment #6

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Introduction to Maximum Likelihood Estimation

Math 494: Mathematical Statistics

CSC321 Lecture 18: Learning Probabilistic Models

Likelihood-Based Methods

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

ECE531 Lecture 10b: Maximum Likelihood Estimation

Master s Written Examination

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

ECE 275A Homework 7 Solutions

simple if it completely specifies the density of x

Loglikelihood and Confidence Intervals

Generalized Linear Models Introduction

Statistical Estimation

Robustness and Distribution Assumptions

A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from

Bias-Variance Tradeoff

Graduate Econometrics I: Maximum Likelihood I

Review and continuation from last week Properties of MLEs

Association studies and regression

Towards stability and optimality in stochastic gradient descent

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Sampling distribution of GLM regression coefficients

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Probability and Estimation. Alan Moses

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Advanced Quantitative Methods: maximum likelihood

Composite Hypotheses and Generalized Likelihood Ratio Tests

Theory of Statistics.

Econometrics I, Estimation

i=1 h n (ˆθ n ) = 0. (2)

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

1 Degree distributions and data

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Linear Models in Machine Learning

Math 152. Rumbos Fall Solutions to Assignment #12

Topic 19 Extensions on the Likelihood Ratio

ST495: Survival Analysis: Maximum likelihood

10. Composite Hypothesis Testing. ECE 830, Spring 2014

Introduction to Estimation Methods for Time Series models Lecture 2

The Expectation-Maximization Algorithm

Stat 451 Lecture Notes Numerical Integration

Hypothesis Testing: The Generalized Likelihood Ratio Test

COS513 LECTURE 8 STATISTICAL CONCEPTS

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Numerical Optimization

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Experiment 1: Linear Regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Chapter 6 Likelihood Inference

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1

Gradient Ascent Chris Piech CS109, Stanford University

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Computer Intensive Methods in Mathematical Statistics

Review of Discrete Probability (contd.)

A Very Brief Summary of Statistical Inference, and Examples

Lecture 10: Generalized likelihood ratio test

Linear Methods for Prediction

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Outline of GLMs. Definitions

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

UNIVERSITY OF TORONTO Faculty of Arts and Science

Midterm exam CS 189/289, Fall 2015

Computing the MLE and the EM Algorithm

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Transcription:

Maximum Likelihood Estimation Guy Lebanon February 19, 2011 Maximum likelihood estimation is the most popular general purpose method for obtaining estimating a distribution from a finite sample. It was proposed by Fisher about 100 years ago and has been widely used since. Definition 1. Let X (1),..., X (n) be sampled iid 1 from a distribution with a parameter θ that lies in a set Θ. maximum likelihood estimator (MLE) is the θ Θ that maximizes the likelihood function The L(θ) = L(θ X (1),..., X (n) ) = p θ (X (1),..., X (n) ) = n p θ (X (i) ) where p above is the density function if X is continuous and the mass function if X is discrete. The MLE is denoted ˆθ or ˆθ n if we wish to emphasize the sample size. Above, we suppress the dependency of L on X 1,..., X (n) to emphasize that we are treating the likelihood as a function of θ. Note that both X (i) and θ may be scalars or vectors (not necessarily of the same dimension) and that L may be discrete or continuous in either X and θ or both or neither. 1. Strictly monotonic increasing functions g preserve order in the sense that the maximizer of L(θ) is the same as the maximizer of g(l(θ)). As a consequence, we can find the MLE by obtaining the maximizer of log L(θ) rather than the likelihood itself which is helpful since it transforms the multiplicative likelihood into a sum (sums are easier to differentiate than products). A common notation for the log of the likelihood is l(θ) = log L(θ). 2. Any additive and multiplicative terms in l(θ) that are not a function of θ may be ignored since it dropping them will not change the maximizer. 3. If L(θ) is differentiable in θ, we can try to find the MLE by solving the equation dl(θ)/dθ = 0 for scalar θ or the system of equations l(θ) = 0 i.e., l(θ)/ θ j = 0, j = 1,..., d for vector θ. The obtained solutions are necessarily critical points (maximum, minimum or inflection) of the log-likelihood. To actually prove that the solution is a maximum we need to show in the scalar case d 2 l(θ)/dθ 2 < 0 or if θ is a vector that the Hessian matrix H(θ) defined by [H(θ)] ij = 2 l(θ) θ i θ j is negative definite 2 (at the solution of l(θ) = 0). 4. If the above method does not work (we can t solve l(θ) = 0) we can find the MLE by iteratively following in the direction of the gradient: initialize θ randomly and iterate θ θ + α l(θ) (where α is a sufficiently small step size) until convergence e.g., l(θ) ɛ or θ (t+1) θ (t) < ɛ. Since the gradient vector points in the direction of steepest ascent this will bring us to a maximum point. 5. The MLE is invariant in the sense that for all 1-1 functions h: denoting the MLE for θ by ˆθ we have that the MLE for h(θ) is h(ˆθ). The key property is that 1-1 functions have an inverse. If we use the parametrization h(θ) rather than θ, the likelihood function is L h 1 rather than L. This can be shown by noting that for 1-1 η = h(θ), p θ (X) = p h 1 (η)(x) and thus the likelihood function of η is K(η) = L(h 1 (η)). We conclude that if ˆθ is the MLE of θ then L(h 1 (h(ˆθ))) = L(ˆθ) L(θ) = L(h 1 (h(θ))) and h(ˆθ) is the MLE for h(θ). Potential problems are: When l and L are not differentiable we can not solve l(θ) = 0 nor can we follow the gradient to convergence. Solution: if θ is low dimensional vector use a grid search, otherwise use non-smooth optimization techniques. 1 iid stands for independent identically distributed, that is the samples are independent draws from the same distribution. 2 A negative definite matrix H is one that satisfies v Hv < 0 for all vectors v. A condition that is equivalent to negative definiteness in symmetric matrices (H is symmetric) and that is often easier to verify is that all eigenvalues are negative. 1

l(θ) has multiple local maxima. Solution: Restart multiple gradient ascent procedures with different starting positions and pick the solution that is the highest of all the maximizers. There are multiple global maxima. Solution: pick one or pick multiple ones and report all as potential estimation values. l(θ) has no maximum point in which case the iterative algorithm may never converge. Solution: Use regularization to penalize high values of θ i.e., l(θ) + λ θ 2 or use early stopping. Example 1: Let X (1),..., X (n) Ber(θ) (Bernoulli distribution means X = 1 with probability θ and 0 with probability 1 θ (and 0 probability for all other values); It is identical to binomial with parameters N = 1. The likelihood is L(θ) = θ X(i) (1 θ) 1 X(i) = θ X (i) (1 θ) n X (i) and the log-likelihood is l(θ) = ( X (i) ) log θ+(n X (i) ) log(1 θ). Setting the loglikelihood derivative to zero yields 0 = X (i) /θ (n X (i) )/(1 θ) or 0 = (1 θ) X (i) (n X (i) )θ = X (i) nθ which implies ˆθ = 1 n X (i). Note that the probability p and the loglikelihood are neither continuous nor smooth in X but are both continuous and smooth in θ. For example the graph below are the loglikelihood of a sample from a Bernouli distribution with θ = 0.5 and n = 3 which is maximized at ˆθ being the empirical average in accordance with the math solution above. 1 > theme set(theme bw( base_ size =8)); s e t. s e e d (0); # for reproducable experiment 2 > n =3; theta =0.5; samples = rbinom (n,1, theta ); samples 1 [1] 1 0 0 1 > D=data.frame( theta = seq (0.01,1, length =100)); 2 > D$ loglikelihood = sum ( samples ) * log (D$ theta ) + ( n-sum ( samples )) * log (1 -D$ theta ); 3 > mle =which.max( D$ loglikelihood ); 4 > p= ggplot (D,aes ( theta, loglikelihood )) + geom line () 5 > print ( qplot ( theta, loglikelihood, geom = ' line ',data =D, 6 + main = ' MLE ( red solid ) vs. True Parameter ( blue dashed ), n=3 ' )+ 7 + geom vline( aes ( xintercept = theta [ mle ]), color =I( ' red ' ), size =1.5 )+ 8 + geom vline( aes ( xintercept =0.5), color =I( ' blue ' ), lty =2, size =1.5 )) 2 MLE (red solid) vs. True Parameter (blue dashed), n=3 3 4 loglikelihood 5 6 7 8 9 0.2 0.4 0.6 0.8 1.0 theta Contrast the above graph with the one below, which corresponds to n = 20 to see the improvement of ˆθ 20 over ˆθ 3. 2

MLE (red solid) vs. True Parameter (blue dashed), n=20 15 20 25 loglikelihood 30 35 40 45 50 0.2 0.4 0.6 0.8 1.0 theta Example 2: Let X (1),..., X (n) N(µ, σ 2 ), θ = (µ, σ 2 ). The log-likelihood is l(θ) = log 1 (2πσ 2 ) n/2 n e (X(i) µ) 2 /(2σ 2) = c n 2 log σ2 + log e n (X(i) µ) 2 /(2σ 2) = c n 2 log σ2 where c is an inconsequential additive constant. Setting the partial derivative with respect to µ to zero gives l(θ) µ = X (i) µ σ 2 = 0 ˆµ = X def = 1 n n X (i). Substituting this in the equation resulting from setting the partial derivative with respect to σ 2 n 2σ + (X (i) µ) 2 2 2σ or 0 = n 4 2σ + (X (i) x) 2 2 property 4 above, the MLE for the standard deviation is ˆσ = σ2 = n (X (i) µ) 2 2σ 2 to 0: 0 = l(θ) σ 2 = 2σ or σ 2 n + (X (i) x) 2 = 0 which implies σ 2 = 1 n 4 n (X(i) X) 2. By 1 n n (X(i) X) 2. In this case θ is a two dimensional vector and so the likelihood function forms a surface in three dimensions. It is convenient to visualize using a heat-map or contour plot which plots the value of l(θ) in terms of colors or equal value contours. For example compare the two graphs below showing the likelihood as a function of θ = (µ, σ 2 ) for n = 5 and n = 50. The MLE is clearly more accurate in the latter case. 1 > n =5; samples = rnorm (n,0,1); # 5 samples from N (0,1) 2 > mu=seq (-1,1, length =40); v= seq (0.01,2, length =40); 3 > D= expand.grid( mu =mu, sigma. square =v); # create all combinations of the two parameters 4 > nloglik = function ( theta, samples ) { # loglikelihood function 5 + l= -log ( theta [,2]) * n/ 2; # works for multiple theta values arranged in data frame 6 + for (s in samples ) l=l-( s-theta [,1]) 2/(2* theta [,2]); 7 + return (l); 8 + } 9 > D$ loglikelihood = nloglik (D, samples ); 10 > p= ggplot (D,aes (mu, sigma.square,z= exp ( loglikelihood )))+ stat contour ( size =0.2 ); 11 > mle =which.max( D$ loglikelihood ); 12 > print (p+geom point( aes (mu[mle ], sigma.square [ mle ]), color =I( ' red ' ), size =3)+ 13 + geom point( aes (0,1), color =I( ' blue ' ), size =3, shape =22)+ opts( title = 14 + ' Likelihood contours, true parameter ( blue square ) vs. MLE ( red circle ), n=5 ' )) 3

2.0 Likelihood contours, true parameter (blue square) vs. MLE (red circle), n=5 1.5 sigma.square 1.0 0.5 1.0 0.5 0.0 0.5 1.0 mu 1 > n =50; samples = rnorm (n,0,1); # 50 samples from N (0,1) 2 > mu=seq (-1,1, length =40); v= seq (0.01,2, length =40); 3 > D=expand.grid(mu=mu, sigma.square =v); 4 > D$ loglikelihood = exp ( nloglik (D, samples )); 5 > p= ggplot (D,aes (mu, sigma.square,z= exp ( loglikelihood )))+ stat contour ( size =0.2 ); 6 > mle =which.max( D$ loglikelihood ); 7 > print (p+geom point( aes (mu[mle ], sigma.square [ mle ]), color =I( ' red ' ), size =3)+ 8 + geom point( aes (0,1), color =I( ' blue ' ), size =3, shape =22)+ opts( title = 9 + ' Likelihood contours, true parameter ( blue square ) vs. MLE ( red circle ), n =50 ' )) 2.0 Likelihood contours, true parameter (blue square) vs. MLE (red circle), n=50 1.5 sigma.square 1.0 0.5 1.0 0.5 0.0 0.5 1.0 mu R can also be used to find the MLE using numeric iterative methods such as gradient descent. This can be very useful when setting the loglikelihood gradient to zero does not result in closed form solution. Below is an example of using R s optim() function to numerically optimize the Gaussian likelihood (even though we don t really need to in this case since there is a closed form solution). The optim() function takes three arguments: an initial parameter value to start the iterative optimization, the objective function to minimize (minus the logikelihood in our case), and the parameter to pass to the objective function. Comparing the MLE obtained using the iterative numeric algorithm with the grid search MLE obtained in the previous example we see that the former is more accurate than latter. As the grid resolution is increased that difference will converge to 0. 1 > # numerical MLE for Gaussian parameters using 50 samples from the previous example 2 > nnloglik = function ( theta, samples ) { # loglikelihood function 3 + l= -log ( theta [2]) * n/ 2; # works for multiple theta values arranged in data frame 4 + for (s in samples ) l=l-( s-theta [1]) 2/(2* theta [2]); 5 + return (-l ); 4

6 + } 7 > iterative.mle = optim (c(0.5,0.5), nnloglik, samples = samples ); 8 > iterative. mle $ par # mle using numeric gradient descent 1 [1] 0. 004178593 1. 065951133 1 > c(d$mu[mle ],D$ sigma.square [ mle ]) # mle using grid search ( above example ) 1 [1] 0.02564103 1.08153846 Example 3: X (1),..., X (n) U[0, θ]. In this case p θ (X (i) ) = θ 1 if X (i) [0, θ] and 0 otherwise. The likelihood is L(θ) = θ n if 0 X (1),..., X (n) θ and 0 otherwise. We need to exercise care in this situation since the definition of the likelihood branches to two option depending on the value of the parameter θ. To treat only one case and not two we write the likelihood as L(θ) = θ n 1 {0 X (1),...,X (n) θ} where 1 {A} is the indicator function which equals 1 if A is true and 0 otherwise. We can t at this point proceed as before since the likelihood is not a differentiable function of θ (and neither will the log-likelihood be). We therefore do not take derivatives and simply examine the function L(θ): it will be zero for θ < max(x (1),..., X (n) ) and non-zero for θ max(x (1),..., X (n) ) in which case the likelihood function will be monotonically decreasing in θ. It follows then the ˆθ = max(x (1),..., X (n) ). The graph below plots the likelihood function for θ = 1 overlayed on the samples (n = 3) with the MLE and true parameter indicated by vertical lines. 1 > D=data.frame( theta = seq (0,2, length =100)); 2 > n =3; samples = runif (n,0,1); samples ; # 5 samples from U[0, theta ], theta =1 1 [1] 0.72540527 0.48614910 0.06380247 1 > D$ likelihood [ D$ theta<max ( samples )]=0; 2 > D$ likelihood [D$ theta max ( samples )]= D$ theta [D$ theta max ( samples )] (-n ); 3 > p= ggplot (D,aes ( theta, likelihood )) + geom line () 4 > print (p+opts( title = ' likelihood, MLE ( red solid ) vs. true parameter ( blue dashed ), n=3 ' )+ 5 + geom vline( aes ( xintercept = max ( samples )), color =I( ' red ' ), size =1.5 )+ 6 + geom vline( aes ( xintercept =1), color =I( ' blue ' ), lty =2, size =1.5 )+ 7 + geom point( aes (x= samples,y=0, size =3))) likelihood, MLE (red solid) vs. true parameter (blue dashed), n=3 1.5 1.0 likelihood 0.5 3 3 0.0 0.0 0.5 1.0 1.5 2.0 theta A variation of this situation has X (1),..., X (n) U(0, θ) (as before but this time the interval is open and not closed). We start as before, but the likelihood at max(x (1),..., X (n) ) is zero. Since the likelihood increases as we get θ closer to max(x (1),..., X (n) ) (from the right), and at max(x (1),..., X (n) ) it is zero. That is there is no MLE! For any specific value ˆθ, we can always come up with ˆθ that will result in a higher likelihood. Thus there is no value of θ that maximizes the likelihood. Another variation has X (1),..., X (n) U[θ, θ + 1]. In this case the likelihood is L(θ) = 1 {θ X (1),...,X (n) θ+1}. The likelihood is thus either zero or 1: it is 1 for many possible values, there are multiple maximizers or MLEs (all ˆθ for which ˆθ X (1),..., X (n) ˆθ + 1)) rather than a unique one. 5

Theoretical Properties The MLE may be motivated on practical grounds as the most popular estimation technique in statistics. It also has some nice theoretical properties that motivate it. Below is a brief non-formal description of thes properties. Consistency The MLE ˆθ n is a function of X (1),..., X (n) and is therefore a random variable. It is sometimes more accurate and sometime less accurate depending on the samples. It is generally true, however, that as n increases the value of the random variable ˆθ n converges to the true parameter value with probability 1 ( ) P ˆθ n = θ = 1. lim n The main condition for this is identifiability, defined as θ θ p θ p θ (that is for two distinct parameter values we get two distinct probability functions). Asymptotic Variance The random variable ˆθ n for large n is approximately normally distributed N(θ, I 1 (θ)/n), with expectation equal to the true parameter and variance decaying linearly with n. Asymptotic Efficiency Among all other unbiased estimators the MLE has the smallest asymptotic variance. 6