BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Size: px

Start display at page:

Download "BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation"

John Harmon
5 years ago
Views:

1 BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall /24

2 Previous Parametric tests Mean comparisons (normality assumption) t-test F-test Regression analysis: test for coefficients t-test F-test Goodness-of-fit test Chi-square test Yujin Chung Lec13: MLE Fall /24

3 Today s lecture A general approach when we assume the underlying parametric distribution of the observed data. Likelihood function Maximum likelihood estimation Model comparisons: LRT and AIC Yujin Chung Lec13: MLE Fall /24

4 Likelihood function Let X 1,..., X n be an iid random sample from f θ (x), where θ is a vector of parameters. The joint probability density of the random sample is f θ (x 1,..., x n ) = f θ (x 1 ) f θ (x n ) function of random sample x 1,..., x n, given θ A likelihood function is L(θ) f θ (x 1,..., x n ) = f θ (x 1 ) p θ (x n ) function of θ given random sample x 1,..., x n not a probability density of θ Yujin Chung Lec13: MLE Fall /24

5 Likelihood function: example Let X 1,..., X n denote a random sample from a Bernoulli distribution with parameter p: p(x) = p x (1 p) 1 x, x = 0, 1 The joint probability of X 1 = x 1,..., X n = x n is p x 1 (1 p) x 1 p xn (1 p) 1 xn = p n i=1 x i (1 p) n n i=1 x i The likelihood function of p is L(p) = p n i=1 x i (1 p) n n i=1 x i. Yujin Chung Lec13: MLE Fall /24

6 Maximum likelihood estimation The method of maximum likelihood or maximum likelihood approach is a method to estimate the parameters of the underlying distribution for a random sample. The estimation by this approach is the parameter value which maximizes the likelihood function L(θ) or log-likelihood function l(θ) = log L(θ). Such estimation is called a maximum likelihood estimation (MLE) ˆθ = arg max L(θ) = arg max l(θ) L(p) 0.0e e e 05 0 MLE Yujin Chung Lec13: MLE p Fall /24

7 How to find MLE: analytical solution 1 Define the likelihood function, L(θ) 2 Take the logarithm of the likelihood function, l(θ) = log L(θ) 3 Take the derivative of the likelihood function with respect to the parameter, l (θ) = d dθ l(θ) 4 Equate the derivative to zero (l (θ) = 0) and solve for the parameter to find ˆθ. 5 Confirm that ˆθ is in fact a maximum by checking that the second derivative of l(θ) evaluated at ˆθ is negative. Verify that the global maximum has been found. Yujin Chung Lec13: MLE Fall /24

8 Analytical solution of MLE: example Let X 1,..., X n denote a random sample from a Bernoulli distribution with parameter p: What is the MLE of p? 1 Likelihood function: L(p) = p n i=1 x i (1 p) n n i=1 x i ( ) n n 2 The log-likelihood is l(p) = x i log p + n x i log(1 p). 3 l (p) = d l(p) dp i=1 n i=1 = x i n n p 1 p i=1 x i 4 Let l (p) = 0 and find the solution for p, which is the MLE ˆp: n i=1 x i n n i=1 x i = 0 implies ˆp = x. ˆp 1 ˆp n 5 l i=1 (p) = x i p 2 n n i=1 x i (1 p) 2 < 0. Note: The MLE is same as the point estimation for p we studied in Lec5. Yujin Chung Lec13: MLE Fall /24. i=1

9 MLE: Normal Let X 1,..., X n denote a random sample from N(µ, σ 2 ): What are the MLEs of µ and σ 2? 1 Likelihood ( function: ) 1 n ( n L(µ, σ 2 ) = i=1 exp (x i µ) 2 ) 2πσ 2 2σ 2 2 The log-likelihood is l(µ, σ 2 ) = n 2 log(2π) n n 2 log σ2 i=1 (x i µ) 2 2σ 2. dl n 3 dµ = i=1 (x i µ) dl σ 2, dσ 2 = n n 2σ 2 + i=1 (x i µ) 2 2σ 4 4 Let dl/dµ = 0 and dl/dσ 2 = 0 and find the solution for µ and σ 2 : ˆµ = X and ˆσ 2 = n i=1 (X i X) 2 n Yujin Chung Lec13: MLE Fall /24

10 Why MLE? MLE is invariant under one-to-one transformation E.g., MLE of p 2 is ( X) 2 Asymptotic Normality n(ˆθ θ) N(0, I1 (θ) 1 ) as n That is, for large n, the asymptotic expectation and variance of MLE ˆθ are E(ˆθ) θ and V ar(ˆθ) 1/I n (θ), respectively. Fisher information: I n (θ) = E(l (θ)) = ni 1 (θ) Yujin Chung Lec13: MLE Fall /24

11 Asymptotic Normality: Example Consider a random sample, X 1,..., X n, from a Bernoulli distribution with parameter p. The MLE of p is ˆp = X. n l i=1 (p) = x i p 2 n n i=1 x i (1 p) 2 Fisher information I n (p) = E[l (p)] = np p 2 + n np (1 p) 2 = n p(1 p) The asymptotic variance of ˆp is I n (p) 1 = p(1 p)/n. For large n, ˆp N(p, p(1 p)/n) cf) By the Central Limit Theorem, X N(p, p(1 p)/n) for large n. Yujin Chung Lec13: MLE Fall /24

12 Tests based on Maximum likelihoods Likelihood ratio test (LRT) AIC Yujin Chung Lec13: MLE Fall /24

13 Likelihood Ratio Test (LRT) Let X 1,..., X n be i.i.d. random variables with a distribution f θ (x), where θ = (θ 1,..., θ p ). We want to test for H 0 : θ Θ 0 = {θ θ q+1 = c 1,..., θ p = c p q } vs. H 1 : θ Θ 1 (= Θ c 0 ). Examples H 0 : µ = c, σ 2 > 0 vs. H 1 : µ c, σ 2 > 0 H 0 : E(Y ) = β 0 + β 1 X 1 vs. H 1 : E(Y ) = β 0 + β 1 X 1 + β 2 X 2 That is, H 0 : β 2 = 0 vs. H 1 : β 2 0 Nested models Two models are nested if one of them is a particular case of the other one: the simpler model can be obtained by setting some coefficients of the more complex model to particular values. Non-nested models: E(Y ) = β 0 + β 1 X 1 vs. E(Y ) = β 0 + β 2 X 2 Yujin Chung Lec13: MLE Fall /24

14 Likelihood Ratio Test (LRT) Idea: comparing the maximum likelihoods under the null and alternative models max L(θ) max L(θ) θ Θ 0 θ Θ 0 Θ 1 If H 0 is true, Λ = max L(θ) / max L(θ) is close to 1, and θ Θ 0 θ Θ 0 Θ 1 2 log Λ is small (close to 0) H 1 is true, Λ is small (close to 0), and 2 log Λ is large. Test statistic: 2 log Λ ( χ 2 p q under H 0 for large n) (d.f.: the difference in the number of parameters) p-value: Pr(χ 2 p q > 2 log Λ) Yujin Chung Lec13: MLE Fall /24

15 Regression analysis: infant blood pressure Table 11.9 (Lecture 10): The systolic blood pressure (Y ), birthweight (X 1 ), and age (X 2 ) for 16 infants. Regression model: y i = α + β 1 x 1,i + β 2 x 2,i + e i, where e i are i.i.d N(0, σ 2 ), for i = 1,..., 16. Least-squares estimations ( α, β 1, β 16 2 ) = arg min [y i (α + β 1 x 1,i + β 2 x 2,i )] 2 i=1 (Intercept) Birthweight Age No matter how the error terms are distributed, the least squares method provides unbiased point estimators and also have minimum variance among all unbiased linear estimators The estimation of σ 2 : S 2 = MSE (d.f., n 3), unbiased estimator Df Sum Sq Mean Sq Residuals Yujin Chung Lec13: MLE Fall /24

16 MLE: regression analysis Regression model: y i = α + β 1 x 1,i + β 2 x 2,i + e i, where e i are i.i.d N(0, σ 2 ), for i = 1,..., 16. Likelihood L(α, β 1, β 2, σ 2 ) = 16 i=1 φ(y i α + β 1 x 1,i + β 2 x 2,i ; 0, σ 2 ), where φ(x; µ, σ 2 ) is the probability density function of N(µ, σ 2 ). MLE (ˆα, ˆβ 1, ˆβ 2, ˆσ 2 ) = arg max L(α, β 1, β 2, σ 2 ) alpha beta1 beta2 sigma ˆσ 2 = n 3 MSE, biased. n ˆσ 2 = (= 13s 2 /16 = /16) Yujin Chung Lec13: MLE Fall /24

17 LRT: regression analysis Model comparisons fit0: Y 1 alpha sigma loglik fit1: Y β 1 X 1 (Birthweight) alpha beta1 sigma loglik fit2: Y β 2 X 2 (Age) alpha beta1 sigma loglik fit3: Y β 1 X 1 + β 2 X 2 (full model) alpha beta1 beta2 sigma loglik Yujin Chung Lec13: MLE Fall /24

18 LRT: regression analysis Models 2 log Λ df p-value fit0 vs. fit e-45 fit0 vs. fit fit0 vs. fit e-50 fit1 vs. fit e-8 fit2 vs. fit e-50 Note: fit1 vs fit2: non-nested model comparison Yujin Chung Lec13: MLE Fall /24

19 AIC: the Akaike criterion Model fit always improves with model complexity. We would like to strike a good balance between model fit and model simplicity. AIC combines a measure of model fit with a measure of model complexity: The smaller, the better. For a given data set and a given model, AIC = 2 log L + 2p where L is the maximum likelihood of the data using the model, and p is the number of parameters in the model. 2 log L is a function of the prediction error: the smaller, the better. Measures how the model fits the data. 2p penalizes complex models: the smaller, the better. Yujin Chung Lec13: MLE Fall /24

20 AIC: model comparisons Consider a number of candidate models. They need not be nested. Calculate their AIC. Choose the model(s) with the smallest AIC. Theoretically: AIC aims to estimate the prediction accuracy of the model for new data sets. Up to a constant. The absolute value of AIC is meaningless. The relative AIC values, between models, is meaningful. Model AIC fit fit fit fit Yujin Chung Lec13: MLE Fall /24

21 AIC: stepwise selection Often there are too many models, we cannot get all the AIC values. We can use stepwise selection. start with some model, simple or complex do a forward step as well as a backward step based on AIC until no predictor should be added, and no predictor should be removed. In R: stepaic() (see R session 10) Yujin Chung Lec13: MLE Fall /24

22 Model comparison with LRT and AIC Works for any distribution Compute the likelihood function Find the MLE under each model Compute the maximum likelihood of each model Perform LRT or compare AIC values. nested models: LRT non-nested models: AIC Yujin Chung Lec13: MLE Fall /24

23 Summary Maximum likelihood approach Maximum likelihood estimation maximizes the likelihood Asymptotic normality LRT AIC Yujin Chung Lec13: MLE Fall /24

24 Next week Optimization approach: how to find MLE numerically? Derivative-based approach Derivative-free approach R package Yujin Chung Lec13: MLE Fall /24

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)