Estimation September 24, 2018 STAT 151 Class 6 Slide 1
Pandemic data Treatment outcome, X, from n = 100 patients in a pandemic: 1 = recovered and 0 = not recovered 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 1 A probability model for treatment outcome How can we estimate p and 1 p? Outcome Probability 1 (recovers) p 0 (does not recover) 1 p STAT 151 Class 6 Slide 2
Possible solutions Some assumptions: P(success) = p, 0 < p < 1 is the same for every trial we can combine all 100 patients to evaluate the drug efficacy The outcomes of the trials are independent of one another to simplify calculations A few possible models: P(X) 0 1 0 1 X p = 0.5 P(X) 0 1 0 1 X p = 0.6 P(X) 0 1 0 1 X p = 0.3 STAT 151 Class 6 Slide 3
Maximum likelihood estimation (MLE) Key ideas: (a) The best model for the observed data is the best model for the population (b) The best model is the most likely explanation of the observed data (c) (a) and (b) leads to the a method called maximum likelihood estimation Some notations and terminologies: We draw a (independent and identically distributed, iid) sample: X 1, X 2,..., X n to estimate p Each observation X i is an observation from a probability model, Bernoulli(p) Write PDF of X as f (X θ) for both discrete and continuous variables where θ is a generic symbol for parameter(s) For any quantity Q, we use ˆQ to denote its estimate (estimator) STAT 151 Class 6 Slide 4
MLE (2) Our data consist of (X 1,..., X 100 ) = (1, 1, 1, 0, 0,..., 0, 1) }{{} 60 1s and 40 0s The probability (likelihood) that X 1 = 1 is p The likelihood that X 2 = 1 is p The likelihood that X 3 = 1 is p The likelihood that X 4 = 0 is 1 p, etc. The likelihood that (X 1,..., X 100 )=(1, 1, 1, 0, 0,..., 0, 1) is L(p X 1,..., X 100 ) = L(p X 1 ) L(p X 2 ) L(p X 99 ) L(p X 100 ) = p p p (1 p) (1 p) (1 p) p = p 60 (1 p) 40 L(p X 1,..., X 100 ) L(p) is called a likelihood function and it is a function of p. L(p) can be considered as the likelihood of the observed data for a particular value of p The maximum likelihood estimate (MLE) of p, is the value of p that gives the highest likelihood for the observed data STAT 151 Class 6 Slide 5
Finding MLE method 1 p L(p) = p 60 (1 p) 40 0 0 0.1 1.4 10 62 0.2 1.5 10 46 0.3 2.7 10 38 0.4 1.8 10 33 0.5 7.9 10 31 0.6 5.9 10 30 0.7 6.2 10 31 0.8 1.7 10 34 0.9 1.8 10 43 1.0 0 STAT 151 Class 6 Slide 6
MLE [Ronald Aylmer (R.A.) Fisher, 1890-1962] For iid X 1,..., X n with PDF f (x θ), the likelihood of θ is: L(θ) = L(θ X 1,..., X n ) = L(θ X 1 )... L(θ X n ) n = f (X 1 θ)... f (X n θ) f (X i θ) L(θ) is the likelihood of observing X 1,..., X n for a particular θ The MLE of θ, is the value ˆθ that gives the highest likelihood for the data, among all possible values of θ MLE is usually obtained by maximizing log L(θ) l(θ). Since the logarithmic function is a monotone function of its argument, maximizing the likelihood or the log-likelihood yield the same ˆθ When possible, it is best to draw a figure of L(θ) or l(θ) STAT 151 Class 6 Slide 7
Likelihood vs. log-likelihood finding MLE method 2 Likelihood Log likelihood p 60 (1 p) 40 0e+00 2e 30 4e 30 6e 30 0.0 0.2 0.4 0.6 0.8 1.0 p 60log(p) + 40log(1 p) 250 200 150 100 0.0 0.2 0.4 0.6 0.8 1.0 p STAT 151 Class 6 Slide 8
Finding MLE method 3 X 1,..., X n iid Bernoulli(p), then f (X i p) = p Xi (1 p) 1 Xi L(p) = p 60 (1 p) 40 = n=100 l(p) = log L(p) = log = p Xi (1 p) 1 Xi n p Xi (1 p) 1 Xi n [X i log p + (1 X i )log(1 p)] MLE ˆp is the value that maximises the log-likelihood dl(p) dp dl(ˆp) = 0 p=ˆp dp STAT 151 Class 6 Slide 9 n [ Xi ˆp + 1 X ] i 1 ˆp ( 1) = 0 n n (1 ˆp) X i ˆp (1 X i ) = 0 n n (1 ˆp) X i (ˆp)n + ˆp X i = 0 ˆp = X = 60 100 = 0.6
Financial crises data 82 Mexican 84 S&L 87 91 Black Mon. Comm. RE 97 98 AsianLTCM 00 Dotcom 07 Subprime 12? Euro Data 3 3 2 1??? 1980 1990 2000 2010 2020 2030 2040 X = # crisis per unit time, e.g., a decade possible values for X : 0, 1, 2,..., assume crises occur (i) independently and (ii) at a constant rate a (probability) model for # random events over time is Poisson(λ), λ > 0 is the rate of crises per unit time How can we use the data X 1, X 2, X 3, X 4 = 3, 3, 2, 1 to learn about?? Which Poisson model is best for the data and?? What is the best λ? STAT 151 Class 6 Slide 10
Financial crises data (2) Original data (X 1, X 2, X 3, X 4 ) = (3, 3, 2, 1), let s ignore X 4 for now # crises (X ) in n = 3 decades are: (X 1, X 2, X 3 ) = (3, 3, 2) The likelihood that the first observation is 3 is λ3 3! e λ The likelihood that the second observation is 3 is λ3 3! e λ The likelihood that the third observation is 2 is λ2 2! e λ The likelihood of (X 1, X 2, X 3 )=(3,3,2) for a particular λ is L(λ) = λ3 3! e λ λ3 3! e λ λ2 2! e λ = λ3+3+2 3!3!2! e 3λ Which value of λ makes the observed data most probable? STAT 151 Class 6 Slide 11
Financial crises data (3) λ L(λ) = λ3+3+2 3!3!2! e 3λ 2.50 0.011721 2.55 0.011821 2.60 0.011884 2.65 0.011912 2.70 0.011907 2.75 0.011869 2.80 0.011799 2.85 0.011701 STAT 151 Class 6 Slide 12
Financial crises data (4) Likelihood Log likelihood exp( 3λ) λ 3+3+2 6 6 2 0.000 0.004 0.008 0.012 0 2 4 6 8 10 8 log(λ) log(6 6 2) 3λ 20 15 10 5 0 2 4 6 8 10 λ λ STAT 151 Class 6 Slide 13
Financial crises data (5) L(λ) = 3 f (X i λ) = l(λ) = 3 λ X i X i! e λ 3 {X i log(λ) λ log(x i!)} = log(λ) 3 X i 3λ 3 log(x i!). Let ˆλ be the MLE of λ, then ˆλ is determined as follows: d dλ l(ˆλ) = 1ˆλ 3 X i 3 = 0 ˆλ = 3 X i 3 = X = 8 3 STAT 151 Class 6 Slide 14
Invariance property: Financial crises data (6) MLE of λ, average # crises in a decade is ˆλ = X = 8 3 Other characteristics of X might be of interest (a) Average time between crises, E(T ) = 1/λ (recall link to Exp(λ)) Ê(T ) = 1/ˆλ = 1/( 8 ) 0.375 decades or 3.75 years 3 (b) Probability of no crises in the next decade, P(X = 0) P(X = 0) = λ0 e λ = e λ 0! P(X = 0) = e ˆλ = e 8/3 0.07 If ˆθ is the MLE of θ, then for any function, g(θ), the MLE of g(θ) is g(ˆθ). This is called the invariance property of MLE STAT 151 Class 6 Slide 15
Special cases: Financial crises data (7) Original data (X 1, X 2, X 3, X 4 ) = (3, 3, 2, 1); X 4 is an observation from 2010-2017 = 0.7 decade, X 4 is called censored. Assuming censoring is random, X 4 Poisson(0.7λ) The likelihood that the first observation is 3 is λ3 3! e λ. The likelihood that the second observation is 3 is λ3 3! e λ. The likelihood that the third observation is 2 is λ2 2! e λ. The likelihood that the fourth observation is 1 is (0.7λ)1 e 0.7λ. 1! The likelihood that (X 1, X 2, X 3, X4 )=(3,3,2,1) is L(λ) = λ3 3! e λ λ3 3! e λ λ2 which is the new likelihood 2! e λ (0.7λ)1 1! e 0.7λ = 0.7λ3+3+2+1 e 3.7λ, 3!3!2!1! STAT 151 Class 6 Slide 16
Financial crises data (8) L(λ) = 0.7λ3+3+2+1 e 3.7λ 3!3!2!1! l(λ) = log[l(λ)] = log(0.7) + 9 log(λ) 3.7λ log(3!3!2!1!) Let ˆλ be the MLE of λ, then ˆλ is determined as follows: d dλ l(ˆλ) = 9ˆλ 3.7 = 0 ˆλ = Total # events {}}{ 9 3.7 }{{} Total time STAT 151 Class 6 Slide 17
Pandemic data (2) MLE suggests estimating p using ˆp = X = n X i n ˆp = X is called an estimator because it can be applied to any sample X 1,..., X n Our sample (X 1,..., X 100 ) = (1, 1, 1, 0, 0,, 0, 1) gives ˆp = 1 + 1 + 1 + 0 + 0 +... + 0 + 1 100 so our estimate of p is 0.6 = 60 100 An estimate is the value of an estimator applied to a particular sample STAT 151 Class 6 Slide 18
Estimate vs. estimator Using a sample (X 1,..., X 100 ) = (1, 1, 1, 0, 0,, 0, 1), our estimate of p is ˆp = X = 0.6 Our estimate come from a sample. Its sampling error estimate parameter = 0.6 p is unknown and not estimable since p is unknown. We study the performance of the estimator that produces our estimate one sample {}}{ ˆp p =? many samples {}}{ E( ˆp p) = Average sampling error = Bias E[{ˆp E(ˆp)} 2 ] = Differences in estimates between samples = Variance E(ˆp p) 2 = Average distance of estimates to p = MSE STAT 151 Class 6 Slide 19
Bias average sampling error For an estimator ˆθ of θ, the bias is the average sampling error using ˆθ over different samples of size n bias(ˆθ) = }{{} E (ˆθ θ) average An estimator is unbiased if bias(ˆθ) = 0. Otherwise, it is biased. A biased estimator systematically overestimates or underestimates θ Some estimators may be biased when the sample size n is small but bias(ˆθ) 0 for large values of n. Those estimators are called consistent estimators. In practice, it is often sufficient to look for a consistent rather than an unbiased estimator Unbiased Biased θ Estimates from different samples ˆθ ˆθ STAT 151 Class 6 Slide 20
Variance does the estimate vary much with the sample? The variance measures how ˆθ estimates the same θ using different samples of size n: var(ˆθ) = Recall that var(ˆθ) is the sampling variation }{{} E [{ˆθ E(ˆθ)} 2 ] average A large var(ˆθ) suggests the estimator s estimate of the (same) unknown θ varies a lot with the sample chosen. So an estimator with a large variance is bad Small variance Large variance Estimates from different samples ˆθ E(ˆθ) ˆθ Note that the reference is E(ˆθ), not θ, so an estimator with a small variance does not guarantee that its estimate will be close to the unknown θ STAT 151 Class 6 Slide 21
Mean squared error (MSE) is our estimate close to the unknown? Mean square error (MSE) measures, on average, the distance between the estimate and θ: MSE(ˆθ) = {bias(ˆθ)} 2 + var(ˆθ) For an unbiased estimator, bias(ˆθ) = 0 and MSE(ˆθ) = var(ˆθ), for all n For a consistent estimator, bias(ˆθ) 0 and MSE(ˆθ) var(ˆθ), for large n For consistent or unbiased estimators, variance is the best measure of performance We illustrate the concept using unbiased estimators, so MSE(ˆθ) = var(ˆθ) Small MSE Large MSE θ Estimates from different samples ˆθ ˆθ Estimator with lower MSE has a higher chance of producing an estimate close to θ STAT 151 Class 6 Slide 22
Bias vs. Variance: Financial crises data (9) Two estimators of λ : (a) ˆλ = X1+X2+X3 3 (b) ˆλ = X1+X2+X3+X 4 3.7 } Both are MLE; which is better? (1) Recall for Y Poisson(µ), E(Y ) = var(y ) = µ (2) X 1 +X 2 +X 3 +X4 {# events in (3 + t) decades} Poisson((3+t)λ), t > 0: ( X1 + X 2 + X 3 + X ) 4 E = E(X 1 + X 2 + X 3 + X4 ) (3 + t)λ = = λ Bias = 0 3 + t 3 + t 3 + t var ( X1 + X 2 + X 3 + X 4 3 + t (3) Variance decreases with a larger t (b) is better (4) Using a larger sample is better (c.f., Class 7) ) = var(x 1 + X 2 + X 3 + X4 ) (3 + t)λ (3 + t) 2 = (3 + t) 2 = λ 3 + t, t > 0 STAT 151 Class 6 Slide 23
Summary Consistent or unbiased estimators are desirable Among consistent or unbiased estimators, the estimator with the smallest variance is efficient Under most circumstances, as the sample size increases (asymptotically), if ˆθ is the MLE and ˆθ is any other unbiased estimator of θ var(ˆθ) var(ˆθ ) ˆθ is at least as good as ˆθ so the MLE is efficient Invariance: if we are interested in estimating any function of θ, say g(θ), the following also holds var[g(ˆθ)] var[g(ˆθ )] g(ˆθ) is at least as good as g(ˆθ ) STAT 151 Class 6 Slide 24