Methods of Estimation MIT 18.655 Dr. Kempthorne Spring 2016 1
Outline Methods of Estimation I 1 Methods of Estimation I 2
X X, X P P = {P θ, θ Θ}. Problem: Finding a function θˆ(x ) which is close to θ. Consider ρ : X Θ R. and define D(θ 0, θ) to measure the discrepancy between θ and the true value θ 0. D(θ 0, θ) = E θ0 ρ(x, θ). As a discrepancy measure, D makes sense if the value of θ minimizing the function is θ = θ 0. If P θ0 were true, and we knew D(θ 0, θ), we could obtain θ 0 as the minimizer. Instead of observing D(θ 0, θ), we observe ρ(x, θ). ρ(, ) is a contrast function ˆθ(X ) is a minimum-contrast estimate. 3
The definition extends to Euclidean Θ R d. θ 0 an interior point of Θ. Smooth mapping: θ D(θ 0, θ). θ = θ 0 solves ' θ D(θ 0, θ) = 0. where ' θ = (,..., θ 1 θ ) T d Substitute ρ(x, θ) for D(θ 0, θ) and solve ˆ ' θ ρ(x, θ) = 0 at θ = θ. 4
Estimating Equations: Ψ : X R d R d, where Ψ = (ψ 1,..., ψ d ) T. For every θ 0 Θ, the expectation of Ψ given P θ0 has a unique solution V (θ 0, θ) = E θ0 [Ψ(X, θ)] = 0 at θ = θ 0. 5
Example 2.1.1 Least Squares. µ(z) = g(β, z), β R d. x = {(z i, Y i ) : 1 i n}, where Y 1,..., Y n are independent. 1 Define ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. Consider Y i = µ(z i ) + E i, where µ(z i ) = g(β, z i ) and the E i are iid N(0, σ 2 0 ). Then, β parametrizes the model and we can write: D(β 0, β) = E β0 ρ(x 1, β) = nσ2 n 0 + i=1[g(β 0, z i ) g(β, z i )] 2 ]. This is minimized by β = β 0 and uniquely so iff β identifiable. 6
ˆ The least-squares estimate β minimizes ρ(x, β). Conditions to guarantee existence of βˆ: Continuity of g(, z i ). Minimum of ρ(x, ) existing on compact set {β} e.g., lim g(β, z i ) =. β If g(β, z i ) is differentiable in β, then βˆ satisfies the Normal Equations obtained by 1 taking partial derivatives of n ρ(x, β) = Y µ 2 = i=1[y i g(β, z i )] 2 and solving: ρ(x, β) β j = 0 7
1 n ρ(x, β) = Y µ 2 = [Y i g(β, z i )] 2 i=1 Solve: ρ(x, β) β j = 0 n g(β, z i ) 2[Y i g(β, z i )] ( 1) β j = 0 i=1 n n g(β, z i ) g(β, z i ) Y i g(β, z i ) = 0 β j β j i=1 i=1 8
Linear case: g(β, z i ) = 1 d j=1 z ij β = z T j i β ρ(x, β) = 0 β j n n g(β, z i ) g(β, z i ) Y i g(β, z i ) = 0 β j β j i=1 i=1 n n T z ij Y i z i,j (z β) = 0 i=1 i=1 n d n i z ij Y i z i,j z i,k β k = 0, j = 1,..., d i=1 k=1 i=1 Z T D Y ZT D Z D β = 0 where Z D is the (n d) design matrix with (i, j) element z i,j 9
Note: Least Squares exemplifies minimum contrast and estimating equation methodology. Distribution assumptions are not necessary to motivate the estimate as a mathematical approximation. 10
Method of Moments Methods of Estimation I Method of Moments X 1,..., X n iid X P θ, θ R d. µ 1 (θ), µ 2 (θ),..., µ d (θ): µ j (θ) = µ j = E [X j θ] the jth moment of X. Sample moments: n µˆj = X j i, j = 1,..., d. i=1 Method of Moments: Solve for θ in the system of equations µ 1 (θ) = µˆ1 µ 2 (θ) = µˆ2.. µ d (θ) = µˆd 11
Note:. θ must be identifiable Existence of µ j : lim µˆj = µ j with µ j <. n If q(θ) = h(µ 1,..., µ d ), then the Method-of-Moments Estimate of q(θ) is qˆ(θ) = h(ˆµ 1,..., µˆd ). The MOM estimate of θ may not be unique! (See Problem 2.1.11) 12
Plug-In and Extension Principles Frequency Plug-In Multinomial Sample: X 1,..., X n with K values v 1,..., v K P(X i = v j ) = p j j = 1,..., K Plug in estimates: ˆp j = N j /n where N j = count({i : X i = v j }) Apply to any function q(p 1,..., p K ): qˆ = q(ˆp 1,..., pˆk ) Equivalent to substituting the true distribution function P θ (t) = P(X t θ) underlying an iid sample with the empirical distribution function: n ˆP(t) = 1 1{x i t} n i=1 Pˆ is an estimate of P, and ν(pˆ) is an estimate of ν(p). 13
14 Example: αth population quantile ν α (P) = 1 [F 1 (α) + F 1 2 U (α)], with 0 < α < 1: where F 1 (α) = inf {x : F (x) α} F 1 U (α) = sup{x : F (x) α} The plug-in estimate is νˆα(p) = ν α (Pˆ) = 1 [Fˆ 1 (α) + Fˆ 1 2 U (α)]. Example: Method of Moments Estimates of jth Moment ν(p) = µ j = E (X j ) 1 νˆ(p) = = ν(ˆ n j µˆj P) = 1 Extension Principle n i=1 x i Objective: estimate q(θ), a function of θ. Assume q(θ) = h(p 1 (θ),..., p K (θ)), where h() is continuous. The extension principle estimates q(θ) with qˆ(θ) = h(ˆp 1,..., pˆk ) h() may not be unique: what h() is optimal? MIT 18.655 Methods of Estimation
Notes on Method-of-Moments/Frequency Plug-In Estimates Easy to compute Valuable as initial estimates in iterative algorithms. Consistent estimates (close to true parameter in large samples). Best Frequency Plug-In Estimates are Maximum-Likelihood Estimates. In some cases, MOM estimators are foolish (See Example 2.1.7). 15
Outline Methods of Estimation I 1 Methods of Estimation I 16
Least Squares Methods of Estimation I General Model: Only Y Random X = {(z i, Y i ) : 1 i n}, where Y 1,..., Y n are independent. z 1,..., z n R d are fixed, non-random. For cases i = 1,..., n Y i = µ(z i ) + E i, where µ(z) = g(β, z), β R d. E i are independent with E [E i ] = 0. The Least-Squares Contrast function 1 is ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. β parametrizes the model and we can write the discrepancy function D(β 0, β) = E β0 ρ(x, β) 17
Least Squares: Only Y Random Contrast Function: 1 ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. Discrepancy Function: D(β 0, β) = E 1 β0 ρ(x, β) n 1n = Var(E i ) + [g(β 0, z i ) g(β, z i )] 2 ]. i=1 i=1 The model is semiparametric with unknown parameter β and unknown (joint) distribution P e of E= (E 1,..., E n ). Gauss-Markov Assumptions Assume that the distribution of E satisfy: E (E i ) = 0 Var(E i ) = σ 2 Cov(E i, E j ) = 0 for i = j 18
General Model: (Y,Z) Both Random (Y 1, Z 1 ),..., (Y n, Z n ) are i.i.d. as X = (Y, Z ) P Define µ(z) = E [Y Z = z] = g(β, z), where g(, ) is a known function and β R d is unknown parameter Given Z i = z i, define E i = Y i µ(z i ) for i = 1,..., n Conditioning on the z i we can write: Y i = g(β, z i ) + E i, i = 1, 2,..., n where E = (E 1,..., E n ) has (joint) distribution P e The Least-Squares Estimate of βˆ is the plug-in estimate β(pˆ), where Pˆ is the empirical distribution for the sample {(Z i, Y i ), i = 1,..., n} The function g(β, z) can be linear in β and z or nonlinear. Closed-form solutions exist for βˆ when g is linear in β. 19
Outline Methods of Estimation I 1 Methods of Estimation I 20
: Assumptions y1 x 1,1 x 1,2 x 1,p y 2 Data y =. and X = x 2,1 x 2,2 x 2,p...... y n x n,1 x n,2 x p,n follow a linear model satisfying the Gauss-Markov Assumptions if y is an observation of random vector Y = (Y 1, Y 2,... Y N ) T and E (Y X, β) = Xβ, where β = (β 1, β 2,... β p ) T is the p-vector of regression parameters. Cov(Y X, β) = σ 2 I n, for some σ 2 > 0. I.e., the random variables generating the observations are uncorrelated and have constant variance σ 2 (conditional on X, and β). 21
For known constants c 1, c 2,..., c p, c p+1, consider the problem of estimating θ = c 1 β 1 + c 2 β 2 + c p β p + c p+1. Under the Gauss-Markov assumptions, the estimator θˆ = c 1 βˆ1 + c 2 βˆ2 + c p βˆp + c p+1, where βˆ1, βˆ 2,... βˆ p are the least squares estimates is 1) An Unbiased Estimator of θ 2) A Linear Estimator 1 of θ, that is θ = n b i y i, for some known (given X) constants b i. i=1 Theorem: Under the Gauss-Markov Assumptions, the estimator ˆθ has the smallest (Best) variance among all Linear Unbiased Estimators of θ, i.e., θˆ is BLUE. 22
: Proof Proof: Without loss of generality, assume c p+1 = 0 and define c =(c 1, c 2,..., c p ) T. The Least Squares Estimate of θ = c T β is: θˆ = c T βˆ = c T (X T X) 1 X T y d T y a linear estimate in y given by coefficients d = (d 1, d 2,..., d n ) T. Consider an alternative linear estimate of θ: θ = b T y with fixed coefficients given by b = (b 1,..., b n ) T. Define f = b d and note that θ = b T y = (d + f) T y = θˆ + f T y If θ is unbiased then because θˆ is unbiased 0 = E (f T y) = f T E (y) = f T (Xβ) for all β R p = f is orthogonal to column space of X = f is orthogonal to d = X(X T X) 1 c 23
If θ is unbiased then The orthogonality of f to d implies Var(θ ) = Var(b T y) = Var(d T y + f T y) = Var(d T y) + Var(f T y) + 2Cov(d T y, f T y) = Var(θˆ) + Var(f T y) + 2d T Cov(y)f = Var(θˆ) + Var(f T y) + 2d T (σ 2 I n )f = Var(θˆ) + Var(f T y) + 2σ 2 d T f = Var(θˆ) + Var(f T y) + 2σ 2 0 Var(θˆ) 24
Outline Methods of Estimation I 1 Methods of Estimation I 25
Estimates Consider generalizing the Gauss-Markov assumptions for the linear regression model to Y = Xβ + E where the random n-vector E: E [E] = 0 n and E [EE T ] = σ 2 Σ. σ 2 is an unknown scale parameter Σ is a known (n n) positive definite matrix specifying the relative variances and correlations of the component observations. Transform the data (Y, X) to Y = Σ 1 2 Y and X = Σ 1 2 X and the model becomes Y = X β + E, where E [E ] = 0 n and E [E (E ) T ] = σ 2 I n By the, the BLUE ( GLS ) of β is βˆ = [(X ) T (X )] 1 (X ) T (Y ) = [X T Σ 1 X] 1 (X T Σ 1 Y) 26
Outline Methods of Estimation I 1 Methods of Estimation I 27
Estimation 28 X P θ, θ Θ with density or pmf function p(x θ). Given an observation X = x, define the likelihood function L x (θ) = p(x θ) : a mapping: Θ R. θˆ ML = θˆ ML (x): the Maximum-Likelihood Estimate of θ is the value making L x ( ) a maximum θ ˆis the MLE if L x (θˆ) = max L x (θ). θ Θ The MLE θˆml (x) identifies the distribution making x most likely The MLE coincides with the mode of the Posterior Distribution if the Prior Distribution on Θ is uniform: π(θ x) p(x θ)π(θ) p(x θ). MIT 18.655 Methods of Estimation
Methods of Estimation I Examples Example 2.2.4: Normal Distribution with Known Variance Example 2.2.5: Size of a Population X 1,..., X n are iid U{1, 2,..., θ}, with θ {1, 2,...}. For x = (x 1,.., x n ), L n x (θ) = i=1 θ 1 1(1 x i θ) = θ n 1(max(x 1,..., x n )) θ) 0, if θ = 0, 1,..., max(x i ) 1 = θ n if θ max(x i ) 29
As a Minimum Contrast Method Define l x (θ) = log L x (θ) = log p(x θ) Because log( ) is monotone decreasing, ˆ θ ML (x) minimizes l x (θ) For an iid sample X = (X 1,..., X n ) with densities p(x i θ), l X (θ) = log p(x. 1,..., x n theta) = log [ n 1 i=1 p(x i θ)] = n i=1 log p(x i θ) As a minimum contrast function, ρ(x, θ) = l X (θ) yields the MLE θˆml (x) The discrepancy function corresonding to the contrast function ρ(x, θ) is D(θ 0, θ) = E [ρ(x, θ) θ 0 ] = E [log p(x θ) θ 0 ] 30
Suppose that θ = θ 0 uniquely minimizes D(θ 0, ). Then D(θ 0, θ) D(θ 0, θ 0 ) = E [log p(x θ) θ 0 ] ( E[log p(x θ 0 ) θ 0 ]) p( x θ) = E [log p ( x θ0) θ 0] > 0, unless θ = θ 0. This difference is the Kullback-Leibler Information Divergence between distribution P θ0 and P θ : p( x θ) K (P θ0, P θ ) = E [log( p ( x θ0) ) θ 0] Lemma 2.2.1 (Shannon, 1948) The mutual entropy K (P θ0, P θ ) is always well defined and K (P θ0, P θ ) 0 Equality holds if and only if {x : p(x θ) = p(x θ 0 )} has probability 1 under both P θ0 and P θ. Proof Apply Jensen s Inequality (B.9.3) 31
Likelihood Equations Suppose: X P θ, with θ Θ, an open parameter space the likelihood function l X (θ) is differentiable in θ ˆθ ML (x) exists Then: θˆml (x) must satisfy the Likelihood Equation(s) ' θ l X (θ) = 0. Important Cases For independent X i with 1 densities/pmfs p i (x i θ), ' n θ l X (θ) = i=1 ' θ log p i (x i θ) = 0 NOTE: p i ( θ) may vary with i. 32
Examples Hardy-Weinberg Proportions (Example 2.2.6) Queues: Poisson Process Models (Exponential Arrival Times and Poisson Counts) (Example 2.2.7) Multinomial Trials (Example 2.2.8) Normal Regression Models (Example 2.2.9). 33
MIT OpenCourseWare http://ocw.mit.edu 18.655 Mathematical Statistics Spring 2016 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.