Methods of Estimation

Size: px

Start display at page:

Download "Methods of Estimation"

Alexander McDaniel
6 years ago
Views:

1 Methods of Estimation MIT Dr. Kempthorne Spring

2 Outline Methods of Estimation I 1 Methods of Estimation I 2

3 X X, X P P = {P θ, θ Θ}. Problem: Finding a function θˆ(x ) which is close to θ. Consider ρ : X Θ R. and define D(θ 0, θ) to measure the discrepancy between θ and the true value θ 0. D(θ 0, θ) = E θ0 ρ(x, θ). As a discrepancy measure, D makes sense if the value of θ minimizing the function is θ = θ 0. If P θ0 were true, and we knew D(θ 0, θ), we could obtain θ 0 as the minimizer. Instead of observing D(θ 0, θ), we observe ρ(x, θ). ρ(, ) is a contrast function ˆθ(X ) is a minimum-contrast estimate. 3

4 The definition extends to Euclidean Θ R d. θ 0 an interior point of Θ. Smooth mapping: θ D(θ 0, θ). θ = θ 0 solves ' θ D(θ 0, θ) = 0. where ' θ = (,..., θ 1 θ ) T d Substitute ρ(x, θ) for D(θ 0, θ) and solve ˆ ' θ ρ(x, θ) = 0 at θ = θ. 4

5 Estimating Equations: Ψ : X R d R d, where Ψ = (ψ 1,..., ψ d ) T. For every θ 0 Θ, the expectation of Ψ given P θ0 has a unique solution V (θ 0, θ) = E θ0 [Ψ(X, θ)] = 0 at θ = θ 0. 5

6 Example Least Squares. µ(z) = g(β, z), β R d. x = {(z i, Y i ) : 1 i n}, where Y 1,..., Y n are independent. 1 Define ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. Consider Y i = µ(z i ) + E i, where µ(z i ) = g(β, z i ) and the E i are iid N(0, σ 2 0 ). Then, β parametrizes the model and we can write: D(β 0, β) = E β0 ρ(x 1, β) = nσ2 n 0 + i=1[g(β 0, z i ) g(β, z i )] 2 ]. This is minimized by β = β 0 and uniquely so iff β identifiable. 6

7 ˆ The least-squares estimate β minimizes ρ(x, β). Conditions to guarantee existence of βˆ: Continuity of g(, z i ). Minimum of ρ(x, ) existing on compact set {β} e.g., lim g(β, z i ) =. β If g(β, z i ) is differentiable in β, then βˆ satisfies the Normal Equations obtained by 1 taking partial derivatives of n ρ(x, β) = Y µ 2 = i=1[y i g(β, z i )] 2 and solving: ρ(x, β) β j = 0 7

8 1 n ρ(x, β) = Y µ 2 = [Y i g(β, z i )] 2 i=1 Solve: ρ(x, β) β j = 0 n g(β, z i ) 2[Y i g(β, z i )] ( 1) β j = 0 i=1 n n g(β, z i ) g(β, z i ) Y i g(β, z i ) = 0 β j β j i=1 i=1 8

9 Linear case: g(β, z i ) = 1 d j=1 z ij β = z T j i β ρ(x, β) = 0 β j n n g(β, z i ) g(β, z i ) Y i g(β, z i ) = 0 β j β j i=1 i=1 n n T z ij Y i z i,j (z β) = 0 i=1 i=1 n d n i z ij Y i z i,j z i,k β k = 0, j = 1,..., d i=1 k=1 i=1 Z T D Y ZT D Z D β = 0 where Z D is the (n d) design matrix with (i, j) element z i,j 9

10 Note: Least Squares exemplifies minimum contrast and estimating equation methodology. Distribution assumptions are not necessary to motivate the estimate as a mathematical approximation. 10

11 Method of Moments Methods of Estimation I Method of Moments X 1,..., X n iid X P θ, θ R d. µ 1 (θ), µ 2 (θ),..., µ d (θ): µ j (θ) = µ j = E [X j θ] the jth moment of X. Sample moments: n µˆj = X j i, j = 1,..., d. i=1 Method of Moments: Solve for θ in the system of equations µ 1 (θ) = µˆ1 µ 2 (θ) = µˆ2.. µ d (θ) = µˆd 11

12 Note:. θ must be identifiable Existence of µ j : lim µˆj = µ j with µ j <. n If q(θ) = h(µ 1,..., µ d ), then the Method-of-Moments Estimate of q(θ) is qˆ(θ) = h(ˆµ 1,..., µˆd ). The MOM estimate of θ may not be unique! (See Problem ) 12

13 Plug-In and Extension Principles Frequency Plug-In Multinomial Sample: X 1,..., X n with K values v 1,..., v K P(X i = v j ) = p j j = 1,..., K Plug in estimates: ˆp j = N j /n where N j = count({i : X i = v j }) Apply to any function q(p 1,..., p K ): qˆ = q(ˆp 1,..., pˆk ) Equivalent to substituting the true distribution function P θ (t) = P(X t θ) underlying an iid sample with the empirical distribution function: n ˆP(t) = 1 1{x i t} n i=1 Pˆ is an estimate of P, and ν(pˆ) is an estimate of ν(p). 13

14 14 Example: αth population quantile ν α (P) = 1 [F 1 (α) + F 1 2 U (α)], with 0 < α < 1: where F 1 (α) = inf {x : F (x) α} F 1 U (α) = sup{x : F (x) α} The plug-in estimate is νˆα(p) = ν α (Pˆ) = 1 [Fˆ 1 (α) + Fˆ 1 2 U (α)]. Example: Method of Moments Estimates of jth Moment ν(p) = µ j = E (X j ) 1 νˆ(p) = = ν(ˆ n j µˆj P) = 1 Extension Principle n i=1 x i Objective: estimate q(θ), a function of θ. Assume q(θ) = h(p 1 (θ),..., p K (θ)), where h() is continuous. The extension principle estimates q(θ) with qˆ(θ) = h(ˆp 1,..., pˆk ) h() may not be unique: what h() is optimal? MIT Methods of Estimation

15 Notes on Method-of-Moments/Frequency Plug-In Estimates Easy to compute Valuable as initial estimates in iterative algorithms. Consistent estimates (close to true parameter in large samples). Best Frequency Plug-In Estimates are Maximum-Likelihood Estimates. In some cases, MOM estimators are foolish (See Example 2.1.7). 15

16 Outline Methods of Estimation I 1 Methods of Estimation I 16

17 Least Squares Methods of Estimation I General Model: Only Y Random X = {(z i, Y i ) : 1 i n}, where Y 1,..., Y n are independent. z 1,..., z n R d are fixed, non-random. For cases i = 1,..., n Y i = µ(z i ) + E i, where µ(z) = g(β, z), β R d. E i are independent with E [E i ] = 0. The Least-Squares Contrast function 1 is ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. β parametrizes the model and we can write the discrepancy function D(β 0, β) = E β0 ρ(x, β) 17

18 Least Squares: Only Y Random Contrast Function: 1 ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. Discrepancy Function: D(β 0, β) = E 1 β0 ρ(x, β) n 1n = Var(E i ) + [g(β 0, z i ) g(β, z i )] 2 ]. i=1 i=1 The model is semiparametric with unknown parameter β and unknown (joint) distribution P e of E= (E 1,..., E n ). Gauss-Markov Assumptions Assume that the distribution of E satisfy: E (E i ) = 0 Var(E i ) = σ 2 Cov(E i, E j ) = 0 for i = j 18

19 General Model: (Y,Z) Both Random (Y 1, Z 1 ),..., (Y n, Z n ) are i.i.d. as X = (Y, Z ) P Define µ(z) = E [Y Z = z] = g(β, z), where g(, ) is a known function and β R d is unknown parameter Given Z i = z i, define E i = Y i µ(z i ) for i = 1,..., n Conditioning on the z i we can write: Y i = g(β, z i ) + E i, i = 1, 2,..., n where E = (E 1,..., E n ) has (joint) distribution P e The Least-Squares Estimate of βˆ is the plug-in estimate β(pˆ), where Pˆ is the empirical distribution for the sample {(Z i, Y i ), i = 1,..., n} The function g(β, z) can be linear in β and z or nonlinear. Closed-form solutions exist for βˆ when g is linear in β. 19

20 Outline Methods of Estimation I 1 Methods of Estimation I 20

21 : Assumptions y1 x 1,1 x 1,2 x 1,p y 2 Data y =. and X = x 2,1 x 2,2 x 2,p y n x n,1 x n,2 x p,n follow a linear model satisfying the Gauss-Markov Assumptions if y is an observation of random vector Y = (Y 1, Y 2,... Y N ) T and E (Y X, β) = Xβ, where β = (β 1, β 2,... β p ) T is the p-vector of regression parameters. Cov(Y X, β) = σ 2 I n, for some σ 2 > 0. I.e., the random variables generating the observations are uncorrelated and have constant variance σ 2 (conditional on X, and β). 21

22 For known constants c 1, c 2,..., c p, c p+1, consider the problem of estimating θ = c 1 β 1 + c 2 β 2 + c p β p + c p+1. Under the Gauss-Markov assumptions, the estimator θˆ = c 1 βˆ1 + c 2 βˆ2 + c p βˆp + c p+1, where βˆ1, βˆ 2,... βˆ p are the least squares estimates is 1) An Unbiased Estimator of θ 2) A Linear Estimator 1 of θ, that is θ = n b i y i, for some known (given X) constants b i. i=1 Theorem: Under the Gauss-Markov Assumptions, the estimator ˆθ has the smallest (Best) variance among all Linear Unbiased Estimators of θ, i.e., θˆ is BLUE. 22

23 : Proof Proof: Without loss of generality, assume c p+1 = 0 and define c =(c 1, c 2,..., c p ) T. The Least Squares Estimate of θ = c T β is: θˆ = c T βˆ = c T (X T X) 1 X T y d T y a linear estimate in y given by coefficients d = (d 1, d 2,..., d n ) T. Consider an alternative linear estimate of θ: θ = b T y with fixed coefficients given by b = (b 1,..., b n ) T. Define f = b d and note that θ = b T y = (d + f) T y = θˆ + f T y If θ is unbiased then because θˆ is unbiased 0 = E (f T y) = f T E (y) = f T (Xβ) for all β R p = f is orthogonal to column space of X = f is orthogonal to d = X(X T X) 1 c 23

24 If θ is unbiased then The orthogonality of f to d implies Var(θ ) = Var(b T y) = Var(d T y + f T y) = Var(d T y) + Var(f T y) + 2Cov(d T y, f T y) = Var(θˆ) + Var(f T y) + 2d T Cov(y)f = Var(θˆ) + Var(f T y) + 2d T (σ 2 I n )f = Var(θˆ) + Var(f T y) + 2σ 2 d T f = Var(θˆ) + Var(f T y) + 2σ 2 0 Var(θˆ) 24

25 Outline Methods of Estimation I 1 Methods of Estimation I 25

26 Estimates Consider generalizing the Gauss-Markov assumptions for the linear regression model to Y = Xβ + E where the random n-vector E: E [E] = 0 n and E [EE T ] = σ 2 Σ. σ 2 is an unknown scale parameter Σ is a known (n n) positive definite matrix specifying the relative variances and correlations of the component observations. Transform the data (Y, X) to Y = Σ 1 2 Y and X = Σ 1 2 X and the model becomes Y = X β + E, where E [E ] = 0 n and E [E (E ) T ] = σ 2 I n By the, the BLUE ( GLS ) of β is βˆ = [(X ) T (X )] 1 (X ) T (Y ) = [X T Σ 1 X] 1 (X T Σ 1 Y) 26

27 Outline Methods of Estimation I 1 Methods of Estimation I 27

28 Estimation 28 X P θ, θ Θ with density or pmf function p(x θ). Given an observation X = x, define the likelihood function L x (θ) = p(x θ) : a mapping: Θ R. θˆ ML = θˆ ML (x): the Maximum-Likelihood Estimate of θ is the value making L x ( ) a maximum θ ˆis the MLE if L x (θˆ) = max L x (θ). θ Θ The MLE θˆml (x) identifies the distribution making x most likely The MLE coincides with the mode of the Posterior Distribution if the Prior Distribution on Θ is uniform: π(θ x) p(x θ)π(θ) p(x θ). MIT Methods of Estimation

29 Methods of Estimation I Examples Example 2.2.4: Normal Distribution with Known Variance Example 2.2.5: Size of a Population X 1,..., X n are iid U{1, 2,..., θ}, with θ {1, 2,...}. For x = (x 1,.., x n ), L n x (θ) = i=1 θ 1 1(1 x i θ) = θ n 1(max(x 1,..., x n )) θ) 0, if θ = 0, 1,..., max(x i ) 1 = θ n if θ max(x i ) 29

30 As a Minimum Contrast Method Define l x (θ) = log L x (θ) = log p(x θ) Because log( ) is monotone decreasing, ˆ θ ML (x) minimizes l x (θ) For an iid sample X = (X 1,..., X n ) with densities p(x i θ), l X (θ) = log p(x. 1,..., x n theta) = log [ n 1 i=1 p(x i θ)] = n i=1 log p(x i θ) As a minimum contrast function, ρ(x, θ) = l X (θ) yields the MLE θˆml (x) The discrepancy function corresonding to the contrast function ρ(x, θ) is D(θ 0, θ) = E [ρ(x, θ) θ 0 ] = E [log p(x θ) θ 0 ] 30

31 Suppose that θ = θ 0 uniquely minimizes D(θ 0, ). Then D(θ 0, θ) D(θ 0, θ 0 ) = E [log p(x θ) θ 0 ] ( E[log p(x θ 0 ) θ 0 ]) p( x θ) = E [log p ( x θ0) θ 0] > 0, unless θ = θ 0. This difference is the Kullback-Leibler Information Divergence between distribution P θ0 and P θ : p( x θ) K (P θ0, P θ ) = E [log( p ( x θ0) ) θ 0] Lemma (Shannon, 1948) The mutual entropy K (P θ0, P θ ) is always well defined and K (P θ0, P θ ) 0 Equality holds if and only if {x : p(x θ) = p(x θ 0 )} has probability 1 under both P θ0 and P θ. Proof Apply Jensen s Inequality (B.9.3) 31

32 Likelihood Equations Suppose: X P θ, with θ Θ, an open parameter space the likelihood function l X (θ) is differentiable in θ ˆθ ML (x) exists Then: θˆml (x) must satisfy the Likelihood Equation(s) ' θ l X (θ) = 0. Important Cases For independent X i with 1 densities/pmfs p i (x i θ), ' n θ l X (θ) = i=1 ' θ log p i (x i θ) = 0 NOTE: p i ( θ) may vary with i. 32

33 Examples Hardy-Weinberg Proportions (Example 2.2.6) Queues: Poisson Process Models (Exponential Arrival Times and Poisson Counts) (Example 2.2.7) Multinomial Trials (Example 2.2.8) Normal Regression Models (Example 2.2.9). 33

34 MIT OpenCourseWare Mathematical Statistics Spring 2016 For information about citing these materials or our Terms of Use, visit:

MIT Spring 2015

MIT Spring 2015 Regression Analysis MIT 18.472 Dr. Kempthorne Spring 2015 1 Outline Regression Analysis 1 Regression Analysis 2 Multiple Linear Regression: Setup Data Set n cases i = 1, 2,..., n 1 Response (dependent)