Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Size: px

Start display at page:

Download "Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona"

Audrey Rose
5 years ago
Views:

1 Likelihood based Statistical Inference Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona L. Pace, A. Salvan, N. Sartori Udine, April 2008 Likelihood: observed quantities, exact properties, inference L 2 slide 1

2 Lecture 2 Likelihood: observed quantities, exact properties, inference 2.1 Likelihood: observed quantities 2.2 Likelihood: exact properties 2.3 Likelihood: inference (no nuisance parameters) 2.4 Reparameterizations 2.5 A two-parameter model 2.6 Example: linear model with normal errors Likelihood: observed quantities, exact properties, inference L 2 slide 2

3 Detailed References 2.1 Likelihood: observed quantities *PS96:/PS *PS01: Azzalini, A. (1996). Statistical Inference based on the Likelihood, Chapman and Hall, London, 2.1, 2.2, 3.2. Azzalini, A. (2001). Inferenza Statistica: una Presentazione basata sul Concetto di Verosimiglianza, Springer, Milano, 2.1, 2.2, 3.2. Barndorff-Nielsen, O.E., Cox, D.R. (1944). Inference and Asymptotics, Chapman and Hall, London, 2.2. Severini, T.A. (2000). Likelihood Methods in Statistics, Oxford University Press, Oxford, Ch. 3, see also 3.8 con ulteriori riferimenti bibliografici. Likelihood: observed quantities, exact properties, inference L 2 slide 3

4 Detailed References 2.2 Likelihood: exact properties *PS01: Ch. 5 *PS96/PS97: SS , Azzalini, A. (1996). Statistical Inference based on the Likelihood, Chapman and Hall, London, 2.1, 2.2, 3.2. Azzalini, A. (2001). Inferenza Statistica: una Presentazione basata sul Concetto di Verosimiglianza, Springer, Milano, 2.1, 2.2, 3.2. Barndorff-Nielsen, O.E., Cox, D.R. (1944). Inference and Asymptotics, Chapman and Hall, London, 2.2. Severini, T.A. (2000). Likelihood Methods in Statistics, Oxford University Press, Oxford, Ch. 3. Likelihood: observed quantities, exact properties, inference L 2 slide 4

5 Detailed References 2.3 Likelihood: inference (no nuisance parameters) *PS01: 3.6 *PS96: , 3.5 (introd.) *PS97: , 3.4 (introd.) Azzalini, A. (1996). Statistical Inference based on the Likelihood, Chapman and Hall, London. Azzalini, A. (2001). Inferenza Statistica: una Presentazione basata sul Concetto di Verosimiglianza, Springer, Milano. Barndorff-Nielsen, O.E., Cox, D.R. (1944). Inference and Asymptotics, Chapman and Hall, London, 3.2. Severini, T.A. (2000). Likelihood Methods in Statistics, Oxford University Press, Oxford, 3.7. Likelihood: observed quantities, exact properties, inference L 2 slide 5

6 Detailed References 2.4 Reparameterizations *PS96: Azzalini, A. (1996). Statistical Inference based on the Likelihood, Chapman and Hall, London, and Azzalini, A. (2001). Inferenza Statistica: una Presentazione basata sul Concetto di Verosimiglianza, Springer, Milano and Barndorff-Nielsen, O.E. (1988). Parametric Statistical Models and Likelihood, Lecture Notes in Statistics, 50, Springer, Berlin, Ch. 3. Barndorff-Nielsen, O.E., Cox, D.R. (1944). Inference and Asymptotics, Chapman and Hall, London, Likelihood: observed quantities, exact properties, inference L 2 slide 6

7 2.1 Likelihood: observed quantities The likelihood function is L = L(θ) = L(θ; y) = c(y) p Y (y; θ), (1) c(y) > 0 arbitrary proportionality constant. Interpretation: on the basis of data y, the parameter value θ Θ is more credible than θ Θ as an index of the data generating probability model if L(θ ) > L(θ ). the empirical support received by θ Θ from y is to be compared with that received by θ Θ on the basis of the ratio L(θ )/L(θ ). L(θ) is not a density on Θ! Two likelihood functions are called equivalent if they agree up to a positive multiplicative constant (which can depend on y but not on θ). Likelihood: observed quantities, exact properties, inference L 2 slide 7

8 Likelihood: observed quantities In many cases it is more convenient to consider the log-likelihood function l = l(θ) = l(θ; y) = log L, (2) we let l(θ) = if L(θ) = 0 (the log-likelihood is defined up to an additive constant, which only depends on y). Independent observations: l(θ) = n log p Yi (y i ; θ). (3) i=1 A value of θ that maximizes L(θ; y) over Θ, that is a value ˆθ such that L(ˆθ) = sup θ Θ L(θ), is called maximum likelihood estimate (m.l.e.) of θ. Likelihood: observed quantities, exact properties, inference L 2 slide 8

9 Likelihood: observed quantities Regular parametric model if Θ is an open subset of IR p and the function l(θ) is continuously differentiable on Θ up to the third order. A major condition for a model to be regular is that all its probability distribution have the same support. Typically exponential families are regular models. Log-likelihood derivatives. In a regular parametric model, l r = l r (θ; y) = θ r l(θ), l rs = l rs (θ; y) = 2 θ r θ sl(θ), l rst = l rst (θ; y) = 3 θ r θ s θ tl(θ), Likelihood: observed quantities, exact properties, inference L 2 slide 9

10 Likelihood: observed quantities The score function is the vector l = l (θ; y) = (l 1,...,l p ). In the following, unless otherwise stated, it will be assumed that the m.l.e. is unique and that it is the unique solution to the likelihood equation l (θ) = 0. (4) Sufficient conditions exist for specific classes of parametric models (Ch. s 5 7 of PS97). The observed information matrix is l 11 l 1p j = j(θ) =..... l p1 l pp. or, in compact notation, j = [ l rs ]; here and below the matrix with elements a rs is indicated by [a rs ]. Likelihood: observed quantities, exact properties, inference L 2 slide 10

11 Likelihood: observed quantities The elements of the inverse of a matrix [a rs ] will be denoted by upper indices, that is, [a rs ] 1 = [a rs ]; in particular j 1 = [j rs ] and i 1 = [i rs ], when these inverses exist. Quantities like [l rst ] are called arrays. Sometimes the notations ˆl = l(ˆθ), ĵ = j(ˆθ), î = i(ˆθ), etc., are used to denote likelihood quantities evaluated at ˆθ. By slightly adjusting the terminology, if θ = (τ, ζ) and ˆθ = (ˆτ, ˆζ), we say that ˆτ is the maximum likelihood estimate of τ. In regular models, ˆθ and partial derivatives of l(θ) with respect to components of θ give us information about the behaviour of the log-likelihood function. With p = 1, l(θ) = l(ˆθ) + (θ ˆθ)l (ˆθ) (θ ˆθ) 2 l (ˆθ) +... = l(ˆθ) 1 2 (θ ˆθ) 2 j(ˆθ) Likelihood: observed quantities, exact properties, inference L 2 slide 11

12 Likelihood: observed quantities Hence, the relative log-lokelihood l(θ) l(ˆθ) may be expanded as l(θ) l(ˆθ) = 1 2 (θ ˆθ) 2 j(ˆθ) Thus, the larger j(ˆθ) is, the more rapidly values of θ different from ˆθ lose empirical support. In other words, the larger j(ˆθ) is, the more the likelihood is concentrated around ˆθ. For this reason, j(ˆθ) is a measure of information about θ contained in the data. When p > 1, j(ˆθ), is positive-definite. Expansion: l(θ) l(ˆθ) = (θ ˆθ) j(ˆθ)(θ ˆθ)/2 +..., l(θ) l(ˆθ) can be approximated around ˆθ by a negative-definite quadratic form. To sum up, in a regular model the log-likelihood function has approximately a parabolic behaviour, at least in the region with largest likelihood values. Likelihood: observed quantities, exact properties, inference L 2 slide 12

13 Likelihood: observed quantities The practise of examining the log-likelihood function is to be recommended and it is facilitated by the graphic options of many popular statistics packages. For an approach to inference which is based entirely on the likelihood function, see Edwards (1972) and Royall (1997, Statistical Evidence, Chapman and Hall, London). Likelihood: observed quantities, exact properties, inference L 2 slide 13

14 2.2 Likelihood: exact properties Behaviour under one-to-one transformations Let F = {p Y (y; θ), θ Θ IR p } be a parametric stat. model for data y. Invariance under one-to-one data transformations. Let t = g(y) be a one-to-one function of y defined on Y with range T, a subset of the same Euclidean space, and let us denote by F T = {p T (t; θ), θ Θ IR p } the statistical model for T = g(y ). Information about θ contained in y is equivalent to information contained in t: only the measurement of data changes. Likelihood is invariant under one-to-one transformations of the data. Indeed the likelihood for θ based on the statistical model F T = {p T (t; θ), θ Θ IR p } and on data t = g(y) is equivalent to that based on F and on data y: L T (θ; t(y)) = c(y)l Y (θ; y). Likelihood: observed quantities, exact properties, inference L 2 slide 14

15 Likelihood: exact properties Invariance under reparameterizations. Let ψ = ψ(θ) be a reparameterization of F (hereafter a one-to-one smooth function from Θ IR p to Ψ IR p, infinitely differentiable together with its inverse). In the new parameterization: F = {p Y (y; ψ), ψ Ψ IR p }, with Ψ = {ψ IR p : ψ = ψ(θ), θ Θ}. The statistical model as well as the inference problem are left unchanged. Hence, inference conclusions should not depend on the parameterization. Likelihood: observed quantities, exact properties, inference L 2 slide 15

16 Likelihood: exact properties The likelihood function is indeed invariant under reparameterizations of the model: the likelihood function L Ψ (ψ) in the new parameterization and the original likelihood L Θ (θ) are related as L Ψ (ψ) = L Θ (θ(ψ)), (5) so that l Ψ (ψ) = l Θ (θ(ψ)), (6) ˆψ = ψ(ˆθ). This means that the m.l.e. is equivariant under reparametrizations. In general, likelihood inference about ψ is simply the translation in the new parameter space of conclusions about θ. Likelihood: observed quantities, exact properties, inference L 2 slide 16

17 Likelihood: exact properties Sampling properties Repeated sampling principle (r.s.p.) study of the probability distribution of the random variable l(θ) = l(θ; Y ), or of related quantities, for θ fixed and as y varies in the sample space Y according to a density p Y (y; θ) in F, where θ Θ is a parameter value not necessarily equal to θ. Null distribution if θ = θ. Non-null distribution if θ θ. Analogously, we refer to null moments (evaluated with respect to a null distribution), and to non-null moments (evaluated with respect to a non-null distribution). Symbols such as E θ ( ), Var θ ( ) are used to indicate expectation, variance (or covariance matrix), etc., calculated with reference to p Y (y; θ). Likelihood: observed quantities, exact properties, inference L 2 slide 17

18 Likelihood: exact properties Wald inequality. Key result in proving consistency of the m.l.e. (under suitable conditions) E θ (l(θ; Y )) > E θ (l(θ ; Y )), θ θ. (7) Likelihood: observed quantities, exact properties, inference L 2 slide 18

19 Likelihood: exact properties To prove (7), note first that because of parameter identifiability ( py (Y ; θ ) ) Pr θ p Y (Y ; θ) = 1 < 1, that is the r.v. p Y (Y ; θ )/p Y (Y ; θ) is non-degenerate for any pair of values θ and θ in Θ with θ θ. Moreover, log( ) is a strictly concave function, so that, by Jensen s inequality, E θ ( log p Y (Y ; θ ) p Y (Y ; θ) ) < log { E θ ( py (Y ; θ ) p Y (Y ; θ) )} = 0, (8) taking into account that E θ ( py (Y ; θ ) p Y (Y ; θ) ) = Y p Y (y; θ ) p Y (y; θ) p Y (y; θ) dy = 1. (Notation for the continuous case has been used without loss of generality). Inequality (7) follows from (8). Likelihood: observed quantities, exact properties, inference L 2 slide 19

20 Likelihood: exact properties Expectation of the score function. For a regular model, assuming that the conditions which permit differentiation and integration to be interchanged are satisfied, E θ (l (θ)) = E θ (l (θ; Y )) = 0, θ Θ. (9) Thus, in agreement with the property that log-likelihood functions associated to data from p Y (y; θ 0 ) have, on average, maximum at θ 0, the components of the score function evaluated at θ 0, are, on average, equal to zero. Property (9) could also be phrased as: the likelihood equation l (θ) = 0 is an unbiased estimating equation. Likelihood: observed quantities, exact properties, inference L 2 slide 20

21 Likelihood: exact properties In general, an estimating equation is an equation in θ of the form q(θ; y) = 0. Its solution gives an estimate ˆθ q of θ. The estimating equation q(θ; y) = 0 is called unbiased E θ (q(y ; θ)) = 0 for all θ Θ. Unbiasedness of an estimating equation is a key property to show consistency of the corresponding estimator (see e.g. 6.2 in PS01; Serfling, 1980, 7.2.1; van der Vaart, 1998, 5.2). Likelihood: observed quantities, exact properties, inference L 2 slide 21

22 Likelihood: exact properties Expected information. We call expected information or Fisher information the quantity [ ( 2 )] l(θ) i(θ) = E θ (j(θ)) = E θ, θ r θ s (null) expectation of the observed information. It is a p p matrix, i(θ) = [i rs (θ)]. Under random sampling of size n, i(θ) is proportional to n. Indeed, i(θ) = ni 1 (θ), where i 1 (θ) is the expected information for one observation. Likelihood: observed quantities, exact properties, inference L 2 slide 22

23 Likelihood: exact properties Covariance matrix of the score function. Under regularity conditions (see e.g. Azzalini, 1996, section 3.2.4), the information identity E θ ( l (θ)(l (θ)) ) = i(θ), θ Θ (10) holds. In other words, the expected information matrix is the null second moment of the score and, as such, is a non-negative definite matrix. In other words, the log-likelihood has Hessian matrix which is, on average, non-positive definite at the true parameter value. Likelihood: observed quantities, exact properties, inference L 2 slide 23

24 2.3 Likelihood: inference (no nuisance parameters) From a likelihood function L(θ) it is natural: i) to use as a point estimate for θ the value ˆθ = ˆθ(y) having maximum likelihood; ii) to consider confidence regions of the form ˆΘ(y) = {θ Θ : L(θ) cl((ˆθ)}, with 0 < c < 1, made up of parameter values having large likelihood; Likelihood: observed quantities, exact properties, inference L 2 slide 24

25 Likelihood: inference iii) to test an hypothesis about θ, H 0 : θ Θ 0 Θ, by comparing the maximum of L(θ) for θ Θ 0 with L(ˆθ), the maximum of L(θ) for θ Θ. Let us denote by L(ˆθ 0 ) the maximum of L(θ) for θ Θ 0. This leads to the test statistic L(ˆθ) L(ˆθ 0 ) (11) with large values significant. This statistic can be equivalently expressed using the log-likelihood, as l(ˆθ) l(ˆθ 0 ). Likelihood: observed quantities, exact properties, inference L 2 slide 25

26 Likelihood: inference Likelihood ratio tests and confidence regions Likelihood ratio test Suppose we want to test the simple hypothesis H 0 : θ = θ 0 against the alternative H 1 : θ θ 0. Twice the logarithm of (11) is { } W(θ 0 ) = 2 l(ˆθ) l(θ 0 ), also called log likelihood ratio statistic. Large values of W(θ 0 ) are critical against H 0. Likelihood: observed quantities, exact properties, inference L 2 slide 26

27 Likelihood: inference Sometimes W(θ 0 ) is a monotonically increasing function of a statistic t(y) having known distribution under H 0. It is thus easy to compute the observed significance level (p-value) asssociated to the observed y obs as α obs = Pr θ0 (W(θ 0 ) W(θ 0 ) obs ) = Pr θ0 (t(y ) t(y obs )). Usually, however, one has to resort to an approximation for the null distribution of W(θ 0 ). Under regularity conditions, the asymptotic null distribution of W(θ 0 ) is chi-squared on p degrees of freedom, where p is the number of scalar components of θ. Thus, under such conditions, W(θ) is an asymptotically pivotal quantity. Likelihood: observed quantities, exact properties, inference L 2 slide 27

28 Likelihood: inference Confidence regions Confidence regions with nominal level (level = null coverage probability) 1 α are ˆΘ(y) = {θ Θ : W(θ) < χ 2 p; 1 α}. The region ˆΘ(y) can be written as ˆΘ(y) = {θ Θ : l(θ) > l(ˆθ) 1 2 χ2 p;1 α}. With p = 1 or 2, graphical representation of likelihood confidence regions is rather simple starting form the log-likelihood plot. Alternatively, the region ˆΘ(y) might also be read from the plot of W(θ). Likelihood: observed quantities, exact properties, inference L 2 slide 28

29 Likelihood: inference One-sided version of W(θ) When θ is a scalar, one might be interested in testing H 0 : θ = θ 0 against the one-sided alternative H1 dx : θ > θ 0 or H1 sx : θ < θ 0. The natural test statistic based on the likelihood is the signed root of W(θ 0 ), r(θ 0 ) = sgn(ˆθ θ 0 ) W(θ 0 ), where sgn( ) is the sign function: sgn(x) = +1 if x > 0, sgn(x) = 1 if x < 0, sgn(x) = 0 if x = 0. Moreover the non-negative square root of W(θ 0 ) is to be considered. Likelihood: observed quantities, exact properties, inference L 2 slide 29

30 Likelihood: inference One-sided version of W(θ) When θ is a scalar, one might be interested in testing H 0 : θ = θ 0 against the one-sided alternative H1 dx : θ > θ 0 or H1 sx : θ < θ 0. The natural test statistic based on the likelihood is the signed root of W(θ 0 ), r(θ 0 ) = sgn(ˆθ θ 0 ) W(θ 0 ), where sgn( ) is the sign function: sgn(x) = +1 if x > 0, sgn(x) = 1 if x < 0, sgn(x) = 0 if x = 0. Moreover the non-negative square root of W(θ 0 ) is to be considered. The test based on r(θ 0 ) rejects H 0 for large (small) values when the alternative is H dx 1 (H sx 1 ). For some models r(θ 0 ) is s a monotonically increasing function of a statistic t(y) having known distribution under H 0. Usually, however, under regularity conditions, the approximation r(θ 0 ) N(0, 1) is used for the null distribution. Likelihood: observed quantities, exact properties, inference L 2 slide 29

31 Likelihood: inference r(θ 0 ) with two-sided and symmetric around zero critical region is equivalent to the test based on W(θ 0 ). ˆΘ(y) = {θ Θ : z 1 α/2 < r(θ) < z 1 α/2 }. For graphical illustrations of construction of confidence regions, see e.g. Example 3.9 in PS01. Likelihood: observed quantities, exact properties, inference L 2 slide 30

32 Likelihood: inference Asymptotically equivalent versions of W(θ) and r(θ) The statistic { } W(θ 0 ) = 2 l(ˆθ) l(θ 0 ), is also called Wilks test statistic, from S. Wilks, who obtained its chi-squared asymptotic null distribution in W(θ 0 ) is a likelihood-based measure of the distance between ˆθ and θ 0. Due to the parabolic approximation for the relative log-likelihood, the asymptotic null distribution of W(θ 0 ) coincides with the asymptotic null distribution of W e (θ 0 ) = (ˆθ θ 0 ) i(θ 0 )(ˆθ θ 0 ). The test statistic W e (θ 0 ) is called Wald test statistic. It measures the distance between ˆθ and θ 0 directly on Θ using the metric defined by the expected information. Likelihood: observed quantities, exact properties, inference L 2 slide 31

33 Likelihood: inference A third asymptotically equivalent statistic is the score test, also called Rao s test. It is given by W u (θ 0 ) = l (θ 0 ) i(θ 0 ) 1 l (θ 0 ). It measures the distance between l (θ 0 ) and E θ0 (l (θ 0 )) = 0, using the metric defined by the inverse of the expected information. When p = 1, it is possible to define one-sided versions of W e (θ 0 ) and of W u (θ 0 ), analogous to r(θ 0 ). For instance, r e (θ 0 ) = i(θ 0 )(ˆθ θ 0 ). The asymptotic null distribution of both W e (θ 0 ) and W u (θ 0 ) is χ 2 p. The asymptotic null distribution of the one-sided versions (p = 1) is N(0, 1). The same asymptotic results hold if the expected information is replaced by i(ˆθ) or j(ˆθ). Likelihood: observed quantities, exact properties, inference L 2 slide 32

34 Likelihood: inference Comparisons Wald test has a rather simple expression and interpretation, it is widely used in applications. However it does depend on the parameterization of the statistical model (see the following section). On the other hand, W u and W do not depend on the parameterization. If compared with the other two versions, W does respect more closely the shape of the log-likelihood, especially when this is far from being parabolic. Confidence regions based on W are necessarily included in the parameter space. This is not always true for confidence regions based on the other two versions. W u is simple to compute and is related to (locally) optimal tests (cf of PS97). Likelihood: observed quantities, exact properties, inference L 2 slide 33

35 2.4 Reparameterizations Likelihood and log-likelihood do not depend on the parameterization of F. Let ψ = ψ(θ) be a one-to-one smooth function from Θ IR p to Ψ IR p, infinitely differentiable together with its inverse. Then ψ defines an alternative parameterization of the model. Since θ and ψ(θ) identify the same element of F, we have (recall (12) and (13)): L Ψ (ψ) = L Θ (θ(ψ)), l Ψ (ψ) = l Θ (θ(ψ)). L Ψ (ψ) = L Θ (θ(ψ)) (12) e l Ψ (ψ) = l Θ (θ(ψ)) ; (13) likelihood is an intrinsic function, i.e. it does not depend on the coordinate system expressed by the parameterization of F. Likelihood: observed quantities, exact properties, inference L 2 slide 34

36 Reparameterizations Important consequence for the m.l.e.: equivariance under reparameterization. If ψ is an alternative parameterization of F, we will have ˆψ = ψ(ˆp( )) = ψ(ˆθ) and ˆθ = θ( ˆψ). On the other hand, log-likelihood derivatives l r, l rs,..., and their expectations depend on the parameterization. Let us denote by ψ a, ψ b,...(a, b = 1,...,p), the generic components of ψ, whereas θ r, θ s,..., denote the generic components of θ. According to the differentiation rule for composite functions (chain rule), where la = p l r θa r, r=1 (14) la = ψ alψ (ψ), (15) and l r = l r (θ(ψ)). Likelihood: observed quantities, exact properties, inference L 2 slide 35

37 Reparameterizations Let lab = 2 ψ a ψ blψ (ψ), labc = 3 ψ a ψ b ψ clψ (ψ), (16) etc., by re-applying the chain rule, we have lab = p p l rs θa r θb s + l r θab r, (17) r,s=1 r=1 where l rs = l rs (θ(ψ)) θ r ab = ( 2 / ψ a ψ b )θ r (ψ). Likelihood: observed quantities, exact properties, inference L 2 slide 36

38 Reparameterizations Transformation rules for observed and expected information: j ab = p p j rs θa r θb s l r θab r, (18) r,s=1 r=1 ī ab = p r,s=1 i rs θ r a θ s b. (19) Perhaps (14) and (19) look more familiar in matrix notation: l Ψ (ψ) = [θa] r T l Θ (θ(ψ)) and i Ψ (ψ) = [θa] r T i Θ (θ(ψ)) [θa] r ; (20) [θa]: r p p matrix with (r, a) element θa. r Likelihood: observed quantities, exact properties, inference L 2 slide 37

39 2.5 A two-parameter model Let y = (0.446, 0.604, 2.137, 0.737, 0.996, 1.152, 1.124, 0.137, 0.982, 1.196, 0.841, 0.636, 0.459, 0.947, 0.037, 1.307, 0.858, 0.164, 0.444, 0.623) be a random sample of size n = 20 from a Weibull W(γ, λ) distribution, with p.d.f. for one observation p Yi (y i ; γ, λ) = λγy γ 1 i e λyγ i, γ, λ > 0, yi > 0. Likelihood equation: n γ + log y i λ y γ i log y i = 0 n λ y γ i = 0 Likelihood: observed quantities, exact properties, inference L 2 slide 38

40 A two-parameter model The solution (obtained numerically) is (ˆγ, ˆλ) = (1.642, 1.24) and ĵ = ( ). W(γ, λ) = 2 { nlog ) (ˆλ λ + nlog ) (ˆγ γ + (ˆγ γ) log y i ˆλ yˆγ i + λ y γ i } Parabolic approximation (Wald test): W(γ, λ). = W e (γ, λ) = (θ ˆθ) ĵ(θ ˆθ), θ = (γ, λ). For H 0 : γ = 1, λ = 1.2 against H 1 : H 0, we get W(1, 1.2) = with α oss. = 0.048, W e (1, 1.2) = 4.94 with α oss. = 0.084, (χ 2 2;0.95 = 5.99). Likelihood: observed quantities, exact properties, inference L 2 slide 39

41 A two-parameter model Confidence regions for (γ, λ) based on W(γ, λ) (1 α) {0.25, 0.50, 0.75, 0.90, 0.95, 0.99} λ γ Likelihood: observed quantities, exact properties, inference L 2 slide 40

42 A two-parameter model Confidence regions for (γ, λ) based on W(γ, λ) (1 α) {0.25, 0.50, 0.75, 0.90, 0.95, 0.99} λ γ Likelihood: observed quantities, exact properties, inference L 2 slide 41

43 2.6 Example: linear model with normal errors Let y = (y 1,..., y n ) be a realization of Y = (Y 1,...,Y n ) with Y i independent, i = 1,..., n. Let us assume that Y i N(µ i, σ 2 ) where the unknown µ i has to be related to covariates whose values are measured on the same units where values y i are measured. Therefore the data pertaining to the i-th unit are (y i, x i1,...,x ip ). The simplest way to relate µ i to (x i1,...,x ip ) is through the linear model where β = (β 1,...,β p ) IR p. µ i = x i1 β x ip β p Likelihood: observed quantities, exact properties, inference L 2 slide 42

44 2.6 Example: linear model with normal errors The model parametric statistical model thus defined is known as multiple regression model with normal errors, in brief normal linear regression model. It is customary to express this model also in the form Y = Xβ + ǫ, where Y is the n 1 random variable generating y, X is the n p matrix containing the fixed (non-stocastic) values of the covariates, ǫ N n (0, σ 2 I n ) is a vector of i.i.d. Gaussian errors. Even more succinctly y is modeled as an observation of Y N n (Xβ, σ 2 I n ). The parameter space of the model is Θ = IR p (0, + ) and the likelihood function is L(β, σ 2 ) = (σ 2 ) n/2 exp{ 1 2σ 2(y Xβ) (y Xβ))}. Likelihood: observed quantities, exact properties, inference L 2 slide 43

45 2.6 Example: linear model with normal errors The log likelihood function is therefore l(β, σ 2 ) = n 2 log(σ2 ) 1 2σ 2(y Xβ) (y Xβ)). For fixed σ 2 the (log) likelihood is maximized by a value of β that renders the shortest possible the distance between y and Xβ. This value of β does not depend on σ 2. Geometrically, we need to find the orthogonal projection of y upon the vector space having the columns of X as its basis. Orthogonality between the residual y X ˆβ and the basis X means in terms of scalar products X (y X ˆβ) = 0. Therefore X y = X X ˆβ. When the columns of X are linearly independent, i.e. X has rank p < n, one has ˆβ = (X X) 1 X y. Likelihood: observed quantities, exact properties, inference L 2 slide 44

46 2.6 Example: linear model with normal errors A simple maximization of l(ˆβ, σ 2 ) gives ˆσ 2 = 1 n (y X ˆβ) (y X ˆβ). Inference on the linear model with normal error is greatly simplified by the fact that the exact sampling distribution of the maximum likelihood estimator is easily described: ˆβ N p (β, σ 2 (X X) 1 ) ˆσ 2 σ2 n χ2 n p ˆβ and ˆσ 2 are independent. Likelihood: observed quantities, exact properties, inference L 2 slide 45

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score