Chapter 3 : Likelihood function and inference

Size: px

Start display at page:

Download "Chapter 3 : Likelihood function and inference"

Sharlene Benson
5 years ago
Views:

1 Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM algorithm

2 The likelihood Given an usually parametric family of distributions F {F θ, θ Θ} with densities f θ [wrt a fixed measure ν], the density of the iid sample x 1,..., x n is n f θ (x i ) i=1 Note In the special case ν is a counting measure, n f θ (x i ) i=1 is the probability of observing the sample x 1,..., x n among all possible realisations of X 1,..., X n

3 The likelihood Given an usually parametric family of distributions F {F θ, θ Θ} with densities f θ [wrt a fixed measure ν], the density of the iid sample x 1,..., x n is n f θ (x i ) i=1 Note In the special case ν is a counting measure, n f θ (x i ) i=1 is the probability of observing the sample x 1,..., x n among all possible realisations of X 1,..., X n

4 The likelihood Definition (likelihood function) The likelihood function associated with a sample x 1,..., x n is the function L : Θ R + n θ f θ (x i ) i=1 same formula as density but different space of variation

5 The likelihood Definition (likelihood function) The likelihood function associated with a sample x 1,..., x n is the function L : Θ R + n θ f θ (x i ) i=1 same formula as density but different space of variation

6 Example: density function versus likelihood function Take the case of a Poisson density [against the counting measure] f(x; θ) = θx x! e θ I N (x) which varies in N as a function of x versus L(θ; x) = θx x! e θ which varies in R + as a function of θ θ = 3

7 Example: density function versus likelihood function Take the case of a Poisson density [against the counting measure] f(x; θ) = θx x! e θ I N (x) which varies in N as a function of x versus L(θ; x) = θx x! e θ which varies in R + as a function of θ x = 3

8 Example: density function versus likelihood function Take the case of a Normal N(0, θ) density [against the Lebesgue measure] f(x; θ) = 1 2πθ e x2 /2θ I R (x) which varies in R as a function of x versus L(θ; x) = 1 2πθ e x2 /2θ which varies in R + as a function of θ θ = 2

9 Example: density function versus likelihood function Take the case of a Normal N(0, θ) density [against the Lebesgue measure] f(x; θ) = 1 2πθ e x2 /2θ I R (x) which varies in R as a function of x versus L(θ; x) = 1 2πθ e x2 /2θ which varies in R + as a function of θ x = 2

10 Example: density function versus likelihood function Take the case of a Normal N(0, 1/θ) density [against the Lebesgue measure] θ f(x; θ) = e x2θ/2 I R (x) 2π which varies in R as a function of x versus L(θ; x) = θ 2π e x2 θ/2 I R (x) which varies in R + as a function of θ θ = 1/2

11 Example: density function versus likelihood function Take the case of a Normal N(0, 1/θ) density [against the Lebesgue measure] θ f(x; θ) = e x2θ/2 I R (x) 2π which varies in R as a function of x versus L(θ; x) = θ 2π e x2 θ/2 I R (x) which varies in R + as a function of θ x = 1/2

12 Example: Hardy-Weinberg equilibrium Population genetics: Genotypes of biallelic genes AA, Aa, and aa sample frequencies n AA, n Aa and n aa multinomial model M(n; p AA, p Aa, p aa ) related to population proportion of A alleles, p A : likelihood p AA = p 2 A, p Aa = 2p A (1 p A ), p aa = (1 p A ) 2 L(p A n AA, n Aa, n aa ) p 2n AA A [2p A (1 p A )] n Aa (1 p A ) 2n aa [Boos & Stefanski, 2013]

13 mixed distributions and their likelihood Special case when a random variable X may take specific values a 1,..., a k and a continum of values A Example: Rainfall at a given spot on a given day may be zero with positive probability p 0 [it did not rain!] or an arbitrary number between 0 and 100 [capacity of measurement container] or 100 with positive probability p 100 [container full]

14 mixed distributions and their likelihood Special case when a random variable X may take specific values a 1,..., a k and a continum of values A Example: Tobit model where y N(X T β, σ 2 ) but y = y I{y 0} observed

15 mixed distributions and their likelihood Special case when a random variable X may take specific values a 1,..., a k and a continum of values A Density of X against composition of two measures, counting and Lebesgue: { P θ (X = a) if a {a 1,..., a k } f X (a) = f(a θ) otherwise Results in likelihood L(θ x 1,..., x n ) = k P θ (X = a i ) n j j=1 where n j # observations equal to a j x i / {a 1,...,a k } f(x i θ)

16 Enters Fisher, Ronald Fisher! Fisher s intuition in the 20 s: the likelihood function contains the relevant information about the parameter θ the higher the likelihood the more likely the parameter the curvature of the likelihood determines the precision of the estimation

17 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n P(3) as n increases n = 40,..., 240

18 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n P(3) as n increases n = 38,..., 240

19 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n N(0, 1) as n increases

20 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n N(0, 1) as sample varies

21 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n N(0, 1) as sample varies

22 why concentration takes place Consider Then and by LLN 1/n log i=1 x 1,..., x n iid F n f(x i θ) = i=1 n log f(x i θ) i=1 n L log f(x i θ) log f(x θ) df(x) X Lemma Maximising the likelihood is asymptotically equivalent to minimising the Kullback-Leibler divergence log f(x) /f(x θ) df(x) X

23 why concentration takes place by LLN 1/n n L log f(x i θ) log f(x θ) df(x) X i=1 Lemma Maximising the likelihood is asymptotically equivalent to minimising the Kullback-Leibler divergence log f(x) /f(x θ) df(x) X c Member of the family closest to true distribution

24 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0

25 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0

26 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0 Reason: X log L(θ x) df θ (x) = X L(θ x) dx = df θ (x) X

27 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0 Connected with concentration theorem: gradient null on average for true value of parameter

28 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0 Warning: Not defined for non-differentiable likelihoods, e.g. when support depends on θ

29 Fisher s information matrix Another notion attributed to Fisher [more likely due to Edgeworth] Information: covariance matrix of the score vector I(θ) = E θ [ log f(x θ) { log f(x θ)} T] Often called Fisher information Measures curvature of the likelihood surface, which translates as information brought by the data Sometimes denoted I X to stress dependence on distribution of X

30 Fisher s information matrix Another notion attributed to Fisher [more likely due to Edgeworth] Information: covariance matrix of the score vector I(θ) = E θ [ log f(x θ) { log f(x θ)} T] Often called Fisher information Measures curvature of the likelihood surface, which translates as information brought by the data Sometimes denoted I X to stress dependence on distribution of X

31 Fisher s information matrix Second derivative of the log-likelihood as well lemma If L(θ x) is twice differentiable [as a function of θ] I(θ) = E θ [ T log f(x θ) ] Hence [ ] 2 I ij (θ) = E θ log f(x θ) θ i θ j

32 Illustrations Binomial B(n, p) distribution Hence f(x p) = ( ) n p x (1 p) n x x / p log f(x p) = x /p n x /1 p 2 / p 2 log f(x p) = x /p 2 n x /(1 p) 2 I(p) = np /p 2 + n np /(1 p) 2 = n /p(1 p)

33 Illustrations Multinomial M(n; p 1,..., p k ) distribution ( ) n f(x p) = p x 1 1 x 1 x px k k k / p i log f(x p) = x i/p i x k/p k 2 / p i p j log f(x p) = x k/p 2 k 2 / p 2 i log f(x p) = x i/p 2 i x k/p 2 k Hence 1/p /p k 1/p k 1/p k 1/p k I(p) = n... 1/p k 1/p k /p k

34 Illustrations Multinomial M(n; p 1,..., p k ) distribution ( ) n f(x p) = p x 1 1 x 1 x px k k k / p i log f(x p) = x i/p i x k/p k 2 / p i p j log f(x p) = x k/p 2 k 2 / p 2 i log f(x p) = x i/p 2 i x k/p 2 k and p 1 (1 p 1 ) p 1 p 2 p 1 p k 1 I(p) 1 = 1 p 1 p 2 p 2 (1 p 2 ) p 2 p k 1 /n p 1 p k 1 p 2 p k 1 p k 1 (1 p k 1 )

35 Illustrations Normal N(µ, σ 2 ) distribution f(x θ) = 1 2π 1 σ exp{ (x µ)2 /2σ 2 } / µ log f(x θ) = x µ /σ 2 / σ log f(x θ) = 1 /σ + (x µ)2 /σ 3 2 / µ 2 log f(x θ) = 1 /σ 2 2 / µ σ log f(x θ) = 2 x µ /σ 3 2 / σ 2 log f(x θ) = 1 /σ 2 3 (x µ)2 /σ 4 Hence ( ) 1 0 I(θ) = 1 /σ 2 0 2

36 Properties Additive features translating as accumulation of information: if X and Y are independent, I X (θ) + I Y (θ) = I (X,Y) (θ) I X1,...,X n (θ) = ni X1 (θ) if X = T(Y) and Y = S(X), I X (θ) = I Y (θ) if X = T(Y), I X (θ) I Y (θ) If η = Ψ(θ) is a bijective transform, change of parameterisation: I(θ) = { } η T { } η I(η) θ θ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]

37 Properties Additive features translating as accumulation of information: if X and Y are independent, I X (θ) + I Y (θ) = I (X,Y) (θ) I X1,...,X n (θ) = ni X1 (θ) if X = T(Y) and Y = S(X), I X (θ) = I Y (θ) if X = T(Y), I X (θ) I Y (θ) If η = Ψ(θ) is a bijective transform, change of parameterisation: I(θ) = { } η T { } η I(η) θ θ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]

38 Properties If η = Ψ(θ) is a bijective transform, change of parameterisation: I(θ) = { } η T { } η I(η) θ θ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]

39 Approximations Back to the Kullback Leibler divergence D(θ, θ) = f(x θ ) log f(x θ )/f(x θ) dx X Using a second degree Taylor expansion log f(x θ) = log f(x θ ) + (θ θ ) T log f(x θ ) (θ θ ) T T log f(x θ )(θ θ ) + o( θ θ 2 ) approximation of divergence: D(θ, θ) 1 2 (θ θ ) T I(θ )(θ θ ) [Exercise: show this is exact in the normal case]

40 Approximations Back to the Kullback Leibler divergence D(θ, θ) = f(x θ ) log f(x θ )/f(x θ) dx X Using a second degree Taylor expansion log f(x θ) = log f(x θ ) + (θ θ ) T log f(x θ ) (θ θ ) T T log f(x θ )(θ θ ) + o( θ θ 2 ) approximation of divergence: D(θ, θ) 1 2 (θ θ ) T I(θ )(θ θ ) [Exercise: show this is exact in the normal case]

41 Approximations Back to the Kullback Leibler divergence D(θ, θ) = f(x θ ) log f(x θ )/f(x θ) dx X Using a second degree Taylor expansion log f(x θ) = log f(x θ ) + (θ θ ) T log f(x θ ) (θ θ ) T T log f(x θ )(θ θ ) + o( θ θ 2 ) approximation of divergence: D(θ, θ) 1 2 (θ θ ) T I(θ )(θ θ ) [Exercise: show this is exact in the normal case]

42 First CLT Central limit law of the score vector Given X 1,..., X n i.i.d. f(x θ), 1/ n log L(θ X 1,..., X n ) N (0, I X1 (θ)) [at the true θ] Notation I 1 (θ) stands for I X1 (θ) and indicates information associated with a single observation

43 First CLT Central limit law of the score vector Given X 1,..., X n i.i.d. f(x θ), 1/ n log L(θ X 1,..., X n ) N (0, I X1 (θ)) [at the true θ] Notation I 1 (θ) stands for I X1 (θ) and indicates information associated with a single observation

44 Sufficiency What if a transform of the sample contains all the information, i.e. uniformly in θ? S(X 1,..., X n ) I (X1,...,X n )(θ) = I S(X1,...,X n )(θ) In this case S( ) is called a sufficient statistic [because it is sufficient to know the value of S(x 1,..., x n ) to get complete information] [A statistic is an arbitrary transform of the data X 1,..., X n ]

45 Sufficiency What if a transform of the sample contains all the information, i.e. uniformly in θ? S(X 1,..., X n ) I (X1,...,X n )(θ) = I S(X1,...,X n )(θ) In this case S( ) is called a sufficient statistic [because it is sufficient to know the value of S(x 1,..., x n ) to get complete information] [A statistic is an arbitrary transform of the data X 1,..., X n ]

46 Sufficiency What if a transform of the sample contains all the information, i.e. uniformly in θ? S(X 1,..., X n ) I (X1,...,X n )(θ) = I S(X1,...,X n )(θ) In this case S( ) is called a sufficient statistic [because it is sufficient to know the value of S(x 1,..., x n ) to get complete information] [A statistic is an arbitrary transform of the data X 1,..., X n ]

47 Sufficiency (bis) Alternative definition: If (X 1,..., X n ) f(x 1,..., x n θ) and if T = S(X 1,..., X n ) is such that the distribution of (X 1,..., X n ) conditional on T does not depend on θ, then S( ) is a sufficient statistic Factorisation theorem S( ) is a sufficient statistic if and only if f(x 1,..., x n θ) = g(s(x 1,..., x n ) θ) h(x 1,..., x n ) another notion due to Fisher

48 Sufficiency (bis) Alternative definition: If (X 1,..., X n ) f(x 1,..., x n θ) and if T = S(X 1,..., X n ) is such that the distribution of (X 1,..., X n ) conditional on T does not depend on θ, then S( ) is a sufficient statistic Factorisation theorem S( ) is a sufficient statistic if and only if f(x 1,..., x n θ) = g(s(x 1,..., x n ) θ) h(x 1,..., x n ) another notion due to Fisher

49 Sufficiency (bis) Alternative definition: If (X 1,..., X n ) f(x 1,..., x n θ) and if T = S(X 1,..., X n ) is such that the distribution of (X 1,..., X n ) conditional on T does not depend on θ, then S( ) is a sufficient statistic Factorisation theorem S( ) is a sufficient statistic if and only if f(x 1,..., x n θ) = g(s(x 1,..., x n ) θ) h(x 1,..., x n ) another notion due to Fisher

50 Illustrations Uniform U(0, θ) distribution L(θ x 1,..., x n ) = θ n n i=1 I (0,θ) (x i ) = θ n I θ> max x i i Hence is sufficient S(X 1,..., X n ) = max X i = X (n) i

51 Illustrations Bernoulli B(p) distribution L(p x 1,..., x n ) = n i=1 p x i (1 p) n x i = {p/1 p} i x i (1 p) n Hence is sufficient S(X 1,..., X n ) = X n

52 Illustrations Normal N(µ, σ 2 ) distribution Hence L(µ, σ x 1,..., x n ) = = = is sufficient 1 {2πσ 2 } n/2 1 {2πσ 2 } n/2 n i=1 1 2πσ exp{ (x i µ) 2 /2σ 2 } { exp 1 2σ 2 { exp 1 2σ 2 S(X 1,..., X n ) = ( } n (x i x n + x n µ) 2 i=1 n (x i x n ) 2 1 2σ 2 i=1 X n, } n ( x n µ) 2 i=1 ) n (X i X n ) 2 i=1

53 Sufficiency and exponential families Both previous examples belong to exponential families f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } Generic property of exponential families: { n n } f(x 1,..., x n θ) = h(x i ) exp T(θ) T S(x i ) nτ(θ) i=1 i=1 lemma For an exponential family with summary statistic S( ), the statistic S(X 1,..., X n ) = is sufficient n S(X i ) i=1

54 Sufficiency and exponential families Both previous examples belong to exponential families f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } Generic property of exponential families: { n n } f(x 1,..., x n θ) = h(x i ) exp T(θ) T S(x i ) nτ(θ) i=1 i=1 lemma For an exponential family with summary statistic S( ), the statistic S(X 1,..., X n ) = is sufficient n S(X i ) i=1

55 Sufficiency as a rare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, θ)

56 Sufficiency as a rare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, θ)

57 Sufficiency as a rare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, θ)

58 Pitman-Koopman-Darmois characterisation If X 1,..., X n are iid random variables from a density f( θ) whose support does not depend on θ and verifying the property that there exists an integer n 0 such that, for n n 0, there is a sufficient statistic S(X 1,..., X n ) with fixed [in n] dimension, then f( θ) belongs to an exponential family [Factorisation theorem] Note: Darmois published this result in 1935 [in French] and Koopman and Pitman in 1936 [in English] but Darmois is generally omitted from the theorem... Fisher proved it for one-d sufficient statistics in 1934

59 Pitman-Koopman-Darmois characterisation If X 1,..., X n are iid random variables from a density f( θ) whose support does not depend on θ and verifying the property that there exists an integer n 0 such that, for n n 0, there is a sufficient statistic S(X 1,..., X n ) with fixed [in n] dimension, then f( θ) belongs to an exponential family [Factorisation theorem] Note: Darmois published this result in 1935 [in French] and Koopman and Pitman in 1936 [in English] but Darmois is generally omitted from the theorem... Fisher proved it for one-d sufficient statistics in 1934

60 Minimal sufficiency Multiplicity of sufficient statistics, e.g., S (x) = (S(x), U(x)) remains sufficient when S( ) is sufficient Search of a most concentrated summary: Minimal sufficiency A sufficient statistic S( ) is minimal sufficient if it is a function of any other sufficient statistic Lemma For a minimal exponential family representation f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } S(X 1 ) S(X n ) is minimal sufficient

61 Minimal sufficiency Multiplicity of sufficient statistics, e.g., S (x) = (S(x), U(x)) remains sufficient when S( ) is sufficient Search of a most concentrated summary: Minimal sufficiency A sufficient statistic S( ) is minimal sufficient if it is a function of any other sufficient statistic Lemma For a minimal exponential family representation f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } S(X 1 ) S(X n ) is minimal sufficient

62 Ancillarity Opposite of sufficiency: Ancillarity When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is ancillary if A(X 1,..., X n ) has a distribution that does not depend on θ Useless?! Not necessarily, as conditioning upon A(X 1,..., X n ) leads to more precision and efficiency: Use of F θ (x 1,..., x n A(x 1,..., x n )) instead of F θ (x 1,..., x n ) Notion of maximal ancillary statistic

63 Ancillarity Opposite of sufficiency: Ancillarity When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is ancillary if A(X 1,..., X n ) has a distribution that does not depend on θ Useless?! Not necessarily, as conditioning upon A(X 1,..., X n ) leads to more precision and efficiency: Use of F θ (x 1,..., x n A(x 1,..., x n )) instead of F θ (x 1,..., x n ) Notion of maximal ancillary statistic

64 Ancillarity Opposite of sufficiency: Ancillarity When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is ancillary if A(X 1,..., X n ) has a distribution that does not depend on θ Useless?! Not necessarily, as conditioning upon A(X 1,..., X n ) leads to more precision and efficiency: Use of F θ (x 1,..., x n A(x 1,..., x n )) instead of F θ (x 1,..., x n ) Notion of maximal ancillary statistic

65 Illustrations 1 If X 1,..., X n iid U(0, θ), A(X1,..., X n ) = (X 1,..., X n )/X (n) is ancillary 2 If X 1,..., X n iid N(µ, σ 2 ), A(X 1,..., X n ) = (X 1 X n,..., X n X n ) n i=1 (X i X n ) 2 ) is ancillary 3 If X 1,..., X n iid f(x θ), rank(x1,..., X n ) is ancillary > x=rnorm(10) > rank(x) [1] [see, e.g., rank tests]

66 Basu s theorem Completeness When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is complete if the only function Ψ such that E θ [Ψ(A(X 1,..., X n ))] = 0 for all θ s is the null function Let X = (X 1,..., X n ) be a random sample from f( θ) where θ Θ. If V is an ancillary statistic, and T is complete and sufficient for θ then T and V are independent with respect to f( θ) for all θ Θ. [Basu, 1955]

67 Basu s theorem Completeness When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is complete if the only function Ψ such that E θ [Ψ(A(X 1,..., X n ))] = 0 for all θ s is the null function Let X = (X 1,..., X n ) be a random sample from f( θ) where θ Θ. If V is an ancillary statistic, and T is complete and sufficient for θ then T and V are independent with respect to f( θ) for all θ Θ. [Basu, 1955]

68 some examples Example 1 If X = (X 1,..., X n ) is a random sample from the Normal distribution N(µ, σ 2 ) when σ is known, X n = 1 /n n i=1 X i is sufficient and complete, while (X 1 X n,..., X n X n ) is ancillary, hence independent from X n. counter-example 2 Let N be an integer-valued random variable with known pdf (π 1, π 2,...). And let S N = n B(n, p) with unknown p. Then (N, S) is minimal sufficient and N is ancillary.

69 some examples Example 1 If X = (X 1,..., X n ) is a random sample from the Normal distribution N(µ, σ 2 ) when σ is known, X n = 1 /n n i=1 X i is sufficient and complete, while (X 1 X n,..., X n X n ) is ancillary, hence independent from X n. counter-example 2 Let N be an integer-valued random variable with known pdf (π 1, π 2,...). And let S N = n B(n, p) with unknown p. Then (N, S) is minimal sufficient and N is ancillary.

70 more counterexamples counter-example 3 If X = (X 1,..., X n ) is a random sample from the double exponential distribution f(x θ) = 2 exp{ x θ }, (X (1),..., X (n) ) is minimal sufficient but not complete since X (n) X (1) is ancillary and with fixed expectation. counter-example 4 If X is a random variable from the Uniform U(θ, θ + 1) distribution, X and [X] are independent, but while X is complete and sufficient, [X] is not ancillary.

71 more counterexamples counter-example 3 If X = (X 1,..., X n ) is a random sample from the double exponential distribution f(x θ) = 2 exp{ x θ }, (X (1),..., X (n) ) is minimal sufficient but not complete since X (n) X (1) is ancillary and with fixed expectation. counter-example 4 If X is a random variable from the Uniform U(θ, θ + 1) distribution, X and [X] are independent, but while X is complete and sufficient, [X] is not ancillary.

72 last counterexample Let X be distributed as x p x α p 2 q α pq 2 p 3 /2 q 3 /2 γ pq γ pq q 3 /2 p 3 /2 αpq 2 αp 2 q with α + α = γ + γ = 2 /3 known and q = 1 p. Then T = X is minimal sufficient V = I(X > 0) is ancillary if α α T and V are not independent T is complete for two-valued functions [Lehmann, 1981]

73 Point estimation, estimators and estimates When given a parametric family f( θ) and a sample supposedly drawn from this family (X 1,..., X N ) iid f(x θ) 1 an estimator of θ is a statistic T(X 1,..., X N ) or ^θ n providing a [reasonable] substitute for the unknown value θ. 2 an estimate of θ is the value of the estimator for a given [realised] sample, T(x 1,..., x n ) Example: For a Normal N(µ, σ 2 ) sample X 1,..., X N, T(X 1,..., X N ) = ^µ n = X N is an estimator of µ and ^µ N = is an estimate

74 Point estimation, estimators and estimates When given a parametric family f( θ) and a sample supposedly drawn from this family (X 1,..., X N ) iid f(x θ) 1 an estimator of θ is a statistic T(X 1,..., X N ) or ^θ n providing a [reasonable] substitute for the unknown value θ. 2 an estimate of θ is the value of the estimator for a given [realised] sample, T(x 1,..., x n ) Example: For a Normal N(µ, σ 2 ) sample X 1,..., X N, T(X 1,..., X N ) = ^µ n = X N is an estimator of µ and ^µ N = is an estimate

75 Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^θ N satisfies L(^θ N X 1,..., X N ) L(θ N X 1,..., X N ) for all θ Θ Under regularity of L( X 1,..., X N ), MLE also solution of the likelihood equations log L(^θ N X 1,..., X N ) = 0 Warning: ^θ N is not most likely value of θ but makes observation (x 1,..., x N ) most likely...

76 Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^θ N satisfies L(^θ N X 1,..., X N ) L(θ N X 1,..., X N ) for all θ Θ Under regularity of L( X 1,..., X N ), MLE also solution of the likelihood equations log L(^θ N X 1,..., X N ) = 0 Warning: ^θ N is not most likely value of θ but makes observation (x 1,..., x N ) most likely...

77 Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^θ N satisfies L(^θ N X 1,..., X N ) L(θ N X 1,..., X N ) for all θ Θ Under regularity of L( X 1,..., X N ), MLE also solution of the likelihood equations log L(^θ N X 1,..., X N ) = 0 Warning: ^θ N is not most likely value of θ but makes observation (x 1,..., x N ) most likely...

78 Maximum likelihood invariance Principle independent of parameterisation: If ξ = h(θ) is a one-to-one transform of θ, then ^ξ MLE N = h(^θ MLE N ) [estimator of transform = transform of estimator] By extension, if ξ = h(θ) is any transform of θ, then ^ξ MLE N = h(^θ MLE n ) Alternative of profile likelihoods distinguishing between parameters of interest and nuisance parameters

79 Maximum likelihood invariance Principle independent of parameterisation: If ξ = h(θ) is a one-to-one transform of θ, then ^ξ MLE N = h(^θ MLE N ) [estimator of transform = transform of estimator] By extension, if ξ = h(θ) is any transform of θ, then ^ξ MLE N = h(^θ MLE n ) Alternative of profile likelihoods distinguishing between parameters of interest and nuisance parameters

80 Unicity of maximum likelihood estimate Depending on regularity of L( x 1,..., x N ), there may be 1 an a.s. unique MLE ^θ MLE n Case of x 1,..., x n N(µ, 1) 2 3 [with τ = + ]

81 Unicity of maximum likelihood estimate Depending on regularity of L( x 1,..., x N ), there may be 1 2 several or an infinity of MLE s [or of solutions to likelihood equations] Case of x 1,..., x n N(µ 1 + µ 2, 1) [and mixtures of normal] 3 [with τ = + ]

82 Unicity of maximum likelihood estimate Depending on regularity of L( x 1,..., x N ), there may be no MLE at all Case of x 1,..., x n N(µ i, τ 2 ) [with τ = + ]

83 Unicity of maximum likelihood estimate Consequence of standard differential calculus results on l(θ) = log L(θ x 1,..., x n ): lemma If Θ is connected and open, and if l( ) is twice-differentiable with lim l(θ) < + θ Θ and if H(θ) = T l(θ) is positive definite at all solutions of the likelihood equations, then l( ) has a unique global maximum Limited appeal because excluding local maxima

84 Unicity of maximum likelihood estimate Consequence of standard differential calculus results on l(θ) = log L(θ x 1,..., x n ): lemma If Θ is connected and open, and if l( ) is twice-differentiable with lim l(θ) < + θ Θ and if H(θ) = T l(θ) is positive definite at all solutions of the likelihood equations, then l( ) has a unique global maximum Limited appeal because excluding local maxima

85 Unicity of MLE for exponential families lemma If f( θ) is a minimal exponential family f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } with T( ) one-to-one and twice differentiable over Θ, if Θ is open, and if there is at least one solution to the likelihood equations, then it is the unique MLE Likelihood equation is equivalent to S(x) = E θ [S(X)]

86 Unicity of MLE for exponential families lemma If Θ is connected and open, and if l( ) is twice-differentiable with lim l(θ) < + θ Θ and if H(θ) = T l(θ) is positive definite at all solutions of the likelihood equations, then l( ) has a unique global maximum

87 Illustrations Uniform U(0, θ) likelihood not differentiable at X (n) but L(θ x 1,..., x n ) = θ n I θ> max x i i ^θ MLE n = X (n) [Super-efficient estimator]

88 Illustrations Bernoulli B(p) likelihood L(p x 1,..., x n ) = {p/1 p} i x i (1 p) n differentiable over (0, 1) and ^p MLE n = X n

89 Illustrations Normal N(µ, σ 2 ) likelihood { L(µ, σ x 1,..., x n ) σ n exp differentiable with (^µ MLE n, ^σ 2MLE n ) = 1 2σ 2 ( n (x i x n ) 2 1 2σ 2 i=1 X n, 1 n } n ( x n µ) 2 i=1 ) n (X i X n ) 2 i=1

90 The fundamental theorem of Statistics fundamental theorem Under appropriate conditions, if (X 1,..., X n ) iid f(x θ), if ^θ n is solution of log f(x 1,..., X n θ) = 0, then n{^θ n θ} L N p (0, I(θ) 1 ) Equivalent of CLT for estimation purposes I(θ) can be replaced with I(^θ n ) or even ^I(^θ n ) = 1 /n i T log f(x i ^θ n )

91 The fundamental theorem of Statistics fundamental theorem Under appropriate conditions, if (X 1,..., X n ) iid f(x θ), if ^θ n is solution of log f(x 1,..., X n θ) = 0, then n{^θ n θ} L N p (0, I(θ) 1 ) Equivalent of CLT for estimation purposes I(θ) can be replaced with I(^θ n ) or even ^I(^θ n ) = 1 /n i T log f(x i ^θ n )

92 Assumptions θ identifiable support of f( θ) constant in θ l(θ) thrice differentiable [the killer] there exists g(x) integrable against f( θ) in a neighbourhood of the true parameter such that 3 f( θ) θ i θ j θ g(x) k the following identity stands [mostly superfluous] I(θ) = E θ [ log f(x θ) { log f(x θ)} T] = E θ [ T log f(x θ) ] ^θ n converges in probability to θ [similarly superfluous] [Boos & Stefanski, 2014, p.286; Lehmann & Casella, 1998]

93 Inefficient MLEs Example of MLE of η = θ 2 when x N p (θ, I p ): ^η MLE = x 2 Then E η [ x 2 ] = η + p diverges away from η with p Note: Consistent and efficient behaviour when considering the MLE of η based on Z = X 2 χ 2 p(η) [Robert, 2001]

94 Inefficient MLEs Example of MLE of η = θ 2 when x N p (θ, I p ): ^η MLE = x 2 Then E η [ x 2 ] = η + p diverges away from η with p Note: Consistent and efficient behaviour when considering the MLE of η based on Z = X 2 χ 2 p(η) [Robert, 2001]

95 Inefficient MLEs Example of MLE of η = θ 2 when x N p (θ, I p ): ^η MLE = x 2 Then E η [ x 2 ] = η + p diverges away from η with p Note: Consistent and efficient behaviour when considering the MLE of η based on Z = X 2 χ 2 p(η) [Robert, 2001]

96 Inconsistent MLEs Take X 1,..., X n iid fθ (x) with for θ [0, 1], f θ (x) = (1 θ) 1 δ(θ) f 0( x θ /δ(θ)) + θf 1 (x) f 1 (x) = I [ 1,1] (x) f 0 (x) = (1 x )I [ 1,1] (x) and Then for any θ δ(θ) = (1 θ) exp{ (1 θ) 4 + 1} ^θ MLE n a.s. 1 [Ferguson, 1982; John Wellner s slides, ca. 2005]

97 Inconsistent MLEs Consider X ij i = 1,..., n, j = 1, 2 with X ij N(µ i, σ 2 ). Then ^µ MLE i = X i1+x i2/2 ^σ2 MLE = 1 4n n (X i1 X i2 ) 2 i=1 Therefore ^σ 2MLE a.s. σ2 /2 [Neyman & Scott, 1948]

98 Inconsistent MLEs Consider X ij i = 1,..., n, j = 1, 2 with X ij N(µ i, σ 2 ). Then Therefore ^µ MLE i = X i1+x i2/2 ^σ2 MLE = 1 4n ^σ 2MLE a.s. σ2 /2 n (X i1 X i2 ) 2 i=1 [Neyman & Scott, 1948] Note: Working solely with X i1 X i2 N(0, 2σ 2 ) produces a consistent MLE

99 Likelihood optimisation Practical optimisation of the likelihood function θ = arg max L(θ x) = θ assuming X = (X 1,..., X n ) iid g(x θ) n g(x i θ). i=1 analytical resolution feasible for exponential families T(θ) n S(x i ) = n τ(θ) i=1 use of standard numerical techniques like Newton-Raphson θ (t+1) = θ (t) + I obs (X, θ (t) ) 1 l(θ (t) ) with l(.) log-likelihood and I obs observed information matrix

100 Likelihood optimisation Practical optimisation of the likelihood function θ = arg max L(θ x) = θ assuming X = (X 1,..., X n ) iid g(x θ) n g(x i θ). i=1 analytical resolution feasible for exponential families T(θ) n S(x i ) = n τ(θ) i=1 use of standard numerical techniques like Newton-Raphson θ (t+1) = θ (t) + I obs (X, θ (t) ) 1 l(θ (t) ) with l(.) log-likelihood and I obs observed information matrix

101 Likelihood optimisation Practical optimisation of the likelihood function θ = arg max L(θ x) = θ assuming X = (X 1,..., X n ) iid g(x θ) n g(x i θ). i=1 analytical resolution feasible for exponential families T(θ) n S(x i ) = n τ(θ) i=1 use of standard numerical techniques like Newton-Raphson θ (t+1) = θ (t) + I obs (X, θ (t) ) 1 l(θ (t) ) with l(.) log-likelihood and I obs observed information matrix

102 EM algorithm Cases where g is too complex for the above to work Special case when g is a marginal g(x θ) = f(x, z θ) dz Z Z called latent or missing variable

103 Illustrations censored data X = min(x, a) X N(θ, 1) mixture model desequilibrium model X.3 N 1 (µ 0, 1) +.7 N 1 (µ 1, 1), X = min(x, Y ) X f 1 (x θ) Y f 2 (x θ)

104 Completion EM algorithm based on completing data x with z, such as (X, Z) f(x, z θ) Z missing data vector and pair (X, Z) complete data vector Conditional density of Z given x: k(z θ, x) = f(x, z θ) g(x θ)

105 Completion EM algorithm based on completing data x with z, such as (X, Z) f(x, z θ) Z missing data vector and pair (X, Z) complete data vector Conditional density of Z given x: k(z θ, x) = f(x, z θ) g(x θ)

106 Likelihood decomposition Likelihood associated with complete data (x, z) and likelihood for observed data such that L c (θ x, z) = f(x, z θ) L(θ x) log L(θ x) = E[log L c (θ x, Z) θ 0, x] E[log k(z θ, x) θ 0, x] (1) for any θ 0, with integration operated against conditionnal distribution of Z given observables (and parameters), k(z θ 0, x)

107 [A tale of] two θ s There are two θ s! : in (1), θ 0 is a fixed (and arbitrary) value driving integration, while θ both free (and variable) Maximising observed likelihood L(θ x) equivalent to maximise r.h.s. term in (1) E[log L c (θ x, Z) θ 0, x] E[log k(z θ, x) θ 0, x]

108 [A tale of] two θ s There are two θ s! : in (1), θ 0 is a fixed (and arbitrary) value driving integration, while θ both free (and variable) Maximising observed likelihood L(θ x) equivalent to maximise r.h.s. term in (1) E[log L c (θ x, Z) θ 0, x] E[log k(z θ, x) θ 0, x]

109 Intuition for EM Instead of maximising wrt θ r.h.s. term in (1), maximise only E[log L c (θ x, Z) θ 0, x] Maximisation of complete log-likelihood impossible since z unknown, hence substitute by maximisation of expected complete log-likelihood, with expectation depending on term θ 0

110 Intuition for EM Instead of maximising wrt θ r.h.s. term in (1), maximise only E[log L c (θ x, Z) θ 0, x] Maximisation of complete log-likelihood impossible since z unknown, hence substitute by maximisation of expected complete log-likelihood, with expectation depending on term θ 0

111 Expectation Maximisation Expectation of complete log-likelihood denoted Q(θ θ 0, x) = E[log L c (θ x, Z) θ 0, x] to stress dependence on θ 0 and sample x Principle EM derives sequence of estimators ^θ (j), j = 1, 2,..., through iteration of Expectation and Maximisation steps: Q(^θ (j) ^θ (j 1), x) = max θ Q(θ ^θ (j 1), x).

112 Expectation Maximisation Expectation of complete log-likelihood denoted Q(θ θ 0, x) = E[log L c (θ x, Z) θ 0, x] to stress dependence on θ 0 and sample x Principle EM derives sequence of estimators ^θ (j), j = 1, 2,..., through iteration of Expectation and Maximisation steps: Q(^θ (j) ^θ (j 1), x) = max θ Q(θ ^θ (j 1), x).

113 EM Algorithm Iterate (in m) 1 (step E) Compute Q(θ ^θ (m), x) = E[log L c (θ x, Z) ^θ (m), x], 2 (step M) Maximise Q(θ ^θ (m), x) in θ and set ^θ (m+1) = arg max θ Q(θ ^θ (m), x). until a fixed point [of Q] is found [Dempster, Laird, & Rubin, 1978]

114 Justification Observed likelihood increases at every EM step L(θ x) L(^θ (m+1) x) L(^θ (m) x) [Exercice: use Jensen and (1)]

115 Censored data Normal N(θ, 1) sample right-censored { } 1 L(θ x) = exp 1 m (x (2π) m/2 i θ) 2 [1 Φ(a θ)] n m 2 i=1 Associated complete log-likelihood: log L c (θ x, z) 1 2 m (x i θ) i=1 n i=m+1 where z i s are censored observations, with density k(z θ, x) = exp{ 1 2 (z θ)2 } 2π[1 Φ(a θ)] = (z i θ) 2, ϕ(z θ) 1 Φ(a θ), a < z.

116 Censored data Normal N(θ, 1) sample right-censored { } 1 L(θ x) = exp 1 m (x (2π) m/2 i θ) 2 [1 Φ(a θ)] n m 2 i=1 Associated complete log-likelihood: log L c (θ x, z) 1 2 m (x i θ) i=1 n i=m+1 where z i s are censored observations, with density k(z θ, x) = exp{ 1 2 (z θ)2 } 2π[1 Φ(a θ)] = (z i θ) 2, ϕ(z θ) 1 Φ(a θ), a < z.

117 Censored data (2) At j-th EM iteration Q(θ ^θ (j), x) [ m (x i θ) 2 1 n ] 2 E (Z i θ) 2 ^θ (j), x i=1 m (x i θ) 2 i=1 n i=m+1 a i=m+1 (z i θ) 2 k(z ^θ (j), x) dz i

118 Censored data (3) Differenciating in θ, with E[Z ^θ (j) ] = n ^θ (j+1) = m x + (n m)e[z ^θ (j) ], a zk(z ^θ (j), x) dz = ^θ (j) + ϕ(a ^θ (j) ) 1 Φ(a ^θ (j) ). Hence, EM sequence provided by [ ^θ (j+1) = m n x + n m ^θ n (j) + ϕ(a ] ^θ (j) ), 1 Φ(a ^θ (j) ) which converges to likelihood maximum ^θ

119 Censored data (3) Differenciating in θ, with E[Z ^θ (j) ] = n ^θ (j+1) = m x + (n m)e[z ^θ (j) ], a zk(z ^θ (j), x) dz = ^θ (j) + ϕ(a ^θ (j) ) 1 Φ(a ^θ (j) ). Hence, EM sequence provided by [ ^θ (j+1) = m n x + n m ^θ n (j) + ϕ(a ] ^θ (j) ), 1 Φ(a ^θ (j) ) which converges to likelihood maximum ^θ

120 Mixtures Mixture of two normal distributions with unknown means.3 N 1 (µ 0, 1) +.7 N 1 (µ 1, 1), sample X 1,..., X n and parameter θ = (µ 0, µ 1 ) Missing data: Z i {0, 1}, indicator of component associated with X i, X i z i N(µ zi, 1) Z i B(.7) Complete likelihood with n 1 = log L c (θ x, z) 1 2 n z i, n 1^µ 1 = i=1 n z i (x i µ 1 ) i=1 n (1 z i )(x i µ 0 ) 2 i=1 = 1 2 n 1(^µ 1 µ 1 ) (n n 1)(^µ 0 µ 0 ) 2 n z i x i, (n n 1 )^µ 0 = i=1 n (1 z i )x i i=1

121 Mixtures Mixture of two normal distributions with unknown means.3 N 1 (µ 0, 1) +.7 N 1 (µ 1, 1), sample X 1,..., X n and parameter θ = (µ 0, µ 1 ) Missing data: Z i {0, 1}, indicator of component associated with X i, X i z i N(µ zi, 1) Z i B(.7) Complete likelihood with n 1 = log L c (θ x, z) 1 2 n z i, n 1^µ 1 = i=1 n z i (x i µ 1 ) i=1 n (1 z i )(x i µ 0 ) 2 i=1 = 1 2 n 1(^µ 1 µ 1 ) (n n 1)(^µ 0 µ 0 ) 2 n z i x i, (n n 1 )^µ 0 = i=1 n (1 z i )x i i=1

122 Mixtures (2) At j-th EM iteration Q(θ ^θ (j), x) = 1 2 E [ n 1 (^µ 1 µ 1 ) 2 + (n n 1 )(^µ 0 µ 0 ) 2 ^θ (j), x ] Differenciating in θ E [ n 1^µ 1 ^θ (j), x ] / E [ n 1 ^θ (j), x ] ^θ (j+1) = E [ (n n 1 )^µ 0 ^θ (j), x ] / E [ (n n 1 ) ^θ (j), x ]

123 Mixtures (3) Hence ^θ (j+1) given by n i=1 E [ ] / n Z i ^θ (j), x i xi i=1 E [ ] Z i ^θ (j), x i n i=1 E [ (1 Z i ) ] / n ^θ (j), x i xi i=1 E [ ] (1 Z i ) ^θ (j), x i Conclusion Step (E) in EM replaces missing data Z i with their conditional expectation, given x (expectation that depend on ^θ (m) ).

124 Mixtures (3) µ EM iterations for several starting values µ 1

125 Properties EM algorithm such that it converges to local maximum or saddle-point it depends on the initial condition θ (0) it requires several initial values when likelihood multimodal

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random