Chapter 3 : Likelihood function and inference

Size: px
Start display at page:

Download "Chapter 3 : Likelihood function and inference"

Transcription

1 Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM algorithm

2 The likelihood Given an usually parametric family of distributions F {F θ, θ Θ} with densities f θ [wrt a fixed measure ν], the density of the iid sample x 1,..., x n is n f θ (x i ) i=1 Note In the special case ν is a counting measure, n f θ (x i ) i=1 is the probability of observing the sample x 1,..., x n among all possible realisations of X 1,..., X n

3 The likelihood Given an usually parametric family of distributions F {F θ, θ Θ} with densities f θ [wrt a fixed measure ν], the density of the iid sample x 1,..., x n is n f θ (x i ) i=1 Note In the special case ν is a counting measure, n f θ (x i ) i=1 is the probability of observing the sample x 1,..., x n among all possible realisations of X 1,..., X n

4 The likelihood Definition (likelihood function) The likelihood function associated with a sample x 1,..., x n is the function L : Θ R + n θ f θ (x i ) i=1 same formula as density but different space of variation

5 The likelihood Definition (likelihood function) The likelihood function associated with a sample x 1,..., x n is the function L : Θ R + n θ f θ (x i ) i=1 same formula as density but different space of variation

6 Example: density function versus likelihood function Take the case of a Poisson density [against the counting measure] f(x; θ) = θx x! e θ I N (x) which varies in N as a function of x versus L(θ; x) = θx x! e θ which varies in R + as a function of θ θ = 3

7 Example: density function versus likelihood function Take the case of a Poisson density [against the counting measure] f(x; θ) = θx x! e θ I N (x) which varies in N as a function of x versus L(θ; x) = θx x! e θ which varies in R + as a function of θ x = 3

8 Example: density function versus likelihood function Take the case of a Normal N(0, θ) density [against the Lebesgue measure] f(x; θ) = 1 2πθ e x2 /2θ I R (x) which varies in R as a function of x versus L(θ; x) = 1 2πθ e x2 /2θ which varies in R + as a function of θ θ = 2

9 Example: density function versus likelihood function Take the case of a Normal N(0, θ) density [against the Lebesgue measure] f(x; θ) = 1 2πθ e x2 /2θ I R (x) which varies in R as a function of x versus L(θ; x) = 1 2πθ e x2 /2θ which varies in R + as a function of θ x = 2

10 Example: density function versus likelihood function Take the case of a Normal N(0, 1/θ) density [against the Lebesgue measure] θ f(x; θ) = e x2θ/2 I R (x) 2π which varies in R as a function of x versus L(θ; x) = θ 2π e x2 θ/2 I R (x) which varies in R + as a function of θ θ = 1/2

11 Example: density function versus likelihood function Take the case of a Normal N(0, 1/θ) density [against the Lebesgue measure] θ f(x; θ) = e x2θ/2 I R (x) 2π which varies in R as a function of x versus L(θ; x) = θ 2π e x2 θ/2 I R (x) which varies in R + as a function of θ x = 1/2

12 Example: Hardy-Weinberg equilibrium Population genetics: Genotypes of biallelic genes AA, Aa, and aa sample frequencies n AA, n Aa and n aa multinomial model M(n; p AA, p Aa, p aa ) related to population proportion of A alleles, p A : likelihood p AA = p 2 A, p Aa = 2p A (1 p A ), p aa = (1 p A ) 2 L(p A n AA, n Aa, n aa ) p 2n AA A [2p A (1 p A )] n Aa (1 p A ) 2n aa [Boos & Stefanski, 2013]

13 mixed distributions and their likelihood Special case when a random variable X may take specific values a 1,..., a k and a continum of values A Example: Rainfall at a given spot on a given day may be zero with positive probability p 0 [it did not rain!] or an arbitrary number between 0 and 100 [capacity of measurement container] or 100 with positive probability p 100 [container full]

14 mixed distributions and their likelihood Special case when a random variable X may take specific values a 1,..., a k and a continum of values A Example: Tobit model where y N(X T β, σ 2 ) but y = y I{y 0} observed

15 mixed distributions and their likelihood Special case when a random variable X may take specific values a 1,..., a k and a continum of values A Density of X against composition of two measures, counting and Lebesgue: { P θ (X = a) if a {a 1,..., a k } f X (a) = f(a θ) otherwise Results in likelihood L(θ x 1,..., x n ) = k P θ (X = a i ) n j j=1 where n j # observations equal to a j x i / {a 1,...,a k } f(x i θ)

16 Enters Fisher, Ronald Fisher! Fisher s intuition in the 20 s: the likelihood function contains the relevant information about the parameter θ the higher the likelihood the more likely the parameter the curvature of the likelihood determines the precision of the estimation

17 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n P(3) as n increases n = 40,..., 240

18 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n P(3) as n increases n = 38,..., 240

19 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n N(0, 1) as n increases

20 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n N(0, 1) as sample varies

21 Concentration of likelihood mode around true parameter Likelihood functions for x 1,..., x n N(0, 1) as sample varies

22 why concentration takes place Consider Then and by LLN 1/n log i=1 x 1,..., x n iid F n f(x i θ) = i=1 n log f(x i θ) i=1 n L log f(x i θ) log f(x θ) df(x) X Lemma Maximising the likelihood is asymptotically equivalent to minimising the Kullback-Leibler divergence log f(x) /f(x θ) df(x) X

23 why concentration takes place by LLN 1/n n L log f(x i θ) log f(x θ) df(x) X i=1 Lemma Maximising the likelihood is asymptotically equivalent to minimising the Kullback-Leibler divergence log f(x) /f(x θ) df(x) X c Member of the family closest to true distribution

24 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0

25 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0

26 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0 Reason: X log L(θ x) df θ (x) = X L(θ x) dx = df θ (x) X

27 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0 Connected with concentration theorem: gradient null on average for true value of parameter

28 Score function Score function defined by log L(θ x) = ( / θ 1 L(θ x),..., / θ p L(θ x) )/ L(θ x) Gradient (slope) of likelihood function at point θ lemma When X F θ, E θ [ log L(θ X)] = 0 Warning: Not defined for non-differentiable likelihoods, e.g. when support depends on θ

29 Fisher s information matrix Another notion attributed to Fisher [more likely due to Edgeworth] Information: covariance matrix of the score vector I(θ) = E θ [ log f(x θ) { log f(x θ)} T] Often called Fisher information Measures curvature of the likelihood surface, which translates as information brought by the data Sometimes denoted I X to stress dependence on distribution of X

30 Fisher s information matrix Another notion attributed to Fisher [more likely due to Edgeworth] Information: covariance matrix of the score vector I(θ) = E θ [ log f(x θ) { log f(x θ)} T] Often called Fisher information Measures curvature of the likelihood surface, which translates as information brought by the data Sometimes denoted I X to stress dependence on distribution of X

31 Fisher s information matrix Second derivative of the log-likelihood as well lemma If L(θ x) is twice differentiable [as a function of θ] I(θ) = E θ [ T log f(x θ) ] Hence [ ] 2 I ij (θ) = E θ log f(x θ) θ i θ j

32 Illustrations Binomial B(n, p) distribution Hence f(x p) = ( ) n p x (1 p) n x x / p log f(x p) = x /p n x /1 p 2 / p 2 log f(x p) = x /p 2 n x /(1 p) 2 I(p) = np /p 2 + n np /(1 p) 2 = n /p(1 p)

33 Illustrations Multinomial M(n; p 1,..., p k ) distribution ( ) n f(x p) = p x 1 1 x 1 x px k k k / p i log f(x p) = x i/p i x k/p k 2 / p i p j log f(x p) = x k/p 2 k 2 / p 2 i log f(x p) = x i/p 2 i x k/p 2 k Hence 1/p /p k 1/p k 1/p k 1/p k I(p) = n... 1/p k 1/p k /p k

34 Illustrations Multinomial M(n; p 1,..., p k ) distribution ( ) n f(x p) = p x 1 1 x 1 x px k k k / p i log f(x p) = x i/p i x k/p k 2 / p i p j log f(x p) = x k/p 2 k 2 / p 2 i log f(x p) = x i/p 2 i x k/p 2 k and p 1 (1 p 1 ) p 1 p 2 p 1 p k 1 I(p) 1 = 1 p 1 p 2 p 2 (1 p 2 ) p 2 p k 1 /n p 1 p k 1 p 2 p k 1 p k 1 (1 p k 1 )

35 Illustrations Normal N(µ, σ 2 ) distribution f(x θ) = 1 2π 1 σ exp{ (x µ)2 /2σ 2 } / µ log f(x θ) = x µ /σ 2 / σ log f(x θ) = 1 /σ + (x µ)2 /σ 3 2 / µ 2 log f(x θ) = 1 /σ 2 2 / µ σ log f(x θ) = 2 x µ /σ 3 2 / σ 2 log f(x θ) = 1 /σ 2 3 (x µ)2 /σ 4 Hence ( ) 1 0 I(θ) = 1 /σ 2 0 2

36 Properties Additive features translating as accumulation of information: if X and Y are independent, I X (θ) + I Y (θ) = I (X,Y) (θ) I X1,...,X n (θ) = ni X1 (θ) if X = T(Y) and Y = S(X), I X (θ) = I Y (θ) if X = T(Y), I X (θ) I Y (θ) If η = Ψ(θ) is a bijective transform, change of parameterisation: I(θ) = { } η T { } η I(η) θ θ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]

37 Properties Additive features translating as accumulation of information: if X and Y are independent, I X (θ) + I Y (θ) = I (X,Y) (θ) I X1,...,X n (θ) = ni X1 (θ) if X = T(Y) and Y = S(X), I X (θ) = I Y (θ) if X = T(Y), I X (θ) I Y (θ) If η = Ψ(θ) is a bijective transform, change of parameterisation: I(θ) = { } η T { } η I(η) θ θ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]

38 Properties If η = Ψ(θ) is a bijective transform, change of parameterisation: I(θ) = { } η T { } η I(η) θ θ In information geometry, this is seen as a change of coordinates on a Riemannian manifold, and the intrinsic properties of curvature are unchanged under different parametrizations. In general, the Fisher information matrix provides a Riemannian metric (more precisely, the Fisher-Rao metric). [Wikipedia]

39 Approximations Back to the Kullback Leibler divergence D(θ, θ) = f(x θ ) log f(x θ )/f(x θ) dx X Using a second degree Taylor expansion log f(x θ) = log f(x θ ) + (θ θ ) T log f(x θ ) (θ θ ) T T log f(x θ )(θ θ ) + o( θ θ 2 ) approximation of divergence: D(θ, θ) 1 2 (θ θ ) T I(θ )(θ θ ) [Exercise: show this is exact in the normal case]

40 Approximations Back to the Kullback Leibler divergence D(θ, θ) = f(x θ ) log f(x θ )/f(x θ) dx X Using a second degree Taylor expansion log f(x θ) = log f(x θ ) + (θ θ ) T log f(x θ ) (θ θ ) T T log f(x θ )(θ θ ) + o( θ θ 2 ) approximation of divergence: D(θ, θ) 1 2 (θ θ ) T I(θ )(θ θ ) [Exercise: show this is exact in the normal case]

41 Approximations Back to the Kullback Leibler divergence D(θ, θ) = f(x θ ) log f(x θ )/f(x θ) dx X Using a second degree Taylor expansion log f(x θ) = log f(x θ ) + (θ θ ) T log f(x θ ) (θ θ ) T T log f(x θ )(θ θ ) + o( θ θ 2 ) approximation of divergence: D(θ, θ) 1 2 (θ θ ) T I(θ )(θ θ ) [Exercise: show this is exact in the normal case]

42 First CLT Central limit law of the score vector Given X 1,..., X n i.i.d. f(x θ), 1/ n log L(θ X 1,..., X n ) N (0, I X1 (θ)) [at the true θ] Notation I 1 (θ) stands for I X1 (θ) and indicates information associated with a single observation

43 First CLT Central limit law of the score vector Given X 1,..., X n i.i.d. f(x θ), 1/ n log L(θ X 1,..., X n ) N (0, I X1 (θ)) [at the true θ] Notation I 1 (θ) stands for I X1 (θ) and indicates information associated with a single observation

44 Sufficiency What if a transform of the sample contains all the information, i.e. uniformly in θ? S(X 1,..., X n ) I (X1,...,X n )(θ) = I S(X1,...,X n )(θ) In this case S( ) is called a sufficient statistic [because it is sufficient to know the value of S(x 1,..., x n ) to get complete information] [A statistic is an arbitrary transform of the data X 1,..., X n ]

45 Sufficiency What if a transform of the sample contains all the information, i.e. uniformly in θ? S(X 1,..., X n ) I (X1,...,X n )(θ) = I S(X1,...,X n )(θ) In this case S( ) is called a sufficient statistic [because it is sufficient to know the value of S(x 1,..., x n ) to get complete information] [A statistic is an arbitrary transform of the data X 1,..., X n ]

46 Sufficiency What if a transform of the sample contains all the information, i.e. uniformly in θ? S(X 1,..., X n ) I (X1,...,X n )(θ) = I S(X1,...,X n )(θ) In this case S( ) is called a sufficient statistic [because it is sufficient to know the value of S(x 1,..., x n ) to get complete information] [A statistic is an arbitrary transform of the data X 1,..., X n ]

47 Sufficiency (bis) Alternative definition: If (X 1,..., X n ) f(x 1,..., x n θ) and if T = S(X 1,..., X n ) is such that the distribution of (X 1,..., X n ) conditional on T does not depend on θ, then S( ) is a sufficient statistic Factorisation theorem S( ) is a sufficient statistic if and only if f(x 1,..., x n θ) = g(s(x 1,..., x n ) θ) h(x 1,..., x n ) another notion due to Fisher

48 Sufficiency (bis) Alternative definition: If (X 1,..., X n ) f(x 1,..., x n θ) and if T = S(X 1,..., X n ) is such that the distribution of (X 1,..., X n ) conditional on T does not depend on θ, then S( ) is a sufficient statistic Factorisation theorem S( ) is a sufficient statistic if and only if f(x 1,..., x n θ) = g(s(x 1,..., x n ) θ) h(x 1,..., x n ) another notion due to Fisher

49 Sufficiency (bis) Alternative definition: If (X 1,..., X n ) f(x 1,..., x n θ) and if T = S(X 1,..., X n ) is such that the distribution of (X 1,..., X n ) conditional on T does not depend on θ, then S( ) is a sufficient statistic Factorisation theorem S( ) is a sufficient statistic if and only if f(x 1,..., x n θ) = g(s(x 1,..., x n ) θ) h(x 1,..., x n ) another notion due to Fisher

50 Illustrations Uniform U(0, θ) distribution L(θ x 1,..., x n ) = θ n n i=1 I (0,θ) (x i ) = θ n I θ> max x i i Hence is sufficient S(X 1,..., X n ) = max X i = X (n) i

51 Illustrations Bernoulli B(p) distribution L(p x 1,..., x n ) = n i=1 p x i (1 p) n x i = {p/1 p} i x i (1 p) n Hence is sufficient S(X 1,..., X n ) = X n

52 Illustrations Normal N(µ, σ 2 ) distribution Hence L(µ, σ x 1,..., x n ) = = = is sufficient 1 {2πσ 2 } n/2 1 {2πσ 2 } n/2 n i=1 1 2πσ exp{ (x i µ) 2 /2σ 2 } { exp 1 2σ 2 { exp 1 2σ 2 S(X 1,..., X n ) = ( } n (x i x n + x n µ) 2 i=1 n (x i x n ) 2 1 2σ 2 i=1 X n, } n ( x n µ) 2 i=1 ) n (X i X n ) 2 i=1

53 Sufficiency and exponential families Both previous examples belong to exponential families f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } Generic property of exponential families: { n n } f(x 1,..., x n θ) = h(x i ) exp T(θ) T S(x i ) nτ(θ) i=1 i=1 lemma For an exponential family with summary statistic S( ), the statistic S(X 1,..., X n ) = is sufficient n S(X i ) i=1

54 Sufficiency and exponential families Both previous examples belong to exponential families f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } Generic property of exponential families: { n n } f(x 1,..., x n θ) = h(x i ) exp T(θ) T S(x i ) nτ(θ) i=1 i=1 lemma For an exponential family with summary statistic S( ), the statistic S(X 1,..., X n ) = is sufficient n S(X i ) i=1

55 Sufficiency as a rare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, θ)

56 Sufficiency as a rare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, θ)

57 Sufficiency as a rare feature Nice property reducing the data to a low dimension transform but... How frequent is it within the collection of probability distributions? Very rare as essentially restricted to exponential families [Pitman-Koopman-Darmois theorem] with the exception of parameter-dependent families like U(0, θ)

58 Pitman-Koopman-Darmois characterisation If X 1,..., X n are iid random variables from a density f( θ) whose support does not depend on θ and verifying the property that there exists an integer n 0 such that, for n n 0, there is a sufficient statistic S(X 1,..., X n ) with fixed [in n] dimension, then f( θ) belongs to an exponential family [Factorisation theorem] Note: Darmois published this result in 1935 [in French] and Koopman and Pitman in 1936 [in English] but Darmois is generally omitted from the theorem... Fisher proved it for one-d sufficient statistics in 1934

59 Pitman-Koopman-Darmois characterisation If X 1,..., X n are iid random variables from a density f( θ) whose support does not depend on θ and verifying the property that there exists an integer n 0 such that, for n n 0, there is a sufficient statistic S(X 1,..., X n ) with fixed [in n] dimension, then f( θ) belongs to an exponential family [Factorisation theorem] Note: Darmois published this result in 1935 [in French] and Koopman and Pitman in 1936 [in English] but Darmois is generally omitted from the theorem... Fisher proved it for one-d sufficient statistics in 1934

60 Minimal sufficiency Multiplicity of sufficient statistics, e.g., S (x) = (S(x), U(x)) remains sufficient when S( ) is sufficient Search of a most concentrated summary: Minimal sufficiency A sufficient statistic S( ) is minimal sufficient if it is a function of any other sufficient statistic Lemma For a minimal exponential family representation f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } S(X 1 ) S(X n ) is minimal sufficient

61 Minimal sufficiency Multiplicity of sufficient statistics, e.g., S (x) = (S(x), U(x)) remains sufficient when S( ) is sufficient Search of a most concentrated summary: Minimal sufficiency A sufficient statistic S( ) is minimal sufficient if it is a function of any other sufficient statistic Lemma For a minimal exponential family representation f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } S(X 1 ) S(X n ) is minimal sufficient

62 Ancillarity Opposite of sufficiency: Ancillarity When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is ancillary if A(X 1,..., X n ) has a distribution that does not depend on θ Useless?! Not necessarily, as conditioning upon A(X 1,..., X n ) leads to more precision and efficiency: Use of F θ (x 1,..., x n A(x 1,..., x n )) instead of F θ (x 1,..., x n ) Notion of maximal ancillary statistic

63 Ancillarity Opposite of sufficiency: Ancillarity When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is ancillary if A(X 1,..., X n ) has a distribution that does not depend on θ Useless?! Not necessarily, as conditioning upon A(X 1,..., X n ) leads to more precision and efficiency: Use of F θ (x 1,..., x n A(x 1,..., x n )) instead of F θ (x 1,..., x n ) Notion of maximal ancillary statistic

64 Ancillarity Opposite of sufficiency: Ancillarity When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is ancillary if A(X 1,..., X n ) has a distribution that does not depend on θ Useless?! Not necessarily, as conditioning upon A(X 1,..., X n ) leads to more precision and efficiency: Use of F θ (x 1,..., x n A(x 1,..., x n )) instead of F θ (x 1,..., x n ) Notion of maximal ancillary statistic

65 Illustrations 1 If X 1,..., X n iid U(0, θ), A(X1,..., X n ) = (X 1,..., X n )/X (n) is ancillary 2 If X 1,..., X n iid N(µ, σ 2 ), A(X 1,..., X n ) = (X 1 X n,..., X n X n ) n i=1 (X i X n ) 2 ) is ancillary 3 If X 1,..., X n iid f(x θ), rank(x1,..., X n ) is ancillary > x=rnorm(10) > rank(x) [1] [see, e.g., rank tests]

66 Basu s theorem Completeness When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is complete if the only function Ψ such that E θ [Ψ(A(X 1,..., X n ))] = 0 for all θ s is the null function Let X = (X 1,..., X n ) be a random sample from f( θ) where θ Θ. If V is an ancillary statistic, and T is complete and sufficient for θ then T and V are independent with respect to f( θ) for all θ Θ. [Basu, 1955]

67 Basu s theorem Completeness When X 1,..., X n are iid random variables from a density f( θ), a statistic A( ) is complete if the only function Ψ such that E θ [Ψ(A(X 1,..., X n ))] = 0 for all θ s is the null function Let X = (X 1,..., X n ) be a random sample from f( θ) where θ Θ. If V is an ancillary statistic, and T is complete and sufficient for θ then T and V are independent with respect to f( θ) for all θ Θ. [Basu, 1955]

68 some examples Example 1 If X = (X 1,..., X n ) is a random sample from the Normal distribution N(µ, σ 2 ) when σ is known, X n = 1 /n n i=1 X i is sufficient and complete, while (X 1 X n,..., X n X n ) is ancillary, hence independent from X n. counter-example 2 Let N be an integer-valued random variable with known pdf (π 1, π 2,...). And let S N = n B(n, p) with unknown p. Then (N, S) is minimal sufficient and N is ancillary.

69 some examples Example 1 If X = (X 1,..., X n ) is a random sample from the Normal distribution N(µ, σ 2 ) when σ is known, X n = 1 /n n i=1 X i is sufficient and complete, while (X 1 X n,..., X n X n ) is ancillary, hence independent from X n. counter-example 2 Let N be an integer-valued random variable with known pdf (π 1, π 2,...). And let S N = n B(n, p) with unknown p. Then (N, S) is minimal sufficient and N is ancillary.

70 more counterexamples counter-example 3 If X = (X 1,..., X n ) is a random sample from the double exponential distribution f(x θ) = 2 exp{ x θ }, (X (1),..., X (n) ) is minimal sufficient but not complete since X (n) X (1) is ancillary and with fixed expectation. counter-example 4 If X is a random variable from the Uniform U(θ, θ + 1) distribution, X and [X] are independent, but while X is complete and sufficient, [X] is not ancillary.

71 more counterexamples counter-example 3 If X = (X 1,..., X n ) is a random sample from the double exponential distribution f(x θ) = 2 exp{ x θ }, (X (1),..., X (n) ) is minimal sufficient but not complete since X (n) X (1) is ancillary and with fixed expectation. counter-example 4 If X is a random variable from the Uniform U(θ, θ + 1) distribution, X and [X] are independent, but while X is complete and sufficient, [X] is not ancillary.

72 last counterexample Let X be distributed as x p x α p 2 q α pq 2 p 3 /2 q 3 /2 γ pq γ pq q 3 /2 p 3 /2 αpq 2 αp 2 q with α + α = γ + γ = 2 /3 known and q = 1 p. Then T = X is minimal sufficient V = I(X > 0) is ancillary if α α T and V are not independent T is complete for two-valued functions [Lehmann, 1981]

73 Point estimation, estimators and estimates When given a parametric family f( θ) and a sample supposedly drawn from this family (X 1,..., X N ) iid f(x θ) 1 an estimator of θ is a statistic T(X 1,..., X N ) or ^θ n providing a [reasonable] substitute for the unknown value θ. 2 an estimate of θ is the value of the estimator for a given [realised] sample, T(x 1,..., x n ) Example: For a Normal N(µ, σ 2 ) sample X 1,..., X N, T(X 1,..., X N ) = ^µ n = X N is an estimator of µ and ^µ N = is an estimate

74 Point estimation, estimators and estimates When given a parametric family f( θ) and a sample supposedly drawn from this family (X 1,..., X N ) iid f(x θ) 1 an estimator of θ is a statistic T(X 1,..., X N ) or ^θ n providing a [reasonable] substitute for the unknown value θ. 2 an estimate of θ is the value of the estimator for a given [realised] sample, T(x 1,..., x n ) Example: For a Normal N(µ, σ 2 ) sample X 1,..., X N, T(X 1,..., X N ) = ^µ n = X N is an estimator of µ and ^µ N = is an estimate

75 Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^θ N satisfies L(^θ N X 1,..., X N ) L(θ N X 1,..., X N ) for all θ Θ Under regularity of L( X 1,..., X N ), MLE also solution of the likelihood equations log L(^θ N X 1,..., X N ) = 0 Warning: ^θ N is not most likely value of θ but makes observation (x 1,..., x N ) most likely...

76 Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^θ N satisfies L(^θ N X 1,..., X N ) L(θ N X 1,..., X N ) for all θ Θ Under regularity of L( X 1,..., X N ), MLE also solution of the likelihood equations log L(^θ N X 1,..., X N ) = 0 Warning: ^θ N is not most likely value of θ but makes observation (x 1,..., x N ) most likely...

77 Maximum likelihood principle Given the concentration property of the likelihood function, reasonable choice of estimator as mode: MLE A maximum likelihood estimator (MLE) ^θ N satisfies L(^θ N X 1,..., X N ) L(θ N X 1,..., X N ) for all θ Θ Under regularity of L( X 1,..., X N ), MLE also solution of the likelihood equations log L(^θ N X 1,..., X N ) = 0 Warning: ^θ N is not most likely value of θ but makes observation (x 1,..., x N ) most likely...

78 Maximum likelihood invariance Principle independent of parameterisation: If ξ = h(θ) is a one-to-one transform of θ, then ^ξ MLE N = h(^θ MLE N ) [estimator of transform = transform of estimator] By extension, if ξ = h(θ) is any transform of θ, then ^ξ MLE N = h(^θ MLE n ) Alternative of profile likelihoods distinguishing between parameters of interest and nuisance parameters

79 Maximum likelihood invariance Principle independent of parameterisation: If ξ = h(θ) is a one-to-one transform of θ, then ^ξ MLE N = h(^θ MLE N ) [estimator of transform = transform of estimator] By extension, if ξ = h(θ) is any transform of θ, then ^ξ MLE N = h(^θ MLE n ) Alternative of profile likelihoods distinguishing between parameters of interest and nuisance parameters

80 Unicity of maximum likelihood estimate Depending on regularity of L( x 1,..., x N ), there may be 1 an a.s. unique MLE ^θ MLE n Case of x 1,..., x n N(µ, 1) 2 3 [with τ = + ]

81 Unicity of maximum likelihood estimate Depending on regularity of L( x 1,..., x N ), there may be 1 2 several or an infinity of MLE s [or of solutions to likelihood equations] Case of x 1,..., x n N(µ 1 + µ 2, 1) [and mixtures of normal] 3 [with τ = + ]

82 Unicity of maximum likelihood estimate Depending on regularity of L( x 1,..., x N ), there may be no MLE at all Case of x 1,..., x n N(µ i, τ 2 ) [with τ = + ]

83 Unicity of maximum likelihood estimate Consequence of standard differential calculus results on l(θ) = log L(θ x 1,..., x n ): lemma If Θ is connected and open, and if l( ) is twice-differentiable with lim l(θ) < + θ Θ and if H(θ) = T l(θ) is positive definite at all solutions of the likelihood equations, then l( ) has a unique global maximum Limited appeal because excluding local maxima

84 Unicity of maximum likelihood estimate Consequence of standard differential calculus results on l(θ) = log L(θ x 1,..., x n ): lemma If Θ is connected and open, and if l( ) is twice-differentiable with lim l(θ) < + θ Θ and if H(θ) = T l(θ) is positive definite at all solutions of the likelihood equations, then l( ) has a unique global maximum Limited appeal because excluding local maxima

85 Unicity of MLE for exponential families lemma If f( θ) is a minimal exponential family f(x θ) = h(x) exp { T(θ) T S(x) τ(θ) } with T( ) one-to-one and twice differentiable over Θ, if Θ is open, and if there is at least one solution to the likelihood equations, then it is the unique MLE Likelihood equation is equivalent to S(x) = E θ [S(X)]

86 Unicity of MLE for exponential families lemma If Θ is connected and open, and if l( ) is twice-differentiable with lim l(θ) < + θ Θ and if H(θ) = T l(θ) is positive definite at all solutions of the likelihood equations, then l( ) has a unique global maximum

87 Illustrations Uniform U(0, θ) likelihood not differentiable at X (n) but L(θ x 1,..., x n ) = θ n I θ> max x i i ^θ MLE n = X (n) [Super-efficient estimator]

88 Illustrations Bernoulli B(p) likelihood L(p x 1,..., x n ) = {p/1 p} i x i (1 p) n differentiable over (0, 1) and ^p MLE n = X n

89 Illustrations Normal N(µ, σ 2 ) likelihood { L(µ, σ x 1,..., x n ) σ n exp differentiable with (^µ MLE n, ^σ 2MLE n ) = 1 2σ 2 ( n (x i x n ) 2 1 2σ 2 i=1 X n, 1 n } n ( x n µ) 2 i=1 ) n (X i X n ) 2 i=1

90 The fundamental theorem of Statistics fundamental theorem Under appropriate conditions, if (X 1,..., X n ) iid f(x θ), if ^θ n is solution of log f(x 1,..., X n θ) = 0, then n{^θ n θ} L N p (0, I(θ) 1 ) Equivalent of CLT for estimation purposes I(θ) can be replaced with I(^θ n ) or even ^I(^θ n ) = 1 /n i T log f(x i ^θ n )

91 The fundamental theorem of Statistics fundamental theorem Under appropriate conditions, if (X 1,..., X n ) iid f(x θ), if ^θ n is solution of log f(x 1,..., X n θ) = 0, then n{^θ n θ} L N p (0, I(θ) 1 ) Equivalent of CLT for estimation purposes I(θ) can be replaced with I(^θ n ) or even ^I(^θ n ) = 1 /n i T log f(x i ^θ n )

92 Assumptions θ identifiable support of f( θ) constant in θ l(θ) thrice differentiable [the killer] there exists g(x) integrable against f( θ) in a neighbourhood of the true parameter such that 3 f( θ) θ i θ j θ g(x) k the following identity stands [mostly superfluous] I(θ) = E θ [ log f(x θ) { log f(x θ)} T] = E θ [ T log f(x θ) ] ^θ n converges in probability to θ [similarly superfluous] [Boos & Stefanski, 2014, p.286; Lehmann & Casella, 1998]

93 Inefficient MLEs Example of MLE of η = θ 2 when x N p (θ, I p ): ^η MLE = x 2 Then E η [ x 2 ] = η + p diverges away from η with p Note: Consistent and efficient behaviour when considering the MLE of η based on Z = X 2 χ 2 p(η) [Robert, 2001]

94 Inefficient MLEs Example of MLE of η = θ 2 when x N p (θ, I p ): ^η MLE = x 2 Then E η [ x 2 ] = η + p diverges away from η with p Note: Consistent and efficient behaviour when considering the MLE of η based on Z = X 2 χ 2 p(η) [Robert, 2001]

95 Inefficient MLEs Example of MLE of η = θ 2 when x N p (θ, I p ): ^η MLE = x 2 Then E η [ x 2 ] = η + p diverges away from η with p Note: Consistent and efficient behaviour when considering the MLE of η based on Z = X 2 χ 2 p(η) [Robert, 2001]

96 Inconsistent MLEs Take X 1,..., X n iid fθ (x) with for θ [0, 1], f θ (x) = (1 θ) 1 δ(θ) f 0( x θ /δ(θ)) + θf 1 (x) f 1 (x) = I [ 1,1] (x) f 0 (x) = (1 x )I [ 1,1] (x) and Then for any θ δ(θ) = (1 θ) exp{ (1 θ) 4 + 1} ^θ MLE n a.s. 1 [Ferguson, 1982; John Wellner s slides, ca. 2005]

97 Inconsistent MLEs Consider X ij i = 1,..., n, j = 1, 2 with X ij N(µ i, σ 2 ). Then ^µ MLE i = X i1+x i2/2 ^σ2 MLE = 1 4n n (X i1 X i2 ) 2 i=1 Therefore ^σ 2MLE a.s. σ2 /2 [Neyman & Scott, 1948]

98 Inconsistent MLEs Consider X ij i = 1,..., n, j = 1, 2 with X ij N(µ i, σ 2 ). Then Therefore ^µ MLE i = X i1+x i2/2 ^σ2 MLE = 1 4n ^σ 2MLE a.s. σ2 /2 n (X i1 X i2 ) 2 i=1 [Neyman & Scott, 1948] Note: Working solely with X i1 X i2 N(0, 2σ 2 ) produces a consistent MLE

99 Likelihood optimisation Practical optimisation of the likelihood function θ = arg max L(θ x) = θ assuming X = (X 1,..., X n ) iid g(x θ) n g(x i θ). i=1 analytical resolution feasible for exponential families T(θ) n S(x i ) = n τ(θ) i=1 use of standard numerical techniques like Newton-Raphson θ (t+1) = θ (t) + I obs (X, θ (t) ) 1 l(θ (t) ) with l(.) log-likelihood and I obs observed information matrix

100 Likelihood optimisation Practical optimisation of the likelihood function θ = arg max L(θ x) = θ assuming X = (X 1,..., X n ) iid g(x θ) n g(x i θ). i=1 analytical resolution feasible for exponential families T(θ) n S(x i ) = n τ(θ) i=1 use of standard numerical techniques like Newton-Raphson θ (t+1) = θ (t) + I obs (X, θ (t) ) 1 l(θ (t) ) with l(.) log-likelihood and I obs observed information matrix

101 Likelihood optimisation Practical optimisation of the likelihood function θ = arg max L(θ x) = θ assuming X = (X 1,..., X n ) iid g(x θ) n g(x i θ). i=1 analytical resolution feasible for exponential families T(θ) n S(x i ) = n τ(θ) i=1 use of standard numerical techniques like Newton-Raphson θ (t+1) = θ (t) + I obs (X, θ (t) ) 1 l(θ (t) ) with l(.) log-likelihood and I obs observed information matrix

102 EM algorithm Cases where g is too complex for the above to work Special case when g is a marginal g(x θ) = f(x, z θ) dz Z Z called latent or missing variable

103 Illustrations censored data X = min(x, a) X N(θ, 1) mixture model desequilibrium model X.3 N 1 (µ 0, 1) +.7 N 1 (µ 1, 1), X = min(x, Y ) X f 1 (x θ) Y f 2 (x θ)

104 Completion EM algorithm based on completing data x with z, such as (X, Z) f(x, z θ) Z missing data vector and pair (X, Z) complete data vector Conditional density of Z given x: k(z θ, x) = f(x, z θ) g(x θ)

105 Completion EM algorithm based on completing data x with z, such as (X, Z) f(x, z θ) Z missing data vector and pair (X, Z) complete data vector Conditional density of Z given x: k(z θ, x) = f(x, z θ) g(x θ)

106 Likelihood decomposition Likelihood associated with complete data (x, z) and likelihood for observed data such that L c (θ x, z) = f(x, z θ) L(θ x) log L(θ x) = E[log L c (θ x, Z) θ 0, x] E[log k(z θ, x) θ 0, x] (1) for any θ 0, with integration operated against conditionnal distribution of Z given observables (and parameters), k(z θ 0, x)

107 [A tale of] two θ s There are two θ s! : in (1), θ 0 is a fixed (and arbitrary) value driving integration, while θ both free (and variable) Maximising observed likelihood L(θ x) equivalent to maximise r.h.s. term in (1) E[log L c (θ x, Z) θ 0, x] E[log k(z θ, x) θ 0, x]

108 [A tale of] two θ s There are two θ s! : in (1), θ 0 is a fixed (and arbitrary) value driving integration, while θ both free (and variable) Maximising observed likelihood L(θ x) equivalent to maximise r.h.s. term in (1) E[log L c (θ x, Z) θ 0, x] E[log k(z θ, x) θ 0, x]

109 Intuition for EM Instead of maximising wrt θ r.h.s. term in (1), maximise only E[log L c (θ x, Z) θ 0, x] Maximisation of complete log-likelihood impossible since z unknown, hence substitute by maximisation of expected complete log-likelihood, with expectation depending on term θ 0

110 Intuition for EM Instead of maximising wrt θ r.h.s. term in (1), maximise only E[log L c (θ x, Z) θ 0, x] Maximisation of complete log-likelihood impossible since z unknown, hence substitute by maximisation of expected complete log-likelihood, with expectation depending on term θ 0

111 Expectation Maximisation Expectation of complete log-likelihood denoted Q(θ θ 0, x) = E[log L c (θ x, Z) θ 0, x] to stress dependence on θ 0 and sample x Principle EM derives sequence of estimators ^θ (j), j = 1, 2,..., through iteration of Expectation and Maximisation steps: Q(^θ (j) ^θ (j 1), x) = max θ Q(θ ^θ (j 1), x).

112 Expectation Maximisation Expectation of complete log-likelihood denoted Q(θ θ 0, x) = E[log L c (θ x, Z) θ 0, x] to stress dependence on θ 0 and sample x Principle EM derives sequence of estimators ^θ (j), j = 1, 2,..., through iteration of Expectation and Maximisation steps: Q(^θ (j) ^θ (j 1), x) = max θ Q(θ ^θ (j 1), x).

113 EM Algorithm Iterate (in m) 1 (step E) Compute Q(θ ^θ (m), x) = E[log L c (θ x, Z) ^θ (m), x], 2 (step M) Maximise Q(θ ^θ (m), x) in θ and set ^θ (m+1) = arg max θ Q(θ ^θ (m), x). until a fixed point [of Q] is found [Dempster, Laird, & Rubin, 1978]

114 Justification Observed likelihood increases at every EM step L(θ x) L(^θ (m+1) x) L(^θ (m) x) [Exercice: use Jensen and (1)]

115 Censored data Normal N(θ, 1) sample right-censored { } 1 L(θ x) = exp 1 m (x (2π) m/2 i θ) 2 [1 Φ(a θ)] n m 2 i=1 Associated complete log-likelihood: log L c (θ x, z) 1 2 m (x i θ) i=1 n i=m+1 where z i s are censored observations, with density k(z θ, x) = exp{ 1 2 (z θ)2 } 2π[1 Φ(a θ)] = (z i θ) 2, ϕ(z θ) 1 Φ(a θ), a < z.

116 Censored data Normal N(θ, 1) sample right-censored { } 1 L(θ x) = exp 1 m (x (2π) m/2 i θ) 2 [1 Φ(a θ)] n m 2 i=1 Associated complete log-likelihood: log L c (θ x, z) 1 2 m (x i θ) i=1 n i=m+1 where z i s are censored observations, with density k(z θ, x) = exp{ 1 2 (z θ)2 } 2π[1 Φ(a θ)] = (z i θ) 2, ϕ(z θ) 1 Φ(a θ), a < z.

117 Censored data (2) At j-th EM iteration Q(θ ^θ (j), x) [ m (x i θ) 2 1 n ] 2 E (Z i θ) 2 ^θ (j), x i=1 m (x i θ) 2 i=1 n i=m+1 a i=m+1 (z i θ) 2 k(z ^θ (j), x) dz i

118 Censored data (3) Differenciating in θ, with E[Z ^θ (j) ] = n ^θ (j+1) = m x + (n m)e[z ^θ (j) ], a zk(z ^θ (j), x) dz = ^θ (j) + ϕ(a ^θ (j) ) 1 Φ(a ^θ (j) ). Hence, EM sequence provided by [ ^θ (j+1) = m n x + n m ^θ n (j) + ϕ(a ] ^θ (j) ), 1 Φ(a ^θ (j) ) which converges to likelihood maximum ^θ

119 Censored data (3) Differenciating in θ, with E[Z ^θ (j) ] = n ^θ (j+1) = m x + (n m)e[z ^θ (j) ], a zk(z ^θ (j), x) dz = ^θ (j) + ϕ(a ^θ (j) ) 1 Φ(a ^θ (j) ). Hence, EM sequence provided by [ ^θ (j+1) = m n x + n m ^θ n (j) + ϕ(a ] ^θ (j) ), 1 Φ(a ^θ (j) ) which converges to likelihood maximum ^θ

120 Mixtures Mixture of two normal distributions with unknown means.3 N 1 (µ 0, 1) +.7 N 1 (µ 1, 1), sample X 1,..., X n and parameter θ = (µ 0, µ 1 ) Missing data: Z i {0, 1}, indicator of component associated with X i, X i z i N(µ zi, 1) Z i B(.7) Complete likelihood with n 1 = log L c (θ x, z) 1 2 n z i, n 1^µ 1 = i=1 n z i (x i µ 1 ) i=1 n (1 z i )(x i µ 0 ) 2 i=1 = 1 2 n 1(^µ 1 µ 1 ) (n n 1)(^µ 0 µ 0 ) 2 n z i x i, (n n 1 )^µ 0 = i=1 n (1 z i )x i i=1

121 Mixtures Mixture of two normal distributions with unknown means.3 N 1 (µ 0, 1) +.7 N 1 (µ 1, 1), sample X 1,..., X n and parameter θ = (µ 0, µ 1 ) Missing data: Z i {0, 1}, indicator of component associated with X i, X i z i N(µ zi, 1) Z i B(.7) Complete likelihood with n 1 = log L c (θ x, z) 1 2 n z i, n 1^µ 1 = i=1 n z i (x i µ 1 ) i=1 n (1 z i )(x i µ 0 ) 2 i=1 = 1 2 n 1(^µ 1 µ 1 ) (n n 1)(^µ 0 µ 0 ) 2 n z i x i, (n n 1 )^µ 0 = i=1 n (1 z i )x i i=1

122 Mixtures (2) At j-th EM iteration Q(θ ^θ (j), x) = 1 2 E [ n 1 (^µ 1 µ 1 ) 2 + (n n 1 )(^µ 0 µ 0 ) 2 ^θ (j), x ] Differenciating in θ E [ n 1^µ 1 ^θ (j), x ] / E [ n 1 ^θ (j), x ] ^θ (j+1) = E [ (n n 1 )^µ 0 ^θ (j), x ] / E [ (n n 1 ) ^θ (j), x ]

123 Mixtures (3) Hence ^θ (j+1) given by n i=1 E [ ] / n Z i ^θ (j), x i xi i=1 E [ ] Z i ^θ (j), x i n i=1 E [ (1 Z i ) ] / n ^θ (j), x i xi i=1 E [ ] (1 Z i ) ^θ (j), x i Conclusion Step (E) in EM replaces missing data Z i with their conditional expectation, given x (expectation that depend on ^θ (m) ).

124 Mixtures (3) µ EM iterations for several starting values µ 1

125 Properties EM algorithm such that it converges to local maximum or saddle-point it depends on the initial condition θ (0) it requires several initial values when likelihood multimodal

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random

More information

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

simple if it completely specifies the density of x

simple if it completely specifies the density of x 3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

Graduate Econometrics I: Maximum Likelihood I

Graduate Econometrics I: Maximum Likelihood I Graduate Econometrics I: Maximum Likelihood I Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Maximum Likelihood

More information

Nuisance parameters and their treatment

Nuisance parameters and their treatment BS2 Statistical Inference, Lecture 2, Hilary Term 2008 April 2, 2008 Ancillarity Inference principles Completeness A statistic A = a(x ) is said to be ancillary if (i) The distribution of A does not depend

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2 Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe,

More information

Optimization Methods II. EM algorithms.

Optimization Methods II. EM algorithms. Aula 7. Optimization Methods II. 0 Optimization Methods II. EM algorithms. Anatoli Iambartsev IME-USP Aula 7. Optimization Methods II. 1 [RC] Missing-data models. Demarginalization. The term EM algorithms

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth

More information

Lecture 17: Likelihood ratio and asymptotic tests

Lecture 17: Likelihood ratio and asymptotic tests Lecture 17: Likelihood ratio and asymptotic tests Likelihood ratio When both H 0 and H 1 are simple (i.e., Θ 0 = {θ 0 } and Θ 1 = {θ 1 }), Theorem 6.1 applies and a UMP test rejects H 0 when f θ1 (X) f

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Advanced statistical inference. Suhasini Subba Rao

Advanced statistical inference. Suhasini Subba Rao Advanced statistical inference Suhasini Subba Rao Email: suhasini.subbarao@stat.tamu.edu April 26, 2017 2 Contents 1 The Likelihood 9 1.1 The likelihood function............................. 9 1.2 Constructing

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

1 Complete Statistics

1 Complete Statistics Complete Statistics February 4, 2016 Debdeep Pati 1 Complete Statistics Suppose X P θ, θ Θ. Let (X (1),..., X (n) ) denote the order statistics. Definition 1. A statistic T = T (X) is complete if E θ g(t

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) :

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently

More information

Classical Estimation Topics

Classical Estimation Topics Classical Estimation Topics Namrata Vaswani, Iowa State University February 25, 2014 This note fills in the gaps in the notes already provided (l0.pdf, l1.pdf, l2.pdf, l3.pdf, LeastSquares.pdf). 1 Min

More information

ML Testing (Likelihood Ratio Testing) for non-gaussian models

ML Testing (Likelihood Ratio Testing) for non-gaussian models ML Testing (Likelihood Ratio Testing) for non-gaussian models Surya Tokdar ML test in a slightly different form Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) = {θ : l x (θ) max θ Θ l

More information

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments We consider two kinds of random variables: discrete and continuous random variables. For discrete random

More information

Normalising constants and maximum likelihood inference

Normalising constants and maximum likelihood inference Normalising constants and maximum likelihood inference Jakob G. Rasmussen Department of Mathematics Aalborg University Denmark March 9, 2011 1/14 Today Normalising constants Approximation of normalising

More information

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n = Spring 2012 Math 541A Exam 1 1. (a) Let Z i be independent N(0, 1), i = 1, 2,, n. Are Z = 1 n n Z i and S 2 Z = 1 n 1 n (Z i Z) 2 independent? Prove your claim. (b) Let X 1, X 2,, X n be independent identically

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002 Statistics Ph.D. Qualifying Exam: Part II November 9, 2002 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. 1 2 3 4 5 6 7 8 9 10 11 12 2. Write your

More information

Stat 710: Mathematical Statistics Lecture 12

Stat 710: Mathematical Statistics Lecture 12 Stat 710: Mathematical Statistics Lecture 12 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 12 Feb 18, 2009 1 / 11 Lecture 12:

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero 1999 32 Statistic used Meaning in plain english Reduction ratio T (X) [X 1,..., X n ] T, entire data sample RR 1 T (X) [X (1),..., X (n) ] T, rank

More information

Theory of Statistical Tests

Theory of Statistical Tests Ch 9. Theory of Statistical Tests 9.1 Certain Best Tests How to construct good testing. For simple hypothesis H 0 : θ = θ, H 1 : θ = θ, Page 1 of 100 where Θ = {θ, θ } 1. Define the best test for H 0 H

More information

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation Ryan Martin www.math.uic.edu/~rgmartin Version: August 19, 2013 1 Introduction Previously we have discussed various properties of

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Julien Berestycki Department of Statistics University of Oxford MT 2016 Julien Berestycki (University of Oxford) SB2a MT 2016 1 / 32 Lecture 14 : Variational Bayes

More information

Chapter 7. Hypothesis Testing

Chapter 7. Hypothesis Testing Chapter 7. Hypothesis Testing Joonpyo Kim June 24, 2017 Joonpyo Kim Ch7 June 24, 2017 1 / 63 Basic Concepts of Testing Suppose that our interest centers on a random variable X which has density function

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

EM for ML Estimation

EM for ML Estimation Overview EM for ML Estimation An algorithm for Maximum Likelihood (ML) Estimation from incomplete data (Dempster, Laird, and Rubin, 1977) 1. Formulate complete data so that complete-data ML estimation

More information

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

Exercises and Answers to Chapter 1

Exercises and Answers to Chapter 1 Exercises and Answers to Chapter The continuous type of random variable X has the following density function: a x, if < x < a, f (x), otherwise. Answer the following questions. () Find a. () Obtain mean

More information

Lecture 26: Likelihood ratio tests

Lecture 26: Likelihood ratio tests Lecture 26: Likelihood ratio tests Likelihood ratio When both H 0 and H 1 are simple (i.e., Θ 0 = {θ 0 } and Θ 1 = {θ 1 }), Theorem 6.1 applies and a UMP test rejects H 0 when f θ1 (X) f θ0 (X) > c 0 for

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Introduction to Estimation Methods for Time Series models Lecture 2

Introduction to Estimation Methods for Time Series models Lecture 2 Introduction to Estimation Methods for Time Series models Lecture 2 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 2 SNS Pisa 1 / 21 Estimators:

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Monday, September 26, 2011 Lecture 10: Exponential families and Sufficient statistics Exponential Families Exponential families are important parametric families

More information

Principles of Statistics

Principles of Statistics Part II Year 2018 2017 2016 2015 2014 2013 2012 2011 2010 2009 2008 2007 2006 2005 2018 81 Paper 4, Section II 28K Let g : R R be an unknown function, twice continuously differentiable with g (x) M for

More information

Composite Hypotheses and Generalized Likelihood Ratio Tests

Composite Hypotheses and Generalized Likelihood Ratio Tests Composite Hypotheses and Generalized Likelihood Ratio Tests Rebecca Willett, 06 In many real world problems, it is difficult to precisely specify probability distributions. Our models for data may involve

More information

MATH4427 Notebook 2 Fall Semester 2017/2018

MATH4427 Notebook 2 Fall Semester 2017/2018 MATH4427 Notebook 2 Fall Semester 2017/2018 prepared by Professor Jenny Baglivo c Copyright 2009-2018 by Jenny A. Baglivo. All Rights Reserved. 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators Likelihood & Maximum Likelihood Estimators February 26, 2018 Debdeep Pati 1 Likelihood Likelihood is surely one of the most important concepts in statistical theory. We have seen the role it plays in sufficiency,

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic Unbiased estimation Unbiased or asymptotically unbiased estimation plays an important role in

More information

Information geometry of mirror descent

Information geometry of mirror descent Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee

More information

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology Singapore University of Design and Technology Lecture 9: Hypothesis testing, uniformly most powerful tests. The Neyman-Pearson framework Let P be the family of distributions of concern. The Neyman-Pearson

More information

Fundamentals of Statistics

Fundamentals of Statistics Chapter 2 Fundamentals of Statistics This chapter discusses some fundamental concepts of mathematical statistics. These concepts are essential for the material in later chapters. 2.1 Populations, Samples,

More information

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University Likelihood, MLE & EM for Gaussian Mixture Clustering Nick Duffield Texas A&M University Probability vs. Likelihood Probability: predict unknown outcomes based on known parameters: P(x q) Likelihood: estimate

More information

Inference in non-linear time series

Inference in non-linear time series Intro LS MLE Other Erik Lindström Centre for Mathematical Sciences Lund University LU/LTH & DTU Intro LS MLE Other General Properties Popular estimatiors Overview Introduction General Properties Estimators

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota 1 Likelihood Inference We have learned one very general method of estimation: method of moments. the Now we

More information

6 Classic Theory of Point Estimation

6 Classic Theory of Point Estimation 6 Classic Theory of Point Estimation Point estimation is usually a starting point for more elaborate inference, such as construction of confidence intervals. Centering a confidence interval at a point

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Foundations of Statistical Inference

Foundations of Statistical Inference Foundations of Statistical Inference Jonathan Marchini Department of Statistics University of Oxford MT 2013 Jonathan Marchini (University of Oxford) BS2a MT 2013 1 / 27 Course arrangements Lectures M.2

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium November 12, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Master s Written Examination - Solution

Master s Written Examination - Solution Master s Written Examination - Solution Spring 204 Problem Stat 40 Suppose X and X 2 have the joint pdf f X,X 2 (x, x 2 ) = 2e (x +x 2 ), 0 < x < x 2

More information

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random STA 216: GENERALIZED LINEAR MODELS Lecture 1. Review and Introduction Much of statistics is based on the assumption that random variables are continuous & normally distributed. Normal linear regression

More information

ST5215: Advanced Statistical Theory

ST5215: Advanced Statistical Theory Department of Statistics & Applied Probability Wednesday, October 19, 2011 Lecture 17: UMVUE and the first method of derivation Estimable parameters Let ϑ be a parameter in the family P. If there exists

More information

Estimation of Dynamic Regression Models

Estimation of Dynamic Regression Models University of Pavia 2007 Estimation of Dynamic Regression Models Eduardo Rossi University of Pavia Factorization of the density DGP: D t (x t χ t 1, d t ; Ψ) x t represent all the variables in the economy.

More information

Bayesian Inference for directional data through ABC and homogeneous proper scoring rules

Bayesian Inference for directional data through ABC and homogeneous proper scoring rules Bayesian Inference for directional data through ABC and homogeneous proper scoring rules Monica Musio* Dept. of Mathematics and Computer Science, University of Cagliari, Italy - email: mmusio@unica.it

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation University of Pavia Maximum Likelihood Estimation Eduardo Rossi Likelihood function Choosing parameter values that make what one has observed more likely to occur than any other parameter values do. Assumption(Distribution)

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

Gaussian vectors and central limit theorem

Gaussian vectors and central limit theorem Gaussian vectors and central limit theorem Samy Tindel Purdue University Probability Theory 2 - MA 539 Samy T. Gaussian vectors & CLT Probability Theory 1 / 86 Outline 1 Real Gaussian random variables

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Statistical Theory MT 2007 Problems 4: Solution sketches

Statistical Theory MT 2007 Problems 4: Solution sketches Statistical Theory MT 007 Problems 4: Solution sketches 1. Consider a 1-parameter exponential family model with density f(x θ) = f(x)g(θ)exp{cφ(θ)h(x)}, x X. Suppose that the prior distribution has the

More information

Probability reminders

Probability reminders CS246 Winter 204 Mining Massive Data Sets Probability reminders Sammy El Ghazzal selghazz@stanfordedu Disclaimer These notes may contain typos, mistakes or confusing points Please contact the author so

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models. Lecture 1. Review and Introduction STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

2014/2015 Smester II ST5224 Final Exam Solution

2014/2015 Smester II ST5224 Final Exam Solution 014/015 Smester II ST54 Final Exam Solution 1 Suppose that (X 1,, X n ) is a random sample from a distribution with probability density function f(x; θ) = e (x θ) I [θ, ) (x) (i) Show that the family of

More information

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018 Mathematics Ph.D. Qualifying Examination Stat 52800 Probability, January 2018 NOTE: Answers all questions completely. Justify every step. Time allowed: 3 hours. 1. Let X 1,..., X n be a random sample from

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Chapter 4. Theory of Tests. 4.1 Introduction

Chapter 4. Theory of Tests. 4.1 Introduction Chapter 4 Theory of Tests 4.1 Introduction Parametric model: (X, B X, P θ ), P θ P = {P θ θ Θ} where Θ = H 0 +H 1 X = K +A : K: critical region = rejection region / A: acceptance region A decision rule

More information

Maximum Likelihood Large Sample Theory

Maximum Likelihood Large Sample Theory Maximum Likelihood Large Sample Theory MIT 18.443 Dr. Kempthorne Spring 2015 1 Outline 1 Large Sample Theory of Maximum Likelihood Estimates 2 Asymptotic Results: Overview Asymptotic Framework Data Model

More information

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016 Auxiliary material: Exponential Family of Distributions Timo Koski Second Quarter 2016 Exponential Families The family of distributions with densities (w.r.t. to a σ-finite measure µ) on X defined by R(θ)

More information

Information in a Two-Stage Adaptive Optimal Design

Information in a Two-Stage Adaptive Optimal Design Information in a Two-Stage Adaptive Optimal Design Department of Statistics, University of Missouri Designed Experiments: Recent Advances in Methods and Applications DEMA 2011 Isaac Newton Institute for

More information

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes Neyman-Pearson paradigm. Suppose that a researcher is interested in whether the new drug works. The process of determining whether the outcome of the experiment points to yes or no is called hypothesis

More information