Nancy Reid SS 6002A Office Hours by appointment

Size: px

Start display at page:

Download "Nancy Reid SS 6002A Office Hours by appointment"

Jennifer Wilson
5 years ago
Views:

Nancy Reid SS 6002A reid@utstat.utoronto.

1 Nancy Reid SS 6002A Office Hours by appointment Light touch assessment One or two problems assigned weekly graded during Reading Week

2 Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. quasi-likelihood, composite likelihood 3. semi-parametric likelihood, partial likelihood 4. empirical likelihood, penalized likelihood 5. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood, simulated likelihood STA 4508: Topics in Likelihood Inference January 7, /50

3 Why so many? Principle: The probability model and the choice of [parameter] serve to translate a subject-matter question into a mathematical and statistical one Cox, 2006, p.3 likelihood function is proportional to the probability model inference based on the likelihood function is widely accepted provides more than point estimate or test of point hypothesis models needed for applications are more and more complex need some analogues to the likelihood function for these complex settings STA 4508: Topics in Likelihood Inference January 7, /50

4 The likelihood function Parametric model: f (y; θ), Likelihood function y Y, θ Θ R d L(θ; y) = f (y; θ), or L(θ; y) = c(y)f (y; θ), or L(θ; y) f (y; θ) typically, y = (y 1,..., y n ) x 1,..., x n i = 1,..., n f (y; θ) or f (y x; θ) is joint density under independence L(θ; y) f (y i x i ; θ) log-likelihood l(θ; y) = log L(θ; y) = log f (y i x i ; θ) θ could have dimension d > n (e.g. genetics), or d n, or θ could have infinite dimension e.g. regular model d < n and d fixed as n increases STA 4508: Topics in Likelihood Inference January 7, /50

5 Examples y i N(µ, σ 2 ): E(y i ) = x T i β: L(θ; y) = L(θ; y) = n i=1 n i=1 σ n exp{ 1 2σ 2 (y i µ) 2 } σ n exp{ 1 2σ 2 (y i x T i β) 2 } E(y i ) = m(x i ), m(x) = Σ J j=1 φ jb j (x): L(θ; y) = n i=1 σ n exp{ 1 2σ 2 (y i Σ J j=1 φ jb j (x i )) 2 } STA 4508: Topics in Likelihood Inference January 7, /50

6 ... examples y i = µ + ρ(y i 1 µ) + ɛ i, ɛ i N(0, σ 2 ): n L(θ; y) = f (y i y i 1 ; θ)f 0 (y 0 ; θ) i=1 y 1,..., y n i.i.d. observations from a U(0, θ) distribution: n L(θ; y) = θ n, i=1 0 < y (1) < < y (n) < θ STA 4508: Topics in Likelihood Inference January 7, /50

7 ... examples y 1,..., y n are the times of jumps of a non-homogeneous Poisson process with rate function λ( ): l{λ( ); y} = n i=1 τ log{λ(y i )} λ(u)du, 0 0 < y 1 < < y n < τ Davison, 6.5 multinomial: y i = (y i1,..., y ik ), l(θ; y) = n i=1 c=1 y ic = 1, y ic = 0, c c k y ic log(p c ) negative cross-entropy Hastie et al., Ch. 7 p = p(x, β), as above STA 4508: Topics in Likelihood Inference January 7, /50

9 Data: times of failure of a spring under stress 225, 171, 198, 189, 189, 135, 162, 135, 117, 162

10 Non-computable likelihoods Ising model: f (y; θ) = exp( (i,j) E 1 θ ij y i y j ) Z (θ) y i = ±1; binary property of a node i in a graph with n nodes θ ij measures strength of interaction between nodes i and j E is the set of edges between nodes partition function Z (θ) = y exp( (i,j) E θ ijy i y j ) Davison 6.2 Ravikumar et al. (2010). High-dimensional Ising model selection... Ann. Statist. p.1287 STA 4508: Topics in Likelihood Inference January 7, /50

11 ... complicated likelihoods example: clustered binary data latent variable: z ir = x ir i + ɛ ir, b i N(0, σb 2), ɛ ir N(0, 1) r = 1,..., n i : observations in a cluster/family/school... i = 1,..., n clusters random effect b i introduces correlation between observations in a cluster observations: y ir = 1 if z ir > 0, else 0 Pr(y ir = 1 b i ) = Φ(x ir β + b i) = p i Φ(z) = z 1 2π e x 2 /2 dx likelihood θ = (β, σ b ) L(θ; y) = n i=1 log ni r=1 p i y ir (1 p i ) 1 y ir φ(b i, σ 2 b )db i more general: z ir = x ir β + w ir b i + ɛ ir Renard et al. (2004) STA 4508: Topics in Likelihood Inference January 7, /50

12 ... complicated likelihoods M/G/1 queue: exponential arrival times, general service times, 1 server Shestopaloff & Neal, 2013 observations y i : times between departures from the queue unobserved variables V i arrival time of customer i model: V 1 Exp(θ 3 ) V i V i 1 V i 1 + Exp(θ 3 ) Yi X i 1, V i Uniform{θ 1 + max(0, V i X i 1 ), θ 2 + max(0, V i X i 1 )} X i = i j=1 Y j n n L(θ; y) = f (v 1 θ) f (v i v i 1, θ) f (y i v i, x i 1, θ)dv 1 dv n i=1 i=1 Fearnhead & Prangle, 2012 STA 4508: Topics in Likelihood Inference January 7, /50

13 Widely used STA 4508: Topics in Likelihood Inference January 7, /50

14 ... widely used STA 4508: Topics in Likelihood Inference January 7, /50

15 ... widely used STA 4508: Topics in Likelihood Inference January 7, /50

16 ... widely used STA 4508: Topics in Likelihood Inference January 7, /50

17 History STA 4508: Topics in Likelihood Inference January 7, /50

18 History STA 4508: Topics in Likelihood Inference January 7, /50

19 History STA 4508: Topics in Likelihood Inference January 7, /50

20 Why likelihood? makes probability modelling central emphasizes the inverse problem of reasoning from y 0 to θ or f ( ) suggested by Fisher as a measure of plausibility Royall, 1994 L(ˆθ)/L(θ) (1, 3) very plausible; L(ˆθ)/L(θ) (3, 10) implausible; L(ˆθ)/L(θ) (10, ) very implausible converts a prior probability π(θ) to a posterior π(θ y) via Bayes formula provides a conventional set of summary quantities for inference based on properties of the postulated model STA 4508: Topics in Likelihood Inference January 7, /50

21 ... why likelihood? likelihood function depends on data only through sufficient statistics likelihood map is sufficient Fraser & Naderi, 2006 gives exact inference in transformation models likelihood function as pivotal Hinkley, 1980 provides summary statistics with known limiting distribution leading to approximate pivotal functions, based on normal distribution likelihood function + sample space derivative gives better approximate inference STA 4508: Topics in Likelihood Inference January 7, /50

22 Likelihood inference direct use of likelihood function note that only relative values are well-defined define relative likelihood RL(θ) = L(θ) sup θ L(θ ) = L(θ) L(ˆθ) STA 4508: Topics in Likelihood Inference January 7, /50

23 ... likelihood inference combine with a probability density for θ π(θ y) = f (y; θ)π(θ) f (y; θ)π(θ)dθ inference for θ via probability statements from π(θ y) e.g., Probability (θ > 0 y) = 0.23, etc. any other use of likelihood function for inference relies on derived quantities and their distribution under the model the Likelihood Principle states two experiments with proportional likelihood functions lead to the same inference about the same parameter C& H, 1974, p.39 (strong likelihood) STA 4508: Topics in Likelihood Inference January 7, /50

24 Derived quantities; f (y; θ) observed likelihood L(θ; y) = c(y)f (y; θ) log-likelihood l(θ; y) = log L(θ; y) = log f (y; θ) + a(y) score U(θ) = l(θ; y)/ θ observed information j(θ) = 2 l(θ; y)/ θ θ T expected information i(θ) = E θ U(θ)U(θ) T called i 1 (θ) in CH STA 4508: Topics in Likelihood Inference January 7, /50

25 ... derived quantities; f (y; θ) observed likelihood L(θ; y) n i=1 f (y i; θ) log-likelihood l(θ; y) = n i=1 log f (y; θ)+a(y) score U(θ) = l(θ; y)/ θ = O p ( ) maximum likelihood estimate ˆθ = ˆθ(y) = arg sup θ l(θ; y) Fisher information j(ˆθ) = 2 l(ˆθ; y)/ θ θ T expected information i(θ) = E θ U(θ)U(θ) T = O( ) STA 4508: Topics in Likelihood Inference January 7, /50

26 Bartlett identities STA 4508: Topics in Likelihood Inference January 7, /50

27 Limiting distributions U(θ) = n i=1 U i(θ) E{U(θ)} = var{u(θ)} = U(θ)/ n L N{0, i 1 (θ)} STA 4508: Topics in Likelihood Inference January 7, /50

28 ... limiting distributions U(θ)/ n L N{0, i 1 (θ)} U(ˆθ) = 0 = U(θ) + (ˆθ θ)u (θ) + R n (ˆθ θ) = {U(θ)/i(θ)}{1 + o p (1)} n(ˆθ θ) L N{0, i 1 1 (θ)} STA 4508: Topics in Likelihood Inference January 7, /50

29 ... limiting distributions n(ˆθ) L N{θ, i 1 1 (θ)} l(θ) = l(ˆθ) + (θ ˆθ)l (ˆθ) (θ ˆθ) 2 l (ˆθ) + R n 2{l(ˆθ) l(θ)} = (ˆθ θ) 2 i(θ){1 + o p (1)} 2{l(ˆθ) l(θ)} L χ 2 d STA 4508: Topics in Likelihood Inference January 7, /50

30 Inference from limiting distributions ˆθ. N d {θ, j 1 (ˆθ)} j(ˆθ) = l (ˆθ; y) θ is estimated to be 21.5 (95% CI ) ˆθ ± 2ˆσ w(θ) = 2{l(ˆθ) l(θ)}. χ 2 d likelihood based CI for θ with confidence level 95% is (18.6, 23.0) log likelihood function log likelihood w θ θ θ STA 4508: Topics in Likelihood Inference January 7, /50 θ

31 ... inference from limiting distributions pivotal quantities and p-value functions; θ scalar r u (θ) = U(θ)j 1/2 (ˆθ). N(0, 1) Pr{U(θ)j 1/2 (ˆθ) u(θ)j 1/2 (ˆθ)}. = Φ{u(θ)j 1/2 (ˆθ)} under sampling from the model f (y; θ) = f (y 1,..., y n ; θ) p u (θ) = Φ{u(θ)j 1/2 (ˆθ)} p-value function (of θ, for fixed data) shorthand = Φ{r u (θ)}, and = Φ{r e (θ)}, = Φ{r(θ)} are all p-value functions for θ, based on limiting dist ns STA 4508: Topics in Likelihood Inference January 7, /50

33 Example f (y i ; θ) = θe y i θ, i = 1,..., n l(θ) = l (θ) = l (θ) = r u (θ) = n( 1 θȳ 1) r e (θ) = n(1 ȳθ) r(θ) = STA 4508: Topics in Likelihood Inference January 7, /50

34 Example f (y i ; θ) = θ y i e θ /y i! l(θ) = l (θ) = l (θ) = r e (θ) = (s nθ)/ s Pr(S s) 1 Pr(S s) upper and lower p-value functions: Pr(S < s), Pr(S s) mid p-value function: Pr(S < sr) + 0.5Pr(S = s) STA 4508: Topics in Likelihood Inference January 7, /50

36 Aside for inference re θ, given y, plot p(θ) vs θ for p-value for H 0 : θ = θ 0, compute p(θ 0 ) for checking whether, e.g. Φ{r e (θ)} is a good approximation, compare p(θ) = Φ{re (θ)} to p exact (θ), as a function of θ, fixed y or compare p(θ 0 ) to p exact (θ 0 ) as a function of y if p exact (θ) not available, simulate STA 4508: Topics in Likelihood Inference January 7, /50

37 Nuisance parameters θ = (ψ, λ) = (ψ 1,..., ψ q, λ 1,..., λ d q ) ( ) Uψ (θ) U(θ) =, U U λ (θ) λ (ψ, ˆλ ψ ) = 0 ( ) ( ) iψψ i i(θ) = ψλ jψψ j j(θ) = ψλ i λψ i λλ ( i i 1 (θ) = ψψ i ψλ ) i λψ i λλ j λψ j λλ ( j j 1 (θ) = ψψ j ψλ ). j λψ i ψψ (θ) = {i ψψ (θ) i ψλ (θ)i 1 λλ (θ)i λψ(θ)} 1, l P (ψ) = l(ψ, ˆλ ψ ), j P (ψ) = l P (ψ) j λλ STA 4508: Topics in Likelihood Inference January 7, /50

38 Inference from limiting distributions, nuisance parameters w u (ψ) = U ψ (ψ, ˆλ ψ ) T {i ψψ (ψ, ˆλ ψ )}U ψ (ψ, ˆλ ψ ) w e (ψ) = ( ˆψ ψ){i ψψ ( ˆψ, ˆλ)} 1 ( ˆψ ψ) w(ψ) = 2{l( ˆψ, ˆλ) l(ψ, ˆλ ψ )} = 2{l P ( ˆψ) l P (ψ)}.. χ 2 q χ 2 q. χ 2 q; Approximate Pivots r u (ψ) = l P (ψ)j P( ˆψ) 1/2. N(0, 1), r e (ψ) = ( ˆψ ψ)j P ( ˆψ) 1/2. N(0, 1), r(ψ) = sign( ˆψ ψ)2{l P ( ˆψ) l P (ψ)}. N(0, 1) STA 4508: Topics in Likelihood Inference January 7, /50

40 Properties of likelihood functions and likelihood inference the likelihood depends only on the minimal sufficient statistic recall: L(θ; y) = m 1 (s; θ)m 2 (y) s is minimal sufficient equivalently L(θ; y) depends only on s L(θ 0 ; y) the likelihood map is sufficient Fraser & Naderi, 2006; Barndorff-Nielsen, et al, 1976 i.e y L 0 ( ; y), or y L( ; y) normed STA 4508: Topics in Likelihood Inference January 7, /50

41 ... properties maximum likelihood estimates are equivariant: ĥ(θ) = h(ˆθ) for one-to-one h( ) question: which of w e, w u, w are invariant under reparametrization of the full parameter: ϕ(θ)? question: which of r e, r u, r are invariant under interest-respecting reparameterizations (ψ, λ) {ψ, η(ψ, λ)}? consistency of maximum likelihood estimate? equivalence of maximum likelihood estimate and root of score equation? observed vs. expected information STA 4508: Topics in Likelihood Inference January 7, /50

42 Asymptotics for Bayesian inference exp{l(θ; y)}π(θ) π(θ y) = exp{l(θ; y)}π(θ)dθ expand numerator and denominator about ˆθ, assuming l (ˆθ) = 0 π(θ y). = N{ˆθ, j 1 (ˆθ)} expand denominator only about ˆθ result π(θ y). = 1 (2π) d/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) STA 4508: Topics in Likelihood Inference January 7, /50

43 Posterior is asymptotically normal π(θ y). N{ˆθ, j 1 (ˆθ)} θ R, y = (y 1,..., y n ) careful statement STA 4508: Topics in Likelihood Inference January 7, /50

44 ... posterior is asymptotically normal π(θ y). N{ˆθ, j 1 (ˆθ)} θ R, y = (y 1,..., y n ) equivalently l π (θ) = STA 4508: Topics in Likelihood Inference January 7, /50

45 ... posterior is asymptotically normal In fact, If π(θ) > 0 and π (θ) is continuous in a neighbourhood of θ 0, there exist constants D and n y s.t. F n (ξ) Φ(ξ) < Dn 1/2, for all n > n y, on an almost-sure set with respect to f (y; θ 0 ), where y = (y 1,..., y n ) is a sample from f (y; θ 0 ), and θ 0 is an observation from the prior density π(θ). F n (ξ) = Pr{(θ ˆθ)j 1/2 (ˆθ) ξ y} Johnson (1970); Datta & Mukerjee (2004) STA 4508: Topics in Likelihood Inference January 7, /50

46 Laplace approximation π(θ y). = 1 (2π) 1/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) π(θ y) = π(θ y) = 1 (2π) 1/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) {1+O p(n 1 )} y = (y 1,..., y n ), θ R 1 1 (2π) 1/2 j π(ˆθ π ) +1/2 exp{l π (θ; y) l π (ˆθ π ; y)}{1+o p (n 1 )} STA 4508: Topics in Likelihood Inference January 7, /50

47 Posterior cdf θ π(ϑ y)dϑ. = θ 1 (2π) 1/2 el(ϑ;y) l( ˆϑ;y) 1/2 π(ϑ) j( ˆϑ) π( ˆϑ) dϑ STA 4508: Topics in Likelihood Inference January 7, /50

48 Posterior cdf θ π(ϑ y)dϑ. = θ 1 (2π) 1/2 el(ϑ;y) l( ˆϑ;y) 1/2 π(ϑ) j( ˆϑ) π( ˆϑ) dϑ SM, 11.3 STA 4508: Topics in Likelihood Inference January 7, /50

50 BDR, Ch.3, Cauchy with flat prior

Nancy Reid SS 6002A Office Hours by appointment

Nancy Reid SS 6002A Office Hours by appointment Nancy Reid SS 6002A reid@utstat.utoronto.ca Office Hours by appointment Problems assigned weekly, due the following week http://www.utstat.toronto.edu/reid/4508s16.html Various types of likelihood 1. likelihood,