Nancy Reid SS 6002A Office Hours by appointment

Size: px

Start display at page:

Download "Nancy Reid SS 6002A Office Hours by appointment"

Curtis Simmons
5 years ago
Views:

Nancy Reid SS 6002A reid@utstat.utoronto.

2 Nancy Reid SS 6002A Office Hours by appointment Problems assigned weekly, due the following week

3 Various types of likelihood 1. likelihood, marginal likelihood, conditional likelihood, profile likelihood, adjusted profile likelihood 2. quasi-likelihood, composite likelihood 3. semi-parametric likelihood, partial likelihood 4. empirical likelihood, penalized likelihood 5. bootstrap likelihood, h-likelihood, weighted likelihood, pseudo-likelihood, local likelihood, sieve likelihood 6. simulated likelihood, indirect inference, inference for p > n STA 4508: Topics in Likelihood Inference January 12, /48

4 Why so many? Principle: The probability model and the choice of [parameter] serve to translate a subject-matter question into a mathematical and statistical one Cox, 2006, p.3 likelihood function is proportional to the probability model inference based on the likelihood function is widely accepted provides more than point estimate or test of point hypothesis models needed for applications are more and more complex need some analogues to the likelihood function for these complex settings STA 4508: Topics in Likelihood Inference January 12, /48

5 The likelihood function Parametric model: f (y; θ), Likelihood function y Y, θ Θ R d L(θ; y) = f (y; θ), or L(θ; y) = c(y)f (y; θ), or L(θ; y) f (y; θ) typically, y = (y 1,..., y n ) x 1,..., x n i = 1,..., n f (y; θ) or f (y x; θ) is joint density under independence L(θ; y) f (y i x i ; θ) log-likelihood l(θ; y) = log L(θ; y) = log f (y i x i ; θ) θ could have dimension d > n (e.g. genetics), or d n, or θ could have infinite dimension e.g. regular model d < n and d fixed as n increases STA 4508: Topics in Likelihood Inference January 12, /48

6 Examples y i N(µ, σ 2 ): E(y i ) = x T i β: L(θ; y) = L(θ; y) = n i=1 n i=1 σ n exp{ 1 2σ 2 (y i µ) 2 } σ n exp{ 1 2σ 2 (y i x T i β) 2 } E(y i ) = m(x i ), m(x) = Σ J j=1 φ jb j (x): L(θ; y) = n i=1 σ n exp{ 1 2σ 2 (y i Σ J j=1 φ jb j (x i )) 2 } STA 4508: Topics in Likelihood Inference January 12, /48

7 ... examples y i = µ + ρ(y i 1 µ) + ɛ i, ɛ i N(0, σ 2 ): n L(θ; y) = f (y i y i 1 ; θ)f 0 (y 0 ; θ) i=1 y 1,..., y n i.i.d. observations from a U(0, θ) distribution: n L(θ; y) = θ n, i=1 0 < y (1) < < y (n) < θ STA 4508: Topics in Likelihood Inference January 12, /48

8 ... examples y 1,..., y n are the times of jumps of a non-homogeneous Poisson process with rate function λ( ): l{λ( ); y} = n i=1 τ log{λ(y i )} λ(u)du, 0 0 < y 1 < < y n < τ Davison, 6.5 multinomial: y i = (y i1,..., y ik ), l(θ; y) = n i=1 c=1 y ic = 1, y ic = 0, c c k y ic log(p ic ) negative cross-entropy Hastie et al., Ch. 7 p ic = p(x ic, θ), as above STA 4508: Topics in Likelihood Inference January 12, /48

10 Data: times of failure of a spring under stress 225, 171, 198, 189, 189, 135, 162, 135, 117, 162

11 Complicated likelihoods example: clustered binary data latent variable: z ir = x ir i + ɛ ir, b i N(0, σb 2), ɛ ir N(0, 1) r = 1,..., n i : observations in a cluster/family/school... i = 1,..., n clusters random effect b i introduces correlation between observations in a cluster observations: y ir = 1 if z ir > 0, else 0 Pr(y ir = 1 b i ) = Φ(x ir β + b i) = p i Φ(z) = z 1 2π e x 2 /2 dx likelihood θ = (β, σ b ) L(θ; y) = n i=1 ni r=1 p i y ir (1 p i ) 1 y ir φ(b i, σ 2 b )db i more general: z ir = x ir β + w ir b i + ɛ ir Renard et al. (2004) STA 4508: Topics in Likelihood Inference January 12, /48

12 ... complicated likelihoods generalized linear geostatistical models E{Y (s) u(s)} = g{x(s) T β + u(s)}, s S R d, d 2 Diggle & Ribeiro, 2007 random intercept u is a realization of a stationary GRF, mean 0, covariance cov{u(s), u(s )} = σ 2 ρ(s s ; α) n observed locations y = (y 1,..., y n ) with y i = y(s i ) likelihood function n L(θ; y) = f (y i u i ; θ) f (u; θ) du R n }{{} 1... du n i=1 MVN(0,Σ) no factorization into lower dimensional integrals, as with independent observations from the usual GLMMs STA 4508: Topics in Likelihood Inference January 12, /48

13 Non-computable likelihoods Ising model: f (y; θ) = exp( (i,j) E 1 θ ij y i y j ) Z (θ) y i = ±1; binary property of a node i in a graph with n nodes θ ij measures strength of interaction between nodes i and j E is the set of edges between nodes partition function Z (θ) = y exp( (i,j) E θ ijy i y j ) Davison 6.2 Ravikumar et al. (2010). High-dimensional Ising model selection... Ann. Statist. p.1287 STA 4508: Topics in Likelihood Inference January 12, /48

14 ... non-computable likelihoods f (v, h; η) 1 Z (η) exp(αt v + β T h + v T Ωh); η = (α, β, Ω) Mu Zhu, U Waterloo STA 4508: Topics in Likelihood Inference January 12, /48

15 ... restricted Boltzmann machine f (v, h; η) 1 Z (η) exp(αt v + β T h + v T Ωh); η = (α, β, Ω) with a single binary top node h model for h, given v, is logistic regression: log{p(h = 1 v)/p(h = 0 v)} = α + v T ω with several binary top nodes, model for h t, given h t and v is also logistic regression with odds ratio depending only on v stack these in layers, with top nodes for one layer becoming bottom nodes for the next STA 4508: Topics in Likelihood Inference January 12, /48

16 History STA 4508: Topics in Likelihood Inference January 12, /48

17 History STA 4508: Topics in Likelihood Inference January 12, /48

18 History STA 4508: Topics in Likelihood Inference January 12, /48

19 Why likelihood? makes probability modelling central emphasizes the inverse problem of reasoning from y 0 to θ or f ( ) suggested by Fisher as a measure of plausibility Royall, 1994 L(ˆθ)/L(θ) (1, 3) very plausible; L(ˆθ)/L(θ) (3, 10) implausible; L(ˆθ)/L(θ) (10, ) very implausible converts a prior probability π(θ) to a posterior π(θ y) via Bayes formula provides a conventional set of summary quantities for inference based on properties of the postulated model STA 4508: Topics in Likelihood Inference January 12, /48

20 ... why likelihood? likelihood function depends on data only through sufficient statistics likelihood map is sufficient Fraser & Naderi, 2006 gives exact inference in transformation models likelihood function as pivotal Hinkley, 1980 provides summary statistics with known limiting distribution leading to approximate pivotal functions, based on normal distribution likelihood function + sample space derivative gives better approximate inference STA 4508: Topics in Likelihood Inference January 12, /48

21 Likelihood inference direct use of likelihood function note that only relative values are well-defined define relative likelihood RL(θ) = L(θ) sup θ L(θ ) = L(θ) L(ˆθ) STA 4508: Topics in Likelihood Inference January 12, /48

22 ... likelihood inference combine with a probability density for θ π(θ y) = f (y; θ)π(θ) f (y; θ)π(θ)dθ inference for θ via probability statements from π(θ y) e.g., Probability (θ > 0 y) = 0.23, etc. any other use of likelihood function for inference relies on derived quantities and their distribution under the model the Likelihood Principle states two experiments with proportional likelihood functions lead to the same inference about the same parameter C& H, 1974, p.39 (strong likelihood) STA 4508: Topics in Likelihood Inference January 12, /48

23 Derived quantities; f (y; θ) observed likelihood L(θ; y) = c(y)f (y; θ) log-likelihood l(θ; y) = log L(θ; y) = log f (y; θ) + a(y) score U(θ) = l(θ; y)/ θ observed information j(θ) = 2 l(θ; y)/ θ θ T expected information i(θ) = E θ U(θ)U(θ) T called i 1 (θ) in CH STA 4508: Topics in Likelihood Inference January 12, /48

24 ... derived quantities; f (y; θ) observed likelihood L(θ; y) n i=1 f (y i; θ) log-likelihood l(θ; y) = n i=1 log f (y; θ)+a(y) score U(θ) = l(θ; y)/ θ = O p ( ) maximum likelihood estimate ˆθ = ˆθ(y) = arg sup θ l(θ; y) Fisher information j(ˆθ) = 2 l(ˆθ; y)/ θ θ T expected information i(θ) = E θ U(θ)U(θ) T = O( ) STA 4508: Topics in Likelihood Inference January 12, /48

25 Bartlett identities STA 4508: Topics in Likelihood Inference January 12, /48

26 Limiting distributions U(θ) = n i=1 U i(θ) E{U(θ)} = var{u(θ)} = U(θ)/ n L N{0, i 1 (θ)} STA 4508: Topics in Likelihood Inference January 12, /48

27 ... limiting distributions U(θ)/ n L N{0, i 1 (θ)} U(ˆθ) = 0 = U(θ) + (ˆθ θ)u (θ) + R n (ˆθ θ) = {U(θ)/i(θ)}{1 + o p (1)} n(ˆθ θ) L N{0, i 1 1 (θ)} STA 4508: Topics in Likelihood Inference January 12, /48

28 ... limiting distributions n(ˆθ θ) L N{0, i 1 1 (θ)} l(θ) = l(ˆθ) + (θ ˆθ)l (ˆθ) (θ ˆθ) 2 l (ˆθ) + R n 2{l(ˆθ) l(θ)} = (ˆθ θ) 2 i(θ){1 + o p (1)} 2{l(ˆθ) l(θ)} L χ 2 d STA 4508: Topics in Likelihood Inference January 12, /48

29 Inference from limiting distributions ˆθ. N d {θ, j 1 (ˆθ)} j(ˆθ) = l (ˆθ; y) θ is estimated to be 21.5 (95% CI ) ˆθ ± 2ˆσ w(θ) = 2{l(ˆθ) l(θ)}. χ 2 d likelihood based CI for θ with confidence level 95% is (18.6, 23.0) log likelihood function log likelihood w θ θ θ θ STA 4508: Topics in Likelihood Inference January 12, /48

30 ... inference from limiting distributions pivotal quantities and p-value functions; θ scalar r u (θ) = U(θ)j 1/2 (ˆθ). N(0, 1) Pr{U(θ)j 1/2 (ˆθ) u(θ)j 1/2 (ˆθ)}. = Φ{u(θ)j 1/2 (ˆθ)} under sampling from the model f (y; θ) = f (y 1,..., y n ; θ) p u (θ) = Φ{u(θ)j 1/2 (ˆθ)} p-value function (of θ, for fixed data) shorthand = Φ{r u (θ)}, and = Φ{r e (θ)}, = Φ{r(θ)} are all p-value functions for θ, based on limiting dist ns STA 4508: Topics in Likelihood Inference January 12, /48

32 Example f (y i ; θ) = θe y i θ, i = 1,..., n l(θ) = l (θ) = l (θ) = r u (θ) = n( 1 θȳ 1) r e (θ) = n(1 ȳθ) r(θ) = STA 4508: Topics in Likelihood Inference January 12, /48

33 Example f (y i ; θ) = θ y i e θ /y i! l(θ) = l (θ) = l (θ) = r e (θ) = (s nθ)/ s Pr(S s) 1 Pr(S s) upper and lower p-value functions: Pr(S < s), Pr(S s) mid p-value function: Pr(S < sr) + 0.5Pr(S = s) STA 4508: Topics in Likelihood Inference January 12, /48

35 Aside for inference re θ, given y, plot p(θ) vs θ for p-value for H 0 : θ = θ 0, compute p(θ 0 ) for checking whether, e.g. Φ{r e (θ)} is a good approximation, compare p(θ) = Φ{r e (θ)} to p exact (θ), as a function of θ, fixed y or compare p(θ 0 ) to p exact (θ 0 ) as a function of y if p exact (θ) not available, simulate STA 4508: Topics in Likelihood Inference January 12, /48

36 Nuisance parameters θ = (ψ, λ) = (ψ 1,..., ψ q, λ 1,..., λ d q ) ( ) Uψ (θ) U(θ) =, U U λ (θ) λ (ψ, ˆλ ψ ) = 0 ( ) ( ) iψψ i i(θ) = ψλ jψψ j j(θ) = ψλ i λψ i λλ ( i i 1 (θ) = ψψ i ψλ ) i λψ i λλ j λψ j λλ ( j j 1 (θ) = ψψ j ψλ ). j λψ i ψψ (θ) = {i ψψ (θ) i ψλ (θ)i 1 λλ (θ)i λψ(θ)} 1, l P (ψ) = l(ψ, ˆλ ψ ), j P (ψ) = l P (ψ) j λλ STA 4508: Topics in Likelihood Inference January 12, /48

37 Inference from limiting distributions, nuisance parameters w u (ψ) = U ψ (ψ, ˆλ ψ ) T {i ψψ (ψ, ˆλ ψ )}U ψ (ψ, ˆλ ψ ) w e (ψ) = ( ˆψ ψ){i ψψ ( ˆψ, ˆλ)} 1 ( ˆψ ψ) w(ψ) = 2{l( ˆψ, ˆλ) l(ψ, ˆλ ψ )} = 2{l P ( ˆψ) l P (ψ)}.. χ 2 q χ 2 q. χ 2 q; Approximate Pivots r u (ψ) = l P (ψ)j P( ˆψ) 1/2. N(0, 1), r e (ψ) = ( ˆψ ψ)j P ( ˆψ) 1/2. N(0, 1), r(ψ) = sign( ˆψ ψ)2{l P ( ˆψ) l P (ψ)}. N(0, 1) STA 4508: Topics in Likelihood Inference January 12, /48

39 Properties of likelihood functions and likelihood inference the likelihood depends only on the minimal sufficient statistic recall: L(θ; y) = m 1 (s; θ)m 2 (y) s is minimal sufficient equivalently L(θ; y) depends only on s L(θ 0 ; y) the likelihood map is sufficient Fraser & Naderi, 2006; Barndorff-Nielsen, et al, 1976 i.e y L 0 ( ; y), or y L( ; y) normed STA 4508: Topics in Likelihood Inference January 12, /48

40 ... properties maximum likelihood estimates are equivariant: ĥ(θ) = h(ˆθ) for one-to-one h( ) question: which of w e, w u, w are invariant under reparametrization of the full parameter: ϕ(θ)? question: which of r e, r u, r are invariant under interest-respecting reparameterizations (ψ, λ) {ψ, η(ψ, λ)}? consistency of maximum likelihood estimate? equivalence of maximum likelihood estimate and root of score equation? observed vs. expected information STA 4508: Topics in Likelihood Inference January 12, /48

41 Asymptotics for Bayesian inference exp{l(θ; y)}π(θ) π(θ y) = exp{l(θ; y)}π(θ)dθ expand numerator and denominator about ˆθ, assuming l (ˆθ) = 0 π(θ y). = N{ˆθ, j 1 (ˆθ)} expand denominator only about ˆθ result π(θ y). = 1 (2π) d/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) STA 4508: Topics in Likelihood Inference January 12, /48

42 Posterior is asymptotically normal π(θ y). N{ˆθ, j 1 (ˆθ)} θ R, y = (y 1,..., y n ) careful statement STA 4508: Topics in Likelihood Inference January 12, /48

43 ... posterior is asymptotically normal π(θ y). N{ˆθ, j 1 (ˆθ)} θ R, y = (y 1,..., y n ) equivalently l π (θ) = STA 4508: Topics in Likelihood Inference January 12, /48

44 ... posterior is asymptotically normal In fact, If π(θ) > 0 and π (θ) is continuous in a neighbourhood of θ 0, there exist constants D and n y s.t. F n (ξ) Φ(ξ) < Dn 1/2, for all n > n y, on an almost-sure set with respect to f (y; θ 0 ), where y = (y 1,..., y n ) is a sample from f (y; θ 0 ), and θ 0 is an observation from the prior density π(θ). F n (ξ) = Pr{(θ ˆθ)j 1/2 (ˆθ) ξ y} Johnson (1970); Datta & Mukerjee (2004) STA 4508: Topics in Likelihood Inference January 12, /48

45 Laplace approximation π(θ y). = 1 (2π) 1/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) π(θ y) = π(θ y) = 1 (2π) 1/2 j(ˆθ) +1/2 exp{l(θ; y) l(ˆθ; y)} π(θ) π(ˆθ) {1+O p(n 1 )} y = (y 1,..., y n ), θ R 1 1 (2π) 1/2 j π(ˆθ π ) +1/2 exp{l π (θ; y) l π (ˆθ π ; y)}{1+o p (n 1 )} STA 4508: Topics in Likelihood Inference January 12, /48

46 Posterior cdf θ π(ϑ y)dϑ. = θ 1 (2π) 1/2 el(ϑ;y) l( ˆϑ;y) 1/2 π(ϑ) j( ˆϑ) π( ˆϑ) dϑ STA 4508: Topics in Likelihood Inference January 12, /48

47 Posterior cdf θ π(ϑ y)dϑ. = θ 1 (2π) 1/2 el(ϑ;y) l( ˆϑ;y) 1/2 π(ϑ) j( ˆϑ) π( ˆϑ) dϑ SM, 11.3 STA 4508: Topics in Likelihood Inference January 12, /48

48 BDR, Ch.3, Cauchy with flat prior

Nancy Reid SS 6002A Office Hours by appointment

Nancy Reid SS 6002A Office Hours by appointment Nancy Reid SS 6002A reid@utstat.utoronto.ca Office Hours by appointment Light touch assessment One or two problems assigned weekly graded during Reading Week http://www.utstat.toronto.edu/reid/4508s14.html