Approximate Likelihoods

Similar documents
Various types of likelihood

Recap. Vector observation: Y f (y; θ), Y Y R m, θ R d. sample of independent vectors y 1,..., y n. pairwise log-likelihood n m. weights are often 1

Composite likelihood methods

Likelihood-Based Methods

Outline of GLMs. Definitions

Composite Likelihood

Composite Likelihood Estimation

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

i=1 h n (ˆθ n ) = 0. (2)

Nancy Reid SS 6002A Office Hours by appointment

Model comparison and selection

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

New Bayesian methods for model comparison

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Nancy Reid SS 6002A Office Hours by appointment

Generalized Linear Models. Kurt Hornik

For more information about how to cite these materials visit

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

Foundations of Statistical Inference

Various types of likelihood

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Probabilistic Machine Learning

Inference in non-linear time series

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Lecture 14 More on structural estimation

Label Switching and Its Simple Solutions for Frequentist Mixture Models

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Econometrics I, Estimation

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Computational methods for mixed models

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Likelihood inference in the presence of nuisance parameters

Math 533 Extra Hour Material

Regularization in Cox Frailty Models

An introduction to Approximate Bayesian Computation methods

Density Estimation. Seungjin Choi

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Generalized linear mixed models for biologists

ML estimation: Random-intercepts logistic model. and z

1. Fisher Information

PQL Estimation Biases in Generalized Linear Mixed Models

Bayesian Asymptotics

Fractional Imputation in Survey Sampling: A Comparative Review

Paradoxical Results in Multidimensional Item Response Theory

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Linear Methods for Prediction

MODEL SELECTION BASED ON QUASI-LIKELIHOOD WITH APPLICATION TO OVERDISPERSED DATA

Linear Regression Models P8111

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Posterior Regularization

Label switching and its solutions for frequentist mixture models

STA 4273H: Statistical Machine Learning

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Lecture 6: Model Checking and Selection

Estimation in Generalized Linear Models with Heterogeneous Random Effects. Woncheol Jang Johan Lim. May 19, 2004

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

CSC 2541: Bayesian Methods for Machine Learning

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

EM Algorithm II. September 11, 2018

Theory of Maximum Likelihood Estimation. Konstantin Kashin

MIT Spring 2016

Bayes Model Selection with Path Sampling: Factor Models

Bayes: All uncertainty is described using probability.

Aspects of Likelihood Inference

The equivalence of the Maximum Likelihood and a modified Least Squares for a case of Generalized Linear Model

Support Vector Machines

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

The performance of estimation methods for generalized linear mixed models

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Parametric Techniques Lecture 3

Generalized Linear Models Introduction

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Non-maximum likelihood estimation and statistical inference for linear and nonlinear mixed models

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Review and continuation from last week Properties of MLEs

Monte Carlo in Bayesian Statistics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Unsupervised Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Generalized Quasi-likelihood versus Hierarchical Likelihood Inferences in Generalized Linear Mixed Models for Count Data

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Lecture 9 STK3100/4100

Graduate Econometrics I: Maximum Likelihood II

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Regression I: Mean Squared Error and Measuring Quality of Fit

A Very Brief Summary of Bayesian Inference, and Examples

Likelihood and Fairness in Multidimensional Item Response Theory

Outline for today. Computation of the likelihood function for GLMMs. Likelihood for generalized linear mixed model

Machine learning - HT Maximum Likelihood

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Approximate Bayesian Computation and Particle Filters

1 Hypothesis Testing and Model Selection

On Markov chain Monte Carlo methods for tall data

Statistical Data Mining and Machine Learning Hilary Term 2016

Discussion of Maximization by Parts in Likelihood Inference

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Transcription:

Approximate Likelihoods Nancy Reid July 28, 2015

Why likelihood? makes probability modelling central l(θ; y) = log f (y; θ) emphasizes the inverse problem of reasoning y θ converts a prior probability to a posterior π(θ) π(θ y) provides a conventional set of summary quantities: maximum likelihood estimator, score function,... these define approximate pivotal quantities, based on normal distribution basis for comparison of models, using AIC or BIC Approximate Likelihoods WSC 2015 2

Example 1: GLMM GLM: y ij u i exp{y ij η ij b(η ij ) + c(y ij )} linear predictor: η ij = x T ij β + z T ij u i j=1,...n i ; i=1,...m random effects: u i N k (0, Σ) log-likelihood: l(β, Σ) = m i=1 + log ( yi T X iβ 1 log Σ 2 R k exp{y T i Z iu i 1 T i b(x iβ + Z i u i ) 1 2 ut i Σ 1 u i }du i ) Ormerod & Wand 2012 Approximate Likelihoods WSC 2015 3

Example 2: Poisson AR Poisson f (y t α t ; θ) = exp(y t log µ t µ t )/y t! log µ t = β + α t autoregression α t = φα t 1 + ɛ t, ɛ t N(0, σ 2 ), φ < 1, θ = (β, φ, σ 2 ) likelihood ( n ) L(θ; y 1,..., y n ) = f (y t α t ; θ) f (α; θ)dα t=1 L approx (θ; y) via Laplace with some refinements Davis & Yau, 2011 Approximate Likelihoods WSC 2015 4

Some proposed solutions simplify the likelihood composite likelihood variational approximation Laplace approximation to integrals change the mode of inference quasi-likelihood indirect inference simulate approximate Bayesian computation MCMC Approximate Likelihoods WSC 2015 5

Composite likelihood also called pseudo-likelihood reduce high-dimensional dependencies by ignoring them for example, replace f (y i1,..., y ik ; θ) by pairwise marginal f 2 (y ij, y ij ; θ), j<j conditional f c (y ij y N (ij) ; θ) j or Composite likelihood function n CL(θ; y) f 2 (y ij, y ij ; θ) i=1 j<j Composite ML estimates are consistent, asymptotically normal, not fully efficient Besag, 1975; Lindsay, 1988 Approximate Likelihoods WSC 2015 6

weighting components to increase efficiency of score equation Wednesday session

Example: AR Poisson Davis & Yau, 2011 Likelihood ( ) n L(θ; y 1,..., y n ) = f (y t α t ; θ) f (α; θ)dα Composite likelihood t=1 t=1 n 1 CL(θ; y 1,..., y n) = f (y t α t; θ)f (y t+1 α t+1 ; θ)f (α t, α t+1 ; θ)dα tdα t+1 consecutive pairs Time-series asymptotic regime one vector y of increasing length Composite ML estimator still consistent, asymptotically normal, estimable asymptotic variance Efficient, relative to a Laplace-type approximation Surprises: AR(1), fully efficient; MA(1), poor; ARFIMA(0,d,0), ok Approximate Likelihoods WSC 2015 9

Variational methods Titterington, 2006; Ormerod & Wand, 2010 in a Bayesian context, want f (θ y) use an approximation q(θ) dependence of q on y suppressed choose q(θ) to be simple to calculate close to posterior simple to calculate q(θ) = q j (θ j ) simple parametric family close to posterior: miminize Kullback-Leibler divergence between true posterior and approximation q Approximate Likelihoods WSC 2015 10

... variational methods Titterington, 2006; Ormerod & Wand, 2010 example GLMM: l(β, Σ; y) = log f (y u; β)f (u; Σ)du = m i=1 (y i T X i β 1 2 log Σ log R k exp{yt i Z i u i 1 T i b(x i β+z i u i ) 1 2 ut i Σ 1 u i }du i ) high-dimensional integral variational solution for some choice q(u): l(β, Σ; y) q(u) log{f (y, u; β, Σ)/q(u)}du Simple choice of q : N(µ; Λ) variational parameters µ, Λ Approximate Likelihoods WSC 2015 11

Example: GLMM variational approx: Ormerod & Wand, 2012, JCGS l(β, Σ) l(β, Σ, µ, Λ) = m i=1(yi T X i β 1 log Σ ) 2 + m i=1 E u N(µ i,λ i )(yi T Z i u 1 T i b(x i β+z i u) 1 2 ut Σ 1 u log{φ Λi (u µ i )}) variational estimate: simplifies to k one-dim. integrals l( β, Σ, µ, Λ) = arg max β,σ,µ,λ l(β, Σ, µ, Λ) inference for β, Σ? consistency? asymptotic normality? Hall, Ormerod, Wand, 2011; Hall et al. 2011 emphasis on algorithms and model selection e.g. Tan & Nott, 2013, 2014 Approximate Likelihoods WSC 2015 12

Links to composite likelihood? VL: approx L(θ; y) by a simpler function of θ, e.g. q j (θ) CL: approx f (y; θ) by a simpler function of y, e.g. f (y j ; θ) S. Robin 2012 Some links between variational approximation and composite likelihoods? Zhang & Schneider 2012 A composite likelihood view for multi-label classification JMLR V22 Grosse 2015 Scaling up natural gradient by sparsely factorizing the inverse Fisher matrix ICML Approximate Likelihoods WSC 2015 13

http://carlit.toulouse.inra.fr/aigm/pub/reunion nov2012/mstga-1211-robin.pdf

Some proposed solutions simplify the likelihood composite likelihood variational approximation Laplace approximation to integrals change the mode of inference quasi-likelihood indirect inference simulate approximate Bayesian computation MCMC Approximate Likelihoods WSC 2015 15

Indirect inference composite likelihood estimators are consistent under conditions... because log CL(θ; y) = n i=1 j<j log f (y j, y j ; θ) derivative w.r.t. θ has expected value 0 what happens if an estimating equation g(y; θ) is biased? g(y 1,..., y n ; θ n ) = 0; θ n θ E θ {g(y ; θ )} = 0 θ = k(θ); invertible? θ = k(θ ) k 1 k new estimator ˆθ n = k( θ n ) k( ) is a bridge function, connecting wrong value of θ to the right one Yi & R, 2010; Jiang & Turnbull, 2004 Approximate Likelihoods WSC 2015 16

... indirect inference Smith, 2008 model of interest y t = G t (y t 1, x t, ɛ t ; θ), θ R d likelihood is not computable, but can simulate from the model simple (wrong) model y t f (y t y t 1, x t ; θ ), θ R p find the MLE in the simple model, ˆθ = ˆθ (y 1,..., y n ), say use simulated samples from model of interest to find the best θ best θ gives data that reproduces ˆθ Shalizi, 2013 Approximate Likelihoods WSC 2015 17

... indirect inference Smith, 2008 simulate samples y m t, m = 1,..., M at some value θ compute ˆθ (θ) from the simulated data ˆθ (θ) = arg max log f (y m θ t yt 1 m, x t; θ ) choose θ so that ˆθ (θ) is as close as possible to ˆθ m t from the model if p = d simply invert the bridge function ; if p > d, e.g. arg min{ˆθ (θ) ˆθ} T W {ˆθ (θ) ˆθ} θ estimates of θ are consistent, asymptotically normal, but not efficient Approximate Likelihoods WSC 2015 18

... non-convex optimization Grosse, 2015 Efficient algorithms Sham Kakade, Non-convex approaches to learning representations Latent variable models (e.g. mixture models, HMMs) are typically optimized with EM, which can get stuck in local optima Sometimes, the model can be fit in closed form using moment matching consistent, but not statistically optimal solution often corresponds to a matrix or tensor factorization Approximate Likelihoods WSC 2015 19

Approximate Bayesian Computation Marin et al., 2010 simulate θ from π(θ) simulate data z from f ( ; θ ) if z = y then θ is an observation from posterior π( y) actually s(z) = s(y) for some set of statistics actually ρ{s(z), s(y)} < ɛ for some distance function ρ( ) Fearnhead & Prangle, 2011 many variations, using different MCMC methods to select candidate values θ Approximate Likelihoods WSC 2015 20

ABC and Indirect Inference Cox & Kartsonaki, 2012 both methods need a set of parameter values from which to simulate: θ or θ both methods need a set of auxiliary functions of the data s(y) or ˆθ (y) in indirect inference, ˆθ is the bridge to the parameters of real interest, θ C & K use orthogonal designs based on Hadamard matrices to chose θ and calculate summary statistics focussed on individual components of θ Approximate Likelihoods WSC 2015 21

Some proposed solutions simplify the likelihood composite likelihood variational approximation Laplace approximation to integrals change the mode of inference quasi-likelihood indirect inference simulate approximate Bayesian computation MCMC Approximate Likelihoods WSC 2015 22

Laplace approximation l(θ; y) = log f (y u; β)g(u; Σ)db = log exp{q(u, y, θ)}db, say θ = (β, Σ) l Lap (θ; y) = Q(ũ, y, θ) 1 2 log Q (ũ, y, θ) + c using Taylor series expansion of Q(, y, θ) about ũ simplification of the Laplace approximation leads to PQL: l PQL (θ, u; y) = log f (y u; β) 1 2 ut Σ 1 u to be jointly maximized over u and θ Breslow & Clayton, 1993 and parameters in Σ PQL can be viewed as linearizing E(y) and then using results for linear mixed models Molenberghs & Verbeke, 2006 Approximate Likelihoods WSC 2015 23

Extensions of Laplace approximations expansions valid with p = o(n 1/3 ) Shun & McCullagh, 1995 expansions for mixed linear models to higher order Raudenbush et al., 2000 use REML for variance parameters Nelder & Lee, 1996 integrated nested Laplace approximation Rue et al., 2009 model f (y i θ i ); prior π(θ ϑ) parameters and hyper-par posterior π(θ, ϑ y) π(θ ϑ)π(ϑ) f (y i θ i ) marginal posterior π(θ i y) = π(θ i ϑ, y) }{{} Laplace π(ϑ y) dϑ }{{} Laplace Approximate Likelihoods WSC 2015 24

Quasi-likelihood simplify the model E(y i ; θ) = µ i (θ); Var(y i ; θ) = φν i (θ) consistent with generalized linear models example: over-dispersed Poisson responses PQL uses this construction, but with random effects Molenberghs & Verbeke, Ch. 14 why does it work? score equations are the same as for a real likelihood hence unbiased derivative of score function equal to variance function special to GLMs Approximate Likelihoods WSC 2015 25

Some proposed solutions simplify the likelihood composite likelihood variational approximation Laplace approximation to integrals change the mode of inference quasi-likelihood indirect inference simulate approximate Bayesian computation MCMC Approximate Likelihoods WSC 2015 26

References Approximate Likelihoods WSC 2015 27