Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Similar documents
Stat 5101 Lecture Notes

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Simulating Random Variables

Lecture 3 September 1

EM Algorithm II. September 11, 2018

STAT Advanced Bayesian Inference

MCMC algorithms for fitting Bayesian models

A Very Brief Summary of Statistical Inference, and Examples

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Bayesian Regression Linear and Logistic Regression

POLI 8501 Introduction to Maximum Likelihood Estimation

Review and continuation from last week Properties of MLEs

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

A Very Brief Summary of Bayesian Inference, and Examples

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Part 6: Multivariate Normal and Linear Models

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Introduction to Bayesian Methods

Bayes: All uncertainty is described using probability.

Maximum Likelihood Estimation

Outline of GLMs. Definitions

Statistics 3858 : Maximum Likelihood Estimators

Bayesian inference: what it means and why we care

Introduction to Maximum Likelihood Estimation

Generalized Linear Models. Kurt Hornik

STAT 730 Chapter 4: Estimation

Parameter Estimation

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Statistical Estimation

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Linear Methods for Prediction

Greene, Econometric Analysis (6th ed, 2008)

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

simple if it completely specifies the density of x

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Probability and Estimation. Alan Moses

Part III. A Decision-Theoretic Approach and Bayesian testing

Hybrid Censoring; An Introduction 2

Lecture 32: Asymptotic confidence sets and likelihoods

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Maximum Likelihood Estimation

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Stat 451 Lecture Notes Numerical Integration

Econometrics I, Estimation

Linear Regression Models P8111

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistical Inference: Maximum Likelihood and Bayesian Approaches

Maximum Likelihood Estimation

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Lecture 1: Introduction

The Expectation-Maximization Algorithm

Nonparametric Bayesian Methods (Gaussian Processes)

The comparative studies on reliability for Rayleigh models

Markov Chain Monte Carlo methods

Proportional hazards regression

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Bayesian Inference. Chapter 4: Regression and Hierarchical Models


Linear Models A linear model is defined by the expression

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

VCMC: Variational Consensus Monte Carlo

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

STAT 705 Generalized linear mixed models

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Data Analysis and Uncertainty Part 2: Estimation

Linear Methods for Prediction

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

STAT 740: Testing & Model Selection

ECE 275A Homework 7 Solutions

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

A note on multiple imputation for general purpose estimation

Bayesian Inference. Chapter 9. Linear models and regression

Bayesian Asymptotics

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Statistical Data Mining and Machine Learning Hilary Term 2016

HOMEWORK #4: LOGISTIC REGRESSION

Bayesian inference for multivariate extreme value distributions

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

David Giles Bayesian Econometrics

ST 740: Linear Models and Multivariate Normal Inference

Lecture 5: LDA and Logistic Regression

CS281 Section 4: Factor Analysis and PCA

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Machine Learning 2017

Introduction to Bayesian Statistics 1

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Problem Selected Scores

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Condensed Table of Contents for Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control by J. C.

Semiparametric Regression

Transcription:

Approximate Normality, Newton-Raphson, & Multivariate Delta Method Timothy Hanson Department of Statistics, University of South Carolina Stat 740: Statistical Computing 1 / 39

Statistical models come in all shapes and sizes! STAT 704/705: linear models with normal errors, logistic & Poisson regression STAT 520/720: time series models, eg ARIMA STAT 771: longitudinal models with fixed and random effects STAT 770: logistic regression, generalized linear mixed models, r k tables STAT 530/730: multivariate models STAT 521/721: stochastic processes et cetera, et cetera, et cetera 2 / 39

Examples from your textbook Censored data (Ex 11, p 2) Z i = min{x i, ω}, X 1,, X n iid f (x α, β) = αβx α 1 exp( βx α ) Finite mixture (Ex 12 & 110, p 3 & 11) Beta data X 1,, X n iid k p j N(µ j, σj 2 ) j=1 X 1,, X n iid f (x α, β) = Γ(α + β) Γ(α)Γ(β) x α 1 (1 x) β 1 Logistic regression (Ex 113, p 15) Y i ind Bern ( exp(β x i ) 1 + exp(β x i ) ) More: MA(q), normal, gamma, student t, et cetera 3 / 39

Likelihood The likelihood is simply the joint distribution of data, viewed as a function of model parameters L(θ x) = L(θ 1,, θ k x 1,, x n ) = f (x 1,, x n θ 1,, θ k ) If data are independent then L(θ x) = n i=1 f i(x i θ) where f i ( θ) is the marginal pdf/pmf of X i Beta data L(α, β x) = Logistic regression data L(β y) = n i=1 n i=1 Γ(α + β) Γ(α)Γ(β) x α 1 i (1 x i ) β 1 ( exp(β ) yi ( ) x i ) 1 1 yi 1 + exp(β x i ) 1 + exp(β x i ) 4 / 39

Two main inferential approaches Section 12: Maximum likelihood Use likelihood as is for information on θ Section 13: Bayesian Include additional information on θ through prior Methods of moments (MOM) and generalized method of moments (GMOM) are simple, direct methods for estimating model parameters that match population moments to sample moments Sometimes easier than MLE, eg beta data, gamma data Your text introduces the Bayesian approach in Chapter 1; we will first consider large-sample approximations 5 / 39

First derivative vector and second derivative matrix The gradient vector (or score vector ) of the log-likelihood are the partial derivatives: log L(θ x) = log L(θ x) θ 1 log L(θ x) θ k, and the Hessian, ie matrix of second partial derivatives is denoted: 2 log L(θ x) 2 log L(θ x) θ 2 θ 1 1 θ 2 θ 1 θ k 2 2 log L(θ x) 2 log L(θ x) log L(θ x) = θ 2 θ 1 θ2 2 2 log L(θ x) θ k θ 2 2 log L(θ x) θ k θ 1 2 log L(θ x) 2 log L(θ x) θ 2 θ k 2 log L(θ x) θ 2 k 6 / 39

Maximum likelihood & asymptotic normality ˆθ = argmax θ Θ L(θ x) Finds θ that makes data x as likely as possible Under regularity conditions, ˆθ N k (θ, V θ ), V θ = [ E{ 2 log L(θ X)}] 1 Normal approximation historically extremely important; crux of countless approaches and papers Regularity conditions have to do with differentiability of L(θ x) wrt θ, support of x i not depending on θ, invertibility of Fisher information matrix, and certain expectations are finite Most models we ll consider satisfy the necessary conditions 7 / 39

How to maximize likelihood? Problem is to maximize L(θ x), or equivalently log L(θ x) Any extrema ˆθ satisfies log L(θ x) = 0 In one dimension this is θ L(θ x) = 0 Global max if 2 log L(θ x) < 0 for θ Θ θ 2 Only in simple problems can we solve this directly Various iterative approaches: steepest descent, conjugate gradient, Newton-Raphson, etc Iterative approaches work fine for moderate number of parameters k and/or sample sizes n, but can break down when problem gets too big 8 / 39

Classic examples with closed-form MLEs X 1,, X n iid Bern(π) Y = n i=1 X i Bin(n, π) X i ind Pois(λt i ), t i is an offset or exposure time X 1,, X n iid N(µ, σ 2 ) X 1,, X n iid Np (µ, Σ) 9 / 39

About your textbook Main focus: simulation-based approaches to obtaining inference More broadly useful than traditional deterministic methods, especially for large data sets and/or complicated models We will discuss non-simulation based approaches as well in a bit more detail than your book; still used a lot and quite important See Gentle (2009) for complete treatment Your text has lots of technical points, pathological examples, convergence theory, alternative methods (eg profile likelihood), etc We may briefly touch on these but omit details Please read your book! Main focus of course: gain experience and develop intuition on numerical methods for obtaining inference, ie build your toolbox 10 / 39

Deterministic versus stochastic numerical methods Deterministic methods use a prescribed algorithm to arrive at a solution Given the same starting values the solution they arrive at the same answer each time Stochastic methods introduce randomness into an algorithm Different runs typically produce (slightly) different solutions We will first examine a deterministic approach to maximizing the likelihood and obtain inference based on approximate normality 11 / 39

First derivative vector and second derivative matrix The gradient vector (or score vector ) of the log-likelihood are the partial derivatives: log L(θ x) = log L(θ x) θ 1 log L(θ x) θ k, and the Hessian, ie matrix of second partial derivatives is denoted: 2 log L(θ x) 2 log L(θ x) θ 2 θ 1 1 θ 2 θ 1 θ k 2 2 log L(θ x) 2 log L(θ x) log L(θ x) = θ 2 θ 1 θ2 2 2 log L(θ x) θ k θ 2 2 log L(θ x) θ k θ 1 2 log L(θ x) 2 log L(θ x) θ 2 θ k 2 log L(θ x) θ 2 k 12 / 39

Maximum likelihood The likelihood or score equations are log L(θ x) = 0 The MLE ˆθ satisfies these However, a solution to these equations may not be the MLE ˆθ is a local max if log L(θ x) = 0 and 2 log L(θ x) is negative-semidefinite (has non-positive eigvenvalues) The expected Fisher information is I n (θ) = E{[ log L(θ X)][ log L(θ X)] } = E{ 2 log L(θ x)} The observed Fisher information is this quantity without the expectation J n (θ) = 2 log L(θ x) 13 / 39

Maximum likelihood In large samples under regularity conditions, ˆθ N k (θ, I n (θ) 1 ) and ˆθ N k (θ, J n (θ) 1 ) Just replace the unknown θ by the MLE ˆθ in applications and the approximations still hold ˆθ N k (θ, I n (ˆθ) 1 ) and ˆθ N k (θ, J n (ˆθ) 1 ) The observed Fisher information J n (ˆθ) is easier to compute and recommended by Efron and Hinkley (1978) when using the normal approximation above 14 / 39

Multivariate delta method Once the MLE ˆθ and its approximate covariance J n (ˆθ) is obtained, we may be interested in functions of θ, eg the vector g 1 (θ) g(θ) = g m (θ) The multivariate delta method gives g(ˆθ) N m (g(θ), [ g(ˆθ)] J(ˆθ) 1 [ g(ˆθ)]) This stems from approximate normality and the first-order Taylors approximation g(ˆθ) g(θ) + [ g(θ)] (ˆθ θ) 15 / 39

Multivariate delta method As before, g(θ) = g 1 (θ) θ 1 g 1 (θ) θ 2 g 1 (θ) θ k g 2 (θ) θ 1 g 2 (θ) For univariate functions g(θ) : R k R g(θ) = g m(θ) θ 1 g m(θ) θ 2 θ 2 g 2 (θ) θ k g(θ) θ 1 g(θ) θ 2 g(θ) θ k g m(θ) θ k Example: g(µ, σ) = µ/σ, the signal-to-noise ratio 16 / 39

Multivariate delta method: testing We have g(ˆθ) N m (g(θ), [ g(ˆθ)] J(ˆθ) 1 [ g(ˆθ)]) Let g 0 R m be a fixed, known vector An approximate Wald test of H 0 : g(θ) = g 0 has test statistic W = (g(ˆθ) g 0 ) {[ g(ˆθ)] J(ˆθ) 1 [ g(ˆθ)]} 1 (g(ˆθ) g 0 ) In large samples W χ 2 m 17 / 39

Simple confidence intervals and tests Along the diagonal of J(ˆθ) 1 are the estimated variances of ˆθ 1,, ˆθ k Approximate 95% confidence interval for θ j is ˆθ j ± 196 [J(ˆθ) 1 ] jj Can test H 0 : θ j = t from test statistic z = ˆθ j t [J(ˆθ) 1 ] jj For any function g(θ) can obtain CI and test in similar manner using result on previous slide For example, 95% CI is g(ˆθ) ± 196 [ g(ˆθ)] J(ˆθ) 1 [ g(ˆθ)] 18 / 39

Univariate Newton-Raphson In general, computing the MLE, ie the solution to log L(θ x) = 0, is non-trivial except in very simple models An iterative method for finding the MLE is Newton-Raphson Want to find a zero of some univariate function g( ), ie an x st g(x) = 0 A first-order Taylors approximation gives g(x 1 ) g(x 0 ) + g (x 0 )(x 1 x 0 ) Setting this to zero and solving gives x 1 = x 0 g(x 0) g (x 0 ) The iterative version is x j+1 = x j g(x j ) g (x j ) I will show on the board how this works in practice For univariate θ, solve θ log L(θ x) = 0 via θ j+1 = θ j θ log L(θ j x) 2 θ 2 log L(θ j x) 19 / 39

R example Bernoulli data X 1,, X n iid Bern(π), which is the same as Y = n X i Bin(n, π) i=1 MLE ˆπ directly MLE via Newton-Raphson Starting values? Multiple local maxima? Example 19 (p 10); Example 117 (p19) 20 / 39

Multivariate Newton-Raphson The multivariate version applied to log L(θ x) = 0 is θ j+1 = θ j [ 2 log L(θ j x)] 1 [ log L(θ j x)] Taking inverses in a terribly inefficient way to solve a linear system of equations and instead one solves [ 2 log L(θ j x)]θ j+1 = [ 2 log L(θ j x)]θ j [ log L(θ j x)] for θ j+1 through either a direct decomposition (Cholesky) or iterative (conjugate gradient) method Stop when ˆθ j+1 ˆθ j < ɛ for some norm and tolerance ɛ θ 1 = k i=1 θ k i, θ 2 = i=1 θ2 i, and θ = max i=1,,k θ i This is absolute criterion; can also use relative 21 / 39

Notes on Newton-Raphson Computing the Hessian can be computationally demanding; finite difference approximations can reduce this burden Steepest descent only uses the gradient log L(θ x), ignoring how the gradient is changing (the Hessian) May be easier to program Quasi-Newton methods use approximations that satisfy the secant condition ; an approximation to the Hessian is built up as the algorithm progresses using low-rank updates Popular version is Broyden-Fletcher-Goldfarb-Shanno (BFGS) Details beyond scope of this course (see Sec 62 in Gentle, 2009) 22 / 39

Useful R functions optim in R is powerful optimization function that performs Nelder-Mead (only function values log L(θ x) are used!), conjugate gradient, BFGS (including constrained), and simulated annealing The ucminf package also has the stable function ucminf for optimization and uses the same syntax as optim The Hessian at the maximum is provided for most approaches Maple, Mathematica (http://wwwwolframalphacom/calculators/derivativecalculator/) can symbolically differentiate functions, so can R! deriv is built-in and Deriv is in the Deriv package jacobian and hessian are numerical estimates from the numderiv package The first gives the gradient, the second the matrix of second partial derivatives, both evaluated at a vector 23 / 39

Censored data likelihood Consider right censored data Event times are independent of censoring times T 1,, T n iid f (t), C 1,, C n iid g(t) We observe Y i = min{t i, C i } and δ i = I {T i C i }: δ i = 1 if the ith observation is uncensored 24 / 39

Censored data likelihood This is not obvious, but the the joint distribution of {(Y i, δ i )} n i=1 is n {f (y i )[1 G(y i )]} δ i {g(y i )[1 F (y i )]} 1 δ i i=1 If we have time later we will derive this formally For now, note that if f ( ) and g( ) do not share any parameters, and f ( ) has parameters θ, then L(θ y, δ) n f (y i θ) δ i [1 F (y i θ)] 1 δ i i=1 25 / 39

Multivariate Newton-Raphson & delta method examples Censored gamma data (p 24) Logistic regression data (p 15) Note: Tim has used optim many, many times fitting models and/or obtaining starting values and initial covariance matrices for random-walk Markov chain Monte Carlo (later) Usually not necessary to program your own iterative optimization method, but it might come up 26 / 39

Crude starting values for logistic regression Let s find first-order Taylor s approximation to logistic function Let g(x) = ex 1+e ; then g (x) = x ex and (1+e x ) 2 In other words, g(x) g(0) + g (0)(x 0) e 4x i θ 1 2 1 + e 4x i θ 1 2 x iθ ex 1 + e x 1 2 + 1 4 x Therefore can fit usual ls to get Ê(Y i ) = ˆθ 1 + ˆθ 2 t i and take initial guess as θ 1 = 4ˆθ 1 1 2, θ 2 = 4ˆθ 2 Taylors theorem very useful! 27 / 39

The Bayesian approach 1 considers θ to be random, and 2 augments the model with additional information about θ called the prior π(θ) Bayes rule gives the posterior distribution Here, π(θ x) = f (x 1,, x n θ)π(θ) f (x 1,, x n ) f (x 1,, x n ) = θ Θ f (x 1,, x n θ) π(θ)dθ }{{} L(θ x) is the marginal density of x integrating over the prior through the model 28 / 39

Normalizing constant of posterior density f (x) f (x) = f (x 1,, x n ) is often practically impossible to compute, but thankfully most information for θ is carried by the shape of the density wrt θ π(θ x) f (x 1,, x n θ) π(θ) }{{} f (x θ)=l(θ x) f (x) is an ingredient in a Bayes factor for comparing two models Instead of using the likelihood for inference, we use the posterior density, which is proportional to the likelihood times the prior π(θ x) L(θ x)π(θ) Note that if π(θ) is constant then we obtain the likelihood back! There is more similar than different between Bayesian and likelihood-based inference Both specify a probability model and use L(θ x) Bayes justs adds π(θ) 29 / 39

Posterior inference A common estimate of θ is the posterior mean θ = θπ(θ x)dθ θ Θ This is the Bayes estimate wrt squared error loss (p 13) For θ = (θ 1,, θ k ) the estimate of θ j wrt to absolute loss is the posterior median θ j, satisfying θj π(θ j x)dθ j = 05 A 95% credible interval (L, U) satisfies P(L θ j U x) = 095: U L π(θ j x)dθ j = 095 30 / 39

Functions of parameters θ We are often interested in functions of model parameters g(θ), in which case the posterior mean is g(θ) = g(θ)π(θ x)dθ, θ Θ and the posterior median satisfiessomething difficult to even write down! Need to derive the distribution of g(θ) from π(θ x) Note that the distribution of g(θ) can be derived if θ x is multivariate normal or approximately multivariate normal via the multivariate delta method 31 / 39

Bayesian inference was at a bottleneck for years Parameter estimates, as well as parameter intervals, are impossible to compute directly except for very simple problems! We need ways to approximate integrals and quantiles to move forward! One broadly useful approach is Markov chain Monte Carlo, which requires simulating random variables from different distributions, depending on the model and approach Simulation is also necessary to determine the frequentist operating characteristics (MSE, bias, coverage probability) of estimators under known conditions; many statistical papers have a simulations section! But first we ll consider a Bayesian normal approximation 32 / 39

Gradient and Hessian of log posterior log π(θ x) = log{l(θ x)π(θ)} log f (x) Last term not a function of θ, so log π(θ x) = log L(θ x)π(θ) θ 1 log L(θ x)π(θ) θ k = log L(θ x) θ 1 log L(θ x) θ k + + log π(θ) θ 1 log π(θ) θ k, and 2 log π(θ x) = 2 log L(θ x)π(θ) θ1 2 2 log L(θ x)π(θ) θ 2 θ 1 2 log L(θ x)π(θ) θ k θ 1 2 log L(θ x)π(θ) θ 1 θ 2 2 log L(θ x)π(θ) θ 2 2 2 log L(θ x)π(θ) θ k θ 2 2 log L(θ x)π(θ) θ 1 θ k 2 log L(θ x)π(θ) θ 2 θ k 2 log L(θ x)π(θ) θ 2 k 33 / 39

Gradient and Hessian of log posterior log π(θ x) = log L(θ x) + log π(θ), and 2 log π(θ x) = 2 log L(θ x) + 2 log π(θ) Note first part is function of data x and n, second is not 34 / 39

Normal approximation to posterior The normal approximation for MLEs works almost exactly the same for posterior distributions! Let ˆθ = argmax θ Θ π(θ x) be the posterior mode (found via Newton-Raphson or by other means) Then θ x N k (ˆθ, [ 2 log π(θ x)] 1 ) Here, ˆθ is the posterior mode, but also the (approximate) posterior mean and (componentwise) has the posterior medians 35 / 39

Poisson example Let X i ind Pois(λt i ), where t i is exposure time Take the prior λ Γ(α, β) where α and β are given Then n e λt i (λt i ) x i L(λ x) = x i! i=1 e λ n i=1 t i λ n i=1 x i, and leading to λ x Γ π(λ) λ α 1 e βλ, ( α + n x i, β + i=1 ) n t i Here, using a conjugate prior, we have an exact, closed-form solution Exact Bayesian solutions are very, very rare However, this allows us to compare the exact posterior density to the normal approximation for real data i=1 36 / 39

Ache monkey hunting Garnett McMillan spent a year with the Ache tribe in Paraguay While living with them he ate the same foods they did including armadillo, honey, tucans, anteaters, snakes, lizards, giant grubs, and capuchin monkeys (including their brains) Garnett says Lizard eggs were a real treat Garnett would go on extended hunting trips in the rain forest with hunters; we consider data from n = 11 Ache hunters aged 50-59 years followed over several jungle hunting treks X i is the number of capuchin monkeys killed over t i days We ll assume X i ind Pois(λt i ), i = 1,, 11 Garnett s prior for an average number of monkeys killed over 10 days is 1, with a 95% interval of 025 to 4 37 / 39

Normal approximation to Challenger data Recall the logistic regression likelihood L(β y) = n i=1 ( exp(β ) yi ( ) x i ) 1 1 yi 1 + exp(β x i ) 1 + exp(β x i ) Hanson, Branscum, and Johnson (2014) develop the prior ([ ] [ ]) 0 16875 2402 β N 2, 0 2402 003453 for the Challenger data 38 / 39

What s next? Normal approximations are fairly easy to compute and wildly useful! Newton-Raphson is an approach to finding a MLE or posterior mode; the normal approximation also makes use of the observed Fisher information matrix The multivariate delta method allows us to obtain inference for functions g(θ) However, normal approximations break down with large numbers of parameters in θ and/or complex models They are also too crude for small sample sizes and do not work for some models We need an all-purpose, widely applicable approach to obtained posterior inference for Bayesian models; our main tool will be Markov chain Monte Carlo (MCMC) To understand MCMC we need to first discuss how to generate random variables, then consider Monte Carlo approximations to means and quantilesthese are our next topics 39 / 39