Stat 451 Lecture Notes Numerical Integration

Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 2 / 29

Motivation While many statistics problems rely on optimization, there are also some that require numerical integration. Bayesian statistics is almost exclusively integration. data admits a likelihood function L(θ); θ unknown, so assign it a weight function π(θ); combine prior and data using Bayes s formula π(θ x) = L(θ)π(θ) Θ L(θ )π(θ ) dθ. Need to compute probabilities and expectations integrals! Some non-bayesian problems may involve integration, e.g., random- or mixed-effects models. Other approaches besides Bayesian and frequentist... 3 / 29

Intuition There are a number of classical numerical integration techniques, simple and powerful. Think back to calculus class where integral was defined: approximate function by a constant on small intervals; compute area of rectangles and sum them up; integral defined as the limit of this sum as mesh 0. Numerical integration, or quadrature, is based on this definition and refinements thereof. Basic principle: 3 approximate the function on a small interval by a nice one that you know how to integrate. Works well for one- or two-dim integrals; for higher-dim integrals, other tools are needed. 3 This was essentially the same principle that motivated the various methods we discussed for optimization! 4 / 29

Notation Suppose that f (x) is a function that we d like to integrate over an interval [a, b]. Take n relatively large and set h = (b a)/n. Let x i = a + ih, i = 0,..., n 1. Key point: if f (x) is nice, then it can be approximated by a simple function on the small interval [x i, x i+1 ]. A general strategy is to approximate the integral by b a n 1 f (x) dx = i=0 xi+1 x i f (x) for appropriately chosen A ij s and m. n 1 i=0 j=0 m A ij f (x ij ), 5 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 6 / 29

Polynomial approximation Consider the following sequence of polynomials: Then p ij (x) = k j p i (x) = x x ik x ij x ik, j = 0,..., m. m p ij (x)f (x ij ) j=0 is an m th degree polynomial that interpolates f (x) at the nodes x i0,..., x im. Furthermore, xi+1 x i f (x) dx xi+1 x i p(x) dx = j=0 xi+1 m p ij (x) dx f (x ij ). x } i {{ } A ij 7 / 29

Riemann rule: m = 0 Approximate f (x) on [x i, x i+1 ] by a constant. Here x i0 = x i and p i0 (x) 1, so b a n 1 n 1 f (x) dx f (x i )(x i+1 x i ) = h f (x i ). i=0 Features of Riemann s rule: Very easy to program only need f (x 0 ),..., f (x n ). Can be slow to converge, i.e., lots of x i s may be needed to get a good approximation. i=0 8 / 29

Trapezoid rule: m = 1 Approximate f (x) on [x i, x i+1 ] by a linear function. In this case: x i0 = x i and x i1 = x i+1. A i0 = A i1 = (x i+1 x i )/2 = h/2. Therefore, b a f (x) dx h 2 n 1 { f (xi ) + f (x i+1 ) }. i=0 Still only requires function evaluations at the x i s. More accurate then Riemann because the linear approximation is more flexible than constant. Can derive bounds on the approximation error... 9 / 29

Trapezoid rule (cont.) A general tool which we can use to study the precision of the trapezoid rule is the Euler Maclaurin formula. Suppose that g(x) is twice differentiable; then n g(t) t=0 n 0 g(t) dt + 1 2 {g(0) + g(n)} + C 1 g (t) n 0, where LHS RHS C 2 n 0 g (t) dt. How does this help? Trapezoid rule is T (h) := h { 1 2 g(0) + g(1) + + g(n 1) + 1 2 g(n)} where g(x) = f (a + h t). 10 / 29

Trapezoid rule (cont.) Apply Euler Maclaurin to T (h): T (h) = h h = h n t=0 n 0 b a g(t) h 2 {g(0) + g(n)} g(t) dt + h C 1 { g (t) n 0 1 h f (x) dx + h C 1 {hf (b) hf (a)}. Therefore, b T (h) f (x) dx = O(h 2 ), h 0. a } 11 / 29

Trapezoid rule (cont.) Can trapezoid error O(h 2 ) be improved? Our derivation above is not quite precise; the next smallest term in the expansion is O(h 4 ). Romberg recognized that a manipulation of T (h) will cancel the O(h 2 ) term, leaving only the O(h 4 ) term! Romberg s rule is 4T ( h 2 ) T (h) 3 = b a f (x) dx + O(h 4 ), h 0. Can be iterated to improve further; see Sec. 5.2 in G&H. 12 / 29

Simpson rule: m = 2 Approximate f (x) on [x i, x i+1 ] by a quadratic function. Similar arguments as above gives the x i s and A ij s. Simpson s rule approximation is b a f (x) dx h n 1 { f (x i ) + 4f 6 i=0 ( xi + x ) } i+1 + f (x i+1 ). 2 More accurate than the trapezoid rule error is O(n 4 ). If n is taken to be even, then the formula simplifies a bit; see Equation (5.20) in G&H and my R code. 13 / 29

Remarks This approach works for generic m and the approximation improves as m increases. Can be extended to functions of more than one variable, but details get complicated real fast. In R, integrate does one-dimensional integration. Numerical methods and corresponding software work very well, but care is still needed see Section 5.4 in G&H. 14 / 29

Example: Bayesian analysis of binomial Suppose X Bin(n, θ) with n known and θ unknown. Prior for θ is the so-called semicircle distribution with density π(θ) = 8π 1{ 1 4 (θ 1 2 )2} 1/2, θ [0, 1]. The posterior density is then π x (θ) = θ x (1 θ) n x{ 1 4 (θ 1 2 )2} 1/2 1 0 ux (1 u) n x{ 1 4 (θ 1 2 )2}. 1/2 du Calculating the Bayes estimate of θ, the posterior mean, requires a numerical integration. 15 / 29

Example: mixture densities Mixture distributions are very common models, flexible. Useful for density estimation and heavy-tailed modeling. General mixture model looks like p(y) = k(y x)f (x) dx, where kernel k(y x) is a pdf (or pmf) in y for each x f (x) is a pdf (or pmf). Easy to check that p(y) is a pdf (or pmf) depending on k. Evaluation of p(y) requires integration for each y. 16 / 29

Example 5.1 in G&H Generalized linear mixed model: ind Y ij Pois(λ ij ), λ ij = e γ i e β 0+β 1 j, { i = 1,..., n j = 1,..., J iid where γ 1,..., γ n N(0, σγ). 2 Model parameters are (β 0, β 1, σ 2 γ). Marginal likelihood for θ = (β 0, β 1, σ 2 γ) is n L(θ) = i=1 J Pois(Y ij e γ i e β 0+β 1 j )N(γ i 0, σγ) 2 dγ i. j=1 Goal is to maximize over θ... 17 / 29

Example 5.1 in G&H (cont.) Taking log we get l(θ) = n i=1 log J Pois(Y ij e γ i e β 0+β 1 j )N(γ i 0, σγ) 2 dγ i. j=1 G&H consider evaluating [ J j=1 } {{ } L i (θ) ] j(y 1j e γ 1 e β 0+β 1 j ) j=1 β 1 L 1 (θ), or [ J ] Pois(Y 1j e γ 1 e β 0+β 1 j ) N(γ 1 0, σγ) 2 dγ 1. Reproduce Tables 5.2 5.4 using R codes. 18 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 19 / 29

Very brief summary Gaussian quadrature is an alternative Newton Cotes methods. Useful primarily in problems where integration is with respect to a non-uniform measure, e.g., an expectation. Basic idea is that the measure identifies a sequence of orthogonal polynomials. Approximations of f via these polynomials turns out to be better than Newton Cotes approximations, at least as far as integration is concerned. Book gives minimal details, and we won t get into it here. 20 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 21 / 29

Setup The Laplace approximation is a tool that allows us to approximate certain integrals based on optimization! The type of integrals to be considered are J n := b a f (x)e ng(x) dx, n endpoints a < b can be finite or infinite; f and g are sufficiently nice functions; g has a unique maximizer ˆx = arg max g(x) in interior of (a, b). Claim: when n is large, the major contribution to the integral comes from a neighborhood of ˆx, the maximizer of g. 4 4 For a proof of this claim, see Section 4.7 in Lange. 22 / 29

Formula Assuming the claim, it suffices to restrict the range of integration to a small neighborhood around ˆx, where g(x) g(ˆx) + ġ(ˆx)(x ˆx) + 1 }{{} 2 g(ˆx)(x ˆx)2. =0 Plug this into integral: J n f (x)e n{g(ˆx)+ 1 2 n g(ˆx)(x ˆx)2} dx nbhd = e ng(ˆx) f (x)e 1 2 [ n g(ˆx)](x ˆx)2 dx. nbhd 23 / 29

Formula (cont.) From previous slide: J n e ng(ˆx) nbhd Two observations: since ˆx is a maximizer, g(ˆx) < 0; on small nbhd, f (x) f (ˆx). Therefore, J n f (ˆx)e ng(ˆx) f (x)e 1 2 [ n g(ˆx)](x ˆx)2 dx. nbhd e 1 2 [ n g(ˆx)](x ˆx)2 dx (2π) 1/2 f (ˆx)e ng(ˆx) { n g(ˆx)} 1/2. 24 / 29

Example: Stirling s formula Stirling s formula is a useful approximation of factorials. Starts by writing factorial as a gamma function n! = Γ(n + 1) = Make a change of variable x = z/n to get 0 z n e z dz. n! = n n+1 e n g(x) dx, g(x) = log x x. 0 g(x) has maximizer ˆx = 1 in interior of (0, ). For large n, Laplace approximation gives: n! n n+1 (2π) 1/2 e n g(1) { n g(1)} 1/2 = (2π) 1/2 n n+1/2 e n. 25 / 29

Example: Bayesian posterior expectations Recall the Bayesian ingredients: L(θ) is the likelihood based on n iid samples π(θ) a prior density. Then a posterior expectation looks like E{h(θ) data} = h(θ)l(θ)π(θ) dθ L(θ)π(θ) dθ. When n is large, applying Laplace to both numerator and denominator gives where ˆθ is the MLE. E{h(θ) data} h(ˆθ), So, previous binomial example that showed posterior mean close to MLE was not a coincidence... 26 / 29

Remarks Can be shown that the error in Laplace approx is O(n 1 ). 5 The basic principle of the Laplace approximation is that locally the integrals look like Gaussian integrals. This principle extends to integrals over more than one dimension this multivariate version is most useful. There is also a version of the Laplace approximation for the case when the maximizer of g is on the boundary. Then the principle is to make integral looks like exponential or gamma integrals. Details of this version can be found in Sec. 4.6 of Lange. 5 Can be improved with some extra care. 27 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 28 / 29

Remarks Quadrature methods are very powerful. In principle, these methods can be developed for integrals of any dimension, but they only work well in 1 2 dimensions. Curse of dimensionality if the dimension is large, then one needs so many grid points to get good approximations. Laplace approximation can work in high-dimensions, but only for certain kinds of integrals fortunately, the stat-related integrals are often of this form. For higher dimensions, Monte Carlo methods are preferred: generally very easy to do approximation accuracy is independent of dimension. We will talk in detail later about Monte Carlo. 29 / 29