Stat 451 Lecture Notes Numerical Integration

Similar documents
Stat 451 Lecture Notes Monte Carlo Integration

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 12: Monday, Apr 16. f(x) dx,

Stat 451 Lecture Notes Simulating Random Variables

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bayesian Regression Linear and Logistic Regression

Fitting Multidimensional Latent Variable Models using an Efficient Laplace Approximation

Lecture : Probabilistic Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Probabilistic Graphical Models

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Scientific Computing: Numerical Integration

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Statistics: Learning models from data

EM Algorithm & High Dimensional Data

Lecture 13 Fundamentals of Bayesian Inference

Bayesian estimation of the discrepancy with misspecified parametric models

Calculus of Variations Summer Term 2014

Lecture 4 September 15

Introduction to Probabilistic Machine Learning

April 15 Math 2335 sec 001 Spring 2014

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Integration, differentiation, and root finding. Phys 420/580 Lecture 7

STA 4273H: Sta-s-cal Machine Learning

CS 450 Numerical Analysis. Chapter 8: Numerical Integration and Differentiation

Physics 115/242 Romberg Integration

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Machine Learning 2017

STA 4273H: Statistical Machine Learning

Theory of Maximum Likelihood Estimation. Konstantin Kashin

g-priors for Linear Regression

A Very Brief Summary of Statistical Inference, and Examples

Foundations of Statistical Inference

CPSC 540: Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

Statistical Data Analysis Stat 3: p-values, parameter estimation

8.3 Partial Fraction Decomposition

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Numerical Analysis for Statisticians

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian analysis in finite-population models

Foundations of Statistical Inference

Primer on statistics:

CSC321 Lecture 18: Learning Probabilistic Models

COURSE Numerical integration of functions (continuation) 3.3. The Romberg s iterative generation method

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Overfitting, Bias / Variance Analysis

Chapter 8.8.1: A factorization theorem

Parametric Techniques Lecture 3

Probability and Estimation. Alan Moses

Lecture 6: Graphical Models: Learning

Calculus of Variations Summer Term 2016

8.3 Numerical Quadrature, Continued

Calculus Favorite: Stirling s Approximation, Approximately

One-parameter models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Density Estimation. Seungjin Choi

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Differentiation and Integration

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

Numerical Integration

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CSC 2541: Bayesian Methods for Machine Learning

CS 361: Probability & Statistics

Introduction to Bayesian Methods

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Log-Linear Models, MEMMs, and CRFs

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Adaptive Monte Carlo Methods for Numerical Integration

Review. Numerical Methods Lecture 22. Prof. Jinbo Bi CSE, UConn

Graphical Models for Collaborative Filtering

Bayesian Machine Learning

Kernel methods, kernel SVM and ridge regression

Lecture 2: Repetition of probability theory and statistics

Contents. I Basic Methods 13

STAT 730 Chapter 4: Estimation

Parameter Estimation

Bayesian Inference: Posterior Intervals

CS 257: Numerical Methods

A short introduction to INLA and R-INLA

Default Priors and Effcient Posterior Computation in Bayesian

6.1 Variational representation of f-divergences

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Error analysis for efficiency

Statistical Estimation

Numerical integration and differentiation. Unit IV. Numerical Integration and Differentiation. Plan of attack. Numerical integration.

Transcription:

Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 2 / 29

Motivation While many statistics problems rely on optimization, there are also some that require numerical integration. Bayesian statistics is almost exclusively integration. data admits a likelihood function L(θ); θ unknown, so assign it a weight function π(θ); combine prior and data using Bayes s formula π(θ x) = L(θ)π(θ) Θ L(θ )π(θ ) dθ. Need to compute probabilities and expectations integrals! Some non-bayesian problems may involve integration, e.g., random- or mixed-effects models. Other approaches besides Bayesian and frequentist... 3 / 29

Intuition There are a number of classical numerical integration techniques, simple and powerful. Think back to calculus class where integral was defined: approximate function by a constant on small intervals; compute area of rectangles and sum them up; integral defined as the limit of this sum as mesh 0. Numerical integration, or quadrature, is based on this definition and refinements thereof. Basic principle: 3 approximate the function on a small interval by a nice one that you know how to integrate. Works well for one- or two-dim integrals; for higher-dim integrals, other tools are needed. 3 This was essentially the same principle that motivated the various methods we discussed for optimization! 4 / 29

Notation Suppose that f (x) is a function that we d like to integrate over an interval [a, b]. Take n relatively large and set h = (b a)/n. Let x i = a + ih, i = 0,..., n 1. Key point: if f (x) is nice, then it can be approximated by a simple function on the small interval [x i, x i+1 ]. A general strategy is to approximate the integral by b a n 1 f (x) dx = i=0 xi+1 x i f (x) for appropriately chosen A ij s and m. n 1 i=0 j=0 m A ij f (x ij ), 5 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 6 / 29

Polynomial approximation Consider the following sequence of polynomials: Then p ij (x) = k j p i (x) = x x ik x ij x ik, j = 0,..., m. m p ij (x)f (x ij ) j=0 is an m th degree polynomial that interpolates f (x) at the nodes x i0,..., x im. Furthermore, xi+1 x i f (x) dx xi+1 x i p(x) dx = j=0 xi+1 m p ij (x) dx f (x ij ). x } i {{ } A ij 7 / 29

Riemann rule: m = 0 Approximate f (x) on [x i, x i+1 ] by a constant. Here x i0 = x i and p i0 (x) 1, so b a n 1 n 1 f (x) dx f (x i )(x i+1 x i ) = h f (x i ). i=0 Features of Riemann s rule: Very easy to program only need f (x 0 ),..., f (x n ). Can be slow to converge, i.e., lots of x i s may be needed to get a good approximation. i=0 8 / 29

Trapezoid rule: m = 1 Approximate f (x) on [x i, x i+1 ] by a linear function. In this case: x i0 = x i and x i1 = x i+1. A i0 = A i1 = (x i+1 x i )/2 = h/2. Therefore, b a f (x) dx h 2 n 1 { f (xi ) + f (x i+1 ) }. i=0 Still only requires function evaluations at the x i s. More accurate then Riemann because the linear approximation is more flexible than constant. Can derive bounds on the approximation error... 9 / 29

Trapezoid rule (cont.) A general tool which we can use to study the precision of the trapezoid rule is the Euler Maclaurin formula. Suppose that g(x) is twice differentiable; then n g(t) t=0 n 0 g(t) dt + 1 2 {g(0) + g(n)} + C 1 g (t) n 0, where LHS RHS C 2 n 0 g (t) dt. How does this help? Trapezoid rule is T (h) := h { 1 2 g(0) + g(1) + + g(n 1) + 1 2 g(n)} where g(x) = f (a + h t). 10 / 29

Trapezoid rule (cont.) Apply Euler Maclaurin to T (h): T (h) = h h = h n t=0 n 0 b a g(t) h 2 {g(0) + g(n)} g(t) dt + h C 1 { g (t) n 0 1 h f (x) dx + h C 1 {hf (b) hf (a)}. Therefore, b T (h) f (x) dx = O(h 2 ), h 0. a } 11 / 29

Trapezoid rule (cont.) Can trapezoid error O(h 2 ) be improved? Our derivation above is not quite precise; the next smallest term in the expansion is O(h 4 ). Romberg recognized that a manipulation of T (h) will cancel the O(h 2 ) term, leaving only the O(h 4 ) term! Romberg s rule is 4T ( h 2 ) T (h) 3 = b a f (x) dx + O(h 4 ), h 0. Can be iterated to improve further; see Sec. 5.2 in G&H. 12 / 29

Simpson rule: m = 2 Approximate f (x) on [x i, x i+1 ] by a quadratic function. Similar arguments as above gives the x i s and A ij s. Simpson s rule approximation is b a f (x) dx h n 1 { f (x i ) + 4f 6 i=0 ( xi + x ) } i+1 + f (x i+1 ). 2 More accurate than the trapezoid rule error is O(n 4 ). If n is taken to be even, then the formula simplifies a bit; see Equation (5.20) in G&H and my R code. 13 / 29

Remarks This approach works for generic m and the approximation improves as m increases. Can be extended to functions of more than one variable, but details get complicated real fast. In R, integrate does one-dimensional integration. Numerical methods and corresponding software work very well, but care is still needed see Section 5.4 in G&H. 14 / 29

Example: Bayesian analysis of binomial Suppose X Bin(n, θ) with n known and θ unknown. Prior for θ is the so-called semicircle distribution with density π(θ) = 8π 1{ 1 4 (θ 1 2 )2} 1/2, θ [0, 1]. The posterior density is then π x (θ) = θ x (1 θ) n x{ 1 4 (θ 1 2 )2} 1/2 1 0 ux (1 u) n x{ 1 4 (θ 1 2 )2}. 1/2 du Calculating the Bayes estimate of θ, the posterior mean, requires a numerical integration. 15 / 29

Example: mixture densities Mixture distributions are very common models, flexible. Useful for density estimation and heavy-tailed modeling. General mixture model looks like p(y) = k(y x)f (x) dx, where kernel k(y x) is a pdf (or pmf) in y for each x f (x) is a pdf (or pmf). Easy to check that p(y) is a pdf (or pmf) depending on k. Evaluation of p(y) requires integration for each y. 16 / 29

Example 5.1 in G&H Generalized linear mixed model: ind Y ij Pois(λ ij ), λ ij = e γ i e β 0+β 1 j, { i = 1,..., n j = 1,..., J iid where γ 1,..., γ n N(0, σγ). 2 Model parameters are (β 0, β 1, σ 2 γ). Marginal likelihood for θ = (β 0, β 1, σ 2 γ) is n L(θ) = i=1 J Pois(Y ij e γ i e β 0+β 1 j )N(γ i 0, σγ) 2 dγ i. j=1 Goal is to maximize over θ... 17 / 29

Example 5.1 in G&H (cont.) Taking log we get l(θ) = n i=1 log J Pois(Y ij e γ i e β 0+β 1 j )N(γ i 0, σγ) 2 dγ i. j=1 G&H consider evaluating [ J j=1 } {{ } L i (θ) ] j(y 1j e γ 1 e β 0+β 1 j ) j=1 β 1 L 1 (θ), or [ J ] Pois(Y 1j e γ 1 e β 0+β 1 j ) N(γ 1 0, σγ) 2 dγ 1. Reproduce Tables 5.2 5.4 using R codes. 18 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 19 / 29

Very brief summary Gaussian quadrature is an alternative Newton Cotes methods. Useful primarily in problems where integration is with respect to a non-uniform measure, e.g., an expectation. Basic idea is that the measure identifies a sequence of orthogonal polynomials. Approximations of f via these polynomials turns out to be better than Newton Cotes approximations, at least as far as integration is concerned. Book gives minimal details, and we won t get into it here. 20 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 21 / 29

Setup The Laplace approximation is a tool that allows us to approximate certain integrals based on optimization! The type of integrals to be considered are J n := b a f (x)e ng(x) dx, n endpoints a < b can be finite or infinite; f and g are sufficiently nice functions; g has a unique maximizer ˆx = arg max g(x) in interior of (a, b). Claim: when n is large, the major contribution to the integral comes from a neighborhood of ˆx, the maximizer of g. 4 4 For a proof of this claim, see Section 4.7 in Lange. 22 / 29

Formula Assuming the claim, it suffices to restrict the range of integration to a small neighborhood around ˆx, where g(x) g(ˆx) + ġ(ˆx)(x ˆx) + 1 }{{} 2 g(ˆx)(x ˆx)2. =0 Plug this into integral: J n f (x)e n{g(ˆx)+ 1 2 n g(ˆx)(x ˆx)2} dx nbhd = e ng(ˆx) f (x)e 1 2 [ n g(ˆx)](x ˆx)2 dx. nbhd 23 / 29

Formula (cont.) From previous slide: J n e ng(ˆx) nbhd Two observations: since ˆx is a maximizer, g(ˆx) < 0; on small nbhd, f (x) f (ˆx). Therefore, J n f (ˆx)e ng(ˆx) f (x)e 1 2 [ n g(ˆx)](x ˆx)2 dx. nbhd e 1 2 [ n g(ˆx)](x ˆx)2 dx (2π) 1/2 f (ˆx)e ng(ˆx) { n g(ˆx)} 1/2. 24 / 29

Example: Stirling s formula Stirling s formula is a useful approximation of factorials. Starts by writing factorial as a gamma function n! = Γ(n + 1) = Make a change of variable x = z/n to get 0 z n e z dz. n! = n n+1 e n g(x) dx, g(x) = log x x. 0 g(x) has maximizer ˆx = 1 in interior of (0, ). For large n, Laplace approximation gives: n! n n+1 (2π) 1/2 e n g(1) { n g(1)} 1/2 = (2π) 1/2 n n+1/2 e n. 25 / 29

Example: Bayesian posterior expectations Recall the Bayesian ingredients: L(θ) is the likelihood based on n iid samples π(θ) a prior density. Then a posterior expectation looks like E{h(θ) data} = h(θ)l(θ)π(θ) dθ L(θ)π(θ) dθ. When n is large, applying Laplace to both numerator and denominator gives where ˆθ is the MLE. E{h(θ) data} h(ˆθ), So, previous binomial example that showed posterior mean close to MLE was not a coincidence... 26 / 29

Remarks Can be shown that the error in Laplace approx is O(n 1 ). 5 The basic principle of the Laplace approximation is that locally the integrals look like Gaussian integrals. This principle extends to integrals over more than one dimension this multivariate version is most useful. There is also a version of the Laplace approximation for the case when the maximizer of g is on the boundary. Then the principle is to make integral looks like exponential or gamma integrals. Details of this version can be found in Sec. 4.6 of Lange. 5 Can be improved with some extra care. 27 / 29

Outline 1 Introduction 2 Newton Cotes quadrature 3 Gaussian quadrature 4 Laplace approximation 5 Conclusion 28 / 29

Remarks Quadrature methods are very powerful. In principle, these methods can be developed for integrals of any dimension, but they only work well in 1 2 dimensions. Curse of dimensionality if the dimension is large, then one needs so many grid points to get good approximations. Laplace approximation can work in high-dimensions, but only for certain kinds of integrals fortunately, the stat-related integrals are often of this form. For higher dimensions, Monte Carlo methods are preferred: generally very easy to do approximation accuracy is independent of dimension. We will talk in detail later about Monte Carlo. 29 / 29