Fundamental Issues in Bayesian Functional Data Analysis. Dennis D. Cox Rice University

Similar documents
Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Part 6: Multivariate Normal and Linear Models

Nonparametric Bayesian Methods (Gaussian Processes)

Markov Chain Monte Carlo methods

Multivariate Normal & Wishart

Bayesian Regression Linear and Logistic Regression

Creating Non-Gaussian Processes from Gaussian Processes by the Log-Sum-Exp Approach. Radford M. Neal, 28 February 2005

Markov chain Monte Carlo methods in atmospheric remote sensing

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian linear regression

CSC 2541: Bayesian Methods for Machine Learning

Likelihood-free MCMC

Introduction to Machine Learning CMU-10701

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Statistical techniques for data analysis in Cosmology

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Bayesian Methods for Machine Learning

STA 4273H: Statistical Machine Learning

Density Estimation. Seungjin Choi

The Metropolis-Hastings Algorithm. June 8, 2012

Principles of Bayesian Inference

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Curve Fitting Re-visited, Bishop1.2.5

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian Estimation with Sparse Grids

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Nonparametric Bayesian Methods - Lecture I

MCMC 2: Lecture 2 Coding and output. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

Monte Carlo in Bayesian Statistics

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Stat 451 Lecture Notes Numerical Integration

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

Bayesian Estimation of Input Output Tables for Russia

STA 4273H: Statistical Machine Learning

PMR Learning as Inference

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

Dynamic System Identification using HDMR-Bayesian Technique

STAT 518 Intro Student Presentation

A short introduction to INLA and R-INLA

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Fitting a Straight Line to Data

Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Analysing geoadditive regression data: a mixed model approach

Markov Chain Monte Carlo

Spatially Adaptive Smoothing Splines

Metropolis Hastings. Rebecca C. Steorts Bayesian Methods and Modern Statistics: STA 360/601. Module 9

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

19 : Slice Sampling and HMC

Lecture 6: Bayesian Inference in SDE Models

Bayesian model selection in graphs by using BDgraph package

Towards inference for skewed alpha stable Levy processes

An example to illustrate frequentist and Bayesian approches

Hierarchical Modelling for Univariate Spatial Data

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Markov Chain Monte Carlo (MCMC)

Computational statistics

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Part 4: Multi-parameter and normal models

Robert Collins CSE586, PSU Intro to Sampling Methods

Non-Parametric Bayes

Theory and Computation for Gaussian Processes

Bayesian non-parametric model to longitudinally predict churn

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Markov Networks.

Graphical Models for Collaborative Filtering

STA 294: Stochastic Processes & Bayesian Nonparametrics

The Expectation-Maximization Algorithm

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Markov Chain Monte Carlo

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

CPSC 540: Machine Learning

Introduction to Bayesian methods in inverse problems

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

David Giles Bayesian Econometrics

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Simulating Random Variables

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

Estimation of Operational Risk Capital Charge under Parameter Uncertainty

Monetary and Exchange Rate Policy Under Remittance Fluctuations. Technical Appendix and Additional Results

MCMC Sampling for Bayesian Inference using L1-type Priors

Modelling geoadditive survival data

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

Metropolis-Hastings Algorithm

Calibrating Environmental Engineering Models and Uncertainty Analysis

Principles of Bayesian Inference

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Part 8: GLMs and Hierarchical LMs and GLMs

Bayesian data analysis in practice: Three simple examples

TDA231. Logistic regression

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Introduction to Machine Learning CMU-10701

E(x i ) = µ i. 2 d. + sin 1 d θ 2. for d < θ 2 0 for d θ 2

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Transcription:

Fundamental Issues in Bayesian Functional Data Analysis Dennis D. Cox Rice University 1

Introduction Question: What are functional data? Answer: Data that are functions of a continuous variable.... say we observe Y i (t), t [a, b] where Y 1, Y 2,...Y n are i.i.d. N(μ, V ): μ(t) =E[Y (t)], V(t, s) =Cov[Y (t), Y(s)]. Question: Do we ever really observe functional data? Here s some examples of functional data: 2

Intensity Intensity Intensity 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm..5.15 Intensity..6.12 Intensity.2.6 Intensity 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm..5.15 Intensity.1.3.5 Intensity.2.1 Intensity 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm. 3.1.3 Intensity..2.4 Intensity.2.8 Intensity 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm. 4 55 Emission Wavelength (nm..2.8.2.8.1.3.5

Introduction (cont.) Question: But you don t really observe continuous functions, do you? Answer: Look closely at the data... 4

Intensity.199.2.21.22.23.24 455 46 465 47 Emission Wavelength (nm.) 5

Introduction (cont.) OK, so it is really a bunch of dots connected by line segments. That is, we really have the data Y i (t) fort on a grid: t {395, 396,...,66}. But people doing functional data analysis like to pretend they are observing whole functions. Is it just a way of sounding erudite? Functional Data Analysis, not for the heathen and unclean. Some books on the subject: Functional Data Analysis and Applied Functional Data Analysis by Ramsay and Silverman; Nonparametric Functional Data Analysis: Theory and Practice by Ferraty and Vieu. 6

Functional Data (cont.): Working with functional data requires some idealization E.g. the data are actually multivariate; they are stored as either of (G) (Y i (t 1 ),...,Y i (t m )), vectors of values on a grid (C) (η i1,...,η im )wherey i (t) = m j=1 η ijb j (t) isabasis function expansion (e.g., B-splines). Note that the order of approximation m is rather arbitrary. Treating functional data as simply multivariate doesn t make use of the additional structure implied by being a smooth function. 7

Functional Data (cont.): Methods for Functional Data Analysis (FDA) should satisfy the Grid Refinement Invariance Principle (GRIP): As the order of approximation becomes more exact (i.e., m ), the method should approach the appropriate limiting analogue for true functional (infinite dimensional) observations. Thus the statistical procedure will not be strongly dependent on the finite dimensional approximation. Two general ways to mind the GRIP: (i) Direct: Devise a method for true functional data, then find a finite dimensional approximation ( projection ). (ii) Indirect: Devise a method for the finite dimensional data, then see if it has a limit as m. 8

See Lee & Cox, Pointwise Testing with Functional Data Using the Westfall-Young Randomization Method, Biometrika (28) for a frequentist nonparametric approach to some testing problems with functional data. 9

Bayesian Functional Data Analysis: Why Bayesian? After all, Bayesian methods have a high information reqirement, i.e. a likelihood and a prior. In principle, statistical inference problems are not conceptually as difficult for Bayesians. Of course, there is the problem of computing the posterior, even approximately (will MCMC be the downfall of statistics?). And, priors have consequences. So there are lots of opportunities for investigation into these consequences. 1

A Bayesian problem: develop priors for Bayesian functional data analysis. Again assume the data are realizations of a Gaussian process, say we observe Y i (t), t [a, b] where Y 1, Y 2,...Y n are i.i.d. N(μ, V ): μ(t) =E[Y (t)], V(t, s) =Cov[Y (t), Y(s)]. Denote the discretized data by Y (m) i = Y i = (Y i (t 1 ),...,Y i (t m )) and the corresponding mean vectors and covariance matrix μ and V where V ij = V (t i,t j ). Prior distribution for μ: μ V,k N(,kV). But V????? What priors can we construct for covariance functions? 11

Requisite properties of covariance functions: Symmetry: V (s, t) =V (t, s). Positive definiteness: for any choice of k and distinct s 1,..., s k in the domain, the matrix given by V ij = V (s i,s j ) is positive definite. It is difficult to achieve this latter requirement. 12

Requirements on Covariance Priors: Our first requirement in constructing a prior for covariance functions is that we mind the GRIP One may wish to use the conjugate inverse Wishart prior: V 1 Wishart(d m,w m )forsomem m matrix W m.... where, e.g., W m is obtained by discretizing a standard covariance function. Under what conditions (if any) on m and d m will this converge to a probability measure on the space of covariance operators? This would be an indirect approach to satisfying the GRIP. More on this later. 13

Requirements on Covariance Priors (cont.): An easier way to satisfy the GRIP requirement is to construct a prior on the space of covariance functions and then project it down to the finite dimensional approximation. For example, using grid values, V ij = V (t i,t j ). i.e., the direct approach. We (joint work with Hong Xiao Zhu of MDACC) did come up with something that works, sort of. 14

A proposed approach that does work (sort of): Suppose Z 1, Z 2,... are i.i.d. realizations of a Gaussian random process (mean, covariance function B(s, t)). Consider V (s, t) = i w i Z i (s)z i (t) where w 1, w 2,... are nonnegative constants satisfying w i <. i One can show that this gives a random covariance function, and that its distribution fills out the space of covariance functions. Canwecomputewithit? 15

A proposed approach that sort of works (cont.): Thus, if we can compute with this proposed prior, we will have satisfied the three requirements: a valid prior on covariance functions that fills out the space of covariance functions, and is useful in practice. Assuming we use values on a grid for the finite dimensional representation, let Z i =(Z(t 1 ),...,Z(t m )). Then V = i w i Z i Z T i How to compute with this? One idea is to write out the characteristic function and use Fourier inversion. That works well for weighted sum of χ 2 distributions (fortran code available from Statlib) 16

A proposed approach that sort of works (cont.): Another approach: use the Z i directly. We will further approximate V by truncating the series: V (m,j) = j w i Z i Z i T i=1 We devised a Metropolis-Hastings algorithm to sample the Z i. Can use rank-1 QR updating to do fairly efficient computing (update each Z i one at a time). 17

A proposed approach that sort of works (cont.): There are a couple of minor modifications: 1. We include an additional scale parameter k in V (s, t) =k i w iz i (s)z i (t) wherek has an independent inverse Γ prior. 2. We integrate out μ and k. and use the marginal unnormalized posterior f(z Y 1,... Y n )ina Metropolis-Hastings MCMC algorithm. The algorithm has been implemented in Matlab. 18

Some results with simulated data: Generated data from Brownian motion (easy to do!) n = 5 and various values of m and j 19

2.5 Brownian Motion N=1, m=1 2 1.5 1.5.5 1 1.5 2 2 4 6 8 1 2

First, the True Covariance function for Brownian Motion. 21

True Covariance Function 1.8 Sigma.6.4.2 1.5 Tm.2.4 Tm.6.8 1 22

The covariance function used to generate the Z i is the Ornstein-Uhlenbeck correlation: B(s, t) =exp[ α s t ] with α = 1. This process goes by a number of other names (the Gauss-Markov process, Continuous-Time Autoregression of order 1, etc.) 23

.95.9.85.8.75.7.65 1.8.6.4.2.2.4.6.8 1 24

The Bayesian posterior mean estimate with m = 1, j = 2. 25

Bayes Estimated Covariance Function.8.6 Sigma e st.4.2 1.5 Tm.2.4 Tm.6.8 1 26

Thesamplecovarianceestimatewithm = 1. 27

Sample Estimated of Covariance Function.8.6 Sigma s ample.4.2 1.5 Tm.2.4 Tm.6.8 1 28

Now the Bayes posterior mean estimate with m = 3, j = 6. 29

Bayes Estimated Covariance Function.5.4 Sigma e st.3.2.1 1.8.6 Tm.4.2.2.4 Tm.6.8 1 3

Thesamplecovarianceestimatewithm = 3. 31

Sample Estimated of Covariance Function.7.6.5 Sigma s ample.4.3.2.1 1 1.5.5 Tm Tm 32

Some results with simulated data: Mean squared error results (averaged over the grid points): m j MSE Bayes MSE Sample 1 2.17.26 3 6.65.54 33

Problems with the proposed approach that sort of works (cont.): The problem is way over-parameterized in terms of the Z j, 1 j J, wherej m. Computations very time intensive, and MCMC seems to not mix well - seems to converge to different values depending on the start. Caused by complex non-identifiability in the model? Posterior mode is a complicated manifold in a very high dimensional space. 34

Another approach (work in progress): It would be very nice if we could construct a conjugate prior like the inverse Wishart in finite dimensions. This seems problematic. The main difficulty is that the inverse of a covariance operator (obtained from a covariance function) is not bounded. For example, let Y (t) be Brownian motion considered as taking values in L 2 [, 1]. Then v(s, t) =Cov(Y (t),y(s)) = min{s, t}. The operator V is defined by Vf(s) = 1 v(s, t)f(t)dt. Compute V 1 g by solving (for f) the integral equation g(s) = 1 v(s, t)f(t)dt 35

Inverse Wishart (cont.): With a little calculus g(s) = = 1 s min(s, t)f(t)dt tf(t)dt + s 1 s f(t)dt. We see g is absolutely continuous and g() =. Differentiating g (s) = sf(s) sf(s)+ = 1 s f(t)dt 1 s f(t)dt We see g is absolutely continuous and g (1) =. Differentiating again g (s) = f(s). 36

Inverse Wishart (cont.): Thus, in the Brownian motion case, V is invertible at g iff g is absolutely continuous and satisfies the two boundary conditions. Thus, V is certainly not invertible on all of L 2 [, 1]. We can understand the problem in general by using the spectral representation: V = i λ i φ i φ i. Thus Vx= i λ i x, φ i φ i Then, if V 1 x exists, it is given by V 1 x = i λ 1 i x, φ i φ i This converges in H iff i λ 2 i x, φ i 2 <, whichisapretty strict condition on x since i λ i <. 37

Inverse Wishart (cont.): So, even though it looks like it is going to be very difficult to make it work, is there some way to do so? Instead of trying to guess a prior for which an inverse Wishart will be a good finite dimensional approximant, let s try another approach. Let s see if we can choose d m so that as m, InverseWishart(d m, B m ) converges (in some sense). It is very difficult working with the Inverse Wishart - no m.g.f., the ch.f. is unknown. 38

Inverse Wishart (cont.): In order to obtain our results, we define sampling and interpolation operators: f m = (f(t 1 ),...,f(t m )) I f m = linear interpolant of f m. Here, (t 1,...,t m ) is a regular grid. Note that f f m is an operator from continuous functions to m-dimensional space, and I goes the other way. Define an analogous sampling operator for functions of two variables: B m is an m m matrix with (i, j) entryequalto B(t i,t j ). 39

Inverse Wishart (cont.): Moment results: suppose V m InverseWishart(d m,s m Bm ) and f m is obtained by sampling a continuous function f. Then as long as m/d m a>1, where E[I V m fm ] d m m Bf(s) = Bf/(a 1), B(s, t)f(t)dt. 4

Inverse Wishart (cont.): Second Moments results: suppose V m InverseWishart(d m,s m B m )and f m and g m are obtained by sampling continuous functions f and g, Again as long as m/d m a>1, E[I V m fm g T m V m ] (d m m) 2 Bf Bg/(a 1) 2. Thus, in some sense, we can get first and second moments to converge if we have d m /m converging (e.g., take d m =2m). 41

The Bayesian posterior mean estimate under inverse-wishart prior with m = 5, d m = 1 obtained by Monte-Carlo. 42

Posterior mean of the covariance using Inverse Wishart prior.8.6.4.2 1.5.2.4.6.8 1 43

Further research: Main interesting problem in the direct approach: find ways to approximate the prior using mixtures of inverse Wisharts. For the indirect approach: nearly complete proof for weak convergence in the space of S-operators but using a basis function expansion rather than grid evaluations. Must check the properties of this limiting measure. 44

The End 45