Karhunen-Loeve Expansion and Optimal Low-Rank Model for Spatial Processes

TTU, October 26, 2012 p. 1/3 Karhunen-Loeve Expansion and Optimal Low-Rank Model for Spatial Processes Hao Zhang Department of Statistics Department of Forestry and Natural Resources Purdue University zhanghao@purdue.edu

TTU, October 26, 2012 p. 2/3 Outline Motivations of studying KL expansion in spatial statistics Numerical algorithms Some examples and results Applications in multivariate spatial processes

Spatial data Y (s i ),s i D R 2 or R 3. Correlated Massive Satellite data Remotely sensed data Censor network Multi-source X2 3000 3500 4000 4500 5000 5500 1000 0 1000 2000 3000 X1 TTU, October 26, 2012 p. 3/3

TTU, October 26, 2012 p. 4/3 Spatial process Review the spatial data as a realization of an underlying spatial process: data=mean+error, mean function=deterministic, linear function of some covariates (i.e., lat and long). error process=2nd order process, often assumed to be stationary. Spatial correlation attributable to unobserved latent variables; accounting for the spatial variation unaccounted for by the mean. I assume the mean is a constant (known or unknown).

TTU, October 26, 2012 p. 5/3 Covariogram and Kriging Covariogram: C(s 1,s 2 ) = Cov(Y (s 1 ),Y (s 2 )). Stationary if C(s 1,s 2 ) = C(0,s 2 s 1 ). Isotropic if C(s 1,s 2 ) = C( s 1 s 2 ). Kriging=Best Linear Prediction: Given Y (s 1 ),,Y (s n ), n Ŷ (s 0 ) = E(Y (s 0 )) + α i (Y (s i ) EY (s i )) i=1 = E(Y (s 0 )) + Cov(Y (s),y )[V ar(y )] 1 (Y EY ).

TTU, October 26, 2012 p. 6/3 Large covariance matrix n is hundreds of thousands or millions. Data are correlated The inverse of V n n is needed in the full MLE or Bayesian inferences, and kriging, rending the traditional methods impractical: L n (θ) = n 2 log(2π) 1 2 log V n(θ) 1 2 (Y n Xβ) Vn 1 (θ)(y n Xβ), The inverse matrix is needed for prediction.

TTU, October 26, 2012 p. 7/3 Large storage and memory For an n n matrix, the double precision storage is 8n 2 bites. When n = 10 4, the storage is 800MB! When n = 10 6, the storage is 8000GB! Computation for V 1 n n is O(n 3 ) unless V n n has a special structure. Therefore, for massive spatial data, traditional geostatistical methods may not be applicable and innovative methods are need.

TTU, October 26, 2012 p. 8/3 Low rank models Y (s) = µ + a(s) Z + ǫ(s), (1) where a(s) = (a 1 (s),,a m (s)) is a function of s that may depend on some parameters θ, Z = (Z(u 1 ),,Z(u m )) is a vector of latent variables with u i s pre-chosen, and ǫ(s) is a white noise with variance τ 2 and is independent of Z. Gaussian process convolution: a j (s) = exp( s u j 2 /φ)/ 2π, Z(u i ) s are i.i.d. Fixed rank kriging: a j (s) = b( s u j 2 /φ) for some known basis function b and Z = (Z(u 1 ),,Z(u m )) is a realization of a latent process at locations u j, j = 1,,m. Predictive process: a(s) = V ar(z) 1 Cov(Z,Z(s)) where Z(s) is a Gaussian stationary process with mean 0 and some parametric covariogram.

TTU, October 26, 2012 p. 9/3 Covariance matrix of low rank Observing Y = (Y (s i ),i = 1,,n) from the low rank model, write Y = µ1 + AZ + ǫ. The covariance matrix of Y is V = AV z A + τ 2 I n, where A is n m and V z = V ar(z). Woodbury Formula: V 1 = τ 2 (I n A(τ 2 V 1 z + A A) 1 A ). Memory requirement is 8(n + nm + m 2 ) and linear in n. Sylvester s determinant theorem: V = τ 2(n m) V z τ 2 V 1 z + A A.

TTU, October 26, 2012 p. 10/3 Optimal low rank models Given a random field Y (s),s D, consider the best low rank model, i.e., to minimize D E ( Y (s) ) 2 m a k (s)z k ds, k=1 over all functions a k (s) and random variables Z k. Solution: a k (s) = λ k f k (s) where λ k and f k (s), k = 1,,m are the first m largest eigenvalues and the corresponding eigenfunctions of the Karhunen-Loeve expansion of the random field.

TTU, October 26, 2012 p. 11/3 K-L expansion Y (s) = λk f k (s)z k,s D k=1 where λ 1 λ 2 0, Z k are white noises of unit variance and f j (s)f k (s)ds = δ jk. D Hence C(s,x) = λ k f k (s)f k (x) k=1 and D C(s,x)f k (x)dx = λ k f k (s).

TTU, October 26, 2012 p. 12/3 Truncated KL D E D { E Y (s) { } 2 m λk f k (s)z k ds k=1 Y (s) } 2 m a k (s)z k ds for any functions a k (s) and zero-mean random variables Z k where D is bounded. k=1

TTU, October 26, 2012 p. 13/3 Computation of eigenpairs Given a covariance function C(s,x),x,s D R d, find λ > 0 and function f(x) such that Approximation: D C(s,x)f(x)dx = λf(s). Choose a basis {φ i,i = 1,,m} L 2 (D). Approximate an eigenfunction by f(s) = m i=1 d iφ i (s), so that D C(s,x)f(x)dx λf(s) is orthogonal to each basis function φ i (x).

TTU, October 26, 2012 p. 14/3 Galerkin approximation Finding λ and d: D ( m d i i=1 D C(s, x)φ i (s)dx λ ) m d i φ i (s) φ j (s)ds = 0, for j = 1,, m, i=1 Gd = λbd, (2) where G is m m whose (i,j) element is D D φ i(s)c(s,x)φ j (x)dsdx, B is m m whose (i,j) element is D φ i(s)φ j (s)ds, d = (d 1,,d m ). Solve the generalized eigenvalue problem (2) to get λ and d. The m 2 elements of G are each 4-dimensional integrals. Ghanem and Spanos, 1991, Schwab and Todor, 2005, Phoon, et al. 2002, Huang, et al., 2001

TTU, October 26, 2012 p. 15/3 Gaussian Quadrature Method Gaussian Legendre quadrature: 1 1 g(x)dx m r i g(x i ). i=1 The approximation is exact for polynomials of order (2m 1) or less. The nodes x i are the roots of Legendre polynomial of m order and the weights r i also calculated from the Legendre polynomial. Extends to high dimension: 1 1 1 1 g(y 1,y 2 )dy 1 dy 2 m i=1 m r i r j g(x i,x j ) j=1

TTU, October 26, 2012 p. 16/3 Example of Nodes Nodes, m=20 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0

TTU, October 26, 2012 p. 17/3 Gaussian Quadrature Method for KL Given C(s,x), s,x [ 1,1] d (d = 2), find λ and f(x) such [ 1,1] d C(s,x)f(x)dx = λf(x). Let u i, i = 1,...,m be the m roots of Legendre polynomial and u i, i = 1,...,m 2 denotes the pairs (u j,u k ),j,k = 1,...,m. Find f = (f(u 1 ),...,f(u m 2)) by solving m 2 j=1 where w j = u k u l if u j = (u k,u l ). w j C(u i,u j )f(u j ) = λf(u i )

TTU, October 26, 2012 p. 18/3 Gaussian Quadrature Method for KL Write V = (C(u i,u j )),W = diag(w 1,...,w m 2). Then V Wf = λf A generalized eigenvalue problem. Hence λ is an eigenvalue of W 1/2 V W 1/2 and W 1/2 f is the corresponding eigenvector.

TTU, October 26, 2012 p. 19/3 Polynomial Interpolation Having obtained f, we now have f(u i ). Now approximate the eigenfunction by a polynomial through the Lagrange interpolation.

TTU, October 26, 2012 p. 20/3 Example 1 covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 distance (a) distance (b) distance (c) covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 distance (d) distance (e) Top: Gaussian quadrature; Bottom: Galerkin projection. distance (f)

TTU, October 26, 2012 p. 21/3 Example 2 covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 distance (a) distance (b) distance (c) covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.0 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile covariance 0.2 0.4 0.6 0.8 1.0 True Function 5%quantile median 95%quantile 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 distance (d) distance (e) Top: Gaussian quadrature; Bottom: Galerkin projection. distance (f)

TTU, October 26, 2012 p. 22/3 Simulation study Objective: To compare the predictive performance of the different low-rank approximations True model: Gaussian process Y (s),s [0,1] 2 with mean 0 and exponential covariance function exp( h/0.3). We use three low-rank models to approximate the true model: KL by Galerkin projection (GPKL), KL by Gaussian quadrature (GQKL) and the predictive process (Banerjee, et al., 2008). Simulate from the true model on a 81*81 lattice, but use each of the low rank model to estimate model parameters.

TTU, October 26, 2012 p. 23/3 Predictive Performance MSE: (1/n) (Y (s i ) Ŷ (s i )) 2. Logarithmic score (LogS): Let Z i = (Y (s i ) Ŷ (s i))/σ i (s i ). 1 n n i=1 Continuous ranked probability score: ( 1 2 log(2πˆσ i(s i ) 2 ) + 1 2 Z2 i ) crps(f,x) = (F(y) 1{y x}) 2 dy. CRPS = 1 n Gneiting, et al., 2006, 2007. n crps(f i,y (s i )). i=1

(a) MSE scores(b) CRPS scores(c) LogS scores using 150 eigen TTU, October terms 26, 2012 p. 24/3 150 eigen terms MSE 1.20 1.30 1.40 GQKL PP GPKL CRPS 1.08 1.12 1.16 GQKL PP GPKL 0 10 20 30 40 50 0 10 20 30 40 50 (a) (b) LogS 1.50 1.54 1.58 GQKL PP GPKL 0 10 20 30 40 50 (c)

(a) MSE scores(b) CRPS scores(c) LogS scores using 300 eigen TTU, October terms 26, 2012 p. 25/3 300 eigen terms MSE 1.16 1.18 1.20 1.22 1.24 GQKL PP GPKL CRPS 1.07 1.09 1.11 GQKL PP GPKL 0 10 20 30 40 50 0 10 20 30 40 50 (a) (b) LogS 1.49 1.50 1.51 1.52 1.53 GQKL PP GPKL 0 10 20 30 40 50 (c)

(a) MSE scores(b) CRPS scores(c) LogS scores using 300 eigen TTU, October terms 26, 2012 p. 26/3 600 eigen terms MSE 1.13 1.15 1.17 GQKL PP GPKL CRPS 1.065 1.075 1.085 GQKL PP GPKL 0 10 20 30 40 50 0 10 20 30 40 50 (a) (b) LogS 1.480 1.490 1.500 GQKL PP GPKL 0 10 20 30 40 50 (c)

TTU, October 26, 2012 p. 27/3 Multivariate covariogram Two processes Y i (s), i = 1,2. Marginal or direct covariogram: C ii (s,x) = Cov(Y i (s),y i (x)). Cross covariogram: C ij (s,x) = Cov(Y i (s),y j (x)), i j. The matrix valued function C(s,x) = (C ij (, )) must satisfy: For any s 1,,s n, i j α i C(s i,s j )α j 0 for any vectors α i R p. It is not trivial to specify a cross covariogram.

TTU, October 26, 2012 p. 28/3 Some existing models Proportional model: C ij (s,x) = σ ij ρ(s,xx). Linear model of coregionalization: Sum of proportional models. Convolution models: Gaspari and Cohn (1999), Gelfand et al. (2004) Asymmetric models: Apanasovich and Genton (2009), Li and Zhang (2010) Multivariate Matern model: Gneiting et al. (2010)

TTU, October 26, 2012 p. 29/3 Limitation of existing models Estimation of parameters becomes more complex; constrained maximization. Do not allow arbitrary marginal covariogram models. There are cases when the multivariate covariogram model does not improve over the univariate model in predictive performance.

TTU, October 26, 2012 p. 30/3 Valid cross covariogram for given marginals The problem: Given two marginal covariograms C ii (s,x), s,x R d, i = 1,2,, find the class of all valid cross covariograms. A theoretical solution: Apply the Karhunen-Loeve expansion to the marginals to write C ii (s, x) = k=1 λ ik f ik (s)f ik (x), s, x D where λ k is a decreasing sequence of positive numbers and D f ik(s) 2 ds = 1 and D is a compact subset of R d. The class of all possible valid cross covariograms is C ij (s, x) = r kl λ ik λ jl f ik (s)f jl (x) k=1 l=1 where r kl [ 1,1] are such that the matrix (r kl ) n k,l=1 cross covariograms are given by such an array {r kl }. is non-negative definite for any n. Hence all

TTU, October 26, 2012 p. 31/3 Non-parametric cross covariogram Consider a bivariate process. Write the KL expansion for each process Y i (s) = λik f ik (s)z ik,s D, k=1 where Z ik,k = 1,2, are a white noise (i = 1,2). In a special case, when Z 1j and Z 2k are either uncorrelated or perfectly correlated, we get the non-parametric cross covariogram C 12 (s,x) = j,k:cov(z 1j,Z 2k )=1 λ1j λ 2k f 1j (s)f 2k (x).

TTU, October 26, 2012 p. 32/3 Two practical steps For the non-parametric cross covariogram to work in practice, we need An efficient algorithm to computer the eigenpairs (λ j,f j (s)) for any given covariogram. Find the pairs (j,k) such that Cov(Z 1j,Z 2k ) = 1.

TTU, October 26, 2012 p. 33/3 Inferences Estimate parameters in each marginal model using only marginal data; Since cross covariogram involves no new parameters, we only need to identify the perfectly correlated pairs of Z 1j and Z 2k. This is equivalent to specifying the correlation matrix Corr(Z 1, Z 2 ) for Z i = (Z i1,, Z im ) which has at most m 1s with the other elements being 0. Start with Corr(Z 1, Z 2 ) = 0. Replace one 0 element with 1 to optimize some objective function (i.e., likelihood, prediction score) Fixing the non-zero elements, replace one more 0 element with 1 to optimize the objective function. Iterate. The non-parameter cross covariogram leads to a much simplified estimation procedure.

TTU, October 26, 2012 p. 34/3 An example Wang and Zhang (2012): Let W i (s), s [0,1] 2 be two latent processes with mean 0 and have a bivariate exponential covariogram, i.e., Cov(W i (s + h),w j (s)) = σ ij exp( h /φ ij ). We observe the following process Y i (s) = W i (s) + τ i ǫ i (s) at the 1225 equally spaced locations {(i/35, j/35), i, j = 1,, 35}. The model parameters for the covariogram are chosen to be: φ 11 = 0.2,φ 22 = 0.3,φ 12 = 0.2,σ 12 = 5.4, σ 11 = σ 22 = 9 and τ 2 1 = τ2 2 = 1.

TTU, October 26, 2012 p. 35/3 Model fitting Fit two models: The true model and the truncated KL expansion with m=600 terms. For the true model, we estimate the parameters by maximizing the joint likelihood. For the truncated KL model, we estimate the marginal parameters by maximizing the marginal likelihood and identify the perfectly correlated pairs by maximizing the joint likelihood.

TTU, October 26, 2012 p. 36/3 Predictive performance Predictive scores: RMSE and logarithmic score: MSE (i) = 1 n LogS (i) = 1 n n ( Yi (s k ) Ŷi(s k ) ) 2 k=1 n j=1 ( 1 2 log(2πˆσ2 ij) + 1 ) 2 z2 ij,i = 1,2 where Ŷ i (s k ) denotes the cokriging prediction of Y i (s k ) using data at all locations but s k ; σ ik the corresponding cokriging standard deviation and z ik = (Y i (s k ) Ŷ i (s k ))/σ ik. Among 100 simulations, the non-parametric cross covariogram model beats the fitted bivariate parametric model in each simulation according to the two predictive scores.