Hierarchical matrix techniques for maximum likelihood covariance estimation

Size: px

Start display at page:

Download "Hierarchical matrix techniques for maximum likelihood covariance estimation"

Madison Neal
5 years ago
Views:

Computing Research Center and Uncertainty Center, KAUST

1 Hierarchical matrix techniques for maximum likelihood covariance estimation tifica Alexander Litvinenko, Extreme Computing Research Center and Uncertainty Center, KAUST (joint work with M. Genton, Y. Sun and D. Keyes)

2 The structure of the talk 1. Motivation 2. Hierarchical matrices [Hackbusch 1]: 3. Matérn covariance function 4. Uncertain parameters of the covariance function: 4.1 Uncertain covariance length 4.2 Uncertain smoothness parameter 5. Identification of these parameters via maximizing the log-likelihood. 2 / 37

3 Motivation, problem 1 Task: to predict temperature, velocity, salinity, estimate parameters of covariance Grid: 50Mi locations on 50 levels, (X*Y*Z) + X*Y= 500*500* *500 = 50Mi. High-resolution time-dependent data about Red Sea: zonal velocity and temperature 3 / 37

4 Motivation, problem 2 Task: to predict moisture, compute covariance, estimate its parameters 2D-Grid: 2.5Mi locations with 2.1Mi observations and 278K missing values. Soil moisture latitude longitude High-resolution daily soil moisture data at the top layer of the Mississippi basin, U.S.A., (Chaney et al., in review). Important for agriculture, defense. Moisture is very heterogeneous. 4 / 37

5 Motivation, estimation of uncertain parameters cov. length H-matrix rank tion Logo Box-plots Lock-up for l = (domain [0, 1] 2 ) vs different H-matrix ranks k = {3, 7, }. Which H-matrix rank is sufficient for identification of parameters of a particular type of cov. matrix? 5 / 37

6 Motivation for H-matrices General dense matrix requires O(n 3 ) storage and time. Could be expensive! If covariance matrix is structured (diagonal, Toeplitz, circulant) then we can apply e.g. FFT with O(nlogn), but if not? 6 / 37

7 Hierarchical (H)-matrices Introduction into Hierarchical (H)-matrix technique 7 / 37

8 Examples of H-matrix approximations Figure : Three examples of H-matrix approximations, (left-middle) R n n, n = 2 10, of the discretised covariance function cov(x, y) = e r, l 1 = 0.15, l 2 = 0.2, x, y [0, 1] 2 ; (right) Soil moisture from example above with dofs. The biggest dense (dark) blocks R 32 32, max. rank k = 4 on the left, k = in the middle, and k = on the right. 8 / 37

9 Matérn covariance functions Matérn covariance functions C θ = 2σ2 ( r ν ( r ) Kν, θ = (σ Γ(ν) 2l) 2, ν, l). l / 37

10 Examples of Matérn covariance matrices ( ) ( ) 3r 3r C ν=3/2 (r) = 1 + exp l l ( ) ( 5r C ν=5/2 (r) = r ) 2 5r l 3l 2 exp l (1) (2) ν = 1/2 exponential covariance function C ν=1/2 (r) = exp( r), ν Gaussian covariance function C ν= (r) = exp( r 2 ). 10 / 37

11 Identifying uncertain parameters Identifying uncertain parameters 11 / 37

12 Identifying uncertain parameters Given: a vector of measurements z = (z 1,..., z n ) T with a covariance matrix C(θ ) = C(σ 2, ν, l). C θ = 2σ2 ( r ν ( r ) Kν, θ = (σ Γ(ν) 2l) 2, ν, l). l To identify: uncertain parameters (σ 2, ν, l). Plan: Maximize the log-likelihood function L(θ) = 1 ( ) Nlog2π + log det{c(θ)} + z T C(θ) 1 z, 2 tion Logo On each Lock-up iteration i we have a new matrix C(θ i ). 12 / 37

13 Other works 1. S. AMBIKASARAN, et al., Fast direct methods for gaussian processes and the analysis of NASA Kepler mission, arxiv: , (2014). 2. S. AMBIKASARAN, J. Y. LI, P. K. KITANIDIS, AND E. DARVE, Large-scale stochastic linear inversion using hierarchical matrices, Computational Geosciences, (20) 3. J. BALLANI AND D. KRESSNER, Sparse inverse covariance estimation with hierarchical matrices, (2015). 4. M. BEBENDORF, Why approximate LU decompositions of finite element discretizations of elliptic operators can be computed with almost linear complexity, (2007). 5. S. BOERM AND J. GARCKE, Approximating gaussian processes with H2-matrices, J. E. CASTRILLON, M. G. GENTON, AND R. YOKOTA, Multi-Level Restricted Maximum Likelihood Covariance Estimation and Kriging for Large Non-Gridded Spatial Datasets, (2015). 7. J. DOELZ, H. HARBRECHT, AND C. SCHWAB, Covariance regularity and H-matrix approximation for rough random fields, ETH-Zuerich, H. HARBRECHT et al, Efficient approximation of random fields for numerical applications, Numerical Linear Algebra with Applications, (2015).. C.-J. HSIEH, et al, Big QUIC: Sparse inverse covariance estimation for a million variables, J. QUINONERO-CANDELA, et al, A unifying view of sparse approximate gaussian process regression, (2005). 11. A. SAIBABA, S. AMBIKASARAN, J. YUE LI, P. KITANIDIS, AND E. DARVE, Application of hierarchical matrices to linear inverse problems in geostatistics, Oil & Gas Science (2012). / 37

14 Convergence of the optimization method 14 / 37

15 Details of the identification To maximize the log-likelihood function we use the Brent s method [Brent 73] (combining bisection method, secant method and inverse quadratic interpolation). 1. C(θ) C H (θ, k). 2. H-Cholesky: C H (θ, k) = LL T 3. z T C 1 z = z T (LL T ) 1 z = v T v, where v is a solution of L(θ, k)v(θ) := z(θ ). 4. Let λ i be diagonal elements of H-Cholesky factor L, then log det{c} = log det{ll T } = log det{ n λ 2 i } = 2 i=1 n logλ i, L(θ, k) = N N 2 log(2π) log{ L ii (θ, k)} 1 2 (v(θ)t v(θ)). (3) i=1 i=1 15 / 37

16 Log likelihood(θ ) Shape of Log likelihood(θ) log(det(c)) z T C 1 z Log likelihood parameter θ, truth θ*=12 Figure : Minimum of negative log-likelihood (black) is at θ = (,, l) 12 (σ 2 and ν are fixed) 16 / 37

17 Convergence of H-matrix approximations 0 5 Spectral norm, L=0.1, nu=1 Frob. norm, L=0.1 Spectral norm, L=0.2 Frob. norm, L=0.2 Spectral norm, L=0.5 Frob. norm, L= Spectral norm, L=0.1, nu=0.5 Frob. norm, L=0.1 Spectral norm, L=0.2 Frob. norm, L=0.2 Spectral norm, L=0.5 Frob. norm, L=0.5 log(rel. error) log(rel. error) rank k rank k ν = 1(left) and ν = 0.5 (right) for different cov. lengths l = {0.1, 02, 0.5} 17 / 37

18 Convergence of H-matrix approximations 0 2 Spectral norm, nu=1.5, L=0.1 Spectral norm, nu=1 Spectral norm, nu= Spectral norm, nu=1.5, L=0.5 Spectral norm, nu=1 Spectral norm, nu= log(rel. error) log(rel. error) rank k rank k ν = {1.5, 1, 0.5}, l = 0.1 (left) and ν = {1.5, 1, 0.5}, l = 0.5 (right) 18 / 37

19 What will change? Approximate C by C H 1. How the eigenvalues of C and C H differ? 2. How det(c) differs from det(c H )? [Below] 3. How L differs from L H? [Mario Bebendorf et al] 4. How C 1 differs from (C H ) 1? [Mario Bebendorf et al] 5. How L(θ, k) differs from L(θ)? [Below] 6. What is optimal H-matrix rank? [Below] 7. How θ H differs from θ? [Below] For theory, estimates for the rank and accuracy see works of Bebendorf, Grasedyck, Le Borne, Hackbusch,... 1 / 37

20 Remark For a small H-matrix rank k the H-matrix Cholesky of C H crashes when eigenvalues of C come very close to zero. A remedy is to increase the rank k. In our example for n = 65 2 we increased k from 7 to. To avoid this instability, we can modify C H m = C H + δ 2 I. Assume λ i are eigenvalues of C H. Then eigenvalues of C H m will be λ i + δ 2. n log det(cm H ) = log (λ i + δ 2 ) = i=1 n log(λ i + δ 2 ). (4) i=1 20 / 37

21 Error analysis Theorem (Existence of H-matrix inverse in [Bebendorf 11, Ballani, Kressner 14) Under certain conditions an H-matrix inverse exist C 1 H C 1 ε C 1, (5) theoretical estimations for rank k inv of C 1 H Theorem (Error in log det) are given. Let E := C C H, (C H ) 1 E := (C H ) 1 C I and for the spectral radius ρ((c H ) 1 E) = ρ((c H ) 1 C I) ε. (6) Then log det(c) log det(c H ) plog(1 ε). Proof: See [Ballani, Kressner 14], [Ipsen 05]. 21 / 37

22 How sensitive is Log-Likelihood to the H-matrix rank? It is not at all sensible. H-matrix approximation changes function L(θ, k) and estimation of θ very-very small. θ L(exact) L(7) L(20) Comparison of three likelihood functions, computed with different tion Logo H-matrix Lock-up ranks: exact, H-rank 7, H-rank 20. Exponential covariance function, with covariance length l = 0., domain G = [0, 1] / 37

23 How sensitive is Log-Likelihood to the H-matrix rank? loglikelihood θ Figure : Three negative log-likelihood functions: exact, commuted with H-matrix rank 7 and 17. One can see that even with rank 7 one can achieve very accurate results. 23 / 37

24 Do we need all measurements? Boxplots vs n cov. length number of measurements Moisture data. Boxplots with increasing of number of measurements, n = {1000,..., 32000}. The mean and median are obtained after averaging 100 simulations. 24 / 37

25 Decreasing of error bars with increasing number of measurements Error bars (mean +/- st. dev.) computed for different n. Decreasing of error bars with increasing of number of measurements/dimension, n = {17 2, 33 2, 65 2 }. The mean and median are obtained after averaging 200 simulations. 25 / 37

26 H-matrix approximation is robust w.r.t. parameter ν Figure : Dependence of H-matrix approximation error on parameter ν. Relative error C C H 2 C H 2 via smoothness parameter ν. H-matrix rank k = 8, n = 16641, Matern covariance matrix. 26 / 37

27 H-matrix approximation is robust w.r.t. cov. length l Figure : Dependence of H-matrix approximation error on cov. length l. Relative error C C H 2 C H 2 via covariance length l. H-matrix rank k = 8, n = 16641, Matern covariance matrix. 27 / 37

28 log-determinant of C rank5 ranks=3,4,5 -log-likelihood cov. length cov. length H-matrix approximation of log det(c) (left) and of the tion Logo log-likelihood Lock-up L (right); σ 2 = 1, ν = 0.5, k = 5. The red line - k = 5 and the blue - k = 3 on [0.01, 0.3], k = 4 on [0.3, 0.6] and k = 5 on [0.6, 1]. The rank k = 3 is sufficient to approximate C, but insufficient to approximate C 1 on the whole interval [0.01, 1]. The first numerical instability appears at point l i1 0.3, to avoid it the rank k is increased by 1, until the second instability appears at point l i / 37

29 10 8 log-likelihood 10 7 z T C -1 z parameter \nu H-matrix approx. of z T C 1 z and of log-likelihood L; σ 2 = 1, ν = 0.5, k increases from 5 until 12 after each jump. The red line - rank k = 5 and blue - k = 5 on [0.1, 1.42]; k = 6 on [1.42, 1.57]; k = 7 on [1.57, 1.3]; k = 8 on [1.3, 2.24] and k = {, 10, 12} between [2.24, 3.14]. k has to be increased to approximate C 1. The first numerical instability appears at point ν i1 1.42, to avoid it the rank k is increased by 1, until the second instability appears at point l i etc. 2 / 37

30 Time profiling, C-language 30 / 37

31 Parallel implementation with HLIBpro (R. Kriemann) We used to setup the exponential covariance matrix (cov. length=1), to compute Cholesky and the inverse. We used adaptive rank arithmetics with ε = 1e 4 for each block of C H and ε = 1e 8 for each block of C H. Number of processing cores is 40. We took moisture data (see above) with N points. N compute C H LL T inverse Compr. time size time size ε 1 time size ε 2 rate % sec. MB sec. MB sec. MB % e e % e e-1 Table : Here ε 1 := I (LL T ) 1 C 2, where L and C are H-matrices, I -identity matrix; ε 2 := I BC 2, where B is an H-matrix approximation of C / 37

32 Take into account the gradient L(θ i ) θ i = 1 ( ) 2 tr 1 C C 1 θ i 2 zt 1 C C C 1 z. (7) θ i For an exponential random field, have ( C(θ i ) = θ i l exp ) ( x y 2 = x y 2 ) x y l l 2 exp 2 l (8) C(θ i ) =: C 2 θ i L(θ i ) = 1 θ i 2 tr ( C 1 ) 1 C 2 2 zt C 1 C 2 C 1 z. 32 / 37

33 Conclusion Covariance matrices can be approximated in H-matrix format. Hypotes: H-matrix approximation is robust w.r.t. ν and l. Influence of H-matrix approximation error on the estimated parameters is small. With application of H-matrices we extend the class of covariance functions to work with, allows non-regular discretization of the covariance function on large spatial grids. With the maximizing algorithm we are able to identify both tion Logo parameters: Lock-up covariance lengths l and the smoothness ν 33 / 37

34 Future plans Parallel H-Cholesky for very large covariance matrices on non-regular grids Preconditioning for log-likelihood to decrease cond(c) Domain decomposition for large domains + H-matrix in each sub-domain Apply H-matrices for 1. Kriging estimate ŝ := C sy Cyy 1 y 2. Estimation of variance ˆσ, is the diagonal of conditional cov. matrix C ss y = diag ( ) C ss C sy Cyy 1 C ys, 3. Gestatistical optimal design ϕ A := n 1 tracec ss y, ϕ C := c ( ) T C ss C sy Cyy 1 C ys c, tion Logo To Lock-up implement gradient-based version Compare with the Bayesian Update (H. Matthies, H. Najm, K. Law, A. Stuart et al) 34 / 37

35 Literature 1. Application of hierarchical matrices for computing the Karhunen-Loéve expansion, B.N. Khoromskij, A. Litvinenko, H.G. Matthies, Computing 84 (1-2), 4-67, 31, Parameter identification in a probabilistic setting, B.V. Rosić, A. Kucerová, J Sýkora, O. Pajonk, A. Litvinenko, H.G. Matthies, Engineering Structures 50, 17-16, Methods for statistical data analysis with decision trees, V. Berikov, A. Litvinenko, Novosibirsk, Sobolev Institute of Mathematics, Parametric and uncertainty computations with tensor product representations, H.G. Matthies, A. Litvinenko, O. Pajonk, B.V. Rosić, E. Zander, Uncertainty in Scientific Computing, -150, Data sparse computation of the Karhunen-Loève expansion, B.N. Khoromskij, A. Litvinenko, AIP Conference Proceedings 1048 (1), 311, Kriging and spatial design accelerated by orders of magnitude: Combining low-rank covariance approximations with FFT-techniques W. Nowak, A. Litvinenko, Mathematical Geosciences 45 (4), , / 37

36 Acknowledgement 1. Lars Grasedyck (RWTH Aachen) and Steffen Boerm (Uni Kiel) for HLIB ( 2. Ronald Kriemann (MPI Leipzig) for 3. KAUST Research Computing group, KAUST Supercomputing Lab (KSL) 36 / 37

37 Matérn Fields (Whittle, 63) Taken from D. Simpson (see also Finn Lindgren, Havard Rue, David Bolin,...) Theorem The covariance function of a Matérn field c(x, y) = 1 Γ(ν + d/2)(4π) d/2 κ 2ν 2 ν 1 (κ x y )ν K ν (κ x y ) is the Green s function of the differential operator () L 2 ν = ( κ 2 ) ν+d/2. (10) 37 / 37

Uncertainty Quantification and related areas - An overview

Uncertainty Quantification and related areas - An overview Alexander Litvinenko, talk given at Department of Mathematical Sciences, Durham University Bayesian Computational Statistics & Modeling, KAUST