Gaussian Processes 1. Schedule

1 Schedule 17 Jan: Gaussian processes (Jo Eidsvik) 24 Jan: Hands-on project on Gaussian processes (Team effort, work in groups) 31 Jan: Latent Gaussian models and INLA (Jo Eidsvik) 7 Feb: Hands-on project on INLA (Team effort, work in groups) 12-13 Feb: Template model builder. (Guest lecturer Hans J Skaug) To be decided...

Motivation Large spatial (spatio-temporal) datasets Gaussian processes are very commonly used in practice.

Motivation Other contexts Genetic data Functional data Response surfaces modeling and optimization Extremely common as building block in several machine learning applications.

Preliminaries Univariate Gaussian distribution 0.6 0.5 0.4 pdf p(x) 0.3 0.2 0.1 0-3 -2-1 0 1 2 3 x Y N(µ, σ 2 ). p(y) = 1 ( exp 1 2πσ 2 (y µ) 2 Z = Y µ, Y = µ + σz. σ Z is standard normal, mean 0 and variance 1. σ 2 ), y R.

Preliminaries Multivariate gaussian distribution Size n 1 vector Y = (Y 1,..., Y n ) ( 1 p(y) = exp 1 ) (2π) n/2 Σ 1/2 2 (y µ) Σ 1 (y µ), Y R n. Mean is E(Y ) = µ = (µ 1,..., µ n ). Variance-covariance matrix is Σ 1,1... Σ 1,n Σ =........., Σ n,1... Σ n,n Σ i,i = σ 2 i = Var(Y i ), Σ i,j = Cov(Y i, Y j ), Corr(Y i, Y j ) = Σ i,j /(σ i σ j ).

Preliminaries Illustrations n = 2 - correlation 0.9 (left), independent (right). x 2 3 2 1 0-1 -2-3 x 2 3 2 1 0-1 -2-3 -2 0 2 x 1-2 0 2 x 1

Preliminaries Joint for blocks Y A = (Y A,1,..., Y A,nA ), Y B = (Y B,1,..., Y B,nB ), joint Gaussian with mean (µ A, µ B ), covariance: [ ] Σ µ = (µ A, µ B ), Σ = A Σ A,B, Σ B,A Σ B

Preliminaries Conditioning E(Y A Y B ) = µ A + Σ A,B Σ 1 B (Y B µ B ), Var(Y A Y B ) = Σ A Σ A,B Σ 1 B Σ B,A. Mean is linear in conditioning variable (data). Variance is not dependent on data.

Preliminaries Illustration 1 0.9 0.8 0.7 x 2 =-1 x 2 =1 conditional pdf p(x 1 x 2 ) 0.6 0.5 0.4 0.3 0.2 0.1 0-3 -2-1 0 1 2 3 x 1 Figure: Conditional pdf for Y 1 when Y 2 = 1 or Y 2 = 1.

Preliminaries Transformation Z = L 1 (Y µ), Y = µ + LZ, Σ = LL. Z = (Z 1,..., Z n ) are independent standard normal, mean 0 and variance 1.

Preliminaries Cholesky factorization Σ = Lower triangular matrix L = Σ 1,1... Σ 1,n......... Σ n,1... Σ n,n L 1,1 0... 0 L 2,1 L 2,2... 0......... 0 L n,1 L n,2... L n,n = LL,,

Preliminaries Cholesky - example Σ = [ 1 0.9 0.9 1 ] [ 1 0, L = 0.9 0.44 ]. Consider sampling from joint p(y 1, y 2 ) = p(y 1 )p(y 2 y 1 ): Sample from p(y 1 ) is constructed by Y 1 = µ 1 + L 1,1 Z 1. Sample from p(y 2 y 1 ) is constructed by Y 2 = µ 2 + L 2,1 Z 1 + L 2,2 Z 2 = µ 2 + L 2,1 Y 1 µ 1 L 1,1 + L 2,2 Z 2

Processes Gaussian process For any set of time locations t 1,..., t n. Y (t 1 ),..., Y (t n ) is jointly Gaussian. Mean µ(t i ), i = 1,..., n. Σ = Σ 1,1... Σ 1,n......... Σ n,1... Σ n,n,

Processes Covariance function The covariance tends to decay with distance: Σ i,j = γ( t i t j ), for some covariance function γ. Examples: γexp(h) = σ 2 exp( φh) γmat(h) = σ 2 (1 + φh) exp( φh) γgauss(h) = σ 2 exp( φh 2 )

Processes Illustration of covariance function 1 0.9 Exp. corr Matern type corr Gauss. corr 0.8 0.7 0.6 Correlation 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 45 50 Distance t-s

Processes Illustration of samples of Gaussian processes 3 Exp. corr Matern type corr 2 1 x(t) 0-1 -2-3 0 10 20 30 40 50 60 70 80 90 100 Time t

Processes Markov property The exponential correlation function gives a Markov process. For any t > s > u, p(y(t) y(s), y(u)) = p(y(t) y(s)). (Proof by trivariate distribution, and conditioning.) E(Y A Y B ) = µ A + Σ A,B Σ 1 B (Y B µ B ), Var(Y A Y B ) = Σ A Σ A,B Σ 1 B Σ B,A.

Processes Conditional formula E(Y A Y B ) = µ A + Σ A,B Σ 1 B (Y B µ B ), Var(Y A Y B ) = Σ A Σ A,B Σ 1 B Σ B,A. Expectation linear in data. Variance only dependent on data locations, not data. Expectation close to conditioning variables near data locations, goes to µ A far from data. Variance small near data locations, goes to Σ A far from data. Close data locations are not double data.

Processes Illustration of conditioning in Gaussian processes 2 1 x(t) 0-1 -2 0 10 20 30 40 50 60 70 80 90 100 t 3 2 x(t) 1 0-1 -2 0 10 20 30 40 50 60 70 80 90 100 t

Processes Application of Gaussian processes: function optimization Several applications involve very time-demanding or costly experiment. Which configurations of inputs give highest output? Goal is to find the optimal input without too many trials / tests. Approach: Fit a GP to the function, based on the evaluation points and results. This allows fast consideration of which evaluations points to choose!

Processes Production quality : Current prediction Goal: Find the best temperature input, to give optimal production. Which temperature to evaluate next? Experiment is costly, want to do few evaluations.

Processes Expected improvement Maximum so far Y = max(y ). EI n (s) = E(max(Y (s) Y, 0) Y )

Processes Sequential uncertainty reduction Perform test at t = 40, result is Y (40) = 50.

Processes Sequential uncertainty reduction and optimization Maximum so far Y = max(y, Y n+1 ). EI n+1 (s) = E(max(Y (s) Y, 0) Y, Y n+1 )

Processes Project on function optimization using EI next week Analytical solutions to parts of computational challenge.

Gaussian Markov random fields Precision matrix Q [ Σ 1 = Q = ] Q A Q A,B Q B,A Q B. Q holds the conditional variance structure.

Gaussian Markov random fields Interpretation of precision Q 1 A = Var(Y A Y B ), E(Y A Y B ) = µ A Q 1 A Q A,B(Y B µ B ), (Proof by QΣ = I. Or by writing out quadratic form and p(y A Y B ) p(y A, Y B ).)

Gaussian Markov random fields Sparse precision matrix Q For graphs the precision matrix is sparse. Q ij = 0 if nodes i and j are not neighbors. Conditionally independent. Q i,i±2 = 0 for exponential covariance function on a regular grid in time.

Gaussian Markov random fields Sparse precision matrix Q This sparseness means that several techniques from numerical analysis can be used. Solve Qb = a quickly for b.

Gaussian Markov random fields Cholesky factorization of Q Common method for sampling and evalution: Q 1,1... Q 1,n Q =......... = L Q L Q, Q n,1... Q n,n Lower triangular matrix L Q = L Q,1,1 0... 0 L Q,2,1 L Q,2,2... 0......... 0 L Q,n,1 L Q,n,2... L Q,n,n, The Cholesky factor is often sparse, but not as sparse as Q, because it holds only partial conditional structure, according to an ordering. Sparsity is maintained for exponential covariance function in time dimension (Markov).

Gaussian Markov random fields Sampling and evaluation using L Q Q = Q 1,1... Q 1,n......... Q n,1... Q n,n L Q Y = Z. (Previously, for covariance we had Y = LZ.) = L Q L Q, log Q = 2 log L Q = 2 i L Q,ii

Gaussian Markov random fields GMRF for spatial applications. A Markovian model can be constructed for a spatial Gaussian processes (Lindgren et al., 2011). The spatial process is viewed as a stochastic partial differential equation, and the solution is embedded in a triagularized graph over a spatial domain. More later (31 Jan).

Regression Model Regression for spatial datasets Applications with trends and residual spatial variation.

Regression Model Spatial Gaussian regression model Model: Y (s) = X (s)β + w(s) + ɛ(s). 1. Y (s) response variable at location /location-time position s. 2. β regression effects. X (s) covariates at s. 3. w(s) structured (space-time correlated) Gaussian process. 4. ɛ(s) unstructured (independent) Gaussian measurement noise.

Regression Model Gaussian model Model: Y (s) = X (s)β + w(s) + ɛ(s). Data at n locations: Y = (Y (s 1 ),..., Y (s n )). Main goals are: Parameter estimation Prediction

Regression Model Gaussian model Likelihood for parameter estimation: l(y ; β, θ) = 1 2 log C 1 2 (Y X β) C 1 (Y X β) C(θ) = C = Σ + τ 2 I n Var(w) = Σ, Var(ɛ(s i )) = τ 2 for all i. θ include parameters of the covariance model.

Method Maximum likelihood MLE: (ˆθ, ˆβ) = argmax θ,β {l(y ; β, θ)}.

Method Analytical derivatives Formulas for matrix derivatives. Q(θ) = C 1 ˆβ = [X QX ] 1 X QY, Z = Y X ˆβ dlog C dθ r = trace(q dc ) dθ r dz QZ dθ r = Z Q dc QZ. dθ r

Method Score and Hessian for θ dl = 1 dc trace(q ) + 1 dθ r 2 dθ r 2 Z Q dc QZ, dθ r ( d 2 ) l E = 1 dc trace(q Q dc ). dθ r dθ s 2 dθ s dθ r

Method Updates for each iteration Q = Q(θ p ) ˆβ p = [X QX ] 1 X QY, ˆθ p+1 = ˆθ p E ( d 2 l(y ; ˆβ p, ˆθ p ) dθ 2 ) 1 dl(y ; ˆβ p, ˆθ p ) dθ, Iterative scheme usually starts from preliminary guess, obatined via summary statistics.

Method Illustration maximization Exponential covariance with nugget effect. θ = (θ 1, θ 2, θ 3 ) : log precision, logistic range, log nugget precision.

Method Asymptotic properties ˆθ N(θ, G 1 ). G = G(ˆθ) = E ( d 2 ) l dθ 2.

Prediction Prediction from joint Gaussian formulation Prediction Ŷ 0 = E(Y 0 Y ) = X 0 ˆβ + C 0,. C 1 (Y X ˆβ). C 0,. is size 1 n vector of cross-covariances between prediction site s 0 and data sites. Prediction variance Var(Y 0 Y ) = C 0 C 0,. C 1 C 0,..

Numerical examples Synthetic data Consider unit square. Create grid of 25 2 = 625 locations. Use 49 data randomly assigned, or along center line (two designs). Covariance C(h) = τ 2 I (h = 0) + σ 2 (1 + φh) exp( φh), h = s i s j. θ include transformations of: σ, τ and φ.

Numerical examples Likelihood optimization True parameters β = ( 2, 3, 1), θ = (0.25, 9, 0.0025). Random design: β = [ 2(0.486), 3.43(0.552), 0.812(0.538)] θ = [ 0.298(0.118), 7.89(1.98), 0.00563(0.00679)] Center design: ˆβ = [ 2.06(0.576), 3.4(0.733), 0.353(0.733)] ˆθ = [0.255(0.141), 7.19(1.97), 0.00283(0.00128)]

Numerical examples Predictions

Numerical examples Project on spatial Gaussian regression model next week

Numerical examples Mining example Data giving oxide grade information at n = 1600 locations. 500 m Prediction line B Northing A Easting Use spatial statistics to predict oxide grade. Question: Will mining be profitable? Should one gather more data before making mining decision?

Numerical examples Mining example Model: Y (s) = X (s)β + w(s) + ɛ(s). Data at n borehole locations used to fit model parameters by MLE. Starting values from variogram (related to sample covariance). 0.8 0.7 In Strike Perpendicular Omni directional 0.6 0.5 Variogram 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 Distance (m)

300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 450 500 5 4 3 2 1 0 300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 450 500 0.6 0.55 0.5 0.45 0.4 0.35 0.3 Numerical examples Mining example Current prediction A B A B Altitude (m) Altitude (m) Line Distance (m), Line Distance (m)

Numerical examples Mining example Develop or not? Decision is done using expected profits. Value based on current data: PV = max(p y, 0), p y = r t µ y k t 1, Will more data influence the mining decision? Expected value with more data: PoV = max(p y,y n, 0)π(y n y)dy n, p y,y n = r t µ y,y n k t 1, y n Is more data valuable. Compute the value of information (VOI): VOI = PoV PV. If VOI exceeds the cost of data, it is worthwhile gathering this information. (Computations partly analytical, similar to EI.)

Numerical examples Mining example Use this to compare two data types in planned boreholes: XRF : scanning in a lab (expensive, accurate). XMET : handheld device for grade scanning at the mining site (inexpensive, inaccurate).

Numerical examples Mining example 12 x 105 Cost of XRF ($) 10 8 6 4 2 XMET XRF Nothing 0 0 2 4 6 8 10 12 Cost of XMET ($) x 10 5

Numerical examples Schedule 17 Jan: Gaussian processes 24 Jan: Hands-on project on Gaussian processes 31 Jan: Latent Gaussian models and INLA. 7 Feb: Hands-on project on INLA 12-13 Feb: Template model builder. (Guest lecturer Hans J Skaug)

Numerical examples Schedule Project (24 Jan) will include Expected improvement for function maximization. Parameter estimation in spatial regression model (forest example).