Gaussian Processes 1. Schedule

Similar documents
Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geo-statistical Datasets

Nearest Neighbor Gaussian Processes for Large Spatial Data

Machine learning - HT Maximum Likelihood

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Hierarchical Modeling for Univariate Spatial Data

Nonparameteric Regression:

11 : Gaussian Graphic Models and Ising Models

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Likelihood-Based Methods

Hierarchical Modelling for Univariate Spatial Data

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

On Gaussian Process Models for High-Dimensional Geostatistical Datasets

Gaussian Process Regression

Introduction to Geostatistics

The value of imperfect borehole information in mineral resource evaluation

A short introduction to INLA and R-INLA

Gaussian with mean ( µ ) and standard deviation ( σ)

Mathematical statistics

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Association studies and regression

Linear Classification

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Chapter 3: Maximum Likelihood Theory

Spatial Statistics with Image Analysis. Lecture L02. Computer exercise 0 Daily Temperature. Lecture 2. Johan Lindström.

Machine Learning for Data Science (CS4786) Lecture 12

Generalized Linear Models. Kurt Hornik

Linear Regression (9/11/13)

Chapter 17: Undirected Graphical Models

Bayesian Decision and Bayesian Learning

Probabilistic Graphical Models

Computation fundamentals of discrete GMRF representations of continuous domain spatial models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Model Checking and Improvement

Data are collected along transect lines, with dense data along. Spatial modelling using GMRFs. 200m. Today? Spatial modelling using GMRFs

Regression with correlation for the Sales Data

Linear Regression and Discrimination

System Identification, Lecture 4

System Identification, Lecture 4

The Expectation-Maximization Algorithm

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Probabilistic Graphical Models

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Markov random fields. The Markov property

Gaussian predictive process models for large spatial data sets.

Introduction to Machine Learning

Point-Referenced Data Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

CS 195-5: Machine Learning Problem Set 1

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

MCMC algorithms for fitting Bayesian models

Variational Inference (11/04/13)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Introduction to Gaussian Processes

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

Bayesian Approaches Data Mining Selected Technique

CPSC 540: Machine Learning

Graphical Models and Kernel Methods

Introduction to Gaussian Process

Advanced Quantitative Methods: maximum likelihood

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

POLI 8501 Introduction to Maximum Likelihood Estimation

Chapter 4 - Fundamentals of spatial processes Lecture notes

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

CPSC 540: Machine Learning

PMR Learning as Inference

Part 6: Multivariate Normal and Linear Models

Bayesian Modeling and Inference for High-Dimensional Spatiotemporal Datasets

Introduction to Probabilistic Machine Learning

Linear Dynamical Systems

A Framework for Daily Spatio-Temporal Stochastic Weather Simulation

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Machine Learning (CS 567) Lecture 5

Hierarchical Modeling for Multivariate Spatial Data

Computer Intensive Methods in Mathematical Statistics

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Learning discrete graphical models via generalized inverse covariance matrices

Linear Models for Classification

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Lecture 13 : Variational Inference: Mean Field Approximation

Introduction to Probabilistic Graphical Models: Exercises

Bayesian Linear Models

Frequentist-Bayesian Model Comparisons: A Simple Example

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Basic math for biology

Overfitting, Bias / Variance Analysis

Transcription:

1 Schedule 17 Jan: Gaussian processes (Jo Eidsvik) 24 Jan: Hands-on project on Gaussian processes (Team effort, work in groups) 31 Jan: Latent Gaussian models and INLA (Jo Eidsvik) 7 Feb: Hands-on project on INLA (Team effort, work in groups) 12-13 Feb: Template model builder. (Guest lecturer Hans J Skaug) To be decided...

Motivation Large spatial (spatio-temporal) datasets Gaussian processes are very commonly used in practice.

Motivation Other contexts Genetic data Functional data Response surfaces modeling and optimization Extremely common as building block in several machine learning applications.

Preliminaries Univariate Gaussian distribution 0.6 0.5 0.4 pdf p(x) 0.3 0.2 0.1 0-3 -2-1 0 1 2 3 x Y N(µ, σ 2 ). p(y) = 1 ( exp 1 2πσ 2 (y µ) 2 Z = Y µ, Y = µ + σz. σ Z is standard normal, mean 0 and variance 1. σ 2 ), y R.

Preliminaries Multivariate gaussian distribution Size n 1 vector Y = (Y 1,..., Y n ) ( 1 p(y) = exp 1 ) (2π) n/2 Σ 1/2 2 (y µ) Σ 1 (y µ), Y R n. Mean is E(Y ) = µ = (µ 1,..., µ n ). Variance-covariance matrix is Σ 1,1... Σ 1,n Σ =........., Σ n,1... Σ n,n Σ i,i = σ 2 i = Var(Y i ), Σ i,j = Cov(Y i, Y j ), Corr(Y i, Y j ) = Σ i,j /(σ i σ j ).

Preliminaries Illustrations n = 2 - correlation 0.9 (left), independent (right). x 2 3 2 1 0-1 -2-3 x 2 3 2 1 0-1 -2-3 -2 0 2 x 1-2 0 2 x 1

Preliminaries Joint for blocks Y A = (Y A,1,..., Y A,nA ), Y B = (Y B,1,..., Y B,nB ), joint Gaussian with mean (µ A, µ B ), covariance: [ ] Σ µ = (µ A, µ B ), Σ = A Σ A,B, Σ B,A Σ B

Preliminaries Conditioning E(Y A Y B ) = µ A + Σ A,B Σ 1 B (Y B µ B ), Var(Y A Y B ) = Σ A Σ A,B Σ 1 B Σ B,A. Mean is linear in conditioning variable (data). Variance is not dependent on data.

Preliminaries Illustration 1 0.9 0.8 0.7 x 2 =-1 x 2 =1 conditional pdf p(x 1 x 2 ) 0.6 0.5 0.4 0.3 0.2 0.1 0-3 -2-1 0 1 2 3 x 1 Figure: Conditional pdf for Y 1 when Y 2 = 1 or Y 2 = 1.

Preliminaries Transformation Z = L 1 (Y µ), Y = µ + LZ, Σ = LL. Z = (Z 1,..., Z n ) are independent standard normal, mean 0 and variance 1.

Preliminaries Cholesky factorization Σ = Lower triangular matrix L = Σ 1,1... Σ 1,n......... Σ n,1... Σ n,n L 1,1 0... 0 L 2,1 L 2,2... 0......... 0 L n,1 L n,2... L n,n = LL,,

Preliminaries Cholesky - example Σ = [ 1 0.9 0.9 1 ] [ 1 0, L = 0.9 0.44 ]. Consider sampling from joint p(y 1, y 2 ) = p(y 1 )p(y 2 y 1 ): Sample from p(y 1 ) is constructed by Y 1 = µ 1 + L 1,1 Z 1. Sample from p(y 2 y 1 ) is constructed by Y 2 = µ 2 + L 2,1 Z 1 + L 2,2 Z 2 = µ 2 + L 2,1 Y 1 µ 1 L 1,1 + L 2,2 Z 2

Processes Gaussian process For any set of time locations t 1,..., t n. Y (t 1 ),..., Y (t n ) is jointly Gaussian. Mean µ(t i ), i = 1,..., n. Σ = Σ 1,1... Σ 1,n......... Σ n,1... Σ n,n,

Processes Covariance function The covariance tends to decay with distance: Σ i,j = γ( t i t j ), for some covariance function γ. Examples: γexp(h) = σ 2 exp( φh) γmat(h) = σ 2 (1 + φh) exp( φh) γgauss(h) = σ 2 exp( φh 2 )

Processes Illustration of covariance function 1 0.9 Exp. corr Matern type corr Gauss. corr 0.8 0.7 0.6 Correlation 0.5 0.4 0.3 0.2 0.1 0 0 5 10 15 20 25 30 35 40 45 50 Distance t-s

Processes Illustration of samples of Gaussian processes 3 Exp. corr Matern type corr 2 1 x(t) 0-1 -2-3 0 10 20 30 40 50 60 70 80 90 100 Time t

Processes Markov property The exponential correlation function gives a Markov process. For any t > s > u, p(y(t) y(s), y(u)) = p(y(t) y(s)). (Proof by trivariate distribution, and conditioning.) E(Y A Y B ) = µ A + Σ A,B Σ 1 B (Y B µ B ), Var(Y A Y B ) = Σ A Σ A,B Σ 1 B Σ B,A.

Processes Conditional formula E(Y A Y B ) = µ A + Σ A,B Σ 1 B (Y B µ B ), Var(Y A Y B ) = Σ A Σ A,B Σ 1 B Σ B,A. Expectation linear in data. Variance only dependent on data locations, not data. Expectation close to conditioning variables near data locations, goes to µ A far from data. Variance small near data locations, goes to Σ A far from data. Close data locations are not double data.

Processes Illustration of conditioning in Gaussian processes 2 1 x(t) 0-1 -2 0 10 20 30 40 50 60 70 80 90 100 t 3 2 x(t) 1 0-1 -2 0 10 20 30 40 50 60 70 80 90 100 t

Processes Application of Gaussian processes: function optimization Several applications involve very time-demanding or costly experiment. Which configurations of inputs give highest output? Goal is to find the optimal input without too many trials / tests. Approach: Fit a GP to the function, based on the evaluation points and results. This allows fast consideration of which evaluations points to choose!

Processes Production quality : Current prediction Goal: Find the best temperature input, to give optimal production. Which temperature to evaluate next? Experiment is costly, want to do few evaluations.

Processes Expected improvement Maximum so far Y = max(y ). EI n (s) = E(max(Y (s) Y, 0) Y )

Processes Sequential uncertainty reduction Perform test at t = 40, result is Y (40) = 50.

Processes Sequential uncertainty reduction and optimization Maximum so far Y = max(y, Y n+1 ). EI n+1 (s) = E(max(Y (s) Y, 0) Y, Y n+1 )

Processes Project on function optimization using EI next week Analytical solutions to parts of computational challenge.

Gaussian Markov random fields Precision matrix Q [ Σ 1 = Q = ] Q A Q A,B Q B,A Q B. Q holds the conditional variance structure.

Gaussian Markov random fields Interpretation of precision Q 1 A = Var(Y A Y B ), E(Y A Y B ) = µ A Q 1 A Q A,B(Y B µ B ), (Proof by QΣ = I. Or by writing out quadratic form and p(y A Y B ) p(y A, Y B ).)

Gaussian Markov random fields Sparse precision matrix Q For graphs the precision matrix is sparse. Q ij = 0 if nodes i and j are not neighbors. Conditionally independent. Q i,i±2 = 0 for exponential covariance function on a regular grid in time.

Gaussian Markov random fields Sparse precision matrix Q This sparseness means that several techniques from numerical analysis can be used. Solve Qb = a quickly for b.

Gaussian Markov random fields Cholesky factorization of Q Common method for sampling and evalution: Q 1,1... Q 1,n Q =......... = L Q L Q, Q n,1... Q n,n Lower triangular matrix L Q = L Q,1,1 0... 0 L Q,2,1 L Q,2,2... 0......... 0 L Q,n,1 L Q,n,2... L Q,n,n, The Cholesky factor is often sparse, but not as sparse as Q, because it holds only partial conditional structure, according to an ordering. Sparsity is maintained for exponential covariance function in time dimension (Markov).

Gaussian Markov random fields Sampling and evaluation using L Q Q = Q 1,1... Q 1,n......... Q n,1... Q n,n L Q Y = Z. (Previously, for covariance we had Y = LZ.) = L Q L Q, log Q = 2 log L Q = 2 i L Q,ii

Gaussian Markov random fields GMRF for spatial applications. A Markovian model can be constructed for a spatial Gaussian processes (Lindgren et al., 2011). The spatial process is viewed as a stochastic partial differential equation, and the solution is embedded in a triagularized graph over a spatial domain. More later (31 Jan).

Regression Model Regression for spatial datasets Applications with trends and residual spatial variation.

Regression Model Spatial Gaussian regression model Model: Y (s) = X (s)β + w(s) + ɛ(s). 1. Y (s) response variable at location /location-time position s. 2. β regression effects. X (s) covariates at s. 3. w(s) structured (space-time correlated) Gaussian process. 4. ɛ(s) unstructured (independent) Gaussian measurement noise.

Regression Model Gaussian model Model: Y (s) = X (s)β + w(s) + ɛ(s). Data at n locations: Y = (Y (s 1 ),..., Y (s n )). Main goals are: Parameter estimation Prediction

Regression Model Gaussian model Likelihood for parameter estimation: l(y ; β, θ) = 1 2 log C 1 2 (Y X β) C 1 (Y X β) C(θ) = C = Σ + τ 2 I n Var(w) = Σ, Var(ɛ(s i )) = τ 2 for all i. θ include parameters of the covariance model.

Method Maximum likelihood MLE: (ˆθ, ˆβ) = argmax θ,β {l(y ; β, θ)}.

Method Analytical derivatives Formulas for matrix derivatives. Q(θ) = C 1 ˆβ = [X QX ] 1 X QY, Z = Y X ˆβ dlog C dθ r = trace(q dc ) dθ r dz QZ dθ r = Z Q dc QZ. dθ r

Method Score and Hessian for θ dl = 1 dc trace(q ) + 1 dθ r 2 dθ r 2 Z Q dc QZ, dθ r ( d 2 ) l E = 1 dc trace(q Q dc ). dθ r dθ s 2 dθ s dθ r

Method Updates for each iteration Q = Q(θ p ) ˆβ p = [X QX ] 1 X QY, ˆθ p+1 = ˆθ p E ( d 2 l(y ; ˆβ p, ˆθ p ) dθ 2 ) 1 dl(y ; ˆβ p, ˆθ p ) dθ, Iterative scheme usually starts from preliminary guess, obatined via summary statistics.

Method Illustration maximization Exponential covariance with nugget effect. θ = (θ 1, θ 2, θ 3 ) : log precision, logistic range, log nugget precision.

Method Asymptotic properties ˆθ N(θ, G 1 ). G = G(ˆθ) = E ( d 2 ) l dθ 2.

Prediction Prediction from joint Gaussian formulation Prediction Ŷ 0 = E(Y 0 Y ) = X 0 ˆβ + C 0,. C 1 (Y X ˆβ). C 0,. is size 1 n vector of cross-covariances between prediction site s 0 and data sites. Prediction variance Var(Y 0 Y ) = C 0 C 0,. C 1 C 0,..

Numerical examples Synthetic data Consider unit square. Create grid of 25 2 = 625 locations. Use 49 data randomly assigned, or along center line (two designs). Covariance C(h) = τ 2 I (h = 0) + σ 2 (1 + φh) exp( φh), h = s i s j. θ include transformations of: σ, τ and φ.

Numerical examples Likelihood optimization True parameters β = ( 2, 3, 1), θ = (0.25, 9, 0.0025). Random design: β = [ 2(0.486), 3.43(0.552), 0.812(0.538)] θ = [ 0.298(0.118), 7.89(1.98), 0.00563(0.00679)] Center design: ˆβ = [ 2.06(0.576), 3.4(0.733), 0.353(0.733)] ˆθ = [0.255(0.141), 7.19(1.97), 0.00283(0.00128)]

Numerical examples Predictions

Numerical examples Project on spatial Gaussian regression model next week

Numerical examples Mining example Data giving oxide grade information at n = 1600 locations. 500 m Prediction line B Northing A Easting Use spatial statistics to predict oxide grade. Question: Will mining be profitable? Should one gather more data before making mining decision?

Numerical examples Mining example Model: Y (s) = X (s)β + w(s) + ɛ(s). Data at n borehole locations used to fit model parameters by MLE. Starting values from variogram (related to sample covariance). 0.8 0.7 In Strike Perpendicular Omni directional 0.6 0.5 Variogram 0.4 0.3 0.2 0.1 0 0 50 100 150 200 250 300 Distance (m)

300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 450 500 5 4 3 2 1 0 300 250 200 150 100 50 0 50 100 150 200 250 300 350 400 450 500 0.6 0.55 0.5 0.45 0.4 0.35 0.3 Numerical examples Mining example Current prediction A B A B Altitude (m) Altitude (m) Line Distance (m), Line Distance (m)

Numerical examples Mining example Develop or not? Decision is done using expected profits. Value based on current data: PV = max(p y, 0), p y = r t µ y k t 1, Will more data influence the mining decision? Expected value with more data: PoV = max(p y,y n, 0)π(y n y)dy n, p y,y n = r t µ y,y n k t 1, y n Is more data valuable. Compute the value of information (VOI): VOI = PoV PV. If VOI exceeds the cost of data, it is worthwhile gathering this information. (Computations partly analytical, similar to EI.)

Numerical examples Mining example Use this to compare two data types in planned boreholes: XRF : scanning in a lab (expensive, accurate). XMET : handheld device for grade scanning at the mining site (inexpensive, inaccurate).

Numerical examples Mining example 12 x 105 Cost of XRF ($) 10 8 6 4 2 XMET XRF Nothing 0 0 2 4 6 8 10 12 Cost of XMET ($) x 10 5

Numerical examples Schedule 17 Jan: Gaussian processes 24 Jan: Hands-on project on Gaussian processes 31 Jan: Latent Gaussian models and INLA. 7 Feb: Hands-on project on INLA 12-13 Feb: Template model builder. (Guest lecturer Hans J Skaug)

Numerical examples Schedule Project (24 Jan) will include Expected improvement for function maximization. Parameter estimation in spatial regression model (forest example).