Gaussian process for nonstationary time series prediction

Similar documents
Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

STAT 518 Intro Student Presentation

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Learning Gaussian Process Models from Uncertain Data

Widths. Center Fluctuations. Centers. Centers. Widths

Ecient Implementation of Gaussian Processes. United Kingdom. May 28, Abstract

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Glasgow eprints Service

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

Probabilistic & Unsupervised Learning

STA 4273H: Sta-s-cal Machine Learning

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Neutron inverse kinetics via Gaussian Processes

Nonparametric Bayesian Methods (Gaussian Processes)

Learning curves for Gaussian processes regression: A framework for good approximations

Optimization of Gaussian Process Hyperparameters using Rprop

Te Whare Wananga o te Upoko o te Ika a Maui. Computer Science

Reliability Monitoring Using Log Gaussian Process Regression

Recent Advances in Bayesian Inference Techniques

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes (10/16/13)

Statistical Approaches to Learning and Discovery

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Multivariate statistical methods and data mining in particle physics

Introduction to Gaussian Processes

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

FastGP: an R package for Gaussian processes

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

Regression with Input-Dependent Noise: A Bayesian Treatment

GAUSSIAN PROCESS REGRESSION

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

A Process over all Stationary Covariance Kernels

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Power Load Forecasting based on Multi-task Gaussian Process

Gaussian Process Regression

A variational radial basis function approximation for diffusion processes

Nonparameteric Regression:

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Lecture : Probabilistic Machine Learning

The Variational Gaussian Approximation Revisited

Gaussian Processes. 1 What problems can be solved by Gaussian Processes?

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Prediction of Data with help of the Gaussian Process Method

Density Estimation: ML, MAP, Bayesian estimation

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Gaussian Processes in Machine Learning

Tokamak profile database construction incorporating Gaussian process regression

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

MTTTS16 Learning from Multiple Sources

Derivative observations in Gaussian Process models of dynamic systems

Modeling human function learning with Gaussian processes

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Smooth Bayesian Kernel Machines

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Density Estimation. Seungjin Choi

Bayesian Methods for Machine Learning

Lecture XI Learning with Distal Teachers (Forward and Inverse Models)

Pattern Recognition and Machine Learning

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Introduction to Machine Learning

Relevance Vector Machines for Earthquake Response Spectra

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Linear Regression and Its Applications

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Variational Principal Components

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Undercomplete Independent Component. Analysis for Signal Separation and. Dimension Reduction. Category: Algorithms and Architectures.

Bayesian Deep Learning

Nonparametric Regression With Gaussian Processes

Gentle Introduction to Infinite Gaussian Mixture Modeling

Shallow. Deep. Transits Learning

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

STA414/2104 Statistical Methods for Machine Learning II

Multivariate Bayesian Linear Regression MLAI Lecture 11

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Error Empirical error. Generalization error. Time (number of iteration)

Statistical Rock Physics

GWAS V: Gaussian processes

Advanced Introduction to Machine Learning CMU-10715

Monte Carlo Methods for Statistical Inference: Variance Reduction Techniques

Finding minimum energy paths using Gaussian process regression

A BAYESIAN APPROACH FOR PREDICTING BUILDING COOLING AND HEATING CONSUMPTION

INTRODUCTION TO PATTERN RECOGNITION

Mean field methods for classification with Gaussian processes

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Transcription:

Computational Statistics & Data Analysis 47 (2004) 705 712 www.elsevier.com/locate/csda Gaussian process for nonstationary time series prediction Soane Brahim-Belhouari, Amine Bermak EEE Department, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong Received 27 June 2003; received in revised form 12 February 2004; accepted 13 February 2004 Abstract In this paper, the problem of time series prediction is studied. A Bayesian procedure based on Gaussian process models using a nonstationary covariance function is proposed. Experiments proved the approach eectiveness with an excellent prediction and a good tracking. The conceptual simplicity, and good performance of Gaussian process models should make them very attractive for a wide range of problems. c 2004 Elsevier B.V. All rights reserved. Keywords: Bayesian learning; Gaussian processes; Prediction theory; Time series 1. Introduction A new method for regression was inspired by Neal s work on Bayesian learning for neural networks (Neal, 1996). It is an attractive method for modelling noisy data, based on priors over function using Gaussian Processes (GP). It was shown that many Bayesian regression models based on neural networks converge to Gaussian processes in the limit of an innite network (Neal, 1996). Gaussian process models have been applied to the modelling of noise free (Neal, 1997) and noisy data (Williams and Rasmussen, 1996) as well as to classication problems (MacKay, 1997). In our previous work, Gaussian process for forecasting problem was proposed and compared with radial basis function neural network (Brahim-Belhouari and Vesin, 2001). Bayesian learning has shown an excellent prediction capability for stationary problems. However, most real life time series are often nonstationary and hence improved models Corresponding author. Tel.: +852-2358-8844; fax: +852-2358-1485. E-mail address: eebelhou@ust.hk (S. Brahim-Belhouari). 0167-9473/$ - see front matter c 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2004.02.006

706 S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 dealing with nonstationarity are required. In this paper, we present a prediction approach based on Gaussian process, which has been successfully applied to nonstationary time series. In addition, the use of the Gaussian process with dierent covariance functions for real temporal patterns is investigated. The proposed approach is shown to outperform Radial Basis Function (RBF) neural network. The advantage of the Gaussian process formulation is due to the fact that the combination of the prior and noise models is carried out exactly using matrix operations. We also present a maximum likelihood approach used to estimate the hyperparameters of the covariance function. These hyperparameters allow the control of the form of the Gaussian process. The paper is organized as follows: Section 2 presents a general formulation of the forecasting problem. Section 3 discusses the application of GP models, with nonstationary covariance function, to prediction problems. Section 4 shows the feasibility of the approach for real nonstationary time series. Section 5 presents some concluding remarks. 2. Prediction problem Time series are encountered in science as well as in real life. Most common time series are the result of unknown or incompletely understood systems. A time serie x(k) is dened as a function x of an independent variable k, generated from an unknown system. Its main characteristic is that its evolution cannot be described exactly. The observation of past values of a phenomenon in order to anticipate its future behavior represents the essence of forecasting. A typical approach is to predict by constructing a prediction model which takes into account previous outcomes of the phenomenon. We can take a set of d such values x k d+1 ;:::;x k to be the model input and use the next value x k+1 as the target. In a parametric approach to forecasting we express the predictor in terms of a nonlinear function y(x; ) parameterized by parameters. It implements a nonlinear mapping from input vector x =[x k d+1 ;:::;x k ] to the real value: t i = y(x i ; )+ i ; i=1;:::;n; (1) where is a noise corrupting the data points. Time series processing is an important application area of neural networks. In fact, y can be given by a specied network. The output of the radial basis function (RBF) network is computed as a linear superposition (Bishop, 1995): M y(x; )= w j g j (x); (2) j=1 where w j denotes the weights of the output layer. The Gaussian basis functions g j are dened as { g j (x) = exp x } j 2 2j 2 ; (3)

S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 707 where j and 2 j denote means and variances. Thus we dene the parameters as = [w j ; j ; j ] (j =1;:::;M), which can be estimated using a special purpose training algorithm. In nonparametric methods, predictions are obtained without representing the unknown system as an explicit parameterized function. A new method for regression was inspired by Neal s work (1996) on Bayesian learning for neural networks. It is an attractive method for modelling noisy data, based on priors over function using Gaussian Processes. 3. Gaussian process models The Bayesian analysis of forecasting models is dicult because a simple prior over parameters implies a complex prior distribution over functions. Rather than expressing our prior knowledge in terms of a prior for the parameters, we can instead integrate over the parameters to obtain a prior distribution for the model outputs in any set of cases. The prediction operation is most easily carried out if all the distributions are Gaussian. Fortunately, Gaussian process are exible enough to represent a wide variety of interesting model structure, many of which would have a large number of parameters if formulated in more classical fashion. A Gaussian process is a collection of random variables, any nite set of which have a joint Gaussian distribution (MacKay, 1997). For a nite collection of inputs, x =[x (1) ;:::;x (n) ], we consider a set of random variables y =[y (1) ;:::;y (n) ], to represent the corresponding function values. A Gaussian process is used to dene the joint distribution between the y s: (y x) exp( 1 2 y 1 y); (4) where the matrix is given by the covariance function C: pq = cov(y (p) ;y (q) )=C(x (p) ; x (q) ); (5) cov stands for covariance operator. 3.1. Predicting with Gaussian process The goal of Bayesian forecasting is to compute the distribution (y (n+1) D; x (n+1) ) of output y (n+1) given a test input x (n+1) and a set of n training points D = {x (i) ;t (i) i =1;:::;n}. Using Baye s rule, we obtain the posterior distribution for the (n+1) Gaussian process outputs. By conditioning on the observed targets in the training set, the predictive distribution is Gaussian (Williams and Rasmussen, 1996): (y (n+1) D; x (n+1) ) N( y (n+1); 2 y (n+1)); (6) where the mean and variance are given by y (n+1) = a Q 1 t; 2 y (n+1) = C(x(n+1) ; x (n+1) ) a Q 1 a;

708 S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 Q pq = C(x (p) ; x (q) )+r 2 pq ; a p = C(x (n+1) ; x (p) ); p=1;:::;n; r 2 is the unknown variance of the Gaussian noise. In contrast to classical methods, we obtain not only a point prediction but a predictive distribution. This advantage can be used to obtain the prediction intervals that describe a degree of belief of the predictions. 3.2. Training a Gaussian process There are many possible choices of prior covariance functions. From a modelling point of view, the objective is to specify prior covariances which contain our prior beliefs about the structure of the function we are modelling. Formally, we are required to specify a function which will generate a nonnegative denite covariance matrix for any set of input points. Gaussian process procedure can handle interesting models by simply using a covariance function with an exponential term (Williams and Rasmussen, 1996): { } C exp (x (p) ; x (q) )=w 0 exp 1 2 d l=1 w l (x (p) l x (q) l ) 2 : (7) This function expresses the idea that cases with nearby inputs will have highly correlated outputs. The simplest nonstationary covariance function so that C(x (p) ; x (q) ) C( x (p) x (q) ), is the one corresponding to a linear trend: C ns (x (p) ; x (q) )=v 0 + v 1 d l=1 x (p) l x (q) l : (8) Addition and multiplication of simple covariance functions are useful in constructing a variety of covariance functions. It was found that the sum of both covariance function (Eqs. (7) (8)) works well. Let us assume that a form of covariance function has been chosen, but that it depends on undetermined hyperparameters =(v 0 ;v 1 ;w 0 ; w 1 ;:::;w d ;r 2 ). We would like to learn these hyperparameters from the training data. In a maximum likelihood framework, we adjust the hyperparameters so as to maximize the log likelihood of the hyperparameters: log (D ) = log (t (1) ;:::;t (n) x (1) ;:::;x (n) ; ) = 1 2 log det Q 1 2 t Q 1 t n log 2: (9) 2 It is possible to express analytically the partial derivatives of the log likelihood, which can form the basis of an ecient learning scheme. These derivatives are (MacKay, 1997) 9 log (D )= 1 ( 9 i 2 tr Q 1 9Q 9 i ) + 1 2 t 1 9Q Q Q 1 t: (10) 9 i We initialize the hyperparameters to random values (in a reasonable range) and then use an iterative method, for example conjugate gradient, to search for optimal values of the hyperparameters. We found that this approach is sometimes susceptible to

S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 709 local minima. To overcome this drawback, we randomly selected a number of starting positions within the hyperparameters space. 4. Experimental results In order to test the performance of the GP models using dierent covariance functions and compare it with RBF network, we used a respiration signal as real nonstationary time series. The data set contains 725 samples representing a short recording period of the respiratory rhythm via a thoracic belt. We obtained 400 patterns for training and 325 for testing candidate models. Predictive patterns were generated by windowing 6 inputs and one output. We conducted experiments for GP models using exponential covariance function (Eq. (7)) and a nonstationary covariance function given by the sum of (Eqs. (7) and (8)). The RBF network uses 30 centers chosen according to the validation set. The hyperparameters were adapted to the training data using conjugate gradient search algorithm. Results of prediction for both GP structures are given in Figs. 1 and 2. Time window is divided into a training part (time 400) and test part (time 400). As it can be seen from Fig. 1, predictive patterns using a GP model with an exponential covariance function match perfectly with the real signal during the training 80 Training set Test set 75 70 65 60 55 0 100 200 300 400 500 600 700 800 Time Fig. 1. Respiration signal (dashed line) and the mean of the predictive distribution using a GP model with an exponential covariance function (solid line).

710 S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 80 Training set Test set 75 70 65 60 55 0 100 200 300 400 500 600 700 800 Time Fig. 2. Respiration signal (dashed line) and the mean of the predictive distribution using a GP model with a nonstationary covariance function (solid line). Table 1 Prediction Performance expressed in terms of mean square test error Methods Mean square test error GP ns 3.83 GP exp 6.37 RBF 7.18 period, but deviate signicantly during the test period. In Fig. 2, the GP model using a nonstationary covarinace function shows a good tracking performance. The prediction performance is quantied by the mean square test error and is shown in Table 1. We can see from this table that the GP model using a nonstationary covarinace function performs better than the one using exponential covariance function and RBF network. The prediction performance would be greatly enhanced if the maximum of signal local trends are within the training window. Consequently, a large training window is required to estimate reliable statistics about the underlying process. However, requiring a vast number of data patterns n is time consuming as the inversion of an n n matrix (Q) is needed. In fact, nding the predictive mean (Eq. (6)) and evaluating

S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 711 8 7 6 Training time (mins) 5 4 3 2 1 0 100 200 300 400 500 600 700 Number of training patterns Fig. 3. Training time of the GP model as function of the number of training patterns. the gradient of the log likelihood (Eq. (10)) require the evaluation of Q 1. Fig. 3 shows the training times (on a SUN ULTRA SPARC 10) for dierent widths of the training window. The training time increases dramatically with the number of training patterns (for n = 400, the time required to perform the prediction was about 156 s). As a consequence, new approximation methods are needed to deal with this computational cost when the number of data patterns becomes large. 5. Concluding remarks In this paper we proposed a forecasting method based on Gaussian process models. We have shown that reasonable prediction and tracking performance can be achieved in the case of nonstationary time series. In addition, Gaussian process models are simple, practical and powerful Bayesian tools for data analysis. One drawback with our prediction method based on Gaussian processes is the computational cost associated with the inversion of an n n matrix. The cost of direct methods of inversion may become prohibitive when the number of the data points n is greater than 1000 (MacKay, 1997). Further research is needed to deal with this issue. It is of interest to investigate the possibility of improving further the performance by using a more complicated parameterized covariance function. A multi-model forecasting approach with GP predictors can be proposed as an on-line learning technique.

712 S. Brahim-Belhouari, A. Bermak / Computational Statistics & Data Analysis 47 (2004) 705 712 Acknowledgements The work described in this paper was supported in part by a Direct Allocation Grant (project No. DAG02/03.EG05) and a PDF Grant from HKUST. References Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Clarendon press, Oxford. Brahim-Belhouari, S., Vesin, J-M., 2001. Bayesian learning using Gaussian process for time series prediction. 11th IEEE Statistical Signal Processing Workshop, pp. 433 436. MacKay, D.J.C., 1997. Gaussian Processes A Replacement for Supervised Neural Networks. Lecture Notes for a Tutorial at NIPS 97 UK, http://www.inference.phy.cam.ac.uk/mackay/bayesgp.html. Neal, R.M., 1996. Bayesian Learning for Neural Networks. Springer, New York. Neal, R.M., 1997. Monte Carlo implementation of Gaussian process models for Bayesian regression and classication. Technical Report, No. 9702, University of Toronto. Williams, C.K.I., Rasmussen, C.E., 1996. Gaussian process for regression. Touretzky, D.S., Mozer, M.C., Hasselmo, M.E., (Eds.) In: Advances in Information Processing Systems, MIT Press, Cambridge, MA, 8th Edition, pp. 111 116. http://www.cs.toronto.edu/ care/gp.html.