D-Facto public., ISBN , pp

Similar documents
Hybrid HMM/MLP models for time series prediction

y(n) Time Series Data

The connection of dropout and Bayesian statistics

MODELING NONLINEAR DYNAMICS WITH NEURAL. Eric A. Wan. Stanford University, Department of Electrical Engineering, Stanford, CA

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Intelligent Systems Discriminative Learning, Neural Networks

STA 414/2104: Lecture 8

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA 414/2104: Lecture 8

Machine Learning, Fall 2012 Homework 2

Generalization by Weight-Elimination with Application to Forecasting

ECE521 week 3: 23/26 January 2017

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Part 6: Multivariate Normal and Linear Models

D-Facto public., ISBN , pp

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

CS 4700: Foundations of Artificial Intelligence

Bagging and Other Ensemble Methods

Artificial Neural Networks

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Forecasting Chaotic time series by a Neural Network

4. Multilayer Perceptrons

Algorithmisches Lernen/Machine Learning

Effect of number of hidden neurons on learning in large-scale layered neural networks

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Final Examination CS 540-2: Introduction to Artificial Intelligence

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

18.6 Regression and Classification with Linear Models

Logistic Regression. Machine Learning Fall 2018

Lecture 2: Logistic Regression and Neural Networks

Introduction to Neural Networks

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Multivariate Data Analysis and Machine Learning in High Energy Physics (III)

Dual Estimation and the Unscented Transformation

ECE521 Lectures 9 Fully Connected Neural Networks

Artificial Neural Networks

Address for Correspondence

Structure Design of Neural Networks Using Genetic Algorithms

Deep unsupervised learning

The exam is closed book, closed notes except your one-page cheat sheet.

STA 4273H: Statistical Machine Learning

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CS Machine Learning Qualifying Exam

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

Learning Air Traffic Controller Workload. from Past Sector Operations

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Introduction to Artificial Neural Networks

Learning Local Error Bars for Nonlinear Regression

Error Empirical error. Generalization error. Time (number of iteration)

Learning Gaussian Process Models from Uncertain Data

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Other Topologies. Y. LeCun: Machine Learning and Pattern Recognition p. 5/3

Bayesian Approaches Data Mining Selected Technique

Pattern Recognition and Machine Learning

Neural networks. Chapter 20, Section 5 1

Singer Identification using MFCC and LPC and its comparison for ANN and Naïve Bayes Classifiers

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Multilayer Perceptrons and Backpropagation

ECE521 Lecture 7/8. Logistic Regression

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Dynamic Probabilistic Models for Latent Feature Propagation in Social Networks

Gaussian Process priors with Uncertain Inputs: Multiple-Step-Ahead Prediction

Statistical learning. Chapter 20, Sections 1 3 1

Ensembles of Nearest Neighbor Forecasts

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Information Dynamics Foundations and Applications

Model Selection for Gaussian Processes

epochs epochs

Analysis of Fast Input Selection: Application in Time Series Prediction

Regression, Ridge Regression, Lasso

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Small sample size generalization

Lecture 4: Feed Forward Neural Networks

Unsupervised Learning

Hidden Markov models

Linear Dynamical Systems (Kalman filter)

Machine Learning 4771

Artificial Neural Networks 2

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Neural networks (NN) 1

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

CS:4420 Artificial Intelligence

STA 414/2104, Spring 2014, Practice Problem Set #1

Glasgow eprints Service

Naïve Bayes classification

Wavelet Neural Networks for Nonlinear Time Series Analysis

Discovering molecular pathways from protein interaction and ge

Bayesian Deep Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

CSCI-567: Machine Learning (Spring 2019)

Introduction to Machine Learning

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

FINAL: CS 6375 (Machine Learning) Fall 2014

Deep Belief Networks are compact universal approximators

Transcription:

A Bayesian Approach to Combined Forecasting Maurits D. Out and Walter A. Kosters Leiden Institute of Advanced Computer Science Universiteit Leiden P.O. Box 9512, 2300 RA Leiden, The Netherlands Email: fmout,kostersg@liac s.nl Abstract. Suitable neural netw orks may act as experts for time series predictions. The naive prediction is in a Bayesian manner used as prior to steer the weigh ted combination of these experts. 1. Introduction Predicting the near future using information from the past is a challenging task. In this paper w etry to do so b y combining neural netw orksand techniques from Bay esian statistics (cf. [1] and [6]) in order to generate a prediction for an observable y t+1, giv en the time seriesy 1 ;y 2 ;:::;y t. We start from the following observations. First it is hard to beat naiv e predictions y t in the setting abo ve.secondly, it is a complicated matter to combine expert forecasts into better ones. We feel that a joined effort might generate better results, especially if neural netw orks support the process. For this purpose we use so-called FIR netw orks (see [4]), where the usage of previous activ ations during the training stage makes them suited for time series analysis. In this paper we useabayesian model for time series analysis. The prior is based upon the naive prediction. We then develop neural netw ork tec hniques to pro vide the necessary expert results. These ingredients are combined into a joined forecast. Here we extend methods like those in [2]. Experiments show the potential of this approach. We conclude with suggestions for further research. Among other things we discuss different possibilities for the expert forecasts, all of them based upon neural netw ork tec hniques. 2. Construction of a Bayesian Model In this paper w e consider a process fy t : t = 1; 2;:::g where y t is a onedimensional con tin uous random variable at time t. Suppose w eare at time t and we are given the task of predicting y t+1 given the sequence y 1 ;y 2 ;:::;y t. We consider the following prior probability distribution of y t+1 : (y t+1 j y t ) ο N(y t ;ff 2 ): (1)

This means that we assume that y t+1 has a Gaussian distribution with mean y t and variance ff 2. Suppose that at the same time we have N expert forecasts for y t+1 available, contained in the column vector (T denoting transpose) ~o t =(o 1 t ;:::;o N t ) T : (2) We assume that each forecast o i t (i =1; 2;:::;N)isanunbiased estimator of y t+1. This means that we can describe the relationship betw een~o t and y t+1 b y ~o t = y t+1 ~e+ ~v; (3) where ~e =(1; 1;:::;1) T is the v ector of dimensionn consisting of ones and ~v = (v 1 ;v 2 ;:::;v N ) T is the v ector of random errors assumed to have amultivariate Gaussian distribution with mean zero (the N-dimensional zero vector) and a N N variance-covariance matrix ±. This means that the distribution of ~o t given y t+1 (referred to as the likelihood of ~o t ) can be described by (~o t j y t+1 ) ο N(y t+1 ~e;±): (4) Starting from these observations we can now use Bay es' rule to derive the probability distribution of y t+1 given ~o t and y t, which is called the posterior distribution of y t+1 : p(y t+1 j ~o t ;y t )= p(y t+1j y t ) p(~o t j y t+1 ;y t ) ; (5) p(~o t j y t ) where p( ) denotes the probability density function. In [1] it is shown that where m t = NX i=1 (y t+1 j ~o t ;y t ) ο N(μ t ;ο 2 ); (6) μ t = Wy t +(1 W ) m t ; ο 2 = Wff 2 ; (7) W =1= fff 2 ~e T ± 1 ~e+1g; (8) w i o i t; (w 1 ;w 2 ;:::;w N )=~e T ± 1 =~e T ± 1 ~e; (9) The mean μ t in (7) is called the Bayesian estimator. Itis easy to see that it is un biased (i.e., the expectation ofμ t equals y t+1 ). Another well-known result is that it is optimal in the sense that it is an estimator with minimal variance. We will use the Bayesian estimator as our forecast for the unknown y t+1. Note that the combination is carried out at tw olevels. First a weigh ted av erage of the expert forecasts based on the variance-covariance matrix is computed in (9). This results in the estimator m t which is then combined with y t (the naiv eprediction) in (7) using both the variance-covariance matrix ± and the variance ff 2 of the prior distribution in order to get the mean of the posterior distribution. A problem ho w ev er liesin the fact that in most situations ± and ff 2 are unknown. One w ayto ov ercome this problem is to estimate these quantities

using the known observations y 1 ;y 2 ;:::;y t.we can for example estimate ff 2 b y maximizing the likelihood L: L(ff 2 )= ty k=2 which gives us the following estimator: cff 2 = 1 t 1 tx k=2 p(y k jy k 1); (10) (y k y k 1) 2 : (11) The variance-covariance estimator can also be estimated from y 1 ;y 2 ;:::;y t which yields the N N matrix ^± with (^±) ij = 1 t 2 tx k=2 (v i k vi avg)(v j k vj avg) (i; j =1; 2;:::;N); (12) P where vk i = oi k y t+1 and v i = 1 t avg t 1 k=2 vi k denote the error of forecast i at time k and its av erage error, respectively. In [2] sev eral techniques that combine neural net w orkforecasts are discussed. We mention bumping, i.e., take the netw ork with the best performance, and bagging, i.e., take theunw eigh ted mean of the netw orks. 3. as Expert Forecasts In the previous section w ediscussed a Bayesian framework in which several expert forecasts w ereused to deriv ethe posterior probability distribution of y t+1. In our model we let feed-forward netw orks provide such forecasts. Their abilit y to discov er non-linear relationships betw een an input space and a corresponding target space by means of examples makes them suitable models to generate predictions in time series. The main idea is to let each netw ork generate a prediction of y t+1 when it is presented with an input pattern containing the sequence (y t M ;y t M+1;:::;y t ) for certain M 0. Suc hinput patterns are also referred to as time windows. We choose standard feed-forward netw orks containing a single hidden layer, where the units hav e sigmoid activation functions. This lay er is fully connected to the single neuron in the output layer which has the identity as its activation function. The output of this neuron will be the actual prediction of the netw ork. As mentioned before we let the input consist of a time window containing the current observation and the M previous observations. This means that the output of the netw ork is a function from the observation at the current time and from past observations. This construction may be viewed as a special instance of a FIR netw ork (see [4]). In the literature several strategies hav e been proposed for training and testing a feed-forward netw ork for time series prediction. For our experiments we

adopt the following scheme. Suppose we are at time t and we want to generate a prediction of the unknown y t+1.we then compose a training set consisting of the S preceding time windows with corresponding targets. We train the netw ork during a number of cycles on this training set by means of teacher-forcing adaptation, i.e., we do not feed the actual output of the netw ork back asinput, but take the real targets instead. We use the standard back-propagation algorithm to minimize the sum of squares error of the netw ork on the training set: E t = 1 2 SX s=1 (o t s+1 y t s+1) 2 ; (13) where o t s+1 denotes the output of the netw ork when presented with the time window (y t s M ;y t s M+1;:::;y t s). 4. Experiments and Results In this section we present and discuss the results of experiments performed for tw o datasets. Three methods of constructing an ensemble (bumping, bagging and the Bayesian model described in Section 2.) and the naive prediction were implemented. F or both datasets we constructed N = 30 feed-forward netw orks each having one hidden layer consisting of 5 units. At eachpoint of time they w ere trained for 50 cycles on thes =20preceding time windows with M = 10. After training w e let each technique make a prediction of the following element in the time series of which the error was computed. For each dataset the experiment w as repeated 20 times and afterwards the average and the standard deviation of the root-mean-square error were computed. The following two datasets were used: The Mack ey-glass dataset: This is a dataset consisting of the first 1000 points of the series generated by the Mack ey-glass delay-differential equation with delay parameter 30, as described in [3]. Sunspots dataset: This dataset contains 280 yearly sunspot n umbers, as described in [5]. Mack ey-glass Sunspots av erage standard deviation average standard deviation bumping 0.047 0.00056 0.397 0.03192 bagging 0.052 0.00059 0.335 0.00638 naiv e 0.380-0.382 - Bay es 0.038 0.00077 0.323 0.00623 T able1: The average and standard deviation of the root-mean-square errors over 20 runs for the tw o datasets.

1.4 Mackey-Glass target observations Bayesian estimator 1.2 1 Mackey-Glass(t) 0.8 0.6 0.4 0.2 900 910 920 930 940 950 960 970 980 990 1000 t 1.2 1 Sunspots target observations Bayesian estimator 0.8 Sunspots(Year) 0.6 0.4 0.2 0-0.2 1870 1880 1890 1900 1910 1920 1930 1940 1950 1960 1970 1980 Year Figure 1: Graphs sho wing the performance of the Bay esian estimator on the Mack ey-glass series and the Sunspots series.

In T able 1 the av erage root-mean-square errors yielded b y the different techniques over 20 runs are presented. F or bumping, bagging and the Bayesian model the standard deviation is also included. Note that the naive prediction has the same performance over the runs so its standard deviation was omitted. For both datasets w e see that the Bayesian model provides us with the low est error. The improvement is significant when compared to the other techniques. T oillustrate the performance of the Bayesian model, the time series and the function computed by Bay es' estimator for the last 100 observations of both datasets are shown in Figure 1. 5. Conclusions and Further Research We may conclude that the approach provides promising results. The Bayesian combination of naiv eprediction and neural net w orkexperts is fruitful, and adheres to the intuition concerning expert forecasting. F or further research we are in terestedin the following. Contrary to other approaches the expert involved are neural netw orks. During the training stage it might be helpful to anticipate the later use of the outputs, and to ensure different behaviour for the networks. For instance, they might be trained into different directions in order to generate independent predictions. Variations on the methods from [2] need further study. We are also examining other underlying probability distributions, such as the generalized Laplace distribution. References [1] G. Anandalingam and L. Chen, Linear Combination of Forecasts: A General Bayesian Model, Journal of Forecasting 8 (1989), 199 214. [2] T. Heskes, Balancing Between Bagging and Bumping, Advances in Information Processing Systems 9 (1997), 466 472. [3] M.C. Mack ey and L. Glass, Oscillation and Chaos in Physiological Control Systems, Science 197 (1977), 287. [4] E.A. Wan, Time Series Pr ediction Using a Connectionist Network with Internal Delay Lines, in A.S. Weigend and N.A. Gershenfeld (editors), Time Series Prediction: Forecasting the Future and Understanding the Past, SFI Studies in the Sciences of Complexity, Addison-Wesley, 1994. [5] A.S. Weigend, B.A. Huberman and D.E. Rumelhart, Pr edictingsunspots and Exchange Rates with Connectionist, pp. 395 432 in M. Casdagli and S. Eubank (editors), Nonlinear Modeling and F orecasting, Addison-Wesley, 1992. [6] M.C. van Wezel, W.A. Kosters and J.N. Kok, Maximum Likelihood Weights for a Linear Ensemble of Regression, Proceedings ICONIP 1998, Japan, 498 501.