Learning Gaussian Process Models from Uncertain Data

Learning Gaussian Process Models from Uncertain Data Patrick Dallaire, Camille Besse, and Brahim Chaib-draa DAMAS Laboratory, Computer Science & Software Engineering Department, Laval University, Canada {dallaire,besse,chaib}@damas.ift.ulaval.ca http://www.damas.ift.ulaval.ca Abstract. It is generally assumed in the traditional formulation of supervised learning that only the outputs data are uncertain. However, this assumption might be too strong for some learning tasks. This paper investigates the use of Gaussian Process prior to infer consistent models given uncertain data. By assuming a Gaussian distribution with known variances over the inputs and a Gaussian covariance function, it is possible to marginalize out the inputs uncertainty and keep an analytical posterior distribution over functions. We demonstrated the properties of the method on a synthetic problem and on a more realistic one, which consist in learning the dynamics of the well-known cart-pole problem and compare the performance versus a classic Gaussian Process. A large improvement of the mean squared error is presented as well as the consistency of the result of the regression. Key words: Gaussian Processes, Noisy Inputs, Dynamical Systems 1 Introduction As soon as a regression has to be done using a statistical model on noisy inputs, the resulting quality of the estimated model may suffer if no attention is paid to the uncertainty of the training set. Actually, this may occur in two different ways, one due to the training with noisy inputs and the other due to an extra noise in the outputs caused by the noise over the inputs. Statisticians already have investigated this problem in several ways: total leastsquares [1] changes the cost of the regression problem to encourage the regressor to minimize both error due to noise on outputs as well as noise on inputs; the errorin-variables model [2] deals directly with noisy inputs by creating correlated virtual variables that thus have correlated noises. Recent work in machine learning has also addressed this problem, either by attempting to learn the entire input distribution [3], by integrating over chosen noisy points using estimated distribution during training [4] or by de-noising the inputs by accounting for the noise while training the model [5]. In this paper, we investigate an approach, pioneered by Girard [6], in which more than trying to predict using noisy inputs, we learn from these inputs by marginalizing out the inputs uncertainty and keep an analytical posterior distribution over functions. This approach achieves two goals: First it shows that we are able to learn and make

2 Learning Gaussian Process Models from Uncertain Data prediction from noisy inputs. Second, this method is applied to a well-known problem of balancing a pole over a cart where the problem is to learn the 5-dimensional nonlinear dynamics of the system. Results show that taking into account the uncertainty of the inputs make the regression consistent and reduce drastically the mean squared error. This paper is structured as follows. First, we formalize the problem of learning with noisy inputs and introduce some notations about Gaussian Processes and the regression model. In section 3, we present the experimental results on a difficult artificial problem and on a more realistic problem. Section 4 discusses the results and concludes the paper. 2 Preliminaries A Gaussian Process (GP) is a stochastic process which is used in machine learning to describe a distribution directly into the function space. It also provides a probabilistic approach to the learning task and has the interesting property to give uncertainty estimates while doing predictions. The interested reader is invited to refer to [7] for more information on GPs. 2.1 Gaussian Process regression By using a GP prior, it is assumed that the joint distribution of the finite set of observations given their inputs is multivariate Gaussian. Thus, a GP is fully specified by its mean and covariance functions. Assume that a set of training data D = {x i, y i } N i=1 is available where x i R D, y is a scalar observation such that y i = f(x i ) + ɛ i (1) and where ɛ i is a white Gaussian noise. For convenience, we will use the notation X = [x 1,..., x N ] for inputs and y = [y 1,..., y N ] for outputs. Under the GP prior model with zero mean function, the joint distribution of the training set is y X N (0, K) where K is the covariance matrix whose entries K ij are given by the covariance function C(x i, x j ). This multivariate Gaussian probability distribution over the training observations can be used to compute the posterior distribution over functions. Therefore, making prediction is done by using the posterior mean and its associated measure of uncertainty, given by the posterior covariance. For a test input x, the posterior distribution is f x, X, y N (µ(x ), σ 2 (x )) with mean and variance functions given by µ(x ) = k K 1 y (2) σ 2 (x ) = C(x, x ) k K 1 k (3) where k is the N 1 vector of covariance between x and training inputs X. Although many covariance functions can be used to define a GP prior, we will use for the reminder of this paper the squared exponential which is one of the most widely used kernel function. The chosen kernel function C(x i, x j ) = σ 2 f exp((x i x j ) W 1 (x i x j )) + σ 2 ɛ δ ij (4)

Learning Gaussian Process Models from Uncertain Data 3 is parameterized by a vector of hyperparameters θ = [W, σ 2 f, σ2 ɛ ], where W is the diagonal matrix of characteristic length-scale, which account for different covariance measure for each input dimension, σ 2 f is the signal variance and σ2 ɛ is the noise variance. Varying these hyperparameters influence the interpretation of the training data by modifying the shapes of functions allowed by the GP prior. It might be difficult a priori to fix the hyperparameters of a kernel function and expect these to fit the observed data correctly. A common way to estimate the hyperparameters is to maximize the log likelihood of the observations y [7]. The function to maximize is log p(y X, θ) = 1 2 y K 1 y 1 2 log K N 2 log 2π (5) since the joint distribution of the observations is a multivariate Gaussian. The maximization can be done using conjugate gradient methods to find an acceptable local maxima. 2.2 Learning with uncertain inputs As we suit in the introduction, the assumption that only the outputs are noisy is not enough for some learning task. Consider the case where the inputs are uncertain and where each input value comes with variance estimates. It has been shown by Girard [6] that, for normally distributed inputs and using the squared exponential as kernel function, integrate over the input distribution analytically is feasible. Consider the case where inputs are a set of Gaussian distributions rather than a set of point estimates. Therefore, the true input value x i is not observable, but we have access to its distribution N (u i, Σ i ). Thus, accounting for these inputs distributions is done by solving C n = C(x i, x j )p(x i )p(x j )dx i dx j (6) where p(x i ) = N (u i, Σ i ) and p(x j ) = N (u j, Σ j ). Since [8] involve integrations over products of Gaussians, the resulting kernel function is computed exactly with C n ((u i, Σ i ), (u j, Σ j )) = σ2 f exp((u i u j ) (W + Σ i + Σ j ) 1 (u i u j )) + σ I + W 1 (Σ i + Σ j ) 1 ɛ 2 δ ij 2 (7) which is again a squared exponential 1. It is easy to see that this new kernel function is a generalization of [3] by letting the covariance matrix of both inputs tends to zero. Hence, it is possible to learn from a combination of noise-free and uncertain inputs. Theoretically, learning from uncertain data is as difficult as in the noise-free case, although it might require more data. The posterior distribution over function is found using the same equations by using the new covariance function. The hyperparameters can be learned with the log-likelihood as well, but it is now riddled with many local maxima. Using standard conjugate gradient methods will quickly lead to a local maxima that might not explain the data properly. An improper local maxima which occurs often 1 In fact, the noise term is not a part of the integration since it models an independent noise process, and thus it remains in the new kernel.

4 Learning Gaussian Process Models from Uncertain Data is to interpret the observations as highly noisy. In this case, the matrix W tends to have large values on its diagonal, meaning that most dimensions are irrelevant, and the value of σ 2 ɛ is over estimated to transpose the input error in the output dimensions. A solution to prevent this difficulty is to find a maximum a posteriori (MAP) estimation of the hyperparameters. Placing a prior over the hyperparameters will thus act as a regularization term to prevent improper local maxima. In the experiments, we chose to use a prior of the exponential family in order to get a simpler log posterior function to maximize. 3 Experiments In our experiments, we compare the performance of the Gaussian Process using inputs uncertainty () and the standard Gaussian Process () which use only the point estimates. We first evaluate the behavior of each method on a one-dimensional synthetic problem and then compare their performances on a harder problem which consists in learning the nonlinear dynamics of a cart-pole system. 3.1 Synthetic Problem: Sincsig In order to be able to easily visualize the behavior of both GPs prior, we have chosen a one-dimensional function for the first learning example. The function is composed of a sinc and a sigmoid function as y = { sinc(x) if x 0 0.5 [1 + exp( 10x 5)] 1 + 0.5 otherwise (8) and we will refer to it as the Sincsig function. The evaluation has been conducted on randomly drawn training sets of different sizes. We uniformly sampled N inputs in [ 10, 10] which are the noise-free inputs {x i } N i=1. The observations set is then constructed by sampling each output according to y i N (sincsig(x i ), σy). 2 The computation of the uncertain inputs is done by sampling the noise σx 2 i to be applied on each input. For each noise-free x i, we sampled the noisy input according to u i N (x i, σx 2 i ). It is easy to see that x i u i, σx 2 i N (u i, σx 2 i ) and therefore we have a complete training set which is defined as D = {(u i, σx 2 i ), y i } N i=1. Figure 1 show a typical example of a training data set (crosses), with the real function to be regressed (solid line) and the result of the regression (thin line) for the (top) and the classic GP (bottom). Error bars indicate that the is not consistent with the data since it does not take into account the noise s variance on inputs. The first experiment was conducted with an output noise standard deviation σ y = 0.1 with different size of training sets. The input noises standard deviation σ xi were sampled uniformly in [0.5, 2.5]. We chose these standard deviations so that adding artificially some independent noise during the optimisation process over the outputs can explain the noise over the inputs. All comparisons of the and the has been done by training both with the same random data sets 2. Figure 2(a) shows the 2 Note that the standard Gaussian Process regression does not use the variances of the inputs.

Learning Gaussian Process Models from Uncertain Data 5 2 1 0 1 2 10 5 0 5 10 1 0 1 10 5 0 5 10 Fig. 1. The Sincsig function with and regressions 0.26 0.24 0.22 0.1 0.08 0.2 0.06 0.18 0.16 0.04 0.14 0.12 0.02 0.1 50 100 150 200 250 300 (a) Mean Squared Error. σ y = 0.1 0 50 100 150 200 250 300 (b) Mean Squared Error. σ y (0.5, 2.5) Fig. 2. Results on the Sincsig Problem

6 Learning Gaussian Process Models from Uncertain Data averaged mean square error over 25 randomly chosen training sets for different values of N. Results show that when very few data are available, both processes explain the outputs with lot of noise over the outputs. As expected, when the size of the data set increases, the optimized its hyperparameters so as to explain the noisy inputs by very noisy outputs while the correctly explain the noise on the inputs and selects the less noisy so as to minimize the mean squared error. In the second experiment, in order to emphasize the impact of noisy inputs, we assumed that the Gaussian processes now know the noise s variance on the observations. Therefore, the noise hyperparameters σɛ 2 is set to zero since the processes exactly know the noise matrix to be added when computing the covariance matrix. For each output, the standard deviation σ yi is then uniformly sampled in [0.2, 0.5]. Figure 2(b) shows the performance of and. Not allowing to explain noisy data by the independent noise process has two effects: First, it does not allow the to explain noisy inputs by noisy outputs when only few data are available, and it also forces the to use the information on input variance whatever the size of the data set is. Let us now see what the results on a real nonlinear dynamical system. 3.2 The Cart Pole Problem We now consider the harder problem of learning the cart pole dynamics. Figure 3 gives a picture of the system from which we try to learn the dynamics. The state is defined by the position (ϕ) of the cart, its velocity ( ϕ), the pole s angle (α) and its angular velocity ( α). There is also a control input which is used to apply lateral forces on the cart. Following the equation in [9] to govern the dynamics, we used Euler s method to update the system s state: Fig. 3. The cart-pole balancing problem α = ( F mpl α g sin α + cos α 2 sin α ( ) 4 l 3 mp cos2 α m c+m p m c+m p ) ϕ = F + m pl( α 2 sin α α sin α) m c + m p Where g is the gravity force, F the force associated to the action, l the half-length of the cart, m p the mass of the pole and m c the mass of the cart. For this problem, the training set were sampled exactly as in the Sincsig case. Stateaction pairs were uniformly sampled on their respective domains. The outputs were obtained with the true dynamical system and then perturbed with sampled noises assumed known. Since the output variances are also known, the training set can be seen

Learning Gaussian Process Models from Uncertain Data 7 as Gaussian input distributions that map to Gaussian output distributions. Therefore, one might use a sequence of Gaussian belief state as its training set in order to learn a partially observable dynamical system. Following this idea, there is no reason for the output distributions to have a significantly smaller variance Position Position then the input distribution. Velocity Velocity 35 35 12 12 In this experiment, the input and output noises standard deviation 30 30 were uniformly 25 sampled in [0.5, 2.5] for each dimensions. Every output dimensions 25 were treated independently by using a Gaussian Process prior for each of them. Figure 4 shows the 8 8 20 20 6 15 6 15 averaged mean square error over 25 randomly chosen training sets for different N values for each dimension. 4 4 5 2 2 50 50100100150150200200250250300300 Number Number of training of training data data 5 50 50 100 100150150200200250250300300 Number Number of training of training data data Position Position 12 12 8 8 6 6 4 4 2 2 50 50100100150150200200250 250300 300 Number of training of data data (a) Pole Position Pole Angle Angle 14 14 Velocity Pole Velocity Pole Angle Angle 35 14 35 14 30 30 12 12 25 25 20 20 15 15 6 6 4 4 5 5 2 2 50 50 100 100 150 150 200 250 300 50 50100100150150200200250250300300 Number of of training data Number Number of training of training data data 50 50 (b) Velocity Angular Velocity 8 8 (c) Pole Angle Angular Angular Velocity Velocity 50 50 40 40 30 30 20 20 0 0 50 50 100 100150150200200250250300300 Number of training of training data data (d) Angular Velocity 12 12 8 8 6 6 4 4 2 2 50 100 150 200 250 300 50 100 150 200 250 300 40 40 Fig. 4. Mean Squared Error results on the Cart-pole problem 30 30 20 20 10 10 0 0 50 100 150 200 250 300 50 100 150 200 250 300 3.3 Learning the kernel hyperparameters As stated at the end of Section 2.2, it is possible to learn the hyperparameters given a training set. Since conjugate gradient methods performed poorly for the optimization of the log likelihood in the cases, we preferred stochastic optimization methods for this task. In every experiments, we thus maximized the log posterior instead of the log likelihood. A gamma Γ (2, 1) prior distributions as been placed over all characteristic length-scale terms in W and a normal N (0, 1) prior distribution as been placed over the signal standard deviation σ f. Comparing to previous work on the subject [6, 10] which use isotropic hyperparameters in the kernel function, we applied automatic relevance determination, that improves considerably the performance while does not increase the complexity of the kernel. 4 Discussion Results for the synthetic problem are presented in Figure 1, 2(a) and 2(b). These results first show that using the knowledge of the noise on the inputs improve the consistency of the regression more than the standard Gaussian Process since the error assumed by the includes completely the real function while the one of does not. Second, the is also able to discriminate which noise comes from the input and

8 Learning Gaussian Process Models from Uncertain Data which one come from the output as denoted in Figure 2(a) and 2(b). As the does not assume any noise on the input, it always assumes that the noise comes from the outputs, and thus learns a large hyperparameter for the noise, that also augments its mean squared error. Problems of this approach come as soon as an optimisation of the hyperparameters have to be done. Indeed, the log-likelihood function is riddled of local maxima that cannot be avoided using classic gradient methods. An interesting avenue would be to look at natural gradient approaches [11]. Another future work concerns the application of this work to the learning of continuous Hidden Markov Models, as well as continuous POMDPs by using the belief state as a noisy input [12]. To conclude, we proposed a Gaussian Process Model for regression that is able to learn with noise on the inputs and on the outputs as well as to predict with less mean squared error than permitted by previous approaches while keeping consistent with the true function. Results on a synthetic problem explain the advantages of the methods while results on the cart-pole problem show the applicability of the approach to the learning of real nonlinear dynamical systems, largely outperforming previous methods. References 1. Golub, G., Loan, C.V.: An Analysis of the Total Least Squares problem. SIAM J. Numer. Anal. 17 (1980) 883 893 2. Caroll, R., Ruppert, D., Stefanski, L.: Measurement Error in Nonlinear Models. Chapman and Hall (1995) 3. Ghahramani, Z., Jordan, M.I.: Supervised learning from incomplete data via an EM approach. In: NIPS. (1993) 120 127 4. Tresp, V., Ahmad, S., Neuneier, R.: Training Neural Networks with Deficient Data. In: NIPS. (1993) 128 135 5. Quiñonero-Candela, J., Roweis, S.T.: Data imputation and robust training with gaussian processes (2003) 6. Girard, A.: Approximate Methods for Propagation of Uncertainty with Gaussian Process Model. PhD thesis, University of Glasgow, Glasgow, UK (2004) 7. Rasmussen, C.E., Williams, C.K.I.: Gaussian Processes for Machine Learning. The MIT Press (December 2006) 8. Girard, A., Rasmussen, C.E., Quiñonero-Candela, J., Murray-Smith, R.: Gaussian Process Priors with Uncertain Inputs - Application to Multiple-Step Ahead Time Series Forecasting. In: NIPS. (2002) 529 536 9. Florian, R.: Correct Equations for the Dynamics of the Cart-pole System. Technical report, Center for Cognitive and Neural Studies (2007) 10. Quiñonero-Candela, J.: Learn ing with Uncertainty - Gaussian Processes and Relevance Vector Machines. PhD thesis, Technical University of Denmark, Denmark (2004) 11. Roux, N.L., Manzagol, P.A., Bengio, Y.: Topmoumoute Online Natural Gradient Algorithm. In: NIPS. (2008) 849 856 12. Dallaire, P., Besse, C., Chaib-draa, B.: GP-POMDP: Bayesian Reinforcement Learning in Continuous POMDPs with Gaussian Processes. In: Proc. of IEEE/RSJ Inter. Conf. on Intelligent Robots and Systems. (2009) To appear.