Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process regression an classification is propose. Monotonicity information is introuce with virtual erivative observations, an the resulting posterior is approximate with expectation propagation. Behaviour of the metho is illustrate with artificial regression examples, an the metho is use in a real worl health care classification problem to inclue monotonicity information with respect to one of the covariates. INTRODUCTION In moelling problems there is sometimes a priori knowlege available, concerning the function to be learne, which can be use to improve the performance of the moel. Such information may be inaccurate, an be relate to the behaviour of the output variable as a function of the input variables. For instance, instea of having measurements on erivatives, the output function can be known to be monotonic with respect to an input variable. For univariate an multivariate aitive functions, the monotonicity can be force by construction, see e.g. Shively et al., 9. A generic approach for multivariate moels was propose by Sill an Abu- Mostafa, 997, who introuce monotonicity information to multilayer perceptron MLP neural networks using hints that are virtual observations place appropriately in the input space. See also Lampinen an Selonen, 997 for more explicit formulation. However, use of hints can be problematic with MLP ue to nonstationarity of the smoothness properties, an Preliminary work. Uner review by AISTATS. Do not istribute. ifficulties in the integration over the posterior istribution. In this paper, we propose a metho similar to hint approach for incluing mononicity information into a Gaussian process GP moel using virtual erivative observations with a Gaussian istribution. In Gaussian processes smoothness can be controlle in a more systematic way than in MLP by the selection of a covariance function. In this work, integrals are approximate using the fast expectation propagation EP algorithm. We first illustrate the behaviour an examine the performance of the approach with artificial univariate regression ata sets. We then illustrate the benefits of monotonicity information in a real worl multivariate classification problem with monotonicity for one of the covariates. Section presents briefly the Gaussian process with erivative observations, an Section escribes the propose metho. In Section 4 experiments are shown, an conclusions are rawn in Section 5. GAUSSIAN PROCESSES AND DERIVATIVE OBSERVATIONS Gaussian process GP is a flexible nonparametric moel in which the prior is set irectly over functions of one or more input variables, see e.g. O Hagan, 978; MacKay, 998; Neal, 999; Rasmussen an Williams, 6. Gaussian process moels are attractive in moelling complex phenomena since they allow possible nonlinear effects, an if there are epenencies between covariates, GP can hanle these interactions implicitly. Let x enote a D-imensional covariate vector, an the matrix X, of size N D, all N training input vectors. We assume a zero mean Gaussian process prior pf X = Nf,KX,X, where f is a vector of N latent values. The covariance matrix KX, X between the latent values epens on the covariates, an is etermine by the covariance

function. Throughout this work, we use the stationary square exponential covariance function, which prouces smooth functions, given by Cov f i,f j] = Kx i,x j = η exp D = ρ xi xj, where η an ρ = {ρ,...,ρ D } are the hyperparameters of the GP moel. In the regression case, having the vector y of N noisy outputs, we assume the Gaussian relationship between the latent function values an the noisy observations py f = Ny f,σ I, where σ is the noise variance an I is the ientity matrix. Given the training ata X an y, the conitional preictive istribution for a new covariate vector x is Gaussian with mean an variance Ef x,y,x,θ] = Kx,XKX,X + σ I y Varf x,y,x,θ] = Kx,x Kx,X where θ = {η,ρ,σ}. KX,X + σ I KX,x, 4 Instea of integrating out the hyperparameters, for simplicity we fin a point estimate for the values of the hyperparameters θ, by optimising the marginal likelihoo py X,θ = py f,θpf X,θf, an in the computations we use the logarithm of the marginal likelihoo log py X,θ = yt KX,X + σ I y log KX,X + σ I N logπ. The erivative of a Gaussian process remains a Gaussian process because ifferentiation is a linear operator Rasmussen, ; Solak et al.,. This makes it possible to inclue erivative observations in the GP moel, or to compute preictions about erivatives. The mean of the erivative is equal to the erivative of the latent mean ] = E f i] E f i x i x i Likewise, the covariance between a partial erivative an a function value satisfies ] f i Cov,f j = Cov f i,f j], x i x i. an the covariance between partial erivatives ] f i Cov, fj = x j g x i x i xj g Cov f i,f j]. For the square exponential covariance function, the covariances between function values an partial erivatives are given by Cov f i x i g,f j ] = η exp D = an between partial erivatives by ] f i Cov, fj = η exp D x i g x j h = ρ g ρ xi xj ρ g x i g x j g, ρ xi xj δ gh ρ h xi h xj h xi g x j g where δ gh = if g = h, an otherwise. For instance, having observe the values of y, mean of the erivative of the latent function f with respect to the imension, is ] f E x = Kx,X x KX,X + σ I y, an the variance ] f Var x = Kx,x x x Kx,X x KX,X + σ I KX,x x, similar to the equations an 4. To use the erivative observations in the Gaussian process, the observation vector y can be extene to inclue also the erivative observations, an the covariance matrix between the observations can be extene to inclue the covariances between the observations an partial erivatives, an the covariances between the partial erivatives. EXPRESSING MONOTONICITY INFORMATION In this section we present the metho for introucing monotonicity information to a Gaussian process moel. Instea of evaluating the erivative everywhere, it is possible to choose a finite number of locations where the erivative is evaluate when the function is smooth. Monotonicity conitions are the following: at the operating point x i, the erivative of the target function,

is non-negative with respect to the input imension i. We use the notation m i i for the erivative information where monotonicity is with respect to the imension i at the location x i. We enote with m a set of M erivative points inucing the monotonicity at the operating points X m the matrix of size M D. To express this monotonicity, the following probit likelihoo p m i f i f i i = Φ 5 x i i x i ν i Φz = z Nt, t, is assume for the erivative observation. By using the probit function instea of step function the likelihoo tolerates small errors. The probit function in 5 approaches the step function when ν, an in all experiments in this work we fixe ν = e 6. However, it is possible to ajust the steepness of the step, an thereby control the strictness of monotonicity information with the parameter ν in the likelihoo. To inclue the information from this likelihoo into the GP moel, the expectation propagation algorithm Minka, is use to form virtual erivative observations. For now we assume we have a set of locations X m where the function is known to be monotonic. By assuming a zero mean Gaussian process prior for latent function values, the joint prior for latent values an erivatives is given by where f joint = pf,f X,X m = Nf joint,k joint, f f ] Kf,f K,an K joint = f,f K f,f K f,f ]. 6 In 6 f is use as a shorthan notation for the erivative of latent function f with respect to some of the input imensions, an the subscripts of K enote the variables between which the covariance is compute. Using the Bayes rule, the joint posterior is obtaine by where pf,f y,m = Z pf,f X,X m py fpm f 7 pm f = Φ f i x i ν i an the normalisation term is Z = pf,f X,X m py fpm f ff. 8 Since the likelihoo for the erivative observations in 8 is not Gaussian, the posterior is analytically intractable. We apply the EP algorithm, an compute the Gaussian approximation for the posterior istribution. The local likelihoo approximations given by EP are then use in the moel as virtual erivative observations, in aition to the observations y. The EP algorithm approximates the posterior istribution in 7 with qf,f y,m = Z EP pf,f X,X m py f t i Z i, µ i, σ i, where t i Z i, µ i, σ i = Z i Nf i µ i, σ i are local likelihoo approximations with site parameters Z i, µ i an σ i. The posterior is a prouct of Gaussian istributions, an can be simplifie to qf,f y,m = Nf joint µ,σ. 9 The posterior mean is µ = Σ Σ joint µ joint an the covariance Σ = K joint + Σ joint, where µ joint = ỹ µ ], an Σ σ I joint = Σ ]. In µ is the vector of site means µ i, an Σ is a iagonal matrix with site variances σ i on the iagonal. The esire posterior marginal moments with the likelihoo 5 are upate as Ẑ i = Φz i σ i ˆµ i = µ i + Nz i, Φz i ν + σ i /ν ˆσ i = σ i σ4 i Nz i, Φz i ν + σ i where µ i z i =, ν + σ i /ν z i + Nz i, Φz i, an µ i an σ i are the parameters of the cavity istribution in EP. These equations are almost similar to those of binary classification with probit likelihoo, an the EP algorithm is otherwise similar as presente, for example, in chapter of Rasmussen an Williams, 6.

The normalisation term is approximate with EP as Z EP = qy,m X,X m,θ = pf,f X,X m py f t i Z i, µ i, σ i ff = = Z Nf joint µ,σff Z Z i, joint Z i joint where the normalisation term of the prouct of Gaussians is Z joint = π D/ K joint + Σ joint / exp µtjointk joint + Σ joint µ joint, an the remaining terms Z i are the normalisation constants from EP. In the computations we use the logarithm of the normalisation term, an after the convergence of EP, the approximation for the logarithm of marginal likelihoo is compute as log Z EP = log K joint + Σ joint µt jointk joint + Σ M joint µ i µ i µ joint + σ i + σ i M + log Φ µ i + M logσ ν + σ i i + σ i. /ν The values for the hyperparameters are foun by optimising the logarithm of the joint marginal likelihoo approximation for the observations an erivative information. To use the virtual erivative samples in the GP moel preictions, the approximative preictive mean an variance for the latent variable can be compute with Ef x,y,x,m,x m ] = K,fjoint K joint + Σ joint µ joint Varf x,y,x,m,x m ] = K, K,fjoint K joint + Σ joint K fjoint, analogously to the stanar GP preiction equations an 4. In classification examples, we assume the probit likelihoo for class observations pc f = N Φf i c i, where now c i = {,} escribes the two output classes. We apply the expectation propagation algorithm for both the class observations an virtual erivative observations. EP approximates the joint posterior of f an f similarly to the regression case in 9, except that the vector of observations y, an noise σ I in are now replace with site approximations µ class an Σ class, enoting the mean an variance site terms given by EP, an associate with class observations. The parameter ν in likelihoo for virtual erivative observation causes the esire posterior marginal moments to be compute slightly ifferently, epening on whether the moments are compute for class observations or erivative observations. For class observations, the moments are given, for example, in chapter of Rasmussen an Williams, 6, an for virtual erivative observations moments are compute as in the regression case. The values for the hyperparameters are foun by optimising the joint marginal likelihoo approximation of class observations an virtual erivative observations. The normalisation term is compute as in regression, except that again y an noise σ I in are replace with site terms µ class an Σ class. Furthermore, in the computation of the normalisation of joint posterior, the normalisation site terms Z class of class observations are also taken into account. In classification, the preictions for the latent values using the class observations an virtual erivative observations are mae by using the extene vector of site means an extene covariance matrix having site variances on the iagonal.. PLACING THE VIRTUAL DERIVATIVE POINTS In low imensional problems the erivative points can be place on a gri to assure monotonicity. A rawback is that the number of gri points increases exponentially with regar to the number of input imensions. In higher imensional cases the istribution for X can be assume to be the empirical istribution of observations X, an the virtual points can be chosen to be at the unique locations of the observe input ata points. Alternatively, a ranom subset of points from the empirical istribution can be chosen. If the istance between erivative points is short enough compare to the lengthscale, then monotonicity information affects also between the virtual points accoring to the correlation structure. Due to the computational scaling ON + M, it may be necessary to use a smaller number of erivative points. In such a case, a general solution is to use the GP preic-

tions about the values of erivatives at the observe unique ata points. The probability of the erivative being negative is compute, an at the locations where this probability is high, virtual erivative points are place. After conitioning on the virtual ata points, the new preictions for the erivative values at the remaining unique observe points can be compute, an virtual erivative points can be ae, move or remove if neee. This iteration can be continue to assure monotonicity at the interesting regions. To place the virtual erivative points between the observe ata points, or outsie the convex hull of the observe X, a more elaborate istribution moel for X is neee. Again, the probability of erivative being negative can easily be compute, an more virtual erivative points can be place on locations where this probability is high. 4 EXPERIMENTAL RESULTS 4. DEMONSTRATION An example of Gaussian process regression with monotonicity information is shown in Figure. Subfigure a illustrates the GP preiction mean + 95% interval without monotonicity information, with hyperparameter values foun by optimising the marginal likelihoo. Subfigures b an c show the preictions with monotonicity information, with hyperparameter values that maximise the approximation of the joint marginal likelihoo. Short vertical lines in b an c are the locations of virtual erivative points. In Subfigure b, the locations of virtual points are foun by choosing a subset amongst the observe ata points, on the locations where the probability of erivative being negative is large before conitioning to any monotonicity information the erivative seen in Subfigure. In Subfigure c the virtual points are place on a gri. The preictions in b an c are similar, an e an f illustrate the corresponing erivatives of the latent functions. Since the probability of erivative being negative in e an f at the observe ata range is very low, aing more virtual erivative points is unnecessary. The effect of the monotonicity information is illustrate also in Figure. Subfigures a-c show the case without monotonicity information: a shows the marginal likelihoo as a function of lengthscale an noise variance parameters signal magnitue is fixe to be one, an b an c show two ifferent solutions mean + 95% interval at two ifferent moes shown in a. The moe with the shorter lengthscale an smaller noise variance function estimate in b has higher ensity. Subfigures -f show the case with monotonicity information. Subfigure shows the approximate marginal likelihoo for the observations an virtual erivative observations. Now the moe corresponing to the longer lengthscale an the monotone function shown in f has much higher ensity. Since virtual observations are not place ensely, there is still another moe at shorter lengthscale function estimate in e although with much lower ensity. This shows the importance of having enough virtual observations, an this secon moe woul eventually vanish if the number of virtual observations woul be increase. 4. ARTIFICIAL EXAMPLES We test the Gaussian process moel with monotonicity information by performing simulation experiments on four artificial ata sets. We consier the following functions: a fx = if x <.5, fx = if x.5 step; b fx = x linear; c fx = exp.5x exponential; fx = /{ + exp 8x + 4} logistic, an raw observations from the moel y i = fx i +ǫ i, where x i an e i are i.i.. samples from the uniform istribution Ux i,, an from the Gaussian Nǫ i,. We normalise x an y to have mean zero an stanar eviation.5. For the Gaussian process with monotonicity information, we introuce virtual observations space equally between the observe minimum an maximum values of x variable. We compare the results of the moel to a Gaussian process with no monotonicity information. The performances of the moels are evaluate using a root-mean-square error RMSE. The estimates for RMSE are evaluate against true function values on 5 equally space x-values. Table summarises the simulation results. The results are base on simulations repeate 5 times. Two sample sizes, N = an N =, were use in the simulations. For the step function, the GP moel with monotonicity information performs worse than GP without monotonicity assumption because the propose metho has tenency to favour smooth increasing functions. In a case of heavy truncation by the step likelihoo, the result may not be well approximate with a Gaussian istribution, an thus erivative information presente by virtual observations can be slightly biase away from zero. On the other han, the GP without monotonicity assumption estimates the step function with a shorter lengthscale proucing

4 4 4 a b c erivative, f/ x erivative, f/ x erivative, f/ x 4 4 e 4 f Figure : Example of Gaussian process solution mean + 95% interval without monotonicity information a, an the corresponing erivative of the latent function. Subfigures b an c illustrate the solutions with monotonicity information, an the corresponing erivatives are shown in e an f. The virtual erivative observations shown with short vertical lines in b are place on locations where the probability of erivative being negative is large seen in Subfigure. In Subfigure c the erivative points are place on a gri. noise variance... characteristic lengthscale 4 4 4 4 a b c noise variance... characteristic lengthscale 4 4 4 4 e f Figure : Contour plot of the log marginal likelihoo without monotonicity information a, an the corresponing solutions b an c at the moes. Subfigure shows contour plot of the marginal likelihoo with monotonicity information, an Subfigures e an f illustrate the corresponing solutions at the moes. The locations of virtual observations are shown with short vertical lines in Subfigures e an f.

Table : Root-mean-square errors for the artificial examples. function root-mean-square error N = N = GP GP GP GP monot. monot. step.5.76.9.67 linear.9.68.5.4 exponential.74.68.54.5 logistic.78.77.6.6 a better fit but with wiggling behaviour. For linear an exponential functions the GP with monotonicity assumption gives better estimates, as monotonicity information favours smoother solutions an prevents the estimate functions from wiggling. In the case of observations, the ifferences between the estimates for the two moels were smaller with linear an exponential functions, as the possibility of overfit ecreases. For logistic function both moels gave similar results. 4. MODELLING RISK OF INSTITUTIONALISATION In this section we report the results of assessing the institutionalisation risk of users of communal elerly care services. The risk of institutionalisation was moelle using ata prouce from health care registers, an the aim was to stuy whether a patient becomes institutionalise or not uring the next three months. The actual stuy population consiste of patients over 65 years in one city uring -4. In this stuy the following seven variables were use as preictors: age, whether the patient ha nursing home perios, whether the patient ha ADL activities of aily living evaluation, maximum memory problem score, maximum behavioral symptoms score, maximum number of aily home care visits, an number of ays in hospital. Since only a small number of institutionalisation events were available in the whole ata set, the training ata was balance such that approximately half of the patients institutionalise. The training ata set consiste of observations. Classification was one using a Gaussian process binary classification moel with the probit likelihoo function, an the square exponential covariance function with an iniviual lengthscale parameter for each input variable. We moelle the risk of institutionalisation with a GP where no information about monotonicity with respect to any of the covariates was assume. This moel was compare to a GP moel where monotonicity information was ae such that the institutionalisation risk was assume to increase as a function of age. Virtual observations were place at the unique locations of the input training ata points. To test the preictive abilities of these two GP moels, ROC curves were compute for younger an oler the olest thir age groups using an inepenent test ata of observation perios. The preictive performances of the moels were similar for the younger age group but the GP moel with monotonicity information gave better preictions for the oler age group Figure. As age increases, the ata becomes more scarce an monotonicity assumption more useful. We also stuie the effect of monotonicity information in the moel by comparing the preicte risks of institutionalisation as a function of age an ifferent aily home care levels. The preictions for a low-risk subgroup are shown in Figure 4. The GP moel without monotonicity information gives a slight ecrease for the institutionalisation risk for patients over 8 Subfigure a, whereas the GP moel with monotonicity information gives smoother results Subfigure b, suggesting more realistic estimates. true positive rate.8.6.4. younger age group younger age group monotonicity oler age group oler age group monotonicity..4.6.8 false positive rate Figure : ROC curves for the probability of institutionalisation of elerly. 5 CONCLUSION We have propose a metho for introucing monotonicity information to a nonparametric Gaussian process moel. The monotonicity information is set using virtual erivative observations concerning the behaviour of the target function in the esire locations of input space. In the metho a Gaussian approximation is foun for the virtual erivative observations using the EP algorithm, an the virtual observations are use in the GP moel in aition to the real ob-

probability of institutionalisation.8.6.4. aily home care no visits aily home care max visit aily home care max visits probability of institutionalisation.8.6.4. aily home care no visits aily home care max visit aily home care max visits 65 75 85 95 age 65 75 85 95 age a b Figure 4: Simulate estimates for the probabilities of institutionalisation of elerly as a function of age an aily home care levels. The estimates using a Gaussian process moel are shown in Subfigure a, an the estimates using a Gaussian process with monotonicity information in Subfigure b. servations. In the cases where the target function is monotonic, a solution that is less prone to overfitting, an therefore better, can be achieve using monotonicity information. This is emphasize in the cases where there is only a small number of observations available. When the target function has flat areas with sharp steps, the virtual erivative observations can lea to a worse performance cause by a bias away from zero ue to Gaussian approximation of the truncate erivative istribution. Therefore virtual erivative observations implying monotonicity are more useful in the cases when the target function is smooth. Further, if the istance between the virtual erivative observations is too large with respect to the estimate characteristic lengthscale, the solution can become non-monotonic. However, by placing an aing the virtual points iteratively, this can be avoie, an a monotonic solution can be guarantee. Acknowlegements References Lampinen, J. an Selonen, A. 997. Using backgroun knowlege in multilayer perceptron learning. In Proc. of The th Scaninavian Conference on Image Analysis, volume, pages 545 549. MacKay, D. J. C. 998. Introuction to Gaussian processes. In Bishop, C. M., eitor, Neural Networks an Machine Learning, pages 66. Springer- Verlag. Minka, T.. Expectation Propagation for approximative Bayesian inference. In Proceeings of the 7th Annual Conference on Uncertainty in Artificial Intelligence UAI-, pages 6 69. Morgan Kaufmann. Neal, R. M. 999. Regression an classification using Gaussian process priors with iscussion. In Bernaro, J. M., Berger, J. O., Dawi, A. P., an Smith, A. F. M., eitors, Bayesian Statistics 6, pages 475 5. Oxfor University Press. O Hagan, A. 978. Curve fitting an optimal esign for preiction. Journal of the Royal Statistical Society. Series B Methoological, 4: 4. Rasmussen, C. E.. Gaussian processes to spee up Hybri Monte Carlo for expensive Bayesian integrals. In Bayesian Statistics, volume 7, pages 65 659. Oxfor University Press. Rasmussen, C. E. an Williams, C. K. I. 6. Gaussian Processes for Machine Learning. The MIT Press. Shively, T. S., Sager, T. W., an Walker, S. G. 9. A Bayesian approach to non-parametric monotone function estimation. Journal of the Royal Statistical Society Series B, 7:59 75. Sill, J. an Abu-Mostafa, Y. 997. Monotonicity hints. In Avances in Neural Information Processing Systems 9, pages 64 64. MIT Press. Solak, E., Murray-Smith, R., Leithea, W. E., Leith, D. J., an Rasmussen, C. E.. Derivative observations in Gaussian process moels of ynamic systems. In Avances in Neural Information Processing Systems 5, pages 4. MIT Press.