Variational Inference for Heteroscedastic Semiparametric Regression

Size: px

Start display at page:

Download "Variational Inference for Heteroscedastic Semiparametric Regression"

Betty Fitzgerald
5 years ago
Views:

Marianne Menictas, University of Technology, Sydney Matt Wand,

1 From the SelectedWorks of Matt Wand Summer February 7, 014 Variational Inference for Heteroscedastic Semiparametric Regression Marianne Menictas, University of Technology, Sydney Matt Wand, University of Technology, Sydney Available at:

2 Variational inference for heteroscedastic semiparametric regression BY MARIANNE MENICTAS AND MATT P. WAND School of Mathematical Sciences, University of Technology Sydney, Broadway 007, Australia 1st August, 014 Summary We develop fast mean field variational methodology for Bayesian heteroscedastic semiparametric regression, in which both the mean and variance are smooth, but otherwise arbitrary, functions of the predictors. Our resulting algorithms are purely algebraic, devoid of numerical integration and Monte Carlo sampling. The locality property of mean field variational Bayes implies that the methodology also applies to larger models possessing variance function components. Simulation studies indicate good to excellent accuracy, and considerable time savings compared with Markov chain Monte Carlo. We also provide some illustrations from applications. Keywords: Approximate Bayesian inference; mean field variational Bayes; non-conjugate variational message passing; variance function estimation. 1 Introduction Data that are big in terms of volume and/or velocity are becoming increasingly ubiquitous and there is a strong imperative for the development of fast, flexible and extendable methodology for processing such data. Semiparametric regression e.g. Ruppert, Wand & Carroll, 003, 009 is an important class of flexible statistical models and methods, but is mainly geared towards moderate sample sizes and batch processing. Recently, Luts, Broderick & Wand 014 developed new nonparametric regression methodology specifically tailored to high volume/velocity situations. The present article extends their general approach to accommodate possible heteroscedasticity. Whilst we focus on univariate and bivariate nonparametric regression and additive models, the modularity of our approach allows easy extension to more complex models. The accommodation of heteroscedasticity is aided by the variational inference technique known as non-conjugate variational message passing e.g. Knowles & Minka, 011. The particular form of non-conjugate variational message passing that we use involves approximating particular posterior density functions by Multivariate Normal density functions, which is an established notion within the variational approximation literature e.g. Hinton & van Camp, 1993; Barber & Bishop, 1997; Raiko et al., 007; Challis & Barber, 013. In nonparametric regression it is common to ignore heteroscedasticity in the data and instead invoke the constant variance assumption. Figure 1 contains nonparametric regression examples, based on the R function smooth.spline R Core Team, 014 with default settings. Both data sets appear in Ruppert et al. 003 and are first described in its Sections 5.3 and.7 respectively. The scatterplots and the standardized residual plots show that there is significant heteroscedasticity, the ignorance of which will lead to erroneous inferential statements, such as prediction intervals. Heteroscedastic nonparametric regression overcomes the problems apparent in Figure 1. For a set of regression data, x i, y i, 1 i n, it involves replacement of the homoscedastic nonparametric model Ey i = fx i, Vary i = σ 1

3 age years logincome age years residuals range logratio range residuals Figure 1: Two example nonparametric regression fits and corresponding standardized residual plots. Horizontal dashed lines at ± and ±3 aid assessment of standard normality of the standardized residuals. by Ey i = fx i, Vary i = gx i, 1 with the variance function g to be estimated simultaneously with the mean function f. Several approaches to fitting 1 now exist e.g. Ruppert et al., 003; Rigby & Stasinopoulos, 005; Crainiceanu et al., 007. A particularly attractive approach, from an extendability standpoint, involves graphical model representations of mixed model-based penalized splines Wand, 009. This allows one to take advantage of the growing body of methodology and software for approximate inference in general graphical models. Graphical model-based Bayesian inference engines such as BUGS Spiegelhalter et al., 003, Infer.NET Minka et al., 013 and Stan Stan Development Team, 013 now accommodate a wide range of nonparametric and semiparametric regression models e.g. Marley & Wand, 010; Luts, Wang, Ormerod & Wand, 014. However, the form of the approximate inference differs markedly among the various inference engines. BUGS and Stan each use relatively slow but highly accurate Markov chain Monte Carlo MCMC methodology whereas Infer.NET uses faster, but less accurate, deterministic approximations such as mean field variational Bayes MFVB. The latter type of methodology is our focus here. A relatively new modification of MFVB, non-conjugate variational message passing, is shown to be particularly useful for handling heteroscedasticity since it results in algorithms that involve only closed form algebraic expressions. The modularity of the graphical model and MFVB approaches implies that methodology for handling heteroscedasticity in semiparametric regression applies to arbitrarily large models. We call this the locality property of MFVB and it is described in Section 3 of

4 Wand et al If a very large graphical model includes a component where one variable is modelled as a heteroscedastic nonparametric regression function of another variable then variational inference for the parameters in that part of the graph can be achieved using the methodology developed here. Variational methodology for heteroscedastic regression models has also been developed by Lazaro-Gredilla & Titsias 011 and Nott, Tran & Leng 01, although the latter article was confined to linear mean and log-variance functions. Lazaro-Gredilla & Titsias 011 achieved simultaneous nonparametric mean and variance function estimation via an elegant Gaussian process approach. The ensuing nonparametric function estimators are full-rank and their variance function estimation strategy relies on Gauss-Hermite quadrature. Our approach differs by using low-rank penalized splines and a non-conjugate MFVB strategy that provides closed form updates, making it more amenable to high volume/velocity data. Section gives description of the Bayesian penalized spline model for simultaneous mean and variance function estimation based on univariate data. The variational inference methodology is provided in Section 3 and our main algorithm is presented there. Section 4 provides numerical illustrations which give evidence of the speed achieved by non-conjugate MFVB. In Sections 5 7 we describe extensions to bivariate nonparametric regression, additive models and real-time semiparametric regression. Some concluding remarks are made in Section 8. The appendix contains some algebraic details used in the derivation of Algorithm 1. Model Description The Gaussian heteroscedastic nonparametric regression model has generic form: ind. y i Nfx i, gx i, 1 i n, where x i, y i is the ith predictor/response pair of a regression data-set and ind. means distributed independently as. The functions f and g are smooth, but otherwise arbitrary, functions. We refer to f as the mean function and g as the variance function. We also use the term standard deviation function for g. Mixed model-based penalized spline fitting of e.g. Ruppert et al., 003 involves modelling f and g according to: K u fx = β 0 + β 1 x + u k zk ux, u k ind. N0, σu k=1 K v and gx = exp γ 0 + γ 1 x + v k zk vx ind., v k N0, σv. k=1 3 The zk u : 1 k K u} and zk v : 1 k K v} are spline bases of sizes K u and K v, respectively. Our default for the zk u and zv k are suitably transformed cubic O Sullivan splines, as described in Section 4 of Wand & Ormerod 008. If K u = K v then the two bases are identical, but we leave open the possibility for different basis sizes for the mean and variance functions. In this article, we take a Bayesian approach to fitting and 3, and impose the following priors on the model parameters: ind. β 0, β 1 N0, σβ, γ ind. 0, γ 1 N0, σγ, σ u Half-CauchyA u and σ v Half-CauchyA v with hyperparameters σ β, σ γ, A u, A v > 0 to be specified by the user. The Half-CauchyA density function is given by px = /π A}/1 + x/a }, x > 0. Assuming that the 3

5 data have been pre-transformed to have zero mean and unit variance, our default setting of the hyperparemeters throughout this article are: σ β = σ γ = A u = A v = Result 5 of Wand et al. 011 allows us to replace σ u Half-CauchyA u by σ u a u Inverse-Gamma 1, 1/a u, a u Inverse-Gamma 1, 1/A u, 5 where x Inverse-GammaA, B means that x has an Inverse Gamma distribution with shape parameter A > 0 and rate parameter B > 0. The corresponding density function is px = B A ΓA 1 x A 1 exp B/x, x > 0. Representation 5 is more amenable to variational approximate inference. A similar replacement is made for σ v. The full Bayesian hierarchical model corresponding to is: y β, γ, u, v N Xβ + Z u u, diag exp Xγ + Z v v}, u σ u N0, σ ui, v σ v N0, σ vi, β N0, σ β I, γ N0, σ γi, σ u a u Inverse-Gamma 1, 1/a u, a u Inverse-Gamma 1, 1/A u, 6 σ v a v Inverse-Gamma 1, 1/a v, a v Inverse-Gamma 1, 1/A v. Here β and γ are 1 vectors of fixed effects containing β 0, β 1 and γ 0, γ 1 respectively, u is the K u 1 vector containing u 1,..., u Ku, v is a K v 1 vector defined similarly, and σu and σv are variance components corresponding to u and v respectively. The design matrices, X, Z u and Z v, are defined as X [1 x i ] 1 i n, [ ] Z u zk u x i 1 k K u 1 i n and [ ] Z v zk v x i 1 k K v 1 i n. It is also convenient to combine the mean function coefficients and variance function coefficients into single vectors: ν [ β u ] and An analogous combining of the design matrices: C ν [X Z u ], C ω [X Z v ], then allows us to write Xβ + Z u u = C ν ν and Xγ + Z v v = C ω ω. Figure shows the directed acyclic graph corresponding to the model conveyed in 6. MCMC schemes for fitting and inference in 6 are relatively straightforward to devise and implement. The BUGS and Stan MCMC-based inference engines also support 6 and illustration of such is given in Section 4. However, MCMC does not scale well to high volume/velocity data and larger models with heteroscedastic nonparametric regression components. ω [ γ v ]. 3 Variational Inference Methodology We now consider the problem of mean field-type variational inference for 6 e.g. Wainwright & Jordan, 008; Ormerod & Wand, 010. This involves an approximation to the joint posterior density function of the form pν, ω, σ, a y qν qω qσ qa 7 4

6 ν ω a u β y γ a v σ u u v σ v Figure : Directed acyclic graph for the model in 6. The shaded node corresponds to the observed data vector. Random effects and auxiliary variables are referred to as hidden nodes. The grey dashed boxes indicate that β and u are combined into a vector denoted by ν and ω is the concatenation of γ and v. where σ = σ u, σ v and a = a u, a v. The q-densities are chosen to minimize the Kullback- Leibler divergence between the left-hand-side and right-hand-side of 7. This is equivalent to maximizing the variational lower bound on the marginal log-likelihood log py: log py; q E q [ log pν, ω, σ, a, y} logqν qω qσ qa }]. The optimal q densities, denoted by q, can be shown to satisfy e.g. Ormerod & Wand, 010: q ν exp E q ν log pν rest }, q ω exp E q ω log pω rest }, q σ exp E q σ log pσ rest } and q a exp E q a log pa rest }, 8 where, for example E q ν denotes expectation with respect to the q-densities of all parameters except ν. Also rest denotes all of the random variables in the model other than those in ν, including y. The full conditionals pν rest, pσ rest and pa rest each have standard forms which result in closed form q-density expressions. However, pω rest is a non-standard form and challenging integrals arise in the determination of q ω. We achieve a tractable solution by imposing the additional restriction qω = qω; µ qω, Σ qω is a N µ qω, Σ qω density function 9 for some mean vector µ qω and covariance matrix Σ qω. The corresponding marginal log-likelihood is [ log py; q, µ qω, Σ qω E q log pν, ω, σ, a, y } ] 10 log qν qω; µ qω, Σ qω qσ qa and minimization of the Kullback-Leibler divergence corresponds to µ qω and Σ qω being chosen to maximise 10. This approach, due to Knowles & Minka 011, has been labelled non-conjugate variational message passing since it provides a way of circumventing non-conjugacies in MFVB. Note that variational message passing is an alternative formulation of MFVB, see e.g. Minka & Winn, 008. Knowles & Minka 011 propose a fixed point iteration scheme as a means of maximizing 10. Wand 014 provides the algebraic details and simplification of fixed point updates for the Multivariate Normal q-density parameters, such as 9. 5

7 For fixed µ qω, Σ qω, application of 8, with q ω omitted, leads to q ν is the N µ qν, Σ qν density function, q σ = q σu, σv is a product of Inverse-Gamma 1 K u + 1, B qσ u and Inverse-Gamma 1 K v + 1, B qσ v density functions, and q a = q a u, a v is a product of Inverse-Gamma1, B qau and Inverse-Gamma1, B qav density functions 11 for parameters µ qν and Σ qν, the mean and covariance matrix of q ν, B qσ u, the rate parameter of q σ u, B qσ v, the rate parameter of q σ v, B qau, the rate parameter of q a u and B qav, the rate parameter of q a v. These optimal parameters are all interrelated and obtained through an iterative algorithm, listed below as Algorithm 1. The algorithm also includes fixed point iterative updates for µ qω and Σ qω. Details on the derivation of 11, as well as the fixed point updates, are given in an appendix. Before presenting Algorithm 1 some additional matrix algebraic notation is required. For two matrices A and B of equal size, A B denotes their element-wise product. If a is a d 1 vector then diaga is the d d diagonal matrix with the entries of a along the diagonal. For a d d matrix A, diagonala is the d 1 vector comprising the diagonal entries of A. Also, µ qu is defined to be the sub-vector of µ qν corresponding to u, Σ qu is the sub-matrix of Σ qν corresponding to u. The symbols µ qv and Σ qv are defined similarly. Initialize: µ qω a K v + 1 vector, Σ qω a K v + K v + positive definite matrix, µ q1/σ u, µ q1/σ v > 0, and µ qr ν an n 1 vector. Cycle: ψ qω exp C ω µ qω + 1 diagonal C ω Σ qω C T } ω Σ qν [ σ C T ν diagψ qω C ν + β I 0 0 µ q1/σ u I µ qν Σ qν C T ν diagψ qω y C T ωdiag µ qr ν ψ qω C ω + ] 1 [ σ γ I 0 ]} 1 Σ qω 0 µ q1/σ v I [ σ µ qω µ qω + Σ qω C T ω µ qr ν ψ qω 1 γ I 0 0 µ q1/σ v I µ q1/au 1/ µ q1/σ u + A u ; µ q1/av 1/ µ q1/σ v + A v K u + 1 µ q1/σ u µ q1/au + µ qu + tr Σ qu K v + 1 µ q1/σ v µ q1/av + µ qv + tr Σ qv } T µ qr ν diagonal y C ν µ qν y C ν µ qν + Cν Σ qν C T ν ] µ qω } until the absolute relative change in log py; q, µ qω, Σ qω is negligible. Algorithm 1: MFVB algorithm for the determination of the optimal parameters in q ω, q ν, q σ u, q σ v, q a u and q a v. 6

8 Convergence of Algorithm 1 can be monitored using the following explicit expression for the marginal log-likelihood lower bound: log py; q, µ qω, Σ qω = 1 K u + K v + 4 n logπ + log Γ 1 K u log Γ 1 K v + 1 logπ loga u loga v n 1T C ω µ qω 1 y C νµ qν T diagψ qω y C ν µ qν 1 tr C T ν C ν Σ qν logσ β logσγ + 1 log Σ qν + 1 log Σ qω 1 µ σβ qβ + trσ qβ } 1 µ σγ qγ + trσ qγ 1 K u + 1 logb qσ u 1 K v + 1 logb qσ v log µ q1/σ u + A u log µ q1/σ v + A v + µq1/σ u µ q1/au + µ q1/σ v µ q1/av. Unlike ordinary MFVB, there is no guarantee that each iteration leads to an increase in py; q, µ qω, Σ qω. Thus the iterations should be stopped when the absolute relative change in its logarithm falls below a negligible amount. In Figure 3 we return to the example of Figure 1, armed with our new MFVB methodology. The top panels show the fitted mean functions, according to 6, with pointwise 95% credible sets. The model was fitted using MFVB corresponding to Algorithm 1 and MCMC based on BUGS Spiegelhalter et al. 003 with a burnin of 5000, kept sample of 5000 and thinning factor of 5. The much faster MFVB fit is seen to be in excellent agreement with its more computationally costly benchmark. The middle panels show the fitted standard deviation functions, corresponding to g in the notation of, and corresponding 95% credible sets. The agreement between MFVB and MCMC is good, rather than excellent, for g. The g-standardized residual plots at the bottom of Figure 3 indicate proper accounting for the heteroscedasticity and agreement with the normality. We also conducted a comprehensive check in order to confirm that the relative change in the approximate marginal log-likelihood does not lead to early stopping. Over several hundred runs, with data generated from different scenarios, we recorded Bayes estimates of f and g at the stopping point and again with iterations continuing 5% beyond the stopping point. The differences in the estimates were negligible. 4 Performance Assessment We conducted a comprehensive simulation study to assess the performance of Algorithm 1 in terms of inferential accuracy and computing time. Data were simulated according to model with n = 500 and the x i s uniform on 0, 1. The four mean and variance function pairs are listed in Table 1, where φ ; µ, σ and Φ ; µ, σ respectively denote the density and distribution functions of the Normal distribution with mean µ and standard deviation σ. We generated 100 data-sets for each of the functions shown in Table 1. For each model corresponding to a new replication, inference was achieved using MFVB via Algorithm 1 and MCMC for comparison. R was also used for implementation of Algorithm 1. The MFVB iterations were terminated when the relative change in log py; q fell below The MCMC was performed using BUGS with sample sizes the same as for the example in Section 3. The following sections provide details on the accuracy of MFVB against the MCMC benchmark. 7

9 mean function age years function MFVB MCMC standard deviation function age years function standardized residuals age years standardized residuals mean function range function standard deviation function range function standardized residuals range standardized residuals Figure 3: Top panels: fitted mean functions for two example data sets from Wand & Jones 1995 and Ruppert et al The solid curves are approximate pointwise posterior means whilst the dashed curves are corresponding pointwise 95% credible sets. The approximate fits are based on MFVB via Algorithm 1 and MCMC via BUGS. Middle panels: similar to top panels but for the standard deviation function. Bottom panels: standardized residual plots based on y fx i }/ ĝx i } where f and ĝ are the MFVB-approximate Bayes estimates of f and g. 4.1 Assessment of Accuracy Algorithm 1 yields fast approximate inference for the model parameters, however it does not guarantee that an adequate level of accuracy will be achieved. Figure 4 provides an accuracy assessment of Algorithm 1 using side-by-side boxplots of the accuracy scores for the parameters of interest. 8

10 Setting fx log gx A sin3πx cos4πx B 1.0x x φ x; 0.38, Φ x; 0., x φ x; 0.75, 0.03 C 0.35 φ x; 0.01, φ x; 0.45, φ x; 0, φ x; 1, φ x; 0.7, 0.14} D sin3πx 1.0x x cos 4πx x +0.4 φ x; 0.38, 0.08 Φ x; 0., 0.1 Table 1: Details of simulation study settings. For a generic parameter θ, the accuracy of q θ is defined to be accuracyq = q θ pθ y dθ %. 1 Note that pθ y can be approximated by using kernel density estimation applied to the MCMC sample. This accuracy score is justified in Faes et al Accuracy is monitored for parameters fh k and gh k, 1 k 5, where the H k are the sample hexiles of the x i s. The boxplots illustrate that most of the accuracies for the fh k lie around 90%, while accuracies for the gh k lie around 80%. These very good accuracy results are in keeping with the heuristics given in Section 3.1 of Menictas & Wand 013. accuracy setting A setting B fh 1 fh fh 3 fh 4 fh 5 gh 1 gh gh 3 gh 4 gh 5 setting C fh 1 fh fh 3 fh 4 fh 5 gh 1 gh gh 3 gh 4 gh setting D Figure 4: Summary of simulation study where the accuracy values are summarized as a boxplot. Figure 5 shows comparison of MCMC and the MFVB fitted mean and standard devia- 9

11 tion functions for the first replication in each of the four simulation settings. The MCMC and MFVB fits show excellent agreement for the mean functions and relatively good agreement for the standard deviation functions mean function A MFVB MCMC Truth standard deviation function A mean function B standard deviation function B mean function C standard deviation function C mean function D standard deviation function D Figure 5: Comparison of MCMC and MFVB fitted functions for the first data-set in each of the four simulation settings. 4. Assessment of Coverage We are also interested in the comparison between the coverage gained by the MFVB approximate credible intervals and the true coverage. Table gives the percentages of the true parameter coverage based on the approximate 95% credible intervals attained from the MFVB posterior densities. The coverage overall is good and does not fall below 86%. As we have already seen in the previous section the performance of MFVB is excellent for 10

12 the mean function and very good for the variance function. fh 1 fh fh 3 fh 4 fh 5 gh 1 gh gh 3 gh 4 gh 5 sett. A sett. B sett. C sett. D Table : Percentage coverage of the true parameter values by approximate 95% credible intervals based on variational Bayes approximate posterior density functions. The percentages are based on 100 replications. 4.3 Assessment of Speed During the running of the simulation we monitored the time taken per model to be fitted via MCMC and MFVB. The results are summarized in Table 3. The simulation was run on the first author s desktop computer Intel Core i GHz processor, 8 GBytes of random access memory. As mentioned previously, convergence was assessed differently for both approaches. Also, the speed gains of MFVB are traded off against accuracy losses which are invoked by the product restriction given in 7. However, despite this concern, the results show that MFVB is at-least approximately 00 times faster than MCMC when comparing all models. Thus we can assert that a model that takes minutes to run using MCMC, will take only seconds to run using our MFVB algorithm. MCMC MFVB setting A 15.49, , 1.87 setting B 13.96, , 5.89 setting C , , 4.67 setting D , , 1.38 Table 3: 99% Wilcoxon confidence intervals based on run times in seconds for MCMC and MFVB fitting. 5 Extension to Bivariate Predictors Algorithm 1 is relatively easy to extend to bivariate predictor nonparametric regression, which is closely related to geostatistics e.g. Cressie, 1993, which we briefly describe here. In this case, the data are of the form x i, y i, x i R, y i R. 13 In the classical geostatistics scenario, the x i s specify to geographical locations. However, in 13 the x i s could also represent pairs of non-geographical measurements. The bivariate analogue of is ind. y i Nfx i, gx i, 1 i n, 14 11

13 where f and g are real-valued functions on R. The extension of 3 is K u fx = β 0 + β T 1 x + u k zk ux, u k ind. N0, σu k=1 K v and gx = exp γ 0 + γ T 1 x + v k zk vx ind., v k N0, σv, where each of β 1 and γ 1 are 1 vectors. The functions z u k : 1 k K u} and z v k : 1 k K v } are now bivariate spline basis functions. A reasonable default for the z u k Ruppert et al. 003 is the low-rank thin plate spline basis with kth element: k=1 15 z u k x = r x κu k [ r κu k κu k 1 k,k K u ] 1/, 16 where κ u 1,..., κu K u is a set of bivariate knot locations that efficiently cover the space of the x i s and rx x logx. The default zk v s have an analogous definition. According to this set-up, the only difference between the univariate nonparametric heteroscedastic regression model, treated in Section 3, and its bivariate counterpart is the basis functions and their coefficients. Hence, Algorithm 1 can be used to fit the bivariate nonparametric heteroscedastic regression model by replacing ν, ω, C ν and C ω from Section 3 with β 0 γ 0 [ ] [ ] ν β 1, ω γ 1, C ν 1 x T i zk u x i and C ω 1 x T i zk v x i. u v 1 k K u 1 i n 1 k K v 1 i n We fitted 15 to geo-referenced data on sea-floor sediment pollution in the North Sea source: Pebesma & Duin, 005. The data are stored in the pcb data-frame within the R package gstat Pebesma, 004. The response variable is a measurement of polychlorinated biphenyl with Ballschmiter-Zell congener number 138 PCB-138. The motivating study is concerned with spatial and temporal variability of PCB-138. For the purposes of illustration, we ignore the temporal aspect and focus on geographical variability in the mean and variance of the response. In the notation of model 14 15, the variables are x = x 1, x where x 1 = x-coordinate in the Universal Mobile Telecommunications System for Zone 31, x = y-coordinate in the Universal Mobile Telecommunications System for Zone 31, and y = PCB-138 measured on the sediment fraction smaller than 63 parts per million, in µg/kg dry matter. The sample size is n = 16 and K u = K v = 50 thin plate spline bases were used for each functional fit. The estimated mean and standard deviation functions are shown in Figure 6. Both functions are seen to exhibit pronounced spatial effects. Simple bivariate predictor models that ignore the heteroscedasticity described by the right panel of Figure 6 would have erroneous prediction intervals. Higher dimensional heteroscedastic nonparametric regression can be achieved via Algorithm 1 with little notational change from the bivariate case treated here. The only required modification involves higher-dimensional thin plate spline basis functions instead of those given by Extension to Additive Models The final semiparametric regression extension, that we discuss briefly in this section, is additive models with multiplicative variance functions. Models of this type have a small 1

14 mean function standard deviation function y coordinate UTM zone x coordinate UTM zone mean PCB 138 µg/kg dry matter y coordinate UTM zone st. dev. PCB 138 µg/kg dry matter x coordinate UTM zone 31 Figure 6: Left panel: Fitted mean function for the polychlorinated biphenyl data described in the text, based on MFVB via Algorithm 1. Right panel: similar to top panel but for the standard deviation function. literature with Rigby & Stasinopoulos 005 being a key reference. Their generic form is d d y i N β 0 + f j x ji, exp γ 0 + h j x ji, 1 i n. 17 j=1 Here f j and h j, 1 j d, are smooth but otherwise arbitrary functions. We use penalized spline models of the form K u j k=1 Kj v j=1 f j x = β j x + u jk zjk u x, u jk ind. N0, σuj and h j x = γ j x + k=1 18 v jk zjk v x, v jk ind. N0, σvj. The zjk u : 1 k Ku j }, 1 j d, are spline bases of sizes Ku j, analogous to those presented in Section. The zjk v : 1 k Kv j }, 1 j d, are similarly defined. The priors on the regression coefficients and standard deviation parameters are β j σ uj ind. N0, σβ, γ j ind. N0, σγ, ind. ind. Half-CauchyA u, σ vj Half-CauchyA v. 19 The Bayesian model given by 17, 18 and 19 admits a closed form non-conjugate MFVB algorithm, with the regression coefficients for the full mean and variance functions each being Multivariate Normal. Details and illustration are given in Menictas 014. Section 5. of Wand 014 also contains an illustration for data from a Californian air pollution study. The extension to additive models with parametric components is trivial. For example, if binary predictor data b i, 1 i n, are also available then the following extension of 17: d d y i N β 0 + β b b i + f j x ji, exp γ 0 + γ b b i + h j x ji, 1 i n j=1 13 j=1

15 can be handled via simple additions to the design matrices and coefficient vectors. Similar comments apply to the addition of continous predictors that have purely linear impact on either the mean function or log-variance function. 7 Real-Time Heteroscedastic Nonparametric Regression Almost all nonparametric regression methodology presented to date make the assumption that the data are processed in batch, that is, at the same time. However, some disadvantages of batch processing include the requirement that analysis must wait until the entire data set has been collected, and often the need to store the entire data set in memory. In the real-time case, the analysis updates once each new data point is collected. This is beneficial, and sometimes essential, for both high volume and/or velocity data. In this section we present a variation of Algorithm 1 that allows for real-time heteroscedastic nonparametric regression. Algorithm processes each new entry of y, denoted by y new, and its corresponding row of C ν and C ω, denoted by c ν,new and c ω,new, successively in real time. The starting values for the real-time procedure are determined by performing a sufficiently large batch fit. This is explained in more detail in Section.1.1 of Luts, Broderick & Wand 014. The web-site realtime-semiparametric-regression.net features a movie that illustrates Algorithm for data simulated according to setting D. The warm-up sample size is n warm = 500. The link for the movie is titled Heteroscedastic nonparametric regression and portrays the effectiveness of real time processing for mean and standard deviation function fitting. 8 Concluding Remarks We have developed closed form algorithms for fast batch and real-time fitting and inference for a variety of heteroscedastic semiparametric regression models. The methodology also applies to larger models courtesy of the locality property of mean field variational inference methods. The new methodology has been shown to perform very well on simulated and actual data. 14

16 1. Use Algorithm 1 to perform batch-based tuning runs, analogous to those described in Algorithm of Luts et al. 014, and determine a warm-up sample size n warm for which convergence is validated.. Set µ qν, Σ qν, µ qω, Σ qω, µ q1/σ u, and µ q1/σ v to their values obtained in the warm up batch-based tuning run with sample size n warm. Next set y warm to be the response vector on the first n warm observations. Also set C ν,warm and C ω,warm to be the design matrices based on the first n warm observations. Lastly assign n n warm. 3. Cycle: Read in y new 1 1, c ν,new + K u 1} and c ω,new + K v 1}; n n+1 C ν [ C ] ν c ν,new ; Cω [ C ] [ ω c ω,new ; y y ] y new } µ qr ν diagonal y C ν µ qν y C ν µ qν + Cν Σ qν C ν ψ qω exp C ω µ qω + 1 diagonal C ω Σ qω C } ω Σ qν C ν diag ψ qω C ν + σ β I µ q1/σ u I Ku µ qν Σ qν C ν diag ψ qω y Σ qω C ω diag µ qω µ qω + Σ qω C ω µ q1/au 1/ µ q1/σ u + A u [ σ γ I 0 µ qr ν ψ qω C ω + µ qr ν ψ qω 1 0 µ q1/σ v I Kv ] 1 [ σ γ I 0 0 µ q1/σ v I Kv ; µ q1/av 1/ µ q1/σ v + A v ] µ qω } µ q1/σ u K u + 1/µ q1/au+ µ qu +tr Σ qu } µ q1/σ v K v + 1/µ q1/av+ µ qv +tr Σ qv } until the analysis is complete or data no longer available. Algorithm : MFVB algorithm for real-time determination of the optimal parameters in q ω, q ν, q σ u, q σ v, q a u and q a v. Appendix: Derivation of optimal q-density functions Derivation of q ν First note that log q ν = E q log pν rest} + const = 1 [ν T C T ν diag E q e Cωω } [ σ C ν + β I 0 0 µ q1/σ u I Ku ν T C T ν diag E q e Cωω } y ] + const ] ν 15

17 where const denotes terms not depending on the argument of q. The form of q ν follows from this and the fact that E q e Cωω = exp C ω µ qω + 1 C ωσ qω C T ω}. Therefore and Σ qν = µ qν = Σ qν C T ν diagψ qω y, } [ σ C T ν diag ψ qω C ν + β I 0 0 µ q1/σ u I Ku ] 1. Derivation of q σ u and q σ v log q σ u = E q log pσ u rest } + const = 1 K u } logσ u 1 E q u + µ q1/au /σ u + const The form of q σ u follows from the fact that: E q u = E q u + tr Cov q u}. The expression for µ q1/σ u follows from a suitable result for the Inverse-Gamma distribution. For example, if random variable v has an Inverse-Gamma distribution with density function pv = B A ΓA 1 v A 1 exp v/b, then E1/v = A/B. Therefore B qσ u = 1 µ qu + tr } Σ qu + µ q1/au, and µ q1/σ u = 1 K u + 1/B qσ u. The derivation of B qσ v and µ q1/σ v are similar. Derivation of q a u and q a v log q a u = E q log pa u rest} + const = 1 1 loga u µ q1/σ u + A u /au + const The expressions for B qau and µ q1/au follow immediately. The derivation of B qav, and µ q1/av are similar to that just shown above. Derivation of the µ qω and Σ qω updates The updates for µ qω and Σ qω are based maximisation of the current value of the marginal log-likelihood lower bound log py; q, µ qω, Σ qω over these parameters using fixed point iteration. Wand 014 shows that the updates reduce to Σ qω } T 1 vec 1 DvecΣ qω S µ qω µ qω + Σ qω D µqω S T. where S E q log py ν, ω + log pω σ v }, 16

18 D denotes derivative vector, as defined in Magnus & Neudecker 1999, and vec and vec 1 are as defined in Wand 014. However, a result in the appendix of Opper & Archambeau 009 shows that an equivalent expression for the first of these updates is Σ qω 1 H µqω S where H denotes the Hessian matrix as defined in Magnus & Neudecker We work with this alternative form here. An explicit expression for S is This leads to S 1 T C ω µ qω µ T qr C ν exp ω µ qω + 1 diagonal C ω Σ qω C T } ω 1 tr [ σ γ I 0 0 µ q1/σ v I ] µ qωµ T qω + Σ qω n + K + 1 log π K log σ v log σ γ. d µqω S = 1 T C ω dµ qω +µ T qr exp ν diag C ω µ qω + 1 diagonal C ω Σ qω C T ω } C ω dµ qω [ ] σ γ I 0 = µ T qω dµ qω 0 µ q1/σ v I [ µ qr ν exp C ω µ qω + 1 diagonal C ω Σ qω C T ω } T 1] Cω [ ] σ µ T γ I 0 qω dµ qω. 0 µ q1/σ v I Then, by Theorem 6, Chapter 5, of Magnus & Neudecker 1999, T [ D µqω S = C T ω µ qr ν exp C ω µ qω + 1 diagonal C ω Σ qω C T ω } ] 1 [ ] σ γ I 0 µ qω. 0 µ q1/σ v I Next, d µ qω S = = [ T µ qr ν exp dµ qω C ω µ qω + 1 diagonal C ω Σ qω C T }] T ω [ ] T σ γ I 0 dµ qω dµ qω 0 µ q1/σ v I [ [ µ qr ν diag exp C ω µ qω + 1 diagonal C ω Σ qω C T }] ω [ ]] } T T σ γ I 0 C ω dµ qω Cω dµ qω 0 µ q1/σ v I C ω dµ qω 17

19 Then, using diagab = a b, [ d µ qω S = diag µ qr ν diag exp C ω µ qω + 1 diagonal C ω Σ qω C T }] ω [ ]} T T σ γ I 0 C ω dµ qω Cω dµ qω 0 µ q1/σ v I T = dµ qω C T ω diag [ µ qr ν exp C ω µ qω dµ qω + 1 diagonal C ω Σ qω C T ω}] Cω dµ qω [ ] T σ γ I 0 dµ qω dµ qω 0 µ q1/σ v I Therefore, by Theorem 6, Chapter 6, of Magnus & Neudecker 1999, [ H µqω S = C T ωdiag µ qr ν exp C ω µ qω + 1 diagonal C ω Σ qω C T ω }] C ω [ ] σ γ I µ q1/σ v I Combining these expressions leads to the updates in Algorithm 1. 9 Acknowledgments This research was partially supported by an Australian Postgraduate Award and Australian Research Council Discovery Project DP The authors are grateful to Jan Luts, an associate editor and two referees for their comments on this research. References Barber, D. & Bishop, C.M Ensemble learning for multi-layer networks. In Jordan, M.I., Kearns, K.J. and Solla, S.A. Eds.. Advances in Neural Information Processing Systems, 10, Challis, E. & Barber, D Gaussian Kullback-Leibler approximate inference. Journal of Machine Learning Research, 14, Crainiceanu, C.M., Ruppert, D., Carroll, R.J., Joshi, A. & Goodner, B Spatially adaptive Bayesian penalized splines with heteroscedastic errors. Journal of Computational Graphical Statistics, 16, Cressie N.A.C Statistics for Spatial Data, Revised Edition. New York: Wiley. Hinton, G.E. & van Camp, D Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual Association for Computing Machinery Conference on Computational Learning Theory, Knowles, D.A & Minka, T.P Non-conjugate variational message passing for multinomial and binary regression. In J. Shawe-Taylor, R.S. Zamel, P. Bartlett, F. Pereira and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, 4, pp

20 Lázaro-Gredilla, M. & Titsias, M.K Variational heteroscedastic Gaussian process regression. Proceedings of the 8th International Conference on Machine Learning, Bellevue, Washington, USA. Luts, J., Broderick, T. & Wand, M.P Real-time semiparametric regression. Journal of Computational and Graphical Statistics, 3, Luts, J., Wang, S.S.J., Ormerod, J.T. & Wand, M.P Semiparametric regression analysis via Infer.NET. Unpublished manuscript. Marley, J.K. & Wand, M.P Non-standard semiparametric regression via BRugs. Journal of Statistical Software, Volume 37, Issue 5, Menictas, M. & Wand, M.P Variational inference for marginal longitudinal semiparametric regression. Stat,, Menictas, M Variational Inference for Heteroscedastic and Longitudinal Regression Models. PhD Thesis, University of Technology, Sydney. anticipated. Minka, T. & Winn, J Gates: A graphical notation for mixture models. Microsoft Research Technical Report Series, MSR-TR , Minka, T., Winn, J., Guiver, J. & Knowles, D Infer.NET.5. Nott, D.J., Tran, M.-N. & Leng, C. 01. Variational approximation for heteroscedastic linear models and matching pursuit algorithms. Statistics and Computing,, Opper, M. & Archambeau, C The variational Gaussian approximation revisited. Neural Computation, 1, Ormerod, J.T. & Wand, M.P Explaining variational approximations. The American Statistician, 64, Pebesma, E.J Multivariate geostatistics in R: the gstat package. Computers and Geosciences, 30, Pebesma, E.J. & Duin, R.N.M Spatial patterns of temporal change in North Sea sediment quality on different spatial scales. In P. Renard, H. Demougeot-Renard and R. Froidevaux Eds., Geostatistics for Environmental Applications: Proceedings of the Fifth European Conference on Geostatistics for Environmental Applications pp Berlin: Springer. Raiko, T., Valpola, H., Harva, M. & Karhunen, J Building blocks for variational Bayesian learning of latent variable models. Journal of Machine Learning Research, 8, R Core Team 014. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN: Rigby, R.A. & Stasinopoulos, D.M Generalized additive models for location, scale and shape. Applied Statistics, 54,

21 Ruppert, D., Wand, M.P. & Carroll, R.J Semiparametric Regression. New York: Cambridge University Press. Ruppert, D., Wand, M.P. and Carroll, R.J Semiparametric regression during Electronic Journal of Statistics, 3, Spiegelhalter, D.J., Thomas, A., Best, N.G., Gilks, W.R. and Lunn, D BUGS: Bayesian inference using Gibbs sampling. Medical Research Council Biostatistics Unit, Cambridge, England. Stan Development Team Stan: A C++ library for probability and sampling. Version Wainwright, M.J. & Jordan, M.I Graphical models, exponential families, and variational inference. Foundation and Trends in Machine Learning, 1, Wand, M.P. & Jones, M.C Kernel Smoothing. London: Chapman and Hall. Wand, M.P. & Ormerod, J.T On semiparametric regression with O Sullivan penalised splines. Australian and New Zealand Journal of Statistics, 50, Wand, M.P Semiparametric regression and graphical models. Australian and New Zealand Journal of Statistics, 51, Wand, M.P., Ormerod, J.T., Padoan, S.A. & Frühwirth, R Mean field variational Bayes for elaborate distributions. Bayesian Analysis, 6, Number 4, Wand, M.P Fully simplified Multivariate Normal updates in non-conjugate variational message passing. Journal of Machine Learning Research, 15,

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data for Parametric and Non-Parametric Regression with Missing Predictor Data August 23, 2010 Introduction Bayesian inference For parametric regression: long history (e.g. Box and Tiao, 1973; Gelman, Carlin,