Variational Inference for Heteroscedastic Semiparametric Regression

Size: px
Start display at page:

Download "Variational Inference for Heteroscedastic Semiparametric Regression"

Transcription

1 From the SelectedWorks of Matt Wand Summer February 7, 014 Variational Inference for Heteroscedastic Semiparametric Regression Marianne Menictas, University of Technology, Sydney Matt Wand, University of Technology, Sydney Available at:

2 Variational inference for heteroscedastic semiparametric regression BY MARIANNE MENICTAS AND MATT P. WAND School of Mathematical Sciences, University of Technology Sydney, Broadway 007, Australia 1st August, 014 Summary We develop fast mean field variational methodology for Bayesian heteroscedastic semiparametric regression, in which both the mean and variance are smooth, but otherwise arbitrary, functions of the predictors. Our resulting algorithms are purely algebraic, devoid of numerical integration and Monte Carlo sampling. The locality property of mean field variational Bayes implies that the methodology also applies to larger models possessing variance function components. Simulation studies indicate good to excellent accuracy, and considerable time savings compared with Markov chain Monte Carlo. We also provide some illustrations from applications. Keywords: Approximate Bayesian inference; mean field variational Bayes; non-conjugate variational message passing; variance function estimation. 1 Introduction Data that are big in terms of volume and/or velocity are becoming increasingly ubiquitous and there is a strong imperative for the development of fast, flexible and extendable methodology for processing such data. Semiparametric regression e.g. Ruppert, Wand & Carroll, 003, 009 is an important class of flexible statistical models and methods, but is mainly geared towards moderate sample sizes and batch processing. Recently, Luts, Broderick & Wand 014 developed new nonparametric regression methodology specifically tailored to high volume/velocity situations. The present article extends their general approach to accommodate possible heteroscedasticity. Whilst we focus on univariate and bivariate nonparametric regression and additive models, the modularity of our approach allows easy extension to more complex models. The accommodation of heteroscedasticity is aided by the variational inference technique known as non-conjugate variational message passing e.g. Knowles & Minka, 011. The particular form of non-conjugate variational message passing that we use involves approximating particular posterior density functions by Multivariate Normal density functions, which is an established notion within the variational approximation literature e.g. Hinton & van Camp, 1993; Barber & Bishop, 1997; Raiko et al., 007; Challis & Barber, 013. In nonparametric regression it is common to ignore heteroscedasticity in the data and instead invoke the constant variance assumption. Figure 1 contains nonparametric regression examples, based on the R function smooth.spline R Core Team, 014 with default settings. Both data sets appear in Ruppert et al. 003 and are first described in its Sections 5.3 and.7 respectively. The scatterplots and the standardized residual plots show that there is significant heteroscedasticity, the ignorance of which will lead to erroneous inferential statements, such as prediction intervals. Heteroscedastic nonparametric regression overcomes the problems apparent in Figure 1. For a set of regression data, x i, y i, 1 i n, it involves replacement of the homoscedastic nonparametric model Ey i = fx i, Vary i = σ 1

3 age years logincome age years residuals range logratio range residuals Figure 1: Two example nonparametric regression fits and corresponding standardized residual plots. Horizontal dashed lines at ± and ±3 aid assessment of standard normality of the standardized residuals. by Ey i = fx i, Vary i = gx i, 1 with the variance function g to be estimated simultaneously with the mean function f. Several approaches to fitting 1 now exist e.g. Ruppert et al., 003; Rigby & Stasinopoulos, 005; Crainiceanu et al., 007. A particularly attractive approach, from an extendability standpoint, involves graphical model representations of mixed model-based penalized splines Wand, 009. This allows one to take advantage of the growing body of methodology and software for approximate inference in general graphical models. Graphical model-based Bayesian inference engines such as BUGS Spiegelhalter et al., 003, Infer.NET Minka et al., 013 and Stan Stan Development Team, 013 now accommodate a wide range of nonparametric and semiparametric regression models e.g. Marley & Wand, 010; Luts, Wang, Ormerod & Wand, 014. However, the form of the approximate inference differs markedly among the various inference engines. BUGS and Stan each use relatively slow but highly accurate Markov chain Monte Carlo MCMC methodology whereas Infer.NET uses faster, but less accurate, deterministic approximations such as mean field variational Bayes MFVB. The latter type of methodology is our focus here. A relatively new modification of MFVB, non-conjugate variational message passing, is shown to be particularly useful for handling heteroscedasticity since it results in algorithms that involve only closed form algebraic expressions. The modularity of the graphical model and MFVB approaches implies that methodology for handling heteroscedasticity in semiparametric regression applies to arbitrarily large models. We call this the locality property of MFVB and it is described in Section 3 of

4 Wand et al If a very large graphical model includes a component where one variable is modelled as a heteroscedastic nonparametric regression function of another variable then variational inference for the parameters in that part of the graph can be achieved using the methodology developed here. Variational methodology for heteroscedastic regression models has also been developed by Lazaro-Gredilla & Titsias 011 and Nott, Tran & Leng 01, although the latter article was confined to linear mean and log-variance functions. Lazaro-Gredilla & Titsias 011 achieved simultaneous nonparametric mean and variance function estimation via an elegant Gaussian process approach. The ensuing nonparametric function estimators are full-rank and their variance function estimation strategy relies on Gauss-Hermite quadrature. Our approach differs by using low-rank penalized splines and a non-conjugate MFVB strategy that provides closed form updates, making it more amenable to high volume/velocity data. Section gives description of the Bayesian penalized spline model for simultaneous mean and variance function estimation based on univariate data. The variational inference methodology is provided in Section 3 and our main algorithm is presented there. Section 4 provides numerical illustrations which give evidence of the speed achieved by non-conjugate MFVB. In Sections 5 7 we describe extensions to bivariate nonparametric regression, additive models and real-time semiparametric regression. Some concluding remarks are made in Section 8. The appendix contains some algebraic details used in the derivation of Algorithm 1. Model Description The Gaussian heteroscedastic nonparametric regression model has generic form: ind. y i Nfx i, gx i, 1 i n, where x i, y i is the ith predictor/response pair of a regression data-set and ind. means distributed independently as. The functions f and g are smooth, but otherwise arbitrary, functions. We refer to f as the mean function and g as the variance function. We also use the term standard deviation function for g. Mixed model-based penalized spline fitting of e.g. Ruppert et al., 003 involves modelling f and g according to: K u fx = β 0 + β 1 x + u k zk ux, u k ind. N0, σu k=1 K v and gx = exp γ 0 + γ 1 x + v k zk vx ind., v k N0, σv. k=1 3 The zk u : 1 k K u} and zk v : 1 k K v} are spline bases of sizes K u and K v, respectively. Our default for the zk u and zv k are suitably transformed cubic O Sullivan splines, as described in Section 4 of Wand & Ormerod 008. If K u = K v then the two bases are identical, but we leave open the possibility for different basis sizes for the mean and variance functions. In this article, we take a Bayesian approach to fitting and 3, and impose the following priors on the model parameters: ind. β 0, β 1 N0, σβ, γ ind. 0, γ 1 N0, σγ, σ u Half-CauchyA u and σ v Half-CauchyA v with hyperparameters σ β, σ γ, A u, A v > 0 to be specified by the user. The Half-CauchyA density function is given by px = /π A}/1 + x/a }, x > 0. Assuming that the 3

5 data have been pre-transformed to have zero mean and unit variance, our default setting of the hyperparemeters throughout this article are: σ β = σ γ = A u = A v = Result 5 of Wand et al. 011 allows us to replace σ u Half-CauchyA u by σ u a u Inverse-Gamma 1, 1/a u, a u Inverse-Gamma 1, 1/A u, 5 where x Inverse-GammaA, B means that x has an Inverse Gamma distribution with shape parameter A > 0 and rate parameter B > 0. The corresponding density function is px = B A ΓA 1 x A 1 exp B/x, x > 0. Representation 5 is more amenable to variational approximate inference. A similar replacement is made for σ v. The full Bayesian hierarchical model corresponding to is: y β, γ, u, v N Xβ + Z u u, diag exp Xγ + Z v v}, u σ u N0, σ ui, v σ v N0, σ vi, β N0, σ β I, γ N0, σ γi, σ u a u Inverse-Gamma 1, 1/a u, a u Inverse-Gamma 1, 1/A u, 6 σ v a v Inverse-Gamma 1, 1/a v, a v Inverse-Gamma 1, 1/A v. Here β and γ are 1 vectors of fixed effects containing β 0, β 1 and γ 0, γ 1 respectively, u is the K u 1 vector containing u 1,..., u Ku, v is a K v 1 vector defined similarly, and σu and σv are variance components corresponding to u and v respectively. The design matrices, X, Z u and Z v, are defined as X [1 x i ] 1 i n, [ ] Z u zk u x i 1 k K u 1 i n and [ ] Z v zk v x i 1 k K v 1 i n. It is also convenient to combine the mean function coefficients and variance function coefficients into single vectors: ν [ β u ] and An analogous combining of the design matrices: C ν [X Z u ], C ω [X Z v ], then allows us to write Xβ + Z u u = C ν ν and Xγ + Z v v = C ω ω. Figure shows the directed acyclic graph corresponding to the model conveyed in 6. MCMC schemes for fitting and inference in 6 are relatively straightforward to devise and implement. The BUGS and Stan MCMC-based inference engines also support 6 and illustration of such is given in Section 4. However, MCMC does not scale well to high volume/velocity data and larger models with heteroscedastic nonparametric regression components. ω [ γ v ]. 3 Variational Inference Methodology We now consider the problem of mean field-type variational inference for 6 e.g. Wainwright & Jordan, 008; Ormerod & Wand, 010. This involves an approximation to the joint posterior density function of the form pν, ω, σ, a y qν qω qσ qa 7 4

6 ν ω a u β y γ a v σ u u v σ v Figure : Directed acyclic graph for the model in 6. The shaded node corresponds to the observed data vector. Random effects and auxiliary variables are referred to as hidden nodes. The grey dashed boxes indicate that β and u are combined into a vector denoted by ν and ω is the concatenation of γ and v. where σ = σ u, σ v and a = a u, a v. The q-densities are chosen to minimize the Kullback- Leibler divergence between the left-hand-side and right-hand-side of 7. This is equivalent to maximizing the variational lower bound on the marginal log-likelihood log py: log py; q E q [ log pν, ω, σ, a, y} logqν qω qσ qa }]. The optimal q densities, denoted by q, can be shown to satisfy e.g. Ormerod & Wand, 010: q ν exp E q ν log pν rest }, q ω exp E q ω log pω rest }, q σ exp E q σ log pσ rest } and q a exp E q a log pa rest }, 8 where, for example E q ν denotes expectation with respect to the q-densities of all parameters except ν. Also rest denotes all of the random variables in the model other than those in ν, including y. The full conditionals pν rest, pσ rest and pa rest each have standard forms which result in closed form q-density expressions. However, pω rest is a non-standard form and challenging integrals arise in the determination of q ω. We achieve a tractable solution by imposing the additional restriction qω = qω; µ qω, Σ qω is a N µ qω, Σ qω density function 9 for some mean vector µ qω and covariance matrix Σ qω. The corresponding marginal log-likelihood is [ log py; q, µ qω, Σ qω E q log pν, ω, σ, a, y } ] 10 log qν qω; µ qω, Σ qω qσ qa and minimization of the Kullback-Leibler divergence corresponds to µ qω and Σ qω being chosen to maximise 10. This approach, due to Knowles & Minka 011, has been labelled non-conjugate variational message passing since it provides a way of circumventing non-conjugacies in MFVB. Note that variational message passing is an alternative formulation of MFVB, see e.g. Minka & Winn, 008. Knowles & Minka 011 propose a fixed point iteration scheme as a means of maximizing 10. Wand 014 provides the algebraic details and simplification of fixed point updates for the Multivariate Normal q-density parameters, such as 9. 5

7 For fixed µ qω, Σ qω, application of 8, with q ω omitted, leads to q ν is the N µ qν, Σ qν density function, q σ = q σu, σv is a product of Inverse-Gamma 1 K u + 1, B qσ u and Inverse-Gamma 1 K v + 1, B qσ v density functions, and q a = q a u, a v is a product of Inverse-Gamma1, B qau and Inverse-Gamma1, B qav density functions 11 for parameters µ qν and Σ qν, the mean and covariance matrix of q ν, B qσ u, the rate parameter of q σ u, B qσ v, the rate parameter of q σ v, B qau, the rate parameter of q a u and B qav, the rate parameter of q a v. These optimal parameters are all interrelated and obtained through an iterative algorithm, listed below as Algorithm 1. The algorithm also includes fixed point iterative updates for µ qω and Σ qω. Details on the derivation of 11, as well as the fixed point updates, are given in an appendix. Before presenting Algorithm 1 some additional matrix algebraic notation is required. For two matrices A and B of equal size, A B denotes their element-wise product. If a is a d 1 vector then diaga is the d d diagonal matrix with the entries of a along the diagonal. For a d d matrix A, diagonala is the d 1 vector comprising the diagonal entries of A. Also, µ qu is defined to be the sub-vector of µ qν corresponding to u, Σ qu is the sub-matrix of Σ qν corresponding to u. The symbols µ qv and Σ qv are defined similarly. Initialize: µ qω a K v + 1 vector, Σ qω a K v + K v + positive definite matrix, µ q1/σ u, µ q1/σ v > 0, and µ qr ν an n 1 vector. Cycle: ψ qω exp C ω µ qω + 1 diagonal C ω Σ qω C T } ω Σ qν [ σ C T ν diagψ qω C ν + β I 0 0 µ q1/σ u I µ qν Σ qν C T ν diagψ qω y C T ωdiag µ qr ν ψ qω C ω + ] 1 [ σ γ I 0 ]} 1 Σ qω 0 µ q1/σ v I [ σ µ qω µ qω + Σ qω C T ω µ qr ν ψ qω 1 γ I 0 0 µ q1/σ v I µ q1/au 1/ µ q1/σ u + A u ; µ q1/av 1/ µ q1/σ v + A v K u + 1 µ q1/σ u µ q1/au + µ qu + tr Σ qu K v + 1 µ q1/σ v µ q1/av + µ qv + tr Σ qv } T µ qr ν diagonal y C ν µ qν y C ν µ qν + Cν Σ qν C T ν ] µ qω } until the absolute relative change in log py; q, µ qω, Σ qω is negligible. Algorithm 1: MFVB algorithm for the determination of the optimal parameters in q ω, q ν, q σ u, q σ v, q a u and q a v. 6

8 Convergence of Algorithm 1 can be monitored using the following explicit expression for the marginal log-likelihood lower bound: log py; q, µ qω, Σ qω = 1 K u + K v + 4 n logπ + log Γ 1 K u log Γ 1 K v + 1 logπ loga u loga v n 1T C ω µ qω 1 y C νµ qν T diagψ qω y C ν µ qν 1 tr C T ν C ν Σ qν logσ β logσγ + 1 log Σ qν + 1 log Σ qω 1 µ σβ qβ + trσ qβ } 1 µ σγ qγ + trσ qγ 1 K u + 1 logb qσ u 1 K v + 1 logb qσ v log µ q1/σ u + A u log µ q1/σ v + A v + µq1/σ u µ q1/au + µ q1/σ v µ q1/av. Unlike ordinary MFVB, there is no guarantee that each iteration leads to an increase in py; q, µ qω, Σ qω. Thus the iterations should be stopped when the absolute relative change in its logarithm falls below a negligible amount. In Figure 3 we return to the example of Figure 1, armed with our new MFVB methodology. The top panels show the fitted mean functions, according to 6, with pointwise 95% credible sets. The model was fitted using MFVB corresponding to Algorithm 1 and MCMC based on BUGS Spiegelhalter et al. 003 with a burnin of 5000, kept sample of 5000 and thinning factor of 5. The much faster MFVB fit is seen to be in excellent agreement with its more computationally costly benchmark. The middle panels show the fitted standard deviation functions, corresponding to g in the notation of, and corresponding 95% credible sets. The agreement between MFVB and MCMC is good, rather than excellent, for g. The g-standardized residual plots at the bottom of Figure 3 indicate proper accounting for the heteroscedasticity and agreement with the normality. We also conducted a comprehensive check in order to confirm that the relative change in the approximate marginal log-likelihood does not lead to early stopping. Over several hundred runs, with data generated from different scenarios, we recorded Bayes estimates of f and g at the stopping point and again with iterations continuing 5% beyond the stopping point. The differences in the estimates were negligible. 4 Performance Assessment We conducted a comprehensive simulation study to assess the performance of Algorithm 1 in terms of inferential accuracy and computing time. Data were simulated according to model with n = 500 and the x i s uniform on 0, 1. The four mean and variance function pairs are listed in Table 1, where φ ; µ, σ and Φ ; µ, σ respectively denote the density and distribution functions of the Normal distribution with mean µ and standard deviation σ. We generated 100 data-sets for each of the functions shown in Table 1. For each model corresponding to a new replication, inference was achieved using MFVB via Algorithm 1 and MCMC for comparison. R was also used for implementation of Algorithm 1. The MFVB iterations were terminated when the relative change in log py; q fell below The MCMC was performed using BUGS with sample sizes the same as for the example in Section 3. The following sections provide details on the accuracy of MFVB against the MCMC benchmark. 7

9 mean function age years function MFVB MCMC standard deviation function age years function standardized residuals age years standardized residuals mean function range function standard deviation function range function standardized residuals range standardized residuals Figure 3: Top panels: fitted mean functions for two example data sets from Wand & Jones 1995 and Ruppert et al The solid curves are approximate pointwise posterior means whilst the dashed curves are corresponding pointwise 95% credible sets. The approximate fits are based on MFVB via Algorithm 1 and MCMC via BUGS. Middle panels: similar to top panels but for the standard deviation function. Bottom panels: standardized residual plots based on y fx i }/ ĝx i } where f and ĝ are the MFVB-approximate Bayes estimates of f and g. 4.1 Assessment of Accuracy Algorithm 1 yields fast approximate inference for the model parameters, however it does not guarantee that an adequate level of accuracy will be achieved. Figure 4 provides an accuracy assessment of Algorithm 1 using side-by-side boxplots of the accuracy scores for the parameters of interest. 8

10 Setting fx log gx A sin3πx cos4πx B 1.0x x φ x; 0.38, Φ x; 0., x φ x; 0.75, 0.03 C 0.35 φ x; 0.01, φ x; 0.45, φ x; 0, φ x; 1, φ x; 0.7, 0.14} D sin3πx 1.0x x cos 4πx x +0.4 φ x; 0.38, 0.08 Φ x; 0., 0.1 Table 1: Details of simulation study settings. For a generic parameter θ, the accuracy of q θ is defined to be accuracyq = q θ pθ y dθ %. 1 Note that pθ y can be approximated by using kernel density estimation applied to the MCMC sample. This accuracy score is justified in Faes et al Accuracy is monitored for parameters fh k and gh k, 1 k 5, where the H k are the sample hexiles of the x i s. The boxplots illustrate that most of the accuracies for the fh k lie around 90%, while accuracies for the gh k lie around 80%. These very good accuracy results are in keeping with the heuristics given in Section 3.1 of Menictas & Wand 013. accuracy setting A setting B fh 1 fh fh 3 fh 4 fh 5 gh 1 gh gh 3 gh 4 gh 5 setting C fh 1 fh fh 3 fh 4 fh 5 gh 1 gh gh 3 gh 4 gh setting D Figure 4: Summary of simulation study where the accuracy values are summarized as a boxplot. Figure 5 shows comparison of MCMC and the MFVB fitted mean and standard devia- 9

11 tion functions for the first replication in each of the four simulation settings. The MCMC and MFVB fits show excellent agreement for the mean functions and relatively good agreement for the standard deviation functions mean function A MFVB MCMC Truth standard deviation function A mean function B standard deviation function B mean function C standard deviation function C mean function D standard deviation function D Figure 5: Comparison of MCMC and MFVB fitted functions for the first data-set in each of the four simulation settings. 4. Assessment of Coverage We are also interested in the comparison between the coverage gained by the MFVB approximate credible intervals and the true coverage. Table gives the percentages of the true parameter coverage based on the approximate 95% credible intervals attained from the MFVB posterior densities. The coverage overall is good and does not fall below 86%. As we have already seen in the previous section the performance of MFVB is excellent for 10

12 the mean function and very good for the variance function. fh 1 fh fh 3 fh 4 fh 5 gh 1 gh gh 3 gh 4 gh 5 sett. A sett. B sett. C sett. D Table : Percentage coverage of the true parameter values by approximate 95% credible intervals based on variational Bayes approximate posterior density functions. The percentages are based on 100 replications. 4.3 Assessment of Speed During the running of the simulation we monitored the time taken per model to be fitted via MCMC and MFVB. The results are summarized in Table 3. The simulation was run on the first author s desktop computer Intel Core i GHz processor, 8 GBytes of random access memory. As mentioned previously, convergence was assessed differently for both approaches. Also, the speed gains of MFVB are traded off against accuracy losses which are invoked by the product restriction given in 7. However, despite this concern, the results show that MFVB is at-least approximately 00 times faster than MCMC when comparing all models. Thus we can assert that a model that takes minutes to run using MCMC, will take only seconds to run using our MFVB algorithm. MCMC MFVB setting A 15.49, , 1.87 setting B 13.96, , 5.89 setting C , , 4.67 setting D , , 1.38 Table 3: 99% Wilcoxon confidence intervals based on run times in seconds for MCMC and MFVB fitting. 5 Extension to Bivariate Predictors Algorithm 1 is relatively easy to extend to bivariate predictor nonparametric regression, which is closely related to geostatistics e.g. Cressie, 1993, which we briefly describe here. In this case, the data are of the form x i, y i, x i R, y i R. 13 In the classical geostatistics scenario, the x i s specify to geographical locations. However, in 13 the x i s could also represent pairs of non-geographical measurements. The bivariate analogue of is ind. y i Nfx i, gx i, 1 i n, 14 11

13 where f and g are real-valued functions on R. The extension of 3 is K u fx = β 0 + β T 1 x + u k zk ux, u k ind. N0, σu k=1 K v and gx = exp γ 0 + γ T 1 x + v k zk vx ind., v k N0, σv, where each of β 1 and γ 1 are 1 vectors. The functions z u k : 1 k K u} and z v k : 1 k K v } are now bivariate spline basis functions. A reasonable default for the z u k Ruppert et al. 003 is the low-rank thin plate spline basis with kth element: k=1 15 z u k x = r x κu k [ r κu k κu k 1 k,k K u ] 1/, 16 where κ u 1,..., κu K u is a set of bivariate knot locations that efficiently cover the space of the x i s and rx x logx. The default zk v s have an analogous definition. According to this set-up, the only difference between the univariate nonparametric heteroscedastic regression model, treated in Section 3, and its bivariate counterpart is the basis functions and their coefficients. Hence, Algorithm 1 can be used to fit the bivariate nonparametric heteroscedastic regression model by replacing ν, ω, C ν and C ω from Section 3 with β 0 γ 0 [ ] [ ] ν β 1, ω γ 1, C ν 1 x T i zk u x i and C ω 1 x T i zk v x i. u v 1 k K u 1 i n 1 k K v 1 i n We fitted 15 to geo-referenced data on sea-floor sediment pollution in the North Sea source: Pebesma & Duin, 005. The data are stored in the pcb data-frame within the R package gstat Pebesma, 004. The response variable is a measurement of polychlorinated biphenyl with Ballschmiter-Zell congener number 138 PCB-138. The motivating study is concerned with spatial and temporal variability of PCB-138. For the purposes of illustration, we ignore the temporal aspect and focus on geographical variability in the mean and variance of the response. In the notation of model 14 15, the variables are x = x 1, x where x 1 = x-coordinate in the Universal Mobile Telecommunications System for Zone 31, x = y-coordinate in the Universal Mobile Telecommunications System for Zone 31, and y = PCB-138 measured on the sediment fraction smaller than 63 parts per million, in µg/kg dry matter. The sample size is n = 16 and K u = K v = 50 thin plate spline bases were used for each functional fit. The estimated mean and standard deviation functions are shown in Figure 6. Both functions are seen to exhibit pronounced spatial effects. Simple bivariate predictor models that ignore the heteroscedasticity described by the right panel of Figure 6 would have erroneous prediction intervals. Higher dimensional heteroscedastic nonparametric regression can be achieved via Algorithm 1 with little notational change from the bivariate case treated here. The only required modification involves higher-dimensional thin plate spline basis functions instead of those given by Extension to Additive Models The final semiparametric regression extension, that we discuss briefly in this section, is additive models with multiplicative variance functions. Models of this type have a small 1

14 mean function standard deviation function y coordinate UTM zone x coordinate UTM zone mean PCB 138 µg/kg dry matter y coordinate UTM zone st. dev. PCB 138 µg/kg dry matter x coordinate UTM zone 31 Figure 6: Left panel: Fitted mean function for the polychlorinated biphenyl data described in the text, based on MFVB via Algorithm 1. Right panel: similar to top panel but for the standard deviation function. literature with Rigby & Stasinopoulos 005 being a key reference. Their generic form is d d y i N β 0 + f j x ji, exp γ 0 + h j x ji, 1 i n. 17 j=1 Here f j and h j, 1 j d, are smooth but otherwise arbitrary functions. We use penalized spline models of the form K u j k=1 Kj v j=1 f j x = β j x + u jk zjk u x, u jk ind. N0, σuj and h j x = γ j x + k=1 18 v jk zjk v x, v jk ind. N0, σvj. The zjk u : 1 k Ku j }, 1 j d, are spline bases of sizes Ku j, analogous to those presented in Section. The zjk v : 1 k Kv j }, 1 j d, are similarly defined. The priors on the regression coefficients and standard deviation parameters are β j σ uj ind. N0, σβ, γ j ind. N0, σγ, ind. ind. Half-CauchyA u, σ vj Half-CauchyA v. 19 The Bayesian model given by 17, 18 and 19 admits a closed form non-conjugate MFVB algorithm, with the regression coefficients for the full mean and variance functions each being Multivariate Normal. Details and illustration are given in Menictas 014. Section 5. of Wand 014 also contains an illustration for data from a Californian air pollution study. The extension to additive models with parametric components is trivial. For example, if binary predictor data b i, 1 i n, are also available then the following extension of 17: d d y i N β 0 + β b b i + f j x ji, exp γ 0 + γ b b i + h j x ji, 1 i n j=1 13 j=1

15 can be handled via simple additions to the design matrices and coefficient vectors. Similar comments apply to the addition of continous predictors that have purely linear impact on either the mean function or log-variance function. 7 Real-Time Heteroscedastic Nonparametric Regression Almost all nonparametric regression methodology presented to date make the assumption that the data are processed in batch, that is, at the same time. However, some disadvantages of batch processing include the requirement that analysis must wait until the entire data set has been collected, and often the need to store the entire data set in memory. In the real-time case, the analysis updates once each new data point is collected. This is beneficial, and sometimes essential, for both high volume and/or velocity data. In this section we present a variation of Algorithm 1 that allows for real-time heteroscedastic nonparametric regression. Algorithm processes each new entry of y, denoted by y new, and its corresponding row of C ν and C ω, denoted by c ν,new and c ω,new, successively in real time. The starting values for the real-time procedure are determined by performing a sufficiently large batch fit. This is explained in more detail in Section.1.1 of Luts, Broderick & Wand 014. The web-site realtime-semiparametric-regression.net features a movie that illustrates Algorithm for data simulated according to setting D. The warm-up sample size is n warm = 500. The link for the movie is titled Heteroscedastic nonparametric regression and portrays the effectiveness of real time processing for mean and standard deviation function fitting. 8 Concluding Remarks We have developed closed form algorithms for fast batch and real-time fitting and inference for a variety of heteroscedastic semiparametric regression models. The methodology also applies to larger models courtesy of the locality property of mean field variational inference methods. The new methodology has been shown to perform very well on simulated and actual data. 14

16 1. Use Algorithm 1 to perform batch-based tuning runs, analogous to those described in Algorithm of Luts et al. 014, and determine a warm-up sample size n warm for which convergence is validated.. Set µ qν, Σ qν, µ qω, Σ qω, µ q1/σ u, and µ q1/σ v to their values obtained in the warm up batch-based tuning run with sample size n warm. Next set y warm to be the response vector on the first n warm observations. Also set C ν,warm and C ω,warm to be the design matrices based on the first n warm observations. Lastly assign n n warm. 3. Cycle: Read in y new 1 1, c ν,new + K u 1} and c ω,new + K v 1}; n n+1 C ν [ C ] ν c ν,new ; Cω [ C ] [ ω c ω,new ; y y ] y new } µ qr ν diagonal y C ν µ qν y C ν µ qν + Cν Σ qν C ν ψ qω exp C ω µ qω + 1 diagonal C ω Σ qω C } ω Σ qν C ν diag ψ qω C ν + σ β I µ q1/σ u I Ku µ qν Σ qν C ν diag ψ qω y Σ qω C ω diag µ qω µ qω + Σ qω C ω µ q1/au 1/ µ q1/σ u + A u [ σ γ I 0 µ qr ν ψ qω C ω + µ qr ν ψ qω 1 0 µ q1/σ v I Kv ] 1 [ σ γ I 0 0 µ q1/σ v I Kv ; µ q1/av 1/ µ q1/σ v + A v ] µ qω } µ q1/σ u K u + 1/µ q1/au+ µ qu +tr Σ qu } µ q1/σ v K v + 1/µ q1/av+ µ qv +tr Σ qv } until the analysis is complete or data no longer available. Algorithm : MFVB algorithm for real-time determination of the optimal parameters in q ω, q ν, q σ u, q σ v, q a u and q a v. Appendix: Derivation of optimal q-density functions Derivation of q ν First note that log q ν = E q log pν rest} + const = 1 [ν T C T ν diag E q e Cωω } [ σ C ν + β I 0 0 µ q1/σ u I Ku ν T C T ν diag E q e Cωω } y ] + const ] ν 15

17 where const denotes terms not depending on the argument of q. The form of q ν follows from this and the fact that E q e Cωω = exp C ω µ qω + 1 C ωσ qω C T ω}. Therefore and Σ qν = µ qν = Σ qν C T ν diagψ qω y, } [ σ C T ν diag ψ qω C ν + β I 0 0 µ q1/σ u I Ku ] 1. Derivation of q σ u and q σ v log q σ u = E q log pσ u rest } + const = 1 K u } logσ u 1 E q u + µ q1/au /σ u + const The form of q σ u follows from the fact that: E q u = E q u + tr Cov q u}. The expression for µ q1/σ u follows from a suitable result for the Inverse-Gamma distribution. For example, if random variable v has an Inverse-Gamma distribution with density function pv = B A ΓA 1 v A 1 exp v/b, then E1/v = A/B. Therefore B qσ u = 1 µ qu + tr } Σ qu + µ q1/au, and µ q1/σ u = 1 K u + 1/B qσ u. The derivation of B qσ v and µ q1/σ v are similar. Derivation of q a u and q a v log q a u = E q log pa u rest} + const = 1 1 loga u µ q1/σ u + A u /au + const The expressions for B qau and µ q1/au follow immediately. The derivation of B qav, and µ q1/av are similar to that just shown above. Derivation of the µ qω and Σ qω updates The updates for µ qω and Σ qω are based maximisation of the current value of the marginal log-likelihood lower bound log py; q, µ qω, Σ qω over these parameters using fixed point iteration. Wand 014 shows that the updates reduce to Σ qω } T 1 vec 1 DvecΣ qω S µ qω µ qω + Σ qω D µqω S T. where S E q log py ν, ω + log pω σ v }, 16

18 D denotes derivative vector, as defined in Magnus & Neudecker 1999, and vec and vec 1 are as defined in Wand 014. However, a result in the appendix of Opper & Archambeau 009 shows that an equivalent expression for the first of these updates is Σ qω 1 H µqω S where H denotes the Hessian matrix as defined in Magnus & Neudecker We work with this alternative form here. An explicit expression for S is This leads to S 1 T C ω µ qω µ T qr C ν exp ω µ qω + 1 diagonal C ω Σ qω C T } ω 1 tr [ σ γ I 0 0 µ q1/σ v I ] µ qωµ T qω + Σ qω n + K + 1 log π K log σ v log σ γ. d µqω S = 1 T C ω dµ qω +µ T qr exp ν diag C ω µ qω + 1 diagonal C ω Σ qω C T ω } C ω dµ qω [ ] σ γ I 0 = µ T qω dµ qω 0 µ q1/σ v I [ µ qr ν exp C ω µ qω + 1 diagonal C ω Σ qω C T ω } T 1] Cω [ ] σ µ T γ I 0 qω dµ qω. 0 µ q1/σ v I Then, by Theorem 6, Chapter 5, of Magnus & Neudecker 1999, T [ D µqω S = C T ω µ qr ν exp C ω µ qω + 1 diagonal C ω Σ qω C T ω } ] 1 [ ] σ γ I 0 µ qω. 0 µ q1/σ v I Next, d µ qω S = = [ T µ qr ν exp dµ qω C ω µ qω + 1 diagonal C ω Σ qω C T }] T ω [ ] T σ γ I 0 dµ qω dµ qω 0 µ q1/σ v I [ [ µ qr ν diag exp C ω µ qω + 1 diagonal C ω Σ qω C T }] ω [ ]] } T T σ γ I 0 C ω dµ qω Cω dµ qω 0 µ q1/σ v I C ω dµ qω 17

19 Then, using diagab = a b, [ d µ qω S = diag µ qr ν diag exp C ω µ qω + 1 diagonal C ω Σ qω C T }] ω [ ]} T T σ γ I 0 C ω dµ qω Cω dµ qω 0 µ q1/σ v I T = dµ qω C T ω diag [ µ qr ν exp C ω µ qω dµ qω + 1 diagonal C ω Σ qω C T ω}] Cω dµ qω [ ] T σ γ I 0 dµ qω dµ qω 0 µ q1/σ v I Therefore, by Theorem 6, Chapter 6, of Magnus & Neudecker 1999, [ H µqω S = C T ωdiag µ qr ν exp C ω µ qω + 1 diagonal C ω Σ qω C T ω }] C ω [ ] σ γ I µ q1/σ v I Combining these expressions leads to the updates in Algorithm 1. 9 Acknowledgments This research was partially supported by an Australian Postgraduate Award and Australian Research Council Discovery Project DP The authors are grateful to Jan Luts, an associate editor and two referees for their comments on this research. References Barber, D. & Bishop, C.M Ensemble learning for multi-layer networks. In Jordan, M.I., Kearns, K.J. and Solla, S.A. Eds.. Advances in Neural Information Processing Systems, 10, Challis, E. & Barber, D Gaussian Kullback-Leibler approximate inference. Journal of Machine Learning Research, 14, Crainiceanu, C.M., Ruppert, D., Carroll, R.J., Joshi, A. & Goodner, B Spatially adaptive Bayesian penalized splines with heteroscedastic errors. Journal of Computational Graphical Statistics, 16, Cressie N.A.C Statistics for Spatial Data, Revised Edition. New York: Wiley. Hinton, G.E. & van Camp, D Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 6th Annual Association for Computing Machinery Conference on Computational Learning Theory, Knowles, D.A & Minka, T.P Non-conjugate variational message passing for multinomial and binary regression. In J. Shawe-Taylor, R.S. Zamel, P. Bartlett, F. Pereira and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, 4, pp

20 Lázaro-Gredilla, M. & Titsias, M.K Variational heteroscedastic Gaussian process regression. Proceedings of the 8th International Conference on Machine Learning, Bellevue, Washington, USA. Luts, J., Broderick, T. & Wand, M.P Real-time semiparametric regression. Journal of Computational and Graphical Statistics, 3, Luts, J., Wang, S.S.J., Ormerod, J.T. & Wand, M.P Semiparametric regression analysis via Infer.NET. Unpublished manuscript. Marley, J.K. & Wand, M.P Non-standard semiparametric regression via BRugs. Journal of Statistical Software, Volume 37, Issue 5, Menictas, M. & Wand, M.P Variational inference for marginal longitudinal semiparametric regression. Stat,, Menictas, M Variational Inference for Heteroscedastic and Longitudinal Regression Models. PhD Thesis, University of Technology, Sydney. anticipated. Minka, T. & Winn, J Gates: A graphical notation for mixture models. Microsoft Research Technical Report Series, MSR-TR , Minka, T., Winn, J., Guiver, J. & Knowles, D Infer.NET.5. Nott, D.J., Tran, M.-N. & Leng, C. 01. Variational approximation for heteroscedastic linear models and matching pursuit algorithms. Statistics and Computing,, Opper, M. & Archambeau, C The variational Gaussian approximation revisited. Neural Computation, 1, Ormerod, J.T. & Wand, M.P Explaining variational approximations. The American Statistician, 64, Pebesma, E.J Multivariate geostatistics in R: the gstat package. Computers and Geosciences, 30, Pebesma, E.J. & Duin, R.N.M Spatial patterns of temporal change in North Sea sediment quality on different spatial scales. In P. Renard, H. Demougeot-Renard and R. Froidevaux Eds., Geostatistics for Environmental Applications: Proceedings of the Fifth European Conference on Geostatistics for Environmental Applications pp Berlin: Springer. Raiko, T., Valpola, H., Harva, M. & Karhunen, J Building blocks for variational Bayesian learning of latent variable models. Journal of Machine Learning Research, 8, R Core Team 014. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN: Rigby, R.A. & Stasinopoulos, D.M Generalized additive models for location, scale and shape. Applied Statistics, 54,

21 Ruppert, D., Wand, M.P. & Carroll, R.J Semiparametric Regression. New York: Cambridge University Press. Ruppert, D., Wand, M.P. and Carroll, R.J Semiparametric regression during Electronic Journal of Statistics, 3, Spiegelhalter, D.J., Thomas, A., Best, N.G., Gilks, W.R. and Lunn, D BUGS: Bayesian inference using Gibbs sampling. Medical Research Council Biostatistics Unit, Cambridge, England. Stan Development Team Stan: A C++ library for probability and sampling. Version Wainwright, M.J. & Jordan, M.I Graphical models, exponential families, and variational inference. Foundation and Trends in Machine Learning, 1, Wand, M.P. & Jones, M.C Kernel Smoothing. London: Chapman and Hall. Wand, M.P. & Ormerod, J.T On semiparametric regression with O Sullivan penalised splines. Australian and New Zealand Journal of Statistics, 50, Wand, M.P Semiparametric regression and graphical models. Australian and New Zealand Journal of Statistics, 51, Wand, M.P., Ormerod, J.T., Padoan, S.A. & Frühwirth, R Mean field variational Bayes for elaborate distributions. Bayesian Analysis, 6, Number 4, Wand, M.P Fully simplified Multivariate Normal updates in non-conjugate variational message passing. Journal of Machine Learning Research, 15,

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data for Parametric and Non-Parametric Regression with Missing Predictor Data August 23, 2010 Introduction Bayesian inference For parametric regression: long history (e.g. Box and Tiao, 1973; Gelman, Carlin,

More information

Variational Inference for Count Response Semiparametric Regression

Variational Inference for Count Response Semiparametric Regression From the SelectedWorks of Matt Wand 014 Variational Inference for Count Response Semiparametric Regression Jan Luts, University of Technology, Sydney Matt Wand, University of Technology, Sydney Available

More information

Integrated Non-Factorized Variational Inference

Integrated Non-Factorized Variational Inference Integrated Non-Factorized Variational Inference Shaobo Han, Xuejun Liao and Lawrence Carin Duke University February 27, 2014 S. Han et al. Integrated Non-Factorized Variational Inference February 27, 2014

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Variational Bayesian Inference for Parametric and Nonparametric Regression with Missing Data

Variational Bayesian Inference for Parametric and Nonparametric Regression with Missing Data Variational Bayesian Inference for Parametric and Nonparametric Regression with Missing Data BY C. FAES 1, J.T. ORMEROD & M.P. WAND 3 1 Interuniversity Institute for Biostatistics and Statistical Bioinformatics,

More information

Variational Learning : From exponential families to multilinear systems

Variational Learning : From exponential families to multilinear systems Variational Learning : From exponential families to multilinear systems Ananth Ranganathan th February 005 Abstract This note aims to give a general overview of variational inference on graphical models.

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data Christel Faes Co-authors: John Ormerod and Matt Wand May 10, 01 Introduction Variational Bayes Bayesian

More information

Large-scale Ordinal Collaborative Filtering

Large-scale Ordinal Collaborative Filtering Large-scale Ordinal Collaborative Filtering Ulrich Paquet, Blaise Thomson, and Ole Winther Microsoft Research Cambridge, University of Cambridge, Technical University of Denmark ulripa@microsoft.com,brmt2@cam.ac.uk,owi@imm.dtu.dk

More information

The Variational Gaussian Approximation Revisited

The Variational Gaussian Approximation Revisited The Variational Gaussian Approximation Revisited Manfred Opper Cédric Archambeau March 16, 2009 Abstract The variational approximation of posterior distributions by multivariate Gaussians has been much

More information

VIBES: A Variational Inference Engine for Bayesian Networks

VIBES: A Variational Inference Engine for Bayesian Networks VIBES: A Variational Inference Engine for Bayesian Networks Christopher M. Bishop Microsoft Research Cambridge, CB3 0FB, U.K. research.microsoft.com/ cmbishop David Spiegelhalter MRC Biostatistics Unit

More information

Web-Supplement for: Accurate Logistic Variational Message Passing: Algebraic and Numerical Details

Web-Supplement for: Accurate Logistic Variational Message Passing: Algebraic and Numerical Details Web-Supplement for: Accurate Logistic Variational Message Passing: Algebraic and Numerical Details BY TUI H. NOLAN AND MATT P. WAND School of Mathematical and Physical Sciences, University of Technology

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Non-conjugate Variational Message Passing for Multinomial and Binary Regression

Non-conjugate Variational Message Passing for Multinomial and Binary Regression Non-conjugate Variational Message Passing for Multinomial and Binary Regression David A. Knowles Department of Engineering University of Cambridge Thomas P. Minka Microsoft Research Cambridge, UK Abstract

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Expectation Propagation for Approximate Bayesian Inference

Expectation Propagation for Approximate Bayesian Inference Expectation Propagation for Approximate Bayesian Inference José Miguel Hernández Lobato Universidad Autónoma de Madrid, Computer Science Department February 5, 2007 1/ 24 Bayesian Inference Inference Given

More information

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we

More information

STAT 518 Intro Student Presentation

STAT 518 Intro Student Presentation STAT 518 Intro Student Presentation Wen Wei Loh April 11, 2013 Title of paper Radford M. Neal [1999] Bayesian Statistics, 6: 475-501, 1999 What the paper is about Regression and Classification Flexible

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

Bayesian non-parametric model to longitudinally predict churn

Bayesian non-parametric model to longitudinally predict churn Bayesian non-parametric model to longitudinally predict churn Bruno Scarpa Università di Padova Conference of European Statistics Stakeholders Methodologists, Producers and Users of European Statistics

More information

Fisher information for generalised linear mixed models

Fisher information for generalised linear mixed models Journal of Multivariate Analysis 98 2007 1412 1416 www.elsevier.com/locate/jmva Fisher information for generalised linear mixed models M.P. Wand Department of Statistics, School of Mathematics and Statistics,

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

A short introduction to INLA and R-INLA

A short introduction to INLA and R-INLA A short introduction to INLA and R-INLA Integrated Nested Laplace Approximation Thomas Opitz, BioSP, INRA Avignon Workshop: Theory and practice of INLA and SPDE November 7, 2018 2/21 Plan for this talk

More information

The Bayesian Approach to Multi-equation Econometric Model Estimation

The Bayesian Approach to Multi-equation Econometric Model Estimation Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation

More information

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH

Lecture 5: Spatial probit models. James P. LeSage University of Toledo Department of Economics Toledo, OH Lecture 5: Spatial probit models James P. LeSage University of Toledo Department of Economics Toledo, OH 43606 jlesage@spatial-econometrics.com March 2004 1 A Bayesian spatial probit model with individual

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Mean Field Variational Bayes for Elaborate Distributions

Mean Field Variational Bayes for Elaborate Distributions Bayesian Analysis (011) 6, Number 4, pp. 847 900 Mean Field Variational Bayes for Elaborate Distributions Matthew P. Wand, John T. Ormerod, Simone A. Padoan and Rudolf Frühwirth Abstract. We develop strategies

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Stochastic Variational Inference

Stochastic Variational Inference Stochastic Variational Inference David M. Blei Princeton University (DRAFT: DO NOT CITE) December 8, 2011 We derive a stochastic optimization algorithm for mean field variational inference, which we call

More information

Sampling Methods (11/30/04)

Sampling Methods (11/30/04) CS281A/Stat241A: Statistical Learning Theory Sampling Methods (11/30/04) Lecturer: Michael I. Jordan Scribe: Jaspal S. Sandhu 1 Gibbs Sampling Figure 1: Undirected and directed graphs, respectively, with

More information

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller Variational Message Passing By John Winn, Christopher M. Bishop Presented by Andy Miller Overview Background Variational Inference Conjugate-Exponential Models Variational Message Passing Messages Univariate

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

Bayesian time series classification

Bayesian time series classification Bayesian time series classification Peter Sykacek Department of Engineering Science University of Oxford Oxford, OX 3PJ, UK psyk@robots.ox.ac.uk Stephen Roberts Department of Engineering Science University

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

arxiv: v1 [stat.ml] 2 Mar 2016

arxiv: v1 [stat.ml] 2 Mar 2016 Automatic Differentiation Variational Inference Alp Kucukelbir Data Science Institute, Department of Computer Science Columbia University arxiv:1603.00788v1 [stat.ml] 2 Mar 2016 Dustin Tran Department

More information

Mean Field Variational Bayes for Elaborate Distributions

Mean Field Variational Bayes for Elaborate Distributions Bayesian Analysis (011) 6, Number 4, pp. 1 48 Mean Field Variational Bayes for Elaborate Distributions Matthew P. Wand, John T. Ormerod, Simone A. Padoan and Rudolf Frührwirth Abstract. We develop strategies

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency Chris Paciorek March 11, 2005 Department of Biostatistics Harvard School of

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Analysing geoadditive regression data: a mixed model approach

Analysing geoadditive regression data: a mixed model approach Analysing geoadditive regression data: a mixed model approach Institut für Statistik, Ludwig-Maximilians-Universität München Joint work with Ludwig Fahrmeir & Stefan Lang 25.11.2005 Spatio-temporal regression

More information

Gaussian Process Regression Model in Spatial Logistic Regression

Gaussian Process Regression Model in Spatial Logistic Regression Journal of Physics: Conference Series PAPER OPEN ACCESS Gaussian Process Regression Model in Spatial Logistic Regression To cite this article: A Sofro and A Oktaviarina 018 J. Phys.: Conf. Ser. 947 01005

More information

Dynamic System Identification using HDMR-Bayesian Technique

Dynamic System Identification using HDMR-Bayesian Technique Dynamic System Identification using HDMR-Bayesian Technique *Shereena O A 1) and Dr. B N Rao 2) 1), 2) Department of Civil Engineering, IIT Madras, Chennai 600036, Tamil Nadu, India 1) ce14d020@smail.iitm.ac.in

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Bayesian Inference for the Multivariate Normal

Bayesian Inference for the Multivariate Normal Bayesian Inference for the Multivariate Normal Will Penny Wellcome Trust Centre for Neuroimaging, University College, London WC1N 3BG, UK. November 28, 2014 Abstract Bayesian inference for the multivariate

More information

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis Summarizing a posterior Given the data and prior the posterior is determined Summarizing the posterior gives parameter estimates, intervals, and hypothesis tests Most of these computations are integrals

More information

Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping

Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Bayesian Sampling and Ensemble Learning in Generative Topographic Mapping Akio Utsugi National Institute of Bioscience and Human-Technology, - Higashi Tsukuba Ibaraki 35-8566, Japan March 6, Abstract Generative

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Understanding Covariance Estimates in Expectation Propagation

Understanding Covariance Estimates in Expectation Propagation Understanding Covariance Estimates in Expectation Propagation William Stephenson Department of EECS Massachusetts Institute of Technology Cambridge, MA 019 wtstephe@csail.mit.edu Tamara Broderick Department

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 10 Alternatives to Monte Carlo Computation Since about 1990, Markov chain Monte Carlo has been the dominant

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Riemann Manifold Methods in Bayesian Statistics

Riemann Manifold Methods in Bayesian Statistics Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes

More information

Scaling up Bayesian Inference

Scaling up Bayesian Inference Scaling up Bayesian Inference David Dunson Departments of Statistical Science, Mathematics & ECE, Duke University May 1, 2017 Outline Motivation & background EP-MCMC amcmc Discussion Motivation & background

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Widths. Center Fluctuations. Centers. Centers. Widths

Widths. Center Fluctuations. Centers. Centers. Widths Radial Basis Functions: a Bayesian treatment David Barber Bernhard Schottky Neural Computing Research Group Department of Applied Mathematics and Computer Science Aston University, Birmingham B4 7ET, U.K.

More information

Nearest Neighbor Gaussian Processes for Large Spatial Data

Nearest Neighbor Gaussian Processes for Large Spatial Data Nearest Neighbor Gaussian Processes for Large Spatial Data Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public Health, Johns

More information

Grid Based Variational Approximations

Grid Based Variational Approximations Grid Based Variational Approximations John T Ormerod 1 School of Mathematics and Statistics, University of Sydney, NSW 6, Australia Abstract Variational methods for approximate Bayesian inference provide

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Real-Time Semiparametric Regression via Sequential Monte Carlo

Real-Time Semiparametric Regression via Sequential Monte Carlo Real-Time Semiparametric Regression via Sequential Monte Carlo Mathew W. McLean Chris. J. Oates Matt P. Wand {mathew.mclean, christopher.oates, matt.wand}@uts.edu.au June 9, 2017 Abstract Sequential Monte

More information

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements

Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements Supplement to A Hierarchical Approach for Fitting Curves to Response Time Measurements Jeffrey N. Rouder Francis Tuerlinckx Paul L. Speckman Jun Lu & Pablo Gomez May 4 008 1 The Weibull regression model

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

A general mixed model approach for spatio-temporal regression data

A general mixed model approach for spatio-temporal regression data A general mixed model approach for spatio-temporal regression data Thomas Kneib, Ludwig Fahrmeir & Stefan Lang Department of Statistics, Ludwig-Maximilians-University Munich 1. Spatio-temporal regression

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Sargur Srihari srihari@cedar.buffalo.edu 1 Topics 1. What are probabilistic graphical models (PGMs) 2. Use of PGMs Engineering and AI 3. Directionality in

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Discussion of Maximization by Parts in Likelihood Inference

Discussion of Maximization by Parts in Likelihood Inference Discussion of Maximization by Parts in Likelihood Inference David Ruppert School of Operations Research & Industrial Engineering, 225 Rhodes Hall, Cornell University, Ithaca, NY 4853 email: dr24@cornell.edu

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March

More information

The Recycling Gibbs Sampler for Efficient Learning

The Recycling Gibbs Sampler for Efficient Learning The Recycling Gibbs Sampler for Efficient Learning L. Martino, V. Elvira, G. Camps-Valls Universidade de São Paulo, São Carlos (Brazil). Télécom ParisTech, Université Paris-Saclay. (France), Universidad

More information

Bayesian Modeling of Conditional Distributions

Bayesian Modeling of Conditional Distributions Bayesian Modeling of Conditional Distributions John Geweke University of Iowa Indiana University Department of Economics February 27, 2007 Outline Motivation Model description Methods of inference Earnings

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models

Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models Bayesian Image Segmentation Using MRF s Combined with Hierarchical Prior Models Kohta Aoki 1 and Hiroshi Nagahashi 2 1 Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology

More information

Automating variational inference for statistics and data mining

Automating variational inference for statistics and data mining Automating variational inference for statistics and data mining Tom Minka Machine Learning and Perception Group Microsoft Research Cambridge A common situation You have a dataset Some models in mind Want

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information