Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Size: px

Start display at page:

Download "Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data"

Juniper Grant
5 years ago
Views:

1 Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data Christel Faes Co-authors: John Ormerod and Matt Wand May 10, 01

2 Introduction Variational Bayes Bayesian inference For parametric regression: long history (e.g. Box and Tiao, 1973; Gelman, Carlin, Stern and Rubin, 004) For non-parametric regression: e.g. mixed model representations of penalized splines (e.g. Ruppert, Wand and Carroll, 003) For dealing with missingness in data: allows incorporation of standard missing data models (e.g. Little and Rubin, 004; Daniels and Hogan, 008) Dealing with complex models via hierarchical Bayesian model Easy via MCMC, but can be costly in processing time

3 Variational Approximate Bayes inference Fast Bayesian regression analysis Deterministic approach that yields approximate inference Involves approximation of posterior densities by other densities for which inference is more tractable Part of mainstream Computer Science methodology (e.g. Bishop, 006) Speech recognition, document retrieval (e.g. Jordan, 004) Functional magnetic resonance imaging (e.g. Flandin and Penny, 007) Recently, used in statistical problems (e.g. Ormerod & Wand, 010) Cluster analysis for gene expression data (e.g. Teschendorff et al. 005) Finite mixture models (e.g. McGrory & Titterington, 007)

4 Elements of Variational Bayes Bayesian inference on θ Θ is based on the posterior density function p(y, θ) p(θ y) = = p(y θ)p(θ) p(y) p(y) with y the observed data vector For an arbitrary density function q over Θ, the following inequality holds ( { } ) p(y, θ) p(y) p(y; q) = exp q(θ)log dθ q(θ) Equality holds if and only if q(θ) = p(θ y) almost everywhere The gap between log p(y) and log p(y; q) is the KL-divergence For most models, qexact(θ) = p(θ y) intractable

5 Elements of Variational Bayes Variational Bayes relies on product density restrictions on q: M q(θ) = q i (θ i ) i=1 for some partition {θ 1,..., θ M } of θ Tractability at cost of posterior independence assumption The optimal densities (with minimum KL divergence) can be shown to satisfy q i (θ i ) exp {E θi log p(θ, y)} exp {E θi log p(θ i rest)} E θi denotes expectation wrt j i q j (θ j ) rest {y, θ 1,..., θ i 1, θ i+1,..., θ M } p(θ i rest) are the full conditionals (e.g. Robert and Casella, 004))

6 Elements of Variational Bayes The expressions q i (θ i ) exp {E θi log p(θ i rest)} uniquely maximize p(y; q) with respect to the parameters q i (θ i) Method of alternating variable (or coordinate ascent method) used to attain convergence Convergence assessed by monitoring relative increase in log ( p(y; q) ) Typically, convergence achieve within a few hundred iterations

7 Example: Mixture Model ELISA test results for Bluetongue BTV-8 Mixture model of normal distributions On weights of mixture: Dirichlet prior Conjugate priors on parameters of normal distribution

8 Example: Mixture Model Prevalence estimation: Provincie MCMC VA p 1 p 1 Antwerpen 0,771 0,771 Vlaams-Brabant 0,794 0,794 Waals-Brabant 0,760 0,761 West-Vlaanderen 0,914 0,915 Oost-Vlaanderen 0,781 0,781 Henegouwen 0,966 0,966 Luik 0,554 0,550 Limburg 0,533 0,533 Luxemburg 0,93 0,933 Namen 0,894 0,894 Multimodel Parameters: MCMC VA µ q(µ1) µ q(µ) µ q(µ1) µ q(µ) mean se mean se mean se mean se 1,46 0,049 4,578 0, ,45 0,0336 4,577 0,00435

9 Accuracy Estimation - Simulation Simple with Missing Predictor Data Assume the model y i = β 0 + β 1 x i + ɛ i, ɛ i N(0, σ ɛ) Take β 0, β 1 N(0, σ β ) and σ ɛ IG(A ɛ, B ɛ ). Suppose that predictors are susceptible to missingness and assume x i N(µ x, σ x) with hyperpriors µ x N(0, σ µ x ) and σ x IG(A x, B x ) Let R i be the missingness indicators and consider the missingness mechanisms: 1 P(R i = 1) = p: MCAR P(R i = 1) = Φ(φ 0 + φ 1 y i ) for φ 0, φ 1 N(0, σφ ): MAR 3 P(R i = 1) = Φ(φ 0 + φ 1 x i ) for φ 0, φ 1 N(0, σφ ): MNAR Use auxiliary variables a i φ N((Y φ) i, 1) or a i φ N((X φ) i, 1) for the probit regression components

10 s Accuracy Estimation - Simulation predictor MCAR predictor MAR predictor MNAR β y σ ε β y σ ε β y σ ε x mis x obs x mis x obs x mis x obs µ x σ x µ x σ x µ x σ x R R φ R φ Evidence nodes (observed data), hidden nodes (random variables) and directed edges (conditional dependence) Markov blanket is set of children, parents and co-parents of node DAGs aid the algebra for variational Bayes q i (θ i ) exp { E θi log p(θ i rest) } exp { E θi log p(θ i Markov blanket of θ i ) }

11 s Accuracy Estimation - Simulation predictor MCAR predictor MAR predictor MNAR β y σ ε β y σ ε β y σ ε x mis x obs x mis x obs x mis x obs µ x σ x µ x σ x µ x σ x R R φ R φ MAR: MNAR: separation between the two hidden node sets {β, σ ɛ, x mis, µx, σ x } and {a, φ} Bayesian inference for regression parameters not impacted by missing-data mechanism separation does not occur, e.g. Markov blanket of x mis includes {a, φ}

12 Accuracy Estimation - Simulation Approximate Inference via Variational Bayes We impose the product density restrictions: MCAR: q(β, σ ɛ, x mis, µ x, σ x ) = q(β, µ x )q(σ ɛ, σ x )q(x mis ) MAR: MNAR: q(β, σ ɛ, x mis, µx, σ x, φ, a) = q(β, µx, φ)q(σ ɛ, σ x )q(x mis )q(a) q(β, σ ɛ, x mis, µx, σ x, φ, a) = q(β, µx, φ)q(σ ɛ, σ x )q(x mis )q(a) For the MCAR, this leads to optimal densities of the form q (β) = Bivariate normal density q (µ x ) = Univariate normal density q (σɛ ) = Inverse Gamma density q (σx ) = Inverse Gamma density q (x mis ) = product of Univariate Normal densities For MAR and MNAR situation, derivations of optimal densities for φ and a have easy expressions as well

13 Accuracy Estimation - Simulation Approximate Inference via Variational Bayes Iterative scheme for obtaining the parameters in optimal densities: Initialize: µ q(1/σ ε ), µ q(1/σ x ) > 0, µ q(β) ( 1) and Σ q(β) ( ). Cycle: / [ } ] σq(x mis ) 1 µ q(1/σ {µ x ) + µ q(1/σ ε ) q(β 1 ) + (Σ q(β) ) + µ q(φ 1 ) + (Σ q(φ) ) { for i = 1,..., n mis : µ q(xmis,i) σ q(x mis ) [µ q(1/σ x ) µ q(µx ) + µ q(1/σ ε ) y xmis,i µ q(β1 ) } ] (Σ q(β) ) 1 µ q(β0 ) µ q(β 1 ) + µ q(axmis,i ) µ q(φ 1 ) (Σ q(φ) ) 1 µ q(φ0 ) µ q(φ 1 ). update E q(xmis ) (X) and E q(x mis ) (XT X) Σ q(β) { µ q(1/σ ε ) E q(x mis ) (XT X) + 1 σ β I } 1 ; µ q(β) Σ q(β) µ q(1/σ ε ) E q(x mis ) (X)T y ( ) σq(µx ) 1/ nµ q(1/σ x ) + 1/σ ; µ µx q(µx ) σq(µx ) µ q(1/σx )(1T x obs + 1 T µ q(xmis ) ) B q(σ ε ) Bε + 1 y y T E q(xmis ) (X)µ q(β) 1 tr{e q(x mis ) (XT X)(Σ q(β) + µ q(β) µ T q(β) )} B q(σ x ) Bx + 1 ( x obs µ q(µx ) 1 + µ q(xmis ) µ q(µx ) 1 + nσ q(µ x ) + n misσ q(x mis ) ) µ q(1/σ ε ) (Aε + 1 n)/b q(σ ε ) ; µ q(1/σ x ) (Ax + 1 n)/b q(σ x ) Σ q(φ) { E q(xmis ) (XT X) + 1 σ φ I } 1 ; µq(φ) Σ q(φ) E q(xmis ) (X)T µ q(a) µ q(a) E q(xmis ) (X)µ q(φ) + (R 1) (π) 1/ exp{ 1 (E q(x mis ) (X)µ q(φ) ) } Φ((R 1) (E q(xmis ) (X)µ q(φ) )) until the increase in p(y, x obs, R; q) is negligible.

14 Accuracy Estimation - Simulation Accuracy of variational Bayes inference Speedy approximate inference, but no guarentees of achieving acceptable level of Assessment of algorithm via simulated data Compare q (θ) with exact posterior density p(θ y) KL dominated by tail-behavior of densities (Hall, 1987) Use L 1 loss, or integrated absolute error (IAE) of q IAE(q ) = Accuracy measure defined as q (θ) p(θ y) dθ (q ) = 1 (IAE(q )/sup q IAE(q)) = 1 1 IAE(q ) Note that 0 (q ) 1 MCMC with large samples used to approximate p(θ y)

15 Variational Bayes Accuracy Estimation - Simulation Simulation Simple with predictor MNAR (φ 0 =.95, φ 1 =.95) σ ε = 0.05 (φ 0 =.95, φ 1 =.95) σ ε = 0. (φ 0 =.95, φ 1 =.95) σ ε = (φ 0 = 0.85, φ 1 = 1.05) σ ε = 0.05 (φ 0 = 0.85, φ 1 = 1.05) σ ε = 0. (φ 0 = 0.85, φ 1 = 1.05) σ ε = β 0 β 1 σ µ x ε σ φ 0 φ 1 x m1x mx m3x m4 β 0 β 1 x σ µ x ε σ φ 0 φ 1 x m1x mx m3x m4 β 0 β 1 x σ µ x ε σ φ 0 φ 1 x m1x mx m3x m4 x High for regression part {β 0, β 1, σ ɛ}, but drop when large amount of missing data and data noisy Accuracy of missing covariates {x i } is high in all situations Poor performance for missing mechanism parameters (φ and a): location fine, deflated spread

16 Accuracy Estimation - Simulation Simulation Simple with predictor MNAR successive values of log{p(y,x obs,r;q)} β 0 β 1 σ ε log{p(y,xobs,r;q)} VB MCMC 95% % % µ x σ x φ 0 φ % % % % x mis,1 x mis, x mis,3 x mis, % % % %

17 Accuracy Estimation - Simulation Simulation Simple with predictor MNAR Credible interval coverage: mis ness MCAR low miss. MCAR high miss. MNAR low miss. MNAR high miss. σ ε β β σε µ x σx φ φ x mis, x mis, x mis,

18 Accuracy Estimation - Simulation Simulation Simple with predictor MNAR Speed comparisons: MAR models MNAR models MCMC (5.89,5.84) (33.8,33.9) var. Bayes (0.0849,0.0850) (0.705,0.790) ratio (76.6,78.7) (59.5,67.8) General conclusion: Fast alternative method Excellent for regression parameters

19 Illustration with Missing Predictor Data Replace linear mean function by smooth flexible function f (x) Use penalized splines with mixed model representation: f (x) = β 0 + β 1 x + K u k z k (x) k=1 with u j N(0, σ u) and {z l (.); 1 k K} set of spline basis functions Different spline functions possible, e.g. O Sullivan penalized splines (Wand and Ormerod, 008) Penalized linear splines

20 Illustration with Missing Predictor Data predictor MCAR predictor MAR predictor MNAR σ u (β,u) y σ ε σ u (β,u) y σ ε σ u (β,u) y σ ε x mis x obs x mis x obs x mis x obs µ x σ x µ x σ x µ x σ x R R φ R φ Nonparametric extension enlarges DAG Nodes (β, u) not broken up into separate nodes, because expressions easier if kept together Variational Bayes algorithms: modification of previous algorithm Leads to non-standard forms of optimal densities

21 Illustration with Missing Predictor Data Let C x = (1, x, z 1 (x),..., z K (x)) The optimal densities for x mis,i take the form ( q (x mis,i ) exp 1 ) Cx mis,i mis,i C Tx mis,i with mis,i correspond to each entry of x mis except x mis,i Does not have closed-form integral, numerical integration required to obtain normalizing factors We take basic quadrature approach, with same quadrature grid (g 1,..., g M ) over all 1 i n mis M z 1 (x)dx w j z 1 (g j ) k=1 with (w 1,..., w M ) the quadrature weights Expressions for mean and variance can be derived

22 Data examples Variational Bayes Illustration Example 1: simulated data, n = 300 y i N(f (x i ), σ ɛ with f (x) = sin(4πx) and 0% of the x i s removed completely at random Penalized splines with truncated linear spline basis with 30 knots Knots equally spaced over range of observed x i s MCMC: burning of size 0, 000, thinning factor of 0, post-burnin of 00, 000 Example : simulated data, same setting, but missingness according to R i Bernoulli(Φ(φ 0 + φ 1 x 1 )) (φ 0, φ 1 ) = (3, 3) Example 3: ozone data Daily maximum one-hour-average ozone level versus daily temperature at El Monte, California n = 361, with 137 of the predictor values are missing Predictors and errors approximately normal and homoscedastic

23 Results Example 1 Variational Bayes Illustration log{p(y,x obs;q)} successive values of log{p(y,x obs;q)} σ ε 93% µ x 85% x mis,7 98% σ x 9% x mis,36 97% σ u 6% x mis,65 98% Good to excellent of variational Bayes (except for σ u) Multimodality of posteriors of variational Bayes approximations

24 Illustration Results Examples Nonparametric MCAR Example x y MCMC VB Nonparametric MNAR Example x y Ozone Data Example daily temperature (degrees Fahrenheit) Maximum one hour average Ozone level Good agreement between variational Bayes and MCMC in fitted functions Time needed: 75 seconds for variational Bayes, 15.5 hours for MCMC

25 Examples 4 Variational Bayes Illustration European study of Antibiotic Use Defined Daily Dose per country per year Turkey production possible explanatory covariate observed data variational approximation DDD DDD Turkey Production Turkey Production

26 Conclusions Variational Bayes Variational Bayes inference achieves good to excellent for main parameters of interest Poor is realized for the missing data mechanism parameters Better maybe achieved with a more elaborate variational scheme in situations where they are of interest Variational Bayes approximates multimodal posterior densities with high degree of Speed-up in the order of several hundreds

27 References Bishop, C.M. (006). Pattern Recognition and Machine Learning. New York: Springer. Box, G.P. and Tiao, G.C. (1973). Bayesian Inference in Statistical Analysis. Reading, Massachusetts: Addison Wesley. Daniels, M.J. and Hogan, J.W. (008). Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. Boca Raton, Florida: Chapman and Hall (CRC Press). Faes, C., Ormerod, J.T. and Wand, M.P. (011). Variational Bayesian inference for parametric and nonparametric regression with missing data. Journal of the American Statistical Association, 106, Flandin, G. and Penny, W.D. (007). Bayesian fmri data analysis with sparse spatial basis function priors. NeuroImage, 34, Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (004). Bayesian Data Analysis. Boca Raton, Florida: Chapman and Hall. Jordan, M.I. (004). Graphical models. Statistical Science, 19, McGrory, C.A. and Titterington, D.M. (007). Variational approximations in Bayesian model selection for finite mixture distributions. Computational Statistics and Data Analysis, 51, Ormerod, J.T. and Wand, M.P. (010). Explaining variational approximations. The American Statistician, 64, Robert, C.P. and Casella, G. (004). Monte Carlo Statistical Methods, Second Edition. New York: Springer-Verlag. Ruppert, D., Wand, M.P. and Carroll, R.J. (003). Semiparametric Regression. New York: Cambridge University Press. Teschendorff, A.E., Wang, Y., Barbosa-Morais, N.L., Brenton, J.D. and Caldas C. (005). A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics, 1, Wand, M.P. and Ormerod, J.T. (008). On O Sullivan penalised splines and semiparametric regression. Australian and New Zealand Journal of Statistics, 50,

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data for Parametric and Non-Parametric Regression with Missing Predictor Data August 23, 2010 Introduction Bayesian inference For parametric regression: long history (e.g. Box and Tiao, 1973; Gelman, Carlin,