A Workshop on Bayesian Nonparametric Regression Analysis

Size: px

Start display at page:

Download "A Workshop on Bayesian Nonparametric Regression Analysis"

Daisy Andrews
6 years ago
Views:

1 A Workshop on Bayesian Nonparametric Regression Analysis George Karabatsos University of Illinois-Chicago Methodological Illustration Presentation (90 min.) Modern Modeling Methods (M 3 ) Conference University of Connecticut Neag School of Education Wednesday, May 21, :00-2:30pm Supported by NSF-MMS Research Grant SES

2 Workshop Outline I. Introduction (20 minutes). A. Regression is ubiquitous in applied social science research. B. Review linear regression by least-squares (LS). C. Review Bayesian approach to linear regression: Prior, Data, Posterior (Denison et al. 2002). 1. Bayes estimate of (, 2 ): Combination of LS estimate of (, 2 ), and information from the prior (, 2 ). 2. MCMC methods to estimate posterior, (, 2 Data). 3. Extensions of Bayesian linear model for binary regression (probit and logit), ordinal regression, and HLM/multilevel model. 2

3 Workshop Outline (continued) I. Introduction (20 minutes) (Continued). D. Why Bayesian nonparametric (BNP) regression? Compared to linear models, BNP regression models: 1. Provides more modeling flexibility for real data sets that are typically more complex than what can be adequately described by a linear model. 2. Provides a richer approach to data analysis, compared to linear models which focus only on the mean of Y x. A BNP regression model provide inference as to how the distribution of Y x depends on x, and/or provide flexible modeling of random effects. 3. BNP regression model has an infinite number of parameters. The estimation of the parameters of these models, from data, can be adequately handled by the Bayesian approach to statistical inference. 3

4 Workshop Outline (continued) II. BNP regression models (30 min.) A. General Overview of Bayesian Inference and MCMC. 1. Prior-to-posterior updating. 2. Posterior predictive inference. 3. Posterior predictive model fit evaluation. 4. MCMC methods, and convergence assessment. 4

5 Workshop Outline (continued) II. BNP regression models (30 min.) B. BNP Regression, based on infinite mixtures 1. General BNP (density) regression model a) General Dependent Dirichlet process (DDP) b) Dirichlet process (DP) c) General stick-breaking process d) Other normalized random measures 2. ANOVA/Linear DDP model (De Iorio et al. 2004) 3. Infinite probits model (Karabatsos & Walker, 2012b) 4. Infinite probits model, with automatic covariate selection 5

6 III. Workshop Outline (continued) How to apply BNP regression models (40 minutes), for data analysis, using my Bayesian Regression Software. A. PIRLS data (students nested within schools) 1. Dependent variable: reading achievement. Covariates of the student, teacher, classroom, & school. 2. Binary regression, and ordinal regression. 3. Multilevel BNP regression analysis via the DDP. 4. Causal analysis, regression discontinuity design (effect of class size on student achievement). B. Larynx data: Censored survival data (survival analysis). C. School Calendar data: Meta-analysis. D. Item response data (NAEP). BNP-IRT analysis. (All data applications above are time-permitting). IV. Conclusions. 6

7 Regression and the Social Sciences Social science research questions usually have the form: How does Y depend on x = (x 1,, x p )? Examples of such questions: Association of Y and X 1, given X 2 = x 2,, X p = x p? Association of Y and (X 1, X 2 ) given X 3 = x 3,, X p = x p? Which of (X 1,, X p ) are important predictors of Y? How item responses Y ij relate to examinee ability and to item difficulty? As in Item Response Theory (IRT). (x: Indicators (0,1) of examinees; Indicators (0, 1) of items.) How is Y causally effected by X 1? How is Y causally effected by X 1, given X 2 = x 0? with some Y observations censored (survival analysis) 7

8 Regression and the Social Sciences Social science research questions usually have the form: How does Y depend on x = (x 1,, x p )? Often of interest to investigate this dependence based on one or more features of the distribution of Y: Mean of Y (common approach; as in linear regression) Variance of Y (for variance regression) Median of Y (for median regression) Quantiles of Y (for quantile regression; incl. median reg.) Survival function of Y (for survival analysis) Hazard function of Y (for survival analysis) p.d.f. of Y (for density regression) c.d.f. of Y (for density regression) 8

9 Linear Regression (least-squares) Linear model: y i = x i β + ε i, ε i N(0, σ²), i = 1,, n; i.e., f(y i x i ; ) = n(y i x i β, σ²), i = 1,, n. Notation: x i β = x x p x p ( x = (1, x 1,, x p ) ) σ² is variance of the regression errors (the ε i ). N(, σ²) is normal distribution, with mean and variance parameters, (, σ²), and with probability density function (p.d.f.) n(, σ²) (p.d.f is a bell-shaped curve). f(y x; ) is the likelihood of y conditionally on x, and given all model parameters,. For linear regression model, = (β, σ²). 9

10 Linear Regression (least-squares) Linear model: y i = x i β + ε i, ε i N(0, σ²), i = 1,, n; i.e., f(y i x i ; ) = n(y i x i β, σ²), i = 1,, n. Given sample data D n = {(y i, x i )} i =1:n, the maximum-likelihood estimate (MLE) of (β, σ²) is defined by:, 2 arg max n i 1 n y i x i β,σ², 2 p 1 For linear model, MLE given by Ordinary Least-Squares (OLS) X X 1 X y and 2 1 n n i 1 y i x i 2 where: X = ((1, x i )) n (p+1) and y = (y 1,, y n ). 10

11 Bayesian Linear Regression Model f(y i x i ; ) = n(y i x i β, σ²), i = 1,, n β σ² ~ N p+1 (m, σ²v) σ² ~ IG(a, b) (a = shape, b = rate) Prior density: (, 2 ) = n p+1 ( m, σ²v)ig(σ² a, b) n p+1 ( m, ) density function of (p + 1)-variate normal N p+1 (0, ) ig( a, b) density funct. of inverse-gamma distribution IG(a, b). The data, D n = {(y i, x i )} i =1:n, update the prior (, 2 ) to a Posterior density:, 2 D n n i 1 n i 1 n y i x i, 2, 2 n y i x i, 2 d, 2 Posterior distribution (density), (, 2 D n ), conveys the probable values of the model parameters (, 2 ), given the data (D n ) and prior (, 2 ). 11

12 Bayesian Linear Regression Model Can estimate posterior distribution (density) π(β, σ² D n ) by constructing a Markov (MCMC) chain {(β, σ²) (s) } s=1:s. First, initialize with (β, σ²) (0) = (β, σ²) and stage s = 1. Then this chain is constructed by sampling/drawing from the following full conditional posterior distributions (f.c.p.d.s), in the following Sampling Steps: Sampling Step 1. Draw β X, y, σ² N(m *, σ²v * ). Sampling Step 2. Draw σ² X, y, β IG(a *, b * ). Then set (β, σ²) (s) = (β, σ²), s = s + 1, and repeat Steps if s < S. Notation: V * = (V 1 + X X) 1, m * = V * (V 1 m + X y), a * = a + n/2, b * = b + (1/2)(m V 1 m + y y (m ) (V ) 1 m ). As S, the chain {(β, σ²) (s) } s=1:s converges to samples from the posterior distribution (density), π(β, σ² D n ). 12

13 OLS/MLE & Bayesian Connections Recall that the prior density of the linear model is: (, 2 ) = n p+1 ( m, σ²v)ig(σ² a, b). If the prior density is non-informative or neutral, i.e., with m = 0, V 1 0, a = b = 0, so that the density (, 2 ) is flat over the range of (, 2 ), then the mode of the posterior density π(β, σ² D n ) coincides with the OLS/MLE estimator,, 2, and m * = = E[ D n ] = V * X y, with V * = (X X) 1. Then sampling distribution of (frequentist statistics) coincides with the posterior distribution of β, given 2 2 N m, 2 V 13

14 Extensions of Bayesian Linear Model Extension to a binary dependent variable, Y {0, 1}: Likelihood: f(y x; ) = (x β / σ) y [1 (x β /σ)] 1 y with ( ) = Pr[Y * < ]. Probit model: ( ) is c.d.f. of normal N(0,1) distribution. Logit model: ( ) = exp( )/[1 + exp( )] is Logistic(0,1) c.d.f. MCMC estimation methods are the same as in the linear model, after adding a MCMC sampling step that draws: y i D n, N(x i β, σ²)1(y i > 0) y i1(y i < 0) 1 y i, i = 1,,n and after replacing y with y* = (y 1,, y i,, y n ). Probit model: Fix σ² 1 (Albert & Chib, 1992) Logit model: Add another MCMC step to sample σ² from the Kolmogorov-Smirnov distribution (Holmes & Held, 2006). 14

15 Extensions of Bayesian Linear Model Extension to ordinal dependent variable, Y {c = 0,1,,C}: Likelihood, Cumulative logits or probits model: f(y x; ) = c [ ({w c x β}/σ) ({w c 1 x β}/σ)] 1(y=c) Add prior: (w) 1( w 1 < w 0 = 0 < w 1 < w 2 < < w C ) for the cutoff parameters, to the linear model. Then MCMC estimation methods are the same as in the linear model, after adding MCMC sampling steps to draw from the following f.c.p.d.s: y i D n, N(x i β, σ²) c 1(w c 1 < y i < w c ) 1(y i =c) (for i =1,, n) w c D n, i=1:n N(y i x i β, σ²)1(w c 1 < w c < w c+1 ) (c=1,,c 1) and after replacing y with y* = (y 1,, y i,, y n ). 15

16 Extensions of Bayesian Linear Model Extension to a censored dependent variable response, Y i Suppose that a continuous-valued response y i (e.g. log survival time) is censored, i.e., known only to fall within a fixed interval A i = [Y LBi, Y UBi ]. Interval censored: < Y LBi < Y UBi < Right censored: < Y LBi < Y UBi = Left censored: = Y LBi < Y UBi < Likelihood of censored response f(y i x i ; ) = ({Y UBi x i β} / σ) ({Y LBi x i β} / σ) MCMC estimation methods are the same as in the linear model, after adding a MCMC sampling step to draw from the f.c.p.d.: y i D n, N(x i β, σ²)1(y i A i ) and after replacing y i with y i 16

17 Extensions of Bayesian Linear Model Bayesian HLM / Multilevel model: Likelihood: f(y x; ) = n(y x β + z u h, σ²), for Level 2 groups h = 1,, H. adding priors: u h T ~ iid N(0, T), and T ~ (T), to the linear model. Then MCMC estimation methods are the same as in the linear model, after adding sampling steps in the MCMC algorithm to draw from the following f.c.p.d.s: u h D n, T, i=1:n N(y i x i β + z i u h, σ²)n(u h 0, T) (for h = 1,, H) T D n, h=1:h N(u h 0, T) (T). 17

18 Why Nonparametric? Why Bayesian nonparametric (BNP) regression? Modeling flexibility needed for real data sets (typically complex) Simple linear/hlm model stronger (less credible) data assumptions inaccurate statistical inferences. More flexible BNP regression model weaker (more credible) data assumptions more accurate statistical inferences. Provides a richer approach to data analysis Linear model/hlm: How mean Y changes with x? BNP regression: How Y changes with x? For mean, variance, p.d.f. (density regression), quantiles (quantile regression), survival functions of Y? BNP-HLM: Supports all random effect distributions. Also provides a model-based cluster analysis. 18

19 Student Literacy Score (z standardized) Blue circles: Data points (n = 565) Years teaching experience (TEXP) Solid red line: Normal linear model fit. OLS estimate of E[Y x] 19 Dashed lines: +/ 2 time error s.d. estimate

20 Probability Density (p.d.f.) TEXP = TEXP = TEXP = TEXP = Probability Density (p.d.f.) TEXP = Literacy Score TEXP = Literacy Score TEXP = Literacy Score TEXP = Literacy Score Probability density function (p.d.f.) estimates: Red: Normal linear model p.d.f. Blue: Estimate of true p.d.f. 20

21 Why Bayes Nonparametric? Why Bayes? A BNP regression model, provides a very flexible and rich approach to data analysis, by employing an infinite number of parameters. The estimation of the parameters of these models, from data, can be handled by the Bayesian approach to statistical inference. Hence, before we describe BNP regression models, we need to give a brief overview of the general approach to Bayesian statistical inference and MCMC. 21

22 Overview of Bayesian Modeling In general, consider a model (m) with likelihood density function f( x; m ), parameter m m R K ), and a prior probability density m ( m ) assigned over the space m. Data, D n = {(y i, x i )} i =1:n, update the prior m ( m ) to a Posterior density: m D n n i 1 n i 1 f y i x i ; m m m f y i x i ; m d m m with ( m D n ) = Pr[ < ] the c.d.f. of ( m D n ). Posterior density m ( m D n ) conveys the probable values of m, given data D n = {(y i, x i )} i =1:n, and prior, m ( m ). 22

23 Overview of Bayesian Modeling Model predictions from the prior predictive density: f(y x, m) = f(y x; ζ m )dπ m (ζ m ) f(y x, m) is likelihood of y given model m and prior Π m (ζ m ). Bayes factor (B 12 ): Evidence in favor of model M 1 vs. model M 2 : (Jeffreys, 1961; Kass & Raftery, 1995): Table of Evidence: B 12 2 log B 12 Evidence in favor of M 1 vs. M 2 1 to 3 0 to 2 Not worth more than a bare mention 3 to 20 2 to 6 Positive 20 to to 10 Strong > 150 > 10 Very strong 23

24 Overview of Bayesian Modeling Model predictions from the posterior predictive density: f n (y x, m) = f(y x; ζ m )dπ(ζ m D n ) c.d.f. F n (y x, m) = Y<y f(y x; ζ m )dπ(ζ m D n ) Model Predictive Mean E n [Y x, m] = y df n (y x, m) Variance Var n [Y x] = (y E n [Y x]) 2 df n (y x, m) c.d.f. F n (y x, m) = Pr n (Y < y x, m) Median F n 1 (.5 x, m) Quantile F n 1 (u x, m), for a choice u [0, 1]. Survival function 1 F n (y x, m) Hazard function etc. f n (y x, m) / [1 F n (y x, m)] 24

25 Overview of Bayesian Modeling Partial dependence (PD) method (Friedman, 2001): Let x S x denote 1 or 2 covariates of interest to study predictions, with x = x S x C and x = x S x C =. PD posterior predictive: Mean E n y x S n 1 i 1 E n y x S,x Ci Variance Var n y x S 1 n n i 1 Var n y x S,x Ci Density f n y x S 1 n n i 1 f n y x S,x Ci c.d.f. F n y x S 1 n n i 1 F n y x S,x Ci Median F 1 n.5 x S 1 n n i 1 F 1 n.5 x S,x Ci Quantile F 1 n u x S 1 n n i 1 F 1 n u x S,x Ci, u 0,1. Survival 1 F n y x S n 1 n i 1 1 F n y x S,x Ci etc. Then can study such PD predictions over a range of x S, via scatterplots, Trellis plots, etc. n 25

26 Overview of Bayesian Modeling Measures of the predictive fit of a model (m), for data D n Response y i fit: Standardized residual (z i ). z i = {y i E n (Y x i, m)}² / Var n (Y x i, m) Model fit: D(m) criterion n y i is an outlier when z i > 2. D m yi E n Y x i,m 2 Var n Y x i,m i 1 i 1 Predictive mean-square. First term measures data fit. Second term is penalty; increases with model complexity. (Laud & Ibrahim, 1995; Gelfand & Ghosh, 1998) Among a set of M models compared, {m = 1,, M}, the model with the best predictive fit for the data is identified as the one with the smallest value of D(m). 26 n

27 Bayesian Modeling and MCMC Posterior density of model: m D n n i 1 n i 1 f y i x i ; m m m f y i x i ; m d m m In typical applications of Bayesian modeling, ( D n ) cannot be solved directly, esp. when dimensionality of is high and/or when the prior ( ) is not conjugate to likelihood f( x; ). To infer the posterior ( D n ), and all posterior functionals of interest, can use Markov chain Monte Carlo (MCMC), MCMC involves constructing a discrete-time Harris ergodic Markov chain {ζ (s) } s=1:s having stationary distribution Π(ζ D n ). Ergodicity ensured by a proper ( ) (Robert & Casella, 2004). As S, chain {ζ (s) } s=1:s converges to samples from the posterior Π(ζ D n ). 27

28 Bayesian Modeling and MCMC Posterior density of model: m D n n i 1 n i 1 f y i x i ; m m m f y i x i ; m d m m A MCMC chain {ζ (s) } s=1:s can be constructed by sampling blocks ζ b, for b = 1,,B, of ζ, repeatedly for S times. ( b ζ b = ζ and ζ b ζ c = for all b c). Each stage s, sample s mb mb D n, m/b, for b = 1,,B, with m/b s s m1,, m,b 1, s 1 m,b 1,, s 1 mb. Gibbs sampler: For sampling known π(ζ b D n, ζ m/b ). Rejection sampler: For sampling π(ζ b D n, ζ m/b ), when we only know: π(ζ mb D n, ζ m/b ) i=1:n f(y i x i ; ζ m/b )π(ζ mb ζ m/b ). Metropolis-Hastings, adaptive rejection, or slice sampling. 28

29 Bayesian Modeling and MCMC Given a sampled MCMC chain, {ζ (s) } s=1:s, check whether S is large enough for the chain to provide samples from the posterior distribution (density), ( D n ), by performing a: MCMC Convergence Analysis (2 steps; Geyer 2011): 1. Trace plots: For each univariate parameter ζ of interest, do a trace plot of MCMC samples {ζ (s) } s=1:s to verify good mixing; i.e., that the samples appear to adequately explore the posterior ( D n ). Plot should look stable and hairy % Monte Carlo Confidence intervals (MCCIs): For each parameter, and its posterior estimate, of interest, use overlapping batch means and/or sub-sampling methods (Flegal & Jones, 2011) to calculate and verify that the 95% MCCI is small enough for practical purposes. If trace plots and MCCIs both not satisfactory, then run additional MCMC samples until they are both satisfactory. 29

30 BNP Regression Modeling A flexible and rich approach to BNP regression is provided by Bayesian Density regression models. Bayesian density regression is a relatively new research area that has already seen many important developments. Nearly all proposed Bayesian density regression models have the general form: f y x; f y x,, dg x j 1 f y x,, j x j x 30

31 General BNP Model f y x; f y x,, dg x j 1 {f ( x,, ) : (, ) } chosen family of parametric densities (kernel densities) f y x,, j x j x G x j 1 j x j x mixing distribution ω j (x) (x) ( ) mixing weights that sum to 1 at every x X probability measure degenerate at (x) {ω j (x)} j=1,2, and { j (x)} j=1,2, are infinite collections of processes indexed by X Other model parameters, not part of the mixture 31

32 General BNP Model f y x; f y x,, dg x j 1 The Bayesian density regression model is completed by the specification of a prior distribution on parameters: f y x,, j x j x = (, (ω j (x), j (x)) j=1,2, ), x X Current research aims to construct such prior distributions that guarantee large supports for f (y x ; ). The general model encompasses linear models, normal LMMs/HLMs, GLMs, GLMMs, finite mixture latent class models, hierarchical mixtures-of-experts regression models, and mixtures of Gaussian process regressions. 32

33 Examples, General BNP Model f y x; f y x,, dg x j 1 Most Bayesian density regression models are based on the Dependent Dirichlet Process (DDP) (MacEachern, 1999, 2000, 2001). DDP prior is denoted by: G x ~ DDP(, G 0x ). A random distribution G x from a DDP is constructed by: G x j 1 j x j x with stick-breaking weights: j x v j x j 1 l 1 1 v j x, j 1,2,, v j ~ Q j, v j : X [0,1], and atoms j (x) ~ ind G 0x (Sethuraman, 1994) 33 f y x,, j x j x

34 Examples, General BNP Model f y x; f y x,, dg j 1 The ordinary Dirichlet process (DP; Ferguson, 1973), is not covariate-dependent, and is denoted by: G ~ DP(, G 0 ), where the random distribution function: G j 1 j j is constructed by covariate independent, stick breaking weights: j v j j 1 l 1 1 v j, v j ~ iid Beta(1, ), and atoms j ~ iid G 0, j = 1,2, f y x,, j Under DP(, G 0 ), E[G( )] = G 0, Var[G( )] = {G( )[1 G( )]} / ( + 1). 34

35 Examples, General BNP Model f y x; f y x,, dg j 1 The more general nonparametric, stick-breaking process, G ~ Stick-Breaking((a j, b j ) j=1,2,, G 0 ), for the stick-breaking mixture weights, j v j j 1 l 1 1 v j lets: v j ~ ind Beta(a j, b j ), j = 1,2, f y x,, j (Ishwaran & James, 2001) Special cases of the stick-breaking process: G ~ DP(, G 0 ), when a j = 1 and b j = > 0. G ~ PitmanYor(a, b, G 0 ) when a j = 1 a and b j = b + ja, 0 a < 1 and b > a. 35

36 ANOVA/Linear DDP Mixture Model f y x; f y x,, dg x j 1 f y x,, j x j A multi-level model n y i h h i h 1 X h f y h X h, h 1,,N h f y h X h j 1 n h n y i h x i h j, 2 j i h 1 j j j 1 l 1 1 l j Beta 1 a, aj j,t N,T 2 IG a 0 /2,a 0 /2,T N 0,r 0 I p 1 IW T p 3,s 0 I p 1 Ga a,1/b De Iorio, Müller, Rosner, & MacEachern (2004) 36

37 ANOVA/Linear DDP Mixture Model f y x; f y x,, dg x j 1 f y x,, j x j While the ANOVA/linear DDP model, as represented in the previous slide, is a mixture model with mixing distribution: G ~ Stick-Breaking((a j, b j ) j=1,2,, G 0 ). the model is identical to a model with mixing distribution G x ~ ANOVA-DDP((a j, b j ) j=1,2,, G 0 ) G x j 1 j j x with, j (x) = x j, j, ~ iid G 0 = N(, ), and normal kernel n(y, 2 ). De Iorio, Müller, Rosner, & MacEachern (2004, p. 208). 37

38 Infinite Probits (IP) Model (v1) f y x; f y x,, dg x j 1 f y x,, j j x y i x i f y x i, i 1,,n f y x n y j, 2 j j x j ( ( ) is N(0,1) c.d.f.) j x j, 2 N, 2 j 2 IG 1, j x exp x 1/2 j 1 x exp x 1/2, N 0, 0 2 U 0,b Ga a 0 /2,a 0 /2, N 0, vi Karabatsos & Walker (2012b) 38

39 Infinite Probits (IP) Model Mixture weight j (x) (x) = 1/ Index j (x) = 1/ Index j (x) = Index j (x) = Index j Density f(y x) y y y (x) controls the level of multimodality of f(y x). x more informative f(y x) becomes more unimodal x less informative f(y x) becomes more multimodal y 39

40 Infinite Probits (IP) Model (v2) f y x; f y x,, dg x j 1 y i x i f y x i, i 1,,n f y x n y j, 2 j j x j f y x,, j j x ( ( ) is N(0,1) c.d.f.) Covariate Selection Priors j x j, 2 N, 2 j 2 IG 1, j x exp x 1/2 j 1 x exp x 1/2, N 0, 0 2 U 0,b Ga a 0 /2,a 0 /2, N 0,diag v 1 v 0 1 k, k Ber 1/2 Ber 1/2, k 0,1,,p Karabatsos & Walker (2012b) 40

41 Infinite Probits (IP) Model (v3) f y x; f y x,, dg x j 1 f y x,, j x j y i x i f y x i, i 1,,n f y x n y j x, 2 j x j ( ( ) is N(0,1) c.d.f.) j x j x j 1 x j 2 N 0, 2 ; U 0,b ; 0 2 N 0, 2 v 0 k 2, k N 0, 2 v 1 k v 0 1 k 2 k Ber w, k 1,,p 2 IG a 0 /2,a 0 /2 N 0, 2 vi 2 IG a /2,a /2 Karabatsos & Walker (2012b) 41

42 MCMC Sampling of BNP models The standard Gibbs samplers of Kalli et al. (2010) can be used to perform MCMC sampling of BNP regression models, combined with standard Gibbs sampling methods for sampling the f.c.p.d.s of parameters of ordinary linear/hlm models that are assigned conjugate priors; and possibly combined with rejection sampling methods (e.g., Metropolis or slice sampling) for sampling the f.c.p.d.s of parameters that are assigned non-conjugate priors. MCMC extensions to the linear DDP model for binary, ordinal, or censored Y, are similar to that for the linear model, after replacing N(x i β, σ²) with N(x i β z, σ²) for the f.c.p.d.s for the i y i *s, with along with f.c.p.d. z i N(x i β, σ²) z i zi. MCMC extensions to v1 or v2 (or v3) of the infinite probits model would instead replace with N(, σ z i zi ²) and f.c.p.d. z i N(x i β, σ²) z i zi (x) (with N( zi + x i β, σ²) and z i N( + x z i i β, 42 σ²) z (x), respectively). See Karabatsos & Walker (2012a,b). i

43 Applying BNP regression Applying BNP regression analysis, for data analysis, using the Bayesian Regression Software. A. PIRLS data (students nested within schools) 1. Dependent variable: reading achievement. Covariates of the student, teacher, classroom, & school. 2. Binary regression, and ordinal regression. 3. Multilevel BNP regression analysis via the DDP. 4. Causal analysis, regression discontinuity design (effect of class size). B. Larynx data: Censored survival data (survival analysis). C. School Calendar data: Meta-analysis. D. Item response data (NAEP). BNP-IRT analysis. (All data applications above are time-permitting). 43

44 Bayesian Regression Software Website for Bayesian Regression Software: Software Requirements: The software can run on either a Windows (PC) computer, a Linux computer, or on a Macintosh computer, which has installed either the (1) (free) MATLAB Compiler Runtime (MCR) R2013b (8.2) software for Windows 64-bit; Click here if you need to download and install the free MCR R2013b (8.2) software for Windows on your computer. (2) or has installed MATLAB version R2013b. Click this link to download the Bayesian Regression software package (file BayesRegression_2014b_pkg.exe). (8 MB) 44

45 Bayesian Regression Software For a Macintosh or Linux computer, the Bayesian Regression software can be run under a free software program that can run executable (.exe) files, such as Wine, Wine Bottler, Darwine, VirtualBox, or Bochs PC Emulator. 45

46 Bayesian Regression Software After downloading the Bayesian Regression software onto your computer desktop, click the file (icon): BayesRegression_2014b_pkg.exe to unpack several files, including the software file BayesRegression_2014b.exe, a readme file, a folder containing several example data files, and the software user's manual. The software user's manual provides step-by-step instructions on how to analyze a given data set, using a model of your choice. Any one of the example data files may be used to illustrate the Bayesian Regression Software. The Bayesian Regression software is opened/run by clicking the icon (file) BayesRegression_2014b.exe (it may take a few moments to load on screen). 46

47 47

48 You may use the Import data menu option. This will import you data set for analysis. The data set must be a comma delimited (.csv) file (variable names in first line; all data numeric) After data is imported, the software will automatically rename the data file to have the *.DAT extension. If you already have a data file, that has the *.DAT extension, then you may just select the Open Data (*.DAT) file menu option (the first menu option). 48

49 Imported (or opened) data file appears below. 49

50 You may initially inspect the data, by running basic descriptive statistics (summaries, frequencies, and/or correlations of variables) and/or by constructing plots of the data (e.g., histograms, Q-Q plots, scatter plots, box plots, etc.) 50

51 Univariate descriptive statistics for variables: zread, MALE, AGE, CLSIZE. Sample size, mean, median, s.d., min, max, percentiles, etc. 51

52 Univariate descriptive statistics for variables: zread, MALE, AGE, CLSIZE. Bivariate correlations (Pearson r statistics). 52

53 53

54 Also, with any of these menu options, you may reformat the data: e.g., -- Perform simple variable transformations such as constructing z-scores; -- Create new variables by dummy coding, creating interaction variables; -- Impute missing values of variables; etc. 54

55 We will construct z-scores of 8 variables, which will be used as covariates for a BNP regression analysis that we will perform shortly. 55

56 After easily clicking through the Simple variable transformations > Z-score menu options, new variables, the z-scores of 8 variables, are instantly created. 56

57 Select a BNP regression model for data analysis. We select the infinite probits regression model (v1) (one of 43 possible choices of models, from the software) 57

58 Select Dependent Variable for the model: zread Student literacy score. (a z-standardized version of READSC). 58

59 Select Covariates for the model: We select the z-standardized covariates: Z:MALE, Z:AGE, Z:CLSIZE, Z:ELL, Z:ELL, 59 Z:TEXP4, Z:EDLEVEL, Z:ENROL, Z:SCHSAFE.

60 Enter the Parameters of the Prior Distribution for the model: We select the default choices. 60

61 After menu selections, the Infinite Probits regression model appears (middle screen); along with a list of the choices of prior parameters (upper left), and a list of the choices of the dependent variables and covariates (upper right of screen). 61

62 When necessary (optional): (1) If you have censored data for your dependent variable, then you may select variables that indicate the nature of the censoring; (e.g., for a survival analysis) (2) If your data observations are weighted, then you may select the variable for the observation weights (e.g., for a meta-analysis). See the software user s manual for more details. 62

63 Click Run Posterior Analysis button. This will generate 50K MCMC samples from the model s posterior distribution. Before clicking button, we entered in the MCsamples box. 63

64 The analysis is in progress: A wait bar appears, indicating the current number of MCMC sampling iterations run (out of 50K), the time elapsed, the time left to 50K MCMC iterations, and the current value of the model fit ( D(m) statistic ). 64

65 After end of analysis run, this text output file is generated and opened automatically. Text display of BNP regression model. 65

66 Text display of chosen prior parameters of BNP regression model. 66

67 Text display of chosen dependent variable and covariates for BNP regression model. 67

68 Reports of fit of the BNP regression model. D(m) = 44.5 R-squared =.97 No outliers; all standardized residuals < 1 68

69 Posterior estimates of the parameters of the BNP regression model. 69

We need to evaluate the accuracy of the posterior estimates (reported in text output), through a MCMC convergence analysis of the 50K MCMC samples that we generated.

70 We need to evaluate the accuracy of the posterior estimates (reported in text output), through a MCMC convergence analysis of the 50K MCMC samples that we generated. We can do so by clicking the Posterior Summaries button. Recall that in analyzing the MCMC convergence, that we need to generate, for all model parameters of interest: (1) trace plots to evaluate MCMC mixing; and (2) 95% Monte Carlo confidence intervals, for posterior estimates of interest. 70

71 After clicking Posterior Summaries button, select menu option Trace Plots of MC samples. 71

72 Select the parameters that you want to view in the trace plots. 72

73 73

74 Now we evaluate the 95% Monte Carlo confidence intervals (MCCIs). After clicking Posterior Summaries button, select menu option: Table: Posterior Summaries, with 95% MCCIs. 74

For the model parameters, the 95% MCCI half-widths are near.00 to 2 decimals, for nearly all parameter estimates. Given this result, and trace plots, we conclude convergence.

75 For the model parameters, the 95% MCCI half-widths are near.00 to 2 decimals, for nearly all parameter estimates. Given this result, and trace plots, we conclude convergence. If convergence cannot be confirmed by either the trace plots or the 95%MCCIs, then we may generate additional MCMC samples, beyond the 50K MCMC samples already generated, in order to attain better MCMC convergence. Additional MCMC samples can be generated 75 by clicking again the Run Posterior Analysis button of the software.

76 Under the model, we can investigate the posterior predictions of Y (zread), as a function of one or more covariates of interest. We can do so by clicking the Posterior Predictive button. 76

77 We will choose to investigate how the mean, variance, p.d.f., survival function, and various quantiles, namely, the 10%ile, 25%ile, 50%ile (median), 75%ile, and 90%ile, of Y (zread), 77 varies as a function of one covariate, school enrollment (Z:ENROL).

78 Now select the covariate, school enrollment (Z:ENROL). 78

79 We now choose to investigate how posterior predictions of Y (zread) varies as a function of covariate Z:ENROL, for values of this covariate ranging from 1.7 to 1.7, with the values in this range separated by.1. Therefore, in the text box above, we enter 1.7 :.1 :

80 When investigating how posterior predictions of Y (zread) varies as a function of covariate Z:ENROL, We choose to make posterior predictions using the partial dependence method (Friedman, 2001). 80

81 Plot of posterior predictive Y (zread), as a function of zenrol, in terms of the mean, 10%ile, 25%, 50%, 75%, and 90%iles of`y. 81

82 Plot of posterior predictive variance of Y (zread), as a function of zenrol. 82

83 Plot of posterior predictive p.d.f. of Y (zread), as a function of zenrol. 83

84 Plot of posterior predictive survival function of Y (zread), as a function of zenrol. 84

85 Output table of specific numerical estimates of the posterior predictives of Y (zread), as a function of zenrol. Comma-delimited output files of these results are also generated. Also, comma-delimited output files of all posterior predictive are generated, in case that the data analyst want to further analyze or 85 plot these results on her/his own.

86 Run another posterior predictive analysis, by clicking again the Posterior Predictive button of the software. Here, plot of posterior predictive p.d.f. of Y (zread), by gender group. Using partial dependence method. zmale = 1.02 is female, and zmale =.98 is male. 86

87 Plot of posterior predictive Y (zread), for male vs. female, in terms of the mean, 10%ile, 25%, 50%, 75%, and 90%iles of`y. Female 87 Male

88 Conclusions Linear models, while often used in social science research, do not adequately describe many data sets, as they usually give rise to complex relationships between the Y and x variables. Then, linear models can provide misleading inferences. Also, ordinary linear models can only convey how the mean of Y varies with x. We proposed some BNP regression models that are: (1) flexible enough to adequately capture any complex relationships between the Y and x variables; (2) and which provide a richer approach to data analysis, in terms of how the mean, variance, quantiles, p.d.f., survival function, etc., of Y, changes with x. We illustrated these models on several real data sets, using my free software: Bayesian Regression: Nonparametric and Parametric Models (Karabatsos, 2014a, 2014b). 88

89 References Albert, J. & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, Denison,, D.G.T., Holmes, C.C., Mallick, B.K., & Smith, A.F.M. (2002). Bayesian methods for nonlinear classification and regression. John Wiley & Sons. De Iorio, M., Müller, P., Rosner, G. L., & MacEachern, S. N. (2004). An ANOVA model for dependent random measures. Journal of the American Statistical Association, 99, Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29, Gelfand, A. & Ghosh, J. (1998). Model choice: A minimum posterior predictive loss approach. Biometrika, 85, George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian Variable Selection. Statistica Sinica, 7, Holmes, C. & Held, K. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1, Ishwaran, H. and James, L. F. (2001). Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association, 96, Jeffreys, H. (1961). The Theory of Probability (3 rd ed.). Oxford. (p. 432) Kalli, M., Griffin, J. and Walker, S. G. (2010). Slice sampling mixture models. Statistics and Computing, 21, Karabatsos, G. & Walker, S. (2012a). Bayesian nonparametric mixed random utility models. Computational Statistics and Data Analysis, 56,

90 References (continued) Karabatsos, G. & Walker, S. (2012b). Adaptive-modal Bayesian nonparametric regression. Electronic Journal of Statistics, 6, Karabatsos, George (2014a). Bayesian Regression: Nonparametric and Parametric Models, Version 2014b [Software]. University of Illinois-Chicago. Available from Karabatsos, George (2014b). Bayesian Regression: Nonparametric and Parametric Models, Version 2014b. Software User s Manual. University of Illinois-Chicago. Available from Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, Laud, P. & Ibrahim, J. (1995). Predictive model selection. Journal of the Royal Statistical Society, Series B, 57, MacEachern, S. N. (1999). Dependent nonparametric processes. Proceedings of the Bayesian Statistical Sciences Section of the American Statistical Association, MacEachern, S. N. (2000). Dependent Dirichlet Processes, Technical Report, Department of Statistics, The Ohio State University. MacEachern, S. N. (2001). Decision theoretic aspects of dependent nonparametric processes. In Bayesian Methods with Applications to Science, Policy and Official Statistics (E. George, ed.) International Society for Bayesian Analysis, Creta. Sethuraman, J. (1994). A constructive definition of Dirichlet priors. 90 Statistica Sinica, 4,

Bayesian Nonparametric Rasch Modeling: Methods and Software

Bayesian Nonparametric Rasch Modeling: Methods and Software George Karabatsos University of Illinois-Chicago Keynote talk Friday May 2, 2014 (9:15-10am) Ohio River Valley Objective Measurement Seminar