PIER HLM Course July 30, 2011 Howard Seltman. Discussion Guide for Bayes and BUGS

Size: px

Start display at page:

Download "PIER HLM Course July 30, 2011 Howard Seltman. Discussion Guide for Bayes and BUGS"

Lora Bridges
5 years ago
Views:

1 PIER HLM Course July 30, 2011 Howard Seltman Discussion Guide for Bayes and BUGS 1. Classical Statistics is based on parameters as fixed unknown values. a. The standard approach is to try to discover, e.g., whether some parameter is equal to zero or not. The steps are: i. First posit a probability model (and corresponding assumptions) and null and alternative hypotheses. ii. Next, choose a test statistic that tends to differ based on whether the null or alternative hypothesis is correct. iii. Next, calculate the distribution of the test statistic over repeated theoretical repetitions of the experiment (sampling distribution). iv. Finally, make a (practical) decision about retaining or rejecting the null hypothesis by comparing the observed statistic to its null sampling distribution (or inverting the test to obtain a confidence interval). v. This may be supplemented by power analysis. vi. Correct classical interpretation must be limited to something like this: If my assumptions are true, then, over the long run (my lifetime of experiments) 100 % of the times when I happen to run experiments for which H 0 is true, on average I will falsely reject H 0, and 100(1- % of the time I will correctly retain H 0. Also, 100 % of the times when I happen to run experiments for which H 0 is false to the specific degree used in calculating power (1- ), on average I will falsely retain H 0, and 100(1- % of the time I will correctly reject H 0. If the true effect size is larger than that used in my power calculation I will make fewer type 2 errors, and for smaller effects sizes, I will make more type 2 errors. vii. Illustration: In a series of careers with 40 null experiments and 100 experiments with 80% power for the smallest interesting effect size (optimistic!!), on average we expect to see 2 false positives, 38 true negatives, 80 true positives, and 20 false negatives. From this we can calculate that the PV+ = 80/(80+2)=97.5%, and the PV- = 38/(38+20)=65.5%. Every different lifetime experiment description gives a different pair of predictive values. Beyond that, heed the warning: due to dumb luck, results may vary. 1

2 2. Bayesian Statistics is based on parameters having a probability distribution. a. Example: We want to know the population difference between the math score of students taught with methods A and B, say = A - B. b. In classical statistics these parameters are fixed unknowns, and 1) we either want to make a decision on whether or not they are equal (and that decision is either true or false; we never know for sure), or 2) we want to construct, say, a 95% confidence interval for, say [L,U] for which, over our lifetime, when our assumptions are met, on average, only about 5% of our 95% Cis will not contain the true value. The CI is random, and the parameter is fixed. c. In Bayesian statistics, we start with a prior distribution for which reflects our beliefs about before we run the experiment. This can be based on earlier experiments and/or can be subjective based on looser information, say all of our reading about similar methods. In the well-justified subjective approach we may elicit a prior from experts. Also, different experts (or non-experts) may validly hold different prior beliefs, which are operationalized as different prior distributions. If little pertinent prior information is available, it is appropriate to express our uncertainty as a weak (dispersed) prior distribution, e.g., ~N(0, s.d.=100 points). An alternative is a non-informative prior, which is an objective off-the-shelf distribution, e.g., all values of are equally likely before running the experiment. These non-informative priors are often improper, i.e., not a valid probability distribution, which causes problems only in some situations (such as multilevel models). Conjugate priors simplify the analysis by matching the likelihood. Generally, it is a good idea to perform sensitivity analysis to investigate how sensitive vs. robust the findings are to the chosen specification of the prior distribution. d. The goal of a Bayesian analysis is to use the experimental results to create a posterior distribution for the quantities of interest (e.g., the above) that express what we should (in the technical sense) believe about now that the experiment is complete. 2

3 e. One tool of Bayesian statistics is Bayes theorem, which can be expressed as: P( Y) = P(Y )P( ) / Σ P(Y )P( ) where the summation (or integral) is over all possible values of. Here is the parameter of interest, Y is the data, P( ) is the prior distribution of the parameter, P(Y ) is the likelihood, i.e., it expresses how likely the experimental outcome is for any given value of the parameter according to the model, and P( Y) is the posterior distribution of the parameter. i. Example: Three kinds of coins are manufactured with long-run heads probabilities of 1=0.3, 2=0.5, and 3=0.7. Assume we know that fair coins are manufactured at four times the rate of the other two. We will flip the coin 5 times and count the number of heads (Y). Let be the true heads chance of the single coin we have. We can express P( )={0.3, 0.5, 0.7} with corresponding prior probabilities of {1/6, 2/3, 1/6} or {0.167, 0.667, 0.167}. Using the binomial theorem we know that: P(Y ) = (5 choose Y) Y (1- ) 5-Y so that P(Y=2 0.3 or 0.5 or 0.7) is 10 x x = or 10 x x = or 10 x x = If we observe Y=2, then the denominator sum in Bayes formula is (1/6) (2/3) (1/6)= and the posterior probabilities are: P( =0.3 Y=2) = (1/6) / = 0.18 P( =0.5 Y=2) = (2/3) / = 0.74 P( =0.7 Y=2) = (1/6) / = 0.08 f. The posterior distributions are the main conclusions of a Bayesian analysis. Unlike classical analysis, because parameters have distributions you can directly and correctly state that, e.g., the probability that is between 5.0 and 10.0 is 95%, or that the probability that is greater than zero is 56%. Derived posterior probabilities of quantities that combine parameters are easy. Probabilistic model comparison is easy through Bayes factors. Models can also be combined for Bayesian model averaging. 3

4 g. Although direct calculation of posterior distributions is sometimes possible, in practice for most complex problems a sample from the posterior distribution is generated and all practical results can be obtained from this posterior sample. The main computational tools for generating a posterior sample are the Gibbs Sampler and MCMC (Markov Chain Monte Carlo), often in combination. Briefly, the Gibbs Sampler breaks up large problems into smaller more manageable pieces, and MCMC is a general purpose way to generate a value from a particular posterior distribution more-or-less directly from the likelihood and prior distribution without any probability calculations. Although there are many practical difficulties that often require knowledge and experience to solve, the use of MCMC allows very complex problems to be analyzed, including many situations where the classical sampling distributions are intractable. h. The main tool for constructing models and putting them into the Bayesian calculation apparatus is the directed acyclic graph (DAG) which pictorially represents the data and parameters and their relationships, particularly conditional independence. 4

3. BUGS and rube a. Documentation: http://www.mrcbsu.cam.ac.uk/bugs/documentation/bugs05/manual05.sec-contents.html, http://www.stat.cmu.edu/~hseltman/rube. b. BUGS is Windows only, popular, and free.

5 3. BUGS and rube a. Documentation: b. BUGS is Windows only, popular, and free. You can set up models via a GUI or by entering computer code. My package, rube, allows running BUGS from inside R, plus it provides extra functionality such as meaningful error messages, and onthe-fly changing of IVs including interactions c. Here is a GUI generated DAG for a simple school example: 5

6 d. Here is the code for a complete example (schoolmodel.txt): # M is the number of neighborhoods # NN is the total number of students # nbhd[j] is the neighborhood for student j # Y[j] is the score (outcome) for student j model { for (i in 1:M) { betaclass[i] ~ dnorm(muc, precc) } for (j in 1 : NN ) { ability[j] <- betaclass[nbhd[j]] + betap*pretest[j] + betam*male[j] + betas*nses[nbhd[j]] + betac*ncrime[nbhd[j]] test[j] ~ dnorm(ability[j], precerr)i(0,100) } muc ~ dnorm(50, 0.001) precc ~ dgamma(0.0001, ) precerr ~ dgamma(0.0001, ) betap ~ dnorm(0, 0.001) betam ~ dnorm(0, 0.001) betas ~ dnorm(0, 0.001) betac ~ dnorm(0, 0.001) sdclass <- sqrt(1/precc) sderr <- sqrt(1/precerr) } e. Here is code to run this using rube (MixedBugs.R) # Submit school/neighborhood data to BUGS (crossed hierarchical model) # Starting values (from lmer(); should be estimated from data) thisinit = function() { list(muc=rnorm(1,69), precc=rnorm(1,1/8^2,0.02), precerr=ronrm(1,1/10^2,0.02), betap=rnorm(1,5,0.2), betam=rnorm(1,-1,0.2), betas=rnorm(1,3,0.2), betac=rnorm(1,-8,0.2), betaclass=rnorm(m,0,0.5))} # Data snd = read.table("schoolneighbor.dat", header=t) M=length(sizes) first = c(1, cumsum(sizes)[-m]+1) NN=nrow(data) thisdata=list(m=m, NN=NN, nbhd=snd$neighborhoodid, test=snd$test, pretest=snd$pretest, male=snd$male, Nses=snd$Nses[first], Ncrime=snd$Ncrime[first]) 6

7 # Run the model through WinBUGS require("rube") rube("schoolmodel.txt", data=thisdata, inits=thisinit) Rube Results: Constants: Size Min Max Mean SD NAs M NN nbhd pretest male Nses Ncrime Data: Distr Size NAs Initial Value(s) [Range] Flags test dnorm_i(0,100) / [28, 100] Stochastics: Distr Size Parameters Initial Value(s) [Range] betaclass dnorm 13 muc, precc / [-0.591, 1.352] muc dnorm 50, precc dgamma 1 1e-04, 1e precerr dgamma 1 1e-04, 1e betap dnorm 1 0, betam dnorm 1 0, betas dnorm 1 0, betac dnorm 1 0, f. Here are the results: Equivalent lmer() code: L1 = lmer(test ~ male+pretest+nses+ncrime+(1 classid) + (1+neighborhoodID), data=hw4) # # Random effects: # Groups Name Variance Std.Dev. # neighborhoodid (Intercept) # Residual # Number of obs: 468, groups: neighborhoodid, 13 # # Fixed effects: # Estimate Std. Error t value # (Intercept) # male # pretest # Nses # Ncrime

8 density ACF betap rube results: my.params=c("sdclass","sderr","betap","betam","betas","betac") rslt = rube("schoolmodel.txt", data=thisdata, inits=thisinit, parameters.to.save=my.params, n.iter=3000, n.burnin=1000, n.thin=1) rslt Rube Results: Run at :04 and taking 3.39 secs mean sd MCMCerr 2.5% 25% 50% 75% 97.5% Rhat n.eff sdclass sderr betap betam betas betac betap ~ dnorm iteration number Rhat= Lag betap 8

9 g. Comments i. Bayesian analysis, e.g., using BUGS, is always appropriate if you are philosophically Bayesian. ii. Bayesian analysis matches classical analysis in many ways in many circumstances, e.g., with very weak or uninformative prior distributions. iii. Bayesian analysis is appropriate when you want to incorporate prior information in your analysis. iv. Bayesian analysis tends to more appropriately model all sources of variation. v. Bayesian analysis is often a first choice for unusual models that have no existing software. vi. Many difficulties can arise, such as slow convergence, highly correlated posteriors, choice of MCMC proposal distributions, choice of blocks of parameters to update simultaneously, and difficulty in specifying complex priors. 9

36-463/663Multilevel and Hierarchical Models

36-463/663Multilevel and Hierarchical Models From Bayes to MCMC to MLMs Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 Outline Bayesian Statistics and MCMC Distribution of Skill Mastery in a Population