5.4 wells in Bangladesh Chris Parrish July 3, 2016

Size: px

Start display at page:

Download "5.4 wells in Bangladesh Chris Parrish July 3, 2016"

Magdalene Charles
6 years ago
Views:

1 5.4 wells in Bangladesh Chris Parrish July 3, 2016 Contents wells in Bangladesh 1 data logistic regression with one predictor 3 figure first logistic model: switched ~ dist 4 model fit more reasonable model: switched ~ dist/100 5 model fit figure logistic regression with second input variable 8 figure model: switched ~ dist/100 + arsenic 9 model fit figure 5.11a figure 5.11a wells in Bangladesh reference: - ARM chapter 05, github library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectcores()) library(ggplot2) wells in Bangladesh data # Data source("wells.data.r", echo = TRUE) > N < > switched <- c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

2 + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, + 1, 1, 0, 0, 1, 0, 1, 1, 0,... [TRUNCATED] > arsenic <- c(2.36, 0.71, 2.07, 1.15, 1.1, 3.9, 2.97, , 3.28, 2.52, 3.13, 3.04, 2.91, 3.21, 1.7, 1.8, 1.44, , 2.33, 2.83, 1.79,... [TRUNCATED] > dist <- c( , , , , , , , +... [TRUNCATED] > assoc <- c(0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, + 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, + 1, 0, 0, 1, 1, 0, 1, 0, 0,... [TRUNCATED] > educ <- c(0, 0, 10, 12, 14, 9, 4, 10, 0, 0, 5, 0, + 0, 0, 0, 7, 7, 7, 0, 10, 7, 0, 5, 0, 8, 8, 10, 16, 10, 10, + 10, 10, 0, 0, 0, 3, 0, [TRUNCATED] scatterplot data <- data.frame(dist, arsenic) ggplot(data, aes(dist, arsenic)) + geom_point(shape = 20, color = "darkred") + geom_smooth() arsenic summary statistics dist 2

3 tbl <- rbind(summary(dist), summary(arsenic)) row.names(tbl) <- c("dist", "arsenic") tbl Min. 1st Qu. Median Mean 3rd Qu. Max. dist arsenic apply(cbind(dist, arsenic), 2, sd) dist arsenic logistic regression with one predictor figure 5.8 # Logistic regression with one predictor # Figure 5.8 p1 <- ggplot(data.frame(dist)) + geom_histogram(aes(dist), color = "seashell", fill = "wheat", binwidth = 10) + scale_x_continuous("distance (in meters) to the nearest safe well") + scale_y_continuous("") print(p1) Distance (in meters) to the nearest safe well 3

4 first logistic model: switched ~ dist model wells_dist.stan data { int<lower=0> N; int<lower=0,upper=1> switched[n]; vector[n] dist; parameters { vector[2] beta; model { switched ~ bernoulli_logit(beta[1] + beta[2] * dist); fit # First logistic model: switched ~ dist data.list.1 <- c("n", "switched", "dist") wells_dist.sf <- stan(file='wells_dist.stan', data=data.list.1, iter=1000, chains=4) plot(wells_dist.sf) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) beta[1] beta[2] pairs(wells_dist.sf)

5 beta[1] beta[2] lp print(wells_dist.sf, pars = c("beta", "lp ")) Inference for Stan model: wells_dist. 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% beta[1] beta[2] lp n_eff Rhat beta[1] beta[2] lp Samples were drawn using NUTS(diag_e) at Tue Jul 5 10:11: For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). The estimated Bayesian Fraction of Missing Information is a measure of the efficiency of the sampler with values close to 1 being ideal. For each chain, these estimates are more reasonable model: switched ~ dist/100 model wells_dist100.stan 5

6 data { int<lower=0> N; int<lower=0,upper=1> switched[n]; vector[n] dist; transformed data { vector[n] dist100; // rescaling dist100 = dist / 100.0; parameters { vector[2] beta; model { switched ~ bernoulli_logit(beta[1] + beta[2] * dist100); fit # More reasonable model: switched ~ dist/100 wells_dist100.sf <- stan(file='wells_dist100.stan', data=data.list.1, iter=1000, chains=4) plot(wells_dist100.sf) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) beta[1] beta[2] pairs(wells_dist100.sf)

7 beta[1] beta[2] lp print(wells_dist100.sf, pars = c("beta", "lp ")) Inference for Stan model: wells_dist chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% beta[1] beta[2] lp n_eff Rhat beta[1] beta[2] lp Samples were drawn using NUTS(diag_e) at Tue Jul 5 10:11: For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). The estimated Bayesian Fraction of Missing Information is a measure of the efficiency of the sampler with values close to 1 being ideal. For each chain, these estimates are figure 5.9 # Figure 5.9 beta.post.2 <- extract(wells_dist100.sf, "beta")$beta beta.mean.2 <- colmeans(beta.post.2) 7

8 # dev.new() p2 <- ggplot(data.frame(switched, dist), aes(dist, switched)) + geom_jitter(position = position_jitter(width = 0.2, height = 0.01), shape = 20, color = "darkred") + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean.2[1] - beta.mean.2[2] * x / 100))) + scale_x_continuous("distance (in meters) to the nearest safe well", breaks = seq(from = 0, by = 50, length.out = 7)) + scale_y_continuous("pr(switching)", breaks = seq(0, 1, 0.2)) print(p2) Pr(switching) Distance (in meters) to the nearest safe well logistic regression with second input variable figure 5.10 # Logistic regression with second input variable # Figure 5.10 # dev.new() p3 <- ggplot(data.frame(arsenic)) + geom_histogram(aes(arsenic), color = "seashell", fill = "wheat", binwidth = 0.25) + scale_x_continuous("arsenic concentration in well water") + scale_y_continuous("") print(p3) 8

9 Arsenic concentration in well water model: switched ~ dist/100 + arsenic model wells_d100ars.stan data { int<lower=0> N; int<lower=0,upper=1> switched[n]; vector[n] dist; vector[n] arsenic; transformed data { vector[n] dist100; // rescaling dist100 = dist / 100.0; parameters { vector[3] beta; model { switched ~ bernoulli_logit(beta[1] + beta[2] * dist100 + beta[3] * arsenic); 9

10 fit # Model: switched ~ dist/100 + arsenic data.list.3 <- c("n", "switched", "dist", "arsenic") wells_d100ars.sf <- stan(file='wells_d100ars.stan', data=data.list.3, iter=1000, chains=4) plot(wells_d100ars.sf) ci_level: 0.8 (80% intervals) outer_level: 0.95 (95% intervals) beta[1] beta[2] beta[3] pairs(wells_d100ars.sf) beta[1] beta[2] beta[3] lp print(wells_d100ars.sf, pars = c("beta", "lp ")) Inference for Stan model: wells_d100ars

11 4 chains, each with iter=1000; warmup=500; thin=1; post-warmup draws per chain=500, total post-warmup draws=2000. mean se_mean sd 2.5% 25% 50% 75% 97.5% beta[1] beta[2] beta[3] lp n_eff Rhat beta[1] beta[2] beta[3] lp Samples were drawn using NUTS(diag_e) at Tue Jul 5 10:11: For each parameter, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence, Rhat=1). The estimated Bayesian Fraction of Missing Information is a measure of the efficiency of the sampler with values close to 1 being ideal. For each chain, these estimates are beta.post.3 <- extract(wells_d100ars.sf, "beta")$ beta beta.mean.3 <- colmeans(beta.post.3) figure 5.11a # Figure 5.11 (a) # dev.new() p4 <- ggplot(data.frame(switched, dist), aes(dist, switched)) + geom_jitter(position = position_jitter(width = 0.2, height = 0.01), shape = 20, color = "darkred") + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean.3[1] - beta.mean.3[2] * x / beta.mean.3[3] * 0.5))) + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean.3[1] - beta.mean.3[2] * x / beta.mean.3[3]))) + annotate("text", x = c(50,75), y = c(0.35, 0.55), label = c("if As = 0.5", "if As = 1.0"), size = 4) + scale_x_continuous("distance (in meters) to the nearest safe well", breaks = seq(from = 0, by = 50, length.out = 7)) + scale_y_continuous("pr(switching)", breaks = seq(0, 1, 0.2)) plot(p4) 11

12 Pr(switching) if As = 0.5 if As = Distance (in meters) to the nearest safe well figure 5.11a # Figure 5.11 (b) # dev.new() p5 <- ggplot(data.frame(switched, arsenic), aes(arsenic, switched)) + geom_jitter(position = position_jitter(width = 0.2, height = 0.01), shape = 20, color = "darkred") + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean.3[1] - beta.mean.3[3] * x))) + stat_function(fun = function(x) 1 / (1 + exp(- beta.mean.3[1] - beta.mean.3[2] * beta.mean.3[3] * x))) + annotate("text", x = c(1.7,2.5), y = c(0.78, 0.56), label = c("if dist = 0", "if dist = 50"), size = 4) + scale_x_continuous("arsenic concentration in well water", breaks = seq(from = 0, by = 2, length.out = 5)) + scale_y_continuous("pr(switching)", breaks = seq(0, 1, 0.2)) print(p5) 12

13 if dist = 0 Pr(switching) if dist = Arsenic concentration in well water 13

4.2 centering Chris Parrish July 2, 2016

4.2 centering Chris Parrish July 2, 2016 Contents centering and standardizing 1 centering.................................................. 1 data.................................................. 1 model.................................................