Homework 6 Solutions

Similar documents
Metric Predicted Variable on Two Groups

Darwin Uy Math 538 Quiz 4 Dr. Behseta

Homework 4 Solutions

Metric Predicted Variable on One Group

Theory of Inference: Homework 4

Frontier Analysis with R

Bayesian Dynamic Modeling for Space-time Data in R

Package bpp. December 13, 2016

Metric Predicted Variable With One Nominal Predictor Variable

Measurement, Scaling, and Dimensional Analysis Summer 2017 METRIC MDS IN R

Hierarchical Modeling

Prediction problems 3: Validation and Model Checking

Chapter 4 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (December 11, 2006)

GENERALIZED ERROR DISTRIBUTION

Introduction and Single Predictor Regression. Correlation

Package diffeq. February 19, 2015

Matematisk statistik allmän kurs, MASA01:A, HT-15 Laborationer

Class 04 - Statistical Inference

36-463/663: Multilevel & Hierarchical Models HW09 Solution

First steps of multivariate data analysis

Online Appendix to Mixed Modeling for Irregularly Sampled and Correlated Functional Data: Speech Science Spplications

Package gma. September 19, 2017

Canadian climate: function-on-function regression

2015 SISG Bayesian Statistics for Genetics R Notes: Generalized Linear Modeling

Package FDRSeg. September 20, 2017

Statistics II: Estimating simple parametric models

MALA versus Random Walk Metropolis Dootika Vats June 4, 2017

BIOS 312: Precision of Statistical Inference

Bayesian Regression for a Dirichlet Distributed Response using Stan

STAT 675 Statistical Computing

Information Theoretic Properties of the Poisson Binomial

Multiple Regression: Mixed Predictor Types. Tim Frasier

Package leiv. R topics documented: February 20, Version Type Package

Lecture 5 : The Poisson Distribution

Consider fitting a model using ordinary least squares (OLS) regression:

Chapter 3: Methods for Generating Random Variables

Keys STAT602X Homework 5

Statistics/Mathematics 310 Larget April 14, Solution to Assignment #9

Robust Inference in Generalized Linear Models

Problems from Chapter 3 of Shumway and Stoffer s Book

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Using the package hypergsplines: some examples.

Gov 2000: 9. Regression with Two Independent Variables

Stat 8053, Fall 2013: Robust Regression

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Fitting Cox Regression Models

Multiple Regression: Nominal Predictors. Tim Frasier

Explore the data. Anja Bråthen Kristoffersen Biomedical Research Group

Holiday Assignment PS 531

Chapter 8 Conclusion

AMS 206: Applied Bayesian Statistics

Statistical Computing Session 4: Random Simulation

STAT 153, FALL 2015 HOMEWORK 1 SOLUTIONS

Part V. Model Slection and Regularization. As of Nov 21, 2018

Galaxy Dataset Analysis

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Aster Models and Lande-Arnold Beta By Charles J. Geyer and Ruth G. Shaw Technical Report No. 675 School of Statistics University of Minnesota January

Time Series Models and Data Generation Processes

Math 70 Homework 8 Michael Downs. and, given n iid observations, has log-likelihood function: k i n (2) = 1 n

Multivariate Normal & Wishart

Multivariate Analysis Homework 1

R-companion to: Estimation of the Thurstonian model for the 2-AC protocol

R Package ecolmod: figures and examples from Soetaert and Herman (2009)

Computer Assignment 8 - Discriminant Analysis. 1 Linear Discriminant Analysis

Package GMCM. R topics documented: July 2, Type Package. Title Fast estimation of Gaussian Mixture Copula Models. Version

Understanding Inference: Confidence Intervals I. Questions about the Assignment. The Big Picture. Statistic vs. Parameter. Statistic vs.

Bayesian inference for a population growth model of the chytrid fungus Philipp H Boersch-Supan, Sadie J Ryan, and Leah R Johnson September 2016

Gaussian Mixtures. ## Type 'citation("mclust")' for citing this R package in publications.

Package SpatPCA. R topics documented: February 20, Type Package

SUPPLEMENTARY MATERIAL TECHNICAL APPENDIX Article: Comparison of control charts for monitoring clinical performance using binary data

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

Package depth.plot. December 20, 2015

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Package interspread. September 7, Index 11. InterSpread Plus: summary information

Interpreting Regression

Contents 1 Admin 2 General extensions 3 FWL theorem 4 Omitted variable bias 5 The R family Admin 1.1 What you will need Packages Data 1.

Explore the data. Anja Bråthen Kristoffersen

Statistical Simulation An Introduction

Package CorrMixed. R topics documented: August 4, Type Package

W = 2 log U 1 and V = 2πU 2, We need the (absolute) determinant of the partial derivative matrix ye (1/2)(x2 +y 2 ) x x 2 +y 2.

2017 SISG Bayesian Statistics for Genetics R Notes: Generalized Linear Modeling

Modern Regression HW #6 Solutions

Lecture 1 Bayesian inference

Contents Label # Tables in Document

Introduction to Statistics and R

Mathematics Review Exercises. (answers at end)

David Giles Bayesian Econometrics

Likelihoods. P (Y = y) = f(y). For example, suppose Y has a geometric distribution on 1, 2,... with parameter p. Then the pmf is

Count Predicted Variable & Contingency Tables

Hypothesis Tests and Confidence Intervals Involving Fitness Landscapes fit by Aster Models By Charles J. Geyer and Ruth G. Shaw Technical Report No.

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Package clustergeneration

Outline. 1 Preliminaries. 2 Introduction. 3 Multivariate Linear Regression. 4 Online Resources for R. 5 References. 6 Upcoming Mini-Courses

Multivariate Statistical Analysis

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Examples from. Computational Statistics. An Introduction to. Günther Sawitzki StatLab Heidelberg

Chapter 2: Elements of Probability Continued

Hypothesis Tests and Confidence Intervals Involving Fitness Landscapes fit by Aster Models By Charles J. Geyer and Ruth G. Shaw Technical Report No.

Hierarchical Linear Models

Interpreting Regression

Transcription:

Homework 6 Solutions set.seed(1) library(mvtnorm) samp.theta <- function(y, Sigma, mu.0, lambda.0) { n <- nrow(y) y.bar <- apply(y, 2, mean) theta.var <- solve(solve(lambda.0) + n*solve(sigma)) theta.mean <- theta.var%*%(solve(lambda.0)%*%mu.0 + n*solve(sigma)%*%y.bar) return(rmvnorm(1, theta.mean, theta.var)) samp.sigma <- function(y, theta, S.0, nu.0) { n <- nrow(y) S.n <- S.0 + (t(y) - c(theta))%*%t(t(y) - c(theta)) return(solve(rwishart(1, nu.0 + n, solve(s.n))[,, 1])) Problem 1 Part a. There isn t a single correct answer to this problem. mu.0 <- c(40, 38) nu.0 <- length(mu.0) + 2 S.0 <- lambda.0 <- rbind(c(100, 25), c(25, 100)) One possible answer is: ( ) 100 25 I set the prior mean to µ 0 = (40, 38) and set Λ 0 =, which assumes ages of married couples are 25 100 correlated and that over 95% of ages are between 16 and 100. As in the test score example, we set S 0 = Λ 0 and center the prior for Σ weakly about S 0 by setting ν 0 = 2 + 2. Part b. Solution depends on your choice in a. Y <- read.table("http://www.stat.washington.edu/~pdhoff/book/data/hwdata/agehw.dat", header = TRUE) Y1s <- matrix(nrow = nrow(y), ncol = 4) Y2s <- matrix(nrow = nrow(y), ncol = 4) 1

for (i in 1:4) { theta <- rmvnorm(1, mu.0, lambda.0) Sigma <- solve(rwishart(1, nu.0, solve(s.0))[,, 1]) Y.sim <- rmvnorm(nrow(y), theta, Sigma) Y1s[, i] <- Y.sim[, 1] Y2s[, i] <- Y.sim[, 2] par(mfrow = c(2, 2)) range <- range(cbind(y, Y1s, Y2s)) for (i in 1:4) { plot(y[, 1], Y[, 2], ylim = range, xlim = range, xlab = expression(y[h]), ylab = expression(y[w])) points(y1s[, i], Y2s[, i], pch = 16) if (i == 3) {legend("topleft", pch = c(1, 16), legend = c("observed", "Simulated"), cex = 0.75) Observed Simulated This prior isn t perfect; ages appear to be less correlated under this prior than in the observed data. But, it s close enough to be plausible so I ll leave it be. Part c. Solution depends on your choice in b. theta <- apply(y, 2, mean) Sigma <- cov(y) S <- 10000 thetas.c <- matrix(nrow = S, ncol = length(theta)) Sigmas.c <- matrix(nrow = S, ncol = length(as.vector(sigma))) 2

for (i in 1:S) { thetas.c[i, ] <- theta <- samp.theta(y, Sigma, mu.0, lambda.0) Sigma <- samp.sigma(y, theta, S.0, nu.0) Sigmas.c[i, ] <- as.vector(sigma) post.ci.c <- rbind(quantile(thetas.c[, 1], c(0.025, 0.975)), quantile(thetas.c[, 2], c(0.025, 0.975)), quantile(sigmas.c[, 2]/sqrt(Sigmas.c[, 1]*Sigmas.c[, 4]), c(0.025, 0.975))) row.names(post.ci.c) <- c("$\\theta_h$", "$\\theta_w$", "$\\rho$") kable(post.ci.c, digits=2) 2.5% 97.5% θ h 41.67 46.99 θ w 38.34 43.29 ρ 0.86 0.93 plot(thetas.c[, 1], thetas.c[, 2], xlab = expression(theta[h]), ylab = expression(theta[w])) θ w 36 38 40 42 44 46 40 42 44 46 48 50 θ h hist(sigmas.c[, 2]/sqrt(Sigmas.c[, 1]*Sigmas.c[, 4]), xlab = expression(rho), main = "", freq = FALSE) 3

Density 0 5 10 15 20 0.80 0.85 0.90 0.95 ρ Part d. mu.0 <- rep(0, 2) lambda.0 <- 10^5*diag(2) S.0 <- 1000*diag(2) nu.0 <- 3 theta <- apply(y, 2, mean) Sigma <- cov(y) S <- 10000 thetas.d <- matrix(nrow = S, ncol = length(theta)) Sigmas.d <- matrix(nrow = S, ncol = length(as.vector(sigma))) for (i in 1:S) { thetas.d[i, ] <- theta <- samp.theta(y, Sigma, mu.0, lambda.0) Sigma <- samp.sigma(y, theta, S.0, nu.0) Sigmas.d[i, ] <- as.vector(sigma) post.ci.d <- rbind(quantile(thetas.d[, 1], c(0.025, 0.975)), quantile(thetas.d[, 2], c(0.025, 0.975)), quantile(sigmas.d[, 2]/sqrt(Sigmas.d[, 1]*Sigmas.d[, 4]), c(0.025, 0.975))) row.names(post.ci.d) <- c("$\\theta_h$", "$\\theta_w$", "$\\rho$") kable(post.ci.d, digits=2) 2.5% 97.5% θ h 41.67 47.17 θ w 38.30 43.47 ρ 0.79 0.90 4

Part e. Given that we have n = 100, the likelihood dominates the posterior and the results from both priors are very similar. The only exception is that the posterior 95% confidence interval for ρ is slightly larger under the diffuse prior used in Part d.. If we had less data, e.g., n = 25, we would expect larger 95% confidence intervals under the more diffuse prior used in Part d. than under the more informative prior used in Part c.. Problem 2 Part a. data <- read.table("http://www.stat.washington.edu/~pdhoff/book/data/hwdata/azdiabetes.dat", header = TRUE) Y.n <- data[data[, "diabetes"] == "No",!names(data) == "diabetes"] Y.d <- data[data[, "diabetes"] == "Yes",!names(data) == "diabetes"] mu.0.n <- apply(y.n, 2, mean) mu.0.d <- apply(y.d, 2, mean) lambda.0.n <- S.0.n <- cov(y.n) lambda.0.d <- S.0.d <- cov(y.d) nu.0.d <- nu.0.n <- 9 theta.n <- apply(y.n, 2, mean) theta.d <- apply(y.d, 2, mean) Sigma.n <- cov(y.n) Sigma.d <- cov(y.d) S <- 10000 thetas.d <- matrix(nrow = S, ncol = length(theta.d)) thetas.n <- matrix(nrow = S, ncol = length(theta.n)) Sigmas.d <- matrix(nrow = S, ncol = length(as.vector(sigma.d))) Sigmas.n <- matrix(nrow = S, ncol = length(as.vector(sigma.n))) for (i in 1:S) { thetas.d[i, ] <- theta.d <- samp.theta(y.d, Sigma.d, mu.0.d, lambda.0.d) Sigma.d <- samp.sigma(y.d, theta.d, S.0.d, nu.0.d) Sigmas.d[i, ] <- as.vector(sigma.d) thetas.n[i, ] <- theta.n <- samp.theta(y.n, Sigma.n, mu.0.n, lambda.0.n) Sigma.n <- samp.sigma(y.n, theta.n, S.0.n, nu.0.n) Sigmas.n[i, ] <- as.vector(sigma.n) par(mfrow = c(4, 2)) par(mar = rep(2, 4)) for (j in 1:7) { range <- range(cbind(thetas.d[, j], thetas.n[, j])) hist(thetas.n[, j], xlim = range, xlab = "", col = rgb(0, 0, 1, 0.5), main = names(y.d)[j], freq = FALSE) hist(thetas.d[, j], xlim = range, add = TRUE, 5

freq = FALSE) plot(range, range, type = "n", axes = FALSE) legend("center", fill = c("white", rgb(0, 0, 1, 0.5)), legend = c("diabetics", "Non-Diabetics")) npreg glu 0.0 1.5 0.0 0.3 0.6 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 bp 68 70 72 74 76 78 bmi 0.00 0.20 0.0 0.4 110 120 130 140 150 skin 26 28 30 32 34 36 ped 0.0 0.6 0 10 20 30 32 34 36 age 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.0 0.4 Diabetics Non Diabetics 28 30 32 34 36 38 40 The figure shows that the means of all of the variables appear to be higher in the diabetic group, this is consistent with our comparisons of θ d,i and θ n,i in the table. post.prob <- numeric(length(theta.d)) for (i in 1:length(post.prob)) { post.prob[i] <- mean(thetas.d[, i] > thetas.n[, i]) names(post.prob) <- paste("$\\text{pr\\left(\\theta_{d,", 1:7, " > \\theta_{n,", 1:7, "\\right)$", sep = "") kable(post.prob, digits=2) Pr (θ d,1 > θ n,1 ) 1 Pr (θ d,2 > θ n,2 ) 1 Pr (θ d,3 > θ n,3 ) 1 Pr (θ d,4 > θ n,4 ) 1 Pr (θ d,5 > θ n,5 ) 1 Pr (θ d,6 > θ n,6 ) 1 Pr (θ d,7 > θ n,7 ) 1 6

Part b. par(mfrow = c(1, 1)) par(mar = rep(3, 4)) plot(apply(sigmas.d, 2, mean), apply(sigmas.n, 2, mean), xlab = "Diabetics", ylab = "Non-Diabetics", main = expression(paste("elements of ", Sigma, sep = "")), pch = 16) abline(a = 0, b = 1) Elements of Σ Non Diabetics 0 100 200 300 400 500 600 0 200 400 600 800 1000 The single point off the 45 line corresponds to the estimated variance of glucose levels; glucose levels are more variable among diabetic subjects than diabetic subjects. This is consistent with our expectations, given that diabetes is a disease related to glucose level regulation. # Figure out which point is the outlier which(abs(apply(sigmas.d, 2, mean) - apply(sigmas.n, 2, mean)) == max(abs(apply(sigmas.d, 2, mean) - a ## [1] 9 7