Ozone Project. Motivating Application. Model/Distribution Specification

Size: px

Start display at page:

Download "Ozone Project. Motivating Application. Model/Distribution Specification"

Evangeline Singleton
5 years ago
Views:

1 Ozone Project Motivating Application Ozone po 3 q is a gas that occurs naturally in the atmosphere but comes in two varieties: good and bad. Good ozone is present in the upper atmosphere and protects us from harmful UV radiation. Bad ozone, otherwise known as ground-level ozone, occurs near the earths surface and has been found to be associated with various diseases of the lungs. Ground-level ozone is made by a chemical reaction from pollutants that are emitted from motor vehicles and industrial sources. Due to the positive association between high levels ground-level ozone and the prevalence of lung disease, the EPA actively monitors ground-level ozone (which I m just going to call ozone from now on) at many locations across the US. The file Ozone RData contains the following ozone data collected on July, : us.polys - a shape file that will draw the U.S. Try: ozone - a numeric vector containing 1081 measurement of the pollutant ozone po 3 q. locs - a 1081 ˆ 2 matrix of locations where ozone was measured; Try: points(locs[,1],locs[,2],pch=19) For this analysis, you will need to install the LatticeKrig and maptools packages in R. You will also need my fit.matern and predict.matern functions, which you can access by sourcing the file source("fitmaterngp.r") Model/Distribution Specification In this assignment, you will expand upon the concepts you learned about the independent linear regression model y N pxβ, σ 2 Iq. The problems in that assignment assumed that Varpyq σ 2 I so we were assuming no correlation between the responses (all covariances are 0 so this means independence). Now, in the spatial setting, response variables are correlated by virtue of the fact that they were collected near each other (e.g. pollution in Orem is likely to be similar to pollution in Provo). To account for correlation, we make a small change to our linear regression model such that, y N pxβ, σ 2 Rq (1) where X is just going to be a column-vector of 1 s (because we don t have any other explanatory variables in our dataset - just the intercept) and R is an n ˆ n correlation matrix which has 1 1 s along the diagonal, and the off-diagonal elements representing the correlations between 1

2 observations. For example, r ij r ji is the correlation between observations i and j. Note that in this model, Varpyq is NOT diagonal (there are correlations in the off-diagonal elements) which captures the spatial correlation. As was the case in previous assignments, we desire to use our data to estimate the distribution in Equation (1) because once we know the distribution we can do all sorts of great things like prediction. The unknowns in this model are the vector β, the variance σ 2 and the correlation matrix R. For this assignment, however, we are going to pretend that we know R. In reality, we won t know R and we will have to estimate it from the data but we ll get to that in a later assignment. To get R, run the following code: ## Set R for Ozone Data myfit <- fit.matern(ozone 1,locs=locs,nu=1/2) #May take a min R <- rdist(locs) R <- myfit$omega*matern(r,alpha=myfit$alpha,nu=1/2)+ (1-omega)*diag(length(ozone)) Estimation for the Correlation Regression Model We want to use our data to estimate the unknowns in the distribution described by (1). We obtain estimates of β and σ 2 using maximum likelihood estimation. If you know how to do maximum likelihood estimation then try these next few problems. If you don t, then you can just use the answers to analyze the dataset. 1. Find the MLE p β for a known R. Answer: p β px 1 R 1 Xq 1 X 1 R 1 y 2. Find the MLE pσ 2 for a known R. Answer: pσ 2 py X p βq 1 R 1 py X p βq{n where n is the number of observations in y (i.e. n nrowpyq). Back to the Ozone Application In the previous section we derived how to use our data to estimate the unknown parameters. Now, let use those estimators to get the distribution: 1. Use your results in the previous section to calculate p β and pσ 2 for the ozone dataset You can make sure your answer is correct by looking at myfit$coeftable and myfit$sigma2. The Math Behind Spatial Prediction Now that we have our estimates of the distribution, we want to use this distribution to help us predict ozone at locations where we didn t collect data. The whole reason we can predict ozone 2

3 at new locations is because of the correlation that we incorporated in R. That is, because things are correlated in space, we can lean on surrounding observations to get a prediction at a new location. To do the prediction, we are going to use those really cool properties of the multivariate normal distribution that I have been mentioning. Notationally, let y pred be the ozone levels at the locations we want to predict and y be the observations of ozone that we do have. We want to use y to predict y pred. To do this, we will make an assumption that the joint distribution of y pred and y is a multivariate normal. That is, we will assume, ˆypred N y ˆˆXpred X ˆ β, σ 2 Rpred R pred,obs R obs,pred R where X pred is a matrix of explanatory variables for all the places we want to predict at (in this application it will just be a column of 1 s because we don t have explanatory variables), X is the same matrix as in (1), R pred is the correlation matrix for the predictions (note that the predictions need to be correlated because we are predicting for locations near each other), R pred,obs R 1 obs,pred is the correlation between the predictions and the observed data (note that our predictions and observations should be correlated because we are predicting to places near our observations) and R is the correlation matrix for our data (the same R as we used in Equation (1) above). The first really cool property of the multivariate normal (MVN) that I am going to point out is that the marginal distribution of y in (2) is normal with mean Xβ and covariance matrix σ 2 R (notice to get the marginal we just take the part of the mean and covariance matrix associated with y). So, essentially, (2) is saying the same thing about y as (1) is which means that we can use all our estimates of β and σ 2 for the distribution of (2) as we did for (1). The second really cool property of the MVN is that the conditional distribution of y pred given y is still normal. In fact, y pred y N `X pred β ` R pred,obs R 1 py Xβq, σ 2 pr pred R pred,obs R 1 R obs,pred. (3) The reason this distribution is useful is that it quantifies what we know about y pred given what we have already observed in y. The second great thing is that we have estimates of β and σ 2 that we can just plug in. In general, when we predict in spatial statistics, we just use the mean of this distribution (which is, hopefully, intuitive). (2) Prediction for the Ozone Data Given the math results in about the conditional distribution of y pred given y above, make a prediction of y pred and plot it on a map. In the dataset Ozone RData are the following objects which we will use to do this: pred.locs - a 5398 ˆ 2 matrix of locations where we desire ozone predictions; usa.grid.x, usa.grid.y ˆ 100 matrices of x and y-coordinates across the US. You can look at them by trying plot(us.polys) points(c(usa.grid.x),c(usa.grid.y),pch=19,cex=.5) 3

4 grid.locs - the elements of usa.grid.x and usa.grid.y that are the prediction locations; For example, pred.locs <- cbind(usa.grid.x[grid.locs],usa.grid.y[grid.locs]) pred.mat - a 100 ˆ 100 placeholder matrix that will need to be populated by predictions at grid.locs. So, for example, if you create an object my.predictions your ozone predictions at the 5398 prediction locations in pred.locs then you should populate pred.mat using the following code: pred.mat[grid.locs] <- my.predictions The pred.mat object is now a 100 ˆ 100 matrix with either your prediction (for the 5398 locations) or an NA for locations you didn t predict for. Predictions can then be plotted using the following code: image.plot(usa.grid.x,usa.grid.y,pred.mat,add=true) plot(us.polys,add=true) The image.plot(x,y,z) function used above takes at least 3 arguments. x would be a matrix of x coordinates of the prediction locations, y would be a matrix of y coordinates of the prediction locations and z would be a matrix of your predictions where an NA would mean no prediction so just put white space. So, in the above code usa.grid.x is 100 ˆ 100 x-coordinates of locations, usa.grid.y is a 100 ˆ 100 y-coordinates of locations and pred.mat is your corresponding prediction at that location. There are more pretty pictures you can draw using geom raster in ggplot2 but this will be good enough for now. We can learn ggplot2 later (and we should). Here are some steps to help you out: 1. Create X pred as a nrow(pred.locs)ˆ1 column vector of 1 s. 2. Get R pred, R pred,obs and R obs,pred using the following code: all.locs <- rbind(pred.locs,locs) n.pred <- nrow(pred.locs) n <- nrow(locs) R.all <- omega*matern(rdist(all.locs),alpha=myfit$alpha,nu=1/2)+ (1-omega)*diag(n.pred+n) R.pred <- R.all[1:n.pred,1:n.pred] R.pred.obs <- R.all[1:n.pred,n.pred+(1:n)] R.obs.pred <- R.all[n.pred+(1:n),1:n.pred] R <- R.all[n.pred+(1:n),n.pred+(1:n)] 3. Calculate your prediction of y pred using the mean of the conditional distribution in (3) where you plugin your estimates of p β and σ 2 from above. 4

5 4. Plot your predictions using the above code help. 5. Check your predictions are correct by using the following code: my.predictions <- predict.matern(myfit,predlocs=pred.locs) 5

A Primer on Statistical Inference using Maximum Likelihood

A Primer on Statistical Inference using Maximum Likelihood November 3, 2017 1 Inference via Maximum Likelihood Statistical inference is the process of using observed data to estimate features of the population.