Analysis of Environmental Data Problem Set Conceptual Foundations: Pro b ab ility d istrib u tio n s Answers

Size: px

Start display at page:

Download "Analysis of Environmental Data Problem Set Conceptual Foundations: Pro b ab ility d istrib u tio n s Answers"

Aron Hardy
5 years ago
Views:

1 Analysis of Environmental Data Problem Set Conceptual Foundations: Pro b ab ility d istrib u tio n s Answers Note, to answer some of these questions you will likely need to very carefully review the bestiary of probability distributions presented in the lecture notes and more thoroughly in Bolker. 1. Consider a study on the timing of alewife spawning migration runs up a coastal stream. At a fishway at the mouth of the stream, you establish a digital video monitoring station that continuously videos fish movement through the fishway and then sample the video after the fact to detect and count fish moving through the fishway. For practical reasons, you sample the video for 9 randomly selected 10 minute intervals in each of 7 randomly selected days every 14 days over the course of the spawning season. For each sample you count the number alewife seen swimming upstream through the fishway. Fish count is the dependent variable and time, expressed as ordinal date of the run (day of the year) plus the time of day (in fractions of days), is the independent variable. You hypothesize fish counts to have a concave relationship over time (i.e., rise, peak and then fall); consequently, you specify the deterministic model to be a quadratic polynomial function of 2 time (i.e., fish count=a+b*time+c*time ). Based on this data set, answer the following questions: a. What are the parameters of the deterministic model? a, b, and c b. Identify at least two (preferably more) potential sources of error in fish counts? Potential sources of measurement error include: Failure to accurately detect and count all fish that move through the fishway on the video; e.g., when a fish is blocked from view by another fish. Accidentally counting the same fish twice; e.g., when a fish moves back and forth through the fishway. Potential sources of process error include: Random behavior of fish regarding their decision to move through the fishway on any given minute. Other factors beside time that are governing the fishes decision to move through the fishway on any given 10 minute interval such as fluctuations in water temperature and flow or the density of fish below the fishway. c. Specify a suitable probability distribution for the stochastic component of the model. A suitable probability distribution for this type of data is the Poisson, which is designed for effectively unlimited count data, or the negative binomial, which is similar to the Poisson except that it allows the variance to be larger than the mean (i.e., overdispersed data).

2 Probability Distributions: Problem Set Answers 2 2. Consider a hypothetical study of wood frog larval abundance in vernal pools. Let s say you sample 100 vernal pools (observational units) and at each pool you take 10 dip net sweeps of the water column in random locations throughout the pool, and you record the presence/absence of wood frog tadpoles in each sweep. Let s assume that the probability of tadpole capture is the same for every sweep. For now, let s also assume that all pools are the same. Given the probability distributions shown here for a binomial distribution with a trial size=10 (#dips) and a per trial probability of success prob=0.3, answer the following questions: a. What is the probability of observing tadpoles in 6 out of 10 sweeps in a pond? This value can be obtained from the probability mass distribution shown in the upper right subplot of the figure provided by reading off the probability on the y-axis for #successes=6, which is approximately p=0.03. This value can be obtained mathematically from the binomial probability mass function with parameters size=10 (trial size; # dips in this case) and prob=0.3 (per trial probability of success, capture in this case), which is specified as follows: p(x)=choose(n,x) p^x (1-p)^(n-x) where n=trial size, x=#successes, p=per trial probability of success (prob), and choose(n,x) is the binomial coefficient for x success out of a trial of size n, which is how many different combinations of n trials can produce x successes. In this case there are 210 different ways 10 trials can produce 6 success (e.g., , , etc.). Plugging in the appropriate numbers yields the following result: p(x)=210*(0.3^6)*(1-0.3)^(10-6)= b. What is the probability of observing tadpoles in 2 or fewer of the 10 sweeps in a pool? This value can be obtained from the cumulative probability distribution shown in the upper left subplot of the figure provided by reading off the probability on the y-axis for #successes=2, which is approximately p=0.4. This value can be obtained mathematically using the probability mass function as above (a) by summing the probabilities of observing 0, 1 and 2 successes. This value can be computed in R directly using the pbinom () function as follows: pbinom(2,size=10,prob=0.3)=0.3827

3 Probability Distributions: Problem Set Answers 3 c. What is the probability of observing tadpoles in at least 4 of the 10 sweeps in a pool? This value can be obtained from the cumulative probability distribution shown in the upper left subplot of the figure provided by reading off the probability on the y-axis for #successes=3, which is approximately p=0.65, and then taking the compliment (1-0.65=0.35). Note, since we are interested in 4 or more successes, we need to know the cumulative probability of observing 3 (not 4) or less. This value can also be obtained mathematically using the probability mass function as above (a) by summing the probabilities of observing 0-3 successes and taking the compliment, or summing the probability of observing 4-10 successes. This value can be computed in R directly using the pbinom () function as follows: 1-pbinom(3,size=10,prob=0.3) = d. If the per trial probability of capture was 0.5 instead of 0.3, how would the probability mass function (pmf) change? The probability mass distribution would shift to the right such that the most likely (highest probability) #successes would be 0.5 instead of 0.3, as shown here in the barplot. 3. Consider a hypothetical study on the affect of road crossing mortality on the age structure of spadefoot toad populations. Spadefoot toads typically undergo annual migrations to and from their breeding sites (seasonal ponds) and have extremely high fidelity to their breeding site (i.e., local populations are relatively independent). You hypothesize that road mortality is sufficient to affect population age structure, since at least a portion of the local population adjacent to a road would be subject to increased mortality rates due to road kill during migration to and from the breeding pond. Let s say you sample three local populations at their breeding ponds; one pond is adjacent to a busy highway, another one is next to a secondary road with moderate traffic rates, and another one is without any nearby roads. For each population, you randomly sample 100 individuals and determine how many years they each survive. Thus, the data represent for each individual (observation) the age at death. The observed data are plotted

4 Probability Distributions: Problem Set Answers 4 here as a bar chart depicting the number of individuals surviving to each age for each of the populations. You are interested in knowing if the annual survival rate differs among the populations and estimating the probability of individuals in each population surviving for 10 years. Consequently, you specify the deterministic model to be an indexed vector of mean annual mortality rates (i.e., a vector consisting of three different mean mortality rates corresponding to the three different populations) or, alternatively, an indexed vector of expected probability of survival for 10 years. Based on this information, answer the following questions: a. What are some of the potential sources of error in the final statistical model? Potential sources of measurement error include: Error in determining the age at death, due to the difficulty of determining the age of individuals which will depend on aging method. Error in recording the age correctly in your data log. Potential sources of process error include: Randomness in getting killed by a vehicle crossing a road. Subject the same 100 individuals to the same road and traffic conditions and you will get a different number of roadkills just by pure chance of individuals getting hit or not. Other factors beside roadkill that are governing the age structure of the local populations that is unaccounted for in this study and that may be causing variations in age structure unrelated to roads. b. Does this data warrant a discrete or continuous distribution for the stochastic component of the model, and why? This data is clearly discrete as it represents the counts of individuals in each age class. Individuals are indivisible (discrete) units. c. What is a suitable probability distribution for the stochastic component of this model? The geometric probability distribution (at least one form of it) gives the number of trials (in this case, years) with a constant probability of failure (in this case, death) until you get a single failure (in this case, the individual dies). Thus, the geometric distribution is mechanistically ideally suited to deal with data representing the number of survived breeding seasons for a seasonally reproducing organism such as the spadefoot toad. Note, the geometric distribution is also the special case of the negative binomial when the overdispersion parameter k=1, and thus the negative binomial distribution would be a viable alternative if the variance ended up being too great for the geometric. d. Looking ahead to hypothesis testing, how might you go about determining whether roads have a significant affect on population age structure and/or toad longevity; i.e. whether survival rates differ significantly among the highway, road and none populations? Here, the basic hypothesis is that the age structure differs among populations, which

5 Probability Distributions: Problem Set Answers 5 can be restated more specifically as the annual survival rates differ among populations presumably as a result of differential road mortality rates. Consequently, the null hypothesis is that the age structure, and thus the annual survival rates, are the same among populations. If we can specify two models, one representing the null hypothesis (no difference) and the other representing the alternative hypothesis (they differ) and we can find an objective criterion from which to quantitatively assess how well each model fits the data, then we can determine whether the alternative model is better than the null model. In addition, we can determine the likelihood of observing the differences among populations that we observed if in fact these population samples were drawn from the same underlying distribution (i.e., that they in fact are identical and the differences we observed were simply due to chance associated with drawing a sample from the population). 4. Consider a hypothetical study on the affect of fire size on the severity of the fire. There is a general belief among land managers in the west that larger fires are more severe in terms of their ecological impacts. Fire severity is generally defined in terms of the proportion of the overstory vegetation killed by the fire and is often categorized into high severity, mixed severity and low severity. The belief is that as fires get larger the proportion of the fire (inside the fire perimeter) that is classified as high severity increases. Some claim this to be a myth. You decide to test this hypothesis. Specifically, you decide to test the hypothesis that the proportional extent of high severity burn increases logistically as fire size increases; i.e., you specify the deterministic model to be a logistic function of fire size (i.e., proportion high (c*(d-x)) severity=a+((b-a)/(1+e ))). Note, this is a 4 parameter logistic function that has parameters that control the asymptotes at the left-(a) and right-hand (b) ends of the x axis and scales (c) the response to x about the midpoint (d) where the curve has its inflection. To confront this model with data, you compile data on 100 randomly selected fires that occurred during the past 10 years in the Rocky Mountains region. For each fire you determine the size (ha) and proportion of high severity burn (via analysis of pre- and postfire satellite images). Actually, these data already exist for fires greater than 100 acres, so all you have to do is download the data and conduct the analysis. The data are shown here as a scatter plot. Based on this data set, answer the following questions: a. What are some of the potential sources of error in the final statistical model? Potential sources of measurement error include: Error in measuring the true size of the fire, since this depends on how you define the perimeter of the fire and the spatial resolution of the measuring device. Error in classifying locations to burn severity classes, since this involves deciding

6 Probability Distributions: Problem Set Answers 6 how much difference between pre- and post-fire satellite images is necessary to call something high severity and there is uncertainty in the choice of where to make the break between high and low severity and there is error in the spectral data recorded by the satellite sensor. Potential sources of process error include: Random variation among fires in the extent of high severity due to random fluctuations in the weather that drives fire behavior and fuel conditions (e.g., moisture levels). Other factors beside fire size that influence the distribution and extent of high severity within the fire perimeter, such as fuel loads and terrain that influence fire behavior. b. Does this data warrant a discrete or continuous distribution for the stochastic component of the model, and why? This data warrants a continuous distribution because the dependent variable is the proportional abundance of high severity burn within the fire perimeter, which is a continuous quantity bounded by 0-1. c. What is a suitable probability distribution for the stochastic component of this model? The beta probability distribution is phenomenologically ideally suited for data on a proportion scale, and it is the only continuous distribution that is bounded 0-1. Note, if the data were proportions but did not approach either 0 or 1, then some of the other continuous distributions such as the normal and gamma might also work. Also, the classical approach for dealing with data on a proportion scale was to apply the arcsine square root transformation and then use the normal distribution, but this is not really justifiable anymore given the availability of the beta distribution, and ultimately does not solve the problem of the data being bounded 0-1 and the normal distribution being unbounded. Lastly, note the important difference between continuous data measured on a proportion scale, such as fire severity, and discrete proportional data, such as the number of successes out of a given number of trials. Discrete proportional data is handled with the binomial distribution, whereas the continuous proportional data is handled with the beta distribution. The key distinction is that with discrete proportional data there is a well-defined trial and trial size (number of trials per sample unit) for which the trial outcome is binary and the number of successful outcomes are counted. d. Looking ahead to model selection, what other deterministic models might you propose as plausible alternatives to the 4 parameter logistic function based on the scatter plot? And are these likely to be mechanistic or phenomenological models? Since we do not have a mechanistic basis for the deterministic relationship between fire size and fire severity, there are a wide variety of monotonic functions that could be used to phenomenologically describe the apparent monotonic relationship. A simple linear or quadratic polynomial, or any of the saturating response functions that have an upper

7 Probability Distributions: Problem Set Answers 7 asymptote like the monomolecular, Beverton-Holt, Holling type III, and Von Bertalanffy would be suitable alternatives. 5. Consider a hypothetical study on the willingness of automobile owners to pay a gas tax to reduce carbon emissions in an effort to combat global warming. Specifically, policy makers are considering an additional tax on gasoline in which the revenue generated would be used to develop alternative renewable energy sources and they would like to know how much people would be willing to pay to reduce carbon emissions by 50% in an aggressive effort to control global warming and the factors influencing people s willingness to pay. In particular, they believe that the amount people would be willing to pay is going to be linearly related to income level; i.e., the more you make the more you would be willing to pay. However, some policy makers disagree because they think the rich are much less willing to pay for common goods such as clean air. So, you decide to test this hypothesis. Specifically, you test the hypothesis that the amount a person is willing to pay in additional gasoline tax is linearly related to their income level; consequently, you specify the deterministic model to be a simple linear function of income level (i.e., amount willing to pay=a+b*income level). You conduct a random survey of 100 automobile owners in Amherst, Massachusetts and record the income level and the amount they are willing to pay in additional gasoline tax. The dependent variable, amount willing to pay, is an integer (cents). The data are shown here as a scatterplot. You fit a simple linear model with normally distributed errors (i.e., normal probability distribution for the stochastic component of the model). The fitted model is depicted here as the solid line in the scatterplot (note, this is simply a straight line with the best estimates of the intercept a and slope b). Figuring out how to compute these best estimates is a subject for future consideration: parameter estimation. After fitting the model, you plot a histogram of the residuals: the deviations between the fitted values and the observed values, as shown here. Based on this information, answer the following questions: a. Based on the information provided, does this model appear to be properly specified model in terms of both the deterministic and stochastic components? If not, why? No it does not. The linear model for the deterministic component does not capture the apparent curvilinear relationship between income and willingness to pay. As suspected by some politicians, the wealthier individuals appear to be willing to pay less than the middle income people. Also, the dependent variable is presumably measured on a discrete scale, since cents are indivisible monetary units, and thus the normal

8 Probability Distributions: Problem Set Answers 8 distribution is not the most appropriate. b. Given the scatterplot, what is a reasonable alternative deterministic function for this relationship? Any hump-shaped distribution might suffice, for example a quadratic polynomial would be logical choice. Other functions that rise, peak and then decline might also work, but the parabolic nature of the curvilinear relationship might not lend itself well to many of these. c. Given the dependent variable (amount willing to pay in additional cents/gallon), what is a more suitable probability distribution for the stochastic component of this model, and justify your choice? The Poisson probability distribution is phenomenologically ideally suited for discrete data that is non-negative integers and effectively unbounded on the upper end, which is the case here since cents can take on any non-negative integer, including zero, and could go as high as someone was willing to pay. The mean and the variance of the Poisson distribution are the same, which allows the variance to increase as the mean increases, but if the variance is much larger than the mean, then the negative binomial would be more appropriate, since it is similar to the Poisson except that it allows the variance to be larger than the mean (i.e., overdispersed data). 6. Consider a hypothetical study on the affect of bedrock geology on the calcium concentration of second and third-order streams in western Massachusetts. Calcium rich waters are especially important for certain organisms, such as bivalves. You sample 100 streams in the study area and for each stream you measure the percent of the watershed underlain by calcareous bedrock from a GIS data layer available from USGS. In addition, you collect water samples from each stream and measure the calcium concentration, which is measured on a continuous scale and ranges from trace amounts (0.01) to a little over 1.4. You are unsure what to expect for the relationship between percent calcareous and stream calcium concentration, so you plot the data. After seeing the data, you decide to fit a simple linear model (i.e., calcium=a+b*calcareous) with normally distributed errors. The linear fit is shown in the figure (solid line), as is the residuals of the model. Based on this information, answer the following questions: a. What are some of the potential sources of error in the final statistical model?

9 Probability Distributions: Problem Set Answers 9 Potential sources of measurement error include: Error in measuring the true calcium concentration of the water sample, since the assays are not perfect. Error in measuring the true percentage of the watershed underlain by calcareous bedrock, since the mapped geological data are very coarse approximations. Potential sources of process error include: Random variation in calcium concentration over time and space within the stream, such that any single water sample has a varying amount of calcium in it. Other factors beside bedrock geology that influence the calcium concentration of the water. b. What are three problems with the specification of this statistical model (i.e., the linear deterministic function and the normal probability distribution) with this dataset? The linear model does not capture the apparent curvilinear relationship. Perhaps a function that allows for curvature would be a better fit. Also, the intercept of the fitted linear model is negative, which means that calcium concentration is predicted to be negative when percent calcareous is very low, which is an impossible outcome since calcium concentration can never be negative. The normal distribution allows for negative values, but calcium concentration cannot be negative. Thus, if the mean calcium concentration is near zero, e.g., when percent calcareous is near zero, the normal distribution will predict some observations to be negative. A more appropriate distribution would be the gamma which does not allow negative or zero values. Note, even if we fit a zerointercept linear model for the deterministic component to account for the problem with the fitted negative intercept, which forces the intercept to go through zero, the normal distribution will still allow impossible negative values. In the normal distribution the mean and the variance are independent, which means that the variance should not change with the mean; i.e., that it remains constant as the mean changes. In this example, the variance clearly increases as the mean increases. The gamma distribution has a more complicated relationship between the mean and variance, but one that allows the variance to increase with the mean. c. Looking ahead to parameter estimation, can you think of a way to assess how well the specified linear model fits the data (or the lack of fit) that makes use of the probability distribution in an explicit way? If we are willing to assume that the data were drawn from a normal distribution with a mean (expected value) calcium concentration linearly dependent on percent calcareous bedrock., then we can determine the probability (or likelihood actually, but we will return to this distinction later) of observing any particular value of calcium concentration given the mean (expected) value for any value of calcareous bedrock. Specifically, for any single observation, the linear model gives us the expected value (or mean) of calcium concentration for the measured value of percent

10 Probability Distributions: Problem Set Answers 10 calcareous. The normal probability distribution gives us the probability of any outcome (calcium concentration) given the mean (expected value determined from the linear model) and standard deviation. So, if we are willing to assume a particular standard deviation for the normal probability distribution, then we can use the normal probability density function to determine the probability of the observed value of calcium concentration for the given value of percent calcareous. If we calculate the probability of observing each observation and multiply them together, we will get the probability of observing the entire dataset, which is a measure of how well the specified model fits the data. By trying different combinations of values for model parameters, we can search for the combination that gives us the greatest probability, which becomes our best estimate of the parameters the maximum likelihood estimates. 7. Consider a hypothetical dataset on the time between major flooding events in a river floodplain. Let s say you record the length of time (in say days or years) between flooding events that exceed a specified threshold in magnitude, say a 5 year flood event, under the assumption that the probability of a 5 year flood event is the same every year. What probability distribution would be appropriate as a mechanistic description of the distribution of time between events? There are at least two possibilities depending on whether time between events is considered a discrete variable or a continuous variable. If time between events is considered discrete, for example if time is measured in years and each year is considered a discrete unit, then the geometric distribution provides a mechanistic description of the data, because it is the number of trials (years) until you get a single failure (5 year flood event), given that there is a constant probability of a 5 year flood event every year. On the other hand, if time between events is considered continuous, for example if time is measured in days and days which are discrete units are merely an arbitrary measurement scale for an intrinsically continuous variable, then the exponential distribution provides a mechanistic description of the data, because it is the distribution of waiting times for a single event to happen, given that there is a constant probability per unit time that it will happen. Thus, the exponential distribution is the continuous counterpart of the geometric distribution. 8. Consider a hypothetical dataset on gypsy moth abundance in oak stands in the Quabbin watershed. Let s say that you put out pheromone traps in 100 locations for a 1 week period during the flight period and count the number of moths collected in each trap. The distribution of counts are shown here in the histogram. The computed mean count is 5.09 (moths/trap) and the variance is Based on this information, what probability distribution would be most appropriate for this data? This example represents classic simple count data

11 Probability Distributions: Problem Set Answers 11 for which the Poisson distribution is ideal, because it gives the distribution of the number of events or counts in a given sampling unit of counting effort if each event is independent of all the others. The assumption of independence of events may be problematic in this case, but there are statistical methods for dealing with this lack of independence. However, the Poisson distribution has a single parameter, lambda, and assumes that the mean and variance are equal to lambda. Given the computed mean and variance, this assumption does not hold for this dataset. Fortunately, the negative binomial distribution is well suited, at least phenomenologically, to deal with count data in which the variance is greater than the mean. So, in this case, the negative binomial is the preferred distribution. 9. Consider a hypothetical study on the energy performance of three different window types. You experimentally expose 20 window panes each of 3 different window types to direct sunlight under the same conditions (e.g., ambient air temperature) and measure the BTU s on the inside of the window. I don t know if this at all makes sense but it doesn t matter for the point of this exercise. The data are shown here as a box-and-whisker plot. Let s say that you hypothesize that the mean BTU differs among window type (i.e., the deterministic part of the model) and that you are willing to assume that the data were derived from a normal distribution (i.e., the stochastic part of the model) with a mean equal to the sample mean of the window type and a standard deviation equal to the pooled sample standard deviation; in other words, that the data were drawn from 3 normal distributions that differ in their means, but have the same spread. Note, this is the classical statistical model for this dataset. Given the sample means and pooled standard deviation below (note, these are the parameters of the normal probability distribution), answer the following questions: Mean BTU: type 1=1.28; type 2=1.84; type 3=3.09 Pooled standard deviation=1.61 a. What is the probability density of observing a btu=3 for a window of type=1? What about for a window of type=3? These values are easily computed from the normal probability density function given the specified means and standard deviation. See the lecture notes for the mathematical function. Here are the computations: window type 1: (1/sqrt(2*pi*1.61^2))*exp(-((3-1.28)^2)/(2*1.61^2))=0.14

12 Probability Distributions: Problem Set Answers 12 window type 3: (1/sqrt(2*pi*1.61^2))*exp(-((3-3.09)^2)/(2*1.61^2))=0.25 b. Are there any problems with the use of the normal probability distribution with this dataset? If so, is there a better alternative? Yes, there are at least two related problems. First, the normal distribution is unbounded. Thus, it allows for negative values, which are illogical in this case. This may not be a practical issue if the means are >>0, since negative values may be so unlikely as to not affect anything. However, if the means are close to 0, as in this case, this can be a real issue. In this case, it is apparent that the distributions of btu values for window types 1 and 2 are being truncated at zero, resulting in positively skewed distributions, which brings us to the second issue. The normal distribution is symmetrical about the mean. Due to the zero truncation problem, the distributions, at least for window types 1 and 2, are clearly not symmetrical. This is a very common situation with environmental data. Fortunately, the gamma distribution can be used, at least phenomenologically, with positively skewed distributions of positive real numbers, which is exactly the case here. Note, the gamma distribution does not allow for zeros; the data must be positive numbers. This limits the use of the gamma to situations in which the data must take on a positive value, or else it requires an minor adjustment to the data (e.g., adding a small value to each observation) so that the gamma can be used.

Analysis of Environmental Data Problem Set Conceptual Foundations: De te rm in istic fu n c tio n s Answers

Analysis of Environmental Data Problem Set Conceptual Foundations: De te rm in istic fu n c tio n s Answers 1. The following real data set contains data on marbled salamander abundance (abund=mean number