Simulating Realistic Ecological Count Data

1 / 76 Simulating Realistic Ecological Count Data Lisa Madsen Dave Birkes Oregon State University Statistics Department Seminar May 2, 2011

2 / 76 Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results

Outline Motivation Example: Weed Counts 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 3 / 76

4 / 76 Motivation Example: Weed Counts Why Simulate Data? Simulation studies useful for Assessing the performance of analytical procedures

5 / 76 Motivation Example: Weed Counts Why Simulate Data? Simulation studies useful for Assessing the performance of analytical procedures Power analysis or sample size determination

6 / 76 Motivation Example: Weed Counts Why Simulate Data? Simulation studies useful for Assessing the performance of analytical procedures Power analysis or sample size determination Finding a good sampling design

Outline Motivation Example: Weed Counts 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 7 / 76

Motivation Example: Weed Counts Weed Counts vs. Soil Magnesium (Heijting et al, 2007) Weeds 0 5 10 15 20 25 Counts Outliers 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 8 / 76

Motivation Example: Weed Counts Maps of Weed Counts and Magnesium Weed Counts Magnesium y 0 10 20 30 40 50 Counts Outliers Zeros y 0 10 20 30 40 50 0 2 4 6 8 10 12 x 0 2 4 6 8 10 12 x 9 / 76

Outline Pearson Correlation Spearman Correlation Limits to Dependence 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 10 / 76

11 / 76 Pearson Correlation Pearson Correlation Spearman Correlation Limits to Dependence The usual measure of dependence between X and Y is the Pearson product-moment correlation coefficient: ρ(x, Y ) = E(XY ) E(X)E(Y ) [var(x) var(y )] 1/2.

12 / 76 Pearson Correlation Pearson Correlation Spearman Correlation Limits to Dependence The usual measure of dependence between X and Y is the Pearson product-moment correlation coefficient: ρ(x, Y ) = E(XY ) E(X)E(Y ) [var(x) var(y )] 1/2. Estimate ρ(x, Y ) from sample (X 1, Y 1 ),..., (X n, Y n ) as ˆρ(X, Y ) = n i=1 [(X i X)(Y i Y )] [ n i=1 (X i X) 2 n i=1 (Y i Y ) 2 ], 1/2

Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence ρ(x,x)=1 X 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 X 13 / 76

Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence ρ(x,e^x)<1 exp(x) 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 X 14 / 76

15 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence For bivariate normal X and Y, ρ(x, Y ) completely characterizes dependence.

16 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence For bivariate normal X and Y, ρ(x, Y ) completely characterizes dependence. For non-normal X and Y, other measures of dependence may be more appropriate.

18 / 76 Spearman Correlation Pearson Correlation Spearman Correlation Limits to Dependence The Spearman correlation coefficient is ρ S (X, Y ) = 3{P[(X X 1 )(Y Y 1 ) > 0] P[(X X 1 )(Y Y 1 ) < 0]} where X 1 d = X Y 1 d = Y with X 1 and Y 1 independent of one another and of (X, Y ).

19 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Estimating Spearman Correlation Given bivariate sample (X 1, Y 1 ),..., (X n, Y n ), calculate ranks r(x i ) and r(y i ). Then ˆρ S (X, Y ) = n i=1 {[r(x i) (n + 1)/2][r(Y i ) (n + 1)/2]} n(n 2, 1)/12 the sample Pearson correlation coefficient of the ranked data.

20 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Spearman Correlation Measures Monotone Dependence ρ S (X, e X ) = ρ S (X, X) = 1...

21 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Spearman Correlation Measures Monotone Dependence ρ S (X, e X ) = ρ S (X, X) = 1... provided X is continuous.

22 / 76 Correcting for Ties Pearson Correlation Spearman Correlation Limits to Dependence When X is discrete, one can construct X and Y so that X = Y almost surely but ρ S (X, Y ) < 1.

23 / 76 Correcting for Ties Pearson Correlation Spearman Correlation Limits to Dependence When X is discrete, one can construct X and Y so that X = Y almost surely but ρ S (X, Y ) < 1. Rescale ρ S so that it ranges between 1 and 1: ρ RS (X, Y ) = ρ S (X, Y ) {[1 x p(x)3 ][1 y q(y)3 ]} 1/2, where p(x) = P(X = x) and q(y) = P(Y = y) (Nešlehová, 2007).

24 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Ties in Sample Ranks Two common methods for handling ties in sample X 1,..., X n : Random ranks: When u tied values would occupy ranks p 1,..., p u if they were distinct, randomly assign these u ranks to the tied values.

25 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Ties in Sample Ranks Two common methods for handling ties in sample X 1,..., X n : Random ranks: When u tied values would occupy ranks p 1,..., p u if they were distinct, randomly assign these u ranks to the tied values. Midranks: Assign each tied value the average rank, 1 u u k=1 p k.

26 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Rescaled Spearman Correlation and Midranks For sample (X 1, Y 1 ),..., (X n, Y n ), let the distribution of (X, Y ) be the empirical distribution function of the sample. Then ρ RS (X, Y ) coincides with the sample Pearson correlation coefficient of the midranks (Nešlehová, 2007).

28 / 76 Pearson Correlation Spearman Correlation Limits to Dependence For X and Y with joint CDF H(x, y) and marginal CDFs F(x) and G(y), the Fréchet-Hoeffding bounds are max[f (x) + G(Y ) 1, 0] H(x, y) min[f(x), G(y)].

29 / 76 Pearson Correlation Spearman Correlation Limits to Dependence For X and Y with joint CDF H(x, y) and marginal CDFs F(x) and G(y), the Fréchet-Hoeffding bounds are max[f (x) + G(Y ) 1, 0] H(x, y) min[f(x), G(y)]. These bounds induce margin-dependent bounds on ρ(x, Y ) and ρ S (X, Y ).

30 / 76 Pearson Correlation Spearman Correlation Limits to Dependence For X and Y with joint CDF H(x, y) and marginal CDFs F(x) and G(y), the Fréchet-Hoeffding bounds are max[f (x) + G(Y ) 1, 0] H(x, y) min[f(x), G(y)]. These bounds induce margin-dependent bounds on ρ(x, Y ) and ρ S (X, Y ). For X Bernoulli(p X ) and Y Bernoulli(p Y ) with p X p Y, ρ(x, Y ) (1 p X )p Y /[p X (1 p Y )].

Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 31 / 76

32 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z.

33 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i.

34 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i. Aside: To simulate maximally dependent Y i and Y j, set the corresponding element of Σ Z equal to 1.

35 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i. Aside: To simulate maximally dependent Y i and Y j, set the corresponding element of Σ Z equal to 1. 2. Transform each element of Z to obtain desired marginals.

36 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i. Aside: To simulate maximally dependent Y i and Y j, set the corresponding element of Σ Z equal to 1. 2. Transform each element of Z to obtain desired marginals. Y i = F 1 i {Φ(Z i )}, Φ( ) the standard normal CDF.

37 / 76 Inverse CDF for Discrete Distributions Bernoulli(0.4) CDF F(x) 0.0 0.2 0.4 0.6 0.8 1.0 F 1 i (u) = inf{x : F i (x) u} 0.5 0.0 0.5 1.0 1.5 x

Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 38 / 76

Weed Data Weeds 0 5 10 15 20 25 Counts Outliers 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 39 / 76

40 / 76 A Plausible Marginal Model Negative binomial hurdle model is a Bernoulli mixture of a point mass at 0 and a negative binomial, left-truncated at 1. π, y = 0 P(Y = y) = (1 π) Γ(θ + y) ( θ Γ(θ)Γ(y + 1) 1 ) θ ( µ ) y θ + µ θ + µ ( ) θ θ, y 1 θ + µ Model π and negative binomial mean µ as functions of covariate, x = soil magnesium.

41 / 76 Negative Binomial Hurdle CDF The target CDF for Y i is then F i (y) = π i + 1 π i 1 g i (0 µ i, θ) {G i(y µ i, θ) g i (0 µ i, θ)} for y 0, where G i ( µ i, θ) and g i ( µ i, θ) are the negative binomial CDF and PDF with log(µ i ) = β 0 + β 1 x i, and logit(π i ) = γ 0 + γ 1 x i.

Weed Data With Fitted Means NB Hurdle Fit Weeds 0 5 10 15 20 Data Fitted Mean 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 42 / 76

43 / 76 Comments Unlike data analysis, goal is a tractable but flexible model.

44 / 76 Comments Unlike data analysis, goal is a tractable but flexible model. Can determine marginal CDFs by other methods.

45 / 76 Comments Unlike data analysis, goal is a tractable but flexible model. Can determine marginal CDFs by other methods. Different marginals need not come from the same family.

Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 46 / 76

The Principle of Spatial Dependence Dependence between observations is higher when they are close together. Dependence 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 Distance 47 / 76

The Variogram var(y i Y j ) is small if Y i and Y j are dependent. Variogram 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 Distance 48 / 76

49 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process.

50 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process. To make inference feasible, we assume stationarity, i.e. E(Y i )=E(Y j ) and var(y i Y j ) = 2γ(h ij ), where h ij is the vector between locations of Y i and Y j, and γ( ) is called the semivariogram.

51 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process. To make inference feasible, we assume stationarity, i.e. E(Y i )=E(Y j ) and var(y i Y j ) = 2γ(h ij ), where h ij is the vector between locations of Y i and Y j, and γ( ) is called the semivariogram. Weed counts are not stationary: means differ, and larger means are associated with larger variance.

52 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process. To make inference feasible, we assume stationarity, i.e. E(Y i )=E(Y j ) and var(y i Y j ) = 2γ(h ij ), where h ij is the vector between locations of Y i and Y j, and γ( ) is called the semivariogram. Weed counts are not stationary: means differ, and larger means are associated with larger variance. Stationarity assumption is more reasonable for ranks than counts.

53 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication.

54 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication. Kruskal (1958): Population analog of rank r(y i ) is F(Y i ).

55 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication. Kruskal (1958): Population analog of rank r(y i ) is F(Y i ). For each Y i, we can estimate its CDF F i by plugging in point estimates of the parameters.

56 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication. Kruskal (1958): Population analog of rank r(y i ) is F(Y i ). For each Y i, we can estimate its CDF F i by plugging in point estimates of the parameters. If Y i is unusually large (or small), given its estimated distribution, ˆF i (Y i ) will also be unusually large (or small), but ˆF 1 (Y 1 ),..., ˆF n (Y n ) will all be on the same scale.

57 / 76 Estimating Spatial Dependence Fit a parametric semivariogram model to the ranked spatial counts. Semivariance 0.00 0.02 0.04 0.06 0 5 10 15 For Y i and Y j separated by a distance of h ij, 1 2 var[f i (Y i ) F j (Y j )] = 0.03 + 0.027 (1 ) e h ij /1.36 ˆρ RS (Y i, Y j ) = 0.47e h ij /1.36 Distance

58 / 76 Calculating Σ Z 1. For each pair i, j, obtain {[ ] ]} ˆρ S (Y i, Y j ) = 1 ˆfi (r) [1 3 ˆfj (s) 1/2 ˆρ 3 RS (Y i, Y j ), r=0 s=0 where ˆf i and ˆf j are the estimated PMFs of Y i and Y j.

59 / 76 Calculating Σ Z 1. For each pair i, j, obtain {[ ] ]} ˆρ S (Y i, Y j ) = 1 ˆfi (r) [1 3 ˆfj (s) 1/2 ˆρ 3 RS (Y i, Y j ), r=0 s=0 where ˆf i and ˆf j are the estimated PMFs of Y i and Y j. 2. Then numerically solve for δ = ρ(z i, Z j ): ˆρ S (Y i, Y j ) = 3 r=0 s=0 ˆfi (r)ˆf j (s)(φ δ {Φ 1 [ˆF i (r 1)], Φ 1 [ˆF j (s 1)]} + Φ δ {Φ 1 [1 ˆF i (r)], Φ 1 [1 ˆF j (s)]} Φ δ {Φ 1 [ˆF i (r 1)], Φ 1 [1 ˆF j (s)]} Φ δ {Φ 1 [1 ˆF i (r)], Φ 1 [ˆF j (s 1)]}).

60 / 76 Simulating Outliers Simulation Results Apply Retain locations and covariate values from data set. Simulate a multivariate standard normal vector Z with correlation matrix Σ Z.

61 / 76 Apply Simulating Outliers Simulation Results Retain locations and covariate values from data set. Simulate a multivariate standard normal vector Z with correlation matrix Σ Z. 1 Set Y i = ˆF i {Φ(Z i )}.

Outline Simulating Outliers Simulation Results 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 62 / 76

Two Outlier Processes Simulating Outliers Simulation Results Weeds 0 5 10 15 20 25 Counts Outliers 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 63 / 76

Outliers Localized Simulating Outliers Simulation Results Weed Counts Magnesium y 0 10 20 30 40 50 Counts Outliers Zeros y 0 10 20 30 40 50 0 2 4 6 8 10 12 x 0 2 4 6 8 10 12 x 64 / 76

65 / 76 Simulating Outliers Simulation Results Empirical Observations About Outliers Outliers occur in the region between y = 17 and y = 33 meters.

66 / 76 Simulating Outliers Simulation Results Empirical Observations About Outliers Outliers occur in the region between y = 17 and y = 33 meters. Outliers associated with mg between 250 and 300 are between 12.9 and 14.9 larger than target means, whereas outliers associated with mg above 330 are between 2.6 and 10.3 larger.

67 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300.

68 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300. Set these counts equal to the integer part of target mean plus a random uniform on (12, 15).

69 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300. Set these counts equal to the integer part of target mean plus a random uniform on (12, 15). Randomly select another 4 to 6 points with y-coordinates between 17 and 33 and mg exceeding 330.

70 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300. Set these counts equal to the integer part of target mean plus a random uniform on (12, 15). Randomly select another 4 to 6 points with y-coordinates between 17 and 33 and mg exceeding 330. Set these to the integer part of target means plus a random uniform on (2, 11).

Outline Simulating Outliers Simulation Results 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 71 / 76

Simulating Outliers Simulation Results Simulated Data vs. Observed Data 220 240 260 280 300 320 340 360 0 5 10 15 20 25 30 35 100 Simulated Datasets Soil Magnesium (mg/kg) Weeds Simulated Observed 72 / 76

Simulating Outliers Simulation Results Simulated Data vs. Observed Data Weeds 0 5 10 15 20 Data Sim. Means Target Means 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 73 / 76

Simulating Outliers Simulation Results Simulated Data vs. Observed Data 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 Distance Rank Correlation Simulated Target 74 / 76

Simulating Outliers Simulation Results A Couple of Simulated Maps 0 2 4 6 8 10 12 0 10 20 30 40 50 Weed Counts x y Counts Zeros 0 2 4 6 8 10 12 0 10 20 30 40 50 Weed Counts x y Counts Zeros 75 / 76

References Simulating Outliers Simulation Results S. Heijting, W. Van Der Werf, A. Stein, and M.J. Kropff (2007), Are weed patches stable in location? Application of an explicitly two-dimensional methodology, Weed Research 47 (5), pp. 381-395. DOI: 10.1111/j.1365-3180.2007.00580.x W.H. Kruskal (1958), Ordinal measures of association, Journal of the American Statistical Association 53, pp. 814 861. L. Madsen and D. Birkes (2013), Simulating dependent discrete data, Journal of Computational and Graphical Statistics, 83(4), pp. 677 691. J. Nešlehová (2007), On rank correlation measures for non-continuous random variables, Journal of Multivariate Analysis 98, pp. 544 567. 76 / 76