Simulating Realistic Ecological Count Data

Similar documents
Bivariate Paired Numerical Data

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Nonparametric Statistics Notes

Lecture 2: Repetition of probability theory and statistics

Chapter 5 continued. Chapter 5 sections

Chapter 5. Chapter 5 sections

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14

Spring 2012 Math 541B Exam 1

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Joint Distribution of Two or More Random Variables

Contents 1. Contents

Lecture 11. Probability Theory: an Overveiw

Algorithms for Uncertainty Quantification

Lecture 1: August 28

Statistics for Economists Lectures 6 & 7. Asrat Temesgen Stockholm University

Chapter 6 Expectation and Conditional Expectation. Lectures Definition 6.1. Two random variables defined on a probability space are said to be

Random Variables and Their Distributions

Nonparametric hypothesis tests and permutation tests

Chapter 4 continued. Chapter 4 sections

Machine learning - HT Maximum Likelihood

Unit 14: Nonparametric Statistical Methods

Continuous Random Variables

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Elements of Probability Theory

Problem Y is an exponential random variable with parameter λ = 0.2. Given the event A = {Y < 2},

Chapter 2. Discrete Distributions

Class 8 Review Problems 18.05, Spring 2014

EE4601 Communication Systems

Probability and Statistics Notes

First Year Examination Department of Statistics, University of Florida

More than one variable

matrix-free Elements of Probability Theory 1 Random Variables and Distributions Contents Elements of Probability Theory 2

Random Variables. P(x) = P[X(e)] = P(e). (1)

Learning Objectives for Stat 225

Multivariate Random Variable

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Bivariate distributions

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

Math Review Sheet, Fall 2008

Lecture 16: Hierarchical models and miscellanea

ECE 302 Division 2 Exam 2 Solutions, 11/4/2009.

Class 8 Review Problems solutions, 18.05, Spring 2014

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

2 (Statistics) Random variables

15 Discrete Distributions

STAT Chapter 5 Continuous Distributions

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

For a stochastic process {Y t : t = 0, ±1, ±2, ±3, }, the mean function is defined by (2.2.1) ± 2..., γ t,

Problem Set #5. Econ 103. Solution: By the complement rule p(0) = 1 p q. q, 1 x 0 < 0 1 p, 0 x 0 < 1. Solution: E[X] = 1 q + 0 (1 p q) + p 1 = p q

Bivariate Distributions

Chapter 5: Joint Probability Distributions

Estimation of Copula Models with Discrete Margins (via Bayesian Data Augmentation) Michael S. Smith

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Class 26: review for final exam 18.05, Spring 2014

Test Problems for Probability Theory ,

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

Probability and Stochastic Processes

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Quick Tour of Basic Probability Theory and Linear Algebra

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

High-Throughput Sequencing Course

Practice Examination # 3

Copula modeling for discrete data

ENGG2430A-Homework 2

Copulas. MOU Lili. December, 2014

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Notes for Math 324, Part 19

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

STAT/MATH 395 PROBABILITY II

Multivariate negative binomial models for insurance claim counts

Joint Probability Distributions, Correlations

Lehrstuhl für Statistik und Ökonometrie. Diskussionspapier 87 / Some critical remarks on Zhang s gamma test for independence

Lecture 2: Review of Probability

STA 2201/442 Assignment 2

STAT 512 sp 2018 Summary Sheet

EXAMINATIONS OF THE HONG KONG STATISTICAL SOCIETY GRADUATE DIPLOMA, Statistical Theory and Methods I. Time Allowed: Three Hours

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

ECE302 Exam 2 Version A April 21, You must show ALL of your work for full credit. Please leave fractions as fractions, but simplify them, etc.

Chapter 4 Multiple Random Variables

STAT 430/510: Lecture 16

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Inferential Statistics

Statistics STAT:5100 (22S:193), Fall Sample Final Exam B

3. Probability and Statistics

Multiple Random Variables

Introduction to Machine Learning

Probability Distributions Columns (a) through (d)

Statistics 427: Sample Final Exam

The Binomial distribution. Probability theory 2. Example. The Binomial distribution

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Monte Carlo Methods. Leon Gu CSD, CMU

18 Bivariate normal distribution I

Let X and Y denote two random variables. The joint distribution of these random

Lecture 3. Discrete Random Variables

Transcription:

1 / 76 Simulating Realistic Ecological Count Data Lisa Madsen Dave Birkes Oregon State University Statistics Department Seminar May 2, 2011

2 / 76 Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results

Outline Motivation Example: Weed Counts 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 3 / 76

4 / 76 Motivation Example: Weed Counts Why Simulate Data? Simulation studies useful for Assessing the performance of analytical procedures

5 / 76 Motivation Example: Weed Counts Why Simulate Data? Simulation studies useful for Assessing the performance of analytical procedures Power analysis or sample size determination

6 / 76 Motivation Example: Weed Counts Why Simulate Data? Simulation studies useful for Assessing the performance of analytical procedures Power analysis or sample size determination Finding a good sampling design

Outline Motivation Example: Weed Counts 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 7 / 76

Motivation Example: Weed Counts Weed Counts vs. Soil Magnesium (Heijting et al, 2007) Weeds 0 5 10 15 20 25 Counts Outliers 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 8 / 76

Motivation Example: Weed Counts Maps of Weed Counts and Magnesium Weed Counts Magnesium y 0 10 20 30 40 50 Counts Outliers Zeros y 0 10 20 30 40 50 0 2 4 6 8 10 12 x 0 2 4 6 8 10 12 x 9 / 76

Outline Pearson Correlation Spearman Correlation Limits to Dependence 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 10 / 76

11 / 76 Pearson Correlation Pearson Correlation Spearman Correlation Limits to Dependence The usual measure of dependence between X and Y is the Pearson product-moment correlation coefficient: ρ(x, Y ) = E(XY ) E(X)E(Y ) [var(x) var(y )] 1/2.

12 / 76 Pearson Correlation Pearson Correlation Spearman Correlation Limits to Dependence The usual measure of dependence between X and Y is the Pearson product-moment correlation coefficient: ρ(x, Y ) = E(XY ) E(X)E(Y ) [var(x) var(y )] 1/2. Estimate ρ(x, Y ) from sample (X 1, Y 1 ),..., (X n, Y n ) as ˆρ(X, Y ) = n i=1 [(X i X)(Y i Y )] [ n i=1 (X i X) 2 n i=1 (Y i Y ) 2 ], 1/2

Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence ρ(x,x)=1 X 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 X 13 / 76

Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence ρ(x,e^x)<1 exp(x) 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 X 14 / 76

15 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence For bivariate normal X and Y, ρ(x, Y ) completely characterizes dependence.

16 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Pearson Correlation Measures Linear Dependence For bivariate normal X and Y, ρ(x, Y ) completely characterizes dependence. For non-normal X and Y, other measures of dependence may be more appropriate.

Outline Pearson Correlation Spearman Correlation Limits to Dependence 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 17 / 76

18 / 76 Spearman Correlation Pearson Correlation Spearman Correlation Limits to Dependence The Spearman correlation coefficient is ρ S (X, Y ) = 3{P[(X X 1 )(Y Y 1 ) > 0] P[(X X 1 )(Y Y 1 ) < 0]} where X 1 d = X Y 1 d = Y with X 1 and Y 1 independent of one another and of (X, Y ).

19 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Estimating Spearman Correlation Given bivariate sample (X 1, Y 1 ),..., (X n, Y n ), calculate ranks r(x i ) and r(y i ). Then ˆρ S (X, Y ) = n i=1 {[r(x i) (n + 1)/2][r(Y i ) (n + 1)/2]} n(n 2, 1)/12 the sample Pearson correlation coefficient of the ranked data.

20 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Spearman Correlation Measures Monotone Dependence ρ S (X, e X ) = ρ S (X, X) = 1...

21 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Spearman Correlation Measures Monotone Dependence ρ S (X, e X ) = ρ S (X, X) = 1... provided X is continuous.

22 / 76 Correcting for Ties Pearson Correlation Spearman Correlation Limits to Dependence When X is discrete, one can construct X and Y so that X = Y almost surely but ρ S (X, Y ) < 1.

23 / 76 Correcting for Ties Pearson Correlation Spearman Correlation Limits to Dependence When X is discrete, one can construct X and Y so that X = Y almost surely but ρ S (X, Y ) < 1. Rescale ρ S so that it ranges between 1 and 1: ρ RS (X, Y ) = ρ S (X, Y ) {[1 x p(x)3 ][1 y q(y)3 ]} 1/2, where p(x) = P(X = x) and q(y) = P(Y = y) (Nešlehová, 2007).

24 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Ties in Sample Ranks Two common methods for handling ties in sample X 1,..., X n : Random ranks: When u tied values would occupy ranks p 1,..., p u if they were distinct, randomly assign these u ranks to the tied values.

25 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Ties in Sample Ranks Two common methods for handling ties in sample X 1,..., X n : Random ranks: When u tied values would occupy ranks p 1,..., p u if they were distinct, randomly assign these u ranks to the tied values. Midranks: Assign each tied value the average rank, 1 u u k=1 p k.

26 / 76 Pearson Correlation Spearman Correlation Limits to Dependence Rescaled Spearman Correlation and Midranks For sample (X 1, Y 1 ),..., (X n, Y n ), let the distribution of (X, Y ) be the empirical distribution function of the sample. Then ρ RS (X, Y ) coincides with the sample Pearson correlation coefficient of the midranks (Nešlehová, 2007).

Outline Pearson Correlation Spearman Correlation Limits to Dependence 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 27 / 76

28 / 76 Pearson Correlation Spearman Correlation Limits to Dependence For X and Y with joint CDF H(x, y) and marginal CDFs F(x) and G(y), the Fréchet-Hoeffding bounds are max[f (x) + G(Y ) 1, 0] H(x, y) min[f(x), G(y)].

29 / 76 Pearson Correlation Spearman Correlation Limits to Dependence For X and Y with joint CDF H(x, y) and marginal CDFs F(x) and G(y), the Fréchet-Hoeffding bounds are max[f (x) + G(Y ) 1, 0] H(x, y) min[f(x), G(y)]. These bounds induce margin-dependent bounds on ρ(x, Y ) and ρ S (X, Y ).

30 / 76 Pearson Correlation Spearman Correlation Limits to Dependence For X and Y with joint CDF H(x, y) and marginal CDFs F(x) and G(y), the Fréchet-Hoeffding bounds are max[f (x) + G(Y ) 1, 0] H(x, y) min[f(x), G(y)]. These bounds induce margin-dependent bounds on ρ(x, Y ) and ρ S (X, Y ). For X Bernoulli(p X ) and Y Bernoulli(p Y ) with p X p Y, ρ(x, Y ) (1 p X )p Y /[p X (1 p Y )].

Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 31 / 76

32 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z.

33 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i.

34 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i. Aside: To simulate maximally dependent Y i and Y j, set the corresponding element of Σ Z equal to 1.

35 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i. Aside: To simulate maximally dependent Y i and Y j, set the corresponding element of Σ Z equal to 1. 2. Transform each element of Z to obtain desired marginals.

36 / 76 Simulation Suppose we want to simulate dependent Y = [Y 1,..., Y N ] where Y i has marginal CDF F i. 1. Simulate a multivariate standard normal vector Z. Variance-covariance matrix Σ Z will determine dependence among the Y i. Aside: To simulate maximally dependent Y i and Y j, set the corresponding element of Σ Z equal to 1. 2. Transform each element of Z to obtain desired marginals. Y i = F 1 i {Φ(Z i )}, Φ( ) the standard normal CDF.

37 / 76 Inverse CDF for Discrete Distributions Bernoulli(0.4) CDF F(x) 0.0 0.2 0.4 0.6 0.8 1.0 F 1 i (u) = inf{x : F i (x) u} 0.5 0.0 0.5 1.0 1.5 x

Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 38 / 76

Weed Data Weeds 0 5 10 15 20 25 Counts Outliers 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 39 / 76

40 / 76 A Plausible Marginal Model Negative binomial hurdle model is a Bernoulli mixture of a point mass at 0 and a negative binomial, left-truncated at 1. π, y = 0 P(Y = y) = (1 π) Γ(θ + y) ( θ Γ(θ)Γ(y + 1) 1 ) θ ( µ ) y θ + µ θ + µ ( ) θ θ, y 1 θ + µ Model π and negative binomial mean µ as functions of covariate, x = soil magnesium.

41 / 76 Negative Binomial Hurdle CDF The target CDF for Y i is then F i (y) = π i + 1 π i 1 g i (0 µ i, θ) {G i(y µ i, θ) g i (0 µ i, θ)} for y 0, where G i ( µ i, θ) and g i ( µ i, θ) are the negative binomial CDF and PDF with log(µ i ) = β 0 + β 1 x i, and logit(π i ) = γ 0 + γ 1 x i.

Weed Data With Fitted Means NB Hurdle Fit Weeds 0 5 10 15 20 Data Fitted Mean 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 42 / 76

43 / 76 Comments Unlike data analysis, goal is a tractable but flexible model.

44 / 76 Comments Unlike data analysis, goal is a tractable but flexible model. Can determine marginal CDFs by other methods.

45 / 76 Comments Unlike data analysis, goal is a tractable but flexible model. Can determine marginal CDFs by other methods. Different marginals need not come from the same family.

Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 46 / 76

The Principle of Spatial Dependence Dependence between observations is higher when they are close together. Dependence 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 Distance 47 / 76

The Variogram var(y i Y j ) is small if Y i and Y j are dependent. Variogram 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 Distance 48 / 76

49 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process.

50 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process. To make inference feasible, we assume stationarity, i.e. E(Y i )=E(Y j ) and var(y i Y j ) = 2γ(h ij ), where h ij is the vector between locations of Y i and Y j, and γ( ) is called the semivariogram.

51 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process. To make inference feasible, we assume stationarity, i.e. E(Y i )=E(Y j ) and var(y i Y j ) = 2γ(h ij ), where h ij is the vector between locations of Y i and Y j, and γ( ) is called the semivariogram. Weed counts are not stationary: means differ, and larger means are associated with larger variance.

52 / 76 Stationarity A typical spatial data set represents a single incomplete sample of size N = 1 from a spatial random process. To make inference feasible, we assume stationarity, i.e. E(Y i )=E(Y j ) and var(y i Y j ) = 2γ(h ij ), where h ij is the vector between locations of Y i and Y j, and γ( ) is called the semivariogram. Weed counts are not stationary: means differ, and larger means are associated with larger variance. Stationarity assumption is more reasonable for ranks than counts.

53 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication.

54 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication. Kruskal (1958): Population analog of rank r(y i ) is F(Y i ).

55 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication. Kruskal (1958): Population analog of rank r(y i ) is F(Y i ). For each Y i, we can estimate its CDF F i by plugging in point estimates of the parameters.

56 / 76 Ranking Spatial Data Estimator of ρ S uses sample (X 1, Y 1 ),..., (X n, Y n ), but spatial sample has no replication. Kruskal (1958): Population analog of rank r(y i ) is F(Y i ). For each Y i, we can estimate its CDF F i by plugging in point estimates of the parameters. If Y i is unusually large (or small), given its estimated distribution, ˆF i (Y i ) will also be unusually large (or small), but ˆF 1 (Y 1 ),..., ˆF n (Y n ) will all be on the same scale.

57 / 76 Estimating Spatial Dependence Fit a parametric semivariogram model to the ranked spatial counts. Semivariance 0.00 0.02 0.04 0.06 0 5 10 15 For Y i and Y j separated by a distance of h ij, 1 2 var[f i (Y i ) F j (Y j )] = 0.03 + 0.027 (1 ) e h ij /1.36 ˆρ RS (Y i, Y j ) = 0.47e h ij /1.36 Distance

58 / 76 Calculating Σ Z 1. For each pair i, j, obtain {[ ] ]} ˆρ S (Y i, Y j ) = 1 ˆfi (r) [1 3 ˆfj (s) 1/2 ˆρ 3 RS (Y i, Y j ), r=0 s=0 where ˆf i and ˆf j are the estimated PMFs of Y i and Y j.

59 / 76 Calculating Σ Z 1. For each pair i, j, obtain {[ ] ]} ˆρ S (Y i, Y j ) = 1 ˆfi (r) [1 3 ˆfj (s) 1/2 ˆρ 3 RS (Y i, Y j ), r=0 s=0 where ˆf i and ˆf j are the estimated PMFs of Y i and Y j. 2. Then numerically solve for δ = ρ(z i, Z j ): ˆρ S (Y i, Y j ) = 3 r=0 s=0 ˆfi (r)ˆf j (s)(φ δ {Φ 1 [ˆF i (r 1)], Φ 1 [ˆF j (s 1)]} + Φ δ {Φ 1 [1 ˆF i (r)], Φ 1 [1 ˆF j (s)]} Φ δ {Φ 1 [ˆF i (r 1)], Φ 1 [1 ˆF j (s)]} Φ δ {Φ 1 [1 ˆF i (r)], Φ 1 [ˆF j (s 1)]}).

60 / 76 Simulating Outliers Simulation Results Apply Retain locations and covariate values from data set. Simulate a multivariate standard normal vector Z with correlation matrix Σ Z.

61 / 76 Apply Simulating Outliers Simulation Results Retain locations and covariate values from data set. Simulate a multivariate standard normal vector Z with correlation matrix Σ Z. 1 Set Y i = ˆF i {Φ(Z i )}.

Outline Simulating Outliers Simulation Results 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 62 / 76

Two Outlier Processes Simulating Outliers Simulation Results Weeds 0 5 10 15 20 25 Counts Outliers 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 63 / 76

Outliers Localized Simulating Outliers Simulation Results Weed Counts Magnesium y 0 10 20 30 40 50 Counts Outliers Zeros y 0 10 20 30 40 50 0 2 4 6 8 10 12 x 0 2 4 6 8 10 12 x 64 / 76

65 / 76 Simulating Outliers Simulation Results Empirical Observations About Outliers Outliers occur in the region between y = 17 and y = 33 meters.

66 / 76 Simulating Outliers Simulation Results Empirical Observations About Outliers Outliers occur in the region between y = 17 and y = 33 meters. Outliers associated with mg between 250 and 300 are between 12.9 and 14.9 larger than target means, whereas outliers associated with mg above 330 are between 2.6 and 10.3 larger.

67 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300.

68 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300. Set these counts equal to the integer part of target mean plus a random uniform on (12, 15).

69 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300. Set these counts equal to the integer part of target mean plus a random uniform on (12, 15). Randomly select another 4 to 6 points with y-coordinates between 17 and 33 and mg exceeding 330.

70 / 76 Simulating Outliers Simulation Results Augmenting the Simulated Data with Outliers From the 139 simulated weed counts, Randomly select 4 to 6 locations with y-coordinates between 17 and 33 and mg between 250 and 300. Set these counts equal to the integer part of target mean plus a random uniform on (12, 15). Randomly select another 4 to 6 points with y-coordinates between 17 and 33 and mg exceeding 330. Set these to the integer part of target means plus a random uniform on (2, 11).

Outline Simulating Outliers Simulation Results 1 Motivation Example: Weed Counts 2 Pearson Correlation Spearman Correlation Limits to Dependence 3 4 Simulating Outliers Simulation Results 71 / 76

Simulating Outliers Simulation Results Simulated Data vs. Observed Data 220 240 260 280 300 320 340 360 0 5 10 15 20 25 30 35 100 Simulated Datasets Soil Magnesium (mg/kg) Weeds Simulated Observed 72 / 76

Simulating Outliers Simulation Results Simulated Data vs. Observed Data Weeds 0 5 10 15 20 Data Sim. Means Target Means 220 240 260 280 300 320 340 360 Soil Magnesium (mg/kg) 73 / 76

Simulating Outliers Simulation Results Simulated Data vs. Observed Data 0 10 20 30 40 0.0 0.1 0.2 0.3 0.4 Distance Rank Correlation Simulated Target 74 / 76

Simulating Outliers Simulation Results A Couple of Simulated Maps 0 2 4 6 8 10 12 0 10 20 30 40 50 Weed Counts x y Counts Zeros 0 2 4 6 8 10 12 0 10 20 30 40 50 Weed Counts x y Counts Zeros 75 / 76

References Simulating Outliers Simulation Results S. Heijting, W. Van Der Werf, A. Stein, and M.J. Kropff (2007), Are weed patches stable in location? Application of an explicitly two-dimensional methodology, Weed Research 47 (5), pp. 381-395. DOI: 10.1111/j.1365-3180.2007.00580.x W.H. Kruskal (1958), Ordinal measures of association, Journal of the American Statistical Association 53, pp. 814 861. L. Madsen and D. Birkes (2013), Simulating dependent discrete data, Journal of Computational and Graphical Statistics, 83(4), pp. 677 691. J. Nešlehová (2007), On rank correlation measures for non-continuous random variables, Journal of Multivariate Analysis 98, pp. 544 567. 76 / 76