M statistic commands: interpoint distance distribution analysis with Stata

Similar documents
BMC Medical Informatics and Decision Making 2005, 5:19

Meta-analysis of epidemiological dose-response studies

The effect of spatial aggregation on performance when mapping a risk of disease

Cluster Analysis using SaTScan

Practical Statistics

Homework Solutions Applied Logistic Regression

Spatial and Syndromic Surveillance for Public Health

ARIC Manuscript Proposal # PC Reviewed: _9/_25_/06 Status: A Priority: _2 SC Reviewed: _9/_25_/06 Status: A Priority: _2

Topic 7: Heteroskedasticity

Topic 3: Sampling Distributions, Confidence Intervals & Hypothesis Testing. Road Map Sampling Distributions, Confidence Intervals & Hypothesis Testing

GIST 4302/5302: Spatial Analysis and Modeling Point Pattern Analysis

Chapter 6 Spatial Analysis

STA6938-Logistic Regression Model

Postgraduate course: Anova and Repeated measurements Day 2 (part 2) Mogens Erlandsen, Department of Biostatistics, Aarhus University, November 2010

Statistical tests for a sequence of random numbers by using the distribution function of random distance in three dimensions

Implementing Matching Estimators for. Average Treatment Effects in STATA. Guido W. Imbens - Harvard University Stata User Group Meeting, Boston

Simulating complex survival data

Testing for global spatial autocorrelation in Stata

Disease mapping with Gaussian processes

Binary Dependent Variables

Understanding the multinomial-poisson transformation

Kullback-Leibler Designs

The Simulation Extrapolation Method for Fitting Generalized Linear Models with Additive Measurement Error

Epidemiology Principle of Biostatistics Chapter 11 - Inference about probability in a single population. John Koval

Gaussian processes for spatial modelling in environmental health: parameterizing for flexibility vs. computational efficiency

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Bayesian Hierarchical Models

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Epidemiology Principle of Biostatistics Chapter 14 - Dependent Samples and effect measures. John Koval

Cluster Analysis using SaTScan. Patrick DeLuca, M.A. APHEO 2007 Conference, Ottawa October 16 th, 2007

Spatial Analysis 1. Introduction

BIO5312 Biostatistics Lecture 6: Statistical hypothesis testings

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Confidence intervals for the variance component of random-effects linear models

General Linear Model (Chapter 4)

Spatial Point Pattern Analysis

Econometrics for PhDs

1 Independent Practice: Hypothesis tests for one parameter:

ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES

FleXScan User Guide. for version 3.1. Kunihiko Takahashi Tetsuji Yokoyama Toshiro Tango. National Institute of Public Health

Obtaining Critical Values for Test of Markov Regime Switching

Gaussian Process Regression Model in Spatial Logistic Regression

Distribution Fitting (Censored Data)

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION

Spatio-temporal correlations in fuel pin simulation : prediction of true uncertainties on local neutron flux (preliminary results)

Introduction to Geostatistics

Statistical Analysis of Spatio-temporal Point Process Data. Peter J Diggle

Cluster investigations using Disease mapping methods International workshop on Risk Factors for Childhood Leukemia Berlin May

USING CLUSTERING SOFTWARE FOR EXPLORING SPATIAL AND TEMPORAL PATTERNS IN NON-COMMUNICABLE DISEASES

Interactive GIS in Veterinary Epidemiology Technology & Application in a Veterinary Diagnostic Lab

Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit-root test

Chapter 7 Fall Chapter 7 Hypothesis testing Hypotheses of interest: (A) 1-sample

An Introduction to SaTScan

Confounding and effect modification: Mantel-Haenszel estimation, testing effect homogeneity. Dankmar Böhning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The simulation extrapolation method for fitting generalized linear models with additive measurement error

Outline. Practical Point Pattern Analysis. David Harvey s Critiques. Peter Gould s Critiques. Global vs. Local. Problems of PPA in Real World

SIMULATION-BASED SENSITIVITY ANALYSIS FOR MATCHING ESTIMATORS

Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit-root test

10/8/2014. The Multivariate Gaussian Distribution. Time-Series Plot of Tetrode Recordings. Recordings. Tetrode

On block bootstrapping areal data Introduction

Finding Outliers in Monte Carlo Computations

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Functional Form. So far considered models written in linear form. Y = b 0 + b 1 X + u (1) Implies a straight line relationship between y and X

Analyzing spatial autoregressive models using Stata

A short guide and a forest plot command (ipdforest) for one-stage meta-analysis

Implementing Conditional Tests with Correct Size in the Simultaneous Equations Model

Learning Objectives for Stat 225

Density Estimation (II)

Unit 14: Nonparametric Statistical Methods

NONPARAMETRIC ADJUSTMENT FOR MEASUREMENT ERROR IN TIME TO EVENT DATA: APPLICATION TO RISK PREDICTION MODELS

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Implementing Matching Estimators for. Average Treatment Effects in STATA

Lecture 3.1 Basic Logistic LDA

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ

Lecture 12: Effect modification, and confounding in logistic regression

Propensity Score Matching and Analysis TEXAS EVALUATION NETWORK INSTITUTE AUSTIN, TX NOVEMBER 9, 2018

2. We care about proportion for categorical variable, but average for numerical one.

Nonrecursive Models Highlights Richard Williams, University of Notre Dame, Last revised April 6, 2015

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Global Spatial Autocorrelation Clustering

Defining Statistically Significant Spatial Clusters of a Target Population using a Patient-Centered Approach within a GIS

Multivariate Distributions

Inferential Statistics

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Recall the Basics of Hypothesis Testing

Multivariate Analysis of Crime Data using Spatial Outlier Detection Algorithm

dqd: A command for treatment effect estimation under alternative assumptions

(Mis)use of matching techniques

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Bootstrapping, Permutations, and Monte Carlo Testing

Bayesian SAE using Complex Survey Data Lecture 4A: Hierarchical Spatial Bayes Modeling

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Inference for Binomial Parameters

ECON 594: Lecture #6

SaTScan TM. User Guide. for version 7.0. By Martin Kulldorff. August

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Estimating the Fractional Response Model with an Endogenous Count Variable

Transcription:

M statistic commands: interpoint distance distribution analysis with Stata

Motivation: spatial distribution (1) In many situations we are interested in answering the following QUESTION: Is a certain phenomenon more (less) concentrated in a certain area? Typical question of cluster analysis 2

Motivation: spatial distribution (2) More general problem: spatial distribution analysis QUESTION: Does the group of interest have a different spatial distribution than the population (null distribution)? QUESTION: Does group 1 have a different spatial distribution than group 2? 3

Motivation: spatial distribution (3) longitude 2000 4000 6000 8000 10000 4000 6000 8000 10000 latitude dental reduction no dental reduction Source: Alt and Vach Data, Waller, L. and Gotway, C., Applied Spatial Statistics for Public Health Data. Wiley IEEE, 2004. 4

Motivation: spatial distribution (4) Massachusetts Leukemia Data Leukemia Data In Massachusetts H0 : do locations w/ OBS>EXP and OBS<EXP have the same SD? 0 1 Massachusetts Breast Cancer Data Breast Cancer Data In Massachusetts 0 Source: Massachusetts Cancer Report 2006, elaboration and 1 coordinates Harvard School of Public Health. 5

New commands Mstat and Mtest are Stata routines that can be used to test H0. Based on the Euclidean distance in bi dimensional spaces. Applications Epidemiology, Sociology, Economics, Demography, etc whenever the fact that two phenomena are equally distributed over a given area of interest is not obvious and relevant at the same time. Extensions K dimensional spaces Non Euclidean metrics or general dissimilarity measures H0 : is group j distributed as the underlying null (population) distibution (1 sample M statistic) 6

Theory: Interpoint Distance Distribution (IDD) The main statistic on which M is based is the Empirical (Cumulative) Density Function (ECDF) of the Interpoint Distance Distribution. X1 d1,2 Xi di,j X2 Xj X3 d3,n Xn D= 0 d... d 1,2 1, n d 2,1 0 M M O d n 1, n d... d 0 n,1 nn, 1 7

Interpoint Distance Distribution (2) From n observations D is an nxn symmetric matrix w/ zero main diagonal. We calculate di,j, i j D= 0 d... d 1,2 1, n d 2,1 0 M M O d n 1, n d... d 0 n,1 nn, 1 d is a sample of n nn ( 1) = 2 2 DEPENDENT distances 8

Interpoint Distance Distribution (3) n nn ( 1) 2 2 We use all the = distances. From a sample of n observations we get roughly n 2 /2 distances: COMPLEX RELATION. Others use different, less informative statistics: distance to the nearest neighbor ( or k<n 1 neighbors), average distance, etc Forsberg et al. [4] show that using all distances is more powerful. 9

Density 0.00005.0001.00015.0002.00025 Interpoint Distance Distribution (4) We build the Empirical Density Function f(d) for the IDD: a statistic we use to collect information on the (unknown) spatial distribution. 0 2000 4000 6000 8000 Interpoint Distance 0 2000 4000 6000 8000 Interpoint Distance 10

The M statistic Based on a discretized version of the ECDF: d vector of cutoffs values of the BINS, whose number affects the power of our test [3]. With k bins: Fˆ( d) = Fˆ( d ),..., Fˆ( d ) 1 k where Fˆ d 1 n = 1 2 i, j i j ( ) { d d } l l 11

The M statistic (2) 1 sample M statistic: one group vs population H0: 0 ˆ ( d ) F d = F ( ) ( ) being the observed ECDF: F ˆ( ) ( ) M = F F Σ F F ( 0 ) T d d ( ˆ( d) 0 ( d) ) is the Moore Penrose (Mata pinv()) generalized inverse of the variance covariance matrix of F ˆ d,. Σ ( ) Σ d 12

The M statistic (3) σ Σ= l,m σ 1, with n = 4 1 d 1 d ˆ d ˆ d { d } { d } F( ) F( ) l, m i, j l i, k m l m 3 i< j, k Bonetti and Pagano (2005) show that M d χ k Slow convergence Monte Carlo test. 2 13

The M statistic (4) 2 samples M statistic: Group 1 vs Group 2 H0: F ( ) ( ) 1 2 Equal variance assumption: σ d M = F F Σ F F = F ( ( ) ( )) T ˆ ( ) ( ) 1 d ˆ ˆ ˆ 2 d 1 d 2 d 1 d ( ) n + n n = 4 1 d 1 d { d } { d } 1 2 l, m i, j l i, k m nn 1 2 3 i< j, k 14

2 samples M Test Kernel Densities density 0.5 1 1.5 2 10 observations 0.5 1 interpoint distance density: Group1 density: Group2 45 distances 15

Mstat and Mtest algorithm Mstat: (1) Generate the distance matrix D; (2) Generate the cutoff vector d so to have EQUIPROBABLE BINS (wrt the population); (3) Compute the ECDF in Group 1 and 2 at d; (4) Compute the matrix, take its (generalized) inverse and compute M. Σ Mtest: (A) Execute steps (1) (4) of Mstat algorithm; (B) Permute the Group indicator variable (dummy 0 1); (C) Execute step (3) of Mstat algorithm; (D) Using d and Σ from step (2) and (4) of Mstat algorithm compute M; (E) Iterate (A) (D) P times, generating a vector [M,M1,M2,,MP]; (F) Compute the Monte Carlo p value= (#Mi M)/P, and its exact binomial confidence interval. 16

Mstat and Mtest commands DATASET: must contain three variables x coordinates y coordinates Group dummy variable Syntax: Mstat, x(varname) y(varname) g(varname) bins(#) scatter density chi2 Mtest, x(varname) y(varname) g(varname) bins(#) sc den iter(#) level(#) 17

Mstat and Mtest commands (2) longitude 2000 4000 6000 8000 10000 4000 6000 8000 10000 latitude density 0.00005.0001.00015.0002.00025 Kernel Densities 0 2000 4000 6000 8000 interpoint distance dental reduction no dental reduction density: dental reduction density: no dental reduction 18

Mstat options x*(varname) y*(varname) g*(varname) x coordinates y coordinates 0 1 dummy bins(#) scatter density chi2 number of bins scatter plot Kernel density asymptotic Chi2 pvalue 19

Mtest options x*(varname) y*(varname) g*(varname) bins(#) scatter density iter(#) level(#) x coordinates y coordinates 0 1 dummy number of bins scatter plot Kernel density # of permutations conf level for pvalue C.I. 20

Cancer Data in Massachusetts Datasets are fully compatible with Pisati s spmap Leukemia Data From MA cancer report, we have 348 locations (census tracts) with Massachusetts Leukemia Data EXPECTED EVENTS OBSERVED EVENTS We build the dummy Group1: EXP<OBS Group2: OBS EXP Plot with spmap Run Mtest 0 1. Mtest, x(lat) y(long) g(g) den iter(1000) M statistic Monte Carlo permutation results H0: The two groups have the same spatial distribution Number of bins = 20 Number of permutations = 1000 T T(obs) c n p=c/n SE(p) [95% Conf. Interval] M 26.28204 159 1000 0.1590 0.0116.1368657.1831623 Note: confidence interval is with respect to p=c/n. Note: c = #{T >= T(obs)} 21

Cancer Data in Massachusetts (2) Breast Cancer Data Massachusetts Breast Cancer Data. Mtest, x(lat) y(long) g(g) den iter(1000) 0 1 M statistic Monte Carlo permutation results H0: The two groups have the same spatial distribution Number of bins = 20 Number of permutations = 1000 T T(obs) c n p=c/n SE(p) [95% Conf. Interval] M 38.56281 4 1000 0.0040 0.0020.0010909.0102097 Note: confidence interval is with respect to p=c/n. Note: c = #{T >= T(obs)} 22

Future Developments Two samples M with general, non Euclidean dissimilarity. Possible for the user to input the matrix D. One sample M. The user need to be familiar with the underlying theory: specification of the null distribution critical. 23

Acknowledgments Harvard University, School of Public Health, Biostatistics Department. Universita Commerciale Luigi Bocconi, Dipartimento di Scienze delle Decisioni. Research supported in part by a grant from the NIH, P01 CA 134294. A special thanks to Justin Manjourides, PhD and Al Ozonoff, PhD for the helpful insights and suggestions during my Summer 2010 visit at HSPH. 24

Selected References [1] Bonetti, M. and Pagano, M., The interpoint distance distribution as a descriptor of point patterns, with an application to spatial disease clustering. Stat Med 2005; 24(5):753 773. [2] Manjourides, J. and Pagano, M., A test of the difference between two interpoint distance distributions. Submitted [3] Forsberg, L., Bonetti, M. and Pagano, M., The choice of the number of bins for the M statistic. Computational Statistics & Data Analysis 2009; 53(10): 3640 3649. [4] Bonetti, M., Forsberg, L., Ozonoff, A. and Pagano, M., The distribution of interpoint distances. Mathematical Modeling Applications in Homeland Security. HT Banks and C Castillo Chavez, Eds. ; 2003:87 106. [5] Forsberg, L., Bonetti, M., Jeffery, C., Ozonoff, A. and Pagano, M., Distance Based Methods for Spatial and Spatio Temporal Surveillance. Spatial and Syndromic Surveillance for Public Health (ch.8). John Wiley & Sons, Ltd; 2005. 25