The energy Package. October 3, Title E-statistics (energy statistics) tests of fit, independence, clustering

Similar documents
The energy Package. October 28, 2007

Package energy. September 15, 2017

Package energy. August 12, 2018

Package EnergyOnlineCPM

TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION

Experimental Design and Data Analysis for Biologists

The ICS Package. May 4, 2007

The cramer Package. June 19, cramer.test Index 5

Package MiRKATS. May 30, 2016

The SpatialNP Package

Package SpatialNP. June 5, 2018

Package sscor. January 28, 2016

DISCO ANALYSIS: A NONPARAMETRIC EXTENSION OF ANALYSIS OF VARIANCE. By Maria L. Rizzo and Gábor J. Székely Bowling Green State University

Package multivariance

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /15/ /12

Package QuantifQuantile

A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data

Package matrixnormal

Package AID. R topics documented: November 10, Type Package Title Box-Cox Power Transformation Version 2.3 Date Depends R (>= 2.15.

Package kpcalg. January 22, 2017

Package EMMIX. November 8, 2013

Package RATest. November 30, Type Package Title Randomization Tests

Package dhsic. R topics documented: July 27, 2017

Package mpmcorrelogram

Package symmoments. R topics documented: February 20, Type Package

Package gtheory. October 30, 2016

Package BSSasymp. R topics documented: September 12, Type Package

The sbgcop Package. March 9, 2007

Package Delaporte. August 13, 2017

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

Package homtest. February 20, 2015

Advanced Statistical Methods: Beyond Linear Regression

Package covsep. May 6, 2018

Package CopulaRegression

Package pfa. July 4, 2016

Package plw. R topics documented: May 7, Type Package

Package rbvs. R topics documented: August 29, Type Package Title Ranking-Based Variable Selection Version 1.0.

arxiv: v3 [stat.me] 8 Jul 2014

Package MultiRNG. January 10, 2018

Package ShrinkCovMat

Unsupervised machine learning

Lecture 2: Data Analytics of Narrative

Package SEMModComp. R topics documented: February 19, Type Package Title Model Comparisons for SEM Version 1.0 Date Author Roy Levy

HANDBOOK OF APPLICABLE MATHEMATICS

Finite Mixture Model Diagnostics Using Resampling Methods

Package Ball. January 3, 2018

Package depth.plot. December 20, 2015

Package clustergeneration

Package alphaoutlier

Chapter 5 continued. Chapter 5 sections

Package sbgcop. May 29, 2018

Package CorrMixed. R topics documented: August 4, Type Package

The pan Package. April 2, Title Multiple imputation for multivariate panel or clustered data

Package metafuse. October 17, 2016

Package jmcm. November 25, 2017

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

Package EMMIXmcfa. October 18, 2017

Visualizing the Multivariate Normal, Lecture 9

Package flowmap. October 8, 2018

Package HDtest. R topics documented: September 13, Type Package

Package orclus. R topics documented: March 17, Version Date

Package gma. September 19, 2017

Data Mining and Analysis: Fundamental Concepts and Algorithms

Package msbp. December 13, 2018

Package Grace. R topics documented: April 9, Type Package

Notes on the Multivariate Normal and Related Topics

Package EMMIXmfa. October 18, Index 15

Package mmm. R topics documented: February 20, Type Package

Package BootWPTOS. R topics documented: June 30, Type Package

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering using Mixture Models

Package mlmmm. February 20, 2015

Package hierdiversity

Lecture 1: Random number generation, permutation test, and the bootstrap. August 25, 2016

Package dama. November 25, 2017

More on Unsupervised Learning

Package IndepTest. August 23, 2018

Package ENA. February 15, 2013

Supplementary Material for Wang and Serfling paper

The lmm Package. May 9, Description Some improved procedures for linear mixed models

Package lmm. R topics documented: March 19, Version 0.4. Date Title Linear mixed models. Author Joseph L. Schafer

DISCO ANALYSIS: A NONPARAMETRIC EXTENSION OF ANALYSIS OF VARIANCE. BY MARIA L. RIZZO AND GÁBOR J. SZÉKELY 1 Bowling Green State University

The eba Package. February 16, 2006

Asymptotic Statistics-III. Changliang Zou

Package GeneExpressionSignature

Package unbalhaar. February 20, 2015

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

STAT 730 Chapter 1 Background

Package reinforcedpred

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Package CovSel. September 25, 2013

Package milineage. October 20, 2017

The evdbayes Package

Package neat. February 23, 2018

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Package bivariate. December 10, 2018

Package gk. R topics documented: February 4, Type Package Title g-and-k and g-and-h Distribution Functions Version 0.5.

Package goftest. R topics documented: April 3, Type Package

Package AssocTests. October 14, 2017

Package horseshoe. November 8, 2016

Transcription:

The energy Package October 3, 2005 Title E-statistics (energy statistics) tests of fit, independence, clustering Version 1.0-3 Date 2005-10-02 Author Maria L. Rizzo <rizzo@math.ohiou.edu> and Gabor J. Szekely <gabors@bgnet.bgsu.edu> E-statistics (energy) tests for comparing distributions: multivariate normality, multivariate k-sample test for equal distributions, hierarchical clustering by e-distances, multivariate independence test, Poisson test. Energy-statistics concept based on a generalization of Newton s potential energy is due to Gabor J. Szekely. Maintainer Maria Rizzo <rizzo@math.ohiou.edu> License GPL 2.0 or later R topics documented: edist............................................. 1 energy.hclust........................................ 3 eqdist.etest......................................... 5 Independence E-test..................................... 7 ksample.e.......................................... 8 mvnorm.etest........................................ 10 poisson.mtest........................................ 11 Index 13 edist E-distance Returns the E-distances (energy statistics) between clusters. edist(x, sizes, distance = FALSE, ix = 1:sum(sizes), alpha = 1) 1

2 edist x sizes distance ix alpha data matrix of pooled sample or Euclidean distances vector of sample sizes logical: if TRUE, x is a distance matrix a permutation of the row indices of x distance exponent A vector containing the pairwise two-sample multivariate E-statistics for comparing clusters or samples is returned. The e-distance between clusters is computed from the original pooled data, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data, or distance object returned by dist. The first sizes[1] rows of the original data matrix are the first sample, the next sizes[2] rows are the second sample, etc. The permutation vector ix may be used to obtain e-distances corresponding to a clustering solution at a given level in the hierarchy. The e-distance between two clusters C i, C j of size n i, n j proposed by Szekely and Rizzo (2003) is the e-distance e(c i, C j ), defined by where e(c i, C j ) = M ij = 1 n i n j n in j n i + n j [2M ij M ii M jj ], n i n j X ip X jq α, p=1 q=1 denotes Euclidean norm, α = alpha, and X ip denotes the p-th observation in the i-th cluster. The exponent alpha should be in the interval (0,2]. A object of class dist containing the lower triangle of the e-distance matrix of cluster distances corresponding to the permutation of indices ix is returned. Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering via Joint Between-Within Distances: Extending Ward s Minimum Variance Method, Journal of Classification 22(2) (in press). Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5). Szekely, G. J. (2000) Technical Report 03-05, E-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University. See Also energy.hclust eqdist.etest ksample.e

energy.hclust 3 ## compute e-distances for 3 samples of iris data data(iris) edist(iris[,1:4], c(50,50,50)) ## compute e-distances from vector of group labels d <- dist(matrix(rnorm(100), nrow=50)) g <- cutree(energy.hclust(d), k=4) edist(d, sizes=table(g), ix=rank(g, ties.method="first")) energy.hclust Hierarchical Clustering by Minimum (Energy) E-distance Performs hierarchical clustering by minimum (energy) E-distance method. energy.hclust(dst, alpha = 1) dst alpha Euclidean distances in a dist object, or a distance matrix produced by dist, or lower triangle of distance matrix as vector in column order. If dst is a square matrix, the lower triangle is interpreted as a vector of distances. distance exponent Dissimilarities are d(x, y) = x y α, where the exponent α is in the interval (0,2]. This function performs agglomerative hierarchical clustering. Initially, each of the n singletons is a cluster. At each of n-1 steps, the procedure merges the pair of clusters with minimum e-distance. The e- distance between two clusters C i, C j of sizes n i, n j is given by where e(c i, C j ) = M ij = 1 n i n j n in j n i + n j [2M ij M ii M jj ], n i n j X ip X jq α, p=1 q=1 denotes Euclidean norm, and X ip denotes the p-th observation in the i-th cluster. The return value is an object of class hclust, so hclust methods such as print or plot methods, plclust, and cutree are available. See the documentation for hclust. The e-distance measures both the heterogeneity between clusters and the homogeneity within clusters. E-clustering (α = 1) is particularly effective in high dimension, and is more effective than some standard hierarchical methods when clusters have equal means (see example below). For other advantages see the references.

4 energy.hclust An object of class hclust which describes the tree produced by the clustering process. The object is a list with components: merge: height: order: labels: call: method: an n-1 by 2 matrix, where row i of merge describes the merging of clusters at step i of the clustering. If an element j in the row is negative, then observation -j was merged at this stage. If j is positive then the merge was with the cluster formed at the (earlier) stage j of the algorithm. the clustering height: a vector of n-1 non-decreasing real numbers (the e-distance between merging clusters) a vector giving a permutation of the indices of original observations suitable for plotting, in the sense that a cluster plot using this ordering and matrix merge will not have crossings of the branches. labels for each of the objects being clustered. the call which produced the result. the cluster method that has been used (e-distance). dist.method: the distance that has been used to create dst. Szekely, G. J. and Rizzo, M. L. (2005) Hierarchical Clustering via Joint Between-Within Distances: Extending Ward s Minimum Variance Method, Journal of Classification 22(2) (in press). Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5). Szekely, G. J. (2000) Technical Report 03-05: E-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University. See Also edist ksample.e eqdist.etest hclust ## Not run: library(cluster) data(animals) plot(energy.hclust(dist(animals))) ## End(Not run) data(usarrests) ecl <- energy.hclust(dist(usarrests)) print(ecl) plot(ecl) cutree(ecl, k=3) cutree(ecl, h=150)

eqdist.etest 5 ## compare performance of e-clustering, Ward's method, group average method ## when sampled populations have equal means: n=200, d=5, two groups z <- rbind(matrix(rnorm(1000), nrow=200), matrix(rnorm(1000, 0, 5), nrow=200)) g <- c(rep(1, 200), rep(2, 200)) d <- dist(z) e <- energy.hclust(d) a <- hclust(d, method="average") w <- hclust(d^2, method="ward") list("e" = table(cutree(e, k=2) == g), "Ward" = table(cutree(w, k=2) == g), "Avg" = table(cutree(a, k=2) == g)) eqdist.etest Multisample E-statistic (Energy) Test of Equal Distributions Performs the nonparametric multisample E-statistic (energy) test for equality of multivariate distributions. eqdist.etest(x, sizes, distance = FALSE, incomplete = FALSE, N = 100, R = 999) x sizes distance incomplete N R data matrix of pooled sample vector of sample sizes logical: if TRUE, first argument is a distance matrix logical: if TRUE, compute incomplete E-statistics sample size for incomplete statistics number of bootstrap replicates The k-sample multivariate E-test of equal distributions is performed. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or the corresponding distance matrix. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc. The test is implemented by nonparametric bootstrap, an approximate permutation test with R replicates. For large samples it is more efficient if x contains the data matrix rather than the distances. Incomplete statistics are supported for the two-sample test. If incomplete==true, at most N observations from each sample (by sampling without replacement) are used in the calculation of the statistic. If distance==true complete statistics are always computed. The definition of the multisample E-statistic is given in the ksample.e documentation.

6 eqdist.etest A list with class htest containing method statistic p.value data.name description of test observed value of the test statistic approximate p-value of the test description of data Note The pairwise e-distances between samples can be conveniently computed by the edist function, which returns a dist object. The function ksample.e computes the test statistic without storing the distances, which is more efficient than calling eqdist.etest with R = 0. Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5). Szekely, G. J. (2000) Technical Report 03-05: E-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University. See Also ksample.e, edist energy.hclust data(iris) ## test if the 3 varieties of iris data (d=4) have equal distributions eqdist.etest(iris[,1:4], c(50,50,50), R = 199) ## compare incomplete versions of two sample test x <- c(rpois(400, 2), rnbinom(600, size=1, mu=2)) eqdist.etest(x, c(400, 600), incomplete=true, N=100, R = 199) eqdist.etest(x, c(400, 600), incomplete=true, N=200, R = 199)

Independence E-test 7 Independence E-test Energy Statistic Test of Independence Computes the multivariate nonparametric E-statistic and test of independence. indep.e(x, y) indep.etest(x, y, R=199) x y R matrix: first sample, observations in rows matrix: second sample, observations in rows number of replicates Computes the coefficient I and performs a nonparametric E-test of independence. The test decision is obtained via bootstrap, with R replicates. The sample sizes (number of rows) of the two samples must agree, and samples must not contain missing values. The statistic E = ni 2 is a ratio of V-statistics based on interpoint distances x i y j. See the reference below for details. The value of the test statistic is returned by indep.e. The function indep.etest returns a list with class htest containing method statistic p.value data.name description of test observed value of the coefficient I approximate p-value of the test description of data Bakirov, N.K., Rizzo, M.L., and Szekely, G.J. (2004), A Multivariate Nonparametric Test of Independence, Journal of Multivariate Analysis (submitted).

8 ksample.e ## independent multivariate data x <- matrix(rnorm(60), nrow=20, ncol=3) y <- matrix(rnorm(40), nrow=20, ncol=2) indep.e(x, y) indep.etest(x, y, R = 99) ## dependent bivariate data library(mass) Sigma <- matrix(c(1,.5,.5, 1), 2, 2) x <- mvrnorm(30, c(0, 0), Sigma) indep.etest(x[,1], x[,2], R = 99) ksample.e E-statistic (Energy Statistic) for Multivariate k-sample Test of Equal Distributions Returns the E-statistic (energy statistic) for the multivariate k-sample test of equal distributions. ksample.e(x, sizes, distance = FALSE, ix = 1:sum(sizes), incomplete = FALSE, N = 100) x sizes distance ix incomplete N data matrix of pooled sample vector of sample sizes logical: if TRUE, x is a distance matrix a permutation of the row indices of x logical: if TRUE, compute incomplete E-statistics incomplete sample size The k-sample multivariate E-statistic for testing equal distributions is returned. The statistic is computed from the original pooled samples, stacked in matrix x where each row is a multivariate observation, or from the distance matrix x of the original data. The first sizes[1] rows of x are the first sample, the next sizes[2] rows of x are the second sample, etc. Incomplete statistics are supported for the two-sample case. If incomplete==true, at most N observations from each sample (by sampling without replacement) are used in the calculation of the statistic. If distance==true complete statistics are always computed. The two-sample E-statistic proposed by Szekely and Rizzo (2004) is the e-distance e(s i, S j ), defined for two samples S i, S j of size n i, n j by e(s i, S j ) = n in j n i + n j [2M ij M ii M jj ],

ksample.e 9 where M ij = 1 n i n j n i n j X ip X jq, p=1 q=1 denotes Euclidean norm, and X ip denotes the p-th observation in the i-th sample. The k-sample E-statistic is defined by summing the pairwise e-distances over all k(k 1)/2 pairs of samples: Large values of E are significant. E = 1 i<j k e(s i, S j ). The value of the multisample E-statistic corresponding to the permutation ix is returned. Note The pairwise e-distances between samples can be conveniently computed by the edist function, which returns a dist object. The function ksample.e computes the E-statistic only. For the test decision, a nonparametric bootstrap test (approximate permutation test) is provided by the function eqdist.etest. With the default arguments, ksample.e computes the statistic without storing the distance matrix, which is more efficient than calling eqdist.etest with R = 0. Szekely, G. J. and Rizzo, M. L. (2004) Testing for Equal Distributions in High Dimension, InterStat, November (5). Szekely, G. J. (2000) Technical Report 03-05: E-statistics: Energy of Statistical Samples, Department of Mathematics and Statistics, Bowling Green State University. See Also eqdist.etest edist energy.hclust ## compute 3-sample E-statistic for 4-dimensional iris data data(iris) ksample.e(iris[,1:4], c(50,50,50)) ## compute a 3-sample univariate E-statistic ksample.e(rnorm(150), c(25,75,50))

10 mvnorm.etest mvnorm.etest E-statistic (Energy) Test of Multivariate Normality Performs the E-statistic (energy) test of multivariate or univariate normality. mvnorm.etest(x, R = 999) mvnorm.e(x) normal.e(x) x R data matrix of multivariate sample, or univariate data vector number of bootstrap replicates If x is a matrix, each row is a multivariate observation. The data will be standardized to zero mean and identity covariance matrix using the sample mean vector and sample covariance matrix. If x is a vector, the univariate statistic normal.e(x) is returned. If the data contains missing values or the sample covariance matrix is singular, NA is returned. The E-test of multivariate normality was proposed and implemented by Szekely and Rizzo (2005). The test statistic for d-variate normality is given by E = n( 2 n E y i Z E Z Z 1 n n 2 i=1 n i=1 j=1 n y i y j ), where y 1,..., y n is the standardized sample, Z, Z are iid standard d-variate normal, and denotes Euclidean norm. The E-test of multivariate (univariate) normality is implemented by parametric bootstrap with R replicates. The value of the E-statistic for univariate normality is returned by normal.e. The value of the E-statistic for multivariate normality is returned by mvnorm.e. mvnorm.etest returns a list with class htest containing method statistic p.value data.name description of test observed value of the test statistic approximate p-value of the test description of data

poisson.mtest 11 Szekely, G. J. and Rizzo, M. L. (2005) A New Test for Multivariate Normality, Journal of Multivariate Analysis, 93/1, 58-80, http://dx.doi.org/10.1016/j.jmva.2003.12.002. Rizzo, M. L. (2002). A New Rotation Invariant Goodness-of-Fit Test, Ph.D. dissertation, Bowling Green State University. Szekely, G. J. (1989) Potential and Kinetic Energy in Statistics, Lecture Notes, Budapest Institute of Technology (Technical University). ## compute normality test statistics for iris Setosa data data(iris) mvnorm.e(iris[1:50, 1:4]) normal.e(iris[1:50, 1]) ## test if the iris Setosa data has multivariate normal distribution mvnorm.etest(iris[1:50,1:4], R = 199) ## test a univariate sample for normality x <- runif(50, 0, 10) mvnorm.etest(x, R = 199) poisson.mtest Mean Distance Test for Poisson Distribution Performs the mean distance goodness-of-fit test of Poisson distribution with unknown parameter. poisson.mtest(x, R = 999) poisson.m(x) x R vector of nonnegative integers, the sample data number of bootstrap replicates The mean distance test of Poissonity was proposed and implemented by Szekely and Rizzo (2004). The test is based on the result that the sequence of expected values E X-j, j=0,1,2,... characterizes the distribution of the random variable X. As an application of this characterization one can get an estimator ˆF (j) of the CDF. The test statistic (see poisson.m) is a Cramer-von Mises type of distance, with M-estimates replacing the usual EDF estimates of the CDF: M n = n ( ˆF (j) F (j ; ˆλ)) 2 f(j ; ˆλ). j=0 The test is implemented by parametric bootstrap with R replicates.

12 poisson.mtest The function poisson.m returns the test statistic. The function poisson.mtest returns a list with class htest containing method statistic p.value data.name estimate of test observed value of the test statistic approximate p-value of the test description of data sample mean Szekely, G. J. and Rizzo, M. L. (2004) Mean Distance Test of Poisson Distribution, Statistics and Probability Letters, 67/3, 241-247. http://dx.doi.org/10.1016/j.spl.2004. 01.005. x <- rpois(20, 1) poisson.m(x) poisson.mtest(x, R = 199)

Index Topic cluster edist, 1 energy.hclust, 3 Topic htest eqdist.etest, 5 Independence E-test, 6 ksample.e, 7 mvnorm.etest, 9 poisson.mtest, 10 Topic multivariate edist, 1 energy.hclust, 3 eqdist.etest, 5 ksample.e, 7 mvnorm.etest, 9 Topic nonparametric edist, 1 eqdist.etest, 5 ksample.e, 7 edist, 1, 4, 6, 8 energy.hclust, 2, 3, 6, 8 eqdist.etest, 2, 4, 5, 8 indep.e (Independence E-test), 6 indep.etest (Independence E-test), 6 Independence E-test, 6 ksample.e, 2, 4 6, 7 mvnorm.e (mvnorm.etest), 9 mvnorm.etest, 9 normal.e (mvnorm.etest), 9 poisson.m, 11 poisson.m (poisson.mtest), 10 poisson.mtest, 10 13