Estimating the number of shared species by a jackknife procedure

Size: px

Start display at page:

Download "Estimating the number of shared species by a jackknife procedure"

Esmond Reed
5 years ago
Views:

1 Environ Ecol Stat (2015) 22: DOI /s Estimating the number of shared species by a jackknife procedure Chia-Jui Chuang 1,2 Tsung-Jen Shen 1 Wen-Han Hwang 1 Received: 12 September 2014 / Revised: 13 April 2015 / Published online: 7 May 2015 Springer Science+Business Media New York 2015 Abstract A sequence of jackknife estimators is developed to estimate the number of shared species in two communities. The estimators have simple and explicit formulae. A sequential testing criterion is also developed to determine a proper order for these jackknife estimators. The performance of the estimators is evaluated using empirical data on two forests from Malaysia, where 209 shared species present in both forests, and using simulated data. Results for the empirical data and simulated scenarios (for sampling fraction ranging from 0.5 to 20 %) show that the jackknife estimator, compared with other existing estimators, has a smaller bias and provides more reliable interval estimation in most cases. Additionally, two avian datasets from Taiwan and Hong Kong are used to demonstrate the proposed method. To extend the proposed method to three communities, we also list the first six orders of the jackknife estimators explicitly. Keywords Quadrat sampling Shared species Two-sample jackknife 1 Introduction Assume that there are M 1 and M 2 species in communities I and II, respectively. Of these species, without loss of generality, assume that the first S species are shared by Handling Editor: Pierre Dutilleul. Electronic supplementary material The online version of this article (doi: /s ) contains supplementary material, which is available to authorized users. B Wen-Han Hwang wenhan@nchu.edu.tw 1 Department of Applied Mathematics and Institute of Statistics, National Chung Hsing University, Taichung, Taiwan 2 Center for Biomedical Resources, National Health Research Institutes, Zhunan, Taiwan

2 760 Environ Ecol Stat (2015) 22: both communities, with M 1 S species and M 2 S species being unique to communities I and II, respectively. Let X = ( X 1,...,X M1 ) and Y = ( Y1,...,Y M2 ) be the frequencies of the M 1 and M 2 species randomly sampled from the two communities, respectively. Note that all species observed in both samples are undoubtedly shared species. However, species only observed in one sample have uncertain designation: they can be unique species or shared species. The objective of this study is to estimate the number of shared species S based on such samples. Estimating the number of species shared by two distinct ecological communities is essential for understanding spatial distribution of diversity in landscapes, for modeling species diversity patterns, such as the species area relationship and beta diversity (Ostling et al. 2003; Krishnamani et al. 2004; Tjørve and Tjørve 2008), and for inferring mechanisms of diversity maintenance (Condit et al. 2002). The shared species between communities, traditionally characterized by Jaccard or Sorenson similarity indices, form the basis of community ordination analysis and are widely used to measure beta diversity in macroecology (Magurran 2004; Colwell and Elsensohn 2014). However, in ecological applications, species similarities are nearly always directly calculated from sampled data with the implied assumption that samples constitute full coverage of the communities from which they are sampled. Given that in reality samples are almost always a minute fraction of the communities they are supposed to represent, this assumption is rarely true. Sample bias can be substantial (Chao et al. 2006). Realizing this problem, much effort has recently been dedicated to estimating shared species between communities based on samples (Chao et al. 2000; Schloss and Handelsman 2006; Chao et al. 2006, 2008; Yue and Clayton 2012). In the context of estimating species richness S, there are three major studies in the literature. Chao et al. (2000) first took the shared species approach to estimate species richness. They extended a richness estimator via the sample coverage for single community (i.e., the so-called ACE estimator; Colwell and Coddington 1994) to estimate the number of species shared by two communities. Chao et al. (2006) derived a simple estimator for shared species based on the Laplace approximation formula. In a more recent attempt, Pan et al. (2009) proposed a nonparametric lower bound for shared species among multiple communities, which can be considered as an extension of the popular richness estimator Chao2 (Colwell and Coddington 1994). However, these statistical developments have not been subjected to thorough examination and no empirical test has yet been conducted to evaluate their performance in practice. Prior to this study, we conducted empirical tests using simulated data of two largescale forest plots in Malaysia to evaluate performance of these three methods. Our results showed that the three methods all suffered from serious underestimation of shared species if the sampling intensity is less than 5 % of the true community. This deficiency is critical because it is very rare to have sampling effort larger than 1 % in field surveys (Chiarucci et al. 2003). This made us question the practical utility of these methods and motivated us to develop more reliable methods for estimating shared species richness. Our method is developed based on a two-sample jackknife procedure (Schechtman and Wang 2004). The derived method has an explicit form consisting of a sequence of estimators. We further develop a sequential acceptance rejection criterion to determine the jackknife order among the sequence of estimators. Note that jackknife estimators in a single community have been widely applied to

3 Environ Ecol Stat (2015) 22: estimate species richness or population size. This approach has been examined and recommended by several studies. For examples, based on comprehensive studies of simulated data (Burnham and Overton 1978, 1979; Amstrup et al. 2010) and real datasets (Palmer 1990, 1991; Colwell and Coddington 1994; Walther and Morand 1998; Walther and Moore 2005; Williams et al. 2007; Gotelli and Colwell 2009), there is a general consensus that the jackknife method can be useful in practice. In the remainder of this study, we first introduce the model structures and provide a brief overview of the three main methods proposed in the literature. We then derive the jackknife estimator for the number of shared species between communities and extend the applications of quadrat sampling data, which are frequently used in plant surveys. The jackknife estimator is tested using simulated data of two forest plots in Malaysia. We conclude by discussing a generalization to estimate shared species in multiple communities. 2 Model and overview Suppose there are two communities. Community I has M 1 species with relative abundances ( p 1,...,p M1 ) and community II has M2 species with relative abundances ( q1,...,q M2 ).Letn1 and n 2 be the sample sizes for samples from communities I and II, respectively. Suppose that each individual is observed or detected independently from others; hence the species counts X = ( X 1,...,X M1 ) follow a multinomial distribution with total n 1 and probabilities ( p 1,...,p M1 ) and similarly for Y = ( Y 1,...,Y M2 ).LetI ( ) be an indicator function, where I (A) = 1 if the event A occurs and 0 otherwise. Then D = S i=1 I (X i 1, Y i 1) denotes the number of observed shared species from the samples. For any two nonnegative integers j and k, we define f jk = S I (X i = j, Y i = k) i=1 as the number of shared species with precisely j individuals in sample I (the sample from community I) and k individuals in sample II (the sample from community II). We can observe f jk only if both j and k are positive. For simplicity, we further let f j+ = k 1 f jk = S i=1 I (X i = j, Y i 1) be the number of observed shared species with j individuals in sample I and similarly define f +k. Note that the parameter of interest S can be decomposed into four terms S = D + f 0+ + f +0 + f 00, (1) where the last three terms are unobserved. With the above notation, we review the three methods for estimating S in the literature.

4 762 Environ Ecol Stat (2015) 22: Method I: Sample coverage In Chao et al. (2000), the sample coverage C with respect to two communities is defined as: Si=1 p i q i I (X i 1, Y i 1) C = Si=1, (2) p i q i which is the fraction of product probabilities associated with the species common to both samples. Based on the method of moments, Chao et al. (2000) suggested an estimator of C: Ĉ = 1 Di=1 {X i I (Y i = 1) + Y i I (X i = 1) I (X i = 1, Y i = 1)} Di=1 X i Y i. This estimator performs very well in many situations. The underlying concept can be traced back to Good (1953), who considered the probability that an extra individual sampled from a community is a new species. Good s coverage has been widely used to develop methods for richness estimation (Darroch and Ratcliff 1980; Esty 1985; Chao and Lee 1992). As a special case, if the relative abundances are uniform, that is p i = 1/M 1 and q j = 1/M 2 for all i and j, the sample coverage (2) reduces to C = D/S. Thus, D/Ĉ is an estimate of S in this case. However, the uniform case is unrealistic for most field studies. To account for more general situations, Chao et al. (2000) considered the asymptotic bias of D/Ĉ in terms of coefficients of covariance (CCVs). Let r i = p i q i for i = 1,...,S, p = S i=1 p i /S, q = S i=1 q i /S, and r = S i=1 r i /S. The CCVs are Ɣ 1 = S i=1 (p i p)(r i r)/(s p r), Ɣ 2 = Si=1 (q i q)(r i r)/(s q r), and Ɣ 12 = S i=1 (p i p)(q i q)(r i r)/(s p q r). Furthermore, these CCVs can be estimated via the method of moments with some approximations. Based on the above-mentioned preliminary work, Chao et al. (2000) proposed an estimator of S: Ŝ Cov = Ḓ C + 1 Ĉ ( f 1+ ˆƔ 1 + f +1 ˆƔ 2 + f 11 ˆƔ 12 ), where ˆƔ 1, ˆƔ 2, and ˆƔ 12 are the associated CCV estimates and are given in Chao et al. (2000). Note that the estimator Ŝ Cov is influenced by high frequency values, so the authors suggested applying it to a subset of the original data with X i 10 and Y i 10 for all i. The estimator has been implemented in the software programs Spade (Chao and Shen 2010) and EstimateS (Colwell and Elsensohn 2014). Method II: Laplace approximation The second method relies on the Laplace approximation, which is a popular technique to approximate numerical integrals (Goutis and Casella 1999). Owing to the specialty of the Laplace technique, Chao et al. (2006) assumed that each relative abundance (p i and q i ) is bounded below by a positive constant. When the sample sizes n 1 and n 2 are large, they established E( f 0+ ) E( f1+ 2 )/{2E( f 2+)}, E( f +0 ) E( f+1 2 )/{2E( f +2)}, and E( f 00 ) E( f 11 )E( f 1+ )E( f +1 )/{4E( f 2+ )E( f +2 )}.

5 Environ Ecol Stat (2015) 22: According to (1), Chao et al. (2006) proposed a simple estimator using the Laplace approximation: Ŝ Lap = D + f f 2+ + f f +2 + f 11 f 1+ f +1 4 f 2+ f +2. If one of the denominators ( f 2+ and f +2 ) is zero, Chao et al. (2006) suggested using a bias-corrected formula: f1+ 2 Ŝ Lap = D + 2( f ) + f+1 2 2( f ) + f f 1+ f ( f )( f ). (3) Method III: Lower bound Instead of seeking an estimate of S directly, Pan et al. (2009) obtained a lower bound estimate for the number of shared species. Pan et al. (2009) applied the Cauchy Schwarz inequality and showed that 2E( f 0+ )E( f 2+ ) (n 1 1) E 2 ( f 1+ )/n 1, 2E( f +0 )E( f +2 ) (n 2 1)E 2 ( f +1 )/n 2, and 4E( f 00 )E( f 22 ) (n 1 1)(n 2 1)E 2 ( f 11 )/(n 1 n 2 ). Substituting these terms into (1) and considering ( n j 1 ) /n j 1for j = 1, 2asn j are usually large, Pan et al. (2009) developed a lower bound estimator of S: Ŝ Low = D + f f 2+ + f f +2 + f f 22. Similarly, when f 2+ = 0, f +2 = 0or f 22 = 0, a correction like (3) could be adopted as well. 3 Jackknife procedure for estimating S 3.1 A sequence of jackknife estimators The jackknife method was invented by Quenouille (1949) and has been widely applied for correcting the statistical bias and estimating standard errors (Shao and Tu 1995). In ecology, Burnham and Overton (1978) applied the procedure to obtain a series of population size estimators for a closed capture recapture model. Heltshe and Forrester (1983) considered species richness estimation based on a quadrat sampling data. Traditionally, the first-order jackknife method is carried out through recomputing a desired statistic by successively leaving one observation out at a time from a onesample dataset. Because our interest is about two-sample data, we followed extended works (Arvesen 1969; Schechtman and Wang 2004) to apply the jackknife procedure to two-sample situations. Suppose that individuals in sample I have labels a l,l= 1,...,n 1, and individuals in sample II have labels b m, m = 1,...,n 2. For a parameter of interest θ, let ˆθ be an estimator of θ and ˆθ ( l, ) be the estimate when individual a l is removed from sample I, ˆθ (, m) be the estimate when individual b m is removed from sample II, and

6 764 Environ Ecol Stat (2015) 22: ˆθ ( l, m) be the estimate after individuals a l and b m are removed from samples I and II, respectively. Recall that D is the number of observed shared species from the samples; intuitively, Ŝ 0 = D is chosen to be a basic estimator of S for the jackknife procedure. The procedure starts with alternately and sequentially deleting a l and b m from the full dataset and then recounts the observed number of shared species in the resulting data. For instance, Ŝ ( l, ) is the observed number of shared species after deleting a l from sample I. Trivially, Ŝ ( l, ) can be either D or D 1, where the latter occurs when individual a l belongs to species i associated with X i = 1 and Y i > 0. Performing the usual jackknife method with respect to sample I yields the estimator: Ŝ 0,X = n 1 Ŝ 0 (n 1 1) Similarly, by jackknifing sample II, we obtain: Ŝ 0,Y = n 2 Ŝ 0 (n 2 1) n1 Ŝ( l, ) l=1 0 = D + n 1 1 f 1+. n 1 n 1 n2 m=1 Ŝ(, m) 0 n 2 = D + n 2 1 n 2 f +1. By taking a weighted average of Ŝ 0,X and Ŝ 0,Y (Arvesen 1969), the first-order jackknife estimator is: Ŝ 1 = n 1Ŝ0,X + n 2 Ŝ 0,Y n 1 + n 2 = D + n 1 1 n 1 + n 2 f 1+ + n 2 1 n 1 + n 2 f +1. Nevertheless, as shown in Schechtman and Wang (2004), this first-order jackknife estimator does not reduce the bias in terms of asymptotic order. Hence a further correction is necessary. Following Schechtman and Wang (2004), we consider jackknifing Ŝ 0,X with deleting one individual b m at a time from sample II. As a result, we find the second-order estimator Ŝ 2 : Ŝ 2 = n 2 Ŝ 0,X (n 2 1) n2 Ŝ(, m) m=1 0,X n 2 n1 Ŝ( l, ) l=1 0 = n 1 n 2 Ŝ 0 n 2 (n 1 1) n 1 (n 2 1) n 1 + (n 1 1)(n 2 1) n 1 n 2 Ŝ ( l, m) 0 n 1 n 2 l=1 m=1 = D + n 1 1 n 1 f 1+ + n 2 1 n 2 f +1 + (n 1 1)(n 2 1) n 1 n 2 f 11. n2 m=1 Ŝ(, m) 0 n 2 We note that, alternatively, it can be shown Ŝ 2 = n 1 Ŝ 0,Y (n 1 1) n 1 l=1 Ŝ( l, ) 0,Y /n 1. Briefly, Ŝ 2 results from combining jackknifing Ŝ 0 with alternately deleting one individual from either sample.

7 Environ Ecol Stat (2015) 22: In order to further reduce the statistical bias by Ŝ 0, we suggest continuing this procedure. In this way we establish a sequence of estimators Ŝ k for k 0; the algorithm is summarized as follows. Step 0: Initialize ν = 0. Step 1: Let k = 2ν + 1 and define Ŝ 2ν,X = n 1 Ŝ 2ν (n 1 1) n 1 Ŝ 2ν,Y = n 2 Ŝ 2ν (n 2 1) n 2 k = 2ν + 1, is: /n 1 and 0,Y /n 2.Thekth-order jackknife estimator, m=1 Ŝ(, m) Ŝ( l, ) l=1 2ν Ŝ k = n 1Ŝ2ν,X + n 2 Ŝ 2ν,Y n 1 + n 2. Step 2: The (k + 1)-th-order jackknife estimator, k + 1 = 2ν + 2, is: Ŝ k+1 = n 2 Ŝ 2ν+1,X (n 2 1) n2 m=1 Ŝ(, m) 2ν+1,X n 2. Step 3: Increment ν to ν + 1 and return to Step 1. In Theorem 1 of the Appendix, we show that Ŝ k is a linear combination of observed frequencies f ij and give the explicit formula. In addition, as the sample sizes (n 1 and n 2 ) are usually large in practice, we can further simplify the expression of Ŝ k ;see Corollary 1 in the Appendix. The variance estimation of Ŝ k can be derived from a standard asymptotic approach. Due to random sampling, the random variables S D as well as f ij, i, j 1, follow a multinomial distribution with the total S and probabilities 1 π and π ij, i 1, j 1, where π = i 1, j 1 π ij and π ij is the probability of a shared species exactly observed i times in sample I and j times in sample II. Given Ŝ k, we estimate π ij by ˆπ ij = f ij /Ŝ k for all i 1 and j 1. As a consequence, we have { fij (1 ˆπ Cov( f ij, f st ) = ij ) if i = s, j = t; f ij ˆπ st otherwise. Rewriting Ŝ k = i 1 j 1 c ij f ij in terms of some constant coefficients c ij,the variance estimator of Ŝ k can be expressed as: Var(Ŝ k ) = i 1 cij 2 f ij Ŝ k. (4) Remark Similar to the proof in Cormack (1989), it is straightforward to show that the bias of the initial estimator D cannot be expressed by a power series in the reciprocal of the sample sizes n 1 and n 2, and hence the bias-reduction assumption in the two-sample jackknife procedure of Schechtman and Wang (2004) is not satisfied. Nevertheless, in practice, the bias can be reduced by jackknife estimators under some conditions. For instance, the second-order jackknife estimator Ŝ 2 can reduce the bias of D in a j 1

8 766 Environ Ecol Stat (2015) 22: broad range of situations. To see this, let d 1 = n 1 p and d 2 = n 2 q. Using the Taylor expansion to D and Ŝ 2 around p and q, we obtain the asymptotic biases of D and Ŝ 2 in terms of d 1, d 2, and the coefficients of variation (CVs), see Web Appendix S2. Note that d 1 (d 2 ) is the average of the observed number of individuals for the shared species in community I (community II). As a consequence, we can evaluate the relative asymptotic biases of D and Ŝ 2,givend 1, d 2, and the CVs. Web Figure 1 displays the results under selected CVs. Based on these results, we conclude the jackknife estimator Ŝ 2 is able to reduce the bias of D, especially when d 1 and/or d 2 are small. Note that, when both d 1 and d 2 are large, the absolute bias of both D and Ŝ 2 tend to be small. The asymptotic bias of other jackknife estimators can be evaluated from the parallel technique, but the results are much more complicated than that of Ŝ Order selection Although the jackknife estimator Ŝ k is likely to have a smaller bias for larger k, it inevitably inflates the variance as more terms are involved. Thus, there is a biasvariance trade-off in selecting a jackknife order k. Here we use a sequential test procedure (Burnham and Overton 1978) as the decision criterion. For each k 0, consider the following hypotheses: H 0k : E(Ŝ k+1 Ŝ k ) = 0vs.H 1k : E(Ŝ k+1 Ŝ k ) = 0. (5) Assume that, under the null hypothesis H 0k, the test statistic T k = Ŝ k+1 Ŝ k Var(Ŝ k+1 Ŝ k ) (6) is asymptotically normally distributed. For a significance level α, the procedure begins by testing the hypothesis in (5) with order k = 0 and then continues to the next order until acceptance occurs. In other words, if the p-value associated with the test statistic T k is smaller than α, the procedure goes to the next order of hypothesis and it stops when the p-value exceeds α. When the procedure stops at k = k, our proposed estimator is Ŝ JK = Ŝ k. Note that Burnham and Overton (1978) suggested using an interpolation formula at this stage, but the resulting estimate was less favorable than the proposed method in a simulation study (data not shown). The variance estimate in the denominator of (6) can be obtained via the same technique shown in (4) since Ŝ k+1 Ŝ k is a linear combination of the observed frequencies f ij. However, we caution that (4) is not suitable for estimating variance of Ŝ JK because the selected order k is a random variable. Specifically, the variance of Ŝ JK would be underestimated if one treated k as fixed and applied (4) naively. Although an analytic variance estimator of Ŝ JK is currently not available, we suggest adopting a non-parametric bootstrap (Chao et al. 2000) to obtain a variance estimator instead. It is possible that the proposed sequential testing procedure never terminates i.e., the test never yields a p value that exceeds the desired significance level α, though

9 Environ Ecol Stat (2015) 22: this outcome was unusual in our empirical study. To successfully implement the order selection, we set an upper bound of the jackknife order K u and an upper threshold to avoid extreme estimates that can occur frequently for higher orders of k. Under these refinements, the sequential test procedure is stopped at k when the next (k + 1)-order jackknife estimator would exceed the upper threshold. Moreover, when no acceptance occurs before order K u, we took K u as the selected order. In practice, we suggest taking K u = 6 because the procedure seldom selected an order larger than 6 in our experience. The upper threshold could be 10 times the number of observed shared species in both samples (Hwang and Huang 2003). 4 Quadrat sampling with incidence-based data In plant ecology, it is common to collect data by quadrat sampling in which an area of interest is divided into several regular quadrats (usually in a rectangle shape), and a random sample of quadrats is taken from the area. Within each sampled quadrat, instead of counting the exact abundance of each species, one only record the presence (1) or absence (0) for each species. Thus the sampling unit is a quadrat, and a vector of 0 1 values reflects species incidence in each quadrat. Although the incidence data differ from the structure we considered in the last section, the jackknife procedure developed in this study is equally applicable. In this section, we redefine notation and corresponding statistics for the jackknife formula. Under a quadrat sampling design, let n 1 and n 2 be the number of quadrats taken from communities I and II, respectively. Let D be the observed number of shared species and f jk be the number of shared species detected in j quadrats in community I and in k quadrats in community II. Other symbols like f j+ and f +k are defined similarly as in (1). However, the quadrat sampling incidence data is different from sampling abundance data; the sampling unit, denoted a l in sample I and b m in sample II, is now an incidence vector rather than a scalar as in the previous sections. That is a l = ( a 1l,...,a M1 l), where ail = 1ifthei-th species has been detected in the l-th quadrat of sample I and a il = 0 otherwise. b m = ( b 1m,...,b M2 m) is similarly defined. Following the same arguments in Sect. 3 but treating a l and b m as the removed units, the jackknife estimators derived from the incidence-based quadrat sampling design are the same as before. The sequential testing criterion in Sect. 3.2 is again recommended for selecting the jackknife order. Note that the similarity between the abundance-based and incidence-based data is not coincident. In fact, it can be shown that all methods in Sect. 2 have the same representations for both incidence-based and abundance-based data; see Pan et al. (2009) for a remark on the two data types of the estimation approaches. 5 Empirical study The performance of various estimators was assessed by simulation where two largescale census rain forest datasets were considered as sampling populations to reflect the species structure in two real communities. The two forest plots, Pasoh and Lambir, are

10 768 Environ Ecol Stat (2015) 22: Table 1 Basic characteristics of the Pasoh and the Lambir plots Pasoh Lambir Location 2 58 N, E 4 10 N, E Size of plot (ha) Range of elevation (m) Annual rainfall (mm) No. of species No. of individuals 320, ,602 No. of shared species log(frequency) Fig. 1 Frequency of the 209 shared species of trees in Pasoh (right) and Lambir (left) plots. Note that the horizontal axis is on the log scale both located in Malaysia. The Pasoh plot is 50 ha ( m) and is located in the Pasoh Forest Reserve, Peninsular Malaysia. The Lambir plot is 52 ha ( m) and is located in Lambir Hills National Park in Sarawak, Malaysia. In each plot, all free-standing trees and shrubs at least 1 cm in diameter at breast height were counted, located on a reference map with precise coordinates, and were identified to species. To date, both plots have been censused several times; we use the data collected in 1985 for the Pasoh plot and in 1991 for the Lambir plot. Table 1 summarizes the background of the two plots, which includes locations, average annual rainfall, and species richness. There were 209 tree species in common between the two plots. In Fig. 1 we show the abundances of these shared species, where the number in the Pasoh plot ranges from 1 to 8821 (median 208) and in the Lambir plot from 1 to 3130 (median 141).

11 Environ Ecol Stat (2015) 22: We simulated quadrat sampling from the two plots and considered three quadrat sizes (5 5m, 10 10m, and 20 20m) and seven sampling proportions (0.5, 1, 3, 5, 10, 20, 33, and 50 %). For each combination of quadrat size and sampling proportion, 2000 pairs of samples of quadrats were randomly selected with replacement from the Pasoh and Lambir plots. Note that sampling without replacement is more appropriate than sampling with replacement in this application; however, since the sampling proportion is usually small in practice and at most 30 % in our study, these sampling schemes yield very similar results. Figure 2 displays the average frequency of f ij used in the jackknife estimators. We found the frequencies were not sensitive to quadrat sizes; however, the frequencies varied notably with sampling proportions. For each generated dataset, we computed the following estimators: the sample coverage estimator (Ŝ Cov ), the Laplace approximation estimator (Ŝ Lap ),thelower bound estimator (Ŝ Low ), the jackknife estimators Ŝ 1,...,Ŝ 6, and the estimator Ŝ JK selected based on the procedure proposed in Sect. 3.2 with K u = 6 and significance level α = 0.1 (results were similar for α = 0.05 and α = 0.15). For the proposed Ŝ JK, we estimated the standard error (SE) using the bootstrap procedure (Chao et al. 2000) with 100 bootstrap replicates; SEs of the other estimators were derived according to (4). The resulting 2000 estimates along with their SEs were averaged to give the Estimate and ˆσ in Web Tables 2 4 in the Supplementary Materials. The sample SE (denoted σ ) and sample root mean squared error (RMSE) were also calculated. Moreover, we also computed the percentage of the 2000 simulated datasets in which the 95 % confidence intervals covered the true number of shared species. Since the distribution of the species richness estimator skews to the right in general, a log-transformed confidence interval suggested by Chao (1987) was adopted here. In Fig. 3 we summarize the results of comparing the proposed jackknife estimators with existing method in terms of bias, RMSE, and coverage percentage of the 95 % confidence interval. With regard to the three existing methods, when the sampling proportion is very small (say 0.5 %), the sample coverage-based estimator Ŝ Cov outperforms Ŝ Lap and Ŝ Low. However, the performance of these estimators is reversed if the sampling proportion is increased. As we can see from Fig. 3 and Web Tables 2 4, when the sampling proportion increases, the estimator Ŝ Cov approaches the target value S = 209 very slowly compared with the other methods. In this case, Ŝ Lap and Ŝ Low are preferred in terms of bias and RMSE. According to the empirical test, we also find that the difference between the Laplace approximation and the lower bound estimate is negligible, but the former usually has a smaller bias and RMSE than the latter. Nevertheless, all the three methods have considerable negative bias, especially when the sampling proportion is less than 5 %. Given the same sampling proportion but with different quadrat sizes, the magnitude of bias slightly increases when the quadrat size becomes large, except for Ŝ Cov and Ŝ JK at sampling proportions less than 3 %. A similar pattern is also found for the RMSE. The jackknife estimators Ŝ 1,...,Ŝ 6 present an apparently increasing trend with the order (Web Tables 2 4), where Ŝ 5 has the smallest bias when the sampling proportion is less than 1 %, but this is accompanied by a rather large variance; Ŝ 4 has the smallest RMSE in almost all cases when the sampling proportion is less than 10 %; Ŝ 2 has the best performance among all considered methods when the sampling proportion

12 770 Environ Ecol Stat (2015) 22: Fig. 2 Average frequency counts of f ij used in the jackknife estimators where the data were generated from the Pasoh and Lambir plots for selected combinations of quadrat size (column) and sampling proportion q (row)

13 Environ Ecol Stat (2015) 22: m m m bias RMSE Ŝ Cov Ŝ JK Ŝ Lap Ŝ Low Ŝ 1 Ŝ 2 Ŝ 3 Ŝ 4 Ŝ 5 Ŝ Coverage percentage of the 95% C.I Sampling proportion Fig. 3 Bias (top panel), root mean squared error (middle panel), and coverage percentage of the 95 % confidence interval (bottom panel) for estimators of the number of shared species in Pasoh and Lambir plots where the sampling quadrat size are 5 5m, m, and m

14 772 Environ Ecol Stat (2015) 22: is more than 20 %. These results are just as anticipated: a higher-order jackknife estimator is required to reduce bias when the data are sparse (i.e., when sampling effort is smaller). However, the variance of a higher-order estimator tends to increase as more terms are involved. In contrast, a lower-order jackknife estimator can work well when data are rich (i.e., when sampling effort is greater). In practical applications, it may be difficult to determine whether a particular dataset is rich enough to consider a jackknife estimate. Using the order selection procedure in Sect. 3.2, the order-selected jackknife estimator Ŝ JK performs well. The order selection procedure generally stops at Ŝ 2 when q 10 % and stops at Ŝ 3 or Ŝ 4 otherwise; see Web Tables 2 4. In comparison with the three existing methods, it has the smallest bias and most reliable interval estimation with the coverage percentage closest to the anticipated nominal level. Assessed using RMSE, Ŝ JK also performs favorably for small sampling proportions. Nevertheless, its RMSE is comparable with the other estimators for large sampling proportions when the data are rich. Despite its generally superior performance, it is worth noting that the jackknife estimator still suffers a considerable negative bias when the sampling proportion is very small, and the coverage percentage of interval estimation could reach as low as 75 %. Finally we note that the average of the bootstrap SEs are quite close to the sample SEs; a simulation study (data not shown) found the SE estimation would be underestimated more than 30 % if we used (4) and regarded the selection order as a fixed constant. As a consequence, the naïve use of (4) yielded an artificially narrow confidence interval and undermined the performance in terms of coverage percentage. 6 Case study 6.1 Example 1: Bird abundance data in two river estuaries This illustrating example considers bird abundance data from two river estuaries, Ker- Ya River and Chung-Kang River, in Taiwan. A local wild bird society in Taiwan collected data weekly from April 1994 to March 1995; see Chao et al. (2000) for further details. There were 155 species (with 85,867 individuals) and 140 species (with 59,646 individuals) observed at the two estuaries. We calculated D = 111 birds common to both estuaries and f 1+ = 10, f 2+ = 2, f 3+ = 6, f +1 = 15, f +2 = 7, f +3 = 3, f 11 = 4, f 12 = 2, f 21 = 1, and f 13 = f 22 = f 23 = 0. Estimated shared species Ŝ k for k = 1,...,4 and associated p-values for the selected orders are shown at the top of Table 2. When the significance level is α = 0.1, the order k = 3 was selected and the corresponding estimate is Ŝ JK = In addition, using the bootstrap procedure with 100 bootstrap replicates, the SE was For comparison, Ŝ Cov, Ŝ Lap, and Ŝ Low were also evaluated. Though Ŝ Lap and Ŝ JK were very similar, Ŝ JK yielded a much smaller SE. In contrast, Ŝ Cov and Ŝ Lap produced much smaller estimates. 6.2 Example 2: Hong Kong big bird race data This example considers incidence-based data collected from a bird watch race; see Chao et al. (2006) for a description. The rules of the race were simple: record as

15 Environ Ecol Stat (2015) 22: Table 2 Estimated shared species for Example 1 (top) and 2 (bottom) Ŝ 1 Ŝ 2 Ŝ 3 Ŝ 4 Ŝ JK Ŝ Lap Ŝ Low Ŝ Cov Example 1: Abundance-based data Estimated S SE p value <10 4 < Example 2: Incidence-based data Estimated S SE p value <10 4 < The p values indicate the evidence against the null hypothesis of H 0,k 1 ; see Sect. 3.2 many bird species in Hong Kong territory as possible in a period of one month. Consequently, all watchers focused on species seen during the race regardless of abundance of observed species. At the end of the race, each team enumerated all the bird species they observed (together with some other watching information), resulting in incidence-based data. There were 19 participating teams in 1999 and 20 teams in During the race, 217 species were observed in 1999 and 220 species were observed in There were 116 species common to both years. In our notation, n 1 = 19, n 2 = 20, and D = 116; the relevant frequency counts were f 1+ = 6, f 2+ = 4, f 3+ = 5, f +1 = 10, f +2 = 7, f +3 = 4, f 11 = 1, f 12 = 3, f 13 = f 21 = f 23 = 0, and f 22 = 1. Based on these key statistics, results produced by various estimation methods are given at the bottom of Table 2. Jackknife estimates of orders 1 4 yield a range over With the selection order k at 2, we see that Ŝ JK is 133 and the SE is 9.6, slightly larger than results obtained by other estimators. 7 Discussion In this study, the two-sample jackknife procedure in Schechtman and Wang (2004) is extended and applied to estimate the number of shared species between two communities. In addition to developing a series of jackknife estimators for shared species richness, we also suggest a sequential testing criterion for selecting a proper order among these jackknife estimators to strike a reasonable trade-off between reducing bias and inflating variance. The performance of the proposed and existing estimators was evaluated using an empirical study and two real datasets of avian communities. In the empirical study, we found the proposed estimator Ŝ JK possesses advantageous properties compared with the other methods, especially for sampling fraction ranging from 0.5 to 20 %. To confirm our results, an additional simulation study made by six postulated communities with low sampling rates was also carried out and our findings are summarized in the Supplementary Materials, where the performance of the shared

16 774 Environ Ecol Stat (2015) 22: species estimators were similar to what we observed in the empirical data when neither community is a homogeneous population. It is worthwhile to indicate that the second- and fourth-order jackknife estimators, Ŝ 2 and Ŝ 4, could have better performance than Ŝ JK in terms of bias, RMSE, and coverage percentage of the 95 % confidence interval. For a quick estimate of Swithout an order selection step, Ŝ 2 is recommended when the data are rich and Ŝ 4 when the data are sparse. Similar recommendations apply in species richness and population size estimation, where the first- and second-order jackknife estimators are frequently suggested in applications (Heltshe and Forrester 1983; Hellmann and Fowler 1999; Chao 2005). It is further worth remarking that rare species (observed only once or twice) convey the most information about the number of unseen species in the sample; Eren et al. (2012) and Chiu et al. (2014) also underscored the importance of low observed frequencies for estimator performance. The empirical study reveals that the proposed estimator Ŝ JK can yield interval estimates that are much more reasonable than some typical methods when the sample proportion is small (q 3%), though this method still suffers from considerable negative bias in this case. Seeking a more satisfactory estimator in this setting is a challenge and is certainly worth pursuing. In a sense, this work at least provides a possible framework for doing so. In particular, the jackknife method appears promising for addressing this problem. For example, as shown in the empirical study, the fifth-order jackknife estimator Ŝ 5 performed well in terms of bias when q 1 %; unfortunately it also produced a large variance. Based on our findings, further research may reduce the variance of a higher-order jackknife estimator and/or develop an alternative order selection procedure. Burnham and Overton (1978) proposed a sequential criterion to select the order from a series of jackknife estimators to estimate a population size. The selection criterion considered in our study is similar to theirs with one distinction regarding the variance estimation. To estimate the variance of the resulting estimator based on the sequential testing criterion, Burnham and Overton (1978) did not take the randomness of the selection order into account and instead only calculated the asymptotic variance of the selected estimator with a fixed order from the sequential test. The resulting variance estimate is therefore underestimated, as we mentioned in Sect A non-parametric bootstrap procedure (Chao et al. 2000) is useful to overcome this drawback and thus improve the coverage percentage of such an estimator. In principle, as suggested by a referee, the selected order may reflect sufficiency of the sampling information, e.g., the data are sparse if the selected order k > 2 and vice versa. Although this seems reasonable, this suggestion warrants further investigation. More relevant to the overarching goal of this study would be to develop a stopping rule for obtaining an estimate with a desired accuracy (Yip et al. 2003) or an extension that directly accounts for the cost of sampling (Rasmussen and Starr 1979; Chao et al. 1993). It is straightforward to extend our method to estimate the number of shared species in multiple communities (Pan et al. 2009). In the Supplementary Materials, an algorithm describes in detail the sequence of jackknife estimators in the case of three communities. Several first-order jackknife estimators have been explicitly formulated

17 Environ Ecol Stat (2015) 22: and tabulated in the Supplementary Materials. For the case of more than three communities, jackknife estimators can be developed in a similar manner. Acknowledgments The authors are grateful to Professor Fangliang He for his valuable discussions and providing the Lambir forest plot data. The authors thank the referees and editor for their useful comments. We also thank Roman Gulati for his generous editing assistance. This work was supported by the Ministry of Science and Technology of Taiwan. 8 Appendix: A general result of the jackknife estimators Ŝ k Define a 2-dimensional array of coefficients d t,u as: d 1,1 = 1 d t,t = td t 1,t 1 t 2; d t,1 = 2 t 1 t 2; d t,u = d t 1,u + u ( ) d t 1,u d t 1,u 1 t 2 and 2 u < t d t,u = 0 otherwise. These coefficients are used to simplify the expressions of jackknife estimators. The formulae can be summarized with the following Theorem. Theorem 1 For each nonnegative integer v, we have: and ν+1 n 1 t Ŝ 2ν,X = D + d ν+1,t f t+ + n 1 + Ŝ 2ν,Y = D + + ν+1 t=1 ν t=1 u=1 ν t=1 ν ν+1 t=1 u=1 ν u=1 d ν,u n 2 u n 2 f +u d ν+1,t d ν,u (n 1 t)(n 2 u) n 1 n 2 f tu (7) n 1 t ν+1 n 2 u d ν,t f t+ + d ν+1,u f +u n 1 n 2 u=1 d ν,t d ν+1,u (n 1 t)(n 2 u) n 1 n 2 f tu. (8) Therefore, Ŝ 2ν+1 = (n 1 Ŝ 2ν,X + n 2 Ŝ 2ν,Y )/(n 1 + n 2 ) is a linear combination of the frequencies f tu. Furthermore, the (2ν + 2)-th order jackknife estimator is: ν+1 Ŝ 2ν+2 = D + t=1 ν+1 ν+1 + t=1 u=1 n 1 t ν+1 d ν+1,t f t+ + n 1 u=1 d ν+1,u n 2 u n 2 f +u d ν+1,t d ν+1,u (n 1 t)(n 2 u) n 1 n 2 f tu. (9)

18 776 Environ Ecol Stat (2015) 22: The proof is established by mathematical induction and is shown in the Supplementary Materials due to lengthy algebra. We can further simplify the formulae in the next Corollary. Corollary 1 When the sample sizes n 1 and n 2 are sufficiently large, define λ j = (n j h)/(n 1 + n 2 ) for any finite number h and j = 1, 2. Asymptotically, the explicit forms of the jackknife estimators Ŝ k for k = 1,...,6, are as follows: Ŝ 1 = D + λ 1 f 1+ + λ 2 f +1 ; Ŝ 2 = D + f 1+ + f +1 + f 11 ; Ŝ 3 = D + (1 + 2λ 1 ) f 1+ 2λ 1 f 2+ + (1 + 2λ 2 ) f +1 2λ 2 f f 11 2λ 1 f 12 2λ 1 λ 2 f 21 ; Ŝ 4 = D + 3 f 1+ 2 f f +1 2 f f 11 6 f 12 6 f f 22 ; Ŝ 5 = D + (3 + 4λ 1 ) f 1+ 2(1 + 5λ 1 ) f λ 1 f 3+ + (3 + 4λ 2 ) f +1 2(1 + 5λ 2 ) f λ 2 f f 11 + (22λ 1 36) f 12 (22λ 2 36) f f λ 1 f λ 2 f 13 12λ 1 f 32 12λ 2 f 23 ; Ŝ 6 = D + 7 f f f f f f f f f f f f f f f 33. References Amstrup SC, McDonald TL, Manly BF (eds) (2010) Handbook of capture recapture analysis. Princeton University Press, Princeton Arvesen JN (1969) Jackknifing U-statistics. Ann Math Stat 40: Burnham KP, Overton WS (1978) Estimation of the size of a closed population when capture probabilities vary among animals. Biometrika 65(3): Burnham KP, Overton WS (1979) Robust estimation of population size when capture probabilities vary among animals. Ecology 60(5): Chao A (1987) Estimating the population size for capture recapture data with unequal catchability. Biometrics 43: Chao A (2005) Species estimation and applications. In: Balakrishnan N, Read CB, Vidakovic B (eds) Encyclopedia of statistical sciences, vol 12, 2nd edn. Wiley, New York, pp Chao A, Hwang W-H, Chen Y-C, Kuo C-Y (2000) Estimating the number of shared species in two communities. Stat Sin 10: Chao A, Jost L, Chiang S-C, Jiang Y-H, Chazdon R (2008) A two-stage probabilistic approach to multiplecommunity similarity indices. Biometrics 64: Chao A, Lee S-M (1992) Estimating the number of classes via sample coverage. J Am Stat Assoc 87: Chao A, Ma M-C, Yang MCK (1993) Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80: Chao A, Shen T-J (2010) Program SPADE (Species Prediction And Diversity Estimation). Program and User s Guide published at Chao A, Shen T-J, Hwang W-H (2006) Application of Laplace s boundary-mode approximations to estimate species and shared species richness. Aust N Z J Stat 48: Chiarucci A, Enright NJ, Perry GLW, Miller BP, Lamont BB (2003) Performance of nonparametric species richness estimators in a high diversity plant community. Divers Distrib 9:

19 Environ Ecol Stat (2015) 22: Chiu CH, Wang YT, Walther BA, Chao A (2014) An improved nonparametric lower bound of species richness via a modified good-turing frequency formula. Biometrics 70(3): Colwell RK, Coddington JA (1994) Estimating terrestrial biodiversity through extrapolation. Philos Trans R Soc Lond B 345: Colwell RK, Elsensohn JE (2014) EstimateS turns 20: statistical estimation of species richness and shared species from samples, with non-parametric extrapolation. Ecography 37: Condit R, Pitman N, Leigh EG Jr, Chave J, Terborgh J, Foster RB, Núñez P, Aguilar S, Valencia R, Villa G, Muller-Landau HC, Losos E, Hubbell SP (2002) Beta-diversity in tropical forest trees. Science 295: Cormack RM (1989) Log-linear models for capture-recapture. Biometrics Darroch JN, Ratcliff D (1980) A note on capture recapture estimation. Biometrics 36: Eren MI, Chao A, Hwang WH, Colwell RK (2012) Estimating the richness of a population when the maximum number of classes is fixed: a nonparametric solution to an archaeological problem. PLoS One 7(5):e34179 Esty WW (1985) Estimation of the number of classes in a population and the coverage of a sample. Math Stat 10:41 50 Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40: Gotelli NJ, Colwell RK (2009) Estimating species richness. In: Magurran A, McGill B (eds) Frontiers in measuring biodiversity. Oxford University Press, New York Goutis C, Casella G (1999) Explaining the saddle point approximation. Am Stat 53: Heltshe JF, Forrester NE (1983) Estimating species using the jackknife procedure. Biometrics 39:1 11 Hellmann JJ, Fowler GW (1999) Bias, precision, and accuracy of four measures of species richness. Ecol Appl 9: Hwang WH, Huang SY (2003) Estimation in capture recapture models when covariates are subject to measurement errors. Biometrics 59: Krishnamani R, Kumar A, Harte J (2004) Estimating species richness at large spatial scales using data from discrete plots. Ecography 27: Magurran AE (2004) Measuring biological diversity. Blackwell, Oxford Ostling A, Harte J, Green J, Kinzig A (2003) A community-level fractal property produces power-law species area relationships. Oikos 103: Palmer MW (1990) The estimation of species richness by extrapolation. Ecology 71: Palmer MW (1991) Estimating species richness: the second-order jackknife reconsidered. Ecology 72: Pan H-Y, Chao A, Foissner W (2009) A nonparametric lower bound for the number of specie hared by multiple communities. J Agric Biol Environ Stat 14: Quenouille MH (1949) Approximate tests of correlation in time series. J R Stat Soc Ser B 11:68 84 Rasmussen SL, Starr N (1979) Optimal and adaptive stopping in the search for new species. J Am Stat Assoc 74: Schechtman E, Wang S (2004) Jackknifing two-sample statistics. J Stat Plan Inference 119: Schloss PD, Handelsman J (2006) Introducing SONS, a tool for OTU-based comparisons of membership and structure between microbial communities. Appl Environ Microbiol 72: Shao J, Tu D (1995) The jackknife and bootstrap. Springer, New York Tjørve E, Tjørve KMC (2008) The species area relationship, self-similarity, and the true meaning of the z-value. Ecology 89: Walther BA, Moore JL (2005) The concepts of bias, precision and accuracy, and their use in testing the performance of species richness estimators, with a literature review of estimator performance. Ecography 28: Walther BA, Morand S (1998) Comparative performance of species richness estimation methods. Parasitology 116: Williams VL, Witkowski ET, Balkwill K (2007) The use of incidence-based species richness estimators, species accumulation curves and similarity measures to appraise ethnobotanical inventories from South Africa. Biodivers Conserv 16: Yip PS, Fang X, Zhou Y, Wang Y (2003) Sequential procedure for fixed accuracy estimation of the population size in recapture sampling. Aust N Z J Stat 45: Yue JC, Clayton MK (2012) Sequential sampling in the search for new shared species. J Stat Plan Inference 142:

20 778 Environ Ecol Stat (2015) 22: Chia-Jui Chuang received the Ph.D. degree in mathematics from the National Chung Hsing University, Taiwan in He is now a research fellow in the National Health Research Institutes, Taiwan. His research interests are in ecological statistics and public health. Tsung-Jen Shen received the Ph.D. degree in statistics from the National Tsing Hua University, Taiwan in Since 2010, he has been an associate professor at the National Chung Hsing University. His research interests are in developing statistical methods to deal with ecological issues, including alpha and beta diversity indices estimation, species richness prediction and so forth. Wen-Han Hwang received the Ph.D. degree in statistics from the National Tsing Hua University, Taiwan in Since 2012, he has been a professor at the National Chung Hsing University. His research interests are in ecological statistics, measurement error analysis and statistical inference.

CHAO, JACKKNIFE AND BOOTSTRAP ESTIMATORS OF SPECIES RICHNESS

IJAMAA, Vol. 12, No. 1, (January-June 2017), pp. 7-15 Serials Publications ISSN: 0973-3868 CHAO, JACKKNIFE AND BOOTSTRAP ESTIMATORS OF SPECIES RICHNESS CHAVAN KR. SARMAH ABSTRACT: The species richness