Linearly interpolated FDH efficiency score for nonconvex frontiers

Size: px

Start display at page:

Download "Linearly interpolated FDH efficiency score for nonconvex frontiers"

Lily Wiggins
5 years ago
Views:

Journal of Multivariate Analysis 97 (26) 2141 2161 www.elsevier.

1 Journal of Multivariate Analysis 97 (26) Linearly interpolated FDH efficiency score for nonconvex frontiers Seok-Oh Jeong a, Léopold Simar b, a Department of Statistics, Hankuk University of Foreign Studies, Yong-In, South Korea b Institut de statistique, Université Catholique de Louvain, Louvain-la-Neuve, Belgium Received 3 December 24 Available online 3 January 26 Abstract This paper addresses the problem of estimating the monotone boundary of a nonconvex set in a full nonparametric and multivariate setup. This is particularly useful in the context of productivity analysis where the efficient frontier is the locus of optimal production scenarios. Then efficiency scores are defined by the distance of a firm from this efficient boundary. In this setup, the free disposal hull (FDH) estimator has been extensively used due to its flexibility and because it allows nonconvex attainable production sets. However, the nonsmoothness and discontinuities of the FDH is a drawback for conducting inference in finite samples. In particular, it is shown that the bootstrap of the FDH has poor performances and so is not useful in practice. Our estimator, the LFDH, is a linearized version of the FDH, obtained by linear interpolation of appropriate FDH-efficient vertices. It offers a continuous, smooth version of the FDH. We provide an algorithm for computing the estimator, and we establish its asymptotic properties. We also provide an easy way to approximate its asymptotic sampling distribution. The latter could offer bias-corrected estimator and confidence intervals of the efficiency scores. In a Monte Carlo study, we show that these approximations work well even in moderate sample sizes and that our LFDH estimator outperforms, both in bias and in MSE, the original FDH estimator. 26 Elsevier Inc. All rights reserved. AMS 2 subject classification: 62G5; 62G2; 62G9; 62P2 Keywords: Nonparametric frontiers; Nonconvex monotone boundaries; Efficiency; Bootstrapping FDH The research support from Interuniversity Attraction Pole, Phase V (no. P5/24) from the Belgian Government (Belgian Science Policy) is gratefully acknowledged. Corresponding author. addresses: seokohj@hufs.ac.kr (S.-O. Jeong), simar@stat.ucl.ac.be (L. Simar) X/$ - see front matter 26 Elsevier Inc. All rights reserved. doi:1.116/j.jmva

2 2142 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Introduction This paper addresses the problem of estimating the monotone boundary of nonconvex set in a full nonparametric and multivariate setup. The problem found its sources in productivity analysis and efficiency measurements of firms. Foundations of the economic theory on productivity and efficiency analysis date back to the works of Koopmans [15] and Debreu [4] on activity analysis. Shephard [2] proposes a modern formulation of the problem. Following these lines, we consider a production technology where the activity of the firms is characterized by a set of inputs x R p + and outputs y R q +. In this framework the production set is the set of technically feasible combinations of (x, y). It is defined as Ψ =(x, y) R p+q + x can produce y}. (1.1) Assumptions are usually done on this set, such as free disposability of inputs and outputs, meaning that if (x, y) Ψ, then (x,y ) Ψ, as soon as 1 x x and y y. In some cases, convexity of Ψ is also assumed (see [2], for more details). As far as efficiency of a firm is of concern, the boundaries of Ψ are of interest. The efficient boundary (frontier) of Ψ is the locus of optimal production scenarios (minimal achievable input level for a given output or maximal achievable output given the input). The Farrell Debreu efficient frontier is defined in a radial sense and the efficiency scores for a given production scenario (x, y) Ψ are defined as Input oriented : θ(x, y) = infθ (θx,y) Ψ}, (1.2) Output oriented : λ(x, y) = supλ 1 (x, λy) Ψ}. (1.3) If (x, y) is laid in Ψ, θ(x, y) 1 is the proportionate reduction of inputs a unit working at the level (x, y) should perform to achieve efficiency. If θ(x, y) = 1, the unit is on the efficient frontier of Ψ. In the output direction, λ(x, y) 1 represents the proportionate increase of outputs the unit operating at level (x, y) should attain to be considered as being efficient. And if λ(x, y) = 1, the unit is on the efficient frontier. In practice Ψ is unknown and so has to be estimated from a random sample of production units (x i,y i ) i = 1,...,n}, where we assume that Prob((x i,y i ) Ψ) = 1 (referred in the literature as deterministic frontier model). So the problem is related to the problem of estimating the support of the random variable (x, y) where, for mathematical convenience, we will assume that Ψ is compact. The most popular nonparametric estimators are based on envelopment ideas: we search for estimators of Ψ which envelops at best the observed data points. The statistical properties of these estimators are now well established (see e.g. [19], for a recent survey). The most flexible nonparametric estimator, initiated by Deprins et al. [5], is the free disposal hull (FDH) estimator. It is provided by the FDH of the sample points: Ψ FDH = n i=1 (x, y) R p+q + y y i,x x i }. (1.4) The FDH efficiency scores are obtained by plugging Ψ FDH in Eqs. (1.2) and (1.3) in place of the unknown Ψ. The asymptotic properties of the resulting estimators are provided by Park 1 From here and below inequalities between vectors a,b R k have to be understood element by element. Writing a b means a i b i, for i = 1,...,k.

3 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) et al. [18]. In summary, the error of estimation converges at a rate n 1/(p+q) to a limiting Weibull distribution. If we assume that Ψ is convex, the convex hull of Ψ FDH provides the data envelopment analysis (DEA) estimator of Ψ, initiated by Farrell [6] and popularized as linear programming estimator by Charnes et al. [3]. It is defined as Ψ DEA = (x, y) R p+q n n + y γ i y i ; x γ i x i for (γ 1,...,γ n ) i=1 i=1 } n such that γ i = 1 ; γ i,i = 1,...,n. (1.5) i=1 It is the smallest free disposal convex set covering all the data points. When using Ψ DEA as the estimated attainable set, the asymptotic properties of resulting DEA efficiency scores have been investigated in Kneip et al. [13], Kneip et al. [14] and Jeong [11]. In summary, the error of estimation converges at a rate n 2/(p+q+1) to a limiting nondegenerate distribution. These nonparametric estimators are very popular and very attractive due to their flexibility. Under convexity assumption, the DEA estimator provides a piecewise linear continuous convex frontier estimate, but the less restrictive FDH estimator, allowing for nonconvex attainable sets, provides a discontinuous boundary estimate (for p = q = 1, it is a stair-case monotone function) whereas the true unknown boundary is often assumed to be smooth (continuous and differentiable). As explained below, the nonsmoothness and the discontinuity of the FDH estimator make it very difficult for us to use for inference in finite sample. And the bootstrap (even in its consistent version) fails to provide sensible practical solutions (poor finite sample properties). The objective of this paper is to propose a smoothed (linearized) version of the FDH estimator allowing nonconvex attainable set Ψ which addresses the main drawbacks of the FDH estimator. Our resulting estimator appears indeed to behave better than the FDH in finite samples and its asymptotic distribution can be easily evaluated in practice. The paper is organized as follows. Section 2 summarizes the main properties of the FDH estimator then, Section 3 presents our linearized version of the FDH (LFDH) and the practical way for computing the corresponding efficiency scores. Section 4 analyzes the asymptotic properties of our estimator and suggests procedures of bias-correction and confidence intervals for the frontier and for the efficiency scores. Section 5 investigates the finite sample properties of our estimator and shows its superiority over the original FDH estimator. Section 6 concludes. In the Appendix, we show that the subsampling bootstrap is consistent for the FDH estimators but we indicate by some simulation its practical limitations in finite sample, advocating again for the use of our LFDH estimator. 2. The FDH estimator Given a sample X n =(x i,y i ), i = 1,...,n} Ψ, the FDH estimate of Ψ was defined in (1.4). The resulting FDH estimators of the efficiency scores for a firm operating at the level (x, y) are defined by ˆθ n (x, y) = minθ (θx,y) Ψ FDH }, ˆλ n (x, y) = maxλ 1 (x, λy) Ψ FDH }.

4 2144 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) One may easily verify that ˆθ n (x, y) = min i y i y ˆλ n (x, y) = max i x i x max 1 k p min 1 k q x (k) i x y (k) i y (k), (2.1) (k). (2.2) where a (k) denotes the kth component of the vector a. From now on, for the presentation, we will only focus on the input orientation, but of course, all the results and properties are easily translated into the output oriented case. Park et al. [18] analyze the asymptotic properties of the estimators. They rely on some regularity assumptions on the data generating process. Assumption 1. θ(x, y) is continuously (partially) differentiable in (x, y) Ψ, and its partial derivatives are all nonzero for all (x, y) Ψ. Assumption 2. (x i,y i ) s are iid with a density f which is continuous, and f(x,y) > onψ and f(x,y) = outside Ψ. It is then shown that n 1/(p+q) (ˆθ n (x, y) θ(x, y)) has a limiting Weibull distribution W(μ xy,p+ q) where μ xy is a parameter depending on the shape of the boundary and on the density f at the boundary point. The expression of μ xy and a way for estimating it in finite sample is provided in Park et al. [18]. The curse of dimensionality is a particularly sensible issue in this setup, and as shown by their simulations it makes the use of the estimated limiting Weibull distribution rather imprecise in small sample (say n smaller than 1, when p + q 5). In fact, in addition to the curse of dimensionality shared by most of the nonparametric techniques, the nonsmoothness and the discontinuity of FDH estimator make its use very difficult for statistical inference in finite sample. An alternative for doing inference might be the bootstrap. In Appendix A, we showed that a subsampling bootstrap provides a consistent approximation of the sampling distribution of FDH estimators. This bootstrap is very easy to implement and avoids the problem of estimating the unknown parameter of the limiting Weibull. However, one may doubt on its usefulness in practice. In fact, from our small simulation study in the Appendix, it is verified that the subsampling bootstrap is not a good idea particularly for the confidence interval. The main reason is that the FDH estimate is determined by only one sample point, and each subsampling chooses this point too often. Also, the choice of the optimal subsample size remains an open and sensible question, as shown in our simulations. Since under our assumptions the true frontier is continuous, we might hope to improve the performance of the FDH estimator by smoothing its corresponding frontier. This is the idea of the linearized free disposal hull (LFDH) estimator defined in the next section. As shown below, it turns out that the LFDH estimator outperforms the FDH estimator. 3. The LFDH estimator 3.1. Main idea Linear interpolation is certainly the simplest plan for smoothing the FDH frontier. The idea is to interpolate the vertices of the FDH of a given data set to get the smoothed version of FDH

5 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) estimate. In a two-dimensional setup (p = q = 1), this would be easy to do, by drawing the polygonal line smoothing the staircase production frontier. But in multidimensional setup it is not straightforward to identify the points which are to be interpolated among the vertices. In this section we propose an algorithm for doing this, which results in the linearly interpolated FDH (LFDH) efficiency scores. We will consider the estimation of θ(x,y ) (or λ(x,y ) in the output oriented case) for a given firm (x,y ). A sketch of the idea is as follows: Step 1: Identify the vertices of the FDH built by the observations X n. Step 2: Move the vertices to a convex surface along the direction parallel to x (or y for the output oriented case). Step 3: Compute the convex hull of the moved points. Step 4: Identify the vertices constituting the facet penetrated by the ray θx (or λy for the output oriented case). Step 5: Interpolate them. A precise algorithm for this idea shall be described in the next subsection. For the practical implementation, Step 2 3 seems rather ambiguous when both input and output variables are multidimensional. Hence we will translate the problem, by considering a new coordinate system, where the frontier will be described by a scalar function and multidimensional covariates. This idea is analogous to the idea developed in Kneip et al. [14] when analyzing the properties of the DEA estimator. For this, we need to build an orthonormal basis of a (d 1)-dimensional space t =v R d t v = } for a given vector t R d, where denotes the transpose of a vector or a matrix. An algorithm to obtain an orthonormal basis of t can be given as follows: ( j ) 1/2 ONB 1: Compute S j = i=1 t2 i for j = 1,...,d. ONB 2: Set j = 1. ONB 3: If S j =, then set ( ). v j =,,...,, 1,,..., }}}} j 1 d j Otherwise, set ( ), v j = t 1 c j,t 2 c j,...,t j c j, S j /S j+1,,..., }}}} j d j 1 where c j = t j+1 /(S j S j+1 ). ONB 4: If j = d 1 then STOP. Otherwise, set j = j + 1 and goto [ONB 3]. One may easily verify that those v j s obtained by the above algorithm are all orthogonal to the vector t R d, and that v j j = 1,...,d 1} are linearly independent. In addition v j v j = 1 holds for all j = 1,...,d 1. Thus v j j = 1,...,d 1} forms an orthonormal basis for the (d 1)-dimensional space t Definition of the LFDH estimator Fix (x,y ) R p + Rq + the point of interest. We are to define a linearly interpolated FDH (LFDH) efficiency score at (x,y ) as an estimator of θ(x,y ). For brevity we restrict the detailed presentation for the input efficiency scores. In a remark below we give the algorithm for the output orientation.

6 2146 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Let X B be the set of the FDH-efficient vertices obtained by X n, i.e. X B = (x i,y i ) X n ˆθ n (x i,y i ) = 1 = ˆλ } n (x i,y i ), i = 1,...,n, (3.1) and let n B be the cardinality of X B. The problem is now to identify the points to be interpolated among the n B points in X B. Let v j j = 1,...,p 1} be an orthonormal basis for x. Consider a transformation r x from R p + to Rp 1 R + : r x : x x v 1,x v 2,..., x v p 1, x x x x. (3.2) Then (r x (1) (x),...,r x (p 1) (x)) is the coefficient vector of x in the space spanned by v j j = 1,...,p 1} and r x (p) (x) is the distance between x and x. Therefore, it holds that r x (x ) = (,...,, x x ). Moreover, r x is a one-to-one transformation, and the following inverse relation holds x = p 1 j=1 x (x)v j + r x (p) x (x). x x r (j) When p = 1, we have r x (x) = x for all x R +. Consider now a transformation h x,y : R p + Rq + Rp 1+q R + which maps (x, y) to (z, u), where ( ) z = r x (1) (x),...,r x (p 1) (x), y (1) y (1),...,y(q) y (q), u = r x (p) (x). (3.3) Note that h x,y (x,y ) = (,,...,, x x ). See Fig. 1 for a graphical illustration of the new coordinate system given by the transform h x,y. Applying the transform in (3.3) to the points in X B, we get the set of transformed data in the new coordinate system (z, u): } (zi,u i ) (z i,u i ) = h x,y (x i,y i ), (x i,y i ) X B. Now we are to identify the adjacent z i s among them which forms a (smallest) simplex in z- space containing z =. Those adjacent z i s must satisfy the property that the interior of the circumcircle determined by the adjacent z i s contains no other z i s. This is closely related to Delaunay triangulation (or tessellation) in computational geometry, see Barber et al. [1] and Section 5.3 in O Rourke [16]. A simple way to do this, suggested by Brown [2], is as follows. By substituting u i with w i = z i z i for each i = 1,...,n B, we get the moved points (z i,w i ) i = 1,...,n B } laid on the strictly convex surface u = z z in the coordinate system (z, u). Next, compute the convex hull of (z i,w i ) s and identify the set of indices I defined by I =i (z i,w i ) makes the facet of the convex hull containing the origin (z = ), i = 1,...,n B }. (3.4)

7 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) y z 2 x 2 (x, y ) u y x z 1 Fig. 1. An illustration of the new coordinate system (z, u) in the case of p = 2 and q = 1. x 1 z y + u=z'z (x,y ) u + + Fig. 2. Identifying the adjacent points which are to be interpolated. The crosses (+) represent (z i,u i ) s and the asterisks ( ) represent the corresponding (z i,w i ) s. x That is, if we define γ = (γ 1,...,γ n B ) by γ = arg min γ nb n B γ i w i γ i z i =, i=1 i=1 n B i=1 γ i = 1, γ i, i= 1,...,n B }, (3.5) then we have I =i γ i > }. For a graphical illustration of the idea so far, see Fig. 2.

8 2148 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) The linear programming problem in (3.5) has a unique solution when it is feasible because the convex hull of any given points is uniquely defined. Note, however, that problem (3.5) may not have feasible solution. This happens when the u-axis does not intersect the convex hull of the projected FDH-vertices. In the original coordinates system, it means when the u-axis intersects the FDH frontier in one of its extreme parts which is parallel to one of the axes till the infinity (by the free disposability assumption). In this case, there is nothing to interpolate, the LFDH frontier coincides with the FDH frontier and the LFDH efficiency score is thus equal to the original FDH efficiency score. 2 Finally, define the LFDH estimator θ n (x,y ) by θ n (x,y ) = min θ > θx i I γ ix i,y i I γ iy i for some γ i,i I such that } i I γ i = 1. (3.6) Note that the proposed estimator coincides with a DEA estimator with reference set (sample points) given by the points (x i,y i ) i I}. We note also that it satisfies the free disposability assumption. Moreover, we are able to build the corresponding LFDH estimator of the attainable set Ψ by Ψ LFDH = (x, y) R p+q + x θ n (t, s)t, y s, (t, s) Ψ FDH }. (3.7) Remark 1. The algorithm for output LFDH efficiency score is given similarly. Define y = v R q + y v = } and let v j j = 1,...,q 1} be an orthonormal basis for y. Consider a transformation r y from R q + to Rq : r y : s s v 1,...,s s y v p 1, y y. For each observation (x i,y i ) X B, apply a transform (x i,y i ) (z i,w i ): ( z i = x (1) i x (1),...,x(p) i x (p),r y (y i ) (1),...,r y (y i ) (q 1)), w i = z i z i for i = 1,...,n B. Next, we construct a convex hull of the transformed data and identify the set of indices defined by I =i (z i,w i ) makes the facet of the convex hull containing z = }. Hence we have got (x i,y i ) X B i I} which are to be interpolated. Finally the LFDH estimator is given by λ n (x,y ) = max θ 1 x i I γ ix i, λy i I γ iy i for some γ i,i I such that } i I γ i = 1. (3.8) 2 The authors would like to thank Paul Wilson who pointed this problem and suggested the solution.

9 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Remark 2. In productivity literature the production sets are commonly assumed to have continuous boundaries. While FDH assumes the staircase boundary during estimating the continuous target frontier, LFDH assumes the continuous boundary for the continuous target. Hence one may easily expect that LFDH performs better than FDH in terms of bias except the extreme case when the production set has the (nearly) staircase boundary. As for the variance, since LFDH uses p +q points to define its estimate in contrast to FDH which uses only one point, LFDH tends to have smaller variance than FDH. Remark 3. LFDH estimator can be regarded as a rolling-ball estimator by Hall et al. [9] with variable ball size. LFDH determines the smoothing parameter, i.e. the ball size, automatically in such a way that it locally adapts to the shape of the given cloud of points. 4. Asymptotic distribution 4.1. Asymptotics of LFDH estimator Recall the transformation h x,y : R p + Rq + Rp 1+q R + given by ( ) z i = r x (x i ) (1),...,r x (x i ) (p 1),y (1) i y (1),...,y(q) i y (q), u i = r x (x i ) (p) for i = 1,...,n. We denote the transformed dataset by X n : (4.1) X n =(z i,u i ) i = 1,...,n}. In the new coordinate system (z, u), the attainable set Ψ is reexpressed as G =(z, u) R p 1+q R + (z, u) = h x,y (x, y), (x, y) Ψ}. And we can define the boundary of Ψ through its correspondent in the coordinate system (z, u): g(z x,y ) = infu > (z, u) G}. Hence the set G can equivalently be represented by the function g as well: G =(z, u) R p 1+q R + u g(z x,y )}. (4.2) Furthermore, since the point of interest (x,y ) is transformed by h x,y into (, x ), wehave θ(x,y ) = x 1 g( x,y ), where denotes the Euclidean norm of a vector. See Fig. 3 for a graphical illustration. The following assumptions are the analogy in the coordinate system (z, u) of Assumptions 1 and 2 in Section 2. Assumption 1a. The function g(z) is continuously differentiable in z, and g 1 (z) is nonsingular for all z, where g 1 (z) is the diagonal matrix having ( / z)g(z) as its diagonal elements. Assumption 2a. (z i,u i ) s are iid with a density f which is continuous and positive on G. In particular f (z, g(z)) is bounded away from zero for all z.

10 215 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) y z u=g(z) Psi G y (x, y ) u g() x x Fig. 3. Ψ, G and g in the case of p = 1 and q = 1, where Psi stands for Ψ. Define the LFDH estimator, ĝ LFDH (z x,y ),ofg(z x,y ) for z R p 1+q by the following procedure. Firstly, identify the vertices of the FDH of X n and apply the transform in (4.1) on the vertices to get (z i,u i ) i = 1,...,n B }. Secondly, move (z i,u i ) i = 1,...,n B } to a convex surface along the u-axis. Thirdly, build the convex hull of the projected points and identify the set of indices I z such that (z i,u i ) i I z } is involved in the facet of the convex hull computed at z. Note that I is identical to I in (3.6). Finally define the LFDH estimator of g(z x,y ) by ĝ LFDH (z x,y ) = min γ i Iz i u i γ i I i z i = z for some γ i, i I z z such that } γ i I i = 1. (4.3) z Note that, in the coordinate system (z, u), ĝ LFDH (z x,y ) is the highest piecewise linear surface below (z i,u i ) s. Moreover, it is easily verified that θ z,n = min θ > θx = γ i I i x i,y = γ z i I i y i z for some γ i,i I z such that } γ i I i = 1 z is equal to x 1 ĝ LFDH (z x,y ). Under Assumption 1a 2a one may prove that, as n, θ z=,n = θ n (x,y ) holds with probability tending to one, see Proposition 2 in Jeong and Park [12]. Hence it holds that with probability tending to one ĝ LFDH ( x,y ) = x θ n (x,y ). Consequently, the asymptotic behavior of θ n (x,y ) is equivalent to that of ĝ LFDH ( x,y ).To simplify the notation, we will omit x,y ing( x,y ) and ĝ LFDH ( x,y ) from now on. Consider an estimation of a continuous frontier function g(z), z R p 1+q. Due to the complexity in multidimensional situation it is not easy to derive the explicit formula for the asymptotic distribution of ĝ LFDH (z). So we suggest to follow the strategy used in Hwang et al. [1], Jeong [11] and Jeong and Park [12]. Omitting the detailed proofs which are very similar to those in

11 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Jeong and Park [12], we sketch the main steps to obtain a large sample approximation of the limit distribution as follows. Consider a linear transformation that takes (z i,u i ) to zi = n 1/(p+q) g 1 (z)(z i z), u i = n 1/(p+q) u i g(z)}. Then (zi,u i ) has as its frontier the surface with the equation u = 1 z + o(1) uniformly on any compact set of z, where 1 = (1, 1,...,1). Moreover, for large n, the density in the new coordinate system (z,u ) is approximated by n 1 g 1 (z) 1 f uniformly in the region (z,u ) z ε n n 1/(p+q), 1 z u 1 z + ε n n 1/(p+q)} for each sequence ε n, where f denotes the density at (z, g(z)) and denotes the determinant of a matrix. Define κ = g 1 (z) / f } 1/(p+q), and consider a new random sample from the uniform distribution on B κ = (z,u ) z [ (κ/2)n 1/(p+q),(κ/2)n 1/(p+q) ] p 1+q, 1 z u 1 z + κn 1/(p+q) }. (4.4) Note that the uniform density on this region is n 1 g 1 (z) 1 f. Let ĝlfdh ( ) be the version of ĝ LFDH ( ) obtained from the new sample. Then, according to the same arguments for Theorem 1 in Jeong and Park [12], we have the following theorem. Theorem 4.1. Under Assumption 1a 2a, for z R p 1+q, n 1/(p+q) ĝ LFDH (z) g(z)} and ĝ LFDH () have the same limit distribution. Corollary 4.1. Under Assumption 1 2, n 1/(p+q) θ n (x,y ) θ(x,y )} and ĝ LFDH ()/ x have the same limit distribution. Once the unknown parameters f and g 1 (z) are determined, the limit distribution of LFDH estimator θ n (x,y ) can be simulated based on this result: we only need to simulate a large number of times the value ĝlfdh (), each of which being computed from a random sample drawn from the uniform distribution on B κ. Of course, f and g 1 (z) are unknown, but they can be consistently estimated as follows. For estimating f, we propose to use the histogram type version in Jeong and Park [12]: 1 n nδ p+q I(z i,u i ) D(δ)}, (4.5) i=1 where D(δ) is the region in the coordinate system (z, u) defined by } D(δ) = (z, u) z [ δ/2, δ/2] p 1+q, ĝ LFDH (z) u ĝ LFDH (z) + δ. Note that the Lebesgue measure of D(δ) is equal to δ p+q. Its consistency is directly derived by the consistency of ĝ LFDH, as long as δ is chosen to satisfy nδ p+q as n, see Theorem 2 in Gijbels et al. [7].

12 2152 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) For estimating g 1 (z), we propose to use the slope of the facet involved in ĝ LFDH (z), see Park [17] and Hall and Park [8]. The slope can be easily obtained by computing the hyperplane in the coordinate system (z, u) which passes through the set of points (z i,u i ) i I z } (z, ĝ LFDH (z))}. That is, the estimator of g 1 (z) is defined by the product of (β, β 1,...,β p 1+q ) which is the solution of the system of equations below: p 1+q u i = β + j=1 β j z (j) i, i I z. (4.6) Note that the cardinality of I z is greater than or equal to 1 and less than or equal to p + q by construction. In fact it is equal to p + q with probability tending to one as n, and hence system (4.6) is nonsingular in probability. When the cardinality of I z is less than p+q, we suggest to add the equation p 1+q ĝ LFDH (z) = β + j=1 β j z (j) to (4.6) at first. If the singularity still exists, then, by the ascending order of z i z for z i s such that (z i,u i ) X n \(z i,u i ) i I z }, add the equation p 1+q ĝ LFDH (z i ) = β + j=1 β j z (j) i to (4.6) in turn until the system becomes nonsingular. It is worthwhile to note that, for estimating g 1, the proposed estimator does not require any smoothing parameter. Having an estimate of the asymptotic distribution of the LFDH estimators, the next subsection suggests procedures for correcting the bias and for constructing confidence intervals of the quantities of interest Bias-corrected estimator and confidence interval By using the distribution of ĝlfdh (), we may indeed quantify the bias of the LFDH estimator ĝ LFDH () or θ n (x,y ). Let ĝlfdh,b ()}B b=1 be the set of B values of ĝ LFDH (), each of which is computed from a random sample from the uniform distribution on Bˆκ, see (4.4), where ˆκ is an estimate κ which is the unknown in the large sample approximation in Theorem 4.1. Since the empirical distribution of ĝlfdh,b ()}B b=1 approximates the distribution of ĝ LFDH (),wemay estimate the asymptotic mean of n 1/(p+q) ĝ LFDH () g()} by B 1 B b=1 ĝ LFDH,b (). Thus, a bias-corrected estimator of g() is given by ĝ LFDH () n 1/(p+q) B 1 B b=1 ĝ LFDH,b ().

13 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Of course we get the bias-corrected version of θ n (x,y ) by B } x 1 ĝ LFDH () n 1/(p+q) B 1 ĝlfdh,b (). b=1 The empirical distribution of ĝlfdh,b ()}B b=1 also enables us to construct a confidence interval for g(). Let ˆq α be the αth quantile of the empirical distribution of ĝlfdh,b ()}B b=1. Then, 1(1 α)% confidence interval for g() is given by ] [ĝ LFDH () n 1/(p+q) ˆq 1 α/2, ĝ LFDH () n 1/(p+q) ˆq α/2, and hence 1(1 α)% confidence interval for θ(x,y ) is given by [ θn (x,y ) n 1/(p+q) x 1 ˆq 1 α/2, θ n (x,y ) n 1/(p+q) x 1 ˆq α/2 ]. Remark 4. Accurate estimation of the unknown quantity κ is of course crucial for the above procedures. In particular the choice of smoothing parameter δ for estimating f in (4.5) should be done very carefully, because it affects very much on the resulting estimate of f. But one may expect that the choice of δ affects less for bias-correction, which is the same situation as in Gijbels et al. [7]. Anyway, the finding of data-driven techniques for selecting δ is a challenging topic for future research. 5. Numerical study 5.1. Sampling distribution of FDH and LFDH estimators In this section, we compare the sampling distribution of FDH estimator and that of LFDH estimator by a simulation study. For the simulation models we choose the models in Section 4.1 in Park et al. [18]. In order to see how LFDH outperforms FDH, we compare the finite sample properties of LFDH output scores with the results in Park et al. [18] which dealt with the output scores only. Model 1: Simulation II in Park et al. [18]. p = q = 2. Model 2: Simulation III in Park et al. [18]. p = 4, q = 1. We conducted 5 Monte Carlo experiments with the sample sizes n = 1 and n = 4. Fig. 4 depicts the sampling distributions of the FDH and LFDH estimates in all the Monte Carlo scenarios. The purpose of this comparison is to present how LFDH works better than FDH. Note that this is not the comparison between the bias-corrected estimators. For the sampling distribution of FDH, histograms are given in the figures, since the FDH estimator provides a spurious mass at one and any smooth density estimates are not appropriate. The superior performance of LFDH estimators to FDH estimators is clearly seen in the figures: in each case, the sampling distribution of LFDH is more concentrated (less variance) and it is shifted toward the true value of the estimated parameters, which is an empirical evidence of discussion in Remark 2. This is confirmed by analyzing the mean squared errors as provided in Table 1. The table demonstrates again the desirable properties of LFDH estimator over the original FDH: the bias is substantially smaller and the MSE are smaller by a factor 4 5.

14 2154 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Model 1, n = 1 Model 1, n = Density Density Model 2, n = 1 Model 2, n = Density Density Fig. 4. Comparison of the sampling distribution of FDH (histogram) and that of LFDH (solid line) estimates. Model 1: θ = Model 2: θ =

15 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Table 1 MSE comparisons of FDH estimator and LFDH estimator n Mean (S.E.) MSE ( 1 1 ) FDH LFDH FDH LFDH Model (.92) (.71) (θ = 2.27) (.65) (.46) Model (.33) (.3) (θ = ) (.26) (.19) probability probability (a) (b) probability probability (c) (d) Fig. 5. Comparison between the simulated limit distribution (solid line) and the empirical distribution (asterisks) of LFDH when (a) n = 1 (b) n = 2 (c) n = 4 (d) n = 1.

16 2156 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) probability probability (a) (b) probability probability (c) (d) Fig. 6. Comparison between the limit distribution (solid line) and the empirical distribution (asterisks) of FDH when (a) n = 1 (b) n = 2 (c) n = 4 (d) n = Limit distribution vs. empirical distribution We conducted another simulation study to evaluate the accuracy of our large sample approximation of the sampling distribution given by Theorem 4.1 in Section 4. Under Model 2 in the previous subsection, 5 Monte Carlo experiments were done with the sample sizes n = 1, 2, 4, 1. We compared the empirical distribution of n 1/(p+q) ĝ LFDH (z) g(z)} and the simulated distribution of ĝlfdh (). Two thousand Monte Carlo simulations were done for the latter. Fig. 5 depicts the results, where we verify that our large sample approximation for LFDH works quite well even with the moderate sample sizes in a five-dimensional setup. We also did the same thing for FDH and Fig. 6 is the result. As we compare it with Fig. 5, we observe that the gap between the asymptotic distribution and the empirical distribution of FDH diminishes much slower than that of LFDH.

17 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Conclusion In this paper we have proposed a new nonparametric estimator of monotone frontiers which allows for nonconvex sets below the frontier. This is particularly useful in the context of productivity analysis where the FDH estimator has been extensively used. However, the nonsmoothness and discontinuities of the FDH is a drawback for conducting inference in finite samples. The bootstrap, even in a consistent version of it, does not provide a useful alternative (poor finite sample properties). Our estimator, the LFDH, is a linearized version of the FDH, obtained by linear interpolation of the appropriate FDH-efficient vertex in the observed sample. It offers a continuous, smooth version of the FDH. We provide an algorithm for computing the estimator, and we establish its asymptotic properties. We also provide an easy way to approximate its asymptotic sampling distribution. The latter could offer bias-corrected estimator and confidence intervals of the efficiency scores. In a Monte Carlo study, we show that these approximations works well even in moderate sample sizes and that our LFDH estimator outperforms, both in bias and in MSE, the original FDH estimator. Also, the gap between the asymptotic distribution and the empirical distribution diminishes faster as n increases. The approach presented here could be extended if smoother version of the FDH would be wanted. For example, we could use spline interpolation passing through the appropriate FDHefficient vertices. However, we doubt the practical gain of such more elaborate smoothing procedures. Appendix A. Subsampling FDH estimators A.1. The consistency We know that, in this boundary estimation setup, the naive bootstrap (sampling with replacement, samples of size n from the original sample X n ) is inconsistent (see [19] and the references therein). Subsampling is a way to solve the problem. Let m be the size of each subsample such that m/n and m as n, that is, the subsample Xn,m =(X i,y i ), i = 1,...,m} is drawn randomly, without replacement, from the original sample X n. Let ˆθ n,m = ˆθ n,m (x,y ) be the FDH estimate of θ(x,y ) obtained from the subsample Xn,m. Write θ = θ(x,y ) and ˆθ n = ˆθ n (x,y ) for brevity. Theorem A.1. Under the assumptions AI AIII in Park et al. [18], as n we have sup z> Pr n 1/(p+q) (ˆθn ) } /θ 1 z ) 1/(p+q) Pr m (ˆθ n,m /ˆθ n 1 z n} X p. (A.1) Proof. Define NW(x, y) =(x,y ) R p+q + x <x,y >y} Ψ, and the events [ ] E n = NW(x (1 + n 1/(p+q) z), y ) X n =, [ ] En,m = NW(x (1 + m 1/(p+q) z), y ) Xn,m =.

18 2158 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Then, we may easily prove that for z> Pr n 1/(p+q) (ˆθn ) } /θ 1 >z Pr E n } ; ) } 1/(p+q) Pr m (ˆθ n,m /θ 1 >z X n Pr E n,m } p X n as n. Since m 1/(p+q) (ˆθ n θ 1) p, it suffices to show that } sup z> Pr E m Pr En,m } p X n (A.2) as n. Let N n,m = ( n m) be the number of subsamples, indexed by j = 1,...,Nn,m, available from X n. And let Xn,m,j be the jth subsample, and E n,m,j be the event E n,m corresponding to X n,m,j. Then Pr En,m X } 1 n = N n,m N n,m j=1 I } En,m,j, which is a U-statistic with the mean Pr E m }. Hence, by Hoeffding s inequality (see [21], Theorem A, p. 21), } Pr E n,m X n Pr Em } p. The uniformity in z of the convergence (A.2) is given by Polya s Theorem (see [21]). The preceding argument is for subsampling without replacement, but asymptotically, subsampling with replacement does not make any difference. A.2. Subsampling FDH in action When the sample size n or the subsample size m = n κ is not large enough and the point of interest (x,y ) is close to the frontier of the FDH of the original sample X n, it may happen very often that the point (x,y ) is not laid in the FDH of the subsample Xn,m, which may cause a problem to define the efficiency score and deteriorate the quality of resulting bootstrap distribution. But note that, for any positive real number a, θ(ax,y) = a 1 θ(x, y), ˆθ n (ax,y) = a 1 ˆθn (x, y) and ˆθ n,m (ax,y) = a 1 ˆθ n,m (x, y), so that we have ˆθ n (ax,y)/θ(ax,y) = ˆθ n (x, y)/θ(x, y), ˆθ n,m (ax,y)/ˆθ n (ax,y) = ˆθ n,m (x, y)/ˆθ n (x, y) for all a>. By choosing a large enough for (ax,y ) to be laid in the FDH of each subsample and then computing ˆθ n,m (ax,y )/ˆθ n (ax,y ) instead of ˆθ n,m (x,y )/ˆθ n (x,y ), we can avoid the above technical difficulty. If the value of y is too large (if y > maxy j (x j,y j ) Xn,m }), such value a does not exist. In such a case, we suggest to add the point of interest (x,y ) to the bootstrap sample Xn,m, this does not alter the asymptotic properties of the bootstrap. This trick is applicable to subsampling any other radial efficiency scores such as the DEA efficiency score as analyzed by Kneip et al. [14]. Another practical difficulty is due to the fact that N n,m is very large in general. That is, it may be very difficult in practice to consider all possible Xn,m,j,j = 1,...,N n,m. Therefore we consider the following Monte Carlo algorithm: draw randomly a set of numbers J 1,...,J B } from 1,...,N n,m } with or without replacement. Write dn,m,b = m1/(p+q) (ˆθ n,m,b /ˆθ n 1), where ˆθ n,m,b

19 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) is the value for ˆθ n,m from the subsample X n,m,j b, b = 1,...,B. Then, the empirical distribution of dn,m,b,b = 1,...,B} approximates the exact sampling distribution of n1/(p+q) (ˆθ n /θ 1) given X n as B. For example, we can estimate the bias of FDH estimator by this approximation resulting the following bias-corrected estimator: B 1 ˆθ n 1 + n 1/(p+q) B 1 dn,m,b}. b=1 Moreover we can construct a 1 (1 α)% confidence interval for θ as follows: [ˆθn /(1 + n 1/(p+q) c 1 α/2,m ), ˆθ n /(1 + n 1/(p+q) c α/2,m ) ], where cα,m denotes the αth sample quantile of d n,m,b,b = 1,...,B}. A.3. Simulation study We conducted some simulation studies following the settings in Section 5 in order to investigate the finite sample performances of the subsampling (with replacement) FDH estimators. Table 2 Comparison of FDH estimator and its bias-corrected versions κ n = 1 n = 1 Mean (S.E.) MSE Mean (S.E.) MSE Model 1 (θ = 2.27) FDH (.85) (.53) (.253) (.16) (.243) (.1) (.228) (.96) (.211) (.93) (.19) (.9) (.175) (.87) (.161) (.84) (.148) (.8) (.136) (.76) (.126) (.72) (.116) (.68) Model 2 (θ = ) FDH (.34) (.2) (.95) (.37) (.94) (.36) (.92) (.34) (.9) (.33) (.86) (.32) (.81) (.31) (.75) (.3) (.68) (.29) (.61) (.28) (.55) (.26) (.48) (.24).3154 The values for MSE are multiplied by 1 1.

20 216 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) Table 3 Coverage probabilities of the confidence interval by subsampling FDH estimator κ n = 1 n = 1 9% 95% 99% 9% 95% 99% Model Model From M = 5 Monte Carlo experiments, we computed the MSE of the bias-corrected estimator and the coverage probabilities of the confidence interval obtained from subsampling, which are summarized in Tables 2 and 3. We considered the subsample sizes m = n κ for various κ in < κ 1, and B = 2 was used for each subsampling. As is seen from the simulation results, while the subsampling worked fairly well for the biascorrection, the subsampling for building confidence intervals shows very poor performances. Indeed, the coverage probabilities obtained by the subsampling are not generally close to the nominal level even in the cases of the sample size of n = 1 which is not small at all. These features are mainly due to the nature of FDH estimator rather than that of subsampling. Note that the subsampling approximates the continuous sampling distribution by a discrete distribution, and that the value of FDH estimate is completely determined by only one point of the FDH vertices of the data. In this situation the approximate discrete distribution tends to give too much probability mass on the point. Therefore, the approximation of the sampling distribution would have very poor accuracy particularly in the tail when the (sub)sample size is not large enough, which inherently gives the poor coverage accuracy of the confidence interval with large nominal probability such as 9%, 95% and 99%. Hence we may expect that the coverage probabilities would be very poor especially for the small values of κ, which is observed in Table 3. In the biascorrection we may expect the similar thing by the same reason as the above. By construction, the bias would be over-calibrated by the subsampling when κ is small (since large dn,m,b s would have

21 S.-O. Jeong, L. Simar / Journal of Multivariate Analysis 97 (26) large probability masses in the subsampling distribution), which results in the over-correction, see Table 2. The effect, however, is less crucial than that for the confidence interval, since it involves the average of the d n,m,b s not the tail distribution of the d n,m,b s. References [1] C.B. Barber, D.P. Dobkin, H. Huhdanpaa, The Quickhull algorithm for convex hulls, ACM Trans. Math. Software 22 (1996) [2] D.F. Brown, Voronoi diagrams from convex hulls, Inform. Process. Lett. 9 (1979) [3] A. Charnes, W.W. Cooper, E. Rhodes, Measuring the inefficiency of decision making units, European J. Oper. Res. 2 (1978) [4] G. Debreu, The coefficient of resource utilization, Econometrica 19 (1951) [5] D. Deprins, L. Simar, H. Tulkens, Measuring labor inefficiency in post offices, in: M. Marchand, P. Pestieau, H. Tulkens (Eds.), The Performance of Public Enterprises: Concepts and Measurements, North-Holland, Amsterdam, [6] M.J. Farrell, The measurement of productive efficiency, J. Roy. Statist. Soc. Series A 12 (1957) [7] I. Gijbels, E. Mammen, B.U. Park, L. Simar, On estimation of monotone and concave frontier functions, J. Amer. Statist. Assoc. 94 (1999) [8] P. Hall, B.U. Park, New methods for bias correction at endpoints and boundaries, Ann. Statist. 3 (22) [9] P. Hall, B.U. Park, B. Turlach, Rolling-ball method for estimating the boundary of the support of a point-process intensity, Ann. Inst. H. Poincare Probab. Statist. 38 (22) [1] J.H. Hwang, B.U. Park, W. Ryu, Limit theorems for boundary function estimators, Statist. Probab. Lett. 59 (22) [11] S.-O. Jeong, Asymptotic distribution of DEA efficiency scores, J. Korean Statist. Soc. 33 (24) [12] S.-O. Jeong, B.U. Park, Large sample approximation of the limit distribution of convex-hull estimators of boundaries, Scand. J. Statist., 26, forthcoming. [13] A. Kneip, B.U. Park, L. Simar, A note on the convergence of nonparametric DEA estimators for production efficiency scores, Econometric Theory 14 (1998) [14] A. Kneip, L. Simar, P.W. Wilson, Asymptotics for DEA estimators in nonparametric Frontier models, Discussion Paper #317, Institut de Statistique, UCL, Belgium, [15] T.C. Koopmans, An analysis of production as an efficient combination of activities, in: T.C. Koopmans (Ed.), Activity Analysis of Production and Allocation, Cowles Commission for Research in Economics, Monograph, vol. 13, Wiley, New York, [16] J. O Rourke, Computational Geometry in C, second ed., Cambridge University Press, Cambridge, [17] B.U. Park, On estimating the slope of increasing boundaries, Statist. Probab. Lett. 52 (21) [18] B.U. Park, L. Simar, Ch. Weiner, The FDH estimator for productivity efficiency scores: asymptotic properties, Econometric Theory 16 (2) [19] L. Simar, P.W. Wilson, Statistical inference in nonparametric frontier models: the state of the art, J. Productivity Anal. 13 (2) [2] R.W. Shephard, Theory of Cost and Production Function, Princeton University Press, Princeton, 197. [21] R.J. Serfling, Approximation Theorems of Mathematical Statistics, Wiley, New York, 198.

I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S (I S B A)

I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S (I S B A) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2011/35 STATISTICAL