When, Where and How of Efficiency Estimation: Improved. procedures

Size: px

Start display at page:

Download "When, Where and How of Efficiency Estimation: Improved. procedures"

Lynne Boyd
5 years ago
Views:

1 When, Where and How of Efficiency Estimation: Improved procedures Mike G. Tsionas May 23, 2016 Abstract The issues of functional form, distributions of the error components and endogeneity are for the most part still open in stochastic frontier models. The same is true when it comes to imposition of restrictions of monotonicity and curvature, making efficiency estimation an elusive goal. In this paper we attempt to consider these problems simultaneously and offer practical solutions to the problems raised by Stone (2002) and addressed in Badunenko, Henderson and Kumbhakar (2012). We provide major extensions to smoothly mixing regressions and fractional polynomial approximations for both the functional form of the frontier and the structure of inefficiency. Endogeneity is handled, simultaneously, using copulas. We provide detailed computational experiments and an application to US banks. To explore the posteriors of the new models we rely heavily on Sequential Monte Carlo techniques. Key words: Stochastic Frontiers; Smoothly Mixing Regressions; Fractional Polynomial Approximations; Efficiency Estimation; Bayesian inference; Sequential Monte Carlo. Lancaster University Management School, LA1 4YX, UK, & Athens University of Economics and Business, 76 Patission Str., Athens 10434, Greece. 1

2 1 Introduction Stone (2002) has raised several important points in connection to efficiency estimation, including the choice of functional forms, the distributions of statistical error term and the one-sided error term that represents inefficiency etc. Badunenko, Henderson and Kumbhakar (2012, BHK henceforth) used simulations to compare the performance of (i) a two-stage semi-parametric stochastic frontier (SSF) estimator due to Fan et al. (1996), and (ii) the non-parametric bias-corrected DEA estimator of Kneip et al. (2008). One important outcome from their simulation studies is that, in realistic situations, both estimators do a fair job at identifying the best and the worst performing decision making units. However, each does a relatively poor job at identifying the median performer. Another important conclusion for policy making was the following: For a practical example, we note that [...] identifying a benchmark firm(s) is often important to regulators. A benchmark firm(s) (say top 5%) is often used to decide the penalty (or carrot) for the bottom firms in yardstick competition. In such a case, it is important to accurately estimate the top 5% and bottom 5% of firms. We note that identification of both the top 5% and bottom 5% of firms is feasible in scenarios s1, s2 and s4. In general, FLW would produce more reliable results for the bottom 5% (except for s2) and KSW would generally produce more reliable results for the upper 5%. 1 Of course, BHK used an SSF alternative to the bootstrapped DEA approach. This partly solves the functional form assumption although not in a comprehensive manner. There are two problems: (i) With many inputs and small samples, the estimator would not be reliable. (ii) The estimator still uses a normal-half-normal distribution in the second stage to estimate technical efficiencies. (iii) There are other problems with actual data sets that neither Stone (2002) nor BHK considered. (iv) The recommendation that one estimator is best for the upper 5% and another for the bottom 5% of firms is, at best, problematic as the overall statistical properties are hard to analyze. The questions that Stone (2002) raised and which were partly answered by the simulation studies in BHK are still at large. Functional forms and error term distributions in stochastic frontier analysis are still problems that we need to consider jointly. Another problem that plagues applications is the potential endogeneity of inputs: Inputs are not experimentally designed nor they are exogenous to production; they are decided by firms. Nothing really implies that outputs are exogenous and inputs are endogenous. In a cost minimization framework, for example, the firm decides the quantities of inputs, given input prices and outputs. Under this behavioral consideration, inputs would be endogenous and outputs would be exogenous. Many public firms would belong to this category. Under 1 Here, s1 to s4 refer to different scenarios in their simulations. Suppose σv 2 and σ2 u denote the variances of two sided error term (v) and one sided error term (u) respectively and λ = σ u/σ v. In scenario s1 (σ v = σ u = 0.01,λ = 1.0), both terms are relatively small. In other words, the data are measured with relatively little error and the units are relatively ecient. In scenario s2 (σ v = 0.01 and σ u = 0.05, λ = 5.0), the data have relatively little noise, but the units under consideration are relatively inecient. In scenario s3 (σ v = 0.05 and σ u = 0.01,λ = 0.2), the data are relatively noisy and the the rms are relatively ecient. The fourth scenario s4 (σ v = σ u = 0.05,λ = 1.0) is redundant as = 1.0 as in s1. However, we show this case to emphasize that the results of the experiment depend upon the ratio of σ u to σ v and not their absolute values. 2

3 profit maximization, both inputs and outputs are endogenous as they are both subject to choice by the firm. The endogeneity problem has proven to be quite hard to attack. Measurement error in inputs is also a problem and, for practical purposes, it means that observed inputs are correlated with the error term(s) so we have an endogeneity problem. In stochastic frontier models the basic model is y = f(x) exp(v u) where y is output, x is a vector of inputs, v is statistical noise usually assumed to be normally distributed, N (0, σ 2 v), and u is a positive error term usually assumed to follow a half-normal distribution 2, u N + (0, σ 2 u). The ratio λ = σu σ v is known as signal-to-noise ratio and indicates how important is inefficiency (u) relative to noise (v). The simulation study in BHK concluded that with a low value of λ it does not make much sense to proceed and measure inefficiency as this is inherently hard in this situation. When λ 1, both methods they used turned out to be, for the most part at least, similar in performance subject to the considerations stated above. When the value of λ is small, the data are not very informative about the presence of an inefficiency signal. However, this could be attributed, at least partially, to the potential correlation between x and v and / or u. Ignoring this may seriously bias efficiency scores while formally accounting for it, it can lead to better separation of inefficiency from noise. The distributional assumptions problems seem to be resolved only in the context of bootstrapped-dea as this approach does not need any such particular assumptions. In the two-stage SSF estimator due to Fan et al. (1996), the first stage estimates a term that includes f(x) and thus it dispenses away with the functional form assumption (at least, on the surface) but it does assume a normal-half-normal specification in the second stage to estimate technical inefficiencies. From that point of view the approach is not completely satisfactory. Another problem is that monotonicity and concavity are not imposed (e.g. Pu, Parmeter and Racine, 2013) on f(x) and, therefore, it is not clear that a production function is estimated indeed. The bootstrapped-dea method, again, seems to have an advantage here. Why not use the bootstrapped-dea method then? BHK have shown that adopting this practice indiscriminately may lead to serious problems as, generally, the semi-parametric stochastic frontier estimator seems to perform better! Moreover, under conditions of endogeneity, measurement errors etc, bootstrapped-dea cannot offer a panacea. The semi-parametric stochastic frontier estimator is, clearly, not a panacea either in this situation. Another consideration is that with data sets running in the thousands of observations, applying the the bootstrapped-dea method is extremely computationally intensive whereas applying the SSF estimator is much easier. Therefore, we believe that there are reasons to stick with stochastic frontiers but functional form issues, issues related to distributional assumptions and endogeneity issues must be resolved simultaneously, under the same umbrella of a common, unifying flexible stochastic frontier model. We should, nevertheless, mention that there have important developments in DEA. These are summarized in Daraio and Simar (2007). Two-stage models that allow for explanatory variables 2 Many other distributions have been used in the literature like the exponential, the gamma, the Weibull etc. 3

4 are treated fully in Simar and Wilson (2007). See also Daraio, Bonaccorsi and Simar (2015) and Daraio and Simar (2014) who have used a directional distance function with estimated input - output - specific directions, see also Simar and Vanhems (2012) and Simar, Vanhems and Wilson (2012). 2 Model 2.1 The approach in Badunenko, Henderson and Kumbhakar (2012) Badunenko, Henderson and Kumbhakar (2012) compare two promising estimators of technical efficiency in the cross-sectional case. They compare the non-parametric kernel estimator of Fan et al. (1996) with the non-parametric bias-corrected DEA estimator of Kneip et al. (2008). The model of Fan et al. (1996) requires a parametric second stage (and hence it is semiparametric), it is more robust than the initial stochastic frontier model by Aigner et al. (1977) and Meeusen and van den Broeck (1977) 3. The method of Kneip et al. (2008) introduces statistical ineference via bootstrapping has been shown to complement well the standard DEA model found in Charnes et al. (1978). Suppose x i R p is a vector of inputs and y i denotes output for firm i = 1,..., n. In the model of Kneip et al. (2008) we have: y i = f(x i ) + v i u i, i = 1,..., n, (1) where f : R p R is an unknown smooth estimation, v i is a two-sided error term and u i 0 is a one-sided error representing inefficiency. As E(y i x i ) = f(x i ) E(u i ) f(x i ), a non-parametric estimator of the regression function E(y i x i ) is not f(x i ) itself. It becomes necessary to adopt distributional assumptions on both v and u to construct the log-likelihood function to be maximized to obtain the efficiency scores. Under he assumption that v i x i N(0, σ 2 v) and, independently, u i x i N + (0, σ 2 u), the log-likelihood function is: l(λ) = n ln ˆσ + n i=1 ( ln Φ λˆε ) i 1 ˆσ 2ˆσ 2 n ˆε 2 i, (2) i=1 where ˆε i = y i Ê(y i x i ) µ(λ, ˆσ), µ(λ, ˆσ) = λσ 2{π(1 + λ 2 ) 1/2 }, and ˆσ 2 = n 1 n i=1 {y i Ê(y i x i )} 2 {1 2λ2 π(1+λ 2 ) }. (3) The log-likelihood function in (2) is a function of the single parameter λ = σu σ v given the estimator ˆσ 2 of σ 2 = σ 2 v +σ 2 u. Given an estimate ˆλ the estimate of ˆσ 2 can be recovered from (3). The point estimator of inefficiency via Jondrow et al. (1982) can be obtained as: û i = ˆµ i + ˆσ φ(ˆµ i /ˆσ ) Φ(ˆµ, i = 1,..., n, (4) i /ˆσ ) 3 For a review, see Kumbhakar and Lovell (2003). 4

5 where ˆµ i = ˆε iˆσ 2 u/(ˆσ 2 v + ˆσ 2 u), ˆσ = ˆσ v ˆσ u /(ˆσ 2 v + ˆσ 2 u) 1/2, and φ, Φ denote the standard normal pdf and cdf respectively. Badunenko, Henderson and Kumbhakar (2012) use a local linear estimator to formulate the log-likelihood in (1). 2.2 In search of a new model In this paper we are not concerned with DEA. The challenges in stochastic frontier modeling are the following: i) To estimate the unknown function f(x) in (1). The function must satisfy certain theoretical properties. It must be increasing and concave for all x R n +. ii) To dispense with the distributional assumptions on v i and u i. In this context it would be useful if the distribution of u i depends flexibly on a given vector of predetermined of environmental variables z i R s. This vector may include x i but in this case the following point applies more forcefully. iii.1) To account for the endogeneity of the regressors x i. We should mention that what is endogenous depends on behavioral assumptions. Under cost minimization x is endogenous but y is not. Under profit maximization both are endogenous. Under revenue maximization only y is endogenous. iii.2) If prices are available, the endogeneity problem can be solved by appending the first-order-conditions from the cost minimization, profit maximization or revenue maximization problem. If, as is most common, prices are not available, endogeneity must be addressed in a different way. 3 New models 3.1 Functional form There is considerable evidence that mixture-of-normals models tend to perform substantially better compared to other alternatives such as the Dirichlet process prior. Geweke and Kean (2007) propose the smoothly mixing regression (SMR): G p(y x) = ω g (x)f N (β gx, σg), 2 (5) g=1 where x R k is a multivariate covariate, f N (µ, σ 2 ) denotes the density of a N(µ, σ 2 ) distribution and the weights satisfy ω g (x) 0, G g=1 ω g(x) = 1. Villani, Kohn and Nott (2012) and Villani, Kohn and Giordani (2009) extend the model as follows: G ( p(y x) = ω g (x)f N β g v g (x), exp(δ gw g (x)) ). (6) g=1 Typically, it is assumed that: ω g (x) = exp(γ gx) G s=1 exp(γ sx), (7) 5

6 where v g (x) and w g (x) are transformations of the regressors, such as splines. Geweke and Petrella (2012) introduce an alternative set of fractional polynomial approximations and show that the set of fractional polynomial approximations is dense on a Sobolev space of functions on a compact set. 4 Moreover, imposing regularity conditions directly on the fractional polynomials produces pseudo-true approximations that converge rapidly to productions functions having no exact representation as fractional polynomials. Fractional polynomial approximations (with k terms) are defined as: f(x 1,..., x n ) = k i=1 a i n l=1 x s il i, (8) which is assumed positive on R n ++. Such fractional polynomial approximations can approximate the unknown function and its first two derivatives on a closed compact subset of R n ++. In compact notation fractional polynomials are denoted by: [ where x = [x 1,..., x n ], j(i) = j (i) 1,..., j(i) n p(x; a) = k a i x j(i), (9) i=1 ] is a sequence of multi-indices. Therefore, x j(i) is a Cobb-Douglas-type function whose exponents are given by the multi-index j(i). A twice differentiable function f(x) is called strictly regular on C R n ++ if f(x) > 0, f(x)/ x > 0 and 2 f(x)/ x x is negative definite, for all x C. For strictly regular functions, Geweke and Petrella (2012) have proved that fractional polynomials can approximate to any degree any function and its first two derivatives. Suppose we have the data x t R n, t = 1,..., T. The fractional approximation can be written as: p(x t ; a) = k n a i i=1 j=1 x sji tj, (10) with s ji s (i) j. We can represent this function more compactly as: p(x t ; a) = x s11 t1 x s12 t1 x s 1k t1 x s21 t2... x sn1 tn x s22 t2... x sn2 tn x s 2k t2... x s nk tn a 1 a 2. a k = z ta. (11) Define the T k matrix Z = [z 1, z 2,..., z T ] so that 4 Fractional polynomials are also used in Sauerbrei and Royston (1999). [p (x 1 ; a), p (x 2 ; a),..., p (x T ; a)] = Za, (12) 6

7 with the typical element being z ti = n l=1 xs li tl. It can be shown that the derivatives are given as follows: p(x t ; a)/ x t = W ta (13) where W t = [w tij ], w tij = s ji x 1 tj z ti. The n n Hessian matrix is: 2 p(x t ; a)/ x t x t = k a i C ti = (a I n ) i=1 C t1. C tk, (14) where C ti = [c tijr ], c tijr = x 1 tj x 1 tr s ji s ri z ti δ jr s ji x 2 tj z ti and δ jr represents the Kronecker δ. If we define the nk nt matrix C = [C ti, t = 1,..., T, i = 1,..., k] then we have: [ 2 p(x; a),..., 2 p(x T ; a) ] = (a I n ) C. (15) Positivity and monotonicity correspond to the restrictions: Za > O, W a > O. (16) Strict concavity can be checked using the eigenvalues of matrices C ti which are nonlinear functions of a. Geweke and Petrella (2012) advocate checking the conditions at a number of points in G rather than checking or enforcing them at all data points. With base b = 1 2 a cost function, for example, in three input prices, can be represented as a 1 + a 2 x 1 + a 3 x 1/2 1 x 1/2 2 + a 4 x 1/2 1 x 1/2 3 + a 5 x 1/2 1 + a 6 x 2 + a 7 x 1/2 2 x 1/2 3 + a 8 x 1/2 2 + a 9 x 3 + a 10 x 1/2 3, (17) when homogeneity is not imposed. Notice that with b = 1 2 when homogeneity is imposed this results in the generalized Leontief functional form. Similarly, polynomials in bases b = 1, 1 2, 1 3, 1 4,... can be defined. With homogeneity imposed the cost function is: a 1 x 1 + a 2 x 1/2 1 x 1/2 2 + a 3 x 1/2 1 x 1/2 3 + a 4 x 2 + a 5 x 1/2 2 x 1/2 3 + a 6 x 3. In their application to the 158 observations of Christensen and Greene (1976), Geweke and Petrella (2012) impose regularity at 1,000 points randomly generated on a hyper-rectangle bounded below by 90% of minimum price and bounded above by 110% of maximum price. A significant result of the study was that base b = 1 2 polynomials provided the best fit assessed by the Bayes factor so the generalized Leontief functional form is strongly favored by 7

8 the data (as Bayes factor were always in excess of 5.5). 3.2 Smoothly mixing regressions and fractional polynomial approximations The simplicity of fractional polynomials of base b = 1 2 makes them ideal candidates for incorporation into the SMR framework. We can represent the conditional distribution of a dependent variable y given x R n as: G p(y x) = ω g (x; c)f N (v g (x; a), exp(w g (x; b))), (18) g=1 where v g (x; a) = k n a gi i=1 j=1 x sji tj, (19) w g (x t ; b) = k n b gi i=1 j=1 x sji tj, (20) assuming that both fractional polynomials have the same number of terms k. The parametrization of the variances is made for convenience so that one does not have to check positivity. The polynomials imply vectors of parameters a g = [a g1,..., a gn ] and b g = [b g1,..., b gn ] for each g = 1,..., G. Then, a = [a 1,..., a G ], b = [b 1,..., b G] and c = [c 1,..., c G ]. 3.3 Major extension Given the ability of fractional polynomial expansions to approximate arbitrary regular functions and their derivatives it seems plausible that they can be generalized to the non-fractional case as follows: p(x t ; α) = k n α i i=1 j=1 x βji tj, (21) where α i are parameters as the a i s but the β ji s are parameters as well (β ji 0, n j=1 β ji = 1 ). The functional form can be written as: p(x t ; α) = α 1 (x β11 t1 xβ21 t2...xβn1 tn ) + α 2 ( x β12 t1 xβ22 t2...xβn2 tn ) α k ( x β 1k t1 ) xβ 2k t2...xβ nk tn (22) For identification we can impose the restrictions: α 1 α 2... α k. (23) When the β ji s are parameters we can economize on the order of a given fractional polynomial expansion. 8

9 Homogeneity of degree one can be imposed easily using: k α i = 1. (24) i=1 Positivity reduces to α i > 0, i = 1,..., k and the curvature restrictions can be imposed using the methodology of the previous section. An important exercise is to test whether the proposed functional form outperforms a given fractional polynomial. It is, indeed, possible to do so when the number of terms k increases. At the same time the proposed functional form satisfies automatically the positivity, homogeneity and positive-derivative properties provided we impose the restrictions β ji 0, j = 1,..., n, i = 1,.., k. These restrictions are independent of the data so it remains to impose only curvature restrictions. These restrictions, however, are automatically satisfied provided β ji < 1, j = 1,..., n, i = 1,.., k since the functional form is a sum of Cobb-Douglas type functions. Embedded into the SMR framework, effectively we can generalize the conditional distribution y x to a semiparametric form that is capable of approximating arbitrary distributions as the number of groups,g, increases. The SMR allows for arbitrary heteroskedastity, skewness and kurtosis without sacrificing simplicity as we deal with a mixture-of-normals distribution. 3.4 Technical inefficiency We can allow for technical inefficiency by assuming: y t = k n α i i=1 j=1 x βji tj + ε t + u t, (25) where u t 0 is the one-sided error term representing technical inefficiency. Flexible distributions for the two error components are as follows: and, independently, G ε ( p (ε t x t ) = ω g (ε) (x t )f N g=1 g=1 0, exp(w (ε) g ) (x)), (26) G u ( ) p (u t x t ) = ω g (u) (x t )f N+ 0, exp(w g (u) (x t )). (27) Here 5, ω (ε) g and ω (u) g represent fractional or Cobb-Douglas type approximations (g = 1,..., G ). We should notice that standard assumptions in the SFM literature, responsible for its inflexibility, such as independence or orthogonality between the error components and the regressors, are removed here. Here, f N+ ( µ, σ 2 ) denotes the density of the 5 Clearly, flexibility in modeling the distributions of the two error components also accounts for endogeneity which is an additional advantage of smoothly mixing regressions. 9

10 half-normal distribution. The assumption can be generalized to the case of the truncated normal distribution: G u ( p (u t x t ) = ω u,g (x t )f N+ g=1 v g (u) ) (x t ), exp(w g (u) (x t )). (28) There is an alternative to (21) which is, perhaps, more faithful to the SMR concept. The alternative is to assume: y t x t, u t, I t = g N ( v (ε) g ) (x t ) + u t, exp(w g (ε) (x t )), (29) where I t is an index that represents the group to which the observation belongs and p (I t = g x t ) = ω g (x) = exp(γ gx) G, g = 1,..., G. (30) s=1 exp(γ gx) The marginal distribution of u t is given by (27). In (29) v g (ε) (x t ) is a real or fractional polynomial expansion which is group-specific whereas in (21) a single polynomial expansion is used to approximate the functional form. The distributional assumptions are relaxed via (29) and (30). 4 Panel data and Endogeneity 4.1 Panel data With panel data, unobserved heterogeneity can be introduced via the assumption: y it x it, u it, I it, λ i = g N ( v (ε) g ) (x it ) + λ i + u it, exp(w g (ε) (x it )), (31) (i = 1,..., N, t = 1,..., T ) where λ i denotes the individual effects or y it = v (ε) g (x it ) + λ i + u it + ε it, (32) conditional on x it, λ i, u it and I it = g {1,..., G}, along with: ε it x it N ( ) 0, exp(w g (ε) (x it )). (33) Modeling the individual effects relies on a substantial generalization of Mundlak s (1978) device: G λ ( ) p(λ i ) = ω u,g ( x i )f N+ 0, exp(w g (λ) ( x i )), (34) g=1 10

11 where x i = T 1 T t=1 x it represents the average value of the covariates and exp(w g (λ) ( x i ) is a (flexible) function representing the log of their variance. In this formulation it is not necessary to assume independence between the individual effects and the regressors. 4.2 Endogeneity Often the assumption that the regressors x it satisfy a strong exogeneity assumption cannot be maintained. We have used this assumption previously as flexibility in modeling the distributions of the two error components also accounts for endogeneity which is an additional advantage of SMR. Suppose an M 1 vector d it may be available which can be considered exogenous (for example, time trend). Endogeneity can be handled, alternatively, by assuming a panel vector autoregression (PVAR) of the form: x it = a i + Bx i,t 1 + Γd it + ξ it, (35) where ξ it N (O, Σ ξ ), B and Γ are matrices n n and n M respectively, and the individual effects a i N n (O, Σ a ), (36) independently of ξ it. The assumption can be generalized to an SMR specification for the a i s as functions of x i and. possibly, d i = T 1 T t=1 d it. It is possible to extend the PVAR in (35) to an SMR context but this extension does not seem very promising. The important issue is to allow for correlation between ξ it and ε it in (32). This can be done by assuming that: ε it ξ it N n+1 (O, Σ it ). (37) The dependence between ε it and ξ it can be used to model endogeneity as an alternative to SMR, see (29)-(30). The assumption has been introduced by Kutlu (2010) and Tran and Tsionas (2013) although with a fixed Σ it. For the elements of Σ it we assume that var (ε it ) is given by (33), the diagonal elements of cov (ξ it ) are fixed parameters or depend on x i,t 1 and possibly d it via flexible functional forms, and non-diagonal elements are fixed. Alternatively, consider the Cholesky factorization: Σ it = H it H it, (38) where H it is a lower diagonal n n matrix. Denote h it = vec (H it ) = [h it,1,..., h it,m ] which is a vector of dimension m = n(n+1) 2. We assume h it,j = w (j) (x i,t 1, d it ), j = 1,..., m (39) 11

12 where w (j) () represents a flexible functional form. The parameters of the specification are unrestricted and allow for a quite flexible specification of the joint distribution of ε it and ξ it. The assumption can be further generalized as follows: G ζ ( ) p (ε it, ξ it ) p (ζ it ) = ω (g) (x it ) f N,n+1 O, Σ (g), (40) g=1 it where f N,n+1 (O, Σ it ) denotes the (n + 1)- variate normal distribution with mean zero and covariance matrix Σ it, whose specification is given above. As in SMR the weights ω (g) depend on the regressors x it. Due to the explicit parametrization of h it in (39) the use of a multivariate normal mixture for the joint multivariate distribution is not as demanding as when Σ (g) is left unrestricted and, therefore, it can be used even when the number of variables (n) is relatively large. 4.3 Endogeneity through copulas In this section suppose Φ(z) denotes the standard normal cdf and Φ k+1,σ (z) denotes the cdf of a (k+1) dimensional normal distribution with correlation matrix Σ. Given the marginal distributions F ε (ε it ) and F j (x j,it ), j = 1,..., k and the densities f ε (ε it ) and f j (x j,it ), j = 1,..., k, the joint distribution of ε it and x it can be represented by a copula function C(F ε, F) where F(x) = [F 1 (x 1 ),..., F k (x k )], x R k. The copula is defined in [0, 1] k+1 by: C(ξ 1,..., ξ k+1 ) = P (F ε (ε it ) ξ 1, F 1 (x 1,it ) ξ 2,..., F k (x k,it ) ξ k+1 ), (41) so that the copula is itself a cdf. Moreover, U j = F j (x) and U ε = F ε (ε) have uniform distributions. If c(ξ 1,..., ξ k+1 ) denotes the pdf associated with the copula, then by Sklar s (1959) theorem, we have: k f(ε, x 1,..., x k ) = c(ε, x 1,..., x k ) f ε (ε) f j (x j ). (42) j=1 One commonly used copula function is the Gaussian copula 6. The (k+1)-dimensional CDF with correlation matrix Σ is given by c(w; Σ) = Φ k+1,σ (Φ 1 (U ε ), Φ 1 (U 1 ),..., Φ 1 (U k )), where w := (U ε, U 1,..., U k ) = (F ε (ε), F 1 (x 1 ),..., F k (x k )). The density copula is: c(w; Σ) = det(σ) 1/2 exp { 1 2 [Φ 1 (U ε ), Φ 1 (U 1 ),..., Φ 1 (U k )] (Σ 1 I k+1 )[Φ 1 (U ε ), Φ 1 (U 1 ),..., Φ 1 (U k )] } (43) 6 Relative to other copulas, the Gaussian copula is generally robust for most application and has many desirable properties (Danaher and Smith, 2011). 12

13 and the joint distribution is given by k log f(ε, x 1,..., x k ) = log c(w; Σ) + log f ε (ε) + log f j (x j ). (44) j=1 Clearly, it is easy to formulate and evaluate (44) as c(w; Σ) can be computed easily and marginals f ε (ε) and f 1 (x 1 ),..., f k (x k ) are assumed available. Regarding the marginal density f ε (ε) enough has been written in the previous sections. For the marginals of the regressors, we approximate them using: F j (x j ) = 1 n+1 n I(x j,it x j ), j = 1,..., k, (45) i=1 and we use the scaling factor n+1 to avoid difficulties arising from the potential unboundedness of log c(w; Σ) as some of arguments the tend to one. Additionally, as k j=1 log f j(x j ) does not depend on the parameters, it can be omitted from (44). To obtain F ε (ε) = ε f ε(e)de we, generally, need to use univariate numerical integration through, for example, Gaussian quadrature. We use 40-point quadrature using the Gauss-Kronrod rule as implemented in IMSL. Based on (44) we can define the log-likelihood function: log f(ε, x 1,..., x k ) = i,t log c(w it ; Σ) + i,t log f ε (ε it ), (46) ignoring a constant term, ε it := y it f(x it ; β) and w := (F ε (ε it ), F 1 (x 1,it ),..., F k (x k,it )). The log-likelihood can be maximized with respect to θ = [β, σ, λ] and the different elements of Σ. 5 Computational experiment I In this computational experiment we consider artificial data with N = 1, 500 and N = 5, 000 observations with T = 10 time periods (and therefore n = 150 or n = 500). Of course, N = nt. The sample sizes are typical of what is used in applied econometrics. We have three relative input prices and three outputs, and a cost function is assumed along with share equations derived from Shephard s lemma. The cost function is of the form C = C(w 1, w 2, w 3 ; y 1, y 2 y 3,, t). We consider two basic data generating processes: i) A mixture of Translog models with C components. ii) A mixture of Quadratic models with C components. All models satisfy the theoretical regularity conditions. For the generalized Leontief this is straightforward while for the translog models we follow the procedure set forth by Perelman and Santin (undated). We vary C from 1 to 5 so effectively we have ten data generating processes in cases (i) and (ii). Mixture models are generated so that each one has equal probability. We set σ v = σ u = 0.3 which is also typical. Individual effects are included in all models and are generated from a normal distribution with zero mean and standard deviation

14 Our interest focuses on estimating (i) technical change, (ii) technical inefficiency, (iii) input elasticities and returns to scale. These measures depend on the parametrizations used to make the functional forms regular for each observation in the samples. To determine the best model when using a fractional polynomial approximation (FPA) or an approximation using the additive Cobb-Douglas (ACD) specification we use the modified log predictive scores methodology. The cross-validated log score (Gelfand et al, 1992) is: ( N log p(y i Y i ) N = log i=1 i=1 S 1 S s=1 log p(y i θ (s), Y i ) ). (47) The modified cross-validated log score (Geweke and Keane, 2007) relies on keeping the first N 1 < N observations for estimation and keeping the remaining for cross-validation. This is repeated R times and we take the mean: R 1 R r=1 ( N i=n 1+1 ( log p y (r) i ) ) y (r) i, Y (r) N 1, (48) where Y (r) is a random selection from Y (the data) and averaging over the draws θ (s) is also performed to compute the score: R 1 R r=1 ( N i=n 1+1 ( log p y (r) i ) ) y (r) i, Y (r) N 1 = R R 1 r=1 ( S 1 S N s=1 i=n 1+1 ( log p y (r) i ) y (r) i, Y (r) N 1, θ (s)). (49) The difference is that the Gelfand et al (1992) variant is more computationally demanding as N posterior simulators are needed while the Geweke and Keane (2007) variant uses only R << N posterior MCMC simulators so it can be computationally more efficient. Models with higher log predictive score perform better in terms of cross-validated prediction. In this work we take R = 50 and N 1 = 1, 000 when N = 1, 500 and N 1 = 4, 500 when N = 5, 000 so that 500 observations are always left out for cross-validation. Posterior simulation via MCMC is performed, for both FPA and ACD, using a variant of Sequential Monte Carlo as described in the Appendix. All posterior simulators rely on a transient phase of 50,000 simulations followed by another 100,000 draws. Convergence is monitored using Geweke s (1992) convergence diagnostics and numerical standard errors (NSE) as well as relative numerical efficiency (RNE) are computed as well, based on AR(10) approximations to compute the spectral density of the process at the origin. Latent technical efficiency is explicitly integrated out of the posterior to reduce the amount of autocorrelation inherent in MCMC computations. The theoretical restrictions are imposed on a grid rather than on each observation. Since we have three relative prices and three outputs, following Geweke and Petrella (2012) we choose a random grid of size an where a = 1 2, 1 or 2. When, for example, N = 5, 000 this means that the theoretical restrictions are imposed at 2,5000, 5,000 or 10,000 points using rejection. In all cases the results were qualitatively and quantitatively the same when a = 1 or 14

15 a = 2 implying that the imposition of restrictions at N/2 points was more than adequate. Relative to Geweke and Petrella (2012) we added the following check to facilitate the imposition of theoretical restrictions: The restrictions are imposed at the point of means and then also at points on a ball whose radius was r = 1 2 and r = 3 2 relative to the mean. We count the number of violations in each specification and if it exceeds 1 3 of the sample size (N) we add more points. We ended up with a fixed specification where the theoretical restrictions are exactly imposed at points with radius r = 1 (the mean) and also r = ± 3 k for k = 2, 4, 5 and r = ± 1 k for k =2,3,4,5 plus the 90% and 110% minimum and maximum of the data: The random grid was set up so that 90% and 110% of the minimum and maximum respectively of the data, are included in the random grid. We use standard uniform random numbers to construct the grid instead of quasi-random numbers as in Geweke and Petrella (2012). We did not experiment with quasi-random numbers as the algorithms were quite successful after imposing the constraints at points around the means as described above. To facilitate the imposition of theoretical restrictions the data are divided by their means so that in logs they have zero mean (in translog) or unit mean (in the Quadratic specification). As a basis of comparison of take the simple translog specification when N = 1, 500. The results for N = 5, 000 are quite similar qualitatively and are not reported in the interest of space. Table 1. Model Comparison Fractional Polynomial Approximation Additive Cobb-Douglas k = 1 k = 2 k = 3 k = 4 k = 5 k = 1 k = 2 k = 3 k = 4 k = 5 k = 6 Translog Mixture of Translog, C = Mixture of Translog, C = Mixture of Translog, C = Mixture of Translog, C = Quadratic Mixture of Quadratic, C = Mixture of Quadratic, C = Mixture of Quadratic, C = Mixture of Quadratic, C = Notes: Although we have panel data, we do not determine the sample of size N 1 by retaining the firm structure. Therefore, we randomize as if the data were iid. We feel this provides a more stringent cross-validation comparison. Otherwise, we would have to select a sample of size n 1 < n which retains all T observations for a particular firm. This could bias the results for technical change reported below in Table 2. To make the comparison easier we normalize the log predictive score to 0 and we take deviation of other LPS from the LPS at zero. From Table 1 it turns out that low-order fractional polynomials but also low order ACD models perform quite 15

16 well in terms of cross-validation. Perhaps surprisingly, the ACD model does not need more than three components to approximate complicated DGPs like the Translog or Quadratic specification with large number of mixing components. This shows that the ACD can be a valid competitor to the more complicated Fractional Polynomial Approximation. Of course, the critical issue is whether functions of interest like technical change and technical inefficiency can be approximated equally well by the FPA and the ACD approximations. For the best models, selected using the modified cross-validated LPS of Geweke and Keane (2007) relevant information is presented in Table 2, where we examine the performance of both FPA and ACD. Table 2. Model Comparison in terms of functions of interest (rank correlation coefficients, median across all MCMC draws finally retained, skipping every other 10th draw) Technical change Inefficiency ε w1 ε w2 ε w3 ε y1 ε y2 ε y3 Translog Mixture of Translog, C = Mixture of Translog, C = Mixture of Translog, C = Mixture of Translog, C = Quadratic Mixture of Quadratic, C = Mixture of Quadratic, C = Mixture of Quadratic, C = Mixture of Quadratic, C = Notes: ε wj = C w j w j C approximations. C y and εyj = j y j C. For each MCMC draw these are series which can be computed for both the FPA and ACD The rank correlation is then computed, monitored for each draw (omitting every other tenth) and then the medians are reported. A measure of inverse returns to scale is e cy = j εyj, that is RT S = e 1 cy. Technical change is T C = 1 C C t. Technical inefficiency for each draw is computed from its posterior conditional distribution after the draws for regression parameters and scale parameters become available, using well-known expressions. Averaging across parameter draws produces technical inefficiency estimates which are then compared for the FPA and ACD approximations using rank correlation coefficients. From Table 2, it is evident that estimated functions of interest are quite close in terms of rank correlation coefficients and therefore the good behavior of ACD, reported in Table 1, produces estimated functions of interest that are quite close to those corresponding to the FPA. Given the hardness of imposing the theoretical restrictions in FPA, the excellent behavior of ACD should, undoubtedly, encourage its use in applied econometrics. An interesting question is whether Finite-Mixture-of-Normals-Models (FMNM) can also be used profitably as approximations. FMNM based on Cobb-Douglas regression models with different scale parameters can also 16

17 approximate, in theory, arbitrary functional forms (Norets and Pelenis, 2011). Although Norets and Pelenis (2011) employ a more general approach to extract conditional distributions from a general distribution provided by a FMNM, a simpler approach is to use FMNM of stochastic frontier models in the usual sense often used in the literature. Table 3. Results from FMNM (rank correlation coefficients, median across all MCMC draws finally retained, skipping every other 10th draw) Technical change Inefficiency Optimal order of FMNM Translog Mixture of Translog, C = Mixture of Translog, C = Mixture of Translog, C = Mixture of Translog, C = Quadratic Mixture of Quadratic, C = Mixture of Quadratic, C = Mixture of Quadratic, C = Mixture of Quadratic, C = Notes: Reported results are for sample size N=5,000. Results for N=1,500 were qualitatively similar. From Table 3 it turns out that the approximation properties of FMNM are worse when compared to ACD and FPA. We have computed but do not report, in the interest of space, rank correlations between true and estimated elasticities and returns to scale (RTS) to obtain a similar result. The rank correlation between true and estimated elasticities and RTS range from, approximately, 0.3 to Computational experiments II In this section we consider the computational experiments in Badunenko, Henderson and Kumbhakar (2012). They consider two simple production functions: (1) Cobb-Douglas (CD) y = x α 1 x 1 α 2 exp(v u) and (2) a constant elasticity of substitution (CES) y = [βx ρ 1 + (1 β)xρ 2 ] 1/ρ exp(v u). They set α = 1 3, β = 2 3 and ρ = 1 2. For the error term we assume v N(0, σ 2 v) and for inefficiency we have u N + (0, σ 2 u). Three sample sizes are analyzed (total number of observations is 50, 100 and 200). To generate x 1 and x 2 they assume that they are uniformly drawn in the interval [1, 2] iid and independently of each other. With respect to the parameters of noise and inefficiency they have three scenarios: 17

18 In scenario S1 (σ v = σ u = 0.01, λ = 1.0), both terms are relatively small. In other words, the data are measured with relatively little error and the units are relatively efficient. In scenario S2 (σ v = 0.01, σ u = 0.05, λ = 5.0), the data have relatively little noise, but the units under consideration are relatively inefficient. In scenario S3 (σ v = 0.05, σ u = 0.01, λ = 0.2), the data are relatively noisy and the the firms are relatively efficient. In scenario S4 (σ v = 0.05, σ u = 0.05, λ = 1.0), the data are relatively noisy and the the firms are relatively inefficient. This scenario is actually redundant as the results mostly depend on λ and not the individual values of σ v and σ u. However, the case is interesting since we can examine whether the results depend on actual values of λ and / or the magnitude of noise and inefficiency. All experiments consist of 2000 Monte Carlo trials. Data sets where the residuals have wrong (positive) skewness are discarded. We compare our results only with the frontier estimator (FLW) as the DEA estimator (KSW) has been found to have disappointing performance under noise -even moderate- in the study of Badunenko, Henderson and Kumbhakar (2012). We use the following metrics, as in Badunenko, Henderson and Kumbhakar (2012): Bias = n 1 RMSE = {n 1 n i=1 n i=1 Upward Bias = n 1 ( T E i T E i ) ( T E i T E i )} 1/2 n i=1 I( T E i > T E i ) Kendall s τ = n c n d 1 2n(n 1) where T E i is estimated technical efficiency, T E i = exp( u i ) is actual technical efficiency, n c, n d represent the number of concordant pairs and the number of discordant pairs in the data set (efficiency ranks) respectively. From Tables 4 and 5 it turns out in cases of Cobb-Douglas and CES that the methods developed in this study improve drastically over the procedures in FLW. Coverage is much better in many cases, biases and RMSEs are significantly lower, the upward bias is mitigated and the correlations between actual and predicted efficiency are much higher. 18

19 Table 4. Finite sample performance of the efficiency estimates: Cobb-Douglas production technology ECA, 95% (a) Bias RMSE Upward bias (c) Correlation (d) FLW (b) this study FLW this study FLW this study FLW this study FLW this study Scenario S1 (σ v = σ u = 0.01, λ = 1.0) n= n= n= Scenario S2 (σ v = 0.01, σ u = 0.05, λ = 5.0) n= n= n= Scenario S3 (σ v = 0.05, σ u = 0.01, λ = 0.2) n= n= n= Scenario S4 (σ v = 0.05, σ u = 0.05, λ = 1.0) n= n= n= Notes: Cobb-Douglas production function: ;y i = x 1/2 i1 x2/3 i2 exp(vi ui), vi N(0, σ2 v ), u N+(0, σ2 σu u ), λ = σv. (a) Empirical Coverage Accuracy is the share of true technical efficiencies that are within bounds of predicted 95% confidence interval for estimated technical efficiency. Reported in this table is the median of such shares across all Monte Carlo simulations; (b) FLW represents the SSF estimator and the results are taken from Badunenko, Henderson and Kumbhakar (2012). (c) Upward Bias is the share of predicted technical efficiencies strictly larger than the true ones. The desired value of upward bias is 0.5. The values less (greater) than 0.5 indicates systematic underestimation (overestimation) of technical efficiencies. Reported in the table is the median of such shares across all Monte Carlo simulations; (d) Kendall correlation coefficient between predicted and true technical efficiencies. Reported in the table is the median of such coefficients across all Monte Carlo simulations. 19

20 Table 5. Finite sample performance of the efficiency estimates: CES ECA, 95% (a) Bias RMSE Upward bias (c) Correlation (d) FLW (b) this study FLW this study FLW this study FLW this study FLW this study S1 (σ v = σ u = 0.01, λ = 1.0) n= n= n= S2 (σ v = 0.01, σ u = 0.05, λ = 5.0) n= n= < n= < S3 (σ v = 0.05, σ u = 0.01, λ = 0.2) n= n= n= S4 (σ v = 0.05, σ u = 0.05, λ = 1.0) n= n= n= Notes: CES production function: ;y i = [ 2 3 x1/2 i x1/2 i2 ]1/2 exp(v i u i), v i N(0, σ 2 v ), u N+(0, σ2 σu u ), λ = σv.. (a) Empirical Coverage Accuracy is the share of true technical efficiencies that are within bounds of predicted 95% confidence interval for estimated technical efficiency. Reported in this table is the median of such shares across all Monte Carlo simulations; (b) FLW represents the SSF estimator and the results are taken from Badunenko, Henderson and Kumbhakar (2012). (c) Upward Bias is the share of predicted technical efficiencies strictly larger than the true ones. The desired value of upward bias is 0.5. The values less (greater) than 0.5 indicates systematic underestimation (overestimation) of technical efficiencies. Reported in the table is the median of such shares across all Monte Carlo simulations; (d) Kendall correlation coefficient between predicted and true technical efficiencies. Reported in the table is the median of such coefficients across all Monte Carlo simulations. In tables 4 and 5 we have shown that on average our improved procedure is arguably better than FLW. One interesting question is what would happen at low and upper ends of the efciency distribution. We provide the results in Tables 4a and 5a for the 5% and 10% lower and upper ends of the efciency distribution. Specifically, we proceed as follows: Given that an observation has efficiency in the lower 5% of the efciency distribution in the data, we first examine whether it belongs to the lower 5% of the efciency distribution in a given simulation. If not we record the correlation as zero; otherwise, we compute the actual correlation. From the results in Tables 4a and 5a it turns out that, in most cases, the correlations at the lower and upper 5% and 10% ends of the efficiency distribution are lower than the average but fairly close to the average value. 20

21 Therefore, the performance of the improved procedure does not deteriorate and can be used safely in most cases. Table 4a. Finite sample performance of the efficiency estimates: Cobb-Douglas production technology (Correlations at lower and upper ends of the efficiency distribution) lower 5% lower 10% upper 5% upper 10% S1 (σ v = σ u = 0.01, λ = 1.0) n= n= n= S2 (σ v = 0.01, σ u = 0.05, λ = 5.0) n= n= n= S3 (σ v = 0.05, σ u = 0.01, λ = 0.2) n= n= n= S4 (σ v = 0.05, σ u = 0.05, λ = 1.0) n= n= n= Notes: See notes to Table 4. 21

22 Table 5a. Finite sample performance of the efficiency estimates: CES production technology (Correlations at lower and upper ends of the efficiency distribution) lower 5% lower 10% upper 5% upper 10% S1 (σ v = σ u = 0.01, λ = 1.0) n= n= n= S2 (σ v = 0.01, σ u = 0.05, λ = 5.0) n= n= n= S3 (σ v = 0.05, σ u = 0.01, λ = 0.2) n= n= n= S4 (σ v = 0.05, σ u = 0.05, λ = 1.0) n= n= n= Notes: See notes to Table 5. 7 Data We apply the new techniques to the US banking data of Malikov et al (2015) on which we rely heavily for the following description. Our data on commercial banks come from Call Reports available from the Federal Reserve Bank of Chicago and include all FDIC-insured commercial banks with reported data for 2001:Q1 2010:Q4. We focus on a selected subsample of relatively homogeneous large banks, namely those with total assets in excess of $1 billion dollars (in 2005 US dollars) in the first 3 years of observation.we further exclude Internet banks, commercial banks conducting primarily credit card activities and banks chartered outside the continental USA. After cleaning the data we have an unbalanced panel with 2397 bank year observations for 285 banks. We deflate all nominal stock variables to 2005 US dollars using the consumer price index (for all urban consumers). We follow the commonly used intermediation approach of Sealey and Lindley (1977) to model the bank s production technology. According to this approach a bank s balance sheet is assumed to capture the essential structure of its core business. Liabilities, together with physical capital and labor, are taken as inputs to the bank s 22

23 production process, whereas assets (other than physical) are considered as outputs. Liabilities include core deposits and purchased funds; assets include loans and trading securities. We define the following outputs of the bank s production process: consumer loans (y1), real estate loans (y2), commercial and industrial loans (y3) and securities (y4). These output categories are essentially the same as those in Berger and Mester (1997, 2003). Following Hughes and Mester (1998, 2013), we further include off-balance-sheet income (y5) as an additional output. We also include bank s total non-performing loans (b). The variable inputs are labor, i.e. the number of full-time equivalent employees (x1), physical capital (x2), purchased funds (x3), interest-bearing transaction accounts (x4) and non-transaction accounts (x5). We also include financial capital (equity, e) as an additional input to the production technology. We follow Berger and Mester (1997, 2003) and Feng and Serletis (2009) and assume that equity is a quasi-fixed input. The treatment of equity capital as an input to banking production technology is consistent with Hughes and Mester s (1993, 1998) argument that banks may use it as a source of funds and thus as potential protection against losses. We compute the prices of variable inputs (w1 w5) by dividing total expenses on each input by the corresponding input quantity. Table I in Malikov, Kumbhakar and Tsionas (2015) presents summary statistics of the data we use. 7 8 Empirical results We will consider several models to examine whether the new techniques perform better and / or provide a better description of the data. Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: An FPA model with normal and half-normal distributions for the two error components of the model. Model III: An FPA model with the following modification. The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. Additionally, we will also look closely into the models developed for panel data in sections 4.1, 4.2 and The data are available in 23

24 Table 6. Model comparison for US banking data Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: An FPA model with normal and half-normal distributions for the two error components of the model. Model III: An FPA model with the following modifications: The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. model normalized LPS model order of Additive Cobb-Douglas translog I II III IV IV V VI base of Fractional Polynomial Approximation V model VI VII II III From the results in the left of Table 6 it turns out that an ACD specification is better, followed by model VII (the full SMR which uses bases 1 3 for the cost function and 1 2 for the logs of scale parameters of the two error components; the weight function also turns out to be of base 1 2 ).8 The right panel of Table 6 provides the normalized LPS for (i) the order of the ACD in the upper panel, and (ii) the base of FPA in the lower panel. Given these results, it seems safe to proceed conditional on the choices of the particular order of ACD or base of the FPA for models IV, V, VI and II, III. Before doing so, it would be desirable to test the different models based on the alternative assumptions proposed in section 4, viz. panel data (section 4.1), endogeneity (section 4.2) and copulas (section 4.3). To summarize, in section 4.1 we introduce one-sided individual effects λ i which are handled in a flexible way, see (33) and (34). In section 4.2 we capture the correlation between x it and ε it using a panel VAR and a normal joint distribution where the elements of the Cholesky factor of the covariance matrix are flexible functions, see (35)-(37) and (38)-(39). Finally, in section 4.3 we have a copula dependence model, see (44) or (46). To save space we report results only for the best models in each of the three alternatives and we report the results in Table 7. 8 For model VII we used a full comparison between FPA for the cost function, the two scale parameters and the weights in bases 1, k = 1,..., 10. This involved a choice among 10,000 different models so we give only the final best choice. k 24

25 Table 7. Alternative assumptions about panel data and endogeneity Results for models IV,V, VI and I, II are the same as in Table 6 to allow direct comparison. Models x.1, x.2 x.3 correspond to x=iv, V, V and I,II. Models of the form x.1 are as in section 4.1 where we introduce one-sided individual effects λ i which are handled in a flexible way, see (33) and (34). Models of the form x.2 are as in section 4.2 where we capture the correlation between x it and ε it using a panel VAR and a normal joint distribution where the elements of the Cholesky factor of the covariance matrix are flexible functions, see (35)-(37) and (38)-(39). Models of the form x.3 are as in section 4.3 where we have a copula dependence model, see (44) or (46). In models x.1 and x.2 the orders of the various flexible functions are optimized and the best results are provided. All model orders run from 1 to 10 and they are combined with different orders of the Additive Cobb-Douglas or the Fractional Polynomial Approximation for the functional form of the cost function. Precise model orders are available on request. model order of Additive Cobb-Douglas IV IV IV IV V V V V VI VI VI VI base of Fractional Polynomial Approximation model II II II II III III III III From the results in Table 7 it turns out that, at least in this empirical application, it is not necessary to resort to the flexible panel specification (section 4.1), panel VAR-based-endogeneity as in section 4.2 or copulas (section 4.3). In fact, copulas seem to perform worst. The flexible panel specification performs worst relative to the benchmark 25

26 models (II, III, IV or II,III). Assuming cost inefficiency to be flexible but time-invariant is quite restrictive and a panel VAR (even with a quite flexible for the elements of the Cholesky factor of the covariance matrix Σ it ) does not seem sufficient to handle the joint distribution of x it and ε it relative to the benchmark SMR specification. This shows that the SMR and its substantial generalizations in this study, are quite capable of representing actual and possibly complicated features of the data, including endogeneity in particular. In fact, the dominance of models III and VI (with VI being the best) also implies that there is considerable value added from modeling endogeneity: The LPS of III and VI are and , respectively, over a simple translog model. From Table 6, models that do not account for endogeneity (like model II) sometimes imply a sizable LPS (like ) which is still 10 times smaller compared to the LPS of model III. This is, of course, an overwhelming difference implying that endogeneity is critical in the performance of models in this empirical application. The copula specification seems to perform worst relative to the other two models. The bad performance of copulas is evidently as result of the more flexible specifications in SMR. The specification in section 4.2 (a panel VAR with a flexible parametrization for the elements of the Cholesky decomposition of the covariance matrix, Σ it ) performs best suggesting that explicit modeling of Σ it is critical. However, none of the models in section 4 performs better than the models presented in Table 6. This suggests that these extensions, although interesting for future applications, are complicated enough and over-parametrized that fail to improve the log-predictive score statistic. To show the value added from accounting for endogeneity we can consider models I and II under the specifications in section 4.2 where the organizing principle is that of a panel vector autoregression (PVAR) with a joint distribution of ε it and ξ it. We take again the translog as a benchmark with a LPS of zero and the results are presented in Table 8. In case A we have a PVAR model with a general covariance matrix Σ it for ε it and ξ it. In case B we resrict the covariance matrix not to depend on x it. In cases C and D we consider the general specification in (40) with a general and fixed covariance matrix respectively. In case D, although the covariance matrix is fixed, it differs among groups (g = 1,..., G ζ ). 26

27 Table 8. Additional results for endogeneity Notes: FPA and ACD denote Fractional Polynomial Approximation and Additive Cobb-Douglas respectively. The results presented are for the best orders of FPA and ACD and are not reported to save space but are available on request. In case A we have a PVAR model with a general covariance matrix Σ it for ε it and ξ it. In case B we resrict the covariance matrix not to depend on x it. In cases C and D we consider the general specification in (40) with a general and fixed covariance matrix respectively. In case D, although the covariance matrix is fixed, it differs among groups (g = 1,..., G ζ ). FPA ACD translog A. Panel VAR, fixed Σ it B. Panel VAR, Joint distribution of ε it and ξ it, fixed Σ it C. Joint distribution of ε it and ξ it, fixed Σ it D. Joint distribution of ε it and ξ it, general Σ it From the results in Table 8, there are definitely gains from considering endogeneity relative to a simple translog, so endogeneity matters empirically. The issue is whether modeling endogeneity in this way is better when compared to the alternatives in Tables 7. As the highest LPS is it turns out worse that an LPS of for the FPA model IV in Table 7 and certainly LPS of or for models VI and III in the same Table. Therefore, at least in this empirical application, modeling endogeneity through a PVAR as in section 4.2 does not appear viable. Although there is, clearly, considerable value added from modeling heterogeneity, it turns out PVAR models (section 4.2) turn out to be worst when compared to the alternative, more general and flexible models like II,III or IV,VI. The different models can be compared in terms of returns to scale (RTS), efficiency change (EC), productivity growth (PG) which is equal to technical change plus EC, and cost efficiency. 27

Figure 1. Sample distribution of RTS Notes: Returns to scale are computed as rts = 1/ K log C(p,y) k=1 log y. The measure is averaged across all SMC draws to account for parameter k uncertainty.

28 Figure 1. Sample distribution of RTS Notes: Returns to scale are computed as rts = 1/ K log C(p,y) k=1 log y. The measure is averaged across all SMC draws to account for parameter k uncertainty. Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: A FPA model with normal and half-normal distributions for the two error components of the model. Model III: A FPA model with the following modification. The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. The good performance of models VI and VII in terms of LPS materializes in the fact that these models have quite different implications for RTS, PG and efficiency. From Figure 1, these models agree that RTS is close to 0.98 whereas the other models provide RTS measures from 0.82 to which is typical in banking studies employing flexible functional forms. The translog itself provides average RTS close to 0.82 and very small probability that RTS could be higher than From Figure 2, models VI and VII suggest that PG is very close to zero and very few banks have PG close to 1% or -1%. The other models provide, again, results which indicate that PG averages almost 4% and can be as large as 10% for certain banks. Models VI and VII are inconsistent with this prediction and suggest that PG, if any, is quite small. 28

Figure 2. Sample distribution of PG Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend.

29 Figure 2. Sample distribution of PG Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: A FPA model with normal and half-normal distributions for the two error components of the model. Model III: A FPA model with the following modification. The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. 29

Figure 3. Sample distribution of technical efficiency Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend.

30 Figure 3. Sample distribution of technical efficiency Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: An FPA model with normal and half-normal distributions for the two error components of the model. Model III: An FPA model with the following modification. The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. Another difference is in terms of technical efficiency (see Figure 1). Models VI and VII suggest that cost efficiency averages 95%-96% but the sample distribution is highly skewed to the left. The remaining models suggest that cost efficiency averages 92% but the sample distributions are quite different. For example, the remaining models suggest that efficiency in excess of 96% is practically impossible, contrary to the predictions of models VI and VII. It should be mentioned that since one can put condence intervals on efciency estimates, 0.96 and 0.98 will probably be statistically the same. In Figure 4 we examine the temporal pattern of PG. Again, models VI and VII suggest that PG has been fairly close to zero from 2001 to The other models provide quite different predictions with a generally declining pattern of PG but still positive and well above 3% on the average. Model IV even suggests that PG increased in the aftermath of the sub-prime crisis. In terms of efficiency change (Figure 5) models VI and VII suggest that there has been no serious change in cost efficiency contrary to the remaining models. Some models provide positive 30

estimates while other provide negative estimates for efficiency change so we have a mixed bag in this case, in terms of the temporal behavior of efficiency change.

31 estimates while other provide negative estimates for efficiency change so we have a mixed bag in this case, in terms of the temporal behavior of efficiency change. As the models are quite different and have different implications, this is not unexpected. The fact that models VI and VII suggest that PG has been fairly close to zero since 2001 is more in line with reality. In banking most innovations (like ATMs and electronic banking) have exhausted their productivity effects well before the 2000s as they have been introduced extensively by almost all banks. From that point of view it is not surprising to find that productivity growth has been quite low. Figure 4. Temporal behavior of PG Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: A FPA model with normal and half-normal distributions for the two error components of the model. Model III: A FPA model with the following modification. The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. 31

Figure 5. Cost efficiency change Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend.

32 Figure 5. Cost efficiency change Notes: Model I: a translog cost function model which depends on input prices, outputs, equity, non-performing loans and a time trend. The model does not allow for inefficiency. This model is used as a benchmark. Model II: A FPA model with normal and half-normal distributions for the two error components of the model. Model III: A FPA model with the following modification. The log variance of the two-sided error term is itself an FPA model and the log of the scale parameter of the half- normal distribution for the one-sided error term is also an FPA model. Models IV, V and VI: As models I, II and III with the modification that we use ACD instead of FPA. Model VII: A general SMR (or GSMR) model for the cost function. The probabilities and logs of the scale parameters of the normal and half-normal distributions for the two error components of the model are functions of logs of input prices, outputs, equity, non-performing loans and, finally, a time trend. Similar to productivity change is what we observe in terms of cost efficiency change. Most models agree that cost efficiency change has been quite low and in the neighborhood of ±2-3% with the preferred models VI and VII showing no evidence of quantitatively important changes in cost efficiency. This challenges the conventional view that efficiency can adjust rapidly to changing economic conditions as, for example, in the aftermath or during the sub-prime crisis. On the contrary, conventional models like model I, deliver the empirical implication that efficiency has constantly improving at an average rate of 1,5-2% per year, which is hard to believe and must be attributed to the inflexibilities of the functional forms in this model. The marginal effects of several variables on cost efficiency are provided in Figures 6a through 6c. These marginal effects may be computed as log ˆrit w it these effects numerically we use finite differences, viz. logs. where 9 ˆr it = exp( û it ) and û it is estimated cost inefficiency. To approximate û it w it, where w it is defined as w it = w it + h, and h 9 Here, w it stands for the vector of variables that affect the ACD specification. We remind that most elements of w it are already in 32

33 is a vector whose elements are 0.1 times the minimum absolute value of elements of w it. The marginal effects are computed for each draw and they are averaged across all draws so that they fully account for parameter uncertainty. Finally, results are reported only for our preferred specification, viz. Model VI. For visual clarity all variables are normalized to lie in the interval [0, 1]. Figure 6a. Marginal effects on cost efficiency One of the most important technical aspects of our major extensions is the ability to deliver the marginal effects of certain variables on cost efficiency. The marginal effects are plotted in Figures 6a through 6c. The effects are highly nonlinear and do not always have the same sign through the entire (relevant) domain. The effect of consumer loans (Fig. 6a) is positive starting from about zero to finally obtain a value of The effect of real estate loans is negative for the most part but positive when these loans are less than their 20% percentile. At the median their effect is and increases to at higher percentiles -a fact that can be attributed to the volatile character of real estate prices. Commercial and industrial loans have a monotonic impact on efficiency with their effect being negative for values less than the 35% percentile and they increase to reach a maximum of of Finally, securities, have a consistently positive and monotonic effect which is close to 0.01 at the median and increases to 0.04 near the maximum. 33

34 Figure 6b. Marginal effects on cost efficiency From Figure 6b, it turns out that labor and capital consistently decreases cost efficiency but purchased funds have a positive effect with a minor positive contribution from off-balance-sheet income. These results indicate that further expansion of the banks in terms of capital or labor cannot possibly increase the efficiency of their operations, although there are clearly ways to do so through re-balancing their activities in terms of scope. Purchased funds, for example, (Fig. 6b) as well as securities and consumer loans (Fig. 6a) can contribute to higher efficiency. 34

35 Figure 6c. Marginal effects on cost efficiency We can see how economies of scope in efficiency work further if we take a look at Figure 6c. Non-performing loans have a clear negative effect on cost efficiency (of nearly at the median) while interest-bearing accounts, equity and non-transaction accounts have a positive effect (although non-monotonic) on efficiency. Non-mononicity is particularly pronounced in the case of equity: Its effect takes off just above the 70% percentile although even the effect at the median is quantitatively important (nearly 0.05). In terms of policy implications these results are important. Cost efficiency can be improved through a shift from capital, labor and non-performing loans to equity, interest-bearing transaction and non-transaction accounts, purchased funds and, somewhat less importantly, to consumer loans. The use of commercial and industrial loans and other items is more ambiguous and depends on where a particular bank is, in terms of its current position in the (relevant) domain of this variable. These results are not just descriptive although we deliberately considered the separate effect of each variable on technical efficiency. As a matter of fact, for each particular bank, we can consider in a formal way, the precise effect (along with 95% Bayes probability intervals) of changing the mix of outputs on its efficiency. In that way, we can design formally a particular scheme to increase cost efficiency by, say, three or four percentage points by keeping constant labor and capital and changing the mix of consumer versus industrial and commercial loans and securities. The flexibility of our new models, therefore, solves another outstanding problem, viz. that researchers often can estimate efficiency but do not know what to do when it comes to specific proposal about increasing efficiency levels. Our results can be used to design formal schemes to achieve precisely this goal. 35

36 Finally, in Figure 7 we report the marginal effect log ˆrit t which we call the technical change function. First, the values of the function are quite close to zero, a fact that is consistent with previous results. Second, it seems that in the beginning of 2000s there was some positive, albeit quite small technical change, the general pattern shows a negative trend and a recovery took place in recent years but technical change in the banking sector is still close to zero and negative. Figure 7. Technical change function Note: The dotted lines represent 95% Bayes probability intervals. In view of the empirical fact that there has been immaterial technical change and productivity growth, it is clear that the banking sector can realize cost savings mainly through increasing its cost efficiency. As we have shown the matter is both quantitative and qualitative, in the sense that a new mix of outputs or re-balancing must be sought with specific changes in the mix that can be computed through an obvious generalization of marginal effects in the direction of changing several outputs at a time. Due to the flexibility of our functional forms and the many stages at which they enter (functional form, mixing weights, variances etc) there are rich patterns that can be modeled in a systematic and attractive manner. Indeed, one of the elusive goals of efficiency estimation so far, has been not only to estimate efficiency (although doing this in a flexible manner has proved to be quite challenging) but also to figure out what has to be done in order to realize cost savings in practice. The models in this paper are likely to contribute to this goal and open new avenues for further research to this important policy issue. 36

The profit function system with output- and input- specific technical efficiency

The profit function system with output- and input- specific technical efficiency Mike G. Tsionas December 19, 2016 Abstract In a recent paper Kumbhakar and Lai (2016) proposed an output-oriented non-radial