ON STATISTICAL INFERENCE UNDER ASYMMETRIC LOSS FUNCTIONS Michael Baron Received: Abstract We introduce a wide class of asymmetric loss functions and show how to obtain asymmetric-type optimal decision rules from standard and commonly used Bayesian procedures. Important properties of minimum risk estimators are established. In particular, we discuss their sensitivity to the magnitude of asymmetry of the loss function. Introduction Consider a statistical decision problem, where underestimating the parameter is, say, cheaper or less harmful than overestimating it by the same amount. Problems of this nature appear in actuarial science, marketing, pharmaceutical industry, quality control, change-point analysis, child's IQ estimation, estimation of water levels for dam constructions, and other situations ([], [2], [3], AMS 99 subject classications: 62A5, 62C0. Key words and phrases: loss function, prior distribution, minimum risk estimator, xed point, range of posterior expectation.
2 BARON section 4.4, [6], section 3.3, [8], [9], [0], []). For example, an underestimated premium leads to some nancial loss, whereas an overestimated premium may lead to a loss of a contract. For the sake of simplicity, it is still common in practical applications to use standard Bayesian decision rules like the posterior mean or median, even in asymmetric situations. However, these procedures do not reect the dierence in losses. It is recognized that some type of correction should be introduced to take account of the asymmetry. In this paper we introduce a wide class of asymmetric loss functions (). The corresponding asymmetric-type minimum risk decision rules can then be obtained as xed points of standard estimators (Theorem ). A simple alternative is to use linear or quadratic approximations (4) and (5) for situations when over- and underestimation costs are dierent, but compatible. In this case, a standard Bayes estimator should be corrected by a term proportional to the posterior mean absolute deviation. Suppose that a continuous or discrete parameter 2 is to be estimated. Here is either a connected subset of R or a possibly innite collection of points f: : : < 0 < < : : :g. Let be a probability measure on, which may be realized as the prior distribution, or the posterior, after a sample x = fx ; : : : ; x n g has been observed. As an initial setting, we choose an arbitrary measure on and a loss function L(; ), where is a decision rule. The loss L may be one of the standard symmetric loss functions, however, symmetry is not required for the subsequent discussion.
BARON 3 Introduce an asymmetric loss function of the form W (; ) = K L(; )If g + K 2 L(; )If > g () for some positive K and K 2. It generalizes the linear loss proposed in [3], section 4.4.2, for estimating the child's IQ. Another popular asymmetric loss function is the linex loss ([0], []). Unlike the linex, () denes a wide family of loss functions, due to an arbitrary choice of L(; ), where costs of over- and underestimation are compatible. Without loss of generality we can study the case K K 2 only, because the other situation obtains by reparameterization 7!. Also we can assume that K =. Then the loss function W has the form W (; ) = 8 >< >: L(; ) if ; ( + )L(; ) if > for some 0, which denotes the additional relative cost of overestimation. Let be the minimum risk estimator of, = () = arg min Z W (; )d(): If overestimation is costly, one will tend to underestimate. (2) A way to express this is to consider a family of weighted measures f t g obtained from by increasing probabilities of small values of, t = 8 >< >: ( + )=( + R t d()) if < t; =( + R d()) if t (3) t: For every t let Z ~(t) = arg min L(; )d t ():
4 BARON In practice, is usually a posterior distribution (jx) under some prior (). One denes an altered prior t () similarly to (3), by assigning higher probabilities for smaller values of. Then t in (3) is the corresponding posterior. In the next section, we show that is the local and global minimum, and the xed point of (t). ~ The rest of the paper gives simple methods of obtaining () in practice and establishes robust properties of with respect to the choice of. Popular examples are discussed. 2 The xed point of ~ (t) We assume that L(; ) is a strictly convex nonnegative function of for any, with L(; ) = 0. Consider the risks r() and r t () associated with (W; ) and (L; t ) respectively. One has (t) ~ = arg min r t () for any t, and = arg min r(). Then ~(t) also minimizes t() = C(; t)r t () = Z Z L(; )d() + L(; )d(); <t and minimizes () = (). Under our assumptions t () and () are strictly convex functions of. The following lemma establishes an important property of ~ (t). Lemma The decision rule ~ (t) is a non-increasing function on ( ; T ) and a non-decreasing function on (T; +), where T = inffs : ~ (s) sg = supfs : ~(s) > sg.
BARON 5 Proof : For any s t one has t() = s () + Z s<t L(; )d(): (4) Hence, by convexity, the minimum of t () is attained between the points of minima of s () and R s<t L(; )d(). In other words, since the latter attains its minimum at some point between s and t, minfs; (s)g ~ (t) ~ maxft; (s)g: ~ (5) If (s) ~ s, then (5) results in (s) ~ (t) ~ t, from which fs : (s) ~ sg is a right half-line and (t) ~ is non-decreasing on it. Let T = inffs : (s) ~ sg. Then (t) ~ > t for all t < T, and for any s < t (5) implies (t) ~ (s). ~ Thus ~(t) is non-increasing on ( ; T ). 2 We now handle discrete and continuous cases separately. First, let be a discrete distribution on = f k g, where k are enumerated in increasing order. Then it follows from (4) that t () = k () for t 2 ( k ; k ], and therefore ~(t) = ( ~ k ). In particular, () = () = k (), when k < k. We show that () attains a local minimum at = T. Then, by convexity of, = T. Let k < T k for some k. Then T = supfs : (s) ~ > sg = supfs : ( ~ k ) > sg = ( ~ k ); (6) and hence k () has a local minimum at T. If T < k, then () also attains a local minimum at = T, because both functions coincide in some neighborhood of T. Otherwise, if T = k for some k, then T = inffs : (s) ~ sg = inffs : ( ~ k+ ) sg = ~ (k+ ):
6 BARON Thus, from (6), ( ~ k ) = ( ~ k+ ), and T is a point of minimum for both functions k and k+. Since k to the left of T and k+ to the right of it, and it is continuous, it follows that T is a point of minimum for (). Formula (6) implies that solves the equation ~(t) = t: (7) Moreover, is the unique root of (7). Indeed, suppose that (t) ~ = t for some t 6= T. By the denition of T, the only possibility is t > T. Let m < t m+, and take any s 2 (maxft; m g ; t). Then (s) ~ = (t) ~ = t > s, which contradicts that s > T. The fact that is the unique xed point of (t) ~ suggests a computational method of numerical evaluation of. We also note that from (5) with s = T one has T (t) ~ for any t > T. The same inequality obtains for t < T, because (T ~ ) = T for some > 0, and by Lemma ~ s (T ~ ) for any s < T. Hence, = min t (t). ~ Now turn to the case when is a continuous distribution on, having a density (). Then, from (4), one has s! t pointwise as s! t, and the following statement holds. Lemma 2 If is absolutely continuous with respect to Lebesgue measure, then ~(t) is a continuous function of t. The proof of Lemma 2 is given in the Appendix. By continuity, it follows again from Lemma that (T ~ ) = T = min t (t). ~ It remains to show that = T. Since t () increases in t pointwise, for any t > T (t) = t (t) > t ( (t)) ~ T ( (t)) ~ T (T ) = (T ):
BARON 7 Hence, > T is impossible. Suppose that < T. Choose < t < T and a sequence t j # t, for which (t j ) is bounded by some M. Then, since = arg min, and is convex, one has (t j ) > (t), and (t j ) t (t j ) > t (t) t (t j ): (8) Consider both sides of (8). From (4), (t j ) t (t j ) = Z tj t L(; t j )()d M(x)L(t; t j )(t j t) = o(t j t); as j!, because L(t; t j )! 0. However, by convexity of t, t(t) t (t j ) m(t j t); where m = t (t) t ( ~ (t)) ~(t) t > 0: Now we have a contradiction to (8). Thus we have proved the following theorem. Theorem The minimum risk estimator of under the loss function W and the prior is the unique root of the equation (t) ~ = t. Also, it equals min t (t). ~ As an illustration, we consider the problem of estimating the binomial parameter, when one variable X from the binomial (5; ) distribution is observed, and has a uniform prior distribution. For 2 f; 6g and all possible values of X, the graphs of (t) ~ are depicted on Figure. Their shapes are in accordance with Lemma. The dotted line represents the graph (t) ~ = t. By Theorem, () coincides with the intersection points for dierent values of X. Obviously, higher values of imply a higher penalty for overestimation, which leads to lower values of (t) ~ and.
8 BARON ~(t) = (t) ~ = 6 0.8 X=5 X=4 0.6 X=3 0.4 X=2 0.2 X= X=0 0 0 0.2 0.4 0.6 0.8 t 0.8 0.6 0.4 0.2 X=5 X=4 X=3 X=2 X= X=0 0 0 0.2 0.4 0.6 0.8 t Figure. Behavior of ~ (t) and in the binomial case. 2. The range of decision rules Methods of Bayesian global posterior robustness suggest giving an interval of decision rules corresponding to a family of priors, rather than specifying one optimal procedure for one given prior ([4], [5]). One considers an interval " # D = inf ; sup 2 2 of the minimum risk decision rules corresponding to all priors of. In asymmetric problems with a xed, one considers the family of prior distributions = f t g for all values of t, as dened in (3). A direct application of Lemma and Theorem gives the form of D, D = [ (); (0)] ; (9) whereas Lemma 4 below bounds the length of D for the case of the squarederror loss.
BARON 9 Further, one can consider an extended family of priors = f t; g for all t 2 and 2 [ ; 2 ], 0 < < 2. Then the range of the posterior expectation is D = [ ( 2 ); (0)] ; (0) independent of. If a sample from f(xj) is available, and are Bayes rules under the prior distributions 2, their ranges are still given by (9) and (0). 3 Absolute error loss In order to have a unique minimizer, we required that L(; ) be a strictly convex function of. However, if L(; ) = j j, all the minimum risk estimators with respect to the loss function W are still the solutions of (7). Indeed, if L is the absolute error loss, then ~ (t) is any median of t, and (7) is equivalent to a system of two inequalities, Z t d t () = Z t ( + )d() ( + ) R t d() + R t d() 2 and a similar inequality for R t d t (). Any =( + 2)-quantile of solves this system. According to [3], section 4.4, the Bayes rule has the same form. 4 Squared error loss and weighted losses When L(; ) = ( ) 2, it is easy to check that ~(t) = E t () = + E (I <t ) + P f < tg : ()
0 BARON Here and later I A = if A holds; = 0 otherwise, and = E (). Note that E (I <t ) = E f j < tgp f < tg. Hence, (t) ~ is a convex linear combination of = ( ) ~ and E ( j < t). Then, according to () and Theorem, solves the equation E f t) + E ( t)i <t = 0; (2) which is equivalent to E ( t) + = ( + )E ( t) ; where a + = maxfa; 0g, and a = maxf a; 0g. In general, for any loss function of the form L(; ) = j j ; > ; solves the equation E [( t) + ] = ( + )E [( t) ] : The case = yields the =(+2)-quantile mentioned in the previous section. Generalization to the case of weighted squared error loss functions L(; ) = w()( ) 2 is straightforward. The weights w() usually allow larger errors for larger true values of the parameter. It is equivalent to the decision problem with the unweighted loss, if the distribution t (d) is replaced by 0 t(d) = 8 >< >: ( + )()w() E w() + R if < t; t w()d() : ()w() E w() + R if t: t w()d() Any weighted loss function can be considered similarly, as long as the weight function is -integrable.
BARON 4. Sensitivity with respect to, and correction of standard decision rules When the choice of, the relative additional cost of overestimation, is not obvious, it is natural to investigate the robustness of with respect to dierent. The following results concern the rate of change of = () for! 0 and! +. Lemma 3 Let L(; ) be the squared error loss. Then d () d =0 = Cov( ; I < ) = 2 E j j: (3) Lemma 3 shows that j () (0)j has the linear order in as! 0. As # 0, one has () = 2 E j j + O( 2 ): (4) In applications, this determines a simple strategy for a decision maker. Every time when overestimation is slightly more costly than underestimation, the standard Bayes estimator (0) = is to be corrected by a term proportional to the mean absolute deviation of. Suppose, for example, that the loss caused by overestimating the parameter is 0% higher than the loss caused by underestimating it by the same amount. Then the corrected decision rule (0:) = (0:05)E j j should be used instead of the standard posterior mean = E (). Dierentiating once more, one obtains, d 2 () d 2 =0 = E j jp f < g :
2 BARON This provides a more accurate correction of the standard decision rule, () = E j j 2 2 P f < g + O( 3 ) The next result bounds all possible deviations of.! ; as! 0: (5) Lemma 4 One has j () (0)j () 2 p + ; (6) where () denotes the standard deviation of under the distribution. Simple calculation shows that if consists of only two points, then (t) = everywhere between them. In this case, equality holds in (6). According to Lemma 4, when is unbounded, () deviates from (0) with the rate O( p ), as! +. 4.2 Examples Bayesian estimation of the normal mean We consider Bayesian estimation of a normal mean. Let X ; : : : ; X n be a sample from the normal distribution with E X = ; V ar(x) = 2. Suppose that 2 is known, and, the parameter of interest, has a prior distribution (), which is normal(; 2 ). The corresponding posterior = (jx) is normal( x ; x 2 ), where x = n X= 2 + = 2 n= 2 + = 2 and 2 x = n= 2 + = 2 (eg. [3], [7]). Then z = ( x )= x follows the standard normal distribution under (jx). Let (u) and (u) denote standard normal pdf and cdf
BARON 3 respectively. Note that E zi z<u = (u). From (), ~(t) = x + E( x + x z)i x+ xz<t + P f x + x z < tg = x x(u) + (u) ; where u = (t x )= x. Then the xed-point equation (7) is equivalent to (u) + u(u) + u = 0: (7) Hence, the minimum asymmetric risk decision rule is given by where u() is a solution of (7). According to (8), d d particular, d d () = x + x u(); (8) = xu 0 (), which is independent of X ; : : : ; X n. In q n 2 + = x (0) = p =0 2 and the same formula can be obtained by Lemma 3. By (5), () x x ( )= p 2 for small, and the correction term is the same regardless of the observed sample. 2 ; Asymmetric least squares In the no-intercept linear regression model y i = x i + i ; let ^ be selected to minimize (b) = X (y i x i b) 2 + X (y i x i b) 2 I yi <x i b = X x 2 i (r i b) 2 + X x 2 i (r i b) 2 I ri <b; (9)
4 BARON with r i = y i =x i. Additional weights assigned to squared residuals reect the higher cost of overestimating and overpredicting y. Suppose that ratios r i are enumerated in their increasing order, r : : : r n. For every k = ; : : : ; n, consider k(b) = nx i= This function is minimized at ^(k) = x 2 i (r i b) 2 + P n x2 i r i + P k x2 i r i P n x 2 i + P k x 2 i where the distribution puts masses 8 >< >: = kx i= Cx 2 i at r i ; i > k; C( + )x 2 i at r i ; i k: x 2 i (r i b) 2 ; R rd (r) R d (r) = E (r); Thus ^(k) has the Bayes-type form of ~ (k), and according to Theorem, the asymmetric least squares solution ^ = can be found as the xed point of ^(k). Clearly, the case = 0 gives the OLS solution. Change-point estimation In a classical change-point problem, a sample of independent variables x = (X ; : : : ; X n ) is such that the distribution of X j depends on whether j < or j. One needs to estimate the change-point parameter. Since a changepoint has to be detected \as soon as possible", suppose that overestimation leads to a higher loss than underestimation. Further, assume a (continuous) uniform prior distribution of over the interval = [0; n]. If the pre-change and the post-change densities or probability mass functions f and g are mutually absolutely continuous, the standard
BARON 5 Bayes rule under the squared-error loss has the form = (0) = E(jx) = P n k= k(k) P n k= (k) ; (20) where (k) = f(x ) : : : f(x k )=g(x )= : : : =g(x k ) ([8]). Correction for an asymmetric loss is straightforward. According to (5), one obtains the optimal decision rule under the loss function (2), () = 0 @ P n (k)jk j + P []+ (k) + (fg 2 ) ([] + ) P n (k) 2 0 @ P [] 2 2 A (k) + fg([] + ) P n A + O(3 ); (k) as! 0, where is given by (20), [] and fg denote its integer and fractional parts, respectively (derivation is omitted). 5 Appendix Proof of Lemma 2. Suppose that there exist such > 0 and a monotone sequence t j! that j (t ~ j ) ()j ~ > for all j. Let = min f ( () ~ ); ( () ~ + )g ( ()) ~ > 0: If t j ", then tj () " (), and tj ( () ~ ) > ( () ~ ) =2 for suciently large j. Then, since tj ( ~ ) ( ~ ) < ( () ~ ) =2, it follows from convexity of tj that it has a minimum on ( () ~ ; () ~ + ), which contradicts to j (t ~ j ) ()j ~ >. If t j #, then tj ( ()) ~ tj ( (t ~ j )) ( (t ~ j )) > ( ()) ~ +, and we have a contradiction: tj ( ()) ~ 6! ( ()). ~
6 BARON Hence, (t) ~ is continuous. Proof of Lemma 3. According to (2), Z ( )d() + Dierentiating implicitly, one obtains Z < ( )d() = 0: d () d = E ( ) + P f < g : Notice that = 0 corresponds to the symmetric square loss W L, hence (0) =. Also, E ( ) = E (I < ) E I < = Cov( ; I < ); (2) and the rst equality in (3) follows. Next, observe that E( )I > + E( )I < = E( ) = 0; (22) and E( )I > E( )I < = Ej j: (23) Subtracting (23) from (22), one completes the proof. Proof of Lemma 4. For any t let (t) denote a correlation coecient between and I <t under. Then, similarly to (2), one obtains from () ~(t) (0) = Cov(; I <t) + P f < tg = (t) () q P f < tg ( P f < tg) : + P f < tg
BARON 7 Clearly, ~ (t) (0). Then, by Theorem, () maximizes j ~ (t) (0)j over all t, and j () (0)j () max 0p q p( p) + p = () 2 p + : Acknowledgment. The author is grateful to Professors A. L. Rukhin, R. J. Sering, and K. Ostaszewski for their valuable comments to earlier versions of this paper. References [] J. Aitchison and I. R. Dunsmore. Statistical Prediction Analysis. Cambridge University Press, London, 975. [2] M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. PTR Prentice-Hall, Inc., 993. [3] J. O. Berger. Statistical Decision Theory. Springer{Verlag, New York, 985. [4] J. O. Berger. Robust Bayesian analysis: sensitivity to the prior. J. Stat. Plann. Inf., 25:303{328, 990. [5] L. DeRobertis and J. A. Hartigan. Bayesian inference using intervals of measures. Ann. Statist., 9:235{244, 98. [6] R. V. Hogg and S. A. Klugman. Loss distributions. Wiley, New York, 984.
8 BARON [7] E. L. Lehmann. Theory of Point Estimation. Wiley, New York, 983. [8] A. L. Rukhin. Change-point estimation under asymmetric loss. Statistics&Decisions, 5:4{63, 995. [9] J. Shao and S.-C. Chow. Constructing release targets for drug products: a bayesian decision theory approach. Appl. Statist., 40:38{390, 99. [0] H. R. Varian. A Bayesian approach to real estate assessment. In S. E. Fienberg, A. Zellner, eds., Studies in Bayesian Econometrics and Statistics, pages 95{208, Amsterdam: North-Holland, 975. [] A. Zellner. Bayesian estimation and prediction using asymmetric loss functions. J. Amer. Stat. Assoc., 8:446{45, 986. Michael Baron Programs in Mathematical Sciences University of Texas at Dallas Richardson, TX 75083-0688