Robust Inference. A central concern in robust statistics is how a functional of a CDF behaves as the distribution is perturbed.

Robust Inference Although the statistical functions we have considered have intuitive interpretations, the question remains as to what are the most useful distributional measures by which to describe a given distribution. In a simple case such as a normal distribution, the choices are obvious. For skewed distributions, or distributions that arise from mixtures of simpler distributions, the choices of useful distributional measures are not so obvious. A central concern in robust statistics is how a functional of a CDF behaves as the distribution is perturbed. If a functional is rather sensitive to small changes in the distribution, then one has more to worry about if the observations from the process of interest are contaminated with observations from some other process.

Sensitivity of Statistical Functions to Perturbations in the Distribution One of the most interesting things about a function (or a functional) is how its value varies as the argument is perturbed. Two key properties are continuity and differentiability. For the case in which the arguments are functions, the cardinality of the possible perturbations is greater than that of the continuum. We can be precise in discussions of continuity and differentiability of a functional Υ at a point (function) F in a domain F by defining another set D consisting of difference functions over F; that is the set the functions D = F 1 F 2 for F 1, F 2 F.

Derivatives of Functionals The concept of differentiability for functionals is necessarily more complicated than for functions over real domains. For a functional Υ over the domain F, we define three levels of differentiability at the function F F. All definitions are in terms of a domain D of difference functions over F, and a linear functional Λ F defined over D in a neighborhood of F. The first type of derivative is very general. The other two types depend on a metric ρ on F F induced by a norm on F.

Derivatives of Functionals Gâteaux differentiable. Υ is Gâteaux differentiable at F iff there exists a linear functional Λ F (D) over D such that for t IR for which F + td F, lim t 0 ( Υ (F + td) Υ (F ) t Λ F (D) ) = 0. ρ-hadamard differentiable. Υ is ρ-hadamard differentiable at F iff there exists a linear functional Λ F (D) over D such that for any sequence t j 0 IR and sequence D j D such that ρ(d j, D) 0 and F + t j D j F, lim j ( ) Υ (F + tj D j ) Υ (F ) Λ F (D j ) t j = 0.

ρ-fréchet differentiable. Υ is ρ-fréchet differentiable at F iff there exists a linear functional Λ(D) over D such that for any sequence F j F for which ρ(f j, F ) 0, ( Υ (Fj ) Υ (F ) Λ F (F j F ) ) lim j ρ(f j, F ) = 0.

Differentials of Functionals The linear functional Λ F is called the [Gâteaux ρ-hadamard ρ-fréchet] differential of Υ at F.

Perturbations In statistical applications using functionals defined on the CDF, we often consider a simple type of function in the neighborhood of the CDF. These are CDFs formed by adding a single mass point to the given distribution. For a given CDF P (y), we can define a simple perturbation as where 0 ɛ 1. P x,ɛ (y) = (1 ɛ)p (y) + ɛi [x, ) (y), (1) We will refer to this distribution as an ɛ-mixture distribution. The distribution with CDF P is the reference distribution. (This, of course, is the distribution of interest, so I often refer to it without any qualification.)

Perturbations A simple interpretation of the perturbation in equation (1) is that it is the CDF of a mixture of a distribution with CDF P and a degenerate distribution with a single mass point at x, which may or may not be in the support of the distribution. The extent of the perturbation depends on ɛ; if ɛ = 0, the distribution is the reference distribution. If the distribution with CDF P is continuous with PDF p, the PDF of the mixture is dp x,ɛ (y)/dy = (1 ɛ)p(y) + ɛδ(x y), where δ( ) is the Dirac delta function. If the distribution is discrete, the probability mass function has nonzero probabilities (scaled by (1 ɛ)) at each of the mass points associated with P together with a mass point at x with probability ɛ.

PDFs and the CDF of the ɛ-mixture Distribution In the left-hand graph the PDF of a continuous reference distribution (solid line) and the PDF of the ɛ-mixture distribution (dotted line together with the mass point at x). ε P x, ε (y) p(y) (1-ε)p(y) ε x x

Perturbations A statistical function evaluated at P x,ɛ compared to the function evaluated at P allows us to determine the effect of the perturbation on the statistical function. For example, we can determine the mean of the distribution with CDF P x,ɛ in terms of the mean µ of the reference distribution to be (1 ɛ)µ + ɛx. This is easily seen by thinking of the distribution as a mixture. For example, for the M functional we have M(P x,ɛ ) = = (1 ɛ) y d((1 ɛ)p (y) + ɛi [x, ) (y)) y dp (y) + ɛ yδ(y x) dy = (1 ɛ)µ + ɛx. (2)

Perturbations For a discrete distribution we would follow the same steps using summations (instead of an integral of y times a Dirac delta function, we just have a point mass of 1 at x), and would get the same result.

Quantiles under Perturbations The π quantile of the mixture distribution, Ξ π (P x,ɛ ) = P 1 x,ɛ (π), is somewhat more difficult to work out. This quantile, which we will call q, is shown relative to the π quantile of the continuous reference distribution, x π for two cases.

Quantiles under Perturbations For example, if the reference distribution is a standard normal, π = 0.7, so x π = 0.52, and ɛ = 0.1, we have the graphs p(y) p(y) (1-ε)p(y) ε (1-ε)p(y) ε x1 q yπ yπ q x2 In the left-hand graph, x 1 = 1.25, and in the right-hand graph, x 2 = 1.25.

Quantiles under Perturbations We see that in the case of a continuous reference distribution (implying P is strictly increasing), ), for (1 ɛ)p (x) + ɛ < π, P 1 ( π ɛ 1 ɛ Px,ɛ 1 (π) = x, for (1 ɛ)p (x) π (1 ɛ)p (x) + ɛ, P 1 ( ) π 1 ɛ, for π < (1 ɛ)p (x). (3) The conditions in equation (3) can also be expressed in terms of x and quantiles of the reference distribution. For example, the first condition is equivalent to x < y π ɛ 1 ɛ.

The Influence Function The extent of the perturbation depends on ɛ, and so we are interested in the relative effect; in particular, the relative effect as ɛ approaches zero. The influence function for the functional Υ and the CDF P, defined at x as Υ (P x,ɛ ) Υ (P ) φ Υ,P (x) = lim (4) ɛ 0 ɛ if the limit exists, is a measure of the sensitivity of the distributional measure defined by Υ to a perturbation of the distribution at the point x. The influence function is also called the influence curve, and denoted by IC. The limit is the right-hand Gâteaux derivative of the functional Υ at P and x.

The Influence Function The influence function can also be expressed as the limit of the derivative of Υ (P x,ɛ ) with respect to ɛ: φ Υ,P (x) = lim ɛ 0 ɛ Υ (P x,ɛ). (5) This form is often more convenient for evaluating the influence function.

The Influence Function for the M Functional Some influence functions are easy to work out, for example, the influence function for the M functional that defines the mean of a distribution, which we denote by µ. The influence function for this functional operating on the CDF P at x is φ µ,p (x) = M(P x,ɛ ) M(P ) lim ɛ 0 ɛ = (1 ɛ)µ + ɛx µ lim ɛ 0 ɛ = x µ. (6)

The Influence Function We note that the influence function of a functional is a type of derivative of the functional, M(P x,ɛ )/ ɛ. The influence function for other moments can be computed in the same way. Note that the influence function for the mean is unbounded in x; that is, it increases or decreases without bound as x increases or decreases without bound. Note also that this result is the same for multivariate or univariate distributions.

The Influence Function for Quantiles The influence function for a quantile is more difficult to work out. The problem arises from the difficulty in evaluating the quantile. As I informally described the distribution with CDF P x,ɛ, it is a mixture of some given distribution and a degenerate discrete distribution. Even if the reference distribution is continuous, the CDF of the mixture, P x,ɛ, does not have an inverse over the full support (although for quantiles we will write P 1 x,ɛ ). Let us consider a simple instance: a univariate continuous reference distribution, and assume p(x π ) > 0. We approach the problem by considering the PDF, or the probability mass function.

The Influence Function for Quantiles In the left-hand graph the second figure, the total probability mass up to the point y π is (1 ɛ) times the area under the curve, that is, (1 ɛ)π, plus the mass at x 1, that is, ɛ. Assuming ɛ is small enough, the π quantile of the ɛ-mixture distribution is the π ɛ quantile of the reference distribution, or P 1 (π ɛ). It is also the π quantile of the scaled reference distribution; that is, it is the value of the function (1 ɛ)p(x) that corresponds to the proportion π of the total probability (1 ɛ) of that component. Use of the definitions is somewhat messy. It is more straightforward to differentiate P 1 x 1,ɛ and take the limit.

The Influence Function for Quantiles For fixed x < y π, we have ( ) ɛ P 1 π ɛ = 1 ɛ 1 p ( P 1 ( π ɛ 1 ɛ )) (π 1)(1 ɛ) (1 ɛ) 2. Likewise, we take the derivatives for the other cases, and then take limits. We get φ Ξπ,P (x) = π 1 p(y π ), for x < y π, 0, for x = y π, π p(y π ), for x > y π.

The Influence Function for Quantiles Notice that the actual value of x is not in the influence function; only whether x is less than, equal to, or greater than the quantile. Notice also that, unlike the influence function for the mean, the influence function for a quantile is bounded; hence, a quantile is less sensitive than the mean to perturbations of the distribution. Likewise, quantile-based measures of scale and skewness are less sensitive than the moment-based measures to perturbations of the distribution. The L J and M ρ functionals, depending on J or ρ, can also be very insensitive to perturbations of the distribution.

The mean and variance of the influence function at a random point are of interest; in particular, we may wish to restrict the functional so that and E(φ Υ,P (X)) = 0 E ( (φ Υ,P (X)) 2) <.

Sensitivity of Estimators Based on Statistical Functions If a distributional measure of interest is defined on the CDF as Υ (P ), we are interested in the performance of the plug-in estimator Υ (P n ); specifically, we are interested in Υ (P n ) Υ (P ). This turns out to depend crucially on the differentiability of Υ. If we assume Gâteaux differentiability, we can write n (Υ (Pn ) Υ (P )) = Λ P ( n(p n P )) + R n = 1 φ Υ,P (Y i ) + R n n where the remainder R n 0. i

Convergence of Estimators First, we as- We are interested in the stochastic convergence. sume E(φ Υ,P (X)) = 0 and E ( (φ Υ,P (X)) 2) <. Then the question is the stochastic convergence of R n. Gâteaux differentiability does not guarantee that R n converges fast enough. However, ρ-hadamard differentiability does imply that that R n is o P (1), because it implies that norms of functionals (with or without random arguments) go to 0. We can also get that R n is o P (1) by assuming Υ is ρ-fréchet differentiable and that nρ(p n, P ) is O P (1).

Convergence of Estimators Assuming either ρ-hadamard or ρ-fréchet differentiability, given the moment properties of φ Υ,P (X) and R n is o P (1), we have by Slutsky s theorem, n (Υ (Pn ) Υ (P )) d N(0, σ 2 Υ,P ), where σ 2 Υ,P = E ( (φ Υ,P (X)) 2).

Asymptotic Variance of Estimators For a given plug-in estimator based on the statistical function Υ, knowing E ( (φ Υ,P (X)) 2) (and assuming E(φ Υ,P (X)) = 0) provides us an estimator of the asymptotic variance of the estimator.

Robust Estimators The influence function is very important in leading us to estimators that are robust; that is, to estimators that are relatively insensitive to departures from the underlying assumptions about the distribution. As mentioned above, the functionals L J and M ρ, depending on J or ρ, can be very insensitive to perturbations of the distribution; therefore estimators based on them, called L-estimators and M- estimators, can be robust. A class of L-estimators that are particularly useful are linear combinations of the order statistics. Because of the sufficiency and completeness of the order statistics in many cases of interest, such estimators can be expected to exhibit good statistical properties.

Robust Estimators Another class of estimators similar to the L-estimators are those based on ranks, which are simpler than order statistics. These are not sufficient the data values have been converted to their ranks nevertheless they preserve a lot of the information. The fact that they lose some information can actually work in their favor; they can be robust to extreme values of the data. A functional to define even a simple linear combination of ranks is rather complicated. As with the L J functional, we begin with a function J, which in this case we require to be strictly increasing, and also, in order to ensure uniqueness, we require that the CDF P be strictly increasing.

R J Estimators The R J functional is defined as the solution to the equation ( ) P (y) + 1 P (2RJ (P ) y) J dp (y) = 0. (7) 2 A functional defined as the solution to this optimization problem is called an R J functional, and an estimator based on applying it to a ECDF is called an R J estimator or just an R-estimator.