arxiv: v2 [stat.me] 24 Dec 2017

Size: px

Start display at page:

Download "arxiv: v2 [stat.me] 24 Dec 2017"

Marion Moore
6 years ago
Views:

1 Bayesian Robustness to Outliers in Linear Regression arxiv:62.698v2 stat.me] 24 Dec 27 Philippe Gagnon, Université de Montréal, Canada Alain Desgagné Université du Québec à Montréal, Canada and Mylène Bédard Université de Montréal, Canada ABSTRACT. Whole robustness is an appealing attribute to look for in statistical models. It implies that the impact of outliers, defined here as the observations that are not in line with the general trend, gradually vanishes as they move further and further away from this trend. Consequently, the conclusions obtained are consistent with the bulk of the data. Nonrobust models may lead to inferences that are not in line with either the outliers or the nonoutliers. In this paper, we make the following contribution: we generalise existing whole robustness results for simple linear regression through the origin to the usual linear regression model. The strategy to attain whole robustness is simple: replace the traditional normal assumption on the error term by a super heavy-tailed distribution assumption. Analyses are then conducted as usual, and typical methods of inference as statistical hypothesis testing are available. Key words: Bayesian inference, Built-in robustness, Maximum likelihood estimation, Super heavy-tailed distributions, Whole robustness. Introduction The linear regression model is commonly used to predict values of a dependent variable through auxiliary information a set of observations from explanatory variables), or to quantify the strength of the relationship between the dependent variable and the explanatory variables. The underlying assumption in this model is that the explanatory variables are linearly related to the dependent variable, with an error term that traditionally is normally distributed. It is well understood that conflicting sources of information may contaminate the inference when normality is assumed. From a Bayesian perspective, these conflicting sources may represent the prior and outliers. We focus on the latter in this paper. A conflict therefore represents the fact that a group of observations produces a rather different inference than that arising from the bulk of the data and the prior. Under the normality assumption, this may lead to the following undesirable effect caused by the light normal tails: the posterior concentrates in an area that is not supported by any source of information in order to incorporate them all. In this context of robustness against outliers in linear regression, a contaminated inference corresponds to predictions that are not in line with either the outliers or the Address for correspondence: Philippe Gagnon, Département de mathématiques et de statistique, Université de Montréal, C.P. 628, Succursale Centre-Ville, Montréal, Québec, Canada, H3C 3J7. gagnonp@dms.umontreal.ca

2 2 P. Gagnon, A. Desgagné and M. Bédard nonoutliers, and misleading quantification of the strength of the relationship between the dependent variable and the explanatory variables. We believe that the appropriate way to address the problem is to limit the impact of outliers in order to obtain conclusions that are consistent with the majority of the observations. Box and Tiao 968) were the first to propose a Bayesian solution for dealing with outliers in linear regression. They suggested to let the error term be distributed as a mixture of two normals: one component for the nonoutliers and the other one, with a larger variance, for the outliers. This approach has been generalised by West 984) who modelled errors with heavy-tailed distributions constructed as scale mixtures of normals, which includes the Student distribution. Peña et al. 29) introduced more recently a different robust Bayesian approach in which the weight of each observation decreases with the distance between this observation and the bulk of the data. The authors proved that the Kullback-Leibler divergence between the posterior arising from the nonoutliers only and the posterior arising from the sample containing outliers is bounded. This result essentially indicates that these two posterior densities can be different, but that their ratio is bounded. The strategies of Box and Tiao 968) and West 984) are appealing by their simplicity. There is however a lack of theoretical results that validate the underlying ideas. In fact, heavy-tailed distributions as the Student only allow to attain partial robustness see the results of Andrade and O Hagan 2) in a context of location-scale model), which means that outliers have, as they approach plus or minus infinity, a significant but limited impact on the inference. Consequently, we believe that there is still a need for simple strategies with evidences of their validity. In this paper, we aim at filling this gap through an approach as simple as those of Box and Tiao 968) and West 984), and theoretical results ensuring whole robustness. As these authors, we replace the traditional normal assumption on the error term by a distribution assumption that accommodates for the presence of outliers. The tails of the distribution are however heavier than those of heavytailed distributions; they are super heavy tails e.g. log-pareto or log-student tails). We borrowed the idea from Desgagné and Gagnon 27), in which simple linear regressions through the origin are considered and whole robustness is proved. Our work is thus aligned with the theory of conflict resolution in Bayesian statistics, as described by O Hagan and Pericchi 22) in their extensive literature review on that topic. In the following, we first present the general model without any specific distribution assumption on the error term) in Section 2.. We then describe the super heavy-tailed distributions that we use to attain whole robustness in Section 2.2. They are log-regularly varying distributions. The linear regression model with an error term assumed to follow this type of distribution is characterised by its built-in robustness that resolves conflicts in a sensitive and automatic way. This shall be witnessed in Section 2.3 in which the robustness results are presented. The key result is the convergence of the posterior distribution arising from the whole sample) towards the posterior arising from the nonoutliers only, when the outliers approach plus or minus infinity. This corresponds to whole robustness and indicates that users can conduct their analyses as usual, estimating parameters with posterior medians and Bayesian credible intervals for instance. The result follows from the pointwise convergence of the posterior density towards that arising from the nonoutliers only. The posterior density being proportional to the likelihood function if a prior proportional to is used, we are able to establish the robustness of the maximum likelihood estimates. Our strategy is therefore also useful under the frequentist paradigm.

3 Bayesian Robust Linear Regression 3 In Section 3, we illustrate the practical implications of the theoretical results presented in Section 2.3 through a numerical study. In particular, we show the relevance of our approach in the analysis of the relationship between the dependent variable and the explanatory variables. The relationship between two stock market indexes is studied. We also evaluate the accuracy of the estimates arising from the robust multiple linear regression model. Our robust model is compared with the nonrobust with the normal assumption) and partially robust with the Student distribution assumption) models. It is showed that our model performs as well as the nonrobust and partially robust ones in absence of outliers, in addition to being completely robust. It indicates that, by simply changing the assumption on the error term, we obtain adequate estimates whether there are outliers or not. 2. Resolution of Conflicts in Linear Regression 2.. Model i) Let Y,..., Y n R be n random variables representing data points from the dependent variable and x T :=, x 2,..., x p ),..., xn T :=, x n2,..., x np ) be n vectors of observations from the explanatory variables, where p {2, 3,...}, n p + and x ij R are assumed to be known. We want to study the relationship between the dependent variable and the explanatory variables assuming that the following model is suitable: Y i = x T i β + ɛ i, i =,..., n, where the n random variables ɛ,..., ɛ n R and the p-dimensional random variable β := β,..., β p ) T R p represent the errors and regression coefficients, respectively. These n + random variables are conditionally independent given σ >, the scale parameter, with a conditional density for ɛ i given by ɛ i β, σ = D ɛ i σ D σ f ɛi ), i =,..., n. σ ii) We assume that f is a strictly positive continuous probability density function on R that is symmetric with respect to the origin, and that is such that both tails of z fz) are monotonic see Section 5 for the detailed definition of monotonicity), which implies that the tails of fz) are also monotonic. These assumptions on f imply that fz) and z fz) are bounded on the real line, and both converge to as z. Note that the function fz) can have parameters, e.g. a shape parameter; however, their value is assumed to be known. iii) We assume that the joint prior density of β and σ, denoted πβ, σ), is bounded on σ >, and is such that πβ, σ)//σ) is bounded on < σ, for all β R p. Together, these assumptions are equivalent to: πβ, σ)/ max, /σ) is bounded on σ >, for all β R p. A large variety of priors fit within this assumed structure; for instance, this is the case for all proper densities. In addition, non-informative priors such as πβ, σ) /σ, the usual one for this type of random variables, and πβ, σ) satisfy these assumptions. Inference on the parameters β and σ helps us understanding the relationship between the dependent and the explanatory variables. For simplicity, we assume that all explanatory variables are continuous.

4 4 P. Gagnon, A. Desgagné and M. Bédard Given that the scale parameter of the distribution of the error term is σ, homoscedasticity is an underlying assumption of the model. When the classical normality is assumed, σ also represents the conditional) standard deviation of each error ɛ i. Note that if in addition πβ, σ), the maximum a posteriori probability MAP) estimator of β is the well-known least square estimator. The MAP estimator corresponds to the maximum likelihood estimator when the prior is proportional to. An important drawback of the classical normality assumption is that outliers have a significant impact on the estimation. In this paper, we study robustness of the estimation of β and σ. The objective is to find sufficient conditions to attain whole robustness, meaning that the impact of outliers gradually vanishes as they move further and further away from the trend emerging from the bulk of the data this is the definition of outliers on which we focus in this paper). In other words, we want to attain robustness against observations with errors ɛ i = y i xi T β approaching + or, where β defines the probable hyperplanes for the bulk of the data. Our strategy to mathematically represent this situation is to let y i approach + or for the outliers and to consider the vectors x,..., x n as fixed. Studying this theoretical framework is indeed sufficient to attain in practise robustness against any type of outliers that fits our definition. Consider for instance an outlier with an extreme x value, but a y value comparable to those in the bulk of the data. This outlier can be viewed as an observation with a fixed x value, but an extremely low or high) y value compared with the trend emerging from the bulk of the data. We now set our mathematical framework for dealing with outliers. Among the n observations of Y,..., Y n, denoted by y n := y,..., y n ), we assume that k of them form a group of nonoutlying observations, that we denote y k, while l = n k of them are considered as outliers. For i =,..., n, we define the binary functions k i and l i as follows: if y i is a nonoutlying value k i =, and if it is an outlier l i =. These functions take the value of otherwise. Therefore, we have k i + l i = for i =,..., n, with n k i = k, and n l i = l. We assume that each outlier converges towards or + at its own specific rate, to the extent that the ratio of two outliers is bounded. More precisely, we assume that y i = a i + b i ω, for i =,..., n, where a i and b i are constants such that a i R and i) b i = if k i =, ii) b i if l i =, and we let ω. Let the joint posterior density of β and σ be denoted by πβ, σ y n ) and the marginal density of Y,..., Y n ) be denoted by my n ), where πβ, σ y n ) = my n )] πβ, σ) /σ)fy i xi T β)/σ), β R p, σ >. Let the joint posterior density of β and σ arising from the nonoutlying observations only be denoted by πβ, σ y k ) and the corresponding marginal density be denoted by my k ), where πβ, σ y k ) = my k )] πβ, σ) /σ)fyi xi T β)/σ) ] k i, β R p, σ >. Note that if the prior πβ, σ) is proportional to, the likelihood functions, given by the product

5 terms in the posteriors above, can also be expressed as follows: Bayesian Robust Linear Regression 5 Lβ, σ y n ) = my n )πβ, σ y n ) and Lβ, σ y k ) = my k )πβ, σ y k ). ) Proposition. Consider the Bayesian context described in Section 2., and assume that k > p +. Then, the joint posterior densities πβ, σ y k ) and πβ, σ y n ) are proper. The condition k p+ is sufficient if we assume that πβ, σ)//σ) is bounded on σ > instead of πβ, σ) is bounded on σ > and πβ, σ)//σ) is bounded on < σ ). The proof of Proposition is provided in Section 5.. Note that, in fact, we only need n > p + or n p+ if we assume that πβ, σ)//σ) is bounded on σ > ) to guarantee that πβ, σ y n ) is proper. The condition k > p + or k p + ) is stronger Log-Regularly Varying Distributions As mentioned in the introduction, our approach to attain robustness is to replace the traditional normal assumption on the error term by a log-regularly varying distribution assumption. In this section, we provide an overview of such distributions. For more information, we refer the reader to Desgagné 23) and Desgagné 25). Definition Log-regularly varying distribution). A random variable Z with a symmetric density fz) is said to have a log-regularly varying distribution with index ρ if zfz) is log-regularly varying at with this same index ρ see Definition 2). Definition 2 Log-regularly varying function). We say that a measurable function g is log-regularly varying at with index ρ R, denoted g L ρ ), if ɛ >, τ, there exists a constant Aɛ, τ) > such that z Aɛ, τ) and /τ ν τ ν ρ gz ν )/gz) < ɛ. If ρ =, g is said to be log-slowly varying at. Log-regularly varying functions form an interesting class of functions with useful properties for robustness. As indicated in Definition 2, they are such that g L ρ ) if gz ν )/gz) converges towards ν ρ uniformly in any set ν /τ, τ] for any τ ) as z, where ρ R. This implies that for any ρ R, we have g L ρ ) if and only if there exists a constant A > and a function s L ) such that for z A, g can be written as gz) = log z) ρ sz). It means that a symmetric distribution with tails behaving like / z )log z ) ρ with ρ is a log-regularly varying distribution. Note that an example of such distributions is presented in Section Resolution of conflicts We now present the main contribution of this paper, which is Theorem. This theorem contains our robustness results which are stated for models using auxiliary information, i.e. for models with p 2. The case p = corresponds to the location-scale model, and we refer the reader to Desgagné 25) for suitable assumptions under which robustness is attained in this specific situation.

6 6 P. Gagnon, A. Desgagné and M. Bédard Theorem. Consider the model and context described in Section 2.. If we assume that i) f is a log-regularly varying distribution i.e. zfz) L ρ ) with ρ ), ii) l n/2 p /2) k n/2 + p /2) i.e. the difference between the number of nonoutliers and the number of outliers, k l, is at least 2p /2)), then, recalling that y i = a i + b i ω with b i = for the nonoutliers and b i for the outliers, we obtain the following results: a) b) c) lim my n ) n fy i)] li = my k ), lim πβ, σ y n) = πβ, σ y k ), uniformly on β, σ) λ, λ] p /τ, τ], for any λ and τ, d) as ω, and in particular lim R p πβ, σ yn ) πβ, σ y k ) dβ dσ =, β, σ y n D β, σ yk, β i y n D βi y k, i =,..., p, and σ y n D σ yk, e) lim my k)/my n )] Lβ, σ y n ) = Lβ, σ y k ), uniformly on β, σ) λ, λ] p /τ, τ], for any λ and τ. The proof of Theorem is provided in Section 5.2. Theorem is of great practical use considering the simplicity of its two sufficient conditions. Condition i) indicates that modelling must be performed using a density f with sufficiently heavy tails, and more precisely, using a log-regularly varying distribution see Definition ). Well-known heavy-tailed distributions, such as the Student distribution, do not satisfy this criterion. Desgagné 25) introduced an appealing distribution family that belongs to the family of log-regularly varying distributions, and therefore, satisfies condition i): the family of log-pareto-tailed symmetric distributions. In our opinion, the most appealing log-pareto-tailed symmetric distribution is a distribution called the log-pareto-tailed standard normal distribution, with parameters α > and ϕ >. It exactly matches the standard normal on the interval α, α], while having tails that behave like / z )log z ) ϕ which is a log-pareto behaviour). This distribution shall be described in more details in the numerical study in Section 3. Condition ii) indicates that there must be at most n/2 p /2) outliers, or equivalently, that there must be at least n/2 + p /2) nonoutliers, where and are the floor and ceiling functions, respectively. For instance, consider a sample of size n = 5 with one explanatory variable

7 Bayesian Robust Linear Regression 7 the simple linear regression model is therefore used, i.e. p = 2). Robustness is attained provided that there are l = 6 outliers or less which leaves k = 9 nonoutliers or more). The breakdown point, which is the proportion of outliers that an estimator can handle, is therefore 6/5 =.4 when n = 5. More generally, the condition l n/2 p /2) translates into a breakdown point of n/2 p /2) /n. For fixed p, the breakdown point thus converges to.5 as n, usually considered as the maximum desired value. Condition ii) is generally satisfied in practise given that the number of outliers rarely reaches half the sample size minus p /2). Another feature of Theorem that contributes to its practical use is that the results are easy to interpret. In result a), the asymptotic behaviour of the marginal my n ) is described. This result is used in Section 3. to assess robustness of Bayes factors for testing H : β i = versus H : β i when i 2) through the convergence my n )/my n β i = ) my k )/my k β i = ) as ω. Result a) is in fact the centrepiece of Theorem ; its demonstration requires considerable work, and then leads relatively easily to the other conclusions of the theorem. Result b) indicates that the posterior density arising from the whole sample converges uniformly towards the posterior density arising from the nonoutliers only. The impact of outliers then gradually vanishes as they approach plus or minus infinity. Result b) leads to result c), stating that the posterior density arising from the whole sample converges in L towards the posterior density arising from the nonoutlying observations only. This last result implies the following convergence: Pβ, σ E y n ) Pβ, σ E y k ) as ω, uniformly for all sets E R p R +. This result is slightly stronger than convergence in distribution result d)) which requires only pointwise convergence. The convergence of the posterior marginal distributions then follows. Result d) is the most practical result; it indicates that any estimation of β and σ based on posterior quantiles e.g. using posterior medians and Bayesian credible intervals) is robust to outliers. Result e) follows from result b) and indicates that for a given sample, the likelihood up to a multiplicative constant that is independent of β and σ) converges uniformly to the likelihood arising from the nonoutliers only. Consequently, the maximum of Lβ, σ y n ) converges to the maximum of Lβ, σ y k ), and therefore, the maximum likelihood estimate also converges, as ω. 3. Numerical Study In this section, we aim at increasing the understanding of the practical implications of Theorem. This is first achieved, in Section 3., via an analysis of the relationship between two stock market indexes. Second, in Section 3.2, we study the behaviour of the posterior when we artificially move an observation towards plus or minus infinity. In other words, we visually represent the convergence of the posterior towards that arising from the nonoutliers only result d) of Theorem ). In Section 3.3, a simulation study is conducted to evaluate the accuracy of the estimates arising from our robust linear regression model with two explanatory variables. In all the analyses, we compare the results obtained from the robust model with those from the nonrobust with the normal assumption) and partially robust with the Student distribution assumption) models. 3.. Analysis of the Relationship Between Two Stock Market Indexes We know from Theorem that the impact of outliers is asymptotically null if we assume a super heavy-tailed distribution on the error term of the linear regression model. In other words, this robust

8 8 P. Gagnon, A. Desgagné and M. Bédard model excludes outliers in the limit. In this section, we show that this model turns out to effectively limit the impact of outliers in practise. In particular, we show how this translates into adequate conclusions, contrary to the nonrobust model. The data set presented in Figure and Table is analysed for this purpose. January 2 daily returns of S&P 5 y) and S&P/TSX x) y x Fig.. January 2 daily returns of S&P 5 and S&P/TSX with posterior medians under the nonrobust blue solid line), partially robust orange dashed line), and robust green dotted line) models, respectively Table. Returns for day i of January 2 for S&P 5 y i) and S&P/TSX x i) in percentages, i =,..., 9 y i x i y i x i The data are therefore analysed using a simple linear regression where the daily returns of the S&P 5 and S&P/TSX play the role of the dependent and explanatory variables, respectively. We set the prior to be non-informative, and more precisely, πβ, σ) /σ. For each of the three models considered robust, partially robust and nonrobust), we first compute posterior medians and highest posterior density HPD) intervals for the parameters. For the partially robust model, the degrees of freedom of the heavy-tailed Student distribution are arbitrarily set to, and a known scale parameter of.88 is added to this distribution in order to match the 2.5th and 97.5th percentiles with those of the standard normal. For the robust model, we assume that the error term has a log- Pareto-tailed standard normal distribution. Its density depicted in Figure 2) is expressed as { 2π) /2 exp x 2 /2), if x α the standard normal part), fx) = 2π) /2 exp α 2 /2)α/ x )log α/ log x ) ϕ, if x > α the log-pareto tails). 2) We set α such that this distribution has also the same 2.5th and 97.5th percentiles as the standard normal, i.e. α =.96. This implies that, according to the procedure described in Section 4 of Desgagné 25), ϕ = 4.8 this procedure ensures that f is a continuous probability density function). Therefore, all three error distributions studied in this section have 95% of their mass in the interval.96,.96].

9 Bayesian Robust Linear Regression 9 Comparison between the standard normal, Student and log-pareto-tailed standard normal fx) fx) x x Fig. 2. Densities of the standard normal blue solid line), Student with degrees of freedom and a known scale parameter of.88 orange dashed line), and log-pareto-tailed standard normal with α =.96 and ϕ = 4.8 green dotted line) The parameter estimates are provided in Table 2. Contrary to the partially robust and robust models, the 95% HPD interval for β 2 of the nonrobust model includes. The posterior in this case puts therefore some mass around. We now perform Bayesian tests for hypotheses H : β 2 = versus H : β 2 through the Bayes factor my n )/my n H ) i.e. the marginal density of y n divided by the marginal density of y n under H, which is the marginal density of y n under the simple location-scale model). The Bayes factors are 3.69, 9.23, and 23. for the nonrobust, partially robust and robust models, respectively. Under the robust model, it seems clear that there is enough evidence to conclude that the relationship between the variables is statistically significant. Under the nonrobust model, it is not that clear. We now redo the analysis while excluding observation 8 an outlier as can be seen in Figure ) in order to show its impact. The results are presented in Table 3. The results are now relatively similar for all three models. In particular, the 95% HPD intervals for β 2 do not include. Also, the Bayes factors are 62.75, 5.6, and for the nonrobust, partially robust, and robust models, respectively. The impact of the outlier is much more significant on ˆσ than on ˆβ as can be seen by comparing the posterior medians in Table 2 with those in Table 3). This is due to the position of x 8, y 8 ) =.2,.79), which is centred with respect to the x values in the data set as seen in Figure ). Its presence therefore results in a higher probability for a larger scaling. The impact of the outlier is however not only on the marginal posterior distribution of σ; it propagates through the joint posterior of β and σ, increasing the posterior scaling of β. This is especially true for the nonrobust model and this is what led to the differences in the test results presented above), but this is also true for the partially robust model. Our approach is appealing because it provides whole robustness for all parameters. In this example, whole robustness leads to an inference that best reflects the behaviour of the bulk of the data, compared with the inferences arising from the nonrobust and partially robust models. Note that, as mentioned in Section 2.3, the Bayes factor is a robust measure when it is assumed that the error term has a super heavy-tailed distribution as the log-pareto-tailed standard normal distribution. Indeed, result a) of Theorem states that the marginal my n ) behaves like

10 P. Gagnon, A. Desgagné and M. Bédard my k ) n fy i)] li. Furthermore, the marginal my n H ) behaves like my k H ) n fy i) ] li, because when the assumptions of Theorem are satisfied, those of Theorem in Desgagné 25), ensuring whole robustness for the location-scale model, are also satisfied. As a result, the Bayes factor my n )/my n H ) behaves like my k )/my k H ). Table 2. Posterior medians and 95% HPD intervals under the three models based on the analysis of the data set presented in Table Assumptions on f Estimates for β β 2 σ Standard normal.4.25,.34).4.2,.83).62.43,.88) Student d.f.)..4,.35).4.7,.76).53.33,.8) Log-Pareto-tailed normal.3.9,.34).43.3,.72).42.26,.65) Table 3. Posterior medians and 95% HPD intervals under the three models based on the analysis of the data set presented in Table, but excluding observation 8 Assumptions on f Estimates for β β 2 σ Standard normal.5.3,.33).44.8,.69).37.25,.53) Student d.f.).6.2,.34).4.6,.68).39.25,.58) Log-Pareto-tailed normal.5.4,.33).43.7,.69).37.24,.54) 3.2. Convergence of the Posterior In this section, we illustrate how the posterior behaves when we artificially move an observation towards plus or minus infinity, through the evolution of point estimates. More precisely, we use the same data set as in Section 3. see Table and Figure ), and we gradually move the observation identified as an outlier in that section i.e. x 8, y 8 )) according to the following procedure: y 8 is gradually moved from the value.5 a nonoutlier) to 6 a large outlier), while x 8 remains fixed. In Section 3., we have observed the impact of an outlier centred with respect to the x values in the data set. We therefore change the value of x 8 to to illustrate the impact of an outlier located at one of the endpoints of the x values. While we gradually move the observation, we estimate the parameters β := β, β 2 ) T and σ for each data set related to a different value of y 8 using posterior medians with a prior proportional to /σ. This whole process is performed under the three models the nonrobust, partially robust, and wholly robust models with the same distribution assumptions as in Section 3.). We now present the results in Figure 3, in which y 8 = y 8 is defined as the x-axis instead of y 8 ) so that larger values correspond to larger distances to the bulk of the data. We first notice that, for the robust model, the impact of the point x 8, y 8 ) grows until y 8 reaches a threshold, and beyond this threshold, the impact vanishes as the observation approaches minus infinity. The threshold is around y 8 =.4, and based on the data set with y 8 =.4, we find ˆβ =., ˆβ 2 =.54 and ˆσ =.43. The point estimates converge towards.6,.43 and.36 for β, β 2 and σ, respectively, which are the point estimates when x 8, y 8 ) is excluded from the sample. Whole robustness is, as expected, attained for both β and σ. Note that an increase in the

11 β Bayesian Robust Linear Regression Estimation of β under the three Estimation of β 2 under the three Estimation of σ under the three models when y 8 = y 8 models when y 8 = y 8 models when y 8 = y 8 increases from.5 to 6 increases from.5 to 6 increases from.5 to y8 β y8 σ y8 Fig. 3. Estimation of β and σ when y 8 = y 8 increases from.5 to 6 under three different assumptions on f: standard normal density blue solid line), Student density orange dashed line) and log-pareto-tailed standard normal density green dotted line) value of the parameter α would result in an increase in the value of the threshold. Setting α =.96 seems suitable for practical uses. For the partially robust model, the estimation of β is robust as the impact of the outlier slowly decreases after a certain threshold. The impact on σ is limited, but does not decrease when the outlying value moves away, reflecting partial robustness. The inference is clearly not robust when it is assumed that the error has a normal distribution, as the estimates of β 2 and σ grow, and that of β diminishes Simulation Study In this section, we evaluate through a simulation study the accuracy of the estimates arising from our robust linear regression model with two explanatory variables. The model is Y i = xi T β + ɛ i with β := β, β 2, β 3 ) T and ɛ i σ /σ)fɛ D i /σ), i =,..., n, where f is assumed to be a log-pareto-tailed standard normal density with α =.96 and ϕ = 4.8 in our robust model. It is compared with the same linear regression model, but where f is assumed to be a standard normal density in the nonrobust model, and a Student density with degrees of freedom and a known scale parameter of.88 in the partially robust model. Note that the parameters of the distributions are the same as in Section 3.. For the simulation study, we set n = 3, x,2, x 2,2,..., x 3,2 =, 2,..., 3 which are the observations from the first explanatory variable), x,3, x 2,3,..., x 3,3 = 2, 2,..., 29 2 which are the observations from the second explanatory variable), and πβ, σ). We simulate 5, data sets using values for β and σ respectively set to,,.) T and 2 for each of the three scenarios that we now describe. In the first one, the errors are generated from a standard normal distribution; therefore, the probability to observe outliers is negligible. In the second scenario, the errors are generated from a mixture of two normals where the first component is a standard normal and the second has a mean of and a variance of, with weights of.95 and.5, respectively. This last component can contaminate the data sets by generating extreme values. In the third and last scenario, the errors are also generated from a mixture of two normals, but the contamination is due to the second component s scaling. More precisely, the first component is again a standard normal distribution, but the second has a mean of and a variance of 2, with weights of.9 and., respectively.

12 2 P. Gagnon, A. Desgagné and M. Bédard For each simulated data set, we estimate β and σ using MAP estimation for the three models recall that MAP estimation corresponds to maximum likelihood estimation when the prior is proportional to as in this analysis). Within each simulation scenario, we evaluate the performance of each model using sample mean square errors MSE), based on the true values β =, β 2 =, β 3 =., and σ = 2; the performance of the estimation of β is evaluated through the sum of the MSE of the estimators of β, β 2, and β 3. The results are presented in Tables 4 and 5. We first notice that when there is no outlier the % N, ) scenario), the accuracy of the estimates arising from the three models is approximately the same. This is not surprising given that the three densities studied are similar, especially on the interval.96,.96] in which 95% of their mass is located. They however differ in the thickness of their tails, a feature that plays a major role in robustness. The density with the slimmest tails, the normal, corresponds to the case where the impact of outliers is the most significant. When assuming the Student distribution on the error term, the impact on ˆβ is limited, but the impact on ˆσ is significant, reflecting partial robustness. Under the robust model, the impact on ˆβ, ˆσ) is, as expected, limited. Note that for all models, the impact on each of the ˆβ i is in line with that on ˆβ. Table 4. Sum of the MSE of the estimators of β, β 2 and β 3 under the three scenarios and the three assumptions of f Assumptions on f Scenarios %N, ) 95%N, ) + 5%N, ) 9%N, ) + %N, 2 ) Standard normal Student d.f.) Log-Pareto-tailed normal Table 5. MSE of the estimators of σ under the three scenarios and the three assumptions of f Assumptions on f Scenarios %N, ) 95%N, ) + 5%N, ) 9%N, ) + %N, 2 ) Standard normal Student d.f.) Log-Pareto-tailed normal Conclusion In this paper, we have proved whole robustness results for linear regression models, as the following key result: the convergence of the posterior distribution towards that arising from the nonoutliers only when the outliers approach plus or minus infinity result d), Theorem ). These results hold under two simple and intuitive assumptions. First, assume that the error term follows a super heavytailed distribution to accommodate for the presence of outliers. Second, the number of outliers must not exceed half the sample size minus p /2. In other words, there must be at most n/2 p /2) outliers or equivalently, that must be at least n/2 + p /2) nonoutliers). Although the whole robustness results are asymptotic, their practical relevance has been shown through numerical analyses in Section 3. The robust model used in these analyses results from

13 Bayesian Robust Linear Regression 3 the assumption that the error term follows the log-pareto-tailed standard normal density provided in 2). This model has been compared with the nonrobust with the normal assumption) and partially robust with the Student distribution assumption) models. The conclusion is that our model performs as well as the nonrobust and partially robust models in absence of outliers, in addition to being completely robust. In fact, it performs almost as well in absence of outliers, as seen in Tables 4 and 5 in Section 3.3. This is a rather small price to pay for whole robustness. We therefore believe it is suitable to always assume that the error term has the density detailed in 2) to obtain adequate results, regardless of whether there are outliers or not, by computing estimates as usual from the posterior distribution. 5. Proofs The proof of Proposition is given in Section 5. and the proof of Theorem is given in Section 5.2. Prior to that, we define the constant B > as follows: { } B := max sup z R fz), sup z R z fz), sup β R p, σ> πβ, σ)/ max, /σ) The monotonicity of the tails of fz) and z fz) implies that there exists a constant M > such that y z M fy) fz) and y fy) z fz). 3). 5.. Proof of Proposition To prove that πβ, σ y n ) is proper the proof for πβ, σ y k ) is omitted because it is similar), it suffices to show that the marginal my n ) is finite. Recall that we require that n > p +. The reader will notice that only n p + is required if πβ, σ) is bounded by B/σ for all σ > instead of πβ, σ) is bounded by B max, /σ)). We first show that the function is integrable on the area where the ratio /σ is bounded. More precisely, we consider β R p and δm σ <, where δ is a positive constant that can be chosen as small as we want upper bounds are provided in the proof). We next show that the function is integrable on the area where the ratio /σ approaches infinity, that is < σ < δm. We have δm R p πβ, σ) a B n p+ b max, M δ /σ)fy i xi T β)/σ) dβ dσ max, ) δm σ ) x T B n p+ det.. fu p ) du p σ n p x T p p σ f R p yi xi T β ) dβ dσ σ dσ δm σn p fu ) du

14 4 P. Gagnon, A. Desgagné and M. Bédard δm σ n p dσ c = M/δ) n p /n p ) <. In step a, we use πβ, σ) B max, /σ) and we bound each of n p densities f by B. In step b, we use the change of variables u i = y i xi T β)/σ for i =,..., p. The determinant is non-null because all explanatory variables are continuous. Indeed, consider the case p = 2 for instance i.e. the simple linear regression); the determinant is different from provided that x 2 x 22, and this happens with probability. In step c, we use n > p +. Note that if, instead, we bound πβ, σ) by B/σ in step a, one can verify that the condition n p + is sufficient to bound above the integral. We now show that the integral is finite on β R p and < σ < δm. In this area, the ratio /σ) approaches infinity. We have to carefully analyse the subareas where y i xi T β is close to in order to deal with the / form of the ratios y i xi T β)/σ. In order to achieve this, we split the domain of β as follows: R p = n i =R c ] i n i = n i,i 2=i i 2) ))] R i n i2=i2 i) Rc i2 R i R i2 n i3=i3 i,i2) Rc i3 ))] n i,i 2,...,i p=i j i s i j,i s s.t. j s) Ri R i2... R ip ) ], 4) where R i := {β : y i xi T β < δ}, i {,..., n}. The set R i represents the hyperplanes y = xi T β characterised by the different values of β that satisfy y i xi T β < δ. In other words, it represents the hyperplanes passing near the point x i, y i ), and more precisely, at a vertical distance of less than δ. The set n i = Rc i is therefore comprised of the hyperplanes that are not passing close to any point. The set n i R = i n i 2=i 2 i ) Rc i 2 )) represents the hyperplanes passing near one and only one) point. The set n i R,i 2=i i 2) i R i2 n i 3=i 3 i,i 2) Rc i 3 )) represents the hyperplanes passing near two and only two) points, and so on. We choose δ small enough to ensure that R i R i2... R ip R ip+ = when i,..., i p+ are all different. It is possible to do so because an hyperplane passes through no more than p points. This implies that n i,i 2,...,i p=i j i s i j,i s s.t. j s) = n i,i 2,...,i p=i j i s i j,i s s.t. j s) Ri R i2... R ip ) ] ))] R i R i2... R ip n ip+=ip+ i,i2,...,ip) Rc ip+. Note that all sets R i R i2... R ip are nonempty when i,..., i p are all different, because all explanatory variables are continuous which implies that the p p matrix with rows given by xi T,..., xi T p has a determinant different from ). Note also that R i n i 2=i 2 i ) Rc i 2 ) is nonempty for all i, R i R i2 n i 3=i 3 i,i 2) Rc i 3 ) is nonempty for all i, i 2 such that i i 2, and so on. Finally note that the decomposition of R p given in 4) is comprised of p n i= i) mutually exclusive sets given by n i = Rc i, R i n i 2=i 2 i ) Rc i 2 ), i =,..., n, R i R i2 n i 3=i 3 i,i 2) Rc i 3 ), i, i 2 =,..., n with i i 2, and so on. We now consider < σ < δm and β in one of the p n i= i) mutually exclusive sets given in 4). As explained above, the difficulty lies in dealing with the hyperplanes β that are such that y i xi T β < δ for some points x i, y i ). The strategy is essentially to use the product of

15 Bayesian Robust Linear Regression 5 /σ)fy i x T i β)/σ) of these points to integrate over β, and to bound the other terms of my n). Therefore, if β R i R i2... R ip, we consider the points x i, y i ), x i2, y i2 ),..., x ip, y ip ) to integrate over β. If β R i R i2... R ip n i p=i p i,...,i p ) Rc i p ), we consider the points x i, y i ), x i2, y i2 ),..., x ip, y ip ), and any other point x ip, y ip ) to integrate over β, and so on. We have πβ, σ) /σ)fy i xi T β)/σ) a B/σ) maxδm, ) /σ)fy i xi T β)/σ) /σ) i {i,...,i p} b /σ)/σ)f δ/σ)] n p c B/δ] n p /σ 2 )f δ/σ) /σ)fy i x T i β)/σ) i {i,...,i p} i {i,...,i p} i {i,...,i p} /σ)fy i x T i β)/σ) /σ)fy i x T i β)/σ). /σ)fy i x T i β)/σ) In step a, we use πβ, σ) B max, /σ) = B/σ) maxσ, ) B/σ) maxδm, ). In step b, for all i {i,..., i p } we use fy i xi T β)/σ) fδ/σ) by the monotonicity of the tails of f because y i xi T β /σ δ/σ δδ M = M. In step c, we bound n p terms /σ)fδ/σ) by B/δ. Finally, we bound the integral of /σ 2 )f δ/σ) i {i /σ)fy,...,i p} i xi T β)/σ) by /σ 2 )f δ/σ) a = det x T i. x T i p R p i {i,...,i p} /σ)fy i x T i β)/σ) dβ dσ /σ 2 )f δ/σ) dσ = b det x T i. x T i p f σ ) dσ <. In step a, we use the same change of variables as above: u j = y ij x T i j β)/σ for j =,..., p. In step b, we use the change of variable σ = δ/σ Proof of Theorem Consider the model and the context described in Section 2.. We assume that l n/2 p /2) k n/2 + p /2) k l + 2p. In addition, we assume that l, i.e. that there is at least one outlier, otherwise the proof would be trivial. Two propositions and a lemma that are used in the proof of Theorem are first given, and the proofs of results a) to e) follow. The proofs of these propositions and this lemma can be found in Desgagné 25). Proposition 2 Dominance). If s L ) and g L ρ ), then for all δ >, there exists a constant Aδ) > such that z Aδ) implies that log z) δ < sz) < log z) δ and log z) ρ δ < gz) < log z) ρ+δ.

16 6 P. Gagnon, A. Desgagné and M. Bédard Proposition 3 Location-scale transformation). If zfz) L ρ ), then we have /σ)fz µ)/σ)/fz) as z, uniformly on µ, σ) λ, λ] /τ, τ], for any λ and τ. Lemma. For all λ, τ, there exists a constant Dλ, τ) such that z R and µ, σ) λ, λ] /τ, τ] implies that /Dλ, τ) /σ)fz µ)/σ)/fz) Dλ, τ). Note that Lemma is a corollary of Proposition 3. Proof of Result a). We first observe that my n ) my k ) my n ) n fy = i)] li my k ) n fy πβ, σ y n ) dσ dβ i)] li R p πβ, σ) n /σ)fyi xi T = β)/σ)] k i+l i R p my k ) n fy dσ dβ i)] li /σ)fyi xi T = πβ, σ y k ) β)/σ) ] li dσ dβ. fy i ) R p We show that the last integral converges towards as ω to prove result a). If we use Lebesgue s dominated convergence theorem to interchange the limit ω and the integral, we have lim R p = = R p R p πβ, σ y k ) /σ)fyi xi T β)/σ) ] li dσ dβ fy i ) /σ)fyi xi T β)/σ) ] li dσ dβ fy i ) lim πβ, σ y k) πβ, σ y k ) dσ dβ =, using Proposition 3 in the second equality, since x,..., x n are fixed, and then Proposition. Note that the condition of Proposition is satisfied because k l + 2p k p + 2 because l and p 2). Note also that pointwise convergence is sufficient, for any value of β R p and σ >, once the limit is inside the integral. However, in order to use Lebesgue s dominated convergence theorem, we need to prove that the integrand is bounded by an integrable function of β and σ that does not depend on ω, for any value of ω y, where y is a constant. The constant y can be chosen as large as we want, and minimum values for y will be given throughout the proof. In order to bound the integrand, we divide the domain of integration into two areas: σ < and < σ <. Again, we want to separately analyse the area where the ratio /σ approaches infinity. We assumed that y i can be written as y i = a i + b i ω, where ω, and a i and b i are constants such that a i R and b i if l i = if the observation is an outlier). Therefore, the ranking of the

17 Bayesian Robust Linear Regression 7 elements in the set { y i : l i = } is primarily determined by the values b,..., b n, and we can choose the constant y larger than a certain threshold to ensure that this ranking remains unchanged for all ω y. Without loss of generality, we assume for convenience that ω = min y i and consequently min b i =. {i: l } {i: l } We now bound above the integrand on the first area. Area : Consider σ < and assume without loss of generality that y,..., y p are p nonoutliers therefore k =... = k p = ). We have πβ, σ y k ) a B σ n b c = d = /σ)f yi xi T β)/σ) ] li fy i ) n fω)] l B σ n fω)] l σ n D a i, )fb i ω x T i β)/σ) πβ, σ) σ n fy i xi T β)/σ) fy i )] li fy i )] li D a i, )fb i ω xi T β)/σ) b i D a i, b i )] li fb i ω xi T β)/σ) fx T fω)] l σ n i β/σ) ] k i fbi ω xi T β)/σ) ] l i p /σ)fxt i β/σ) ] l ω/σ fx T σ k p /2 ωfω) σ /2 i β/σ) ] k i fbi ω xi T β)/σ) ] l i. i=p+ In step a, we use y i = a i + b i ω and Lemma to obtain fy i x T i β)/σ) = fb i ω x T i β)/σ + a i /σ) D a i, )fb i ω x T i β)/σ), because a i /σ a i for all i. We also use πβ, σ) B max, /σ) = B. In step b, we use again Lemma to obtain fω)/fy i ) = fy i a i )/b i )/fy i ) b i D a i, b i ). In step c, we set b i = if k i = and we use the symmetry of f to obtain f xi T β/σ) = fxt i β/σ). In step d, we use the assumption k =... = k p =. Now it suffices to demonstrate that ω/σ ] l ωfω) σ /2 i=p+ fx T i β/σ) ] k i fbi ω x T i β)/σ) ] l i 5) is bounded by a constant that does not depend on ω, β and σ since p /σ)fxt i β/σ)

18 8 P. Gagnon, A. Desgagné and M. Bédard /σ) k p /2 is an integrable function on area. Indeed, since k > p +, we have /σ) k p /2 p x T /σ)fxi T β/σ) dβ dσ = det.. R p x T p dσ <, σk p /2 using the following change of variables: u i = xi T β/σ, i =,..., p. The determinant is different from because all the explanatory variables are continuous. Note that if instead, in step a above, we bound πβ, σ) by B/σ, one can verify that the condition k p + is sufficient to bound above the integral. In order to bound the function in 5), we split area into three parts: σ < ω /2, ω /2 σ < ω/γm) and ω/γm) σ <, where M is defined in 3) and γ is a positive constant that can be chosen as large as we want lower bounds are provided in the proof). Note that this split is well defined if y > max, γm) 2 ) since ω y. First, consider ω/γm) σ <. We have ω/σ ] l ωfω) σ /2 i=p+ b B n p l+/2 /ω)/2 γm) ωfω)] l fx T i β/σ) ] k i fbi ω x T i β)/σ) ] l i a B n p d B n p γm) l+/2 2ρ + )l/e] ρ+)l <. σ /2 c B n p γm) l+/2 /ω) /2 log ω) ρ+)l ] l ω/σ ωfω) In step a, we use f B. In step b, we use ω/σ γm and /σ γm/ω. In step c, we use ωfω) > log ω) ρ if ω y A), where A) comes from Proposition 2. For step d, it is purely algebraic to show that the maximum of log ω) ξ /ω /2 is 2ξ/e) ξ for ω > and ξ >, where ξ = ρ + )l in our situation. Now, consider the two other parts combined we will split them in the next step), that is σ ω/γm). We have In step a, we use ω/σ ] l ωfω) σ /2 = σ /2 a σ /2 i=p+ i=p+ fx T i β/σ) ] k i fbi ω x T i β)/σ) ] l i ] l n ω/σ)fω/σ) ωfω) i=p+ ] l ω/σ)fω/σ) B k p D, γ)γ] l. ωfω) fx T i β/σ) ] k i fbi ω xi T β)/σ) ] li fω/σ) fx T i β/σ) ] k i fbi ω xi T β)/σ) ] li B k p D, γ)γ] l. 6) fω/σ)

19 Bayesian Robust Linear Regression 9 The proof of this inequality is substantial. Therefore, to ease the reading, it is deferred after the demonstration that the remaining term, i.e. is bounded. σ /2 ] l ω/σ)fω/σ), ωfω) We first consider ω /2 σ ω/γm). We have σ /2 ω/σ)fω/σ) ωfω) ] l a B l /ω)/4 ωfω)] l b B l /ω) /4 log ω) ρ+)l c B l 4ρ + )l/e] ρ+)l <. In step a, we use ω/σ)fω/σ) B and /σ) /2 /ω) /4. In step b, we use ωfω) > log ω) ρ if ω y A), where A) comes from Proposition 2. In step c, it is purely algebraic to show that the maximum of log ω) ξ /ω /4 is 4ξ/e) ξ for ω > and ξ >, where ξ = ρ + )l in our situation. We now consider σ ω /2. We have σ /2 ] l ω/σ)fω/σ) a ω /2 fω /2 ] l ) b 2 ρ+)l <. ωfω) ωfω) In step a, we use /σ and ω/σ)fω/σ) ω /2 fω /2 ) by the monotonicity of the tails of z fz) since ω/σ ω /2 y /2 M if y M 2. In step b, we use ω /2 fω /2 )/ωfω)) 2/2) ρ = 2 ρ+ if ω y A, 2), where A, 2) comes from the definition of log-regularly varying functions see Definition 2). Finally, we prove the inequality in 6). Recall that we assume that k n/2 + p /2) k l + 2p. We know that y,..., y p are p nonoutliers therefore k =... = k p = ). We also know that the number of remaining nonoutliers the nonoutliers among observations p + to n) is greater than or equal to p because k p l + p p because we assume that l ). In order to prove the result, we split the domain of β as follows: R p = i Oi c ] i Oi i Fi c ))] i,i Oi F i i2 i Fi c ))] 2 ))] i,i,...,i p i j i s i j,i s s.t. j s) O i F i F ip ip i,...,i p F c ip where i,i,...,i pi j i s i j,i s s.t. j s) Oi F i F ip )], O i := {β : b i ω x T i β < ω/2}, i I O, 7) F i := {β : x T i β < ω/γ}, i I F, 8) I O := {i : i {p +,..., n} and l i = } and I F := {i : i {p +,..., n} and k i = } are the sets of indexes of outliers and remaining fixed observations nonoutliers), respectively.

20 2 P. Gagnon, A. Desgagné and M. Bédard The set O i represents the hyperplanes y = xi T β characterised by the different values of β that satisfy b i ω xi T β < ω/2. In other words, it represents the hyperplanes that pass at a vertical distance of less than ω/2 of the point x i, b i ω), which is considered as an outlier since ω recall that b i ω = y i a i ). Analogously, the set F i represents the hyperplanes that pass at a vertical distance of less than ω/γ of the point x i, ), which is considered as a nonoutlier. Therefore, the set i Oi c represents the hyperplanes that pass at a vertical distance of at least ω/2 of all the points x i, b i ω) all the outliers). The set i O i i Fi c )) represents the hyperplanes that pass at a vertical distance of less than ω/2 of at least one point x i, b i ω) an outlier), but at a vertical distance of at least ω/γ of all the points x i, ) all the nonoutliers). For each i I F, the set i O i F i i2 i Fi c 2 )) represents the hyperplanes that pass at a vertical distance of less than ω/2 of at least one point x i, b i ω) an outlier), at a vertical distance of less than ω/γ of the point x i, ) a nonoutlier), but at a vertical distance of at least ω/γ of all the other nonoutliers, and so on. Now, we claim that O i F i F ip = for all i, i,..., i p with i j i s, i j, i s such that j s. To prove this, we use the fact that x i a vector of size p) can be expressed as a linear combination of x i,..., x ip. This is true because all explanatory variables are continuous, and therefore, linearly independent with probability. As a result, considering that β F i F ip and x i = p s= a sx is for some a,..., a p R, we have p ) T b i ω xi T β = b a iω a s x is β p b i ω a s xi T b s β ω ω p a s c ω ω γ 2. s= In step a, we use the reverse triangle inequality. In step b, we use that b i and p s= a sxi T s β p s= a s xi T s β p s= a s ω/γ because β F i F ip, which means that xi T β < ω/γ for all i {i,..., i p }. In step c, we define the constant γ such that γ 2 p s= a s we choose γ such that it satisfies this inequality for any combination of i and i,..., i p ). Therefore, we have that β O i. This proves that O i F i F ip = for all i, i,..., i p with i j i s, i j, i s such that j s. This in turn implies that R p = i Oi c ] i Oi i Fi c ))] i,i Oi F i i2 i Fi c ))] 2 ))] i,i,...,i p i j i s i j,i s s.t. j s) O i F i F ip ip i,...,i p F c ip. This decomposition of R p is comprised of + p k p ) i= i mutually exclusive sets given by i Oi c, i O i i Fi c )), i O i F i i2 i Fi c 2 )) for i I F, and so on. We are now ready to bound the function on the left-hand side in 6). We first show that the function is bounded on β i O c i and σ ω/γm). For all i I O, we have fb i ω x T i β)/σ) fω/σ) fω/2σ)) fω/σ) s= 2D, 2) D, γ)γ, using the monotonicity of f because b i ω xi T β /σ ω/2σ) γm/2 M we choose γ 2), s=

Robust estimation of skew-normal distribution with location and scale parameters via log-regularly varying functions

Robust estimation of skew-normal distribution with location and scale parameters via log-regularly varying functions Shintaro Hashimoto Department of Mathematics, Hiroshima University October 5, 207 Abstract