arxiv: v2 [stat.me] 24 Dec 2017

Size: px
Start display at page:

Download "arxiv: v2 [stat.me] 24 Dec 2017"

Transcription

1 Bayesian Robustness to Outliers in Linear Regression arxiv:62.698v2 stat.me] 24 Dec 27 Philippe Gagnon, Université de Montréal, Canada Alain Desgagné Université du Québec à Montréal, Canada and Mylène Bédard Université de Montréal, Canada ABSTRACT. Whole robustness is an appealing attribute to look for in statistical models. It implies that the impact of outliers, defined here as the observations that are not in line with the general trend, gradually vanishes as they move further and further away from this trend. Consequently, the conclusions obtained are consistent with the bulk of the data. Nonrobust models may lead to inferences that are not in line with either the outliers or the nonoutliers. In this paper, we make the following contribution: we generalise existing whole robustness results for simple linear regression through the origin to the usual linear regression model. The strategy to attain whole robustness is simple: replace the traditional normal assumption on the error term by a super heavy-tailed distribution assumption. Analyses are then conducted as usual, and typical methods of inference as statistical hypothesis testing are available. Key words: Bayesian inference, Built-in robustness, Maximum likelihood estimation, Super heavy-tailed distributions, Whole robustness. Introduction The linear regression model is commonly used to predict values of a dependent variable through auxiliary information a set of observations from explanatory variables), or to quantify the strength of the relationship between the dependent variable and the explanatory variables. The underlying assumption in this model is that the explanatory variables are linearly related to the dependent variable, with an error term that traditionally is normally distributed. It is well understood that conflicting sources of information may contaminate the inference when normality is assumed. From a Bayesian perspective, these conflicting sources may represent the prior and outliers. We focus on the latter in this paper. A conflict therefore represents the fact that a group of observations produces a rather different inference than that arising from the bulk of the data and the prior. Under the normality assumption, this may lead to the following undesirable effect caused by the light normal tails: the posterior concentrates in an area that is not supported by any source of information in order to incorporate them all. In this context of robustness against outliers in linear regression, a contaminated inference corresponds to predictions that are not in line with either the outliers or the Address for correspondence: Philippe Gagnon, Département de mathématiques et de statistique, Université de Montréal, C.P. 628, Succursale Centre-Ville, Montréal, Québec, Canada, H3C 3J7. gagnonp@dms.umontreal.ca

2 2 P. Gagnon, A. Desgagné and M. Bédard nonoutliers, and misleading quantification of the strength of the relationship between the dependent variable and the explanatory variables. We believe that the appropriate way to address the problem is to limit the impact of outliers in order to obtain conclusions that are consistent with the majority of the observations. Box and Tiao 968) were the first to propose a Bayesian solution for dealing with outliers in linear regression. They suggested to let the error term be distributed as a mixture of two normals: one component for the nonoutliers and the other one, with a larger variance, for the outliers. This approach has been generalised by West 984) who modelled errors with heavy-tailed distributions constructed as scale mixtures of normals, which includes the Student distribution. Peña et al. 29) introduced more recently a different robust Bayesian approach in which the weight of each observation decreases with the distance between this observation and the bulk of the data. The authors proved that the Kullback-Leibler divergence between the posterior arising from the nonoutliers only and the posterior arising from the sample containing outliers is bounded. This result essentially indicates that these two posterior densities can be different, but that their ratio is bounded. The strategies of Box and Tiao 968) and West 984) are appealing by their simplicity. There is however a lack of theoretical results that validate the underlying ideas. In fact, heavy-tailed distributions as the Student only allow to attain partial robustness see the results of Andrade and O Hagan 2) in a context of location-scale model), which means that outliers have, as they approach plus or minus infinity, a significant but limited impact on the inference. Consequently, we believe that there is still a need for simple strategies with evidences of their validity. In this paper, we aim at filling this gap through an approach as simple as those of Box and Tiao 968) and West 984), and theoretical results ensuring whole robustness. As these authors, we replace the traditional normal assumption on the error term by a distribution assumption that accommodates for the presence of outliers. The tails of the distribution are however heavier than those of heavytailed distributions; they are super heavy tails e.g. log-pareto or log-student tails). We borrowed the idea from Desgagné and Gagnon 27), in which simple linear regressions through the origin are considered and whole robustness is proved. Our work is thus aligned with the theory of conflict resolution in Bayesian statistics, as described by O Hagan and Pericchi 22) in their extensive literature review on that topic. In the following, we first present the general model without any specific distribution assumption on the error term) in Section 2.. We then describe the super heavy-tailed distributions that we use to attain whole robustness in Section 2.2. They are log-regularly varying distributions. The linear regression model with an error term assumed to follow this type of distribution is characterised by its built-in robustness that resolves conflicts in a sensitive and automatic way. This shall be witnessed in Section 2.3 in which the robustness results are presented. The key result is the convergence of the posterior distribution arising from the whole sample) towards the posterior arising from the nonoutliers only, when the outliers approach plus or minus infinity. This corresponds to whole robustness and indicates that users can conduct their analyses as usual, estimating parameters with posterior medians and Bayesian credible intervals for instance. The result follows from the pointwise convergence of the posterior density towards that arising from the nonoutliers only. The posterior density being proportional to the likelihood function if a prior proportional to is used, we are able to establish the robustness of the maximum likelihood estimates. Our strategy is therefore also useful under the frequentist paradigm.

3 Bayesian Robust Linear Regression 3 In Section 3, we illustrate the practical implications of the theoretical results presented in Section 2.3 through a numerical study. In particular, we show the relevance of our approach in the analysis of the relationship between the dependent variable and the explanatory variables. The relationship between two stock market indexes is studied. We also evaluate the accuracy of the estimates arising from the robust multiple linear regression model. Our robust model is compared with the nonrobust with the normal assumption) and partially robust with the Student distribution assumption) models. It is showed that our model performs as well as the nonrobust and partially robust ones in absence of outliers, in addition to being completely robust. It indicates that, by simply changing the assumption on the error term, we obtain adequate estimates whether there are outliers or not. 2. Resolution of Conflicts in Linear Regression 2.. Model i) Let Y,..., Y n R be n random variables representing data points from the dependent variable and x T :=, x 2,..., x p ),..., xn T :=, x n2,..., x np ) be n vectors of observations from the explanatory variables, where p {2, 3,...}, n p + and x ij R are assumed to be known. We want to study the relationship between the dependent variable and the explanatory variables assuming that the following model is suitable: Y i = x T i β + ɛ i, i =,..., n, where the n random variables ɛ,..., ɛ n R and the p-dimensional random variable β := β,..., β p ) T R p represent the errors and regression coefficients, respectively. These n + random variables are conditionally independent given σ >, the scale parameter, with a conditional density for ɛ i given by ɛ i β, σ = D ɛ i σ D σ f ɛi ), i =,..., n. σ ii) We assume that f is a strictly positive continuous probability density function on R that is symmetric with respect to the origin, and that is such that both tails of z fz) are monotonic see Section 5 for the detailed definition of monotonicity), which implies that the tails of fz) are also monotonic. These assumptions on f imply that fz) and z fz) are bounded on the real line, and both converge to as z. Note that the function fz) can have parameters, e.g. a shape parameter; however, their value is assumed to be known. iii) We assume that the joint prior density of β and σ, denoted πβ, σ), is bounded on σ >, and is such that πβ, σ)//σ) is bounded on < σ, for all β R p. Together, these assumptions are equivalent to: πβ, σ)/ max, /σ) is bounded on σ >, for all β R p. A large variety of priors fit within this assumed structure; for instance, this is the case for all proper densities. In addition, non-informative priors such as πβ, σ) /σ, the usual one for this type of random variables, and πβ, σ) satisfy these assumptions. Inference on the parameters β and σ helps us understanding the relationship between the dependent and the explanatory variables. For simplicity, we assume that all explanatory variables are continuous.

4 4 P. Gagnon, A. Desgagné and M. Bédard Given that the scale parameter of the distribution of the error term is σ, homoscedasticity is an underlying assumption of the model. When the classical normality is assumed, σ also represents the conditional) standard deviation of each error ɛ i. Note that if in addition πβ, σ), the maximum a posteriori probability MAP) estimator of β is the well-known least square estimator. The MAP estimator corresponds to the maximum likelihood estimator when the prior is proportional to. An important drawback of the classical normality assumption is that outliers have a significant impact on the estimation. In this paper, we study robustness of the estimation of β and σ. The objective is to find sufficient conditions to attain whole robustness, meaning that the impact of outliers gradually vanishes as they move further and further away from the trend emerging from the bulk of the data this is the definition of outliers on which we focus in this paper). In other words, we want to attain robustness against observations with errors ɛ i = y i xi T β approaching + or, where β defines the probable hyperplanes for the bulk of the data. Our strategy to mathematically represent this situation is to let y i approach + or for the outliers and to consider the vectors x,..., x n as fixed. Studying this theoretical framework is indeed sufficient to attain in practise robustness against any type of outliers that fits our definition. Consider for instance an outlier with an extreme x value, but a y value comparable to those in the bulk of the data. This outlier can be viewed as an observation with a fixed x value, but an extremely low or high) y value compared with the trend emerging from the bulk of the data. We now set our mathematical framework for dealing with outliers. Among the n observations of Y,..., Y n, denoted by y n := y,..., y n ), we assume that k of them form a group of nonoutlying observations, that we denote y k, while l = n k of them are considered as outliers. For i =,..., n, we define the binary functions k i and l i as follows: if y i is a nonoutlying value k i =, and if it is an outlier l i =. These functions take the value of otherwise. Therefore, we have k i + l i = for i =,..., n, with n k i = k, and n l i = l. We assume that each outlier converges towards or + at its own specific rate, to the extent that the ratio of two outliers is bounded. More precisely, we assume that y i = a i + b i ω, for i =,..., n, where a i and b i are constants such that a i R and i) b i = if k i =, ii) b i if l i =, and we let ω. Let the joint posterior density of β and σ be denoted by πβ, σ y n ) and the marginal density of Y,..., Y n ) be denoted by my n ), where πβ, σ y n ) = my n )] πβ, σ) /σ)fy i xi T β)/σ), β R p, σ >. Let the joint posterior density of β and σ arising from the nonoutlying observations only be denoted by πβ, σ y k ) and the corresponding marginal density be denoted by my k ), where πβ, σ y k ) = my k )] πβ, σ) /σ)fyi xi T β)/σ) ] k i, β R p, σ >. Note that if the prior πβ, σ) is proportional to, the likelihood functions, given by the product

5 terms in the posteriors above, can also be expressed as follows: Bayesian Robust Linear Regression 5 Lβ, σ y n ) = my n )πβ, σ y n ) and Lβ, σ y k ) = my k )πβ, σ y k ). ) Proposition. Consider the Bayesian context described in Section 2., and assume that k > p +. Then, the joint posterior densities πβ, σ y k ) and πβ, σ y n ) are proper. The condition k p+ is sufficient if we assume that πβ, σ)//σ) is bounded on σ > instead of πβ, σ) is bounded on σ > and πβ, σ)//σ) is bounded on < σ ). The proof of Proposition is provided in Section 5.. Note that, in fact, we only need n > p + or n p+ if we assume that πβ, σ)//σ) is bounded on σ > ) to guarantee that πβ, σ y n ) is proper. The condition k > p + or k p + ) is stronger Log-Regularly Varying Distributions As mentioned in the introduction, our approach to attain robustness is to replace the traditional normal assumption on the error term by a log-regularly varying distribution assumption. In this section, we provide an overview of such distributions. For more information, we refer the reader to Desgagné 23) and Desgagné 25). Definition Log-regularly varying distribution). A random variable Z with a symmetric density fz) is said to have a log-regularly varying distribution with index ρ if zfz) is log-regularly varying at with this same index ρ see Definition 2). Definition 2 Log-regularly varying function). We say that a measurable function g is log-regularly varying at with index ρ R, denoted g L ρ ), if ɛ >, τ, there exists a constant Aɛ, τ) > such that z Aɛ, τ) and /τ ν τ ν ρ gz ν )/gz) < ɛ. If ρ =, g is said to be log-slowly varying at. Log-regularly varying functions form an interesting class of functions with useful properties for robustness. As indicated in Definition 2, they are such that g L ρ ) if gz ν )/gz) converges towards ν ρ uniformly in any set ν /τ, τ] for any τ ) as z, where ρ R. This implies that for any ρ R, we have g L ρ ) if and only if there exists a constant A > and a function s L ) such that for z A, g can be written as gz) = log z) ρ sz). It means that a symmetric distribution with tails behaving like / z )log z ) ρ with ρ is a log-regularly varying distribution. Note that an example of such distributions is presented in Section Resolution of conflicts We now present the main contribution of this paper, which is Theorem. This theorem contains our robustness results which are stated for models using auxiliary information, i.e. for models with p 2. The case p = corresponds to the location-scale model, and we refer the reader to Desgagné 25) for suitable assumptions under which robustness is attained in this specific situation.

6 6 P. Gagnon, A. Desgagné and M. Bédard Theorem. Consider the model and context described in Section 2.. If we assume that i) f is a log-regularly varying distribution i.e. zfz) L ρ ) with ρ ), ii) l n/2 p /2) k n/2 + p /2) i.e. the difference between the number of nonoutliers and the number of outliers, k l, is at least 2p /2)), then, recalling that y i = a i + b i ω with b i = for the nonoutliers and b i for the outliers, we obtain the following results: a) b) c) lim my n ) n fy i)] li = my k ), lim πβ, σ y n) = πβ, σ y k ), uniformly on β, σ) λ, λ] p /τ, τ], for any λ and τ, d) as ω, and in particular lim R p πβ, σ yn ) πβ, σ y k ) dβ dσ =, β, σ y n D β, σ yk, β i y n D βi y k, i =,..., p, and σ y n D σ yk, e) lim my k)/my n )] Lβ, σ y n ) = Lβ, σ y k ), uniformly on β, σ) λ, λ] p /τ, τ], for any λ and τ. The proof of Theorem is provided in Section 5.2. Theorem is of great practical use considering the simplicity of its two sufficient conditions. Condition i) indicates that modelling must be performed using a density f with sufficiently heavy tails, and more precisely, using a log-regularly varying distribution see Definition ). Well-known heavy-tailed distributions, such as the Student distribution, do not satisfy this criterion. Desgagné 25) introduced an appealing distribution family that belongs to the family of log-regularly varying distributions, and therefore, satisfies condition i): the family of log-pareto-tailed symmetric distributions. In our opinion, the most appealing log-pareto-tailed symmetric distribution is a distribution called the log-pareto-tailed standard normal distribution, with parameters α > and ϕ >. It exactly matches the standard normal on the interval α, α], while having tails that behave like / z )log z ) ϕ which is a log-pareto behaviour). This distribution shall be described in more details in the numerical study in Section 3. Condition ii) indicates that there must be at most n/2 p /2) outliers, or equivalently, that there must be at least n/2 + p /2) nonoutliers, where and are the floor and ceiling functions, respectively. For instance, consider a sample of size n = 5 with one explanatory variable

7 Bayesian Robust Linear Regression 7 the simple linear regression model is therefore used, i.e. p = 2). Robustness is attained provided that there are l = 6 outliers or less which leaves k = 9 nonoutliers or more). The breakdown point, which is the proportion of outliers that an estimator can handle, is therefore 6/5 =.4 when n = 5. More generally, the condition l n/2 p /2) translates into a breakdown point of n/2 p /2) /n. For fixed p, the breakdown point thus converges to.5 as n, usually considered as the maximum desired value. Condition ii) is generally satisfied in practise given that the number of outliers rarely reaches half the sample size minus p /2). Another feature of Theorem that contributes to its practical use is that the results are easy to interpret. In result a), the asymptotic behaviour of the marginal my n ) is described. This result is used in Section 3. to assess robustness of Bayes factors for testing H : β i = versus H : β i when i 2) through the convergence my n )/my n β i = ) my k )/my k β i = ) as ω. Result a) is in fact the centrepiece of Theorem ; its demonstration requires considerable work, and then leads relatively easily to the other conclusions of the theorem. Result b) indicates that the posterior density arising from the whole sample converges uniformly towards the posterior density arising from the nonoutliers only. The impact of outliers then gradually vanishes as they approach plus or minus infinity. Result b) leads to result c), stating that the posterior density arising from the whole sample converges in L towards the posterior density arising from the nonoutlying observations only. This last result implies the following convergence: Pβ, σ E y n ) Pβ, σ E y k ) as ω, uniformly for all sets E R p R +. This result is slightly stronger than convergence in distribution result d)) which requires only pointwise convergence. The convergence of the posterior marginal distributions then follows. Result d) is the most practical result; it indicates that any estimation of β and σ based on posterior quantiles e.g. using posterior medians and Bayesian credible intervals) is robust to outliers. Result e) follows from result b) and indicates that for a given sample, the likelihood up to a multiplicative constant that is independent of β and σ) converges uniformly to the likelihood arising from the nonoutliers only. Consequently, the maximum of Lβ, σ y n ) converges to the maximum of Lβ, σ y k ), and therefore, the maximum likelihood estimate also converges, as ω. 3. Numerical Study In this section, we aim at increasing the understanding of the practical implications of Theorem. This is first achieved, in Section 3., via an analysis of the relationship between two stock market indexes. Second, in Section 3.2, we study the behaviour of the posterior when we artificially move an observation towards plus or minus infinity. In other words, we visually represent the convergence of the posterior towards that arising from the nonoutliers only result d) of Theorem ). In Section 3.3, a simulation study is conducted to evaluate the accuracy of the estimates arising from our robust linear regression model with two explanatory variables. In all the analyses, we compare the results obtained from the robust model with those from the nonrobust with the normal assumption) and partially robust with the Student distribution assumption) models. 3.. Analysis of the Relationship Between Two Stock Market Indexes We know from Theorem that the impact of outliers is asymptotically null if we assume a super heavy-tailed distribution on the error term of the linear regression model. In other words, this robust

8 8 P. Gagnon, A. Desgagné and M. Bédard model excludes outliers in the limit. In this section, we show that this model turns out to effectively limit the impact of outliers in practise. In particular, we show how this translates into adequate conclusions, contrary to the nonrobust model. The data set presented in Figure and Table is analysed for this purpose. January 2 daily returns of S&P 5 y) and S&P/TSX x) y x Fig.. January 2 daily returns of S&P 5 and S&P/TSX with posterior medians under the nonrobust blue solid line), partially robust orange dashed line), and robust green dotted line) models, respectively Table. Returns for day i of January 2 for S&P 5 y i) and S&P/TSX x i) in percentages, i =,..., 9 y i x i y i x i The data are therefore analysed using a simple linear regression where the daily returns of the S&P 5 and S&P/TSX play the role of the dependent and explanatory variables, respectively. We set the prior to be non-informative, and more precisely, πβ, σ) /σ. For each of the three models considered robust, partially robust and nonrobust), we first compute posterior medians and highest posterior density HPD) intervals for the parameters. For the partially robust model, the degrees of freedom of the heavy-tailed Student distribution are arbitrarily set to, and a known scale parameter of.88 is added to this distribution in order to match the 2.5th and 97.5th percentiles with those of the standard normal. For the robust model, we assume that the error term has a log- Pareto-tailed standard normal distribution. Its density depicted in Figure 2) is expressed as { 2π) /2 exp x 2 /2), if x α the standard normal part), fx) = 2π) /2 exp α 2 /2)α/ x )log α/ log x ) ϕ, if x > α the log-pareto tails). 2) We set α such that this distribution has also the same 2.5th and 97.5th percentiles as the standard normal, i.e. α =.96. This implies that, according to the procedure described in Section 4 of Desgagné 25), ϕ = 4.8 this procedure ensures that f is a continuous probability density function). Therefore, all three error distributions studied in this section have 95% of their mass in the interval.96,.96].

9 Bayesian Robust Linear Regression 9 Comparison between the standard normal, Student and log-pareto-tailed standard normal fx) fx) x x Fig. 2. Densities of the standard normal blue solid line), Student with degrees of freedom and a known scale parameter of.88 orange dashed line), and log-pareto-tailed standard normal with α =.96 and ϕ = 4.8 green dotted line) The parameter estimates are provided in Table 2. Contrary to the partially robust and robust models, the 95% HPD interval for β 2 of the nonrobust model includes. The posterior in this case puts therefore some mass around. We now perform Bayesian tests for hypotheses H : β 2 = versus H : β 2 through the Bayes factor my n )/my n H ) i.e. the marginal density of y n divided by the marginal density of y n under H, which is the marginal density of y n under the simple location-scale model). The Bayes factors are 3.69, 9.23, and 23. for the nonrobust, partially robust and robust models, respectively. Under the robust model, it seems clear that there is enough evidence to conclude that the relationship between the variables is statistically significant. Under the nonrobust model, it is not that clear. We now redo the analysis while excluding observation 8 an outlier as can be seen in Figure ) in order to show its impact. The results are presented in Table 3. The results are now relatively similar for all three models. In particular, the 95% HPD intervals for β 2 do not include. Also, the Bayes factors are 62.75, 5.6, and for the nonrobust, partially robust, and robust models, respectively. The impact of the outlier is much more significant on ˆσ than on ˆβ as can be seen by comparing the posterior medians in Table 2 with those in Table 3). This is due to the position of x 8, y 8 ) =.2,.79), which is centred with respect to the x values in the data set as seen in Figure ). Its presence therefore results in a higher probability for a larger scaling. The impact of the outlier is however not only on the marginal posterior distribution of σ; it propagates through the joint posterior of β and σ, increasing the posterior scaling of β. This is especially true for the nonrobust model and this is what led to the differences in the test results presented above), but this is also true for the partially robust model. Our approach is appealing because it provides whole robustness for all parameters. In this example, whole robustness leads to an inference that best reflects the behaviour of the bulk of the data, compared with the inferences arising from the nonrobust and partially robust models. Note that, as mentioned in Section 2.3, the Bayes factor is a robust measure when it is assumed that the error term has a super heavy-tailed distribution as the log-pareto-tailed standard normal distribution. Indeed, result a) of Theorem states that the marginal my n ) behaves like

10 P. Gagnon, A. Desgagné and M. Bédard my k ) n fy i)] li. Furthermore, the marginal my n H ) behaves like my k H ) n fy i) ] li, because when the assumptions of Theorem are satisfied, those of Theorem in Desgagné 25), ensuring whole robustness for the location-scale model, are also satisfied. As a result, the Bayes factor my n )/my n H ) behaves like my k )/my k H ). Table 2. Posterior medians and 95% HPD intervals under the three models based on the analysis of the data set presented in Table Assumptions on f Estimates for β β 2 σ Standard normal.4.25,.34).4.2,.83).62.43,.88) Student d.f.)..4,.35).4.7,.76).53.33,.8) Log-Pareto-tailed normal.3.9,.34).43.3,.72).42.26,.65) Table 3. Posterior medians and 95% HPD intervals under the three models based on the analysis of the data set presented in Table, but excluding observation 8 Assumptions on f Estimates for β β 2 σ Standard normal.5.3,.33).44.8,.69).37.25,.53) Student d.f.).6.2,.34).4.6,.68).39.25,.58) Log-Pareto-tailed normal.5.4,.33).43.7,.69).37.24,.54) 3.2. Convergence of the Posterior In this section, we illustrate how the posterior behaves when we artificially move an observation towards plus or minus infinity, through the evolution of point estimates. More precisely, we use the same data set as in Section 3. see Table and Figure ), and we gradually move the observation identified as an outlier in that section i.e. x 8, y 8 )) according to the following procedure: y 8 is gradually moved from the value.5 a nonoutlier) to 6 a large outlier), while x 8 remains fixed. In Section 3., we have observed the impact of an outlier centred with respect to the x values in the data set. We therefore change the value of x 8 to to illustrate the impact of an outlier located at one of the endpoints of the x values. While we gradually move the observation, we estimate the parameters β := β, β 2 ) T and σ for each data set related to a different value of y 8 using posterior medians with a prior proportional to /σ. This whole process is performed under the three models the nonrobust, partially robust, and wholly robust models with the same distribution assumptions as in Section 3.). We now present the results in Figure 3, in which y 8 = y 8 is defined as the x-axis instead of y 8 ) so that larger values correspond to larger distances to the bulk of the data. We first notice that, for the robust model, the impact of the point x 8, y 8 ) grows until y 8 reaches a threshold, and beyond this threshold, the impact vanishes as the observation approaches minus infinity. The threshold is around y 8 =.4, and based on the data set with y 8 =.4, we find ˆβ =., ˆβ 2 =.54 and ˆσ =.43. The point estimates converge towards.6,.43 and.36 for β, β 2 and σ, respectively, which are the point estimates when x 8, y 8 ) is excluded from the sample. Whole robustness is, as expected, attained for both β and σ. Note that an increase in the

11 β Bayesian Robust Linear Regression Estimation of β under the three Estimation of β 2 under the three Estimation of σ under the three models when y 8 = y 8 models when y 8 = y 8 models when y 8 = y 8 increases from.5 to 6 increases from.5 to 6 increases from.5 to y8 β y8 σ y8 Fig. 3. Estimation of β and σ when y 8 = y 8 increases from.5 to 6 under three different assumptions on f: standard normal density blue solid line), Student density orange dashed line) and log-pareto-tailed standard normal density green dotted line) value of the parameter α would result in an increase in the value of the threshold. Setting α =.96 seems suitable for practical uses. For the partially robust model, the estimation of β is robust as the impact of the outlier slowly decreases after a certain threshold. The impact on σ is limited, but does not decrease when the outlying value moves away, reflecting partial robustness. The inference is clearly not robust when it is assumed that the error has a normal distribution, as the estimates of β 2 and σ grow, and that of β diminishes Simulation Study In this section, we evaluate through a simulation study the accuracy of the estimates arising from our robust linear regression model with two explanatory variables. The model is Y i = xi T β + ɛ i with β := β, β 2, β 3 ) T and ɛ i σ /σ)fɛ D i /σ), i =,..., n, where f is assumed to be a log-pareto-tailed standard normal density with α =.96 and ϕ = 4.8 in our robust model. It is compared with the same linear regression model, but where f is assumed to be a standard normal density in the nonrobust model, and a Student density with degrees of freedom and a known scale parameter of.88 in the partially robust model. Note that the parameters of the distributions are the same as in Section 3.. For the simulation study, we set n = 3, x,2, x 2,2,..., x 3,2 =, 2,..., 3 which are the observations from the first explanatory variable), x,3, x 2,3,..., x 3,3 = 2, 2,..., 29 2 which are the observations from the second explanatory variable), and πβ, σ). We simulate 5, data sets using values for β and σ respectively set to,,.) T and 2 for each of the three scenarios that we now describe. In the first one, the errors are generated from a standard normal distribution; therefore, the probability to observe outliers is negligible. In the second scenario, the errors are generated from a mixture of two normals where the first component is a standard normal and the second has a mean of and a variance of, with weights of.95 and.5, respectively. This last component can contaminate the data sets by generating extreme values. In the third and last scenario, the errors are also generated from a mixture of two normals, but the contamination is due to the second component s scaling. More precisely, the first component is again a standard normal distribution, but the second has a mean of and a variance of 2, with weights of.9 and., respectively.

12 2 P. Gagnon, A. Desgagné and M. Bédard For each simulated data set, we estimate β and σ using MAP estimation for the three models recall that MAP estimation corresponds to maximum likelihood estimation when the prior is proportional to as in this analysis). Within each simulation scenario, we evaluate the performance of each model using sample mean square errors MSE), based on the true values β =, β 2 =, β 3 =., and σ = 2; the performance of the estimation of β is evaluated through the sum of the MSE of the estimators of β, β 2, and β 3. The results are presented in Tables 4 and 5. We first notice that when there is no outlier the % N, ) scenario), the accuracy of the estimates arising from the three models is approximately the same. This is not surprising given that the three densities studied are similar, especially on the interval.96,.96] in which 95% of their mass is located. They however differ in the thickness of their tails, a feature that plays a major role in robustness. The density with the slimmest tails, the normal, corresponds to the case where the impact of outliers is the most significant. When assuming the Student distribution on the error term, the impact on ˆβ is limited, but the impact on ˆσ is significant, reflecting partial robustness. Under the robust model, the impact on ˆβ, ˆσ) is, as expected, limited. Note that for all models, the impact on each of the ˆβ i is in line with that on ˆβ. Table 4. Sum of the MSE of the estimators of β, β 2 and β 3 under the three scenarios and the three assumptions of f Assumptions on f Scenarios %N, ) 95%N, ) + 5%N, ) 9%N, ) + %N, 2 ) Standard normal Student d.f.) Log-Pareto-tailed normal Table 5. MSE of the estimators of σ under the three scenarios and the three assumptions of f Assumptions on f Scenarios %N, ) 95%N, ) + 5%N, ) 9%N, ) + %N, 2 ) Standard normal Student d.f.) Log-Pareto-tailed normal Conclusion In this paper, we have proved whole robustness results for linear regression models, as the following key result: the convergence of the posterior distribution towards that arising from the nonoutliers only when the outliers approach plus or minus infinity result d), Theorem ). These results hold under two simple and intuitive assumptions. First, assume that the error term follows a super heavytailed distribution to accommodate for the presence of outliers. Second, the number of outliers must not exceed half the sample size minus p /2. In other words, there must be at most n/2 p /2) outliers or equivalently, that must be at least n/2 + p /2) nonoutliers). Although the whole robustness results are asymptotic, their practical relevance has been shown through numerical analyses in Section 3. The robust model used in these analyses results from

13 Bayesian Robust Linear Regression 3 the assumption that the error term follows the log-pareto-tailed standard normal density provided in 2). This model has been compared with the nonrobust with the normal assumption) and partially robust with the Student distribution assumption) models. The conclusion is that our model performs as well as the nonrobust and partially robust models in absence of outliers, in addition to being completely robust. In fact, it performs almost as well in absence of outliers, as seen in Tables 4 and 5 in Section 3.3. This is a rather small price to pay for whole robustness. We therefore believe it is suitable to always assume that the error term has the density detailed in 2) to obtain adequate results, regardless of whether there are outliers or not, by computing estimates as usual from the posterior distribution. 5. Proofs The proof of Proposition is given in Section 5. and the proof of Theorem is given in Section 5.2. Prior to that, we define the constant B > as follows: { } B := max sup z R fz), sup z R z fz), sup β R p, σ> πβ, σ)/ max, /σ) The monotonicity of the tails of fz) and z fz) implies that there exists a constant M > such that y z M fy) fz) and y fy) z fz). 3). 5.. Proof of Proposition To prove that πβ, σ y n ) is proper the proof for πβ, σ y k ) is omitted because it is similar), it suffices to show that the marginal my n ) is finite. Recall that we require that n > p +. The reader will notice that only n p + is required if πβ, σ) is bounded by B/σ for all σ > instead of πβ, σ) is bounded by B max, /σ)). We first show that the function is integrable on the area where the ratio /σ is bounded. More precisely, we consider β R p and δm σ <, where δ is a positive constant that can be chosen as small as we want upper bounds are provided in the proof). We next show that the function is integrable on the area where the ratio /σ approaches infinity, that is < σ < δm. We have δm R p πβ, σ) a B n p+ b max, M δ /σ)fy i xi T β)/σ) dβ dσ max, ) δm σ ) x T B n p+ det.. fu p ) du p σ n p x T p p σ f R p yi xi T β ) dβ dσ σ dσ δm σn p fu ) du

14 4 P. Gagnon, A. Desgagné and M. Bédard δm σ n p dσ c = M/δ) n p /n p ) <. In step a, we use πβ, σ) B max, /σ) and we bound each of n p densities f by B. In step b, we use the change of variables u i = y i xi T β)/σ for i =,..., p. The determinant is non-null because all explanatory variables are continuous. Indeed, consider the case p = 2 for instance i.e. the simple linear regression); the determinant is different from provided that x 2 x 22, and this happens with probability. In step c, we use n > p +. Note that if, instead, we bound πβ, σ) by B/σ in step a, one can verify that the condition n p + is sufficient to bound above the integral. We now show that the integral is finite on β R p and < σ < δm. In this area, the ratio /σ) approaches infinity. We have to carefully analyse the subareas where y i xi T β is close to in order to deal with the / form of the ratios y i xi T β)/σ. In order to achieve this, we split the domain of β as follows: R p = n i =R c ] i n i = n i,i 2=i i 2) ))] R i n i2=i2 i) Rc i2 R i R i2 n i3=i3 i,i2) Rc i3 ))] n i,i 2,...,i p=i j i s i j,i s s.t. j s) Ri R i2... R ip ) ], 4) where R i := {β : y i xi T β < δ}, i {,..., n}. The set R i represents the hyperplanes y = xi T β characterised by the different values of β that satisfy y i xi T β < δ. In other words, it represents the hyperplanes passing near the point x i, y i ), and more precisely, at a vertical distance of less than δ. The set n i = Rc i is therefore comprised of the hyperplanes that are not passing close to any point. The set n i R = i n i 2=i 2 i ) Rc i 2 )) represents the hyperplanes passing near one and only one) point. The set n i R,i 2=i i 2) i R i2 n i 3=i 3 i,i 2) Rc i 3 )) represents the hyperplanes passing near two and only two) points, and so on. We choose δ small enough to ensure that R i R i2... R ip R ip+ = when i,..., i p+ are all different. It is possible to do so because an hyperplane passes through no more than p points. This implies that n i,i 2,...,i p=i j i s i j,i s s.t. j s) = n i,i 2,...,i p=i j i s i j,i s s.t. j s) Ri R i2... R ip ) ] ))] R i R i2... R ip n ip+=ip+ i,i2,...,ip) Rc ip+. Note that all sets R i R i2... R ip are nonempty when i,..., i p are all different, because all explanatory variables are continuous which implies that the p p matrix with rows given by xi T,..., xi T p has a determinant different from ). Note also that R i n i 2=i 2 i ) Rc i 2 ) is nonempty for all i, R i R i2 n i 3=i 3 i,i 2) Rc i 3 ) is nonempty for all i, i 2 such that i i 2, and so on. Finally note that the decomposition of R p given in 4) is comprised of p n i= i) mutually exclusive sets given by n i = Rc i, R i n i 2=i 2 i ) Rc i 2 ), i =,..., n, R i R i2 n i 3=i 3 i,i 2) Rc i 3 ), i, i 2 =,..., n with i i 2, and so on. We now consider < σ < δm and β in one of the p n i= i) mutually exclusive sets given in 4). As explained above, the difficulty lies in dealing with the hyperplanes β that are such that y i xi T β < δ for some points x i, y i ). The strategy is essentially to use the product of

15 Bayesian Robust Linear Regression 5 /σ)fy i x T i β)/σ) of these points to integrate over β, and to bound the other terms of my n). Therefore, if β R i R i2... R ip, we consider the points x i, y i ), x i2, y i2 ),..., x ip, y ip ) to integrate over β. If β R i R i2... R ip n i p=i p i,...,i p ) Rc i p ), we consider the points x i, y i ), x i2, y i2 ),..., x ip, y ip ), and any other point x ip, y ip ) to integrate over β, and so on. We have πβ, σ) /σ)fy i xi T β)/σ) a B/σ) maxδm, ) /σ)fy i xi T β)/σ) /σ) i {i,...,i p} b /σ)/σ)f δ/σ)] n p c B/δ] n p /σ 2 )f δ/σ) /σ)fy i x T i β)/σ) i {i,...,i p} i {i,...,i p} i {i,...,i p} /σ)fy i x T i β)/σ) /σ)fy i x T i β)/σ). /σ)fy i x T i β)/σ) In step a, we use πβ, σ) B max, /σ) = B/σ) maxσ, ) B/σ) maxδm, ). In step b, for all i {i,..., i p } we use fy i xi T β)/σ) fδ/σ) by the monotonicity of the tails of f because y i xi T β /σ δ/σ δδ M = M. In step c, we bound n p terms /σ)fδ/σ) by B/δ. Finally, we bound the integral of /σ 2 )f δ/σ) i {i /σ)fy,...,i p} i xi T β)/σ) by /σ 2 )f δ/σ) a = det x T i. x T i p R p i {i,...,i p} /σ)fy i x T i β)/σ) dβ dσ /σ 2 )f δ/σ) dσ = b det x T i. x T i p f σ ) dσ <. In step a, we use the same change of variables as above: u j = y ij x T i j β)/σ for j =,..., p. In step b, we use the change of variable σ = δ/σ Proof of Theorem Consider the model and the context described in Section 2.. We assume that l n/2 p /2) k n/2 + p /2) k l + 2p. In addition, we assume that l, i.e. that there is at least one outlier, otherwise the proof would be trivial. Two propositions and a lemma that are used in the proof of Theorem are first given, and the proofs of results a) to e) follow. The proofs of these propositions and this lemma can be found in Desgagné 25). Proposition 2 Dominance). If s L ) and g L ρ ), then for all δ >, there exists a constant Aδ) > such that z Aδ) implies that log z) δ < sz) < log z) δ and log z) ρ δ < gz) < log z) ρ+δ.

16 6 P. Gagnon, A. Desgagné and M. Bédard Proposition 3 Location-scale transformation). If zfz) L ρ ), then we have /σ)fz µ)/σ)/fz) as z, uniformly on µ, σ) λ, λ] /τ, τ], for any λ and τ. Lemma. For all λ, τ, there exists a constant Dλ, τ) such that z R and µ, σ) λ, λ] /τ, τ] implies that /Dλ, τ) /σ)fz µ)/σ)/fz) Dλ, τ). Note that Lemma is a corollary of Proposition 3. Proof of Result a). We first observe that my n ) my k ) my n ) n fy = i)] li my k ) n fy πβ, σ y n ) dσ dβ i)] li R p πβ, σ) n /σ)fyi xi T = β)/σ)] k i+l i R p my k ) n fy dσ dβ i)] li /σ)fyi xi T = πβ, σ y k ) β)/σ) ] li dσ dβ. fy i ) R p We show that the last integral converges towards as ω to prove result a). If we use Lebesgue s dominated convergence theorem to interchange the limit ω and the integral, we have lim R p = = R p R p πβ, σ y k ) /σ)fyi xi T β)/σ) ] li dσ dβ fy i ) /σ)fyi xi T β)/σ) ] li dσ dβ fy i ) lim πβ, σ y k) πβ, σ y k ) dσ dβ =, using Proposition 3 in the second equality, since x,..., x n are fixed, and then Proposition. Note that the condition of Proposition is satisfied because k l + 2p k p + 2 because l and p 2). Note also that pointwise convergence is sufficient, for any value of β R p and σ >, once the limit is inside the integral. However, in order to use Lebesgue s dominated convergence theorem, we need to prove that the integrand is bounded by an integrable function of β and σ that does not depend on ω, for any value of ω y, where y is a constant. The constant y can be chosen as large as we want, and minimum values for y will be given throughout the proof. In order to bound the integrand, we divide the domain of integration into two areas: σ < and < σ <. Again, we want to separately analyse the area where the ratio /σ approaches infinity. We assumed that y i can be written as y i = a i + b i ω, where ω, and a i and b i are constants such that a i R and b i if l i = if the observation is an outlier). Therefore, the ranking of the

17 Bayesian Robust Linear Regression 7 elements in the set { y i : l i = } is primarily determined by the values b,..., b n, and we can choose the constant y larger than a certain threshold to ensure that this ranking remains unchanged for all ω y. Without loss of generality, we assume for convenience that ω = min y i and consequently min b i =. {i: l } {i: l } We now bound above the integrand on the first area. Area : Consider σ < and assume without loss of generality that y,..., y p are p nonoutliers therefore k =... = k p = ). We have πβ, σ y k ) a B σ n b c = d = /σ)f yi xi T β)/σ) ] li fy i ) n fω)] l B σ n fω)] l σ n D a i, )fb i ω x T i β)/σ) πβ, σ) σ n fy i xi T β)/σ) fy i )] li fy i )] li D a i, )fb i ω xi T β)/σ) b i D a i, b i )] li fb i ω xi T β)/σ) fx T fω)] l σ n i β/σ) ] k i fbi ω xi T β)/σ) ] l i p /σ)fxt i β/σ) ] l ω/σ fx T σ k p /2 ωfω) σ /2 i β/σ) ] k i fbi ω xi T β)/σ) ] l i. i=p+ In step a, we use y i = a i + b i ω and Lemma to obtain fy i x T i β)/σ) = fb i ω x T i β)/σ + a i /σ) D a i, )fb i ω x T i β)/σ), because a i /σ a i for all i. We also use πβ, σ) B max, /σ) = B. In step b, we use again Lemma to obtain fω)/fy i ) = fy i a i )/b i )/fy i ) b i D a i, b i ). In step c, we set b i = if k i = and we use the symmetry of f to obtain f xi T β/σ) = fxt i β/σ). In step d, we use the assumption k =... = k p =. Now it suffices to demonstrate that ω/σ ] l ωfω) σ /2 i=p+ fx T i β/σ) ] k i fbi ω x T i β)/σ) ] l i 5) is bounded by a constant that does not depend on ω, β and σ since p /σ)fxt i β/σ)

18 8 P. Gagnon, A. Desgagné and M. Bédard /σ) k p /2 is an integrable function on area. Indeed, since k > p +, we have /σ) k p /2 p x T /σ)fxi T β/σ) dβ dσ = det.. R p x T p dσ <, σk p /2 using the following change of variables: u i = xi T β/σ, i =,..., p. The determinant is different from because all the explanatory variables are continuous. Note that if instead, in step a above, we bound πβ, σ) by B/σ, one can verify that the condition k p + is sufficient to bound above the integral. In order to bound the function in 5), we split area into three parts: σ < ω /2, ω /2 σ < ω/γm) and ω/γm) σ <, where M is defined in 3) and γ is a positive constant that can be chosen as large as we want lower bounds are provided in the proof). Note that this split is well defined if y > max, γm) 2 ) since ω y. First, consider ω/γm) σ <. We have ω/σ ] l ωfω) σ /2 i=p+ b B n p l+/2 /ω)/2 γm) ωfω)] l fx T i β/σ) ] k i fbi ω x T i β)/σ) ] l i a B n p d B n p γm) l+/2 2ρ + )l/e] ρ+)l <. σ /2 c B n p γm) l+/2 /ω) /2 log ω) ρ+)l ] l ω/σ ωfω) In step a, we use f B. In step b, we use ω/σ γm and /σ γm/ω. In step c, we use ωfω) > log ω) ρ if ω y A), where A) comes from Proposition 2. For step d, it is purely algebraic to show that the maximum of log ω) ξ /ω /2 is 2ξ/e) ξ for ω > and ξ >, where ξ = ρ + )l in our situation. Now, consider the two other parts combined we will split them in the next step), that is σ ω/γm). We have In step a, we use ω/σ ] l ωfω) σ /2 = σ /2 a σ /2 i=p+ i=p+ fx T i β/σ) ] k i fbi ω x T i β)/σ) ] l i ] l n ω/σ)fω/σ) ωfω) i=p+ ] l ω/σ)fω/σ) B k p D, γ)γ] l. ωfω) fx T i β/σ) ] k i fbi ω xi T β)/σ) ] li fω/σ) fx T i β/σ) ] k i fbi ω xi T β)/σ) ] li B k p D, γ)γ] l. 6) fω/σ)

19 Bayesian Robust Linear Regression 9 The proof of this inequality is substantial. Therefore, to ease the reading, it is deferred after the demonstration that the remaining term, i.e. is bounded. σ /2 ] l ω/σ)fω/σ), ωfω) We first consider ω /2 σ ω/γm). We have σ /2 ω/σ)fω/σ) ωfω) ] l a B l /ω)/4 ωfω)] l b B l /ω) /4 log ω) ρ+)l c B l 4ρ + )l/e] ρ+)l <. In step a, we use ω/σ)fω/σ) B and /σ) /2 /ω) /4. In step b, we use ωfω) > log ω) ρ if ω y A), where A) comes from Proposition 2. In step c, it is purely algebraic to show that the maximum of log ω) ξ /ω /4 is 4ξ/e) ξ for ω > and ξ >, where ξ = ρ + )l in our situation. We now consider σ ω /2. We have σ /2 ] l ω/σ)fω/σ) a ω /2 fω /2 ] l ) b 2 ρ+)l <. ωfω) ωfω) In step a, we use /σ and ω/σ)fω/σ) ω /2 fω /2 ) by the monotonicity of the tails of z fz) since ω/σ ω /2 y /2 M if y M 2. In step b, we use ω /2 fω /2 )/ωfω)) 2/2) ρ = 2 ρ+ if ω y A, 2), where A, 2) comes from the definition of log-regularly varying functions see Definition 2). Finally, we prove the inequality in 6). Recall that we assume that k n/2 + p /2) k l + 2p. We know that y,..., y p are p nonoutliers therefore k =... = k p = ). We also know that the number of remaining nonoutliers the nonoutliers among observations p + to n) is greater than or equal to p because k p l + p p because we assume that l ). In order to prove the result, we split the domain of β as follows: R p = i Oi c ] i Oi i Fi c ))] i,i Oi F i i2 i Fi c ))] 2 ))] i,i,...,i p i j i s i j,i s s.t. j s) O i F i F ip ip i,...,i p F c ip where i,i,...,i pi j i s i j,i s s.t. j s) Oi F i F ip )], O i := {β : b i ω x T i β < ω/2}, i I O, 7) F i := {β : x T i β < ω/γ}, i I F, 8) I O := {i : i {p +,..., n} and l i = } and I F := {i : i {p +,..., n} and k i = } are the sets of indexes of outliers and remaining fixed observations nonoutliers), respectively.

20 2 P. Gagnon, A. Desgagné and M. Bédard The set O i represents the hyperplanes y = xi T β characterised by the different values of β that satisfy b i ω xi T β < ω/2. In other words, it represents the hyperplanes that pass at a vertical distance of less than ω/2 of the point x i, b i ω), which is considered as an outlier since ω recall that b i ω = y i a i ). Analogously, the set F i represents the hyperplanes that pass at a vertical distance of less than ω/γ of the point x i, ), which is considered as a nonoutlier. Therefore, the set i Oi c represents the hyperplanes that pass at a vertical distance of at least ω/2 of all the points x i, b i ω) all the outliers). The set i O i i Fi c )) represents the hyperplanes that pass at a vertical distance of less than ω/2 of at least one point x i, b i ω) an outlier), but at a vertical distance of at least ω/γ of all the points x i, ) all the nonoutliers). For each i I F, the set i O i F i i2 i Fi c 2 )) represents the hyperplanes that pass at a vertical distance of less than ω/2 of at least one point x i, b i ω) an outlier), at a vertical distance of less than ω/γ of the point x i, ) a nonoutlier), but at a vertical distance of at least ω/γ of all the other nonoutliers, and so on. Now, we claim that O i F i F ip = for all i, i,..., i p with i j i s, i j, i s such that j s. To prove this, we use the fact that x i a vector of size p) can be expressed as a linear combination of x i,..., x ip. This is true because all explanatory variables are continuous, and therefore, linearly independent with probability. As a result, considering that β F i F ip and x i = p s= a sx is for some a,..., a p R, we have p ) T b i ω xi T β = b a iω a s x is β p b i ω a s xi T b s β ω ω p a s c ω ω γ 2. s= In step a, we use the reverse triangle inequality. In step b, we use that b i and p s= a sxi T s β p s= a s xi T s β p s= a s ω/γ because β F i F ip, which means that xi T β < ω/γ for all i {i,..., i p }. In step c, we define the constant γ such that γ 2 p s= a s we choose γ such that it satisfies this inequality for any combination of i and i,..., i p ). Therefore, we have that β O i. This proves that O i F i F ip = for all i, i,..., i p with i j i s, i j, i s such that j s. This in turn implies that R p = i Oi c ] i Oi i Fi c ))] i,i Oi F i i2 i Fi c ))] 2 ))] i,i,...,i p i j i s i j,i s s.t. j s) O i F i F ip ip i,...,i p F c ip. This decomposition of R p is comprised of + p k p ) i= i mutually exclusive sets given by i Oi c, i O i i Fi c )), i O i F i i2 i Fi c 2 )) for i I F, and so on. We are now ready to bound the function on the left-hand side in 6). We first show that the function is bounded on β i O c i and σ ω/γm). For all i I O, we have fb i ω x T i β)/σ) fω/σ) fω/2σ)) fω/σ) s= 2D, 2) D, γ)γ, using the monotonicity of f because b i ω xi T β /σ ω/2σ) γm/2 M we choose γ 2), s=

Robust estimation of skew-normal distribution with location and scale parameters via log-regularly varying functions

Robust estimation of skew-normal distribution with location and scale parameters via log-regularly varying functions Robust estimation of skew-normal distribution with location and scale parameters via log-regularly varying functions Shintaro Hashimoto Department of Mathematics, Hiroshima University October 5, 207 Abstract

More information

Bayesian inference for the location parameter of a Student-t density

Bayesian inference for the location parameter of a Student-t density Bayesian inference for the location parameter of a Student-t density Jean-François Angers CRM-2642 February 2000 Dép. de mathématiques et de statistique; Université de Montréal; C.P. 628, Succ. Centre-ville

More information

Bayesian robustness modelling of location and scale parameters

Bayesian robustness modelling of location and scale parameters Bayesian robustness modelling of location and scale parameters A. O Hagan and J.A.A. Andrade Dept. of Probability and Statistics, University of Sheffield, Sheffield, S 7RH, UK. Dept. of Statistics and

More information

1 Hypothesis Testing and Model Selection

1 Hypothesis Testing and Model Selection A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

A Bayesian perspective on GMM and IV

A Bayesian perspective on GMM and IV A Bayesian perspective on GMM and IV Christopher A. Sims Princeton University sims@princeton.edu November 26, 2013 What is a Bayesian perspective? A Bayesian perspective on scientific reporting views all

More information

Proof. We indicate by α, β (finite or not) the end-points of I and call

Proof. We indicate by α, β (finite or not) the end-points of I and call C.6 Continuous functions Pag. 111 Proof of Corollary 4.25 Corollary 4.25 Let f be continuous on the interval I and suppose it admits non-zero its (finite or infinite) that are different in sign for x tending

More information

A Note on Auxiliary Particle Filters

A Note on Auxiliary Particle Filters A Note on Auxiliary Particle Filters Adam M. Johansen a,, Arnaud Doucet b a Department of Mathematics, University of Bristol, UK b Departments of Statistics & Computer Science, University of British Columbia,

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information

The Instability of Correlations: Measurement and the Implications for Market Risk

The Instability of Correlations: Measurement and the Implications for Market Risk The Instability of Correlations: Measurement and the Implications for Market Risk Prof. Massimo Guidolin 20254 Advanced Quantitative Methods for Asset Pricing and Structuring Winter/Spring 2018 Threshold

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Importance sampling with the generalized exponential power density

Importance sampling with the generalized exponential power density Statistics and Computing 5: 89 96, 005 C 005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Importance sampling with the generalized exponential power density ALAIN DESGAGNÉ and

More information

Module 1. Probability

Module 1. Probability Module 1 Probability 1. Introduction In our daily life we come across many processes whose nature cannot be predicted in advance. Such processes are referred to as random processes. The only way to derive

More information

Bayesian Inference for Normal Mean

Bayesian Inference for Normal Mean Al Nosedal. University of Toronto. November 18, 2015 Likelihood of Single Observation The conditional observation distribution of y µ is Normal with mean µ and variance σ 2, which is known. Its density

More information

Bayesian spatial quantile regression

Bayesian spatial quantile regression Brian J. Reich and Montserrat Fuentes North Carolina State University and David B. Dunson Duke University E-mail:reich@stat.ncsu.edu Tropospheric ozone Tropospheric ozone has been linked with several adverse

More information

Measure and integration

Measure and integration Chapter 5 Measure and integration In calculus you have learned how to calculate the size of different kinds of sets: the length of a curve, the area of a region or a surface, the volume or mass of a solid.

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem Chapter 8 General Countably dditive Set Functions In Theorem 5.2.2 the reader saw that if f : X R is integrable on the measure space (X,, µ) then we can define a countably additive set function ν on by

More information

AMS-207: Bayesian Statistics

AMS-207: Bayesian Statistics Linear Regression How does a quantity y, vary as a function of another quantity, or vector of quantities x? We are interested in p(y θ, x) under a model in which n observations (x i, y i ) are exchangeable.

More information

Linear Models A linear model is defined by the expression

Linear Models A linear model is defined by the expression Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

A Note on Hypothesis Testing with Random Sample Sizes and its Relationship to Bayes Factors

A Note on Hypothesis Testing with Random Sample Sizes and its Relationship to Bayes Factors Journal of Data Science 6(008), 75-87 A Note on Hypothesis Testing with Random Sample Sizes and its Relationship to Bayes Factors Scott Berry 1 and Kert Viele 1 Berry Consultants and University of Kentucky

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

2 Sequences, Continuity, and Limits

2 Sequences, Continuity, and Limits 2 Sequences, Continuity, and Limits In this chapter, we introduce the fundamental notions of continuity and limit of a real-valued function of two variables. As in ACICARA, the definitions as well as proofs

More information

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions

Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error Distributions Journal of Modern Applied Statistical Methods Volume 8 Issue 1 Article 13 5-1-2009 Least Absolute Value vs. Least Squares Estimation and Inference Procedures in Regression Models with Asymmetric Error

More information

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form. Stat 8112 Lecture Notes Asymptotics of Exponential Families Charles J. Geyer January 23, 2013 1 Exponential Families An exponential family of distributions is a parametric statistical model having densities

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

Divergence Based priors for the problem of hypothesis testing

Divergence Based priors for the problem of hypothesis testing Divergence Based priors for the problem of hypothesis testing gonzalo garcía-donato and susie Bayarri May 22, 2009 gonzalo garcía-donato and susie Bayarri () DB priors May 22, 2009 1 / 46 Jeffreys and

More information

Teaching Bayesian Model Comparison With the Three-sided Coin

Teaching Bayesian Model Comparison With the Three-sided Coin Teaching Bayesian Model Comparison With the Three-sided Coin Scott R. Kuindersma Brian S. Blais, Department of Science and Technology, Bryant University, Smithfield RI January 4, 2007 Abstract In the present

More information

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N Problem 1. Let f : A R R have the property that for every x A, there exists ɛ > 0 such that f(t) > ɛ if t (x ɛ, x + ɛ) A. If the set A is compact, prove there exists c > 0 such that f(x) > c for all x

More information

Lebesgue measure and integration

Lebesgue measure and integration Chapter 4 Lebesgue measure and integration If you look back at what you have learned in your earlier mathematics courses, you will definitely recall a lot about area and volume from the simple formulas

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 2: Multivariate distributions and inference Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2016/2017 Master in Mathematical

More information

Primer on statistics:

Primer on statistics: Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing ryan.reece@gmail.com http://rreece.github.io/ Insight Data Science - AI Fellows Workshop Feb 16, 018 Outline 1. Maximum likelihood

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Bayesian vs frequentist techniques for the analysis of binary outcome data

Bayesian vs frequentist techniques for the analysis of binary outcome data 1 Bayesian vs frequentist techniques for the analysis of binary outcome data By M. Stapleton Abstract We compare Bayesian and frequentist techniques for analysing binary outcome data. Such data are commonly

More information

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm Chapter 13 Radon Measures Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm (13.1) f = sup x X f(x). We want to identify

More information

Understanding Regressions with Observations Collected at High Frequency over Long Span

Understanding Regressions with Observations Collected at High Frequency over Long Span Understanding Regressions with Observations Collected at High Frequency over Long Span Yoosoon Chang Department of Economics, Indiana University Joon Y. Park Department of Economics, Indiana University

More information

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Example: Some investors think that the performance of the stock market in January

More information

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u Interval estimation and hypothesis tests So far our focus has been on estimation of the parameter vector β in the linear model y i = β 1 x 1i + β 2 x 2i +... + β K x Ki + u i = x iβ + u i for i = 1, 2,...,

More information

Statistical Inference

Statistical Inference Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Week 12. Testing and Kullback-Leibler Divergence 1. Likelihood Ratios Let 1, 2, 2,...

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data. Example: Some investors think that the performance of the stock market in January

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Lebesgue Integration on R n

Lebesgue Integration on R n Lebesgue Integration on R n The treatment here is based loosely on that of Jones, Lebesgue Integration on Euclidean Space We give an overview from the perspective of a user of the theory Riemann integration

More information

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box 90251 Durham, NC 27708, USA Summary: Pre-experimental Frequentist error probabilities do not summarize

More information

Objective Prior for the Number of Degrees of Freedom of a t Distribution

Objective Prior for the Number of Degrees of Freedom of a t Distribution Bayesian Analysis (4) 9, Number, pp. 97 Objective Prior for the Number of Degrees of Freedom of a t Distribution Cristiano Villa and Stephen G. Walker Abstract. In this paper, we construct an objective

More information

Inference in Hybrid Bayesian Networks with Nonlinear Deterministic Conditionals

Inference in Hybrid Bayesian Networks with Nonlinear Deterministic Conditionals KU SCHOOL OF BUSINESS WORKING PAPER NO. 328 Inference in Hybrid Bayesian Networks with Nonlinear Deterministic Conditionals Barry R. Cobb 1 cobbbr@vmi.edu Prakash P. Shenoy 2 pshenoy@ku.edu 1 Department

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression Robust Statistics robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn Parameter estimation and forecasting Cristiano Porciani AIfA, Uni-Bonn Questions? C. Porciani Estimation & forecasting 2 Temperature fluctuations Variance at multipole l (angle ~180o/l) C. Porciani Estimation

More information

Integral Jensen inequality

Integral Jensen inequality Integral Jensen inequality Let us consider a convex set R d, and a convex function f : (, + ]. For any x,..., x n and λ,..., λ n with n λ i =, we have () f( n λ ix i ) n λ if(x i ). For a R d, let δ a

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation Patrick Breheny February 8 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/27 Introduction Basic idea Standardization Large-scale testing is, of course, a big area and we could keep talking

More information

Lectures on Simple Linear Regression Stat 431, Summer 2012

Lectures on Simple Linear Regression Stat 431, Summer 2012 Lectures on Simple Linear Regression Stat 43, Summer 0 Hyunseung Kang July 6-8, 0 Last Updated: July 8, 0 :59PM Introduction Previously, we have been investigating various properties of the population

More information

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y Predictor or Independent variable x Model with error: for i = 1,..., n, y i = α + βx i + ε i ε i : independent errors (sampling, measurement,

More information

Do Shareholders Vote Strategically? Voting Behavior, Proposal Screening, and Majority Rules. Supplement

Do Shareholders Vote Strategically? Voting Behavior, Proposal Screening, and Majority Rules. Supplement Do Shareholders Vote Strategically? Voting Behavior, Proposal Screening, and Majority Rules Supplement Ernst Maug Kristian Rydqvist September 2008 1 Additional Results on the Theory of Strategic Voting

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

Probabilities & Statistics Revision

Probabilities & Statistics Revision Probabilities & Statistics Revision Christopher Ting Christopher Ting http://www.mysmu.edu/faculty/christophert/ : christopherting@smu.edu.sg : 6828 0364 : LKCSB 5036 January 6, 2017 Christopher Ting QF

More information

Independent Component (IC) Models: New Extensions of the Multinormal Model

Independent Component (IC) Models: New Extensions of the Multinormal Model Independent Component (IC) Models: New Extensions of the Multinormal Model Davy Paindaveine (joint with Klaus Nordhausen, Hannu Oja, and Sara Taskinen) School of Public Health, ULB, April 2008 My research

More information

Measure Theory and Lebesgue Integration. Joshua H. Lifton

Measure Theory and Lebesgue Integration. Joshua H. Lifton Measure Theory and Lebesgue Integration Joshua H. Lifton Originally published 31 March 1999 Revised 5 September 2004 bstract This paper originally came out of my 1999 Swarthmore College Mathematics Senior

More information

1 Introduction. 2 Measure theoretic definitions

1 Introduction. 2 Measure theoretic definitions 1 Introduction These notes aim to recall some basic definitions needed for dealing with random variables. Sections to 5 follow mostly the presentation given in chapter two of [1]. Measure theoretic definitions

More information

Asymptotics for posterior hazards

Asymptotics for posterior hazards Asymptotics for posterior hazards Pierpaolo De Blasi University of Turin 10th August 2007, BNR Workshop, Isaac Newton Intitute, Cambridge, UK Joint work with Giovanni Peccati (Université Paris VI) and

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression Reading: Hoff Chapter 9 November 4, 2009 Problem Data: Observe pairs (Y i,x i ),i = 1,... n Response or dependent variable Y Predictor or independent variable X GOALS: Exploring

More information

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ). Connectedness 1 Motivation Connectedness is the sort of topological property that students love. Its definition is intuitive and easy to understand, and it is a powerful tool in proofs of well-known results.

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Does k-th Moment Exist?

Does k-th Moment Exist? Does k-th Moment Exist? Hitomi, K. 1 and Y. Nishiyama 2 1 Kyoto Institute of Technology, Japan 2 Institute of Economic Research, Kyoto University, Japan Email: hitomi@kit.ac.jp Keywords: Existence of moments,

More information

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples Bayesian inference for sample surveys Roderick Little Module : Bayesian models for simple random samples Superpopulation Modeling: Estimating parameters Various principles: least squares, method of moments,

More information

Gibbs Sampling in Endogenous Variables Models

Gibbs Sampling in Endogenous Variables Models Gibbs Sampling in Endogenous Variables Models Econ 690 Purdue University Outline 1 Motivation 2 Identification Issues 3 Posterior Simulation #1 4 Posterior Simulation #2 Motivation In this lecture we take

More information

A NOTE ON SECOND ORDER CONDITIONS IN EXTREME VALUE THEORY: LINKING GENERAL AND HEAVY TAIL CONDITIONS

A NOTE ON SECOND ORDER CONDITIONS IN EXTREME VALUE THEORY: LINKING GENERAL AND HEAVY TAIL CONDITIONS REVSTAT Statistical Journal Volume 5, Number 3, November 2007, 285 304 A NOTE ON SECOND ORDER CONDITIONS IN EXTREME VALUE THEORY: LINKING GENERAL AND HEAVY TAIL CONDITIONS Authors: M. Isabel Fraga Alves

More information

Solutions Final Exam May. 14, 2014

Solutions Final Exam May. 14, 2014 Solutions Final Exam May. 14, 2014 1. Determine whether the following statements are true or false. Justify your answer (i.e., prove the claim, derive a contradiction or give a counter-example). (a) (10

More information

Discussion of Hypothesis testing by convex optimization

Discussion of Hypothesis testing by convex optimization Electronic Journal of Statistics Vol. 9 (2015) 1 6 ISSN: 1935-7524 DOI: 10.1214/15-EJS990 Discussion of Hypothesis testing by convex optimization Fabienne Comte, Céline Duval and Valentine Genon-Catalot

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible

More information

Isomorphisms between pattern classes

Isomorphisms between pattern classes Journal of Combinatorics olume 0, Number 0, 1 8, 0000 Isomorphisms between pattern classes M. H. Albert, M. D. Atkinson and Anders Claesson Isomorphisms φ : A B between pattern classes are considered.

More information

P-adic Functions - Part 1

P-adic Functions - Part 1 P-adic Functions - Part 1 Nicolae Ciocan 22.11.2011 1 Locally constant functions Motivation: Another big difference between p-adic analysis and real analysis is the existence of nontrivial locally constant

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Minimum Message Length Analysis of the Behrens Fisher Problem

Minimum Message Length Analysis of the Behrens Fisher Problem Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction

More information

General Bayesian Inference I

General Bayesian Inference I General Bayesian Inference I Outline: Basic concepts, One-parameter models, Noninformative priors. Reading: Chapters 10 and 11 in Kay-I. (Occasional) Simplified Notation. When there is no potential for

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Probability and statistics; Rehearsal for pattern recognition

Probability and statistics; Rehearsal for pattern recognition Probability and statistics; Rehearsal for pattern recognition Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských

More information

Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments 1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu

More information

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33 Hypothesis Testing Econ 690 Purdue University Justin L. Tobias (Purdue) Testing 1 / 33 Outline 1 Basic Testing Framework 2 Testing with HPD intervals 3 Example 4 Savage Dickey Density Ratio 5 Bartlett

More information

A Very Brief Summary of Bayesian Inference, and Examples

A Very Brief Summary of Bayesian Inference, and Examples A Very Brief Summary of Bayesian Inference, and Examples Trinity Term 009 Prof Gesine Reinert Our starting point are data x = x 1, x,, x n, which we view as realisations of random variables X 1, X,, X

More information

MATHS 730 FC Lecture Notes March 5, Introduction

MATHS 730 FC Lecture Notes March 5, Introduction 1 INTRODUCTION MATHS 730 FC Lecture Notes March 5, 2014 1 Introduction Definition. If A, B are sets and there exists a bijection A B, they have the same cardinality, which we write as A, #A. If there exists

More information

Convex Optimization Notes

Convex Optimization Notes Convex Optimization Notes Jonathan Siegel January 2017 1 Convex Analysis This section is devoted to the study of convex functions f : B R {+ } and convex sets U B, for B a Banach space. The case of B =

More information

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses Space Telescope Science Institute statistics mini-course October 2011 Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses James L Rosenberger Acknowledgements: Donald Richards, William

More information

A misère-play -operator

A misère-play -operator A misère-play -operator Matthieu Dufour Silvia Heubach Urban Larsson arxiv:1608.06996v1 [math.co] 25 Aug 2016 July 31, 2018 Abstract We study the -operator (Larsson et al, 2011) of impartial vector subtraction

More information

Bayesian Point Process Modeling for Extreme Value Analysis, with an Application to Systemic Risk Assessment in Correlated Financial Markets

Bayesian Point Process Modeling for Extreme Value Analysis, with an Application to Systemic Risk Assessment in Correlated Financial Markets Bayesian Point Process Modeling for Extreme Value Analysis, with an Application to Systemic Risk Assessment in Correlated Financial Markets Athanasios Kottas Department of Applied Mathematics and Statistics,

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information