Robust estimators for additive models using backfitting

Similar documents
Additive Isotonic Regression

S estimators for functional principal component analysis

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

Robust Variable Selection Through MAVE

Some Theories about Backfitting Algorithm for Varying Coefficient Partially Linear Model

DEPARTMENT MATHEMATIK ARBEITSBEREICH MATHEMATISCHE STATISTIK UND STOCHASTISCHE PROZESSE

7 Semiparametric Estimation of Additive Models

By 3DYHOýtåHN, Wolfgang Härdle

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Single Index Quantile Regression for Heteroscedastic Data

Nonparametric Methods

Titolo Smooth Backfitting with R

Local Polynomial Regression

Remedial Measures for Multiple Linear Regression Models

Single Index Quantile Regression for Heteroscedastic Data

Frontier estimation based on extreme risk measures

STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song

A comparison of different nonparametric methods for inference on additive models

Sociology 740 John Fox. Lecture Notes. 1. Introduction. Copyright 2014 by John Fox. Introduction 1

Robust variable selection through MAVE

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Additive Models: Extensions and Related Models.

S estimators for functional principal component analysis

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Transformation and Smoothing in Sample Survey Data

Function of Longitudinal Data

Local Polynomial Wavelet Regression with Missing at Random

41903: Introduction to Nonparametrics

Nonparametric Principal Components Regression

Robust estimation of scale and covariance with P n and its application to precision matrix estimation

Introduction to Regression

Estimating a frontier function using a high-order moments method

Measuring robustness

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Robust forecasting of non-stationary time series DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI)

Inference based on robust estimators Part 2

ECO Class 6 Nonparametric Econometrics

Lecture 3: Statistical Decision Theory (Part II)

Stat 5101 Lecture Notes

Regularization Methods for Additive Models

Generated Covariates in Nonparametric Estimation: A Short Review.

On the Robust Modal Local Polynomial Regression

Efficient and Robust Scale Estimation

ASYMPTOTICS FOR PENALIZED SPLINES IN ADDITIVE MODELS

Regression Analysis for Data Containing Outliers and High Leverage Points

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Introduction to Regression

Nonparametric Regression Härdle, Müller, Sperlich, Werwarz, 1995, Nonparametric and Semiparametric Models, An Introduction

Web-based Supplementary Materials for A Robust Method for Estimating. Optimal Treatment Regimes

Outlier robust small area estimation

Robust control charts for time series data

Nonparametric Identi cation and Estimation of Truncated Regression Models with Heteroskedasticity

The deterministic Lasso

Statistical Properties of Numerical Derivatives

Leverage effects on Robust Regression Estimators

ON CONCURVITY IN NONLINEAR AND NONPARAMETRIC REGRESSION MODELS

Lecture 14: Variable Selection - Beyond LASSO

COMPARISON OF THE ESTIMATORS OF THE LOCATION AND SCALE PARAMETERS UNDER THE MIXTURE AND OUTLIER MODELS VIA SIMULATION

2. (Today) Kernel methods for regression and classification ( 6.1, 6.2, 6.6)

Supporting Information for Estimating restricted mean. treatment effects with stacked survival models

Modelling Survival Data using Generalized Additive Models with Flexible Link

mboost - Componentwise Boosting for Generalised Regression Models

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Independent Component (IC) Models: New Extensions of the Multinormal Model

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

UNIVERSITÄT POTSDAM Institut für Mathematik

PREWHITENING-BASED ESTIMATION IN PARTIAL LINEAR REGRESSION MODELS: A COMPARATIVE STUDY

Stability and the elastic net

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

arxiv: v2 [stat.me] 4 Jun 2016

Testing against a linear regression model using ideas from shape-restricted estimation

Estimation of the Conditional Variance in Paired Experiments

ROBUST ESTIMATION OF A CORRELATION COEFFICIENT: AN ATTEMPT OF SURVEY

Robustness for dummies

A NOTE ON ROBUST ESTIMATION IN LOGISTIC REGRESSION MODEL

Lecture 14 October 13

A characterization of elliptical distributions and some optimality properties of principal components for functional data

Independent Factor Discriminant Analysis

Can we do statistical inference in a non-asymptotic way? 1

The properties of L p -GMM estimators

New Local Estimation Procedure for Nonparametric Regression Function of Longitudinal Data

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Jyh-Jen Horng Shiau 1 and Lin-An Chen 1

Local Modal Regression

Semiparametric Estimation of Partially Linear Dynamic Panel Data Models with Fixed Effects

Sparse Additive Functional and kernel CCA

TESTING SERIAL CORRELATION IN SEMIPARAMETRIC VARYING COEFFICIENT PARTIALLY LINEAR ERRORS-IN-VARIABLES MODEL

Chapter 9. Non-Parametric Density Function Estimation

Introduction to Regression

Nonparametric Regression

Chapter 9. Non-Parametric Density Function Estimation

A tailor made nonparametric density estimate

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

arxiv: v5 [stat.me] 12 Jul 2016

Nonparametric Regression. Badr Missaoui

Simulating Uniform- and Triangular- Based Double Power Method Distributions

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Generalized Additive Models

Definitions of ψ-functions Available in Robustbase

Transcription:

Robust estimators for additive models using backfitting Graciela Boente Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires and CONICET, Argentina Alejandra Martínez Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires and CONICET, Argentina and Matias Salibián Barrera University of British Columbia, Canada Abstract Additive models provide an attractive setup to estimate regression functions in a nonparametric context. They provide a flexible and interpretable model, where each regression function depends only on a single explanatory variable and can be estimated at an optimal univariate rate. It is easy to see that most estimation procedures for these models are highly sensitive to the presence of even a small proportion of outliers in the data. In this paper, we show that a relatively simple robust version of the backfitting algorithm (consisting of using robust local polynomial smoothers) corresponds to the solution of a well-defined optimization problem. This formulation allows us to find mild conditions to show that these estimators are Fisher consistent and to study the convergence of the algorithm. Our numerical experiments show that the resulting estimators have good robustness and efficiency properties. We illustrate the use of these estimators on a real data set where the robust fit reveals the presence of influential outliers. Keywords: Fisher consistency; Kernel weights; Robust estimation; Smoothing 1

1 Introduction Consider a general regression model, where a response variable Y R is related to a vector X = (X 1,..., X d ) t R d of explanatory variables through the following non-parametric regression model: Y = g (X) + σ ε. (1) The error ε is assumed to be independent from X and centered at zero, while σ is the error scale parameter. When ε has a finite first moment, we have the usual regression representation E(Y X ) = g (X). Standard estimators for g can thus be derived relying on local estimates of the conditional mean, such as kernel polynomial regression estimators. It is easy to see that such procedures can be seriously affected either by a small proportion of outliers in the response variable, or when the distribution of Y X has heavy tails. Note, however, that even when ε does not have a finite first moment, the function g (X) can still be interpreted as a location parameter for the distribution of Y X. In this case, local robust estimators can be used to estimate the regression function as, for example, the local M estimators proposed in Boente and Fraiman (1989) and the local medians studied in Welsh (1996). Unfortunately both robust and non-robust non-parametric regression estimators are affected by the curse of dimensionality, which is caused by the fact that the expected number of observations in local neighbourhoods decreases exponentially as a function of d, the number of covariates. This results in regression estimators with a very slow convergence rate. Stone (198) showed that additive models can avoid these problems and produce non-parametric multiple regression estimators with a univariate rate of convergence. In an additive model, the regression function is assumed to satisfy g (X) = µ + d g,j (X j ), (2) where µ R, g,j : R R, 1 j d, are unknown smooth functions with E(g,j (X j )) =. Such a model retains the ease of interpretation of linear regression models, where each component g,j can be thought as the effect of the j-th covariate on the centre of the conditional distribution of Y. Moreover, Linton (1997), Fan et al. (1998) and Mammen et al. (1999) obtained different oracle properties showing that each additive component can be estimated as well as when all the other ones are known. Several algorithms to fit additive models have been proposed in the literature. In this paper, we focus on the backfitting algorithm as introduced in Friedman and Stuetzle (1981) and discussed further in Buja et al. (1989). The backfitting algorithm can be intuitively motivated by observing that, if (2) holds, then ( g,j (x) = E Y α ) g,l (X l ) X j = x. (3) l j j=1 2

Hence, given a sample, the backfitting algorithm iteratively estimates the components g,j, 1 j d, using a univariate smoother of the partial residuals in (3) as functions of the j-th covariate. This algorithm is widely used due to its flexiblity (different univariate smoothers can be used), ease of implementation and intuitive motivation. Furthermore, it has been shown to work very well in simulation studies (Sperlich et al. 1999) and applications, although its statistical properties are difficult to study due to its iterative nature. Some results regarding its bias and conditional variance can be found in Opsomer and Ruppert (1997), Wand (1999) and Opsomer (2). When second moments exist, Breiman and Friedman (198) showed that, under certain regularity conditions, the backfitting procedure finds functions m 1 (X 1 ),..., m d (X d ) minimizing E(Y µ d i=1 m j(x j )) 2 over the space of functions with E[m j (X j )] = and finite second moments. In other words, even if the regression function g in (1) does not satisfy the additive model (2), the backfitting algorithm finds the orthogonal projection of the regression function onto the linear space of additive functions in L 2. Equivalently, backfitting finds the closest additive approximation (in the L 2 sense) to E(Y X 1,..., X d ). Furthemore, the backfitting algorithm is a coordinate-wise descent algorithm minimizing the squared loss functional above. The sample version of the algorithm solves a system of nd nd normal equations and corresponds to the Gauss Seidel algorithm for linear systems of equations. If the smoother chosen to estimate (3) is not resistant to outliers then the estimated additive components can be seriously affected by a relatively small proportion of atypical observations. Given the local nature of non parametric regression estimators, we will be concerned with the case where outliers are present in the response variable. Bianco and Boente (1998) considered robust estimators for additive models using kernel regression, which are a robust version of those defined in Baek and Wehrly (1993). The main drawback of this approach is that it assumes that Y g,j (X j ) is independent from X j, which is difficult to justify or verify in practice. Outlier resistant fits for generalized additive models have been considered recently in the literature. When the variance is a known function of the mean and the dispersion parameter is known, we refer to Alimadad and Salibián-Barrera (212) and Wong et al. (214), who consider robust fits based on backfitting and penalized splines M estimators, respectively. In the case of model (1), the approach of Wong et al. (214) reduces to that of Oh et al. (27) which is an alternative based on penalized splines. On the other hand, Croux et al. (211) provides a robust fit for generalized additive models with nuisance parameters using penalized splines, but no theoretical support is provided for their method. In this paper, we consider an intuitively appealing way to obtain robust estimators for model (1) which combines the backfitting algorithm with robust univariate smoothers. For example, one can consider those proposed in Boente and Fraiman (1989), Härdle and Tsybakov (1988), Härdle (199) and Oh et al. (27). One of the main contributions of this paper is to show that this intuitive approach to obtain a robust backfitting algorithm is well justified. Specifically, we show that applying the backfitting algorithm using the robust nonparametric regression estimators of Boente and Fraiman (1989) corresponds to minimizing 3

E[ρ((Y µ d i=1 m j(x j ))/σ )] over functions m 1 (X 1 ),..., m d (X d ) with E[m j (X j )] =, where ρ is a loss function. Furthermore, this robust backfitting corresponds to a coordinatewise descent algorithm and can be shown to converge. We also establish sufficient conditions for these robust backfitting estimators to be Fisher consistent for the true additive components when (2) holds. Our numerical experiments confirm that these estimators have very good finite-sample properties, both in terms of robustness, and efficiency with respect to the classical approach when the data do not contain outliers. These robust estimators cannot be interpreted as orthogonal projections of the regression function onto the space of additive functions of the predictors. However, the first-order conditions for the minimum of this optimization problem are closely related to the robust conditional location functional defined in Boente and Fraiman (1989). The rest of the paper is organized as follows. In Section 2, we show that the robust backfitting algorithm mentioned above corresponds to a coordinate-descent algorithm to minimize a robust functional using a convex loss function. We also prove that the resulting estimator is Fisher consistent, which means that the solution to the population version of the problem is the object of interest (in our case, the true regression function). The convergence of this algorithm is studied in Section 2.1, while its finite-sample version using local M regression smoothers is presented in Section 3. The results of our numerical experiments conducted to evaluate the performance of the proposed procedure are reported in Section 4. To study the sensitivity of the robust backfitting with respect to single outliers, Section provides a numerical study of the empirical influence function. Finally, in Section 6 we illustrate the advantage of using robust backfitting on a real data set. All proofs are relegated to the Appendix. 2 The robust backfitting functional In this section, we introduce a population-level version of the robust backfitting algorithm. By showing that the robust backfitting corresponds to a coordinate-descent algorithm to minimize a robust functional, we are able to find sufficient conditions for the robust backfitting to be Fisher-consistent. In what follows, we will assume that (X t, Y ) t is a random vector satisfying the additive model (2), where Y R and X = (X 1,..., X d ) t, that is, Y = µ + d g,j (X j ) + σ ε. (4) j=1 As it is customary, to ensure identifiability of the components of the model, we will further assume that Eg,j (X j ) =, 1 j d. When second moments exist, it is easy to see that 4

the backfitting estimators solve the following minimization problem ( min E Y ν (ν,m) R H ad d 2 m j (X j )), () { where H ad = m(x) = } d j=1 m j(x j ), m j H j H, H = {r(x) : E(r(X)) =, E(r 2 (X)) < } and H j is the Hilbert space of measurable functions m j of X j, with zero mean and finite second moment, i.e., Em j (X j ) = and Em 2 j(x j ) <. The solution to () is characterized by its residual Y µ g(x) being orthogonal to H ad. Since this space is spanned by H l, 1 l d, the solution of () satisfies E( Y µ d j=1 g j(x j ) ) = and E( Y µ d j=1 g j(x j ) X l ) =, for 1 l d, from where it follows that µ = E(Y ) and g l (X l ) = E(Y µ j l g j(x j ) X l ), 1 l d. Given a sample, the backfitting algorithm iterates the above system of equations replacing the conditional expectations with non-parametric regression estimators (e.g. local polynomial smoothers). To reduce the effect of outliers on the regression estimates, we replace the square loss function in () by a function with bounded derivative such as the Huber or Tukey s loss functions. For these losses, ρ c (u) = c 2 ρ 1 (u/c), where c > is a tuning constant to achieve a given efficiency. The Huber type loss corresponds to ρ 1 = ρ h while the Tukey s loss to ρ t, where ρ h (u) = u 2 /2 if u 1 and ρ h (u) = u 1/2 otherwise, while ρ t (u) = min (3u 2 3u 4 + u 6, 1). Another possible choices are ρ 1 (u) = 1 + u 2 1 which is a smooth approximation of the Huber function and ρ 1 (u) = u arctan(u). ln(1 + u 2 ) which has derivative ρ 1(u) = arctan(u). The bounded derivative of the loss function controls the effect of outlying values in the response variable (sometimes called vertical outliers in the literature). Formally, our objective function is given by ( Y ν d j=1 Υ(ν, m) = E ρ m ) j(x j ), (6) where ρ : R [, ) is even, ν R and the functions m j H j, 1 j d. Let P be a distribution in R d+1 and let (X t, Y ) t P. Define the functional (µ(p ), g(p )) as the solution of the following optimization problem: where g(p )(X) = d j=1 g j(p )(X j ) H ad. j=1 σ (µ(p ), g(p )) = argmin (ν,m) R H ad Υ(ν, m), (7) To prove that the functional in (7) is Fisher-consistent and to derive first-order conditions for the point where it attains its minimum value, we will need the following assumptions: E1 The random variable ε has a density function f (t) that is even, non-increasing in t, and strictly decreasing for t in a neighbourhood of.

R1 The function ρ : R [, ) is continuous, non-decreasing, ρ() =, and ρ(u) = ρ( u). Moreover, if u < v with ρ(v) < sup t ρ(t) then ρ(u) < ρ(v). A1 Given functions m j H j, if P( d j=1 m j(x j ) = ) = 1 then, for all 1 j d, we have P(m j (X j ) = ) = 1 Remark 2.1. Assumption E1 is a standard condition needed to ensure Fisher-consistency of an M location functional (see, e.g. Maronna et al., 26). Assumption R1 is satisfied by the so-called family of rho functions in Maronna et al. (26), which include many commonly used robust loss functions, such as those mentioned above. Since the loss function ρ can be chosen by the user, this assumption is not restrictive. Finally, assumption A1 allows us to write the functional g(p ) in (7) uniquely as g(p ) = d j=1 g j(p ). Assumption A1 appears to be the most restrictive and deserves some discussion. It is closely related to the identifiability of the additive model (4) and holds if the explanatory variables are independent from each other. Indeed, let us denote (x, X α ) the vector with the α th coordinate equal to x and the other ones equal to X j, j α and by m(x) = d j=1 m j(x j ), for m j H j. For any fixed 1 α d, the condition P(m(X) = ) = 1 implies that for almost every x α, P(m(x α, X α ) = X α = x α ) = 1. Using that the components of X are independent, we obtain that P(m(x α, X α ) = ) = 1 which implies that m(xα, u α )df Xα (u) =. Note that since Em j (X j ) = for all j, m(x α, u α )df Xα (u) = m α (x α ) + j α m j(u j )df Xα (u) = m α (x α ). Hence, m α (x α ) =, for almost every x α as desired. However, if the components of X are not independent, then P(m(x α, X α ) = X α = x α ) = 1 does not imply m(x α, u α )df Xα (u) =. This has already been observed by Hastie and Tibshirani (199, page 17). The fact that H ad is closed in H entails that under mild assumptions, the minimum of E(Y m(x)) 2 over H ad exists and is unique. However, the individual functions m j (x j ) may not be uniquely determined since the dependence among the covariates may lead to more than one representation for the same surface (see also Breiman and Friedman, 198). In fact, condition A1 is analogous to assumption.1 of Breiman and Friedman (198). It is also worth noticing that Stone (198) gives conditions to ensure that A1 holds. Indeed, Lemma 1 in Stone (198) implies Proposition 2.1 below which gives weak conditions for the unique representation and hence, as shown in Theorem 2.1 below, for the Fisher consistency of the functional g(p ). Its proof is omitted since it follows straightforwardly. Proposition 2.1. Assume that X has compact support S and that its density f X is bounded in S and such that inf x Sf f X (x) >. Let V j = m j (X j ) be random variables such that P( d j=1 V j = ) = 1 and E(V j ) =, then P(V j = ) = 1. The next Theorem establishes the Fisher consistency of the functional (µ(p ), g(p )). In other words, it shows that the solution to the optimization problem (7) are the target quantities to be estimated under model (4). 6

Theorem 2.1. Assume that the random vector (X t, Y ) t R d+1 satisfies (4) and let P stand for its distribution. a) If E1 and R1 hold, then Υ(ν, m) in (6) achieves its unique minimum over R H ad at (µ(p ), g(p )) = (µ(p ), d j=1 g j(p )) when µ(p ) = µ and P( d j=1 g j(p )(X j ) = d j=1 g,j(x j ) ) = 1. b) If in addition A1 holds, the unique minimum (µ(p ), g(p )) = (µ(p ), d j=1 g j(p )) satisfies µ(p ) = µ and P ( g j (P )(X j ) = g,j (X j ) ) = 1 for 1 j d. It is worth noticing that a minimizer (µ(p ), g(p )) of (7) always exists if ρ is a strictly convex function, even if E1 does not hold. If in addition A1 holds, the minimizer will have a unique representation. For ν R, x = (x 1,..., x d ) t R d and m = (m 1,..., m d ) t H 1 H 2 H d let Γ(ν, m, x) = (Γ (ν, m), Γ 1 (ν, m, x 1 ),..., Γ d (ν, m, x d )) t, where [ ( Y ν d j=1 Γ (ν, m) = E ψ m )] j(x j ) σ [ ( Y ν d j=1 Γ l (ν, m, x l ) = E ψ m ) ] j(x j ) X l = x l, 1 l d. (8) σ Our next theorem shows that it is possible to choose the solution g(p ) of (7) so that its additive components g j = g j (P ) satisfy first order conditions which are generalizations of those corresponding to the classical case where ρ(u) = u 2. Theorem 2.2. Let ρ be a differentiable function satisfying R1 and such that its derivative ρ = ψ is bounded and continuous. Let (X t, Y ) t P be a random vector such that (µ(p ), g(p )) is a minimizer of Υ(ν, m) over R H ad where µ(p ) R, g(p ) = d j=1 g j(p ) H ad, i.e., (µ(p ), g(p )) is the solution of (7). Then, (µ(p ), g(p )) satisfies the system of equations Γ(ν, m, x) = almost surely P X. Remark 2.2. a) It is also worth mentioning that if (X t, Y ) t satisfies (4) with the errors satisfying E1, then Γ(µ, g, x) =. Moreover, if the model is heteroscedastic Y = g (X) + σ (X) ε = µ + d g,j (X j ) + σ (X) ε, (9) where the errors ε are symmetrically distributed and the score function ψ is odd, then (µ, g ) j=1 7

satisfies [ ( Y µ d j=1 E ψ g )],j(x j ) σ (X) [ ( 1 Y E σ (X) ψ µ j l g ) ],j(x j ) g,l (X l ) X l σ (X) =, =, 1 l d, which provides a way to extend the robust backfitting algorithm to heteroscedastic models. b) Assume now that missing responses can arise in the sample, that is, we have a sample (X t i, Y i, δ i ) t, 1 i n, where δ i = 1 if Y i is observed and δ i = if Y i is missing, and (X t i, Y i ) t satisfy an additive heteroscedastic model (9). Let (X t, Y, δ) t be a random vector with the same distribution as (X t i, Y i, δ i ) t. Moreover, assume that responses may be missing at random (mar), i.e., P(δ = 1 (X, Y )) = P(δ = 1 X) = p(x). Define (µ(p ), g(p )) = argmin (ν,m) R H ad Υ δ (ν, m) where ( Y ν d j=1 Υ δ (ν, m) = E δρ m ) ( j(x j ) Y ν d j=1 = E p(x)ρ m ) j(x j ). σ (X) σ (X) Analogous arguments to those considered in the proof of Theorem 2.1, allow to show that, if E1 and R1 hold, Υ δ (ν, m) achieves its unique minimum at (ν, m) R H where ν = µ and P(m(X) = d j=1 g,j(x j )) = 1. Besides, if in addition A1 holds, the unique minimum satisfies that µ(p ) = µ and P(g j (P )(X j ) = g,j (X j )) = 1, that is, the functional is Fisher consistent. On the other hand, the proof of Theorem 2.2 can be also generalized to the case of an homocedastic additive model (4) with missing responses. Effectively, when inf x p(x) >, using the mar assumption, it is possible to show that there exists a unique measurable solution g l (x) of λ l,δ (x, a) = where { ( Y µ(p ) j l λ l,δ (x, a) = E p(x) ψ g ) j(p )(X j ) a X l = x}. σ More precisely, let Γ δ (ν, m, x) = (Γ,δ (ν, m), Γ 1,δ (ν, m, x 1 ),..., Γ d,δ (ν, m, x d )) t with m = (m 1,..., m d ) t and [ ( Y ν d j=1 Γ,δ (ν, m) = E p(x)ψ m )] j(x j ) [ Γ l,δ (ν, m, x l ) = E p(x) ψ σ ( Y µ(p ) j l m j(x j ) m l (X l ) σ ) X l = x l ] 1 l d. Similar arguments to those considered in the proof of Theorem 2.2, allow to show that if there exists a unique minimizer (µ(p ), g(p )) R H ad of Υ δ (ν, m), then (µ(p ), g(p )) is 8

a solution of Γ δ (ν, m, x) =. Note that instead of a simplified approach, a propensity score approach can also be considered taking δ/p(x) instead of δ. In this case, Υ δ (ν, m) = Υ(ν, m) defined in (6) and Γ l,δ = Γ l defined in (8). The propensity approach is useful when preliminary estimates of the missing probability are available, otherwise, the simplified approach is preferred. 2.1 The population version of the robust backfitting algorithm In this section, we derive an algorithm to solve (7) and study its convergence. For simplicity, we will assume that the vector (X t, Y ) t is completely observed and that it satisfies (4). By Theorem 2.2, the robust functional (µ(p ), g(p )) satisfies (8). To simplify the notation, in what follows we will put µ = µ(p ) and g j = g j (P ), 1 j d and m s=l a s will be understood as if m < l. The robust backfitting algorithm is given in Algorithm 1. Algorithm 1 Population version of the robust backfitting 1: Let l = and g () = (g () 1,..., g () d )t be an initial set of additive components, for example: g () = and µ an initial location parameter. 2: repeat 3: l l + 1 4: for j = 1 to d do : Let R (l) j = Y µ (l 1) j 1 s=1 g(l) s (X s ) d s=j+1 g(l 1) s (X s ) 6: Let g (l) j solve [ ( ) ] (l) R j g (l) j (X j ) E ψ X j = x = a.s. σ 7: end for 8: for j = 1 to d do 9: g (l) j = g (l) j 1: end for 11: Let µ (l) solve E[ g (l) j (X j )]. [ ( Y µ (l) )] d j=1 E ψ g(l) j (X j ) σ =. 12: until convergence Our next Theorem shows that each Step l of the algorithm above reduces the objective function Υ(µ (l), g (l) ). Theorem 2.3. Let ρ be a differentiable function satisfying R1 and such that its derivative ρ = ψ is a strictly increasing, bounded and continuous function with lim t + ψ(t) > and lim t ψ(t) <. Let ( µ (l), g (l)) = l 1 (µ(l), g (l) 1,..., g (l) ) l 1 be the sequence obtained 9 d

with Algorithm 1. Then, {Υ(µ (l), g (l) )} l 1 is a decreasing sequence, so that the algorithm converges. 3 The sample version of the robust backfitting algorithm In practice, given a random sample (X t i, Y i ) t 1 i n from the additive model (4) we apply Algorithm 1 replacing the unknown conditional expectations with univariate robust smoothers. Different smoothers can be considered, including splines, kernel weights or even nearest neighbours with kernel weights. In what follows we describe the algorithm for kernel polynomial M-estimators. Let K : R R be a kernel function and let K h (t) = (1/h)K(t/h). The estimators of the solutions of (8) using kernel M polynomial estimators of order q are given by the solution to the following system of equations: 1 n n K hj (X i,j x j )ψ i=1 1 n ( n Yi µ d j=1 ψ ĝj(x ) i,j ) σ ) i=1 ( Yi µ l j ĝl(x i,l ) q s= β s,jz i,j,s σ = Z i,j (x j ) =, 1 j d, where Z i,j (x j ) = (Z i,j,, Z i,j,1,..., Z i,j,q ) t with Z i,j,s = (X i,j x j ) s, s d. Then, we have ĝ j (x j ) = β,j, 1 j d. The corresponding algorithm is described in detail in Algorithm 2. The same procedure can be applied to the complete sample when responses are missing. Remark 3.1. A possible choice of the preliminary scale estimator σ is obtained by calculating the mad of the residuals obtained with a simple and robust nonparametric } regression estimator, as local medians. In that case we have σ = mad 1 i n {Y i Ŷi, where Ŷ i = median 1 j n {Y j : X j,k X i,k h k, 1 k d}. The bandwidths h k are preliminary values to be selected, or alternatively they can be chosen as the distance between X i,k and its r-th nearest neighbour among {X j,k } j i. 4 Numerical studies This section contains the results of a simulation study designed to compare the behaviour of our proposed estimator with that of the classical one. We generated N = samples of size n = for models with d = 2 and d = 4 components. We considered samples without outliers and also samples contaminated in different ways. We also included in our 1

Algorithm 2 The sample version of the robust backfitting 1: Let l = and ĝ () = (ĝ () 1,..., ĝ () d )t be an initial set of additive components, for example: ĝ () =, and let σ be a robust residual scale estimator. Moreover, let µ () an initial location estimator such as an M location estimator of the responses. 2: repeat 3: l l + 1 4: for j = 1 to d do : for i = 1 to n do 6: Let x j = X i,j 7: for i = 1 to n do 8: Let Z i,j (x j ) = (1, (X i,j x j ), (X i,j x j ) 2,..., (X i,j x j ) q ) t and µ (l) j 1 (X i,s ) d s=1 g(l) s s=j+1 ĝ(l 1) s (X i,s ). 9: end for 1: Let β j (x j ) = ( β j (x j ), β 1j (x j ),..., β qj (x j )) t be the solution to 1 n n K h (X i,j x j ) ψ i=1 ( R(l) i,j β j (x j ) t Z i,j (x j ) σ ) R (l) i,j Z i,j (x j ) =. = Y i 11: Let g (l) j (x j ) = β j (x j ). 12: end for 13: end for 14: for j = 1 to d do 1: ĝ (l) j = g (l) j 16: end for 17: Let µ (l) solve n 18: until convergence i=1 g(l) j 1 n (X i,j )/n. n ψ i=1 ( Yi µ (l) d j=1 ĝ(l) σ ) j (X i,j ) =. experiment cases where the response variable may be missing, as described in Remark 2.2. All computations were carried out using an R implementation of our algorithm, publicly available on line at http://www.stat.ubc.ca/~matias/soft.html. To generate missing responses, we first generated observations (X t i, Y i ) t satisfying the additive model Y = g (X) + u = µ + d j=1 g,j(x j ) + u, where u = σ ε. Then, we generate {δ i } n i=1 independent Bernoulli random variables such that P (δ i = 1 Y i, X i ) = P (δ i = 1 X i ) = p (X i ) where we used a different function p for each value of d. When d = 2 we used p(x) = p 2 (x) =.4+.(cos(x 1 +.2)) 2 which yields around 31.% of missing responses. For d = 4, we set p(x) = p 4 (x) =.4 +.(cos(x 1 x 3 +.2)) 2, which produces approximately 33% of missing Y i s. In addition, we also consider the case where all responses are observed, 11

i.e., p(x) 1. We compared the following estimators: The classical backfitting estimator adapted to missing responses, denoted ĝ bc. A robust backfitting estimator ĝ br,h using Huber s loss function with tuning constant c = 1.34. This loss function is such that ρ c(u) = ψ c (u) = min (c, max( c, u)). A robust backfitting estimator ĝ br,t using Tukey s bisquare loss function with tuning constant c = 4.68. This loss function satisfies ρ c(u) = ψ c (u) = u (1 (u/c) 2 ) 2 I [ c,c] (u). The univariate smoothers were computed using the Epanechnikov kernel K(u) =.7 (1 u 2 )I [ 1,1] (u). We used both local constants and local linear polynomials which correspond to q = and q = 1 in Algorithm 2 denoted in all Tables and Figures with a subscript of and 1, respectively. The performance of each estimator ĝ j of g,j, 1 j d, was measured through the following approximated integrated squared error (ise): ise(ĝ j ) = 1 n i=1 δ i n (g,j (X ij ) ĝ j (X ij )) 2 δ i. i=1 where X ij is the jth component of X i and δ i = if the i-th response was missing and δ i = 1 otherwise. An approximation of the mean integrated squared error (mise) was obtained by averaging the ise above over all replications. A similar measure was used to compare the estimators of the regression function g = µ + d j=1 g,j. 4.1 Monte Carlo study with d = 2 additive components In this case, the covariates were generated from a uniform distribution on the unit square, X i = (X i,1, X i,2 ) t U([, 1] 2 ), the error scale was σ =. and the overall location µ =. The additive components were chosen to be g,1 (x 1 ) = 24 (x 1.) 2 2, g,2 (x 2 ) = 2π sin(πx 2 ) 4. (1) Optimal bandwidths with respect to the mise can be computed assuming that the other components in the model are known (see, for instance, Härdle et al., 24). Since the explanatory variables are uniformly distributed, the optimal bandwidths for the local constant (q = ) and local linear (q = 1) estimators are the same and very close to our choice of h = (.1,.1). For the errors, we considered the following settings: C : u i N(, σ 2 ). 12

C 1 : u i (1.1) N(, σ 2 ) +.1 N(1,.1). C 2 : u i N(1,.1) for all i s such that X i D.3, where D η = [.2,.2 + η] 2. C 3 : u i N(1,.1) for all i s such that X i D.9, where D η is as above. C 4 : u i (1.3) N(, σ 2 ) +.3 N(1,.1) for all i s such that X i D.3. Case C corresponds to samples without outliers and they will illustrate the loss of efficiency incurred by using a robust estimator when it may not be needed. The contamination setting C 1 corresponds to a gross-error model where all observations have the same chance of being contaminated. On the other hand, case C 2 is highly pathological in the sense that all observations with covariates in the square [.2,.] [.2,.] are severely affected, while C 3 is similar but in the region [.2,.29] [.2,.29]. The difference between C 2 and C 3 is that the bandwidths we used (h =.1) are wider than the contaminated region in C 3. Finally, case C 4 is a gross-error model with a higher probability of observing an outlier, but these are restricted to the square [.2, ] [.2,.]. Figure 1 illustrate these contamination scenarios on one randomly generated sample. The panels correspond to settings C 2, C 3 and C 4, with solid triangles indicating contaminated cases...2.4.6.8 1...2.4.6.8 1...2.4.6.8 1...2.4.6.8 1...2.4.6.8 1...2.4.6.8 1. (a) C 2 (b) C 3 (c) C 4 Figure 1: Scatter plots of covariates (x 1, x 2 ) t with solid triangles indicating observations with contaminated response variables, for contamination settings C 2, C 3 and C 4. The square regions indicate the sets D η for each scenario. Tables 1 to 3 report the obtained values of the mise when estimating the regression function g = µ +g,1 +g,2 and each additive component g,1 and g,2, respectively. Estimates obtained using local constant (q = ) and local linear smoothers (q = 1) are indicated with a subscript and 1, respectively. To complement the reported results on the mise, given in Tables 1 to 3, we report in Tables 4 to 6 the ratio between the mean integrated squared error (mise) obtained for each considered contamination and for the clean data. To identify the ratio, it is denoted as mise(c j )/mise(c ). Consider first the situation where there are no missing responses. As expected, when the data do not contain outliers the robust estimators are slightly less efficient than the 13

p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 C.74.73.718.77.79.79.741.76.79.11.113.113 C1.8437.136.73.926.62.91 6.167.147.8 6.312.897.272 C2 8.823.247.73 8.218.1471.86 1.841.3.837 1.1.34.396 C3.16.722.719.93.9.8.1882.786.79.1224.127.113 C4.919.764.718.823.122.8 1.117.84.782 1.37.169.11 Table 1: mise of the estimators of the regression function g = µ + g,1 + g,2 under different contaminations and missing mechanisms. p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 C.44.44.44.96.97.97.493.2.21.138.14.14 C1.3638.3.49.393.12.13.176.92.23.94.334.271 C2 3.49.1348.473 3.371.66.11 3.793.1641.47 3.914.942.176 C3.93.468.44.49.13.97.184.32.21.99.147.14 C4.431.14.43.347.119.98.491.87.17.3991.166.141 Table 2: mise of the estimators of the additive component g,1 under different contaminations and missing mechanisms. p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 C.4.46.41.16.17.17.46.476.497.17.19.19 C1.39.494.418.397.16.113.494.99.498.731.273.184 C2 3.1891.932.434 3.3314.633.11 4.168.172.2 4.1969.1642.38 C3.683.43.416.48.111.17.893.476.496.69.16.19 C4.3237.391.41.341.119.18.419.46.492.4433.177.16 Table 3: mise of the estimators of the additive component g,2 under different contaminations and missing mechanisms. 14

p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 mise(c1)/mise(c) 82.98 1.877 1.161 766.6964 7.827 1.1499 83.17 1.928 1.124 7.4987 7.93 2.4127 mise(c2)/mise(c) 121.839 3.141 1.492 116.972 18.64 1.97 136.2 4.6984 1.94 912.8 26.967 3.13 mise(c3)/mise(c) 2.2147 1.277 1.1 12.83 1.131 1.1 2.383 1.47.9996 11.161 1.1286 1.6 mise(c4)/mise(c) 13.19 1.869.9997 11.786 1.411 1.138 14.889 1.1118.9898 93.97 1.4974 1.16 Table 4: Ratio between the mise of the estimators of the additive component g = µ + 2 j=1 g,j under the considered contaminations and under clean data for the two missing mechanisms. p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 mise(c1)/mise(c) 8.1696 1.1298 1.113 4.93 1.61 1.2 1.7 1.183 1.44 42.624 2.3876 1.9396 mise(c2)/mise(c) 79.6889 3.283 1.427 31.462 6.847 1.39 76.99 3.2722 1.49 29.429 6.74 1.2 mise(c3)/mise(c) 2.994 1. 1.2.949 1.86 1.24 2.1986 1.69 1.8 4.321 1.29 1.31 mise(c4)/mise(c) 9.7694 1.141.9993 36.89 1.2287 1.63 9.966 1.171.9934 28.8339 1.189 1.73 Table : Ratio between the mise of the estimators of the additive component g,1 under the considered contaminations and under clean data for the two missing mechanisms. p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 mise(c1)/mise(c) 8.7994 1.2173 1.62 36.7398 1.491 1.28 1.6429 1.267 1.22 36.4372 1.7173 1.199 mise(c2)/mise(c) 78.8384 2.293 1.44 313.383.899 1.28 86.42 3.2998 1.461 266.814 1.3392 2.424 mise(c3)/mise(c) 1.6873.992 1.1 4.18 1.36 1.2 1.9229.9997.998 4.419 1.386 1.27 mise(c4)/mise(c) 8.22.961 1. 32.112 1.192 1.4 9.289.9769.9896 28.1834 1.114 1.63 Table 6: Ratio between the mise of the estimators of the additive component g,1 under the considered contaminations and under clean data for the two missing mechanisms. 1

least squares one, although the differences in the estimated mise s are well within the Monte Carlo margin of error. For the contamination cases C 1 and C 2, when using either local constants or local linear smoothers, the mise of the classical estimator for g is notably larger than those of the robust estimators (more than 4 times larger). This difference is smaller when estimating each component g,1 and g,2, but remains fairly large nonetheless. In general, when the data contain outliers, we note that the robust estimators give noticeably better regression estimators (both for g and its components) than the classical one. The estimator based on Tukey s bisquare function is very stable across all the contamination cases considered here, while for the Huber one, there s a slight increase in mise for the estimated additive components under the contamination setting C 1 and a larger one under C 2. The advantage of the estimators based on Tukey s loss function over those based on the Huber one becomes more apparent when one inspects the ratio between the mise s obtained with and without outliers given in Tables 4 to 6. It is worth noting that, for clean data, local linear smoothers achieve much smaller mise values than local constants. This may be explained by the well-known bias-reducing property of local polynomials, particularly near the boundary. This behaviour can also be observed across contamination settings for the robust estimators. Interestingly, the results of our experiments with missing responses yield the same overall conclusions. The estimators based on the subset of complete data remain Fisher-consistent, but, as one would expect, they all yield larger mise s due to the smaller sample sizes used to compute them. Moreover, the estimators based on Tukey s loss function are still preferable, although they are slightly less efficient than those based on the Huber loss when q =. When comparing the behaviour of the robust local constant and local linear estimators one notices that, as above, local linear estimators have smaller mise s. However, the ratios of mise, reported in Tables 4 to 6, show that under C 2 the mise s of the local linear estimators of g with missing responses are more than 4 times larger than those obtained with local constants. This effect is mainly due to the poor estimation of g,2 and of the location parameter µ as suggested by the reported results. We also looked at the median ise across the simulation replications to determine whether the differences in the observed averaged ise s are representative of most samples or they were influenced by poor fits obtained in a few bad samples. Tables 7 to 9 report the median over replications of the ise, denoted Medise for the estimators of the regression function g and each additive component when d = 2. Besides, Tables 1 to 12 report the ratio between the median integrated squared error (Medise) obtained for each considered contamination and for the clean data. To identify the ratio, it is denoted as Medise(C j )/Medise(C ). The median ise results show that the robust estimators behave similarly when p(x) 1 and when responses may be missing. Furthermore, the performances of Tukey s local linear and local constant smoothers are equally good. Based on the above results we recommend using local linear smoothers computed with Tukey s loss function, although in some situations local constants may give better results, 16

p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 C.7.7.714.73.74.74.733.72.784.1.18.17 C1.7681.126.713.823.87.86 6.24.1379.793 6.147.676.126 C2 8.4718.237.747 8.46.1333.82 9.788.2982.82 9.7619.1794.118 C3.137.717.71.73.83.7.13.783.781.892.122.18 C4.8481.762.712.782.113.7.9826.827.776.9236.17.11 Table 7: Medise of the estimators of the regression function g under different contaminations and missing mechanisms. p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 C.43.433.44.68.69.7.46.478.1.99.99.99 C1.3296.489.443.397.13.73.474.7.498.377.194.19 C2 3.188.1292.43 3.33.613.73 3.7.126.23 3.4773.7.18 C3.86.42.441.46.78.7.981.1.2.498.19.99 C4.4121..436.332.96.7.444.67.496.32.131.1 Table 8: Medise of the estimators of the additive component g,1 under different contaminations and missing mechanisms. p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 C.382.383.391.71.72.72.43.443.462.12.13.13 C1.318.46.398.317.13.78.43.63.463.36.193.114 C2 3.1744.874.49 3.384.7.77 3.978.128.489 4.1319.884.113 C3.62.381.387.49.77.72.767.437.463.88.111.14 C4.2993.369.391.3171.86.73.3722.434.461.39.122.1 Table 9: Medise of the estimators of the additive component g,2 under different contaminations and missing mechanisms. 17

Medise(C1) Medise(C) Medise(C2) Medise(C) Medise(C3) Medise(C) Medise(C4) Medise(C) p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 82.3814 1.797.9991 82.7378 7.9218 1.161 82.614 1.834 1.113 84.4446 6.2633 1.1736 12.9971 3.3876 1.47 118.677 17.9962 1.14 133.979 3.9676 1.2 928.673 16.613 1.978 1.9638 1.24.991 1.3761 1.1264 1.9 2.946 1.416.9961 8.4822 1.1289 1.78 12.1124 1.899.9971 18.2387 1.199 1.123 13.4123 1.13.996 87.88 1.419 1.276 Table 1: Ratio between the Medise of the estimators of the additive component g under the considered contaminations and under clean data for the two missing mechanisms. Medise(C1) Medise(C) Medise(C2) Medise(C) Medise(C3) Medise(C) Medise(C4) Medise(C) p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 7.6679 1.1314 1.61 2.8814 1.887 1.462 9.842 1.164.993 4.2267 1.919 1.976 81.874 2.9874 1.294 49.31 8.871 1.44 79.611 3.1894 1.43 3.68 7.627 1.876 2.131 1.48 1.19.9736 1.1294 1.16 2.197 1.6 1.19.231 1.977.9964 9.897 1.16.9911 48.8814 1.3842 1.81 9.473 1.1843.9894 3.14 1.3224 1.83 Table 11: Ratio between the Medise of the estimators of the additive component g,1 under the considered contaminations and under clean data for the two missing mechanisms. Medise(C1) Medise(C) Medise(C2) Medise(C) Medise(C3) Medise(C) Medise(C4) Medise(C) p(x) 1 p2(x) =.4 +. cos 2 (x1 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 8.322 1.2142 1.177 49.2713 1.864 1.717 1.483 1.2699 1.2 49.143 1.877 1.182 83.179 2.2814 1.43 463.4817 8.8 1.9 92.617 2.8382 1.71 43.2129 8.6146 1.931 1.6226.9934.9911.7296 1.781.9914 1.7847.986 1.22.7381 1.82 1.44 7.8349.9617.999 44.429 1.1974 1.12 8.6646.9798.9963 38.99 1.1879 1.214 Table 12: Ratio between the Medise of the estimators of the additive component g,2 under the considered contaminations and under clean data for the two missing mechanisms. 18

in particular when responses are missing and some neighbourhoods contain too few observations. 4.2 Monte Carlo study with d = 4 additive components For this model we generated covariates X i = (X i1, X i2, X i3, X i4 ) U([ 3, 3] 4 ), independent errors ε i N(, 1) and σ =.1. We used the following additive components: g,1 (x 1 ) = x 3 1/12, g,2 (x 2 ) = sin( x 2 ), g,3 (x 3 ) = x 2 3/2 1., g,4 (x 4 ) = e x 4 /4 (e 3 e 3 )/24. As in dimension d = 2, optimal bandwidths with respect to the mise were computed assuming that the other components in the model are known, resulting in h MISE opt = (.36,.38,.34,.29). However, it was difficult to obtain a reliable estimate for the residual scale σ using these bandwidths (see Remark 3.1), since many 4-dimensional neighbourhoods did not contain sufficient observations. To solve this problem we used h σ = (.93,.93,.93,.93) to estimate σ (we expect an average of points in each 4-dimensional neighbourhood using h σ ). We then applied the backfitting algorithm with the optimal bandwidths h MISE opt. In addition to the settings C and C 1 described above, we modified the contamination setting C 2 so that u i N(1,.1) for all i such that X i,j [ 1., 1.] for all 1 j 4. Tables 13 to 17 report the mise for the different estimators, contamination settings and missing mechanisms. Ratios of mise s for clean and contaminated settings and median ise s are reported in tables included Tables 18 to 32. As observed in Section 4.1, our experiments with and without missing responses yield similar conclusions regarding the advantage of the robust procedure over the classical backfitting and the benefits of the Tukey s loss function over the Huber one. As in dimension d = 2, when responses are missing, all the mise s are slightly inflated, as it is to be expected. It is also not surprising that when the data do not contain outliers (C ), the robust estimators have a slightly larger mise than their classical counterparts. However, when outliers are present, both robust estimators provide a substantially better performance than the classical one, given similar results to those for clean data. Also as noted for the model with d = 2, the mise of the estimators based on Tukey s bisquare score function are more stable across the different contamination settings than those using Huber s score function. However, one difference with the results in dimension d = 2 should be pointed out. Under the gross error model C 1, the local linear smoother performs worse than the local constant when missing responses arise. This difference is highlighted when one looks at the ratio of the corresponding mise s. However, median ise s reported in Tables 23 to 27 are more stable, which indicates that the difference in mise s may be due to a few samples. In fact, due to the relatively large number of missing observations that may be present in this setting (around 33%) around 6% of the replications produce robust fits with Tukey s loss function with atypically large values of mise, which are due to sparsely populated neighbourhoods. Based on these observations, we also recommend using Tukey s local linear estimators. 19

p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 C.123.12.12.23.23.23.13.142.1.33.33.33 C1 7.334.2.134 7.69.497.46 8.4183.699.192 8.881.163.932 C2 4.8441.36.128 4.8224.221.2.9889.392.148 6.121.28.38 Table 13: mise of the estimators of the regression function g = µ + 4 j=1 g,j under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 C.46.47.47.2.2.2.9.6.61.3.3.3 C1.47.8.49.636.66.21.83.17.63.973.2.178 C2.9691.89.47.9897.6.2 1.1446.1.62 1.196.73.3 Table 14: mise of the estimators of the additive component g,1 under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 C.2.2.2.16.16.16.3.3.36.24.24.24 C1.332.62.27.6189.81.26.782.11.38.8982.226.136 C2.9298.66.26.9482..16 1.19.83.37 1.2339.67.24 Table 1: mise of the estimators of the additive component g,2 under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 C.91.92.92.42.42.42.136.141.146.82.82.82 C1.991.137.9.6741.16.2.919.273.17 1.679.449.337 C2 1.137.1.94 1.7.8.42 1.22.27.144 1.226.126.82 Table 16: mise of the estimators of the additive component g,3 under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 C.7.7.71.36.36.36.9.98.13.8.8.8 C1.6818.118.72.78.97.37 1.89.23.127 1.231.83.429 C2 1.82.12.71 1.92.78.36 1.3678.19.11 1.3877.117.61 Table 17: mise of the estimators of the additive component g,4 under different contaminations and missing mechanisms. 2

p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 mise(c1)/mise(c) 94.686 4.1927 1.66 3263.3298 21.3232 1.963 622.6124 4.9188 1.286 2677.7184 47.1836 28.937 mise(c2)/mise(c) 392.7639 2.8789 1.24 268.873 9.4793 1.62 442.936 2.7631.9876 1817.42 7.82 1.1449 Table 18: Ratio between the mise of the estimators of the additive component g = µ + 4 j=1 g,j under the considered contaminations and under clean data for the two missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 mise(c1)/mise(c) 119.9119 1.822 1.448 318.2129 3.318 1.2 142.179 1.761 1.33 327.38 8.4212.996 mise(c2)/mise(c) 29.497 1.9119 1.18 49. 3.24 1.18 194.7711 1.743 1.134 43.116 2.4681 1.27 Table 19: Ratio between the mise of the estimators of the additive component g,1 under the considered contaminations and under clean data for the two missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 mise(c1)/mise(c) 213.8436 2.4642 1.692 38.1917.16 1.6478 22.133 2.8729 1.99 379.1973 9.277.74 mise(c2)/mise(c) 372.9278 2.648 1.39 9.17 3.441 1.28 344.82 2.333 1.324 2.8911 2.8161 1.29 Table 2: Ratio between the mise of the estimators of the additive component g,2 under the considered contaminations and under clean data for the two missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 mise(c1)/mise(c) 66.132 1.488 1.318 16.38 2.264 1.2368 67.6382 1.938 1.77 13.993.1 4.1322 mise(c2)/mise(c) 111.721 1.684 1.1 237.9769 2.19 1.87 89.7438 1.4627.988 1.2898 1.44 1.8 Table 21: Ratio between the mise of the estimators of the additive component g,3 under the considered contaminations and under clean data for the two missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 mise(c1)/mise(c) 97.8712 1.6842 1.244 211.838 2.7222 1.34 114.1378 2.448 1.237 212.268 1.466 7.388 mise(c2)/mise(c) 11.914 1.776 1.11 296.8837 2.1873 1.131 143.3664 1.6293.9839 239.2791 2.83 1.41 Table 22: Ratio between the mise of the estimators of the additive component g,4 under the considered contaminations and under clean data for the two missing mechanisms. 21

p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 C.123.12.12.23.23.23.13.142.148.33.33.33 C1 7.2318.497.133 7.283.421.27 8.3171.481.1 8.784.43.4 C2 4.774.344.128 4.76.23.24.949.36.147.938.21.3 Table 23: Medise of the estimators of the regression function g under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 C.4.4.4.13.13.13..1.2.19.19.19 C1.139.77.42.99.9.14.7747.99..96.7.21 C2.96.84.41.9677.6.13 1.1128.96.3 1.191.61.2 Table 24: Medise of the estimators of the additive component g,1 under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 C.2.2.2.11.11.11.28.28.29.16.16.16 C1.498.6.22.879.4.12.7289.71.31.8413.64.18 C2.918.62.21.926..11 1.1789.77.31 1.241.9.16 Table 2: Medise of the estimators of the additive component g,2 under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 C.76.77.77.23.23.23.12.18.111.44.44.44 C1.686.124.8.6297.7.24.867.167.11 1.1.11.46 C2 1.46.141.79.997.69.23 1.189.177.11 1.193.96.4 Table 26: Medise of the estimators of the additive component g,3 under different contaminations and missing mechanisms. p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 C.6.7.8.22.22.22.73.76.8.34.34.34 C1.61.16.61.713.78.23 1.298.141.8 1.1628.112.37 C2 1.484.11.9 1.447.66.22 1.3321.14.79 1.3391.84.3 Table 27: Medise of the estimators of the additive component g,4 under different contaminations and missing mechanisms. 22

Medise(C1) Medise(C) Medise(C2) Medise(C) p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 ĝbc, ĝbr,h, ĝbr,t, ĝbc,1 ĝbr,h,1 ĝbr,t,1 87.928 3.991 1.632 3284.712 18.399 1.184 616.264 3.3837 1.47 2694.8344 13.3422 1.233 386.7698 2.7623 1.23 27.99 8.879 1.682 437.7162 2.68.991 182.7418 6.67 1.631 Table 28: Ratio between the Medise of the estimators of the additive component g under the considered contaminations and under clean data for the two missing mechanisms. Medise(C1) Medise(C) Medise(C2) Medise(C) p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 ĝ1,bc, ĝ1,br,h, ĝ1,br,t, ĝ1,bc,1 ĝ1,br,h,1 ĝ1,br,t,1 129.727 1.9127 1.13 46.2723 4.32 1.668 16.483 1.927 1.649 466.3972 3.886 1.932 241.387 2.99 1.12 73.776 4.31 1.116 224.7727 1.8676 1.267 96.688 3.13 1.223 Table 29: Ratio between the Medise of the estimators of the additive component g,1 under the considered contaminations and under clean data for the two missing mechanisms. Medise(C1) Medise(C) Medise(C2) Medise(C) p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 ĝ2,bc, ĝ2,br,h, ĝ2,br,t, ĝ2,bc,1 ĝ2,br,h,1 ĝ2,br,t,1 23.998 2.8292 1.822 7.17.1 1.94 28.182 2.478 1.66 24.3816 4.31 1.112 464.4 3.1424 1.488 877.376 4.7631 1.244 418.119 2.726 1.618 7.317 3.682 1.27 Table 3: Ratio between the Medise of the estimators of the additive component g,2 under the considered contaminations and under clean data for the two missing mechanisms. Medise(C1) Medise(C) Medise(C2) Medise(C) p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 ĝ3,bc, ĝ3,br,h, ĝ3,br,t, ĝ3,bc,1 ĝ3,br,h,1 ĝ3,br,t,1 7.1423 1.62 1.372 273.7164 3.247 1.414 8.413 1.31.9944 226.6848 2.612 1.1 ) 132.767 1.8189 1.282 43.641 3.177 1.127 116.2487 1.6494.9913 27.687 2.1641 1.84 Table 31: Ratio between the Medise of the estimators of the additive component g,3 under the considered contaminations and under clean data for the two missing mechanisms. Medise(C1) Medise(C) Medise(C2) Medise(C) p(x) 1 p4(x) =.4 +. cos 2 (x1 x3 +.2) ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 ĝ4,bc, ĝ4,br,h, ĝ4,br,t, ĝ4,bc,1 ĝ4,br,h,1 ĝ4,br,t,1 11.3211 1.821 1.28 331.927 3.6263 1.442 14.1222 1.8468.9912 344.1223 3.3286 1.869 18.714 2.11 1.223 484.78 3.44 1.2 181.218 1.83.9832 396.297 2.483 1.267 Table 32: Ratio between the Medise of the estimators of the additive component g,2 under the considered contaminations and under clean data for the two missing mechanisms. 23