On Classification Based on Totally Bounded Classes of Functions when There are Incomplete Covariates

Similar documents
Estimation of the essential supremum of a regression function

Sieve Estimators: Consistency and Rates of Convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Optimally Sparse SVMs

Rates of Convergence by Moduli of Continuity

Empirical Process Theory and Oracle Inequalities

Sequences and Series of Functions

Empirical Processes: Glivenko Cantelli Theorems

lim za n n = z lim a n n.

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

5.1 A mutual information bound based on metric entropy

REGRESSION WITH QUADRATIC LOSS

Math Solutions to homework 6

Chapter 6 Infinite Series

Lecture Notes for Analysis Class

7.1 Convergence of sequences of random variables

1 Convergence in Probability and the Weak Law of Large Numbers

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Regression with quadratic loss

A survey on penalized empirical risk minimization Sara A. van de Geer

32 estimating the cumulative distribution function

Lecture 19: Convergence

Advanced Stochastic Processes.

Notes #3 Sequences Limit Theorems Monotone and Subsequences Bolzano-WeierstraßTheorem Limsup & Liminf of Sequences Cauchy Sequences and Completeness

Measure and Measurable Functions

Binary classification, Part 1

A Note on the Kolmogorov-Feller Weak Law of Large Numbers

Self-normalized deviation inequalities with application to t-statistic

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Machine Learning Brett Bernstein

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Properties of Fuzzy Length on Fuzzy Set

7.1 Convergence of sequences of random variables

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

LECTURE 8: ASYMPTOTICS I

FUNDAMENTALS OF REAL ANALYSIS by

Entropy Rates and Asymptotic Equipartition

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 15: Learning Theory: Concentration Inequalities

TERMWISE DERIVATIVES OF COMPLEX FUNCTIONS

Kernel density estimator

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Theorem 3. A subset S of a topological space X is compact if and only if every open cover of S by open sets in X has a finite subcover.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Machine Learning Theory (CS 6783)

Berry-Esseen bounds for self-normalized martingales

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Metric Space Properties

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Chapter 7 Isoperimetric problem

ON POINTWISE BINOMIAL APPROXIMATION

Rademacher Complexity

MAT1026 Calculus II Basic Convergence Tests for Series

1+x 1 + α+x. x = 2(α x2 ) 1+x

McGill University Math 354: Honors Analysis 3 Fall 2012 Solutions to selected problems

Lecture 3 : Random variables and their distributions

f n (x) f m (x) < ɛ/3 for all x A. By continuity of f n and f m we can find δ > 0 such that d(x, x 0 ) < δ implies that

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Mi-Hwa Ko and Tae-Sung Kim

Fall 2013 MTH431/531 Real analysis Section Notes

Distribution of Random Samples & Limit theorems

STA Object Data Analysis - A List of Projects. January 18, 2018

M17 MAT25-21 HOMEWORK 5 SOLUTIONS

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Notes 19 : Martingale CLT

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

SOME SEQUENCE SPACES DEFINED BY ORLICZ FUNCTIONS

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

Estimation for Complete Data

5. Likelihood Ratio Tests

Intro to Learning Theory

Lecture 2: Concentration Bounds

Lecture 8: Convergence of transformations and law of large numbers

Agnostic Learning and Concentration Inequalities

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS

Stat 421-SP2012 Interval Estimation Section

Maximum Likelihood Estimation and Complexity Regularization

ON THE FUZZY METRIC SPACES

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

Council for Innovative Research

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Homework 4. x n x X = f(x n x) +

The log-behavior of n p(n) and n p(n)/n

An Introduction to Randomized Algorithms

arxiv: v1 [math.pr] 4 Dec 2013

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

Entropy and Ergodic Theory Lecture 5: Joint typicality and conditional AEP

Detailed proofs of Propositions 3.1 and 3.2

Lecture 3 The Lebesgue Integral

Lecture 10 October Minimaxity and least favorable prior sequences

Transcription:

Joural of Statistical Theory ad Applicatios Volume, Number 4, 0, pp. 353-369 ISSN 538-7887 O Classificatio Based o Totally Bouded Classes of Fuctios whe There are Icomplete Covariates Majid Mojirsheibai ad Zahra Motazeri Abstract This article deals with the two-group classificatio problem, where the class coditioal probability πz) = PY = Z = z belogs to a kow class of fuctios F which is totally bouded with respect to the supremum orm. Give a ɛ-cover F ɛ of F, we cosider kerel regressio methods for costructig classifiers usig members of F ɛ. A Horvitz-Thompsotype iverse weightig approach will be used to hadle the presece of icomplete covariates i the data. Coditios uder which the resultig classifiers are strogly cosistet are also give. Key Words ad Phrases. Classificatio, cosistecy, empirical process, coverig umber. AMS 000 Subject Classificatios. 6H30. Departmet of Mathematics, Califoria State Uiversity Northridge, CA 9330, USA. email: majid.mojirsheibai@csu.edu Departmet of Epidemiology ad Commuity Medicie, Faculty of Medicie, Uiversity of Ottawa, 45 Smyth 358) Ottawa, ON, KH 8M5. email: zmotaze@uottawa.ca

M. Mojirsheibai ad Z. Motazeri 354 Itroductio Cosider the followig stadard two-group classificatio problem. Let Z, Y ) be a radom pair, where Z R s is a radom vector of covariates or predictors ad Y 0, has to be predicted based o the vector Z. More precisely, oe would like to fid a fuctio a classifier) g: R s 0, for which the misclassificatio error probability Lg) = PgZ) Y is as small as possible. The best classifier, called the Bayes classifier ad deoted by g B, is give by where if π z) > g B z) = 0 otherwise, ) π z) = PY = Z = z = EY Z = z). ) For a proof of this fact see, for example, Devroye et. al. 996; Chapter ).) The error of this classifier will be deoted by L throughout this paper, i.e., L := Pg B Z) Y. 3) I passig, we also ote that if Z ad Y are idepedet the π z) is a costat fuctio of z ad is i fact equal to P Y =. I this extreme case, g B Z) is either always or always 0. O the other had, if Y = IZ B for some B R s the π z) = Iz B ad also g B z) = Iz B. I practice oe does ot kow the uderlyig probability distributio of the pair Z, Y ), ad therefore fidig g B traiig sample is virtually impossible. However, i statistics, oe usually has access to a Z, Y ), Z, Y ),, Z, Y ) draw from F. The goal is the to costruct a data-based classificatio rule g, whose coditioal error rate L g ) = Pg Z) Y Z i, Y i ), i =,,

Icomplete Covariates 355 is i some sese small. A desirable property for a data-based classifier is cosistecy: A classifier g is said to be cosistet if L g ) coverges to Lg B ) i probability. If the covergece holds almost surely the g is said to be strogly cosistet. Next, let F be a give class of fuctios π : R s [0, ]. For ay real-valued fuctio f o R s, let f = sup z R s fz) be its usual supremum orm ad put Bπ, ɛ) = h : R s [0, ] π h < ɛ, i.e., Bπ, ɛ) is the ope ball of fuctios, cetered at π, with the -radius ɛ > 0. Suppose that the fiite set of fuctios F ɛ = π,, π Nɛ), where π i : R s 0,, i Nɛ) <, is a ɛ-cover of the family F i the usual sese that sup π F mi i Nɛ) π π i ɛ. Note that F Nɛ) Bπ i, ɛ). Here, each member of F ɛ may or may ot be a member of F. The coverig umber of the family F with respect to the -orm, deoted by N ɛ, F), is the cardiality F ɛ of the smallest ɛ-cover of F. If N ɛ, F) < for every ɛ > 0 the F is said to be totally bouded. I passig we also ote the close relatioship betwee compactess ad total boudedess also called pre-compactess): compactess implies total boudedess, but the coverse is ot i geeral true. I fact, a metric space is compact if ad oly if it is complete ad totally bouded, this is the Heie-Borel theorem for geeral metric spaces). For more o these ad other properties of compact metric spaces oe may refer, for example, to Willard 004). Next, for each π F cosider the classifier if πz) > /, g π z) = 0 otherwise. 4) Let L π) = Ig π Z i ) Y i 5)

M. Mojirsheibai ad Z. Motazeri 356 be the empirical error rate of g π. The the so-called skeleto estimate of π, selected from F ɛ, is give by see, for example, Chapter 8 of Devroye et. al. 996)): with the correspodig sample-based classifier see )): if π z) > /, g π z) = 0 otherwise. π = argmi L π), 6) Let L π ) = Pg π X) Y X i, Y i ), i =,, be the error of the classifier g π. The followig theorem establishes the cosistecy of the resultig classifier see Theorem 8. of Devroye et. al. 996)). Theorem Let F be a totally bouded class of fuctios mappig R s [0, ]. If π z) F the there is a sequece ɛ > 0 ad a sequece F ɛ π, selected from F ɛ, oe has L π ) a.s. L. Here ɛ ca be take as the smallest positive umber for which log N ɛ, F) ɛ. See Devroye et. al. 996; Chapter 8)) for a proof of this result. F such that for the skeleto estimate I the ext sectio we shall cosider the case where some of the compoets of the covariate vectors Z i may be missig. More specifically, we study the case with Z i = X i, V i ) R d+p, d + p = s, where X i R d, d, is always observable, but V i R p may be missig for the i th observatio. To deal with this difficulty, we propose a Horvitz-Thompso-type estimatio approach which works by weightig the complete cases by the iverse of the missig data probabilities. The problem of classificatio with missig covariates has also bee addressed by Mojirsheibai ad Motazeri 007), uder differet assumptios. Mai Results. Motivatio I this sectio we cosider the case where some compoets of Z i s may be missig. More specifically, we cosider the situatio where Z i = X i, V i ) R d+p, ad where X i R d, d,

Icomplete Covariates 357 is always observable, but V i R p may be missig for the i th observatio. We also defie the radom variables 0 if V i is missig i = otherwise, i =,. Now, the data may be represeted by D = Z, Y, ),, Z, Y, ) = X, V, Y, ),, X, V, Y, ). Let Z, Y ) be a ew observatio, for which Y 0, has to be predicted based o Z ad the data D ); here Z, Y ) iid = Z, Y ). Clearly the miimizatio i 6) is o loger possible uder the curret setup where there are missig V i s amog the data. This is because the computatio of the right had side of 5) requires every Z i, i =,,. Usig the complete cases aloe i 5) will ot solve the problem; here a complete case refers to a fully observable Z i i.e., whe i = ). The reaso is that if we choose π as the miimizer of L π) := i Ig π Z i ) Y i, the the correspodig empirical process L π) Lπ) π F is ot cetered i geeral ot eve asymptotically), ad this plays a crucial role i establishig the theoretical validity of g π. I fact, it is clear that L π) is ot i geeral ubiased for Lπ). To motivate the procedures of this sectio, we also eed to defie the missig probability mechaism, i.e., the quatity pz i, Y i ) := P i = Z i, Y i = E i Z i, Y i ), i =,,. I what follows we shall also assume that pz i, Y i ) p 0 > 0 ; 7) this is a assumptio which says, i a sese, that there is always a ozero probability p 0 that a observatio is ot missig. Now, cosider the hypothetical situatio where the above fuctio p is kow ad put L p π) := i pz i, Y i ) Ig πz i ) Y i, 8)

M. Mojirsheibai ad Z. Motazeri 358 where g π is as i 4). I passig we also ote that 5) is the special case of 8) whe E i ) = for all i. I fact, it is straightforward to see that L p π) satisfies E[ L p π)] = Lπ), where Lπ) = Pg π Z) Y ). It is importat to metio that the idea i 8) is very similar to that used by Györfi et al. 00; Chapter 6) for the ubiased estimatio of a mea from cesored data. Next, defie the followig revised versio of the estimator π i 6) ad let g π be its correspodig classifier, i.e., if π z) > /, g π z) = 0 otherwise. π = argmi Lp π) 9) 0) To study the performace of g π, let L π ) = P g π Z) Y D be the misclassificatio error of g π. The we have the followig result. Theorem Let F be a totally bouded class of fuctios mappig R d [0, ] cotaiig the fuctio π x) = PY = X = x. The for every ɛ ad δ satisfyig δ > ɛ > 0 oe has P L π ) L > δ N ɛ, F) exp δ/ ɛ) p 0, where p 0 is as i 7). The proofs of the theorems will be deferred util all the results have bee stated.the followig corollary is a immediate cosequece of the Borel-Camtelli lemma: Corollary Let ɛ be a sequece of positive costats decreasig to 0. Also let F be the class of fuctios defied i Theorem. If, as, the log N ɛ, F) 0 L π ) a.s. L. Thus, if the missig probability mechaism pz i, Y i ) were kow, the above approach would provide the theoretical basis to costruct strogly cosistet classifiers. Ufortuately, i practice, the missig probability mechaism is almost always ukow ad must be estimated. I the ext sectio we propose a kerel-based approach to overcome this problem.

Icomplete Covariates 359. Kerel Regressio Let pz i, Y i ) = P i = Z i, Y i be the missig probability, i.e., the coditioal probability that V i is missig recall that Z i = X i, V i ) ). Uder the commoly used assumptio of data Missig At Radom MAR), oe assumes that the probability that V i is missig does ot deped o V i itself. That is, P i = Z i, Y i = P i = X i, Y i =: qx i, Y i ). ) Whe P i = Z i, Y i = P i = the V i is said to be Missig Completely At Radom MCAR). For these defiitios ad a survey of other missig patters oe may refer to the book by Little ad Rubi 00). Now cosider the followig kerel-based estimator of the fuctio qx i, Y i ) defied i ): qx i, Y i ) = ) j=, i Xj X jiy j = Y i K i j=, i IY j = Y i K Xj X i h h ), ) with the covetio 0/0 = 0, where K : R d R + is ay kerel with the smoothig parameter h; here h h) 0, as.) Next, for each π F, put L q π) := i qz i, Y i ) Ig πz i ) Y i, ad defie π = argmi L q π). The the correspodig classifier is give by if π z) > /, g π z) = 0 otherwise. 3) To assess the performace of g π we will make the followig assumptios: C: The MAR assumptio ) holds with qx i, Y i ) q 0 > 0, for some positive costat q 0, compare with 7)). C: The radom vector X has a compactly supported desity fuctio, fx), ad f is bouded away from zero o its support. Furthermore, f ad its first-order partial derivatives are uiformly bouded o its support.

M. Mojirsheibai ad Z. Motazeri 360 C3: The partial derivatives x i qx, y), where i =, d), exist ad are bouded o the compact support of f, uiformly i x. C4: The kerel K satisfies Ku)du = ad u i Ku)du <, i =,, d, ad K <. The smoothig parameter h satisfies h 0 ad h d, as. The followig theorem gives performace bouds for the classifier g π. Theorem 3 Let F be as i Theorem ad defie the classifier g π that coditios C C4 hold. i) For every δ > ɛ > 0 there is a 0 > 0 such that for all > 0, as i 3). Also suppose P L π ) L > δ N ɛ, F) e δ ɛ) q0 /8 + 4 e c δ ɛ)/4) h d + e c h d), where L π ) = Pg π Z) Y D ad where c ad c are positive costats ot depedig o, δ, or ɛ. ii) Let ɛ be a sequece of positive costats decreasig to 0. If, as, log N ɛ, F) 0 ad log h 0 the L π ) a.s. L. The above results, as well as those i Theorem ad corollary, are based o the requiremet that F is totally bouded. Furthermore, the ɛ-coverig umber N ɛ, F) of the class F should ot grow too fast as ɛ gets closer ad closer to 0). There are may importat classes of fuctios that satisfy these requiremets; here we give two examples: Example. Differetiable fuctios.) Let k,, k s be o-egative itegers ad put k = k,, k s ) ad k = k + +k s. Also, for ay g : R s R, let D k) gu) = k gu)/ u k,, uks s. Cosider the class of fuctios with bouded partial derivatives of order r: G = g : [0, ] d R k r sup D k) gu) A <. u The, for every ɛ > 0, log N ɛ, Ψ) Mɛ α, where α = d/r ad M Md, r). This result is due to Kolmogorov ad Tikhomirov 959).

Icomplete Covariates 36 Example. Cosider the class Ψ of all covex fuctios ψ : C [0, ], where C R d is compact ad covex. If ψ satisfies the Lipschitz coditio ψz ) ψz ) L z z, for all z, z C, the log N ɛ, Ψ) Mɛ d/, for every ɛ > 0, where M Md, L); see Va der Vaart ad Weller 996)..3 Least-squares Regressio I this sectio we cosider least-squares LS) estimates of the fuctio q. The method works as follows. Suppose that q belogs to the kow class of fuctios Q of the form q : R d 0, [q 0, ], where q 0 is as i assumptio C. The least-squares estimate of q is give by Now, for each π F, let q = argmi q Q i qx i, Y i )). L q π) := i qz i, Y i ) Ig πz i ) Y i, ad defie π = argmi L q π). I this case, we cosider the followig classifier if π z) > /, g π z) = 0 otherwise. 4) To study the performace of g π we also eed the followig stadard otatio from the empirical process theory. Fix x, y ),, x, y ) ad let N ɛ, Q, x i, y i ) ) be the ɛ-coverig umber of the class Q with respect to the empirical measure of the poits x, y ),, x, y ). That is, N ɛ, Q, x i, y i ) ) is the cardiality of the smallest subclass of fuctios Q ɛ = q,, q Nɛ) q i : R d 0, [q 0, ] such that for every q Q ad every ɛ > 0 there is a q Q ɛ such that qx i, y i ) q x i, y i ) < ɛ. For more o this oe may refer, for example, to Pollard 984) or va der Vaart ad Weller 996). We the have the followig result. Theorem 4 Let F be as i Theorem ad suppose that coditio C holds. Also, defie the classifier g π as i 4) ad set L π ) = Pg π Z) Y D. The:

M. Mojirsheibai ad Z. Motazeri 36 i) For every δ > ɛ > 0 there is a 0 > 0 such that for all > 0, P L π ) L > δ N ɛ, F) e δ ɛ) q0 /8 [ δ ɛ)q ) ] + 8E N 0, Q, X i, Y i ) e C 3δ ɛ) 64 [ δ ɛ) q 4 ) ] 0 + 8E N, Q, X i, Y i ) e C 4δ ɛ) 4 04 where c 3 ad c 4 are positive costats ot depedig o, δ, or ɛ. ii) Let ɛ be a sequece of positive costats decreasig to 0. If, as, )] log N ɛ, F) log E [N c, Q, X i, Y i ) 0 ad 0, c > 0, the L π ) a.s. L. 3 Proofs Proof of Theorem. The proof is based o stadard argumets, see, for example, Devroye et al. 996; Sec. 8.3)), ad goes as follows. First observe that for ay classifier g PgZ) Y = PgZ) = Y ) = PgZ) =, Y = + PgZ) = 0, Y = 0 [ ] [ ] = E IgZ) = IY = E IgZ) = 0 IY = 0 [ ] [ ] = E E IgZ) = IY = Z E E IgZ) = 0 IY = 0 Z [ ] = E IgZ) = π Z) + IgZ) = 0 π Z)), where π Z) = PY = Z. Thus, [ ] pgz) y L = E Ig B Z) = π Z) + Ig B Z) = 0 π Z)) [ ] E IgZ) = π Z) + IgZ) = 0 π Z)) [ ) = E π Z) Ig B Z) = IgZ) = )] + π Z)) Ig B Z) = 0 IgZ) = 0 [ )] = E π Z) ) Ig B Z) = IgZ) = = E[ π Z) ] Ig B Z) gz), 5) i view of the defiitios of g B ad π i ) ad )).

Icomplete Covariates 363 Now let π F ad put Lπ) = Pg π Z) Y, where ad ote that by 5) if πz) > g π z) = 0 otherwise, Lπ) Lπ ) = E[ π Z) ] Ig B Z) g π Z) E πz) π Z), 6) where the last lie follows sice π Z) 0.5 πz) π Z) wheever g B Z) gz). Let π F ɛ be such that π Bπ, ɛ); this is possible sice F ɛ is a ɛ-cover of F ad π F. Sice if Lπ) L E π Z) π Z), by 6)) sup z R d+p π z) π z) ɛ, because π Bπ, ɛ)), 7) oe fids that for every δ > ɛ > 0 P L π ) L > δ P = P P sup L π ) if Lπ) > δ ɛ Lπ) > δ ɛ L π ) L p π ) + L p π ) if Lp π) Lπ) > δ ɛ N ɛ, F) sup P Lp π) Lπ) δ > ɛ. Now, by Hoeffdig s iequality, this last probability statemet appearig above ca be bouded by exp δ/ ɛ) p 0, ad this completes the proof of the theorem. Proof of Theorem 3. Part i) For each π F, let L q π) := i qx i, Y i ) Ig πz i ) Y i

M. Mojirsheibai ad Z. Motazeri 364 ad observe that L q π) L q π) = Furthermore, sice i Ig π Z i ) Y i i Ig π Z i ) Y i qx i, Y i ) qx i, Y i ) ) qx i, Y i ) qx i, Y i ) qx i, Y i ). L π ) if Lπ) = [ ] L π ) L q π ) sup L q π) Lπ), [ + L q π ) if ] Lπ) oe fids that P L π ) L > δ P L π ) if Lπ) > δ ɛ, i view of 7)) P sup L q π) Lπ) > δ ɛ ) P qx i, Y i ) qx i, Y i ) qx i, Y i ) > δ 4 ɛ + P sup L q π) Lπ) > δ 4 ɛ := I + II, say). 8) But, usig the MAR assumptio see )), it is straightforward to see that E[ L q π)] = Lπ). Therefore II N ɛ, F) P Lq π) Lπ) > δ 4 ɛ N ɛ, F) e δ ɛ) q 0 /8, via Hoeffdig s iequality). 9) As for the term I i 8) first ote that [ I P qx i, Y i ) qx i, Y i ) qx i, Y i ) δ ɛ ] [ > 4 [ + P qx i, Y i ) < q ] 0 P qxi, Y i ) qx i, Y i ) /q0) > δ ɛ 4 + qx i, Y i ) > q ] 0 P qx i, Y i ) < q 0. 0) It will be show at the ed of the proof that for every costat b > 0, ad large eough, P qxi, Y i ) qx i, Y i ) > b 4e C 3h d b, )

Icomplete Covariates 365 where C 3 is a positive costat ot depedig o or ɛ. Therefore, takig b = δ ɛ i ), the first sum o the r.h.s. of 0) is bouded by 4e C 4h d δ ɛ), for large eough, where C 4 > 0 does ot deped o, δ, or ɛ. Similarly, sice P qx i, Y i ) < q 0 / P qxi, Y i ) qx i, Y i ) > q 0, oe fids, via )), that for large eough, the secod sum o the r.h.s. of 0) is bouded by 4e C 5h d, where the costat C 5 is positive ad does ot deped o or ɛ. Puttig the above together, we have show that for large eough, I 4 e C 4h d δ ɛ) + 4 e C 5h d. This completes the proof of part i) of Theorem 3. Part ii) follows from the Borel-Catelli lemma. Proof of ). Sice qx i, Y i ) qx i, Y i ), it is sufficiet to prove ) for 0 < b. Now, let SX i, Y i ) = fx i )P Y = Y i Y i )qx i, Y i ) ŜX i, Y i ) = ) h d Xj X ) i j IY j = Y i K h j=, i RX i, Y i ) = fx i )P Y = Y i Y i RX i, Y i ) = ) h d Xj X ) i IY j = Y i K h j=, i ad observe that qx i, Y i ) qx i, Y i ) = ŜX i, Y i ) RX i, Y i ) SX i, Y i ) RX i, Y i ) = ŜX i, Y i )/ RX i, Y i ) RX i, Y i ) RX i, Y i ) RX i, Y i )) + ŜX i, Y i ) SX i, Y i ) RX i, Y i ) RX i, Y i ) RX i, Y i ) + RX i, Y i ) ŜX i, Y i ) SX i, Y i ), RX i, Y i ) where we have used the fact that ŜX i, Y i )/ RX i, Y i ). Therefore, sice RX i, Y i ) > C 6 0, by assumptio C)), oe fids that for every b > 0 P qxi, Y i ) qx i, Y i ) > b P ŜX i, Y i ) SX i, Y i ) > C 7 b + P RXi, Y i ) RX i, Y i ) > C 7 b := π + π. )

M. Mojirsheibai ad Z. Motazeri 366 where C 7 = C 6 /. Now, by the results of Mojirsheibai et al. 0; Lemma A., with gz, Y ) = ) oe fids ] SX i, Y i ) E[ŜXi, Y i ) X i, Y i Ch, 3) where C > 0 is a costat ot depedig o. Therefore ] ] E π P ŜX i, Y i ) E[ŜXi, Y i ) X i, Y i + [ŜXi, Y i ) X i, Y i SX i, Y i )] > C7 b ] P ŜX i, Y i ) E[ŜXi, Y i ) X i, Y i > C8 b where for large by 3)), where C 8 = C 7 /) [ ] ] = E P ŜX i, Y i ) E[ŜXi, Y i ) X i, Y i > C8 b X i, Y i = E P ) Γ j X i, Y i ) > C 8b X i, Y i, 4) j=, i [ Γ j X i, Y i ) = h d j IY j = Y i K Xj X i h ) E j IY j = Y i K Xj X i h ) Xi, Y i ]. However, coditioal o X i, Y i ), the terms Γ j X i, Y i ), j =,,, are idepedet, zero-mea radom variables, bouded by h d K ad +h d K. We also ote that ] VarΓ j X i, Y i ) X i, Y i ) = E [Γ j X i, Y i ) X i, Y i h d K f. Therefore, by Beett s iequality Beett, 96), for ay fixed oradom) x ad y P ) Γ j X i, Y i ) > C 8 b )h d X i = x, Y i = y exp C8 b, K f + C 8 b j=, i where the boud does ot deped o x or y. 0 < b, oe fids for large eough), )h d C 8 π exp b. K f + C 8 Therefore, i view of 4) ad the fact that Similarly, oe ca also show with, i fact, less efforts) that, for large eough, )h d C 9 π exp b, K f + C 9 where C 9 is a positive costat ot depedig o or b. This complete the proof of ).

Icomplete Covariates 367 Proof of Theorem 4. Part i) Usig 7) ad the argumets that lead to 8), we fid P L π ) L > δ I + II, where II is as i 8) ad But, by 9), I := P ) qx i, Y i ) qx i, Y i ) qx i, Y i ) > δ 4 ɛ. II < N ɛ, F) e δ ɛ) q 0 /8. To deal with the term I first ote that sice q q 0, oe fids I P q0 qx i, Y i ) qx i, Y i ) > δ ɛ 4 [ ] P qx i, Y i ) qx i, Y i ) E qx, Y ) qx, Y ) D [ ] + E qx, Y ) qx, Y ) D > δ ɛ)q 0 4 P sup q X, Y ) qx i, Y i ) Eq X, Y ) qx, Y ) > δ ɛ)q 0 q Q 8 [ ] + P E qx, Y ) qx, Y ) D > δ ɛ)q 0 8 := I A) + I B). 5) Stadard results from the empirical process theorey, see for example, Pollard 984)), yields [ δ ɛ)q I A) ) ] 8E N 0, Q, X i, Y i ) e δ ɛ) q0 4/8)8) 64 As for the term I B), put S q) = [ i qx i, Y i )]

M. Mojirsheibai ad Z. Motazeri 368 ad observe that I B) [ P E qx, Y ) qx, Y ) ] D > δ ɛ) q0 4 64 by Cauchy-Schwartz iequality) [ ] = P E qx, Y ) δ ɛ) q 4 D EqX, Y ) > 0 64 P sup S q ) E q X, Y ) δ ɛ) q 4 > 0, 64 q Q where the last lie above follows from the followig argumets [ ] E qx, Y ) D EqX, Y ) [ ] = E qx, Y ) q D if E X, Y ) q Q [ ] = sup E qx, Y ) D S q) + S q) q Q S q ) + S q ) Eq X, Y ) sup S q ) E q X, Y ), q Q ad where, we have used the fact that S q) S q ) 0, by the defiitio of q). Therefore [ δ ɛ) I B) q0 4 8E N, Q, X i, Y i ) 04 ) ] e Cδ ɛ)4, where C > 0 does ot deped o or ɛ. Part ii) follows from the Borel-Catelli lemma. Ackowledgemets. The authors would like to thak Professor Hamedai ad the referees for the helpful commets. Refereces [] Beett, G. 96). Probability iequalities for the sum of idepedet radom variables. Joural of the America Statistical Associatio, 57, 33-45. [] Devroye, L., Györfi, L., ad Lugosi, G. 996). A Probabilistic Theory of Patter Recogitio. Spriger, New York.

Icomplete Covariates 369 [3] Györfi, L., Kohler, M., Krzyzak, A., ad Walk, H. 00). A Distributio-Free Theory of Noparametric Regressio. Spriger. [4] Kolmogorov, A.N. ad Tikhomirov, V.M. 959). ɛ-etropy ad ɛ-capacity of sets i fuctio spaces, Uspekhi Matematicheskikh Nauk, 4, 3-86. [5] Little, R.J.A. ad Rubi, D.B. 00). Statistical Aalysis With Missig Data. Wiley, New York. [6] Mojirsheibai, M., Motazeri, Z., ad Rajaeefard, A. 0). O classificatio with icomplete covariates. Statistics, 45, 47-450. [7] Mojirsheibai, M. ad Motazeri, Z. 007). Statistical classificatio with missig covariates. Joural of the Royal Statistical Society Ser. B., 69, 839-857. [8] Pollard, D. 984). Covergece of Stochastic Processes. Spriger-Verlag, New York. [9] va der Vaart, A.W. ad Weller, J.A. 996). Weak Covergece ad Empirical Processes with Applicatio to Statistics. Spriger-Verlag, New York. [0] Willard, S. 004). Geeral Topology. Dover Publicatios.