Nonparametric estimation of conditional distributions

Similar documents
Estimation of the essential supremum of a regression function

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Sequences and Series of Functions

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Sieve Estimators: Consistency and Rates of Convergence

7.1 Convergence of sequences of random variables

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

On the Asymptotic Normality of an Estimate of a Regression Functional

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

Kernel density estimator

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Chapter 6 Infinite Series

Advanced Stochastic Processes.

Detailed proofs of Propositions 3.1 and 3.2

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Empirical Processes: Glivenko Cantelli Theorems

Limit distributions for products of sums

Lecture 8: Convergence of transformations and law of large numbers

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

1 The Haar functions and the Brownian motion

A Proof of Birkhoff s Ergodic Theorem

Math Solutions to homework 6

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS

1+x 1 + α+x. x = 2(α x2 ) 1+x

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Approximation by Superpositions of a Sigmoidal Function

Chapter 6 Principles of Data Reduction

Pattern Classification, Ch4 (Part 1)

6.3 Testing Series With Positive Terms

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Asymptotic distribution of products of sums of independent random variables

5. Likelihood Ratio Tests

7.1 Convergence of sequences of random variables

A survey on penalized empirical risk minimization Sara A. van de Geer

Notes 19 : Martingale CLT

32 estimating the cumulative distribution function

Mathematical Methods for Physics and Engineering

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Estimation of a regression function by maxima of minima of linear functions

1 = δ2 (0, ), Y Y n nδ. , T n = Y Y n n. ( U n,k + X ) ( f U n,k + Y ) n 2n f U n,k + θ Y ) 2 E X1 2 X1

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Fall 2013 MTH431/531 Real analysis Section Notes

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Lecture 19: Convergence

Chapter 7 Isoperimetric problem

Entropy Rates and Asymptotic Equipartition

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 3 : Random variables and their distributions

Maximum Likelihood Estimation and Complexity Regularization

Introductory statistics

REGRESSION WITH QUADRATIC LOSS

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

Output Analysis and Run-Length Control

Estimation for Complete Data

Lecture 12: September 27

Lecture 15: Density estimation

Lecture 2: Monte Carlo Simulation

On Classification Based on Totally Bounded Classes of Functions when There are Incomplete Covariates

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Lecture 3 The Lebesgue Integral

LECTURE 8: ASYMPTOTICS I

Probability for mathematicians INDEPENDENCE TAU

TESTING FOR THE BUFFERED AUTOREGRESSIVE PROCESSES (SUPPLEMENTARY MATERIAL)

Regression with quadratic loss

Rates of Convergence for Quicksort

Statistical Pattern Recognition

On n-dimensional Hilbert transform of weighted distributions

Distribution of Random Samples & Limit theorems

Unbiased Estimation. February 7-12, 2008

Solution. 1 Solutions of Homework 1. Sangchul Lee. October 27, Problem 1.1

An alternative proof of a theorem of Aldous. concerning convergence in distribution for martingales.

Monte Carlo Integration

Local Approximation Properties for certain King type Operators

INFINITE SEQUENCES AND SERIES

Element sampling: Part 2

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Notes 5 : More on the a.s. convergence of sums

Singular Continuous Measures by Michael Pejic 5/14/10

5.1 A mutual information bound based on metric entropy

1 Covariance Estimation

Summary. Recap ... Last Lecture. Summary. Theorem

Properties and Hypothesis Testing

A Quantitative Lusin Theorem for Functions in BV

Simulation. Two Rule For Inverting A Distribution Function

BIRKHOFF ERGODIC THEOREM

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

The Central Limit Theorem

Random Variables, Sampling and Estimation

TAMS24: Notations and Formulas

APPENDIX A SMO ALGORITHM

Kolmogorov-Smirnov type Tests for Local Gaussianity in High-Frequency Data

1 Convergence in Probability and the Weak Law of Large Numbers

Assignment 5: Solutions

Transcription:

Noparametric estimatio of coditioal distributios László Györfi 1 ad Michael Kohler 2 1 Departmet of Computer Sciece ad Iformatio Theory, udapest Uiversity of Techology ad Ecoomics, 1521 Stoczek, U.2, udapest, Hugary, email: gyorfi@szit.bme.hu 2 Fachrichtug 6.1-Mathematik, Uiversität des Saarlades, Postfach 151150, D-66041 Saarbrücke, Germay, email: kohler@math.ui-sb.de November 9, 2005 bstract Estimatio of coditioal distributios is cosidered. It is assumed that the coditioal distributio is either discrete or that it has a desity with respect to the Lebesgue-orelmeasure. Partitioig estimates of the coditioal distributio are costructed ad results cocerig cosistecy ad rate of covergece of the itegrated total variatio error of the estimates are preseted. Key words ad phrases: coditioal desity, coditioal distributio, cofidece sets, partitioig estimate, Poisso regressio, rate of covergece, uiversal cosistecy. Ruig title: Estimatio of coditioal distributios Please sed correspodece ad proofs to: Michael Kohler, Fachrichtug 6.1-Mathematik, Uiversität des Saarlades, Postfach 151150, D-66041 Saarbrücke, Germay, email: kohler@math.ui-sb.de, phoe +49-681-3022435, fax +49-681-3026583, e-mail: kohler@math.ui-sb.de 1

1 Itroductio Oe of the mai tasks i statistics is to estimate a distributio from a give sample. Let µ be a probability distributio o IR d ad let X 1, X 2,... be idepedet ad idetically distributed radom variables with distributio µ. simple but powerful estimate of µ is the empirical distributio µ () = 1 I (X i ), where I deotes the idicator fuctio of the set. y the strog law of large umbers we have µ () µ() a.s. (1) i=1 for each orel set. If we wat to make some statistical iferece about µ it is ot eough to have (1) for each set idividually, istead we eed covergece of µ to µ uiformly over classes of sets. y the Gliveko-Catelli theorem the empirical distributio satisfies sup µ ((, x]) µ((, x]) 0 a.s., (2) x IR d where (, x] = (, x (1) ]... (, x (d) ] for x = (x (1),..., x (d) ) IR d. This is great i case that we wat to make some statistical iferece about itervals, but for more geeral ivestigatios it would be much icer if we are able to cotrol the error i total variatio defied as sup µ () µ(), (3) d where d are the orel-sets i IR d. Clearly, for the empirical distributio the error (3) does ot coverge to zero i geeral, sice if µ has a cotiuous distributio fuctio we have µ({x 1,..., X }) = 0 ad µ ({X 1,..., X }) = 1. If we are able to costruct estimates ˆµ of µ such that sup ˆµ () µ() 0 a.s., (4) d 2

the it is easy to costruct cofidece sets ˆ for the values of X 1 such that they have asymptotically level α for give α (0, 1), i.e. such that lim if µ( ˆ ) 1 α a.s. Ideed, ay set ˆ with ˆµ ( ˆ ) 1 α has this property sice µ( ˆ ) = µ ( ˆ ) (µ ( ˆ ) µ( ˆ )) 1 α sup d µ () µ(). Ufortuatley, as was show i Devroye ad Györfi (1990), it is impossible to costruct estimates ˆµ such that (4) holds for all distributios µ. However, it follows from arro et al. (1992) that i case we restrict ourselves to distributios where the oatomic part is absolutely cotiuous with respect to a kow domiatig measure, it is possible to costruct estimates such that (4) holds for all such distributios. Special cases iclude discrete measures (where we assume for otatioal coveiece that µ(in 0 ) = 1) ad measures which have a desity with respect to the Lebesgue-orel-measure. y Scheffe s theorem it suffices i these cases to costruct estimates (ˆµ ({k})) k IN0 of (µ({k})) k IN0 ad estimates ˆf of the desity f of µ, resp., which satisfy ˆµ ({k}) µ({k}) 0 a.s. (5) k=0 ad f (x) f(x) λ(dx) 0 a.s., (6) where λ deotes the Lebesgue-orel-measure. Here oe estimates µ() by ˆµ () = ˆµ ({k}) ad ˆµ () = k IN 0 3 ˆf (x) dx, resp.

May estimates which satisfy (6) uiversally for all desities are costructed i Devroye ad Györfi (1985a). I this paper we wat to apply the above ideas i the regressio cotext. Here we have give idepedet ad idetically distributed radom vectors (X, Y ), (X 1, Y 1 ),... with values i IR d IR d. Give the sample D = {(X 1, Y 1 ),..., (X, Y )} of the distributio of (X, Y ) we wat to costruct estimates ˆP { x} of the coditioal distributio P{Y X = x} of Y give X such that sup d ˆP { x} P{Y X = x} µ(dx) 0 a.s., (7) where µ deotes agai the distributio of X. I cotrast to stadard regressio, where d = 1 ad where oly the mea E{Y X = x} of the coditioal distributio is estimated (cf., e.g., Györfi et al. (2002)), we ca use estimates with the property (7) ot oly for predictio of the value of Y for give value of X, but also to costruct cofidece regios for the value of Y give the value of X. Ideed, similarly as above oe gets that (7) implies that ay set C (x) with ˆP {C (x) x} 1 α satisfies lim if P{Y C (X) D } 1 α a.s., sice we have with P { } = P{ D } P{Y C (X) D } = P {Y C (x) X = x}µ(dx) ˆP {C (x) x}µ(dx) ˆP {C (x) x} P {Y C (x) X = x} µ(dx) 1 α sup ˆP ( x) P{Y X = x} µ(dx). 4

I order to costruct estimates with the property (7), we cosider two special cases: I the first case the coditioal distributio of Y give X is discrete (ad for otatioal coveiece we assume agai that the support is cotaied i IN 0 ). I the secod case the coditioal distributio of Y give X = x has a desity f( x) with respect to the Lebesgue-orel-measure. I both cases Scheffe s theorem implies that i order to have (7) we have to costruct estimates of P{Y = k X = x} ad f( x) such that k=0 ˆP {k x} P{Y = k X = x} µ(dx) 0 a.s. (8) ad f (y x) f(y x) λ(dy)µ(dx) 0 a.s., (9) resp. I order to costruct i the first case estimates with the property (8) we use two differet approaches: I the first approach we cosider for each y IN 0 P{Y = y X = x} = E{I {Y =y} X = x} as a regressio fuctio ad estimate it by applyig a partitioig estimate to a sample of (X, I {Y =y} ). I the secod approach we cosider Poisso regressio, i.e., we make a parametric assumptio o the way the coditioal distributio of Y give X = x depeds o m(x) ad assume that P{Y = y X = x} = m(x)y y! e m(x) (y IN 0 ) for some m : IR d (0, ), where m is completely ukow. I this case we estimate m(x) = E{Y X = x} by a partitioig estimate m (x) applied to a sample of (X, Y ), ad cosider the plug-i estimate ˆP {Y = y X = x} = m (x) y y! e m(x) (y IN 0 ). 5

I both approaches we preset results cocerig uiversal cosistecy, i.e. we show (8) for all correspodig discrete coditioal distributios, ad we aalyze the rate of covergece of the estimates. Estimates of the coditioal desity i the secod case are defied as suitable partitioig estimates. We preset results cocerig uiversal cosistecy, i.e., we show (9) for all coditioal distributios with desity, ad we aalyze the rate of covergece uder regularity assumptios o the smoothess of the coditioal desity. The paper is orgaized as follows: Our mai results cocerig estimatio of discrete coditioal distributios ad coditioal desities are described i Sectio 2 ad 3, resp. The proofs are give i Sectio 4. 2 The estimatio of discrete coditioal distributios I this sectio we study partitioig estimates of discrete coditioal distributios. I our first two theorems each coditioal probability P{Y = y X = x} is estimated separately. We have the followig result cocerig cosistecy of the estimate. Theorem 1 Let P = {,j : j} be a partitio of IR d ad for x IR d deote by (x) that cell,j of P that cotais x. Let ˆP {y x} = i=1 I (x)(x i ) I {Yi =y} j=1 I (x)(x j ) be the partitioig estimate of P{Y = y X = x}. ssume that the uderlyig partitioig P = {,j : j} satisfies for each sphere S cetered at the origi lim max diam(,j) = 0 (10) j:,j S ad {j :,j S } lim = 0, (11) 6

where diam() deotes the diameter of the set. The ˆP {y x} P{Y = y X = x} µ(dx) 0 a.s. Next we cosider the rate of covergece of the above partitioig estimate. It is wellkow that i order to derive o-trivial rate of covergece results i oparametric regressio oe eeds smoothess assumptio o the uderlyig regressio fuctio (cf., Devroye (1982)). I our ext result we assume that the coditioal probabilities are locally Lipschitz cotiuous, such that the itegral over the sum of the Lipschitz costat is fiite. Theorem 2 ssume X is bouded a.s., P{Y = y X = x} P{Y = y X = z} C y (x) x z for all x, z from the bouded support of X ad for some local Lipschitz costats C y (x) satisfyig ad assume C y (x)µ(dx) = C <, P{Y = y} <. Let ˆP {y x} be the partitioig estimate of P{Y = y X = x} with respect to a partitio of IR d cosistig of cubes with side-legth h. The so for E ˆP {y x} P{Y = y X = x} µ(dx) c 1 1 + P{Y = y} 1 + d C h, h d h = c 2 1/(d+2) 7

we get E ˆP {y x} P{Y = y X = x} µ(dx) c 3 1 d+2. I the ext theorem we cosider Poisso regressio. Here the coditioal distributio of Y give X is give by P{Y = y X = x} = m(x)y y! e m(x) (y IN 0 ) for some m : IR d (0, ). ecause of m(x) = E{Y X = x} we ca estimate it by applyig a partitioig estimate to D ad use a plug-i estimate ˆP {y x} = m (x) y y! e m(x) (y IN 0 ) to estimate the coditioal distributio of Y give X. For this estimate we have the followig result. Theorem 3 ssume that E{Y } < ad P{Y = y X = x} = m(x)y y! e m(x) (y IN 0 ) for some m : IR d (0, ). Let P i=1 I (x)(x i ) Y i P i=1 m (x) = I (x)(x i if ) i=1 I (x)(x i ) > log 0 otherwise. be the (modified) partitioig estimate of m with partitio P = {,j : j} ad set ˆP {y x} = m (x) y y! e m(x) (y IN 0 ). a) ssume that the uderlyig partitio P satisfies (10) ad for each sphere S cetered at the origi {j :,j S } log lim = 0. (12) The ˆP {y x} P{Y = y X = x} µ(dx) 0 a.s. 8

b) ssume X is bouded a.s. ad assume that E{Y 2 } < ad m is Lipschitz cotiuous, i.e. m(x) m(z) C x z for some costat C IR +. Choose the uderlyig partitio such that it cosists of cubes of side-legth h. The E ˆP {y x} P{Y = y X = x} µ(dx) c 4 h d + c 5 h, so for h = c 6 1/(d+2) we get E ˆP {y x} P{Y = y X = x} µ(dx) c 7 1 d+2. Remark 1. ssume that the assumptios of Theorem 3 b) hold. The fuctio f(u) = u y e u /(y!) satisfies for u [0, ] f (u) = y u y 1 y! e u uy y! e u ( + 1) y 1 (y 1)!, so by boudedess of the Lipschitz cotiuous regressio fuctio m we get for y > 0 P{Y = y X = x} P{Y = y X = z} ( + 1) y 1 C x z. (y 1)! This implies that the coditioal probabilities are Lipschitz cotiuous ad that the itegral over the sum of the Lipschitz costat is bouded by 1 y 1 + ( + 1) C = ( 1 + ( + 1) e ) C, (y 1)! y=1 hece uder the assumptio of Theorem 3 b) the estimate i Theorem 2 achieves the same rate of covergece although it does ot deped o the particular form of the coditioal distributio. 9

Remark 2. Uder more restrictive regularity assumptios o the uderlyig distributio cosistecy of a localized log-likelihood Poisso regressio estimate was show i Kohler ad Krzyżak (2005). 3 The estimatio of coditioal desities I this sectio assume that Y takes values i IR d. Our aim is to estimate the coditioal distributio of Y give X cosistetly i total variatio. We assume that Y has absolutely cotiuous distributio ad the coditioal desity of Y give X is deoted by f(y x). For estimatig f(y x), itroduce a histogram estimate. Let Q = {,j : j} be a partitio of IR d, such that the Lebesgue measure λ of each cell is positive ad fiite. Let (y) be the cell of Q ito which y falls. s before let P = {,j : j} be a partitio of IR d ad deote the cell ito which x falls by (x). Put ν (, ) = 1 the the histogram estimate is as follows: i=1 I {Xi,Y i }, f (y x) = ν ( (x), (y)) µ ( (x)) λ( (y)). We will use the followig coditios: assume that for each sphere S cetered at the origi we have lim max diam(,j) = 0 (13) j:,j S ad {j :,j S } lim = 0. (14) The ext theorem exteds the desity-free strog cosistecy result of bou-jaoude (1976) to coditioal desity estimatio. 10

Theorem 4 ssume that the partitios P ad Q satisfy (10), (11), (13) ad (14), resp. The f (y x) f(y x) λ(dy)µ(dx) 0 a.s. Devroye ad Györfi (1985a), ad eirlat ad Györfi (1998) calculated the rate of covergece of the expected L 1 error of the histogram. Next we exted these results to the estimates of coditioal desities. Theorem 5 ssume X ad Y are bouded a.s., ad f(u x) f(y x) C 1 (x) u y ad f(y z) f(y x) C 2 (y) x z for all x, z from the bouded support of X ad for all y, u from the bouded support of Y such that C 1 (z)µ(dz) < ad C 2 (y)λ(dy) <. Let f (y x) be the histogram estimate of f(y x) with respect to a partitios P ad Q cosisitig of cubes with side-legths h ad H, resp. The so for E f (y x) f(y x) λ(dy)µ(dx) c8 c 9 h d + h d + d c 10 h + d H c 11 H, d h = c 12 1/(d+d +2) ad H = c 13 1/(d+d +2) we get E f (y x) f(y x) λ(dy)µ(dx) c 14 1 d+d +2. 11

4 Proofs 4.1 Proof of Theorem 1 Usig (where x + = max{x, 0}) we get = 2 a b = 2(b a) + + (a b) ˆP {y x} P{Y = y X = x} µ(dx) ( P{Y = y X = x} ˆP {y x}) + ˆP {y x}µ(dx) + µ(dx) P{Y = y X = x}µ(dx). Usig the Cauchy-Schwarz iequality ad Theorem 23.1 i Györfi et al. (2002) we get for each fixed y IN 0 ( P{Y = y X = x} ˆP ) {y x} P X(dx) + ˆP {y x} P{Y = y X = x} µ(dx) ( ˆP {y x} P{Y = y X = x}) 2 µ(dx) 0 a.s., which implies together with the domiated covergece theorem, that the first term o the right had side above coverges to zero. Cocerig the secod term we observe = ˆP {y x}µ(dx) i=1 I (x)(x i ) I {Yi =y} j=1 I (x)(x j ) ( ) = I P { j=1 I (x)(x j )>0} 1 µ(dx) = I P µ{ i=1 I,j (X i )=0o,j }. j=0 12 P{Y = y X = x}µ(dx) 1 µ(dx)

Together with (11), it implies that ˆP {y x}µ(dx) µ{,j } µ {,j } j=0 0 P{Y = y X = x}µ(dx) a.s. (cf. Lemma 1 i Devroye ad Györfi (1985b) or, with better costat i the expoetial upper boud, cf. the proof of Lemma 23.2 i Györfi et al. (2002)). 4.2 Proof of Theorem 2 I the sequel we use the otatio ν y, () = 1 i=1 I {Yi =y,x i }, ad with this otatio the partitio estimate is give by ˆP {y x} = ν y,( (x)) µ ( (x)). Thus, = = ˆP {y x} P{Y = y X = x} µ(dx) ν y, ( (x)) µ ( (x)) P ν y, () µ () P{Y = y X = x} µ(dx) P{Y = y X = x} µ(dx) ν y, () P µ () ν y,() µ() µ(dx) + ν y, () P{Y = y, X } P µ() µ() µ(dx) + P{Y = y, X } P{Y = y X = x} µ() µ(dx) P 13

ν y, () µ () ν y,() µ() µ() P + ν y, () P{Y = y, X } µ() µ() µ() P + P{Y = y, X } P{Y = y X = x} µ() µ(dx) = P P + ν y, () 1 µ () 1 µ() µ() ν y, () P{Y = y, X } P + P{Y = y, X } P{Y = y X = x} µ() µ(dx) P µ () µ() P + ν y, () P{Y = y, X } P + P{Y = y, X } P{Y = y X = x} µ() µ(dx), P where we have used for the last iequality that ν y, () = µ (). Sice µ () is biomially distributed with parameters ad µ() we get by Cauchy- Schwarz iequality P E{ µ () µ() } y Jese iequality we have P P E{(µ () µ()) 2 } µ(). ( a1 +... + a l l ) 2 a2 1 +... a2 l l, 14

which implies a 1 +... + a l l (a 2 1 +... a2 l ). Usig this iequality i the sum above for the c 15 /h d may cells P cotaied i the bouded support of X (which are the oly oes with µ() 0) we coclude P E{ µ () µ() } = c15 h d c15 ( ) 2 µ()/ P h d µ() P c15 h d. Similarly we get = = E{ ν y, () P{Y = y, X } } P E{(ν y, () P{Y = y, X }) 2 } P P P{Y = y, X } c 15 P P{Y = y, X } h d c15 c 15 P{Y = y} h d h d P{Y = y}. Fially P{Y = y, X } P{Y = y X = x} P µ() µ(dx) P{Y = y X = z}µ(dz) P{Y = y X = x}µ(dz) P µ() µ() µ(dx) P{Y = y X = z} P{Y = y X = x} µ(dz) µ(dx) µ() P 15

C y (x) diam() µ() µ(dx) P µ() d h C y (x)µ(dx) d h C. Summarizig the above results, the assertio follows. 4.3 Proof of Theorem 3 I the proof we will use the followig lemma. Lemma 1 For arbitrary u, v IR + we have u j j! e u vj j! e v 2 u v. j=0 Proof. W.l.o.g. assume u < v. The u j j! e u vj j=0 u j j! e u uj j=0 ( u j = j=0 j! e v j! e v j! e u uj j! e v + u j j! e v vj j=0 ) ( v j + j=0 = e u ( e u e v) + e v e v e u e v = 2 (1 e (v u)) 2 v u, j! e v j! e v uj j! e v ) sice 1 + x e x (x IR). Proof of Theorem 3. Proof of a): y Lemma 1 we get = ˆP {y x} P{Y = y X = x} µ(dx) m (x) y y! e m(x) m(x)y y! 16 e m(x) µ(dx)

2 m (x) m(x) µ(dx) (15) 0 a.s. by Györfi (1991) (see also Theorems 23.3 i Györfi et al. (2002)). Proof of Part b): Usig (15), E ˆP {y x} P{Y = y X = x} µ(dx) { 2 E c 4 h d + c 5 h, } m (x) m(x) 2 µ(dx) where the last step ca be doe i a similar way as the proof of Theorem 4.3 i Györfi et al. (2002). 4.4 Proof of Theorem 4 Itroduce the otatio ν(, ) = E{ν (, )} = P{X, Y }, the f (y x) f(y x) λ(dy)µ(dx) = P Q P Q + P Q + P Q P Q ν (, ) µ () λ() f(y x) λ(dy)µ(dx) ν (, ) µ () λ() ν (, ) µ() λ() λ(dy)µ(dx) ν (, ) ν(, ) µ() λ() µ() λ() λ(dy)µ(dx) ν(, ) µ() λ() f(y x) λ(dy)µ(dx) = ν (, ) µ () λ() ν (, ) µ() λ() µ()λ() 17

therefore + ν (, ) ν(, ) µ() λ() µ() λ() µ()λ() P Q + ν(, ) µ() λ() f(y x) λ(dy)µ(dx), P Q f (y x) f(y x) λ(dy)µ(dx) ν (, ) 1 µ () 1 µ() µ() P Q + ν (, ) ν(, ) P Q + ν(, ) µ() λ() f(y x) λ(dy)µ(dx) P Q µ () µ() (16) P + ν (, ) ν(, ) (17) P Q + ν(, ) µ() λ() f(y x) λ(dy)µ(dx), (18) P Q where we have used for the last iequality that ν (, ) = µ (). Q ecause of (11), (16) teds to 0 a.s., while (11) ad (14) imply that (17) teds to 0 a.s. (cf. Lemma 1 i Devroye ad Györfi (1985b)). Cocerig the covergece of the bias term (18), itroduce the otatio f (y x) = (x) f(u z)λ(du)µ(dz) (y) µ( (x)) λ( (y)) the P Q = P Q ν(, ) µ() λ() f(y x) λ(dy)µ(dx) f(u z)λ(du)µ(dz) µ() λ() f(y x) λ(dy)µ(dx) 18

= f (y x) f(y x) λ(dy)µ(dx) 0, because of the coditios (10) ad (13). This covergece is obvious if f(y x) is cotiuous ad has compact support. I geeral, we use that f(y x) L 1 (µ λ), ad refer to the deseess result such that the set of cotiuous fuctios i L 1 (µ λ) with compact support is dese i L 1 (µ λ) (cf., e.g., Devroye ad Györfi (2002)). alterative techique would be the Lebesgue desity theorem (cf., e.g., Lemma 24.5 i Györfi et al. (2002)), which is a poitwise covergece, ad together with the Scheefe theorem ad the domiated covergece theorem we are ready. 4.5 Proof of Theorem 5 ecause of the proof of Theorem 4, { E } f (y x) f(y x) λ(dy)µ(dx) E { µ () µ() } P + E { ν (, ) ν(, ) } P Q ν( (x), (y)) + µ( (x)) λ( (y)) f(y x) λ(dy)µ(dx). ccordig to the proof of Theorem 2, the coditio that X is bouded implies that P E { µ () µ() } ad, similarly, usig X ad Y are bouded we ca show P c15 h d, c16 E { ν (, ) ν(, ) } h d Q H d Cocerig the rate of covergece of the bias term we observe ν( (x), (y)) µ( (x)) λ( (y)) f(y x) λ(dy)µ(dx) 19.

= P Q = P Q P Q P Q + P Q ν(, ) µ() λ() f(y x) λ(dy)µ(dx) f(u z)λ(du)µ(dz) µ() λ() pplyig the coditios the theorem we get that f(y x) λ(dy)µ(dx) f(u z) f(y x) λ(du)µ(dz) λ(dy)µ(dx) µ() λ() f(u z) f(y z) λ(du)µ(dz) λ(dy)µ(dx) µ() λ() f(y z) f(y x) λ(du)µ(dz) λ(dy)µ(dx). µ() λ() ν( (x), (y)) µ( (x)) λ( (y)) f(y x) µ(dx)λ(dy) = P Q + P Q C 1 (z)µ(dz)λ(s Y ) d H + C 1(z) d H λ(du)µ(dz) λ(dy)µ(dx) µ() λ() C 2(y) d h λ(du)µ(dz) λ(dy)µ(dx) µ() λ() C 2 (y)λ(dy) d h, where S Y is the bouded support of Y. Refereces [1] bou-jaoude, S. (1976). Coditios écessaires et suffisates de covergece L 1 e probabilité de l histogramme pour ue desité. ales de l Istitut Heri Poicaré, XII, 213-231. [2] arro,. R., Györfi, L. ad va der Meule, E. C. (1992). Distributio estimatio cosistet i total variatio ad two types of iformatio divergece. IEEE Tras. Iformatio Theory 38, pp. 1437-1454. 20

[3] eirlat, J. ad Györfi, L. (1998). O the L 1 error i histogram desity estimatio: the multidimesioal case. Noparametric Statistics 9, pp. 197-216. [4] Devroye, L. (1982). y discrimiatio rule ca have arbitrarily bad probability of error for fiite sample size. IEEE Trasactios o Patter alysis ad Machie Itelligece 4, 154 157. [5] Devroye, L. ad Györfi, L. (1985a). Noparametric Desity Estimatio: The L 1 View. Joh Wiley, New York. [6] Devroye, L. ad Györfi, L. (1985b). Distributio-free expoetial boud for the L 1 error of partitioig estimates of a regressio fuctio. I Probability ad Statistical Decisio Theory, F. Koecy, J. Mogyoródi, W. Wertz, Eds., D. Reidel, pp. 67-76. [7] Devroye, L. ad Györfi, L. (1990). No empirical measure ca coverge i the total variatio sese for all distributio. als of Statistics 18, pp.1496-1499. [8] Devroye, L. ad Györfi, L. (2002). Distributio ad desity estimatio. I Priciples of Noparametric Learig, L. Györfi (Ed.), Spriger-Verlag, Wie, pp. 223-286. [9] Györfi, L. (1991). Uiversal cosistecy of a regressio estimate for ubouded regressio fuctios, Noparametric fuctioal estimatio ad related topics (ed. G. Roussas), 329 338, NTO SI Series, Kluwer cademic Publishers, Dordrecht. [10] Györfi, L., Kohler, M., Krzyżak,., ad Walk, H. (2002). Distributio-Free Theory of Noparametric Regressio. Spriger Series i Statistics, Spriger. [11] Kohler, M. ad Krzyżak,. symptotic cofidece itervals for Poisso regressio. Submitted for publicatio, 2005. 21