Applications of the van Trees inequality to non-parametric estimation.

Brno-06, Lecture 2, 16.05.06 D/Stat/Brno-06/2.tex www.mast.queensu.ca/ blevit/ Applications of te van Trees inequality to non-parametric estimation. Regular non-parametric problems. As an example of suc problems we consider asymptotic optimality of te empirical cmumlative distribution function. Let X 1,..., X n i.i.d. c.d.f. F x), p.d.f. fx) A typical non-parametric problem, and a classical one, is to estimate te cdf F x), for any given x. It s called a regular non-parametric problem, since in many ways it is similar to estimating an unnown parameter, in a regular parametric model. In suc problems a) te rate of convergence is typically 1 n and b) tere exist unbiased or asymptotically unbiased estimators. Here F x) itself can be seen as te unnown parameter of interest, altoug it is now a functional parameter, or an infinitly dimensional parameter. Te ecdf We ave 1x) = ˆF n x) = # {X i : X i x} = 1 n n { 1 x 0 0 x < 0, 1x X i) = n 1x X i ) i=1 { 1 wit probability F x) 0 wit probability 1 F x) E X F ˆF n x) F x)) 2 = 1 F x)1 F x)) n or equivalently ne X F ˆF n x) F x)) 2 = F x)1 F x)). In te case of a real unnown parameter θ, we used intervals a, b), to introduce te notion of locally asymptotically minimax estimates. Wat plays te role of suc intervals, in a non-parametric setting? We need a notion of open vicinities V in our case. Te easiest way to acieve tis is to use a metric, or a distance function on te set of all distributions. Tis can be done in many different ways. We coose te classical distance in varition metric ϱf 1, F 2 ) = f 1 x) f 2 x) dx. Tus defined measure of closeness satisfies all te axioms of distance: symmetry: te triangle inequality: ϱf 1, F 2 ) 0 ϱf 1, F 2 ) = ϱf 2, F 1 ) ϱf 1, F 2 ) ϱf 1, F 3 ) + ϱf 3, F 2 ) A set V is called open if for any F 0 V, tere exist ε > 0 suc tat {F : ϱf, F 0 ) < ε} V. Now for any subset V sup ne X F ˆF n x) F x)) 2 = sup F x)1 F x)). 1) 1

Teorem 1. For any open subset V and any sequence of estimators F n x), lim sup ne X F ˆF n x) F x)) 2 sup F x)1 F x)). 2) n Corollary. According to 1)-2), te empirical cdf ˆF n x) is locally asymptotically minimax. Proof of Teorem 1. Let V bean arbitrary vicinity and let F 0 x) be an arbitrary element of V. Denote te corresponding pdf by f 0 x). Let us introduce a one-dimensional parametric sub-family of distributions F x θ) as follows: We mention some properties of tese family. fy θ) = f 0 y)1 + θ1x y) F 0 x))), θ < 1/2. 3) 1. fx θ) is a probability density for all θ < 1/2. Indeed since θ1x y) F 0 x)) θ 1x y) F 0 x) 1 2 1x y) + F 0x)) ) 1 1 + 1) = 1 2 we ave Moreover, fy θ) dy = f 0 y) dy + θ 1 + θ1x y) F 0 x)) 1 1 = 0. 1x y) F 0 x))f 0 y) dy = 1 + θf 0 x) F 0 x)) = 1. 2. ϱf θ, F 0 ) = fy θ) f 0 y) dy = f 0 y) θ1x y) F 0 x y) dy θ 2f 0 y) dy = 2θ. Corollary. Tere is δ > 0 suc tat F θ x) V, for all θ < δ. 3. ψθ) := F x θ) is a continuously differentiable function and ψ θ) = F 0 x)1 F 0 x)). Indeed, ψθ) = 1x y)fx θ) dy = 1x y)f 0 y)1 + θ1x y) F 0 x))) dy = F 0 y) + θ 1x y)f 0 y) dy + θ 1x y) F 0 x))1x y)f 0 y) dy = 1x y) F 0 x)) 2 f 0 y) dy = F 0 x) + θf 0 x)1 F 0 x)) 4. Iθ) is continuously differentiable and I0) = F 0 x)1 F 0 x)). Indeed, log fy θ) = log f 0 y) + log1 + θ1x y) F 0 x)) 2

Tus log fy θ)) = 1x y) F 0 x) 1 + θ1x y) F 0 x)) ) 2 Iθ) = log fy θ)) fy θ) dy = θ ) 2 1x y) F 0 x) f 0 y)1 + θ1x y) F 0 x)) dy = 1 + θ1x y) F 0 x)) 1x y) F 0 x)) 2 1 + θ1x y) F 0 x)) f 0y) dy It is easy to see now tat Iθ) is continuous and I0) = 1x y) F 0 x)) 2 f 0 x) dy = Var X1 θ1x X 1 ) = F 0 x)1 F 0 x)). Combining tese properties wit an asymptotic version of te van Trees inequality we obtain, for any estimator F n x) of F x), lim sup ne X F F n x) F x)) 2 lim sup ne X θ F n x) F x θ)) 2 := n n θ δ lim sup E X θ n ψ n x) ψθ)) 2 ψ 0)) 2 n θ δ I0) Since F 0 is an arbitrary element of V, we also obtain = F 0x)1 F 0 x))) 2 F 0 x)1 F 0 x)) = F 0 x)1 F 0 x)). lim sup ne X F F n x) F x)) 2 sup F x)1 F x)). n Remar. How was determined te form of te family 3)? Let us loo at an arbitrary sub-family were y) is any bounded function suc tat f 0 y)y) dy = 0. fy θ) = f 0 y)1 + θy)) 3 ) Ten 3 ) determines a family of distribution, for all sufficiently small θ. Similarly to te above calculations we find ten tat ψ 0) = 1x y) F 0 x))y)f 0 y) dy and I0) = 2 y)f 0 y) dy. Terefore, te resulting lower bound would be ψ 0)) 2 I0) = 1x y) F0 x))y)f 0 y) dy ) 2 2 y)f 0 y) dy max 3

Now, by te Caucy-Scwarz inequality 1x y) F0 x))f 0 y)y)) dy ) 2 2 y)f 0 y) dy 1x y) F 0 x)) 2 f 0 y) dy. Tus te best poosible lower bounf is te one we ave obtained before, and an be acieved if and only if y) = const 1x y) F 0 x)). Tat s exactly te coice we ave used, te coice of te constant being insignifacnt. Tus te family used in te proof is te ardest parametric subfamily. Acordingly, tis metod of proving asymptotic optimality is called metod of ardest parametric subfamily. Singular non-parametric problems. density estimation, in te setting As a typical exampe of suc problems, we will consider ernel Te classical ernel density estimators are X = X 1,..., X n ) i.i.d. fx) f n x) = 1 n n ) x Xi i=1 Te functions x) is called ernel function. Te parameter is called bandwidt: = n 0. We will assume tat is an arbitrary symmetric function. How good is tis estimate? does it converge to te true density? It is clear tat as n, sould go to 0. We will assume tat is an arbitrary symmetric function, x) x), and tat x) dx = 1, x) c <, x) = 0 if x 1 bounded support) and tat x)x 2 dx c, 2 x) dx c. Note tat since x) is symmetric, 1 xx) dx = 0. We will also assume tat fx) C, f x) C, f x) C Te variance-bias decomposition of te mean square error: Ef n x) fx)) 2 = Varf n x)) + Ef n x) fx)) 2 Ef n x) = 1 ) x E X1 4

Varf n x)) = 1 n 2 Var ) x X1 1 n 2 E2 ) x X1 Let us analyze tese expectations using Taylor expansion. ) ) x X1 X1 x := E = E = fy) dy = 1 z) fx) + f x)z + 12 ) f x + ϑz)z) 2 dz = Tus fx) + 1 1 2 2 f x + ϑz)z 2 ) = fx) + O 3 ). 4) Ef n x) = fx) + O 2 ). Similarly, Terefore ) x E 2 X1 y x = 2 1 2 z)fx + z) dz = fx) Varf n x)) 1 n 2 E2 Combining tese estimaties togeter sows tat Ef n x) fx)) 2 C n + C4 = C ) fx) dx = 2 z) dz + O 2 ) C. 5) ) x X1 C n. ) 1 n + 4. 6) If we coose too small ten te variance term is becomes large under-smooting). If we coose large, ten te bias becomes large over-smooting). So we ave to strie a balance, by minimizing te r..s. If we oose a bandwidt suc tat ten 1 n = 4, Ef n x) fx)) 2 C 1 n = 5, = 1 n 1/5 n 1/5 n + 1 ) 1 = C n 4/5 n + 1 ) = 2C 4/5 n 4/5 n 4/5 If te minimize te r..s. of 6) w.r.t., we will obtain te same rate, wit a somewat better constant. Note owever, te even wit te best balance possible, te bias and variance terms are of te same order. In oter words, te resulting estimator is asymptotically biased! Note also tat te resulting rate of convergence is slower tan in te case of te cdf, and tis clearly sows in simulations! Let F = {fx) : sup fx) C, x sup x f x) C, sup f x) C} x Teorem 2. Tere exist constants 0 < d < D < and an estimator ˆf x) suc tat for any real x sup E f ˆf n x) fx)) 2 5 D n 4/5.

On te oter and, for any estimator f n x) of fx) and any real x sup E f f n x) fx)) 2 d n 4/5. Proof of te lower bound. Let x be fixed. Coose f 0 x) F suc tat tat and for some ε > 0 f 0 x) > 0 sup fy) C ε, y sup y f y) C ε, sup f y) C ε. 7) y Also coose a symmetric density x) satisfying all te properties assumed in te proof of te lower bound and additionally 0) > 0, y) c, y) c. Let us coose a one-dimensional parametric sub-family fx θ): were according to 4) fy θ) = f 0 y) 1 + θ = f 0 y) C. )) Here = n 0 will be cosen later. Note tat since we want to estimate fx):, our target funational is ψθ) := fx θ) = f 0 x) 1 + θ 0) )). Let us note te following properties of te parametric family fy θ). 1. If δ is a sufficiently small number, ten fy θ) F for all θ < δ 2. Indeed, te most severe restriction on θ stems from te requirement Since fy θ) yy = f 0 y) + θ f 0 y) fy θ) yy C. 8) ) ) )) y x y x + 2 f 0 y) + 2 f 0 y) te required property 8) easily follows from 7). Note tat only a very small piece of te above family fits into te familty F. Te size of tis family will feature prominently in te resulting lower bound. 2. According to 4), for all sufficiently small, ψ θ) = f 0 x)0) ) = f 0 x)0) + O)) 0)f 0 x)/2 > 0. 3. Barring te tecnicalitites we ave log fy θ) = log f 0 y) + log 1 + θ ) x y 6 )) log f 0 y) + θ ) x y )

and according to 4)-5) Iθ) log fy θ)) = θ ) ) 2 x y fy θ) dy ) 2 x y f 0 y) dy 2 f 0 x) ) x y ) ) 2 x y f 0 y) dy = 2 z) dz + O 2 ) c 1. combining all te above properties of te tus cosen family fy θ) wit te van Trees lower bound we obtain for arbitrary estimator fx) sup E X f f n x) fx)) 2 sup EX f f n x) fx θ)) 2 θ δ 2 sup E X f ψ n ψθ)) 2 θ δ 2 ψ θ)λθ) dθ ) 2 n Iθ)λθ) dθ + π2 δ 2 4 f 2 0x)0)/2) c 1 n + π2 δ 2 4 = c 2 n + 1 4 Here again we ave to balance te two terms appearing in te denomiantor, a problem wic is similar but sligtly different from 6). We an simply coose suc tat n = 1 4, 5 = 1 n, = 1 n 1/5 n + 1 4 = 2n4/5. Tus it follows te for some d > 0 sup E X f f n x) fx)) 2 d 2n 4/5. 7