Notes On Nonparametric Density Estimation. James L. Powell Department of Economics University of California, Berkeley

Similar documents
Notes On Nonparametric Density Estimation. James L. Powell Department of Economics University of California, Berkeley

4 Conditional Distribution Estimation

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

Kernel density estimator

Notes On Median and Quantile Regression. James L. Powell Department of Economics University of California, Berkeley

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Nonparametric regression: minimax upper and lower bounds

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Lecture 15: Density estimation

Lecture 9: Regression: Regressogram and Kernel Regression

32 estimating the cumulative distribution function

LECTURE 14 NOTES. A sequence of α-level tests {ϕ n (x)} is consistent if

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

ON LOCAL LINEAR ESTIMATION IN NONPARAMETRIC ERRORS-IN-VARIABLES MODELS 1

Large Sample Theory. Convergence. Central Limit Theorems Asymptotic Distribution Delta Method. Convergence in Probability Convergence in Distribution

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

7.1 Convergence of sequences of random variables

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

First Year Quantitative Comp Exam Spring, Part I - 203A. f X (x) = 0 otherwise

Lecture 19: Convergence

Lecture 33: Bootstrap

LECTURE 8: ASYMPTOTICS I

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

7.1 Convergence of sequences of random variables

Lecture 7 Testing Nonlinear Inequality Restrictions 1

This section is optional.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Statistical Inference Based on Extremum Estimators

Asymptotic Results for the Linear Regression Model

Distribution of Random Samples & Limit theorems

Random Variables, Sampling and Estimation

Regression with an Evaporating Logarithmic Trend

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

CONCENTRATION INEQUALITIES

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 20: Multivariate convergence and the Central Limit Theorem

The central limit theorem for Student s distribution. Problem Karim M. Abadir and Jan R. Magnus. Econometric Theory, 19, 1195 (2003)

Introductory statistics

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

MA Advanced Econometrics: Properties of Least Squares Estimators

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Estimation for Complete Data

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Properties and Hypothesis Testing

Topic 9: Sampling Distributions of Estimators

17. Joint distributions of extreme order statistics Lehmann 5.1; Ferguson 15

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Stat 421-SP2012 Interval Estimation Section

Exponential Families and Bayesian Inference

Topic 9: Sampling Distributions of Estimators

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Probability and Statistics

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Notes 19 : Martingale CLT

Topic 9: Sampling Distributions of Estimators

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

1 = δ2 (0, ), Y Y n nδ. , T n = Y Y n n. ( U n,k + X ) ( f U n,k + Y ) n 2n f U n,k + θ Y ) 2 E X1 2 X1

Unbiased Estimation. February 7-12, 2008

STA Object Data Analysis - A List of Projects. January 18, 2018

Week 6. Intuition: Let p (x, y) be the joint density of (X, Y ) and p (x) be the marginal density of X. Then. h dy =

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Berry-Esseen bounds for self-normalized martingales

Asymptotic distribution of products of sums of independent random variables

STAT331. Example of Martingale CLT with Cox s Model

Lecture 10 October Minimaxity and least favorable prior sequences

Semiparametric Mixtures of Nonparametric Regressions

4. Partial Sums and the Central Limit Theorem

Math Solutions to homework 6

Pattern Classification, Ch4 (Part 1)

Solution to Chapter 2 Analytical Exercises

Machine Learning Brett Bernstein

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Lecture 2: Monte Carlo Simulation

Estimation of the Mean and the ACVF

Contents. 2 Distribution function estimation Multivariate density estimation Multivariate Kernel Density Estimation Example...

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

The standard deviation of the mean

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Kolmogorov-Smirnov type Tests for Local Gaussianity in High-Frequency Data

Expectation and Variance of a random variable

Chapter 6 Infinite Series

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Elements of Statistical Methods Lots of Data or Large Samples (Ch 8)

Computation Of Asymptotic Distribution For Semiparametric GMM Estimators

Density Estimation. Chapter 1

CSE 527, Additional notes on MLE & EM

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

A statistical method to determine sample size to estimate characteristic value of soil parameters

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Efficient GMM LECTURE 12 GMM II

Basis for simulation techniques

STAT Homework 2 - Solutions

The Central Limit Theorem

Transcription:

Notes O Noparametric Desity Estimatio James L Powell Departmet of Ecoomics Uiversity of Califoria, Berkeley Uivariate Desity Estimatio via Numerical Derivatives Cosider te problem of estimatig te desity fuctio f(x) of a scalar, cotiuously-distributed iid sequece x i at a particular poit x If te desity f is i a kow parametric family (eg, Gaussia), estimatio of te desity reduces to estimatio of te fiite-dimesioal parameters tat caracterize tat particular desity i te parametric family Witout a parametric assumptio, toug, estimatio of te desity f over all poits i its support would ivolve estimatio of a ifiite umber of parameters, kow i statistics as a oparametric estimatio problem (toug ifiite-parametric estimatio migt be a more accurate title) Sice te desity fuctio f(x) is te derivative of te cumulative distributio fuctio F (x) Pr{x i x}, ad sice te empirical cdf ˆF (x) X {x i x} is te atural oparametric of te cdf, it seems atural to base estimatio of f o te empirical cdf However, wile ˆF is -cosistet ad asymptotically ormal, it would be clearly osesical to estimate f by differetiatig ˆF, sice its derivative is eiter zero or udefied Usig te defiitio of te desity f as te (rigt-) derivative of te cdf, F (x + ) F (x) f(x) lim, 0 we migt estimate te desity f by a correspodig differece ratio of ˆF : ˆf(x) ˆF (x + ) ˆF (x) X {x <x i x + }, were te perturbatio, also kow as a badwidt or widow widt ) is positive but small, depedig upo te sample size (ie, ) To sow MSE cosistecy of ˆf, it suffices to coose te badwidt sequece so tat te mea bias ad variace of ˆf bot ted to zero as te sample size icreases Sice te empirical cdf ˆF is a ubiased estimator of F tatis,e[ ˆF (x)] F (x) te bias of ˆf is evidetly if E[ ˆf(x)] f(x) F (x + ) F (x) f(x) 0 0 as

Te variace of ˆf(x) is! V ( ˆf(x)) X V {x <x i x + } wic will ted to zero if 2 V ({x <x i x + }) F (x + ) F (x) ( (F (x + ) F (x))) f(x) + O( ), as Tus, if te badwidt sequece teds to zero as teds to ifiity, but at a slower rate ta /, te MSE of ˆf will coverge to zero, esurig its (weak) cosistecy To arrow dow te coice of badwidt sequece, we migt wat to coose it to mamize te rate of covergece of te MSE of ˆf to zero To do tis, we eed a explicit expressio for its bias as a fuctio of Assumig F (x+) is smoot eoug to admit a secod-order Taylor s series expasio aroud 0, F (x + ) F (x)+f(x) + f 0 (x) 2 2 + o( 2 ), te MSE of ˆf ca be expressed as MSE( ˆf(x); ) E( ˆf(x)) 2 f(x)i + V ( ˆf(x)) µ f 0 (x) 2 2 + o( 2 )+ f(x) + O( ) Because te squared bias is directly related to, wile te variace is iversely related to, te fastest covergece of te MSE to zero occurs we te squared bias ad variace coverge to zero at te same speed (If tey coverge at differet speeds, te slower speed domiates) Tus te optimal badwidt sequece will satisfy O ( ) 2 µ O, so O µ! /3 will give te fastest rate of covergece of te MSE to zero, µ! MSE( ˆf(x); 2/3 )O Tis rate is slower ta te usual parametric rate of covergece of te MSE (if it ests) to zero, wic is O( ) for, eg, te sample mea 2

Altoug te optimal rate of covergece of to zero does ot deped upo f itself, te optimal level would require kowledge of f(x) Assumig te badwidt sequece is of te form c /3, we ca coose c to miimize te limitig ormalized MSE µ lim 2/3 MSE( ˆf(x); f 0 (x) 2 ) 2 c + f(x) c, wic is miimized i c at c! (f 0 (x)) 2 /3 2f(x) So te optimal badwidt sequece µ 2f(x) (f 0 (x)) 2 tat miimizes te leadig terms i te MSE formula is ifeasible, sice it requires kowledge of te desity f(x) itself ad its first derivative Toug it ca be sow (uder additioal coditios) tat cosistet estimatio of te costat c will ot affect te rate of covergece of ˆf, etc, oparametric estimatio of f 0 (x) caot be based upo ˆf(x) directly (sice its derivative is zero werever it is defied), but would require a similar umerical derivate approac, wic would etail its ow badwidt coice problem A alterative is to stadardize te coice of te costat term c for some kow parametric desity fuctio; for example, if f(x) σ φ( x µ σ /3 ), were φ is te stadard ormal desity, te µ x c 2 /3 φ(x) σ, 2 ad estimatio of te optimal costat would reduce to estimatio of te stadard deviatio σ of te x i distributio Istead of coosig te badwidt for a specific valueofx, we migt wat to coose it to miimize a global MSE criterio over all possible x values, suc as te itegrated mea-squared error criterio IMSE( ˆf; ) MSE( ˆf(x); )dx µ f 0 (x) 2 2 dx + + o(2 )+O( ) Te same calculatios as for te poitwise optimal badwidt yield te optimal IMSE badwidt + to be! /3 + 2 R (f 0 (x)) 2, dx wic still depeds upo te derivative of f (A similar expressio olds for te badwidt miimizig a average MSE criterio, were dx is replaced by f(x)dx i te formula for te IMSE) For te Gaussia 3

desity f(x) x µ σ φ( σ ), te optimal IMSE badwidt would be! /3 + 2 R ( xφ(x)) 2 σ dx 24σ /3, so stadardizig te badwidt coice to be optimal for ormal desities reduces te badwidt coice problem to estimatio of te stadard deviatio σ Multivariate Kerel Desity Estimatio Te umerical derivative estimator of te uivariate desity f(x) above is a special case of a geeral class of oparametric desity estimators called kerel desity estimators Now supposig x i R p, we ca tik of smootig out te empirical cdf ˆF (x) for by replacig it wit a covolutio of ˆF ad te distributio of a idepedet, cotiuously-distributed oise term ε, were is a small positive badwidt as above ad ε as a kow desity fuctio K(ε) (also kow as te kerel fuctio) For a particular realized value of x i, te desity fuctio for X i x i + ε would be f Xi (x) p K µ x by te usual cage-of-variables formula for multivariate desities (Here te absolute value of te determiat of te Jacobia dε/dxi 0 is p ) Tus te kerel desity estimator ˆf(x) is te average of f Xi over te observed values of x i i te sample, ˆf(x) X p K µ x As log as R K(u)du, te desity estimator ˆf(x) will also itegrate to oe over x, as befittig a desity fuctio Te umerical derivative estimator discussed above is a special case wit p ad K(u) { u<0}, te desity fuctio for a Uiform(, 0) radom variable Te MSE calculatios are straigtforward extesios of tose for te umerical derivative estimator Te expectatio of ˆf is " E[ ˆf(x)] X µ # E p K µ E p K µ z p K f(z)dz Makig te cage-of-variables u u(z) (x z) (wit du p dz), E[ ˆf(x)] K(u)f(x u)du, wic clearly teds to f(x) as 0 if f is cotiuous ad bouded above (by domiated covergece), as log as R K(u)du (Te last coditio implies R K(u) du <, a coditio tat will be more 4

relevat later, we te oegativity restrictio K(u) is relaxed) Assumig tat f(x) is smoot eoug for f(x u) to admit a secod-order Taylor s expasio aroud 0, ie, f(x u) f(x) f(x) x 0 (u)+ 2 (u)0 2 f(x) x x 0 (u)+o(2 ) µ µ f(x) f(x) x 0 u + 2 2 2 tr f(x) x x 0 uu0 + o( 2 ), te bias of ˆf ca be expressed as µ E[ ˆf(x)] f(x) f(x) x 0 µ uk(u)du + 2 2 2 tr f(x) x x 0 uu 0 K(u)du + o( 2 ) If te mea of te kerel, R uk(u)du, is ozero (as wit te oe-sided umerical derivative estimator above), te te bias is O(); owever,iftekerelfuctiok(u) is cose to be symmetric about zero, or, more geerally, if uk(u)du 0, (*) te te bias is E[ ˆf(x)] f(x) O( 2 ) Similarly, te variace of ˆf(x) ca be calculated as Var( ˆf(x)) X µ! Var p K µ µ Var p K µ µ 2 E p K µ µ 2 E p K µ z 2 2p K f(z)dz 2 (E[ ˆf(x)]) p [K (u)] 2 f(x u)du 2 (E[ ˆf(x)]) f(x) µ p [K (u)] 2 du + o p (Te secod equality exploits te fact tat x i is iid, ad te fift makes te same cage-of-variables as for te bias formula) So te bias of ˆf(x) is O( 2 ) uder coditio (*), ad its variace is O(( p ) ); te optimal badwidt sequece, wic equates te rate of covergece of te squared bias ad variace to zero, tus satisfies so µ O(( ) 4 )O ( ) p O µ /(p+4), 5

ad te MSE evaluated at is MSE( ˆf(x); )O µ! 4/(p+4) Note tat, as p icreases ie, te umber of compoets i x i, umber of argumets of f(x) icreases te best rate of covergece of te MSE declies, a peomeo refereced by te catc prase, te curse of dimesioality Te Asymptotic Distributio of te Kerel Desity Estimator Te kerel desity estimator ˆf(x) ca be rewritte as a sample average of idepedet, ideticallydistributed radom variables ˆf(x) X z i, were z i ˆf(x) µ x p K Here te secod subscript i z i deotes its depedece o te sample size troug te badwidt term Suc doubly-subscripted radom variables, were te rage of te first subscript is bouded above by te secod, are kow as triagular arrays, ad classical limit teorems must be modified to accout for te cagig distributio of te observatios as te sample size icreases To demostrate asymptotic ormality of ˆf(x), a coveiet cetral limit teorem is Liapuov s Cetral Limit Teorem for Triagular Arrays It statest tat, if te scalar radom variable z i is idepedetly (but ot ecessarily idetically) distributed wit variace Var(z i ) σ 2 i ad r-t absolute cetral momet E[ z i E(z i ) r ] ρ i < for some r>2, ad if ( P ρ i) /r P σ2 i /2 0 as 0 (kow as te Liapuov coditio), te is asymptotically ormal, z X z i z E[z ] p Var(z ) d N (0, ) Applyig tis teorem to ˆf(x), we obtai te variace of te z i terms as σ 2 i Var(z i ) µ µ Var p K f(x) µ p [K (u)] 2 du + o p 6

from our earlier calculatios; settig r 3, we get a upper boud for te tird cetral momet of z i as ρ i E[ z i E(z i ) 3 ] 8E[ z i 3 ] µ 8E " 3 # 8f(x) µ 2p K (u) 3 du + o 2p, were te iequality uses te expasio E[ z i E(z i ) 3 ] E[( z i + E(z i ) ) 3 ] E[ z i 3 ]+3E[ z i 2 ] E(z i ) +3E[ z i ] E(z i ) 2 +(E(z i )) 3 ad te last equality makes te same cage-of-variables calculatios as for te variace of z i Tus, for tis problem te Liapuov coditio is ( P ρ i) /r P /2 i σ2 ³ 8f(x) R K (u) 3 du + o /3 2p 2p ³ f(x) R [K (u)] 2 du + o /2 p if O p /3 2p/3! O O(( p ) /6 ) 0 p, /2 p/2 te same coditio imposed to esure tat Var( ˆf(x)) 0 as Uder tis coditio, te,! ˆf(x) E[ ˆf(x)] q Var( ˆf(x)) d N (0, ), or, subtitutig te expressio for Var( ˆf(x)) O(( p ) ) ad collectig terms, p ( ˆf(x) E[ ˆf(x)]) d N (0,f(x) [K (u)] 2 du) Tis is almost, but ot quite, i te form of te usual expressio for a asymptotic distributio of a estimator; te differece is tat te expectatio of te estimator E[ ˆf(x)] rater ta te true value f(x) is subtracted from ˆf(x) i tis expressio Writig p ( ˆf(x) f(x)) p ( ˆf(x) E[ ˆf(x)]) + p (E[ ˆf(x)] f(x)), te asymptotic ormality of ˆf(x) aroud f(x) will require te secod, bias term to coverge to a costat Isertig te expressio for te bias, p (E[ ˆf(x)] f(x)) µ µ 2 2 p 2 tr f(x) x x 0 uu 0 K(u)du + o( 2 ) O( p+4 ) 7

If te badwidt takes te form c µ /(p+4), so tat it coverges to zero at te optimal rate, te p (E[ ˆf(x)] f(x)) µ c(p+4)/2 2 f(x) tr 2 x x 0 δ(x), uu 0 K(u)du ad p ( ˆf(x) f(x)) d N (δ(x),f(x) [K (u)] 2 du) Here te appromatig ormal distributio for ˆf(x) would be cetered at f(x) +δ(x), ot f(x), ad costructio of te usual cofidece regios or test statistics would be complicated by te fact tat δ(x) depeds upo te ukow secod derivative of f(x) If te badwidt teds to zero faster ta te optimal rate, ie, te o µ /(p+4), p (E[ ˆf(x)] f(x)) 0, ad te bias term vaises from te asymptotic distributio, p ( ˆf(x) f(x)) d N (0,f(x) [K (u)] 2 du) Ofte i practice tis sort of udersmootig wic implies te bias of f(x) is egligible relative to te variace is assumed, ad cofidece itervals of te form " s s # f(x) f(x) 96 ˆf(x) [K (u)] 2 du, f(x)+96 ˆf(x) [K (u)] 2 du are reported, toug it is best to view te claimed 95% asymptotic coverage rate wit some skepticism Note tat if te badwidt teds to zero slower ta te optimal rate, eg, µ γ o, γ > p +4, te te bias of ˆf(x) domiates its stadard deviatio, ad te ormalized differece p ( ˆf(x) f(x)) diverges Rescalig for Multivariate Kerels Derivatives of te Kerel Desity ad Regressio Estimators Higer-Order (Bias-Reducig) Kerels 8