STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics. 7. k-earest eighbor k-earest eighbor (k-nn is a cool ad powerful idea for oparametric estimatio. Today we will talk about its applicatio i desity estimatio. I the future, we will lear how to use it for regressio aalysis ad classificatio. Let X,, X p be our radom sample. Assume each observatio has d differet variables; amely, X i R d. For a give poit x, we first rak every observatio based o its distace to x. Let R k (x deotes the distace from x to its k-th earest eighbor poit. For a give poit x, the knn desity estimator estimates the desity by where V d p k (x k πd/2 Γ( d 2 V d Rk d(x k Volume of a d-dimesioal ball with radius beig R k (x, + is the volume of a uit d-dimesioal ball ad Γ(x is the Gamma fuctio. Here are the results whe d, 2, ad 3 d, V 2: p k (x k d 2, V π: p k (x k d 3, V 4 3 π: p k(x k 2R k (x. πrk 2 (x. 3 4πRk 3 (x. What is the ituitio of a knn desity estimator? By the defiitio of R k (x, the ball cetered at x with radius R k (x B(x, R k (x {y : x y R k (x} satisfies the fact that k I(X i B(x, R k (x. i Namely, ratio of observatios withi B(x, R k (x is k/. Recall from the relatio betwee EDF ad CDF, the quatity I(X i B(x, R k (x i ca be viewed as a estimator of the quatity P (X i B(x, R k (x 7- B(x,R k (x p(ydy.
7-2 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Whe is large ad k is relatively small compared to, R k (x will be small because the ratio k is small. Thus, the desity p(y withi the regio B(x, R k (x will ot chage too much. Namely, p(y p(x for every y B(x, R k (x. Note that p(x is the ceter of the ball B(x, R k (x. Therefore, P (X i B(x, R k (x B(x,R k (x p(ydy p(x dy p(x V d Rk(x. d B(x,R k (x This quatity will be the target of the estimator i I(X i B(x, R k (x, which equals to k. As a result, we ca say that which leads to This motivates us to use as a desity estimator. p(x V d R d k(x P (X i B(x, R k (x I(X i B(x, R k (x k, i p(x V d Rk(x d k p(x k V d Rk d(x p k (x k V d Rk d(x Example. We cosider a simple example i d. Assume our data is X {, 2, 6,, 3, 4, 2, 33}. What is the knn desity estimator at x 5 with k 2? First, we calculate R 2 (5. The distace from x 5 to each data poit i X is {4, 3,, 6, 8, 9, 5, 28}. Thus, R 2 (5 3 ad p k (5 2 8 2 R 2 (5 24. What will the desity estimator be whe we choose k 5? I this case, R 5 (5 8 so p k (5 5 8 2 R 5 (5 5 64. Now we see that differet value of k gives a differet desity estimate eve at the same x. How do we choose k? Well, just as the smoothig badwidth i the KDE, it is a very difficult problem i practice. However, we ca do some theoretical aalysis to get a rough idea about how k should be chagig with respect to the sample size. 7.. Asymptotic theory The asymptotic aalysis of a k-nn estimator is quiet complicated so here I oly stated its result i d. The bias of the k-nn estimator is p ( ( 2 ( 2 (x k p(x k bias( p k (x E( p k (x p(x b p 2 + b 2 (x k + o +, k where b ad b 2 are two costats. The variace of the k-nn estimator is ( Var( p k (x v p2 (x + o, k k
Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach 7-3 where v is a costat. The quatity k is somethig we ca choose. We eed k whe to make sure both bias ad variace coverge to. However, how k diverges affects the quality of estimatio. Whe k is large, the variace is small while the bias is large. Whe k is small, the variace is large ad the bias teds to be small but it could also be large (the secod compoet i the bias will be large. To balace the bias ad variace, we cosider the mea square error, which is at the rate ( k 4 MSE( p k (x O 4 +. k This motivates us to choose k C 4 5 for some costat C. This leads to the optimal covergece rate for a k-nn desity estimator. Remark. MSE( p k,opt (x O( 4 5 Whe we cosider a d-dimesioal data, the bias will be of the rate ( ( 2 k d bias( p k (x O + k ad the variace is still at rate Var( p k (x O (. k This shows a very differet pheomeo compared to the KDE. I KDE, the rate of variace depeds o the dimesio whereas the bias remais the same. I knn, the rate of variace stays the same rate but the rate of bias chages with respect to the dimesio. Oe ituitio is that o matter what dimesio is, the ball B(x, R k (x always cotai k observatios. Thus, the variability of a knn estimator is caused by k poits, which is idepedet of the dimesio. O the other had, i KDE, the same h i differet dimesios will cover differet umber of observatios so the variability chages with respect to the dimesio. The knn approach has a advatage that it ca be computed very efficietly usig kd-tree algorithm. This is a particularly useful feature whe we have a huge amout of data ad whe the dimesio of the data is large. However, a dowside of the knn is that the desity ofte has a heavy-tail, which implies it may ot work well whe x is very large. Moreover, whe d, the desity estimator p k (x is ot eve a desity fuctio (the itegral is ifiite!. 7.2 Basis approach I this sectio, we assume that the PDF p(x is supported o [, ]. Namely, p(x > oly i [, ]. Whe the PDF p(x is smooth (i geeral, we eed p to be squared itegrable, i.e., p(x2 dx <, we ca use a orthoormal basis to approximate this fuctio. This approach has several other ames: the basis estimator, projectio estimator, ad a orthogoal series estimator. https://e.wikipedia.org/wiki/k-d_tree
7-4 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Let {φ (x, φ 2 (x,, φ m (x, } be a set of basis fuctios. The we have p(x θ j φ j (x. The quatity θ j is the coefficiet of each basis. I sigal process, these quatities are refereed to as the sigal. The collectio {φ (x, φ 2 (x,, φ m (x, } is called a basis if its elemets have the followig property: Uit : φ 2 j(xdx (7. for every j,. Orthoormal: φ j (xφ k (xdx (7.2 for every j k,. Here are some cocrete examples of the basis: Cosie basis: φ (x, φ j (x 2 cos(π(j x, j 2, 3,, Trigoometric basis: φ (x, φ 2j (x 2 cos(2πjx, φ 2j+ (x 2 si(2πjx, j, 2, Ofte the basis is somethig we ca choose so it is kow to us. What is ukow to use is the coefficiets θ, θ 2,,. Thus, the goal is to estimate these coefficiets usig the radom sample X,, X. How do we estimate these parameters? We start with some simple aalysis. For ay basis φ j (x, cosider the followig itegral: E(φ j (X φ j (xdp (x φ j (xp(xdx φ j (x θ k φ k (xdx k θ k φ j (xφ k (xdx k }{{} except k j θ k I(k j k θ j.
Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach 7-5 Namely, the expectatio of φ j (X is exactly the coefficiet θ j. This motivates us to use the sample average as a estimator: θ j φ j (X i. By costructio, this estimator is ubiased, i.e., E( θ j θ j. The variace of this estimator is i where σ 2 j E(φ2 j (X φ 2 j. Var( θ j Var(φ j(x ( E(φ 2 j (X E 2 (φ j (X (E(φ2 j(x φ 2 j σ2 j, I practice, we caot use all the basis because there will be ifiite umber of them beig calculated. So we will use oly M basis as our estimator. Namely, our estimator is p,m (x θ j φ j (x. (7.3 Later i the asymptotic aalysis, we will show that we should ot choose M to be too large because of the bias-variace tradeoff. 7.2. Asymptotic theory To aalyze the quality of our estimator, we use the mea itegrated squared error (MISE. Namely, we wat to aalyze MISE( p,m E( ( p,m (x p(x 2 dx [ bias 2 ( p,m (x + Var( p,m (x ] dx Aalysis of Bias. Because each θ j is a ubiased estimator of θ j, we have E( p,m (x E( θ j φ j (x E( θ j φ j (x θ j φ j (x. Thus, the bias at poit x is bias( p,m (x E( p,m (x p(x θ j φ j (x θ j φ j (x θ j φ j (x. jm+
7-6 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Thus, the itegrated squared bias is bias 2 ( p,m (xdx jm+ jm+ jm+ km+ jm+ θ 2 j. 2 θ j φ j (x dx ( θ j φ j (x km+ θ k φ k (x θ j θ k φ j (xφ k (xdx } {{ } I(jk Namely, the bias is determied by the sigal stregth of the igored basis, which makes sese because the bias should be reflectig the fact that we are ot usig all the basis ad if there are some importat basis (the oes with large θ j beig igored, the bias ought to be large. We kow that i KDE ad knn, the bias is ofte associated with the smoothess of the desity fuctio. How does the smoothess comes ito play i this case? It turs out that if the desity is smooth, the remaiig sigals jm+ θ2 j will also be small. To see this, we cosider a very simple model by assumig that we are usig the cosie basis ad dx (7.4 p (x 2 dx L (7.5 for some costat L >. Namely, the overall curvature of the desity fuctio is bouded. Usig the fact that for cosie basis fuctio φ j (x, Thus, equatio (7.5 implies φ j(x 2π(j si(π(j x, φ j (x 2π 2 (j 2 cos(π(j x π 2 (j 2 φ j (x. L p (x 2 dx π 4 θ j φ j (x 2 dx π 2 (j 2 θ j φ j (x 2 dx ( π 2 (j 2 θ j φ j (x π 2 (k 2 θ k φ k (x dx k k (j 2 (k 2 θ j θ k φ j (xφ k (xdx } {{ } I(jk π 4 (j 2 θj 2.
Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach 7-7 Namely, equatio (7.5 implies This further implies that (j 4 θj 2 L π 4. (7.6 θj 2 O(M 4. (7.7 jm+ A ituitive explaatio is as follows. Equatio (7.6 implies that whe j, the sigal θj 2 O(j 5. The reaso is: if θj 2 j, equatio (7.6 becomes 5 (j 4 θj 2 (j 4 j 5 j! Thus, the tail sigals have to be shrikig toward at rate O(j 5. As a result, jm+ θ2 j jm+ O(j 5 O(M 4. Equatio (7.7 ad (7.4 together imply that the bias is at rate bias 2 ( p,m (xdx O(M 4. Aalysis of Variace. Now we tur to the aalysis of variace. Var( p,m (xdx Var θ j φ j (x dx φ 2 j(xvar( θ j + Var( θ j Var( θ j O σ 2 j ( M φ 2 j(xdx + } {{ } j k j k Now puttig both bias ad variace together, we obtai the rate of the MISE MISE( p,m. φ j (xφ k (xcov( θ j, θ k dx [ bias 2 ( p,m (x + Var( p,m (x ] dx O ( M 4 Cov( θ j, θ k φ j (xφ k (xdx } {{ } + O ( M. Thus, the optimal choice is M M C /5 for some positive costat C. Ad this leads to which is the same rate as the KDE ad knn. Remark. MISE( p,m O( 4/5,
7-8 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach (Sobolev space Agai, we see that the bias is depedet o the smoothess of the desity fuctio. Here, as log as the desity fuctio has a overall curvature (secod derivative beig bouded, we have a optimal MISE at the rate of O( 4/5. I fact, if the desity fuctio has stroger smoothess, such as the overall third derivatives beig bouded, we will have a eve faster covergece rate O( 6/7. Now for ay positive iteger β ad a positive umber L >, we defie W (β, L { p : } p (β (x 2 dx L < be a collectio of smooth desity fuctios. The the bias betwee a basis estimator ad ay desity p W (β, L is at the rate O(M 2β ad the optimal covergece rate is O( 2β 2β+. The collectio W (β, L is a space of fuctios ad is kow as the Sobolev Space. (Tuig parameter The umber of basis M i a basis estimator, the umber of eighborhood k i the knn approach, ad the smoothig badwidth h i the KDE all play a very similar role i biasvariace tradeoff. These quatities are called tuig parameters i statistics ad machie learig. Just like what we have see i the aalysis of the KDE, choosig these parameters is ofte a very difficult task because the optimal choice depeds o the actual desity fuctio, which is ukow to us. A priciple approach to choosig the tuig parameters is based o miimizig a error estimator. For istace, i the basis estimator, we try to fid a estimator for the MISE, MISE( p,m. Whe we chage value of M, this error estimator will also chage. We the choose M by miimizig the error, i.e., M argmi M MISE( p,m. The we pick M basis ad costruct our estimator.