Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Similar documents
Lecture 2: Monte Carlo Simulation

Lecture 15: Learning Theory: Concentration Inequalities

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Study the bias (due to the nite dimensional approximation) and variance of the estimators

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 10 October Minimaxity and least favorable prior sequences

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

6.3 Testing Series With Positive Terms

Ma 530 Introduction to Power Series

Lecture Chapter 6: Convergence of Random Sequences

Topic 9: Sampling Distributions of Estimators

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Topic 9: Sampling Distributions of Estimators

ECE 901 Lecture 13: Maximum Likelihood Estimation

Efficient GMM LECTURE 12 GMM II

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Section 11.8: Power Series

Lecture 19: Convergence

Empirical Process Theory and Oracle Inequalities

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Pattern Classification, Ch4 (Part 1)

Topic 9: Sampling Distributions of Estimators

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

9.3 The INTEGRAL TEST; p-series

Random Variables, Sampling and Estimation

Monte Carlo Integration

Chapter 4. Fourier Series

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Introduction to Machine Learning DIS10

8.1 Introduction. 8. Nonparametric Inference Using Orthogonal Functions

Lecture Notes for Analysis Class

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Kernel density estimator

MAT1026 Calculus II Basic Convergence Tests for Series

Algorithms for Clustering

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Lecture 13: Maximum Likelihood Estimation

Infinite Sequences and Series

LECTURE 2 LEAST SQUARES CROSS-VALIDATION FOR KERNEL DENSITY ESTIMATION

STAT Homework 2 - Solutions

Ma 530 Infinite Series I

6.867 Machine learning, lecture 7 (Jaakkola) 1

Chapter 10: Power Series

Sequences and Series of Functions

1 Approximating Integrals using Taylor Polynomials

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

MA131 - Analysis 1. Workbook 2 Sequences I

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Lecture 9: Regression: Regressogram and Kernel Regression

Statistics 511 Additional Materials

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

Lecture 33: Bootstrap

Machine Learning for Data Science (CS 4786)

Math 155 (Lecture 3)

Lecture 3: August 31

Metric Space Properties

Fall 2013 MTH431/531 Real analysis Section Notes

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

11 Correlation and Regression

Machine Learning Brett Bernstein

Vector Quantization: a Limiting Case of EM

Estimation of the Mean and the ACVF

7.1 Convergence of sequences of random variables

1.010 Uncertainty in Engineering Fall 2008

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

The Growth of Functions. Theoretical Supplement

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Bayesian Methods: Introduction to Multi-parameter Models

University of Colorado Denver Dept. Math. & Stat. Sciences Applied Analysis Preliminary Exam 13 January 2012, 10:00 am 2:00 pm. Good luck!

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Math 116 Final Exam December 12, 2014

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Lecture 12: September 27

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Sieve Estimators: Consistency and Rates of Convergence

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Kolmogorov-Smirnov type Tests for Local Gaussianity in High-Frequency Data

Chapter 6 Principles of Data Reduction

Seunghee Ye Ma 8: Week 5 Oct 28

Math 113 Exam 3 Practice

Math 525: Lecture 5. January 18, 2018

x a x a Lecture 2 Series (See Chapter 1 in Boas)

INFINITE SEQUENCES AND SERIES

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Math 2784 (or 2794W) University of Connecticut

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Machine Learning Brett Bernstein

4.3 Growth Rates of Solutions to Recurrences

MA131 - Analysis 1. Workbook 9 Series III

Basis for simulation techniques

Lecture 2: April 3, 2013

Transcription:

STAT 425: Itroductio to Noparametric Statistics Witer 28 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Istructor: Ye-Chi Che Referece: Sectio 8.4 of All of Noparametric Statistics. 7. k-earest eighbor k-earest eighbor (k-nn is a cool ad powerful idea for oparametric estimatio. Today we will talk about its applicatio i desity estimatio. I the future, we will lear how to use it for regressio aalysis ad classificatio. Let X,, X p be our radom sample. Assume each observatio has d differet variables; amely, X i R d. For a give poit x, we first rak every observatio based o its distace to x. Let R k (x deotes the distace from x to its k-th earest eighbor poit. For a give poit x, the knn desity estimator estimates the desity by where V d p k (x k πd/2 Γ( d 2 V d Rk d(x k Volume of a d-dimesioal ball with radius beig R k (x, + is the volume of a uit d-dimesioal ball ad Γ(x is the Gamma fuctio. Here are the results whe d, 2, ad 3 d, V 2: p k (x k d 2, V π: p k (x k d 3, V 4 3 π: p k(x k 2R k (x. πrk 2 (x. 3 4πRk 3 (x. What is the ituitio of a knn desity estimator? By the defiitio of R k (x, the ball cetered at x with radius R k (x B(x, R k (x {y : x y R k (x} satisfies the fact that k I(X i B(x, R k (x. i Namely, ratio of observatios withi B(x, R k (x is k/. Recall from the relatio betwee EDF ad CDF, the quatity I(X i B(x, R k (x i ca be viewed as a estimator of the quatity P (X i B(x, R k (x 7- B(x,R k (x p(ydy.

7-2 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Whe is large ad k is relatively small compared to, R k (x will be small because the ratio k is small. Thus, the desity p(y withi the regio B(x, R k (x will ot chage too much. Namely, p(y p(x for every y B(x, R k (x. Note that p(x is the ceter of the ball B(x, R k (x. Therefore, P (X i B(x, R k (x B(x,R k (x p(ydy p(x dy p(x V d Rk(x. d B(x,R k (x This quatity will be the target of the estimator i I(X i B(x, R k (x, which equals to k. As a result, we ca say that which leads to This motivates us to use as a desity estimator. p(x V d R d k(x P (X i B(x, R k (x I(X i B(x, R k (x k, i p(x V d Rk(x d k p(x k V d Rk d(x p k (x k V d Rk d(x Example. We cosider a simple example i d. Assume our data is X {, 2, 6,, 3, 4, 2, 33}. What is the knn desity estimator at x 5 with k 2? First, we calculate R 2 (5. The distace from x 5 to each data poit i X is {4, 3,, 6, 8, 9, 5, 28}. Thus, R 2 (5 3 ad p k (5 2 8 2 R 2 (5 24. What will the desity estimator be whe we choose k 5? I this case, R 5 (5 8 so p k (5 5 8 2 R 5 (5 5 64. Now we see that differet value of k gives a differet desity estimate eve at the same x. How do we choose k? Well, just as the smoothig badwidth i the KDE, it is a very difficult problem i practice. However, we ca do some theoretical aalysis to get a rough idea about how k should be chagig with respect to the sample size. 7.. Asymptotic theory The asymptotic aalysis of a k-nn estimator is quiet complicated so here I oly stated its result i d. The bias of the k-nn estimator is p ( ( 2 ( 2 (x k p(x k bias( p k (x E( p k (x p(x b p 2 + b 2 (x k + o +, k where b ad b 2 are two costats. The variace of the k-nn estimator is ( Var( p k (x v p2 (x + o, k k

Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach 7-3 where v is a costat. The quatity k is somethig we ca choose. We eed k whe to make sure both bias ad variace coverge to. However, how k diverges affects the quality of estimatio. Whe k is large, the variace is small while the bias is large. Whe k is small, the variace is large ad the bias teds to be small but it could also be large (the secod compoet i the bias will be large. To balace the bias ad variace, we cosider the mea square error, which is at the rate ( k 4 MSE( p k (x O 4 +. k This motivates us to choose k C 4 5 for some costat C. This leads to the optimal covergece rate for a k-nn desity estimator. Remark. MSE( p k,opt (x O( 4 5 Whe we cosider a d-dimesioal data, the bias will be of the rate ( ( 2 k d bias( p k (x O + k ad the variace is still at rate Var( p k (x O (. k This shows a very differet pheomeo compared to the KDE. I KDE, the rate of variace depeds o the dimesio whereas the bias remais the same. I knn, the rate of variace stays the same rate but the rate of bias chages with respect to the dimesio. Oe ituitio is that o matter what dimesio is, the ball B(x, R k (x always cotai k observatios. Thus, the variability of a knn estimator is caused by k poits, which is idepedet of the dimesio. O the other had, i KDE, the same h i differet dimesios will cover differet umber of observatios so the variability chages with respect to the dimesio. The knn approach has a advatage that it ca be computed very efficietly usig kd-tree algorithm. This is a particularly useful feature whe we have a huge amout of data ad whe the dimesio of the data is large. However, a dowside of the knn is that the desity ofte has a heavy-tail, which implies it may ot work well whe x is very large. Moreover, whe d, the desity estimator p k (x is ot eve a desity fuctio (the itegral is ifiite!. 7.2 Basis approach I this sectio, we assume that the PDF p(x is supported o [, ]. Namely, p(x > oly i [, ]. Whe the PDF p(x is smooth (i geeral, we eed p to be squared itegrable, i.e., p(x2 dx <, we ca use a orthoormal basis to approximate this fuctio. This approach has several other ames: the basis estimator, projectio estimator, ad a orthogoal series estimator. https://e.wikipedia.org/wiki/k-d_tree

7-4 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Let {φ (x, φ 2 (x,, φ m (x, } be a set of basis fuctios. The we have p(x θ j φ j (x. The quatity θ j is the coefficiet of each basis. I sigal process, these quatities are refereed to as the sigal. The collectio {φ (x, φ 2 (x,, φ m (x, } is called a basis if its elemets have the followig property: Uit : φ 2 j(xdx (7. for every j,. Orthoormal: φ j (xφ k (xdx (7.2 for every j k,. Here are some cocrete examples of the basis: Cosie basis: φ (x, φ j (x 2 cos(π(j x, j 2, 3,, Trigoometric basis: φ (x, φ 2j (x 2 cos(2πjx, φ 2j+ (x 2 si(2πjx, j, 2, Ofte the basis is somethig we ca choose so it is kow to us. What is ukow to use is the coefficiets θ, θ 2,,. Thus, the goal is to estimate these coefficiets usig the radom sample X,, X. How do we estimate these parameters? We start with some simple aalysis. For ay basis φ j (x, cosider the followig itegral: E(φ j (X φ j (xdp (x φ j (xp(xdx φ j (x θ k φ k (xdx k θ k φ j (xφ k (xdx k }{{} except k j θ k I(k j k θ j.

Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach 7-5 Namely, the expectatio of φ j (X is exactly the coefficiet θ j. This motivates us to use the sample average as a estimator: θ j φ j (X i. By costructio, this estimator is ubiased, i.e., E( θ j θ j. The variace of this estimator is i where σ 2 j E(φ2 j (X φ 2 j. Var( θ j Var(φ j(x ( E(φ 2 j (X E 2 (φ j (X (E(φ2 j(x φ 2 j σ2 j, I practice, we caot use all the basis because there will be ifiite umber of them beig calculated. So we will use oly M basis as our estimator. Namely, our estimator is p,m (x θ j φ j (x. (7.3 Later i the asymptotic aalysis, we will show that we should ot choose M to be too large because of the bias-variace tradeoff. 7.2. Asymptotic theory To aalyze the quality of our estimator, we use the mea itegrated squared error (MISE. Namely, we wat to aalyze MISE( p,m E( ( p,m (x p(x 2 dx [ bias 2 ( p,m (x + Var( p,m (x ] dx Aalysis of Bias. Because each θ j is a ubiased estimator of θ j, we have E( p,m (x E( θ j φ j (x E( θ j φ j (x θ j φ j (x. Thus, the bias at poit x is bias( p,m (x E( p,m (x p(x θ j φ j (x θ j φ j (x θ j φ j (x. jm+

7-6 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach Thus, the itegrated squared bias is bias 2 ( p,m (xdx jm+ jm+ jm+ km+ jm+ θ 2 j. 2 θ j φ j (x dx ( θ j φ j (x km+ θ k φ k (x θ j θ k φ j (xφ k (xdx } {{ } I(jk Namely, the bias is determied by the sigal stregth of the igored basis, which makes sese because the bias should be reflectig the fact that we are ot usig all the basis ad if there are some importat basis (the oes with large θ j beig igored, the bias ought to be large. We kow that i KDE ad knn, the bias is ofte associated with the smoothess of the desity fuctio. How does the smoothess comes ito play i this case? It turs out that if the desity is smooth, the remaiig sigals jm+ θ2 j will also be small. To see this, we cosider a very simple model by assumig that we are usig the cosie basis ad dx (7.4 p (x 2 dx L (7.5 for some costat L >. Namely, the overall curvature of the desity fuctio is bouded. Usig the fact that for cosie basis fuctio φ j (x, Thus, equatio (7.5 implies φ j(x 2π(j si(π(j x, φ j (x 2π 2 (j 2 cos(π(j x π 2 (j 2 φ j (x. L p (x 2 dx π 4 θ j φ j (x 2 dx π 2 (j 2 θ j φ j (x 2 dx ( π 2 (j 2 θ j φ j (x π 2 (k 2 θ k φ k (x dx k k (j 2 (k 2 θ j θ k φ j (xφ k (xdx } {{ } I(jk π 4 (j 2 θj 2.

Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach 7-7 Namely, equatio (7.5 implies This further implies that (j 4 θj 2 L π 4. (7.6 θj 2 O(M 4. (7.7 jm+ A ituitive explaatio is as follows. Equatio (7.6 implies that whe j, the sigal θj 2 O(j 5. The reaso is: if θj 2 j, equatio (7.6 becomes 5 (j 4 θj 2 (j 4 j 5 j! Thus, the tail sigals have to be shrikig toward at rate O(j 5. As a result, jm+ θ2 j jm+ O(j 5 O(M 4. Equatio (7.7 ad (7.4 together imply that the bias is at rate bias 2 ( p,m (xdx O(M 4. Aalysis of Variace. Now we tur to the aalysis of variace. Var( p,m (xdx Var θ j φ j (x dx φ 2 j(xvar( θ j + Var( θ j Var( θ j O σ 2 j ( M φ 2 j(xdx + } {{ } j k j k Now puttig both bias ad variace together, we obtai the rate of the MISE MISE( p,m. φ j (xφ k (xcov( θ j, θ k dx [ bias 2 ( p,m (x + Var( p,m (x ] dx O ( M 4 Cov( θ j, θ k φ j (xφ k (xdx } {{ } + O ( M. Thus, the optimal choice is M M C /5 for some positive costat C. Ad this leads to which is the same rate as the KDE ad knn. Remark. MISE( p,m O( 4/5,

7-8 Lecture 7: Desity Estimatio: k-nearest Neighbor ad Basis Approach (Sobolev space Agai, we see that the bias is depedet o the smoothess of the desity fuctio. Here, as log as the desity fuctio has a overall curvature (secod derivative beig bouded, we have a optimal MISE at the rate of O( 4/5. I fact, if the desity fuctio has stroger smoothess, such as the overall third derivatives beig bouded, we will have a eve faster covergece rate O( 6/7. Now for ay positive iteger β ad a positive umber L >, we defie W (β, L { p : } p (β (x 2 dx L < be a collectio of smooth desity fuctios. The the bias betwee a basis estimator ad ay desity p W (β, L is at the rate O(M 2β ad the optimal covergece rate is O( 2β 2β+. The collectio W (β, L is a space of fuctios ad is kow as the Sobolev Space. (Tuig parameter The umber of basis M i a basis estimator, the umber of eighborhood k i the knn approach, ad the smoothig badwidth h i the KDE all play a very similar role i biasvariace tradeoff. These quatities are called tuig parameters i statistics ad machie learig. Just like what we have see i the aalysis of the KDE, choosig these parameters is ofte a very difficult task because the optimal choice depeds o the actual desity fuctio, which is ukow to us. A priciple approach to choosig the tuig parameters is based o miimizig a error estimator. For istace, i the basis estimator, we try to fid a estimator for the MISE, MISE( p,m. Whe we chage value of M, this error estimator will also chage. We the choose M by miimizig the error, i.e., M argmi M MISE( p,m. The we pick M basis ad costruct our estimator.