Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta

Similar documents
Nonparametric Methods

Nonparametric Density Estimation

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

Adaptive Kernel Estimation of The Hazard Rate Function

Chapter 9. Non-Parametric Density Function Estimation

Density and Distribution Estimation

12 - Nonparametric Density Estimation

Chapter 9. Non-Parametric Density Function Estimation

Introduction to Curve Estimation

NONPARAMETRIC DENSITY ESTIMATION WITH RESPECT TO THE LINEX LOSS FUNCTION

A Novel Nonparametric Density Estimator

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Confidence intervals for kernel density estimation

More on Estimation. Maximum Likelihood Estimation.

An Introduction to the Schramm-Loewner Evolution Tom Alberts C o u(r)a n (t) Institute. March 14, 2008

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

High-dimensional regression

Introduction to Regression

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Density Estimation (II)

3 Nonparametric Density Estimation

Introduction to Regression

Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)

Statistics for Python

41903: Introduction to Nonparametrics

Introduction to Regression

Inference on distributions and quantiles using a finite-sample Dirichlet process

Local Polynomial Regression

Chapter 1. Density Estimation

Nonparametric Regression. Badr Missaoui

A tailor made nonparametric density estimate

Kernel density estimation

Nonparametric Density Estimation and Monotone Rearrangement

Histogram Härdle, Müller, Sperlich, Werwatz, 1995, Nonparametric and Semiparametric Models, An Introduction

To get horizontal and slant asymptotes algebraically we need to know about end behaviour for rational functions.

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Fourier Series. 1. Review of Linear Algebra

Introduction to Regression

Estimation for nonparametric mixture models

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

ECON 721: Lecture Notes on Nonparametric Density and Regression Estimation. Petra E. Todd

Section 1.x: The Variety of Asymptotic Experiences

Nonparametric Econometrics

8.7 Taylor s Inequality Math 2300 Section 005 Calculus II. f(x) = ln(1 + x) f(0) = 0

Recall the Basics of Hypothesis Testing

Analysis methods of heavy-tailed data

Nonparametric Function Estimation with Infinite-Order Kernels

Lecture 17: Numerical Optimization October 2014

Lesson 29 MA Nick Egbert

Answers for Calculus Review (Extrema and Concavity)

The Nonparametric Bootstrap

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

Northwestern University Department of Electrical Engineering and Computer Science

Parameter estimation Conditional risk

Econ 582 Nonparametric Regression

LECTURE NOTE #3 PROF. ALAN YUILLE

p. 6-1 Continuous Random Variables p. 6-2

On robust and efficient estimation of the center of. Symmetry.

Sparse Nonparametric Density Estimation in High Dimensions Using the Rodeo

Nonparametric Estimation of Luminosity Functions

Density estimation Nonparametric conditional mean estimation Semiparametric conditional mean estimation. Nonparametrics. Gabriel Montes-Rojas

Monte Carlo Methods for Statistical Inference: Variance Reduction Techniques

Non-parametric Inference and Resampling

MATH 408N PRACTICE FINAL

MATH 5640: Fourier Series

Confidence Intervals

CS 147: Computer Systems Performance Analysis

MATH 162. Midterm 2 ANSWERS November 18, 2005

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

DESIGN-ADAPTIVE MINIMAX LOCAL LINEAR REGRESSION FOR LONGITUDINAL/CLUSTERED DATA

Hazard Rate Function Estimation Using Erlang Kernel

Monte Carlo Solution of Integral Equations

How to Use Calculus Like a Physicist

Supervised Learning: Non-parametric Estimation

Can we do statistical inference in a non-asymptotic way? 1

A nonparametric method of multi-step ahead forecasting in diffusion processes

STAT 450: Statistical Theory. Distribution Theory. Reading in Casella and Berger: Ch 2 Sec 1, Ch 4 Sec 1, Ch 4 Sec 6.

Notes on numerical solution of differential equations

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

INTRODUCTION TO REAL ANALYSIS II MATH 4332 BLECHER NOTES

A NOTE ON TESTING THE SHOULDER CONDITION IN LINE TRANSECT SAMPLING

Ph211 Summer 09 HW #4, week of 07/13 07/16. Ch6: 44, 46, 52; Ch7: 29, 41. (Knight, 2nd Ed).

Physics 6720 Introduction to Statistics April 4, 2017

Statistics: Learning models from data

Regression Discontinuity Design Econometric Issues

A Locally Adaptive Transformation Method of Boundary Correction in Kernel Density Estimation

Math 113: Quiz 6 Solutions, Fall 2015 Chapter 9

Local Whittle Likelihood Estimators and Tests for non-gaussian Linear Processes

Problem 3. Give an example of a sequence of continuous functions on a compact domain converging pointwise but not uniformly to a continuous function

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Physically Based Modeling Differential Equation Basics

Review for the First Midterm Exam

Week 9 Generators, duality, change of measure

Math 42: Fall 2015 Midterm 2 November 3, 2015

Basic Probability Reference Sheet

E[X n ]= dn dt n M X(t). ). What is the mgf? Solution. Found this the other day in the Kernel matching exercise: 1 M X (t) =

NADARAYA WATSON ESTIMATE JAN 10, 2006: version 2. Y ik ( x i

Transcription:

Boundary Correction Methods in Kernel Density Estimation Tom Alberts C o u(r)a n (t) Institute joint work with R.J. Karunamuni University of Alberta November 29, 2007

Outline Overview of Kernel Density Estimation Boundary Effects Methods for Removing Boundary Effects Karunamuni and Alberts Estimator

What is Density Estimation? Basic question: given an i.i.d. sample of data X 1,X 2,...,X n, can one estimate the distribution the data comes from? As usual, estimators. estimators. there are parametric and non-parametric Here we consider only non-parametric Assumptions on the distribution: It has a probability density function, which we call f, f is as smooth as we need, at least having continuous second derivatives.

Most Basic Estimator: the Histogram! Parameters: an origin x 0 and a bandwidth h Create bins...,[x 0 h,x 0 ), [x 0, x 0 +h), [x 0 +h,x 0 + 2h),... frequency 100 400 700 treatment length Dataset: lengths (in days) of 86 spells of psychiatric treatments for patients in a study of suicide risks

Most Basic Estimator: the Histogram! Can write the estimator as f n (x) = 1 nh #{X i : X i in the same bin as x } Is it accurate? In the limit, yes. A consequence of the Strong Law of Large Numbers: as n and h 0, f n (x) f(x) almost surely. Advantages: simple computationally easy well known by the general public Disadvantages: depends very strongly on the choice of x 0 and h ugly

Dependence on x 0 frequency 100 400 700 treatment length frequency 100 400 700 treatment length

Making the Histogram a Local Estimator There s an easy way to get rid of the dependence on x 0. Recall 1 f(x) = lim P (x h < X < x + h) h 0 2h which can be naively estimated by In pictures: f n (x) = 1 2nh #{X i : x h X i x + h}

Making the Histogram a Local Estimator Let K(x) = 1 21 { 1 x 1}. Can also write the estimator as f n (x) = 1 n n i=1 1 h K ( ) x Xi This is the general form of a kernel density estimator. Nothing special about the choice K(x) = 1 21 { 1 x 1} Can use smooth K and get smooth kernel estimators. h

Making the Histogram a Local Estimator Let K(x) = 1 21 { 1 x 1}. Can also write the estimator as f n (x) = 1 n n i=1 1 h K ( ) x Xi This is the general form of a kernel density estimator. Nothing special about the choice K(x) = 1 21 { 1 x 1} Can use smooth K and get smooth kernel estimators. h

Making the Histogram a Local Estimator Let K(x) = 1 21 { 1 x 1}. Can also write the estimator as f n (x) = 1 n n i=1 1 h K ( ) x Xi This is the general form of a kernel density estimator. Nothing special about the choice K(x) = 1 21 { 1 x 1} Can use smooth K and get smooth kernel estimators. h

Making the Histogram a Local Estimator Let K(x) = 1 21 { 1 x 1}. Can also write the estimator as f n (x) = 1 n n i=1 1 h K ( ) x Xi This is the general form of a kernel density estimator. Nothing special about the choice K(x) = 1 21 { 1 x 1} Can use smooth K and get smooth kernel estimators. h

Properties of the Kernel What other properties should K satisfy? positive symmetric about zero K(t)dt = 1 tk(t)dt = 0 0 < t 2 K(t)dt < If K satisfies the above, it follows immediately that f n (x) 0 and f n (x)dx = 1.

Different Kernels 1 1 Box 1 1 Triangle 1 1 Epanechnikov 1 1 Biweight 1 1 Gaussian

Does the Choice of Kernel Matter? For reasons that we will see, the optimal K should minimize ( ) 2/5 ( ) 4/5 C(K) = t 2 K(t)dt K(t) 2 dt It has been proven that the Epanechnikov kernel is the minimizer. However, for most other kernels C(K) is not much larger than C(Epanechnikov). For the five presented here, the worst is the box estimator, but C(Box) < 1.1C(Epanechnikov) Therefore, usually choose kernel based on other considerations, i.e. desired smoothness.

How Does Bandwidth Affect the Estimator? The bandwidth h acts as a smoothing parameter. Choose h too small and spurious fine structures become visible. Choose h too large and many important features may be oversmoothed. h = 2 h =.2

How Does Bandwidth Affect the Estimator? A common choice for the optimal value of h is ( ) 2/5 ( ) 1/5 ( 1/5 t 2 K(t)dt K(t) 2 dt f (x) dx) 2 n 1/5 Note the optimal choice still depends on the unknown f Finding a good estimator of h is probably the most important problem in kernel density estimation. But it s not the focus of this talk.

Measuring Error How do we measure the error of an estimator f n (x)? Use Mean Squared Error throughout. Can measure error at a single point E [ (f n (x) f(x)) 2] = (E [f n (x)] f(x)) 2 + Var (f n (x)) = Bias 2 + Variance Can also measure error over the whole line by integrating E [ (f n (x) f(x)) 2] dx The latter is called Mean Integrated Squared Error (MISE). MISE has an integrated bias and variance part.

Bias and Variance E [f n (x)] f(x) = 1 n [ ( )] x Xi E K nh h i=1 = 1 ( )] x [K h E Xi h 1 ( x y ) = h K f(y)dy f(x) h = K(t)f(x ht)dt f(x) = K(t)(f(x ht) f(x))dt ( = K(t) htf (x) + 1 ) 2 h2 t 2 f (x) +... dt = hf (x) tk(t)dt + 1 2 h2 f (x) t 2 K(t)dt +... = 1 2 h2 f (x) t 2 K(t)dt + higher order terms in h

Bias and Variance Can work out the variance in a similar way E [f n (x)] f(x) = h2 2 f(2) (x) t 2 K(t)dt + o(h 2 ) Var (f n (x)) = 1 ( ) 1 nh f(x) K(t) 2 dt + o nh Notice how h affects the two terms in opposite ways. Can integrate out the bias and variance estimates above to get the MISE h 4 4 ( t 2 K(t)dt ) 2 plus some higher order terms f (x) 2 dx + 1 nh K(t) 2 dt

Bias and Variance The optimal h from before was chosen so as to minimize the MISE. This minimum of the MISE turns out to be ( 1/5 5 4 C(K) f (x) dx) 2 n 4/5 where C(K) was the functional of the kernel given earlier. Thus we see we chose the optimal kernel to be the one that minimizes the MISE, all else held equal. Note that when using the optimal bandwidth, the MISE goes to zero like n 4/5.

Boundary Effects All of these calculations implicitly assume that the density is supported on the entire real line. If it s not, then the estimator can behave quite poorly due to what are called boundary effects. Combatting these is the main focus of this talk. For simplicity, we ll assume from now on that f is supported on [0, ). Then [0,h) is called the boundary region.

Boundary Effects In the boundary region, f n usually underestimates f. This is because f n doesn t feel the boundary, and penalizes for the lack of data on the negative axis. frequency 100 400 700 treatment length treatment length

Boundary Effects For x [0,h), the bias of f n (x) is of order O(h) rather than O(h 2 ). In fact it s even worse: f n (x) is not even a consistent estimator of f(x). E [f n (x)] = f(x) c + h2 2 f (x) K(t)dt hf (x) 1 c Var (f n (x)) = f(x) nh where x = ch, 0 c 1. 1 c c 1 t 2 K(t)dt + o(h 2 ) 1 K(t) 2 dt + o Note the variance isn t much changed. tk(t)dt ( ) 1 nh

Methods for Removing Boundary Effects There is a vast literature on removing boundary effects. I briefly mention 4 common techniques: Reflection of data Transformation of data Pseudo-Data Methods Boundary Kernel Methods They all have their advantages and disadvantages. One disadvantage we don t like is that some of them, especially boundary kernels, can produce negative estimators.

Reflection of Data Method Basic idea: since the kernel estimator is penalizing for a lack of data on the negative axis, why not just put some there? Simplest way: just add X 1, X 2,..., X n to the data set. Estimator becomes: ˆf n (x) = 1 n { ( ) ( )} x Xi x + Xi K + K nh h h i=1 for x 0, ˆf n (x) = 0 for x < 0. It is easy to show that ˆf n(x) = 0. Hence it s a very good method if the underlying density has f (0) = 0.

Transformation of Data Method Take a one-to-one, continuous function g : [0, ) [0, ). Use the regular kernel estimator with the transformed data set {g(x 1 ), g(x 2 ),...,g(x n )}. Estimator ˆf n (x) = 1 nh n i=1 K ( ) x g(xi ) h Note this isn t really estimating the pdf of X, but instead of g(x). Leaves room for manipulation then. One can choose g to get the data to produce whatever you want.

Pseudo-Data Methods Due to Cowling and Hall, this generates data beyond the left endpoint of the support of the density. Kind of a reflected transformation estimator. It transforms the data into a new set, then puts this new set on the negative axis. [ n ˆf n (x) = 1 ( ) x Xi m ( ) ] x + X( i) K + K nh h h i=1 Here m n, and i=1 X ( i) = 5X (i/3) 4X (2i/3) + 10 3 X (i) where X (t) linearly interpolates among 0, X (1),X (2),...,X (n).

Boundary Kernel Method At each point in the boundary region, use a different kernel for estimating function. Usually the new kernels give up the symmetry property and put more weight on the positive axis. ˆf n (x) = 1 nh c n i=1 ( ) x Xi K (c/b(c)) h c where x = ch, 0 c 1, and b(c) = 2 c. Also K (c) (t) = 12 { } (1 + c) 4(1 + t) (1 2c)t + 3c2 2c + 1 1 { 1 t c} 2

Method of Karunamuni and Alberts Our method combines transformation and reflection. f n (x) = 1 n { ( ) ( )} x + g(xi ) x g(xi ) K + K nh h h i=1 for some transformation g to be determined. We choose g so that the bias is of order O(h 2 ) in the boundary region, rather than O(h). Also choose g so that g(0) = 0,g (0) = 1, g is continuous and increasing.

Method of Karunamuni and Alberts Do a very careful Taylor expansion of f and g in E [ fn (x) ] = 1 { ( ) ( )} x + g(y) x g(y) K + K f(y)dy h h h to compute the bias. Set the h coefficient of the bias to be zero requires g (0) = 2f (0) 1 c (t c)k(t)dt / f(0) ( c + 2 1 c (t c)k(t)dt ). where x = ch, 0 c 1. Most novel feature: note that g (0) actually depends on x! What this means: at different points x, the data is transformed by a different amount.

Method of Karunamuni and Alberts Simplest possible g satisfying these conditions g(y) = y + 1 2 dk cy 2 + λ 0 (dk c) 2 y 3 where d = f (1) (0) /f(0), k c = 2 1 c (t c)k(t)dt /( c + 2 1 c (t c)k(t)dt ), and λ 0 is big enough so that g is strictly increasing. Note g really depends on c, so we write g c (y) instead. Hence the amount of transformation of the data depends on the point x at which we re estimating f(x). Important feature: k c 0 as c 1.

Method of Karunamuni and Alberts Consequently, g c (y) = y for c 1. This means our estimator reduces to the regular kernel estimator at interior points! We like that feature: the regular kernel estimator does well at interior points so why mess with a good thing? Also note that our estimator is always positive. Moreover, by performing a careful Taylor expansion of the boundary, one can show the variance is still O ( 1 nh). Var ( fn (x) ) = f(0) nh { 2 1 c K(t)K(2c t)dt + 1 1 K 2 (t)dt } + o ( ) 1 nh

Method of Karunamuni and Alberts Note that g c (y) requires a parameter d = f (0)/f(0). Of course we don t know this, so we have to estimate it somehow. We note d = d dx log f(x) x=0, which we can estimate by where f n ˆd = log f n(h 1 ) log fn(0) h 1 is some other kind of density estimator. We follow methodology of Zhang, Karunamuni and Jones for this. Imporant feature: d = 0, then g c (y) = y. This means our estimator reduces to the reflection estimator if f (0) = 0!

Method of Karunamuni and Alberts I mention that our method can be generalized to having two distinct transformations involved. f n (x) = 1 n { ( ) ( )} x + g1 (X i ) x g2 (X i ) K + K nh h h i=1 With both g 1 and g 2 there are many degrees of freedom. In another paper we investigated five different pairs of (g 1,g 2 ). As would be expected, no one pair did exceptionally well on all shapes of densities. The previous choice g 1 = g 2 = g was the most consistent of all the choices, so we recommend it for practical use.

Simulations of Our Estimator f(x) = x2 2 e x,x 0, with h =.832109 0.25 0.25 0.25 0.25 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00-0.05-0.05-0.05-0.05 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 1.25 EstimatorV BoundaryKernel LLwithBVF LLwithoutBVF 0.25 0.25 0.25 0.25 0.20 0.20 0.20 0.20 0.15 0.15 0.15 0.15 0.10 0.10 0.10 0.10 0.05 0.05 0.05 0.05 0.00 0.00 0.00 0.00-0.05-0.05-0.05-0.05 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 1.25 0.00 0.25 0.50 0.75 1.00 1.25 Z,K &F J &F H &P C &H

Simulations of Our Estimator f(x) = 2 π(1+x 2 ),x 0, with h =.690595 0.80 0.80 0.80 0.80 0.70 0.70 0.70 0.70 0.60 0.60 0.60 0.60 0.50 0.50 0.50 0.50 0.40 0.40 0.40 0.40 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 EstimatorV BoundaryKernel LLwithBVF LLwithoutBVF 0.80 0.80 0.80 0.80 0.70 0.70 0.70 0.70 0.60 0.60 0.60 0.60 0.50 0.50 0.50 0.50 0.40 0.40 0.40 0.40 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Z,K &F J &F H &P C &H

Simulations of Our Estimator f(x) = 5 4 (1 + 15x)e 5x,x 0, with h =.139332 2.25 2.25 2.25 2.25 2.00 2.00 2.00 2.00 1.75 1.75 1.75 1.75 1.50 1.50 1.50 1.50 1.25 1.25 1.25 1.25 1.00 1.00 1.00 1.00 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 EstimatorV BoundaryKernel LLwithBVF LLwithoutBVF 2.25 2.25 2.25 2.25 2.00 2.00 2.00 2.00 1.75 1.75 1.75 1.75 1.50 1.50 1.50 1.50 1.25 1.25 1.25 1.25 1.00 1.00 1.00 1.00 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 Z,K &F J &F H &P C &H

Simulations of Our Estimator f(x) = 5e 5x,x 0, with h =.136851 6.00 6.00 6.00 6.00 5.00 5.00 5.00 5.00 4.00 4.00 4.00 4.00 3.00 3.00 3.00 3.00 2.00 2.00 2.00 2.00 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 EstimatorV BoundaryKernel LLwithBVF LLwithoutBVF 6.00 6.00 6.00 6.00 5.00 5.00 5.00 5.00 4.00 4.00 4.00 4.00 3.00 3.00 3.00 3.00 2.00 2.00 2.00 2.00 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 Z,K &F J &F H &P C &H

Our First Estimator g 1 = g 2 = g on the Suicide Data Dashed Line: Regular Kernel Estimator Solid Line: Karunamuni and Alberts 0.008 0.006 0.004 0.002 0.000 0 100 200 300 400 500

Slides Produced With Asymptote: The Vector Graphics Language symptote http://asymptote.sf.net (freely available under the GNU public license)