A new method of nonparametric density estimation

Size: px

Start display at page:

Download "A new method of nonparametric density estimation"

Patrick Hicks
5 years ago
Views:

1 A new method of nonparametric density estimation Andrey Pepelyshev Cardi December 7, /32 A. Pepelyshev A new method of nonparametric density estimation

2 Contents Introduction A new density estimate Selection of a smoothing parameter Application in quality control 2/32 A. Pepelyshev A new method of nonparametric density estimation

3 Introduction Let (x 1,..., x m ) be a sample from a distribution F (x) with density p(x) = F (x). The problem is to estimate p(x) and F (x). 3/32 A. Pepelyshev A new method of nonparametric density estimation

4 Empirical distribution function The EDF F m (x) is the minimum-variance unbiased nonparametric estimate of F (x). F m (x) = 1 m m 1 [xi, )(x) i=1 4/32 A. Pepelyshev A new method of nonparametric density estimation

5 Kernel estimation The kernel density estimate is ˆp m,h (x) = 1 mh m i=1 K( x x i h ) where h is the bandwidth and the kernel K is a continuous function, K(x)dx = 1. The popular choice of K(x) is the Gaussian kernel K(x) = 1 2π e x2 /2. 5/32 A. Pepelyshev A new method of nonparametric density estimation

6 Kernel estimation The corresponding estimate of F (x) have the form ˆF (x) = (K F m )(x) = K(x z)f m (z)dz since the equality x K(y z)dy = z K(y x)dy holds for all x and z if K(x) is a symmetric kernel. 6/32 A. Pepelyshev A new method of nonparametric density estimation

7 Bandwidth selection An optimal bandwidth minimizes a risk function, e.g. MISE(h) = E (p(x) ˆp m,h (x)) 2 dx. The LSCV-bandwidth minimizes Ψ(h) = ˆp m,h (x) 2 dx 2 m m i=1 ˆp [x 1,...,x i 1,x i+1,...,x n ] m 1,h (x i ) 7/32 A. Pepelyshev A new method of nonparametric density estimation

8 A new estimate A new estimate p(x) = d dx F (x) The aim is to obtain a smoothed EDF which can be dierentiated. 8/32 A. Pepelyshev A new method of nonparametric density estimation

9 A new estimate A new estimate Consider values of the EDF at equidistant points as a time series. Apply Singular Spectrum Analysis (SSA) for its smoothing. Golyandina N., Nekrutkin V., Zhigljavsky A. (2001) Analysis of Time Series Structure: SSA and Related Techniques. London: Chapman & Hall/CRC. 9/32 A. Pepelyshev A new method of nonparametric density estimation

10 A new estimate The SSA procedure for a CDF series Transform a series (f 1,..., f N ) to a matrix X as X=(x i,j )=(f i+j 1 ), i=1,..., L, j =1,..., N L+1. Make the SVD decomposition X = L i=1 λi U i V T i, λ 1 λ X = f 1 f 2 f 3 f N L+1 f 2 f 3 f 4 f N L+2 f 3 f 4 f 5 f N L+3... f L f L+1 f L+2 f N. 10/32 A. Pepelyshev A new method of nonparametric density estimation

11 A new estimate The SSA procedure for a CDF series Transform a series (f 1,..., f N ) to a matrix X as X=(x i,j )=(f i+j 1 ), i=1,..., L, j =1,..., N L+1. Make the SVD decomposition X = L λi U i V T i, λ 1 λ i=1 Extract the r leading components X (r) = ( x (r) i,j ) = r i=1 λi U i V T i. Transform X (r) back to the series ˆf j = ˆf j (L, r) = 0 1 j < L, 1 L x (r) k,j k+1 L L j K, k=1 1 K < j N, 11/32 A. Pepelyshev A new method of nonparametric density estimation

12 A new estimate Representation of the SSA procedure ˆf j = L i=1 L (u 1,i u 1,l u r,i u r,l )f j+i l /L l=1 for j = L,..., K, where (u l,1,..., u l,l ) T = U l. This is a data-adaptive lter. L is half of the length of the lter. r is a complexity of the lter. L controls the smoothness. 12/32 A. Pepelyshev A new method of nonparametric density estimation

13 A new estimate Selection of points t j for time series Choose δ such that and dene t j as f j = F m (t j ), t j+1 t j = δ δ 1 m max x p(x) t j = δ(j 2L) + min x i, i=1,...,m j = 1,..., N, N = 2L max + 4L, maxi x i min i x i L max =. 2δ 13/32 A. Pepelyshev A new method of nonparametric density estimation

14 A new estimate Inuence of the parameter L Measurements of Lean Body Mass from the Australian Institute of Sport Data The SSA 1c estimate (i.e. with r = 1) of the distribution function and the density for L=20,..., 60 14/32 A. Pepelyshev A new method of nonparametric density estimation

15 A new estimate Inuence of the parameter L Measurements of Lean Body Mass from the Australian Institute of Sport Data The SSA 1c estimate (i.e. with r = 1) of the distribution function and the density for L=20,..., 60 15/32 A. Pepelyshev A new method of nonparametric density estimation

16 A new estimate Specic SSA estimates SSA 1c : the SSA estimate with r = 1 SSA 2c : the SSA estimate with r = 2 SSA b : the SSA estimate with r = 1 and bias correction 16/32 A. Pepelyshev A new method of nonparametric density estimation

17 A new estimate Validity of the SSA procedure The estimated series ( ˆf 1,..., ˆf N ) can be transformed to the SSA estimate of F (x) as follows ˆF m (x) = N ( j=2 ˆf j 1 + ( ˆf j ˆf j 1 ) x t ) j 1 1 [tj 1,t t j t j )(x) j 1 +1 [tn, )(x), Lemma. The SSA 1c estimate is a valid distribution function. 17/32 A. Pepelyshev A new method of nonparametric density estimation

18 A new estimate The SSA b estimate Let S(F ) be a lter (weighted moving average). Our case: S(F ) is the SSA 1c estimate. The problem: this lter has a bias. The idea is to correct S(F ) as an estimate F by use of S(S(F )) and S(S(S(F ))). S b (F) = 3S(F) 3S 2 (F) + S 3 (F) 18/32 A. Pepelyshev A new method of nonparametric density estimation

19 A new estimate Simulation results for dierent L The performance of SSA estimates for samples of size 100, the replication number is ED ISE ED KS ED H model 0.4N(0, 1) + 0.6N(5, 2 2 ) Kernel est. with h LSCV SSA 1c est. with L= SSA 1c est. with L= SSA 1c est. with L= SSA 2c est. with L= SSA 2c est. with L= SSA 2c est. with L= SSA b est. with L= SSA b est. with L= SSA b est. with L= /32 A. Pepelyshev A new method of nonparametric density estimation

20 The automatic choice of a smoothing parameter How to choose L automatically A law of the iterated logarithm. Glivenko-Cantelli result The empirical distribution function F m (x) satises almost surely, where R m = lim sup R Fm m (x) F (x) 1 m 2 m 2 ln ln m. If an estimate is far from the EDF, then this estimate is not good. 20/32 A. Pepelyshev A new method of nonparametric density estimation

21 The automatic choice of a smoothing parameter How to choose L automatically 1. Compute L=max { L : F m ˆF m (l) } 1/R m l {1,..., L}. 2. Compute the sequence M 1,..., M L, where M j is the number of modes of estimated density for L = j. This sequence has decrasing tendency. 3. Compute ˇM j = min{m 1, M 2,..., M j }. 21/32 A. Pepelyshev A new method of nonparametric density estimation

22 The automatic choice of a smoothing parameter How to choose L automatically 4. Divide the set {1, 2,..., L} into groups such that the values ˇM j are equal to each other within each group, i.e. dene a 1, b 1,..., a k, b k and k such that 1 = a 1 b 1 <... < a k b k = L, a i+1 = b i + 1 and ˇM j = ˇM j for all i, j {a l,..., b l }, l {1,..., k}. 22/32 A. Pepelyshev A new method of nonparametric density estimation

23 The automatic choice of a smoothing parameter How to choose L automatically 5. Finally, compute L a (a 'best' value of L) as an average with weight coecients, which are proportional to the sizes of these k groups, namely k L a = c i w i, where i=1 c i = γ i a i + (1 γ i )b i, w i = b i a i k j=1 b j a j, γ i = 1/2 if ˇMai = 1 and γ i = 0.9 otherwise. 23/32 A. Pepelyshev A new method of nonparametric density estimation

24 The automatic choice of a smoothing parameter How to choose the bandwidth 1. Compute h=max { h : Fm ˆF m ( ) } 1/R m (0, h). 2. Dene a dense set {h 1,..., h n } (0, h) and compute the sequence M 1,..., M n, where M j be the number of modes of estimated density for h = h j. 3. Compute ˇM j = min{m 1, M 2,..., M j }. 4. Divide the set {h 1,..., h n } into groups as earlier. 5. Compute h a = k i=1 c iw i, where c i and w i are dened in the same manner. 24/32 A. Pepelyshev A new method of nonparametric density estimation

25 The automatic choice of a smoothing parameter Consistency Lemma. The kernel estimate with h a and any SSA estimate with L a of the distribution function are consistent. 25/32 A. Pepelyshev A new method of nonparametric density estimation

26 The automatic choice of a smoothing parameter Simulation study Consider several models of the density p(x) and simulate samples of size 100. D ISE (ˆp, p) = (ˆp(x) p(x) ) 2dx D KS ( ˆF, F ) = ˆF F = max ˆF (x) F (x) x ( ) 2dx D H (ˆp, p) = ˆp(x) p(x) 26/32 A. Pepelyshev A new method of nonparametric density estimation

27 The automatic choice of a smoothing parameter Results on bandwidth selection ED ISE ED KS ED H model N(0, 1) Kernel est. with h LSCV Kernel est. with h SJPI Kernel est. with h ICV Kernel est. with h a model 0.4N(0, 1) + 0.6N(5, 2 2 ) Kernel est. with h LSCV Kernel est. with h SJPI Kernel est. with h ICV Kernel est. with h a model 0.3N(0, 1) + 0.7N(15, 4 2 ) Kernel est. with h LSCV Kernel est. with h SJPI Kernel est. with h ICV Kernel est. with h a /32 A. Pepelyshev A new method of nonparametric density estimation

28 The automatic choice of a smoothing parameter Results on the SSA estimates ED ISE ED KS ED H EL a model N(0, 1) Kernel est. with h a SSA 1c est SSA 2c est SSA b est model 0.4N(0, 1) + 0.6N(5, 2 2 ) Kernel est. with h a SSA 1c est SSA 2c est SSA b est model 0.3N(0, 1) + 0.7N(15, 4 2 ) Kernel est. with h a SSA 1c est SSA 2c est SSA 3c est SSA b est /32 A. Pepelyshev A new method of nonparametric density estimation

29 Application Application in photovoltaics In quality control, acceptance sampling is used to determine whether to accept or reject a large production lot by checking a small number of items. A sampling plan consists of the number n of items to be measured and the critical value c such that a lot is rejected if an appropriate statistic is larger than c. 29/32 A. Pepelyshev A new method of nonparametric density estimation

30 Application in photovoltaics Sampling plans for quality control Let measurements be distributed according to a distribution G(x) with mean a and variance σ 2. The asymptotically optimal sampling plan (n, c) is ( Φ 1 (α) Φ 1 (1 β) ) 2 n = ( F 1 (AQL) F 1 (RQL) ) 2, n ( c = F 1 (AQL) + F 1 (RQL) ), 2 where F (x) = G((x a)/σ), AQL is the acceptable quality level, RQL is the rejectable quality level, α is the producer risk and β is the consumer risk. 30/32 A. Pepelyshev A new method of nonparametric density estimation

31 Application in photovoltaics Comparison of sampling plans Means and standard deviations of sampling plan size using the empirical distribution function, the kernel density estimate and the SSA 1c, SSA b and SSA 2c estimates for samples of size m from 0.4N(0, 1) + 0.6N(5, 2 2 ). The true sampling plan size is 392. m = 250 m = 500 m = 1000 EDF 469.7(468.9) 462.7(278.9) 422.0(160.1) Kernel h LSCV 322.3(102.8) 324.0(79.7) 337.8(59.2) Kernel h a 302.1(77.2) 321.2(65.2) 339.1(52.8) SSA 1c est (71.8) 316.3(61.3) 329.1(46.9) SSA b est (112.9) 408.9(88.8) 403.6(67.7) SSA 2c est (107.8) 393.1(81.4) 392.8(62.1) 31/32 A. Pepelyshev A new method of nonparametric density estimation

32 Application in photovoltaics Thank you for your attention! 32/32 A. Pepelyshev A new method of nonparametric density estimation

Estimation of the quantile function using Bernstein-Durrmeyer polynomials

Estimation of the quantile function using Bernstein-Durrmeyer polynomials Andrey Pepelyshev Ansgar Steland Ewaryst Rafaj lowic The work is supported by the German Federal Ministry of the Environment, Nature