ON SOME TWO-STEP DENSITY ESTIMATION METHOD

Size: px

Start display at page:

Download "ON SOME TWO-STEP DENSITY ESTIMATION METHOD"

Madison Flowers
5 years ago
Views:

1 UNIVESITATIS IAGELLONICAE ACTA MATHEMATICA, FASCICULUS XLIII 2005 ON SOME TWO-STEP DENSITY ESTIMATION METHOD by Jolanta Jarnicka Abstract. We introduce a new two-step kernel density estimation method, based on the EM algorithm and the generalized kernel density estimator. The accuracy obtained is better in particular, in the case of multimodal or skewed densities.. Introduction. Density estimation has been investigated in many papers. In particular, the kernel estimation method, introduced in [], has received much attention. Let us recall some problems connected with the kernel density estimation. Let X i : Ω, i =,..., n, be a sequence of random variables defined on a probability space Ω, F, P. Suppose that they are identically and independently distributed with an unknown density f, and that {x i } is a sequence of corresponding observations. The kernel density estimator is characterized by two components: the bandwidth hn and the kernel K. Definition.. Let {hn} n= 0, + with hn 0. Let K : be a measurable nonnegative function. The kernel density estimator is given by the formula. fh x = n x xi K. nhn hn We call h the bandwidth window width or smoothing parameter. We call K the kernel. Statistical properties of this estimator, like biasedness, asymptotical unbiasedness and efficiency, were presented in [], [0], and [3]. i=

2 68 The kernel determines the regularity, symmetry about zero and shape of the estimator, while the bandwidth the amount of smoothing. In particular, f h is a density, provided that K 0 and Kt dt =. Considerable research has been carried out on the question of how one should select K in order to optimize properties of the kernel density estimator f h see e.g. [5], [4], and [3] and numerous examples have been discussed. Suggested choice is a density, symmetric about zero, with finite variance, like e.g. the Gaussian kernel Kt = 2π e 2 x2, the Epanechnikov kernel Kt = x2 for x < 5 or the rectangular kernel Kt = 2 for x < see e.g. [3] or [4] for more examples. But in some situations nonsymmetric kernels e.g. the exponential kernel Kt = e x for x > 0 guarantee better results. A comparative study of this problem was carried out in [7]. The main problem of the kernel density estimation is still the choice of the bandwidth hn. Various measures of accuracy of the estimator f h have been used to obtain the optimal bandwidth see e.g. [3], [3]. We will here focus on the mean integrated squared error given by MISE f h = E f h x fx 2 dx, and its asymptotic version for hn 0 and nhn as n, often abbreviated by AMISE f h see [5], [8]. The mean integrated squared error can be written as a sum of integrated squared bias and integrated variance of f h : MISE f h = E f h x fx 2 dx + V f h x dx. We want to keep both the bias and the variance possibly small, so the optimal bandwidth hn should be chosen so that MISE is minimal. Under additional assumptions on the density f we assume that f is twice differentiable and that 2 f x dx < and for a kernel being a symmetric density with a finite variance, we can provide asymptotic approximations for the bias and the variance of f h see [3]. Having obtained the formula for the asymptotic mean integrated squared error, we get the optimal bandwidth hn in the following form:.2 h opt n = K2 t dt 2 2 t2 Kt dt f x dx 5 n 5.

3 69 Although the integrals K2 t dt and t2 Kt dt can be calculated if K is known, formula.2 depends on an unknown density f and hence, in practice, the ability to calculate h opt n depends on obtaining an estimate for 2 f x dx. Several ways to solve this problem have appeared in the literature. One of the simplest methods, suggested in [3] often called the Silverman s rule of thumb, is to assume that f is known. For example assuming that f is the density of a normal distribution with parameters µ and σ 2, and using the Gaussian kernel, we obtain h opt n =.06σn 5, where σ can be estimated by the standard deviation. Other best known methods beside those giving automatic procedures for selecting the bandwidth, like the cross-validation see e.g. [3] are based on the idea of constructing a pilot function and then using it in actual estimation see e.g. [3], [9], [], [2], [5], and [6]. The problem is that none of these methods guarantees good results for a wide class of estimated density functions. In this paper we propose a new two-step kernel density estimation method, which generalizes methods considered in [3] and [6]. Its practical application is in fact based on a solution of some optimization problem. In Section 2 we present the idea of the two-step method and prove some basic properties of the proposed generalized kernel density estimator. Section 3 is dedicated to the problem of the choice of optimal bandwidths {} m j=. In Section 4 we propose an algorithm which enables us to construct a pilot. The last section contains a discussion of an example, which presents possibilities of our method. We emphasize that we have carried a lot of experiments and computer simulations in order to check the efficiency of the new method in comparison with the methods mentioned. According to the statistical analysis, we can state that in some cases e.g. when estimating bimodal or skewed densities our method gives better results than other known methods and in numerous cases its efficiency is comparable to the one of the methods mentioned see [7]. 2. The idea of the two-step method. Let Ω, F, P be a probability space, X i : Ω i =, 2,..., n a sequence of random variables and suppose they are independently and identically distributed. We assume that f is an unknown density function of X i and {x i } n i= is the sequence of observations on X i. Consider a family of functions {φ j x} m j=, such that 0 φ j x, φ j x =, x, j=

4 70 and a sequence of corresponding bandwidths {} m j=, such that h jn > 0 and 0, j =,..., m. Assume also that the kernel K : is a nonnegative measurable function. Definition 2.. We define the generalized kernel density estimator as a function given by 2. f φ x = n n i= j= φ j x i x K xi. The proposed method consists of two steps: We choose a pilot function f 0, by parametric estimation, using the EM algorithm see Section 4., under the assumption that f 0 is given as follows f 0 x = j α j f j x, where f j x = 2 e x µj 2 σ j, j =,..., m. 2πσj 2 Considering the generalized kernel density estimator nonparametric estimation we choose a sequence of bandwidths > 0, 0, for n, j =,..., m, using the pilot calculated in step, by taking 2.2 φ j x = α jf j x f 0 x = α j f j x m k= α kf k x. emark 2.2. The first modification with respect to the traditional kernel estimation method is the fact that we choose a sequence of bandwidths for j =,..., m instead of the single value hn. This may cause some complication in calculations but guarantees a better accuracy of the estimator. emark 2.3. Formula 2. corresponds to the methods described in [], [3] and in [5]. The idea of the construction of the pilot function by parametric estimation comes from [6]. emark 2.4. Note that for m =, taking φ x, we get the kernel density estimator., with the bandwidth h n. emark 2.5. As in the case of the kernel density estimator., the kernel K has an essential effect on the properties of f φ. In particular, if K is a density, and so Kx 0 and Kx dx =, then also f φ x 0 and

5 7 f φ x dx = = n n n n i= j= i= j= φ j y φ j y x y K dx x y K dx. Putting x y = t, for a fixed j, gives f φ x dx = Kt dt =. 2.. Properties of the generalized kernel density estimator. We will now present some statistical properties of f φ and then use its asymptotic behaviour in a criterion for the choice of, j =,..., m. Then Under the above assumptions and notation, the following lemma is true. Lemma 2.6. Assume that E φ j x i K x xi < and x xi x xi E φ j x i φ k x i K K <, j, k =,..., m. h k n E f φ x = V f φ x = n n j= n j,k= m j= Kyφ j x + yfx + y dy, K h k n x y K x y φ j yφ k yfy dy h k n Kyφ j x + yfx + y dy 2. Proof. Fix an n N. Since K is a measurable nonnegative function and {X i } n i= { is a sequence } of identically and independently distributed random variables, i n φj x K x xi forms a sequence of identically and independently i= distributed random variables. Thus E f φ x = n x n E xi φ j x i K = i= j= j= E φ j x i K x xi.

6 72 Observe that, for every j =,..., m E φ j x i K Letting y = x t, we have dy dt = x xi = K and x t φ j tft dt. Kyφ j x + yfx + y dy = Kyφ j x + yfx + y dy. Hence, summing both sides over j =,..., m, we get the expected value of the generalized kernel estimator E f φ x = j= Kyφ j x + yfx + y dy. To prove 2.4, we use the following equation There is V f φ x = n 2 n i= = n 2 n V f φ x = E f 2 φ x E f φ x 2. E [ m E i= [ m j= j= j= m E φ j x i x K xi 2 φ j x i x K xi j= ] 2 φ j x i x K xi m φ k x i x h k n K xi h k n k= E φ j x i K x xi ] 2.

7 73 Thanks to the independence of x i, i =,..., n, the first term can be written as: n m φ j x i x n 2 E h i= j= j n K xi m φ k x i x h k n K xi h k n k= m m = n = n = n j= j= k= j= k= φ j y K x y k= φ j yφ k y h k n K x y h k n K φ k y x y h k n K h k n K x y fy dy h k n fy dy x y x y K φ j yφ k yfy dy. h k n Hence the variance of the generalized kernel estimator is given by V f φ x = n n x y x y K K φ j yφ k yfy dy h k n h k n m 2, Kyφ j x + yfx + y dy n j,k= which completes the proof. j= emark 2.7. Taking m = and φ x in Theorem 2.6, we obtain the formulae for the expected value and variance of the traditional kernel estimator f h. They can be found for example in [3]. In order to show some statistical properties of fφ, we use the following lemma see [0] for a similar result for the traditional kernel estimator. Lemma 2.8. Under the above assumptions, let K be a bounded and integrable function such that lim t tkt = 0, and let > 0, 0, for n and j =,..., m. Then for every point x of continuity of f there is j= E φ j x i K x xi fx Kt dt, n. Proof. Our goal is to show that for every j =,..., m fx + zφ j x + z fxφ j x + zkz dz = 0. lim n

8 74 For a fixed j, there is fx + zφ j x + z fxφ j x + zkz dz fx + zφ j x + z fxφ j x + zkz dz. Letting ζ = z, we obtain where dz dζ fx ζφ j x ζ fxφ j x ζk = =, and ζ fx ζφ j x ζ fxφ j x ζk dζ fx ζφ j x ζ fxφ j x ζk ζ = fx ζ fx φ j x ζ K ζ dζ, ζ, for K is nonnegative. Observe that φ j x ζ, whenever x ζ. Therefore, fx ζ fx φ j x ζ K ζ dζ fx ζ fx K ζ dζ. Take an arbitrary δ > 0. Splitting the integration interval into two parts: {ζ : ζ δ} and {ζ : ζ < δ} and applying the triangle inequality to the first term, we get {ζ: ζ δ} fx ζ fx {ζ: ζ δ} K ζ fx ζ K + dζ {ζ: ζ δ} ζ dζ fx K ζ dζ.

9 75 Since K is bounded and f is a density, we get fx ζ ζk ζ dζ ζ {ζ: ζ δ} sup y δ and fxk ζ dζ = {ζ: ζ δ} fx = fx y δ yky fx ζ dζ {ζ: ζ δ} z δ Hence, for n, the first term tend to 0: fx ζ fx K ζ dζ {ζ: ζ δ} sup yky fx ζ dζ + fx {ζ: ζ δ} {ζ: ζ δ} Kz dz. z δ K ζ dζ Kz dz 0. From the continuity of f at x, the second integral may be estimated as follows fx ζ fx K ζ dζ {ζ: ζ <δ} ε δ K ζ dζ = ε δ Ky dy {ζ: ζ <δ} {y: y < δ } and converges to zero as n. Thus, for every j =,..., m, x E xi φ j x i K fxφ j x Kt dt, for n. After summing both sides over j the proof is completed. emark 2.9. By the additional assumption that Kt dt =, E f φ x fx 0, n, and hence the generalized kernel estimator is asymptotically unbiased. Provided that for every j =,..., m, n for n, we obtain V f φ 0 for n.

10 76 3. Criterion for the bandwidth choice. Like in the case of the traditional kernel estimator., we will use asymptotic mean integrated squared error AMISE f φ as a criterion for the choice of > 0, 0, for n, j =,..., m. We obtain the formula for AMISE f φ by asymptotic approximations of E f φ x fx and V f φ. We will use a well known corollary of Taylor s formula: Corollary 3.. Suppose that gx is an arbitrary function, n -times differentiable in a neighbourhood of x 0 and that g n x 0 at x 0 exists. Then ε>0 δ>0, 0<θ<δ : gx 0 + θ gx 0 θg x 0 θn n! gn x 0 < θn n! ε. Under the assumption from the last section we will formulate the criterion for the choice of the sequence of bandwidths. From now on, we assume that K is a measurable nonnegative function such that t 2 Kt dt < and Kt dt =. We will consider two cases: for a symmetric and nonsymmetric kernel, starting with the symmetric case. Theorem 3.2. Under the assumptions from the last section, suppose that functions φ j f, j =,..., m are differentiable in a neighbourhood of x and there exist derivatives φ j xfx, j =,..., m at x, and a kernel K satisfies tkt dt = 0. Then E f φ x = fx + 2 V f φ x = n + o j,k= 2 φ j xfx j= h k n. n j h jn K t 2 Kt dt + o 2, x y x y K φ j yφ k yfy dy h k n Proof. Let ε > 0. By Corollary 3., for every j =,..., m, there exists a δ > 0 such that 0 < t < δ and φ j x + tfx + t φ j xfx tφ j xfx 2 h jn 2 t 2 φ j xfx < h jn 2 t 2 ε. 2! j

11 77 Multiplying inequalities h jn 2 t 2 ε < φ j x + tfx + t φ j xfx tφ j xfx 2! 2 h jn 2 t 2 φ j xfx < h jn 2 t 2 ε 2! by Kt, and integrating with respect to t, we obtain h jn 2 ε 2! t 2 Kt dt < φ j x + tfx + tkt dt φ j xfx 2 h jn 2 φ j xfx t 2 Kt dt < h jn 2 ε t 2 Kt dt, 2! provided that K is a symmetric density. Summing over j =,..., m, by 2.3, we get 4 j= Therefore, 2 ε t 2 Kt dt < E ˆf φ x fx 2 E f φ x fx 2 2 φ j xfx j= 2 φ j xfx j= < 4 < 4 t 2 Kt dt 2 ε j= t 2 Kt dt j= t 2 Kt dt. 2 ε t 2 Kt dt. Since > 0, j =,..., m, denoting µ 2 = t2 Kt dt, we obtain E f φ x fx m j= h jn 2 2 φ jxfx µ 2 < 4 εµ 2. Observe that k h k n 2 m j= h jn 2 and 0, n for j =,..., m. Since ε > 0 is arbitrary, we get E f φ x fx 2 φ m jxfx µ 2 = o 2. j=

12 78 Now, since V f φ x = n 2 n i= m E j= φ j x i x K xi m φ k x i x h k n K xi h k n k= [ m j= E φ j x i K x xi ] 2, applying Lemma 2.8 and emark 2.9, we conclude that the second term is. By 2.4, the first term equals o n j h jn n j,k= K h k n which completes the proof. x y K x y φ j yφ k yfy dy, h k n In the nonsymmetric case, we proceed in the following way. Theorem 3.3. Under the assumptions from the last section, if the functions fφ j are differentiable at x, for j =,..., m and tkt dt 0, and tkt dt <, then E f φ x = fx + V f φ x = n +o j,k= φ j xfx j= h k n. n j h jn K x y tkt dt + o, K j x y φ j yφ k yfy dy h k n Proof. Let ε > 0 be arbitrary. By the Peano formula, for every j =,..., m, there exists a δ > 0, 0 < t < δ such that: φ j x + tfx + t φ j xfx tφ j xfx < tε. Multiplying by Kt, and integrating with respect to t, we get by assumptions on K, φ j x + tfx + tkt dt φ j xfx φ j xfx µ where µ = tkt dt. < εµ,

13 79 Summing over j =,..., m, by 2.3, we obtain E f φ x fx φ j xfx m µ < εµ. j= j= Therefore, E f φ x fx m j= h jn φ j xfx µ < εµ, since for every j, 0, n and hence j h jn is bounded, so E f φ x fx = µ φ j xfx = o. j= In the second part of the proof we can proceed as in the symmetric case see the proof of Theorem 3.2. Under the above assumptions, the following corollaries are true. Corollary 3.4. By Theorem 3.2, if K is symmetric and φ j xfx φ k xfx dx <, for every j, k =,..., m, then the asymptotic mean integrated squared error is given by: 3. AMISE f φ = + n j= k= h k n where µ 2 = t2 Kt dt. j= k= 4 h2 jnh 2 k nµ2 2 φ j xfx φ k xfx dx K t t K dt φ j yφ k yfy dy, h k n Corollary 3.5. By Theorem 3.3, if K is nonsymmetric and φ j xfx φ k xfx dx <, j

14 80 for every j, k =,..., m, then the asymptotic mean integrated squared error is given by 3.2 AMISE f φ = h k nµ 2 φ j xfx φ k xfx dx + n j= k= h k n where µ = tkt dt. j= k= K t t K dt φ j yφ k yfy dy, h k n Unfortunately the complexity of formulae 3. and 3.2 prevent the derivation of the explicit formulae for, j =,..., m, hence the choice of the sequence of bandwidths should be done numerically, by minimization of AMISE, for a symmetric and nonsymmetric kernel, respectively. Obviously the above formulae get simpler if we know the kernel. For example, in the case of the Gaussian kernel and for fixed j, k, we get µ 2 = t2 K g t dt = and Therefore, AMISE f φ = t K g K g j= k= t h k n dt = 2π 4 h2 jnh 2 k n e t = h k n 2π hj n 2 + h k n πn hj n 2 + h k n 2 h k n 2 dt φ j xfx φ k xfx dx φ j xφ k xfx dx. In case of the rectangular kernel, for fixed j, k, µ 2 = t2 K p t dt = 3 and t t K p K p dt = h k n 2 min {h jn, h k n}, and hence AMISE f φ = j= k= 36 h2 jnh 2 k n + min {h jn, h k n} 2nh k n φ j xfx φ k xfx dx φ j xφ k xfx dx.

15 8 Using the nonsymmetric exponential kernel, we obtain AMISE f φ = h k n φ j xfx φ k xfx dx j= k= + h k n n + h k n since for fixed j, k, there is µ = tk wt dt = t t + K w K w dt = e t h k n φ j xφ k xfx dx, h k n 2 dt = h jnh k n + h k n. emark 3.6. If K is symmetric, taking m = and φ x see emark 2.4, by 3., we obtain: AMISE f φ = h n 4 2 t 2 Kt dt φ xfx 2 dx 4 + K 2 t dt φ x 2 fx dx nh n = h n 4 2 t 2 Kt dt fx 2 dx + K 2 t dt, 4 nh n which gives the formula for AMISE for the traditional kernel estimator.. emark 3.7. Using nonsymmetric kernel, for m =, and φ x, by 3.2, we get the formula for AMISE for estimator.: 2 AMISE f φ = h n 2 tkt dt φ xfx 2 dx + K 2 t dt φ x 2 fx dx nh n 2 = h n 2 tkt dt fx 2 dx + K 2 t dt nh n see [7] for results on the nonsymmetric case in the traditional kernel estimation. In Section 4 we will construct the pilot function by parametric estimation. The main problem is the choice of number of components in the model α f x θ + α 2 f 2 x θ α m f m x θ m. This is a known and much investigated problem in many methods of parametric estimation. However, in our case, it is not so important, since the pilot is just

16 82 a tool for further estimation. So we suggest the choice of m = 2, because it does not decrease the efficiency of the method considered. emark 3.8. In the case m = 2, we choose two bandwidths h n and h 2 n, minimizing: for the Gaussian kernel: 3.3 AMISE f φ = 2 h2 nh 2 2n φ xfx φ 2 xfx dx + 4 h4 n φ xfx 2 dx + 4 h4 2n φ 2 xfx 2 dx + 2nh n φ 2 π xfx dx + 2nh 2 n φ 2 π 2xfx dx 2 + n πh 2 n + h2 2 n φ xφ 2 xfx dx, for the rectangular kernel: AMISE f φ = 8 h2 nh 2 2n φ xfx φ 2 xfx dx + 36 h4 n φ xfx 2 dx + 36 h4 2n φ 2 xfx 2 dx + φ 2 2nh n xfx dx + φ 2 2nh 2 n 2xfx dx + min {h n, h 2 n} φ xφ 2 xfx dx, nh nh 2 n for the exponential kernel: AMISE f φ = h 2 n φ xfx 2 dx + h 2 2n φ 2 xfx 2 dx + h n φ 2 2n xfx dx + h 2n φ 2 2n 2xfx dx + 2h nh 2 n φ xfx φ 2 xfx dx h nh 2 n + 2 φ xφ 2 xfx dx. nh n + h 2 n 4. Construction of the pilot function parametric estimation. In this section we present some way of construction of the pilot function, based on a modification of the well-known maximum likelihood method. We apply the Expectation Maximization Algorithm, introduced in [2] and used to estimate

17 83 the parameters in stochastic models, to hidden Markov models, to estimate a hazard function, as well as to estimate the parameters in mixture models. The last of these applications is the matter of our interest and will be used in the normal mixture model α Nµ, σ 2 + α 2 Nµ 2, σ α m Nµ m, σ 2 m. 4.. The EM Algorithm. Let x = x, x 2,..., x n be a sample of independent observations from an absolutely continuous distribution with a density function f X x θ, where θ denotes the parameters of the distribution. Definition 4.. We define the likelihood function and the log-likelihood function L x θ x := n f X x i θ, i= l x θ x := ln L x θ x = n ln f X x i θ. i= The maximum likelihood estimator of θ is given by θ = argmax θ Θ L x θ x = argmax l x θ x. θ Θ To present the EM algorithm we consider data model 4. {x i, y i, i =,..., n}, from the distribution of the random variable X, Y, where x = x,..., x n is a sample of independent observations from the absolutely continuous distribution of X, with the density f X treated as a marginal density of X, Y, and y = y,..., y n denotes the latent or missing data. Sample x, together with y is called the complete data. Definition 4.2. For data model 4., the likelihood and log-likelihood functions are defined as follows: n Lθ x, y = fx i, y i θ, i= lθ x, y = ln Lθ x, y = n ln fx i, y i θ. i=

18 84 From now on we will use the following notation n 4.4 f X x θ = f X x i θ, fx, y θ = f Y X y x, θ = i= n fx i, y i θ, i= n f Y X y i x i, θ. i= Corollary 4.3. For the considered data model, the maximum likelihood estimator of θ is given by θ = argmax l x θ x, y ln f Y X y x, θ, θ Θ where f Y X y x, θ denotes the conditional density of Y, given X = x and θ. Proof. By the definition of conditional density, and hence, by Definition 4., θ = argmax θ Θ = argmax θ Θ f X x θ = fx, y θ f Y X y x, θ, l x θ x = argmaxln fx, y θ ln f Y X y x, θ θ Θ lθ x, y ln f Y X y x, θ Now, let Qθ, θ t := E ln fx, y θ x, θ t, Hθ, θ t := E ln f Y X y x, θ x, θ t, where E x, θ denotes the conditional expected value, given X = x and fixed θ = θ t, where θ t is the value of parameter θ in the iteration t. The EM algorithm gives an iterative procedure that maximizes Qθ, θ t, for every t = 0,, 2,... giving θ t+ = argmax θ Qθ, θ t. After sufficient number of iterations, argmax θ l x θ x will be found with a high probability see [2], [6]. Theorem 4.4. Under the above assumptions the procedure of θ t+ = argmax Qθ, θ t, θ Θ

19 85 guarantees that with equality iff for t = 0,, 2,... Proof. Consider the expression l x θ t+ x l x θ t x, Qθ t+ x, θ t = Qθ t x, θ t, Qθ t+ x, θ t Hθ t+ x, θ t. From formulae 4.7 and 4.8, there follows Qθ t+ x, θ t Hθ t+ x, θ t = E ln fx, y θ t+ x, θ t E ln f Y X y x, θ t+ x, θ t = E ln fx, y θ t+ x, θ t, f Y X y x, θ t+ and by the definition of conditional density and 4.4 Thus l x θ t+ x = ln fx, y θ t+ f Y X y x, θ t+. Qθ t+ x, θ t Hθ t+ x, θ t = El x θ t+ x x, θ t. By the monotonicity of the conditional expected value, l x θ t+ x l x θ t x 0, provided that Qθ t+ x, θ t Qθ t x, θ t + Hθ t x, θ t Hθ t+ x, θ t 0. Note that the first term Qθ t+ x, θ t Qθ t x, θ t 0, by 4.4. The second one, by 4.8 and the definition of the conditional expected value, can be written as Hθ t x, θ t Hθ t+ x, θ t = E ln f Y X y x, θ t x, θ t E ln f Y X y x, θ t+ x, θ t = E ln f Y Xy x, θ t x, θ t = ln f Y Xy x, θ t f Y X y x, θ t+ f Y X y x, θ t+ f Y Xy x, θ t dy. As ln f Y Xy x, θ t f Y X y x, θ t+ = ln + f Y Xy x, θ t+ f Y X y x, θ t f Y X y x, θ t and ln + t < t, for every t >, t 0, the proof is completed.

20 86 The EM algorithm can be stated as follows: Start: Set the initial value θ 0. Each step consists of two substeps. For every t = 0,, 2,... we proceed as follows: Expectation step: Let θ t be the current estimates of the unknown parameters θ. By 4.7, we calculate Qθ, θ t =E ln fx, y θ x, θ t = ln fx, y θf Y X y x, θ t dy = ln fx, y θ fx, y θ t f X x θ t dy. Maximization step: Since f X x θ t is independent of θ, we get θ t+ = argmax Qθ, θ t = argmax ln fx, y θfx, y θ t dy. θ θ By iterating EM steps, the algorithm converges to the parameters that should be the maximum likelihood parameters, but the convergence speed is rather low see e.g. [6]. There is some research to accelerate the convergence speed of the EM algorithm, but the procedure is usually not easy and difficult to carry out. In this paper we will focus on the traditional form of the EM algorithm and apply it to the estimation of a mixture of the normal densities Fitting the mixture of the normal densities. As previously, consider x = x, x 2,..., x n a sample of independent observations from a density f X x θ, where θ = θ,..., θ m denotes the parameters of the distribution and consider the following model: where Suppose that f X is given by the mixture density f X x θ = α j f j x θ j, f j x θ j = j= α j > 0, j =,..., m, α j =, j= 2 e x µ j 2 σ j, θ j = µ j, σj 2. 2πσj We call α j the prior probability of the j-th mixture component, for j =,..., m.

21 87 Suppose also, as in 4., that y = y, y 2,..., y n denotes latent data from the distribution of random variable Y = Y, Y 2,..., Y n, such that P Y i = j = α j, j =,..., m, i =,..., n. Calculate Qθ, θ t for the above model. The conditional density of Y, given X = x and θ t, is given by n 4.9 f Y X y x, θ t = f Y X y i x i, θ t, and the joint density of X and Y 4.0 fx, y θ = i= n fx i, y i θ yi, i= which by the definition of the conditional density gives n 4. fx, y θ = α yi f yi x i θ yi. i= From formulae 4.9, 4. and 4.7, we obtain n 4.2 Qθ, θ t =Eln fx, y θ x, θ t = Eln α yi f yi x i θ yi x, θ t = = j= i= j= i= i= n ln α yi f yi x i θ yi P Y i = j x, θ t n ln α j f j x i θ j P j x i, θ t. Thus, in each iteration t of the EM algorithm, the expectation step, calculates qij t = P j x i, θ t. From now on, we will denote by α j t, µt j and σt j 2 estimators of the parameters α j, µ j, σj 2 in iteration t. Lemma 4.5. Under the above assumptions, α j t = n 4.3 q t n ij, i= n µ t i= j = qt ij x i 4.4, n i= qt ij 4.5 σ t j 2 = n i= x i µ t j 2 qij t n. i= qt ij

22 88 Proof. To find α j, we use the Lagrange multiplier λ, with the constraint of m j= α j =. For j =,..., m, we calculate Qθ, θ t + λ α j m j= By formula 4.2, m n m ln α j f j x i θ j qij t + λ α j and hence j= i= j= α j = 0. n qij t = λα j, j =,..., m. i= α j = 0, j =,..., m, Summing both sides over j, we get λ = n, and thus 4.3. To prove 4.4 and 4.5, we maximize the function n Qθ, θ t = ln α j f j x i θ j qij. t j= i= By the assumptions of the model, 2 f j x θ j = e x µ j 2 σ j, θ j = µ j, σj 2. 2πσj We get the function given by n 4.6 Qµ j, σ j = j= i= lnα j 2 ln2π lnσ j 2 xi µ j 2 q t σ ij. By differentiating 4.6 with respect to µ and σ, for j =,..., m, we obtain Q n = x i µ j q µ ij, t j Q σ j = i= n + x i µ j 2 q σ ij. t j i= Solving the equations n i= x i µ j qij t = 0 and n i= we get 4.4 and 4.5. σ 3 j σj + x i µ j 2 σ 3 j q t ij = 0,

23 89 The scheme of the EM algorithm, for the mixture of normal densities can be stated as follows: Start: We set αj 0, µ0 j i σ0 j 2. For j =, 2,..., m, we iterate t =, 2,...: E-step: We calculate qij t, M-step: We calculate α j t = n q n ij, t i= n µ t i= j = qt ij x i, n i= qt ij σ t j 2 = n i= x i µ t j 2 qij t n. i= qt ij The above scheme with the exact formulae can easily be used to construct the pilot function. 5. Example. In this section we present an example of how to apply the two-step method in practice. Suppose that we are given a sample of 200 independent observations from an absolutely continuous distribution with an unknown density f. In this example we consider the mixture of two Weibull densities, with parameters: α = 0.25, a = 2.9, b = 3.5 and α 2 = 0.75, a 2 =.6, b 2 = 0.6, which is nonsymmetric and bimodal. At first we construct a pilot function as a mixture of m = 2 normal densities f 0 x = α f x µ, σ 2 + α 2 f 2 x µ 2, σ 2 2, where f j x µ, σ 2 = e x µ j 2 2 σ j 2, j =, 2, 2πσj using the EM algorithm. We set the initial values of the estimates of the parameters: α 0 = 0.5, α 2 0 = 0.5, µ 0 = 0.5, µ 0 2 = 0.5, σ 0 2 = 0.5, σ = 0.5, and after 30 iterations of the EM algorithm we get α = 0.35, µ = 2.698, σ 2 =.560, α 2 = 0.685, µ 2 = 0.489, σ 2 2 =

24 90 Therefore, the pilot is given by the formula f 0 x = α x µ 2 α 2π σ 2 e 2 σ x µ 2 2 2π σ2 2 e 2 σ 2 2. In the second step we use the generalized kernel density estimator 2. with the Gaussian kernel. In this case φ x = α x µ 2 2π σ e 2 σ 2 2 f 0 x, φ 2 x = α x µ π σ2 e 2 σ f 0 x We obtain the bandwidths h and h 2 minimizing the AMISE f φ given by formula 3.3. By numerical minimization, we get h = and h 2 = 0.0. Hence, we have the density estimator given by the formula f φ x = φ x i= h x xi K h + φ 2x x xi K, h 2 h 2 where x,..., x 200 denote the sample given, and φ x, φ 2 x, h, h 2 are calculated above. To show the results of the estimation, we present the graphs of the estimator f φ and the estimated density in the picture below x One can see that the estimator f φ solid line is very close to the estimated density f dotted line, which gives evidence to a high efficiency of the method proposed.

25 9 eferences. Abramson I.S., Arbitrariness of the Pilot Estimator in Adapative Kernel Methods, J. Multivariate Anal., 2 982, Dempster A.P., Laird N.M., ubin D.B., Maximum-likelihood from Incomplete Data via EM Algorithm, J. oy. Statist. Soc. Ser. B 977, Devroye L., Gjörfi L., Nonparametric Density Estimation: the L view, Willey, New York, Domański Cz., Pruska K., Nieklasyczne metody statystyczne, PWE, Warszawa, Hall P., Marron J.S., Choice of Kernel Order in Density Estimation, Ann. Statist., 6, 987, Hjort N.L., Glad I.K., Nonparametric Density Estimation with a Parametric Start, Ann. Statis., 23, 3 995, Jarnicka J., Dwukrokowa metoda estymacji gęstości, preprint. 8. Marron J.S., Wand M.P., Exact Mean Integrated Squared Error, Ann. Statist., 20, 2 992, Park B.U., Marron J.S., Comparison of Data-Driven Bandwidth Selectors, J. Amer. Statist. Assoc., , Parzen E., On Estimation of a Probability Density Function and Mode, Ann. Math. Statist., , osenblatt M., emarks on some Nonparametrics Estimates of a Density Function, Ann. Math. Statist., , Sheater S.J., Jones M.C., A eliable Data-Based Bandwidth Selection Method for Kernel Density Estimation, J. oy. Statist. Soc. Ser. B, 53, 3 99, Silverman B.W., Density Estimation for Statistics and Data Analysis, Chapman and Hall, London, New York, Terrel G.., Scott D.W., Variable Kernel Density Estimation, Ann. Statist., 20, 3 992, Wand M.P., Marron J.S., uppert D., Transformations in Density Estimation, J. Amer. Statist. Assoc., 86, 44 99, Wu C.F.J., On the Convergence Properties of the EM Algorithm, Ann. Statist.,, 983, eceived November 2, jj@gamma.im.uj.edu.pl

A Novel Nonparametric Density Estimator

A Novel Nonparametric Density Estimator Z. I. Botev The University of Queensland Australia Abstract We present a novel nonparametric density estimator and a new data-driven bandwidth selection method with