STOCHASTIC INFORMATION GRADIENT ALGORITHM BASED ON MAXIMUM ENTROPY DENSITY ESTIMATION. Badong Chen, Yu Zhu, Jinchun Hu and Ming Zhang

ICIC Express Letters ICIC International c 2009 ISSN 1881-803X Volume 3, Number 3, September 2009 pp. 1 6 STOCHASTIC INFORMATION GRADIENT ALGORITHM BASED ON MAXIMUM ENTROPY DENSITY ESTIMATION Badong Chen, Yu Zhu, Jinchun Hu and Ming Zhang Department of Precision Instruments and Mechanology Tsinghua University Beijing, 100084, P. R. China chenbd04@mails.tsinghua.edu.cn Received December 2009; accepted February 2010 Abstract. We propose a new stochastic information gradient (SIG algorithm based on an online maximum entropy density estimation of the error distribution. The proposed algorithm is simple, yet efficient in implementation, as it involves no choice of bandwidth and does not resort to Newton s method of optimization. Simulation results demonstrate the favorable performance of the new algorithm. 1. Introduction. The minimum mean square error (MMSE criterion and corresponding least mean-squares (LMS algorithm are rather efficient for adaptive filtering if the error is normally distributed [1]. When the error distribution is non-gaussian, the MMSE criterion (and hence the LMS algorithm usually results in significant performance degradation [2, 3, 4]. Recently, in order to deal with non-gaussian error distribution, the minimum error entropy (MEE criterion has been proposed as an information theoretic alternative to the MMSE criterion in supervised adaptation [5, 6, 7, 8]. Under MEE criterion, the stochastic gradient based filtering algorithm, i.e., the named stochastic information gradient (SIG algorithm, takes the form [6] W k+1 = W k η W { log ˆp (e k} (1 where W k = [w 1, w 2,, w M ] T denotes the filter s weight vector at k iteration, η is the step-size, and ˆp(. denotes the estimated probability density function (PDF of error e k. The SIG algorithm (1 is somewhat similar to the adaptive estimation (AE [9]. The adaptive estimation first estimates the error distribution based on the consistent ordinary least squares (OLS residuals, and then uses the estimated likelihood function to estimate the parameters. However, different from the AE estimation, the SIG algorithm is an online adaptive filtering (learning algorithm, which estimates the error distribution and updates the weight vector simultaneously. The existing online PDF estimate is usually obtained by non-parametric kernel approach with a sliding window of error samples [1], that is ˆp(e k = 1 k K σ (e k e i (2 L i=k L+1 where L is the length of sliding error data ((e k L+1,, e k 1, e k, K σ (. is the kernel function with bandwidth σ. However, as stressed in [9], certain parametric estimate of the error distribution, which does not involve bandwidth selection, may outperform the nonparametric ones. So in this letter, we propose a new SIG algorithm, in which the error s distribution is modeled as a generalized exponential density and is estimated through the Maximum Entropy Principle (MEP [9, 10]. The reason for using the maximum entropy 1

2 BADONG CHEN, YU ZHU, JINCHUN HU AND MING ZHANG (Maxent density estimation is twofold. First, with this method, the selection of kernel bandwidth is bypassed (an inappropriate choice of width will significantly deteriorate the behavior of the SIG algorithm; and second, the Maxent densities nest a wide range of statistical distributions as special cases, yet fit the known data without committing extensively to unseen data. 2. Algorithm. The Maxent density is of the generalized exponential family [9], that is p (e = exp ( λ 0 m λ if i (e (3 where functions {f i (e} establish the characterizing moments {E [f i (e]}, {λ i } are the Lagrange multipliers. Let E p [f i (e] = µ i, the Lagrange multipliers {λ i } can be obtained by solving the optimization [9]: max {Γ = λ 0 + m } λ iµ i λ ( λ 0 = log exp ( m (4 λ if i (e de where λ = [λ 1,, λ m ] T. This optimization can be solved using Newton s method [9]. However, the computational burden of Newton s method is very heavy due to a lot of numerical integrals involved. Thus it is not suitable for online density estimation. In order to reduce the computational burden, we use the method of literature [10] to calculate the Lagrange multipliers. Specifically, let λ i Γ = 0, we get µ i = f i (eexp ( λ 0 m λ if i (e de (5 Applying the integrating by parts method and assuming the function F i (e = f i (ede satisfies F i (e p (e = 0, we have µ i = m λ [ je p Fi (e f j (e] = m λ jβ ij (6 j=1 j=1 where β ij = E p [ Fi (e f j (e ]. And hence, the Lagrange multipliers are given by the solution of the linear system of equations shown in (6, that is λ = β 1 µ (7 where µ = [µ 1,, µ m ] T, β = [β ij ]. The moment vector µ and the matrix β can be approximated by using sample means. So we propose an online Maxent density estimation of the error distribution: ( m ˆp(e k = C (ˆλ (k exp ˆλ i (kf i (e k ˆλ(k = ˆβ (k 1 ˆµ (k (8 ˆµ i (k = γ µˆµ i (k 1 + (1 γ µ f i (e k ˆβ ij (k = γ β ˆβij (k 1 + (1 γ β F i (e k f j (e k where C (ˆλ (k stands for the normalization factor, γ µ and γ β are the exponential weighting parameters (0 < γ µ < 1, 0 < γ β < 1. Based on the estimated PDF ˆp (e k in (8, the SIG algorithm of (1 becomes W k+1 = W k η ( m ˆλ i (kf i (e k W = W k η e ( k m (9 ˆλ i (kf i W (e k

ICIC EXPRESS LETTERS, VOL.3, NO.3, 2009 3 If the adaptive filter is of finite impulse response (FIR structure, we have ( m W k+1 = W k + η ˆλ i (kf i (e k X k (10 where X k = [x k, x k 1,, x k M+1 ] T is the input vector (M is the number of taps. The above algorithm depends on the choice of the functions {f i (e}. In practice, as contended in [9], the expression of Maxent density can be chosen as or simply p (e = exp ( λ 0 λ 1 e λ 2 e 2 λ 3 log ( 1 + e 2 λ 4 sin (e λ 5 cos (e (11 p (e = exp ( λ 0 λ 1 e λ 2 e 2 λ 3 log ( 1 + e 2 (12 To make a distinction, we denote SIG-Kernel and SIG-Maxent the SIG algorithms based on the kernel approach and the maximum entropy approach, respectively. 3. Simulation Results. Now we present a simulation experiment to demonstrate the performance of the SIG-Maxent algorithm, in comparison with the SIG-Kernel and the well-known LMS algorithm. Consider the FIR channel identification scenario, in which the error signal is given by e k = ( W T X k + n k W T k X k = V T k X k + n k (13 where W denotes the weight vector of the unknown FIR channel, n k is the interference noise, and V k = W W k is the weight error vector. In the simulation, we set W = [0.1,0.2,0.3,0.4,0.1,0.3,0.4,0.3,0.2,0.1,0.1,0.2,0.3,0.4, 0.5] T (14 Further, let the input {x k } be the unit-power white Gaussian process, and {n k } be the unit-dispersion symmetric alpha-stable (SαS noise (1 < α 2 [2]. For the SIG-Maxent algorithm, we adopt (12 as the Maxent density, and set γ µ = γ β = 0.999. For the SIG- Kernel algorithm, we choose the Gaussian kernel and determine the bandwidth according to Silverman s rule of thumb [11]. Fig. 1 and Fig. 2 show the convergence curves of the mean square deviation (MSD E [ Vk TV k] averaged over 10 independent Monte Carlo runs for α = 2 and α = 1.5 respectively. Evidently, the two SIG algorithms are both robust to impulse noise (α = 1.5, and in particular, the SIG-Maxent algorithm achieves the smallest misadjustments for both Gaussian (α = 2 and impulse noise cases. Remark: In a previous paper [12], we show that the SIG algorithm with Gaussian kernel (SIG-G is not robust to the impulse noises. In that work, we propose a Laplacian kernel based SIG (SIG-L algorithm to improve the robustness performance. Our new study shows that, however, if error s PDF is estimated as (2, the SIG-G algorithm is actually robust to the impulse noises. Notice that in [12],the estimated PDF is given by ˆp(e k = 1 L k 1 i=k L K σ (e k e i (15 4. Conclusion. Based on an online Maxent density estimation of the error distribution, we develop the SIG-Maxent algorithm. Simulation results emphasize the favorable performance of the new algorithm. Acknowledgment. This work was supported by the National Natural Science Foundation of China (60904054, National Key Basic Research and Development Program (973 of China (2009CB724205, China Postdoctoral Science Foundation Funded Project (20080440384, and the grant from the State Key Laboratory of Tribology (SKLT at Tsinghua University (SKLT08B04, SKLT08B06.

4 BADONG CHEN, YU ZHU, JINCHUN HU AND MING ZHANG 2 0-2 -4-6 MSD (db -8-10 SIG-Kernel LMS SIG-Maxent -12-14 -16-18 0 1000 2000 3000 4000 5000 iteration Figure 1. Average convergence curves for different algorithms (α = 2.0 10 5 MSD (db 0-5 LMS -10 SIG-Kernel SIG-Maxent -15 0 1000 2000 3000 4000 5000 iteration Figure 2. Average convergence curves for different algorithms (α = 1.50 REFERENCES [1] S. Haykin, Adaptive Filtering Theory, 3rd ed., NY: Prentice Hall, 1996. [2] S. Min, L. N. Chrysostomos, Signal processing with fractional lower order moments: stable processes and their applications, Proceedings of the IEEE, vol.81, no.7, pp.986-1009, 1993. [3] S. C. Pei, C. C. Tseng, Least mean p-power error criterion for adaptive FIR filter, IEEE Journal on Selected Areas in Communications, vol. 12, no. 9, pp. 1540-1547, 1994. [4] S.C. Douglas, H.Y. Meng, Stochastic gradient adaptation under general error criteria, IEEE Trans. Signal Processing, vol. 42, pp. 1335-1351, 1994. [5] J. C. Principe, D. Xu, Q. Zhao, J. W. Fisher, Learning from examples with information theoretic criteria, Journal of VLSI Signal Processing Systems, vol. 26, pp. 61-77, 2000. [6] D. Erdogmus, E. H. Kenneth, J. C. Principe, Online entropy manipulation: stochastic information gradient, IEEE Signal Processing Letters, vol.10, no.8, pp.242-245, 2003.

ICIC EXPRESS LETTERS, VOL.3, NO.3, 2009 5 [7] D. Erdogmus, J. C. Principe, Convergence properties and data efficiency of the minimum error entropy criterion in Adaline training, IEEE Transactions on Signal Processing, vol.51, no.7, pp.1966-1978, 2003. [8] B. D. Chen, J. C. Hu, L. Pu, Z. Q. Sun, Stochastic gradient algorithm under (h, φ-entropy criterion, Circuits Systems Signal Processing, vol. 26, pp. 941-960, 2007. [9] X. Wu, T. Stengos, Partially adaptive estimation via the maximum entropy densities, Econometrics Journal, vol. 8, pp. 352-366, 2005. [10] D. Erdogmus, E. H. Kenneth, N. R. Yadunandana, J. C. Principe, Minimax mutual information approach for independent component analysis, Neural Computation, vol. 16, pp. 1235-1252, 2004. [11] B. W. Silverman, Density Estimation for Statistic and Data Analysis, NY: Chapman & Hall, 1986. [12] M. Zhang, B. D. Chen, Y. Zhu, J. C. Hu, Laplacian kernel based SIG algorithm for FIR filtering in the presence of alpha-stable noise, ICIC Express Letters, vol.4, no.1, pp. 173-176, 2010.