STOCHASTIC INFORMATION GRADIENT ALGORITHM BASED ON MAXIMUM ENTROPY DENSITY ESTIMATION. Badong Chen, Yu Zhu, Jinchun Hu and Ming Zhang

Similar documents
Computing Maximum Entropy Densities: A Hybrid Approach

STOCHASTIC INFORMATION GRADIENT ALGORITHM WITH GENERALIZED GAUSSIAN DISTRIBUTION MODEL

Recursive Least Squares for an Entropy Regularized MSE Cost Function

STEADY-STATE MEAN SQUARE PERFORMANCE OF A SPARSIFIED KERNEL LEAST MEAN SQUARE ALGORITHM.

Error Entropy Criterion in Echo State Network Training

MANY real-word applications require complex nonlinear

Variable Learning Rate LMS Based Linear Adaptive Inverse Control *

ADAPTIVE FILTER THEORY

Project 1: A comparison of time delay neural networks (TDNN) trained with mean squared error (MSE) and error entropy criterion (EEC)

Independent Component Analysis and Unsupervised Learning

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

An Error-Entropy Minimization Algorithm for Supervised Training of Nonlinear Adaptive Systems

Sparse Least Mean Square Algorithm for Estimation of Truncated Volterra Kernels

Performance Comparison of Two Implementations of the Leaky. LMS Adaptive Filter. Scott C. Douglas. University of Utah. Salt Lake City, Utah 84112

Recursive Generalized Eigendecomposition for Independent Component Analysis

Believability Evaluation of a State Estimation Result

Discussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine

A Minimum Error Entropy criterion with Self Adjusting. Step-size (MEE-SAS)

2262 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 47, NO. 8, AUGUST A General Class of Nonlinear Normalized Adaptive Filtering Algorithms

Statistical and Adaptive Signal Processing

Expectation propagation for signal detection in flat-fading channels

Blind Equalization Based on Direction Gradient Algorithm under Impulse Noise Environment

Power-Efficient Linear Phase FIR Notch Filter Design Using the LARS Scheme

Alpha-Stable Distributions in Signal Processing of Audio Signals

A low intricacy variable step-size partial update adaptive algorithm for Acoustic Echo Cancellation USNRao

System Identification of UAV Based on EWC-LMS

Linear Models for Regression

BLIND SEPARATION OF TEMPORALLY CORRELATED SOURCES USING A QUASI MAXIMUM LIKELIHOOD APPROACH

Submitted to Electronics Letters. Indexing terms: Signal Processing, Adaptive Filters. The Combined LMS/F Algorithm Shao-Jen Lim and John G. Harris Co

Samira A. Mahdi University of Babylon/College of Science/Physics Dept. Iraq/Babylon

Convergence Evaluation of a Random Step-Size NLMS Adaptive Algorithm in System Identification and Channel Equalization

ON-LINE MINIMUM MUTUAL INFORMATION METHOD FOR TIME-VARYING BLIND SOURCE SEPARATION

ADAPTIVE CLUSTERING ALGORITHM FOR COOPERATIVE SPECTRUM SENSING IN MOBILE ENVIRONMENTS. Jesus Perez and Ignacio Santamaria

Curriculum Vitae Wenxiao Zhao

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 53, NO. 3, MARCH

A DELAY-DEPENDENT APPROACH TO DESIGN STATE ESTIMATOR FOR DISCRETE STOCHASTIC RECURRENT NEURAL NETWORK WITH INTERVAL TIME-VARYING DELAYS

Generalized Information Potential Criterion for Adaptive System Training

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

ON SOME EXTENSIONS OF THE NATURAL GRADIENT ALGORITHM. Brain Science Institute, RIKEN, Wako-shi, Saitama , Japan

Spatially Smoothed Kernel Density Estimation via Generalized Empirical Likelihood

Blind Deconvolution by Modified Bussgang Algorithm

A ROBUST BEAMFORMER BASED ON WEIGHTED SPARSE CONSTRAINT

AN ON-LINE ADAPTATION ALGORITHM FOR ADAPTIVE SYSTEM TRAINING WITH MINIMUM ERROR ENTROPY: STOCHASTIC INFORMATION GRADIENT

A Canonical Genetic Algorithm for Blind Inversion of Linear Channels

Neural Network Training

squares based sparse system identification for the error in variables

ESTIMATION problem plays a key role in many fields,

New Recursive-Least-Squares Algorithms for Nonlinear Active Control of Sound and Vibration Using Neural Networks

Optimal Mean-Square Noise Benefits in Quantizer-Array Linear Estimation Ashok Patel and Bart Kosko

Acoustic MIMO Signal Processing

A Robust State Estimator Based on Maximum Exponential Square (MES)

FIR Filters for Stationary State Space Signal Models

Receivers for Digital Communication in. Symmetric Alpha Stable Noise

Higher Order Statistics

Impulsive Noise Filtering In Biomedical Signals With Application of New Myriad Filter

CONVENTIONAL decision feedback equalizers (DFEs)

Pattern Recognition and Machine Learning

Tensor approach for blind FIR channel identification using 4th-order cumulants

Bayesian Methods in Positioning Applications

Comparative Performance Analysis of Three Algorithms for Principal Component Analysis

Gaussian process for nonstationary time series prediction

Analysis Methods for Supersaturated Design: Some Comparisons

Linear Least-Squares Based Methods for Neural Networks Learning

Performance Analysis and Enhancements of Adaptive Algorithms and Their Applications

Trimmed Diffusion Least Mean Squares for Distributed Estimation

STATISTICAL similarity measures play significant roles

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Modelling Non-linear and Non-stationary Time Series

ADAPTIVE EQUALIZATION AT MULTI-GHZ DATARATES

A Cross-Associative Neural Network for SVD of Nonsquared Data Matrix in Signal Processing

A METHOD OF ADAPTATION BETWEEN STEEPEST- DESCENT AND NEWTON S ALGORITHM FOR MULTI- CHANNEL ACTIVE CONTROL OF TONAL NOISE AND VIBRATION

Diffusion Maximum Correntropy Criterion Algorithms for Robust Distributed Estimation

POLYNOMIAL SINGULAR VALUES FOR NUMBER OF WIDEBAND SOURCES ESTIMATION AND PRINCIPAL COMPONENT ANALYSIS

Learning Gaussian Process Models from Uncertain Data

Gradient-based Adaptive Stochastic Search

Nonparametric Bayesian Methods (Gaussian Processes)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Median Filter Based Realizations of the Robust Time-Frequency Distributions

The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

AdaptiveFilters. GJRE-F Classification : FOR Code:

Computer Vision & Digital Image Processing

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Adaptive Filtering. Squares. Alexander D. Poularikas. Fundamentals of. Least Mean. with MATLABR. University of Alabama, Huntsville, AL.

Likelihood-based inference with missing data under missing-at-random

Average Reward Parameters

On the estimation of initial conditions in kernel-based system identification

Adaptive MMSE Equalizer with Optimum Tap-length and Decision Delay

A Flexible ICA-Based Method for AEC Without Requiring Double-Talk Detection

IS NEGATIVE STEP SIZE LMS ALGORITHM STABLE OPERATION POSSIBLE?

Revision of Lecture 4

Robust diffusion maximum correntropy criterion algorithm for distributed network estimation

Approximate Inference Part 1 of 2

Generalized Linear Models. Kurt Hornik

On Information Maximization and Blind Signal Deconvolution

Modification and Improvement of Empirical Likelihood for Missing Response Problem

MMSE DECODING FOR ANALOG JOINT SOURCE CHANNEL CODING USING MONTE CARLO IMPORTANCE SAMPLING

Efficient coding of natural images with a population of noisy Linear-Nonlinear neurons

VECTOR-QUANTIZATION BY DENSITY MATCHING IN THE MINIMUM KULLBACK-LEIBLER DIVERGENCE SENSE

Bayesian Regularization and Nonnegative Deconvolution for Time Delay Estimation

Transcription:

ICIC Express Letters ICIC International c 2009 ISSN 1881-803X Volume 3, Number 3, September 2009 pp. 1 6 STOCHASTIC INFORMATION GRADIENT ALGORITHM BASED ON MAXIMUM ENTROPY DENSITY ESTIMATION Badong Chen, Yu Zhu, Jinchun Hu and Ming Zhang Department of Precision Instruments and Mechanology Tsinghua University Beijing, 100084, P. R. China chenbd04@mails.tsinghua.edu.cn Received December 2009; accepted February 2010 Abstract. We propose a new stochastic information gradient (SIG algorithm based on an online maximum entropy density estimation of the error distribution. The proposed algorithm is simple, yet efficient in implementation, as it involves no choice of bandwidth and does not resort to Newton s method of optimization. Simulation results demonstrate the favorable performance of the new algorithm. 1. Introduction. The minimum mean square error (MMSE criterion and corresponding least mean-squares (LMS algorithm are rather efficient for adaptive filtering if the error is normally distributed [1]. When the error distribution is non-gaussian, the MMSE criterion (and hence the LMS algorithm usually results in significant performance degradation [2, 3, 4]. Recently, in order to deal with non-gaussian error distribution, the minimum error entropy (MEE criterion has been proposed as an information theoretic alternative to the MMSE criterion in supervised adaptation [5, 6, 7, 8]. Under MEE criterion, the stochastic gradient based filtering algorithm, i.e., the named stochastic information gradient (SIG algorithm, takes the form [6] W k+1 = W k η W { log ˆp (e k} (1 where W k = [w 1, w 2,, w M ] T denotes the filter s weight vector at k iteration, η is the step-size, and ˆp(. denotes the estimated probability density function (PDF of error e k. The SIG algorithm (1 is somewhat similar to the adaptive estimation (AE [9]. The adaptive estimation first estimates the error distribution based on the consistent ordinary least squares (OLS residuals, and then uses the estimated likelihood function to estimate the parameters. However, different from the AE estimation, the SIG algorithm is an online adaptive filtering (learning algorithm, which estimates the error distribution and updates the weight vector simultaneously. The existing online PDF estimate is usually obtained by non-parametric kernel approach with a sliding window of error samples [1], that is ˆp(e k = 1 k K σ (e k e i (2 L i=k L+1 where L is the length of sliding error data ((e k L+1,, e k 1, e k, K σ (. is the kernel function with bandwidth σ. However, as stressed in [9], certain parametric estimate of the error distribution, which does not involve bandwidth selection, may outperform the nonparametric ones. So in this letter, we propose a new SIG algorithm, in which the error s distribution is modeled as a generalized exponential density and is estimated through the Maximum Entropy Principle (MEP [9, 10]. The reason for using the maximum entropy 1

2 BADONG CHEN, YU ZHU, JINCHUN HU AND MING ZHANG (Maxent density estimation is twofold. First, with this method, the selection of kernel bandwidth is bypassed (an inappropriate choice of width will significantly deteriorate the behavior of the SIG algorithm; and second, the Maxent densities nest a wide range of statistical distributions as special cases, yet fit the known data without committing extensively to unseen data. 2. Algorithm. The Maxent density is of the generalized exponential family [9], that is p (e = exp ( λ 0 m λ if i (e (3 where functions {f i (e} establish the characterizing moments {E [f i (e]}, {λ i } are the Lagrange multipliers. Let E p [f i (e] = µ i, the Lagrange multipliers {λ i } can be obtained by solving the optimization [9]: max {Γ = λ 0 + m } λ iµ i λ ( λ 0 = log exp ( m (4 λ if i (e de where λ = [λ 1,, λ m ] T. This optimization can be solved using Newton s method [9]. However, the computational burden of Newton s method is very heavy due to a lot of numerical integrals involved. Thus it is not suitable for online density estimation. In order to reduce the computational burden, we use the method of literature [10] to calculate the Lagrange multipliers. Specifically, let λ i Γ = 0, we get µ i = f i (eexp ( λ 0 m λ if i (e de (5 Applying the integrating by parts method and assuming the function F i (e = f i (ede satisfies F i (e p (e = 0, we have µ i = m λ [ je p Fi (e f j (e] = m λ jβ ij (6 j=1 j=1 where β ij = E p [ Fi (e f j (e ]. And hence, the Lagrange multipliers are given by the solution of the linear system of equations shown in (6, that is λ = β 1 µ (7 where µ = [µ 1,, µ m ] T, β = [β ij ]. The moment vector µ and the matrix β can be approximated by using sample means. So we propose an online Maxent density estimation of the error distribution: ( m ˆp(e k = C (ˆλ (k exp ˆλ i (kf i (e k ˆλ(k = ˆβ (k 1 ˆµ (k (8 ˆµ i (k = γ µˆµ i (k 1 + (1 γ µ f i (e k ˆβ ij (k = γ β ˆβij (k 1 + (1 γ β F i (e k f j (e k where C (ˆλ (k stands for the normalization factor, γ µ and γ β are the exponential weighting parameters (0 < γ µ < 1, 0 < γ β < 1. Based on the estimated PDF ˆp (e k in (8, the SIG algorithm of (1 becomes W k+1 = W k η ( m ˆλ i (kf i (e k W = W k η e ( k m (9 ˆλ i (kf i W (e k

ICIC EXPRESS LETTERS, VOL.3, NO.3, 2009 3 If the adaptive filter is of finite impulse response (FIR structure, we have ( m W k+1 = W k + η ˆλ i (kf i (e k X k (10 where X k = [x k, x k 1,, x k M+1 ] T is the input vector (M is the number of taps. The above algorithm depends on the choice of the functions {f i (e}. In practice, as contended in [9], the expression of Maxent density can be chosen as or simply p (e = exp ( λ 0 λ 1 e λ 2 e 2 λ 3 log ( 1 + e 2 λ 4 sin (e λ 5 cos (e (11 p (e = exp ( λ 0 λ 1 e λ 2 e 2 λ 3 log ( 1 + e 2 (12 To make a distinction, we denote SIG-Kernel and SIG-Maxent the SIG algorithms based on the kernel approach and the maximum entropy approach, respectively. 3. Simulation Results. Now we present a simulation experiment to demonstrate the performance of the SIG-Maxent algorithm, in comparison with the SIG-Kernel and the well-known LMS algorithm. Consider the FIR channel identification scenario, in which the error signal is given by e k = ( W T X k + n k W T k X k = V T k X k + n k (13 where W denotes the weight vector of the unknown FIR channel, n k is the interference noise, and V k = W W k is the weight error vector. In the simulation, we set W = [0.1,0.2,0.3,0.4,0.1,0.3,0.4,0.3,0.2,0.1,0.1,0.2,0.3,0.4, 0.5] T (14 Further, let the input {x k } be the unit-power white Gaussian process, and {n k } be the unit-dispersion symmetric alpha-stable (SαS noise (1 < α 2 [2]. For the SIG-Maxent algorithm, we adopt (12 as the Maxent density, and set γ µ = γ β = 0.999. For the SIG- Kernel algorithm, we choose the Gaussian kernel and determine the bandwidth according to Silverman s rule of thumb [11]. Fig. 1 and Fig. 2 show the convergence curves of the mean square deviation (MSD E [ Vk TV k] averaged over 10 independent Monte Carlo runs for α = 2 and α = 1.5 respectively. Evidently, the two SIG algorithms are both robust to impulse noise (α = 1.5, and in particular, the SIG-Maxent algorithm achieves the smallest misadjustments for both Gaussian (α = 2 and impulse noise cases. Remark: In a previous paper [12], we show that the SIG algorithm with Gaussian kernel (SIG-G is not robust to the impulse noises. In that work, we propose a Laplacian kernel based SIG (SIG-L algorithm to improve the robustness performance. Our new study shows that, however, if error s PDF is estimated as (2, the SIG-G algorithm is actually robust to the impulse noises. Notice that in [12],the estimated PDF is given by ˆp(e k = 1 L k 1 i=k L K σ (e k e i (15 4. Conclusion. Based on an online Maxent density estimation of the error distribution, we develop the SIG-Maxent algorithm. Simulation results emphasize the favorable performance of the new algorithm. Acknowledgment. This work was supported by the National Natural Science Foundation of China (60904054, National Key Basic Research and Development Program (973 of China (2009CB724205, China Postdoctoral Science Foundation Funded Project (20080440384, and the grant from the State Key Laboratory of Tribology (SKLT at Tsinghua University (SKLT08B04, SKLT08B06.

4 BADONG CHEN, YU ZHU, JINCHUN HU AND MING ZHANG 2 0-2 -4-6 MSD (db -8-10 SIG-Kernel LMS SIG-Maxent -12-14 -16-18 0 1000 2000 3000 4000 5000 iteration Figure 1. Average convergence curves for different algorithms (α = 2.0 10 5 MSD (db 0-5 LMS -10 SIG-Kernel SIG-Maxent -15 0 1000 2000 3000 4000 5000 iteration Figure 2. Average convergence curves for different algorithms (α = 1.50 REFERENCES [1] S. Haykin, Adaptive Filtering Theory, 3rd ed., NY: Prentice Hall, 1996. [2] S. Min, L. N. Chrysostomos, Signal processing with fractional lower order moments: stable processes and their applications, Proceedings of the IEEE, vol.81, no.7, pp.986-1009, 1993. [3] S. C. Pei, C. C. Tseng, Least mean p-power error criterion for adaptive FIR filter, IEEE Journal on Selected Areas in Communications, vol. 12, no. 9, pp. 1540-1547, 1994. [4] S.C. Douglas, H.Y. Meng, Stochastic gradient adaptation under general error criteria, IEEE Trans. Signal Processing, vol. 42, pp. 1335-1351, 1994. [5] J. C. Principe, D. Xu, Q. Zhao, J. W. Fisher, Learning from examples with information theoretic criteria, Journal of VLSI Signal Processing Systems, vol. 26, pp. 61-77, 2000. [6] D. Erdogmus, E. H. Kenneth, J. C. Principe, Online entropy manipulation: stochastic information gradient, IEEE Signal Processing Letters, vol.10, no.8, pp.242-245, 2003.

ICIC EXPRESS LETTERS, VOL.3, NO.3, 2009 5 [7] D. Erdogmus, J. C. Principe, Convergence properties and data efficiency of the minimum error entropy criterion in Adaline training, IEEE Transactions on Signal Processing, vol.51, no.7, pp.1966-1978, 2003. [8] B. D. Chen, J. C. Hu, L. Pu, Z. Q. Sun, Stochastic gradient algorithm under (h, φ-entropy criterion, Circuits Systems Signal Processing, vol. 26, pp. 941-960, 2007. [9] X. Wu, T. Stengos, Partially adaptive estimation via the maximum entropy densities, Econometrics Journal, vol. 8, pp. 352-366, 2005. [10] D. Erdogmus, E. H. Kenneth, N. R. Yadunandana, J. C. Principe, Minimax mutual information approach for independent component analysis, Neural Computation, vol. 16, pp. 1235-1252, 2004. [11] B. W. Silverman, Density Estimation for Statistic and Data Analysis, NY: Chapman & Hall, 1986. [12] M. Zhang, B. D. Chen, Y. Zhu, J. C. Hu, Laplacian kernel based SIG algorithm for FIR filtering in the presence of alpha-stable noise, ICIC Express Letters, vol.4, no.1, pp. 173-176, 2010.