TIME SERIES ANALYSIS WITH INFORMATION THEORETIC LEARNING AND KERNEL METHODS

Size: px

Start display at page:

Download "TIME SERIES ANALYSIS WITH INFORMATION THEORETIC LEARNING AND KERNEL METHODS"

Amie Gilbert
5 years ago
Views:

1 TIME SERIES ANALYSIS WITH INFORMATION THEORETIC LEARNING AND KERNEL METHODS By PUSKAL P. POKHAREL A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 c 2007 Puskal P. Pokharel 2

3 To my parents, professors, and friends. 3

4 ACKNOWLEDGMENTS With patience, persistence and support of many individuals, my stay for doctoral research has been very memorial and fruitful. Without the help and the active encouragement of some of these people, the amount of time and effort required of the Ph.D. degree would have made it overwhelmingly daunting. I acknowledge and thank Dr. Jose C. Principe for his role as advisor and mentor during my stay here. Our numerous discussions and exchange of ideas has been integral to this research. I also thank him for providing a stimulating environment in the Computational NeuroEngineering Laboratory where I have developed and expanded a lot of my engineering and research capabilities. I am also grateful to the members of my advisory committee, Dr. John G. Harris, Dr. K. Clint Slatton and Dr. Murali Rao for their time and valuable suggestions. I also would like to express my appreciation to all the CNEL members, especially, those of the ITL group for their collaboration in this and other research projects. Finally, I thank my parents for encouraging me to pursue a Ph.D. in the first place, and for supporting me every step of the way. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Signal Detection Optimal and Adaptive Filtering NEW SIMILARITY AND CORRELATION MEASURE Kernel-Based Algorithms and ITL Generalized Correlation Function Correntropy as Generalized Correlation Properties Similarity Measure Optimal Signal Processing Based on Correntropy CORRENTROPY BASED MATCHED FILTERING Detection Statistics and Template Matching Linear Matched Filter Correntropy as Decision Statistic Interpretation from Kernel Methods Impulsive Noise Distributions Two-term Gaussian mixture model Alpha-stable distribution Locally suboptimal receiver Selection of Kernel Size Experiments and Results Additive White Gaussian Noise Additive Impulsive Noise by Mixture of Gaussians Alpha-Stable Noise in Linear Channel Effect of Kernel Size Low Cost CMF Using Triangular Kernel APPLICATION TO SHAPE CLASSIFICATION OF PARTIALLY OCCLUDED OBJECTS Introduction

6 4.2 Problem Model and Solution WIENER FILTERING AND REGRESSION Linear Wiener Filter Correntropy Based Wiener Filter Simple Regression Models with Kernels Radial Basis Function Network Nadaraya Watson Estimator Normalized RBF Network Parametric Improvement on Correntropy Filter Experiments and Results System Identification Time Series Prediction of Mackey-Glass Time Series ON-LINE FILTERS Background Correntropy LMS Kernel LMS Self-Regularized Property Experimental Comparison between LMS and KLMS Kernel LMS with Restricted Growth Sparsity Condition for the Feature Vectors Algorithm Experimental Comparison between KLMS and KLMS with Restricted Growth Application to Nonlinear System Identification Motivation Identification of a Wiener System Experimental Results Summary APPLICATION TO BLIND EQUALIZATION Motivation Problem Setting Cost Function and Iterative Algorithm Simulation Results SUMMARY Summary Other Applications Future Work LIST OF REFERENCES BIOGRAPHICAL SKETCH

7 Table LIST OF TABLES page 3-1 Values for the statistic for the two cases using the Gaussian kernel Recognition and misclassification rates

8 Figure LIST OF FIGURES page 1-1 Set up of a typical time series problem General overview for solving time series problems in the feature space Probabilistic interpretation of V XY (the maximum of each curve has been normalized to 1 for visual convenience) Receiver operating characteristic curves for synchronous detection in AWGN channel with kernel variance σ 2 (CMF ) = 15 σ 2 (MI) = 15 (the curves for MF and CMF for 10 db overlap) Receiver operating characteristic curves for asynchronous detection in AWGN with kernel variance σ 2 (CMF ) = 15 σ 2 (MI) = Receiver operating characteristic curves for synchronous detection in additive impulsive noise with kernel variance σ 2 (CMF ) = 5, σ 2 (MI) = Receiver operating characteristic curves for asynchronous detection in additive impulsive noise with kernel variance σ 2 (CMF ) = 5, σ 2 (MI) = Receiver operating characteristic curves for synchronous detection in additive white α-stable distributed noise, kernel variance σ 2 = 3, α = Receiver operating characteristic curves for synchronous detection in additive white α-stable distributed noise, kernel variance σ 2 = 3, SNR=15 db, the plots for MI and CMF almost coincide for both values of α Receiver operating characteristic curves for asynchronous detection in additive white α-stable distributed noise, kernel variance σ 2 = 3, SNR=15 db, the plots for MI and CMF almost coincide for both values of α Area under the ROC for various kernel size values for additive alpha-stable distributed noise using synchronous detection, α = Area under the ROC for various kernel size values for additive impulsive noise with the mixture of Gaussians using synchronous detection Triangular function that can be used as a kernel Receiver operating characteristic curves for SNR of 5 db for the various detection methods The fish template database The occluded fish The extracted boundary of the fish in

9 5-1 The computation of the output data sample given an embedded input vector using the correntropy Wiener filter Time delayed neural network to be modeled Input and output signals generated by the TDNN (desired response), CF and WF using a time embeding of Mean square error values for WF and CF for the system modeling example Mean square error values for MG time series prediction for various embedding size Mean square error values for MG time series prediction for different size of training data Linear filter structure with feature vectors in the feature space Error samples for KLMS in predicting Mackey-Glass time series Learning curves for the LMS, the KLMS and the regularized solution Comparison of the mean square error for the three methods with varying embedding dimension (filter order for LMS) of the input Learning curve of linear LMS and kernel LMS with restricted growth for three values of δ Performance of kernel LMS with restricted growth for various values of δ Number of kernel centers after training for various values of δ A nonlinear Wiener system Signal in the middle of the Wiener system vs. output signal for binary input symbols and different indexes n Mean square error for the identification of the nonlinear Wiener system with the three methods. The values for KRLS is only shown after the the first window Inter symbol interference (ISI) convergence curves for correntropy and CMA under Gaussian noise Convergence curves of the equalizer coefficients for correntropy and CMA under Gaussian noise. The true solution is w = (0, 1, 0.5) Inter symbol interference (ISI) convergence curves for the correntropy and CMA under impulsive noise

10 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy TIME SERIES ANALYSIS WITH INFORMATION THEORETIC LEARNING AND KERNEL METHODS Chair: Jose C. Principe Major: Electrical and Computer Engineering By Puskal P. Pokharel December 2007 The major goal of our research is to develop simple and effective nonlinear versions of some basic time series tools for signal detection, optimal filtering, and on-line adaptive filtering. These extensions shall be based on concepts being developed in information theoretic learning (ITL) and kernel methods. In general all ITL algorithms can be interpreted from kernel methods because ITL is based on extracting higher order information (that beyond second order as given by the autocorrelation function) directly from the data samples by exploiting nonparametric density estimation using translation invariant kernel functions. ITL in general is still lacking in providing tools to better exploit time structures in the data because the assumption of independently distributed data samples is usually an essential requirement. Kernel methods provide an elegant means of obtaining nonlinear versions of linear algorithms expressed in terms of inner products by using the so called kernel trick and Mercer s theorem. This has given rise to a variety of algorithms in the field of machine learning but most of them are computationally very expensive using a large Gram matrix of dimension same as the number of data points. Since these large matrices are usually ill-conditioned, they require an additional regularization step in methods like kernel regression. Our goal is to design basic signal analysis tools for time signals that extract higher order information from the data directly like ITL and also avoid the complexities of many kernel methods. 10

11 We present new methods for time series analysis (matched filtering and optimal adaptive filtering) based on the newly invented concept in ITL, correntropy and kernel methods. Correntropy induces a RKHS that has the same dimensionality as the input space but is nonlinearly related to it. It is different from the conventional kernel methods, in both scope and detail. This in effect helps us to derive some elegant versions of a few tools that form the basic building block of signal processing like the matched filter (correlation receiver), the Wiener filter, and the least mean square (LMS) filter. 11

12 CHAPTER 1 INTRODUCTION Natural processes of interest for engineering are composed of two basic characteristics: statistical distribution of amplitudes and time structure. Time in itself is very fundamental and is crucial to many world problems, and the instantaneous random variables are hardly ever independently distributed, i.e., stochastic processes possess a time structure. For this reason there are widely used measures that quantify the time structure like the autocorrelation function. On the other hand, there are a number of methods that are solely based on the statistical distribution, ignoring the time structure. A single measure that includes both of these important characteristics could greatly enhance the theory of stochastic random processes. The fact that reproducing kernels are covariance functions as described by Aronszajn [1] and Parzen [2] explains their early role in inference problems. More recently, numerous algorithms using kernel methods including Support Vector Machines (SVM) [3], kernel principal component analysis (K-PCA) [4], kernel Fisher discriminant analysis (K-FDA) [5], and kernel canonical correlation analysis (K-CCA) [6],[7] have been proposed. Likewise, advances in information theoretic learning (ITL), have brought out a number of applications, where entropy and divergence employ Parzen s nonparametric estimation [8], [9], [10], [11]. Many of these algorithms have given very elegant solutions to complicated nonlinear problems. Most of all these contemporary algorithms are based on assumptions of independent distribution of data, which in many cases is not realistic. Obviously, accurate description of a stochastic process requires both the information of the distribution and that of its time structure. We can describe time series problems seen in engineering in broadly two major categories. (A) Detection: This problem has a general setup given by figure 1-1. A known signal template (usually chosen from a finite set of possibilities) passes through a channel which modifies the signal based on an underlying system model. Usually an addition source 12

13 Figure 1-1. Set up of a typical time series problem. independent of the channel and the template is also present which is called the noise. This results in an observed signal that is a distorted version of the original signal. The problem then is simply to detect which signal template was transmitted through the channel based on the observed signal, usually without knowing or calculating the channel explicitly [12]. (B) Estimation: Estimation usually involves estimating or calculating any component (or a part of it) of the time series problem model given in figure 1-1 based on the observed signal. Those components can be the original input signal, channel parameters or even certain noise characteristics. The problem may or may not have prior knowledge of the input signal. If such information is absent then such an estimation problem is called blind. Good examples of blind estimation are blind source separation [13], blind deconvolution [14] and blind equalization [15]. There can be other estimation problems as well like noise estimation (usually involves calculating certain noise parameters like covariance), system identification [16], time series prediction [17], interference cancellation [18], etcetera. The integral component for both these categories of problems is a correlation (or more generally similarity) measure that extracts information about the time structure. These ideas are summarized in figure 1-2. It is here that out proposed method of higher order correlation will make an impact for the improvement of these basic tasks in time series analysis. Presently, the most widely used method of similarity is the correlation function which appears naturally when the underlying assumptions are limited to linearity and Gaussianity. The kernel methods that were mentioned earlier also employ a similarity 13

Figure 1-2. General overview for solving time series problems in the feature space measure, the kernel function. But here the similarity is point-wise among pairs of data samples.

14 Figure 1-2. General overview for solving time series problems in the feature space measure, the kernel function. But here the similarity is point-wise among pairs of data samples. Thus to represent the data set for solving most of the problems one is required to evaluate the kernel pairwise with all the available data samples creating a large Gram matrix. Though these type of methods can solve certain complex problems, they are usually quite burdensome both on time and memory. Yet many researchers are attracted toward kernel methods because it can solve complex problems elegantly using conventional optimal signal processing theory but in a rich kernel induced reproducing kernel Hilbert space (RKHS). This space is usually very high dimensional, but most solutions in the RKHS can be readily calculated in the input space using the kernel function which acts as an inner product of the feature vectors. Through our study, we shall present a new function that is a statistical measure having the same form as the correlation function and like the the kernel will have an associated RKHS. This provides a completely new avenue of research, potentially solving complex problems more accurately and conveniently. Specifically, we shall define a generalized correlation function, which due to its intriguing 14

15 relationship with Renyi s Quadratic entropy and properties similar to correlation, is termed as correntropy. This new function can provide novel means of performing various signal processing tasks involving detection and estimation. The specific tasks that we discuss in this dissertation are signal detection (correlation receiver) and optimal filtering. The integral component for both of these problems is a similarity measure, using the concepts of correntropy and kernel methods. Now we introduce these tasks more specifically. 1.1 Signal Detection Detection of known signals transmitted through linear and nonlinear channels is an important fundamental problem in signal processing theory with a wide range of applications in communications, radar, and biomedical engineering to name just a few [12, 19]. The linear correlation filter or matched filter has been the basic building block for the majority of these applications. The limitations of the matched filter, though, are already defined by the assumptions under which its optimality can be proved. It is well known that for the detection of a known signal linearly added to white Gaussian noise (AWGN) the matched filter maximizes the signal to noise ratio (SNR) among all linear filters [20]. Theoretically, this means that the matched filter output is a maximum likelihood statistic for hypothesis testing under the assumptions of linearity and Gaussianity. The optimality is predicated on the sufficiency of second order statistics to characterize the noise. Unfortunately, most real world signals are not completely characterized by their second order statistics and sub-optimality inevitably creeps in. Optimal detection in non Gaussian noise (or non-linear) environments requires the use of the characteristic function and is much more complex [21] because it requires higher order statistics to accurately model the noise. This motivates the recent interest in nonlinear filters (kernel matched filters) [22, 23] or nonlinear cost functions [24], but the computational complexity of such systems outweighs their usefulness in applications where high processing delay cannot be tolerated such as in radar and mobile communication systems. With kernel 15

16 methods, a nonlinear version of the template matching problem is first formulated in kernel feature space by using a nonlinear mapping and the so called kernel trick [22] is employed to give a computationally tractable formulation. But the correlation matrix formed in this infinite dimensional feature space is also infinitely large and the resulting formulation is complex using a large set of training data. Alternatively it can be formulated as a discriminant function in kernel space [23], but still suffers from the need to train the system before hand and store the training data. The matched filter based on quadratic mutual information is another recently introduced nonlinear filter that maximizes the mutual information between the template and the output of the filter [24]. This method does not require an initial training step since it is non-parametric. However, the method requires the estimation of the quadratic mutual information with kernels and is ideally valid only for identically and independently distributed (iid) samples, which is rarely the case in reality. Moreover, the computational load is still O(N 2 ) at best. The derivation of the method introduced in this dissertation uses a recently introduced positive definite function called correntropy [15], which quantifies higher order moments of the noise distribution and has a computational complexity of O(N), thus providing a useful combination of good representation and less computational complexity. 1.2 Optimal and Adaptive Filtering Due to the power of the solution and the relatively easy implementation, Wiener filters have been extensively used in all the areas of electrical engineering. Despite this wide spread use, Wiener filters are solutions in linear vector spaces. Therefore, many attempts have been made to create nonlinear solutions to the Wiener filter mostly based on Volterra series [25], but unfortunately the solutions are very complex with many coefficients. There are also two types of nonlinear models that have been commonly used: The Hammerstein and the Wiener models. They are composed of a static nonlinearity and a linear system, where the linear system is adapted using the Wiener solution. However, the choice of the nonlinearity is critical for good performance, because linear solution is 16

17 obtained in the transformed space. The recent advances of nonlinear signal processing have used nonlinear filters, commonly known as dynamic neural networks [26] that have been extensively used in the basic same applications of Wiener filters when the system under study is nonlinear. However, there is no analytical solution to obtain the parameters of multi-layered neural networks. They are normally trained using the back propagation algorithm or its modifications. In some other cases, a nonlinear transformation of the input is first implemented and a regression is computed at the output. Good examples of this are the radial basis function (RBF) network [27] and the kernel methods [3, 28]. The disadvantage of these alternate techniques of projection is the tremendous amount of computation required due to the required inversion of a huge matrix and there is usually a need for regularization. We show how to extend the analytic solution in linear vector spaces proposed by Wiener to a nonlinear manifold that is obtained through a reproducing kernel Hilbert space. The main idea is to transform the input data nonlinearly to be Gaussian distributed and then apply the linear Wiener filter solution. Our method actually encompasses and enriches the Hammerstein model by inducing nonlinearities which may not be achieved via static nonlinearity. For viable computation, this approach utilizes a rather simplistic approximation due to which the results are less optimal but still performs better than the linear Wiener filter. To improve its results we propose to obtain the least mean square error solution on line following the approach employed in Widrow s famous LMS algorithms [29]. 17

18 CHAPTER 2 NEW SIMILARITY AND CORRELATION MEASURE 2.1 Kernel-Based Algorithms and ITL In the last years a number of kernel methods, including Support Vector Machines (SVM) [3], kernel principal component analysis (K-PCA) [4], kernel Fisher discriminant analysis (K-FDA) [5], and kernel canonical correlation analysis (K-CCA) [6],[7] have been proposed and successfully applied to important signal processing problems. The basic idea of kernel algorithms is to transform the data x i from the input space to a high dimensional feature space of vectors Φ(x i ), where the inner products can be computed using a positive definite kernel function satisfying Mercer s conditions [3]: κ(x i, x j ) = Φ(x i ), Φ(x j ). This simple and elegant idea allows us to obtain nonlinear versions of any linear algorithm expressed in terms of inner products, without even knowing the exact mapping Φ. A particularly interesting characteristic of the feature space is that it is a reproducing kernel Hilbert space (RKHS): i.e., the span of functions {κ(, x) : x X } defines a unique functional Hilbert space [1], [2], [30], [31]. The crucial property of these spaces is the reproducing property of the kernel f(x) = κ(, x), f, f F. (2 1) In particular, we can define our nonlinear mapping from the input space to a RKHS as Φ(x) = κ(, x), then we have Φ(x), Φ(y) = κ(, x), κ(, y) = κ(x, y), (2 2) and thus Φ(x) = κ(, x) defines the Hilbert space associated with the kernel. Without loss of generality, in this chapter we will only consider the translation-invariant Gaussian kernel, which is the most widely used Mercer kernel. ( ) κ(x y) = 1 x y 2 exp 2πσ 2σ 2 (2 3) 18

19 On the other hand, Information Theoretic Learning (ITL) addresses the issue of extracting information directly from data in a non-parametric manner [8]. Typically, Renyi s entropy or some approximation to the Kullback-Leibler distance have been used as ITL cost functions and they have achieved excellent results on a number of problems: e.g., time series prediction [9], blind source separation [10] or equalization [11]. It has been recently shown that ITL cost functions, when estimated using the Parzen method, can also be expressed using inner products in a kernel feature space which is defined by the Parzen kernel, thus suggesting a close relationship between ITL and kernel methods [32],[33]. For instance, if we have a data set x 1,, x N R d, and the corresponding set of transformed data points Φ(x i ),, Φ(x N ), then it turns out that the squared mean of the transformed vectors, i.e., m Φ 2 1 N = Φ(x i ), 1 N Φ(x j ) N N i=1 j=1 = 1 N 2 N i=1 N κ(x i x j ), (2 4) is the information potential V (x) as defined in [8] 1. Another interesting concept in information theoretic learning is the Cauchy-Schwarz pdf distance, which has been proved to be effective for measuring the closeness between two probability density functions and has been used successfully for non-parametric clustering. If p X (x) and p Y (y) are the two pdfs, Cauchy-Schwarz pdf distance is defined [8] by px (x)p Y (x)dx D(p X, p Y ) = log p 2 X (x)dx 0. (2 5) p 2 Y (x)dx Mutual information (MI) indicates the amount of shared information between two or more random variables. In information theory, the MI between two random variables X and Y is j=1 1 The quadratic Renyi s entropy is defined as H R = log(v (x)). 19

20 traditionally defined by Shannon as [34]. I s (X; Y ) = p XY (x, y) log p XY (x, y) dxdy (2 6) p X (x)p Y (y) where p XY (x, y) is the joint probability density function (pdf) of X and Y, and P X (x) and P Y (y) are the marginal pdfs. The crucial property of mutual information for our purposes is the fact that it measures the dependency (even nonlinear) between two random variables X and Y. If X and Y are independent, MI becomes zero [34]. In a sense, MI can be considered a generalization of correlation to nonlinear dependencies; that is MI can be used to detect nonlinear dependencies between two random variables, whereas the usefulness of correlation is limited to linear dependencies. However, in order to estimate MI one has to assume that the samples are iid, which is not the case for templates that are wave forms. Although Shannons MI is the traditionally preferred measure of shared information, essentially it is a measure of divergence between the variables X and Y from independence. Based on this understanding, a different, but qualitatively similar measure of independence can be obtained using the Cauchy-Schwartz inequality for inner products in vector spaces: x, y x y. The following expression is defined as the Caucy-Schwartz Mutual Information (CS-QMI) between X and Y [8]: I s (X; Y ) = 1 2 log p 2 XY (x, y)dxdy p X 2 (x)p Y 2 (y)dxdy ( pxy (x, y)p X (x)p Y (y)dxdy ) 2 (2 7) With data available, I s (X; Y ) can be estimated using Parzen window density estimation and can be used as a statistic for signal detection as in [24]. We shall use CS-QMI as a comparison against our proposed method as well since this is a direct template matching scheme that requires no training and shows improved performance in non-gaussian and nonlinear situations. Similarly, the equivalence between kernel independent component analysis (K-ICA) and a Cauchy-Schwartz independence measure has been pointed out in [35]. In fact, all learning algorithms that use nonparametric probability density function (pdf) estimates in the input space admit an alternative formulation as kernel 20

21 methods expressed in terms of dot products. This interesting link allows us to gain some geometrical understanding of kernel methods, as well as to determine the optimal kernel parameters by looking at the pdf estimates in the input space. Since the cost functions optimized by ITL algorithms (or, equivalently, by kernel methods) involve pdf estimates, these techniques are able to extract the higher order statistics of the data and that explains to some extent the improvement over their linear counterparts observed in a number of problems. Despite its evident success, a major limitation of all these techniques is that they assume independent and identically distributed (i.i.d.) input data. However, in practice most of the signals in engineering have some correlation or temporal structure. Moreover, this temporal structure can be known in advance for some problems (for instance in digital communications working with coded source signals). Therefore, it seems that most of the conventional ITL measures are not using all the available information in the case of temporally correlated (non-white) input signals. The main goal is to present a new function that unlike conventional ITL measures effectively exploits both the statistical and the time-domain information about the input signal. This new function, which we refer to as correntropy function, will be presented in the next section. 2.2 Generalized Correlation Function Correntropy as Generalized Correlation A new measure of generalized correlation called correntropy was presented in [15]. The definition is as follows: Let {x t, t T} be a stochastic process with T being an index set. The generalized correlation function V (s, t) is defined as a function from T T into R + given by V (s, t) = E [κ(x(s), x(t))], (2 8) 21

22 where E [ ] denotes mathematical expectation. Using a series expansion for the Gaussian kernel, the correlation function can be rewritten as V (s, t) = 1 ( 1) n 2πσ 2 n σ 2n n! E [ x s x t 2n], (2 9) n=0 which involves all the even-order moments of the random variable x s x t. Specifically, the term corresponding to n = 1 in (2 9) is proportional to E [ x s 2] + E [ x t 2] 2E [ x s x t 2] = σ 2 x s + σ 2 x t 2R x (s, t). This shows that the information provided by the conventional autocorrelation function is included within the new function. From (2 9), we can see that in order to have a univariate correlation function, all the even-order moments must be invariant to a time shift. This is a stronger condition than wide sense stationarity, which involves only second-order moments. More precisely, a sufficient condition to have V (t, t τ) = V (τ) is that the input stochastic process must be strictly stationary on the even moments; this means that the joint pdf p(x t, x t+τ ), must be unaffected by a change of time origin. We will assume this condition in the rest of the dissertation when using V (τ). For a discrete-time stationary stochastic process we define the generalized correlation function as V [m] = E [κ(x n x n m )], which can be easily estimated through the sample mean ˆV [m] = 1 N m + 1 N κ(x n x n m ). (2 10) n=m This form higher order correlation has been termed correntropy because of its unique capability to provide information both on the time structure (correlation) of the signal as well as on the distribution of the random variable (average across delays or indexes is the information potential) [15]. Correntropy is positive definite and because of which, defines a whole new reproducing kernel Hilbert space (RKHS) [1]. It also has a maximum when the two indexes s and t are equal. These properties are also satisfied by the widely known correlation function [2]. These facts have made it possible to explore a whole new avenue of research which in essence combines the benefits of kernel methods and information theoretic learning [36],[37]. In the same way a covariance version of correntropy can 22

23 be defined by employing vectors that are centered in the feature space defined by the Mercer s kernel. Given any two random variables x 1 and x 2, the corresponding feature vectors Φ(x 1 ) and Φ(x 2 ) can be centered as, Φ(x i ) = Φ(x i ) E[Φ(x i )], i = 1, 2. (2 11) Then the inner product between these two centered feature vectors becomes Φ(x1 ), Φ(x 2 ) = Φ(x 1 ), Φ(x 2 ) +E[Φ(x 1 )].E[Φ(x 2 )] E[Φ(x 1 )], Φ(x 2 ) Φ(x 1 ), E[Φ(x 2 )]. (2 12) Employing a statistical expectation on both sides of the above equation we get E [ Φ(x1 ), Φ(x ] 2 ) = E[ Φ(x 1 ), Φ(x 2 ) ] E[Φ(x 1 )], E[Φ(x 2 )] = E[κ(x 1, x 2 )] E x2 E x1 [κ(x 1, x 2 )]. (2 13) This results in the centered correntropy between the two random variables. Now we can define the centered correntropy for random vectors: Given a random vector x = [x 1, x 2,..., x L ] T, its centered correntropy can be defined as a matrix V such that the i, j th element is V (i, j) = E[κ(x i, x j )] E xi E xj [κ(x i, x j )]. (2 14) The following theorem is the basis of the novel Wiener filter formulation. Theorem 1. For any symmetric positive definite kernel (i.e., Mercer s Kernel) κ(x(i), x(j)) defined on R R and a random process x(n), the correntropy V defined with V (i, j) = E[κ(x(i), x(j))] E x(i) E x(j) [κ(x(i), x(j))] is a covariance function of a Gaussian random process. Proof: It can be proved that V is non-negative definite and symmetrical. The symmetry of V is a direct consequence of the symmetry of the kernel. Non-negativity of V can be shown by using the vectors Φ(x(i)) obtained from Mercer s theorem. Let us take 23

24 any set of real numbers (α 1, α 2,..., α L ) not all zeros. Then, L L α i α j V (i, j) = i=1 j=1 L i=1 j=1 L α i α j E[κ(x(i), x(j))] E x(i) E x(j) [κ(x(i), x(j))] (2 15) Using (2 13), [ L L L ] L α i α j V (i, j) = E α i α j Φ(x(i)) E[Φ(x(i))], Φ(x(j)) E[Φ(x(j))] i=1 j=1 i=1 j=1 [ L ] L = E α i [Φ(x(i)) E[Φ(x(i))]], α j [Φ(x(j)) E[Φ(x(j))]] i=1 j=1 = E L 2 α i [Φ(x(i)) E[Φ(x(i))]] i=1 0. (2 16) Thus V is non-negative definite and symmetrical. Now, it can be proved that R is the auto-covariance function of a random process if and only if R is a symmetric non-negative kernel [2]. The theorem then follows. This means that given V(i, j) for a random process x(n), there exists a Gaussian process z(n) such that E[z(i).z(j)] E[z(i)].E[z(j)] = V (i, j). Theorem 2. For any identically and independently distributed random vector x, the corresponding Gaussian random vector z in the feature space defined by correntropy is uncorrelated, i.e., the covariance matrix is diagonal. Proof: Let V be the centered correntropy matrix for x and hence covariance matrix for z with its i, j th element with i j as V (i, j) = E[z i.z j ] E[z i ].E[z j ] = E[κ(x(i), x(j))] E x(i) E x(j) [κ(x(i), x(j))] (2 17) = E x(i) E x(j) [κ(x(i), x(j))] E x(i) E x(j) [κ(x(i), x(j))] = 0. It also easy to see that since x is identically distributed, all the diagonal terms are also equal. 24

25 2.2.2 Properties Some important properties of the GCF can be listed as follows: Property 1: For any symmetric positive definite kernel (i.e., Mercer kernel) κ(x s, x t ) defined on R R, the generalized correlation function defined as V (s, t) = E [κ(x s, x t )] is a reproducing kernel. Proof: Since κ(x s, x t ) is symmetrical, it is obvious that V (s, t) is also symmetrical. Now, since κ(x s, x t ) is positive definite, for any set of n points {x 1,, x n } and any set of real numbers {a 1,, a n }, not all zero n n a i a j κ(x i, x j ) > 0. (2 18) i=1 j=1 It is also true that for any strictly positive function g(, ) of two random variables x and y, E [g(x, y)] > 0. Then [ n ] n E a i a j κ(x i, x j ) > 0 i=1 n n a i a j E [κ(x i, x j )] = j=1 n i=1 j=1 i=1 j=1 n a i a j V (i, j) > 0. (2 19) Thus, V (s, t) is both symmetric and positive definite. Now, the Moore-Aronszajn theorem [1] proves that for every real symmetric positive definite function κ of two real variables, there exists a unique reproducing kernel Hilbert space (RKHS) with κ as its reproducing kernel. Hence, V (s, t) = E [κ(x s, x t )] is a reproducing kernel. This concludes the demonstration. The following properties consider a discrete-time stochastic process, obviously, the properties are also satisfied for continuous-time processes. Property 2: V [m] is a symmetric function: V [ m] = V [m]. Property 3: V [m] reaches its maximum at the origin, i.e., V [m] V [0], m. Property 4: V [m] 0 and V [0] = 1 2πσ. 25

26 All these properties can be easily proved. Properties 2 and 3 are also satisfied by the conventional autocorrelation function, whereas Property 4 is a direct consequence of the positiveness of the Gaussian kernel. Property 5: Let {x n, n = 0,, N 1} be a set of i.i.d data drawn according to some distribution p(x). The mean value of the GCF estimator (2 10) coincides with the estimate of information potential obtained through Parzen windowing with Gaussian kernels. Proof: The Parzen pdf estimate is given by ˆp(x) = 1 N estimate of the information potential is V = ( 1 N = 1 N 2 N 1 n=0 N 1 i=0 κ(x x n )) 2 dx N 1 n=0 κ(x n x i ), N 1 n=0 κ(x x n), and the (2 20) where κ denotes a Gaussian kernel with twice the kernel size of κ. On the other hand the GFC estimate is ˆV [m] = 1 N 1 N m n=m κ(x n x n m ), for (N 1) m (N 1), and therefore its mean value is ˆV [m] = 1 2N 1 N 1 m= N+1 1 N m κ(x n x n m ). (2 21) Finally, it is trivial to check that all the terms in (2 20) are also in (2 21). This concludes the proof. Property 5 clearly demonstrates that this generalization includes information about the pdf. On the other hand, we also showed that it also conveys information about the correlation. For this reasons, in the sequel we will refer to V [m] as correntropy. 26

27 Property 6: Given V [m] for m = 0,, P 1, then the following Toeplitz correntropy matrix of dimensions P P V [0] V [1] V [P 1] V [1] V [0] V [P 2] V =, (2 22) V [P 1] V [P 2] V [0] is positive definite. Proof: Matrix V can be decomposed as V = N n=m A n where A n is given by κ(x n x n ) κ(x n x n 1 ) κ(x n x n P 1 ) κ(x n x n 1 ) κ(x n x n ) κ(x n x n P 2 ) A n =, (2 23) κ(x n x n P 1 ) κ(x n x n P 2 ) κ(x n x n ) if κ(x i, x j ) is a kernel satisfying Mercer s conditions, then A n is a positive definite matrix n. On the other hand, the sum of positive definite matrices is also positive definite [38]; this proves that V is a positive definite matrix. Property 7: Let {x n, n T } be a discrete-time w.s.s. zero-mean Gaussian process with autocorrelation function r[m] = E [x n x n m ]. The correntropy function for this process is given by where σ is the kernel size and σ 2 [m] = 2(r[0] r[m]). 1 2πσ, m = 0 V [m] = (2 24) 1, m 0 2π(σ 2 +σ 2 [m]) Proof: The correntropy function is defined as V [m] = E [κ(x n x n m )]. Since x n is a zero-mean Gaussian random process, for m 0 z m = x n x n m is also a zero-mean Gaussian random variable with variance σ 2 [m] = 2(r[0] r[m]). Therefore V [m] = ( ) 1 z 2 κ(z m ) exp m dz 2πσ[m] 2σ 2 m. (2 25) [m] 27

28 Since we are considering a Gaussian kernel with variance σ 2, equation (2 25) is the convolution of two zero-mean Gaussians of variances σ 2 and σ[m] 2 evaluated at the origin; this yields (2 24) immediately. Property 7 clearly reflects that correntropy conveys information about the time structure of the process and also about its pdf via quadratic Renyi s entropy. As a consequence of Property 7, if {x n, n T } is a white zero-mean Gaussian process with variance σx 2 1 we have that V [m] = m 0, which coincides with the mean value 2π(σ 2 +σx 2 ), of the function and, of course, is the information potential 2 of a Gaussian random variable of variance σ 2 x, when its pdf has been estimated via Parzen windowing with a Gaussian kernel of size σ 2. Property 8: The correntropy estimator (2 10) is unbiased and asymptotically consistent. The properties of the estimator can be derived following the same lines used for the conventional correlation function [39]. 2.3 Similarity Measure For template matching and detection the decision statistic is in general a similarity function. For matched filtering the cross-correlation function is used. Here we define a non-linear similarity metric which can be explained in probabilistic terms. Inspired by the concept of correntropy, we define a means of measuring similarity between two random processes. by: The cross-correntropy at i and j for two random processes X k and Y k can be defined V XY (i, j) = E[κ(X i, Y j )]. (2 26) 2 (ˆp(x)) 2 1 dx = 2π(σ 2 +σx) 2 28

29 With the assumptions of ergodicity and stationarity, the cross correntropy at lag m between the two random processes can be estimated by ˆV XY (m) = 1 N N κ(x k m, y k ) (2 27) k=1 For detection if we assume that we know the timing of the signal we just need to use ˆV XY (0) = 1 N N κ(x k, y k ). (2 28) k=1 Property 9: The cross-correntropy estimate defined in (2 28) for two random variables x k and y k, each iid, approaches through Parzen density estimation to a measure of probability density along a line given by D XY = P XY (x, y)δ(x y)dxdy, (2 29) where δ(x y) is the Dirac s delta function. D XY in (2 29) defines the integral along the x = y of the joint probability density function (PDF) P XY. Hence, this gives a direct probabilistic measure on how similar the two random variables are. Proof: The estimate of the joint pdf using Parzen windowing with a Gaussian kernel and available N samples can be written as ˆP XY (x, y) = 1 N N κ σ (x, x k )κ σ (y, y k ) (2 30) k=1 where κ σ is the Gaussian kernel with variance σ 2. Using the estimate of the pdf we have ˆD XY (2 31) = 1 N = 1 N N k=1 N n=1 κ σ (x, x k )κ σ (y, y k )δ(x, y)dxdy (2 32) κ σ (x, x k )κ σ (x, y k )dx. (2 33) 29

30 Here we can use the convolution property of the Gaussian kernel resulting in ˆD XY = 1 N N κ 2σ (x k, y k ) (2 34) k=1 This coincides with (2 28) with the kernel having variance 2σ 2 This can also be inferred from (2 26) since V XY (0) = E[κ σ (x, y)] = κ σ (x, y)p XY (x, y)dxdy. (2 35) This is the just the integral along a strip (defined by the kernel) on the x = y line. For σ 0, the kernel would be the Dirac s delta function coinciding with (2 29). Figure 2-1 illustrates this graphically. This means that V XY (0) increases as more data points in the joint space lie closer to the line x = y and kernel size regulates what is considered close. This also give the motivation of using V XY as a similarity measure between X and Y. Property 10: The cross-correntropy estimate defined in (2 28) for two random variables x i and y i, each iid, is directly related through Parzen density estimation to Cauchy-Schwarz pdf distance defined in (2 5). To explore the relationship between these two quantities let us estimate the pdf s of these random variables using Parzen estimation: ˆp X (x) = 1 N ˆp Y (y) = 1 N The numerator in (2 5) can be written as N κ(x, x i ) (2 36) i=1 N κ(y, y i ) (2 37) i=1 ˆpX (x)ˆp Y (x)dx = 1 N 2 N N j=1 i=1 κσ (x, x i )κ σ (x, y i )dx = 1 N 2 E Y E X [κ 2σ (X, Y )] = E[κ 2σ (X, Y )] 1 N N j=1 i=1 N κ 2σ (x i, y i ) = ˆV XY. i=1 N κ 2σ (x i, y j ) (2 38) 30

31 Figure 2-1. Probabilistic interpretation of V XY (the maximum of each curve has been normalized to 1 for visual convenience). Here the approximations are for switching between statistical expectations and sample averages, marginal expectations above (with respect to the subscript in the expectation operator) are replaced with a joint expectation (without the subscript). These steps are valid as long as the random variables are independent. ˆVXY is the same as in (2 28) with a kernel variance of 2σ 2. Thus from (2 5), the Cauchy-Schwarz pdf distance can be written as D(ˆp X, ˆp Y ) = log(v XY ) log G(X) + 1 log G(Y ), (2 39) 2 where G(X) and G(Y ) are estimates of information potentials (defined in (2 4)) for X and Y, respectively. Note that V XY accounts for the interaction between the two random variables in the Cauchy-Schwarz pdf distance, since it gives an estimate of the inner product between the two pdf s as shown in (2 38). 31

32 Property 11: The cross correntropy function between two random variables X and Y defined in (2 26) is the cross correlation of two random variables U and Z. That is, ˆV XY = 1 N N κ(x k, y k ) = 1 N k=1 N u k z k = ˆR UZ, (2 40) k=1 where ˆR UZ is the estimate of the cross correlation between U and Z using sample average. The correntropy matrix for the vector [X Y ] T given by ˆV XY = ˆV XX ˆV Y X ˆVXY ˆVY Y (2 41) is positive definite then using similar arguments [2] used to prove theorems 1 and 2, a random vector [U Z] T will exist whose correlation matrix is given by the same matrix. Thus ˆR UZ = ˆR UU ˆR ZU ˆRUZ ˆRZZ = ˆV XY. (2 42) Hence two random variables U and Z will exist satisfying (2 40). This shows that using the cross correntropy function is equivalent to simply computing cross correlation of two other random variables nonlinearly related to original data through correntropy. 2.4 Optimal Signal Processing Based on Correntropy As we have discussed earlier given the data samples {x i } N i=1 the correntropy kernel creates another data set {z x (i)} N i=1 preserving the similarity measure as E[z x (i) z x (j)] = E[k(x i, x j )] = V (i, j). In [2] Parzen has described an interesting concept of the Hilbert space representation of a random process. For a stochastic process {X t, t T } with T being an index set, the auto-correlation function R X (t, s) defined as E[X t X s ] is symmetric and positive-definite, thus defining an RKHS. Parzen showed that the inner product structure of this RKHS is equivalent to the conventional theory of second order stochastic processes. According to Parzen, a symmetric non-negative kernel is the covariance kernel of a random function (in other words random process) and vice versa. Therefore, given a 32

33 random process {X t, t T } there exists another random process {z t, t T } such that E[z t z s ] = V (t, s) = E[k(X t, X s )]. (2 43) z t is nonlinearly related to, but of the same dimension as X t, while preserving the similarity measure in the sense of (2 43). Meanwhile, the linear manifold of {z t, t T } forms a congruent RKHS with correntropy V (t, s) as the reproducing kernel. The vectors in this space would be nonlinearly related to the input space based on the underlying input statistics and hence any linear signal processing solution using a covariance function (given by correntropy) in this feature space would automatically give a nonlinear formulation with respect to the original input data. For instance, if we construct a desired signal by the optimal projection on the space of, this filter would also be a linear optimal filter in the feature space and corresponding a nonlinear filter would be obtained with respect to the original input, thus potentially modeling the nonlinear dependencies between the input and the desired input. This concept of using the feature space given by correntropy can be clearly extended for most supervised learning methods in general. In the following chapters we shall present the usefulness of the correntropy function both as a similarity and correlation measure as well as a means of optimal signal processing in the corresponding feature space. We conclude this chapter by pointing out the differences and similarities between the RKHS induced by correntropy (VRKHS) versus the RKHS induced by the Gaussian kernel (GRKHS) as used in kernel methods, and the old Parzen formulation of the RKHS (PRKHS) based on autocorrelation of random processes. The primary characteristic of the GRKHS is the nonlinear transformation to a high dimensional space controlled by the number of input data samples. This creates difficulties in understanding some solutions in the input space (e.g. kernel PCA), and requires regularized solutions, since the number of unknowns is equal to the number of samples. Powerful and elegant methods have been used to solve this difficulty [3], [40], but they complicate the solution. In fact, instead of 33

34 straight least squares for kernel regression, one has to compute a constrained quadratic optimization problem. The nonlinearity is controlled solely by the kernel utilized, and most of the times it is unclear how to select it optimally, although some important works have addressed the difficulty [41]. Although correntropy using the Gaussian kernel is related to the GRKHS, VRKHS is conceptually, and practically much closer to PRKHS. Since correntropy is different from correlation in the sense that it involves high-order statistics of input signals, the VRKHS induced by auto-correntropy is not equivalent to statistical inference on Gaussian processes. The transformation from the input space to VRKHS is nonlinear and the inner product structure of VRKHS provides the possibility of obtaining close form optimal nonlinear filter solutions by utilizing high-order statistics as we shall also demonstrate. Another important difference compared with existing machine learning methods based on the GRKHS is that the feature space has the same dimension of the input space. This has advantages because in VRKHS there is no need to regularize the solution, when the number of samples is large compared with the input space dimension. Further work needs to be done regarding this point, but we hypothesize that in our methodology, regularization is automatically achieved due to the inbuilt constraint on the dimensionality of the data. The fixed dimensionality also carries disadvantages because the user has no control of the VRKHS dimensionality. Therefore, the quality of the nonlinear solution depends solely on the nonlinear transformation between the input space and VRKHS. Another important attribute of the VRKHS is that the nonlinear transformation is mediated by the data statistics. As it is well known from the theory of Hammerstein-Wiener models [42] for nonlinear function approximation (and to a certain extent in SVM theory [3]), the nonlinearity plays an important role in performance. In other words, in the Hammerstein models, a static nonlinearity which is independent of the data is chosen a priori whereas our approach induces the nonlinearity implicitly by defining a generalized similarity measure which is data dependent. Our method actually encompasses and enriches Hammerstein-Wiener models by inducing 34

35 nonlinearities which may not be achieved via static nonlinearities, and are tuned to the data (i.e. the same kernel may induce different nonlinear transformations for different data sets). The RKHS induced by correntropy is indeed different from the two most widely used RKHS studied in machine learning and statistical signal processing. Being different does not mean that it will always be better, but these results show that this provides a promising opening for both theoretical and applied research. Still certain facets of this research, like the exact relationship of the input data with the feature data, requires further investigation for a more concrete understanding of the impact on the betterment of conventional signal processing. 35

36 3.1.1 Linear Matched Filter CHAPTER 3 CORRENTROPY BASED MATCHED FILTERING 3.1 Detection Statistics and Template Matching Detecting a known signal in noise can be statistically formulated as a hypothesis testing problem, choosing between two hypotheses, one with the signal s 1,k present (H 1 ) and the other with the signal s 0,k present (H 0 ). When the channel is additive with noise, n k, independent of the signal templates, the two hypotheses are H 0 : r k = n k + s 0,k H 1 : r k = n k + s 1,k. (3 1) When both the hypotheses are governed by Gaussian probability distribution, the optimal decision statistic is given by the log likelihood [43] as L(r) = log R R [ (r m1 ) T R 1 1 (r m 1 ) (r m 0 ) T R 1 0 (r m 0 ) ] (3 2) where R 0 and R 1 are the respective covariance matrices, and m 0 and m 1 are the respective mean vectors of the hypotheses. L(r) is then compared to a threshold η, and if the statistic is greater, hypothesis H 1 otherwise H 0 is chosen. This is a complicated quadratic form in the data that is usually not preferred to be computed. Assumption of independently and identically distributed (iid) zero mean noise would reduce this to a much simpler expression. Then, R 0 = R 1 = σni, 2 m 0 = s 0 and m 1 = s 1, where σn 2 is the noise variance, I is the identity matrix, and s 0 and s 1 are the two possible transmitted signal vectors. L(r) = 0 1 = 1 2σ 2 n = 1 2σ 2 n 2σ 2 n [ (r s1 ) T (r s 1 ) (r s 0 ) T (r s 0 ) ] [ r T r 2r T s 1 + s T 1 s 1 r T r + 2r T s 0 s T 0 s 0 ] [ 2r T s 1 + s T 1 s 1 + 2r T s 0 s T 0 s 0 ] (3 3) 36

37 The terms not depending on r can be dropped and the expression can be rescaled with out any effect on the final decision. These changes will be reflected on the threshold to which the statistics is compared. So, we can use the following decision statistic which is nothing but the difference between the correlation between the received signal r and the templates. L(r) = r T (s 1 s 0 ). (3 4) This, in fact, is the output y k of the matched filter at the time instant k = τ, the signal length, such that y k = r k h k (3 5) where denotes convolution and h k = s 1,τ k s 0,τ k is the matched filter impulse response. Thus, the filter output is composed of a signal and a noise component. The output achieves its maximum value at the time instant τ when there is maximum correlation between the matched filter impulse response and the template thereby maximizing the signal to noise ratio (SNR) (defined as the ratio of the total energy of the signal template to the noise variance) SNR = 1 σn 2 T s 2 k. (3 6) The matched filter is one of the fundamental building blocks of almost all communication receivers, automatic target recognition systems, and many other applications where transmitted waveforms are known. The wide applicability of the matched filter principle is due to its simplicity and optimality under the linear additive white Gaussian noise (AWGN) framework Correntropy as Decision Statistic Now inspired by properties 9 and 10 in chapter 2, we shall define the decision statistic used hence forth. The properties above demonstrate that cross-correntropy is a probabilistic similarity between two random vectors. Let us assume that we have a receiver and a channel with white additive noise. For simplicity we take the binary k=0 37

38 Table 3-1. Values for the statistic for the two cases using the Gaussian kernel. Received signal Statistic, L C N r = n + s 0 1 N 2πσ 2 r = n + s 1 1 N 2πσ 2 i=1 N i=1 e (s 1,i s 0,i +n i ) 2 2σ 2 1 N 2πσ 2 N e (n i) 2 2σ 2 1 N 2πσ 2 i=1 N e (n i) 2 2σ 2 i=1 e (s 1,i s 0,i +n i ) 2 2σ 2 detection problem where the received vector is r with two possible cases: (a) when the signal s 0 is present (hypothesis H 0, r 0 = n+s 0 ) and (b) when the signal s 1 is present (hypothesis, r 1 = n + s 1 ). Now we basically want to check whether the received vector is closer to s 1 (validating H 1 ) or to s 0 (validating H 0 ) based on our similarity measure (correntropy). Thus when the timings of the signal is known and they are synchronized we define the following as the correntropy matched filter statistic: L C (r) = 1 N N κ(r i, s 1,i ) 1 N i=1 N κ(r i, s 0,i ). (3 7) i=1 With the two cases of either transmitting r 1 (signal s 1 in noise) or r 0 (signal s 0 in noise) table 3-1 summarizes the values of the statistics for the linear matched filter and the correntropy matched filter using the Gaussian kernel. Since the pdf of the noise n i is considered symmetrical, n i and n i are statistically the same and hence L C (r 1 ) = L C (r 0 ). We should also note that the correntropy matched filter given by (3 7) defaults to the linear matched filter because according to property 11 in chapter 2, (3 7) is also measuring the difference in correlation between the random feature vectors derived form correntropy corresponding to r, s 0 and s 1. Thus the correntropy matched filter creates a linear matched filter that is nonlinearly related to the original input data. If the transmitted signal is lagged by a certain delay and this is not known a priori, the following correntropy 38

39 decision statistic should be used: { } N 1 L C (r) = max κ(r m N i m, s 1i ) max m i=1 { 1 N } N κ(r i m, s 0i ), (3 8) where r i m is the ith sample in the received signal vector of samples in the symbol window delayed by m. Now depending on whether the detection scheme is synchronous or not, all that is required is for L C to be compared with a threshold to decide when the signal template was transmitted Interpretation from Kernel Methods So far we have introduced the correntropy matched filter from an information theoretic learning perspective (ITL). We arrive at expression (3 7) from kernel methods with a few assumptions. We shall transform the data from the input space to the kernel feature space and compute (3 4) in the feature space by using the kernel trick. But instead of using the original kernel κ we shall use the sum of kernels, κ defined by i=1 κ(r, s) = 1 N N κ(r i, s i ), (3 9) i=1 where r = [r 1, r 2,..., r N ] T and s = [s 1, s 2,..., s N ] T are the input vectors. Note that κ is a valid kernel by all means. It is trivial to show that κ is symmetrical and positive definite merely from the fact that it is a sum of symmetrical and positive definite functions implying that it is a Mercer kernel. Hence κ can be written as κ(r, s) = Φ T (r) Φ(s), (3 10) where Φ maps the input vector to a possibly (depending on the kernel chosen) infinite dimensional feature vector. With the two possible transmitted vectors s 0 and s 1 and r 0 as the received signal in the input space, the corresponding feature vectors will be Φ(s 0 ), Φ(s 1 ) and Φ(r), respectively. Now applying the the decision statistic (3 4) in the feature 39

40 space we get L Φ(r) = ( Φ(s1 ) Φ(s 0 ) )T Φ(r) = 1 N N κ(r i, s 1,i ) 1 N i=1 N κ(r i, s 0,i ). (3 11) i=1 L Φ coincides with (3 7). Of course L Φ is not the maximum likelihood statistic in the feature space, since the data in the feature space is not guaranteed to be Gaussian. But second order information in the feature space is known to extract higher order information in the input space. For example (2 4) shows that the squared norm of the mean of the feature vectors is the information potential which gives Renyi s quadratic entropy. We can expect the same effect here and, as we shall see later, is also demonstrated in the results Impulsive Noise Distributions Since we aim to show the effectiveness of the proposed method in impulsive noise environments, we shall briefly introduce the most commonly used pdf models for such distributions. These distributions are commonly used to model noise observed in low-frequency atmospheric noise, fluorescent lighting systems, combustion engine ignition, radio and underwater acoustic channels, economic stock prices, and biomedical signals [44],[45],[46]. There are two main models used in literature that we present next. To our knowledge, a single detection method that can be applied easily to both such models has not been presented in literature so far. We shall demonstrate that the proposed CMF is an exception Two-term Gaussian mixture model The two-term Gaussian mixture model, which is an approximation to the more general Middleton Class A noise model [47] has been used to test various algorithms under an impulsive noise environment [23],[46],[48]. The noise is generated as a mixture of two Gaussian density functions such that the noise distribution f N (n) = (1 ε)n(0, σ1) 2 + εn(0, σ2), 2 where ε is the percentage of noise spikes and usually σ2 2 >> σ

41 Alpha-stable distribution α-stable distributions are also widely used to model impulsive noise behavior [44], [45]. This distribution gradually deviates from Gaussianity as α decreases from 2 to 1 (when the distribution becomes Cauchy). This range is also appropriate because even though the higher moments diverge, the mean is still defined. Though the pdf of an α-stable distribution does not have a closed form expression, it can be expressed in terms of its characteristic function (Fourier transform of the pdf). The general form of the characteristic function and many details on the α-stable distribution can be found in [49]. Here we shall only consider the symmetric α-stable noise. The characteristic function of such noise is given by Ψ α (u) = e σα u α, (3 12) where σ represents a scale parameter, similar to a standard deviation. Such a random variable has no moment greater or equal to α, except for the case α = 2 [49]. It is noteworthy that the deterministic or degenerate case (α = 0), the Gaussian case (α=2), the Cauchy case (α=1) and the Levy or Pearson distribution (in the non-symmetric framework, for α=0.5) are the only cases for which the pdf possesses a closed form expression [49] Locally suboptimal receiver When we present the simulations in a later section, for one case using additive α- stable noise, we shall compare the performance of the proposed CMF detector with the locally suboptimum (LSO) detector, which gives an impressive performance with minimum complexity [50]. So we shall briefly introduce these concepts. For details please refer to the respective citations. The LSO detector is derived directly from the locally optimum (LO) detector [51], whose test statistic is given by T LO (r) = N s k g LO (r k ), (3 13) k=1 41

42 where the nonlinear score function is g LO (x) = f n (x) f n (x), (3 14) and f n (x) is the first derivative of f n (x), the pdf of the additive noise. Since f n (x) does not have a closed form expression when the noise is α-stable distributed. g LO cannot be found exactly. The LSO detector uses an approximation for the score function cx, x λ g LSO (x) =, (3 15) α+1, x > λ x where c = α+1 λ 2 and λ = 2.73α 1.75 is the empirical estimate of the peak of g LO. As it can be easily seen, a drawback of the LSO detector is that one needs to know the value of alpha before it can be employed. We shall assume that this information available in the simulations that will be presented at the end of this chapter Selection of Kernel Size The kernel size (variance parameter in the Gaussian kernel) is a free parameter in kernel based and information theoretic methods, so it has to be chosen by the user. There are in fact numerous publications in statistics on selecting the proper kernel size [52], [53], [54], in the realm of density estimation, but a systematic study on the effect of kernel size in ITL is far from complete. For the particular case of the correntropy matched filter we can show the following interesting property. Property 1: The decision using statistic L C reduces to the optimal Gaussian ML decision in the input space given by (3 4) as the kernel size (variance of the Gaussian Kernel) increases. Proof: Using the Taylor s series expansion we get κ σ (r n, s n ) = 1 (r n s n ) 2 2σ 2 + (r n s n ) 4 2(2σ 2 ) 2 (3 16) 42

43 Since order of the terms increases by 2 with each term, as σ increases, the contribution of the higher order terms becomes less significant compared to the lower order terms. Then, κ σ (r n, s n ) 1 (rn sn)2 2σ 2 and it is easy to see that using L C is equivalent to using L, where L is given by (3 4). This means that the kernel size acts as means of tuning the correntropy matched filter, larger kernel size adjusting it to linearity and Gaussianity. For instance in AWGN by property 1, the kernel size should be relatively larger than the dynamic range of the received signal so that the performance defaults to that of the linear matched filter, which is optimal for this case. In the experiments presented below, the kernel size is chosen heuristically so that the best performance is observed. In the cases where the correntropy matched filter is expected to bring benefits, like for additive impulsive noise, the kernel size should be selected according to density estimation rules to represent well the template (i.e. Silverman s rule [53], or on the order of the variance of the template signal). Silverman s rule is given by σ opt = σ x {4N 1 (2d + 1)} 1 1+d, where σ x is the standard deviation of the data, N is the data size and d is the dimension of the data. As shall be illustrated later, choosing the appropriate kernel size for the detection problem at hand is not very exacting (as in many other kernel estimation problems) since a wide interval provides optimal performance for a given noise environment. 3.2 Experiments and Results Receiver operating characteristic (ROC) curves [55] were used to compare the performance of the linear matched filter (MF), the matched filter based on mutual information (MI) and the proposed correntropy matched filter (CMF). ROC curves give the plot of the probability of detection P D against the probability of false alarm P F A for a range [0, ) of threshold values, γ (the highest threshold corresponds to the origin). The area under the ROC can be used as a means of measuring the over all performance of a detector. The ROC curves were plotted for various values of signal to noise ratio (SNR), defined as the ratio of the total energy of the signal template to the noise variance. For 43

44 α-stable distributions, where the variance of the noise is not defined, SNR was estimated as signal power to squared scale of noise ratio. 10,000 Monte-Carlo (MC) simulations were run for each of the following cases. The template is a sinusoid signal with length 64 given by s i = sin(2πi/10), i = 1, 2,..., 64. Segments (chips) of length equal to the signal template, some containing the signal and others without the signal, were generated with a 1 2 probability of transmitting a signal. Various simulations were performed under the following situations. 1. Additive white Gaussian noise linear channel, r k = s k + n k. 2. Additive white impulsive noise linear channel where the noise is generated as a mixture of two Gaussian density functions such that the noise distribution f N (n) = (1 ε)n(0, σ1) 2 + εn(0, σ2), 2 where the percentage of noise spikes, ε = 0.15 and σ2 2 = 50σ Additive zero mean α-stable noise channel. For each case two sets of ROC curves shall be presented one for the case when the timing of the received symbols are known and the signal is synchronized (using the statistic (3 7)) and the other when there might be an unknown delay in the received symbols (using the statistic (3 8)). These two cases shall be referred respectively as synchronous detection and asynchronous detection in the following sections. For the latter, the delay was simulated to be not larger than 20 samples and hence the corresponding similarity measures (correlation (MF), correntropy (CMF), mutual information (MI) and correlation with the suboptimal score (LSO)) were maximized over the lags less than 20. We have used the method given in [56] with the skewness parameter β = 0 to generate the symmetrical α-stable distributed samples Additive White Gaussian Noise For the AWGN case, the normal matched filter is optimal so it should outperform all the other filters under test. But the CMF, although obtained in a different way, provides almost the same performance as can be expected from (3 16) for large kernel size, and 44

45 observed in figure 3-1 for synchronous detection and figure 3-2 for asynchronous detection. In fact, the performance of the CMF will approach arbitrarily close to that of the MF as the kernel size increases. The MI based matched filter also approaches similar performance when the kernel size is increased to high values, but the results are a bit inferior because the computation of CS-QMI [24] disregards the ordering of the samples. Figure 3-1. Receiver operating characteristic curves for synchronous detection in AWGN channel with kernel variance σ 2 (CMF ) = 15 σ 2 (MI) = 15 (the curves for MF and CMF for 10 db overlap) Additive Impulsive Noise by Mixture of Gaussians When the additive noise is impulsive, the proposed CMF clearly outperforms both the MI detector and the linear MF (see figure 3-3 for synchronous detection and figure 3-4 for asynchronous detection). The increased robustness to impulsive noise can be attributed to the properties of the correntropy function using the Gaussian kernel as mentioned earlier (heavy attenuation for large data values of the outliers). 45

46 Figure 3-2. Receiver operating characteristic curves for asynchronous detection in AWGN with kernel variance σ 2 (CMF ) = 15 σ 2 (MI) = Alpha-Stable Noise in Linear Channel Now let us observe the behavior of the detection methods for α-stable distributed noise. In this case the comparisons with the LSO detector is also presented, since it is close to optimal for additive α-stable distributed noise [50]. We assume that the value of alpha is known before hand to use this detector. Since these distributions have second moments tending to infinity, the matched filter utterly fails. Figures 3-5 and 3-6 show these results for synchronous detection. It can be seen that the CMF and the LSO detector both give almost identical performance, but the LSO detector requires one to know the exact values of α a priori. Of course, as alpha increases and approaches 2, the performance of the linear MF improves since the noise then becomes Gaussian (see figure 3-6). The curves in figure 3-6 for the two values of α corresponding to each nonlinear detector (CMF, LSO and MI) almost coincide. Since the variance of the α-stable noise 46

47 Figure 3-3. Receiver operating characteristic curves for synchronous detection in additive impulsive noise with kernel variance σ 2 (CMF ) = 5, σ 2 (MI) = 2. is not well defined, SNR would imply the ratio of squared scale of noise to signal power. Figure 3-7 shows the ROC plots for the the asynchronous detection. Once again, the performance of the CMF rivals that of the LSO detector, which is exclusively designed for the α-stable noise for a given α. These simulations demonstrate the effectiveness of the proposed CMF for this widely used impulsive noise model Effect of Kernel Size The choice of kernel size, though important, is not as critical as in many other kernel methods and density estimation problems. For detection with the correntropy matched filter, we decided to plot the area under the curve for different values of the kernel size to evaluate its effect on the ROC. As can be seen in figures 3-9 and 3-8, a wide range of the kernel size works well. For the case with impulsive noise using the two-term Gaussian model the values given by the Silverman s rule (2.2 and 4.05 for SNR values 47

48 Figure 3-4. Receiver operating characteristic curves for asynchronous detection in additive impulsive noise with kernel variance σ 2 (CMF ) = 5, σ 2 (MI) = 2. of 5dB and 0dB respectively) fall in the high performance region, but for the α-stable noise, the Silverman s rule is not computable since the variance of the noise is ill-defined. However it is trivial to choose a value for kernel size through a quick scan to select the best performance for the particular application Low Cost CMF Using Triangular Kernel Though we have presented most of our arguments previously using the Gaussian kernel, these arguments are valid for any valid kernel. In fact, the type of kernel is usually chosen based on the type of problem or simply based on convenience. For example if it is known before hand that the nonlinearity involved in a system to be estimated is polynomial, then a polynomial kernel would be used [57]. Likewise if the problem solution is based on Parzen s pdf estimation, then a pdf kernel is used like the Gaussian, Laplacian, etcetera. Here we shall demonstrate the use of a triangular kernel so as to simplify the 48

49 Figure 3-5. Receiver operating characteristic curves for synchronous detection in additive white α-stable distributed noise, kernel variance σ 2 = 3, α = 1.1. implementation of the CMF even more. The use of the Gaussian kernel though gives superior performance, a machine usually evaluates this through a polynomial based expansion. Bearing this complexity may not always be necessary. For instance, in our problem here, by using the triangular kernel (which is a valid Parzen kernel), we can simplify the CMF implementation. The triangular kernel evaluated between two real points x and y is given by U a (x, y) = T ( x y ), (3 17) a where a is a real positive number and x, T (x) = 0, for 1 x 1 otherwise. (3 18) Figure 3-10 shows this function. All of the ideas and arguments presented in this chapter 49

50 Figure 3-6. Receiver operating characteristic curves for synchronous detection in additive white α-stable distributed noise, kernel variance σ 2 = 3, SNR=15 db, the plots for MI and CMF almost coincide for both values of α. will still be valid for this kernel. Note that using this kernel, one does not require even a single multiplication operation. Figure 3-11 shows the ROC plot of the previous methods along with the the CMF using the triangular kernel. The SNR is 5dB with the noise generated with a mixture of Gaussian pdfs as discussed before. The triangular kernel (3 17) used a width (2a = 1.5). The parameter values for the other methods are the same as shown in figure

51 Figure 3-7. Receiver operating characteristic curves for asynchronous detection in additive white α-stable distributed noise, kernel variance σ 2 = 3, SNR=15 db, the plots for MI and CMF almost coincide for both values of α. 51

52 Figure 3-8. Area under the ROC for various kernel size values for additive alpha-stable distributed noise using synchronous detection, α =

53 Figure 3-9. Area under the ROC for various kernel size values for additive impulsive noise with the mixture of Gaussians using synchronous detection. Figure Triangular function that can be used as a kernel. 53

54 Figure Receiver operating characteristic curves for SNR of 5 db for the various detection methods. 54

55 CHAPTER 4 APPLICATION TO SHAPE CLASSIFICATION OF PARTIALLY OCCLUDED OBJECTS 4.1 Introduction A vast amount of papers and books in the statistical and pattern analysis literature has been devoted to the problem of recognizing shapes using various landmarks [58],[59],[60]. Nevertheless, one of the major challenges yet to be solved, is the automatic extraction of landmarks for the shape when the object is partially occluded [60],[61],[46]. The problem is mostly caused by misplaced landmarks corresponding the occluded part of the object. Using the proposed robust detector the influence of the occluded portions will automatically be less significant. The idea is that the occlusion may be treated as outliers, and hence maybe modeled as some kind of heavy tailed distribution. In a previous chapter, we have already seen that the correntropy matched filter is robust against impulsive noise. with our approach there is no need to extract so called landmarks based on certain properties of the shape boundary like change in curvature or presence of corners, etc. Instead our algorithm uses the shape boundaries directly, trying to match it with one of the known shape templates. This is done by simultaneously adjusting for the possible rotations and scaling required through a fixed number of parameter updates. 4.2 Problem Model and Solution A large class of classification problems can be solved using the classical linear model. Consider the following M-ary classification problem. For m = 0, 1,..., M 1, the mth hypothesis is Y = AX m + E, (4 1) where Y is an K N measurement matrix, AX m is the measurement component due to the signal of interest when model or class m is in force, and E is an additive zero mean noise matrix of size K N. A is an unknown K P matrix. X m is an P N matrix. The signal component can also be called a subspace signal [46]. All these quantities may also 55

56 be complex valued. Hence, given the measurement matrix Y, the problem is to classify which of the M signal components X m A is present. If some of the model parameters are unknown like here, a usual approach employed for detectors is the generalized likelihood ratio (GLR) principle [62]. The GLR detector replaces the unknown parameters by the their maximum likelihood estimates (MLE). However, if the pdf of the noise matrix is unknown or it is not possible to find the MLEs of those parameters, the GLR based methods are hard to implement. This is usually true when the noise model deviates from the usual assumptions of Gaussianity. It is true here in our example as well. In this chapter we shall demonstrate the proposed robust detector on shape classification of partially occluded fish silhouettes. Figure 4-1 shows the set of 16 fish silhouettes (downloaded from that constitute the templates that were applied in the numerical example. For each silhouette the boundary was extracted. In relation to the model given by (4 1), X m represents the matrix whose each column is a point in the boundary in the shape and A is a scaling and rotation matrix of the form A = a b b a. (4 2) A distorted version of one of the shapes in the template database is given. That shape is then attempted to be matched with one of the database elements. Hence in our case, 16 template matching operations are performed. For each case, the optimal scaling, rotation and translation parameters are found based on optimizing (maximizing) the correntropy between the database shape boundary and the boundary of the given distorted shape. Thus the corresponding cost function is J(A, d) = N i=1 e Ax i d y i 2 2σ 2, (4 3) where x i and y i are the ith column entries of X m and Y respectively. This function is now maximized with respect to the scaling and rotation matrix, and the translation vector d. 56

57 Figure 4-1. The fish template database. Let a = a b. (4 4) Then by differentiating (4 3) with respect to a and equating the gradient to zero one can get the following fixed point update rule: a N i=1 N i=1 z i e Ax i d y i 2 2σ 2 x i 2 e Ax i d y i 2 2σ 2, (4 5) 57

58 with z i = x i1y i1 + x i2 y i2 x i1 y i2 x i2 y i1, (4 6) where x ik and y ik are the kth components of the vectors x i and y i respectively. Similarly by differentiating with respect to d and equating the gradient to zero one can get the following fixed point update rule for d: d N i=1 (Ax i y i )e Ax i d y i 2 2σ 2 N i=1 e Ax i d y i 2 2σ 2. (4 7) After the optimization process is finished for each of the templates in the database, the shape corresponding to the largest value of the cost function is chosen. Thus the occluded shape is recognized. Since there were 16 silhouettes, a database of the 16 templates is stored extracting the boundaries for each of them. Since a shape is completely described by its boundary, we can use any conventional technique to get the boundaries. For instance in our example, for each silhouette template in the database, 100 points were evenly extracted to represent the boundaries. We have to also extract the boundary points for the occluded shape as well. When using the proposed robust detector, we have to extract an ordered set of points (like a time series). In addition we also have to extract a fixed number of boundary points, the same as the number of boundary points in the templates. Note that these boundary points have to be properly ordered since the kernel function is evaluated for each point in the boundary with one point in the template. An easy and apparently effective thing to do is to order the points in the following way: For k = 1, 2,..., databasesize, pick the kth point in the boundary of the template and choose the closest point in the boundary of the occluded object to be its kth point. We shall call this the nearest point ordering (NPO). Since the perimeter given by the occluded 58

59 object might be very large, it is advisable to extract a number of boundary points for the occluded object greater than the templates in the database. Here we have used twice the number of points for the occluded shape than what was used for the database templates. Thus for each shape template, there will be an ordered set of vector points, i.e., the matrix X m m = 1, 2,..., M of boundary points and also a corresponding vector of boundary points of the occluded object given by Y. M is the number of templates or classes (here the database size, 16). The summary of the algorithm is presented next. Center boundary points of the received occluded shape by subtracting the mean vector from each point. For template index m = 1toM... For fixed point update index, i = 1 to number of iterations Perform nearest point ordering of Y with respect to X m to get the new Y Perform an update of A, d using (4 5) and (4 7).... End For loop getting A opt, d opt.... Compute J opt (m) = J(A opt, d opt ). End For loop getting J opt (m) (m = 1, 2,..., M) m opt = arg max J opt (m) n Choose the m opt th shape template to be the most likely shape for the received occluded object. The inputs to the detector were occluded versions of the shape number 1 in figure 4-1. The occluded versions were constructed by random rotation (within ±60 degrees), scaling by a random factor (between 1 and 2) and translation of two fishes of shape no. 1. Only occluded silhouettes where the ratio of the occluded area to the area of the silhouette itself was between 50% and 90% were considered. Figures 4-2 and 4-3 show a possible occluded version of shape no. 1 and its boundaries respectively. 59

60 Figure 4-2. The occluded fish. Table 4-1. Recognition and misclassification rates. Shape number Proposed method LMedS (RWR) 1 95% 63% 2 to 16 5% 27% 1000 Monte Carlo (MC) simulations were performed by generating the occluded shape randomly and classifying it to one of the 16 possible known shapes. The table shows the recognition and misclassification rates for the proposed method along with the LMedS method using row weighted residuals (RWR) as reported in [46]. As in all the methods that use the Gaussian kernel, the kernel size σ is important. In this application kernel size tunes what is considered close and far. The far points are heavily diminished when computing the correntropy value. For instance in this numerical demonstration, a kernel size of σ = 15 was used. Since σ is the standard deviation term in the Gaussian kernel, points that are within ±σ distance from one another will return much higher 60

22 : Hilbert Space Embeddings of Distributions

10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation