SENSITIVITY ANALYSIS OF BLIND SEPARATION OF SPEECH MIXTURES. Savaskan Bulek. A Dissertation Submitted to the Faculty of

Size: px

Start display at page:

Download "SENSITIVITY ANALYSIS OF BLIND SEPARATION OF SPEECH MIXTURES. Savaskan Bulek. A Dissertation Submitted to the Faculty of"

Godwin Reynolds
5 years ago
Views:

1 SENSITIVITY ANALYSIS OF BLIND SEPARATION OF SPEECH MIXTURES by Savaskan Bulek A Dissertation Submitted to the Faculty of The College of Engineering & Computer Science in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy Florida Atlantic University Boca Raton, FL December 21

3 ACKNOWLEDGEMENTS I gratefully acknowledge my advisor Dr. Nurgun Erdol for her guidance, encouragement, and great patience. She has provided continued motivation, and generous advice throughout my Ph.D. study. I wish to express my thanks to Dr. Valentine Aalo, Dr. Christopher Beetle, and Dr. Hanqi Zhuang for their support and valuable suggestions. I would also like to thank the Lifelong Learning Society at FAU, FAU s Center for Ocean Energy Technology, NASA, and Department of Computer and Electrical Engineering and Computer Science for providing financial support. iii

4 ABSTRACT Author: Title: Institution: Dissertation Advisor: Degree: Savaskan Bulek Sensitivity Analysis of Blind Separation of Speech Mixtures Florida Atlantic University Dr. Nurgun Erdol Doctor of Philosophy Year: 21 Blind source separation (BSS) refers to a class of methods by which multiple sensor signals are combined with the aim of estimating the original source signals. Independent component analysis (ICA) is one such method that effectively resolves static linear combinations of independent non-gaussian distributions. We propose a method that can track variations in the mixing system by seeking a compromise between adaptive and block methods by using mini-batches. The resulting permutation indeterminacy is resolved based on the correlation continuity principle. Methods employing higher order cumulants in the separation criterion are susceptible to outliers in the finite sample case. We propose a robust method based on low-order non-integer moments by exploiting the Laplacian model of speech signals. We study separation methods for even (over)-determined linear convolutive mixtures in the frequency domain based on joint diagonalization of matrices employing time-varying second order statistics. We investigate the sources affecting the sensitivity of the solution under the finite sample case such as the set size, overlap iv

5 amount and cross-spectrum estimation methods. v

6 To my family.

7 Contents List of Figures x 1 Introduction Abstract Motivation System Model Source signals Mixing system Noise Mixture signals Demixing system Global system Blind Source Separation (BSS) Independent Component Analysis Indeterminacies Separability and uniqueness Contrast functions Higher order statistics Fractional order statistics Second order statistics Iterative Search Algorithms vi

8 1.6.1 Deflation scheme Symmetric scheme Joint diagonalization Orthogonal Joint Diagonalization Non-orthogonal Joint Diagonalization Diagonality Measures Uniqueness Conditions Optimization methods Applications Hearing aids Teleconferencing Speech recognition Simulation Setup and Performance Metrics Source Signals Mixing System Mixture signals Noise Performance measures Outline of Dissertation Appendix BSS using Fractional Order Moments Abstract Introduction Contrast function The Search Surface Optimization on the unit circle vii

9 2.6 Statistical Properties of the Sample Contrast Estimator Simulations Synthetic data Speech data Some Generalizations Generalization to arbitrary orders Generalization to MIMO Chapter Summary Appendix Generalized Gaussian Distribution Distribution of linear combinations of independent Laplacian variables Statistics of the Sample Estimator Block Adaptive ICA with a Time Varying Mixing Matrix Abstract Introduction Problem Formulation Numerical Simulations Sinusoidal source signals Speech source signals Chapter Summary Sensitivity Analysis of Joint Diagonalization in Convolutive BSS Abstract Introduction System Model Scaling Correction viii

10 4.3.2 Permutation Correction Demixing System Estimation Joint Diagonalization of Cross-Spectral Matrices Cost Function Uniqueness Conditions Cross-Spectral Matrix Estimation Multitaper estimates Lag window estimates Frequency-averaged cross-periodogram Effects of Imperfect Cross-Spectral Matrix Estimation Numerical Simulations Database Simulations I Simulations II Chapter Summary Conclusion and Outlook Conclusions Open Issues Bibliography ix

11 List of Figures 1.1 Simultaneously active speaker scenario Conceptual block diagram of the blind source separation problem Impulse responses of synthetic mixing channels Magnitude responses of synthetic mixing channels Prerecorded room impulse responses Magnitude responses of prerecorded room channels Mutual information and negentropy of two Laplace mixtures D kurtosis and fractional order moments surfaces D kurtosis surface D fractional order moments surface Optimization of demixing angle with different initializations Bias and variance of sample cost functions Performance measures of kurtosis and fractional order moments Normalized fractional order moments of a generalized Gaussian pdf Performance measure of fractional order moments in a MIMO scenario Various members of a generalized Gaussian pdf Time varying mixing coefficients of a TITO system Effects of the block size on the separation of sinusoids Effects of initializations on the separation x

12 3.4 Correlations of the output signals Effects of the block size on the separation of speech mixtures SIR for various cross-spectrum estimators and overlap amounts SIR for various cross-spectrum estimators and number of tapers Histograms of performance and uniqueness measures Histograms of various statistics of source cross-spectra Effects of number of tapers on uniqueness and performance measures Effects of set size on uniqueness and performance measures Effects of number of tapers & segment size on various measures xi

13 Chapter 1 Introduction 1.1 ABSTRACT This chapter serves as an introduction to the problem considered in this dissertation. All the elements of the problem along with the key principles that lead to several approaches are clearly explained. A comprehensive account of existing approaches to the problem is organized into different categories according to the nature of the methods. Audio applications for which the techniques may be useful are provided. The overview of the dissertation is given at the end of this chapter. 1.2 MOTIVATION Speech enhancement is a signal processing task that is required in many situations in which the improvement in quality of a degraded speech signal is desired. The source of degradation may be reverberation, multiple interfering speakers, background noise. Most of today s single microphone noise reduction systems are based on spectral subtraction, and signal subspace decomposition methods [1], [2]. These methods have limited performances in the case of multiple interfering speakers. One way to overcome these limitations is to employ multiple microphones which is motivated by the human binaural system. Early multi-microphone noise reduction systems relied on fixed or adaptive beamforming, which remain in use today [3]. These systems 1

14 Source 1 Unknown environment Mixture 1 Output 1 Source 2 Mixing System BSS Mixture 2 Output 2 Figure 1.1: Two input two output simultaneously active speaker scenario. require some prior knowledge such as inactive time periods of the target source, source locations, microphone array geometry. In practice, however, this prior knowledge is rarely available, hence a system that will not depend on this information is highly desirable. Blind source separation (BSS), aptly named because they aim the recovery of the source signals from their mixtures when neither the mixing system nor the source signals are observable. The general problem may be described by the example illustrated in Fig. 1.1, where the speech signals of two simultaneously active speakers are recorded by two microphones. The speech signals spoken by the speakers are called source signals, and the microphone recordings are the mixture signals. The acoustic environment is represented by the mixing system. The microphone measurements typically contain components from both sources. The procedure that resolves the individual speaker s speech by operating on the recordings, without possessing information on each source, such as its active time periods and location, or on the mixing system is called the BSS. 2

15 S ( n) 1 SN s V ( n) 1 V ( ) N x n ( n) ( n) Unknown environment X ( n) 1 X N x Y ( n) 1 YN y ( n) Figure 1.2: Conceptual block diagram of the blind source separation problem. 1.3 SYSTEM MODEL Fig. 1.2 depicts the conceptual block diagram of the BSS problem. The general N s input, N x output mixing system can be formulated as: X(n) = H (S(n)) + V(n), (1.1) where S(n) denotes the (N s 1) vector of source signals S 1 (n),, S Ns (n), X(n) denotes the (N x 1) vector of mixture signals X 1 (n),, X Nx (n), V(n) is the (N x 1) vector of noise signals V 1 (n),, V Nx (n), and n is the discrete time index. Here, H is the (N x N s ) multichannel (MIMO) mixing system. Accordingly, the N x input N y output demixing system can be formulated as: Y(n) = W (X(n)). (1.2) where Y(n) denotes the (N y 1) vector of output signals Y 1 (n),, Y Ny (n), and W is the (N y N x ) multichannel demixing system. In the following, each component of the model (1.1) and (1.2) will be explained in detail along with some typical examples and assumptions. 3

16 1.3.1 Source signals In BSS we have a set of physical sources, located at distinct unknown locations, that simultaneously emit the signals S 1 (n),, S Ns (n). These signals occupy the same frequency range and are referred to as the source signals. Moreover, it is implied that the source signals are measured at the sources. For example, in case of speech, S i (n) is the waveform, which gives the pressure change with time at the lips output. In this dissertation we deal with speech signals. Assumptions on the source signals S(n) The following assumptions on the source signals are made throughout this dissertation. A BSS system must perform well for all speech signals. Thus, from the system point of view, its inputs are random processes whose sample functions are randomly selected by the users. Each S m (n) has zero mean, that is E[S m (n)] =, n, m = 1,, N s. S 1 (n),, S Ns (n) are statistically mutually independent at each time instant n. In order to solve the BSS problem the source signals should have some distinct characteristics, such as non-gaussianity, nonstationarity, or nonwhiteness. By exploiting each characteristic we obtain a different BSS method. Further assumptions will be given in later sections and chapters Mixing system The mixing system, H, has multiple sources delivering S 1 (n),, S Ns (n) at the input end and multiple sensors receiving the observed signals X 1 (n),, X Nx (n) at the output end, hence it is a (N x N s ) MIMO system. The sensors discussed in this dissertation are microphones. They are often designed with an omnidirectional 4

17 response. Received signals at the microphones are interchangeably called mixture signals, in the sense that each X i (n) contains some contribution from all the source signals. Assumptions on the mixing system H The following assumptions on the mixing system are made throughout this dissertation. We assume that H is a linear, stable, causal, convolutive (with memory) multi-channel system. The ratio of the number of sources to the number of mixtures is assumed to be (N s /N x ) 1. Furthermore, the frequency response matrix H(f) of the mixing system has full column rank for all frequencies f. The latter two assumptions are necessary to invert a linear mixing system with a linear demixing system. Example In the following an example on room acoustics is given. Consider the speech separation application (see Fig. 1.1) in which the speech (source) signals are recorded in a room with an array of microphones. Here the medium of propagation is air bounded with walls. Depending on the location of the sources and the microphones, each source signal undergoes some changes such as refraction, reflection, diffraction, attenuation [4]. As a result of these distortions each microphone will pick up not only exact copy 1 of each speech signal from the direct path but also some delayed and attenuated copies from indirect paths. This phenomenon is usually referred to as reverberation and its duration varies with the geometry of the environment. Under the assumption of linearity (M1), and (M4), the channel between the mth source and the lth microphone may be modeled by its time- 1 There might be some propagation delay. 5

18 varying impulse response H lm (n; n ). The impulse response provides a model of all the possible paths that the speech signal experiences on its travel from the source to the microphone. Here n denotes response time of the filter to the unit impulse applied by the source at time n n. The term time-varying generally implies motion of the sources and/or receivers. For stationary (fixed positions) sources and microphones, which may be valid over a short duration observation interval, the channel between the source and the receiver can be assumed as time-invariant. The impulse response function of a linear time-invariant (LTI) convolutive mixing system takes the following form H(n) = L h 1 k= H(k)δ(n k). If the mixing system is nonconvolutive, then the impulse response reduces to H(n) = H()δ(n), or we simply drop the time index and denote it by H. The instantaneous mixing model is usually used in anechoic environments for experimental purposes. Sect details the type of the mixing systems that will be used in the simulations Noise We assume that the noise is additive and statistically independent of the source signals. Typical examples are thermal, sensor noise, and background noise, that cannot be modeled as a point source e.g., wind, traffic (diffuse) Mixture signals Consider that the mixing model (1.1) is linear so that the mixture signals take the following form X(n) = H(n) S(n) + V(n), (1.3) 6

19 where denotes convolution. (1.3) will be referred to as the convolutive mixing model. Taking the STFT of (1.3) leads to: X(f, i) = H(f)S(f, i) + V(f, i), (1.4) where X(f, i), S(f, i) are the STFT of the mixture and source signals at frequency f and segment i. Here, H(f) denotes the complex-valued (N x N s ) frequency response matrix of the LTI mixing system H(n). Note that the convolutive mixing problem reduces to the instantaneous mixing one by moving from the time domain to the frequency domain. Other differences between these two formulations are (i) the variables in (1.3) are real-valued, whereas they are complex-valued in (1.4), and (ii) amplitude distributions of the signals are different in two domains. On the other hand, one particular case of (1.3) which is based on (M7) is the linear instantaneous model X(n) = HS(n) + V(n). (1.5) Since Fourier transform preserves linear relations, (1.5) can be written in the frequency domain as X(f, i) = HS(f, i) + V(f, i). (1.6) Demixing system The demixing system, W, has a set of sensors delivering the mixture signals X(n) at the input end and multiple outputs delivering the output signals Y(n) at the output end. W is a (N y N x ) MIMO system whose parameters need to be adjusted according to some criteria to achieve separation. The type of W, such as linear/nonlinear, convolutive/nonconvolutive, depends directly on the type of H. Assuming a linear, convolutive demixing model choice of its structure, i.e., FIR (tapped-delay-line), lattice, direct-form IIR, etc. is also an important issue. 7

20 1.3.6 Global system The cascade of the mixing and demixing systems is usually referred to as the global system G. It is widely used in formulating various performance measures for controlled test simulations (see Sect ). Assuming linear instantaneous model (1.5) the global system G is a (N y N s ) matrix G = WH. (1.7) Under the linear convolutive model (1.3) the global system G(n) is the MIMO filter G(n) = W(n) H(n). (1.8) 1.4 BLIND SOURCE SEPARATION (BSS) Blind Source Separation (BSS) is an example of an inverse problem, in the sense that it identifies the inverse of a mixing system referred to as demixing system. Moreover, it falls in the realm of unsupervised learning, owing to the fact that identification of the demixing system has to be performed without having access to a reference signal or the mixing system. To get around this difficulty there is a need for some strong a priori information on the signals of interest. BSS algorithms incorporate this prior information in the design criteria, so as to estimate the demixing system. The prior information has to be statistical to be effective. One of the earliest methods applied in communications is the constant modulus algorithm (CMA). This method achieves separation and equalization in a blind fashion by minimizing the deviation of the separated output magnitudes from a fixed gain. The underlying assumption is that the source signals, such as PSK and FSK, have constant magnitudes with non-gaussian (sub-gaussian) pdf. 8

21 Another class of algorithms collectively termed as Independent Component Analysis (ICA) uses statistical independence and non-gaussian amplitude distributions of source signals as the prior information to solve the BSS problem. A wide range of ICA algorithms are based on higher order statistics and information theory. ICA can be used for the speech signals because amplitude distribution of speech signals is super-gaussian, e.g., Laplacian for a wide range of segment sizes. Other BSS approaches exploit the non-stationarity, non-whiteness of source signals as the prior information. These assumptions allow for Gaussian sources, and hence second order statistics (SOS) (see Sect ) are sufficient for separation. Speech signals are considered to be stationary over 3-4 ms long segments. Over these stationary segments, they are temporally correlated. For durations greater than 4 ms, speech signals are non-stationary, in the sense that temporal correlations, hence the variances vary from one segment to another Independent Component Analysis ICA is a statistical method that is widely used to solve the BSS problem. The main assumption behind ICA is (S2). The aim of ICA is to estimate a demixing system W, using the N x available mixture signals X(n), such that the output signals Y(n) are statistically independent. Implicit in the ICA formulation is that each signal S i (n) is a SSS random process with a pdf f 2 Si. Similarly, each random vector S(n) is described by the joint pdf f S. Statistical dependencies between multiple random processes with arbitrary pdfs are quantified by mutual information. This information theoretic measure is the basis of ICA. The mutual information I(Y 1 ; ; Y Ny ) between 2 Note that, if S i (n) is SSS, then f Si is independent of n. 9

22 N y random variables with joint pdf f Y is defined as [5] I(Y 1 ; ; Y Ny ) = I(Y) = f Y (u) log f Y(u) du. (1.9) Π i f Yi (u i ) It becomes zero when the underlying random processes are statistically independent. Mutual information is a particular case of Kullback-Leibler divergence (KLD). The KLD between two probability distributions f Sk and f Sl is defined as [5] D KL (S k ; S l ) = f Sk (x) log f S k (x) dx. (1.1) f Sl (x) D KL (S k ; S l ) is a nonnegative measure and becomes zero if and only if f Sk = f Sl. From the definitions it is clear that I(Y) = D KL (f(y); Π i f Yi ). I(Y) is nonnegative and becomes zero if and only if Y(n) are statistically independent (joint pdf factorizes into product of marginal pdfs). As in the KLD, the mutual information depends only on the pdf of the random vector, and hence mutual information is sometimes written as I(f Y ) rather than I(Y). In practice, the mutual information is difficult to utilize, therefore, various statistical criteria have been proposed based on its approximations. Contrast functions, as we will see in Sect. 1.5, cast these statistical criteria into optimization problems, the elements of W being the optimization parameters. The contrast function should possess desirable properties such as its global minimum point defines W yielding Y(n) that are statistically independent. Also, any contrast function should be invariant to several factors, such as scaling and permutation as they are inherent indeterminacies in any BSS algorithm. This will be discussed next Indeterminacies Ideally, we want to achieve Y(n) = S(n). However, such a case requires perfect separation and dereverberation which is not possible without precise knowledge on the sources or the mixing. In BSS sources can be estimated up to several indeterminacies 1

23 because neither S(n) nor H is accessible. Depending on the type of H (and W) we have different indeterminacies. For linear models, these are (i) an arbitrary scaling (filtering) of each source and (ii) a permutation of the source indices [6]. Due to the multiplicative form of the linear instantaneous mixing model, (1.5) can be rewritten as X(n) = (HΛ 1 Π 1 ) (ΠΛS(n)) +V(n), where Λ is a diagonal matrix with nonzero elements and Π is a permutation matrix obtained by interchanging the columns of the identity matrix. Because of these ambiguties, the goal of the BSS is not to recover identical copies of the source signals at the outputs, rather it is to recover source signals without any interference from other sources. This is equivalent to finding W such that the global system (1.7) satisfies the following: G = ΠΛ, (1.11) If Λ and Π are the only two indeterminacies in finding W, then the solution to the BSS problem is said to be unique. Equivalently, the matrices W and H 1 are said to be essentially equal [6]. Under the convolutive mixing model (1.3) the permutation indetermination stays the same, however the scaling indetermination Λ becomes a filtering indetermination Λ(n), and the goal becomes to estimate W(n) that satisfies the following: G(n) = ΠΛ(n). (1.12) Note that, if the demixing estimation is performed independently for each frequency then both the scaling and permutation factors become frequency dependent Separability and uniqueness This section presents an overview of the separability and uniqueness issues of the demixing system W using BSS. Separability means recovery of the sources at the 11

24 ouputs by means of W. By uniqueness we mean that W achieving separation is unique up to aforementioned ambiguities (see Sect ). The question to be addressed is that under what assumptions statistical independence of Y(n) guarantees W satisfying (1.11) or (1.12). For the linear instantaneous mixing model (1.5) theoretical results are provided in [7], [8], [9]. In [1] it was shown that the problem has no solution for Gaussian and temporally iid sources. For temporally iid sequences, temporal correlations at nonzero lags vanish, hence only the statistical properties at zero lag may be used. For Gaussian distributed sources this reduces to the use of auto- and cross-correlations at zero lag as they are the sole parameters that determine their multivariate pdf. However, as we will see in Sect. 1.5, by putting constraints on the cross-correlation and variances of the output signals W can only be determined up to an orthogonal transformation. In order to uniquely identify W one can assume that the sources are possibly temporally iid but non-gaussian, and use contrasts involving higher order statistics such as mutual information, negentropy, entropy, cumulants to find it [7]. possibly Gaussian but nontemporally iid (have temporal structure), and use contrasts involving second order statistics such as cross-correlation matrices at multiple lags [8], zero-lag cross-correlation matrices at multiple times [11], [9] to find it. For convolutive mixtures, theoretical results are provided in [12], [13], [14], [15]. 12

25 Effects of noise In the noiseless and overdetermined case, there is no advantage in using the additional N x N s mixtures, and any set 3 of N s out of the N x mixtures can be used for separation [16]. This effectively reduces the dimension of the demixing parameter space turning the problem into an even determined one. In the noisy case, depending on the noise level on each channel unreliable channels can degrade the separation performance. Unreliable channels must be excluded from the combination [17]. Under the overdetermined model, if the noise is spatially iid with equal variance, then classical subspace based methods may be employed to find the whitening matrix, one of the two matrices constituting the demixing matrix (see Sect. 1.5). The other part of the demixing matrix is usually found using higher order statistics which allows (in theory) unbiased estimation of W so that WH = ΠΛ. Even so, the output signals cannot restore the sources because Y(n) = ΠΛS(n) + WV(n). The additive term implies that noise may be amplified. 1.5 CONTRAST FUNCTIONS In supervised adaptive filtering, a reference signal is typically employed to determine the optimal demixing parameters. However, in unsupervised adaptive filtering, e.g. BSS, such reference signals are not available. Therefore, there is a need to construct a (contrast) function of the demixing system parameters W, that will not utilize S(n) or H. Moreover, at the global maxima of this function WH = ΠΛ needs to be satisfied. Formal definition of a contrast functional is given in [18] in SISO blind deconvolution context. [7] extended them to be used in ICA under a linear instantaneous mixing model. A contrast ψ is a function mapping the pdf f S of a multidimensional random 3 provided that the associated N s N s mixing matrix is invertible 13

26 process S(n) to a real scalar, and satisfying the following properties [7], [19]: (C1) Invariant to scaling: ψ(s(n)) = ψ(λs(n)). (C2) Invariant to permutation: ψ(s(n)) = ψ(πs(n)). (C3) If S(n) has independent components and G is an invertible matrix, then ψ(gs(n)) ψ(s(n)). (C4) Equality holds if and only if G takes the form in (1.11). Note that, as we will see in the following examples, contrast ψ is a function of the pdf f Y of the output signals Y(n). This implies that ψ depends on W. In the following, we briefly review the contrast functions used in ICA. Mutual information The negative of mutual information (1.9), that is ψ(y(n)) = I(Y 1 ; ; Y Ny ) is a contrast function [7]. This has been recognized as the canonical contrast for ICA. The problem with the mutual information is that its estimation requires the estimation of joint and marginal pdfs. The density estimators could be parametric [2] or nonparametric, which are usually based on histograms [21] or kernels [22]. Likelihood When the source distribution f S is known, the maximum likelihood principle leads to minimizing the KL divergence between f Y (n) and f S (n). In [23] it was shown that ψ(y(n)) = D KL (Y(n); S(n)) is a contrast function. 14

27 Negentropy By introducing a reference random vector Y G having a multivariate Gaussian distribution with the same covariance matrix C Y as Y, negentropy N(Y) of the random vector Y can be written as [7] N(Y) = H(Y G ) H(Y), (1.13) where H(Y) is the differential entropy of Y defined in (1.4). Negentropy N(Y) is nonnegative and invariant under invertible linear transformations, that is for any nonsingular matrix A one can readily verify that N(Y) = N(AY) holds. In other words, N(Y) is not a discriminator of W, and hence it cannot be used as a contrast. However, from (1.9) and (1.13) the mutual information may be written as [7] I(Y) = N(Y) i N(Y i ) log det diag C y det C y. (1.14) The middle term in (1.14) is the sum of marginal negentropies of each Y i. As opposed to N(Y), N(Y i ) varies with W, and hence it can be used as a discriminator. The last term in (1.14) contains second-order statistics and vanishes when C Y is diagonal. This is usually achieved by spatial whitening (sphering) of the mixture signals X(n) (see Sect. 1.5). Under these conditions, the only term that can be used to minimize the mutual information is the middle one, and thus ψ(y(n)) = i N(Y i) can be used as a contrast function. There are various methods utilizing the sum of marginal negentropies as a contrast to be maximized with the main difference in its approximations. As an example, FastICA algorithm uses several nonlinear functions to approximate the negentropy [24]. These nonlinear functions imply the use of higher order statistics. 15

28 Spatial Whitening Decorrelating the output signals and normalizing their variances to unity are usually referred to as prewhitening or sphering. For linear instantaneous mixtures, prewhitening amounts to a linear transformation of the mixture signals by a whitening matrix Q, which is usually taken as any square root of the inverse of the spatial mixture covariance matrix. The spatial covariance matrix of the mixture signals is defined as C x = EX(n)X (n), (1.15) which is a N x N x positive definite Hermitian symmetric matrix. The spatial covariance matrix of the output signals is defined similarly and denoted by C y. In simple terms, the idea is to satisfy C y = I. Let C x = E x D x E x be the EVD of C x, then Q = C 1 2 x = D 1 2 x E x is a whitening matrix. In other words, Ỹ(n) = C 1 2 x X(n) has identity covariance matrix under the assumption that the source covariance matrix is also identity, i.e., C s = I. Note that the whitening matrix Q is not unique, in the sense that any orthogonal matrix multiplying it from the left is another whitening matrix. Whitening results in the last term in (1.17), (1.18) to vanish; confining the demixing matrix to the set of orthogonal matrices (unitary matrices in the complex-valued case). When X(n) is noisy, i.e., V(n) in (1.5) it can be shown that the additive noise V(n) introduces a bias in the estimated whitening matrix Q [25]. If the noise covariance matrix, C v = EV(n)V (n), is known or can be estimated, then bias removal may be employed [26]. For the linear convolutive mixing case, prewhitening is achieved through linear multichannel prewhitening filter Q(n) [27]. 16

29 1.5.1 Higher order statistics The moments and cumulants of integer valued orders greater than two are usually referred to as higher order statistics. Fourier transforms of higher order cumulants give the poly-spectra. For example, the Fourier transform of the third order cumulant sequence is called bispectrum or bispectral density. In [7] negentropy N(Y i ) is approximated using a finite number of cumulants. The underlying assumption is that the f Yi is given by a reference Gaussian distribution multiplied by a fourth order polynomial (e.g., Edgeworth expansion). For zero mean and unit variance Y i it yields N(Y i ) 1 12 K2 3(Y i ) K2 4(Y i ) K4 3(Y i ) 1 8 K2 3(Y i )K 4 (Y i ), (1.16) where K 3 (Y i ) and K 4 (Y i ) denote skewness and kurtosis of Y i, respectively. Substituting (1.16) for N(Y i ) in (1.14), I(Y) can be approximated as I(Y) N(Y) log det diag C Y det C Y. { 4K 2 3 (Y i ) + K4(Y 2 i ) + 7K3(Y 4 i ) 6K3(Y 2 i )K 4 (Y i ) } i (1.17) If f Yi is symmetric around its mean, then K 3 (Y i ) = and (1.17) reduces to the following: I(Y) N(Y) 1 48 i K 2 4(Y i ) log det diag C Y det C Y. (1.18) Even though derived as an approximation to mutual information through polynomial expansion of pdfs, higher order cumulants yield contrast functions. In particular, the following functionals utilizing fourth-order cumulants (kurtosis) i K2 4(Y i ), i K 4(Y i ), i K2 4(Y i )/K 4 2(Y i ), i K 4(Y i )/K 2 2(Y i ) are contrasts, implying that they are free from spurious maxima. As a result, the associated iterative algorithms are globally convergent to a valid separation solution. 17

30 While the first two require whiteness constraint, the latter two do not employ any constraint as they are already normalized [7], [28], [29], [3]. Another criterion based on fourth-order cumulants ψ(y(n)) = ijkl,i j κ2 Y (i, j, k, l), under the whiteness constraint is the JADE contrast, that was proposed in [31]. Besides fourth-order cumulants based contrasts the functional ψ(y(n)) = i K2 m(y i ) under the whiteness constraint has been shown to be a contrast 4 for any m 3 in [7]. In practice, cumulants need to be estimated from the received data. This is usually done by sample averaging, however, according to [32], the sample size needed to estimate the mth order statistics of a stochastic process, subject to prescribed values of estimation bias and variance, increases exponentially with order m [33]. This justifies the use of the fourth order cumulants among the HOS as a contrast function Fractional order statistics The absolute moments of noninteger valued orders of probability density functions are referred to as fractional order statistics (FOS). Definitions of cumulants and moments of integer valued orders can be generalized to noninteger values of order m by means of the techniques of fractional calculus [34], [35]. We should emphasize that we are interested in moments of fractional orders with values less than four. The motivation behind the use of low FOS in a contrast function is that for a given sample size, their sample estimators have lower variance compared to that of HOS. Fractional moments have no obvious pictorial representation of the pdf, whereas 4 The use of odd valued m is justified if the underlying pdfs are skew. 18

31 mean is related to the center for unimodal pdfs, variance is an indicator of spread, skewness is related to symmetry, kurtosis is a measure of peakedness, etc. The mth order, where < m <, absolute moment of a pdf f S associated with a RV S(n) is defined by [35] E S m = s m f S (s) ds, (1.19) provided that the integral exists. Methods utilizing FOS for the BSS can be found in [36], [37], [38] Second order statistics Second order statistics (SOS) include second order cumulants, i.e., cross-correlation, auto-correlation functions in time domain. Their Fourier transforms give power spectral density and cross-spectral density functions in the frequency domain. SOS based approaches have the advantage that they do not require any apriori information on the source pdfs. The major limitation of the methods utilizing SOS is that separation is possible only when the source signals are temporally colored and/or nonstationary. For instance, when the source signals have no temporal characteristics, that can be exploited then separation is not possible [7], [9]. Speech signals are considered to be non-stationary for durations gretaer than 4 ms [39]. Another attribute of the speech signals is that they are temporally correlated (colored). These two properties are often exploited to derive contrasts based on SOS in the BSS of speech signals. Methods utilizing SOS for the linear instantaneous mixing case are introduced in [4], [8], [9]. Approaches for the linear convolutive mixing case based on SOS generally exploit the non-stationarity property of the source signals either in the frequency domain [41], [42], [43], [44], [45], or in the time domain [46]. 19

32 1.6 ITERATIVE SEARCH ALGORITHMS Stationary points of a given contrast function ψ, since it has no closed form solution, are determined by an iterative algorithm: W (i+1) = W (i) + µ W W=W (i), (1.2) where W (i) is an estimate at iteration i =, 1, 2, µ denotes the step size, which is usually chosen as a small positive constant, and W is the update used to improve the estimate W (i+1) for the next iteration. There are a variety of algorithms with the difference in the way the update W is constructed. Typically, W is a function of the gradients of the contrast function ψ. However, the choice of the update term introduces a trade-off between convergence speed in terms of the required number of iterations and the computational complexity per iteration. On one hand, methods based on first-order gradient, such as steepest ascent method, have low computational complexity at the expense of slow convergence rate. On the other hand, methods utilizing second-order gradient, such as Newton s method, exhibit faster convergence with increasing computational complexity. There exist other methods exploiting the structure of the ICA model, such as natural gradient [47], equivariant algorithm [48], and fixed point algorithm [24]. In particular, [48] proposes to use multiplicative updates as opposed to additive the one in (1.2): W (i+1) = (I + ε) W (i), (1.21) where the gradient ε of the contrast function for this multiplicative scheme is called relative gradient. [47] approaches the problem by considering the underlying space of parameters W as Riemannian. They show that the steepest ascent direction in the Riemannian space of W is not J/ W as in the Euclidean space rather J W WT W, (1.22) 2

33 and call this as natural gradient. [24] derives his fixed point algorithm based on an approximation of Newton s method and named it as FastICA. Note that the FastICA algorithm operates on batch data. However, both the natural and relative gradient algorithms may be employed in on-line and off-line (batch) modes. All these latter algorithms have superior convergence rates compared to standard steepest ascent adaptation. When the underlying mixing system is time-invariant, batch methods are preferable because of their convergence speed due to more accurate gradient estimates than sample-based methods Deflation scheme In deflation scheme, the idea is to extract one output signal, then remove it from the mixture recursively, that is one after another [28]. In particular, at the first stage, first row w 1 of W is estimated using an iterative algorithm as in (1.2) 5. After convergence, one output signal is extracted and its contribution is removed from the mixtures leading to a N x N s 1 mixing system for the second stage. Typically, Gram-Schmidt orthogonalization is performed at each stage to remove the projections of the previously estimated rows from the current one [49]. This orthogonality constraint is necessary to prevent the algorithm from converging to previously estimated demixing system parameters. It should be emphasized here that the second term in (1.14) allows for deflation scheme of separation. In other words, we are maximizing i N(Y i) by maximizing its summands N(Y i ) at each stage through w i. Advantages of the deflation type BSS algorithms are the ability to estimate a subset of the source signals and reduced computational load. The major drawback, however, is the propagation of the error to the later stages. 5 The parameter of multivariate contrast is a vector instead of a matrix in the deflation scheme 21

34 1.6.2 Symmetric scheme In the symmetric scheme, all the output signals are extracted in a single stage. All of the demixing system parameters W are optimized simultaneously as in (1.2). Joint diagonalization is an example of symmetric scheme. 1.7 JOINT DIAGONALIZATION The joint diagonalization (JD) problem may be stated as finding a matrix W that operates, in the congruence sense, on a set C of D symmetric matrices C i, referred to as target matrices, so that D i = WC i W are diagonal for i = 1,, D. It is well known that any two symmetric matrices can be exactly jointly diagonalized under some mild conditions using the generalized eigenvalue decomposition [5]. In the BSS context, the target matrices C i admit the following structure C i = HΛ i H, i = 1,, D, (1.23) where H is the nonsingular mixing matrix, and Λ i are diagonal matrices i. Under these conditions, the demixing matrix defined by W = H 1 up to permutation and scaling indeterminacies is characterized by an exact joint diagonalizer of the set C. In general, the assumptions on the source signals, in a similar way that they turn into various contrast functions, are used to construct the diagonal matrices Λ i. Some typical examples are the correlation matrices at multiple lags [8], at multiple times [11], [9] and higher order joint cumulant matrices [31]. Usually, the target matrices in the set are estimated from the available data. Because of the estimation errors the hypothesized structure (1.23) of the target matrices is lost and an exact joint diagonization is no longer possible. In this case, however, it is possible to estimate W that will approximately jointly diagonalize the estimated target matrices in the set C. It is beneficial to jointly diagonalize more than two matrices to avoid the sensitivity 22

35 to estimation errors in the target matrices [51], [52], [53]. We will elaborate on this in Chapter 4. The JD problem may be broadly categorized as (i) orthogonal joint diagonalization (ii) non-orthogonal joint diagonalization according to the restrictions on the form of H, hence W Orthogonal Joint Diagonalization In the orthogonal joint diagonalization (OJD) problem, the joint diagonalizer is restricted to be orthogonal. In a general BSS context, this is usually done by first finding a whitening matrix Q as any square root of the inverse of the spatial mixture covariance matrix, say one of the matrices C 1 in the set C, and then transform the remaining matrices C i, i = 2,, D in C into the following C i = QC i Q. This prewhitening stage reduces the JD problem to seeking an orthogonal joint diagonalizer matrix W of the transformed set { C 2,, C D }. The non-orthogonal demixing matrix is then found as W = WQ [31], [8] Non-orthogonal Joint Diagonalization In the non-orthogonal joint diagonalization (NOJD) problem, the joint diagonalizer W is not resticted to be orthogonal. The prewhitening stage in the OJD approach attains exact joint diagonalization of C 1, however, possible estimation errors in C 1 may have a severe effect on both the transformed set and the resulting orthogonal diagonalizer. Therefore, such an approach is known to limit the attainable separation performance [54], [55]. As a consequence, many authors proposed NOJD methods to avoid prewhitening [51], [56], [55], [57], [58]. 23

36 1.7.3 Diagonality Measures Any JD method aims at minimizing some measure of joint deviation from diagonality. In the following two common off-diagonality measures are reviewed. Frobenius norm criterion The first measure is based on the following least-squares squared Frobenius norm: J (W) = D WCi W diag WC i W 2, (1.24) F i=1 where 2 F is the squared Frobenius norm. The minimizer W opt of J is called the joint diagonalizer of the set C = {C 1,, C D }. To avoid the trivial or singular minimizers usually some constraints such as unity determinant [58], rows of unity norm [59], unity diagonal [6], orthogonality [31], [8] are employed. Other variants of (1.24) include a set of positive weights α i yielding a weighted least-squares criterion [55]. Log-likelihood function criterion The second measure is suitable for positive-definite matrices C i and can be traced back to the likelihood criterion [61], [11], [51] J (W) = D i=1 log det diag WC iw det WC i W, (1.25) where det is the determinant operator, and diag(a) is a diagonal matrix with the same diagonal as A. It can be shown that J with equality if and only if WC i W is diagonal i. Furthermore, (1.25) is both scale and permutation invariant, that is J(ΠΛW) = J(W). Note that (1.24) does not have this invariant property. Other variants of (1.25) include a set of positive weights α i in the criterion [9]. 24

37 1.7.4 Uniqueness Conditions [52] quantified the uniqueness of the solution for the JD problem by the introduction of a parameter ρ which is called the modulus of uniqueness. It is defined on the (D N s ) size matrix Ψ that consists of the Λ i diagonal in row i as shown: Λ 11 Λ 12 Λ 1Ns Ψ =..... ]... [λ = 1 λ 2 λ Ns, (1.26) Λ D1 Λ D2 Λ DNs where the (D 1) vector λ i denotes the ith column of Ψ. Collinearity between the columns may be measured by the cosine of the angle between them as ρ ij = λ i λ j λ i λ j, i j = 1,, N s. (1.27) It is assumed that ρ ij = 1 if λ i = for some i. The modulus of uniqueness for the set of diagonal matrices Λ i, i = 1,, D is defined as ρ = max i,j ρ ij. The uniqueness of the solution, that is the essential equivalence of W and H 1 is formulated as ρ < Optimization methods In general, the solution of the JD problem is found using an iterative algorithm. Many algorithms have been proposed with the main differences in the type of iterations used to minimize the cost function with constraints and parameterizations of the joint diagonalizer. For example, [58] uses a Jacobi-like algorithm to construct a constrained matrix of determinant one with equal column norms by successive multiplications of Givens rotations, hyperbolic rotations and diagonal matrices. In [51] a computationally efficient iterative algorithm for solving the minimization of (1.25) was proposed. Since we are going to use this method in Chapter 4, it is briefly described next. The algorithm is based on the classic Jacobi approach of making successive transformations on each pair of rows of W as follows: Let wi T and wj T denote the i-th and j-th 25

38 rows of W. They are transformed as wt i wt i T ij wt i. (1.28) w T j w T j w T j without changing the other rows. Here, T ij is a 2 2 transformation matrix having the following closed form where with f ij = 1 D T ij = e ij, (1.29) 1 4e ij e ji e ji e ij = f ij 1 e ji 1 f ji [ D WCl W ] jj l=1 [WC i W ] ii 1, g ij = 1 D g ij, (1.3) g ji D l=1 Re [ WC l W ] ij [WC i W ] ii, (1.31) [A] ij denoting the (i, j)-the element of A. The key point is that the transformation T ij in (1.28) always decreases (1.25) unless g ij = g ji =. The iterations proceed by applying the procedure to all of the N y (N y 1)/2 pairs of rows until convergence is attained. Since the transformations are not orthogonal, the resulting matrix W is not orthogonal. Note that this procedure requires that all the target matrices in C be positive-definite. 1.8 APPLICATIONS In the following we briefly review possible application areas where the BSS may be beneficial Hearing aids For hearing aid users enhancement of the hearing and understanding the desired speech are essential. Amplification helps most hearing-impaired people to hear speech. 26

39 However, in a noisy place, hearing aids will amplify noise as well as the desired speech signal. Many schemes exist to suppress background noise and interfering sources, and enhance the desired speech improving the signal-to-noise-ratio [62]. As an example microphone array systems performing fixed or adaptive beamforming are still in use today [63], [3]. The drawback of beamforming is that it needs a priori information about the source positions and the microphone array geometry. In practice, however, such information is rarely available, so that BSS methods may be used instead [64], [65] Teleconferencing Audio and video conferencing, collectively termed teleconferencing systems are widely used to facilitate communication among several people located far away from one another. These systems are often used for meetings, during which numerous people using a single teleconferencing device are talking to each other in a room. In such situations the sound captured by multiple microphones of the teleconferencing device is a mixture of multiple reverberant speech signals, resulting in poor intelligibility for the remote listener. In such applications, BSS can be used to improve the sound quality [66]. In addition, this improvement would lead to better audio compression enhancing the efficiency of the transmission Speech recognition Speech recognition is one of the key technologies that will enable verbal communication between humans and computers. One of the shortcomings of the present speech recognition technology is when the speech is recorded at a distance from the speaker. In addition to this, other talkers and noise in the environment can corrupt the speech signal as it is recorded. BSS methods may be used in such scenarios as a preprocess- 27

40 ing stage to help improve the recognition rate. The application areas include voice controlled devices used in intelligent home and office environments, humanoid robots, automobiles, speaker identifiers, and speech-to-speech translation [67]. 1.9 SIMULATION SETUP AND PERFORMANCE METRICS In this section we summarize the types of the source signals and the mixing channels used in the simulations of this dissertation Source Signals In the simulations we use the following type of signals as S(n): iid sequence of random samples drawn from the Laplacian distribution. speech signals from the TIMIT database [68]. The speech signals are constructed from different utterances, half from male and half from female speakers, without intervening pauses. The utterances have been recorded in a quite environment with a close microphone, so that any reverberation is negligibly small. The speech signals are sampled at 16 khz Mixing System For the simulation tests the above sources are mixed using a set of different mixing situations, including instantaneous mixing matrices, convolutive mixing matrices. Under the linear instantaneous mixing model (1.5) H is chosen according to the following: N x N s matrix with elements drawn from zero mean unit variance Gaussian distribution. 28

41 N s N s orthogonal matrix. In particular, when we consider the TITO model, orthogonal H will be chosen as the Givens rotation matrix: H = where θ is the rotation parameter. cos θ sin θ sin θ cos θ, (1.32) Under the LTI model (1.3) the elements of H(n) are selected according to the following: synthetic mixing: (i) iid zero mean unit variance variables drawn from Gaussian distribution, (ii) minimum phase FIR channels, real mixing: measured room impulse responses from the R-HINT-E database provided in [69]. In the following we give two examples of linear, convolutive 2 2 mixing systems. The impulse responses of the synthetic mixing filters (ii) are plotted in Fig.1.3. Fig. 1.4 shows the magnitude response functions of the mixing channels in panels (a)-(d), and the condition number of H(f) is plotted as a function of frequency in panel (e), all in db scale. As the second example we used the premeasured room impulse responses obtained from the R-HINT-E database provided in [69]. These measurements were conducted in hearing aid design context at McMaster University. Details on the room configurations, measurement setup and technique are provided in [7]. Here, we briefly describe the measurement environment. The impulse responses were measured at microphones placed in the ears of a human head and torso model (KEMAR) from different locations in a reverberant classroom (reverberation time T 6 around 13 ms) with dimensions KEMAR was located in the center of the room 29

42 (a) H 11 (n) (b) H 12 (n) 1 1 Amplitude.5 Amplitude (c) H 21 (n) (d) H 22 (n) 1 1 Amplitude.5 Amplitude.5 4 sample, n 4 sample, n Figure 1.3: Impulse response functions of the 8-tap mixing channels. with a microphone in each ear 55 above the floor. A single loudspeaker was moved to 48 different locations around KEMAR with angles varying from to 36 clockwise direction in front of KEMAR. For each location, room impulse responses were measured and stored in a database called R-HINT-E. In the simulations involving 2 2 model, we chose the position of the first speaker at degree and the other one at 45 degree on a circle around the microphones. Both speakers are located at 6 high, and 6 away from the microphones (circle radius). The original sampling rate was 44.1 khz, however, to make consistent with the sampling rate of the source signals (speech), they are resampled to f s = 16 khz. Fig. 1.5 plots the four elements of H(n), each is 248 sample long. Fig. 1.6 shows the magnitude response functions of the mixing channels in panels (a)-(d) in db scale. Furthermore, the condition number 3

43 Mag. (db) Mag. (db) Mag. (db) (a) H 11 (f) (c) H 21 (f) (e) Condition number of H(f) Frequency (khz) Mag. (db) Mag. (db) Mag. (db) (b) H 12 (f) (d) H 22 (f) (f) Index of H(f) Frequency (khz) Figure 1.4: Magnitude response functions (a)-(d) of the channels associated with Fig 1.3, condition number of H(f) in (e), and performance index Index (H(f)) in (f). of H(f) is plotted as a function of frequency in db scale in panel (e). The condition number of H(f) takes values around 4 db f Mixture signals The mixture signals X(n) are generated according to (1.3) Noise The noise V(n) is generated as iid zero mean σ 2 variance variables drawn from Gaussian distribution. The variance σ 2 is chosen according to SNR level. 31

44 (a) H 11 (n) (b) H 12 (n) Amplitude Amplitude (c) H 21 (n) (d) H 22 (n) Amplitude Amplitude sample, n sample, n Figure 1.5: Impulse response functions of the premeasured mixing channels [69] Performance measures In BSS quality of separation is measured using various metrics, some of them being application specific. [71], [72] provide detailed discussions on the evaluation of BSS methods, the latter one on audio applications. One of the most widely used measure in instantaneous mixing case is the Index (G) that measures the cross-talk or interchannel interference [73]: Index (G) [ ] G ij max i j k G ik 1 + j [ i ] G ij max k G kj 1. (1.33) 32

45 Mag. (db) Mag. (db) Mag. (db) (a) H 11 (f) (c) H 21 (f) (e) Condition number of H(f) Frequency (khz) Mag. (db) Mag. (db) Mag. (db) (b) H 12 (f) (d) H 22 (f) (f) Index of H(f) Frequency (khz) Figure 1.6: Magnitude response functions (a)-(d) of the premeasured mixing channels associate with Fig.1.5. Condition number of H(f) (e), and performance index IndexH(f) (f). Index (G) with equality if and only if the perfect demixing (1.11) is achieved. For example, the following matrices G 1 = 1, G 2 = 4, 5 1 yield Index (G 1 ) = Index (G 2 ) =, while G 3 = 5.1,.15 G 4 = 1.5, yield Index (G 3 ) = 21.3 db and Index (G 4 ) = 18.2 db. In practice, an index value around 2 db indicates a successful performance. To illustrate 2 unsuccessful trials 33

46 consider the following matrices.16 G 5 = 2.47, 2.81 G 6 = 1.15, with Index (G 5 ) = 2.7 db and Index (G 6 ) =.8 db. Panels (f) of Figs. 1.4 and 1.6 show the index applied to frequency domain mixing matrices, Index (H(f)) as a function of frequency. They fluctuate around db, and any demixing matrix W(f) tries to pull Index (G(f)) below those values. The second measure to be used in this dissertation especially for the convolutive mixing scanearios is the signal to interference ratio (SIR). Suppose for a moment that the signal of interest is S i (n), and we are trying to estimate it at the jth output as Y j (n), then the SIR at the jth output is defined as SIR j = n [G ji(n) S i (n)] 2 i k n [G jk(n) S k (n)] 2. (1.34) Averaging (1.34) over all N y outputs, assuming that the desired signal is different at each output, we get an average output SIR. Similarly we can measure the input SIR by substituting H for G in (1.34). The ratio of the average output SIR to the average input SIR is usually called the average SIR improvement. Considering the two examples given in Sect , the average input SIR for the first system depicted in Fig. 1.3 is 4.2 db, while for the second system given in Fig. 1.5 is 3.4 db. 1.1 OUTLINE OF DISSERTATION The following is a detailed outline of the remaining chapters of this dissertation: Chapter 2 concentrates on the two-input two-output instantaneous mixing system which is the simplest case of the multi-input multi-output problem. Detailed studies of this subset of the general problem provide some insights to the BSS problem and 34

47 thus serve as a good starting point. The objective of this chapter is to demonstrate the gains in source separation that can be obtained by using fractional lower order moments. Given an explicitly defined probabilistic model (Laplacian distribution) for the sources we explore the use of fractional lower order moments as a criterion for blind source separation. This method starts with prewhitening of the mixture signals and relies on moments matching by means of an orthogonal transformation. A gradient based iterative search algorithm is used to solve the ICA problem. It is also shown that such criteria enjoy basic properties that avoid the existence of nonseparating solutions. Comparison of the proposed method with normalized kurtosis on both the synthetic data and speech data show that the separation performance is in favor of our approach over a wide range of block sizes. Some extensions to the general MIMO system and also moment orders are discussed. Chapter 3 discusses the adaptive ICA algorithms under nonstationary instantaneous mixing scenario. In environments where the rules of source combination change rapidly, adaptive or block adaptive methods must be deployed; and associated problems of convergence and permutation ambiguity solved. We propose using ICA on overlapping blocks (mini batches) and resolve the permutation ambiguity based on the principle of correlation continuity. We explore the effect of different initializations, block length, overlap percentage and sufficiency and utility of second order statistics to maintain continuity in the resolved signals. We demonstrate results using simulated test signals and real speech recordings. Chapter 4 examines the separation of convolutive mixtures of speech signals in the frequency domain. We investigate the sensitivity of the joint approximate diagonalization of a set of time-varying cross-spectral matrices. We study the effect of number of matrices in this set, and show that estimation of demixing system parameters is related to both several statistics of the perturbation term, occuring due to nonvan- 35

48 ishing cross-spectra, and uniqueness of the joint diagonalizer measured by modulus of uniqueness parameter. The second part discusses the cross-spectral matrix estimation in the orthogonal multitaper framework. Four different nonparametric cross-spectrum estimators that fall into this framework are compared via numerical simulations, where real speech signals, and both synthetic and real room impulse responses are used in a two-input, two-output scenario. Chapter 5 is the concluding chapter, suggestions on further development are included. 36

49 1.11 APPENDIX The (joint) characteristic function ϕ Y of a multivariate (joint) distribution f Y is defined as [74] ϕ Y (ν) = f Y (u) exp(jν T u)du, (1.35) where ν is a vector of deterministic variables ν 6 i. It is the inverse Fourier transform of the joint pdf, and under general conditions ϕ Y and f Y completely determine each other. Note that ϕ Y is real and even if and only if f Y is symmetrical around the origin. If the elements of Y(n) are statistically independent, we have ϕ Y (ν) = ϕ Y1 (ν 1 )ϕ Y2 (ν 2 ) ϕ YNy (ν Ny ), (1.36) where ϕ Yi is the characteristic function of the marginal pdf f Yi. Moreover, one can show that ϕ Y (ν) is continuous at ν =, and hence it can be expanded in Taylor series. Note that this is a polynomial expansion, where the exponent of each term is a positive integer. The coefficients in the expansion yields joint moments, and hence ϕ Y (ν) is also referred to as moment generating function. The logarithm of the characteristic function is called cumulant generating function ψ Y (ν) = log ϕ Y (ν) because the coefficients of its Taylor series expansion about ν = reveals cumulants. For example, let 1 i 1,, i k N y, then the kth order cumulant of Y(n) is defined as a k dimensional array with the (i 1,, i k )th element: κ Y (i 1,, i k ) = ( j) k k ψ Y (ν) ν i1 ν ik. (1.37) ν= If we let i 1 i 2 i 3 i 4 and assume that Y(n) 7 is zero-mean, then the fourth order cumulant, also called the fourth order cross-cumulant, may be expressed in terms of 6 If Y is a real-valued (complex-valued) random vector, then ν is also real-valued (complex-valued) with the same size. 7 to simplify the presentation, we assume that Y(n) is real valued. 37

50 its joint moments of orders up to four as: κ Y (i 1, i 2, i 3, i 4 ) =EY i1 (n)y i2 (n)y i3 (n)y i4 (n) EY i1 (n)y i2 (n)ey i3 (n)y i4 (n) EY i1 (n)y i3 (n)ey i2 (n)y i4 (n) EY i2 (n)y i3 (n)ey i1 (n)y i4 (n). (1.38) Letting i 1 = i 2 = i 3 = i 4 = i and assuming zero mean Y i (n), the fourth order cumulant is called the kurtosis of Y i (n): K 4 (Y i ) = κ Y (i, i, i, i) = EY 4 i (n) 3(EY 2 i (n)) 2. (1.39) If the components of Y(n) are statistically independent, then all the cross-cumulants of any order k vanish. Moreover, higher order k > 2 cumulants of a Gaussian random vector are zero. In general, cumulants may be interpreted as a set of descriptive constants of a pdf. The problem is that there are an infinite number of them. However, only finite number of them are typically used in a contrast function. The differential entropy of a random vector Y with joint pdf f Y is defined as follows [75] H(Y) = f Y (u) log f Y (u)du. (1.4) 38

51 Chapter 2 Blind Separation of Laplacian Sources based on Fractional Order Moments 2.1 ABSTRACT A new contrast function based on low fractional moments is proposed for blind source separation (BSS) of speech signals. Its study is motivated by a need to perform blind speech separation over short data frames. The new contrast function is more numerically stable; its estimates over short frames have better statistical properties than higher order measures such as kurtosis. The proposed contrast function is enabled by the Laplacian distribution of speech signals. Its theoretical and statistical properties are derived and tested using pseudo-random data as well as speech. Its performance is compared to that of kurtosis and it is shown that this contrast function consistently outperforms the normalized kurtosis over a wide range of frame lengths chosen between 5 and 5 ms. 2.2 INTRODUCTION Blind source separation (BSS) resolves mixtures into statistically independent signals by optimizing a contrast function such as kurtosis [76]. The extrema of the cost function correspond to an inversion of the mixing operation yielding the source 39

52 signals. Conditions for a successful numerical operation rely on the goodness of the estimate of the cost function. For finite data sets, large deviations from theoretical values create spurious peaks, and the demixing operation fails [77], [78]. The need to work with short frames is common in real-time speech separation applications, where delays beyond 1 ms. are not tolerable. The use of the fourth moment, for example, causes the kurtosis estimate to have a large variance and be highly susceptible to an occasional large valued sample. Insufficient data similarly corrupts gradient estimates in the adaptive case [28]. Use of lower order moments would reduce the estimation variance however third moments are zero for all symmetric distributions, and the variance is insufficient as a discriminator. The next logical choice is the use of fractional moments [36], [37]. It is rather fortunate that the Laplace distribution [79], whose fractional moments have many salient properties, is the widely accepted distribution of speech signals [8]. In this chapter we propose a novel contrast function defined in terms of fractional moments and show that it out performs normalized kurtosis in its discrimination of speech signals. The proposed fractional-moments contrast function is developed in Sect The theoretical search surface of the proposed contrast function are analyzed and compared to those of the kurtosis in Sect CONTRAST FUNCTION The absolute moments of the Laplace distribution are given by ( ) a σ ν a (X) = E X η a = Γ (a + 1), (2.1) 2 where η and σ are the location and standard deviation parameters, respectively. For the Gaussian distribution, the absolute moments satisfy ν a (X) = 1 ( ) ( ) a a + 1 2σ Γ π 2. (2.2) 4

53 Theorem The absolute fractional moments at a = 3/2 and a = 5/2 of all distributions, characterizing linear combinations of independent Laplace random variables satisfy ( ) ν 2/3 ( ) 3/2 ν 2/5 5/2, (2.3) Γ (1 + 3/2) Γ (1 + 5/2) with equality if and only if the distribution is Laplace. Proof. The equality of (2.3) for the Laplacian distribution is obtained by evaluating (2.1) at a = 3/2 and 5/2 solving for σ/ 2. For the Gaussian distribution, we evaluate (2.2) at a = 3/2 and divide by Γ (1 + 3/2) to get ( ) ν 2/3 3/2 = ( ) 2/3 Γ (5/4) 2σπ 1/3 =.748σ. (2.4) Γ (1 + 3/2) Γ (5/2) Analogous operation for a = 5/2 yields ( ) ν 2/5 5/2 = ( ) 2/5 Γ (7/4) 2σπ 1/5 =.6727σ, (2.5) Γ (1 + 5/2) Γ (7/2) and establishes the inequality. It has been shown [81] (and references therein) that the non-gaussianity of the sum of independent random variables (RV) is monotonically non-increasing. It follows, therefore, that the fractional absolute moments of the sum of independent Laplace RVs will also satisfy the inequality of (2.3). The above theorem suggests the proposed optimization statement to be used for speech separation: ( ) ν 2/3 ( ) 3/2 ν 2/5 5/2 J fm = (2.6) Γ (1 + 3/2) Γ (1 + 5/2) In the next section we analyze the search surface both in two and three dimensional demixing parameter spaces under the TITO model and compare it to that of the normalized kurtosis function. 41

54 2.4 THE SEARCH SURFACE Recall that the elements of S(n) = [S 1 (n), S 2 (n)] T are statistically independent, and identically Laplace distributed with f S, and that they are mixed by the orthogonal mixing matrix H, chosen as the Givens rotation matrix with the rotation parameter θ, given in (1.32), to produce the mixture signals X(n) = [X 1 (n), X 2 (n)] T by X(n) = HS(n). (2.7) Orthogonality of H preempts that the mixtures are decorrelated and allows us to focus on the merits of the proposed contrast to resolve the mixtures into independent components. Thus, we may set our aim to finding an orthogonal demixing matrix W = cos(α) sin(α), (2.8) sin(α) cos(α) with the rotation angle α [ π, π). Clearly the global matrix (1.7) is another orthogonal matrix with rotation parameter φ = α θ: G = cos(φ) sin(φ), (2.9) sin(φ) cos(φ) Since (2.9) establishes the link between Y i (n) and S i (n) via Y(n) = GS(n), it is fairly easy to determine the marginal pdf 1 f Y of the output signals Y i (n) as a parametric family of distributions with parameter φ, f Y (y) = 2 y cos(φ) e cos φ sin(φ) e 2 y sin φ 2 cos(2φ), φ Φ 1 (1/2 + y ) e 2 y, φ Φ 2 ( ) 1/ 2 e 2 y, φ Φ 3 (2.1) 1 Because of the symmetry f Yi = f Y, i = 1, 2. 42

55 where the regions are specified at integer multiples of π/2 in Φ 3, Φ 2 = ±π/4 ± π, and excluding them in Φ 1. The joint pdf f Y1 Y 2 of Y(n) can be determined using the convolution theorem in probability theory [82] as f Y1 Y 2 (y 1, y 2 ) = 1 ( 2 exp 2 sin (φ) y 1 + cos (φ) y 2 ) 2 cos (φ) y 1 + sin (φ) y 2. (2.11) Using (2.1) and (2.11), the mutual information I(Y 1 ; Y 2 ) between Y 1 (n) and Y 2 (n) (1.9), negentropy N(Y 1 ) (1.13), cumulants, moments of the output signals can be easily found. Fig. 2.1 plots I(Y 1 ; Y 2 ) in panel (a), and N(Y 1 ) in panel (b) as a function of the orthogonal global matrix parameter φ. Separation points are the ones where the mutual information becomes zero, and the negentropy takes its maximum value. Particularly, At φ = π: Y 1 (n) = S 1 (n), Y 2 (n) = S 2 (n). At φ = π/2: Y 1 (n) = S 2 (n), Y 2 (n) = S 1 (n). At φ = : Y 1 (n) = S 1 (n), Y 2 (n) = S 2 (n). At φ = π/2: Y 1 (n) = S 2 (n), Y 2 (n) = S 1 (n). Using (2.1), the normalized kurtosis defined by J kt = ν 4 ν (2.12) may be written as a function of the orthogonal global matrix parameter φ as ( ) 3 cos(6φ) + 7, φ Φ 8 cos(2φ) 1 J kt = 3/2, φ Φ 2 3, φ Φ 3, (2.13) Note that, since the Laplacian sources are super Gaussian, hence sign definite, we do not need to take the square or absolute value of (2.12) as a cost function to be 43

56 Amplitude.1.5 (a) Mutual information I(Y 1 ;Y 2 ) SEPARATION POINTS φ / π (b) Negentropy N(Y 1 )=N(Y 2 ).1 SEPARATION POINTS Amplitude φ / π Figure 2.1: Mutual information (a), and negentropy (b) as a function of the orthogonal global matrix parameter φ. maximized. It is well known that the maximization of (2.12) corresponds to the true demixing angles [76], [28]. To verify the same for the proposed fractional contrast function, we compute (2.6) for the Laplacian mixture distribution f Y (y) given by (2.1). Using ( ) cos φ a+2 sin φ a+2 a g (φ, a) =, (2.14) cos 2φ the result is given as J fm = (g (φ, 3/2) g (φ, 5/2)), φ Φ 1 ( (1 + 3/4) 2/3 (1 + 5/4) 2/5), φ Φ 2, φ Φ 3. (2.15) 44

57 3 (a) Kurtosis Amplitude Amplitude SEPARATION POINTS φ / π (b) Proposed cost for m 1 =1.5, m 2 = SEPARATION POINTS φ / π Figure 2.2: (a) Kurtosis, and (b) fractional order moments as a function of the orthogonal global matrix parameter φ. We can readily verify that J fm with equality only at the separation points, where Y i = ±S j, i, j = {1, 2} meaning that minimization of J fm will result in perfect separation. Note that, due to the symmetry of the Laplace pdf, ±S j have the same pdf and the sign ambiguity cannot be resolved. Fig. 2.2 illustrates the cost functions (2.13) in panel (a) and (2.15) in panel (b) as functions of the orthogonal global matrix parameter φ. It can be observed that the minima of J fm and the maxima of J kt occur at the correct demixing angles, corresponding to the separated signals ±S 1, S 2. We can also the plot the 3 D surfaces of the contrast functions in the space of W 11 and W 12. Fig. 2.3 and 2.4 show the surfaces of the kurtosis and the fractional order moment based contrasts, respectively. The separation points of this 2 2 problem are given by the maxima points of the kurtosis surface in Fig. 2.3 and the minima 45

$Figure 2.3: Kurtosis as the contrast function in the space of W 11 and W 12. points of the fractional contrast in Fig. 2.4.$ $11 + W 2 12 = 1 yields the plots shown in panels (a) and (b) of Fig. 2.2. The algorithmic properties of the fractional order moments based contrast are derived in the next section.$

58 Figure 2.3: Kurtosis as the contrast function in the space of W 11 and W 12. points of the fractional contrast in Fig Note that the projection of the surfaces onto the unit circle, that is where W W 2 12 = 1 yields the plots shown in panels (a) and (b) of Fig The algorithmic properties of the fractional order moments based contrast are derived in the next section. 2.5 OPTIMIZATION ON THE UNIT CIRCLE We can utilize one of the gradient based optimization techniques to find the minima of J fm. For its simplicity we prefer to use the steepest descent algorithm, with the following update α (i + 1) = α (i) λ ( δjfm δα ) α=α(i), (2.16) 46

Figure 2.4: Fractional order moments based cost function in the space of W 11 and W 12. where λ denotes the step size of the update, and i is the iteration index.

17) where δν a/2 δα = a [ ] 2 E Y i a/2 1 δ R, δα (2.18) δ Y i δα = sgn (Y i) (Y 1 cos (α) Y 2 sin (α)).

59 Figure 2.4: Fractional order moments based cost function in the space of W 11 and W 12. where λ denotes the step size of the update, and i is the iteration index. Using the chain rule, the gradient can be found as follows δj fm ν 1/3 3/2 δα = 2 3 Γ 2/3 (1 + 3/2) δν 3/2 ν 3/5 5/2 δα 2 5 Γ 2/5 (1 + 5/2) δν 5/2 δα, (2.17) where δν a/2 δα = a [ ] 2 E Y i a/2 1 δ R, δα (2.18) δ Y i δα = sgn (Y i) (Y 1 cos (α) Y 2 sin (α)). Note that the absolute moments ν a are presumed to possess continuous first order derivatives excluding the case Y i =. Since the cost function J fm which is shown in Fig. 2.5 as a function of the orthogonal demixing angle α is free of spurious minima, α will converge to one of the four separating angles, depending on the initialization. 47

Independent Component Analysis. Contents

Independent Component Analysis. Contents Contents Preface xvii 1 Introduction 1 1.1 Linear representation of multivariate data 1 1.1.1 The general statistical setting 1 1.1.2 Dimension reduction methods 2 1.1.3 Independence as a guiding principle