y 2 a 12 a 1n a 11 s 2 w 11 f n a 21 s n f n y 1 a n1 x n x 2 x 1

Size: px

Start display at page:

Download "y 2 a 12 a 1n a 11 s 2 w 11 f n a 21 s n f n y 1 a n1 x n x 2 x 1"

Harvey Boone
6 years ago
Views:

1 Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA Barak A. Pearlmutter Computer Science Dept, FEC 33 University of Ne Mexico Albuquerque, NM 873 bap@cs.unm.edu Lucas C. Parra Siemens Corporate Research 755 College Road East Princeton, NJ lucas@scr.siemens.com Abstract In the square linear blind source separation problem, one must nd a linear unmixing operator hich can detangle the result x i (t) of mixing n unknon independent sources s i (t) through an unknon n n mixing matrix A(t) of causal linear lters: x i = P a i s. We cast the problem as one of maximum likelihood density estimation, and in that frameork introduce an algorithm that searches for independent components using both temporal and spatial cues. We call the resulting algorithm \Contextual ICA," after the (Bell and Senoski 995) Infomax algorithm, hich e sho to be a special case of cica. Because cica can make use of the temporal structure of its input, it is able separate in a number of situations here standard methods cannot, including sources ith lo kurtosis, colored Gaussian sources, and sources hich have Gaussian histograms. The Blind Source Separation Problem Consider a set of n indepent sources s(t) ::: s n (t). We P are given n linearly distorted sensor reading hich combine these sources, x i = a is, here a i is a lter beteen source and sensor i, asshon in gure a. This can be expressed as X X X x i (t) = a i ()s (t ; ) = = a i s

2 f( s(t) s ( t ),... ) f 2 s s 2 a a 2 a 2 a n f( y (t) y ( t ),...; () ) f 2 y y n f n s n f n y n a n n x x 2 x n x x 2 x n Figure : The left diagram shos a generative model of data production for blind source separation problem. The cica algorithm ts the reparametrized generative model on the right to the data. Since (unless the mixing process is singular) both diagrams give linear maps beteen the sources and the sensors, they are mathematically equivalent. Hoever, (a) makes the transformation from s to x explicit, hile (b) makes the transformation from x to y, the estimated sources, explicit. P or, in matrix notation, x(t) = = A()s(t ; ) =A s: The square linear blind source separation problem is to recover s from x. There is an inherent ambiguity in this, for if e dene a ne set of sources s by s i = b i s i here b i () is some invertable lter, then the various s i are independent, P and constitute ust as good a solution to the problem as the true s i,sincex i = (a i b ; ) s : Similarly the sources could be arbitrarily permuted. Surprisingly, up to permutation of the sources and linear ltering of the individual sources, the problem is ell posed assuming that the sources s are not Gaussian. The reason for this is that only ith a correct separation are the recovered sources truly statistically independent, and this fact serves as a sucient constraint. Under the assumptions e have made, and further assuming that P the linear transformation A is invertible, e ill speak of recovering y i (t) = i x here these y i are a ltered and permuted version of the original unknon s i. For clarity of exposition, ill often refer to \the" solution and refer to the y i as \the" recovered sources, rather than refering to an point inthe manifold of solutions and a set of consistent recovered sources. 2 Maximum likelihood density estimation Folloing Pham, Garrat, and Jutten (992) and Belouchrani and Cardoso (995), e cast the BSS problem as one of maximum likelihood density estimation. In the MLE frameork, one begins ith a probabilistic model of the data production process. This probabilistic model is parametrized by a vector of modiable parameters, and it therefore assigns a -dependent probability density p(x x ::: ) toa each possible dataset x x :::. The task is then to nd a hich maximizes this probability. There are a number of approaches to performing this maximization. Here e apply Without these assumptions, for instance in the presence of noise, even a linear mixing process leads to an optimal unmixing process that is highly nonlinear.

3 the stochastic gradient method, in hich a single stochastic sample x is chosen from the dataset and ;dlogp(x P )=d is used as a stochastic estimate of the gradient of the negative likelihood t ;dlogp(x(t) )=d. 2. The likelihood of the data The model of data production e consider is shon in gure a. In that model, the sensor readings x are an explicit linear function of the underlying sources s. In this model of the data production, there are to stages. In the rst stage, the sources independently produce signals. These signals are time-dependent, and the probability density of source i producing value s (t) at time t is f (s (t)s (t ; ) s (t ; 2) :::). Although this source model could be of almost any dierentiable form, e used a generalized autoregressive model described in appendix A. For expository purposes, e can consider using a simple AR model, so e models (t) = b ()s (t ; ) + b (2)s (t ; 2) + + b (T )s (t ; T )+r, here r is an iid random variable, perhaps ith a complicated density. It is important to distinguish to dierent, although related, linear lters. When the source models are simple AR models, there are to types of linear convolutions being performed. The rst is in the ay each source produces its signal: as a linear function of its recent history plus a hite driving term, hich could be expressed as a moving average model, a convolution ith a hite driving term, s = b r. The second is in the ay the sources are mixed: linear functions of the output of each source are added, x i = P a i s = P (a i b ) r. Thus, ith AR sources, the source convolution could be folded into the convolutions of the linear mixing process. If e ere to estimate values for the free parameters of this model, i.e. to estimate the lters, then the task of recovering the estimated sources from the sensor output ould require inverting the linear A =(a i ), as ell as some technique to guarantee its non-singularity. Such a model is shon in gure a. Instead, e parameterize the model by W = A ;, an estimated unmixing matrix, as shon in gure b. In this indirect representation, s is an explicit linear function of x, and therefore x is only an implicit linear function of s. This parameterization of the model is equally convenient for assigning probabilities to samples x, and is therefore suitable for MLE. Its advantage is that because the transformation from sensors to sources is estimated explicitly, the sources can be recovered directly from the data and the estimated model, ithout invertion. Note that in this inverse parameterization, the estimated mixture process is stored in inverse form. The source-specic models f i are kept in forard form. Each source-specic model i has a vector of parameters, hich e denote (i). We are no in a position to calculate the likelihood of the data. For simplicityeuse a matrix W of real numbers rather than FIR lters. Generalizing this derivation to a matrix of lters is straightforard, folloing the same techniques used bylambert (996), Torkkola (996), A. Bell (997), but space precludes a derivation here. The individual generative source models give p(y(t)y(t ; ) y(t ; 2) :::)= Y i f i (y i (t)y i (t ; ) y i (t ; 2) :::) ()

4 here the probability densities f i are each parameterized by vectors (i). Using these equations, e ould like to express the likelihood of x(t) in closed form, given the history x(t ; ) x(t ; 2) :::. Since the history is knon, e therefore also kno the history of the recovered sources, y(t ; ) y(t ; 2) :::. This means that e can calculate the density p(y(t)y(t ; ) :::). Using this, e can express the density ofx(t) and expand G b =log^p(x ) =logw + P log f (y (t)y (t ; ) y (t ; 2) ::: () ) There are to sorts of parameters hich emust take the derivative ith respect to: the matrix W and the source parameters (). The source parameters do not inuence our recovered sources, and therefore have a simple form d G b = ; df (y )=d d f (y ) Hoever, a change to the matrix W changes y, hichintroduces a fe extra terms. Note that d log W =dw = W ;T, the transpose inverse. Next, since y = Wx, e see that dy =dw = (x) T, a matrix of zeros except for the vector x in ro. No e note that df ()=dw term has to logical components: the rst from the eectofchanging W upon y (t), and the second from the eect of changing W upon y (t ; ) y (t ; 2) :::. (This second is called the \recurrent term", and such terms are frequently dropped for convenience. As shon in gure 3, dropping this term here is not a reasonable approximation.) df (y (t)y (t ; ) ::: ) dw dy (t) (t) dw dy (t ; (t ; ) dw Note that the expression dy(t;) is the only matrix, and it is zero except for the dw th ro, hich isx(t ; ). The =@y (t) e shall denote f (), and (t ; ) e shall denote f () (). We thenhave d b G dw = ;W;T ; f () f () X x(t) T ; = f () () f ()! x(t ; ) T (2) here (expr()) denotes the column vector hose elements are expr() ::: expr(n). 2.2 The natural gradient Folloing Amari, Cichocki, and Yang (996), e follo a pseudogradient. Instead of using equation 2, e post-multiply this quantity by W T W. Since this is a positivedenite matrix, it does not aect the stochastic gradient convergence criteria, and the resulting quantity simplies in a fashion that neatly eliminates the costly matrix inversion otherise required. Convergence is also accelerated. 3 Experiments We conducted a number of experiments to test the ecacy of the cica algorithm. The rst, shon in gure 2, as a toy problem involving a set of processed deliberately constructed to be dicult for conventional source separation algorithms. In the second experiment, shon in gure 3, ten real sources ere digitally mixed ith an instantaneous matrix and separation performance as measured as a funciton of varying model complexity parameters. These sources have are available for benchmarking purposes in

5 5 Filtered and Mixed Uniform Distributions Recovered Source Values Estimated Density in Unximed Source Space estimated density source source Figure 2: cica using a history of one time step and a mixture of ve logistic densities for each source as applied to 5, samples of a mixture of to one-dimensional uniform distributions each ltered by convolution ith a decaying exponential of time constant of99.5. Shon is a scatterplot of the data input to the algorithm, along ith the true source axes (left), the estimated residual probability density (center), and a scatterplot of the residuals of the data transformed into the estimated source space coordinates (right). The product of the true mixing matrix and the estimated unmixing matrix deviates from a scaling and permutation matrix by about 3%. Truncated Gradient Full Gradient Noise Model S/N Ratio S/N Ratio S/N Ratio number of AR filter taps number of AR filter taps 2 number of logistics Figure 3: The performance of cica as a function of model complexity and gradient accuracy. In all simulations, ten ve-second clips taken digitally from ten audio CD ere digitally mixed through a random ten-by-ten instantanious mixing matrix. The signal to noise ratio of each original source as expressed in the recovered sources is plotted. In (a) and (b), AR source models ith a logistic noise term ere used, and the number of taps of the AR model as varied. (This reduces to Bell-Senoski infomax hen the number of taps is zero.) Is (a), the recurrent term of the gradient as left out, hile in (b) the recurrent term as included. Clearly the recurrent term is important. In (c), a degenerate AR model ith zero taps as used, but the noise term as a mixture of logistics, and the number of logistics as varied. 4 Discussion The Infomax algorithm (Baram and Roth 994) used for source separation (Bell and Senoski 995) is a special case of the above algorithm in hich (a) the mixing is not convolutional, so W () = W (2) = ::: =, and (b) the sources are assumed to be iid, and therefore the distributions f i (y(t)) are not history sensitive. Further, the form of the f i is restricted to avery special distribution: the logistic density,

6 the derivative of the sigmoidal function =( + exp ;). Although ICA has enoyed avariety of applications (Makeig et al. 996 Bell and Senoski 996b Baram and Roth 995 Bell and Senoski 996a), there are a number of sources hich it cannot separate. These include all sources ith Gaussian histograms (e.g. colored gaussian sources, or even speech to run through the right sort of slight nonlinearity), and sources ith lo kurtosis. As shon in the experiments above, these are of more than theoretical interest. If e simplify our model to use ordinary AR models for the sources, ith gaussian noise terms of xed variance, it is possible to derive a closed-form expression for W (Hagai Attias, personal communication). It may be that for many sources of practical interest, trading aay this model accuracy for speed ill be fruitful. 4. Weakened assumptions It seems clear that, in general, separating hen there are feer microphones than sources requires a strong bayesian prior, and even given perfect knoledge of the mixture process and perfect source models, inverting the mixing process ill be computationally burdensome. Hoever, hen there are more microphones than sources, there is an opportunity to improve the performance of the system in the presence of noise. This seems straightforard to integrate into our frameork. Similarly, fast-timescale microphone nonlinearities are easily incorporated into this maximum likelihood approach. The structure of this problem ould seem to lend itself to EM. Certainly the individual source models can be easily optimized using EM, assuming that they themselves are of suitable form. References A. Bell, T.-W. L. (997). Blind separation of delayed and convolved sources. In Advances in Neural Information Processing Systems 9. MIT Press. In this volume. Amari, S., Cichocki, A., and Yang, H. H. (996). A ne learning algorithm for blind signal separation. In Advances in Neural Information Processing Systems 8. MIT Press. Baram, Y. and Roth, Z. (994). Density Shaping by Neural Netorks ith Application to Classication, Estimation and Forecasting. Tech. rep. CIS-94-2, Center for Intelligent Systems, Technion, Israel Institute for Technology, Haifa. Baram, Y. and Roth, Z. (995). Forecasting by Density Shaping Using Neural Netorks. In Computational Intelligence for Financial Engineering Ne York City. IEEE Press. Bell, A. J. and Senoski, T. J. (995). An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7 (6), 29{59. Bell, A. J. and Senoski, T. J. (996a). The Independent Components of Natural Scenes. Vision Research. Submitted.

7 Bell, A. J. and Senoski, T. J. (996b). Learning the higher-order structure of a natural sound. Netork: Computation in Neural Systems. In press. Belouchrani, A. and Cardoso, J.-F. (995). Maximum likelihood source separation by the expectation-maximization technique: Deterministic and stochastic implementation. In Proceedings of 995 International Symposium on Non-Linear Theory and Applications, pp. 49{53 Las Vegas, NV. In press. Lambert, R. H. (996). Multichannel Blind Deconvolution: FIR Matrix Algebra and Separation of Multipath Mixtures. Ph.D. thesis, USC. Makeig, S., Anllo-Vento, L., Jung, T.-P., Bell, A. J., Senoski, T. J., and Hillyard, S. A. (996). Independent component analysis of event-related potentials during selective attention. Society for Neuroscience Abstracts, 22. Pearlmutter, B. A. and Parra, L. C. (996). A Context-Sensitive Generalization of ICA. In International Conference on Neural Information Processing Hong Kong. Springer-Verlag. Url ftp://ftp.cnl.salk.edu/pub/bap/iconip-96- cica.ps.gz. Pham, D., Garrat, P., and Jutten, C. (992). Separation of a mixture of independent sources through a maximum likelihood approach. In European Signal Processing Conference, pp. 77{774. Torkkola, K. (996). Blind separation of convolved sources based on information maximization. In Neural Netorks for Signal Processing VI Kyoto, Japan. IEEE Press. In press. A Fixed mixture AR models The f (u )e used ere a mixture AR processes driven by logistic noise terms, as in Pearlmutter and Parra (996). Each source model as f (u (t)u (t ; ) u (t ; 2) ::: )= X k m k h((u (t) ; u k )= k )= k (3) here k is a scale parameter for logistic density k of source andisanelement Pof, and the mixing coecients m k are elements of and are constrained by m k k =. The component means u k are taken to be linear functions of the recent values of that source, u k = X = a k () u (t ; )+b k (4) here the linear prediction coecients a k () and bias b k are elements of. The derivatives of these are straightforard see Pearlmutter and Parra (996) for details. One complication is to note that, after P each eight update, the mixing coecients must be normalized, m k m k = m k k :

Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA

Maximum Likelihood Blind Source Separation: A Context-Sensitive Generalization of ICA Barak A. Pearlmutter Computer Science Dept, FEC 313 University of New Mexico Albuquerque, NM 87131 bap@cs.unm.edu Lucas