Maximal Causes for Non-linear Component Extraction

Size: px

Start display at page:

Download "Maximal Causes for Non-linear Component Extraction"

Tyrone Jennings
5 years ago
Views:

1 Journal of Machine Learning Research 9 (2008) Submitte 5/07; Revise 11/07; Publishe 6/08 Maximal Causes for Non-linear Component Extraction Jörg Lücke Maneesh Sahani Gatsby Computational Neuroscience Unit University College Lonon 17 Queen Square Lonon WC1N 3AR, UK LUCKE@GATSBY.UCL.AC.UK MANEESH@GATSBY.UCL.AC.UK Eitor: Yoshua Bengio Abstract We stuy a generative moel in which hien causes combine competitively to prouce observations. Multiple active causes combine to etermine the value of an observe variable through a max function, in the place where algorithms such as sparse coing, inepenent component analysis, or non-negative matrix factorization woul use a sum. This max rule can represent a more realistic moel of non-linear interaction between basic components in many settings, incluing acoustic an image ata. While exact maximum-likelihoo learning of the parameters of this moel proves to be intractable, we show that efficient approximations to expectation-maximization (EM) can be foun in the case of sparsely active hien causes. One of these approximations can be formulate as a neural network moel with a generalize softmax activation function an Hebbian learning. Thus, we show that learning in recent softmax-like neural networks may be interprete as approximate maximization of a ata likelihoo. We use the bars benchmark test to numerically verify our analytical results an to emonstrate the competitiveness of the resulting algorithms. Finally, we show results of learning moel parameters to fit acoustic an visual ata sets in which max-like component combinations arise naturally. Keywors: component extraction, maximum likelihoo, approximate EM, competitive learning, neural networks 1. Introuction In recent years, algorithms such as inepenent components analysis (ICA; Comon, 1994; Bell an Sejnowski, 1997), sparse coing (SC; Olshausen an Fiel, 1996), an non-negative matrix factorization (NMF; Lee an Seung, 1999) have been use to escribe the statistics of the natural environment, an the components extracte by these methos have been linke to sensory neuronal response properties. State in the language of probabilistic generative moels (see, e.g., Dayan an Abbott, 2001; Rao et al., 2002) these systems escribe sensory ata as a linear superposition of learne components. For many types of ata, incluing images, this assume linear cooperation between generative causes is unrealistic. Alternative, more competitive generative moels have also been propose: for instance, Saun (1995) suggests a moel in which hien causes are combine by a noisy-or rule, while Dayan an Zemel (1995) suggest a yet more competitive scheme. Here, we formulate an extreme case of competition, in which the strongest generative influence on an observe variable (e.g., an image pixel) alone etermines its value. Such a rule has the property of selecting, for each observe variable, a single generative cause to etermine that variable s value. c 2008 Jörg Lücke an Maneesh Sahani.

2 LÜCKE AND SAHANI This form of combination emerges naturally in the context of spectrotemporal masking in mixe auio signals. For image ata, occlusion leas to a ifferent combination rule, but one that shares the selection property in that, uner constant lighting conitions, the appearance of each observe pixel is etermine by a single object. In parallel to this evelopment of generative approaches, a number of artificial neural network architectures have been esigne to tackle the problem of non-linear component extraction, mostly in artificial ata (e.g., Spratling an Johnson, 2002; Lücke an von er Malsburg, 2004; Lücke an Bouecke, 2005; Spratling, 2006), although sometimes in natural images (e.g., Harpur an Prager, 1996; Charles et al., 2002; Lücke, 2007). These moels often perform quite well with respect to various benchmark tests. However, the relationship between them an the ensity moels that are implicit or explicit in the generative approach has not, thus far, been mae clear. We show here that inference an learning in a restricte form of our novel generative moel correspon closely in form to the processing an plasticity rules use in such neural network approaches, thus bringing together these two isparate threas of investigation. The organization of the remainer of this article is as follows. In Section 2 we efine the novel generative moel an then procee to obtain the associate parameter upate rules in Section 3. In Section 4 we erive computationally efficient approximations to these upate rules, in the context of sparsely active hien causes that is, when a small number of hien causes generally suffices to explain the ata. In Section 5 we relate a restricte form of the generative moel to neural network learning rules with Hebbian plasticity an ivisive normalization. Results of numerical experiments in Section 6 show the component extraction performance of the generative schemes as well as a comparison to other algorithms. Finally, in Section 7, we iscuss our analytical an numerical results. 2. A Generative Moel with Maximum Non-linearity We consier a generative moel for D observe variables y, ( = 1,...,D), in which H hien binary causes s h, (h = 1,...,H), each taking the value 0 or 1, compete to etermine the value of each observation (see Figure 1). Associate with each pair (s h,y ), is a weight W h. Given a set of active causes (i.e., those taking the value 1), the istribution of y is etermine by the largest of the weights associate with the active causes an y. Much of our iscussion will apply generally to all moels of this causal structure, irrespective of the etails of the istributions involve. For concreteness, however, we focus on a particular choice, in which the hien variables are rawn from a multivariate Bernoulli istribution; an the observe variables are non-negative, integer-value an, given the causes, conitionally inepenent an Poisson-istribute. Thus, collecting all the causes into a single binary vector s {0,1} H, an all the observe variables into an integer vector y Z D + we have: p( s π) = p( y s,w) = H p(s h π h ), h=1 D p(y W ( s,w)), =1 p(s h π h ) = π s h h (1 π h) 1 s h, (1) p(y w) = wy y! e w. (2) Here, π [0,1] H parameterizes the prior istribution on s, while the weight matrix W R H D parameterizes the influence of the hien causes on the istribution of y. It will be convenient to 1228

3 MAXIMAL CAUSES s 1 s 2 s 3 W h y 1 W ( s,w) y D Figure 1: A generative moel with H = 3 hien variables an D = 5 observe variables. The values y of the observe variables are conitionally inepenent given the values s of the hien variables. The value y is rawn from a istribution which is etermine by the parameters W 1, W 2, an W 3. For a given binary vector s these parameters combine competitively accoring to the function W ( s,w) = max h {s h W h }. group these parameters together into Θ = ( π,w). The function W ( s,w) in (2) gives the effective weight on y, resulting from a particular pattern of causes s. Thus, in the moel consiere here, W ( s,w) = max h {s hw h }. (3) It is useful to place the moel (1) (3) in context. Moels of this general type, in which the observations are conitionally inepenent of one another given a set of hien causes, are wiesprea. They unerlie algorithms such as ICA, SC, principal components analysis (PCA), factor analysis (see, e.g., Everitt, 1984), an NMF. In these five cases, an inee in the majority of such moels stuie, the effective weights W ( s,w) are forme by a linear combination of all the weights that link hien variables to the observation; that is, W ( s,w) = h s h W h. Some other moels, notably those of Saun (1995) an Dayan an Zemel (1995), have implemente more competitive combination rules, where larger iniviual weights ominate the effective combination. The present moel takes this competition to an extreme, so that only the single largest weight (amongst those associate with active hien variables) etermines the output istribution. Thus, where ICA, PCA, SC, or NMF use a sum, we use a max. We refer to this new generative moel as the Maximal Causes Analysis (MCA) moel. Figure 2 illustrates the ifference between linear superposition an competitive combination using (3). Let us suppose that noise-free observations are generate by causes in the form of horizontal an vertical objects with the same gray-value, on a ark (black) backgroun (see Figure 2). If these objects occlue one-another, they may generate an observe image such as that illustrate in Figure 2B. However, if we were to use the actual causes an weights in Figure 2A, but instea combine them linearly, we woul obtain the (ifferent) input pattern of Figure 2C. In this case, competitive combination using the max-rule of Equation (3) woul result in the correct pattern. This is not, of course, generally true, but for monochrome objects with small variations in their gray-values it 1229

LÜCKE AND SAHANI A B C cause 1 non-linear linear { { cause 2 { { cause 3 cause 4 max {s hw h } s h W h h h Figure 2: An illustration of non-linear versus linear combination of hien causes.

4 LÜCKE AND SAHANI A B C cause 1 non-linear linear { { cause 2 { { cause 3 cause 4 max {s hw h } s h W h h h Figure 2: An illustration of non-linear versus linear combination of hien causes. A Four examples of hien causes with gray-value 200. B The input image that may result if sources occlue one another. In this case, the correct function W ( s,w) (see Figure 1) to combine the hien causes is the max-operation. C The input image that results if the four causes combine linearly (gray-values are scale to fill the interval [0,255]). For C, the correct function W ( s,w) is linear super-position. hols approximately. More generally, the maximum combination rule is always closer to the result of occlusion than is the simple sum implie by moels such as ICA. As state above, although in this paper we focus on the specific istributions given in (1) an (2), much of the analytical treatment is inepenent of these specific choices. Thus, upate rules for learning the weights W from ata will be erive in a general form, that can accommoate alternative, non-factore istributions for the binary hien variables. This general form is also preserve if the Poisson istribution is replace, for example, by a Gaussian. Poisson variability represents a reasonable choice for the non-negative ata consiere in this paper, an resembles the cost function introuce by Lee an Seung (1999) for NMF. 3. Maximum Likelihoo Given a set of observe ata vectors Y={ y (n) } n=1,...,n, taken to be generate inepenently from a stationary process, we seek parameter values Θ = ( π,w ) that maximize the likelihoo of the ata uner the generative moel of Equations (1) to (3): ( ) Θ = argmax Θ {L(Θ)} with L(Θ) = log p( y (1),..., y (N) Θ). We use Expectation-Maximization (EM; Dempster et al. 1977; see also Neal an Hinton 1998, for the formulation that appears here) to maximize the likelihoo in this latent variable moel. To o so, we introuce the free-energy F (Θ, q) a ata-epenent function of the parameters Θ an an unknown istribution q( s (1),..., s (N) ) over the hien ata or variables that is always equal to or less than the likelihoo evaluate at the same parameter values. For inepenently generate ata vectors y (n), the istribution q may be taken (without loss of generality) to factor over the hien 1230

5 MAXIMAL CAUSES vectors q( s (1),..., s (N) ) = n q n ( s (n) ). Then the free-energy is efine as: F (Θ,q) = N n=1 [ [ q n ( s) log ( p( y (n) s,θ) ) + log ( p( s Θ) )]] + H(q) L(Θ), (4) s where H(q) = n H(q n ( s)) = n s q n ( s)log(q n ( s)) is the Shannon entropy of q. The iterations of EM alternately increasef with respect to the istributions q n while holing Θ fixe (the E-step), an with respect to Θ while holing the q n fixe (the M-step). Thus, if we consier a pair of steps beginning from parameters Θ, the E-step first fins new istributions q n that epen on Θ an the observations y (n), which we write as q n ( s;θ ). Ieally, these istributions maximize F for fixe Θ, in which case it can be shown that q n ( s;θ ) = p( s y (n),θ ) anf(θ,q n ( s;θ )) =L(Θ ) (Neal an Hinton, 1998). In practice, computation of this exact posterior may be intractable, an it is often replace by an approximation. After choosing the q n s in the E-step, we maximizef with respect to Θ in the M-step while holing the q n istributions fixe. Thus the free-energy can be re-written in terms of Θ an Θ : F (Θ,Θ ) = N [ n=1 s q n ( s;θ ) [ log ( p( y (n) s,θ) ) + log ( p( s Θ) )]] + H(Θ ). (5) where H(Θ ) = n H(q n ( s;θ )). A necessary conition to achieve this maximum with respect to W i Θ, is that (see Appenix A for etails): ( ) (n) y F (Θ,Θ ) = W i q n ( s;θ W ( s,w) ) W ( s,w) n W s i W ( s,w)! = 0. (6) Unfortunately, uner the max-combination rule of Equation (3), W is not ifferentiable. Instea, we efine a smooth function W ρ that converges to W as ρ approaches infinity: ( H ) 1 W ρ ρ ( s,w) := (s h W h ) ρ h=1 lim ρ W ρ ( s,w) = W ( s,w), (7) an replace the erivative of W by the limiting value of the erivative of W ρ, which we write as A i (see Appenix A for etails): A i ( s,w) := lim ρ ( ) W ρ W ( s,w) = lim i ρ s i (W i ) ρ h s h (W h ) ρ. (8) Arme with this efinition, a rearrangement of the terms in (6) yiels (see Appenix A): W i = A i ( s,w) qn y (n) n, (9) A i ( s,w) qn n where A i ( s,w) qn is the expectation ofa i ( s,w) uner the istribution q n ( s;θ ): A i ( s,w) qn = q n ( s;θ )A i ( s,w). (10) s 1231

6 LÜCKE AND SAHANI Equation (9) represents a set of non-linear equations (one for each W i ) that efines the necessary conitions for an optimum off with respect to W. The equations o not represent straightforwar upate rules for W i because the right-han-sie oes not epen only on the ol values W Θ. They can, however, be use as fixe-point iteration equations, by simply evaluating the erivatives A i at W instea of W. Although there is no guarantee that these iterations converge, if they o converge the corresponing parameters must lie at a stationary point of the free-energy. Numerical experiments escribe later confirm that this fixe-point approach is, in fact, robust an convergent. Note that the enominator in (9) vanishes only if q n ( s;θ )A i ( s,w) = 0 for all s an n (assuming positive weights), in which case (6) is alreay satisfie, an no upate of W is require. Thus far, we have not mae explicit reference to the form of prior source istribution, an so the result of Equation (9) is inepenent of this choice. For our chosen Bernoulli istribution (1), the M-step is obtaine by setting the erivative of F with respect to π i to zero, giving (after rearrangement): π i = 1 N n s i qn, with s i qn = q n ( s;θ )s i. (11) s Parameter values that satisfy Equations (9) an (11), maximize the free-energy given the istributions q n = q n ( s;θ ). As state before, the optimum with respect to q (an therefore, exact optimization of the likelihoo, since the optimal setting of q forces the free-energy boun to be tight) is obtaine by setting the q n to the posterior istributions: q n ( s;θ ) = p( s y (n),θ ) = p( s, y(n) Θ ), (12) p( s, y (n) Θ ) where p( s, y (n) Θ ) = p( s π ) p( y (n) s,w ), an with the latter istributions given by (1) an (2), respectively. Equations (9) to (12) thus represent a complete set of upate rules for maximizing the ata likelihoo uner the generative moel. The only approximation mae to this point is to use the ol values W on the right-han-sie of the M-step equation in (9). We therefore refer to this set of upates as a pseuo-exact learning rule an call the algorithm they efine MCA ex, with the subscript for exact. We will see in numerical experiments that MCA ex oes inee maximize the likelihoo. 4. E-Step Approximations The computational cost of fining the exact sufficient statistics A i ( s,w) qn, with q n equal to the posterior probability (12), is intractable in general. It grows exponentially in the smaller of the number of hien causes H, an the number of observe variables D (see Appenix B for etails). A practical learning algorithm, then, must epen on fining a computationally tractable approximation to the true expectation. One approach, a form of variational metho (Joran et al., 1999), woul be to optimize the q n within a constraine class of istributions; for example, istributions that factor over the sources s h. Unfortunately, this conventional factoring approach provies limite benefit here, as the form of A i ( s,w) resists straightforwar evaluation of the expecte value with respect to the iniviual sources. Instea, we base our approximations on an assumption of sparsity that only a small number of active hien sources is neee to explain any one observe s 1232

7 MAXIMAL CAUSES ata vector (note that sparsity here refers to the number of active hien sources, rather than to their proportion). The resulting expressions relate to those that woul be foun by a variational optimization constraine to istributions that are sparse in the sense above, but are not ientical. The relationship will be explore further in the Discussion. To evelop the sparse approximations, consier grouping the terms in the expecte value of Equation (10) accoring to the number of active sources in the vector s: A i ( s,w) qn = p( s y (n),θ )A i ( s,w) (13) s = a p( s a y (n),θ )A i ( s a,w)+ a,b a < b where s a := (0,...,0,1,0,...,0) with only s a = 1 p( s ab y (n),θ )A i ( s ab,w)+..., a,b,c a < b < c s ab := (0,...,0,1,0,...,0,1,0,...,0) with only s a = 1, s b = 1, a b, an s abc etc. are efine analogously. Note that A i ( 0,W) = 0 because of (7) an (8). Now, each of the conitional probabilities p( s y (n),θ ) implicitly contains a similar sum over s for normalization: p( s y (n),θ ) = 1 Z p( s, y(n) Θ ), Z := p( s, y (n) Θ ), (14) s an the terms of this sum may be groupe in the same way Z := p( 0, y (n) Θ )+ p( s a, y (n) Θ )+ a a,b a < b Combining (13) an (14) yiels: p( s ab, y (n) Θ )+ p( s abc, y (n) Θ )+.... a,b,c a < b < c A i ( s,w) qn = a p( s a, y (n) Θ )A i ( s a,w)+ a,b p( s ab, y (n) Θ )A i ( s ab,w)+... a<b p( 0, y (n) Θ )+ a p( s a, y (n) Θ )+ a,b p( s ab, y (n) Θ )+... a<b. (15) A similar grouping of terms is possible for the expectation s h qn. If we now assume that the significant posterior probability mass will concentrate on vectors s with only a limite number of non-zero entries, the expane sums in both numerator an enominator of (15) may be truncate without significant loss. The accuracy of the approximation epens both on the sparsity of the true generative process, an on the istance of the current moel parameters (in the current EM iteration) from the true ones. In general, provie that the true process is inee sparse, a truncate approximation will become more accurate as the estimate parameters approach their maximum likelihoo values. The convergence properties an accuracy of algorithms base on this form of approximation will be teste numerically in Section 6. Different choices of the truncation yiel approximate algorithms with ifferent properties. Two of these will be consiere here. 1233

8 LÜCKE AND SAHANI 4.1 MCA 3 In the first approximation, we truncate all but one of the sums that appear in the expansions of A i ( s,w) qn an s i qn after the terms that inclue three active sources, while truncating the numerator of A i ( s,w) qn after the two-source terms (see Appenix C for etails): A i ( s,w) qn an s i qn 1+ h π i exp(i (n) i ) + π i π c exp(i (n) ic )H (W i W c ) c(c i) π h exp(i (n) h ) a,b a b π a π b exp(i (n) ab ) π a π b π c exp(i (n) a,b,c a b c π i exp(i (n) i ) + π i π c exp(i (n) ic ) + α 2 π i π b π c exp(i (n) ibc ) c(c i) b,c(b c i) 1+ h π h exp(i (n) h ) a,b a b π a π b exp(i (n) ab ) π a π b π c exp(i (n) a,b,c a b c abc ) (16) abc ), (17) where π i = π i, I (n) i 1 π i W ab = max(w a,w b ), I (n) W abc = max(w a,w b,w c ), I (n) ( = ( log(w i )y (n) W i ), ) ab = log( W ab )y(n) W ab, ( ) abc = log( W abc )y (n) W abc, (18) 1 an where H (x) = 1 for x > 0; 2 for x = 0; 0 for x < 0 is the Heavisie function. The above equations have been simplifie by iviing both numerator an enominator by terms that o not epen on s, for example, by H i=1 (1 π i) (see Appenix C). Approximations (16) an (17) are use in the fixe-point upates of Equations (9) an (11), where the parameters that appear on the righthan-sie are hel at their current values. Thus all parameters that appear on the right-han-sie of the approximations take values in Θ = ( π,w ). The early truncation of the numerator in (16) improves performance in experiments, partly by increasing competition between causes further, an partly by reucing the contribution of more complex ata patterns that are better fit, given the current parameter settings, by three active sources than by two. By contrast, the three-source terms are kept in the numerator of (17). In this case, neglecting complex input patterns as in (16) woul lea to greater errors in the estimate source activation probabilities π i. Inee, even while keeping these terms, π i ten to be unerestimate if the input ata inclue many patterns with more than three active sources. To compensate, we introuce a factor of α > 1 multiplying the three-source term in (17) (so that α = 1 correspons to the actual truncate sum), which is upate as escribe in Appenix C. This scheme yiels goo estimates of π i, even if more than three sources are often active in the input ata. The M-step Equations (9) an (11) together with E-step approximations (16) an (17) represent a complete set of upate equations for the MCA generative moel. The computational cost of one parameter upate grows polynomially in the total number of causes, with orer H 3. The algorithm that is efine by these upates will therefore be referre to as MCA

associate with each source variable i are constraine to sum to a constant C: i {1,.

This thir restriction means that each non-zero generating weight W gen i associate with cause i can be covere by the same number of W gen c W gen i : W gen i > 0 H (W gen gen c Wi ) b i, (20) c i

9 MAXIMAL CAUSES 4.2 R-MCA 2 In the secon place, we consier a restriction of the generative moel in which (i) all s h are istribute accoring to the same prior istribution with fixe parameter π; (ii) the weights W i associate with each source variable i are constraine to sum to a constant C: i {1,...,H} : π i = π an W i = C ; (19) an (iii) on average, the influence of each hien source is homogeneously covere by the other sources. This thir restriction means that each non-zero generating weight W gen i associate with cause i can be covere by the same number of W gen c W gen i : W gen i > 0 H (W gen gen c Wi ) b i, (20) c i whereh is the Heavisie function an b i is the number of causes that can cover cause i. Figure 3 il- A B C Figure 3: A an B show patterns of weights that satisfy the uniformity conition (20) whereas weights in C violate it. Each hien cause is symbolize by an ellipse, with the graylevel of the ellipse representing the value W i of each weight within the ellipse. Weights outsie the ellipse for each cause are zero (black). The black squares inicate the 4-by-4 gri of observe pixels. lustrates this conition. Figure 3A,B show weight patterns associate with hien causes for which the conition is fulfille; for instance in Figure 3B b i = 0 for all causes with horizontal weight patterns, while b i = 1 for the vertically oriente cause. In Figure 3C the conition is violate. Roughly, these conitions guarantee that all hien causes have equal average effects on the generate ata vectors. They make the evelopment of a more efficient approximate learning algorithm possible but, espite their role in the erivation, the impact of these assumptions is limite in practice, in the sense that the resulting algorithm can perform well even when the input ata set violates assumptions (19) an (20). This is emonstrate in a series of numerical experiments etaile below. Upate rules for the restricte generative moel can again be erive by approximate expectationmaximization (see Appenix C). Using both the sum constraint of (19) an the assumption of homogeneous coverage of causes, we obtain the M-step upate: W i = C A i ( s,w) qn y (n) n A i ( s,w) qn y (n). (21) n 1235

10 LÜCKE AND SAHANI Empirically, we fin that the restricte parameter space of this moel means that we can approximate the sufficient statistics A i ( s,w) qn by a more severe truncation than before, now keeping twosource terms in the enominator, but only single-source terms in the numerator, of the expansion (15). This approximation, combine with the fact that any zero-value observe patterns (i.e., those with y (n) = 0) o not affect the upate rule (21) an so can be neglecte, yiels the expression (see Appenix C): A i ( s,w) qn h exp(i (n) h exp(i (n) i ) ) + π 2 a,b a b exp(i (n) ab ), π := π 1 π, (22) with abbreviations given in (18). Equations (21) an (22) are upate rules for the MCA generative moel, subject to the conitions (19) an (20). They efine an algorithm that we will refer to as R-MCA 2 with R for restricte an with 2 inicating a computational cost that grows quaratically with H. 5. Relation to Neural Networks We now relate component extraction as learne within the MCA framework to that achieve by a family of artificial neural networks. Consier the network of Figure 4 which consists of D input variables (or units) with values y 1,...,y D an H hien units with values g 1,...,g H. An observation y is represente by the values (or activities) of the input units, which act through connections parameterize by (W i ) to etermine the activities of the hien units through an activation function g i = g i ( y,w ). These parameters (W i ) are known as the network (or synaptic) weights. g i ( y,w ) W i y Figure 4: Architecture of a two layer neural network. Input is represente by values y 1 to y D of D input units (small black circles). These values combine with synaptic weights W to etermine the activities of the hien units g 1 to g H (big black circles). The otte horizontal arrows symbolize lateral information exchange that may be require to compute the functions g 1 to g H. After the g i are compute the parameters (W i ) are moifie using a -rule. Learning in such a neural network involves ajusting the weights W in response to a series of input patterns, using a rule that is heuristically esigne to extract some form of structure from these 1236

11 MAXIMAL CAUSES inputs. A stanar choice is the Hebbian -rule with ivisive normalization: W i = εg i ( y,w )y an W new i = C W i + W i (W i + W i ), (23) The normalization step is neee to prevent weights from growing without boun, an the ivisive form use here is most common. Here, the constant C efines the value at which W i is hel constant; it will be relate below to the C appearing in Equation 19. Many neural networks with the structure epicte in Figure 4, an that use a learning rule ientical or similar to (23), have been shown to converge to weight values that ientify clusters in, or extract useful components from, a set of input patterns (O Reilly, 2001; Spratling an Johnson, 2002; Yuille an Geiger, 2003; Lücke an von er Malsburg, 2004; Lücke, 2004; Lücke an Bouecke, 2005; Spratling, 2006). The upate rule (23) epens on only one input pattern, an is usually applie online, with the weights being change in response to each pattern in turn. If, instea, we consier the effect of presenting a group of patterns { y (n) }, the net change is approximately (see Appenix D): Wi new C n g i ( y (n),w )y (n) n g i ( y (n),w )y (n). (24) Now, comparing (24) to (21), we see that if the activation function of a neural network were chosen so that g i ( y (n),w ) = A i ( s,w) qn, then the network woul optimize the parameters of the restricte MCA generative moel, with W =W (we rop the istinction between W an W from now on). Unfortunately, the expectation A i ( s,w) qn epens on, an thus exact optimization in the general case woul require a moifie Hebbian rule. However, the truncate approximation of (22) is the same for all, an so the changes in each weight epen only on the activities of the corresponing pre- an post-synaptic units. Thus, the Hebbian -rule, W i = εg i y with g i = h exp(i i ) exp(i h ) + π 2 exp(i ab ) a,b a b (where I h, I ab, an π are the abbreviations introuce in Equations 18 an 22), when combine with ivisive normalization, implements an online version of the R-MCA 2 algorithm. We refer to this online weight upate rule as R-MCA NN (for Neural Network). Note that the function g i in (25) resembles the softmax function (see, e.g., Yuille an Geiger, 2003), but contains an aitional term in the enominator. This ae term reuces the change in weights when an input pattern results in more than one hien unit with significant activity. That is, the system tries to explain a given input pattern using the current state of its moel parameters W. If one hien unit explains the input better than any combination of two units, that unit is moifie. If the input is better explaine by a combination of two units, the total learning rate is reuce. Soft winner-take-all (WTA) activation functions, such as the softmax, are foun in many networks that serve to both cluster an extract components from inputs, as appropriate. For clustering, the relationship between WTA-like competition an maximum-likelihoo methos is well known (Nowlan, 1990). The connection rawn here offers a probabilistic account of the effectiveness of similar rules for component ientification. If the probability of more than one cause being active is small (i.e., π is small), our activation rule for g i (25) reuces to the stanar softmax, suggesting that neural networks with activation an learning functions that resemble Equations (25) may perform well at both component extraction an clustering. (25) 1237

12 LÜCKE AND SAHANI 6. Experiments The MCA generative moel, along with the associate learning algorithms that have been introuce here, are esigne to extract component features from non-linear mixtures. To stuy their performance, we employ numerical experiments, using artificial as well as more realistic ata. The artificial ata sets are base on a wiely-use benchmark for non-linear component extraction, while the more realistic ata are taken from acoustic recorings in one case an from natural images in the other. The goals of these experiments are (1) to establish whether the approximate algorithms o inee increase the likelihoo of the moel parameters; (2) to test convergence an asymptotic accuracy of the algorithms; (3) to compare component extraction using MCA to other componentextraction algorithms; an (4) to emonstrate the applicability of the moel an algorithms to more realistic ata where non-linear component combinations arise naturally. 6.1 The Bars Test The ata sets use in experiments on artificial ata were rawn from variants of the bars test introuce by Föliák (1990). Each ata vector represents a grayscale image, with a non-linear combination of ranomly chosen horizontal an vertical light-colore bars, each extening all the way across a black backgroun. Most commonly, the intensity of the bars is uniform an equal, an the combination rule is such that overlapping regions remain at the same intensity. This type of ata is a benchmark for the stuy of component extraction with non-linear interactions between hien causes. Many component-extraction algorithms have been applie to a version of the bars test, incluing some with probabilistic generative semantics (Saun, 1995; Dayan an Zemel, 1995; Hinton et al., 1995; Hinton an Ghahramani, 1997), as well as many with non-generative objective functions (Harpur an Prager, 1996; Hochreiter an Schmihuber, 1999; Lee an Seung, 2001; Hoyer, 2004) a substantial group of which have been neurally inspire (Föliák, 1990; Fyfe, 1997; O Reilly, 2001; Charles et al., 2002; Spratling an Johnson, 2002; Lücke an von er Malsburg, 2004; Lücke an Bouecke, 2005; Spratling, 2006; Butko an Triesch, 2007). In most of the experiments escribe here the input ata were 25-imensional vectors, representing a 5-by-5 gri of pixels; that is, D = 5 5. There were b possible single bars, some of which were superimpose to create each image. On the 5-by-5 gri there are 5 possible horizontal, an 5 vertical, bar positions, so that b = 10. Each bar appears inepenently with a probability π, with areas of overlap retaining the same value as the iniviual bars. Figure 5A shows an example set of noisy ata vectors constructe in this way. 6.2 Annealing The likelihoo surface for the MCA generative moel is potentially multimoal. Thus, hill-climbing algorithms base on EM may converge to local optima in the likelihoo, which may well be consierably poorer than the global optimum. This tenency to fin sub-optimal fixe points can be reuce by incorporating a eterministic annealing, or relaxation, proceure (Uea an Nakano, 1998; Sahani, 1999), whereby the entropy of the posterior istribution in the free energy (4) is artificially inflate in early iterations, with this inflation progressively reuce in later iterations, uner the control of a temperature parameter. All of the experiments iscusse here incorporate eterministic annealing, the etails of which are given in Appenix E. 1238

13 MAXIMAL CAUSES A Input patterns B W 1 W 2 W 3 W 4 W 5 W 6 W 7 W 8 W 9 W 10 Iterations Figure 5: Bars test ata with b = 10 bars on D = 5 5 pixels an a bar appearance probability of π = A 24 patterns from the set of N = 500 input patterns that were generate accoring to the generative moel with Poisson noise. B Change of the parameters W if MCA 3 is use for parameter upate. Learning stoppe automatically after 108 iterations in this trial (see Appenix E). 6.3 Convergence From a theoretical stanpoint, none of the four algorithms MCA ex, MCA 3, R-MCA 2, or R-MCA NN, can be guarantee to maximize the likelihoo of the MCA generative moel. All of them upate the parameters in the M-step using a fixe-point iteration, rather than either maximization or a graient step. All but MCA ex also approximate the posterior sufficient statistics (10). Thus, our first numerical experiments are esigne to verify that the algorithms o, in fact, increase parameter likelihoo in practice, an that they o converge. For this purpose, it is appropriate to use a version of the bars test in which observations are generate by the MCA moel. Thus, we selecte MCA parameters that generate noisy bar-like images. There were 10 hien sources in the generating moel, one corresponing to each bar. The associate matrix of generating weights, W gen, was 10 25, with each row representing a horizontal or vertical bar in a 5-by-5 pixel gri. The weights W gen i that correspon to the pixels of the bar were set to 10, the others to 0, so that W gen i = 50. Each source was active with probability π gen i, leaing to an average of two 1239 = 2 10

14 LÜCKE AND SAHANI bars appearing in each image. We generate N = 500 input patterns (each with 25 elements) using Equations (1) to (3); a subset of the resulting patterns is isplaye in Figure 5A. L(Θ) 10 3 W gen, π gen MCA ex MCA 3 R-MCA L(Θ) R-MCA NN ae noise no noise patterns 10 3 iterations Figure 6: Change of the MCA parameter likelihoo uner ifferent MCA learning algorithms. Data were generate as in Figure 5. To allow for comparison, the same set of N = 500 input patterns was use for all experiments shown. The likelihoo of the generating parameters (W gen, π gen ) is shown by the otte horizontal line. The main axes show likelihoo values of the batch-moe algorithms MCA ex, MCA 3, an R-MCA 2 as a function of EM iteration. The inset axes shows likelihoo values of the online algorithm R-MCA NN as a function of number of input pattern presentations. Patterns were ranomly selecte from the set of N = 500 inputs, an the parameters were upate for each pattern. Figure 6 shows the evolution of parameter likelihoos, as a function of iteration, for each of the MCA algorithms, with 5 ifferent choices of initial parameters for each. With the exception of the first few iterations of R-MCA 2, the likelihoo of the parameters uner the batch moe algorithms increase at almost every iteration. The online R-MCA NN showe greater fluctuations as upates base on iniviual ata vectors inevitably perturbe the parameter estimates. As might be expecte, given the observation of increasing likelihoos an the fact that the likelihoos are boune, each algorithm eventually converge from each initial value use in Figure 6. Furthermore, in each case, the likelihoo of the solution foun was close to the likelihoo of the actual weights use in generation (the ashe horizontal lines). The final likelihoo values for MCA ex were slightly higher than the likelihoos of (W gen, π gen ), as is expecte for an exact maximumlikelihoo algorithm in noisy ata; whereas the values achieve by the approximations MCA 3 an R-MCA 2 were slightly lower. In fact, in 100 further experiments, the annealing an parameter initialization schemes escribe in Appenix E, brought the likelihoo close to that of the generating weights in 98 of 100 runs using R-MCA 2 an in 90 of 100 runs using MCA 3. We i not run 1240

15 MAXIMAL CAUSES these more extensive tests for MCA ex ue to its long running time (it is also omitte from similar quantitative analyses below). The two basic observations, that likelihoos generally increase at each iteration an that the batch-moe algorithms all reliably converge, hel true for all of the experiments escribe here an below, even where ata were not generate from a version of the MCA moel. Thus, we conclue that these algorithms are generally robust in practice, espite the absence of any theoretical guarantees. 6.4 Parameter Recovery Figure 5B shows the evolution of parameters W, uner the approximate MCA 3 algorithm, showing that the estimate W i inee converge to values close to the generating parameters W gen, as was suggeste by the convergence of the likelihoo to values close to that of the generative parameters. While not shown, the convergence of W uner MCA ex, R-MCA 2 or R-MCA NN was qualitatively similar to this sequence. Clearly, if MCA ex fins the global optimum, we woul expect the parameters foun to be close to those use for generation. The same is not necessarily true of the approximate algorithms. However, both MCA 3 an R-MCA 2 i in fact fin weights W that were very close to the generating values whenever an obviously poor local optimum was avoie. In MCA 3 the average pixel intensity of a bar was estimate to be 10.0±0.5 (stanar eviation), taken across all bar pixels in 90 trials where the likelihoo increase to a high value. Using R-MCA 2 this value was estimate to be 10.0 ± 0.8 (across all bar pixels on 98 high-likelihoo trials). Note that the Poisson istribution (2) results in a consierable variance of bar pixel intensities aroun the mean of 10.0 (compare Figure 5A) which explains the high stanar eviation aroun the relatively precise mean value. The backgroun pixels (original value zero) are estimate to have an intensity of 0.05 ± 0.02 in MCA 3 an are all virtually zero (all are smaller than ) in R-MCA 2. MCA 3 also estimates the parameters π. Because of the finite number of patterns (N = 500) we compare the estimates with the actual frequency of occurrence of each bar i: π i = (numb of bars i in input)/n. The mean absolute ifference between the estimate π i an the actual probability π i was (across the 90 trials with high likelihoo), which emonstrates the relative accuracy of the solutions, espite the approximation mae in Equation (17). For the neural network algorithm R-MCA NN given by (25) we observe virtually the same behavior as for R-MCA 2 when using a small learning rate (e.g., ε = 0.1) an the same cooling scheule in both cases (see Lücke an Sahani, 2007). The aitional noise introuce by the online upates of R-MCA NN ha only a negligible effect. For larger learning rates the situation was ifferent, however. For later comparison to noisy neural network algorithms, we use a version of R-MCA NN with a relatively high learning rate of ε = 1.0. Furthermore, instea of a cooling scheule, we use a fixe temperature T = 16 an ae Gaussian noise (σ = 0.02) at each parameter upate: W i = εg i y + ση. With these learning parameters, R-MCA NN learne very rapily, requiring fewer than 1000 pattern presentations in the majority of trials. Ten plots of likelihoos against number of presente patterns are shown for R-MCA NN in Figure 6 (inset figure, black lines) for the same N = 500 patterns as use for the batch-moe algorithms. Because of the aitional noise in W, the final likelihoo values were somewhat lower than those of the generating weights. Using R-MCA NN with the same parameters but without ae noise (σ = 0), final likelihoo values were often higher (inset axes, gray lines) but the algorithm also converge to local optima more often. In 1241

16 LÜCKE AND SAHANI Reliability Moel noisy no noise MCA 3 90% 81% R-MCA 2 98% 96% R-MCA NN >99% >99% Reliability Moel no noise reference noisy-or 27% Saun, 1995 competitive 69% Dayan/Zemel, 1995 LOCOCODE 96% Hochreiter/Schmihuber, 1999 Table 1: Comparison of MCA algorithms with other systems in the stanar bars test with b = 10 bars (D = 5 5, π = 2 10, N = 500). For the MCA algorithms reliability values are compute on the basis of 100 trials. Values for these algorithms are also given for the same bars test with Poisson noise. Reliability values for the other systems are taken from the literature. For instance, the moel of Hochreiter an Schmihuber (1999) is reporte to fail to extract all bars in one of 25 trials. Two systems, back-propagation (BP) an GeneRec, that are escribe by O Reilly (2001) have also been applie to this bars test. In their stanar versions, BP an GeneRec achieve 10% an 60% reliability, respectively. Hochreiter an Schmihuber (1999) report that ICA an PCA extract only subsets of all bars. Traine without bar overlap. contrast, R-MCA NN with noise avoie local optima in all 100 trials. In the following, R-MCA NN will therefore refer to the noisy version with σ = 0.02 unless otherwise state. 6.5 Comparison to Other Algorithms Noiseless Bars To compare the component extraction results of MCA to that of other algorithms reporte in the literature, we use a stanar version of the bars benchmark test, in which the bars appear with no noise. The competing algorithms o not necessarily employ probabilistic semantics, an may not be explicitly generative; thus, we cannot compare performance in terms of likelihoos, nor in terms of the accuracy with which generative parameters are recovere. Instea, we aopt a commonly use measure, which asks how reliably all the ifferent bars are ientifie (see, e.g., Hochreiter an Schmihuber, 1999; O Reilly, 2001; Spratling an Johnson, 2002; Lücke an von er Malsburg, 2004; Spratling, 2006). For each moel, an internal variable (say the activities of the hien units, or the posterior probabilities of each source being active) is ientifie as the response to an image. The responses evoke in the learne moel by each of the possible single-bar images are then consiere, an the most active unit or most probable source corresponing to each bar is ientifie. If the mapping from single-bar images to the most active internal variable is injective that is, for each single bar a ifferent hien unit or source is the most active then this instance of the moel is sai to have represente all of the bars. The reliability is the frequency with which each moel represents all possible bars, when starte from ranom initial conitions, an given a ranom set of images generate with the same parameter settings. For the MCA algorithms, the responses are efine to be the approximate posterior values for each possible source vector with only one active source, evaluate at the final parameter values after learning: q( s h ;Θ) p( s h y bar,w). The reliabilities of MCA 3, R-MCA 2, an R-MCA NN as well as some other publishe componentextraction algorithms are shown in Table 1. These experiments use a configuration of the bars test much as above (D = 5 5, b = 10, an π gen = 2 10 ) which is perhaps the most commonly use 1242

17 MAXIMAL CAUSES in the literature, (e.g., Saun, 1995; Dayan an Zemel, 1995; Hochreiter an Schmihuber, 1999; O Reilly, 2001). The bars have a fixe an equal gray-value. We generate N = 500 patterns accoring to these settings an normalize the input patterns y (n) to lie in the interval [0,10] (i.e., bar pixels have a value of 10 an the backgroun is 0). We consiere both the case with Poisson noise (which has been iscusse above) an the stanar noiseless case. Experiments were run starting from 100 ifferent ranomly initialize parameters W. The same algorithms an the same cooling scheule were use (the same fixe T in the case of R-MCA NN ) to fit patterns with an without noise. Without noise, MCA 3 with H = 10 hien variables foun all 10 bars in 81 of 100 experiments. R-MCA 2 with H = 10 foun all bars in 96 of 100 experiments. Using the criterion of reliability, R-MCA NN performe best an foun all bars in all 100 of 100 experiments. This seems likely to result from the fact that the ae Gaussian noise, as well as noise introuce by the online upates, combine to rive the system out of shallow optima. Furthermore, R-MCA NN was, on average, faster than MCA 3 an R-MCA 2 in terms of require pattern presentations. It took fewer than 1000 pattern presentations to fin all bars in the majority of 100 experiments, 1 although in a few trials learning i take much longer. On the other han, MCA 3 an R-MCA 2 achieve better likelihoos an recovere generative parameters closer to the true values. These algorithms also have the avantage of a well efine stopping criterion. MCA 3 learns the parameters of the prior istribution whereas R-MCA 2 uses a fixe value. R-MCA 2 oes, however, remain highly reliable, even when the fixe parameter π iffers significantly from the true value π gen. Figure 7: A common local optimum foun by MCA 3 in the stanar bars test. Two weight patterns reflect the same hien cause, while another represents the superposition of two causes. As was the case for the noisy bars, the R-MCA algorithms avoie local optima more often. This may well be a result of the smaller parameter space associate with the restricte moel. A common local optimum for MCA 3 is isplaye in Figure 7, where the weights associate with two sources generate the same horizontal bar, while a thir source generates a weaker combination of two bars. This local solution is suboptimal, but the fact that MCA 3 has parameters to represent varying probabilities for each cause being present, means that it can ajust the corresponing rates to match the ata. The fixe setting of π for R-MCA woul introuce a further likelihoo penalty for this solution. Many component-extraction algorithms particularly those base on artificial neural networks use moels with more hien elements than there are istinct causes in the input ata (e.g., Charles et al., 2002; Lücke an von er Malsburg, 2004; Spratling, 2006). If we use H = 12 hien variables, then all the MCA-algorithms (MCA 3, R-MCA 2, an R-MCA NN ) foun all of the bars in all of 100 trials. 1. Note that, accoring to the efinition above, all bars are often alreay represente at intermeiate likelihoo values. 1243

LÜCKE AND SAHANI A Input patterns with 3 bars on average W after learning (MCA 3 ) B Input patterns with ifferent bar sizes W after learning (MCA 3 ) C Input patterns with overlapping parallel bars W

18 LÜCKE AND SAHANI A Input patterns with 3 bars on average W after learning (MCA 3 ) B Input patterns with ifferent bar sizes W after learning (MCA 3 ) C Input patterns with overlapping parallel bars W after learning (MCA 3 ) D W after learning (R-MCA NN) Figure 8: Experiments with increase bar overlap. In A bar overlap is increase by increasing the bar appearance probability to π gen = 3 10 (an average of three bars per pattern). In B bar overlap is varie using ifferent bar withs (two one-pixel-wie bars an one three-pixelwie bar for each orientation). In the bars test in C there are 8 (two-pixel-wie) horizontal bars an 8 (two-pixel-wie) vertical bars on a D = 9 9 pixel gri. Each bar appears with probability π gen = 2 16 (two bars per input pattern on average). Each experimental ata set is illustrate by 14 typical input patterns. For A to C the parameters W of a typical trial are shown if MCA 3 is use for learning. The vectors W i = (W i1,...,w id ) appear in orer of ecreasing learne appearance probability π i. In D the parameters W for a typical trial using R-MCA NN are shown. 1244

19 MAXIMAL CAUSES 6.6 Comparison to Other Algorithms Bar Overlap For most component-extraction algorithms that have been teste against the bars benchmark, it is ifficult to know how specialize they are to the form of this test. The algorithms might, for example, epen on the fact that all bars appear with the same probability, or that they have the same with. Different versions of the bars test have therefore been introuce to probe how generally the ifferent algorithms might succee. In particular, there has been consierable recent interest in stuying robustness to varying egrees of overlap between bars (see, e.g., Lücke an von er Malsburg, 2004; Lücke, 2004; Spratling, 2006). This is because it is the non-linear combination within the regions of overlap that most istinguishes the bars test images from linear superpositions of sources. In three ifferent experiments we varie the egree of overlap in three ifferent ways. Following Spratling (2006), in all experiments the MCA moel ha twice as many possible sources as there were bars in the generative input. In all experiments we use the same algorithms, initial conitions, an cooling scheules as escribe above an in Appenix E. Again, each trial use a newly generate set of training patterns an a ifferent ranomly generate matrix W. In the following, reliability values are compute on the basis of 25 trials each. The most straightforwar way to increase the egree of bar overlap is to use the stanar bars test with an average of three instea of two bars per image, that is, take π = 3 10 for an otherwise unchange bars test with b = 10 bars on D = 5 5 pixels (see Figure 8A for some examples). When using H = 20 hien variables, MCA 3 extracte all bars in 92% of 25 experiments. Thus the algorithm works well even for relatively high egrees of superposition. The values of W foun in a typical trial are shown in Figure 8A. The parameters W i = (W i1,...,w id ) that are associate with a hien variable or unit are sorte accoring to the learne appearance probabilities π i. Like MCA 3, both R-MCA 2 an R-MCA NN were run without changing any parameters. In the restricte case, this meant that the assume value for the source probability (π = 2 10 ) was ifferent from the generating value (π gen = 3 10 ). Nevertheless, the performance of both algorithms remaine better than that of MCA 3, with R-MCA 2 an R-MCA NN fining all 10 bars in 96% an 100% of 25 trials, respectively. We can also choose unequal bar appearance probabilities (cf., Lücke an von er Malsburg, 2004). For example, half the bars appeare with probability π gen h = (1+γ) 2 10 an the other half2 appeare with probability π gen h = (1 γ) 2 10, MCA 3 extracte all bars in all of 25 experiments for γ = 0.5. For γ = 0.6 (when half the bars appeare 4 times more often than the other half) all bars were extracte in 88% of 25 experiments. For γ = 0.6 R-MCA 2 an R-MCA NN foun all bars in 96% an 100% of 25 experiments respectively. Reliability values for R-MCA NN starte to ecrease for γ = 0.7 (92% reliability). As suggeste by Lücke an von er Malsburg (2004), we also varie the bar overlap in a secon experiment by choosing bars of ifferent withs. For each orientation we use two one-pixel wie bars an one three-pixel-wie bar. Thus, for this ata set, b = 6 an D = 5 5. The bar appearance probability was π = 2 6, so that an input containe, as usual, two bars on average. Figure 8B shows some examples. MCA 3 extracte all bars in 84% of 25 experiments for this test. Reliability values ecrease for more extreme ifferences in the bar sizes. R-MCA 2 an R-MCA NN both foun all bars in all 25 trials each. Thus, although the unequal bar sizes violate the assumption W i = C that was mae in the erivation of R-MCA 2 an R-MCA NN, the algorithms performance in terms of reliability seeme unaffecte. 2. If bars are numbere h = 1 to 5 for the horizontal an h = 6 to 10 for the vertical, we chose the ones with even numbers to appear with the higher probability. 1245

LÜCKE AND SAHANI bars 16 12 8 NMFmse NMFiv S-NMF 4 0 L-NMF SC-NMFY NN-DI DI SC-NMFAY SC-NMFA learning algorithm R-MCANN NN-SC MCA3 R-MCA2 Figure 9: Comparison of MCA 3, R-MCA 2, an R-MCA NN with

20 LÜCKE AND SAHANI bars NMFmse NMFiv S-NMF 4 0 L-NMF SC-NMFY NN-DI DI SC-NMFAY SC-NMFA learning algorithm R-MCANN NN-SC MCA3 R-MCA2 Figure 9: Comparison of MCA 3, R-MCA 2, an R-MCA NN with other systems in the bars test with increase occlusion (compare Figure 8C an Figure 2). Bars test parameters are D = 9 9, b = 16, π = 2 16, an N = 400. Data for the non-mca algorithms are taken from Spratling (2006). The bar heights represent the average numbers of extracte bars in 25 trials. Error bars inicate the largest an the lowest number of bars foun in a trial. The algorithms NN-DI an DI are fee-forwar neural networks of the type epicte in Figure 4. All other (non-mca) algorithms are versions of NMF with ifferent objectives an constraints (see Appenix E an Spratling, 2006, for etails). In the thir experiment we change the egree of bar overlap more substantially, using a bars test that inclue overlapping parallel bars as introuce by Lücke (2004). We use eight horizontal an eight vertical bars, each two pixels wie, on a 9-by-9 gri. Thus, two parallel neighboring bars ha substantial overlap. Figure 8C shows some typical input patterns. Note that the introuctory example of Figure 2A,B is also of this type. To allow for a etaile comparison with other systems we aopte the exact settings use by Spratling (2006), that is, we consiere 25 runs of a system with H = 32 hien variables using bars test parameters D = 9 9, π gen = 2 16, an N = 400. For these ata, MCA 3 foun all 16 bars in all of 25 experiments. The same is true for R-MCA 2 whereas R-MCA NN misse one bar in one of the 25 trials. Figure 9 shows a quantitative comparison with other algorithms that have been applie to this version of the bars test. Of the ten algorithms stuie by Spratling (2006) just one, namely non-negative sparse coing (NN-SC; Hoyer, 2002, with sparseness parameter λ = 1), is as reliable as MCA 3 an R-MCA 2. The other systems, incluing forms of NMF both with an without a sparseness constraint, fail partly or entirely in extracting the actual hien causes. For a typical trial using MCA 3 the final parameters W are isplaye in Figure 8C. Again the W i s associate with the ifferent hien variables are sorte accoring to their learne parameters π i. A qualitatively ifferent set of W i s was obtaine when R-MCA NN was use for learning. Figure 8D shows a typical outcome ( W i s are not sorte). In this case, only the actual causes are clearly represente whereas the W i s of the supernumerary units remain unspecialize. The same feature is reporte by Spratling (2006) for the algorithms NN-DI an DI use in this same test. Convergence to a representation that contains just the true hien causes an leaves super- 1246

MAXIMAL CAUSES numerary units unspecialize can improve the interpretability of the result.

However, hien units that represente more composite inputs, as seen for MCA 3, were rarely observe.

Thus, in Figure 8C the appearance probabilities for the first 16 sources are much higher than for the others.

21 MAXIMAL CAUSES numerary units unspecialize can improve the interpretability of the result. When using a higher fixe temperature for R-MCA NN all the hien units represente bars, with some bars represente by more than one unit. However, hien units that represente more composite inputs, as seen for MCA 3, were rarely observe. On the other han, the parameters foun by MCA 3 provie an inication of significance of each weight pattern in the appearance probabilities π i. Thus, in Figure 8C the appearance probabilities for the first 16 sources are much higher than for the others. The later sources may be interprete as capturing some of the higher-orer structure that results from a finite set of input patterns. In contrast to R-MCA, such higher-orer representations nee not aversely affect the ata likelihoo because the corresponing appearance probabilities can be relatively small. A Generating causes B Input patterns C W after learning (MCA 3 ) Figure 10: Experiments with more causes an hien variables than observe variables. A The 12 patterns use to generate the ata. Each is a 1-by-2 pixel bar on a 3-by-3 gri (D = 9). B Ten examples of the 500 input patterns generate using the causes shown in A. C Parameters W foun in a typical run of MCA 3 with H = 24. The vectors W i = (W i1,...,w id ) appear in orer of ecreasing learne appearance probability π i. 6.7 More Causes than Observe Variables In the experiments escribe above, the number of hien causes was always smaller than the number of observe variables. We next briefly stuie the over-complete case where ata were generate, an moels were fit, using more hien causes than observe variables. We generate N = 500 patterns on a 3-by-3 gri (D = 9), using sparse combinations of 12 hien causes corresponing to 6 horizontal an 6 vertical bars, each 1-by-2 pixels in size an thus extening across only a portion of the image (Figure 10A). As in the bars tests above, black was assigne to a value of 0 an white to 10. Patterns were generate without noise, with an average of two bars appearing in each (π = 2 12 ). Ten such patterns are shown in Figure 10B. Figure 10C shows the weights learne uring a typical run using MCA 3 with the same parameter settings as above an twice as many hien variables than observe ones (H = 24). Weights are sorte in orer of ecreasing inferre appearance probabilities π i. All 12 causes were ientifie, with many represente more than once. A few hien variables, with lower inferre probabilities of 1247

22 LÜCKE AND SAHANI appearance, were associate with more composite patterns. MCA 3 extracte all causes in all of 25 trials. R-MCA 2 also extracte all causes in all of 25 trials, an never represente composite patterns. R-MCA NN only extracte all causes when run at fixe temperatures that were lower than those use for the bars tests above (e.g., T = 3), in which case it i so in all of 25 trials. This requirement for a lower temperature was consistent with the observation that a lower ata imension D leas to a ecrease in the critical temperatures associate with the algorithms (see Appenix E). For larger values of T (e.g., T = 16) R-MCA NN i not extract single causes. 6.8 Violations of Moel Assumptions To optimize the likelihoo of the ata uner the MCA generative moel, each of the approximate learning algorithms relies on the fact that, uner the Bernoulli prior (1), some number of the observe ata vectors will be generate by only a small number of active sources. To highlight this point we explicitly remove such sparse ata vectors from a stanar bars test, thereby violating the Bernoulli prior assumption of the generative moel. We use bars tests as escribe above, with b = 10 or b = 16 bars an π = 2 b, generating N = 500 (or more) patterns, in each case by first rawing causes from the Bernoulli istribution (1) an then rejecting patterns in which fewer than m causes were active. As might be expecte, when m was 3 or greater the approximate algorithms all faile to learn the weights associate with single causes. However, when only patterns with fewer than 2 bars ha been remove, MCA 3 was still able to ientify all the bars in many of the runs. More precisely, using ata generate as above with b = 10, m = 2 an N = 500, MCA 3 with H = 10 hien variables foun all causes in 69 of 100 trials with noisy observations an in 37 of 100 trials without noise (the parameters for MCA 3 an the associate annealing scheule were unchange). Note that in these experiments the average number of active causes per input vector is increase by the removal of sparse ata vectors. An increase in reliability in the noisy case is consistent with our other experiments. The relatively low reliability seen for noiseless bars in this experiment may be ue to the combine violation of both the assume prior an noise istributions. As long as the ata set i contain some vectors generate by few sources, the learning algorithms coul all relatively robustly ientify the causes given sufficient ata, even when the average observation containe many active sources. For instance, in a stanar noiseless bars test with b = 16 bars on an 8 8 gri, an N = 1000 patterns with an average of four active causes in each (π = 4 16 ), all three algorithms still achieve high reliability values, using twice as many hien variables as actual bars (H = 32), an using the same parameters as for the stanar bars test above. MCA 3 foun all causes in 20 of 25 trials in these ata (80% reliability). Reliabilities of R-MCA 2 an R-MCA NN (25 trials each) were 76% an 100%, respectively. The reliabilities of all algorithms fell when the ata set containe fewer patterns, or when the average number of bars per pattern was larger. 6.9 Applications to More Realistic Data We stuy two examples of component extraction in more realistic settings, applying the MCA algorithms to both acoustic an image ata. Acoustic ata. Soun waveforms from multiple ifferent sources combine linearly, an so are conventionally unmixe using algorithms such as ICA applie to simultaneous recorings from multiple microphones. The situation is ifferent, however, for spectrogram representations of natural 1248

MAXIMAL CAUSES A Generating causes (phoneme waveforms) A 5 0 5 0 25 50 75 t/ms [a ] [i:] [ I] [k] [t] [p] c B Log-spectrograms of generating phonemes f /khz [a ] [i:] [ I] [k] [t] [p] 4.0 1.0 0.

23 MAXIMAL CAUSES A Generating causes (phoneme waveforms) A t/ms [a ] [i:] [ I] [k] [t] [p] c B Log-spectrograms of generating phonemes f /khz [a ] [i:] [ I] [k] [t] [p] c C t Linear mixture [a ] + [k] D Input ata points E W after learning (MCA 3 ) Figure 11: Application to acoustic ata. A Pressure waveforms of six phonemes spoken by a male voice. Axes here, an for the waveform in C, are as shown for [a ] (A is a normalize amplitue). B The log-spectrograms of the phonemes in A. We use 50 frequency channels an nine time winows ( t = 1,...,9). Axes of all log-spectrograms in the figure are as shown for [a ]. C Waveform of the linear mixture of phonemes [a ] an [k], an the log-spectrogram of this linear mixture. D Six examples of the N = 500 ata vectors that were use for the experiments. Each ata vector is the log-spectrogram of a linear mixture of the phoneme waveforms in A. The ata sets for the experiments use an average of two waveforms per ata vector. E Parameters W foun by MCA 3 with H = 12, using 500 mixture log-spectrograms. The parameter vectors W i = (W i1,...,w id ) appear in orer of ecreasing learne appearance probability π i an are linearly scale to fill the gray scale. 1249

24 LÜCKE AND SAHANI soun. The power of natural souns in iniviual time-frequency bins varies over many orers of magnitue, an so is typically measure logarithmically an expresse in units of ecibels, giving a representation that is closely aligne with the response of the cochlea to the corresponing soun. In this representation, the combination of log-spectrograms of the ifferent sources may be well approximate by the max rule (R. K. Moore, 1983, quote by Roweis, 2003). In particular, the logarithmic power istribution, as well as the sub-linear power summation ue to phase misalignment, both lea to the total power in a time-frequency bin being ominate by the single largest contribution to that bin (see Discussion). To stuy the extraction of components from mixtures of soun by MCA, we base the following experiment on six recorings of phonemes spoken by a male voice (see Figure 11A). The phoneme waveforms were mixe linearly to generate N = 500 superpositions, with each phoneme appearing in each mixture with probability π = 2 6. Thus each mixture comprise two phonemes on average, with a combination rule that resemble the MCA max-rule in the approximate sense escribe above. We applie the MCA algorithms to the log-spectrograms of these mixtures. Figure 11B shows the log-spectrograms of the iniviual phonemes an Figure 11C shows the log-spectrogram of an example phoneme mixture. We use 50 frequency channels an 9 time bins to construct the logspectrograms. The resulting values were threshole an then rescale linearly so that power-levels across all phonemes fille the interval [0, 10], as in the stanar bars test. For more etails see Appenix E. The MCA algorithms were use with the same parameter settings as in the bars tests above, except that annealing began at a lower initial temperature (see Appenix E). As in the bars tests with increase overlap, we use twice as many hien variables (H = 12) as there were causes in the input. Figure 11E shows the parameters W learne in one run using MCA 3. The parameter vectors W i = (W i1,...,w id ) are isplaye in ecreasing orer of the corresponing learne value of π i. As can be seen, the first six such vectors converge to spectrogram representations similar to those of the six original phonemes. The six hien variables associate with lower values of π i, converge to weight vectors that represente more composite spectrograms. This result is representative of those foun with MCA 3. R-MCA 2 also converge to single spectrogram representations, but tene to represent those single spectrograms multiple times rather than representing more composite patterns with the aitional components. Results for R-MCA NN were very similar to those for R-MCA 2 when we use a high fixe temperature (see Appenix E for etails). For intermeiate fixe temperatures, results for R-MCA NN were similar to those of the bars test in Figure 8D in that each cause was represente just once, with aitional hien units isplaying little structure in their weights. For lower fixe temperatures (starting from T 40) R-MCA NN faile to represent all causes. In general, the reliability values of all three algorithms were high. These were measure as escribe for the bars tests above, by checking whether, after learning, inference base on each iniviual phoneme log-spectrogram le to a ifferent hien cause being most probable. MCA 3 foun all causes in 21 of 25 trials (84% reliability), R-MCA 2 foun all causes in all of 25 trials; as i R-MCA NN (with fixe T = 70). Reliability for MCA 3 improve to 96% with a slower cooling proceure (θ W = ; see Appenix E). Visual ata. Finally, we consier a ata set for which the exact hien sources an their mixing rule are unknown. The ata were taken from a single 250-by-250 pixel gray-level image of grass taken 1250

MAXIMAL CAUSES A Original image B Input patches C W after learning (R-MCA 2 ) D Generate patches after learning Figure 12: Application to visual ata.

For visualization we have brightene the image (we let values in the lower half of the light intensity range fill the range of gray values from zero to 255 an clampe values in the upper half to value

B 35 examples taken from the 5000 10-by-10 pixel patches that were use for numerical experiments. The patches represent light intensities linearly.

25 MAXIMAL CAUSES A Original image B Input patches C W after learning (R-MCA 2 ) D Generate patches after learning Figure 12: Application to visual ata. A The 250-by-250 pixel image use as basis for the experiments. The image is taken from the van Hateren atabase of natural images (see Appenix E). For visualization we have brightene the image (we let values in the lower half of the light intensity range fill the range of gray values from zero to 255 an clampe values in the upper half to value 255). Without brightening, the image woul appear unnaturally ark on a finite gray scale because of a small number of pixels with very high values. B 35 examples taken from the by-10 pixel patches that were use for numerical experiments. The patches represent light intensities linearly. For visualization, each patch has been scale to fill the range of gray values. C Parameters W resulting from a typical run of R-MCA 2 with H = 50 hien variables an N = 5000 image patches. For visualization, each parameter vector W i = (W i1,...,w id ) has been linearly scale to fill the range of gray values. D Patches generate using the restricte generative moel an weights as in C (patches have been scale as in B an C). 1251

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction