Maximal Causes for Non-linear Component Extraction

Size: px
Start display at page:

Download "Maximal Causes for Non-linear Component Extraction"

Transcription

1 Journal of Machine Learning Research 9 (2008) Submitte 5/07; Revise 11/07; Publishe 6/08 Maximal Causes for Non-linear Component Extraction Jörg Lücke Maneesh Sahani Gatsby Computational Neuroscience Unit University College Lonon 17 Queen Square Lonon WC1N 3AR, UK LUCKE@GATSBY.UCL.AC.UK MANEESH@GATSBY.UCL.AC.UK Eitor: Yoshua Bengio Abstract We stuy a generative moel in which hien causes combine competitively to prouce observations. Multiple active causes combine to etermine the value of an observe variable through a max function, in the place where algorithms such as sparse coing, inepenent component analysis, or non-negative matrix factorization woul use a sum. This max rule can represent a more realistic moel of non-linear interaction between basic components in many settings, incluing acoustic an image ata. While exact maximum-likelihoo learning of the parameters of this moel proves to be intractable, we show that efficient approximations to expectation-maximization (EM) can be foun in the case of sparsely active hien causes. One of these approximations can be formulate as a neural network moel with a generalize softmax activation function an Hebbian learning. Thus, we show that learning in recent softmax-like neural networks may be interprete as approximate maximization of a ata likelihoo. We use the bars benchmark test to numerically verify our analytical results an to emonstrate the competitiveness of the resulting algorithms. Finally, we show results of learning moel parameters to fit acoustic an visual ata sets in which max-like component combinations arise naturally. Keywors: component extraction, maximum likelihoo, approximate EM, competitive learning, neural networks 1. Introuction In recent years, algorithms such as inepenent components analysis (ICA; Comon, 1994; Bell an Sejnowski, 1997), sparse coing (SC; Olshausen an Fiel, 1996), an non-negative matrix factorization (NMF; Lee an Seung, 1999) have been use to escribe the statistics of the natural environment, an the components extracte by these methos have been linke to sensory neuronal response properties. State in the language of probabilistic generative moels (see, e.g., Dayan an Abbott, 2001; Rao et al., 2002) these systems escribe sensory ata as a linear superposition of learne components. For many types of ata, incluing images, this assume linear cooperation between generative causes is unrealistic. Alternative, more competitive generative moels have also been propose: for instance, Saun (1995) suggests a moel in which hien causes are combine by a noisy-or rule, while Dayan an Zemel (1995) suggest a yet more competitive scheme. Here, we formulate an extreme case of competition, in which the strongest generative influence on an observe variable (e.g., an image pixel) alone etermines its value. Such a rule has the property of selecting, for each observe variable, a single generative cause to etermine that variable s value. c 2008 Jörg Lücke an Maneesh Sahani.

2 LÜCKE AND SAHANI This form of combination emerges naturally in the context of spectrotemporal masking in mixe auio signals. For image ata, occlusion leas to a ifferent combination rule, but one that shares the selection property in that, uner constant lighting conitions, the appearance of each observe pixel is etermine by a single object. In parallel to this evelopment of generative approaches, a number of artificial neural network architectures have been esigne to tackle the problem of non-linear component extraction, mostly in artificial ata (e.g., Spratling an Johnson, 2002; Lücke an von er Malsburg, 2004; Lücke an Bouecke, 2005; Spratling, 2006), although sometimes in natural images (e.g., Harpur an Prager, 1996; Charles et al., 2002; Lücke, 2007). These moels often perform quite well with respect to various benchmark tests. However, the relationship between them an the ensity moels that are implicit or explicit in the generative approach has not, thus far, been mae clear. We show here that inference an learning in a restricte form of our novel generative moel correspon closely in form to the processing an plasticity rules use in such neural network approaches, thus bringing together these two isparate threas of investigation. The organization of the remainer of this article is as follows. In Section 2 we efine the novel generative moel an then procee to obtain the associate parameter upate rules in Section 3. In Section 4 we erive computationally efficient approximations to these upate rules, in the context of sparsely active hien causes that is, when a small number of hien causes generally suffices to explain the ata. In Section 5 we relate a restricte form of the generative moel to neural network learning rules with Hebbian plasticity an ivisive normalization. Results of numerical experiments in Section 6 show the component extraction performance of the generative schemes as well as a comparison to other algorithms. Finally, in Section 7, we iscuss our analytical an numerical results. 2. A Generative Moel with Maximum Non-linearity We consier a generative moel for D observe variables y, ( = 1,...,D), in which H hien binary causes s h, (h = 1,...,H), each taking the value 0 or 1, compete to etermine the value of each observation (see Figure 1). Associate with each pair (s h,y ), is a weight W h. Given a set of active causes (i.e., those taking the value 1), the istribution of y is etermine by the largest of the weights associate with the active causes an y. Much of our iscussion will apply generally to all moels of this causal structure, irrespective of the etails of the istributions involve. For concreteness, however, we focus on a particular choice, in which the hien variables are rawn from a multivariate Bernoulli istribution; an the observe variables are non-negative, integer-value an, given the causes, conitionally inepenent an Poisson-istribute. Thus, collecting all the causes into a single binary vector s {0,1} H, an all the observe variables into an integer vector y Z D + we have: p( s π) = p( y s,w) = H p(s h π h ), h=1 D p(y W ( s,w)), =1 p(s h π h ) = π s h h (1 π h) 1 s h, (1) p(y w) = wy y! e w. (2) Here, π [0,1] H parameterizes the prior istribution on s, while the weight matrix W R H D parameterizes the influence of the hien causes on the istribution of y. It will be convenient to 1228

3 MAXIMAL CAUSES s 1 s 2 s 3 W h y 1 W ( s,w) y D Figure 1: A generative moel with H = 3 hien variables an D = 5 observe variables. The values y of the observe variables are conitionally inepenent given the values s of the hien variables. The value y is rawn from a istribution which is etermine by the parameters W 1, W 2, an W 3. For a given binary vector s these parameters combine competitively accoring to the function W ( s,w) = max h {s h W h }. group these parameters together into Θ = ( π,w). The function W ( s,w) in (2) gives the effective weight on y, resulting from a particular pattern of causes s. Thus, in the moel consiere here, W ( s,w) = max h {s hw h }. (3) It is useful to place the moel (1) (3) in context. Moels of this general type, in which the observations are conitionally inepenent of one another given a set of hien causes, are wiesprea. They unerlie algorithms such as ICA, SC, principal components analysis (PCA), factor analysis (see, e.g., Everitt, 1984), an NMF. In these five cases, an inee in the majority of such moels stuie, the effective weights W ( s,w) are forme by a linear combination of all the weights that link hien variables to the observation; that is, W ( s,w) = h s h W h. Some other moels, notably those of Saun (1995) an Dayan an Zemel (1995), have implemente more competitive combination rules, where larger iniviual weights ominate the effective combination. The present moel takes this competition to an extreme, so that only the single largest weight (amongst those associate with active hien variables) etermines the output istribution. Thus, where ICA, PCA, SC, or NMF use a sum, we use a max. We refer to this new generative moel as the Maximal Causes Analysis (MCA) moel. Figure 2 illustrates the ifference between linear superposition an competitive combination using (3). Let us suppose that noise-free observations are generate by causes in the form of horizontal an vertical objects with the same gray-value, on a ark (black) backgroun (see Figure 2). If these objects occlue one-another, they may generate an observe image such as that illustrate in Figure 2B. However, if we were to use the actual causes an weights in Figure 2A, but instea combine them linearly, we woul obtain the (ifferent) input pattern of Figure 2C. In this case, competitive combination using the max-rule of Equation (3) woul result in the correct pattern. This is not, of course, generally true, but for monochrome objects with small variations in their gray-values it 1229

4 LÜCKE AND SAHANI A B C cause 1 non-linear linear { { cause 2 { { cause 3 cause 4 max {s hw h } s h W h h h Figure 2: An illustration of non-linear versus linear combination of hien causes. A Four examples of hien causes with gray-value 200. B The input image that may result if sources occlue one another. In this case, the correct function W ( s,w) (see Figure 1) to combine the hien causes is the max-operation. C The input image that results if the four causes combine linearly (gray-values are scale to fill the interval [0,255]). For C, the correct function W ( s,w) is linear super-position. hols approximately. More generally, the maximum combination rule is always closer to the result of occlusion than is the simple sum implie by moels such as ICA. As state above, although in this paper we focus on the specific istributions given in (1) an (2), much of the analytical treatment is inepenent of these specific choices. Thus, upate rules for learning the weights W from ata will be erive in a general form, that can accommoate alternative, non-factore istributions for the binary hien variables. This general form is also preserve if the Poisson istribution is replace, for example, by a Gaussian. Poisson variability represents a reasonable choice for the non-negative ata consiere in this paper, an resembles the cost function introuce by Lee an Seung (1999) for NMF. 3. Maximum Likelihoo Given a set of observe ata vectors Y={ y (n) } n=1,...,n, taken to be generate inepenently from a stationary process, we seek parameter values Θ = ( π,w ) that maximize the likelihoo of the ata uner the generative moel of Equations (1) to (3): ( ) Θ = argmax Θ {L(Θ)} with L(Θ) = log p( y (1),..., y (N) Θ). We use Expectation-Maximization (EM; Dempster et al. 1977; see also Neal an Hinton 1998, for the formulation that appears here) to maximize the likelihoo in this latent variable moel. To o so, we introuce the free-energy F (Θ, q) a ata-epenent function of the parameters Θ an an unknown istribution q( s (1),..., s (N) ) over the hien ata or variables that is always equal to or less than the likelihoo evaluate at the same parameter values. For inepenently generate ata vectors y (n), the istribution q may be taken (without loss of generality) to factor over the hien 1230

5 MAXIMAL CAUSES vectors q( s (1),..., s (N) ) = n q n ( s (n) ). Then the free-energy is efine as: F (Θ,q) = N n=1 [ [ q n ( s) log ( p( y (n) s,θ) ) + log ( p( s Θ) )]] + H(q) L(Θ), (4) s where H(q) = n H(q n ( s)) = n s q n ( s)log(q n ( s)) is the Shannon entropy of q. The iterations of EM alternately increasef with respect to the istributions q n while holing Θ fixe (the E-step), an with respect to Θ while holing the q n fixe (the M-step). Thus, if we consier a pair of steps beginning from parameters Θ, the E-step first fins new istributions q n that epen on Θ an the observations y (n), which we write as q n ( s;θ ). Ieally, these istributions maximize F for fixe Θ, in which case it can be shown that q n ( s;θ ) = p( s y (n),θ ) anf(θ,q n ( s;θ )) =L(Θ ) (Neal an Hinton, 1998). In practice, computation of this exact posterior may be intractable, an it is often replace by an approximation. After choosing the q n s in the E-step, we maximizef with respect to Θ in the M-step while holing the q n istributions fixe. Thus the free-energy can be re-written in terms of Θ an Θ : F (Θ,Θ ) = N [ n=1 s q n ( s;θ ) [ log ( p( y (n) s,θ) ) + log ( p( s Θ) )]] + H(Θ ). (5) where H(Θ ) = n H(q n ( s;θ )). A necessary conition to achieve this maximum with respect to W i Θ, is that (see Appenix A for etails): ( ) (n) y F (Θ,Θ ) = W i q n ( s;θ W ( s,w) ) W ( s,w) n W s i W ( s,w)! = 0. (6) Unfortunately, uner the max-combination rule of Equation (3), W is not ifferentiable. Instea, we efine a smooth function W ρ that converges to W as ρ approaches infinity: ( H ) 1 W ρ ρ ( s,w) := (s h W h ) ρ h=1 lim ρ W ρ ( s,w) = W ( s,w), (7) an replace the erivative of W by the limiting value of the erivative of W ρ, which we write as A i (see Appenix A for etails): A i ( s,w) := lim ρ ( ) W ρ W ( s,w) = lim i ρ s i (W i ) ρ h s h (W h ) ρ. (8) Arme with this efinition, a rearrangement of the terms in (6) yiels (see Appenix A): W i = A i ( s,w) qn y (n) n, (9) A i ( s,w) qn n where A i ( s,w) qn is the expectation ofa i ( s,w) uner the istribution q n ( s;θ ): A i ( s,w) qn = q n ( s;θ )A i ( s,w). (10) s 1231

6 LÜCKE AND SAHANI Equation (9) represents a set of non-linear equations (one for each W i ) that efines the necessary conitions for an optimum off with respect to W. The equations o not represent straightforwar upate rules for W i because the right-han-sie oes not epen only on the ol values W Θ. They can, however, be use as fixe-point iteration equations, by simply evaluating the erivatives A i at W instea of W. Although there is no guarantee that these iterations converge, if they o converge the corresponing parameters must lie at a stationary point of the free-energy. Numerical experiments escribe later confirm that this fixe-point approach is, in fact, robust an convergent. Note that the enominator in (9) vanishes only if q n ( s;θ )A i ( s,w) = 0 for all s an n (assuming positive weights), in which case (6) is alreay satisfie, an no upate of W is require. Thus far, we have not mae explicit reference to the form of prior source istribution, an so the result of Equation (9) is inepenent of this choice. For our chosen Bernoulli istribution (1), the M-step is obtaine by setting the erivative of F with respect to π i to zero, giving (after rearrangement): π i = 1 N n s i qn, with s i qn = q n ( s;θ )s i. (11) s Parameter values that satisfy Equations (9) an (11), maximize the free-energy given the istributions q n = q n ( s;θ ). As state before, the optimum with respect to q (an therefore, exact optimization of the likelihoo, since the optimal setting of q forces the free-energy boun to be tight) is obtaine by setting the q n to the posterior istributions: q n ( s;θ ) = p( s y (n),θ ) = p( s, y(n) Θ ), (12) p( s, y (n) Θ ) where p( s, y (n) Θ ) = p( s π ) p( y (n) s,w ), an with the latter istributions given by (1) an (2), respectively. Equations (9) to (12) thus represent a complete set of upate rules for maximizing the ata likelihoo uner the generative moel. The only approximation mae to this point is to use the ol values W on the right-han-sie of the M-step equation in (9). We therefore refer to this set of upates as a pseuo-exact learning rule an call the algorithm they efine MCA ex, with the subscript for exact. We will see in numerical experiments that MCA ex oes inee maximize the likelihoo. 4. E-Step Approximations The computational cost of fining the exact sufficient statistics A i ( s,w) qn, with q n equal to the posterior probability (12), is intractable in general. It grows exponentially in the smaller of the number of hien causes H, an the number of observe variables D (see Appenix B for etails). A practical learning algorithm, then, must epen on fining a computationally tractable approximation to the true expectation. One approach, a form of variational metho (Joran et al., 1999), woul be to optimize the q n within a constraine class of istributions; for example, istributions that factor over the sources s h. Unfortunately, this conventional factoring approach provies limite benefit here, as the form of A i ( s,w) resists straightforwar evaluation of the expecte value with respect to the iniviual sources. Instea, we base our approximations on an assumption of sparsity that only a small number of active hien sources is neee to explain any one observe s 1232

7 MAXIMAL CAUSES ata vector (note that sparsity here refers to the number of active hien sources, rather than to their proportion). The resulting expressions relate to those that woul be foun by a variational optimization constraine to istributions that are sparse in the sense above, but are not ientical. The relationship will be explore further in the Discussion. To evelop the sparse approximations, consier grouping the terms in the expecte value of Equation (10) accoring to the number of active sources in the vector s: A i ( s,w) qn = p( s y (n),θ )A i ( s,w) (13) s = a p( s a y (n),θ )A i ( s a,w)+ a,b a < b where s a := (0,...,0,1,0,...,0) with only s a = 1 p( s ab y (n),θ )A i ( s ab,w)+..., a,b,c a < b < c s ab := (0,...,0,1,0,...,0,1,0,...,0) with only s a = 1, s b = 1, a b, an s abc etc. are efine analogously. Note that A i ( 0,W) = 0 because of (7) an (8). Now, each of the conitional probabilities p( s y (n),θ ) implicitly contains a similar sum over s for normalization: p( s y (n),θ ) = 1 Z p( s, y(n) Θ ), Z := p( s, y (n) Θ ), (14) s an the terms of this sum may be groupe in the same way Z := p( 0, y (n) Θ )+ p( s a, y (n) Θ )+ a a,b a < b Combining (13) an (14) yiels: p( s ab, y (n) Θ )+ p( s abc, y (n) Θ )+.... a,b,c a < b < c A i ( s,w) qn = a p( s a, y (n) Θ )A i ( s a,w)+ a,b p( s ab, y (n) Θ )A i ( s ab,w)+... a<b p( 0, y (n) Θ )+ a p( s a, y (n) Θ )+ a,b p( s ab, y (n) Θ )+... a<b. (15) A similar grouping of terms is possible for the expectation s h qn. If we now assume that the significant posterior probability mass will concentrate on vectors s with only a limite number of non-zero entries, the expane sums in both numerator an enominator of (15) may be truncate without significant loss. The accuracy of the approximation epens both on the sparsity of the true generative process, an on the istance of the current moel parameters (in the current EM iteration) from the true ones. In general, provie that the true process is inee sparse, a truncate approximation will become more accurate as the estimate parameters approach their maximum likelihoo values. The convergence properties an accuracy of algorithms base on this form of approximation will be teste numerically in Section 6. Different choices of the truncation yiel approximate algorithms with ifferent properties. Two of these will be consiere here. 1233

8 LÜCKE AND SAHANI 4.1 MCA 3 In the first approximation, we truncate all but one of the sums that appear in the expansions of A i ( s,w) qn an s i qn after the terms that inclue three active sources, while truncating the numerator of A i ( s,w) qn after the two-source terms (see Appenix C for etails): A i ( s,w) qn an s i qn 1+ h π i exp(i (n) i ) + π i π c exp(i (n) ic )H (W i W c ) c(c i) π h exp(i (n) h ) a,b a b π a π b exp(i (n) ab ) π a π b π c exp(i (n) a,b,c a b c π i exp(i (n) i ) + π i π c exp(i (n) ic ) + α 2 π i π b π c exp(i (n) ibc ) c(c i) b,c(b c i) 1+ h π h exp(i (n) h ) a,b a b π a π b exp(i (n) ab ) π a π b π c exp(i (n) a,b,c a b c abc ) (16) abc ), (17) where π i = π i, I (n) i 1 π i W ab = max(w a,w b ), I (n) W abc = max(w a,w b,w c ), I (n) ( = ( log(w i )y (n) W i ), ) ab = log( W ab )y(n) W ab, ( ) abc = log( W abc )y (n) W abc, (18) 1 an where H (x) = 1 for x > 0; 2 for x = 0; 0 for x < 0 is the Heavisie function. The above equations have been simplifie by iviing both numerator an enominator by terms that o not epen on s, for example, by H i=1 (1 π i) (see Appenix C). Approximations (16) an (17) are use in the fixe-point upates of Equations (9) an (11), where the parameters that appear on the righthan-sie are hel at their current values. Thus all parameters that appear on the right-han-sie of the approximations take values in Θ = ( π,w ). The early truncation of the numerator in (16) improves performance in experiments, partly by increasing competition between causes further, an partly by reucing the contribution of more complex ata patterns that are better fit, given the current parameter settings, by three active sources than by two. By contrast, the three-source terms are kept in the numerator of (17). In this case, neglecting complex input patterns as in (16) woul lea to greater errors in the estimate source activation probabilities π i. Inee, even while keeping these terms, π i ten to be unerestimate if the input ata inclue many patterns with more than three active sources. To compensate, we introuce a factor of α > 1 multiplying the three-source term in (17) (so that α = 1 correspons to the actual truncate sum), which is upate as escribe in Appenix C. This scheme yiels goo estimates of π i, even if more than three sources are often active in the input ata. The M-step Equations (9) an (11) together with E-step approximations (16) an (17) represent a complete set of upate equations for the MCA generative moel. The computational cost of one parameter upate grows polynomially in the total number of causes, with orer H 3. The algorithm that is efine by these upates will therefore be referre to as MCA

9 MAXIMAL CAUSES 4.2 R-MCA 2 In the secon place, we consier a restriction of the generative moel in which (i) all s h are istribute accoring to the same prior istribution with fixe parameter π; (ii) the weights W i associate with each source variable i are constraine to sum to a constant C: i {1,...,H} : π i = π an W i = C ; (19) an (iii) on average, the influence of each hien source is homogeneously covere by the other sources. This thir restriction means that each non-zero generating weight W gen i associate with cause i can be covere by the same number of W gen c W gen i : W gen i > 0 H (W gen gen c Wi ) b i, (20) c i whereh is the Heavisie function an b i is the number of causes that can cover cause i. Figure 3 il- A B C Figure 3: A an B show patterns of weights that satisfy the uniformity conition (20) whereas weights in C violate it. Each hien cause is symbolize by an ellipse, with the graylevel of the ellipse representing the value W i of each weight within the ellipse. Weights outsie the ellipse for each cause are zero (black). The black squares inicate the 4-by-4 gri of observe pixels. lustrates this conition. Figure 3A,B show weight patterns associate with hien causes for which the conition is fulfille; for instance in Figure 3B b i = 0 for all causes with horizontal weight patterns, while b i = 1 for the vertically oriente cause. In Figure 3C the conition is violate. Roughly, these conitions guarantee that all hien causes have equal average effects on the generate ata vectors. They make the evelopment of a more efficient approximate learning algorithm possible but, espite their role in the erivation, the impact of these assumptions is limite in practice, in the sense that the resulting algorithm can perform well even when the input ata set violates assumptions (19) an (20). This is emonstrate in a series of numerical experiments etaile below. Upate rules for the restricte generative moel can again be erive by approximate expectationmaximization (see Appenix C). Using both the sum constraint of (19) an the assumption of homogeneous coverage of causes, we obtain the M-step upate: W i = C A i ( s,w) qn y (n) n A i ( s,w) qn y (n). (21) n 1235

10 LÜCKE AND SAHANI Empirically, we fin that the restricte parameter space of this moel means that we can approximate the sufficient statistics A i ( s,w) qn by a more severe truncation than before, now keeping twosource terms in the enominator, but only single-source terms in the numerator, of the expansion (15). This approximation, combine with the fact that any zero-value observe patterns (i.e., those with y (n) = 0) o not affect the upate rule (21) an so can be neglecte, yiels the expression (see Appenix C): A i ( s,w) qn h exp(i (n) h exp(i (n) i ) ) + π 2 a,b a b exp(i (n) ab ), π := π 1 π, (22) with abbreviations given in (18). Equations (21) an (22) are upate rules for the MCA generative moel, subject to the conitions (19) an (20). They efine an algorithm that we will refer to as R-MCA 2 with R for restricte an with 2 inicating a computational cost that grows quaratically with H. 5. Relation to Neural Networks We now relate component extraction as learne within the MCA framework to that achieve by a family of artificial neural networks. Consier the network of Figure 4 which consists of D input variables (or units) with values y 1,...,y D an H hien units with values g 1,...,g H. An observation y is represente by the values (or activities) of the input units, which act through connections parameterize by (W i ) to etermine the activities of the hien units through an activation function g i = g i ( y,w ). These parameters (W i ) are known as the network (or synaptic) weights. g i ( y,w ) W i y Figure 4: Architecture of a two layer neural network. Input is represente by values y 1 to y D of D input units (small black circles). These values combine with synaptic weights W to etermine the activities of the hien units g 1 to g H (big black circles). The otte horizontal arrows symbolize lateral information exchange that may be require to compute the functions g 1 to g H. After the g i are compute the parameters (W i ) are moifie using a -rule. Learning in such a neural network involves ajusting the weights W in response to a series of input patterns, using a rule that is heuristically esigne to extract some form of structure from these 1236

11 MAXIMAL CAUSES inputs. A stanar choice is the Hebbian -rule with ivisive normalization: W i = εg i ( y,w )y an W new i = C W i + W i (W i + W i ), (23) The normalization step is neee to prevent weights from growing without boun, an the ivisive form use here is most common. Here, the constant C efines the value at which W i is hel constant; it will be relate below to the C appearing in Equation 19. Many neural networks with the structure epicte in Figure 4, an that use a learning rule ientical or similar to (23), have been shown to converge to weight values that ientify clusters in, or extract useful components from, a set of input patterns (O Reilly, 2001; Spratling an Johnson, 2002; Yuille an Geiger, 2003; Lücke an von er Malsburg, 2004; Lücke, 2004; Lücke an Bouecke, 2005; Spratling, 2006). The upate rule (23) epens on only one input pattern, an is usually applie online, with the weights being change in response to each pattern in turn. If, instea, we consier the effect of presenting a group of patterns { y (n) }, the net change is approximately (see Appenix D): Wi new C n g i ( y (n),w )y (n) n g i ( y (n),w )y (n). (24) Now, comparing (24) to (21), we see that if the activation function of a neural network were chosen so that g i ( y (n),w ) = A i ( s,w) qn, then the network woul optimize the parameters of the restricte MCA generative moel, with W =W (we rop the istinction between W an W from now on). Unfortunately, the expectation A i ( s,w) qn epens on, an thus exact optimization in the general case woul require a moifie Hebbian rule. However, the truncate approximation of (22) is the same for all, an so the changes in each weight epen only on the activities of the corresponing pre- an post-synaptic units. Thus, the Hebbian -rule, W i = εg i y with g i = h exp(i i ) exp(i h ) + π 2 exp(i ab ) a,b a b (where I h, I ab, an π are the abbreviations introuce in Equations 18 an 22), when combine with ivisive normalization, implements an online version of the R-MCA 2 algorithm. We refer to this online weight upate rule as R-MCA NN (for Neural Network). Note that the function g i in (25) resembles the softmax function (see, e.g., Yuille an Geiger, 2003), but contains an aitional term in the enominator. This ae term reuces the change in weights when an input pattern results in more than one hien unit with significant activity. That is, the system tries to explain a given input pattern using the current state of its moel parameters W. If one hien unit explains the input better than any combination of two units, that unit is moifie. If the input is better explaine by a combination of two units, the total learning rate is reuce. Soft winner-take-all (WTA) activation functions, such as the softmax, are foun in many networks that serve to both cluster an extract components from inputs, as appropriate. For clustering, the relationship between WTA-like competition an maximum-likelihoo methos is well known (Nowlan, 1990). The connection rawn here offers a probabilistic account of the effectiveness of similar rules for component ientification. If the probability of more than one cause being active is small (i.e., π is small), our activation rule for g i (25) reuces to the stanar softmax, suggesting that neural networks with activation an learning functions that resemble Equations (25) may perform well at both component extraction an clustering. (25) 1237

12 LÜCKE AND SAHANI 6. Experiments The MCA generative moel, along with the associate learning algorithms that have been introuce here, are esigne to extract component features from non-linear mixtures. To stuy their performance, we employ numerical experiments, using artificial as well as more realistic ata. The artificial ata sets are base on a wiely-use benchmark for non-linear component extraction, while the more realistic ata are taken from acoustic recorings in one case an from natural images in the other. The goals of these experiments are (1) to establish whether the approximate algorithms o inee increase the likelihoo of the moel parameters; (2) to test convergence an asymptotic accuracy of the algorithms; (3) to compare component extraction using MCA to other componentextraction algorithms; an (4) to emonstrate the applicability of the moel an algorithms to more realistic ata where non-linear component combinations arise naturally. 6.1 The Bars Test The ata sets use in experiments on artificial ata were rawn from variants of the bars test introuce by Föliák (1990). Each ata vector represents a grayscale image, with a non-linear combination of ranomly chosen horizontal an vertical light-colore bars, each extening all the way across a black backgroun. Most commonly, the intensity of the bars is uniform an equal, an the combination rule is such that overlapping regions remain at the same intensity. This type of ata is a benchmark for the stuy of component extraction with non-linear interactions between hien causes. Many component-extraction algorithms have been applie to a version of the bars test, incluing some with probabilistic generative semantics (Saun, 1995; Dayan an Zemel, 1995; Hinton et al., 1995; Hinton an Ghahramani, 1997), as well as many with non-generative objective functions (Harpur an Prager, 1996; Hochreiter an Schmihuber, 1999; Lee an Seung, 2001; Hoyer, 2004) a substantial group of which have been neurally inspire (Föliák, 1990; Fyfe, 1997; O Reilly, 2001; Charles et al., 2002; Spratling an Johnson, 2002; Lücke an von er Malsburg, 2004; Lücke an Bouecke, 2005; Spratling, 2006; Butko an Triesch, 2007). In most of the experiments escribe here the input ata were 25-imensional vectors, representing a 5-by-5 gri of pixels; that is, D = 5 5. There were b possible single bars, some of which were superimpose to create each image. On the 5-by-5 gri there are 5 possible horizontal, an 5 vertical, bar positions, so that b = 10. Each bar appears inepenently with a probability π, with areas of overlap retaining the same value as the iniviual bars. Figure 5A shows an example set of noisy ata vectors constructe in this way. 6.2 Annealing The likelihoo surface for the MCA generative moel is potentially multimoal. Thus, hill-climbing algorithms base on EM may converge to local optima in the likelihoo, which may well be consierably poorer than the global optimum. This tenency to fin sub-optimal fixe points can be reuce by incorporating a eterministic annealing, or relaxation, proceure (Uea an Nakano, 1998; Sahani, 1999), whereby the entropy of the posterior istribution in the free energy (4) is artificially inflate in early iterations, with this inflation progressively reuce in later iterations, uner the control of a temperature parameter. All of the experiments iscusse here incorporate eterministic annealing, the etails of which are given in Appenix E. 1238

13 MAXIMAL CAUSES A Input patterns B W 1 W 2 W 3 W 4 W 5 W 6 W 7 W 8 W 9 W 10 Iterations Figure 5: Bars test ata with b = 10 bars on D = 5 5 pixels an a bar appearance probability of π = A 24 patterns from the set of N = 500 input patterns that were generate accoring to the generative moel with Poisson noise. B Change of the parameters W if MCA 3 is use for parameter upate. Learning stoppe automatically after 108 iterations in this trial (see Appenix E). 6.3 Convergence From a theoretical stanpoint, none of the four algorithms MCA ex, MCA 3, R-MCA 2, or R-MCA NN, can be guarantee to maximize the likelihoo of the MCA generative moel. All of them upate the parameters in the M-step using a fixe-point iteration, rather than either maximization or a graient step. All but MCA ex also approximate the posterior sufficient statistics (10). Thus, our first numerical experiments are esigne to verify that the algorithms o, in fact, increase parameter likelihoo in practice, an that they o converge. For this purpose, it is appropriate to use a version of the bars test in which observations are generate by the MCA moel. Thus, we selecte MCA parameters that generate noisy bar-like images. There were 10 hien sources in the generating moel, one corresponing to each bar. The associate matrix of generating weights, W gen, was 10 25, with each row representing a horizontal or vertical bar in a 5-by-5 pixel gri. The weights W gen i that correspon to the pixels of the bar were set to 10, the others to 0, so that W gen i = 50. Each source was active with probability π gen i, leaing to an average of two 1239 = 2 10

14 LÜCKE AND SAHANI bars appearing in each image. We generate N = 500 input patterns (each with 25 elements) using Equations (1) to (3); a subset of the resulting patterns is isplaye in Figure 5A. L(Θ) 10 3 W gen, π gen MCA ex MCA 3 R-MCA L(Θ) R-MCA NN ae noise no noise patterns 10 3 iterations Figure 6: Change of the MCA parameter likelihoo uner ifferent MCA learning algorithms. Data were generate as in Figure 5. To allow for comparison, the same set of N = 500 input patterns was use for all experiments shown. The likelihoo of the generating parameters (W gen, π gen ) is shown by the otte horizontal line. The main axes show likelihoo values of the batch-moe algorithms MCA ex, MCA 3, an R-MCA 2 as a function of EM iteration. The inset axes shows likelihoo values of the online algorithm R-MCA NN as a function of number of input pattern presentations. Patterns were ranomly selecte from the set of N = 500 inputs, an the parameters were upate for each pattern. Figure 6 shows the evolution of parameter likelihoos, as a function of iteration, for each of the MCA algorithms, with 5 ifferent choices of initial parameters for each. With the exception of the first few iterations of R-MCA 2, the likelihoo of the parameters uner the batch moe algorithms increase at almost every iteration. The online R-MCA NN showe greater fluctuations as upates base on iniviual ata vectors inevitably perturbe the parameter estimates. As might be expecte, given the observation of increasing likelihoos an the fact that the likelihoos are boune, each algorithm eventually converge from each initial value use in Figure 6. Furthermore, in each case, the likelihoo of the solution foun was close to the likelihoo of the actual weights use in generation (the ashe horizontal lines). The final likelihoo values for MCA ex were slightly higher than the likelihoos of (W gen, π gen ), as is expecte for an exact maximumlikelihoo algorithm in noisy ata; whereas the values achieve by the approximations MCA 3 an R-MCA 2 were slightly lower. In fact, in 100 further experiments, the annealing an parameter initialization schemes escribe in Appenix E, brought the likelihoo close to that of the generating weights in 98 of 100 runs using R-MCA 2 an in 90 of 100 runs using MCA 3. We i not run 1240

15 MAXIMAL CAUSES these more extensive tests for MCA ex ue to its long running time (it is also omitte from similar quantitative analyses below). The two basic observations, that likelihoos generally increase at each iteration an that the batch-moe algorithms all reliably converge, hel true for all of the experiments escribe here an below, even where ata were not generate from a version of the MCA moel. Thus, we conclue that these algorithms are generally robust in practice, espite the absence of any theoretical guarantees. 6.4 Parameter Recovery Figure 5B shows the evolution of parameters W, uner the approximate MCA 3 algorithm, showing that the estimate W i inee converge to values close to the generating parameters W gen, as was suggeste by the convergence of the likelihoo to values close to that of the generative parameters. While not shown, the convergence of W uner MCA ex, R-MCA 2 or R-MCA NN was qualitatively similar to this sequence. Clearly, if MCA ex fins the global optimum, we woul expect the parameters foun to be close to those use for generation. The same is not necessarily true of the approximate algorithms. However, both MCA 3 an R-MCA 2 i in fact fin weights W that were very close to the generating values whenever an obviously poor local optimum was avoie. In MCA 3 the average pixel intensity of a bar was estimate to be 10.0±0.5 (stanar eviation), taken across all bar pixels in 90 trials where the likelihoo increase to a high value. Using R-MCA 2 this value was estimate to be 10.0 ± 0.8 (across all bar pixels on 98 high-likelihoo trials). Note that the Poisson istribution (2) results in a consierable variance of bar pixel intensities aroun the mean of 10.0 (compare Figure 5A) which explains the high stanar eviation aroun the relatively precise mean value. The backgroun pixels (original value zero) are estimate to have an intensity of 0.05 ± 0.02 in MCA 3 an are all virtually zero (all are smaller than ) in R-MCA 2. MCA 3 also estimates the parameters π. Because of the finite number of patterns (N = 500) we compare the estimates with the actual frequency of occurrence of each bar i: π i = (numb of bars i in input)/n. The mean absolute ifference between the estimate π i an the actual probability π i was (across the 90 trials with high likelihoo), which emonstrates the relative accuracy of the solutions, espite the approximation mae in Equation (17). For the neural network algorithm R-MCA NN given by (25) we observe virtually the same behavior as for R-MCA 2 when using a small learning rate (e.g., ε = 0.1) an the same cooling scheule in both cases (see Lücke an Sahani, 2007). The aitional noise introuce by the online upates of R-MCA NN ha only a negligible effect. For larger learning rates the situation was ifferent, however. For later comparison to noisy neural network algorithms, we use a version of R-MCA NN with a relatively high learning rate of ε = 1.0. Furthermore, instea of a cooling scheule, we use a fixe temperature T = 16 an ae Gaussian noise (σ = 0.02) at each parameter upate: W i = εg i y + ση. With these learning parameters, R-MCA NN learne very rapily, requiring fewer than 1000 pattern presentations in the majority of trials. Ten plots of likelihoos against number of presente patterns are shown for R-MCA NN in Figure 6 (inset figure, black lines) for the same N = 500 patterns as use for the batch-moe algorithms. Because of the aitional noise in W, the final likelihoo values were somewhat lower than those of the generating weights. Using R-MCA NN with the same parameters but without ae noise (σ = 0), final likelihoo values were often higher (inset axes, gray lines) but the algorithm also converge to local optima more often. In 1241

16 LÜCKE AND SAHANI Reliability Moel noisy no noise MCA 3 90% 81% R-MCA 2 98% 96% R-MCA NN >99% >99% Reliability Moel no noise reference noisy-or 27% Saun, 1995 competitive 69% Dayan/Zemel, 1995 LOCOCODE 96% Hochreiter/Schmihuber, 1999 Table 1: Comparison of MCA algorithms with other systems in the stanar bars test with b = 10 bars (D = 5 5, π = 2 10, N = 500). For the MCA algorithms reliability values are compute on the basis of 100 trials. Values for these algorithms are also given for the same bars test with Poisson noise. Reliability values for the other systems are taken from the literature. For instance, the moel of Hochreiter an Schmihuber (1999) is reporte to fail to extract all bars in one of 25 trials. Two systems, back-propagation (BP) an GeneRec, that are escribe by O Reilly (2001) have also been applie to this bars test. In their stanar versions, BP an GeneRec achieve 10% an 60% reliability, respectively. Hochreiter an Schmihuber (1999) report that ICA an PCA extract only subsets of all bars. Traine without bar overlap. contrast, R-MCA NN with noise avoie local optima in all 100 trials. In the following, R-MCA NN will therefore refer to the noisy version with σ = 0.02 unless otherwise state. 6.5 Comparison to Other Algorithms Noiseless Bars To compare the component extraction results of MCA to that of other algorithms reporte in the literature, we use a stanar version of the bars benchmark test, in which the bars appear with no noise. The competing algorithms o not necessarily employ probabilistic semantics, an may not be explicitly generative; thus, we cannot compare performance in terms of likelihoos, nor in terms of the accuracy with which generative parameters are recovere. Instea, we aopt a commonly use measure, which asks how reliably all the ifferent bars are ientifie (see, e.g., Hochreiter an Schmihuber, 1999; O Reilly, 2001; Spratling an Johnson, 2002; Lücke an von er Malsburg, 2004; Spratling, 2006). For each moel, an internal variable (say the activities of the hien units, or the posterior probabilities of each source being active) is ientifie as the response to an image. The responses evoke in the learne moel by each of the possible single-bar images are then consiere, an the most active unit or most probable source corresponing to each bar is ientifie. If the mapping from single-bar images to the most active internal variable is injective that is, for each single bar a ifferent hien unit or source is the most active then this instance of the moel is sai to have represente all of the bars. The reliability is the frequency with which each moel represents all possible bars, when starte from ranom initial conitions, an given a ranom set of images generate with the same parameter settings. For the MCA algorithms, the responses are efine to be the approximate posterior values for each possible source vector with only one active source, evaluate at the final parameter values after learning: q( s h ;Θ) p( s h y bar,w). The reliabilities of MCA 3, R-MCA 2, an R-MCA NN as well as some other publishe componentextraction algorithms are shown in Table 1. These experiments use a configuration of the bars test much as above (D = 5 5, b = 10, an π gen = 2 10 ) which is perhaps the most commonly use 1242

17 MAXIMAL CAUSES in the literature, (e.g., Saun, 1995; Dayan an Zemel, 1995; Hochreiter an Schmihuber, 1999; O Reilly, 2001). The bars have a fixe an equal gray-value. We generate N = 500 patterns accoring to these settings an normalize the input patterns y (n) to lie in the interval [0,10] (i.e., bar pixels have a value of 10 an the backgroun is 0). We consiere both the case with Poisson noise (which has been iscusse above) an the stanar noiseless case. Experiments were run starting from 100 ifferent ranomly initialize parameters W. The same algorithms an the same cooling scheule were use (the same fixe T in the case of R-MCA NN ) to fit patterns with an without noise. Without noise, MCA 3 with H = 10 hien variables foun all 10 bars in 81 of 100 experiments. R-MCA 2 with H = 10 foun all bars in 96 of 100 experiments. Using the criterion of reliability, R-MCA NN performe best an foun all bars in all 100 of 100 experiments. This seems likely to result from the fact that the ae Gaussian noise, as well as noise introuce by the online upates, combine to rive the system out of shallow optima. Furthermore, R-MCA NN was, on average, faster than MCA 3 an R-MCA 2 in terms of require pattern presentations. It took fewer than 1000 pattern presentations to fin all bars in the majority of 100 experiments, 1 although in a few trials learning i take much longer. On the other han, MCA 3 an R-MCA 2 achieve better likelihoos an recovere generative parameters closer to the true values. These algorithms also have the avantage of a well efine stopping criterion. MCA 3 learns the parameters of the prior istribution whereas R-MCA 2 uses a fixe value. R-MCA 2 oes, however, remain highly reliable, even when the fixe parameter π iffers significantly from the true value π gen. Figure 7: A common local optimum foun by MCA 3 in the stanar bars test. Two weight patterns reflect the same hien cause, while another represents the superposition of two causes. As was the case for the noisy bars, the R-MCA algorithms avoie local optima more often. This may well be a result of the smaller parameter space associate with the restricte moel. A common local optimum for MCA 3 is isplaye in Figure 7, where the weights associate with two sources generate the same horizontal bar, while a thir source generates a weaker combination of two bars. This local solution is suboptimal, but the fact that MCA 3 has parameters to represent varying probabilities for each cause being present, means that it can ajust the corresponing rates to match the ata. The fixe setting of π for R-MCA woul introuce a further likelihoo penalty for this solution. Many component-extraction algorithms particularly those base on artificial neural networks use moels with more hien elements than there are istinct causes in the input ata (e.g., Charles et al., 2002; Lücke an von er Malsburg, 2004; Spratling, 2006). If we use H = 12 hien variables, then all the MCA-algorithms (MCA 3, R-MCA 2, an R-MCA NN ) foun all of the bars in all of 100 trials. 1. Note that, accoring to the efinition above, all bars are often alreay represente at intermeiate likelihoo values. 1243

18 LÜCKE AND SAHANI A Input patterns with 3 bars on average W after learning (MCA 3 ) B Input patterns with ifferent bar sizes W after learning (MCA 3 ) C Input patterns with overlapping parallel bars W after learning (MCA 3 ) D W after learning (R-MCA NN) Figure 8: Experiments with increase bar overlap. In A bar overlap is increase by increasing the bar appearance probability to π gen = 3 10 (an average of three bars per pattern). In B bar overlap is varie using ifferent bar withs (two one-pixel-wie bars an one three-pixelwie bar for each orientation). In the bars test in C there are 8 (two-pixel-wie) horizontal bars an 8 (two-pixel-wie) vertical bars on a D = 9 9 pixel gri. Each bar appears with probability π gen = 2 16 (two bars per input pattern on average). Each experimental ata set is illustrate by 14 typical input patterns. For A to C the parameters W of a typical trial are shown if MCA 3 is use for learning. The vectors W i = (W i1,...,w id ) appear in orer of ecreasing learne appearance probability π i. In D the parameters W for a typical trial using R-MCA NN are shown. 1244

19 MAXIMAL CAUSES 6.6 Comparison to Other Algorithms Bar Overlap For most component-extraction algorithms that have been teste against the bars benchmark, it is ifficult to know how specialize they are to the form of this test. The algorithms might, for example, epen on the fact that all bars appear with the same probability, or that they have the same with. Different versions of the bars test have therefore been introuce to probe how generally the ifferent algorithms might succee. In particular, there has been consierable recent interest in stuying robustness to varying egrees of overlap between bars (see, e.g., Lücke an von er Malsburg, 2004; Lücke, 2004; Spratling, 2006). This is because it is the non-linear combination within the regions of overlap that most istinguishes the bars test images from linear superpositions of sources. In three ifferent experiments we varie the egree of overlap in three ifferent ways. Following Spratling (2006), in all experiments the MCA moel ha twice as many possible sources as there were bars in the generative input. In all experiments we use the same algorithms, initial conitions, an cooling scheules as escribe above an in Appenix E. Again, each trial use a newly generate set of training patterns an a ifferent ranomly generate matrix W. In the following, reliability values are compute on the basis of 25 trials each. The most straightforwar way to increase the egree of bar overlap is to use the stanar bars test with an average of three instea of two bars per image, that is, take π = 3 10 for an otherwise unchange bars test with b = 10 bars on D = 5 5 pixels (see Figure 8A for some examples). When using H = 20 hien variables, MCA 3 extracte all bars in 92% of 25 experiments. Thus the algorithm works well even for relatively high egrees of superposition. The values of W foun in a typical trial are shown in Figure 8A. The parameters W i = (W i1,...,w id ) that are associate with a hien variable or unit are sorte accoring to the learne appearance probabilities π i. Like MCA 3, both R-MCA 2 an R-MCA NN were run without changing any parameters. In the restricte case, this meant that the assume value for the source probability (π = 2 10 ) was ifferent from the generating value (π gen = 3 10 ). Nevertheless, the performance of both algorithms remaine better than that of MCA 3, with R-MCA 2 an R-MCA NN fining all 10 bars in 96% an 100% of 25 trials, respectively. We can also choose unequal bar appearance probabilities (cf., Lücke an von er Malsburg, 2004). For example, half the bars appeare with probability π gen h = (1+γ) 2 10 an the other half2 appeare with probability π gen h = (1 γ) 2 10, MCA 3 extracte all bars in all of 25 experiments for γ = 0.5. For γ = 0.6 (when half the bars appeare 4 times more often than the other half) all bars were extracte in 88% of 25 experiments. For γ = 0.6 R-MCA 2 an R-MCA NN foun all bars in 96% an 100% of 25 experiments respectively. Reliability values for R-MCA NN starte to ecrease for γ = 0.7 (92% reliability). As suggeste by Lücke an von er Malsburg (2004), we also varie the bar overlap in a secon experiment by choosing bars of ifferent withs. For each orientation we use two one-pixel wie bars an one three-pixel-wie bar. Thus, for this ata set, b = 6 an D = 5 5. The bar appearance probability was π = 2 6, so that an input containe, as usual, two bars on average. Figure 8B shows some examples. MCA 3 extracte all bars in 84% of 25 experiments for this test. Reliability values ecrease for more extreme ifferences in the bar sizes. R-MCA 2 an R-MCA NN both foun all bars in all 25 trials each. Thus, although the unequal bar sizes violate the assumption W i = C that was mae in the erivation of R-MCA 2 an R-MCA NN, the algorithms performance in terms of reliability seeme unaffecte. 2. If bars are numbere h = 1 to 5 for the horizontal an h = 6 to 10 for the vertical, we chose the ones with even numbers to appear with the higher probability. 1245

20 LÜCKE AND SAHANI bars NMFmse NMFiv S-NMF 4 0 L-NMF SC-NMFY NN-DI DI SC-NMFAY SC-NMFA learning algorithm R-MCANN NN-SC MCA3 R-MCA2 Figure 9: Comparison of MCA 3, R-MCA 2, an R-MCA NN with other systems in the bars test with increase occlusion (compare Figure 8C an Figure 2). Bars test parameters are D = 9 9, b = 16, π = 2 16, an N = 400. Data for the non-mca algorithms are taken from Spratling (2006). The bar heights represent the average numbers of extracte bars in 25 trials. Error bars inicate the largest an the lowest number of bars foun in a trial. The algorithms NN-DI an DI are fee-forwar neural networks of the type epicte in Figure 4. All other (non-mca) algorithms are versions of NMF with ifferent objectives an constraints (see Appenix E an Spratling, 2006, for etails). In the thir experiment we change the egree of bar overlap more substantially, using a bars test that inclue overlapping parallel bars as introuce by Lücke (2004). We use eight horizontal an eight vertical bars, each two pixels wie, on a 9-by-9 gri. Thus, two parallel neighboring bars ha substantial overlap. Figure 8C shows some typical input patterns. Note that the introuctory example of Figure 2A,B is also of this type. To allow for a etaile comparison with other systems we aopte the exact settings use by Spratling (2006), that is, we consiere 25 runs of a system with H = 32 hien variables using bars test parameters D = 9 9, π gen = 2 16, an N = 400. For these ata, MCA 3 foun all 16 bars in all of 25 experiments. The same is true for R-MCA 2 whereas R-MCA NN misse one bar in one of the 25 trials. Figure 9 shows a quantitative comparison with other algorithms that have been applie to this version of the bars test. Of the ten algorithms stuie by Spratling (2006) just one, namely non-negative sparse coing (NN-SC; Hoyer, 2002, with sparseness parameter λ = 1), is as reliable as MCA 3 an R-MCA 2. The other systems, incluing forms of NMF both with an without a sparseness constraint, fail partly or entirely in extracting the actual hien causes. For a typical trial using MCA 3 the final parameters W are isplaye in Figure 8C. Again the W i s associate with the ifferent hien variables are sorte accoring to their learne parameters π i. A qualitatively ifferent set of W i s was obtaine when R-MCA NN was use for learning. Figure 8D shows a typical outcome ( W i s are not sorte). In this case, only the actual causes are clearly represente whereas the W i s of the supernumerary units remain unspecialize. The same feature is reporte by Spratling (2006) for the algorithms NN-DI an DI use in this same test. Convergence to a representation that contains just the true hien causes an leaves super- 1246

21 MAXIMAL CAUSES numerary units unspecialize can improve the interpretability of the result. When using a higher fixe temperature for R-MCA NN all the hien units represente bars, with some bars represente by more than one unit. However, hien units that represente more composite inputs, as seen for MCA 3, were rarely observe. On the other han, the parameters foun by MCA 3 provie an inication of significance of each weight pattern in the appearance probabilities π i. Thus, in Figure 8C the appearance probabilities for the first 16 sources are much higher than for the others. The later sources may be interprete as capturing some of the higher-orer structure that results from a finite set of input patterns. In contrast to R-MCA, such higher-orer representations nee not aversely affect the ata likelihoo because the corresponing appearance probabilities can be relatively small. A Generating causes B Input patterns C W after learning (MCA 3 ) Figure 10: Experiments with more causes an hien variables than observe variables. A The 12 patterns use to generate the ata. Each is a 1-by-2 pixel bar on a 3-by-3 gri (D = 9). B Ten examples of the 500 input patterns generate using the causes shown in A. C Parameters W foun in a typical run of MCA 3 with H = 24. The vectors W i = (W i1,...,w id ) appear in orer of ecreasing learne appearance probability π i. 6.7 More Causes than Observe Variables In the experiments escribe above, the number of hien causes was always smaller than the number of observe variables. We next briefly stuie the over-complete case where ata were generate, an moels were fit, using more hien causes than observe variables. We generate N = 500 patterns on a 3-by-3 gri (D = 9), using sparse combinations of 12 hien causes corresponing to 6 horizontal an 6 vertical bars, each 1-by-2 pixels in size an thus extening across only a portion of the image (Figure 10A). As in the bars tests above, black was assigne to a value of 0 an white to 10. Patterns were generate without noise, with an average of two bars appearing in each (π = 2 12 ). Ten such patterns are shown in Figure 10B. Figure 10C shows the weights learne uring a typical run using MCA 3 with the same parameter settings as above an twice as many hien variables than observe ones (H = 24). Weights are sorte in orer of ecreasing inferre appearance probabilities π i. All 12 causes were ientifie, with many represente more than once. A few hien variables, with lower inferre probabilities of 1247

22 LÜCKE AND SAHANI appearance, were associate with more composite patterns. MCA 3 extracte all causes in all of 25 trials. R-MCA 2 also extracte all causes in all of 25 trials, an never represente composite patterns. R-MCA NN only extracte all causes when run at fixe temperatures that were lower than those use for the bars tests above (e.g., T = 3), in which case it i so in all of 25 trials. This requirement for a lower temperature was consistent with the observation that a lower ata imension D leas to a ecrease in the critical temperatures associate with the algorithms (see Appenix E). For larger values of T (e.g., T = 16) R-MCA NN i not extract single causes. 6.8 Violations of Moel Assumptions To optimize the likelihoo of the ata uner the MCA generative moel, each of the approximate learning algorithms relies on the fact that, uner the Bernoulli prior (1), some number of the observe ata vectors will be generate by only a small number of active sources. To highlight this point we explicitly remove such sparse ata vectors from a stanar bars test, thereby violating the Bernoulli prior assumption of the generative moel. We use bars tests as escribe above, with b = 10 or b = 16 bars an π = 2 b, generating N = 500 (or more) patterns, in each case by first rawing causes from the Bernoulli istribution (1) an then rejecting patterns in which fewer than m causes were active. As might be expecte, when m was 3 or greater the approximate algorithms all faile to learn the weights associate with single causes. However, when only patterns with fewer than 2 bars ha been remove, MCA 3 was still able to ientify all the bars in many of the runs. More precisely, using ata generate as above with b = 10, m = 2 an N = 500, MCA 3 with H = 10 hien variables foun all causes in 69 of 100 trials with noisy observations an in 37 of 100 trials without noise (the parameters for MCA 3 an the associate annealing scheule were unchange). Note that in these experiments the average number of active causes per input vector is increase by the removal of sparse ata vectors. An increase in reliability in the noisy case is consistent with our other experiments. The relatively low reliability seen for noiseless bars in this experiment may be ue to the combine violation of both the assume prior an noise istributions. As long as the ata set i contain some vectors generate by few sources, the learning algorithms coul all relatively robustly ientify the causes given sufficient ata, even when the average observation containe many active sources. For instance, in a stanar noiseless bars test with b = 16 bars on an 8 8 gri, an N = 1000 patterns with an average of four active causes in each (π = 4 16 ), all three algorithms still achieve high reliability values, using twice as many hien variables as actual bars (H = 32), an using the same parameters as for the stanar bars test above. MCA 3 foun all causes in 20 of 25 trials in these ata (80% reliability). Reliabilities of R-MCA 2 an R-MCA NN (25 trials each) were 76% an 100%, respectively. The reliabilities of all algorithms fell when the ata set containe fewer patterns, or when the average number of bars per pattern was larger. 6.9 Applications to More Realistic Data We stuy two examples of component extraction in more realistic settings, applying the MCA algorithms to both acoustic an image ata. Acoustic ata. Soun waveforms from multiple ifferent sources combine linearly, an so are conventionally unmixe using algorithms such as ICA applie to simultaneous recorings from multiple microphones. The situation is ifferent, however, for spectrogram representations of natural 1248

23 MAXIMAL CAUSES A Generating causes (phoneme waveforms) A t/ms [a ] [i:] [ I] [k] [t] [p] c B Log-spectrograms of generating phonemes f /khz [a ] [i:] [ I] [k] [t] [p] c C t Linear mixture [a ] + [k] D Input ata points E W after learning (MCA 3 ) Figure 11: Application to acoustic ata. A Pressure waveforms of six phonemes spoken by a male voice. Axes here, an for the waveform in C, are as shown for [a ] (A is a normalize amplitue). B The log-spectrograms of the phonemes in A. We use 50 frequency channels an nine time winows ( t = 1,...,9). Axes of all log-spectrograms in the figure are as shown for [a ]. C Waveform of the linear mixture of phonemes [a ] an [k], an the log-spectrogram of this linear mixture. D Six examples of the N = 500 ata vectors that were use for the experiments. Each ata vector is the log-spectrogram of a linear mixture of the phoneme waveforms in A. The ata sets for the experiments use an average of two waveforms per ata vector. E Parameters W foun by MCA 3 with H = 12, using 500 mixture log-spectrograms. The parameter vectors W i = (W i1,...,w id ) appear in orer of ecreasing learne appearance probability π i an are linearly scale to fill the gray scale. 1249

24 LÜCKE AND SAHANI soun. The power of natural souns in iniviual time-frequency bins varies over many orers of magnitue, an so is typically measure logarithmically an expresse in units of ecibels, giving a representation that is closely aligne with the response of the cochlea to the corresponing soun. In this representation, the combination of log-spectrograms of the ifferent sources may be well approximate by the max rule (R. K. Moore, 1983, quote by Roweis, 2003). In particular, the logarithmic power istribution, as well as the sub-linear power summation ue to phase misalignment, both lea to the total power in a time-frequency bin being ominate by the single largest contribution to that bin (see Discussion). To stuy the extraction of components from mixtures of soun by MCA, we base the following experiment on six recorings of phonemes spoken by a male voice (see Figure 11A). The phoneme waveforms were mixe linearly to generate N = 500 superpositions, with each phoneme appearing in each mixture with probability π = 2 6. Thus each mixture comprise two phonemes on average, with a combination rule that resemble the MCA max-rule in the approximate sense escribe above. We applie the MCA algorithms to the log-spectrograms of these mixtures. Figure 11B shows the log-spectrograms of the iniviual phonemes an Figure 11C shows the log-spectrogram of an example phoneme mixture. We use 50 frequency channels an 9 time bins to construct the logspectrograms. The resulting values were threshole an then rescale linearly so that power-levels across all phonemes fille the interval [0, 10], as in the stanar bars test. For more etails see Appenix E. The MCA algorithms were use with the same parameter settings as in the bars tests above, except that annealing began at a lower initial temperature (see Appenix E). As in the bars tests with increase overlap, we use twice as many hien variables (H = 12) as there were causes in the input. Figure 11E shows the parameters W learne in one run using MCA 3. The parameter vectors W i = (W i1,...,w id ) are isplaye in ecreasing orer of the corresponing learne value of π i. As can be seen, the first six such vectors converge to spectrogram representations similar to those of the six original phonemes. The six hien variables associate with lower values of π i, converge to weight vectors that represente more composite spectrograms. This result is representative of those foun with MCA 3. R-MCA 2 also converge to single spectrogram representations, but tene to represent those single spectrograms multiple times rather than representing more composite patterns with the aitional components. Results for R-MCA NN were very similar to those for R-MCA 2 when we use a high fixe temperature (see Appenix E for etails). For intermeiate fixe temperatures, results for R-MCA NN were similar to those of the bars test in Figure 8D in that each cause was represente just once, with aitional hien units isplaying little structure in their weights. For lower fixe temperatures (starting from T 40) R-MCA NN faile to represent all causes. In general, the reliability values of all three algorithms were high. These were measure as escribe for the bars tests above, by checking whether, after learning, inference base on each iniviual phoneme log-spectrogram le to a ifferent hien cause being most probable. MCA 3 foun all causes in 21 of 25 trials (84% reliability), R-MCA 2 foun all causes in all of 25 trials; as i R-MCA NN (with fixe T = 70). Reliability for MCA 3 improve to 96% with a slower cooling proceure (θ W = ; see Appenix E). Visual ata. Finally, we consier a ata set for which the exact hien sources an their mixing rule are unknown. The ata were taken from a single 250-by-250 pixel gray-level image of grass taken 1250

25 MAXIMAL CAUSES A Original image B Input patches C W after learning (R-MCA 2 ) D Generate patches after learning Figure 12: Application to visual ata. A The 250-by-250 pixel image use as basis for the experiments. The image is taken from the van Hateren atabase of natural images (see Appenix E). For visualization we have brightene the image (we let values in the lower half of the light intensity range fill the range of gray values from zero to 255 an clampe values in the upper half to value 255). Without brightening, the image woul appear unnaturally ark on a finite gray scale because of a small number of pixels with very high values. B 35 examples taken from the by-10 pixel patches that were use for numerical experiments. The patches represent light intensities linearly. For visualization, each patch has been scale to fill the range of gray values. C Parameters W resulting from a typical run of R-MCA 2 with H = 50 hien variables an N = 5000 image patches. For visualization, each parameter vector W i = (W i1,...,w id ) has been linearly scale to fill the range of gray values. D Patches generate using the restricte generative moel an weights as in C (patches have been scale as in B an C). 1251

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Dynamics of Cortical Columns Self-Organization of Receptive Fields

Dynamics of Cortical Columns Self-Organization of Receptive Fields Preprint, LNCS 3696, 31 37, ICANN 2005 Dynamics of Cortical Columns Self-Organization of Receptive Fiels Jörg Lücke 1,2 an Jan D. Bouecke 1 1 Institut für Neuroinformatik, Ruhr-Universität, 44780 Bochum,

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

Cascaded redundancy reduction

Cascaded redundancy reduction Network: Comput. Neural Syst. 9 (1998) 73 84. Printe in the UK PII: S0954-898X(98)88342-5 Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto,

More information

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Lecture 2 Lagrangian formulation of classical mechanics Mechanics Lecture Lagrangian formulation of classical mechanics 70.00 Mechanics Principle of stationary action MATH-GA To specify a motion uniquely in classical mechanics, it suffices to give, at some time t 0,

More information

The total derivative. Chapter Lagrangian and Eulerian approaches

The total derivative. Chapter Lagrangian and Eulerian approaches Chapter 5 The total erivative 51 Lagrangian an Eulerian approaches The representation of a flui through scalar or vector fiels means that each physical quantity uner consieration is escribe as a function

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control

19 Eigenvalues, Eigenvectors, Ordinary Differential Equations, and Control 19 Eigenvalues, Eigenvectors, Orinary Differential Equations, an Control This section introuces eigenvalues an eigenvectors of a matrix, an iscusses the role of the eigenvalues in etermining the behavior

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz

A note on asymptotic formulae for one-dimensional network flow problems Carlos F. Daganzo and Karen R. Smilowitz A note on asymptotic formulae for one-imensional network flow problems Carlos F. Daganzo an Karen R. Smilowitz (to appear in Annals of Operations Research) Abstract This note evelops asymptotic formulae

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling Balancing Expecte an Worst-Case Utility in Contracting Moels with Asymmetric Information an Pooling R.B.O. erkkamp & W. van en Heuvel & A.P.M. Wagelmans Econometric Institute Report EI2018-01 9th January

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy,

NOTES ON EULER-BOOLE SUMMATION (1) f (l 1) (n) f (l 1) (m) + ( 1)k 1 k! B k (y) f (k) (y) dy, NOTES ON EULER-BOOLE SUMMATION JONATHAN M BORWEIN, NEIL J CALKIN, AND DANTE MANNA Abstract We stuy a connection between Euler-MacLaurin Summation an Boole Summation suggeste in an AMM note from 196, which

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1

d dx But have you ever seen a derivation of these results? We ll prove the first result below. cos h 1 Lecture 5 Some ifferentiation rules Trigonometric functions (Relevant section from Stewart, Seventh Eition: Section 3.3) You all know that sin = cos cos = sin. () But have you ever seen a erivation of

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y

ensembles When working with density operators, we can use this connection to define a generalized Bloch vector: v x Tr x, v y Tr y Ph195a lecture notes, 1/3/01 Density operators for spin- 1 ensembles So far in our iscussion of spin- 1 systems, we have restricte our attention to the case of pure states an Hamiltonian evolution. Toay

More information

Proof of SPNs as Mixture of Trees

Proof of SPNs as Mixture of Trees A Proof of SPNs as Mixture of Trees Theorem 1. If T is an inuce SPN from a complete an ecomposable SPN S, then T is a tree that is complete an ecomposable. Proof. Argue by contraiction that T is not a

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Sparse Reconstruction of Systems of Ordinary Differential Equations

Sparse Reconstruction of Systems of Ordinary Differential Equations Sparse Reconstruction of Systems of Orinary Differential Equations Manuel Mai a, Mark D. Shattuck b,c, Corey S. O Hern c,a,,e, a Department of Physics, Yale University, New Haven, Connecticut 06520, USA

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7. Lectures Nine an Ten The WKB Approximation The WKB metho is a powerful tool to obtain solutions for many physical problems It is generally applicable to problems of wave propagation in which the frequency

More information

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture

More information

inflow outflow Part I. Regular tasks for MAE598/494 Task 1

inflow outflow Part I. Regular tasks for MAE598/494 Task 1 MAE 494/598, Fall 2016 Project #1 (Regular tasks = 20 points) Har copy of report is ue at the start of class on the ue ate. The rules on collaboration will be release separately. Please always follow the

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21

'HVLJQ &RQVLGHUDWLRQ LQ 0DWHULDO 6HOHFWLRQ 'HVLJQ 6HQVLWLYLW\,1752'8&7,21 Large amping in a structural material may be either esirable or unesirable, epening on the engineering application at han. For example, amping is a esirable property to the esigner concerne with limiting

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

3.2 Shot peening - modeling 3 PROCEEDINGS

3.2 Shot peening - modeling 3 PROCEEDINGS 3.2 Shot peening - moeling 3 PROCEEDINGS Computer assiste coverage simulation François-Xavier Abaie a, b a FROHN, Germany, fx.abaie@frohn.com. b PEENING ACCESSORIES, Switzerlan, info@peening.ch Keywors:

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE

THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE Journal of Soun an Vibration (1996) 191(3), 397 414 THE VAN KAMPEN EXPANSION FOR LINKED DUFFING LINEAR OSCILLATORS EXCITED BY COLORED NOISE E. M. WEINSTEIN Galaxy Scientific Corporation, 2500 English Creek

More information

All s Well That Ends Well: Supplementary Proofs

All s Well That Ends Well: Supplementary Proofs All s Well That Ens Well: Guarantee Resolution of Simultaneous Rigi Boy Impact 1:1 All s Well That Ens Well: Supplementary Proofs This ocument complements the paper All s Well That Ens Well: Guarantee

More information

Switching Time Optimization in Discretized Hybrid Dynamical Systems

Switching Time Optimization in Discretized Hybrid Dynamical Systems Switching Time Optimization in Discretize Hybri Dynamical Systems Kathrin Flaßkamp, To Murphey, an Sina Ober-Blöbaum Abstract Switching time optimization (STO) arises in systems that have a finite set

More information

Transmission Line Matrix (TLM) network analogues of reversible trapping processes Part B: scaling and consistency

Transmission Line Matrix (TLM) network analogues of reversible trapping processes Part B: scaling and consistency Transmission Line Matrix (TLM network analogues of reversible trapping processes Part B: scaling an consistency Donar e Cogan * ANC Eucation, 308-310.A. De Mel Mawatha, Colombo 3, Sri Lanka * onarecogan@gmail.com

More information

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations Lecture XII Abstract We introuce the Laplace equation in spherical coorinates an apply the metho of separation of variables to solve it. This will generate three linear orinary secon orer ifferential equations:

More information

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation

Relative Entropy and Score Function: New Information Estimation Relationships through Arbitrary Additive Perturbation Relative Entropy an Score Function: New Information Estimation Relationships through Arbitrary Aitive Perturbation Dongning Guo Department of Electrical Engineering & Computer Science Northwestern University

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS Yannick DEVILLE Université Paul Sabatier Laboratoire Acoustique, Métrologie, Instrumentation Bât. 3RB2, 8 Route e Narbonne,

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

arxiv: v5 [cs.lg] 28 Mar 2017

arxiv: v5 [cs.lg] 28 Mar 2017 Equilibrium Propagation: Briging the Gap Between Energy-Base Moels an Backpropagation Benjamin Scellier an Yoshua Bengio * Université e Montréal, Montreal Institute for Learning Algorithms March 3, 217

More information

A. Exclusive KL View of the MLE

A. Exclusive KL View of the MLE A. Exclusive KL View of the MLE Lets assume a change-of-variable moel p Z z on the ranom variable Z R m, such as the one use in Dinh et al. 2017: z 0 p 0 z 0 an z = ψz 0, where ψ is an invertible function

More information

CONTROL CHARTS FOR VARIABLES

CONTROL CHARTS FOR VARIABLES UNIT CONTOL CHATS FO VAIABLES Structure.1 Introuction Objectives. Control Chart Technique.3 Control Charts for Variables.4 Control Chart for Mean(-Chart).5 ange Chart (-Chart).6 Stanar Deviation Chart

More information

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number

More information

On combinatorial approaches to compressed sensing

On combinatorial approaches to compressed sensing On combinatorial approaches to compresse sensing Abolreza Abolhosseini Moghaam an Hayer Raha Department of Electrical an Computer Engineering, Michigan State University, East Lansing, MI, U.S. Emails:{abolhos,raha}@msu.eu

More information

Gaussian processes with monotonicity information

Gaussian processes with monotonicity information Gaussian processes with monotonicity information Anonymous Author Anonymous Author Unknown Institution Unknown Institution Abstract A metho for using monotonicity information in multivariate Gaussian process

More information

ELEC3114 Control Systems 1

ELEC3114 Control Systems 1 ELEC34 Control Systems Linear Systems - Moelling - Some Issues Session 2, 2007 Introuction Linear systems may be represente in a number of ifferent ways. Figure shows the relationship between various representations.

More information

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France

APPROXIMATE SOLUTION FOR TRANSIENT HEAT TRANSFER IN STATIC TURBULENT HE II. B. Baudouy. CEA/Saclay, DSM/DAPNIA/STCM Gif-sur-Yvette Cedex, France APPROXIMAE SOLUION FOR RANSIEN HEA RANSFER IN SAIC URBULEN HE II B. Bauouy CEA/Saclay, DSM/DAPNIA/SCM 91191 Gif-sur-Yvette Ceex, France ABSRAC Analytical solution in one imension of the heat iffusion equation

More information

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods

Hyperbolic Moment Equations Using Quadrature-Based Projection Methods Hyperbolic Moment Equations Using Quarature-Base Projection Methos J. Koellermeier an M. Torrilhon Department of Mathematics, RWTH Aachen University, Aachen, Germany Abstract. Kinetic equations like the

More information

Analyzing Tensor Power Method Dynamics in Overcomplete Regime

Analyzing Tensor Power Method Dynamics in Overcomplete Regime Journal of Machine Learning Research 18 (2017) 1-40 Submitte 9/15; Revise 11/16; Publishe 4/17 Analyzing Tensor Power Metho Dynamics in Overcomplete Regime Animashree Ananumar Department of Electrical

More information

3.7 Implicit Differentiation -- A Brief Introduction -- Student Notes

3.7 Implicit Differentiation -- A Brief Introduction -- Student Notes Fin these erivatives of these functions: y.7 Implicit Differentiation -- A Brief Introuction -- Stuent Notes tan y sin tan = sin y e = e = Write the inverses of these functions: y tan y sin How woul we

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Polynomial Inclusion Functions

Polynomial Inclusion Functions Polynomial Inclusion Functions E. e Weert, E. van Kampen, Q. P. Chu, an J. A. Muler Delft University of Technology, Faculty of Aerospace Engineering, Control an Simulation Division E.eWeert@TUDelft.nl

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

Semiclassical analysis of long-wavelength multiphoton processes: The Rydberg atom

Semiclassical analysis of long-wavelength multiphoton processes: The Rydberg atom PHYSICAL REVIEW A 69, 063409 (2004) Semiclassical analysis of long-wavelength multiphoton processes: The Ryberg atom Luz V. Vela-Arevalo* an Ronal F. Fox Center for Nonlinear Sciences an School of Physics,

More information

Optimized Schwarz Methods with the Yin-Yang Grid for Shallow Water Equations

Optimized Schwarz Methods with the Yin-Yang Grid for Shallow Water Equations Optimize Schwarz Methos with the Yin-Yang Gri for Shallow Water Equations Abessama Qaouri Recherche en prévision numérique, Atmospheric Science an Technology Directorate, Environment Canaa, Dorval, Québec,

More information

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification

Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection and System Identification Concentration of Measure Inequalities for Compressive Toeplitz Matrices with Applications to Detection an System Ientification Borhan M Sananaji, Tyrone L Vincent, an Michael B Wakin Abstract In this paper,

More information

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels.

Technion - Computer Science Department - M.Sc. Thesis MSC Constrained Codes for Two-Dimensional Channels. Technion - Computer Science Department - M.Sc. Thesis MSC-2006- - 2006 Constraine Coes for Two-Dimensional Channels Keren Censor Technion - Computer Science Department - M.Sc. Thesis MSC-2006- - 2006 Technion

More information

Energy behaviour of the Boris method for charged-particle dynamics

Energy behaviour of the Boris method for charged-particle dynamics Version of 25 April 218 Energy behaviour of the Boris metho for charge-particle ynamics Ernst Hairer 1, Christian Lubich 2 Abstract The Boris algorithm is a wiely use numerical integrator for the motion

More information

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS

TIME-DELAY ESTIMATION USING FARROW-BASED FRACTIONAL-DELAY FIR FILTERS: FILTER APPROXIMATION VS. ESTIMATION ERRORS TIME-DEAY ESTIMATION USING FARROW-BASED FRACTIONA-DEAY FIR FITERS: FITER APPROXIMATION VS. ESTIMATION ERRORS Mattias Olsson, Håkan Johansson, an Per öwenborg Div. of Electronic Systems, Dept. of Electrical

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

Dot trajectories in the superposition of random screens: analysis and synthesis

Dot trajectories in the superposition of random screens: analysis and synthesis 1472 J. Opt. Soc. Am. A/ Vol. 21, No. 8/ August 2004 Isaac Amiror Dot trajectories in the superposition of ranom screens: analysis an synthesis Isaac Amiror Laboratoire e Systèmes Périphériques, Ecole

More information

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems

Construction of the Electronic Radial Wave Functions and Probability Distributions of Hydrogen-like Systems Construction of the Electronic Raial Wave Functions an Probability Distributions of Hyrogen-like Systems Thomas S. Kuntzleman, Department of Chemistry Spring Arbor University, Spring Arbor MI 498 tkuntzle@arbor.eu

More information

A simple model for the small-strain behaviour of soils

A simple model for the small-strain behaviour of soils A simple moel for the small-strain behaviour of soils José Jorge Naer Department of Structural an Geotechnical ngineering, Polytechnic School, University of São Paulo 05508-900, São Paulo, Brazil, e-mail:

More information

Entanglement is not very useful for estimating multiple phases

Entanglement is not very useful for estimating multiple phases PHYSICAL REVIEW A 70, 032310 (2004) Entanglement is not very useful for estimating multiple phases Manuel A. Ballester* Department of Mathematics, University of Utrecht, Box 80010, 3508 TA Utrecht, The

More information

arxiv:hep-th/ v1 3 Feb 1993

arxiv:hep-th/ v1 3 Feb 1993 NBI-HE-9-89 PAR LPTHE 9-49 FTUAM 9-44 November 99 Matrix moel calculations beyon the spherical limit arxiv:hep-th/93004v 3 Feb 993 J. Ambjørn The Niels Bohr Institute Blegamsvej 7, DK-00 Copenhagen Ø,

More information

6. Friction and viscosity in gasses

6. Friction and viscosity in gasses IR2 6. Friction an viscosity in gasses 6.1 Introuction Similar to fluis, also for laminar flowing gases Newtons s friction law hols true (see experiment IR1). Using Newton s law the viscosity of air uner

More information

Homework 2 EM, Mixture Models, PCA, Dualitys

Homework 2 EM, Mixture Models, PCA, Dualitys Homework 2 EM, Mixture Moels, PCA, Dualitys CMU 10-715: Machine Learning (Fall 2015) http://www.cs.cmu.eu/~bapoczos/classes/ml10715_2015fall/ OUT: Oct 5, 2015 DUE: Oct 19, 2015, 10:20 AM Guielines The

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

One-dimensional I test and direction vector I test with array references by induction variable

One-dimensional I test and direction vector I test with array references by induction variable Int. J. High Performance Computing an Networking, Vol. 3, No. 4, 2005 219 One-imensional I test an irection vector I test with array references by inuction variable Minyi Guo School of Computer Science

More information

Diagonalization of Matrices Dr. E. Jacobs

Diagonalization of Matrices Dr. E. Jacobs Diagonalization of Matrices Dr. E. Jacobs One of the very interesting lessons in this course is how certain algebraic techniques can be use to solve ifferential equations. The purpose of these notes is

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Introduction to Markov Processes

Introduction to Markov Processes Introuction to Markov Processes Connexions moule m44014 Zzis law Gustav) Meglicki, Jr Office of the VP for Information Technology Iniana University RCS: Section-2.tex,v 1.24 2012/12/21 18:03:08 gustav

More information

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers

Improving Estimation Accuracy in Nonrandomized Response Questioning Methods by Multiple Answers International Journal of Statistics an Probability; Vol 6, No 5; September 207 ISSN 927-7032 E-ISSN 927-7040 Publishe by Canaian Center of Science an Eucation Improving Estimation Accuracy in Nonranomize

More information

Final Exam Study Guide and Practice Problems Solutions

Final Exam Study Guide and Practice Problems Solutions Final Exam Stuy Guie an Practice Problems Solutions Note: These problems are just some of the types of problems that might appear on the exam. However, to fully prepare for the exam, in aition to making

More information

Qubit channels that achieve capacity with two states

Qubit channels that achieve capacity with two states Qubit channels that achieve capacity with two states Dominic W. Berry Department of Physics, The University of Queenslan, Brisbane, Queenslan 4072, Australia Receive 22 December 2004; publishe 22 March

More information

arxiv: v2 [cs.ds] 11 May 2016

arxiv: v2 [cs.ds] 11 May 2016 Optimizing Star-Convex Functions Jasper C.H. Lee Paul Valiant arxiv:5.04466v2 [cs.ds] May 206 Department of Computer Science Brown University {jasperchlee,paul_valiant}@brown.eu May 3, 206 Abstract We

More information

Capacity Analysis of MIMO Systems with Unknown Channel State Information

Capacity Analysis of MIMO Systems with Unknown Channel State Information Capacity Analysis of MIMO Systems with Unknown Channel State Information Jun Zheng an Bhaskar D. Rao Dept. of Electrical an Computer Engineering University of California at San Diego e-mail: juzheng@ucs.eu,

More information

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N.

Modeling the effects of polydispersity on the viscosity of noncolloidal hard sphere suspensions. Paul M. Mwasame, Norman J. Wagner, Antony N. Submitte to the Journal of Rheology Moeling the effects of polyispersity on the viscosity of noncolloial har sphere suspensions Paul M. Mwasame, Norman J. Wagner, Antony N. Beris a) epartment of Chemical

More information

Robustness and Perturbations of Minimal Bases

Robustness and Perturbations of Minimal Bases Robustness an Perturbations of Minimal Bases Paul Van Dooren an Froilán M Dopico December 9, 2016 Abstract Polynomial minimal bases of rational vector subspaces are a classical concept that plays an important

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

The Principle of Least Action

The Principle of Least Action Chapter 7. The Principle of Least Action 7.1 Force Methos vs. Energy Methos We have so far stuie two istinct ways of analyzing physics problems: force methos, basically consisting of the application of

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

Analytic Scaling Formulas for Crossed Laser Acceleration in Vacuum

Analytic Scaling Formulas for Crossed Laser Acceleration in Vacuum October 6, 4 ARDB Note Analytic Scaling Formulas for Crosse Laser Acceleration in Vacuum Robert J. Noble Stanfor Linear Accelerator Center, Stanfor University 575 San Hill Roa, Menlo Park, California 945

More information

WUCHEN LI AND STANLEY OSHER

WUCHEN LI AND STANLEY OSHER CONSTRAINED DYNAMICAL OPTIMAL TRANSPORT AND ITS LAGRANGIAN FORMULATION WUCHEN LI AND STANLEY OSHER Abstract. We propose ynamical optimal transport (OT) problems constraine in a parameterize probability

More information