On Sparsity by NUV-EM, Gaussian Message Passing, and Kalman Smoothing

Size: px

Start display at page:

Download "On Sparsity by NUV-EM, Gaussian Message Passing, and Kalman Smoothing"

Nigel Spencer
6 years ago
Views:

1 IA 216 On Sparsity y UV-EM, Gaussian Message Passing, and Kalman Smoothing Hans-Andrea Loeliger, Luas Bruderer, Hampus Malmerg, Federico Wadehn, and our Zalmai EH Zurich, Dept. of Information echnology & Electrical Engineering Astract ormal priors with unnown variance UV have long een nown to promote sparsity and to lend well with parameter learning y expectation maximization EM. In this paper, we advocate this approach for linear state space models for applications such as the estimation of impulsive signals, the detection of localized events, smoothing with occasional jumps in the state space, and the detection and removal of outliers. he actual computations oil down to multivariate-gaussian message passing algorithms that are closely related to Kalman smoothing. We give improved tales of Gaussian-message computations from which such algorithms are easily synthesized, and we point out two preferred such algorithms. I. IRODUCIO his paper is aout two topics: 1 A particular approach to modeling and estimating sparse parameters ased on zero-mean normal priors with unnown variance UV. 2 Multivariate-Gaussian message passing variations of Kalman smoothing in such models. he main point of the paper is that these two things go very well together and comine to a versatile toolox. his is not entirely new, of course, and the ody of related literature is large. onetheless, the specific perspective of this paper has not, as far as nown to these authors, een advocated efore. Concerning the second topic, linear state space models continue to e an essential tool for a road variety of applications, cf. [1] [4]. he primary algorithms for such models are variations and generalizations of Kalman filtering and smoothing, or, equivalently, multivariate-gaussian message passing in the corresponding factor graph [5], [6] or similar graphical model [1]. A variety of such algorithms can easily e synthesized from tales of message computations as in [6]. In this paper, we give a new version of these tales with many improvements over those in [6], and we point out two preferred such algorithms. Concerning the first topic, UV priors zero-mean normal priors with unnown variance originated in Bayesian inference [7] [9]. he sparsity-promoting nature of such priors is the asis of automatic relevance determination ARD and sparse Bayesian learning developed y eal [9], ipping [], [11], Wipf et al. [12], [13], and others. he asic properties of UV priors are illustrated y the following simple example. Let U e a variale or parameter of interest, which we model as a zero-mean real scalar Gaussian random variale with unnown variance s 2. Assume that we oserve Y = U + Z, where the noise Z is zero-mean Gaussian with nown variance σ 2 and independent of U. he maximum lielihood ML estimate of s 2 from a single sample Y = µ R is easily determined: ŝ 2 1 = /2s argmax 2 +σ 2 s 2 2πs2 + σ 2 e µ2 1 = max{, µ 2 σ 2 }. 2 In a second step, for s 2 fixed to ŝ 2 as in 2, the MAP/MMSE/LMMSE estimate of U is û = µ { = ŝ 2 ŝ 2 + σ 2 3 µ µ2 σ 2 µ 2 if µ 2 > σ 2, otherwise. Equations 1 4 continue to hold if the scalar oservation Y is generalized to an oservation Y R such that, for fixed Y = y, the lielihood function py u is Gaussian up to a scale factor with mean µ and variance σ 2. In fact, this is all we need to now in this paper aout UV priors per se. he estimate 4 has some pleasing properties: first, it promotes sparsity and can thus e used to select features or relevant parameters; second, it has no a priori preference as to the scale of U, and large values of U are not scaled down. ote that the latter property is lost if ML estimation of s 2 is replaced y MAP estimation ased on a proper prior on s 2. In this paper, we will stic to asic UV regularization as aove, with no prior on the unnown variances: variales or parameters of interest are modeled as independent Gaussian random variales, each with its own unnown variance that is estimated exactly or approximately y maximum lielihood. We will advocate the use of UV regularization in linear state space models, for applications such as the estimation of impulsive signals, the detection of localized events, smoothing with occasional jumps in the state space, and the detection and removal of outliers. Concerning the actual computations, estimating the unnown variances is not sustantially different from learning other parameters of state space models and can e carried out y expectation maximization EM [14] [17] and other methods in such a way that the actual computations essentially amount to Gaussian message passing. he paper is structured as follows. In Section II, we egin with a quic loo at UV regularization in a standard linear model. Estimation of the unnown variances is addressed in Section III. Factor graphs and state space models are reviewed 4

2 in Sections IV and V, respectively, and UV regularization in such models is addressed in Section VI. he new tales of Gaussian-message computations are given in Appendix A. II. SUM OF GAUSSIAS AD LEAS SQUARES WIH UV REGULARIZAIO We egin with an elementary linear model a special case of a relevance vector machine [] as follows. For 1,..., K R n \ {}, let K Y = U + Z 5 =1 where U 1,..., U K are independent zero-mean real scalar Gaussian random variales with unnown variances σ1, 2..., σk 2, and where the noise Z is Rn -valued zero-mean Gaussian with covariance matrix σ 2 I and independent of U 1,..., U K. For a given oservation Y = y R n, we wish to estimate, first, σ1, 2..., σk 2 y maximum lielihood, and second, U 1,..., U K with σ1, 2..., σk 2 fixed. In the first step, we achieve sparsity: if the ML estimate of σ 2 is zero, then U = is fixed in the second step. he second step the estimation of U 1,..., U K for fixed σ1, 2..., σk 2 is a standard Gaussian estimation prolem where MAP estimation, MMSE estimation, and LMMSE estimation coincide and amount to minimizing 1 y σ 2 u σ 2 u 2, 6 K + K + where K + denotes the set of those indices {1,..., K} for which σ 2 >. A closed-form solution of this minimization is with û = σ 2 W y 7 K 1 W = σ 2 + σ I 2, 8 =1 as may e otained from standard least-squares equations see also [11]. An alternative proof will e given in Appendix B, where we also point out how W can e computed without a matrix inversion. In 2 and 4, the estimate is zero if and only if y 2 σ 2. wo different generalizations of this condition to the setting of this section are given in the following theorem. Let py,... denote the proaility density of Y and any other variales according to 5. heorem. Let σ 1,..., σ K e fixed at a local maximum or at a saddle point of py σ1, 2..., σk 2. hen σ2 = if and only if W y 2 W 9 with W = K 1 σl 2 l l + σ 2 I σ 2. l=1 Moreover, with W as in 8, we have W y 2 W, 11 with equality if σ 2 >. he proof will e given in Appendix B. he matrices W and W are oth positive definite. he former depends on, ut not on σ 2; the latter depends also on σ2, ut not on. III. VARIACE ESIMAIO Following a standard approach, the unnown variances σ1, 2..., σk 2 in Section II and analogous quantities in later sections can e estimated y an EM algorithm as follows. 1 Begin with an initial guess of σ1, 2..., σk 2. 2 Compute the mean m U and the variance σu 2 of the Gaussian posterior distriution pu y, σ1, 2..., σk 2 with σ1, 2..., σk 2 fixed. 3 Update σ1, 2..., σk 2 according to 13 elow. 4 Repeat steps 2 and 3 until convergence, or until some pragmatic stopping criterion is met. 5 Optionally update σ1, 2..., σk 2 according to 16 elow. he standard EM update for the variances is σ 2 E [ U 2 σ1, 2..., σk 2 ] 12 = m 2 U + σu he required quantities m 2 U and σu 2 are given y 85 and 88, respectively. With this update, asic EM theory guarantees that the lielihood py σ1, 2..., σk 2 cannot decrease and will normally increase in step 3 of the algorithm. he stated EM algorithm is safe, ut the convergence can e slow. he following alternative update rule, due to MacKay [], often converges much faster: m2 U σ 2 1 σu 2 /σ 2 14 However, this alterative update rule comes without guarantees; sometimes, it is too agressive and the algorithm fails completely. An individual variance σ 2 can also e estimated y a maximum-lielihood step as in 2: σ 2 argmax py σ1, 2..., σk 2 15 σ 2 = max {, m U 2 σu 2 }, 16 he mean m U is given y 4 and the variance σu 2 is given y 95. However, for parallel updates simultaneously for all {1,..., K}, as in step 3 of the algorithm aove, the rule 16 is normally too agressive and fails. Later on, the same algorithm will e used for estimating parameters or variales in linear state space models. In this case, we have no useful analytical expressions for the analogs of m U and σu 2, ut these quantities are easily computed y Gaussian message passing. IV. O FACOR GRAPHS AD GAUSSIA MESSAGE PASSIG From now on, we will heavily use factor graphs, oth for reasoning and for descriing algorithms. We will use factor graphs as in [5], [6], where nodes/oxes represent factors and

3 σ 1 σ 2 σ K U 1 1 U U K K, σ 2 I X = Ũ 1 + Ũ 2 Ũ K Z X 1 X K Y = y Fig. 1. Cycle-free factor graph of 5 with UV regularization. edges represent variales. By contrast, factor graphs as in [18] have oth variale nodes and factor nodes. Figure 1, for example, represents the proaility density py, z, u 1,..., u K σ 1,..., σ K of the model 5 with auxiliary variales Ũ = U and X = X 1 +Ũ with X =. he nodes laeled represent zero-mean normal densities with variance 1; the node laeled, σ 2 I represents a zeromean multivariate normal density with covariance matrix σ 2 I. All other nodes in Figure 1 represent deterministic constraints. For fixed σ 1,..., σ K, Figure 1 is a cycle-free linear Gaussian factor graph and MAP/MMSE/LMMSE estimation of any variales can e carried out y Gaussian message passing, as descried in detail in [6]. Interestingly, in this particular example, most of the message passing can e carried out symolically, i.e., as a technique to derive closed-form expressions for the estimates. Every message in this paper is a scalar or multivariate Gaussian distriution, up to a scale factor. Sometimes, we also allow a degenerate limit of a Gaussian, such as a Gaussian with variance zero or infinity, ut we will not discuss this in detail. Scale factors can e ignored in this paper. Messages can thus e parameterized y a mean vector and a covariance matrix. For example, m X and V X denote the mean vector and the covariance matrix, respectively, of the message traveling forward on the edge X in Figure 1, while m X and V X denote the mean vector and the covariance matrix, respectively, of the message traveling acward on the edge X. Alternatively, messages can e parameterized y the precision matrix W X = the inverse of the covariance matrix V X and the precision-weighted mean vector ξx = W XmX. 17 Again, the acward message along the same edge will e denoted y reversed arrows. In a directed graphical model without cycles as in Figure 1, forward messages represent priors while acward messages represent lielihood functions up to a scale factor. In addition, we also wor with marginals of the posterior distriution i.e., the product of forward message and acward message along the same edge [5], [6]. For example, m X and V X denote the posterior mean vector and the posterior covariance matrix, respectively, of X. An important role in this paper is played y the alternative parameterization with the dual precision matrix W X and the dual mean vector ξ X = V X + 1 V X 18 = W X m X m X. 19 Message computations with all these parameterizations are given in ales I VI in Appendix A, which contain numerous improvements over the corresponding tales in [6]. V. LIEAR SAE SPACE MODELS Consider a standard linear state space model with state X R n and oservation Y R L evolving according to X = AX 1 + BU 2 Y = CX + Z 21 with A R n n, B R n m, C R L n, and where U with values in R m and Z with values in R L are independent zero-mean white Gaussian noise processes. We will usually assume, first, that L = 1, and second, that the covariance matrix of U is an identity matrix, ut these assumptions are not essential. A cycle-free factor graph of such a model is shown in Figure 2. In Section VI, we will vary and augment such models with UV priors on various quantities. Inference in such a state space model amounts to Kalman filtering and smoothing [1], [2] or, equivalently, to Gaussian message passing in the factor graph of Figure 2 [5], [6]. Estimating the input U is not usually considered in the Kalman filter literature, ut it is essential for signal processing, cf. [19], [2]. With the tales in the appendix, it is easy to put together a large variety of such algorithms. he relative merits of different such algorithms depend on the particulars

4 U B X 1 A + =, σ 2 I Z C Ỹ + X Y = y Fig. 2. One section of the factor graph of the linear state space model 2 and 21. he whole factor graph consists of many such sections and optional initial and/or terminal conditions. he dashed loc will e varied in Section VI. of the prolem. However, we find the following two algorithms usually to e the most advantageous, oth in terms of computational complexity and in terms of numerical staility. If oth the input U and output Y are scalar or can e decomposed into multiple scalar inputs and outputs, neither of these two algorithms requires a matrix inversion. he first of these algorithms is essentially the Modified Bryson Frazier MBF smoother [21] augmented with input-signal estimation. MBF Message Passing: 1 Perform forward message passing with m X and V X using II.1, II.2, III.1, III.2, V.1, V.2. his is the standard Kalman filter. 2 Perform acward message passing with ξ X and W X, eginning with ξ X = and W X = at the end of the horizon, using II.6, II.7, III.7, III.8, and either V.4, V.6, V.8 or V.5, V.7, V.9. 3 Inputs U may then e estimated using II.6, II.7, III.7, III.8, IV.9, IV he posterior mean m X and covariance matrix V X of any state X or of an individual component thereof may e otained from IV.9 and IV.13 5 Outputs Ỹ = CX may then very oviously e estimated using I.5, I.6, III.5, III.6. In step 2, the initialization with W X = corresponds to the typical situation with no a priori information aout the state X at the end of the horizon. MBF message passing is especially attractive for input signal estimation as in step 3 aove, without steps 4 and 5. he second algorithm is an exact dual to MBF message passing and especially attractive for state estimation and output signal estimation i.e., for standard Kalman smoothing, without steps 4 and 5 elow. his algorithm acward recursion with time-reversed information filter, forward recursion with marginals BIFM does not seem to e widely nown. BIFM Message Passing: 1 Perform acward message passing with ξ X and W X using I.3, I.4, III.3, III.4, and VI.1, VI.2 with the changes for the reverse direction stated in ale VI. his is a time-reversed version of the standard information filter. 2 Perform forward message passing with m X and V X using I.5, I.6, III.5, III.6, and either VI.4, VI.6, VI.8 or VI.5, VI.7, VI.9. 3 Outputs Ỹ may then very oviously e estimated using I.5, I.6, III.5, III.6. 4 he dual means ξ X and the dual precision matrices W X may e otained from IV.3 and IV.7. 5 Inputs U may then e estimated using II.6, II.7, III.7, III.8, IV.9, IV.13. VI. SPARSIY BY UV I SAE SPACE MODELS Sparse input signals are easily introduced: simply replace the normal prior on U in 2 and in Figure 2 y a UV prior, as shown in Figure 3. his approach was used in [22] to estimate the input signal U 1, U 2,... itself. However, we may also e interested in the clean output signal Ỹ = CX. For example, consider the prolem of approximating some given signal y 1, y 2,... R y constant segments, as illustrated in Figure 8. he constant segments can e represented y the simplest possile state space model with n = 1, A = C = 1, and no input. For the occasional jumps etween the constant segments, we use a sparse input signal U 1, U 2,... with a UV prior and with B = = 1 as in Figure 3. he sparsity level i.e., the numer of constant segments can e controlled y the assumed oservation noise σ 2. he sparse scalar input signal of Figure 3 can e generalized in several different directions. A first ovious generalization is to comine a primary white-noise input with a secondary sparse input as shown in Figure 4. For example, the constant segments in Figure 8 are thus generalized to random-wal segments as in Figure 9. Another generalization of Figure 8 is shown in Figure, where the constant-level segments are replaced y straight-line segments, which can e represented y a state space model of order n = 2. he corresponding input loc, with two separate sparse scalar input signals, is shown in Figure 5; the first input, U,1, affects the magnitude and the second input, U,2, affects the slope of the line model. he further generalization to polynomial segments is ovious. Continuity can e enforced y omitting the input U,1, and continuity of derivatives can e enforced liewise. More generally, Figure 5 generalized to an aritrary numer of sparse scalar input signals can e used to allow occasional jumps in individual components of the state of aritrary state space models.

5 σ... X =... C U... + X... Z Ỹ + + Z σ Fig. 3. Alternative input loc to replace the dashed ox in Figure 2 for a sparse scalar input signal U 1, U 2,.... U,1 B X... σ U,2 Fig. 4. Input loc with oth white noise and additional sparse scalar input. σ,1 U, X... σ,2 U,2 2 Fig. 5. Input loc with two separate sparse scalar inputs for two degrees of freedom such as in Figure. U,1 Ũ,1 U,2 B X... Fig. 6. Input loc allowing general sparse pulses, each with its own signature, in addition to full-ran white noise. y Fig. 7. Alternative output loc for scalar signal with outliers Fig. 8. Estimating or fitting a piecewise constant signal. Fig. 9. Estimating a random wal with occasional jumps Fig.. Approximation with straight-line segments Fig. 11. Outlier removal according to Figure

6 In all these examples, the parameters σ 2 or σ2,l can e learned as descried in Section III, and the required quantities m U and σu 2 or m U,l and σu 2,l, respectively can e computed y message passing in the pertinent factor graph as descried in Section V. A more sustantial generalization of Figure 3 is shown in Figure 6, with σ of Figure 3 generalized to R n. We mention without proof that this generalized UV prior on Ũ,1 = U,1 still promotes sparsity and can e learned y EM provided that BB has full ran [23]. his input model allows quite general events to happen, each with its own signature. he estimated nonzero vectors ˆ 1, ˆ 2,... may e viewed as features of the given signal y 1, y 2,... that can e used for further analysis. Finally, we turn to the output loc in Figure 2. A simple and effective method to detect and to remove outliers from the scalar output signal of a state space model is to replace 21 with Y = CX + Z + Z 22 with sparse Z, as shown in Figure 7 [24]. Again, the parameters σ can e estimated y EM essentially as descried in Section III, and the required quantities m Z and σ 2 Z can e computed y message passing as descried in Section V. An example of this method is shown in Figure 11 for some state space model of order n = 4 with details that are irrelevant for this paper. VII. COCLUSIO We have given improved tales of Gaussian-message computations for estimation in linear state space models, and we have pointed out two preferred message passing algorithms: the first algorithm is essentially the Modified Bryson-Frazier smoother, the second algorithm is a dual of it. In addition, we have advocated UV priors together with EM algorithms from sparse Bayesian learning for introducing sparsity into linear state space models and outlined several applications. In this paper, all factor graphs were cycle-free so that Gaussian message passing yields exact marginals. he use of UV regularization in factor graphs with cycles, and its relative merits in comparison with, e.g., AMP [25], remains to e investigated. APPEDIX A ABULAED GAUSSIA-MESSAGE COMPUAIOS ales I VI are improved versions of the corresponding tales in [6]. he notation for the different parameterizations of the messages was defined in Section IV. he main novelties of this new version are the following: 1 ew notation ξ = W m and ξ = W m. 2 Introduction of the dual marginal ξ IV.1 with pertinent new expressions in ales I V, and new expressions with the dual precision matrix, especially V.4 V.9. hese results from [2] are used oth in Appendix B and in the two preferred algorithms in Section V. ABLE I GAUSSIA MESSAGE PASSIG HROUGH A EQUALIY-COSRAI. X = Y Z Constraint X = Y = Z, expressed y factor δz x δy x ξz = ξ X + ξ Y W Z = W X + W Y ξx = ξ Y + ξ Z W X = W Y + W Z I.1 I.2 I.3 I.4 m X = m Y = m Z I.5 V X = V Y = V Z I.6 ξ X = ξ Y + ξ Z ABLE II GAUSSIA MESSAGE PASSIG HROUGH A ADDER ODE. X + Y Constraint Z = X + Y, expressed y factor δz x + y Z m Z = m X + m Y V Z = V X + V Y m X = m Z m Y V X = V Z + V Y m Z = m X + m Y I.7 II.1 II.2 II.3 II.4 II.5 ξ X = ξ Y = ξ Z II.6 W X = W Y = W Z II.7 3 ew expressions VI.4 VI.9 for the marginals, which are essential for the BIFM Kalman smoother in Section V. he proofs elow are given only for the new expressions; for the other proofs, we refer to [6]. Proof of I.7: Using IV.3, I.3, I.4, and I.5, we have ξ X = W X m X ξ X 23

7 ABLE III GAUSSIA MESSAGE PASSIG HROUGH A MARIX MULIPLIER ODE WIH ARBIRARY REAL MARIX A. ABLE V GAUSSIA MESSAGE PASSIG HROUGH A OBSERVAIO BLOCK. X A Y Constraint Y = AX, expressed y factor δy Ax X = Z A m Y = A m X V Y = A V X A ξx = A ξ Y W X = A W Y A m Y = Am X V Y = AV X A ξ X = A ξy W X = A WY A III.1 III.2 III.3 III.4 III.5 III.6 III.7 III.8 ABLE IV GAUSSIA SIGLE-EDGE MARGIALS m, V AD HEIR DUALS ξ, W. ξ X = W X m X m X IV.1 = ξ X W X m X IV.2 = W X m X ξ X IV.3 W X = V X + V X 1 IV.4 = W X V X WX IV.5 = W X W X V X WX IV.6 = W X W X V X WX IV.7 m X = V X ξ X + ξ X IV.8 = m X V X ξx IV.9 = m X + V X ξx IV. V X = W X + W X 1 IV.11 = V X WX V X IV.12 = V X V X WX V X IV.13 = V X V X WX V X IV.14 = WY + W Z mx ξy + ξ Z = WY m Y ξ Y + WZ m Z ξ Z = ξ Y + ξ Z. 26 Proof of II.6: We first note m X m X = m X + m Y m Z 27 = m Z m Z, 28 Y m Z = m X + V X A G m Y A m X V Z = V X V X A GA V X with G = V Y + A V X A 1 ξ X = F ξz + A W Y A mz m Y V.1 V.2 V.3 V.4 = F ξz + A G A m X m Y V.5 W X = F WZ F + A W Y AF V.6 = F WZ F + A GA V.7 with F = I V Z A W Y A V.8 = I V X A GA V.9 For the reverse direction, replace m Z y m X, V Z y V X, m X y m Z, V X y V Z, exchange ξ X and ξ Z, exchange W X and W Z, and change + to in V.4 and V.5. and II.6 follows from II.7. Proof of III.7: Using [6, eq. III.9], we have ξ X = W X m X m X 29 Proof of IV.9 and IV.2: have = W X mx W X mx 3 = A WY A m X A WY my 31 = A WY m Y m Y. 32 Using IV.13 and IV.12, we m X = V X ξx + V X ξx 33 V = X ξx V X WX V X + V X WX V X ξx 34 = m X V mx X WX m X 35 = m X V X ξx, 36 and IV.2 follows y multiplication with W X. Proof of V.9: From I.2 and III.4, we have W Z = W X + A W Y A, 37 from which we otain W X = W Z A W Y A 38 = W Z F. 39

8 ABLE VI GAUSSIA MESSAGE PASSIG HROUGH A IPU BLOCK. X + Z A Y ξz = ξ X + W X AH ξ Y A ξ X W Z = W X W X AHA W X with H = WY + A W X A 1 m X = F m Z + A V Y A ξ Z ξ Y = F m Z + AH A ξ X ξ Y V X = F V Z F + AV Y A F = F V Z F + AHA with F = I W Z A V Y A VI.1 VI.2 VI.3 VI.4 VI.5 VI.6 VI.7 VI.8 = I W X AHA VI.9 For the reverse direction, replace ξ Z y ξ X, W Z y W X, ξx y ξ Z, W X y W Z, exchange m X and m Z, exchange V X and V Z, and replace ξ Y y ξ Y. hus V Z WX = F and On the other hand, we have V Z = F V X. 4 V Z = I V X A GA V X 41 from V.2, and F = I V X A GA follows. Proof of V.4: Using I.7, III.7, and IV.3, we have ξ X = ξ Z + A ξy 42 = ξ Z + A WY m Y W Y my. 43 Using III.5 and IV.9, we further have m Y = Am Z 44 and inserting 45 into 43 yields V.4. = A mz V Z ξz, 45 Proof of V.5: We egin with m X = m Z. Using IV.9, we have m X V X ξx = m Z V Z ξz 46 = m X + V X A G m Y A m X V X F ξz, 47 where the second step uses V.1 and V Z = F V X = F V X from 4. Sutracting m X and multiplying y V 1 X yields V.5. Proof of V.7: We egin with V X = V Z. Using IV.13, we have V X V X WX V X = V Z V Z WZ V Z 48 = V X V X A GA V X V X F WZ F V X, 49 where the second step uses V.2 and 4. Sutracting V X and multiplying y V 1 X yields V.7. Proof of V.6: As we have already estalished V.7, we only need to prove A GA = A W Y AF. Using V.9, we have A GA = W X I F 51 = W X V Z A W Y A 52 = A W Y A V Z WX, 53 where the last step follows from A GA = A GA. Inserting 39 then yields. Proof of VI.9: From II.2 and III.2, we have V Z = V X + A V Y A, 54 from which we otain V X = V Z A V Y A 55 hus W Z V X = F and = V Z F. 56 W Z = F W X. 57 On the other hand, we have W Z = I W X AHA WX 58 from VI.2, and F = I W X AHA follows. Proof of VI.4: Using II.3, III.5, and IV.9, we have m X = m Z Am Y 59 = m Z A my V Y ξy. 6 Using II.6, III.7, and IV.2, we further have ξ Y = A ξz 61 = A ξz W Z m Z, 62 and inserting 62 into 6 yields VI.4. Proof of VI.5: We egin with ξ X = ξ Z from II.6. Using IV.2, we have ξx W X m X = ξ Z W Z m Z 63 = ξ X + W X AH ξ Y A ξ X W X F m Z, 64

9 where the second step uses VI.1 and W Z = F W X from 57. Sutracting ξ X and multiplying y V X yields VI.5. Proof of VI.7: We egin with W X = W Z from II.7. Using IV.6, we have W X W X V X WX = W Z W Z V Z WZ 65 = W X W X AHA W X W X F V Z F WX, 66 where the second step uses VI.2 and 57. Sutracting W X and multiplying y V X yields VI.7. Proof of VI.6: Since we have already estalished VI.7, we only need to prove Using VI.9, we have AHA = A V Y A F. 67 AHA = V X I F 68 = V X WZ A V Y A 69 = A V Y A W Z V X, 7 where the last step follows from AHA = AHA. Inserting 57 then yields 67. APPEDIX B MESSAGE PASSIG I FIGURE 1 AD PROOFS In this appendix, we demonstrate how all the quantities pertaining to computations mentioned in Sections II and III, as well as the proof of the theorem in Section II, are otained y symolic message passing using the tales in Appendix A. he ey ideas of this section are from [6, Section V.C]. hroughout this section, σ 1,..., σ K are fixed. A. Key Quantities ξ X and W X he pivotal quantities of this section are the dual mean vector ξũ and the dual precision matrix WŨ. Concerning the former, we have ξũ = ξ X = ξ X = ξ Y 71 = W Y y, 72 for = 1,..., K, where 71 follows from II.6, and 72 follows from since m Y =. Concerning W X, we have ξ Y = W Y m Y m Y 73 = W Y y 74 WŨ = W X = W X = W Y 75 = W as defined in 8 76 for = 1,..., K, where 75 follows from II.7, and 76 follows from W Y = V Y + V Y 1 77 with V Y = and V Y = K σ 2 + σ 2 I. 78 =1 he matrix W can e computed without matrix inversion as follows. First, we note that W X = V X + V X 1 79 = + V X 1 8 = W X. 81 Second, using VI.2, the matrix W X can e computed y the acward recursion W X 1 = W X W X σ 2 + W X 1 W X 82 starting from W XK = σ 2 I. he complexity of this alternative computation of W is On 2 K; y contrast, the direct computation of 8 using Gauss-Jordan elimination for the matrix inversion has complexity On 2 K + n 3. B. Posterior Distriution and MAP estimate of U For fixed σ 1,..., σ K, the MAP estimate of U is the mean m U of the Gaussian posterior of U. From IV.9 and III.7, we have and 72 yields m U = m U V U ξu 83 = σ 2 ξũ, 84 m U = σ 2 W y, 85 which proves 7. For re-estimating the variance σ 2 as in Section III, we also need the variance σu 2 of the posterior distriution of U. From IV.13 and III.8, we have and 76 yields σ 2 U = V U V U WU V U 86 = σ 2 σ 2 WŨ σ 2, 87 σ 2 U = σ 2 σ 2 W σ C. Lielihood Function and Bacward Message of U We now consider the acward message along the edge U, which is the lielihood function py u, σ 1,..., σ K, for fixed y and fixed σ 1,..., σ K, up to a scale factor. For use in Section B-D elow, we give two different expressions oth for the mean m U and for the variance σ 2 U of this message. As to the latter, we have W U from III.4, and thus = WŨ 89 σ 2 U = WŨ 1. 9

10 We also note from II.4 that WŨ = V X 1 + V X 1 91 Alternatively, we have = W as defined in. 92 σ 2 1 U = W U V U 93 = WŨ 1 σ 2 94 = W 1 σ 2, 95 where we used IV.4, III.8, and 76. As to the mean m U, we have ξu = ξũ 96 = WŨmŨ 97 = WŨ m X m X 1 98 = WŨ y 99 from III.3 and II.3, and thus m U = σu 2 ξu = 1 WŨ WŨ y 1 from 9. Alternatively, we have m U = 1 m U W ξu U 2 = WŨ 1 ξũ 3 = W 1 W y, 4 where we used IV.1, III.8, III.7, 72, and 76. D. Proof of the heorem in Section II Let σ 1,..., σ K e fixed at a local maximum or at a saddle point of the lielihood py σ 1,..., σ K. hen and σ = argmax py σ 1,..., σ K 5 σ σ 2 = max{, m 2 U σ 2 U } 6 from 2. From 1 and 9, we have mu 2 σu 2 WŨ y 2 1 = 2 7 WŨ WŨ With 92, it is ovious that mu 2 σu 2 if and only if 9 holds. As to 11, we have m 2 U σ 2 U = W y 2 W 2 1 W + σ 2 8 from 4 and 95. We now distinguish two cases. If σ 2 >, 6 and 8 together imply W y 2 W 2 1 W =. 9 On the other hand, if σ 2 =, 6 and 8 imply 2 W y 1 2 W W. 1 Comining these two cases yields 11. REFERECES [1] S. Roweis and Z. Ghahramani, A unifying review of linear Gaussian models, eural Computation, vol. 11, pp , Fe [2]. Kailath, A. H. Sayed, and B. Hassii, Linear Estimation. Prentice Hall, J, 2. [3] J. Durin and S. J. Koopman, ime Series Analysis y State Space Methods. Oxford Univ. Press, 212. [4] C. Bishop, Pattern Recognition and Machine Learning. Springer, 26. [5] H.-A. Loeliger, An introduction to factor graphs, IEEE Signal Proc. Mag., Jan. 24, pp [6] H.-A. Loeliger, J. Dauwels, Junli Hu, S. Korl, Li Ping, and F. R. Kschischang, he factor graph approach to model-ased signal processing, Proceedings of the IEEE, vol. 95, no. 6, pp , June 27. [7] D. J. C MacKay, Bayesian interpolation, eural Comp., vol. 4, n. 3, pp , [8] S. Gull, Bayesian inductive inference and maximum entropy, in Maximum-entropy and Bayesian Methods in Science and Engineering, G. J. Ericson and C. R. Smith, eds., Kluwer 1988, pp [9] R. M. eal, Bayesian Learning for eural etwors, ew Yor: Springer Verlag, [] M. E. ipping, Sparse Bayesian learning and the relevance vector machine, J. Machine Learning Research, vol. 1, pp , 21. [11] M. E. ipping and A. C. Faul, Fast marginal lielihood maximisation for sparse Bayesian models, Proc. 9th Int. Worshop on Artificial Intelligence and Statistics, 23. [12] D. Wipf and S. agarajan, A new view of automatic relevance determination, Advances in eural Information Processing Systems, pp , 28. [13] D. P. Wipf and B. D. Rao, Sparse Bayesian learning for asis selection, IEEE rans. Signal Proc., vol. 52, no. 8, Aug. 24, pp [14] A. P. Dempster,. M. Laird, and D. B. Ruin, Maximum lielihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, vol. 39, Series B, pp. 1 38, [15] P. Stoica and Y. Selén, Cyclic minimizers, majorization techniques, and the expectation-maximization algorithm: a refresher, IEEE Signal Proc. Mag., January 24, pp [16] Z. Ghahramani and G. E. Hinton, Parameter Estimation for Linear Dynamical Systems. echn. Report CRG-R-96-2, Univ. of oronto, [17] J. Dauwels, A. Ecford, S. Korl, and H.-A. Loeliger, Expectation maximization as message passing Part I: principles and Gaussian messages, arxiv: [18] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, Factor graphs and the sum-product algorithm, IEEE rans. Information heory, vol. 47, pp , Fe. 21. [19] L. Bruderer and H.-A. Loeliger, Estimation of sensor input signals that are neither andlimited nor sparse, 214 Information heory & Applications Worshop IA, San Diego, CA, Fe. 9 14, 214. [2] L. Bruderer, Input Estimation and Dynamical System Identification: ew Algorithms and Results. PhD thesis at EH Zurich o 22575, 215. [21] G. J. Bierman, Factorization Methods for Discrete Sequential Estimation. ew Yor: Academic Press, [22] L. Bruderer, H. Malmerg, and H.-A. Loeliger, Deconvolution of wealy-sparse signals and dynamical-system identification y Gaussian message passing, 215 IEEE Int. Symp. on Information heory ISI, Hong Kong, June 14 19, 215. [23]. Zalmai, H. Malmerg, and H. A. Loeliger, Blind deconvolution of sparse ut filtered pulses with linear state space models, 41th IEEE Int. Conf. on Acoustics, Speech and Signal Processing ICASSP, Shanghai, China, March 2 25, 216. [24] F. Wadehn, L. Bruderer, V. Sahdeva, and H.-A. Loeliger, Outlierinsensitive Kalman smoothing and marginal message passing, in preparation. [25] D. L. Donoho, A. Malei, and A. Montanari, Message-passing algorithms for compressed sensing, Proc. ational Academy of Sciences, vol. 6, no. 45, pp , 29.

Outlier-insensitive Kalman Smoothing and Marginal Message Passing

Outlier-insensitive Kalman Smoothing and Marginal Message Passing Federico Wadehn, Vijay Sahdeva, Luas Bruderer, Hang Yu, Justin Dauwels and Hans-Andrea Loeliger Department of Information Technology and