Factor Analyi with Poion Output Gopal Santhanam Byron Yu Krihna V. Shenoy, Department of Electrical Engineering, Neurocience Program Stanford Univerity Stanford, CA 94305, USA {gopal,byronyu,henoy}@tanford.edu Maneeh Sahani Gatby Computational Neurocience Unit Univerity College London 7 Queen Square, London WCN 3AR, UK maneeh@gatby.ucl.ac.uk Technical Report NPSL-TR-06- March 9, 006 Abtract We derive a modified verion of factor analyi for data that i poion rather than gauian ditributed. Thi modified approach may better fit certain clae of data, including neuronal piking data commonly collected in electrophyiology experiment. Introduction Factor analyi and other imilar dimenionality reduction approache e.g., PCA or SPCA are derived uing a tate-pace model. The latent tate i modeled a a gauian ditribution. The oberved output i modeled a a linear function of the latent tate with additive gauian noie. Thi approach can provide the benefit of reducing the dimenionality of the oberved, but noiy, data to a mall number of underlying factor. Thee factor may then be ued to provide meaningful prediction on new data. For count, or point proce, data, the gauian output noie model ued in factor analyi may not provide a good decription of the data. Intead, we modify the output noie model to be poion. Additionally, we extend the tate-pace model to incorporate a mixture of gauian rather than a ingle gauian ditribution. Thi extenion can erve to better model the latent tate, epecially when there i an a priori expectation that data i clutered. Once trained, the model can be ued to make prediction of the latent or unoberved tate for new oberved data. We dub our new approach a Factor Analyi with Poion Output, or FAPO for hort. Generative Model The generative model for FAPO i given below. x N µ,σ y i x Poionhc i x + d i for i,..., q
The random variable i the mixture component indicator and ha a dicrete probability ditribution over {,..., M} i.e., P π. Given, the latent tate vector, x R p, i gauian ditributed with mean µ and covariance Σ. The output, y i N 0, are generated from a poion ditribution where h i a link function mapping R R +, c i R p and d i R are contant, and R i the time bin width. We collect the count from all q imultaneouly oberved variable into a vector y N q 0, whoe ith element i y i. The choice of the link function h i dicued in the following ection. In thi work, we aume that all of the parameter of the model, namely π, µ, Σ, c i, and d i for {,..., M} and i {,..., q}, are unknown. The goal i to learn the parameter o that the model can be ued to make prediction of x and for a new y. 3 Sytem Identification The procedure of ytem identification, or model training, require learning the parameter from the oberved data. The oberved data include N obervation of y, an i.i.d. equence y,y,...,y N denoted by {y}, and N obervation of the mixture component indicator,, an i.i.d. equence,,..., N denoted by {}. The latent tate vector are hidden and not oberved. Thi ituation i an unupervied problem, although not completely unupervied; the ytem identification can be more challenging if i alo unknown. Thi latter cenario i beyond the cope of thi article. Once the model i trained, however, we etimate the mot likely x and with new oberved data y, a decribed in the following ection. The tandard approach to ytem identification in the preence of unoberved latent variable i the Expectation-Maximiation or EM algorithm. The algorithm maximie the likelihood of the model parameter i.e., θ {π,µ,...,m,σ,...,m,c,...,q, d,...,q } over the oberved data. The algorithm i iterative and each iteration i performed in two part, the expectation E tep and the maximiation M tep. Iteration are performed until the likelihood converge. 3. E-tep The E-tep of EM require computing the expected log joint likelihood, E log P {x},{y},{} θ, over the poterior ditribution of the hidden tate vector, P {x} {y},,θ k, where θ k are the parameter etimate at the kth EM iteration. Since the obervation are i.i.d. we can equivalently maximie the um of the individual expected log joint likelihood, E log P x,y, θ. The poterior ditribution can be expreed a follow: P x y,,θ k P y x,θ k P x,θ k. 3 Becaue P y x i a product of poion rather than a gauian, the tate poterior P x y will not be of a form that allow for eay computation of the log joint likelihood. Intead, a in, we approximated thi poterior with a gauian centered at the mode of log P x y and whoe covariance i given by the negative invere heian of the log poterior at that mode. Certain choice of h, including h e and h log + e, lead to a log poterior that i trictly concave in x. In thee cae, the unique mode can eaily be found by Newton method. log P x y,,θ log P y x,θ + log P x,θ + C q log P y i x + log N x µ,σ + C q h q h c i x + d i + y i log h c i x + d i log y! i + log N x µ,σ + C3 c i x + d i + y i log h c i x + d i x Σ x + µ Σ x + C 4 4
Taking the gradient and heian of 4 with repect to x, reult in the following expreion. q x log P x y,,θ x h c i x + d i + y i x log h c i x + d i Σ x + Σ x log P x y,,θ q x h c i x + d i + y i x log h c i x + d i Σ Letting, ζ i c i x + d i. For the aforementioned verion of h and h, the gradient and heian are q x log P x y,,θ e ζi c i + y i c i Σ x + Σ µ 5 q e ζi + y i c i µ Σ x + Σ µ 6 q x log P x y,,θ e ζi c i c i Σ 7 and q x log P x y,,θ eζi + e ci + y i eζ i ζi + e ζ i log + e ci Σ ζi x + Σ q + y i log + e ζi e ζi ci ζi + e µ Σ x + Σ µ 8 q e x log P x y,,θ ζi c i y i +e ζi e ζi log + e ζ i + e ci x ζi q + + y i log + e ζi e ζ i + e ζi c i e ζi c i + e ζ i y i e ζi log + e ζ i + e ζ i ci c i + + y i log + e ζi e ζi + e ζ i ci c i Σ c i Σ q y i e ζi log + e ζ i + yi log + e e ζi ζi + e ζ i ci c i Σ q + y i log + e e ζi ζi log + e e ζi ζi + e ζ i ci c i Σ, 9 repectively. For obervation n, let Q n be a gauian ditribution in R p that approximate P x n y n,,θ k and ha mean ξ n and covariance Ψ n. The expectation of the log joint likehood for a given obervation can be expreed 3
a follow: E n E Qn log P xn,y n, θ 0 E Qn q E Qn q log P yn n i x + log P x n + log P h c i x n + d i + yn i log h c i x n + d i log yn i! p logπ log Σ n + log P. x n Σ x n + µ Σ x n µ Σ µ n The term that do not depend on x n or any component of θ can be grouped a a contant, C, outide the expectation. Doing o, and alo moving term that do not depend on x n outide the expectation, we have q E n E Qn h c i x n + d i + yn i log h c i x n + d i x n Σ x n + µ Σ x n µ Σ µ n log Σ n + C E Qn q h c i x n + d i + yn i log h c i x n + d i E Q n x n Σ x n + µ n Σ E Qn x n 3 µ Σ µ n log Σ n + C q E Qn h c i x n + d i + yn i log h c i x n + d i Tr Σ Ψn n + ξ n ξ n + µ n Σ ξ n 4 µ Σ µ n log Σ n + C, where 3 i implified to 4 by uing the following relationhip: E Qn x n Σ x n EQn Tr x n Σ x n E Qn Tr Σ x n x n Tr Σ E Qn xn x n Tr Σ Ψn n + ξ n ξ n. Becaue the poterior tate ditribution are approximated a gauian in the E-tep, the expectation in 4 i a gauian integral that involveon-linear function g and h and cannot be computed analytically in general. Fortunately, thi high-dimenional integral can be reduced to a one-dimenional gauian integral with mean c i ξ n and variance c i Ψ n c i. The expectation of the log joint likelihood over all of the N obervation i imply the um of the individual 4
E n term: E E Q log P {x},{y},{} θ E n. 3. M-tep The M-tep require finding learning the ˆθk+ that atifie: ˆθ k+ arg max θ E Q log P {x},{y},{} θ. 5 Thi can achieved by differentiating E with repect to the parameter, θ, a hown below. The indicator function, I will prove ueful. Alo, let N N I. Prior probability of mixture component identification : π N I 6 State vector mean, for mixture component identification : E µ I Σ ξ n Σ µ 0 µ k+ N State vector covariance, for mixture component identification : N Σ E I Σ I Σ Σ Tr Σ I Σ I ξ n 7 Ψn + ξ n ξ n + µ Σ Ψn + ξ n ξ n µ ξ n + µ µ Ψn + ξ n ξ n µ ξ n + µ µ Σ ξ n µ Σ µ log Σ Σ Σ 0 Σ k+ N N Obervation mapping contant: I Ψ n + ξ n ξ n µ k+ N I ξ n + N µ k+ µ k+ I I Ψ n + ξ n ξ n µ k+ µ k+ 8 5
We want to maximie the following objective function, with repect to c i and d i : q Ẽ E Qn h c i x n + d i + yn i log h c i x n + d i q q E Qn h c i x n + d i + yn i log h c i x n + d i E Qn h c i x n + d i + yn i log h c i x n + d i + C. 9 Firt, let u intead examine the following more general problem: maximie the objective function E x g c x + d with repect to c and d, where g i concave and x i gauian ditributed with mean ξ and covariance Ψ. Defining the new variable c c d and x x, the objective function can be equivalently expreed a O E x g c x + d g c x N x ξ, Ψ dx x g N c ξ, c Ψ c d, 0 where ξ ξ and Ψ i a matrix in R p+ p+ with the upper-left p p ub-matrix equal to Ψ and the ret of the element et to ero. Thi objective function can be maximied uing Newton method ince it i concave in c. However, to perform thi optimiation method, we require the gradient and the heian of O. The gradient can be obtained a hown below. O g N c ξ, c Ψ c d g N c ξ, c Ψ c d g c ξ π exp d Taking the partial derivative in require the following quantitie. c Ψ c c Ψ c 3 Ψ c Ψ c c ξ c ξ exp c ξ exp c ξ c ξ ξ c Ψ c 4 Ψ c c ξ 4 ξ c ξ Ψ c ξ c Uing the equation above, we can reduce to O g c ξ ξ + c ξ Ψ c Ψ c + c ξ ξ + c ξ Ψ c N c ξ, c Ψ c d. 6
While there doeot exit a convenient analytic olution to the above integral, it can be accurately and reaonably efficiently approximated uing gauian quadrature, 3. Specifically, gauian quadrature rule tate that f N µ,σ d J w j f Z j for Z j µ + γ j σ, 3 j for any function f, where w j are the quadrature weight and γ j are the normalied quadrature point. Identifying the function f in and ubtituting Z j c ξ + γ j, the quadrature function for the gradient i f γ j g c ξ + γ j Ψ c + g c ξ + γ j γ j γ j ξ + γ j ξ + Ψ c γ j Ψ c. 4 Likewie, the ame procedure mut be performed to find the heian of the objective function: O g N c ξ, c Ψ c The following quantitie will be ueful. Ψ c c Ψ c Ψ c c Ψ c Ψ c c Ψ + Ψ Ψ c + c ξ ξ + c ξ Ψ c d. 5 c ξ ξ c c Ψ c Ψ c c ξ ξ c Ψ c Ψ c c ξ ξ ξ ξ c ξ Ψ c c ξ Ψ + c ξ c Ψ c ξ c ξ c ξ c Ψ c Ψ c c ξ ξ 4 c Ψ c Ψ c c ξ 3 c Ψ c ξ Next define a function a a aγ j γ j ξ + γ j Ψ c. 6 Subtituting thee expreion into 5, the quadrature function for the heian i: f γ j g c ξ + γ j aγ j aγ j + c Ψ c Ψ c c Ψ Ψ γ j c Ψ c Ψ cξ 3 ξ ξ + γ j 4γ j Ψ c Ψ c Ψ c c Ψ γ j c Ψ c 3 ξ c Ψ. 7 7
For certain choice of h it i poible to compute the gradient and heian analytically. To illutrate, we tart with the following form gn µ,σ µ d g exp πσ σ d 8 for g exp. The claic method to olve thi integral i to complete the quare in the exponent. µ exp πσ σ d exp µ + µ σ πσ σ d exp πσ exp πσ µ + σ + µ σ d µ + σ + µ + σ µ + σ 4 µ σ µ σ 4 + µ σ exp µ + µ + σ σ exp πσ σ d exp µ + σ N µ + σ,σ d exp µ + σ d 9 Relating thi form back to 0, the gradient with repect to c i exp c ξ + c Ψ c exp c ξ + c Ψ c ξ + Ψ c. 30 Likewie, the heian can be computed a follow. exp c ξ + c Ψ c ξ + Ψ c exp c ξ + c Ψ c ξ + Ψ c ξ + Ψ c Ψ + For another choice of gx, c ξ, the gradient i trivially ξ and the heian i the ero matrix. 3 4 Inference Once the model parameter have been choen, the generative model can be ued to make inference on the training data or new obervation. For the training data, the hidden tate vector x i the only variable that mut be inferred. The poterior ditribution of x can be approximated by a gauian, exactly a decribed previouly. Thi reult in a ditribution Q with mean ξ and covariance Ψ. Therefore, the maximum a poteriori etimate etimate of x i imply ξ. When performing inference for a new obervation, the mixture component identification,, i aumed to be unknown. The poterior ditribution of both and x, given the data, y, are potentially of interet. The firt of thee ditribution can be expreed a follow: P y, ˆθ P y, ˆθ P ˆθ π P y,x, ˆθ dx π x x P y x, ˆθ P x, ˆθ dx π E x P y x, ˆθ. 3 8
where the expectation in 3 i of a product of poion with repect to a gauian ditribution that ha mean ˆµ and covariance ˆΣ. Thi expectation can be computed uing ampling technique or Laplace method. To infer x given the data, the following derivation applie: P M x y, ˆθ P x y,, ˆθ P y, ˆθ M P x y,, ˆθ π E x P y x, ˆθ. 33 9
Reference A.C. Smith and E.N. Brown. Etimating a tate-pace model from point proce obervation. Neural Comput, 55:965 99, 003. S.J. Julier and J.K. Uhlmann. A new extenion of the Kalman filter to nonlinear ytem. In Proc. AeroSene: th Int. Symp. Aeropace/Defene Sening, Simulation and Control, page 8 93, 997. 3 U.N. Lerner. Hybrid Bayeian network for reaoning about complex ytem. PhD thei, Stanford Univerity, Stanford, CA, 00. 0