Collaborative Multi-Output Gaussian Processes for Collections of Sparse Multivariate Time Series

Size: px

Start display at page:

Download "Collaborative Multi-Output Gaussian Processes for Collections of Sparse Multivariate Time Series"

Jessica Porter
5 years ago
Views:

1 Collaboratve Mult-Output Gaussan Processes for Collectons of Sparse Multvarate Tme Seres Steven Cheng-Xan L Benamn Marln College of Informaton & Computer Scences Uversty of Massachusetts Amherst {cxl,marln}@cs.umass.edu Abstract Collaboratve Mult-Output Gaussan Processes COGPs) are a flexble tool for modelng multvarate tme seres. They nduce correlaton across outputs through the use of shared latent processes. Whle past work has focused on the computatonal challenges that result from a sngle multvarate tme seres wth many observed values, ths paper explores the problem of fttng the COGP model to collectons of many sparse and rregularly sampled multvarate tme seres. Ths work s motvated by applcatons to modelng physologcal data heart rate, blood pressure, etc.) n Electroc Health Records EHRs). 1 Introducton Gaussan process GP) regresson s a well-known and wdely-used approach for modelng temporal and spatal data [9]. The man drawback of GP models s the prohbtve cost of the requred computatons. To address ths ssue, Hensman et al. [4] recently ntroduced a scalable algorthm to perform GP nference based on a stochastc varatonal approxmaton [5]. Usng a smlar approach, Nguyen and Bolla [7] proposed collaboratve mult-output Gaussan processes COGP) for effcently learng mult-output GPs gven a sngle multvarate tme seres wth many observatons. Ths work extends a long lne of pror research on mult-output GPs [1, 2, 3, 11, 12]. In ths paper, we consder the problem of learng the COGP model when the data consst of a collecton of many sparse and rregularly sampled multvarate tme seres. Ths problem s motvated by the analyss of Intensve Care Ut ICU) Electroc Health Records EHR) data. In the ICU EHR settng, each patent s represented by an ensemble of sparse and rregularly sampled physologcal tme seres, one per underlyng physologcal varable such as heart rate, blood pressure, etc. A typcal ICU EHR record contans observatons of physologcal varables recorded at rregular ntervals by clcal staff durng the routne course of care. Key varables may have one to two recorded observatons per hour, so the data are qute sparse. On the other hand, ndvdual hosptals may have access to EHRs for many thousands) of patents. Our goal s to ft a common COGP model by leveragng the data from multple patents. We present an extenson to the COGP model and a modfed varatonal learng algorthm that explots the fact that we have many sparsely observed multvarate tme seres. We also explore the use of sparsty-nducng regularzaton on the factors controllng the nteractons between outputs to deal wth varables that are hghly sparsely observed. We present predctve log lkelhood results on a real ICU EHR data set. 2 Mult-Output Gaussan Processes Consder a data set contang a collecton of multvarate tme seres D = {S 1,..., S N }. Each tme seres S n conssts of P channels S n = {t n1, y n1 ),..., t np, y np )} n whch t s a set of 1

2 tme ponts and y are the correspondng observed values. For ICU EHR data, each tme seres has only a small number of observatons that are rregularly sampled. We extend the collaboratve mult-output Gaussan processes COGP) [7] to model correlaton across dfferent channels gven a collecton of mult-channel tme seres where each channel s sparse and rregularly sampled. Let y k denote the kth observaton of channel from S n measured at tme t k, t s modeled as a nosy observaton of the sum of a functon h and a weghted combnaton of Q shared latent functons g 1,..., g Q evaluated at t k, where each functon has an ndependent Gaussan process GP) pror h GP 0, k h), ) ) for = 1,..., P and g GP 0, k g), ) ) for = 1,..., Q. Specfcally, py k ) = N h t k ) + Q ) w g t k ), β 1 Note that the hyperparameters of the covarance functons k h) and k g) are shared across the entre tme seres collecton D, and so are the weghts w. The shared Gaussan precson β 1 models the nose of the process that s shared by all of the tme seres n the th channel. In order to effcently estmate the hyperparameters mentoned above, a set of M nducng tme ponts z = [z 1,..., z M ] s ntroduced to approxmate the orgnal GP posteror for all g and h. These nducng ponts provde a uversal reference so that we can estmate the combnaton weghts and other hyperparameters solely on the margnal dstrbuton. Moreover, by choosng a smaller M those GPs can be sparsfed to speed up computaton [4, 5]. Let g = g t ) and h = h t ) for n = 1,..., N, = 1,..., P and = 1,..., Q. Lke other GP approxmatons [8], a set of nducng varables u and v are ntroduced such that pg u n ) = pu n ) = pg u n ) = pu n ) = N g µ g), ) Kg) N u n 0, k g) z, z) ) ph n v n ) = ph v ) = N h µ h), Kh) ) pv n ) = pv ) = N v 0, k h) z, z) ) where µ g) and Kg) are defned smlarly. are the posteror mean and covarance defned as follows and µh) µ g) = kg) K g) = kg) t, z)k g) z, z) 1 u n t, t ) k g) In ths work, we use squared exponental kernels for both k h) whereas for k g) t, z)k g) z, z) 1 k g) z, t ). and k g), that s, k h) x, x ) = a exp b x x ) 2), for a > 0 and b > 0 and K h) we fx the leadng coeffcent a = 1 snce the weghts w control the scale already. We use varatonal nference to estmate the parameters. Followng the procedure of COGP, we can derve the evdence lower bound wth all g and h collapsed as n [4] and ntroduce the mean feld varatonal dstrbutons qu n ) = N u n m g) n, ) Sg) n and qv ) = N v m h), ) Sh) for all n,,. We obtan the lower bound shown below. 2

3 N { [ ] log pd) qu n, v n )E pgn,h n u n,v n) log py n g n, h n ) du n dv n n=1 Q P } D KL qu n ) pu n )) D KL qv ) pv )) Snce we are workng n the scenaro that the number of samples n each channel of the ICU EHR s small, nstead of updatng the varatonal parameters of u n and v usng stochastc optmzaton as n [7], we can estmate them analytcally n the varatonal E-step to speed up the overall convergence. Specfcally, we estmate S g) and Sh) ndvdually n closed form by settng the dervatves n of the evdence lower bounds to zero: where A g) As for m g) n = kg) S g) n S h) S g) n = S h) = k g) z, z) 1 + k h) t, z)k h) z, z) 1 and A h) P z, z) 1 + β A h) A h) ) 1 β wa 2 g) Ag) ) 1 = k h) t, z)k h) z, z) 1. and mh), we can estmate all of them ontly by solvng the followng lnear system. ) 1m g) n ) 1m h) = P β w A g) y A h) mh) ) w k A g) k mg) nk, for all k = β w A h) 3 Experment and Results Q y ) w A g) mg) n, for all We evaluate the performance of our extenson of the mult-output GP model COGP) usng predctve lkelhood on held out data. Our experments are based on a pedatrc ICU EHR data set collected at the Chldren s Hosptal of Los Angeles. The data contan sparse and rregularly sampled tme seres for 13 standard physologcal varables. The data set we use for these experments contans a collecton of 1000 patent records. We extract the samples from the frst 24 hours n each epsode. The average number of observatons per day vares between 7 and 50 for these varables wth consderable varaton between patents. We compare the predctve performance on the held-out data ponts usng the COGP wth dfferent regularzaton schemes. We also compare to a baselne method that models each channel as an ndependent GP INDEP-GP). We randomly splt the 1000 epsodes nto 500 for trang and test on the remang half. For each channel, we hold out the mddle one-thrd of the observatons of each epsode to evaluate the predctve dstrbuton on the held-out tme ponts, so that nference has to account for nformaton from other channels due to the lack of reference n the neghborhood. Ths nvolves estmatng m g), S g), m h), S h) for each test case gven the shared hyperparameters that Table 1: Held-out log-lkelhood comparson method average log-lkelhood Q regularzaton parameter COGP-COL ±0.045) 3 τ = 0.2 COGP-ROW ±0.036) 3 λ = 2.0 COGP-IND ±0.043) 5 λ = 0.8 COGP ±0.042) 3 INDEP-GP ±0.171) 3

4 Table 2: Average log-lkelhood on each channel channel COGP-COL COGP INDEP-GP # test n use avg length SpO ±0.06) 1.22 ±0.10) 4.51 ±0.29) HR 0.92 ±0.10) 1.19 ±0.14) 5.84 ±0.45) RR 0.04 ±0.01) 0.01 ±0.01) 0.77 ±0.16) sbp 0.53 ±0.10) 0.53 ±0.10) 2.46 ±0.34) dbp 0.88 ±0.02) 1.37 ±0.04) 0.78 ±0.15) EtCO ±0.01) 0.09 ±0.01) 0.97 ±0.17) Temp 0.82 ±0.36) 0.80 ±0.37) 0.25 ±0.19) TGCS 0.58 ±0.08) 0.58 ±0.08) 4.48 ±0.29) CRR 0.71 ±0.05) 0.75 ±0.07) 0.85 ±0.19) UO 1.13 ±0.08) 1.14 ±0.08) 6.32 ±0.29) FO ±0.28) 1.76 ±0.29) 5.57 ±0.89) Gluc 0.05 ±0.01) 0.01 ±0.01) 0.57 ±0.06) ph 0.48 ±0.05) 0.47 ±0.05) 1.11 ±0.14) have been traned. Note that we dscard epsodes that have less than 3 observatons n the gven channel. As the samplng densty vares a lot across channels, the number of test cases n use to evaluate predctve performance for each channels can be consderably dfferent. Therefore, we compute the average log lkelhood on each channel and report the average over all 13 average log lkelhoods as the evaluaton metrc. For mult-output GPs, we consder three schemes to regularze the combnaton weght matrx w R P Q. Frst, we apply l 1 regularzaton on each entry of w by mposng the constrant w 1 < τ COGP-IND). We also consder regularzng w usng group lasso by addng an extra term λ G,) G w2 to the negatve evdence lower bound where G s a set of ndces of w that forms a group, where λ > 0 controls the strength of the regularzaton. We consder takng each column as a group COGP-COL) and takng each row as a group COGP-ROW). In the experment, we use a proected quas-newton algorthm [10] to optmze the regularzed evdence lower bound. We also compare to COGP wthout regularzaton COGP). We test on dfferent values of Q as well as parameters τ, λ for each regularzaton scheme. In the nterest of space, we only show the best results of each method. Table 1 shows the best average held-out log-lkelhood usng dfferent methods. We consder Q {3, 5, 8, 10}. The results show that smaller numbers of latent GPs results n better performance. Importantly, COGP sgfcantly outperforms the ndependent baselne model. Table 1 also shows that regularzaton on columns gves the best result, although there s no column beng zeroed out completely. Table 2 shows the average log-lkelhood of each channel. We can see that COGP outperforms INDEP-GP n all cases except for two of the sparser channels dbp and Temp). Wth sparsty nducng regularzaton, COGP-COL s able to sgfcantly mprove the results for dbp whle havng a mld postve or negatve) effect on other channels. 4 Concluson and Future Drectons In ths work, we extend the collaboratve mult-output GPs to learn correlatons across dfferent outputs based on a collecton of multvarate sparse and rregularly-sampled tme seres. Ths s an mportant step toward follow-up machne learng tasks such as tme seres classfcaton or clusterng. Our work can be ntegrated wth, for example, the expected Gaussan kernel [6] to perform varous machne learng tasks whle makng use of the more accurate modelng provded by COGPs. References [1] Maurco A Alvarez, Lorenzo Rosasco, and Nel D Lawrence. Kernels for vector-valued functons: A revew. arxv preprnt arxv: ,

5 [2] Edwn V Bolla, Kan M Cha, and Chrstopher Wllams. Mult-task gaussan process predcton. In Advances n neural nformaton processng systems, pages , [3] Phllp Boyle and Marcus Frean. Dependent gaussan processes. In Advances n neural nformaton processng systems, pages , [4] James Hensman, Ncolo Fus, and Nel D Lawrence. Gaussan processes for bg data. In Conference on Uncertanty n Artfcal Intellegence, pages aua.org, [5] Matthew D Hoffman, Davd M Ble, Chong Wang, and John Pasley. Stochastc varatonal nference. The Journal of Machne Learng Research, 141): , [6] Steven Cheng-Xan L and Benamn Marln. Classfcaton of sparse and rregularly sampled tme seres wth mxtures of expected gaussan kernels and random features. In Conference on Uncertanty n Artfcal Intellegence, [7] Trung V Nguyen and Edwn V Bolla. Collaboratve mult-output gaussan processes. UAI, [8] Joaqun Quñonero-Candela and Carl Edward Rasmussen. A ufyng vew of sparse approxmate gaussan process regresson. The Journal of Machne Learng Research, 6: , [9] C.E. Rasmussen and C. Wllams. Gaussan processes for machne learng [10] Mark W Schmdt, Ewout Berg, Mchael P Fredlander, and Kevn P Murphy. Optmzng costly functons wth smple constrants: A lmted-memory proected quas-newton algorthm. In Internatonal Conference on Artfcal Intellgence and Statstcs, page None, [11] Yee-Whye Teh, Matthas Seeger, and Mchael Jordan. Semparametrc latent factor models. In Artfcal Intellgence and Statstcs 10, number EPFL-CONF , [12] Andrew Wlson, Zoubn Ghahrama, and Davd A Knowles. Gaussan process regresson networks. In Proceedngs of the 29th Internatonal Conference on Machne Learng ICML- 12), pages ,

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set