Scalable Gaussian process models on matrices and tensors

Size: px

Start display at page:

Download "Scalable Gaussian process models on matrices and tensors"

Betty Kelley
5 years ago
Views:

1 Scalable Gaussian process models on matrices and tensors Alan Qi CS & Statistics Purdue University Joint work with F. Yan, Z. Xu, S. Zhe, and IBM Research!

2 Models for graph and multiway data Model Algorithm 1. STGP: predict who may be your friends on facebook Data 2. DinTucker: distributed online inference for large graphs and tensors

3 Outline Nonpara. Bayesian: Gaussian processes Utilizing Kronecker product STGP Tensor data DinTucker

4 Tensor data: multiple aspects Patients Biomarkers Medicines Yi,j,k: the value of the k-th biomarker (i.e., cell population) for the j-th patient after taking the i-th medicine Predict drug response

5 Matrix data: networks people people Yi,j: 1 if node i is linked to node j, 0 otherwise. Discover communities and predict unknown interactions

6 Goals! - Predict unknown elements (e.g., drug response and network interactions) - Identify latent multi-aspect groups (communities)

7 Classical Tucker decomposition Generalization of matrix factorization 3D case: U core tensor loading matrices Y Z V Y = G 1 U 2 Z 3 V y ijk = X X X g rst u ir z js v ks r s t Sun et al. 2008

8 Assumptions Complete Continuous Multi-linear

9 Solution Gaussian process models for tensors

10 Gaussian processes: prior on scalar f Powerful Bayesian modeling tool Numerous applications: computer vision, sensor networks, computational biology, etc. f x f N(0; K x )

11 GP models on tensor F GP on a tensor: stochastic process in an infinite tensor space Tensor N (F 0; K u, K v, K z )=(2 ) 3N 2 Y s=u,v,z K s N 2 2 Projection of GP on any tensor of finite size follows a tensor-variate Gaussian distribution exp{ 1 2 (F 1 K 1 u 2 K 1 v 3 K 1 z ) F 2 }

12 Latent sparse GP on tensors Patients Biomarkers (i, j, k) u i : Medicines Element u i : z j : v k : (i, j, k) is characterized by Sparse loading vector in latent medicine groups Sparse loading vector in latent patient groups Sparse loading vector in latent biomarker groups

13 Separate covariance for each dimension Kz K v Patients Biomarkers i r K u (i, r) =k(u i, u r ) Medicines Nonlinear relationship between medicines i and r -Separate covariance/kernel function for each dimension! -The more similar loading vectors, the larger the covariance function value

14 Predict unknown tensor elements Kv Patients K v Biomarkers (i, j, k) Medicines K u w (r, s, t) Simple illustration: 1) Based on observed data, estimate loading vectors: [u i, z j, v k ] 2) Compute weights (similarities) between unknown and observed elements: w(ijk, rst) w([u i, z j, v k ], [u r, z s, v t ]) 3) Predict the unknown element: y i,j,k = P r,s,t w(ijk, rst)y r,s,t

15 Graphical model U =[u 1,...,u N ] U V Z Sparse loading vectors u i exp( u i ) Similarly, sample V and Z F Latent tensor F N(0; K u, K v, K z ) Y U Unknown data Y O Observed data Y O p(y ijk f ijk ) Y (i,j,k)2o p(y ijk f ijk ) : Gaussian for continuous data Probit for binary data Possion for count data

16 Benefits Handle binary and missing data Discover block/group structures Avoid overfitting Model prediction uncertainty Incorporate additional side information!! Yan, Xu & Qi, 2011; Xu, Yan & Qi 2011

17 Two views of the same model Infinite latent factor model / nonlinear stochastic blockmodels! Nonparametric prior on infinite exchangeable (high-dimensional) arrays

18 Infinite nonlinear tensor decomposition (z K ) Nonlinear feature mapping: infinite dimension F G (u I ) (v J ) f ijk = G 1 (u i ) 2 (v j ) 3 (z k ) G N(0, I, I, I) y ijk = f ijk + e Integrating out p(f U, V, Z) = Z G gives GP on tensor F p(f G, U, V, Z)p(G)dG = N (0, K u, K v, K z ) where k(u i, u j )= (u i ) T (u j )

19 Exchangeability and De Finetti theorem An exchangeable sequence: its joint distribution is invariant under arbitrary permutation of the indices. De Finetti (1931) and Hewitt and Savage (1955) showed: If X 1,...,X n,... is exchangeable. Then there exists a probability measure µ on probability measure, such that Z! p(x 1 A 1,...,X n A n )= Q(A 1 ) Q(A n )µ(dq) How about priors on exchangeable arrays representing graphs?

20 Prior for random graphs Aldous (1981); Diaconis & Freedman (1981) (See Lauritzen 2007) showed: Any column-row exchangeable binary matrix can be generated as follows:!!! f p U i,v j W IID Uniform(0, 1) G ij W, U i,v j Bernoulli(f(U i,v j )) Transform U i and V j to other r.v.s via suitable transformations.

21 Prior for random graphs For weakly exchangeable arrays (shared permutations over rows and columns; for undirected graphs), our model (Yan et al. 2011) is f GP U i,v j W IID Uniform(0, 1) G ij W, U i,v j Bernoulli(f(U i,u j )) Replace Bernoulli by Gaussian for continuous values Special case of Random function priors (Lloyd et al., 2012)

22 Related works Latent Eigenmodel: (Hoff 2007): parametric models on exchangeable arrays GP-LVM (Lawrence 2005): modeling only interactions in rows or columns Random function priors (Lloyd et al., 2012)

23 Algorithm: Variational EM Marginal likelihood log p(y O U, V, Z) + log p(u, V, Z) Variational approximation Iterations

24 E step: Variational-EM Maximize the variational lower bound over q(f) & q(s) : Z log p(y O U, V, Z) q(f)q(s) log p(y O, F, S U, V, Z)dF+ where model S H q (F)+H q (S) L(q(F),q(S)) are parameters or auxiliary Z variables of noise and q(f) log q(f)df. p(y ijk f ijk ) H q (F) = M step: Maximize the variational lower bound over U, Vand Z.

25 Algorithm: explore model structures Example: Trace{(I + K u K v K z ) 1 } Direct computation: Matrix inversion N 3 by N 3 O(N 9 ) Kronecker product operation:

26 Properties of Kronecker product Properties: Theorem: Eigen-decomposition K = W W T If K = K u K v K z then W = W u W v W z = u v z W u : eigenvectors of K u diag{ u } : eigenvalues of K u

27 Reduced computational complexity Example: Trace{(I + K u K v K z ) 1 } Direct computation Using this theorem and trace properties O(N 9 ) NX NX NX i=1 j=1 k= u i v j z k O(N 3 ) = MX MX MX i=1 j=1 k= u i v j z k O(N 2 M) M<<N

28 2D case: GP stochastic blockmodels - Undirected networks (friend relationships and protein-protein interactions) - Represented by symmetric adjacent matrices Yan, Xu & Qi, 2011; Xu, Yan & Qi 2011

29 2D: Coauthor networks Membership produced by LEM Area Under Curve AUC values SMGB MMSB LEM Ours Number of latent groups Groups Groups Groups Nodes Membership produced by MMSB Nodes Membership produced by SMGB Nodes NIPS authors Ours Co-authorship dataset: co-authorship links from100 authors who have the largest number of co-authors from NIPS 1-17.

30 3D: Enron s Enron dataset: s from senior management of Enron before its bankruptcy in 2001.! 3D tensor representation: Sender-Recipient-Subject Area Under Curve AUC values InfTucker tp InfTucker gp CP TD HOSVD NCP WCP PTD Ours Number of Factors

31 4D: Digg Area Under Curve AUC values Ours InfTucker tp InfTucker gp CP TD HOSVD NCP WCP PTD Number of Factors Digg dataset: Social news from digg.com 4D tensor representation: user-news-keywords-category

32 4D: Digg latent groups Keyword groups learned by STGP: iphone Olympics american journey apple games australian dream internet china rothschild action *seo beijing moscow wisdom photosynth red century top *seo: search engine optimization

33 Next steps Better models for sparse graphs/ tensors? Distributed inference for big graphs/tensors

34 Outline Hierarchical Bayesian Distributed online inf. STGP NELL, fmri,... DinTucker

35 Big data (with big issues) - High computational cost: O(d 3 ) where d is the size of the largest mode. - Not enough memory to store the data in a single machine - Data are physically stored in different locations

36 Challenge for GPs on arrays Coupled array elements: a huge covariance over all them.! Obvious solutions, ADMM or stochastic gradient descent, do not apply due to the partition function.

37 Solution: Hierarchical Bayesian modeling Local training over subarrays! Global information sharing via the parent nodes

38 Solution: Hierarchical Bayesian modeling {Y Y n }. U Global memberships Ũ n Local memberships M nt Latent tensors Y nt Subarrays t=1... T n n=1... N

39 Solution: Hierarchical Bayesian modeling {Y Y n }. U Global memberships Ũ n Local memberships M nt Latent tensors Local training:! Mapper Y nt Subarrays t=1... T n n=1... N

40 Hierarchical Bayesian on MapReduce {Y Y n }. U Ũ n Global memberships Local memberships Global sharing:! Reducer M nt Latent tensors Local VB inf.:! Mapper Y nt Subarrays t=1... T n n=1... N

41 Subarray sampling strategies Uniform! Weighted! Grid

42 Three questions How does the new distributed online inference compare with the sequential inference? How does it scale in terms of number of CPUs? How does it compare with the state-ofthe-art large-scale tensor decomposition method?

43 Same accuracy as sequential inference PARAFAC NPARAFAC HOSVD Tucker InfTucker DinTucker U DinTucker W DinTucker G AUC DinTucker with! a specific sampling! strategy Number of Factors (b) digg2 AUC Number of Factors (c) enron

44 Nearly linear scalability 3.5 Scales up: Rn/R Number of Machines

45 More accurate, less time Running time (hours) DinTucker GigaTensor AUC Subarray Size (I=J=K) (a) NELL: running time 0.83 GigaTensor DinTucker G DinTucker U DinTucker W (b) NELL: prediction

46 More accurate, less time Running time (hours) DinTucker GigaTensor AUC Subarray Size (I=J=K) (c) ACC: running time 0.95 GigaTensor DinTucker G (d) ACC: prediction DinTucker U DinTucker W

47 Conclusions

48 Scalable nonparametric Bayesian methods for graphs Model Algorithm STGP: predict who may be your friends on facebook Data DinTucker: distributed online inference for large graphs and tensors

49 Acknowledgment! Funding agencies: - NSF: CAREER, Robust Intelligence, CDI, STC - NIH: R01 - Microsoft - IBM - Showalter foundation - Indiana Clinical & Translational Science Institute - Purdue

DinTucker: Scaling Up Gaussian Process Models on Large Multidimensional Arrays

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) DinTucker: Scaling Up Gaussian Process Models on Large Multidimensional Arrays Shandian Zhe, 1 Yuan Qi, 1 Youngja Park,