Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Size: px

Start display at page:

Download "Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data"

Eugenia Warren
5 years ago
Views:

1 Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data Yarin Gal Yutian Chen Zoubin Ghahramani

2 Distribution Estimation Distribution estimation of categorical data {y n n = 1,..., N} with y n {1,..., K } is easy. But learning a distribution P(y) from vectors of categorical data, {y n = (y n1,..., y nd ) n = 1,..., N} with y nd {1,..., K }, is much more difficult. 2 of 19

3 Multivariate Categorical Data 3 of 19

4 Multivariate Categorical Data 3 of 19

5 Example: Wisconsin Breast Cancer Sample code Thickness Unif. Cell Size Unif. Cell Shape Marginal Adhesion Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class Benign Benign Benign Benign Benign Malignant Benign Benign Benign Benign Benign Benign Malignant Benign Malignant Malignant Benign Benign Malignant Benign Malignant Malignant Benign ? Malignant Benign Malignant Benign Benign Benign Benign Benign Benign Malignant 683 patients, possible configurations 4 of 19

6 Categorical Latent Gaussian Process We define our model as Continuous latent space of patients x n iid N (0, σ 2 xi) Dist. over K dim. functions generating weights f d GP( ) iid y nd Softmax(fd (x n )) test result y n = (y n1,..., y nd ) medical assessment N patients x n, D categorical tests, each with K possible test results 5 of 19

7 Sparse Gaussian Processes A small number of inducing points support a distribution over functions In dark blue are the data points, in red are the inducing points 6 of 19

8 Sparse Gaussian Processes Given sufficient statistics {Z m, u m } M m=1, Define KMM as the kernel evaluated at points (Z m ), Define KMn = K T nm as the kernel evaluated between points (Z m) and point x n, A sparse Gaussian process is defined through a conditional Gaussian distribution by f n N (a T n u, b n ) with or equivalently, a n = K 1 MM K Mn, b n = K nn K nm K 1 MM K Mn, f n = a T n u + b n ε (f ) n ε (f ) n N (0, 1). 7 of 19

9 Generative Model The resulting generative model, with kernel K(, ), is x n iid N (0, σ 2 x I) patient n (u dk ) K k=1 GP(0, K((Z m) M m=1 )) inducing points f ndk x n, u dk N (a T n u dk, b n ) weight of test result k y nd iid Softmax(fnd1,..., f ndk ) y n = (y n1,..., y nd ) examination d medical assessment We want to find the posterior over X and U: p(x, U Y). 8 of 19

10 Variational Inference Our marginal log-likelihood is intractable. We lower bound the log evidence with a variational approximate posterior q(x, F, U) = q(x)q(u)p(f X, U), with Such that... q(x) = q(u) = N n=1 i=1 D d=1 k=1 Q N (x ni m ni, sni 2 ), K N (u dk µ dk, L d L T d ) q(x, F, U) p(x, F, U Y). 9 of 19

11 Re-parametrising the Model This is equivalent to... Then, x ni = m ni + s ni ε (x) ni u dk = µ dk + L d ε (u) dk ε (x) ni N (0, 1) ε (u) dk N (0, I M) f ndk = a T n u dk + b n ε (f ) ndk ε (f ) ndk N (0, 1) log p(y) KL divergence terms N D [ + n=1 d=1 E (x) ε n,ε (u) d,ε(f ) nd ( fnd log Softmax (y nd ε (f ) nd, U d(ε (u) d ), x n(ε (x) n ))) ]. 10 of 19

12 Monte Carlo integration Monte Carlo integration approximates the likelihood obtaining noisy gradients: E (x) ε n,ε (u) d,ε(f ) log Softmax ( ) nd 1 T ( )) fnd log Softmax (y nd ε (f ) T nd,i, U d(ε (u) d,i ), x n(ε (x) n,i ) i=1 with ε (x),i, ε (u),i, ε (f ),i N (0, 1), independent of the parameters. Adaptive learning-rate stochastic optimisation is used to optimise the noisy objective wrt the parameters of the variational distribution. 11 of 19

13 Categorical Latent Gaussian Process Symbolic differentiation gives simple code: 1 import theano.tensor as T 2 X = m + s * randn(n, Q) 3 U = mu + L.dot(randn(M, K)) 4 Kmm, Kmn, Knn = RBF(sf2, l, Z), RBF(sf2, l, Z, X), RBFnn(sf2, l, X) 5 KmmInv = st.matrix_inverse(kmm) 6 A = KmmInv.dot(Kmn) 7 B = Knn - T.sum(Kmn * KmmInv.dot(Kmn), 0) 8 F = A.T.dot(U) + B[:,None]**0.5 * randn(n, K) 9 S = T.nnet.softmax(F) 10 KL_U, KL_X = get_kl_u(), get_kl_x() 11 LS = T.sum(T.log(T.sum(Y * S, 1))) - KL_U - KL_X 12 LS_func = theano.function([ inputs ], LS) 13 dls_dm = theano.function([ inputs ], T.grad(LS, m)) # and others 14 #... and run RMS-PROP 12 of 19

14 Relations to Existing Models Linear regression Gaussian process regression Factor analysis Gaussian process latent variable model Logistic regression Gaussian process classification Linear Non-linear Continuous Observed input Discrete Latent input Latent Gaussian models Categorical Latent Gaussian Process 13 of 19

15 (b) Categorical Latent Gaussian Process p(y d = 1 x) as a function of x, for d = 0, 1, 2 (first, second, and third digits in the XOR triplet left to right) 14 of 19 Multi-modal Distributions (a) (Linear) Latent Gaussian Model

16 Data Visualisation (a) Example alphadigits Latent Dim Latent Dim 1 (b) (Linear) Latent Gaussian Model latent space Latent Dim Latent Dim 1 (c) Categorical Latent Gaussian Process latent space 15 of 19

17 Data Imputation Terror Warning Effects on Political Attitude (START Terrorism Data Archive - 17 categorical variables with 5-6 values) Uniform Multinomial (Linear) Latent Gaussian Model Categorical Latent Gaussian Process Wisconsin Breast cancer (9 categorical variables with 10 values) (Linear) Categorical Uniform Multinomial Latent Latent Gaussian Gaussian Model Process ± ± Test perplexity predicting randomly missing values 16 of 19

18 Inference Robustness x ELBO log iter Lower bound and MC standard deviation per iteration (on log scale) for the Alphadigits dataset. 17 of 19

19 Scaling Up! Why not scale the model up? We recently showed that sparse Gaussian processes can be scaled well to millions of data points [Gal, van der Wilk, Rasmussen, 2014] Mini-batch GP optimisation [Hensman el at. 2014] Recognition models with sparse Gaussian processes Integrate over unobserved variables Handle missing data Mixed data Using link functions alternative to the Softmax Continuous variables, Positives, Ordinals Currently running experiments! 18 of 19

20 Scaling Up! Why not scale the model up? We recently showed that sparse Gaussian processes can be scaled well to millions of data points [Gal, van der Wilk, Rasmussen, 2014] Mini-batch GP optimisation [Hensman el at. 2014] Recognition models with sparse Gaussian processes Integrate over unobserved variables Handle missing data Mixed data Using link functions alternative to the Softmax Continuous variables, Positives, Ordinals Currently running experiments! 18 of 19

21 Scaling Up! Why not scale the model up? We recently showed that sparse Gaussian processes can be scaled well to millions of data points [Gal, van der Wilk, Rasmussen, 2014] Mini-batch GP optimisation [Hensman el at. 2014] Recognition models with sparse Gaussian processes Integrate over unobserved variables Handle missing data Mixed data Using link functions alternative to the Softmax Continuous variables, Positives, Ordinals Currently running experiments! 18 of 19

22 Thank you 19 of 19

arxiv: v1 [stat.ml] 7 Mar 2015

arxiv: v1 [stat.ml] 7 Mar 2015 Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data arxiv:1503.02182v1 [stat.ml] 7 Mar 2015 Yarin Gal Yutian Chen Zoubin Ghahramani University of Cambridge Abstract Multivariate