Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data

Similar documents
arxiv: v1 [stat.ml] 7 Mar 2015

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Non-Factorised Variational Inference in Dynamical Systems

Learning Bayesian Networks for Biomedical Data

The connection of dropout and Bayesian statistics

Nonparametric Bayesian Methods (Gaussian Processes)

Contents. Part I: Fundamentals of Bayesian Inference 1

Dropout as a Bayesian Approximation: Insights and Applications

Approximate Bayesian inference

Variational Inference via Stochastic Backpropagation

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Gaussian Processes

Deep learning with differential Gaussian process flows

The Variational Gaussian Approximation Revisited

Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions

Recent Advances in Bayesian Inference Techniques

arxiv: v2 [stat.ml] 29 Sep 2014

Non-Gaussian likelihoods for Gaussian Processes

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

STAT 518 Intro Student Presentation

Gaussian Processes for Big Data. James Hensman

Lecture 13 : Variational Inference: Mean Field Approximation

Probabilistic & Bayesian deep learning. Andreas Damianou

Approximate Inference Part 1 of 2

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Approximate Inference Part 1 of 2

Lecture 6: Graphical Models: Learning

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Variational Bayesian Logistic Regression

Introduction to Gaussian Processes

Restricted Boltzmann Machines

ECE521 week 3: 23/26 January 2017

Variational Model Selection for Sparse Gaussian Process Regression

Tree-structured Gaussian Process Approximations

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Sparse Approximations for Non-Conjugate Gaussian Process Regression

Gaussian Processes in Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Multiple-step Time Series Forecasting with Sparse Gaussian Processes

Circular Pseudo-Point Approximations for Scaling Gaussian Processes

Lecture 7 and 8: Markov Chain Monte Carlo

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Gaussian processes for inference in stochastic differential equations

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

An Infinite Product of Sparse Chinese Restaurant Processes

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

ECE521 Lectures 9 Fully Connected Neural Networks

Probabilistic Graphical Models for Image Analysis - Lecture 4

Introduction to Probabilistic Machine Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Explicit Rates of Convergence for Sparse Variational Inference in Gaussian Process Regression

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

A variational radial basis function approximation for diffusion processes

Machine Learning Summer School

Latent Variable Models

STA 4273H: Statistical Machine Learning

GAUSSIAN PROCESS REGRESSION

Variational Autoencoder

The Expectation Maximization or EM algorithm

Ch 4. Linear Models for Classification

Kernel adaptive Sequential Monte Carlo

Homework 6. Due: 10am Thursday 11/30/17

Kernel Sequential Monte Carlo

Variational Autoencoders (VAEs)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

GWAS V: Gaussian processes

Stochastic Gradient Descent

Nonparametric Inference for Auto-Encoding Variational Bayes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Unsupervised Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2017

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Markov Chain Monte Carlo Algorithms for Gaussian Processes

Recurrent Latent Variable Networks for Session-Based Recommendation

Afternoon Meeting on Bayesian Computation 2018 University of Reading

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Expectation Propagation in Dynamical Systems

Sandwiching the marginal likelihood using bidirectional Monte Carlo. Roger Grosse

Variational Inference in TensorFlow. Danijar Hafner Stanford CS University College London, Google Brain

arxiv: v5 [stat.ml] 25 May 2016

Latent Variable Models and EM algorithm

How to build an automatic statistician

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Variational Autoencoders

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Deep latent variable models

Density Estimation. Seungjin Choi

Probabilistic Reasoning in Deep Learning

Bayesian Mixtures of Bernoulli Distributions

Learning Gaussian Process Models from Uncertain Data

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Expectation Propagation for Approximate Bayesian Inference

Statistical Data Mining and Machine Learning Hilary Term 2016

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Transcription:

Latent Gaussian Processes for Distribution Estimation of Multivariate Categorical Data Yarin Gal Yutian Chen Zoubin Ghahramani yg279@cam.ac.uk

Distribution Estimation Distribution estimation of categorical data {y n n = 1,..., N} with y n {1,..., K } is easy. But learning a distribution P(y) from vectors of categorical data, {y n = (y n1,..., y nd ) n = 1,..., N} with y nd {1,..., K }, is much more difficult. 2 of 19

Multivariate Categorical Data 3 of 19

Multivariate Categorical Data 3 of 19

Example: Wisconsin Breast Cancer Sample code Thickness Unif. Cell Size Unif. Cell Shape Marginal Adhesion Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class 1000025 5 1 1 1 2 1 3 1 1 Benign 1002945 5 4 4 5 7 10 3 2 1 Benign 1015425 3 1 1 1 2 2 3 1 1 Benign 1016277 6 8 8 1 3 4 3 7 1 Benign 1017023 4 1 1 3 2 1 3 1 1 Benign 1017122 8 10 10 8 7 10 9 7 1 Malignant 1018099 1 1 1 1 2 10 3 1 1 Benign 1018561 2 1 2 1 2 1 3 1 1 Benign 1033078 2 1 1 1 2 1 1 1 5 Benign 1033078 4 2 1 1 2 1 2 1 1 Benign 1035283 1 1 1 1 1 1 3 1 1 Benign 1036172 2 1 1 1 2 1 2 1 1 Benign 1041801 5 3 3 3 2 3 4 4 1 Malignant 1043999 1 1 1 1 2 3 3 1 1 Benign 1044572 8 7 5 10 7 9 5 5 4 Malignant 1047630 7 4 6 4 6 1 4 3 1 Malignant 1048672 4 1 1 1 2 1 2 1 1 Benign 1049815 4 1 1 1 2 1 3 1 1 Benign 1050670 10 7 7 6 4 10 4 1 2 Malignant 1050718 6 1 1 1 2 1 3 1 1 Benign 1054590 7 3 2 10 5 10 5 4 4 Malignant 1054593 10 5 5 3 6 7 7 10 1 Malignant 1056784 3 1 1 1 2 1 2 1 1 Benign 1057013 8 4 5 1 2? 7 3 1 Malignant 1059552 1 1 1 1 2 1 3 1 1 Benign 1065726 5 2 3 4 2 7 3 6 1 Malignant 1066373 3 2 1 1 1 1 2 1 1 Benign 1066979 5 1 1 1 2 1 2 1 1 Benign 1067444 2 1 1 1 2 1 2 1 1 Benign 1070935 1 1 3 1 2 1 1 1 1 Benign 1070935 3 1 1 1 1 1 2 1 1 Benign 1071760 2 1 1 1 2 1 3 1 1 Benign 1072179 10 7 7 3 8 5 7 4 3 Malignant 683 patients, 2 10 9 possible configurations 4 of 19

Categorical Latent Gaussian Process We define our model as Continuous latent space of patients x n iid N (0, σ 2 xi) Dist. over K dim. functions generating weights f d GP( ) iid y nd Softmax(fd (x n )) test result y n = (y n1,..., y nd ) medical assessment N patients x n, D categorical tests, each with K possible test results 5 of 19

Sparse Gaussian Processes A small number of inducing points support a distribution over functions In dark blue are the data points, in red are the inducing points 6 of 19

Sparse Gaussian Processes Given sufficient statistics {Z m, u m } M m=1, Define KMM as the kernel evaluated at points (Z m ), Define KMn = K T nm as the kernel evaluated between points (Z m) and point x n, A sparse Gaussian process is defined through a conditional Gaussian distribution by f n N (a T n u, b n ) with or equivalently, a n = K 1 MM K Mn, b n = K nn K nm K 1 MM K Mn, f n = a T n u + b n ε (f ) n ε (f ) n N (0, 1). 7 of 19

Generative Model The resulting generative model, with kernel K(, ), is x n iid N (0, σ 2 x I) patient n (u dk ) K k=1 GP(0, K((Z m) M m=1 )) inducing points f ndk x n, u dk N (a T n u dk, b n ) weight of test result k y nd iid Softmax(fnd1,..., f ndk ) y n = (y n1,..., y nd ) examination d medical assessment We want to find the posterior over X and U: p(x, U Y). 8 of 19

Variational Inference Our marginal log-likelihood is intractable. We lower bound the log evidence with a variational approximate posterior q(x, F, U) = q(x)q(u)p(f X, U), with Such that... q(x) = q(u) = N n=1 i=1 D d=1 k=1 Q N (x ni m ni, sni 2 ), K N (u dk µ dk, L d L T d ) q(x, F, U) p(x, F, U Y). 9 of 19

Re-parametrising the Model This is equivalent to... Then, x ni = m ni + s ni ε (x) ni u dk = µ dk + L d ε (u) dk ε (x) ni N (0, 1) ε (u) dk N (0, I M) f ndk = a T n u dk + b n ε (f ) ndk ε (f ) ndk N (0, 1) log p(y) KL divergence terms N D [ + n=1 d=1 E (x) ε n,ε (u) d,ε(f ) nd ( fnd log Softmax (y nd ε (f ) nd, U d(ε (u) d ), x n(ε (x) n ))) ]. 10 of 19

Monte Carlo integration Monte Carlo integration approximates the likelihood obtaining noisy gradients: E (x) ε n,ε (u) d,ε(f ) log Softmax ( ) nd 1 T ( )) fnd log Softmax (y nd ε (f ) T nd,i, U d(ε (u) d,i ), x n(ε (x) n,i ) i=1 with ε (x),i, ε (u),i, ε (f ),i N (0, 1), independent of the parameters. Adaptive learning-rate stochastic optimisation is used to optimise the noisy objective wrt the parameters of the variational distribution. 11 of 19

Categorical Latent Gaussian Process Symbolic differentiation gives simple code: 1 import theano.tensor as T 2 X = m + s * randn(n, Q) 3 U = mu + L.dot(randn(M, K)) 4 Kmm, Kmn, Knn = RBF(sf2, l, Z), RBF(sf2, l, Z, X), RBFnn(sf2, l, X) 5 KmmInv = st.matrix_inverse(kmm) 6 A = KmmInv.dot(Kmn) 7 B = Knn - T.sum(Kmn * KmmInv.dot(Kmn), 0) 8 F = A.T.dot(U) + B[:,None]**0.5 * randn(n, K) 9 S = T.nnet.softmax(F) 10 KL_U, KL_X = get_kl_u(), get_kl_x() 11 LS = T.sum(T.log(T.sum(Y * S, 1))) - KL_U - KL_X 12 LS_func = theano.function([ inputs ], LS) 13 dls_dm = theano.function([ inputs ], T.grad(LS, m)) # and others 14 #... and run RMS-PROP 12 of 19

Relations to Existing Models Linear regression Gaussian process regression Factor analysis Gaussian process latent variable model Logistic regression Gaussian process classification Linear Non-linear Continuous Observed input Discrete Latent input Latent Gaussian models Categorical Latent Gaussian Process 13 of 19

(b) Categorical Latent Gaussian Process p(y d = 1 x) as a function of x, for d = 0, 1, 2 (first, second, and third digits in the XOR triplet left to right) 14 of 19 Multi-modal Distributions (a) (Linear) Latent Gaussian Model

Data Visualisation (a) Example alphadigits Latent Dim 2 2 1.5 1 0.5 0 0.5 1 1.5 2 2 1 0 1 2 Latent Dim 1 (b) (Linear) Latent Gaussian Model latent space Latent Dim 2 2 1.5 1 0.5 0 0.5 1 1.5 2 2 1 0 1 2 Latent Dim 1 (c) Categorical Latent Gaussian Process latent space 15 of 19

Data Imputation Terror Warning Effects on Political Attitude (START Terrorism Data Archive - 17 categorical variables with 5-6 values) Uniform Multinomial (Linear) Latent Gaussian Model Categorical Latent Gaussian Process 2.50 2.20 3.07 2.11 Wisconsin Breast cancer (9 categorical variables with 10 values) (Linear) Categorical Uniform Multinomial Latent Latent Gaussian Gaussian Model Process 8.68 4.41 3.57 ± 0.208 2.86 ± 0.119 Test perplexity predicting randomly missing values 16 of 19

Inference Robustness x 10 5 0.8 1 1.2 ELBO 1.4 1.6 1.8 2 2.2 10 0 10 1 10 2 10 3 log iter Lower bound and MC standard deviation per iteration (on log scale) for the Alphadigits dataset. 17 of 19

Scaling Up! Why not scale the model up? We recently showed that sparse Gaussian processes can be scaled well to millions of data points [Gal, van der Wilk, Rasmussen, 2014] Mini-batch GP optimisation [Hensman el at. 2014] Recognition models with sparse Gaussian processes Integrate over unobserved variables Handle missing data Mixed data Using link functions alternative to the Softmax Continuous variables, Positives, Ordinals Currently running experiments! 18 of 19

Scaling Up! Why not scale the model up? We recently showed that sparse Gaussian processes can be scaled well to millions of data points [Gal, van der Wilk, Rasmussen, 2014] Mini-batch GP optimisation [Hensman el at. 2014] Recognition models with sparse Gaussian processes Integrate over unobserved variables Handle missing data Mixed data Using link functions alternative to the Softmax Continuous variables, Positives, Ordinals Currently running experiments! 18 of 19

Scaling Up! Why not scale the model up? We recently showed that sparse Gaussian processes can be scaled well to millions of data points [Gal, van der Wilk, Rasmussen, 2014] Mini-batch GP optimisation [Hensman el at. 2014] Recognition models with sparse Gaussian processes Integrate over unobserved variables Handle missing data Mixed data Using link functions alternative to the Softmax Continuous variables, Positives, Ordinals Currently running experiments! 18 of 19

Thank you 19 of 19