Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Similar documents
Classification via Bayesian Nonparametric Learning of Affine Subspaces

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

Nonparametric Bayes Inference on Manifolds with Applications

Fast Approximate MAP Inference for Bayesian Nonparametrics

Bayesian Methods for Machine Learning

Bayesian nonparametrics

Supervised Dimension Reduction:

Computer Emulation With Density Estimation

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Bayesian non-parametric model to longitudinally predict churn

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Non-Parametric Bayes

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Nonparametric Bayes tensor factorizations for big data

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Scaling up Bayesian Inference

STAT Advanced Bayesian Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Bayes methods for categorical data. April 25, 2017

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Contents. Part I: Fundamentals of Bayesian Inference 1

CPSC 540: Machine Learning

Spatial Normalized Gamma Process

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Partial factor modeling: predictor-dependent shrinkage for linear regression

Small-variance Asymptotics for Dirichlet Process Mixtures of SVMs

Foundations of Nonparametric Bayesian Methods

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

Bayesian Nonparametrics: Dirichlet Process

Likelihood-free MCMC

Default Priors and Effcient Posterior Computation in Bayesian

Introduction. Chapter 1

A Process over all Stationary Covariance Kernels

Bayesian Nonparametric Regression through Mixture Models

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Linear Methods for Prediction

Bayesian Machine Learning

arxiv: v1 [stat.me] 6 Nov 2013

On the Fisher Bingham Distribution

Image segmentation combining Markov Random Fields and Dirichlet Processes

Part IV: Monte Carlo and nonparametric Bayes

Bayesian Econometrics

STA 4273H: Statistical Machine Learning

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Inference with few assumptions: Wasserman s example

Probabilistic Time Series Classification

Gibbs Sampling in Linear Models #2

Monte Carlo in Bayesian Statistics

Advanced Introduction to Machine Learning

Extreme Value Analysis and Spatial Extremes

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

CSC321 Lecture 18: Learning Probabilistic Models

Introduction to Graphical Models

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Bayesian Nonparametrics

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Nonparametric Inference Methods for Mean Residual Life Functions

Lecture 3a: Dirichlet processes

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Bayesian Linear Regression

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Nonparametric Bayes Uncertainty Quantification

Directional Statistics

Stat 5101 Lecture Notes

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Latent Variable Models and EM Algorithm

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Bayes Model Selection with Path Sampling: Factor Models

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Construction of Dependent Dirichlet Processes based on Poisson Processes

Bagging During Markov Chain Monte Carlo for Smoother Predictions

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

STA414/2104 Statistical Methods for Machine Learning II

Hierarchical Modeling for Univariate Spatial Data

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

ECE521 week 3: 23/26 January 2017

Nonparametric Bayes regression and classification through mixtures of product kernels

Bayesian estimation of the discrepancy with misspecified parametric models

Discriminant Analysis with High Dimensional. von Mises-Fisher distribution and

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

STA 4273H: Statistical Machine Learning

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Lecture 4: Probabilistic Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian inference for multivariate extreme value distributions

Bayesian Nonparametrics

Multivariate Bayesian Linear Regression MLAI Lecture 11

CS 343: Artificial Intelligence

Linear Methods for Prediction

Transcription:

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data Abhishek Bhattacharya, Garritt Page Department of Statistics, Duke University Joint work with Prof. D.Dunson September 2010

Contents 1 Background & Motivation 2 Density Model 3 Regression and Classification 4 Further Work 5 Numerical Examples

Background & Motivation Density Estimation on High Dimensional Space A common approach for modelling the distribution of multivariate data is to use an infinite mixture density. Results in great difficulty in posterior computations when the data dimension is huge. The reason being that, even in case of heavy tailed distributions, most of the variability is along a few directions. That is why when fitting an infinite mixture density model, we end up with a finite number of clusters, say k many, which is much smaller than the data dimension m.

Background & Motivation Our Approach Instead we model the projection of the data on to some k dimensional (affine) subspace using a np model and fit some parametric distribution on the remaining part such as mean 0 Gaussian. This amounts to fitting an infinite mixture model but with cluster locations drawn from some k dimensional affine subspace S. Then by setting a prior on the subsapce and its dimension, we can approximate any density on R m.

Background & Motivation NP Regression & Classification A common approach for regression/classification is modelling the joint using a np mixture density. But when feature dimension m is too high compared to that of the response, again lot of problems - the model fails to capture the association between x & y and instead focuses on getting the marginal of x. To address such situations, many alternatives exist, such as directly model the conditional of y given x, assuming it to depend on a few selected x-coordinates (Chung and Dunson 2009) or assuming it to be stochastic process depending on the projection P S (x) of x on to some smaller, say k dimensional (linear) subspace S (Tokdar et. al. 2010).

Background & Motivation Our Approach We instead propose to model the joint of y and P S (x) using a np mixture while let the remaining x component have an independent parametric distribution such as Gaussian (not mean 0). Our approach is more flexible than Chung and Dunson(2009) and lot easier to implement than Tokdar et.al.(2010). Further by setting a prior on k and S, we can flexibly model the true conditional whatever it is.

Density Model X f (x; Θ) = R k N m (x; φ(µ), Σ)P(dµ) φ(µ) = Uµ + θ, U in the Stiefel manifold V k,m = {U R m k : U U = I k }, θ R m, U θ = 0. Σ = UΣ 1 U + σ 2 (I UU ), Σ 1 M + (k), σ > 0. Parameters Θ = (k, U, θ, Σ 1, σ, P) Express θ = µ 0 U k+1, with µ 0 = θ and the parameter (U, U k+1 ) V k+1,m Fit some full support prior such as Dirichlet Process (DP) on P, full support parametric distribution on the Stiefel manifold for (U, U k+1 ).

Density Model Model Interpretation There is a affine subspace S = {UU y + θ : y R m } of dimension k << m s.t. orthogonal projection UU X + θ of X on to S which can be given isometric coordinates U X follows U X N k (.; µ, Σ 1 )P(dµ) R k while the residuls have mean zero and their coordinates follow V X N m k (.; (µ 0, 0,..., 0), σ 2 I m k ) with V U = 0, V V = I m k, V 1 = U k+1.

Density Model The first k principal coordinates of X live on S if σ 2 eigen values of Σ 1 but can be true even more generally. σ 2 small also means that the data is concentrated around S. Any density on R m in support of f (., Θ) if the prior on k includes m in its support & a full support prior such as DP used for P.

Regression and Classification The feature Y is low dimensional, say in R l or discrete. Want to explain Y flexibly through k many important coordinates of X which are linear transformations of all m X coordinates. When Y continuous in R l, model is (U X, Y ) N k (; µ, Σ 1 )N l (; ν, Σ Y )P(dµdν) R k R l independent of V X N m k (µ 1, σ 2 I m k )

Regression and Classification Regression Model Hence (X, Y ) f (x, y; Θ) = R k R l N m (x; φ(µ), Σ X ) N l (y; ν, Σ Y ) P(dµdν). φ(µ) = Uµ + θ, µ R k lives on S - an affine subspace of dim. k, Σ X = UΣ 1 U + σ 2 (I m UU ). Parameters Θ = (k, U, θ, Σ 1, σ, Σ Y, P). Then V µ 1 = θ which means θ R m satisfies U θ = 0.

Regression and Classification Conditional Then conditional of Y given X depends on its projection on to L = {UU x : x R m } and is f (y x, Θ) = R Nk (U x;µ,σ 1 )N l (y;ν,σ Y )P(dµdν) R Nk (U x;µ,σ 1 )P(dµdν) θ, σ are like nuisance parameters - used in explaining X-marginal. θ 0 implies that the projection on to L is not centered at 0 - thereby adding more flexibility.

Regression and Classification Classification Model When Y {1,..., c} (X, Y ) f (x, y; Θ) = R k S c 1 N m (x; Uµ + θ, Σ X ) ν y P(dµdν) with S c 1 = {ν [0, 1] c : ν j = 1}, Σ X = UΣ 1 U + σ 2 (I m UU ). Then conditional of Y given X depends on its projection on to the linear subspace L = {UU x : x R m }. The term θ which can be chosen wlog to be perpendicular to U implies that the projection on to L is not centered at 0 - thereby adding more flexibility.

Further Work Our next target will be to extend this method of inference from Euclidean spaces (R m ) to more general manifolds such as sphere or spaces of shapes. Often very high dimensional data arises when analysising shapes of images with lot of coordinates. Also in case of R m, we will like to prove theoretically that our method does better than existing ones, such as faster rates of convergence to the true model. We also focus on using projections on more general sub-manifolds instead of just affine subspaces. To do so, one simple approach will be to mix across U instead of just µ resulting in assuming that the cluster locations are drawn from a union of subspaces, i.e. on some k-dimensional polygon. We can also fix a np prior on φ(.) such as GP.

Numerical Examples Computation We employ the following priors (these choices were motivated to keep things simple initially) P DP(α = 1, P 0 = N(mn, s 2 I)) (U, U k+1 ) = U FB(A, B, C) etr(a U + CU BU ) (FB denotes the Fisher-Bingham distribution on the Stiefel manifold) µ 0 TN(m 0, s 2 0, 0, ) p(k) = p k for k = 1,..., m Σ 1 1 Wish(df, Q) σ 2 Gam(a, b)

Numerical Examples Sampling from [U ] [U ] FB(A, B, C ) with A = [( B = C = 1 2 n i=1 n i=1 x i µ S i )Σ 1 1 σ 2 µ 0 ( (x i x i ) [ (σ 2 I k Σ 1 1 ) 0 0 0 n x i )] Sampling from [U ] requires one to obtain random draws from the Stiefel manifold. ]. i=1

Numerical Examples Sampling from [U ] We use the ideas from Hoff (2009) and provide brief details of what we did for the case when k is unknown. In the unknown k case we work under the condition that Σ 1 = σ 2 I k. Under this condition [U ] etr(a U ) R = exp(a [,r] U [,r] ) r=1 A Gibbs sampler can be employed to sample the columns of U. [U [,r] U [, r] ] exp(a [,r] U [,r] )I [U [,r] U [, r] =0]

Numerical Examples Sampling from [U ] Let H r be an orthonormal basis for the null space of U [, r]. Then for z V 1,m k+1 s.t. U [,r] = H r z and H r U [,r] = z and then [U [,r] U [, r] ] exp((a [,r]h r ) z)i [z z=1] Which is a von Mises-Fisher. Thus one can sample U by sampling columns conditioned on others by first sampling z from vmf(a [,r] H r ) and then setting U [,r] = H r z

Numerical Examples Sampling from [k ] [k ] pr(k) exp{ 1 2σ 2 ( µ Si(k) µ Si(k) 2 tr(u (k) µ Si(k) x i ) 2µ 0U k+1 xi )} for k = 1,..., m This could get prohibitively expensive if m is large Two approaches to address this are Truncate the distribution of k Introduce a slice sampling type variable u UN(0, 1) and replace pr(k) with I [u<pr(k)]. This means that k will be drawn from the set {k : pr(k) > u}

Numerical Examples Synthetic Data Example We desire to see how the methodology performs in density estimation. We Generate 51 observations from x 3 N m (µ k, σ 2 I m ) k=1 50 observations used to fit model. Evaluate the value of the likelilhood with the other observation set σ 2 = 0.1 set µ k to a vector of 0 s save for a 1 in the kth location

Numerical Examples Synthetic Data Example For these data we have the following µ 0 = 1/ 3 U k+1 = ( 3/3, 3/3, 3/3, 0, 0,... 0) ( 1/ 2 1/ 2 0 0... 0 U = 2/ 6 2/ 6 3/ 60... 0 )

Numerical Examples Posterior estimate of U and S Compute Ū (the usual mean) using the T MCMC iterates of U Set Û = Ū(Ū Ū) 1/2 V m,m. The estimate for S would then be S = {y R k : Û (ˆk) y + ˆµ 0Û k+1 } µ 0 σ 2 µ 0 U k+1 0.42 0.12 (0.54, 0.32, 0.46)

Numerical Examples Competitors? We also fit the generated data and evaluated the likelihood for the observation held out for the following Our model with k = 2 Finite mixture model with fixed k = 3 Infinite mixture model Gaussian Varying k fixed k = 2 FinM k = 3 InfM Normal 2.07-3.06-6.15-4.48

Numerical Examples What is left to do? Lots!! Perform a full blown simulation study generating multiple data sets Consider different competitors (more sophisticated methods) Haven t really touched regression/classification (which is much more interesting than density estimation in my opinion) Improve on algorithms to make them more efficient (we want to make sure the methodology scales up well).