Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data Abhishek Bhattacharya, Garritt Page Department of Statistics, Duke University Joint work with Prof. D.Dunson September 2010

Contents 1 Background & Motivation 2 Density Model 3 Regression and Classification 4 Further Work 5 Numerical Examples

Background & Motivation Density Estimation on High Dimensional Space A common approach for modelling the distribution of multivariate data is to use an infinite mixture density. Results in great difficulty in posterior computations when the data dimension is huge. The reason being that, even in case of heavy tailed distributions, most of the variability is along a few directions. That is why when fitting an infinite mixture density model, we end up with a finite number of clusters, say k many, which is much smaller than the data dimension m.

Background & Motivation Our Approach Instead we model the projection of the data on to some k dimensional (affine) subspace using a np model and fit some parametric distribution on the remaining part such as mean 0 Gaussian. This amounts to fitting an infinite mixture model but with cluster locations drawn from some k dimensional affine subspace S. Then by setting a prior on the subsapce and its dimension, we can approximate any density on R m.

Background & Motivation NP Regression & Classification A common approach for regression/classification is modelling the joint using a np mixture density. But when feature dimension m is too high compared to that of the response, again lot of problems - the model fails to capture the association between x & y and instead focuses on getting the marginal of x. To address such situations, many alternatives exist, such as directly model the conditional of y given x, assuming it to depend on a few selected x-coordinates (Chung and Dunson 2009) or assuming it to be stochastic process depending on the projection P S (x) of x on to some smaller, say k dimensional (linear) subspace S (Tokdar et. al. 2010).

Background & Motivation Our Approach We instead propose to model the joint of y and P S (x) using a np mixture while let the remaining x component have an independent parametric distribution such as Gaussian (not mean 0). Our approach is more flexible than Chung and Dunson(2009) and lot easier to implement than Tokdar et.al.(2010). Further by setting a prior on k and S, we can flexibly model the true conditional whatever it is.

Density Model X f (x; Θ) = R k N m (x; φ(µ), Σ)P(dµ) φ(µ) = Uµ + θ, U in the Stiefel manifold V k,m = {U R m k : U U = I k }, θ R m, U θ = 0. Σ = UΣ 1 U + σ 2 (I UU ), Σ 1 M + (k), σ > 0. Parameters Θ = (k, U, θ, Σ 1, σ, P) Express θ = µ 0 U k+1, with µ 0 = θ and the parameter (U, U k+1 ) V k+1,m Fit some full support prior such as Dirichlet Process (DP) on P, full support parametric distribution on the Stiefel manifold for (U, U k+1 ).

Density Model Model Interpretation There is a affine subspace S = {UU y + θ : y R m } of dimension k << m s.t. orthogonal projection UU X + θ of X on to S which can be given isometric coordinates U X follows U X N k (.; µ, Σ 1 )P(dµ) R k while the residuls have mean zero and their coordinates follow V X N m k (.; (µ 0, 0,..., 0), σ 2 I m k ) with V U = 0, V V = I m k, V 1 = U k+1.

Density Model The first k principal coordinates of X live on S if σ 2 eigen values of Σ 1 but can be true even more generally. σ 2 small also means that the data is concentrated around S. Any density on R m in support of f (., Θ) if the prior on k includes m in its support & a full support prior such as DP used for P.

Regression and Classification The feature Y is low dimensional, say in R l or discrete. Want to explain Y flexibly through k many important coordinates of X which are linear transformations of all m X coordinates. When Y continuous in R l, model is (U X, Y ) N k (; µ, Σ 1 )N l (; ν, Σ Y )P(dµdν) R k R l independent of V X N m k (µ 1, σ 2 I m k )

Regression and Classification Regression Model Hence (X, Y ) f (x, y; Θ) = R k R l N m (x; φ(µ), Σ X ) N l (y; ν, Σ Y ) P(dµdν). φ(µ) = Uµ + θ, µ R k lives on S - an affine subspace of dim. k, Σ X = UΣ 1 U + σ 2 (I m UU ). Parameters Θ = (k, U, θ, Σ 1, σ, Σ Y, P). Then V µ 1 = θ which means θ R m satisfies U θ = 0.

Regression and Classification Conditional Then conditional of Y given X depends on its projection on to L = {UU x : x R m } and is f (y x, Θ) = R Nk (U x;µ,σ 1 )N l (y;ν,σ Y )P(dµdν) R Nk (U x;µ,σ 1 )P(dµdν) θ, σ are like nuisance parameters - used in explaining X-marginal. θ 0 implies that the projection on to L is not centered at 0 - thereby adding more flexibility.

Regression and Classification Classification Model When Y {1,..., c} (X, Y ) f (x, y; Θ) = R k S c 1 N m (x; Uµ + θ, Σ X ) ν y P(dµdν) with S c 1 = {ν [0, 1] c : ν j = 1}, Σ X = UΣ 1 U + σ 2 (I m UU ). Then conditional of Y given X depends on its projection on to the linear subspace L = {UU x : x R m }. The term θ which can be chosen wlog to be perpendicular to U implies that the projection on to L is not centered at 0 - thereby adding more flexibility.

Further Work Our next target will be to extend this method of inference from Euclidean spaces (R m ) to more general manifolds such as sphere or spaces of shapes. Often very high dimensional data arises when analysising shapes of images with lot of coordinates. Also in case of R m, we will like to prove theoretically that our method does better than existing ones, such as faster rates of convergence to the true model. We also focus on using projections on more general sub-manifolds instead of just affine subspaces. To do so, one simple approach will be to mix across U instead of just µ resulting in assuming that the cluster locations are drawn from a union of subspaces, i.e. on some k-dimensional polygon. We can also fix a np prior on φ(.) such as GP.

Numerical Examples Computation We employ the following priors (these choices were motivated to keep things simple initially) P DP(α = 1, P 0 = N(mn, s 2 I)) (U, U k+1 ) = U FB(A, B, C) etr(a U + CU BU ) (FB denotes the Fisher-Bingham distribution on the Stiefel manifold) µ 0 TN(m 0, s 2 0, 0, ) p(k) = p k for k = 1,..., m Σ 1 1 Wish(df, Q) σ 2 Gam(a, b)

Numerical Examples Sampling from [U ] [U ] FB(A, B, C ) with A = [( B = C = 1 2 n i=1 n i=1 x i µ S i )Σ 1 1 σ 2 µ 0 ( (x i x i ) [ (σ 2 I k Σ 1 1 ) 0 0 0 n x i )] Sampling from [U ] requires one to obtain random draws from the Stiefel manifold. ]. i=1

Numerical Examples Sampling from [U ] We use the ideas from Hoff (2009) and provide brief details of what we did for the case when k is unknown. In the unknown k case we work under the condition that Σ 1 = σ 2 I k. Under this condition [U ] etr(a U ) R = exp(a [,r] U [,r] ) r=1 A Gibbs sampler can be employed to sample the columns of U. [U [,r] U [, r] ] exp(a [,r] U [,r] )I [U [,r] U [, r] =0]

Numerical Examples Sampling from [U ] Let H r be an orthonormal basis for the null space of U [, r]. Then for z V 1,m k+1 s.t. U [,r] = H r z and H r U [,r] = z and then [U [,r] U [, r] ] exp((a [,r]h r ) z)i [z z=1] Which is a von Mises-Fisher. Thus one can sample U by sampling columns conditioned on others by first sampling z from vmf(a [,r] H r ) and then setting U [,r] = H r z

Numerical Examples Sampling from [k ] [k ] pr(k) exp{ 1 2σ 2 ( µ Si(k) µ Si(k) 2 tr(u (k) µ Si(k) x i ) 2µ 0U k+1 xi )} for k = 1,..., m This could get prohibitively expensive if m is large Two approaches to address this are Truncate the distribution of k Introduce a slice sampling type variable u UN(0, 1) and replace pr(k) with I [u<pr(k)]. This means that k will be drawn from the set {k : pr(k) > u}

Numerical Examples Synthetic Data Example We desire to see how the methodology performs in density estimation. We Generate 51 observations from x 3 N m (µ k, σ 2 I m ) k=1 50 observations used to fit model. Evaluate the value of the likelilhood with the other observation set σ 2 = 0.1 set µ k to a vector of 0 s save for a 1 in the kth location

Numerical Examples Synthetic Data Example For these data we have the following µ 0 = 1/ 3 U k+1 = ( 3/3, 3/3, 3/3, 0, 0,... 0) ( 1/ 2 1/ 2 0 0... 0 U = 2/ 6 2/ 6 3/ 60... 0 )

Numerical Examples Posterior estimate of U and S Compute Ū (the usual mean) using the T MCMC iterates of U Set Û = Ū(Ū Ū) 1/2 V m,m. The estimate for S would then be S = {y R k : Û (ˆk) y + ˆµ 0Û k+1 } µ 0 σ 2 µ 0 U k+1 0.42 0.12 (0.54, 0.32, 0.46)

Numerical Examples Competitors? We also fit the generated data and evaluated the likelihood for the observation held out for the following Our model with k = 2 Finite mixture model with fixed k = 3 Infinite mixture model Gaussian Varying k fixed k = 2 FinM k = 3 InfM Normal 2.07-3.06-6.15-4.48

Numerical Examples What is left to do? Lots!! Perform a full blown simulation study generating multiple data sets Consider different competitors (more sophisticated methods) Haven t really touched regression/classification (which is much more interesting than density estimation in my opinion) Improve on algorithms to make them more efficient (we want to make sure the methodology scales up well).