Latent factor density regression models

Size: px

Start display at page:

Download "Latent factor density regression models"

Annabelle Rice
5 years ago
Views:

1 Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON Department of Statistical Science, Duke University, Durham NC 27708, USA ab179@stat.duke.edu dp55@stat.duke.edu dunson@stat.duke.edu SUMMARY 5 In this article, we propose a new class of latent factor conditional density regression models with attractive computational and theoretical properties. The proposed approach is based on a novel extension of the recently proposed latent variable density estimation model in which the response variables conditioned on the predictors are modeled as unknown functions of uniformly distributed latent variables and the predictors with an additive Gaussian error. The latent vari- 10 able specification allows straightforward posterior computation using a uni-dimensional grid via conjugate posterior updates. Moreover, one can center the model on a simple parametric guess facilitating inference. Our approach relies on characterizing the space of conditional densities induced by the above model as kernel convolutions with a general class of continuous predictordependent mixing measures. Theoretical properties in terms of rates of convergence are studied. 15 Some key words: Density regression; Gaussian process; Factor model; Latent variable; Nonparametric Bayes; Rate of convergence 1. INTRODUCTION There is a rich literature on Bayesian methods for density estimation using mixture models of the form 20 y i f(θ i ), θ i P, P Π, (1) where f( ) is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process prior, first introduced by Ferguson (1973, 1974). Recent literature has focused on generalizing model (1) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Bayesian density regression, also termed as conditional density estimation, views the entire conditional 25 density f(y x) as a function valued parameter and allows its center, spread, skewness, modality and other such features to vary with x R p. For data {(y i, x i ), i = 1,..., n} let y i x i f( x i ), {f( x), x X} Π X, (2) where X is the predictor space and Π X is a prior for the class of conditional densities {f x, x X } indexed by the predictors. Refer, for example, to Müller et al. (1996); Griffin & Steel (2006, 2008); Dunson et al. (2007); Dunson & Park (2008); Chung & Dunson (2009); Tokdar et al. 30 (2010a) and Pati et al. (2012) among others. The primary focus of this recent development has been mixture models of the form { } y µh (x) f(y x) = π h (x)φ, (3) h=1 σ h

2 A. BHATTACHARYA, D. PATI AND D. B. DUNSON wmhere φ is the standard normal density, {π h (x), h = 1, 2,...} are predictor-dependent probability weights that sum to one almost surely for each x X, and (µ h, σ h ) G 0 independently, with G 0 a base probability measure on F X R +, F X X R, the space of all X R functions. However, density regression models focusing on mixture models are often a black-box in terms of realistic applications which require needing to center the model on a simple parametric family without sacrificing computational efficiency. For example, it is not clear how to appropriately center (3) on simple linear regression model. Observe that this cannot be achieved by just letting µ h (x) = x β h and then β h N(β 0, Σ 0 ). Moreover, (3) is based on infinitely many stochastic processes {(µ h, π h ), h = 1,..., } which is highly computationally demanding irrespective of the dimension. Indeed, Stephen Walker pointed out in one of the ISBA bulletins Current density regression models are too big, too non-identifiable and I doubt whether they would survive the test of time. Lenk (1988, 1991) proposed a logistic Gaussian process prior for density estimation which can be conveniently centered on a parametric family facilitating inference. Tokdar & Ghosh (2007); van der Vaart & van Zanten (2008, 2009) derived nice theoretical properties of logistic Gaussian process prior in terms of the asymptotic behavior of the posterior. Tokdar et al. (2010b) extended the logistic Gaussian process density estimation model to the density regression setting which allows convenient prior centering and shares similar theoretical properties. Despite such attractive properties, logistic Gaussian process is computationally quite challenging owing to the presence of an intractable integral in the denominator. Although several existing literature propose to discretize the denominator for posterior inference, it turns out that such an approach is not theoretical justifiable. Moreover, discretizing is also not a computationally efficient method requiring to choose the grid points at which the finite sum approximation of the integral will be evaluated. A recent article by Walker (2011) can potentially by-pass this issue at the cost of the introduction of several latent variables and needing to perform RJMCMC for selecting the appropriate grid-points. All these approaches tend to be computationally expensive and suffer from the curse of dimensionality with increasing number of predictors. Although latent variable models have become increasingly popular as a dimension reduction tool in machine learning applications, it was only recently (Kundu & Dunson, 2011) realized that they are also suitable for density estimation. Kundu & Dunson (2011) developed a density estimation model where unobserved U(0, 1) latent variables are related to the response variables via a random non-linear regression with an additive error. Consider the nonlinear latent variable model, y i = µ(η i ) + ɛ i, ɛ i N(0, σ 2 ), (i = 1,..., n) (4) µ GP(µ 0, c), σ 2 IG(a, b), η i U(0, 1), (5) where η i s are subject specific latent variables, µ C[0, 1] relates the latent variables to the observed variables and is assigned a Gaussian process prior, and ɛ i is an idiosyncratic error specific to subject i. Kundu & Dunson (2011); Pati et al. (2011) showed that (4) leads to a highly flexible representation which can approximate a large class of densities subject to minor regularity constraints. Moreover, this allows convenient prior centering, avoids the black-box mixture formulation and enables efficient computation through a griddy Gibbs algorithm. By characterizing the space of densities induced by the above model as kernel convolutions with a general class of continuous mixing measures, Pati et al. (2011) showed optimal posterior convergence properties of this model. They also demonstrated that the posterior converges at a much faster rate under appropriate prior centering.

3 Latent factor density regression 3 Kundu & Dunson (2011) also proposed a density regression model by modeling the response and the covariates jointly using shared dependence on subject specific latent factors after expressing y and each x j, j = 1,... p as conditionally independent non-linear factor models. Such 80 an approach suffers from the inevitable drawback of estimating p + 1 Gaussian process components posing computational inefficiency with diverging number of predictors. Moreover, the joint modeling framework extravagantly focuses a significant amount of the estimation procedure in explaining the dependence among the high dimensional predictors when the key is only to infer conditional dependence of y given the x s. 85 The purpose of this article is to develop a novel density regression model for high-dimensional predictors which not only possesses theoretically elegant optimality properties, but is also computationally efficient and easy to incorporate prior information to perform meaningful statistical inference. To that end, we introduce a general family of continuous mixing measures indexed by the predictors and discuss connections with existing dependent process mixing measures 90 (MacEachern, 1999) used to construct predictor dependent models (3) for density regression. As opposed to (3), our proposed model is based on a single stochastic process linking the response to the predictors and a latent variable. We also study theoretical support properties and show that our prior encompasses a wide range of conditional densities. To our knowledge, only a few papers have considered theoretical properties of density regres- 95 sion models. Tokdar et al. (2010a); Pati et al. (2012); Norets & Pelenis (2010) consider posterior consistency in estimating conditional distributions focusing on logistic Gaussian process priors and mixture models (3). Yoon (2009) tackled the problem in a different way through a limited information approach by approximating the likelihood by the quantiles of the true distribution. Tang & Ghosal (2007a,b) provide sufficient conditions for showing posterior consistency in esti- 100 mating an autoregressive conditional density and a transition density rather than regression with respect to another covariate. However, guaranteeing merely consistency is not enough to characterize the behavior of posterior as the sample size increases. A more informative way to study the behavior is to derive the rate of posterior contraction of density regression models which is crucial to quantifying the concentration of the posterior in terms of the smallest shrinking 105 neighborhood around the true conditional density that still accumulates all the posterior mass asymptotically. To that end, we show that out prior leads to a posterior which converges at the minimax optimal rate under mixed smoothness assumption on the true conditional density. 2. METHODS Suppose y i Y R is observed independently given the covariates x i X, i = 1,..., n 110 which are drawn independently from a probability distribution Q on X R p, compact. Assume that Q admits a density q with respect to the Lebesgue measure. Assuming x i s are rescaled prior to analysis, x i [0, 1] p. Consider the nonlinear latent variable model, y i = µ(η i, x i ) + ɛ i, ɛ i N(0, σ 2 ), (i = 1,..., n) (6) 115 µ Π µ, σ Π σ, η i U(0, 1), (7) where η i [0, 1] are subject specific latent variables, µ C[0, 1] p+1 is a transfer function relating the latent variables and the covariates to the observed variables and ɛ i is an idiosyncratic error specific to subject i. The density of y given x conditional on the transfer function µ and

4 A. BHATTACHARYA, D. PATI AND D. B. DUNSON scale σ is obtained on marginalizing out the latent variable as f(y x; µ, σ) def = f µ,σ (y x) = 1 0 φ σ (y µ(t, x))dt. (8) Define a map g : C[0, 1] p+1 [0, ) F with g(µ, σ) = f µ,σ. One can induce a prior Π on F via the mapping g by placing independent priors Π µ and Π σ on C[0, 1] p+1 and [0, ) respectively, with Π = (Π µ Π σ ) g 1. Kundu & Dunson (2011) assumed a Gaussian process prior in (4) with squared exponential covariance kernel on µ and an inverse-gamma prior on σ 2. In the density regression context, µ is assumed a Gaussian process prior with mean µ 0 : [0, 1] p+1 R and a covariance kernel c : [0, 1] p+1 [0, 1] p+1 R. It is not immediately clear whether the class of densities f µ,σ in (8) in the range of g encompass a large subset of the space of conditional densities. { F d = f(y x) = g(x, y), g : R p+1 R +, g(y, x)dy = 1 x [0, 1] }. p We provide an intuition that relates the above class with convolutions and is crucially used later on. Let f 0 F d be a continuous conditional density with cumulative distribution function F 0 (y x) = y f 0(z x)dz. Assume f 0 to be non-zero almost everywhere within its support, so that F 0 ( x) : supp(f 0 ( x)) [0, 1] is strictly monotone for each x and hence has an inverse F0 1 ( x) : [0, 1] supp(f 0 ( x)) satisfying F 0 {F0 1 (t x) x} = t for all t supp(f 0 ( x)) and for each x [0, 1] p. If supp(f 0 ( x)) = R, then the domain of F0 1 is the open interval (0, 1) instead of [0, 1]. Letting µ 0 (t, x) = F0 1 (t x), one obtains f µ0,σ(y x) = 1 0 φ σ (y F 1 0 (t x))dt = Y φ σ (y z)f 0 (z x)dt, (9) where the second equality follows from the change of variable theorem. Thus, f µ0,σ(y x) = φ σ f 0 (y x), i.e., f µ0,σ is the convolution of f 0 with a normal density having mean 0 and standard deviation σ. It is well known that the convolution φ σ f 0 ( x) can approximate f 0 ( x) for each x arbitrary closely as the bandwidth σ 0. More precisely, for f 0 ( x) L p (λ) for any p 1 for each x, φ σ f 0 ( x) f 0 ( x) p,λ 0 as σ 0. Furthermore, a stronger result φ σ f 0 ( x) f 0 ( x) = O(σ 2 ) for each x holds if f 0 ( x) for each x is compactly supported. However, convergence pointwise in x is a weaker notion, and one might hope for stronger notions of convergence involving joint topology defined on [0, 1] p Y. Refer to 3 2 for details. Thus the above model can approximate a large collection of conditional densities {f 0 (y x)} by letting µ(y, x) concentrate around the conditional quantile functions F 1 0 (y x) by assigning a Gaussian process prior. A couple of advantages of this formulation is the feasibility of an efficient posterior computation based on an uni-dimensional griddy Gibbs algorithm THEORY 3 1. Notations Throughout the paper, Lebesgue measure on R or R p is denoted by λ. The supremum and the L 1 -norms are denoted by and 1 respectively. The indicator function of a set B is denoted by 1 B. Let L p (ν, M) denote the space of real valued measurable functions defined on M with ν-integrable pth absolute power. For two density functions f, g, the Kullback-Leibler

5 Latent factor density regression 5 divergence is given by K(f, g) = log(f/g)fdλ. A ball of radius r with centre x 0 relative to the metric d is defined as B(x 0, r; d). The diameter of a bounded metric space M relative to a metric d is defined to be sup{d(x, y) : x, y M}. The ɛ-covering number N(ɛ, M, d) of a semimetric space M relative to the semi-metric d is the minimal number of balls of radius ɛ needed 160 to cover M. The logarithm of the covering number is referred to as the entropy. 0 stands for a distribution degenerate at 0 and supp(ν) for the support of a measure ν Notions of neighborhoods in conditional density estimation If we define h(x, y) = q(x)f(y x) and h 0 (x, y) = q(x)f 0 (y x) then h, h 0 F. Throughout the paper, h 0 is assumed to be a fixed density in F which we alternatively refer to as the 165 true data generating density and {f 0 ( x), x X } is referred to as the true conditional density. The density q(x) will be needed only for theoretical investigation. In practice, we do not need to know it or learn it from the data. We define the weak, ν-integrated L 1 and sup-l 1 neighborhoods of the collection of conditional densities {f 0 ( x), x X } in the following. A sub-base of a weak neighborhood is 170 defined as W ɛ,g (f 0 ) = { f : f F d, gh gh 0 < ɛ }, (10) X Y X Y for a bounded continuous function g : Y X R. A weak neighborhood base is formed by finite intersections of neighborhoods of the type (10). Define a ν-integrated L 1 neighborhood S ɛ (f 0 ; ν) = { f : f F d, f( x) f 0 ( x) 1 ν(x)dx < ɛ } (11) for any measure ν with supp(ν) X. Observe that under the topology in (11), F d can be identified to a closed subset of L 1 (λ ν, Y supp(ν)) making it a complete separable metric space. 175 For f 1, f 2 F d, let d SS (f 1, f 2 ) = sup x X f( x) f 0 ( x) 1 and define the sup-l 1 neighborhood SS ɛ (f 0 ) = { f : f F d, d SS (f, f 0 ) < ɛ }. (12) Under the sup-l 1 topology, F d can be viewed as a closed subset of the separable Banach space of continuous functions from X L 1 (λ, Y) which are norm bounded and hence a complete separable metric space. Thus measurability issues won t arise with these topologies Posterior convergence rate theorem in the compact case We recall the definitions of rates of posterior convergence. DEFINITION 1. The posterior Π X ( y n, x n) is said to contract at a rate ɛ n going to 0 in the ν-integrated L 1 topology or strongly in the sup-l 1 topology at {f 0 ( x), x X } if Π X ( U c Mɛn y n, x n) 0 a.s. for any ɛ > 0 with U Mɛn = S Mɛn (f 0 ; ν) and SS Mɛn (f 0 ) respectively for large 185 enough M > 0. Here a.s. consistency at {f 0 ( x), x X } means that the posterior distribution concentrates around a neighborhood of {f 0 ( x), x X } for almost every sequence {y i, x i } i=1 generated by i.i.d. sampling from the joint density q(x)f 0 (y x). Studying rates of convergence in density regression models becomes more challenging as we 190 need to assume mixed smoothness in y and x. Although posterior contraction rates are studied widely in mean regression, logistic regression and density estimation models, results on convergence rates for density regression models are lacking. we study posterior convergence rates of the

6 A. BHATTACHARYA, D. PATI AND D. B. DUNSON density regression model (4) by assuming the true conditional density has different smoothness across y and x. Assuming f 0 (y x) to be compact and twice and thrice continuously differentiable in y and x, we obtain a rate of n 1/3 (log n) t 2 using a Gaussian process prior for µ having a single inverse-gamma bandwidth across different dimensions. The optimal rate in such a mixed smoothness class is n 6/17. The slight slow rate of n 1/3 is the drawback of using an isotropic Gaussian process used for modeling an anisotropic function. First let s consider the very simple case when the true conditional density f 0 (y x) is compactly supported for each x [0, 1] p and p = 1. Define µ 0 (x, y) = F0 1 (y x). Assumption 1. For each x [0, 1] p, f 0 ( x) is compactly supported. Without loss of generality we can assume that the support is [0, 1]. Hence µ 0 : [0, 1] 2 R. Assumption 2. f 0 (y x) is twice continuously differentiable in y for any fixed x [0, 1] p and thrice continuously differentiable in x for any fixed y Y. Note that under this assumption µ 0 C 3 [0, 1] 2. Also ( 2 ) f 0 (y x) 2 sup x [0,1] p Y y 2 /f 0 (y x) dy < (13) ( f0 (y x) 4 /f 0 (y x)) dy <. (14) y sup x [0,1] p Y Assumption 3. For any x [0, 1] p, there exists 0 < m x < M x such that f 0 ( x) is bounded, nondecreasing on (, m x ], bounded away from 0 on [m x, M x ] and non-increasing on [M x, ). In addition we have inf x [0,1] p(m x m x ) > 0 (15) Assumption 4. We assume µ follows a centered Gaussian process denoted by GP(0, c) on [0, 1] 2, with a squared exponential covariance kernel c(, ; A) and a Gamma prior for the inversebandwidth A. Thus ψ(t, s; A) = e A t s 2, t, s [0, 1] 2, A Ga(p, q). Assumption 5. We assume σ IG(a σ, b σ ). THEOREM 1. If the true conditional density f 0 satisfies Assumptions 1, 2 and 3 and the prior satisfies Assumptions 4 and 5, the posterior rate of convergence in q-integrated L 1 topology is given by where t 0 is a known constant. n 1/3 log t 0 n (16) Remark 1. The optimal effective smoothness α e for this problem is given by 1 α e = and hence the optimal rate of convergence is given by n 2αe 2αe+1 = n 6/17. Its important to note that the obtained rate of convergence is slower by a very small factor. 225 REFERENCES CHUNG, Y. & DUNSON, D. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association 104, DUNSON, D. & PARK, J. (2008). Kernel stick-breaking processes. Biometrika 95, DUNSON, D., PILLAI, N. & PARK, J. (2007). Bayesian density regression. Journal of the Royal Statistical Society, Series B 69,

7 Latent factor density regression 7 FERGUSON, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, FERGUSON, T. (1974). Prior distributions on spaces of probability measures. The Annals of Statistics 2, GRIFFIN, J. & STEEL, M. (2006). Order-based dependent Dirichlet processes. Journal of The American Statistical Association 101, GRIFFIN, J. & STEEL, M. (2008). Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica (to appear). KUNDU, S. & DUNSON, D. (2011). Single Factor Transformation Priors for Density Regression. DSS Discussion Series. LENK, P. (1988). The logistic normal distribution for Bayesian, nonparametric, predictive densities. Journal of the 235 American Statistical Association 83, LENK, P. (1991). Towards a practicable Bayesian nonparametric density estimator. Biometrika 78, 531. MACEACHERN, S. (1999). Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statistical Science. MÜLLER, P., ERKANLI, A. & WEST, M. (1996). Bayesian curve fitting using multivariate normal mixtures. 240 Biometrika 83, NORETS, A. & PELENIS, J. (2010). Posterior consistency in conditional distribution estimation by covariate dependent mixtures. Unpublished manuscript, Princeton Univ. PATI, D., BHATTACHARYA, A. & DUNSON, D. (2011). Posterior convergence rates in non-linear latent variable models. Arxiv preprint arxiv: (submitted to Bernoulli). 245 PATI, D., DUNSON, D. & TOKDAR, S. (2012). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis (submitted). TANG, Y. & GHOSAL, S. (2007a). A consistent nonparametric Bayesian procedure for estimating autoregressive conditional densities. Computational Statistics & Data Analysis 51, TANG, Y. & GHOSAL, S. (2007b). Posterior consistency of Dirichlet mixtures for estimating a transition density. 250 Journal of Statistical Planning and Inference 137, TOKDAR, S. & GHOSH, J. (2007). Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference 137, TOKDAR, S., ZHU, Y. & GHOSH, J. (2010a). Bayesian Density Regression with Logistic Gaussian Process and Subspace Projection. Bayesian Analysis 5, TOKDAR, S., ZHU, Y. & GHOSH, J. (2010b). Bayesian Density Regression with Logistic Gaussian Process and Subspace Projection. Bayesian Analysis 5, VAN DER VAART, A. & VAN ZANTEN, J. (2008). Rates of contraction of posterior distributions based on gaussian process priors. The Annals of Statistics 36, VAN DER VAART, A. & VAN ZANTEN, J. (2009). Adaptive Bayesian estimation using a Gaussian random field with 260 inverse Gamma bandwidth. The Annals of Statistics 37, WALKER, S. (2011). Posterior sampling when the normalizing constant is unknown. Communications in Statistics 40, YOON, J. (2009). Bayesian analysis of conditional density functions: a limited information approach. Unpublished manuscript, Claremont Mckenna College. 265 [Received January Revised June 2010]

Foundations of Nonparametric Bayesian Methods

1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models