Latent factor density regression models

Similar documents
Foundations of Nonparametric Bayesian Methods

Bayesian Regularization

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian Consistency for Markov Models

POSTERIOR CONSISTENCY IN CONDITIONAL DENSITY ESTIMATION BY COVARIATE DEPENDENT MIXTURES. By Andriy Norets and Justinas Pelenis

Bayesian Methods for Machine Learning

Curve Fitting Re-visited, Bishop1.2.5

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Nonparametric Regression through Mixture Models

Bayesian Nonparametrics

Nonparametric Bayes Inference on Manifolds with Applications

Nonparametric Bayesian Methods (Gaussian Processes)

Scaling up Bayesian Inference

Nonparametric Bayesian Methods - Lecture I

Bayesian Regression with Heteroscedastic Error Density and Parametric Mean Function

STA 4273H: Statistical Machine Learning

Nonparametric Bayes regression and classification through mixtures of product kernels

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Statistics: Learning models from data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Single Factor Transformation Priors for Density. Regression

Computer Emulation With Density Estimation

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

Verifying Regularity Conditions for Logit-Normal GLMM

CPSC 540: Machine Learning

Asymptotics for posterior hazards

Dirichlet Process Mixtures of Generalized Linear Models

NONPARAMETRIC BAYESIAN INFERENCE ON PLANAR SHAPES

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

13: Variational inference II

Hyperparameter estimation in Dirichlet process mixture models

Asymptotic properties of posterior distributions in nonparametric regression with non-gaussian errors

On the Support of MacEachern s Dependent Dirichlet Processes and Extensions

Nonparametric Bayes tensor factorizations for big data

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Lecture 3a: Dirichlet processes

Bayesian spatial quantile regression

STAT 518 Intro Student Presentation

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

STA 4273H: Sta-s-cal Machine Learning

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures

Bayesian non-parametric model to longitudinally predict churn

Markov Chain Monte Carlo (MCMC)

Asymptotics for posterior hazards

Nonparametric Bayes Uncertainty Quantification

Consistency of the maximum likelihood estimator for general hidden Markov models

A Nonparametric Model for Stationary Time Series

Empirical Processes: General Weak Convergence Theory

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

ANISOTROPIC FUNCTION ESTIMATION USING MULTI-BANDWIDTH GAUSSIAN PROCESSES

Worst-Case Bounds for Gaussian Process Models

ICES REPORT Model Misspecification and Plausibility

A Representation Approach for Relative Entropy Minimization with Expectation Constraints

Nonparametric Bayes Density Estimation and Regression with High Dimensional Data

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Geometric ergodicity of the Bayesian lasso

Lecture 13 : Variational Inference: Mean Field Approximation

Nonparametric Bayesian modeling for dynamic ordinal regression relationships

Non-Parametric Bayes

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

A Process over all Stationary Covariance Kernels

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Quantile Processes for Semi and Nonparametric Regression

Introduction. Chapter 1

Noninformative Priors for the Ratio of the Scale Parameters in the Inverted Exponential Distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Sequential Monte Carlo Methods for Bayesian Computation

Announcements. Proposals graded

Statistical inference on Lévy processes

Bayesian Modeling of Conditional Distributions

Priors for the frequentist, consistency beyond Schwartz

Dirichlet Process Mixtures of Generalized Linear Models

Economics 204 Fall 2011 Problem Set 2 Suggested Solutions

Frontier estimation based on extreme risk measures

Dirichlet Process Mixtures of Generalized Linear Models

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Lecture 2: Basic Concepts of Statistical Decision Theory

Introduction to Empirical Processes and Semiparametric Inference Lecture 08: Stochastic Convergence

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Approximate Bayesian Computation

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

A Bayesian Nonparametric Hierarchical Framework for Uncertainty Quantification in Simulation

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Two Useful Bounds for Variational Inference

The Expectation-Maximization Algorithm

Bayes methods for categorical data. April 25, 2017

Lecture 35: December The fundamental statistical distances

Statistica Sinica Preprint No: SS R2

Normalized kernel-weighted random measures

PATTERN RECOGNITION AND MACHINE LEARNING

Inference with few assumptions: Wasserman s example

Transcription:

Biometrika (2010), 97, 1, pp. 1 7 C 2010 Biometrika Trust Printed in Great Britain Advance Access publication on 31 July 2010 Latent factor density regression models BY A. BHATTACHARYA, D. PATI, D.B. DUNSON Department of Statistical Science, Duke University, Durham NC 27708, USA ab179@stat.duke.edu dp55@stat.duke.edu dunson@stat.duke.edu SUMMARY 5 In this article, we propose a new class of latent factor conditional density regression models with attractive computational and theoretical properties. The proposed approach is based on a novel extension of the recently proposed latent variable density estimation model in which the response variables conditioned on the predictors are modeled as unknown functions of uniformly distributed latent variables and the predictors with an additive Gaussian error. The latent vari- 10 able specification allows straightforward posterior computation using a uni-dimensional grid via conjugate posterior updates. Moreover, one can center the model on a simple parametric guess facilitating inference. Our approach relies on characterizing the space of conditional densities induced by the above model as kernel convolutions with a general class of continuous predictordependent mixing measures. Theoretical properties in terms of rates of convergence are studied. 15 Some key words: Density regression; Gaussian process; Factor model; Latent variable; Nonparametric Bayes; Rate of convergence 1. INTRODUCTION There is a rich literature on Bayesian methods for density estimation using mixture models of the form 20 y i f(θ i ), θ i P, P Π, (1) where f( ) is a parametric density and P is an unknown mixing distribution assigned a prior Π. The most common choice of Π is the Dirichlet process prior, first introduced by Ferguson (1973, 1974). Recent literature has focused on generalizing model (1) to the density regression setting in which the entire conditional distribution of y given x changes flexibly with predictors. Bayesian density regression, also termed as conditional density estimation, views the entire conditional 25 density f(y x) as a function valued parameter and allows its center, spread, skewness, modality and other such features to vary with x R p. For data {(y i, x i ), i = 1,..., n} let y i x i f( x i ), {f( x), x X} Π X, (2) where X is the predictor space and Π X is a prior for the class of conditional densities {f x, x X } indexed by the predictors. Refer, for example, to Müller et al. (1996); Griffin & Steel (2006, 2008); Dunson et al. (2007); Dunson & Park (2008); Chung & Dunson (2009); Tokdar et al. 30 (2010a) and Pati et al. (2012) among others. The primary focus of this recent development has been mixture models of the form { } y µh (x) f(y x) = π h (x)φ, (3) h=1 σ h

35 40 45 50 55 60 65 2 A. BHATTACHARYA, D. PATI AND D. B. DUNSON wmhere φ is the standard normal density, {π h (x), h = 1, 2,...} are predictor-dependent probability weights that sum to one almost surely for each x X, and (µ h, σ h ) G 0 independently, with G 0 a base probability measure on F X R +, F X X R, the space of all X R functions. However, density regression models focusing on mixture models are often a black-box in terms of realistic applications which require needing to center the model on a simple parametric family without sacrificing computational efficiency. For example, it is not clear how to appropriately center (3) on simple linear regression model. Observe that this cannot be achieved by just letting µ h (x) = x β h and then β h N(β 0, Σ 0 ). Moreover, (3) is based on infinitely many stochastic processes {(µ h, π h ), h = 1,..., } which is highly computationally demanding irrespective of the dimension. Indeed, Stephen Walker pointed out in one of the ISBA bulletins Current density regression models are too big, too non-identifiable and I doubt whether they would survive the test of time. Lenk (1988, 1991) proposed a logistic Gaussian process prior for density estimation which can be conveniently centered on a parametric family facilitating inference. Tokdar & Ghosh (2007); van der Vaart & van Zanten (2008, 2009) derived nice theoretical properties of logistic Gaussian process prior in terms of the asymptotic behavior of the posterior. Tokdar et al. (2010b) extended the logistic Gaussian process density estimation model to the density regression setting which allows convenient prior centering and shares similar theoretical properties. Despite such attractive properties, logistic Gaussian process is computationally quite challenging owing to the presence of an intractable integral in the denominator. Although several existing literature propose to discretize the denominator for posterior inference, it turns out that such an approach is not theoretical justifiable. Moreover, discretizing is also not a computationally efficient method requiring to choose the grid points at which the finite sum approximation of the integral will be evaluated. A recent article by Walker (2011) can potentially by-pass this issue at the cost of the introduction of several latent variables and needing to perform RJMCMC for selecting the appropriate grid-points. All these approaches tend to be computationally expensive and suffer from the curse of dimensionality with increasing number of predictors. Although latent variable models have become increasingly popular as a dimension reduction tool in machine learning applications, it was only recently (Kundu & Dunson, 2011) realized that they are also suitable for density estimation. Kundu & Dunson (2011) developed a density estimation model where unobserved U(0, 1) latent variables are related to the response variables via a random non-linear regression with an additive error. Consider the nonlinear latent variable model, y i = µ(η i ) + ɛ i, ɛ i N(0, σ 2 ), (i = 1,..., n) (4) µ GP(µ 0, c), σ 2 IG(a, b), η i U(0, 1), (5) 70 75 where η i s are subject specific latent variables, µ C[0, 1] relates the latent variables to the observed variables and is assigned a Gaussian process prior, and ɛ i is an idiosyncratic error specific to subject i. Kundu & Dunson (2011); Pati et al. (2011) showed that (4) leads to a highly flexible representation which can approximate a large class of densities subject to minor regularity constraints. Moreover, this allows convenient prior centering, avoids the black-box mixture formulation and enables efficient computation through a griddy Gibbs algorithm. By characterizing the space of densities induced by the above model as kernel convolutions with a general class of continuous mixing measures, Pati et al. (2011) showed optimal posterior convergence properties of this model. They also demonstrated that the posterior converges at a much faster rate under appropriate prior centering.

Latent factor density regression 3 Kundu & Dunson (2011) also proposed a density regression model by modeling the response and the covariates jointly using shared dependence on subject specific latent factors after expressing y and each x j, j = 1,... p as conditionally independent non-linear factor models. Such 80 an approach suffers from the inevitable drawback of estimating p + 1 Gaussian process components posing computational inefficiency with diverging number of predictors. Moreover, the joint modeling framework extravagantly focuses a significant amount of the estimation procedure in explaining the dependence among the high dimensional predictors when the key is only to infer conditional dependence of y given the x s. 85 The purpose of this article is to develop a novel density regression model for high-dimensional predictors which not only possesses theoretically elegant optimality properties, but is also computationally efficient and easy to incorporate prior information to perform meaningful statistical inference. To that end, we introduce a general family of continuous mixing measures indexed by the predictors and discuss connections with existing dependent process mixing measures 90 (MacEachern, 1999) used to construct predictor dependent models (3) for density regression. As opposed to (3), our proposed model is based on a single stochastic process linking the response to the predictors and a latent variable. We also study theoretical support properties and show that our prior encompasses a wide range of conditional densities. To our knowledge, only a few papers have considered theoretical properties of density regres- 95 sion models. Tokdar et al. (2010a); Pati et al. (2012); Norets & Pelenis (2010) consider posterior consistency in estimating conditional distributions focusing on logistic Gaussian process priors and mixture models (3). Yoon (2009) tackled the problem in a different way through a limited information approach by approximating the likelihood by the quantiles of the true distribution. Tang & Ghosal (2007a,b) provide sufficient conditions for showing posterior consistency in esti- 100 mating an autoregressive conditional density and a transition density rather than regression with respect to another covariate. However, guaranteeing merely consistency is not enough to characterize the behavior of posterior as the sample size increases. A more informative way to study the behavior is to derive the rate of posterior contraction of density regression models which is crucial to quantifying the concentration of the posterior in terms of the smallest shrinking 105 neighborhood around the true conditional density that still accumulates all the posterior mass asymptotically. To that end, we show that out prior leads to a posterior which converges at the minimax optimal rate under mixed smoothness assumption on the true conditional density. 2. METHODS Suppose y i Y R is observed independently given the covariates x i X, i = 1,..., n 110 which are drawn independently from a probability distribution Q on X R p, compact. Assume that Q admits a density q with respect to the Lebesgue measure. Assuming x i s are rescaled prior to analysis, x i [0, 1] p. Consider the nonlinear latent variable model, y i = µ(η i, x i ) + ɛ i, ɛ i N(0, σ 2 ), (i = 1,..., n) (6) 115 µ Π µ, σ Π σ, η i U(0, 1), (7) where η i [0, 1] are subject specific latent variables, µ C[0, 1] p+1 is a transfer function relating the latent variables and the covariates to the observed variables and ɛ i is an idiosyncratic error specific to subject i. The density of y given x conditional on the transfer function µ and

120 125 130 135 140 145 150 4 A. BHATTACHARYA, D. PATI AND D. B. DUNSON scale σ is obtained on marginalizing out the latent variable as f(y x; µ, σ) def = f µ,σ (y x) = 1 0 φ σ (y µ(t, x))dt. (8) Define a map g : C[0, 1] p+1 [0, ) F with g(µ, σ) = f µ,σ. One can induce a prior Π on F via the mapping g by placing independent priors Π µ and Π σ on C[0, 1] p+1 and [0, ) respectively, with Π = (Π µ Π σ ) g 1. Kundu & Dunson (2011) assumed a Gaussian process prior in (4) with squared exponential covariance kernel on µ and an inverse-gamma prior on σ 2. In the density regression context, µ is assumed a Gaussian process prior with mean µ 0 : [0, 1] p+1 R and a covariance kernel c : [0, 1] p+1 [0, 1] p+1 R. It is not immediately clear whether the class of densities f µ,σ in (8) in the range of g encompass a large subset of the space of conditional densities. { F d = f(y x) = g(x, y), g : R p+1 R +, g(y, x)dy = 1 x [0, 1] }. p We provide an intuition that relates the above class with convolutions and is crucially used later on. Let f 0 F d be a continuous conditional density with cumulative distribution function F 0 (y x) = y f 0(z x)dz. Assume f 0 to be non-zero almost everywhere within its support, so that F 0 ( x) : supp(f 0 ( x)) [0, 1] is strictly monotone for each x and hence has an inverse F0 1 ( x) : [0, 1] supp(f 0 ( x)) satisfying F 0 {F0 1 (t x) x} = t for all t supp(f 0 ( x)) and for each x [0, 1] p. If supp(f 0 ( x)) = R, then the domain of F0 1 is the open interval (0, 1) instead of [0, 1]. Letting µ 0 (t, x) = F0 1 (t x), one obtains f µ0,σ(y x) = 1 0 φ σ (y F 1 0 (t x))dt = Y φ σ (y z)f 0 (z x)dt, (9) where the second equality follows from the change of variable theorem. Thus, f µ0,σ(y x) = φ σ f 0 (y x), i.e., f µ0,σ is the convolution of f 0 with a normal density having mean 0 and standard deviation σ. It is well known that the convolution φ σ f 0 ( x) can approximate f 0 ( x) for each x arbitrary closely as the bandwidth σ 0. More precisely, for f 0 ( x) L p (λ) for any p 1 for each x, φ σ f 0 ( x) f 0 ( x) p,λ 0 as σ 0. Furthermore, a stronger result φ σ f 0 ( x) f 0 ( x) = O(σ 2 ) for each x holds if f 0 ( x) for each x is compactly supported. However, convergence pointwise in x is a weaker notion, and one might hope for stronger notions of convergence involving joint topology defined on [0, 1] p Y. Refer to 3 2 for details. Thus the above model can approximate a large collection of conditional densities {f 0 (y x)} by letting µ(y, x) concentrate around the conditional quantile functions F 1 0 (y x) by assigning a Gaussian process prior. A couple of advantages of this formulation is the feasibility of an efficient posterior computation based on an uni-dimensional griddy Gibbs algorithm. 155 3. THEORY 3 1. Notations Throughout the paper, Lebesgue measure on R or R p is denoted by λ. The supremum and the L 1 -norms are denoted by and 1 respectively. The indicator function of a set B is denoted by 1 B. Let L p (ν, M) denote the space of real valued measurable functions defined on M with ν-integrable pth absolute power. For two density functions f, g, the Kullback-Leibler

Latent factor density regression 5 divergence is given by K(f, g) = log(f/g)fdλ. A ball of radius r with centre x 0 relative to the metric d is defined as B(x 0, r; d). The diameter of a bounded metric space M relative to a metric d is defined to be sup{d(x, y) : x, y M}. The ɛ-covering number N(ɛ, M, d) of a semimetric space M relative to the semi-metric d is the minimal number of balls of radius ɛ needed 160 to cover M. The logarithm of the covering number is referred to as the entropy. 0 stands for a distribution degenerate at 0 and supp(ν) for the support of a measure ν. 3 2. Notions of neighborhoods in conditional density estimation If we define h(x, y) = q(x)f(y x) and h 0 (x, y) = q(x)f 0 (y x) then h, h 0 F. Throughout the paper, h 0 is assumed to be a fixed density in F which we alternatively refer to as the 165 true data generating density and {f 0 ( x), x X } is referred to as the true conditional density. The density q(x) will be needed only for theoretical investigation. In practice, we do not need to know it or learn it from the data. We define the weak, ν-integrated L 1 and sup-l 1 neighborhoods of the collection of conditional densities {f 0 ( x), x X } in the following. A sub-base of a weak neighborhood is 170 defined as W ɛ,g (f 0 ) = { f : f F d, gh gh 0 < ɛ }, (10) X Y X Y for a bounded continuous function g : Y X R. A weak neighborhood base is formed by finite intersections of neighborhoods of the type (10). Define a ν-integrated L 1 neighborhood S ɛ (f 0 ; ν) = { f : f F d, f( x) f 0 ( x) 1 ν(x)dx < ɛ } (11) for any measure ν with supp(ν) X. Observe that under the topology in (11), F d can be identified to a closed subset of L 1 (λ ν, Y supp(ν)) making it a complete separable metric space. 175 For f 1, f 2 F d, let d SS (f 1, f 2 ) = sup x X f( x) f 0 ( x) 1 and define the sup-l 1 neighborhood SS ɛ (f 0 ) = { f : f F d, d SS (f, f 0 ) < ɛ }. (12) Under the sup-l 1 topology, F d can be viewed as a closed subset of the separable Banach space of continuous functions from X L 1 (λ, Y) which are norm bounded and hence a complete separable metric space. Thus measurability issues won t arise with these topologies. 180 3 3. Posterior convergence rate theorem in the compact case We recall the definitions of rates of posterior convergence. DEFINITION 1. The posterior Π X ( y n, x n) is said to contract at a rate ɛ n going to 0 in the ν-integrated L 1 topology or strongly in the sup-l 1 topology at {f 0 ( x), x X } if Π X ( U c Mɛn y n, x n) 0 a.s. for any ɛ > 0 with U Mɛn = S Mɛn (f 0 ; ν) and SS Mɛn (f 0 ) respectively for large 185 enough M > 0. Here a.s. consistency at {f 0 ( x), x X } means that the posterior distribution concentrates around a neighborhood of {f 0 ( x), x X } for almost every sequence {y i, x i } i=1 generated by i.i.d. sampling from the joint density q(x)f 0 (y x). Studying rates of convergence in density regression models becomes more challenging as we 190 need to assume mixed smoothness in y and x. Although posterior contraction rates are studied widely in mean regression, logistic regression and density estimation models, results on convergence rates for density regression models are lacking. we study posterior convergence rates of the

195 200 205 210 215 220 6 A. BHATTACHARYA, D. PATI AND D. B. DUNSON density regression model (4) by assuming the true conditional density has different smoothness across y and x. Assuming f 0 (y x) to be compact and twice and thrice continuously differentiable in y and x, we obtain a rate of n 1/3 (log n) t 2 using a Gaussian process prior for µ having a single inverse-gamma bandwidth across different dimensions. The optimal rate in such a mixed smoothness class is n 6/17. The slight slow rate of n 1/3 is the drawback of using an isotropic Gaussian process used for modeling an anisotropic function. First let s consider the very simple case when the true conditional density f 0 (y x) is compactly supported for each x [0, 1] p and p = 1. Define µ 0 (x, y) = F0 1 (y x). Assumption 1. For each x [0, 1] p, f 0 ( x) is compactly supported. Without loss of generality we can assume that the support is [0, 1]. Hence µ 0 : [0, 1] 2 R. Assumption 2. f 0 (y x) is twice continuously differentiable in y for any fixed x [0, 1] p and thrice continuously differentiable in x for any fixed y Y. Note that under this assumption µ 0 C 3 [0, 1] 2. Also ( 2 ) f 0 (y x) 2 sup x [0,1] p Y y 2 /f 0 (y x) dy < (13) ( f0 (y x) 4 /f 0 (y x)) dy <. (14) y sup x [0,1] p Y Assumption 3. For any x [0, 1] p, there exists 0 < m x < M x such that f 0 ( x) is bounded, nondecreasing on (, m x ], bounded away from 0 on [m x, M x ] and non-increasing on [M x, ). In addition we have inf x [0,1] p(m x m x ) > 0 (15) Assumption 4. We assume µ follows a centered Gaussian process denoted by GP(0, c) on [0, 1] 2, with a squared exponential covariance kernel c(, ; A) and a Gamma prior for the inversebandwidth A. Thus ψ(t, s; A) = e A t s 2, t, s [0, 1] 2, A Ga(p, q). Assumption 5. We assume σ IG(a σ, b σ ). THEOREM 1. If the true conditional density f 0 satisfies Assumptions 1, 2 and 3 and the prior satisfies Assumptions 4 and 5, the posterior rate of convergence in q-integrated L 1 topology is given by where t 0 is a known constant. n 1/3 log t 0 n (16) Remark 1. The optimal effective smoothness α e for this problem is given by 1 α e = 1 2 + 1 3 and hence the optimal rate of convergence is given by n 2αe 2αe+1 = n 6/17. Its important to note that the obtained rate of convergence is slower by a very small factor. 225 REFERENCES CHUNG, Y. & DUNSON, D. (2009). Nonparametric Bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association 104, 1646 1660. DUNSON, D. & PARK, J. (2008). Kernel stick-breaking processes. Biometrika 95, 307 323. DUNSON, D., PILLAI, N. & PARK, J. (2007). Bayesian density regression. Journal of the Royal Statistical Society, Series B 69, 163 183.

Latent factor density regression 7 FERGUSON, T. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 209 230. FERGUSON, T. (1974). Prior distributions on spaces of probability measures. The Annals of Statistics 2, 615 629. GRIFFIN, J. & STEEL, M. (2006). Order-based dependent Dirichlet processes. Journal of The American Statistical Association 101, 179 194. 230 GRIFFIN, J. & STEEL, M. (2008). Bayesian nonparametric modelling with the dirichlet process regression smoother. Statistica Sinica (to appear). KUNDU, S. & DUNSON, D. (2011). Single Factor Transformation Priors for Density Regression. DSS Discussion Series. LENK, P. (1988). The logistic normal distribution for Bayesian, nonparametric, predictive densities. Journal of the 235 American Statistical Association 83, 509 516. LENK, P. (1991). Towards a practicable Bayesian nonparametric density estimator. Biometrika 78, 531. MACEACHERN, S. (1999). Dependent nonparametric processes. In Proceedings of the Section on Bayesian Statistical Science. MÜLLER, P., ERKANLI, A. & WEST, M. (1996). Bayesian curve fitting using multivariate normal mixtures. 240 Biometrika 83, 67 79. NORETS, A. & PELENIS, J. (2010). Posterior consistency in conditional distribution estimation by covariate dependent mixtures. Unpublished manuscript, Princeton Univ. PATI, D., BHATTACHARYA, A. & DUNSON, D. (2011). Posterior convergence rates in non-linear latent variable models. Arxiv preprint arxiv:1109.5000 (submitted to Bernoulli). 245 PATI, D., DUNSON, D. & TOKDAR, S. (2012). Posterior consistency in conditional distribution estimation. Journal of Multivariate Analysis (submitted). TANG, Y. & GHOSAL, S. (2007a). A consistent nonparametric Bayesian procedure for estimating autoregressive conditional densities. Computational Statistics & Data Analysis 51, 4424 4437. TANG, Y. & GHOSAL, S. (2007b). Posterior consistency of Dirichlet mixtures for estimating a transition density. 250 Journal of Statistical Planning and Inference 137, 1711 1726. TOKDAR, S. & GHOSH, J. (2007). Posterior consistency of logistic Gaussian process priors in density estimation. Journal of Statistical Planning and Inference 137, 34 42. TOKDAR, S., ZHU, Y. & GHOSH, J. (2010a). Bayesian Density Regression with Logistic Gaussian Process and Subspace Projection. Bayesian Analysis 5, 1 26. 255 TOKDAR, S., ZHU, Y. & GHOSH, J. (2010b). Bayesian Density Regression with Logistic Gaussian Process and Subspace Projection. Bayesian Analysis 5, 1 26. VAN DER VAART, A. & VAN ZANTEN, J. (2008). Rates of contraction of posterior distributions based on gaussian process priors. The Annals of Statistics 36, 1435 1463. VAN DER VAART, A. & VAN ZANTEN, J. (2009). Adaptive Bayesian estimation using a Gaussian random field with 260 inverse Gamma bandwidth. The Annals of Statistics 37, 2655 2675. WALKER, S. (2011). Posterior sampling when the normalizing constant is unknown. Communications in Statistics 40, 784 792. YOON, J. (2009). Bayesian analysis of conditional density functions: a limited information approach. Unpublished manuscript, Claremont Mckenna College. 265 [Received January 2010. Revised June 2010]