Bayesian Nonparametrics Peter Orbanz Columbia University
PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent randomness Bayesian statistics tries to compute the posterior probability P[pattern data]. Peter Orbanz 2 / 16
NONPARAMETRIC MODELS Parametric model Number of parameters fixed (or constantly bounded) w.r.t. sample size Nonparametric model Number of parameters grows with sample size -dimensional parameter space Example: Density estimation x 2 p(x) µ x 1 Parametric Nonparametric Peter Orbanz 3 / 16
NONPARAMETRIC BAYESIAN MODEL Definition A nonparametric Bayesian model is a Bayesian model on an -dimensional parameter space. Interpretation Parameter space T = set of possible patterns, for example: Problem Density estimation Regression Clustering T Probability distributions Smooth functions Partitions Solution to Bayesian problem = posterior distribution on patterns Peter Orbanz [Sch95] 4 / 16
(NONPARAMETRIC) BAYESIAN STATISTICS Task Define prior distribution Q(Θ. ) and observation model P[X. Θ] Compute posterior distribution Q[Θ. X 1 = x 1,..., X n = x n] Parametric case: Bayes theorem Q[dθ x 1,..., x n] = Condition: Q[. X = x] Q for all x. Nonparametric case Bayes theorem (often) not applicable. Parameter space not locally compact. Hence: No density representations. n j=1 p(xj θ) p(x 1,..., x n) Q(dθ) Peter Orbanz 5 / 16
EXCHANGEABILITY Can we justify our assumptions? Recall: data = pattern + noise In Bayes theorem: Q(dθ x 1,..., x n) = de Finetti s theorem where: n j=1 p(xj θ) p(x 1,..., x n) Q(dθ) P(X 1 = x 1, X 2 = x 2,...) = M(X ) ( j=1 X 1, X 2,... exchangeable M(X ) is the set of probability measures on X ) θ(x j = x j) Q(dθ) θ are values of a random probability measure Θ with distribution Q Peter Orbanz [Sch95, Kal05] 6 / 16
EXAMPLES
GAUSSIAN PROCESSES Nonparametric regression Patterns = continuous functions, say on [a, b]: θ : [a, b] R T = C([a, b], R) Hyperparameter Kernel function; controls smoothness of Θ. Inference On data (sample size n): n n kernel matrix Posterior again Gaussian process Posterior computation reduces to matrix computation 2 1 1 2 a Θ(s) 0 a s b b Peter Orbanz [RW06] 8 / 16
RANDOM DISCRETE MEASURES Random discrete probability measure Θ = C iδ Φi i=1 Application: Mixture models p(x φ)dθ(φ) = C ip(x Φ i) i=1 Example: Dirichlet Process Sample Φ 1, Φ 2,... iid G Sample V 1, V 2,... iid Beta(1, α) and set C i := V i 1 i j=1 (1 Vj) Peter Orbanz 9 / 16
MORE EXAMPLES Applications Pattern Bayesian nonparametric model Classification & regression Function Gaussian process Clustering Partition Chinese restaurant process Density estimation Density Dirichlet process mixture Hierarchical clustering Hierarchical partition Dirichlet/Pitman-Yor diffusion tree, Kingman s coalescent, Nested CRP Latent variable modelling Features Beta process/indian buffet process Survival analysis Hazard Beta process, Neutral-to-the-right process Power-law behaviour Pitman-Yor process, Stable-beta process Dictionary learning Dictionary Beta process/indian buffet process Dimensionality reduction Manifold Gaussian process latent variable model Deep learning Features Cascading/nested Indian buffet process Topic models Atomic distribution Hierarchical Dirichlet process Time series Infinite HMM Sequence prediction Conditional probs Sequence memoizer Reinforcement learning Conditional probs infinite POMDP Spatial modelling Functions Gaussian process, dependent Dirichlet process Relational modelling Infinite relational model, infinite hidden relational model, Mondrian process......... Peter Orbanz 10 / 16
RESEARCH PROBLEMS
INFERENCE MCMC Models are generative MCMC natural choice Gibbs samplers easy to derive; can sample through hierarchies However: For most available samplers, inference probably too slow or wrong Gaussian process inference On data: positive definite matrices (Mercer theorem) Inference based on numerical linear algebra Naive methods scale cubically with sample size Approximations For latent variable methods: Variational approximations For Gaussian processes: Inducing point methods Peter Orbanz 12 / 16
ASYMPTOTICS Consistency A Bayesian model is consistent at P 0 if the posterior converges to δ P0 with growing sample size. Convergence rate Find smallest balls B εn (θ 0) for which Q(B εn (θ 0) X 1,..., X n) n 1 P 0 outside model: misspecified Model P 0 = P θ0 Rate = sequence ε 1, ε 2,... Optimal rate is ε n n 1/2 M(X ) Example result Bandwidth adaptation with GPs: True parameter θ 0 C α [0, 1] d, smoothness α unknown With gamma prior on GP bandwidth: Convergence rate is n α/(2α+d) Peter Orbanz [Gho10, KvdV06, Sch65, GvdV07, vdvvz08a, vdvvz08b] 13 / 16
ERGODIC THEORY de Finetti as Ergodic Decomposition P S -invariant P(A) = P G-invariant P(A) = M(X ) E ( j=1 e(a)ν(de) ) θ (A)Q(dθ) for unique ν M(M(X )) for unique ν M(E) where G (nice) group on X and E its set of ergodic measures. Relevance to Statistics de Finetti: random infinite sequences e 2 ν 2 P ν 1 e 1 What if the data is matrix-valued, network-valued,...? Examples: Partitions (Kingman) Graphs (Aldous, Hoover) Markov chains (Diaconis & Freedman) ν 3 e 3 Peter Orbanz 14 / 16
SUMMARY Motivation, in hindsight Bayesian (nonparametric) modeling: Identify pattern/explanatory object (function, discrete measure,...) Usually: Applied probability knows a random version of this object Use process as prior and develop inference Technical Tools Stochastic processes. Exchangeability/ergodic theory. Graphical, hierarchical and dependent models. Inference: MCMC sampling, optimization methods, numerical linear algebra Open Challenges Novel models and useful applications. Better inference and flexible software packages. Mathematical statistics for Bayesian nonparametric models. Peter Orbanz 15 / 16
REFERENCES I [Gho10] S. Ghosal. Dirichlet process, related priors and posterior asymptotics. In N. L. Hjort et al., editors, Bayesian Nonparametrics, pages 36 83. Cambridge University Press, 2010. [GvdV07] Subhashis Ghosal and Aad van der Vaart. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist., 35(2):697 723, 2007. [Kal05] Olav Kallenberg. Probabilistic Symmetries and Invariance Principles. Springer, 2005. [KvdV06] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics, 34(2):837 877, 2006. [RW06] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [Sch65] L. Schwartz. On Bayes procedures. Z. Wahr. Verw. Gebiete, 4:10 26, 1965. [Sch95] M. J. Schervish. Theory of Statistics. Springer, 1995. [vdvvz08a] [vdvvz08b] A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist., 36(3):1435 1463, 2008. A. W. van der Vaart and J. H. van Zanten. Reproducing kernel Hilbert spaces of Gaussian priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, volume 3 of Inst. Math. Stat. Collect., pages 200 222. Inst. Math. Statist., Beachwood, OH, 2008. Peter Orbanz 16 / 16