Bayesian Nonparametrics

Similar documents
Exchangeability. Peter Orbanz. Columbia University

Non-Parametric Bayes

Modern Bayesian Nonparametrics

A Brief Overview of Nonparametric Bayesian Models

Foundations of Nonparametric Bayesian Methods

STAT 518 Intro Student Presentation

Nonparametric Bayesian Methods: Models, Algorithms, and Applications (Day 5)

CPSC 540: Machine Learning

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 3a: Dirichlet processes

Bayesian Methods for Machine Learning

Bayesian Regularization

Dirichlet Processes: Tutorial and Practical Course

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

MAD-Bayes: MAP-based Asymptotic Derivations from Bayes

Introduction to Probabilistic Machine Learning

Nonparametric Bayesian Methods - Lecture I

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Bayesian nonparametrics

Truncation error of a superposed gamma process in a decreasing order representation

Bayesian estimation of the discrepancy with misspecified parametric models

Pattern Recognition and Machine Learning

Gentle Introduction to Infinite Gaussian Mixture Modeling

Probabilistic Graphical Models

Haupthseminar: Machine Learning. Chinese Restaurant Process, Indian Buffet Process

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Advanced Machine Learning

Bayesian nonparametric models for bipartite graphs

STA 4273H: Statistical Machine Learning

Recent Advances in Bayesian Inference Techniques

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Dirichlet Process. Yee Whye Teh, University College London

Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Bayesian non parametric approaches: an introduction

Bayesian Nonparametrics: Models Based on the Dirichlet Process

13: Variational inference II

Properties of Bayesian nonparametric models and priors over trees

Bayesian Nonparametric Models

Stochastic Processes, Kernel Regression, Infinite Mixture Models

STA 4273H: Statistical Machine Learning

On the posterior structure of NRMI

Infinite Latent Feature Models and the Indian Buffet Process

Bayesian nonparametric latent feature models

Infinite latent feature models and the Indian Buffet Process

Bayesian Nonparametrics: some contributions to construction and properties of prior distributions

STAT Advanced Bayesian Inference

Part IV: Monte Carlo and nonparametric Bayes

20: Gaussian Processes

Lecture 13 : Variational Inference: Mean Field Approximation

Nonparametric Probabilistic Modelling

Bayesian Machine Learning

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models

Dirichlet Processes and other non-parametric Bayesian models

Hierarchical Dirichlet Processes

Linear Dynamical Systems

Probabilistic Time Series Classification

Nonparametric Bayes Inference on Manifolds with Applications

Asymptotics for posterior hazards

An Alternative Prior Process for Nonparametric Bayesian Clustering

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Mathematical Formulation of Our Example

Construction of Dependent Dirichlet Processes based on Poisson Processes

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian nonparametric latent feature models

Applied Nonparametric Bayes

Neutron inverse kinetics via Gaussian Processes

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Introduction to Machine Learning

Hierarchical Bayesian Nonparametric Models of Language and Text

Priors for Random Count Matrices with Random or Fixed Row Sums

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

On some distributional properties of Gibbs-type priors

arxiv: v1 [stat.ml] 20 Nov 2012

Random function priors for exchangeable arrays with applications to graphs and relational data

Hierarchical Models, Nested Models and Completely Random Measures

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

INFINITE MIXTURES OF MULTIVARIATE GAUSSIAN PROCESSES

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Interpretable Latent Variable Models

Unsupervised Learning

Bayesian Hidden Markov Models and Extensions

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Quantifying mismatch in Bayesian optimization

Hierarchical Bayesian Nonparametric Models of Language and Text

Image segmentation combining Markov Random Fields and Dirichlet Processes

Nonparametric Bayesian Models for Sparse Matrices and Covariances

Probabilistic Machine Learning

Non-parametric Bayesian Methods

Learning Bayesian network : Given structure and completely observed data

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Transcription:

Bayesian Nonparametrics Peter Orbanz Columbia University

PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent randomness Bayesian statistics tries to compute the posterior probability P[pattern data]. Peter Orbanz 2 / 16

NONPARAMETRIC MODELS Parametric model Number of parameters fixed (or constantly bounded) w.r.t. sample size Nonparametric model Number of parameters grows with sample size -dimensional parameter space Example: Density estimation x 2 p(x) µ x 1 Parametric Nonparametric Peter Orbanz 3 / 16

NONPARAMETRIC BAYESIAN MODEL Definition A nonparametric Bayesian model is a Bayesian model on an -dimensional parameter space. Interpretation Parameter space T = set of possible patterns, for example: Problem Density estimation Regression Clustering T Probability distributions Smooth functions Partitions Solution to Bayesian problem = posterior distribution on patterns Peter Orbanz [Sch95] 4 / 16

(NONPARAMETRIC) BAYESIAN STATISTICS Task Define prior distribution Q(Θ. ) and observation model P[X. Θ] Compute posterior distribution Q[Θ. X 1 = x 1,..., X n = x n] Parametric case: Bayes theorem Q[dθ x 1,..., x n] = Condition: Q[. X = x] Q for all x. Nonparametric case Bayes theorem (often) not applicable. Parameter space not locally compact. Hence: No density representations. n j=1 p(xj θ) p(x 1,..., x n) Q(dθ) Peter Orbanz 5 / 16

EXCHANGEABILITY Can we justify our assumptions? Recall: data = pattern + noise In Bayes theorem: Q(dθ x 1,..., x n) = de Finetti s theorem where: n j=1 p(xj θ) p(x 1,..., x n) Q(dθ) P(X 1 = x 1, X 2 = x 2,...) = M(X ) ( j=1 X 1, X 2,... exchangeable M(X ) is the set of probability measures on X ) θ(x j = x j) Q(dθ) θ are values of a random probability measure Θ with distribution Q Peter Orbanz [Sch95, Kal05] 6 / 16

EXAMPLES

GAUSSIAN PROCESSES Nonparametric regression Patterns = continuous functions, say on [a, b]: θ : [a, b] R T = C([a, b], R) Hyperparameter Kernel function; controls smoothness of Θ. Inference On data (sample size n): n n kernel matrix Posterior again Gaussian process Posterior computation reduces to matrix computation 2 1 1 2 a Θ(s) 0 a s b b Peter Orbanz [RW06] 8 / 16

RANDOM DISCRETE MEASURES Random discrete probability measure Θ = C iδ Φi i=1 Application: Mixture models p(x φ)dθ(φ) = C ip(x Φ i) i=1 Example: Dirichlet Process Sample Φ 1, Φ 2,... iid G Sample V 1, V 2,... iid Beta(1, α) and set C i := V i 1 i j=1 (1 Vj) Peter Orbanz 9 / 16

MORE EXAMPLES Applications Pattern Bayesian nonparametric model Classification & regression Function Gaussian process Clustering Partition Chinese restaurant process Density estimation Density Dirichlet process mixture Hierarchical clustering Hierarchical partition Dirichlet/Pitman-Yor diffusion tree, Kingman s coalescent, Nested CRP Latent variable modelling Features Beta process/indian buffet process Survival analysis Hazard Beta process, Neutral-to-the-right process Power-law behaviour Pitman-Yor process, Stable-beta process Dictionary learning Dictionary Beta process/indian buffet process Dimensionality reduction Manifold Gaussian process latent variable model Deep learning Features Cascading/nested Indian buffet process Topic models Atomic distribution Hierarchical Dirichlet process Time series Infinite HMM Sequence prediction Conditional probs Sequence memoizer Reinforcement learning Conditional probs infinite POMDP Spatial modelling Functions Gaussian process, dependent Dirichlet process Relational modelling Infinite relational model, infinite hidden relational model, Mondrian process......... Peter Orbanz 10 / 16

RESEARCH PROBLEMS

INFERENCE MCMC Models are generative MCMC natural choice Gibbs samplers easy to derive; can sample through hierarchies However: For most available samplers, inference probably too slow or wrong Gaussian process inference On data: positive definite matrices (Mercer theorem) Inference based on numerical linear algebra Naive methods scale cubically with sample size Approximations For latent variable methods: Variational approximations For Gaussian processes: Inducing point methods Peter Orbanz 12 / 16

ASYMPTOTICS Consistency A Bayesian model is consistent at P 0 if the posterior converges to δ P0 with growing sample size. Convergence rate Find smallest balls B εn (θ 0) for which Q(B εn (θ 0) X 1,..., X n) n 1 P 0 outside model: misspecified Model P 0 = P θ0 Rate = sequence ε 1, ε 2,... Optimal rate is ε n n 1/2 M(X ) Example result Bandwidth adaptation with GPs: True parameter θ 0 C α [0, 1] d, smoothness α unknown With gamma prior on GP bandwidth: Convergence rate is n α/(2α+d) Peter Orbanz [Gho10, KvdV06, Sch65, GvdV07, vdvvz08a, vdvvz08b] 13 / 16

ERGODIC THEORY de Finetti as Ergodic Decomposition P S -invariant P(A) = P G-invariant P(A) = M(X ) E ( j=1 e(a)ν(de) ) θ (A)Q(dθ) for unique ν M(M(X )) for unique ν M(E) where G (nice) group on X and E its set of ergodic measures. Relevance to Statistics de Finetti: random infinite sequences e 2 ν 2 P ν 1 e 1 What if the data is matrix-valued, network-valued,...? Examples: Partitions (Kingman) Graphs (Aldous, Hoover) Markov chains (Diaconis & Freedman) ν 3 e 3 Peter Orbanz 14 / 16

SUMMARY Motivation, in hindsight Bayesian (nonparametric) modeling: Identify pattern/explanatory object (function, discrete measure,...) Usually: Applied probability knows a random version of this object Use process as prior and develop inference Technical Tools Stochastic processes. Exchangeability/ergodic theory. Graphical, hierarchical and dependent models. Inference: MCMC sampling, optimization methods, numerical linear algebra Open Challenges Novel models and useful applications. Better inference and flexible software packages. Mathematical statistics for Bayesian nonparametric models. Peter Orbanz 15 / 16

REFERENCES I [Gho10] S. Ghosal. Dirichlet process, related priors and posterior asymptotics. In N. L. Hjort et al., editors, Bayesian Nonparametrics, pages 36 83. Cambridge University Press, 2010. [GvdV07] Subhashis Ghosal and Aad van der Vaart. Posterior convergence rates of Dirichlet mixtures at smooth densities. Ann. Statist., 35(2):697 723, 2007. [Kal05] Olav Kallenberg. Probabilistic Symmetries and Invariance Principles. Springer, 2005. [KvdV06] B. J. K. Kleijn and A. W. van der Vaart. Misspecification in infinite-dimensional Bayesian statistics. Annals of Statistics, 34(2):837 877, 2006. [RW06] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006. [Sch65] L. Schwartz. On Bayes procedures. Z. Wahr. Verw. Gebiete, 4:10 26, 1965. [Sch95] M. J. Schervish. Theory of Statistics. Springer, 1995. [vdvvz08a] [vdvvz08b] A. W. van der Vaart and J. H. van Zanten. Rates of contraction of posterior distributions based on Gaussian process priors. Ann. Statist., 36(3):1435 1463, 2008. A. W. van der Vaart and J. H. van Zanten. Reproducing kernel Hilbert spaces of Gaussian priors. In Pushing the limits of contemporary statistics: contributions in honor of Jayanta K. Ghosh, volume 3 of Inst. Math. Stat. Collect., pages 200 222. Inst. Math. Statist., Beachwood, OH, 2008. Peter Orbanz 16 / 16