Nonparametric Bayesian Methods - Lecture I

Similar documents
Bayesian Regularization

Bayesian Methods for Machine Learning

Foundations of Nonparametric Bayesian Methods

Non-Parametric Bayes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

STAT 518 Intro Student Presentation

STAT Advanced Bayesian Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A Process over all Stationary Covariance Kernels

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Nonparametrics

ICML Scalable Bayesian Inference on Point processes. with Gaussian Processes. Yves-Laurent Kom Samo & Stephen Roberts

Bayesian estimation of the discrepancy with misspecified parametric models

Gaussian Process Regression

Introduction to Probabilistic Machine Learning

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Bayesian Nonparametrics

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

A Brief Overview of Nonparametric Bayesian Models

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Hyperparameter estimation in Dirichlet process mixture models

CPSC 540: Machine Learning

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

More Spectral Clustering and an Introduction to Conjugacy

Statistics & Data Sciences: First Year Prelim Exam May 2018

Bayesian RL Seminar. Chris Mansley September 9, 2008

A Nonparametric Model for Stationary Time Series

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

Learning Bayesian network : Given structure and completely observed data

Dirichlet Processes: Tutorial and Practical Course

Nonparametric Bayesian Uncertainty Quantification

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Lecture 3a: Dirichlet processes

Foundations of Statistical Inference

STA 4273H: Statistical Machine Learning

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian Nonparametrics: Dirichlet Process

Likelihood-free MCMC

Bayesian Nonparametric Regression for Diabetes Deaths

Optimality of Poisson Processes Intensity Learning with Gaussian Processes

Bayesian Models in Machine Learning

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

PMR Learning as Inference

Bayesian Nonparametric Models

Introduction. Chapter 1

Spatial Bayesian Nonparametrics for Natural Image Segmentation

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Density Estimation. Seungjin Choi

Bayesian Nonparametrics for Speech and Signal Processing

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Lecture 2: Priors and Conjugacy

State Space Representation of Gaussian Processes

Log Gaussian Cox Processes. Chi Group Meeting February 23, 2016

Stochastic Processes, Kernel Regression, Infinite Mixture Models

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

MFM Practitioner Module: Quantitative Risk Management. John Dodson. September 23, 2015

arxiv: v1 [stat.me] 6 Nov 2013

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Announcements. Proposals graded

Dirichlet Process. Yee Whye Teh, University College London

Contents. Part I: Fundamentals of Bayesian Inference 1

Stat 5101 Lecture Notes

Naïve Bayes classification

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Learning (II)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Statistics: Learning models from data

Model Based Clustering of Count Processes Data

The Bayesian approach to inverse problems

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Bayesian linear regression

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING. Non-linear regression techniques Part - II

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Computer Emulation With Density Estimation

Bayesian non-parametric model to longitudinally predict churn

Introduction to Probabilistic Graphical Models

Review of Probabilities and Basic Statistics

Gaussian processes for inference in stochastic differential equations

Hierarchical Models & Bayesian Model Selection

Introduction to Machine Learning. Lecture 2

STA 4273H: Statistical Machine Learning

New Bayesian methods for model comparison

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

STA 4273H: Sta-s-cal Machine Learning

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Normalized kernel-weighted random measures

A short introduction to INLA and R-INLA

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Lecture 4: Introduction to stochastic processes and stochastic calculus

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Lecture 2: From Linear Regression to Kalman Filter and Beyond

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Quantifying the Price of Uncertainty in Bayesian Models

Transcription:

Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016

Overview of the lectures I Intro to nonparametric Bayesian statistics II Consistency and contraction rates III Contraction rates for Gaussian process priors IV Rate-adaptive BNP, Challenges,... 2 / 50

Overview of Lecture I Bayesian statistics Nonparametric Bayesian statistics Nonparametric priors Dirichlet processes distribution function estimation Gaussian processes nonparametric regression Conditionally Gaussian processes Dirichlet mixtures nonparametric density estimation Some more examples Concluding remarks 3 / 50

Bayesian statistics 4 / 50

Bayesian vs. frequentist statistics Mathematical statistics: Have data X, possible distributions {P θ : θ Θ}. Want to make inference about θ on the basis of X. Paradigms in mathematical statistics: Classical /frequentist paradigm: There is a true value θ 0 Θ. Assume X P θ0. Bayesian paradigm: Think of data as being generated in steps as follows: Parameter is random: θ Π. Terminology Π: prior. Data given parameter: X θ P θ. Can then consider θ X : posterior distribution. 5 / 50

Bayesian vs. frequentist statistics Mathematical statistics: Have data X, possible distributions {P θ : θ Θ}. Want to make inference about θ on the basis of X. Paradigms in mathematical statistics: Classical /frequentist paradigm: There is a true value θ 0 Θ. Assume X P θ0. Bayesian paradigm: Think of data as being generated in steps as follows: Parameter is random: θ Π. Terminology Π: prior. Data given parameter: X θ P θ. Can then consider θ X : posterior distribution. 5 / 50

Bayesian vs. frequentist statistics Mathematical statistics: Have data X, possible distributions {P θ : θ Θ}. Want to make inference about θ on the basis of X. Paradigms in mathematical statistics: Classical /frequentist paradigm: There is a true value θ 0 Θ. Assume X P θ0. Bayesian paradigm: Think of data as being generated in steps as follows: Parameter is random: θ Π. Terminology Π: prior. Data given parameter: X θ P θ. Can then consider θ X : posterior distribution. 5 / 50

Bayes example - 1 [Bayes, Price (1763)] Suppose we have a coin that has probability p of turning up heads. We do 50 independent tosses and observe 42 heads. What can we say about p? Here we have an observation (the number 42) from a binomial distribution with parameters 50 and p and want to estimate p. Standard frequentist solution: take the estimate 42/50 = 0.84. 6 / 50

Bayes example - 1 [Bayes, Price (1763)] Suppose we have a coin that has probability p of turning up heads. We do 50 independent tosses and observe 42 heads. What can we say about p? Here we have an observation (the number 42) from a binomial distribution with parameters 50 and p and want to estimate p. Standard frequentist solution: take the estimate 42/50 = 0.84. 6 / 50

Bayes Example - 2 Bayesian approach: choose a prior distribution on p, say uniform on [0, 1]. Compute the posterior: beta(43, 9)-distribution (mode is at 42/50 = 0.84). 0 2 4 6 8 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 p 0.0 0.2 0.4 0.6 0.8 1.0 p prior data posterior 7 / 50

Bayes rule Observations X take values in sample space X. Model {P θ : θ Θ}. All P θ dominated: P θ µ, density p θ = dp θ /dµ. Prior distribution Π on the parameter θ. For the Bayesian: θ Π and X θ P θ. Hence, the pair (θ, X ) has density (θ, x) p θ (x) relative to Π µ. Then X has marginal density x p θ (x) Π(dθ), Θ and hence the conditional distribution of θ given X = x, i.e. the posterior, has density relative to the prior Π. θ p θ (x) Θ p θ(x) Π(dθ) 8 / 50

Bayes example again Have X Bin(n, θ), θ (0, 1). Likelihood: ( ) n p θ (X ) = θ X (1 θ) n X. X Prior: uniform distribution on (0, 1). By Bayes rule, posterior density proportional to θ θ X (1 θ) n X. Hence, posterior is Beta(X + 1, n X + 1). 9 / 50

Bayesian nonparametrics 10 / 50

Bayesian nonparametrics Challenges lie in particular in the area of high-dimensional or nonparametric models. Illustration 1: parametric vs. nonparametric regression Y i = f (t i ) + error i 0.0 0.5 1.0 1.5 2.0 2.5 3.0 5 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t t 11 / 50

Bayesian nonparametrics Illustration 2: parametric vs. nonparametric density estimation X 1,..., X n f Frequency 0 20 40 60 Frequency 0 10 20 30 40 50 40 60 80 100 120 x 40 50 60 70 80 90 100 waiting 12 / 50

Bayesian nonparametrics In nonparametric problems, the parameter of interest is typically a function: e.g. a density, regression function, distribution function, hazard rate,..., or some other infinite-dimensional object. Bayesian approach is not at all fundamentally restricted to the parametric case, but: How do we construct priors on infinite-dimensional (function) spaces? How do we compute posteriors, or generate draws? What is the fundamental performance of procedures? 13 / 50

Nonparametric priors 14 / 50

Nonparametric priors - first remarks Often enough to describe how realizations are generated Possible ways to construct priors on an infinite-dimensional space Θ: Discrete priors: Consider (random) points θ 1, θ 2,..., in Θ and (random) probability weights w 1, w 2,... and define Π = w j δ θj. Stochastic Process approach: If Θ is a function space, use machinery for constructing stochastic processes Random series approach: If Θ is a function space, consider series expansions, put priors on coefficients... 15 / 50

Nonparametric priors - Dirichlet process 16 / 50

Dirichlet process - 1 Step 1: prior on simplex of probability vectors of length k: k 1 = {(y 1,..., y k ) R k : y 1,..., y k 0, y i = 1}. For α = (α 1,..., α k ) (0, ) k, define f α (y 1,..., y k 1 ) = C α k i=0 y α i 1 i 1 (y1,...,y k ) k 1 on R k 1, where y k = 1 y 1 y k 1 and C α is the appropriate normalizing constant. A random vector (Y 1,..., Y k ) in R k is said to have a Dirichlet distribution with parameter α = (α 1,..., α k ) if (Y 1,..., Y k 1 ) has density f α and Y k = 1 Y 1 + + Y k 1. 17 / 50

Dirichlet process - 2 Step 2: definition of DP: Let α be a finite measure on R. A random probability measure P on R is called a Dirichlet Process with parameter α if for every partition A 1,..., A k of R, the vector (P(A 1 ),..., P(A k )) has a Dirichlet distribution with parameter (α(a 1 ),..., α(a k )). Notation: P DP(α). 18 / 50

Dirichlet process - 3 Step 3: Prove that DP exists! Theorem. For any finite measure α on R, the Dirichlet process with parameter α exists. Proof. For instance: Use Kolmogorov s consistency theorem to show a process P = (P(A) : A B(R)) with the right fdd s. Prove there exists a version of P such that every realization is a measure. 19 / 50

Dirichlet process - 4 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 4 Ten realizations from Dirichlet process with parameter 25 N(0, 1) 20 / 50

Dirichlet process - 5 Draws from the DP are discrete measures on R: Theorem. Let α be a finite measure, define M = α(r) and ᾱ = α/m. If we have independent θ 1, θ 2,... ᾱ and Y 1, Y 2,... Beta(1, M) and V j = Y j 1 j l=1 (1 Y l), then j=1 V jδ θj DP(α). This is the stick-breaking representation. 21 / 50

Distribution function estimation The DP is a conjugate prior for full distribution estimation: if P DP(α) and X 1,..., X n P P, then P X 1,..., X n DP(α + n i=1 δ X i ). 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0-4 -2 0 2 4-4 -2 0 2 4 Simulated data: 500 draws from a N(1, 1)-distribution, prior: Dirichlet process with parameter 25 N(0, 1). Left: 10 draws from the prior. Right: 10 draws from the posterior. 22 / 50

Nonparametric priors - Gaussian processes 23 / 50

Gaussian process priors - 1 A stochastic process W = (W t : t T ) is called Gaussian if for all n N and t 1,..., t n T, the vector (W t1,..., W tn ) has an n-dimensional Gaussian distribution. Associated functions: mean function: m(t) = EW t, covariance function: r(s, t) = Cov(W s, W t ). The GP is called centered, or zero-mean if m(t) = 0 for all t T. 24 / 50

Gaussian process priors - 2 For a 1,..., a n R and t 1,..., t n T, ( ) a i a j r(t i, t j ) = Var ai W ti 0, i j hence r is a positive definite, symmetric function on T T. Theorem. Let T be a set, m : T R a function and r : T T R a positive definite, symmetric function. Then there exists a Gaussian process with mean function m and covariance function r. Proof. Kolmogorov s consistency theorem. 25 / 50

Gaussian process priors: examples - 1 Brownian motion: m(t) = 0, r(s, t) = s t. -0.01 0.00 0.01 0.02 Regularity: 1/2. 0.0 0.2 0.4 0.6 0.8 1.0 26 / 50

Gaussian process priors: examples - 2 Integrated Brownian motion: t 0 W s ds, for W a Brownian motion. m(t) = 0, r(s, t) = s 2 t/2 t 3 /6. 0 2 4 6 8 10 12 Regularity: 3/2. 0.0 0.2 0.4 0.6 0.8 1.0 27 / 50

Gaussian process priors: examples - 3 By Fubini and integration by parts, t tn t2 1 W t1 dt 1 dt 2 dt n = 0 0 (n 1)! 0 = 1 n! t 0 t 0 (t s) n 1 W s ds (t s) n dw s. The Riemann-Liouville process with parameter α > 0: W α t = Process has regularity α. t 0 (t s) α 1/2 dw s. 28 / 50

Gaussian process priors: examples - 4 Consider a centered Gaussian process W = (W t : t T ), with T R d, such that EW s W t = r(t s), s, t T, for a continuous r : R d R. Such a process is called stationary, or homogenous. By Bochner s theorem: r(t) = e i λ,t µ(dλ), R d for a finite Borel measure µ, called the spectral measure of the process. 29 / 50

Gaussian process priors: examples - 5 The squared exponential process: r(s, t) = exp( t s 2 ) Spectral measure: 2 d π d/2 exp( λ 2 /4) dλ. -8-6 -4-2 0 2 4 Regularity:. 0.0 0.2 0.4 0.6 0.8 1.0 30 / 50

Gaussian process priors: examples - 6 The Matérn process: µ(dλ) (1 + λ 2 ) (α+d/2) dλ, α > 0. Covariance function: r(s, t) = 21 α Γ(α) t s α K α ( t s ), where K α is the modified Bessel function of the second kind of order α. Regularity: α. For d = 1, α = 1/2, get the Ornstein-Uhlenbeck process. 31 / 50

Gaussian process priors: examples - 6 The Matérn process: µ(dλ) (1 + λ 2 ) (α+d/2) dλ, α > 0. Covariance function: r(s, t) = 21 α Γ(α) t s α K α ( t s ), where K α is the modified Bessel function of the second kind of order α. Regularity: α. For d = 1, α = 1/2, get the Ornstein-Uhlenbeck process. 31 / 50

Gaussian process regression - 1 Observations: X i = f (t i ) + ε i, t i [0, 1] fixed ε i independent N(0, 1). Prior on f : law of a centered GP with covariance function r. Posterior: this prior is conjugate for this model: (f (t 1 ),..., f (t n )) X 1,..., X n N n ((I + Σ 1 ) 1 X, (I + Σ 1 ) 1 ), where Σ the is matrix with Σ ij = r(t i, t j ). 32 / 50

Gaussian process regression - 2 Data: 200 simulated data points. Prior: multiple of integrated Brownian motion. -4-2 0 2 4-4 -2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Left: 10 draws from the prior. Right: 10 draws from the posterior. 33 / 50

Nonparametric priors - Conditionally Gaussian processes 34 / 50

CGP s - 1 Observation about GP s: Families of GP s typically depend on auxiliary parameters: hyper parameters. Performance can heavily depend on tuning of parameters. How to choose values of hyper parameters? 35 / 50

CGP s - 2 Regression with a squared exponential GP with covariance (x, y) exp( (x y) 2 /l 2 ), for different length scale hyper parameters l. l too small: l correct: -2-1 0 1 2-2 -1 0 1 2 y y 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x 36 / 50

CGP s - 3 Q: How to choose the best values of hyper parameters? A: Let the data decide! Possible approaches: Put a prior on the hyper parameters as well: full Bayes Estimate hyper parameters : empirical Bayes 37 / 50

CGP s - 3 Q: How to choose the best values of hyper parameters? A: Let the data decide! Possible approaches: Put a prior on the hyper parameters as well: full Bayes Estimate hyper parameters : empirical Bayes 37 / 50

CGP s - 4 Squared exponential GP with gamma length scale: l Γ(a, b) f l GP with cov (x, y) exp( (x y) 2 /l 2 ) Example of a hierarchical prior Prior is only conditionally Gaussian Q: does this solve the bias-variance issue? 38 / 50

CGP s - 4 Squared exponential GP with gamma length scale: l Γ(a, b) f l GP with cov (x, y) exp( (x y) 2 /l 2 ) Example of a hierarchical prior Prior is only conditionally Gaussian Q: does this solve the bias-variance issue? 38 / 50

Nonparametric priors - Dirichlet mixtures 39 / 50

DP mixture priors - 1 Idea: Consider location/scale mixtures of Gaussians of the form p G (x) = ϕ σ (x µ) G(dµ, dσ), where ϕ σ ( µ) G is the N(µ, σ 2 )-density is a probability measure (mixing measure). Construct a prior on densities by making G random. 40 / 50

DP mixture priors - 2 Draw g from a Gaussian DP mixture prior: G DP(G 0 ) (G 0 often N IW ) p G p G Another example of a hierarchical prior 41 / 50

DP mixture density estimation Data: 272 waiting times between geyser eruptions Prior: DP mixture of normals Posterior mean: Density 0.00 0.01 0.02 0.03 0.04 0.05 40 50 60 70 80 90 waiting 42 / 50

Some more examples 43 / 50

Estimating the drift of a diffusion Observation model: dx t = b(x t ) dt + dw t. Goal: estimate b. Prior: s IG(a, b) J Ps(λ) b s, J s J j=1 j 2 Z j e j e j : Fourier basis, Z j N(0, 1) 4 2 0 2 4 posterior mean marginal 0.68 cred. posterior mean PPRS marginal 0.68 cred. 0 1 2 3 4 5 6 [Van der Meulen, Schauer, vz. (2014)] Figure 14: Comparison of the estimate of drift using the Butane Dihedral Angle data. Red solid: A Fourier prior with = 1.5. Blue dashed: Results of Papaspiliopoulos et al. (2012). The posterior mean with 68% credible bands is pictured. Right: Histogram of the data. 44 / 50

Nonparametric estimation of a Poisson intensity Observation model: counts from an inhomogenous Poisson process with periodic intensity λ. Goal: estimate λ. Prior: B-spline expansion with priors on knots and coefficients again with a thinned out dataset. We randomly removed counts, retaining about 1, 000 counts. The same analysis then leads to the posterior plot given in Figure 3. In this case, the uncertainty in the posterior distribution becomes clearly visible. 16 14 x 104 Posterior mean and point wise quantiles (10000 states) Posterior mean Point wise 95% credible intervals 12 Intensity 10 8 6 4 2 10000 0 0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 24:00 Time (hours) 5000 0 0:00 1:00 2:00 3:00 4:00 5:00 6:00 7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 24:00 Time (hours) Figure 3: Top panel: posterior distribution of the intensity function based on the thinned data. (Blue: posterior mean, red: point-wise 95% credible intervals). Lower panel: posterior distribution of the knot locations (Histogram). [Belitser, Serra, vz. (2015)] We find that the prior that we defined in Section 2.2 is a computationally feasible choice for nonparametric Bayesian intensity smoothing in the context of this kind of periodic 45 / 50

Binary prediction on a graph Observation model: P(Y i = 1) = Ψ(f (i)), for f : G R. Prior: Conditionally Gaussian with precision L p, L: graph Laplacian [Hartog, vz. (in prep.)] 46 / 50

Concluding remarks 47 / 50

Take home from Lecture I Within the Bayesian paradigm it is perfectly possible and natural to deal with nonparametric statistical problems. Many nonparametric priors have been proposed and studied: DP s, GP s, DP mixtures, series expansion,... Numerical techniques have been developed to sample from the corresponding posteriors In a variety of statistical settings, the results can be quite satisfactory. Some (theoretical) questions: So do these procedures do what we expect them to do? Why/why not? Do they have desirable properties like consistency? Can we say something more about performance, e.g. about (optimal) convergence rates? 48 / 50

Take home from Lecture I Within the Bayesian paradigm it is perfectly possible and natural to deal with nonparametric statistical problems. Many nonparametric priors have been proposed and studied: DP s, GP s, DP mixtures, series expansion,... Numerical techniques have been developed to sample from the corresponding posteriors In a variety of statistical settings, the results can be quite satisfactory. Some (theoretical) questions: So do these procedures do what we expect them to do? Why/why not? Do they have desirable properties like consistency? Can we say something more about performance, e.g. about (optimal) convergence rates? 48 / 50

DP: Some references for Lecture I - 1 Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Annals of Statistics, 1, 209 30. DP mixtures: Ferguson, T. S. (1983). Bayesian density estimation by mixtures of normal distributions. In Recent Advances in Statistics, ed. M. Rizvi et al., 287 302. Escobar, M. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577 88. MacEachern, S. N. and Muller, P. (1998) Estimating mixture of Dirichlet Process Models. Journal of Computational and Graphical Statistics, 7 (2), 223 338. Neal, R. M. (2000). Markov Chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9, 249 265. 49 / 50

Some references for Lecture I - 2 GP priors: Lenk, P. J. (1988). The logistic normal distribution for Bayesian, nonparametric, predictive densities. J. Amer. Statist. Assoc. 83 509 516. Lenk, P. J. (1991). Towards a practicable Bayesian nonparametric density estimator. Biometrika 78 531 543. Rasmussen, C. E. and Williams, C. K. (2006). Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA. General text: Hjort, N.L., et al., eds. Bayesian nonparametrics. Vol. 28. Cambridge University Press, 2010. 50 / 50