Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Size: px

Start display at page:

Download "Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu"

Ira Perry
5 years ago
Views:

1 Lecture 16-17: Bayesian Nonparametrics I STAT 6474 Instructor: Hongxiao Zhu

2 Plan for today Why Bayesian Nonparametrics? Dirichlet Distribution and Dirichlet Processes. 2

3 Parameter and Patterns Reference: Peter Orbanz NIPS tutorial Parameters == Patterns Data = underlying patterns + noise 3

4 Parametric vs. nonparametric models Parametric model: number of parameters fixed w.r.t. sample size. Nonparametric model: data distribution is defined by assuming an infinite dimensional θ. or the dimension of θ can grow with sample size. 4

5 Example 1. Density estimation Fit one Gaussian to all data points. Pin one Gaussian with fixed variance to each data point. Two parameters Mixture of Gaussians 5

6 Nonparametric Bayesian methods A Bayesian model on an -dimensional parameter space. How do we Interpret it???? 6

7 When the parameter is -dimensional We need define priors on infinite dimensional parameter space These priors are often stochastic processes on the spaces of densities and functions. Gaussian process priors. Dirichlet process priors (most popular) Many others 7

8 To understand Dirichlet process priors, we start with inference using Dirichlet distribution priors 8

9 9

10 Dirichlet Distribution Dirichlet distribution of a vector of probabilities Continuous or discrete? Special case? 10

11 Example 2. We have the word counts of n=1740 papers from NIPS conference. Consider m = 1000 words. We will use a Bayesian model to estimate the probability of observing each word. Assume counts from papers are i.i.d. A real data set: NIPS Conference Papers Vols0-12: Text mining preprocessing: Stemming (remove synonyms) Corpus importance metric (tf idf: term frequency inverse document frequency) 11

12 Re-parameterization of Dirichlet distribution Let Let where. Why do we re-parameterize? 12

13 Re-parameterization of Dirichlet distribution cont d The density function becomes 13

14 14

15 The Image plot of the Dirichlet distribution of. Why triangle shape? How does the mode move? 15

16 What if the sample space is continuous? 16

17 17

18 Lecture 16 ends here. 18

19 Lecture 17: Plan of Today Definition of Dirichlet process (DP). Three different metaphors of DP: Stick-breaking representation. Polya Urn sampler. Chinese restaurant process representation. 19

20 Dirichlet Distribution Key: the Dirichlet distribution is the distribution of distribution. 20

21 Reparameterization of Dirichlet distribution Precision Parameter Base Measure. 21

22 Posterior of p under a discrete likelihood 22

23 Dirichlet distribution through Polya Urn Model The last equation is the posterior predictive probability in Example 3. 23

24 To sum up Dirichlet distribution can be thought of a distribution of a probability measure p. Dirichlet distribution has two parameters: The polya urn scheme means to sample from the marginal likelihood 24

25 So, the Dirichlet distribution prior is a distribution of distribution. We now extend the Dirichlet distribution to the case when the number of categories 25

26 Dirichlet Process Dirichlet Process defined through random partition of sample space. (Ferguson, 1973) 26

27 27

28 See R code. 28

29 29

30 Three Metaphors (equivalent definitions/interpretations) of Dirichlet Processes Stick-breaking process Polya Urn process Chinese restaurant process You can also think these metaphors as different versions of definitions of DP. 30

31 Dirichlet Process -- the Stick-breaking Process Metaphor Dirichlet Process can be defined through stick breaking (Sethuramon, 1994 Stat Sinica). 31

32 0 1 Stick breaking interpretation of DP. It provides a way to construct a DP. 32

33 In situations, one could approximate the infinite sum by 33

34 A nice catoon from a Bayesian pony about the stick breaking process: 34

35 Dirichlet Process -- the Polya Urn Process Metaphor See Blackwell and MacQeen Ferguson distribution via Polya urn schemes (1973) 35

36 I.e., Note we can have infinity number of colors as n approaches infinity, if G0 s sample space is infinite. 36

37 The polya urn provides a way to sample from DP. Each time get samples from a mixture of two options, either from G0, or take value the same as one of previous values. 37

38 Dirichlet Process -- the Chinese Restaurant Process Metaphor 38

39 Note the Chinese restaurant process describes the distribution of cluster assignment. 39

40 40

41 show that the probability of this arrangement is Exercise: Try to permute the labels of the customer (not the table arrangement), and calculate the probability again. See how the probability will change. 41

42 This is the marginal distribution of S in a DP process. K is the number of clusters formed by the first n samples, n_j is the number of samples in the j th cluster. 42

43 Summary Stick breaking Polya Urn Chinese restaurant 43

Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian