Bayesian Nonparametrics: Models Based on the Dirichlet Process

Size: px
Start display at page:

Download "Bayesian Nonparametrics: Models Based on the Dirichlet Process"

Transcription

1 Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

2 Sources and Inspirations Tutorials (slides) P. Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics. NIPS M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That. NIPS Articles etc. E.B. Sudderth, Chapter in PhD thesis, E. Fox, Chapter in PhD thesis, Y.W. Teh, Dirichlet Processes. Encyclopedia of Machine Learning, Springer.... Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

3 Outline 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

4 Outline Introduction and background 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

5 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

6 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

7 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

8 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

9 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

10 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

11 Introduction and background Bayesian learning De Finetti s theorem A premise: Definition An infinite sequence random variables (x 1, x 2,...) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1,..., N), p(x 1, x 2,..., x N ) = p(x π(1), x π(2)..., x π(n) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

12 Introduction and background Bayesian learning De Finetti s theorem A premise: Definition An infinite sequence random variables (x 1, x 2,...) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1,..., N), p(x 1, x 2,..., x N ) = p(x π(1), x π(2)..., x π(n) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

13 Introduction and background De Finetti s theorem (cont d) Bayesian learning Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

14 Introduction and background De Finetti s theorem (cont d) Bayesian learning Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

15 Introduction and background Bayesian learning Bayesian learning Hypothesis space H Given data D, compute p(h D) = p(d h)p(h) p(d) Then, we probably want to predict some future data D, by either: Average over H, i.e. p(d D) = H p(d h)p(h D)p(h) dh Choose the MAP h (or compute it directly), i.e. p(d D) = p(d h MAP) Sample from the posterior... H can be anything! Bayesian learning as a general learning framework We will consider the case in which h is a probabilistic model itself, i.e. a parameter vector θ. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

16 Introduction and background Bayesian learning A simple example Infer the bias θ [0, 1] of a coin after observing N tosses. H = 1, T = 0, p(h) = θ h = θ, hence H = [0, 1] Sequence of Bernoulli trials: p(x 1,..., x n θ) = θ nh (1 θ) N nh where n H = # heads. Unknown θ: p(x 1,..., x N ) = 1 0 θ nh (1 θ) nh k p(θ) dθ θ x 1 x 2 x N θ x i N Need to find a good prior p(θ)... Beta distribution! Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

17 Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,..., x N ) p(x 1,..., x n θ)p(θ) = θ nh (1 θ) nt 1 B(a, b) θa 1 (1 θ) b 1 θ nh+a 1 nt +b 1 (1 θ) i.e. θ x 1,..., x N Beta(a + N H, b + N T ) We re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

18 Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,..., x N ) p(x 1,..., x n θ)p(θ) = θ nh (1 θ) nt 1 B(a, b) θa 1 (1 θ) b 1 θ nh+a 1 nt +b 1 (1 θ) i.e. θ x 1,..., x N Beta(a + N H, b + N T ) We re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

19 Introduction and background A simple example (cont d) Bayesian learning Three sequences of four tosses: H T H H H H H T H H H H Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

20 Introduction and background Nonparametric models Nonparametric models Nonparametric doesn t mean no parameters! Rather, The number of parameters grows as more data are observed. -dimensional parameter space. Finite data Bounded number of parameters Definition A nonparametric model is a Bayesian model on an -dimensional parameter space. Example x 2 p(x) µ Parametric x 1 Nonparametric Peter Orbanz & Yee Whye Teh (from Orbanz and Teh, NIPS 2011) 4 / 71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

21 Introduction and background Nonparametric models Nonparametric models Nonparametric doesn t mean no parameters! Rather, The number of parameters grows as more data are observed. -dimensional parameter space. Finite data Bounded number of parameters Definition A nonparametric model is a Bayesian model on an -dimensional parameter space. Example x 2 p(x) µ Parametric x 1 Nonparametric Peter Orbanz & Yee Whye Teh (from Orbanz and Teh, NIPS 2011) 4 / 71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

22 Outline Finite mixture models 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

23 Finite mixture models Bayesian models Models in Bayesian data analysis Model Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models. Posterior inference Knowing p(d M, θ)... how data is generated... compute p(θ M, D) Akin to reversing the generative process. p(θ) M θ p(d M,θ) D Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

24 Finite mixture models Bayesian models Models in Bayesian data analysis Model Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models. Posterior inference Knowing p(d M, θ)... how data is generated... compute p(θ M, D) Akin to reversing the generative process. p(θ) M θ p(d M,θ) p(θ D, M) D Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

25 Finite mixture models Clustering with FMMs Finite mixture models (FMMs) Bayesian approach to clustering. Each data point is assumed to belong to one of K clusters. General form A sequence of data points x = (x 1,..., x N ) each with probability K p(x i π, θ 1,..., θ K ) = π k f (x i θ k ) k=1 π Π K 1 Generative process For each i: Draw a cluster assignment z i π Draw a data point x i F(θ zi ). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

26 Finite mixture models Clustering with FMMs FMMs (example) Mixture of univariate Gaussians θ k = (µ k, σ k ) p(x i π, µ, σ) = x i N (µ k, σ k ) K π k f N (x i ; µ k, σ k ) k= π =(0.15, 0.25, 0.6) N (1, 1) N (4,.5) N (6,.7) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

27 Finite mixture models Clustering with FMMs FMMs (cont d) Clustering with FFMs Need priors for π, θ Usually, π is given a (symmetric) Dirichlet distribution prior. θ k s are given a suitable prior H, depending on the data. α π z i H θ k K π Dir(α/K,..., α/k) θ k H H k = 1... K z i π π x i θ, z i F(θ zi ) i = 1... N x i N Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

28 Dirichlet distribution Finite mixture models Clustering with FMMs Multivariate generalization of Beta. Dir(1, 1, 1) Dir(2, 2, 2) Dir(5, 5, 5) Dir(5, 5, 2) Dir(5, 2, 2) Dir(0.7, 0.7, 0.7) (from Teh, MLSC 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

29 Finite mixture models Dirichlet distribution (cont d) Clustering with FMMs Γ(α) K π Dir(α/K,..., α/k) iff p(π 1,..., π K ) = k Γ(α/K) Conjugate prior to categorical/multinomial, i.e. implies Moreover, and π Dir( α K,..., α K ) z i π i = 1... N π z 1,..., z N Dir ( α K + n 1, α K + n 2,..., α K + n K p(z 1,..., z N α) = Γ(α) Γ(α + N) K k=1 Γ(n k + α/k) Γ(α/K) p(z i = k z ( i), α) = n( i) k + α/k α + N 1 ) k=1 π α/k 1 k Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

30 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

31 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

32 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

33 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

34 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

35 Finite mixture models Inference Approximate inference for FMMs No exact inference because of the unknown clusters identifiers z Expectation-Maximization (EM) Widely used, but we will focus on MCMC because of the connection with Dirichlet Process. Gibbs sampling Markov chain Monte Carlo (MCMC) integration method Set of random variables v = {v 1, v 2,..., v M }. We want to compute p(v). Randomly initialize their values. At each iteration, sample a variable v i and hold the rest constant: v (t) i v (t) j p(v i v (t 1) j, j i) usually tractable = v (t 1) j This creates a Markov chain with p(v) as equilibrium distribution. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

36 Finite mixture models Gibbs sampling for FMMs Inference State variables: z 1,..., z N, θ 1,..., θ K, π. Conditional distributions: ) α p(π z, θ) = Dir( K + n 1,..., α K + n k p(θ k x, z) p(θ k ) p(x i θ k ) i:z i=k = H(θ k ) F θk (x i ) i:z i=k p(z i = k π, θ, x) p(z i = k π k )p(x i z i = k, θ k ) We can avoid sampling π: = π k F θk (x i ) α π z i x i N H θ k K p(z i = k z i, θ, x) p(x i θ k )p(z i = k z i ) F θk (x i ) ( n ( i) k + α/k ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

37 Finite mixture models Inference Gibbs sampling for FMMs (example) log p(x!, ") = l Mixture of 4 bivariate Gaussians Normal-inverse Wishart prior on θ k = (µ k, Σ k ), conjugate to normal distribution. Σ k W(ν, ) µ k N (ϑ, Σ k /κ) log p(x!, ") = log p(x!,!,") ") = T=2 T=10 T=40 l log p(x!, ") = log p(x!,!,") (from ") = Sudderth, 2008) l Figure Learning a mixture of K = 4 Gaussians using the show the current parameters after T=2 (top), T=10 (middle), and random initializations. Each plot is labeled by the current data lo Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

38 Finite mixture models Inference FMMs: alternative representation α π H H z i x i θ k K α G θ i θ2 θ1 N x i N x2 x1 π Dir(α) θ k H z i π x i F(θ zi ) K G(θ) = π k δ(θ, θ k ) k=1 θ i G x i F( θ i ) θ k H π Dir(α) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

39 Outline Dirichlet process mixture models 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

40 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

41 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

42 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

43 Dirichlet process mixture models The Dirichlet process The (practical) Dirichlet process The Dirichlet process is a distribution over probability measures over Θ. DP(α, H) H(θ) is the base (mean) measure. Think µ for a Gaussian but in the space of probability measures. α is the concentration parameter. Controls the dispersion around the mean H. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

44 Dirichlet process mixture models The Dirichlet process (cont d) The Dirichlet process A draw G DP(α, H) is an infinite discrete probability measure: G(θ) = π k δ(θ, θ k ), where k=1 θ k H, and π is sampled from a stick-breaking prior. G Θ (from Orbanz & Teh, 2008) Break a stick Imagine a stick of length one. For k = 1..., do the following: Break the stick at a point drawn from Beta(1, α). Peter Orbanz & Yee Whye Teh [Kin75, Set94] 50 / 71 Let π k be such value and keep the remainder of the stick. Following standard convention, we write π GEM(α). (Details in second part of talk) w4 w3 w2 w1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

45 Dirichlet process mixture models The Dirichlet process (cont d) The Dirichlet process A draw G DP(α, H) is an infinite discrete probability measure: G(θ) = π k δ(θ, θ k ), where k=1 θ k H, and π is sampled from a stick-breaking prior. G Θ (from Orbanz & Teh, 2008) Break a stick Imagine a stick of length one. For k = 1..., do the following: Break the stick at a point drawn from Beta(1, α). Peter Orbanz & Yee Whye Teh [Kin75, Set94] 50 / 71 Let π k be such value and keep the remainder of the stick. Following standard convention, we write π GEM(α). (Details in second part of talk) w4 w3 w2 w1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

46 Dirichlet process mixture models Stick-breaking, intuitively The Dirichlet process β 1 π 1 π β β 1 1 β 2! k ! k β3 π 3 π 4 1 β3 β 4 1 β 4 β k k ! k 0.2! k 0.2 π k k α =1 α =5 (from Sudderth, 2008) Small α lots of weight assigned to few θ k s. G will be very different from base measure H. Large α weights equally distributed on θ k s. G will resemble the base measure H. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

47 Dirichlet process mixture models The Dirichlet process H G (from Navarro et al., 2005) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

48 Dirichlet process mixture models DP mixture models The DP mixture model (DPMM) Let s use G DP(α, H) to build an infinite mixture model. H α G θ i θ2 θ1 G DP(α, H) θ i G x i F θi x i N x2 x1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

49 DPM (cont d) Dirichlet process mixture models DP mixture models Using explicit clusters indicators z = (z 1, z 2,..., z N ). α π z i H θ k π GEM(α) θ k H k = 1,..., z i π x i F θzi i = 1,..., N x i N Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

50 Dirichlet process mixture models DP mixture models Chinese restaurant process So far, we only have a generative model. Is there a nice conjugancy property to use during inference? It turns out (details in part 2) that, if π GEM(α) z i π the distribution p(z α) = p(z π)p(π) dπ is easily tractable, and is known as the Chinese restaurant process (CRP). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

51 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

52 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

53 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

54 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

55 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

56 Dirichlet process mixture models Gibbs sampling for DPMMs Inference Via the CRP, we can find the conditional distributions for Gibbs sampling. State: θ 1,..., θ k, z. p(θ k x, z) p(θ k ) p(x i θ k ) i:z i=k = h(θ k )f (x i θ k ) α π H p(z i = k z i, θ, x) p(x i θ k )p(z i = k z i ) { n ( i) k f (x i θ k ) exising k α f (x i θ k ) new k z i x i N θ k K grows as more data are observed, asymptotically as α log n. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

57 Dirichlet process mixture models log p(x!, ") Inference = log p(x!, ") = Gibbs sampling for DPMMs (example) Mixture of bivariate Gaussians log p(x!, ") = log p(x!,!,") ") = log p(x!, ") = T=2 T=10 T=40 log p(x!, ") = log p(x!,!,") ") = log p(x!, (from ") = Sudderth, 2008) Figure Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg Columns show the parameters of clusters currently assigned to observations, and corresponding data log likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

58 Dirichlet process mixture models Inference END OF FIRST PART. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

59 Outline A little more theory... 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

60 A little more theory... De Finetti s REDUX De Finetti s REDUX Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i=1 The theorem wouldn t be true if θ s range is limited to Euclidean s vector spaces. We need to allow θ to range over measures. p(θ) is a distribution on measures, like the DP. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

61 A little more theory... De Finetti s REDUX De Finetti s REDUX Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i=1 The theorem wouldn t be true if θ s range is limited to Euclidean s vector spaces. We need to allow θ to range over measures. p(θ) is a distribution on measures, like the DP. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

62 A little more theory... Dirichlet Process REDUX Dirichlet process REDUX Definition Let Θ be a measurable space (of parameters), H be a probability distribution on Θ, and α a positive scalar. A Dirichlet process is the distribution of a random probability measure G over Θ, such that for any finite partition (T 1,..., T k ) of Θ, we have (G(T 1 ),..., G(T K )) Dir(αH(T 1 ),..., αh(t K )). Θ T 1 T 2 T 3 ~ T 1 ~ T 4 ~ T 5 ~ T 2 ~ T 3 (from Sudderth, 2008) E[G(T k )] = H(T k ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

63 Posterior conjugancy A little more theory... Dirichlet process REDUX Via the conjugancy of the Dirichlet distribution, we know that: p(g(t 1 ),..., G(T K ) θ T k ) = Dir(αH(T 1 ),..., αh(t k ) + 1,..., αh(t K )) Formalizing this analysis, we obtain that if G DP(α, H) θ i G i = 1,..., N, the posterior measure also follows a Dirichlet process: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi The DP defines a conjugate prior for distributions on arbitrary measure spaces. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

64 A little more theory... Dirichlet process REDUX Generating samples: stick breaking Sethuraman (1995): equivalent definition of the Dirichlet process, through the stick-breaking construction. G(θ) DP(α, H) iff G(θ) = π k δ(θ, θ k ), k=1 where θ H, and k 1 π k = β k (1 β l ) β l Beta(1, α) l=1 β 1 π 1 π β β 1 1 β 2! k ! k β3 π 3 π 4 1 β3 β 4 1 β 4 β k k ! k 0.2! k 0.2 π k k α =1 α =5 (from Sudderth, 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

65 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

66 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

67 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

68 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

69 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

70 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

71 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

72 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] Therefore, where G DP(α, H) G = β 1 δ θ1 + (1 β 1 )G 1 G = β 1 δ θ1 + (1 β 1 )(β 2 δ θ2 + (1 β 2 )G 2 ). G = π k δ θk k=1 k 1 π k = β k (1 β l ) β l Beta(1, α), l=1 which is the stick-breaking construction. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

73 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

74 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

75 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

76 A little more theory... Dirichlet process REDUX Chinese restaurant (derivation) A bit informally... Let T k contain θ k and shrink it arbitrarily. To the limit, we have that p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N K i=1 ) N k δ θi (θ) This is the generalized Polya urn scheme An urn contains one ball for each preceding observation, with a different color for each distinct θ k. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special weighted ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ k. [This description is from Sudderth, 2008] This allows to sample from a Dirichlet process without explicitly constructing the underlying G DP(α, H). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

77 A little more theory... Dirichlet process REDUX Chinese restaurant (derivation) A bit informally... Let T k contain θ k and shrink it arbitrarily. To the limit, we have that p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N K i=1 ) N k δ θi (θ) This is the generalized Polya urn scheme An urn contains one ball for each preceding observation, with a different color for each distinct θ k. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special weighted ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ k. [This description is from Sudderth, 2008] This allows to sample from a Dirichlet process without explicitly constructing the underlying G DP(α, H). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

78 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX The Dirichlet process implicitly partitions the data. Let z i indicate the subset (cluster) associated with the i6th observation, i.e. θ i = θ zi. From the previous slide, we get: p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N This is the Chinese restaurant process (CRP) K i=1 ) N k δ(z, k) It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

79 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX The Dirichlet process implicitly partitions the data. Let z i indicate the subset (cluster) associated with the i6th observation, i.e. θ i = θ zi. From the previous slide, we get: p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N This is the Chinese restaurant process (CRP) K i=1 ) N k δ(z, k) It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

80 Take away message A little more theory... Dirichlet process REDUX These representations are all equivalent! Posterior DP: G DP(α, H) θ G G Stick-breaking construction: G(θ) = π k δ(θ, θ k ) Generalized Polya urn k=1 θ G θ H ( ) DP α + 1, αh+δ θ α+1 θ k H π GEM(α) p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N Chinese restaurant process p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N K i=1 K i=1 ) N k δ θi (θ) ) N k δ(z, k) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

81 Outline The hierarchical Dirichlet process 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

82 The hierarchical Dirichlet process The DP mixture model (DPMM) Let s use G DP(α, H) to build an infinite mixture model. H α G θ i θ2 θ1 G DP(α, H) θ i G x i F θi x i N x2 x1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

83 The hierarchical Dirichlet process Related subgroups of data Dataset with J related groups x = (x 1,..., x J ). x j = (x j1,..., x jnj ) contains N j observations. We want these group to share clusters (transfer knowledge.) 1 2 m i x ij x 1j x 2j x mj (from jordan, 2005) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

84 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

85 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

86 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

87 The hierarchical Dirichlet process The HDP mixture model (DPMM) H H γ G 0 G 0 α G j G1 G2 G 0 DP(γ, H) G j DP(α, G 0 ) θ ji G j θ ji θ12 θ11 θ 21 θ22 x ji F θji x ji N x12 x11 x21 x22 J Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

88 The hierarchical Dirichlet process The HDP mixture model (DPMM) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) G 0 is discrete. Each group might create several copies of the same global cluster. Aggregating the probabilities: G j (θ) = π k δ(θ, θ k ) t=1 π jk = t:k jt=k π jt It can be shown that π DP(α, β). β = (β 1, β 2,...): average weight of local clusters. π = (π 1, π 2,...) group-specific weights. α controls the variability of clusters weight across groups. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

89 The hierarchical Dirichlet process THANK YOU. QUESTIONS? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

Non-parametric Clustering with Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Dirichlet Processes: Tutorial and Practical Course

Dirichlet Processes: Tutorial and Practical Course Dirichlet Processes: Tutorial and Practical Course (updated) Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS Yee Whye Teh (Gatsby) DP August 2007 / MLSS

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I X i Ν CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr 2004 Dirichlet Process I Lecturer: Prof. Michael Jordan Scribe: Daniel Schonberg dschonbe@eecs.berkeley.edu 22.1 Dirichlet

More information

Bayesian nonparametrics

Bayesian nonparametrics Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability

More information

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu Lecture 16-17: Bayesian Nonparametrics I STAT 6474 Instructor: Hongxiao Zhu Plan for today Why Bayesian Nonparametrics? Dirichlet Distribution and Dirichlet Processes. 2 Parameter and Patterns Reference:

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Bayesian Nonparametric Models

Bayesian Nonparametric Models Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

A Brief Overview of Nonparametric Bayesian Models

A Brief Overview of Nonparametric Bayesian Models A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Part IV: Monte Carlo and nonparametric Bayes

Part IV: Monte Carlo and nonparametric Bayes Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation

More information

Nonparametric Mixed Membership Models

Nonparametric Mixed Membership Models 5 Nonparametric Mixed Membership Models Daniel Heinz Department of Mathematics and Statistics, Loyola University of Maryland, Baltimore, MD 21210, USA CONTENTS 5.1 Introduction................................................................................

More information

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

Stochastic Processes, Kernel Regression, Infinite Mixture Models

Stochastic Processes, Kernel Regression, Infinite Mixture Models Stochastic Processes, Kernel Regression, Infinite Mixture Models Gabriel Huang (TA for Simon Lacoste-Julien) IFT 6269 : Probabilistic Graphical Models - Fall 2018 Stochastic Process = Random Function 2

More information

Dirichlet Process. Yee Whye Teh, University College London

Dirichlet Process. Yee Whye Teh, University College London Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant

More information

Construction of Dependent Dirichlet Processes based on Poisson Processes

Construction of Dependent Dirichlet Processes based on Poisson Processes 1 / 31 Construction of Dependent Dirichlet Processes based on Poisson Processes Dahua Lin Eric Grimson John Fisher CSAIL MIT NIPS 2010 Outstanding Student Paper Award Presented by Shouyuan Chen Outline

More information

Collapsed Variational Dirichlet Process Mixture Models

Collapsed Variational Dirichlet Process Mixture Models Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science

More information

Spatial Normalized Gamma Process

Spatial Normalized Gamma Process Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma

More information

Dirichlet Processes and other non-parametric Bayesian models

Dirichlet Processes and other non-parametric Bayesian models Dirichlet Processes and other non-parametric Bayesian models Zoubin Ghahramani http://learning.eng.cam.ac.uk/zoubin/ zoubin@cs.cmu.edu Statistical Machine Learning CMU 10-702 / 36-702 Spring 2008 Model

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Bayesian non parametric approaches: an introduction

Bayesian non parametric approaches: an introduction Introduction Latent class models Latent feature models Conclusion & Perspectives Bayesian non parametric approaches: an introduction Pierre CHAINAIS Bordeaux - nov. 2012 Trajectory 1 Bayesian non parametric

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

More information

Infinite latent feature models and the Indian Buffet Process

Infinite latent feature models and the Indian Buffet Process p.1 Infinite latent feature models and the Indian Buffet Process Tom Griffiths Cognitive and Linguistic Sciences Brown University Joint work with Zoubin Ghahramani p.2 Beyond latent classes Unsupervised

More information

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix

Infinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes Yee Whye Teh Probabilistic model of language n-gram model Utility i-1 P(word i word i-n+1 ) Typically, trigram model (n=3) e.g., speech,

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing Lecture 21, April 2, 214 Acknowledgement: slides first drafted by Sinead Williamson

More information

Non-parametric Bayesian Methods

Non-parametric Bayesian Methods Non-parametric Bayesian Methods Uncertainty in Artificial Intelligence Tutorial July 25 Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK Center for Automated Learning

More information

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(

More information

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

The Indian Buffet Process: An Introduction and Review

The Indian Buffet Process: An Introduction and Review Journal of Machine Learning Research 12 (2011) 1185-1224 Submitted 3/10; Revised 3/11; Published 4/11 The Indian Buffet Process: An Introduction and Review Thomas L. Griffiths Department of Psychology

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Clustering problems, mixture models and Bayesian nonparametrics

Clustering problems, mixture models and Bayesian nonparametrics Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Exchangeability. Peter Orbanz. Columbia University

Exchangeability. Peter Orbanz. Columbia University Exchangeability Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent noise Peter

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Bayesian Mixtures of Bernoulli Distributions

Bayesian Mixtures of Bernoulli Distributions Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions

More information

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance

More information

Bayesian Nonparametric Learning of Complex Dynamical Phenomena

Bayesian Nonparametric Learning of Complex Dynamical Phenomena Duke University Department of Statistical Science Bayesian Nonparametric Learning of Complex Dynamical Phenomena Emily Fox Joint work with Erik Sudderth (Brown University), Michael Jordan (UC Berkeley),

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Applied Nonparametric Bayes

Applied Nonparametric Bayes Applied Nonparametric Bayes Michael I. Jordan Department of Electrical Engineering and Computer Science Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan Acknowledgments:

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

CMPS 242: Project Report

CMPS 242: Project Report CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz vrk@soe.ucsc.edu Abstract The classification procedures impose certain models on the data and when the assumption match the

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process 10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007 COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Model-based cognitive neuroscience approaches to computational psychiatry: clustering and classification

Model-based cognitive neuroscience approaches to computational psychiatry: clustering and classification Model-based cognitive neuroscience approaches to computational psychiatry: clustering and classification Thomas V. Wiecki, Jeffrey Poland, Michael Frank July 10, 2014 1 Appendix The following serves as

More information

Time Series and Dynamic Models

Time Series and Dynamic Models Time Series and Dynamic Models Section 1 Intro to Bayesian Inference Carlos M. Carvalho The University of Texas at Austin 1 Outline 1 1. Foundations of Bayesian Statistics 2. Bayesian Estimation 3. The

More information

Bayesian nonparametric latent feature models

Bayesian nonparametric latent feature models Bayesian nonparametric latent feature models François Caron UBC October 2, 2007 / MLRG François Caron (UBC) Bayes. nonparametric latent feature models October 2, 2007 / MLRG 1 / 29 Overview 1 Introduction

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Hierarchical Dirichlet Processes

Hierarchical Dirichlet Processes Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We give a simple

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Probabilistic modeling of NLP

Probabilistic modeling of NLP Structured Bayesian Nonparametric Models with Variational Inference ACL Tutorial Prague, Czech Republic June 24, 2007 Percy Liang and Dan Klein Probabilistic modeling of NLP Document clustering Topic modeling

More information

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Inference for a Population Proportion

Inference for a Population Proportion Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist

More information

Tree-Based Inference for Dirichlet Process Mixtures

Tree-Based Inference for Dirichlet Process Mixtures Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani

More information