Bayesian Nonparametrics: Models Based on the Dirichlet Process
|
|
- Myra Carpenter
- 5 years ago
- Views:
Transcription
1 Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
2 Sources and Inspirations Tutorials (slides) P. Orbanz and Y.W. Teh, Modern Bayesian Nonparametrics. NIPS M. Jordan, Dirichlet Process, Chinese Restaurant Process, and All That. NIPS Articles etc. E.B. Sudderth, Chapter in PhD thesis, E. Fox, Chapter in PhD thesis, Y.W. Teh, Dirichlet Processes. Encyclopedia of Machine Learning, Springer.... Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
3 Outline 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
4 Outline Introduction and background 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
5 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
6 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
7 The meaning of it all Introduction and background Bayesian learning BAYESIAN NONPARAMETRICS Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
8 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
9 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
10 Bayesian statistics Introduction and background Bayesian learning Estimate a parameter θ Θ after observing data x. Frequentist Maximum Likelihood (ML): ˆθ MLE = argmax θ p(x θ) = argmax θ L(θ : x) Bayesian Bayes Rule: p(θ x) = p(x θ)p(θ) p(x) Bayesian prediction (using the whole posterior, not just one estimator) p(x new x) = p(x new θ)p(θ x) dθ Maximum A Posteriori (MAP) Θ ˆθ MAP = argmax p(x θ)p(θ) θ Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
11 Introduction and background Bayesian learning De Finetti s theorem A premise: Definition An infinite sequence random variables (x 1, x 2,...) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1,..., N), p(x 1, x 2,..., x N ) = p(x π(1), x π(2)..., x π(n) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
12 Introduction and background Bayesian learning De Finetti s theorem A premise: Definition An infinite sequence random variables (x 1, x 2,...) is said to be (infinitely) exchangeable if, for every N and every possible permutation π on (1,..., N), p(x 1, x 2,..., x N ) = p(x π(1), x π(2)..., x π(n) ) Note: exchangeability not equal i.i.d! Example (Polya Urn) An urn contains some red balls and some black balls; an infinite sequence of colors is drawn recursively as follows: draw a ball, mark down its color, then put the ball back in the urn along with an additional ball of the same color. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
13 Introduction and background De Finetti s theorem (cont d) Bayesian learning Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
14 Introduction and background De Finetti s theorem (cont d) Bayesian learning Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i.e., there exists a parameter space and a measure on it that makes the variables iid! The representation theorem motivates (and encourages!) the use of Bayesian statistics. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
15 Introduction and background Bayesian learning Bayesian learning Hypothesis space H Given data D, compute p(h D) = p(d h)p(h) p(d) Then, we probably want to predict some future data D, by either: Average over H, i.e. p(d D) = H p(d h)p(h D)p(h) dh Choose the MAP h (or compute it directly), i.e. p(d D) = p(d h MAP) Sample from the posterior... H can be anything! Bayesian learning as a general learning framework We will consider the case in which h is a probabilistic model itself, i.e. a parameter vector θ. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
16 Introduction and background Bayesian learning A simple example Infer the bias θ [0, 1] of a coin after observing N tosses. H = 1, T = 0, p(h) = θ h = θ, hence H = [0, 1] Sequence of Bernoulli trials: p(x 1,..., x n θ) = θ nh (1 θ) N nh where n H = # heads. Unknown θ: p(x 1,..., x N ) = 1 0 θ nh (1 θ) nh k p(θ) dθ θ x 1 x 2 x N θ x i N Need to find a good prior p(θ)... Beta distribution! Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
17 Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,..., x N ) p(x 1,..., x n θ)p(θ) = θ nh (1 θ) nt 1 B(a, b) θa 1 (1 θ) b 1 θ nh+a 1 nt +b 1 (1 θ) i.e. θ x 1,..., x N Beta(a + N H, b + N T ) We re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
18 Introduction and background Bayesian learning A simple example (cont d) Beta distribution: θ Beta(a, b) p(θ a, b) = 1 B(a,b) θa 1 (1 θ) b 1 Bayesian learning: p(h D) p(d h)p(h); for us: p(θ x 1,..., x N ) p(x 1,..., x n θ)p(θ) = θ nh (1 θ) nt 1 B(a, b) θa 1 (1 θ) b 1 θ nh+a 1 nt +b 1 (1 θ) i.e. θ x 1,..., x N Beta(a + N H, b + N T ) We re lucky! The Beta distribution is a conjugate prior to the binomial distribution. Beta(0.1, 0.1) Beta(1, 1) Beta(2, 3) Beta(10, 10) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
19 Introduction and background A simple example (cont d) Bayesian learning Three sequences of four tosses: H T H H H H H T H H H H Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
20 Introduction and background Nonparametric models Nonparametric models Nonparametric doesn t mean no parameters! Rather, The number of parameters grows as more data are observed. -dimensional parameter space. Finite data Bounded number of parameters Definition A nonparametric model is a Bayesian model on an -dimensional parameter space. Example x 2 p(x) µ Parametric x 1 Nonparametric Peter Orbanz & Yee Whye Teh (from Orbanz and Teh, NIPS 2011) 4 / 71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
21 Introduction and background Nonparametric models Nonparametric models Nonparametric doesn t mean no parameters! Rather, The number of parameters grows as more data are observed. -dimensional parameter space. Finite data Bounded number of parameters Definition A nonparametric model is a Bayesian model on an -dimensional parameter space. Example x 2 p(x) µ Parametric x 1 Nonparametric Peter Orbanz & Yee Whye Teh (from Orbanz and Teh, NIPS 2011) 4 / 71 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
22 Outline Finite mixture models 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
23 Finite mixture models Bayesian models Models in Bayesian data analysis Model Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models. Posterior inference Knowing p(d M, θ)... how data is generated... compute p(θ M, D) Akin to reversing the generative process. p(θ) M θ p(d M,θ) D Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
24 Finite mixture models Bayesian models Models in Bayesian data analysis Model Generative process. Expresses how we think the data is generated. Contains hidden variables (the subject of learning.) Specifies relations between variables. E.g. graphical models. Posterior inference Knowing p(d M, θ)... how data is generated... compute p(θ M, D) Akin to reversing the generative process. p(θ) M θ p(d M,θ) p(θ D, M) D Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
25 Finite mixture models Clustering with FMMs Finite mixture models (FMMs) Bayesian approach to clustering. Each data point is assumed to belong to one of K clusters. General form A sequence of data points x = (x 1,..., x N ) each with probability K p(x i π, θ 1,..., θ K ) = π k f (x i θ k ) k=1 π Π K 1 Generative process For each i: Draw a cluster assignment z i π Draw a data point x i F(θ zi ). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
26 Finite mixture models Clustering with FMMs FMMs (example) Mixture of univariate Gaussians θ k = (µ k, σ k ) p(x i π, µ, σ) = x i N (µ k, σ k ) K π k f N (x i ; µ k, σ k ) k= π =(0.15, 0.25, 0.6) N (1, 1) N (4,.5) N (6,.7) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
27 Finite mixture models Clustering with FMMs FMMs (cont d) Clustering with FFMs Need priors for π, θ Usually, π is given a (symmetric) Dirichlet distribution prior. θ k s are given a suitable prior H, depending on the data. α π z i H θ k K π Dir(α/K,..., α/k) θ k H H k = 1... K z i π π x i θ, z i F(θ zi ) i = 1... N x i N Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
28 Dirichlet distribution Finite mixture models Clustering with FMMs Multivariate generalization of Beta. Dir(1, 1, 1) Dir(2, 2, 2) Dir(5, 5, 5) Dir(5, 5, 2) Dir(5, 2, 2) Dir(0.7, 0.7, 0.7) (from Teh, MLSC 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
29 Finite mixture models Dirichlet distribution (cont d) Clustering with FMMs Γ(α) K π Dir(α/K,..., α/k) iff p(π 1,..., π K ) = k Γ(α/K) Conjugate prior to categorical/multinomial, i.e. implies Moreover, and π Dir( α K,..., α K ) z i π i = 1... N π z 1,..., z N Dir ( α K + n 1, α K + n 2,..., α K + n K p(z 1,..., z N α) = Γ(α) Γ(α + N) K k=1 Γ(n k + α/k) Γ(α/K) p(z i = k z ( i), α) = n( i) k + α/k α + N 1 ) k=1 π α/k 1 k Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
30 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
31 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
32 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
33 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
34 Inference in FMMs Finite mixture models Inference Clustering: infer z (marginalize over π, θ) p(z x, α, H) = p(x z, H)p(z α) where p(x z, H)p(z α), z α π H p(z α) = Γ(α) Γ(α + N) K k=1 [ N K p(x z, H) = p(x i θ zi ) Θ i=1 k=1 Γ(n k + α/k) Γ(α/K) Parameter estimation: infer π, θ ] H(θ k ) dθ z i x i N θ k K p(π, θ x, α, H) = z [ p(π z, α) K k=1 ] p(θ k x, H) p(z x, α, H) No analytic procedure. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
35 Finite mixture models Inference Approximate inference for FMMs No exact inference because of the unknown clusters identifiers z Expectation-Maximization (EM) Widely used, but we will focus on MCMC because of the connection with Dirichlet Process. Gibbs sampling Markov chain Monte Carlo (MCMC) integration method Set of random variables v = {v 1, v 2,..., v M }. We want to compute p(v). Randomly initialize their values. At each iteration, sample a variable v i and hold the rest constant: v (t) i v (t) j p(v i v (t 1) j, j i) usually tractable = v (t 1) j This creates a Markov chain with p(v) as equilibrium distribution. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
36 Finite mixture models Gibbs sampling for FMMs Inference State variables: z 1,..., z N, θ 1,..., θ K, π. Conditional distributions: ) α p(π z, θ) = Dir( K + n 1,..., α K + n k p(θ k x, z) p(θ k ) p(x i θ k ) i:z i=k = H(θ k ) F θk (x i ) i:z i=k p(z i = k π, θ, x) p(z i = k π k )p(x i z i = k, θ k ) We can avoid sampling π: = π k F θk (x i ) α π z i x i N H θ k K p(z i = k z i, θ, x) p(x i θ k )p(z i = k z i ) F θk (x i ) ( n ( i) k + α/k ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
37 Finite mixture models Inference Gibbs sampling for FMMs (example) log p(x!, ") = l Mixture of 4 bivariate Gaussians Normal-inverse Wishart prior on θ k = (µ k, Σ k ), conjugate to normal distribution. Σ k W(ν, ) µ k N (ϑ, Σ k /κ) log p(x!, ") = log p(x!,!,") ") = T=2 T=10 T=40 l log p(x!, ") = log p(x!,!,") (from ") = Sudderth, 2008) l Figure Learning a mixture of K = 4 Gaussians using the show the current parameters after T=2 (top), T=10 (middle), and random initializations. Each plot is labeled by the current data lo Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
38 Finite mixture models Inference FMMs: alternative representation α π H H z i x i θ k K α G θ i θ2 θ1 N x i N x2 x1 π Dir(α) θ k H z i π x i F(θ zi ) K G(θ) = π k δ(θ, θ k ) k=1 θ i G x i F( θ i ) θ k H π Dir(α) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
39 Outline Dirichlet process mixture models 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
40 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
41 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
42 Dirichlet process mixture models Going nonparametric! Going nonparametric! The problem with finite FMMs What if K is unknown? How many parameters? Idea Let s use parameters! We want something of the kind: p(x i π, θ 1, θ 2,...) = π k p(x i θ k ) k=1 How to define such a measure? We d like the nice conjugancy properties of Dirichlet to carry on... Is there such a thing, the limit of a Dirichlet? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
43 Dirichlet process mixture models The Dirichlet process The (practical) Dirichlet process The Dirichlet process is a distribution over probability measures over Θ. DP(α, H) H(θ) is the base (mean) measure. Think µ for a Gaussian but in the space of probability measures. α is the concentration parameter. Controls the dispersion around the mean H. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
44 Dirichlet process mixture models The Dirichlet process (cont d) The Dirichlet process A draw G DP(α, H) is an infinite discrete probability measure: G(θ) = π k δ(θ, θ k ), where k=1 θ k H, and π is sampled from a stick-breaking prior. G Θ (from Orbanz & Teh, 2008) Break a stick Imagine a stick of length one. For k = 1..., do the following: Break the stick at a point drawn from Beta(1, α). Peter Orbanz & Yee Whye Teh [Kin75, Set94] 50 / 71 Let π k be such value and keep the remainder of the stick. Following standard convention, we write π GEM(α). (Details in second part of talk) w4 w3 w2 w1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
45 Dirichlet process mixture models The Dirichlet process (cont d) The Dirichlet process A draw G DP(α, H) is an infinite discrete probability measure: G(θ) = π k δ(θ, θ k ), where k=1 θ k H, and π is sampled from a stick-breaking prior. G Θ (from Orbanz & Teh, 2008) Break a stick Imagine a stick of length one. For k = 1..., do the following: Break the stick at a point drawn from Beta(1, α). Peter Orbanz & Yee Whye Teh [Kin75, Set94] 50 / 71 Let π k be such value and keep the remainder of the stick. Following standard convention, we write π GEM(α). (Details in second part of talk) w4 w3 w2 w1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
46 Dirichlet process mixture models Stick-breaking, intuitively The Dirichlet process β 1 π 1 π β β 1 1 β 2! k ! k β3 π 3 π 4 1 β3 β 4 1 β 4 β k k ! k 0.2! k 0.2 π k k α =1 α =5 (from Sudderth, 2008) Small α lots of weight assigned to few θ k s. G will be very different from base measure H. Large α weights equally distributed on θ k s. G will resemble the base measure H. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
47 Dirichlet process mixture models The Dirichlet process H G (from Navarro et al., 2005) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
48 Dirichlet process mixture models DP mixture models The DP mixture model (DPMM) Let s use G DP(α, H) to build an infinite mixture model. H α G θ i θ2 θ1 G DP(α, H) θ i G x i F θi x i N x2 x1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
49 DPM (cont d) Dirichlet process mixture models DP mixture models Using explicit clusters indicators z = (z 1, z 2,..., z N ). α π z i H θ k π GEM(α) θ k H k = 1,..., z i π x i F θzi i = 1,..., N x i N Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
50 Dirichlet process mixture models DP mixture models Chinese restaurant process So far, we only have a generative model. Is there a nice conjugancy property to use during inference? It turns out (details in part 2) that, if π GEM(α) z i π the distribution p(z α) = p(z π)p(π) dπ is easily tractable, and is known as the Chinese restaurant process (CRP). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
51 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
52 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
53 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
54 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
55 Dirichlet process mixture models DP mixture models Chinese restaurant process (cont d) Restaurant with tables with capacity z i = table at which customer i sits upon entering. Customer 1 sits at table 1 Customer 2 sits: at table 1 w. prob 1 at table 2 w. prob. α Customer i sits: at table k w. prob. n k (# ppl at k) at new table w. prob. α p(z i = k) = p(z i = k new ) = n k α + i 1 α α + i 1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
56 Dirichlet process mixture models Gibbs sampling for DPMMs Inference Via the CRP, we can find the conditional distributions for Gibbs sampling. State: θ 1,..., θ k, z. p(θ k x, z) p(θ k ) p(x i θ k ) i:z i=k = h(θ k )f (x i θ k ) α π H p(z i = k z i, θ, x) p(x i θ k )p(z i = k z i ) { n ( i) k f (x i θ k ) exising k α f (x i θ k ) new k z i x i N θ k K grows as more data are observed, asymptotically as α log n. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
57 Dirichlet process mixture models log p(x!, ") Inference = log p(x!, ") = Gibbs sampling for DPMMs (example) Mixture of bivariate Gaussians log p(x!, ") = log p(x!,!,") ") = log p(x!, ") = T=2 T=10 T=40 log p(x!, ") = log p(x!,!,") ") = log p(x!, (from ") = Sudderth, 2008) Figure Learning a mixture of Gaussians using the Dirichlet process Gibbs sampler of Alg Columns show the parameters of clusters currently assigned to observations, and corresponding data log likelihoods, after T=2 (top), T=10 (middle), and T=50 (bottom) iterations from two initializations. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
58 Dirichlet process mixture models Inference END OF FIRST PART. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
59 Outline A little more theory... 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
60 A little more theory... De Finetti s REDUX De Finetti s REDUX Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i=1 The theorem wouldn t be true if θ s range is limited to Euclidean s vector spaces. We need to allow θ to range over measures. p(θ) is a distribution on measures, like the DP. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
61 A little more theory... De Finetti s REDUX De Finetti s REDUX Theorem (De Finetti, Aka Representation Theorem) A sequence of random variables (x 1, x 2,...) is infinitely exchangeable if for all N, there exists a random variable θ and a probability measure p on it such that p(x 1, x 2,..., x N ) = Θ p(θ) N p(x i θ) dθ i=1 The theorem wouldn t be true if θ s range is limited to Euclidean s vector spaces. We need to allow θ to range over measures. p(θ) is a distribution on measures, like the DP. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
62 A little more theory... Dirichlet Process REDUX Dirichlet process REDUX Definition Let Θ be a measurable space (of parameters), H be a probability distribution on Θ, and α a positive scalar. A Dirichlet process is the distribution of a random probability measure G over Θ, such that for any finite partition (T 1,..., T k ) of Θ, we have (G(T 1 ),..., G(T K )) Dir(αH(T 1 ),..., αh(t K )). Θ T 1 T 2 T 3 ~ T 1 ~ T 4 ~ T 5 ~ T 2 ~ T 3 (from Sudderth, 2008) E[G(T k )] = H(T k ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
63 Posterior conjugancy A little more theory... Dirichlet process REDUX Via the conjugancy of the Dirichlet distribution, we know that: p(g(t 1 ),..., G(T K ) θ T k ) = Dir(αH(T 1 ),..., αh(t k ) + 1,..., αh(t K )) Formalizing this analysis, we obtain that if G DP(α, H) θ i G i = 1,..., N, the posterior measure also follows a Dirichlet process: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi The DP defines a conjugate prior for distributions on arbitrary measure spaces. i=1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
64 A little more theory... Dirichlet process REDUX Generating samples: stick breaking Sethuraman (1995): equivalent definition of the Dirichlet process, through the stick-breaking construction. G(θ) DP(α, H) iff G(θ) = π k δ(θ, θ k ), k=1 where θ H, and k 1 π k = β k (1 β l ) β l Beta(1, α) l=1 β 1 π 1 π β β 1 1 β 2! k ! k β3 π 3 π 4 1 β3 β 4 1 β 4 β k k ! k 0.2! k 0.2 π k k α =1 α =5 (from Sudderth, 2008) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
65 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
66 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
67 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
68 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We know that (posterior): G DP(α, H) θ G G θ G θ H ( ) DP α + 1, αh+δ θ α+1 Consider the partition (Θ, Θ \ θ) of Θ. We have: ( (G(Θ), G(Θ \ θ)) Dir (α + 1) αh + δ θ α + 1 (θ), (α + 1)αH + δ ) θ α + 1 (Θ \ θ) = Dir(1, α) = Beta(1, α) G has point mass located at θ: G = βδ θ + (1 β)g β Beta(1, α) and G is the renormalized probability measure with the point mass removedé What is G? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
69 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
70 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
71 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] We have: G DP(α, H) θ G G θ G θ H ( DP α + 1, αh+δ θ α+1 G = βδ θ + (1 β)g β Beta(1, α) ) Consider a further partition θ, T 1,..., T K ) of Θ: (G(θ), G(T 1 ),..., G(T K )) = (β, (1 β)g (T 1 ),..., (1 β)g (T K )) Dir(1, αh(t 1 ),..., αh(t K )) Using the agglomerative/decimative property of Dirichlet, we get (G (T 1 ),..., G (T K )) Dir(αH(T 1 ),..., αh(t K )) G DP(α, H) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
72 A little more theory... Dirichlet process REDUX Stick-breaking (derivation) [Teh 2007] Therefore, where G DP(α, H) G = β 1 δ θ1 + (1 β 1 )G 1 G = β 1 δ θ1 + (1 β 1 )(β 2 δ θ2 + (1 β 2 )G 2 ). G = π k δ θk k=1 k 1 π k = β k (1 β l ) β l Beta(1, α), l=1 which is the stick-breaking construction. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
73 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
74 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
75 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX Once again, we start from the posterior: ( p(g θ 1,..., θ 1 ( N, α, H) = DP α + N, αh + α + N N ) ) δ θi i=1 The expected measure of any subset T Θ is: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N N i=1 ) δ θi (T) Since G is discrete, some of the { θ i } N i=1 G take identical values. Assume K N unique values: E [ G(T) θ 1,..., θ N, α, H ] = 1 ( αh + α + N K i=1 ) N k δ θi (T) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
76 A little more theory... Dirichlet process REDUX Chinese restaurant (derivation) A bit informally... Let T k contain θ k and shrink it arbitrarily. To the limit, we have that p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N K i=1 ) N k δ θi (θ) This is the generalized Polya urn scheme An urn contains one ball for each preceding observation, with a different color for each distinct θ k. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special weighted ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ k. [This description is from Sudderth, 2008] This allows to sample from a Dirichlet process without explicitly constructing the underlying G DP(α, H). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
77 A little more theory... Dirichlet process REDUX Chinese restaurant (derivation) A bit informally... Let T k contain θ k and shrink it arbitrarily. To the limit, we have that p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N K i=1 ) N k δ θi (θ) This is the generalized Polya urn scheme An urn contains one ball for each preceding observation, with a different color for each distinct θ k. For each ball drawn from the urn, we replace that ball and add one more ball of the same color. There is a special weighted ball which is drawn with probability proportional to α normal balls, and has a new, previously unseen color θ k. [This description is from Sudderth, 2008] This allows to sample from a Dirichlet process without explicitly constructing the underlying G DP(α, H). Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
78 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX The Dirichlet process implicitly partitions the data. Let z i indicate the subset (cluster) associated with the i6th observation, i.e. θ i = θ zi. From the previous slide, we get: p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N This is the Chinese restaurant process (CRP) K i=1 ) N k δ(z, k) It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
79 A little more theory... Chinese restaurant (derivation) Dirichlet process REDUX The Dirichlet process implicitly partitions the data. Let z i indicate the subset (cluster) associated with the i6th observation, i.e. θ i = θ zi. From the previous slide, we get: p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N This is the Chinese restaurant process (CRP) K i=1 ) N k δ(z, k) It induces an exchangeable distribution on partitions. The joint distribution is invariant to the order the observations are assigned to clusters. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
80 Take away message A little more theory... Dirichlet process REDUX These representations are all equivalent! Posterior DP: G DP(α, H) θ G G Stick-breaking construction: G(θ) = π k δ(θ, θ k ) Generalized Polya urn k=1 θ G θ H ( ) DP α + 1, αh+δ θ α+1 θ k H π GEM(α) p( θ N+1 = θ θ 1,..., θ N, α, H) = 1 ( αh(θ) + α + N Chinese restaurant process p(z N+1 = z z 1,..., z N, α) = 1 ( αδ(z, k) + α + N K i=1 K i=1 ) N k δ θi (θ) ) N k δ(z, k) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
81 Outline The hierarchical Dirichlet process 1 Introduction and background Bayesian learning Nonparametric models 2 Finite mixture models Bayesian models Clustering with FMMs Inference 3 Dirichlet process mixture models Going nonparametric! The Dirichlet process DP mixture models Inference 4 A little more theory... De Finetti s REDUX Dirichlet process REDUX 5 The hierarchical Dirichlet process Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
82 The hierarchical Dirichlet process The DP mixture model (DPMM) Let s use G DP(α, H) to build an infinite mixture model. H α G θ i θ2 θ1 G DP(α, H) θ i G x i F θi x i N x2 x1 Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
83 The hierarchical Dirichlet process Related subgroups of data Dataset with J related groups x = (x 1,..., x J ). x j = (x j1,..., x jnj ) contains N j observations. We want these group to share clusters (transfer knowledge.) 1 2 m i x ij x 1j x 2j x mj (from jordan, 2005) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
84 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
85 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
86 The hierarchical Dirichlet process Hierarchical Dirichlet process (HDP) Global probability measure G 0 DP(γ, H) Defines a set of shared clusters. G 0 (θ) = β k δ(θ, θ k ) k=1 θ k H β GEM(γ) Group specific distributions G j DP(α, G 0 ) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) Note G 0 as base measure! Each local cluster has parameter θ k copied from some global cluster For each group, data points are generated according to: θ ji G j x ji F( θ ji ) Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
87 The hierarchical Dirichlet process The HDP mixture model (DPMM) H H γ G 0 G 0 α G j G1 G2 G 0 DP(γ, H) G j DP(α, G 0 ) θ ji G j θ ji θ12 θ11 θ 21 θ22 x ji F θji x ji N x12 x11 x21 x22 J Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
88 The hierarchical Dirichlet process The HDP mixture model (DPMM) G j (θ) = π k δ(θ, θ k ) t=1 θ t G 0 π GEM(γ) G 0 is discrete. Each group might create several copies of the same global cluster. Aggregating the probabilities: G j (θ) = π k δ(θ, θ k ) t=1 π jk = t:k jt=k π jt It can be shown that π DP(α, β). β = (β 1, β 2,...): average weight of local clusters. π = (π 1, π 2,...) group-specific weights. α controls the variability of clusters weight across groups. Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
89 The hierarchical Dirichlet process THANK YOU. QUESTIONS? Alessandro Panella (CS Dept. - UIC) Bayesian Nonparametrics February 18, / 57
Lecture 3a: Dirichlet processes
Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics
More informationNon-parametric Clustering with Dirichlet Processes
Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction
More informationBayesian Nonparametrics
Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet
More informationBayesian Nonparametrics: Dirichlet Process
Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian
More informationOutline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution
Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model
More informationDirichlet Processes: Tutorial and Practical Course
Dirichlet Processes: Tutorial and Practical Course (updated) Yee Whye Teh Gatsby Computational Neuroscience Unit University College London August 2007 / MLSS Yee Whye Teh (Gatsby) DP August 2007 / MLSS
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationCS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I
X i Ν CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr 2004 Dirichlet Process I Lecturer: Prof. Michael Jordan Scribe: Daniel Schonberg dschonbe@eecs.berkeley.edu 22.1 Dirichlet
More informationBayesian nonparametrics
Bayesian nonparametrics 1 Some preliminaries 1.1 de Finetti s theorem We will start our discussion with this foundational theorem. We will assume throughout all variables are defined on the probability
More informationLecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu
Lecture 16-17: Bayesian Nonparametrics I STAT 6474 Instructor: Hongxiao Zhu Plan for today Why Bayesian Nonparametrics? Dirichlet Distribution and Dirichlet Processes. 2 Parameter and Patterns Reference:
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 4 Problem: Density Estimation We have observed data, y 1,..., y n, drawn independently from some unknown
More informationImage segmentation combining Markov Random Fields and Dirichlet Processes
Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationBayesian Nonparametric Models
Bayesian Nonparametric Models David M. Blei Columbia University December 15, 2015 Introduction We have been looking at models that posit latent structure in high dimensional data. We use the posterior
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationA Brief Overview of Nonparametric Bayesian Models
A Brief Overview of Nonparametric Bayesian Models Eurandom Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin Also at Machine
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationPart IV: Monte Carlo and nonparametric Bayes
Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation
More informationNonparametric Mixed Membership Models
5 Nonparametric Mixed Membership Models Daniel Heinz Department of Mathematics and Statistics, Loyola University of Maryland, Baltimore, MD 21210, USA CONTENTS 5.1 Introduction................................................................................
More informationSharing Clusters Among Related Groups: Hierarchical Dirichlet Processes
Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationStochastic Processes, Kernel Regression, Infinite Mixture Models
Stochastic Processes, Kernel Regression, Infinite Mixture Models Gabriel Huang (TA for Simon Lacoste-Julien) IFT 6269 : Probabilistic Graphical Models - Fall 2018 Stochastic Process = Random Function 2
More informationDirichlet Process. Yee Whye Teh, University College London
Dirichlet Process Yee Whye Teh, University College London Related keywords: Bayesian nonparametrics, stochastic processes, clustering, infinite mixture model, Blackwell-MacQueen urn scheme, Chinese restaurant
More informationConstruction of Dependent Dirichlet Processes based on Poisson Processes
1 / 31 Construction of Dependent Dirichlet Processes based on Poisson Processes Dahua Lin Eric Grimson John Fisher CSAIL MIT NIPS 2010 Outstanding Student Paper Award Presented by Shouyuan Chen Outline
More informationCollapsed Variational Dirichlet Process Mixture Models
Collapsed Variational Dirichlet Process Mixture Models Kenichi Kurihara Dept. of Computer Science Tokyo Institute of Technology, Japan kurihara@mi.cs.titech.ac.jp Max Welling Dept. of Computer Science
More informationSpatial Normalized Gamma Process
Spatial Normalized Gamma Process Vinayak Rao Yee Whye Teh Presented at NIPS 2009 Discussion and Slides by Eric Wang June 23, 2010 Outline Introduction Motivation The Gamma Process Spatial Normalized Gamma
More informationDirichlet Processes and other non-parametric Bayesian models
Dirichlet Processes and other non-parametric Bayesian models Zoubin Ghahramani http://learning.eng.cam.ac.uk/zoubin/ zoubin@cs.cmu.edu Statistical Machine Learning CMU 10-702 / 36-702 Spring 2008 Model
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationBayesian non parametric approaches: an introduction
Introduction Latent class models Latent feature models Conclusion & Perspectives Bayesian non parametric approaches: an introduction Pierre CHAINAIS Bordeaux - nov. 2012 Trajectory 1 Bayesian non parametric
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationCSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado
CSCI 5822 Probabilistic Model of Human and Machine Learning Mike Mozer University of Colorado Topics Language modeling Hierarchical processes Pitman-Yor processes Based on work of Teh (2006), A hierarchical
More informationAdvanced Machine Learning
Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing
More informationInfinite latent feature models and the Indian Buffet Process
p.1 Infinite latent feature models and the Indian Buffet Process Tom Griffiths Cognitive and Linguistic Sciences Brown University Joint work with Zoubin Ghahramani p.2 Beyond latent classes Unsupervised
More informationInfinite-State Markov-switching for Dynamic. Volatility Models : Web Appendix
Infinite-State Markov-switching for Dynamic Volatility Models : Web Appendix Arnaud Dufays 1 Centre de Recherche en Economie et Statistique March 19, 2014 1 Comparison of the two MS-GARCH approximations
More informationHierarchical Models & Bayesian Model Selection
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or
More informationFoundations of Nonparametric Bayesian Methods
1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models
More informationHierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh
Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes Yee Whye Teh Probabilistic model of language n-gram model Utility i-1 P(word i word i-n+1 ) Typically, trigram model (n=3) e.g., speech,
More informationLecture : Probabilistic Machine Learning
Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning
More informationComputer Vision Group Prof. Daniel Cremers. 14. Clustering
Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Infinite Feature Models: The Indian Buffet Process Eric Xing Lecture 21, April 2, 214 Acknowledgement: slides first drafted by Sinead Williamson
More informationNon-parametric Bayesian Methods
Non-parametric Bayesian Methods Uncertainty in Artificial Intelligence Tutorial July 25 Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK Center for Automated Learning
More informationGentle Introduction to Infinite Gaussian Mixture Modeling
Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for
More informationSTAT Advanced Bayesian Inference
1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(
More informationChapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign
Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University
More informationSTAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01
STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationBayesian Inference for Dirichlet-Multinomials
Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationSome slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2
Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationan introduction to bayesian inference
with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationThe Indian Buffet Process: An Introduction and Review
Journal of Machine Learning Research 12 (2011) 1185-1224 Submitted 3/10; Revised 3/11; Published 4/11 The Indian Buffet Process: An Introduction and Review Thomas L. Griffiths Department of Psychology
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationClustering problems, mixture models and Bayesian nonparametrics
Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationExchangeability. Peter Orbanz. Columbia University
Exchangeability Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent noise Peter
More informationBayesian Nonparametrics
Bayesian Nonparametrics Peter Orbanz Columbia University PARAMETERS AND PATTERNS Parameters P(X θ) = Probability[data pattern] 3 2 1 0 1 2 3 5 0 5 Inference idea data = underlying pattern + independent
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationBayesian Mixtures of Bernoulli Distributions
Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationBayesian Nonparametric Learning of Complex Dynamical Phenomena
Duke University Department of Statistical Science Bayesian Nonparametric Learning of Complex Dynamical Phenomena Emily Fox Joint work with Erik Sudderth (Brown University), Michael Jordan (UC Berkeley),
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling
DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including
More informationApplied Nonparametric Bayes
Applied Nonparametric Bayes Michael I. Jordan Department of Electrical Engineering and Computer Science Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan Acknowledgments:
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationStatistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart
Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms
More informationCMPS 242: Project Report
CMPS 242: Project Report RadhaKrishna Vuppala Univ. of California, Santa Cruz vrk@soe.ucsc.edu Abstract The classification procedures impose certain models on the data and when the assumption match the
More informationBayesian RL Seminar. Chris Mansley September 9, 2008
Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationDirichlet Enhanced Latent Semantic Analysis
Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,
More information19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process
10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationLecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007
COS 597C: Bayesian Nonparametrics Lecturer: David Blei Lecture # Scribes: Jordan Boyd-Graber and Francisco Pereira October, 7 Gibbs Sampling with a DP First, let s recapitulate the model that we re using.
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationIntroduction to Machine Learning
Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationModel-based cognitive neuroscience approaches to computational psychiatry: clustering and classification
Model-based cognitive neuroscience approaches to computational psychiatry: clustering and classification Thomas V. Wiecki, Jeffrey Poland, Michael Frank July 10, 2014 1 Appendix The following serves as
More informationTime Series and Dynamic Models
Time Series and Dynamic Models Section 1 Intro to Bayesian Inference Carlos M. Carvalho The University of Texas at Austin 1 Outline 1 1. Foundations of Bayesian Statistics 2. Bayesian Estimation 3. The
More informationBayesian nonparametric latent feature models
Bayesian nonparametric latent feature models François Caron UBC October 2, 2007 / MLRG François Caron (UBC) Bayes. nonparametric latent feature models October 2, 2007 / MLRG 1 / 29 Overview 1 Introduction
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationHierarchical Dirichlet Processes
Hierarchical Dirichlet Processes Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David M. Blei Computer Science Div., Dept. of Statistics Dept. of Computer Science University of California at Berkeley
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationA Simple Proof of the Stick-Breaking Construction of the Dirichlet Process
A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We give a simple
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationLecture 2: Priors and Conjugacy
Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.
More informationProbabilistic modeling. The slides are closely adapted from Subhransu Maji s slides
Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework
More informationProbabilistic modeling of NLP
Structured Bayesian Nonparametric Models with Variational Inference ACL Tutorial Prague, Czech Republic June 24, 2007 Percy Liang and Dan Klein Probabilistic modeling of NLP Document clustering Topic modeling
More informationPROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationInference for a Population Proportion
Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist
More informationTree-Based Inference for Dirichlet Process Mixtures
Yang Xu Machine Learning Department School of Computer Science Carnegie Mellon University Pittsburgh, USA Katherine A. Heller Department of Engineering University of Cambridge Cambridge, UK Zoubin Ghahramani
More information