STAT Advanced Bayesian Inference

Size: px

Start display at page:

Download "STAT Advanced Bayesian Inference"

Jean Fisher
5 years ago
Views:

1 1 / 32 STAT Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218

2 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ( k j=1 a j) θ a j 1 j k j=1 Define α = k j=1 a j and π = (π 1,...,π j ) = a/α. θ a j 1 j. Expected value and variance of the Dirichlet(a 1,...,a k ) distribution E(θ j ) = a j α = π j V(θ j ) = E(θ j)[1 E(θ j )] 1 + α Note that α is a precision parameter (the prior sample size). Large α means low variance.

3 Conjugate analysis for multinomial data 3 / 32 Data: y = (n 1,...,n k ), where n j = number of items in category j. Prior θ Dirichlet(a 1,...,a k ) Likelihood Posterior p(n 1,n 2,...,n k θ 1,θ 2,...,θ k ) k θ n j j j=1 θ n 1,...,n k Dirichlet(n 1 + a 1,...,n k + a k ) Posterior expected value E(θ j n 1,...,n k ) = n j + a j n + α

4 Bayesian histograms 4 / 32 Histogram partitions the data space ξ < ξ 1 <... < ξ k and records how many observations end up in each bin (B h ). Multinomial data. Probability model for histograms f (y) = k h=1 1 ξh 1 <y ξ h ( π h ξ h ξ h 1 ), y R, where π = (π 1,...,π k ) is an unknown probability vector. n h = number of data points in partition (bin) h: ξ h 1 < y ξ h. Prior on π = (π 1,...,π k ) Posterior π Dirichlet(a 1,...,a k ) π n 1,...,n k Dirichlet(n 1 + a 1,...,n k + a k )

5 Illustration of Bayesian histograms 5 / ξ ξ 1 ξ 2 ξ 3 ξ 4 ξ 5 ξ 6 4 α P Data α P(B 1 )

6 Bayesian histograms, cont. Posterior π n 1,...,n k Dirichlet(n 1 + a 1,...,n k + a k ) Specify a 1,..,a k through π = (π 1,...,π k ) and α = k j=1 a j. Specify π from a base distribution P. For the hth bin: π h = P (B h ) = Pr(ξ h 1 < y ξ h ). Pros: Procedure is easy in that we have conjugacy. Allows prior information to be included in frequentist histogram estimates. Cons: Results are sensitive to knots. Allowing free knots is computationally demanding. Even averaging over random knots tends to produce bumps in the density estimate. Lacks smoothness: all pairs of bins have negative correlations, regardless of how near they are. 6 / 32

7 Bayesian histogram example 7 / 32 7 Data Prior.25 Posterior, α =

8 Bayesian histogram example 8 / Posterior, α = 1.25 Posterior, α = Posterior, α = 1.35 Posterior, α =

9 Histograms are sensitive to the choice of bins 9 / 32 Posterior, α = 1 Posterior, α = Posterior, α = 1.25 Posterior, α =

10 Simulation Experiment 1 / 32 Simulate data from the mixtures: f (y) =.75Beta(y 1,5) +.25Beta(y;2,2), with n = 1 samples. Assuming data between [, 1] and choosing 1 equally-spaced knots. Then apply the Bayes histogram approach.

11 Bayes Histogram Estimate 11 / 32

12 The Dirichlet process Let B 1,B 2,...,B k be a partition of the outcome space Ω. Let P(B 1 ),...,P(B k ) denote the distribution over the partition. Dirichlet distribution is a distribution over a space of distributions: (P(B 1 ),...,P(B k )) Dirichlet(αP (B 1 ),...,αp (B k )) where P is a fixed probability measure (e.g. the N(,1) density). Dirichlet distribution is closed under summation or splitting of bins. Can be used to define a stochastic process in a consistent way. Compare with GPs. A random probability measure P follows a Dirichlet process P DP(α P ) with base measure P and precision parameter α iff (P(B 1 ),...,P(B k )) Dirichlet(αP (B 1 ),...,αp (B k )) for any finite (measureable) partition B 1,...,B k. 12 / 32

13 The Dirichlet process - properties 13 / 32 If P DP(αP ) then Model Prior P(B) Beta[αP (B),α (1 P (B))], for any B B E [P(B)] = P (B) Var[P(B)] = P (B)[1 P (B)]/(1 + α) y i P iid P, for i = 1,...,n P DP(αP ) Posterior for a finite partition, P(B 1 ),...,P(B k ) y is ( ) n n Dirichlet αp (B 1 ) + 1 yi B 1,...,αP (B k ) + 1 yi B k i=1 i=1

14 The Dirichlet process - properties 14 / 32 Posterior for the unknown probability distribution P ( ) n P y 1,...,y n DP αp + δ yi Since so P(B) Beta ( αp (B) + E (P(B) y 1,...,y n ) = n i=1 i=1 1 yi B,α(1 P (B)) + n i=1 ( ) ( α n n P (B) + α + n α + n) i=1 1 yi B c ) 1 n 1 y i B c

15 Estimating a d.f. with a DP prior 15 / 32 If B = (,y] then where E (F(y) y 1,...,y n ) = F(y) is the unknown d.f. F (y) is the d.f. from P F n (y) = 1 n 1 y i y is the empirical d.f. ( ) ( ) α n F (y) + F n (y) α + n α + n Note: under the DP posterior, F( ) is discrete with probability one. Not great for continuous data... This is true in general: realisations from a DP are discrete with probability one.

16 Estimating a d.f. with a DP prior 16 / 32 α = F n empirical F prior Posterior α = α = 1 1 α =

17 Stick-breaking characterization of DP (Sethuraman) 17 / 32 P DP(αP ) is equivalent to an infinite mixture of point masses Stick picture P( ) = h=1 π h δ θi π h = V h (1 V l ) l<h V h iid Beta(1,α) θ h iid P Alternative notation for P DP(αP ): π = (π 1,π 2,...) Stick(α) and θ h P

18 Simulating stick-breaking priors α = 1 18 / 32

19 Simulating stick-breaking priors α = 1 19 / 32

20 Simulating stick-breaking priors α = 1 2 / 32

21 Simulating stick-breaking priors α = 1 21 / 32

22 Beyond DP - Pitman-Yor and Probit sticks 22 / 32 Pitman-Yor process with parameters P, a < 1 and b > a: iid P( ) = π h δ θi θ h P h=1 π h = V h (1 V l ) l<h V h iid Beta(1 a,b + ha) Probit stick-breaking with parameters µ and vσ: P( ) = h=1 π h δ θi θ h iid P π h = V h (1 V l ) l<h V h = Φ(x h ), where x h iid N(µ,σ 2 )

23 Finite mixture models 23 / 32 Mixture of normals p(y) = k j=1 π j φ(y; µ j,σ 2 j ) Use allocation variables: I i = j if y i comes from φ(y; µ j,σ 2 j ). Let I = (I 1,...,I n ) and n j = n i=1 (I i = j). Gibbs sampling algorithm: π 1,...,π k I,y Dirichlet(a 1 + n 1,a 2 + n 2,...,a k + n k ) σ 2 j I,y Inv-χ 2 and µ j I,σj 2,y N for j = 1,...,k. I i π, µ,σ 2,y Multinomial(ω i,1,...,ω i,k ), i = 1,...,n, where ω i,j = π j φ(y i ; µ j,σ 2 j ) k q=1 π q φ(y i ; µ q,σ 2 q ).

24 Infinite mixture models - DP mixtures 24 / 32 General mixture formulation f (y P) = K (y θ)dp(θ) where K (y θ) is a kernel and P(θ) is a mixing measure. Example 1: Student-t, t ν (µ,σ 2 ). K (y θ) = φ(y µ,λ) where µ is fixed, θ = λ and P(θ) is the Inv χ 2 (ν,σ 2 ) distribution. Example 2: Finite mixture of normals. φ(y µ,σ 2 ), θ = [ (µ,σ 2 ). P(θ) ] is a discrete distr. with Pr θ = (µ j,σj 2) = π j, for j = 1,...,k. Example 3: P DP(αP ) yields the infinite mixture f (y) = h=1 π h K (y θ h ), π Stick(α).

25 DP mixture is like a finite mixture with large k In infinite mixtures every observation has its own parameter θ i y i K (θ i ) DP is almost surely discrete ties: some of the θ i will have exactly the same values. DP leads to clustering of the θ i. Each observation has potentially its own parameter θ i, but that parameter may be shared by other observations. In finite mixture models each observation also has its own parameter y i I i K (θ Ii ) I i π Multinomial(π 1,...,π k ) θ i P π Dirichlet(α/k,...,α/k) Neal (2) shows that this finite mixture model approaches the DP mixture when k. 25 / 32

26 Marginalizing out P from a DP - Polya scheme 26 / 32 Hierarchical representation of DP mixtures y i K (θ i ), θ i P P DP(αP ) We can actually marginalize out P to obtain the Polya scheme ( ) ( ) α 1 i 1 p(θ i θ 1,...,θ i 1 ) P (θ i ) + α + i 1 α + i 1 δ θj j=1 So p(θ i θ 1,...,θ i 1 ) is a mixture of the base measure P and point masses at the previously drawn θ-values. Way to think about the scary Marginalizing out P : integrate out π in the finite mixture model and let k. [Neal, 2].

27 DPs and the Chinese restaurant process 27 / 32 The so called Polya scheme: ( ) α p(θ i θ 1,...,θ i 1 ) P (θ i ) + α + i 1 Chinese restaurant process: ( 1 α + i 1 ) i 1 δ θj j=1 first customer sits at empty table and obtains the dish θ 1 from P. second customer sits at first customer s table with probability 1+α 1 and has dish θ1 or sits at a new table with probability 1+α α and has dish θ 2 P.. the ith customer sits at table with dish θ j with a probability proportional to n j, the number of customers sitting at table j or sits at a new table with probability proportional to α.

28 Gibbs sampling DP mixtures - marginalizing P 28 / 32 Similar to Gibbs sampling for finite mixtures. Data augmentation with mixture component indicators I i. 1 Update component allocation for ith observation y i by sampling from multinomial { n ( i) j K (y i θj Pr(I i = j ) ) for j = 1,...,k( i) α K (y i θ)dp (θ) for j = k ( i) Update the unique parameter values θ by sampling from p(θj ) P (θc ) K (y i θj ) i:i j =j Note that, unlike finite mixtures, the I i are not independent conditional on θ. This because we have marginalized out P. They have to be sampled sequentially.

29 Gibbs sampling for truncated DP mixtures 29 / 32 Set upper bound N for the number of components. Approximate DP mixture with π h = for h = N + 1,... Posterior samping for infinite mixtures is now very similar to finite mixture. The I i can be sampled independently. 1 Update component allocation for ith observation y i by sampling from multinomial Pr(I i = j ) π j K (y i θ j ) for j = 1,2,...,N. 2 Update the stick-breaking weights [recall: π h = V h l<h (1 V l )] ( ) V j Beta 1 + n j,α + n q for j = 1,...,N 1. N q=j+1 3 Update the unique parameter values θ1,...θ N by sampling just like in the finite mixture model. Sample θ from prior P (θ) for empty clusters.

30 MCMC for DP mixtures Let s look at the updating step: { n ( i) j K (y i θc ) for j = 1,...,k ( i) Pr(I i = j ) α K (y i θ)dp (θ) for j = k ( i) + 1. A customer chooses table based on: the number of existing customers at the tables (with imaginary α customers at a new table) how compatible the taste of the customer (y i ) is to the different dishes served at occupied tables (θ c ) how compatible the taste of the customer (y i ) is to the different dishes that may be served at a new table. A P (θ) with large variance is equivalent to an very experimental cook. You never know what you get... Hyperparameter α clearly matters for the number of clusters (tables), but so does P. Hyperparameter α can be learned from data. P may contain hyperparameters (e.g. P = N(µ,σ 2 )). Just add updating steps for those. 3 / 32

31 Mixture of multivariate regressions - Model 31 / 32 The response vector y is p-dim. Covariates x is q-dim. The model is of the form p(y x) = j=1 π j N(y i B j x i,σ j ) Each component in the mixture is a Gaussian multivariate regression with its own regression coefficient and covariance matrix: y i = B j p 1 x i p qq 1 iid + ε i, ε i N (,Σ j ) p 1 The mixture weights follow a DP stick prior π Stick(α).

32 Mixture of multivariate regressions - Data 32 / y1 2 y x x2 y x1 y x2

33 Mixture of multivariate regressions - DPM 33 / 32 6 No. components: y1 y x x y2 5 y x x2

34 Mixture of multivariate regressions - DPM 34 / Posterior distribution of the number of components Number of components

Non-Parametric Bayes

Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian