Bayesian Structure Modeling. SPFLODD December 1, 2011

Size: px

Start display at page:

Download "Bayesian Structure Modeling. SPFLODD December 1, 2011"

Felix Griffin
5 years ago
Views:

1 Bayesian Structure Modeling SPFLODD December 1, 2011

2 Outline Defining Bayesian Parametric Bayesian models Latent Dirichlet allocabon (Blei et al., 2003) Bayesian HMM (Goldwater and Griffiths, 2007) A limle bit about inference Nonparametric Bayesian models Dirichlet Process and hierarchical DP NP Bayes and grammars (Beal et al., 2001; Johnson et al., 2007; Cohn et al., 2009)

3 In StaBsBcs BeMer to assign F. and B. to analyses, not people. Frequen'st analysis (most of science today): parameters are fixed and unknown; we gain informabon by repeated experiments Point esbmates, standard errors, confidence intervals ( in P% of experiments, the interval will cover the true θ ), hypothesis tests with α fixed in advance, reason about p(data H 0 ) Bayesian analysis: treat unknown parameters probabilisbcally; update beliefs as evidence arrives Start with p(θ) and infer p(θ data), means and quanbles of the posterior over θ, intervals corresponding to P% belief that θ is in the interval

4 The AMracBon of Bayesian Thinking Write down your model declarabvely, worry later about how to fit it from data. Prior encodes prior knowledge. We have lots of this when it comes to language, or at least we think we do! Manage uncertainty about the model the same way we manage uncertainty about the data. Bayesian methods are strongly associated with: unsupervised (and latent variable) learning generabve models

5 Evolving DefiniBons MLE (not Bayesian): Maximum a posteriori esbmabon: CompuBng the posterior over the parameters (fully Bayesian): Empirical Bayesian:

MAP Learning as a Graphical Model R p(w) p w (L) w L

6 MAP Learning as a Graphical Model R p(w) p w (L) w L from one month ago V p w (V L) Combined inference (max over w, sum over L) is very hard. If w were fixed, geeng the posterior over L wouldn t be so bad. If L were fixed, maximizing over w wouldn t be so bad. Standard EM doesn t have p(w); it s very simple to add and useful in pracbce.

7 MLE θ L V α θ L MAP V α θ L Fully Bayesian V α θ L Empirical Bayesian V

8 MulBnomials Let s assume discrete distribubons that simply assign probabilibes to finite sets of events. n gram models, HMMs, PCFGs,

9 DistribuBons over MulBnomials You can think of a mulbnomial distribubon over d events as a point in the (d 1) simplex. [1, 0, 0, 0] [0, 0, 0, 1] [0, 1, 0, 0] [0, 0, 1, 0] To randomly pick a point in this space, we need a con'nuous distribubon over the simplex.

10 Dirichlet DistribuBon A distribubon over the devent probability simplex. Parameters: ρ, the mean of the Dirichlet, and α, the concentrabon around that mean (large α means smaller variance). Beta funcbon: Gamma funcbon (generalized factorial): p(θ α, ρ) = B(αρ) = Γ(a) = 1 B(αρ) d i=1 d Γ(αρ i ) i=1 0 Γ(α) θ αρ i 1 i t a 1 e a dt

11 from answers.com Dirichlet, d=3 (various parameter seengs)

12 Dirichlet, d= 3 (different means and variances ) from Liang and Klein, 2007

13 Sampling from a Dirichlet For i from 1 to do, sample v i from a gamma distribubon with shape αρ i and scale 1. Renormalize the vector v to obtain θ.

14 MAP with a Dirichlet Recall that we can use a prior to smooth an esbmate. For a mulbnomial θ with Dirichlet prior αρ > 1, this equates to adding pseudocounts to the vector of observed counts. ˆθ i = N i + αρ i 1 N + α d As counts become large, prior mamers less. Closed form! d Regularizer view: R(θ) = i=1 (αρ i 1) log θ i Flat prior: α = d, all ρ i = 1/d (equates to MLE) Sparse prior (encourages most θ i to go to zero), but now it s not closed form.

15 Mixture of Unigrams The generabve story for a classical documentclustering model would be something like this (Nigam et al., 2000): For i = 1... M (number of documents): Draw a document length Ni from some distribubon. Draw a topic z i for the document from a mulbnomial over topics, θ. For j = 1... N i : Draw word wij from the mulbnomial βz. Nigam et al. learned this using EM.

Mixture of Unigrams θ Z i mixture coefficients W

A topic z is defined by a unigram distribubon β z.

16 Mixture of Unigrams θ Z i mixture coefficients W i,j β z words N i documents M components of the mixture K Z i is somebmes called the topic of document i. A topic z is defined by a unigram distribubon β z. M = number of documents N i = number of words in document i K = number of mixture components

17 A Word Clustering Model θ Z i,j mixture coefficients W i,j β z words N i documents M components of the mixture K Problem: all words are the same; document informabon is irrelevant. (This is exactly a zero order HMM.) M = number of documents N i = number of words in document i K = number of mixture components

ProbabilisBc LSI (Hofmann, 1996) θ i Z i,j W i,j β z This has very limle to do with latent seman'c indexing, except that it s a probabilisbc model trying to perform a similar task.

18 ProbabilisBc LSI (Hofmann, 1996) θ i Z i,j W i,j β z This has very limle to do with latent seman'c indexing, except that it s a probabilisbc model trying to perform a similar task. words N i documents M components of the mixture K Word clustering; documents correspond to distribu1ons over topics. Problem: can t describe new documents! M = number of documents N i = number of words in document i K = number of mixture components

19 Latent Dirichlet AllocaBon (Blei et al., 2003) α, ρ θ i Z i,j W i,j β z words N i documents M components of the mixture K Documents are mixtures of topics, but a prior over those mixtures lets us reason about new documents, too. M = number of documents N i = number of words in document i K = number of mixture components

20 LDA on ACL Papers (Gimpel, 2006)

21 Smoothed Latent Dirichlet AllocaBon (Blei et al., 2003) α, ρ θ i Z i,j α, ρ W i,j β z words N i documents M components of the mixture K M = number of documents N i = number of words in document i K = number of mixture components

22 Topic Models Beyond LDA Small industry in variabons on topic models, usually adding more evidence to be explained by the topics. Examples: Supervised LDA (Blei and McAuliffe, 2007) adds an observed document category. Link LDA (Erosheva et al., 2004) adds citabons to the document, explained by more draws from θ i Author topic model (Rosen Zvi et al., 2004) adds authors. Correlated topic model (Blei and Lafferty, 2006) lets different topics correlate more flexibly through a different prior. Comment LDA (Yano et al., 2009) generates comments from a different set of unigram models but the same topics.

23 Where is the Structure? Through the topics and θ i, words in a document become interdependent. Kind of a joint document/word clustering. Not really discrete structure the way we ve mostly discussed in this class, though. LDA is a Bayesian zero order HMM.

24 Where is the PredicBon? Topics are hard to evaluate; no gold standard. This is either an open problem or the nail in the coffin, depending on your point of view.

25 Hidden Markov Models α, ρ γ y K Y i,j 1 Y i,j Bayesian HMM (Goldwater and Griffiths, 2007) X i,j η y N i M K α, ρ γ y K Y i,j 1 Y i,j Unsupervised HMM (Merialdo, 1994) X i,j η y N i M K M = number of sequences N i = number of words in sequence i K = number of states

26 The Engineering Part Typically approximate inference is required. Markov chain Monte Carlo (e.g., Gibbs sampling) VariaBonal inference (e.g., mean field) Graphical model view is really helpful when designing inference algorithms for your Bayesian model! Learning: approximate inference + opbmizabon of hyperparameters (for LDA, usually α and β; ρ is o{en assumed uniform). StochasBc or variabonal EM, depending on your choice of approximate inference. Full Bayesian: fix the prior and do inference on all your data. ImplicaBons for train/test methodology?

27 Mean Field VariaBonal Inference in One Slide Consider all hidden variables (structure, parameters, whatever). Call these L and everything else V. Define a parametric distribubon q over each variable. OpBmize this lower bound on likelihood: arg max q E q(l) log p(l V = v)+h(q(l)) Use the collecbon of q as a posterior. InteresBngly, the block coordinate ascent algorithms used to do this for mulbnomial based models look an awful lot like EM!

28 Going Nonparametric How many topics or states? Nonparametric: let the data decide. Not necessarily Bayesian or even probabilisbc! More data jusbfy more parameters. Most common nonparametric and Bayesian tools in NLP are based on the Dirichlet process. DP is not the same as the Dirichlet distribu1on.

29 Dirichlet Process (Key Property) Two parameters, α 0 (concentrabon) and G 0 (base distribubon). If we draw G from DP(α 0, G 0 ), then G is drawn from an infinite mixture (Sethuraman, 1994): G(θ) = k=1 π k { 1 if θ = θk 0 otherwise mixture coefficient independent draw from G 0

30 Dirichlet Process (Key Property) Let Ω denote the event space for G 0 and G. For any finite parbbon of Ω into N parts Ω 1,, Ω N : G(Ω 1 ),..., G(Ω N ) Dirichlet(α 0, G 0 (Ω 1 ),..., G 0 (Ω N ) ) Special case: Ω is finite; result is a Dirichlet.

31 Two Helpful Views S'ck breaking view emphasizes the infinite mixture property. Can be understood as generabng the model, then the data. Useful for deriving a variabonal inference algorithm. Chinese restaurant process view. Data and model generated together. Useful for deriving an MCMC inference algorithm.

32 SBck Breaking Process 1. Draw an infinitely long sequence of mixture components θ 1, from the base distribubon G Draw an infinitely long sequence of mixture coefficients π Draw θ from the mixture: { 1 if θ = θk G(θ) = π k 0 otherwise k=1

33 Mixture Coefficients and GEM For i = 1 to : Draw v i from Beta(1, α 0 ). Aside: a more general framework uses Beta(1 b, α 0 + ib), giving a Pitman Yor process. Let: i 1 π i = v i (1 v j ) j=1

34 Infinite Language Model sbck α 0 π z θ unigram language model G 0 θ z W i words N

35 Chinese Restaurant Process A different way to get to the same distribubon.

36 Chinese Restaurant Process 1. A customer, X 1 walks into a restaurant and sits at the first table. Draw X 1 from G 0. Let t = For i = 2, : A. Customer i chooses a table: p(y i = j y 1,..., y i 1 ) = B. If Y i = t + 1, then draw X i from G 0 and ++t. Otherwise X i is copied from other customers at the table. N j α 0 + i 1 α α + i 1 if j {1,..., t} if j = t +1

37 Why the CRP? No infinitely large random variables. θ and π are now encoded in Y. Highlights the exchangeability property: the probability of the data items doesn t depend on the order in which they are observed. Weaker than IID assumpbons. The data are, in fact, not IID, and that is why inference is much harder.

38 Dirichlet Process Prior What kinds of distribubons does it prefer? The expected number of tables grows with the number of customers N as α 0 log N.

39 α 0 = 1, 1000 customers

40 α 0 = 10, 1000 customers α 0 = 100, 1000 customers

41 α 0 = 0.1, 1000 customers

42 Hierarchical DP First draw from a DP. Use that as a base distribubon for a second DP, from which we draw mulbple Bmes. E.g., β z distribubons in a topic model or η distribubons in an HMM, which should range over the same vocabulary. E.g., γ distribubons in an HMM, which should range over the same set of states. You can do the same thing with a Pitman Yor process.

43 Infinite HMM (Beal et al., 2001) The states, transibons, and emissions are generated by an HDP. The base DP creates a set of states. Each transibon distribubon is generated by the secondary DP, all with the same base DP; so the states are shared. Infinite state PCFGs studied by Liang et al. (2007) and Finkel et al. (2007).

44 Adaptor Grammars (Goldwater et al., 2006) PCFG: when expanding nonterminal N, draw from θ N ; then recur on children. In adaptor grammars, the above happens when a customer in the N restaurant sits at a new table. I.e., the PCFG is a base distribubon. ExisBng tables are associated with en1re subtrees, down to the leaves.

45 Bayesian Tree SubsBtuBon Grammars (Cohn et al., 2009) TSG is a context free formalism in which rewrite rules can be whole chunks of a tree, with nonterminals at the fronber. Rules are elementary trees Treebank does not show you where the boundaries are between the elementary trees. Base distribubon over elementary trees. HDP to generate the grammar. Similar to adaptor grammar, but elementary trees get cached, not enbre subtrees.

47 Why Are These Models Called Infinite? The number of effec1ve grammar rules is arbitrarily large (not infinite). That is, the posterior distribubon will believe that the data came from only a finite number of rules. But with more observabons, more rules might get used to explain the data.

48 Remember Bayesian describes your model, not you! Nonparametric is one way to get around the how many clusters problem. Orthogonal to Bayesian Not the only way to do it, but quite useful in structured seengs where your latent variables aren t just clusters.

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week