Latent Variable Models Probabilistic Models in the Study of Language Day 4

Size: px

Start display at page:

Download "Latent Variable Models Probabilistic Models in the Study of Language Day 4"

Shannon Barker
5 years ago
Views:

1 Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics

2 Preamble: plate notation for graphical models Here is the kind of hierarchical model we ve seen so far: θ y 11 y 1n1 y 21 y 2n2 y m1 y mnm b 1 b 2 b m Σ b

3 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times b m Σ b

4 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) b m Σ b

5 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node b m Σ b

6 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y b N m The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node In our previous application of hierarchical models to regression, cluster identities are known Σ b

7 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y b N m The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node In our previous application of hierarchical models to regression, cluster identities are known Σ b

8 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y N b m Σ b

9 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y N Technically speaking, latent variable means any variable whose value is unknown b m Σ b

10 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y b N Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations m Σ b

11 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y b N m Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations In today s clustering applications, simply treat i as unknown Σ b

12 The plan for today s lecture θ We are going to study the simplest type of latent-variable models φ i y b N m Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations In today s clustering applications, simply treat i as unknown Σ b Inferring values of i induces a clustering among observations; to do so we need to put a probability distribution over i

13 The plan for today s lecture We will cover two types of simple latent-variable models:

14 The plan for today s lecture We will cover two types of simple latent-variable models: The mixture of Gaussians for continuous multivariate data;

15 The plan for today s lecture We will cover two types of simple latent-variable models: The mixture of Gaussians for continuous multivariate data; Latent Dirichlet Allocation (LDA; also called Topic models) for categorical data (words) in collections of documents

16 Mixture of Gaussians Motivating example: how are phonological categories learned

17 Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience:

18 Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience: Infants can distinguish some contrasts that adults of speakers lacking them cannot: alveolar [d] versus retroflex [ã] for English speakers, [r] versus [l] for Japanese speakers; Werker and Tees, 1984; Kuhl et al., 2006, inter alia)

19 Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience: Infants can distinguish some contrasts that adults of speakers lacking them cannot: alveolar [d] versus retroflex [ã] for English speakers, [r] versus [l] for Japanese speakers; Werker and Tees, 1984; Kuhl et al., 2006, inter alia) Other contrasts are not reliably distinguished until 1 year of age by native speakers (e.g., syllable-initial [n] versus [N] in Filipino language environments; Narayan et al., 2010)

20 Learning vowel categories To appreciate the potential difficulties of vowel category learning, consider inter-speaker variation (data courtesy of Vallabha et al., 2007): S1 S Duration 0 2 Duration F F e E i I F F Scatter Plot Matrix

21 Framing the category learning problem Here s 19 speakers data mixed together: Duration F2 2 4 e E i I F1 0 2 Scatter Plot Matrix

22 Framing the category learning problem Learning from such data can be thought of in two ways:

23 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories

24 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes)

25 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y.

26 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y.

27 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y. If θ are parameters describing category representations, our problem is to infer P(Π, θ y)

28 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y. If θ are parameters describing category representations, our problem is to infer P(Π, θ y) from which we could recover the two marginal probability distributions of interest: P(Π y) P(θ y) (distr. over partitions given data) (distr. over category properties given data)

29 The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories)

30 The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories) N observations are generated i.i.d. by: i Multinom(φ) y N(µ i,σ i )

31 The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories) N observations are generated i.i.d. by: i Multinom(φ) y N(µ i,σ i ) Here is the corresponding graphical model: φ i y n Σ µ m

32 Can we use maximum likelihood? For observations y all known to come from the same k-dimensional Gaussian, the MLE for the Gaussian s parameters is µ = ȳ 1,ȳ 2,...,ȳ k Var(y 1 ) Cov(y 1,y 2 )... Cov(y 1,y k ) Cov(y 1,y 2 ) Var(y 2 )... Cov(y 1,y k ) Σ = Cov(y 1,y 2 ) Cov(y 1,y 2 )... Var(y k ) where Var and Cov are the sample variance and covariance

33 Can we use maximum likelihood? So you might ask: why not use the method of maximum likelihood, searching through all the possible partitions of the data and choosing the partition that gives the highest data likelihood? y

34 Can we use maximum likelihood? The set of all partitions into 3,3 observations for our example data:

35 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem.

36 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1...

37 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,

38 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5, ML for this partition:!!!

39 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5, ML for this partition:!!! More generally, for a V-dimensional problem you need at least V +1 points in each partition

40 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5, ML for this partition:!!! More generally, for a V-dimensional problem you need at least V +1 points in each partition But this constraint would prevent you from finding intuitive solutions to your problem!

41 Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i )

42 Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size

43 Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size Returning to our graphical model, we put in a prior on category size/shape

44 Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size Returning to our graphical model, we put in a prior on category size/shape

45 Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) For now we will just leave category prior probabilities uniform: φ 1 = φ 2 = φ 3 = φ 4 = 1 4

46 Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) For now we will just leave category prior probabilities uniform: φ 1 = φ 2 = φ 3 = φ 4 = 1 4 Here is a conjugate prior distribution for multivariate Gaussians: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A)

47 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it

48 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( )

49 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( )

50 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( ) Here, k = 2 (top row) or k = 5 (bottom row)

51 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem

52 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling

53 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put:

54 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments

55 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point:

56 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i

57 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ

58 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ 2.3 Randomly sample a cluster assignment for x i from P(C i x i,π i ) and continue

59 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ 2.3 Randomly sample a cluster assignment for x i from P(C i x i,π i ) and continue 3. Do this for many iterations (e.g., until the unnormalized marginal data likelihood is high)

60 Inference for Mixture of Gaussians using Gibbs Sampling Starting point for our problem:

61 One pass of Gibbs sampling through the data

62 Results of Gibbs sampling with known category probabilities Posterior modes of category structures: F1 versus F2 F1 versus Duration F2 versus Duration F2 0 Duration 0 Duration F F F2

63 Results of Gibbs sampling with known category probabilities Confusion table of assignments of observations to categories: Unsupervised Supervised e e True vowel E i True vowel E i I I Cluster Cluster

64 Extending the model to learning category probabilities The multinomial extension of the beta distribution is the Dirichlet distribution, characterized by parameters α 1,...,α k, and D(π 1,...,π k ): D(π 1,...,π k ) def = 1 Z πα π α π α k 1 k where the normalizing constant Z is Z = Γ(α 1)Γ(α 2 )...Γ(α k ) Γ(α 1 +α 2 + +α k )

65 Extending the model to learning category probabilities So we set: φ D(Σ φ )

66 Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i )

67 Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i )

68 Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i ) Σθ θ φ i y n Σφ b m ΣΣb Σb

69 Having to learn category probabilities too makes the problem harder F1 and F2 F1 and Duration F2 and Duration F2 0 Duration 1 0 Duration F F F2

70 Having to learn category probabilities too makes the problem harder We can make the problem even more challenging by skewing the category probabilities: Category Probability e 0.04 E 0.05 i 0.29 I 0.62

71 Having to learn category probabilities too makes the problem harder F1 and F2 F1 and Duration F2 and Duration F2 0 Duration 1 0 Duration F F F2

72 Having to learn category probabilities too makes the problem harder Confusion tables for these cases: With learning of category frequencies Without learning of category frequencies e e True vowel E i True vowel E i I I Cluster Cluster

73 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression!

74 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning

75 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search

76 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood

77 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood In general you need far more data, and/or additional information sources, to converge on good solutions

78 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood In general you need far more data, and/or additional information sources, to converge on good solutions Relevant references: tons! Read about MOGs for automated speech recognition in Jurafsky and Martin (2008, Chapter 9). See Vallabha et al. (2007) and Feldman et al. (2009) for earlier application of MOGs to phonetic category learning.

79 References I Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learning phonetic categories by learning a lexicon. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages Cognitive Science Society, Austin, TX. Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall, second edition. Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2):F13 F21. Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interaction between acoustic salience and language experience in developmental speech perception: evidence from nasal place discrimination. Developmental Science, 13(3):

80 References II Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33): Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7:49 63.

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer