Latent Variable Models Probabilistic Models in the Study of Language Day 4

Similar documents
Topic Modelling and Latent Dirichlet Allocation

Non-Parametric Bayes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 13 : Variational Inference: Mean Field Approximation

Generative Clustering, Topic Modeling, & Bayesian Inference

Introduction to Probabilistic Machine Learning

Advanced Probabilistic Modeling in R Day 1

Introduction to Machine Learning

Density Estimation. Seungjin Choi

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical Models. David M. Blei Columbia University. October 14, 2014

Latent Dirichlet Allocation Introduction/Overview

Acoustic Unit Discovery (AUD) Models. Leda Sarı

Introduction to Machine Learning

NPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

13: Variational inference II

COS513 LECTURE 8 STATISTICAL CONCEPTS

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 7: Con3nuous Latent Variable Models

Learning Bayesian network : Given structure and completely observed data

Linear Models A linear model is defined by the expression

STA 4273H: Statistical Machine Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Computational Cognitive Science

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Latent Dirichlet Allocation (LDA)

Maximum Likelihood Estimation. only training data is available to design a classifier

Unsupervised Learning with Permuted Data

Statistical Pattern Recognition

STA 4273H: Statistical Machine Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Bayesian Nonparametrics for Speech and Signal Processing

ECE 5984: Introduction to Machine Learning

PMR Learning as Inference

Non-parametric Clustering with Dirichlet Processes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Kernel Density Topic Models: Visual Topics Without Visual Words

Computational Cognitive Science

Lecture 4: Probabilistic Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Recent Advances in Bayesian Inference Techniques

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Introduction to Bayesian inference

Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models

Clustering using Mixture Models

Hidden Markov Models

STA 414/2104: Machine Learning

Based on slides by Richard Zemel

Lecture 6: Graphical Models: Learning

Mixtures of Gaussians. Sargur Srihari

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

A Fully Nonparametric Modeling Approach to. BNP Binary Regression

Exponential Families

The Expectation-Maximization Algorithm

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

A Brief and Friendly Introduction to Mixed-Effects Models in Linguistics

Machine Learning Techniques for Computer Vision

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Outline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning

Topic Models. Charles Elkan November 20, 2008

Clustering and Gaussian Mixture Models

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

STA414/2104 Statistical Methods for Machine Learning II

Parametric Techniques

Probabilistic Graphical Models

Graphical Models for Collaborative Filtering

Gibbs Sampling in Linear Models #2

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Day 1: Probability and speech perception

Unsupervised Learning

Lecture 2: Priors and Conjugacy

Lecture 14. Clustering, K-means, and EM

Gaussian Mixture Model

Machine Learning Summer School

Bayesian Methods for Machine Learning

Part 6: Multivariate Normal and Linear Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

CS Lecture 18. Topic Models and LDA

Probabilistic Methods in Linguistics Lecture 2

Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models

STAT Advanced Bayesian Inference

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Unsupervised Learning

Latent Dirichlet Allocation (LDA)

Parametric Techniques Lecture 3

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Bayesian Mixtures of Bernoulli Distributions

COMP90051 Statistical Machine Learning

Interpretable Latent Variable Models

Modeling Environment

Transcription:

Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics

Preamble: plate notation for graphical models Here is the kind of hierarchical model we ve seen so far: θ y 11 y 1n1 y 21 y 2n2 y m1 y mnm b 1 b 2 b m Σ b

Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times b m Σ b

Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) b m Σ b

Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node b m Σ b

Plate notation for graphical models Here is a more succinct representation of the same model: i θ y b N m The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node In our previous application of hierarchical models to regression, cluster identities are known Σ b

Plate notation for graphical models Here is a more succinct representation of the same model: i θ y b N m The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node In our previous application of hierarchical models to regression, cluster identities are known Σ b

The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y N b m Σ b

The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y N Technically speaking, latent variable means any variable whose value is unknown b m Σ b

The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y b N Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations m Σ b

The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y b N m Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations In today s clustering applications, simply treat i as unknown Σ b

The plan for today s lecture θ We are going to study the simplest type of latent-variable models φ i y b N m Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations In today s clustering applications, simply treat i as unknown Σ b Inferring values of i induces a clustering among observations; to do so we need to put a probability distribution over i

The plan for today s lecture We will cover two types of simple latent-variable models:

The plan for today s lecture We will cover two types of simple latent-variable models: The mixture of Gaussians for continuous multivariate data;

The plan for today s lecture We will cover two types of simple latent-variable models: The mixture of Gaussians for continuous multivariate data; Latent Dirichlet Allocation (LDA; also called Topic models) for categorical data (words) in collections of documents

Mixture of Gaussians Motivating example: how are phonological categories learned

Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience:

Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience: Infants can distinguish some contrasts that adults of speakers lacking them cannot: alveolar [d] versus retroflex [ã] for English speakers, [r] versus [l] for Japanese speakers; Werker and Tees, 1984; Kuhl et al., 2006, inter alia)

Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience: Infants can distinguish some contrasts that adults of speakers lacking them cannot: alveolar [d] versus retroflex [ã] for English speakers, [r] versus [l] for Japanese speakers; Werker and Tees, 1984; Kuhl et al., 2006, inter alia) Other contrasts are not reliably distinguished until 1 year of age by native speakers (e.g., syllable-initial [n] versus [N] in Filipino language environments; Narayan et al., 2010)

Learning vowel categories To appreciate the potential difficulties of vowel category learning, consider inter-speaker variation (data courtesy of Vallabha et al., 2007): S1 S2 4 2 4 4 2 4 2 Duration 0 2 Duration 0 2 0 2 2 0 2 3 0 1 2 3 2 1 0 F2 0 1 2 3 2 1 0 3 3 0 1 2 3 2 1 0 F2 0 1 2 3 2 1 0 3 e E i I 2 0 1 2 2 0 1 2 1 0 F1 0 1 0 F1 0 2 1 0 1 2 2 1 0 1 2 Scatter Plot Matrix

Framing the category learning problem Here s 19 speakers data mixed together: 4 2 4 2 Duration 0 2 0 2 4 2 0 2 0 2 F2 2 4 e E i I 4 2 4 2 2 0 F1 0 2 Scatter Plot Matrix

Framing the category learning problem Learning from such data can be thought of in two ways:

Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories

Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes)

Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y.

Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y.

Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y. If θ are parameters describing category representations, our problem is to infer P(Π, θ y)

Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y. If θ are parameters describing category representations, our problem is to infer P(Π, θ y) from which we could recover the two marginal probability distributions of interest: P(Π y) P(θ y) (distr. over partitions given data) (distr. over category properties given data)

The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories)

The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories) N observations are generated i.i.d. by: i Multinom(φ) y N(µ i,σ i )

The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories) N observations are generated i.i.d. by: i Multinom(φ) y N(µ i,σ i ) Here is the corresponding graphical model: φ i y n Σ µ m

Can we use maximum likelihood? For observations y all known to come from the same k-dimensional Gaussian, the MLE for the Gaussian s parameters is µ = ȳ 1,ȳ 2,...,ȳ k Var(y 1 ) Cov(y 1,y 2 )... Cov(y 1,y k ) Cov(y 1,y 2 ) Var(y 2 )... Cov(y 1,y k ) Σ =...... Cov(y 1,y 2 ) Cov(y 1,y 2 )... Var(y k ) where Var and Cov are the sample variance and covariance

Can we use maximum likelihood? So you might ask: why not use the method of maximum likelihood, searching through all the possible partitions of the data and choosing the partition that gives the highest data likelihood? y 2 4 6 8 2 4 6 8

Can we use maximum likelihood? The set of all partitions into 3,3 observations for our example data: 0 2 4 6 8 10 12.22 0 2 4 6 8 10 14.36 0 2 4 6 8 10 16.09 0 2 4 6 8 10 12.00 0 2 4 6 8 10 10.65 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 10 15.90 0 2 4 6 8 10 17.24 0 2 4 6 8 10 12.22 0 2 4 6 8 10 14.99 0 2 4 6 8 10 14.72 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 10 11.46 0 2 4 6 8 10 10.65 0 2 4 6 8 10 16.09 0 2 4 6 8 10 14.36 0 2 4 6 8 10 12.00 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 10 17.24 0 2 4 6 8 10 15.90 0 2 4 6 8 10 14.99 0 2 4 6 8 10 11.46 0 2 4 6 8 10 14.72 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8

Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem.

Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1...

Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1... 0 2 4 6 8 10 0 2 4 6 8 10

Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1... 0 2 4 6 8 10 ML for this partition:!!! 0 2 4 6 8 10

Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1... 0 2 4 6 8 10 ML for this partition:!!! 0 2 4 6 8 10 More generally, for a V-dimensional problem you need at least V +1 points in each partition

Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1... 0 2 4 6 8 10 ML for this partition:!!! 0 2 4 6 8 10 More generally, for a V-dimensional problem you need at least V +1 points in each partition But this constraint would prevent you from finding intuitive solutions to your problem!

Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i )

Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size

Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size Returning to our graphical model, we put in a prior on category size/shape

Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size Returning to our graphical model, we put in a prior on category size/shape

Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) For now we will just leave category prior probabilities uniform: φ 1 = φ 2 = φ 3 = φ 4 = 1 4

Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) For now we will just leave category prior probabilities uniform: φ 1 = φ 2 = φ 3 = φ 4 = 1 4 Here is a conjugate prior distribution for multivariate Gaussians: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A)

The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it

The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( 1 0 0 1 )

The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( 1 0 0 1 )

The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( 1 0 0 1 ) Here, k = 2 (top row) or k = 5 (bottom row)

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put:

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point:

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ 2.3 Randomly sample a cluster assignment for x i from P(C i x i,π i ) and continue

Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ 2.3 Randomly sample a cluster assignment for x i from P(C i x i,π i ) and continue 3. Do this for many iterations (e.g., until the unnormalized marginal data likelihood is high)

Inference for Mixture of Gaussians using Gibbs Sampling Starting point for our problem:

One pass of Gibbs sampling through the data

Results of Gibbs sampling with known category probabilities Posterior modes of category structures: F1 versus F2 F1 versus Duration F2 versus Duration 4 4 2 2 2 F2 0 Duration 0 Duration 0 2 2 2 4 2 0 2 4 F1 2 0 2 4 F1 4 2 0 2 F2

Results of Gibbs sampling with known category probabilities Confusion table of assignments of observations to categories: Unsupervised Supervised e e True vowel E i True vowel E i I I 1 2 3 4 Cluster 1 2 3 4 Cluster

Extending the model to learning category probabilities The multinomial extension of the beta distribution is the Dirichlet distribution, characterized by parameters α 1,...,α k, and D(π 1,...,π k ): D(π 1,...,π k ) def = 1 Z πα 1 1 1 π α 2 1 2...π α k 1 k where the normalizing constant Z is Z = Γ(α 1)Γ(α 2 )...Γ(α k ) Γ(α 1 +α 2 + +α k )

Extending the model to learning category probabilities So we set: φ D(Σ φ )

Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i )

Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i )

Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i ) Σθ θ φ i y n Σφ b m ΣΣb Σb

Having to learn category probabilities too makes the problem harder F1 and F2 F1 and Duration F2 and Duration 4 3 3 2 2 2 F2 0 Duration 1 0 Duration 1 0 1 1 2 2 2 2 1 0 1 2 F1 2 1 0 1 2 F1 2 0 2 4 F2

Having to learn category probabilities too makes the problem harder We can make the problem even more challenging by skewing the category probabilities: Category Probability e 0.04 E 0.05 i 0.29 I 0.62

Having to learn category probabilities too makes the problem harder F1 and F2 F1 and Duration F2 and Duration 4 3 3 2 2 2 F2 0 Duration 1 0 Duration 1 0 1 1 2 2 2 2 1 0 1 2 F1 2 1 0 1 2 F1 2 0 2 4 F2

Having to learn category probabilities too makes the problem harder Confusion tables for these cases: With learning of category frequencies Without learning of category frequencies e e True vowel E i True vowel E i I I 1 2 3 4 Cluster 1 2 3 4 Cluster

Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression!

Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning

Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search

Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood

Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood In general you need far more data, and/or additional information sources, to converge on good solutions

Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood In general you need far more data, and/or additional information sources, to converge on good solutions Relevant references: tons! Read about MOGs for automated speech recognition in Jurafsky and Martin (2008, Chapter 9). See Vallabha et al. (2007) and Feldman et al. (2009) for earlier application of MOGs to phonetic category learning.

References I Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learning phonetic categories by learning a lexicon. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages 2208 2213. Cognitive Science Society, Austin, TX. Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall, second edition. Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2):F13 F21. Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interaction between acoustic salience and language experience in developmental speech perception: evidence from nasal place discrimination. Developmental Science, 13(3):407 420.

References II Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33):13273 13278. Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7:49 63.