Latent Variable Models Probabilistic Models in the Study of Language Day 4
|
|
- Shannon Barker
- 5 years ago
- Views:
Transcription
1 Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics
2 Preamble: plate notation for graphical models Here is the kind of hierarchical model we ve seen so far: θ y 11 y 1n1 y 21 y 2n2 y m1 y mnm b 1 b 2 b m Σ b
3 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times b m Σ b
4 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) b m Σ b
5 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y N The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node b m Σ b
6 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y b N m The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node In our previous application of hierarchical models to regression, cluster identities are known Σ b
7 Plate notation for graphical models Here is a more succinct representation of the same model: i θ y b N m The rectangles with N and m are plates; semantics of a plate with n is replicate this node n times N = m i=1 n i (see previous slide) The i node is a cluster identity node In our previous application of hierarchical models to regression, cluster identities are known Σ b
8 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y N b m Σ b
9 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y N Technically speaking, latent variable means any variable whose value is unknown b m Σ b
10 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y b N Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations m Σ b
11 The plan for today s lecture θ We are going to study the simplest type of latent-variable models i y b N m Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations In today s clustering applications, simply treat i as unknown Σ b
12 The plan for today s lecture θ We are going to study the simplest type of latent-variable models φ i y b N m Technically speaking, latent variable means any variable whose value is unknown But it s conventionally used to refer to hidden structural relations among observations In today s clustering applications, simply treat i as unknown Σ b Inferring values of i induces a clustering among observations; to do so we need to put a probability distribution over i
13 The plan for today s lecture We will cover two types of simple latent-variable models:
14 The plan for today s lecture We will cover two types of simple latent-variable models: The mixture of Gaussians for continuous multivariate data;
15 The plan for today s lecture We will cover two types of simple latent-variable models: The mixture of Gaussians for continuous multivariate data; Latent Dirichlet Allocation (LDA; also called Topic models) for categorical data (words) in collections of documents
16 Mixture of Gaussians Motivating example: how are phonological categories learned
17 Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience:
18 Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience: Infants can distinguish some contrasts that adults of speakers lacking them cannot: alveolar [d] versus retroflex [ã] for English speakers, [r] versus [l] for Japanese speakers; Werker and Tees, 1984; Kuhl et al., 2006, inter alia)
19 Mixture of Gaussians Motivating example: how are phonological categories learned Evidence that learning involves a combination of both innate bias and experience: Infants can distinguish some contrasts that adults of speakers lacking them cannot: alveolar [d] versus retroflex [ã] for English speakers, [r] versus [l] for Japanese speakers; Werker and Tees, 1984; Kuhl et al., 2006, inter alia) Other contrasts are not reliably distinguished until 1 year of age by native speakers (e.g., syllable-initial [n] versus [N] in Filipino language environments; Narayan et al., 2010)
20 Learning vowel categories To appreciate the potential difficulties of vowel category learning, consider inter-speaker variation (data courtesy of Vallabha et al., 2007): S1 S Duration 0 2 Duration F F e E i I F F Scatter Plot Matrix
21 Framing the category learning problem Here s 19 speakers data mixed together: Duration F2 2 4 e E i I F1 0 2 Scatter Plot Matrix
22 Framing the category learning problem Learning from such data can be thought of in two ways:
23 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories
24 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes)
25 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y.
26 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y.
27 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y. If θ are parameters describing category representations, our problem is to infer P(Π, θ y)
28 Framing the category learning problem Learning from such data can be thought of in two ways: Grouping the observations into categories Determining the underlying category representations (positions, shapes, and sizes) Formally: every possible grouping of observations y into categories represents a partition Π of the observations y. If θ are parameters describing category representations, our problem is to infer P(Π, θ y) from which we could recover the two marginal probability distributions of interest: P(Π y) P(θ y) (distr. over partitions given data) (distr. over category properties given data)
29 The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories)
30 The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories) N observations are generated i.i.d. by: i Multinom(φ) y N(µ i,σ i )
31 The mixture of Gaussians Simple generative model of the data: we have k multivariate Gaussians with frequencies φ = φ 1,...,φ k, each with its own mean µ i and covariance matrix Σ i (here we punt on how to induce the correct number of categories) N observations are generated i.i.d. by: i Multinom(φ) y N(µ i,σ i ) Here is the corresponding graphical model: φ i y n Σ µ m
32 Can we use maximum likelihood? For observations y all known to come from the same k-dimensional Gaussian, the MLE for the Gaussian s parameters is µ = ȳ 1,ȳ 2,...,ȳ k Var(y 1 ) Cov(y 1,y 2 )... Cov(y 1,y k ) Cov(y 1,y 2 ) Var(y 2 )... Cov(y 1,y k ) Σ = Cov(y 1,y 2 ) Cov(y 1,y 2 )... Var(y k ) where Var and Cov are the sample variance and covariance
33 Can we use maximum likelihood? So you might ask: why not use the method of maximum likelihood, searching through all the possible partitions of the data and choosing the partition that gives the highest data likelihood? y
34 Can we use maximum likelihood? The set of all partitions into 3,3 observations for our example data:
35 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem.
36 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,1...
37 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5,
38 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5, ML for this partition:!!!
39 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5, ML for this partition:!!! More generally, for a V-dimensional problem you need at least V +1 points in each partition
40 Can we use maximum likelihood? This looks like a daunting search task, but there is an even bigger problem. Suppose I try a partition into 5, ML for this partition:!!! More generally, for a V-dimensional problem you need at least V +1 points in each partition But this constraint would prevent you from finding intuitive solutions to your problem!
41 Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i )
42 Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size
43 Bayesian Mixture of Gaussians φ i y n Σ µ m i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size Returning to our graphical model, we put in a prior on category size/shape
44 Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) The Bayesian framework allows us to build in explicit assumptions about what constitutes a sensible category size Returning to our graphical model, we put in a prior on category size/shape
45 Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) For now we will just leave category prior probabilities uniform: φ 1 = φ 2 = φ 3 = φ 4 = 1 4
46 Bayesian Mixture of Gaussians φ i y n Σ µ m α i Multinom(φ) y N(µ i,σ i ) For now we will just leave category prior probabilities uniform: φ 1 = φ 2 = φ 3 = φ 4 = 1 4 Here is a conjugate prior distribution for multivariate Gaussians: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A)
47 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it
48 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( )
49 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( )
50 The Inverse Wishart distribution Perhaps the best way to understand the Inverse Wishart distribution is to look at samples from it Below I give samples for Σ = ( ) Here, k = 2 (top row) or k = 5 (bottom row)
51 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem
52 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling
53 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put:
54 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments
55 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point:
56 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i
57 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ
58 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ 2.3 Randomly sample a cluster assignment for x i from P(C i x i,π i ) and continue
59 Inference for Mixture of Gaussians using Gibbs Sampling We still have not given a solution to the search problem One broadly applicable solution is Gibbs sampling Simply put: 1. Randomly initialize cluster assignments 2. On each iteration through the data, for each point: 2.1 Forget the cluster assignment of the current point x i 2.2 Compute the probability distribution over x i s cluster assignment conditional on the rest of the partition: P(C i x i,π i ) = P(x i C θ i,θ)p(c i θ)p(θ)dθ j P(x θ j C j,θ)p(c j θ)p(θ)dθ 2.3 Randomly sample a cluster assignment for x i from P(C i x i,π i ) and continue 3. Do this for many iterations (e.g., until the unnormalized marginal data likelihood is high)
60 Inference for Mixture of Gaussians using Gibbs Sampling Starting point for our problem:
61 One pass of Gibbs sampling through the data
62 Results of Gibbs sampling with known category probabilities Posterior modes of category structures: F1 versus F2 F1 versus Duration F2 versus Duration F2 0 Duration 0 Duration F F F2
63 Results of Gibbs sampling with known category probabilities Confusion table of assignments of observations to categories: Unsupervised Supervised e e True vowel E i True vowel E i I I Cluster Cluster
64 Extending the model to learning category probabilities The multinomial extension of the beta distribution is the Dirichlet distribution, characterized by parameters α 1,...,α k, and D(π 1,...,π k ): D(π 1,...,π k ) def = 1 Z πα π α π α k 1 k where the normalizing constant Z is Z = Γ(α 1)Γ(α 2 )...Γ(α k ) Γ(α 1 +α 2 + +α k )
65 Extending the model to learning category probabilities So we set: φ D(Σ φ )
66 Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i )
67 Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i )
68 Extending the model to learning category probabilities So we set: φ D(Σ φ ) Combine this with the rest of the model: Σ i IW(Σ 0,ν) µ i Σ N(µ 0,Σ i /A) i Multinom(φ) y N(µ i,σ i ) Σθ θ φ i y n Σφ b m ΣΣb Σb
69 Having to learn category probabilities too makes the problem harder F1 and F2 F1 and Duration F2 and Duration F2 0 Duration 1 0 Duration F F F2
70 Having to learn category probabilities too makes the problem harder We can make the problem even more challenging by skewing the category probabilities: Category Probability e 0.04 E 0.05 i 0.29 I 0.62
71 Having to learn category probabilities too makes the problem harder F1 and F2 F1 and Duration F2 and Duration F2 0 Duration 1 0 Duration F F F2
72 Having to learn category probabilities too makes the problem harder Confusion tables for these cases: With learning of category frequencies Without learning of category frequencies e e True vowel E i True vowel E i I I Cluster Cluster
73 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression!
74 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning
75 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search
76 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood
77 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood In general you need far more data, and/or additional information sources, to converge on good solutions
78 Summary We can use the exact same models for unsupervised (latent-variable) learning as for hierarchical/mixed-effects regression! However, category induction presents additional difficulties category learning Non-convexity of the objective function difficulty of search Degeneracy of maximum likelihood In general you need far more data, and/or additional information sources, to converge on good solutions Relevant references: tons! Read about MOGs for automated speech recognition in Jurafsky and Martin (2008, Chapter 9). See Vallabha et al. (2007) and Feldman et al. (2009) for earlier application of MOGs to phonetic category learning.
79 References I Feldman, N. H., Griffiths, T. L., and Morgan, J. L. (2009). Learning phonetic categories by learning a lexicon. In Proceedings of the 31st Annual Conference of the Cognitive Science Society, pages Cognitive Science Society, Austin, TX. Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice-Hall, second edition. Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson, P. (2006). Infants show a facilitation effect for native language phonetic perception between 6 and 12 months. Developmental Science, 9(2):F13 F21. Narayan, C. R., Werker, J. F., and Beddor, P. S. (2010). The interaction between acoustic salience and language experience in developmental speech perception: evidence from nasal place discrimination. Developmental Science, 13(3):
80 References II Vallabha, G. K., McClelland, J. L., Pons, F., Werker, J. F., and Amano, S. (2007). Unsupervised learning of vowel categories from infant-directed speech. Proceedings of the National Academy of Sciences, 104(33): Werker, J. F. and Tees, R. C. (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7:49 63.
Topic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationIntroduction to Probabilistic Machine Learning
Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning
More informationAdvanced Probabilistic Modeling in R Day 1
Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, 2015 1/24 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationDensity Estimation. Seungjin Choi
Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationStatistical Models. David M. Blei Columbia University. October 14, 2014
Statistical Models David M. Blei Columbia University October 14, 2014 We have discussed graphical models. Graphical models are a formalism for representing families of probability distributions. They are
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationAcoustic Unit Discovery (AUD) Models. Leda Sarı
Acoustic Unit Discovery (AUD) Models Leda Sarı Lucas Ondel and Lukáš Burget A summary of AUD experiments from JHU Frederick Jelinek Summer Workshop 2016 lsari2@illinois.edu November 07, 2016 1 / 23 The
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationNPFL108 Bayesian inference. Introduction. Filip Jurčíček. Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic
NPFL108 Bayesian inference Introduction Filip Jurčíček Institute of Formal and Applied Linguistics Charles University in Prague Czech Republic Home page: http://ufal.mff.cuni.cz/~jurcicek Version: 21/02/2014
More informationLecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models
Advanced Machine Learning Lecture 10 Mixture Models II 30.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ Announcement Exercise sheet 2 online Sampling Rejection Sampling Importance
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationCS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas
CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationLinear Models A linear model is defined by the expression
Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 9: Bayesian Estimation Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 17 October, 2017 1 / 28
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationClustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning
Clustering K-means Machine Learning CSE546 Sham Kakade University of Washington November 15, 2016 1 Announcements: Project Milestones due date passed. HW3 due on Monday It ll be collaborative HW2 grades
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language
More informationMaximum Likelihood Estimation. only training data is available to design a classifier
Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional
More informationUnsupervised Learning with Permuted Data
Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationCOS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION
COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationECE 5984: Introduction to Machine Learning
ECE 5984: Introduction to Machine Learning Topics: (Finish) Expectation Maximization Principal Component Analysis (PCA) Readings: Barber 15.1-15.4 Dhruv Batra Virginia Tech Administrativia Poster Presentation:
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationNon-parametric Clustering with Dirichlet Processes
Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationUSEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*
USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 3 Conditionals and marginals For Bayesian analysis it is very useful to understand how to write joint, marginal, and conditional distributions for the multivariate
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationKernel Density Topic Models: Visual Topics Without Visual Words
Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de
More informationComputational Cognitive Science
Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationRecent Advances in Bayesian Inference Techniques
Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian
More informationPattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationUnsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models
SUBMISSION TO IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models Xiaogang Wang, Xiaoxu Ma,
More informationClustering using Mixture Models
Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationA Fully Nonparametric Modeling Approach to. BNP Binary Regression
A Fully Nonparametric Modeling Approach to Binary Regression Maria Department of Applied Mathematics and Statistics University of California, Santa Cruz SBIES, April 27-28, 2012 Outline 1 2 3 Simulation
More informationExponential Families
Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationA Brief and Friendly Introduction to Mixed-Effects Models in Linguistics
A Brief and Friendly Introduction to Mixed-Effects Models in Linguistics Cluster-specific parameters ( random effects ) Σb Parameters governing inter-cluster variability b1 b2 bm x11 x1n1 x21 x2n2 xm1
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationPROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:
More informationOutline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning
Outline Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning Limitations Is categorization just discrimination among mutually
More informationTopic Models. Charles Elkan November 20, 2008
Topic Models Charles Elan elan@cs.ucsd.edu November 20, 2008 Suppose that we have a collection of documents, and we want to find an organization for these, i.e. we want to do unsupervised learning. One
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationGibbs Sampling in Linear Models #2
Gibbs Sampling in Linear Models #2 Econ 690 Purdue University Outline 1 Linear Regression Model with a Changepoint Example with Temperature Data 2 The Seemingly Unrelated Regressions Model 3 Gibbs sampling
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationDay 1: Probability and speech perception
Day 1: Probability and speech perception 1 Day 2: Human sentence parsing 2 Day 3: Noisy-channel sentence processing? Day 4: Language production & acquisition whatsthat thedoggie yeah wheresthedoggie Grammar/lexicon
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More informationLecture 2: Priors and Conjugacy
Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.
More informationLecture 14. Clustering, K-means, and EM
Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationMachine Learning Summer School
Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationCS Lecture 18. Topic Models and LDA
CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same
More informationProbabilistic Methods in Linguistics Lecture 2
Probabilistic Methods in Linguistics Lecture 2 Roger Levy UC San Diego Department of Linguistics October 2, 2012 A bit of review & terminology A Bernoulli distribution was defined as π if x = 1 P(X = x)
More informationUnsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models
SUBMISSION TO IEEE TRANS. ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1 Unsupervised Activity Perception in Crowded and Complicated Scenes Using Hierarchical Bayesian Models Xiaogang Wang, Xiaoxu Ma,
More informationSTAT Advanced Bayesian Inference
1 / 32 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics Jan 23, 218 The Dirichlet distribution 2 / 32 θ Dirichlet(a 1,...,a k ) with density p(θ 1,θ 2,...,θ k ) = k j=1 Γ(a j) Γ(
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More informationA Bayesian Perspective on Residential Demand Response Using Smart Meter Data
A Bayesian Perspective on Residential Demand Response Using Smart Meter Data Datong-Paul Zhou, Maximilian Balandat, and Claire Tomlin University of California, Berkeley [datong.zhou, balandat, tomlin]@eecs.berkeley.edu
More informationSequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007
Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember
More informationBayesian Mixtures of Bernoulli Distributions
Bayesian Mixtures of Bernoulli Distributions Laurens van der Maaten Department of Computer Science and Engineering University of California, San Diego Introduction The mixture of Bernoulli distributions
More informationCOMP90051 Statistical Machine Learning
COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide
More informationInterpretable Latent Variable Models
Interpretable Latent Variable Models Fernando Perez-Cruz Bell Labs (Nokia) Department of Signal Theory and Communications, University Carlos III in Madrid 1 / 24 Outline 1 Introduction to Machine Learning
More informationModeling Environment
Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA
More information