Unsupervised learning (part 1) Lecture 19

Size: px

Start display at page:

Download "Unsupervised learning (part 1) Lecture 19"

Joseph Emery Newton
5 years ago
Views:

1 Unsupervised learning (part 1) Lecture 19 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Dan Weld, Vibhav Gogate, and Andrew Moore

2 Bayesian networks enable use of domain knowledge Will my car start this morning? p(x 1,...x n )= Y i2v p(x i x Pa(i) ) Heckerman et al., Decision-TheoreMc TroubleshooMng, 1995

3 Bayesian networks enable use of domain knowledge p(x 1,...x n )= Y i2v What is the differenmal diagnosis? p(x i x Pa(i) ) Beinlich et al., The ALARM Monitoring System, 1989

4 Bayesian networks are genera*ve models Can sample from the joint distribumon, top-down Suppose Y can be spam or not spam, and X i is a binary indicator of whether word i is present in the Let s try generamng a few s! Label Y X 1 X 2 X 3... X n Features OZen helps to think about Bayesian networks as a generamve model when construcmng the structure and thinking about the model assumpmons

5 Inference in Bayesian networks CompuMng marginal probabilimes in tree structured Bayesian networks is easy The algorithm called belief propagamon generalizes what we showed for hidden Markov models to arbitrary trees X 1 X 2 X 3 X 4 X 5 X 6 Label Y X 1 X 2 X 3... X n Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Features Wait this isn t a tree! What can we do?

6 Inference in Bayesian networks In some cases (such as this) we can transform this into what is called a juncmon tree, and then run belief propagamon

7 Approximate inference There is also a wealth of approximate inference algorithms that can be applied to Bayesian networks such as these Markov chain Monte Carlo algorithms repeatedly sample assignments for esmmamng marginals Varia4onal inference algorithms (determinismc) find a simpler distribumon which is close to the original, then compute marginals using the simpler distribumon

8 Maximum likelihood esmmamon in Bayesian networks Suppose that we know the Bayesian network structure G Let xi x pa(i) be the parameter giving the value of the CPD p(x i x pa(i) ) Maximum likelihood estimation corresponds to solving: 1 MX max Xlog p(x M ; ) M m=1 subject to the non-negativity and normalization constraints This is equal to: 1 XMX max log p(x M ; ) = max M m=1 = max 1 M XNX i=1 X MX XNX m=1 1 M i=1 XMX m=1 log p(x M i x M pa(i) ; ) log p(x M i x M pa(i) ; ) The optimization problem decomposes into an independent optimization problem for each CPD! Has a simple closed-form solution.

9 Returning to clustering Clusters may overlap Some clusters may be wider than others Can we model this explicitly? With what probability is a point from a cluster?

10 ProbabilisMc Clustering Try a probabilismc model! allows overlaps, clusters of different size, etc. Can tell a genera*ve story for data P(Y)P(X Y) Challenge: we need to esmmate model parameters without labeled Ys Y X 1 X 2?? ?? ?? ?? ??

11 Gaussian Mixture Models P(Y): There are k components P(X Y): Each component generates data from a mul>variate Gaussian with mean μ i and covariance matrix Σ i Each data point assumed to have been sampled from a genera4ve process: 1. Choose component i with probability P(y=i) [Mul*nomial] 2. Generate datapoint ~ N(m i, Σ i ) P(X = x j Y = i) = 1 (2π) m / 2 Σ i exp 1 1/ 2 2 x j µ i ( ) T Σ 1 i ( x j µ i ) µ 1 µ 2 By fi:ng this model (unsupervised learning), we can learn new insights about the data µ 3

12 MulMvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2π ) m/2 Σ i exp # 1 1/2 2 x j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' ( Σ idenmty matrix

13 MulMvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2π ) m/2 Σ i exp # 1 1/2 2 x j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' ( Σ = diagonal matrix X i are independent ala Gaussian NB

14 MulMvariate Gaussians P(X = P(X=x x j Y j )= = i) = 1 (2π ) m/2 Σ i exp # 1 1/2 2 x j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' ( Σ = arbitrary (semidefinite) matrix: - specifies rotamon (change of basis) - eigenvalues specify relamve elongamon

15 Eigenvalue, λ, of Σ MulMvariate Gaussians Covariance matrix, Σ, = degree to which x i vary together 1 P(X = x j Y = i) = (2π ) m/2 Σ i exp # P(X=x 1 1/2 2 x j )= j µ i $ % ( ) T Σ 1 i ( x j µ i ) & ' (

16 Modelling erupmon of geysers Old Faithful Data Set Time to ErupMon DuraMon of Last ErupMon

17 Modelling erupmon of geysers Old Faithful Data Set Single Gaussian Mixture of two Gaussians

18 Marginal distribumon for mixtures of Gaussians Component Mixing coefficient K=3

19 Marginal distribumon for mixtures of Gaussians

20 Learning mixtures of Gaussians Original data (hypothesized) Observed data (y missing) Inferred y s (learned model) Shown is the posterior probability that a point was generated from i th Gaussian: Pr(Y = i x)

21 ML esmmamon in supervised setng Univariate Gaussian Mixture of Mul4variate Gaussians ML esmmate for each of the MulMvariate Gaussians is given by: n k x n Σ k ML = 1 n k k j=1 n j=1 µ ML = 1 n ( x j µ ) ML x j µ ML ( ) T Just sums over x generated from the k th Gaussian

22 What about with unobserved data? Maximize marginal likelihood: argmax θ j P(x j ) = argmax j k=1 P(Y j =k, x j ) Almost always a hard problem! Usually no closed form solumon Even when lgp(x,y) is convex, lgp(x) generally isn t Many local opmma K

23 1977: Dempster, Laird, & Rubin ExpectaMon MaximizaMon

24 The EM Algorithm A clever method for maximizing marginal likelihood: argmax θ j P(x j ) = argmax θ j k=1 K P(Y j =k, x j ) Based on coordinate descent. Easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: Compute an expectamon Compute a maximizamon Not magic: s4ll op4mizing a non-convex func4on with lots of local op4ma The computamons are just easier (ozen, significantly so)

25 EM: Two Easy Steps Objec>ve: argmax θ lg j k=1 K P(Y j =k, x j ; θ) = j lg k=1 K P(Y j =k, x j ; θ) Data: {x j j=1.. n} E-step: Compute expectamons to fill in missing y values according to current parameters, θ For all examples j and values k for Y j, compute: P(Y j =k x j ; θ) M-step: Re-esMmate the parameters with weighted MLE esmmates Set θ new = argmax θ j k P(Y j =k x j ;θ old ) log P(Y j =k, x j ; θ) Par>cularly useful when the E and M steps have closed form solu>ons

26 Gaussian Mixture Example: Start

27 AZer first iteramon

28 AZer 2nd iteramon

29 AZer 3rd iteramon

30 AZer 4th iteramon

31 AZer 5th iteramon

32 AZer 6th iteramon

33 AZer 20th iteramon

34 EM for GMMs: only learning means (1D) Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E-step Compute expected classes of all datapoints M-step ( ) exp 1 P Y j = k x j,µ 1...µ K Compute most likely new μs given class expectamons 2σ (x 2 j µ k ) 2 P Y j = k ( ) µ k = m j =1 m j =1 P( Y j = k x ) j P( Y j = k x ) j x j

What if we do hard assignments? Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E-step Compute expected classes of all datapoints P( Y j = k x j,µ 1.

35 What if we do hard assignments? Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t) } E-step Compute expected classes of all datapoints P( Y j = k x j,µ 1...µ ) K exp 1 2σ (x 2 j µ k ) 2 P Y j = k M-step Compute most likely new μs given class expectamons µ k = m j =1 m j =1 P( Y j = k x ) j P( Y j = k x ) j x j µ k = m j =1 m j =1 δ( Y j = k, x ) j x j ( ) δ Y j = k, x j ( ) Equivalent to k-means clustering algorithm!!! δ represents hard assignment to most likely or nearest cluster

36 E.M. for General GMMs Iterate: On the t th iteramon let our esmmates be λ t = { μ 1 (t), μ 2 (t) μ K (t), Σ 1 (t), Σ 2 (t) Σ K (t), p 1 (t), p 2 (t) p K (t) } E-step Compute expected classes of all datapoints for each class M-step P( Y j = k x j ;λ ) (t t p ) k p( (t x j ;µ ) (t ) k,σ ) k Compute weighted MLE for μ given expected classes above ( t +1 µ ) k = P Y j = k x j ;λ t x j j j ( ) P( Y j = k x j ;λ ) t p k (t +1) = j ( t +1 Σ ) k = P( Y j = k x j ;λ ) t m t +1 P Y j = k x j ;λ t x j µ k j ( ) j p k (t) is shorthand for esmmate of P(y=k) on t th iteramon Evaluate probability of a mul*variate a Gaussian at x j ( ) ( t +1) [ ] x j µ k P( Y j = k x j ;λ ) t m = #training examples [ ] T

37 The general learning problem with missing data Marginal likelihood: X is observed, Z (e.g. the class labels Y) is missing: ObjecMve: Find argmax θ l(θ:data) Assuming hidden variables are missing completely at random (otherwise, we should explicitly model why the values are missing)

38 ProperMes of EM One can prove that: EM converges to a local maxima Each iteramon improves the log-likelihood How? (Same as k-means) Likelihood objecmve instead of k-means objecmve M-step can never decrease likelihood

39 EM pictorially L(θ n+1 ) l(θ n+1 θ n ) L(θ n )=l(θ n θ n ) L(θ) l(θ θ n ) Likelihood objecmve Lower bound at iter n L(θ) l(θ θ n ) θ n θ n+1 θ (Figure from tutorial by Sean Borman)

40 What you should know Mixture of Gaussians EM for mixture of Gaussians: How to learn maximum likelihood parameters in the case of unlabeled data RelaMon to K-means Two step algorithm, just like K-means Hard / soz clustering ProbabilisMc model Remember, EM can get stuck in local minima, And empirically it DOES

Mixture Models & EM algorithm Lecture 21

Mixture Models & EM algorithm Lecture 21 David Sontag New York University Slides adapted from Carlos Guestrin, Dan Klein, Luke Ze@lemoyer, Dan Weld, Vibhav Gogate, and Andrew Moore The Evils of Hard Assignments?