Probabilistic & Unsupervised Learning

Size: px

Start display at page:

Download "Probabilistic & Unsupervised Learning"

Clinton Preston
5 years ago
Views:

1 Probabilistic & Unsupervised Learning Week 2: Latent Variable Models Maneesh Sahani Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn 29

2 The story so far... Unsupervised Learning beliefs about data represented by parameterised distribution P (D θ, m) = n P (x i θ, m) i=1 learning involves estimating a value (or distribution) for θ Bayes P (θ D, m) = P (D θ, m)p (θ m) P (D m) MAP θ = argmax θ P (θ D, m) ML and (possibly) model selection P (m D) θ = argmax θ P (D θ, m) dθ P (D θ, m)p (θ m)

3 Aren t We Done? No! direct application of Bayes rule is intractable for all but the simplest models. even maximum-likelihood (or similar) learning may be prohibitive. algorithms (and approximations) are dictated by form of distribution.

4 What about these data? 1 x i x i1 joint distribution p(x i1, x i2 ) is not Normal. conditional distributions p(x i1 x i2 ) and p(x i2 x i1 ) vary, and are difficult to codify. easier to describe by an latent tri-valued discrete process s i Discrete[1/3, 1/3, 1/3] x i N (µ si, Σ)

5 Latent Variable Models Explain correlations in x by assuming some latent variables y (e.g. objects, illumination, pose) (e.g. object parts, surfaces) y P[θ y ] x y P[θ x ] (e.g. edges) p(x, y; θ x, θ y ) = p(x y; θ x )p(y; θ y ) p(x; θ x, θ y ) = dy p(x y; θ x )p(y; θ y ) (retinal image, i.e. pixels)

6 Latent variables and Gaussians x x 2 x 1 The data were measured in 3D, but are mostly specified by position on a 2D surface.

7 Principal Components Analysis Data such as these are often modelled using Principal Components Analysis (PCA). Assume data D = {x i } have zero mean (if not, subtract it!). 5 Find direction of greatest variance λ (1). λ (1) = argmax (x T nv) 2 v =1 n x x 2 Find direction orthogonal to λ (1) with greatest variance λ (2). Find direction orthogonal to {λ (1), λ (2),..., λ (n 1) } with greatest variance λ (n). x 1 Terminate when remaining variance drops below a threshold.

8 Eigendecomposition of a Covariance Matrix The eigendecomposition of a covariance matrix makes finding the PCs easy. Recall that u is an eigenvector, with scalar eigenvalue ω, of a matrix C if Cu = ωu u can have any norm, but we will define it to be unity (i.e., u T u = 1). For a covariance matrix C = xx T (which is D D, symmetric, positive semi-definite): In general there are D eigenvector-eigenvalue pairs (u (i), ω (i) ), except if two or more eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate any linear combination is also an eigenvector). The D eigenvectors are orthogonal (or orthogonalisable, if ω (i) = ω (j) ). Thus, they form an orthonormal basis. i u (i)u (i) T = I. Any vector v can be written as ( ) T v = u (i) u (i) v = i The original matrix C can be written: i (u (i) T v)u (i) = i v (i) u (i) C = i ω (i) u (i) u (i) T = USU T where U = [u (1), u (2),..., u (D) ] collects the eigenvectors and S = diag [(] ω (1), ω (2),..., ω (D) ).

9 PCA and Eigenvectors The variance in direction u (i) is (x T u (i) ) 2 = u (i) T xx T u (i) = u(i) T Cu (i) = u (i) T ω (i) u (i) = ω (i) The variance in an arbitrary direction v is ( )) 2 (x T v) 2 = (x T v (i) u (i) i = ij v (i) u (i) T Cu (j) v (j) = ij v (i) ω (j) v (j) u (i) T u (j) = i v 2 (i) ω (i) If v T v = 1, then i v2 (i) = 1 and so argmax v =1 (x T v) 2 = u (max) The direction of greatest variance is the eigenvector the largest eigenvalue. In general, the PCs are exactly the eigenvectors of the empirical covariance matrix, ordered by decreasing eigenvalue. The eigenspectrum shows how the variance is distributed across dimensions eigenvalue (variance) eigenvalue number fractional variance remaining eigenvalue number

10 PCA subspace The K principle components define the K-dimensional subspace of greatest variance x x 2 x 1 Each data point x n is associated with a projection ˆx n into the principle subspace. x n = K x n(k) λ (k) k=1 This can be used for lossy compression, denoising, recognition,...

11 Example of PCA: Eigenfaces from www-white.media.mit.edu/vismod/demos/facerec/basic.html

12 Another view of PCA: Mutual Information Problem: Given x, find y = Ax with columns of A unit vectors, s.t. I(y; x) is maximised (assuming that P (x) is Gaussian). I(y; x) = H(y) + H(x) H(y, x) = H(y) So we want to maximise the entropy of y. What is the entropy of a Gaussian? H(z) = dz p(z) ln p(z) = 1 2 ln Σ + D (1 + ln 2π) 2 Therefore we want the distribution of y to have largest volume (i.e. det of covariance matrix). Σ y = AΣ x A = AUS x U A So, A should be aligned with the columns of U which are associated with the largest eigenvalues (variances). Projection to the principal component subspace preserves the most information about the (Gaussian) data.

13 Network Interpretations and Encoder-Decoder Duality output units ˆx 1 ˆx 2 ˆx D decoder generation hidden units y 1 y K encoder recognition input units x 1 x 2 x D

14 From Supervised Learning to PCA output units ˆx 1 ˆx 2 ˆx D decoder generation hidden units y 1 y K encoder recognition input units x 1 x 2 x D A linear autoencoder neural network trained to minimise squared error learns to perform PCA (Baldi & Hornik, 1989).

15 Probabilistic PCA We could think of the projection x n as a latent variable, but it is more usual to construct a standard normal latent y. Observed vector data D = {x 1, x 2,..., x N }; x i R D y 1 y K x 1 x 2 x D So, model for observations x is still Gaussian with: p(y) = N (, I) Assumed latent variables {y 1, y 2,..., y N }; y i R K K Linear generative model: x d = Λ dk y k + ɛ d k=1 y k are independent N (, 1) Gaussian factors ɛ d are independent N (, ψ) Gaussian noise K <D p(x) = where Λ is a D K matrix. p(x y) = N (Λy, ψi) p(y)p(x y)dy = N (, ΛΛ + ψi ) This is Probabilistic PCA (ppca). The maximum-likelihood parameter Λ, for fixed ψ, lies in the K-principal subspace.

16 Probabilistic PCA 5 y y 1 x x 1 x x x 1 x y y 1 x x 1 x x x 1 x 5 5 2

17 PCA and ppca In PCA the noise is orthogonal to the subspace, and we can project x n x n trivially. In ppca, the noise is more sensible (equal in all directions). But what is the projection? Find the expected value of y n x n and then take x n = Λy n. Tactic: write p(y n, x n θ), consider x n to be fixed. What is this as a function of y n? p(y n, x n ) = p(y n )p(x n y n ) = (2π) K 2 exp{ 1 2 y n y n } 2πΨ 1 2 exp{ 1 2 (x n Λy n ) Ψ 1 (x n Λy n )} = c exp{ 1 2 [y n y n + (x n Λy n ) Ψ 1 (x n Λy n )]} = c exp{ 1 2 [y n (I + Λ Ψ 1 Λ)y n 2y n Λ Ψ 1 x n ]} = c exp{ 1 2 [y n Σ 1 y n 2y n Σ 1 µ + µ Σ 1 µ]} So Σ = (I + Λ Ψ 1 Λ) 1 = I βλ and µ = ΣΛ Ψ 1 x n = βx n. Where β = ΣΛ Ψ 1. Thus, x n = Λ(I + Λ Ψ 1 Λ) 1 Λ Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n This is not the same projection. ppca takes into account noise in the principal subspace. As ψ, ppca PCA.

18 Factor Analysis If dimensions are not equivalent, equal variance makes no sense. y 1 y K x 1 x 2 x D So, model for observations x is still Gaussian with: p(y) = N (, I) Observed vector data D = {x 1, x 2,..., x N }; x i R D Assumed latent variables {y 1, y 2,..., y N }; y i R K K Linear generative model: x d = Λ dk y k + ɛ d k=1 y k are independent N (, 1) Gaussian factors ɛ d are independent N (, Ψ dd ) Gaussian noise K <D p(x) = p(x y) = N (Λy, Ψ) p(y)p(x y)dy = N (, ΛΛ + Ψ ) where Λ is a D K matrix, and Ψ is K K and diagonal. Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that captures the correlation structure of the data.

19 Factor Analysis (cont.) y 1 y K x 1 x 2 x D ML learning finds Λ and Ψ given data parameters (corrected for symmetries): DK + D K(K 1) 2 If number of parameters > D(D+1) 2 model is not identifiable no closed form solution for ML params: N (, ΛΛ + Ψ) [Bayesian treatment would also have priors over Λ and Ψ and would average over them for prediction.]

20 Factor Analysis Projections Our analysis for ppca still applies: x n = Λ(I + Λ Ψ 1 Λ) 1 Λ Ψ 1 x n = x n Ψ(ΛΛ T + Ψ) 1 x n but now Ψ is diagonal but not spherical. Note, though, that Λ is generally different from that found by ppca.

21 Gradient Methods of Learning FA Write down negative log likelihood: 1 2 log 2π(ΛΛ + Ψ) x (ΛΛ + Ψ) 1 x Optimise w.r.t. Λ and Ψ (need matrix calculus) subject to constraints We will soon see an easier way to learn latent variable models...

22 FA vs PCA PCA and ppca are rotationally invariant; FA is not If x Ux for unitary U, then λ PCA (i) Uλ PCA (i) FA is measurement scale invariant; PCA and ppca are not If x Sx for diagonal S, then λ FA (i) SλFA (i) FA and ppca define a probabilistic model; PCA does not

23 Limitations of Gaussian, FA and PCA models Gaussian, FA and PCA models are easy to understand and use in practice. They are a convenient way to find interesting directions in very high dimensional data sets, eg as preprocessing Their problem is that they make very strong assumptions about the distribution of the data, only the mean and variance of the data are taken into account. The class of densities which can be modelled is too restrictive. 1 x i x i1 By using mixtures of simple distributions, such as Gaussians, we can expand the class of densities greatly.

24 Mixture Distributions 1 x i x i1 A mixture distribution has a single discrete latent variable: s i iid Discrete[π] x i s i P si [θ si ] Mixtures arise naturally when observations from different sources have been collated. They can also be used to approximate arbitrary distributions.

25 The Mixture Likelihood The mixture model is s i iid Discrete[π] x i s i P si [θ si ] Under the discrete distribution P (s i = m) = π m ; π m, k π m = 1 m=1 Thus, the probability (density) at a single data point x i is P (x i )= = k P (x i s i = m)p (s i = m) m=1 k π m P m (x i ; θ m ) m=1 The mixture distribution (density) is a convex combination (or weighted average) of the component distributions (densities).

26 Approximation with a Mixture of Gaussians (MoG) The component densities may be viewed as elements of a basis which can be combined to approximate arbitrary distributions. Here are examples where non-gaussian densities are modelled (aproximated) as a mixture of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the resulting density. Uniform Triangle Heavy tails Given enough mixture components we can model (almost) any density (as accurately as desired), but still only need to work with the well-known Gaussian form.

27 Clustering with a MoG

28 Clustering with a MoG In clustering applications, the latent variable s i represents the (unknown) identity of the cluster to which the ith observation belongs. Thus, the latent distribution gives the prior probability of a data point coming from each cluster. P (s i = m π) = π m Data from the mth cluster are distributed according to the mth component: P (x i s i = m) = P m (x i ) Once we observe a data point, the posterior probability distribution for the cluster it belongs to is P (s i = m x i ) = P m(x i )π m m P m(x i )π m This is often called the responsibility of the mth cluster for the ith data point.

29 The MoG likelihood Each component of a MoG is a Gaussian, with mean µ m and covariance matrix Σ m. Thus, the probability density evaluated at a set of n iid observations, D = {x 1... x n } (i.e. the likelihood) is p(d {µ m }, {Σ m }, π) = = The log of the likelihood is log p(d {µ m }, {Σ m }, π) = n i=1 n i=1 k π m N (x i µ m, Σ m ) m=1 k π m 2πΣ m 1/2 exp m=1 n log i=1 k π m 2πΣ m 1/2 exp m=1 [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m ) [ 1 ] 2 (x i µ m ) T Σ 1 m (x i µ m ) Note that the logarithm fails to simplify the component density terms. A mixture distribution does not lie in the exponential family. Direct optimisation is not easy.

30 Maximum Likelihood for a Mixture Model The log likelihood is: n k L= log π m P m (x i ; θ m ) Its partial derivative wrt θ m is i=1 m=1 L θ m = n i=1 π m k m=1 π mp m (x i ; θ m ) P m (x i ; θ m ) θ m or, using P/ θ = P log P/ θ, = = n i=1 n i=1 π m P m (x i ; θ m ) k m=1 π mp m (x i ; θ m ) }{{} log P m (x i ; θ m ) θ m r im log P m (x i ; θ m ) θ m And its partial derivative wrt π m is L π m = n i=1 P m (x i ; θ m ) k m=1 π mp m (x i ; θ m ) = n i=1 r im π m

31 MoG Derivatives For a MoG, with θ m = {µ m, Σ m } we get L µ m = L Σ 1 m = 1 2 n i=1 r im Σ 1 m (x i µ m ) n ( r im Σm (x i µ m )(x i µ m ) T) i=1 These equations can be used (along with the derivatives wrt to π m ) for gradient based learning; e.g., taking small steps in the direction of the gradient (or using conjugate gradients).

32 The K-means Algorithm The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor Analysis). Take π m = 1/k and Σ m = σ 2 I, with σ 2. Then the responsibilities become binary r im δ(m, argmin l x i µ l 2 ) with 1 for the component with the closest mean and for all other components. We can then solve directly for the means by setting the gradient to. The k-means algorithm iterates these two steps: ( ) assign each point to its closest mean set r im = δ(m, argmin x i µ l 2 ) l ( update the means to the average of their assigned points set µ m = i r ) imx i i r im This usually converges within a few iterations, although the fixed point depends on the initial values chosen for µ m. The algorithm has no learning rate, but the assumptions are quite limiting.

33 A preview of the EM algorithm We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the dependence of the responsibilities on the parameters. We could then solve for both µ m and Σ m. The EM algorithm for MoGs iterates these two steps: Evaluate the responsibilities for each point given the current parameters. Optimise the parameters assuming the responsibilities stay fixed: µ m = i r imx i i i r and Σ m = r im(x i µ m )(x i µ m ) T im i r im Although this appears ad hoc, we will see (later) that it is a special case of a general algorithm, and is actually guaranteed to increase the likelihood at each iteration.

34 Issues There are several problems with the new algorithms: slow convergence for the gradient based method gradient based method may develop invalid covariance matrices local minima; the end configuration may depend on the starting state how do you adjust k? Using the likelihood alone is no good. singularities; components with a single data point will have their covariance going to zero and the likelihood will tend to infinity.

Probabilistic & Unsupervised Learning. Latent Variable Models

Probabilistic & Unsupervised Learning Latent Variable Models Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London