Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Size: px

Start display at page:

Download "Homework 2 Solutions EM, Mixture Models, PCA, Dualitys"

Barbara Hall
5 years ago
Views:

1 Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture of Bernoullis Eric; 5 pts In this section, you will erive an expectation-maximization EM algorithm to cluster blac an white images. The inputs x i can be thought of as vectors of binary values corresponing to blac an white pixel values, an the goal is to cluster the images into groups. You will be using a mixture of Bernoullis moel to tacle this problem. For the sae of brevity, you o not nee to substitute in previously erive expression in later problems. For example, beyon question.. you may use P x i p in your answers.. Mixture of Bernoullis. pts Consier a vector of binary ranom variables, x {0, } D. Assume each variable x is rawn from a Bernoullip istribution, so P x p. Let p 0, D be the resulting vector of Bernoulli parameters. Write an expression for P x p. P x p D p x p x. pts Now suppose we have a mixture of K Bernoulli istributions: each vector x i is rawn from some vector of Bernoulli ranom variables with parameters p, we will call this Bernoullip. Let {p,..., p K } p. Assume a istribution π over the selection of which set of Bernoulli parameters p is chosen. Write an expression for P x i p, π. Let A be the event that x x i was rawn from p. Then: P x p, π P x, A p, π P x A, p, πp A p, π π P x p 3. pts Finally, suppose we have inputs X {x i }...n. Using the above, write an expression for the log lielihoo of the ata X, log P X π, p. log Lp, π log P x i p, π. Expectation step. 4pts Now, we introuce the latent variables for the EM algorithm. Let z i {0, } K be an inicator vector, such that z i if x i was rawn from a Bernoullip, an 0 otherwise. Let Z {z i }...n. What is P z i π? What is P x i z i, p, π?

2 Let A i be the event that x i was rawn from p. P x i z i, p, π K P z i π K π zi P x i z i, p, π, A i zi K i z P x i p. pts Using the above two quantities, erive the lielihoo of the ata an the latent variables, P Z, X π, p. N N N K P Z, X π, p P x i, z i π, p P x i, z i, π, pp z i π P x i p i z K π zi 3. 5pts Let ηz i Ezi xi, π, p. Show that ηz i π D p D pj j π j xi xi p xi p j xi Let p, π be the new parameters that we lie to maximize, so p, π are from the previous iteration. Use this to erive the following final expression for the E step in the expectation-maximization algorithm: D Elog P Z, X p, π X, p, π ηz i log π + x i log p + x i log p ηz i Ezi xi, π, p P z i x i, π, p P xi z i, π, pp z i π, p P xi z i, π, pp z i π, p D π p π D p xi xi Next we compute the log lielihoo: K log P Z, D π, p z i P log x i p + z i z i z i p xi p K log P x i p + log π log π + log log π + Taing the expecte value an replacing Ez i D p D x i xi logp xi z i log π p xi + xi log p ηzi finishes the solution.

3 .3 Maximization step. 4pts We nee to maximize the above expression with respect to π, p. First, show that the value of p that maximizes the E step is p ηzi xi N where N ηzi. Setting the erivative to 0: p Multiply by the enominators: Solving for p ηz i results in Elog P Z, D π, p x i p + xi p ηzi ηz i p xi ηzi x i p. 4pts Show that the value of π that maximizes the E step is π N N + xi 0 p ηz i p ηzi xi N + x i 0 The exponential families notation may be useful. Alternatively, you can use Lagrange multipliers. We only nee to minimize K ηzi log π since the rest is not a function of π. In orer to eep π a istribution, we require π. Let λ be the ual variable for this constraint: K Lπ, λ ηz i log π + λ π Taing the erivate w.r.t. π : solve for π an we get Now we solve for λ: Lλ Tae the erivative w.r.t. λ: Solve for λ π Lπ, λ π λ λ ηz i π + λ 0 ηzi N λ λ ηz i log N log λ + ηz i ηz i 0 K N K N λ 3

4 Clustering Images of Numbers Eric; 5 pts In this section you will use the above algorithm to cluster images of numbers. You will be using the MNIST ataset. Each input is a binary number corresponing to blac an white pixels, an is a flattene version of the 8x8 pixel image. We will use the following conventions: N is the number of atapoints, D is the imension of each input, an K is the number of clusters. Xs is an N D matrix of the input ata, where row i is the pixel ata for picture i. p is a K D matrix of Bernoulli parameters, where row is the vector of parameters for the th mixture of Bernoullis. mix p is a K vector containing the istribution over the various mixtures. eta is a N K matrix containing the results of the E step, so etai, ηz i. clusters is an N vector containing the final cluster labels of the input ata. Each label is a number from to K.. Programming 0pts. Implement the E step of the algorithm within eta EstepXs,p,mix p, saving your calculate values within eta.. Implement the M step of the algorithm within p,mix p MstepXs, moel, alpha, alpha. p an mix p returne by this function shoul contain the new values that maximize the E step. alpha, alpha are Dirichlet smoothing parameters explaine below. 3. Implement clusters MoBlabelsXs,p,mix p. This function will tae in the estimate parameters an return the resulting labels that cluster the ata. 4. Some hints an tips: The autograer will separately grae the accuracy of your implementation separately for the E-step 4pts, M-step 4pts, an M-step with smoothing pts. Finally, it will run your implementation for several iterations on a subset of the MNIST ataset 8pts. A small subset of the MNIST ataset is provie for your convenience. Use the log operator to mae your calculations more numerically stable. In particular, pay attention to the calculation of ηz i. You will nee to avoi zeros in π an p or else you will tae log0. Use Dirichlet prior smoothing with the parameters α, α when upating these variables: p ηzi xi + α N + α D N + α π N + α K Initialize your parameters p by ranomly sampling from a Uniform0, istribution an normalizing each p to have unit length, an π /. Running your implementation on the MNIST ataset with K 0 clusters an α α 0 8 for 0 iterations shoul give you some results similar to the following our reference coe runs in approximately 0 secons: 4

. Analysis. 5pts For each cluster, reshape the pixels into a 8x8 matrix an print the resulting grayscale images. What o you see? Explain, an inclue one such image in your writeup. See https://www.gnu.

5 . Analysis. 5pts For each cluster, reshape the pixels into a 8x8 matrix an print the resulting grayscale images. What o you see? Explain, an inclue one such image in your writeup. See software/octave/oc/interpreter/representing-images.html for help on printing the matrix, or use the provie helper function show clustersp, a, b which will print the mixtures in p in an a b gri.. pts Using your implemente MoBlabels function, cluster the ata. Using the true labels given in ytrain, how many unique igits oes each cluster typically have? Are there any clusters that pice out exactly one igit? 3 Kernel PCA Fish; 5 pts Principal component analysis PCA is use to emphasize variation an is often use to mae ata easy to explore an visualize. Suppose we have ata x, x, x N R that has zero mean. PCA aims at fining a new coorinate system such that the greatest variance by some projection of the ata comes to lie on the first coorinate, the secon greatest variance on the secon coorinate, an so on. The basis vectors of the new coorinate system are the eigenvectors of the co-variance matrix, i.e., N x i x T i λ i v i vi T, where v, v,, v are orthogonal < v i, v j > 0 for i j. Then we can plot our ata points on our new coorinate system. The position of each ata points in the new coorinate system can be erive by projecting x on to the basis of the new coorinate system, i.e., v, v, v. 3. ernel PCA Often we will want to mae linear or non-linear transformation on the ata so that they are projecte to a higher imensional space. Suppose there is a transformation function φ : R R l. We map all the ata to a new space through this function, an we have φx, φx,, φx N. We again want to o PCA on the transforme ata in the new space. How can we o this?. pts Write out the co-variance matrix, C, for φx, φx,, φx N. Define φx N φx j j 5

6 C N T φx i φx φx i φx. pts Now we want to fin the basis vectors for the orthogonalize new space by solving eigenvectors of C. Using the efinition of an eigenvector, λv Cv, explain why v is a linear combination of φx φx, φx φx,, φx N φx. Since v is a linear combination of these vectors, v N α i φx i φx. λv N v λn v T φx i φx φx i φx v 3 φx i φx φx i φx, v α i φx i φx pts Before starting to erive α an fin out v, we introuce ernel function here. A ernel function is in the form of x i, x j φx i, φx j. Notice that we o not nee to now the function φ to calculate x i, x j. We can simply efine a function that is relate to x i, x j an x i, x j x j, x i. A classic example is Raial basis function RBF ernel, which is x i, x j exp x i x j /σ. In orer to use these ernel function, we nee to get ri of all the φx i an replace them with the ernel function. A ernel matrix is efine as K R N N an K ij x i, x j. Since we are ealing with non-centere ata, we first use a moifie ernel matrix K ij φx i φx, φx j φx. By using the results in question 3.. an 3.. an the efinition of eigenvectors, show that Nλ Kα K α. 6 Define a matrix Φ φx φx, φx φx,, φx N φx. So we have v Φα, C N ΦΦT an K Φ T Φ. Rewrite λv Cv with these notations, we have Multiply both sie by Φ T, we have 4. pts Show that the solutions α in are also solutions for equation 6. If α satisfies 0, then it will also satisfies pts Show that λφα N ΦΦT Φα 7 λφ T Φα N ΦT ΦΦ T Φα 8 Nλ Kα K α 9 4 Nλα Kα. 0 K I ee T KI ee T. 6

7 Thus, by solving the eigenvectors of K, we get α. Define Θ φx, φx,, φx N. Rewrite K an have K Θ N T Θ ΘT N ΘT Θ T Θ N T Θ T Θ N ΘT Θ T + N T Θ T Θ T 3 K ee T K Kee T + ee T Kee T 4 I ee T KI ee T 5 6. ptsafter obtaining α, we get the un-normalize basis vector v. To normalize it, what is the factor you nee to multiply to v? v v T v α T Φ T Φα α T Kα. 6 We nee to normalize v by this factor. You shuol not simply write ivie by v, because you o not now what is φ, an again you nee to obtain everything by going through the ernel functions. 7. pts For a ata point x, what is its position in the normalize new space? Explain why you can get the new coorinate without explicitly calculate φx. Projection on to the a basis vector Φα is α T Kα φx φx Φα α T Kα α i Kx, xi, 7 so only ernel function is neee. You can also expen out K an calculate it using K. 4 Huber function an its application in solving ual problem Fish; 35pts 4. Huber function Define a function where > 0.. 3pts Show that the function Bx, min λ + x λ 0 λ +. 8 Bx, { x x if x if x > 9 with the minimizer λ max0, x. B is often calle a huber function. Differentiate Bx, with λ, we have x γ+ 0. Solving the function an we get λ x. To match the constraint that λ 0, we nee to have x. Besies, with λ x, we have Bx, x + x x x x. On the other sie, if x <, notice that gλ λ + + λ+ is an increasing function since g λ x λ+ x > 0, which means you can obtain the minimum value at λ 0. So we have min λ 0 λ + + x λ+ + x x at λ 0. 7

8 . 3pts Is the function B convex in x? Assume is a fixe constant At the range of x <, Bx, x x an Bx, x 0, so the function is convex in this range. Next we examine the part when x >, Bx, x signx an Bx, x 0, so it is convex. Then we examine the points where x, you can see that the ifferentiation approching from the irections from x < an x > at this point are both signx, so the ifferentiation exists an the twice ifferentiation is again 0. So the function is also convex at this point. We thus conclue that the function is convex by showing that twice ifferentation is greater an equal to zero at all points. 3. 3pts Derive the graient of function B at any given point x,. Remember > 0. x, x, Bx,,, x, x, Bx,,, x x, x, Bx, x, x. 4. 3pts Function B is often use as a penalty term in classification or regression problems, which have the form min x Lx + Bx,, 0 where L is the loss function, an is a parameter with positive value. Describe how parameter affects the penalty term B on the solution. When is large, than it acts lie l penalty. When is small, then it acts as l penalty. 4. An application of using Huber loss to solve ual problem Consier the optimization problem min x ix i + r i x i s.t. a T x, x i, for i,, n.. 3pts Show that the problem has strictly feasible solutions if an only if a >. By Cauchy-schwarz inequality, we have a T x a x, x <, a >. 3 For the other irection, if a >, you can set x i signai a question. an this is a feasible solution to the. 3pts Re-express x i, as x i, for i,,, n. Write own the ual of this problem. The langrange function for this question The ual problem is Lx, u, v ix i + r i x i + ua T x + v i x i. 4 Now we solve x so that we can have a function that only contains u, v. sup inf xlx, u, v. 5 u,v Lx, u, v x i i x i + r i + ua i + v i x i 0, 6 x i r i + ua i i + v i. 7 8

9 plug this into the Lagrange function an we have r i + ua i i i + v i r r i + ua i r i + ua i ri + ua i i ua i + v i i + v i i + v i i + v i u 8 r i + ua i v i 9 v i + i So the ual problem is Or you can write it as max u,v r i + ua i v i + i v i u 30 s.t.v i 0 for i N 3 min u,v u + ri + ua i + v i v i + i 3 s.t.v i 0 for i N pts Write own the KKT conition for this problem. Does it characterize the optimal solution? Yes, KKT characterize the optimal solution. # Stationary: x Lx, u, v 0 # Primal feasible: x i a T x # Dual feasible: v i 0, for i N Complementary Slacness: v i x i 0 for i N. 4. 5pts Show that we can further reuce the ual problem to a one-imensional convex problem min µ µ + where B is the Huber function efine in the previous section. Br i + µa i, i, 34 Loo at 30, it is exactly in this form, where v i is the λ, an r i + ua i is x. 5. 3pts Describe an algorithm to solve this ual problem. What is the time complexity for your algorithm? We have shown that Huber function is convex an erive its graient in the question 4..3, so we can simply apply graient escent to solve it. The time complexity is ON, where is the number of step. 6. 3pts How can you recover an optimal primal solution x after solving the ual? After eriving u, v, we can recover x by as state in 5. x i r i + ua i i + v i 35 9

Homework 2 EM, Mixture Models, PCA, Dualitys

Homework 2 EM, Mixture Models, PCA, Dualitys Homework 2 EM, Mixture Moels, PCA, Dualitys CMU 10-715: Machine Learning (Fall 2015) http://www.cs.cmu.eu/~bapoczos/classes/ml10715_2015fall/ OUT: Oct 5, 2015 DUE: Oct 19, 2015, 10:20 AM Guielines The