Homework 2 EM, Mixture Models, PCA, Dualitys

Size: px

Start display at page:

Download "Homework 2 EM, Mixture Models, PCA, Dualitys"

Cecil Chandler
5 years ago
Views:

1 Homework 2 EM, Mixture Moels, PCA, Dualitys CMU : Machine Learning (Fall 2015) OUT: Oct 5, 2015 DUE: Oct 19, 2015, 10:20 AM Guielines The homework is ue at 10:20 am on Monay October 19, Each stuent will given two late ays that can be spent on any homeworks, but at most one late ay per homework. Once you have use up your late ays for the term, late homework submissions will receive no creit. Submit both a paper copy an an electronic copy through through the submission website: https: //autolab.cs.cmu.eu/courses/10715-f15. You can sign in using your Anrew creentials. You shoul make sure to eit your account information an choose a nickname/hanle. This hanle will be use to isplay your results for any competition style questions on the class leaerboar. Some questions will be autograe. Please make sure to carefully follow the submission instructions for these questions. We recommen that you typeset your solutions using software such as LA TEXİf you write, ensure your hanwriting is clear an legible. The TAs will not invest unue effort to ecrypt ba hanwriting. Programming guielines: Octave: You must write submitte coe in Octave. Octave is a free scientific programming language, with syntax similar to that of MATLAB. Installation instructions can be foun on the Octave website. (You can evelop your coe in MATLAB if you prefer, but you must test it in Octave before submitting, or it may fail in the autograer.) Autograing: This problem is autograe using the CMU Autolab system. The coe which you write will be execute remotely against a suite of tests, an the results use to automatically assign you a grae. To make sure your coe executes correctly on our servers, you shoul avoi using libraries which are not present in the basic Octave install. Submission Instructions: For each programming question you will be given a function signature. You will be aske to write a single Octave function which satisfies the signature. In the coe hanout linke above, we have provie you with a single foler containing stubs for each of the functions you nee to complete. Do not moify the structure of this irectory or rename these files. Complete each of these functions, then compress this irectory as a tar file an submit to Autolab online. You may submit coe as many times as you like. When you ownloa the files, you shoul confirm that the autograer is functioning correctly by compressing an submitting the irectory of stubs provie. This shoul result in a grae of zero for all questions. SUBMISSION CHECKLIST Submission executes on our machines in less than 20 minutes. Submission is smaller than 2000K. Submission is a.tar file. Submission returns matrices of the exact imension specifie. 1

2 1 An EM algorithm for a Mixture of Bernoullis [Eric; 25 pts] In this section, you will erive an expectation-maximization (EM) algorithm to cluster black an white images. The inputs x (i) can be thought of as vectors of binary values corresponing to black an white pixel values, an the goal is to cluster the images into groups. You will be using a mixture of Bernoullis moel to tackle this problem. For the sake of brevity, you o not nee to substitute in previously erive expression in later problems. For example, beyon question you may use P (x (i) p (k) ) in your answers. 1.1 Mixture of Bernoullis 1. (2pts) Consier a vector of binary ranom variables, x {0, 1} D. Assume each variable x is rawn from a Bernoulli(p ) istribution, so P (x = 1) = p. Let p (0, 1) D be the resulting vector of Bernoulli parameters. Write an expression for P (x p). 2. (2pts) Now suppose we have a mixture of K Bernoulli istributions: each vector x (i) is rawn from some vector of Bernoulli ranom variables with parameters p (k), we will call this Bernoulli(p (k) ). Let {p (1),..., p (K) } = p. Assume a istribution π(k) over the selection of which set of Bernoulli parameters p (k) is chosen. Write an expression for P (x (i) p, π). 3. (2pts) Finally, suppose we have inputs X = {x (i) }...n. Using the above, write an expression for the log likelihoo of the ata X, log P (X π, p). 1.2 Expectation step 1. (4pts) Now, we introuce the latent variables for the EM algorithm. Let z (i) {0, 1} K be an inicator vector, such that z (i) k = 1 if x (i) was rawn from a Bernoulli(p (k) ), an 0 otherwise. Let Z = {z (i) }...n. What is P (z (i) π)? What is P (x (i) z (i), p, π)? 2. (2pts) Using the above two quantities, erive the likelihoo of the ata an the latent variables, P (Z, X π, p). 3. (5pts) Let η(z (i) k ) = E[z(i) k x(i), π, p]. Show that η(z (i) k ) = π k D =1 (p(k) D =1 (p(j) j π j )x(i) )x(i) (1 p (k) )1 x(i) (1 p (j) )1 x(i) Let p, π be the new parameters that we like to maximize, so p, π are from the previous iteration. Use this to erive the following final expression for the E step in the expectation-maximization algorithm: [ ] K D ( ) E[log P (Z, X p, π) X, p, π] = η(z (i) k ) log π k + x (i) log p(k) + (1 x (i) ) log(1 p(k) ) 1.3 Maximization step k=1 =1 1. (4pts) We nee to maximize the above expression with respect to π, p. First, show that the value of p that maximizes the E step is N p (k) = η(z(i) k )x(i) N k where N k = N η(z(i) k ). 2. (4pts) Show that the value of π that maximizes the E step is π k = N k k N k The exponential families notation may be useful. Alternatively, you can use Lagrange multipliers. 2

3 2 Clustering Images of Numbers [Eric; 25 pts] In this section you will use the above algorithm to cluster images of numbers. You will be using the MNIST ataset. Each input is a binary number corresponing to black an white pixels, an is a flattene version of the 28x28 pixel image. We will use the following conventions: N is the number of atapoints, D is the imension of each input, an K is the number of clusters. Xs is an N D matrix of the input ata, where row i is the pixel ata for picture i. p is a K D matrix of Bernoulli parameters, where row k is the vector of parameters for the kth mixture of Bernoullis. mix p is a K 1 vector containing the istribution over the various mixtures. eta is a N K matrix containing the results of the E step, so eta[i,k] = η(z (i) k ). clusters is an N 1 vector containing the final cluster labels of the input ata. Each label is a number from 1 to K. 2.1 Programming (20pts) 1. Implement the E step of the algorithm within [eta] = Estep(Xs,p,mix p), saving your calculate values within eta. 2. Implement the M step of the algorithm within [p,mix p] = Mstep(Xs, moel, alpha1, alpha2). p an mix p returne by this function shoul contain the new values that maximize the E step. alpha1, alpha2 are Dirichlet smoothing parameters (explaine below). 3. Implement [clusters] = MoBlabels(Xs,p,mix p). This function will take in the estimate parameters an return the resulting labels that cluster the ata. 4. Some hints an tips: The autograer will separately grae the accuracy of your implementation separately for the E-step (4pts), M-step (4pts), an M-step with smoothing (2pts). Finally, it will run your implementation for several iterations on a subset of the MNIST ataset (8pts). A small subset of the MNIST ataset is provie for your convenience. Use the log operator to make your calculations more numerically stable. In particular, pay attention to the calculation of η(z (i) k ). You will nee to avoi zeros in π an p or else you will take log(0) =. Use Dirichlet prior smoothing with the parameters α 1, α 2 when upating these variables: p (k) = N η(z(i) k )x(i) + α 1 N k + α 1 D N k + α 2 π k = k N k + α 2K Initialize your parameters p by ranomly sampling from a Uniform(0, 1) istribution an normalizing each p (k) to have unit length, an π k = 1/k. Running your implementation on the MNIST ataset with K = 20 clusters an α 1 = α 2 = 10 8 for 20 iterations shoul give you some results similar to the following (our reference coe runs in approximately 10 secons): 3

2.2 Analysis 1. (5pts) For each cluster, reshape the pixels into a 28x28 matrix an print the resulting grayscale images. What o you see? Explain, an inclue one such image in your writeup.

4 2.2 Analysis 1. (5pts) For each cluster, reshape the pixels into a 28x28 matrix an print the resulting grayscale images. What o you see? Explain, an inclue one such image in your writeup. See software/octave/oc/interpreter/representing-images.html for help on printing the matrix, or use the provie helper function show clusters(p, a, b) which will print the mixtures in p in an a b gri. 2. (2pts) Using your implemente MoBlabels function, cluster the ata. Using the true labels given in ytrain, how many unique igits oes each cluster typically have? Are there any clusters that picke out exactly one igit? 3 Kernel PCA [Fish; 15 pts] Principal component analysis (PCA) is use to emphasize variation an is often use to make ata easy to explore an visualize. Suppose we have ata x 1, x 2, x N R that has zero mean. PCA aims at fining a new coorinate system such that the greatest variance by some projection of the ata comes to lie on the first coorinate, the secon greatest variance on the secon coorinate, an so on. The basis vectors of the new coorinate system are the eigenvectors of the co-variance matrix, i.e., 1 N x i x T i = λ i v i vi T, (1) where v 1, v 2,, v are orthogonal (< v i, v j >= 0 for i j). Then we can plot our ata points on our new coorinate system. The position of each ata points in the new coorinate system can be erive by projecting x on to the basis of the new coorinate system, i.e., v 1, v 2, v. 3.1 kernel PCA Often we will want to make linear or non-linear transformation on the ata so that they are projecte to a higher imensional space. Suppose there is a transformation function φ : R R l. We map all the ata to a new space through this function, an we have φ(x 1 ), φ(x 2 ),, φ(x N ). We again want to o PCA on the transforme ata in the new space. How can we o this? 1. (2pts) Write out the co-variance matrix, C, for φ(x 1 ), φ(x 2 ),, φ(x N ). (Define φ(x) = 1 N φ(x j )) j=1 4

5 2. (2pts) Now we want to fin the basis vectors for the orthogonalize new space by solving eigenvectors ( of C. Using the ) ( efinition of ) an eigenvector, ( λv = ) Cv, explain why v is a linear combination of φ(x 1 ) φ(x), φ(x 2 ) φ(x),, φ(x N ) φ(x). Since v is a linear combination of these vectors, v = ( ) N α i φ(x i ) φ(x). 3. (2pts) Before starting to erive α an fin out v, we introuce kernel function here. A kernel function is in the form of k(x i, x j ) = φ(x i ), φ(x j ). Notice that we o not nee to know the function φ to calculate k(x i, x j ). We can simply efine a function that is relate to x i, x j an k(x i, x j ) = k(x j, x i ). A classic example is Raial basis function (RBF) kernel, which is k(x i, x j ) = exp( x i x j 2 /2σ 2 ). In orer to use these kernel function, we nee to get ri of all the φ(x i ) an replace them with the kernel function. A kernel matrix is efine as K R N N an K ij = k(x i, x j ). Since we are ealing with non-centere ata, we first use a moifie kernel matrix K ( ) ( ) ij = φ(x i ) φ(x), φ(x j ) φ(x). By using the results in question an an the efinition of eigenvectors, show that 4. (2pts) Show that the solutions α in are also solutions for equation (3pts) Show that Thus, by solving the eigenvectors of K, we get α. Nλ Kα = K 2 α. (2) Nλα = Kα. (3) K = (I ee T )K(I ee T ). (4) 6. (2pts)After obtaining α, we get the un-normalize basis vector v. To normalize it, what is the factor you nee to multiply to v? 7. (2pts) For a ata point x, what is its position in the normalize new space? Explain why you can get the new coorinate without explicitly calculate φ(x). 4 Huber function an its application in solving ual problem [Fish; 35pts] 4.1 Huber function Define a function where > (3pts) Show that the function B(x, ) = min λ + x2 λ 0 λ +. (5) B(x, ) = { x 2 2 x if x if x > (6) with the minimizer λ = max(0, x ). B is often calle a huber function. 2. (3pts) Is the function B convex in x? (Assume is a fixe constant) 3. (3pts) Derive the graient of function B at any given point (x, ). (Remember > 0.) 5

6 4. (3pts) Function B is often use as a penalty term in classification or regression problems, which have the form min x L(x) + B(x, ), (7) where L is the loss function, an is a parameter with positive value. Describe how parameter affects the penalty term B on the solution. 4.2 An application of using Huber loss to solve ual problem Consier the optimization problem min x ( 1 2 ix 2 i + r i x i ) (8) s.t. a T x = 1, x i [ 1, 1] for i = 1, 2, n. (9) 1. (3pts) Show that the problem has strictly feasible solutions if an only if a 1 > (3pts) Re-express x i [ 1, 1] as x 2 i 1, for i = 1, 2,, n. Write own the ual of this problem. 3. (6pts) Write own the KKT conition for this problem. Does it characterize the optimal solution? 4. (5pts) Show that we can further reuce the ual problem to a one-imensional convex problem min µ µ where B is the Huber function efine in the previous section. B(r i + µa i, i ), (10) 5. (3pts) Describe an algorithm to solve this ual problem. What is the time complexity for your algorithm? 6. (3pts) How can you recover an optimal primal solution x after solving the ual? 6

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys

Homework 2 Solutions EM, Mixture Models, PCA, Dualitys Homewor Solutions EM, Mixture Moels, PCA, Dualitys CMU 0-75: Machine Learning Fall 05 http://www.cs.cmu.eu/~bapoczos/classes/ml075_05fall/ OUT: Oct 5, 05 DUE: Oct 9, 05, 0:0 AM An EM algorithm for a Mixture