Machine Learning CS-6350, Assignment - 5 Due: 14 th November 2013

Size: px

Start display at page:

Download "Machine Learning CS-6350, Assignment - 5 Due: 14 th November 2013"

Samantha McDowell
5 years ago
Views:

1 SCHOOL OF COMPUTING, UNIVERSITY OF UTAH Machine Learning CS-6350, Assignment - 5 Due: 14 th November 2013 Chandramouli, Shridharan sdharan@cs.utah.edu ( ) Singla, Sumedha sumedha.singla@utah.edu ( ) November 14, 2013 INTRODUCTION In this assignment, we are asked to work on and implement mixture of gaussian for classification. Sections 1 of this assignment is on using the Mixture of Gaussians for classification, whereas the sections 2 and section 3 concentrate on using the Mixture of Gaussians as a tool for regression. 1 MIXTURE OF GAUSSIANS FOR CLASSIFICATION Question 1(a): In an unsupervised classification problem, we are given data [x i ] n and a number of classes m, and we try to learn: p(y x) = p(x y)p(y) p(x) = p(x y)p(y) m =1 p(x y = )p(y = ) (1.1) where y {1..m} are the class labels. Our hypothesis is that the data is generated as: y Mul tinomi al (φ), (x y = ) N (µ,σ ) {1..m} (1.2) 1

2 Hence, the parameters of this model are φ,[µ ] m =1 and [Σ ] m =1.Since the labels y i of each data point are unknown, we need the EM-algorithm to find locally-maximal log-likelihood estimates for φ,[µ ] m =1 and [Σ ] m =1 Question: Derive the update equations for the parameters φ,[µ ] m =1 and [Σ ] m of the model =1 as maximizers of the expected log-likelihood given the distributions p(y i = ) = w i, Answer:The mixture of Gaussian model for one x i in terms of a latent variable y is: p(x) = p(x i y = )p(y = ) =1 p(y = ) φ p(x i y = ) N (µ,σ ) (1.3) Let us now introduce the following indicator variable: q i, = 1 if Gaussian emitted x i and 0 else where We can now write the oint likelihood of all the X and q: n m p(x,q y) = p(x i y = ) q i, p(y = ) q i, =1 (1.4) Taking log at both sides log (p(x,q y)) = q i, log (p(x i y = )) + q i, log (p(y = )) (1.5) =1 The corresponding auxiliary function is: A(y) = =1 A(y) = E[log (p(x,q y))] A(y) = E q i, log (p(x i y = )) q i, log (p(y = )) A(y) = ( E[q i, ]log =1 =1 E[q i, ]log (p(x y = )) + E[q i, ]log (p(y = )) 1 (2π) a Σ e 1 2 (x i µ ) T Σ 1 (x i µ ) ) + E[q i, ]log (φ ) (1.6) The E-Step estimates the posterior: E[q i, ] = 1 p(q i, = 1 X ) + 0 p(q i, = 0 X ) E[q i, ] = p(y = x) = w i, (1.7) The M-step finds the parameters y that maximizes A A y = 0 (1.8) 2

3 M-Step for Means A = 0 µ A A log p(x i y = ) = µ log p(x i y = ) µ ( ) log 1 e 1 2 (x i µ ) T Σ 1 (x i µ ) (2π) a Σ = 0 = p(y = x i ) µ = 0 = w i,. x i µ Σ w i,.x i = w i,.µ µ = n w i,.x i n w i, (1.9) M-Step for Variances A = 0 Σ A A log p(x i y = ) = Σ log p(x i y = ) Σ ( ) log 1 e 1 2 (x i µ ) T Σ 1 (x i µ ) (2π) a Σ = 0 = p(y = x i ) Σ ( ) (xi µ ) = 0 = w i,. 1 2Σ 2Σ 2 w i,.(x i µ ) 2 = w i, Σ n w i,.(x i µ ) 2 Σ = n w i, (1.10) M-Step for Weights φ = 1 (1.11) =1 We can rewrite auxiliary function as ( J(y) = A(y) + 1 φ ).λ (1.12) =1 3

4 where λ are Lagrange multipliers. = 0 = 1. ( J φ = 0 J = J A λ φ A φ ) p(y = x i ). 1 λ φ n = φ = w i, λ (1.13) n φ = w i, m n k=1 p(k x i ) φ = 1 w i, n Question 1 (b): Plot the data points. Based on the plot, how many classes m do you think the data set should be classified into? Answer 1 (b): Bases on the plot we have choosen m = 3. Plot for the data in 2-D plane is shown in figure 1.1 Figure 1.1: Plot of the initial data Question 1 (c): Implement the EM algorithm to learn the parameters. 4

5 Answer For the EM algorithm, we wrote a Matlab script which runs the E-step and the M-step alternatively, till we get convergence on the log likelihood. After implementing the EM Algorithm the final values for the means are: [ 7.02 µ 1 = 4 [ 1.98 µ 2 = 3.9 [ 5.01 µ 3 = 5.97 We found that repeating the experiment with a different initial mean did not change the final values much.the values remains almost same with different runs, the minor change is due to the inital value of mean. This is probably due to a good and well defined peaks in the data set. Question 1 (c):how did you initialize the EMalgorithm? How does the initialization affect the result? Describe what you see. AnswerWe initialize the EM algorithm with a random value of mean for each gaussian. We have taken the intial value of Sigma as Identity matrix and φ as 1 m where m is the number of gaussians. The final value of the parameters are relatively similar and have little dependence on initial values. But the number of iterations to get the final values depends on the initial guess of the means. If the initial guess is very poor than it take more number of iterations to get the convergence. Refer figure 1.2 Question 1 (c):shows plots of the means and the 1-sigma contours of the Gaussian distributions x y = for 1.m on top of the data upon convergence. Answer Plot is shown in figure 1.3 Question 1 (c):show plots of the evolution of the expected log-likelihood during the iteration of the EM-algorithm. Answer Plot is shown in figure 1.4 Question 1 (d): Given the parameters φ,[µ ] m =1 and [Σ ] m as computed above, give an =1 implicit equation for the decision boundaries between each of the classes. Plot the decision boundaries. Answer Decision boundary between the i th Gaussian and th Gaussian is given implicitly by the equation: p(x y = i )p(y = i ) = p(x y = )p(y = ) unique combinations of i, where i, 1.m Plot is shown in figure 1.5 ] ] ] 5

6 Figure 1.2: Convergence plot for poor initial mean Figure 1.3: Plots of the means and the 1-sigma contours of the Gaussian distributions 6

7 Figure 1.4: The expected log-likelihood Vs Number of Iterations Figure 1.5: The expected log-likelihood Vs Number of Iterations 7

8 2 QUESTION 2: MIXTURE OF GAUSSIANS FOR NON-LINEAR REGRESSION. The mixture of Gaussians model can be used to parameterize multi-modal distributions of data. The model is given by: p(x) = φ N (x : µ,σ ), (2.1) =1 for a given number m of constituent Gaussians, where N (x : µ,σ) is the probability density at x of a Gaussian distribution with mean µ and variance Σ: N (x : µ,σ) = 1 (2π) a Σ exp( 1 2 (x µ)t Σ 1 (x µ) (2.2) where a is the dimension of x. Also here,φ,[µ ] m =1 and [Σ ] m are the parameters of the model, which can be learned using =1 the same EM-algorithm as above given a set of data. Here, φ R,µ 0,µ 1 R dim( ˆX ), andσ R dim( ˆX ) dim( ˆX ) are the parameters of the model. Here, ˆX is a feature vector whose elements are functions of the raw data elements of x. Question 2 (a): Assume that x is distributed according to a mixture of Gaussians distribution with parameters φ,[µ ] m =1 and [Σ ] m and a given number m of constituent Gaussians. =1 Give equations for the mean µ = E(x) and variance Σ = V ar (x) of the distribution. Answerµ = E(x) is given by equation: Σ = V ar (x) is given by equation Σ = V ar (x) = µ = E(x) = φ µ (2.3) =1 φ Σ + φ (µ E(x))(µ E(x)) T (2.4) =1 =1 Question 2 (b): Give equations for the parameters of the marginal mixture of Gaussians distribution over x in terms of the parameters of the oint distribution? Answer If the oint distribution of x and y is given as : ([ ]) ( [ ] {[ x x µ X P = φ y N ; y µ Y =1 ], [ Σ X Σ Y X Σ X Y Σ Y ]}) (2.5) Then, the marginal mixture of Gaussians distribution over x in terms of the parameters of the oint distribution has the parameters: 8

9 1.m mean and variance is given by: { } ) m X MoG m (φ, µ x,σx =1 [ µ X µ Y ] = 1 n [ xi y i ] (2.6) (2.7) [ Σ X Σ X Y Σ Y X Σ Y ] = 1 n ([ xi y i ] [ µ X µ Y ])([ xi y i ] [ µ X µ Y ]) T (2.8) ( ) N X ;µ x φ =,Σx φ m k=1 N ( X ;µ x (2.9) k k),σx φk ( ) 1 ( ) µ n = µx + Σyx Σ x X µ x (2.10) ( ) 1 Σ = Σy Σyx Σ x x y Σ (2.11) Question 2 (c):the conditional distribution x y is also a mixture of Gaussians distribution. Give equations for the parameters of the conditional mixture of Gaussians distribution of x y in terms of the parameters of the oint distribution? Answer The parameters of the conditional mixture of Gaussians distribution of x y in terms of the parameters of the oint distribution are as follows: ( ) N Y ;µ y φ =,Σy φ ( (2.12) m k=1 N Y ;µ )φ Y,ΣY k µ n = µy + ΣX Y Σ = ΣX Σ X Y ( ) Σ y 1 ( ) Y µ y (2.13) ( ) 1 Σ Y Σ Y X (2.14) Question 2 (d) Plot the data and compute the parameters of the oint mixture of Gaussians distribution of for varying number m of constituent Gaussians using the EM-algorithm. Plot the means and 1-sigma contours of the constituent Gaussians on top of the data Answer Plot of the data is shown in figure: 2.1 9

10 Figure 2.1: Plots of the data Figure 2.2: The means and 1-sigma contours of the constituent Gaussians on top of the data 10

11 Figure 2.3: The means and 1-sigma contours of the constituent Gaussians on top of the data Question 2 (d)plot the graphs of E(y x) and the 1-sigma countours E(y x) ± (V ar (y x)) as functions of x on top of the data. What happens if you change the number m of constituent Gaussians. Answer Plot is shown in figure: 2.5 On changing the number m of constituent Gaussians, the accuaracy of the model increases, and the data fits more accurately between the E(y x) ± (V ar (y x)) boundaries. 3 QUESTION 3 : MIXTURE OF GAUSSIANS FOR NON LINEAR REGRESSION Question 3(a) Generate n = 10,000 random states and control inputs {x i ;u i } i=n 1 with bounds i=0 10 < x < 10,10 < y < 10, π < θ < π, 1 φ 1, and 2 v 2, and compute x next = i f (x i,u i ) + m i with m i = N(0,0.0001I ) for all i = 0...n 1. To learn the dynamics model f from this data, learn the parameters of the oint mixture of Gaussians distribution of the data ( x next ) T x i i u i for a varying number m (m should be in the order of tens) of constituent Gaussians using the EM-algorithm. Answer 11

12 Figure 2.4: The means and 1-sigma contours of the constituent Gaussians on top of the data 12

13 Figure 2.5: The E(y x) and the 1-sigma countours E(y x) ± (V ar (y x)) as functions of x We wrote a Matlab function called generate_data.m which takes as input the number of data points (n) and returns a randomly generated set of vectors x i, u i and calculates the value of x next from the below equations. [X,U, X next ] = g ener ate_dat a(n,noi se_f l ag ) Where n is the number of data points needed (n = 10,000 as required per assignment) and noise_flag determines if the values of X,U and X next are modified with additive Gaussian noise. x k + v k cos (θ k ) t if φ k = 0, ( ) x k+1 = x k λsin(θ k ) t an(φ + λsin t an(φ θ k +v k ) k t λ otherwise k) t an(φ k) (3.1) y k + v k sin (θ k ) t if φ k = 0, ( ) y k+1 = y k λcos(θ k ) t an(φ λcos t an(φ θ k +v k ) k t λ otherwise k) t an(φ k) (3.2) t an ( ) φ k θ k+1 = θ k + v k t λ (3.3) The function find_next_position.m uses the data generated by the generate_data function and determines the model parameters for a mixture of gaussians. The number of gaussians is given as an input to this function. 13

14 Figure 2.6: The E(y x) and the 1-sigma countours E(y x) ± (V ar (y x)) as functions of x 14

15 Question 3(b) Now, by constructing the conditional mixture of Gaussian distribution x next x,u and computing its expected value E ( x next x,u ), we get a learned estimate of the function f. Generate a new set of n = 10,000 random states and control inputs {x i,u i } n 1 with bounds 10 < x < 10 i=0, 10 < y < 10, π < θ < π, 1 φ 1 and 2 v 2, and compute the root-mean squared error. Plot this error for a varying number of m of the constituent Gaussians. What is the optimal number of constituent Gaussians? Answer: [µ,φ,σ,rmser r or ] = f ind_next_posi tion ( m Gaussi ancluster s,n dat apoint s ) Here µ, φ, Σ are the model parameters for the determined mixture of m Gaussians over n data points evaluated using the EM Algorithm. Using the oint distribution of X next, X &U, the conditional distribution of X next X,U is calculated as follows: Where, ( { } ) m X next X,U MoG φ, µ i,σ =1 φ = ( ) N Y ;µ y,σy φ m k=1 N ( Y ;µ Y,ΣY µ n = µy + ΣX Y Σ = ΣX Σ X Y ( ) Σ y 1 ( ) Y µ y (3.4) )φ k (3.5) (3.6) ( ) 1 Σ Y Σ Y X (3.7) Here, Y = [X U ] T and is used so for easier representations. The expected value of the above conditional distribution, gives us the estimate of the new X next value. µ = E(x) = φ µ (3.8) We ran the above code for m = 5,10, a number of times for the newly generated data containing X,Uand X next and plotted the values of the average RMS. We found that there was a significant increase in the accuracy of the predicted x next at around m = 30 or 40 and then it almost saturated beyond it. The plot of the RMS error against the number of constituent gaussians is shown below in Figure 3.1. Since there is very little difference in RMS values to ustify a huge m value, the optimal value of m is between 30 and 40. =1 4 WORK SPLIT-UP The work was split evenly between the authors. The matlab programs were coded was coded together and the equations for the first, second and third question was solved individually and verified. 15

16 Figure 3.1: Plot of RMS error against number of constituent Gaussians 5 CONCLUSION In this assignment, we used the Mixture of Gaussians model to model the distribution of data and to use it as a means of running regressions on data. The first part of the assignment dealt with using the Mixture of Gaussians for classifing a set of data, whereas the second and the third parts of the assignment dealt with using the mixture of Gaussian models to run regressions on non linear data. 16

Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013

SCHOOL OF COMPUTING, UNIVERSITY OF UTAH Machine Learning CS-6350, Assignment - 3 Due: 08 th October 2013 Chandramouli, Shridharan sdharan@cs.utah.edu (00873255) Singla, Sumedha sumedha.singla@utah.edu