CSC411 Fall 2018 Homework 5

Similar documents
Homework 6: Image Completion using Mixture of Bernoullis

Programming Assignment 4: Image Completion using Mixture of Bernoullis

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Advanced Introduction to Machine Learning

Introduction to Machine Learning

The Expectation-Maximization Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

COM336: Neural Computing

Generative Clustering, Topic Modeling, & Bayesian Inference

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Non-Parametric Bayes

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

10-701/15-781, Machine Learning: Homework 4

The Bayes classifier

6.867 Machine Learning

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

CS229 Lecture notes. Andrew Ng

Latent Variable Models

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Decision and Bayesian Learning

Data Mining Techniques

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Bayesian Methods: Naïve Bayes

Generative Learning algorithms

Algorithmisches Lernen/Machine Learning

Problem Set 1. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 20

Introduction to Probabilistic Graphical Models: Exercises

ECE521 week 3: 23/26 January 2017

Unsupervised Learning

Advanced Introduction to Machine Learning

Bayesian Decision Theory

Probabilistic Graphical Models

Lecture 4: Probabilistic Learning

Introduction to Machine Learning

CS Lecture 18. Topic Models and LDA

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

6.867 Machine Learning

Introduction to Probabilistic Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Expectation Maximization Algorithm

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Latent Variable Models and EM Algorithm

Expectation Maximization

The Expectation Maximization or EM algorithm

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

Lecture 3: Pattern Classification

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

HOMEWORK #4: LOGISTIC REGRESSION

Naïve Bayes classification

Expectation Maximization

Machine Learning, Fall 2012 Homework 2

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Probabilistic & Unsupervised Learning

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Computational functional genomics

COMP90051 Statistical Machine Learning

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

4 Bias-Variance for Ridge Regression (24 points)

Bayes Decision Theory

Density Estimation. Seungjin Choi

STA 4273H: Statistical Machine Learning

Data Mining Techniques. Lecture 3: Probability

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Maximum Likelihood Estimation. only training data is available to design a classifier

STA 4273H: Statistical Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

01 Probability Theory and Statistics Review

K-Means and Gaussian Mixture Models

Introduction to Machine Learning

Series 7, May 22, 2018 (EM Convergence)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

1 Expectation Maximization

15780: GRADUATE AI (SPRING 2018) Homework 3: Deep Learning and Probabilistic Modeling

Probabilistic Graphical Models

Name: Student number:

The Naïve Bayes Classifier. Machine Learning Fall 2017

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

L11: Pattern recognition principles

Based on slides by Richard Zemel

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Gaussian Mixture Models

Factor Analysis (10/2/13)

STA 4273H: Statistical Machine Learning

Probabilistic Latent Semantic Analysis

Transcription:

Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to Question 3, we will give feedback, but you will get the points for free; see below.) 2. Your completed Python code for Question, as q.py. Neatness Point: One of the 0 points will be given for neatness. You will receive this point as long as we don t have a hard time reading your solutions or understanding the structure of your code. Late Submission: 0% of the marks will be deducted for each day late, up to a maximum of 3 days. After that, no submissions will be accepted. Collaboration. Weekly homeworks are individual work. See the Course Information handout 2 for detailed policies.. [3pts] Gaussian Discriminant Analysis. For this question you will build classifiers to label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale and the pixel values are between 0 and. The labels y are 0,, 2,, 9 corresponding to which character was written in the image. There are 700 training cases and 400 test cases for each digit; they can be found in a2digits.zip. Starter code is provided to help you load the data (data.py). A skeleton (q.py) is also provided for each question that you should use to structure your code. Using maximum likelihood, fit a set of 0 class-conditional Gaussians with a separate, full covariance matrix for each class. Remember that the conditional multivariate Gaussian probability density is given by, { p(x y = k, µ, Σ k ) = (2π) d/2 Σ k /2 exp } 2 (x µ k) T Σ k (x µ k) You should take p(y = k) = 0. You will compute parameters µ kj and Σ k for k (0...9), j (...64). You should implement the covariance computation yourself (i.e. without the aid of np.cov ). Hint: To ensure numerical stability you may have to add a small multiple of the identity to each covariance matrix. For this assignment you should add 0.0I to each matrix. (a) [pt] Using the parameters you fit on the training set and Bayes rule, compute the average conditional log-likelihood, i.e. N N i= log(p(y(i) x (i), θ)) on both the train and test set and report it. https://markus.teach.cs.toronto.edu/csc4-208-09 2 http://www.cs.toronto.edu/~rgrosse/courses/csc4_f8/syllabus.pdf ()

(b) [pt] Select the most likely posterior class for each training and test data point as your prediction, and report your accuracy on the train and test set. (c) [pt] Compute the leading eigenvectors (largest eigenvalue) for each class covariance matrix (can use np.linalg.eig) and plot them side by side as 8 by 8 images. Report your answers to the above questions, and submit your completed Python code for q.py. 2. [2pts] Categorial Distribution. Let s consider fitting the categorical distribution, which is a discrete distribution over K outcomes, which we ll number through K. The probability of each category is explicitly represented with parameter θ k. For it to be a valid probability distribution, we clearly need θ k 0 and k θ k =. We ll represent each observation x as a -of-k encoding, i.e, a vector where one of the entries is and the rest are 0. Under this model, the probability of an observation can be written in the following form: p(x; θ) = Denote the count for outcome k as N k, and the total number of observations as N. In the previous assignment, you showed that the maximum likelihood estimate for the counts was: K k= ˆθ k = N k N. Now let s derive the Bayesian parameter estimate. (a) [pts] For the prior, we ll use the Dirichlet distribution, which is defined over the set of probability vectors (i.e. vectors that are nonnegative and whose entries sum to ). Its PDF is as follows: p(θ) θ a θ a k K. θ x k k. A useful fact is that if θ Dirichlet(a,..., a K ), then E[θ k ] = a k k a k. Determine the posterior distribution p(θ D), where D is the set of observations. From that, determine the posterior predictive probability that the next outcome will be k. (b) [pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the parameter vector θ. For this question, you may assume each a k >. 3. [4pts] Factor Analysis. This question is about the EM algorithm. Since some of you will have seen EM in more detail than others before reading week, we have decided to give you the 4 points for free. So you don t need to submit a solution to this part if you don t want to. But we recommend you make an effort anyway, since you probably know enough to solve it, and it will help you practice the course material. In lecture, we covered the EM algorithm applied to mixture of Gaussians models. In this question, we ll look at another interesting example of EM, namely factor analysis. This is a model very similar in spirit to PCA: we have data in a high-dimensional space, and we d like to summarize it with a lower-dimensional representation. Unlike PCA, we formulate the 2

problem in terms of a probabilistic model. We assume the latent code vector z is drawn from a standard Gaussian distribution N (0, I), and that the observations are drawn from a diagonal covariance Gaussian whose mean is a linear function of z. We ll consider the slightly simplified case of scalar-valued z. The probabilistic model is given by: z N (0, ) x z N (zu, Σ), where Σ = diag(σ 2,..., σ2 D ). Note that the observation model can be written in terms of coordinates: x j z N (zu j, σ j ). We have a set of observations {x (i) } N i=, and z is a latent variable, analogous to the mixture component in a mixture-of-gaussians model. In this question, we ll derive both the E-step and the M-step for the EM algorithm. If you don t feel like you understand the EM algorithm yet, don t worry; we ll walk you through it, and the question will be mostly mechanical. (a) E-step (2pts). In this step, our job is to calculate the statistics of the posterior distribution q(z) = p(z x) which we ll need for the M-step. In particular, your job is to find formulas for the (univariate) statistics: Tips: m = E[z x] = s = E[z 2 x] = Compare the model here with the linear Gaussian model of the Appendix. Note that z here is a scalar, while the Appendix gives the more general formulation where x and z are both vectors. Determine p(z x). To help you check your work: p(z x) is a univariate Gaussian distribution whose mean is a linear function of x, and whose variance does not depend on x. Once you have figured out the mean and variance, that will give you the conditional expectations. (b) M-step (2pts). In this step, we need to re-estimate the parameters of the model. The parameters are u and Σ = diag(σ 2,..., σ2 D ). For this part, your job is to derive a formula for u new that maximizes the expected log-likelihood, i.e., u new arg max u N N E q(z (i) ) [log p(z(i), x (i) )]. i= (Recall that q(z) is the distribution computed in part (a).) This is the new estimate obtained by the EM procedure, and will be used again in the next iteration of the E-step. Your answer should be given in terms of the m (i) and s (i) from the previous part. (I.e., you don t need to expand out the formulas for m (i) and s (i) in this step, because if you were implementing this algorithm, you d use the values m (i) and s (i) that you previously computed.) 3

Tips: Expand log p(z (i), x (i) ) to log p(z (i) ) + log p(x (i) z (i) ) (log is the natural logarithm). Expand out the PDF of the Gaussian distribution. Apply linearity of expectation. You should wind up with terms proportional to E q(z (i) ) [z(i) ] and E q(z (i)[[z (i) ] 2 ]. Replace these expectations with m (i) and s (i). You should get an equation that does not mention z (i). In order to find the maximum likelihood parameter u new, you need to take the derivative with respect to u j, set it to zero, and solve for u new. (c) M-step, cont d (optional) Find the M-step update for the observation variances {σ j } D j=. This can be done in a similar way to part (b). 4

Appendix: Some Properties of Conditional Gaussians Consider a multivariate Gaussian random variable z with the mean µ and the covariance matrix Λ (Λ is the inverse of the covariance matrix and is called the precision matrix). We denote this by p(x) = N (z µ, Λ ). Now consider another Gaussian random variable x, whose mean is an affine function of z (in the form to be clear soon), and its covariance L is independent of z. The conditional distribution of x given z is p(x z) = N (x Az + b, L ). Here the matrix A and the vector b are of appropriate dimensions. In some problems, we are interested in knowing the distribution of z given x, or the marginal distribution of x. One can apply Bayes rule to find the conditional distribution p(z x). After some calculations, we can obtain the following useful formulae: p(x) = N (x Aµ + b, L + AΛ A ) ( ) p(z x) = N x C(A L(x b) + Λµ), C with C = (Λ + A LA). 5