Massachusetts Institute of Technology

Similar documents
Probabilistic Reasoning in Deep Learning

GAUSSIAN PROCESS REGRESSION

Lecture 5: GPs and Streaming regression

STAT 518 Intro Student Presentation

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Introduction to Gaussian Processes

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CS Homework 3. October 15, 2009

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Advanced Introduction to Machine Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Parameter Estimation. Industrial AI Lab.

Learning Gaussian Process Models from Uncertain Data

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

The Expectation-Maximization Algorithm

Homework 2 Solutions Kernel SVM and Perceptron

Week 5: Logistic Regression & Neural Networks

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

6.867 Machine Learning

CSC 2541: Bayesian Methods for Machine Learning

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Nonparametric Regression With Gaussian Processes

Density Estimation. Seungjin Choi

4 Bias-Variance for Ridge Regression (24 points)

CS281 Section 4: Factor Analysis and PCA

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

10-701/15-781, Machine Learning: Homework 4

6.867 Machine Learning

CMU-Q Lecture 24:

CPSC 540: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Midterm sample questions

Bayesian Linear Regression [DRAFT - In Progress]

Probabilistic Machine Learning. Industrial AI Lab.

Lecture : Probabilistic Machine Learning

Bayesian Machine Learning

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Chapter 16. Structured Probabilistic Models for Deep Learning

Advanced Introduction to Machine Learning CMU-10715

Reliability Monitoring Using Log Gaussian Process Regression

Bayesian Concept Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Statistical learning. Chapter 20, Sections 1 4 1

Prediction of double gene knockout measurements

ECE521 week 3: 23/26 January 2017

Approximate Inference Part 1 of 2

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

STA 4273H: Statistical Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Supervised Learning Coursework

Approximate Inference Part 1 of 2

Bayesian Methods for Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Quantitative Biology II Lecture 4: Variational Methods

10-701/ Machine Learning, Fall

Model Selection for Gaussian Processes

Lecture 2 Machine Learning Review

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Notes on Machine Learning for and

Gaussian Mixture Models

Independent Component Analysis and Unsupervised Learning. Jen-Tzung Chien

Introduction to Machine Learning Midterm Exam

Nonparametric Bayesian Methods (Gaussian Processes)

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 7 Introduction to Statistical Decision Theory

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Independent Component Analysis and Unsupervised Learning

Linear Regression and Its Applications

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Active and Semi-supervised Kernel Classification

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Bayesian Regression Linear and Logistic Regression

Week 3: Linear Regression

Gaussian Process Regression

Dynamic Approaches: The Hidden Markov Model

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Variational Autoencoders

6.867 Machine Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Log-Linear Models, MEMMs, and CRFs

Linear Regression (9/11/13)

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

CPSC 540: Machine Learning

Discriminative Training

Linear Models for Regression CS534

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Parametric Techniques Lecture 3

Supplementary Note on Bayesian analysis

Localization of Radioactive Sources Zhifei Zhang

Chris Bishop s PRML Ch. 8: Graphical Models

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Transcription:

Massachusetts Institute of Technology 6.867 Machine Learning, Fall 2006 Problem Set 5 Due Date: Thursday, Nov 30, 12:00 noon You may submit your solutions in class or in the box. 1. Wilhelm and Klaus are friends, but Wilhelm only speaks English and Klaus only speaks German. Klaus has sent Wilhelm an urgent message, and Wilhelm is locked in a room with only the proceedings of the European Parliament (in both German and English) to keep him company. Luckily, Wilhelm is a machine learning expert. He decides to treat Klaus as a noisy channel : assume that thoughts in Klaus s head start out in English, but before they are emitted, they pass through a medium that replaces the English words with new (German) words and rearranges them from the English word-ordering to a new (German) ordering all according to some unkown distributions. Next, he plans to come up with parametrizations for the distributions and then to optimize the parameters over the European Parliament data he is conveniently locked up with. Wilhelm thus achieves the following optimization problem: arg max P (e g) = arg max P (g e) P (e) e e P (g) = arg max P (g e)p (e), e where e and g denote English and German sentences, respectively. This corresponds to Wilhelm s noisy-channel characterization of Klaus s head. Next, Wilhelm decides on parameterizations for the two terms in his optimization criterion. To estimate probabilities of English sentences, he will pretend that English is a Markov process (that is, that English speakers choose each word in a sentence based only on the last two words they have chosen): e P (e) = LM(e i e i 1, e i 2 ), (1) i=1 where e denotes the length of the English sentence e, e i denotes the ith word of e, and e 0 and e 1 correspond to start symbols S 0 and S 1, respectively, which are before every English sentence for simplicity. For the other term, Wilhelm is aware that both the words and their order may change, so he introduces a hidden variable: the alignment between the words. g P (g, a e) = T (g j e aj )D (a j j, e, g ) (2) j=1

Thus, P (g e) = P (g, a e) a g = T (g j e aj )D (a j j, e, g ) a j=1 g e = T (g j e aj )D (a j j, e, g ). j=1 a j =1 This set-up is known as IBM Model 2. (a) To perform the optimization, Wilhelm decides to use the EM algorithm. Derive the updates of the parameters LM, T, and D. We place no constraints on these distributions other than that they are distributions (cf. regularization below). (b) The initial parameter settings are important. Give an example of a bad choice. What should you choose? (c) Insert the updates and initial parameter settings in the provided templates ibm2 train lm.m and ibm2 train.m and train the system using the provided data europarl.m. Use the provided decoder ibm2 beam decoder.m to use the trained system to decode Klaus s message klaus.m. The system is able to achieve surprisingly good results with a very simple probability model! (Although the training is easy, the decoding is non-trivial; check out the source of the decoder.) (d) Suppose Wilhelm believes that his model is reordering the German too aggressively. He is considering regularizing the alignments. He has several ideas; what effect do you think each of these regularization techniques will have? i. Regularize D (a j j, e, g ) toward a uniform distribution. ( ) e ii. Regularize D (a j j, e, g ) toward the distribution D a j = j g j, e, g = 1. iii. Estimate first a simpler distribution D (a j e, g ) and regularize D (a j j, e, g ) toward it. 2. In this problem, we will explore the combined use of regression and Expectation Maximization. The motivating application is as follows: we d like to understand how the expression levels of genes (in a cell) change with time. The expression level of a gene indicates, roughly, the amount of gene-product in the cell. The experimental data we have available are noisy measurements of this level (for each gene) at certain time points. More precisely, we have m genes (g 1,..., g m ) and measurements are available for r timepoints (t 1,..., t r ). The data matrix is Y = [y 1,..., y m ] T where each row y i T = [y i1,..., y ir ] corresponds to the expression levels of gene g i across the r timepoints. Our goal is to estimate continuous functions, each fˆi(t) capturing the expression level of one gene g i. We ll call these expression curves. Clearly, if we treat each gene s expression as independent from others, we would need to perform m unrelated regression tasks. We can do better, however, since we expect that there are groups of genes with similar expression curves. We hypothesize that the expression curves can be grouped into k classes and that, within each group, the functions f i (t) look very similar. When estimating fˆi(t) we can pool together data from all the genes in the same class. We have two problems to solve: (1) figure out which class each gene belongs to, and (2) given class assignments, estimate the expression curves. As you might guess, we will use the

EM algorithm. However, we still need to understand how to estimate continuous curves from discrete data-points. To perform regression, we will use an idea similar to kernel regression. Recall that even if f(t) is some non-linear function of t, it is possible to find a feature mapping φ(t) = [φ 1 (t),..., φ p (t)] T such that f(t) is a linear function of φ s: f(t) = w j φ j (t). In class, we had discussed how to implement this idea to perform kernel regression without explicitly constructing the feature mapping. In the current setting, we extend this to include a prior over the parameters w = [w 1,..., w p ] T. We set this to be simply a spherical multivariate Gaussian N(w; 0, σ2 w I). Now, given any set of time points t 1,..., t r, we imagine generating function values at those time points according to the following procedure Step 1: draw one sample w N(0, σ 2 I) w Step 2: generate function values for the time points of interest p f(t k ) = w j φ j (t k ), k = 1,..., r (3) j=1 Since w j are normally distributed, so is f = [f(t 1 ),..., f(t r )] T. The distribution of this vector is N(f; 0, K) where K kl = K(t k, t l ) = σw 2 φ(t k ) T φ(t l ). Looks familiar? Let s try to understand it a bit better. We have a distribution over function values at any set of time points. In other words, we have defined a prior distribution over functions! The kernel function K(t, t ), where the inputs are the time points, gives rise to a Gram matrix K kl = K(t k, t l ) over any finite set of inputs. This probability distribution over functions is known as a Gaussian Process (GP for short). The choice of the kernel K(t, t ) controls the kind of functions that are sampled. Consider the following kernel, similar to the radial basis function kernel seen before σ 2 (t t ) 2 K(t, t ) = f exp( ) (4) 2ρ 2 The hyperparameter ρ controls the time span at which the function values are strongly coupled. A large ρ implies, for example, that the sampled functions would appear roughly constant across a long time range. σ f scales this correlation by a constant factor. (a) Match each of the following four figures, with the four kernels defined below. Each figure describes some functions sampled using a particular kernel. Note that these functions are drawn by selecting finely spaced time points and drawing a sample from the corresponding function values at these points from N(f; 0, K) where K is the Gram matrix over the same points. Here are the kernels: i. 3 exp( (t t ) 2 ) 2(2.1) 2 (t t ) ii. 2 exp( 2 2(0.7) 2 ) 4 iii. 2 exp( sin2 ( π(t t )2 ) 2(0.3) ) 2 (t t ) iv. 2 exp( 2 2(0.05) 2 ) + 0.1(tt ) 2

Figure 1: Each subfigure corresponds to one kernel Let us now consider how we will use GP s to simultaneously perform regression across genes belonging to the same class. Since we assume they have similar expression curves we will model them as samples from the same Gaussian process. Our model for the data is as follows. There are k classes in the data and each gene belongs to a single class. Let the genes in class l be g l1, g l2,..., g lml where m l is the number of genes in that class. The expression curves for these genes are assumed to be samples from the same GP. The kernels of the k GP s (one for each class) all have the same parametric form, but different hyperparameters θ = (σ f, ρ, σ n ): K(t, t ) = σ 2 exp( (t t ) 2 f ) + σ 2 δ(t, t ), where (5) 2ρ { 2 n 1 if t = t δ(t, t ) = (6) 0 otherwise This is essentially the same as before except for the second term that accounts for measurement noise. In other words, the responses y are modeled as y(t) = f(t) + ɛ t, ɛ t N(0, σn 2 ) and f(t) is a Gaussian process with kernel given by Egn 4. Thus, for any time points t 1,..., t r, the response vector y has a distribution N(0, K), where K is the Gram matrix from the kernel in Eqn 5. (b) We have provided for you a function log likelihood gp(params, t, Yobs), where params specify the hyperparameters of the kernel, t is a vector of r timepoints and Yobs is a q r matrix of observed gene expression values (for q genes) at those timepoints. For each of the q sets of observations, the function computes the log-likelihood that observations were generated from the GP specified by params. Fill in the missing lines of the code. Attach a printout of your code (you only need to show the parts you changed). We now go back to our original problem and set up an Expectation Maximization framework to solve it. Suppose we know that the genes in the data cluster into k classes (i.e., k is a user-specified parameter). We start off by guessing some initial values for the hyperparameters of the k GPs. In each E-step, we compute the posterior assignment probability P (k i) i.e., the probability that gene g i is in class k. To simplify the M-step we will turn these probabilities into hard assignments. In each M-step, we compute the MLE estimates for the hyperparameters of the GPs associated with the classes. (c) To perform EM, we have provided you a function [W,V,L] = em gp(t, Yobs, k, init class). The code is missing some steps in the Expectation sub-function. Fill them in. Attach a printout of your code for the Expectation function.

(d) Using the dataset provided with this problem, run em gp with different choices of k (k = 2, 3, 4, 5). We have provided a function plot results(w,v,t,yobs,k,nn) which you may find useful. You can use it to plot nn randomly chosen curves from each of the k classes. What choice of k seems most appropriate? (e) Gene expression data often has missing values, i.e., you may not know the value y ij, the expression of gene g i at time t j. With some machine learning methods, handling these missing values can be a big hassle. But with our model, the missing values problem can be handled very simply. How? (f) (Optional: Extra Credit) Our initial choice of hyperparameters is constructed as follows: we assign genes to specific classes, either randomly or based upon some insights. Using this class assignment, the initial hyperparameters are computed by an MLE approach. Describe a method that could be used to generate a good initial class assignment. (g) (Optional: Extra Credit) Since we have defined a prior over functions we could turn the clustering problem into a Bayesian model selection problem. We assume that each cluster has a single underlying expression curve f(t), sampled from a fixed GP model, and the responses for each gene in the cluster are subsequently sampled from y(t) = f(t) + ɛ t, ɛ t N(0, σn 2 ), using the same f(t). Write an expression for the marginal likelihood of the data corresponding to a fixed assignment of genes into clusters.