Machine Learning, Fall 2011: Homework 5

Similar documents
Machine Learning, Fall 2011: Homework 5

Midterm: CS 6375 Spring 2015 Solutions

10-701/15-781, Machine Learning: Homework 4

Machine Learning, Fall 2009: Midterm

10-701/ Machine Learning - Midterm Exam, Fall 2010

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Machine Learning, Midterm Exam

10701/15781 Machine Learning, Spring 2007: Homework 2

STA 4273H: Statistical Machine Learning

STA 414/2104: Machine Learning

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Advanced Introduction to Machine Learning

Introduction to Machine Learning CMU-10701

Midterm: CS 6375 Spring 2018

CS534 Machine Learning - Spring Final Exam

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS446: Machine Learning Spring Problem Set 4

CSE 546 Final Exam, Autumn 2013

Pattern Recognition and Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Final Exam, Spring 2006

Probabilistic Graphical Models

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Midterm sample questions

Hidden Markov Models. Terminology and Basic Algorithms

Linear & nonlinear classifiers

Human Mobility Pattern Prediction Algorithm using Mobile Device Location and Time Data

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Hidden Markov Models

MLE/MAP + Naïve Bayes

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Hidden Markov Models

Conditional Random Fields for Sequential Supervised Learning

p(d θ ) l(θ ) 1.2 x x x

COMS F18 Homework 3 (due October 29, 2018)

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 13 Mar 1, 2018

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

University of Connecticut Department of Mathematics

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

Machine Learning for OR & FE

Hidden Markov Models. Terminology and Basic Algorithms

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Wheels Radius / Distance Traveled

Evolutionary Models. Evolutionary Models

CS 6375 Machine Learning

Homework 5 SOLUTIONS CS 6375: Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

ECE521 Lectures 9 Fully Connected Neural Networks

Linear & nonlinear classifiers

Linear Dynamical Systems

Introduction to Machine Learning Midterm, Tues April 8

CSCI-567: Machine Learning (Spring 2019)

The Expectation-Maximization Algorithm

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Machine Learning, Fall 2012 Homework 2

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Brief Introduction of Machine Learning Techniques for Content Analysis

Homework 3. Convex Optimization /36-725

Advanced Data Science

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Hidden Markov Models

CS 136a Lecture 7 Speech Recognition Architecture: Training models with the Forward backward algorithm

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Introduction to Machine Learning Midterm Exam

Genome 373: Hidden Markov Models II. Doug Fowler

Hidden Markov Models. Terminology, Representation and Basic Problems

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Latent Dirichlet Allocation

Math 5a Reading Assignments for Sections

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Problem Set Fall 2012 Due: Friday Sept. 14, at 4 pm

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Statistical NLP: Hidden Markov Models. Updated 12/15

1 Machine Learning Concepts (16 points)

CS532, Winter 2010 Hidden Markov Models

Sequential Supervised Learning

University of Illinois ECE 313: Final Exam Fall 2014

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Homework 5. Convex Optimization /36-725

Name (NetID): (1 Point)

Machine Learning: Homework Assignment 2 Solutions

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Dynamic Approaches: The Hidden Markov Model

HIDDEN MARKOV MODELS

Lecture 3: ASR: HMMs, Forward, Viterbi

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Midterm Exam, Spring 2005

Math 103A Fall 2012 Exam 2 Solutions

Final Exam, Machine Learning, Spring 2009

Qualifier: CS 6375 Machine Learning Spring 2015

Principal Component Analysis

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

HOMEWORK #4: LOGISTIC REGRESSION

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Transcription:

10-601 Machine Learning, Fall 2011: Homework 5 Machine Learning Department Carnegie Mellon University Due:???????????, 5pm Instructions There are???? questions on this assignment. The?????? question involves coding, so start early. Please submit your completed homework to Sharon Cavlovich GHC 8215 by???????????. Submit your homework as????? separate sets of pages, one for each question. Please staple or paperclip all pages from a single question together, but DO NOT staple pages from different questions together. This will make it much easier for the TA s to split up your homework for grading. Include your name and email address on each set. 1 Hidden Markov Models 1. Assume that we have the Hidden Markov Model HMM depicted in the figure above. If each of the states can take on k different values and a total of m different observations are possible across all states, how many parameters are required to fully define this HMM? Justify your answer. SOLUTION: There are a total of three probability distributions that define the HMM, the initial probability distribution, the transition probability distribution, and the emission probability distribution. There are a total of k states, so k parameters are required to define the initial probability distribution we ll ignore all of the -1s for this problem to make things cleaner. For the transition distribution, we can transition from any one of k states to any of the k states including staying in the same state, so k 2 parameters are required. Then, we need a total of km parameters for the emission probability distribution, since each of the k states can emit each of the m observations. Thus, the total number of parameters required are k + k 2 + km. Note that the number of parameters does not depend on the length of the HMMs. 2. What conditional independences hold in this HMM? Justify your answer. S 1 S 2 S 3 O 1 O 2 O 3 1

State P S 1 A 0.99 B 0.01 a Initial probs. S 1 S 2 P S 2 S 1 A A 0.99 A B 0.01 B A 0.01 B B 0.99 b Transition probs. S O P O S A 0 0.8 A 1 0.2 B 0 0.1 B 1 0.9 c Emission probs. SOLUTION: The past is independent of the future, given the present. More precisely, we have that S 1,..., S t 1, O 1,...O t 1 S t+1,..., S n, O t+1,..., O n S t. Note that these independence relations follow directly from d-separation because a HMM is just a Bayes Net. Suppose that we have binary states labeled A and B and binary observations labeled 0 and 1 and the initial, transition, and emission probabilities as in the given table. 3. Using the forward algorithm, compute the probability that we observe the sequence O 1 = 0, O 2 = 1, and O 3 = 0. Show your work i.e., show each of your alphas. SOLUTION: The values of the different alphas and the probability of the sequence are as follows notice that your answers may be slightly different if you kept a different number of decimal places: α1 A = 0.8 0.99 = 0.792 α1 B = 0.1 0.001 = 0.001 α2 A = 0.20.7920.99 + 0.0010.01 = 0.156818 α2 B = 0.90.7920.01 + 0.0010.99 = 0.008019 α3 A = 0.80.1568180.99 + 0.0080190.01 = 0.124264 α3 B = 0.10.1568180.01 + 0.0080190.99 = 0.000950699 P {O T } T t=1 = 0.1252147 4. Using the backward algorithm, compute the probability that we observe the aforementioned sequence O 1 = 0, O 2 = 1, and O 3 = 0. Again, show your work i.e., show each of your betas. SOLUTION: The values of the different betas and the probability of the sequence are as follows again, your answers may vary slightly due to rounding: β3 A = 1 β3 B = 1 β2 A = 0.990.81 + 0.010.11 = 0.793 β2 B = 0.010.081 + 0.990.011 = 0.107 β1 A = 0.990.020.793 + 0.010.90.107 = 0.157977 β1 B = 0.010.20.793 + 0.990.90.107 = 0.096923 P {O T } T t=1 = 0.7920.157977 + 0.0010.096923 = 0.1252147 5. Do your results from the forward and backward algorithm agree? SOLUTION: Yes, the results from the forward and backward algorithm agree! 6. Using the forward-backward algorithm, compute and report the most likely setting for each state. Hint: you already have the alphas and betas from the above sub-problems. 2

SOLUTION: We already have the alphas and betas from the previous two computations. Note that the most likely state at time t is state A if αt A β1 A > α1 B β1 B and state B if the inequality is reversed. We predict that A is the most likely setting for every state. The relevant values for the computation are: α1 A β1 A = 0.792 0.157977 = 0.1251178 α1 B β1 B = 0.001 0.096923 = 9.6923 10 5 α2 A β2 A = 0.156818 0.793 = 0.1243567 α2 B β2 B = 0.008019 0.107 = 0.000858033 α3 A β3 A = 0.124264 1 = 0.124264 α3 B β3 B = 0.000950699 1 = 0.000950699 7. Use the Viterbi algorithm to compute and report the most likely sequence of states. Show your work i.e., show each of your Vs. SOLUTION: The Viterbi algorithm predicts that the most likely sequence of states is A, A, A. The relevant computations are: V1 A = 0.99 0.8 = 0.792 V1 B = 0.01 0.1 = 0.001 V2 A = 0.20.7920.99 = 0.156816 V2 B = 0.90.7920.01 = 0.007128 V3 A = 0.80.1568160.99 = 0.1241983 V3 B = 0.10.0071280.99 = 0.000705672 8. Is the most likely sequence of states the same as the sequence comprised of the most likely setting for each individual state? Does this make sense? Provide a 1-2 sentence justification for your answer. SOLUTION: In this case, the two are the same. This is because the probability of changing states is so low. However, in general, the two need not be the same. In fact, we have seen an example in class where the two sequences are different. 2 Neural Networks [Mladen Kolar, 20 points] In this problem, we will consider neural networks constructed using the following two types of activation functions instead of sigmoid functions: identity step function g S x = g I x = x { 1 if x 0, 0 otherwise. For example, the following figure represents a neural network with one input x, a single hidden layer with K units having step function activations, and a single output with identity activation. The output can be written as outx = g I w 0 + K k=1 w k g S w k 0 + w k 1 x Now you will construct some neural networks using these activation functions. 3

1. [5 pts] Consider the step function ux = { y if x a, 0 otherwise. Construct a neural network with one input x and one hidden layer, whose response is ux. Draw the structure of the neural network, specify the activation function for each unit either g I or g s, and specify the values for all weights in terms of a and y. SOLUTION: One solution can be obtained as follows. Set w 0 = 0, w 1 = y, w 1 0 = a, w 1 1 = 1. 2. [5 pts] Now consider the indicator function 1I x = [a,b { 1 if x [a, b, 0 otherwise. Construct a neural network with one input x and one hidden layer, whose response is y 1I [a,b x, for given real values y, a and b. Draw the structure of the neural network, specify the activation function for each unit either g I or g s, and specify the values for all weights in terms of a, b and y. SOLUTION: One solution can be obtained as follows. Set w 0 = 0, w 1 = y, w 2 = y, w 1 0 = a, w 1 1 = 1, w 2 0 = b, w 2 1 = 1. 3. [10 points] A neural network with a single hidden layer can provide an arbitrarily close approximation to any 1-dimensional bounded smooth function. This question will guide you through the proof. Let fx be any function whose domain is [C, D, for real values C < D. Suppose that the function is Lipschitz continuous, that is, x, x [C, D, fx fx L x x, for some constant L 0. Use the building blocks constructed in the previous part to construct a neural network with one hidden layer that approximates this function within ɛ > 0, that is, x [C, D, fx outx ɛ, where outx is the output of your neural network given input x. Your network should use only the activation functions g I and g S given above. You need to specify the number K of hidden units, the activation function for each unit, and a formula for calculating each weight w 0, w k, w k 0, and wk 1, for each k {1, 2,..., K}. These weights may be specified in terms of C, D, L and ɛ, as well as the values of fx evaluated at a finite number of x values of your choosing you need to explicitly specify which x values you use. You do not need to explicitly write the outx function. Why does your network attain the given accuracy ɛ? 4

SOLUTION: Here is one valid solution. The hidden units all use g s activation functions, and the output unit uses g I. K = D C L 2ɛ + 1. For k {1,..., K}, wk 1 = 1, w k 0 = C + k 12ɛ/L. For the second layer, we have w 0 = 0, w 1 = fc + ɛ/l, and for k {2, 3,..., K}, w k = fc + 2k 1ɛ/L fc + 2k 3ɛ/L. We only evaluate fx at points C + 2k 1ɛ/L, for k {1,..., K}, which is a finite number of points. The function value outx is exactly fc + 2k 1ɛ/L for k {1,..., K}. [C + k 12ɛ/L, C + k2ɛ/l, fx outx ɛ. Note that for any x 3 Principle Component Analysis [William Bishop,??? points] The fourth slide presented in class on November 10th states that PCA finds principle component vectors such that the projection onto these vectors yields minimum mean squared reconstruction error. In the case of a set of sample vectors x 1..., x n and finding only the first principle component, v, this amounts to finding v which minimizes the objective: J v = n x i v T x t v 2 To ensure that this problem has a meaningful solution we require that that v = 1, where denotes the length of the vector. In class we also learned we can view PCA as maximizing the variance of the data after it has been projected onto v. If we assume the sample vectors x 1,..., x n have zero mean, we can write this as: O v = n v T x t 2 = v T XX T v where we again require v = 1. In the equation on there right, the columns of the matrix X are made up of the sample vectors, i.e., X = [ x 1,..., x n ]. Note that to maximize O v subject to the constraint on v we can find the stationary point of the Lagrangian: L O v, λ = v T XX T v + λ1 v T v Please show that minimizing J v can be turned into an equivalent maximization problem which has the same Lagrangian. Note that it is sufficient to show that the two Lagrangians are the same; you do not need to find the actual stationary point. Note: L O v, λ can equivalently be expressed as L O v, λ = vt x t 2 + λ1 v T v. If you are more comfortable using this form so as to not need to deal with matrices, you may do so. Note: For those who may need a primer on Lagrange Multipliers, appendix E of the Bishop book has some basic information. All you need to solve this problem can be found in the first few pages of this appendix. 3.1 Solution: Answer: Consider minimizing J v. We can write: 5

min J v = min x i v T x t v 2 In the above notation min J v means that we minimize J v such that the length of v is 1. Continuing our derivation: min J v = min x i v T x i v 2 [ x i v T x i v] T [ x i v T x i v] [ x T i v T x i v T ][ x i v T x i v] x T i x i v T x i v T x i x T i v T x i v + v T x i v T v T x i v x T i x i v T x i x T i v v T x i x T i v + v T x i x T i v v T v In the last line we have used the fact that v T x i v T x i = v T x i x T i v, xt i vt x i v = v T x i x T i v and vt x i v T v T x i v = v T x i x T i v vt v We can then write: min J v = min = max = max x T i x i v T x i x T i v v T x i x T i v + v T x i x T i v v T v x T i x i + v T x i x T i v 2 + v T v v T x i x T i v 2 + v T v v T x i x T i v 2 + 1 v T x i x T i v v T x i x T i v v T XX T v 6

2. [5 pts] Draw the first two principle components for the following two datasets. Based on this, comment on a limitation of PCA. Answer: One of the limitations of PCA is that is can be highly affected by just a few outliers. Another limitation is that it may fail to capture meaningful components that do not pass through the origin. 7