Name: Student number:

Similar documents
Midterm for CSC321, Intro to Neural Networks Winter 2018, night section Tuesday, March 6, 6:10-7pm

CSC321 Lecture 5: Multilayer Perceptrons

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Classification with Perceptrons. Reading:

CSC321 Lecture 4: Learning a Classifier

CSC321 Lecture 4: Learning a Classifier

CSC 411 Lecture 10: Neural Networks

Do not tear exam apart!

Lecture 2: Linear regression

CSC321 Lecture 16: ResNets and Attention

Lecture 15: Exploding and Vanishing Gradients

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

CSE 546 Final Exam, Autumn 2013

CSC321 Lecture 15: Recurrent Neural Networks

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

ECE521: Inference Algorithms and Machine Learning University of Toronto. Assignment 1: k-nn and Linear Regression

Programming Assignment 4: Image Completion using Mixture of Bernoullis

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Final Exam, Spring 2006

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Midterm: CS 6375 Spring 2015 Solutions

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 20: Reversible and Autoregressive Models

Qualifier: CS 6375 Machine Learning Spring 2015

CSC411 Fall 2018 Homework 5

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Learning Deep Architectures for AI. Part II - Vijay Chakilam

HOMEWORK 4: SVMS AND KERNELS

Deep Feedforward Networks

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture 4: Training a Classifier

HOMEWORK #4: LOGISTIC REGRESSION

Linear & nonlinear classifiers

Homework 6: Image Completion using Mixture of Bernoullis

Neural networks. Chapter 20. Chapter 20 1

6 Reinforcement Learning

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Machine Learning Practice Page 2 of 2 10/28/13

Optimization and Gradient Descent

Final Exam, Machine Learning, Spring 2009

CSC321 Lecture 18: Learning Probabilistic Models

Summary of A Few Recent Papers about Discrete Generative models

Machine Learning

CSC321 Lecture 9: Generalization

Name (NetID): (1 Point)

Seq2Seq Losses (CTC)

Machine Learning. Neural Networks

Deep Feedforward Networks. Sargur N. Srihari

Neural networks. Chapter 19, Sections 1 5 1

CSC321 Lecture 15: Exploding and Vanishing Gradients

, and rewards and transition matrices as shown below:

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

HOMEWORK #4: LOGISTIC REGRESSION

Logistic Regression. Machine Learning Fall 2018

Regularization in Neural Networks

Machine Learning (CSE 446): Neural Networks

Reinforcement Learning and NLP

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Reinforcement Learning as Classification Leveraging Modern Classifiers

CS446: Machine Learning Spring Problem Set 4

Lecture 15: Recurrent Neural Nets

Lecture 5 Neural models for NLP

FINAL: CS 6375 (Machine Learning) Fall 2014

Linear Models for Regression CS534

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

REINFORCEMENT LEARNING

Lecture 4: Training a Classifier

Policy Gradient Methods. February 13, 2017

CS534 Machine Learning - Spring Final Exam

CSC321 Lecture 9: Generalization

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Deep Learning for Computer Vision

Midterm: CS 6375 Spring 2018

Linear discriminant functions

Notes on Back Propagation in 4 Lines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Qualifying Exam in Machine Learning

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Incremental Stochastic Gradient Descent

total 58

Machine Learning Lecture 5

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

INTRODUCTION TO DATA SCIENCE

Homework 2 Solutions Kernel SVM and Perceptron

BACKPROPAGATION. Neural network training optimization problem. Deriving backpropagation

Midterm, Fall 2003

Machine Learning

Linear Models for Regression CS534

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Transcription:

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2018 EXAMINATIONS CSC321H1S Duration 3 hours No Aids Allowed Name: Student number: This is a closed-book test. It is marked out of 35 marks. Please answer ALL of the questions. Here is some advice: The questions are NOT arranged in order of difficulty, so you should attempt every question. Questions that ask you to justify your answer or briefly explain something only require short (1-3 sentence) explanations. Don t write a full page of text. We re just looking for the main idea. None of the questions require long derivations. If you find yourself plugging through lots of equations, consider giving less detail or moving on to the next question. Many questions have more than one right answer.

. Q1: / 2 Q2: / 1 Q3: / 1 Q4: / 2 Q5: / 2 Q6: / 1 Q7: / 3 Q8: / 2 Q9: / 1 Q10: / 2 Q11: / 4 Q12: / 2 Q13: / 2 Q14: / 2 Q15: / 2 Q16: / 3 Q17: / 2 Q18: / 1 Final mark: / 35 2/14

1. [2pts] Recall that multilayer perceptrons are universal for the set of functions mapping binary-valued input vectors to binary valued outputs. (a) [1pt] What do we mean by universal? (b) [1pt] If multilayer perceptrons are universal, why do we still consider other architectures? 2. [1pt] Give an example of a data augmentation technique that would be useful for classifying images of cats vs. dogs, but not for classifying handwritten digits. Briefly explain your answer. 3. [1pt] Suppose we have a grayscale image represented as an array, where larger values denote lighter pixels. What is the effect when we convolve it with the following kernel? 0 1 0 1 4 1 0 1 0 3/14

4. [2pts] The learning rate is an important parameter for gradient descent. (a) [1pt] Briefly describe something that can go wrong if we choose too high a learning rate for full batch gradient descent. (b) [1pt] Briefly describe something that can go wrong if we choose too high a learning rate for stochastic gradient descent, but is not a problem in the full batch setting. 5. [2pts] Suppose we are training an RNN language model using teacher forcing (the method you implemented in Assignment 3). (a) [1pt] What are the inputs to the network at training time? (b) [1pt] What are the inputs to the network at test time? 4/14

6. [1pt] Here is a modified version of code from Programming Assignment 2. The methods downconv1, rfconv, etc. implement convolution layers. Add edges to the diagram to represent the network architecture this implements. You don t need to justify your answer. class MyNet(nn.Module):... def forward(self, x): self.h1 = self.downconv1(x) self.h2 = self.downconv2(self.h1) self.h3 = self.rfconv(self.h2) self.h4 = self.upconv1(torch.cat([self.h3, self.h2], 1)) self.h5 = self.upconv2(self.h4) self.out = self.finalconv(torch.cat([self.h5, x], 1)) return self.out 5/14

7. [3pts] Recall the (multivariate) linear regression model: y = w x + b L(y, t) = 1 (y t)2 2 Your job is to implement full batch gradient descent in NumPy. In particular, suppose we are given an N D NumPy array X representing all the training inputs, and an N dimensional NumPy vector t representing the targets, where N is the number of data points, and D is the input dimension. The weights are represented with a D- dimensional NumPy vector w, and the biases are represented with a scalar b. The learning rate is given as alpha. Write NumPy code which implements one iteration of batch gradient descent. It should be vectorized, i.e. it should not involve a for-loop. You don t need to show your work, but doing so may help you get partial credit. 6/14

8. [2pts] Suppose we have a convolution layer which takes as input an array x = ( ) x 1 x 2 x 3 and convolves x with the kernel ( 2 1 ). This layer has a linear activation function. The output is an array of length 4. Now let s design a fully connected layer which computes the same function. It has a linear activation function and no bias, so it computes y = Wx, where the output y is a vector of length 4. Give the 4 3 weight matrix W which makes this fully connected layer equivalent to the convolution layer above. You don t need to justify your answer, but doing so may help you get partial credit. Hint: first write the values of each output as a linear function of the inputs. To help you check your work, if x = ( 1 2 3 ), your answer should give y = ( 2 3 4 3 ). 9. [1pt] Briefly explain one flaw of encoder-decoder architectures for machine translation which do not use attention, and how attention can fix it. 7/14

10. [2pts] Recall that in order to add a new primitive operation to Autograd, you need to define a vector-jacobian product (VJP). To refresh your memory, here is code which defines VJPs for exponentiation and multiplication. defvjp(exp, lambda g, ans, x: ans * g) defvjp(multiply, lambda g, ans, x, y: y * g, lambda g, ans, x, y: x * g) The arguments to defvjp are the primitive op, followed by functions implementing the VJPs for each of the arguments. The arguments to the VJP function are: the output gradient g, the output ans of the op, and the arguments fed to the op. (a) [1pt] Write Python code that defines a vector-jacobian product for sin. (b) [1pt] Write Python code that defines a vector-jacobian product for divide, the function which computes the elementwise division of two arrays (i.e. divide(x, y) is equivalent to x / y). (This is floating point division, not integer division.) 8/14

11. [4pts] Suppose we receive two binary sequences x 1 = (x (1) 1,..., x (T ) 1 ) and x 2 = (x (1) 2,..., x (T ) 2 ) of equal length, and we would like to design an RNN to determine if they are identical. We will use the following (rather unusual) architecture, drawn with self-loops on the left and unrolled on the right: The computation in each time step is as follows: h (t) = φ ( Wx (t) + b ) { y (t) φ ( v h (t) + ry (t 1) + c ) for t > 1 = φ ( ) v h (t) + c 0 for t = 1, where φ denotes the hard threshold activation function { 1 if z > 0 φ(z) = 0 if z 0 The parameters are a 2 2 weight matrix W, a 2-dimensional bias vector b, a 2- dimensional weight vector v, a scalar recurrent weight r, a scalar bias c for all but the first time step, and a separate bias c 0 for the first time step. We ll use the following strategy. We ll proceed one step at a time, and at time t, the binary-valued elements x (t) 1 and x (t) 2 will be fed as inputs. The output unit y (t) at time t will compute whether all pairs of elements have matched up to time t. The two hidden units h (t) 1 and h (t) 2 will help determine if both inputs match at a given time step. Hint: have h (t) 1 determine if both inputs are 0, and h (t) 2 detemine if both inputs are 1. Continued on next page 9/14

(Question 11, cont d) Give parameters which correctly implement this function: W = b = v = r = c = c 0 = 10/14

12. [2pts] Suppose we have flipped a coin multiple times, and it came up heads N H times and tails N T times. We would like to model the coin as a Bernoulli random variable, and fit the model using maximum likelihood. (a) [1pt] Give the formula for the log-likelihood l(θ), where θ is the probability of heads. (b) [1pt] Solve for the maximum likelihood estimate of θ by setting dl/dθ = 0. 13. [2pts] Recall the CycleGAN architecture for style transfer. (a) [1pt] What might go wrong if we eliminate the discriminator terms from the cost function? (b) [1pt] What might go wrong if we eliminate the reconstruction terms from the cost function? 11/14

14. [2pts] Recall that a GAN could, in principle, be trained using the following minimax formulation, where G is the generator function, D is the probability the discriminator assigns to the sample being data, and J D and J G are the cost functions for the discriminator and generator, respectively. J D = E x D [ log D(x)] + E z [ log(1 D(G(z)))] J G = J D = const + E z [log(1 D(G(z)))] However, in practice, the generator is usually trained with a different loss function. (a) [1pt] What cost function do we typically use for the generator? (b) [1pt] What is the reason to use this cost function rather than the one given above? 15. [2pts] We ve covered autoregressive generative models based on both convolutional networks and RNNs. (a) [1pt] Give one advantage of using a convolutional network rather than an RNN. (b) [1pt] Give one advantage of using an RNN rather than a convolutional network. 12/14

16. [3pts] Reversible architectures are based on a reversible block. Let s modify the definition of the reversible block: y 1 = r x 1 + F(x 2 ) y 2 = s x 2, where denotes elementwise multiplication. This modified block is identical to the ordinary reversible block, except that the inputs x 1 and x 2 are multiplied elementwise by vectors r and s, all of whose entries are positive. You don t need to justify your answers for this question, but doing so may help you receive partial credit. (a) [1pt] Give equations for inverting this block, i.e. computing x 1 and x 2 from y 1 and y 2. You may use / to denote elementwise division. (b) [1pt] Give a formula for the Jacobian y/ x, where y denotes the concatenation of y 1 and y 2. (c) [1pt] Give a formula for the determinant of the Jacobian from part (b). 13/14

17. [2pts] Suppose we have an MDP with two time steps. It has an initial state distribution p(s 1 ), transition probabilities p(s t+1 s t, a t ), and deterministic reward function r(s, a). The agent is currently following a stochastic policy π θ (a s) parameterized by θ. (a) [1pt] Give the formula for the probability p(τ) of a rollout τ = (s 1, a 1, s 2, a 2 ). (b) [1pt] What is the function that REINFORCE is trying to maximize with respect to θ? (You can give your answer in terms of p(τ).) 18. [1pt] Recall that the discounted return is defined as: G t = γ i r t+i, i=0 where γ is the discount factor and r t is the reward at time t. Give the definition of the action-value function Q π (s, a) for policy π, state s, and action a. You can either give an equation or explain it verbally. 14/14