CS Machine Learning Qualifying Exam

Similar documents
Qualifying Exam in Machine Learning

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Be able to define the following terms and answer basic questions about them:

Mathematical Formulation of Our Example

Stochastic Gradient Descent

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Midterm: CS 6375 Spring 2015 Solutions

Naïve Bayes classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Final Exam, Fall 2002

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6375 Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Midterm, Fall 2003

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Generative v. Discriminative classifiers Intuition

Linear discriminant functions

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 2: Logistic Regression and Neural Networks

Machine Learning Practice Page 2 of 2 10/28/13

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Bias-Variance Tradeoff

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Qualifier: CS 6375 Machine Learning Spring 2015

Introduction to Machine Learning Midterm Exam

Week 5: Logistic Regression & Neural Networks

Final Examination CS 540-2: Introduction to Artificial Intelligence

Multilayer Perceptron

Logistic Regression. COMP 527 Danushka Bollegala

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Machine Learning Lecture 5

Reinforcement Learning. George Konidaris

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

FINAL: CS 6375 (Machine Learning) Fall 2014

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

10-701/ Machine Learning, Fall

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 10 - Planning under Uncertainty (III)

Ad Placement Strategies

Machine Learning Lecture 7

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Reinforcement Learning and Control

Machine Learning, Fall 2012 Homework 2

An Introduction to Statistical and Probabilistic Linear Models

Notes on Discriminant Functions and Optimal Classification

Final Exam, Machine Learning, Spring 2009

Machine Learning, Midterm Exam

Intelligent Systems Discriminative Learning, Neural Networks

Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Reinforcement Learning

Machine Learning

STA 4273H: Statistical Machine Learning

Lecture 3 Classification, Logistic Regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

PATTERN RECOGNITION AND MACHINE LEARNING

Linear & nonlinear classifiers

Lecture 1: March 7, 2018

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

Algorithmisches Lernen/Machine Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Final Exam, Spring 2006

MODULE -4 BAYEIAN LEARNING

Hidden Markov Models

The connection of dropout and Bayesian statistics

Reinforcement Learning. Value Function Updates

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Lecture 5 Neural models for NLP

Machine Learning. Instructor: Pranjal Awasthi

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Machine Learning Overview

Introduction to Machine Learning Midterm Exam Solutions

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Linear Classifiers: Expressiveness

CSCI-567: Machine Learning (Spring 2019)

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Linear Models for Classification

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Machine Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

CS230: Lecture 9 Deep Reinforcement Learning

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Transcription:

CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There are three questions in each area. You must answer two out of three questions in the Core area. You must choose two of the remaining areas and answer two out of three questions in each one. 1

1 Core Question 1. Bayes Error Let (X, Y ) be a pair of random variables taking their values in R d and {0, 1}, respectively. We are interested in classification, namely predicting Y from X. Let g (X) denote the Bayes classifier and L = P {g (X) 6= Y } denote the probability of misclassification, otherwise known as the Bayes error. The Bayes classifier can be viewed as the output of the optimal learning algorithm for (X, Y ). Given any other classifier g(x) with associated probability of error L, we have L apple L. (a) Show that any feature transform T which maps X into a transformed random variable T (X) can only increase the Bayes error (meaning that the Bayes error for T (X) will never be less than L ). Does this result have any consequences for deep learning? (b) Prove that L apple min(p, 1 p), where p = P (Y =1). Show that equality is attained if X and Y are independent. Construct a distribution where X is not independent of Y but L =min(p, 1 p). (c) In a paper from 1996, David Wolpert proved the No Free Lunch theorem which demonstrated that bias-free learning is futile. Specifically, if one is interested in the offtraining-set error, then there is no a priori reason to prefer one learning algorithm over another (see www.no-free-lunch.org). Is there any contradiction between these results and the definition of the Bayes classifier above? Explain. Page 2

Question 2. Naive Bayes Classifier Consider the task of predicting whether or not an email is spam based on the words that it contains. Let cw be the probability that the word w occurs in emails of class c, where c =1is spam and c =0is nonspam. Define the indicator variable x w which equals 1 if word w occurs in the email and 0 otherwise. Then the probability that the email belongs to class c is: p(x c, ) = WY w=1 xw cw (1 cw ) 1 xw, where x =(x 1,...,x W ) is a bit vector and W is the number of words in the vocabulary. (a) Show that this is a Naive Bayes model. (b) Show that the class-conditional log-likelihood can be written: log p(x c, ) = (x) T c, for appropriately chosen vectors (x) and c of length W +1. (c) Assuming that p(c =0)=p(c =1)=0.5, write an expression for the posterior odds p(c=1 x) ratio log 2. What type of classifier is this? p(c=0 x) (d) Some words will occur in both spam and non-spam mails and therefore will not be very discriminative. The posterior odds for these words should be 50-50 (i.e. not informative in deciding the class label). For a particular word w, state the conditions on 0w and 1w such that the presence or absence of the word in the email will have no effect on the classifier decision from part (c). Page 3

Question 3. Bias-Variance Tradeoff (a) What is the bias-variance tradeoff? Suppose we are training a neural network to solve a regression problem. What does the bias-variance tradeoff tell us about the training problem? (b) Write the bias-variance decomposition of the expected mean squared error E D [(f(x D) E[y x]) 2 ] for the problem of predicting y given x using a neural network trained on data D. Draw three pictures to illustrate the roles of the bias, variance, and noise in y independent of x. (c) Use the bias-variance tradeoff to discuss the recently-demonstrated effectiveness of neural-network-based classifiers (in the form of deep models) to solve modern prediction problems in computer vision and speech. To what extent is the observed empirical performance explained and/or contradicted by the classical bias-variance tradeoff? Page 4

2 Statistical Methods and Models Question 1. Logistic Regression Suppose you have two datasets: D 1 = {x (1) i,y (1) i } i21...n1 and D 2 = {x (2) j,y (2) j } j21...n2, with each y 2 { 1, 1}, and each x 2 R P. The dimension P is identical for D 1 and D 2. Let w (1) be the unregularized logistic regression (LR) coefficients from training on dataset D 1, under the model, P (y x; w) = (y(x w)), with indicating the sigmoid function and x w indicating the dot product of the features x and the coefficients w. Let w (2) be the unregularized LR coefficients (same model) from training on dataset D 2. Finally, let w be the unregularized LR coefficients from training on the combined dataset D 1 [ D 2. Under these conditions, prove that for any feature n, wn min(w n (1),w n (2) ) wn apple max(w n (1),w n (2) ). Page 5

Question 2. IO-HMM Figure 1 shows a typical input-output hidden Markov model (IO-HMM) that describes how a robot can interact with its environment. The environment states x t are partially observed through the measurements y t, and the agent s actions u t directly affect the state. In this graphical model the actions are exogenous, meaning they are determined by factors outside of the system. x 1 x 2 x 3 u 2 u 3 y 1 y 2 y 3 Figure 1: A dynamical system with open-loop control. How might you learn the parameters of an IO-HMM from dataset D? Assume a transition model p(x t+1 x t,u t+1 ), a measurement model p(y t x t ), and a dataset consisting of a length-n sequence of actions and observations, e.g. {u 1,y 1,u 2,y 2,...u N,y N }. First, describe the parameters that must be learned. Then suggest an algorithm that learns these parameters from data and describe how it works. Provide as much detail as you can. Page 6

Question 3. Variational Inference Consider a probability distribution P corresponding to the pairwise Markov network (diamondshaped) illustrated in Figure 2(a). Consider approximating it with a distribution Q that is represented by the pairwise Markov network of Figure 2(b) (plus-shaped). Derive the potentials 1 (A, C) and 2 (B,D) that minimize D(QkP ). A A B D B D C C Figure 2: Two pairwise Markov models in the variables A, B, C, D in the shape of (a) Diamond (on left) and (b) Plus (on right) Page 7

3 Learning Theory Question 1. PAC learning of simple classes (a) Suppose that C is a finite set of functions from X to {0, 1}. Prove that for any distribution D over X, any target function, and any, > 0, if we draw a sample S from D of size apple 1 1 S ln( C )+ln, then with probability 1, all h 2 C with err D (h) have err S (h) > 0. Here, err D (h) is the true error of h under D and err S (h) is the empirical error of h on the sample S. (b) Present an algorithm for PAC learning disjunctions over {0, 1} n. Page 8

Question 2. Expressivity of LTFs Assume that each example x is given by n boolean features (variables). A decision list is a function of the form: if `1 then b 1, else if `2 then b 2, else if `3 then b 3,...,else b m, where each `i is a literal (either a variable or its negation) and each b i 2{0, 1}. For instance, one possible decision list is the rule: if x 1 then positive, else if x 5 then negative, else positive. Decision lists are a natural representation language in many settings and have also been shown to have a collection of useful theoretical properties. (a) Show that conjunctions (like x 1 ^ x 2 ^ x 3 ) and disjunctions (like x 1 _ x 2 _ x 3 ) are special cases of decision lists. (b) Show that decision lists are a special case of linear threshold functions. That is, any function that can be expressed as a decision list can also be expressed as a linear threshold function f(x) =+ iff w 1 x 1 +...w n x n w 0, for some values w 0,w 1,...,w n. Page 9

Question 3. VC Dimension and Rademacher Complexity (a) What is the VC-dimension d of axis-parallel rectangles in R 3? Specifically, a legal target function is specified by three intervals [x min,x max ], [y min,y max ], and [z min,z max ], and classifies an example (x, y, z) as positive iff x 2 [x min,x max ],y 2 [y min,y max ], and z 2 [z min,z max ]. (b) Explain the importance of VC-dimension for machine learning. (c) Explain when and why generalization bounds based on the Rademacher Complexity could be better than those based on the VC-dimension. Page 10

4 Decision Processes Question 1. Reinforcement Learning Methods (a) Explain the difference between the SARSA and Q-Learning algorithms. Under what conditions would you prefer one over the other? (b) For the SARSA algorithm, write the formula for the weight update in the case of a continuous state space and eligibility traces. Explain the meaning of each term in the formula. (c) For the Q-learning algorithm, show that the Bellman equation in a discrete state space without eligibility traces is given by: Q(x, u) = X x 0 p(x 0 x, u){r(x, x 0,u)+ max u 0 Q(x 0,u 0 )} (d) In many real-world applications, the action space for a robot may be continuous instead of discrete. For example, in autonomous driving the steering angle and acceleration are continuous. How can continuous actions be handled within an RL framework? Describe the strengths and weaknesses of your proposed approach. (e) What are the advantages of policy gradient methods over TD methods? Give one example of a scenario where a policy gradient approach would be attractive. Page 11

Question 2. Markov Reward Process Consider an undiscounted Markov reward process with two states A and B. The transition process and reward functions are unknown, but you have observed two episodes: (A, +3)! (A, +2)! (B, 4)! (A, +4)! (B, 3)! Terminate (B, 2)! (A, +3)! (B, 3)! Terminate In the above episodes, the tuples denote a sample state transition and associated reward at each step. For example, (A, +2)! (B, 4) denotes a transition from state A to state B with a reward of +2. Answer the following: Using the first-visit Monte Carlo method, estimate the value function V (A),V(B) Using the every-visit Monte Carlo method, estimate the value function V (A),V(B) Draw a diagram of the Markov reward process that best explains these two episodes from a maximum likelihood perspective (you do not need to prove that you have the ML estimate). Show rewards and transition probabilities on your diagram. Solve the Bellman equation to give the true value function V (A),V(B). (Hint: Solve the Bellman equations directly rather than iteratively). What value function would batch TD(0) find, if it was applied repeatedly to these two episodes? What value function would batch TD(1) find, using accumulated eligibility traces? Page 12

Question 3. Scaling up Reinforcement Learning Machine learning algorithms have traditionally had difficulty scaling to large problems. In classification and traditional supervised learning this problem arises with data that exist in very high dimensional spaces or when there are many data points for computing, for example, estimates of conditional densities. In reinforcement learning this is also the case, arising when, for example, there are many, many states or very fine-grained actions spaces. Typical approaches to addressing such problems in RL include function approximation and problem decomposition. Compare and contrast these two approaches. What problems of scale do these approaches address? What are their strengths and weaknesses? Are they orthogonal approaches? Can they work well together? What are the differences between hierarchical and modular reinforcement learning? Explain both the theoretical and practical limits of these approaches. Page 13