Qualifying Exam in Machine Learning

Similar documents
CS Machine Learning Qualifying Exam

Midterm: CS 6375 Spring 2015 Solutions

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning, Midterm Exam

CSCI-567: Machine Learning (Spring 2019)

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS534 Machine Learning - Spring Final Exam

Machine Learning

Introduction to Machine Learning Midterm Exam

Final Exam, Fall 2002

Introduction to Machine Learning Midterm Exam Solutions

10-701/ Machine Learning - Midterm Exam, Fall 2010

PAC-learning, VC Dimension and Margin-based Bounds

Name (NetID): (1 Point)

Machine Learning Lecture 7

Name (NetID): (1 Point)

Computational Learning Theory

Notes on Noise Contrastive Estimation (NCE)

The Decision List Machine

Introduction to Machine Learning Midterm, Tues April 8

Stochastic Gradient Descent

Bias-Variance Tradeoff

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Announcements. Proposals graded

Linear Classifiers: Expressiveness

Machine Learning And Applications: Supervised Learning-SVM

PAC-learning, VC Dimension and Margin-based Bounds

Midterm Exam, Spring 2005

Qualifier: CS 6375 Machine Learning Spring 2015

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

The Perceptron algorithm

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Generalization, Overfitting, and Model Selection

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Final Exam, Machine Learning, Spring 2009

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Statistical Machine Learning from Data

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Bayesian Learning (II)

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Computational Learning Theory. CS534 - Machine Learning

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

CSE 546 Final Exam, Autumn 2013

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Support Vector Machines and Bayes Regression

TDT4173 Machine Learning

Support Vector Machines

Machine Learning Basics

ECE 5424: Introduction to Machine Learning

Machine Learning Lecture 5

Machine Learning, Fall 2009: Midterm

ECE521 week 3: 23/26 January 2017

Notes on Discriminant Functions and Optimal Classification

2 Upper-bound of Generalization Error of AdaBoost

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 2 Machine Learning Review

CS446: Machine Learning Spring Problem Set 4

Midterm, Fall 2003

Linear Models for Classification

Statistical Data Mining and Machine Learning Hilary Term 2016

Generalization, Overfitting, and Model Selection

Homework 2 Solutions Kernel SVM and Perceptron

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

1 Learning Linear Separators

CS 6375 Machine Learning

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

6.036 midterm review. Wednesday, March 18, 15

Mathematical Formulation of Our Example

Final Exam, Spring 2006

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Machine Learning Fall 2018

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Voting (Ensemble Methods)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm: CS 6375 Spring 2018

Learning with multiple models. Boosting.

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Overfitting, Bias / Variance Analysis

Linear & nonlinear classifiers

18.9 SUPPORT VECTOR MACHINES

Classification II: Decision Trees and SVMs

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Introduction. Chapter 1

CS7267 MACHINE LEARNING

Week 5: Logistic Regression & Neural Networks

Gradient-Based Learning. Sargur N. Srihari

CSC321 Lecture 18: Learning Probabilistic Models

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Transcription:

Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts out of models and algorithms, decision processes, and learning theory). Please write clearly and support all your arguments as much as possible. Part 1: Core 1. (a) Machine learning methods always result in optimization problems. True or false? Discuss why or why not. Name and discuss specific methods to support your argument. (b) Name at least two examples of common machine learning methods whose objective functions are convex. In practical terms, what are the advantages and disadvantages of convex optimizations in machine learning? (c) Commonly, in many machine learning methods, most of the parameters are chosen by an optimization routine, while a small number are chosen using a model selection approach (most commonly, k-fold cross-validation). Give at least two examples of this. (d) Can model selection for k-fold cross-validation be considered a kind of optimization? (e) In each of the two examples, could/should all of the parameters be chosen by a single optimization routine, in principle? In each of the two examples, could/should all of the parameters be chosen by cross-validation, in principle? Discuss the theoretical and practical limits of what is possible. 2. Consider the situation where you observe a sequence of continuous random variables X (1),..., X (n) p θ in order to estimate θ using an estimator ˆθ n (X (1),..., X (n) ). A common way to measure the prediction quality is using L r = E( θ ˆθ n r ). (a) Write the expectation in L r explicitly. (b) Name one reason to prefer L r with r = 2 over r = 1 and one reason to prefer L r with r = 1 over r = 2. (c) Show mathematically that L 2 decomposes into a squared bias term and a variance term. Explain intuitively what roles these two quantities play. (d) Show that in the case of θ = EX and ˆθ n = 1 n ni=1 X (i), the bias and the variance go to 0 as n. 1

3. (a) What is the bias-variance tradeoff in machine learning? (b) What is overfitting? What are the contributing factors to it? (c) What is the relationship between the bias-variance tradeoff and overfitting? Describe a well known method for resolving the tradeoff and avoiding overfitting in each of the following models: decision trees, logistic regression, and neural networks. Part 2: Models and Algorithms 1. (a) Name at least three widely-used methods for classification and at least three widely-used methods for density estimation. (b) Many would argue that for high-dimensional data, we have good methods for classification, but not for density estimation. (c) Do you agree? Why or why not? Name and discuss specific methods to support your argument. Is the issue more to do with the inherent hardness of each type of task, or more to do with the methods we currently have available? (d) How could one settle this question rigorously, mathematically? How could one settle this question rigorously, empirically? 2. Consider the following exponential family model ( m ) p θ (X) = exp θ i f i (X) log Z(θ) i=1 where Z is the normalization term, and a dataset D = (X (1),..., X (n) ) sampled iid from p θ0. Please try to be as rigorous as possible in your answers below. (a) Express the gradient vector of log p θ (X) using expectations, and the Hessian matrix of log p θ (X) using covariances (in both cases the derivatives are with respect to θ). (b) Prove that the loglikelihood is concave. (c) Assuming that the maximum likelihood estimator (mle) exists, prove that it can be obtained by solving for θ in the following m equations E pθ {f i (X)} = 1 n f i (X (j) ) i = 1,..., m. n j=1 (d) Assuming the model above, in some cases the mle does not exist. Explain what are these cases. Draw by hand a loglikelihood function for which this might happen. Explain why this does not contradict (b) above. (e) One way to address the situation in (c) is to maximize instead the regularized likelihood 1 n n m log p θ (X (i) ) α θj 2 i=1 j=1 for some α > 0. Repeat (a)-(c) for this case. Prove that in this case the situation described in (d) will never happen. 2

3. Consider the problem of using labeled data (X (1), Y (1) ),..., (X (1), Y (1) ) to learn a linear classifier f : X Y, f(x) = sign( j θ jx j ) such as boosting, logistic regression, and linear SVM. (a) Write down the exponential loss that boosting minimizes, the logistic loss that logistic regression minimizes (the negative log-likelihood) and the hinge-loss that linear SVM minimizes. (b) It is sometimes argued that logistic regression and SVM are more resistant to outliers than boosting. Do you agree with this claim? Explain why it is correct or incorrect. (c) Show mathematically using convex duality that minimizing the loglikelihood is equivalent to a maximum entropy problem. Write precisely the maximum entropy objective function and its constraints. Part 3: Decision Processes 1. Machine learning algorithms have traditionally had difficulty scaling to large problems. In classification and traditional supervised learning this problem arises with data that exist in very high dimensional spaces or when there are many data points for computing, for example, estimates of conditional densities. In reinforcement learning this is also the case, arising when, for example, there are many, many states or when actions are at a very low level of abstraction. Typical approaches to addressing such problems in RL include function approximation and problem decomposition. Compare and contrast these two approaches. What problems of scale do these approaches address? What are their strengths and weaknesses? Are they orthogonal approaches? Can they work well together? What are the differences between hierarchical and modular reinforcement learning? Explain both the theoretical and practical limits of these approaches. 2. You have been hired as a consultant to solve a problem for a large corporation that has no machine learning researchers. The problem involves a virtual agent with a set of virtual actuators and virtual but real-valued (continuous) sensors. Your client doesn t know how to describe what he wants the agent to do, but he definitely knows when it has done well or poorly. Formally describe this as a reinforcement learning problem. Argue for this formulation, explaining why it isn t well-modeled as a supervised learning problem, for example. Also detail your actual methodology: how will you set up the problem? How will you train? How will you know that you are doing well? What assumptions will you make? How will you choose the various free parameters of your algorithm (so, yes, please choose an algorithm)? Your client has flipped through a book on unsupervised learning and so demands that you also use unsupervised learning. You are ethical, so you want to use some algorithm in a reasonable and useful way. Given the details of this particular problem, how might you actually use unsupervised learning here to some good effect? Be specific: commit to an algorithm (or two or three) and explain what you would hope to get out of its use. 3

3. Although modeling decision processes as Markov Decision Processes has proven useful, critics have noted that MDPs only model single agents and thus either avoid several multi-agent problems or hide serious complexity inside the transition model, making it difficult for an agent to act optimally under a variety of common conditions. Formally define the extension of MDPs and reinforcement learning to the multi-agent case. What are the requirements of a policy in this case? What are the issues that arise with such an extension? In particular, discuss the implications of extending the Bellman equation to the multi-agent case. What are the theoretical and computational problems that arise? Is this even a particular reasonable way to proceed? Briefly propose some solutions to the problems you identify. Briefly compare and critique these solutions. Part 4: Learning Theory 1. PAC learning of simple classes (a) Suppose that C is a finite set of functions from X to {0, 1}. Prove that for any distribution D over X, any target function, and any ɛ, δ > 0, if we draw a sample S from D of size S 1 [ ( )] 1 ln( C ) + ln, ɛ δ then with probability 1 δ, all h C with err D (h) ɛ have err S (h) > 0. Here, err D (h) is the true error of h under D and err S (h) is the empirical error of h on the sample S. (b) Present an algorithm for PAC learning conjunctions over {0, 1} n. 2. Expressivity of LTFs. Assume each example x is given by n boolean features (variables). A decision list is a function of the form: if l 1 then b 1, else if l 2 then b 2, else if l 3 then b 3,..., else b m, where each l i is a literal (either a variable or its negation) and each b i {0, 1}. For instance, one possible decision list is the rule: if x 1 then positive, else if x 5 then negative, else positive. Decision lists are a natural representation language in many settings and have also been shown to have a collection of useful theoretical properties. (a) Show that conjunctions (like x 1 x 2 x 3 ) and disjunctions (like x 1 x 2 x 3 ) are special cases of decisions lists. (b) Show that decisions lists are a special case of linear threshold functions. That is, any function that can be expressed as a decision list can also be expressed as a linear threshold function f(x) = + iff w 1 x 1 +... w n x n w 0, for some values w 0, w 1,..., w n. 3. VC-dimension of linear separators: In the problems below you will prove that the VCdimension of the class H n of halfspaces (another term for linear threshold functions) in n dimensions is n + 1. (H n is the set of functions w 1 x 1 +... + w n x n a 0, where w 0,..., w n are 4

real-valued.) We will use the following definition: The convex hull of a set of points S is the set of all convex combinations of points in S; this is the set of all points that can be written as x i S λ ix i, where each λ i 0, and i λ i = 1. It is not hard to see that if a halfspace has all points from a set S on one side, then the entire convex hull of S must be on that side as well. (a) [definition] Define the VC dimension and describe the importance and usefulness of VC dimension in machine learning. (b) [lower bound] Prove that VC-dim(H n ) n + 1 by presenting a set of n + 1 points in n-dimensional space such that one can partition that set with halfspaces in all possible ways. (And, show how one can partition the set in any desired way.) (c) [upper bound part 1] The following is Radon s Theorem, from the 1920 s. Theorem. Let S be a set of n + 2 points in n dimensions. Then S can be partitioned into two (disjoint) subsets S 1 and S 2 whose convex hulls intersect. Show that Radon s Theorem implies that the VC-dimension of halfspaces is at most n + 1. Conclude that VC-dim(H n ) = n + 1. (d) [upper bound part 2] Now we prove Radon s Theorem. We will need the following standard fact from linear algebra. If x 1,...., x n+1 are n + 1 points in n-dimensional space, then they are linearly dependent. That is, there exist real values λ 1,..., λ n+1 not all zero such that λ 1 x 1 +... + λ n+1 x n+1 = 0. You may now prove Radon s Theorem however you wish. However, as a suggested first step, prove the following. For any set of n+2 points x 1,..., x n+2 in n-dimensional space, there exist λ 1,..., λ n+2 not all zero such that i λ ix i = 0 and i λ i = 0. (This is called affine dependence.) Now, think about the lambdas... 5