CSE 546 Final Exam, Autumn 2013

Similar documents
10-701/ Machine Learning - Midterm Exam, Fall 2010

Final Exam, Spring 2006

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm Exam, Spring 2005

Machine Learning, Midterm Exam

Machine Learning, Fall 2009: Midterm

CSE 546 Midterm Exam, Fall 2014

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Final Exam, Fall 2002

Introduction to Machine Learning Midterm, Tues April 8

Midterm Exam Solutions, Spring 2007

Introduction to Machine Learning Midterm Exam

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm: CS 6375 Spring 2015 Solutions

CS534 Machine Learning - Spring Final Exam

Be able to define the following terms and answer basic questions about them:

Introduction to Machine Learning Midterm Exam Solutions

Final Exam, Machine Learning, Spring 2009

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Final Exam December 12, 2017

Machine Learning, Fall 2011: Homework 5

Qualifier: CS 6375 Machine Learning Spring 2015

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Final Exam December 12, 2017

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Machine Learning Practice Page 2 of 2 10/28/13

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Name (NetID): (1 Point)

CS246 Final Exam, Winter 2011

Midterm: CS 6375 Spring 2018

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

ECE 5984: Introduction to Machine Learning

1 Machine Learning Concepts (16 points)

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Be able to define the following terms and answer basic questions about them:

Midterm, Fall 2003

Name (NetID): (1 Point)

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

PAC-learning, VC Dimension and Margin-based Bounds

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Max Margin-Classifier

COMS 4771 Introduction to Machine Learning. Nakul Verma

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Midterm exam CS 189/289, Fall 2015

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Introduction to Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Using first-order logic, formalize the following knowledge:

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Bayesian Learning. Machine Learning Fall 2018

CS145: INTRODUCTION TO DATA MINING

10-701/15-781, Machine Learning: Homework 4

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning, Fall 2012 Homework 2

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

9 Classification. 9.1 Linear Classifiers

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

total 58

Linear & nonlinear classifiers

CS 6375 Machine Learning

10701/15781 Machine Learning, Spring 2007: Homework 2

Andrew/CS ID: Midterm Solutions, Fall 2006

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

1 MDP Value Iteration Algorithm

Kernel Methods and Support Vector Machines

CS 188: Artificial Intelligence Spring Announcements

CSCI-567: Machine Learning (Spring 2019)

STA 414/2104, Spring 2014, Practice Problem Set #1

HOMEWORK 4: SVMS AND KERNELS

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Qualifying Exam in Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

CS221 Practice Midterm

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

Name: Student ID: Instructions:

Linear Models for Classification

Final Examination CS540-2: Introduction to Artificial Intelligence

Machine Learning Lecture 7

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

Announcements - Homework

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

MIRA, SVM, k-nn. Lirong Xia

Advanced Introduction to Machine Learning

Transcription:

CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book, class notes, your print outs of class materials that are on the class website, including my annotated slides and relevant readings. You cannot use materials brought by other students. Calculators are not necessary. Laptops, PDAs, phones and Internet access are not allowed. 4. If you need more room to work out your answer to a question, use the back of the page and clearly mark on the front of the page if we are to look at what s on the back. 5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself time to answer all of the easy ones, and avoid getting bogged down in the more difficult ones before you have answered the easier ones. 6. You have hour and 50 minutes. 7. Good luck! Question Topic Max Score k-means 0 Expectation Maximization Kernel Regression and k-nn 4 4 Support Vector Machines 5 LOOCV 6 Markov Decision Processes 4 7 Learning Theory 8 Bayesian Networks 4 TOTAL 00

k-means [0 points] Consider the following clustering method called Leader Clustering. It receives two parameters: an integer k and a real number t. Similar to k-means, it starts by selecting k instances (which will be called leaders) and assigns each training instances to the cluster of the closest leader. During the assignment step, however, if the distance of a training instance to its closest leader is greater than the input threshold t, then this training instance becomes a new leader. During the same assignment step, remaining points can be assigned to these new leaders. After all the training instances have been assigned to a leader s cluster, new leaders are calculated by averaging each cluster. The process is then repeated until the cluster assignments do not change.. [6 points] Given a dataset and a value k, let t vary from 0 to a very large value. When does Leader Clustering produce more, the same number, or fewer clusters than k-means, assuming that the k initial centers are the same for both? When will the clusterings produced by Leader Clustering and k-means be identical?. [4 points] Which of the two methods, k-means or Leader Clustering, will be best at dealing with outliers (data instances that are far away or very different to the other instances in the dataset)? Explain.

Expectation Maximization [ points] Let s say that in a certain course, the probability of each student getting a given grade is: P (A) = P (B) = µ P (C) = µ P (D) = µ We then observe some data. Let a, b, c and d be the number of students who got an A, B, C or D, respectively.. [8 points] What is the Maximum Likelihood Estimator for µ, given a, b, c and d? Now let s say that we observe some new data, but we don t observe a or b. Instead, we observe h, which is the number of students who got either an A or a B. So we don t know a or b but we know that h = a + b. We still observe c and d as before. We now intend to use the Expectation Maximization algorithm to find an estimate of µ.. [ points] Expectation step: Given ˆµ, a current estimate of µ, what are the expected values of a and b?. [ points] Maximizaton step: Given â and ˆb, the expected values of a and b, what is the MLE of µ?

Kernel Regression and k-nn [ points]. [6 points] Sketch the fit Y given X for the dataset given below using kernel regression with a box kernel { if h xi x K(x i, x j ) = I( h x i x j < h) = j < h 0 otherwise for h = 0.5,. h = 0.5 4.5.5 y.5 0.5 0 0 4 5 6 x h = 4.5.5 y.5 0.5 0 0 4 5 6 x 4

. [4 points] Sketch or describe a dataset where kernel regression with the box kernel above with h = 0.5 gives the same regression values as -NN but not as -NN in the domain x [0, 6] below. 5 4.5 4.5.5.5 0.5 0 0 4 5 6. [4 points] Sketch or describe a dataset where kernel regression with the box kernel above with h = 0.5 gives the same regression values as -NN but not as -NN in the domain x (0, 6) below. 5 4.5 4.5.5.5 0.5 0 0 4 5 6 5

4 Support Vector Machines [ Points]. [ points] Suppose we are using a linear SVM (i.e., no kernel), with some large C value, and are given the following data set. X 4 5 X Draw the decision boundary of linear SVM. Give a brief explanation. [ points] In the following image, circle the points such that by removing that example from the training set and retraining SVM, we would get a different decision boundary than training on the full sample. X 4 5 X You do not need to provide a formal proof, but give a one or two sentence explanation. 6

. [ points] Suppose instead of SVM, we use regularized logistic regression to learn the classifier. That is, (w, b) = arg min w R,b R w i [y (i) = 0] ln e (w x(i) +b) + e (w x(i) +b) +[y(i) = ] ln + e. (w x(i) +b) In the following image, circle the points such that by removing that example from the training set and running regularized logistic regression, we would get a different decision boundary than training with regularized logistic regression on the full sample. X 4 5 X You do not need to provide a formal proof, but give a one or two sentence explanation. 7

. [4 points] Suppose we have a kernel K(, ), such that there is an implicit high-dimensional feature map φ : R d R D that satisfies x, z R d, K(x, z) = φ(x) φ(z), where φ(x) φ(z) = D i= φ(x) iφ(z) i is the dot product in the D-dimensional space. Show how to calculate the Euclidean distance in the D-dimensional space φ(x) φ(z) = D (φ(x) i φ(z) i ) without explicitly calculating the values in the D-dimensional vectors. For this question, you should provide a formal proof. i= 8

5 LOOCV [ points] 5. Mean Squared Error Suppose you have 00 datapoints {(x (k), y (k) )} 00 k=. Your dataset has one input and one output. The kth datapoint is generated as follows: x (k) = k/00 y (k) Bernoulli(p) (A random variable with a Bernoulli distribution with parameter p equals with probability p and 0 with probability p.) Note that all of the y (k) s are just noise, drawn independently of all other y (k) s. You will consider two learning algorithms: Algorithm NN: -nearest neighbor (with ties broken arbitrarily). Algorithm Zero: Always predict zero Hint: Recall that the mean of Bernoulli(p) is p.. [ point] What is the expected Mean Squared Training Error for Algorithm Zero?. [ point] What is the expected Mean Squared Training Error for Algorithm NN?. [ point] What is the expected Mean Squared LOOCV error for Algorithm Zero? 9

4. [4 points] What is the expected Mean Squared LOOCV error for Algorithm NN? 5. k-nn [5 points] Come up with a one-dimensional dataset (with 8 to 0 examples) for which the LOOCV accuracy with -nearest neighbor is 00%, but with -nearest neighbors is 0%. 0

6 Markov Decision Processes [4 points] In the game micro-blackjack, you repeatedly draw a number (with replacement) that is equally likely to be a,, or 4. You can either Draw (a number) or Stop if the sum of the numbers you have drawn is less than 6. Otherwise, you must Stop. When you Stop, your reward equals the sum if the sum is less than 6, or equals 0 if the sum is 6 or higher. When you Draw, you receive no reward for that action and continue to the next round. The game ends only when you Stop. There is no discount (γ = ).. [ point] What is the state space for this MDP?. [ points] What is the reward function for this MDP?. [7 points] In order to learn the optimal policy for this MDP we must compute the optimal policy value V (x), where x is an arbitrary state. In lecture there was given the Value Iteration method for finding V (x). Apply one iteration of Value Iteration on this MDP. Let the initial estimate V 0 (x) be set to V 0 (x) = max a R(x, a), where a is an action and R(x, a) is the reward function. Clearly state your answer for V (x). You do not have to show basic calculations. 4. [ points] What is the optimal policy for this MDP? No explanation is necessary.

7 Learning Theory [ Points] For this question, we consider the hypothesis space of decision lists. A decision list classifier consists of a series of binary tests. If a test succeeds, a prediction is returned. Otherwise, processing continues with the next test in the list. An example decision list of length is shown below and classifies whether someone should purchase a specific pair of hiking boots. Are a female On a tight budget Need waterproof No Yes Yes No No Yes No Yes For this problem, we assume each datapoint consists of d binary attributes. Thus, each decision list represents a function f : {0, } d {0, }. [8 points] Suppose we consider only decision lists of length k. Derive an upper-bound on the size of the hypothesis space for this problem. Some possible answers include: (4d) k n (k) d k d Bounds that are overly loose will be given partial credit (by overly loose, we mean off by a very large amount, so don t worry about this too much). Show your work!

. [4 points] Assume the true classification function can be expressed by a decision list of length k. Give a PAC bound for the number of examples needed to learn decision lists of length k with d binary attributes to guarantee a generalization error less than ɛ with probability δ. For this problem, assume there is no label noise.

8 Bayes Nets [4 Points] For this question, consider the following two binary Bayes nets. Each variable may take values from the set {0, }. (a) (b). [4 points] Write down two conditional independencies (e.g. X Y Z) that hold for both (a) and (b).. [4 points] Write down two conditional independencies that hold for one graph but not the other. Specify for which graphs your statements are true and for which graphs they are false. 4

. [6 points] For model (a), consider the following marginal distribution P (A, B, C): A B C P(A, B, C) 0 0 0 / 0 0 / 0 0 / 0 / 0 0 / 0 / 0 / / Based on the above table, list the probabilities P (A), P (B A), and P (C A) associated with the Bayes net. If this is not possible, explain why not. 5