Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Similar documents
Machine Learning, Midterm Exam: Spring 2009 SOLUTION

10-701/ Machine Learning - Midterm Exam, Fall 2010

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Midterm, Fall 2003

Machine Learning, Midterm Exam

Final Exam, Fall 2002

Introduction to Machine Learning Midterm Exam

FINAL: CS 6375 (Machine Learning) Fall 2014

Name (NetID): (1 Point)

Machine Learning

CSE 546 Final Exam, Autumn 2013

Machine Learning

Qualifier: CS 6375 Machine Learning Spring 2015

Midterm Exam, Spring 2005

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning, Fall 2009: Midterm

CS534 Machine Learning - Spring Final Exam

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Generative v. Discriminative classifiers Intuition

Midterm Exam Solutions, Spring 2007

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bias-Variance Tradeoff

Midterm: CS 6375 Spring 2015 Solutions

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Name (NetID): (1 Point)

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Final Exam, Spring 2006

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Bayesian Learning (II)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

CS446: Machine Learning Spring Problem Set 4

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm: CS 6375 Spring 2018

CSCI-567: Machine Learning (Spring 2019)

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Learning with multiple models. Boosting.

ECE521 week 3: 23/26 January 2017

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Department of Computer Science and Engineering CSE 151 University of California, San Diego Fall Midterm Examination

Logistic Regression. Machine Learning Fall 2018

CS 498F: Machine Learning Fall 2010

Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning Theory, Overfi1ng, Bias Variance Decomposi9on

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

10701/15781 Machine Learning, Spring 2007: Homework 2

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Andrew/CS ID: Midterm Solutions, Fall 2006

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Support Vector Machines. Machine Learning Fall 2017

CSC321 Lecture 4 The Perceptron Algorithm

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Stochastic Gradient Descent

Notes on Machine Learning for and

Be able to define the following terms and answer basic questions about them:

Bayesian Methods: Naïve Bayes

Stephen Scott.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

6.036: Midterm, Spring Solutions

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Qualifying Exam in Machine Learning

10-701/ Machine Learning, Fall

Machine Learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

CS 188: Artificial Intelligence Spring Today

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning, Fall 2012 Homework 2

CSE 546 Midterm Exam, Fall 2014

Machine Learning, Fall 2011: Homework 5

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Final Examination CS 540-2: Introduction to Artificial Intelligence

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

MLE/MAP + Naïve Bayes

Part of the slides are adapted from Ziko Kolter

Learning Theory. Machine Learning B Seyoung Kim. Many of these slides are derived from Tom Mitchell, Ziv- Bar Joseph. Thanks!

The Perceptron algorithm

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Lecture 2 Machine Learning Review

Linear Classifiers: Expressiveness

Transcription:

10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the front of the page if we are to look at what s on the back. This exam is open book, open notes, no applets, no wireless communication There are 80 points total on the entire exam, and you have 80 minutes to complete it. Good luck! Name: Andrew ID: Q Topic Max. Score Score 1 Short answer questions 20 2 d-separation 10 3 On-line learning 10 4 Learning methods we studied 10 5 A new learning method 15 6 Naive Bayes and Decision Trees 15 Total 80 1

Part I. Short-answer questions (20 pts) 1. (4 pts) Basic probability. (a) You have received a shiny new coin and want to estimate the probability θ that it will come up heads if you flip it. A priori you assume that the most probable value of θ is 0.5. You then flip the coin 3 times, and it comes up heads twice. Which will be higher, your maximum likelihood estimate (MLE) of θ, or your maximum a posteriori probability (MAP) estimate of θ?. Explain intuitively why, using only one sentence.. MLE- setting a prior at 0.5 will make the MAP estimate closer to 0.5, while the MLE will be p= 2 3. (b) Show, using a proof or an example, that if Pr(A B, C) = Pr(A C), then Pr(A, B C) = Pr(A C) Pr(B C). P r(a, B C) = P r(a B, C)P r(b C) = P (A C)P (B C) 2. (6 pts) Bayes network structures. (a) Using the chain rule of probability (also called the product rule ) it is easy to express the joint distribution over A, B and C,Pr(A, B, C), in several ways. For example, i. Pr(A, B, C) = Pr(A B, C) Pr(C B) Pr(B) ii. Pr(A, B, C) = Pr(B A, C) Pr(C A) Pr(A) Write the Bayes net structures corresponding to each of these two ways of factoring Pr(A, B, C) (i.e., the Bayes nets whose Conditional Probability Tables correspond to the factors). Indicate which is which. P r(a B, C)P r(c B)P r(b) - B is parent of both C and A, C is parent only of A P r(b A, C)P r(c A)P r(a) - A is parent of both C and B, C is parent only of B (b) How many distinct Bayes Network graph structures can you construct to correctly describe the joint distribution Pr(A, B, C) assuming there are no conditional independencies? Please give the exact number, plus a one sentence explanation. (You do not need to show the networks). 6- the only possible structure in order to have no conditional independencies is to use the chain rule structures in the previous part, but permuting the roles of A, B and C. 2

3. (4 pts) Statistical significance. (a) You have trained a logistic regression classifier to predict movie preferences from past movie rentals. Over the 100 training examples, it correctly classifies 90. Over a separate set of 50 test examples, it correctly classifies 43. Give an unbiased estimate of the true error of this classifier, and a 90% confidence interval around that estimate (you may give your confidence interval in the form of an expression). The thing to notice here is that using the training data results (10 incorrect out of 100) will yield a biased estimate, and so you may only use the test results (7 errors out of 50). Some people confused accuracy and error (which is asked for in the question). The answer is the following (where Z 90 = 1.64): 7 0.14 (1 0.14) 50 ± Z 90 50 (b) Show that you really know what confidence interval means, by describing in one sentence a simple experiment you could perform repeatedly, where you would expect in 90% of the experiments to observe an accuracy that falls inside your confidence interval. If we repeated 100 times the experiment collect 50 randomly drawn test examples and measure the fraction misclassified by the above trained classifier, then we expect that the confidence intervals derived by the above expression would contain the true error in at least 90 of these experiments. 4. (3 pts) PAC learning. You want to learn a classifier over examples described by 49 real-valued features, using a learning method that is guaranteed to always find a hyperplane that perfectly separates the training data (whenever it is separable). You decide to use PAC learning theory to determine the number m of examples that suffice to assure your learner will with probability 0.9 output a hypothesis with accuracy at least 0.95. Just as you are about to begin, you notice that the number of features is actually 99, not 49. How many examples must you use now, compared to m? Give a one sentence justification for your answer. One thing to notice here is that the class of hypotheses are hyperplanes, and therefore we can use the formula relating the dimensionality of the data to its VC dimension, ie, V C = dimensionality+1. Therefore, going from 49 to 99 features doubles the VC dimension from 50 to 100. Since m is linear in V C (as opposed to log in H ), this will also double m. 5. (3 pts) Bayes rule. Suppose that in answering a question on a true/false test, an examinee either knows the answer, with probability p, or she guesses with probability 1 p. Assume that the probability of answering a question correctly is 1 for an examinee who knows the answer and 0.5 for the examinee who guesses. What is the probability that an examinee knew the answer to a question, given that she has correctly answered it? Classic application of Bayes rule. Common mistake was to assume p =.5, although this isn t stated in the problem. We get: p(knew correct) = p(correct knew)p(knew) p(correct knew)p(knew) p(correct) = p(correct knew)p(knew)+p(correct guessed)p(guessed). Notice in the denominator we had to consider the two ways she can get a question correct: by 1 p knowing, or by guessing. Plugging in we get: 1 p+.5 (1 p) = 2p p+1. 3

Part II. d-separation (10 pts) Consider the Bayes network above. 1. (2 pts) Is the set {W, X} d-separated from {V, Y } given {E}? Why or why not? No. W U V is not blocked since U E X Z Y is not blocked since a descendent of Z is in E. 2. (4 pts) Let S be a set of variables such that {W, X} is d-separated from {V, Y } given S. What variables must be in S and why? What variables must not be in S and why? In S: U. Not in S: Z and E 3. (4 pts) Consider the subgraph containing only X, Y, Z and E. Is X Y E? If so, prove it. If not, give an example of CPTs such that X Y E. X Y E. One example: P r(x) = P r(y ) = 0.5. Z X Y E Z 4

Part III. On-line learning (10 pts) Consider an on-line learner B and a teacher A. Recall that learning proceeds as follows: For each round t = 1,..., T A sends B an instance x t, withholding the associated label y t B predicts a class ŷ t for x t, and is assessed an error if ŷ t y t. A sends B the correct label y t for x t. 1. (5 pts) Let H be a set of hypotheses, and assume that A is reasonable for H in the following sense: for every sequence (x 1, y t ),..., (x T, y T ) presented by A to B, there exists some h H such that t, y t = h (x t ). Show A can still force the learner B to make at least d mistakes, where d is the VC dimension of H. V Cdim = d means that there is a set S such that S S h H : h S = S. Present each element of S to B; whatever B predicts, A can provide the opposite prediction and still be consistent with some h. 2. (5 pts) Suppose D = (x 1, y 1 ),..., (x T, y T ) is a dataset where each x i represents a document d i, where d has no more than 100 English words (and for the purpose of this question there are no more than 100,000 English words); each x i is a sparse vector (b 1,..., b 100,000 ), where b w = 1 if word w is in the document d i, and b w = 0 otherwise; y 1 is a label (+1 or 1) indicating the class of the document; there exists some vector u such that i, y i u x i > 1 Write down an upper bound on R, the maximum distance of an example x i from the origin, and m, the number of mistakes made by a perceptron before converging to a correct hypothesis v k, when trained on-line using data from D. m < R. γ 2 Plugging in, R x 2 100 γ = 1 (as y n ūx n > 1. It s also ok to make an assumption about n ) So m < 100. 5

Part IV. Understanding the performance of previously-studied learning methods (10 pts) Consider the three datasets (A), (B), and (C) above, where the circles are positive examples (with two numeric features the coordinate of the circle in 2D space) and the squares are negative examples. Now consider the following four learning methods: Naive bayes (NB), logistic regression (LR), decision trees with pruning (DT+p), and decision trees without pruning (DT-p). 1. (5pts) For each of the three datasets, list the learning system(s) that you think will perform well, and say why. There was some freedom here as to how many methods you listed, and in which order, etc. The general idea, though, is that (A) is linearly separable and should be learnable by NB, LR, DT-p and DT+p. (B) is not linearly separable (and so can t be learned by NB or LR), but can be perfectly classified with two lines, ie, a decision tree like DT-p or DT+p. Finally, (C) is a noisy version of (B) and so we need DT+p s pruning to deal with the noise (DT-p would overfit the noise). 2. (5pts) For each of the three datasets, list the learning system(s) that you think will perform poorly, and say why. As above, all methods should do well on (A), NB and LR should fail (B) due to decision boundary form, and NB, LR and DT-p should fail (C) due to noise. 6

Part V. Understanding a novel learning method: Uniform Naive Bayes (15 pts) Suppose you are building a Naive Bayes classifier for data containing two continuous attributes, X 1 and X 2, in addition to the boolean label Y. You decide to represent the conditional distributions p(x i Y ) as uniform distributions; that is, to build a Uniform Naive Bayes Classifier. Your conditional distributions would be represented using the parameters 1 p(x i = x Y = y k ) = b ik a ik where, a ik < b ik. (of course P (X i = x Y = y k ) = 0 when x < a ik or x > b ik.) 1. (5 pts) Given a set of data, how would we choose MLEs for the parameters a ik and b ik? Take a i, b i to be the minimum and maximum for each class i. (See homework 2.1) 2. (5 pts) What will the decision surface look like? (If you like you can describe it in words). Two rectangles that may or may not intersect. If they intersect, there is a tie. 3. (5 pts) Sometimes it is desirable to use a classifier to find examples with highly confident predictions e.g., to rank examples x according to P (Y = + x, θ). If you re using a classifier for this, it is usually undesirable for the scores for two examples to be tied (e.g., for P (Y = + x 1, θ) = P (Y = + x 2, θ). Are ties more or less likely for Uniform Naive Bayes, compared to Gaussian Naive Bayes, and why? More likely Under UNB, in the intersection of the rectangles, as well as any area outside (where probability is 0 for both classes), there will be ties. In GNB ties only occur along the (linear) decision boundary. 7

Part VI. Understanding a novel learning method: Naive Bayes/Decision tree hybrids (15 pts) Suppose you were to augment a decision tree by adding a Naive Bayes classifier at each leaf. That is, consider learning a decision tree of depth k, (where k is smaller than the number of variables n), where each leaf contains not a class, but a Naive Bayes classifier, which is based on the n k variables not appearing on the path to that leaf. 1. (5 pts) Briefly outline a learning algorithm for this representation, assuming that k is a fixed parameter. The simplest idea was to grow a decision tree by the usual method to depth k. Then learn a NB model over the examples in each of the leaves at level k. More sophisticated versions of this method, eg. iterating one round of decision tree growth with one round of NB learning, were also acceptable. 2. (5 pts) Will this result in having the same NB classifier at each leaf? Why or why not? If not, how will they be different? The NB classifiers learned at each leaf will be different because they will have been trained on different examples, based on the different paths the examples take through the tree and the corresponding conditions each example satisfies or doesn t. 3. (5 pts) Briefly describe a plausible means of selecting a value for k. Two common answers here: either a pruning-like method, based on information gain or something similar, or a cross validation method, ie, growing the tree for various values of k until performance on held-out data starts to fall. We didn t consider the time complexity for these various methods, but we did penalize for not explicitly using held-out data for cross-validation, eg. choosing k so as to maximize performance directly on test data was not acceptable. 8