MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Similar documents
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm: CS 6375 Spring 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

10-701/ Machine Learning - Midterm Exam, Fall 2010

Midterm: CS 6375 Spring 2015 Solutions

Qualifier: CS 6375 Machine Learning Spring 2015

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Final Exam, Fall 2002

Machine Learning, Midterm Exam

Introduction to Machine Learning Midterm, Tues April 8

Bayesian Learning (II)

Stochastic Gradient Descent

CSCI-567: Machine Learning (Spring 2019)

Machine Learning Practice Page 2 of 2 10/28/13

Answers Machine Learning Exercises 4

Introduction to Machine Learning Midterm Exam

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS534 Machine Learning - Spring Final Exam

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Midterm, Fall 2003

Generative v. Discriminative classifiers Intuition

Midterm Exam, Spring 2005

Midterm exam CS 189/289, Fall 2015

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Machine Learning

Final Exam, Spring 2006

18.9 SUPPORT VECTOR MACHINES

Introduction to Machine Learning Midterm Exam Solutions

Midterm Exam Solutions, Spring 2007

6.036 midterm review. Wednesday, March 18, 15

CSE 546 Final Exam, Autumn 2013

Logistic Regression. Machine Learning Fall 2018

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Numerical Learning Algorithms

CSE 546 Midterm Exam, Fall 2014

Machine Learning for NLP

Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

1 Machine Learning Concepts (16 points)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Final Exam, Machine Learning, Spring 2009

EE 511 Online Learning Perceptron

The Perceptron Algorithm 1

Machine Learning for NLP

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning Gaussian Naïve Bayes Big Picture

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Bias-Variance Tradeoff

Machine Learning 2017

Machine Learning, Fall 2011: Homework 5

Linear discriminant functions

Final Examination CS540-2: Introduction to Artificial Intelligence

Generative v. Discriminative classifiers Intuition

CS229 Supplemental Lecture notes

Artificial Intelligence Roman Barták

10-701/ Machine Learning, Fall

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine Learning, Fall 2012 Homework 2

The exam is closed book, closed notes except your one-page cheat sheet.

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Linear Models for Classification

Linear Models for Regression CS534

CS6220: DATA MINING TECHNIQUES

Probabilistic Machine Learning. Industrial AI Lab.

CS 6375 Machine Learning

Final Examination CS 540-2: Introduction to Artificial Intelligence

Machine Learning

Overfitting, Bias / Variance Analysis

Learning from Examples

Linear Models for Regression CS534

Voting (Ensemble Methods)

Linear Models for Regression CS534

Qualifying Exam in Machine Learning

Introduction to Logistic Regression and Support Vector Machine

PAC-learning, VC Dimension and Margin-based Bounds

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Machine Learning Lecture 5

Logistic regression and linear classifiers COMS 4771

Linear Discrimination Functions

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Algorithm-Independent Learning Issues

Transcription:

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page. Attach your cheat sheet with your exam. NAME UTD-ID if known Problem 1: Problem 2: Problem 3: Problem 4: Problem 5: Problem 6: TOTAL: 1

CS 6375 FALL 2012 Midterm Solutions, Page 2 of 8 March 28, 2012 PROBLEM 1: TRUE/FALSE QUESTIONS (10 points) 1. (2 points) Naive Bayes is a linear classifier. True or False. Explain. False. The decision rule for Naive Bayes cannot be written as: i w i x i > k. 2. (2 points) Classifier A has 90% accuracy on the training set and 75% accuracy on the test set. Classifier B has 78% accuracy on both the training and test sets. Therefore, we can conclude that classifier A is better than classifier B (because it has better mean accuracy). True or False. Explain. False: Classifier A is probably overfitting the training set and not generalizing as well as classifier B (although we do not have overwhelming evidence in favor of classifier B). 3. (2 points) Consider a data set that is linearly separable and two perceptrons, one trained using the gradient descent rule and the other trained using the perceptron rule. Both perceptrons will have the same accuracy on the training set and the test set. True or False. Explain. False: A perceptron does not learn a unique boundary. 4. (2 points) Given infinite training data over n Boolean attributes, we can always learn the target concept using a decision tree but not using Naive Bayes or Logistic Regression. True or False. Explain. True: A decision tree of depth n can represent any Boolean function perfectly. Thus, given infinite training data, it will learn the target concept perfectly. LR and Naive Bayes on the other hand cannot represent all possible Boolean functions perfectly. NB is in fact a probabilistic classifier. It does not represent a target concept. 5. (2 points) A decision tree is the smallest possible representation of the target concept. In other words, there exists no other representation that is smaller than the decision tree. True of False. Explain. False: A decision tree representation of a SAT problem can be exponential in size, while the SAT problem can be specified in linear space.

CS 6375 FALL 2012 Midterm Solutions, Page 3 of 8 March 28, 2012 PROBLEM 2: LINEAR REGRESSION (10 points) Consider a linear regression problem y = w 1 x + w 0, with a training set having m examples (x 1, y 1 ),..., (x m, y m ). Suppose that we wish to minimize the mean 5 th degree error (loss function) given by: Loss = 1 m (y i w 1 x i w 0 ) 5 m i=1 1. (3 points) Calculate the gradient with respect to the parameter w 1. Hint: dxk = dx kxk 1. The gradient is : 1 m 5(y i w 1 x i w 0 ) 4 ( x i ) m i=1 2. (4 points) Write down pseudo-code for online gradient descent on w 1 for this problem. (You do not need to include the equations for w 0 ) Initialize w 0 and w 1. Assume a value for η Repeat Until Convergence For each example (x i, y i ) in the training set do w 1 = w 1 + η(y i w 1 x i w 0 ) 4 (x i ) End For End Repeat 3. (3 points) Give one reason in favor of online gradient descent compared to batch gradient descent, and one reason in favor of batch over online. Online: it is often much faster, especially when the training set is redundant (contains many similar data points) it can be used when there is no fixed training set (streaming data) the noise in the gradient can help to escape from local minima Batch: Less random and therefore easier to check for convergence Often easier to set the learning rate η

CS 6375 FALL 2012 Midterm Solutions, Page 4 of 8 March 28, 2012 PROBLEM 3: CLASSIFICATION (10 points) Imagine that you are given the following set of training examples. Each feature can take on one of three nominal values: a, b, or c. F1 F2 F3 Category a c a + c a c + a a c b c a c c b 1. (5 points) How would a Naive Bayes system classify the following test example? Be sure to show your work. F1 = a, F2 = c, F3 = b P (Class = +) = 2/5 P (F 1 = a Class = +) = 1/2, P (F 1 = a Class = ) = 1/3 P (F 1 = b Class = +) = 0, P (F 1 = b Class = ) = 1/3 P (F 1 = c Class = +) = 1/2, P (F 1 = c Class = ) = 1/3 P (F 2 = a Class = +) = 1/2, P (F 2 = a Class = ) = 1/3 P (F 2 = c Class = +) = 1/2, P (F 2 = c Class = ) = 2/3 P (F 3 = a Class = +) = 1/2, P (F 3 = a Class = ) = 1/3 P (F 3 = b Class = +) = 0, P (F 3 = b Class = ) = 1/3 P (F 3 = c Class = +) = 1/2, P (F 3 = c Class = ) = 1/3 P (F 1 = a Class = +)P (F 2 = c Class = +)P (F 3 = b Class = +)P (Class = +) = (1/2)(1/2)(0)(2/5) = 0 P (F 1 = a Class = )P (F 2 = c Class = )P (F 3 = b Class = )P (Class = ) = (1/3)(2/3)(1/3)(2/3) The class is given by the equation: argmax c j P (F j C = c)p (C = c). Therefore the class is. 2. (5 points) Describe how a 3-nearest-neighbor algorithm would classify the test example given above. The 3-nearest neighbor algorithm will use a distance measure to compute three nearest points to the test example and assign to it the majority class. Since the data is nominal, we will use hamming distance as the distance metric. Hamming distance is the number of attributes on which the two examples agree on. Hamming distance of the test point to the first five examples is: 2, 0, 1, 1, 2. Thus, we will assign to it the majority class of the first, third and fifth or that of first, fourth or fifth example. Both options yield majority class.

CS 6375 FALL 2012 Midterm Solutions, Page 5 of 8 March 28, 2012 PROBLEM 4: BOOSTING, LOGISTIC REGRESSION and POINT ESTIMATION(10 points) The diagram shows training data for a binary concept where positive examples are denoted by a heart. Also shown are three decision stumps (A, B and C) each of which consists of a linear decision boundary. Suppose that 1. (4 points) AdaBoost chooses A as the first stump in an ensemble and it has to decide between B and C as the next stump. Which will it choose? Explain. What will be the ɛ and α values for the first iteration? A B C It will choose B because the only example mis-classified by A is correctly classified by B (the misclassified examples are assigned higher weight in the next iteration). In the first iteration ɛ = 1/7 and α = ln{ ɛ 1 ɛ } = ln(6). 2. (3 points) When learning a logistic regression classifier, you run gradient ascent for 50 iterations with the learning rate, η = 0.3, and compute the conditional log-likelihood J(θ) after each iteration (where θ denotes the weight vectors). You find that the value of J(θ) increases quickly then levels off. Based on this, which of the following conclusions seems most plausible? Explain your choice in a sentence or two. A. Rather than use the current value of η, it d be more promising to try a larger value for the learning rate (say η = 1.0). B. η = 0.3 is an effective choice of learning rate. C. Rather than use the current value of η, it d be more promising to try a smaller value (say η = 0.1). B is the correct answer. η = 0.3 is an effective choice of learning rate because the algorithm seems to be converging after a few iterations.

CS 6375 FALL 2012 Midterm Solutions, Page 6 of 8 March 28, 2012 3. (3 points) (Point Estimation) You are given a coin and a thumbtack and you put Beta priors Beta(100, 100) and Beta(1, 1) on the coin and thumbtack respectively. You perform the following experiment: toss both the thumbtack and the coin 100 times. To your surprise, you get 60 heads and 40 tails for both the coin and the thumbtack. Are the following two statements true or false. The MLE estimate of both the coin and the thumbtack is the same but the MAP estimate is not. The MAP estimate of the parameter θ (probability of landing heads) for the coin is greater than the MAP estimate of θ for the thumbtack. Explain your answer mathematically. The first statement is True. MLE estimate = 6/10 for both cases. The MAP estimates are not because the beta priors are different. The second statement is False. α MAP estimate for the coin: H +β H 1 α H +β H +α T +β T = 60 + 100 160 + 100 + 40 + 100 2 = 2 159/298 α MAP estimate for the thumbtack: H +β H 1 α H +β H +α T +β T = 60 + 1 160 + 1 + 40 + 1 2 = 2 60/100 Therefore, the MAP estimate of thumbtack is greater than that of the coin.

CS 6375 FALL 2012 Midterm Solutions, Page 7 of 8 March 28, 2012 PROBLEM 5: SUPPORT VECTOR MACHINES (10 points) Consider the following 1-dimensional data: x -3 0 1 2 3 4 5 Class + + + + + 1. (3 points) Draw the decision boundary of a linear support vector machine on this data and identify the support vectors. The decision boundary lies at 0.5. The support vectors are x = 0 and x = 1. 2. (3 points) Give the solution parameters w and b where the linear form is wx + b. w(x = 0) + b = 1 and w(x = 1) + b = 1 Therefore, b = 1 and w = 2 3. (2 points) Suppose we have another instance (x = 5, Class = +). What kernel will you use to classify the training data perfectly. We should use a polynomial kernel with degree 2. (Intuitively we will need a circle around -3 and 0). 4. (2 points) Calculate the leave one out cross validation error for the data set. There are two possible answers here: The number of examples misclassified in leave one out cross validation are either 0, 1, 0, 0, 0, 0, 0 or 0, 1, 1, 0, 0, 0, 0. Therefore the cross validation error is 1/7 or 2/7.

CS 6375 FALL 2012 Midterm Solutions, Page 8 of 8 March 28, 2012 PROBLEM 6: Complexity Analysis (10 points) Provide pseudo-code for an extension of the standard decision tree algorithm that does a k- move lookahead. We discussed a 1-move lookahead algorithm in class. Assume that you have N instances in the training set and each instance is defined by M binary attributes. Provide a computational analysis of the time-complexity of the algorithm with a 2-move lookahead. Hypothesize whether this algorithm will perform better or worse than the standard decision tree and give a qualitative argument why. The pseudo code for k-move lookahead is given below: S: Training data X i indexes the attributes. Procedure ChooseBestAttribute(S, k): Return an attribute X j having the lowest Error(S, X j, k 1) Procedure Error(S, X j, k) If k = 0 then Return total error if we split on feature X j Else e = For every attribute X k we have not yet split on do e = min(e, Error(S X j = T rue, X k, k 1) + Error(S X j = F alse, X k, k 1)) Return e At each step in the decision tree induction, the complexity of 1-move lookahead is O(NM) where N is the number of training examples and M is the dimensionality of the data ( or the number of attributes). Since the recursive algorithm given above will be called M times for 2-move lookahead, the complexity of 2- move lookahead will be O(NM 2 ). The complexity of k-move lookahead is O(NM k ) On average, the algorithm is likely to yield a smaller tree as compared with the 1-move look ahead algorithm because it has more knowledge about future splits. When k M for example, we are guaranteed to get the smallest possible decision tree.