FINAL: CS 6375 (Machine Learning) Fall 2014

Similar documents
Midterm: CS 6375 Spring 2018

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Qualifier: CS 6375 Machine Learning Spring 2015

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Midterm: CS 6375 Spring 2015 Solutions

CS534 Machine Learning - Spring Final Exam

CSE 546 Final Exam, Autumn 2013

Introduction to Machine Learning Midterm Exam

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Introduction to Machine Learning Midterm Exam Solutions

Final Exam, Fall 2002

Machine Learning, Midterm Exam

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

10-701/ Machine Learning - Midterm Exam, Fall 2010

Learning with multiple models. Boosting.

Introduction to Machine Learning Midterm, Tues April 8

Name (NetID): (1 Point)

Name (NetID): (1 Point)

Midterm Exam, Spring 2005

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Midterm, Fall 2003

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Final Exam, Spring 2006

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Practice Page 2 of 2 10/28/13

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS446: Machine Learning Spring Problem Set 4

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

1 Machine Learning Concepts (16 points)

CSCI-567: Machine Learning (Spring 2019)

Midterm Exam Solutions, Spring 2007

CS446: Machine Learning Fall Final Exam. December 6 th, 2016

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machines

TDT4173 Machine Learning

Final Examination CS 540-2: Introduction to Artificial Intelligence

Final Exam, Machine Learning, Spring 2009

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear & nonlinear classifiers

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Final Examination CS540-2: Introduction to Artificial Intelligence

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Machine Learning, Fall 2009: Midterm

ECE 5424: Introduction to Machine Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

ECE 5984: Introduction to Machine Learning

Machine Learning, Fall 2011: Homework 5

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

The exam is closed book, closed notes except your one-page cheat sheet.

Logistic Regression. Machine Learning Fall 2018

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

VBM683 Machine Learning

TDT4173 Machine Learning

Qualifying Exam in Machine Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

CS7267 MACHINE LEARNING

6.036 midterm review. Wednesday, March 18, 15

Machine Learning, Fall 2012 Homework 2

Midterm exam CS 189/289, Fall 2015

Numerical Learning Algorithms

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Be able to define the following terms and answer basic questions about them:

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Lecture 5

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-701/ Machine Learning, Fall

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Homework 5 SOLUTIONS CS 6375: Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Stat 502X Exam 2 Spring 2014

Data Mining und Maschinelles Lernen

Stochastic Gradient Descent

Support Vector Machines. Machine Learning Fall 2017

6.867 Machine learning

CS 6375 Machine Learning

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Understanding Generalization Error: Bounds and Decompositions

Week 5: Logistic Regression & Neural Networks

Pattern Recognition and Machine Learning

Machine Learning

CS6220: DATA MINING TECHNIQUES

Support Vector Machines

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS145: INTRODUCTION TO DATA MINING

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Course in Data Science

Mining Classification Knowledge

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Machine Learning. Ensemble Methods. Manfred Huber

Chapter 14 Combining Models

Machine Learning Lecture 7

Voting (Ensemble Methods)

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Machine Learning

Holdout and Cross-Validation Methods Overfitting Avoidance

Transcription:

FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, use an additional sheet (available from the instructor) and staple it to your exam. NAME UTD-ID if known Question Points Score Decision Trees 10 Naive Bayes and the EM algorithm 10 Linear Methods 10 SVMs and Kernels 10 Ensemble Methods 10 Neural Networks 10 Nearest Neighbors and K-means 10 Learning Theory 10 Bayesian networks 20 Total: 100 1

CS 6375 FALL 2014 Final, Page 2 of 21 Dec. 17, 2014

CS 6375 FALL 2014 Final, Page 3 of 21 Dec. 17, 2014 Question 1: Decision Trees (10 points) Consider the training dataset given below. In the dataset, X 1, X 2, and X 3 are the attributes and Y is the class variable. Example# X 1 X 2 X 3 Y E1 0 0 0 +ve E2 0 0 1 -ve E3 0 1 0 -ve E4 0 1 1 +ve E5 1 0 0 -ve (a) (3 points) Which attribute has the highest information gain? Justify your answer? (b) (3 points) Draw the (full) decision tree for this dataset using the information gain criteria.

CS 6375 FALL 2014 Final, Page 4 of 21 Dec. 17, 2014 Suppose that we have the following validation dataset: Example# X 1 X 2 X 3 Y E6 1 1 0 +ve E7 0 1 1 -ve E8 1 1 1 -ve (c) (4 points) Can you construct a decision tree that will have 100% accuracy on the validation dataset as well as the training dataset. Answer Yes or No. If your answer is yes, then draw the decision tree that will have 100% accuracy. If your answer is no, then explain why it cannot be done. No credit will be given if the explanation is incorrect.

CS 6375 FALL 2014 Final, Page 5 of 21 Dec. 17, 2014 Question 2: Naive Bayes and the EM algorithm (10 points) In this question, we will develop EM algorithm for Naive Bayes with hidden variables. Consider the Naive Bayes model given below. X 1 and X 2 are always observed and H is a hiddenvariable. Y is the class variable. Y X 1 X 2 H Assume that all variables are binary and all parameters are initialized as follows: Parameter P(Y) P(X 1 Y) P(X 1 Y) P(X 2 Y) P(X 2 Y) P(H Y) P(H Y) Initial Value θ α 1 α 0 β 1 β 0 γ 1 γ 0 We are given a dataset having 1000 examples and is such that all tuples in the truth-table over X 1, X 2 and Y occur exactly 125 times in it. Thus, for example #(X 1,X 2,Y) = 125; #(X 1,X 2,Y) = 125; etc. (a) (7 points) After running the EM algorithm until convergence, what will be the values of the parameters θ, α 1, α 0, β 1, β 0, γ 1 and γ 0. Explain your answer. No credit without correct explanation. To figure out the parameters at convergence, try running EM for 1-2 iterations and the pattern will be clear to you.

CS 6375 FALL 2014 Final, Page 6 of 21 Dec. 17, 2014 (b) (3 points) Your Boss claims that those hidden variables in the Naive Bayes model, assuming that they are children of the class variable, are useless. Is your Boss correct? Explain your answer.

CS 6375 FALL 2014 Final, Page 7 of 21 Dec. 17, 2014 Question 3: Linear Methods (10 points) (a) (2 points) Does a 2-class Gaussian Naive Bayes classifier with parameters µ 1,k, σ 1,k, µ 2,k and σ 2,k for attributes k = 1,...,m have exactly the same representational power as logistic regression, given no assumptions about the variance values σ1,k 2 and σ2 2,k? True or False. Explain your answer (no credit if the explanation is incorrect). (b) (8 points) Consider the 2-D dataset given below. Suppose we are going to use logistic regression to train a classifier on this dataset and let us further assume that we are going to apply regularization to all the weights. Formally, we will train a classifier that maximizes the following expression: max n i=1 log(p(y (i) x (i) 1,x(i) 2 )) (λ 0w 2 0 +λ 1 w 2 1 +λ 2 w 2 2) where i indexes the i-th datapoint given by the three tuple (x (i) 1,x(i) 2,y(i) ). x 1 and x 2 are the features and y is the class variable. Recall that P(y x 1,x 2 ) is given by: P(y x 1,x 2 ) = 1 1+e (w 0+w 1 x 1 +w 2 x 2 ) Dataset: 3 2 1 3 2 1 1 1 2 3 2 3 (Question continues to the next page.)

CS 6375 FALL 2014 Final, Page 8 of 21 Dec. 17, 2014 Draw the decision boundary for each of the following cases: Case 1: λ 0 = 0, λ 1 = 0 and λ 2 = 0 Case 2: λ 0 =, λ 1 = 0 and λ 2 = 0 3 3 2 2 1 1 3 2 1 1 1 2 3 3 2 1 1 1 2 3 2 2 3 3 Case 3: λ 0 = 0, λ 1 = and λ 2 = 0 Case 4: λ 0 = 0, λ 1 = 0 and λ 2 = 3 3 2 2 1 1 3 2 1 1 1 2 3 3 2 1 1 1 2 3 2 2 3 3

CS 6375 FALL 2014 Final, Page 9 of 21 Dec. 17, 2014 Question 4: SVMs and Kernels (10 points) (a) (3 points) In the following image, circle the points such that removing that example from the training set and retraining linear SVM, we would get a different decision boundary than training on the full sample. Also, give a one or two sentence explanation for why you circled the points. Explanation: (b) (3 points) Suppose you used a perceptron instead of linear SVMs. Circle the points such that removing that example from the training set and running a perceptron, we may get a different decision boundary than training with perceptron on the full sample. Note the word may and therefore you should take the worst case view. Also, give a one or two sentence explanation for why you circled the points. Explanation:

CS 6375 FALL 2014 Final, Page 10 of 21 Dec. 17, 2014 (c) (4 points) You are trying to use SVMs to build a classifier for a dataset that your boss gave you. In the dataset, there are only a few positive training examples and you are certain that they are always classified correctly by your SVM classifier. The dataset also hasalarge numberof negative trainingexamples and your bosstells you that it is OKif we misclassify some of them. Modify the basic dual form of the SVM optimization problem given below: n maximize α i 1 n α i α j y i y j K(x i,x j ) 2 i=1 i=1 subject to α i 0, and n α i y i = 0 to better solve this type of problem. Namely, you would like to ensure that you won t misclassify any of the positive examples but could misclassify some of the negative examples. Introduce additional parameter(s) (or constants for the purpose of solving the quadratic programming problem) in order to achieve this. In your solution, please use I+ to index positively labeled examples (y i = +1) and I for negative examples (y i = 1). In other words, i I+ means that y i = +1 and i I means that y i = 1. Write down your solution using the dual formulation. i=1

CS 6375 FALL 2014 Final, Page 11 of 21 Dec. 17, 2014 Question 5: Ensemble Methods (10 points) For your convenience, AdaBoost is given on the last page. (a) (1 point) Circle the method or methods that are slower in terms of training time than the decision tree classifier. More than one answer is possible. No explanation needed. BAGGING BOOSTING (b) (1 point) Circle the method or methods that can be parallelized easily (after all we live in the age of multi-core machines). More than one answer is possible. No explanation needed. BAGGING BOOSTING (c) (1 point) Circle the method or methods that attempt to reduce the variance. More than one answer is possible. No explanation needed. BAGGING BOOSTING (d) (1 point) Circle the method or methods that hurt the performance of stable classifiers. To recap, a stable classifier is a classifier whose decision boundary does not change much as we add new data points. More than one answer is possible. No explanation needed. BAGGING BOOSTING (e) (3 points) Answer True or False. Explain your answer. A weak learner with less than 50% accuracy does not present any problem to the Adaboost algorithm.

CS 6375 FALL 2014 Final, Page 12 of 21 Dec. 17, 2014 (f) (3 points) True or False. AdaBoost is not susceptible to outliers. If your answer is true, explain why. If your answer is false, then describe a simple heuristic to fix Adaboost so that it is not susceptible to outliers.

CS 6375 FALL 2014 Final, Page 13 of 21 Dec. 17, 2014 Question 6: Neural Networks (10 points) Consider the Neural network given below. 3 w 5 w 6 1 2 w 1 w 2 w 3 w 4 x 1 x 2 Assume that all internal nodes and output nodes compute the tanh function. In this question, we will derive an explicit expression that shows how back propagation (applied to minimize the least squares error function) changes the values of w 1, w 2, w 3, w 4, w 5 and w 6 when the algorithm is given the example (x 1,x 2,y) with y being class variable (there are no bias terms). Assume that the learning rate is η. Let o 1 and o 2 be the output of the hidden units 1 and 2 respectively. Let o 3 be the output of the output unit 3. Hint: Derivative of tanh(x) = 1 tanh 2 (x). (a) (3 points) Forward propagation. Write equations for o 1, o 2 and o 3. (b) (3 points) Backward propagation. Write equations for δ 1, δ 2 and δ 3 where δ 1, δ 2 and δ 3 are the values propagated backwards by the units denoted by 1, 2 and 3 respectively in the neural network.

CS 6375 FALL 2014 Final, Page 14 of 21 Dec. 17, 2014 (c) (4 points) Give an explicit expression for the new (updated) weights w 1, w 2, w 3, w 4, w 5 and w 6 after backward propagation.

CS 6375 FALL 2014 Final, Page 15 of 21 Dec. 17, 2014 Question 7: Nearest Neighbors and K-means (10 points) (a) (3 points) As we increase k, the k nearest neighbor algorithm begins to overfit the training dataset. True or False. Explain your answer. (b) (3 points) What is the time complexity of the K-means algorithm. Use the O-notation. Explain your notation clearly.

CS 6375 FALL 2014 Final, Page 16 of 21 Dec. 17, 2014 4 3 + + + + 2 1 + + 0 0 1 2 3 4 (c) (4 points) Draw the agglomerative clustering tree for the dataset given above. You must use single link clustering.

CS 6375 FALL 2014 Final, Page 17 of 21 Dec. 17, 2014 Question 8: Learning Theory (10 points) (a) (5 points) Prove that the VC dimension of the concept class of all axis parallel rectangles is greater than or equal to 4. (b) (5 points) Consider the class of concepts that can be expressed as an axis parallel rectangle. Assume further that the rectangles can be defined in terms of 4 integers: a, b, c, and d that lie between 1 and 100 (both included). Namely, a,b,c,d {1,...,100}. We assume that a consistent algorithm is available. i. What is the size of the hypothesis space. ii. How many randomly obtained examples are needed for PAC learning with accuracy parameter ǫ = 0.1 and confidence parameter δ = 0.05? iii. What is the accuracy level (what is ǫ) if it is known that 1000 randomly chosen examples were used, and the desired confidence level is δ = 0.05?

CS 6375 FALL 2014 Final, Page 18 of 21 Dec. 17, 2014 Question 9: Bayesian networks (20 points) Consider the Bayesian network given below: X 5 X 6 X 1 X 2 X 3 X 4 X 7 X 8 (a) (7 points) LetX 4 beevidence. Considertheeliminationorder(X 1,X 2,X 3,X 5,X 6,X 7,X 8 ). Write down the factors that you will encounter during this elimination for computing P(X 4 = x 4 ) (namely, trace the variable elimination algorithm). What is the time and space complexity along this ordering.

CS 6375 FALL 2014 Final, Page 19 of 21 Dec. 17, 2014 (b) (7 points) Let X 4 be evidence for the Bayesian network given on the previous page. Consider the elimination order (X 5,X 7,X 1,X 2,X 3,X 6,X 8 ). Write down the factors that you will encounter during this elimination for computing P(X 4 = x 4 ). What is the time and space complexity along this ordering. (c) (6 points) Suppose we want to learn a Bayesian network over two binary variables X 1 and X 2. You have n training examples {(x (1) 1,x(1) 2 ),...,(x(n) 1,x(n) 2 )}. Let BN1 denote the Bayesian network having no edges and BN2 denote the Bayesian network having an edge from X 1 to X 2. Suppose you are learning both networks using the maximum likelihood estimation approach. i. Describe a situation in which you will prefer BN1 over BN2 for modeling this data? ii. Describe a situation in which you will prefer BN2 over BN1 for modeling this data?

CS 6375 FALL 2014 Final, Page 20 of 21 Dec. 17, 2014 Entropy Entropy(0,0) = 0.0 Entropy(0,1) = 0.0 Entropy(0,2) = 0.0 Entropy(0,3) = 0.0 Entropy(0,4) = 0.0 Entropy(1,1) = 1.0 Entropy(1,2) = 0.918295834054 Entropy(1,3) = 0.811278124459 Entropy(1,4) = 0.721928094887 Entropy(2,2) = 1.0 Entropy(2,3) = 0.970950594455 Entropy(2,4) = 0.918295834054 Entropy(3,3) = 1.0 Entropy(3,4) = 0.985228136034 Entropy(4,4) = 1.0 Useful formulas for Learning Theory m 1 (ln(1/δ) +ln( H )) ǫ m 1 2ǫ 2 (ln(1/δ)+ln( H )) m 1 ǫ (4log 2(2/δ) +8VC(H)log 2 (13/ǫ))

CS 6375 FALL 2014 Final, Page 21 of 21 Dec. 17, 2014