Midterm: CS 6375 Spring 2018

Similar documents
MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm: CS 6375 Spring 2015 Solutions

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Qualifier: CS 6375 Machine Learning Spring 2015

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning Practice Page 2 of 2 10/28/13

Midterm, Fall 2003

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Introduction to Machine Learning Midterm Exam

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning Midterm, Tues April 8

Machine Learning, Midterm Exam

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Machine Learning Midterm Exam Solutions

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Final Exam, Machine Learning, Spring 2009

Final Examination CS 540-2: Introduction to Artificial Intelligence

The exam is closed book, closed notes except your one-page cheat sheet.

6.036: Midterm, Spring Solutions

CS534 Machine Learning - Spring Final Exam

Midterm Exam Solutions, Spring 2007

Linear & nonlinear classifiers

Final Exam, Spring 2006

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Generative v. Discriminative classifiers Intuition

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CSE 546 Final Exam, Autumn 2013

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

MLE/MAP + Naïve Bayes

Midterm Exam, Spring 2005

Bias-Variance Tradeoff

text classification 3: neural networks

Linear Models for Regression CS534

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm exam CS 189/289, Fall 2015

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear Models for Regression CS534

1 Machine Learning Concepts (16 points)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Logistic Regression. Machine Learning Fall 2018

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Numerical Learning Algorithms

Pattern Recognition and Machine Learning

Final Exam, Fall 2002

Linear Models for Regression CS534

Machine Learning, Fall 2009: Midterm

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Support Vector Machines

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Generative v. Discriminative classifiers Intuition

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Final Examination CS540-2: Introduction to Artificial Intelligence

CSCI-567: Machine Learning (Spring 2019)

Stochastic Gradient Descent

Overfitting, Bias / Variance Analysis

Logistic Regression. COMP 527 Danushka Bollegala

Nonlinear Classification

Name (NetID): (1 Point)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Multi-layer Neural Networks

Machine Learning for NLP

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

ECE521 week 3: 23/26 January 2017

Bayesian Learning (II)

Week 5: Logistic Regression & Neural Networks

Neural Network Training

Machine Learning, Fall 2012 Homework 2

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

SVMs, Duality and the Kernel Trick

Artificial Intelligence

STA 414/2104, Spring 2014, Practice Problem Set #1

B555 - Machine Learning - Homework 4. Enrique Areyan April 28, 2015

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Kernelized Perceptron Support Vector Machines

Machine Learning

Machine Learning Basics

CS6220: DATA MINING TECHNIQUES

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Deep Feedforward Networks

Machine Learning (CSE 446): Neural Networks

Holdout and Cross-Validation Methods Overfitting Avoidance

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Chapter 6 Classification and Prediction (2)

Logistic Regression & Neural Networks

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Max Margin-Classifier

Revision: Neural Network

Support Vector Machines

Brief Introduction of Machine Learning Techniques for Content Analysis

Transcription:

Midterm: CS 6375 Spring 2018 The exam is closed book (1 cheat sheet allowed). Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, use an additional sheet (available from the instructor) and staple it to your exam. NAME UTD-ID if known Question Points Score Decision Trees 10 Neural Networks 10 Support Vector Machines (SVMs) 10 Short Questions 20 Total: 50 1

Spring 2018 Final, Page 2 of 9 March 8, 2018

Spring 2018 Final, Page 3 of 9 March 8, 2018 Question 1: Decision Trees (10 points) Consider a large datasetdhaving n examples in which the positive (denoted by+) and negative examples (denoted by ) follow the pattern given below. (Notice that the data is clearly linearly separable). 5 4 3 2 1 0 0 1 2 3 4 5 (a) (5 points) Which among the following is the best upper bound (namely the smallest one that is a valid upper bound) on the number of leaves in an optimal decision tree for D (n is the number of examples in D)? By optimal, I mean a decision tree having the smallest number of nodes. Circle the answer and explain why it is the best upper bound. No credit without a correct explanation. 1. O(n) 2. O(log n) 3. O(log log n) 4. O((logn) 2 ) O(n) is the correct answer. Since all splits are either horizontal or vertical, each of them will classify at most one point correctly.

Spring 2018 Final, Page 4 of 9 March 8, 2018 Consider the dataset given below. X 1, X 2, X 3 and X 4 are the attributes (or features) and Y is the class variable. X 1 X 2 X 3 X 4 Y 3 0 0 1 + 1 1 0 0 2 0 1 1 5 1 1 0 + 4 1 0 1 + 6 0 1 0 (b) (2 points) Which attribute (among X 1, X 2,X 3 and X 4 ) has the highest information gain? X 1 has the highest information gain (it has a different value for each example). (c) (3 points) In the above dataset, is the attribute having the highest information gain useful (namely will it help improve generalization)? Answer YES/NO and then Explain why the attribute is useful if your answer is YES. If your answer is NO, explain how will you change the information gain criteria so that such useless attributes are not selected. Answer is NO. The attribute does not generalize well. Use the GainRatio criteria we discussed in class (we divide the gain by the entropy of the attribute).

Spring 2018 Final, Page 5 of 9 March 8, 2018 Question 2: Neural Networks (10 points) Consider the Neural network given below. 3 4 w 5 w 8 w 6 w 7 1 2 w 1 w 2 w 3 w 4 x 1 x 2 Assume that all internal nodes and output nodes compute the sigmoid σ(t) function. In this question, we will derive an explicit expression that shows how back propagation (applied to minimize the least squares error function) changes the values of w 1, w 2, w 3, w 4, w 5, w 6, w 7 and w 8 when the algorithm is given the example (x 1,x 2,y 1,y 2 ) with y 1 and y 2 being outputs at 3 and 4 respectively (there are no bias terms). Assume that the learning rate is η. Let o 1 and o 2 be the output of the hidden units 1 and 2 respectively. Leto 3 and o 4 be the output of the output units 3 and 4 respectively. d Hint: Derivative: dtσ(t) = σ(t)(1 σ(t)). (a) (2 points) Forward propagation. Write equations for o 1,o 2,o 3 and o 4. o 1 = σ(w 1 x 1 +w 3 x 2 ) o 2 = σ(w 2 x 1 +w 4 x 2 ) o 3 = σ(w 5 o 1 +w 7 o 2 ) o 4 = σ(w 6 o 1 +w 8 o 2 ) (b) (4 points) Backward propagation. Write equations forδ 1,δ 2,δ 3 andδ 4 whereδ 1,δ 2,δ 3 andδ 4 are the values propagated backwards by the units denoted by 1, 2, 3 and 4 respectively in the neural network. δ 1 = o 1 (1 o 1 )(δ 3 w 5 +δ 4 w 6 ) δ 2 = o 2 (1 o 2 )(δ 3 w 7 +δ 4 w 8 ) δ 3 = o 3 (1 o 3 )(y 1 o 3 ) δ 4 = o 4 (1 o 4 )(y 2 o 4 )

Spring 2018 Final, Page 6 of 9 March 8, 2018 (c) (4 points) Give an explicit expression for the new (updated) weights w 1, w 2, w 3, w 4, w 5, w 6, w 7 and w 8 after backward propagation. w 1 = w 1 +ηδ 1 x 1 w 2 = w 2 +ηδ 2 x 1 w 3 = w 3 +ηδ 1 x 2 w 4 = w 4 +ηδ 2 x 2 w 5 = w 5 +ηδ 3 o 1 w 6 = w 6 +ηδ 4 o 1 w 7 = w 7 +ηδ 3 o 2 w 8 = w 8 +ηδ 4 o 2

Spring 2018 Final, Page 7 of 9 March 8, 2018 Question 3: Support Vector Machines (SVMs) (10 points) Consider the following 2-D dataset (x 1 and x 2 are the attributes and y is the class variable). Dataset: x 1 x 2 y 0 0 +1 0 1 +1 1 0 +1 1 1 1 (a) (5 points) Precisely write the expression for the dual problem (assuming Linear SVMs). α 1,α 2,α 3,α 4 be the lagrangian multipliers associated with the four data points. Let Maximize: α 1 +α 2 +α 3 +α 4 1 ( 2α2 α 4 2α 3 α 4 +2α 2 4 +α 2 2 +α 2 2 3) subject to:α 1,α 2,α 3,α 4 0, α 1 +α 2 +α 3 α 4 = 0 (b) (5 points) Identify the support vectors and compute the value of α 1, α 2, α 3 and α 4. (Hint: You don t have to solve the dual optimization problem to compute α 1,α 2,α 3 and α 4.) The last three points are the support vectors. This means that α 1 = 0 and α 4 = α 2 +α 3. Substituting these two constraints in the quadratic optimization problem, we get: Maximize: 2α 2 +2α 3 1 2 (α2 2 +α 2 3) Taking derivatives with respect to α 2 and α 3 and setting them to zero, we get: α 2 = 2 and α 3 = 2. Therefore, α 4 = α 2 +α 3 = 2+2 = 4.

Spring 2018 Final, Page 8 of 9 March 8, 2018 Question 4: Short Questions (20 points) Consider a linear regression problem y = w 1 x + w 2 z + w 0, with a training set having m examples (x 1,z 1,y 1 ),...,(x m,z m,y m ). Suppose that we wish to minimize squared error (loss function) given by: m Loss = (y i w 1 x i w 2 z i w 0 ) 2 under the assumption w 1 = w 2. i=1 (a) (5 points) Derive a batch gradient descent algorithm that minimizes the loss function. (Note the assumption w 1 = w 2 ; also known as parameter tying). Gradient of w 1 and w 2 is m i=1 2(y i w 1 x i w 2 z i w 0 )(x i +z i ) Gradient of w 0 is m i=1 2(y i w 1 x i w 2 z i w 0 ) (Batch gradient descent algorithm skipped. See class notes.)

Spring 2018 Final, Page 9 of 9 March 8, 2018 (b) (5 points) When do you expect the learning algorithm for the logistic regression classifier to produce the same parameters as the ones produced by the learning algorithm for the Gaussian Naive Bayes model with class independent variances. The linear classifiers produced by Logistic Regression and Gaussian Naive Bayes are identical in the limit as the number of training examples approaches infinity, provided the Naive Bayes assumptions hold. (c) (4 points) Describe a reasonable strategy to choose k in k-nearest neighbor. Use validation set. Try different values ofk and choose the one that has the highest accuracy on the validation set. Another approach is to penalize based onk and use a Bayesian approach. (d) (6 points) You are given a coin and a thumbtack and you put Beta priorsbeta(5,5) andbeta(20,20) on the coin and thumbtack respectively. You perform the following experiment: toss both the thumbtack and the coin 100 times. To your surprise, you get 20 heads and 80 tails for both the coin and the thumbtack. Are each of the following two statements true or false. The MLE estimate of both the coin and the thumbtack is the same. The MAP estimate of the parameter θ (probability of landing heads) for the coin is greater than the MAP estimate of θ for the thumbtack. Explain your answer mathematically. True. MLE estimates of the two are the same. 20/100. False. MAP estimate for the coin: (20+5 1)/(100+10 2) = 24/108 = 0.2222. MAP estimate for the thumbtack (20+20 1)/(100+40 2) = 39/138 = 0.282. Thus the MAP estimate of thumbtack is larger than the coin.