Final Examination CS 540-2: Introduction to Artificial Intelligence

Similar documents
Final Examination CS540-2: Introduction to Artificial Intelligence

Be able to define the following terms and answer basic questions about them:

Logistic Regression & Neural Networks

FINAL: CS 6375 (Machine Learning) Fall 2014

Qualifier: CS 6375 Machine Learning Spring 2015

Bayesian Networks BY: MOHAMAD ALSABBAGH

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Midterm: CS 6375 Spring 2018

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Introduction to Machine Learning Midterm, Tues April 8

Numerical Learning Algorithms

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Machine Learning Lecture 5

Logistic Regression. Machine Learning Fall 2018

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

CSCI 315: Artificial Intelligence through Deep Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Be able to define the following terms and answer basic questions about them:

Neural Networks: Introduction

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Machine Learning Midterm Exam

CSCI-567: Machine Learning (Spring 2019)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning Midterm Exam Solutions

Brief Introduction of Machine Learning Techniques for Content Analysis

Pattern Recognition and Machine Learning

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

10-701/ Machine Learning, Fall

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multilayer Perceptron

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Directed Graphical Models

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Midterm: CS 6375 Spring 2015 Solutions

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Introduction to Machine Learning

Andrew/CS ID: Midterm Solutions, Fall 2006

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

1 Machine Learning Concepts (16 points)

Final Exam, Spring 2006

Feed-forward Network Functions

The connection of dropout and Bayesian statistics

Linear & nonlinear classifiers

Revision: Neural Network

Machine Learning Lecture 7

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Mining Classification Knowledge

Data Mining Part 5. Prediction

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Statistical and Computational Learning Theory

CS540 ANSWER SHEET

) (d o f. For the previous layer in a neural network (just the rightmost layer if a single neuron), the required update equation is: 2.

Logistic Regression. COMP 527 Danushka Bollegala

Naïve Bayes classification

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning Spring 2018 Note Neural Networks

Final Exam, Machine Learning, Spring 2009

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Undirected Graphical Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

CS Machine Learning Qualifying Exam

Final Exam December 12, 2017

Neural Networks and the Back-propagation Algorithm

CS 4700: Foundations of Artificial Intelligence

Final Exam December 12, 2017

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Artificial Neural Networks

Artifical Neural Networks

Neural networks and support vector machines

AI Programming CS F-20 Neural Networks

Machine Learning Lecture 10

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Mining Classification Knowledge

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Machine Learning, Midterm Exam

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

CSE 473: Artificial Intelligence Autumn Topics

MODULE -4 BAYEIAN LEARNING

Nonlinear Classification

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

Reading Group on Deep Learning Session 1

CS:4420 Artificial Intelligence

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Learning from Examples

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Outline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car

Transcription:

Final Examination CS 540-2: Introduction to Artificial Intelligence May 7, 2017 LAST NAME: SOLUTIONS FIRST NAME: Problem Score Max Score 1 14 2 10 3 6 4 10 5 11 6 9 7 8 9 10 8 12 12 8 Total 100 1 of 11

Question 1. [14] Constraint Satisfaction Problems You are given the following constraint graph representing a map that has to be colored with three colors, red (R), green (G) and blue (B), subject to the constraints that no adjacent regions, which are connected by an arc in the graph, are assigned the same color. (a) [4] Assuming variable C has already been assigned value R, apply the Forward Checking algorithm and cross out all values below that will be eliminated for other variables. A B C D E F R G B R G B R R G B R G B R G B (b) [4] Assuming variable A has already been assigned value B and variable C has already been assigned value R, apply the Arc-Consistency algorithm (AC-3) (but no backtracking search) and cross out all values below that will be eliminated for other variables. A B C D E F B R G B R R G B R G B R G B (c) [3] Consider the set of assignments below after variable B was selected and assigned value G, and constraint propagation performed. Of the remaining five variables, which variable(s) will be selected using the Most-Constraining Variable Heuristic (aka Degree Heuristic) as the best candidate(s) to consider next? Circle the variable(s) selected. C A B C D E F R B G R B R B R G B R G B (d) [3] Consider the set of inconsistent assignments below. Variable C has just been selected to be assigned a new value during a local search for a complete and consistent assignment. What new value will be selected for variable C by the Min-Conflicts Heuristic? R A B C D E F B G G G B 2 of 11

Question 2. [10] Support Vector Machines (a) [3] Consider an SVM with decision boundary w T x + b = 0 for a 3D feature space, where T stands for vector transpose, w = (1 2 3) T (i.e., w is a column vector) and b = 2. w and x are both column vectors. What will be the classification, +1 or -1, of a test example defined by x = (1-1 -1) T? w T x + b = (1)(1) + (2)(-1) + (3 )(-1) + 2 = -2 < 0 so classification is -1 (b) [2] True or False: In an SVM defined by the decision boundary w T x + b = 0, the support vectors are those examples in the training set that lie exactly on this decision boundary. False (c) [2] True or False: An SVM that uses slack variables with a penalty parameter C that determines how much to penalize a misclassified example, can result in either overfitting or underfitting the training data depending on the value of the parameter C. True (d) [3] Consider a Linear SVM trained using n labeled points in a 2-dimensional feature space, resulting in an SVM with k = 2 support vectors. After adding one more labeled training example (so there are now a total of n + 1 training examples) and retraining the SVM, what is the maximum number of support vectors possible in the new SVM? (i) k = 2 (ii) k + 1 = 3 (iii) k + 2 = 4 (iv) n + 1 (iv) n + 1 (For example, if the current n = 5 points are: negative {(0, 1), (1, 1)}; positive {(2, 0), (3, 0), (4, 0)} then these have only 2 support vectors: (1, 1) and (2, 0). But, if we add one new negative point (2, 1), then all 6 points become support vectors.) 3 of 11

Question 3. [6] Perceptrons (a) [2] True or False: A Perceptron that uses a fixed threshold value of 0 and no extra input with a bias weight can learn any linearly-separable function. False (b) [4] Consider a Perceptron with 3 inputs, x 1, x 2, x 3, and one output unit that uses a linear threshold unit (LTU) as its activation function. The initial weights are w 1 = 0.2, w 2 = 0.7, w 3 = 0.9, and bias w 4 = 0.7 (the LTU uses a fixed threshold value of 0). Hence we have: 0 w1 w2 w3 w4 x1 x2 x3 +1 Given the inputs x 1 = 0, x 2 = 0, x 3 = 1 and the above weights, this Perceptron outputs 1. What are the four (4) updated weights' values after applying the Perceptron Learning Rule with this input, actual output = 1, teacher (aka target) output = 0, and learning rate = 0.2? T=0 and O=1, so update the weights using w i = w i + (.2)(0-1)x i. Thus, the new weights are w 1 = 0.2 + (-.2)(0) = 0.2 w 2 = 0.7 + (-.2)(0) = 0.7 w 3 = 0.9 + (-.2)(1) = 0.7 w 4 = -0.7 + (-.2)(1) = -0.9 4 of 11

Question 4. [10] Back-Propagation in Neural Networks (a) [3] Which one (1) of the following best describes the process of learning in a multilayer, feed-forward neural network that uses back-propagation learning? (i) Activation values are propagated from the input nodes through the hidden layers to the output nodes. (ii) Activation values are propagated from the output nodes through the hidden layers to the input nodes. (iii) Weights on the arcs are modified based on values propagated from input nodes to output nodes. (iv) Weights on the arcs are modified based on values propagated from output nodes to input nodes. (v) Arcs in the network are modified, gradually shortening the path from input nodes to output nodes. (vi) Weights on the arcs from the input nodes are compared to the weights on the arcs coming into the output nodes, and then these weights are modified to reduce the difference. (iv) (b) [2] True or False: The back-propagation algorithm applied to a multilayer, feed-forward neural network using sigmoid activation functions is guaranteed to achieve 100% correct classification for any set of training examples, given a sufficiently small learning rate. False (because it may converge to a local minimum error) (c) [2] True or False: A multilayer, feed-forward neural network that uses as its activation function the Identity function, i.e., g(x) = x, and trained using the back-propagation algorithm can only learn linearly-separable functions. True (d) [3] Which one (1) of the following activation functions is the best one to use in a multilayer, feed-forward neural network that uses the back-propagation algorithm and that will most likely have the fastest training time? (i) LTU (ii) ReLU (iii) Sigmoid (ii) ReLU 5 of 11

Question 5. [11] Deep Learning (a) [5] A Convolutional Neural Network (CNN) has a 100 x 100 array of feature values at one layer that is connected to a Convolution layer that uses 10 x 10 filters with a stride of 5 (i.e., the filter is shifted by 5 horizontally and vertically over different 10 x 10 regions of feature values in the layer below). Only filters that are entirely inside the 100 x 100 array are connected to a unit in the Convolution layer. In addition, each unit in the Convolution layer uses an ReLU activation function to compute its output value. (i) [2] How many units are in the Convolution layer? 19 x 19 = 361 (ii) [3] How many weights must be learned for this layer, not including any bias weights needed? Each unit in the Convolution layer is connected to 10 x 10 = 100 units in the layer below. These weights are shared across all units in this layer, so the total is 100. (b) [3] What is the main purpose of using Dropout during training in a deep neural network? Select the one (1) best answer of the following: (i) To speed up training (ii) To avoid underfitting (iii) To avoid overfitting (iv) To increase accuracy on the training set (iii) To avoid overfitting (c) [3] Answer True or False for each of the following statements about a Max Pooling layer that uses a 3 x 3 filter with a stride of 3 in a CNN. (i) Exhibits local translation invariance of the features detected at the previous layer True (ii) Reduces the number of units in this layer compared to the layer below it True (iii) Shared weights mean only 9 weights have to be learned for this layer False (there are no weights associated with a Pooling layer) 6 of 11

Question 6. [9] Probabilistic Reasoning A barrel contains many balls, some of which are red and the rest are blue. 40% of the balls are made of metal and the rest are made of wood. 30% of the metal balls are colored blue, and 10% of the wood balls are colored blue. Let Boolean random variables B mean a ball is blue (so B means a ball is red), and M means a ball is metal (and M means it is wood). Hence, P(M) = 0.4, P(B M) = 0.3 and P(B M) = 0.1 (a) [3] What is the probability that a wood ball is red, i.e., P( B M)? P( B M) = 1 - P(B M) = 1-0.1 = 0.9 (b) [3] What is the prior probability that a ball is blue, i.e., P(B)? P(B) = P(B M)P(M) + P(B M)P( M) =.3 *.4 +.1 * (1 -.4) = 0.18 (c) [3] What is the posterior probability that a blue ball is metal, i.e., P(M B)? P(M B) = P(B M) P(M) / P(B) =.3 *.4 /.18 = 0.67 7 of 11

Question 7. [8] Naïve Bayes Consider the problem of detecting if an email message contains a Virus. Say we use four random variables to model this problem: Boolean class variable V indicates if the message contains a virus or not, and three Boolean feature variables: A, B and C. We decide to use a Naïve Bayes Classifier to solve this problem so we create a Bayesian network with arcs from V to each of A, B and C. Their associated CPTs are created from the following data: P(V) = 0.2, P(A V) = 0.8, P(A V) = 0.4, P(B V) = 0.3, P(B V) = 0.1, P(C V) = 0.1, P(C V) = 0.6 (a) [4] Compute P( A, B, C V) Use the conditionalized version of the chain rule plus conditional independence to get: P( A,B,C V) = P( A B,C,V) P(B C,V) P(C V) = P( A V) P(B V) P(C V) = P(1 - P(A V)) P(B V) P(C V) = (.2)(.3)(.1) = 0.006 (b) [4] Compute P(A, B, C) Use the conditioning rule, then the conditionalized version of the chain rule as in (a): P(A, B, C) = P(A, B, C V)P(V) + P(A, B, C V)P( V)) = P(A V) P( B V) P( C V) P(V) + P(A V) P( B V) P( C V) P( V) = (.8)(.7)(.9)(.2) + (.4)(.9)(.4)(.8) = 0.216 8 of 11

Question 8. [12] Bayesian Networks Consider the following Bayesian Network containing 5 Boolean random variables: (a) [3] Write an expression for computing P(A, S, H, E, C) given only information that is in the associated CPTs for this network. P(A) P( S) P(H) P(E A, S, H) P( C E, H) (b) [2] True or False: P(A, C, E) = P(A) P( C) P(E) False. This is only true when A and C and E are independent, which they are not here by the structure of the Bayesian network (c) [3] How many numbers must be stored in total in all CPTs associated with this network (excluding numbers that can be calculated from other numbers)? 1 + 1 + 1 + 2 3 + 2 2 = 15 (d) [4] C is conditionally independent of A and S given E and H 9 of 11

Question 9. [12] Hidden Markov Models I have two coins, D1 and D2. At each turn I pick one of the two coins but don't know which one I picked. Which coin is selected is assumed to obey a first-order Markov assumption. After I pick a coin, I flip it and observe whether it's a Head or a Tail. The two coins are not "fair," meaning that they each don't have equal likelihood of turning up heads or tails. The complete situation, including the probability of which coin I pick first is given by the following HMM, where D1 and D2 are the two hidden states, and H and T are the two possible observable values. (a) [4] Compute P(q 1 = D1, q 2 = D1, q 3 = D2, q 4 = D2) P(q 1=D1, q 2=D1, q 3=D2, q 4=D2) = P(q 4=D2 q 3=D2) P(q 3=D2 q 2=D1) P(q 2=D1 q 1=D1) P(q 1=D1) = (.3)(.4)(.6)(.3) = 0.0216 (b) [4] Compute P(o 1 = H, o 2 = T, q 1 = D2, q 2 = D2) P(o 1=H,o 2=T,q 1=D2,q 2=D2) = P(q 1=D2) P(q 2=D2 q 1=D2) P(o 1=H q 1=D2) P(o 2=T q 2=D2) = (.7)(.3)(.4)(.6) = 0.0504 (c) [4] Compute P(q 1 = D1 o 1 = H) P(q 1=D1 o 1=H) = P(o 1=H q 1=D1) P(q 1=D1) / P(o 1=H) by Bayes rule = P(o 1=H q 1=D1) P(q 1=D1) / [P(o 1=H q 1=D1) P(q 1=D1) + P(o 1=H q 1=D2) P(q 1=D2)] by conditioning rule = ((.55)(.3))/[(.55)(.3) + (.4)(.7)] = 0.37 10 of 11

Question 10. [8] AdaBoost (a) [2] True or False: In AdaBoost, if an example in the training set is classified correctly by a chosen weak classifier, we should eliminate it from the training set used in the next iteration to compute the next weak classifier. False (b) [2] True or False: During each iteration of AdaBoost after the next weak classifier is determined, the weights of the misclassified examples will all be updated (before normalization) by the same factor. True (c) [2] True or False: The decisions of the individual weak classifiers (+1 for class 1 and -1 for class 2) selected by Adaboost are combined by choosing the majority class. False (The sign of the linear combination of their decisions is used, which is the weighted majority class) (d) [2] True or False: In a 2-class classification problem where each example is defined by a k- dimensional feature vector, if each weak classifier is defined by a threshold on a single attribute (defining a linear decision boundary), AdaBoost can only learn a strong classifier that has 0 misclassifications on the training set when the training set is linearly separable, no matter how many weak classifiers are combined. False 11 of 11