Final Examination CS540-2: Introduction to Artificial Intelligence

Similar documents
Final Examination CS 540-2: Introduction to Artificial Intelligence

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

FINAL: CS 6375 (Machine Learning) Fall 2014

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Midterm: CS 6375 Spring 2015 Solutions

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm Exam

Final Exam, Machine Learning, Spring 2009

Qualifier: CS 6375 Machine Learning Spring 2015

Machine Learning, Midterm Exam

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

1 Machine Learning Concepts (16 points)

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Support Vector Machine (continued)

Introduction to Machine Learning Midterm Exam Solutions

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Support Vector Machines. Machine Learning Fall 2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

6.036 midterm review. Wednesday, March 18, 15

Machine Learning Practice Page 2 of 2 10/28/13

Pattern Recognition and Machine Learning

Final Exam, Fall 2002

Midterm: CS 6375 Spring 2018

CS 590D Lecture Notes

Statistical Pattern Recognition

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Midterm exam CS 189/289, Fall 2015

Midterm Exam, Spring 2005

Support Vector Machine. Industrial AI Lab.

Support Vector Machines

CSE 546 Final Exam, Autumn 2013

Final Exam December 12, 2017

Final Exam December 12, 2017

Be able to define the following terms and answer basic questions about them:

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS6220: DATA MINING TECHNIQUES

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Numerical Learning Algorithms

Linear & nonlinear classifiers

CS540 ANSWER SHEET

Support Vector Machines

Artificial Intelligence Roman Barták

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Midterm 2 V1. Introduction to Artificial Intelligence. CS 188 Spring 2015

CSCI 315: Artificial Intelligence through Deep Learning

Bayesian Networks Inference with Probabilistic Graphical Models

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

STA 4273H: Statistical Machine Learning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

CS534 Machine Learning - Spring Final Exam

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Support vector machines Lecture 4

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Feed-forward Network Functions

Undirected Graphical Models

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

The exam is closed book, closed notes except your one-page cheat sheet.

Holdout and Cross-Validation Methods Overfitting Avoidance

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear & nonlinear classifiers

Learning from Examples

Perceptron Revisited: Linear Separators. Support Vector Machines

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 4: Perceptrons and Multilayer Perceptrons

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Machine Learning Lecture 7

Machine Learning Lecture 10

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning, Fall 2011: Homework 5

Learning Objectives. c D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.2, Page 1

Machine Learning : Support Vector Machines

Statistical Machine Learning from Data

CSCI-567: Machine Learning (Spring 2019)

COMS 4771 Introduction to Machine Learning. Nakul Verma

1. True/False (40 points) total

Neural networks and support vector machines

Reading Group on Deep Learning Session 1

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning Lecture 5

Logistic Regression & Neural Networks

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Transcription:

Final Examination CS540-2: Introduction to Artificial Intelligence May 9, 2018 LAST NAME: SOLUTIONS FIRST NAME: Directions 1. This exam contains 33 questions worth a total of 100 points 2. Fill in your name and student ID number carefully on the answer sheet 3. Fill in each oval that you choose completely; do not use a check mark, an X or put a box or circle around the oval 4. Fill in the ovals with pencil, not pen 5. If you change an answer, be sure to completely erase the old filled-in oval 6. Fill in only one oval for each question 7. When you answer a question, be sure to check and make sure that the question number on the answer sheet matches the question number that you are answering on the exam 8. For True / False questions, fill in A for True and B for False 9. There is no penalty for guessing

Constraint Satisfaction Consider the following graph representing 6 countries on a map that needs to be colored using three different colors, 1, 2 and 3, such that no adjacent countries have the same color. Adjacencies are represented by edges in the graph. We can represent this problem as a Constraint Satisfaction Problem where the variables, A F, are the countries, and the values are the colors. 1. [4] The table below shows the current domains of all countries after country B has been assigned value 1 and country F has been assigned value 2. What are the reduced domains of the two countries C and D after applying the Forward Checking algorithm to this state? (The domains of countries A and E may change too but we re not interested in them.) A B C D E F {1, 2, 3} {1} {1, 2, 3} {1, 2, 3} {1, 2, 3} {2} A. C = {1, 2, 3} and D = {1, 2, 3} B. C = {2, 3} and D = {1, 2, 3} C. C = {2, 3} and D = {1, 3} D. C = {2} and D = {1, 3} E. C = {2} and D = {3} Answer: C. C = {2, 3}, D = {1, 3} 2. [4] The table below shows the current domains of all countries after country A has been assigned value 2 and country D has been assigned value 3. What are the reduced domains of the two countries C and F after applying the Arc Consistency algorithm (AC-3) to this state? (The domains of countries B and E may change too but we re not interested in them.) A B C D E F {2} {1, 2, 3} {1, 2, 3} {3} {1, 2, 3} {1, 2, 3} A. C = {1} and F = {1} B. C = {1} and F = {2} C. C = {1, 3} and F = {1, 2) D. C = {1, 3} and F = {1} E. C = {1, 2} and F = {1, 2} Answer: A. C = {1} and F = {1} 2 of 12

Neural Networks 3. [2] True or False: The Perceptron Learning Rule is a sound and complete method for a Perceptron to learn to correctly classify any 2-class classification problem. Answer: False. It can only learn linearlyseparable functions. 4. [2] True or False: A Perceptron can learn the Majority function, i.e., where each input unit is a binary value (0 or 1) and it outputs 1 if there are more 1s in the input than 0s. Assume there are n input units where n is odd. Answer: True. Set all weights from the n input units to be 1 and set the bias to be -n/2. 5. [2] True or False: Training neural networks has the potential problem of overfitting the training data. Answer: True. 6. [2] True or False: The back-propagation algorithm, when run until a minimum is achieved, always finds the same solution (i.e., weights) no matter what the initial set of weights are. Answer: False. It will iterate until a local minimum in the squared error is reached. 7. [4] Consider a Perceptron that has two input units and one output unit, which uses an LTU activation function, plus a bias input of +1 and a bias weight w 3 = 1. If both inputs associated with an example are 0 and both weights, w 1 and w 2, connecting the input units to the output unit have value 1, and the desired (teacher) output value is 0, how will the weights change after applying the Perceptron Learning rule with learning rate parameter α = 1? A. w 1, w 2 and w 3 will all decrease B. w 1 and w 2 will increase, and w 3 will decrease C. w 1 and w 2 will increase, and w 3 will not change D. w 1 and w 2 will decrease, and w 3 will not change E. w 1 and w 2 will not change and w 3 decreases Answer: E. w1 and w2 will not change and w3 decreases 3 of 12

Convolutional Neural Networks Consider a Convolutional Neural Network (CNN) that has an Input layer containing a 13 x 13 image that is connected to a Convolution layer using a 4 x 4 filter and a stride of 1 (i.e., the filter is shifted horizontally and vertically by 1 pixel, and only filters that are entirely inside the input array are connected to a unit in the Convolution layer). There is no activation function associated with the units in the Convolution layer. The Convolution layer is connected to a max Pooling layer using a 2 x 2 filter and a stride of 2. (Only filters that are entirely inside the array in the Convolution layer are connected to a unit in the Pooling layer.) The Output layer contains 4 units that each use an ReLU activation function and these units are fully-connected to the units in the Pooling layer. 8. [4] How many units are in the Convolution layer? A. 16 B. 81 C. 100 D. 144 E. 169 Answer: C. because 10 x 10 = 100 units 9. [4] How many distinct weights must be learned for the connections to the Convolution layer? A. 16 B. 169 C. 2,704 D. 16,900 Answer: A because 4 x 4 = 16 weights because weights are shared 10. [4] How many units are in the Pooling layer? A. 4 B. 25 C. 81 D. 100 E. 169 Answer: B because 5 x 5 = 25 units 11. [4] How many distinct weights must be learned for the connections to the Output layer? A. 4 B. 5 C. 16 D. 100 E. 104 Answer: E because (25 x 4) + 4 = 104 (including the bias weights) 4 of 12

The next two questions do NOT refer to the specific CNN described above, but rather are about CNNs in general. 12. [2] True or False: CNNs can learn to recognize an object in an image no matter how the object is translated (i.e., shifted horizontally and/or vertically) even if the training set only includes that object in one position. Answer: True because of shared weights 13. [2] True or False: CNNs can learn to recognize an object in an image no matter how the object is rotated (in the image plane) even if the training set only includes that object in one orientation. Answer: False Support Vector Machines 14. [2] True or False: Given a linearly-separable dataset for a 2-class classification problem, a Linear SVM is better to use than a Perceptron because the SVM will often be able to achieve a better classification accuracy on the training set. Answer: False because both with have 100% accuracy on the linearly-separable training set. 15. [2] True or False: Given a linearly-separable dataset for a 2-class classification problem, a Linear SVM is better to use than a Perceptron because the SVM will often be able to achieve better classification accuracy on the testing set. Answer: True because by maximizing the margin the decision boundary learned by the SVM will often be farther away from more of the training examples, leading to better performance on the testing examples that are close to the decision boundary. 5 of 12

16. [2] True or False: With a non-linearly-separable dataset that contains some extra noise data points, using an SVM with slack variables to create a soft margin classifier, and a small value for the penalty parameter, C, that controls how much to penalize misclassified points, will often reduce overfitting the training data. Answer: True because small C means the penalty for mis-classifying a few points will be small and therefore we are more likely to maximize the margin between most of the points while misclassifying a few points including the noise points. 17. [2] True or False: An SVM using the quadratic kernel K(x, y) = (x T y) 2, can correctly classify the dataset {((1,1), +), ((1,-1), ), ((-1,1), ), ((-1,-1), +)}. Answer: True because using this quadratic kernel will effectively map each 2D point, (p, q), into a 3D point, (p 2, 2pq, q 2 ), meaning points (1,1) and (-1,-1) map to (1, 2, 1), and points (-1,1) and (1,-1) map to (1, 2, 1). So, these four points are now linearly separable in this 3D space. Probabilities 18. [3] Which one of the following is equal to P(A, B, C) given Boolean random variables A, B and C, and no independence or conditional independence assumptions between any of them? A. P(A B) P(B C) P(C A) B. P(C A, B) P(A) P(B) C. P(A, B C) P(C) D. P(A B, C) P(B A, C) P(C A, B) E. P(A B) P(B C) P(C) Answer: C 6 of 12

19. [3] Which one of the following expressions is guaranteed to equal 1, given (not necessarily Boolean) random variables A and B, and no independence or conditional independence assumptions between them? Sums are over all possible values of the associated random variable. A. PP(AA = xx BB = bb) xx B. PP(AA = aa BB = yy) yy C. xx PP(AA = xx) + yy PP(BB = yy) D. xx yy PP(AA = xx BB = yy) Answer: A 20. [4] Given two Boolean random variables, A and B, where P(A) = ½, P(B) = 1/3, and P(A B) = ¼, what is P(A B)? A. 1/6 B. ¼ C. ¾ D. 1 E. Impossible to determine from the given information Answer: D because P(A B) = (P(B A) P(A)) / P(B) = 3/2 P(B A). P(B A) = 1 P(~B A), and P(~B A) = (P(A ~B) P(~B)) / P(A) = 1/3, so P(B A) = 2/3 and therefore P(A B) = (2/3)(3/2) = 1 21. [4] A 6-sided die is rolled 15 times and the results are: side 1 comes up 0 times; side 2: 1 time; side 3: 2 times; side 4: 3 times; side 5: 4 times; side 6: 5 times. Based on these results, what is the probability of side 3 coming up when using Add-1 Smoothing? A. 2/15 B. 1/7 C. 3/16 D. 1/5 Answer: B because (2+1)/(1+2+3+4+5+6) = 1/7 7 of 12

Probabilistic Reasoning Say the incidence of a disease D is about 5 cases per 100 people (i.e., P(D) = 0.05). Let Boolean random variable D mean a patient has disease D and let Boolean random variable TP stand for "tests positive." Tests for disease D are known to be very accurate in the sense that the probability of testing positive when you have the disease is 0.99, and the probability of testing negative when you do not have the disease is 0.97. 22. [4] Compute P(TP), the prior probability of testing positive. A. 0.0285 B. 0.0495 C. 0.051 D. 0.078 Answer: D because P(D) = 0.05, P(TP D) = 0.99, P( TP D) = 0.97, and P(TP) = P(TP D) P(D) + P(TP D) P( D) = (.99)(.05) + (1 -.97)(1 -.05) = 0.078 23. [4] Compute P(D TP), the posterior probability that you have disease D when the test is positive. A. 0.0495 B. 0.078 C. 0.635 D. 0.97 Answer: C because P(D TP) = (P(TP D) P(D)) / P(TP) = ((.99)(.05)) /.078 = 0.635 8 of 12

Bayesian Networks Consider the following Bayesian Network containing four Boolean random variables. P(A) 0.1 A B P(B) 0.5 P(C A) 0.7 P(C A) 0.2 C D P(D A, B) 0.9 P(D A, B) 0.6 P(D A, B) 0.7 P(D A, B) 0.3 24. [4] Compute P( A, B, C, D) A. 0.216 B. 0.054 C. 0.024 D. 0.006 Answer: A because P( A,B, C,D) = P( C A) P(D A,B) P( A) P(B) = (1 0.2)(0.6)(1 0.1)(0.5) = 0.216 25. [4] Compute P(A B, C, D) A. 0.0315 B. 0.0855 C. 0.368 D. 0.583 Answer: C because P(A B,C,D) = P(A,B,C,D) / P(B,C,D) P(A,B,C,D) = (0.7)(0.9)(0.1)(0.5) = 0.0315 P(B,C,D) = P(A,B,C,D) + P( A,B,C,D) = 0.0315 + (0.2)(0.6)(0.9)(0.5) = 0.0855 So, P(A B,C,D) = 0.0315 / 0.0855 = 0.368 9 of 12

The next two questions do NOT refer to the specific Bayesian Network shown above, but rather are about other Bayesian Networks. 26. [2] True or False: The Bayesian Network associated with the following computation of a joint probability: P(A) P(B) P(C A, B) P(D C) P(E B, C) has arcs from node A to C, from B to C, from B to E, from C to D, from C to E, and no other arcs. Answer: True 27. [2] True or False: The following product of factors corresponds to a valid Bayesian Network over the variables A, B, C and D: P(A B) P(B C) P(C D) P(D A). Answer: False because this corresponds to a network that is not a DAG. Hidden Markov Models Consider the following HMM that models a student s activities every hour, where during each hour period the student is in one of two possible hidden states: Studying (S) or playing Video games (V). While doing each activity the student will exhibit an observable facial expression of either Grinning (G) or Frowning (F). Start 0.8 0.4 S 0.2 0.4 0.6 V 0.6 0.2 0.8 0.7 0.3 G F 10 of 12

28. [4] If in the second hour the student is Studying, what s the probability that in the fourth hour the student is playing Video games? That is, compute P(q 4 = V q 2 = S). A. 0.08 B. 0.12 C. 0.16 D. 0.2 E. 0.28 Answer: E because P(q4=V q2=s) = P(q4=V, q3=v q2=s) + P(q4=V, q3=s q2=s) = P(q4=V q3=v, q2=s) P(q3=V q2=s) + P(q4=V q3=s, q2=s) P(q3=S q2=s) = P(q4=V q3=v) P(q3=V q2=s) + P(q4=V q3=s) P(q3=S q2=s) = (.6)(.2) + (.2)(.8) = 0.28 29. [4] What is the probability that in the first hour the student is Grinning? That is, compute P(o 1 = G).. A. 0.08 B. 0.25 C. 0.42 D. 0.5 E. 1.0 Answer: D P(o1=G) = P(q1=S, o1=g) + P(q1=V, o1=g) P(q1=S, o1=g) = P(q1=S) P(o1=G q1=s) = (0.4)(0.2) = 0.08 P(q1=V, o1=g) = P(q1=V) P(o1=G q1=v) = (0.6)(0.7) = 0.42 So, P(o1=G) = 0.08 + 0.42 = 0.5 AdaBoost 30. [2] True or False: If the number of weak classifiers is sufficiently large and each weak classifier is more accurate than chance, the AdaBoost algorithm s accuracy on the training set can always achieve 100% correct classification for any 2-class classification problem where the data can be separated by a linear combination of the weak classifiers decision boundaries. Answer: True 11 of 12

31. [2] True or False: In AdaBoost, the error computed for a weak classifier is calculated by the ratio of the number of misclassified examples to the total number of examples in the training set. Answer: False 32. [2] True or False: For a 2-class classification problem where AdaBoost has selected five weak classifiers, the classification of a test example is determined by the majority class of the classes predicted by the five weak classifiers. Answer: False 33. [4] The AdaBoost algorithm creates an ensemble of weak classifiers by doing which one of the following before determining the next weak classifier: A. Chooses a new random subset of the training examples to use B. Decreases the weights of the training examples that were misclassified by the previous weak classifier C. Increases the weights of the training examples that were misclassified by the previous weak classifier D. Removes the training examples that were classified correctly by the previous weak classifier Answer: C 12 of 12