CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes Name: Roll Number: Please read the following instructions carefully Ø Calculators are allowed. However, laptops or mobile phones are not allowed. Ø You can bring one A4 size cheat sheet. Please attach the cheat sheet along with this booklet. Ø Use the space provided after every question for writing your answer. You will be given additional sheets for rough work. Please attach the additional sheet(s) along with this booklet. Ø Be precise and concise in your answers. Ø Include explanations, derivations, and examples when appropriate. This can fetch partial scores even if the final answer is incorrect. Ø Please write legibly Ø There are 5 questions worth a total of 50 points. Ø Work efficiently. Some questions are easier than others. Try to answer the easier ones before you get bogged down by the harder ones. Ø Keep Calm and Good Luck. # Question Max. Score Score 1 Minesweeper 8 2 Planners 10 3 Bayesian Networks 12 4 Markov Decision Processes 12 5 Reinforcement Learning 8 Total 50

1. Minesweeper (8 points) Minesweeper is a single player puzzle. The objective of the game is to clear a rectangular board containing hidden mines without detonating any of them using clues about the number of neighboring mines in each field. Each square in the rectangular board can be cleared, by clicking on it. If a square that contains a mine is clicked, the game is over. If the square does not contain a mine, one of the two things can happen A number between 1 and 8 appears indicating the number of adjacent (including diagonallyadjacent) squares containing mines. No number appears; in which case, there are no mines in the adjacent cells. The figure below is an example of the game. (4,2) a. Define a first order language (functions, objects, relations) that allows to formalize the knowledge of a player in the game. Represent the following knowledge using the defined language (5 points) There are exactly n mines in the minefield If a cell contains the number 1, then there is exactly one mine in the adjacent cells b. Prove by resolution that there must be a mine in the position (4,2) in the Figure above. (3 points)

2. Planners (10 points) Consider the following artificial planning problem: Initial State: X Goal State: Y, Z Actions: A1 Prec: none Effect: Y, X A2 Prec: X Effect: Z a. Construct the tree resulting from performing one level of progression search. Complete the branch of the tree that will result in the solution. (2 points) b. Construct the tree resulting from performing one level of regression search. Complete the branch of the tree that will result in the solution. (2 points). c. Construct the planning graph until the goals are satisfied. (2 points) d. Identify all the mutex relationships that exists in the graph. (2 points)

e. In general, suppose the progression search is conducted using A* search with a heuristic h that is inadmissible, but overestimates the cost by k units. (if the true cost is c, h might give an estimate of c + k). Can we give a guarantee on how far the plan found by A* will be from the optimum? (2 points)

3. Bayesian Networks (12 points) 3.1 Consider the Bayes network shown below. A B C D E F a. Is A conditionally independent of E given F? Explain.(1 point) b. Given the CPT for A, B, and C and the full joint distribution table, compute the CPT for nodes D, E, and F. (4 points)

c. Suppose that the variables A, B, C, and F have been observed. Variables D and E are unobserved. Prove from first principles that removing node D from the network will not affect the posterior distribution for E. (3 points) d. Under the same assumptions as part c, can we remove node D if we are planning to use rejection sampling and likelihood weighting for obtaining the posterior distribution for E? Explain (4 points)

4. Markov Decision Processes (12 points) 4.1 An agent would like to use standard search techniques for solving an MDP. What should be the conditions on the MDP to perform the standard search? (2 points) 4.2 Given a fixed policy π, where π s is the deterministic action to be taken in state s, the value of the policy satisfies the following equation: V ) s = T s, π s, s - R s, π s, s -, γv ) s - 0 1 On the other hand, a stochastic policy does not recommend a single, deterministic action for each state. It gives for each possible action a in a state s a probability - π a, s = P a s. Modify the above equation to compute the value of a stochastic policy π. (3 points) 4.3 Consider the grid world, illustrated in the Figure below, where A is the start state and the squares with the double rectangle are the exit states. For an exit state, the only action available is Exit, which results in the listed reward and ends the game. For the non-exit states, the agent can choose either East, West, North, or South actions, which move the agent in the corresponding direction; i.e., the actions are deterministic. There are no living rewards. Assume that V 5 s = 0, s, and γ = 1 Z +5 Y A X +10 +15 1 2 3 a. What is the optimal value V A? (1 point) b. When running value iteration, what is the non-zero value of V ; A? What is the value of k when V ; A takes this non-zero value? (2 points)

c. After how many value iterations will V ; A = V A? (write never, if they will never become equal) (2 points) d. If γ = 0.5, what is the optimal value V A? (2 points) 5. Reinforcement Learning (8 points) Consider the grid world illustrated below. The agent is trying to learn the optimal policy. At any square the agent can move North (N), South (S), East(E), or West(W). The terminal states (marked using double squares) also have the exit action performing which the MDP terminates. There are no living rewards. The agent received rewards only while exiting from the terminal states. Let us assume that γ = 1 and α = 0.5. 3-10 -10-10 2 A 1 +10 +15 1 2 3 The agent starts exploring the grid from (2,1) resulting in the following set of episodes. Each entry in the episode is a tuple of the form s, a, s -, r. The agent was in state s, performed the action a, ended in state s, resulting in a reward of r. Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 (2,1), E, (2,2), 0 (2,1), E, (2,2), 0 (2,1), E, (2,2), 0 (2,1), S, (1,1), 0 (2,1), E, (2,2), 0 (2,2), S, (1,2), 0 (2,2), S, (1,2), 0 (2,2), E, (2,3), 0 (1,1), Exit, -, +10 (2,2), S, (1,2), 0 (1,2), E, (1,3), 0 (1,2), N, (2,2), 0 (2,3), N, (3,3), 0 (1,2), E, (1,3), 0 (1,3), Exit, -, +15 (2,2), N, (3,2), 0 (3,3), Exit, -, -10 (1,3), Exit, -, +15 (3,2), Exit, -, -10

a. If the agent were to employ direct utility estimation, what would be the q-value estimates for ((2,2), S), ((1,2), E), ((2,3), E) and ((2,3), N)? (2 points) b. If the agent were to employ Q-learning, what would be the q-value estimates for ((2,2), S), ((1,2), E), ((2,3), E) and ((2,3), N). Also indicate the episode and iteration number when the q-value estimates for these q-states become non-zero? If the q-value never becomes non-zero, write never. (4 points) c. In general, suppose we have a deterministic MDP, the Q-learning update with a learning rate of α = 1 will correctly learn the optimal q-values. True or False, Explain. (2 points)