Section Handout 5 Solutions

Similar documents
The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Introduction to Spring 2009 Artificial Intelligence Midterm Exam

CS221 Practice Midterm

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Introduction to Spring 2009 Artificial Intelligence Midterm Solutions

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Name: UW CSE 473 Final Exam, Fall 2014

Introduction to Arti Intelligence

CS 4100 // artificial intelligence. Recap/midterm review!

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

CS 188 Fall Introduction to Artificial Intelligence Midterm 1

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Constraint satisfaction search. Combinatorial optimization search.

1. Tony and his three best friends (Steven, Donna, and Randy) were each at a different table.

Final Exam December 12, 2017

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Decision Theory: Markov Decision Processes

Midterm. Introduction to Artificial Intelligence. CS 188 Summer You have approximately 2 hours and 50 minutes.

Decision Theory: Q-Learning

Chapter 6 Constraint Satisfaction Problems

Final Exam December 12, 2017

Introduction to Fall 2011 Artificial Intelligence Final Exam

1. Bounded suboptimal search: weighted A*

Introduction to Fall 2009 Artificial Intelligence Final Exam

Constraint Satisfaction Problems

Finding optimal configurations ( combinatorial optimization)

CS599 Lecture 1 Introduction To RL

Tentamen TDDC17 Artificial Intelligence 20 August 2012 kl

Using first-order logic, formalize the following knowledge:

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

ˆ The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Q-Learning in Continuous State Action Spaces

Reinforcement Learning. Machine Learning, Fall 2010

Written examination: Solution suggestions TIN175/DIT411, Introduction to Artificial Intelligence

Constraint Satisfaction

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

CS181 Midterm 2 Practice Solutions

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Exercises, II part Exercises, II part

CS360 Homework 12 Solution

Final. CS 188 Fall Introduction to Artificial Intelligence

Constraint Satisfaction [RN2] Sec [RN3] Sec Outline

Reinforcement Learning

Introduction to Spring 2006 Artificial Intelligence Practice Final

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Reinforcement Learning: An Introduction

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Reinforcement Learning Part 2

Introduction to Reinforcement Learning

Learning in State-Space Reinforcement Learning CIS 32

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Final Exam, Fall 2002

Artificial Intelligence

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Mock Exam Künstliche Intelligenz-1. Different problems test different skills and knowledge, so do not get stuck on one problem.

Constraint sat. prob. (Ch. 6)

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CS 4700: Foundations of Artificial Intelligence Ungraded Homework Solutions

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

CS 7180: Behavioral Modeling and Decisionmaking

CS230: Lecture 9 Deep Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Final Exam V1. CS 188 Fall Introduction to Artificial Intelligence

CS 570: Machine Learning Seminar. Fall 2016

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Probabilistic Planning. George Konidaris

Final Exam, Machine Learning, Spring 2009

Reinforcement Learning Active Learning

CS 188: Artificial Intelligence Spring Announcements

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Planning in Markov Decision Processes

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Allocate-On-Use Space Complexity of Shared-Memory Algorithms

Reinforcement Learning and Control

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning II

CSE250A Fall 12: Discussion Week 9

Reinforcement Learning. Introduction

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Min-Max Message Passing and Local Consistency in Constraint Networks

0.1 Naive formulation of PageRank

MDP Preliminaries. Nan Jiang. February 10, 2019

Policy Gradient Reinforcement Learning for Robotics

CS 331: Artificial Intelligence Propositional Logic I. Knowledge-based Agents

Lecture 8: Policy Gradient

Transcription:

CS 188 Spring 2019 Section Handout 5 Solutions Approximate Q-Learning With feature vectors, we can treat values of states and q-states as linear value functions: V (s) = w 1 f 1 (s) + w 2 f 2 (s) +... + w n f n (s) = w f(s) Q(s, a) = w 1 f 1 (s, a) + w 2 f 2 (s, a) +... + w n f n (s, a) = w f(s, a) where f(s) = [ f 1 (s) f 2 (s)... f n (s) ] T and f(s, a) = [ f1 (s, a) f 2 (s, a)... f n (s, a) ] T represent the feature vectors for state s and q-state (s, a) respectively and w = [ w 1 w 2... w n ] represents a weight vector. Defining difference as difference = [R(s, a, s ) + γ max a Q(s, a )] Q(s, a) approximate Q-learning works almost identically to Q-learning, using the following update rule: w i w i + α difference f i (s, a) CSPs CSPs are defined by three factors: 1. Variables - CSPs possess a set of N variables X 1,..., X N that can each take on a single value from some defined set of values. 2. Domain - A set {x 1,..., x d } representing all possible values that a CSP variable can take on. 3. Constraints - Constraints define restrictions on the values of variables, potentially with regard to other variables. CSPs are often represented as constraint graphs, where nodes represent variables and edges represent constraints between them. Unary Constraints - Unary constraints involve a single variable in the CSP. They are not represented in constraint graphs, instead simply being used to prune the domain of the variable they constrain when necessary. Binary Constraints - Binary constraints involve two variables. They re represented in constraint graphs as traditional graph edges. Higher-order Constraints - Constraints involving three or more variables can also be represented with edges in a CSP graph. In forward checking, whenever a value is assigned to a variable X i, forward checking prunes the domains of unassigned variables that share a constraint with X i that would violate the constraint if assigned. The idea of 1

forward checking can be generalized into the principle of arc consistency. For arc consistency, we interpret each undirected edge of the constraint graph for a CSP as two directed edges pointing in opposite directions. Each of these directed edges is called an arc. The arc consistency algorithm works as follows: Begin by storing all arcs in the constraint graph for the CSP in a queue Q. Iteratively remove arcs from Q and enforce the condition that in each removed arc X i X j, for every remaining value v for the tail variable X i, there is at least one remaining value w for the head variable X j such that X i = v, X j = w does not violate any constraints. If some value v for X i would not work with any of the remaining values for X j, we remove v from the set of possible values for X i. If at least one value is removed for X i when enforcing arc consistency for an arc X i X j, add arcs of the form X k X i to Q, for all unassigned variables X k. If an arc X k X i is already in Q during this step, it doesn t need to be added again. Continue until Q is empty, or the domain of some variable is empty and triggers a backtrack. We ve delineated that when solving a CSP, we fix some ordering for both the variables and values involved. In practice, it s often much more effective to compute the next variable and corresponding value on the fly with two broad principles, minimum remaining values and least constraining value: Minimum Remaining Values (MRV) - When selecting which variable to assign next, using an MRV policy chooses whichever unassigned variable has the fewest valid remaining values (the most constrained variable). Least Constraining Value (LCV) - Similarly, when selecting which value to assign next, a good policy to implement is to select the value that prunes the fewest values from the domains of the remaining unassigned values. 2

Q1. Pacman with Feature-Based Q-Learning We would like to use a Q-learning agent for Pacman, but the size of the state space for a large grid is too massive to hold in memory. To solve this, we will switch to feature-based representation of Pacman s state. (a) We will have two features, F g and F p, defined as follows: F g (s, a) = A(s) + B(s, a) + C(s, a) F p (s, a) = D(s) + 2E(s, a) where A(s) = number of ghosts within 1 step of state s B(s, a) = number of ghosts Pacman touches after taking action a from state s C(s, a) = number of ghosts within 1 step of the state Pacman ends up in after taking action a D(s) = number of food pellets within 1 step of state s E(s, a) = number of food pellets eaten after taking action a from state s For this pacman board, the ghosts will always be stationary, and the action space is {lef t, right, up, down, stay}. calculate the features for the actions {left, right, up, stay} F p (s, up) = 1 + 2(1) = 3 F p (s, left) = 1 + 2(0) = 1 F p (s, right) = 1 + 2(0) = 1 F p (s, stay) = 1 + 2(0) = 1 F g (s, up) = 2 + 0 + 0 = 2 F g (s, left) = 2 + 1 + 1 = 4 F g (s, right) = 2 + 1 + 1 = 4 F g (s, stay) = 2 + 0 + 2 = 4 (b) After a few episodes of Q-learning, the weights are w g = 10 and w p = 100. Calculate the Q value for each action {left, right, up, stay} from the current state shown in the figure. Q(s, up) = w p F p (s, up) + w g F g (s, up) = 100(3) + ( 10)(2) = 280 Q(s, left) = w p F p (s, left) + w g F g (s, left) = 100(1) + ( 10)(4) = 60 Q(s, right) = w p F p (s, right) + w g F g (s, right) = 100(1) + ( 10)(4) = 60 Q(s, stay) = w p F p (s, stay) + w g F g (s, stay) = 100(1) + ( 10)(4) = 60 3

(c) We observe a transition that starts from the state above, s, takes action up, ends in state s (the state with the food pellet above) and receives a reward R(s, a, s ) = 250. The available actions from state s are down and stay. Assuming a discount of γ = 0.5, calculate the new estimate of the Q value for s based on this episode. where Q new (s, a) = R(s, a, s ) + γ max a Q(s, a ) = 250 + 0.5 max{q(s, down), Q(s, stay)} = 250 + 0.5 0 = 250 Q(s, down) = w p F p (s, down) + w g F g (s, down) = 100(0) + ( 10)(2) = 20 Q(s, stay) = w p F p (s, stay) + w g F g (s, stay) = 100(0) + ( 10)(0) = 0 (d) With this new estimate and a learning rate (α) of 0.5, update the weights for each feature. w p = w p + α (Q new (s, a) Q(s, a)) F p (s, a) = 100 + 0.5 (250 280) 3 = 55 w g = w g + α (Q new (s, a) Q(s, a)) F g (s, a) = 10 + 0.5 (250 280) 2 = 40 4

Q2. MDPs and RL Recall that in approximate Q-learning, the Q-value is a weighted sum of features: Q(s, a) = i w if i (s, a). To derive a weight update equation, we first defined the loss function L 2 = 1 2 (y k w kf k (x)) 2 and found dl 2 /dw m = (y k w kf k (x))f m (x). Our label y in this set up is r+γ max a Q(s, a ). Putting this all together, we derived the gradient descent update rule for w m as w m w m + α (r + γ max a Q(s, a ) Q(s, a)) f m (s, a). In the following question, you will derive the gradient descent update rule for w m using a different loss function: L 1 = y k w k f k (x) (a) Find dl 1 /dw m. Ignore the non-differentiable point. Note that the derivative of x is 1 if x < 0 and 1 if x > 0. So for L 1, we have: L 1 w m = { f m (x) if y k w kf k (x) > 0 f m (x) if y k w kf k (x) < 0 (b) Write the gradient descent update rule for w m, using the L 1 loss function. w m w m α L 1 w { m w m + αf m (x) if y k w kf k (x) > 0 w m αf m (x) if y k w kf k (x) < 0 5

Q3. CSPs: Trapped Pacman Pacman is trapped! He is surrounded by mysterious corridors, each of which leads to either a pit (P), a ghost (G), or an exit (E). In order to escape, he needs to figure out which corridors, if any, lead to an exit and freedom, rather than the certain doom of a pit or a ghost. The one sign of what lies behind the corridors is the wind: a pit produces a strong breeze (S) and an exit produces a weak breeze (W), while a ghost doesn t produce any breeze at all. Unfortunately, Pacman cannot measure the strength of the breeze at a specific corridor. Instead, he can stand between two adjacent corridors and feel the max of the two breezes. For example, if he stands between a pit and an exit he will sense a strong (S) breeze, while if he stands between an exit and a ghost, he will sense a weak (W) breeze. The measurements for all intersections are shown in the figure below. Also, while the total number of exits might be zero, one, or more, Pacman knows that two neighboring squares will not both be exits. 1 6 s s 2 S W 5 S 4 w 3 Pacman models this problem using variables X i for each corridor i and domains P, G, and E. (a) State the binary and/or unary constraints for this CSP (either implicitly or explicitly). Binary: X 1 = P or X 2 = P, X 2 = E or X 3 = E, Unary: X 2 P, X 3 = E or X 4 = E, X 4 = P or X 5 = P, X 3 P, X 5 = P or X 6 = P, X 1 = P or X 6 = P, X 4 P i, j s.t. Adj(i, j) (X i = E and X j = E) Note: This is just one of many solutions. The answers below will be based on this formulation. (b) Suppose we assign X 1 to E. Perform forward checking after this assignment. Also, enforce unary constraints. X 1 E X 2 X 3 G E X 4 G E X 5 P G E X 6 P 6

(c) Suppose forward checking returns the following set of possible assignments: X 1 P X 2 G E X 3 G E X 4 G E X 5 P X 6 P G E According to MRV, which variable or variables could the solver assign first? X 1 or X 5 (tie breaking) (d) Assume that Pacman knows that X 6 = G. List all the solutions of this CSP or write none if no solutions exist. (P,E,G,E,P,G) (P,G,E,G,P,G) 7