Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Similar documents
Decision Theory: Q-Learning

Final Exam December 12, 2017

Final Exam December 12, 2017

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

Real Time Value Iteration and the State-Action Value Function

Decision Theory: Markov Decision Processes

CSE250A Fall 12: Discussion Week 9

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Introduction to Reinforcement Learning

Markov Decision Processes

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning. Introduction

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Exercises, II part Exercises, II part

Name: UW CSE 473 Final Exam, Fall 2014

Introduction to Spring 2006 Artificial Intelligence Practice Final

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

CSC321 Lecture 22: Q-Learning

Q-Learning in Continuous State Action Spaces

CS 4100 // artificial intelligence. Recap/midterm review!

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Grundlagen der Künstlichen Intelligenz

6 Reinforcement Learning

Introduction to Fall 2009 Artificial Intelligence Final Exam

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

CSC242: Intro to AI. Lecture 23

1 MDP Value Iteration Algorithm

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

CS 570: Machine Learning Seminar. Fall 2016

Machine Learning I Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning

Markov Decision Processes (and a small amount of reinforcement learning)

CS 7180: Behavioral Modeling and Decisionmaking

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning and Control

CS 234 Midterm - Winter

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Markov decision processes (MDP) CS 416 Artificial Intelligence. Iterative solution of Bellman equations. Building an optimal policy.

Be able to define the following terms and answer basic questions about them:

Lecture 4: Approximate dynamic programming

Multiagent Value Iteration in Markov Games

Some AI Planning Problems

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning. George Konidaris

Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Reinforcement Learning

Introduction to Fall 2008 Artificial Intelligence Midterm Exam

Planning in Markov Decision Processes

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning: An Introduction

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

, and rewards and transition matrices as shown below:

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Internet Monetization

Final Exam, Fall 2002

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Decision making, Markov decision processes

Reinforcement Learning. Machine Learning, Fall 2010

CS181 Midterm 2 Practice Solutions

CS230: Lecture 9 Deep Reinforcement Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Using first-order logic, formalize the following knowledge:

total 58

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

(Deep) Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Introduction to Spring 2009 Artificial Intelligence Midterm Exam

Lecture 3: The Reinforcement Learning Problem

Lecture 1: March 7, 2018

CSE 546 Final Exam, Autumn 2013

Be able to define the following terms and answer basic questions about them:

CS599 Lecture 1 Introduction To RL

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Reinforcement Learning II. George Konidaris

CSEP 573: Artificial Intelligence

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Machine Learning, Midterm Exam

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Reinforcement Learning II. George Konidaris

16.410/413 Principles of Autonomy and Decision Making

Final Exam, Spring 2006

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Chapter 3: The Reinforcement Learning Problem

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Lecture 3: Markov Decision Processes

Lecture 10 - Planning under Uncertainty (III)

Transcription:

1 Bayes Nets Unfortunately during spring due to illness and allergies, Billy is unable to distinguish the cause (X) of his symptoms which could be: coughing (C), sneezing (S), and temperature (T). If he is able to determine the cause with a reasonable accuracy, it would be beneficial for him to take either Robitussin DM or Benedryll. He has asked you to help him formulate his problem as a Bayesian Network. At any point in time, Billy can either be sick (X = sick), allergic (X = allergic), or well (X = well). Figure 1: Bayes Net (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net. C T X S X C, T (b) (2 points) Write the algebraic expression for P (C, S) in terms of the joint distribution P (X, C, T, S). P (C, S) = x P (X = x, C, T = t, S) t (c) (2 points) What probability and conditional probability tables must we store for this Bayes net? P (X), P (C X), P (T X), P (S C, T ) (d) (2 points) Derive an expression for the posterior distribution over X given whether Billy is coughing (C) and sneezing (S). P (X C, S) = = P (X,C,S) P (C,S) P t P (X,C,T =t,s) P P x t P (X=x,C,T =t,s) It is also correct to rewrite the above using the fact P (X, C, T, S) = P (X)P (C X)P (T X)P (S C, T ).

(e) (2 points) Derive an expression for the likelihood of observing that he is coughing (C) and sneezing (S) given X. P (C, S X) = = P (X,C,S) P (X) P t P (X,C,T =t,s) P (X)

2 HMMs You sometimes get colds, which make you sneeze. You also get allergies, which make you sneeze. Sometimes you are well, which doesn t make you sneeze (much). You decide to model the process using the following HMM, with hidden states X {well, allergy, cold} and observations E {sneeze, quiet}. Figure 2: HMM State Transition Diagram Figure 3: HMM Probability Tables (a) (1 point) Fill in the missing entries in the probability tables above.

(b) (2 points) Imagine you observe the sequence quiet, sneeze, sneeze. What is the probability that you were well all three days and observed these effects? P (X 1 = w, X 2 = w, X 3 = w E 1 = q, E 2 = s, E 3 = s) = P (X 1 = w, X 2 = w, X 3 = w, E 1 = q, E 2 = s, E 3 = s) P (E 1 = q, E 2 = s, E 3 = s) P (X 1, X 2, X 3, E 1, E 2, E 3 ) = P (X 1 )P (X 2 X 1 )P (X 3 X 2 )P (E 1 X 1 )P (E 2 X 2 )P (E 3 X 3 ) P (X 1 = w, X 2 = w, X 3 = w, E 1 = q, E 2 = s, E 3 = s) = 0.60 0.5 0.5 0.9 0.1 0.1 = 0.00135 P (E 1 = q, E 2 = s, E 3 = s) = x 1 x 2 x 3 P (X 1 = x 1, X 2 = x 2, X 3 = x 3, E 1 = q, E 2 = s, E 3 = s) P (X 1 = w, X 2 = w, X 3 = w E 1 = q, E 2 = s, E 3 = s) = 0.00135/P (E 1 = q, E 2 = s, E 3 = s) (c) (2 points) What is the posterior distribution over your state on day 2 (X 2 ) if E 1 = quiet, E 2 = sneeze? This is the filtering problem. P (X 1 E 1 = q) P (X 1 )P (E 1 = q X 1 ) Running one step of filtering, we have (in the order of well, allergy, cold): P (X 1 E 1 = q) = [0.885 0.041 0.074] P (X 2 E 1 = q, E 2 = s) P (X 2, E 1 = q, E 2 = s) = x 1 P (X 2, X 1 = x 1, E 1 = q, E 2 = s) = x 1 P (E 2 = s X 2 )P (X 2 X 1 = x 1 )P (X 1 = x 1 )P (E 1 = q X 1 = x 1 ) = P (E 2 = s X 2 ) x 1 P (X 2 X 1 = x 1 )P (X 1 = x 1 )P (E 1 = q X 1 = x 1 ) P (E 2 = s X 2 ) x 1 P (X 2 X 1 = x 1 )P (X 1 = x 1 E 1 = q) Using the equation above, after two steps of filtering, we have P (X 2 E 1 = q, E 2 = s) = [0.104 0.580 0.316] (d) (2 points) What is the posterior distribution over your state on day 3 (X 3 ) if E 1 = quiet, E 2 = sneeze, E 3 = sneeze? We simply need to compute another step of filtering, using our answer from part (c) above. P (X 3 E 1 = q, E 2 = s, E 3 = s) P (E 3 = s X 3 ) x 2 P (X 3 X 2 = x 2 )P (X 2 E 1 = q, E 2 = s) After normalizing, we have P (X 3 E 1 = q, E 2 = s, E 3 = s) = [0.077 0.652 0.271]

3 MDPs Figure 4: MDP Consider the above MDP, representing a robot on a circular wheel. The wheel is divided into eight states and the available actions are to move clockwise or counterclockwise. The robot has a poor sense of direction and will move with probability 0.6 in the intended direction and with probability 0.4 in the opposite direction. All states have reward zero, except the terminal states which have rewards 1 and +1 as shown. The discount factor is γ = 0.9. (5 points) Compute the numeric values of the state-value function V (s) for states s1 through s6 (compute V (s1), V (s2),...). Show your work below. By symmetry, we know that V (s1) = V (s4), V (s2) = V (s5), and V (s3) = V (s6). To compute V (s1), V (s2), and V (s3), we could use value iteration. However, it is much easier to solve directly using the Bellman equations. It is clear that the optimal policy is to move toward the +1 reward. V (s1) = 0.4 γ ( 1) + 0.6 γ V (s2) V (s2) = 0.4 γ V (s1) + 0.6 γ V (s3) V (s3) = 0.4 γ V (s2) + 0.6 γ (+1) Solving this system of equations yields V (s1) = 0.217 V (s2) = 0.265 V (s3) = 0.635

4 More MDPs and Reinforcement Learning (a) (2 points) What is an ɛ-greedy policy and why is it important in Q-learning? An ɛ-greedy policy in one that chooses the best action argmax a Q(s, a) with probability 1 ɛ and chooses a random action with probability ɛ. In Q-learning, such a policy provides the required balance between exploration and explotation. (b) (2 points) Why might one want to keep track of an action-value function Q(s, a) rather than a statevalue function V (s)? Can we compute V (s) from Q(s, a)? If so, give an equation. Can we compute Q(s, a) from V (s)? If so, give an equation. We might want to keep track of an action-value function Q(s, a) because we can directly obtain a policy from Q(s, a) (take the action a maximizing Q(s, a) when in state s), without needing to know the transition model for our environment. Yes, we can compute V (s) from Q(s, a). V (s) = max a Q(s, a) in the case that our policy π is to take the best action. In general, V (s) = a π(s, a)q(s, a). If we don t know the transition model, then we can t compute Q(s, a) from V (s). However, if we re given the transition model T (s, a, s ), then Q(s, a) = s T (s, a, s )V (s ). (c) (2 points) Consider a two-player adversarial zero-sum game. Players MAX and MIN alternate actions in this MDP. When a transition (s, a, s ) occurs, MAX receives reward R(s, a, s ) and MIN receives reward R(s, a, s ). MAX attempts to maximize his total rewards and MIN attempts to minimize his total rewards. Let Q MAX (s, a) and Q MIN (s, a) denote the q-values for performing action a in state s for players MAX and MIN, respectively. Given discount factor γ and transition probabilities T (s, a, s ), write two Bellman equations expressing each of Q MAX and Q MIN in terms of adjacent lookahead values of Q MAX and/or Q MIN. Since MAX and MIN alternate turns, MAX must take the minimum over the adjacent lookahead value, as MIN plays next. Similarly, MIN must take the maximum over the adjacent lookahead value. Q MAX (s, a) = s Q MIN (s, a) = s T (s, a, s )[R(s, a, s ) + γ min a Q MIN (s, a )] T (s, a, s )[R(s, a, s ) + γ max a Q MAX (s, a )]