Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Similar documents
Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

The Reinforcement Learning Problem

Planning Under Uncertainty II

Reinforcement Learning. George Konidaris

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Temporal Difference Learning & Policy Iteration

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning. Machine Learning, Fall 2010

CS599 Lecture 1 Introduction To RL

Reinforcement learning an introduction

Lecture 1: March 7, 2018

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

REINFORCEMENT LEARNING

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Lecture 23: Reinforcement Learning

Reinforcement Learning

CS 598 Statistical Reinforcement Learning. Nan Jiang

1 [15 points] Search Strategies

Reinforcement Learning

Reinforcement Learning: An Introduction

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

CS 570: Machine Learning Seminar. Fall 2016

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS 4100 // artificial intelligence. Recap/midterm review!

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Introduction to Reinforcement Learning

Final Exam December 12, 2017

6 Reinforcement Learning

Q-learning Tutorial. CSC411 Geoffrey Roeder. Slides Adapted from lecture: Rich Zemel, Raquel Urtasun, Sanja Fidler, Nitish Srivastava

RL 3: Reinforcement Learning

Decision Theory: Q-Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

CSE250A Fall 12: Discussion Week 9

Learning in State-Space Reinforcement Learning CIS 32

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Discrete planning (an introduction)

Final Exam December 12, 2017

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning and Control

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Final Exam, Machine Learning, Spring 2009

Markov Decision Processes

Chapter 3: The Reinforcement Learning Problem

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max P (s s, a)u(s )

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Planning and search. Lecture 6: Search with non-determinism and partial observability

Decision Theory: Markov Decision Processes

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Mathematical Formulation of Our Example

Artificial Intelligence

Markov decision processes

Reinforcement Learning

Factored State Spaces 3/2/178

Logic, Knowledge Representation and Bayesian Decision Theory

Lecture 3: Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Autonomous Helicopter Flight via Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

Grundlagen der Künstlichen Intelligenz

A Gentle Introduction to Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Introduction to Artificial Intelligence (AI)

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning

CS 380: ARTIFICIAL INTELLIGENCE

16.4 Multiattribute Utility Functions

Name: UW CSE 473 Final Exam, Fall 2014

Markov Chains. Chapter 16. Markov Chains - 1

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Reinforcement Learning

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Reinforcement Learning (1)

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Final Exam, Spring 2006

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Chapter 3: The Reinforcement Learning Problem

Playing Abstract games with Hidden States (Spatial and Non-Spatial).

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Decision making, Markov decision processes

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

Approximate Inference

Sequential Decision Problems

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Distributed Optimization. Song Chong EE, KAIST

Transcription:

Markov Models and Reinforcement Learning Stephen G. Ware CSCI 4525 / 5525

Camera Vacuum World (CVW) 2 discrete rooms with cameras that detect dirt. A mobile robot with a vacuum. The goal is to ensure both rooms are clean. A B

Camera Vacuum World A B Observable: Fully Agents: Single Deterministic: Deterministic Episodic: Episodic Static: Static Discrete: Discrete

CVW State Space A B Right Left A B Suck A B

Deterministic Process A deterministic process describes how the world transitions from one state to another by taking actions. It can be described as a graph whose nodes are states and whose edges are actions. Most state spaces that we have considered so far (e.g. CVW) are deterministic processes because there is no uncertainty about the outcomes of an action.

Making Decisions with a Deterministic Process A decision process is a process in which some state has been labeled the start state, some states have been labeled as goal states, and a rational agent is expected to find a way to reach the goal from the state. We call a solution to a process a policy. A policy is a function which, for any given current state, specifies which action to take next.

CVW Policy π Robot at A, A dirty, B dirty = Suck π Robot at A, A clean, B dirty = Right π Robot at B, A dirty, B clean = Left for every possible state

Finding Deterministic Policies Just as many problems considered so far can be modeled using a deterministic processes, many search-based solutions we have considered are ways of finding a policy. In general, the problem of solving a deterministic decision process is equivalent to classical planning.

CVW 2 discrete rooms with cameras that detect dirt. A mobile robot with a vacuum. The goal is to ensure both rooms are clean. A B

Stochastic CVW (SCVW) 2 discrete rooms with cameras that detect dirt. A mobile robot with a vacuum that fails to clean 10% of the time. The goal is to ensure both rooms are clean. A B

CVW A B Observable: Fully Agents: Single Deterministic: Deterministic Episodic: Episodic Static: Static Discrete: Discrete

SCVW A B Observable: Fully Agents: Single Deterministic: Stochastic Episodic: Episodic Static: Static Discrete: Discrete

CVW State Space A B Left Right A B Suck A B

SCVW State Space A B Left 100% Right 100% A B Suck 90% Suck 10% A B

Stochastic Process A stochastic process describes how the world transitions from one state to another by taking actions with uncertain effects. It can be described as a graph whose nodes are states and whose edges are actions (annotated with the probability the action will occur in that way). SCVW is a stochastic processes because the outcome of an action cannot be known in advance (but once taken, the outcome is known).

Markov Process (Markov Chain) When a stochastic process is such that the probability of transitioning to a next state depends only on the previous state and the action taken, we say that process has the Markov Property. This property is named for Andrey Markov, a famous mathematician who studied stochastic processes.

Markov Process (Markov Chain) In other words, we don t need to know all the past states you have been in or all the past actions you have taken. We only need to know your current state and the action you intend to take. From that, we can predict (stochastically) which next state you will be in after taking the action.

Markov Chains Markov chains can be used to model simple real world processes. One common application is text mining. Each node in the graph represents a word (w) that appears in the text. An edge leads from w1 to w2 if, somewhere in the corpus, we see w1 followed by w2. The probability of an edge represents the percentage of times w2 is seen to come after w1.

King James Programming A bot trained on two corpuses, The King James Bible and Structure and Interpretation of Computer Programs, that generates random saying such as: Jesus saith unto them, Ye know that the relationship between Fahrenheit and Celsius temperatures is Such a constraint. By running the test with more and more in knowledge and in all things approving ourselves as the ministers of the LORD, and they provoked him to jealousy with that which is good. He that doeth good is of God: but the calf of the sin offering, and the other registers that need to be immediately sworn and notarized.

Markov Decision Process (MDP) A Markov Decision Process is a decision process based on a Markov chain. An agent transition from start to goal cannot always predict the result of an action. Thus, any policy for solving an MDP must account for all states that an agent might accidentally end up in. This can be thought of a classical planning but where things sometimes go wrong.

SCVW Policy π Robot at A, A dirty, B dirty = Suck π Robot at A, A clean, B dirty = Right π Robot at B, A dirty, B clean = Left for every possible state

SCVW 2 discrete rooms with cameras that detect dirt. A mobile robot with a vacuum that fails to clean 10% of the time. The goal is to ensure both rooms are clean. A B

Hidden SCVW (HSCVW) 2 discrete rooms with cameras that detect dirt 95% of the time dirt is present. A mobile robot with a vacuum that fails to clean 10% of the time. The goal is to ensure both rooms are clean. A B

SCVW A B Observable: Fully Agents: Single Deterministic: Stochastic Episodic: Episodic Static: Static Discrete: Discrete

HSCVW A B Observable: Partially Agents: Single Deterministic: Stochastic Episodic: Episodic Static: Static Discrete: Discrete

HSCVW State Space The robot is in room A. Camera A reports dirt. Camera B reports dirt. 90.25% 0.25% A B A B 4.75% 4.75% A B A B

Hidden Markov Model (HMM) A Hidden Markov Model is a process which is assumed to operate according to a Markov process, but whose state is not directly observable. Based on whatever observations are available, an agent must maintain a probability distribution of possible current states and their likelihoods.

Partially Observable Markov Decision Process (POMDP) A Partially Observable Markov Decision Process is a decision process based on a hidden Markov model. An agent transitioning from start to goal does not know the actual state of the world, but can guess it based on observations. It must choose a policy which is expected to maximize the chance of reaching a solution.

Markov Processes How the World Works How an Agent Should Behave in that World Markov Chain Markov Decision Process Hidden Markov Model Partially Observable Markov Decision Process

Reinforcement Learning Reinforcement Learning is a kind of machine learning in which labeled data is not available, but for which periodic feedback (in the form of rewards and punishments) is available. An agent takes actions, observes its reward or punishment, and eventually learns which actions lead to success and which lead to failure. For example, considering training a dog.

RL as Supervised Learning Reinforcement learning can be thought of as a kind of supervised learning in which the agent must generate its own labeled data based on trying different combinations of actions.

RL and POMDP s Most reinforcement learning algorithms are designed to learn optimal policies for solving partially observable Markov decision processes. The agent has observations which tell it which states it might be in (and how likely). It assumes the process is a Markov process. The agent takes actions and observes rewards or punishments. The agent eventually learns how to behave in the environment so as to maximize its reward.

Exploration vs. Exploitation When an agent takes actions whose outcomes are relatively unknown, it is called exploration. When an agent takes actions whose outcomes are known to produce high rewards, it is called exploitation. One of the central problems in RL is striking a balance between exploration (learning new information) and exploitation (using known information).

n-armed Bandits Problem Imagine you have $1000 that you intend to spend on a row of slot machines (one-armed bandits) in Las Vegas. How can you spend that money so as to maximize the amount of money you have left at the end of the day?

n-armed Bandits Problem We assume each machine is controlled by a Markov process. 99% 1% Win Lose 45% 55%

n-armed Bandits Problem But we can t see it; we can only observe the results (wins or losses), so it is a hidden Markov model.

n-armed Bandits Problem Our task is to find an optimal policy that is, to solve a partially observable Markov decision process.

n-armed Bandits Problem Say you have played Machine A 20 times. You have won 10 times and lost 10 times. You have played Machine B 1 time and lost. Is it logical to conclude that you should never play Machine B because, so far, you have lost 100% of the time?

n-armed Bandits Problem No. You are exploiting Machine A too early without exploring Machine B enough. Machine B might have a better win/loss ratio than A. You need to gather more information.