Reinforcement Learning

Similar documents
Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 1: March 7, 2018

Jul 4, 2005 turbo_code_primer Revision 0.0. Turbo Code Primer

Problem Set 8 Solutions

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Elements of Reinforcement Learning

Multi-Objective Verification on MDPs. Tim Quatmann, Joost-Pieter Katoen Lehrstuhl für Informatik 2

Comparing Means: t-tests for Two Independent Samples

Reinforcement Learning

The Winding Path to RL

Clustering Methods without Given Number of Clusters

Assignment for Mathematics for Economists Fall 2016

RELIABILITY OF REPAIRABLE k out of n: F SYSTEM HAVING DISCRETE REPAIR AND FAILURE TIMES DISTRIBUTIONS

CS599 Lecture 1 Introduction To RL

Lecture 4 Topic 3: General linear models (GLMs), the fundamentals of the analysis of variance (ANOVA), and completely randomized designs (CRDs)

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning. Introduction

Memoryle Strategie in Concurrent Game with Reachability Objective Λ Krihnendu Chatterjee y Luca de Alfaro x Thoma A. Henzinger y;z y EECS, Univerity o

Stochastic Perishable Inventory Control in a Service Facility System Maintaining Inventory for Service: Semi Markov Decision Problem

ARTIFICIAL INTELLIGENCE. Reinforcement learning

μ + = σ = D 4 σ = D 3 σ = σ = All units in parts (a) and (b) are in V. (1) x chart: Center = μ = 0.75 UCL =

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

CS 570: Machine Learning Seminar. Fall 2016

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Z a>2 s 1n = X L - m. X L = m + Z a>2 s 1n X L = The decision rule for this one-tail test is

Chapter Landscape of an Optimization Problem. Local Search. Coping With NP-Hardness. Gradient Descent: Vertex Cover

1 MDP Value Iteration Algorithm

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)

Lecture 7: Testing Distributions

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Markov decision processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Artificial Intelligence & Sequential Decision Problems

Social Studies 201 Notes for March 18, 2005

ALLOCATING BANDWIDTH FOR BURSTY CONNECTIONS

The simplex method is strongly polynomial for deterministic Markov decision processes

CS 7180: Behavioral Modeling and Decisionmaking

Myriad of applications

Departure Time and Route Choices with Bottleneck Congestion: User Equilibrium under Risk and Ambiguity

Chapter 12 Simple Linear Regression

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Basics of reinforcement learning

To describe a queuing system, an input process and an output process has to be specified.

Stochastic games with additive transitions

7.2 INVERSE TRANSFORMS AND TRANSFORMS OF DERIVATIVES 281

Lecture 10 Filtering: Applied Concepts

, and rewards and transition matrices as shown below:

Planning in Markov Decision Processes

Standard normal distribution. t-distribution, (df=5) t-distribution, (df=2) PDF created with pdffactory Pro trial version

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Layering as Optimization Decomposition

Codes Correcting Two Deletions

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Q-learning. Tambet Matiisen

Confusion matrices. True / False positives / negatives. INF 4300 Classification III Anne Solberg The agenda today: E.g., testing for cancer

Bogoliubov Transformation in Classical Mechanics

Preemptive scheduling on a small number of hierarchical machines

Value Iteration for Continuous-State POMDPs

CHAPTER 8 OBSERVER BASED REDUCED ORDER CONTROLLER DESIGN FOR LARGE SCALE LINEAR DISCRETE-TIME CONTROL SYSTEMS

Evolutionary Algorithms Based Fixed Order Robust Controller Design and Robustness Performance Analysis

Introduction to Reinforcement Learning

Behavioral thermal modeling for quad-core microprocessors

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Factor Analysis with Poisson Output

Artificial Intelligence Markov Decision Problems

The continuous time random walk (CTRW) was introduced by Montroll and Weiss 1.

A Study on Simulating Convolutional Codes and Turbo Codes

Quantitative Information Leakage. Lecture 9

State Space: Observer Design Lecture 11

Real Time Value Iteration and the State-Action Value Function

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

HSC PHYSICS ONLINE KINEMATICS EXPERIMENT

No-load And Blocked Rotor Test On An Induction Machine

White Rose Research Online URL for this paper: Version: Accepted Version

MDP Preliminaries. Nan Jiang. February 10, 2019

Lecture 8: Period Finding: Simon s Problem over Z N

Internet Monetization

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Comparison of independent process analytical measurements a variographic study

Suggested Answers To Exercises. estimates variability in a sampling distribution of random means. About 68% of means fall

An Introduction to Reinforcement Learning

Reinforcement Learning and Control

Reinforcement Learning: An Introduction

Reinforcement Learning

Multi Constrained Optimization model of Supply Chain Based on Intelligent Algorithm Han Juan School of Management Shanghai University

Artificial Intelligence

List coloring hypergraphs

Autonomous Helicopter Flight via Reinforcement Learning

STOCHASTIC GENERALIZED TRANSPORTATION PROBLEM WITH DISCRETE DISTRIBUTION OF DEMAND

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Channel Selection in Multi-channel Opportunistic Spectrum Access Networks with Perfect Sensing

Transcription:

Reinforcement Learning Yihay Manour Google Inc. & Tel-Aviv Univerity

Outline Goal of Reinforcement Learning Mathematical Model (MDP) Planning Learning Current Reearch iue 2

Goal of Reinforcement Learning Goal oriented learning through interaction Control of large cale tochatic environment with partial knowledge. Supervied / Unupervied Learning Learn from labeled / unlabeled example 3

Reinforcement Learning - origin Artificial Intelligence Control Theory Operation Reearch Cognitive Science & Pychology Solid foundation; well etablihed reearch. 4

Typical Application Robotic Elevator control [CB]. Robo-occer [SV]. Board game backgammon [T], checker [S]. Che [B] Scheduling Dynamic channel allocation [SB]. Inventory problem. 5

Contrat with Supervied Learning The ytem ha a tate. The algorithm influence the tate ditribution. Inherent Tradeoff: Exploration veru Exploitation. 6

Mathematical Model - Motivation Model of uncertainty: Environment, action, our knowledge. Focu on deciion making. Maximize long term reward. Markov Deciion Proce (MDP) 7

Mathematical Model - MDP Markov deciion procee S- et of tate A- et of action δ - Tranition probability R - Reward function Similar to DFA! 8

MDP model - tate and action Environment = tate 0.3 0.7 action a Action = tranition δ (, a, ' ) 9

MDP model - reward R(,a) = reward at tate for doing action a (a random variable). Example: R(,a) = -1 with probability 0.5 +10 with probability 0.35 +20 with probability 0.15 10

MDP model - trajectorie trajectory: 0 a 0 r 0 1 a 1 r 1 2 a 2 r 2 11

MDP - Return function. Combining all the immediate reward to a ingle value. Modeling Iue: Are early reward more valuable than later reward? I the ytem terminating or continuou? Uually the return i linear in the immediate reward. 12

MDP model - return function Finite Horizon - parameter H return = R( i, ai ) Infinite Horizon 1 i H i dicounted - parameter γ<1. return = γ R(i,ai ) i=0 undicounted 1 N N 1 i= 0 R( i,a i N ) return Terminating MDP 13

MDP model - action election AIM: Maximize the expected return. Fully Obervable - can ee the entire tate. Policy - mapping from tate to action Optimal policy: optimal from any tart tate. THEOREM: There exit a determinitic optimal policy 14

Contrat with Supervied Learning Supervied Learning: Fixed ditribution on example. Reinforcement Learning: The tate ditribution i policy dependent!!! A mall local change in the policy can make a huge global change in the return. 15

MDP model - ummary S a A δ ( 1, a, 2) π : R(,a) S A i = 0 γ i ri - et of tate, S =n. -et of k action, A =k. - tranition function. - immediate reward function. - policy. - dicounted cumulative return. 16

Simple example: N- armed bandit Single tate. Goal: Maximize um of immediate reward. a 1 a 2 Given the model: Greedy action. a 3 Difficulty: unknown model. 17

N-Armed Bandit: Highlight Algorithm (near greedy): Exponential weight G i um of reward of action a i w i = β G i Follow the (perturbed) leader Reult: For any equence of T reward: E[online] > max i {G i } - qrt{t log N} 18

Planning - Baic Problem. Given a complete MDP model. Policy evaluation - Given a policy π, etimate it return. Optimal control - Find an optimal policy π * (maximize the return from any tart tate). 19

Planning - Value Function V π () The expected return tarting at tate following π. Q π (,a) The expected return tarting at tate with action a and then following π. V () and Q (,a) are define uing an optimal policy π. V () = max π V π () 20

Planning - Policy Evaluation Dicounted infinite horizon (Bellman Eq.) V π () = E ~ π () [ R(,π ()) + γ V π ( )] Rewrite the expectation V π ( ) = E[ R(, π ( ))] + γ ' δ (, π ( ), ') V π ( ') Linear ytem of equation. 21

Algorithm - Policy Evaluation Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random a: R( i,a) = i 0 1 0 3 1 2 3 2 V π ( 0 ) = 0 +γ [π( 0,+1)V π ( 1 ) + π( 0,-1) V π ( 3 ) ] 22

Algorithm -Policy Evaluation Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random a: R( i,a) = i 0 1 0 1 3 2 3 2 V π ( 0 ) = 5/3 V π ( 1 ) = 7/3 V π ( 2 ) = 11/3 V π ( 3 ) = 13/3 V π ( 0 ) = 0 + (V π ( 1 ) + V π ( 3 ) )/4 23

Algorithm - optimal control State-Action Value function: Q π (,a) = E [ R(,a)] + γ E ~ (,a) [ V π ( )] π π Note V ( ) = Q (, π ( )) For a determinitic policy π. 24

Algorithm -Optimal control Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random R( i,a) = i 0 1 0 1 3 2 3 2 Q π ( 0,+1) = 5/6 Q π ( 0,-1) = 13/6 Q π ( 0,+1) = 0 +γ V π ( 1 ) 25

Algorithm - optimal control CLAIM: A policy π i optimal if and only if at each tate : V π () = MAX a {Q π (,a)} (Bellman Eq.) PROOF: Aume there i a tate and action a.t., V π () < Q π (,a). Then the trategy of performing a at tate (the firt time) i better than π. Thi i true each time we viit, o the policy that perform action a at tate i better than π. 26

Algorithm -optimal control Example A={+1,-1} γ = 1/2 δ( i,a)= i+a π random R( i,a) = i 0 1 0 1 3 2 3 2 Changing the policy uing the tate-action value function. 27

28 MDP - computing optimal policy 1. Linear Programming 2. Value Iteration method. )}, ( max { arg ) ( 1 a Q i a i = π π )} ' ( ) ',, ( ), ( max{ ) ( ' 1 V a a R V i a i + + δ γ 3. Policy Iteration method.

29 Convergence: Value Iteration Ditance of V i from the optimal V * (in L ) + + = + = ), ( ) ( ) ( ) ( ')] ( ') ( ')[,, ( ), ( ) ( ') ( '),, ( ), ( ), ( * 1 * * * 1 * * ' * * * * ' i i i i i i i i i V V V V a Q V V V V V V V a a Q V V a a R a Q γ γ δ γ δ γ Convergence Rate: 1/(1-γ) ONLY Peudo Polynomial

Convergence: Policy Iteration Policy Iteration Algorithm: Compute Q π (,a) Set π() = arg max a Q π (,a) Reiterate Convergence: Policy can only improve V t+1 () V t () Le iteration then Value Iteration, but more expenive iteration. OPEN: How many iteration doe it require?! LB: linear UB: 2 n /n (2-action MDP) [MS] 30

Outline Done Goal of Reinforcement Learning Mathematical Model (MDP) Planning Value iteration Policy iteration Now: Learning Algorithm Model baed Model Free 31

Planning veru Learning Tightly coupled in Reinforcement Learning Goal: maximize return while learning. 32

Example - Elevator Control Learning (alone): Model the arrival model well. Planning (alone) : Given arrival model build chedule Real objective: Contruct a chedule while updating model 33

Learning Algorithm Given acce only to action perform: 1. policy evaluation. 2. control - find optimal policy. Two approache: 1. Model baed (Dynamic Programming). 2. Model free (Q-Learning). 34

Learning - Model Baed Etimate the model from the obervation. (Both tranition probability and reward.) Ue the etimated model a the true model, and find optimal policy. If we have a good etimated model, we hould have a good etimation. 35

Learning - Model Baed: off policy Let the policy run for a long time. what i long?! Auming ome exploration Build an oberved model : Tranition probabilitie Reward Ue the oberved model to etimate value of the policy. 36

Learning - Model Baed ample ize Sample ize (optimal policy): Naive: O( S 2 A log ( S A ) ) ample. (approximate each tranition δ(,a, ) well.) Better: O( S A log ( S A ) ) ample. (Sufficient to approximate optimal policy.) [KS, NIPS 98] 37

Learning - Model Baed: on policy The learner ha control over the action. The immediate goal i to lean a model A before: Build an oberved model : Tranition probabilitie and Reward Ue the oberved model to etimate value of the policy. Accelerating the learning: How to reach new place?! 38

Learning - Model Baed: on policy Well ampled node Relatively unknown node 39

Learning - Model Baed: on policy HIGH REAWRD Well ampled node Relatively unknown node Exploration Planning in new MDP 40

Learning: Policy improvement Aume that we can perform: Given a policy π, Etimate V and Q function of π Can run policy improvement: π = Greedy (Q) Proce converge if etimation are accurate. 41