The Winding Path to RL

Similar documents
Reinforcement Learning

Figure 1 Siemens PSSE Web Site

66 Lecture 3 Random Search Tree i unique. Lemma 3. Let X and Y be totally ordered et, and let be a function aigning a ditinct riority in Y to each ele

skipping section 6.6 / 5.6 (generating permutations and combinations) concludes basic counting in Chapter 6 / 5

Problem Set 8 Solutions

Price Protection with Consumer s Policy Behavior Beibei LI, Huipo WANG and Yunfu HUO

IEOR 3106: Fall 2013, Professor Whitt Topics for Discussion: Tuesday, November 19 Alternating Renewal Processes and The Renewal Equation

Design of Two-Channel Low-Delay FIR Filter Banks Using Constrained Optimization

Systems Analysis. Prof. Cesar de Prada ISA-UVA

Using Maple to Evaluate the Partial Derivatives of Two-Variables Functions

Assignment for Mathematics for Economists Fall 2016

Estimating Conditional Mean and Difference Between Conditional Mean and Conditional Median

CMSC 474, Introduction to Game Theory Maxmin and Minmax Strategies

Do Dogs Know Bifurcations?

Clustering Methods without Given Number of Clusters

Lecture 3. Dispersion and waves in cold plasma. Review and extension of the previous lecture. Basic ideas. Kramers-Kronig relations

Bogoliubov Transformation in Classical Mechanics

Memoryle Strategie in Concurrent Game with Reachability Objective Λ Krihnendu Chatterjee y Luca de Alfaro x Thoma A. Henzinger y;z y EECS, Univerity o

Sociology 376 Exam 1 Spring 2011 Prof Montgomery

11.5 MAP Estimator MAP avoids this Computational Problem!

NCAAPMT Calculus Challenge Challenge #3 Due: October 26, 2011

Chapter Landscape of an Optimization Problem. Local Search. Coping With NP-Hardness. Gradient Descent: Vertex Cover

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

March 18, 2014 Academic Year 2013/14

Laplace Transformation

RELIABILITY OF REPAIRABLE k out of n: F SYSTEM HAVING DISCRETE REPAIR AND FAILURE TIMES DISTRIBUTIONS

Iterative Decoding of Trellis-Constrained Codes inspired by Amplitude Amplification (Preliminary Version)

RADIATION THERMOMETRY OF METAL IN HIGH TEMPERATURE FURNACE

Stochastic Perishable Inventory Control in a Service Facility System Maintaining Inventory for Service: Semi Markov Decision Problem

Improved Adaptive Time Delay Estimation Algorithm Based on Fourth-order Cumulants

ROOT LOCUS. Poles and Zeros

Efficiency Optimal of Inductive Power Transfer System Using the Genetic Algorithms Jikun Zhou *, Rong Zhang, Yi Zhang

online learning Unit Workbook 4 RLC Transients

Social Studies 201 Notes for November 14, 2003

The simplex method is strongly polynomial for deterministic Markov decision processes

Risk reducing actions: efficiency evaluation

RELIABILITY ANALYSIS OF A COMPLEX REPAIRABLE SYSTEM COMPOSED OF TWO 2-OUT-OF-3: G SUBSYSTEMS CONNECTED IN PARALLEL

The Hassenpflug Matrix Tensor Notation

Moment of Inertia of an Equilateral Triangle with Pivot at one Vertex

To describe a queuing system, an input process and an output process has to be specified.

Chapter 4. The Laplace Transform Method

Department of Mechanical Engineering Massachusetts Institute of Technology Modeling, Dynamics and Control III Spring 2002

An Inequality for Nonnegative Matrices and the Inverse Eigenvalue Problem

Lecture 10 Filtering: Applied Concepts

Correction for Simple System Example and Notes on Laplace Transforms / Deviation Variables ECHE 550 Fall 2002

Markov Decision Processes Infinite Horizon Problems

Linear Motion, Speed & Velocity

Preemptive scheduling on a small number of hierarchical machines

Übung zu Globale Geophysik I 05 Answers

Lecture #5: Introduction to Continuum Mechanics Three-dimensional Rate-independent Plasticity. by Dirk Mohr

Lecture 10. Erbium-doped fiber amplifier (EDFA) Raman amplifiers Have replaced semiconductor optical amplifiers in the course

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Midterm 3 Review Solutions by CC

μ + = σ = D 4 σ = D 3 σ = σ = All units in parts (a) and (b) are in V. (1) x chart: Center = μ = 0.75 UCL =

Two-echelon supply chain coordination under information asymmetry with multiple types

Lecture 8: Period Finding: Simon s Problem over Z N

USE OF INTERNET TO DO EXPERIMENTS IN DYNAMICS AND CONTROL FROM ZACATECAS MEXICO IN THE LABORATORY OF THE UNIVERSITY OF TENNESSEE AT CHATANOOGAA.

Analysis the Transient Process of Wind Power Resources when there are Voltage Sags in Distribution Grid

Chapter 3- Answers to selected exercises

Solving Differential Equations by the Laplace Transform and by Numerical Methods

TRANSITION PROBABILITY MATRIX OF BRIDGE MEMBERS DAMAGE RATING

(b) Is the game below solvable by iterated strict dominance? Does it have a unique Nash equilibrium?

Suggested Answers To Exercises. estimates variability in a sampling distribution of random means. About 68% of means fall

CS 170: Midterm Exam II University of California at Berkeley Department of Electrical Engineering and Computer Sciences Computer Science Division

arxiv: v2 [math.nt] 1 Jan 2018

OVERSHOOT FREE PI CONTROLLER TUNING BASED ON POLE ASSIGNMENT

Source slideplayer.com/fundamentals of Analytical Chemistry, F.J. Holler, S.R.Crouch. Chapter 6: Random Errors in Chemical Analysis

7.2 INVERSE TRANSFORMS AND TRANSFORMS OF DERIVATIVES 281

ON THE APPROXIMATION ERROR IN HIGH DIMENSIONAL MODEL REPRESENTATION. Xiaoqun Wang

Halliday/Resnick/Walker 7e Chapter 6

STOCHASTIC GENERALIZED TRANSPORTATION PROBLEM WITH DISCRETE DISTRIBUTION OF DEMAND

Performance Evaluation

Solving Radical Equations

Discovery Mass Reach for Excited Quarks at Hadron Colliders

Orbitals, Shapes and Polarity Quiz

Physics 741 Graduate Quantum Mechanics 1 Solutions to Final Exam, Fall 2014

Reinforcement Learning and Control

Factor Analysis with Poisson Output

Singular perturbation theory

CHAPTER 5. The Operational Amplifier 1

Social Studies 201 Notes for March 18, 2005

Given the following circuit with unknown initial capacitor voltage v(0): X(s) Immediately, we know that the transfer function H(s) is

Chapter 2 Sampling and Quantization. In order to investigate sampling and quantization, the difference between analog

One Class of Splitting Iterative Schemes

USEFUL TECHNIQUES FOR FIELD ANALYSTS IN THE DESIGN AND OPTIMIZATION OF LINEAR INDUCTION MOTORS

arxiv: v1 [math.mg] 25 Aug 2011

Question 1 Equivalent Circuits

High Precision Feedback Control Design for Dual-Actuator Systems

Lecture 17: Analytic Functions and Integrals (See Chapter 14 in Boas)

Chapter 5 Consistency, Zero Stability, and the Dahlquist Equivalence Theorem

Administration, Department of Statistics and Econometrics, Sofia, 1113, bul. Tzarigradsko shose 125, bl.3, Bulgaria,

Frames of Reference and Relative Velocity

MATEMATIK Datum: Tid: eftermiddag. A.Heintz Telefonvakt: Anders Martinsson Tel.:

Linear System Fundamentals

Confusion matrices. True / False positives / negatives. INF 4300 Classification III Anne Solberg The agenda today: E.g., testing for cancer

Online Appendix for Managerial Attention and Worker Performance by Marina Halac and Andrea Prat

arxiv: v2 [math.nt] 30 Apr 2015

Design By Emulation (Indirect Method)

ME 3560 Fluid Mechanics

Jul 4, 2005 turbo_code_primer Revision 0.0. Turbo Code Primer

Transcription:

Markov Deciion Procee MDP) Ron Parr ComSci 70 Deartment of Comuter Science Duke Univerity With thank to Kri Hauer for ome lide The Winding Path to RL Deciion Theory Decritive theory of otimal behavior Markov Deciion Procee Mathematical/Algorithmic realization of Deciion Theory Reinforcement Learning Alication of learning technique to challenge of MDP with numerou or unknown arameter

Deciion Theory Review MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today Swet under the rug today Utility of money aumed :) How to determine cot/utilitie How to determine robabilitie

Playing a Game Show Aume erie of quetion Increaing difficulty Increaing ayoff Choice: Accet accumulated earning and quit Continue and rik loing everything Who want to be a millionaire? State Rereentation Dollar amount indicate the ayoff for getting the quetion right Probabilitic Tranition on Attemt to Anwer Start $00 correct $,000 correct $0K correct $50K $0 $0 $0 $0 $6,00 Downward green arrow indicate the choice to exit the game $00 $,00 $,00 N.B.: Thee exit tranition hould actually correond to tate Green indicate rofit at exit from game

Making Otimal Deciion Work backward from future to reent Conider $50,000 quetion Suoe Pcorrect) = /0 Vto)=$,00 Vcontinue) = 0.9*$0 + 0.*$6.K = $6.K Otimal deciion to Working Backward V=$,749 V=$4,66 V=$5,555 V=$.K 9/0 /4 / $00 $K $0K $50K X /0 $0 X $0 X $0 $0 Red X indicate bad choice $00 $,00 $,00 4

Deciion Theory Review Provide theory of otimal deciion Princile of maximizing utility Eay for mall, tree tructured ace with Known utilitie Known robabilitie Deciion Theory MDP Covered in Today Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration 5

Dealing with Loo Suoe you can ay $000 from any loing tate) to lay again 9/0 /4 / /0 $0 $0 $0 $0 $-000 $00 $,00 $,00 From Policie to Linear Sytem Suoe we alway ay until we win. What i value of following thi olicy? V 0 ) = 0.0 000 +V 0 )) + 0.90V ) V ) = 0.5 000 +V 0 )) + 0.75V ) V ) = 0.50 000 +V 0 )) + 0.50V )!! V ) = 0.90 000 +V 0 )) + 0.0600) Return to Start Continue 6

And the olution i V=$,749 V=$4,66 V=$5,555 V=$.K w/o cheat V=$.47K V=$.58K V=$.95K V=$4.4K 9/0 /4 / /0 $-000 I thi otimal? How do we find the otimal olicy? State ace: S The MDP Framework Action ace: A Tranition function: P Reward function: R,a, ) or R,a) or R) Dicount factor: g Policy: ) a Objective: Maximize exected, dicounted return deciion theoretic otimal behavior) 7

Alication of MDP AI/Comuter Science Robotic control Koenig & Simmon, Thrun et al., Kaelbling et al.) Air Camaign Planning Meuleau et al.) Elevator Control Barto & Crite) Comutation Scheduling Zilbertein et al.) Control and Automation Moore et al.) Soken dialogue management Singh et al.) Cellular channel allocation Singh & Berteka) Alication of MDP Economic/Oeration Reearch Fleet maintenance Howard, Rut) Road maintenance Golabi et al.) Packet Retranmiion Feinberg et al.) Nuclear lant management Rothwell & Rut) Debt collection trategie Abe et al.) Data center management DeeMind) 8

EE/Control Alication of MDP Miile defene Berteka et al.) Inventory management Van Roy et al.) Football lay election Patek & Berteka) Agriculture Herd management Kritenen, Toft) Other Sort trategie Video game The Markov Aumtion Let S t be a random variable for the tate at time t PS t A t- S t-,,a 0 S 0 ) = PS t A t- S t- ) Markov i ecial kind of conditional indeendence Future i indeendent of at given current tate 9

Undertanding Dicounting Mathematical motivation Kee value bounded What if I romie you $0.0 every day you viit me? Economic motivation Dicount come from inflation Promie of $.00 in future i worth $0.99 today Probability of dying Suoe e robability of dying at each deciion interval Tranition w/rob e to tate with value 0 Equivalent to - e dicount factor Dicounting in Practice Often choen unrealitically low Fater convergence of the algorithm we ll ee later Lead to lightly myoic olicie Can reformulate mot alg. for avg. reward Mathematically uglier Somewhat lower run time 0

Deciion Theory MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today V Value Determination Determine the value of each tate under olicy ) = R, )) + g å ' P ', )) V Bellman Equation for a fixed olicy ') R= S 0.4 0.6 S S V ) = + g 0.4V ) + 0.6V ))

Matrix Form ø ö ç ç ç è æ = )), )), )), )), )), )), )), )), )), P P P P P P P P P P g R V P V + = Thi i a generalization of the game how examle from earlier How do we olve thi ytem efficient? Doe it even have a olution? Solving for Value g R V P V + = For moderate number of tate we can olve thi ytem exacty: g R P I V ) - - = Guaranteed invertible becaue ha ectral radiu < gp

Iteratively Solving for Value V = gp V + R For larger number of tate we can olve thi ytem indirectly: V i+ = gp Vi + R Guaranteed convergent becaue ha ectral radiu < gp Etablihing Convergence Eigenvalue analyi Monotonicity Aume all value tart eimitic One value mut alway increae Can never overetimate Eay to rove Contraction analyi

Contraction Analyi Define maximum norm Conider two value function V a and V b each at iteration : WLOG ay V a V -V V a V b =max i V [i ] b = e + e! Vector of all e ) Contraction Analyi Contd. At next iteration for V b : V b = R +γpv b For V a V a = R+γPV a ) R+γPV b + ε)=! R+γPV b +γp! ε = R+γPV b +γ! ε Conclude: Ditribute V a V b γε 4

Imortance of Contraction Any two value function get cloer True value function V* i a fixed oint value doen t change with iteration) Max norm ditance from V* decreae dramatically quickly with iteration V 0 V * =ε V n V * γ n ε Deciion Theory MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today 5

Finding Good Policie Suoe an exert told you the true value of each tate: VS) = 0 VS) = 5 0.5 S 0.7 S 0.5 S 0. S Action Action Imroving Policie How do we get the otimal olicy? If we knew the value under the otimal olicy, then jut take the otimal action in every tate How do we define thee value? Fixed oint equation with choice Bellman equation): V * ) =max a R,a)+γ ' P',a)V * ') Deciion theoretic otimal choice given V* If we know V*, icking the otimal action i eay If we know the otimal action, comuting V* i eay How do we comute both at the ame time? 6

Value Iteration We can t olve the ytem directly with a max in the equation Can we olve it by iteration? V å ) = maxa R, a) + g P ', a) Vi ' ) i+ ' Called value iteration or imly ucceive aroximation Same a value determination, but we can change action Convergence: Can t do eigenvalue analyi not linear) Still monotonic Still a contraction in max norm exercie) Converge quickly Robot Navigation Examle quare,) 4 The robot hown ) live in a world decribed by a 4x grid of quare with quare,) occuied by an obtacle A tate i defined by the quare in which the robot i located:,) in the above figure tate 7

Action Tranition) Model U bring the robot to:,) with robability 0.8,) with robability 0.,) with robability 0. 4 In each tate, the robot oible action are {U, D, R, L} For each action: With robability 0.8 the robot doe the right thing move u, down, right, or left by one quare) With robability 0. it move in a direction erendicular to the intended one If the robot can t move, it tay in the ame quare [Thi model atifie the Markov condition] Action Tranition) Model L bring the robot to:,) with robability 0.8 + 0. = 0.9,) with robability 0. 4 In each tate, the robot oible action are {U, D, R, L} For each action: With robability 0.8 the robot doe the right thing move u, down, right, or left by one quare) With robability 0. it move in a direction erendicular to the intended one If the robot can t move, it tay in the ame quare [Thi model atifie the Markov condition] 8

Terminal State, Reward, and Cot -.04 -.04 -.04 + -.04 -.04 - -.04 -.04 -.04 -.04 4 Two terminal tate: 4,) and 4,) Reward: R4,) = + [The robot find gold] R4,) = - [The robot get traed in quickand] R) = -0.04 in all other tate Thi examle from the textbook) aume no dicounting g=) Dicuion: I thi a good modeling deciion? Stationary) Policy + + - - 4 A tationary olicy i a comlete ma!: tate action For each non-terminal tate it recommend an action, indeendent of when and how the tate i reached Under the Markov and infinite horizon aumtion, the otimal olicy! i necearily a tationary olicy [The bet action in a tate doe not deend on the at] 4 9

Stationary) Policy + + - - 4 A tationary olicy i a comlete ma!: tate action For each non-terminal tate it recommend an action, indeendent of when and how the tate The i reached otimal olicy trie to avoid Under the Markov and infinite dangerou horizon tate aumtion,,) the otimal olicy! i necearily a tationary olicy [The bet action in a tate doe not deend on the at] Finding! i called an obervable Markov Deciion Problem MDP) 4 Otimal Policie for Variou R) + - + - R) = -0.04 R) = - + - + - R) = -0.0 R) > 0 0

If i terminal: Bellman Equation If i non-terminal: + - 4! # = & # + max The equation!#) = &#) are non-linear + -../0) 0 4550,+) The utility of deend on the utility of other tate oibly, including ), and vice vera 7 # #, 8!# ) 9 # = arg max + -../0) 0 4550,+) 7 # #, 8!# ) [Bellman equation] Value Iteration Alied 0 0 0 + 0.8 0.87 0.9 + 0 0-0.76 0.66-0 0 0 0 0.7 0.66 0.6 0.9 4 4. Initialize the utility of each non-terminal tate to V 0 ) = 0. For t = 0,,,... do! "#$ % = ' % + max for each non-terminal tate 4,.//0) 5 6788,,) : % 5 %, ;! " % 5 )

State Utilitie 0.8 0.87 0.9 + 0.76 0.66-0.7 0.66 0.6 0.9 4 The utility of a tate i the maximal exected amount of reward that the robot will collect from and future tate by executing ome action in each encountered tate, until it reache a terminal tate infinite horizon) Under the Markov and infinite horizon aumtion, the utility of i indeendent of when and how i reached [It only deend on the oible equence of tate after, not on the oible equence before ] Convergence of Value Iteration 0.8 0.87 0.9 + 0.76 0.66-0.7 0.66 0.6 0.9 4

Proertie of Value Iteration VI converge to V*. " from V* hrink by g factor each iteration) Converge to otimal olicy Why? Becaue we figure out V*, otimal olicy i argmax) Otimal olicy i tationary i.e. Markovian deend only on current tate) Why? Becaue we are umming utilitie. Thought exeriment: Suoe you think it better to change action the econd time you viit a tate. Why didn t you jut take the bet action the firt time?) Deciion Theory MDP Algorithm for MDP Value Determination Otimal Policy Selection Value Iteration Policy Iteration Covered Today

Greedy Policy Contruction Let name the action that look bet WRT V:!! π v ) = argmax a R,a) + γ P',a)V') '!! π = greedyv) v Exectation over next-tate value Boottraing: Policy Iteration Idea: Greedy election i ueful even with ubotimal V Gue v = 0 V = value of acting on olve linear ytem) v greedyv ) Reeat until olicy doen t change Guaranteed to find otimal olicy Uually take very mall number of iteration Comuting the value function i the exenive art 4

Comaring VI and PI VI Value change at every te Policy may change before exact value of olicy i comuted Many chea iteration PI Alternate olicy/value udate Solve for value of each olicy exactly Fewer, lower iteration need to invert matrix) Convergence Both are contraction in max norm PI i hockingly fat in ractice Comutational Comlexity VI and PI are both contraction maing w/rate g we didn t rove thi for PI in cla) VI cot le er iteration For n tate, a action PI tend to take On) iteration in ractice Recent reult indicate ~On a/-g) wort cae Intereting aide: Bigget inight into PI came ~50 year after the algorithm wa introduced 5

A Unified View of Value Iteration and Policy Iteration Notation Udate for for a fixed olicy definition of T oerator:! " # % " + ' " # Udate with olicy imrovement definition of the T oerator:!#*) = max 0 *, + ' 4 56 * 6 *, #* 6 ) 6

Value Determination For 0 te! " = $ % For i te! & = ' %! &) = ' % &$ % Infinite horizon lim! & = ' %.$ % = % ) ) $ % &. Value Iteration For 0 te! " = $ If R deend on a, ick a with the highet immediate reward) For i te! % = &! %' = & ) $ Infinite horizon lim! % = & ) $ = &! =! % ) 7

Modified Policy Iteration Gue V 0 uually jut R), and i= Reeat until convergence* For j= to n V i = T V i- i = i+ =greedyv i- ) Secial cae: n= VI), n PI) MDP Limitation Reinforcement Learning MDP oerate at the level of tate State = atomic event We uually have exonentially or infinitely) many of thee We aume P and R are known Machine learning to the recue! Infer P and R imlicitly or exlicitly from data) Generalize from mall number of tate/olicie 8