Module 8 Linear Programming. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Similar documents
Today: Linear Programming (con t.)

Reinforcement Learning. Introduction

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

(Deep) Reinforcement Learning

Algorithms for MDPs and Their Convergence

Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning

Preference Elicitation for Sequential Decision Problems

CSE250A Fall 12: Discussion Week 9

Distributed Optimization. Song Chong EE, KAIST

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

1 Overview. 2 Extreme Points. AM 221: Advanced Optimization Spring 2016

Chapter 1 Linear Programming. Paragraph 5 Duality

Planning in Markov Decision Processes

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Lecture notes for Analysis of Algorithms : Markov decision processes

Decision Theory: Q-Learning

Reinforcement Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Lecture 11: Post-Optimal Analysis. September 23, 2009

AM 121: Intro to Optimization Models and Methods: Fall 2018

Probabilistic Planning. George Konidaris

REINFORCEMENT LEARNING

Figure 1: Bayes Net. (a) (2 points) List all independence and conditional independence relationships implied by this Bayes net.

Lecture #21. c T x Ax b. maximize subject to

CO350 Linear Programming Chapter 6: The Simplex Method

Algorithms and Theory of Computation. Lecture 13: Linear Programming (2)

Numerical Optimization

MDP Preliminaries. Nan Jiang. February 10, 2019

Integer Programming ISE 418. Lecture 12. Dr. Ted Ralphs

Duality. Peter Bro Mitersen (University of Aarhus) Optimization, Lecture 9 February 28, / 49

, and rewards and transition matrices as shown below:

16.410/413 Principles of Autonomy and Decision Making

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Review Solutions, Exam 2, Operations Research

Trust Region Policy Optimization

Reinforcement Learning and Control

Linear Programming Duality

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Lecture 7 Duality II

Markov Decision Processes (and a small amount of reinforcement learning)

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

CSCI 1951-G Optimization Methods in Finance Part 01: Linear Programming

Decision Theory: Markov Decision Processes

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning. George Konidaris

Lecture 18: Optimization Programming

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Real Time Value Iteration and the State-Action Value Function

SVMs, Duality and the Kernel Trick

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Q-Learning in Continuous State Action Spaces

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

February 17, Simplex Method Continued

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

On the number of distinct solutions generated by the simplex method for LP

CS 7180: Behavioral Modeling and Decisionmaking

Lecture 3: The Reinforcement Learning Problem

Infinite-Horizon Average Reward Markov Decision Processes

Reinforcement Learning and NLP

3. Duality: What is duality? Why does it matter? Sensitivity through duality.

Chapter 3: The Reinforcement Learning Problem

Artificial Intelligence

Reinforcement Learning

COT 6936: Topics in Algorithms! Giri Narasimhan. ECS 254A / EC 2443; Phone: x3748

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Approximate Dynamic Programming By Minimizing Distributionally Robust Bounds

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Primal-Dual Interior-Point Methods. Javier Peña Convex Optimization /36-725

Chapter 3: The Reinforcement Learning Problem

Partially Observable Markov Decision Processes (POMDPs)

The Primal-Dual Algorithm P&S Chapter 5 Last Revised October 30, 2006

Reinforcement Learning and Deep Reinforcement Learning

Section Notes 9. IP: Cutting Planes. Applied Math 121. Week of April 12, 2010

15-780: LinearProgramming

Zero-Sum Stochastic Games An algorithmic review

Reinforcement Learning II

An Introduction to Markov Decision Processes. MDP Tutorial - 1

1 Primals and Duals: Zero Sum Games

On the Number of Solutions Generated by the Simplex Method for LP

Markov Decision Processes Chapter 17. Mausam

Theory and Internet Protocols

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Temporal Difference Learning & Policy Iteration

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Dual Interpretations and Duality Applications (continued)

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Brief summary of linear programming and duality: Consider the linear program in standard form. (P ) min z = cx. x 0. (D) max yb. z = c B x B + c N x N

BBM402-Lecture 20: LP Duality

An Analytic Solution to Discrete Bayesian Reinforcement Learning

CSC242: Intro to AI. Lecture 23

Slack Variable. Max Z= 3x 1 + 4x 2 + 5X 3. Subject to: X 1 + X 2 + X x 1 + 4x 2 + X X 1 + X 2 + 4X 3 10 X 1 0, X 2 0, X 3 0

A Bound for the Number of Different Basic Solutions Generated by the Simplex Method

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Civil Engineering Systems Analysis Lecture XII. Instructor: Prof. Naveen Eluru Department of Civil Engineering and Applied Mechanics

Infinite-Horizon Discounted Markov Decision Processes

Transcription:

Module 8 Linear Programming CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Policy Optimization Value and policy iteration Iterative algorithms that implicitly solve an optimization problem Can we explicitly write down this optimization problem? Yes, it can be formulated as a linear program 2

Primal Linear Program primallp(mdp) min V s w(s)v(s) subject to V s R s, a + γ Pr s s, a V s s, a s return V Variables: V s s Objective: min s w(s)v(s) where w(s) is a weight assigned to state s Constraints: V s R s, a + γ Pr s s, a V s s, a s 3

Objective Why do we minimize a weighted combination of the values? Shouldn t we maximize value? Value functions V that satisfy the constraints are upper bounds on the optimal value function V V s V s s Minimizing value ensures that we choose the lowest upper bound min V V(s) = V s s 4

Upper bound Theorem: Value functions V that satisfy V s R s, a + γ s Pr s s, a V s s, a are upper bounds on the optimal value function V V s V s s Proof: Since V s R s, a + γ s Pr s s, a V s s, a Then V s max R s, a + γ Pr a s s s, a V s s = H (V)(s) s Furthermore V H V H (H V H V = V 5

Weight function (initial state) How do we choose the weight function? If the policy always starts in the same initial state s 0, then set w s = 1 s = s 0 0 otherwise This ensures that w s V s = V (s 0 ) s 6

Weight function (any state) If the policy may start in any state, then assign a positive weight to each state, i.e. w s > 0 s This ensures that V is minimized at each s and therefore V s = V s s The magnitude of the weight doesn t matter when the LP is solved exactly. We will revisit the choice of w(s) when we discuss approximate linear programming. 7

Optimal Policy Linear program finds V We can extract π from V as usual: π s argmax a R s, a + γ Pr s s, a V (s ) s Or check the active constraints For each s, check which a leads to equality V s = R s, a + γ s Pr s s, a V(s ) V s R s, a + γ s Pr s s, a V s a Set π s a 8

Direct Policy Optimization The optimal solution to the primal linear program is V, but we still have to extract π Could we directly optimize π? Yes, by considering the dual linear program 9

Dual Linear Program duallp(mdp) max s,a y s, a R(s, a) y subject to a y s, a = b s + γ s,a Pr (s s, a)y s, a y s, a 0 s, a Let π a s = Pr a s = y(s, a)/ a y(s, a) return π s Variables: y s, a s, a frequency of each s, a -pair (proportional to π) Objective: max s,a y s, a R(s, a) y Constraints: a y s, a = b s + γ s,a Pr (s s, a)y s, a 10

Duality For every primal linear program in the form min c T x x s. t. Ax b There is an equivalent dual linear program in the form max y bt y s. t. A T y = c and y 0 Interpretation: c = w x = V y π A = I γt a a b = [R a ] a Where min x c T x = max y bt y 11

State Frequency Let f(s) be the frequency of s under policy π. 0 step: f 0 s = w(s) 1 step: f 1 s = w s + γ Pr (s s, π s )w s s 2 steps: f 2 s = w s + γ Pr (s s, π s )w s s +γ 2 Pr s s, π s Pr s s, π s w(s) s,s n steps: f n s n = w s n + γ Pr s n s n 1, π s n 1 s n 1 f n 1 (s n 1 ) steps: f s = w s + γ Pr s s, π(s) s f(s) 12

State-Action Frequency Let y s, a be the state-action frequency y s, a = π a s f s where π a s = Pr a s is a stochastic policy Then the following equations are equivalent f s = w s + γ s Pr s s, π(s) f(s) a π(a s ) f π s = w s + s Pr s s, a π a s f π (s) a y(s, a ) = w s + s Pr s s, a y(s, a) Constraint of dual LP 13

Policy We can recover π from y. y s, a = π a s f s (by definition) π a s = y s,a f s (isolate π) π a s = y s,a a y s,a (by definition) π may be stochastic Actions with non-zero probability are necessarily optimal 14

Objective Duality theory guarantees that the objectives of the primal and dual LPs are equal max y s,a y s, a R s, a = min V s w(s) V(s) This means that s,a y s, a R s, a implicitly measures the value of the optimal policy. 15

Solution Algorithms Two broad classes of algorithms: Simplex (corner search) Interior point methods (interior iterative methods) Polynomial complexity (MDP is in P, not NP) Many packages for linear programming CPLEX (robust, efficient and free for academia) 16