Central-limit approach to risk-aware Markov decision processes

Size: px
Start display at page:

Download "Central-limit approach to risk-aware Markov decision processes"

Transcription

1 Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu.

2 Inventory Management 1 1 Look at current inventory level 2 Decide how much to order 3 Receive inventory, sell inventory (random demand) 4 New inventory level, LeanCor

3 What is an MDP? 1 Single decision maker, 2 Sequential decisions, 3 Uncertainty. Time horizon 1, 2,..., States: X, Actions: A, Reward function: r : X A R, Transition probabilities: P : X A X.

4 Policies Sequential decision problem Policy is a mapping π : X A, If the decision-maker sticks to policy π, we have a Markov chain x1 π, x 2 π,... and we write instead of r(x π 1 ), r(x π 2 ),... r(x 1, π(x 1 )), r(x 2, π(x 2 )),...

5 Example 1 Time horizon: From Black Friday to End of Year. 2 State at day t: how many iphones x t in stock. 3 Action: how many iphones a t to order from warehouse. 4 Transition: from one inventory level x t to the next x t+1, with random demand and a t 5 Cost ( Reward): c(x t, a t ) includes holding cost, order cost, backorder cost, etc.

6 What is risk? Comparing probability distributions of alternatives 2. 2 From

7 What is risk? Many notions (risk measures): mappings of RVs (probability distributions) to real numbers; parametrized: curvature of u, value of λ. Expected utility (von Neumann and Morgenstern): ρ u (X) = Eu(X) with a concave u, Mean-Variance (Markowitz): ρ λ (X) = EX λvx, Value-at-risk (VAR): ρ λ (X) = F 1 X (λ), Expected shortfall (CVAR or AVAR): integral of VAR.

8 VAR

9 Let s cook up an example 1 Chain of many stores. 2 The previous boss minimizes T E c(xt π ) t=1 for each store. 3 You want to try a different approach: For a given λ, you want: min π s.t. v ( T ) P c(xt π ) v λ. t=1.

10 Let s cook up an example 1 For a given λ, you want: min π S = T c(xt π ), t=1 F 1 S (λ), where F 1 S (λ) is the value-at-risk ρ λ (S). 2 You can then say λ-fraction of stores spend less than v on inventory management costs (under i.i.d. assumption)..

11 What set of policies? 1 For a given λ, π = arg min π Π F 1 S (λ). 2 π is the training you give to each store s inventory manager. 3 You want to limit personnel training! 4 = Limit Π to stationary (Markov) policies: X A A..

12 What s been done? MDPs without risk (or risk-neutral) Risk in i.i.d. settings (in finance, operations research) Risk in MDPs mean-variance [MT11] [PG13] probabilistic objective, chance constraint [XM11] dynamic (Markovian) risk measures: [Rus10] [Oso12] risk constraint (CVAR or AVAR) gradient approach [TCGM15]

13 Risk-neutral MDP Finite horizon, infinite horizon: max π Π r(xt π ) T 1 E t=0 max π Π E α t r(xt π ) t=0 Solution via Bellman equation, value iteration, etc. Unlike E, our ρ is not usually not linear.

14 What makes risk measure ρ hard? Stationary policy π = Markov chain {x π t : x X, t = 1, 2,...} T 1 Find CDF of ST π = r(xt π ), t=0 Solve min π Π ρ (S π T ). For risk-neutral MDPs, π Π (well-known). Not in risk-averse setting! History-dependent policies π 1 : H 1 A, π 2 : H 2 A..., Non-Markov chain {x t } T 1 S π 1,...,π T T = r(x t, π t (x t )), Solve min π 1,...,π T Π t=0 ρ ( S π 1,...,π T ) T.

15 Mean-variance [MT11], [PG13] Max-reward: max π Π s.t. T E r(xt π ) t=1 ( T ) r(xt π ) v. VAR t=1 Min-variance: min π Π s.t. VAR ( T ) r(xt π ) t=1 T E r(xt π ) w. t=1 These problems are NP-hard (reduction from subset sum).

16 Probabilistic goal, chance constraint [XM11] Given v R: max π Π ( T ) P r(xt π ) v t=1 Given v R and α (0, 1): max π Π s.t. T E r(xt π ) t=1 ( T ) P r(xt π ) v α t=1

17 Risk constraint [CG14] Given β: min π Π s.t. T E r(xt π ) t=1 ( T ) ρ r(xt π ) β, t=1 where ρ is CVAR.

18 Dynamic (Markovian) risk measures [Rus10] [Oso12] RV X from (Ω, F, P), and coherent ρ: max π 1,...,π T ρ(x) = sup E Q X. Q:absolutely continuous wrt P For a fixed x 1 and T : r(x 1, π 1 (x 1 )) + ρ (c(x 2, π 2 (x 2 )) + ρ (... + ρ(c T +1 (x T +1 )) )) Infinite horizon, discount γ: max π Π ( ( R.V. real number {}}{{}}{ r(x0 π ) + γρ r(x1 π ) + γρ r(x2 π ) +γ ρ ( r(x3 π ) +... ) )) Solve for stationary π by a version of value iteration.

19 Gradient approach [TCGM15] Static and dynamic risk. Without evaluating or estimating the risk value. static coherent ρ: minπ ρ( T t=1 x t π ) estimator for π ρ( T t=1 x π ) plug into a gradient descent algorithm to find local optimum. dynamic (Markovian) ρ T (cf. [Rus10]): ρ (π) = lim T ρ T (x1 π... x T π), min π ρ (π), estimate π ρ (π), etc.

20 What do we propose? There is a huge number of policies! We consider only risk of a stationary policy: ) ρ (r(x 1 π ) r(x T π ). (1) min π Π Each policy π has a risk value attached to it. Evaluating risk for each policy is challenging. Policy improvement from π to π. Approximate solution to NP-hard problem.

21 Policy evaluation 1 Transition probabilities P π known 2 Transition probabilities P π unknown 3 Tools: CLT for Markov chains (e.g., Kontoyiannis and Meyn) Asymptotic reward variance for MDP with fixed policy (e.g., Meyn and Tweedie)

22 CLT Theorem Suppose that R 1, R 2... are i.i.d. with finite mean µ and variance σ 2. Let S n = (R R n )/n, then: n(sn µ) N(0, σ 2 ).

23 CLT for Markov chains Theorem (Theorem 5.1,[KM03]) Suppose that X π and r satisfy ergodicity and continuity assumptions. Let GT π (y) denote the distribution function G π T (y) P { S π T T ϕ π (r) σ T Let ˆr π (x) be a solution to the Poisson equation: Then, for all x π 0 X, } y. P πˆr π = ˆr π r + ϕ π (r)1. (2) GT π γ(y) [ ϱ ] (y) = g(y) + σ T 6σ 2 (1 y 2 ) ˆr π (x0 π ) + o(1/ T ), uniformly in y R, where g and γ denotes the standard normal distribution and density.

24 P π known Input: P π, T, r, λ, x 0. 1 Solving P π ξ π = ξ π for stationary distribution ξ π. 2 Compute asymptotic expected reward per time step: ϕ π (r) = x ξπ (x)r(x) 3 Solve the Poisson equation for ˆr π. 4 Compute asymptotic variance [MT12]: σ 2 = x X[ˆr π (x) 2 (P πˆr π (x)) 2 ]ξ π (x). 5 Estimate the probability distribution GT π of Sπ T with g by CLT. 6 Output VAR estimate: ( y T ϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g σ > λ. T

25 Estimation error G T y σ=1 σ=1.1

26 P π needs to be estimated Input: π, T, r, λ, x 0. 1 Estimate P π with ˆP π (Hard part: how many samples). 2 Estimate stationary distribution ˆξ π. 3 Compute asymptotic expected reward per time step: ˆϕ π (r) = x ˆξ π (x)r(x) 4 Solve the Poisson equation for ˆr π. 5 Compute asymptotic variance [MT12, Equation 17.44]: ˆσ 2 = x X[ˆr π (x) 2 (ˆP πˆr π (x)) 2 ]ˆξ π (x). 6 Estimate the probability distribution GT π of Sπ T with g by CLT. 7 Output VAR estimate: ( y T ˆϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g ˆσ > λ. T

27 Policy improvement Parametrized space of policies: π θ, for θ Θ R d. Randomized and continuous policies: π θ : X A X continuous in parameter θ. Stochastic approximation and policy gradient. SPSA (simultaneous perturbation stochastic approx): ( ) θ (i)ρ(st θ ) = (i) ρ(s θ+β T ) ρ(st θ ), β θ (i) t+1 = Γ ( i θt + λ(t) θ (i)ρ(st θ ) ) for all i. Step sizes satisfy stochastic approximation assumptions.

28 Policy improvement We can find a local optimum.

29 What can we guarantee? Soon on arxiv. Policy evaluation Theorem Our estimated risk ˆρ has an error of O( 1 σ ) with high probability. For T every ɛ, δ > 0, and T large enough, we have Policy improvement Theorem P( ˆρ λ (ST π ) ρ λ(st π ) ɛ) 1 δ. Under continuity and stochastic approximation assumptions, for every ɛ > 0, there exists β such that θ t converges almost surely to an ɛ-neighbourhood of a local optimum.

30 Extensions Any risk measure that is a functional of the probability distribution of S π T. Use this algorithm for other NP-hard problems (Subset Sum)? Extension to infinite horizon, broken-down into phases of length T.

31 References Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages , Ioannis Kontoyiannis and Sean P Meyn. Spectral theory and limit theorems for geometrically ergodic Markov processes. Annals of Applied Probability, pages , S. Mannor and J. N. Tsitsiklis. Mean-variance optimization in Markov decision processes. In Proceedings of ICML, Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, pages , LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages , Andrzej Ruszczyński. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2): , 2010.

Policy Gradient for Coherent Risk Measures

Policy Gradient for Coherent Risk Measures Policy Gradient for Coherent Risk Measures Aviv Tamar UC Berkeley avivt@berkeleyedu Mohammad Ghavamzadeh Adobe Research & INRIA mohammadghavamzadeh@inriafr Yinlam Chow Stanford University ychow@stanfordedu

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.

More information

Risk-Sensitive and Efficient Reinforcement Learning Algorithms. Aviv Tamar

Risk-Sensitive and Efficient Reinforcement Learning Algorithms. Aviv Tamar Risk-Sensitive and Efficient Reinforcement Learning Algorithms Aviv Tamar Risk-Sensitive and Efficient Reinforcement Learning Algorithms Research Thesis Submitted in partial fulfillment of the requirements

More information

Gradient Estimation for Attractor Networks

Gradient Estimation for Attractor Networks Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes

More information

Markov Decision Processes and their Applications to Supply Chain Management

Markov Decision Processes and their Applications to Supply Chain Management Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations

More information

Markov Decision Processes and Dynamic Programming

Markov Decision Processes and Dynamic Programming Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process

More information

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers

On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,

More information

Practicable Robust Markov Decision Processes

Practicable Robust Markov Decision Processes Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)

More information

STA205 Probability: Week 8 R. Wolpert

STA205 Probability: Week 8 R. Wolpert INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and

More information

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Finite-Sample Analysis in Reinforcement Learning

Finite-Sample Analysis in Reinforcement Learning Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

arxiv: v2 [cs.ai] 8 Jun 2015

arxiv: v2 [cs.ai] 8 Jun 2015 Policy Gradient for Coherent Risk Measures arxiv:1502.03919v2 [cs.ai] 8 Jun 2015 Aviv Tamar Electrical Engineering Department The Technion - Israel Institute of Technology avivt@tx.technion.ac.il Mohammad

More information

Average-cost temporal difference learning and adaptive control variates

Average-cost temporal difference learning and adaptive control variates Average-cost temporal difference learning and adaptive control variates Sean Meyn Department of ECE and the Coordinated Science Laboratory Joint work with S. Mannor, McGill V. Tadic, Sheffield S. Henderson,

More information

Risk-Averse Dynamic Optimization. Andrzej Ruszczyński. Research supported by the NSF award CMMI

Risk-Averse Dynamic Optimization. Andrzej Ruszczyński. Research supported by the NSF award CMMI Research supported by the NSF award CMMI-0965689 Outline 1 Risk-Averse Preferences 2 Coherent Risk Measures 3 Dynamic Risk Measurement 4 Markov Risk Measures 5 Risk-Averse Control Problems 6 Solution Methods

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Policy Gradients with Variance Related Risk Criteria

Policy Gradients with Variance Related Risk Criteria Aviv Tamar avivt@tx.technion.ac.il Dotan Di Castro dot@tx.technion.ac.il Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering, The Technion - Israel Institute of Technology, Haifa, Israel

More information

= µ(s, a)c(s, a) s.t. linear constraints ;

= µ(s, a)c(s, a) s.t. linear constraints ; A CONVE ANALYTIC APPROACH TO RISK-AWARE MARKOV DECISION PROCESSES WILLIAM B. HASKELL AND RAHUL JAIN Abstract. In classical Markov decision process (MDP) theory, we search for a policy that say, minimizes

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Optimizing the CVaR via Sampling

Optimizing the CVaR via Sampling Optimizing the CVaR via Sampling Aviv Tamar, Yonatan Glassner, and Shie Mannor Electrical Engineering Department The Technion - Israel Institute of Technology Haifa, Israel 32000 {avivt, yglasner}@tx.technion.ac.il,

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

The Role of Discount Factor in Risk Sensitive Markov Decision Processes

The Role of Discount Factor in Risk Sensitive Markov Decision Processes 06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo

More information

CVaR and Examples of Deviation Risk Measures

CVaR and Examples of Deviation Risk Measures CVaR and Examples of Deviation Risk Measures Jakub Černý Department of Probability and Mathematical Statistics Stochastic Modelling in Economics and Finance November 10, 2014 1 / 25 Contents CVaR - Dual

More information

Stochastic Optimization One-stage problem

Stochastic Optimization One-stage problem Stochastic Optimization One-stage problem V. Leclère September 28 2017 September 28 2017 1 / Déroulement du cours 1 Problèmes d optimisation stochastique à une étape 2 Problèmes d optimisation stochastique

More information

Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for

More information

Discrete planning (an introduction)

Discrete planning (an introduction) Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135

More information

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria Journal of Machine Learning Research 18 2018) 1-51 Submitted 12/15; Revised 4/17; Published 4/18 Risk-Constrained Reinforcement Learning with Percentile Risk Criteria Yinlam Chow DeepMind Mountain View,

More information

Separable Utility Functions in Dynamic Economic Models

Separable Utility Functions in Dynamic Economic Models Separable Utility Functions in Dynamic Economic Models Karel Sladký 1 Abstract. In this note we study properties of utility functions suitable for performance evaluation of dynamic economic models under

More information

Dynamic Risk Measures and Nonlinear Expectations with Markov Chain noise

Dynamic Risk Measures and Nonlinear Expectations with Markov Chain noise Dynamic Risk Measures and Nonlinear Expectations with Markov Chain noise Robert J. Elliott 1 Samuel N. Cohen 2 1 Department of Commerce, University of South Australia 2 Mathematical Insitute, University

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Mean Field Competitive Binary MDPs and Structured Solutions

Mean Field Competitive Binary MDPs and Structured Solutions Mean Field Competitive Binary MDPs and Structured Solutions Minyi Huang School of Mathematics and Statistics Carleton University Ottawa, Canada MFG217, UCLA, Aug 28 Sept 1, 217 1 / 32 Outline of talk The

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement

More information

An Uncertain Control Model with Application to. Production-Inventory System

An Uncertain Control Model with Application to. Production-Inventory System An Uncertain Control Model with Application to Production-Inventory System Kai Yao 1, Zhongfeng Qin 2 1 Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China 2 School of Economics

More information

RISK-SENSITIVE AND DATA-DRIVEN SEQUENTIAL DECISION MAKING

RISK-SENSITIVE AND DATA-DRIVEN SEQUENTIAL DECISION MAKING RISK-SENSITIVE AND DATA-DRIVEN SEQUENTIAL DECISION MAKING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF INSTITUTE OF COMPUTATIONAL & MATHEMATICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD

More information

Variance Adjusted Actor Critic Algorithms

Variance Adjusted Actor Critic Algorithms Variance Adjusted Actor Critic Algorithms 1 Aviv Tamar, Shie Mannor arxiv:1310.3697v1 [stat.ml 14 Oct 2013 Abstract We present an actor-critic framework for MDPs where the objective is the variance-adjusted

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Lecture Notes 10: Dynamic Programming

Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists Peter J. Hammond 1 of 81 Lecture Notes 10: Dynamic Programming Peter J. Hammond 2018 September 28th University of Warwick, EC9A0 Maths for Economists Peter

More information

Infinite-Horizon Average Reward Markov Decision Processes

Infinite-Horizon Average Reward Markov Decision Processes Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average

More information

ORIGINS OF STOCHASTIC PROGRAMMING

ORIGINS OF STOCHASTIC PROGRAMMING ORIGINS OF STOCHASTIC PROGRAMMING Early 1950 s: in applications of Linear Programming unknown values of coefficients: demands, technological coefficients, yields, etc. QUOTATION Dantzig, Interfaces 20,1990

More information

Generalized quantiles as risk measures

Generalized quantiles as risk measures Generalized quantiles as risk measures Bellini, Klar, Muller, Rosazza Gianin December 1, 2014 Vorisek Jan Introduction Quantiles q α of a random variable X can be defined as the minimizers of a piecewise

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

On Differentiability of Average Cost in Parameterized Markov Chains

On Differentiability of Average Cost in Parameterized Markov Chains On Differentiability of Average Cost in Parameterized Markov Chains Vijay Konda John N. Tsitsiklis August 30, 2002 1 Overview The purpose of this appendix is to prove Theorem 4.6 in 5 and establish various

More information

TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory

TD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory TD Learning Sean Meyn Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory Joint work with I.-K. Cho, Illinois S. Mannor, McGill V. Tadic, Sheffield

More information

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca

More information

Controlled sequential Monte Carlo

Controlled sequential Monte Carlo Controlled sequential Monte Carlo Jeremy Heng, Department of Statistics, Harvard University Joint work with Adrian Bishop (UTS, CSIRO), George Deligiannidis & Arnaud Doucet (Oxford) Bayesian Computation

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011

Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Sean P. Meyn Joint work with Darshan Shirodkar and Prashant Mehta Coordinated Science Laboratory and the Department of

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

On Optimization of the Total Present Value of Profits under Semi Markov Conditions

On Optimization of the Total Present Value of Profits under Semi Markov Conditions On Optimization of the Total Present Value of Profits under Semi Markov Conditions Katehakis, Michael N. Rutgers Business School Department of MSIS 180 University, Newark, New Jersey 07102, U.S.A. mnk@andromeda.rutgers.edu

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria Risk-Constrained Reinforcement Learning with Percentile Risk Criteria Yinlam Chow Institute for Computational & Mathematical Engineering Stanford University Stanford, CA 94305, USA Mohammad Ghavamzadeh

More information

Reformulation of chance constrained problems using penalty functions

Reformulation of chance constrained problems using penalty functions Reformulation of chance constrained problems using penalty functions Martin Branda Charles University in Prague Faculty of Mathematics and Physics EURO XXIV July 11-14, 2010, Lisbon Martin Branda (MFF

More information

Value Function Iteration

Value Function Iteration Value Function Iteration (Lectures on Solution Methods for Economists II) Jesús Fernández-Villaverde 1 and Pablo Guerrón 2 February 26, 2018 1 University of Pennsylvania 2 Boston College Theoretical Background

More information

Planning and Model Selection in Data Driven Markov models

Planning and Model Selection in Data Driven Markov models Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),

More information

Point Process Control

Point Process Control Point Process Control The following note is based on Chapters I, II and VII in Brémaud s book Point Processes and Queues (1981). 1 Basic Definitions Consider some probability space (Ω, F, P). A real-valued

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Stochastic Primal-Dual Methods for Reinforcement Learning

Stochastic Primal-Dual Methods for Reinforcement Learning Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning

More information

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

Internet Monetization

Internet Monetization Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition

More information

Birgit Rudloff Operations Research and Financial Engineering, Princeton University

Birgit Rudloff Operations Research and Financial Engineering, Princeton University TIME CONSISTENT RISK AVERSE DYNAMIC DECISION MODELS: AN ECONOMIC INTERPRETATION Birgit Rudloff Operations Research and Financial Engineering, Princeton University brudloff@princeton.edu Alexandre Street

More information

Inverse Stochastic Dominance Constraints Duality and Methods

Inverse Stochastic Dominance Constraints Duality and Methods Duality and Methods Darinka Dentcheva 1 Andrzej Ruszczyński 2 1 Stevens Institute of Technology Hoboken, New Jersey, USA 2 Rutgers University Piscataway, New Jersey, USA Research supported by NSF awards

More information

AM 121: Intro to Optimization Models and Methods: Fall 2018

AM 121: Intro to Optimization Models and Methods: Fall 2018 AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted

More information

Stochastic Safest and Shortest Path Problems

Stochastic Safest and Shortest Path Problems Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for

More information

A.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.

A.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy. University of Liverpool piunov@liv.ac.uk The Markov Decision Process under consideration is defined by the following elements X = {0, 1, 2,...} is the state space; A is the action space (Borel); p(z x,

More information

Robust Optimal Control Using Conditional Risk Mappings in Infinite Horizon

Robust Optimal Control Using Conditional Risk Mappings in Infinite Horizon Robust Optimal Control Using Conditional Risk Mappings in Infinite Horizon Kerem Uğurlu Monday 9 th April, 2018 Department of Applied Mathematics, University of Washington, Seattle, WA 98195 e-mail: keremu@uw.edu

More information

On service level measures in stochastic inventory control

On service level measures in stochastic inventory control On service level measures in stochastic inventory control Dr. Roberto Rossi The University of Edinburgh Business School, The University of Edinburgh, UK roberto.rossi@ed.ac.uk Friday, June the 21th, 2013

More information

Portfolio optimization with stochastic dominance constraints

Portfolio optimization with stochastic dominance constraints Charles University in Prague Faculty of Mathematics and Physics Portfolio optimization with stochastic dominance constraints December 16, 2014 Contents Motivation 1 Motivation 2 3 4 5 Contents Motivation

More information

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel

More information

Robust Growth-Optimal Portfolios

Robust Growth-Optimal Portfolios Robust Growth-Optimal Portfolios! Daniel Kuhn! Chair of Risk Analytics and Optimization École Polytechnique Fédérale de Lausanne rao.epfl.ch 4 Technology Stocks I 4 technology companies: Intel, Cisco,

More information

Robustness and bootstrap techniques in portfolio efficiency tests

Robustness and bootstrap techniques in portfolio efficiency tests Robustness and bootstrap techniques in portfolio efficiency tests Dept. of Probability and Mathematical Statistics, Charles University, Prague, Czech Republic July 8, 2013 Motivation Portfolio selection

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Sequential Monte Carlo Samplers for Applications in High Dimensions

Sequential Monte Carlo Samplers for Applications in High Dimensions Sequential Monte Carlo Samplers for Applications in High Dimensions Alexandros Beskos National University of Singapore KAUST, 26th February 2014 Joint work with: Dan Crisan, Ajay Jasra, Nik Kantas, Alex

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016

More information

Control Variates for Markov Chain Monte Carlo

Control Variates for Markov Chain Monte Carlo Control Variates for Markov Chain Monte Carlo Dellaportas, P., Kontoyiannis, I., and Tsourti, Z. Dept of Statistics, AUEB Dept of Informatics, AUEB 1st Greek Stochastics Meeting Monte Carlo: Probability

More information

FOURIER TRANSFORMS OF STATIONARY PROCESSES 1. k=1

FOURIER TRANSFORMS OF STATIONARY PROCESSES 1. k=1 FOURIER TRANSFORMS OF STATIONARY PROCESSES WEI BIAO WU September 8, 003 Abstract. We consider the asymptotic behavior of Fourier transforms of stationary and ergodic sequences. Under sufficiently mild

More information

18.175: Lecture 14 Infinite divisibility and so forth

18.175: Lecture 14 Infinite divisibility and so forth 18.175 Lecture 14 18.175: Lecture 14 Infinite divisibility and so forth Scott Sheffield MIT 18.175 Lecture 14 Outline Infinite divisibility Higher dimensional CFs and CLTs Random walks Stopping times Arcsin

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Markov Decision Processes Chapter 17. Mausam

Markov Decision Processes Chapter 17. Mausam Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.

More information

On Finding Optimal Policies for Markovian Decision Processes Using Simulation

On Finding Optimal Policies for Markovian Decision Processes Using Simulation On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Markov Chain Monte Carlo Methods for Stochastic

Markov Chain Monte Carlo Methods for Stochastic Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information