Central-limit approach to risk-aware Markov decision processes
|
|
- Janis Williamson
- 5 years ago
- Views:
Transcription
1 Central-limit approach to risk-aware Markov decision processes Jia Yuan Yu Concordia University November 27, 2015 Joint work with Pengqian Yu and Huan Xu.
2 Inventory Management 1 1 Look at current inventory level 2 Decide how much to order 3 Receive inventory, sell inventory (random demand) 4 New inventory level, LeanCor
3 What is an MDP? 1 Single decision maker, 2 Sequential decisions, 3 Uncertainty. Time horizon 1, 2,..., States: X, Actions: A, Reward function: r : X A R, Transition probabilities: P : X A X.
4 Policies Sequential decision problem Policy is a mapping π : X A, If the decision-maker sticks to policy π, we have a Markov chain x1 π, x 2 π,... and we write instead of r(x π 1 ), r(x π 2 ),... r(x 1, π(x 1 )), r(x 2, π(x 2 )),...
5 Example 1 Time horizon: From Black Friday to End of Year. 2 State at day t: how many iphones x t in stock. 3 Action: how many iphones a t to order from warehouse. 4 Transition: from one inventory level x t to the next x t+1, with random demand and a t 5 Cost ( Reward): c(x t, a t ) includes holding cost, order cost, backorder cost, etc.
6 What is risk? Comparing probability distributions of alternatives 2. 2 From
7 What is risk? Many notions (risk measures): mappings of RVs (probability distributions) to real numbers; parametrized: curvature of u, value of λ. Expected utility (von Neumann and Morgenstern): ρ u (X) = Eu(X) with a concave u, Mean-Variance (Markowitz): ρ λ (X) = EX λvx, Value-at-risk (VAR): ρ λ (X) = F 1 X (λ), Expected shortfall (CVAR or AVAR): integral of VAR.
8 VAR
9 Let s cook up an example 1 Chain of many stores. 2 The previous boss minimizes T E c(xt π ) t=1 for each store. 3 You want to try a different approach: For a given λ, you want: min π s.t. v ( T ) P c(xt π ) v λ. t=1.
10 Let s cook up an example 1 For a given λ, you want: min π S = T c(xt π ), t=1 F 1 S (λ), where F 1 S (λ) is the value-at-risk ρ λ (S). 2 You can then say λ-fraction of stores spend less than v on inventory management costs (under i.i.d. assumption)..
11 What set of policies? 1 For a given λ, π = arg min π Π F 1 S (λ). 2 π is the training you give to each store s inventory manager. 3 You want to limit personnel training! 4 = Limit Π to stationary (Markov) policies: X A A..
12 What s been done? MDPs without risk (or risk-neutral) Risk in i.i.d. settings (in finance, operations research) Risk in MDPs mean-variance [MT11] [PG13] probabilistic objective, chance constraint [XM11] dynamic (Markovian) risk measures: [Rus10] [Oso12] risk constraint (CVAR or AVAR) gradient approach [TCGM15]
13 Risk-neutral MDP Finite horizon, infinite horizon: max π Π r(xt π ) T 1 E t=0 max π Π E α t r(xt π ) t=0 Solution via Bellman equation, value iteration, etc. Unlike E, our ρ is not usually not linear.
14 What makes risk measure ρ hard? Stationary policy π = Markov chain {x π t : x X, t = 1, 2,...} T 1 Find CDF of ST π = r(xt π ), t=0 Solve min π Π ρ (S π T ). For risk-neutral MDPs, π Π (well-known). Not in risk-averse setting! History-dependent policies π 1 : H 1 A, π 2 : H 2 A..., Non-Markov chain {x t } T 1 S π 1,...,π T T = r(x t, π t (x t )), Solve min π 1,...,π T Π t=0 ρ ( S π 1,...,π T ) T.
15 Mean-variance [MT11], [PG13] Max-reward: max π Π s.t. T E r(xt π ) t=1 ( T ) r(xt π ) v. VAR t=1 Min-variance: min π Π s.t. VAR ( T ) r(xt π ) t=1 T E r(xt π ) w. t=1 These problems are NP-hard (reduction from subset sum).
16 Probabilistic goal, chance constraint [XM11] Given v R: max π Π ( T ) P r(xt π ) v t=1 Given v R and α (0, 1): max π Π s.t. T E r(xt π ) t=1 ( T ) P r(xt π ) v α t=1
17 Risk constraint [CG14] Given β: min π Π s.t. T E r(xt π ) t=1 ( T ) ρ r(xt π ) β, t=1 where ρ is CVAR.
18 Dynamic (Markovian) risk measures [Rus10] [Oso12] RV X from (Ω, F, P), and coherent ρ: max π 1,...,π T ρ(x) = sup E Q X. Q:absolutely continuous wrt P For a fixed x 1 and T : r(x 1, π 1 (x 1 )) + ρ (c(x 2, π 2 (x 2 )) + ρ (... + ρ(c T +1 (x T +1 )) )) Infinite horizon, discount γ: max π Π ( ( R.V. real number {}}{{}}{ r(x0 π ) + γρ r(x1 π ) + γρ r(x2 π ) +γ ρ ( r(x3 π ) +... ) )) Solve for stationary π by a version of value iteration.
19 Gradient approach [TCGM15] Static and dynamic risk. Without evaluating or estimating the risk value. static coherent ρ: minπ ρ( T t=1 x t π ) estimator for π ρ( T t=1 x π ) plug into a gradient descent algorithm to find local optimum. dynamic (Markovian) ρ T (cf. [Rus10]): ρ (π) = lim T ρ T (x1 π... x T π), min π ρ (π), estimate π ρ (π), etc.
20 What do we propose? There is a huge number of policies! We consider only risk of a stationary policy: ) ρ (r(x 1 π ) r(x T π ). (1) min π Π Each policy π has a risk value attached to it. Evaluating risk for each policy is challenging. Policy improvement from π to π. Approximate solution to NP-hard problem.
21 Policy evaluation 1 Transition probabilities P π known 2 Transition probabilities P π unknown 3 Tools: CLT for Markov chains (e.g., Kontoyiannis and Meyn) Asymptotic reward variance for MDP with fixed policy (e.g., Meyn and Tweedie)
22 CLT Theorem Suppose that R 1, R 2... are i.i.d. with finite mean µ and variance σ 2. Let S n = (R R n )/n, then: n(sn µ) N(0, σ 2 ).
23 CLT for Markov chains Theorem (Theorem 5.1,[KM03]) Suppose that X π and r satisfy ergodicity and continuity assumptions. Let GT π (y) denote the distribution function G π T (y) P { S π T T ϕ π (r) σ T Let ˆr π (x) be a solution to the Poisson equation: Then, for all x π 0 X, } y. P πˆr π = ˆr π r + ϕ π (r)1. (2) GT π γ(y) [ ϱ ] (y) = g(y) + σ T 6σ 2 (1 y 2 ) ˆr π (x0 π ) + o(1/ T ), uniformly in y R, where g and γ denotes the standard normal distribution and density.
24 P π known Input: P π, T, r, λ, x 0. 1 Solving P π ξ π = ξ π for stationary distribution ξ π. 2 Compute asymptotic expected reward per time step: ϕ π (r) = x ξπ (x)r(x) 3 Solve the Poisson equation for ˆr π. 4 Compute asymptotic variance [MT12]: σ 2 = x X[ˆr π (x) 2 (P πˆr π (x)) 2 ]ξ π (x). 5 Estimate the probability distribution GT π of Sπ T with g by CLT. 6 Output VAR estimate: ( y T ϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g σ > λ. T
25 Estimation error G T y σ=1 σ=1.1
26 P π needs to be estimated Input: π, T, r, λ, x 0. 1 Estimate P π with ˆP π (Hard part: how many samples). 2 Estimate stationary distribution ˆξ π. 3 Compute asymptotic expected reward per time step: ˆϕ π (r) = x ˆξ π (x)r(x) 4 Solve the Poisson equation for ˆr π. 5 Compute asymptotic variance [MT12, Equation 17.44]: ˆσ 2 = x X[ˆr π (x) 2 (ˆP πˆr π (x)) 2 ]ˆξ π (x). 6 Estimate the probability distribution GT π of Sπ T with g by CLT. 7 Output VAR estimate: ( y T ˆϕ ˆρ λ (ST {y π π ) } ) = inf (r) R : g ˆσ > λ. T
27 Policy improvement Parametrized space of policies: π θ, for θ Θ R d. Randomized and continuous policies: π θ : X A X continuous in parameter θ. Stochastic approximation and policy gradient. SPSA (simultaneous perturbation stochastic approx): ( ) θ (i)ρ(st θ ) = (i) ρ(s θ+β T ) ρ(st θ ), β θ (i) t+1 = Γ ( i θt + λ(t) θ (i)ρ(st θ ) ) for all i. Step sizes satisfy stochastic approximation assumptions.
28 Policy improvement We can find a local optimum.
29 What can we guarantee? Soon on arxiv. Policy evaluation Theorem Our estimated risk ˆρ has an error of O( 1 σ ) with high probability. For T every ɛ, δ > 0, and T large enough, we have Policy improvement Theorem P( ˆρ λ (ST π ) ρ λ(st π ) ɛ) 1 δ. Under continuity and stochastic approximation assumptions, for every ɛ > 0, there exists β such that θ t converges almost surely to an ɛ-neighbourhood of a local optimum.
30 Extensions Any risk measure that is a functional of the probability distribution of S π T. Use this algorithm for other NP-hard problems (Subset Sum)? Extension to infinite horizon, broken-down into phases of length T.
31 References Yinlam Chow and Mohammad Ghavamzadeh. Algorithms for CVaR optimization in MDPs. In Advances in Neural Information Processing Systems, pages , Ioannis Kontoyiannis and Sean P Meyn. Spectral theory and limit theorems for geometrically ergodic Markov processes. Annals of Applied Probability, pages , S. Mannor and J. N. Tsitsiklis. Mean-variance optimization in Markov decision processes. In Proceedings of ICML, Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science & Business Media, Takayuki Osogami. Robustness and risk-sensitivity in Markov decision processes. In Advances in Neural Information Processing Systems, pages , LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances in Neural Information Processing Systems, pages , Andrzej Ruszczyński. Risk-averse dynamic programming for Markov decision processes. Mathematical programming, 125(2): , 2010.
Policy Gradient for Coherent Risk Measures
Policy Gradient for Coherent Risk Measures Aviv Tamar UC Berkeley avivt@berkeleyedu Mohammad Ghavamzadeh Adobe Research & INRIA mohammadghavamzadeh@inriafr Yinlam Chow Stanford University ychow@stanfordedu
More informationApproximate Dynamic Programming
Master MVA: Reinforcement Learning Lecture: 5 Approximate Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/ lazaric/webpage/teaching.html Objectives of the lecture 1.
More informationRisk-Sensitive and Efficient Reinforcement Learning Algorithms. Aviv Tamar
Risk-Sensitive and Efficient Reinforcement Learning Algorithms Aviv Tamar Risk-Sensitive and Efficient Reinforcement Learning Algorithms Research Thesis Submitted in partial fulfillment of the requirements
More informationGradient Estimation for Attractor Networks
Gradient Estimation for Attractor Networks Thomas Flynn Department of Computer Science Graduate Center of CUNY July 2017 1 Outline Motivations Deterministic attractor networks Stochastic attractor networks
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationMarkov Decision Processes and their Applications to Supply Chain Management
Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationOn the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers
On the Approximate Solution of POMDP and the Near-Optimality of Finite-State Controllers Huizhen (Janey) Yu (janey@mit.edu) Dimitri Bertsekas (dimitrib@mit.edu) Lab for Information and Decision Systems,
More informationPracticable Robust Markov Decision Processes
Practicable Robust Markov Decision Processes Huan Xu Department of Mechanical Engineering National University of Singapore Joint work with Shiau-Hong Lim (IBM), Shie Mannor (Techion), Ofir Mebel (Apple)
More informationSTA205 Probability: Week 8 R. Wolpert
INFINITE COIN-TOSS AND THE LAWS OF LARGE NUMBERS The traditional interpretation of the probability of an event E is its asymptotic frequency: the limit as n of the fraction of n repeated, similar, and
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationFinite-Sample Analysis in Reinforcement Learning
Finite-Sample Analysis in Reinforcement Learning Mohammad Ghavamzadeh INRIA Lille Nord Europe, Team SequeL Outline 1 Introduction to RL and DP 2 Approximate Dynamic Programming (AVI & API) 3 How does Statistical
More informationUniversity of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming
University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of
More informationarxiv: v2 [cs.ai] 8 Jun 2015
Policy Gradient for Coherent Risk Measures arxiv:1502.03919v2 [cs.ai] 8 Jun 2015 Aviv Tamar Electrical Engineering Department The Technion - Israel Institute of Technology avivt@tx.technion.ac.il Mohammad
More informationAverage-cost temporal difference learning and adaptive control variates
Average-cost temporal difference learning and adaptive control variates Sean Meyn Department of ECE and the Coordinated Science Laboratory Joint work with S. Mannor, McGill V. Tadic, Sheffield S. Henderson,
More informationRisk-Averse Dynamic Optimization. Andrzej Ruszczyński. Research supported by the NSF award CMMI
Research supported by the NSF award CMMI-0965689 Outline 1 Risk-Averse Preferences 2 Coherent Risk Measures 3 Dynamic Risk Measurement 4 Markov Risk Measures 5 Risk-Averse Control Problems 6 Solution Methods
More informationApproximate Dynamic Programming
Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A.
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationPolicy Gradients with Variance Related Risk Criteria
Aviv Tamar avivt@tx.technion.ac.il Dotan Di Castro dot@tx.technion.ac.il Shie Mannor shie@ee.technion.ac.il Department of Electrical Engineering, The Technion - Israel Institute of Technology, Haifa, Israel
More information= µ(s, a)c(s, a) s.t. linear constraints ;
A CONVE ANALYTIC APPROACH TO RISK-AWARE MARKOV DECISION PROCESSES WILLIAM B. HASKELL AND RAHUL JAIN Abstract. In classical Markov decision process (MDP) theory, we search for a policy that say, minimizes
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationOptimizing the CVaR via Sampling
Optimizing the CVaR via Sampling Aviv Tamar, Yonatan Glassner, and Shie Mannor Electrical Engineering Department The Technion - Israel Institute of Technology Haifa, Israel 32000 {avivt, yglasner}@tx.technion.ac.il,
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationThe Role of Discount Factor in Risk Sensitive Markov Decision Processes
06 5th Brazilian Conference on Intelligent Systems The Role of Discount Factor in Risk Sensitive Markov Decision Processes Valdinei Freire Escola de Artes, Ciências e Humanidades Universidade de São Paulo
More informationCVaR and Examples of Deviation Risk Measures
CVaR and Examples of Deviation Risk Measures Jakub Černý Department of Probability and Mathematical Statistics Stochastic Modelling in Economics and Finance November 10, 2014 1 / 25 Contents CVaR - Dual
More informationStochastic Optimization One-stage problem
Stochastic Optimization One-stage problem V. Leclère September 28 2017 September 28 2017 1 / Déroulement du cours 1 Problèmes d optimisation stochastique à une étape 2 Problèmes d optimisation stochastique
More informationIntroduction to Reinforcement Learning Part 1: Markov Decision Processes
Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015 Note I ve created these slides whilst following Algorithms for
More informationDiscrete planning (an introduction)
Sistemi Intelligenti Corso di Laurea in Informatica, A.A. 2017-2018 Università degli Studi di Milano Discrete planning (an introduction) Nicola Basilico Dipartimento di Informatica Via Comelico 39/41-20135
More informationRisk-Constrained Reinforcement Learning with Percentile Risk Criteria
Journal of Machine Learning Research 18 2018) 1-51 Submitted 12/15; Revised 4/17; Published 4/18 Risk-Constrained Reinforcement Learning with Percentile Risk Criteria Yinlam Chow DeepMind Mountain View,
More informationSeparable Utility Functions in Dynamic Economic Models
Separable Utility Functions in Dynamic Economic Models Karel Sladký 1 Abstract. In this note we study properties of utility functions suitable for performance evaluation of dynamic economic models under
More informationDynamic Risk Measures and Nonlinear Expectations with Markov Chain noise
Dynamic Risk Measures and Nonlinear Expectations with Markov Chain noise Robert J. Elliott 1 Samuel N. Cohen 2 1 Department of Commerce, University of South Australia 2 Mathematical Insitute, University
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationMean Field Competitive Binary MDPs and Structured Solutions
Mean Field Competitive Binary MDPs and Structured Solutions Minyi Huang School of Mathematics and Statistics Carleton University Ottawa, Canada MFG217, UCLA, Aug 28 Sept 1, 217 1 / 32 Outline of talk The
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationAn Uncertain Control Model with Application to. Production-Inventory System
An Uncertain Control Model with Application to Production-Inventory System Kai Yao 1, Zhongfeng Qin 2 1 Department of Mathematical Sciences, Tsinghua University, Beijing 100084, China 2 School of Economics
More informationRISK-SENSITIVE AND DATA-DRIVEN SEQUENTIAL DECISION MAKING
RISK-SENSITIVE AND DATA-DRIVEN SEQUENTIAL DECISION MAKING A DISSERTATION SUBMITTED TO THE DEPARTMENT OF INSTITUTE OF COMPUTATIONAL & MATHEMATICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD
More informationVariance Adjusted Actor Critic Algorithms
Variance Adjusted Actor Critic Algorithms 1 Aviv Tamar, Shie Mannor arxiv:1310.3697v1 [stat.ml 14 Oct 2013 Abstract We present an actor-critic framework for MDPs where the objective is the variance-adjusted
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationLecture Notes 10: Dynamic Programming
University of Warwick, EC9A0 Maths for Economists Peter J. Hammond 1 of 81 Lecture Notes 10: Dynamic Programming Peter J. Hammond 2018 September 28th University of Warwick, EC9A0 Maths for Economists Peter
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationORIGINS OF STOCHASTIC PROGRAMMING
ORIGINS OF STOCHASTIC PROGRAMMING Early 1950 s: in applications of Linear Programming unknown values of coefficients: demands, technological coefficients, yields, etc. QUOTATION Dantzig, Interfaces 20,1990
More informationGeneralized quantiles as risk measures
Generalized quantiles as risk measures Bellini, Klar, Muller, Rosazza Gianin December 1, 2014 Vorisek Jan Introduction Quantiles q α of a random variable X can be defined as the minimizers of a piecewise
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationOn Differentiability of Average Cost in Parameterized Markov Chains
On Differentiability of Average Cost in Parameterized Markov Chains Vijay Konda John N. Tsitsiklis August 30, 2002 1 Overview The purpose of this appendix is to prove Theorem 4.6 in 5 and establish various
More informationTD Learning. Sean Meyn. Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory
TD Learning Sean Meyn Department of Electrical and Computer Engineering University of Illinois & the Coordinated Science Laboratory Joint work with I.-K. Cho, Illinois S. Mannor, McGill V. Tadic, Sheffield
More informationExponential Moving Average Based Multiagent Reinforcement Learning Algorithms
Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms Mostafa D. Awheda Department of Systems and Computer Engineering Carleton University Ottawa, Canada KS 5B6 Email: mawheda@sce.carleton.ca
More informationControlled sequential Monte Carlo
Controlled sequential Monte Carlo Jeremy Heng, Department of Statistics, Harvard University Joint work with Adrian Bishop (UTS, CSIRO), George Deligiannidis & Arnaud Doucet (Oxford) Bayesian Computation
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationQuasi Stochastic Approximation American Control Conference San Francisco, June 2011
Quasi Stochastic Approximation American Control Conference San Francisco, June 2011 Sean P. Meyn Joint work with Darshan Shirodkar and Prashant Mehta Coordinated Science Laboratory and the Department of
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationOn Optimization of the Total Present Value of Profits under Semi Markov Conditions
On Optimization of the Total Present Value of Profits under Semi Markov Conditions Katehakis, Michael N. Rutgers Business School Department of MSIS 180 University, Newark, New Jersey 07102, U.S.A. mnk@andromeda.rutgers.edu
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationValue Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes
Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael
More informationRisk-Constrained Reinforcement Learning with Percentile Risk Criteria
Risk-Constrained Reinforcement Learning with Percentile Risk Criteria Yinlam Chow Institute for Computational & Mathematical Engineering Stanford University Stanford, CA 94305, USA Mohammad Ghavamzadeh
More informationReformulation of chance constrained problems using penalty functions
Reformulation of chance constrained problems using penalty functions Martin Branda Charles University in Prague Faculty of Mathematics and Physics EURO XXIV July 11-14, 2010, Lisbon Martin Branda (MFF
More informationValue Function Iteration
Value Function Iteration (Lectures on Solution Methods for Economists II) Jesús Fernández-Villaverde 1 and Pablo Guerrón 2 February 26, 2018 1 University of Pennsylvania 2 Boston College Theoretical Background
More informationPlanning and Model Selection in Data Driven Markov models
Planning and Model Selection in Data Driven Markov models Shie Mannor Department of Electrical Engineering Technion Joint work with many people along the way: Dotan Di-Castro (Yahoo!), Assaf Halak (Technion),
More informationPoint Process Control
Point Process Control The following note is based on Chapters I, II and VII in Brémaud s book Point Processes and Queues (1981). 1 Basic Definitions Consider some probability space (Ω, F, P). A real-valued
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationStochastic Primal-Dual Methods for Reinforcement Learning
Stochastic Primal-Dual Methods for Reinforcement Learning Alireza Askarian 1 Amber Srivastava 1 1 Department of Mechanical Engineering University of Illinois at Urbana Champaign Big Data Optimization,
More informationReinforcement learning
Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error
More informationDeep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory
Deep Reinforcement Learning Jeremy Morton (jmorton2) November 7, 2016 SISL Stanford Intelligent Systems Laboratory Overview 2 1 Motivation 2 Neural Networks 3 Deep Reinforcement Learning 4 Deep Learning
More informationProject Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming
Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,
More informationErgodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.
Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationBirgit Rudloff Operations Research and Financial Engineering, Princeton University
TIME CONSISTENT RISK AVERSE DYNAMIC DECISION MODELS: AN ECONOMIC INTERPRETATION Birgit Rudloff Operations Research and Financial Engineering, Princeton University brudloff@princeton.edu Alexandre Street
More informationInverse Stochastic Dominance Constraints Duality and Methods
Duality and Methods Darinka Dentcheva 1 Andrzej Ruszczyński 2 1 Stevens Institute of Technology Hoboken, New Jersey, USA 2 Rutgers University Piscataway, New Jersey, USA Research supported by NSF awards
More informationAM 121: Intro to Optimization Models and Methods: Fall 2018
AM 11: Intro to Optimization Models and Methods: Fall 018 Lecture 18: Markov Decision Processes Yiling Chen Lesson Plan Markov decision processes Policies and value functions Solving: average reward, discounted
More informationStochastic Safest and Shortest Path Problems
Stochastic Safest and Shortest Path Problems Florent Teichteil-Königsbuch AAAI-12, Toronto, Canada July 24-26, 2012 Path optimization under probabilistic uncertainties Problems coming to searching for
More informationA.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.
University of Liverpool piunov@liv.ac.uk The Markov Decision Process under consideration is defined by the following elements X = {0, 1, 2,...} is the state space; A is the action space (Borel); p(z x,
More informationRobust Optimal Control Using Conditional Risk Mappings in Infinite Horizon
Robust Optimal Control Using Conditional Risk Mappings in Infinite Horizon Kerem Uğurlu Monday 9 th April, 2018 Department of Applied Mathematics, University of Washington, Seattle, WA 98195 e-mail: keremu@uw.edu
More informationOn service level measures in stochastic inventory control
On service level measures in stochastic inventory control Dr. Roberto Rossi The University of Edinburgh Business School, The University of Edinburgh, UK roberto.rossi@ed.ac.uk Friday, June the 21th, 2013
More informationPortfolio optimization with stochastic dominance constraints
Charles University in Prague Faculty of Mathematics and Physics Portfolio optimization with stochastic dominance constraints December 16, 2014 Contents Motivation 1 Motivation 2 3 4 5 Contents Motivation
More informationLightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty
JMLR: Workshop and Conference Proceedings vol (212) 1 12 European Workshop on Reinforcement Learning Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor Technion Ofir Mebel
More informationRobust Growth-Optimal Portfolios
Robust Growth-Optimal Portfolios! Daniel Kuhn! Chair of Risk Analytics and Optimization École Polytechnique Fédérale de Lausanne rao.epfl.ch 4 Technology Stocks I 4 technology companies: Intel, Cisco,
More informationRobustness and bootstrap techniques in portfolio efficiency tests
Robustness and bootstrap techniques in portfolio efficiency tests Dept. of Probability and Mathematical Statistics, Charles University, Prague, Czech Republic July 8, 2013 Motivation Portfolio selection
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationSequential Monte Carlo Samplers for Applications in High Dimensions
Sequential Monte Carlo Samplers for Applications in High Dimensions Alexandros Beskos National University of Singapore KAUST, 26th February 2014 Joint work with: Dan Crisan, Ajay Jasra, Nik Kantas, Alex
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationReductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang
Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016
More informationControl Variates for Markov Chain Monte Carlo
Control Variates for Markov Chain Monte Carlo Dellaportas, P., Kontoyiannis, I., and Tsourti, Z. Dept of Statistics, AUEB Dept of Informatics, AUEB 1st Greek Stochastics Meeting Monte Carlo: Probability
More informationFOURIER TRANSFORMS OF STATIONARY PROCESSES 1. k=1
FOURIER TRANSFORMS OF STATIONARY PROCESSES WEI BIAO WU September 8, 003 Abstract. We consider the asymptotic behavior of Fourier transforms of stationary and ergodic sequences. Under sufficiently mild
More information18.175: Lecture 14 Infinite divisibility and so forth
18.175 Lecture 14 18.175: Lecture 14 Infinite divisibility and so forth Scott Sheffield MIT 18.175 Lecture 14 Outline Infinite divisibility Higher dimensional CFs and CLTs Random walks Stopping times Arcsin
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationOn Finding Optimal Policies for Markovian Decision Processes Using Simulation
On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationMarkov Chain Monte Carlo Methods for Stochastic
Markov Chain Monte Carlo Methods for Stochastic Optimization i John R. Birge The University of Chicago Booth School of Business Joint work with Nicholas Polson, Chicago Booth. JRBirge U Florida, Nov 2013
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationSTAT 512 sp 2018 Summary Sheet
STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}
More information