Payments System Design Using Reinforcement Learning: A Progress Report

Similar documents
REINFORCEMENT LEARNING

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Reinforcement Learning. Machine Learning, Fall 2010

CS230: Lecture 9 Deep Reinforcement Learning

Q-Learning and SARSA: Machine learningbased stochastic control approaches for financial trading

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

The convergence limit of the temporal difference learning

Approximate Q-Learning. Dan Weld / University of Washington

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Q-Learning in Continuous State Action Spaces

Liquidity saving mechanisms in collateral-based RTGS payment systems 1

Basics of reinforcement learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Chapter 6: Temporal Difference Learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

(Deep) Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning and NLP

Chapter 8: Generalization and Function Approximation

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

6 Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Temporal difference learning

Procedia Computer Science 00 (2011) 000 6

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Neural Map. Structured Memory for Deep RL. Emilio Parisotto

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Branes with Brains. Reinforcement learning in the landscape of intersecting brane worlds. String_Data 2017, Boston 11/30/2017

Introduction to Reinforcement Learning

Generalization and Function Approximation

Open Theoretical Questions in Reinforcement Learning

A Gentle Introduction to Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Replacing eligibility trace for action-value learning with function approximation

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Deep Learning Lab Course 2017 (Deep Learning Practical)

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Neural Networks and Deep Learning

Lecture 1: March 7, 2018

Data Mining Part 5. Prediction

RL 3: Reinforcement Learning

EEE 241: Linear Systems

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Lecture 23: Reinforcement Learning

Reinforcement Learning

ECE521 Lectures 9 Fully Connected Neural Networks

ilstd: Eligibility Traces and Convergence Analysis

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Q-learning. Tambet Matiisen

Reinforcement learning an introduction

Reward-modulated inference

CSC321 Lecture 22: Q-Learning

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

CS Deep Reinforcement Learning HW2: Policy Gradients due September 19th 2018, 11:59 pm

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

CS599 Lecture 2 Function Approximation in RL

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Planning in Markov Decision Processes

Reinforcement learning

Learning in State-Space Reinforcement Learning CIS 32

Artificial Neural Networks. Edward Gatt

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning as Classification Leveraging Modern Classifiers

Linear Least-squares Dyna-style Planning

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

David Silver, Google DeepMind

Jakub Hajic Artificial Intelligence Seminar I

INTRODUCTION TO EVOLUTION STRATEGY ALGORITHMS. James Gleeson Eric Langlois William Saunders

Human-level control through deep reinforcement. Liia Butler

Artificial Intelligence

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Neural Networks. Historical description

Integer weight training by differential evolution algorithms

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Overview Example (TD-Gammon) Admission Why approximate RL is hard TD() Fitted value iteration (collocation) Example (k-nn for hill-car)

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning. Introduction

Importance Reweighting Using Adversarial-Collaborative Training

Neural Networks: Backpropagation

Transcription:

Payments System Design Using Reinforcement Learning: A Progress Report A. Desai 1 H. Du 1 R. Garratt 2 F. Rivadeneyra 1 1 Bank of Canada 2 University of California Santa Barbara 16th Payment and Settlement System Simulation Seminar 31 August 2018

Motivation How should new HVPS be designed? Performing counterfactual exercises with realistic and reliable estimates of a structural model How are new HVPS designed? Empirical evaluation of different design options using historical data But historical data were generated under different rules and behaviour of participants would likely change How important are the behavioural changes?

Objective First: bring machine learning (ML) techniques to study payments systems - Approximate the current behavioural rules of participants in the Canadian LVTS Future: help design new system by investigating the implied tradeoffs (delay and liquidity) of alternative designs

Concepts: Why Machine Learning? Simulation: process of replication of known outcomes given input data, environment and rules of agents (ABM) ML: estimation of rules given input data and environment Supervised: when output and inputs are known Reinforcement learning (RL): agents learn by observing the results of their interaction with the environment Task: estimate (learn) the policy rules of agents

Concepts: Why Machine Learning? Unsupervised ML: when only the inputs are available, used to find or interpret the structure or topology of inputs. Not our interest today Deep Neural Network (DNN): technique for non-linear estimation Deep learning: estimation of policy rules using DNN

Reinforcement Learning Payments systems are well suited for RL methods because rules of environment are fixed and known Outcomes can change with the actions (inputs) of agents Agents interact with the environment and learn by observing the results of those actions What we need: 1. Reward function: provide agents with signals of the value of actions taken 2. A value function approximator (like DNN) 3. Environment: RTGS simulator

Payments systems theory: Literature Bech & Garratt (2003) liquidity management game and equilibria Agent-based methods: Arciero et al. (2009) explore responses of agents to shocks in RTGS Galbiati & Soramäki (2011) model agents choosing initial liquidity to satisfy payment demands Reinforcement learning: Bellman (1957) dynamic programming is a direct precursor of RL Bertsekas & Tsitsiklis (1996) techniques for approximate DP Sutton & Barto (1998) early techniques of reinforcement learning Efron & Hastie (2016) estimation of DNN Lots of open source work: OpenAI, Deep Mind, etc.

Plan of the talk 1. Model: liquidity management problem as a dynamic programming problem 2. Implementation of RL algorithm 2.1 The reward function 2.2 Q learning algorithm 2.3 Deep neural network 2.4 Computational architecture 3. Future work

Model: individual liquidity management problem Day divided into subperiods: t = 0, 1,.., T l t : liquidity at t d > 0: cost of delay per dollar per subperiod t {p t }: set of new payments to be processed in subperiod t {s t }: set of payments sent in subperiod t {p 1 t }: set of payments queued (internally) up to subperiod t {r t }: set of payments received in subperiod t

Model Objective: minimize cost of delay s.t. liquidity constraint and all payments sent by the end of the day V ({s t }; l t ) = max [ { } ] p 1 t+1 d + βv ({s t+1 }; l t+1 ) {s t } s.t. l t = l t 1 {s t } + {r t } 0 { } { } p 1 t {p t 1 } p 1 t 1 \ {s t 1 } and l T = l T 1 { } p 1 T {p T } + {r T } 0 { } s T {p T } p 1 T

Model Remarks: Stochastic version can be similarly formulated In this formulation payments are indivisible and non-interchangeable Integer problem and curse of dimensionality make this a hard problem to solve by guess-and-verify, envelope theorem or backward induction Solution: approximate Dynamic Programming (Bertsekas & Tsitsiklis, 1996) using Deep Neural Networks

Model: variants Basic problem can be extended Liquidity management problem with choice of initial liquidity (l 0 ) Collateral management problem choosing subsequent collateral apportioning (c t ) Liquidity management game (Bech & Garratt, 2003): simultaneously solve for the policy function of each participant

RL Implementation RL agent objective: learn how much payment value, P, to send at any point in time (given current liquidity, payments queued, expected demands, average sent payments, expected and received incoming payments) To implement our RL agent we need: 1. Reward function: the normative statement of the value of actions taken by the agent 2. Q learning: method to record value of action-state pairs 3. DNN: method to approximate the Q function 4. Simulator: calculates the outcomes of the environment given certain actions providing the reward

Reward Function Reward function provides the returns from taking certain actions given the state ρ t = { p 1 t } d c t { Choice is P which determines the amount of delay c t is the cost of collateral (if needed) Variant: add received payments, {r t } p 1 t }

Q learning algorithm Q function: logs attained rewards of action P given state l Q learning: the procedure to estimate this function QQ PP (action) l (state)

Q learning algorithm Q function: logs attained rewards of action P given state l Q learning: the procedure to estimate this function QQ PP PP (action) l l (state)

Q learning algorithm Q function: logs attained rewards of action P given state l Q learning: the procedure to estimate this function QQ PP PP PP (action) l l (state)

Q learning algorithm 1. Initialize Q-function with zero value entries 2. l t initial state l 0 where l t defined as liquidity at time t 3. While not converged do 4. ˆPt = { sample randomly P t FIFO({s t }) with probability ɛ argmax Pt FIFO({s t }) Q(l t, P t ) with probability 1 ɛ 5. Perform action ({s t }) = FIFO 1 (ˆP t ) 6. ρ t ρ t+1 new reward from environment 7. l t l t+1 new state from environment 8. Q(l t, P t ) = R t + γq(l t+1, P t+1 )

FIFO algorithm FIFO algorithm is a solution to the integer problem Agent chooses the value P and the FIFO algorithm returns the set {s t } such that {s t } P Other algorithms (FIFO bypass, sort, etc) are interesting and could help learning (future work)

Deep Neural Network A feed forward neural network is a nonlinear system with one input layer of x j as predictors (liquidity, payment demands,...) one or more hidden layers of "neurons" a l and the output layer o Neurons use inputs in the following way: K1 a l = g + and feed into the output P = h w (1) l0 w (2) l0 j=1 + K2 l=1 w (1) lj w (2) l x j a l

Deep Neural Network The inputs x j feed into the neurons with some weights w (k) lj and to the output value P Objective is to estimate the vector W usually via gradient descent Deep refers when the number of hidden layers, k, is "large"

RL Computational Infrastructure Hyperparameters,, RL algorithm Transaction file DNN / / RTGS simulator Collateral file Each day is an "epoch." The DNN is re-estimated until some convergence criteria of the Q function is attained

Intuition and Expected Results Agent uses and estimates: Current liquidity, l, expected demands, E[p t+1 ], payments queued, p 1 t, expected and received incoming payments, E[r t+1 ] and r t, and past rewards ρ, To maximize cumulative rewards by: Choosing the value of payments to be sent in that period (action P t ) DNN is updated (W ) using the variation in rewards Expected Results Policy function should be a buffer: l P > 0 Policy function conditions on the deviation of typical payment demands and received payments

Learning: Learning and Challenges When choosing P, solution is a choice of a liquidity buffer. How to deal with payment priorities? Idea: two-step process. First choose P and then optimize within {s} Challenges: Estimation of the DNN (e.g. # of layers), choice of hyperparameters (e.g. ε) and convergence criteria How to handle the impact of an agent s choices on the ability of others to send payments at the time as observed in the data: add collateral

Future work Finish estimation of individual liquidity management problem Game version: simultaneously estimate individual policy rules of multiple agents Optimize over hyperparameters and initial liquidity Alternatives to FIFO to learn from the structure of payments

Future work How to use in payments system design: With estimated rules do transfer learning: use the trained DNN and rules under a new environment and re-estimate outcomes For example set simulator (environment) to: i) reduce reward to induce throughput guidelines (vary the delay cost) ii) introduce new LSMs

Advertisement 5th BoC - PayCan Fall Payments Workshop Ottawa ON, September 20-21, 2018 Contact: Francisco Rivadeneyra riva@bankofcanada.ca