Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis

Similar documents
Set-valued dynamic treatment regimes for competing outcomes

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Markov decision processes

MDPs with Non-Deterministic Policies

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

Reinforcement Learning and Control

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning Active Learning

Lecture 10 - Planning under Uncertainty (III)

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Grundlagen der Künstlichen Intelligenz

Introduction to Reinforcement Learning

Reinforcement Learning. Introduction

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Internet Monetization

Machine Learning I Reinforcement Learning

Reinforcement Learning. George Konidaris

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning

16.1 Min-Cut as an LP

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Planning in Markov Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement learning an introduction

RL 14: POMDPs continued

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

MDP Preliminaries. Nan Jiang. February 10, 2019

Sequential Decision Problems

CS 188: Artificial Intelligence Spring Announcements

Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

REINFORCEMENT LEARNING

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning: An Introduction

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Policy Gradient Methods. February 13, 2017

Reinforcement Learning II

Basics of reinforcement learning

Reinforcement Learning

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Temporal difference learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Artificial Intelligence

Methods for Interpretable Treatment Regimes

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Preference Elicitation for Sequential Decision Problems

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning Wrap-up

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Practicable Robust Markov Decision Processes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Decision Theory: Q-Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Efficient Maximization in Solving POMDPs

Optimal Control of Partiality Observable Markov. Processes over a Finite Horizon

Temporal Difference Learning & Policy Iteration

ilstd: Eligibility Traces and Convergence Analysis

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Final Exam December 12, 2017

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Lecture 1: March 7, 2018

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Partially Observable Markov Decision Processes (POMDPs)

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Markov decision processes and interval Markov chains: exploiting the connection

Decision Theory: Markov Decision Processes

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Magical Policy Search: Data Efficient Reinforcement Learning with Guarantees of Global Optimality

CSC 411 Lecture 7: Linear Classification

Reinforcement learning with a bilinear Q function

Reinforcement learning

Tractable Objectives for Robust Policy Optimization

Reinforcement Learning and Deep Reinforcement Learning

CSC242: Intro to AI. Lecture 23

2534 Lecture 4: Sequential Decisions and Markov Decision Processes

Math 124: Modules Overall Goal. Point Estimations. Interval Estimation. Math 124: Modules Overall Goal.

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Reinforcement Learning Part 2

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Final Exam December 12, 2017

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Markov Decision Processes Chapter 17. Mausam

Chapter 3: The Reinforcement Learning Problem

Notes on Tabular Methods

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Transcription:

Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis Dan Lizotte, Michael Bowling, Susan A. Murphy University of Michigan, University of Alberta

Overview Why are we interested in multiple reward functions? Medical research applications for RL Algorithm for discrete (tabular) case More efficient than previous approaches Algorithm for linear function approximation Previous approaches not applicable

Application: Medical Decision Support Our goal is to use RL as a tool for data analysis for decision support: 1.Take comparative effectiveness clinical trial data 2.Produce a policy (or Q-function) based on patient features (state) 3.Give the policy to a clinician But really, a policy is too prescriptive. Our data are noisy and incomplete, causing uncertainty in the learned policy - other projects at Michigan and elsewhere. Our output is intended for an agent whose reward function we do not know.

Example: Schizophrenia In treatment of schizophrenia, one wants few symptoms but good functionality. This is often unachievable. 1.This is a chronic disease. Patient state changes over time. 2.The effect of different treatments varies from patient to patient. 3.Different people may have very different preferences about which to give up. Each has a different reward function/objective. Properties 1. and 2. make the problem amenable to RL. The goal of this work is to deal with 3. by not committing to a single reward function. Let s look at an idealized version of batch RL, and compare with the type of data we actually have.

Fixed-Horizon Batch RL, Idealized Version Get trajectories s i 1, ai 1, r i 1, si 2, ai 2, r i 2,...,si T, ai T, r i T Find a policy that chooses actions to maximize expected sum of future rewards. (Note fixed horizon and knowledge of t.) One possibility: Fitted Q-iteration Learn Q T (s T, a T ) E[R T s T, a T ] Move backward through time: Q t (s t, a t ) E[R t +max at+1 Expectations are approximated using data Q t+1 (S t+1, a t+1) s t, a t ]

Fixed-Horizon Batch RL, Realistic Version Get trajectories o i 0, ai 1, oi 1, ai 2, oi 2,...,ai T, oi T o i t The include many measurements, including symptoms, side-effects, genetic information,... We must define (much like in state-feature construction) st i = st(o i 0:t 1 i, ai 1:t 1 ) rt i = rt i (o0:t i, ai 1:t ) Now we have s i 1, ai 1, r i 1, si 2, ai 2, r i 2,...,si T, ai T, r i T, can do fitted-q How can we defer the choice of rt i (o0:t i, ai 1:t )?

Multiple Reward Functions Consider a pair of important objectives. Suppose rt (0) reflects level of symptoms and rt ( 1) reflects level of functionality. Consider the set of convex combinations of reward functions, e.g. rt(s,a,δ) (1 - δ) rt (0) (s,a) + δ rt ( 1) (s,a) Each δ identifies a specific reward function, and induces a corresponding Qt(,,δ). Depending on δ, the optimal policy cares more about rt (0) or rt ( 1). Standard approach: Preference Elicitation Try to determine the decision-maker s true value of δ via time tradeoff, standard gamble, visual analog scales,...

Inverse Preference Elicitation We propose a different approach Take r(s,a,δ) (1 - δ) r (0) (s,a) + δ r ( 1) (s,a) Run analysis to find optimal actions given all δ, i.e. learn Qt(s,a,δ) and Vt(s,δ) for all t {1,2,...,T} and for all δ [0, 1] Given a new patient s state, report, for each action, the range of δ for which it is optimal. Care about Symptoms Tradeoffs For Which Each Treatment is Optimal: s = 1 0 0.1 0.3 0.4 0.5 0.7 0.8 0.9 1 δ Treatment 9 Treatment 6 Care about Functionality

Overview Why are we interested in multiple reward functions? Medical research applications for RL Algorithm for discrete (tabular) case More efficient than previous approaches Algorithm for linear function approximation Previous approaches not applicable

Algorithm for Discrete State Space: Preview rt(s,a,δ) (1 - δ) rt (0) (s,a) + δ rt ( 1) (s,a) QT(s,a,0), QT(s,a,1) are average rewards, QT(s,a,δ) is linear in δ VT(s,δ) is continuous and piecewise linear in δ Knots introduced by pointwise max over a QT-1(s,a,δ) is continuous and piecewise linear in δ Average of VT(s,δ) over trajectories with s,a,s tuples. [WLOG considering terminal rewards only]

Value Backup: Max Over Actions QT(s,a,0), QT(s,a,1) are average rewards, QT(s,a,δ) is linear in δ VT(s,δ) is continuous and piecewise linear in δ Knots introduced by pointwise max over a found by convex hull Q function: Line Representation 1 Q function: Point Representation (1 δ) R 0 + δ R 1 0.8 0.5 0.3 A1 A2 A4 A3 VT(s,δ) A3 A2 A4 A1 0.7 0.4 R 1 0.8 0.4 (,0.7) A3 (0.3,0.4) A4 (0.5,) A2 (0.8,) A1 0 5 0.5 0.75 1 δ 0 0 0.4 0.8 1 R 0

Value Backup: Max Over Actions Our value function representation remembers which actions are optimal over which intervals of delta Q function: Line Representation 1 Q function: Point Representation (1 δ) R 0 + δ R 1 0.8 0.5 0.3 A1 VT(s,δ) A2 A3 0.7 0.4 R 1 0.8 0.4 (,0.7) A3 (0.3,0.4) A4 (0.5,) A2 (0.8,) A1 A1 A2 A3 0 5 0.5 0.75 1 δ 0 0 0.4 0.8 1 R 0

Value Backup: Expectation Over Next State QT-1(s,a,δ) is continuous and piecewise linear in δ Average of VT(s,δ) over trajectories with s,a,s tuples. V function: Max Of Lines Representation V function: Max Of Lines Representation Mean V function: Max Of Lines Representation 0.8 0.7 0.7 0.8 0.75 0.7 0.75 0.7 (1 ) R 0 + R 1 0.5 (1 ) R 0 + R 1 0.5 (1 ) R 0 + R 1 0.55 0.45 0.4 0.55 0.5 0.35 0 5 0.5 0.75 1 0 5 0.5 0.75 1 0 5 0.5 0.75 1 VT(s1,δ) VT(s2,δ) QT-1(s,a,δ) = ES s,a[rt-1+vt+1(s,δ)]

Value Backups: Discrete State (tabular) QT(s,a,δ) is piecewise linear, with O( A ) pieces ES s,a[vt(s,δ)] is piecewise linear, with O( S A ) pieces At T-t, for each s, V(s,δ) has O( S T-t A T-t+ 1 ) pieces Can compute in O( S T-t+ 1 A T-t+ 1 ) time by operating directly on the piecewise linear functions Previous work took O( S 2(T-t)+ 1 A 2(T-t)+ 1 log S 2(T-t)+ 1 A 2(T-t)+ 1 ) time Vt-1(s,δ) is convex linear combination of Vt(s,δ), therefore stays convex in δ (1 ) R 0 + R 1 0.8 0.5 V function: Max Of Lines Representation 0.7 (1 ) R 0 + R 1 0.7 V function: Max Of Lines Representation 0.8 0.5 (1 ) R 0 + R 1 Mean V function: Max Of Lines Representation 0.75 0.7 0.55 0.45 0.4 0.75 0.7 0.55 0.5 0.35 0 5 0.5 0.75 1 0 5 0.5 0.75 1 0 5 0.5 0.75 1

Dominated Actions Some actions are not optimal for any δ Q function: Line Representation 1 Q function: Point Representation (1 δ) R 0 + δ R 1 0.8 0.5 0.3 A1 V(s,δ) A2 A3 0.7 0.4 R 1 0.8 0.4 (,0.7) A3 (0.3,0.4) X A4 (0.5,) A2 (0.8,) A1 A1 A2 A3 0 5 0.5 0.75 1 δ 0 0 0.4 0.8 1 R 0 Some actions are not optimal for any (δ,s)! Can enumerate s to check this.

Overview Why are we interested in multiple reward functions? Medical research applications for RL Algorithm for discrete (tabular) case More efficient than previous approaches Algorithm for linear function approximation Previous approaches not applicable

Continuous State Space, Linear Function Approx. What if we want to generalize across state? Estimate QT(s,a,0), QT(s,a,1) using linear regression rather than sample averages. Use these to compute QT(s,a,δ) for other δ Recall: rt(s,a,δ) (1 - δ) rt (0) (s,a) + δ rt ( 1) (s,a) Construct design matrices Sa (na p), targets ra(δ) (na 1) from our data set QT(s,a,δ;β) = βa(δ) T s, βa(δ) = (Sa T Sa) -1 Sa T ra(δ) QT(s,a,δ;β) linear in β, each element of β linear in r, and r linear in δ QT(s,a,δ;β) = ((1 - δ) βa(0) + δ βa(1)) T [1, s] T

Movie of QT QT(s,a,δ;β) = ((1 - δ) βa(0) + δ βa(1)) T [1, s] T

Maximization Over Actions QT(s,a1,δ;β) VT(s,δ) is continuous and piecewise linear in both δ and s While learning, we only evaluate VT(s,δ) at si we have in our dataset QT(s,a2,δ;β) Knots found by convex hull

Regression Over Next State QT(s,a1,δ;β) While learning, we only evaluate VT(s,δ) at si we have in our dataset Reg. coefficients for QT-1(s,a,δ;β) are weighted sums of VT(si,δ) QT(s,a2,δ;β) Break problem into regions of δ-space where VT(si,δ) are simultaneously linear

Value Backups: Continuous State QT(s,a,δ;β) = ((1 - δ) βa(0) + δ βa(1)) T [1, s] T VT(s,δ) = maxa QT(s,a,δ;β) is piecewise linear in δ QT-1(s,a,δ;β) = βa(δ) T s, βa(δ) = (Sa T Sa) -1 Sa T VT(s,δ) is not convex in δ Each element of VT(s,δ) is piecewise linear in δ Where VT(s,δ) are simultaneously linear, elements of βa(δ) T are linear. Must compute βa(δ) T at knot δs between linear regions O(n A ) knots, n is number of trajectories. At time T-t, there could be O(n T-t A T-t ) knots.

Dominated Actions: Continuous State Set of dominated actions can be determined analytically in O( A 3 #knots) time for 1D state and 1D tradeoff Algorithm based on properties of boundaries between regions where different actions are optimal E.g., top-down view of Q-function, actions 2,3,5 are not optimal for any (s,δ). 1 0.9 0.8 0.7 7 8 6 δ 0.5 4 0.4 9 0.3 1 Approach: Identify triple points 0.1 10 0 6 4 2 0 2 4 6 s

Reality Check Must compute βa(δ) T at knot δs between linear regions. At time T-t, there could be O(n T-t A T-t ) knots, in the worst case. Is this even feasible? Consider 1000 randomly generated datasets, n = 1290, A = 3, T = 3, parameters similar to real data Maximum time for 1 simulation run is 6.55 seconds on 8 procs. Worst-case #knots Observed Min Observed Med Observed Max t=2 3870 687 790 910 t=1 1.5 10 7 2814 3160 3916

Overview Why are we interested in multiple reward functions? Medical research applications for RL Algorithm for discrete (tabular) case More efficient than previous approaches Algorithm for linear function approximation Previous approaches not applicable

Future Work - Computing Science, Statistics Allow more state variables For backups: Easy! Each element of β is piecewise linear in δ When checking for dominated actions, 2 reward functions plus 2 state variables is feasible. (Or 3 reward functions + 1 state variable.) Allow more reward functions For backups: 3 reward functions feasible. Representing non-convex continuous piecewise linear functions in high dimensions appears difficult. Approximations, now that we know what we are approximating. Measures of uncertainty for preference ranges

Future Work - Clinical Science 1.Schizophrenia Symptom reduction versus functionality, or weight gain 2.Major Depressive Disorder Symptom reduction versus weight gain, other side-effects 3.Diabetes Disease complications versus drug side-effects

Thanks! Supported by National Institute of Health grants R01 MH080015 and P50 DA10075 Questions? Related work: Barrett, L. and Narayanan, S. Learning all optimal policies with multiple criteria. In Proceedings of the 25th International Conference on Machine Learning (ICML 2008), 2008.