Preference-based Reinforcement Learning

Size: px

Start display at page:

Download "Preference-based Reinforcement Learning"

Sybil Nelson
6 years ago
Views:

1 Preference-based Reinforcement Learning Róbert Busa-Fekete 1,2, Weiwei Cheng 1, Eyke Hüllermeier 1, Balázs Szörényi 2,3, Paul Weng 3 1 Computational Intelligence Group, Philipps-Universität Marburg 2 Research Group on Artificial Intelligence of the Hungarian Academy of Sciences and University of Szeged (RGAI) 3 INRIA Lille - Nord Europe, SequeL project, 40 avenue Halley, Villeneuve d Ascq, France 4 Laboratory of Computer Science, Université Pierre et Marie Curie EWRL 2013/ Dagstuhl August 6, 2013

2 Preference-based Reinforcement Learning Motivating example: Medical treatment design Direct policy search (DPS) Evolutionary direct policy search approach Preference-based Evolutionary Direct policy search (PB-EDPS) Preference-based Racing (PBR)

3 Many problems where it is hard to define a reasonable reward function task of driving [Abbeel and Ng, 2004] medical treatment design [Zhao et al., 2009] Aggregation of rewards: one may not always be willing/able to combine rewards Multi-objective reinforcement learning Episodic setup: h following policy π, h following policy π Given h and h, it might be easier to decide which one is preferred (at least in some problems) The piece of information we want to learn from is preferences over simulations!

4 Motivating example: medical treatment design [Zhao et al., 2009] Virtual patient with cancer State captures some essential factors in cancer treatment Tumor size Toxicity Episodic setup: an episode corresponds to a treatment of a patient over six months The action is the dosage level itself dosage Transitions: The tumor is constantly growing (without treatment or if the dosage is too low) The higher dose selected, the higher toxicity evolves, and the more tumor growth is inhibited. The higher the toxicity and the tumor size, the higher the probability of the patient s death.

5 Motivating example [Zhao et al., 2009] Terminal state: end of sixth month or patient dies The reward is defined based on the wellness of patient tumor size: -5, 5, 15 toxicity level: -5, 0, 5 The reward assigned to death is -60 Based on the wellness of patients, it is straightforward to define a preference relation over treatments Given two trajectories h1 and h 2 generated by following two different treatments Otherwise Pareto dominance h 1 h 2 if the tumor size AND the toxicity level both are smaller under h 1

6 Point of departure: preferences There is no reward function (hard to define a reasonable one) and the goal is not to find a reward function!!! The piece of information we want to learn from is preferences over trajectories! Partial order over trajectories h H (T ) From a tutor or an expert Extracted from trajectories

7 Decision model Decision model: lifting the preference relation on H (T ) to a preference relation on the space of policies Intermezzo: each policy π generate a probability distribution over the set of trajectories (for a fixed MDP) which is denoted by P π policy random variable whose realizations are trajectories s(π, π ) = Eh P π,h P π [I{h h }] Probability of that π beats π Ordinal decision model π π if and only if s(π, π) < s(π, π ) Alternative decision model?

8 Preference-based Evolutionary direct policy search

9 Direct policy search (DPS) 1. Parametric policy space: Π = {π Θ Θ R d }, for example the space of linear policies: π w (s) = w T s if S R d 2. The policy search can be viewed as an optimization task: Π is the search space, some policy evaluation is the target function Gradientbased Direct policy search Gradientfree REINFORCE [Williams, 1992] Preferencebased evolutionary DPS Evolutionary DPS [Heidrich- Meisner and Igel, 2009] Evolutionary strategy based optimizer Crossentropy method

10 Evolutionary direct policy search approach [Heidrich-Meisner and Igel, 2009] Covariance Matrix Adaptation Evolution Strategy (CMA-ES)[Hansen and Kern, 2004] It maintains a distribution over the solution space ( in this case over the space of policies) Expected total reward is optimized that can be estimated based on finite set of trajectories {h 1,..., h n } P π as ρ (n) π = 1 n n V (h i ) i=1 where V (.) is the cummulative reward

11 Evolutionary direct policy search [Heidrich-Meisner and Igel, 2009] Repeat these three steps until convergence 1. Generate a population of candidate solutions (in this case, a set of policies with different parameters). π Θ1,..., π Θλ where Θ 1,..., Θ λ N (m, Σ) 2. Evaluate the candidate solutions (estimate the performance of the policies based on simulations {h 1,..., h n } P πθi ). ρ (n) π Θi = 1 n n V (h i ) i=1 and select the best µ individuals 3. Update m and Σ by using the parameters of best µ individuals/policies

12 Racing algorithm (a) [Heidrich-Meisner and Igel, 2009] In the bandit literature, these algorithms are called PAC bandits

13 Basic idea 1. Direct motivation: the Evolution strategy optimizers need only ranking, but they do not need the function values themselves 2. GOAL: devise a racing algorithm that utilizes only pairwise comparison of random samples (in this case trajectories) and is able to select the best policies with respect to the decision model ( ) 3. This naturally gives rise to a preference-based policy search method

14 Recall the decision model s(π, π ) = Eh P π,h P π [I{h h }] Ordinal decision model π π if and only if s(π, π) < s(π, π ) There can be preferential cycles π π AND π π AND π π select the best options is not a well-defined task Practical solution: surrogate ranking model Given π 1,..., π K π i C π j d i < d j where d i = {k : π k π i, k i} It is a complete preorder since it has a numeric representation (d i ) Unfortunately, the preference relation C depends on the set of policies considered

15 An example for the surrogate ranking model π 2 π 3 edge π i π j π 4 π 1 C d 2 = 4 d 1 = 3 d 3 = d 4 = d 5 = 1 π 5

16 Concentration property of s(.,.) s(π, π ) = Eh P π,h P π [I{h h }] π π if and only if s(π, π) < s(π, π ) π i C π j d i < d j where d i = {k : π k π i, k i} s(π, π ) can be estimated based on finite sets of trajectories {h 1,..., h n } P π and {h 1,..., h n} P π as s(π, π ) = 1 nn n n I{h i h j} i=1 j=1 Hoeffding-bound for U-statistics, two-sample case Hoeffding, 1963, 5b: For any ɛ > 0 P ( s(π, π ) s(π, π ) ɛ ) 2 exp ( 2 min(n, n )ɛ 2) Empirical Bernstein-bound?

17 Preference-based racing We have an efficient estimator for s(π i, π j ) We can calculate confidence interval for s(π i, π j ) K = 7, K = 3 edge s(π i, π j ) is significantly bigger than s(π j, π i ) π 1 π 2 π 3 π 4 π 5 π 6 π 7 Expected sample complexity: Even-Dar et al. [2002] ( i,j = 1/2 s(π i, π j ) )

18 Preference-based evolutionary direct policy search Repeat these three steps until convergence 1. Generate a population of candidate solutions (in this case, a set of policies with different parameters). π Θ1,..., π Θλ where Θ 1,..., Θ λ N (m, Σ) 2. Select the best µ individuals by using Preference-based Racing algorithm 3. Update m and Σ by using the parameters of best µ individuals/policies

19 The relation of and C (only locally valid) π2 π i is a Condorcet winner among a set of policies π 1,..., π K if π l π i for all l i π3 π4 π5 π1 If the Condorcet winner exists, it is the largest element of C π2 Smith set is the smallest non-empty set D {π 1,..., π K } satisfying π k π i for all π i D and π j {π 1,..., π K } \ D Proposition Let Π = {π 1,..., π K } be a set of random variables for which there exists a Smith set D of size K D. Then for any π i D and π j Π \ D, π j C π i. π3 π4 π5 π1 d 5 = 1 d 4 = 1 d 3 = 1 d 1 = 3 d 2 = 4

20 Issues to be discussed The existence of global optima If there exists a global Condorcet winner, under what assumptions we can find it (w.h.p) by using Evolution strategy along with Preference-based racing algorithm? Hoeffding-bound is loose: the use of Clopper-Pearson-type confidence bound for trinomial random variables

21 Bibliography I P. Abbeel and A.Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21th International conference on Machine Learning, pages????, E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and markov decision processes. In Proceedings of the 15th Annual Conference on Computational Learning Theory, pages , N. Hansen and S. Kern. Evaluating the CMA evolution strategy on multimodal test functions. In Parallel Problem Solving from Nature-PPSN VIII, pages , V. Heidrich-Meisner and C. Igel. Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search. In Proceedings of the 26th International Conference on Machine Learning, pages , J. Hemelrijk. Note on Wilcoxon s two-sample test when ties are present. The Annals of Mathematical Statistics, 23(1): , 1952.

22 Bibliography II R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3): , Y. Zhao, M.R. Kosorok, and D. Zeng. Reinforcement learning design for cancer clinical trials. Statistics in Medicine, 28(26): , 2009.

23 Preference-based racing for C One can estimate s(π, π ) based on finite set of trajectories {h 1,... h n } P π and { h 1,... h n } Pπ s(π, π ) = 1 nn n n I{h i h j} i=1 j=1 Incomparable trajectories: solution by [Hemelrijk, 1952] 1 if x x I {x, x } = 0 if x x 1/2 otherwise Probabilistic interpretation: if two samples are incomparable, then we select one of them being preferred with probability 1/2 s(π, π ) = 1 s(π, π)

24 Preference-based racing: optimization view Preference-based case: PBR(π 1,..., π K, K, n max, δ) argmax I{π j C π i } I {1,...,K}: I =K i I j i with probability at least 1 δ Since s(π i, π j ) = 1 s(π j, π i ) argmax I{s(π j, π i ) > 1/2} (1) I {1,...,K}: I =K i I j i We have an efficient estimator of s(π i, π j )

25 Algorithm 1 PBR(π 1,..., π K, K, n max, δ) 1: A = {(i, j) 1 i, j K}, n = 0 2: while (n n max) ( A > 0) do 3: for all i appearing in A do 4: h (n) i M and π i Generate trajectories 5: end for 6: for all (i, j) A do 7: Update s i,j = 1 n n n 2 l=1 l =1 I{h(l) i h (l ) j } 1 8: c i,j = 2n log 2K 2 n max, u δ i,j = ŝ i,j + c i,j, l i,j = ŝ i,j c i,j 9: end for 10: for i = 1 K do 11: z i = {j : u i,j < 1/2, j i}, o i = {j : l i,j > 1/2, j i} 12: end for 13: C = { i : K K < {j : K zj < o i } } select 14: D = { i : K < {j : K o j < z i } } discard 15: for (i, j) A do 16: if (i, j C D) (1/2 [l i,j, u i,j ]) then 17: A = A \ (i, j) Do not update ŝ i,j any more 18: end if 19: end for 20: n = n : end while 22: return the top-k options for which the most s i,j above 1/2

26 Cancer treatment 1. State space consists of toxicity level and tumor size (X, Y ) 2. Linear policy space 3. Each policy search method were trained 100 times and each policy were evaluated on 300 virtual patients 4. 6-months treatment 5. Transitions: X t+1 = X t + X t and Y t+1 = Y t + Y t Y t = [a 1 max(x t, X 0 ) b 1 (D t d 1 )] I{Y t > 0} X t = a 2 max(y t, Y 0 ) b 2 (D t d 2 ) 6. Probability of death: 1 exp( exp(c 0 + c 1 X t + c 2 Y t ))

27 Cancer treatment Toxicity Random (0.355) Const. π(s) = 0.1 (0.405) Const. π(s) = 0.4 (0.318) Const. π(s) = 0.7 (0.383) Const. π(s) = 1.0 (0.448) SARSA(λ) (0.337) PBPI (0.259) EDPS (0.283) PB-EDPS (0.209) Pareto front Tumor size (b) 100 repetitions of training process

Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm

Preference-based Reinforcement Learning: Evolutionary Direct Policy Search using a Preference-based Racing Algorithm Róbert Busa-Fekete Balázs Szörényi Paul Weng Weiwei Cheng Eyke Hüllermeier Abstract