Nonparametric Bayesian Inverse Reinforcement Learning

Size: px

Start display at page:

Download "Nonparametric Bayesian Inverse Reinforcement Learning"

Aron Morgan
5 years ago
Views:

1 PRML Summer School 2013 Nonparametric Bayesian Inverse Reinforcement Learning Jaedeug Choi

2 Sequential Decision Making (1) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Mario Bros. [Gibson, 2 / 82

3 Sequential Decision Making (2) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Robotic goalkeeper [Busoniu, 3 / 82

4 Sequential Decision Making (3) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Spoken dialogue management system [Young, 4 / 82

5 Today s Talk Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary Updated slides 5 / 82

6 Framework of RL (1) Agent: decision maker Environment: anything external to agent [Russell & Norvig] 6 / 82

7 Framework of RL (2) Agent: decision maker Environment: anything external to agent Reward: describes agent s goal or preference action Agent situation reward Environment [Mario wiki, 7 / 82

8 Objective of RL Policy: behavior strategy Find policy maximizing cumulative reward 8 / 82

9 Supervised Learning vs. RL Supervised learning: learning from examples For each input x, correct output y is known Infer input-output relationship y f(x) y f RL: learning from experience Correct outputs not available, only rewards Trial-and-error search and delayed reward x situation reward Environment Agent action r 0 r 1 r 2 s 0, a 0 s 1, a 1 s 2, a 2 9 / 82

10 Markov Decision Process (MDP) Definition Finite set of states s S Finite set of actions a A Transition function T s, a, s = P s s, a Reward function R s situation reward Environment Agent action Assumptions Markovian transition model: P s t s t 1, a t 1,, s 1, a 1 = P(s t s t 1, a t 1 ) Completely observable environment [Puterman 94] 10 / 82

11 Example: Mario in Maze (1) States s S Position of Mario x, y s 1 = 1,1, s 2 = 1,2,, s 11 = (4,3) 3 s 3 s 5 s 8 s 11 s 2 s 7 s 10 2 s 1 s 4 s 6 s [Russell & Norvig] 11 / 82

12 Example: Mario in Maze (2) Actions a A Move left, right, up, and down [Russell & Norvig] 12 / 82

13 Example: Mario in Maze (3) Transition function T s, a, s = P s s, a Moving in intended direction with probability 0.8 T s 1,, s 4 = 0.8 T s 1,, s 2 = T s 1,, s 1 = 0.1 s 3 s 5 s 8 s 11 s s 7 s 10 s 1 s 4 s 6 s [Russell & Norvig] 13 / 82

14 Example: Mario in Maze (4) Reward function R s For terminal states: R s 11 = +1, R s 10 = 1 For non-terminal states: R s = 0.04, s S\{s 10, s 11 } s 3 s 5 s 8 s 11 s 2 s 7 s 10 s 1 s 4 s 6 s 9 [Russell & Norvig] 14 / 82

15 Policy Policy π: mapping from states to actions π s = a: execute action a in state s Describes agent s behavior Following policy π (1) Determine current state s (2) Execute action π(s) (3) Goto step (1) 15 / 82

16 Learning Goal Return: discounted cumulative rewards R t π = R s t + γr s t+1 + γ 2 R s t+2 + Discount factor γ 0,1 Determines present value of future rewards Encodes increasing uncertainty about future Bounds infinite sum Optimal policy: maximizes expected return π = arg max π E[ R 0 π ] 16 / 82

17 Planning and Learning Planning Agent uses model to create or improve policy It has a-priori knowledge about model of environment (e.g. MDP) (Reinforcement) learning Agent doesn t have a-priori knowledge about model It may explicitly learn (i.e. construct) model from interaction, but not necessarily It faces exploration vs. exploitation dilemma Planning and learning is closely related [Kee-Eung Kim, Introduction to RL, PRMLWS 2012] 17 / 82

18 Value Functions Recall Return: R π t = R s t + γr s t+1 + γ 2 R s t+2 + = k=0 γ k R s t+k V π (s): state-value function for policy π Expected return when starting from s and following π V π s = E π [R t s t = s] = E π k=0 γ k R s t+k s t = s Q π (s, a): action-value function for policy π Expected return when starting from s, taking a, and following π Q π s, a = E π R t s t = s, a t = a = E π k=0 γ k R s t+k s t = s, a t = a 18 / 82

19 Bellman Equation for V π Basic idea R t = R s t + γr s t+1 + γ 2 R s t+2 + = r t + γr t+1 + γ 2 r t+2 + = r t + γ r t+1 + γr t+2 + = r t + γr t+1 Bellman equation V π s = E π [R t s t = s] = E π k=0 γ k r t+k s t = s = E π r t + γ k=0 γ k r t+k+1 s t = s = a P s, a P s s, a R s + γe π k=0 γ k r t+k+1 s t+1 = s s = R s + γ T s, π s, s V π (s ) s Immediate reward Expected value for next states 19 / 82

20 Optimal Value Functions Partial ordering over policies π π if and only if V π s V π (s) for all s S Optimal policy π π for all π Optimal policies share same optimal value functions V s = max π Q s, a = max π Vπ (s) for all s S Qπ (s, a) for all s S and a A Bellman optimality equations for V and Q V (s) = max a R s + γ T s, a, s V s s Q (s, a) = R s + γ s T s, a, s max Q (s, a ) a 20 / 82

21 Value Iteration Algorithm Iterative algorithm for solving Bellman equation V i+1 s max a R s + γ T s, a, s V i s s As i, V i+1 converges to optimal value function Initialize V 0 arbitrarily, e.g., V 0 s = 0 for all s S For i = 0 to For each s S V i+1 s max Until max s a V i s V i+1 s < θ R s + γ T s, a, s V i s s Output policy π such that π s = arg max R s + γ T s, a, s V s a s 21 / 82

22 Value Iteration Example (1) Initialize V 0 s = 0 for all s S [ +1 ] [ -1 ] / 82

23 Value Iteration Example (2) V 1 s max a R s + γ T s, a, s V 0 s s [ +1 ] [ -1 ] / 82

24 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] V 2 s 2 = max x(a) = a {,,, } R(s 2 ) T s 2, a, s 1 V 1 (s 1 ) T s 2, a, s 2 V 1 (s 2 ) T s 2, a, s 3 V 1 (s 3 ) T s 2, a, s 4 V 1 (s 4 ) x = = 0.08 x = = x = = x = = / 82

25 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] R(s 2 ) T s 2, a, s 1 V 1 (s 1 ) T s 2, a, s 2 V 1 (s 2 ) T s 2, a, s 3 V 1 (s 3 ) T s 2, a, s 4 V 1 (s 4 ) x = = 0.08 x = = x = = x = = V 2 s 2 = max x(a) = a {,,, } 25 / 82

26 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] / 82

27 Value Iteration Example (4) V 3 s max a R s + γ T s, a, s V 2 s s [ +1 ] [ -1 ] / 82

28 Value Iteration Example (5) V 21 s max a R s + γ T s, a, s V 20 s s [ +1 ] [ -1 ] / 82

29 Value Iteration Example (6) Output policy π s = arg max a R s + γ T s, a, s V s s [ +1 ] s [ -1 ] s 1 s 2 s R(s 2 ) T s 2, a, s 1 V(s 1 ) T s 2, a, s 2 V(s 2 ) T s 2, a, s 3 V(s 3 ) T s 2, a, s 4 V(s 4 ) x = = x = = x = = x = = 0.553? 29 / 82

30 Value Iteration Example (7) Optimal policy π [ +1 ] [ -1 ] 30 / 82

31 Contents Reinforcement learning (RL) Markov decision process (MDP) Bellman equations for value functions Value iteration algorithm Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 31 / 82

32 Applying RL to Real-World Problems Formalizing problems in framework of RL Dynamics model: relatively easy to obtain Computed by counting events Reward function: non-trivial to obtain Hand-tuned until satisfactory policy is obtained More systematic approach is required! action Agent situation reward Environment 32 / 82

33 Example: Mario in Maze MDP Transition function: moving in intended direction with probability 0.8 Reward: for terminal states R t = ±1 for nonterminal states R s = 0.04 Optimal policy π Optimal policy for different ranges of R(s) R s < [ +1 ] [ -1 ] [ +1 ] [ +1 ] [ -1 ] [ -1 ] < R s < [Russell & Norvig] 33 / 82

34 Example: Autonomous Helicopter Control Reward function design Difficult to balance many desiderata But, easy to collect expert s behavior [ 34 / 82

IRL Reward function Dynamics model RL IRL Prescribes action to take for each state Solution to RL Optimal policy

35 Inverse Reinforcement Learning (IRL) Definition [Russell 98] Recovering domain expert s reward function from its behaviors RL vs. IRL Reward function Dynamics model RL IRL Prescribes action to take for each state Solution to RL Optimal policy Expert s behavior Represents goal or preference Solution to IRL Useful for robotics, human/animal behavior studies, neuroscience, economics, [Argall et al. 09; Cohen & Ranganath 07; Hopkins 07; Borgers & Sarim 00] 35 / 82

$IRL Problems IRL for MDP\R Assumption (1) Completely observable environments (MDP) (2) Expert behaves optimally Input (1) Dynamics model: transition function T (2) Expert s behavior: trajectory set D$

36 IRL Problems IRL for MDP\R Assumption (1) Completely observable environments (MDP) (2) Expert behaves optimally Input (1) Dynamics model: transition function T (2) Expert s behavior: trajectory set D Output reward function R making expert s policy π E optimal Dynamics model Reward function RL IRL Expert s behavior Expert s behavior: trace of expert s policy π E Set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h 36 / 82

37 IRL vs. Imitation Learning Apprenticeship learning via IRL Find good policy using R inferred by IRL Imitation learning [Argall et al. 09] Learn teacher s policy directly using supervised learning Fix class of policy: neural network, decision tree, Estimate policy (= mapping from states to actions) from training examples (= trajectory D) Advantages of IRL Reward function is succinct and transferable description of task: π vs. R 37 / 82

38 Matrix Notations (1) Definition of MDP S, A, T, R Set of states s S, set of actions a A Transition function T s, a, s = P s s, a Reward function R s, a Using matrix notations Transition function: S A S matrix T T π : S S matrix with T π s, s = T s, π s, s T a : S S matrix with T a s, s = T s, a, s Reward function: S A -dimensional vector R R π : S -dimensional vector with R π s = R s, π s R a : S -dimensional vector with R a s = R s, a 38 / 82

39 Matrix Notations (2) Bellman equations for value functions V π (s) = R s, π s + γ T s, π s, s V π s s V π = R π + γt π V π Q π s, a = R s, a + γ T s, a, s V π s s Q a π = R a + γt a V π V π = R π +γ T π V π V π (s 1 ) R s 1, π s 1 T s 1, π s 1, s 1 T s 1, π s 1, s 2 T s 1, π s 1, s S V π (s 1 ) V π (s 2 ) R s 2, π s 2 T s 2, π s 2, s 1 T s 2, π s 2, s 2 T s 2, π s 2, s S V π (s 2 ) V π (s S ) R s S, π s S T s S, π s S, s 1 T s S, π s S, s 2 T s S, π s S, s S V π (s S ) V π, R π : S -dimensional vector T π : S S matrix 39 / 82

40 Reward Optimality Condition IRL for MDP\R given policy π E MDP\R S, A, T, γ π E : S A Find R making π E optimal Necessary and sufficient condition for R to guarantee optimality of π E [Ng & Russell 00] V π E s; R Q π E s, a; R for all s S and a A V π E R Q a π E R for all a A Key fact used in many IRL algorithms 40 / 82

41 Reward Optimality Region (1) Reward optimality region for π Region of reward functions that yield π as optimal I I A γt I γt π 1 E π R 0 E π : S S A matrix with E π s, s, a = 1 if s = s π s = a I A : stacking S S identity matrix A times Proof V π = R π + γt π V π V π = I T π 1 R π Q a π = R a + γt a V π V π R Q a π R for all a A R π + γt π V π R a + γt a V π R π + γt π I T π 1 R π R a + γt a I T π 1 R π 41 / 82

42 Reward Optimality Region (2) Reward optimality region for π E Region of reward functions that yield π E as optimal I I A γt I γt π E 1 E π E R 0 convex polytope region in reward space R R D Difficulty of IRL problems Infinitely many reward functions guarantee optimality of π E Degenerate case R = 0 makes any policy optimal Inherently ill-posed π E R 42 / 82

43 Ng and Russell s Algorithm IRL for MDP\R S, A, T, γ given policy π E Choose R in reward optimality region for π E Maximize sum of margins between π E and all other actions Favor sparse reward function Optimization problem Linear programming formulation Sum of margins Penalty of too many non-zero entries λ: adjustable parameter max R Q π E s, π E s Q π E s s, a λ R 1 a A\π E s s.t. I I A γt I γt π E 1 E π E R 0 R s, a R max, s S, a A Reward optimality region [Ng & Russell 00] 43 / 82

Apprenticeship Learning (AL) via IRL Goal: learn policy from observing expert Find policy whose performance as good as expert's policy measured according to expert s unknown reward function Example:

44 Apprenticeship Learning (AL) via IRL Goal: learn policy from observing expert Find policy whose performance as good as expert's policy measured according to expert s unknown reward function Example: when teaching someone to drive We do not tell what reward function is Difficult to balance many desiderata We demonstrate driving preference for lane safe following distance keeping away from pedestrians maintaining reasonable speed [Abbeel & Ng 04] 44 / 82

45 Reward Function Representation Known feature functions Φ: S A 0,1 D Φ = [φ 1,, φ D ]: S A D feature matrix φ d : S A -dimensional feature vector Φ(s, a): D-dimensional feature vector Unknown weight vector w 1,1 D Reward function is linear combination of feature functions R = Φw or R s, a = w D Φ s, a = d=1 w d φ d (s, a) Common assumption in IRL to address problems with large state space R(s 1, a 1 ) R(s 1, a 2 ) R(s S, a A ) φ 1 s 1, a 1 φ 2 s 1, a 1 φ D s 1, a 1 φ 1 s 1, a 2 φ 2 s 1, a 2 φ D s 1, a 2 = φ 1 s S, a A φ 2 s S, a A φ D s S, a A w 1 w 2 w D 45 / 82

Example: Feature Functions in Mario Bros.

Mario successfully reaching end of level Mario

46 Example: Feature Functions in Mario Bros. Mario Bros. has huge state space Require compact representation of reward function Binary feature functions indicate for Mario successfully reaching end of level Mario collecting coin Mario killing enemy Mario receiving damage by enemy Mario getting killed by enemy And so on 46 / 82

47 Feature Expectation (FE) Feature expectation Expected cumulative discounted sum of feature values μ π = E t=0 γ t Φ s t, π s t s 0 R D t=0 s 0 V π s 0 = E t=0 γ t R s t, π s t s 0 = E γ t w Φ s t, π s t = w E t=0 γ t Φ s t, π s t s 0 = w μ π Estimate expert s feature expectation μ π E Expert s policy π E is not known Expert s behavior is given by set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h μ π E μ π E = 1 M M m=1 H h=1 γ h 1 Φ s m,h, a m,h Denote μ E = μ π E and μ E = μ π E 47 / 82

48 Closeness of FE and Performance Feature expectation μ π V π s 0 = w μ π Performance closeness is bounded by FE closeness For any underlying reward R = Φw V π E s 0 V π s 0 = w μ E w μ π w 2 μ E μ π 2 μ E μ π 2 w 2 w 1 1 by assumption Find policy π whose feature expectation is close to μ E μ E 48 / 82

49 Algorithm for AL via IRL Alternate RL step and IRL step RL step: compute optimal policy for estimated reward IRL step: estimate reward that makes π E perform better than all previously found policies Initialize w arbitrarily and set Π Repeat Compute optimal policy π for MDP with R = Φw Π Π {π} Solve the following optimization problem max x w s.t. w μ E w μ π w 2 1 Until x ε + x, π Π RL step IRL step: find w s.t. w μ E w μ π, π Quadratically constrained programming problem 49 / 82

50 Experiments: Simulated Highway Learn different driving styles Nice Nasty Right lane nice [Abbeel & Ng 04] 50 / 82

51 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Reward optimality condition Ng and Russell s algorithm Apprenticeship learning via IRL Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 51 / 82

52 Difficulty of IRL Problems Solution of IRL is not unique: ill-posed Infinitely many reward functions guarantee optimality of π E Even degenerate case R = 0 makes any policy optimal π E R R R D Previous heuristics to address ill-posedness Maximizing margin between expert s policy and all other policies [Ng & Russell 00; Ratliff et al. 06] Minimizing deviation from expert s policy [Neu & Szepesvari 07] Maximizing worst case performance compared to expert s policy [Syed & Schapire 07] Adopting maximum entropy principle for choosing learned policy [Ziebart et al. 08] 52 / 82

53 Bayesian Inference One of two dominant statistical inference Estimate probability of that hypothesis is true as additional evidence is acquired Prior P(H) Probability of hypothesis is true before any evidence is acquired Likelihood P E H Compatibility of evidence with given hypothesis Posterior P H E P E H P H Probability of hypothesis given observed evidence 53 / 82

54 Bayesian Framework for IRL Evidence Expert s behavior: set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h Prior P(R) Preference or external knowledge on reward functions Likelihood P D R Compatibility of reward function with given behavior data Posterior P R D P D R P(R) [Ramachandran & Amir 07] 54 / 82

55 Prior Assume rewards are i.i.d. P R = P R s, a s,a Example distributions Uniform distribution over [ R max, R max ] No knowledge about rewards other than its range Normal or Laplacian distribution with zero mean P R s, a = r exp r2 r 2 or exp Prefer sparse rewards 2σ Beta distribution with α = β = 1 P R s, a = r = 2σ r 0.5 r 0.5 R 1 max R max Planning problem: most states have small rewards but a few states (e.g., goal) have high reward 55 / 82

56 P(s, a R) Q (s, a; R) Likelihood Expert s behavior D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h Assumption π E is stationary P D R = m s,a τ m P(s, a R) Expert behaves optimally with respect to R P s, a R = 1 exp Z(s;R) Q s, a; R Z(s; R) = exp Q s, a ; R a A a 1 a 2 a 3 a 4 Likelihood of D a 1 a 2 a 3 a 4 P D R = m 1 s,a τ m Z(s;R) exp Q s, a; R 56 / 82

57 Posterior Posterior: P r D Prior: P R = P R s, a s,a Likelihood: P D R = P D r P r m 1 s,a τ m Z(s;R) exp Q s, a; R D R Gaussian prior a m,h R s,a S A μ σ s m,h H M 57 / 82

1, 0, 0, 0, 1] Expert s policy π E : optimal policy on R State s 1 s 2 s 3 s 4 s 5 Action a 1 a 1 a 1 a 1 a 1

58 Example: 5-State Chain MDP with 5 states arranged in chain Action a 1 : moves to right with prob. 0.6 moves to left with prob. 0.4 Action a 2 : always moves to state s 1 True reward R = [0.1, 0, 0, 0, 1] Expert s policy π E : optimal policy on R State s 1 s 2 s 3 s 4 s 5 Action a 1 a 1 a 1 a 1 a 1 Prior: R s 2 = R s 3 = R s 4 = 0 P R s 1 = N 0.1, 1 P R s 5 = N(1, 1) True reward MAP reward Posterior mean reward Reward optimality region 58 / 82

59 Algorithms for BIRL PolicyWalk: MCMC algorithm [Ramachandran & Amir 07] Generate samples from posterior P r D Return sample mean as estimate of true mean of posterior Generate Markov chain on intersection points of grid of length δ in R S Initialize R R S /δ arbitrarily Repeat Pick R uniformly at random from neighbors of R in R S /δ Set R R with probability min 1, P R D P R D Gradient method [Choi & Kim 11] Computing approximate maximum-a-posteriori (MAP) estimate 59 / 82

60 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 60 / 82

IRL from Collective Behavior Data Multiple experts generate

different from each other Identify multiple reward functions

behavior data D = {τ 1,, τ M } τ M s 1 Move right Coin

killing enemy killing enemy IRL s 1 s 2 s 3 s 1 s 2 τ 1 s 3 s

61 IRL from Collective Behavior Data Multiple experts generate collective behavior data Experts may have reward functions different from each other Identify multiple reward functions from collective behavior data Players preferences Collective behavior data D = {τ 1,, τ M } τ M s 1 Move right Coin collector Enemy killer collecting coin collecting coin killing enemy killing enemy IRL s 1 s 2 s 3 s 1 s 2 τ 1 s 3 s 2 τ 2 s 3 Jump Jump Move Move right Move right right Jump Jump Move left 61 / 82

62 IRL for Multiple Reward Functions Naïve solution Individually infer reward function of each trajectory Data sparsity problem Solution Cluster trajectories according to inferred reward functions Do not know the number of clusters a priori Behavior data D = {τ 1, τ 2,, τ M } Reward functions {R 1, R 2,, R K }, K M τ 1 τ 3 τ 2 τ M 1 τ 4 IRL & clustering trajectories R 1 s1 s2 s3 R 2 τ 5 s1 s2 s3 τ M 62 / 82

63 Dirichlet Process (1) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with prob. proportional to α Table (cluster) Customer (data) α α α α α 11 + α Rich-gets-richer phenomenon: popular tables have higher growing probabilities Cluster data with unknown # of clusters [Neal 00] 63 / 82

Dirichlet Process (2) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with

64 Dirichlet Process (2) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with prob. proportional to α Table (reward function) τ τ τ τ R 1 τ τ R 2 R 3 R 4 τ τ τ Customer (trajectory) Rich-gets-richer phenomenon: popular tables have higher growing probabilities Cluster trajectories with unknown # of reward functions [Neal 00] 64 / 82

65 Gaussian Mixture Model + Dirichlet Prior Motivation How many clusters are presented in the data? Dirichlet distribution Conjugate prior of multinomial distribution θ 1 Generative process θ 2 β ~ Dir α K,, α K c n ~ Mult β = β 1,, β K, K =2 θ k G 0 β 1 β 2 y n ~ N y θ cn Cluster assignment prob. P(c n ) follows multinomial distribution 65 / 82

66 Gaussian Mixture Model + Dirichlet Prior Motivation How many clusters are presented in the data? Dirichlet distribution Conjugate prior of multinomial distribution θ 1 Generative process θ 2 β ~ Dir α K,, α K, K = 2 c n ~ Mult β = β 1,, β K θ k G 0 β 1 β 2 y n ~ N y θ cn Cluster assignment prob. P(c n ) follows multinomial distribution 66 / 82

67 Infinite Gaussian Mixture Model Motivation How many clusters are presented in the data? Dirichlet process Conjugate prior of infinite multinomial distribution θ 1 θ 3 Generative process β ~ Dir α,, α, K K K c n ~ Mult β = β 1,, β K θ k G 0 y n ~ N y θ cn θ 2 Dirichlet process mixture β 1 β2 (DPM) model β 3 β 4 θ 4 Cluster assignment prob. P(c n ) follows infinite multinomial distribution [Rasmussen 00] 67 / 82

68 Nonparametric Bayesian IRL for Multiple Rewards Approach Extend BIRL with Dirichlet process mixture (DPM) model Advantages Do not require the number of experts Identify which trajectory is generated by which expert [Choi & Kim 12] 68 / 82

69 DPM-BIRL Generative process Cluster assignment c = [c 1,, c M ] is drawn by β ~ Dir α,, α, K K K c m ~ Mult β = β 1,, β K Reward R k is drawn from reward-prior P(R) Trajectory τ m is generated by P τ m R cm c m a m,h β k R k,s,a α Inference using MCMC method Collecting samples from joint posterior K P c, R k k=1 D, α Finding maximum-a-posteriori (MAP) estimate s m,h S A K H M Cluster trajectories 69 / 82

70 Algorithm for DPM-BIRL [Metroplis-Hastings update] Sample c m from P c m c m, α n m,j, if c m = j for some j c m α, if c m j for all j c m If c m j for all j c m, draw new reward R cm from P(R) Set c m = c m with prob. of min 1, P τ m R c m P τ m R c m [Update using Langevin algorithm] f R k = P τ c k R k P R k : unnormalized reward posterior g x, y exp x y 1 2 ξ2 log f x 2 2 Sample R k from R k = R k + ξ2 Set R k = R k with prob. of min 1, f R k g R k,rk f R k g R k,r k 2 log f R k + ξε and ε N(0,1) 70 / 82

Information Transfer to New Trajectory (1) Finishing IRL on given data D = {τ 1, τ 2,, τ M } Computed IRL results: set of samples c l, R l l 1,, R K l L l=1 drawn from joint posterior Inferring

71 Information Transfer to New Trajectory (1) Finishing IRL on given data D = {τ 1, τ 2,, τ M } Computed IRL results: set of samples c l, R l l 1,, R K l L l=1 drawn from joint posterior Inferring reward for new trajectory τnew Transfer relevant information from pre-computed IRL results Compute conditional posterior of R for τ new P R τnew, D, α P τnew R P(R D, α) τ 3 τnew τ 2 τ 4 τ 1 τ M 1 τ 5 τ M 71 / 82

72 Information Transfer to New Trajectory (2) Conditional posterior of R for τ new P R τnew, D, α P τnew R P(R D, α) P(τnew R) α α+m P R + 1 α+m l,k n k l L δ R R k l Approximated posterior mean is analytically computed E R τnew, D, α = RdP(R τnew, D, α) 1 Z α RP τ α+m new R + 1 α+m n k l L R l k l,k P τnew R k l R = arg max R P τ new R P(R): MAP estimate for only τnew 72 / 82

73 Experiments: Mario Bros. (1) DPM-BIRL aligns trajectories much well with players than previous algorithm [Babes-Vroman et al. 11] Preference Expert player Coin collector Enemy killer Speedy Gonzales collect coins / kill enemies / / / / Ground truth: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 DPM-BIRL: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 EM-MLIRL(4): τ 1 τ 2 τ 3 τ 4 τ 5 τ 9 τ 6 τ 7 τ 8 τ 10 τ 11 τ 12 EM-MLIRL(8): τ 1 τ 2 τ 3 τ 4 τ 9 τ 5 τ 6 τ 7 τ 8 τ 10 τ 11 τ / 82

74 Experiments: Mario Bros. (2) Preference Expert player Coin collector Enemy killer Speedy Gonzales collect coins / kill enemies / / / / DPM-BIRL: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 Cluster#1 Cluster#2 Cluster#3 Cluster#4 Cluster#5 Reward functions learned by DPM-BIRL 1 φ coin collected φ enemy killed / 82

75 Experiments: Mario Bros. (3) Information transfer to new trajectory Visualize posterior probability of player s behavior 75 / 82

76 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 76 / 82

77 Example IRL Applications (1) Quadruped locomotion [Kolter et al. 07] 77 / 82

78 Example IRL Applications (2) Learn driving style [Levine & Koltun 12] 78 / 82

79 Example IRL Applications (3) Personal navigation device predicts driver s route [Ziebart 79 / 82

80 Summary Reinforcement learning Principled approach to sequential decision making problems Learning from experience Inverse reinforcement learning Recovering expert s reward function (objective or preference) from its behaviors Natural way to examine animal and human behaviors Building computational model for making decisions Making robots learn to mimic demonstrator 80 / 82

81 Recommended Reading Reinforcement learning M. L. Puterman, Markov Decision Processes, John Wiley & Sons, R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Cambridge Univ Press, L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence(101), J. Pineau, G. Gordon, and S. Thrun, Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research(27), Inverse reinforcement learning A. Y. Ng and S. Russell, Algorithms for Inverse Reinforcement Learning, ICML P. Abbeel and A. Y. Andrew, Apprenticeship learning via inverse reinforcement learning, ICML N. D. Ratliff, J. Bagnell, and M. A. Zinkevich, Maximum margin planning, ICML D. Ramachandran and E. Amir, Bayesian Inverse Reinforcement Learning, IJCAI B. D. Ziebart, A. Maas, J. Bagnell, and K. A. Dey, Maximum entropy inverse reinforcement learning, AAAI J. Choi and K. Kim, Inverse reinforcement learning in partially observable environments, JMLR(12), S. Levine, Z. Popovic, and V. Koltun, Nonlinear Inverse Reinforcement Learning with Gaussian Processes, NIPS J. Choi and K. Kim, Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions, NIPS J. Choi and K. Kim, Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning, IJCAI / 82

82 References [Abbeel & Ng 04] P. Abbeel and A. Y. Andrew, Apprenticeship learning via inverse reinforcement learning, ICML [Argall et al, 09] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, A survey of robot learning from demonstration, Robotics and Autonomous Systems, 57(5), [Babes-Vroman et al. 11] M. Babes-Vroman, V. Marivate, K. Subramanian, and M. Littman, Apprenticeship Learning About Multiple Intentions, ICML [Borgers & Sarim 00] T. Borgers and R. Sarin, Naive reinforcement learning with endogenous aspirations, International Economic Review, 41(4), [Choi & Kim 11] J. Choi and K. Kim, MAP Inference for Bayesian Inverse Reinforcement Learning, NIPS [Choi & Kim 12] J. Choi and K. Kim, Nonparametric {B}ayesian Inverse Reinforcement Learning for Multiple Reward Functions, NIPS [Hopkins 07] E. Hopkins, Adaptive learning models of consumer behavior, Journal of Economic Behavior and Organization, 64(3-4), [Kolter et al. 07] J. Kolter, P. Abbeel, and A. Ng, Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, NIPS [Neal 00] R. Neal, Markov Chain Sampling Methods for Dirichlet Process Mixture Models, Journal of Computational and Graphical Statistics, 9(2), [Neu & Szepesvari 07] G. Neu and C. Szepesvari, Apprenticeship learning using inverse reinforcement learning and gradient methods, UAI [Ng & Russell 00] A. Y. Ng and S. Russell, Algorithms for Inverse Reinforcement Learning, ICML [Niv 09] Y. Niv, Reinforcement learning in the brain, Journal of Mathematical Psychology, 53(3), [Puterman 94] M. L. Puterman, Markov Decision Processes, John Wiley & Sons, [Ramachandran & Amir 07] D. Ramachandran and E. Amir, Bayesian Inverse Reinforcement Learning, IJCAI [Rasmussen 00] C. E. Rasmussen, The Infinite Gaussian Mixture Model, NIPS [Ratliff et al. 06] N. D. Ratliff, J. Bagnell, and M. A. Zinkevich, Maximum margin planning, ICML [Russell 98] S. Russell, Learning agents for uncertain environments (extended abstract), COLT [Russell & Norvig], S. Russell and P. Norvig, Artificial intelligence: A modern approach, Prentice Hall, [Sutton & Barto 98] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Cambridge Univ Press, [Syed & Schapire 07] U. Syed and R. E. Schapire, A Game-Theoretic approach to apprenticeship learning, NIPS [Ziebart et al. 08] B. D. Ziebart, A. Maas, J. Bagnell, and K. A. Dey, Maximum entropy inverse reinforcement learning, AAAI / 82

MAP Inference for Bayesian Inverse Reinforcement Learning

MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,