Nonparametric Bayesian Inverse Reinforcement Learning

Size: px
Start display at page:

Download "Nonparametric Bayesian Inverse Reinforcement Learning"

Transcription

1 PRML Summer School 2013 Nonparametric Bayesian Inverse Reinforcement Learning Jaedeug Choi

2 Sequential Decision Making (1) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Mario Bros. [Gibson, 2 / 82

3 Sequential Decision Making (2) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Robotic goalkeeper [Busoniu, 3 / 82

4 Sequential Decision Making (3) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Spoken dialogue management system [Young, 4 / 82

5 Today s Talk Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary Updated slides 5 / 82

6 Framework of RL (1) Agent: decision maker Environment: anything external to agent [Russell & Norvig] 6 / 82

7 Framework of RL (2) Agent: decision maker Environment: anything external to agent Reward: describes agent s goal or preference action Agent situation reward Environment [Mario wiki, 7 / 82

8 Objective of RL Policy: behavior strategy Find policy maximizing cumulative reward 8 / 82

9 Supervised Learning vs. RL Supervised learning: learning from examples For each input x, correct output y is known Infer input-output relationship y f(x) y f RL: learning from experience Correct outputs not available, only rewards Trial-and-error search and delayed reward x situation reward Environment Agent action r 0 r 1 r 2 s 0, a 0 s 1, a 1 s 2, a 2 9 / 82

10 Markov Decision Process (MDP) Definition Finite set of states s S Finite set of actions a A Transition function T s, a, s = P s s, a Reward function R s situation reward Environment Agent action Assumptions Markovian transition model: P s t s t 1, a t 1,, s 1, a 1 = P(s t s t 1, a t 1 ) Completely observable environment [Puterman 94] 10 / 82

11 Example: Mario in Maze (1) States s S Position of Mario x, y s 1 = 1,1, s 2 = 1,2,, s 11 = (4,3) 3 s 3 s 5 s 8 s 11 s 2 s 7 s 10 2 s 1 s 4 s 6 s [Russell & Norvig] 11 / 82

12 Example: Mario in Maze (2) Actions a A Move left, right, up, and down [Russell & Norvig] 12 / 82

13 Example: Mario in Maze (3) Transition function T s, a, s = P s s, a Moving in intended direction with probability 0.8 T s 1,, s 4 = 0.8 T s 1,, s 2 = T s 1,, s 1 = 0.1 s 3 s 5 s 8 s 11 s s 7 s 10 s 1 s 4 s 6 s [Russell & Norvig] 13 / 82

14 Example: Mario in Maze (4) Reward function R s For terminal states: R s 11 = +1, R s 10 = 1 For non-terminal states: R s = 0.04, s S\{s 10, s 11 } s 3 s 5 s 8 s 11 s 2 s 7 s 10 s 1 s 4 s 6 s 9 [Russell & Norvig] 14 / 82

15 Policy Policy π: mapping from states to actions π s = a: execute action a in state s Describes agent s behavior Following policy π (1) Determine current state s (2) Execute action π(s) (3) Goto step (1) 15 / 82

16 Learning Goal Return: discounted cumulative rewards R t π = R s t + γr s t+1 + γ 2 R s t+2 + Discount factor γ 0,1 Determines present value of future rewards Encodes increasing uncertainty about future Bounds infinite sum Optimal policy: maximizes expected return π = arg max π E[ R 0 π ] 16 / 82

17 Planning and Learning Planning Agent uses model to create or improve policy It has a-priori knowledge about model of environment (e.g. MDP) (Reinforcement) learning Agent doesn t have a-priori knowledge about model It may explicitly learn (i.e. construct) model from interaction, but not necessarily It faces exploration vs. exploitation dilemma Planning and learning is closely related [Kee-Eung Kim, Introduction to RL, PRMLWS 2012] 17 / 82

18 Value Functions Recall Return: R π t = R s t + γr s t+1 + γ 2 R s t+2 + = k=0 γ k R s t+k V π (s): state-value function for policy π Expected return when starting from s and following π V π s = E π [R t s t = s] = E π k=0 γ k R s t+k s t = s Q π (s, a): action-value function for policy π Expected return when starting from s, taking a, and following π Q π s, a = E π R t s t = s, a t = a = E π k=0 γ k R s t+k s t = s, a t = a 18 / 82

19 Bellman Equation for V π Basic idea R t = R s t + γr s t+1 + γ 2 R s t+2 + = r t + γr t+1 + γ 2 r t+2 + = r t + γ r t+1 + γr t+2 + = r t + γr t+1 Bellman equation V π s = E π [R t s t = s] = E π k=0 γ k r t+k s t = s = E π r t + γ k=0 γ k r t+k+1 s t = s = a P s, a P s s, a R s + γe π k=0 γ k r t+k+1 s t+1 = s s = R s + γ T s, π s, s V π (s ) s Immediate reward Expected value for next states 19 / 82

20 Optimal Value Functions Partial ordering over policies π π if and only if V π s V π (s) for all s S Optimal policy π π for all π Optimal policies share same optimal value functions V s = max π Q s, a = max π Vπ (s) for all s S Qπ (s, a) for all s S and a A Bellman optimality equations for V and Q V (s) = max a R s + γ T s, a, s V s s Q (s, a) = R s + γ s T s, a, s max Q (s, a ) a 20 / 82

21 Value Iteration Algorithm Iterative algorithm for solving Bellman equation V i+1 s max a R s + γ T s, a, s V i s s As i, V i+1 converges to optimal value function Initialize V 0 arbitrarily, e.g., V 0 s = 0 for all s S For i = 0 to For each s S V i+1 s max Until max s a V i s V i+1 s < θ R s + γ T s, a, s V i s s Output policy π such that π s = arg max R s + γ T s, a, s V s a s 21 / 82

22 Value Iteration Example (1) Initialize V 0 s = 0 for all s S [ +1 ] [ -1 ] / 82

23 Value Iteration Example (2) V 1 s max a R s + γ T s, a, s V 0 s s [ +1 ] [ -1 ] / 82

24 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] V 2 s 2 = max x(a) = a {,,, } R(s 2 ) T s 2, a, s 1 V 1 (s 1 ) T s 2, a, s 2 V 1 (s 2 ) T s 2, a, s 3 V 1 (s 3 ) T s 2, a, s 4 V 1 (s 4 ) x = = 0.08 x = = x = = x = = / 82

25 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] R(s 2 ) T s 2, a, s 1 V 1 (s 1 ) T s 2, a, s 2 V 1 (s 2 ) T s 2, a, s 3 V 1 (s 3 ) T s 2, a, s 4 V 1 (s 4 ) x = = 0.08 x = = x = = x = = V 2 s 2 = max x(a) = a {,,, } 25 / 82

26 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] / 82

27 Value Iteration Example (4) V 3 s max a R s + γ T s, a, s V 2 s s [ +1 ] [ -1 ] / 82

28 Value Iteration Example (5) V 21 s max a R s + γ T s, a, s V 20 s s [ +1 ] [ -1 ] / 82

29 Value Iteration Example (6) Output policy π s = arg max a R s + γ T s, a, s V s s [ +1 ] s [ -1 ] s 1 s 2 s R(s 2 ) T s 2, a, s 1 V(s 1 ) T s 2, a, s 2 V(s 2 ) T s 2, a, s 3 V(s 3 ) T s 2, a, s 4 V(s 4 ) x = = x = = x = = x = = 0.553? 29 / 82

30 Value Iteration Example (7) Optimal policy π [ +1 ] [ -1 ] 30 / 82

31 Contents Reinforcement learning (RL) Markov decision process (MDP) Bellman equations for value functions Value iteration algorithm Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 31 / 82

32 Applying RL to Real-World Problems Formalizing problems in framework of RL Dynamics model: relatively easy to obtain Computed by counting events Reward function: non-trivial to obtain Hand-tuned until satisfactory policy is obtained More systematic approach is required! action Agent situation reward Environment 32 / 82

33 Example: Mario in Maze MDP Transition function: moving in intended direction with probability 0.8 Reward: for terminal states R t = ±1 for nonterminal states R s = 0.04 Optimal policy π Optimal policy for different ranges of R(s) R s < [ +1 ] [ -1 ] [ +1 ] [ +1 ] [ -1 ] [ -1 ] < R s < [Russell & Norvig] 33 / 82

34 Example: Autonomous Helicopter Control Reward function design Difficult to balance many desiderata But, easy to collect expert s behavior [ 34 / 82

35 Inverse Reinforcement Learning (IRL) Definition [Russell 98] Recovering domain expert s reward function from its behaviors RL vs. IRL Reward function Dynamics model RL IRL Prescribes action to take for each state Solution to RL Optimal policy Expert s behavior Represents goal or preference Solution to IRL Useful for robotics, human/animal behavior studies, neuroscience, economics, [Argall et al. 09; Cohen & Ranganath 07; Hopkins 07; Borgers & Sarim 00] 35 / 82

36 IRL Problems IRL for MDP\R Assumption (1) Completely observable environments (MDP) (2) Expert behaves optimally Input (1) Dynamics model: transition function T (2) Expert s behavior: trajectory set D Output reward function R making expert s policy π E optimal Dynamics model Reward function RL IRL Expert s behavior Expert s behavior: trace of expert s policy π E Set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h 36 / 82

37 IRL vs. Imitation Learning Apprenticeship learning via IRL Find good policy using R inferred by IRL Imitation learning [Argall et al. 09] Learn teacher s policy directly using supervised learning Fix class of policy: neural network, decision tree, Estimate policy (= mapping from states to actions) from training examples (= trajectory D) Advantages of IRL Reward function is succinct and transferable description of task: π vs. R 37 / 82

38 Matrix Notations (1) Definition of MDP S, A, T, R Set of states s S, set of actions a A Transition function T s, a, s = P s s, a Reward function R s, a Using matrix notations Transition function: S A S matrix T T π : S S matrix with T π s, s = T s, π s, s T a : S S matrix with T a s, s = T s, a, s Reward function: S A -dimensional vector R R π : S -dimensional vector with R π s = R s, π s R a : S -dimensional vector with R a s = R s, a 38 / 82

39 Matrix Notations (2) Bellman equations for value functions V π (s) = R s, π s + γ T s, π s, s V π s s V π = R π + γt π V π Q π s, a = R s, a + γ T s, a, s V π s s Q a π = R a + γt a V π V π = R π +γ T π V π V π (s 1 ) R s 1, π s 1 T s 1, π s 1, s 1 T s 1, π s 1, s 2 T s 1, π s 1, s S V π (s 1 ) V π (s 2 ) R s 2, π s 2 T s 2, π s 2, s 1 T s 2, π s 2, s 2 T s 2, π s 2, s S V π (s 2 ) V π (s S ) R s S, π s S T s S, π s S, s 1 T s S, π s S, s 2 T s S, π s S, s S V π (s S ) V π, R π : S -dimensional vector T π : S S matrix 39 / 82

40 Reward Optimality Condition IRL for MDP\R given policy π E MDP\R S, A, T, γ π E : S A Find R making π E optimal Necessary and sufficient condition for R to guarantee optimality of π E [Ng & Russell 00] V π E s; R Q π E s, a; R for all s S and a A V π E R Q a π E R for all a A Key fact used in many IRL algorithms 40 / 82

41 Reward Optimality Region (1) Reward optimality region for π Region of reward functions that yield π as optimal I I A γt I γt π 1 E π R 0 E π : S S A matrix with E π s, s, a = 1 if s = s π s = a I A : stacking S S identity matrix A times Proof V π = R π + γt π V π V π = I T π 1 R π Q a π = R a + γt a V π V π R Q a π R for all a A R π + γt π V π R a + γt a V π R π + γt π I T π 1 R π R a + γt a I T π 1 R π 41 / 82

42 Reward Optimality Region (2) Reward optimality region for π E Region of reward functions that yield π E as optimal I I A γt I γt π E 1 E π E R 0 convex polytope region in reward space R R D Difficulty of IRL problems Infinitely many reward functions guarantee optimality of π E Degenerate case R = 0 makes any policy optimal Inherently ill-posed π E R 42 / 82

43 Ng and Russell s Algorithm IRL for MDP\R S, A, T, γ given policy π E Choose R in reward optimality region for π E Maximize sum of margins between π E and all other actions Favor sparse reward function Optimization problem Linear programming formulation Sum of margins Penalty of too many non-zero entries λ: adjustable parameter max R Q π E s, π E s Q π E s s, a λ R 1 a A\π E s s.t. I I A γt I γt π E 1 E π E R 0 R s, a R max, s S, a A Reward optimality region [Ng & Russell 00] 43 / 82

44 Apprenticeship Learning (AL) via IRL Goal: learn policy from observing expert Find policy whose performance as good as expert's policy measured according to expert s unknown reward function Example: when teaching someone to drive We do not tell what reward function is Difficult to balance many desiderata We demonstrate driving preference for lane safe following distance keeping away from pedestrians maintaining reasonable speed [Abbeel & Ng 04] 44 / 82

45 Reward Function Representation Known feature functions Φ: S A 0,1 D Φ = [φ 1,, φ D ]: S A D feature matrix φ d : S A -dimensional feature vector Φ(s, a): D-dimensional feature vector Unknown weight vector w 1,1 D Reward function is linear combination of feature functions R = Φw or R s, a = w D Φ s, a = d=1 w d φ d (s, a) Common assumption in IRL to address problems with large state space R(s 1, a 1 ) R(s 1, a 2 ) R(s S, a A ) φ 1 s 1, a 1 φ 2 s 1, a 1 φ D s 1, a 1 φ 1 s 1, a 2 φ 2 s 1, a 2 φ D s 1, a 2 = φ 1 s S, a A φ 2 s S, a A φ D s S, a A w 1 w 2 w D 45 / 82

46 Example: Feature Functions in Mario Bros. Mario Bros. has huge state space Require compact representation of reward function Binary feature functions indicate for Mario successfully reaching end of level Mario collecting coin Mario killing enemy Mario receiving damage by enemy Mario getting killed by enemy And so on 46 / 82

47 Feature Expectation (FE) Feature expectation Expected cumulative discounted sum of feature values μ π = E t=0 γ t Φ s t, π s t s 0 R D t=0 s 0 V π s 0 = E t=0 γ t R s t, π s t s 0 = E γ t w Φ s t, π s t = w E t=0 γ t Φ s t, π s t s 0 = w μ π Estimate expert s feature expectation μ π E Expert s policy π E is not known Expert s behavior is given by set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h μ π E μ π E = 1 M M m=1 H h=1 γ h 1 Φ s m,h, a m,h Denote μ E = μ π E and μ E = μ π E 47 / 82

48 Closeness of FE and Performance Feature expectation μ π V π s 0 = w μ π Performance closeness is bounded by FE closeness For any underlying reward R = Φw V π E s 0 V π s 0 = w μ E w μ π w 2 μ E μ π 2 μ E μ π 2 w 2 w 1 1 by assumption Find policy π whose feature expectation is close to μ E μ E 48 / 82

49 Algorithm for AL via IRL Alternate RL step and IRL step RL step: compute optimal policy for estimated reward IRL step: estimate reward that makes π E perform better than all previously found policies Initialize w arbitrarily and set Π Repeat Compute optimal policy π for MDP with R = Φw Π Π {π} Solve the following optimization problem max x w s.t. w μ E w μ π w 2 1 Until x ε + x, π Π RL step IRL step: find w s.t. w μ E w μ π, π Quadratically constrained programming problem 49 / 82

50 Experiments: Simulated Highway Learn different driving styles Nice Nasty Right lane nice [Abbeel & Ng 04] 50 / 82

51 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Reward optimality condition Ng and Russell s algorithm Apprenticeship learning via IRL Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 51 / 82

52 Difficulty of IRL Problems Solution of IRL is not unique: ill-posed Infinitely many reward functions guarantee optimality of π E Even degenerate case R = 0 makes any policy optimal π E R R R D Previous heuristics to address ill-posedness Maximizing margin between expert s policy and all other policies [Ng & Russell 00; Ratliff et al. 06] Minimizing deviation from expert s policy [Neu & Szepesvari 07] Maximizing worst case performance compared to expert s policy [Syed & Schapire 07] Adopting maximum entropy principle for choosing learned policy [Ziebart et al. 08] 52 / 82

53 Bayesian Inference One of two dominant statistical inference Estimate probability of that hypothesis is true as additional evidence is acquired Prior P(H) Probability of hypothesis is true before any evidence is acquired Likelihood P E H Compatibility of evidence with given hypothesis Posterior P H E P E H P H Probability of hypothesis given observed evidence 53 / 82

54 Bayesian Framework for IRL Evidence Expert s behavior: set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h Prior P(R) Preference or external knowledge on reward functions Likelihood P D R Compatibility of reward function with given behavior data Posterior P R D P D R P(R) [Ramachandran & Amir 07] 54 / 82

55 Prior Assume rewards are i.i.d. P R = P R s, a s,a Example distributions Uniform distribution over [ R max, R max ] No knowledge about rewards other than its range Normal or Laplacian distribution with zero mean P R s, a = r exp r2 r 2 or exp Prefer sparse rewards 2σ Beta distribution with α = β = 1 P R s, a = r = 2σ r 0.5 r 0.5 R 1 max R max Planning problem: most states have small rewards but a few states (e.g., goal) have high reward 55 / 82

56 P(s, a R) Q (s, a; R) Likelihood Expert s behavior D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h Assumption π E is stationary P D R = m s,a τ m P(s, a R) Expert behaves optimally with respect to R P s, a R = 1 exp Z(s;R) Q s, a; R Z(s; R) = exp Q s, a ; R a A a 1 a 2 a 3 a 4 Likelihood of D a 1 a 2 a 3 a 4 P D R = m 1 s,a τ m Z(s;R) exp Q s, a; R 56 / 82

57 Posterior Posterior: P r D Prior: P R = P R s, a s,a Likelihood: P D R = P D r P r m 1 s,a τ m Z(s;R) exp Q s, a; R D R Gaussian prior a m,h R s,a S A μ σ s m,h H M 57 / 82

58 Example: 5-State Chain MDP with 5 states arranged in chain Action a 1 : moves to right with prob. 0.6 moves to left with prob. 0.4 Action a 2 : always moves to state s 1 True reward R = [0.1, 0, 0, 0, 1] Expert s policy π E : optimal policy on R State s 1 s 2 s 3 s 4 s 5 Action a 1 a 1 a 1 a 1 a 1 Prior: R s 2 = R s 3 = R s 4 = 0 P R s 1 = N 0.1, 1 P R s 5 = N(1, 1) True reward MAP reward Posterior mean reward Reward optimality region 58 / 82

59 Algorithms for BIRL PolicyWalk: MCMC algorithm [Ramachandran & Amir 07] Generate samples from posterior P r D Return sample mean as estimate of true mean of posterior Generate Markov chain on intersection points of grid of length δ in R S Initialize R R S /δ arbitrarily Repeat Pick R uniformly at random from neighbors of R in R S /δ Set R R with probability min 1, P R D P R D Gradient method [Choi & Kim 11] Computing approximate maximum-a-posteriori (MAP) estimate 59 / 82

60 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 60 / 82

61 IRL from Collective Behavior Data Multiple experts generate collective behavior data Experts may have reward functions different from each other Identify multiple reward functions from collective behavior data Players preferences Collective behavior data D = {τ 1,, τ M } τ M s 1 Move right Coin collector Enemy killer collecting coin collecting coin killing enemy killing enemy IRL s 1 s 2 s 3 s 1 s 2 τ 1 s 3 s 2 τ 2 s 3 Jump Jump Move Move right Move right right Jump Jump Move left 61 / 82

62 IRL for Multiple Reward Functions Naïve solution Individually infer reward function of each trajectory Data sparsity problem Solution Cluster trajectories according to inferred reward functions Do not know the number of clusters a priori Behavior data D = {τ 1, τ 2,, τ M } Reward functions {R 1, R 2,, R K }, K M τ 1 τ 3 τ 2 τ M 1 τ 4 IRL & clustering trajectories R 1 s1 s2 s3 R 2 τ 5 s1 s2 s3 τ M 62 / 82

63 Dirichlet Process (1) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with prob. proportional to α Table (cluster) Customer (data) α α α α α 11 + α Rich-gets-richer phenomenon: popular tables have higher growing probabilities Cluster data with unknown # of clusters [Neal 00] 63 / 82

64 Dirichlet Process (2) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with prob. proportional to α Table (reward function) τ τ τ τ R 1 τ τ R 2 R 3 R 4 τ τ τ Customer (trajectory) Rich-gets-richer phenomenon: popular tables have higher growing probabilities Cluster trajectories with unknown # of reward functions [Neal 00] 64 / 82

65 Gaussian Mixture Model + Dirichlet Prior Motivation How many clusters are presented in the data? Dirichlet distribution Conjugate prior of multinomial distribution θ 1 Generative process θ 2 β ~ Dir α K,, α K c n ~ Mult β = β 1,, β K, K =2 θ k G 0 β 1 β 2 y n ~ N y θ cn Cluster assignment prob. P(c n ) follows multinomial distribution 65 / 82

66 Gaussian Mixture Model + Dirichlet Prior Motivation How many clusters are presented in the data? Dirichlet distribution Conjugate prior of multinomial distribution θ 1 Generative process θ 2 β ~ Dir α K,, α K, K = 2 c n ~ Mult β = β 1,, β K θ k G 0 β 1 β 2 y n ~ N y θ cn Cluster assignment prob. P(c n ) follows multinomial distribution 66 / 82

67 Infinite Gaussian Mixture Model Motivation How many clusters are presented in the data? Dirichlet process Conjugate prior of infinite multinomial distribution θ 1 θ 3 Generative process β ~ Dir α,, α, K K K c n ~ Mult β = β 1,, β K θ k G 0 y n ~ N y θ cn θ 2 Dirichlet process mixture β 1 β2 (DPM) model β 3 β 4 θ 4 Cluster assignment prob. P(c n ) follows infinite multinomial distribution [Rasmussen 00] 67 / 82

68 Nonparametric Bayesian IRL for Multiple Rewards Approach Extend BIRL with Dirichlet process mixture (DPM) model Advantages Do not require the number of experts Identify which trajectory is generated by which expert [Choi & Kim 12] 68 / 82

69 DPM-BIRL Generative process Cluster assignment c = [c 1,, c M ] is drawn by β ~ Dir α,, α, K K K c m ~ Mult β = β 1,, β K Reward R k is drawn from reward-prior P(R) Trajectory τ m is generated by P τ m R cm c m a m,h β k R k,s,a α Inference using MCMC method Collecting samples from joint posterior K P c, R k k=1 D, α Finding maximum-a-posteriori (MAP) estimate s m,h S A K H M Cluster trajectories 69 / 82

70 Algorithm for DPM-BIRL [Metroplis-Hastings update] Sample c m from P c m c m, α n m,j, if c m = j for some j c m α, if c m j for all j c m If c m j for all j c m, draw new reward R cm from P(R) Set c m = c m with prob. of min 1, P τ m R c m P τ m R c m [Update using Langevin algorithm] f R k = P τ c k R k P R k : unnormalized reward posterior g x, y exp x y 1 2 ξ2 log f x 2 2 Sample R k from R k = R k + ξ2 Set R k = R k with prob. of min 1, f R k g R k,rk f R k g R k,r k 2 log f R k + ξε and ε N(0,1) 70 / 82

71 Information Transfer to New Trajectory (1) Finishing IRL on given data D = {τ 1, τ 2,, τ M } Computed IRL results: set of samples c l, R l l 1,, R K l L l=1 drawn from joint posterior Inferring reward for new trajectory τnew Transfer relevant information from pre-computed IRL results Compute conditional posterior of R for τ new P R τnew, D, α P τnew R P(R D, α) τ 3 τnew τ 2 τ 4 τ 1 τ M 1 τ 5 τ M 71 / 82

72 Information Transfer to New Trajectory (2) Conditional posterior of R for τ new P R τnew, D, α P τnew R P(R D, α) P(τnew R) α α+m P R + 1 α+m l,k n k l L δ R R k l Approximated posterior mean is analytically computed E R τnew, D, α = RdP(R τnew, D, α) 1 Z α RP τ α+m new R + 1 α+m n k l L R l k l,k P τnew R k l R = arg max R P τ new R P(R): MAP estimate for only τnew 72 / 82

73 Experiments: Mario Bros. (1) DPM-BIRL aligns trajectories much well with players than previous algorithm [Babes-Vroman et al. 11] Preference Expert player Coin collector Enemy killer Speedy Gonzales collect coins / kill enemies / / / / Ground truth: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 DPM-BIRL: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 EM-MLIRL(4): τ 1 τ 2 τ 3 τ 4 τ 5 τ 9 τ 6 τ 7 τ 8 τ 10 τ 11 τ 12 EM-MLIRL(8): τ 1 τ 2 τ 3 τ 4 τ 9 τ 5 τ 6 τ 7 τ 8 τ 10 τ 11 τ / 82

74 Experiments: Mario Bros. (2) Preference Expert player Coin collector Enemy killer Speedy Gonzales collect coins / kill enemies / / / / DPM-BIRL: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 Cluster#1 Cluster#2 Cluster#3 Cluster#4 Cluster#5 Reward functions learned by DPM-BIRL 1 φ coin collected φ enemy killed / 82

75 Experiments: Mario Bros. (3) Information transfer to new trajectory Visualize posterior probability of player s behavior 75 / 82

76 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 76 / 82

77 Example IRL Applications (1) Quadruped locomotion [Kolter et al. 07] 77 / 82

78 Example IRL Applications (2) Learn driving style [Levine & Koltun 12] 78 / 82

79 Example IRL Applications (3) Personal navigation device predicts driver s route [Ziebart 79 / 82

80 Summary Reinforcement learning Principled approach to sequential decision making problems Learning from experience Inverse reinforcement learning Recovering expert s reward function (objective or preference) from its behaviors Natural way to examine animal and human behaviors Building computational model for making decisions Making robots learn to mimic demonstrator 80 / 82

81 Recommended Reading Reinforcement learning M. L. Puterman, Markov Decision Processes, John Wiley & Sons, R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Cambridge Univ Press, L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence(101), J. Pineau, G. Gordon, and S. Thrun, Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research(27), Inverse reinforcement learning A. Y. Ng and S. Russell, Algorithms for Inverse Reinforcement Learning, ICML P. Abbeel and A. Y. Andrew, Apprenticeship learning via inverse reinforcement learning, ICML N. D. Ratliff, J. Bagnell, and M. A. Zinkevich, Maximum margin planning, ICML D. Ramachandran and E. Amir, Bayesian Inverse Reinforcement Learning, IJCAI B. D. Ziebart, A. Maas, J. Bagnell, and K. A. Dey, Maximum entropy inverse reinforcement learning, AAAI J. Choi and K. Kim, Inverse reinforcement learning in partially observable environments, JMLR(12), S. Levine, Z. Popovic, and V. Koltun, Nonlinear Inverse Reinforcement Learning with Gaussian Processes, NIPS J. Choi and K. Kim, Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions, NIPS J. Choi and K. Kim, Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning, IJCAI / 82

82 References [Abbeel & Ng 04] P. Abbeel and A. Y. Andrew, Apprenticeship learning via inverse reinforcement learning, ICML [Argall et al, 09] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, A survey of robot learning from demonstration, Robotics and Autonomous Systems, 57(5), [Babes-Vroman et al. 11] M. Babes-Vroman, V. Marivate, K. Subramanian, and M. Littman, Apprenticeship Learning About Multiple Intentions, ICML [Borgers & Sarim 00] T. Borgers and R. Sarin, Naive reinforcement learning with endogenous aspirations, International Economic Review, 41(4), [Choi & Kim 11] J. Choi and K. Kim, MAP Inference for Bayesian Inverse Reinforcement Learning, NIPS [Choi & Kim 12] J. Choi and K. Kim, Nonparametric {B}ayesian Inverse Reinforcement Learning for Multiple Reward Functions, NIPS [Hopkins 07] E. Hopkins, Adaptive learning models of consumer behavior, Journal of Economic Behavior and Organization, 64(3-4), [Kolter et al. 07] J. Kolter, P. Abbeel, and A. Ng, Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, NIPS [Neal 00] R. Neal, Markov Chain Sampling Methods for Dirichlet Process Mixture Models, Journal of Computational and Graphical Statistics, 9(2), [Neu & Szepesvari 07] G. Neu and C. Szepesvari, Apprenticeship learning using inverse reinforcement learning and gradient methods, UAI [Ng & Russell 00] A. Y. Ng and S. Russell, Algorithms for Inverse Reinforcement Learning, ICML [Niv 09] Y. Niv, Reinforcement learning in the brain, Journal of Mathematical Psychology, 53(3), [Puterman 94] M. L. Puterman, Markov Decision Processes, John Wiley & Sons, [Ramachandran & Amir 07] D. Ramachandran and E. Amir, Bayesian Inverse Reinforcement Learning, IJCAI [Rasmussen 00] C. E. Rasmussen, The Infinite Gaussian Mixture Model, NIPS [Ratliff et al. 06] N. D. Ratliff, J. Bagnell, and M. A. Zinkevich, Maximum margin planning, ICML [Russell 98] S. Russell, Learning agents for uncertain environments (extended abstract), COLT [Russell & Norvig], S. Russell and P. Norvig, Artificial intelligence: A modern approach, Prentice Hall, [Sutton & Barto 98] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Cambridge Univ Press, [Syed & Schapire 07] U. Syed and R. E. Schapire, A Game-Theoretic approach to apprenticeship learning, NIPS [Ziebart et al. 08] B. D. Ziebart, A. Maas, J. Bagnell, and K. A. Dey, Maximum entropy inverse reinforcement learning, AAAI / 82

MAP Inference for Bayesian Inverse Reinforcement Learning

MAP Inference for Bayesian Inverse Reinforcement Learning MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,

More information

Improving the Efficiency of Bayesian Inverse Reinforcement Learning

Improving the Efficiency of Bayesian Inverse Reinforcement Learning Improving the Efficiency of Bayesian Inverse Reinforcement Learning Bernard Michini* and Jonathan P. How** Aerospace Controls Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139 USA Abstract

More information

Relative Entropy Inverse Reinforcement Learning

Relative Entropy Inverse Reinforcement Learning Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de

More information

Maximum Margin Planning

Maximum Margin Planning Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart Outline Introduction to

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Batch, Off-policy and Model-free Apprenticeship Learning

Batch, Off-policy and Model-free Apprenticeship Learning Batch, Off-policy and Model-free Apprenticeship Learning Edouard Klein 13, Matthieu Geist 1, and Olivier Pietquin 12 1. Supélec-Metz Campus, IMS Research group, France, prenom.nom@supelec.fr 2. UMI 2958

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart Outline Learning from Demonstrations

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Inverse Reinforcement Learning in Partially Observable Environments

Inverse Reinforcement Learning in Partially Observable Environments Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim Department

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning

Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim Department

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision

More information

CSE250A Fall 12: Discussion Week 9

CSE250A Fall 12: Discussion Week 9 CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.

More information

Inverse Optimal Control

Inverse Optimal Control Inverse Optimal Control Oleg Arenz Technische Universität Darmstadt o.arenz@gmx.de Abstract In Reinforcement Learning, an agent learns a policy that maximizes a given reward function. However, providing

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

An Adaptive Clustering Method for Model-free Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

A Game-Theoretic Approach to Apprenticeship Learning

A Game-Theoretic Approach to Apprenticeship Learning Advances in Neural Information Processing Systems 20, 2008. A Game-Theoretic Approach to Apprenticeship Learning Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016 Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam: Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Autonomous Helicopter Flight via Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

An Introduction to Reinforcement Learning

An Introduction to Reinforcement Learning An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

Planning by Probabilistic Inference

Planning by Probabilistic Inference Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal

Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal Brian D. Ziebart Machine Learning Department Carnegie Mellon University September 30, 2008 Thesis committee: J. Andrew

More information

Efficient Learning in Linearly Solvable MDP Models

Efficient Learning in Linearly Solvable MDP Models Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Preference Elicitation for Sequential Decision Problems

Preference Elicitation for Sequential Decision Problems Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These

More information

RL 14: POMDPs continued

RL 14: POMDPs continued RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally

More information

Point-Based Value Iteration for Constrained POMDPs

Point-Based Value Iteration for Constrained POMDPs Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals

More information

Maximum Causal Tsallis Entropy Imitation Learning

Maximum Causal Tsallis Entropy Imitation Learning Maximum Causal Tsallis Entropy Imitation Learning Kyungjae Lee 1, Sungjoon Choi, and Songhwai Oh 1 Dep. of Electrical and Computer Engineering and ASRI, Seoul National University 1 Kakao Brain kyungjae.lee@rllab.snu.ac.kr,

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

arxiv: v1 [cs.ro] 12 Aug 2016

arxiv: v1 [cs.ro] 12 Aug 2016 Density Matching Reward Learning Sungjoon Choi 1, Kyungjae Lee 1, H. Andy Park 2, and Songhwai Oh 1 arxiv:1608.03694v1 [cs.ro] 12 Aug 2016 1 Seoul National University, Seoul, Korea {sungjoon.choi, kyungjae.lee,

More information

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of

More information

Lecture 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Active Learning of MDP models

Active Learning of MDP models Active Learning of MDP models Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and François Charpillet Nancy Université / INRIA LORIA Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex France

More information

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning. Yishay Mansour Tel-Aviv University Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak

More information

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning

More information

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,

More information

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009

Parking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009 Consider the following scenario: There are two envelopes, each of which has an unknown amount of money in it. You get to choose one of the envelopes. Given this is all you get to know, how should you choose?

More information

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

More information

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning

Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

16.4 Multiattribute Utility Functions

16.4 Multiattribute Utility Functions 285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Imitation Learning. Richard Zhu, Andrew Kang April 26, 2016

Imitation Learning. Richard Zhu, Andrew Kang April 26, 2016 Imitation Learning Richard Zhu, Andrew Kang April 26, 2016 Table of Contents 1. Introduction 2. Preliminaries 3. DAgger 4. Guarantees 5. Generalization 6. Performance 2 Introduction Where we ve been The

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)

More information

Maximum Entropy Inverse Reinforcement Learning

Maximum Entropy Inverse Reinforcement Learning Maximum Entropy Inverse Reinforcement Learning Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 bziebart@cs.cmu.edu,

More information

Learning in Zero-Sum Team Markov Games using Factored Value Functions

Learning in Zero-Sum Team Markov Games using Factored Value Functions Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer

More information

Chapter 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Reinforcement Learning (1)

Reinforcement Learning (1) Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396 Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction

More information

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics

Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Michael Herman Tobias Gindele Jörg Wagner Felix Schmitt Wolfram Burgard Robert Bosch GmbH D-70442 Stuttgart, Germany

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Lecture 9: Policy Gradient II (Post lecture) 2

Lecture 9: Policy Gradient II (Post lecture) 2 Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Be able to define the following terms and answer basic questions about them:

Be able to define the following terms and answer basic questions about them: CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

University of Alberta

University of Alberta University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment

More information

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague /doku.php/courses/a4b33zui/start pagenda Previous lecture: individual rational

More information