Nonparametric Bayesian Inverse Reinforcement Learning
|
|
- Aron Morgan
- 5 years ago
- Views:
Transcription
1 PRML Summer School 2013 Nonparametric Bayesian Inverse Reinforcement Learning Jaedeug Choi
2 Sequential Decision Making (1) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Mario Bros. [Gibson, 2 / 82
3 Sequential Decision Making (2) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Robotic goalkeeper [Busoniu, 3 / 82
4 Sequential Decision Making (3) Multiple decisions over time are made to achieve goals Reinforcement learning (RL): principled approach to sequential decision making problems Spoken dialogue management system [Young, 4 / 82
5 Today s Talk Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary Updated slides 5 / 82
6 Framework of RL (1) Agent: decision maker Environment: anything external to agent [Russell & Norvig] 6 / 82
7 Framework of RL (2) Agent: decision maker Environment: anything external to agent Reward: describes agent s goal or preference action Agent situation reward Environment [Mario wiki, 7 / 82
8 Objective of RL Policy: behavior strategy Find policy maximizing cumulative reward 8 / 82
9 Supervised Learning vs. RL Supervised learning: learning from examples For each input x, correct output y is known Infer input-output relationship y f(x) y f RL: learning from experience Correct outputs not available, only rewards Trial-and-error search and delayed reward x situation reward Environment Agent action r 0 r 1 r 2 s 0, a 0 s 1, a 1 s 2, a 2 9 / 82
10 Markov Decision Process (MDP) Definition Finite set of states s S Finite set of actions a A Transition function T s, a, s = P s s, a Reward function R s situation reward Environment Agent action Assumptions Markovian transition model: P s t s t 1, a t 1,, s 1, a 1 = P(s t s t 1, a t 1 ) Completely observable environment [Puterman 94] 10 / 82
11 Example: Mario in Maze (1) States s S Position of Mario x, y s 1 = 1,1, s 2 = 1,2,, s 11 = (4,3) 3 s 3 s 5 s 8 s 11 s 2 s 7 s 10 2 s 1 s 4 s 6 s [Russell & Norvig] 11 / 82
12 Example: Mario in Maze (2) Actions a A Move left, right, up, and down [Russell & Norvig] 12 / 82
13 Example: Mario in Maze (3) Transition function T s, a, s = P s s, a Moving in intended direction with probability 0.8 T s 1,, s 4 = 0.8 T s 1,, s 2 = T s 1,, s 1 = 0.1 s 3 s 5 s 8 s 11 s s 7 s 10 s 1 s 4 s 6 s [Russell & Norvig] 13 / 82
14 Example: Mario in Maze (4) Reward function R s For terminal states: R s 11 = +1, R s 10 = 1 For non-terminal states: R s = 0.04, s S\{s 10, s 11 } s 3 s 5 s 8 s 11 s 2 s 7 s 10 s 1 s 4 s 6 s 9 [Russell & Norvig] 14 / 82
15 Policy Policy π: mapping from states to actions π s = a: execute action a in state s Describes agent s behavior Following policy π (1) Determine current state s (2) Execute action π(s) (3) Goto step (1) 15 / 82
16 Learning Goal Return: discounted cumulative rewards R t π = R s t + γr s t+1 + γ 2 R s t+2 + Discount factor γ 0,1 Determines present value of future rewards Encodes increasing uncertainty about future Bounds infinite sum Optimal policy: maximizes expected return π = arg max π E[ R 0 π ] 16 / 82
17 Planning and Learning Planning Agent uses model to create or improve policy It has a-priori knowledge about model of environment (e.g. MDP) (Reinforcement) learning Agent doesn t have a-priori knowledge about model It may explicitly learn (i.e. construct) model from interaction, but not necessarily It faces exploration vs. exploitation dilemma Planning and learning is closely related [Kee-Eung Kim, Introduction to RL, PRMLWS 2012] 17 / 82
18 Value Functions Recall Return: R π t = R s t + γr s t+1 + γ 2 R s t+2 + = k=0 γ k R s t+k V π (s): state-value function for policy π Expected return when starting from s and following π V π s = E π [R t s t = s] = E π k=0 γ k R s t+k s t = s Q π (s, a): action-value function for policy π Expected return when starting from s, taking a, and following π Q π s, a = E π R t s t = s, a t = a = E π k=0 γ k R s t+k s t = s, a t = a 18 / 82
19 Bellman Equation for V π Basic idea R t = R s t + γr s t+1 + γ 2 R s t+2 + = r t + γr t+1 + γ 2 r t+2 + = r t + γ r t+1 + γr t+2 + = r t + γr t+1 Bellman equation V π s = E π [R t s t = s] = E π k=0 γ k r t+k s t = s = E π r t + γ k=0 γ k r t+k+1 s t = s = a P s, a P s s, a R s + γe π k=0 γ k r t+k+1 s t+1 = s s = R s + γ T s, π s, s V π (s ) s Immediate reward Expected value for next states 19 / 82
20 Optimal Value Functions Partial ordering over policies π π if and only if V π s V π (s) for all s S Optimal policy π π for all π Optimal policies share same optimal value functions V s = max π Q s, a = max π Vπ (s) for all s S Qπ (s, a) for all s S and a A Bellman optimality equations for V and Q V (s) = max a R s + γ T s, a, s V s s Q (s, a) = R s + γ s T s, a, s max Q (s, a ) a 20 / 82
21 Value Iteration Algorithm Iterative algorithm for solving Bellman equation V i+1 s max a R s + γ T s, a, s V i s s As i, V i+1 converges to optimal value function Initialize V 0 arbitrarily, e.g., V 0 s = 0 for all s S For i = 0 to For each s S V i+1 s max Until max s a V i s V i+1 s < θ R s + γ T s, a, s V i s s Output policy π such that π s = arg max R s + γ T s, a, s V s a s 21 / 82
22 Value Iteration Example (1) Initialize V 0 s = 0 for all s S [ +1 ] [ -1 ] / 82
23 Value Iteration Example (2) V 1 s max a R s + γ T s, a, s V 0 s s [ +1 ] [ -1 ] / 82
24 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] V 2 s 2 = max x(a) = a {,,, } R(s 2 ) T s 2, a, s 1 V 1 (s 1 ) T s 2, a, s 2 V 1 (s 2 ) T s 2, a, s 3 V 1 (s 3 ) T s 2, a, s 4 V 1 (s 4 ) x = = 0.08 x = = x = = x = = / 82
25 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] R(s 2 ) T s 2, a, s 1 V 1 (s 1 ) T s 2, a, s 2 V 1 (s 2 ) T s 2, a, s 3 V 1 (s 3 ) T s 2, a, s 4 V 1 (s 4 ) x = = 0.08 x = = x = = x = = V 2 s 2 = max x(a) = a {,,, } 25 / 82
26 Value Iteration Example (3) V 2 s max a R s + γ T s, a, s V 1 s s s 1 s 2 s [ +1 ] s [ -1 ] / 82
27 Value Iteration Example (4) V 3 s max a R s + γ T s, a, s V 2 s s [ +1 ] [ -1 ] / 82
28 Value Iteration Example (5) V 21 s max a R s + γ T s, a, s V 20 s s [ +1 ] [ -1 ] / 82
29 Value Iteration Example (6) Output policy π s = arg max a R s + γ T s, a, s V s s [ +1 ] s [ -1 ] s 1 s 2 s R(s 2 ) T s 2, a, s 1 V(s 1 ) T s 2, a, s 2 V(s 2 ) T s 2, a, s 3 V(s 3 ) T s 2, a, s 4 V(s 4 ) x = = x = = x = = x = = 0.553? 29 / 82
30 Value Iteration Example (7) Optimal policy π [ +1 ] [ -1 ] 30 / 82
31 Contents Reinforcement learning (RL) Markov decision process (MDP) Bellman equations for value functions Value iteration algorithm Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 31 / 82
32 Applying RL to Real-World Problems Formalizing problems in framework of RL Dynamics model: relatively easy to obtain Computed by counting events Reward function: non-trivial to obtain Hand-tuned until satisfactory policy is obtained More systematic approach is required! action Agent situation reward Environment 32 / 82
33 Example: Mario in Maze MDP Transition function: moving in intended direction with probability 0.8 Reward: for terminal states R t = ±1 for nonterminal states R s = 0.04 Optimal policy π Optimal policy for different ranges of R(s) R s < [ +1 ] [ -1 ] [ +1 ] [ +1 ] [ -1 ] [ -1 ] < R s < [Russell & Norvig] 33 / 82
34 Example: Autonomous Helicopter Control Reward function design Difficult to balance many desiderata But, easy to collect expert s behavior [ 34 / 82
35 Inverse Reinforcement Learning (IRL) Definition [Russell 98] Recovering domain expert s reward function from its behaviors RL vs. IRL Reward function Dynamics model RL IRL Prescribes action to take for each state Solution to RL Optimal policy Expert s behavior Represents goal or preference Solution to IRL Useful for robotics, human/animal behavior studies, neuroscience, economics, [Argall et al. 09; Cohen & Ranganath 07; Hopkins 07; Borgers & Sarim 00] 35 / 82
36 IRL Problems IRL for MDP\R Assumption (1) Completely observable environments (MDP) (2) Expert behaves optimally Input (1) Dynamics model: transition function T (2) Expert s behavior: trajectory set D Output reward function R making expert s policy π E optimal Dynamics model Reward function RL IRL Expert s behavior Expert s behavior: trace of expert s policy π E Set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h 36 / 82
37 IRL vs. Imitation Learning Apprenticeship learning via IRL Find good policy using R inferred by IRL Imitation learning [Argall et al. 09] Learn teacher s policy directly using supervised learning Fix class of policy: neural network, decision tree, Estimate policy (= mapping from states to actions) from training examples (= trajectory D) Advantages of IRL Reward function is succinct and transferable description of task: π vs. R 37 / 82
38 Matrix Notations (1) Definition of MDP S, A, T, R Set of states s S, set of actions a A Transition function T s, a, s = P s s, a Reward function R s, a Using matrix notations Transition function: S A S matrix T T π : S S matrix with T π s, s = T s, π s, s T a : S S matrix with T a s, s = T s, a, s Reward function: S A -dimensional vector R R π : S -dimensional vector with R π s = R s, π s R a : S -dimensional vector with R a s = R s, a 38 / 82
39 Matrix Notations (2) Bellman equations for value functions V π (s) = R s, π s + γ T s, π s, s V π s s V π = R π + γt π V π Q π s, a = R s, a + γ T s, a, s V π s s Q a π = R a + γt a V π V π = R π +γ T π V π V π (s 1 ) R s 1, π s 1 T s 1, π s 1, s 1 T s 1, π s 1, s 2 T s 1, π s 1, s S V π (s 1 ) V π (s 2 ) R s 2, π s 2 T s 2, π s 2, s 1 T s 2, π s 2, s 2 T s 2, π s 2, s S V π (s 2 ) V π (s S ) R s S, π s S T s S, π s S, s 1 T s S, π s S, s 2 T s S, π s S, s S V π (s S ) V π, R π : S -dimensional vector T π : S S matrix 39 / 82
40 Reward Optimality Condition IRL for MDP\R given policy π E MDP\R S, A, T, γ π E : S A Find R making π E optimal Necessary and sufficient condition for R to guarantee optimality of π E [Ng & Russell 00] V π E s; R Q π E s, a; R for all s S and a A V π E R Q a π E R for all a A Key fact used in many IRL algorithms 40 / 82
41 Reward Optimality Region (1) Reward optimality region for π Region of reward functions that yield π as optimal I I A γt I γt π 1 E π R 0 E π : S S A matrix with E π s, s, a = 1 if s = s π s = a I A : stacking S S identity matrix A times Proof V π = R π + γt π V π V π = I T π 1 R π Q a π = R a + γt a V π V π R Q a π R for all a A R π + γt π V π R a + γt a V π R π + γt π I T π 1 R π R a + γt a I T π 1 R π 41 / 82
42 Reward Optimality Region (2) Reward optimality region for π E Region of reward functions that yield π E as optimal I I A γt I γt π E 1 E π E R 0 convex polytope region in reward space R R D Difficulty of IRL problems Infinitely many reward functions guarantee optimality of π E Degenerate case R = 0 makes any policy optimal Inherently ill-posed π E R 42 / 82
43 Ng and Russell s Algorithm IRL for MDP\R S, A, T, γ given policy π E Choose R in reward optimality region for π E Maximize sum of margins between π E and all other actions Favor sparse reward function Optimization problem Linear programming formulation Sum of margins Penalty of too many non-zero entries λ: adjustable parameter max R Q π E s, π E s Q π E s s, a λ R 1 a A\π E s s.t. I I A γt I γt π E 1 E π E R 0 R s, a R max, s S, a A Reward optimality region [Ng & Russell 00] 43 / 82
44 Apprenticeship Learning (AL) via IRL Goal: learn policy from observing expert Find policy whose performance as good as expert's policy measured according to expert s unknown reward function Example: when teaching someone to drive We do not tell what reward function is Difficult to balance many desiderata We demonstrate driving preference for lane safe following distance keeping away from pedestrians maintaining reasonable speed [Abbeel & Ng 04] 44 / 82
45 Reward Function Representation Known feature functions Φ: S A 0,1 D Φ = [φ 1,, φ D ]: S A D feature matrix φ d : S A -dimensional feature vector Φ(s, a): D-dimensional feature vector Unknown weight vector w 1,1 D Reward function is linear combination of feature functions R = Φw or R s, a = w D Φ s, a = d=1 w d φ d (s, a) Common assumption in IRL to address problems with large state space R(s 1, a 1 ) R(s 1, a 2 ) R(s S, a A ) φ 1 s 1, a 1 φ 2 s 1, a 1 φ D s 1, a 1 φ 1 s 1, a 2 φ 2 s 1, a 2 φ D s 1, a 2 = φ 1 s S, a A φ 2 s S, a A φ D s S, a A w 1 w 2 w D 45 / 82
46 Example: Feature Functions in Mario Bros. Mario Bros. has huge state space Require compact representation of reward function Binary feature functions indicate for Mario successfully reaching end of level Mario collecting coin Mario killing enemy Mario receiving damage by enemy Mario getting killed by enemy And so on 46 / 82
47 Feature Expectation (FE) Feature expectation Expected cumulative discounted sum of feature values μ π = E t=0 γ t Φ s t, π s t s 0 R D t=0 s 0 V π s 0 = E t=0 γ t R s t, π s t s 0 = E γ t w Φ s t, π s t = w E t=0 γ t Φ s t, π s t s 0 = w μ π Estimate expert s feature expectation μ π E Expert s policy π E is not known Expert s behavior is given by set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h μ π E μ π E = 1 M M m=1 H h=1 γ h 1 Φ s m,h, a m,h Denote μ E = μ π E and μ E = μ π E 47 / 82
48 Closeness of FE and Performance Feature expectation μ π V π s 0 = w μ π Performance closeness is bounded by FE closeness For any underlying reward R = Φw V π E s 0 V π s 0 = w μ E w μ π w 2 μ E μ π 2 μ E μ π 2 w 2 w 1 1 by assumption Find policy π whose feature expectation is close to μ E μ E 48 / 82
49 Algorithm for AL via IRL Alternate RL step and IRL step RL step: compute optimal policy for estimated reward IRL step: estimate reward that makes π E perform better than all previously found policies Initialize w arbitrarily and set Π Repeat Compute optimal policy π for MDP with R = Φw Π Π {π} Solve the following optimization problem max x w s.t. w μ E w μ π w 2 1 Until x ε + x, π Π RL step IRL step: find w s.t. w μ E w μ π, π Quadratically constrained programming problem 49 / 82
50 Experiments: Simulated Highway Learn different driving styles Nice Nasty Right lane nice [Abbeel & Ng 04] 50 / 82
51 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Reward optimality condition Ng and Russell s algorithm Apprenticeship learning via IRL Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 51 / 82
52 Difficulty of IRL Problems Solution of IRL is not unique: ill-posed Infinitely many reward functions guarantee optimality of π E Even degenerate case R = 0 makes any policy optimal π E R R R D Previous heuristics to address ill-posedness Maximizing margin between expert s policy and all other policies [Ng & Russell 00; Ratliff et al. 06] Minimizing deviation from expert s policy [Neu & Szepesvari 07] Maximizing worst case performance compared to expert s policy [Syed & Schapire 07] Adopting maximum entropy principle for choosing learned policy [Ziebart et al. 08] 52 / 82
53 Bayesian Inference One of two dominant statistical inference Estimate probability of that hypothesis is true as additional evidence is acquired Prior P(H) Probability of hypothesis is true before any evidence is acquired Likelihood P E H Compatibility of evidence with given hypothesis Posterior P H E P E H P H Probability of hypothesis given observed evidence 53 / 82
54 Bayesian Framework for IRL Evidence Expert s behavior: set of trajectories D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h Prior P(R) Preference or external knowledge on reward functions Likelihood P D R Compatibility of reward function with given behavior data Posterior P R D P D R P(R) [Ramachandran & Amir 07] 54 / 82
55 Prior Assume rewards are i.i.d. P R = P R s, a s,a Example distributions Uniform distribution over [ R max, R max ] No knowledge about rewards other than its range Normal or Laplacian distribution with zero mean P R s, a = r exp r2 r 2 or exp Prefer sparse rewards 2σ Beta distribution with α = β = 1 P R s, a = r = 2σ r 0.5 r 0.5 R 1 max R max Planning problem: most states have small rewards but a few states (e.g., goal) have high reward 55 / 82
56 P(s, a R) Q (s, a; R) Likelihood Expert s behavior D = τ 1, τ 2,, τ M Trajectory τ m = s m,1, a m,1,, s m,h, a m,h Assumption π E is stationary P D R = m s,a τ m P(s, a R) Expert behaves optimally with respect to R P s, a R = 1 exp Z(s;R) Q s, a; R Z(s; R) = exp Q s, a ; R a A a 1 a 2 a 3 a 4 Likelihood of D a 1 a 2 a 3 a 4 P D R = m 1 s,a τ m Z(s;R) exp Q s, a; R 56 / 82
57 Posterior Posterior: P r D Prior: P R = P R s, a s,a Likelihood: P D R = P D r P r m 1 s,a τ m Z(s;R) exp Q s, a; R D R Gaussian prior a m,h R s,a S A μ σ s m,h H M 57 / 82
58 Example: 5-State Chain MDP with 5 states arranged in chain Action a 1 : moves to right with prob. 0.6 moves to left with prob. 0.4 Action a 2 : always moves to state s 1 True reward R = [0.1, 0, 0, 0, 1] Expert s policy π E : optimal policy on R State s 1 s 2 s 3 s 4 s 5 Action a 1 a 1 a 1 a 1 a 1 Prior: R s 2 = R s 3 = R s 4 = 0 P R s 1 = N 0.1, 1 P R s 5 = N(1, 1) True reward MAP reward Posterior mean reward Reward optimality region 58 / 82
59 Algorithms for BIRL PolicyWalk: MCMC algorithm [Ramachandran & Amir 07] Generate samples from posterior P r D Return sample mean as estimate of true mean of posterior Generate Markov chain on intersection points of grid of length δ in R S Initialize R R S /δ arbitrarily Repeat Pick R uniformly at random from neighbors of R in R S /δ Set R R with probability min 1, P R D P R D Gradient method [Choi & Kim 11] Computing approximate maximum-a-posteriori (MAP) estimate 59 / 82
60 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 60 / 82
61 IRL from Collective Behavior Data Multiple experts generate collective behavior data Experts may have reward functions different from each other Identify multiple reward functions from collective behavior data Players preferences Collective behavior data D = {τ 1,, τ M } τ M s 1 Move right Coin collector Enemy killer collecting coin collecting coin killing enemy killing enemy IRL s 1 s 2 s 3 s 1 s 2 τ 1 s 3 s 2 τ 2 s 3 Jump Jump Move Move right Move right right Jump Jump Move left 61 / 82
62 IRL for Multiple Reward Functions Naïve solution Individually infer reward function of each trajectory Data sparsity problem Solution Cluster trajectories according to inferred reward functions Do not know the number of clusters a priori Behavior data D = {τ 1, τ 2,, τ M } Reward functions {R 1, R 2,, R K }, K M τ 1 τ 3 τ 2 τ M 1 τ 4 IRL & clustering trajectories R 1 s1 s2 s3 R 2 τ 5 s1 s2 s3 τ M 62 / 82
63 Dirichlet Process (1) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with prob. proportional to α Table (cluster) Customer (data) α α α α α 11 + α Rich-gets-richer phenomenon: popular tables have higher growing probabilities Cluster data with unknown # of clusters [Neal 00] 63 / 82
64 Dirichlet Process (2) Metaphor: Chinese restaurant with infinite # of tables 1 st customer sits at 1 st table m th customer Joins occupied table k in proportion to popularity Sits at new table with prob. proportional to α Table (reward function) τ τ τ τ R 1 τ τ R 2 R 3 R 4 τ τ τ Customer (trajectory) Rich-gets-richer phenomenon: popular tables have higher growing probabilities Cluster trajectories with unknown # of reward functions [Neal 00] 64 / 82
65 Gaussian Mixture Model + Dirichlet Prior Motivation How many clusters are presented in the data? Dirichlet distribution Conjugate prior of multinomial distribution θ 1 Generative process θ 2 β ~ Dir α K,, α K c n ~ Mult β = β 1,, β K, K =2 θ k G 0 β 1 β 2 y n ~ N y θ cn Cluster assignment prob. P(c n ) follows multinomial distribution 65 / 82
66 Gaussian Mixture Model + Dirichlet Prior Motivation How many clusters are presented in the data? Dirichlet distribution Conjugate prior of multinomial distribution θ 1 Generative process θ 2 β ~ Dir α K,, α K, K = 2 c n ~ Mult β = β 1,, β K θ k G 0 β 1 β 2 y n ~ N y θ cn Cluster assignment prob. P(c n ) follows multinomial distribution 66 / 82
67 Infinite Gaussian Mixture Model Motivation How many clusters are presented in the data? Dirichlet process Conjugate prior of infinite multinomial distribution θ 1 θ 3 Generative process β ~ Dir α,, α, K K K c n ~ Mult β = β 1,, β K θ k G 0 y n ~ N y θ cn θ 2 Dirichlet process mixture β 1 β2 (DPM) model β 3 β 4 θ 4 Cluster assignment prob. P(c n ) follows infinite multinomial distribution [Rasmussen 00] 67 / 82
68 Nonparametric Bayesian IRL for Multiple Rewards Approach Extend BIRL with Dirichlet process mixture (DPM) model Advantages Do not require the number of experts Identify which trajectory is generated by which expert [Choi & Kim 12] 68 / 82
69 DPM-BIRL Generative process Cluster assignment c = [c 1,, c M ] is drawn by β ~ Dir α,, α, K K K c m ~ Mult β = β 1,, β K Reward R k is drawn from reward-prior P(R) Trajectory τ m is generated by P τ m R cm c m a m,h β k R k,s,a α Inference using MCMC method Collecting samples from joint posterior K P c, R k k=1 D, α Finding maximum-a-posteriori (MAP) estimate s m,h S A K H M Cluster trajectories 69 / 82
70 Algorithm for DPM-BIRL [Metroplis-Hastings update] Sample c m from P c m c m, α n m,j, if c m = j for some j c m α, if c m j for all j c m If c m j for all j c m, draw new reward R cm from P(R) Set c m = c m with prob. of min 1, P τ m R c m P τ m R c m [Update using Langevin algorithm] f R k = P τ c k R k P R k : unnormalized reward posterior g x, y exp x y 1 2 ξ2 log f x 2 2 Sample R k from R k = R k + ξ2 Set R k = R k with prob. of min 1, f R k g R k,rk f R k g R k,r k 2 log f R k + ξε and ε N(0,1) 70 / 82
71 Information Transfer to New Trajectory (1) Finishing IRL on given data D = {τ 1, τ 2,, τ M } Computed IRL results: set of samples c l, R l l 1,, R K l L l=1 drawn from joint posterior Inferring reward for new trajectory τnew Transfer relevant information from pre-computed IRL results Compute conditional posterior of R for τ new P R τnew, D, α P τnew R P(R D, α) τ 3 τnew τ 2 τ 4 τ 1 τ M 1 τ 5 τ M 71 / 82
72 Information Transfer to New Trajectory (2) Conditional posterior of R for τ new P R τnew, D, α P τnew R P(R D, α) P(τnew R) α α+m P R + 1 α+m l,k n k l L δ R R k l Approximated posterior mean is analytically computed E R τnew, D, α = RdP(R τnew, D, α) 1 Z α RP τ α+m new R + 1 α+m n k l L R l k l,k P τnew R k l R = arg max R P τ new R P(R): MAP estimate for only τnew 72 / 82
73 Experiments: Mario Bros. (1) DPM-BIRL aligns trajectories much well with players than previous algorithm [Babes-Vroman et al. 11] Preference Expert player Coin collector Enemy killer Speedy Gonzales collect coins / kill enemies / / / / Ground truth: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 DPM-BIRL: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 EM-MLIRL(4): τ 1 τ 2 τ 3 τ 4 τ 5 τ 9 τ 6 τ 7 τ 8 τ 10 τ 11 τ 12 EM-MLIRL(8): τ 1 τ 2 τ 3 τ 4 τ 9 τ 5 τ 6 τ 7 τ 8 τ 10 τ 11 τ / 82
74 Experiments: Mario Bros. (2) Preference Expert player Coin collector Enemy killer Speedy Gonzales collect coins / kill enemies / / / / DPM-BIRL: τ 1 τ 2 τ 3 τ 4 τ 5 τ 6 τ 7 τ 8 τ 9 τ 10 τ 11 τ 12 Cluster#1 Cluster#2 Cluster#3 Cluster#4 Cluster#5 Reward functions learned by DPM-BIRL 1 φ coin collected φ enemy killed / 82
75 Experiments: Mario Bros. (3) Information transfer to new trajectory Visualize posterior probability of player s behavior 75 / 82
76 Contents Reinforcement learning (RL) Inverse reinforcement learning (IRL) Bayesian inverse reinforcement learning (BIRL) Nonparametric Bayesian inverse reinforcement learning Applications and Summary 76 / 82
77 Example IRL Applications (1) Quadruped locomotion [Kolter et al. 07] 77 / 82
78 Example IRL Applications (2) Learn driving style [Levine & Koltun 12] 78 / 82
79 Example IRL Applications (3) Personal navigation device predicts driver s route [Ziebart 79 / 82
80 Summary Reinforcement learning Principled approach to sequential decision making problems Learning from experience Inverse reinforcement learning Recovering expert s reward function (objective or preference) from its behaviors Natural way to examine animal and human behaviors Building computational model for making decisions Making robots learn to mimic demonstrator 80 / 82
81 Recommended Reading Reinforcement learning M. L. Puterman, Markov Decision Processes, John Wiley & Sons, R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Cambridge Univ Press, L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence(101), J. Pineau, G. Gordon, and S. Thrun, Anytime point-based approximations for large POMDPs. Journal of Artificial Intelligence Research(27), Inverse reinforcement learning A. Y. Ng and S. Russell, Algorithms for Inverse Reinforcement Learning, ICML P. Abbeel and A. Y. Andrew, Apprenticeship learning via inverse reinforcement learning, ICML N. D. Ratliff, J. Bagnell, and M. A. Zinkevich, Maximum margin planning, ICML D. Ramachandran and E. Amir, Bayesian Inverse Reinforcement Learning, IJCAI B. D. Ziebart, A. Maas, J. Bagnell, and K. A. Dey, Maximum entropy inverse reinforcement learning, AAAI J. Choi and K. Kim, Inverse reinforcement learning in partially observable environments, JMLR(12), S. Levine, Z. Popovic, and V. Koltun, Nonlinear Inverse Reinforcement Learning with Gaussian Processes, NIPS J. Choi and K. Kim, Nonparametric Bayesian Inverse Reinforcement Learning for Multiple Reward Functions, NIPS J. Choi and K. Kim, Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning, IJCAI / 82
82 References [Abbeel & Ng 04] P. Abbeel and A. Y. Andrew, Apprenticeship learning via inverse reinforcement learning, ICML [Argall et al, 09] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, A survey of robot learning from demonstration, Robotics and Autonomous Systems, 57(5), [Babes-Vroman et al. 11] M. Babes-Vroman, V. Marivate, K. Subramanian, and M. Littman, Apprenticeship Learning About Multiple Intentions, ICML [Borgers & Sarim 00] T. Borgers and R. Sarin, Naive reinforcement learning with endogenous aspirations, International Economic Review, 41(4), [Choi & Kim 11] J. Choi and K. Kim, MAP Inference for Bayesian Inverse Reinforcement Learning, NIPS [Choi & Kim 12] J. Choi and K. Kim, Nonparametric {B}ayesian Inverse Reinforcement Learning for Multiple Reward Functions, NIPS [Hopkins 07] E. Hopkins, Adaptive learning models of consumer behavior, Journal of Economic Behavior and Organization, 64(3-4), [Kolter et al. 07] J. Kolter, P. Abbeel, and A. Ng, Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion, NIPS [Neal 00] R. Neal, Markov Chain Sampling Methods for Dirichlet Process Mixture Models, Journal of Computational and Graphical Statistics, 9(2), [Neu & Szepesvari 07] G. Neu and C. Szepesvari, Apprenticeship learning using inverse reinforcement learning and gradient methods, UAI [Ng & Russell 00] A. Y. Ng and S. Russell, Algorithms for Inverse Reinforcement Learning, ICML [Niv 09] Y. Niv, Reinforcement learning in the brain, Journal of Mathematical Psychology, 53(3), [Puterman 94] M. L. Puterman, Markov Decision Processes, John Wiley & Sons, [Ramachandran & Amir 07] D. Ramachandran and E. Amir, Bayesian Inverse Reinforcement Learning, IJCAI [Rasmussen 00] C. E. Rasmussen, The Infinite Gaussian Mixture Model, NIPS [Ratliff et al. 06] N. D. Ratliff, J. Bagnell, and M. A. Zinkevich, Maximum margin planning, ICML [Russell 98] S. Russell, Learning agents for uncertain environments (extended abstract), COLT [Russell & Norvig], S. Russell and P. Norvig, Artificial intelligence: A modern approach, Prentice Hall, [Sutton & Barto 98] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction, Cambridge Univ Press, [Syed & Schapire 07] U. Syed and R. E. Schapire, A Game-Theoretic approach to apprenticeship learning, NIPS [Ziebart et al. 08] B. D. Ziebart, A. Maas, J. Bagnell, and K. A. Dey, Maximum entropy inverse reinforcement learning, AAAI / 82
MAP Inference for Bayesian Inverse Reinforcement Learning
MAP Inference for Bayesian Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim bdepartment of Computer Science Korea Advanced Institute of Science and Technology Daejeon 305-701, Korea jdchoi@ai.kaist.ac.kr,
More informationImproving the Efficiency of Bayesian Inverse Reinforcement Learning
Improving the Efficiency of Bayesian Inverse Reinforcement Learning Bernard Michini* and Jonathan P. How** Aerospace Controls Laboratory Massachusetts Institute of Technology, Cambridge, MA 02139 USA Abstract
More informationRelative Entropy Inverse Reinforcement Learning
Relative Entropy Inverse Reinforcement Learning Abdeslam Boularias Jens Kober Jan Peters Max-Planck Institute for Intelligent Systems 72076 Tübingen, Germany {abdeslam.boularias,jens.kober,jan.peters}@tuebingen.mpg.de
More informationMaximum Margin Planning
Maximum Margin Planning Nathan Ratliff, Drew Bagnell and Martin Zinkevich Presenters: Ashesh Jain, Michael Hu CS6784 Class Presentation Theme 1. Supervised learning 2. Unsupervised learning 3. Reinforcement
More informationReinforcement Learning
Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo Marc Toussaint University of Stuttgart Outline Introduction to
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationBatch, Off-policy and Model-free Apprenticeship Learning
Batch, Off-policy and Model-free Apprenticeship Learning Edouard Klein 13, Matthieu Geist 1, and Olivier Pietquin 12 1. Supélec-Metz Campus, IMS Research group, France, prenom.nom@supelec.fr 2. UMI 2958
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationReinforcement Learning
Reinforcement Learning Inverse Reinforcement Learning LfD, imitation learning/behavior cloning, apprenticeship learning, IRL. Hung Ngo MLR Lab, University of Stuttgart Outline Learning from Demonstrations
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationREINFORCEMENT LEARNING
REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents
More informationInverse Reinforcement Learning in Partially Observable Environments
Proceedings of the Twenty-First International Joint Conference on Artificial Intelligence (IJCAI-09) Inverse Reinforcement Learning in Partially Observable Environments Jaedeug Choi and Kee-Eung Kim Department
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationBayesian Nonparametric Feature Construction for Inverse Reinforcement Learning
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Bayesian Nonparametric Feature Construction for Inverse Reinforcement Learning Jaedeug Choi and Kee-Eung Kim Department
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationTopics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems
Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationCS788 Dialogue Management Systems Lecture #2: Markov Decision Processes
CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes Kee-Eung Kim KAIST EECS Department Computer Science Division Markov Decision Processes (MDPs) A popular model for sequential decision
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationInverse Optimal Control
Inverse Optimal Control Oleg Arenz Technische Universität Darmstadt o.arenz@gmx.de Abstract In Reinforcement Learning, an agent learns a policy that maximizes a given reward function. However, providing
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationAn Adaptive Clustering Method for Model-free Reinforcement Learning
An Adaptive Clustering Method for Model-free Reinforcement Learning Andreas Matt and Georg Regensburger Institute of Mathematics University of Innsbruck, Austria {andreas.matt, georg.regensburger}@uibk.ac.at
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationA Game-Theoretic Approach to Apprenticeship Learning
Advances in Neural Information Processing Systems 20, 2008. A Game-Theoretic Approach to Apprenticeship Learning Umar Syed Computer Science Department Princeton University 35 Olden St Princeton, NJ 08540-5233
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationEfficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationMarks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:
Marks } Assignment 1: Should be out this weekend } All are marked, I m trying to tally them and perhaps add bonus points } Mid-term: Before the last lecture } Mid-term deferred exam: } This Saturday, 9am-10.30am,
More informationSequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague
Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationAutonomous Helicopter Flight via Reinforcement Learning
Autonomous Helicopter Flight via Reinforcement Learning Authors: Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, Shankar Sastry Presenters: Shiv Ballianda, Jerrolyn Hebert, Shuiwang Ji, Kenley Malveaux, Huy
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationPlanning by Probabilistic Inference
Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.
More informationReinforcement Learning
Reinforcement Learning Ron Parr CompSci 7 Department of Computer Science Duke University With thanks to Kris Hauser for some content RL Highlights Everybody likes to learn from experience Use ML techniques
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationModeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal
Modeling Decision Making with Maximum Entropy Inverse Optimal Control Thesis Proposal Brian D. Ziebart Machine Learning Department Carnegie Mellon University September 30, 2008 Thesis committee: J. Andrew
More informationEfficient Learning in Linearly Solvable MDP Models
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence Efficient Learning in Linearly Solvable MDP Models Ang Li Department of Computer Science, University of Minnesota
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationPreference Elicitation for Sequential Decision Problems
Preference Elicitation for Sequential Decision Problems Kevin Regan University of Toronto Introduction 2 Motivation Focus: Computational approaches to sequential decision making under uncertainty These
More informationRL 14: POMDPs continued
RL 14: POMDPs continued Michael Herrmann University of Edinburgh, School of Informatics 06/03/2015 POMDPs: Points to remember Belief states are probability distributions over states Even if computationally
More informationPoint-Based Value Iteration for Constrained POMDPs
Point-Based Value Iteration for Constrained POMDPs Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science Pascal Poupart School of Computer Science IJCAI-2011 2011. 7. 22. Motivation goals
More informationMaximum Causal Tsallis Entropy Imitation Learning
Maximum Causal Tsallis Entropy Imitation Learning Kyungjae Lee 1, Sungjoon Choi, and Songhwai Oh 1 Dep. of Electrical and Computer Engineering and ASRI, Seoul National University 1 Kakao Brain kyungjae.lee@rllab.snu.ac.kr,
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationarxiv: v1 [cs.ro] 12 Aug 2016
Density Matching Reward Learning Sungjoon Choi 1, Kyungjae Lee 1, H. Andy Park 2, and Songhwai Oh 1 arxiv:1608.03694v1 [cs.ro] 12 Aug 2016 1 Seoul National University, Seoul, Korea {sungjoon.choi, kyungjae.lee,
More informationPartially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague
Partially observable Markov decision processes Jiří Kléma Department of Computer Science, Czech Technical University in Prague https://cw.fel.cvut.cz/wiki/courses/b4b36zui/prednasky pagenda Previous lecture:
More informationFood delivered. Food obtained S 3
Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationActive Learning of MDP models
Active Learning of MDP models Mauricio Araya-López, Olivier Buffet, Vincent Thomas, and François Charpillet Nancy Université / INRIA LORIA Campus Scientifique BP 239 54506 Vandoeuvre-lès-Nancy Cedex France
More informationReinforcement Learning. Yishay Mansour Tel-Aviv University
Reinforcement Learning Yishay Mansour Tel-Aviv University 1 Reinforcement Learning: Course Information Classes: Wednesday Lecture 10-13 Yishay Mansour Recitations:14-15/15-16 Eliya Nachmani Adam Polyak
More informationCOMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati
COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning Hanna Kurniawati Today } What is machine learning? } Where is it used? } Types of machine learning
More informationKalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)
Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN) Alp Sardag and H.Levent Akin Bogazici University Department of Computer Engineering 34342 Bebek, Istanbul,
More informationParking lot navigation. Experimental setup. Problem setup. Nice driving style. Page 1. CS 287: Advanced Robotics Fall 2009
Consider the following scenario: There are two envelopes, each of which has an unknown amount of money in it. You get to choose one of the envelopes. Given this is all you get to know, how should you choose?
More informationGentle Introduction to Infinite Gaussian Mixture Modeling
Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for
More informationEfficient Probabilistic Performance Bounds for Inverse Reinforcement Learning
Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning Daniel S. Brown and Scott Niekum Department of Computer Science University of Texas at Austin {dsbrown,sniekum}@cs.utexas.edu
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Geoff Hollinger Sequential Decision Making in Robotics Spring, 2011 *Some media from Reid Simmons, Trey Smith, Tony Cassandra, Michael Littman, and
More informationProf. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be
REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while
More information16.4 Multiattribute Utility Functions
285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationImitation Learning. Richard Zhu, Andrew Kang April 26, 2016
Imitation Learning Richard Zhu, Andrew Kang April 26, 2016 Table of Contents 1. Introduction 2. Preliminaries 3. DAgger 4. Guarantees 5. Generalization 6. Performance 2 Introduction Where we ve been The
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationCMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 11: Markov Decision Processes II Teacher: Gianni A. Di Caro RECAP: DEFINING MDPS Markov decision processes: o Set of states S o Start state s 0 o Set of actions A o Transitions P(s s,a)
More informationMaximum Entropy Inverse Reinforcement Learning
Maximum Entropy Inverse Reinforcement Learning Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 bziebart@cs.cmu.edu,
More informationLearning in Zero-Sum Team Markov Games using Factored Value Functions
Learning in Zero-Sum Team Markov Games using Factored Value Functions Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 27708 mgl@cs.duke.edu Ronald Parr Department of Computer
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationReinforcement Learning as Classification Leveraging Modern Classifiers
Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions
More informationReinforcement Learning (1)
Reinforcement Learning 1 Reinforcement Learning (1) Machine Learning 64-360, Part II Norman Hendrich University of Hamburg, Dept. of Informatics Vogt-Kölln-Str. 30, D-22527 Hamburg hendrich@informatik.uni-hamburg.de
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationOutline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution
Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More informationInverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics
Inverse Reinforcement Learning with Simultaneous Estimation of Rewards and Dynamics Michael Herman Tobias Gindele Jörg Wagner Felix Schmitt Wolfram Burgard Robert Bosch GmbH D-70442 Stuttgart, Germany
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationLecture 9: Policy Gradient II (Post lecture) 2
Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationBe able to define the following terms and answer basic questions about them:
CS440/ECE448 Section Q Fall 2017 Final Review Be able to define the following terms and answer basic questions about them: Probability o Random variables, axioms of probability o Joint, marginal, conditional
More informationBayesian Nonparametrics for Speech and Signal Processing
Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationUniversity of Alberta
University of Alberta NEW REPRESENTATIONS AND APPROXIMATIONS FOR SEQUENTIAL DECISION MAKING UNDER UNCERTAINTY by Tao Wang A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment
More informationSequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague
Sequential decision making under uncertainty Jiří Kléma Department of Computer Science, Czech Technical University in Prague /doku.php/courses/a4b33zui/start pagenda Previous lecture: individual rational
More information