An Analytic Solution to Discrete Bayesian Reinforcement Learning

Size: px

Start display at page:

Download "An Analytic Solution to Discrete Bayesian Reinforcement Learning"

Everett Hall
6 years ago
Views:

1 An Analytic Solution to Discrete Bayesian Reinforcement Learning Pascal Poupart (U of Waterloo) Nikos Vlassis (U of Amsterdam) Jesse Hoey (U of Toronto) Kevin Regan (U of Waterloo) 1

2 Motivation Automated assistant [Boger et al. IJCAI-05] Use RL to adapt to users Learn through user interactions (no simulation) Bear cost of actions Cannot explore too much Real-time response 2

3 Model-Based Bayesian RL Model-based Bayesian RL: Naturally optimize exploration/exploitation tradeoff Reduce exploration with prior knowledge Mathematically and computationally complex Contributions: Optimal value function has simple parameterization i.e., upper envelope of a set of multivariate polynomials BEETLE: Bayesian Exploration/Exploitation Tradeoff in LEarning Exploit polynomial parameterization 3

4 Outline Bayesian reinforcement learning Value function parameterization BEETLE algorithm Experiments Conclusion 4

5 Reinforcement Learning Markov Decision Process: S: set of states A: set of actions R: set of rewards T(s,a,s ) = Pr(s s,a): transition function U(s,a) = r: reward function Reinforcement Learning Bayesian Model-based Reinforcement Learning Encode unknown prob. by random variables θ i.e., θ sas = Pr(s s,a): random variable in [0,1] i.e., θ sa = Pr( s,a): multinomial distribution 5

6 Model Learning Assume prior b(θ sa ) = Pr(θ sa ) Learning: compute posterior given s,a,s b sas (θ sa ) = k Pr(θ sa ) Pr(s s,a,θ sa ) = k b(θ sa ) θ sas Conjugate prior: Dirichlet prior Dirichlet posterior b(θ sa ) = Dir(θ sa ;n sa ) = k Π s (θ sas ) n sas 1 b sas (θ sa ) = k b(θ sa ) θ sas = k Π s (θ sas ) n sas 1 + δ(s,s ) = k Dir(θ sa ; n sa + δ(s,s )) 6

7 Structural priors Prior Knowledge Tie identical parameters If Pr( s,a) = Pr( s,a ) then θ sa = θ s a Factored representation DBN: unknown conditional dist. Informative priors No knowledge: uniform Dirichlet If (θ 1, θ 2 ) ~ (0.2, 0.8) then set (n 1, n 2 ) to (0.2k, 0.8k) k indicates the level of confidence Pr(p) Dir(p; 1, 1) Dir(p; 2, 8) Dir(p; 20, 80) p 7

8 Classic RL: Policy Optimization V*(s) = max a U(s,a) + Σ s Pr(s s,a) V*(s ) Hard to tell what needs to be explored Exploration heuristics: ε-greedy, Boltzmann, etc. Bayesian RL: V*(s,b) = max a U(s,a) + Σ s Pr(s s,b,a) V*(s,b sas ) Belief b tells us what parts of the model are not well known and therefore worth exploring Optimal exploration/exploitation tradeoff [Dearden 98,99], [Strens 00], [Duff 02], [Wang 05] 8

9 Value Function Parameterization Theorem: V* is the upper envelope of a set of multivariate polynomials (V s (θ) = max i poly i (θ)) Proof: by induction Define value function in terms of θ instead of b i.e. V*(s,b) = θ b(θ) V s (θ) dθ Bellman s equation V s (θ) = max a U(s,a) + Σ s Pr(s s,a,θ) V s (θ) = max a k a + Σ s θ sas max i poly i (θ) = max j poly j (θ) 9

10 BEETLE Algorithm Sample a set of reachable belief points B V {0} Repeat V {} For each b in B compute multivariate polynomial poly as (θ) argmax poly V θ b sas (θ) poly(θ) dθ a* argmax a θ b sas (θ) R(s,a) + Σ s θ sas poly as (θ) dθ poly(θ) U(s,a*) + Σ s θ sa*s poly a*s (θ) V V {poly} V V 10

11 Polynomials Computational issue: # of monomials in each polynomial grows by O( S ) at each iteration poly(θ) = U(s,a*) + Σ s θ sa*s poly a*s (θ) = U(s,a*) + Σ s θ sas Σ i mono i (θ) = U(s,a*) + Σ i,s mono i,s (θ) After n iterations: polynomials have O( S n ) monomials! 11

12 Projection Scheme Approximate polynomials by a linear combination of a fixed set of monomial basis functions φ i (θ): i.e. poly(θ) Σ i c i φ i (θ) Find best coefficients c i by minimizing L n norm: Min c θ poly(θ) - Σ i c i φ i (θ) n dθ For the Euclidean norm (L 2 ), this can be done by solving a system of linear equations Ax = b such that A ij = θ φ i (θ) φ j (θ) dθ b i = θ poly(θ) φ j (θ) dθ x i = c i 12

13 Basis functions Which monomials should we use as basis functions? Recall that: b sas (θ) = k b(θ) θ sas poly(θ) U(s,a) + Σ s θ sas poly as (θ) Hence we use beliefs as basis functions 13

14 Beetle summary Offline: optimize policy at sampled belief points Time: minutes to hours Online: learn transition model by belief monitoring Time: fraction of a second Advantages: Fast enough for online learning Optimizes exploration/exploitation tradeoff Easy to encode prior knowledge in initial belief Disadvantage: Policy may not be good for all belief points 14

15 Empirical Evaluation Comparison with two heuristics Exploit: pure exploitation strategy Greedily select best action of the mean model at each time step Slow execution: must solve an MDP at each time step Discrete POMDP: discretize θ Discretization leads to an exponential number of states Intractable for medium to large problems 15

16 Empirical Evaluation Problem S A Free params Opt Discrete POMDP Exploit Beetle Beetle time (minutes) Chain ± ± ± Chain ± ± ± Chain na-m 3078 ± ± Handw ± ± ± Handw ± ± ± Handw na-m 297 ± ±

17 Informative Priors Problem Opt k = 0 Informative priors k = 10 k = 20 k = 30 Chain ± ± ± ± 32 Handw ± ± ± ± 16 Handw ± ± ± ± 12 17

18 Learning Curves cumulative reward Optimal (utopic) Beetle (prior 0) Beetle (prior 30) Exploit Handw steps 18

19 Conclusion Motivation Learning by interaction with environment (no simulation) Bear consequence of actions Minimal exploration Real-time execution Bayesian RL Optimizes exploration/exploitation tradeoff Can easily encode prior knowledge to reduce exploration Contributions Optimal value function parameterization: as the upper envelope of multivariate polynomials BEETLE algorithm 19

20 Future work Learn user behaviors for assistive technologies Consider partially observable domains Learn dynamic transition models Consider correlated Dirichlets priors 20

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems

Topics of Active Research in Reinforcement Learning Relevant to Spoken Dialogue Systems Pascal Poupart David R. Cheriton School of Computer Science University of Waterloo 1 Outline Review Markov Models