Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands 23 Jan 2010

Reinforcement learning 1 Introduction Reinforcement learning Markov decision processes (MDP) Dynamic programming 2 Exploration and Exploitation Trade-off Bayesian RL Decision-theoretic solution Belief MDP 3 Examples 4 Tree expansion Bounds Stochastic branch and bound

Reinforcement learning Definition (The Reinforcement Learning Problem) Learning how to act in an environment solely by interaction and reinforcement. Characteristics The environment is unknown. Data is collected by the agent through interaction. The optimal actions are hinted at via reinforcement (scalar rewards). Applications: Sequential Decision Making tasks Control, robotics, etc. Scheduling, network routing. Game playing, co-ordination of multiple agents. Relation to biological learning.

Markov decision processes Markov decision processes (MDP) We are in some environment µ, where at each time step t: We observe state s t S. We take action a t A. We receive a reward r t R. µ s t s t+1 a t r t+1 Model P µ(s t+1 s t, a t) P µ(r t+1 s t, a t) (Transition distribution) (Reward distribution)

Markov decision processes (MDPs) The agent The agent is defined by its policy π. P π(a t s t) µ Controlling the environment We wish to find π maximising the expected total future reward r t+1 s t s t+1 T E µ,π r t t=1 (utility) π a t to the horizon T.

Markov decision processes (MDPs) The agent The agent is defined by its policy π. P π(a t s t) µ Controlling the environment We wish to find π maximising the expected total future reward E µ,π T t=1 γ t r t (utility) to the horizon T with discount factor γ (0, 1]. π s t s t+1 a t r t+1

Value functions State value function ( T ) t,µ(s) E π,µ γ k r t+k s t = s V π How good a state is under the policy π for the environment µ. k=1 π (µ) : V π (µ) t,µ (s) Vt,µ(s) π π, t, s (optimal policy) The optimal policy π dominates all other policies π everywhere in S. V t,µ(s) V π (µ) t,µ (s), (optimal value function) The optimal value function V is the value function of the optimal policy π.

When the environment µ is known s 1 t+1 Iterative/offline methods s t a 1 t, r 0 t+1 a 1 t, r 1 t+1 a 2 t, r 0 t+1 s 2 t+1 Estimate the optimal value function V (i.e. with backwards induction on all S). Iteratively improve π (i.e. with policy iteration) to obtain π. a 2 t, r 1 t+1 s 3 t+1 s 4 t+1 Online methods Forward search followed by backwards induction (on subset of S). Dynamic programming (Backwards Induction) V t(s t) = sup E µ[r t s t, a] + γ a i V t+1(s i t+1) P µ(s i t+1 s t, a)

When the environment µ is unknown Decision theoretic solkution using a probabilistic belief Belief: A distribution over possible MDPs. Method: Take into account future beliefs when planning. Problem: The combined belief/mdp model is an infinite MDP. Goal: Efficient methods to approximately solve the infinite MDP.

Near-optimal Bayesian RL 1 Introduction Reinforcement learning Markov decision processes (MDP) Dynamic programming 2 Exploration and Exploitation Trade-off Bayesian RL Decision-theoretic solution Belief MDP 3 Examples 4 Tree expansion Bounds Stochastic branch and bound

Bayesian Reinforcement Learning Estimating the correct MDP The true µ is unknown, but we assume it is in M. Maintain a belief ξ t(µ) over all possible MDPs µ M. ξ 0 is our initial belief about µ M. µ π s t s t+1 a t r t+1 The belief update ξ t+1(µ) ξ t(µ s t+1, r t+1, s t, a t) = (1a) Pµ(st+1, rt+1 st, at)ξt(µ). (1b) ξ t(s t+1, r t+1 s t, a t)

Exploration-exploitation trade-offs with Bayesian RL Exploration-exploitation trade-off We just described an estimation method. But, how should we behave while the MDP is not well estimated? The plausibility of different MDPs is important. Main idea Take knowledge gains into account when planning. Starting with the current belief, enumerate all possible future beliefs. Take the action that maximises the expected utility.

Decision-theoretic solution Belief-Augmented Markov Decision Processes RL problems with state are expressed as MDPs. Under uncertainty there are two types of state variables: the environment s state s t and our belief state ξ t. Augmenting the state space with the belief space allows us to represent RL under uncertainty as a big MDP with a hyperstate. This MDP can be solved with DP techniques (backwards induction).

Belief tree ω 1 t+1 ω t a 1, s 1 a 1, s 2 a 2, s 2 a 2, s 2 ω 2 t+1 ω 3 t+1 Of interest (Pseudo)-tree structure. Hyperstate ω t (s t, ξ t). Ω t {ωt i : i = 1, 2,...}. ω 4 t+1 The induced MDP ν P ν(ωt+1 ω i t, a t) = ξ t(st+1, i rt+1 s i t, a t) = P µ(st+1, i rt+1 s i t, a t)ξ t(µ)dµ M Backwards induction V t (ω) = ω Ω t+1 ξ t(ω ω t, a t )[E ξt (r ω, ω t) + γv t+1(ω )]

The n-armed bandit problem Actions A = {1,..., n}. Expected reward E(r t a t = i) = x i. Discount factor γ 1 and/or horizon T > 0. If the expected rewards are unknown, what must we do? Decision-theoretic approach Assume r t a t = i ψ(θ i ), with θ i Θ unknown parameters. Define prior ξ(θ 1,..., θ n). Select actions to maximise E ξ U t = E ξ T t k=1 γk r t+k.

Bernoulli example Consider n Bernoulli bandits with unknown parameters θ i, i = 1,..., n such that r t a t = i Bern(θ i ), E(r t a t = i) = θ i. (2) We model our belief for each bandit s parameter θ i as a Beta distribution Beta(α i, β i ), with density f (θ α i, β i ) so that ξ(θ 1,..., θ n) = n f (θ i α i, β i ). Recall that the posterior of a Beta prior is also a Beta. Let k t.i t k=1 I a k = i be the number of times we played arm i and ˆr t.i 1 t k t,i k=1 rt I a k = i be the empirical reward of arm i at time t.. Then, the posterior distribution for the parameter of arm i is i=1 ξ t = Beta(α i + k t,iˆr t,i, β i + k t,i (1 ˆr t,i )) Since r t {0, 1} the possible states of our belief given some prior are N 2n.

Belief states The state of the bandit problem is the state of our belief. A sufficient statistic for our belief is the number of times we played each bandit and the total reward from each bandit. Thus, our state at time t is entirely described our priors α, β (the initial state) and the vectors k t = (k t,1,..., k t,i ) (3) ˆr t = (ˆr t,1,..., ˆr t,i ). (4) At any time t, we can calculate the probability of observing r t = 1 or r t = 0 if we pull arm i as: ξ t(r t = 1 a t = i) = α i + k t,iˆr t,i α i + β i + k t,i The next state is well-defined and depends only on the current state. Thus, the n-armed bandit problem is an MDP.

Experiment design Example Consider k treatments to be administered to T volunteers. Each volunteer can only be used once. At the t-th trial, we perform some experiment a t {1,..., k} and obtain a reward r t = 1 if the result is successful and 0 otherwise. If simply randomise trials, then we will obtain a much lower number of successes than if we solve the bandit MDP. Example We are given a hypothesis set H = {h 1, h 2}, a prior ψ 0 on H, a decision set D = {d 1, d 2} and a loss function L : D H R. We can choose from a set of k possible experiments to be performed over T trials. At the t-th trial, we choose experiment a t {1,..., k} and observe outcome x t X. Our posterior is ψ t(h) = ψ 0(h a 1,..., a t, x 1,..., x t) The reward is r t = 0 for t < T and r T = min d D E ψ T (L d). The process is a T -horizon MDP, which can be solved with standard backwards induction.

Tree properties Tree depth (Naive) Error at depth k: ɛ γk 1 γ. Branching factor φ = R A S Practical methods to handle the tree Lookeahead up to fixed time T. In some cases, closed-form solutions (i.e. Gittins indices) Pruning or sparse expansion. Value function approximations.

Bounds on node values Fortunately, we can obtain bounds on the value of any node ω = (s, ξ). Let π (µ) be the optimal policy for µ: Lower bound V (ω) E ξ V π ( µ ξ ) µ (s), where µ ξ E ξ µ is the mean MPD. The optimal policy must be at least as good as any stationary policy, Upper bound E ξ max V µ π (s) V (ω) π The optimal policy cannot do better than the policy which learns the correct model at the next time-step. Estimating the Bounds for some hyperstate ω = (s, ξ) V µ(s)ξ(µ) dµ 1 n ˆv i, v i = V µi (s), µ i ξ. n i=1

Stochastic branch and bound Main idea Sample once from all leaf nodes. 1/1 0.1 0.9 0/1 Expand the node with the highest mean upper bound. We quickly discover overoptimistic bounds. Unexplored leaf nodes accumulate samples. Hierarchical variant Sample children instead of leafs. Average bounds along path to avoid degeneracy.

Stochastic branch and bound 1/1 0.09 0.1 0/0 0/0 0.1 0.9 0/1 Main idea Sample once from all leaf nodes. Expand the node with the highest mean upper bound. We quickly discover overoptimistic bounds. Unexplored leaf nodes accumulate samples. Hierarchical variant Sample children instead of leafs. Average bounds along path to avoid degeneracy.

Stochastic branch and bound 1/1 0.09 0.1 0/1 0/1 0.1 0.9 1/2 Main idea Sample once from all leaf nodes. Expand the node with the highest mean upper bound. We quickly discover overoptimistic bounds. Unexplored leaf nodes accumulate samples. Hierarchical variant Sample children instead of leafs. Average bounds along path to avoid degeneracy.

Complexity results Let be the value difference between two branches and β = V max V min. If N times an optimal branch is sampled without being expanded: P(N > n) exp( 2β 2 n 2 2 ) If K is the number of times a sub-optimal branch will be expanded then P(K > k), for k > k 0 = log γ /β Stochastic branch and bound 1 ( ) O exp{ 2β 2 [(k k 0) 2 ]} Stochastic branch and bound 2 ( ) Õ exp{ 2(k k 0) 2 (1 γ 2 )

Summary Results Development of upper and lower bounds for belief tree values Application to efficient tree expansion Complexity bounds for tree expansion Future work Sparse sampling and smoothness property to reduce branching factor Can we get regret bounds via posterior concentration? Extend approach to non-parametrics...

Questions? Thank you for your attention.