Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Similar documents
Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

Bayesian reinforcement learning

1 MDP Value Iteration Algorithm

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Artificial Intelligence

Bandit models: a tutorial

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Multi-armed bandit models: a tutorial

Reinforcement Learning

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

RL 14: Simplifications of POMDPs

Reinforcement learning

Reinforcement Learning

Markov Decision Processes

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Internet Monetization

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

REINFORCEMENT LEARNING

Decision Theory: Markov Decision Processes

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement learning an introduction

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 23: Reinforcement Learning

Online Learning and Sequential Decision Making

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning and NLP

Reinforcement Learning

Markov decision processes

CS 7180: Behavioral Modeling and Decisionmaking

, and rewards and transition matrices as shown below:

Stochastic Primal-Dual Methods for Reinforcement Learning

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Basics of reinforcement learning

Machine Learning I Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Markov decision processes and interval Markov chains: exploiting the connection

Decision Theory: Q-Learning

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

RL 14: POMDPs continued

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Lecture 8: Policy Gradient

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Approximate Universal Artificial Intelligence

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning: An Introduction

ilstd: Eligibility Traces and Convergence Analysis

Advanced Machine Learning

Q-Learning in Continuous State Action Spaces

Why (Bayesian) inference makes (Differential) privacy easy

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Reinforcement Learning as Variational Inference: Two Recent Approaches

Temporal difference learning

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Evaluation of multi armed bandit algorithms and empirical algorithm

Lecture 1: March 7, 2018

Lecture 3: Markov Decision Processes

RL 3: Reinforcement Learning

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 9: Policy Gradient II 1

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Final. Introduction to Artificial Intelligence. CS 188 Spring You have approximately 2 hours and 50 minutes.

On Bayesian bandit algorithms

Reinforcement Learning Active Learning

Exploration and Exploitation in Bayes Sequential Decision Problems

Using first-order logic, formalize the following knowledge:

Reinforcement Learning as Classification Leveraging Modern Classifiers

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Bandit View on Continuous Stochastic Optimization

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Active Learning of MDP models

Bayesian Active Learning With Basis Functions

Factored State Spaces 3/2/178

1 [15 points] Search Strategies

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

Partially Observable Markov Decision Processes (POMDPs)

Reinforcement Learning and Deep Reinforcement Learning

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

The Multi-Arm Bandit Framework

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Markov Decision Processes (and a small amount of reinforcement learning)

15-780: ReinforcementLearning

Distributed Optimization. Song Chong EE, KAIST

Transcription:

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands 23 Jan 2010

Reinforcement learning 1 Introduction Reinforcement learning Markov decision processes (MDP) Dynamic programming 2 Exploration and Exploitation Trade-off Bayesian RL Decision-theoretic solution Belief MDP 3 Examples 4 Tree expansion Bounds Stochastic branch and bound

Reinforcement learning Definition (The Reinforcement Learning Problem) Learning how to act in an environment solely by interaction and reinforcement. Characteristics The environment is unknown. Data is collected by the agent through interaction. The optimal actions are hinted at via reinforcement (scalar rewards). Applications: Sequential Decision Making tasks Control, robotics, etc. Scheduling, network routing. Game playing, co-ordination of multiple agents. Relation to biological learning.

Markov decision processes Markov decision processes (MDP) We are in some environment µ, where at each time step t: We observe state s t S. We take action a t A. We receive a reward r t R. µ s t s t+1 a t r t+1 Model P µ(s t+1 s t, a t) P µ(r t+1 s t, a t) (Transition distribution) (Reward distribution)

Markov decision processes (MDPs) The agent The agent is defined by its policy π. P π(a t s t) µ Controlling the environment We wish to find π maximising the expected total future reward r t+1 s t s t+1 T E µ,π r t t=1 (utility) π a t to the horizon T.

Markov decision processes (MDPs) The agent The agent is defined by its policy π. P π(a t s t) µ Controlling the environment We wish to find π maximising the expected total future reward E µ,π T t=1 γ t r t (utility) to the horizon T with discount factor γ (0, 1]. π s t s t+1 a t r t+1

Value functions State value function ( T ) t,µ(s) E π,µ γ k r t+k s t = s V π How good a state is under the policy π for the environment µ. k=1 π (µ) : V π (µ) t,µ (s) Vt,µ(s) π π, t, s (optimal policy) The optimal policy π dominates all other policies π everywhere in S. V t,µ(s) V π (µ) t,µ (s), (optimal value function) The optimal value function V is the value function of the optimal policy π.

When the environment µ is known s 1 t+1 Iterative/offline methods s t a 1 t, r 0 t+1 a 1 t, r 1 t+1 a 2 t, r 0 t+1 s 2 t+1 Estimate the optimal value function V (i.e. with backwards induction on all S). Iteratively improve π (i.e. with policy iteration) to obtain π. a 2 t, r 1 t+1 s 3 t+1 s 4 t+1 Online methods Forward search followed by backwards induction (on subset of S). Dynamic programming (Backwards Induction) V t(s t) = sup E µ[r t s t, a] + γ a i V t+1(s i t+1) P µ(s i t+1 s t, a)

When the environment µ is unknown Decision theoretic solkution using a probabilistic belief Belief: A distribution over possible MDPs. Method: Take into account future beliefs when planning. Problem: The combined belief/mdp model is an infinite MDP. Goal: Efficient methods to approximately solve the infinite MDP.

Near-optimal Bayesian RL 1 Introduction Reinforcement learning Markov decision processes (MDP) Dynamic programming 2 Exploration and Exploitation Trade-off Bayesian RL Decision-theoretic solution Belief MDP 3 Examples 4 Tree expansion Bounds Stochastic branch and bound

Bayesian Reinforcement Learning Estimating the correct MDP The true µ is unknown, but we assume it is in M. Maintain a belief ξ t(µ) over all possible MDPs µ M. ξ 0 is our initial belief about µ M. µ π s t s t+1 a t r t+1 The belief update ξ t+1(µ) ξ t(µ s t+1, r t+1, s t, a t) = (1a) Pµ(st+1, rt+1 st, at)ξt(µ). (1b) ξ t(s t+1, r t+1 s t, a t)

Exploration-exploitation trade-offs with Bayesian RL Exploration-exploitation trade-off We just described an estimation method. But, how should we behave while the MDP is not well estimated? The plausibility of different MDPs is important. Main idea Take knowledge gains into account when planning. Starting with the current belief, enumerate all possible future beliefs. Take the action that maximises the expected utility.

Decision-theoretic solution Belief-Augmented Markov Decision Processes RL problems with state are expressed as MDPs. Under uncertainty there are two types of state variables: the environment s state s t and our belief state ξ t. Augmenting the state space with the belief space allows us to represent RL under uncertainty as a big MDP with a hyperstate. This MDP can be solved with DP techniques (backwards induction).

Belief tree ω 1 t+1 ω t a 1, s 1 a 1, s 2 a 2, s 2 a 2, s 2 ω 2 t+1 ω 3 t+1 Of interest (Pseudo)-tree structure. Hyperstate ω t (s t, ξ t). Ω t {ωt i : i = 1, 2,...}. ω 4 t+1 The induced MDP ν P ν(ωt+1 ω i t, a t) = ξ t(st+1, i rt+1 s i t, a t) = P µ(st+1, i rt+1 s i t, a t)ξ t(µ)dµ M Backwards induction V t (ω) = ω Ω t+1 ξ t(ω ω t, a t )[E ξt (r ω, ω t) + γv t+1(ω )]

The n-armed bandit problem Actions A = {1,..., n}. Expected reward E(r t a t = i) = x i. Discount factor γ 1 and/or horizon T > 0. If the expected rewards are unknown, what must we do? Decision-theoretic approach Assume r t a t = i ψ(θ i ), with θ i Θ unknown parameters. Define prior ξ(θ 1,..., θ n). Select actions to maximise E ξ U t = E ξ T t k=1 γk r t+k.

Bernoulli example Consider n Bernoulli bandits with unknown parameters θ i, i = 1,..., n such that r t a t = i Bern(θ i ), E(r t a t = i) = θ i. (2) We model our belief for each bandit s parameter θ i as a Beta distribution Beta(α i, β i ), with density f (θ α i, β i ) so that ξ(θ 1,..., θ n) = n f (θ i α i, β i ). Recall that the posterior of a Beta prior is also a Beta. Let k t.i t k=1 I a k = i be the number of times we played arm i and ˆr t.i 1 t k t,i k=1 rt I a k = i be the empirical reward of arm i at time t.. Then, the posterior distribution for the parameter of arm i is i=1 ξ t = Beta(α i + k t,iˆr t,i, β i + k t,i (1 ˆr t,i )) Since r t {0, 1} the possible states of our belief given some prior are N 2n.

Belief states The state of the bandit problem is the state of our belief. A sufficient statistic for our belief is the number of times we played each bandit and the total reward from each bandit. Thus, our state at time t is entirely described our priors α, β (the initial state) and the vectors k t = (k t,1,..., k t,i ) (3) ˆr t = (ˆr t,1,..., ˆr t,i ). (4) At any time t, we can calculate the probability of observing r t = 1 or r t = 0 if we pull arm i as: ξ t(r t = 1 a t = i) = α i + k t,iˆr t,i α i + β i + k t,i The next state is well-defined and depends only on the current state. Thus, the n-armed bandit problem is an MDP.

Experiment design Example Consider k treatments to be administered to T volunteers. Each volunteer can only be used once. At the t-th trial, we perform some experiment a t {1,..., k} and obtain a reward r t = 1 if the result is successful and 0 otherwise. If simply randomise trials, then we will obtain a much lower number of successes than if we solve the bandit MDP. Example We are given a hypothesis set H = {h 1, h 2}, a prior ψ 0 on H, a decision set D = {d 1, d 2} and a loss function L : D H R. We can choose from a set of k possible experiments to be performed over T trials. At the t-th trial, we choose experiment a t {1,..., k} and observe outcome x t X. Our posterior is ψ t(h) = ψ 0(h a 1,..., a t, x 1,..., x t) The reward is r t = 0 for t < T and r T = min d D E ψ T (L d). The process is a T -horizon MDP, which can be solved with standard backwards induction.

Tree properties Tree depth (Naive) Error at depth k: ɛ γk 1 γ. Branching factor φ = R A S Practical methods to handle the tree Lookeahead up to fixed time T. In some cases, closed-form solutions (i.e. Gittins indices) Pruning or sparse expansion. Value function approximations.

Bounds on node values Fortunately, we can obtain bounds on the value of any node ω = (s, ξ). Let π (µ) be the optimal policy for µ: Lower bound V (ω) E ξ V π ( µ ξ ) µ (s), where µ ξ E ξ µ is the mean MPD. The optimal policy must be at least as good as any stationary policy, Upper bound E ξ max V µ π (s) V (ω) π The optimal policy cannot do better than the policy which learns the correct model at the next time-step. Estimating the Bounds for some hyperstate ω = (s, ξ) V µ(s)ξ(µ) dµ 1 n ˆv i, v i = V µi (s), µ i ξ. n i=1

Stochastic branch and bound Main idea Sample once from all leaf nodes. 1/1 0.1 0.9 0/1 Expand the node with the highest mean upper bound. We quickly discover overoptimistic bounds. Unexplored leaf nodes accumulate samples. Hierarchical variant Sample children instead of leafs. Average bounds along path to avoid degeneracy.

Stochastic branch and bound 1/1 0.09 0.1 0/0 0/0 0.1 0.9 0/1 Main idea Sample once from all leaf nodes. Expand the node with the highest mean upper bound. We quickly discover overoptimistic bounds. Unexplored leaf nodes accumulate samples. Hierarchical variant Sample children instead of leafs. Average bounds along path to avoid degeneracy.

Stochastic branch and bound 1/1 0.09 0.1 0/1 0/1 0.1 0.9 1/2 Main idea Sample once from all leaf nodes. Expand the node with the highest mean upper bound. We quickly discover overoptimistic bounds. Unexplored leaf nodes accumulate samples. Hierarchical variant Sample children instead of leafs. Average bounds along path to avoid degeneracy.

Complexity results Let be the value difference between two branches and β = V max V min. If N times an optimal branch is sampled without being expanded: P(N > n) exp( 2β 2 n 2 2 ) If K is the number of times a sub-optimal branch will be expanded then P(K > k), for k > k 0 = log γ /β Stochastic branch and bound 1 ( ) O exp{ 2β 2 [(k k 0) 2 ]} Stochastic branch and bound 2 ( ) Õ exp{ 2(k k 0) 2 (1 γ 2 )

Summary Results Development of upper and lower bounds for belief tree values Application to efficient tree expansion Complexity bounds for tree expansion Future work Sparse sampling and smoothness property to reduce branching factor Can we get regret bounds via posterior concentration? Extend approach to non-parametrics...

Questions? Thank you for your attention.