Approximate Universal Artificial Intelligence

Similar documents
Approximate Universal Artificial Intelligence

A Monte-Carlo AIXI Approximation

Reinforcement Learning via AIXI Approximation

A Monte Carlo AIXI Approximation

Advances in Universal Artificial Intelligence

On Ensemble Techniques for AIXI Approximation

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Artificial Intelligence

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Using Localization and Factorization to Reduce the Complexity of Reinforcement Learning

Basics of reinforcement learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Bayesian reinforcement learning and partially observable Markov decision processes November 6, / 24

arxiv: v1 [cs.ai] 30 May 2017

RL 3: Reinforcement Learning

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Factored State Spaces 3/2/178

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CS599 Lecture 1 Introduction To RL

Predictive MDL & Bayes

, and rewards and transition matrices as shown below:

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

comp4620/8620: Advanced Topics in AI Foundations of Artificial Intelligence

Reinforcement Learning Active Learning

Reinforcement Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

European Workshop on Reinforcement Learning A POMDP Tutorial. Joelle Pineau. McGill University

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Chapter 3: The Reinforcement Learning Problem

Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

On the Optimality of General Reinforcement Learners

Lecture 1: March 7, 2018

RL 14: Simplifications of POMDPs

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

Bayesian Mixture Modeling and Inference based Thompson Sampling in Monte-Carlo Tree Search

Reinforcement Learning and NLP

CS 188: Artificial Intelligence Spring Announcements

Reinforcement learning an introduction

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Q-Learning in Continuous State Action Spaces

6 Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Notes on Tabular Methods

Reinforcement Learning Part 2

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Loss Bounds and Time Complexity for Speed Priors

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Bayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison

Internet Monetization

Lecture 8: Policy Gradient

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Bayesian Active Learning With Basis Functions

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

On the Computability of AIXI

Distributed Optimization. Song Chong EE, KAIST

The Multi-Arm Bandit Framework

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

Approximation Methods in Reinforcement Learning

CS 188 Introduction to Fall 2007 Artificial Intelligence Midterm

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

1 MDP Value Iteration Algorithm

Reinforcement Learning II

Reinforcement Learning

CS 598 Statistical Reinforcement Learning. Nan Jiang

Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decisions

Universal Convergence of Semimeasures on Individual Random Sequences

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement learning

Introduction to Reinforcement Learning and multi-armed bandits

Reinforcement Learning

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

State Aggregation in Monte Carlo Tree Search

Reinforcement Learning as Classification Leveraging Modern Classifiers

arxiv: v2 [cs.ai] 20 Dec 2014

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Confident Bayesian Sequence Prediction. Tor Lattimore

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning Wrap-up

Lecture 3: The Reinforcement Learning Problem

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

VIME: Variational Information Maximizing Exploration

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

CS 7180: Behavioral Modeling and Decisionmaking

Bayesian reinforcement learning

Transcription:

Approximate Universal Artificial Intelligence A Monte-Carlo AIXI Approximation Joel Veness Kee Siong Ng Marcus Hutter David Silver University of New South Wales National ICT Australia The Australian National University University College London September 26, 2013

General Reinforcement Learning Problem Worst case scenario: Environment is unknown. Observations may be noisy. Effects of actions may be stochastic. No explicit notion of state. Perceptual aliasing. Rewards may be sparsely distributed. Notation: Agent interacts with an unknown environment µ by making actions a A. Environment responds with observations o O and rewards r R. For convenience, we sometimes use x O R. x 1:n denotes x 1, x 2,... x n, x <n denotes x 1, x 2,... x n 1 and ax 1:n denotes a 1, x 1, a 2, x 2,... a n, x n.

MC-AIXI-CTW in context Some approaches to (aspects of) the general reinforcement learning problem: Model-free RL with function approximation (e.g. TD) POMDP (assume an observation / transition model, maybe learn parameters?) Learn some (hopefully compact) state representation, then use MDP solution methods Our approach: Directly approximate AIXI, a universal Bayesian optimality notion for general reinforcement learning agents.

AIXI: A Bayesian Optimality Notion a AIXI t = arg max a t x t [ t+m ]... max r i 2 K(ρ) ρ(x 1:t+m a 1:t+m ), a t+m x t+m ρ M i=t Expectimax + (generalised form of) Solomonoff Induction Model class M contains all enumerable chronological semi-measures. Kolmogorov Complexity used as an Ockham prior. m := b t + 1 is the remaining search horizon. b is the maximum age of the agent Caveat: Incomputable. Not an algorithm!

Describing Environments, AIXI Style A history h is an element of (A X ) (A X ) A. An environment ρ is a sequence of conditional probability functions {ρ 0, ρ 1, ρ 2,... }, where for all n N, ρ n : A n Density (X n ) satisfies a 1:n x <n : ρ n 1 (x <n a <n ) = ρ n (x 1:n a 1:n ), ρ 0 (ϵ ϵ) = 1. x n X The ρ-probability of observing x n in cycle n given history h = ax <n a n is provided ρ(x <n a <n ) > 0. ρ(x n ax <n a n ) := ρ(x 1:n a 1:n ) ρ(x <n a <n )

Learning a Model of the Environment We will be interested in agents that use a mixture environment model to learn the true environment µ. ξ(x 1:n a 1:n ) := ρ M w ρ 0 ρ(x 1:n a 1:n ) M := {ρ 1, ρ 2,... } is the model class w ρ 0 is the prior weight for environment ρ. Satisfies the definition of an environment model. Therefore, can predict by using: ξ(x n ax <n a n ) = w ρ n 1 ρ(x n ax <n a n ), w ρ n 1 := w ρ 0 ρ(x <n a <n ) w ν ρ M 0 ν(x <n a <n ) ν M

Theoretical Properties Theorem: Let µ be the true environment. The µ-expected squared difference of µ and ξ is bounded as follows. For all n N, for all a 1:n, n ( 2 µ(x <k a <k ) µ(x k ax <k a k ) ξ(x k ax <k a k )) k=1 x 1:k { } ln w ρ 0 + D KL(µ( a 1:n ) ρ( a 1:n )), min ρ M where D KL ( ) is the KL divergence of two distributions. Roughly: The predictions made by ξ will converge to those of µ if a model close (w.r.t. KL Divergence) to µ is in M.

Prediction Suffix Trees A prediction suffix tree is a simple, tree based variable length Markov model. For example, using the PST below, having initially been given data 01: Pr(010 01) = Pr(0 01) Pr(1 010) Pr(0 0101) = (1 θ 1 )θ 2 (1 θ 1 ) = 0.9 0.3 0.9 = 0.243 1 8 88888 0 θ 1 = 0.1 1 8 88888 0 θ 2 = 0.3 θ 3 = 0.5

Context Tree Weighting Context Tree Weighting is an online prediction method. CTW uses mixture of prediction suffix trees. Smaller suffix trees are given higher initial weight, which helps to avoid overfitting when data is limited. Let C D denote the class of all prediction suffix trees of maximum depth D, then CTW computes in time O(D): Pr(x 1:t ) = M C D 2 ΓD(M) Pr(x 1:t M) Γ D (M) is description length of context tree M. This is truly amazing, as computing the sum naively would take time double-exponential in D!

Model Class Approximation Action-Conditional Context Tree Weighting Algorithm: Approximate model class of AIXI with a mixture over all action-conditional Prediction Suffix Tree structures of maximum depth D. PSTs are a form of variable order Markov model. Context Tree Weighting algorithm can be adapted to compute a mixture of over 2 2D 1 environment models in time O(D)! Inductive bias: smaller PST structures favoured. PST parameters are learnt using KT estimators. KL-divergence term in previous theorem grows O(log n). Intuitively, efficiency of CTW is due to clever exploitation of shared structure.

Greedy Action Selection Action a A has value V (a) = E[R a] = expected return. Consider Bandit setting: No history or state dependence. Optimal action/arm: a := arg max a V (a) (unknown). V (a) unknown frequency estimate ˆV (a) := 1 T (a) t:a R t=a t, R t = actual return at time t. T (a) := #{t T : a t = a} = #times arm a taken so far. Greedy action: a greedy T +1 = arg max a ˆV (a) Problem: If a accidentally looks bad (low early ˆV (a)), it will never be taken again = explore/exploit dilemma. Solution: Optimism in the face of uncertainty...

Upper Confidence Algorithm for Bandits UCB action: at UCB +1 := arg max a V + (a) V + (a) := ˆV (a) + C log T T (a), C > 0 suitable constant. If arm under-explored (i.e. T (a) log T ) V + (a) huge UCB will take arm a Every arm taken infinitely often ˆV (a) V (a) If sub-optimal arm over-explored (i.e. T (a) log T ) V + (a) ˆV (a) V (a) < V (a ) ˆV (a ) < V + (a ) UCB will not take arm a Fazit: T (a) log T for all suboptimal arms. T (a ) = T O(log T ), i.e. only O(log T ) T subopt. actions UCB is theor. optimal explore/exploit strategy for Bandits. Application: Use heuristically in expectimax tree search...

Expectimax Approximation Monte Carlo approximation of expectimax Tree Search (MCTS) Upper Confidence Tree (UCT) algorithm: Sample observations from CTW distribution. a1 a2 a3 Select actions with highest Upper Confidence Bound (UCB) V +. Expand tree by one leaf node (per trajectory). future reward estimate o1 o2 o3 o4 Simulate from leaf node further down using (fixed) playout policy. Propagate back the value estimates for each node. Repeat until timeout. With sufficient time, converges to the expectimax solution. Value of Information correctly incorporated when instantiated with a mixture environment model. Gives Bayesian solution to the exploration/exploitation dilemma.

Agent Architecture (MC-AIXI-CTW = UCT+CTW) Environment Perform action in real world Record new sensor information... Past Observation/Reward Action... Past Refine environment model Observation/Reward MC-AIXI-CTW An approximate AIXI agent Record Action a1 a2 a3 Update Bayesian Mixture of Models o1 o2 o3 o4 + - - + - - Simple Large Prior... Complex Small Prior Decide on best action future reward estimate

Relationship to AIXI Given enough thinking time, MC-AIXI-CTW will choose: a t = arg max a t x t max a t+m x t+m [ t+m ] r i i=t M C D 2 ΓD(M) Pr(x 1:t+m M, a 1:t+m ) In contrast, AIXI chooses: a t = arg max a t x t... max a t+m x t+m [ t+m ] r i 2 K(ρ) Pr(x 1:t+m a 1:t+m, ρ) ρ M i=t

Algorithmic Considerations Restricted the model class to gain the desirable computational properties of CTW Approximated the finite horizon expectimax operation with a MCTS procedure O(Dm log( O R )) operations needed to generate m observation/reward pairs (for a single simulation) O(tD log( O R )) space overhead for storing the context tree. Anytime search algorithm Search can be parallelized O(D log( O R ))) to update the context tree online

Experimental Setup Agent tested on a number of POMDP domains, as well as TicTacToe and Kuhn Poker. Agent required to both learn and plan. The context depth and search horizon were made as large as possible subject to computational constraints. ϵ-greedy training, with a decaying ϵ Greedy evaluation

Results

Comparison to Other RL Algorithms 0.1 Learning Scalability - Kuhn Poker MC-AIXI U-Tree Active-LZ Optimal Average Reward per Cycle 0.05 0-0.05-0.1-0.15-0.2 100 1000 10000 100000 1000000 Experience (cycles)

Resources Required for (Near)Optimal Performance Domain Experience Simulations Search Time Cheese Maze 5 10 4 500 0.9s Tiger 5 10 4 10000 10.8s 4 4 Grid 2.5 10 4 1000 0.7s TicTacToe 5 10 5 5000 8.4s Biased RPS 1 10 6 10000 4.8s Kuhn Poker 5 10 6 3000 1.5s Timing statistics collected on an Intel dual quad-core 2.53Ghz Xeon. Toy problems solvable in reasonable time on a modern workstation. General ability of agent will scale with better hardware.

Limitations PSTs inadequate to represent many simple models compactly. For example, it would be unrealistic to think that the current MC-AIXI-CTW approximation could cope with real-world image or audio data. Exploration/exploitation needs more attention. Can something principled and efficient be done for general Bayesian agents using large model classes?

Future Work Uniform random rollout policy used in ρuct. A learnt policy should perform much better. All prediction was done at the bit level. Fine for a first attempt, but no need to work at such a low level. Mixture environment model definition can be extended to continuous model classes. Incorporate more (action-conditional) Bayesian machinery. Richer notions of context.

References For more information, see: A Monte-Carlo AIXI Approximation (2011), J. Veness, K.S. Ng, M. Hutter, W. Uther, D. Silver http://dx.doi.org/10.1613/jair.3125 Highlights: a direct comparison to U-Tree / Active-LZ, improved model class approximation (FAC-CTW) and more relaxed presentation. Video of the latest version playing Pacman http://www.youtube.com/watch?v=yfsmhtmgdke