Abstract. Acknowledgements. 1 Introduction The focus of this dissertation Reinforcement Learning... 4

Size: px

Start display at page:

Download "Abstract. Acknowledgements. 1 Introduction The focus of this dissertation Reinforcement Learning... 4"

Eric Conley
5 years ago
Views:

1 Contents iii Abstract i Acknowledgements ii 1 Introduction The focus of this dissertation Reinforcement Learning A qualitative introduction The distinguishing characteristic Examples of reinforcement learning tasks A brief history of reinforcement learning An illustration of reinforcement learning The representation problem Features, detectors, and generalized states Feature extraction Cognitive economy and the big picture Contributions of This Dissertation A Brief Preview of Subsequent Chapters Reinforcement Learning and Feature Extraction The Elements of Reinforcement Learning The agent and the environment Common assumptions

2 iv 2.2 Q-Learning Policies and value functions An illustration Theoretical convergence Practical considerations Survey of Feature Extraction Gradient-descent function approximation Targeting important regions in state-space Distinguishing states with different action values Good representations Formalization of Cognitive Economy for Reinforcement Learning Introduction Overview State generalization A principled approach Cognitive Economy Related ideas in reinforcement learning Three aspects of cognitive economy Preliminaries Representational model Assumptions Definitions Relevant Features Importance

3 v Definitions of importance Necessary Distinctions Policy distinctions Value distinctions Making sound decisions Criterion for representational adequacy State Compatibility Feature Extraction Summary Action Values for Generalized States Introduction Action values depend on the agent s representation Example: State-Action Values for a Discrete Representation Region-Action Values Example: Region-action values for a partition representation Generalized Action Values Method 1: Exploit assumptions of soft state aggregation Method 2: Exploit assumptions of error minimization Example: Generalized action values for a coarse coded representation The Convex Generalization Property Effects of Representation on Action Values Summary

4 vi 5 State Compatibility and Representational Adequacy Introduction Summary of notation Examples of what can go wrong Sufficient conditions for ɛ-adequacy Generalization of state separation Common assumptions Proof of the theorem Discussion Necessary and sufficient conditions The case for partition representations Appendix: Some Helpful Properties of Convex Combinations Case Studies in On-Line Feature Extraction Introduction Making Relevant Distinctions Methodology An Algorithm for On-Line Feature Extraction Top Level of the Algorithm Recognizing Surprising States Investigating Surprising States Adding and Merging State-Space Regions Further Considerations Initial Tests Case Study: Puck-On-A-Hill Task

5 vii Analysis Results Case Study: Pole Balancing Task Analysis Results Discussion Future Work Conclusion Contributions Future work Afterword 244 Bibliography 247

6 List of Tables viii 1 Maximum returns Tabular representation of the action values Q(s, a) Action values for up discrete state representation Action values for right discrete state representation Whole path values for up partition representation Whole path values for right partition representation One-step values for up partition representation One-step values for right partition representation Probability of state occurrence under a random policy Generalized action values under soft-state aggregation Whole-path action values, υ1 (s, right) Whole-path action values, υ1 (s, up) Generalized action values under minimization of regional errors Generalized action values under minimization of global errors Inconsistency of action rankings may prevent ɛ-adequacy (δ = 0.08, ɛ = 0.2). Here pref δ (s) = {a 1, a 2, a 3, a 4 } Inadequate representation for δ > ɛ/2 (δ = 0.15, ɛ = 0.2.) State value incompatibilities. (δ = 0.1, ɛ = 0.2) Inadequate representation where Equation 43 is not met (δ = 0.1, ɛ = 0.2). Here pref δ (s) = {a 1, a 2, a 3, a 4 } {a 1, a 2, a 3 } = pref1 ɛ (s 1 )

7 List of Figures ix 1 A 5 4 gridworld with start state S (2, 3) and goal state G (4, 2) Gridworld partitioned according to preferred actions An agent and its environment A simple three-node reinforcement learning task A larger gridworld, with state generalization A two-action gridworld task Q(s, right) Q(s, up) for the two-action gridworld Plot of the state-probability distribution for the two-action gridworld under random exploration This state region appears to have different values for the action, depending on whether we enter it from s 1 or s 2. Thus the Markov property does not hold for the partition region Examples of successful feature extraction for the two-action gridworld task The agent s representation simplifies the environment by grouping states of the world into generalized states that share the same action values A possible recognition function for the feature tall Several different representations of the gridworld Widely-separated action values are characteristic of important features A necessary policy distinction: The state grouping must be split to avoid mistakes in policy An unnecessary distinction: Splitting the generalized state is probably not worth-while

8 x 17 A necessary value distinction: The resulting states must be kept separate for the agent to seek the better outcome The representation must make appropriate policy distinctions, which affect its next move, and value distinctions, which allow wise choices from earlier states Incremental regret: Compare R, the expected long-term reward from s when we act according to the policy of the generalized state, with R s, the return that results from taking the action which is best at s itself State compatibility: Allow s 1 and s 2 to be grouped together if a one-step look-ahead reveals their overall values to be close and their preference sets similar Three representations of the 4 4 gridworld An optimal path through the partitioned gridworld Q(s, right) under soft-state aggregation Q(s, right) Q(s, up) under soft-state aggregation Q(s, right) under minimization of regional errors Q(s, right) Q(s, up) under minimization of regional errors Superposition of the detectors for the counter-example Q(s, right) Q(s, up) under minimization of global errors Q(s, right) for the discrete representation Q(s, right) Q(s, up) for the discrete representation A value distinction which may or may not be necessary, depending on the values of r zx and r zy Top level of the algorithm

9 xi 33 Selecting surprising states for further investigation Strategy for Active Investigations of Surprising States Feature Extraction Algorithm Judging the Compatibility of Two States The puck-on-a-hill task: balance the puck on the hill to avoid negative reinforcement from hitting the wall Controllable states: states outside this band result in failure An ideal representation: must-push-left states (top curve) and mustpush-right states (bottom curve) are separated by the diagonal line Representation constructed automatically, from scratch (24 categories) Representation constructed from a good seed representation A representation inspired by Variable Resolution Dynamic Programming Enhanced VRDP representation Representation designed to limit the loss of controllability (from Yendo Hu, 1996) Averaged performance curves for the original VRDP representation and Yendo Hu s controllability quantization Averaged performance curves for the four best representations The cart-pole apparatus. The task is to balance the pole by pushing the cart to either the left or the right in each control interval Angular acceleration for the pole, f = Acceleration of the cart, for f = 10.0 and θ = Acceleration of the cart, for f = 10.0 and θ =

Reinforcement Learning: An Introduction

Reinforcement Learning: An Introduction Introduction Betreuer: Freek Stulp Hauptseminar Intelligente Autonome Systeme (WiSe 04/05) Forschungs- und Lehreinheit Informatik IX Technische Universität München November 24, 2004 Introduction What is