Abstract: Q-learning as well as other learning paradigms depend strongly on the representation

Similar documents
Ingo Ahrns, Jorg Bruske, Gerald Sommer. Christian Albrechts University of Kiel - Cognitive Systems Group. Preusserstr Kiel - Germany

y(n) Time Series Data

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.


Reinforcement Learning In Continuous Time and Space

Control of a Noisy Mechanical Rotor

Gaussian process for nonstationary time series prediction

= w 2. w 1. B j. A j. C + j1j2

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Chapter 7 Interconnected Systems and Feedback: Well-Posedness, Stability, and Performance 7. Introduction Feedback control is a powerful approach to o

LQ Control of a Two Wheeled Inverted Pendulum Process

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Boxlets: a Fast Convolution Algorithm for. Signal Processing and Neural Networks. Patrice Y. Simard, Leon Bottou, Patrick Haner and Yann LeCun

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

The Markov Decision Process Extraction Network

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Error Empirical error. Generalization error. Time (number of iteration)

Abstract This report has the purpose of describing several algorithms from the literature all related to competitive learning. A uniform terminology i

Reinforcement Learning based on On-line EM Algorithm

CS599 Lecture 1 Introduction To RL

6 Reinforcement Learning

Adaptive linear quadratic control using policy. iteration. Steven J. Bradtke. University of Massachusetts.

(a) (b) (c) Time Time. Time

Hierarchical Evolution of Neural Networks. The University of Texas at Austin. Austin, TX Technical Report AI January 1996.

The wake-sleep algorithm for unsupervised neural networks

COMP9444 Neural Networks and Deep Learning 11. Boltzmann Machines. COMP9444 c Alan Blair, 2017

Artificial Neural Networks Examination, March 2004

Perceptron. (c) Marcin Sydow. Summary. Perceptron

LSTM CAN SOLVE HARD. Jurgen Schmidhuber Lugano, Switzerland. Abstract. guessing than by the proposed algorithms.

Elec4621 Advanced Digital Signal Processing Chapter 11: Time-Frequency Analysis

Remaining energy on log scale Number of linear PCA components

RL 3: Reinforcement Learning

Simple Neural Nets For Pattern Classification

Novel determination of dierential-equation solutions: universal approximation method

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.

Approximating Q-values with Basis Function Representations. Philip Sabes. Department of Brain and Cognitive Sciences

An Adaptive Clustering Method for Model-free Reinforcement Learning

Non-Euclidean Independent Component Analysis and Oja's Learning

a subset of these N input variables. A naive method is to train a new neural network on this subset to determine this performance. Instead of the comp

agent1 c c c c c c c c c c c agent2 c agent3 c c c c c c c c c c c indirect reward an episode a detour indirect reward an episode direct reward

2 Tikhonov Regularization and ERM

SEPARATION OF ACOUSTIC SIGNALS USING SELF-ORGANIZING NEURAL NETWORKS. Temujin Gautama & Marc M. Van Hulle

Learning and Extracting Initial Mealy Automata With a. Modular Neural Network Model. Peter Tino. Department of Informatics and Computer Systems

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

THE system is controlled by various approaches using the

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Abstract. Acknowledgements. 1 Introduction The focus of this dissertation Reinforcement Learning... 4

A NEURAL NETWORK THAT EMBEDS ITS OWN META-LEVELS In Proc. of the International Conference on Neural Networks '93, San Francisco. IEEE, Jurgen Sc

Conservation of linear momentum

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

Generalization and Function Approximation

Adaptive Control with Approximated Policy Search Approach

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Artificial Neural Networks Examination, June 2005

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

Introduction to Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Genetic Algorithms with Dynamic Niche Sharing. for Multimodal Function Optimization. at Urbana-Champaign. IlliGAL Report No

COMPARING PERFORMANCE OF NEURAL NETWORKS RECOGNIZING MACHINE GENERATED CHARACTERS

Issues and Techniques in Pattern Classification

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Ensembles of Extreme Learning Machine Networks for Value Prediction

Name: UW CSE 473 Final Exam, Fall 2014

Fast pruning using principal components

Function Optimization Using Connectionist Reinforcement Learning Algorithms Ronald J. Williams and Jing Peng College of Computer Science Northeastern

Self-organized criticality and the self-organizing map

STATE GENERALIZATION WITH SUPPORT VECTOR MACHINES IN REINFORCEMENT LEARNING. Ryo Goto, Toshihiro Matsui and Hiroshi Matsuo

Arun K. Somani. University of Washington, Box Seattle, WA Tele: (206)

CS181 Practical4 Reinforcement Learning

Rate-Monotonic Scheduling with variable. execution time and period. October 31, Abstract

Learning Vector Quantization (LVQ)

Three Steps toward Tuning the Coordinate Systems in Nature-Inspired Optimization Algorithms

Training Radial Basis Neural Networks with the Extended Kalman Filter

hidden -> output input -> hidden PCA 1 pruned 2 pruned pruned 6 pruned 7 pruned 8 9 pruned 10 pruned 11 pruned pruned 14 pruned 15

On Comparison of Neural Observers for Twin Rotor MIMO System

ε ε


arxiv: v1 [cs.ai] 5 Nov 2017

Gravitational potential energy *

Learning Gaussian Process Models from Uncertain Data

A Dynamical Implementation of Self-organizing Maps. Wolfgang Banzhaf. Department of Computer Science, University of Dortmund, Germany.

Activity Driven Adaptive Stochastic. Resonance. Gregor Wenning and Klaus Obermayer. Technical University of Berlin.

Appendix W. Dynamic Models. W.2 4 Complex Mechanical Systems. Translational and Rotational Systems W.2.1

Experts. Lei Xu. Dept. of Computer Science, The Chinese University of Hong Kong. Dept. of Computer Science. Toronto, M5S 1A4, Canada.

Neural Networks and Ensemble Methods for Classification

Three Steps toward Tuning the Coordinate Systems in Nature-Inspired Optimization Algorithms

Laboratory for Computer Science. Abstract. We examine the problem of nding a good expert from a sequence of experts. Each expert

An application of the temporal difference algorithm to the truck backer-upper problem

Modified Learning for Discrete Multi-Valued Neuron

0 o 1 i B C D 0/1 0/ /1

Potential Function Explain of the Quick Algorithm of Synergetic Neural Network

Transcription:

Ecient Q-Learning by Division of Labor Michael Herrmann RIKEN, Lab. f. Information Representation 2-1 Hirosawa, Wakoshi 351-01, Japan michael@sponge.riken.go.jp Ralf Der Universitat Leipzig, Inst. f. Informatik Postf. 920, D-04009 Leipzig, Germany der@informatik.uni-leipzig.d400.de Abstract: Q-learning as well as other learning paradigms depend strongly on the representation of the underlying state space. As a special case of the hidden state problem we investigate the eect of a self-organizing discretization of the state space in a simple control problem. We apply the neural gas algorithm with adaptation of learning rate and neighborhood range to a simulated cart-pole problem. The learning parameters are determined by the ambiguity of successful actions inside each cell. 1 Introduction Q-learning [1] is an algorithm for dynamic programming that represents an ecient implementation of the reinforcement learning paradigm. It is based on an estimation of future reinforcement conditioned on the previously acquired policy of assigning actions to the states of a controlled system. One may distinguish two ways of implementing Q-learning, either by means of a state-action look-up table or by approximating the Q-function, e.g. using multilayered neural nets [2]. For continuous state spaces the look-up table refers to a quantization of the states of the system given by either prior knowledge or an adaptive algorithm. Our paper is concerned with the latter approach and tries to reveal and to resolve some of its pitfalls. These are caused by the interference of two types of learning: The adaptation of the partitioner and the reinforcement learning of the controller. The situation is schematically represented in Fig. 1. The determination of quantized states, which are internal states in the full control problem, represents an instance of the hidden state problem (cf. [3]). The reinforcement learning in turn relies on the hidden states that comprise cells of the state space partition. For discrete actions an ideal partition of the (continuous) state space consists of domains with each having a unique optimal action for all states belonging to that domain. Hence, an optimal partition is dened by means of the policy function assigning states to actions. If the partitioning is done by means of a learning vector quantizer, cells are dened by reference vectors in the input space together with a nearest neighbor assignment of input states to reference vectors. The learned distribution of the reference vectors, however, is determined usually by statistical properties of the inputs to the vector quantizer [4], the density of which corresponds to the frequency by which the states of the system are encountered. In particular, one obtains a high resolution (ne grained partitions) around the stable states of the controlled system. In general, the partition obtained in this way will seldom be optimal in the above sense, cf. Fig. 2. The present paper is devoted to introducing a convenient algorithm which is able of on-line self-learning the optimal partitions. Here, the learning rule for the vector quantizer is based on the neural gas algorithm [5] which allows for cooperativity between the neurons (categories) so that it shares the noise

ltering properties of Kohonen's self-organizing feature maps without relying on a prespecied topology. This algorithm appeared to be well suited for representing data from high dimensional input spaces. The reinforcement learning for the policy function of the controller is implemented by the standard Q-learning algorithm introduced by Watkins [1]. The two learning rules are introduced in the next section, the modication leading to optimized partitioning being given in section 3. The fourth section is devoted to a numerical study of the controller, where for reasons of comparability we have chosen the pole-and-cart problem with the standard set of parameters (cf. e.g. [6]). Several further improvements of partitionercontroller systems, in particular avoiding the problem of destabilizing feed-back loops will be discussed briey in the nal section. Figure 1: Scheme of the learning problem combining controller and partitioner learning 2 Controller and partitioner learning Q-learning: We consider a general control task of the following type. Suppose, a system is characterized by the state vector x 2 X. Using an appropriate discretization of the generally continuous X the states can be represented by a nite number of categories j(t). x is to be controlled towards and stabilized inside a target region using a sequence of control actions fa(t)g, t = 0;1;... The set of possible actions is assumed to be nite. The Q-learning algorithms determines optimal action sequences fa(t)g in correspondence to the categories fj(t)g based on reinforcement signals r(t). Optimal actions a(t) are chosen according to a(t) = argmax Q(j(t);a(t)) (1) which together wit the mapping implicitly denes the policy function a = (x). Q(j(t);a(t)) which measures the value of a state-action pair is adjusted according to where 1Q (j(t);a(t)) = (r(t +1)+V (j(t + 1)) 0Q(j(t);a(t))) (2) V (j(t + 1)) = max Q(j(t +1);a) (3) a The reinforcement signal r(t) assumes, e.g. an elevated value if the state has reached the target. It is possible to reward states that are close to the target or to use negative reinforcement when the system has reached a `failure' state. The example in section 4 relies solely on negative reinforcement.

Adaptive partitioning: Both the control algorithm and the representation of the system's state should be adaptive if (i) the environment changes in time, (ii) no sucient prior knowledge is available, or (iii) the distribution of state depends intrinsically on the behavior of the system itself. It appears, hence, to be natural to learn the partitions for the Q-algorithm. Unsupervised learning algorithms known from neural networks are well suited for this purpose, where the units (neurons) of the net are associated with the categories. In order to avoid drawbacks of a prescribed topology of reference units we favored the neural gas algorithm [5] rather than a feature map [7]. Whereas the latter is denitely faster in learning the number of units per dimension has to be specied in advance A particularly useful property consists in the determination of the neighborhood by the order of the distances from a given input datum such that correlations among neural units can be controlled specically in a dimension-independent manner. The basic update rule [5] for category vectors w j representing a prototype state of category j reads 1w j = 0h (k j (x; w)) (w j 0 x): (4) Topological relations between an input datum and the set of reference vectors w is introduced by the function h depending on the rank k of the distance between w j and x among all such distances. h (k j (x; w)) = exp(0k j (x; w)= j ): (5) In particular, for! 0 only the unit closest to x will be inuenced by the update, whereas for larger other neurons, too, are attracted towards x. In our formulation may vary across units. The resulting asymmetry of interaction allows to transport neurons to regions which are only poorly represented. Figure 2: Illustrative example of a three-cell partition. First, optimized with respect to reconstruction error being common in self-organizing maps or, next, optimized with respect to disambiguity of the partition. 3 Partitioning adapted to Q-learning The maximal value of the Q-function at a given unit decides which action is to be executed at the next time step. In the ideal partitioning dened in the introduction, there is only one optimal action in each domain so that in the converged state the Q-function is stationary under the learning rule 2. Hence, the uctuations of Q in each partition of the state space may serve as an indicator for the reliability of this partition. When monitoring Q-values it becomes obvious which domains having already decided for an action. Otherwise, the unit

`requests for assistance' to its neighbors by means of an increased -value (cf. Eq. 5). In the case of two possible actions denoted by a(t) =61 we dene the rmness f 2f0;1gof a unit i as f i = kha(t)i t;j(t)=i k (6) The average h...i isamoving average over those time steps where unit i was activated and acting by the greedy strategy (cf. below). Other denitions of the rmness are possible. In particular, in the case of more than two actions instead of (6) entropy measures based on the probability of success for each action are suggestive. The choice (6, 7) on the other hand is not only distinguished by its simplicity, but has also the advantage of allowing for small deviations from maximal rmness which are unavoidable when using a nearest neighbor classier to approximate the smooth boundaries in the control problem. This is due to a vanishing derivative ofwith respect to f at f = 1. A neuron cannot be rm for the `wrong' action as this is dened in terms of the Q-function. On the other hand, the maximum of the Q-function at unit j will approach the appropriate value as soon as the domain of j is such that a unique correct action exists. For the problem of rarely visited units that may not have achieved a rm state cf. Discussion on controlling the resolution properties of a partitioner. Now, can be dened using the length of the mean action vector, in the present case simply i / (1 0 f i ) 2 (7) such that rm units will not disturb their neighbors whereas nonrm units will eectively attract rm units such that their domain shrinks and reliable control becomes possible. 4 The cart-pole example The principles described above have been exemplied by the standard task of balancing a pole. (cf. [6]. Details of the problem are given in the Appendix. (For consistency we will call here the set of values of the four degrees of freedom of the system a state space rather than a phase space). Upon receiving a four dimensional input vector the partitioner returns the bestmatching unit as the category to which the input vector belongs. The category is submitted to the Q-learning controller together with a reinforcement signal r(t) as the only information the controller receives for advice. r = 01 if a failure occurred and r = 0 otherwise. The number of categories (units) was 162 (compare [6]), two dierent actions have been allowed. We want to remark that by exploiting symmetries of the model a solution could be found with much less neurons. The Q-learning of control actions contains a simple form of explorational behavior by choosing actions according to (1) with probability c and randomly otherwise. c is linearly increased reaching unity at the end of the learning phase of length T learn. During the subsequent test period (T test =0:5T learn ) only the greedy policy, i.e. c = 1, is used, but changes in the Q-function are still allowed. The controller is required to achieve control within a maximal time range T = T learn + T test A trial of T time steps counts as a successful one if during at least 10000 time steps in series no failure occurred. Results are presented in the gures. Fig. 3 demonstrates the speed up of the learning time by a factor of more than 2 due to an appropriate assignment of domains. Because of the success-independent exploration strategy used in all simulations also higher learning times (T 500000) did not yield success in each trial. The maximal success rates were 0.68 and 0.8 for the xed partition and the adaptive partition controller, resp. The rmness depicted in Fig. 4 reects the mean value with respect to active units. The failure of a trial may, however, be caused by the existence undecided units which are rarely visited and, hence, do not have much inuence on the average of the rmness.

100 80 % success 60 40 20 0 0 50000 100000 150000 200000 250000 T Figure 3: Number of successes in 100 trails at dierent values of the total learning time for the adaptive partitioner (full line) and a xed box partition (dashed line). averaged firmness 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 20000 40000 60000 80000 100000 120000 140000 T Figure 4: Time course of the rmness averaged over all units for a typical learning trial. 5 Discussion The proposed learning algorithm for optimized partitioning of the state space of a controlled system rests on tuning the neighborhood width of a neural gas partitioner according to the diversity of the actions proposed by the Q-learner. Also the learning rate of the adaptive partioner has been modied such that it deceases when the partition improves. The computer simulations of the generic cart-pole problem clearly demonstrate that this new algorithm encreases both accuracy and speed of learning of the controller. The resulting partition deviates from the box partition by a better adaptation to the dynamics of the problem: If position and momentum have opposite signs actually no control action is necessary. Thus, either of the two available actions is critical. This is taken into account by the adaptive partition by a ner resolution in the diagonal region of the state space. Finally, it should be mentioned that the feedback of output informations to the learning dynamics of the partitioner may well lead to instabilities in early learning. However, these can be avoided by compensating the shift of the reference vectors by conveniently reevaluating the values of the Q-function aected. An appropriate algorithm has shown in preliminary computer simulations to not only avoid the instabilities but also to further increase the speed of Q-learning. Detailed results will for lack of space be presented elsewhere. Acknowledgements: Part of this work has been done in the LADY project sponsored by the

German BMFT under grant 01 IN 106B/3. MH has received support form the HCM-EC program under grant CHRX-CT93-0097. References [1] C. J. C. H. Watkins (1992) Q-learning. Machine Learning 8, p. 279-292. [2] P. J. Werbos (1990) A menu of designs for reinforcement learning over time. In: W. T. Miller, R. S. Sutton, P. J. Werbos (eds.) Neural Networks for Control, MIT Press, Mass. p. 67-95. [3] S. Das, M. C. Mozer (1994) A unied gradient-descent/clustering architecture for nite state machine induction. In: J. D. Cowan, G. Tesauro, J. Alspector (eds.) Adv. in Neur. Inform. Proc. Syst. 6, p. 19-26. [4] H. Ritter, K. Schulten (1986) On the stationary state of Kohonen's self{organizing sensory mapping. Biol. Cybern. 54, 99-106. [5] Th. M. Martinetz, K. Schulten (1991) A `neural gas' network learns topologies. In: Proc. ICANN91, Helsinki, Elsevier Amsterdam. [6] M. Pendrith (1994) On reinforcement learning of control actions in noisy and nonmarkovian domains. U. New South Wales, Techn. rep. UNSW-CSE-TR-9410. [7] T. Kohonen (1984) Self-organization and associative memory. Springer, Berlin, Heidelberg, New York. [8] H.-U. Bauer, R. Der, M. Herrmann (1995) Controlling the magnication factor of selforganizing feature maps. Submitted to Neur. Comp. Appendix: Details of the cart-pole problem The task (compare [6]) consists in balancing a pole (inverted pendulum) hinged to a cart by applying a binary force F = 610N to the cart. The motion of the cart is conned to a xed length track such that the task is successfully solved if both the pole remains upright and the cart avoids the ends of the track. The pole is described by the deviation from its vertical position and the angular momentum _. The cart position and velocity is denoted by x and _x, resp. The vector v =(x; _x; ; _ ) serves as input to the partitioner. The four dimensional system is governed by the following equations: = mg sin 0cos (F +mpl _ 2 sin ) (4=3)ml0m pl ; x = F 1 + m cos 2 m p l( _ sin 0 cos ) Parameter values are: length of track 02:4m <x<2:4m, maximal deviation 00:21 << 0:21, mass of cart plus pole m =1:1kg, mass of pole m p =0:1kg, gravitational acceleration g =9:81m=s 2, eective length of pole l =0:5m. The equations of motion were integrated using the Euler method with time step 1t =0:02s.