A Parameter Investigation of the ϵ-greedy Exploration Ratio Adaptation Method in Multi-agent Reinforcement Learning

Similar documents
Evaluation of multi armed bandit algorithms and empirical algorithm

Temporal Difference Learning & Policy Iteration

QUICR-learning for Multi-Agent Coordination

RL 3: Reinforcement Learning

The Multi-Armed Bandit Problem

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Convergence and No-Regret in Multiagent Learning

Reinforcement Learning

Bias-Variance Error Bounds for Temporal Difference Updates

arxiv: v1 [cs.ai] 1 Jul 2015

Grundlagen der Künstlichen Intelligenz

An Introduction to Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Bellmanian Bandit Network

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

ilstd: Eligibility Traces and Convergence Analysis

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Basics of reinforcement learning

Reinforcement Learning

Exploration. 2015/10/12 John Schulman

The convergence limit of the temporal difference learning

On the Worst-case Analysis of Temporal-difference Learning Algorithms

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Q-Learning in Continuous State Action Spaces

An Introduction to Reinforcement Learning

Optimal Convergence in Multi-Agent MDPs

Linear Least-squares Dyna-style Planning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Reinforcement Learning

Q-Learning for Markov Decision Processes*

Reinforcement Learning

Temporal difference learning

Elements of Reinforcement Learning

1 Problem Formulation

Learning to play K-armed bandit problems

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Introducing strategic measure actions in multi-armed bandits

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Reinforcement Learning In Continuous Time and Space

On and Off-Policy Relational Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning

Learning to Compete, Compromise, and Cooperate in Repeated General-Sum Games

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

An application of the temporal difference algorithm to the truck backer-upper problem

Advanced Machine Learning

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Introduction to Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Q-learning. Tambet Matiisen

Seminar in Artificial Intelligence Near-Bayesian Exploration in Polynomial Time

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning (1)

arxiv: v1 [cs.ai] 21 May 2014

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

CS 570: Machine Learning Seminar. Fall 2016

Machine Learning I Reinforcement Learning

Reinforcement Learning Part 2

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Learning to Coordinate Efficiently: A Model-based Approach

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Cyclic Equilibria in Markov Games

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Reinforcement Learning Active Learning

Reinforcement Learning II

Reinforcement Learning: the basics

Gradient Descent for Symmetric and Asymmetric Multiagent Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Abstraction in Predictive State Representations

Learning Algorithms for Minimizing Queue Length Regret

(Deep) Reinforcement Learning

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Learning ε-pareto Efficient Solutions With Minimal Knowledge Requirements Using Satisficing

Measuring Agent Intelligence via Hierarchies of Environments

Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games

Using Expectation-Maximization for Reinforcement Learning

Finite-time Analysis of the Multiarmed Bandit Problem*

TD(X) Converges with Probability 1

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Replacing eligibility trace for action-value learning with function approximation

Experts in a Markov Decision Process

Methods for Interpretable Treatment Regimes

arxiv: v1 [cs.lg] 30 Dec 2016

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Efficient Exploration and Value Function Generalization in Deterministic Systems

On the Convergence of Optimistic Policy Iteration

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Transcription:

European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. A Parameter Investigation of the ϵ-greedy Exploration Ratio Adaptation Method in Multi-agent Reinforcement Learning Takuya Okano Masaki Onishi Department of Policy and Planning Sciences University of Tsukuba Tsukuba, Ibaraki 305-0577, Japan okano565656@gmail.com onishi@ni.aist.go.jp Artificial Intelligence Research Center National Institute of Advanced Industrial Science and Technology Tsukuba, Ibaraki 305-8560, Japan Itsuki Noda Artificial Intelligence Research Center National Institute of Advanced Industrial Science and Technology Tsukuba, Ibaraki 305-8560, Japan i.noda@aist.go.jp Abstract Balancing the amounts of exploration and exploitation is one of the most important issues in reinforcement learning, which is also important while attempting to achieve favorable performances in multi-agent and non-stationary environments. In a previous study, we proposed the ϵ-greedy exploration ratio ϵ adaptation method, which is referred to as Win or Update Exploration ratio (WoUE). This method is capable of adapting ϵ in multiagent and non-stationary environments. However, while using this method, we have only one attempt to set the parameter that determines ϵ s update amount. In this study, we investigate the manner in which this parameter affects the WoUE s performance. Thus, we confirm that there is an optimal value for the WoUE s performance, and we further evaluate WoUE using the related adaptation method. We confirm that WoUE method improves the performance as compared with that obtained using the related methods regardless of the WoUE s parameter in multi-agent and non-stationary environments. Keywords: Reinforcement Learning, Multi-agent Reinforcement Learning, Exploration Ratio, ϵ-greedy 1. Introduction Exploration ratio adaptation in ϵ-greedy is an important issue in reinforcement learning (Sutton and Barto, 1998). Several studies have proposed various adaptation methods, which only target single agent reinforcement learning and/or stationary environments. However, several difficulties arise while attempting to adapt these methods to scenarios with multiagent and non-stationary environments. Therefore, we proposed exploration ratio adaptation method WoUE which adapts the exploration ratio in multi-agent and non-stationary environments. To function, this method requires a parameter that determines the exploration ratio update amount at one time. In this study, we investigate the influence that is c 2018 Takuya Okano, Masaki Onishi, and Itsuki Noda. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.

Takuya Okano, Masaki Onishi, and Itsuki Noda exhibited by various parameters on the system performance, and we evaluate WoUE and the related adaptation method of exploration ratio using each parameter. 2. Background 2.1 Domain To evaluate the exploration ratio adaptation methods, we use the non-stationary Repeated Resource Sharing Problem (RRSP)(Noda and Ohta, 2008). RRSP is a congestion game where agents share their resources. Mathematically, this RRSP can be represented by RRSP = < G, C, A, R >, G = {g 1, g 2,..., g M }, C = {c g1, c g2,..., c gm }, A = {a 1, a 2,..., a N }, where G represents the set of resources, C is the set of resource capacities, A is the set of agents, M is the number of resources, and N is the number of agents. In addition, n j, the number of agents that share the same resource (j G), determines the reward function, R(n j, c j ), which is a monotonically decreasing function w.r.t. n j /c j. Each agent selects a resource j according to ϵ-greedy and gets a reward R(n j, c j ) by the selecting resource, and learns by reinforcement learning throughout the process, i.e., updates the expected reward Q(j) by Q(j) (1 α)q(j) + αr(n j, c j ), where α is a learning ratio. This process is repeated during every step. The global optimal condition in the game is defined as the average of the distribution of all agent rewards is high value and the variance is minimum value. In other words, the global optimal condition is the capacity composition ratio of resource j G and the number of agents, who select the resource j, is equal to a specific c condition( j k G c = n j k N : j G). Therefore, we define the system performance, i.e., Error, using the following equation: Error j G( c j k G c k n j N )2. (1) Here, the objective in the game is to minimize the Error by multi-agent reinforcement learning as well as to evaluate the system performance using the Error. We also introduce non-stationary environments to RRSP to reflect real-world situations. The capacity of each resource may change over time by a certain probability. We simulate numerous real-world problems using the non-stationary RRSP. On applying the congestion problem to RRSP, the road represents the resource and the car is an agent. We assume that the change in the resource capacity is an impassable or blocked road, consequently, RRSP creates a new road. Therefore, we use a non-stationary RRSP that resource s capacity change over time by a certain probability to evaluate the multi-agent reinforcement learning performance. Figure 1 shows the image diagram of the non-stationary RRSP. 2

A Parameter Investigation of Exploration Ratio Adaptation Method Figure 1: Diagram of the non-stationary RRSP. 2.2 Exploration Ratio Adaptation in Multi-agent and Non-stationary Environments The exploration ratio ϵ in ϵ-greedy is a parameter used to determine the balance between exploration and exploitation(sutton and Barto, 1998). The ratio is a parameter that affects the learning performance as well as multi-agent reinforcement learning performance. It is well known that this ratio has a positive value but should eventually converge to zero in a stationary environment. In non-stationary environments, however, this ratio should not converge to zero. Because the environment changes continuously over time, this ratio should be kept positive to follow the changes of the environment. Moreover, in multi-agent environments, the agents influence each other. Therefore, the exploration ratio is also considered as a parameter that indicates the level of influence of an agent on the other agents. As a result, obtaining a suitable value for this parameter is important in order to improve performance of multi-agent reinforcement learning, which is a difficult yet essential issue. Several previous studies have explored and proposed various methods to adjust the exploration ratio. However, these studies only targeted single agent reinforcement learning and/or stationary environments (Tokic, 2010; Tokic and Palm, 2011; Auer et al., 2002; Hester et al., 2013; Singh et al., 2000). 3. Win or Update Exploration Ratio In our previous study, we proposed Win or Update Exploration ratio (WoUE) method (Okano and Noda, 2017), which is an ϵ-greedy exploration ratio ϵ adaptation method in multi-agent and non-stationary environments. In WoUE, agent i who receives a lower reward compared with the average reward of all agents(total average) updates its ϵ i, with the objective of reaching equilibrium. The ratios are updated to minimize the variance of the distribution of all agents rewards during a certain period according to the following equation: ϵ i { ϵ i + η(µ R R i )(R ϵi R (1 ϵi )) (µ R > R i ) ϵ i (µ R R i ), (2) 3

Takuya Okano, Masaki Onishi, and Itsuki Noda where µ R is the average reward for all agents respectively, ϵ i is the exploration ratio of the agent i for a certain period, R i is the average reward that the agent i receives associated with the exploration ratio ϵ i for a certain period. R (1 ϵi ) is the average reward derived from exploitation, and R ϵi is the average reward derived from exploration. The update amount is given by the following equation: (µ R R i )(R ϵi R (1 ϵi )) σ2 R ϵ i. (3) Therefore, we observe that agent i updates its exploration ratio, ϵ i, in an attempt to minimize the variance of the distribution of all agents rewards σr 2. In other words, the agent i updates ϵ i to reach equilibrium. Because only the agents whose rewards are lower than total average, it can be expected to avoid falling into a sub-optimal equilibrium where the average reward is low. The derivation of Equation (3) is shown in Appendix A. Since each agent only requires µ R, this method is capable of updating ϵ in such a manner that minimizes σr 2 by only broadcasting µ R to all agents. This method is easily adapted if there are many agents in the environment. In this study, we use a different version of WoUE (Okano and Noda, 2017), which updates ϵ at each step in order to increase the opportunity to update ϵ. To update at each step, the parameters µ R, R i, R (1 ϵi ), and R ϵi are modified with Exponential Moving Average. WoUE algorithm is shown in Appendix B. 4. Investigation of WoUE s Parameter η WoUE initializes the parameter η that modifies ϵ from Equation (2). We investigated the effects of η on the Error(i.e., the system performance) via experimentation. The results indicate that there is an optimal η. When η is high, WoUE can achieve a good performance. We investigate changes of the Error for each η in the non-stationary RRSP. In this experiment, we compare the Error with each η. Figure 2 shows the Error of WoUE for Error of each η, game=rrsp 0.18 0.16 nb_step=10000 nb_step=25000 nb_step=50000 nb_step=75000 nb_step=100000 Error 0.14 0.12 0.10 0.08 2 6 2 4 2 2 2 0 2 2 2 4 2 6 η Figure 2: Error(i.e., the system performance) of WoUE associated with each η. each η. Each line represents a different step number (the update time of the exploration ratio ϵ). The results indicate that WoUE Error curves over each η have convex downward shapes, which confirms that there is an optimal η parameter which minimizes the Error. When both η and the step number are small, the system performance is poor. Figure 3 shows the changes in ϵ with respect to WoUE with a small(left) and a large(right) value of η. Figure 3 confirms that updating ϵ is quite slow. Poor performance due to the insufficient 4

A Parameter Investigation of Exploration Ratio Adaptation Method Average ε Change Average ε of All Agents by WoUE η = 0.015625 1.00 average epsilon 0.75 0.50 0.25 Average ε Change Average ε of All Agents by WoUE η = 64 1.00 average epsilon 0.75 0.50 0.25 0.00 0 20000 40000 60000 80000 100000 0.00 0 20000 40000 60000 80000 100000 ε Change ε of Each Agents by WoUE η = 0.015625 1.00 agent0 0.75 agent1 0.50 0.25 ε 1.00 0.75 0.50 0.25 Change ε of Each Agents by WoUE η = 64 agent0 agent1 0.00 0 20000 40000 60000 80000 100000 0.00 0 20000 40000 60000 80000 100000 Figure 3: Change in ϵ by WoUE with η = 2 6 (left) and η = 2 6 (right).the upper graphs are the average ϵ for all agents and the lower graphs are ϵ for the two sampled agents. amount of time taken in this experiment to update ϵ. If there is a sufficient amount of the update time, WoUE performs well. Although this method produces small Error quickly, when η is large, the Error is bigger than the minimum value. Figure 3 indicates that ϵ changes rapidly, due to the large value of η, (ϵ transforms from a minimum ϵ to a maximum ϵ, or from a maximum to a minimum). In addition, both the exploration noise and the Error increase. Thus, if there is a sufficient amount of time to adapt ϵ, η should be set to a small value. When there is insufficient amount of time, η should be set to a large value. 5. Comparison with Related Work We compared WoUE with other methods in a non-stationary RRSP scenario. Based on these comparisons, we conformed that WoUE achieve a better performance than the other methods regardless of the various η values in non-stationary environments. 5.1 Comparative Method As comparative methods, we chose a normal ϵ-greedy and Value-Difference Based Exploration (VDBE) method (Tokic, 2010). VDBE modifies ϵ in ϵ-greedy method based on temporal-difference errors. Even though VDBE is used as an adaptation method for ϵ in single agent environments, this method is easily adapted to multi-agent environments. VDBE has two parameters σ and δ. Based on the results from (Tokic, 2010), we set δ = 1 M. In this experiment, we use a variety of σ values and evaluate the normal ϵ-greedy method without modifying ϵ. 5

Takuya Okano, Masaki Onishi, and Itsuki Noda 5.2 Experiment We evaluate these methods using a non-stationary RRSP. This experiment comprises several agents (N = 100), many resources (M = 5), the reward function (R(n j, c j ) = c j 1+n j ), and the original resource capacities (C = {1, 2, 4, 8, 16}). Each capacity of resource doubles when a capacity is original and it restores to the original value when a capacity is already doubled by probability of 0.001 within each step. The parameters of each method are described in Appendix C. The learning ratio α for all agents is α = 0.3. In these experiments, all agents act randomly for 5% of the steps such that all agents learn the environment. Method and Parameter Average Error(SD) Average ϵ(sd) ϵ = 0.01 0.104(0.035) 0.010(0.000) ϵ = 0.1 0.114(0.019) 0.100(0.000) VDBE σ = 0.001 0.569(0.069) 0.788(0.032) VDBE σ = 0.01 0.160(0.024) 0.246(0.058) VDBE σ = 0.1 0.156(0.058) 0.001(0.003) VDBE σ = 1 0.233(0.079) 0.000(0.000) WoUE η = 0.015625 0.109(0.021) 0.090(0.030) WoUE η = 1 0.095(0.023) 0.029(0.009) WoUE η = 64 0.096(0.023) 0.046(0.028) Table 1: Average Error and average ϵ of all agents for each method in the non-stationary RRSP. SD is the standard deviation for Error and ϵ. ε 0.5 0.4 0.3 Average ε of all agents, nb_agent=100, α=0.3 ε = 0.01 ε = 0.1 VDBE σ = 0.001 VDBE σ = 0.01 VDBE σ = 0.1 VDBE σ = 1 WoUE η = 0.015625 WoUE η = 1 WoUE η = 64 0.2 0.1 0.0 0 5000 10000 15000 20000 25000 30000 Figure 4: Changes in average ϵ of all agents for each method in non-stationary RRSP. VDBE with σ = 0.001 is out of this figure because the ϵ is very high. Table 1 shows the Error(i.e., the system performance) and the average ϵ for all agents for each adaptation method in the latter half of all steps. The VDBE s Error varies for each σ value and the Error is worse compared with the other methods, i.e., the VDBE s lowest Error is higher than the WoUE s highest Error. On the other hand, the results indicate that WoUE has better performance than the other methods. Even though the VDBE s performance varies for each parameter, the WoUE s performance is favorable regardless of the parameters. 6

A Parameter Investigation of Exploration Ratio Adaptation Method Figure 4 shows the change in the average ϵ for all agents. Here, the X-axis is the step number, and the Y-axis is the average ϵ for all agents. ϵ adapted by VDBE varies according to a specific parameter. This parameter affects the system performance sensitivity. Even though the SD of ϵ adapted by WoUE varies, the average ratio is similar. In another parameters, we confirmed the same result in Appendix D. If WoUE s η is small, the system performance is lacking due to a lowly changing ϵ. However if we continually update ϵ, system performance will improve. 6. Conclusion and Future work In this study, we investigated the parameter η in WoUE which is adaptation method of ϵ in ϵ-greedy and compared the results with recent studies in multi-agent and non-stationary environments. The WoUE s parameter η determines ϵ s the update amount. We conducted several experiments to investigate the effects of this parameter. Our results indicate that there are optimal η values. We confirmed that when η is small, WoUE achieves a better performance despite long simulation periods. When η is high, even though WoUE achieves a better performance quicker, the system performance is worse than the best performance, due to the exploration noise. We evaluated WoUE with recent studies on VDBE with various parameters. Results indicate that WoUE performs more favorably than VDBE in multi-agent and non-stationary environments regardless of the parameters used. The evaluation of WoUE in real world applications and its theoretical analysis are topics of future investigations. Acknowledgments This presentation is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI Grant Number 18H03301. 7

Takuya Okano, Masaki Onishi, and Itsuki Noda Appendix A. We further derive Equation(3). First, we express the average reward of the agent i R i as R i (1 ϵ i )R (1 ϵi ) + ϵ i R ϵi. (4) Equation(4) can be derived from the following set of equations: R i = 1 T T rt (5) = 1 T (1 ϵi) T ( r(1 ϵi ) + rϵi ) (6) = 1 T (1 ϵi ) T (T (1 ϵ i ) r(1 ϵi ) + T ϵ i T (1 ϵi ) = T (1 ϵ i ) T 1 T (1 ϵi ) T (1 ϵi ) r(1 ϵi ) + T ϵ i T rϵi ) (7) 1 rϵi (8) T (1 ϵi ) T 1 1 ϵi (1 ϵ i ) r(1 ϵi ) + ϵ i rϵi T (1 ϵi ) (9) = (1 ϵ i )R (1 ϵi ) + ϵ i R ϵi, (10) T 1 ϵi represents the exploitation number and is the exploration number by agent i. r (1 ϵi ) is the reward from exploitation and r ϵi is the reward from exploration. Next we derive Equation(3). The variance σr 2 is calculated from: σ 2 R = 1 N N (µ R R i ) 2. (11) i=0 We differentiate σ 2 R with respect to ϵ i Therefore, we arrive at Equation(3): σ 2 R ϵ i 2 N (µ R R i ) (µ R R i ) ϵ i (12) 2 N (µ R R i )(R (1 ϵi ) R ϵi ) (13) (µ R R i )(R (1 ϵi ) R ϵi ) (14) σ2 R ϵ i (µ R R i )(R ϵi R (1 ϵi )). (15) (µ R R i )(R ϵi R (1 ϵi )) σ2 R ϵ i. (16) 8

A Parameter Investigation of Exploration Ratio Adaptation Method Appendix B. Algorithm 1 Multi-agent Reinforcement Learning with WoUE 1: Initialze N agents 2: for step t = 0 to T do 3: for agent i = 0 to N do 4: Select action, using ϵ-greedy 5: end for 6: for agent i = 0 to N do 7: Get reward r i 8: Update average reward R i αr i + (1 α)r i 9: Update Q i (a) 10: end for 11: Compute average reward of all agents µ R,t = 1 N N i=0 r i 12: Update average reward of all agents µ R αµ R,t + (1 α)µ R 13: Broadcast µ R to all agents 14: for agent i = 0 to N do 15: if agent i exploited then 16: Update exploited reward R (1 ϵi) ϵ i r i + (1 ϵ i )R (1 ϵi) 17: else if agent i explored then 18: Update explored reward R ϵi (1 ϵ i )r i + ϵ i R ϵi 19: end if 20: if R i < µ R then 21: ϵ i ϵ i + η(µ R R i )(R ϵi R (1 ϵi)) 22: ϵ i checkminmax(ϵ i ) 23: end if 24: end for 25: end for 26: 27: function checkminmax(ϵ) 28: if ϵ < minimum ϵ then 29: ϵ minimum ϵ 30: else if maximum ϵ < ϵ then 31: ϵ maximum ϵ 32: end if 33: return ϵ 34: end function 9

Takuya Okano, Masaki Onishi, and Itsuki Noda Appendix C. Name Parameter details WoUE η = {0.0125, 1, 245}, minimum ϵ = 0.01, maximum ϵ = 0.9 VDBE σ={0.001, 0.01, 0.1, 1 }, δ = 1 M Table 2: Adaptation Methods for ϵ and their parameters. Appendix D. Herein, we evaluated on the adaptation methods with other settings. This experiment comprises a number of agents (N = 300), a number of resources (M = 8), and the original resource capacities (C = {3, 6, 9, 12, 15, 18, 21, 24}). The learning ratio for all agents is α = 0.1. Results are shown in Table 3 and Figure 5. We confirmed that these results are similar to the result in Section 5. Method and Parameter Average Error(SD) Average ϵ(sd) ϵ = 0.01 0.112(0.057) 0.010(0.000) ϵ = 0.1 0.091(0.034) 0.100(0.000) VDBE σ = 0.001 0.126(0.025) 0.280(0.056) VDBE σ = 0.01 0.170(0.067) 0.004(0.006) VDBE σ = 0.1 0.247(0.041) 0.000(0.000) VDBE σ = 1 0.300(0.051) 0.000(0.000) WoUE η = 0.015625 0.104(0.024) 0.207(0.021) WoUE η = 1 0.075(0.030) 0.063(0.016) WoUE η = 64 0.071(0.032) 0.077(0.037) Table 3: Average Error and average ϵ of all agents for each method in the non-stationary RRSP. SD is the standard deviation for Error and ϵ. ε 0.5 0.4 0.3 Average ε of all agents, nb_agent=300, α=0.1 ε = 0.01 ε = 0.1 VDBE σ = 0.001 VDBE σ = 0.01 VDBE σ = 0.1 VDBE σ = 1 WoUE η = 0.015625 WoUE η = 1 WoUE η = 64 0.2 0.1 0.0 0 5000 10000 15000 20000 25000 30000 Figure 5: Changes in average ϵ of all agents for each method in non-stationary RRSP. VDBE with σ = 0.001 is out of this graph. 10

A Parameter Investigation of Exploration Ratio Adaptation Method References Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2 3):235 256, 2002. Todd Hester, Manuel Lopes, and Peter Stone. Learning exploration strategies in modelbased reinforcement learning. In The Twelfth International Conference on Autonomous Agents and Multiagent Systems (AAMAS), May 2013. Itsuki Noda and Masayuki Ohta. Meta-level control of multiagent learning in dynamic repeated resource sharing problems. In Tu-Bao Ho and Zhi-Hua Zhou, editors, Proc. of PRICAI 2008, pages 296 308. Springer, Dec. 2008. Takuya Okano and Itsuki Noda. Adaptation method of the exploration ratio based on the orientation of equilibrium in multi-agent reinforcement learning under non-stationary environments. Journal of Advanced Computational Intelligence and Intelligent Informatics, 21(5):939 947, Sep. 2017. Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 38 (3):287 308, Mar 2000. Richard S. Sutton and Andrew G. Barto. Reinforcement Learning : An Introduction. MIT Press, 1998. Michel Tokic. Adaptive ϵ-greedy exploration in reinforcement learning based on value differences. In Proceedings of the 33rd Annual German Conference on Advances in Artificial Intelligence, KI 10, pages 203 210, Berlin, Heidelberg, 2010. Michel Tokic and Gnther Palm. Value-difference based exploration: Adaptive control between epsilon-greedy and softmax. In KI 2011: Advances in Artificial Intelligence, volume 7006 of Lecture Notes in Computer Science, pages 335 346. Springer, 2011. 11