How to Minimize Maximum Regret in Repeated Decision-Making

Size: px

Start display at page:

Download "How to Minimize Maximum Regret in Repeated Decision-Making"

Theodore Ford
5 years ago
Views:

1 How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: , schlag@iue.it

2 Abstract Consier repeate ecision making in a stationary noisy environment given a nite set of actions in each roun. Payo s belong to a known boune interval. A rule or strategy attains minimax regret if it minimizes over all rules the maximum over all payo istributions of the i erence between achievable an achieve iscounte expecte payo s. Linear rules that attain minimax regret are shown to exist an are optimal for a Bayesian ecision-maker enowe with the prior where learning is most i cult. Minimax regret behavior for choosing between two actions given small or intermeiate iscount factors is erive an only requires two rouns of memory. JEL classi cation: D8, D83. Keywors: Two-arme banit, Bernoulli, boune rationality, minimax regret, limite memory.

3 Introuction Decision-making is an elementary part of human behavior. It is the founation of any moel of strategic interaction. The theory of ecision making thus in uences irectly or inirectly almost any economic preiction. Rational ecision making as we call it toay (von Neumann- Morgenstern 944, Savage 972) procees as follows. The ecision-maker rst speci es a prior probability istribution over the set of states that may occur. Then he selects the action that maximizes expecte utility an upates his initial prior after any new information arrives. We will refer to a ecision-maker as Bayesian if he behaves accoring to this proceure as probability upating follows Bayes rule. The unerlying behavioral rule will be calle Bayesian optimal. Rational ecision-making has been criticize from the beginning. In particular it has been questione whether iniviuals are able to form priors an whether they have the ability an time to perform the necessary calculations when making their choices an upating their prior. These objections are particularly relevant when stakes are low, time is scarce an priors are i use (cf Simon, 982). We follow an alternative approach an investigate behavior of a ecision-maker who makes choices that attain minimax regret (Wal, 90, Robbins, 92, cf. Savage, 9). This is a istribution-free approach in the sense that priors for the speci c ecision problem are not speci e by the ecision-maker. The avantage of a istribution-free approach is that the ecision-maker oes not have to etermine a new prior an compute a new behavior each time he faces a similar ecision problem. Instea he behaves the same at each rst encounter an aapts behavior over time through experience an learning to the speci c environment. Numerous i erent rules are suggeste in the literature to escribe learning without priors. This paper as to the few papers (e.g. Börgers et al., 200, Schlag, 998) that formally select among such rules. The environment of this paper is the same as in the classic multi-arme banit problem which can be escribe as follows. An iniviual must repeately choose from a nite set of actions or arms. Each choice yiels a ranom payo which is rawn from an action epenent istribution that is stationary an inepenent of previous choices or payo s achieve. All payo s are assume to belong to the interval [0; ]. The speci cation of a set of actions an

4 of a payo istribution for each action will be calle a ecision problem. So the iniviual repeately an inepenently faces the same ecision problem. Finally, the classic multi-arme banit speci cation inclues a prior which is a probability measure over the set of ecision problems. In the alternative setting where payo s only belong to f0; g we call the ecision problem a Bernoulli ecision problem. The payo istribution unerlying a choice of an action in a given ecision problem the iniviual actually faces shoul not be confuse with the prior istribution over the ecision problems the iniviual might face. A rule or strategy is a escription of which action the iniviual chooses next given his previous observations. We istinguish between eterministic rules that o not involve ranomizing between actions an (ranomize) rules that are probability measures over eterministic rules. We assume that the iniviual is risk neutral an that future payo s are iscounte with a given iscount factor where 2 (0; ). An action that maximizes expecte payo s in a given ecision problem as a best action. For a given rule an a given ecision problem regret is e ne as the i erence between the maximal iscounte expecte payo obtainable (i.e. the payo to choosing a best action forever) an the iscounte expecte payo achieve by this rule in this ecision problem. Regret is strictly positive whenever the ecision maker is a priori uncertain (or ignorant) of which action is best. This results from the fact that regret is never negative, that regret can be interprete as the iscounte sum of regret per roun an that zero regret will not be attaine in roun one if the ecision-maker is uncertain about which action is best. Before introucing our methoology in more etail it is useful to explain how selecting behavior accoring maxmin (Wal, 90, Gilboa an Schmeiler, 989) fails in our setting. We allow for any prior over ecision problems that yiel payo s in [0; ]. For any given rule expecte payo s are minimize when each action yiels the payo 0 for sure. So all rules yiel the same minimal expecte payo an hence a maxmin ecision maker shoul be ini erent among all rules. There is little value to learning about the returns to the i erent actions if all actions yiel similar expecte payo s. We will ignore such ecision problems an focus on an iniviual who wishes to perform well when there is an incentive to learn. Performance of a rule will be measure by the maximal regret it achieves over the set of all ecision problems. Accoringly we search for rules that attain minimax regret which means that the ecision maker minimizes over 2

5 all rules the maximum regret over all ecision problems of this rule. In other wors, minimax regret behavior minimizes the maximum loss ue to ignorance of the true state of a airs. In our main characterization of minimax regret behavior we exten results obtaine by Berry an Fristet (98) for Bernoulli two-arme banits to our setting in which payo s belong to [0; ] an where more than two actions are allowe. Accoringly, a rule attains minimax regret if an only if it is an equilibrium strategy of the ecision-maker in the zero sum game with nature where the ecision-maker minimizes, an nature maximizes regret. A rule that attains minimax regret is shown to exist within the set of Bernoulli equivalent rules that are symmetric. A Bernoulli equivalent rule is a rule that is linear in payo s an that behaves in any ecision problem as in the Bernoulli ecision problem in which actions receive the same expecte payo as in the original ecision problem. A rule (or a prior) is calle symmetric if its escription oes not epen on how the actions are labelle. As in Berry an Fristet (98) we are able to show the relationship between minimax regret an Bayesian ecision-making. Any minimax regret rule is Bayesian optimal uner a so-calle worst case prior that has the interpretation that it is the environment in which learning is most i cult for a Bayesian ecision maker. More formally, a worst case prior maximizes over all priors the regret of a Bayesian ecision-maker an is an equilibrium strategy of nature in the ctitious zero sum game mentione above. Con rming our intuition we n that worst case priors exist within the set of symmetric priors that put weight on Bernoulli ecision problems only. In the rest of the paper we investigate minimax regret behavior in more etail when there are two actions only. Special focus is on whether minimax regret can be attaine by a rule with nite memory. Speci cally, a rule has n roun memory for some natural number n if the next choice only epens on choices or payo s obtaine in the previous n rouns. The minimal size of memory neee to escribe a rule is a caniate measure of a rule s complexity. Our results buil on unerstaning uner which circumstances worst case priors are simple where simple refers here to the fact that their support only contains two Bernoulli ecision See French (986) for a iscussion of minimax regret along with alternative istribution free measures of behavior. Other stuies on minimax regret inclue Chamberlain (2000) an, in terms of relative regret, Neeman (200). 3

6 problems. Let Q 0 be the symmetric prior that puts equal weight on the two eterministic (twoaction) ecision problems in which one action yiels payo an the other payo 0. When minimax regret can be attaine with a rule with nite memory then Q 0 is the only caniate for a symmetric worst case prior that is simple in the above sense. This is proven using results by Kakigi (983) an Samaranayake (992) on Bayesian optimal ecision-making uner simple priors. Furthermore we show that Q 0 can only be a worst case prior when 0:62, here proofs rely on investigating Taylor expansions of the regret of a Bernoulli equivalent rule near Q 0. This means for > 0:62 that either the worst case prior is not simple or minimax regret cannot be attaine by a rule with nite memory. Further results below complete the picture as they show that Q 0 is in fact a worst case prior for all 0:62. It is intuitive that Q 0 is a worst case prior when is su ciently small as Q 0 maximizes the minimum regret in the rst roun. Asie we obtain that minimax regret behavior is never eterministic as long as 0:62: There is an obvious caniate for a simple symmetric Bernoulli equivalent rule that attains minimax regret when Q 0 is a worst case prior as such a rule must be Bayesian optimal against Q 0 : It is the single roun memory rule that speci es in the rst roun to choose each action equally likely an in any later roun to repeat the previous action with probability equal to the payo obtaine in the previous roun. Our calculations show that this rule attains minimax regret if an only if 0:4. We also n that there is no single roun memory rule that attains minimax regret when > 0:4. Typical for Bernoulli equivalent rules, the single roun memory rule selecte above prescribes ranom behavior whenever a payo is realize in (0; ). We n that there exists a single roun memory rule that is eterministic apart from the choice in the rst roun if an only if =3. A rule with this property for all =3 speci es to choose each action equally likely in the rst roun an in later rouns to repeat the previous action if an only if the payo obtaine in the previous roun was greater than =3. We then present a symmetric Bernoulli equivalent two roun memory rule that attains minimax regret if an only if 0:62. Shoul payo s be only realize in f0; g then this rule speci es to choose each action equally likely in the rst roun, to choose the same action again in the next two rouns whenever receiving payo an to switch actions otherwise. Behavior after receiving interior payo s is more intricate an essentially involves a four state stochastic 4

7 automaton. We also show that there is no two roun memory rule that attains minimax regret for larger than 0:62 if it attains minimax regret for all 0:62. Finally we investigate for which values of between 0:4 an 0:62 minimax regret can be attaine with two roun memory of actions but only on a single roun memory of previous payo s - rules we call two roun action memory rules. We n that minimax regret is attainable with such a rule if an only if 0:4. The rule presente with this property is Bernoulli equivalent, symmetric, speci es to choose the same action again after receiving payo an to sometimes choose it again after receiving payo 0 if the same action has been chosen twice in a row. This is the rst paper in which minimax regret behavior has been explicitly erive for twoarme banits. Partial results existe previously only for the scenario in which all payo s are containe in f0; g. Berry an Fristet (98) provie upper an lower bouns on minimax regret when is close to. A series of papers in the statistics an in the machine learning literature present speci c examples of rules to be use when the ecision maker is in nitely patient, i.e. = (e.g. Robbins, 92, 96, Samuels, 968, Narenra an Thathachar, 989). In particular, two rules suggeste by Robbins (92, 96) coincie to the rules selecte by us for small ( 0:4) an for intermeiate ( 0:62) iscount factors when payo s are limite to f0; g. The presentation of the material procees as follows. Section two introuces the basic setting. In Section three we supply the main characterization result of minimax regret behavior an worst case priors. In Section four we analyze separately rules that attain minimax regret among those with single roun memory, two roun memory an two roun action memory. 2 Decision Problems, Rules an Selection Let Y enote the set of probability measures over the set Y. A multi-action ecision problem (W; P ) consists of a nite set of actions or arms W = a ; ::; a jw j with jw j 2 an for each action c 2 W a measurable payo istribution P c 2 [0; ]. 2 Sometimes we will inex parameters by the ecision problem D they refer to, e.g. write P c (D) instea of P c. The set 2 Our results can be applie to payo istributions over a known boune interval [;!] by rst rescaling payo s into [0; ] using the linear transformation x 7! x :!

8 of all multi-action ecision problems will be enote by D. A multi-arme banit is escribe by a nite set of actions W an by a prior (or probability measure) Q 2 D over the set of multi-action ecision problems with action set W. We a the term Bernoulli if realize payo s only belong to f0; g. The set of all Bernoulli multi-action ecision problems will be enote by D 0. Payo s 0 an are sometimes referre to as failure an success respectively. 3 Consier an iniviual who repeately faces the same multi-arme banit (W; Q). In each of a sequence of rouns the iniviual is aske to choose an action from W. Before the rst roun nature selects the multi-action ecision problem W; P ~ the iniviual will be facing accoring to the prior Q. Choice of action c in roun t yiels a payo realize accoring to ~ P c that is rawn inepenently of previous choices an payo realizations. A rule (or strategy) is the formal escription of how the iniviual makes his choice as a function of his previous experience. A eterministic rule is a mapping f : ;[ m= fm k= fw [0; ]gg! W where f (;) is the action chosen in the rst roun an f (a ; x ; ::; a m ; x m ) is the action choosen in roun m+ after choosing action a k an receiving payo x k in roun k for k = ; ::; m. The set of eterministic rules will be enote by F. A (ranomize) (behavioral) rule is a probability measure over the set eterministic rules an hence an element of F. We ientify c 2 W with the probability istribution in W that selects c with probability one so that F F. We will also write (;) c as the probability of choosing action c in the rst roun an (a ; x ; ::; a m ; x m ) c as the probability of choosing action c in roun m+ after the history (a ; x ; ::; a m ; x m ). Notice that these probabilities nee not be inepenent across rouns. Assume throughout that the iniviual ecision-maker is risk neutral an iscounts future payo s with a given iscount factor 2 (0; ). 4 For a given rule an a given ecision problem D let p (n) c = p (n) c (; D) be the probability of choosing action c 2 W in roun n unconitional on previous choices. Let c (D) = R xp c (x; D) enote the expecte payo of choosing action c when facing the multi-action ecision problem (W; D). Then (; D) := ( ) P n= n P c2w p(n) c (; D) c (D) is the iscounte value of future payo s. The regret 3 The machine learning literature (cf Naremra an Thathachar, 989) refers to the Bernoulli case as the P- moel an to our setting with payo s in [0; ] as the S-moel. In the Q-moel the support of the payo istribution is nite. 4 Our analysis also applies to agents that are not risk neutral by replacing each payo x with a von Neumann- Morgenstern utility u (x) where u (0) = 0 an u () =. 6

9 (or opportunity loss) of a rule when facing the multi-action ecision problem D is e ne as L (D) := max c2w f c (D)g (; D). Regret is a measure of the loss ue to ignorance of the true state of a airs where the state of a airs is ienti e with a ecision problem. Elements of arg max c2w f c (D)g will sometimes be referre to as best actions. A Bayesian ecision-maker is an iniviual who chooses a rule ^ 2 arg max R (D) Q ~ (D). His choice ^ = ^ ~Q is calle a Bayesian optimal rule uner Q. ~ We will call Q a worst case prior if it maximizes the expecte regret of a Bayesian ecision-maker over all priors, i.e. if R Q 2 arg max Q2D L^(Q) (D) Q (D). Simplifying we obtain that Q is a worst case prior if an only if Q R 2 arg max Q2D min 2F L (D) Q (D). If the prior Q ~ is unknown (while W is known) then accoring to Savage (972) the iniviual speci es a subjective prior ^Q an chooses a Bayesian optimal rule uner ^Q. We follow an alternative approach (Wal, 90, Gibbons, 92) that is istribution-free as the iniviual oes not invoke a speci c prior to select a rule. We assume that the iniviual selects a rule that minimizes among all rules the maximal regret among all ecision problems (W; D). More speci cally, we say that attains minimax regret if 2 arg min 2F sup D2D L (D). 3 A General Characterization Some e nitions are nee before we present our characterization of minimax regret behavior an worst case priors. 3. Symmetry As the various actions belonging to W cannot be istinguishe (apart from their labels), symmetry will play an important role in our investigation. Given D 2 D an a permutation of the elements of W let D 2 D be the multi-action ecision problem e ne by permuting the labels of the actions in D using such that P c (D ) = P (c) (D) for c 2 W. For a given multi-arme banit (W; Q) with Q 2 D let Q be the istribution e ne by exchanging each ecision problem D in the support of Q by D. A prior Notice how we thus i er from the approach of Savage (9, cf. French, 986) that is base on a set of states, each being without uncertainty an where regret is consiere in each state separately. 7

10 Q is calle symmetric if Q = Q hols for any permutations of the elements of W. The set of symmetric priors over a subset of D will be enote by p. Given a eterministic rule f an a permutation of the elements of W let f be the eterministic rule that is erive from f by permuting actions with such that f (;) c = f (;) (c) an f (a ; x ; ::; a m ; x m ) c = ( (a ) ; x ; ::; (a m ) ; x m ) (c). A ranomize rule is calle symmetric if (T ) = (ff s.t. f 2 T g) hols for all permutations of W an for all measurable sets of eterministic rules T. The set of symmetric ranomize rules will be enote by p F. Notice that if is symmetric then (;) c = jw j for all c 2 W. 3.2 Linearity an Bernoulli Equivalence In our setting there are no restrictions on how the action prescribe by a given rule in a given roun epens on previous payo s obtaine. We will n that simple rules in the sense that behavior is a linear function of previous payo s will play an important role for attaining minimax regret behavior. More speci cally, a subset of the linear rules calle Bernoulli equivalent rules will play this important role. A rule is calle linear if (a ; x ; ::; a m ; x m ) c is linear in x k for all k = ; ::; m an all m which means that X X (a ; x ; ::; a m ; x m ) c = :: [ m k= (j kx k + ( j k ) ( x k ))] (a ; j ; ::; a m ; j m ) c () j =0 j m=0 hols for all m an for all a i 2 W an x i 2 [0; ], i = ; ::; m. The set of linear rules will be enote by L F. A linear rule is calle Bernoulli equivalent if in any ecision problem it behaves as it oes in the Bernoulli ecision problem in which actions have the same expecte payo as in the original ecision problem. More formally, given D 2 D let D 0 (D) 2 D 0 be e ne by the fact that c (D) = c (D 0 (D)) hols for all c 2 W. Then we require for all D 2 D an for any sequence of actions a ; ::; a m that the probability that action a i is chosen in roun i for all i = ; ::; m is 8

11 the same uner D as it is uner D 0 (D). Formally, = mx my (;) a (a ; x ; ::; a k ; x k ) ak+ P a (x ) ::P am (x m ) X my j= y j =0 j= k= m Y y j aj + ( y j ) aj (;)a (a ; y ; ::; y k ; x k ) ak+ The set of Bernoulli equivalent rules with support M F will be enote by B M. Next we illustrate why not all linear rules are Bernoulli equivalent. Consier a linear rule f. It is easily checke that f satis es the conitions impose on a Bernoulli equivalent rule in the rst two rouns. However this is not necessarily true in roun three. For instance, the probability of obtaining the sequence of actions a,b,b in the rst three rouns equals f (;) a f (a; x) b f (a; x; b; y) b P a (x) P b (y) = f (;) a b f (a; x) b f (a; x; b; ) b P a (x) + ( b ) k= f (a; x) b f (a; x; b; 0) b P a (x) If f prescribes to ranomize inepenently in each roun then f (a; x) b f (a; x; b; ) b P a (x) = ( a f (a; ) b + ( a ) f (a; 0) b ) ( a f (a; ; b; ) b + ( a ) f (a; 0; b; ) b ) : On the other han, if f is Bernoulli equivalent then f (a; x) b f (a; x; b; ) b P a (x) = a f (a; ) b f (a; ; b; ) b + ( a ) f (a; 0) b f (a; 0; b; ) b : So if the linear rule f is Bernoulli equivalent an f (;) a > 0 then one of the following three statements is true: (i) f (a; 0) b = f (a; ) b, (ii) f (a; 0; b; y) b = f (a; ; b; y) b for all y 2 [0; ], or (iii) ranomization uner f in rouns two an three is not inepenent. So linearity can only coincie with Bernoulli equivalence if payo s obtaine more than one roun ago have only a limite impact on present behavior (see Sections 4.2. an below). Finally we show how to exten a rule e ne on the set of Bernoulli ecision problems to a Bernoulli equivalent rule. We present two ways to generate the same behavior. (A) When a payo, say x i 2 [0; ], is obtaine in roun i then realize an inepenent ranom variable that yiels with probability x i an 0 otherwise. Remember the realization xi 2 f0; g of this ranom variable an forget the payo x i itself. Apply the rule in all later rouns as if xi was the payo receive in roun i. (B) Given a sequence z = (z i ) i= with z i 2 [0; ] for all i e ne the rule 9

12 z by setting z (;) = (;) an setting z (a ; x ; ::; a m ; x m ) = a ; fx z g; ::; a m ; fxmzmg for any history (a ; x ; ::; a m ; x m ) where fxi z i g is the inicator function that takes value if x i z i an value 0 otherwise. The Bernoulli equivalent extention of the rule is then obtaine by ranomizing over the set of rules z by choosing z i ii from a uniform istribution on [0; ] for all i. Uner the behavior e ne in (A), each sequence of actions has the same probability of occurring in the ecision problem D as it oes in the Bernoulli ecision problem D 0 (D). Nonethe-less, the construction in (A) oes not formally e ne a rule as the memory of the ecisionmaker is change which is not allowe in our e nition of a ranomize rule. Our secon alternative (B) leas to a formal e nition of a rule. It is easily shown that the rule e ne in (B) is behaviorally equivalent to the one escribe in (A) an hence that it is Bernoulli equivalent. Remark Linear rules, in particular Bernoulli equivalent rules, typically involve ranomizing when receiving payo s in (0; ). More speci cally, it is easily euce from () for a linear rule f that either f (a ; x ; ::; a n ; x n ) is inepenent of x ; ::; x n or f (a ; x ; ::; a n ; x n ) =2 W for all x ; ::; x n 2 (0; ). In contrast we show in the appenix that Bayesian optimal rules typically o not involve ranomizing behavior. 3.3 The Result The following characterization will be very useful as it reuces the search for a rule that attains minimax regret to the search for an equilibrium of a zero-sum game. At the same time it reveals a close connection between minimax regret behavior an Bayesian ecision making. Proposition i) There exists a worst case prior in p D 0 an a rule in B p F that attains minimax regret. The value of minimax regret is strictly positive. ii) 2 B p F attains minimax regret an Q 2 p D 0 is a worst case prior if an only if L (D) Q (D) L (D) Q (D) L (D) Q (D) 8 2 p F 8Q 2 p D 0 : (iii) 2 F attains minimax regret an Q 2 D is a worst case prior if an only if L (D) Q (D) L (D) Q (D) L (D) Q (D) 8 2 F 8Q 2 D: (2) 0

13 In particular, any rule that attains minimax regret is Bayesian optimal uner any worst case prior. The above generalizes nings that Berry an Fristet (98) have obtaine for Bernoulli two-arme banits. Proof. We rst review the results Berry an Fristet (98) obtaine for Bernoulli two-arme banits which is statement (i) an the if statements of (ii) an (iii). They introuce a topology on the set of strategies an then show for the zero sum game where the iniviual chooses a rule to minimize regret an nature chooses a prior to maximize regret that a Nash equilibrium ( ; Q ) exists. If ( ; Q ) is a such a Nash equilibrium (i.e. (2) hols when restricte to the case of jw j = 2 an Q 2 D 0 ) then L (D) Q (D) = max max min Q2D 0 Q2D 0 L (D) Q (D) min 2F L (D) Q (D) min max L (D) Q (D) 2F Q2D 0 L (D) Q (D) = L (D) Q (D) which proves the if statement of (iii) for Bernoulli two-arme banits. Berry an Fristet (98) also ensure the existence of a strictly positive lower boun on the value of minimax regret so this completes (i) for Bernoulli two-arme banits. Quasi-concavity of max Q2D0 R L (D) Q (D) as a function of shows that p F \ arg min 2F max Q2D0 R L (D) Q (D) 6= ;. Similarly, quasi-convexity of min 2F R L (D) Q (D) as a function of Q is use to show that p D 0 \ arg max Q2D0 min 2F R L (D) Q (D) 6= ;. Finally, the if statement of (ii) follows from the fact that p D 0 \ arg max Q2D0 R L (D) Q (D) 6= ; if 2 p F an similarly, p F \ arg min 2F R L (D) Q (D) 6= ; if Q 2 p D 0. The above can be generalize to Bernoulli multi-arme banits immeiately. In the following we will show that the above also hols when payo s are not restricte to f0; g. Let ( ; Q ) 2 B F D 0 be a Nash equilibrium of the zero-sum game when restricting attention to D 0. Since R is Bernoulli equivalent, max Q2D0 L (D) Q (D) = R max Q2D L (D) Q (D) an Q R 2 D 0 implies that min 2F L L (D) Q (D) =min 2F R L (D) Q (D) an hence (2) hols. Notice furthermore that the if statement of (iii) hols as state by the same proof as when we consiere only D 0. Part (i) an the if statement of (ii) then also follow as above.

14 Consier now the only if statements of (ii) an (iii). If attains minimax regret an Q is a worst case prior then max inf Q2D 2F L (D) Q (D) sup Q2D L (D) Q (D) = inf 2F L (D) Q (D) = min L (D) Q (D) 2F sup Q2D L (D) Q (D) L (D) Q (D) so the claim follows as we know that min 2F sup Q2D0 R L (D) Q (D) = max Q2D0 inf 2F R L (D) Q (D) hols. 4 Two-Arme Banits In the following we investigate minimax regret behavior when there are two actions only. Let W = fa; bg. Special attention will focus on whether minimax regret behavior can be implemente with a rule that has nite memory. An important ingreient will be to unerstan Bayesian optimal behavior uner speci c very simple priors. We say that the rule has n roun memory if (a ; x ; ::; a m ; x m ) c is inepenent of (a k ; x k ) for k m n. has nite memory if there exists n such that has n roun memory. has n roun action memory if has n roun memory an if (a ; x ; ::; a m ; x m ) c is inepenent of x k for k m. The amount of memory neee to implement a rule can be consiere a measure of its complexity. 4. Necessary Conitions Following Proposition, rules that attain minimax regret are Bayesian optimal uner some prior over Bernoulli ecision problems. Insights into Bayesian optimal behavior can thus teach us about minimax regret behavior. Unfortunately many results on Bayesian optimal ecision making eal only with inepenent arms - we o not expect a worst case prior ever to have this property. We will use results on epenent arms ue to Kakigi (983) an Samaranayake (992) who consier priors that put weight on two Bernoulli ecision problems. Q 2 p D 0 will be calle a symmetric two point Bernoulli prior if it only has two elements in its support, formally if there exist v an w with 0 v < w such that Q ~D = =2 for D ~ 2 D 0 with ~D = v an 2 ~D = w. We will write Q = Q (v; w) an also write 2

15 Q 0 instea of Q (0; ). Kakigi (983) erives a particular Bayesian optimal rule for such priors. Results foun in Samaranayake (992) can be use to show that a Bayesian optimal rule uner such a prior will have the stay with a winner property. The rule is sai to have the stay with a winner property if it speci es to choose the same action again after any success, i.e. if (a ; x ; ::; a m ; ) am = for all a k, x k, k = ; ::; m, all a m an all m. 6 Proposition 2 Consier jw j = 2. If the nite memory rule attains minimax regret an Q (v; w) is a worst case prior (for instance when arg max D2D0 : a(d)> b (D) L (D) is single value) then Q (v; w) = Q 0. Proof. First assume 0 < v < w <. Kakigi (983) shows that the following symmetric rule is Bayesian optimal uner such a prior Q (v; w). Choose action a in roun one. Choose action a in roun n if an only if the upate belief that a yiels a higher expecte payo than b is at least 0:. Notice that this rule cannot be implemente with a nite memory as 0 < v < w <. What we have to show in the following is that no Bayesian optimal rule will have nite memory. It is shown in Kakigi (Proof of Theorem 2, 983) that the i erence between the value of choosing a an the value of choosing b when continuing thereafter optimally is non ecreasing in the belief that action a yiels a higher expecte payo than action b. Let r (s) enote this i erence where s is the corresponing belief. Samaranayake (Example 2.2, 992) shows that actions a an b are negatively correlate after any history. Since v < w, the support of the marginal istributions of choosing a (or of choosing b) has two elements. So the results in Proposition.2(b) in Samaranayake (992) hols with strict inequalities. This means that a Bayesian ecision maker strictly prefers action c over action after action c yiele a success. Thus r (s + ) > 0 if s + is the upate belief after having belief =2 an receiving a success by choosing a. Together with the fact that r is non ecreasing we obtain that r is strictly increasing in s. In other wors, the rule from Kakigi (983) escribe above prescribes the unique Bayesian optimal behavior whenever the belief oes not equal =2. Hence no Bayesian optimal rule uner Q (v; w) with 0 < v < w < has nite memory. 6 Note that the result proven by Berry an Fristet (98) for inepenent arms is weaker as it only states that there always exists a Bayesian optimal rule with the stay with a winner property. 3

16 Now assume v = 0 an w 2 (0; ). Consier a Bayesian optimal behavior uner Q (0; w). As behavior when s = =2 oes not matter we can assume that is symmetric an speci es to switch after any failure. Of course locks in on the same action whenever he obtains the rst success. We now calculate regret of when facing Q (0; w) for some w 2 (0; ). Let z be the future value after only failures obtaine previously. Then z = ( ) 2 w + 2 w w + 2 w z so z = w + w 2 2+ w an hence L ( ) w (Q (0; w)) = w z = 2 2+ w. L (Q (0; w)) as a function of w obtains its unique maximum when w = an hence Q (0; w) is never a worst case prior if w <. Finally, assume v > 0 an w =. Let be the symmetric Bayesian optimal rule uner Q (v; ) that switches after a failure an has the stay with a winner property. Consier regret uner Q (v; ) for some v 2 (0; ). Let x be the future value of payo s after only achieving successes in the previous rouns with the worse action. We obtain x = ( ) v + vx+( v) so x = +v 2v v an hence L (Q (v; )) = 2 2 +v 2v v = 2 ( )( v) v. L (Q (v; )) as a function of v obtains its unique maximum when v = 0 an hence Q (v; ) is never a worst case prior if v > 0. So Q 0 is the only caniate for a simple worst case prior. Using Taylor expansions of regret we erive an upper boun on the set of iscount factors uner which Q 0 can be a worst case prior. Proposition 3 Consier jw j = 2. Then Q 0 is not a worst case prior for > 2p 2 0:62. Proof. Consier a symmetric Bernoulli equivalent rule that attains minimax regret with Q 0 being a worst case prior. Since is symmetric, (;) a = =2. Since is a Bayesian optimal rule uner Q 0, has the stay with a winner property an (c; 0) c = 0 for c 2 fa; bg. So all we have to check in orer for to attain minimax regret is that Q 0 maximizes regret of the rule. In the following we will erive necessary conitions such that Q 0 maximizes regret L (Q) among priors Q containe in fq (v; ) : 0 v < g [ fq (0; w) : 0 < w g. In fact we will only be consiering priors close to Q 0 which means that v is close to 0 an w is close to. Looking at rst orer e ects only means that when facing Q (v; ) we can ignore events in which the ba action yiels two successes. Similarly, when facing Q (0; w) we can ignore events when the best action yiels two failures. 4

17 Below we alter the behavior of to obtain a rule 0 that chooses if possible a best response to both Q (v; ) an Q (0; w). 0 retains the properties of that 0 is a best response to Q 0 an that Q 0 maximizes regret uner 0. The latter follows since L (Q 0 ) = L 0 (Q 0 ) an L (Q) L 0 (Q) implies L 0 (Q 0 ) L 0 (Q). Let 0 choose action a forever after observing a failure from action b an a success from action a in the rst two rouns. Here 0 chooses a best response to both Q (0; w) an to Q (v; ). Let 0 choose action a forever after observing (a; ; a; ) or (a; ; a; 0; a; ). As we are only intereste in rst orer approximation, we ignore the possibility that we coul be facing ( a ; b ) = (v; ). Similarly, base on rst orer approximation 0 chooses action a forever after observing two failures of b an one failure of a in the rst three rouns. Let x = (c; ; c; 0) an y = (; 0; c; 0) for c 6=. Then 0 = ( ) 2 w + ( ) 2 ( + w) w + 2 w2 w + 2 w2 2 w + ( ) 2 2 w ( w) x + 2 ( w) y + 2 ( w) ( y) w ( x) w2 ( w) w + 2 w ( w) xw + 2 ( w) 3 w (( y) + yw + ( y) w + y) + o ( w) 2 where the expressions refer in the orer of their appearance to the payo s in roun one an two, continuation payo starting roun three after the events (b; 0; a; ) an (a; ; a; ), roun three payo s after (a; ; a; 0), (a; 0; b; 0) an (b; 0; a; 0) an continuation payo s starting roun four after (a; ; a; 0; a; ), (a; ; a; 0; b; 0) an after (a; 0; b; 0; b; 0), (a; 0; b; 0; a; ), (b; 0; a; 0; a; ) an (b; 0; a; 0; b; 0). Consequently, L 0 = 2 ( ) 2 ( ) 2 x 2 ( w) + o ( w) 2 : Since Q 0 maximizes L 0 (Q) we obtain 2 x 2 0 which implies 2 0 which implies 2p 2. We combine Propositions 2 an 3 to obtain the following. Corollary 4 Consier jw j = 2 an > 2p 2. Then either there is no nite memory rule that attains minimax regret or arg max D2D0 : a(d)> b (D) L (D) is not single value for any that attains minimax regret.

18 At this point of our analysis we have no evience for which values (if any) of 2p 2 that Q 0 is a worst case prior. However, if Q 0 is a worst case prior at = 2p 2 then the expansion technique use in the proof of Proposition 3 reveals properties of a minimax regret rule. Lemma Consier jw j = 2 an = 2p 2. Assume that is a symmetric rule that attains minimax regret an assume that Q 0 is a worst case prior. Then (c; 0) c = 0, (c; ; c; 0) c =, (c; ; c; ; c; 0) c =, (c; ; c; 0; c; 0) c = 0, (c; 0; ; ; ; 0) =, (c; 0; ; 0; c; 0) c = 0 if (c; 0; ; 0) <, (c; 0; ; 0; ; 0) = 0 if (c; 0; ; 0) > 0 an in the rst three rouns oes not switch after a success. Consequently, no single roun memory nor any n roun action memory for some n attains minimax regret uner this critical value of. Nor oes one of the rules suggeste by Robbins (96) or Isbell (99) for n > 2 (an = ) have this property. Proof. First we provie the analogous calculations as in the proof of Proposition 3 when facing ( a ; b ) = (; v). We calculate where we o not explicitly calculate events where two successes of the worse action occur. Then = ( ) v + 2 ( v) + 2 v ( v) x2 + 2 v ( v) ( x) ( v) 3 + o v 2 where the expressions refer to the event (a; ; a; ; :::), the payo in roun one from choosing action b an the events (b; 0; a; ; a; ; :::), (b; ; b; 0; a; ; a; ; :::) an (b; ; b; 0; b; 0; a; ; a; ; :::). Consequently L 0 = 2 ( ) 2 ( ) ( x) 2 v + o v 2 an hence 2 + x 2 0 is necessary if Q 0 is a worst case prior. Looking a bit more carefully at the above calculations as well as those in the proof of Proposition 3 it is easily veri e that Q 0 is not a worst case prior if one of the conitions in the statement of the proposition o not hol. Finally we show that any rule that attains minimax regret when Q 0 is a worst case prior behaves in roun one like a symmetric rule. Proposition 6 Consier jw j = 2. If Q 0 is a worst case prior an attains minimax regret then (;) a = =2. 6

19 Proof. Consier a rule that attains minimax regret when Q 0 is a worst case prior. Then is Bayesian optimal uner Q 0. Let D c be the Bernoulli two-action ecision problem with P c () = P (0) = where 6= c. Then L (D a ) = (;) a (;) b an L (D b ) = (;) b (;) a. Since Q 0 is a worst case we obtain L (D a ) = L (D b ) an hence (;) a = = Su cient conitions Above we show that Q 0 is the only caniate for a simple worst case prior. For any symmetric rule regret equals L (D) = ( ) 2 j a b j + ( ) o () so we actually expect Q 0 (which maximizes j a b j) to be a worst case prior for su ciently small. Interestingly we n below that Q 0 oes not have to be that small for this to be true Single roun memory In the following we search for single roun memory rules that attain minimax regret. Note that for single roun memory rules there is no i erence between linearity an Bernoulli equivalence. Proposition 7 Consier jw j = 2. (i) The symmetric linear single roun memory rule that has the stay with a winner property an that satis es (a; 0) a = 0 attains minimax regret if an only if p 2 0:4. This rule yiels = 2 ( a + b ) ( a b ) ( a b ) 2. (ii) For any 2 (0; ) there is no other symmetric linear single roun memory rule that attains minimax regret. (iii) There is no single roun memory rule that attains minimax regret for some > p 2. Notice that Bayesian optimal rules generally o not have nite memory even when is small. For instance, as pointe out in the proof of Proposition 2, any Bayesian optimal rule uner the two point istribution Q (v; w) with 0 < v < w < oes not have nite roun memory. Proof. It follows immeiately that the rule escribe above is the unique symmetric linear single roun memory Bayesian optimal rule uner Q 0. Let z c be the iscounte future value of payo s conitional on choosing action c. Then z a = ( ) a + a z a + ( a ) z b. Similar 7

20 expression for z b an solving the two equations in the two unknowns z a an z b yiels the expression for = 0: (z a + z b ) given above. For a > b we obtain L = a 4 2 a a a b b a 2 ( + a b ) 2 where the enumerator is ecreasing in a. If a = then the enumerator is also increasing in b. Evaluating the enumerator at a = an b = 0 we obtain 2 2 which has the positive root p 2. Hence a L 0 hols for all a an b if p 2 > p 2 then a L < 0 hols when a = an b = 0. Similarly we obtain for a > b L = ( + 2 a ) 2 b 2 ( + a b ) 2 :. On the other han, if Thus, L is maximize at ( a ; b ) = (; 0) an Q 0 is a worst case prior if an only if p 2. Since is the unique symmetric linear single roun memory rule that is Bayesian optimal uner Q 0 there is no alternative symmetric linear single roun memory rule that attains minimax regret when p 2. It is easily veri e that arg max D2D0 : a(d)> b (D) L (D) is single value for all 2 (0; ). Thus, by Corollary 4 oes not attain minimax regret when > p a Note that p b = 2 + a b is the sum of the iscounte probabilities of choosing action b uner where p b can be erive as the solution to = ( p b ) a + p b b. Consier an alternative symmetric linear single roun memory rule. Let q c be the probability of choosing action c in the next roun given c is chosen in the present roun, then q c = c y + ( c ) z where y = (c; ) c an z = (c; 0) c. Consequently () = a q a + 2q a + q a q b ( b a ) an L = 2 + q a q b ( a b ) when a > b. It is easily veri e that y L < 0 < z L when a > b. Thus is among linear symmetric single roun memory rules the only caniate for a Bayesian optimal rule an hence the only caniate for a symmetric rule that attains minimax regret. Following Propositions 6 an 7, any single roun memory rule that attains minimax regret ranomizes in roun one, choosing each action with probability 0:. However, the rule selecte in Proposition 7 also ranomizes in later rouns whenever receiving a payo in (0; ). In the 8

21 following we investigate when an whether this sort of ranomizing is also necessary for attaining minimax regret. Proposition 8 Consier jw j = 2. Consier a single roun memory rule with (c; x) c 2 f0; g for all x 2 [0; ]. Then (i) attains minimax regret for all =3 if an only if (;) a = =2, (c; x) c = 0 if x < =3 an (c; x) c = if x > =3. (ii) oes not attain minimax regret for > =3. Proof. Consier a symmetric single roun memory rule o with o (c; x) c 2 f0; g for all x 2 [0; ]. Consier D 2 arg max D2D0 : a(d)> b (D) L o (D). Then o behaves (in terms of sequences of actions chosen) when facing D as the rule e ne in Proposition 7 oes when facting D 0 2 D 0 e ne by P c (; D 0 ) = P c (fx : o (c; x) c = g ; D ). Setting q c = P c (; D 0 ) an using + 2q a the expression p b from the proof of Proposition 7 we obtain L o (D ) = 2 + q a q b ( a b ). Now assume that o attains minimax regret. Let o = sup x f o (c; x) c = 0g an u = inf x f o (c; x) c = g. Since D maximizes the regret of o we erive that a (D ) = q a + ( q a ) o an b (D ) = q b u. Since o u, the ecision-maker can lower this maximal regret by choosing a rule with o = u =:. In other wors, the ecision-maker chooses an nature chooses q a an q b an regret is given by L f o (D ) = + 2q a (q a + ( q a ) q b ). 2 + q a q b Following Proposition 7, if f attains minimax regret then Q 0 is a worst case prior. Hence we nee to verify that which simpli es to =3. implies L f oj q (qa;qb )=(;0) = a 2 ( ) ( ) ( + ) 0 2 L f oj q (qa;qb )=(;0) = ( ) ( ) 0 b 2 Finally, if = =3 then it is easily veri e that q L 0 hols for =3. q 2 L 0 hols =3 an that q 2 = 0 9

22 4.2.2 Two roun memory Next we search for two roun memory rules that attain minimax regret. The rule we select for small an intermeiate iscount factors turns out to be a Bernoulli equivalent extension of a rule suggeste by Robbins (96) for use in Bernoulli two-action ecision problems instea when =. When payo s are in f0; g this rule prescribes to switch back an forth until the rst success is obtaine an then only to switch after two consecutive failures. Proposition 9 Consier jw j = 2. Consier the Bernoulli equivalent symmetric two roun memory rule that has the stay with a winner property an that satis es (c; 0) c = (c; 0; c; 0) c = (; 0; c; 0) c = 0 an (c; ; c; 0) c = for fc; g = fa; bg. Then = 2 ( a + b ) + 2 ( a b ) 2 ( + ( a + b )) 2 ( a ) ( b ) 2 + ( ) ( + (2 a b )) an attains minimax regret if an only if 2p 2 0:62. No other symmetric two roun memory rule attains minimax regret when = 2p 2. The only ajustment to the rule suggeste by Robbins (96) is that we require the ecisionmaker to choose each action equally likely in the rst roun. Notice that (; ; c; 0) c is not explicitly speci e as (; ; c; 0) for 6= c occurs with zero probability. e ne above is simple to implement in Bernoulli two-action ecision problems. However when payo s can also be realize in (0; ) then implementation is a bit more complicate as ranomization is not inepenent across rouns. Recalling our iscussion of Bernoulli equivalent rules in Section 3.2, one way to make the choices when observing an interior payo x m in roun m is to take a raw m from a lottery that yiels with probability x m an 0 with probability x m an then to remember m for two rouns an to act, when making choices in roun m + an m + 2, as if m was the payo realize in roun m. Of course memory of m for two rouns is not necessary after a m+ 6= a m as in this case (a m ; x m ; a m+ ; x m+ ) is inepenent of x m. An alternative way to irectly e ne the behavior of for all payo s in [0; ] is using the following stochastic automaton with the four states a, a2, b an b2 which is graphically represente in Figure. Choose action c in state ci. Use the transition function g to n out which state to enter in roun one an which state to enter in the next roun given the current 20

23 state where g : ;[(fa; a2; b; b2g [0; ])! fa; a2; b; b2g is given by g (;) a = g (;) b = =2 (so start o in state a an b each with probability =2) an g (c; x) c2 = g (c; x) = g (c2; x) c2 = g (c2; x) c = x for x 2 [0; ] an fc; g = fa; bg. State c2 can be interprete has higher con ence in action c for c 2 fa; bg : Figure : The selecte two roun memory rule as a stochastic automaton with four states. Proof. Consier a symmetric two roun memory rule that attains minimax regret when = 2p 2. Following Lemma we obtain (c; 0) c = 0, (c; ; c; 0) c =, (c; 0; ; ) =, (c; 0; c; 0) c = 0, (c; 0; ; 0) 2 f0; g an has the stay with a winner property. If (c; 0; ; 0) = an b = 0 then L = a 2 a 3 2 a 3 3 a a a a 2 2 a a 2 3 a a an 2 ( a) 2 L (; 0) = (2 ) ( + ) 2 so this rule oes not attain minimax regret if > =2. If instea (c; 0; ; 0) = 0 (which is the rule selecte in the statement) then L = ( a b ) + 2 a 2 2 a a 2 + a b 2 a 2 b a : b Assume 2p 2. By rst showing that b L 0 an then that a L 0 hols when b = 0 it can easily be veri e that ( a ; b ) = (; 0) is the unique maximizer of L conitional on a > b. This means that Q 0 is a worst case prior. Now assume > 2p 2. It can also be easily veri e that arg max a> b L ( a ; b ) is single value. Thus, by Corollary 4 oes not attain minimax regret. 2

24 To keep this paper short we refrain from an exhaustive analysis of two roun memory rules as we i for single roun memory rules. However notice that following Proposition 9 we know that there is no two roun memory rule that attains minimax regret for all 2 (0; o ) where o > 2p 2. obtain: Combining the result on Q 0 in the proof of Proposition 9 with Propositions 3 an 6 we Corollary 0 Consier jw j = 2. (i) Q 0 is a worst case prior if an only if 2p 2. (ii) There is no eterministic rule that attains minimax regret when 2p 2. Part (ii) is presente as it is an easy corollary of our previous results. A more general proof that this hols for all 2 (0; ) is far from obvious Two roun action memory Consier now two roun action memory rules. In the following we investigate how this restriction changes the range of iscount factors given in Proposition 9 in which minimax regret can be achieve. Proposition Consier jw j = 2. There exists 0 with 0 0:4 such that: (i) If 0 then the symmetric linear rule f + with two roun action memory that has the stay with a winner property an that satis es f + (c; ; c; 0) c = 0 0 0:84 an f + (c; 0) c = f + (; ; c; 0) c = 0 for c 6= attains minimax regret. regret. (ii) If 0 < 2p 2 then there is no two roun action memory rule that attains minimax When compare with the two roun memory rule from Proposition 9, the rule f + given above is simpler in two respects. First of all, f + has two roun action memory so it requires less memory than. Secon of all, ranomization uner f + occurs inepenently in each roun while the implementation of require a much more complicate ranomization process. Proof. Consier a symmetric two roun action memory rule that is Bayesian optimal against Q 0. Then f (;) a = 0:, f (c; 0) c = 0 an f (c; ) c = f (c; ; c; ) c = f (; ; c; ) c =. In particular, f has the stay with the winner property. Let = f (c; ; c; 0) an = f (; ; c; 0) for c 6=. 22

25 For any given roun except roun one consier the state escribe by the present an previous choice. Then there are four states aa, ab, bb, ba where c speci es that the present action is an the previous action was c. Let v n, w n, y n an z n be the respective probabilities of being in these states in roun n 2. Then v 2 = 2 a, w 2 = 2 ( a), y 2 = 2 b an z 2 = 2 ( b). Given the transition matrix M equal to a + ( a ) ( ) 0 0 a + ( a ) ( ) ( a ) 0 0 ( a ) 0 b + ( b ) ( ) b + ( b ) ( ) 0 0 ( b ) ( b ) 0 we obtain v n+ w n+ y n+ z n+ T = M v n w n y n z n T an hence L = max f a ; b g 2 ( ) ( a + b ) ( ) a b b a (I M) v 2 w 2 y 2 z 2 T where I 2 R 4;4 is the ientity matrix. The explicit expression for L is too elaborate to present here but it is easily veri e for a > b that Lj (a;b )=(;0) = a 2 + ( ) ( 2 + ) Lj (a;b )=(;0) = b 2 + In the following we search values of an that maximize the largest value of such that a Lj (a; b )=(;0) 0 an problem. It follows that 0 = which yiels b Lj (a; b )=(;0) 0 hols. Let 0, 0 an 0 be the solutions to this a Lj (a; b )=(;0) = So we are looking for 0 an 0 such that = 0 an = 0. Solving these two equations yiels 0 = Thus, for > 0 either 0 = 3 s p q an a Lj (a; b )=(;0) < 0 or p :4369: b Lj (a; b )=(;0) > 0 which means for > 0 that Q 0 is not a worst case prior. Combining this with Proposition 9 we have proven part (ii). 23

26 In the following we consier 0, = 0, = an a > b which yiels L = ( ) a + ( + ) 0 ( a ) 2 ( ( 0 ) ( b )) ( a b ) (2 a b ) ( 0 ) ( + 0 ) ( a ) ( b ) 2 + ( 0 ) 2 ( a ) ( b ) 3 an will prove that L attains its maximum at ( a ; b ) = (; 0). First we will prove that b Lj b =0 = b L 0. Let a = w. Then 2 ( w 0w) w ( 0 ) 2 (2 0 ) ( ) ( 0 ) 2 2 w ( 0 ) 2 w w 2 w w 2 ( + 0 ) The enumerator of the secon factor is the only term can take negative values. Looking at this term we n that b Lj (a; b )=(;0) 0 implies b Lj b =0 0 for all b. We also obtain b b L = ( + 0 ) 0 w w + w 2 + w 2 (0 w + w) 2 ( + b w) w ( 0 ) ( b ) 2 ( + 0 ( 0 ) ) 3 0 which completes the proof that If b = 0 then b L 0 hols for 0. w L = ( ) w + 2 ( ) ( ) w 2 w w w Since we obtain w Lj (w; b )=(0;0) 0 implies w Lj b =0 0 which completes the proof of the fact that ( a ; b ) = (; 0) maximizes L if 0. Conclusion This paper emonstrates how simple but well esigne rules can have very powerful properties when choosing between two actions uner low an intermeiate iscount factors ( 0:62). Reucing search for minimax regret to search for a Nash equilibrium of a zero-sum game an iscovering the importance of Q 0 are the keys to eriving our results. Whether the cuto 0:62 is restrictive epens on the particular application as, besies the egree of patiency, the iscount factor can also be interprete as the probability of being able to choose again. When > 0:62 or when there are more than two actions then our results are weaker; minimax regret can be 24

Calculus and optimization

Calculus and optimization Calculus an optimization These notes essentially correspon to mathematical appenix 2 in the text. 1 Functions of a single variable Now that we have e ne functions we turn our attention to calculus. A function