Information collection on a graph

Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements to refine Bayesian estimates of individual edge costs in order to learn about the best ath. This roblem differs from traditional ranking and selection, in that the imlementation decision the ath we choose is distinct from the measurement decision the edge we measure. Our decision rule is easy to comute, and erforms cometitively against other learning olicies, including a Monte Carlo adatation of the knowledge gradient olicy for ranking and selection. 1 Introduction Consider a ath-finding roblem on a grah in which the lengths or values of the edges are random, and their distributions are unknown. We begin with indeendent normal Bayesian riors for the edge values, and we can obtain noisy measurements of the values which we can use to refine our estimates through Bayesian udating. We are allowed to make N measurements of individual edges, and we can measure any edge at any time, regardless of its location in the grah. After the measurements are comlete, we must make a guess as to the best ath. Our roblem is to sequentially determine the best edges to evaluate, where we can make each choice given what we have learned from rior measurements. This roblem contains an imortant distinction between measurement and imlementation decisions. While we measure, we collect information about individual edges. However, our overarching goal is to find a ath. We must choose the edges we measure in such a way as to collect the most information about the grah as a whole. We are not constrained by the grah structure when choosing what to measure, in the sense that we can always measure any edge at any time. Nonetheless, we must still kee the grah structure in mind when choosing edges, because it is relevant to the final imlementation decision. 1

The distinction between measurement and imlementation has not been considered in earlier work on otimal learning. A major goal of this work is to oen u new avenues for otimal learning in the context of oerations research roblems e.g. on grahs. Three ossible examles of grah roblems where a learning comonent may come into lay are the following: 1. PERT/CPM roject management. A comlex roject can be reresented as a grah in which edges corresond to tasks. We wish to find the sequence of tasks that can fulfill the requirements of the roject in the shortest ossible time. We can change our estimate of the time required to comlete a task by analyzing historical data from revious rojects involving that task. We do not have time to analyze all available records they may be exensive to access, so we can only erform a small number of historical studies. 2. Biosurveillance. We are lanning a route for a single medical secialist through a region in a develoing country. The route should maximize the secialist s total effectiveness in the region. Before committing to a route, we can make contact with hositals in the region and ask for recent medical data that could change our beliefs about the secialist s otential effectiveness there. Each contact requires money and time to analyze the data, so we cannot contact every hosital. 3. Defense of intellectual roerty. Certain stores may be unwittingly selling counterfeit roducts such as rinter ink. The ink manufacturer has an estimate of how much counterfeit ink is sold in each store, and wishes to lan a route for a detective to investigate a number of the stores. The estimates can be imroved by ordering samles of ink from individual stores. This incurs inventory, transortation and storage costs, so the number of orders is limited. Otimal information collection has a long history in the context of simle roblems such as multi-armed bandits see e.g. Gittins 1989 and ranking and selection. A general overview of ranking and selection can be found in Bechhofer et al. 1995 and Kim & Nelson 2006, while Law & Kelton 1991 and Goldsman 1983 rovide a simulation-oriented ersective. In these roblems, there is a finite set of alternatives with unknown values, and the goal is to find the highest value. We can imrove our estimates of the values by sequentially measuring different alternatives. In the roblem of learning on a grah, we also have a finite set of edges that can be measured, but we are not simly looking for the best edge. We learn by measuring individual edges, but we use the information we collect to imrove our ability to find a ath. 2

Stochastic shortest ath roblems have also been widely studied. An overview is available in Snyder & Steele 1995. However, many of these studies assume that the edge values have known distributions, for examle the exonential distribution Kulkarni 1986, Peer & Sharma 2007. The work by Frieze & Grimmett 1985 describes a robabilistic shortest-ath algorithm for more general classes of non-negative distributions, and analyzes the length of the shortest ath in the secial case of uniformly distributed edge values. Correlations among the edge values have also been studied by Fan et al. 2005, again with the assumtion of known cost distributions. For online grah roblems, in which we learn in the rocess of traversing the grah, methods such as Q-learning by Watkins & Dayan 1992 use stochastic aroximation to estimate unknown costs, while Bayesian aroaches have been roosed by Dearden et al. 1998 and Duff & Barto 1996. We build on a class of aroximate olicies originally develoed for ranking and selection, where each measurement maximizes the value of information that can be collected in a single time ste. This technique was first roosed by Guta & Miescke 1996 for ranking and selection with indeendent Gaussian riors, and subsequently exanded in the work on Value of Information Procedures VIP by Chick & Inoue 2001a and Chick & Inoue 2001b. Additional theoretical roerties were established by Frazier et al. 2008 for the knowledge gradient KG olicy. A KGlike methodology was also alied to other learning roblems: by Chick et al. 2009, for ranking and selection with unknown measurement noise; by Frazier et al. 2009, for ranking and selection with correlated Gaussian riors; and by Ryzhov et al. 2009 and Ryzhov & Powell 2009, for the online multi-armed bandit roblem. In addition to their theoretical roerties, KG-tye olicies have been shown to erform well exerimentally. In the offline setting, thorough emirical studies were erformed by Inoue et al. 1999 and Branke et al. 2007. In the online case, the variant of KG studied in Ryzhov et al. 2009 erforms cometitively even against the known, otimal Gittins olicy for multi-armed bandits, while being much easier to comute than Gittins indices. These features make KG olicies attractive for information collection roblems. This aer makes the following contributions: 1 We resent a new class of otimal learning roblems beyond the scoe of the literature on ranking and selection and multi-armed bandits. In this roblem class, our goal is to solve an otimization roblem on a grah with unknown edge values. We can imrove our estimate of the otimal solution by making sequential measurements of individual edges. 2 We show that the knowledge gradient concet can be alied to this roblem 3

class, while retaining its theoretical and comutational advantages. 3 We roose an alternate learning olicy that treats the roblem as a ranking and selection roblem, using Monte Carlo samling to avoid having to enumerate all aths. 4 We conduct an exerimental study comaring these and other learning olicies on a diverse set of grah toologies. The study indicates that the KG olicy is effective for grahs where there are many aths that could otentially be the best, and the Monte Carlo olicy is effective when we are allowed to make a large number of measurements. Section 2 lays out a mathematical model for information collection on a grah. In Section 3, we derive the exact KG decision rule for an acyclic grah roblem, and aroximate it for general grahs. We also show that the KG olicy is asymtotically otimal as the number of measurements becomes large. In Section 4, we give a decision rule for the Monte Carlo KG olicy. Finally, we resent numerical results comaring the erformance of KG to existing learning olicies. 2 Mathematical model Consider a grah described by a finite set S of nodes and a set E S S of directed edges. Every edge i, j E has a value µ. For notational simlicity, and without loss of generality, we assume that every ath must start at some fixed origin node a S, and that every ath must contain exactly T edges. We wish to find the ath with the largest total value max i,j E δ µ 1 where denotes a ath that starts at a and contains T edges, and δ is an indicator function that equals 1 if the edge i, j aears in the ath, and zero otherwise. Throughout our analysis, we assume that the grah is acyclic, so any edge can aear at most once in a given ath. The best ath can be found using Bellman s equation for dynamic rogramming: V t i = max µ + V t+1 j, 2 j V T i = 0. 3 These quantities are defined for each i S and each t = 0,..., T. Thus, V t i is the length of the best ath that starts at node i and contains T t edges. It follows that V 0 a is the otimal value of the roblem 1. The actual edges that make u the best ath can be found by keeing track of the nodes j that achieve the maximum in 2 for each i. 4

If the values µ are known, 2 gives us the exact otimal solution to 1. If the values are random with known distribution, 2 still solves the roblem in the sense that it gives us the ath with the highest exected total value. However, in our work, the distributions of the values are unknown, and our beliefs about them change as we learn more about them. 2.1 Learning about individual edges Suose that the mean values µ are unknown, but we can estimate them by measuring individual edges. When we choose to measure edge i, j E, we observe a random value ˆµ, which follows a Gaussian distribution with mean µ and variance σ 2 ε. We assume that the measurement error σε 2 is known, and we sometimes use the notation β ε = σε 2 to refer to the measurement recision. 2 Because µ is itself unknown, we assume that µ N µ 0,, where µ 0 and σ0 reresent our rior beliefs about µ. We also assume that the values of the edges are mutually indeendent, conditioned on µ, i, j E. We learn about the grah by making N sequential measurements, where N is given. measurement corresonds to exactly one edge. Any edge can be measured at any time, regardless of grah structure. Let F n be the sigma-algebra generated by our choices of the first n edges, as well as the observations we made on those edges. We say that something haens at time n if it haens immediately after we have made exactly n measurements. Then we can define µ n = IE n µ 2 where IE n = IE F n. Similarly, we let σ n be the conditional variance of µ given F n, 2 with β n = σ n being the conditional recision. Thus, at time n, we believe that µ N µ n, σ n 2. Our beliefs evolve according to the Bayesian udating equation σ 0 One µ n+1 = { β n µ n +βε ˆµn+1 β n +βε µ n otherwise. if i, j is the n + 1st edge measured 4 The values of the edges are indeendent, so we udate only our beliefs about the edge that we have just measured. The quantity ˆµ n+1 The recision of our beliefs is udated using the equation is the random value observed by making that measurement. { β β n+1 n = + β ε if i, j is the n + 1st edge measured β n otherwise. 5 5

} } We use the notation µ n = {µ n i, j E and β n = {β n i, j E. We also let n 2 σ = V ar µ n+1 F n = V ar µ n+1 F n V ar µ n F n 6 be the reduction in the variance of our beliefs about i, j that we achieve by measuring i, j at time n. It can be shown that 2 σ n = σ n σ n+1 2 = 1 β n 1 β n + β. ε It is known for instance, from DeGroot 1970, that the conditional distribution of µ n+1 2 F n is N µ n, σ n. In other words, given F n, we can write given µ n+1 = µ n + σ n Z where Z is a standard Gaussian random variable. It follows that IE n µ n+1 = µ n. 7 Our beliefs about the values after n measurements are comletely characterized by µ n and β n. We can define a knowledge state s n = µ n, β n to comletely cature all the information we have at time n. If we choose to measure edge i, j E at time n, we write s n+1 = K M s n, i, j, ˆµ n+1 where the transition function K M is described by 4 and 5. To streamline our resentation, the measurement error σ 2 ε is taken to be constant for all edges, similar to Frazier et al. 2008. However, we can allow the measurement error to be edge-deendent without significant changes in our analysis. If we suose that ˆµ n+1 N µ, λ 2, we obtain the same model, but with σ 2 ε and β ε relaced by λ 2 and λ 2 in 4, 5 and 6. Excet for the modifications in these equations, all theoretical and comutational results resented in this aer remain unchanged in the case where the measurement error varies across edges. The validity of our assumtion of Gaussian riors and measurements is roblem-deendent. If the measurement is done through statistical samling with a large enough samle size, the Gaussian distribution is a good aroximation. The method of batch means see e.g. Schmeiser 1982, Kim & Nelson 2007 can be used to design the observations to mitigate the non-normality of the 6

underlying data. Additionally, Hoff 2009 states, based on a result by Lukacs 1942, that a Gaussian samling model can be used if we believe the samle mean to be indeendent from the samle variance in articular, if the samle variance is known. A Gaussian rior may work well even when the measurements are non-gaussian. Gelman et al. 2004 suggests that a unimodal and roughly symmetric osterior can be aroximated by a Gaussian distribution. Under certain conditions, the osterior is asymtotically normal as the number of measurements becomes large see Bernardo & Smith 1994. samling model is aroriate for many learning roblems. In short, a Gaussian 2.2 Learning about aths At time n, our beliefs about the ath that solves 2 are exressed using Bellman s equation, with the unknown values µ relaced by the most recent beliefs µ n : V n t i; s n = max j µ n + V n t+1 j; s n, 8 V n T i; s n = 0. 9 As with 2, we comute V n t for all i and t, from which we can construct the ath that we believe to be the best at time n. It is imortant to understand the distinction between 2 and 8. The quantity V 0 a reresents the true length of the true best ath. The quantity V n 0 a; sn reresents our time-n beliefs about which ath is the best, and thus deends on s n. The ath that solves 8 is our best time-n guess of the ath that solves 2. Intuitively, the solution to 8 should be worse than the exected solution to 2. In other words, there is a enalty for not having erfect information. The following roosition formalizes this idea. The roof uses an induction argument, and can be found in the Aendix. Proosition 2.1. For all i S, for all t = 0,..., T, and for all knowledge states s n, V n t i; s n IE n V t i almost surely. 10 We can also make a time-n estimate of the length of a fixed ath : V,n t i; s n = µ n + V,n t+1 j; sn, where j = x i, V,n T i; s n = 0. 7

Here, x i denotes the node that follows node i in ath. The true length of ath is given by V t i = µ + V t+1 j, where j = x i, V T i = 0. From these equations, it is clear that IE n V,n t i = Vt i; s n for fixed. Our use of the index t is a technical convention of dynamic rogramming. Bellman s equation constructs the best ath one edge at a time, and the index t merely serves to indicate how many edges in the ath have already been built. It does not have any bearing on how many edges we have measured in the learning roblem. For convenience, we will use the notation V n s n = V n 0 a; s n, V,n s n = V,n 0 a; s n, to refer to our time-n estimates, droing the index t. Similarly, we use V and V to denote V 0 a and V 0 a. 2.3 Measurement olicies and imlementation decisions The roblem consists of two stages. The first stage consists of N sequential measurements. The second stage occurs at time N, after all measurements have been made, and consists of a single imlementation decision: we have to choose a ath based on the final knowledge state s N. We can choose a olicy π for choosing edges to measure in the first hase, and an imlementation decision ρ for choosing a ath in the second hase. The measurement olicy π can be viewed as a collection of decision rules X π,0,..., X π,n 1, where each X π,n is a function maing the knowledge state s n to an edge in E. The time-n decision rule uses the most recent knowledge state s n to make a decision. The imlementation decision ρ can be viewed as a random ath, that is, a function maing an outcome ω of the first N measurements to a ath. Thus, our imlementation decision will become known immediately after the last measurement, but may not be known exactly before time N. The measurement olicy and imlementation decision should be chosen to achieve su π su IE π V ρ, 11 ρ 8

a True values µ. b Final beliefs µ N. Figure 1: An outcome for which the ath given by V N s N to ath is not the true best ath bottom ath. where V ρ is the true length in terms of µ of the random ath ρ that becomes known at time N. One intuitive choice of imlementation decision is the ath that solves 8 at time N. Once s N is known, V N s N will give us the ath we believe to be the best at time N, and we can choose this ath for our imlementation decision. At the same time, it is not immediately obvious that maximizing V N s N will necessarily reveal something about the true best ath. Figure 1 rovides a simle illustration using a grah with three aths to, middle and bottom. Figure 1a shows the true values µ for each edge, and Figure 1b shows a samle realization of the final beliefs µ N. The solution to 2 is the bottom ath, yet the time-n solution to 8 is the to ath. This examle gives rise to the question of whether there might be some other imlementation decision that is more likely to find the true best ath. For examle, we might try to account for the uncertainty in our time-n beliefs by solving the ath roblem with the edge lengths given by µ N + z σn, where z is a tunable constant, and using the solution to that roblem as our guess of the best ath. This aroach would give rise to an entire class of imlementation decisions, arameterized by z. Thus, the sace of ossible imlementation decisions may be quite large. However, the next result shows that the natural choice of imlementation decision finding a ath by solving 8 at time N always achieves the maximum in 11. Theorem 2.1. If π denotes a measurement olicy, and the random ath ρ is the imlementation decision, then su π su ρ IE π V ρ = su IE π V N s N. π The roof can be found in the Aendix. The meaning of this result is that the ath that solves 9

8 at time N is, in fact, the best ossible imlementation decision. By maximizing our final guess of the best ath, we maximize in exectation the true value of the ath we choose. Thus, the roblem reduces to choosing a measurement olicy π for selecting individual edges in the first hase, and our objective function can be written as su IE π V N s N. 12 π Remark 2.1. By taking an exectation of both sides of 10, we find that, for any olicy π, IE π V N s N IE π V, where V = V 0 a is the true length of the ath that solves 2. Since the true edge values µ do not deend on the olicy π, it follows that IE π V = IEV for any π, hence IE π V N s N IEV for all π. Thus, Proosition 2.1 gives us a global uer bound on the objective value achieved by any measurement olicy. Note that we use a time-staged grah model, where we are always looking for a ath with T edges. This is convenient for modeling, because it enables us to easily write the solution to the ath roblem using Bellman s equation. However, the KG olicy that we derive in Section 3 does not require a time-staged grah, and can be used for many different ath roblems. For examle, if our grah has both a source and a destination node, we would simly let V n s n be the time-n estimate of the best ath from the source to destination. We are also not bound to the maximization roblem in 1. For a shortest-ath roblem, the derivation in Section 3 will be identical, excet V n s n will be obtained using a shortest-ath algorithm. In fact, our comutational study in Section 5 solves shortest-ath roblems on grahs with sources and sinks. 3 The knowledge gradient olicy Suose that we are at time n, in knowledge state s n. Let n be the ath that achieves V n s n. Thus, n is the ath that we believe is the best, given our most recent information, and V n s n is our estimate of its length. The knowledge gradient olicy is based on the idea first develoed by Guta & Miescke 1996 and later studied by Chick & Inoue 2001a, Chick & Inoue 2001b and Frazier et al. 2008 for the ranking and selection roblem. This idea can be stated as choosing the measurement that would 10

be otimal if it were the last measurement we were allowed to make. If we are at time N 1, with only one more chance to measure, the best choice is given by arg max where IE N 1 i,j E IEN 1 V N s N = arg max i,j E IEN 1 V N s N V N 1 s N 1 observes all the information known at time N 1, as well as the choice to measure i, j at time N 1. We bring V N 1 s N 1 into the maximum because this quantity is known at time n, and does not deend on the choice of measurement. If we always assume that we have only one more chance to measure, at every time ste, then the decision rule that follows from this assumtion is X KG,n s n = arg max i,j E IEn V n+1 s n+1 V n s n. 13 In words, we measure the edge that maximizes the exected imrovement in our estimate of the length of the best ath that can be obtained from a single measurement. The term knowledge gradient is due to 13 being written as a difference. Remark 3.1. By definition, the KG olicy is otimal for N = 1. In this case, a measurement olicy consists of only one measurement, and 12 becomes max i,j E IE0 V 1 s 1. Below, we find the value of a single measurement, and resent the knowledge gradient olicy. 3.1 The effect of one measurement In order to comute the right-hand side of 13, we consider the effects of measuring one edge on } our beliefs. Fix an edge i, j E and let A = { : δ = 1 be the set of all aths containing i, j. Then A c is the set of all aths not containing that edge. Now define a ath n as follows. If n A, let n = arg max V,n s n. A c On the other hand, if n A c, let n = arg max A V,n s n. 11

Thus, if i, j is already in the best time-n ath, then n is the best ath that does not contain this edge. If i, j is not art of the ath we believe to be the best, then n is the best ath that does contain that edge. Thus, by definition, n n. Proosition 3.1. If we measure edge i, j at time n, the ath that achieves V n+1 s n+1 will be either n or n. Proof: Suose that n A. By definition, n = arg max V,n s n, so in articular n = arg max A V,n s n. Deending on the outcome of our measurement of i, j, our beliefs about all aths in A will change, but they will all change by the same amount µ n+1 µ n. This is because we assume that the grah contains no cycles, so all aths in A contain only one coy of i, j. Therefore, n = arg max A V,n+1 s n+1 for every outcome. Thus, n is the only ath in A that can be the best time-n + 1 ath. Our beliefs about the aths in A c will remain the same, because none of those aths contain i, j, and our beliefs about the other edges do not change as a result of measuring i, j. Therefore, arg max V,n+1 s n+1 = arg max V,n s n = n A c A c for every outcome. Thus, n is the only ath in Ac follows that n and n that can be the best time-n + 1 ath. It are the only two aths that can be the best at time n + 1. If n A c, the argument is the same. By definition, n is the best ath, so n = arg max V,n s n. A c Our beliefs about the aths in A c do not change after measuring i, j, so n will still be the best ath in A c at time n + 1. Our beliefs about all aths in A will change by the same amount after the measurement, so n will still be the best ath in A at time n + 1. Therefore, n and n are again the only two aths that can be the best at time n + 1. Because n and n figure rominently in the KG olicy, we must remark on their comutation. We can obtain n via 8. If n A, then n can be found by solving a modified version of 8 with µ n set to. This ensures that we obtain a ath in Ac. If n A c, we can again solve a modified version of 8 with µ n chosen to be some large number, for instance the sum of the other µ n values. This will construct a ath that includes i, j, with the other edges chosen otimally. 12

3.2 Comutation of the KG olicy Define a function f z = zφ z + φ z, where φ and Φ are the standard Gaussian df and cdf, resectively. Also, for notational convenience, we define V n sn = V n,n s n. This quantity is our time-n estimate of the length of the ath n defined in Section 3.1. With these definitions, we can resent the main result of this section, namely the exact solution of the exectation in 13. Theorem 3.1. The KG decision rule in 13 can be written as where X KG,n s n = arg max i,j E νkg,n 14 ν KG,n = σ n f V n s n V n sn. 15 σ n Proof: As in the roof of Proosition 3.1, we consider two cases, one where δ n δ n = 1 and one where = 0. The two cases differ slightly, but in the end we derive one unified formula for ν KG,n. Case 1: n A c. Suose that the edge i, j is not currently art of the best ath. Nonetheless, we can otentially gain by measuring it. From Proosition 3.1 we know that only n or n can be the best ath at time n + 1. Observe that n will become the best ath beating n if µ n+1 > µ n + V n s n V n s n, that is, our beliefs about i, j increase by an amount that is large enough to make u the time-n difference between n and n. Note that V n s n V n sn 0 by assumtion, because V n s n is the time-n length of the best time-n ath. For all other outcomes of the measurement that is, if our beliefs about i, j do not increase enough, n will continue to be the best ath at time n + 1. The one-eriod increase in our beliefs about the length of the best ath, denoted by V n+1 s n+1 V n s n, deends on the outcome of the measurement in the following fashion: = V n+1 s n+1 V n s n { µ n+1 µ n V n s n V n sn if µ n+1 µ n V n s n V n sn 0 otherwise. 16 The shae of this function can be seen in Figure 2a. Then, the knowledge gradient obtained by 13

a Case 1: n A c. b Case 2: n A. Figure 2: Structure of the one-eriod increase in our beliefs about the best ath. measuring i, j is ν KG,n = IE n V n+1 s n+1 V n s n [ = IE n µ n+1 µ n V n s n V n s n 1 n+1 {µ Equation 7 tells us that, given F n, µ n+1 ν KG,n = σ n IE Z 1 { Z V n s n V n σ n V n s n V n s n P N } sn 2 µ n, σ n. Thus, Z V n s n V n sn σ n µ n V n s n V nsn } ]. where Z N 0, 1. It follows that ν KG,n = σ n φ V n s n V n sn σ n V n s n V n s n Φ V n s n V n sn σ n = σ n f V n s n V n sn. 17 σ n Case 2: n A. If we measure an edge that is art of the best ath, our estimate of the best ath can become better or worse, deending on the outcome of the measurement. Then n, the best ath not containing that edge, will become the best ath at time n + 1 if µ n+1 < µ n V n s n V n s n, 14

that is, if our beliefs about i, j dro far enough. In this case, the one-eriod imrovement is given by = V n+1 s n+1 V n s n { V n s n V n sn µ n+1 µ n otherwise. if µ n+1 µ n < V n s n V n sn 18 The shae of this function is shown in Figure 2b. The knowledge gradient is ν KG,n = IE n V n+1 s n+1 V n s n As before, µ n+1 = V n s n V n s n P [ +IE n µ n+1 µ n N µ n+1 µ n < V n s n V n s n ]. 1 n+1 {µ µ n V n s n V nsn } 2 µ n, σ n given F n. Therefore, ν KG,n = V n s n V n s n P + σ n IE n Z 1 { Z V n s n V n V n s n V n Z sn < σ n } sn σ n which becomes ν KG,n = V n s n V n s n Φ V n s n V n sn + σ n φ V n s n V n sn. σ n σ n This is the same exression as in 17. The right-hand side of 15 rovides us with a simle, easily comutable formula for the knowledge gradient. The formula resembles an analogous formula for ranking and selection, examined by Frazier et al. 2008. However, 15 is designed secifically for the grah roblem; to run the KG olicy at time n, we are required to solve one shortest-ath roblem for each edge, to find V n sn. Equations 14 and 15 give an exact comutation of 13 when the grah contains no cycles. If we allow cycles in the grah, then any ath that is the best time-n ath containing k coies 15

of i, j, for any k = 0, 1,..., can become the best time-n + 1 ath after measuring i, j. It is difficult to enumerate all such aths; if the grah has cycles, we suggest 15 as an aroximation to this difficult comutation. For shortest-ath roblems, however, no ath with a ositive-cost cycle can ever be the shortest, so 14 and 15 closely aroximate 13 as long as negative-cost cycles occur with negligible robability. 3.3 Asymtotic otimality of the KG olicy Define the risk function R = V V to reresent the loss incurred by choosing instead of the true best ath at time N. In this section, we show that the KG olicy is asymtotically otimal in the sense of Frazier & Powell 2008, that is, lim IE min IE N R = IE min IE R µ. 19 N In words, the minimum-risk decision after N measurements will attain the minimum risk ossible if all values are erfectly known, in the limit as N. The crucial oint is that the KG olicy is the only learning olicy that is otimal for both N = 1 in the sense of Remark 3.1 and for N in the sense of 19. This combination of myoic and asymtotic otimality suggests that KG could also erform well for finite measurement budgets. All exectations in this discussion are under the KG olicy; we dro the olicy name from IE KG for notational convenience. Observe that lim IE min IE N R N IE N = lim = lim N IE IE N V V min IE N V + min IE N V = lim IEV max V,N s N N = IEV lim N IEV N s N. Using similar calculations, it can be shown that IE min IE R µ = 0, which means that we can rewrite 19 as lim IEV N s N = IEV. 20 N From Remark 2.1 we know that IEV N s N IEV for any N, so 20 means that an asymtotically otimal olicy, with our usual imlementation decision, achieves the highest ossible objective value. The definition given in 19 is in line with the intuitive meaning of asymtotic otimality. 16

Theorem 3.2. The KG olicy of Theorem 3.1 is asymtotically otimal in the sense of 20. The roof of Theorem 3.2 is technical in nature, and can be found in the Aendix. The work by Frazier & Powell 2008 rovides sufficient conditions for the asymtotic otimality of a KG-like learning olicy in a general otimal learning setting. Our contribution is to verify that these conditions are satisfied by the KG olicy for the grah setting. In the Aendix, we list the conditions in the context of the grah roblem, then show that they are satisfied. 4 A Monte Carlo learning olicy In this section, we offer a different strategy for choosing edges. This aroach views the aths of the grah as alternatives in a ranking and selection roblem. We exlain how to model this roblem and solve it using the correlated KG algorithm by Frazier et al. 2009, assuming that we can enumerate all the aths. We then discuss how to use Monte Carlo simulation to avoid having to enumerate all the aths, instead generating a small subset of the set of all aths. 4.1 Ranking and selection on aths Recall from Section 2.2 that V denotes the true value of a ath. Suose for now that we can enumerate all the aths of the grah as 1,..., P. Let V aths = V 1,..., V P denote the true lengths of these aths. Let V aths,n s n = V 1,n s n,..., V P,n s n reresent the aths time-n lengths. Also, let E be the set of edges contained in ath { 1,..., P }. Because a ath is characterized by its index, we will use to refer to a ath, as well as the ath s index in the set {1,..., P }. From before, we know that IE n V = V,n s n for any ath. Because V = i,j E µ, the conditional covariance of V and V, given F n, is exressed by Σ aths,n, s n = i,j E E σ n 2. 21 As before, the individual edges of the grah are indeendent. However, two aths are not indeendent if they have at least one edge in common, and the covariance of two ath lengths is the sum of the variances of the edges that the two aths have in common. Then, given F n, we have V aths N V aths,n s n, Σ aths,n s n 22 17

where Σ aths,n s n is defined by 21. Thus, we have a multivariate Gaussian rior distribution on the vector V aths of true ath lengths. Now suose that, instead of measuring one edge in each time ste, we can measure a ath containing T edges, and use 4 and 5 to udate our beliefs about every edge in that ath. Because our measurements are indeendent, the variance of such a measurement is σεt 2. Our goal is to find arg max V, the ath with the largest true value. This can be viewed a traditional ranking and selection roblem with correlated Gaussian riors. The alternatives of the roblem are aths, our beliefs are given by 22, and we choose a ath to measure in every time ste. To solve this roblem, we can aly the correlated knowledge gradient algorithm from Frazier et al. 2009. The knowledge gradient for ath in this roblem is ν KGC,n = IE n V n+1 s n+1 V n s n. 23 For ath, we define a vector σ KGC,n = Σ aths,n e σεt 2 + Σ aths,n 24 to reresent the reduction in the variance of our beliefs about all aths achieved by measuring ath. Here e is a vector with 1 at index and zeros everywhere else. Then, 23 can be rewritten as ν KGC,n = P 1 y=1 σ KGC,n y+1 σ y KGC,n f c y 25 where the aths have been sorted in order of increasing σ y KGC,n, f is as in Section 3, and the numbers c y are such that y = arg max V aths,n s n + σ KGC,n z for z [c y 1, c y, with ties broken by the largest-index rule. Then, the correlated KG olicy for choosing a ath is given by X KGC,n s n = arg max ν KGC,n. 26 4.2 Using Monte Carlo samling to generate a choice set There are two major roblems with using the correlated KG olicy to find a ath. First, we want to measure individual edges, not aths. If we use 26 to find a ath, we also need a rule for choosing an edge from that ath. Second, and more imortantly, it is difficult to enumerate aths, and thus we cannot use traditional ranking and selection methods on them. As an alternative to the KG 18

olicy described in Section 3, we roose a Monte Carlo-based olicy that generates a small set of aths, and runs 26 on that set. We run our Monte Carlo-based version of KG over the aths by first generating K samle 2 realizations of the random variable µ n N µ n σ, n for every edge i, j. Let µ n ω k = { } µ n ω k i, j E be the kth samle realization. We can find a ath corresonding to the kth realization by solving Bellman s equation using µ n ω k as the edge values. Because some samle realizations might yield the same best ath, let K 0 be the number of distinct aths obtained from this rocedure, and let l 1,..., l K0 reresent those distinct aths. As before, we will use l to refer to a ath as well as the ath s index in {1,..., K 0 }. Define the vector V MC = V l 1,..., V l K 0 to reresent the true lengths of the aths using µ, and similarly let V MC,n s n = V l 1,n s n,..., V l K 0,n s n reresent the aths time-n lengths. Then, given F n, we have V MC N V MC,n s n, Σ MC,n s n where Σ MC,n l,l s n = i,j E l E l σ n 2. To ut it in words, we first find a set of K 0 different aths by solving K Monte Carlo shortest-ath roblems. Given the information we know at time n, the mean length of a ath is the sum of the time-n lengths of the links in that ath, and the covariance of two ath lengths is the sum of the variances of the edges that the two aths have in common. Then, given F n, the vector of ath lengths has the multivariate Gaussian rior distribution given above. We can now aly the correlated KG algorithm for ranking and selection to the K 0 aths generated, and reeat the comutations 24, 25 and 26 using V MC,n and Σ MC,n instead of V P,n and Σ P,n. This rocedure returns a ath l MC,n. It remains to select an edge from this ath. We roose the highest-variance rule X MCKG,n s n = arg max i,j E l MC,n σ n. 27 In the secial case where K 0 = 1, we can simly follow 27 for the sole ath generated, without additional comutation. In Section 5, we use the MCKG olicy as a cometitive strategy to evaluate the erformance of the KG olicy. However, we note that MCKG is also a new algorithm for this roblem class. It can be used in a situation where 15 is too exensive to comute, but we can still solve K ath roblems for some K < E. The MCKG olicy is equally suitable for cyclic and acyclic grahs. 19

5 Comutational exeriments We examined the ways in which the erformance of KG on a grah, relative to several other learning olicies, was affected by the hysical structure of the grah, the size of the grah, the measurement budget N, and the amount of information given by the rior. Our methods of grah generation are discussed in Section 5.1. As stated at the end of Section 3, it does not matter whether we are looking for the shortest or longest ath, because the KG formula in 15 will be the same in both cases. In our exeriments, we minimized ath length on grahs with a clearly defined source node and destination node; for all of our learning olicies, we used a freeware imlementation of Dkstra s algorithm to solve the shortest-ath roblems. In this setting, if π is a measurement olicy, and π is the ath that seems to be the best at time N after having followed π, then the oortunity cost of π is defined to be C π = V π V, 28 the difference in the true length of the ath π and the true length of the true best ath. This is the error we make by choosing the ath that seems to be the best after running olicy π. The quantity V is found using 2, with the maximum relaced by a minimum. For olicies π 1 and π 2, C π 2 C π 1 = V π2 V π 1 29 is the amount by which olicy π 1 outerforms olicy π 2. Positive values of 29 indicate that π 1 found a shorter better ath, whereas negative values of 29 mean the oosite. For every exeriment in our study, we ran each measurement olicy 10 4 times, starting from the same initial data, thus obtaining 10 4 samles of 28 for each olicy. The 10 4 samle aths were divided into grous of 500 in order to obtain aroximately normal samles of oortunity cost and the standard errors of those averages. The standard error of the difference in 29 is the square root of the sum of the squared standard errors of C π 1, C π 2. Crucially, this erformance metric requires us to know the true values µ for every grah we consider. In order to test a learning olicy, we first assume a truth, then evaluate the ability of the olicy to find that truth. For this reason, the starting data for our exeriments were randomly generated, including the hysical grah structure itself. Because we minimized ath length, we generated µ and µ 0 large enough to avoid negative edge values in our measurements. For each grah, we generated two sets of numbers. In the heterogeneous-rior set, the rior 20

means µ 0 were generated from a uniform distribution on [450, 550]. The rior variances were generated from a uniform distribution on [95, 105]; the urose of using such a narrow interval was to ensure that all of them would be aroximately equal, but any one would be equally likely to be the largest. Then, for each edge i, j, the true value µ was generated from a Gaussian distribution with mean µ 0 and variance σ 0 2. This reresents a situation in which our rior beliefs are accurate on average, and give us a reasonably good idea about the true values. The measurement noise σ 2 ε was chosen to be 100 2. In the second set of initial arameters, referred to as the equal-rior set, we generated the rior means µ 0 from a uniform distribution on [495, 505], the urose of the narrow interval again being to break ties among the riors. The true means µ were generated from a uniform distribution on [300, 700]. The rior variances and the measurement noise were obtained the same way as in the heterogeneous-rior exeriments. The true edge lengths fall into roughly the same range as in the heterogeneous-rior exeriments, but the riors now give us much less information about them. Five olicies were tested overall; we briefly describe their imlementation. Knowledge gradient on a grah KG. This olicy is defined by the decision rule 14, the exact KG olicy for acyclic grahs. The quantity V n s n is found by solving a shortest-ath roblem using µ n as the edge values. The quantity V n sn is found in a similar fashion, with the cost of i, j modified as described in Section 3. Pure exloitation Ex. The ure exloitation olicy consists of finding the ath n that solves 8 with max relaced by min, then measuring the edge given by X Ex,n s n = arg min i,j n µ n. Variance-exloitation VEx. This olicy is a slight modification of the ure exloitation olicy. It measures the edge given by X V Ex,n s n = arg max i,j n σ n. Instead of simly choosing the edge that looks the best on the ath that looks the best, it chooses the edge that we are least certain about on that same ath. Monte Carlo correlated KG MCKG. The Monte Carlo olicy is described in Section 4. The decision rule for this olicy is given by 27. The olicy has one arameter K, the number of Monte Carlo samles generated. In our exeriments, we used K = 30. We found that smaller values of K resulted in very few aths. On the other hand, larger values did not areciably increase the number K 0 of distinct aths generated which was tyically in the single digits, while requiring substantially more comutational time. 21

Pure exloration Exlore. uniformly at random and measures it. In every iteration, the ure exloration olicy chooses an edge 5.1 Effect of grah structure on KG erformance We considered three general tyes of grah structure: Layered grahs Layer L, B, c. The layered grah is closest in form to the time-staged model we develoed in Section 2. It consists of a source node, a destination node, and L layers in between. Each layer contains B nodes, and every node in every layer excet for the last one is connected to c randomly chosen nodes in the next layer. The source is connected to every node in the first layer, and every node in the last layer is connected to the destination. The total number of nodes in the grah is L B + 2, and the total number of edges is L 1 B c + 2 B. The edges are directed, so every layered grah is acyclic. Erdős-Renyi grahs ER D,. The Erdős-Renyi random grah model was introduced by Gilbert 1959 and Erdős & Renyi 1959. A grah has D nodes, and any two nodes have a fixed robability of being connected by an edge. Thus, the total number of edges in the grah varies, but on average is equal to D 2. In our exeriments, the source is the node with index 1 and the sink is the node with index D. Scale-free grahs SF S, I, c. We use the scale-free grah model created by Barabási & Albert 1999. We start with S nodes, and run I iterations. In every iteration, we add one new node and connect it to c randomly chosen, reviously existing nodes. The total number of nodes is equal to S + I, and the total number of edges is equal to I c. In our exeriments, the source is the first node added and the sink is the last node added. Figure 3 gives examles of all three tyes. In the layered grah, any ath from source to destination contains the same number of edges. In the other grahs, several nodes have very high degrees, so there tends to be at least one very short ath from one node to another. Layered grahs are acyclic, so 14 and 15 give the exact comutation of 13. The other two tyes of grahs can have cycles, so we use 14 as a close aroximation of 13. Our edge values are high enough to make the robability of negative-cost cycles negligible. We generated 10 grahs of each tye, each with aroximately 30 nodes and 50 edges. The exact 22

a Layer3,4,2. b ER12,0.3. c SF3,9,2. Figure 3: Examles of layered, Erdős-Renyi and scale-free grahs. nodes are marked by s and t. The source and destination tyes were Layer 4, 5, 3, ER 30, 0.1 and SF 5, 25, 2. The minimum, average, and maximum values of the difference 29 across ten grahs of each tye are given in Tables 1, 2 and 3 for both the heterogeneous-rior and equal-rior exeriments. The measurement budget was taken to be N = 30, or aroximately 60% of the number of edges. The KG olicy gives the best erformance on the layered grahs, where it outerforms all other olicies on average. In the worst case, it can be outerformed by ure exloitation and varianceexloitation. However, even then, the difference is negligible, as the value of a tyical ath in one of these layered grahs is around 2500. Furthermore, in the best case, the KG olicy outerforms the cometition by a much larger margin. For the other two tyes of grahs, KG erforms cometitively on average, but is outerformed by every olicy in the worst case, though the margin is very small for scale-free grahs. The Monte Carlo olicy erforms esecially well on both Erdős-Renyi and scale-free grahs, with a slight edge over KG on average. In general, the cometition is much tighter than for the layered grahs. In Erdős-Renyi and scale-free grahs, there tends to be at least one ath from source to destination that contains very few edges. When the values on the edges are similar in magnitude, this means that a ath with fewer edges is more likely to be the best. In such grahs, our consideration Heterogeneous-rior Equal-rior Min Average Max Min Average Max KG-Ex -2.4410 151.5178 337.4421 162.2031 367.6961 673.0983 KG-VEx -3.7512 62.6977 104.9868-15.9721 72.6030 130.8938 KG-MCKG 29.6875 60.8563 89.8636-30.9532 54.7674 195.6884 KG-Exlore 13.1494 93.1405 145.6467 33.5755 95.8332 167.7865 Table 1: Mean differences in oortunity cost across ten Layer 4, 5, 3 grahs. 23

Heterogeneous-rior Equal-rior Min Average Max Min Average Max KG-Ex -29.5931 14.7976 161.8455-3.4380 29.8378 249.9321 KG-VEx -82.0065-8.5772 2.9997-43.8978 8.4519 79.0177 KG-MCKG -161.1705-17.5574 8.9841-51.1068-3.7332 10.5969 KG-Exlore -49.7891 4.9316 53.6483 0.0 24.0246 94.4013 Table 2: Mean differences in oortunity cost across ten ER 30, 0.1 grahs. Heterogeneous-rior Equal-rior Min Average Max Min Average Max KG-Ex -4.9423 9.6666 53.1207-36.3735 3.4179 80.8175 KG-VEx -5.0684 6.6864 66.7394-21.9450 6.2264 85.9265 KG-MCKG -3.2274 0.2408 3.2134-30.5455-2.0340 12.9562 KG-Exlore -5.2182 9.7683 82.3290 0.0 22.2583 88.9437 Table 3: Mean differences in oortunity cost across ten SF 5, 25, 2 grahs. is narrowed down to a small number of very short aths; even in the equal-rior case, the grah toology rovides a great deal of information. In fact, all five of our olicies were able to find the true best ath in five out of ten of the Erdős-Renyi grahs. For this reason, Table 2 contains one 0.0 value, meaning that both olicies under consideration achieved a zero oortunity cost. In a layered grah, however, every ath contains the same number of edges. In this case, small differences in our rior beliefs matter much more, and there are many more aths that could otentially be the best. In this setting, exloitation-based methods quickly get stuck on an incorrect ath, while exloration is unable to discover enough useful information. The KG olicy, on the other hand, is more effective at finding a good ath. Thus, the KG olicy is a articularly good choice for a time-staged grah model. We also see that KG tends to erform better in the equal-rior setting for layered and Erdős- Renyi grahs. On scale-free grahs, the erformance of KG suffers in the worst case, but benefits in the best case, with a slight dro in average-case erformance. For the most art, we see that KG can learn effectively when the rior gives relatively little information, esecially in the layered grah setting. The most effective olicy after KG is MCKG, which has the most sohisticated learning mechanism among the cometition. This is also the only olicy among the cometition which adats well to the equal-rior setting, maintaining its erformance relative to KG on average. Table 4 shows the average standard errors of our estimates of 29 across each set of grahs. The numbers are much smaller than most of the mean differences reorted in Tables 1-3. As exected, 24

Heterogeneous-rior Equal-rior Layer ER SF Layer ER SF KG-Ex 1.7887 0.3589 0.2364 2.0525 0.5892 0.3060 KG-VEx 1.9584 0.3447 0.2349 2.4267 0.5329 0.3710 KG-MCKG 1.7512 0.2479 0.1414 2.1261 0.2482 0.2857 KG-Exlore 2.0358 0.3728 0.2652 2.3589 0.6671 0.4579 Table 4: Average standard errors of the differences in oortunity cost. the standard error is larger for layered grahs, when we have more aths to choose from. Finally, Table 5 reorts the average number of distinct edges measured by each olicy for each set of grahs. Once again, we find that all olicies excet ure exloration examine fewer distinct edges on Erdős-Renyi and scale-free grahs than on layered grahs. For instance, when the MCKG olicy takes Monte Carlo samles of the best ath on a layered grah, the choice of the best ath can be decided by minor variations in the edge samles, and we will samle more distinct aths. However, on a grah where there are one or two aths with very few edges, those few aths will almost always come out on to in the Monte Carlo samling, and MCKG will do much less exloration than before. 5.2 Effect of grah size on KG erformance We examined the effect of grah size on the erformance of the KG olicy on the layered grahs discussed in Section 5.1, the grah tye that most resembles the time-staged model introduced in Section 2. We generated a set of ten Layer 6, 6, 3 grahs. Thus, each grah in the set had 38 nodes and 102 edges, aroximately twice as many as the grahs in Section 5.1. We also increased the measurement budget to N = 60, again aroximately 60% of the number of edges. The erformance of the KG olicy on this set is summarized in Table 6. Every ath in these grahs contains two more edges than for the Layer 4, 5, 3 grahs, so the tyical ath length is Heterogeneous-rior Equal-rior Layer ER SF Layer ER SF KG 20.5801 10.7087 7.6630 22.9280 8.9953 9.1416 Ex 3.4162 1.5208 2.2032 3.1745 2.1769 1.9843 VEx 10.7140 3.9947 3.9494 12.5684 3.6904 3.9536 MCKG 29.0706 5.8198 5.6184 29.5703 5.4339 5.4444 Exlore 23.2830 21.5285 22.7159 23.2794 21.5349 22.7244 Table 5: Average number of distinct edges measured by each olicy. 25