Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu
Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links L, a weight function w: L R +, and two nodes s, t N. The goal is to find a directed path from s to t having minimal total weight. More generally, the destination nodes t can be multiple. We are interested in distributed solutions, in which each node performs a local computation, with access only to the state of its neighbors Principle of optimality If node x lies on a shortest path from s to t, then the portion of the path from s to x (or, respectively, from x to t) must also be the shortest paths between s and x (resp., x and t) This allows an incremental divide-and-conquer procedure, also known as dynamic programming Let h (i) be the shortest distance from any node i to the goal t. Then, the shortest distance from i to t via a node j neighboring i is given by f i, j = w i, j + h j and h i = min f i, j, which is known as Bellman s optimality equation j 1
Asynchronous Dynamic Programming 2
Asynchronous Dynamic Programming One can prove that the ASYNCHDP procedure is guaranteed to converge to the true values, that is, h will converge to h. Moreover, the worst case convergence will require n iterations For realistic problems where n is large, however, not only can convergence be slow, but this procedure requires n processes (or agents) since the procedure assumes an agent for each node. So to be practical we turn to heuristic versions of the procedure, which require a smaller number of agents 3
Learning Real-Time A* (LRTA*) In the learning real-time A*, or LRTA*, algorithm, the agent starts at a given node, performs an operation similar to that of asynchronous dynamic programming, and then moves to the neighboring node with the shortest estimated distance to the goal, and repeats Assume that the set of nodes is finite, that all weights w(i, j) are positive and finite, and that there exists some path from every node in the graph to a goal node Note that this procedure uses a given heuristic function h( ) that serves as the initial value for each newly encountered node. To guarantee certain properties of LRTA*, we must assume that h is admissible, which means that h never overestimates the distance to the goal, that is, h(i) h (i) One can ensure admissibility by setting h i = 0 for all i, although less conservative admissible heuristic functions (built using knowledge of the problem domain) can speed up the convergence to the optimal solution 4
LRTA* With these assumptions, LRTA* has the following properties: The h-values never decrease, and remain admissible LRTA* terminates; the complete execution from the start node to termination at the goal node is called a trial If LRTA* is repeated while maintaining the h-values from one trial to the next, it eventually discovers the shortest path from the start to a goal node If LRTA* find the same path on two sequential trials, this is the shortest path (However, this path may also be found in one or more previous trials before it is found twice in a row) 5
LRTA* in Action 6
LRTA*(2) in Action LRTA is a centralized procedure. However, rather than have a single agent execute this procedure, one can have multiple agents execute it The properties of the algorithm (call it LRTA*(n), with n agents) are not altered, but the convergence to the shortest path can be speed up dramatically First, if the agents break ties differently, some will reach the goal much faster than others. Second, if they all have access to a shared h-value table, the learning of one agent can teach the others. Specifically, after every round and for every i, h(i) = max h j (i), where h j i is agent j s updated value for h(i) j 7
Action Selection in Multiagent MDPs Recall that in a single-agent Markov Decision Process (MDP), the optimal policy π is characterized by the Bellman optimality equations: Q π s, a = r s, a + γ V π s s = max a Qπ s, a p a ss V π (s ) These equations turn into an algorithm specifically, the dynamic-programmingstyle value iteration algorithm However, in real-world applications the situation is not that simple. The MDP may not be known by the planning agent and thus may have to be learned (This case is discussed in Chapter 7) The MDP may simply be too large to iterate over all instances of the equations 8
Action Selection in Multiagent MDPs Consider a multiagent MDP with some modularity of actions where (global) action a is a vector of local actions (a 1,, a n ), one by each of n agents. The assumption here is that the reward is common, so there is no issue of competition among the agents. There is not even a problem of coordination; we have the luxury of a central planner Suppose that the Q values for the optimal policy, Q π, have already been computed. Then, the optimal policy is easily recovered; the optimal action in state s is argmax Q π. But if a ranges over an exponential number of choices by all agents, easy becomes hard. The question is: can we do better than naively enumerating over all action combinations by the agents? In general the answer is no, but in practice, the interaction among the agents actions can be quite limited, which can be exploited both in the representation of the Q function and in the maximization process. Specifically, in some cases we can associate an individual Q i function with each agent i, and express the Q function (either precisely or approximately) as a linear sum of the individual Q i s: a n Q s, a = Q i (s, a) i=1 argmax a σ i=1 n Q i (s, a) This in and of itself is not very useful. However, it is often also the case that each individual Q i depends only on a small subset of the variables 9
Action Selection in Multiagent MDPs Q a 1, a 2, a 3, a 4 = Q 1 a 1, a 2 + Q 2 a 2, a 4 + Q 3 a 1, a 3 + Q 4 a 3, a 4 argmax Q 1 a 1, a 2 + Q 2 a 2, a 4 + Q 3 a 1, a 3 + Q 4 a 3, a 4 a 1,a 2,a 3,a 4 10
By employing a variable elimination algorithm, one can compute a conditional strategy, from the following set of equations: where Action Selection in Multiagent MDPs a + e (a, a )] a, a + e (a, a )] a 1 = argmax e 2 ( a 1 ) a 1 a 2 = argmax[q 1 a 1, a 2 2 3 1 2 a 3 = argmax[q a 3 3 1 3 4 2 3 a 4 = argmax[q 2 a 2, a 4 + Q 4 (a 3, a 4 )] a 4 e 2 (a 1 ) = max a 2 [Q 1 a 1, a 2 + e 3 (a 1, a 2 )] e 3 (a 1, a 2 ) = max a 3 [Q 3 a 1, a 3 + e 4 (a 2, a 3 )] e 4 (a 2, a 3 ) = max a 4 [Q 2 a 2, a 4 + Q 4 (a 3, a 4 )] 11
Infinite-Horizon Dynamic Programing Model for sequential decision-making problems under dynamics Markov Decision Process (MDP) a special case of infinite-horizon DP Set of possible states s S Set of possible actions a A Transition probability P a ss = Pr(s s, a) Deterministic policy function a = π(s) Reward function r = r s, a with discount factor γ [0,1) Episode τ a s 0 a 1 a 2 a 0 s 1 s 2 s 3 3 s 4 Return R π τ r 0 + γr 1 + γ 2 r 2 + γ 3 r 3 Expected return R π = E τ [R π τ ] = E τ σ t=0 γ t r t 12
Two Main Tasks in DP Value prediction: evaluate how good a policy π is for given s or (s, a) State-value function V π s = E τ R π τ s 0 = s = r s, π s Action-value function (Q-factor function) π(s) + γσ s P ss V π s, Q π s, a = E τ R π τ s 0 = s, a 0 = a = r s, a + γσ s P a ss V π s = r s, a + γσ s P a ss Q π s, π(s ) s S, a A Policy optimization: find optimal policy π maximizing R π s S π s = argmax V π (s) π = argmax[r s, a + γσ s P a ss V π s ], s S a A π s = argmax Q π s, π(s) π = argmax a A Q π s, a, s S 13
a Suppose that P ss Value Iteration (VI) method How to Find π in DP and r s, a are known Bellman s optimality equation for Q π, known as principle of optimality Q π = T Q π where T is contraction, written as for s S, a A, Q π s, a = r s, a + γσ s P a ss V π s = r s, a + γσ s P a ss max a A Qπ s, a π Solve Bellman s equation by value iteration, staring with arbitrary Q 0 Q π π Q π k+1 = T(Q k ), k = 0,1, Find optimal policy π by maximization π s = argmax Q π s, a, s S a A 14
Policy Iteration (PI) method How to Find π in DP Policy evaluation (critic) phase For a given policy π, compute Q-factor Q π by value iteration Q π Q π k+1 = T π (Q π k ), k = 0,1, where T π is contraction, written as for s S, a A, Q π s, a = r s, a + γσ s P a ss Q π s, π(s ) Policy improvement (actor) phase For a given Q-factor Q π, compute a new policy π by maximization π s = argmax Q π s, a, a A Policy improvement theorem s S critic Q π s, π s = max a A Qπ s, a Q π s, π s = V π (s), s S V π (s) V π (s), s S Q π π 15 actor
DP requires to solve Bellman s equation to get Q π or Q π a P ss From DP to RL VI: Q π s, a = r s, a + γσ s P a ss max a A Qπ s, a, s S, a A PI: Q π s, a = r s, a + γσ s P a ss Q π s, π(s ), s S, a A and r s, a are unknown in practice => curse of modelling (uncertainty) Cardinalities of sets S and A are extremely high => curse of dimensionality Category of RL algorithms based on what to learn a Model-based RL learn models, P ss and r s, a, from samples to compute Q π or Q π Value-based RL learn value functions, Q π or Q π, directly from samples Policy search RL search optimal policy function π based on samples Summation over a huge number of states can be approximated by Monte Carlo estimation and stochastic approximation techniques 16
Auction-Like Distributed Optimization Consider two classical optimization problems Assignment problem (weighted matching problem in bipartite graph) Scheduling problem The aim is to derive distributed problem solving (optimization) that has a certain economic flavor, in particular, auction-style solutions for them The assignment problem is a linear programming (LP) problem, which is relatively easy to solve (specifically, solvable in polynomial time), and admits an auction-like optimization procedure with tight guarantees The scheduling problem is an integer programming (IP) problem, which is more complex (specifically, NP-complete), and also admits an auction-like optimization procedure but does not come with such guarantees 17
Assignment Problem and Linear Programming Definition 2.3.1 (Assignment problem) An assignment problem consists of A set N of n agents A set X of n objects A set M N X of possible assignment pairs, and A function v: M R giving the value of each assignment pair An assignment is a set of pairs S M such that each agent i N and each object j X is in at most one pair in S. A feasible assignment is one in which all agents are assigned an object. A feasible assignment S is optimal if it maximizes σ (i,j) s v(i, j) An assignment problem can be encoded as a linear program with fractional matches (i.e., 0 x i,j 1) Any linear program can be solved in polynomial time 18
Assignment Problem and Linear Programming Lemma 2.3.2 The LP encoding of the assignment problem has at least one integral solution such that for every i, j it is the case that x i,j = 0 or x i,j = 1. Furthermore, any optimal fractional solution can be converted in polynomial time to an optimal integral solution Corollary 2.3.3 The assignment problem can be solved in polynomial time However, the polynomial-time solution to the LP problem is of complexity roughly O(n 3 ), which may be too high in some cases. Furthermore, the solution is not parallelizable One alternative is based on the economic notion of competitive equilibrium 19
Assignment Problem and Competitive Equilibrium Imagine that each of the objects in X has an associated price; the price vector is p = (p 1,, p n ) where p j is the price of object j Given an assignment S M and a price vector p, define the utility from an assignment j to agent i as u i, j = v i, j p j An assignment and a set of prices are in competitive equilibrium when each agent is assigned the object that maximizes his utility given the current prices Definition 2.3.4 (Competitive equilibrium) A feasible assignment S and a price vector p are in competitive equilibrium when for every pairing (i, j) S it is the case that k, u(i, j) u(i, k) Theorem 2.3.5 If a feasible assignment S and a price vector p satisfy the competitive equilibrium condition then S is an optimal assignment. Furthermore, for any optimal solution S, there exists a price vector p such that p and S satisfy the competitive equilibrium condition This theorem means that one way to search for solutions of the LP is to search the space of competitive equilibria. And a natural way to search that space involves auction-like procedures, in which the individual agents bid for the different resources in a pre-specified way 20
Connection between Optimization and Competitive Equilibrium Primal problem of an LP A meaningful economic interpretation, namely, production economy Each product consumes a certain amount of each resource, and each product is sold at a certain price. Interpret x i as the amount of product i produced, c i as the price of product i, b j as the available amount of resource j and a ij as the amount of resource j needed to produce a unit of product i The optimization problem can be interpreted as profit maximization with constraints capturing the limitation on resources 21
Connection between Optimization and Competitive Equilibrium Dual problem of an LP y i can be given a meaningful economic interpretation, namely, as the marginal value of resource i, also known as its shadow price The shadow price captures the sensitivity of the optimal solution to a small change in the availability of that particular resource, holding everything else constant To be precise, the shadow price is the value of the Lagrange multiplier at the optimal solution 22
A Naive Auction Algorithm Theorem 2.3.6 The naive algorithm terminates only at a competitive equilibrium X = x 1, x 2, x 3, N = {1,2,3} 23
The naive auction algorithm may fail to terminate. This can occur when more than one object offers maximal value for a given agent; in this case the agent s bid increment will be zero X = x 1, x 2, x 3, N = {1,2,3} A Naive Auction Algorithm 24
A Terminating Auction Algorithm To remedy the flaw, we must ensure that prices continue to increase when objects are contested by a group of agents. The extension is quite straightforward: we add a small amount to the bidding increment. Otherwise, the algorithm is as stated earlier Because the prices must increase by at least ε at every round, the competitive equilibrium property is no longer preserved over the iteration. Agents may overbid on some objects Definition 2.3.7 (ε-competitive equilibrium) S and p satisfy ε-competitive equilibrium when for each i N, if there exists a pair (i, j) S then k, u i, j + ε u i, k Theorem 2.3.8 A feasible assignment S with n objects that forms an ε-competitive equilibrium with some price vector is within nε of optimal 25
Corollary 2.3.9 Consider an assignment problem with an integer valuation function v: M Z. If ε < 1 then any feasible assignment found by the terminating auction algorithm will be optimal One can show that the algorithm indeed terminates and its running time is v i,j A Terminating Auction Algorithm O n 2 max i,j ε Observe that if ε = O(1/n) the algorithm s running time is O(n 3 k), where k is a constant that does not depend on n, yielding worst case performance similar to the LP solution approach 26
Scheduling Problem and Integer Programming Definition 2.3.10 (Scheduling problem) A scheduling problem consists of a tuple C = (N, X, q, v), where N is a set of n agents X is a set of m discrete and consecutive time slots q = (q 1,, q m ) is a reserve price vector, where q j is a reserve value for time slot x j ; q can be thought of as the value for the slot of the owner of the resource v = v 1,, v n, where v i, the valuation function of agent i, is a function over possible allocations of time slots that is parameterized by two arguments: d i, the deadlines of agent i, and λ i, the required number of time slots required by agent i. Thus for an allocation F i 2 X of agent i, we have that 27
Scheduling Problem and Integer Programming A solution to a scheduling problem is a vector F = F 0, F 1,, F n, where F i is the set of time slots assigned to agent i, and F o is the time slots that are not assigned. The value of a solution is defined as The scheduling problem is inherently more complex than the assignment problem. Specifically, the scheduling problem is NP-complete, whereas the assignment problem is P The scheduling problem can be defined much more broadly. It could involve earliest start times as well as deadlines, could require contiguous blocks of time for a given agent, could involve more than one resource, and so on 28
Scheduling Problem and Integer Programming A scheduling problem can be encoded as an integer program where for every subset S X, the boolean variable x i,s will represent the fact that agent i was allocated the bundle S, and v i (S) his valuation for that bundle IPs are not in general solvable in polynomial time. However, it turns out that a generalization of the auction-like procedure can be applied in this case too The price we will pay for is that the generalized algorithm will not come with the same guarantees that we had in the case of the assignment problem 29
Scheduling Problem and Generalized Competitive Equilibrium Definition 2.3.11 (Competitive equilibrium, generalized form) Given a scheduling problem, a solution F is in competitive equilibrium at prices p if and only if For all i N it is the case that F i = argmax(v i T σ j xj T p j ) (the set of time T X slots allocated to agent i maximizes his surplus at prices p); For all j such that x j F 0, it is the case that p j = q j (the price of all unallocated time slots is the reserve price); and For all j such that x j F 0 it is the case that p j q j (the price of all allocated time slots is greater than or equal to the reserve price) Theorem 2.3.12 If a solution F to a scheduling problem C is in competitive equilibrium at prices p, then F is also optimal for C Show that the value of F is higher than the value of any other solution F 30
Scheduling Problem and Generalized Competitive Equilibrium V F = q j + v i (F i ) j x j F 0 i N = p j + v i (F i ) j x j F 0 i N = p j + j x j X i N p j + j x j X i N v i F i j x j F i p j v i F i j x j F i p j j x p j + v j F i F i 0 i N j x q j + v j F i F i = V(F ) 0 i N Theorem 2.3.13 A scheduling problem has a competitive equilibrium solution if and only if the LP relaxation of the associated integer program has a integer solution 31
Ascending-Auction Algorithm The best-known distributed protocol for finding a competitive equilibrium is the socalled ascending-auction algorithm In this protocol, the center advertises an ask price, and the agents bid the ask price for bundles of time slots that maximize their surplus at the given ask prices. This process repeats until there is no change Let b = b 1,, b m be the bid price vector, where b j is the highest bid so far for time slot x j X. Let F = (F 1,, F n ) be the set of allocated slots for each agent. Finally, let ε be the price increment The ascending-auction algorithm is very similar to the assignment problem auction with one notable difference that the bid increment is always constant It is also possible for the ascending-auction algorithm to not converge to an equilibrium independently of how small the increment is Even if the algorithm converges, we have no guarantee that it converges to an optimal solution. Moreover, we cannot even bound how far the solution is from optimal One property we can guarantee, however, is termination. The algorithm must terminate, giving us the worst-case running time O(n max ) σ i N v i (F i ) F i ε 32
Ascending-Auction Algorithm 33
Totally Asynchronous Iterative Algorithms Let X 1, X 2,, X n be given sets, and let X be the Cartesian product X = X 1 X 2 X n For x X, we write x = (x 1, x 2,, x n ) where x i X i, i. Let f i : X X i be given functions and let f: X X be the function defined by f x = f 1 x, f 2 x,, f n x, x X The problem is to find a fixed point of f, that is, an element x X with x = f(x ) or, equivalently, x i = f i x, i There is a set of times T = {0,1,2, } and let T i T be the set of times at which x i is updated by agent i Assume that each agent i updating x i may not have access to the most recent value of the component of x; thus, we assume that x i t + 1 = ቐ f i x 1 τ 1 i t,, x n τ n i t, t T i x i t, where τ j i t are times satisfying 0 τ j i t t, t T t T i 34
Asynchronous Convergence Theorem Assumption 1.1 (Total asynchronism) The sets T i are infinite and if {t k } is a sequence of elements of T i that tends to infinity, then lim τ i j t k = for every j k Assumption 2.1 There is a sequence of nonempty sets {X k } with X(k + 1) X(k) X satisfying two conditions: (a) (Synchronous convergence condition) We have f x X k + 1, k and x X k. Furthermore, if y k is a sequence such that y k X(k) for every k, then every limit point of y k is a fixed point of f (b) (Box condition) For every k, there exists sets X i (k) X i such that X k = X 1 (k) X 2 (k) X n (k) Proposition 2.1 (Asynchronous convergence theorem) If Assumptions 1.1 and 2.1 hold, and the initial solution estimate x 0 = x 1 0,, x n 0 belongs to the set X 0, then every limit point of x t is a fixed point of f If X = R n, X i = R, i and f: R n R n is a contraction mapping w.r.t. some weighted maximum norm w with modulus α, Assumption 2.1 is satisfied with sets {X k } defined by X k = {x R n x x w α k x 0 x w } 35