CS711008Z Algorithm Design and Analysis

Size: px

Start display at page:

Download "CS711008Z Algorithm Design and Analysis"

Domenic Richard
5 years ago
Views:

1 .. CS711008Z Algorithm Design and Analysis Lecture 9. Lagrangian duality and SVM Dongbo Bu Institute of Computing Technology Chinese Academy of Sciences, Beijing, China 1 / 136

2 . Outline Classification problem and maximum margin strategy; Solving maximum margin problem using Lagrangian duality; SMO technique; Kernel tricks; 2 / 136

3 ..Classification problem and maximum margin strategy 3 / 136

4 . Classification problem Given a set of samples with their category labels (denoted as (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ), y i { 1, +1}, the goal of classification problem is to find an appropriate function f(x) that can describe the dependency between y i and x i ; thus, for a new sample x, we can infer its category based on f(x ). A great variety of classification algorithms have been designed, including Fisher s linear discriminant, logistic regression, decision tree, neural network and SVM. 4 / 136

5 . Linear classifier Unlike decision tree, SVM adopts the classifier with the following type: If f(x) > 0 then y = +1; If f(x) < 0 then y = 1; Let s first restrict the f(x) to be linear, i.e. f(x) = ω T x + b The hyperplane ω T x + b = 0 is denoted as separating hyperplane. 5 / 136

6 The objective of training procedure is to find an appropriate setting of ω and b such that all samples in the training set can be correctly labelled using the classifier. We will consider the torlerance of several mislabelled samples later. 6 / 136

7 . Maximum margin strategy There are always multiple settings of ω and b that the corresponding classifier works perfectly on all samples. Which one should we use? We prefer the one such that the margin between positive and negative samples is maximized: The wider the margin is, the larger the generality performance on new samples. Thus, we needs to solve the following optimization problem: min w,b 2 ω s.t. y i (w x i + b) 1 0 i = 1, 2,, n Note: The restriction f(x) > 0 for positive sample x is implemented as f(x) = 1. The distance for any point x to the hyperplane ω T x + b = 0 is ω T x+b 2 ω. Thus, the margin is: ω. 7 / 136

8 . An equivalent form with quadratic objective function An equivalent form is: min w,b 1 2 w 2 s.t. y i (w x i + b) 1 0 i = 1, 2,, n Question: how to solve this optimization problem subject to inequality constraints? Of course we solve the problem (called primal problem hereafter) directly using convex quadratic programming techniques; however, consider its dual problem will bring great benefits. Let s review the conditions of the optimal solution first. 8 / 136

9 . Lagrangian multiplier method, KKT conditions and Lagrangian. dual 9 / 136

10 . A brief introduction to Lagrangian multiplier and KKT conditions It is relatively easy to optimize an objective function without any constraint, say: min f(x) But how to optimize an objective function with equality constraints? min f(x) s.t. g i (x) = 0 i = 1, 2,..., m And how to optimize an objective function with inequality constraints? min f(x) s.t. g i (x) = 0 i = 1, 2,..., m h i (x) 0 i = 1, 2,..., p 10 / 136

11 . Lagrangian multiplier: under equality constraints Consider the following optimization problem: min f(x, y) s.t. g(x, y) = 0 Intuition: suppose (x, y ) is the optimum point. Thus at (x, y ), f(x, y) does not change when we walk along the curve g(x, y) = 0; otherwise, we can follow the curve to make f(x, y) smaller, meaning that the starting point (x, y ) is not optimum. g(x, y) g(x, y) = 0 x f(x, y). f(x, y) = 1 11 / 136

12 g(x, y) g(x, y) = 0 x f(x, y). f(x, y) = 1 f(x, y) = 2 So at (x, y ), the red line tangentially touches a blue contour, i.e. there exists a λ such that: f(x, y) = λ g(x, y) Lagrange must have cleverly noticed that the equation above looks like partial derivatives of some larger scalar function: L(x, y, λ) = f(x, y) λg(x, y) Necessary conditions of optimum point: If (x, y ) is local optimum, then there exists λ such that L(x, y, λ) = / 136

13 . Understanding Lagrangian function Lagrangian function: a combination of the original optimization objective function and constraints: Note: L(x,y,λ) λ g(x, y) = 0. L(x, y, λ) = f(x, y) λg(x, y) = 0 is essentially the equality constraint The critical point of Lagrangian L(x, y, λ) occurs at saddle points rather than local minima (or maxima). To utilize numerical optimization techniques, we must first transform the problem such that the critical points lie at local minima. This is done by calculating the magnitude of the gradient of Lagrangian. 13 / 136

14 . KKT conditions: under inequality constraints Consider the following optimization problem: min f(x) s.t. g i (x) 0 i = 1, 2,...m g(x) 0 g(x) 0 x. f(x) = 1 f(x) = 2 x. f(x) = 1 f(x) = 2 Figure: Case 1: the optimum point lies in the curve g(x) = 0. Thus Lagrangian condition L(x, λ) = 0 applies. Case 2: the optimum point lies within the region g(x) 0; thus we have f(x) = 0 at x. 14 / 136

15 . Understanding complementary slackness g(x) 0 g(x) 0 x. f(x) = 1 f(x) = 2 x. f(x) = 1 f(x) = 2 Figure: Case 1: the optimum point lies in the curve g(x) = 0. Thus Lagrangian condition L(x, λ) = 0 applies. Case 2: the optimum point lies within the region g(x) 0; thus, we have f(x) = 0 at x These two cases were summarized as the following two conditions: (Stationary point) L(x, λ) = 0 (Complementary slackness) λ i g i (x ) = 0, i = 1, 2,..., m (Note that in case 2, g(x )<0 λ = 0 f(x ) = 0.) 15 / 136

16 . KKT conditions Lagrangian: L(x, λ) = f(x) m λ i g i (x) Necessary conditions of optimum point: If x is local optimum satisfying some regularity conditions, then there exists λ such that:.1 (Stationary point) L(x, λ) = 0.2 (Primal feasibility) g(x ) 0.3 (Dual feasibility) λ i 0, i = 1, 2,..., m.4 (Complementary slackness) λ i g i (x ) = 0, i = 1, 2,..., m i=1 1 Note: KKT conditions are usually not solved directly in optimization; instead, iterative successive approximation is most often used. However, the final results should meet KKT conditions. 1 KKT conditions were named after William Karush, Harold W. Kuhn, and. Albert. W. Tucker / 136

17 . Three strategies to solve an optimization problem under inequality constraints Solve the primal problem subject to primal feasibility; Solve the dual problem subject to the dual feasibility; Interior-point method: Complementary slackness, also called orthogonality by Gomory, essentially equals to the strong duality, i.e., the primal and dual problems have identical objective value. A relaxation of this condition, i.e., λ i g i (x ) = µ, is used in the interior point method to reflect duality gap. 17 / 136

18 . Ạpplying Lagrangian dual to the maximum margin problem 18 / 136

19 . An example Primal problem: Lagrangian: min x s.t. x 2 x 0 L(x, y, z) = x y (x 2) z x = 2y + x (1 y z) Notice that when y 0, z 0 and x 2, L(x, y, z) is a lower bound of the primal objective function x. Lagrangian dual function: g(y, z) = inf x L(x, y, z) = Dual problem: { 2y if 1 y z = 0 max 2y s.t. y 1 y 0 otherwise 19 / 136

20 . Lagrangian connecting primal and dual Primal: c^t x = x L(x,y) Dual: g(y) = 2y y x Observation: Primal objective function x Lagrangian Dual objective function y in the feasible region. 20 / 136

21 . Lagrangian dual explanation of LP duality Consider a LP model: min c T x s.t. Ax b x 0 Lagrangian: L(x, λ, s) = c T x m λ i(a i1 x a in x n b i ) m s ix i i=1 i=1 Notice that Lagrangian is a lower bound of the primal objective function, i.e. c T x L(x, λ, s), when λ 0, s 0 and x is feasible. Furthermore we have c T x L(x, λ) inf L(x, λ) x when λ 0, s 0 and x is feasible. Denote Lagrangian dual g(λ, s) = inf x L(x, λ, s). The above inequality can be rewritten as: c T x L(x, λ, s) g(λ, s) 21 / 136

22 . Lagrangian dual explanation of LP duality cont d What is the Lagrangian dual g(λ, s)? g(λ, s) = inf x L(x, λ, s) = inf x (ct x m λ i (a i1 x a in x n b i ) i=1 = inf x (λt b + (c T λ T A s T )x) { λ T b if c T = λ T A + s T = otherwise Thus λ T b is a lower bound of f(x) when c T = λ T A + s T and x 0. Note g(λ) is always concave even if the constraints are not convex. m s i x i ) i=1 22 / 136

23 . Find a tight bound Now let s try to find the best lower bound of f(x). Thus the tight lower bound max g(λ) can be described as: Notes: max λ T b s.t. λ T A c T λ 0.1 This is actually the Dual form of LP if replacing λ by y; thus, we have another explanation of Dual variables y the Lagrangian multiplier..2 Lagrangian dual is a lower bound of the primal optimum under some conditions. In addition, Lagrangian dual is concave; thus, the dual problem is always a convex programming problem even if the primal problem is not a convex programming problem. 23 / 136

24 . Lagrangian dual explanation of maximum margin problem Primal problem: min w,b 1 2 w 2 s.t. y i (w T x i + b) 1, i {1,..., n} Lagrangian: L(w, b, α) = 1 2 w 2 n i=1 α iy i (w x i + b) + n i=1 α i Notice that Lagrangian is a lower bound of the primal objective function, i.e. 1 2 w 2 L(w, b, α), when α 0 and w, b is feasible. Furthermore we have 1 2 w 2 L(w, b, α) inf L(w, b, α) w,b when α 0 and w, b is feasible. Denote Lagrangian dual g(α) = inf w,b L(w, b, α). The above inequality can be rewritten as: 1 2 w 2 L(w, b, α) g(α) 24 / 136

25 . Lagrangian dual explanation of maximum margin duality cont d What is the Lagrangian dual g(α)? g(α) = inf L(w, b, α) w,b To calculate the inferior bound of L(w, b, α), we set its derivates to be 0, i.e., L(w, b, α) w n = w α i y i x i = 0 i=1 L(w, b, α) b = n α i y i = 0 i=1 25 / 136

26 . Lagrangian dual function Lagrangian dual function: g(α) = 1 2 n n α i α j y i y j (x T i x j ) + n i=1 j=1 i=1 α i Thus g(α) is a lower bound of 1 2 w 2 and α 0. when n i=1 α iy i = 0 26 / 136

27 . Lagrangian dual problem Now let s try to find the tightest lower bound of 1 2 w 2, which can be calculated by solving the following Lagrangian dual problem: max 1 n n 2 i=1 j=1 α iα j y i y j (x T i x j) + n i=1 α i s.t. n i=1 α iy i = 0 α 0 27 / 136

28 . Two explanations of dual variables y: L. Kantorovich vs. T. Koopmans.1 Price interpretation: constrained optimization plays an important role in economics. Dual variables are also called as shadow price (by T. Koopmans), i.e. the instantaneous change in the optimization objective function when constraints are relaxed, or marginal cost when strengthening constraints..2 Lagrangian multiplier: the effect of constraints on the objective function (by L. Kantorovich). For example, when b i increase to b i + b i, how much the objective function value will change. In fact, we have L(x,λ) b i = λ i. 28 / 136

29 . Explanation of dual variables y: using Diet as an example Optimal solution to primal problem with b 1 = 2000, b 2 = 55, b 3 = 800: x = (14.24, 2.70, 0, 0), c T x = Optimal solution to dual problem: y = (0.0269, 0, ), y T b = Let s make a slight change on b, and watch the effect on max c T x..1 b 1 = 2001: max c T x = (Note that.2 y 1 = = ) b 2 = 56: max c T x = (Note that.3 y 2 = 0 = ) b 3 = 801: max c T x = (Note that y 3 = = ) (See extra slides) 29 / 136

30 . Ẉeak duality, strong duality, and Slater conditions 30 / 136

31 . Weak duality and strong duality Sometimes the optimal solution of the primal problem doesn t has identical value to that of the dual problem. The difference between them is denoted as duality gap. This case is denoted as weak duality. The opposite case is denoted as strong duality. Conditions under which strong duality holds are denoted as constraint qualifications. 31 / 136

32 . Slater s conditions for strong duality The best known sufficient condition of strong duality is: Primal problem is convex programming, i.e., min f(x) s.t. g i (x) 0 i = 1, 2,...m Ax = b where f(x) and g i (x) are convex. Slater s condition holds, i,e. there exists some strictly feasible point x relint(d) such that and Ax = b g i (b)<0, i = 1, 2,..., m Here D represents the feasible domain of the primal problem, and relint(d) denotes the relative interior of D. 32 / 136

33 . Slater s conditions for strong duality (cont d) The Slater s conditions are trivial if the constraints of the primal problem are affine, i.e., Consider the following convex programming problem: min f(x) s.t. g i (x) 0 i = 1, 2,...m Ax = b where f(x) and g i (x) are convex. Slater s condition: If g i (x) are affine, the inequalities are no longer required to be strict and thus reduce to the original form, i.e. there exists some strictly feasible point x relint(d) such that Ax = b and g i (b) 0, i = 1, 2,..., m 33 / 136

34 . Strong duality in SVM For the maximum margin problem, its dual problem has the identical optimal objective function value. 34 / 136

35 . Ḳernel method to overcome linear insperability 35 / 136

36 . Linearly inseperable Some samples are linearly insuperable. How should we do? Kernel method: If the samples are linearly inseperable in the original input space, let s map them into a higher-dimensional space (called feature space) and make them linearly separable in that space. In other words, nonlinearity was introduced into the original input space through using the map. 36 / 136

37 . An example: XOR problem Let s define a map: ϕ : R 2 R 3 as below: [x (1), x (2) ] Φ([x (1), x (2) ]) = [x (1), x (2), x (1) x (2) ] These samples are linearly seperable after mapping into R 3 using Φ(x (1), x (2) ) 37 / 136

38 . Replace the inner product in the optimization problem for SVM Remember that in the optimization problem for SVM, the objective function contains inner product x T x: max 1 N N 2 i=1 j=1 α iα j y i y j (x T i x j) + N i=1 α i s.t. N i=1 α iy i = 0 α 0 After mapping x into Φ(x), the inner product x T x changes into Φ(x) T Φ(x), and the optimization problem changes accordingly: max 1 N N 2 i=1 j=1 α iα j y i y j (Φ(x i ) T Φ(x j )) + N i=1 α i s.t. N i=1 α iy i = 0 α 0 Note we simply replace this inner product with Φ(x) T Φ(x) without even knowing the map Φ(x). 38 / 136

39 . The seperating hyperplane Remember that in the optimization problem for SVM, the seperating hyperplane is: f(x) = ω T x + b = i,j α i y i x T i x = 0 After mapping x into Φ(x), the sporting hyperplane changes into: f(x) = ω T Φ(x) + b = i,j α i y i Φ(x) T i Φ(x) = 0 Note that Φ(x) T i Φ(x) = k(φ(x) i, Φ(x)). 39 / 136

40 . Ḍeriving kernel function from map 40 / 136

41 . Example 1 Consider the map Φ : R 2 R 4 : [x (1), x (2) ] Φ([x (1), x (2) ]) = [x (1)2, x (2)2, x (1) x (2), x (1) x (2) ] The corresponding kernel function is: k(x, z) = x (1)2 z (1)2 +x (2)2 z (2)2 +2x (1) x (2) z (1) z (2) =< x, z > 2 R 2 41 / 136

42 . Example 2 Consider the map Φ : R 3 R 13 : [x (1), x (2), x (3) ] Φ([x (1), x (2), x (3) ]) = [x (1)2, x (1) x (2),..., x (3)2, 2c The corresponding kernel function is: k(x, z) = x T z + c 42 / 136

43 43 / 136

44 . The power of polynomial kernel We can map into a d-dimensional feature space. 44 / 136

45 . The power of polynomial kernel 45 / 136

46 . Ṭhe requirement of kernel funcitons 46 / 136

47 . Definition of kernel function A function k : X X R is a kernel function if: k is symmetric: k(x, z) = k(z, x). For any m N and any x 1, x 2,..., x m mathcalx, the matrix K defined by K i,j = k(x i, x j ) is positive semidefinite. 47 / 136

48 . Question: Could we map from input space into a feature space. with infinite dimension? 48 / 136

49 . Map into an infinite dimensional feature space 49 / 136

50 Let s consider the feature space where all elements are functions. The span of this space is Φ(x i ) = k(., x i ). Thus, any element can be represented as: For two elements we define inner product as: 50 / 136

51 . Reproducing kernel Hilbert space Now we find a map such that 51 / 136

52 52 / 136

53 . Theorem (Farkas lemma). Given vectors a 1, a 2,..., a m, c R n. Then either..1 c C(a 1, a 2,..., a m ); or.2 there is a vector y R n such that for all i, y T a i 0 but y T c < 0.. y a 2 c C(a 1, a 2 ) y a 1. y a 2 C(a 1, a 2 ) a 1 c Figure: Case 1: c C(a 1, a 2 ) Figure: Case 2: c / C(a 1, a 2 ) Here, C(a 1,..., a m ) denotes the cone spanned by a 1,..., a m, i.e. C(a 1,..., a m ) = {x x = m i=1 λ ia i, λ i 0}. 53 / 136

54 . Proof.. Suppose for any vector y R n, y T a i 0 (i = 1, 2,..., m), we always have y T c 0. We will show that c should lie within the cone C(a 1, a 2,..., a m ). Consider the following Primal problem: min c T y s.t. a T i y 0 i = 1, 2,..., m It is obvious that the Primal problem has a feasible solution y = 0, and is bounded since c T y 0. Thus the Dual problem also has a bounded optimal solution: max 0 s.t. x T A T = c T x 0 In other words, there exists a vector x such that c = m i=1 x ia i and x i / 136

55 . Variants of Farkas lemma Farkas lemma lies at the core of linear optimization. Using Farkas lemma, we can prove Separation theorem, and MiniMax theorem in the game theory.. Theorem. Let A be an m n matrix, and b R m. Then either..1 Ax = b, x 0 has a feasible solution; or.2 there is a vector y R m such that y T A 0 but y T b < / 136

56 . Variants of Farkas lemma. Theorem. Let A be an m n matrix, and b R m. Then either..1 Ax b has a feasible solution; or.2 there is a vector y R m such that y 0, y T A 0 but y T b < / 136

57 . Caratheodory s theorem. Theorem. Given vectors a 1, a 2,..., a m R n. If x C(a 1, a 2,..., a m ), then there is a linearly independent vector set of a 1, a 2,..., a m, say a. 1, a 2,..., a r, such that x C(a 1, a 2,..., a r ). 57 / 136

58 . Separation theorem. Theorem. Let C R n be a closed, convex set, and let x R n. If x / C,. then there exists a hyperplane separating x from C. 58 / 136

59 . Ạpplication 2: von Neumann s MiniMax theorem on game theory 59 / 136

60 . Game theory Game theory studies competing and cooperative behaviours among intelligent and rational decision-makers. In 1928, John von Neumann proved the existence of mixed-strategy equilibria in two-person zero-sum games. In 1950, John Forbes Nash Jr. developed a criterion of mutual consistency of players strategies, which applies to a wider range of games than that proposed by J. von Neumann. He proved the existence of Nash equilibrium in every n-player, non-zero-sum, non-cooperative game (not just 2-player, zero-sum games). Game theory was widely applied in mathematical economics, in biology (e.g., analysis of evolution and stability) and computer science (e.g., analysis of interactive computations and lower bound on the complexity of randomized algorithms, the equivalence between linear program and two-person zero-sum game). 60 / 136

61 . Paper-rock-scissors: an example of two-player zero-sum game Paper-rock-scissors is a hand game usually played by two players, denoted as row player and column player: each player selects one of the three hand shapes, including paper, rock, and scissors ; then the players show their selections simultaneously. It has two possible outcomes other than tie: one player wins and the other player loses, which can be formally described using the following payoff matrix. Paper Rock Scissors Paper 0, 0 1, -1-1, 1 Rock -1, 1 0, 0 1, -1 Scissors 1, -1-1, 1 0, 0 Each player attempts to select appropriate action to maximize his gain. 61 / 136

62 . Matching penny: another example of two-person zero-sum game Matching pennies is a game played by two players, namely, row player and column player. Each player has a penny and secretly turns it to head or tail. The players then reveal their selections simultaneously. If the pennies match, then row player keeps both pennies; otherwise, column player keeps both. The payoff matrix is as follows. Head Tail Head 1, -1-1, 1 Tail -1, 1 1, -1 Each player tries to maximize his gain via making an appropriate selection. 62 / 136

63 . Simultaneous games vs. sequential games Simultaneous games are games in which all players move simultaneously. Thus, no player have information of the others selections in advance. Sequential games are games in which the later player has some information, although maybe imperfect, of previous actions by the other players. A complete plan of action for every stage of the game, regardless of whether the action actually arises in play, is denoted as a (pure) strategy. Normal form is used to describe simultaneous games while extensive form is used to describe sequential games. J. von Neumann proposed an approach to transform transform strategies in sequential games into actions in simultaneous games. Note that the transformation is one-way, i.e., multiple sequential games might correspond to the same simultaneous game, and it may result in an exponential blowup in the size of the representation. 63 / 136

64 . Normal form A game Γ in normal form among m players contains the following items: Each player k has a finite number of pure strategies S k = {1, 2,..., n k }. Each player k is associated with a payoff function H k : S 1 S 2... S m R. To play the game, each player selects a strategy without information of others, and then reveals the selection simultaneously. The players gain are calculated using corresponding payoff functions. Each player attempts to maximize his gain via selecting an appropriate strategy. 64 / 136

65 . Two-person zero-sum game in normal form In a two-person zero-sum game game Γ, a player s gain or less is exactly balanced by the other player s loss or gain, i.e., H 1 (s 1, s 2 ) + H 2 (s 1, s 2 ) = 0. Thus we can define another function H(s 1, s 2 ) = H 1 (s 1, s 2 ) = H 2 (s 1, s 2 ) and represent it using a payoff matrix. Head Tail Head 1-1 Tail -1 1 Row player aims to maximize H(s 1, s 2 ) by selecting an appropriate strategy s 1 while column player aims to minimize H(s 1, s 2 ) by selecting an appropriate strategy s / 136

66 . von Neumann s MiniMax theorem: motivation When analyzing a two-person zero-sum game Γ, von Neumann noticed that the difficulty comes from the difference between games and ordinary optimization problems: row player tries to maximize H(s 1, s 2 ); however, he can control s 1 only as he has no information of the other player s selection s 2, and so does column player. Thus von Neumann suggested to investigate two auxiliary games without this difficulty, denoted as Γ 1 and Γ 2, before attacking the challenging game Γ..1 Γ 1 : Row player selects a strategy s 1 first, and exposes his selection to column player before column player selects a strategy s 2..2 Γ 2 : Column player selects a strategy s 2 first, and exposes his selection to row player before row player selects a strategy s 1. The two auxiliary games are much easier than the original game Γ, and more importantly, they provide upper and lower bounds for Γ. 66 / 136

67 . Auxiliary game Γ 1 Let s consider column player first. As he knows row player s selection s 1, the objective function H(s 1, s 2 ) becomes an ordinary optimization function over a single variable s 2, and column player can simply select a strategy s 2 with the minimum objective function value min s2 H(s 1, s 2 ). Head Tail Row minimum Head Tail -1 2 v 1 = 1 Now consider row player. When he selects a strategy s 1, he can definitely predict the selection of column player. Since min s2 H(s 1, s 2 ) is an ordinary function over a single s 1, it is easy for row player to select a strategy s 1 with the maximum objective function value v 1 = max min H(s 1, s 2 ). s 1 s 2 67 / 136

68 . Auxiliary game Γ 2 Let s consider row player first. As he knows column player s selection s 2, the objective function H(s 1, s 2 ) becomes an ordinary optimization function over a single variable s 1, and row player can simply select a strategy s 1 with the maximum objective function value max s1 H(s 1, s 2 ). Head Tail Head -2 1 Tail -1 2 Column maximum v 2 = 1 2 Now consider column player. When he selects a strategy s 2, he can definitely predict the selection of row player. Since max s1 H(s 1, s 2 ) is an ordinary function over a single variable s 2, it is easy for column player to select a strategy s 2 with the minimum objective function value v 2 = min max H(s 1, s 2 ). s 2 s 1 68 / 136

69 . Γ 1 and Γ 2 bound Γ For row player, it is clearly Γ 1 is disadvantageous to him as he should expose his selection s 1 to column player. On the contrary, Γ 2 is beneficial to row player as he knows column player s selection s 2 before making decision. Head Tail Row minimum Head Tail -1 2 v 1 = 1 Column maximum v 2 = 1 2 Thus these two auxiliary games provides lower and upper bounds: v 1 v v 2 where v denote row player s gain in the original game Γ. 69 / 136

70 . Case 1: v 1 = v 2 For a game with the following payoff matrix, we have v 1 = v = v 2 and call this game strictly determined. Head Tail Row minimum Head Tail -1 2 v 1 = 1 Column maximum v 2 = 1 2 The saddle point of the payoff matrix H(s 1, s 2 ) represents a pure strategy equilibrium. In this equilibrium, each player has nothing to gain by changing only his own strategy. In addition, knowing the opponent s selection will bring no gain. von Neumann proved the existence of the optimal strategy in a perfect information two-person zero-sum game, e.g., chess. L. S. Shapley further showed that a finite two-person zero-sum game has a pure strategy equilibrium if every 2 2 submatrix of the game has a pure strategy equilibrium [?]. 70 / 136

71 . Case 2: v 1 < v 2 In contrast, matching penny does not have a pure strategy equilibrium as there is no saddle point in the payoff matrix. So does the paper-rock-scissors game. Head Tail Row minimum Head Tail -1 1 v 1 = 1 Column maximum v 2 = 1 1 This fact implies that knowing the opponent s selection might bring gain; however, it is impossible to know the opponent s selection as the players reveal their selections simultaneously. In this case, let s play a mixed strategy rather than a pure strategy. 71 / 136

72 . From pure strategy to mixed strategy A mixed strategy is an assignment of probability to pure strategies, allowing a player to randomly select a pure strategy. Consider the payoff matrix as below. If the row player select strategy A with probability 1, he is said to play a pure strategy. If he tosses a coin and select strategy A if the coin lands head and B otherwise, then he is said to play a mixed strategy. A B A 1-1 B / 136

73 . Two types of interpretation of mixed strategy From a player s viewpoint: J. von Neumann described the motivation underlying the introduction of mixed strategy as follows: since it is impossible to exactly know opponent s selection, a player could switch to protect himself by randomly selecting his own strategy, making it difficult for the opponent to know the player s selection. However, this interpretation came under heavy fire for lacking of behaviour supports: Seldom do people make choices following a lottery. From opponent s viewpoint: Robert Aumann and Adam Brandenburger interpreted mixed strategy of a player as opponent s belief of the player s selection. Thus, Nash equilibrium is an equilibrium of belief rather than actions. 73 / 136

74 . Existence of mixed strategy equilibrium Consider a mixed strategy game: row player has m strategies available and he selects a strategy s 1 according to a distribution u, while column player has n strategies available and he selects a strategy s 2 according to a distribution v, i.e., Pr(s 1 = i) = u i, i = 1,..., n Here, u and v are independent. Thus the expected gain of row player is: m i=1 n Pr(s 2 = j) = v j, j = 1,..., m j=1 u ih ij v j = u T Hv row player attempts to minimize u T Hv via selecting an appropriate u, while column player attempts to maximize it via selecting an appropriate v. Now let s consider the two auxiliary games Γ 1 and Γ 2 again and answer the following questions: what happens if row player exposes his mixed strategy to column player? And if we reverse the order of the players? 74 / 136

75 . von Neumann s MiniMax theorem [1928] This question has been answered by the von Neumann s MiniMax theorem.. Theorem.. max u min v ut Hv = min v max u ut Hv The theorem states that knowing the other player s strategy will bring no gain in a mixed-strategy zero-sum game, and the order doesn t change the value. A mixed-strategy Nash equilibrium exists for any two-person zero-sum game with a finite set of actions. A Nash equilibrium in a two-player game is a pair of strategies, each of which is a best response to the other; i.e., each gives the player using it the highest possible payoff, given the other players strategy. 75 / 136

76 . von Neumann s MiniMax theorem: proof Let s consider the auxiliary game Γ 1 first, in which the strategy of row player, i.e., u, was exposed to column player. This is of course beneficial to column player since he can select the optimal strategy v to minimize u T Hv, which is inf{u T Hv v 0, 1 T v = 1} = min j=1,...,n (ut H) j Thus row player should select u to maximize the above value, which can be formulated as a linear program: max min j=1,...,n (ut H) j s.t. 1 T u = 1 u 0 76 / 136

77 . von Neumann s MiniMax theorem: proof The linear program can be rewritten as below. max s s.t. u T H s1 T 1 T u = 1 u 0 Similarly we consider the auxiliary game Γ 2 and calculate the optimal strategy v by solving the following linear program. min t s.t. Hv t1 1 T v = 1 v 0 These two linear programs are both feasible and form Lagrangian dual. Thus they have the same optimal objective value according to the strong duality property. 77 / 136

78 . An example: paper-rock-scissors game For the paper-rock-scissors game, we have the following two linear programs. Linear program for Γ 1 : Linear program for Γ 2 : max s s.t. 0u 1 u 2 + u 3 s u 1 + 0u 2 u 3 s u 1 + u 2 + 0u 3 s u 1 + u 2 + u 3 = 1 u 1, u 2, u 3 0 min t s.t. 0v 1 + v 2 v 3 t v 1 + 0v 2 + v 3 t v 1 v 2 + 0v 3 t v 1 + v 2 + v 3 = 1 v 1, v 2, v 3 0 The mixed strategy equilibrium is u T = [ 1 3, 1 3, 1 3 ] and u T = [ 1 3, 1 3, 1 3 ] with the game value / 136

79 . Comments on the mixed strategy equilibrium by von Neumann Note that a mixed strategy equilibrium always exists no matter whether the payoff matrix H has a saddle point or not. Regardless of column player s selection, row player can select an appropriate strategy to guarantee his gain v 1 0. Regardless of row player s selection, column player can select an appropriate strategy to guarantee row player s gain v 1 0. Using the strategy u T = [ 1 3, 1 3, 1 3 ], row player can guarantee that he won t lose, i.e., the probability of losing is less than the probability of winning. The strategy u T = [ 1 3, 1 3, 1 3 ] is designed for protecting himself rather than attacking his opponent, i.e., it cannot be used to benefit from opponent s fault. 79 / 136

80 . Ạpplication 3: Yao s MiniMax principle [1977] 80 / 136

81 . Yao s MiniMax principle Consider a problem Π. Let A = {A 1, A 2,..., A n } be algorithms to Π, and I = {I 1, I 2,..., I m } be the inputs with a given size. Let T (A i, I j ) be the running time of algorithm A i on the input I j. Inputs Algorithms A 1 A 2 I 1 T 11 T 12 I 2 T 21 T 22 Thus max T (A i, I j ) represents the worst-case time for the I j I deterministic algorithm A i. For a randomized algorithms, however, it is usually difficult to bound its expected running time on worst-case inputs. Yao s MiniMax principle provides a technique to build lower bound for the expected running time of any randomized algorithm on its worst-case input. 81 / 136

82 . Expected running time of a randomized algorithm A q A Las Vegas randomized algorithm can be viewed as a distribution over all deterministic algorithms A = {A 1, A 2,..., A n }. Specifically, let q be a distribution over A, and A q be a randomized algorithm chosen according to q, i.e., A q refers to a deterministic algorithm A i with probability q i. Given a input I j, the expected running time of A q can be written as n E[T (A q, I j )] = q i T (A i, I j ) Thus max I j I E[T (A q, I j )] represents the expected running time of A q on its worst-case input. i=1 82 / 136

83 . Expected running time of a deterministic algorithm A i on random input Now consider a deterministic algorithm A i running on random input. Let p be a distribution over I, and I p be a random input chosen from I, i.e., I p refers to I j with probability p j. Given a deterministic algorithm A i, its expected running time on random input I p can be written as E[T (A i, I p )] = m p j T (A i, I j ) j=1 Thus min A i A E[T (A i, I p )] represents the expected running time of the best deterministic algorithm on the random input I p. 83 / 136

84 . Yao s MiniMax principle. Theorem. For any random input I p and randomized algorithm A q,. min E[T (A i, I p )] max E[T (A q, I j )] A i A I j I To establish a lower bound for the expected running time of a randomized algorithm on its worst-case input, it suffices to find an appropriate distribution over inputs and prove that on this random input, no deterministic algorithm can do better than the randomized one. The power of this technique lies at the fact that one can choose any distribution over inputs and the lower bound is constructed based on deterministic algorithms. 84 / 136

85 . Yao s MiniMax principle: proof. Proof... min E[T (A i, I p )] max min E[T (A i, I u )] (1) A i A u m A i A = max min E[T (A v, I u )] u m v n (2) = min max E[T (A v, I u )] v n u m (3) = min max v, I j )] v n I j I (4) max E[T (A q, I j )] (5) I j I Here, n denotes the set of n-dimensional probability vectors. Equation (3) follows by the von Neumann s MiniMax theorem. 85 / 136

86 . Ạpplication 4: Revisiting ShortestPath algorithm 86 / 136

87 . ShortestPath problem. INPUT: n cities, and a collection of roads. A road from city i to j has a distance d(i, j). Two specific cities: s and t.. OUTPUT: the shortest path from city s to t. s 5 u. 8 1 v 6 2 t 87 / 136

88 . ShorestPath problem: Primal problem s x 1 5 u x 2. 8 x 3 1 x 4 6 v x 5 2 t Primal problem: set variables for roads (Intuition: x i = 0/1 means whether edge i appears in the shortest path), and a constraint means that we enter a node through an edge and leaves it through another edge. min 5x 1 + 8x 2 + 1x 3 + 6x 4 + 2x 5 s.t. x 1 + x 2 = 1 vertex s x 4 x 5 = 1 vertex t x 1 + x 3 + x 4 = 0 vertex u x 2 x 3 + x 5 = 0 vertex v x 1, x 2, x 3, x 4, x 5 = 0/1 88 / 136

89 . ShorestPath problem: Primal problem s x 1 5 u x 2. 8 x 3 1 x 4 6 v x 5 2 t Primal problem: relax the 0/1 integer linear program into linear program by the totally uni-modular property. min 5x 1 + 8x 2 + 1x 3 + 6x 4 + 2x 5 s.t. x 1 + x 2 = 1 vertex s x 4 x 5 = 1 vertex t x 1 + x 3 + x 4 = 0 vertex u x 2 x 3 + x 5 = 0 vertex v x 1, x 2, x 3, x 4, x 5 0 x 1, x 2, x 3, x 4, x / 136

90 . ShortestPath problem: Dual problem height y s s y 5 u u. 8 1 y v y t v 6 2 t Dual problem: set variables for cities. (Intuition: y i means the height of city i; thus, y s y t denotes the height difference between s and t, providing a lower bound of the shortest path length.) max y s y t s.t. y s y u 5 x 1 : edge (s, u) y s y v 8 x 2 : edge (s, v) y u y v 1 x 3 : edge (u, v) y t + y u 6 x 4 : edge (u, t) y t + y v 2 x 5 : edge (v, t) 90 / 136

91 . Ḍual simplex method 91 / 136

92 . Revisiting Primal Simplex algorithm Consider the following Primal problem P: Primal simplex tabular: min x x 2 + 6x 3 s.t. x 1 + x 2 + x 3 4 x 1 2 x 3 3 3x 2 + x 3 6 x 1, x 2, x 3 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 -z= 0 c 1 = 1 c 2 =14 c 3 =6 c 4 =0 c 5 =0 c 6 =0 c 7 =0 x B1 = b 1 = x B2 = b 2 = x B3 = b 3 = x B4 = b 4 = Primal variables: x; Feasible: B 1 b 0. A basis B is called primal feasible if all elements in B 1 b (the first column except for z) are non-negative. 92 / 136

93 . Revisiting Primal Simplex algorithm cont d Now let s consider the Dual problem D: max 4y 1 + 2y 2 + 3y 3 + 6y 4 s.t. y 1 + y 2 1 y 1 + 3y 4 14 y 1 + y 3 + y 4 6 y 1, y 2, y 3, y 4 0 Primal simplex tabular: x 1 x 2 x 3 x 4 x 5 x 6 x 7 -z= 0 c 1 = 1 c 2 =14 c 3 =6 c 4 =0 c 5 =0 c 6 =0 c 7 =0 x B1 = b 1 = x B2 = b 2 = x B3 = b 3 = x B4 = b 4 = Dual variables: y T = c T B B 1 ; Feasible: y T A c T. A basis B is called dual feasible if all elements in c T = c c T B B 1 A = c T y T A (the first row except for z) are non-negative. 93 / 136

94 . Another view point of the Primal Simplex algorithm Thus another view point of the Primal Simplex algorithm can be described as:.1 Starting point: The Primal Simplex algorithm starts with a primal feasible solution (x B = B 1 b 0);.2 Maintenance: Throughout the process we maintain the primal feasibility and move towards the dual feasibility;.3 Stopping criteria: c T = c T c T B B 1 A 0, i.e., y T A c T. In other words, the iteration process ends when the basis is both primal feasible and dual feasible. 94 / 136

95 . Dual simplex works in just an opposite fashion Dual simplex:.1 Starting point: The Dual Simplex algorithm starts with a dual feasible solution (c T 0);.2 Maintenance: Throughout the process we maintain the dual feasibility and move towards the dual feasibility;.3 Stopping criteria: x B = B 1 b 0. In other words, the iteration process ends when the basis is both primal feasible and dual feasible. 95 / 136

96 . Primal simplex vs. Dual simplex Both primal simplex and dual simplex terminate at the same condition, i.e. the basis is both primal feasible and dual feasible. However, the final objective is achieved in totally opposite fashions the primal simplex method keeps the primal feasibility while the dual simplex method keeps the dual feasibility during the pivoting process. The primal simplex algorithm first selects an entering variable and then determines the leaving variable. In contrast, the dual simplex method does the opposite; it first selects a leaving variable and then determines an entering variable. 96 / 136

97 Dual simplex(b I, z, A, b, c) 1: //Dual simplex starts with a dual feasible basis. Here, B I contains the indices of the basic variables. 2: while TRUE do 3: if there is no index l (1 l m) has b l < 0 then 4: x =CalculateX(B I, A, b, c); 5: return (x, z); 6: end if; 7: choose an index l having b l < 0 according to a certain rule; 8: for each index j (1 i n) do 9: if a lj < 0 then 10: j = c j a lj ; 11: else 12: j = ; 13: end if 14: end for 15: choose an index e that minimizes j; 16: if e = then 17: return no feasible solution ; 18: end if 19: (B I, A, b, c, z) = Pivot(B I, A, b, c, z, e, l); 20: end while 97 / 136

98 . An example Standard form: Slack form: min 5x x x 3 s.t. x 1 x 2 x 3 2 x 1 3x 2 3 x 1, x 2, x 3 0 min 5x x x 3 s.t. x 1 x 2 x 3 + x 4 = 2 x 1 3x 2 + x 5 = 3 x 1, x 2, x 3, x 4, x / 136

99 . Step 1 x 1 x 2 x 3 x 4 x 5 -z= 0 c 1 = 5 c 2 =35 c 3 =20 c 4 =0 c 5 =0 x B1 = b 1 = x B2 = b 2 = Basis (in blue): B = {a 4, a 5 } [ B Solution: x = 1 ] b = [0, 0, 0, 2, 3] 0 T. Pivoting: choose a 5 to leave basis since b 2 = 3 < 0; choose c a 1 to enter basis since min j j,a2j <0 a 2j = c 1 a / 136

100 . Step 2 x 1 x 2 x 3 x 4 x 5 -z= -15 c 1 = 0 c 2 =20 c 3 =20 c 4 =0 c 5 =5 x B1 = b 1 = x B2 = b 2 = Basis (in blue): B = {a 1, a 4 } [ B Solution: x = 1 ] b = [3, 0, 0, 5, 0] 0 T. Pivoting: choose a 4 to leave basis since b 1 = 5 < 0; choose c a 2 to enter basis since min j j,a1j <0 a 1j = c 2 a / 136

101 . Step 3 x 1 x 2 x 3 x 4 x 5 -z= -40 c 1 = 0 c 2 =0 c 3 =15 c 4 =5 c 5 =10 x B1 = b 1 = x B2 = b 2 = Basis (in blue): B = {a 1, a 2 } [ B Solution: x = 1 ] b = [ 5 0 4, 3 4, 0, 0, 0]T. Pivoting: choose a 1 to leave basis since b 2 = 3 4 c a 3 to enter basis since min j j,a2j <0 a 2j = c 3 a 23. < 0; choose 101 / 136

102 . Step 4 x 1 x 2 x 3 x 4 x 5 -z= -55 c 1 = 20 c 2 =0 c 3 =0 c 4 =20 c 5 =5 x B1 = b 1 = x B2 = b 2 = Basis (in blue): B = {a 2, a 3 } [ B Solution: x = 1 ] b = [0, 1, 1, 0, 0] 0 T. Done! 102 / 136

103 . When dual simplex method is useful?.1 The dual simplex algorithm is most suited for problems for which an initial dual feasible solution is easily available..2 It is particularly useful for reoptimizing a problem after a constraint has been added or some parameters have been changed so that the previously optimal basis is no longer feasible..3 Trying dual simplex is particularly useful if your LP appears to be highly degenerate, i.e. there are many vertices of the feasible region for which the associated basis is degenerate. We may find that a large number of iterations (moves between adjacent vertices) occur with little or no improvement. 2 2 References: Operations Research Models and Methods, Paul A. Jensen and Jonathan F. Bard; OR-Notes, J. E. Beasley 103 / 136

104 ..Primal Dual: another Improvment approach 104 / 136

105 . Primal Dual method: a brief history In 1955, H. Kuhn proposed the Hungarian method for the MaxWeightedMatching problem. This method effectively explores the duality property of linear programming. In 1956, G. Dantzig, R. Ford, and D. Fulkerson extended this idea to solve linear programming problems. In 1957, R. Ford, and D. Fulkerson applied this idea to solve network-flow problem and Hitchcock problem. In 1957, J. Munkres applied this idea to solve the transportation problem. 105 / 136

106 . Primal Dual method: basic idea 106 / 136

107 . Primal Dual Method Primal Dual method is a dual method, which exploits the lower bound information in subsequent linear programming operations. Advantages:.1 Unlike dual simplex starting from a dual basic feasible solution, primal dual method requires only a dual feasible solution..2 An optimal solution to DRP usually has combinatorial explanation, especially for graph-theory problems.. Primal P x Dual D y Step 1: y x Restricted Primal RP x Step 2: x y Dual of RP DRP y Step 3: y = y+θ y 107 / 136

108 . Basic idea of primal dual method Primal P: min c 1x 1 + c 2x c nx n s.t. a 11x 1 + a 12x a 1nx n = b 1 (y 1) a 21 x 1 + a 22 x a 2n x n = b 2 (y 2 )... a m1 x 1 + a m2 x a mn x n = b m (y m ) x 1, x 2,..., x n 0 Dual D: max b 1 y 1 + b 2 y b m y m s.t. a 11y 1 + a 21y a m1y m c 1 a 12y 1 + a 22y a m2y m c 2... a 1n y 1 + a 2n y a mn y m c n Basic idea: Suppose we are given a dual feasible solution y. Let s verify whether y is an optimal solution or not:.1 If y is an optimal solution to the dual problem D, then the corresponding primal variables x should satisfy a restricted primal problem called RP ;.2 Furthermore, even if y is not optimal, the solution to the dual of RP (called DRP ) still provide invaluable. information.. it. can. be. 108 / 136 used to improve y.

109 . Step 1: y x I Dual problem D: max b 1 y 1 + b 2 y b m y m s.t. a 11 y 1 + a 21 y a m1 y m c 1 ( = x 1 0)... a 1n y 1 + a 2n y a mn y m c n ( < x n = 0) y provides information of the corresponding primal variables x:.1 Given a dual feasible solution y. Let s check whether y is optimal solution or not..2 If y is optimal, we have the following restrictions on x: a 1i y 1 + a 2i y a mi y m <c i x i = 0 (Reason: complement slackness. An optimal solution y satisfies (a 1i y 1 + a 2i y a mi y m c i ) x i = 0).3 Let s use J to record the index of tight constraints where = holds. 109 / 136

110 . Step 1: y x II 110 / 136

111 . Step 1: y x III.4 Thus the corresponding primal solution x should satisfy the following restricted primal (RP):.5 RP: a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2... a m1 x 1 + a m2 x a mn x n = b m x i = 0 i / J x i 0 i J.6 In other words, the optimality of y is determined via solving RP. 111 / 136

112 . But how to solve RP? I RP: a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2... a m1 x 1 + a m2 x a mn x n = b m x i = 0 i / J x i 0 i J How to solve RP? Recall that Ax = b, x 0 can be solved via solving an extended LP. 112 / 136

113 . But how to solve RP? II RP (extended through introducing slack variables): min ϵ = s 1 +s s m s.t. s 1 +a 11 x a 1n x n = b 1 s 2 +a 21 x a 2n x n = b s m +a m1 x a mn x n = b m x i = 0 i x i 0 i s i 0.1 If ϵ OP T = 0, then we find a feasible solution to RP, implying that y is an optimal solution;.2 If ϵ OP T > 0, y is not an optimal solution. 113 / 136

114 . Step 2: x y I Alternatively, we can solve the dual of RP, called DRP : max w = b 1 y 1 + b 2 y b m y m s.t. a 11 y 1 + a 21 y a m1 y m 0 a 12 y 1 + a 22 y a m2 y m 0... a 1 J y 1 + a 2 J y a m J y m 0 y 1, y 2,, y m 1.1 If w OP T = 0, y is an optimal solution.2 If w OP T > 0, y is not an optimal solution. However, the optimal solution still provides useful information the optimal solution to DRP can be used to improve y. 114 / 136

115 . The difference between DRP and D Dual problem D: DRP: max b 1 y 1 + b 2 y b m y m s.t. a 11 y 1 + a 21 y a m1 y m c 1 a 12 y 1 + a 22 y a m2 y m c 2... a 1n y 1 + a 2n y a mn y m c n max w = b 1 y 1 + b 2 y b m y m s.t. a 11 y 1 + a 21 y a m1 y m 0 a 12 y 1 + a 22 y a m2 y m 0... a 1 J y 1 + a 2 J y a m J y m 0 y 1, y 2,, y m 1 How to write DRP from D? Replacing c i with 0; Only J restrictions in DRP ; An additional restriction: y 1, y 2,..., y m 1; 115 / 136

116 . Step 3: y y I Why y can be used to improve y? Consider an improved dual solution y = y+θ y, θ > 0. We have: Objective function: Since y T b = w OP T > 0, y T b = y T b + θw OP T > y T b. In other words, (y + θ y) is better than y. Constraints: The dual feasibility requires that: For any j J, a 1j y 1 + a 2j y a mj y m 0. Thus we have y T a j = y T a j + θ y T a j c j for any θ > 0. For any j / J, there are two cases: 116 / 136

117 . Step 3: y y II.1 j / J, a 1j y 1 + a 2j y a mj y m 0: Thus y is feasible for any θ > 0 since for 1 j n, a 1j y 1 + a 2j y a mj y m (6) = a 1j y 1 + a 2j y a mj y m (7) + θ(a 1j y 1 + a 2j y a mj y m ) (8) c j (9) Hence dual problem D is unbounded and the primal problem P is infeasible..2 j / J, a 1j y 1 + a 2j y a mj y m > 0: We can safely set θ c j (a 1j y 1 +a 2j y a mj y m ) a 1j y 1 +a 2j y a mj y m to guarantee that y T a j = y T a j + θ y T a j c j. = c j y T a j y T a j 117 / 136

118 . Primal Dual algorithm 1: Infeasible = No Optimal = No y = y 0 ; //y 0 is a feasible solution to the dual problem D 2: while TRUE do 3: Finding tight constraints index J, and set corresponding x j = 0 for j / J. 4: Thus we have a smaller RP. 5: Solve DRP. Denote the solution as y. 6: if DRP objective function w OP T = 0 then 7: Optimal= Yes 8: return y; 9: end if 10: if y T a j 0 (for all j / J) then 11: Infeasible = Yes ; 12: return ; 13: end if 14: Set θ = min c j y T a j y T a j for y T a j > 0, j / J. 15: Update y as y = y + θ y; 16: end while 118 / 136

119 . Advantages of Primal Dual algorithm Some facts: Primal dual algorithm ends if using anti-cycling rule. (Reason: the objective value y T b increases if there is no degeneracy.) Both RP and DRP do not explicitly rely on c. In fact, the information of c is represented in J. This leads to another advantage of primal dual technique, i.e., RP is usually a purely combinatorial problem. Take ShortestPath as an example. RP corresponds to a connection problem. More and more constraints become tight in the primal dual process. (See Lecture 10 for a primal dual algorithm for MaximumFlow problem. ) 119 / 136

120 . ShortestPath: Dijkstra s algorithm is essentially Primal Dual. algorithm 120 / 136

121 . ShorestPath problem s x 1 5 u x 2. 8 x 3 1 x 4 6 v x 5 2 t Primal problem: relax the 0/1 integer linear program into linear program by the totally uni-modular property. min 5x 1 + 8x 2 + 1x 3 + 6x 4 + 2x 5 s.t. x 1 + x 2 = 1 vertex s x 4 x 5 = 1 vertex t x 1 + x 3 + x 4 = 0 vertex u x 2 x 3 + x 5 = 0 vertex v x 1, x 2, x 3, x 4, x 5 0 x 1, x 2, x 3, x 4, x / 136

122 . Dual of ShortestPath problem height y s s y 5 u u. 8 1 y v y t v 6 2 t Dual problem: set variables for cities. (Intuition: y i means the height of city i; thus, y s y t denotes the height difference between s and t, providing a lower bound of the shortest path length.) max y s y t s.t. y s y u 5 x 1 : edge (s, u) y s y v 8 x 2 : edge (s, v) y u y v 1 x 3 : edge (u, v) y t + y u 6 x 4 : edge (u, t) y t + y v 2 x 5 : edge (v, t) 122 / 136

123 . A simplified version height y s s y 5 u u. 8 1 y v y t v 6 2 t Dual problem: simplify by setting y t = 0 (and remove the 2nd constraint in the primal problem P, accordingly) max y s s.t. y s y u 5 x 1 : edge (s, u) y s y v 8 x 2 : edge (s, v) y u y v 1 x 3 : edge (u, v) y u 6 x 4 : edge (u, t) y v 2 x 5 : edge (v, t) 123 / 136

124 . Iteration 1 I Dual feasible solution: y T = (0, 0, 0). Let s check the constraints in D: y s y u < 5 x 1 = 0 y s y v < 8 x 2 = 0 y u y v < 1 x 3 = 0 y u < 6 x 4 = 0 y v < 2 x 5 = 0 Identifying tight constraints in D: J = Φ, implying that x 1, x 2, x 3, x 4, x 5 = 0. RP: min s 1 +s 2 +s 3 s.t. s 1 +x 1 +x 2 = 1 node s s 2 x 1 +x 3 +x 4 = 0 node u s 3 x 2 x 3 +x 5 = 0 node v s 1, s 2, s 3, 0 x 1, x 2, x 3, x 4, x 5 = / 136

125 . Iteration 1 II DRP : max y s s.t. y s 1 y u 1 y v 1 Solve DRP using combinatorial technique: optimal solution y T = (1, 0, 0). Note: the optimal solution is not unique Step length θ: θ = min{ c 1 y T a 1 y T a 1, c 2 y T a 2 y T a 2 } = min{5, 8} = 5 Update y: y T = y T + θ y T = (5, 0, 0). 125 / 136

126 . Iteration 1 III height y s = 5 s 5 8 y u = y v = y t = 0.u. v 1 2 t 6 From the point of view of Dijkstra s algorithm: Optimal solution to DRP is y T = (1, 0, 0): the explored vertex set S = {s} in Dijkstra s algorithm. In fact, DRP is solved via identifying the nodes reachable from s. Step length θ = min{ c 1 y T a 1 y T a 1, c 2 y T a 2 y T a 2 } = min{5, 8} = 5: finding the closest vertex to the nodes in S via comparing all edges going out from S. 126 / 136

127 . Iteration 2 I Dual feasible solution: y T = (5, 0, 0). Let s check the constraints in D: y s y u = 5 y s y v < 8 x 2 = 0 y u y v < 1 x 3 = 0 y u < 6 x 4 = 0 y v < 2 x 5 = 0 Identifying tight constraints in D: J = {1}, implying that x 2, x 3, x 4, x 5 = 0. RP: min s 1 +s 2 +s 3 s.t. s 1 +x 1 +x 2 = 1 node s s 2 x 1 +x 3 +x 4 = 0 node u s 3 x 2 x 3 +x 5 = 0 node v s 1, s 2, s 3, 0 x 2, x 3, x 4, x 5 = / 136

CS711008Z Algorithm Design and Analysis

.. CS711008Z Algorithm Design and Analysis Lecture 9. Algorithm design technique: Linear programming and duality Dongbo Bu Institute of Computing Technology Chinese Academy of Sciences, Beijing, China