Mathematical Optimization Models and Applications

Size: px

Start display at page:

Download "Mathematical Optimization Models and Applications"

Byron Cole
5 years ago
Views:

1 Mathematical Optimization Models and Applications Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. yyye Chapters 1, 2.1-2, 6.1-2, 7.2, 11.1,

2 Model Classifications Optimization problems are generally divided into Unconstrained, Linear and Nonlinear Programming based upon the objective and constraints of the problem Unconstrained Optimization: Ω is the entire space R n Linear Optimization: If both the objective and the constraint functions are linear/affine Conic Linear Optimization: If both the objective and the constraint functions are linear/affine, but variables in a convex cone. Nonlinear Optimization: If the constraints contain general nonlinear functions Stochastic Optimization: Optimize the expected objective function with random parameters Fixed-Point Computation: Optimization of multiple agents with zero-sum objectives There are integer program, mixed-integer program etc. We present a few optimization examples in this lecture that we would cover through out this course. 2

3 Logistic Regression I Given a training data point a i R n, according to the logistic model, the probability that it s in one class C is represented by e at i x+x e at i x+x 0 Thus, for some training data points, we like to determine intercept x 0 and slope vector x R n such that e at i x+x 0 1, if a i C =. 1 + e at i x+x 0. 0, otherwise Then the probability to give a right classification answer for all training data points is ( a i C e at i x+x e at i x+x 0 ) a i C e at i x+x 0 3

4 Logistic Regression II Therefore, we like to maximize the probability when deciding intercept x 0 and slope vector x R n ( ) e at i x+x 0 ( ) 1 1 = e at i x+x e at i x+x e at i x x e at i x+x 0 a i C a i C a i C a i C, which is equivalently to maximize ( ) ln(1 + e at i x x 0 ) ln(1 + e at i x+x 0 ). a i C a i C Or min x 0,x ( a i C ln(1 + e at i x x 0 ) ) + a i C ln(1 + e at i x+x 0 ). This is an unconstrained optimization problem, where the objective is a convex function of decision variables: intercept x 0 and slope vector x R n. 4

5 Linear Programming and Conic Linear Programming minimize c T x subject to Ax = b, x K. Linear Programming (LP): when K is the nonnegative orthant cone Second-Order Cone Programming (SOCP): when K is the second-order cone Semidefinite Cone Programming (SDP): when K is the semidefinite matrix cone min 2x 1 + x 2 + x 3 s.t. x 1 + x 2 + x 3 = 1, (x 1 ; x 2 ; x 3 ) 0; min 2x 1 + x 2 + x 3 s.t. x 1 + x 2 + x 3 = 1, x x 2 3 x 1. min 2x 1 + x 2 + x 3 s.t. x 1 + x 2 + x 3 = 1, ( ) x 1 x 2 0, x 2 x 3 5

6 LP Examles: Sparsest Data Fitting We want to find a sparsest solution to fit exact data measurements, that is, to minimize the number of non-zero entries in x such that Ax = b: minimize x 0 = {j : x j 0} subject to Ax = b. Sometimes this objective can be accomplished by LASSO: It can be equivalently represented by (?) minimize subject to Ax = b. x 1 = n j=1 x j minimize n j=1 y j subject to Ax = b, y x y; Both are linear programs! or minimize n j=1 (x j + x j ) subject to A(x x ) = b, x 0, x 0. 6

7 Sparsest Data Fitting continued A better approximation of the objective can be accomplished by minimize x p := subject to Ax = b; ( n j=1 x j p ) 1/p or minimize Ax b 2 + µ ( n j=1 x j p ) 1/p for some 0 < p < 1, where µ > 0 is a regularization parameter. Or simply minimize x p p := subject to Ax = b; ( n j=1 x j p ) or minimize Ax b 2 + β ( n j=1 x j p ) The former is a linearly constrained (nonconvex) optimization problem! 7

8 SOCP Example: Facility Location c j is the location of client j = 1, 2,..., m, and y is the location of a facility to be built. minimize j y c j p. Or equivalently (?) minimize subject to j δ j y + x j = c j, x j p δ j, j. This is a p-order conic linear program for p 1. In particular, when p = 2, it is an SOCP problem. For simplicity, consider m = 3. 8

9 C2 C1 p=1 y Figure 1: Facility Location at Point y. p=2 C3 9

10 Graph Realization and Sensor Network Localization Given a graph G = (V, E) and sets of non negative weights, say {d ij : (i, j) E}, the goal is to compute a realization of G in the Euclidean space R d for a given low dimension d, where the distance information is preserved. More precisely: given anchors a k R d, d ij N x, and ˆd kj N a, find x i R d such that x i x j 2 = d 2 ij, (i, j) N x, i < j, a k x j 2 = ˆd 2 kj, (k, j) N a, This is a set of Quadratic Equations. Does the system have a localization or realization of all x j s? Is the localization unique? Is there a certification for the solution to make it reliable or trustworthy? Is the system partially localizable with certification? It can be relaxed to SOCP (change = to ) or SDP. 10

11 Figure 2: 50-node 2-D Sensor Localization. 11

12 Matrix Representation of SNL and SDP Relaxation Let X = [x 1 x 2... x n ] be the d n matrix that needs to be determined and e j be the vector of all zero except 1 at the jth position. Then x i x j = X(e i e j ) and a k x j = [I X](a k ; e j ) so that x i x j 2 = (e i e j ) T X T X(e i e j ) a k x j 2 = (a k ; e j ) T [I X] T [I X](a k ; e j ) = (a k ; e j ) T I X X T X T X (a k ; e j ). 12

13 Or, equivalently, (e i e j ) T Y (e i e j ) = d 2 ij, i, j N x, i < j, (a k ; e j ) T I X (a k ; e j ) = ˆd 2 X T kj Y, k, j N a, Y = X T X. Relax Y = X T X to Y X T X, which is equivalent to matrix inequality: I X 0. X T Y This matrix has rank at least d; if it s d, then Y = X T X, and the converse is also true. The problem is now an SDP problem. 13

14 More CLP Examples: Portfolio Management For expected return vector r and co-variance matrix V of an investment portfolio, one management model is: minimize x T V x subject to r T x µ, e T x = 1, x 0, where e is the vector of all ones. This is a convex quadratic program. or simply minimize x T V x subject to r T x µ, e T x = 1, 14

15 More CLP Examples: Robust Portfolio Management In applications, r and V may be estimated under various scenarios, say r i and V i for i = 1,..., m. Then, we like minimize max i x T V i x subject to min i r T i x µ, e T x = 1, x 0. minimize subject to α r T i x µ, i xt V i x α, i e T x = 1, x 0. This is a SOCP problem. 15

16 Recall Data Classification: Supporting Vector Machine Like in logistic regression, suppose we have two-class discrimination data. We assign the first class with 1 and the second with 1 for a binary varible. A powerful discrimination method is the Supporting Vector Machine (SVM). Let the first class data points i be given by a i R d, i = 1,..., n 1 and the second class data points j be given by b j R d, j = 1,..., n 2. We like to find a hyperplane to separate the two classes: minimize β + µ x 2 subject to where µ is a fixed positive regularization parameter. This is a quadratic program. a T i x + x 0 + β 1, i, b T j x + x 0 β 1, j, β 0, 16

17 Figure 3: Linear Support Vector Machine 17

18 Stochastic Optimization and Learning In real world, we most often do where ξ represents random variables with the joint distribution F ξ. mimimize x X E Fξ [h(x, ξ)] (1) Pros: In many cases, the expected value is a good measure of performance Cons: One has to know the exact distribution of ξ to perform the stochastic optimization so that we most frequently use sample distribution. Then, deviant from the assumed distribution may result in sub-optimal solutions. Even know the distribution, the solution/decision is generically risky. 18

19 Learning with Noises/Distortions Goodfellow et al. [2014] 19

20 Distributionally Robust Optimization and Learning Why error? Believing that the sample distribution is the true distribution... In practice, although the exact distribution of the random variables may not be known, people usually know certain observed samples or training data and other statistical information. Thus, we can consider an enlarged distribution set D that confidently containing the sample distribution, and do minimize x X max Fξ D E Fξ [h(x, ξ)] (2) In DRO, we consider a set of distributions D and choose one to minimize the expected value for the worst distribution in D. When choosing D, we need to consider the following: Tractability Practical (Statistical) Meanings Performance (the potential loss comparing to the benchmark cases) This is a nonlinear Min Max optimization problem 20

21 The Markov Decision/Game Process Markov decision processes (MDPs) provide a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision maker. Markov game processes (MGPs) provide a mathematical slidework for modeling sequential decision-making of two-person turn-based zero-sum game. MDGPs are useful for studying a wide range of optimization/game problems solved via dynamic programming, where it was known at least as early as the 1950s (cf. Shapley 1953, Bellman 1957). Modern applications include dynamic planning under uncertainty, reinforcement learning, social networking, and almost all other stochastic dynamic/sequential decision/game problems in Mathematical, Physical, Management and Social Sciences. 21

22 An MDGP Toy Example: Non-Game Setting Actions are in red, blue and black; and all actions have zero cost except the state 4 to the termination state. Which actions to take from every state to minimize the total repeated cost? 22

23 Toy Example: Simplified Representation Actions are in red, blue and black; and all actions have immediate zero cost except the state 4 to the termination state. 23

24 Toy Example: Game Setting States {0, 1, 2} minimize, while States {3, 4} maximize. 24

25 MDP Stationary Policy and Cost-to-Go Value A stationary policy for the decision maker is a function π = {π 1, π 2,, π m } that specifies an action in each state, π i A i, that the decision maker will always choose; which also lead to a cost-to-go value for each state The MDP is to find a stationary policy to minimize/maximize the expected discounted sum over the infinite horizon with a discount factor 0 γ < 1: γ t E[c π i t (i t, i t+1 )]. t=0 If the states are partitioned into two sets, one is to minimize and the other is to maximize the discounted sum, then the process becomes a two-person turn-based zero-sum stochastic game. 25

26 The Cost-to-Go Values of the States /8 3/4 1 1/ Cost-to-go values on each state when actions in red are taken (the current policy is not optimal). 26

27 The Optimal Cost-to-Go Value Vector Let y R m represent the cost-to-go values of the m states, ith entry for ith state, of a given policy. The MDP problem entails choosing the optimal value vector y such that it satisfies: y i = min{c j + γp T j y, j A i }, i, with optimal policy π i = arg min{c j + γp T j y, j A i }, i. In the Game setting, the conditions becomes: y i = min{c j + γp T j y, j A i }, i I, and y i = max{c j + γp T j y, j A i }, i I +. They both are fix-point or saddle-point problems. 27

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming

Project Discussions: SNL/ADMM, MDP/Randomization, Quadratic Regularization, and Online Linear Programming Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305,