Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

Similar documents
Optimization Driven Scenario Grouping

An Adaptive Partition-based Approach for Solving Two-stage Stochastic Programs with Fixed Recourse

Disconnecting Networks via Node Deletions

Highly-scalable branch and bound for maximum monomial agreement

Scenario Grouping and Decomposition Algorithms for Chance-constrained Programs

where X is the feasible region, i.e., the set of the feasible solutions.

From structures to heuristics to global solvers

23. Cutting planes and branch & bound

Computations with Disjunctive Cuts for Two-Stage Stochastic Mixed 0-1 Integer Programs

A Hub Location Problem with Fully Interconnected Backbone and Access Networks

Novel update techniques for the revised simplex method (and their application)

A Benders Algorithm for Two-Stage Stochastic Optimization Problems With Mixed Integer Recourse

Integer Programming Part II

Indicator Constraints in Mixed-Integer Programming

Strengthened Benders Cuts for Stochastic Integer Programs with Continuous Recourse

Software for Integer and Nonlinear Optimization

Improvements for Implicit Linear Equation Solvers

18 hours nodes, first feasible 3.7% gap Time: 92 days!! LP relaxation at root node: Branch and bound

Benders Decomposition for the Uncapacitated Multicommodity Network Design Problem

Lecture 23 Branch-and-Bound Algorithm. November 3, 2009

Network Flows. 6. Lagrangian Relaxation. Programming. Fall 2010 Instructor: Dr. Masoud Yaghini

Introduction to Bin Packing Problems

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

Regularized optimization techniques for multistage stochastic programming

Interior-Point versus Simplex methods for Integer Programming Branch-and-Bound

The CPLEX Library: Mixed Integer Programming

Probabilistic Graphical Models

Toward scalable stochastic unit commitment. Part 2: Solver Configuration and Performance Assessment

IMA Preprint Series # 2176

Solving LP and MIP Models with Piecewise Linear Objective Functions

Lessons from MIP Search. John Hooker Carnegie Mellon University November 2009

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Multiobjective Mixed-Integer Stackelberg Games

Presolve Reductions in Mixed Integer Programming

Cutting Planes in SCIP

Logic, Optimization and Data Analytics

0-1 Knapsack Problem in parallel Progetto del corso di Calcolo Parallelo AA

Solution of Large-scale LP Problems Using MIP Solvers: Repeated Assignment Problem

Mixed Integer Programming Solvers: from Where to Where. Andrea Lodi University of Bologna, Italy

On the Approximate Linear Programming Approach for Network Revenue Management Problems

Stochastic Decision Diagrams

Stochastic Integer Programming An Algorithmic Perspective

K-Adaptability in Two-Stage Mixed-Integer Robust Optimization

Single Machine Scheduling: Comparison of MIP Formulations and Heuristics for. Interfering Job Sets. Ketan Khowala

Can Li a, Ignacio E. Grossmann a,

Generation and Representation of Piecewise Polyhedral Value Functions

Can Li a, Ignacio E. Grossmann a,

Stochastic Integer Programming

An Integrated Column Generation and Lagrangian Relaxation for Flowshop Scheduling Problems

Stochastic Sequencing and Scheduling of an Operating Room

MVE165/MMG631 Linear and integer optimization with applications Lecture 8 Discrete optimization: theory and algorithms

IBM Research Report. Stochasic Unit Committment Problem. Julio Goez Lehigh University. James Luedtke University of Wisconsin

On handling indicator constraints in mixed integer programming

Mixed Integer Programming:

Decomposition Algorithms with Parametric Gomory Cuts for Two-Stage Stochastic Integer Programs

Heuristics for Cost-Optimal Classical Planning Based on Linear Programming

Practical Tips for Modelling Lot-Sizing and Scheduling Problems. Waldemar Kaczmarczyk

Mixed-Integer Nonlinear Decomposition Toolbox for Pyomo (MindtPy)

Recoverable Robustness in Scheduling Problems

IE418 Integer Programming

Conic optimization under combinatorial sparsity constraints

Column Generation for Extended Formulations

Time Aggregation for Network Design to Meet Time-Constrained Demand

Tight and Compact MILP Formulation for the Thermal Unit Commitment Problem

Reformulation and Sampling to Solve a Stochastic Network Interdiction Problem

Dense Arithmetic over Finite Fields with CUMODP

A Fast Heuristic for GO and MINLP

21. Solve the LP given in Exercise 19 using the big-m method discussed in Exercise 20.

Thermal Unit Commitment Problem

Part 4. Decomposition Algorithms

3.4 Relaxations and bounds

Overview: Synchronous Computations

A Capacity Scaling Procedure for the Multi-Commodity Capacitated Network Design Problem. Ryutsu Keizai University Naoto KATAYAMA

Introduction to Integer Linear Programming

Math Models of OR: Branch-and-Bound

Decision Diagrams for Discrete Optimization

Scheduling Home Hospice Care with Logic-Based Benders Decomposition

A heuristic algorithm for the Aircraft Landing Problem

Direct Self-Consistent Field Computations on GPU Clusters

Pedro Munari - COA 2017, February 10th, University of Edinburgh, Scotland, UK 2

Stochastic Unit Commitment with Topology Control Recourse for Renewables Integration

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Nonconvex Generalized Benders Decomposition Paul I. Barton

Overview of course. Introduction to Optimization, DIKU Monday 12 November David Pisinger

Computational testing of exact separation for mixed-integer knapsack problems

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Orbital Conflict. Jeff Linderoth. Jim Ostrowski. Fabrizio Rossi Stefano Smriglio. When Worlds Collide. Univ. of Wisconsin-Madison

Solving Bilevel Mixed Integer Program by Reformulations and Decomposition

A comparison of sequencing formulations in a constraint generation procedure for avionics scheduling

Scenario grouping and decomposition algorithms for chance-constrained programs

The Strength of Multi-Row Relaxations

3.10 Column generation method

Parallelization of the QC-lib Quantum Computer Simulator Library

A LAGRANGIAN RELAXATION FOR CAPACITATED SINGLE ALLOCATION P-HUB MEDIAN PROBLEM WITH MULTIPLE CAPACITY LEVELS

Gestion de la production. Book: Linear Programming, Vasek Chvatal, McGill University, W.H. Freeman and Company, New York, USA

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Development of the new MINLP Solver Decogo using SCIP - Status Report

is called an integer programming (IP) problem. model is called a mixed integer programming (MIP)

The Traveling Salesman Problem: An Overview. David P. Williamson, Cornell University Ebay Research January 21, 2014

3.10 Column generation method

Transcription:

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

... Our contribution PIPS-PSBB*: Multi-level parallelism for Stochastic Mixed-Integer programs Fully-featured MIP solver for any generic 2-stage Stochastic MIP. Two levels of nested parallelism (B & B and LP relaxations). Integral parallelization of every component of Branch & Bound. Handle large problems: parallel problem data distribution. Distributed-memory parallelization. Novel fine-grained load-balancing strategies. Actually two(2) parallel solvers: PIPS-PSBB ug[pips-sbb,mpi]...... *PIPS-PSBB: Parallel Interior Point Solver Parallel Simple Branch and Bound 2

Introduction MIPs are NP-Hard problems: Theoretically and computationally intractable. LP-based Branch & Bound allows us to systematically search the solution space by subdividing the problem. Upper Bounds (UB) are provided by the integer solutions found along the Branch & Bound exploration. Lower Bounds (LB) are provided by the optimal values of the LP relaxations. GAP(%) UB LB UB Upper bound (UB) 100 Lower bound (LB) 3

Coarse-grained Parallel Branch and Bound Branch and bound is straightforward to parallelize: the processing of subproblems is independent. Standard parallelization present in most state-of-the-art MIP solvers. Processing of a node becomes the sequential computation bottleneck. Coarse grained parallelizations are a popular option: Potential performance pitfalls due to a master-slave approach, and relaxations are hard to parallelize.

Coarse-grained Parallel Branch and Bound Branch and Bound exploration is coordinated by a special process or thread. Worker threads solve open subproblems using a base MIP solver. Centralized communication poses serious challenges: performance bottlenecks and a reduction in parallel efficiency: Communication stress at rampup and ramp-down. Limited rebalancing capability: suboptimal distribution of work. Diffusion of information is slow.

Currently available coarse-grained parallelizations Coarse-grained parallelizations may scale poorly. Extra work is performed when compared to the sequential case. Information required to fathom nodes is discovered through the optimization. Powerful heuristics are necessary to find good feasible solutions early in the search. 6

Branch and Bound as a graph problem We can regard parallel Branch and Bound as a parallel graph exploration problem Given P processors, we define the frontier of a tree as the set of P subproblems currently being open. The subset currently processed in parallel are the active nodes. We additionally define a redundant node as a subproblem, which is fathomable if the optimal solution is known. The goal is to increase the efficiency of Parallel Branch and Bound by reducing the number of redundant nodes explored.

Our approach to Parallel Branch and Bound In order to reduce the amount of redundant nodes explored, the search must fathom subproblems by having high quality primal incumbents and focus on the most promising nodes. To increase the parallel efficiency by: Generating a set of active nodes comprised of the most promising nodes. Employing processors to explore the smallest amount of active nodes. Two degrees of parallelism: Processing of nodes in parallel (parallel LP relaxation, parallel heuristics, parallel problem branching, ). Branch and Bound in parallel....

Fine-grained Parallel Branch and Bound The smallest transferrable unit of work is a Branch and Bound node. Because of the exchange of nodes, queues in processors become a collection of subtrees. This allows for great flexibility and a fine-grained control of the parallel effort. Coordination of the parallel optimization is decentralized with the objective of maximizing load balance.

All-to-all parallel node exchange Load balancing is maintained via synchronous MPI collective communications. Solver 0 0 1 2 3 5 6 8 15 Solver 1 Solver 2 4 7 9 10 11 12 14 13 17 18 19 The lower bound of the most promising K nodes of every processor are exchanged and ranked. The top K out of K N nodes are selected and redistributed in a round robin fashion. 16 20 21 22 Gather top K N N bounds (K nodes N solvers N solvers) K=3, N=3 0 1 2 3 5 6 8 15 16 4 7 9 10 0 1 2 3 5 6 8 15 16 4 7 9 10 0 1 2 3 5 6 8 15 16 4 7 9 10 11 12 13 14 17 18 19 11 12 13 14 17 18 19 Solver 02 11 12 13 14 17 18 19 Solver 01 Solver 0 Because of the synchronous nature of the approach, communication must be used strategically in order to avoid parallel overheads. Sort, and select top K N bounds Redistribution of top K N nodes 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8 Solver 2 Solver 1 Solver 0 Node transfers are synchronous, while the statuses of each solver (Upper/lower bounds, tree sizes, times, solutions, ) are exchanged asynchronously. Solver 0 0 3 6 16 20 21 15 22 Solver 1 Solver 2 1 4 7 9 2 5 8 11 12 17 18 19 10 13 14 n Node estimation/bound n Node information

Stochastic Mixed Integer Programming: an overview Stochastic programming models optimization problems involving uncertainty. We consider two-stage stochastic mixed-integer programs (SMIPs) with recourse: 1st stage: deterministic now decisions 2nd stage: depends on random event & first stage decisions. Cost function includes deterministic variables & expected value function of non-deterministic parameters

Stochastic MIPs and their deterministic equivalent We consider deterministic equivalent formulations of 2-stage SMIPs under the sample average approximation This assumption yields characteristic dual block-angular structure. min c t x s.t. A T 1 W 1 T 2 W 2.... T N W N x 0 x 1 x 2. x N b 0 Common constraints b 1 b 2 Independent. realization }scenarios b N A T 1 T 2... W 1 W 2... T N W N

PIPS-PSBB: Design philosophy and features PIPS-PSBB is a specialized solver for two-stage Stochastic Mixed Integer Programs that uses Branch and Bound to achieve finite convergence to optimality. It addresses each of the the issues associated to Stochastic MIPs: A Distributed Memory approach allows to partition the second stage scenario data among multiple computing nodes. A T 1 W 1 T 2 W 2......... T N W N

PIPS-SBB: Design philosophy and features PIPS-SBB is a specialized solver for two-stage Stochastic Mixed Integer Programs that uses Branch and Bound to achieve finite convergence to optimality. It addresses each of the the issues associated to Stochastic MIPs: A Distributed Memory approach allows to partition the second stage scenario data among multiple computing nodes. As the backbone LP solver, we use PIPS-S: a Distributed Memory parallel Simplex solver for Stochastic Linear Programs....

PIPS-SBB: Design philosophy and features PIPS-SBB is a specialized solver for two-stage Stochastic Mixed Integer Programs that uses Branch and Bound to achieve finite convergence to optimality. It addresses each of the the issues associated to Stochastic MIPs: A Distributed Memory approach allows to partition the second stage scenario data among multiple computing nodes. As the backbone LP solver, we use PIPS-S: a Distributed Memory parallel Simplex solver for Stochastic Linear Programs. PIPS-PSBB has a structured software architecture that is easy to expand in terms of functionality and features.

Our approach to Parallel Branch and Bound Two levels of parallelism require a layered organization of the MPI processors. In the Branch and bound communicator, processors exchange: Branch and Bound Nodes. Solutions. Lower Bound Information. Queue sizes and search status. P 0,0 P 0,1 P 0,2 P 0,n P 1,0 P 1,1 P 1,2 P 1,n PIPS-SBB Solver 0 PIPS-SBB Solver 1 In the PIPS-S communicator, processors perform in parallel: LP relaxations. Primal Heuristics. Branching and candidate selection. P m,0 P m,1 P m,2 P m,n PIPS-SBB Solver m Strategies for ramp-up: Parallel Strong Branching Standard Branch and Bound Branch and Bound Comm 0 Branch and Bound Comm 1 Branch and Bound Comm 2 Branch and Bound Comm n Strategy for Ramp-down: intensify the frequency of node rebalancing.

ug[pips-sbb,mpi] PIPS-PSBB PIPS-SBB Parallel Branch and Bound Descentralized, lightweight control of the workload Fine-grained approach PIPS-S Parallel LP relaxations Data Parallelization Sequential Branch and Bound ug[pips-sbb,mpi] Parallel Branch and Bound Centralized control of the workload Coarse-grained, black-box approach In addition to PIPS-PSBB, we also introduce ug[pips-sbb,mpi]: a coarse grained external parallelization of PIPS-SBB. UG is a generic framework used to parallelize Branch & Bound based MIP solvers. Exploits powerful performance of state-of-the-art base solvers, such as SCIP, Xpress, Gurobi, and CPLEX. It uses the base solver as a black box. UG has been widely applied to parallelize many MIP solvers: Distributed memory via MPI: ug[scip,mpi], ug[xpress,mpi], ug[cplex,mpi] Shared-memory via Pthreads: ug[scip,pth], ug[xpress, Pth]

UG[PIPS-SBB,MPI] UG has been successfully used to solve some open MIP problems using more than 80.000 cores. Certainly proven to be scalable. ug[pips-sbb,mpi] co-developed with Yuji Shinano The second MIP solver in the world (after PIPS-PSBB) to use two levels of nested parallelism. 0) 1 1 1 : A A A A : :A 0) 1 A AA B B A A 1 :C 1 :C 1 :C 1 : - A 1 A A A 1 : A A A A : :A 1 : - A 1 A A A 1 :C - B A 1 : A A A A : :A 1 : - A 1 A A A 0) :C -1 1:: : :C 1 1 A :2- - - E 1: 1 1:: : 1 / A B A B

Experimental performance results We test our solver on SSLP instances, from the SIPLIB library. SSLP instances model server locations under uncertainty. Instances are coded as SSLPm_n_s, where s represents the number of scenarios. Larger number of scenarios means bigger problems LP relaxations of all instances fit in memory, even in CPLEX PIPS-SBB can handle much larger LP relaxations Details: see http://www2.isye.gatech.edu/~sahmed/siplib/sslp/sslp.html PIPSBB run on the Cab cluster: Each node: Intel Xeon E5-2670, 2.6 GHz, 2 CPUs x 8 cores/cpu 16 cores/node 2 GB RAM/core, 32 GB RAM/node Infiniband QDR interconnect CPLEX 12.6.2 used in some comparisons, in Vanilla setting.

Experimental performance results We measure parallel performance in terms of speedup, node inefficiency, and communication overhead: Speedup S p on the time T p needed to reach optimality by a configuration with p processors with respect to the time needed by a sequential baseline T 1 : S " = T % T " Communication overhead: Fraction of time T comm + T sync needed for communication and processor synchronization with respect to the total time of execution T exec : C '( = T )'** + T,-.) T /0/) Node inefficiency: Fraction of redundant nodes explored N r with respect to the total number of nodes explored N total. N 2./33 = N 4 N 5'567

PIPS-PSBB and ug[pips-sbb,mpi]: Performance comparison Performance comparison between PIPS-PSBB and ug[pips-sbb,mpi] when optimizing small instances. sslp_15_45_5 (5 scenarios, 3390 binary variables, 301 constraints) Scaling: Time to optimality 70 60 50 40 30 20 10 0 1 2 10 50 100 200 400 Number of Processors Communication overhead(%) 70 60 50 40 30 20 10 0 1 2 10 50 100 200 400 Number of Processors PIPS-PSBB: o Scales up to 200 cores (66x). o Total work performed remains within a factor of 2x w.r.t. sequential. o Communication overhead dominates after 400 cores. o Node inefficiency grows at a slower rate than ug[pips- SBB,MPI]. Total tree size 700000 600000 500000 400000 300000 200000 Node inefficiency(%) 70 60 50 40 30 20 10 ug[pips-sbb,mpi]: o Scales up to 200 cores (33x). o Total work varies by processor configuration. o Higher communication overhead and higher node inefficiency. 100000 1 2 10 50 100 200 400 Number of Processors 0 1 2 10 50 100 200 400 Number of Processors PIPS-PSBB ug[pips-sbb,mpi]

Tuning the communication frequency of PIPS-PSBB Scaling: Time to optimality 70 60 50 40 30 20 10 0 1 2 10 50 100 200 400 Number of Processors Communication overhead(%) 90 80 70 60 50 40 30 20 10 0 1 2 10 50 100 200 400 Number of Processors Total tree size 500000 450000 400000 350000 300000 250000 200000 150000 1 2 10 50 100 200 400 Number of Processors Tight Communication (10-500) Standard Communication (50-1000) Loose Communication (100-50000) PIPS-PSBB allows to modify the frequency between synchronous communications. Frequency defined with (x,y), where x and y represent the minimum and maximum number of B&B iterations that must be processed before communication takes place. Tighter communication increases communication overheads, but reduces work performed. The opposite takes place under loose communication.

PIPS-PSBB Solver performance exposed: sslp_10_50_500 (500 scenarios, 250,010 binary variables, 30,001 constraints) Total tree size Root node LP Relaxation Time (s) 600 500 400 300 200 100 0 1.2e+07 1e+07 8e+06 6e+06 4e+06 2e+06 0 1[500] 5[100] 10[50] 20[25] 25[20] 50[10] 100[5] 500[1] Number of solvers[processors per PIPS-S solver] 1[500] 5[100] 10[50] 20[25] 25[20] 50[10] 100[5] 500[1] Number of solvers[processors per PIPS-S solver] Communication overhead(%) Ramp up communication overhead(%) 25 20 15 10 18 16 14 12 10 8 6 4 2 0 5 0 1[500] 5[100] 10[50] 20[25] 25[20] 50[10] 100[5] 500[1] Number of solvers[processors per PIPS-S solver] 1[500] 5[100] 10[50] 20[25] 25[20] 50[10] 100[5] 500[1] Number of solvers[processors per PIPS-S solver] PIPS-S: Speedup to 10 cores is 6x. Performance increases up to 20 cores. PIPS-PSBB: Communication overhead minimal except at rampup when LP solver is slow.

PIPS-SBB: Comparison against CPLEX Performance comparison against CPLEX 12.6.2 Instance Scenarios Configuration PIPS-PSBB ug[pips-sbb,ug] CPLEX SM CPLEX DM Solvers PIPS-S procs GAP(%) (Time)(s) GAP(%) (Time)(s) Procs GAP(%) (Time)(s) Procs GAP(%) (Time)(s) sslp_5_25_50 50 2 2 (7.45s) (8.03s) 4 (0.27s) 4 (0.27s) sslp_5_25_100 100 2 2 (22.37s) (17.79s) 4 (0.64s) 4 (0.64s) sslp_15_45_5 5 200 2 (107.11s) (163.53s) 16 (1.97s) 400 (6.26s) sslp_15_45_10 10 200 2 0.09% 0.16% 16 (1.81s) 400 (15.04s) sslp_15_45_15 15 200 2 0.25% 0.30% 16 (7.80s) 400 (15.75s) sslp_10_50_50 50 200 10 0.13% 0.21% 16 (43.88s) 2000 0.15%(M) sslp_10_50_100 100 200 10 0.17% 0.20% 16 (221.69s) 2000 0.16%(M) sslp_10_50_500 500 200 10 0.24% 0.24% 16 4.91%(M) 2000 1.25%(M) sslp_10_50_1000 1000 200 10 0.24% 0.24% 16 9.91% 2000 6.08% sslp_10_50_2000 2000 200 10 0.26% 0.26% 16 19.93% 2000 8.11% Time limit: 1 hour Distributed-memory parallelization of CPLEX is often inferior to its shared-memory counterpart. Both CPLEX versions run into Memory limits for some problems. The superior performance of CPLEX s base solver helps in trivial and small problems. PIPS-SBB-based solvers show superior performance for large problems.

Conclusions We developed a light-weight decentralized distributed memory branch-and-bound implementation for PIPS-SBB with two degrees of parallelism: Processing of nodes in parallel (parallel LP relaxation, parallel heuristics, parallel problem branching, ). Branch and Bound in parallel. Better parallel efficiency is achieved by focusing the parallel resources in the most promising nodes. We try to reduce communication bottlenecks and achieve high processor occupancy via a decentralized control of the tree exploration and a lightweight mechanism for exchanging Branch and Bound nodes. Competitive performance to state-of-the-art commercial MIP solvers, in the context of large instances.

A natural progression in the parallelization of Branch & Bound The presented work contributes to the ultimate goal of improving the parallel efficiency of Branch & Bound. New parallel heuristics, which leverage parallelism in order to increase the effectiveness, speed and scalability of primal heuristics. New parallel algorithms for a better distribution of work in the context of Branch & Bound. Scalable massively-parallel heuristics Work-efficient Parallel Branch & Bound The code of PIPS-PSBB is available at: https://github.com/llnl/pips-sbb Time 26

Thank You!