N. Mattor T. J. Williams D. W. Hewett A. M. Dimits. IMACS 97 August 24-29, 1997 Berlin, Germany

Similar documents
SPHERICAL COMPRESSION OF A MAGNETIC FIELD. C. M. Fowler DISCLAIMER

Los Alamos IMPROVED INTRA-SPECIES COLLISION MODELS FOR PIC SIMULATIONS. Michael E. Jones, XPA Don S. Lemons, XPA & Bethel College Dan Winske, XPA

TOWARDS EVEN AND ODD SQUEEZED NUMBER STATES NIETO, MICHAEL M

CQNl_" RESPONSE TO 100% INTERNAL QUANTUM EFFICIENCY SILICON PHOTODIODES TO LOW ENERGY ELECTRONS AND IONS

ION EXCHANGE SEPARATION OF PLUTONIUM AND GALLIUM (1) Resource and Inventory Requirements, (2) Waste, Emissions, and Effluent, and (3) Facility Size

Hyperon Particle Physics at JHF. R. E. Mischke

0STI. E. Hammerberg, XNH MD SIMULATIONS OF DUSTY PLASMA CRYSTAL FORMATION: PRELIMINARY RESULTS M. J. S. Murillo, XPA

~ _- IDOCLRXKT DMSP SATELLITE DETECTIONS OF GAMMA-RAY BURSTS. J. Terrell, NIS-2 P. Lee, NIS-2 R W.

Alamos National Laboratory is operated by the University of California for the United States Department of Energy under contract W-7405-ENG-36

A Few-Group Delayed Neutron Model Based on a Consistent Set of Decay Constants. Joann M. Campbell Gregory D. Spriggs

PROJECT PROGRESS REPORT (03/lfi?lfibr-~/15/1998):

Ray M. Stringfield Robert J. Kares J. R. Thomas, Jr.

AC dipole based optics measurement and correction at RHIC

Further Investigation of Spectral Temperature Feedbacks. Drew E. Kornreich TSA-7, M S F609

Analysis of Shane Telescope Aberration and After Collimation

Turbulent Scaling in Fluids. Robert Ecke, MST-10 Ning Li, MST-10 Shi-Yi Li, T-13 Yuanming Liu, MST-10

Development of a High Intensity EBIT for Basic and Applied Science

Los Alamos. Nova 2. Solutions of the Noh Problem for Various Equations of State Using Lie Groups LA-13497

August 3,1999. Stiffness and Strength Properties for Basic Sandwich Material Core Types UCRL-JC B. Kim, R.M. Christensen.

Modeling Laser and e-beam Generated Plasma-Plume Experiments Using LASNEX

Start-up Noise in 3-D Self-AmpMed

MULTIGROUP BOLTZMANN FOKKER PLANCK ELECTRON-PHOTON TRANSPORT CAPABILITY IN M C N P ~ ~ DISCLAIMER

DML and Foil Measurements of ETA Beam Radius

(4) How do you develop an optimal signal detection technique from the knowledge of

ust/ aphysics Division, Argonne National Laboratory, Argonne, Illinois 60439, USA

IMPACT OF EDGE CURRENT DENSITY AND PRESSURE GRADIENT ON THE STABILITY OF DIII-D HIGH PERFORMANCE DISCHARGES

Applications of Pulse Shape Analysis to HPGe Gamma-Ray Detectors

RUDOLPH J. HENNINGER, XHM PAUL J. MAUDLIN, T-3 ERIC N. HARSTAD, T-3

PROJECT PROGRESS REPORT (06/16/1998-9/15/1998):

Plasma Response Control Using Advanced Feedback Techniques

Walter P. Lysenko John D. Gilpatrick Martin E. Schulze LINAC '98 XIX INTERNATIONAL CONFERENCE CHICAGO, IL AUGUST 23-28,1998

Variational Nodal PerturbationCalculations Using Simplified Spherical HarmonicsO

Plutonium 239 Equivalency Calculations

Measurement of Low Levels of Alpha in 99M0Product Solutions

IMAGING OF HETEROGENEOUS MATERIALS BY. P. Staples, T. H. Prettyman, and J. Lestone

Centrifugal Destabilization and Restabilization of Plane Shear Flows

A Distributed Radiator, Heavy Ion Driven Inertial Confinement Fusion Target with Realistic, Multibeam Illumination Geometry

Abstract of paper proposed for the American Nuclear Society 1997 Winter Meeting Albuquerque, New Mexico November 16-20, 1997

Data Comparisons Y-12 West Tower Data

Alex Dombos Michigan State University Nuclear and Particle Physics

Three-Dimensional Silicon Photonic Crystals

APPLICATION SINGLE ION ACTIVITY COEFFICIENTS TO DETERMINE SOLVENT EXTRACTION MECHANISM FOR COMPONENTS OF NUCLEAR WASTE

Simulation of Double-Null Divertor Plasmas with the UEDGE Code

Flipping Bits in the James Webb Space Telescope s Cameras

Fission-Fusion Neutron Source

Pathway and Kinetic Analysis on the Propyl Radical + O2 Reaction System

The Radiological Hazard of Plutonium Isotopes and Specific Plutonium Mixtures

Capabilities for Testing the Electronic Configuration in Pu

Los Alamos. OF nm LA-"'- Title: HYBRID KED/XRF MEASUREMENT OF MINOR ACTINIDES IN REPROCESSING PLANTS. S. T. Hsue and M. L.

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Use of Gamma Rays from the Decay of 13.8-sec "Be to Calibrate a Germanium Gamma Ray Detector for Measurements up to 8 MeV

RESEARCH OPPORTUNITIES AT THE CNS (Y-12 AND PANTEX) NUCLEAR DETECTION AND SENSOR TESTING CENTERS (NDSTC)

Quarterly Report April 1 - June 30, By: Shirley P. Dutton. Work Performed Under Contract No.: DE-FC22-95BC14936

GA A23736 EFFECTS OF CROSS-SECTION SHAPE ON L MODE AND H MODE ENERGY TRANSPORT

RECXWH2 W/o s3-1

UNDERSTANDING TRANSPORT THROUGH DIMENSIONLESS PARAMETER SCALING EXPERIMENTS

SPECTRAL FUNCTION CALCULATION OF ANGLE WAKES, WAKE MOMENTS, AND MISALIGNMENT WAKES FOR THE SLAC DAMPED DETUNED STRUCTURES (DDS)'

GA A26057 DEMONSTRATION OF ITER OPERATIONAL SCENARIOS ON DIII-D

Parametric Instabilities in Laser/Matter Interaction: From Noise Levels to Relativistic Regimes

4kU. Measurement of Storage Ring Motion at the Advanced Light Source. QSTt ERNESTORLANDO LAWRENCE BERKELEYNATIONAL LABORATORY

DE '! N0V ?

LQSAlamos National Laboratory

GA A22677 THERMAL ANALYSIS AND TESTING FOR DIII D OHMIC HEATING COIL

Multi-Scale Chemical Process Modeling with Bayesian Nonparametric Regression

Constant of Motion for a One- Dimensional and nth-order Autonomous System and Its Relation to the Lagrangian and Hamiltonian

Undulator Interruption in

Using the X-FEL to understand X-ray Thomson scattering for partially ionized plasmas

12/16/95-3/15/96 PERIOD MULTI-PARAMETER ON-LINE COAL BULK ANALYSIS. 2, 1. Thermal Neutron Flux in Coal: New Coal Container Geometry

National Accelerator Laboratory

The temperature dependence of the C1+ propylene rate coefficient and the Arrhenius fit are shown in Figure 1.

Solid Phase Microextraction Analysis of B83 SLTs and Core B Compatibility Test Units

Controlling Backstreaming Ions from X-ray Converter Targets with Time Varying Final Focusing Solenoidal Lens and Beam Energy Variation

Excitations of the transversely polarized spin density. waves in chromium. E3-r 1s

Response to Comment on "The National Ignition Facility Laser Performance Status"

Manifestations of the axial anomaly in finite temperature QCD

Tell uric prof i 1 es across the Darrough Known Geothermal Resource Area, Nevada. Harold Kaufniann. Open-file Report No.

Surrogate Guderley Test Problem Definition

Survey, Alignment and Beam Stability at the Advanced Light Source

A NEW TARGET CONCEPT FOR PROTON ACCELERATOR DRIVEN BORON NEUTRON CAPTURE THERAPY APPLICATIONS* Brookhaven National Laboratory. P.O.

BASAL CAMBRIAN BASELINE GEOLOGICAL CHARACTERIZATION COMPLETED

FUSION TECHNOLOGY INSTITUTE

BWXT Y-12 Y-12. A BWXT/Bechtel Enterprise SMALL, PORTABLE, LIGHTWEIGHT DT NEUTRON GENERATOR FOR USE WITH NMIS

GA A23713 RECENT ECCD EXPERIMENTAL STUDIES ON DIII D

Colliding Crystalline Beams

OAK RIDGE NATIONAL LABORATORY

Los Alamos CQFSF, DISTRIBUTION OF THIS DOCUMENT IS WLtrVmED. Preliminary Results of GODIVA-IV Prompt Burst Modeling. R. H.

Curvature of a Cantilever Beam Subjected to an Equi-Biaxial Bending Moment. P. Krulevitch G. C. Johnson

Summary Report: Working Group 5 on Electron Beam-Driven Plasma and Structure Based Acceleration Concepts

- High Energy Gull Generator

Half-Cell, Steady-State Flow-Battery Experiments. Robert M. Darling and Mike L. Perry

Magnetic Resonance Imaging of Polymer Electrolytes and Insertion Electrodes*

Roger J. Bartlett, P-22 Dane V. Morgan, P-22 M. Sagurton, EG&G J. A. Samson, University of Nebraska

Peak Reliability Delivering near real-time phase angle deltas using Inter-Control Center Communication Protocol (ICCP)

Cable Tracking System Proposal. Nick Friedman Experimenta1 Facilities Division. June 25,1993. Advanced Photon Source Argonne National Laboratory

Clifford K. Ho and Michael L. Wilson Sandia National Laboratories. P.O. Box Albuquerque, NM

Use of AVHRR-Derived Spectral Reflectances to Estimate Surface Albedo across the Southern Great Plains Cloud and Radiation Testbed (CART) Site

Modified Kriging: Evaluation, Modification, Recommendations

8STATISTICAL ANALYSIS OF HIGH EXPLOSIVE DETONATION DATA. Beckman, Fernandez, Ramsay, and Wendelberger DRAFT 5/10/98 1.

Safety Considerations for Laser Power on Metals in Contact with High Explosives-Experimental and Calculational Results

--62/y. Univ. of Maryland LANSCE-1. Preprint LA-UR &&=-470Zd3- Title: STABILITY AND HALO FORMATION IN AXISYMMETRIC INTENSE BEAMS

Transcription:

6 ypcov@ fo! pub!ic,release; fstributfonis unlfmlted Title: Author(s): Submitted to: The Delayed Coupling Method: An Algorithm for Solving Banded Diagonal Matrix Problems in Parallel N Mattor T J Williams D W Hewett A M Dimits IMACS 97 August 24-29, 1997 Berlin, Germany Los Alamos National Laboratory Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the University of California for the US Department of Energy under contract W-7405-ENG-36 By acceptance of this article, the publisher recognizes that the US Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or to allow others to do so, for US Government purposes Los Alamos National Laboratory requests that the publisher identify this article as work performed under the auspices of the US Department of Energy The Los Alarnos National Laboratorystrongly supports academic freedom and a researcher sright to publish; as an institution, however, the Laboratory does not endorse the viewpoint of a publication or guarantee its technical correctness Form 836 (10/96)

DISCLAIMER This report was prepared as an account of work sponsored by an agency of the United States Government Neither the United States Government nor any agency thereof, nor any of their employees, make any warranty, express or implied, or assumes any legal liabiity or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof

DISCLAIMER Portions of this document may be itlegible in electronic jmllee products Images are produced from the best available original documexlt,

The Delayed Coupling Method: An Algorithm for Solving Banded Diagonal Matrix Problems in Parallel N Mattor, T J Williams, D W Hewett, A M Dimits Lawrence Livermore National Laboratory Livermore, California 94550 USA e-mail : mattor @m5llnlgov ABSTRACT We present a new algorithm for solving banded diagonal matrix problems efficiently on distributedmemory parallel computers, designed originally for use in dynamic alternating-direction implicit (ADI) partial differential equation solvers The algorithm optimizes efficiency with respect to the number of numerical operations, and with respect to the amount of interprocessor communication We refer to our approach as the delayed coupling method because the communication is deferred until needed We focus here on tridiagonal and periodic tridiagonal systems 1 INTRODUCTION We discuss a new approach to parallel solution of banded linear systems, the delayed coupling method The method is analogous to the solution of an inhomogeneous linear differential equation, where the solution is a particular solution added to an arbitrary linear combination of homogeneous solutions The coefficients of the homogeneous solutions are later determined by boundary conditions In our parallel method, each processor is given a contiguous subsection of a tridiagonal system With no information about the neighboring subsystems, each processor obtains the solution up to two constants Then the global solution can be found by matching endpoints Our earlier paper[ 11 has a more detailed description of the method and its application to tridiagonal systems The algorithm is designed with the following objectives, listed in order of priority The first objective is to minimize the number of interprocessor communications opened, since this is the most time consuming process Second, the algorithm allows flexibility of the specific solution method of the tridiagonal submatrices Here,-we employ a variant of LU decomposition, but this is easily replaced with cyclic reduction or other Third, we wish to minimize storage needs 2 BASIC ALGORITHM We consider the N x N tridiagonal linear system AX = R, Now at Los Alamos National Laboratory 1

with A= CN AN B N on a parallel computer with P processors For simplicity, we assume N = P M, with M an integer Our algorithm is as follows First, we divide the linear system of order N into P subsystems of order M Thus, the N x N matrix A is divided into P submatrices Lp,each of dimension M x M, *=I where epis the pth column of the M x M identity matrix Similarly, we divide the N dimensional vectors X and R into P sub-vectors x and r, each of dimension M x R y, r2 rp lt =(XI = x2 ( rl xp For each subsystem p we define three vectors xf,xfh,and xihas the solutions to the equations LpxpRE LpXF% (-a? *P, 0 0 oy, LpxiH= ( 0 0 0 -&)? (2) (3) (4) The superscripts on the x stand for particular, upper homogeneous, and lower homogeneous solution respectively, from the inhomogeneous differential equation analogy Here UP, is the mth subdiagonal element of the pth submatrix, etc The general solution of subsystem p is xf plus an arbitrary linear combination of xfhand xkh, [FH xp UH UH xp = x,r+tp +ep LH LH xp 7 (5) <: are yet undetermined coefficientsthat depend on coupling to the neighboring <FH and clh,substitute Eq (5) into Eq (1) Straightforward calculation shows where and solutions To find 2

<vh that = <f;" = 0, and the remaining 2P - 2 coefficients are determined by the solution to the following tridiagonal linear system, or "reduced" system: where xp,mrefers to the mthelement of the appropriate solution from the pth submatrix 3 COMPUTING THE PARTICULAR AND HOMOGENEOUS SOLUTIONS xih We find the three solutions x:, xfh, and by solving Eqs (2)-(4) Exploiting overlapping calculations and elements with value 0 gives the following algorithm, with 13M binary floating point operations: Forward elimination: i = 2,3, M i = 2, 3, M Back substitution: i = M - 1, M - 2, 1 i = M - l, M - 2, 1 i = M - I,M - 2, 1 i = 2,3, M, where the processor index p is implicitly present on all variables, and end elements a1 and C M are written in the appropriate positions in the a and c arrays The sample code in ref [11 implements this with no temporary storage arrays 4 CONSTRUCTION AND SOLUTION OF THE REDUCED MAT= Once each processor has determined x:, xfh,and xth,we construct and solve the reduced system of Eq (6) We assume that the following functions are available for interprocessor communication: 3

0 0 Send (ToPid,data, n): WheninvokedbyprocessorFromPid, thearray dataoflength n is sent to processor ToPid Send ( ) is nonblocking Receive (FromPid,d a t a, n ) : To complete data transmission, processor ToPid invokes Receive ( ) Upon execution, the array sent by processor FromPid is stored in the array data array of length n Receive ( ) is bloclung (the processor waits for the data to be received before continuing) Opening interprocessor communications is generally the most time-consuming step in the entire tridiagonal solution process, so it is important to minimize this The following algorithm consumes a time of T = (log, P)t, in opening communication channels (where t, is the time to open one channel) 1 Each processor writes whatever data it has that is relevant to Eq (6)in the array OutData 2 The OutData arrays from each processor are concatenated as follows (Fig 1): (a) Each processor p sends its OutData array to processor p - 1 (mod P ), and receives a corresponding array from processor p + 1 (mod P ), as depicted in Fig la The incoming array is concatenated to the end of OutData (b) At the ith step, repeat the first step, except sending to processor p-22 (mod P ), and receiving from proce~sorp+2~-~(mod P ) (Fig lb,c), for i = 1,2, After log, P iterations (or the next higher integer), each processor has the contents of the reduced matrix in the OutData array 3 Each processor rearranges the contents of its OutData array into a local copy of the reduced tridiagonal system, and then solves At this point, each processor has all the values in Eq (5) stored locally (G) Dataflow OutData contents (1,2) (2,3) (1-4) (2-5) (3,4) (45) (5,l) (G) Data flow OutData contents (Step) (3-5,l) (4,5,1,2) (5,l-3) Data flow OutData contents (1-5) (2-5,l) (3-5,1,2) (4,5,1-3) (5,l-4) Figure 1: Illustration of the method to pass reduced matrix data between processors, shown for P = 5 4

5 PERFORMANCE The time consumption for this routine is as follows: 1 To calculate the three roots xr,xuh,and xlh requires 13hf binary floating point operations by each processor, done in parallel 2 To assemble the reduced matrix in each processor requires log, P steps where interprocessor communications are opened, and the ith opening passes 8 x 2i real numbers 3 Solution of the reduced system through LU decomposition requires 8 (2P - 2) binary floating point operations by each processor, done in parallel 4 Calculation of the final solution requires 4M binary floating point operations by each processor, done in parallel If tb is the time of one binary floating point operation, t, is the time required to open a communication channel (latency), and t p is the time to pass one real number once communication is opened, then the time to execute this parallel routine is given by (optimally) s, for P >> 1 For cases of present interest, T p is dominated by (log, P ) t, and 17Mtb The parallel efjiciency is defined by ~p f where T, is the execution time of a serial code which solves by LU decomposition Serial LU decomposition solves an N x N system in a time Ts = 8Ntb, so To test these claims empirically, we measured the execution times of working serial and parallel codes, and calculated E P both through its definition and through Eq (8) Fig 2 shows E P as a function of P for two cases, N = 200 and N = 50,000 We conclude from Fig 2 that Eq (8) (smooth lines) is reasonably accurate, both for the theoretical maximum efficiency (47%, achieved for small P and large N ) and for the scaling with large P We made these timings on the BBN TC2000 machine at Lawrence Livermore National Laboratory, using 64-bit floating point arithmetic This machine had 128 M88100 RISC processors, connected by a butterfly-switch network To calculate the predictions of Eq (7) we chose t, = 750psec, based on the average time of a sendreceive pair measured in our code; based on other measurements, we chose the passage time of a single @-bit real number as t p = Spsec; we chose t b = 14psec, based on our measured timing of 0002 18 sec for the serial algorithm on the N = 200 case 6 PERIODIC TRIDIAGONAL SYSTEM We have generalized our algorithm to a periodic tridiagonal system This is a tridiagonal system with additional nonzero elements in the far upper and lower corners of the matrix, that is, Eq (1) 5

4 06 N = 50000 Q 04 00 0 20 40 60 80 100 P Figure 2: Results of scaling runs, comparing the parallel time with serial LU decomposition time Here, ~p is the parallel eficiency and P is the number of processors The smooth lines represent Eq (8),and the points are empirical results with A now of the form Solution proceeds almost precisely as before First divide Equation (9) into tridiagonal subsystems, and solve for the particular, upper, and lower homogeneous solutions The subsystems are again all tridiagonal, so no additional consideration need be given to this part Then use these solutions to construct a reduced system analogous to Eq (6) Here, however, the first and last subsystems acquire drives for the upper and lower homogeneous solutions, respectively, so the condition = = 0 no longer pertains This leads to a reduced matrix with nonzero elements in the far corners ryh LH XP,l x3ub iyh x3,m LH x3,1 LH x3,m 6 x g LH JP rpuh 1

This system necessitates a new solution algorithm The most efficient we know (not shown here) uses LU decomposition, and requires 15P binary operations The interesting consequence is that the parallel efficiency nearly doubles over the nonperiodic case, since the operation count in the corresponding serial solver also rises-from 8N to 15N Thus, Eq ( 8 ) for the predicted efficiency becomes 15 Ep = 17 16P2/N 4- (log2 P ) Pt,/Ntb 15P2tp/Ntb' + + 7 DISCUSSION AND CONCLUSIONS Stability of the parallel tridiagonal algorithm is similar to that of serial LU decomposition of a tridiagonal matrix If the Li are unstable to LU decomposition, then pivoting could be used If the L, are singular, then LU decomposition fails and some alternative should be devised If the large matrix A is diagonally dominant, then so too are the L, If the reduced system is unstable to LU decomposition, this can be replaced by a different solution scheme, with little loss of overall speed (if P << M ) This routine is generalizable from tridiagonal to higher systems For example, in a 5-diagonal system, there would be four homogeneous solutions, each with an undetermined coefficient The coefficients of the homogeneous solutions would be determined by a reduced system analogous to Eq (6), except with O(4P) equations, not 2P - 2 In our applications of the parallel tridiagonal solver we solve a tridiagonal system along each line of grid points parallel to a given direction In two or higher dimensions, each processor owns a segment of each of many systems, giving us a strong advantage in interprocessor communication over solving only a single system: each processor solves all of its triplets of independent subsystems, then packs together all of the data it needs to send to other processors-there is only one send volley for solving all of the systems Furthermore, that number of processors each processor communicates with is the number of processors collinear in the one direction, which will generally be smaller than the total number of processors in a multidimensional domain decomposition; this improves parallel efficiency by reducing the value of P below the total number of processors This work was performed for the US Department of Energy at Lawrence Livermore National Laboratory under contract W-7405-ENG-48 and Los Alamos National Laboratory under contract W-7405-ENG-36 REFERENCES [I] N Mattor, T J Williams, D W Hewett, Algorithm for solving tridiagonal matrix problems in parallel, Parallel Computing 25, p 1769 (1995) [2] R W Hockney and J W Eastwood, Computer Simulation Using Particles, Adam Hilger, Bristol, p 185 (1988) 7