Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Similar documents
Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Contents. Preface... xi. Introduction...

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Balanced Truncation Model Reduction of Large and Sparse Generalized Linear Systems

Binary Decision Diagrams and Symbolic Model Checking

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

B629 project - StreamIt MPI Backend. Nilesh Mahajan

Krylov-Subspace Based Model Reduction of Nonlinear Circuit Models Using Bilinear and Quadratic-Linear Approximations

Approximation of the Linearized Boussinesq Equations

Model reduction of large-scale dynamical systems

Towards high performance IRKA on hybrid CPU-GPU systems

Model Reduction for Linear Dynamical Systems

BALANCING-RELATED MODEL REDUCTION FOR DATA-SPARSE SYSTEMS

Parallelization of the Dirac operator. Pushan Majumdar. Indian Association for the Cultivation of Sciences, Jadavpur, Kolkata

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

MPI at MPI. Jens Saak. Max Planck Institute for Dynamics of Complex Technical Systems Computational Methods in Systems and Control Theory

Communication-avoiding parallel and sequential QR factorizations

Inverse problems. High-order optimization and parallel computing. Lecture 7

Model Reduction for Unstable Systems

Panorama des modèles et outils de programmation parallèle

Coordinate Update Algorithm Short Course The Package TMAC

Dense Arithmetic over Finite Fields with CUMODP

Problem set 5 solutions 1

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Lecture 22: Multithreaded Algorithms CSCI Algorithms I. Andrew Rosenberg

High Performance Computing

The parallelization of the Keller box method on heterogeneous cluster of workstations

J.I. Aliaga 1 M. Bollhöfer 2 A.F. Martín 1 E.S. Quintana-Ortí 1. March, 2009

SPARSE SOLVERS POISSON EQUATION. Margreet Nool. November 9, 2015 FOR THE. CWI, Multiscale Dynamics

Cyclops Tensor Framework

Projectile Motion Slide 1/16. Projectile Motion. Fall Semester. Parallel Computing

Parallelism in Structured Newton Computations

Schwarz-type methods and their application in geomechanics

Model reduction via tangential interpolation

Parallel Scientific Computing

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Efficiency of Dynamic Load Balancing Based on Permanent Cells for Parallel Molecular Dynamics Simulation

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Communication-avoiding parallel and sequential QR factorizations

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Model Reduction of Inhomogeneous Initial Conditions

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Porting a sphere optimization program from LAPACK to ScaLAPACK

Direct Self-Consistent Field Computations on GPU Clusters

Iterative Rational Krylov Algorithm for Unstable Dynamical Systems and Generalized Coprime Factorizations

Parametric Model Order Reduction for Linear Control Systems

Using Model Reduction techniques for simulating the heat transfer in electronic systems.

11 Parallel programming models

CSE 548: Analysis of Algorithms. Lecture 12 ( Analyzing Parallel Algorithms )

THE TWO DIMENSIONAL COUPLED NONLINEAR SCHRÖDINGER EQUATION- NUMERICAL METHODS AND EXPERIMENTS HARINI MEDIKONDURU

Fluid flow dynamical model approximation and control

Adaptive rational Krylov subspaces for large-scale dynamical systems. V. Simoncini

BLAS: Basic Linear Algebra Subroutines Analysis of the Matrix-Vector-Product Analysis of Matrix-Matrix Product

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

PRECONDITIONING IN THE PARALLEL BLOCK-JACOBI SVD ALGORITHM

Parallelization of the QC-lib Quantum Computer Simulator Library

(Mathematical Operations with Arrays) Applied Linear Algebra in Geoscience Using MATLAB

Accelerating linear algebra computations with hybrid GPU-multicore systems.

Convergence Models and Surprising Results for the Asynchronous Jacobi Method

TR A Comparison of the Performance of SaP::GPU and Intel s Math Kernel Library (MKL) for Solving Dense Banded Linear Systems

Review for the Midterm Exam

& - Analysis and Reduction of Large- Scale Dynamic Systems in MATLAB

Optimization Techniques for Parallel Code 1. Parallel programming models

Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism

Energy-aware scheduling for GreenIT in large-scale distributed systems

An iterative SVD-Krylov based method for model reduction of large-scale dynamical systems

Performance of the fusion code GYRO on three four generations of Crays. Mark Fahey University of Tennessee, Knoxville

QuIDD-Optimised Quantum Algorithms

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Applications of Mathematical Economics

Parallelization of the QC-lib Quantum Computer Simulator Library

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

An Algorithmic Framework of Large-Scale Circuit Simulation Using Exponential Integrators

Krylov Techniques for Model Reduction of Second-Order Systems

Parallel Numerical Algorithms

Parallel Programming in C with MPI and OpenMP

Robust Multivariable Control

Parallel Programming. Parallel algorithms Linear systems solvers

Fast Frequency Response Analysis using Model Order Reduction

Sparse BLAS-3 Reduction

The EVSL package for symmetric eigenvalue problems Yousef Saad Department of Computer Science and Engineering University of Minnesota

Lecture 13: Sequential Circuits, FSM

An iterative SVD-Krylov based method for model reduction of large-scale dynamical systems

Lightweight Superscalar Task Execution in Distributed Memory

Parallel PIPS-SBB Multi-level parallelism for 2-stage SMIPS. Lluís-Miquel Munguia, Geoffrey M. Oxberry, Deepak Rajan, Yuji Shinano

A refined Lanczos method for computing eigenvalues and eigenvectors of unsymmetric matrices

arxiv: v1 [hep-lat] 7 Oct 2010

An Integrative Model for Parallelism

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

H 2 -optimal model reduction of MIMO systems

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

arxiv: v3 [math.na] 6 Jul 2018

Transcription:

Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 1 / 40

1 Parallel Computation Why We Need Parallelism in MOR? What is Parallelism? Parallel Architectures 2 Tools of Parallelization Programming Models Parallel Matlab 3 Parallel Version of Rational Krylov Methods Rational Krylov Methods H 2 optimality and Rational Krylov methods An Example System Parallelization of the Algorithm Results 4 Conclusions E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 2 / 40

Why We Need Parallelism in MOR? Computational Complexity Model reduction methods aim to build a model, which is easy to handle. However, for some type of methods such as balanced truncation or rational Krylov reduction process takes lots of time for dense problems. Computational Complexity of Rational Krylov Methods Complexity of the process decomposition of (A σ i E) for k points is O(N 3 ) Therefore, especially in dense problems parallelism is an obligation. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 3 / 40

What is Parallelism? Sequential Programming A single CPU (core) is available Problem is composed of series of commands Each command is executed one after another E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 4 / 40

What is Parallelism? Parallel Programming In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a computational problem: with multiple CPUs or cores Problem is broken into discrete parts that can be solved concurrently. Each part is executed on different CPUs simultaneously. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 5 / 40

Parallel Architectures Shared Memory Generally shared memory machines have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 6 / 40

Parallel Architectures UMA vs. NUMA In Uniform Memory Access (UMA) architecture, identical processors has equal access times to memory. Also called Symmetric Multiprocessor (SMP). Non-uniform Memory Access (NUMA) machines, often made by physically linking two or more SMPs and not all processors have equal access time to all memories. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 7 / 40

Parallel Architectures Distributed Memory Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer s responsibility. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 8 / 40

Parallel Architectures Hybrid Memory The largest and the fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine s memory as global. Network communications are required to move data from one SMP to another. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 9 / 40

Parallel Programming Models: Threads POSIX Threads & OpenMP In the threads model of parallel programming, a single process can have multiple, concurrent execution paths. Threads can come and go, but a.out remains present to provide the necessary shared resources until the application is completed. Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 10 / 40

Parallel Programming Models: Message Passing Interface MPI A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines. Tasks exchange data through communications by sending and receiving messages. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 11 / 40

Matlab Distributed Computing Toolbox Distributed or Parallel From the view of Matlab terminology parallel jobs run on the internal workers such as cores and distributed jobs run on the cluster nodes. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 12 / 40

Basics of Parallel Computing Toolbox parfor In Matlab you can use parfor to make a parallel loop. Message passing or some low level communication issues handled by Matlab itself. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 13 / 40

Basics of Parallel Computing Toolbox when we can use parfor? E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 14 / 40

Basics of Parallel Computing Toolbox when we can not use parfor? E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 15 / 40

Basics of Parallel Computing Toolbox single process multiple data (spmd) In Matlab you can use spmd blocks to run a process on different data sets. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 16 / 40

Basics of Parallel Computing Toolbox single process multiple data (spmd) Master processor has a right to access for all workers data E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 17 / 40

Basics of Parallel Computing Toolbox distributed arrays It is possible to distribute any array to workers. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 18 / 40

Basics of Parallel Computing Toolbox distributed arrays E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 19 / 40

Matrix transposing MPI-Fortran vs. Matlab -DCT E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 20 / 40

Rational Krylov Methods If D selected as zero system triple can be selected as Σ = (A, B, C) for ẋ = Ax + Bu y = C T x + Du Two matrices V R nxk and W R nxk can be defined where W V = I k and k n With these two matrices reduced order system can be found as  = W AV ˆB = W B Ĉ = CV (1) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 21 / 40

Rational Krylov Method There are lots of ways to build the projection matrices. One way is using rational Krylov subspace bases. Assume that k distinct points in complex plane are selected for interpolation. Then interpolation matrices, V and W can be built as shown below. V = [(s 1 I A) 1 B... (s k I A) 1 B] Ŵ = [(s 1 I A T ) 1 C... (s k I A T 1) 1 C] (2) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 22 / 40

Rational Krylov Projectors Assuming that det(ŵ V ) 0, the projected reduced system can be built as, Â = W T AV, ˆB = W T B, Ĉ = CV (3) where W = Ŵ ( ˆV W ) 1 to ensure W V = I k. The basic problem is to find a strategy to select the interpolation points. As the worst case, the interpolation points can be selected as randomly from the operating frequency of the system. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 23 / 40

Rational Krylov Projectors E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 24 / 40

H 2 norm of a system This approach is not optimal. To improve this approach several methods can be used. In this work we use the iterative rational Krylov approach to achieve H 2 norm optimal reduced model. H 2 norm of a system is defined as below, [ + 1/2 G 2 := G(jω) dω] 2 (4) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 25 / 40

H 2 optimality Reduced order system G r (s) is H 2 optimal if it minimizes the G r (s) = argmin deg( Ĝ)=r G(s) Ĝ(s) H2 (5) And there are two important theorems to obtain an H 2 optimal reduced model given by Meier (1967) and Grimme (1997). Antoulas et.al. combine these two important results to achieve an Iterative Rational Krylov Algorithm (IRKA) to obtain H 2 optimal reduced order model E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 26 / 40

Iterative Rational Krylov Algorithm (IRKA) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 27 / 40

Example RLC network We use a ladder RLC network as benchmark example for the numerical implementation of the Alg.1 and Alg.2. Minimal realization of the circuit is given in Fig.1. For this circuit order of the system n = 5. On the other hand, system matrices of this circuit can easily be extended E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 28 / 40

Frequency plots of the reduced and original systems N=201 and the order of reduced system k=20 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 29 / 40

Computational Cost of Methods Computational cost of the rational Krylov methods can be given as O(N 3 ) for dense problems In IRKA rational Krylov methods are used iteratively and the computational complexity has to be multiplied by the iteration number r. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 30 / 40

Parallel Parts of Algorithms Although both algorithms have k times factorization to compute (s i I A) 1 B, these factorizations can be computed on different processors independently. The matrix-matrix and matrix-vector multiplications in the algorithms are amenable to parallel processing. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 31 / 40

Parallel Version of Alg. 1 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 32 / 40

Parallel Version of Alg. 1 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 33 / 40

CPU times for Rational Krylov Table: CPU times of parallel version of Alg.1 for different system orders where the reduced system order k=200. Proc no. time (n=2000) time (n=5000) 1 59.8 1485.3 2 31.4 780.7 4 21.2 451.4 8 23.8 374.2 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 34 / 40

CPU times for IRKA Table: CPU times of parallel version of Alg.2 for different system orders where the reduced system order k=200. Proc no. time (n=2000) time (n=5000) 1 512.6 2486.2 2 410.7 1605.9 4 203.9 810.8 8 176.1 648.4 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 35 / 40

Speedup graph for RK Speedup of a parallel algorithm is defined as S p = T 1 T p (6) where T 1 is the CPU time for one processor and T p is the CPU time for P processor. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 36 / 40

Speedup graph for RK E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 37 / 40

Speedup graph for IRKA E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 38 / 40

continued It can easily be seen from the figures, when we increase the number of processors processing time decreases appreciably upto some point, after which it starts to increase. This is due to communication times becoming dominant over computation time. But in both algorithm, when the size of the system matrices are getting larger better speedups are obtained. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 39 / 40

Conclusions In this work, iterative rational Krylov method based optimal H 2 norm model reduction methods are parallelized. These methods require huge computation but the algorithm themselves are suitable for parallel processing. Therefore, computational time decreases when the number of processors is increased. Due to communication needs of the processors, communication time dominates the overall process time when the system order is small. But in larger orders, parallel algorithm has better speedup values. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 40 / 40