Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

Model Order Reduction via Matlab Parallel Computing Toolbox E. Fatih Yetkin & Hasan Dağ Istanbul Technical University Computational Science & Engineering Department September 21, 2009 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 1 / 40

1 Parallel Computation Why We Need Parallelism in MOR? What is Parallelism? Parallel Architectures 2 Tools of Parallelization Programming Models Parallel Matlab 3 Parallel Version of Rational Krylov Methods Rational Krylov Methods H 2 optimality and Rational Krylov methods An Example System Parallelization of the Algorithm Results 4 Conclusions E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 2 / 40

Why We Need Parallelism in MOR? Computational Complexity Model reduction methods aim to build a model, which is easy to handle. However, for some type of methods such as balanced truncation or rational Krylov reduction process takes lots of time for dense problems. Computational Complexity of Rational Krylov Methods Complexity of the process decomposition of (A σ i E) for k points is O(N 3 ) Therefore, especially in dense problems parallelism is an obligation. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 3 / 40

What is Parallelism? Sequential Programming A single CPU (core) is available Problem is composed of series of commands Each command is executed one after another E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 4 / 40

What is Parallelism? Parallel Programming In the simplest sense, parallel computing is the simultaneous use of multiple computing resources to solve a computational problem: with multiple CPUs or cores Problem is broken into discrete parts that can be solved concurrently. Each part is executed on different CPUs simultaneously. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 5 / 40

Parallel Architectures Shared Memory Generally shared memory machines have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 6 / 40

Parallel Architectures UMA vs. NUMA In Uniform Memory Access (UMA) architecture, identical processors has equal access times to memory. Also called Symmetric Multiprocessor (SMP). Non-uniform Memory Access (NUMA) machines, often made by physically linking two or more SMPs and not all processors have equal access time to all memories. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 7 / 40

Parallel Architectures Distributed Memory Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer s responsibility. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 8 / 40

Parallel Architectures Hybrid Memory The largest and the fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine s memory as global. Network communications are required to move data from one SMP to another. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 9 / 40

Parallel Programming Models: Threads POSIX Threads & OpenMP In the threads model of parallel programming, a single process can have multiple, concurrent execution paths. Threads can come and go, but a.out remains present to provide the necessary shared resources until the application is completed. Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 10 / 40

Parallel Programming Models: Message Passing Interface MPI A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines. Tasks exchange data through communications by sending and receiving messages. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 11 / 40

Matlab Distributed Computing Toolbox Distributed or Parallel From the view of Matlab terminology parallel jobs run on the internal workers such as cores and distributed jobs run on the cluster nodes. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 12 / 40

Basics of Parallel Computing Toolbox parfor In Matlab you can use parfor to make a parallel loop. Message passing or some low level communication issues handled by Matlab itself. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 13 / 40

Basics of Parallel Computing Toolbox when we can use parfor? E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 14 / 40

Basics of Parallel Computing Toolbox when we can not use parfor? E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 15 / 40

Basics of Parallel Computing Toolbox single process multiple data (spmd) In Matlab you can use spmd blocks to run a process on different data sets. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 16 / 40

Basics of Parallel Computing Toolbox single process multiple data (spmd) Master processor has a right to access for all workers data E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 17 / 40

Basics of Parallel Computing Toolbox distributed arrays It is possible to distribute any array to workers. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 18 / 40

Basics of Parallel Computing Toolbox distributed arrays E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 19 / 40

Matrix transposing MPI-Fortran vs. Matlab -DCT E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 20 / 40

Rational Krylov Methods If D selected as zero system triple can be selected as Σ = (A, B, C) for ẋ = Ax + Bu y = C T x + Du Two matrices V R nxk and W R nxk can be defined where W V = I k and k n With these two matrices reduced order system can be found as Â = W AV ˆB = W B Ĉ = CV (1) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 21 / 40

Rational Krylov Method There are lots of ways to build the projection matrices. One way is using rational Krylov subspace bases. Assume that k distinct points in complex plane are selected for interpolation. Then interpolation matrices, V and W can be built as shown below. V = [(s 1 I A) 1 B... (s k I A) 1 B] Ŵ = [(s 1 I A T ) 1 C... (s k I A T 1) 1 C] (2) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 22 / 40

Rational Krylov Projectors Assuming that det(ŵ V ) 0, the projected reduced system can be built as, Â = W T AV, ˆB = W T B, Ĉ = CV (3) where W = Ŵ ( ˆV W ) 1 to ensure W V = I k. The basic problem is to find a strategy to select the interpolation points. As the worst case, the interpolation points can be selected as randomly from the operating frequency of the system. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 23 / 40

Rational Krylov Projectors E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 24 / 40

H 2 norm of a system This approach is not optimal. To improve this approach several methods can be used. In this work we use the iterative rational Krylov approach to achieve H 2 norm optimal reduced model. H 2 norm of a system is defined as below, [ + 1/2 G 2 := G(jω) dω] 2 (4) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 25 / 40

H 2 optimality Reduced order system G r (s) is H 2 optimal if it minimizes the G r (s) = argmin deg( Ĝ)=r G(s) Ĝ(s) H2 (5) And there are two important theorems to obtain an H 2 optimal reduced model given by Meier (1967) and Grimme (1997). Antoulas et.al. combine these two important results to achieve an Iterative Rational Krylov Algorithm (IRKA) to obtain H 2 optimal reduced order model E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 26 / 40

Iterative Rational Krylov Algorithm (IRKA) E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 27 / 40

Example RLC network We use a ladder RLC network as benchmark example for the numerical implementation of the Alg.1 and Alg.2. Minimal realization of the circuit is given in Fig.1. For this circuit order of the system n = 5. On the other hand, system matrices of this circuit can easily be extended E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 28 / 40

Frequency plots of the reduced and original systems N=201 and the order of reduced system k=20 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 29 / 40

Computational Cost of Methods Computational cost of the rational Krylov methods can be given as O(N 3 ) for dense problems In IRKA rational Krylov methods are used iteratively and the computational complexity has to be multiplied by the iteration number r. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 30 / 40

Parallel Parts of Algorithms Although both algorithms have k times factorization to compute (s i I A) 1 B, these factorizations can be computed on different processors independently. The matrix-matrix and matrix-vector multiplications in the algorithms are amenable to parallel processing. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 31 / 40

Parallel Version of Alg. 1 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 32 / 40

Parallel Version of Alg. 1 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 33 / 40

CPU times for Rational Krylov Table: CPU times of parallel version of Alg.1 for different system orders where the reduced system order k=200. Proc no. time (n=2000) time (n=5000) 1 59.8 1485.3 2 31.4 780.7 4 21.2 451.4 8 23.8 374.2 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 34 / 40

CPU times for IRKA Table: CPU times of parallel version of Alg.2 for different system orders where the reduced system order k=200. Proc no. time (n=2000) time (n=5000) 1 512.6 2486.2 2 410.7 1605.9 4 203.9 810.8 8 176.1 648.4 E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 35 / 40

Speedup graph for RK Speedup of a parallel algorithm is defined as S p = T 1 T p (6) where T 1 is the CPU time for one processor and T p is the CPU time for P processor. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 36 / 40

Speedup graph for RK E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 37 / 40

Speedup graph for IRKA E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 38 / 40

continued It can easily be seen from the figures, when we increase the number of processors processing time decreases appreciably upto some point, after which it starts to increase. This is due to communication times becoming dominant over computation time. But in both algorithm, when the size of the system matrices are getting larger better speedups are obtained. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 39 / 40

Conclusions In this work, iterative rational Krylov method based optimal H 2 norm model reduction methods are parallelized. These methods require huge computation but the algorithm themselves are suitable for parallel processing. Therefore, computational time decreases when the number of processors is increased. Due to communication needs of the processors, communication time dominates the overall process time when the system order is small. But in larger orders, parallel algorithm has better speedup values. E. Fatih Yetkin (Istanbul Technical Univ.) Terschelling, 2009 September 21, 2009 40 / 40