Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Size: px

Start display at page:

Download "Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!"

Phoebe Carroll
5 years ago
Views:

1 Parallel Numerics Scope: Revise standard numerical methods considering parallel computations! Required knowledge: Numerics Parallel Programming Graphs Literature: Dongarra, Du, Sorensen, van der Vorst: Numerical Linear Algebra or High-Perormance Computers Pacheco: A User s Guide to MPI (web) Parallel Programming with MPI Schüle: Paralleles Rechnen

2 Why parallel computing? SETI, weather prediction, quantum simulation TOP500 HLRB-II 2

3 I. Introduction. Computer Science Aspects 2. Numerical Problems 3. Graphs II. Elementary Linear Algebra Problems. BLAS 2. Matrix-Vector Operations 3. Matrix-Matrix-Product III. Linear Equations with Dense Matrices. Gaussian Elimination 2. Vectorization 3. Parallelization 4. QR-Decomposition with Householder matrices IV. Sparse Matrices. General Properties, Storage 2. Sparse Matrices and Graphs 3. Reordering 4. Gaussian Elimination and Graphs V. Iterative Methods or Sparse Matrices. Stationary Methods 2. Nonstationary Methods 3. Preconditioning VI. Domain decomposition VII.Eigenvalues, (Quantum Computing, GPU) 3

4 . Introduction. Computer Science Aspects o Parallel Numerics.. Parallelization in the CPU Elementary operations in CPU are carried out in pipelines: - Divide a task into smaller subtasks - Each small subtask is executed on a piece o hardware that operates concurrently with the other stages o the pipeline. Addition Pipeline: Operand Operand 2 Align exponents accordingly normalize result Stage Stage 2 Stage 3 Stage 4 Compare add mantissa exponents Output Result 4

5 Visualisation Pipelining x 2 x y y 2 5

6 Visualisation Pipelining x 3 x 2 x,y y 2 y 3 6

7 Visualisation Pipelining x 4 2 x 3 x 2,y 2 x,y y 3 y 4 7

8 Visualisation Pipelining x x 4 x 3,y 3 x 2,y 2 x,y y 4 y 5 8

9 Visualisation Pipelining x x 5 x 4,y 4 x 3,y 3 x 2,y 2 x,y y 5 y 6 9

10 Visualisation Pipelining x x 6 x 5,y 5 x 4,y 4 x 3,y 3 x 2,y 2 x +y y 6 y 7 0

11 Visualisation Pipelining x i x i-5 x i-4,y i-4 x i-3,y i-3 x i-2,y i-2 x i-,y i- x i +y i y i-5 y i-6 Startup time = k(=4) clock units Lateron on: per clock unit one result Total time: k*u + n*u

12 Advantages o Pipelines: I pipeline is illed: per clock unit one result is achieved. All additions should be organized such that the pipeline is always illed! I the pipeline is nearly empty, e.g. in the beginning o the computations, it is not eicient! Major task or CPU: Organize all operations such that the operands are just in time at the right position to ill the pipeline and keep it ull. 2

13 CPU - Pipelining 3

14 CPU 4

15 General Steps Instruction Fetch: Get the next command Decoding: Analyse instruction and compute addresses o operands Operand Fetch: Get the values o the next operands Execution step: Carry out command on operands Result Write: Write result in memory Pipelining o these steps, and also inside each step. 5

16 Special case: Vector instruction For set o data the same operation has to be executed on all components. y x M = α M For j =,2,..,n: y j = α x j ; y n x n α α x 5 α x 4 α x 3 α x 2 α x x 6 x 7 Total costs: Startup time + vector length * clock time (pipeline length + vector length ) * τ 6

17 Chaining: Combine pipelines directly α x + y : α α x Multiplication Addition α x + y x y Advantage: total cost = startup time + vector length * clock time 7

18 Problem: Data Dependency! Example: Fibonacci numbers x 0 = 0, x =, x 2 = x 0 + x,, x i = x i-2 + x i- ; x x 0 x 2 x 2 x Ater illing in x 0 and x the next pair needs x 2 which is known only ater the irst computation is inished! Pipeline contains always only one pair is nearly empty all the time! Similar Problem or recursive subroutine calls. 8

19 ..2 Memory Organization small CPU Register ast Capacity Level Cache Level 2 Cache Speed Main Memory large Hard Disc slow World (CD, DVD, Stick, Internet,.) 9

20 General Considerations DRAM (dynamic): Fast, periodically rereshing is necessary SRAM (static) SDRAM (small part SRAM combined with large DRAM) DDR: (double date rate) use both voltage lanks (side,shoulder) Necessary time or reading: - Transport addresses via bus to memory (bus speed) - Time between arrival o adresses and arrival o data: latency time (~4 cycles) - Rereshing o data: cycles - Transport o data rom memory: bus cycle 20

21 Cache Idea Cache as memory buer between large, slow memory and small, ast memory. By considering the data low (last used data), try to predict which data will be requested in the next step: - keep the last used data in ast cache because it is likely that the same data will be used again - keep also the neighborhood o the last used data in ast cache. Memory is organized in pages (main memory, hard disc,..). Hence, together with the last used data put the whole page in the cache. Page size ~ bus band width Main memory Cache Hard disc 2

22 Cache hit: The data requested rom the small, ast memory is ound in the cache: Copy the data to ast memory. Done. Cache miss: The data requested rom the small, ast memory is not ound in the cache: Look or data in the large, slow memory. Copy the related page to the cache (removing the oldest cache entry) and copy it to the ast, small memory. Also: Reuse data as oten as possible! Working blockwise to ensure neighbouring! 22

23 Mapping between Memory Direct mapping: Address in cache, modulo 00 0 Disadvantage: Immediately replacing o data in cache Associative mapping: Partition cache in blocks. Write data to direct mapped address in one o the blocks. 23 Replace oldest data in block.

24 Cyclic data distribution Memory is oten organized in banks connected by bus: bus bank bank2 bankn.... Per cycle n operands can be etched out o the n banks. Storing vectors! x in bank x 2 in bank2,. allows one step access to x 24

25 ..3 Parallel Processors Classical von Neumann model: Code and data in memory! Control unit etches instructions and data rom memory and sequentially coordinates the operations. 25

26 Parallel Computation Flynn s taxonomy: MIMD architecture: multiple instructions multiple data (compare to SISD = single instructions single data, etc.) 26

27 Shared Memory (SMP): memory I/O bus cache cache CPU CPU processors 27

28 Cache Coherence memory I/O start = proc_number; or (S=0; s<s_max; S++) parallel or(i=start; i<n; i+=s+proc_number) x[i]=2.; cache bus cache For S=0 and 2 threads: CPU Thread changes x(0,2,4, ) and thread 2 changes x(,3,5, ). CPU Does the cache contain 4 words (cache line = 4), then each changing step o thread also changes data that is also contained in the cache o thread 2 (and vice versa). Otherwise the data in the two caches is not consistent anymore! To retain the right values in both caches ater each changing step also the Value in the other cache has to be renewed! Leads to a dramatical increase o computational time, ev. slower than sequential computation! 28

29 Locally distributed memory: network memory memory cache cache knot CPU knot CPU P M processors..... memory P n M n Virtual shared memory: Distributed data but organized as shared memory. 29

30 Nonuniorm Memory Access Cluster o multiple CPU processors memory memory 2 controler bus controler cache cache cache cache Symmetric multiprocessor CPU CPU 2 CPU 3 CPU 4 Dierent types o communication! Shared memory and distributed memory! 30

31 Topology o the processor/memory interconnection P cache local m. I/O bus Bus: global memory P n Mesh (Array, Grid): p processors, longest path sqrt(p) P M P M P M. P M P M P M P M 3

32 Time or sending data rom one processor to another depends on the connection network topology: Mesh: 2*sqrt(p) vector or ring: p- or p/2 tree: 2 log(p) hypercube: log(p) Tree: Hypercube: 0d d 2d 3d 4d 32

33 Communication - Topology diameter = largest distance = p-: diameter = largest distance = p/2: diameter = largest distance = 2 sqrt(p) 33

34 Diameter = largest distance = 2 ln(p) Diameter:

35 Tree in Hypercube

36 Dierent Topologies: G p Diam(G) Degree(G) Edges(G) G (n) n n- 2 n- T (n) n loor(n/2) 2 n G 2 (n,n) n 2 2n-2 4 2n 2-2n T 2 (n,n) n 2 2*loor(n/2) 4 2n 2 BT(h) 2 h+ - 2h 3 2 h+ -2 HC(k) 2 k k k 2 k- k G: Grid, T: Torus, BT: binary Tree, HC: Hypercube 36

37 Network based on Switches 3-level Omega network Crossbar P7 P7 P7 P6 P6 P6 P5 P5 P5 P4 P4 P4 P3 P3 P3 P2 P2 P2 P P P P0 P0 P0 37

38 Communication Crossbar: Direct, independent connection between all processors. Nonblocking! Omega network: Blocking network. Simultaneous connection P0 P6 and P P7 is not possible! Turn-over o switches necessary! 38

39 ..4 Perormance Analysis Deinition: Computational Speed r = N / t Mlops, N loating point operations in t microseconds or by known speed r: time or N lops is given by t = N / r Amdahls s Law: Setting: An algorithm takes N lop s. A raction is carried out with speed o V Mlops (good in parallel) A raction - is carried out with S Mlops (bad) : high speed parallel : low speed, strongly sequential 39

40 40 Total CPU time: Overall speed (perormance): (Amdahl s Law) ) ( ) ( S V N S N V N t + = + = S V t N r + = = must be close to in order to beneit signiicantly rom parallelism

41 4 Discussion S S V t N r + = = with S the slow speed To achieve large speed, - has to be small! For very large parallel speed V: S S S V t N r = + + = = 0 The total speed is governed by the raction o the strongly sequential part o the algorithm that cannot be parallelized.

42 Speedup Executing a job using p processors in parallel we can achieve a speedup. Deine t p := wall clock time to execute the job on p parallel processors Speedup: S p := t / t p is the ratio o execution time with versus p processors In the ideal case it would hold t = p t p. Eiciency using p processors: E p = S p / p. 0<= E p <= E p : very good parallelizable, because then S p p or t p t p. Problem scales. E 42 p 0: bad, because E p = S p /p = t / (p t p ) and t << p t p.

43 43 Using the same deinition o speed and raction as above: ideally parallel strongly sequential Ware s Law ) ( ) ( ) ( t p p t t p t t p + = + = p p p S E p p ) ( ) ( + = = p p p p t t S p p + = + = = ) ( ) ( : 0 Ware s Law p E p We always will have a small portion o our algorithm that is not parallelizable and thereore the eiciency will always be zero in the limit!

44 /(-): Reachable Speedup or large p. : S: 0/9 0/8 2 0 Speedup depending on p. Saturation /(-). 44

45 Gustason s Law Other model: We assume that the problem can be solved in unit o time on a parallel machine with p processors. Fraction is good parallelizable, - not Compared with this parallel implementation an uniprocessor would perorm (-) + p or the same job. Speedup: t + p S p = = = p + ( p)( ) t p Eiciency: E p = S p p = p + p 45

46 Example = 0.99, p = 00 or 000: Amdahl/Ware: S= p/(+(-)p)) E=/(+(-)p) p=00: S 00 = 00/.99 ~ 50, E 00 = 0.5, p=000: S 000 = 000/0.99 ~ 00, E 000 = 0., Gustason: S=-+p E=(-)/p+ p=00: S 00 = 99.0, E = p=000: S 000 = 990.0, E =

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using