Mapping to Parallel Hardware

Size: px

Start display at page:

Download "Mapping to Parallel Hardware"

Baldric Malone
5 years ago
Views:

1 - Mapping to Parallel Hardware Bruno Bodin University of Edinburgh, United Kingdom 1 / 27

2 Context - Hardware Samsung Exynos ARM big-little (Heterogenenous), 4 ARM A15, 4 ARM A7, 2-16 cores GPU, 5 Watt. Context Liveness Throughput Memory Mapping 2 / 27

Context - Hardware Kalray MPPA256 256 VLIW cores, 16 clusters of 16

3 Context - Hardware Kalray MPPA VLIW cores, 16 clusters of 16 cores, Torus NoC, 10 Watts. Context Liveness Throughput Memory Mapping 3 / 27

4 Context - Problematic Mapping problem How to select which core for which task? Very Hard problem! Usually (automatically) solved for streaming apps. Ex. StreamIt, SigmaC Context Liveness Throughput Memory Mapping 4 / 27

5 Context - Software The SigmaC dataflow language [Goubier et al., 2011] agent Reader (N) { interface { spec {/* N outputs */}} void start () { /* f => out */ } } agent Writer (N) { interface { spec {/* N inputs */}} void start () { /* in => f */ } } agent Worker { interface { spec {in [1]; out [1]}} void start (in, out )) { /* in => out */ } } subgraph root { map { agent ina = new Reader(N+1); agent outa = new Writer(N+1); agent w[n+1] = new Worker(); for(int i=0;i<=n;i++){ connect(ina.out[i],w[i].in); connect(w[i].out,outa.in[i]); } } } What is a dataflow model! Context Liveness Throughput Memory Mapping 5 / 27

6 Context - Dataflow models Kahn Process Networks [Kahn, 1974] t t A set of processes (T ) communicating through channels (A) Channels are unbound FIFO buffers Reading is a blocking operation A deterministic model Undecidable analysis problems Kahn Networks don t allow buffer sizing [Buck et Lee, 1993]. Context Liveness Throughput Memory Mapping 6 / 27

7 Context - Dataflow models Synchronous Dataflow [Lee et Messerschmitt, 1987] I a a O a t t M0(a) I a is the number of tokens produced, O a is the number of tokens consumed, The initial quantity of tokens is M 0 (a). Synchronous because... N G t I a = N G t O a a A A static model It is possible to study its behavior at compile-time: this model is static. Several fundamental problems become decidable. Context Liveness Throughput Memory Mapping 7 / 27

8 Context - Dataflow models Cyclo-Static Dataflow [Bilsen et al., 1995] [2,3,1] [2,5] t t M0(a) Tasks are divided in ϕ(t) phases in a (k), the production rate of t k the k th phase of t out a (k ), the consumption rate of the k th phase of t. Context Liveness Throughput Memory Mapping 8 / 27

9 I Context - Solution The dataflow compilation Throughput evaluation Buffer sizing Mapping & Routing MPPA -256 ES HFCBGA 1156 V1 00 F1 A11 F12 XXXXXX.XXX-X TWN TF YY WW e1 C1 C2 C3 C4 Fission and fusion Buffer sizing partitioning Mapping and Routing H GF CE C ABD Context Liveness Throughput Memory Mapping 9 / 27

10 Liveness - Liveness and Consistency Liveness A dataflow is alive, if there exists a task execution sequence which can be repeated infinitely often. Consistency A dataflow is consistent if there exists a task execution sequence which can be repeated infinitely often with bounded memory constraint. Context Liveness Throughput Memory Mapping 10 / 27

11 Liveness - Liveness and Consistency - Example Example T i 2 2 T j S = [T i, T j, T k, T k, T i ] T k 1 3: Alive and consistent. Context Liveness Throughput Memory Mapping 11 / 27

12 Throughput - Problem definition As soon as possible scheduling [1,3] a CA A [2] C [6] [3,5] [1,1,4] a AB 0 a BC [6,2,1] B Task duration (WCET): d(a 1 ) = 3, d(a 2 ) = 1 d(b 1 ) = 2, d(b 2 ) = 1 d(b 3 ) = 2, d( ) = 1 No resource constraint A 1 A 1 A 1 A 1 A 1 A 2 B 1 B 2 B 3 B 1 B 2 A 2 B 3 B 1 B 2 B 3 A 2 B 1 B 2 B 3 B 1 A 2 B 2 B 3 B 1 Transient state Steady state Context Liveness Throughput Memory Mapping 12 / 27

13 Throughput - Problem definition Normalized period Functional frequency: Th S t n = lim n S t, n. When a CSDFG has bounded memories, a balance exists between tasks frequency. Ω S G = NG t Th S t t T A 1 A 1 A 1 A 1 A 1 A 2 B 1 B 2 B 3 B 1 B 2 A 2 B 3 B 1 B 2 B 3 A 2 B 1 B 2 B 3 B 1 A 2 B 2 B 3 B 1 Th A = 6/21 = 2/7, Ω S G = 21 Context Liveness Throughput Memory Mapping 13 / 27

14 Throughput - State of the art Existing methods Exact methods ([Ghamarian et al., 2006, Stuijk et al., 2008]): A 1 A 1 A 1 A 1 A 1 A 2 B 1 B 2 B 3 B 1 B 2 A 2 B 3 B 1 B 2 B 3 A 2 B 1 B 2 B 3 B 1 A 2 B 2 B 3 B 1 Execution pattern, Th A = 6/21 = 2/7 Optimal Exponential complexity Approximate methods ([Benabid et al., 2012, Bodin et al., 2013]): A 1 A 2 B 1 B 2 B 3 A 1 A 2 A 1 A 2 A 1 A 2 B 1 B 2 B 3 B 1 B 2 B 3 B 1 B 2 B 3 B 1 B 2 Period, Th A = 1/4 Polynomial Lower bound Context Liveness Throughput Memory Mapping 14 / 27

15 Throughput - Periodic scheduling of CSDFG Definition of precedence relations A [3,5] [1,1,4] a AB B A 1, 1 A 2, 1 A 1, 2 A 2, 2 B 1, 1 B 2, 1 B 3, 1 B 1, 2 B 2, 2 B 3, 2 B 1, 3 B 2, 3 B 3, 3 B 3, 4 Context Liveness Throughput Memory Mapping 15 / 27

16 Throughput - Periodic scheduling of CSDFG Definition of a valid schedule Definition (Valid schedule) If a precedence relation exists between two executions t k, n and t k, n, a valid schedule must check S t k, n S t k, n + d(t k ). Context Liveness Throughput Memory Mapping 16 / 27

17 Throughput - Periodic scheduling of CSDFG A precedence relation between two executions Lemma (existence condition) Let a buffer a = (t, t ). A precedence relation exists between t k, n and t k, n if and only if in a (k) > M 0 (a)+i a t k, n O a t k, n max{0, in a (k) out a (k )}. Between A 1, 1 and B 1, 2? A 1, 1 A 2, 1 A 1, 2 A 2, 2? B 1, 1 B 2, 1 B 3, 1 B 1, 2 B 2, 2 B 3, 2 B 1, 3 B 2, 3 B 3, 3 B 3, 4 Context Liveness Throughput Memory Mapping 17 / 27

18 Throughput - Periodic scheduling of CSDFG Precedence relation between two phases Lemma (Existence condition between two phases) A buffer a = (t, t ) induce a precedence relation between t k, n and t k, n if and only if Between A 1 and B 1? αa min (k, k ) α(n, n ) αa max (k, k ). A 1, 1 A 2, 1 A 1, 2 A 2, 2???????? B 1, 1 B 2, 1 B 3, 1 B 1, 2 B 2, 2 B 3, 2 B 1, 3 B 2, 3 B 3, 3 B 3, 4 Context Liveness Throughput Memory Mapping 18 / 27

19 Throughput - Periodic scheduling of CSDFG Periodic constraints We finally get the following constraint: Theorem (Periodic constraints) Precedence relations induced by a buffer a = (t, t ) are checked by a periodic schedule S if and only if S t k, 1 S t k, 1 d(t k ) + Ω S G αmax a (k, k ) i a (k, k ) {1,, ϕ(t)} {1,, ϕ(t )} with αa min (k, k ) αa max (k, k ). N G t Context Liveness Throughput Memory Mapping 19 / 27

20 Throughput - Periodic scheduling of CSDFG Computation of a periodic schedule These constraints can be used to compute a periodic schedule with maximal throughput: Minimize Ω S G with a = (t, t ) A, (k, k ) Y(a), S t k, 1 S t k, 1 d(t k ) + Ω S G αmax a (k, k ) i a Nt G t T, k {1,, ϕ(t)}, S t k, 1 R + Ω S G R+ {0} A linear program Polynomial (Maximum Cycle Ratio Problem) Maximal periodic throughput Upper bound of the maximal throughput Context Liveness Throughput Memory Mapping 20 / 27

21 Memory - Definition Buffer sizing Buffer sizing impact liveness But also performances [1,2] [2] B t 0.. t 3 tokens max: t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 1 t 1 t 1 t 1 t 1 t 1 4 tokens max: t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 2 t 1 t 1 t 1 t 1 t 1 t 1 t 1 t 1 t 1 Context Liveness Throughput Memory Mapping 21 / 27

22 Memory - State of the art Existing solutions Exact method [Stuijk et al., 2008] Exploring a (combinatorial) solution space Maximal throughput evaluation (ASAP) Too long for existing cases Approximative methods [Benazouz et Munier-Kordon, 2013] A maximal throughput is fixed Only periodic schedules are considered Phases execution pattern is fixed. shift 2 shift 3 shift 2 shift 3 Phases fixées t 1, 1 t 1, 2 t 1, 3 période d exécution t 2, 1 t 2, 2 t 2, 3 Context Liveness Throughput Memory Mapping 22 / 27

23 Memory - How to compute it? Buffer sizing with throughput constraint t [1,2] [1,2] B 0 4 Fb(B) [2] [2] t We adapt the linear formulation: Minimize a Fb(A) M 0(a) with a = (t, t ) A, (k, k ) Y(a), S t k, 1 S t k, 1 d(t k ) + Ω S G αmax a (k, k ) i a Nt G a = (t, t ) Fb(A), (k, k ) Y(a ), S t k, 1 S t k, 1 d(t k ) + Ω S G αmax a (k, k ) i a Nt G t T, k {1,, ϕ(t)}, S t k, 1 R + Context Liveness Throughput Memory Mapping 23 / 27

24 Mapping - definition The buffer sizes are known, we need now to select which processor for which task. This is hard, the problem is splitted into sub-problems. Context Liveness Throughput Memory Mapping 24 / 27

25 Mapping - Partitioning Partitioning T1 T1 T1 T2 T2 T2 Producing task groups which will be executed on the same core. Usually solved off-line using meta-heuristics (genetic, taboo, machine learning, etc.). Context Liveness Throughput Memory Mapping 25 / 27

26 Mapping - Mapping Mapping T1 T1 T2 T3 I/O I/O T2 T3 T4 T4 I/O T5 T5 I/O This is the actual association of task groups with computation units. This can be solved at compile-time, but also at run-time using task migration. Context Liveness Throughput Memory Mapping 26 / 27

27 Mapping - Routing Routing T1 T1 T2 T3 I/O I/O T2 T3 T4 T4 I/O T5 T5 I/O For some platforms this is required to determine the actual communication in the NoC. This may be optimally solved with a deterministic NoC. Context Liveness Throughput Memory Mapping 27 / 27

28 - Conclusion Mapping is a hard problem Depends on how parallelism and tasks represented Wide open research area Large research interest at ICSA Edinburgh 1 / 1

29 - References Abir Benabid, Claire Hanen, Olivier Marchetti, et Alix Munier-Kordon. Periodic Schedules for Bounded Timed Weighted Event Graphs. IEEE Transactions on Automatic Control, 57(5): , Mohamed Benazouz et Alix Munier-Kordon. Cyclo-static DataFlow phases scheduling optimization for buffer sizes minimization. In Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems - M-SCOPES 13, page 3, New York, New York, USA, ACM Press. Greet Bilsen, Marc Engels, Rudy Lauwereins, et J.A. Peperstraete. Cyclo-static data flow. IEEE Transactions on Signal Processing, pages , Bruno Bodin, Alix Munier Kordon, et Benoît Dupont de Dinechin. Periodic schedules for cyclo-static dataflow. In The 11th IEEE Symposium on Embedded Systems for Real-time Multimedia, Montreal, QC, Canada, October 3-4, 2013, pages , JT T Buck et Edward A. Lee. Scheduling dynamic dataflow graphs with bounded memory using the token flow model. icassp, (September): , Amir Hossein Ghamarian, Marc Geilen, Sander Stuijk, Twan Basten, Arno Moonen, Marco J.G. Bekooij, Bart D. Theelen, et MohammadReza Mousavi. Throughput Analysis of Synchronous Data Flow Graphs. In International Conference on Application of Concurrency to System Design (ACSD 06), pages 25 36, Thierry Goubier, Renaud Sirdey, Stéphane Louise, et Vincent David. ΣC: A programming model and language for embedded manycores. Algorithms and Architectures for..., 7016: , Gilles Kahn. The semantics of a simple language for parallel programming. Information processing, Edward A. Lee et David G. Messerschmitt. Synchronous dataflow. Proceedings of the IEEE, 75(9): , Sander Stuijk, Marc Geilen, et Twan Basten. Throughput-Buffering Trade-Off Exploration for Cyclo-Static and Synchronous Dataflow Graphs. IEEE Transactions on Computers, 57(10): , / 1

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Fundamental Algorithms for System Modeling, Analysis, and Optimization Edward A. Lee, Jaijeet Roychowdhury, Sanjit A. Seshia UC Berkeley EECS 144/244 Fall 2010 Copyright 2010, E. A. Lee, J. Roydhowdhury,