DMP Deterministic Shared Memory Multiprocessing 1 Presenter: Wu, Weiyi Yale University
Outline What is determinism? How to make execution deterministic? What s the overhead of determinism? 2
What Is Determinism? The SAME input, the SAME output The same execution sequence (with Linearizability) (strong) The same resource-accessing ordering (weak) 3
(Weak) Determinism ld X ld X ld X ld Y ld Y st S st S ld Y Time Flow st S st T st X st T st T Data Dependency st X st X st Y st Y st Y Concurrent Program Execution 1 Execution 2 4
Non-Determinism ld X ld X ld X ld Y ld Y st S st S ld Y Time Flow st S st T st T st T st X Communications st X st X st Y st Y st Y Concurrent Program Execution 1 Execution 2 5
Non-determinism in Non-Determinism 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1M barnes 2M 3M 4M 5M Non-Determinism 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Practice 1M ocean-contig 2M 3M 4M 5M Time (Insns.) Time (Insns.) Figure 3. Amount of nondeterminism over the execution of barnes and ocean-contig. The x axis is the position in the execution where each sample of 100,000 instructions was taken. The y axis is ND (Eq. 1) computed for each sample in the execution. 6
Is Record n Replay Deterministic? Yes! When the log is replayed deterministically, the replayed execution is definitely deterministic No... It cannot deploy to any machines and still run deterministically for any input 7
How to Execute Deterministically? Eliminating all instruction communications Deterministically arrange all communications Read-after-Write Write-after-Write, Write-after-Read 8
DMP-Serial No concurrency! Make it single-threaded! Programs are divided into quanta All processors act as one Only execute when hold token Deterministically pass token when quanta end 9
Example of DMP-Serial ld X ld X ld Y ld X ld Y st S Time Flow st S st S ld Y st T st X st T st T Token Pass st Y st X st Y st X st Y Quantum Concurrent Program Small Quanta Bigger Quanta 10
Recovering Parallelism DMP-Serial is way too strong Weak determinism is also acceptable Instructions without communication can execute concurrently Deterministically schedule communications 11
DMP-ShTab Break quantum into parallel prefix and serial suffix (dynamically), do prefix concurrently Deterministically control shared memory access Only read from Shared memory position Only write to Private memory position 12
Example of DMP-ShTab ld X ld Y ld X ld X st S st S ld Y st S ld Y Time Flow st T st T st T st X st X st Y Token Pass st Y st X st Y Quantum Concurrent Program DMP-Serial DMP-ShTab 13
Similarity to MESI coherence protocol Probe Write Hit Reset Invalid Read Miss, Exclusive Exclusive Read Hit Read Miss, Shared Probe Write Hit Probe Read Hit Probe Write Hit Write Hit Read Hit Probe Read Hit Shared Probe Read Hit Write Hit Write Miss Modified Read Hit Write Hit 14
Similarity to MESI coherence protocol Probe Write Hit Reset Invalid Read Miss, Exclusive Exclusive Read Hit Read Miss, Shared Probe Write Hit Probe Read Hit Probe Write Hit Write Hit Read Hit Probe Read Hit Shared Probe Read Hit Write Hit Write Miss Modified Read Hit Write Hit Shared Private 14
DMP-TM Atomicity, isolation and deterministic total order of quanta will guarantee determinism Make quantum transaction to explicitly execute concurrently Deterministically commit Abort and re-execute latter quantum when conflicting 15
DMP-TMFwd Read-after-Write will be no conflict if propagating modified values in time Forward writes across transactions Need to abort infected transactions when aborting 16
Example of DMP-TM* N memory operation deterministic token passing deterministic order atomic quantum uncommitted value flow P0 P1 P0 P1 1 ld B st C (squash) st C 1 ld B 2 3 ld T st C 2 3 ld B ld B 4 (a) Pure TM ld B 4 (b) TM w/ uncommitted value flow Figure 7. Recovering parallelism by executing quanta as memory transactions (a). Avoiding unnecessary squashes with un-committed data forwarding (b). 17
How to build quanta? Split code evenly - QB-Count The less waiting time, the better No spinning on lock - QB-SyncFollow (token is enough for synchronization) End quantum when finishing working on shared data others wait for - QB-Sharing 18
Example of QB-SyncFollowing forward immediately if a thread starts spinning on a lock. memory operation deterministic token passing N deterministic order atomic quantum P0 P1 P0 P1 lock L lock L 1 unlock L lock L... spin... (wasted work) 2 1 unlock L lock L (grab) 2 3 grab lock L 3 4 (a) Regular quantum break 4 (b) Better quantum break Figure 8. Example of a situation when better quantum breaking policies leads to better performance. 19
Example of QB-Sharing P0 P1 P0 P1 P0 P1 (a) Parallel execution (b) Deterministic serialized execution at a fine-grain (c) Deterministic serialized execution at a coarse-grain Figure 4. Deterministic serialization of memory operations. Dots are memory operations and dashed arrows are happens-before synchronization. 20
Overhead & Scalability 3.0 10.7 6.2 3.7 3.4 10.2 6.1 3.6 Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial 3.8 3.1 10.3 6.8 3.6 11.0 7.0 3.8 8.4 5.2 3.1 5.6 3.2 6.4 4.3 12.1 7.0 3.8 3.2 4.4 3.3 11.8 6.9 3.8 6.7 4.6 runtime normalized to non-deterministic parallel execution 2.5 2.0 1.5 1.0 4 8 barnes cholesky 4 8 16 4 8 fft(p) 16 4 fmm 8 4 lu-nc 8 ocean-c(p) 4 8 16 4 8 16 radix(p) 4 8 vlrend 16 water-sp 4 8 16 4 8 16 SPLASH 4 8 16 blacksch 4 8 16 bodytr 4 8 fluid 16 4 8 16 strmcl 4 8 16 PARSEC gmean gmean Figure 9. Runtime overheads with 4, 8 and 16 threads. (P) indicates page-level conflict detection; line-level otherwise. up over word-granularity 500% 450% 400% 350% 300% 250% 200% 150% 100% Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab 21 eedup over QB-Count 180% 160% 140% 120% 100% 80% 60% 40% Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial
Quanta Sensitivity Figure 9. Runtime overheads with 4, 8 and 16 threads. (P) indicates page-level conflict detection; line-level otherwise. however, Hw-DMP-Serial is less affected by quantum size. % speedup over word-granularity 500% 450% 400% 350% 300% 250% 200% 150% 100% 50% 0% -50% -100% % speedup over 1,000-insn quanta 40% 20% 0% -20% -40% -60% -80% -100% fft 2 X C lu-nc lu-nc ocean-c 2 X C vlrend radix SPLASH gmean vlrend 2 X C SPLASH gmean Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-TMFwd Hw-DMP-Serial Hw-DMP-TM Hw-DMP-ShTab 2 X C strmcl 2 X C PARSEC gmean Figure 10. Performance of 2,000 (2), 10,000 (X) and Figure100,000 11. Performance (C) instruction of page-granularity quanta, relative to conflict 1,000 detectiontion relative quanta. to instruc- line-granularity. bodytr fluid PARSEC gmean strmcl % speedup over QB-Count 180% 160% 140% 120% 100% 80% 60% 40% 20% 0% -20% -40% erate progress along the application s critical path. With 10,000-instruction quanta (Figure 13), the effects of the different quantum builders are more pronounced. In general, the quantum builders that take program synchronization into account (QB-SyncFollow and QB-SyncSharing) outperform those that do not. Hw-DMP-Serial and Hw-DMP-ShTab perform better with smarter quantum building, while the TMs sf ss barnes s sf ss radix(p) s sf ss SPLASH gmean Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial s sf ss bodytr s sf ss strmcl s sf ss PARSEC gmean Figure 12. Performance of QB-Sharing (s), QB- SyncFollow (sf) and QB-SyncSharing (ss) quantum builders, relative to QB-Count, with 1,000-insn quanta. eedup over QB-Count 22 180% 160% 140% 120% 100% 80% 60% 40% Hw-DMP-TMFwd Hw-DMP-TM Hw-DMP-ShTab Hw-DMP-Serial
Software-DMP 8.5 8.0 7.5 7.0 2 threads 4 threads 8 threads Sw-DMP-ShTab runtime normalized to nondeterministic parallel execution 6.5 6.0 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 barnes fft fmm lu-c radix water-sp SPLASH gmean Figure 14. Runtime of Sw-DMP-ShTab relative to nondeterministic execution. y ability of Swity and energy trade-offs; (2) support for debugging; (3) interaction with operating system and I/O nondeterminism; 23 and
Conclusion (First?) implementation of DMP Demonstration of low-overhead determinism for concurrent programs Not pervasive No deterministic interrupts Can not save nondeterministic hardwares
Thanks~ 25