B629 project - StreamIt MPI Backend. Nilesh Mahajan

Similar documents
High-Performance Scientific Computing

ECEN 651: Microprogrammed Control of Digital Systems Department of Electrical and Computer Engineering Texas A&M University

Model Order Reduction via Matlab Parallel Computing Toolbox. Istanbul Technical University

11 Parallel programming models

Solving the Inverse Toeplitz Eigenproblem Using ScaLAPACK and MPI *

Accelerating linear algebra computations with hybrid GPU-multicore systems.

An Integrative Model for Parallelism

Parallelization of the QC-lib Quantum Computer Simulator Library

Efficient algorithms for symmetric tensor contractions

Parallel Numerical Algorithms

Coordinate Update Algorithm Short Course The Package TMAC

Matrix Computations: Direct Methods II. May 5, 2014 Lecture 11

Lecture 23: Illusiveness of Parallel Performance. James C. Hoe Department of ECE Carnegie Mellon University

HYCOM and Navy ESPC Future High Performance Computing Needs. Alan J. Wallcraft. COAPS Short Seminar November 6, 2017

A Quantum Chemistry Domain-Specific Language for Heterogeneous Clusters

Che-Wei Chang Department of Computer Science and Information Engineering, Chang Gung University

Solving PDEs with CUDA Jonathan Cohen

Special Nodes for Interface

Page rank computation HPC course project a.y

QR Decomposition in a Multicore Environment

High-performance Technical Computing with Erlang

Parallel Numerics. Scope: Revise standard numerical methods considering parallel computations!

Administrivia. Course Objectives. Overview. Lecture Notes Week markem/cs333/ 2. Staff. 3. Prerequisites. 4. Grading. 1. Theory and application

Parallel Programming. Parallel algorithms Linear systems solvers

Compiling Techniques

Scikit-learn. scikit. Machine learning for the small and the many Gaël Varoquaux. machine learning in Python

4th year Project demo presentation

Antonio Falabella. 3 rd nternational Summer School on INtelligent Signal Processing for FrontIEr Research and Industry, September 2015, Hamburg

Branch Prediction based attacks using Hardware performance Counters IIT Kharagpur

SYMBOLIC AND NUMERICAL COMPUTING FOR CHEMICAL KINETIC REACTION SCHEMES

Computing least squares condition numbers on hybrid multicore/gpu systems

Cyclops Tensor Framework

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Parallel Polynomial Evaluation

High-performance processing and development with Madagascar. July 24, 2010 Madagascar development team

Digital Systems EEE4084F. [30 marks]

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

Tips Geared Towards R. Adam J. Suarez. Arpil 10, 2015

CS 542G: Conditioning, BLAS, LU Factorization

Parallel programming using MPI. Analysis and optimization. Bhupender Thakur, Jim Lupo, Le Yan, Alex Pacheco

TIME DEPENDENCE OF SHELL MODEL CALCULATIONS 1. INTRODUCTION

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Panorama des modèles et outils de programmation parallèle

2. Accelerated Computations

Performance Metrics for Computer Systems. CASS 2018 Lavanya Ramapantulu

Improving Memory Hierarchy Performance Through Combined Loop. Interchange and Multi-Level Fusion

On my honor, as an Aggie, I have neither given nor received unauthorized aid on this academic work

Introduction to MPI. School of Computational Science, 21 October 2005

Symmetric Pivoting in ScaLAPACK Craig Lucas University of Manchester Cray User Group 8 May 2006, Lugano

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

Parallel Scientific Computing

Lightweight Superscalar Task Execution in Distributed Memory

Overview: Parallelisation via Pipelining

MONTE CARLO METHODS IN SEQUENTIAL AND PARALLEL COMPUTING OF 2D AND 3D ISING MODEL

Clojure Concurrency Constructs, Part Two. CSCI 5828: Foundations of Software Engineering Lecture 13 10/07/2014

Heterogeneous programming for hybrid CPU-GPU systems: Lessons learned from computational chemistry

Weather Research and Forecasting (WRF) Performance Benchmark and Profiling. July 2012

Solution of Linear Systems

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

An Automotive Case Study ERTSS 2016

A Computation- and Communication-Optimal Parallel Direct 3-body Algorithm

WRF performance tuning for the Intel Woodcrest Processor

High Performance Computing

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Performance Analysis of Lattice QCD Application with APGAS Programming Model

Parallelization of Multilevel Preconditioners Constructed from Inverse-Based ILUs on Shared-Memory Multiprocessors

An Experimental Evaluation of Passage-Based Process Discovery

Modeling and Tuning Parallel Performance in Dense Linear Algebra

Block AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark

Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism

ECEN 449: Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University

GPU acceleration of Newton s method for large systems of polynomial equations in double double and quad double arithmetic

Antti-Pekka Hynninen, 5/10/2017, GTC2017, San Jose CA

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Communication-avoiding parallel and sequential QR factorizations

CS246 Final Exam, Winter 2011

Comp 11 Lectures. Mike Shah. July 26, Tufts University. Mike Shah (Tufts University) Comp 11 Lectures July 26, / 40

GRAPHITE Two Years After First Lessons Learned From Real-World Polyhedral Compilation

Static-scheduling and hybrid-programming in SuperLU DIST on multicore cluster systems

Automatic Loop Interchange

CS429: Computer Organization and Architecture

Lecture 24 - MPI ECE 459: Programming for Performance

MAGMA MIC 1.0: Linear Algebra Library for Intel Xeon Phi Coprocessors

Communication-avoiding parallel and sequential QR factorizations

Using R for Iterative and Incremental Processing

Overview: Synchronous Computations

OPENATOM for GW calculations

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Distributed Memory Parallelization in NGSolve

Performance Evaluation of the Matlab PCT for Parallel Implementations of Nonnegative Tensor Factorization

Getting Started with Communications Engineering

FPGA Implementation of a Predictive Controller

Compiler Optimisation

The Design, Implementation, and Evaluation of a Symmetric Banded Linear Solver for Distributed-Memory Parallel Computers

Binding Performance and Power of Dense Linear Algebra Operations

Porting a Sphere Optimization Program from lapack to scalapack

Summary of Hyperion Research's First QC Expert Panel Survey Questions/Answers. Bob Sorensen, Earl Joseph, Steve Conway, and Alex Norton

Multicore Semantics and Programming

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

Transcription:

B629 project - StreamIt MPI Backend Nilesh Mahajan March 26, 2013

Abstract StreamIt is a language based on the dataflow model of computation. StreamIt consists of computation units called filters connected with pipelines, split-joins, feedback loops. Thus, the program is a graph with filters as vertices and the connections as edges. The compiler then estimates the workload statically and computes a partitioning of the graph and assigns work to threads. StreamIt also has a cluster backend that generates threads which communicate through sockets. MPI is a popular library used in the HPC community to write distributed programs written in the Single Program Multiple Data (SPMD) style of programming. MPI community over the years has done a lot of work to identify, optimize communication patterns in various HPC applications. The idea of this project was to see if there are any opportunities to identify collective communication patterns in StreamIt programs and generate efficient MPI code accordingly.

1 Introduction StreamIt[4] is a language developed at MIT to simplify writing streaming applications in domains such as signal-processing. It is based on dataflow model of computation where the program consists of basic computation units called filters. The language the filters are written in looks like Java. Filters are connected with structures/channels like pipelines, split-joins, feedback loops to form a graph. The filters can have state although it is discouraged. The channels have known data rates which specify the number of elements moving through them. The resulting graph is a structured graph which is easier for compiler to optimize still general enough so that the programmer can write useful programs[5]. This kind of dataflow model is generally referred to as synchronous dataflow. StreamIt adds features like dynamic data rates, sliding window operations, teleport messaging. Mainly three kinds of parallelism are expolited: task (fork-join parallelism), data (DOALL loops), pipeline (like ILP in hardware)[1]. The compiler does a static estimation of work and applies various transformations like fusion, fission of filters, cache-aware optimizations to the stream graph. The StreamIt program is translated to Java code which can be run with other Java library or C code. There is also a cluster backend that generates code that communicates via sockets. 2 MPI Model MPI[2] is a library used in the HPC community to write distributes program in the SPMD model. There are n processes each with a unique id called rank ranging from 0 to n. Each process operates in a separate address space and communicates data through primitives provided by MPI. There are primitives for pointto-point communication (MPI Send, MPI Recv), collectives (MPI Broadcast, MPI Reduce etc.), one-sided communication. The MPI community has been actively working on identifying and optimizing collective patterns common across various HPC domains[3]. Also vendors provide their own implementations of collectives taking advantage of specialized hardware. For example, implementing a broadcast as a for loop with point to point send, receives is almost always slower than MPI Broadcast even on shared memory. So the idea of this project was to see if we can identify collective patterns in stream programs and generate MPI code instead of the socket based cluster backend. 3 StreamIt Install In this section, I discuss the difficulties faced installing StreamIt on my laptop. My system is Ubuntu 12.10, dual-core Intel x86-64. I downloaded source from the git repo and tried to install it according to the installation instructions in the source code. One of the problems faced was that you need to add the following paths to CLASSP AT H: 1

1. STREAMIT HOME/src 2. STREAMIT HOME/3rdparty 3. STREAMIT HOME/3rdparty/cplex/cplex.jar 4. STREAMIT HOME/3rdparty/jgraph/jgraph.jar 5. STREAMIT HOME/3rdparty/jcc/jcc.jar 6. STREAMIT HOME/3rdparty/JFlex/jflex.jar Then I needed to add the following line p. p r i n t ( #i n c l u d e <l i m i t s. h>\n ) ; at line 44 in src/at/dms/kjc/cluster/generatem asterdotcpp.java. I think with the newer gcc ( 4.4) the generated code needs limits.h to compile successfully. After this I was able to successfully compile the StreamIt programs. 4 Benchmarks We evaluate the performance of two typical linear algebra kernels written in StreamIt versus C MPI, namely Matrix Multiplication, Cholesky Factorization. First we compare the programs written in C with MPI versus the corresponding StreamIt versions because they look significantly different. 4.1 Matrix Multiply The program starts its execution in the MatrixMult method which is the name of the file. This is similar to Java where the name of the public class is equal to the name of the file. There are three filters in the pipeline namely FloatSource, MatrixMultiply, FloatPrinter. FloatSource is a stateful filter that produces numbers from 1 to 10. The size of the matrices is 10x10. Then there are filters to transpose the second matrix and multiplty, accumulate the result in parallel. One can compile the code with the strc command. Figure 1 denotes the graph generated from the program. I was not sure how to pass the size argument as a command line option to the program and could not find anything in the source or documentation. So I generated various versions of the program with different matrix sizes. The compiler does a lot of work in generating the optimal version so it takes lot of time to compile programs. It is especially interesting that the compiler takes more time to compile programs with higher matrix sizes and even fails for large matrix sizes. Here are the times the compiler took to 2

compile programs with different sizes: Matrix Size real time user time sys time 10 0m3.628s 0m4.116s 0m0.228s 100 0m5.445s 0m7.812s 0m0.292s 200 0m13.419s 0m12.145s 0m0.652s 300 0m39.196s 0m25.250s 0m1.272s Figure 1: StreamIt Matrix Multiply Graph The compilation failed for sizes above 500 elements. For example, for matrix size 1000, compilation failed with the following error after taking 5m14.521s of real time: combined_threads.o: In function init_floatsource 32003_81 0() : We can pass the number of iterations to the executable with i command line option. Following is the StreamIt code: void >void p i p e l i n e MatrixMult { add FloatSource ( 1 0 ) ; add MatrixMultiply (10, 10, 10, 1 0 ) ; add F l o a t P r i n t e r ( ) ; float >f loat p i p e l i n e MatrixMultiply ( int x0, int y0, int x1, int y1 ) { // rearrange and d u p l i c a t e the matrices as necessary : add RearrangeDuplicateBoth ( x0, y0, x1, y1 ) ; add MultiplyAccumulateParallel ( x0, x0 ) ; 3

float >f loat s p l i t j o i n RearrangeDuplicateBoth ( int x0, int y0, int x1, int y1 ) { s p l i t roundrobin ( x0 y0, x1 y1 ) ; // the f i r s t matrix j u s t needs to g e t d u p l i c a t e d add DuplicateRows ( x1, x0 ) ; // the second matrix needs to be transposed f i r s t // and then d u p l i c a t e d : add RearrangeDuplicate ( x0, y0, x1, y1 ) ; j o i n roundrobin ; float >f loat p i p e l i n e RearrangeDuplicate ( int x0, int y0, int x1, int y1 ) { add Transpose ( x1, y1 ) ; add DuplicateRows ( y0, x1 y1 ) ; float >f loat s p l i t j o i n Transpose ( int x, int y ) { s p l i t roundrobin ; for ( int i = 0 ; i < x ; i++) add I d e n t i t y <float >(); j o i n roundrobin ( y ) ; float >f loat s p l i t j o i n MultiplyAccumulateParallel ( int x, int n ) { s p l i t roundrobin ( x 2 ) ; for ( int i = 0 ; i < n ; i ++) add MultiplyAccumulate ( x ) ; j o i n roundrobin ( 1 ) ; float >f loat f i l t e r MultiplyAccumulate ( int rowlength ) { work pop rowlength 2 push 1 { f loat r e s u l t = 0 ; for ( int x = 0 ; x < rowlength ; x++) { r e s u l t += ( peek ( 0 ) peek ( 1 ) ) ; pop ( ) ; pop ( ) ; push ( r e s u l t ) ; float >f loat p i p e l i n e DuplicateRows ( int x, int y ) { add DuplicateRowsInternal ( x, y ) ; 4

float >f loat s p l i t j o i n DuplicateRowsInternal ( int times, int length ) { s p l i t d u p l i c a t e ; for ( int i = 0 ; i < times ; i++) add I d e n t i t y <float >(); j o i n roundrobin ( l ength ) ; void >f loat s t a t e f u l f i l t e r FloatSource ( float maxnum) { f loat num ; i n i t { num = 0 ; work push 1 { push (num ) ; num++; i f (num == maxnum) num = 0 ; float >void f i l t e r F l o a t P r i n t e r { work pop 1 { p r i n t l n ( pop ( ) ) ; float >void f i l t e r Sink { int x ; i n i t { x = 0 ; work pop 1 { pop ( ) ; x++; i f ( x == 100) { p r i n t l n ( done.. ) ; x = 0 ; This is how the C sequential version of matrix multiply looks like: / i n i t i a l i z e A / for ( j =0; j < rows1 ; j++) { for ( i =0; i < columns1 ; i++) { A[ i ] [ j ] = f l o a t S o u r c e ( ) ; 5

/ i n i t i a l i z e B / for ( j =0; j < rows2 ; j++) { for ( i =0; i < columns2 ; i++) { B[ i ] [ j ] = f l o a t S o u r c e ( ) ; for ( j = 0 ; j < rows1 ; j++) { for ( i = 0 ; i < columns2 ; i++) { int k ; f loat val = 0 ; / c a l c u l a t e C { i, j / for ( k = 0 ; k < columns1 ; k++) { val += A[ k ] [ j ] B[ i ] [ k ] ; C[ i ] [ j ] = val ; The initialization loops populate the A and B matrices through values generated from a method called floatsource which is equivalent to the StreamIt filter. I was unable to run the generated cluster code on an actual cluster like odin because there were some libraries missing on odin. Another problem is I am unable to compile StreamIt code with larger matrix sizes hence the MPI comparison might not be a fair comparison because MPI has advantages when it comes to larger array sizes. Still I compared the sequential, StreamIt and MPI codes with the results shown in Figure 2. The x-axis is array sizes, the y-axis is time in nanoseconds. As can be seen in the figure, the MPI version performs worse than both the StreamIt(strc) and sequential versions. This can be expected since MPI has copying overheads which will dominate computation times. So we need higher array sizes for a fair comparison. We can also see that the sequential version is better than the StreamIt version. My guess is again the StreamIt code suffers from communication overheads. 4.2 Cholesky Factorization The second kernel we look at is Cholesky Factorization. Again it took a lot of time compiling the StreamIt source and it took longer to compile programs with higher array sizes. Following are the times: Matrix Size real time user time sys time 100 1m53.304s 1m45.375s 0m1.792s 150 3m26.589s 3m11.444s 0m2.184s 200 5m0.306s 3m57.171s 0m3.284s 6

Figure 2: Matrix Multiply Comparison For size 300, the compilation took around 45 minutes and terminated with a StackOverflow exception. Figure 3. The x-axis is array sizes, the y-axis is time in nanoseconds. I tried the MPI versions with different number of processors (4, 9, 16). The MPI always performs worse while the sequential performs better. My guess is again this is due to communication overheads dominating actual computation. Here is the C sequential kernel of Cholesky Factorization. // c h o l e s k y f a c t o r i z a t i o n for ( k=0; k < N; k++) { A ( k, k ) = s q r t ( A ( k, k ) ) ; for ( i=k+1; i < N; i++) A ( i, k ) /= A ( k, k ) ; for ( i=k+1; i < N; i++) 7

Figure 3: Cholesky Factorization Comparison for ( j=k+1; j <= i ; j++) A ( i, j ) = A ( i, k ) A ( j, k ) ; Here is the StreamIt version of Cholesky Factorization. void >void p i p e l i n e chol ( ) { i n t N=100; add source (N) ; add r c h o l (N) ; // t h i s w i l l g e n erate the o r i g i n a l matrix, f o r t e s t i n g porpuses only add recon (N) ; add s i n k (N) ; 8

/ performs the c h o l e s ky decomposition on a an N by N matrix where input elements are arranged column by column with no redundancy, such that t h e r e are only N(N+1)/2 elements This i s a r e c u r s i v e implementaion / f l o a t >f l o a t p i p e l i n e r c h o l ( i n t N) { // t h i s f i l t e r does the d i v i s i o n s // corresponding to the f i r s t column add d i v i s e s (N) ; // t h i s f i l t e r performes the updates // corresponding to the f i r s t column add updates (N) ; // t h i s s p l i t j o i n takes apart // the f i r s t column and performes the chol (N 1) // on the r e s u l t i n g matrix i t i s // where the a c t u a l r e c u r s i o n occurs. i f (N>1) add break1 (N) ; f l o a t >f l o a t s p l i t j o i n break1 ( i n t N) { s p l i t roundrobin (N, (N (N 1))/2); add I d e n t i t y <f l o a t >(); add r c h o l (N 1); j o i n roundrobin (N, (N (N 1))/2); // performs the f i r s t column d i v i s i o n s f l o a t >f l o a t f i l t e r d i v i s e s ( i n t N) { i n i t { work push (N (N+1))/2 pop (N (N+1))/2 { f l o a t temp1 ; temp1=pop ( ) ; temp1= s q r t ( temp1 ) ; push ( temp1 ) ; f o r ( i n t i =1; i<n; i++) push ( pop ( ) / temp1 ) ; f o r ( i n t i =0; i < (N (N 1))/2; i++) 9

push ( pop ( ) ) ; // updates the r e s t o f the s t r u c t u r e f l o a t >f l o a t f i l t e r updates ( i n t N) { i n i t { work pop (N (N+1))/2 push (N (N+1))/2 { f l o a t [N] temp ; f o r ( i n t i =0; i<n; i ++){ temp [ i ]=pop ( ) ; push ( temp [ i ] ) ; f o r ( i n t i =1; i <N; i++) f o r ( i n t j=i ; j<n; j++) push ( pop() temp [ i ] temp [ j ] ) ; // t h i s i s a source f o r g e n e r a t i n g // a p o s i t i v e d i f i n t e matrix void >f l o a t f i l t e r s ource ( i n t N) { i n i t { work pop 0 push (N (N+1))/2 { f o r ( i n t i =0; i < N; i ++){ push ( ( i +1) ( i +1) 100); f o r ( i n t j=i +1; j<n; j++) push ( i j ) ; // p r i n t s the r e s u l t s : f l o a t >f l o a t f i l t e r recon ( i n t N) { i n i t { work pop (N (N+1))/2 push (N (N+1))/2 { f l o a t [N ] [ N] L ; f l o a t sum=0; 10

f o r ( i n t i =0; i<n; i++) f o r ( i n t j =0; j<n; j++) L [ i ] [ j ]=0; f o r ( i n t i =0; i<n; i++) f o r ( i n t j=i ; j<n; j++) L [ j ] [ i ]=pop ( ) ; f o r ( i n t i =0; i<n; i++) f o r ( i n t j=i ; j<n; j ++){ sum=0; f o r ( i n t k=0; k < N; k++) sum+=l [ i ] [ k ] L [ j ] [ k ] ; push (sum ) ; f l o a t >void f i l t e r s i n k ( i n t N) { i n i t { work pop (N (N+1))/2 push 0 { f o r ( i n t i =0; i<n; i++) f o r ( i n t j=i ; j<n; j++) { // p r i n t l n ( c o l : ) ; // p r i n t l n ( i ) ; // p r i n t l n ( row : ) ; // p r i n t l n ( j ) ; // p r i n t l n ( = ) ; p r i n t l n ( pop ( ) ) ; 5 Cluster Backend The clustern option selects a backend that compiles to N parallel threads that communicate using sockets. When targeting a cluster of workstations, the sockets communicate over the network using TCP/IP. When targeting a multicore architecture, the sockets provide an interface to shared memory. A hybrid setup is also possible, in which there are multiple machines and multiple threads per machine; some threads communicate via memory, while others communicate over the network. In order to compile for a cluster of workstations, one should create a list of available machines and store it in the following location: 11

STREAMIT HOME/cluster-machines.txt. This file should contain one machine name (or IP address) per line. When the compiler generates N threads, it will assign one thread per machine (for the first N machines in the file). If there are fewer than N machines available, it will distribute the threads across the machines. The compiler generates cpp files named threadsn.cpp where N ranges from 0 to a number the compiler determines based on the optimizations. For example for the matrix multiply program with matrix size 10, 13 cpp files were generated while for cholesky with the same matrix size, 500 cpp files were generated contributing to the enormous compile time. Finally, a combined t hreads.cpp connects everything together. The cluster code is generated by compiler files in the directory STREAMIT HOME/src/at/dms/kjc/cluster. The file GenerateM asterdotcpp.java generates the combined t hreads.cpp file while the socket code is generated with ClusterCodeGenerator.java, F usioncode.java and F latirt ocluster.java files. To generate MPI code, we will need to modify these files. 6 Conclusion Although no definitive conclusion can be drawn from this study as for the usability of writing an MPI backend for StreamIt, I got to explore and understand the StreamIt model of computation better. Given more time and a better study with different benchmarks, we should be able to determine whether it is beneficial to generate MPI code instead of the socket code generated right now. It looks likely that the MPI backend should be beneficial by just looking at the cpp code generated for Cholesky, Matrix Multiply. It will take another term project to identify the communication pattern statically from the StreamIt source and modify parts of the compiler to generate MPI code. 12

Bibliography [1] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems, ASPLOS XII, pages 151 162, New York, NY, USA, 2006. ACM. [2] Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra. MPI-The Complete Reference, Volume 1: The MPI Core. MIT Press, Cambridge, MA, USA, 2nd. (revised) edition, 1998. [3] Rajeev Thakur. Improving the performance of collective operations in mpich. In Recent Advances in Parallel Virtual Machine and Message Passing Interface. Number 2840 in LNCS, Springer Verlag (2003) 257267 10th European PVM/MPI Users Group Meeting, pages 257 267. Springer Verlag, 2003. [4] William Thies. Language and compiler support for stream programs. PhD thesis, Cambridge, MA, USA, 2009. AAI0821753. [5] William Thies and Saman Amarasinghe. An empirical characterization of stream programs and its implications for language and compiler design. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT 10, pages 365 376, New York, NY, USA, 2010. ACM. 13