GPU Computing with Applications in Digital Logic

Size: px

Start display at page:

Download "GPU Computing with Applications in Digital Logic"

Isabel Patrick
6 years ago
Views:

3 Tampere International Center for Signal Processing. TICSP series # 62 J. Astola, M. Kameyama, M. Lukac & R. S. Stanković (eds.) GPU Computing with Applications in Digital Logic Tampere International Center for Signal Processing Tampere 2012

4 ISBN ISSN

5 This book is dedicated to all those that like fast computing and computer games, interest in which provided GPGPU.

7 Preface After the opening of the graphics processing unit (GPU) for general purpose computations, an entirely new computing model has emerged providing a temporary break in the endless race for even faster and more powerful computing methods and devices. Since it originated in hardware primarily intended to implement highly demanding computations in computer graphics essentially based on vector-matrix operations, GPUs are well suited for many tasks in signal processing, including the processing of binary and multiple-valued sequences as mathematical models of logic signals used in binary and multiple-valued switching theory and related applications. This monograph is devoted to the applications of GPU computing in these areas. The introductory chapter, Chapter 1, presents a short review of the foundations of the GPU architecture and related programming frameworks, hopefully providing a necessary background for the understanding of the presentations in forthcoming chapters. Chapter 2 is devoted to the problems of the implementation of fast (FFT-like) computing algorithms for spectral transforms used in switching theory and logic design. Chapter 3 explores the possibilities for the parallelization of computations required in solving the unate covering problem of Boolean functions. Chapters 4 and 5 are devoted to the application of GPUs in quantum circuit representation and synthesis. The GPU acceleration methods in matrix representations of quantum circuits for the efficient synthesis of these circuits are the main subject of the Chapter 4. Chapter 5 discusses the GPU acceleration of methods for the synthesis of ternary quantum circuits using Hasse diagrams and genetic algorithms. vii

8 viii Preface The concluding chapter, Chapter 6, is a review of miscellaneous applications of GPU computing in different areas of signal processing in a general sense. We thank all the authors for their contributions to this publication and we hope it will serve its purpose in providing further ways for various applications of GPU computing in digital logic and related areas. Tampere, Finland Sendai, Japan Sendai, Japan Niš, Serbia Jaakko Astola Michitaka Kameyama Martin Lukac Radomir S. Stanković August 2012

9 Contents 1 GPU Architecture and the Programming Environment... 1 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković 1.1 Introduction Computing Model Single Instruction Single Data Pipelining and the Multi-core CPU Computing Model Single Instruction Multiple Data Graphics Processing Unit Structure of the GPGPU System GPU Programming Frameworks Memory Models in CUDA and OpenCL The evolution of GPU Architecture References Computing Spectral Transforms Used in Digital Logic on the GPU.. 25 Dušan Gajić, Radomir S. Stanković 2.1 Introduction Related Work in Computing Spectral Transforms Spectral Transforms Fast Algorithms Cooley-Tukey FFT-like algorithms FFT-like algorithms with constant geometry Mapping Algorithms to the GPU Kernel design and optimization Cooley-Tukey algorithms for Kronecker transforms Algorithms with constant geometry for Kronecker transforms Algorithms for the Haar transform Experiments Experimental Platform Experimental Environment Referent Implementations ix

10 x Contents Experimental Results Computation of the Dyadic Correlation and Autocorrelation over thegpu References Sources and Obstacles for Parallelization - a Comprehensive Exploration of the Unate Covering Problem Using Both CPU and GPU Bernd Steinbach, Christian Posthoff 3.1 Introduction The Problem to Solve: Unate Covering Initial GPU Approach - Matrix Multiplication Utilizing the Boolean Algebra Basic Approach The Improved Basic Approach Parallelization in the Application Domain Parallelization Using MPI on Four or Six CPU-Cores Uniform Distribution Adaptive Distribution Intelligent Master Ordered Restricted Vector Evaluation Sequential CPU Realization Using One Single Core Parallel GPU Realization Using CUDA Conclusions References GPU Acceleration Methods of Representations for Quantum Circuits Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama 4.1 Introduction Quantum Computing and Quantum Circuits Representations of Quantum and Reversible Functions Quantum Circuits Linearly Independent Functions Relational Specification of Quantum Circuits QMDD GPU Acceleration GPU Micro-Parallelization Quantum Circuits Accelerated with GPU GPU Accelerated Parallel Algorithm for Computing LIs Experiments and Results Conclusion Synthesis of Reversible Cascades from Relational Specifications The Algorithm Search Heuristics...126

11 Contents xi Tree Minimization GPU Acceleration Experiments and Results Conclusion Evolutionary Synthesis of Quantum Circuits Genetic Algorithm Quantum Circuit Representation The Comparison of Quantum Circuits Evaluation of a Quantum Circuit The selection of Individuals Population evolution Error Calculation Experimentation Discussion Quantum Circuits with Structured and Unstructured Quantum Gates Width of the Quantum Circuit Length of the Quantum Circuit Evaluation Data Transfer to Memory and Result Transfer to Memory GA Limitations: Building Blocks and the Input Gate Set QMDD Limitations Conclusion Closing Remarks Examples of single qubit Quantum Gates Examples of multiple qubit Quantum Gates Pulse-level quantum gates References Synthesis of Ternary Quantum Circuits using Hasse Diagrams and Genetic Algorithms Maher Hawash, Marek Perkowski, Martin Lukac 5.1 Introduction Ternary Logic System Measurement of a Qubit Trits and Ternary States Reversible Operations Ternary Reversible Operators Synthesis by Example Ternary Logic Synthesis Algorithm Control Line Blocking Ternary Hasse Input Sequence Construction of an Input Sequence Hasse Precedence Quandary Selection Through a Genetic Algorithm

12 xii Contents Objective Function using Quantum Gate Count Genetic Algorithm Genotype and Valid Operators Mapping of the Algorithm to GPU Experimental Results GPU Acceleration Conclusion References An Overview of Miscellaneous Applications of GPU Computing Stanislav Stanković, Jaakko Astola 6.1 Introduction Medical Image Processing Audio Processing General Computer Science Problems Graph Theory Optimization and Machine Learning Dynamic Systems Astrophysics Statistical Modeling Computational Finance Engineering Simulations Computational Chemistry, Material Science, Nano-technology, Quantum Chemistry Computational Systems Biology Computational Neuro Biology Circuit Design and Testing Spectral Techniques References

13 List of Contributors The names of the contributors to this book are given in alphabetical order. Jaakko Astola Dept. of Signal Processing, Tampere University of Technology, Tampere, Finland, Dušan Gajić Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia, Maher Hawash Department of Electrical Engineering, Portland State University, Portland, OR, USA, Michitaka Kameyama Graduate School of Information Sciences, Tohoku University, Sendai, Japan, Pawel Kerntopf Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland, and Department of Theoretical Physics and Computer Science, University of Lodz, Lodz, Poland, Martin Lukac Graduate School of Information Sciences, Tohoku University, Sendai, Japan Marek Perkowski Department of Electrical Engineering, Portland State University, Portland, OR, USA Christian Posthoff The University of The West Indies, St. Augustine Campus, Trinidad & Tobago, xiii

14 xiv List of Contributors Stanislav Stanković Rovio Entertainment Ltd., Tampere, Finland, Radomir S. Stanković Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia, Bernd Steinbach TU Bergakademie Freiberg, Fakultät für Mathematik und Informatik, Institut für Informatik, Freiberg, Germany,

15 Chapter 1 GPU Architecture and the Programming Environment Stanislav Stanković, Dušan Gajić, Radomir S. Stanković Abstract Initially, computers were invented as devices to speed-up computations and facilitate the performance of repetitive mathematical operations. Their wider application in different areas upgraded this basic role and converted computers from calculating machines into devices for processing large amounts of data, with processing understood in a very general sense. The wonderfully large and ever increasing variety of applications based on various forms of data and information processing imposes various challenges, and computer hardware is supposed to provide the appropriate answers. This is the motivation for the surprisingly fast evolution of computing power in the relatively short history of contemporary computers. It seems that presently the opening of graphics processing units (GPU) for general purpose computations (GPGPU) is an answer that can meet the high demands for large computing power in certain applications. Their efficient use can provide a short break in the inevitable race for entirely new computing technologies that are necessary and offer some extended time for the underlying research work. This chapter discusses first the rationales that lead to the development of computing resulting in the appearance of the GPGPU and the GPU as the underlying technological platform. Then, we present the foundations of GPU architecture and the GPU programming frameworks up to the extent that is necessary for understanding the applications of GPUs discussed in this book. 1 Stanislav Stanković Rovio Entertainment Ltd., Tampere, Finland, stanislav.stankovic@gmail.com Dušan Gajić Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia, dule.gajic@gmail.com Radomir S. Stanković Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia, Radomir.Stankovic@gmail.com 1 The work of Stanislav Stanković and Radomir S. Stanković was supported by the Academy of Finland, Finnish Center of Excellence Programme, Grant No

16 2 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković

17 1 GPU Architecture and the Programming Environment Introduction Before going into any further discussions, it is necessary to first determine the terminology. Nowadays the popular abbreviation in computing, GPU, stands for the Graphics Processing Unit, a device intended for Computer Generated Graphics, whose origins can be traced to the Entertainment Software with Video Games as the major driving force. The rationale for introducing a special graphics processing hardware are clear from the following considerations. Video games require real-time 3D graphics which is basically nothing but dealing simultaneously with a complex polygonal mesh and many bitmap textures. The complexity can be expressed as thousands of polygons in a mesh with large bitmap textures and a frame rate greater than 100fps (frames per second). This necessarily results in demands for the performance of many matrix operations very fast and the idea of having a specialized hardware, like the GPU, for such a task naturally emerged. The other abbreviation GPGPU stands for general purpose computing on the GPU and it assumes the implementation of any computationally intensive algorithms on GPUs, not necessarily related to 3D graphics. Recall that to produce 2D and 3D graphics, specialized Application Programming Interfaces (APIs), such as the DirectX and the OpenGL were developed. In a similar way, the specialized parallel programming languages and APIs, such as CUDA and OpenCL, are now available to implement general-purpose algorithms on GPUs, which led to a technique called GPU computing. The main background idea is to allow the exploitation of the parallel processing power of GPUs in other applications in various fields including, for instance, audio signal processing, medical imaging, systems biology, physics, chemistry, etc. In the following sections, we will first briefly review the development of computing models and the related computer hardware that led to the GPGPU and GPUs. Then, we will discuss GPU architecture and the basic features of GPU programming

18 4 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković and the related programming models. The chapter ends with a brief review of the historical development of GPU architecture. 1.2 Computing Model Single Instruction Single Data Classical computer architecture, often called the Von Neumann architecture, assumes that a computing task is performed in a series of processor instructions (a program) by following the computing model called Single Instruction Single Data (SISD) (Fig. 1.1). The SISD computer architecture was designed as far back as the 1940s, and appeared to be very good for general purpose calculations such as, for instance, C = A + B D = 2 C D = D + 1 F = A + sin(d). This way of computing assumes that at each step a different instruction is performed over different data. The program is executed as a series of steps 1. Decode instruction, 2. Fetch data, 3. Perform operation, 4. Write result, over a hardware platform designed to support these steps. The basic kernel of this platform is the Central Processing Unit (CPU) (Fig. 1.2) consisting of 1. An instruction decoder / Coder unit, 2. Registers, 3. An arithmetic-logic unit (ALU), 4. Memory interface. Instruction Operands Result Instruction Pool Data Pool Processing Unit Fig. 1.1: Single instruction single data computing model and related hardware.

19 1 GPU Architecture and the Programming Environment 5 32 Instruction Instruction decoding Code memory management RDD Rnn 5 Rmm Register bank Ddd Rdd 30 Dmm 32 Dmm DI Ctrl ALU 32 Data memory management Memory Fig. 1.2: Basic components of a hardware platform for execution of the computing model SISD. The natural and immediate answer to the question of how to speed up the preformance of a given computing task is to do each operation within the task faster. In terms of computer hardware it means increasing the clock, and the CPU clock frequency of 1Mhz in 1980 was increased up to 4Ghz in There are, however, technical limits, since clock frequency is directly related to power dissipation and heating. A conclusion presented in [7] is that, for the technology between 500nm and 180nm, the speed was doubled with each generation, while for the generation of 130nm the speedup was smaller. This author considers the speedup as small from the 90nm generation and beyond. Therefore, performing several operations at the same time is offered as an alternative leading to the exploitation of various forms of parallelization in computing by using 1. Multi-core processors, 2. Multi-processor machines, 3. Clusters, 4. Grids of PCs, etc. There are, however, also some limitations in this approach, since each task consists of two parts, a part that can be parallelized and another whose parallelization is impossible and which has to be implemented sequentially. Two opposite parameters have to be taken into account, the time necessary to complete a task and the related workload. Example 1 Consider a task that can be implemented within 24 hours on a single CPU. Assume that even 95% of the task can be parallelized, while the remaining 5%

20 6 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković is the part of the task that must be done sequentially. This sequential part of the task requires 5% of the total computing time, which is 1.2 hours. It follows that whatever the speedup is in performing the parallelized part, the total computing time cannot be shorter than 1.2 hours. Therefore, the maximum speedup that can be achieved cannot be larger than 20 times, even when assuming that the implementation time of the parallelized part tends to 0. The situation with the parallelization of the processing can be analyzed by referring to two fundamental laws, the Amdahl law and the Gustafson law. The Amdahl law determines the maximum expected improvement to an overall system due to the improvement of a single part of it. In terms of the parallelization of computing and computer hardware, it can be expressed as determining the maximum theoretical speedup that can be achieved by using multiple processors. Mathematically, the Amdahl law can be expressed as 1 R = (1 P)+ P, S where P is the portion of the task that can be parallelized, assuming that it is eventually distributed over P processors, S is the speedup of P, and R is the total speedup. As Fig. 1.3 shows, depending on the size of the parallelizable part of the task, after some point, a further increase in the number of processors does not contribute considerably to the speedup of the computations. Thus, the Amdhal law, which assumes that the workload is fixed, can be seen as rather discouraging in the case of parallel computing. A different picture can be obtained while discussing the Gustafson law (also called the Gustafson-Barsis law), which assumes that the time to do the nonparallelized part of the work is constant. The Gustafson law can be expressed as S = a(n)+p(1 a(n)), where S is the speedup, n is the measure of the problem size, and the task a(n)+ pb(n) consisting of the parts b(n) and a(n), which can and cannot be parallelized, respectively, is implemented on p-processors. In this case, since the non-parallelizable part of the task a(n) =const., it follows that p implies S. Therefore, if the size of problems considered is increased sufficiently, increasing the number of processors ensures the achievement of any predetermined efficiency. The difference in the pictures about the efficiency of the parallelization comes from the basic assumptions in these two laws. The Amdhal law is formulated by assuming that the overall workload of a program does not change with respect to the number of processors. It follows that the single-process execution time is fixed. The Gustafson law assumes that the per-process parallel execution time is fixed, while the single-process time scales with the number of processors P. A direct implication is that in many cases the limitations imposed by the non-parallelizable (sequential) part of the program can be overcomed by increasing the total amount of computation, providing a proper reformulation of the considered problem instances. The

21 1 GPU Architecture and the Programming Environment Parallelized part P 50% 60% 70% 80% 90% 95% 12 Speed up Number of processors Fig. 1.3: Amdahl law - the speed up in terms of the number of processors. idea is to select the size of problems so that they can be solved on the available hardware resources in a fixed time span. Then, the faster processors and more processors working parallel allow us to solve larger problems within the same time. 1.3 Pipelining and the Multi-core CPU The Single Instruction Single Data (SISD) computation model assumes that the program is performed on the Von Neumann architecture with a single CPU. The program is executed so that instructions are taken (fetch) from the memory in the order as they are stored (as specified by the program), then decoded and executed (some operations are performed) over the specified data. The result is sent to the determined register. It follows that during fetching, the arithmetic unit waits for the next instruction to specify which operations and over which data they should be performed. The desire to keep all the elements of a CPU busy as much as possible all the time, led to a speedup technique called pipelining. The basic idea is to allow next instructions to be fetched (and kept in a buffer), while the arithmetic unit performs operations specified by the present instruction which increases the number of instructions that are processed during a given time, although does not reduce the implementation time of each of them. This method is called instruction pipelining and

22 8 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković Speed up a(n)=5% a(n)=10% a(n)=20% a(n)=30% a(n)=40% a(n)=50% Number of processors Fig. 1.4: Gustafson law - the speed up in terms of processors. should not be confused with arithmetic pipelining, where arithmetic operations are split into steps that can be overlapped when performed. The main idea of instruction pipelining is that instructions (i) are shifted in time (t) when fetched from the memory and the stages of their implementation do not overlap, thus, can be implemented simultaneously by using different resources of the same hardware. Fig. 1.5 illustrates this method with the example of a five step pipeline. The steps of the execution of an instruction are 1. IF - instruction fetch, 2. ID - instruction decode, 3. EX - instruction execution over some operands, 4. MEM - memory access, 5. WB- register write back. The figure illustrates that while the instruction with the order number 1, I(1), performs the step WB, the next instruction I(2) is in the fourth step and is accessing the memory. The other instructions are also shifted, so that at the same time instance the instruction I(5) is fetched from the memory. The operating system handles this time shifting of instructions making it transparent for the user. To simplify the related procedures, the set of allowed instructions is reduced, from where the name the Reduced Instruction Set CPU (RISC) comes. It is clear that larger numbers of instructions or operations that can be in the pipeline provide better performances and, for example, Prescott and Cedar Mill Pentium

23 1 GPU Architecture and the Programming Environment 9 4 cores have a 31-step pipeline. The limitations of the method can be formulated as data dependency, meaning that the next operation needs the results from the previous operations and has to wait for them. List of instructions i(1) i(2) i(3) i(4) i(5) time t instructions i IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB time instance t q Fig. 1.5: Five step RISC pipeline. Multi-core CPUs are an answer to the problem, since they offer several cores which would work in parallel and reduce the problem of data dependency at the price of increased complexity of dealing with sets of instructions. The current processors usually have 2, or 4 cores, while 8 core processors are also emerging. Each core has its own Instruction Decoder, the ALU, and registers, however, cores share the same memory interface. Fig. 1.6 illustrates the basic architecture of a multicore CPU with the example of the Fujitsu multi-core processor comprising four Fujitsu FR550 processor cores [10]. Regarding functioning, the multi-core CPUs still remain the SISD architecture. This computation model can be illustrated by the following example which also shows disadvantages. FR550 processor core FR550 processor core SDRAM controler Internal DMA controler System bus controler External DMA controler FR550 processor core Fr550 processor core Fig. 1.6: Simplified architecture of a multi-core processor equipped with four FR-V FR550 processor cores.

24 10 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković Example 2 Consider computing C = A + B, where A, B, and C are n-dimensional vectors. In a SISD computing model, the computing is done as follows C[1] =A[1]+B[1] C[2] =A[2]+B[2]. C[n] =A[n]+B[n]. Thus, the computation is performed in n steps with the same arithmetic operation at each step. Within each step, the following actions are performed for i = 1,...,n, 1. Fetch Instruction, 2. Decode Instruction, 3. Get Data, 4. Execute C[i]=A[i]+B[i], 5. Writeback (in memory). It is clear that there are many redundant Instruction Fetch (IF) and Instruction Decoding (ID) actions, since this is repeated n times, although each time the same operation (the addition of A[i] and B[i], i= 1,...,n) is performed. The data dependency is low and all the computations are performed on a single ALU. At each instance a single operation of addition (ADD) is performed, resulting in a prolonged computation time. 1.4 Computing Model Single Instruction Multiple Data Computation tasks similar to the problem discussed in Example 2 motivated the introduction of another computing model called Single Instruction Multiple Data (SIMD) and related hardware architecture. Since the same instruction is performed, there is no need for multiple Instruction Decoders, however, there are many ALUs and a single Memory Interface (Fig. 1.7). The difference in performing a computing task with respect to the SISD computing model is illustrated by the following example. Example 3 The computation of C = A + B in Example 2 using SIMD architecture is performed as C[1]=A[1]+B[1],C[2]=A[2]+B[2],...,C[n]=A[n]+B[n], which means that n operations are executed in a single step. This clearly results in a considerable speedup in the implementation of the task.

25 1 GPU Architecture and the Programming Environment 11 Instruction Operands Result Instruction Pool Processing Unit Data Pool Processing Unit Processing Unit Processing Unit Fig. 1.7: The basic components of a hardware platform for the execution of the SIMD computing model. 1.5 Graphics Processing Unit GPUs are a facet of architectures used to perform the SIMD computing model. A GPU consists of several streaming multiprocessors (compute units), each equipped with several streaming processors (streaming cores or processing elements) that share a single instruction decoder and a common memory interface. Note that when describing the GPU architecture, different terminology is used in the application programming interfaces (APIs) for programming GPUs, the Compute Unified Device Architecture (CUDA) [21], and the Open Computing Language (OpenCL) [15], [22], as compared in Table 1.1. Note also that there is a difference in terminology in the OpenCL standard version 1.2 [15] and the AMD interpretation of it. Table 1.1: Terminology in CUDA and OpenCL. CUDA OpenCL GPU Device GPU Device Streaming Multiprocessor Compute Unit Streaming Processor Streaming Core (AMD) Processing Element (Open CL version 1.2) Grid Global domain Bloc (of threads) Work-group Warp Wavefront Thread Work-item Memory Global Global Constant Constant Shared Local Local Private

26 12 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković Fig. 1.8 shows a simplified GPU architecture, as it is usually represented in this or similar ways in many publications, see for instance [1], [13], [24]. The main blocks are several streaming multiprocessors (compute units) consisting of several streaming processors (streaming cores) that share the common instruction fetch/decode unit and a shared memory. The simplified structure of a streaming core is shown in Fig. 1.9 by referring to the ATI/AMD GPUs [1]. Instruction Fetch / Decode Unit Shared Memory Streaming Multiprocessor (SM) Fig. 1.8: The simplified architecture of a GPU. Instruction and Control Flow Transcendental operations Branch Execution Unit General Purpose Registers Processing Element (PE) Fig. 1.9: Structure of the streaming processor (streaming core).

27 1 GPU Architecture and the Programming Environment 13 The fundamental building block of the GPU is the streaming multiprocessor (SM) consisting of several streaming processors (SPs). The streaming multiprocessor is designated to perform the Single Instruction Multiple Data (SIMD) computing model, i.e., each streaming processor performs the same instruction but over different sets of data. Therefore, the streaming multiprocessor is equipped with a single instruction fetch/decode unit. A streaming processor consists of several basic processing elements that can operate simultaneously. The number of processing elements varies in different GPU architectures. For instance, the ATI/AMD GPU has four or five very long instruction word (VLIW) processors [8], [15], [19], allowing execution of several scalar operations simultaneously. Arithmetic units of processing elements can execute either integer operations or single and double precision floating point operations. In each streaming core, there is a processing element that can perform transcendental operations such as sine, logarithm, etc. A streaming multiprocessor also contains a number of registers and a small onchip (typically 16-48KBytes) memory shared between the streaming cores. The main storage is located in a off-chip GPU global memory that has relatively large latency (the number of clock cycles needed to access the required data stored in a specific row or column of the memory) which usually ranges from 400 to 800 clock cycles [22]. Note that the computations on the GPU are fast (up to 32 basic instructions per clock cycle per SM [1], [22]). Therefore, for optimal performance, GPUs use many active threads (the smallest execution entities, i.e., the basic instances of a function in the program, also called work-items in OpenCL) to perform many computations while data is being transferred from/to the global memory. Recall that the streaming processors operate by passing streams (the sets of elements of the same type) of data records through computation kernels. The kernels in general can have one or more input and output streams and perform complex calculations ranging from a few to thousands of operations per input element. Since in a GPU the computing units execute a single set of instructions (SIMD) (the single kernel), it means that the data parallelism is used and each streaming core performs the same task on different pieces of distributed data. The operations are controlled by many execution threads working in parallel over a stream of data. For an illustration, in the aforementioned Example 2 discussing the addition of two vectors, if the computation is performed over two processors, the first processor can preform the addition of elements in the upper half of the vectors while the other will operate over the bottom half, and the computations will be done twice as fast. A thread performs the addition of two elements A[i] and B[i] to get the result C[i]=A[i]+B[i]. The minimum size of the data processed in the SIMD manner by a streaming (CUDA) multiprocessor is a group of 32 or 64 threads (or work-items in OpenCL), which is called a warp (or wavefront in OpenCL). To facilitate the programming, threads are organized into blocks which are the sets of threads that can mutually communicate and synchronize their execution through the local (shared) memory in a streaming multiprocessor shared by streaming processors (Fig. 1.10). This hierarchy provides execution granularity and partially corresponds to the hardware archi-

28 14 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković tecture, i.e., streaming processors and streaming multiprocessors. Blocks are viewed as 3D structures and each block has the same number of threads. A block can consists of 64 to 512 threads, and it is executed by the same streaming multiprocessor (since threads within a block are synchronized over the same shared memory). Due to the existence of control registers, a computation unit can execute multiple blocks simultaneously [1]. Blocks are organized into grids with the main background idea of the grouping that several blocks that are processed simultaneously by the GPU are closely linked to the available hardware resources. The number of blocks in a grid allows a programmer not to think about fixed resources, since the CUDA runtime (an application programming interface (API) written in C programming language style as a part of programming resources) will break down the blocks and distribute them over the available hardware. If the hardware resources are restricted, blocks will be executed sequentially, otherwise they can be processed in parallel if there is a sufficiently large number of processing units. A good consequence is that the same programming code can be executed on different GPUs with a different number of processors. This feature is often called the scalability of the programming model or the programming code. Host Kernel K 1 Grid G 1 Block B(0,0) Block B(1,0) Block B(0,1) Block B(1,1) t(0,0,1) t(1,0,1) t(2,0,1) t(3,0,1) t(0,0,0) t(0,1,0) t(1,0,0) t(1,1,0) t(2,0,0) t(2,1,0) t(3,0,0) t(3,1,0) Device t(0,1,1) t(1,1,1) t(2,1,1) t(3,1,1) Fig. 1.10: The hierarchy of execution entities in CUDA.

29 1 GPU Architecture and the Programming Environment 15 The execution of threads in OpenCL is organized in a similar way. Kernels are functions executed by devices (GPU) and they are compiled at runtime. Work-items are instances of kernels that execute the same code on different data. To provide data level granularity, kernels are organized into work groups and wavefronts. The number of work-items per wavefront is fixed by the hardware. For instance, the card Radeon HD5800 allows 64 work-items within a wavefront. The number of wavefronts per work-group is specified by the user, and clearly should be an integer multiple. Work group WG 0,0 Work group WG 1,0 Work group WG 0,1 Work group WG 1,1 Global domain Wavefront WF 1 Wavefront WF 2 Wavefront WF 3 Wavefront WF 4 Wavefront WF 5 Wavefront WF 6 Wavefront WF 7 Wavefront WF 8 Wavefront WF 9 Wavefront WF 10 Wavefront WF 11 Wavefront WF 12 Wavefront WF 13 Work group Wavefront WF 14 Wavefront WF 15 Wavefront WF 16 Workitem WI 1 Workitem WI 2 Workitem WI 3 Workitem WI 4 Workitem WI 5 Workitem WI 6 Workitem WI 7 Workitem WI 8 Workitem WI 9 Workitem WI 10 Workitem WI 11 Workitem WI 12 Workitem WI 13 Wavefront Workitem WI 14 Workitem WI 15 Workitem WI 6 Fig. 1.11: The hierarchy of execution entities in OpenCL. The above organization of the execution is well suited when there is no data dependency between the data processed by groups of threads. If this is not the case, it can be suitable to execute threads independently, which is fulfilled in the computing model called Single Instruction Multiple Threads (SIMT) [1], [13], [21]. Although there are different views if this is an entirely new computing model or just a variant of SIMD, it might be said that SIMT is somewhere between the SIMD and the Simultaneous Multiple-Threading (SMT) technique to improve the exploitation the instruction level parallelism. A SIMT multiprocessor is capable of executing indi-

30 16 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković vidual threads independently, which is different than the SIMD where elements of short vectors are processed in parallel meaning that threads are executed in synchronous groups. When programming for SIMD systems, data parallelism must be expressed explicitly at the software level for each vector instruction. With the GPU SIMT architecture, data parallelism between independent threads is discovered automatically at the hardware level. For a discussion of SMIT specific to the NVIDIA Tesla GPU, we refer to [18]. Differences between SIMT and SIMD models are elaborated in more detail in [11], [13], [22]. Table 1.2 provides a brief summary of the main differences between CPUs, GPUs, Digital Signal Processors (DSP), and multi-core CPUs. For more detailed discussions of this subject see, for instance, [12]. Table 1.2: Comparison of the GPU with other computing architectures. GPU versus CPU CPU GPU SISD Architecture SIMD Architecture General purpose Massively parallel Good for sequential code Good for matrix-vector calculations Instruction level parallelism Data level parallelism Typical System Architecture PC with graphics card GPU on graphics card CPU on motherboard Graphics memory on graphics card Main system memory on motherboard CPU for sequential code GPU for parallel code GPU subservient to CPU GPU versus DSP GPU DSP More flexible/general purpose Self-contained, stand alone High-level programming languages support For embedded systems More powerful Less power consumption More specialized GPU versus Multi-core CPU Multi-core (CPU) Many-core (GPU) Small number 2, 4, 8 of cores Hundreds of cores up to 2880 on Nvidia Kepler GK110 Powerful cores Less powerful cores 1.6 Structure of the GPGPU System A GPU device should be viewed as a hierarchy of components, not just an array of streaming processors. It is a massively parallel computation platform consisting of

31 1 GPU Architecture and the Programming Environment 17 hundreds of streaming processors and the computations are performed with thousands of execution threads. This makes it very well suited for vector and matrix operations. Therefore, GPUs are a hardware platform for many GPGPU tasks in different areas, as will shortly be reviewed in Chapter 6. A typical GPGPU system consists of a CPU (intended to perform the sequential part of the task) and a GPU (to perform the part that can be parallelized) and can be implemented as a PC with a graphics card. 1.7 GPU Programming Frameworks There are mainly two Application Programming Interfaces (APIs) which are available for the development of GPGPU programs, Compute Unified Device Architecture (CUDA) by NVIDIA [21] and a more recent standard Open Computing Language - OpenCL [15], [22]. CUDA is a vendor-specific framework and supports accelerated program execution on GPU hardware produced by NVIDIA. The OpenCL is an open standard for writing programs that can be executed across heterogeneous platforms. Furthermore, the OpenCL C programming language, that is a part of the OpenCL framework, allows development of the code that is at the same time accelerated and portable across various computational platforms and devices (CPUs, GPUs, Field Programmable Gate Arrays (FPGAs), Digital Signal Processors (DSPs), cell processors, embedded processors). The integration of these conceptually different computing devices into common heterogeneous multicore processors is an important recent trend in the evolution of computer systems. The design of the OpenCL framework was strongly motivated by this trend and, therefore, fully supports programming for heterogeneous systems [11], [20]. The OpenCL platform model represents an abstraction of the underlying hardware architecture and is shown in Figure The flow of execution of an OpenCL program is described in detail both in the official standard specification [15] and in [1], [20]. The structure of an OpenCL program, also presented in Figure 1.13, consists of two main parts: 1. The host code to implement the non-parallel part of the task, which is executed on CPUs and implements tasks like creating the context for the execution of kernels and making the kernel calls. The host code is usually written in C/C++, although OpenCL bindings for languages like C# and Python also exist [1], [15]. 2. The device code to implement parallel part of the task, which actually implements the kernels and is typically processed on GPUs. If there are no GPUs available, then kernels can be executed on the CPU. The device code is developed by using the OpenCL C programming language which is based on the ISO C99 standard for the C programming language, with certain restrictions (e.g., recursion is not allowed, some limitations are imposed on the usage of pointers, etc.), and special extensions for writing massively parallel programs. OpenCL C is in many aspects similar to CUDA ([21]) and the transition from one language to the other is relatively straightforward.

32 18 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković Fig illustrates the GPGPU processing flow. 1. Copy input data Main memory CPU Host 4. Copy the result 2. Instruct processing GPU global memory Instruction Fetch / Decode Unit Streaming processor (Stream core) 3. Execute parallel in each core Shared memory GPU Streaming Multiprocessor (Compute Unit) Device Fig. 1.12: GPGPU processing flow. Host code written in C/C++ no data parallelism Main memory Host CPU Device code written in OpenCL C high data parallelism OpenCL Program Device - GPU GPGPU system Fig. 1.13: The OpenCL platform model and the structure of an OpenCL program. The host code selects devices, creates a context for the execution of the kernels, and creates a specific data structure called the command queue to determine the order of execution of commands and their distribution (scheduling) over devices selected for the given context. Devices consist of sets of cores on the CPU and GPU.

33 1 GPU Architecture and the Programming Environment 19 The kernels are executed across cores in a data parallel manner. Recall that the data parallelism assumes that an individual data element is assigned to a separate logical core for processing. The context enables sharing of memory between devices, assuming that all devices are in the same context. Each device must have a queue and all the work is submitted through queues. When a kernel is submitted for execution by the host, an index space is defined. A single instance of the kernel, called a work-item (or in CUDA a thread), is executed for each point in the index space. A number representing the global identifier of a work-item is assigned to it based on the corresponding point in the index space. This identifier is accessible from the kernel, and is used to distinguish the data to be processed by each work-item. Every time a kernel is launched, many work-items (their number is specified by the programmer) are created. Each work-item executes the same code, but the specific path through the code and the data operated upon can vary for each of the work-items. Work-items are grouped into work-groups in order to provide communication and cooperation between work-items. A program is executed in two steps, the host code is executed as a single thread on the CPU, and the device code as many threads in parallel on the GPU. 1.8 Memory Models in CUDA and OpenCL In general, computations are performed over data that are stored in a memory. The memory can be viewed as an array of basic memory elements arranged in rows and columns similar to the structure of a matrix. Regarding the speed of the computation and the total time for performing a task, two critical parameters are 1. The latency (time from the moment when a memory module receives the instruction to access some memory address until the data stored at this address becomes available), 2. The bandwidth (the amount of data that can be transferred from memory in a unit of time), In a computing system, memory is organized as a hierarchical structure consisting of modules with different sizes and speeds, including small and fast memories such as registers and caches, and large and slow external memories. Each of these various memory modules has its own objective and it is designed to serve a particular purpose. The central part of this hierarchical memory structure is the main system memory that is aimed at performing general purpose computations by executing sequential programs. Since in such program implementations many memory accesses are usually requested, such a memory has a low latency and a limited bandwidth. A graphics memory is primarily designed for handling big textures and performs the role of a frame buffer permitting large chunks of data to be transferred at the same time. Therefore, it has a big bandwidth but also a large latency. Therefore, for the efficiency of computations, it is of essential importance to know the memory model

34 20 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković used in the GPU and then arrange the task in a manner that will exploit the positive features of GPU computing in the best possible manner. The abstract memory models used in CUDA and OpenCl are basically the same and can be classified as the so-called relaxed memory consistency models [15], which consists of four memory modules 1. Private Memory per work-item, 2. Local memory per work-group, 3. Global memory per all work-items, 4. Constant memory per all work-items, as shown in Fig Block (0,0) Block ( nn, ) Private memory Private memory Private memory Private memory Thread t(0,0,0) Thread t(0,0, m)... Thread tnn (,,0) Thread tnnm (,, ) Local memory Local memory Global / Constant memory (ROM) Global memory Host Device Fig. 1.14: CUDA memory model. The following read and write (R/W) access to memory is allowed for the device code 1. R/W per-thread registers, 2. R/W per-block shared memory, 3. R/W per-grid global memory, 4. R per-grid constant memory (the read-only memory). The host code can transfer data to and from per-grid global and constant memories. In OpenCL, registers correspond to the private memory, shared memory to the local memory, work-items to threads, and blocks to work groups [8]. The main background idea in such an organization of the memory modules is to allow the user to partition the data and the computation task into tasks assigned to separate work-items. For this purpose data are localized and associated with work-items or within a work group by the user.

35 1 GPU Architecture and the Programming Environment 21 Table 1.3 summarizes the way of allocating and accessing the memory modules by the host code and the device code with a notation as used in both CUDA and OpenCL. Table 1.3: Allocation and access of memory modules by the host code and the device code. Global/Global Constant/Constant Shared/Local Local/Private Allocation Dynamic Dynamic Dynamic Forbidden Host Access Read-write Read-write Forbidden Forbidden Allocation Forbidden Static Static Static Device Access Read-write Read Read-write Read-write It is usually advisable to reduce the communication with the main system memory and also with the global GPU memory. When accessing the global memory, it is good to read data from the contiguous memory addresses and memory access patterns. For this purpose, the memory coalescing has been developed as a technique intended for the optimization of the memory addressing by 1. Avoiding unnecessary memory access, 2. Avoiding random access, 3. Arranging data in consecutive memory locations, 4. Reading big blocks of data in one operation, 5. Exploiting high bandwidth. For instance, an algorithm would be optimized if the task is organized in such a way that work-items in the same wavefront read the data from memory locations in the same cache line [1]. In other words, work-items with consecutive identifiers access consecutive memory addresses. This technique is called memory coalescing, with the term borrowed from general memory management methods where it generally refers to joining two neighboring free blocks of memory. The problem is that important performances of the memory might be different for different GPU cards of different producers. For example, in different GPUs the number of cache lines varies. Therefore, the user has to specify and adapt the program to the specific characteristic of the available hardware for a concrete task. 1.9 The evolution of GPU Architecture The closing section of this chapter is devoted to a brief discussion of the evolution of GPU architecture summarized in Table 1.4. It can be said that the Commodore Amiga, originally released in 1985, was the first personal computer that featured hardware specialized for rendering computer

36 22 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković graphics. Before the release of the Amiga, CPUs were exclusively in charge of preparing frames and drawing them on computer displays. In the Amiga, the graphics coprocessor included functions such as line drawing, area filling, and a special circuit for the acceleration of bitmap manipulation [17]. The method of hardware acceleration of 2D graphics became widespread on PCs only ten years later, in In the early 1990s, graphics on PCs were drawn using a Video Graphics Array (VGA) controller which was essentially a memory controller connected to the RAM memory. The advent of semiconductor technology gradually allowed a set of 2D graphics acceleration functions to be added to VGA controllers. First 3D graphics acceleration functions, such as triangle rasterization and texture mapping and shading, were incorporated into VGA controllers in On October 11, 1999, Nvidia released the GeForce 256 as the world s first Graphics Processing Unit. The Nvidia GeForce 256 was a single-chip processor dedicated to 2D/3D graphics rendering acceleration, with integrated hardware engines for transform and lighting, triangle setup and clipping, and rendering. GPU architecture in the early 2000s could be described as a predominantly fixed-function processor dedicated to rendering computer graphics. Each step in the graphics rendering pipeline was customized for the processing of a certain task using specialized hardware. An important event that influenced the later development of GPU architecture happened in 2003, when designers of traditional CPUs reached a limit in the previously constant increase of processor frequency. This limit was dictated mostly by heat dissipation and energy consumption. Since then, the two main trends in the development of new computer architectures have been the multi-core and the many-core approache. Designers of traditional CPUs continued the development through a multi-core approach with the main goal of maintaining execution speed of sequential programs while adding more and more cores. GPU designers, on the other hand, followed the many-core approach, focusing on providing ever increasing throughput for the execution of parallel programs. The many-core GPU architecture has also evolved towards providing more programmability with each new generation, with less specialized and more flexible pipeline steps. This evolution resulted in modern GPUs which are a unified graphics and computing engine with additional fixed-function units, which serve as both a programmable graphics processor and a scalable parallel computing platform. The extraordinary evolution of the GPU, from its appearance in 1999 to date, can be illustrated by pointing out that from January 2012 three out of the top five most powerful supercomputers in the world use the GPU acceleration [26]. For a review of supercomputers, see for instance [25]. As an illustration of computing power, the basic characteristics of the Tianhe-1A supercomputer can be summarized as petaflops, 14,336 Xeon X5670 CPUs, and 7,168 Nvidia Tesla M2050 GPUs. In August 2011, this computer was considered as the second in TOP500. For a historical overview of GPU computing, we refer to [24].

37 1 GPU Architecture and the Programming Environment 23 Table 1.4: Historic development of GPUs and GPGPU Amiga the first microcomputer with a dedicated graphics hardware 1987 IBM 8514, 2D primitives in hardware Mid 90ies first 3D accelerators 3Dfx Voodoo OpenGL, DirectX (windows) Mid 90ies to first half of the 00s GPUs get more complex Computational power grows Flexibility grows Developers demand more control Programmable Vertex & Pixel shaders 1999 Nvidia GeForce 256, the first true GPU Early 2000 First non-graphic use of GPUs 2007 CUDA framework by Nvidia 2008 OpenCL standard by Khronos Khrornos Group Apple, AMD/ATI, Nvidia, etc. Current situation GPUs cheap and ubiquitous Manufacturers: Nvidia, AMD/ATI nearly 100% of the market Two software standards CUDA (Nvidia), OpenCL(Khronos) Nvidia Tesla - GeForce GTX GFLOPS References 1. Aamodt, T. M., Architecting graphics processors for non-graphics compute acceleration, in Proc. of the 2009 IEEE Pacific Rim Conf. on Communications, Computers and Signal Processing, Victoria, BC, Canada, August 23-26, Advanced Micro Devices (AMD) Inc., AMD Accelerated Parallel Processing OpenCL Programming Guide, Available from: Accessed August 5, Arndt, J., Matters Computational: Ideas, Algorithms, Source Code, Springer, Brooks, F., Mythical Man Month, Essays on Software Engineering, Addison-Wesley Professional, Second Edition, Buluc, A., Gilbert, J., Budak, C., Solving path problems on the GPU, Parallel Computing, Vol. 36, No. 5-6, Copeland, A. D., Chang, N. B., Lung, S., GPU accelerated decoding of high performance error correcting codes, Proc. 14th Annual Workshop on HPEC, Lexington, Massachusetts, USA, Sept Ertl, A., CPU clock speed, July 30, 2008, Accessed August 3, 2012.

38 24 Stanislav Stanković, Dušan Gajić, Radomir S. Stanković 8. Farber, R., OpenCL - memory spaces, October 2008, July 20, Accessed 9. Fatahalian, K., Houston, M., A closer look at GPUs, Communications of the ACM, Vol. 51, No. 10, Association for Computing Machinery, Fujitsu Public and Investor Relations, Fujitsu develops multicore processor for high-performance digital consumer products, Accessed August 1, Gaster, B. R., Howes, L., Kaeli, D., Mistry, P., Schaa, D., Heterogeneous Computing with OpenCL, Elsevier, Morgan Kaufmann, Ghorpade, J., Parande, J., Kulkarni, M., Bawaskar, A., GPGPU processing in CUDA architecture, Advanced Computing - An International Journal (ACIJ), Vol. 3, No. 1, 2012, Hennessy, J., and Patterson, D., Computer Organization and Design: The Hardware/Software Interface, Elsevier, Morgan Kaufmann, Fourth Revised Edition, Hwu, W.W., GPU Computing Gems - Emerald Edition, Morgan Kaufmann Publishers, Khronos Group, OpenCL Specification 1.2, Khronos OpenCL Working Group, Kirk, D. B., Hwu, W. W., Programming Massively Parallel Processors: A Hands-on Approach, Elsevier, Morgan Kaufmann, Knight, G., Amiga history guide, Accessed July 30, Lindholm, E., Nickolls, J., Oberman, S., Montrym, J., NVIDIA Tesla - A unified graphic and computing architecture, IEEE Mocro, March-April, 2008, Mathew, B.K., Very large instruction word architectures, in Oklobdzija, V.G., (ed.), The Computer Engineering Handbook, CRC Press, Munshi, A., Gaster, B. R., Mattson, T. G., Fung, J., Ginsburg, D., OpenCL Programming Guide, Pearson Education, Addison-Wesley Professional, Nvidia (2011) Nvidia CUDA. Available from: Accessed August 6, NVIDIA, OpenCL Programming Guide for the CUDA Architecture, Owens, J.D., Houston, M., Luebke, D., Green, S., Stone, J.E., Philips, J.C., GPU Computing, Proc. IEEE, Vol. 96, No. 5, 2008, Sanders, J., Kandrot, E., CUDA by Example - An Introduction to General-purpose GPU Programming, Addison Wsley, Steen, Aad J. van der, Overview of recent supercomputers, NCF/NPC Research, technical report, October Top 500 Supercomputer Sites, Accessed July 30, 2012.

39 Chapter 2 Computing Spectral Transforms Used in Digital Logic on the GPU Dušan Gajić, Radomir S. Stanković Abstract GPU computing originated in the opening of the graphics processing units (GPUs), which are devices intended to produce computer graphics, for general purpose computations. Since computer graphics is based on matrix operations, GPUs are purposely designed to implement such operations efficiently. Spectral transforms are defined in terms of sets of basis functions, which can be conveniently arranged as columns of certain matrices whose entries could be complex numbers, real numbers, integers, or elements of some finite fields. In this way, spectral transforms are defined by transform matrices and the determining spectra of discrete functions, which are conveniently defined by function vectors, reduces to matrix-vector operations. In many cases, basis functions and due to that also the related transform matrices, express some specific properties enabling us to derive fast computation algorithms by exploiting instruction parallelism. The matrix specification and the parallelism in related algorithms make computing the spectral transform a task well suited for implementation on GPUs. A direct mapping of existing fast algorithms, however, does not lead to implementations that take full advantage of all GPU features regarding computational power and memory bandwidth. Therefore, a suitable reformulation of the existing algorithms is necessary. Most of the related work is devoted to the implementation of the Fast Fourier Transform (FFT), that is an algorithm for efficient computation of the Discrete Fourier Transform (DFT), due to its numerous applications. Since computing of DFT involves complex number arithmetic, which in programming implementations doubles the computing requirements (for the real and the imaginary part), the advantages of using GPUs are impressive. In this chapter, we point out that a considerable speedup in computing can be achieved even in the case of integer valued transforms or Boolean valued transforms such as trans- Dušan Gajić Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia, dule.gajic@gmail.com Radomir S. Stanković Dept. of Computer Science, Faculty of Electronic Engineering, University of Niš, Niš, Serbia, Radomir.Stankovic@gmail.com 25

40 26 Dušan Gajić, Radomir S. Stanković forms that are used in digital logic for processing logic (Boolean) functions. The chapter discusses the implementation of the Walsh, the Reed-Muller, and the arithmetic transforms, as examples of the Kronecker product representable transforms, and the Haar transform, as an example of the layered-kronecker transforms. In the reformulation of the algorithms to make them well suited for computation on GPUs, special attention is paid to the organization of the computations, impact of the integer and Boolean arithmetic, different structure of fast algorithms, memory transfers reduction, and related issues. Furthermore, the chapter presents a discussion of the implementation of two kinds of FFT-like algorithms, the Cooley-Tukey algorithms and the so-called algorithms with constant geometry. Performances of GPU implementations are compared with the classical C/C++ implementations on the Central Processing Unit (CPU). Experiments show that, even though the spectral transforms considered involve arithmetic over integers (in the case of the Walsh, the arithmetic, and the Haar transforms) and Boolean values in the case of the Reed-Muller transform, significant speedups can be achieved by implementing the algorithms in OpenCL and performing them on the GPU. As an illustration of the possible applications, the procedures developed for the Walsh transform are used to compute the correlation and autocorrelation of discrete functions in terms of Boolean variables viewed as functions on finite dyadic groups. 1 1 The work of Radomir S. Stanković was supported by the Academy of Finland, Finnish Center of Excellence Programme, Grant No

41 2 Computing Spectral Transforms Used in Digital Logic on the GPU Introduction Digital logic including switching theory and logic design is an area where computing with large vectors is often required. In particular, this is the case when spectral transforms are used for the analysis of logic functions and the design of the corresponding networks [12], [18], [22]. Performing spectral transforms is a computationally intensive task in spite of the existence of fast algorithms [7], [12], [18], [19], [22]. In order to fully exploit the possibilities offered by GPUs, the existing algorithms cannot be ported in an ad hoc manner to new architecture, but have to be modified and adapted to the targeted graphics hardware. We investigate the implementation of certain spectral transforms, often used in switching theory and logic design, on GPUs, and perform an analysis of efficiency of such implementations. The motivation for the research is found in the following considerations. The Fast Fourier Transform (FFT) [7], is well suited for processing on the GPU because it involves complex number arithmetic and transcendental operations [11], [24]. The discrete Walsh transform belongs to the same class of transforms, since can be viewed as the Fourier transform on a particular group, the finite dyadic groups on which the Boolean functions are defined [12], [18]. Computing of the discrete Walsh transform, however, reduces to additions and subtractions over integers. Therefore, it is interesting to explore what are possible advantages of computing the Walsh transform on the GPU. Experimental investigations reported in this chapter show that, in spite of the simplicity of the arithmetic operations involved in the Walsh transform, a considerable speedup can be achieved by a simple adaptation of the algorithms to the GPU architecture. We also consider the Reed-Muller transform and the arithmetic transform for the following reasons. These transforms are based on the basis vectors of the same form but take integers and logic (Boolean) values, respectively. Thus, in these two trans-

42 28 Dušan Gajić, Radomir S. Stanković forms, the computations are over the set of integers and in the Galois field GF(2). The idea behind the selection of these transforms is to compare possible differences in the performance of their implementations, since GPUs do not originally support Boolean vectors, and on the hardware level interpret them as integers [2], [21]. The Walsh, the Reed-Muller, and the arithmetic transforms share the same asymptotic time complexity of O(Nlog2N), where N = 2 n is the size of the vector to be processed and n is the number of variables in the function represented by the vector. For these transforms, the transform matrices are Kronecker product representable. We also discuss the implementation of the Haar transform [9], [12], [18], [22] which is not a Kronecker representable transform, but still has a layered-kronecker structure [9], and its computation has the time complexity of O(N). Although this transform offers smaller parallelism and is computationally less demanding than the Walsh transform, and considerably less than the DFT, the experimental results confirm that even in this case considerable speedups can be achieved. For the studied transforms, we compared the GPU implementations based on two different classes of algorithms, the Cooley-Tukey algorithms, allowing the in-place implementations, and the so-called fast algorithms with constant geometry which require out-of-place implementations [12]. The difference in the algorithms requires different hardware resources and also a different optimization of the programming code which was the rationale for discussing the Haar transform. 2.2 Related Work in Computing Spectral Transforms The implementation of various FFT algorithms on different technological platforms is a widely studied subject, see for instance [12], [19], [22], [24], and the references therein. Recently, the technique of general purpose computations on the GPU (GPGPU) has proven to be a suitable approach for this and similar computationally intensive tasks [5], [8], [13], [16], [23]. In particular, the GPU-accelerated calculation of FFT algorithms using CUDA is discussed in [11], [25]. NVIDIA provides an FFT library called CUFFT [14]. The CUDA SDK [14] presents examples of the Walsh and the Haar transforms on single-precision real numbers. In [8], is presented a CUDA implementation of Low- Density Parity-Check codes (LDPC) that uses the Walsh transform and the inverse Walsh transform in accelerating the decoding process. The method in [8] reduces the LDPC decoding time from 9 days on the Central Processing Unit (CPU) to less than 6 minutes on an array of high-performance GPUs [8]. The application of the CUDA 2D Fast Walsh transform in image processing is described in [23]. The OpenCL is a more recent GPGPU development than CUDA. AMD Accelerate Parallel Processing (APP) Software Development Kit (SDK) [1] has examples of the Walsh and the Haar transforms on single-precision real numbers, but these implementations are rather limited in the size of the vectors to be processed and could not be used for comparison in the discussions bellow. For example, the Haar transform from [1] offers GPU processing for vectors with N 512.

43 2 Computing Spectral Transforms Used in Digital Logic on the GPU Spectral Transforms Spectral transforms are an efficient tool in solving many tasks in digital logic [12], [18], [22]. Fast computation is an essential requirement that makes their practical application feasible. The spectra can, in principle, be directly computed by matrixvector multiplication, assuming that both the function to be processed and its spectrum with respect to a given transform are represented as vectors. However, for a vector of size N, the transform matrix is of the dimension (N N) and N 2 arithmetic operations are required. In the case considered in this chapter, N = 2 n where n is the number of variables in a function f (x 1,...,x n ) : {0,1} n P, where P is the field of rational numbers (for the Walsh, the arithmetic, and the Haar transform) and the Galois field GF(2) for the Reed-Muller transform. Fast algorithms derived by the factorization of the transform matrices in the same way as in the case of the FFT have a complexity of O(Nlog 2 N). In this chapter, we discuss the computation over GPU of two different kinds of spectral transforms 1. Kronecker transforms [12], [19], which include the Walsh, the arithmetic, and the Reed-Muller transforms. 2. Layered-Kronecker transforms [9], represented by the Haar transform. Kronecker transforms are defined by transform matrices that are Kronecker product representable and the difference between various transforms is in the basic transform matrices. This observation permits the following unified definition of Kronecker transforms that will be discussed below. Definition 1 For functions of n binary valued variables, Kronecker transforms are defined by (2 n 2n) transform matrices of the form n T(n)= T i (1), T i (1)= i=1 [ ] ab, (2.1) cd where is the Kronecker product, and entries of the basic transform matrix a, b, c, and d are specified for each transform separately under the condition that the matrix T i (1) is a non-singular matrix. For transforms discussed in this chapter the parameters a, b, c, and d are specified in Table 2.1. For the Reed-Muller transform, the values 0 and 1 are the logic values, i.e., the elements of the Galois field GF(2). When computing the spectrum, it is assumed that the related operations are performed in the algebraic structure where the basis functions take their values. Thus, the Walsh and the arithmetic transforms are computed over the field of rational numbers, while for the Reed-Muller transform the computations are in GF(2). Definition 2 For a function of n variables f (x 1,...,x n ) represented by the function vector F =[f (0),..., f (2 n 1)] T, the Kronecker spectrum S f (w 1,...,w n ) written as vector S f =[S f (0),...,S f (2 n 1)] T, is computed as

44 30 Dušan Gajić, Radomir S. Stanković Table 2.1: Entries of the basic transform matrices for the Kronecker transforms. Entry value Transform ab c d Walsh Arithmetic Reed-Muller S f =(T(n)) 1 F, (2.2) where (T(n)) 1 is the matrix inverse to T(n). From the Kronecker spectrum, the function f is reconstructed as F = T(n)S f. (2.3) Definition 3 The non-normalized Haar transform is defined by the transform matrix [ [ ] ] H(n 1) 11 H(n)= I(n 1) [ 1 1 ], H(n)= [ 0 ], (2.4) where n is the number of variables in the function to be processed and I(n 1) is the ((n 1) (n 1)) identity matrix. Definition 4 For a function of n variables f (x 1,...,x n ) represented by the function vector F =[f (0),..., f (2 n 1)] T, the Haar spectrum S f (w 1,...,w n ) written as vector S f =[S f (0),...,S f (2 n 1)] T, is computed as S f = 2 n H(n)F. (2.5) From the Haar spectrum, the function f is reconstructed as where H T (n) is the transpose of H(n). F = H T (n)s f, (2.6) The Haar transform defined in this way is used in switching theory and logic design, since it suits well to switching functions. In some other areas, as for instance signal processing, the normalized Haar transform is used equally [12]. The computing method for the non-normalized Haar transform presented in this chapter extends directly to the normalized Haar transform, since the difference is just in the scaling factors assigned to Haar functions. Example 4 Table 2.2 shows the transform matrices for the Walsh, the arithmetic, and the Reed-Muller transforms for n = 3. The Haar transform matrix for n = 3 is defined as

45 2 Computing Spectral Transforms Used in Digital Logic on the GPU H(3)= (2.7) Table 2.2: Transform matrices for the Walsh, the arithmetic, and the Reed-Muller transform for n = 3. Transform Transformation matrix Basic matrix [ ] 1 1 Walsh W(3)=W(1) W(1) W(1) W(1)= [ 1 1 ] 10 Arithmetic A(3) =A(1) A(1) A(1) A(1) = [ 11 ] 10 Reed-Muller R(3) =R(1) R(1) R(1) R(1) = Fast Algorithms Different ways of matrix factorization of transform matrices T(n) and H(n) yield different fast computing algorithms for the corresponding transforms [12]. In programming implementation, different factorizations require the usage of different memory resources and different handling of the input data. This feature makes a considerable difference when GPUs are used, since the number of memory accesses is essentially important and can considerably affect the total computation time. For this reason, we will consider two different types of FFT-like algorithms, Cooley- Tukey algorithms and algorithms with constant geometry [12] Cooley-Tukey FFT-like algorithms The Kronecker transform matrix T(n) in (2.1) can be factorized as T(n)= n i=1 C i (n), (2.8)

46 32 Dušan Gajić, Radomir S. Stanković where C i (n)= n i=1 { T(1), i = j, C j (1), C j (1)= I(1), i j, where I(1) is the (2 2) identity matrix. The Haar matrix is factorized in a similar way by using the feature that it is Kronecker product representable by layers. The factorization of transform matrices in illustrated by the following example. Example 5 Table 2.3 shows the transform matrices for the Walsh, the arithmetic, and the Reed-Muller transforms for n = 3. The Haar matrix for n = 3 is factorized as H(3)=C 1 (3) C 2 (3) C 2 (3) (2.9) where C 1 (3) = , C 2 (3) = C 3 (3) = Figs. 2.1(a), 2.2(a), and 2.3(a) show the flow-graphs of the fast Cooley-Tukey algorithms for the Walsh transform, the arithmetic transform, and the Reed-Muller transform for n = 3, derived from the factorization in Table 2.3. In these figures, the

47 2 Computing Spectral Transforms Used in Digital Logic on the GPU 33 Table 2.3: Transform matrices for the Walsh, the arithmetic, and the Reed-Muller transform for n = 3. Transform Transformation matrix Factorization Walsh W(3)=C 1 (3) C 2 (3) C 3 (3) C 1 (3)=W(1) I(1) I(1) C 2 (3)=I(1) W(1) I(1) C 3 (3)=I(1) I(1) W(1) Arithmetic A(3)=C 1 (3) C 2 (3) C 3 (3) C 1 (3)=A(1) I(1) I(1) C 2 (3)=I(1) A(1) I(1) C 3 (3)=I(1) I(1) A(1) Reed-Muller R(3)=C 1 (3) C 2 (3) C 3 (3) C 1 (3)=R(1) I(1) I(1) C 2 (3)=I(1) R(1) I(1) C 3 (3)=I(1) I(1) R(1) solid lines denote addition, while the dashes correspond to the subtraction of elements linked by the lines. In other words, the weights assigned to these lines are 1 and 1, respectively. The flow-graph of the fast algorithm for the arithmetic transform is the same as for the Reed-Muller transform with correspondingly modified weighting coefficients as specified in Table 2.1. Fig. 2.4(a) shows the flow-graph of the Cooley-Tukey algorithm for the Haar transform for n = 3. f (0) f (1) f (2) f (3) f (4) f (5) f (6) f (7) S f (0) f (0) S f (1) f (1) S f (2) f (2) S f (3) f (3) S f (4) f (4) S f (5) f (5) S f (6) f (6) S f (7) f (7) () a () b S f (0) S f (1) S f (2) S f (3) S f (4) S f (5) S f (6) S f (7) Fig. 2.1: Flow-graph of the Cooley-Tukey algorithm (a) and the algorithm with constant geometry (b) for the Walsh transform for n = FFT-like algorithms with constant geometry The basic idea in algorithms with constant geometry is to factorize the transform matrix into factors of the same form as much as possible. This factorization leads to

48 34 Dušan Gajić, Radomir S. Stanković f (0) f (1) f (2) f (3) f (4) f (5) f (6) f (7) S f (0) f (0) S f (1) f (1) S f (2) f (2) S f (3) f (3) S f (4) f (4) S f (5) f (5) S f (6) f (6) S f (7) f (7) () a () b S f (0) S f (1) S f (2) S f (3) S f (4) S f (5) S f (6) S f (7) Fig. 2.2: Flow-graph of the Cooley-Tukey algorithm (a) and the algorithm with constant geometry (b) for the arithmetic transform for n = 3. f (0) f (1) f (2) f (3) f (4) f (5) f (6) f (7) S f (0) S f (1) S f (2) S f (3) S f (4) S f (5) S f (6) S f (7) f (0) f (1) f (2) f (3) f (4) f (5) f (6) f (7) S f (0) S f (1) S f (2) S f (3) S f (4) S f (5) S f (6) S f (7) Fig. 2.3: The flow-graph of the Cooley-Tukey algorithm (a) and the algorithm with constant geometry (b) for the Reed-Muller transform for n = 3. f (0) f (1) f (2) f (3) f (4) f (5) f (6) f (7) S f (0) f (0) S f (1) f (1) S f (2) f (2) S f (3) f (3) S f (4) f (4) S f (5) f (5) S f (6) f (6) S f (7) f (7) () a () b S f (0) S f (1) S f (2) S f (3) S f (4) S f (5) S f (6) S f (7) Fig. 2.4: The flow-graph of the Cooley-Tukey algorithm (a) and the algorithm with constant geometry (b) for the Haar transform for n = 3.

49 2 Computing Spectral Transforms Used in Digital Logic on the GPU 35 algorithms with minimum differences between steps in the algorithms, from where the name comes, and it means that different steps will be implemented in the same manner and with minimum hardware modifications. Depending on the computation platform, this feature of constant geometry might bring advantages [20]. The transform matrix T(n) is factorized as T(n)=(Q i ) n (n), (2.10) where the entries of the matrix Q i are determined depending of the entries of a concrete transform matrix T(n). The index i is used to differentiate the Walsh, the arithmetic, and the Reed-Muller transform and can be w, a, and rm, respectively. For the general theory, we refer to [20] and [12], and here this factorization will be illustrated by the following example. Example 6 The matrices Q i (3), i {w,a,rm}, leading to the algorithms with constant geometry for the Walsh, the arithmetic, and the Reed-Muller transforms for n = 3 are the following W(3)=Q 3 w = (2.11) A = Q 3 a = (2.12) A = Q 3 rm = (2.13)

50 36 Dušan Gajić, Radomir S. Stanković Figs. 2.1(b), 2.2(b), and 2.3(b) show the flow-graphs of the fast algorithms with constant geometry for the Walsh transform, the arithmetic transform, and the Reed- Muller transform for n = 3. The fast algorithm with constant geometry for the Haar transform for n = 3 is obtained by factorization as H(3)=C 1 (3) C 2 (3) C 2 (3), (2.14) where C 1 (3) = C 2 (3) = C 3 (3) = Fig. 2.4(b) shows the flow-graphs of the Cooley-Tukey algorithm and the algorithm with constant geometry for the Haar transform for n = Mapping Algorithms to the GPU Direct mapping of an algorithm to a targeted hardware technology is barely possible and some modifications are necessary in order to exploit the good features of the technology and avoid possible bottlenecks. In the case considered in this chapter,

51 2 Computing Spectral Transforms Used in Digital Logic on the GPU 37 this in particular concerns the organization of the computations, memory transfer reduction, the impact of the integer and the Boolean arithmetic, the structure of the fast algorithms, and other related issues. As presented in Chapter 1.9, a program for computing on GPU (GPGPU program) consists of two parts, the device code and the host code that are executed on the GPU and CPU, respectively. In this section, we describe the corresponding implementations and the algorithms used to compute the studied spectral transforms Kernel design and optimization CPUs are intended for general purpose computations and are optimized for low latency (the time after a memory module receives the instruction to access some memory address and before the data becomes available), meaning that a considerable number of accesses to the memory are acceptable. The GPU architectures are optimized for high throughput at the expense of the extended latency. It means that a GPU has a large bandwidth and large chunks of data are transferred at the same time, enabling the high instruction parallelism (the same instruction is executed by many processors over different sets of data). These are the basic considerations that have to be taken into account when designing the device code and the host code to be performed over a GPU system as discussed in Chapter 1.9. Recall that in GPU stream processing, the same kernel is executed by many processors working in parallel over a stream of data. The device code defines the kernel and specifies the data over which it is executed. The host code is typically a sequential C/C++ program that defines the context for the execution of kernels. The context includes resources like devices, kernels, the program, and memory objects. The host creates a data structure called a command-queue to coordinate the execution of the kernels on the devices. The host then places commands into the command-queue which are afterwards scheduled onto the devices that exist within the context. When a kernel is submitted for execution by the host, an index space is defined. A single instance of the kernel, called a work-item (or in CUDA a thread), is executed for each point in the index space. A number representing the global identifier of a work-item is assigned to it based on the corresponding point in the index space. This identifier is accessible from the kernel, and it is used to distinguish the data to be processed by each work-item. Every time a kernel is launched, many work-items (their number is specified by the programmer) are created. Each work-item executes the same code, but the specific path through the code and the data operated upon can vary for each of the work-items. Work-items are grouped into work-groups in order to provide communication and cooperation between work-items. For the implementation of fast computing algorithms for spectral transforms, we used the OpenCL C programming language that is a subset of the ISO C99 language with certain restrictions (e.g., recursion is not allowed) and special extensions for parallel programming. It is in many aspects similar to CUDA and the transition from one language to the other is relatively easy.

52 38 Dušan Gajić, Radomir S. Stanković Fig. 2.5 shows the computing model used for the mapping of fast algorithms for the studied Kronecker spectral transforms to the GPU system. The same model is used for the computation of the Haar transform by using the Cooley-Tukey algorithm. Host Main memory Function vector for f Spectrum for f GPU GPU global memory buff A work item 1. Computes op1 and op2 by (2.15) 2. Reads data from addresses op1 and op2 in buff 4. Stores results at the addresses op1 and op2 in buff 3. Computes operations specified by T(1) Fig. 2.5: The GPU computing model for the Cooley-Tukey algorithms for Kronecker transforms and the Haar transform. The kernels access the global GPU device memory and avoid access to the shared GPU memory, in which way the philosophy of global memory algorithms is applied [11]. The computations are organized in such a manner that threads with consecutively numbered global identifiers access consecutive device memory locations, which can be viewed as the implementation of the memory coalescing technique, assuring simultaneous memory accesses by multiple threads in a single memory transaction [2], [14]. We paid attention to this way of memory optimization since it can have a considerable impact on performance, although it is smaller in the case of the AMD GPUs and OpenCL than for CUDA and GPUs by NVIDIA [2]. Techniques for the optimization of the kernel code include replacing integer divide and modulo operations, which require many instruction cycles, with the bit-

53 2 Computing Spectral Transforms Used in Digital Logic on the GPU 39 wise operations whenever possible. Recall that the integer division is defined as x \ y x/y, where the floor functions mean disregarding the fractional part (remainder) of the usual division of integers. For example, the replacement with bitwise operations means that, if i is an integer that is a power of 2 and j is any integer, the integer division of j with i is equivalent to j >> log 2 i and the modulo operation ( j%i) can be replaced with ( j&(i 1)) Cooley-Tukey algorithms for Kronecker transforms The Cooley-Tukey algorithms for the studied Kronecker transforms have many common features, as can be seen from their flow-graphs, and therefore we will discuss the design of the corresponding kernels in a unified way. Algorithm 1 outlines the basic structure of the Cooley-Tukey FFT-like algorithms for the studied Kronecker transforms. As in other implementations of FFT-like algorithms, the steps of these algorithms are executed sequentially, but operations within the steps are performed in parallel [12]. The function vector of the function to be processed is first transferred from the main memory of the host to the buffer allocated in the global memory of the GPU. After that the kernels are executed in an in-place manner as in other in-place implementations of FFT, meaning that the resulting spectral coefficients are stored in the same buffer. Then, the content of the buffer is copied back to the host. These memory operations are very costly and take from 25% up to 75% of the total GPU running time. It can be concluded that GPU processing can be useful for sufficiently large functions. For example, the experiments show that for functions with n 18 variables, the time for creating the buffer and performing the transfers becomes acceptable. Steps are preformed by creating and executing in parallel 2 n 1 threads, since computations are performed over pairs of values of vectors with a length of 2 n, each thread performing the basic operations specified by the basic transform matrices T(1). The large number of active threads helps in hiding the data access latency. Each thread reads two elements from the GPU buffer with indices op1 and op2 whose values for the sudied Kronecker spectral transforms are determined as op1 thread(id)%d + 2 d (thread(id) \ d), (2.15) op2 op1 + d, where % is the modulo operation i% j of computing i modulo j and \ denotes the integer division. The parameters thread(id) and d are the global identifier of the thread in the index space and the distance between the elements over which the butterfly performs the computation in the current algorithm step, respectively. For the Cooley-Tukeu algorithms, in the k-th step of the algorithm, is d = 2 n k. After the computation, the results are stored in the same locations in the GPU global memory. We use the same kernel, however, the computations are over in-

54 40 Dušan Gajić, Radomir S. Stanković tegers for the Walsh and the arithmetic transform, and over Boolean values for the Reed-Muller transform. As expected, this difference in arithmetic operations does not affect the performance, since current GPUs on the hardware level interpret Boolean values as integers [2], [21] and, therefore, the processing time remains the same. Boolean buffers are not even officially supported by the OpenCL standard specification [21], however, operating on them produces correct results. As a consequence, the speedups for the Reed-Muller transform on the GPU, compared to the CPU implementation are not as large as for the Walsh and the arithmetic transform. This is also due to the fact that the C/C++ implementation of the Reed-Muller transform on the CPU operates efficiently on Boolean values represented as single bytes. Algorithm 1 (The Cooley-Tukey algorithm for the Kronecker transforms) 1. Allocate a buffer buff in the global memory of the GPU device. 2. Transfer the input vector input from the main memory to buff 3. For each step of the transform from step 1 to step n, Call the OpenCL kernel for the appropriate transform with the input parameters being the GPU buffer buff and the value of d = 2 n step. The kernel is executed by 2 n 1 threads in parallel on the GPU. Each thread reads op1 and op2 from buff, performs the defined operations and stores the results in the same locations. 4. Transfer the contents of buff to the output vector in the main memory. The following example illustrates the main difference in the implementation of the Cooley-Tukey algorithm for the Walsh transform on the CPU and the GPU. Similar considerations can be performed in the case of the same algorithm for other transforms. Example 7 Consider a function of n = 3 variables represented by the function vector whose elements are stored in a memory with 8 memory cells m(0),m(1),...,m(7). By referring to Fig. 2.1(a), computing the Walsh coefficients can be interpreted as the implementation of four butterfly operations in each step. These butterflies can be enumerated as B(0), B(1), B(3), B(4). Each butterfly performs the operation specified by W(1), i.e., an addition m(op1)+m(op2) and a substraction m(op1) m(op2) over an ordered pair of input data (operands) (m(op1),m(op2)). The locations where from the input data m(op1) and m(op2) are read, depend on the step of the algorithm. For example, in the first step (step = 1)B(0) reads m(op1) and m(op2) from the memory locations m(0) and m(4), computes m(0)+m(4) and m(0) m(4) and saves the results at the locations m(0) and m(4). In the second step step = 2, B(0) reads the input data from the memory locations m(0) and m(2) and saves the results in the same locations. In the third step step = 3, B(0) reads and saves data in the locations m(0) and m(1). The other butterflies B(k), k = 1,...,7, perform the computations in a similar manner by reading and writing data in the locations m(r), where r is determined depending on the values for k and the step. Thus, for the given butterfly, the operands are read and the results stored in the same

55 2 Computing Spectral Transforms Used in Digital Logic on the GPU 41 locations. For the same butterfly, the locations which it addresses are, however, different in each step of the algorithm. The same computations in the implementation of the Cooley-Tukey algorithm on a GPU can be described as follows. Each butterfly corresponds to a thread, i.e., each thread performs the computations specified by W(1). In this example, the parameter step, i.e., the distance between the elements involved in the computations performed by a butterfly, are 4, 2, and 1, respectively. In the first step, d = = 2 2 = 4, and the threads thread(0), thread(1), thread(2), and thread(3), corresponding to the butterflies B(0), B(2), B(3), and B(4), compute over data stored at memory locations op1 and op2 determined as Step 1 thread(0) op1 thread(0)% (thread(0) \ 4)=0% (0 \ 4)=0, op2 op1 + d = = 4. Thus, it reads the data from the memory locations m(0) and m(4) and stores the results in the same locations. Note that / denotes the division of integers. thread(1) op1 thread(1)% (thread(1) \ 4)=1% (1 \ 4)=1, op2 op1 + d = = 5. thread(2) op1 thread(2)% (thread(2) \ 4)=2% (2 \ 4)=2, op2 op1 + d = = 6. thread(3) op1 thread(3)% (thread(3) \ 4)=3% (3 \ 4)=3, op2 op1 + d = = 7. In the second step, d = = 2, and the threads read the input data from the locations with indices computed as Step 2 thread(0) op1 thread(0)% (thread(0) \ 2)=0%2 + 2 (0 \ 2)=0, op2 op1 + d = = 2.

56 42 Dušan Gajić, Radomir S. Stanković thread(1) thread(2) thread(3) op1 thread(1)% (thread(1) \ 2)=1%2 + 2 (1 \ 2) = = 1, op2 op1 + d = = 3. op1 thread(2)% (thread(2) \ 2)=2%2 + 2 (2 \ 2) = = 4, op2 op1 + d = = 6. op1 thread(3)% (thread(3) \ 2)=3%2 + 2 (3 \ 2) = = 5, op2 op1 + d = = 7. In the third step, d = = 1, and the threads read from the locations determined as Step 3 thread(0) op1 thread(0)+% (thread(0) \ 1)=0% (0 \ 1) = = 0, op2 op1 + d = = 1. thread(1) thread(2) thread(3) op1 thread(1)% (thread(1) \ 1)=1% (1 \ 1) = = 2, op2 op1 + d = = 5. op1 thread(2)% (thread(2) \ 1)=2% (2 \ 1) = = 4, op2 op1 + d = = 5.

57 2 Computing Spectral Transforms Used in Digital Logic on the GPU 43 op1 thread(3)% (thread(3) \ 1)=2% (3 \ 1) = = 6, op2 op1 + d = = 7. We can see that in the first step, thread(0), and thread(1) compute over values in the locations ((m(0), m(4)) and (m(1), m(5)), respectively. The threads thread(2) and thread(3), in the same step compute over values in (m(2), m(6)), and (m(3), m(7), respectively. Therefore, threads with consecutive global identifiers perform the computations over successive locations. In the second step, the threads thread(0), thread(1), thread(2), and thread(3), read and write into locations (m(0),m(2)), (m(1),m(3)), (m(4),m(6)), and (m(5),m(7)), respectively. In the third step, the threads read and write into locations (m(0),m(1)), (m(2),m(3)), (m(4), m(5)), and (m(6), m(7)). These locations are the same as specified by the flow-graph of the fast algorithms, which confirms that the mapping is done correctly and shows that these algorithms ensure automatically the memory coalescing. The memory requests from successive threads can be grouped together into a single transaction, in which way the memory coalescing technique is applied. This example explains the efficiency of the computation on the GPU. For a single memory transaction, four threads working in parallel perform a step of the algorithm for n = 3. In general, a wavefront consisting of 64 work-items (threads) will perform simultaneously the computation over a subvector of 64 elements in the function vector of the signal to be processed. The number of wavefronts per work-group is determined by the user and all the work-items in a work-group can be performed over the same streaming multiprocessor, since work-items within a work-group can be synchronized Algorithms with constant geometry for Kronecker transforms To observe the difference between the Cooley-Tukey algorithms and algorithms with constant geometry, we first analyze the following example. Example 8 As in Example 7, consider a function of n = 3 variables represented by the function vector whose elements are stored in a memory with 8 memory cells m(0), m(1),..., m(7). By referring to Fig. 2.1(b), for the first step, (step = 4), and B(0) reads from the memory locations m(0) and m(1), computes m(0) +m(1) and m(0) m(1), but saves the results at the locations m(0) and m(4). Thus, unlike the Cooley-Tukey algorithm, the results are saved in locations different from these where the operands are read, however, the given butterfly addresses the same location in each step. The same computations in the implementation of the algorithm with constant geometry on a GPU can be described as follows. For each thread the operands are determined independently of the step of the algorithm. In this example, the first thread, thread(0), reads from the locations m(0) and m(1) and writes the results in

58 44 Dušan Gajić, Radomir S. Stanković the locations m(0) and m(4). These memory locations are uniquely determined by the identifier of the thread thread(id) as op1 = 2 thread(0)=2 0 = 0, op2 = 2 thread(0)+1 = 1, for the locations from which the data is read, and op1 = thread(0)=0, op2 = thread(0)+2 n 1 = = 4, for the locations in which the results should be stored. For a general case, see formulas (2.16) below. Fig. 2.6 shows the GPU computing models for Kronecker transforms by using algorithms with constant geometry. Host Main memory Function vector for f Spectrum for f step = even GPU GPU global memory buff1 buff2 step = odd A work item 1. Computes op1 and op2 by (2.16) 2. Reads data from addresses op1 and op2 in buff step = even step = odd 4. Stores results at the addresses dst1 and dst2 computed by (2.17) in buff1 if step is even or buff2 if step is odd 3. Computes operations specified by T(1) Fig. 2.6: The GPU computing model for the algorithms with constant geometry for Kronecker transforms.

59 2 Computing Spectral Transforms Used in Digital Logic on the GPU 45 Since for a thread the locations to read and write are different, the algorithm has to be implemented out-of-place. Therefore, we use two separate buffers. Note that although the memory access pattern from (2.16) is quite different than for the Cooley-Tukey algorithm, threads with successive global identifiers still read and write into neighboring memory locations and, therefore, memory coalescing is applied. Algorithms with constant geometry for the studied Kronecker transforms read from one pair of vector elements and write the results into another pair and, therefore, cannot be implemented in-place. It follows that two separate buffers are required, for the input vector and for the results of a step in the algorithm. Therefore, the corresponding kernels have three arguments, the input buffer, the output buffer, and the current step. The values of the indices op1 and op2 of the elements fetched from the input buffer depend on the value of the global identifier of the thread op1constant 2 thread(id), (2.16) op2constant op1constant + 1. The indices of the locations dst1 and dst2 in the output buffer where the results of the butterfly operation performed by the kernel are stored are calculated by the same formulas as in the case of the in-place algorithm for Kronecker transforms. Thus, these locations are determined as dst1 thread(id), (2.17) dst2 op1 + 2 n 1. Algorithm 2 presents the basic steps in computing the spectra of Kronecker transforms over a GPU by using algorithms with constant geometry. Algorithm 2 (An algorithm with constant geometry for the Kronecker transforms) 1. Allocate two buffers buff1 and buff2 in the global memory of the GPU device. 2. Transfer the input vector input from the host main memory to buff1 and buff2. 3. For each step of the transform from step 1 to step n, If step mod2 = 0, then call the OpenCL kernel for the corresponding transform with the input parameters in the order buff1, buff2. Execute the kernel by 2 n 1 threads in parallel on the GPU. Each thread reads op1 and op2, determined by (2.16), from buff1, performs the operations defined by the kernel and stores the results in the locations, determined by (2.15), in buff2. Else, if step mod 2 0, call the OpenCL kernel for the corresponding transform with the order of the first two input arguments swapped buff2, buff1. 4. If n is even, transfer the contents of buff1 to the vector output in the main memory, else transfer the contents of buff2 to the vector output in the main memory.

60 46 Dušan Gajić, Radomir S. Stanković Since there are two buffers, the operations of buffer transfers occupy the bandwidth and increase total computing time. A straightforward implementation of the related kernels would have a poor performance. However, if we add a simple check of the pass order number to the host code, we can execute the kernel with arguments for the input and the output vectors swapped for the odd and even pass through the loop, i.e., for the odd and even steps of the algorithm. After the spectrum is computed, we check whether the last step is odd or even numbered, and depending on that we select the buffer whose contents will be transferred back to the host. This approach does not require an extra bandwidth compared to the in-place algorithm for the Walsh transform. The difference between the in-place and out-of-place algorithms is reduced to two condition checks in the host code and one extra buffer in the device memory with no additional extra bandwidth requirements Algorithms for the Haar transform The Cooley-Tukey algorithm for the Haar Transform can be viewed as the reduced version of the Cooley-Tukey algorithm for the Walsh transform in the sense that some butterflies are missing, due to the properties of the Haar functions, i.e., since Haar functions take the value 0 besides the values 1 and 1. The butterflies corresponding to 0 in the Haar functions can be avoided, which makes the Haar transform computationally very efficient [9], [12], [18], [22]. In other words, the Haar transform is a local transform in the sense that some Haar coefficients are computed on a subset of values in the function vector of the processed function, which means they are determined in the previous steps of the algorithm and just transferred to the output in the further steps. It follows that the kernel used in computing the Walsh spectra can be used to compute the Haar spectra. The difference is that the number of active threads is reduced by a half in each step starting from 2 n 1 active threads in the first step to a single thread in the n-th step. As noticed above, we use the same computing model as for the Cooley-Tukey algorithms for the Kronecker transforms (Fig. 2.5). It is similar for the algorithm with constant geometry for the Haar transform. The algorithm is implemented out-of-place, as in the case of the same algorithms for other transforms. Formulas for reading the operands and writing the results in the memory are the same as for the Walsh transform (2.16). We also used two buffers and argument swapping as in the case of the Walsh transform. Fig. 2.7 shows the GPU computing models for the Kronecker transforms and the Haar transform by using algorithms with constant geometry. Since some of the Haar coefficients are computed in the previous steps of the algorithm, and then remain stored in the corresponding buffer, after the n-th step of the algorithm, buffers will contain different subsets of Haar coefficients, which should be transferred back to the host. A simple algorithm can be devised for reading the spectrum from these two buffers, but there is also an alternative solution. We added a third GPU buffer to store the results of each step of the algorithm. After the

61 2 Computing Spectral Transforms Used in Digital Logic on the GPU 47 Host Main memory Function vector for f Spectrum for f GPU GPU global memory buff1 buff2 buff3 step = even step = odd step = even step = odd A work item 1. Computes op1 and op2 by (2.16) 2. Reads data from addresses op1 and op2 in buff 4. Stores results at the addresses dst1 and dst2 computed by (2.17) in buff1 if step is even or buff2 if step is odd 3. Computes operations specified by H(1) Fig. 2.7: GPU computing model for the Cooley-Tukey algorithm and the algorithm with constant geometry for the Haar transform. n-th step, the complete spectrum is stored in this buffer, and its content is copied to the host. In the Haar transform, the possibilities for parallelization reduce in each step of the algorithm. The experimental results, however, show that after the first step, where the parallelism is entirely exploited, the remaining steps are executed very fast on the GPU and the OpenCL C implementation still clearly outperforms its C/C++ CPU counterpart. Algorithms 3 and 4 outline the basic steps of the Cooley- Tukey algorithm and the algorithm with constant geometry for computing the Haar spectrum over a GPU. Algorithm 3 (The Cooley-Tukey algorithm for the Haar transform) 1. Allocate a buffer buff in the global memory of the GPU device. 2. Transfer the input vector input from the main memory to buff. 3. For each step of the transform from step 1 to step n, Call the OpenCL kernel for the Haar transform with the input parameters buff in the global memory of GPU and the value d = 2 n step.

62 48 Dušan Gajić, Radomir S. Stanković Execute the kernel in parallel on the GPU. The number of active threads is 2 n 1 for the first step and reduces by a half in each next step. Each thread reads op1 and op2, determined by (2.15) from buff, performs the defined operations and stores the results in the same locations. 4. Transfer the contents of buff to the output-vector in the main memory. Algorithm 4 (The algorithm with constant geometry for the Haar transform) 1. Allocate three buffers buff1, buff2, and buff3 in the global memory of the GPU device. 2. Transfer the input vector input from the host main memory to buff1 and buff2. 3. For each step of the transform from step 1 to step n, If step mod2 = 0, then call the OpenCL kernel for the Haar transform with the input parameters in the order buff1, buff2, buff3. Execute the kernel on the GPU. The number of active threads is 2 n 1 for the first step, and reduces by a half in each step. Each thread reads op1 and op2, determined by (2.16, from buff1, performs the operations and stores the results in the locations determined by (2.15) in buff2 and buff3. Else, if step mod2 0, call the OpenCL kernel for the Haar transform with the first two arguments swapped buff2, buff1, buff3. 4. Transfer the contents of buff3 to the output vector in the main memory. 2.6 Experiments This section presents experimental results performed over a GPU in the OpenCL programming environment and compared with the referent C/C++ implementation on a CPU Experimental Platform The experiments were performed on a HP Pavilion dv7-4060us notebook computer whose basic parameters are shown in Table 2.4. The OpenCL kernels were developed by using MS Visual Studio 2010 Ultimate and ATI Accelerated Parallel Processing SDK 2.3 [1]. The graphics card driver is ATI Mobility Catalyst ATI Stream Profiler 2.1 was used for the performance analysis of OpenCL kernels as suggested in [2]. The source code was compiled for the x64 platform and C/C++ implementations were optimized during the compilation for the maximum level of performance. Since in the GPU computing, the possibilities for the parallelization depend on the number of available processors, it is important to emphasize here that the used GPU belongs to the lower-middle performance class. During the experiments, the GPU was used for both the computations and for rendering the screen contents. Therefore, the speedups will be much larger if a spe-

63 2 Computing Spectral Transforms Used in Digital Logic on the GPU 49 cialized and higher performance GPU is used. For instance, the used GPU has 80 stream cores and the global memory bandwidth of 25GB/s while the high-end ATI Radeon 5970 GPU has 640 stream cores and a memory bandwidth of 256GB/s. A multithreaded GPU program is partitioned into groups of threads that execute independently from each other, so that a GPU with more cores would automatically execute the program faster than a GPU with fewer cores. This is not the case with the referent C/C++ implementations that execute as single-threaded programs on the CPU, where switching to a more powerful and expensive CPU brings small performance benefits, as noticed in [16]. Table 2.4: Experimental platform. CPU AMD Phenom II N830 triple-core (2.1 GHz) RAM 4GB DDR3 OS Windows 7 (64-bit) GPU ATI Mobility Radeon 5650 Engine speed 650 MHz Global memory 1 GB DDR3 800 MHz Compute units 5 Processing elements Experimental Environment For the experiments, a test environment was constructed in C/C++. Since in FFT over function vectors the results are independent of the function values, we performed experiments on randomly generated binary vectors, in the same way as in [11] and [25]. The times reported for the experiments are the average values for 10 program executions. We present both the results for the computation times on the GPU (t c ) and the times needed for the transfer of data to/from GPU (t m ), as well as the total implementation time (t total = t c + t m ), which provides a more complete perspective on the efficiency of the realizations. We did not use any architecture dependent GPU code optimizations in order to preserve the portability of the code Referent Implementations To investigate and verify the performance of the proposed OpenCL GPU implementations, we performed a comparative analysis with respect to the C/C++ CPU implementations.

64 50 Dušan Gajić, Radomir S. Stanković The sequential C/C++ implementations of the Kronecker transforms for the CPU processing require careful handling of memory access patterns. For example, in the classical radix-2 FFT, the swapping of inner loops which control the order of computations within the steps of the algorithm results in the reduction of the number of trigonometric operations which in certain situations improves the overall performance [4]. In the case of spectral transforms discussed in this chapter, there are no transcendental computations that the FFT involves. It follows that in this case the order of loops is irrelevant and produces a highly non-local memory access patterns. This poor spatial organization leads to an inefficient use of the cache memory (cache thrashing) [4], [5], [25]. This effect is invisible with small vectors that fit in the cache. However, swapping of the inner loops that define the order of the butterfly operations within a step, followed by a slight modification of the entire code results in speedups of 30 times or more, as can be seen from the experimental results in Tables 2.5, 2.6, and 2.7. We decided to address this issue here, and include these two different CPU implementations in the experiments, since this possibility for reducing the implementation time improving the spatial locality of the input data is often overlook in the implementation of the Kronecker spectral transforms by Cooley-Tukey algorithms. For instance, this is a case in the AMD Accelerated Parallel Processing SDK [1] that has a C/C++ implementation of the Cooley-Tukey algorithm for the Walsh transform as a referent example Experimental Results The Walsh transform The first set of the experiments compares the performance of the implementation the fast Walsh transform on the CPU and the GPU. Table 2.5 compares 1. The CPU implementations of the Cooley-Tukey algorithm for the Walsh transform without and with swapping the inner loops. 2. The CPU implementation of the algorithm with constant geometry. 3. The OpenCL GPU implementation of the Cooley-Tukey algorithm. 4. The OpenCL GPU implementation of the algorithm with constant geometry. The source code for the Cooley-Tukey C/C++ implementation for the CPU processing without the loop-swapping was taken from [1]. Since the constant geometry algorithm is implemented out-of-place and uses two instead of three for loops required in the Cooley-Tukey algorithm, the technique of inner-loops swapping does not apply. The same results are presented graphically in Fig. 2.8 where for the GPU implementations the computation time is presented. These experiments show that the memory access patterns resulting from the different orderings of the loops in the C/C++ implementations have a considerable impact on the performance. The CPU implementation with the inner loops swapped

65 2 Computing Spectral Transforms Used in Digital Logic on the GPU 51 is on average 20.8 times faster than the implementation without swapping in [1]. Regarding computation time (t c ), the GPU implementation of the Cooley-Tukey algorithm is 4.6 times and even 99.6 times faster than CPU implementations with and without swapping the inner loops. If the total time is taken into account, the GPU implementations are 3.7 and 78.2 times faster, respectively. Regarding the total time, the algorithms with constant geometry over a GPU are on average 4% slower than the Cooley-Tukey algorithms due to increased memory requirements. If the computation time is considered, the algorithms with constant geometry are on average 4% faster than the Cooley-Tukey algorithm for the Walsh transform. It follows that the exceeding total implementation time is due to the communication with the memory. It can generally be concluded that the implementation over a GPU can bring noticeable improvement in the performance, even in the case of algorithms with simple integer arithmetic. Table 2.5: Implementation times for the Walsh transform. Time (ms) n CPU GPU Cooley-Tukey Constant geometry Cooley-Tukey Constant geometry t total no swapping t total swapping t total t c t m t total t c t m t total av Arithmetic and Reed-Muller transforms Tables 2.6 and 2.7 show the times for computing the arithmetic transform and the Reed-Muller transform, respectively, by using the Cooley-Tukey and algorithms with constant geometry. The arithmetic transform on the CPU is 26 times faster when the swapping of the inner loops is performed. For the Reed-Muller transform the same factor is 35. Comparing the computation time and the total time on a GPU and CPU, the speedups for the arithmetic transform are up to 121 times and 83 times, against the slower CPU code (without swapping) and 4.7 times and 3.2 times, against the faster CPU code (with swapping). For the Reed-Muller transform, speedups are up to 99.6 times and 68.5 times against the slower CPU code, and 2.8 times and 2 times against the faster CPU code. The computation and the total times

66 52 Dušan Gajić, Radomir S. Stanković CPU Cooley Tukey no swapping CPU Cooley Tukey swapping CPU constant geometry GPU Cooley Tukey GPU constant geometry Computational time [ms] Number of input variables Fig. 2.8: Implementation times for the Walsh transform. for the constant geometry algorithm on the GPU are in the case of the arithmetic transform 6.6 times and 4.4 times, respectively, faster than on the CPU, and 3.2 times and 2.1 times, respectively, for the Reed-Muller transform. Figs. 2.9 and 2.10 show the same information graphically by comparing the total implementation time for the GPU against the two CPU implementations (with and without swapping) for the arithmetic transform and the Reed-Muller transform, respectively.

67 2 Computing Spectral Transforms Used in Digital Logic on the GPU 53 Table 2.6: Implementation times for the arithmetic transform. Time (ms) n CPU GPU Cooley-Tukey Constant geometry Cooley-Tukey Constant geometry t total no swapping t total swapping t total t c t m t total t c t m t total av CPU Cooley Tukey no swapping CPU Cooley Tukey swapping CPU Constant geometry GPU Cooley Tukey GPU Constant geometry Computation time [ms] Number of imput variables Fig. 2.9: Implementation times for the arithmetic transform.

68 54 Dušan Gajić, Radomir S. Stanković Table 2.7: Implementation times for the Reed-Muller transform. Time (ms) n CPU GPU Cooley-Tukey Constant geometry Cooley-Tukey Constant geometry t total no swapping t total swapping t total t c t m t total t c t m t total av Computation Time [ms] CPU Cooley Tukey no swapping CPU Cooley Tukeyn swapping CPU Constant geometry GPU Cooley Tukey GPU Constant geometry Number of variables Fig. 2.10: Implementation times for the Reed-Muller transform.

69 2 Computing Spectral Transforms Used in Digital Logic on the GPU 55 The Haar transform The Haar transform is computationally very efficient due to the properties of the Haar functions. Therefore, the CPU implementations although sequential, are faster than for the Kronecker transforms. Furthermore, these properties of Haar functions reduce the possibilities for the parallelization of the computations over a GPU, since the number of active threads is reduced by a half in each step of the computation compared to the number of active threads in the previous step. For these reasons the advantages of using a GPU are less impressive than those for the Kronecker transforms. Some speedup is still achieved and can be important in certain practical applications. Table 2.8 and Fig show the results of experiments for the Haar transform. We compared the CPU implementation of the Cooley-Tukey algorithm for the Haar transform with the GPU implementations of this algorithm and the algorithm with constant geometry. Due to the properties of the Haar transform, the CPU implementation is much faster than the CPU implementations of the studied Kronecker transforms. GPU implementation is still 3.2 and 2.3 times faster on average when the computation time and the total time are compared respectively, for the Cooley- Tukey algorithm. The computation is up to 5.9 times faster in the calculation time and 1.5 times in the total time for the algorithm with constant geometry. The times for memory transfers to/from the GPU dominate over the GPU calculation times for the FHT, especially for the algorithm with constant geometry. Table 2.8: Implementation times for the Haar transform. Time (ms) n CPU GPU Cooley-Tukey Constant geometry Cooley-Tukey Constant geometry t total t total t c t m t total t c t m t total av

70 56 Dušan Gajić, Radomir S. Stanković CPU Cooley Tukey CPU Constant geometry GPU Cooley Tukey GPU Constant geometry Computation time [ms] Number of variables Fig. 2.11: Implementation times for the Haar transform. 2.7 Computation of the Dyadic Correlation and Autocorrelation over the GPU In this section, we will briefly discuss the application of the computing methods and related GPU kernels for the Walsh transform to the computation of the correlation and autocorrelation of discrete functions of binary valued variables. From the abstract harmonic analysis point of view, the discrete finite Walsh transform for sequences of length 2 n is the Fourier transform on finite dyadic groups C2 n, where C 2 =({0,1}, ) is the cyclic group of order 2, i.e., it consists of a set with elements 0 and 1 equipped with the addition modulo 2 (EXOR). Therefore, functions in binary valued (Boolean) variables can be viewed as functions on C 2 n and due to that may be processed by the Walsh and related spectral transforms such as those discussed in previous sections. The concepts of the convolution and correlation on this group coincide, since there is no difference in the results produced by the addition and subtraction modulo 2. Therefore, these two terms, convolution and correlation, can be used interchangeably, without causing misunderstanding. For two functions f (x), g(x), x =(x 1,...,x n ), x i {0,1}, the correlation (convolution) f g is defined as a function of n variables 2 n 1 C f g (τ)= f (x)g(x τ), τ =(τ 1,...,τ n ),τ i {0,1}. (2.18) x=0

71 2 Computing Spectral Transforms Used in Digital Logic on the GPU 57 The autocorrelation is the correlation of the function with itself, thus, it is defined as 2 n 1 B f (τ)= x=0 f (x) f (x τ), τ =(τ 1,...,τ n ). (2.19) The convolution theorem states that the convolution in the original domain is transferred into multiplication in the spectral domain, thus, the Fourier (Walsh) coefficients of f g are the product of the Fourier (Walsh) coefficients for f and g, S f g (τ)=s f (τ) S g (τ). (2.20) If the spectra for f and g are written in matrix notation as vectors of spectral coefficients for τ {0,...,2 n 1}, the multiplication is understood componentwise. In the spectral domain, the autocorrelation is computed as S f f (τ)=s f (τ) S f (τ)=(s f (τ)) 2. (2.21) This equality is usually called the Wiener-Khinchin autocorrelation theorem on finite dyadic groups since it is a direct analogue of the Wiener-Khinchin theorem in the classical Fourier analysis [12]. Recall that since the Walsh transform matrix W(n) is orthogonal and symmetric, the Walsh transform is a self-inverse transform up to the scaling factor 2 n [12]. Thus, the inverse Walsh transform is equal to the direct Walsh transform with a multiplication by 2 n. Then, the convolution theorem and the Wiener-Khinchin theorem permit the formulation of the following computation procedures for the correlation and the autocorrelation of functions on C n 2 and C f g = 2 n W (n)((w(n) f ) (W (n)g)), (2.22) B f = 2 n W(n)((W(n) f )) 2, (2.23) where W is the Walsh transform operator [12]. From (2.22) and (2.23) the computation of the correlation and the autocorrelation is done by the following procedures Computing the correlation of f and g 1. Given the functions f and g, compute their Walsh spectra S f and S g. 2. Multiply componentwise S f and S g. 3. Compute the Walsh transform of S f S g. 4. Multiply the resulting Walsh coefficients by 2 n. The same procedure can be used to compute the autocorrelation, in which case the multiplication of S f and S g is actually replaced by taking the square of S f.it

72 58 Dušan Gajić, Radomir S. Stanković follows that the procedures and related kernels developed for the GPU computing of the Walsh transform can directly be used to compute the correlation and the autocorrelation of functions on C2 n. It is sufficient to just develop kernels to perform the componentwise multiplication of spectra for the correlation and to square the spectral coefficients for the autocorrelation, and a kernel for the multiplication with 2 n. Therefore, the computation of the correlation can be performed by Algorithm 5. The autocorrelation can be computed by the same algorithm, after the Step 4 is modified to calculate (S f ) 2. Algorithm 5 (Algorithm for computing the correlation) 1. Allocate buffers buff1 and buff2 in the GPU device global memory. 2. Transfer the function vectors F and G for the functions f and g from the host CPU main memory to buff1 and buff2, respectively. 3. Compute the Walsh transform on the vectors in buff1 and buff2, using the OpenCL implementation of the Cooley-Tukey algorithm for the Walsh transform. 4. Execute the OpenCL kernel for the componentwise multiplication of the Walsh spectra with 2 n threads executed in parallel and store the result in buff1. 5. Compute the Walsh transform of the vector in buff1 and store it in buff1. 6. Multiply the vector in buff1 by 2 n using the corresponding OpenCL kernel with 2 n threads working in parallel. 7. Transfer the contents of buff1 to the host CPU main memory. Besides the steps described above, which invoke the corresponding OpenCL kernels, the host first creates the context for the execution of OpenCL kernels. The context includes resources such as devices, kernels, programs, and memory objects. The host also creates a data structure called a command-queue to coordinate the execution of the kernels on various computational devices. Commands are placed into the command-queue and afterwards scheduled onto the devices that exist within the context. The host allocates the memory buffers in the GPU global memory and transfers the input vectors from the main memory to the GPU. When the computation is completed, the host transfers the resulting dyadic correlation coefficients back to the main memory of the system and releases all of the resources that were occupied by the OpenCL program. To estimate the efficiency of the proposed algorithm, we performed the experiments in the same way as in Section 2.6 and over the same hardware. We computed the correlation and the autocorrelation over sets of 10 randomly generated binary valued functions for different numbers of binary valued variables n. We used the Cooley-Tukey algorithm to compute the Walsh spectra. Tables 2.9 and 2.10 as well as Figs and 2.13 show the implementation times for the correlation and the autocorrelation as functions in the number of variables. For the correlation, the speedup achieved by using GPU is 5.5 when the computation times are compared, and up to 4.6 when the total times including memory transfers are taken into account. For the autocorrelation, the speedup is 5.6 and 4.9, respectively.

73 2 Computing Spectral Transforms Used in Digital Logic on the GPU 59 Table 2.9: Implementation times for computing the correlation of randomly generated functions. n Time (ms) CPU GPU t total t c t m t total av Table 2.10: Implementation times for computing the autocorrelation of randomly generated functions. n Time (ms) CPU GPU t total t c t m t total av CPU GPU Computation time [ms] Number of variables Fig. 2.12: Implementation times for computing the correlation of randomly generated binary valued functions for n = 18,20,22,23,24,25.

74 60 Dušan Gajić, Radomir S. Stanković 3500 Computation time [ms] CPU GPU Number of variables Fig. 2.13: Implementation times for computing the autocorrelation of randomly generated binary valued functions for n = 18, 20, 22, 23, 24, 25. References 1. AMD Accelerated Parallel Processing SDK, amd.com/gpu/amdappsdk, AMD Inc., last visit on 20/08/ AMD Accelerated Parallel Processing OpenCL Programming Guide, AMD Inc., Aamodt, T. M., Architecting graphics processors for nongraphics compute acceleration, Proc IEEE Pacific Rim Conf. on Communications, Computers and Signal Processing, Victoria, BC, Canada, August 23-26, Arndt, J., Matters Computational: Ideas, Algorithms, Source Code, Springer, Bryant, R. E., OHallaron, D. R., Computer Systems - A Programmers Perspective, Addison Wesley, 2010, Buluc, A., Gilbert, J., Budak, C., Solving path problems on the GPU, Parallel Computing, Vol. 36, No. 5-6, 2010, Cooley, J. W., Tukey, J. W., An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation, No. 90, 1965, Copeland, A. D., Chang, N.B., Lung, S., GPU accelerated decoding of high performance error correcting codes, Proc. 14th Annual Workshop on HPEC, Lexington, Massachusetts,

75 2 Computing Spectral Transforms Used in Digital Logic on the GPU 61 USA, September 15-16, Falkowski, B.J., Relationships between arithmetic and Haar wavelet transforms in the form of layered Kronecker matrices, Electronic Letters, Vol. 35, No. 10, 1999, Gajić, D.B., Stanković, R.S., Computing fast spectral transforms on graphics processing units using OpenCL, Proc Reed-Muller Workshop, May 25-26, 2011, Tuusula, Finland, Govindaraju, N. K., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J., High performance discrete Fourier transforms on graphics processors, Proc ACM/IEEE Conf. on Supercomputing, Austin, Texas, USA, November 15-21, Karpovsky, M.G., Stanković, R.S., Astola, J.T., Spectral Logic and Its Applications for the Design of Digital Devices, Wiley-Interscience, Lukac, M., Perkowski, M.A., Kerntopf, P., Kameyama, M., GPU acceleration methods and techniques for quantum logic synthesis, Proc. 9th Int. Workshop on Boolean Problems, Freiberg, Germany, September 16-17, 2010, NVidia CUDA - Compute Unified Device Architecture, developer.nvidia.com/object/gpucomputing.html, NVIDIA Corp., Accessed 20/08/ NVidia CUDA CUFFT Library, NVIDIA Corp., Paul, E., Steinbach, B., Perkowski, Application of CUDA in the Boolean domain for the unate covering problem, Proc. 9th Int. Workshop on Boolean Problems, Freiberg, Germany, September 16-17, 2010, Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W., Optimization principles and application performance evaluation of a multithreaded GPU using CUDA, Proc. 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, NY, USA, February 20-23, 2008, Stanković, R.S., Astola, J.T., Spectral Interpretation of Decision Diagrams, Springer, Stanković, R.S., Moraga, C., Astola, J.T., Fourier Analysis on Finite Non-Abelian Groups with Applications in Signal Processing and System Design, Wiley/IEEE Press, Stojić, M.R., Stanković, M.S., Stanković, R.S., Discrete Transforms in Applications, Nauka, Beograd, 1985, second updated edition The OpenCL Specification 1.2, Khronos Working Group, Thornton, M.A., Drechsler, R., Miller, D.D., Spectral Techniques in VLSI CAD, Springer, Tong, Y., Zhang, J., Research of fast 2-D Walsh transformation based on GPU, Microelectronics & Computer, CNKI, Van Loan, C., Computational Frameworks for the Fast Fourier Transform, Society for Industrial Mathematics, Volkov, V., Kazian, B., Fitting FFT onto G80 architecture, UC Berkeley CS258 Project Report, 2008.

77 Chapter 3 Sources and Obstacles for Parallelization - a Comprehensive Exploration of the Unate Covering Problem Using Both CPU and GPU Bernd Steinbach, Christian Posthoff Abstract The available high computational power of Graphic Processing Units (GPU) forced many programmers to execute time consuming parts of the software on the huge number of processor cores of the GPU. This large number of processor cores executes the same instruction on different data in parallel and provides an important source for parallelization. The restriction to a single instruction and the required transfer of the data between the Central Processing Unit (CPU) and the GPU are obstacles which must be taken into account. This chapter contains a comprehensive exploration of both the theoretical background and several approaches to the unate covering problem of Boolean functions. This problem is NP-complete [14] - therefore parallel approaches can make a significant contribution to smaller calculation times. It will be shown that the well explored method of matrix multiplication on the GPU allows only a very small improvement for the unate covering problem in comparison to the CPU. However, the comprehensive utilization of many properties of the problem finally allows the reduction of the required time for the calculation by a factor of more than Bernd Steinbach Freiberg University of Mining and Technology Institute of Computer Science D Freiberg, Germany, steinb@informatik.tu-freiberg.de Christian Posthoff The University of The West Indies St. Augustine Campus Trinidad & Tobago, Christian.Posthoff@sta.uwi.edu

78 64 Bernd Steinbach, Christian Posthoff

79 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem Introduction Minimal disjunctive forms of Boolean functions [10] are needed in circuit design. There are two sources of minimizing such an expression. First of all the number of variables of each conjunction can be minimized to the necessary number of variables in a prime conjunction. Secondly, the number of prime conjunctions can be reduced to a certain minimal value. This chapter deals with this second task. The practical benefits of a minimal disjunctive form are not only the reduced circuit space, but also the reduced power consumption and a simple calculation of test patterns. The explored approaches focus to this covering problem but can be adapted to similar SAT-problems which must be solved using typical modern PCs. Due to practical importance and the exponential complexity [14], the benefits of parallel computation on a GPU were studied in [5] for a fast solution of unate covering problems. The classical task of the GPU is the matrix multiplication which realizes the needed affine transformation of graphical information. Unfortunately, the matrix multiplication approach reached in the best case for the unate covering problem only a 1.9 times shorter calculation time using the GPU in comparison to a single CPU core. For a direct comparison between the Central Processing Unit (CPU) and the Graphic Processing Unit (GPU) all of the experiments in this chapter were executed using a unique hardware base. One, four or all six cores of the CPU Intel Xeon Processor (X5650, 6x, 2.67GHz) are used depending on the studied approach. The Tesla C2070 computing processor is used as a GPU. This GPU contains 448 CUDA cores. There are other sources for parallel calculations besides the large number of processor cores of the GPU. Using a single CPU core, the bits of a machine word can be calculated in parallel. XBOOLE [6] extends this source of improvement to the computation of ternary vectors and reaches a significantly stronger improvement on the same (but additionally larger) benchmarks.

80 66 Bernd Steinbach, Christian Posthoff An alternative parallel approach [13] utilizes both XBOOLE and the Message Passing Interface (MPI) and reduces the required solution time furthermore. This approach uses only four or six cores of a single CPU. Utilizing the power of XBOOLE [6] and symmetric functions in a completely different approach on a single CPU core enlarged this improvement again [9]. It could be a conclusion of the comparison between the small success achieved in utilizing the GPU with a well fitting approach of matrix multiplication and the strong improvement by several orders of magnitude based on the CPU, that the unate covering problem is not suitable for parallelization on the GPU. It will be shown that a comprehensive utilization of both the properties of the GPU and the properties of the unate covering problem finally allow the reduction of the time required for the calculation by a factor of more than when a GPU is used. 3.2 The Problem to Solve: Unate Covering A given set of all prime conjunctions {pc i i = 1,...,n} covers the associated Boolean function n f (x)= pc i. (3.1) All of the minimal covers of f (x) can be calculated with regard to this given set of all prime conjunctions. Each subset of prime conjunctions {pc i i = 1,...,k < n} which satisfies equation (3.1) for a value k smaller than n can be used as a source of an approximate cover. It follows from (3.1) that for each input pattern x = c 0 with f (x = c 0 )=1 there exists at least one prime conjunction pc i such that [pc i f (x)] x=c0 = 1. In most cases this condition will be satisfied by several prime conjunctions. Otherwise, the single prime conjunction that satisfies the condition is essential and must be used in each cover. Boolean model variables p i are introduced in order to describe the covering problem in a formal way. Each model variable p i is associated with the prime conjunction pc i. A new Boolean function P(p) can be created as follows: 1. Each x = c 0 with f (x = c 0 )=1 can be covered by several prime conjunctions. Hence, a disjunction of the associated model variables p i can be created for each such c All of the function values f (x = c 0 )=1 must be covered. Hence, all such disjunctions must be connected by conjunctions to get the Boolean function P(p). The created function P(p) is called a Petrick function. Such a function has a conjunctive form (CF) [10]. Example 3.1. A simple Petrick function P(p) of 8 variables and 8 disjunctions is shown in (3.2). i=1

81 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 67 P(p) =(p 4 p 6 p 7 ) (p 4 p 5 p 6 p 8 ) (p 1 p 3 p 4 p 7 p 8 ) (p 1 p 4 p 5 p 7 p 8 ) (p 1 p 2 p 5 p 6 ) (p 4 p 5 p 6 p 7 p 8 ) (p 1 p 4 p 5 p 6 p 7 p 8 ) (p 2 p 3 p 4 p 7 p 8 ) (3.2) A complete cover is defined by the equation (3.3). Due to the CF on the lefthand side of (3.3), this equation specifies a satisfiability problem (SAT) [1]. The disjunctions of a SAT-formula are called clauses. The solution of a SAT-problem are such values of the variables p i which satisfy the equation (3.3). The non-negated variables p i of each solution of (3.3) describe which prime conjunctions pc i must be used in the associated cover of the basic function f (x): P(p)=1. (3.3) It is a special feature of the Petrick function that it is unate. This property can be defined using a simple derivative of the Boolean differential calculus, see [11] and Section 4 in [6]. Definition 3.1. Unate Functions A Boolean function f (x 1,x 2,...,x i,...,x n ) is called positively unate in x i iff x i f (x 1,x 2,...,x i,...,x n ) x i = 0. (3.4) Similarly, the function f (x) is called negatively unate in x i iff x i f (x 1,x 2,...,x i,...,x n ) x i = 0 (3.5) holds. If f (x) is positively (negatively) unate in all x i, then this function is called positively (negatively) unate. If either (3.4) or (3.5) holds for each variable x i, then the function f is called unate. If this condition fails for at least one variable, then f (x) is called binate. All of the variables in a Petrick function appear in positive polarity. Hence, the Petrick function is positively unate. This property has a strong influence on the solution process. The assignment i : p i = 1 solves the Petrick equation (3.3), which means that all prime conjunctions pc i must be used to cover the function f (x). However, it is the aim of the Unate Covering Problem (UCP) to find a cover using a number of prime conjunctions pc i, which is as small as possible. The set of all solutions of the Petrick equation (3.3) can be classified by the number of variables which appear in positive polarity. The solutions of the UCP are all solutions of Petrick equation (3.3) which belong to the class of the smallest number of variables in positive polarity.

82 68 Bernd Steinbach, Christian Posthoff 3.3 Initial GPU Approach - Matrix Multiplication The basic task for the GPU is the transformation of given graphical information into the color values of all the pixels of a computer screen. This task can be solved by the multiplication of the given graphical information with an affine matrix. This matrix covers all the information about the needed scaling, shifting, mirroring, and rotations. The architecture of GPUs allows that many of the required multiplications and additions be executed in parallel. Due to this classical main task of GPUs, the utilization of the matrix multiplication for the unate covering problem was explored in [5]. This approach can easily be understood by means of a very simple Petrick function (3.6) of three variables p i which appear in four clauses. P(p)=(p 1 ) (p 1 p 2 ) (p 2 p 3 ) (p 3 ) (3.6) Table 3.1 shows the truth table of the Petrick function (3.6) including the values of all four clauses and the literal costs of the inputs (number of input variables equal to 1). It can be seen that the Petrick function (3.6) has the function value 1 only for two input patterns. The literal costs for these patterns are 2 and 3. The solution of the UCP are all input patterns with P(p) =1 and the smallest literal cost which is p 1 = 1 and p 3 = 1 for the Petrick function (3.6). Table 3.1: The truth table of the Petrick function (3.6) Inputs Clauses Output Literal p 1 p 2 p 3 p 1 p 1 p 2 p 2 p 3 p 3 P(p) Cost In order to solve the UCP for the Petrick function (3.6), the matrices of this function and the set of all the input vectors are needed. Petrick function (3.6) can be represented by a matrix P of three rows which indicate top down the variables p 1, p 2, and p 3, and four columns associated from left to right to the four clauses c 1,c 2,c 3, and c 4 of (3.6). The columns of matrix A are associated from left to right to the variables p 1, p 2, and p 3. The product of the matrices A and P is the cover matrix C of 2 3 = 8 rows for the different p i assignments and four columns which indicate the reached cover of the clause c 1,c 2,c 3, and c 4. The matrix equation (3.7) shows the details of this calculation.

83 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem C = A P = = (3.7) The elements c ij of the matrix C are calculated as usual by k max c ij = k=1 a ik p kj. (3.8) This calculation goes well with the Single Instruction Multiple Thread (SIMT) architecture of the GPU. Several solution values of independent data can be calculated in parallel on the GPU. Row i of the solution matrix C contains detailed information how the assignment of the variables p k of the row i of the matrix A covers each clause. Taking as an example the fourth row of A with p 2 = 1 and p 3 = 1, the following coverings are indicated in the fourth row of the matrix C of (3.7): c 41 = 0 indicates that the first clause is not covered, c 42 = 1 indicates that the second clause is covered by one of the p k = 1, c 43 = 2 indicates that the third clause is covered by two of the p k = 1, and c 44 = 1 indicates that the fourth clause is covered by one of the p k = 1. Due to c 41 = 0, the fourth row of A in (3.7) is no solution of Petrick equation (3.3). There are only two rows of A without a zero value in the associated row of the matrix C of (3.7). These solutions of Petrick equation (3.3) are the sixth row p =(101) and the last row p =(111). The sixth row contains two values 1, i.e. p 1 = 1 and p 3 = 1, which is the only solution of the unate covering problem of Petrick equation (3.6), because it is a solution of the Petrick equation (3.3) and contains the smallest number of values 1 of the variables p i. This example reveals that all the elements of the matrix C must be evaluated to decide on the solutions of the UCP. This is a time consuming task which can be simplified when the rows of the matrix A are ordered with regard to the horizontal checksum. Utilizing this order, the top down evaluation of the matrix C can be stopped in the case that any rows without any zero element c ij and with a horizontal checksum L of the literals p i are detected, while the first row with a horizontal checksum L + 1 has to be evaluated next. All of the algorithms are implemented using the Microsoft Visual Studio 2010 and Nvidia Parallel Nsight TM. High precision time measurement was made using QueryPerformanceCounter(timeStamp).

84 70 Bernd Steinbach, Christian Posthoff All of the programs are executed in the Win32 release mode to exclude debug effects and impacts of the 64 bit technology. Equivalent basic algorithms were implemented as sequential programs for the CPU and as the parallel CUDA program for the GPU. The reused algorithms of [5] are extended by a more detailed time measurement. Algorithm 1 shows the main steps to solve the UCP using only a single CPU core. The time intervals for reading the PI chart and writing the minimal solution into an output file were excluded from the time measurements. Algorithm 1 MatMulUCP for CPU 1: read the PI chart from input file (initialize P) 2: create an ordered solution space based on the number of literals in the PI chart (initialize A) 3: perform sequential matrix multiplication (C = A P) 4: search output rows of C for satisfying rows 5: report minimal solutions to the output file The multiplication of the matrices A and P is executed in parallel on the GPU using Algorithm 2. As a precondition the matrices A and B must be moved from the main memory of the CPU to the device memory of the GPU in step 3 of Algorithm 2. Vice versa, the result matrix C is moved in step 5 of Algorithm 2 from the device memory of the GPU to the main memory of the CPU. The time to move these three matrices in the required direction is included in the time interval for matrix multiplication. Algorithm 2 ParallelMatMulUCP for GPU 1: read the PI chart from input file (initialize P) 2: create an ordered solution space based on the number of literals in the PI chart (initialize A) 3: transfer input matrices to the GPU 4: perform CUDA matrix multiplication (C = A P) 5: transfer the results from GPU to the main system memory 6: search output rows of C for satisfying rows 7: report minimal solutions to output file A set of Petrick functions was generated in the project of [5] as a unique basis of all comparisons. These functions depend on 8, 16, 24, or 32 variables p i and 8, 16, 32, 64, 128, or 256 clauses. Additional Petrick functions of 32 variables p i and 512 or even 1024 clauses were generated. These benchmark functions were used for all of the experiments in this chapter. Table 3.2 shows the experimental results of the matrix multiplication approach in a comparison between the sequential CPU implementation and the parallel GPU implementation. The first two columns indicate the used benchmark of a Petrick functions by the number nv of variables p i and the number nc of clauses. The next two columns show the found solutions by the minimal number nv of variables p i

85 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 71 Table 3.2: Results in milliseconds for the matrix multiplication approaches of CPU and GPU Benchmark Solution Algorithm 1 on CPU Algorithm 2 on GPU nv nc nv nm sort multiply total sort multiply total needed to cover Petrick functions P(p) and the number nm of different minimal covers. The matrix of all the different patterns of the variables p i are generated and sorted in both implementations on the CPU. Hence, there are only small time differences caused by the scheduling of the operating system. As expected, the matrix multiplication is faster on the GPU in comparison to the CPU. This can be seen in the columns multiply of Table 3.2 where the CPU needs milliseconds but the GPU only milliseconds for the benchmark This time includes in the GPU approach the time needed to transfer the data between CPU and GPU. Due to these additional time intervals, the CPU is in total at least 10 times faster than the GPU for the small benchmarks of 8 variables p i. Unate covering problems of 16 variables p i and at least 64 clauses can be solved in a shorter time on the GPU. This benefit grows for larger numbers of clauses and reaches for the benchmark an improvement factor of 2.5 restricted to the time needed for multiplication and 1.9 for the total time. The matrix multiplication approach has one more drawback. The solution matrix C for a Petrick function of nv variables p i and nc clauses consists of nc 2 nv elements which are already 16,777,216 matrix elements for the benchmark Due to these large memory requirements, the matrix multiplication approach is restricted to such small Petrick functions. In [5] it was suggested that the matrices of integer values are replaced by matrices of Boolean values. Equation (3.8) can be replaced by equation (3.9) - the Boolean results are calculated for Boolean matrices A and B. k max c ij = a ik p kj (3.9) k=1

86 72 Bernd Steinbach, Christian Posthoff Experimental results on the computer mentioned above have ensured that the approach of Boolean matrix multiplication needs more time than that of integer values, because of the more difficult access to single Boolean values. 3.4 Utilizing the Boolean Algebra Basic Approach Due to the drawbacks of the matrix multiplication approach for UCP, the properties of this problem are studied in detail. The classical approach to solve the unate covering problem is the application of the distributive law [6] (a b) (c d)=ac ad bc bd (3.10) from left to right for all the clauses. No sign of an operation between two variables indicates that the neighbored variables are connected by an AND-operation without writing the sign explicitly. The number of variables within each conjunction created by the distributive law is equal to the number of clauses. This number can be reduced by the idempotent law [6] a a = a (3.11) if a variable appears more than once in a conjunction. The application of the idempotent law can be realized, without any calculation effort, if a data structure is used in which the appearance of a variable in the conjunction is indicated by a unique single position. The application of (3.10) and (3.11) during the solution process of equation (3.2) results in 180, 000 conjunctions. A time of 9, milliseconds has been measured using a single CPU core for the computation of these 180, 000 conjunctions. Each of these huge numbers of conjunctions represents exactly one solution of the equation (3.3) for the used Petrick function (3.2). The solutions of the UCP are only conjunctions with a minimal number of variables. The absorption law [6] allows the removal of conjunctions from the solution set, which are covered by conjunctions of fewer variables: a ab= a. (3.12) The absorption is very powerful. Its application to the disjunction of the 180, 000 conjunctions leads to a disjunction of only 12 conjunctions. Figure 3.1 (a) shows these solutions represented as a list of ternary vectors (TVL) which was calculated using XBOOLE [6], [8]. Each row of this TVL describes a conjunction of variables p i. A value 1 in the column headed by p i and the row j means that the variable p i

87 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 73 appears in the conjunction c j. This simplification needs 86, milliseconds on the same PC using a single processor core again. These 12 ternary vectors are solutions of equation (3.3) for the used Petrick function (3.2) and hold the following properties: 1. complete: one ternary vector with d dash-elements - describes 2 d conjunctions because the d dash-elements can be replaced by each combination of 1-elements, 2. not disjoint: the same vector can be constructed by replacing certain dashes by 1-elements starting from any pair of vectors given in Figure 3.1 (a); hence, the number of solutions cannot be calculated as sum of 2 d i, and 3. irredundant: if at least a single 1-element is replaced by a 0-element, this changed ternary vector is no solution of equation (3.3) for the used Petrick function (3.2) anymore. The set of irredundant solutions can contain vectors with different numbers of 1- elements. The solution of the UCP is the subset of vectors of the set of irredundant solutions which contains the minimal number of 1-elements. The set of 12 solution vectors consists of 7 solutions with 2 variables, see Figure 3.1 (b), and 5 solutions with 3 variables. The wanted minimal solutions of the UCP can simply be found by counting the number values 1 in the ternary solution vectors. (a) pppp pppp ========= (b) pppp pppp ========= Fig. 3.1: The solutions of the simple unate covering problem (3.2): (a) all 12 minimal irredundant solutions; (b) all 7 minimal solutions The implementation of a program based on the distributive, idempotent, and absorption laws needs a loop in which both the distributive and idempotent laws are applied for all clauses. After this loop, the application of the absorption law reduced the found solutions to the set of all irredundant solutions. In a final counting step, the minimal solutions of UCP are selected.

88 74 Bernd Steinbach, Christian Posthoff The Improved Basic Approach A practical algorithm that solves the unate covering problem can be implemented, such that the idempotent law (3.11) is applied implicitly. The application of the distributive law (3.10) for two disjunctions of nv 1 and nv 2 variables results in a disjunction of nv 1 nv 2 conjunctions. The complexity of this part of the algorithm for a Petrick function of nv variables that consists of nc clauses is equal to O(nv nc ). The absorption (3.12) strongly reduces the number of conjunctions. Hence, it can shorten the runtime when the absorption is used after each application of the distributive law. This change in the algorithm reduced the runtime for the above example by a factor of 15,471. The same 7 solution vectors with 2 variables were found within milliseconds instead of 95, milliseconds. rows time in seconds Fig. 3.2: Changes in the number of rows applying the distributive and absorption law consecutively in the loop solving UCP of 16 variables and 32 clauses Figure 3.2 visualizes the solution process for the largest solvable example using the suggested algorithm within a time limit of 10 minutes. Figure 3.2 shows on the one hand that the execution of the distributive law for one clause results quickly in a much larger number of rows which represent interim solutions. On the other hand, this figure also shows that it is time consuming to exclude such interim solutions which are absorbed by other interim solutions. The different local maximal values of this process depend on both the last reduced number of rows and the number of literals in the used clause. Table 3.3 shows the experimental results of the suggested basic approach. The number ni of all the irredundant solutions of the Petrick equation is found by this approach as additional information. Taking this additional information into account,

89 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 75 Table 3.3: Benchmark results for the basic approach: distributive law and absorption law Benchmark Solution Total Time nv nc nv nm ni in Milliseconds , , , the total time for Petrick functions of 8 variables is similar to the matrix multiplication approach. Therefore this improved basic approach is used as the basis of comparison for all further improvements and the reached improvement factor of is excluded from these comparisons. It is an obstacle of this improved approach that it takes too much time to compensate for the complexity of O(nv nc ) for the distributive law by the application of the absorption law for larger functions - as can already be seen for Petrick functions of 16 variables. The practical requirement to solve such covering problems for larger numbers of variables on the one hand and the extreme growth of the runtime on the other hand enforce to search for alternative approaches which utilize the complete parallel computation power of modern PCs. In order to increase the number of variables and to reduce runtime, respectively, parallel approaches for the UCP are considered in the following. 3.5 Parallelization in the Application Domain Having a PC of n core processor cores, it is an obvious approach to use the known techniques [3], [7] to adapt the previous algorithm to several cores that can contribute simultaneously to the solution of the problem. Due to the exponential complexity of the problem, only a slightly larger number of variables can be expected n additional = log 2 n core. Furthermore, due to Amdahl s law, the reachable speedup will in most cases be less than the number of cores working in parallel. Taking into account both the small number of cores of the available PC and the already achieved speedup of more than 15,000, the application domain is studied at first for alternative faster approaches. As shown in Figure 3.1, a solution vector of a Petrick function of n variables consists of k 1-elements and d = n k dashes. Such a vector represents 2 d solutions of the Petrick equation (3.3). Hence, the search space of the exponential size

90 76 Bernd Steinbach, Christian Posthoff 2 n for n Boolean variables can be evaluated easier when 2 d Boolean vectors are represented by a single ternary vector and, generally, ternary vectors are used as the respective data structure. XBOOLE [6], [8], [11], [12] is a library that utilizes this approach consistently for both the representation of Boolean functions and the respective computations. Petrick equation (3.3) is a characteristic Boolean equation because of the constant 1 function on the right-hand side. There is a strong difference in the effort to solve a characteristic Boolean equation depending on the form of the function on the left-hand side. While a conjunctive form [10] of the function on the left-hand side of a characteristic Boolean equation causes exponential complexity of the solution procedure, a characteristic equation with a disjunctive form [10] on the left-hand side can be solved in constant time. Unfortunately, the Petrick function is a special type of a conjunctive form. Hence, an alternative approach to solving the equation P(p)=1 is the transformation of the Petrick function P(p) given in conjunctive form into an equivalent function AS(p) in disjunctive form that describes all the solutions of this equation as well. The following two properties can be utilized: 1. Two successive negations do not change a Boolean function: P(p)=P(p). 2. The negation using de Morgan s law alternates between the conjunctive and disjunctive form. Using the XBOOLE-operators NDM(f(x)) for the negation according to de Morgan s law and CPL( f (x)) for the calculation of the complement, we get the following algorithm: AS(p)=CPL(NDM(P(p)). (3.13) The XBOOLE-operator NDM( f (x)) has a complexity of O(1). The main computational effort is required by the XBOOLE-operator CPL. The benefits of this approach in comparison to the application of the distributive law are as follows: 1. The solution is represented by an orthogonal set of ternary vectors [6]. 2. Each ternary vector that includes d dash elements represents 2 d solutions. 3. Due to the orthogonal representation, the sought exact minimal solutions can be selected by counting the 1-elements in the solution vectors. Table 3.4 shows the experimental results for the approach described by formula (3.13). The executed benchmarks are the same Petrick functions of nv variables and nc clauses as for the basic approach. Due to the higher power of the CPL(NDM( f ))- approach significantly larger covering problems could be solved. Due to the orthogonal representation of the solution, it is possible to calculate the number of all the solutions of the solved equation. These values are given in the column nall of Table 3.4. The number of ternary vectors which represent all the solutions is given in the column ntv of Table 3.4. The benefit of the applied ternary representation becomes visible by comparing columns 3 and 4 of Table 3.4. The number of binary solution vectors is 824,453 times larger than the number of

91 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 77 Table 3.4: Benchmark results for Algorithm: CPL(NDM(P)) Benchmark Solution Time in Milliseconds nv nc nall ntv nv nm NDM CPL Total , , ,899 1, ,710 2, ,624 2, ,010 2, ,748,287 1, ,500,394 8, ,186,932 16, ,950,753 33, ,894,860 95, , ,495, , , ,292,102,982 5, ,289,047,266 20, ,286,710,172 91, ,261,521, , , ,249,596,098 1,001, , ,186,686,031 1,649, , , ,171,447,911 4,333, , ,974, ,178,998,085 9,391, , , ,552, ternary vectors in the case of the benchmark (32 8). The needed minimal number of variables nv and the number of minimal covers nm of the solved UCP are given in columns 5 and 6 of Table 3.4. The time intervals for the main operations NDM and CPL are listed in Table 3.4 additionally to the total time needed to solve the UCP benchmarks. The time for the evaluation of the known ternary vectors needed to find the value of nall is excluded from the total time because this evaluation is not necessary to solve the UCP. All NDM operations are executed very quickly. The required time for the CPL operation increases with the size of the benchmark. The significant time interval of the total time which is not occupied by the XBOOLE operations NDM and CPL is used to select the sought minimal solutions of the UCP. The restriction of the huge numbers of solutions of the Petrick equation (see column nall of Table 3.4) to the much smaller numbers of wanted solutions (see column nm of Table 3.4) can be a source of further improvements. Based on the largest benchmark solved with the improved basic approach (16 32), the CPL(NDM( f ))-approach reached a speedup with a factor of 142,360. The

92 78 Bernd Steinbach, Christian Posthoff same nine solution vectors with 3 variables were found in 2.2 milliseconds instead of 313 seconds using a single processor core. speedup number of clauses Fig. 3.3: The reached speedup by the XBOOLE-approach CPL(NDM(P)) (3.13) in comparison to the already strongly improved iterative basic algorithm of Subsection using the distributive and the absorption law for a UCP of 16 variables. Figure 3.3 shows the serious speedup in the range of more than 5,000 up to more than 100,000 reached by the XBOOLE CPL(NDM( f ))-approach (3.13) for identical covering examples of 16 variables and 8 up to 32 clauses. This strong improvement confirms that the utilization of properties given by the application domain should be combined with the power of the common use of several processors working in parallel. 3.6 Parallelization Using MPI on Four or Six CPU-Cores Uniform Distribution Concurrent processing on several cores can be implemented by threads of a single process or a set of processes. Due to the common use of the same program functions and some local control variables within these functions, threads cannot be used when the same XBOOLE operation must be applied concurrently. Hence, the message-passing interface (MPI) [3] must be utilized for the concurrent solution of the covering problem on several CPU cores.

93 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 79 The main task of the unate covering problem is solved by the CPL operation. After the execution of h(x) =CPL(g(x)), the function h(x) is equal to 1 for such patterns x of the Boolean space B n for which the function g(x) is equal to 0. Hence, the CPL-operation calculates the difference between the whole Boolean space and the 1-patterns of the given function g(x). The XBOOLE-operation DIF( f, g) calculates f (x) g(x). A very simple approach for the parallel solution of the UCP is the partition of the Boolean space into subspaces of a fixed size and the concurrent execution of the DIF-operation for these subspaces. The largest power of 2 which is smaller than 6 (the available number of cores) is 4. Hence, the first MPI-approach is adapted to 4 processor cores. For a compact representation of the problems one can use fss(k,i,p 1 0 ) for the function that is equal to 1 for any p 0 of the i-th of 2 k subspaces. Using r as the rank (index of the respective process), the subtask AS[r](p)=DIF( fss(2,r,p 1 0 ),NDM(P(p)) (3.14) must be solved on each core. The final solution for the special case of 4 cores can be calculated by 3 AS(p)= AS[r](p). (3.15) r=0 Due to the orthogonality of f ss(2,r,p 1 0 ) the partial solution sets AS[r](p) of (3.15) is orthogonal, too. Therefore, the disjunctions in (3.15) can be realized by the concatenation of partial solution sets which can be done in constant time. Table 3.5: Benchmark results for the concurrent algorithm: DIF( f 1 ss (2,r,p 0),NDM(P(p)) using 4 cores Benchmark Solution Time in Milliseconds nv nc nv nm t 0 t 1 t 2 t 3 total , , , , , , , , , , , , , ,531, , , , ,531, ,565, , , , ,565,823.6

94 80 Bernd Steinbach, Christian Posthoff Table 3.5 shows the results for the application of (3.14) and (3.15) using 4 cores. Due to the very short runtime, the benchmark results for 8 and 16 variables are omitted. Columns t 0,...,t 3 show the runtime for the four different subspaces. Despite the same size of the subproblems, the use of these periods of time to calculate a partial solution for a subspace varies in a wide range because: 1. Depending on the given Petrick function, the representation of the partial solution in the subspace can require different numbers of ternary vectors. 2. The computation with different numbers of ternary vectors requires different time intervals. 3. The final evaluation of the solution set with regard to the wanted minimal solution requires different time intervals depending on both the number of ternary vectors and the step in the iteration where the first minimal solution is found. speedup linear speedup 24 variables 32 variables number of clauses Fig. 3.4: A comparison of the XBOOLE-approaches UNI(DIF( f [r],ndm(p))) (3.14), (3.15) and CPL(NDM(P)) (3.13) for covering problems of 24 and 32 variables Therefore, the runtime for a subspace can even be higher than the runtime for the whole covering problem, especially for relatively small problems. Figure 3.4 shows this weakness of a uniform division of the Boolean space by values of the speedup which are smaller than 1. On the other hand, the same reasons lead to a superlinear speedup of 12.8 for the UCP solution of 32 variables and 256 clauses using the XBOOLE-approaches UNI(DIF( f [r],ndm(p))) on 4 cores. Figure 3.4 shows that the speedup grows in a certain range for both the number of variables and the number of clauses. Due to very bad load balancing, the speedup drops down below the linear speedup for the Petrick functions of 32 variables and 512 or 1024 clauses.

95 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 81 The global improvements of this MPI-approach are for the benchmark (32 256) and for the benchmark ( ). The observed bad load balancing of the uniform division of the Boolean space motivates an improved parallel solution that will be discussed in the next subsection Adaptive Distribution Basically, the Boolean space B n can be maximally divided up into 2 n subspaces. For the concurrent computation on n core cores, at least n core subspaces are needed. This minimal number of subspaces was used in the previous approach and caused bad load balancing. This can be improved when the UCP is split into subproblems for a larger number of subspaces. The larger the number of subspaces, the better the load balancing that can be achieved. However, the creation of too many subspaces contradicts the very valuable improvements by means of ternary vectors. Each ternary vector itself represents a subspace that is directly defined by the context of the problem to be solved. As a compromise n ss = 2 6 = 64 subspaces are used for the experiments involving the adaptive distribution approach. In order to improve load balance, the subspaces must be assigned to the processes in such a way that all working processes will finish approximately at the same time. Hence, one of the processes must control the assignment of the subtasks. Therefore, a master-worker architecture (see Figure 3.5) is implemented. The master process controls the assignment of the subtasks to the worker processes. Hence, one core is lost for the direct problem solution. master (a) worker 1 worker 2 worker 3 master worker 1 worker 2 worker 3 worker 4 worker 5 (b) Fig. 3.5: Master - worker architecture of a PC with (a) 4 cores or (b) 6 cores

96 82 Bernd Steinbach, Christian Posthoff For the direct comparison of the MPI-approaches, the adaptive distribution was executed using 4 cores (see Figure 3.5 (a)) as in the case of uniform distribution. It is a benefit of the adaptive distribution approach that the number of cores does not strongly depend on the number of subtasks. This scalability will be evaluated using all 6 available cores of the CPU in the architecture of Figure 3.5 (b). The concurrent algorithm follows the previous approach. Using i as an index of Boolean subspace, the three or five worker processes solve the assigned subtask AS[i](p)=DIF( fss 1 (6,i,p 0),NDM(P(p)). (3.16) The aggregate solutions can be calculated by AS(p)= n ss i=0 = AS[i](p) (3.17) independently of the number of used cores. This architecture is very profitable because the master can assign the next unsolved subtask to the worker immediately upon the request of the worker. Table 3.6: Benchmark results for the adaptive concurrent algorithm: DIF( f 1 ss (6,i,p 0),NDM(P(p)) using 4 cores. Benchmark Solution Time in Milliseconds Speedup nv nc nv nm t 1 t 2 t 3 total , , , , , , , , , , , , , , , , Table 3.6 shows the experimental results for the adaptive approach based on formulas (3.16) and (3.17) using 4 cores. The columns t 1, t 2, and t 3 of Table 3.6 show the runtime of the three workers. No speedup is reached for the small benchmarks which need only a few milliseconds in all the improved approaches. Despite the restriction to three workers, the highest speedup in comparison to the single core solution is for the large Petrick function of 32 variables and up to 1024 clauses. The reason of this impressive speedup is based on a special utilization

97 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 83 of the implemented concurrent approach. Each worker sends the results of the unate covering problem for the assigned subspace to the master process. These results include both the minimal number of values 1 in this partial solution and the number of the found minimal solutions. The master process handles the partial solutions in the following way: If the master process already knows the solutions with a smaller number of values 1, the received solutions with a larger number of values 1 are immediately omitted. If the master process already knows the solutions with the same number of values 1, the received solutions are accumulated. If the master process so far only knows solutions with more values 1, the stored accumulated solution is replaced by the new better solution. Using this simple algorithm, the master process knows the smallest number of values 1 found so far by the concurrent worker processes. On each request of a worker for the next subtask, the master process sends both the number of the next subspace and the smallest solution found so far. This information helps the worker process to simplify the evaluation algorithm because large solutions must not be taken into account anymore. Table 3.7: Benchmark results for the adaptive concurrent algorithm: DIF( f 1 ss (6,i,p 0),NDM(P(p)) using 6 cores having the same solutions as shown in Table 3.6. Benchmark Time in Milliseconds Speedup nv nc t 1 t 2 t 3 t 4 t 5 total , , , , , , , , , , , , , , , , , , , , , , , , Without any change, the implemented program of the adaptive approach was executed using 6 cores in parallel. Table 3.7 summarizes these experimental results. The detailed enumeration of the runtime of workers 1 to 5 needs so much space that the information of the solution is skipped and the precision is restricted to one position after the decimal point. The found solutions nv and nm are the same as those

98 84 Bernd Steinbach, Christian Posthoff given in Table 3.6. The total runtime on 6 cores is reduced approximately to 4/6in comparison to the adaptive approach using 4 cores. This expected improvement is slightly affected by the concrete load balance. The speedup in the last column relates to the time of the sequential approach given in Table 3.4. Figure 3.6 visualizes the strong super-linear speedup of the adaptive XBOOLEapproach UNI(DIF( f [i],ndm(p))) (3.16), (3.17) running on 4 or 6 cores. The global improvements of these MPI-approaches are using 4 cores and using 6 cores. Small tasks where the CPL(NDM(P)) approach (3.13) needs a couple of milliseconds only should be solved by a single processor core. Convenient thresholds for the covering problems are 24 variables and 64 clauses. For all benchmarks larger than 24 variables and larger than 128 clauses, super-linear speedup is reached. It is notable that the super-linear speedup increases even stronger for larger tasks. This observation leads to an important general conclusion: the available set of processor cores should not be used simply for the computation of the assigned subtasks but mainly as a source of knowledge that restricts the efforts for the subsequent subtasks Intelligent Master In the previous approach, a master process is required to control the adaptive assignment of the subtask to the worker. The master process waits most of the time for the requests of the workers. An additional worker thread inside the master process can cause delays in the answer to the worker requests. Hence, the chosen master-worker architecture of Figure 3.5 should not be changed. In addition to the better load balancing, one source of the improvement for the adaptive distribution approach is the utilization of the smallest number of values 1 known so far by all workers. In a further approach, the same knowledge is used by the master itself. Such an intelligent master process allows the additional reduction of the runtime for large problems. This approach of an intelligent master relies on the following property: only solution vectors with the smallest number of values 1 are wanted. A subspace is defined by fixing a certain number of variables; (x 1 = 1,x 2 = 1,x 3 = 0,x 4 = 1,x 5 = 0) defines, for instance, one of 32 subspaces where the first 5 variables are fixed. If a solution of two values 1 is already known, it can be concluded that no solution results from this subspace since already three variables have the value set to 1. This evaluation can be realized by the master when it is waiting for the request of a worker. Subspaces in which no solution can exist are excluded by the intelligent master without detailed calculations in the subspace itself. The intelligent master approach was implemented, too. This approach should be applied especially for large benchmarks. Due to this intelligent behavior of the master it is possible to extend the division of the Boolean space into more subspaces. As a compromise n ss = 2 10 = 1024 subspaces are used for 6 cores. Another posi-

99 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 85 speedup cores linear speedup 24 variables 32 variables number of clauses speedup cores linear speedup 24 variables 32 variables number of clauses Fig. 3.6: Speedup by the adaptive XBOOLE-approach UNI(DIF( f [i],ndm(p))) (3.16), (3.17) in comparison to CPL(NDM(P)) (3.13) for UCP of 24 and 32 variables running on 4 or 6 cores

100 86 Bernd Steinbach, Christian Posthoff tive effect is the reduction of the required memory space. The number of necessary ternary vectors is reduced by a factor of n ss = 2 10 = Table 3.8: Benchmark results for the approach that combines the adaptive concurrent algorithm DIF( f 1 ss (10,i,p 0),NDM(P(p)) and the intelligent master using 6 cores. Benchmark Solution Time in Speedup Efficiency nv nc nv nm Milliseconds , , , , , Table 3.8 shows that this approach should be used for large unate covering problems which can be controlled by a frame algorithm. The speedup in column 6 is the ratio between the total time of the improved approach given in Table 3.4 on a single core and column 5 of Table 3.8 using 6 cores. Due to Amdahl s law, the speedup is typically less than the number of used cores. This is true for the two smallest evaluation benchmarks. Such small tasks can be solved in a couple of milliseconds on a single core; no parallel approach must be applied. The speedup for the largest executed benchmark is equal to 1, Hence, without substantial computations the intelligent master approach reduces the required runtime by one third. It is an important property of the suggested intelligent master approach that the speedup grows with larger tasks. A second evaluation parameter is the efficiency. The efficiency is defined as a quotient between the speedup and the number of used cores. An ideal implementation is reached if the value of the efficiency is equal to 1. For benchmarks larger than 32 variables and 32 clauses the approach of the intelligent master holds this ideal value. It is a remarkable result that the efficiency grows strongly with the size of the problem. The efficiency for the largest benchmark is Figure 3.7 shows in comparison to Figure 3.6 a further improvement of the intelligent master approach over the approach of adaptive distribution. The super-linear speedup is already reached for the covering problem of 32 variables and 32 clauses. The suggested enhancement of the simple master to an intelligent master improves the speedup by an additional factor in the range of approximately 2 to 5. The global improvement of this MPI-approach is for the benchmark ( ) using 6 cores.

101 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 87 speedup linear speedup 32 variables number of clauses Fig. 3.7: Speedup by the intelligent master approach UNI(DIF( f [i],ndm(p))) (3.16), (3.17) in comparison to CPL(NDM(P)) (3.13) for the UCP of 32 variables and 32 and more clauses on 6 cores 3.7 Ordered Restricted Vector Evaluation Sequential CPU Realization Using One Single Core The parallelization in the application domain reached a significant improvement factor 142,360 in comparison with the improved basic approach of the iterative application of the distributive and the absorption law. The reason for this strong improvement of more than 10 5 on a single CPU core is the application of XBOOLE [6], [8], [11], [12] in the algorithm CPL(NDM(P)). Two effects are utilized by XBOOLE for this significant improvement: 1. The Boolean variables are assigned to the bit position within one word or, if necessary, in several words according to the word width of the computer. All Boolean values of such a computer word are processed in a single time step of the computer in parallel. 2. As many as possible, the vectors of Boolean values are merged within a single ternary vector of d dash elements. Hence, XBOOLE processes even 2 d Boolean vectors within a single time step. The selection of all minimal solutions from the set of all solutions of the Petrick equation (3.3) is the most time-consuming subtask of this approach. This can be

102 88 Bernd Steinbach, Christian Posthoff seen in Table 3.4 by the time difference between the total time and the time needed for the CPL-operation. The distribution of this approach to 6 cores of the CPU does not only reach a linear speedup of 6, but the extreme super-linear speedup of 1, 426 in the approach that combines the adaptive concurrent algorithm DIF( f ss(10,i,p 1 0 ),NDM(P(p)) and the intelligent master. This super-linear speedup is based on the exchange of intermediate results between the workers controlled by the master and the utilization of the knowledge about the smallest known solution to restrict the evaluation of the found solutions. The intelligent master excludes even whole subspaces without loss of any minimal solution of the UCP. This analysis of the previous approaches reveals the source of further improvements - the restriction of the evaluated Boolean space. Instead of the calculation of all solutions of the Petrick equation (3.3) and the subsequent selection of the minimal solutions, certain subsets of the Boolean space can be evaluated. The Boolean space B n can be divided into n + 1 subspaces S i,i = 0,...,n where the subspace S i is defined by all the Boolean vectors of the length n which contain i values 1: B n = S 0 S 1... S n. (3.18) Assuming fi s(p)=1for all vectors p S i the Petrick equation (3.3) can be expressed by P(p) 1 = P(p) and divided into n + 1 partial unique covering problems PUCP(i) n i=0 fi s (p)=1, (3.19) P(p) fi s (p)=1. (3.20) The wanted minimal solution of the UCP is equal to the non-empty solution of PUCP(i) with the smallest value of i. An algorithm can be organized so that the PUCP(i) are solved iteratively beginning with i = 1 (a solution of PUCP(0) cannot exist) and increments i = i + 1. The benefit of this approach is that the iteration can be stopped when the solution set of PUCP(i) is not empty. It should be remembered that PUCP(n) has the solution p j = 1 j = 1,...n for each Petrick function. Unfortunately, symmetric functions have a drawback with regard to the ternary vectors used in XBOOLE. All Boolean vectors with f i s (p) =1 include i values 1 and n i values 0. Hence, there do not exist two such vectors that differ only in a single position. Consequently, fi s (p) cannot be simplified by ternary vectors so that the application of XBOOLE loses one of its benefits. Therefore a direct calculation of binary vectors represented by machine words is used instead of the computation of ternary vectors in XBOOLE. Each clause c of the Petrick function P(p) can be represented by a Boolean vector within a machine word such that a value 1 in the bit position j indicates the variable p j which appears in the clause. All of the clauses of the entire Petrick function P(p)

103 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 89 are stored in the clause vector cv.vector where cv.elements indicate the number of clauses. Each conjunction of the symmetric function f i s (p) describes a concrete permutation p of i out of n values 1. Such a conjunction can be represented by a Boolean vector within a machine word using the same order of the variables p j as in the clause vector cv.vector. All of the permutations of f i s (p) are stored in the permutation vector pv.vector where pv.elements indicates the number of permutations. The solutions of PUCP(i) can be stored in a similar data structure because each solution is a machine word of the permutation vector pv.vector which holds the condition (3.20) of PUCP(i). Hence, the memory for the solutions is the solution vector sv.vector where sv.elements indicates the number of solutions. The conjunction of the machine words pv.vector[ip] cv.vector[ic] creates a new machine word in which a value 1 in the bit position j indicates that the assignment of p j = 1 of the permutation ip in the permutation vector pv.vector effects that the clause ic of the permutation vector pv.vector gets the Boolean value 1. It is only important for PUCP(i) whether this conjunction vector includes at least one value 1; the number of values 1 and their positions in this vector are irrelevant for the decision of PUCP(i). In the case that the conjunction vector is equal to zero, the clause ic is not covered by the permutation ip of the permutation vector pv.vector. Hence, the permutation ip is no solution and the evaluation of further clauses for the permutation ip can be omitted. In the case that a selected permutation ip of the permutation vector pv.vector covers all the ic.elements clauses of the clause vector cv.vector, this permutation is a solution of PUCP(i) and can be stored in the solution vector sv.vector. Algorithm 3 shows the restricted vector evaluation that solves PUCP(i). It should be mentioned that several machine words must be used if the number of variables p j is greater than the number of bits of a single machine word. Algorithm 3 Calculate all the solutions of a PUCP for the given cv and pv on the CPU 1: is sv.elements {index of next solution in sv} 2: for ip 0,ip < pv.elements,ip ip+ 1 do 3: for ic 0,ic < cv.elements,ic ic + 1 do 4: if pv.vector[ip] cv.vector[ic] =0 then 5: break 6: end if 7: end for 8: if ic = cv.elements then 9: sv.vector[is] pv.vector[ip] {add solution} 10: is is : end if 12: end for 13: sv.elements is

104 90 Bernd Steinbach, Christian Posthoff Unfortunately, it must be taken into account that the number of conjunctions of fi s(p) for a Petrick function of n variables is equal to ( n) ( i. The example 32 ) 6 = 906,192 shows that a large memory is needed to store the permutation vector. Hence, the available memory can limit the size of solvable PUCPs. Algorithm 4 vanquishes this problem using the function GeneratePermutations(). This function generates the permutations of i values 1 in a range of n bit positions restricted to a maximal number of 2 16 permutations. The Boolean control parameter first = true determines that the generator starts with the first permutation. The function GeneratePermutations() remembers the state of the generation process and generates in a consecutive call with first = falesthe next interval of permutations. The inner while-loop of Algorithm 4 realizes the complete evaluation of all the permutations of PUCP(i). The outer while-loop controls the iterative solution of PUCP(i) beginning with i = 1 and increments i = i + 1. This loop finishes the work in the case that the number sv.elements of the solutions is greater than zero, because in this case all the minimal solutions of the UCP are found. Algorithm 4 Ordered restricted vector evaluation on the CPU 1: cv Read(P(p)) {clause vector} 2: sv /0 {solution vector} 3: i 1 4: first true 5: while sv.elements = 0 do 6: pv GeneratePermutations( first, i, cv.variables) {permutation vector} 7: sv.elements CalculateSolutions(cv, pv, sv) 8: while pv.elements = pv.max do 9: pv GeneratePermutations( first, i, cv.variables) 10: sv.elements sv.elements +CalculateSolutions(cv, pv, sv) 11: end while 12: i i : end while 14: Write(sv) Table 3.9 summarizes the experimental results of the ordered restricted vector evaluation on the CPU. This approach solves the UCP so quickly that at least 32 variables are needed to get measurable time intervals. The first four columns enumerate the executed benchmarks and the found minimal solution of the UCP. The value in the column generate pv is the sum of all the periods of time to execute the function GeneratePermutations(). Similarly, a value in the column calculate sv is the sum of all period of time to execute the function CalculateSolutions() for the associated benchmark. The total time in the last column summarizes the time for the generation of the needed permutations and the calculation of the minimal solutions of the UPC. Despite the restriction to a single core of the CPU, the sequential approach of the ordered restricted vector evaluation shortens the runtime for the benchmark ( ) by a factor of 90.3 in comparison with the fastest MPI approach on 6 cores.

105 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 91 Table 3.9: Benchmark results for the sequential approach of ordered restricted vector evaluation using a single core on the CPU Benchmark Solution Time in Milliseconds nv nc nv nm generate pv calculate sv total The factor of improvement with regard to the improved basic approach is Parallel GPU Realization Using CUDA In order to find all the minimal solutions of the UCP, all the permutations of the PUCP(i) must be evaluated. These subtasks are independent on each other, and can be solved in parallel on a large number of GPU cores. The kernel function for these subtasks is the content of the outer loop of Algorithm 3. The index of the subtask to be solved can be determined using the block index blockidx.x, the block dimension blockdim.x, and the thread index threadidx.x. Figure 3.8 shows the used code of the kernel petkernel(). Two aspects of the petkernel should be discussed. The main work in this kernel is realized in the for-loop by the conjunction of machine words of a clause and a permutation. Different decisions in the alternative within this loop inhibit the parallel execution of the threads of the running warp. It is possible to omit this if-statement. However, all ic elements clauses must be evaluated without this decision. The benefit of this decision is that the running warp of threads is finished in the case that all the threads leave the for-loop by the break-statement. Due to the generated order of the permutations, the benefits of the early break outbalance the drawback of the alternative in the control flow. The second aspect is that the solution vectors which are found in different threads must be stored in a single global solution vector. The atomic CUDA function atomicadd() solves this problem. This atomic function is executed without interference from all the other threads and prevents race conditions when several threads want to store a found solution at the same time. Due to the used parameters, the atomicadd() increments the global counter of the solutions by 1 and returns this incremented index to store the found solution.

106 92 Bernd Steinbach, Christian Posthoff global void petkernel( unsigned int *dev cv, // vector of all clauses of P(p) unsigned int *dev pv, // (partial) vector of permutations unsigned int *dev sv, // vector to store the solutions int *dev sv max elements, // maximal number of solutions int *dev sv elements, // number of stored solutions int ic elements, // number of clauses int ip elements) // number of permutations { int ic; // index of the clause // index of the permutation int ip = threadidx.x + blockdim.x * blockidx.x; if (ip < ip elements) { for ( ic = 0; ic < ic elements; ic++) if ((dev pv[ip] & dev cv[ic]) == 0) break; if ( ic == ic elements) { int is = atomicadd(dev sv elements, 1); if (is < *dev sv max elements) dev sv[is] = dev pv[ip]; } } } Fig. 3.8: The kernel used to evaluate one permutation for the clauses of the Petrick function. The execution of the kernel petkernel requires that both the vector of the clauses cv and the vector of permutations pv are copied from the main memory of the CPU to the device memory of the GPU. Additionally, a sufficiently large vector svg must be allocated to the device memory to collect the found solutions. This vector must finally be copied to the main memory of the CPU. Algorithm 5 extends Algorithm 4 by the necessary copy operations. Within the function CalculateSolutionsOnGPU(), the number of threads and blocks for all the pvc.elements of one interval of permutation are defined by: n threads=512 n blocks=(pvc.elements+n threads-1)/n threads Based on these parameters, the petkernel of Figure 3.8 is called by the following statement. petkernel<<< n blocks, n threads>>> ( dev cv, dev pv, dev sv, dev sv maxelements, dev sv elements, cv.elements, pv.elements);

107 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem 93 Algorithm 5 Ordered restricted vector evaluation using the CPU and the GPU 1: cvc Read(P(p)) {clause vector on CPU} 2: cvg cvc {clause vector on GPU} 3: svc /0 {solution vector on CPU} 4: svg svc {solution vector on GPU} 5: i 1 6: first true 7: while svc.elements = 0 do 8: pvc GeneratePermutations( first, i, cvc.variables) {permutation vector on CPU} 9: pvg pvc {permutation vector on GPU} 10: svc.elements CalculateSolutionsOnGPU(cvg, pvg, svg) 11: while pvc.elements = pvc.max do 12: pvc GeneratePermutations( first, i, cv.variables) 13: pvg pvc 14: svc.elements svc.elements +CalculateSolutionsOnGPU(cvg, pvg, svg) 15: end while 16: i i : end while 18: svc svg {receive solution vector from GPU} 19: Write(svc) Table 3.10 summarizes the experimental results of the ordered restricted vector evaluation, which is executed using both the CPU and the GPU. As expected, this approach solves the UCP so quickly that also at least 32 variables are needed to get measurable time intervals. The first four columns enumerate the executed benchmarks and the found minimal solution of the UCP. The most time-consuming subtask of the UCP is the calculation of the solution vector sv. Despite the ifstatement within the for-loop, the GPU solves this task 10 times faster than the CPU which has a large profit from this early break of the for-loop. The sum of all the executions of the function CalculateSolutionsOnGPU() for the associated benchmark are given in the eighth column of Table The generation of the required permutations needs less time on the CPU than the time for the calculation of the solution on the GPU. The sum of all the time intervals of the execution of the function GeneratePermutations() for the associated benchmark are given in the sixth column of Table Additional time is required to copy the vectors between the CPU and the GPU. These time intervals are listed in columns 5, 7, and 9 of Table The total time in the last column summarizes the time for the generation of the required permutations, the calculation of the minimal solutions of the UCP and all the copy operations of vectors between the CPU and GPU. The parallel approach of ordered restricted vector evaluation using both the CPU and GPU reduces the runtime for the benchmark ( ) by a factor of in comparison with the fastest MPI approach on 6 cores. The factor of improvement with regard to the improved basic approach is

108 94 Bernd Steinbach, Christian Posthoff Table 3.10: Benchmark results for the parallel approach of ordered restricted vector evaluation using the GPU Tesla C2070. Benchmark Solution Time in Milliseconds nv nc nv nm copy generate copy calculate copy total P(p) permutation pv solution sv to vector to vector to GPU pv GPU sv CPU Conclusions The small set of CPU cores and the large set of GPU cores are typically observed as sources of improvement by the parallelization of time-consuming tasks. One more important source of improvement is the parallel computation of all the bits of machine words within a single core of both the CPU and the GPU. A general obstacle in the Boolean domain is the exponential complexity of the Boolean space and consequently of most of the Boolean problems. Due to this exponential complexity, the simple utilization of the parallel resources shortens the solution process only slightly. The mapping of the covering problem in Section 3.3 to a matrix multiplication with excellent properties for parallelization on the GPU confirms this conclusion. The programming model of the parallel devices combines sources and obstacles for parallelization. The Message Passing Interface MPI uses the Single Program Multiple Data SPMD model. It is a source of improvement of SPMD that the same program runs independently on several CPU cores. Obstacles of the SPMD are the necessary special handling of a message passing between the processes, different time intervals needed to solve the subtask, and possible dependencies of the data. The Compute Unified Device Architecture CUDA uses the Single Instruction Multiple Thread SIMT model. The source of improvement of the SIMT is that the same instruction can be executed on hundreds of GPU cores in parallel. This source of improvement is an obstacle itself. Branches in the control flow interfere with the utilization of the complete power of computation. As shown in the main part of this chapter, a very important source for improvement is a deep analysis of the problem itself and the utilization of devices that can be used in parallel. The consequent implementation of this concept for the unate covering problem improved the basic approach by a significant factor of more than

109 3 Sources and Obstacles for Parallelization Explored for the Unate Covering Problem The applied method can (and should be) reused for other complex problems. References 1. Biere, A., Heule, M., Hans van Maaren, H. and Walsh, T.: Handbook of Satisfiability, IOS Press, Amsterdam, Berlin, Oxford, Tokyo, Washington (DC), Cordone, R., Ferrandi, F., Sciuto, D. and Wolfler Calvo, R. An Efficient Heuristic Approach to Solve the Unate Covering Problem, Proceedings of the Conference on Design, Automation and Test in Europe, Paris, France, 2000, Gropp, W., Thakur, R. and Lusk, E. Using MPI-2: Advanced Features of the Message Passing Interface, MIT Press, Cambridge, MA, USA, NVIDIA CUDA NVIDIA CUDA C Programming Guide, Version 4.2 see: CUDA C Programming Guide.pdf 5. Paul, E., Steinbach, B. and Perkowski, M.: Application of CUDA in the Boolean Domain for the Unate Covering Problem, in: Steinbach, B. (Editor): Boolean Problems, Proceedings of the 9th International Workshops on Boolean Problems, September 2010, Freiberg University of Mining and Technology, Freiberg, 2010, ISBN , Posthoff, Ch. and Steinbach, B. Logic Functions and Equations - Binary Models for Computer Science, Springer, Dordrecht, The Netherlands, Quinn, M. J. Parallel Programming in C with MPI and OpenMP. Mc Graw Hill, New York (NY), USA, Steinbach, B. XBOOLE - A Toolbox for Modelling, Simulation, and Analysis of Large Digital Systems, System Analysis and Modeling Simulation, Gordon & Breach Science Publishers, Volume 9, Number 4, 1992, Steinbach, B. and Posthoff, Ch.: Improvements of the Construction of Exact Minimal Covers of Boolean Functions. in: Roberto Moreno-Daz, Franz Pichler und Alexis Quesada-Arencibia: Computer Aided Systems Theory - EUROCAST 2011,13th International Conference, Las Palmas de Gran Canaria, Spain, February 6-11, 2011, Revised Selected Papers, Part II, Lecture Notes in Computer Science Volume 6928, Springer, ISBN: , DOI: / , 2012, Steinbach, B. and Posthoff, Ch.: An Extended Theory of Boolean Normal Forms. in: Proceedings of the 6th Annual Hawaii International Conference on Statistics, Mathematics and Related Fields, Honolulu, Hawaii, 2007, Steinbach, B. and Posthoff, Ch. Boolean Differential Calculus. in: Sasao, T. and Butler, J. T. Progress in Applications of Boolean Functions Synthesis Lecturers on Digital Circuits and Systems. Morgan & Claypool Publishers, San Rafael, CA USA, 2010, 5578 and Steinbach, B. and Posthoff, Ch.: Logic Functions and Equations - Examples and Exercises. Springer Science + Business Media B.V., 2009.

110 96 Bernd Steinbach, Christian Posthoff 13. Steinbach, B. and Posthoff, Ch. Parallel Solution of Covering Problems Super-Linear Speedup on a Small Set of Cores. GSTF International Journal on Computing, Global Science and Technology Forum (GSTF), Singapore, 2011, ISSN: , Volume 1, Number 2, Wegener, I.: Complexity Theory - Exploring the Limits of Efficient Algorithms, Springer, Dordrecht, The Netherlands, 2005.

111 Chapter 4 GPU Acceleration Methods of Representations for Quantum Circuits Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama Abstract In this chapter we present various applications of the GPU accelerated computation to the problem of the synthesis of quantum and reversible circuits. We describe three different methods of designing quantum and reversible circuits, each using a matrix representation of the circuits. At first we show a Linearly Independent Functions based reversible circuit synthesis. Second, the Relational Specification synthesis of reversible circuits, and finally a synthesis method using a genetic algorithm. Because the reversible circuits represented as matrices grow exponentially in size with an increasing number of bits, an efficient method of acceleration is required. The efficient acceleration is possible by using a GPU processor and the CUDA framework. To demonstrate why the usage of such matrices as circuit representation is useful, we also describe a set of experiments comparing a more efficient circuit representation - the QMDD - and we show that there are tradeoffs between a hardware accelerated matrix manipulation and a high level circuit representations with respect to the memory space and the speed of manipulation of the circuits. 1 Martin Lukac Graduate School of Information Sciences, Tohoku University, Sendai, Japan lukacm@ecei.tohoku.ac.jp Marek Perkowski Department of Electrical Engineering, Portland State University, Portland, OR, USA mperkows@ece.pdx.edu Pawel Kerntopf Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland, and Department of Theoretical Physics and Computer Science, University of Lodz, Lodz, Poland, p.kerntopf@ii.pw.edu.pl Michitaka Kameyama Graduate School of Information Sciences, Tohoku University, Sendai, Japan, kameyama@ecei.tohoku.ac.jp 1 The work of Pawel Kerntopf was supported in part by the Polish Ministry of Science and Higher Education under Grant 4180/B/T02/2010/38. 97

112 98 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama

113 4 GPU Acceleration Methods of Representations for Quantum Circuits Introduction Synthesis of quantum circuits (SQC) is an area of research with growing activity and interest due to the imminent need for the development of novel technologies for the design of circuits and computers. The main reason for this interest is the fact that quantum computing provides one of the possible solutions to Moore s limit [42]. The design of quantum circuits has been explored using the standard reversible techniques such as the MMD algorithm [37], reversible wave-cascades [41], spectral decompositions [37], BDD [57], reachability analysis [17, 18], group theory [60, 61], input cube re-ordering [14] as well as evolutionary approaches [58, 49, 25, 29, 53, 34, 28]. This chapter focuses on the possibilities of accelerating the quantum circuits computation and manipulation with respect to different synthesis methods and the different requirements of each method. This chapter is divided into three distinct parts. 1. First, we describe the synthesis of quantum circuits from the linearly independent functions [47, 24]. This approach is based on the usage of a library of reversible gates (representing the linearly independent logic functions) and requires the synthesis of a large number of reversible gates. To make the matter more interesting the presented synthesis is implemented in ternary logic. As will be seen, the synthesis of circuits in radix 3 is more complex because the CUDA architecture is not well adapted for such matrices. This approach illustrates the GPU acceleration for multiple-valued reversible circuits. 2. The second approach illustrates the synthesis of reversible and quantum circuits specified by relational specifications. This approach uses a tree search and requires fast matrix multiplication in a very large number. The particular requirement is that matrix multiplication is performed at every node of the search tree and each step on the tree is a matrix-matrix multiplication.

114 100 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama 3. The final approach is the evolutionary synthesis of quantum and reversible circuits. The evolutionary approach uses matrices of different sizes and both the matrix and Kronecker products are accelerated. Because the evolutionary approach requires the evaluation of a large number of circuits, the acceleration must be optimized for both matrix manipulation as well as for the data transfer between the main memory and the GPU device. In this chapter, first we explain the minimal basic principles of quantum computing in Section 4.2. Section 4.3 describes the representations of reversible and quantum circuits in each of the three synthesis methods. Section 4.4 briefly describes how the GPU (using the CUDA framework) is used to accelerate the circuit representations discussed in this chapter. Sections 4.5, 4.6 and 4.7 describe the usage, and experiments in each of the three described synthesis methods and show three examples of synthesis of quantum circuits. Finally, Section 4.8 is a general conclusion of the whole chapter and discusses the future work. 4.2 Quantum Computing and Quantum Circuits A quantum circuit is the counter part of a classical circuit. It was introduced as one of the possible representations for quantum computation [6, 7, 22], and ever since it has received many theoretical improvements in the form of algorithms and theorems [8, 2, 10, 4, 51, 50]. Quantum circuits operate on quantum bits (qubit - the quantum counterpart of the classical bit). A qubit is represented by a wave equation that allows the qubit to be in a deterministic or in a quantum probabilistic state (superposed, entangled) [43]. An example of a qubit is shown in Eq. (4.1) [ ] α ψ = α 0 + β 1 = (4.1) β with α 2 + β 2 = 1. For (α = 1,β = 0) and (α = 0,β = 1), the equation (4.1) represents the qubit in the basis states 0 and 1 respectively; with such values of the coefficients the quantum qubit behaves as a classical Boolean bit. For other values of α and β, the qubit is in a quantum superposed state [43]. Multiple-valued qudits are a generalization of the qubits. Thus, a qudit is represented by the following wave equation: r n 1 φ = i=0 α i i (4.2) where α i is a complex coefficient, i is the observable state and r is the radix. The complex coefficients obey the unity equation given by rn 1 i=0 α i 2 = rn 1 i=0 α i αi = 1 where αi is the complex conjugate of α i. Each of the α i αi terms represents the probability of observation of the associated state i.

115 4 GPU Acceleration Methods of Representations for Quantum Circuits 101 For instance, a qudit given by φ = α 0 + β 1 + γ 2 means three states 0, 1 and 2 with probabilities α 2, β 2 and γ 2 respectively. Thus a qudit φ = means that the 0 state is observable with a probability of 1 6, the 1 state is observable with a probability of 1 3, and state 3 is observable with a probability of 1 2. Because the qubits (and qudits) evolve in a Hilbert complex vector space, the logic operations are rotations represented as matrices. The quantum computation evolves in this complex space, and this Hilbert space is spanned by a set of orthonormal vectors representing the observables of the quantum system. For instance, for a qubit, this basis support is given as follows: ( ( = 1 =. (4.3) 0) 1) Then the qubit given by ρ = can be expressed in the vector notation on the binary support as: ρ = 1 3 ( 1 2 ) = ( ) ( ) = (4.4) When multiple qubits are used in parallel (as in a quantum register or quantum circuit) the dimension of the Hilbert space expands exponentially. Assume that qubit a (with possible states 0 and 1 ) is represented by Ψ a = α a 0 + β a 1 and qubit b is represented by Ψ b = α b 0 + β b 1. When such qubits are put together in parallel (multiplied using the Kronecker product), the characteristic wave function of their combined states will be: [ ] [ ] α a α b αa αb Ψ a Ψ b = = α a β b β a β b β a α b β a β b = α a α b 00 + α a β b 01 + β a α b 10 + β a β b 11. (4.5) with α a α b, α a β b, β a α b and β a β b being the complex amplitudes associated respectively with the 00, 01, 10 and 11 states and as in the single qubit case α a α b 2 + α a β b 2 + β a α b 2 + β a β b 2 = 1. The quantum circuit operates on a set of qubits (quantum wires) via the unitary operators - the quantum gates. Examples of quantum gates used in this work are shown in Appendix 4.8.1, and Quantum gates acting on adjacent wires are multiplied using the Kronecker product (similar to qubits in parallel) and quantum gates in series are multiplied using the matrix product. Thus, a circuit built from a set of quantum gates, after multiplying the gates together, is represented by a single unitary matrix.

116 102 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama 4.3 Representations of Quantum and Reversible Functions Quantum functions can be represented in different ways. Depending on the goal and on the structure of the quantum function, the quantum circuit can be implemented in more or less complete form, allowing for a more or less higher storage compression and computational speed increase. For instance, the matrix representation represents the circuits fully without any compression; the QMDD [39] represents only non zero coefficients, etc Quantum Circuits The most convenient way to observe the behavior of quantum operators placed on parallel wires is using the quantum circuit representation. For instance Figure 4.1 shows a simple quantum circuit with three quantum gates. H NOT Fig. 4.1: Example quantum circuit with three quantum gates The lines in Figure 4.2 represent the qudits (in this case the lines represent binary qubits) and the three gates are CNOT, H (Hadamard) and NOT. Let the two qubits be represented by initial states given by the expansion in (4.5). The circuit operation is then expressed as: CNOT(H NOT) ab = CNOT 1 ( ) ( ) ab [ ] [ ] = CNOT 1 [ 10] 10 [ ] ab = CNOT α a β b + β a β b 1000 α a β b + β a β b 0100 = CNOT α a α b + β a α b α a β b β a β b = α a α b β a α b α a α b α a β b β a α b β a β b α a β b + β a β b α a α b + β a α b α a β b β a β b = α a α b + β a α b α a α b β a α b. α a α b β a α b α a β b β a β b (4.6)

117 4 GPU Acceleration Methods of Representations for Quantum Circuits 103 Consider a larger circuit with outputs that represent a Boolean function. The circuit represents the Toffoli gate, also known as CCNOT. This function changes the output qubit value when both of the target qubits are equal 1. A representation of this function of the Toffoli gate is shown in Figure 4.2 and its unitary matrix is shown in Eq. (4.7). a b c P Q R (a) (b) (c) Fig. 4.2: The Toffoli gate constructed from CV, CV and CNOT gates. To f f oli =CCNOT =[I CV ] [CIV] [CNOT I] [I CV ] [CNOT I] = (4.7) The logic operation (matrix) acting on a quantum state (vector) is thus performed via a matrix-vector product. For instance, computing the output of a Toffoli gate for the input state 110 is given by CCNOT 110 = 111. (4.8) The output of a quantum circuit requires measurement [43, 15]. The measurement operation projects the quantum state onto the observables of the quantum circuit. From the computational point of view the measurement is also an operator and is represented by a matrix. This means that to obtain logical values from the quantum circuit one more matrix-matrix multiplication is required. Thus one can

118 104 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama represent the input-to-observable-output computation as shown in Eq. (4.9) O = M C I, (4.9) with O being the output vector of the circuit and I being the input vector to the circuit (such as shown in Eq. (4.5). The M and C are the measurement and circuit matrix representation respectively. An example of the measurement matrix M is shown in Eq. (4.10). This matrix represents the projective measurement operator for two qubits ab and for the desired quantum state 10 ; the operator is composed from single qubit quantum state projective measurement operators [ ] M 10 = = [ 10 ] [ 0 0 1] [ 01 ] 0000 [ ] [ ] = = M M 0. (4.10) Linearly Independent Functions In [23] it was shown how to generate Linearly Independent (LI) quantum circuits using a simple matrix-vector multiplication procedure accelerated using GPU programming. In this approach, an initial circuit is given in the form of the LI function coefficients that are represented by a matrix. By manipulating the matrix, new sets of LI coefficients are obtained. These sets can be used to synthesize reversible quantum circuits. For reasons of clarity, we show the basic steps of the synthesis of reversible quantum circuits using the LI functions, however, for more details see e.g. [48, 47]. The K-Map in Table 4.1 presents a function f (A,B,C,D). We choose LIs of 2 variables. Choosing one out of sixteen possible LI Expansions for two variables provides a unique expansion for f (A, B,C, D). Table 4.1: Function of four variables. CD AB Our goal is to calculate the spectral coefficients by using (4.11) [48, 47, 23]

119 4 GPU Acceleration Methods of Representations for Quantum Circuits 105 CV = M 1 FV. (4.11) First, the functional vector (FV) of cofactors is derived from the rows given by the K-Map from Table 4.1 as shown in (4.12). The vector FV is as follows: fā B (C,D) C D FV = fāb (C,D) f A B(C,D) = C D (4.12) f AB (C,D) C. Equation (4.12) shows the development of the vector FV from the K-map of Table 4.1. The vector of functions on the right represents the cofactors for respective double-variable cofactors f A i B j(c,d) from the left. Cofactors f A 0 B0 correspond to row A = 0,B = 0 of the K-Map in Table 4.1, etc. The matrix M can be determined by selecting LIs (from all the possible LI expansions) and solving appropriate equations for all values of (A,B). Let the selected LI functions be {Ā B,B,A,1}. Replacing A and B for all possible logic values will generate four distinct LI equations whose LI coefficients can be represented in a non-singular matrix. Using (4.12), a non-singular matrix M is created. Using matrix algebra notation we obtain M CV = FV and thus M 1 FV = CV. This leads to the following matrix equation: 1111 fā B M 1 FV = CV = (C,D) SFĀ B fāb (C,D) f A B (C,D) = (C,D) SF B (C,D) SF A (C,D). (4.13) 0111 f AB (C,D) SF 1 (C,D) The equation for calculating the vector of spectral coefficient functions is now shown below: SFĀ B (C,D) 1111 C D C SF B (C,D) SF A (C,D) = C D = C D 1. (4.14) SF 1 (C,D) 0111 C D In general, the base functions on variables A and B are of an arbitrary type, and the linear combinations of cofactors on C and D are also of an arbitrary type. Thus LI is an extension from AND/EXOR logic to arbitrary binary operators. In the ternary case the modulo-3 logic is replaced with arbitrary ternary operators. The final step in this approach is to substitute the values of the variables into a previous function in the GRM form as shown below. The resulting equation can be directly mapped to an array of Multi-Controlled Toffoli gates as a special case or to generalized LI gates.

120 106 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama f (A,B,C,D)=Ā B SFĀ B (C,D) B SF B(C,D) A SF A (C,D) 1 SF 1 (C,D) = Ā B(C) B(C D) A(1) 1(D)=Ā BC BC BD A D. (4.15) Generalizing to Ternary Logic Table 4.2: Example of a ternary function of three variables. C AB The approach from Section is now extended to ternary logic. Consider the function table in Table 4.2 representing a ternary function f (A,B,C). Like in the previous example, we decide to use the LI functions of two variables. In this case, however, there are no 16 expansions but rather 3 4 = 81 ternary expansions. The vector FV is now expressed with respect to the generalized rotation literals for each variable and is given by Eq. (4.16): f A 0 B 0(C) C f A 0 B 1(C) f A 0 B 2(C) C 0,2 f A FV = 1 B 0(C) C 0,2 f A 1 B 1(C) C 1,2 f A 1 B 2(C) = C 0,1. (4.16) f A 2 B 0(C) C 1,2 f A 2 B 1(C) C +1 C +2 f A 2 B 2(C) C 0,2 Developing the vector FV from Figure 4.2. The vector of functions on the right represents the cofactors for respective double-variable cofactors f A i B j(c) from the left. Cofactors fa 0 B 0 corresponds to row A = 0,B = 0 of the function table in Figure 4.2., etc. Following the Boolean case, the matrix M can be now developed by selecting the desired LI and solving for all the values of A and B. The selected LIs are shown in Eq. (4.17)

121 4 GPU Acceleration Methods of Representations for Quantum Circuits 107 f (A,B,C)=A 0 SF A 0(C) 3 A 1 SF A 1(C) 3 B 0 SF B 0(C) 3 B 0,1 SF B 0,1(C) 3 A 0 B 0 SF A 0 B 0(C) 3 A 0 B 2 SF A 0 B 2(C) 3 A 1 B 2 SF A 1 B 2(C) 3 A 2 B 1 SF A 2 B 1(C) 3 SF 1 (C) (4.17) In this case, we obtain 3 2 = 9 equations, shown in Eq. (4.18)-(4.26) A = 0,B = 0: f =1 SF A 0(C) 3 0 SF A 1(C) 3 1 SF B 0(C) 3 1 SF B 0,1(C) 3 1 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.18) A = 0,B = 1: f =1 SF A 0(C) 3 0 SF A 1(C) 3 0 SF B 0(C) 3 1 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.19) A = 0,B = 2: f =1 SF A 0(C) 3 0 SF A 1(C) 3 0 SF B 0(C) 3 0 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 1 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.20) A = 1,B = 0: f =0 SF A 0(C) 3 1 SF A 1(C) 3 1 SF B 0(C) 3 1 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.21) A = 1,B = 1: f =0 SF A 0(C) 3 1 SF A 1(C) 3 0 SF B 0(C) 3 1 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.22) A = 1,B = 2: f =0 SF A 0(C) 3 1 SF A 1(C) 3 0 SF B 0(C) 3 0 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 1 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.23)

122 108 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama A = 2,B = 0: f =0 SF A 0(C) 3 0 SF A 1(C) 3 1 SF B 0(C) 3 1 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C) (4.24) A = 2,B = 1: f =0 SF A 0(C) 3 0 SF A 1(C) 3 0 SF B 0(C) 3 1 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 1 SF A 2 B 1(C) 1 SF 1(C) (4.25) A = 2,B = 2: f =0 SF A 0(C) 3 0 SF A 1(C) 3 0 SF B 0(C) 3 0 SF B 0,1(C) 3 0 SF A 0 B 0(C) 3 0 SF A 0 B 2(C) 3 0 SF A 1 B 2(C) 3 0 SF A 2 B 1(C) 1 SF 1(C). (4.26) Observe that in these equations the EXOR operation is replaced by modulo-3 addition and the Boolean AND by modulo-3 multiplication. Aligning these coefficients in a matrix we obtain the structure shown in Eq. (4.27). M = A 0 A 1 B 0 B 0,1 A 0 B 0 A 0 B 2 A 1 B 2 A 2 B 0 1 A 0 B A 0 B 1 A 0 B A 1 B A 1 B A 1 B A 2 B A 2 B A 2 B (4.27) Computing its inverse M 1 and multiplying it by the FV vector we obtain a new set of spectral coefficients. This is shown in Eq. (4.28) M FV = f A 0 B 0(C) f A 0 B 1(C) f A 0 B 2(C) f A 1 B 0(C) f A 1 B 1(C) f A 1 B 2(C) = f A 2 B 0(C) f A 2 B 1(C) f A 2 B 2(C)

123 4 GPU Acceleration Methods of Representations for Quantum Circuits 109 f A 0 B 1(C) 3 f A 2 B 1(C) C 0,2 3 C +2 f A 1 B 1(C) 3 f A 2 B 1(C) f A 1 B 0(C) 3 f A 1 B 1(C) C 0,1 3 C +2 f A 2 B 1(C) 3 f A 2 B 2(C) C 1,2 3 C 0,1 f A 0 B 0(C) 3 f A 0 B 1(C) 3 f A 1 B 0(C) 3 f A 1 B 1(C) C +2 3 C 0,2 f A 0 B 1(C) 3 f A 0 B 2(C) 3 f A 2 B 1(C) 3 f A 2 B 2(C) = C 3 C 0,2 3 C 1,2 3 C 0,1 f A 1 B 1(C) 3 f A 1 B 2(C) 3 f A 2 B 1(C) 3 f A 2 B 2(C) C 0,2 3 C 0,2 3 C +2 3 C 0,2 f A 1 B 0(C) 3 f A 1 B 1(C) 3 f A 2 B 0(C) 3 f A 2 B 1(C) C 0,1 3 C 1,2 3 C +2 3 C 0,2 C 1,2 3 C 0,1 3 C +1 3 C +2 f A 2 B 2(C) C 0,2 (4.28) Finally, plugging the new coefficients into Eq. (4.17) we obtain f (A,B,C)=A 0 (C 0,2 3 C +2 ) 3 A 1 (C 0,1 3 C +2 ) 3 B 0 (C 1,2 3 C 0,1 ) 3 B 0,1 (C +2 3 C 0,2 ) 3 A 0 B 0 (C 3 C 0,2 3 C 1,2 3 C 0,1 ) 3 A 0 B 2 (C 0,2 3 C 0,2 3 C +2 3 C 0,2 ) 3 A 1 B 2 (C 0,1 3 C 1,2 3 C +2 3 C 0,2 ) 3 A 2 B 1 (C 1,2 3 C 0,1 3 C +1 3 C +2 ) (C 0,2 ) =A 0 C 0,1 3 A 1 C 1,2 3 B 0 C +2 3 B 0,1 C 0,1 3 A 0 B 0 C 1,2 3 A 0 B 2 C 0,1 3 A 1 B 2 C 0,2 3 A 2 B 1 C +2 3 C 0,2. (4.29) This result can now be implemented using the LI blocks like in the binary case. This is shown in Figure 4.3. A B 0 A 0 A 0 A 1 A 1 B 0 B 0 B 0,1 B 0,1 A 0 B 0 A 0 B 0 A 0 B 2 A 0 B 2 A 1 B 2 A 1 B 2 A 2 B 1 A 2 B 1 A B 0 C C 0 0 C 0,1 C 0,1 C 1,2 C 1,2 C +2 C +1 C 0,1 C 0,1 C 1,2 C 1,2 C 0,1 C 0,1 C 0,2 C 0,2 C 1,2 C 1,2 C 0,2 C 0, f Fig. 4.3: A circuit representation of the solution found in Eq. (4.29) Creating LI matrices from LI matrices by operating on them. The rows of a LI matrix represent the functions of the LI basis functions family described by this matrix. For instance, Figure 4.4 presents a LI matrix of FPRM with base functions 1, ā, b and ā b. Because when we EXOR rows of the LI matrix,

124 110 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama we obtain another LI matrix, by EXOR-ing functions corresponding to rows we obtain a new family of (new) basis functions. ā b āb a b ab ā b ā b 1 ā b a b Fig. 4.4: A spectral matrix with minterms as columns and basis functions as rows - this is a change of the basis matrix. This EXOR-ing can be done one at a time, as shown on the right of the matrix in Figure 4.4 ( b ā b =(1 ā) b = a b). The new family of basis functions is {1, ā, b, a b}. So we obtain a GRM from FPRM, nothing new conceptually, but this is only one example of creating new LI bases from other LI bases. Applying this method to larger matrices in all possible ways we can, however, create (in theory) any new LI family based on binary logic. For instance, Figure 4.5 creates base functions that do not exists in GRM. 1 a b c ab ac bc abc 1 a b c ab a(b c) bc abc 1 a b c a b a(b c) bc bcā 1 a b c ab a(b c) bc bcā 1 a āb c ab a(b c) bca bcā Fig. 4.5: Step-by-step generation of a sequence of families of Linearly Independent functions using EXOR-ing and starting from a PPRM base. Many types of butterfly diagrams and recursive (tree search) algorithms can be adapted to perform this kind of processing to create new orthogonal bases. Figure 4.5 presents an example of generating families of base functions. The EXOR-ing operations are drawn as arrows from two arguments of the EXOR operator. Thus, for instance, the new base function a(b c) is created by EXOR-ing the base functions ab and ac. This new base function replaces base function ac. Similarly, we can create new bases in ternary logic.

125 4 GPU Acceleration Methods of Representations for Quantum Circuits Relational Specification of Quantum Circuits The concept of a don t care is well known in circuits synthesis. The don t care represents the fact that the output of a minterm (the cell of a Karnaugh map) can be assumed in the synthesis process to be either a logic zero or a logic one. In the case of circuits with two or more outputs, there can be imposed certain constraints on the output sets of values that cannot be formalized by the concept of don t cares. For example, this is the case when two outputs of a logic cell in a Karnaugh Map are either 00 or 11 but never 01 or 10. This leads to the concept of the generalized don t cares and Boolean Relations invented by Robert Brayton [3] and applied to decomposition of classical logic by Perkowski [45]. When applying the don t cares to the synthesis of reversible circuits, the incomplete specification is automatically constrained by the reversibility requirement. This section introduces this technological or logical constraint to the synthesis of reversible circuits: the synthesis from relational input-output specifications. Relational specifications occur when engineers design automata oracles for quantum algorithms or parts of quantum oracles [25, 26, 29, 31]. Definition 5 (Relational Specification of Reversible Function) A relational specification of an incompletely specified reversible function is a representation that contains the 0, 1 and - as well as restrictions to the specification of the dashes. This means that the function f can be specified by 0, 1 and - and then by indicating with what type of constraints the incompletely specified function will become a relational specification. If the function is to be reversible, the relational specification impose restrictions on which outputs are available for the assignment to each minterm output. Definition 6 (Relational Reversible Matrix Specification) A reversible logic realizable relational permutative matrix is a matrix that can be realized as a fully permutative matrix. A reversible realizable relational permutative matrix is such matrix that contains symbols 0, 1 and -. Such a matrix is subject to the relational constraints given by the reversible logic. A relational reversible matrix could become a standard permutative matrix (one 1 in every row and in every column, all the other symbols are 0 s) if one would substitute all symbols - with symbols 0 and 1 in a correct way. We assume that such substitution always exists, and is usually not unique. Thus, such a matrix is not an arbitrary matrix that includes don t cares symbols (dash symbols). The relational specification (RS) is a more constrained don t care. In this chapter, the RS are not resolved directly. In classical approaches, such as function embedding, the don t cares are determined prior to the synthesis process [62, 14, 40]. For instance, a function is designed using the constant 0 and the resulting design is compared to a design using the constant 1. In the proposed approach, the don t cares are filled in implicitly by the synthesis process. Thus, the don t cares are automatically allocated by the reversible gates used in the circuit.

126 112 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama An example of a relational reversible matrix specifications is shown in Figure α 0 α α/β βα 0 β β 0 A relational permutative matrix specified with two overlapping square dashes (α and β ) Fig. 4.6: An example of relational matrix specifications QMDD Contrary to the above mentioned two representations, the following representation is not designed to be GPU accelerated and we are showing it here for a comparison with the simpler and GPU efficient representations. The Quantum Multi-Valued Decision Diagram (QMDD) [39] is a decisiondiagram-based matrix representation that was created to minimize the size of the representation of the permutative circuits built from quantum and quantumpermutative quantum gates. The QMDD is a well structured representation of a quantum matrix which uses the assumption that in general most quantum circuits have certain structure. As the reversible quantum gates as well as the Controlled quantum gates have a high degree of structure, representing them in QMDD is efficient. Note that QMDD is based on the CUDD [52] and the decision diagram representation of matrices introduced in [5, 11]. The QMDD approach is also similar to the one used in Viamontes [56]. Consider the quantum gate Controlled-V which is represented by the matrix shown in Eq. (4.30) U(x 2 x 1 x 0 )= i i 2 0. (4.30) i i i i i i 2 Figure 4.7 shows how the quantum matrix from Eq. (4.30) is represented by a QMDD. The main advantage of using this approach is that once a sub-matrix is all 0 it is dropped from the representation. In the QMDD, it is important to recall

127 4 GPU Acceleration Methods of Representations for Quantum Circuits 113 that the QMDD represents a matrix not a function. Thus, unlike Binary Decision Diagrams (BDD) or Binary Decision Trees (BDT), each vertex outgoing from a node represents a coefficient for a given sub-matrix. Moreover, QMDD is used to compute a particular coefficient of a matrix and not the output of a given function. Finally, the order of the edges matters as it determines which sub-matrix is accessed. In this QMDD approach, the order of the edges is always 0, 1, 2, and 3. Thus, the left-most (0-th) edge addresses the top left matrix, the second the top-right, and so on. 1 x i x i i 1 x 1 x Fig. 4.7: The QMDD of a CV gate in a three qubit circuit, with the control qubit being the x 2 and the target qubit being the x 1. The edges from each node represent a sub-matrix given by the variable value. For instance, the edges of the node x 2 point to the left-top, right-top, bottom-left, and bottom right sub-matrices each of a dimension (4 4). Each subsequent node splits each sub-matrix in a similar manner. Once the QMDD is constructed, all of the edges having a value of 0 are removed. For instance, observe that the qubit x 2 has the 0 th edge coefficient 1 and the 3 rd edge coefficient 1+i 2. Edges 1 and 2 have a coefficient 0 implying that the top-right and bottom-left matrix are all zero matrices (eq.4.30). This means that one can look directly on the topmost node and directly decide which sub-matrices are empty and which are not. To calculate the output of x 2 x 1 x 0 = 111 one has to look to all the possible sub-matrices defined by the edges of the variables in the path starting with edge 3 of the x 2 node, then to the

128 114 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama sub-matrices defined by edges 2 and 3 of the x 1 node and finally to the sub-matrix defined by the edge 3 of the x 0 node. This gives the correct result that can be written out as shown in equation i 2 x 2 (edge3) 1 + i 2 x 2, x 1 (edge1) i 1 i x 1, 2 x 1 (edge3) i 2 x 1, x 0 (edge3) 1 1 i 2 x 2x 1 x i 2 x 2x 1 x 0. (4.31) Observe that from node x 1, two edges are computed. This is because in the column corresponding to the input state 111 non-null coefficients are present in the sub-matrices represented by edge 1 and edge QMDD Evaluation The QMDD package [39] provides standard matrix manipulation operations such as matrix multiplication, the Kronecker product, and matrix addition. In order to compare its performance within its provided functionality, the evaluation of the synthesized circuit is compared to the target circuit by using matrix multiplication and then the diagonal elements obtained are used to evaluate the circuit s correctness. The matrix multiplication will produce for two identical permutative quantum circuits an identity matrix, and thus a correct circuit will be characterized by: 1 2 n 2 n k=0 α kk = 1. (4.32) while the results will always be < 1 for any circuit not equal to the target circuit. Thus, the total number of operations to evaluate the QC using this approach is considerably smaller than using standard matrix representation, however, as will be shown, the problem is the size of the allocated memory. This evaluation is based on the fact that the multiplication of two unitary matrices U and U produces the identity matrix if these matrices U and U are equal. This is a widely well known and used fact in quantum computation [9, 43, 15].

129 4 GPU Acceleration Methods of Representations for Quantum Circuits GPU Acceleration GPU Micro-Parallelization The CUDA framework was developed by NVIDIA for the growing use of the GPU for computing tasks. The acceleration implemented and discussed in this section is limited to matrix manipulations. The GPU approach to computing is based on a regular structure of the data. In other words, the GPU approach relies on the microparallelization of a single task by splitting a large task into many subtasks. The GPU has an SIMD architecture: it is built from a number of computational cores. Each core can compute multiple tasks in parallel. In contrast to more standard MPI-like parallelism, the GPU based threads are much smaller and much more restricted in their capabilities. For instance, the threads cannot communicate between each other, they have a limited amount of shared local memory and only a limited number of threads can be created on a single core at once. The GPU comes with multiple cores and each core is organized into a twodimensional (can be virtually organized into a three-dimensional) grid of process threads. In order to have an optimal matrix multiplication, several conditions must be satisfied. The most important one is that the grid of the computational processes should be allocated optimally. It means that the matrix size should be a multiple of the number of parallel computational processes. The second constraint of similar importance is that in order to do GPU computation, the data must be transported from the main memory to the GPU memory using the pci-x data bus. As this bus is also used for other devices on the computer, it is possible to saturate the bandwidth and thus actually slow down the computation. The pci-x bus is the replacement of the standard pci bus and is used to communicate with peripheral devices such as the USB, Serial or hard-disk. The high amount of data transmitted during computation on this bus can interfere with the system data being transmitted on this bus and thus can cause an overall slowdown of the computational process. To save the pci-x data bus bandwidth, the computation of the circuit matrix representation is done in two steps: 1. Compute the matrix of the circuit by sending all the required parallel blocks (matrices) one by one to the GPU. 2. Return the computed resulting matrix Quantum Circuits Accelerated with GPU As discussed in Section 4.3.1, the size of a quantum circuit synthesized from arbitrary quantum gates grows exponentially in the number of states and the number of computational steps. It is advantageous to use a distributed method of computation. Figure 4.8 shows where the CUDA acceleration is used.

130 116 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama I SW I I CUDA Multiplication H I CUDA Multiplication I SW I I CUDA accelerated Kronecker multiplication Fig. 4.8: Schema representing the usage of the CUDA accelerator in the computation of a quantum circuit matrix representation: (a) the quantum circuit to compute, (b) the quantum circuit where each gate is shown as a labeled block and separated into serial blocks (labeled 1,2,3). In order to use the GPU processor appropriately, we implemented an optimized matrix multiplication as well as the Kronecker multiplication. The matrix-matrix multiplication is implemented in such a manner that each computational thread computes one element from the matrix; i.e. it computes 2 n multiplications and 2 n 1 additions (inner vector products). In this manner, each matrix of an arbitrary size can be accelerated up to being computed in a single step. This is the case when the number of coefficients (elements) in the resulting matrix is equal to the number of possible computational threads created on the GPU cores. Matrix multiplication is well suited to maximize the CUDA usage when dealing with a quantum circuit partitioned into parallel blocks (Figure 4.8). In general, reprogramming the CUDA device can slow down the overall performance, and thus ideally, one wants to keep the configuration unchanged during the computational process. This allows one to maximize the ratio between the memory-gpu data transfer and the GPU operations. The Kronecker multiplication is more difficult to implement because it is mostly used to multiply matrices of different sizes and thus there is either a significant amount of unused memory on the GPU device or more parameterization is required. As shown in Figure 4.8, any quantum circuit can be parsed into a set of parallel blocks. Each parallel block is a sub-circuit with the same number of input and output qubits as the overall circuit. This means that to obtain a matrix of a quantum circuit we require k n r n multiplications and k r n additions for a circuit that is built from k parallel blocks, on n qubits with radix r. On top of this, each parallel block is built from a set of quantum gates. Within a single block, the quantum gates - placed in parallel - have to be built into a single parallel block matrix using the Kronecker multiplication. Thus for j qubits in a circuit and assuming i quantum gates in the parallel block, the total number of element by element multiplication is given by the following recursive equation:

131 4 GPU Acceleration Methods of Representations for Quantum Circuits 117 r p+1 = r p + c(q p ) c(q i ) c(q h ); p,i < j, (4.33) h with r p representing the number of multiplications done until the p th step, the function c(q p ) being the function returning the number of element in the matrix q p and q p, q i are two matrices multiplied by the Kronecker product. The rightmost expression h c(q h ) represents all the remaining matrices in the parallel block beside q p and q i. Thus, for a circuit with four qubits and four single qubit gates the total number of multiplications in a Kronecker multiplication is 4 (4 (4 4)) = 4 4 which is the maximum. In both cases of the application of the CUDA acceleration, it is important to decide where to use GPU acceleration and where not. The ideal application is such that, based on the size of the matrices it will either do the computation on the host (CPU) or, in the case of larger matrices, will perform the computation on the device (GPU). Thus the optimal GPU acceleration can be summarized in four constraints [24]: Representation - the size of the circuit representation directly impacts the speed of its computation (both the number of quantum gates used and the number of qubits). Computation and Evaluation - The computation of the matrix product and Kronecker product. For instance, if measurement is used, additional vector-matrix products are required. Representation Creation - The creation of the representation. This metric is mainly dependent on what type of gates have been used (Structured vs. Unstructured Quantum Gates). Data transfer to Memory - The data representation might need the transport of the data to the shared memory (as in the case of CUDA), depending on the representation. This must be minimized in order to allow the fastest possible computation The upper limit of the GPU accelerated computation obtained using this approach is the possibility of circuits of up to 10 qubits when using complex double precision scalars [23] to be synthesized and up to 20 qubits when using integers [25] to be synthesized. These quantum circuits are arbitrary and without any structure and thus cannot be efficiently accelerated using QuIDD or QMDD. On the other hand, the lower bound is given by matrices of such sizes where the GPU acceleration does not improve the speed of computation. From Table 4.4, the lower limit is given by 10 qubits while the upper limit is 26 qubits in this case. In the case of quantum circuits - when complex coefficients with double precisions are used in the matrix representation - the lower limit is 6 qubits and the upper limit is 11 [23]. 4.5 GPU Accelerated Parallel Algorithm for Computing LIs As described in Section 4.3.2, the representation used in the LI-based logic synthesis is matrix based. As such, it is currently impossible to process very large functions

132 118 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama represented as matrices on common computers. For this purpose we implemented a simple GPU based matrix processor that allows us to evaluate a quantum circuit faster by using parallel matrix computation. This approach illustrates how even simple parallelization of very simple tasks can make the presented approach competitive for circuits of larger sizes. To compute the LI functions and expansions we used the CUBLAS [44], which is a CUDA [44] GPU implementation of the BLAS [44] library. This library provides a set of optimized functions for linear mathematics computation and thus is ideally suited for use for the computation of the benchmarks. However, there is no direct implementation in CUBLAS of matrix multiplication using such operators as modulo 2 or modulo 3, so the testing was performed using a library of matrix operations built specifically for this purpose. The code is based on the sample software provided on the CUDA web site and was modified and optimized for our application. The advantages of this library are mainly given by the fact that we have complete control over its process and execution (transparency) as well as the configuration. Finally, preliminary tests showed that for our application, our library was as fast as the one provided by CUBLAS and thus all the benchmarks are done using our implemented library. The program used works as follows: 1. The program takes the following inputs: a. The number of variables in the input function - n, b. The number of base functions to be computed - nb, c. The input function (in the form of K-map), Example: The function f represented by the truth-vector F =[0,1,0,1;0,0,1,1;1,1,0,0;1,0,0,1] is the K-map representation of the function from Table 4.1. d. the initial base function (in the form of binary matrix M) Example: m1 =[1,0,0,1;0,1,0,1;0,0,1,1;0,1,1,1]. 2. The program computes nb number of base functions by operating on the given base functions, m1 (by EXOR-ing arbitrarily selected columns as explained in sections 4.2 and 4.3) and stores them in database in a three-dimensional m matrix of dimensions (nb n n) (nb times (n n) matrices), 3. The program computes the inverse of all matrices in m and stores them in the database, represented by the mi matrix of same dimensions as matrix m. each inverse is then verified by m mi = I. 4. The program finally computes m(i) f and stores the results in SF(i), where SF is a matrix of same size as m and mi. The usage of a GPU in the computing of our benchmarks is described by the following pseudo code: 00: t->t0: Allocate required Device memory space (M,Mˆ-1,FV,SF) 01: t1: Load Data to Graphic Global Memory (M, FV)

133 4 GPU Acceleration Methods of Representations for Quantum Circuits : t2: While (End Condition Not Met) Do 03: Compute Mˆ-1 (CPU) 04: (Send Mˆ-1 to GPU device) 05: Compute SF = Mˆ-1*FV 06: (Send SF to Host Memory) 07: Chose a two columns C1 and C2 from Matrix M 08: Compute Element-wise C1 EXOR C2 and replace C1 09: End While Following the established nomenclature when dealing with a GPU computing device, there are two types of memory. The Device Memory is the memory located on the GPU accelerated device while Host Memory refers to the main computer memory. In line 00, a fixed amount of device memory is allocated. This memory will contain the M and M 1 matrices as well as the FV vector and the SF coefficients. Line 01 initialize the M and FV data structure with initial data. From line 02 to line 08 a loop executes the process of computing SF using the current M 1 (line 03 and 04), sending SF to the main host memory (line 05), and generate new matrix M (line 06 and 07). All matrix operations except the inversion computation are executed on the GPU processor. The additional operations executed on the CPU are the storage of the calculated SF vectors and the evaluation of the End Condition. The EXOR-ing of the columns have been tested on both CPU and GPU but the number of computations and the time required to execute this operation are exponentially smaller than the calculation of the SF coefficient and thus not considered to be a high priority problem Experiments and Results We run our algorithm for a select set of LIs for both Boolean logic as well as for ternary logic. The simulation was done using one set of selected LIs for ternary logic and one set of select LI s for the binary logic. These sets are shown in Figure 4.9. For each case, four arbitrarily generated initial functions have been used. For both logic radices, functions of up to 6 bits have been designed. The reason for simulating this up to only 6 qubits is that many matrices with coefficients become singular and thus, it is not always easy to generate invertible matrices with LI coefficients. For each of the arbitrary initial functions, 64 functions have been generated. Each function has been generated from the original one by EXOR-ing two distinct columns of the LI matrix. When generating a new LI matrix from the previous one, only the standard EXOR operation was used because the matrix of the LI coefficients represents the information in Boolean values. The EXOR-ing operation was generalized to the addition mod-radix so that it can be applied to logic of an arbitrary radix only in the case of the multiplication of the inverted LI matrix with the matrix of the function. Observe that the overall procedure can be reformulated according to a given cost. For instance, in the case of a function described by a set of local multi-valued opera-

134 120 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama 1001 LI(2)= , LI(3)= (4.34) Fig. 4.9: The LIs for binary as well as for ternary logic. Table 4.3: An example of the classification of the generated LI based functions. Category/Property 3 (2) 4(2) 5(2) 6(2) 3 (3) 4(3) 5(3) 6(3) Balanced Ones Zeros tors, some operators can have higher values than others. In such a case, the synthesis algorithm selects only such functions that have minimal cost requirements. Table 4.3 shows an example of the cost function assigned to the synthesized functions. In this case the functions are sorted according to whether they are balanced, or if their truth tables contain more zeros or more non-zero elements. The row labeled Ones represents whether the number of functions in which the number of non zero elements is larger than the number of zero elements. For instance, for the Boolean case, the non-zero elements are only ones, however, in the ternary case, the value of non-zero elements can be both 1 and 2. Other cost functions can be easily created based on the needs of the designer. For instance the generated LI functions can have their cost calculated based on the number of occurrences of each LI in the generated function or so. To evaluate the performance of the GPU usage, we also run a slightly different set of benchmarks on the CPU and the GPU. The main difference with the initial benchmarks is the fact that we only used binary arbitrary generated functions as well as larger matrices of the LI coefficients. These benchmarks are designed to take maximum advantage of the GPU acceleration and thus represent ideal conditions of GPU usage. The comparison is shown in Table 4.4. The first column of Table 4.4 shows the number of bits in the base function, the second column shows the number of terms in the LI expansion, the third column shows the time of computation using the GPU accelerated routine, and the last column shows the time of computation using standard CPU based computation. The experiments in this section are limited to arbitrary functions of 26 bits because of the following reasons. The matrix computation using the CPU was not able to accept larger matrices than 26 bits. Moreover, the time required to calculate the matrix inverse using CPU is shown to be exponential (as expected), with an increasing number of bits. The matrix operations executed in these benchmarks

135 4 GPU Acceleration Methods of Representations for Quantum Circuits 121 Table 4.4: The binary benchmarks comparison. Bits LI s Terms GPU(CUDA)(ms) CPU(ms) are always equal to the multiplication of two matrices of the size M 1 ([mi b ][mi b ]) FV([mi b ][b f mi b ]), where mi b is the number of bits in the LI expansion (also in the M 1 matrix) and b f is the number of bits that the function is defined on. Thus for instance the computation of the SF coefficients of a function with 20 bits and the LI expansion over four bits is done using two matrix multiplication of dimensions ([16][16]) ([16][65536]) The reason of using only binary benchmarks for testing the performance difference is the fact that binary benchmarks create matrices of (2 n 2 n ) size, which is the most natural geometry and shape to be represented in a GPU device. This is because the GPU device is a regular SIMD structure organized in a square like net. Each core in the GPU can execute a set of block of threads and each block can contain up to a certain number of threads. This means that to fully test the GPU in ideal conditions, matrices that have dimensions based on 2 n can fully fill the GPU processor and thus take the maximum advantage of the acceleration. The generated functions can be used as a library of LI blocks selected from a set of gates with equivalent costs. Because these circuits are all generated from Linearly Independent elements, each of them can for instance be realized in quantum computing. In such a case the importance of this method increases even more as such cell-based approach is well suited for systems that require programming for many qubits. Thus, methods using simple operators allows us to quickly design larger functions at the cost of additional qubits. Moreover, as quantum computation naturally evolves (computationally) in a multi-valued complex Hilbert space, the extension of this approach to more dimensions allows us to design both Boolean quantum circuits with a smaller amount of quantum gates [19] as well as quickly design multi-valued quantum circuits. Finally, the proposed approach can be generalized into a recursive approach. This is shown in Figure This figure shows that using the proposed approach, one can build the desired circuit by always applying a two bit controlled modulo-3 addition

136 122 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama V 0 V 1. V j. V n 0 V j LI k (V 0,V 1,...,V j) (V 0,V 1,...,V j) SF k (V j+1,...,v n) (V j+1,...,v n) r LI 1 k SF 1 k V 0 V 1. V j V j+1. 0 V n 0 0 Fig. 4.10: The generalization of our approach. An n-variable function is decomposed into a sequence of blocks as shown above, where each block is built from LI and LI 1 and from SF and SF 1. gate to the target bit. Thus, for a selected variables and a set of generated LIs, a simple synthesis algorithm is given as follows: 1. For a given LI, select the least expensive LI function realizing the desired n- variable function 2. Apply it in a pair such as LI k and LIk 1. The output bit in between these two functions being the upper control bit of the modulo-3 addition gate. 3. Apply the remaining SF function also in pair SF k and SFk 1. The output bit in between these functions is the lower control of the modulo-3 addition gate. 4. Apply the target modulo-3 addition operation on the controlled bit. 5. Repeat steps 1-3 for each LI in the decomposition until the function is realized Conclusion In this section, we presented the methodology and results of using a GPU with the CUDA libraries to accelerate the synthesis of quantum circuits. The reported results illustrate that a GPU is a good tool to reduce the time required for the synthesis of fully quantum-probabilistic circuits (non-structured quantum circuits). However, it might not be the most adequate approach for more structured circuits [24]. The designed circuits have all been synthesized using our CUDA library and it provides most of the standard matrix operations required for the simulation of quantum circuits. The used computational libraries have been implemented on measures for the various applications and of our future goal is to complete a GPU accelerated quan-

137 4 GPU Acceleration Methods of Representations for Quantum Circuits 123 tum computational library with all the required operations, such as matrix inversion for instance. This will allow an easy integration of the GPU technology into quantum and reversible circuits synthesizers. Another goal is to develop a parallel algorithms for structures such as the QMDD [16, 39] or QuIDD [56] that could greatly benefit from a low level parallelization. 4.6 Synthesis of Reversible Cascades from Relational Specifications In this section we present a search-based algorithm to synthesize reversible functions from relational specifications - circuits with no ancilla bits and using arbitrary permutative gates. This is a tree search method, which means the circuit is created from outputs to inputs, as in most reversible circuit synthesis methods, and at every stage we create several nodes, successors of the previous level node. Each of these nodes represents the choice of some gate from the library. Each branch of the tree that terminates with an identity is a solution. In this sense, the method from this section is similar to the methods from previous sections. Similarly, one can find equivalent methods to synthesize from relational specifications that are counterparts of other methods presented so far. In order to make the tree search approach computationally tractable, several heuristics to decrease the size of the search tree are presented The Algorithm The synthesis problem for relational specification is formulated as follows: 1. Given is a relational specification in the form of a reversible logic realizable relational permutative matrix U. 2. Given is a library LIB of gates (unitary matrices) to be used in the synthesis. These gates are described by matrices of the size (2 k 2 k ) where k is the width of the cascade. We say thus that the gates are for the whole width of the cascade. This approach simplifies the synthesis but is not applicable to large cascades. Implicitly, the size specification given by 2 k 2 k means that the LIB set of input gates does not necessarily contains truly k k reversible gates because the LIB is limited to quantum primitives that includes gates of two and three qubit at maximum. 3. Find the cascade C of the gates from the library LIB that realizes the given relational specification. This means that cascade C has a permutative matrix M(C) such that M(C) matches matrix U on all cares, i.e. for every symbol 0 or 1 the matrices U and M(C) are the same. At the same time the symbols from M(C)

138 124 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama which are not 0 or 1 are replaced with binary symbols which correspond to the realized circuit. Example 4.1. Given is the relational specification of the cascade in the form of an relational permutative matrix U. The first step of the algorithm is to replace every 00 x 00y U = = 0010 u 00v Fig. 4.11: Relational Permutation Matrix for Example 4.1. dash symbol, i.e. - with a unique variable. This is shown at the right side of Figure Four variables are introduced, one for each symbol. The next stage of the presented algorithm is to assume that our relational specification U (matrix U which formally is not permutative at this stage) is a composition of a library gate (matrix) Y and the remainder circuit (matrix) X. Because we synthesize the cascade from outputs to inputs, gate Y is on the right and the remainder circuits X on the left. Please understand that gate Y is a specific gate from the library, so it has a definite permutative matrix Y. But the remainder circuit X is a relational specification, so it still has the unknown symbols in its matrix and is not a permutative matrix. We just want to make it a permutative matrix in the process of our synthesis. Based on the explanation above, we assume the general decomposition setup as one shown in Figure Here Y is the selected gate from LIB (and its permutative matrix) and X is the remainder (residue) reversible function (circuit, specification, etc) that results from selecting gate Y for initial function specification U. From now on, we will use the name or the matrix representing this gate interchangeably. a b X Y P Q a = b U P Q Fig. 4.12: The general decomposition of unitary matrix U into a sequential composition of blocks X and Y. Relational specification U is decomposed into a composition of gate Y that comes from our library of gates LIB and the remainder specification X. This procedure is repeated by decomposing X in the same way, until a remainder Xn is found that has a matrix that can be completed to an identity matrix. This completes the synthesis. Note that gate Y is only one of the gates from the gate library. We can select any gate, all gates, or some gates that optimize some cost function, such as a Hamming

139 4 GPU Acceleration Methods of Representations for Quantum Circuits 125 distance. Any method presented so far can be adapted for this. Therefore, in general, the method in this chapter is the tree search method, and we show only a few branches of this tree here. Assume that in this example the algorithm selected gate Y which is shown in Figure 4.13 below: a b Y P Q a P = b Q Fig. 4.13: Realization of gate Y from library of gates LIB. A Feynman Gate with Exor Up was assumed here as gate Y. The library has every gate as its matrix, name and schematics. This means the algorithm selected a CNOT gate with EXOR UP as the gate from the library LIB. Please note that LIB is an arbitrary library of cells, not only Toffoli or Feynman, though it must be a library that is universal, which means that any function can be realized with the gates of this library (in our general algorithm, possibly using also ancilla bits, in this chapter however, the ancilla bits are not used). Thus, the permutative matrix Y = Y 1 of the gate from Figure 4.13 is shown in Figure 4.14: Fig. 4.14: Permutative Matrix Y = Y 1. Please observe that the matrix from the library LIB and its inverse matrix are the same. This is the property of the gates that we used. At this point, having the specification matrix U, and the selected gate Y, itis possible to use matrix calculus and find the matrix of X. This matrix is symbolic, not permutative yet. Now we will prove the mathematics of the presented method. Assume Y X = U, (4.35) and multiplying both sides of the equation by Y 1 we obtain Y 1 YX = Y 1 U. (4.36) Thus, we obtain X = Y 1 U. (4.37) The obtained matrices are shown in Figure 4.15.

140 126 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama 1000 x 00y x 00y u 00v = 0100 u 00v Fig. 4.15: The synthesis based on matrix multiplication starting from incomplete specification U for Example 4.1. The algorithm found the relational specification of the (new) remainder function X. By applying gate Y to specification U and starting from the outputs of the circuit, the algorithm creates the relational specification of the reminder circuit; the algorithm obtains it in the form of matrix X. Now the algorithm has to either complete search in this branch or find a new gate Y 1 for decomposition and continue iteratively as explained above Search Heuristics To complete the specification matrix X to a permutative matrix the algorithm uses a specific heuristic. The heuristic used can be phrased as follows: increase the number of 1 s on the matrix diagonal. This means that the variable symbols (x, y, u, and v are our variable symbols in this example) should be substituted with care (binary) symbols 0 and 1 in such a way that the number of symbols 1 on the diagonal of the matrix should be as high as possible. This heuristic applied to matrix X above produces a chain of substitutions of binary values to symbolic variables as below: x = 1 y = 0 u = 0 v = 1. (4.38) Thus, the algorithm created a permutative matrix X. Now, the library LIB is searched for the existence of a gate with this matrix. If such a gate exists, it is applied in the circuit, and the identity matrix is created which means the algorithm terminates. In our case,when the matrix X as in Figure 4.15 is found, it is next found in library LIB as a gate CNOT with EXOR down. Thus matrix X and its corresponding circuit are shown in Figure 4.16a and 4.16b. The final solution - reversible cascade corresponding to U is shown in Figure 4.16c. Observe that at this synthesis stage, the matrix X was recognized as a library gate X. Otherwise, if the matrix would not correspond to any gate in the library, it would be further decomposed, as explained above, and the decomposition would continue until the cascade is completed. Observe also that the solution to symbolic matrix X, like the one from Figure 4.15 is usually not unique, and we can find many solutions. Another solution from matrix X from Figure 4.15 is the following: y = 1 x = 0 v = 0 u = 1, (4.39)

141 4 GPU Acceleration Methods of Representations for Quantum Circuits X = a P a P 0010 b Q b Q a) b) c) Fig. 4.16: (a) The Unitary matrix of gate X found by the algorithm, (b) the circuit corresponding to the unitary matrix X (c) the complete circuit synthesized for the original specification U = YX. The circuit from Figure 4.16(c) corresponds to the circuit from Figure which transforms a symbolic matrix X to the following permutative matrix X from Figure 4.17: Fig. 4.17: Permutative matrix X for the symbolic (relational) matrix X. The above matrix X corresponds to the library gate X (circuit from Figure 4.19). This circuit is explained in Figure In this explanation, the KMap for PQ is created directly from matrix X in Figure 4.17 and next separated to KMaps for P and Q. a\b 0 1 a\b 0 1 a\b = PQ P Q Fig. 4.18: KMap for PQ, first together, then separated to KMaps for P and for Q. a b P Q = a b P Q Fig. 4.19: The second solution circuit obtained from the symbolic matrix in Figure 4.15.

142 128 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama In this example the algorithm thus found two solutions - the cascade from Figure 4.16c and the cascade from Figure 4.19, followed by the gate from Figure In general, we have two tree search processes. One process solves the equations with symbolic variables and the second process selects gate Y as in the previous sections Tree Minimization a b c P Q R (a) (b) (c) Fig. 4.20: The Toffoli gate in all possible configurations over three input wires: (a) Toffoli c, (b) Toffoli b, (c) Toffoli a. Given a set of gates from the library LIB, the generated tree can grow to intractable dimensions. For instance, using only a Toffoli gate in all three configurations, as shown in Figure 4.20, the number of possible Toffoli gates grows according to the equation (4.40) #(To f f oli j ) p =[p (T I 1)] T I + P(T I, p) (4.40) where p is the number of bits (the width of the cascade), the T I is the number of inputs of the Toffoli gate and P(T I, p) is the number of permutations of T I on p computing how many different placements of the control bits can be done using a single type (Figure 4.20(a), (b), (c)) of Toffoli gate. This equation can be modified for other types of gates such as CNOT or SWAP as well. Thus, using Toffoli gates within three possible configurations (Figure 4.20), the set of available gates after expansion to four bits is 9 and after expansion to 5 bits the number is 15. This means that the search tree will grow by a factor of 9 in each level for a circuit of 4 bits when using a single Toffoli gate. In order to reduce the size of the tree, several heuristics can be used. First, because the circuit length is unknown, the search evaluates the current circuit at every node. Therefore, for every sub-tree in the overall tree, at every node identities can be removed in real time. The result of this removal is that every branch that has a child at leaf k equal to the parent is cut without examination because the multiplication

143 4 GPU Acceleration Methods of Representations for Quantum Circuits 129 will move the circuit two steps back in the search tree. This is shown in Figure 4.21 by the IF nodes and the arcs. F G 1 G 2 G k G 1 F G k F G 1 IF G 1 G 2 G 3 G 2 F G 3 F G 2 F G 1 F G 2 G 3 G 2 F IF G 1 G 1 F G 2 G 3 IF G 3 F Fig. 4.21: The search-tree GPU Acceleration In order to allow the search of as much of the tree as possible, we accelerate the computation of the circuit matrices using the GPU. The acceleration is simply to compute matrix multiplication on the GPU device by parallelizing it using the standard approach. A graphical representation of the multiplication is shown in Figure Figure 4.22(a) and (b) show the two input matrices, and the output matrix is shown as the product decomposed into the threads of the GPU in Figure 4.22(c). Each thread executes a small task and each thread has access to a very fast shared memory in order to perform the parallel task in a minimum time. With respect to Figure 4.21, the acceleration is executed at every node of the tree. Naturally, the tree is very large as both the depth and the width can grow exponentially with the number of input gates. Moreover, because the gates represent relational specifications, a very large number of candidates are rejected on every level of the tree. This means that it is not efficient to send the input gates to the GPU, but storing it possibly greatly improves the performance. Of course, the more inputs the circuit has the larger the matrices representing every gate are. In fact, depending on the number of input gates and the size of the circuit, one can estimate how many gates can be stored in the global memory of the GPU. Because all of the gates used in this approach are represented by (n n) matrices (with n being the number of input/output qubits), it is possible to exactly estimate the size required for storage in the global memory. For instance, a circuit with five input qubits is represented by a (2 n 2 n )=( ) matrix that contains

144 130 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama * (a) (b) (c) Fig. 4.22: Graphical representation of the GPU matrix-matrix multiplication 1024 coefficients. Thus, a single matrix of complex coefficients in double precision requires = 16K bytes. Assuming that the memory on the GPU devices that can be allocated is 1GB, it is possible to store 65K gate matrices. This can seem to be a large number, but it is quickly exhausted when a large number of gates is required. For instance, the use of 10 different input two qubit gates, such gates require = 200 gates. This is because each two-qubit gate can be placed in 10 different positions in a 5 qubit circuit and in 10 different positions when the control and the output bits are inverted. This can be generalized by permutations: the number of n! positions of a k-qubit gate in an n-qubit circuit is equal to (n k)!. Then the exact n! amount of memory required to keep the circuits is given by 8. For example, (n k)! 100 circuits of 5 qubits in a 10 qubit circuit would require approximately 24MBytes. In order to save the BUS bandwidth - store all matrices from the beginning in the GPU global memory. Compute the circuit matrix only on the GPU. compute the Error of the Circuit on the GPU. Send back to the CPU only the scalar value. The accelerated search tree is shown in Figure 4.23.

145 4 GPU Acceleration Methods of Representations for Quantum Circuits 131 M 0 G 1 G 3 M 0 G 1 G 2 G 2 M 0 G 1 G 3 G 3 M 0 G 1 G 2 M 0 G 1 G 1 M 0 G 3 M 0 G 1 M 0 M 0 G 3 G 3 M 0 G 2 M 0 G 2 G 2 Fig. 4.23: The GPU accelerate search tree for the relational specified quantum circuits. The operation is executed on the GPU Experiments and Results Table 4.5 shows the obtained results using the proposed approach. As mentioned in the introduction, the tree search approach using matrices is computationally intensive and as such it can be tested only on circuits with a relatively small size with a small number of input gates. However, and despite these limitations, when properly configured and with good heuristics the tree search is a viable method for circuits synthesis because it allows to constructively synthesize reversible circuits from relational specifications for the first time. Table 4.5: Table showing the results of the tree search algorithm. Name Inputs Outputs Embedding Def # of Gates miller 3 3 no complete 10 majority yes complete 10 4mod5 4 1 yes complete 15 sf no incomplete 20 4gt yes incomplete 20 4gt yes incomplete 20 4gt yes incomplete 20 The first column in Table 4.5 is the name of the benchmark, the second and third column is the number of input and output bits respectively. The fourth column

146 132 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama indicates whether embedding (adding constants) was necessary to make the function reversible and the fifth column shows if the definition of the function was complete or not. The sixth column represents the number of gates that the circuit was realized with. Observe that the results presented are not optimized in the number of gates (an example of the circuits can be seen in Figures 4.24 and Figures 4.25). The results are presented to illustrate the fact that the presented tree search used a depth-first approach in searching and thus the chance of finding a correct circuit of the maximum allowed size is much higher than finding an optimized circuit in the shortest possible time. Moreover, the length of the circuit was in this case given as a high approximation of the maximum number of gates. Therefore, the results are comparable in the number of gates. a b c d a b c f Fig. 4.24: Example of circuit obtained for the sf 232 benchmark function The results also show that breadth-first or an iterative deepening approach to the search will provide better results as well allow us to find shorter circuits if they exist. From the observed results for the provided benchmarks, it can be seen that a b c d e a b c d f a b c d e a b c d f Fig. 4.25: An example of a circuit obtained for the 4mod5 benchmark function before and after minimization.

147 4 GPU Acceleration Methods of Representations for Quantum Circuits 133 such transformations as templates or other systematic gate rewriting can be used to minimize the gates. This can be done both in real time as well as in post-processing Conclusion Currently the tree search is implemented only with a single heuristics, however, several optimizations are planned for the future: Replace the matrix representation by a more compact format such as QMDD. The advantage of using QMDD is that it can be parallelized on the GPU processor and thus operations such as multiplication can be performed in parallel. This allows us to also design larger circuits. Another possible implementation attractive only to permutative gates is the permutation vector representation. This is particularly attractive for Controlled-U reversible gates because they remain invariant under inversion and transposition, which makes their manipulation as permutation vectors extremely simple and fast. Cycle removal; the obtained gates in Section often contain cycles of gates that mutually eliminate themselves and thus can be completely removed. Computing such circuits is thus wasteful and with a proper tree pruning such circuits can be completely avoided. The method can easily and naturally be extended to the Linear Nearest Neighbor Model (LNNM) and for this model the number of gates in the library can be decreased. The method can be extended to matrices that are similar to permutative ones but can have complex numbers instead of ones. This adds rotation gates to the library. A library can be defined with only one inverter, one Feynman and one Toffoli gate and (n 1) swap gates. It can be proven that this library is universal, but is much smaller than the one used in this chapter. This may lead to a more efficient synthesis of circuits for the LNN model. The breadth first search and the iterative deepening will speed up the search and the finding of a minimal sized circuit. 4.7 Evolutionary Synthesis of Quantum Circuits Genetic Algorithm The evolutionary synthesis of quantum circuits studied in this section uses a Genetic Algorithm (GA) to design and search for quantum circuits. A GA is a set of directed random processes that make probabilistic decisions - simulated evolution. The GA starts by initializing a set of random circuits (line 02) called the population (circuits from the population are also called individuals), and then proceeds until a solution

148 134 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama Table 4.6: Structure of a Genetic Algorithm. 01: t 0; 02: initialize(p(t)); /* initial population */ 03: while (not termination-condition) do 04: evaluate(p(t)); /* evaluate fitness */ 05: t t + 1; 06: Q s (t) select(p(t)); /* selection operator */ 07: Q r (t) recombine(qs(t)); /* crossover operator */ 08: P(t) mutate(q r (t)); /* mutation operator */ 09: end while is found or until the maximal number of iterations (generations) is reached doing the following loop (line 03): calculate the error and the fitness for each individual (line 04), select a set of individuals (parents) (line 06), recombine selected individuals using a crossover operator (line 07), introduce noise using the mutation operator (line 08) and replace old individuals (parents) with the newly generated individuals (offsprings). Each such loop starting from one population and generating a new generation is called a generation Quantum Circuit Representation pisiip phicp pisiip (a) I S H I I S H H I I I I (b) (c) S - SWAP C - CCNOT H - Hadamard Fig. 4.26: The transformation of a chromosome (a) encoded string, to a final quantum circuit representation (d). S is a Swap gate, H is a Hadamard gate and I is an Identity. In the middle of the circuit is one CCNOT (Toffoli) gate. Observe that the Swap and the CCNOT gate have been replaced by single character representation. (d)

149 4 GPU Acceleration Methods of Representations for Quantum Circuits 135 Each circuit in the population is represented as a string (genotype); this encoding was introduced in [25]. This representation allows the description of any quantum or reversible circuit [29, 31]. The chromosome of each individual is a string of characters with two types of tags. First a group of characters is used to represent the quantum gates that can be used in the individual string representation. Second, a single character p is used as a separator between blocks of quantum gates (Figure4.26(a)). These separators indicate the start and the end of a parallel block of gates in the quantum circuit (Figure4.26(b)). Observe that each space (empty wire or a gate) in a quantum circuit is represented by a character corresponding to a quantum gate The Comparison of Quantum Circuits Observe that a quantum circuit can be separated into serial blocks; each such block contains one or more quantum gates connected in parallel and spreads across all qubits in the circuit. Thus the Toffoli gate from Figure 4.2 contains five serial blocks each containing one two-qubit quantum gate. To evaluate the similarly between two quantum circuits it is possible to compare their respective circuit matrices element by element (note that this is only possible in simulation) or their respective outputs can be compared. These two methods of comparison are schematically represented in Figure Inputs a b c Measurement Evaluation M M M U Matrix-based Evaluation Outputs Fig. 4.27: Two methods of evaluating quantum circuits: the matrix-based evaluation (also called the element error evaluation) and the measurement-based evaluation method. The element error evaluation (EE) is based on a matrix to matrix element comparison. The EE approach is parameterized to compare the squares of the coefficients (in order to remove the complex phase). Thus, to obtain the difference between two circuits represented as matrices, the EE method compares each element of the two matrices one by one. The Measurement-based Evaluation (ME) compares the outputs of the circuit; each time an input state is generated, an output is computed by

150 136 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama first propagating the input state through the unitary matrix of the circuit and then the output state is measured. The obtained observable output is then compared with the expected result specified by the user. Note that the main difference between these two comparison approaches is the fact that the EE method compares matrices while the ME method compares output vectors Evaluation of a Quantum Circuit Each individual is tested to evaluate the correctness of the circuit that it represents. In general the correctness is obtained by comparing an individual s phenotype (circuit) with the desired target circuit. Thus the correctness of a given circuit is obtained as an error that is used to calculate the fitness of the individual. This is shown in Eq. (4.41). 1 f = 1 error, (4.41) with error being the difference between the evolved circuit and the desired target circuit. In general the error is calculated as error = 1 k k i=1 (oi o t i )2 with o i being the i th output of the target circuit and the o t i is the ith output of the desired target circuit. The value corresponding to o i depends on the circuit evaluation as described in Section 4.7.3; in the case of the EE comparison, o i represents coefficients from the matrix representing the quantum circuit and in the case of an ME evaluation it represents the output of the quantum circuit after the measurement is applied The selection of Individuals The idea behind the selection is that by mixing the genotypes of parents, new individuals will be created. If the parents have some desired properties (phenotypes), the offspring have a chance to preserve or to improve the phenotypical properties. In other words, if two circuits are mixed in such a manner that some gates are exchanged between them, one of the newly obtained circuits can be the target circuit. In this work, the GA uses a population-replacement strategy (also called generational GA) which means that parents are completely replaced by offsprings from one generation to another. The selection of parents is handled by the Stochastic Universal Sampling used to select two parents at a time. The SUS details can be found in [1].

151 4 GPU Acceleration Methods of Representations for Quantum Circuits Population evolution The selected parents are recombined using a crossover operation: the two resulting offspring have genotypes that are created by exchanging parts of genotypes between the parents. The crossover used in this work is the so-called two-point crossover. It means that each parent s genotype is cut into three segments and the middle segments are exchanged between the two parents - creating new offspring. Once such offspring are generated the mutation operator is used on each of them respectively. The mutation operator inserts noise into the genotype by locally changing it. In other words, the mutation operator selects a random character (limited to characters that represents a gate in the circuit) in the genotype and changes it into another character (also representing a quantum gate). Once this step is terminated and the number of offspring is equal to the number of parents, the parents are replaced by the offspring. The overall process is shown in Figure Parent Population Selection Parent 1 String Parent 2 String Evaluation Offspring Population Mutation Cross over Offspring 1 String 1100 Parent Parent Offspring 2 String Parent Parent Fig. 4.28: One generation of the GA operation. The crossover and the mutation are applied to using probabilities of 0.7 and 0.05 respectively. This means that two selected parents have a 70% chance of being recombined into two new offspring; otherwise, the new offspring are equal to the parents. Similarly, the mutation operator has a 0.5% chance to change one randomly selected gate form the circuit to another one Error Calculation During the evaluation of the individual s fitness value, the genotype must be decoded into the phenotype (circuit). To obtain the error of each individual, the circuit (phe-

152 138 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama notype) has to be constructed and its correctness must be verified. As introduced in Section the QC can be compared using two different methods. Intuitively, the EE evaluation seems to be more computationally expensive because full matrices are being compared element by element. However, as it turns out, the ME evaluation counter-intuitively requires more computational operations. In the EE method, once the circuit matrix is computed, then the difference between all the elements of the matrices can be done in 2 2n computational steps; each step comparing one coefficient from one matrix to another coefficient, to another. In the ME method, for each input an output of the circuit must be computed, because the comparison is performed on the outputs of the circuit. Thus, each time an input is propagated through the matrix of the circuit, the measurement is applied and the resulting value is used to compute the difference. In this manner, for 2 n inputs that must be propagated through the circuit, 2 n outputs are obtained and are compared in 2 n steps. However, to obtain each of the 2 n outputs from the inputs requires an additional 2 2n + 2 2n+1 2 n+1 computational steps. This is because the multiplication of the input vector by the circuit matrix requires 2 2n multiplications and 2 n 1 2 n = 2 2n 2 n additions. To obtain the observable result, the same amount of computation is required because the measurement is a matrix of the same dimension as is the circuit matrix. Thus the overhead of computation for each input-output combination is 2 2n +1 2 n n+1. Also observe that comparing the complex coefficients of the circuit matrix allows the evaluation of the target circuit with an arbitrary precision - including complex phase. In the measurement process the phase information is lost and thus the output comparison can be performed solely over the set of observables of the quantum system Experimentation a b c a) 1 a b f a b Z b) Z a b f c Fig. 4.29: An example of two realizations of the Toffoli gate obtained from the GA before their optimizing transformations were applied. For the experimentation we selected a set of benchmarks such that both approaches have reached their limit in both the feasibility as well as in time allocated for the computation. Each benchmark function was run for 500 generations and was

153 4 GPU Acceleration Methods of Representations for Quantum Circuits 139 tested on 20 different runs. Table 4.7 shows the various measure metrics that have been used to evaluate the benchmarks. Table 4.7: The various metrics used to evaluate the benchmark functions. Structure Related Measures Empirical Evaluators (Creation time of the Circuit) Width (s) Matrix-based Length(s) Measurement-based with Unstructured Gates (s) Overall performance with Structured Gates (s) for CSF for ICSF The input gate set was selected appropriately to include only the gates necessary to synthesize a given circuit. This knowledge was based on the study of the currently known minimal QC realizations of the given function, as well as on our past experimentations with the size of the input gate set [29, 30]. The input gate set used depends on the assignment: for the unstructured quantum circuit synthesis, the input gate set includes both a single qubit (Appendix4.8.1), two qubit (Appendix4.8.2), and more (if provided) qubit quantum gates (Appendix4.8.2) as well as arbitrary, angle-parameterized rotations (Appendix4.8.3) are allowed; for the structured quantum circuit synthesis only multi qubit controlled gates (Appendix4.8.2) are used and no arbitrary rotations (as in Appendix4.8.3) are allowed. The benchmark of the creation of the circuit is executed as a sub-category of the structured and unstructured search. As will be seen, the matrix evaluation and Measurement-based evaluation are analyzed only partially as the performance can be deduced from some of the prior evaluation results. Finally, the overall performance is analyzed for each run benchmark. a V a b V b c V V V V V V f Fig. 4.30: An example Majority Circuit found by the GA. All of the completely specified functions are evaluated using the EE method, while circuits that are defined only on a subset of qubits are evaluated using the ME evaluation method. As already introduced, when the QMDD is used, the error evaluation is made by matrix multiplication and the sum of the diagonal elements as explained earlier is equivalent to the EE evaluation method.

154 140 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama The first set of benchmarks is the set of Universal Reversible Quantum-Realized gates. In this set, three gates have been used: Toffoli, Fredkin and Majority. Two out of three of these gates are shown in Figures 4.29 and These gates have been designed using the input set gates containing the gates described in Appendix and The second set of benchmarks is a set of reversible arithmetic functions that the algorithm was attempting to realize. The arithmetic functions that we tested our approach for were 4gt5 (5 qubits function that outputs 1 only if the number defined by the binary encoding of the input is greater than 5), and 4mod5 (5 qubits functions that checks if the number defined by the inputs is divisible by 5). V V V V V V V Fig. 4.31: An example of approximate quantum circuits for the arithmetic function 4gt5. The last set of functions was a set of truly quantum gates such as entanglement (for 3 and 4 qubits) using the set of input gates from Appendix and Finally, the Toffoli gate was designed using the input primitives, a set of one and two-qubit rotations Nuclear Magnetic Resonance (NMR) pulses (also called control pulses) as given in Appendix Because the focus of this chapter is not on the synthesis itself but rather on the parallelization and realization aspect of its implementation, the results of the synthesis process are provided just as a demonstration of the methodology and are not analyzed. For a more detailed explanation of the results in the evolutionary automated synthesis in synthesis of quantum circuits the interested reader can read our previous work [31, 29, 28] as well as the work of other authors conducting out the research in this area [49, 58, 59, 20, 21, 53, 54, 55, 33, 34]. The provided circuits are obtained using the GA with an additional post-minimization process. The minimization used in these examples is the aggregation of quantum gates. This is a common approach to quantum circuit minimization in quantum computing and the interested reader can consult [43, 29, 28] for a more detailed explanation of this method. For the sake of clarity, any quantum gate defined on the same number of qubits and having the same size can be aggregated with its neighbor if no other gate is located in between. In all of the cases, at least one partial solution has been found. Figures 4.29 and 4.30 show selected results of the Toffoli and of the Majority reversible gate synthesized by the GA. In the case of the arithmetic function synthesis, only partial solutions have been found. The illustration in Figure 4.31 shows a partial realization of the 4gt5 function.

155 4 GPU Acceleration Methods of Representations for Quantum Circuits 141 Finally, in the last benchmarking set, both exact and partial results have been found. For instance, the entanglement circuits on 3, 4 and 5 qubits have been all successfully synthesized (Figure 4.32 shows the results for the 4 qubit entanglement QC). A result described as an exact solution is a perfect match to the desired target circuit. On the other hand, a partial solution means that at least one of the qubit outputs is different from the functional specification. For instance, a Controlled-U function such as a Toffoli gate will be synthesized with the correct output on the bottom qubit, but one of the two control qubits will be different. This means that the target function is still realized but some of the go-through outputs are not obtained. (a) a b c d H a b c f (b) a b c d H a b c f Fig. 4.32: The results of synthesis of the entanglement circuit for 4 qubits after minimization. In the case where the NMR pulses have been used, only approximate results have been found; examples of an approximate Toffoli gate are shown in Figures 4.33a and 4.33b. By approximate solutions we mean that the given function was synthesized with error larger than desired. Besides testing both approaches on the above benchmarks for feasibility, these sets of functions have been modified as follows in order to test both the accelerated matrix and the QMDD evaluation methods of our GA. The Toffoli gate was scaled from three qubits up to 20 qubits and the representation was tested with gates from Appendix 4.8.1, and from Appendix??. An arbitrary circuit of up to k gates was designed and it was constructed using both representation methods and was tested with gates from Appendix and from Appendix These modified benchmarks are discussed in the next section according to the metrics introduced in Table Discussion First it is important to observe that the circuit string representation (the chromosomes of the individuals in the GA) is independent from the calculation of the final QC. Thus, the evolutionary synthesis is partially independent from the evaluation

156 142 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama a b c π 4 π I π 4 x π I π x π 4 I π 4 I π x π π 4 z y 3π π I 2 I π 2 I π π π 2 I 4 I R 4 y 0 π 4 I π 2 I π 4 x 3π 4 I π I π 2 z a b c R 0 = π 2 y 2πx (a) An example of approximate Toffoli using the NMR input set of gates. a a b c R 0 π 4 z π 4 I π 4 I R 1 R 2 π 3 I π 3 I π 3 y π 4 z π 4 I π 4 I π 3 I π 3 I R 3 R 4 π 3 y π z π 4 y b c R 0 = π I π 3 I R 1 = π 4 x π 3 y R 2 = π 2 x π 3 y R 3 = π I π 3 I R 4 = π 2 I πi π 2 I (b) An example of approximate Toffoli after minimization. This gate was generated using the NMR set of rotations. Fig. 4.33: An example of approximate Toffoli gates realized using the NMR control pulses. and the fitness calculation. Thus, the representation and the computation of the circuit outputs are not the only existing bottlenecks to be considered. As it stands, the following are issues that have been identified as possible causes of a considerable slowdown in a GA-based SQC. Representation - the size of the circuit representation directly impacts the speed of its computation. This is directly observable in both the length (number of quantum gates used) of the circuit as well as on the width (number of qubits) of the circuit. Computation and Evaluation - The computation of the matrix product and Kronecker product. The evaluation of the circuit requires, in general, at last the data retrieval from the representation and in some cases also additional operations (in the measurement evaluation additional vector-matrix products are required). Representation Creation - The creation of the representational form. This metric is mainly dependent on what types of gates have been used (Structured vs. Unstructured quantum gates). Data transfer to Memory - The data representation might need the transport of the data to the shared memory (as in the case of CUDA), depending on the representation this must be minimized in order to allow the fastest possible computation In the presented experimental results we compared the QMDD vs. CUBLAS implementation of the GA, but in some cases we also provided the results of a CBLAS implemented GA. The CBLAS library [12] is the standard non accelerated matrix multiplication and was used as the reference. The CBLAS [12] library is a C/C++ implementation of the Fortran BLAS (Linear Algebra package) library and

157 4 GPU Acceleration Methods of Representations for Quantum Circuits 143 is available online. The CUDA acceleration is using CUBLAS - an GPU accelerated version of the CBLAS package. The graphic accelerator has two GPUs, each GPU has 128 cores. Each core support up to 512 parallel threads of computation, thus, at maximum the accelerator can execute = operations in parallel Quantum Circuits with Structured and Unstructured Quantum Gates Table 4.8: Feasibility limits of each method during the computation of complex matrices using arbitrary quantum gates. # of Qubits QMDD CBLAS CUBLAS 7 Yes Yes Yes 8 Partial (M) Yes Yes 9 Partial (M) Partial (T) Yes 10 No No Yes 11 No No Partial(T) 12 No No No While it is true that permutative and controlled quantum gates have high a degree of structure, during a pseudo random evolutionary search many circuits that are synthesized do not have any structure at all. Our experimentation showed that despite the fact that the QMDD is very efficient in representing quantum-realized reversible circuits, during the synthesis of circuits with more than 5 qubits, the QMDD would run out of memory for circuits without any structure. Table 4.8 shows the feasibility results for all three tested methods for circuits between 7 to 10 qubits. Columns QMDD, CBLAS and CUBLAS respectively represent the implementation of the computation of the quantum circuit. QMDD - Quantum Multi-valued Decision Diagram, CBLAS - c/c++ implementation of the Basic Linear Algebra Subprograms (BLAS) [12] library, CUBLAS - the BLAS GPU accelerated library [44]. Yes means that the given implementation is computationally possible under the experimental conditions and viable results have been obtained. The circuits generated in Table 4.8 are random and no preexisting structure is defined or required. The random quantum circuits are designed in such a manner that for a given number of serial segments, an arbitrary set of gates is generated. The results are as expected; for quantum circuits without structure the QMDD is not a well suited representation. For circuits with structure, a similar evaluation was made and the results are given in Table 4.9 which shows that when only gates from Appendix are used for synthesis, the QMDD outperforms the CUDA both in the size of the representation as well as in the speed of computing the resulting representation. Such circuits are

158 144 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama Table 4.9: The feasibility limits of each method during the computation of complex matrices using only structured quantum gates. # of Qubits QMDD CBLAS CUBLAS 7 Yes Yes Yes 8 Yes Yes Yes 9 Yes Partial (time) Yes 10 Yes No Yes 11 Yes No Partial (time) 12 Yes No No Yes No No said to have structure, but like the case described in Table 4.8, a random process was used to design such circuits, but only with a particular subset of gates. The computation of matrices for quantum circuits with more than 10 qubits could be still possible using the GPU computational approach, although its efficiency is directly dependent on the size of the GPU memory. For instance, in our case, the GPU had 1GB (NVIDIA GeForce 9800 GX2) of memory and thus it is only reasonable to use it to compute matrices such that three of them all fit together in the shared memory. This is explained as follows. In the matrix multiplication there are two input matrices and one output matrix. Each matrix being a set of 2 n 2 n complex (double double or double floats) coefficients, a matrix of 5 qubits requires with a double precision (8 bytes) 2 n 2 n = 8KB bytes. For ten qubits the amount of required memory is 8MB. This amount is divided by half for the float based precision. With such a scaling the maximum size of a circuit is limited to 13 qubits but practically this is often only 11 or Width of the Quantum Circuit Figure 4.34 shows the time required to build quantum circuit representation as a function of the width of the quantum circuit. The length of the circuit is randomly generated within the range of 15 to 34 quantum serial segments. As can be seen, for circuits using only structured quantum gates the QMDD is more suitable for representation. This means that while using controlled quantum gates, the QMDD will always outperform the matrix representation in both speed and size of the synthesized quantum circuit. This is also observable in the following experiments. Observe that the time for building a circuit in the QMDD cannot be predicted as it depends on the structure of the quantum circuit. This can be seen in the fact that while for permutative quantum circuits (built using the Toffoli macros) the QMDD is extremely fast and handles larger circuits very well as well (dotted curve), but as soon as gates such as CV or CV are used the performance drops drastically (dashed curve). Figure 4.34 shows that for permutative quantum circuits the QMDD building

159 4 GPU Acceleration Methods of Representations for Quantum Circuits Quantum circuit Building-width CUBLAS circuit calculation QMDD arbitrary circuit building QMDD permutative circuit building Time (in seconds) Circuit Qubits 0 Fig. 4.34: A comparative plot of the time required to build a quantum circuit using permutative gates and the time required to build a quantum circuit using quantum gates. time increases linearly with respect to the size of the matrix, corresponding to the number of minterms it requires to build the circuit. While QMDD handles circuits as large as 15 or 20 qubits very well, the CUDA approach is not time tractable after 12 qubits. Also note that the CUDA accelerated matrix computation depends only on the width of the circuit. The plot shows only one curve because it is independent from the content of the matrices. The main problem (limit 12 qubits in Table 4.8 and 4.9) while using the GPU accelerated matrix multiplication is the exponential time growth required for the computation (an example can be seen in Figure 4.36). This time can be linearly decreased (as discussed later) for the price of larger hardware but otherwise remains the main problem factor in this approach. Thus, as expected, the time required to build either a unitary matrix representing the quantum circuit or a QMDD of the same circuit depends on the width of the quantum circuit (mainly for CUBLAS). In the case of the QMDD, the performance is directly dependent on both the type of gates used and on the width of the quantum circuit.

160 146 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama Length of the Quantum Circuit Quantum Gates Quantum circuit Building-length 5 qbts - QMDD 5 qbts - CUBLAS 6 qbts - QMDD 6 qbts - CUBLAS 7 qbts - QMDD 7 qbts - CUBLAS Time (in seconds) Circuit Segments Fig. 4.35: A comparison of the time required to build a matrix and a QMDD using quantum permutative and non-permutative gates. Figure 4.35 shows the comparison of the times required to build a QMDD and a CUDA accelerated matrix for circuits with respectively 20, 40, 60, 80, 150 and 500 quantum gates. Figure 4.35 shows the time required to build circuits using quantum gates such as V, H, Z, Controlled-V or Controlled-V. Observe that as expected, the CUDA multiplication time increases linearly with the number of gates (parallel blocks) in the quantum circuit. On the other hand, observe that the time required for the QMDD package to build the representation grows much faster, and depending on the number of gates in the quantum circuit it crashes 20, 40 or 60 quantum gates. The time required for the CUDA accelerated matrix building grows linearly with the number of quantum gates, while, as expected, is exponential with the number of qubits. Finally observe that the circuits built with arbitrary quantum gates that the QMDD can successfully represent is much shorter than the CUDA approach. This is shown in Figure 4.35 by the lines going to zero seconds for the QMDD curves: this means that the QMDD runs out of memory and cannot represents larger circuits anymore. Observe that, as expected, the wider the circuit is the faster the QMDD runs out of memory.

161 4 GPU Acceleration Methods of Representations for Quantum Circuits 147 On the other hand, when dealing with only permutative gates, the QMDD outperforms completely the CUDA matrix approach. This is again due to the fact that the CUDA matrix is data independent while the QMDD is optimized for representing such gates. This is shown in Figure Permutative Gates Quantum circuit Building-length 7 qbts - QMDD 7 qbts - CUBLAS 9 qbts - QMDD 9 qbts - CUBLAS 10 qbts - QMDD 10 qbts - CUDA 5 4 Time (in seconds) Circuit Segments Fig. 4.36: A comparison of the time required to build a matrix and a QMDD using only permutative quantum gates. The reason why the QMDD is much faster for permutative gates is because the QMDD only requires the addition of specified minterms rather than a whole unitary matrix. Also, when using only structured quantum gates, the matrix representing a circuit contains a relatively small number of non-zero coefficients and in the case when only permutative quantum gates are used, only a single coefficient per row/column is non-zero and thus such a permutative matrix is represented by a single vector having 2 n elements. Thus, the time required to build the data structure that allows the final circuit evaluation depends on the number of gates (for both the CUBLAS and QMDD) as well as on the type of gates used (for the QMDD) Evaluation Once the circuit is constructed, the GA requires the evaluation of its correctness.

162 148 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama Fig. 4.37: A comparison of the evaluation of the quantum circuit for 7, 8 and 10 qubits for three studied methods. The time is in seconds and is averaged over selected arbitrary circuits. Figure 4.37 shows the comparison of the computation time required to generate a fitness value for a given circuit. It includes the decoding of the genotype, the representation construction and fitness and error calculation. Observe that, as expected, CUDA accelerated fitness computation outperforms both the CBLAS and QMDD, while the QMDD outperforms the CBLAS for larger circuits. This is again due to the fact that as long as QMDD can handle the size of the circuit, it also provides fast methods for matrix multiplication. Thus without acceleration, the matrix multiplication is much slower. The evaluation itself (the fitness calculation) implies that in the EE case, k expected values for the given coefficients are compared one to one, in the ME case k measurements are done or a matrix multiplication is done in the QMDD case. This means that, in general, the time is negligible when compared to the overall circuit building. As Figure 4.37 shows the calculation of the fitness of the CBLAS and CUBLAS approaches is faster, as it requires either only element by element comparison or a matrix-vector multiplication (ME evaluation), while in the QMDD case each evaluation requires a full QMDD-by-QMDD multiplication. The time for this QMDD operation is both dependent on the size of the QMDD (the structure of the quantum circuit) as well as on the number of the qubits in the circuit.

163 4 GPU Acceleration Methods of Representations for Quantum Circuits Data Transfer to Memory and Result Transfer to Memory This problem concerns only the accelerated computation, as the data that is processed by the GPU must first be sent from the Computer Memory to the Video Shared Memory where it is processed. Once the multiplication is done, the result can be sent back to the main memory for further processing. Just like in the previous cases, here the worst case scenario is repeated again. Because the search is essentially not structured, the data bus must be used in a maximum manner in order to speed the computation up as much as possible. Fig. 4.38: A comparison of fitness calculation vs. simple matrix multiplication for QC with 7 to 10 qubits. The Matrix Multiplication Curves are postfixed by MM and Fitness Calculation Curves are postfixed by FIT. Figure 4.38 shows a comparative drawing of the matrix multiplication alone and the fitness calculation. Note that the CUDA matrix multiplication does not show up as it is very close to 0 time. This can be seen even when minimizing the computation on the CPU for calculating the fitness or other parameters. Also observe that as in the previous graph, the QMDD speed decrease is not linear as it depends on the circuit size and structure. Finally observe that in this case the circuit used was a circuit without a structure so that the full QMDD is built and the worst case performance can be evaluated.

164 150 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama GA Limitations: Building Blocks and the Input Gate Set The acceleration of the multiplication of matrices and the possibility of synthesizing larger circuits in a reasonable time introduces a set of problems that have not been an issue for smaller circuits. For example, the problem of larger building blocks preservation; more qubits are in a quantum circuit, larger building blocks are required to create universal gates and hence the evolutionary operators have a higher chance of breaking those blocks and preventing us to reach the desired solution. This is a well known fact first introduced by the Building Blocks Theory [16, 13] and is seen in the SQC problem. To illustrate this problem, Figures 4.31, and 4.33a show the partial results that the GA designed, but was not able to find a completely correct solution. Another problem that is also related to synthesizing larger QCs is the fact that given our representation for larger circuits large sets of input gates are required. This is a natural consequence of having unique labels for each used quantum gate Quantum Circuit Valuedness In the presented experiments, we considered only matrices that can directly be matched to the structure of the GPU device. Matrices with 2 n elements can be directly matched to a (n n) array of GPUs, but matrices with a different geometry such as k n will be calculated much slower. This is either because only matrices smaller than the maximal size 2 n can be computed at a time or the k n matrix must be partitioned so as to fit the sub-blocks of the GPU grid. For instance, a quantum-binary circuit of size 2 10 can be computed in one step on a block of GPUs with each GPU calculating 64 (2 6 ) coefficients of a unitary matrix. On the other hand a quantum-ternary matrix of size 3 10 must either be computed on a GPU array having for instance (81 81) GPUs each computing 729 = 3 6 matrix coefficients or such a matrix must be broken down and computed as a sequence of operations on each of the sub-matrices. However, for the best performance it is recommended to always compute matrices that fit onto the GPU array directly QMDD Limitations The QMDD package requires the initialization of an internal built-in optimizer (LUT s) for faster complex number manipulation and if used as such, a complete initialization depends on the size of the circuit. In short, each time the QMDD package is initialized it will create a LUT with all the possible complex coefficients for the case when the QMDD is fully allocated. In all the cases when the QMDD crashed it was because the LUT table overflowed. This happened mainly due to the

165 4 GPU Acceleration Methods of Representations for Quantum Circuits 151 fact that the LUT stores values for each coefficient in the matrix it represents and thus, a matrix that is largely filled with non zero coefficients it will overflow. Observe that this overflow happens in both cases - with growing width and growing length of an arbitrary circuit. This means that the complexity of the circuit increases by adding more complex gates as well as while adding larger and larger quantum gates. This overflow of the LUT is due partially to the fact that the GA generates circuits that have quantum controls, and thus the complexity of the circuit increases. Moreover, as for a given quantum circuit, the QMDD does not remove unused coefficients during the procedure of building a quantum circuit, and the LUT can be easily overflow. This can be seen in Figure 4.39, where by adding CV/VC gate (VC being a CV gate with target qubit up and control qubit down) a circuit building in a QMDD requires together = 20 quantum coefficients. Thus adding more quantum gates will easily fill the LUT. a a a a V a a V b V b b V b b V V b (a) (b) (c) Fig. 4.39: Three quantum circuits using the CV and the VC gates. (a) four complex coefficients are created, (b) eight complex coefficients are created and (c) eight (different than in (b)) coefficients are created Conclusion In this section, we compared the different approaches to the evaluation of quantum circuits synthesized using a Genetic Algorithm. The results showed that despite the fact that the QMDD representations is very well suited to represent and manipulate circuits with a high level of structure, the CUDA accelerated matrix representation is more universal when more memory is available. In addition, we showed that when truly quantum circuits get large, there is no straightforward solution. The complexity of the circuit grows, and as a result it is not possible to efficiently represent it. On the other hand, the time required for matrix multiplication grows exponentially with the size of the matrix (number of qubits) and thus the only solution is to provide larger hardware that will allow us to process larger matrices at once. Note that our approach concerns a search method where many circuits are built, evaluated and destroyed. Thus, for methods that build a given circuit once and modify its structure, this evaluation might not be the most accurate. We showed that

166 152 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama while the QMDD is indeed an optimized approach, the fitness calculation is not very efficient and even for larger matrices while done using the CUBLAS/BLAS approach is faster. The general conclusion is, that as long as the search method is an unstructured Evolutionary Synthesis, the matrix representation is the only one possible. In the case where some structure is introduced such as either using only permutative quantum gates or for instance doing a structure based search such as proposed in [26, 27] the more efficient representation (QMDD) should be used. Table 4.10: Table representing the overall results of this work. Method QMDD CBLAS CUDA Parameters of Synthesis Structured Yes Size Depending Yes Unstructured No Size Depending Yes Operations Yes Standard Standard Table 4.10 represents the summary of the results obtained in this work. Each row represents a generalization of the results observed during the experimentation in this work. The first and second row describe whether the given computational approach can handle structured and unstructured circuits. Recall the analysis here concerns arbitrary quantum circuits with up to 12 qubits in size. The third row represents whether the given method offers advantages for standard operations. The decision on which of the approaches (acceleration vs. advanced representation) should be used also depends on the type of operations required for the GA. The QMDD is more efficient for the reversible QC, but the method used by Viamontes [56] is to be more efficient for structured QC where algebraic operations are required. To conclude, it is obvious that in an unstructured environment, the brute force approach with micro-parallelization is a much more powerful tool than well designed and optimized data structures for the evolutionary synthesis of small and average sized quantum circuits. However, for data larger than some computationally feasible limit, heuristics and even optimization subroutines in a way that was done for instance in [35, 36] are required. This limit can be either hardware or software based. Future work includes the extension of the GA to a structure-based evolutionary synthesis of quantum circuits and the support for larger building blocks that would allow the synthesis of circuits with many qubits. A GA completely built-in the GPU for maximum efficiency can also be one of the possible research directions.

167 4 GPU Acceleration Methods of Representations for Quantum Circuits Closing Remarks This chapter described three separate approaches to the synthesis of reversible and quantum circuits. The general conclusion that can be drawn from the described methodologies and the experiments, is that the GPU can be used efficiently to accelerate matrix-represented quantum circuits of small and moderate size. The most important problem is to balance between the size of the quantum circuits and the amount of required matrix operations. This means that if a very large number of matrix operations are required it is crucial to store all the matrices on the GPU global memory because the pci-x bus is relatively slow when over-used. Another important point is that the usage of the GPU allows us to do extremely fast searches by accelerating matrix operations. This means that by accelerating local operations, a tree spanning a problem space is indeed accelerated and thus allows us to search a much larger problem space in a reasonable time. This is very important because as long as a problem can be specified as a tree search using matrix operations on every node, the GPU approach is the best possible tool to use. The parallelization offered by the GPU is in fact a competing tool for many software optimized approaches, algorithms, representations and tools. This means that it has become a standalone replacement for optimizations and allows us to accelerate certain problems. In the SQC, the GPU approach is almost designed on measure to process quantum Boolean circuits, but it is more difficult to accelerate quantum circuits of higher radices. The general conclusion is that the GPU can be explored and used even more efficiently for the SQC and the approaches that have not so far been accelerated will benefit greatly from this technology.

168 154 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama Appendices Examples of single qubit Quantum Gates Some of the single qubit operators shown in the matrix representations can be seen in equation4.42. ( ) ( ) 01 0 i a) X = NOT = b) Y = 10 i 0 ( ) 1 0 c) Z = 0 1 e) V = (1+i) 2 ( ) 1 i i 1 ( ) d) H = f ) Phase = ( ) 10 0 i (4.42) The operators in eq.4.42 are the NOT (or the Pauli-X), the Pauli-Y, the Pauli-Z, the Hadamard, the V and the Phase gate respectively Examples of multiple qubit Quantum Gates Figure 4.40 shows some of the multi-qubit gates as well as the general Controlled-U template gates. a a a a a a b U (a) b b c U (b) b c b U c (c) b c a a a a a a b b b b b b c c c c (d) (e) (f) Fig. 4.40: Multi-qubit quantum gates. U in (a,b,c) represents a single qubit or a two qubit operation, (d) Feynman gate (with matrix in eq. 4.43), (e) Toffoli gate (with matrix in eq. 4.7), (f) Fredkin gate (with matrix in eq. 4.44).

169 4 GPU Acceleration Methods of Representations for Quantum Circuits Feynman = (4.43) Fredkin = (4.44) Pulse-level quantum gates Single qubit operations: rotations R x,r y,r z for various degrees of rotation θ. With each unitary rotation (R x, R y, R z ) represented in equation4.45. R x (θ)=e iθx/2 = cos θ 2 I i sin θ 2 X ( cos( θ = 2 ) i sin( 2 θ ) ) i sin( 2 θ ) cos( 2 θ ) R y (θ)=e iθy /2 = cos θ 2 I i sin θ 2 Y ( cos( θ = 2 ) sin( θ 2 ) ) sin( θ 2 ) cos( θ 2 ) (4.45) R z (θ)=e iθz/2 = cos θ 2 I i sin θ 2 Z ( ) e iθ/2 0 = 0 e iθ/2 Two-qubit operation; depending on approach the Interaction operator is used as J zz or J xy for various rotations θ. The interaction operator is in its simplest form a phase inverter and can be seen as two single qubit Pauli-Z gates multiplied in parallel. The result is such that when the control qubit is one it changes the phase of the target qubit.

170 156 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama References 1. Baker, J.E., Reducing bias and inefficiency in the selection algorithm, Proc. of the Second Int. Conf. on Genetic Algorithms and their Application, 1987, Barenco, A., Bennett, C.H., Cleve, R., DiVincenzo, D.P., Margolus, N., Shor, P., Sleator, T., Smolin, J.A., Weinfurter H., Elementary gates for quantum computation, Physical Review A, 52, 1995, Brayton, R., Somenzi, F., An exact minimizer for Boolean relations, Proc. Int. Conf. on Computer Aided Design ICCAD, Bremner, M.J., Dawson, Ch.M., Dodd, J.L., Gilchrist, A., Harrow, A.W., Mortimer, D., Nielsen, M.A., Osborne, T.J., Practical scheme for quantum computation with any two-qubit entangling gate, Phys. Rev. Lett., 89, 24, 2002, Clarke, E., Mcmillan, K., Zhao, X., Fujita, M., Yang, J., Spectral transforms for large Boolean functions with applications to technology mapping, Formal Methods in System Design, 2-3, 10, 1997, Deutsch, D., Quantum theory, the Church-Turing principle and the universal quantum computer, Proc. of the Royal Society of London, Ser. A, 1985, A400: Deutsch, D., Quantum computational networks, Proc. of the Royal Society of London, Ser. A, 1989, A425: Deutsch, D., Barenco, A., Ekert, A., Universality in quantum computation, Proc. of the Royal Society of London, Ser. 1995, A449: Dirac, P. A. M., The Principles of Quantum Mechanics, Clarendon, Oxford, Eisert, J., Jacobs, K., Papadopoulos, P., Plenio, M.B., Optimal local implementation of nonlocal quantum gates, Phys. Rev. A, 62(5):052317, Oct Fujita, M., McGeer, P.C., Yang, J. C.-Y., Multiterminal binary decision diagrams: An efficient data structure for matrix representation, Form. Methods Syst. Des., 2-3(10), 1997, GNU, GSL CBLAS, GNU, Goldberg, D.E., Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, MA, Grosse, D., Wille, R., Dueck, G.W., Drechsler, R., Exact synthesis of elementary quantum gate circuits for reversible functions with dont cares, Proc. 38th Int. Symp. on Multiple Valued Logic, Gruska, J., Quantum Computing, Osborne/McGraw-Hill,U.S., Holland, J.H., Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, Hung, W.N.N., Song, X., Yang, G., Yang, J., Perkowski, M., Quantum logic synthesis by symbolic reachability analysis, Proc. DAC, 2004.

171 4 GPU Acceleration Methods of Representations for Quantum Circuits Hung, W. N. N., Song, X., Yang, G., Yang, J., Perkowski, M., Optimal synthesis of multiple output boolean functions using a set of quantum gates by symbolic reachability analysis, IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, Vol. 25, No. 9, 2006, Lanyon, B. P., Barbieri, M., Almeida, M. P., Jennewein, T., Ralph, T. C., Resch, K. J., Pryde, G. J., OBrien, J. L., Gilchrist, A., White, A. G., Quantum computing using shortcuts through higher dimensions, April 2008, arxiv: [quant-ph]. 20. Leier, A., Evolution of Quantum Algorithms Using Genetic Programming, PhD thesis, University of Dortmund, Leier, A., Banzhaf, W., Comparison of selection strategies for evolutionary quantum circuit design Proc. of the Genetic and Evolutionary Computation Conference (GECCO), 2004, Lloyd, S., Almost any quantum logic gate is universal, Phys. Rev. Lett., 75(2), Jul 1995, Lukac, M., Kameyama, M., Miller, D.M., Perkowski, M., High speed genetic algorithms in quantum logic synthesis: Low level parallelization vs. representation, Journal of Multiple Valued Logic and Soft Computing, accepted, Lukac, M., Perkowski, M., Kerntopf, P., Kameyama, M., GPU Acceleration Methods and Techniques for Quantum Logic Synthesis, International Workshop on Boolean Problems, Freiberg, Germany, September 16-17, Lukac, M., Perkowski, M., Evolving quantum circuit using genetic algorithm, Proc. of the 2002 NASA/DoD Conference on Evolvable Hardware, 2002, Lukac, M., Perkowski, M., Combining evolutionary and exhaustive search to find the least expensive quantum circuits, Proc. ULSI Symposium, Lukac, M., Perkowski, M., Using exhaustive search for the discovery of a new family of optimum universal permutative binary quantum gates, Proc. Int. Workshop on Logic & Synthesis, Lukac, M., Perkowski, M., Quantum finite state machines as sequential quantum circuits, Proc. ISMVL, Lukac, M., Perkowski, M., Goi, H., Pivtoraiko, M., Yu, C. H., Chung, K., Jee, H., Kim, B.-G., Kim, Y.-D., Evolutionary approach to quantum reversible circuit synthesis, Artif. Intell. Review., 20(3-4), 2003, Lukac, M., Perkowski, M., Goi, H., Pivtoraiko, M., Yu, C. H., Chung, K., Jee, H., Kim, B-G., Kim, Y-D., Evolutionary approach to quantum and reversible circuits synthesis, Artificial Intelligence in Logic Design, Kluwer Academic Publisher, 2004, Lukac, M., Pivtoraiko, M., Mishchenko, A., Perkowski, M., Automated synthesis of generalized reversible cascades using genetic algorithms, Proc. Fifth Int. Workshop on Boolean Problems, 2002, Lukac, M., Sasaki, A., Kameyama, M., Cellular automata based robotics architecture for behavioral decision making, to be published.

172 158 Martin Lukac, Marek Perkowski, Pawel Kerntopf, Michitaka Kameyama 33. Massey, P., Clark, J.A., Stepney, S., Evolving quantum circuits and programs through genetic programming, Proc. of the Genetic and Evolutionary Computation Conference (GECCO), 2004, Massey, P., Clark, J.A., Stepney, S., Evolving of a human-competitive quantum fourier transform algorithm using genetic programming, Proc. of the Genetic and Evolutionary Computation Conference (GECCO), 2005, Michalski, R. S., Learnable evolution: Combining symbolic and evolutionary learning, Proc. of the Fourth International Workshop on Multistrategy Learning (MSL98), 1998, Michalski, R. S., Kaufman, K., Intelligent evolutionary design: A new approach to optimizing complex engineering systems and its application to designing heat exchangers, International Journal of Intelligent Systems, 21(12), Miller, D. M., Dueck, G. W., Spectral techniques for reversible logic synthesis, Proc. RM, 2003, Miller, D. M., Maslov, D., Dueck, G. W., Synthesis of quantum multiple-valued circuits, Journal of Multiple-Valued Logic and Soft Computing, Vol. 12, No. 5-6, 2006, Miller, D. M., Thornton, M. A., Goodman, D., A decision diagram package for reversible and quantum circuits, Proc. IEEE World Congress on Computational Intelligence, page on CD, Miller, D. M., Wille, R., Dueck, G. W., Synthesizing reversible circuits for irreversible functions, Proc. 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools, Mischenko, A., Perkowski, M., Logic synthesis of reversible wave cascades, Proc. of IWLS, 2002, Moore, G. E., Cramming more components onto integrated circuits2, Electronics, April 19, Nielsen, M. A., Chuang, I. L., Quantum Computation and Quantum Information, Cambridge University Press, NVIDIA, NVIDIA CUDA, NVIDIA, learn.html. 45. Perkowski, M., Generalized orthonormal expansion and some of its applications, Proc. ISMVL, 1992, Perkowski, M., Lukac, M., Kerntopf, P., Kameyama, M., GPU library based approach to quantum logic synthesis, RC workshop, Perkowski, M., Marek-Sadowska, M., Jozwiak, L., Luba, T., Grygiel, S., Nowicka, M., Malvi, R., Wang, Z., Zhang, J. S., Decomposition of multiple-valued relations, Proc. Int. Symposium on Multi-Valued Logic, Perkowski, M., Sarabi, A., Beyl, F. R., Fundamental theorems and families of forms for binary and multiple-valued linearly independent logic, Proc. RM, Rubinstein, B. I. P., Evolving quantum circuits using genetic programming, Congress on Evolutionary Computation (CEC2001), 2001,

173 4 GPU Acceleration Methods of Representations for Quantum Circuits Shende, V. V., Bullock, S. S., Markov, I. L., A practical top-down approach to quantum circuit synthesis, Proc. Asia Pacific DAC, Shende, V. V., Prasad, A. K., Markov, I. L., Hayes, J.P., Synthesis of reversible logic circuits, 22(710), Somenzi, F., CUDD, CU decision diagram package, release , Spector, L., Automatic Quantum Computer Programming: A Genetic Programming Approach, Kluwer Academic Publishers, Stadelhofer, R., Banzhaf, W., Suter, D., Quantum and classical parallelism in parity algorithms for ensemble quantum computers, Physical Review A, 71, Stadelhofer, R., Banzhaf, W., Suter, D., Evolving blackbox quantum algorithms using genetic programming, Artif. Intell. Eng. Des. Anal. Manuf., 22, 2008, Viamontes, G. F., Markov, I. L., Hayes, J. P., Graph-based simulation of quantum computation in the state-vector and density-matrix representation, Proc, of SPIE, Vol. 5436, Wille, R., Groe, D., Miller, D. M., Dreschler, R., Equivalence checking of reversible circuits, Proc. 39th Int. Symp. on Multi-Valued Logic, 2009, Williams, C., Gray, A., Automated design of quantum circuits, Proc. of QCQC 1998, 1998, Yabuki, T., Genetic algorithms for quantum circuit design evolving a simpler teleportation circuit, Late Breaking Papers at the 2000 Genetic and Evolutionary Computation Conference, Yang, G., Song, X., Hung, W. N. N., Perkowski, M., Fast synthesis of exact minimal reversible circuits using group theory, Proc. Asia PAcific DAC, 2005, Yang, G., Song, X., Perkowski, M., Wu, J., Realizing ternary quantum switching networks without ancilla bits, Journal of Physics A, Mathematical and General, 38, 2005, Zilic, Z., Radecka, K., Kazamiphur, A., Reversible circuit technology mapping from nonreversible specifications, Proc. Conf. on Design, Automation and Test in Europe, DATE 07, San Jose, CA, USA, 2007,

174

175 Chapter 5 Synthesis of Ternary Quantum Circuits using Hasse Diagrams and Genetic Algorithms Maher Hawash, Marek Perkowski, Martin Lukac Abstract We present the results of application of Hasse diagrams and Genetic Algorithms (GA) to the problem of synthesizing ternary quantum circuits which belong to the class of reversible circuits, represented as input-output mapping vectors. The chap- ter specifically focuses on ternary quantum circuits with relatively large number of variables where valid solutions exist in an exponentially expanding search space. Valid solutions represent the set of all input vector permutations (arrangements or sequences) which satisfy the circuit specification and are algorithmically convergent. We discovered that orderings of the input vector impact the size of the resulting circuit and that only certain orderings are algorithmically convergent. Because of the very large number of sequences and relatively small number of valid solutions, we accelerate the proposed method using the GPU acceleration. We show that using the GPU, a considerable acceleration is obtained and we illustate the shortcomming of both using the CPU and the GPU in the proposed method. Additionally, in this chapter, we describe a method for systematically constructing such valid sequences using a ternary Hasse diagram and illustrate a detailed proof of the critical issue of algorithmic convergence. Finally, we illustrate the benefit of synthesizing many se- quences over using the natural binary sequence and we demonstrate the advantage of utilizing a genetic algorithm for selecting the subset of valid sequences for synthesis compared to random selection. Maher Hawash Department of Electrical Engineering, Portland State University, Portland, OR, USA, gmhawash@gmail.com Marek Perkowski Department of Electrical Engineering, Portland State University, Portland, OR, USA mperkows@ece.pdx.edu Martin Lukac Graduate School of Information Sciences, Tohoku University, Sendai, Japan lukacm@ecei.tohoku.ac.jp 161

176 162 Maher Hawash, Marek Perkowski, Martin Lukac

177 5 Synthesis of Ternary Quantum Circuits Introduction In 1975, Gordon Moore, the cofounder of Intel, issued his famous prediction that the number of transistors on a microchip will doubles every 18 months. Surprisingly enough, Moore s prophecy has proven true for the past forty-some years; yet, as the dimensions of the transistor reach the low tens of nanometers, the dreadful quantum effects are exhibiting their influence on the behavior of the chip. Moore s law is nearing its end. In addition to fabrication woes, heat has been one of the greatest enemies of nanoscale miniaturization, pushing thermal conductivity of the very thin copper interconnects to their limits. In the realm of classical technology, the irreversibility of digital logic gates results in information loss which manifests itself as heat dissipation. Landauer proved that using irreversible logic gates yields a rate of energy loss proportional to kt [13]. Essentially, information equals energy, and the loss of it equals heat loss. Computations which preserve information are considered reversible and gates which perform reversible computations are designated as reversible gates. Bennett [6] showed that a near-zero energy dissipation is possible when a computer can operate near its thermodynamic equilibrium and further displayed that such a stasis state can be achieved through reversible components. Toffoli [18] showed that quantum logic gates are inherently reversible and demonstrated a set of universal quantum binary primitive capable of implementing any logic circuit - namely, the NCT library (Not, Controlled-Not and Toffoli gates). The qubit came to represent the quantum analogy of the classical symbol of the information carrier: the bit. Possibly years before the feasibility of mass production of quantum computers, researchers have been laying the foundation for manufacturing such a computing device by exploring automated synthesis algorithms of quantum logic circuits. In this chapter we tackle the problem of quantum logic synthesis for ternary quantum logic. Ternary-valued logic represents information in a base 3 system with three base states {0, 1 and 2} where a qutrit(trit) is a quantum unit of information with three

178 164 Maher Hawash, Marek Perkowski, Martin Lukac basis states. Qutrit is the ternary equivalent of the binary qubit. A ternary-valued m-variable reversible logic function maps each of the 3m input terms to a unique output term; or mathematically speaking, it is an onto and one-to-one function or a bijection. The problem of synthesizing a reversible circuit is the process of constructing a cascade of ternary reversible gates which maps each said input term to its corresponding unique output term. Mathematically, quantum circuit synthesis represents a decomposition of a circuit s specification to a number of small permutations of reversible gates. In their work, Miller, Maslov and Dueck [16], henceforth MMD, presented exhaustive results for all (9!) permutations of two-variable ternary reversible functions. They further illustrated a synthesis example of the inherently irreversible 3-trit full adder by adding a single ancillary trit to create a 4-trit reversible function, and then applying their synthesis algorithm to such a function. M. M. Khan, et al. [12] presented a method for synthesizing ternary GF(3) based reversible logic circuits while avoiding the addition of ancillary trits. M. H. Khan, et al. [10, 11] presented another method of synthesis of ternary circuits based on the Galois field sum of products (GFSOP) using cascades of multiple input ternary Toffoli and swap gates. Al-Rabadi [2, 3] proposed a Galois field based approach to ternary logic synthesis using fast spectral transforms and fast permutation transforms. In the synthesis of reversible circuits, the GPU has previously been used in various approaches. For instance, Lukac et. al. [14] described the acceleration of the Genetic Algorithm Based Reversible Circuit Synthesis by representing the circuits as a permutative matix and computing the circuit representation on the GPU. In similar applications to reversible circuit synthesis, the matrix acceleration has been used to compute incompletely specified functions [15] or used for an exhaustive search for circuits with minimal quantum cost. The GPU was shown to be very well adapted to the acceleration of matrices and to the computation of such elements that are well suited to the SIMD architecture. In this chapter, the application of the GPU is, however, different because the circuits designed are not represented as matrices but rather as sequences of coefficients that have to be computed. Thus the acceleration is not matrix structured, but many serial computations are performed in parallel. Our major contributions in this chapter are the following: 1. To demonstrate the advantage of exploring different input vector sequences on the synthesized circuit quantum gate cost. 2. To outline an algorithm for constructing valid input vector sequences using a ternary Hasse structure and provide a proof of algorithmic convergence for such sequences. 3. To establish a set of benchmark cost numbers of synthesis of large ternary functions up to 9 variables. Alhagi, et al [5] defined large binary functions such as those consisting of eight (8) or more binary qubits whose information could easily be contained within 5 qutrits. A nine trit register carries the equivalent of binary bits of information.

179 5 Synthesis of Ternary Quantum Circuits To introduce the GPU acceleration for the fast computation of circuits from vector sequences. We describe various settings, the results and a comparison with the CPU based computation. As will be seen experimentally, the acceleration of the computation can be achieved if very strict experimental conditions are applied. This is quite different from the standard matrix acceleration by the GPU. For the sake of self-containment, section 5.2 introduces the domain of ternary logic and reversible gates followed by section 5.3 illustrating the introduced concepts by an example of the operation of the algorithm for a two variable ternary function in section 5.3. Section 5.4 provides a detailed explanation of our ternary logic synthesis algorithm where it describes the concept of control line blocking, and demonstrates how to construct Hasse diagrams and generate input vector sequences from such diagrams. Section 5.5 gives details of the genetic algorithm and section 5.6 describes the mapping of the proposed method into the GPU. Finally section 5.7 is an analysis of the experimental results and section 5.8 concludes this chapter. 5.2 Ternary Logic System Measurement of a Qubit Fig. 5.1: Measurement of qubit The theory of quantum mechanics depicts the qubit as a quantum state which could exist in a state of superposition between the two basis states of {0, 1}. However, upon observing the state of such a register, i.e., the measure its value, the qubit loses its state of superposition and collapses into one of the two basis states. Due to the quantized nature of the particle used for the computation (the polarization of a photon, presence of an electron...etc.), a detector monitoring a value of zero(0) would either observe the particle (a one) or not (a zero), but nothing in between. For example, placing a set of two orthogonal polarization filters in the path of a stream of photons would polarize some of the photons along one axis (basis state

180 166 Maher Hawash, Marek Perkowski, Martin Lukac zero) and the reset along the other axis (basis state one). In essence, measurement of a qubit in the superposition state of α 0 + β 1 forces the qubit to collapse to a zero with a probability of α 2 or to a one with a probability of β 2. Visually, Figure 5.1 depicts the measurement as a projection of the vector representing the superposed state onto the vectors representing the two basis states {0, 1}. Similar to any probabilistic computation, a quantum computation is typically performed on an assembly of quantum systems consisting of a large number (N) of identical quantum circuits, all initialized to the identical set of input values. Measurement is performed by exposing half of the particles to the zero detector and the rest to the one detector, where, probabilistically, the sensor detecting the highest number of hits reflects the internal state of the quantum register Trits and Ternary States Ternary logic is a closed logic system with certain ternary operators that operate on three logic values {0, 1, 2}. In quantum mechanics, the three ternary values could correspond to the different polarization of a photon or alignment of a nuclear spin in a uniform magnetic field. To date, Nuclear Magnetic Resonance (NMR) and Ion Trap are the most promising technologies which were used to demonstrate quantum circuit model of quantum computation. Definition 7 A ternary quantum bit, a trit, is a ternary quantum system defined over the Hilbert space H 3 with basis states { 0, 1, 2 }, which are represented with the following Heisenberg vector notation: 0 = 1 0 0, 1 = 0 1 0, 2 = 0 0, (5.1) 1 1 1/2 3/2 0 B 2 T 0 T Fig. 5.2: Ternary States

181 5 Synthesis of Ternary Quantum Circuits 167 A trit in essence represents a quantum register capable of holding the value of a ternary variable which is retrieved and updated (internally) throughout a computation process. Figure 5.2 depicts the Bloch sphere where the three ternary states 0 T, 1 T and 2 T are defined in reference to the binary base states 0 B and 1 B, described in [16, 4], as follows: 3 0 B 1 0 T = 2 1 T = 1 B (5.2) 2 T = 3 0 B 1 2 A two-variable register consists of two trits which have an information capacity of 3 2 (9 possible distinct states) represented as follows: 00 = 0 0 = 01 = 0 1 = 02 = 0 2 = [ ] T [ ] T [ ] T 10 = 0 0 = 11 = 0 1 = 12 = 0 2 = [ ] T [ ] T [ ] T 20 = 0 0 = 21 = 0 1 = 22 = 0 2 = [ ] T [ ] T [ ] T where the symbol represents the mathematical tensor (Kronecker) product of the two trits. Consequently, an n-trit ternary register is a vector of n-ternary trits with a capacity of 3 n states which are represented by the following equation: n Ψ(t)= φ i (5.3) i=0 where Ψ(t) represents the state of the system at time (t) and φ i is the state of trit (i) Reversible Operations Definition 8 A k-variable ternary reversible gate (operator) is a bijective, one-toone and onto mapping of every permutation of the 3 k input patterns. Unlike classical logic, quantum logic circuits are inherently reversible and can only be constructed from reversible logic gates. Logical reversibility is the ability to reconstruct the input of a function from its output, and vice versa. The definition above stipulates such reversibility with the one-to-one mapping where each input term is mapped to a single element of the output, and vice versa. The onto stipulates that all the elements of the output set are used, and hence, there are the same number

168 Maher Hawash, Marek Perkowski, Martin Lukac of elements in the input and output sets. The third requirement is of closure where the range and domain of the function are identical sets.

182 168 Maher Hawash, Marek Perkowski, Martin Lukac of elements in the input and output sets. The third requirement is of closure where the range and domain of the function are identical sets. The reader can easily deduce that, with definition 2, the set of output terms is simply a permutation of the input terms where each set includes a unique set of elements. a b f g (a) a b a x (b) Fig. 5.3: (a) non-reversible functions f and g, (b) ax is a reversible function. To illustrate, Figure 5.3a shows the two logical functions AND (f) and OR (g), which are, separately and jointly, irreversible. Both functions map the two inputs, a and b, to a single output, f or g, making it impossible to reconstruct the input pair from the single output. Since these gates only have a single output, one of their inputs has effectively been erased and the information it carries has been lost. Figure 5.3b, by contrast, represents the logical XOR function x which, taken alone, is unidirectional and irreversible. However, when x is combined with a copy of input a, the combined pair a x represents a reversible function where each input term, a b, maps to a single unique output term, a x, and vice versa. In binary logic, only two reversible gates (operators) exit: a wire representing identify, and an inverter representing negation. As with all reversible gates, these binary operators have the same number of input and output variables, one variable, which uniquely map each input value to an output value as shown in Figure 5.4. Fig. 5.4: Binary Identity and NOT gates. Notice that, mathematically, the two functions represent the ordered set of all permutations of the input values, 0 and 1: {wire: (0, 1), inverter: (1, 0)}. By corollary, the ternary values 0, 1, 2 can be fully permuted into six unique sequences yielding a total of six unique ternary operators representing the ordered set of ternary values: {(0,1,2), (0,2,1), (1,0,2), (1,2,0), (2,0,1), (2,1,0)}.

5 Synthesis of Ternary Quantum Circuits 169 5.2.4 Ternary Reversible Operators Fig. 5.5: Generalized Ternary Gates. Figure 5.

183 5 Synthesis of Ternary Quantum Circuits Ternary Reversible Operators Fig. 5.5: Generalized Ternary Gates. Figure 5.5 lists the six ternary operators for a single ternary variable with their names in the first column, the mathematical equation in the second column, the truth table in the third column, and the symbolic notation in the last column. Clearly, the [[+0]] operator is the analogy of a wire. The [[+1]] and [[+2]] operators are ternary inverters which perform a mathematical summation of the gate s inputs modulo 3. The [[12]], [[02]] and [[01]] operators swap their namesake input values without affecting the third ternary value. For example, the [[12]] gate swaps the values one and two while leaving the zeros alone. Symbolically, the three ternary quantum states could be visualized as three equally separated points on the Bloch sphere where measurement along two of the three axes would be sufficient [16]. Considering the above gates in light of the atomic particle spin, Figure 5.6 provides a visual illustration of the mechanics of these operators. For example, the [[+1]] operator represents a clockwise rotation of 120 where a [[+2]] operator represents a 240 rotation. The last three swap operators, on the other hand, act as a 180 rotation of the Bloch sphere around an axis line which passes through one of the constants. For example, the ternary operator [[12]] twists the block sphere around the 0 axis line and swaps the location of the 1 and 2 constants. Definition 9 A controlled gate is a logic gate consisting of n variables where the values of (n-1) control variables enable the operation on the target variable (n). Definition 10 The control lines are independent variables where a specific pattern of values on the set of control lines affects the operation on a dependent variable (target line). Definition 11 A target line is a dependent variable where a specific operation on the line is enabled iff a set of control lines matches a specific pattern; otherwise, the signal on the line is passed through unchanged.

184 170 Maher Hawash, Marek Perkowski, Martin Lukac [+1] [12] (a) (b) Fig. 5.6: Visualization of ternary operators a) [[+1]],b)[[12]]. Fig. 5.7: (a) Ternary Inverter vs. (b) binary Inverter; (c) ternary controlled op(c OP) vs. (d) Feynman CNOT; and (e) ternary (C 2 OP) vs. (f) Toffoli gate (C 2 NOT). Figure 5.7a shows the ternary extension to the binary inverter gate of Figure 5.7b. In the ternary case, however, the [[+1]] and [[+2]] act as inverters to one another which, visually, a [[+1]] followed by a [[+2]] operator represents a cumulative rotation of 360 around the Bloch sphere bringing the atomic particle to its original orientation. Figure 5.7c shows the ternary extension to the binary C-NOT gate in Figure 5.7d. In this case, a value of one (1) on control line (a) activates the [[+2]] operator on line (b) while other values on line (a) would pass (b) unchanged. Of course, any of the five ternary operators could be used on the target line (b) resulting in five representations of the controlled gate. Additionally, the control line (a) could theoretically utilize any of the ternary values {0, 1, 2} as a control signal resulting in a total of fifteen (15) representatives of the Feynman gate in the ternary domain. Similarly, Figure 5.7e shows a ternary equivalent of the Toffoli gate where the operator [[12]] affects the target line (c) only if (a) and (b) are both one (1). The reader can easily deduce, through the same argument above, that a total of

5 Synthesis of Ternary Quantum Circuits 171 45 representatives of the Toffoli gate exist in the ternary space.

185 5 Synthesis of Ternary Quantum Circuits representatives of the Toffoli gate exist in the ternary space. Al-Rabadi [2] and Khan [10] considered several Multi-Valued (MV) and ternary gates including the mod-sum gates used in this chapter. Miller et al [16] used ternary gates identical to the gates shown in Figure 5.5 and they limited the control line to the value of one (1). 5.3 Synthesis by Example Fig. 5.8: The Ternary Synthesis Example transforming output columns AB into input columns ab (last two columns). The shades are target trits, and borders are for control values, Top row indicates gates and control values. Before we delve into the details of the ternary synthesis algorithm, it would be helpful to start with an illustration of the process as described by Miller, et al. [16]. Figure 5.8 shows a reversible ternary function of two variables composed of 9 terms (3 2 ). Column (ab) represents the input vector and column (AB) represents the corresponding output vector. The objective of the synthesis is to create a cascade of primitive reversible ternary gates to map all input minterms to their corresponding output minterms. The algorithm terminates when all the terms of the output vector (AB) map to their corresponding terms of the input vector (ab) - compare column 8 to column 1. The algorithm processes the terms one trit at a time and places gates only when the input/output trits of the same position mismatch. The algorithm observes a simple, yet essential, guiding principle stating that: A completely mapped pair should never be altered by succeeding mapping calculations. This important

186 172 Maher Hawash, Marek Perkowski, Martin Lukac rule assures that the algorithm will always converge, which is an essential criterion for synthesizing arbitrary reversible circuits. Figure 5.8 illustrates the step by step synthesis process along with the circuit diagram for each transformation, as follows: Considering the inherent reversibility of the function, the algorithm starts the synthesis from the output column (AB) towards the input column (ab). Starting with the first pair (00 12), the algorithm realizes that a [[+1]] inverter on line (a) would correctly map the upper trit 0 to 1. Essentially, any value presented on the (a) line will be incremented by 1 modulo 3, as shown in the shaded text of the third column. A second gate, [[+2]] is placed on line (B) bringing its value from two to zero matching the corresponding input line (b). Notice that due to their unconditional nature, the above two gates affect all the terms of the output vector as demonstrated by the shaded values. The synthesis process continues with the second term where the upper trit of the input term (01) mismatches the newly realized output term (11) in column 4. The algorithm places a [[+1]] gate on line (A) to perform a 1 0 transformation only if line (B) has the value of one; hence, is a controlled gate. As stated before, the control values (highlighted with thick borders) are used to ensure that completely mapped pairs are never modified by a later step. The third input term (02) now maps to (22). A [[+2]] gate on line (a) controlled by a value of (2) on line (b) remedies the mismatch of the upper trit. The fourth input term (10) now maps to (22) requiring two gates to correct. The first gate is a controlled [[+2]] on line (b) controlled by a value of (2) on line (a). To correct the upper bit, we realize that a [[12]] swap gate on line (a) would correctly map that line and the remaining minterms of the function. Realizing that column 1 is identical to column 8, the synthesis algorithm terminates with a successful synthesis of the specification. Taking a deeper look at the synthesis example, the reader would surely discover a discrepancy between the operator indicated on top, e.g., [[+1]], and what the table indicates as the result of such an operation (shaded). For example, with the pair (00, 12), performing [[+1]] on the trit value of (A = 1) would surely yield a value of (A = 2) not (A = 0) indicated in the first row of the table, column 3. To alleviate this disagreement, remember that we have started our synthesis process from the output vector (AB), column 2, heading towards the input vector (ab), replicated in column 8. And why not, this is a reversible circuit after all! So, in truth, the operation [[+1]] answers the question, what do I need to make the input trit of value (a = 0) into the output trit of value (A = 1)? A [[+1]] of course! And that is exactly what we have specified. We could have easily started from the input vector (ab) and synthesized the circuit by adding gates to match the output in a similar manner and would have surely ended up with a circuit functionally equivalent to the one shown above.

5 Synthesis of Ternary Quantum Circuits 173 5.

187 5 Synthesis of Ternary Quantum Circuits Ternary Logic Synthesis Algorithm We were inspired by our research in the binary domain [5] which revealed that the size of the resulting circuit is greatly influenced by the arrangement of terms of the input vector. In the above example, the terms of the input vector (ab) are arranged in their natural ternary order. The remainder of this chapter answers the questions: Is the size of the circuit influenced by the arrangement of the input vector? Do such arrangements affect algorithmic convergence? We will demonstrate below that the answer to both questions is in the affirmative where the arrangement of the input vector influences the number of gates required to implement the circuit. We will also demonstrate through an example the concept of control line blocking where, for a subset of such input orderings, the algorithm becomes trapped in an endless loop. It is worth mentioning, for the sake of completeness, that our algorithm for synthesizing reversible circuits adheres to the basic assumption for quantum circuits as outlined by Toffoli [18] below: 1. No fan-out is permitted between gates, 2. Loops are not permitted, 3. Permutations of connections between gates are permitted Control Line Blocking Fig. 5.9: Using either of the two trits of 20 as a control line is guaranteed to change one of the previously mapped completed terms. Some minterm orderings of the input vector violate the bedrock principle of never altering previously completed mapped pairs, and force the algorithm into an infinite loop. Using the same steps in Section 5.3 above, Figure 5.9 lists the synthesis of the first three minterms of the input sequence: Once the second minterm is mapped correctly, trying to map the third pair (20 02) becomes impossible without altering the first two completely mapped pairs. For example, in order to map the lower trit correctly (0 2), we would normally use the upper trit (2) as a control signal and possibly apply swap gate [[02]] to provide the correct mapping.

188 174 Maher Hawash, Marek Perkowski, Martin Lukac However, such an operation would surely alter the completely mapped pair of the second row (22 20), and in effect violate the aforementioned principle. The reader can easily deduce that going back and attempting to correct the term of the second row will result in an infinite loop. Similarly, attempting to use the lower trit (0) as a control signal is also destructive where, in this case, the first completely mapped pair would be altered. Definition 12 The control Line Blocking condition occurs when all the control lines of the current minterm are a subset of the control lines of a previously completed minterm for a given input order. It is possible, of course, to programmatically detect when a previously mapped pair has been altered, and, consequently, reject the input sequence. Going through all the possible permutations of input arrangements would surely guarantee the discovery of the most optimal solution. Even for functions with a small number of ternary variables, however, attempting all permutations is impossible. A lowly 3-variable ternary function consists of 27 (3 3 ) minterms resulting in 27! (10 28 ) possible permutations: not an easy feat even for our most powerful computing machines. The question then is: would it be possible to only focus the search on sequences which are guaranteed to converge? Ternary Hasse Input Sequence Mathematics comes to the rescue! The authors discovered that it is possible to construct a subset of all possible convergent sequences using the mathematical concept of Hasse diagrams and covering graphs. Rather than cycling through the entire set of permutations, we could easily construct a number of such valid input sequences and discover the ones which provide the circuits with the lowest quantum cost. Definition 13 A Hasse or Poset diagram is a type of mathematical diagram used to represent a finite partially ordered set, in the form of a graph where, for the relation {(x,y) x y x,y S}, each element of S is a vertex in the plane and draws a line segment or curve that goes upward from x to y whenever y covers x (that is, whenever x < y and there is no z such that x < z < y). The relations < and represent a precedence hierarchy between the operands and are not necessarily analogous to the mathematical inequality relations on real numbers. We will start with a demonstration of constructing a Hasse diagram for a two variable ternary function - see Figure Starting from smallest valued minterm (00), we draw a line to each of the two minterms which satisfy the relation of partial ordered sets: {(x,y) x y x,y S}. Loosely speaking, we find all the minterms which are trit-wise larger than the minterm at hand. For 00, adding a 1 to the lower trit yields 01, and adding a one to the upper trit yields 10 - shown in Figure 5.10a. In a similar fashion, the 01 minterm would yield 02 by adding a 1 to the lower trit, and a 11 by incrementing the upper trit - Figure 5.10b. The process repeats for

189 5 Synthesis of Ternary Quantum Circuits 175 Fig. 5.10: Construction of a Ternary Hasse Diagram. all the terms in the set until the highest number 22 is reached. Consider for a moment the upper trit of term 02 in Figure 5.10c. Notice that for the branch ( ), each transition only affects a single trit and that, for the sake of maintaining closure within the ternary domain, the process stops once a trit reaches the value of 2. Now that the lower trit for minterm 02 has reached the ceiling of 2, the upper trit can transition through its stages ( ) Construction of an Input Sequence Fig. 5.11: Two-Trit Ternary Hasse Diagram. Once we have constructed the Hasse structure, we group all the minterms at the same level together, within a set of bands, as shown graphically in Figure 5.11.

190 176 Maher Hawash, Marek Perkowski, Martin Lukac Definition 14 A Hasse Ternary Band is the set of terms at the same level in a ternary Hasse diagram where the sum of trits of each term equals the zero-based numerical order of the band. The column on the right of Figure 5.11 shows the sum of each trit in a band. For example, band 3 has the terms 12 and 21 which both add up to 3, and hence, they are in band 3. Corollary 1 A ternary function with n variables has 2n+1 Hasse Ternary Bands. Proof Since the highest band has a single minterm of n number of the digit 2, the sum of all digits is clearly 2n. Since the sum is zero based, according to Definition 7, the number of bands is 2n + 1. At this stage, we are able to use the Hasse Ternary Bands to construct input vectors which are guaranteed to converge. The following pseudo-code outlines the process: 1 inputsequence := {} 2 for index := 0 to 2 n DO 3 bandsequence := Permutation(Band[index]) 4 Append(inputSequence, bandsequence) 5 OD The above pseudo-code can be described by the following steps: 1. Step 2: Start at the lower band consisting of all zeros (0...0), and stop at the upper band consisting of all twos (2...2), 2. Steps 3, 4: For each band, append any permutation of the terms within the band to the end of the sequence. Observe that for a two variable function, the combined permutations of the bands will result in 24 valid input sequences (2! 3! 2!). The following two vectors are examples of valid input sequences: S 1 = {00,01,10,11,20,02,21,12,22} S 2 = {00,10,01,20,02,11,12,21,22} The alert reader will readily notice that, in constructing the input sequence, the precedence order, defined below, is not necessarily obeyed as prescribed in section Ternary Hasse Input Sequence above. For instance, the vector S 2 includes the term 10 with a high trit of one (1), followed by 01 with a high trit of zero (0); yet, we consider this sequence valid. Notice that since the precedence criteria (of 0 1 2) applies only to the construction of the Hasse diagram, and not to the construction of the input vector from the Hasse diagram. The only restriction for constructing the input vector from a Hasse diagram is: All minterms of a lower band must be used before any minterms in the next higher band. Clearly the algorithm described above satisfies both conditions. Definition 15 The precedence order refers to the mathematical binary relation between a set of elements in a partially ordered set (Poset) where one element precedes the other. Partial orders reflect the fact that not every pair of elements of a Poset need be related.

191 5 Synthesis of Ternary Quantum Circuits 177 Now that an input vector has been constructed, we apply the same synthesis process detailed in the Synthesis By Example (section 5.3). What are the advantages of this algorithm then? With this algorithm, we are now able to: 1) systematically construct multiple input vector arrangements which are guaranteed to converge, 2) without searching through every possible permutation of the input vector. This allows us to examine a large number of input vector arrangements in order to discover the circuit with the best quantum cost among the input vectors. Most likely, however, such a solution will not be the optimal circuit realization as the number of possible input vector arrangements grows exponentially Hasse Precedence Quandary (a) (b) (c) Fig. 5.12: Ternary Value Precedence diagram. In our attempt to construct Hasse diagrams for ternary space, we were confronted with a dilemma regarding precedence in the form of the question: What precedence exists amongst the three ternary values {0, 1 and 2}? Figure 5.12, for example, shows three possible arrangements of the ternary values establishing the precedence of one constant over the other(s). Do we treat the literal one (1) as equal, less than or greater than the other two? Intrinsically, there is no natural precedence among the three constants, but rather, a symbolic primacy born out of our choice of mathematical manipulations. Figure 5.12 clearly demonstrates that we conveniently, yet arbitrarily, designated the symbol 0 to align at 0, while the other two symbols are 120 away, and hence, there are no natural physical phenomena dictating precedence amongst the three constants. In the algorithm described herein, we have artificially set precedence for the convenience of implementation where, as described shortly, we opted to perform low-to-high trit transitions first followed by high-to-low transitions, and as a result, introduced an artificial algorithmic prejudice in favor of the value two (2). Our choice was largely driven by the procedure for constructing the Hasse diagram, described above, where we have favored the constant two over one and the latter over zero. Such a favorable treatment results in delaying the appear-

192 178 Maher Hawash, Marek Perkowski, Martin Lukac ance of terms containing the constant two while forcing the terms of containing zeros to appear earlier during the synthesis process. Such a delay allows us to avoid control line blocking by relying on the fact that the value two will always appear later in the sequence, and hence, can be used as a control variable. We shall revisit this topic further in the discussion about convergence below. Figure 5.12b shows that the constant two (2) has precedence over the other two values, and Figure 5.12c shows that the two constants one (1) and two (2) are of equal precedence and that both are higher than the value zero(0). Algorithmically, it is feasible, of course, to swap the symbolic values zero (0) with two (2) which would grant preference to the value zero (0) over the value two (2). Following the same thought process, the reader can quickly ascertain that, at most, there exist 12 unique precedence orders: 1. Six for precedence order (a) representing the six unique permutations of {0, 1, 2} O a = {0 1 2,0 2 1,1 0 2,1 2 0,2 0 1,2 1 0}, 2. Three for precedence order (b), O b = {2 {0,1},1 {0,2},0 {1,2}}, 3. Three for precedence order (c), O a = {{1,2} 0,{0,1} 2,{0,2} 1}. Mathematically speaking, members within each of the three sets O a, O b, O c are described as equivalent classes where a single element acts as a representative of the entire group. In this chapter we limit the discussion to the three precedence orders shown in Figure 5.12 and treat them as representatives of the twelve possible precedence orders. The precedence order of Figure 5.12a will be explained immediately after the discussion about algorithmic convergence. Definition 16 Given a set S and an equivalence relation on S, the equivalence class of an element a in S is the subset of all elements in S which are equivalent to a, represented as: [a]={x S x a} 5.5 Selection Through a Genetic Algorithm A ternary function with n variables has 3 n minterms in the input vector which makes the number of possible permutations of an input vector an astounding 3 n!. Our method of constructing convergent input vector sequences constructs a subset of all convergent sequences. According to Corollary 1, a ternary Hasse diagram consists of 2n+1 bands where each band consists of N(b) minterms - see the derivation in Appendix A: m=b/2( ) ( ) n n (m j) N(b)= (5.4) m j 2 j +(bmod2) j=2 ( n where is the combination operator and mod is the modulo operator. In the k) process of constructing the input vector, step 3: above of the pseudo-code (section 5.4.3), selects a single sequence of any permutation of all the minterms in a

193 5 Synthesis of Ternary Quantum Circuits 179 band. Consequently, the total number of possible input vectors is the product of all the permutations of all bands, stated as: T(n)= 2n b=2 N(b)! (5.5) Fig. 5.13: The number of permutations for all the possible input vectors vs. Hasse based sequences. Clearly, as the number of variables increases, Table 2, the number of possible input vectors generated by our algorithm still grows exponentially, despite orders of magnitudes slower than those generating all possible input vectors - see Figure For a 3 trit function, we could easily examine all 6,494 Hasse based input vectors and select the one which yields the best quantum cost. Functions of 4-trits or more, however, are suddenly beyond the capacity of our best computers. Confronted with such daunting roadblocks, and borrowing from our experience in the binary domain [7], we opted to employ a genetic algorithm to construct potential input vector arrangements based on the results of previously synthesized input vectors Objective Function using Quantum Gate Count In their analysis of the simple 2 trit function, the authors of [16] arbitrarily assigned a cost of one for unconditional ternary gates, and a cost of two for controlled ternary gates. Practically, however, there are no existing physical implementations of ternary quantum systems, and hence, no realistic cost could be assumed. For the purposes of this chapter, we use the number of gates as the measure of fitness, or objective function, for the genetic algorithm. For an arbitrary ternary quantum circuit C with

194 180 Maher Hawash, Marek Perkowski, Martin Lukac Table 5.1: All the possible permutations vs total permutations for Hasse based sequences. Number of Number of Input Vector T(n) Variables (n) Minterms (3n) Permutations(3n!) 2n b=0 N(b)! , x1028 6, x x x x x x , x x , x1022,196 1 x102, ,683 8x1075,974 2x109,615 k quantum gates, the quantum cost Q c is calculated as follows: Q c = k i=0 G qc (i) (5.6) where G qc (i) is the quantum cost for each gate which we assume to be one for the purposes of our analysis. The actual gate cost would be used once the physical implementation of ternary gates is realized Genetic Algorithm Rather than bouncing randomly around the search space, a genetic algorithm follows a set of directed probabilistic steps where new solutions are the offspring of existing good solutions. The following block exhibits the standard structure of a genetic algorithm: 1 g number o f generations 2 initialize(p(g)) 3 while (g > 0) DO 4 evaluate(p(g)) 5 P1(g), P2(g) select(p(g)) Set of parent pair 6 g g 1; 7 P(g) recombine(p1(g), P2(g)); crossover children 8 P(g) mutate(p(g)); Mutate children 9 OD The initialization step (2:) randomly creates a set of valid input vector sequences, initial population, using the Ternary Hasse Diagrams where, for each band, the set of minterms are randomly shuffled, and the resulting band arrangement is concatenated to the input vector under construction - step 4: of (section 5.4.3). The synthesis of the

5 Synthesis of Ternary Quantum Circuits 181 initial population gives the fitness, quantum cost, of each input vector arrangement which is used to determine the next generation, offspring, of

195 5 Synthesis of Ternary Quantum Circuits 181 initial population gives the fitness, quantum cost, of each input vector arrangement which is used to determine the next generation, offspring, of solutions to examine. The roulette wheel selection process is then used to randomly select two parents of the current generation for recombination (step 7:). For this research, we studied both single and double crossover operators to create the next generation, with a special consideration for the position of the crossover - discussed shortly. The final step of the genetic algorithm (step 8:) applies a mutation operator in order to continuously maintain population diversity and avoid premature convergence to local minima Genotype and Valid Operators Fig. 5.14: Genetic Algorithm genotype and valid operators. As discussed earlier and shown in [5, 7], the band structure, defined above, must be faithfully preserved in order to assure algorithmic convergence. As a result of the banded structure of the algorithm, recombination operators are limited in their application to the boundaries of a band in order to avoid a minterm jumping from one band to another. In similar fashion, mutation operators are constrained to swapping minterms intra-band which will also preserve the certitude of convergence. Figure 5.14 illustrates the structure of a chromosome, i.e. input vector arrangement, for a two variable ternary function consisting of five bands. As hinted earlier, in order to ensure that an offspring is a valid input vector sequence, the crossover point(s) must occur at either end of a band, but not in the middle of a band. Had the invalid crossover point been taken in Figure 5.14, the resultant child would have been invalid as it would have included the minterm 01 twice and lacked the term 10. Of course a repair process could have detected and corrected such a defect which, depending on the repair process, could yield a different, yet valid, input vector. The reader might correctly surmise that the choice of limiting crossover to band boundary could potentially result in stale members within each band, leading to premature convergence to local minima. In general, genetic algorithms introduce mutations as a remedy for premature convergence where mutation typically acts as a background

196 182 Maher Hawash, Marek Perkowski, Martin Lukac operator at a low probability of occurrence. For this study, however, we intentionally elevated the probability of applying the mutation operator, at a level higher than suggested by standard genetic algorithms, in order to counteract the limitations imposed on the recombination operator (band boundary only). A high level of mutation probability, we theorized, would inject diversity within children allowing them to escape such a hasty race to the nearest local minima. 5.6 Mapping of the Algorithm to GPU In order to accelerate the search for the best sequence with the genetic algorithm, we mapped the computation of the circuits to the GPU. The general idea is that computing multiple sequences in parallel in the GPU will accelerate the convergence of the algorithm. The mapping separates the algorithms in two parts: 1. The Genetic Algorithm and all of its components are run on the CPU, 2. The sequence evaluation is performed on the GPU. For every generation the genetic algorithms produces candidate circuits in the form of the sequences and this information is required to compute the individual circuits. Thus at every generation such sequences are sent in integer format to the GPU. The sequences are split into groups, and each group is processed by a single thread on one of the available cores in the GPU. 5.7 Experimental Results For the purposes of this study, we have limited our experiment to the set of Hidden Weighted Trit (HWT) [8]. HWT functions are an extension to their binary counterpart Hidden Weighted Bit (HWB) functions which were first introduced by Prasad et al. [17] and are heavily cited as one of the harder benchmarks for reversible binary logic synthesis [5, 1, 7, 9, 17]. Definition 17 Hidden Weighted Trit (HWT) functions are reversible ternary functions where the output minterm is generated by circularly shifting the input minterm by the number of its non-zero trits. For the sake of a balanced comparison, we used the same synthesis algorithm for both the natural and Hasse based input sequences on the same method for calculating the cost. The only independent variable, in this case, is the input vector arrangement which represents the crux of our experiment reported in this chapter. The use of genetic algorithm is merely an aid for discovering Hasse based input vectors with lower circuit cost. The table above shows the results of a synthesis of the HWT functions of 4 to 9 trits using the natural and the Hasse based input vector ordering using the CPU

197 5 Synthesis of Ternary Quantum Circuits 183 Table 5.2: A comparison between natural and Hasse based input vector arrangements. Natural Hasse Saving Function # Gates Time # Gates Time # Gates Percentage Hwt % Hwt % hwt % hwt % hwt , % hwt , % processing only. Clearly the Hasse based ordering has produced better results for all the functions with a 60% saving for the HWT-4 function down to 7% for the HWT-9 function. Of course the results should not be surprising, as the genetic algorithm processed a total of 600,000 arrangements of the input vector consisting of 100 generations of 100 individuals each for two variants of recombination methods (single & double crossover), and 30 combinations of probabilities of recombination and mutation. The fact that we are able to freely construct convergent input vector sequences, at our whim, is the strong point of this algorithm, and hence, our main contribution to the research area. Notice that the percentage of savings shrinks dramatically as the number of variables increases, which can easily be explained with a quick glance at Table 3 above. Even though the search space is greatly reduced with the Hasse based algorithm, a nine trit function has a search space of the order 10 9,615, for which, an exploration of 600,000 elements is like a drop in a colossal ocean Time (sec) 0 hw4 hw5 hw6 hw7 hw8 hw9 Fig. 5.15: Time of synthesis for Hidden Weighted Trit functions.

198 184 Maher Hawash, Marek Perkowski, Martin Lukac To exacerbate matters further, the time to synthesize functions with a larger number of variables increases exponentially. Although our implementation of the genetic algorithm took advantage of multithreading on an 8 core Intel i7 processor, the nine variable function consumed more than two hours to yield 7% improvement by visiting 600,000 solutions. Figure 5.15 demonstrates an exponentially increasing curve depicting time vs. number of variables for the class of HWT functions. For a four trit function, the 600,000 visited solutions represent a % of the Hasse based search space where a saving of 60% is a great achievement. For a nine variable function, however, covering a similar ratio of the search space requires visiting close to potential solutions which is beyond the possibilities of all existing computing power on earth GPU Acceleration In order to accelerate the time required for finding Hasse sequences with the lowest cost, several experiments using the GPU - CPU mapping have been performed. The results of these experiments are shown in Table 5.3 and 5.5 for the HWT5 and HWT6 functions respectively. The columns in both tables represent in order: case - the type of experiment, cores - the number of cores (CPU or GPU) the algorithm is running on, repetitions - the number of times the sequences are computed, samples - the number of different sequences computed, Total Time - time required for the whole computation and Time/Sample - the unit time required to compute and evaluate a single sequence. Notice that because the mapping to the GPU only concerns the sequence to circuit mapping, the comparison and evaluation of the computation speed is only evaluated for circuit computation. Table 5.3: Function HWT5. Case cores repetitions samples Total Time Time/Sample GA 8 Core Host CPU CUDA Single sequence 1 running 1000 times on a single CUDA core (loop inside CUDA core) Same as above, 2 plus a call to synchthreads() outside the loop Single sequence specified duplicated across the specified number of cores. Loop in CPU Passthrough 2 devices

199 5 Synthesis of Ternary Quantum Circuits 185 The first case in Table 5.3 and 5.5 shows the Genetic Algorithm running on an i7 Intel 960 Processor with 8 cores running at 3.2GHz each. The system has 12 GByte DDR3 memory. The GA runs through 72 sets of parameters of 300 generations each, where each consists of 512 individuals. For HWT5, we realize a 317% speedup on the GPU relative to the CPU. For the HWT6 function, however, such a speedup diminishes to a mere 10% advantage for the GPU. This big difference and the decrease of the performance between the 512 and 1024 cores is because the transfer time required to send the sequences from the CPU to the GPU. This can be confirmed by observing that the time/sample remains unchanged (the processing happens at the same speed). The reason for seeing a speedup for HWT6 is that it was CPU bound, and having more CPU cores helped demonstrate the value of CUDA (when we are CPU bound). This is because the amount of sequences to be synthesized between the five and six variable function is very large. Thus the amount of data to be transfered between the CPU and the GPU consumes a much larger time. This impedes the GPU based approach and in the ideal case sending all sequences at the same time would improve the overall GPU performance. Table 5.4 shows the mapping of the GA to either only the CPU based approach (column 1) or to CPU-GPU approach (column 2). CPU Table 5.4: GA Algorithm. CUDA for 300 generations CPU: For 300 generations for 512 individuals CUDA: Synthesize 512 individuals AddToQueue(individual) Bread & Cull CPU: Bread & Cull 8 Threads (Kernel): While(!queue.empty()) Synthesize(queue.pop(Individual)) Cases 1 and 2 in Table 5.3 and 5.5 shows the results for running a single thread on the GPU and the loop over the available sequences are directly implemented in the GPU. In case 1, a single sample is fed to a single CUDA thread, and the same sample is synthesized 1000 times. Case 2 is the same, with the addition of synchthreads() outside the loop and only for 100 samples. They both gave essentially the same results for synthesizing a single sample. The difference between these two cases is shown directly in Table 5.6. Case 3 is similar to cases 1 and 2 but the loop over the available samples is in the CPU: the loop is in the CPU while the number of threads on the CUDA is changed. In the first case, the loop repeats 1000 times in the CPU calling a single thread to do the synthesis. As expected, adding a second thread (effectively synthesizing two sequences) did not affect the time, as both CUDA threads are working in parallel. Similarly, doing 512 threads at the same time (the number of cores on a single device) was all done in the same time. This particular configuration shows the ef-

200 186 Maher Hawash, Marek Perkowski, Martin Lukac Table 5.5: Function HWT6. Case cores repetitions samples Total Time Time/Sample GA 8 Core Host CPU CUDA Single sequence 1 running 1000 times on a single CUDA core (loop inside CUDA core) Same as above, 2 plus a call to synchthreads() outside the loop Single sequence duplicated across the specified number of cores. Loop in CPU Passthrough 2 devices ficiency and the usefulness of the GPU acceleration. The pseudocode of case 3 is shown in Table 5.7. Case 4 is the same algorithm as Case 3, where 1024 sequences are fed to two GPU devices (two distinct graphic cards) each with 512 threads. The loop is repeated 512 times. For the HWT5 function, you can see that it took about the same time per sample which effectively did not affect the time per sample. For the HWT6, however, the time per sample almost dropped by 30% showing the parallelism of CUDA. In order to get both devices to work in parallel, a special memory mode in the host memory was used. It is called the pinned host memory mode which page locks a region of memory on the host and makes it visible to the CUDA device. The locked memory holds the input and output sequences while the results were kept on CUDA s local memory. The algorithm on the CUDA device avoids memory conflicts between threads by coping the input/output vectors to local memory, and allocates the output buffers on local memory as well. When shared memory was used for the output buffers, it took much longer to synthesizes the sequences due to bank conflicts between the threads trying to read and (especially write) to the shared memory. Table 5.6: CUDA, core 0, thread 0. Case 1 Case 2 for 1000 times for 1000 times Synthesize(sample) Synthesize(sample) synchthreads()

201 5 Synthesis of Ternary Quantum Circuits 187 Notice that cases 1 and 2 produced unexpected results. We expected that placing the loop inside the CUDA core would yield the best results; however, we measured a 100% degradation in performance when the loop was place inside the CUDA compared to placing it in the CPU (case 3). We observed the same anomaly for both the HWT5 and HWT6 functions. Table 5.7: Case 3. CPU: for 1000 times CUDA: Synthesize(sample) Finally, to summarize the results of the GPU acceleration, considerable acceleration of circuit computation is achieved if the following conditions are satisfied: Minimize the CPU-to-GPU transfer data, Minimize the GPU global-to-local memory transfer, Split the data so that different running GPU cores do not obstruct each other by blocking the global memory access. 5.8 Conclusion In this chapter we compared the synthesis process using a natural ternary arrangement of the input vector versus a subset of all possible arrangements and successfully demonstrated the benefit of the latter. In the process, we introduced a synthesis algorithm capable of synthesizing any arbitrary ternary function with a large number of variables. Since the option of exploring the entire search space is unfeasible, our unique method of constructing input vector arrangements, with guaranteed convergence, becomes an essential component of the search algorithm. To accelerate the search of the various sequences, we introduced a mapping to the GPU that allows us to compute the candidate circuits in a much faster manner. Moreover, the proposed mapping shows that the advantage of computing a large number of sequences of the GPU can be further optimized by a more precise algorithm to hardware mapping. This includes a separation of the available sequences in groups of such a size that are calculated in the fastest manner and of such size that would minimize the global to local memory traffic. Building a Hasse based ternary structure provides protection against the trap of control line blocking and allows our synthesis algorithm to access any element within the limited Hasse based search space in a random manner. Still, rather than randomly hopping throughout the search space, we employed a genetic algorithm which utilized results of earlier steps of the synthesis process to construct the potential input vector sequences to be considered.

202 188 Maher Hawash, Marek Perkowski, Martin Lukac References 1. Reversible logic synthesis benchmarks page. [online]. dmaslov. 2. Al-Rabadi, A., Quantum circuit synthesis using classes of GF(3) reversible fast spectral transforms, Proc. IEEE Interntational Symposium on Multiple Valued Logic, Toronto, Canada, May 20-22, Al-Rabadi, A., Reversible fast permutation transforms for quantum circuit synthesis, Proc. IEEE International Symposium on Multiple-Valued Logic, Toronto, Canada, May 20-22, Al-Rabadi, A., Casperson, L., Perkowski,M., Song, Z., Multiple valued quantum logic, Quantum Computers and Computing, 3, 1, 2002, Alhagi, N., Hawash, M., Perkowski, M., Synthesis of reversible circuits with no ancilla bits for large reversible functions specified with bit equations, Proc. IEEE International Symposium on Multiple-Valued Logic, Barcelona, Spain, May 26-28, Bennet, C., Logical reversibility of computation, IBM Journal of Research and Development, 1973, Hawash, M., Abdalhaq, B., Hawash, A., Perkowski, M., Application of genetic algorithm for synthesis of large reversible circuits using covered set partitions, International Symposium on Innovation in Information & Communication Technology (ISIICT 2011), Amman, Jordan, November 2011, Vol. 1, Hawash, M., Perkowski, M., Using Hasse diagrams to synthesize ternary quantum circuits, IEEE International Symposium on Multiple-Valued Logic, Victoria, British Columbia, Canada, May 14-16, Hawash, M., Perkowski, M., Bleiler, S., Caughman, J., Hawash, A., Reversible function synthesis of large reversible functions with no ancillary bits using covering set partitions, International Conference on Information Technology - New Generation, Khan, M.H.A., Perkowski, M., Kerntopf, P., Multi-output Galois field sum of products synthesis with new quantum cascades, IEEE International Symposium on Multiple- Valued Logic, Tokyo, Japan, May 17-19, 2003, Khan, M.H.A., Perkowski, M., Khan, M.R., Kerntopf, P., Ternary GFSOP minimization using Kronecker decision diagrams and their synthesis with quantum cascades, Journal of Multiple-Valued Logic and Soft Computing, 11, 2005, Khan, M.M., Biswas, A.K., Chowdhury, S., Hasan, M., Khan, A.I., Synthesis of GF(3) based reversible/quantum logic circuits without garbage output, International Symposium on Multiple-Valued Logic, Naha, Okinawa, Japan, May 21-23, 2009, Landauer, R., Irreversibility and heat generation in the computing process, IBM Journal of Research and Development, 5, 1961, Lukac, M., Kameyama, M., Miller, D.M., Perkowski, M., High speed genetic algorithms in quantum logic synthesis: Low level parallelization vs. representation, Journal of Multiple- Valued Logic and Soft Computing, accepted, 2012.

203 5 Synthesis of Ternary Quantum Circuits Lukac, M., Perkowski, M., Kerntopf, P., Kameyama, M., GPU acceleration methods and techniques for quantum logic synthesis, 9th International Workshop on Boolean Problems, Freiberg, Germany, September 16-17, rr Miller, D.M., Maslov, D., Dueck, G.W., Synthesis of quantum multiple-valued circuits, Journal of Multiple-Valued Logic and Soft Computing, 12, 5-6, 2006, Prasad, A., Shende, V., Markov, I., Hayes, J., Patel, K., Data structures and algorithms for simplifying reversible circuits, ACM Journal on Emerging Technologies in Computing Systems, 2, 4, 2006, Toffoli, T., Reversible computing, MIT Lab for Computer Science, MIT/LCS/TM-151,

204

205 Chapter 6 An Overview of Miscellaneous Applications of GPU Computing Stanislav Stanković, Jaakko Astola Abstract In general, the GPU computing denotes computing over a GPU system consisting of a graphics processing unit (GPU) and a central processing unit (CPU). Such systems are aimed at accelerating the general purpose computations which considerably extends possible applications, making computationally demanding algorithms feasible in various areas of science and engineering practice. The general purpose computing on graphics processing units (GPGPU) is a new computing method that can be viewed as an answer to the ever increasing demands on computing power. If problems are appropriately selected to fit well into the GPGPU environment, high advantages can be achieved. Therefore, in many areas, GPGPU is used to solve computationally demanding tasks. This chapter provides a brief overview of applications of GPUPU in several areas primarily related to signal processing, with the latter term understood in the general sense. 1 Stanislav Stanković Rovio Entertainment Ltd., Tampere, Finland, stanislav.stankovic@gmail.com Jaakko Astola Dept. of Signal Processing, Tampere University of Technology, Tampere, Finland, Jaakko.Astola@tut.fi 1 The work of Stanislav Stanković was supported by the Academy of Finland, Finnish Center of Excellence Programme, Grant No

206 192 Stanislav Stanković, Jaakko Astola

207 6 An Overview of Miscellaneous Applications of GPU Computing Introduction Giving a comprehensive overview of a rapidly increasing body of publications on the application of GPGPU in various fields of science and research, which would encompass every significant paper on this topic, would be a colossal task. This is clearly outside of the scope of this publication. Rather, in what follows, we try to give examples in order to illustrate the wide diversity of problems and tasks for which GPGPU has been proposed as a solution. We apologize in advance if we have omitted any publication which would otherwise merit mention. In our selection we were guided primarily by the desire to show the diversity of GPGPU based approaches and the versatility of the GPU as a computational platform. This diversity is best seen in the list of academic institutions in which GPGPU related research is taking place. This list includes some of the preeminent world universities. The NVIDIA corporation as one of the two main manufacturers of GPU devices has been running the CUDA Center of Excellence Program since 2009 aimed at fostering cooperation with the scientific community. As of 2011, this program includes centers of excellence in eighteen universities. Several publications present a generalized overview into the diverse field of GPGPU. For example the series of books [87], [88] edited by Professor Wen-mei W. Hwu from the University of Illinois, Urbana-Champaign, represent a valuable compendium of publications on GPGPU based solutions in different areas. Furthermore, a survey compiled by a group of authors led by J. D. Owens, D. Luebke [133] in 2005, and updated in 2007, gives a cross-section of early applications of graphic processors for general purpose scientific computations. As indicated earlier, GPU devices were developed with the main intent of application for real-time rendering of 3D computer graphics. Therefore, it is not surprising that GPU devices have found some of the first general programming applications in closely related Signal Processing fields of Image and Video processing, especially Medical Image Processing, and Computer Vision.

208 194 Stanislav Stanković, Jaakko Astola Many signal processing tasks can be characterized as so-called embarrassingly parallel problems, exhibiting a very low degree of data dependence. Typically, a signal in this case, the static image or a video, is represented in matrix form, especially suited for parallel processing with GPU devices. For example, GPU devices have been used in [115] and [164] for efficient image demosaicing, one of the typical image processing tasks. In [165], the employ of the CUDA platform for Connected Component Labeling is discussed. A CUDA based implementation of fast graph cuts for application in computer vision was proposed by P. J. Narayanan and V. Vineet, in [123] and [177]. The applications of GPGPU in Computer Vision (CV) range from the use of GPU devices for individual tasks in larger CV systems to the implementations of complete application specific CV solutions. OpenVIDIA [55], developed at the Eyetap Personal Imaging Lab (epi Lab) at the Electrical and Computer Engineering Group, the University of Toronto, is a comprehensive library of computer vision algorithms on the GPU platform. Low-level vision problems like stereo correspondence, restoration, segmentation, etc., are usually modeled as a label assignment problem, often modeled in a Markov Random-Field (MRF) framework. The use of GPU devices for this problem has been discussed in [178]. A case study on biologically inspired machine vision through the usage of CUDA is presented in [139]. Visual Saliency Model implementation on GPU devices is proposed in [146]. P. Muyan-Ozcelik and V. Glavtchev, propose an embedded system for real-time speed limit sign recognition based on CUDA and GPU in [119], and [120], while A. Obukhov in [131] presents a GPGPU based system for face detection. A GPU accelerated SVM based image recognition system is discussed in [156]. CUDA implementation of the HAAR classifier for object detection is proposed in [132], while the scale invariant feature transform by using the GPU is discussed in [186]. The use of GPU devices for real-time extraction of depth information from stereo images has been explored in a number of publications including [37], [60], [62], [64], [193], [194], [196], [198], etc. A comparison of CUDA and OpenCL platforms for Image and Video processing tasks is presented in [171]. 6.2 Medical Image Processing Medical imaging is one of the areas of digital signal processing with significant real-life applications. The use scenarios, in which medical imaging systems are employed, pose strict constraints on operational properties of such systems. Here especially the large amount of data that needs to be processed is contrasted with the short time intervals available for processing. Parallel computing has been used as a method of achieving significant reductions in needed processing time. However,

209 6 An Overview of Miscellaneous Applications of GPU Computing 195 many other parallel processing platforms, such as PC clusters, are ill suited for the use in final commercial products, in settings such as hospitals, medical laboratories and other health institutions. A compact, integrated, and self contained computational platform such as the GPU provides means for cost reduction and increases applicability of systems based on it. Image registration is one of the crucial steps in image registration in medical imaging methods such as MRI and CT. A method of GPU accelerated elastic image registration is presented in [155]. An Image Registration Toolkit (ITK) is one of the most widely used open-source libraries of state-of-the-art image processing and analysis algorithms used in this area. A GPU accelerated version of ITK is presented in [94]. The volumetric registration, a process consisting of aligning two or more 3- D images into a common coordinate frame using GPU, is discussed in [159]. The most common application of GPU devices in medical imaging has been in Tomography in speeding up the process of reconstruction of 3D CT images from a series of 2D X-ray samples, [11], [16], [41], [91], [101], [143], [144], [157], [161], [188], [189], [190], and [192]. GPGPU as a method for speeding up the Monte Carlo simulations of radiation propagation in X-ray imagery is discussed in [9], and [10]. The application of GPU in MRI reconstruction is discussed in [103], [117], [169], [199], and [200]. Implementation of the 3D Monte Carlo PET reconstruction algorithm on GPU is proposed by Wirth et al. in [184]. SIMD massively parallel processors have been proposed as computational platform for the construction of brain anatomy atlases in [30] as early as More recently GPU devices, as a ubiquitous SIMD platform, have been used for this task in [72], [73]. In [93] the authors propose a GPU based method for brain connectivity reconstruction and visualization in large-scale electron micrographs. 6.3 Audio Processing Audio processing is an area closely related to image processing in terms of underlying methodology. This area saw some of the earliest examples of general programming on GPU devices. Good examples of the application of GPU devices can be found for the example in [43]. In [32] the authors propose a complete GPU based speech recognition system. The computation of Gaussian-mixture based acoustic likelihoods represent the computationally most demanding part of most automatic speech recognition systems. Several publications examined the possibility of the usage of GPUs for this task, for example [21], [33], [34], [44], [71], [102], [176] etc. A physically correct simulation of acoustic environments remains a computationally very intensive task. The GPGPU has been used to address this problem in [151], and [152].

210 196 Stanislav Stanković, Jaakko Astola 6.4 General Computer Science Problems GPU devices have been used as a computational platform for efficient implementations of methodology related to more general computer science and digital telecommunication problems. For example, in [96] the authors make use of GPU devices for parallelization of a large scale search algorithm, while in [135] a CUDA based implementation of fast in-place sorting is presented. A method for building efficient hash tables on GPU devices is discussed in [3]. GPUs have been used as a means of speeding up mathematical computations, such as sparse matrix manipulation, and sparse matrix linear solvers, in [14], [15], [17], [19], [31] and [58]. Interval arithmetic is considered in [36]. The application of GPU devices for fast linear algebra has been explored in [2], [49] and [179]. A hybrid method for solving tridiagonal systems based on GPU is presented in [197], and LU decomposition using the CUDA platform is proposed in [86]. Memory access patterns for cellular automate on the CUDA platform are discussed in [12]. Agent based modeling is one of the areas where GPU based solutions have been explored [1], [47], [148], and [149]. 6.5 Graph Theory Many mathematical problems related to graph theory rely on heuristic iterative algorithms. While the task of parallelization of these algorithms is often more complex than the parallelization of matrix based multiplication methods encountered for example in signal processing, GPU devices have been successfully used in this area as well. For example, the CUDA platform has been used for parallel graph component labeling in [82]. GPU devices are used for fast spanning tree computation in [81]. In [95] authors compare the efficiency of edge and node parallelism methods for the calculation of graph centrality metrics. Multi agent path planning is discussed in [46]. GPU devices are employed for ant colony optimization in [182]. An efficient algorithm based on the CUDA platform for maximum network flow problem has been proposed in [187]. 6.6 Optimization and Machine Learning Various machine learning and optimization methods can be computationally very intensive. Although iterative in nature, with high data dependency between iterations, in some cases these algorithms can be efficiently parallelized, and implemented on GPGPU platforms.

211 6 An Overview of Miscellaneous Applications of GPU Computing 197 For example, in [7] the authors discuss a CUDA based implementation of an island based parallel genetic algorithm. The derivative-free mesh optimization method using GPU devices is proposed in [160]. In [181] the authors discuss the methodology for large scale machine learning using GPU devices. More specifically in [22] and [84] two implementations of support vector machines are presented. 6.7 Dynamic Systems A good example of the application of GPU devices in dynamic system modeling is the use of these devices for the n-body problem, [20], [74], [75], [76], [124], [129], and [195]. 6.8 Astrophysics Related to the modeling of dynamics systems, GPU devices have been used in astrophysics, for example for the simulation of gravitational forces of black holes [83], and propagation of gravitation waves [35]. GPU devices have been used to speed-up the calculation of Einstain s field equations in [201]. Other related applications of GPGPU in astrophysics include the field of astroseismology, in which the time-varying radiant flux is analyzed in order to study the internal structure of distant stars [172]. 6.9 Statistical Modeling Many practical applications, especially in cryptography, rely on fast and reliable random number generators. The complexity of the random number algorithm is often proportional to the quality of the generated random sequence. Reliable algorithms can therefore be computationally intensive. GPU devices have been suggested as a means of reducing the computation time for such applications, [18] and [85]. In [61] the authors employ GPGPU to speed-up calculation of inverse cumulative distribution functions, while in [158] a method for fast calculation of Iterated Function Systems, like the ones employed in fractal flow algorithms is discussed. A Monte Carlo method to solve the radiative transport equation in inhomogeneous participating media for light and gamma photons is presented in [170].

212 198 Stanislav Stanković, Jaakko Astola 6.10 Computational Finance Computational finance is a rapidly growing field of applications of numerical methods for the simulation of behavior financial instruments. GPU devices have been employed in this area mainly for risk simulations [42], [45], and [147] Engineering Simulations The GPGPU has been employed for various application oriented engineering simulations which rely on computationally intensive numerical methods. Finite element methods on GPU devices are discussed in [4], [5] and [23]. GPU clusters are used for large-scale gas turbine simulations in [145]. Rarefied gas dynamic simulations on GPU devices are presented in [54], while further applications of the GPGPU in fluid dynamics are discussed in [8], [97], and [175]. Fast electromagnetic integral equation solvers on GPUs are proposed in [106]. GPU devices are used for solving wave equations on unstructured geometries in [100] Computational Chemistry, Material Science, Nano-technology, Quantum Chemistry Computational chemistry and quantum chemistry, branches of physical chemistry dealing with properties of materials on the nano-scale, with important applications in material science and nano-technology, have also seen extensive application of the GPGPU. For example, in [59], GPU devices are used for fast Wavelet-Based Density Functional Theory calculations. Density Functional Theory simulations are also discussed in [105] and [191]. The multilevel summation of electrostatic potentials using graphics processing units are discussed in [79] and [80]. In [6] and [90], the authors use GPU devices to model the quantum mechanical molecular dynamics. A system for an the interactive display of molecular orbits accelerated by GPU devices has been proposed in [166], [167] and [168]. More specifically, GPU acceleration of cutoff pair potentials for molecular modeling applications is discussed by the same group of authors in [153]. A general discussion of large-scale chemical informatics using GPU devices is presented in [77]. A GPU accelerated method for the calculation of measure of chemical similarity between different molecules in presented in [78].

213 6 An Overview of Miscellaneous Applications of GPU Computing Computational Systems Biology Molecular interaction at the intercellular level represents the foundation of physiology of living organisms. Computational systems biology relies heavily on computer simulations to study the complex molecular mechanisms. These simulations can be efficiently implemented on modern GPU devices. For example, the simulation of biomolecular interactions has been explored in [138]. Nucleotide and protein sequence matching is probably the biggest area of application of GPU devices in computational systems biology. This is a pattern matching task where a local similarity between nucleotide or protein sequences is calculated. A series of publications by various authors discusses GPGPU based approaches for biological sequence matching [98], [112], [150], [173], [174], [183], and [185]. SmithWaterman and related dynamic programming algorithms are the most frequently used methods for this task [48], [99], [114], [107], [108], [110], [111], and [154] Computational Neuro Biology Biological neural systems by nature exhibit a large scale of parallelism. These systems consist of a large number of basic functional elements, neuronal cells, which process signals and communicate asynchronously. Parallel computing platforms are a natural choice for simulation of such systems. Several important software simulators have already established themselves as standard tools for this task. Many of these software packages have some kind of parallel processing support. In recent years, some of them have made the transition to the GPGPU platform, for example Brian [63] for the neuro simulator by R. Brette and D. Goodman. In addition, some of the neural simulators have been developed especially with the goal of exploiting parallelism in GPU processors. An example for this is the GeNN simulator by T. Nowotny [125], and [126]. NeMo introduced in [53] is another platform for neural modeling of spiking neurons using GPUs. The simulation of spiking neurons using GPU devices is also discussed in [52], [121], and [122]. In [118] the authors present a GPU-based framework for simulating cortically-organized networks. Furthermore, applications of GPU devices in neuroscience are discussed in [13]. GPU accelerated temporal data mining for Neuroscience is discussed in [50] Circuit Design and Testing The ever increasing size and complexity of modern logic circuits presents a significant challenge for current automated design tools. Especially the tasks of circuit

214 200 Stanislav Stanković, Jaakko Astola verification and testing of even the circuits of moderate size can be computationally very expensive for standard CPU based platforms. In recent years, the GPGPU has been proposed as a potential remedy for this situation. The topic of logic simulation in general using the GPU as a platform has been discussed in [39], and [136]. Many Electronic Design Automation (EDA) applications rely on sparse matrices as an underlying data structures, for example, connectivity matrices in large circuits etc. Fast manipulation of such matrices is essential for overall performance of EDA systems. In [40], authors discuss the efficient strategies of the Sparse-Matrix Vector Product (SMVP) on the CUDA platform. The GPU-Based method for fast circuit optimization is presented in [109]. An implementation of Mikami routing algorithm is presented in [27]. Gate level simulations of digital circuits using the CUDA platform are discussed in [24], [25], and [26]. Fault simulations using GPUs are discussed in a series of publications by K. Gulati and S.P. Khatri [67], [68], [69], and [70]. GPU acceleration of cycle-based soft-error simulation for reconfigurable array architectures is presented in [89]. The application of CUDA for the Unate Covering Problem in the Boolean Domain is discussed in [134]. In [142], the authors make use of the GPGPU for solving SAT problems using ternary vectors. GPU devices have been used for Quantum Logic Synthesis in [113] and [137]. The related field of power grid systems design and testing has also benefited from the application of massively parallel processing platforms. GPU devices have been used for simulation of power grid systems and circuits in several recent papers, for example in [104], [51] and [162] Spectral Techniques Spectral techniques, especially Fourier and Fourier-like transforms are an important tool used in many different areas, especially in signal processing and related fields. Many signal processing methods rely on computation of Fourier-like transforms using a family of so-called fast algorithms. These algorithms have been developed originally for sequential computation on standard CPUs. The optimal parallelization of these algorithms using GPU devices is not a trivial task. The acceleration of computation of discrete spectral transforms in general, using the OpenCL platform can be found in [56], and [57]. The biggest share of publications related to the use of GPU devices in spectral techniques is related to various implementations of FFT. The list of these publications includes the following: [28], [29], [38], [65], [163], [180], etc. CUDA based implementations of 2D FFT are discussed in [66], and FFT-based 2D convolution on the CUDA platform is presented in [140]. 3D FFT implementations on GPU devices are described in [127], and [128]. In [116], the authors present a method for the calculation of 3D finite difference on GPUs.

215 6 An Overview of Miscellaneous Applications of GPU Computing 201 Image convolution with CUDA is presented in an NVIDIA whitepaper by Podlozhnyuk [141]. An example of CUDA based implementation of the discrete cosine transform is presented in [130]. References 1. Aaby, B. G., Perumalla, K. S., Seal, S. K., Efficient simulation of agent-based models on multi-gpu and multicore clusters, in SIMUTools 10: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels, Belgium, Torremolinos, Malaga, Spain, 2010, Agullo, E., Augonnet, C., Dongarra, J., Ltaief, H., Namyst, R., Thibault, S., Tomov, S., A hybridization methodology for high-performance linear algebra software for GPUs, in Wen-mei W. Hwu, (ed.], GPU Computing Gems, Morgan Kaufmann, 2011, Vol. 2, Chapter 34, Alcantara, D. A., Volkov, V., Sengupta, S., Mitzenmacher, M., Owens, J. D., Amenta, N., Building an efficient hash table on the GPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 4, Morgan Kaufmann, 2011, Allard, J., Courtecuisse, H., Implicit FEM and fluid coupling on GPU for interactive multiphysics simulation, in Siggraph Talks, Vancouver, Canada, August Allard, J., Courtecuisse, H., Balasalle, J. U., Implicit FEM solver in CUDA for interactive ddeformation simulation, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 4, Morgan Kaufmann, 2011, Anderson, J., Lorenz, D., Travesset, A., General purpose molecular dynamics simulations fully implemented on graphics processing units, J. Comput. Phys., 227, 2008, Ardila, Y., Yamashita, S., Evaluation of migration methods for island based parallel genetic algorithm on CUDA, March 8-9, SASIMI 2012, Beppu, Japan. 8. Asouti, V. G., Trompoukis, X. S., Kampolis, I. C., Giannakoglou, K. C., Unsteady CFD computations using vertex-centered finite volumes for unstructured grids on graphics processing units, Int. J. Numerical Methods in Fluids, Vol. 67, 2011, Badal, A., Badano, A., Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit, Med. Phys., Vol. 36, 2009, Badal, A., Badano, A., Monte Carlo simulation of X-ray imaging using a graphics processing unit, in B. Yu (ed.), IEEE NSC-MIC, Conference Record, HP31, October 2531, 2009, Orlando, Florida, 2009, Badal, A.,Kyprianou, I., Sharma, D., Badano, A., Fast cardiac CT simulation using a graphics processing unit-accelerated Monte Carlo code, in E. Samei, N.J. Pelc (eds.), Proc. SPIE Medical Imaging Conference, Medical Imaging 2010: Physics of Medical Imaging, 15 February 2010, SPIE, San Diego, California, USA, 2010,

216 202 Stanislav Stanković, Jaakko Astola 12. Balasalle, J., Lopez, M. A., Rutherford, M. J., On improved memory access patterns for cellular automata using CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 6, Morgan Kaufmann, 2011, Baladron, J. P., Fasoli, D., Faugeras, O., Three applications of GPU computing in neuroscience, Computing in Science and Engineering, Dec. 20, Baskaran, M. M., Bordawekar, R., Optimizing sparse matrix-vector multiplication on GPUs, IBM Technical Report, Available from: Accessed August 9, Bell, N., Garland, M., Efficient sparse matrix-vector multiplication on CUDA, NVIDIA Technical Report NVR , December Benquassmi, A., Fontaine, E., Lee, H.-H. S., Parallelization of Katsevich CT image reconstruction algorithm on generic multi-core processors and GPGPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 41, Morgan Kaufmann, 2011, Bolz, J.,Farmer, I., Grinspun, E., Schroder, P., Sparse matrix solvers on the GPU: conjugate gradients and multigrid, ACM Trans. Graph., Vol. 22, No. 3, 2003, Bradley, T., du Toit, J., Tong, R., Giles, M., Woodhams, P., Parallelization techniques for random number generators, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 16, Morgan Kaufmann, 2011, Buatois, L., Caumon, G., Levy, B., Concurrent number cruncher A GPU implementation of a general sparse linear solver, Int. J. Parallel, Emergent Distrib. Syst., Vol. 24, No. 3, 2009, Burtscher, M., Pingali, K., An efficient CUDA implementation of the tree-based Barnes Hut n-body algorithm, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 6, Morgan Kaufmann, 2011, Cardinal, P., Dumouchel, P., Boulianne, G., Comeau, M., GPU accelerated acoustic likelihood computations, in Proc. Interspeech 08, Brisbane, Australia, September 2326, 2008, Catanzaro, B., Sundaram, N., Keutzer, K., Fast support vector machine training and classification on graphics processors, in Proc. the 25th Int. Conf. on Machine Learning, ICML 08, Helsinki, Finland, July 59, 2008, Vol. 307, New York, ACM, 2008, Cecka, C., Lew, A. J., Darve, E., Assembly of finite element methods on graphics processors, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 16, Morgan Kaufmann, 2011, Chatterjee, D., DeOrio, A., Bertacco, V., Event-driven gate-level simulation with GP- GPUs, in Proc. the 46th Annual Design Automation Conference, July 2631, 2009, San Francisco, California, ACM, New York, 2009, Chatterjee, D., DeOrio, A., Bertacco, V., High-performance gate-level simulation with GPGPUs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 23, Morgan Kaufmann, 2011, Chatterjee, D., DeOrio, A., Bertacco, V., GCS: high-performance gate-level simulation with GPGPUs, in Proc. Design, Automation and Test in Europe Conference and Exhibition,

217 6 An Overview of Miscellaneous Applications of GPU Computing 203 April 2024, 2009, Nice, France, Chan, C. Y., Lin, J. L., Chien, L. S., Ho, T. Y., Liu, Y. Y., GPU-based line probing techniques for Mikami routing algorithm, SASIMI 2012, Beppu, Japan, March 8-9, Chen, Y., Cui, X., Mei, H., Improving performance of matrix multiplication and FFT on GPU, in Proc. 24th ACM Int. Conf. on Supercomputing, Tsukuba, Ibaraki, Japan, June 24, 2010, Chen, Y., Cui, X., Mei, H., Large-scale Fast Fourier transform, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 39, Morgan Kaufmann, 2011, Christensen, G. E., Miller, M.I., Vannier, M.W., Grenander, U., Individualizing neuroanatomical atlases using a massively parallel computer, Computer, Vol. 29, 1996, Christen, M., Schenk, O., Burkhart, H., General-purpose sparse matrix building blocks using the NVIDIA CUDA technology platform, Book of Abstracts for First Workshop on General Purpose Processing on Graphics Processing Units, October 4, 2007, Boston. 32. Chong, J., Gonina, E., Keutzer, K., Efficient automatic speech recognition on the GPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 37, Morgan Kaufmann, 2011, Chong, J., Gonina, E., Yi, Y., Keutzer, K., A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit, in Proc. Interspeech 09, Brighton, U.K., September 610, 2009, Chong, J., Yi, Y., Faria, A., Satish, N., Keutzer, K., Data-parallel large vocabulary continuous speech recognition on graphic processors, Tech. Re UCB/EECS , Elect. Eng. Comput. Sci. Dept., Univ. of California, Berkeley, Chung, S. K., Wen, L., Blair, D., Cannon, K., Datta, A., Application of graphics processing units to search pipeline for gravitational waves from coalescing binaries of compact objects, Class. Quantum Grav., Vol. 27, 2010, Collange, S., Daumas, M., Defour, D., Interval arithmetic in CUDA, in Wen-mei Hwu, (ed.), GPU Computing Gems Jade Edition, Vol. 2, Chapter 8, Morgan-Kaufmann Publishers, 2011, Cornelis, N., Cool, L. V., Real-time connectivity constrained depth map computation using programmable graphics hardware, in Proc. Conf. on Computer Vision and Pattern Recognition, 2005, Cui, X., Chen, Y., Mei, H., Improving performance of matrix multiplication and FFT on GPU, in Proc. 15th Int. Conf. on Parallel and Distributed Systems, Shengzhen, China, December 811, 2009, Deng, Y., Mu, S., The potential of GPUs for VLSI physical design automation, in Proc. Int. Conf. on Solid-State and Integrated-Circuit Technology, Deng, Y. S., Wang, B. D., Mu, S., Taming irregular EDA applications on GPUs, in Proc. the 2009 International Conference on Computer-Aided Design, San Jose, California, November 25, 2009,

218 204 Stanislav Stanković, Jaakko Astola 41. Despres, P., Sun, M., Hasegawa, B. H., Prevrhal, S., FFT and cone-beam CT reconstruction on graphics hardware, in J. Hsieh, M.J. Flynn, (Eds.), Proc. of the SPIE, San Diego, CA, SPIE, Bellingham, WA, March 16, 2007, Dixon, M. F., Bradley, T., Chong, J., Keutzer, K., Monte Carlobased financial market value-at-risk estimation on GPUs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 24, Morgan Kaufmann, 2011, Dixon, P. R., Oonishi, T., Furui, S., Fast acoustic computations using graphics processors, in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Taipei, Taiwan, April 1924, 2009, Dixon, P. R., Oonishi, T., Furui, S., Harnessing graphics processors for the fast computation of acoustic likelihoods in speech recognition, Comput., Speech, Lang., Vol. 23, No. 4, 2009, Egloff, D., Pricing financial derivatives with high performance finite difference solvers on GPUs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 23, Morgan Kaufmann, 2011, Erra, U., Caggianese, G., Real-time Adaptive GPU multi-agent path planning, in Wenmei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 22, Morgan Kaufmann, 2011, Erra, U., Frola, B., Scarano, V., Couzin, I., An efficient GPU implementation for large scale individual-based simulation of collective behavior, in Proc. of the 2009 Int. Workshop on High Performance Computational Systems Biology, Trento, Italy, 2009, Farrar, M., Striped Smith-Waterman speeds database searches six times over other simd implementations, Bioinformatics, Vol. 23, No. 2, 2007, Fatica, M., Accelerating linpack with CUDA on heterogenous clusters, in Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, DC, March 8, 2009, Feng, W., Cao, Y., Patnaik, D., Ramakrishnan, N., Temporal data mining for neuroscience, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 15, Morgan Kaufmann, 2011, Feng, Z., Li, P., Multigrid on GPU: tackling power grid analysis on parallel SIMT platforms, in Proc. the 2008 IEEE/ACM International Conference on Computer-Aided Design, Piscataway, NJ, 2008, Fidjeland, A. K., Shanahan, M. P., Accelerated simulation of spiking neural networks using GPUs, Proc. WCCI 2010 IEEE World Congress on Computational Intelligence, Barcelona, Spain, July 18-23, 2010, Fidjeland, A. K., Roesch, E. B., Shanahan, M. P., Luk, W., NeMo: A platform for neural modelling of spiking neurons using GPUs, in Proc. IEEE Int. Conf Application-specific Systems, Architectures and Processors, 2009, Frezzotti, A., Ghiroldi, G. P., Gibelli, L., GPU acceleration of rarefied gas dynamic simulations, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 15, Morgan Kaufmann, 2011,

219 6 An Overview of Miscellaneous Applications of GPU Computing Fung, J., Mann, S., Aimone, C., Openvidia: parallel gpu computer vision, in Proc.of the 13th Annual ACM International Conference on Multimedia, MULTIMEDIA 05, ACM, New York, 2005, Gajić, D. B., Stanković, R. S., Computing fast spectral transforms on graphics processing units using OpenCL, Proc. Reed-Muller 2011 Workshop, Tuusula, Finland, May 2011, Gajić, D. B., Stanković, R. S., GPU accelerated computation of fast spectral transforms, Facta Universitatis, Ser.: Electronics and Energetics, Vol. 24, No. 3, December 2011, Garland, M., Sparse matrix multiplications on manycore GPUs, in Proc. of the 45th Annual Conference on Design Automation, June 813, 2008, Anaheim, California, USA, 2008, Genovese, L., Ospici, M., Videau, B., Deutsch, T., Mehaut, J.-F., Wavelet-based density functional theory calculation on massively parallel hybrid architectures, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 10, Morgan Kaufmann, 2011, Gibson, J., Marques, O., Stereo depth with a unified architecture GPU, in Proc. Conf. on Computer Vision and Pattern Recognition, Workshop on Computer Vision, June 23-28, Giles, M. B., Approximating the erfinv function, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 10, Morgan Kaufmann, 2011, Gong, M., Yang, Y.-H., Near real-time reliable stereo matching using programmable graphics hardware, in Proc. of Conf. on Computer Vision and Pattern Recognition, 2005, Goodman, D., Brette, R., Brian: a simulator for spiking neural networks in Python, Front. Neuroinform, doi: /neuro Gong, M., Yang, R., Image-gradient-guided real-time stereo on graphics hardware, in Proc. the Fifth Int. Conf. on 3-D Digital Imaging and Modeling, Ottawa, Canada, June 13-16, 2005, Govindaraju, N. K., Lloyd, B., Dotsenko, Y., Smith, B., Manferdelli, J., High performance discrete Fourier transforms on graphics processors, in Proc. of the 2008 ACM/IEEE conference on Supercomputing, Austin, Texas, November 1521, 2008, Article No Gu, L., Li, X., Siegel, J., An empirically tuned 2D and 3D FFT library on CUDA GPU, Proc. the 24th ACM International Conference on Supercomputing, Tsukuba, Japan, June 1-4, 2010, Gulati, K., Croix, J. F., Khatri, S. P., Shastry, R., Fast circuit simulation on graphics processing units, in Proc. of Conf. on Asia and South Pacific Design Automation, January 19-22, 2009, Gulati, K., Khatri, S., Fault table generation using graphics processing units, in Proc. High Level Design Validation and Test Workshop, San Francisco, California, November 46, 2009, Gulati, K., Khatri, S. P., Towards acceleration of fault simulation using graphics processing units, in Proc. of the 45th Annual Design Automation Conference, Anaheim, California,

220 206 Stanislav Stanković, Jaakko Astola USA, June 813, 2008, Gulati, K., Khatri, S. P., Accelerating statistical static timing analysis using graphics processing units, in Proc.of the 2009 Asia and South Pacific Design Automation Conference, Yokohama, Japan, January 1922, 2009, Gupta, K., Owens, J. D., Three-layer optimizations for fast GMM computations on GPU-like parallel processors, in Proc. IEEE ASRU, Merano, Italy, 2009, Ha, L., Kruger, J., Joshi, S., Silva, C. T., Multiscale unbiased diffeomorphic atlas construction on multi-gpus, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 48, Morgan Kaufmann, 2011, Ha, L. K., Kruger, J., Fletcher, P. T., Joshi, S., Silva, C. T., Fast parallel unbiased diffeomorphic atlas construction on multi-graphics processing units, in EUROGRAPHICS Symposium on Parallel Graphics and Visualization, December Hamada, T., Iitaka, T., The chamomile scheme: an optimized algorithm for n-body simulations on programmable graphics processing units, arxiv.org, Cornell University Library, arxiv:astro-ph (accessed ). 75. Hamada, T., Narumi, T., Yokota, R., Yasuoka, K., Nitadori, K., Taiji, M., 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence, in Proc. of the Conf. on High Performance Computing Networking, Portland, Oregon, USA, November 2009, Storage and Analysis, SC 09, ACM, New York, NY 2009, Hamada, T., Nitadori, K., Benkrid, K., Ohno, Y., Morimoto, G., Masada, T., Shibata, Y., Oguri, K., Taiji, M., A novel multiple-walk parallel algorithm for the BarnesHut treecode on GPUs towards cost effective, high performance N-body simulation, Vol. 24, No. 12, Computer Science - Research and Development, Springer-Verlag, New York, 2009, (special issue paper). 77. Haque, I. S., Pande, V. S., Large-scale chemical informatics on GPUs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 2, Morgan Kaufmann, Haque, I. S., Pande, V. S., Walters, W. P., SIML: a fast SIMD algorithm for calculating LINGO chemical similarities on GPUs and CPUs, J. Chem. Inf. Model., Vol. 50, No. 4, 2010, Hardy, D. J., Stone, J. E., Vandivort, K. L., Gohara, D., Rodrigues, C., Schulten, K., Fast molecular electrostatics algorithms on GPUs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 4, Morgan Kaufmann, 2011, Hardy, D. J., Stone, J. E., Schulten, K., Multilevel summation of electrostatic potentials using graphics processing units, J. Parallel. Computing, Vol. 35, 2009, Harish, P., Narayanan, P. J., Vineet, V., Patidar, S., Fast minimum spanning tree computation on large graphs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 4, Morgan Kaufmann, 2011, Hawick, K. A., Leist, A., Playne, D. P., Parallel graph component labelling with GPUs and CUDA, Parallel Comput., Vol. 36, 2010,

221 6 An Overview of Miscellaneous Applications of GPU Computing Herrmann, F., Silberholz, J., Tiglio, M., Black hole simulations with CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 8, Morgan Kaufmann, Herrero-Lopez, S., Multiclass support vector machine, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 19, Morgan Kaufmann, 2011, Howes, L., Thomas, D., Efficient random number generation and application using CUDA, in GPU Gems 3, Chapter 37, Addison Wesley, Humphrey, J., Spagnoli, K., Price, D., Kemelis, E., LU decomposition in CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems - Jade Edition, Vol. 2, Chapter 12, Morgan Kaufmann, 2011, Hwu, W., GPU Computing Gems Emerald Edition, Morgan Kaufmann, March Hwu, W., GPU Computing Gems Jade Edition, Morgan Kaufmann, November Imagawa, T., Oue, T., Tsutsui, H., Ochi, H., Sato, T., GPU acceleration of cycle-based soft-error simulation for reconfigurable array architectures, SASIMI 2012, Beppu, Japan, March 8-9, Jakowski, J., Irle, S., Morokuma, K., Quantum shemistry: Propagation of electronic structure on a GPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 5, Morgan Kaufmann, 2011, Jang, B., Kaeli, D., Do, S., Pien, H., Multi-GPU implementation of iterative tomographic reconstruction algorithms, in Proc. Sixth IEEE Int. Symp. on Biomedical Imaging: From Nano to Macro, Boton, Massachusetts, USA, June 28 - July 1, 2009, Jedrzejewski, M., Room Acoustics Computation on Graphics Hardware, Master Thesis, Accessed August 9, Jeong, W.-K., Pfister, H., Beyer, J., Hadwiger, M., GPU-accelerated brain connectivity reconstruction and visualization in large-scale electron micrographs, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 49, Morgan Kaufmann, 2011, Jeong, W.-K., Pfister, H., Fatica, M., Medical image processing using GPU-accelerated ITK image filters, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 46, Morgan Kaufmann, 2011, Jia, Y., Lu, V., Hoberock, J., Garland, M., Hart, J. C., Edge vs. node parallelism for graph centrality metrics, in Wen-mei W. Hwu, (ed.), GPU Computing Gems - Jade Edition, Morgan Kaufmann, October 2011, Kaldewey, T., Di Blas, A., Large scale GPU search, in Wen-mei W. Hwu, (ed.), GPU Computing Gems - Jade Edition, Chapter 1, Morgan Kaufmann, October Kampolis, I. C., Trompoukis, X. S., Asouti, V. G., Giannakoglou, K. C., CFD-based analysis and two-level aerodynamic optimization on graphics processing units, Computer Methods in Applied Mechanics and Engineering, Vol. 199, No. 9-12, 2010, Khajeh-Saeed, A., Blair Perot, J., GPU-supercomputer acceleration of pattern matching, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 13, Morgan Kaufmann,

222 208 Stanislav Stanković, Jaakko Astola 2011, Khajeh-Saeed, A., Poole, S., Perot, J. B., Acceleration of the SmithWaterman algorithm using single and multiple graphics processors, J. Comput. Phys., Vol. 229, 2010, Klockner, A.,Warburton, T., Hesthaven, J. S., Solving wave equations on unstructured geometries, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 18, Morgan Kaufmann, 2011, Kole, J., Beekman, F., Evaluation of accelerated iterative X-ray CT image reconstruction using floating point graphics hardware, Phys. Med. Biol., Vol. 51, 2006, Kveton, P., Novak, M., Accelerating hierarchical acoustic likelihood computation on graphics processors, in Proc. Interspeech 10, Makuhari, Japan, September 2630, 2010, Lefohn, A. E., Kniss, J. M., Hansen, C. D., Whitaker, R.T., Interactive deformation and visualization of level set surfaces using graphics hardware, in Proc. of the 14th IEEE Visualization 2003 (VIS03), Seattle, WA, October 19-24, 2003, Lin, L., Shiono, H., Yokota, M., Fukui, M., A GPGPU implementation of parallel backward Euler algorithm for power grid circuit simulation, SASIMI 2012, Beppu, Japan, March 8-9, Luehr, N., Ufimtsev, I., Martinez, T., Dynamical quadrature grids: Applications in density functional calculations, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 3, Morgan Kaufmann, 2011, Li, S., Chang, R., Lomkin, V., Fast electromagnetic integral equation solvers on graphics processing units (GPUs), in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 19, Morgan Kaufmann, 2011, Ligowski, L., Rudnicki, W., An efficient implementation of Smith-Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases, in IEEE Int. Workshop on High Performance Computational Biology, HiCOMB 2009, Rome, Italy, May 25, Ligowski, L., Rudnicki, W. R., Liu, Y., Schmidt, B., Accurate scanning of sequence databases with the Smith-Waterman algorithm, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 11, Morgan Kaufmann, 2011, Liu, Y., Hu, J., GPU-based parallel computing for fast circuit optimization, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 24, Morgan Kaufmann, 2011, Liu, Y., Maskell, D. L., Schmidt, B., CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units, BMC Res. Notes, 2, 2009, Liu, Y., Schmidt, B., Maskell, D. L., CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions, BMC Res. Notes, 3, 2010, Liu, W., Schmidt, B., Voss, G., Muller-Wittig, W., Streaming algorithms for biological sequence alignment on GPUs, IEEE Trans. Parallel Distrib. Syst., Vol. 18, No. 9, 2007,

223 6 An Overview of Miscellaneous Applications of GPU Computing Lukac, M., Perkowski, M. A., Kerntopf, P., Kameyama, M., GPU acceleration methods and techniques for quantum logic synthesis, 9th Int. Workshop on Boolean Problems, on CD, Manavski, S. A., Valle, G., CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment, BMC Bioinformatics, 9, Suppl. 2, 2008, S McGuire, M., Efficent, high-quality Bayer demosaic filtering on GPUs, submitted to J. Graph. GPU Game Tools, Vol. 13, No. 4, 2009, Micikevicius, P., 3D finite difference computation on GPUs using CUDA, in Proc. 2nd Workshop on General Purpose Processing on Graphics Processing Units, Washington, DC, March 8, 2009, Murphy, M., Lustig, M., l1 minimization in l1-spirit compressed sensing MRI reconstruction, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 45, Morgan Kaufmann, 2011, Mutch, J., Knoblich, U., Poggio, T., CNS: a GPU-based framework for simulating cortically-organized networks, Computer Science and Artificial Intelligence Laboratory Technical Report, MIT-CSAIL-TR , February 26, Muyan-Ozcelik, P., Glavtchev, V., Ota, J. M., Owens, J. D., A template-based approach for real-time speedlimit-sign recognition on an embedded system using GPU computing, in Proc. of the 32nd Annual German Association for Pattern Recognition (DAGM) Symposium, Springer-Verlag, Berlin, Heidelberg, Germany, 2010, LNCS 6376, Muyan-Ozcelik, P., Glavtchev, V., Ota, J. M., Owens, J. D., Real-time speed-limit-sign recognition on an embedded system using a GPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 32, Morgan Kaufmann, 2011, Nageswaran, J. M., Dutt, N., Krichmar, J. L., Nicolau, A., Veidenbaum, A., Supporting information for neural network article: A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors, Neural Networks, Vol. 22, No. 56, JulyAugust 2009, Advances in Neural Networks Research: IJCNN2009, Nageswaran, J. M., Dutt, N., Krichmar, J. L., Nicolau, A., Veidenbaum, A., A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors, Neural Networks, Vol. 22, 2009, Narayanan, P. J., Vineet, V., Stich, T., Fast graph cuts for computer vision, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 29, Morgan Kaufmann, 2011, Negrut, D., Tasora, A., Anitescu, M., Mazhar, H., Heyn, T., Pazouki, A., Solving large multi-body dynamics problems on the GPU, in W.-M.W. Hwu, (ed.), GPU Gems 4, The Jade Edition, Morgan Kaufmann, Nowotny, T., GeNN a framework for GPU enhanced neuronal network simulations, CNS 2011 Workshop on GPU super-computing, Stockholm, July 23-28, Sweden, 2011.

224 210 Stanislav Stanković, Jaakko Astola 126. Nowotny, T., GeNN (GPU enhanced neuronal networks) framework (2011), Accessed August 10, Nukada, A., Matsuoka, S., Auto-tuning 3-D FFT library for CUDA GPUs, in Proc. of the Conference on High Performance Computing Networking, Storage and Analysis, Portland, Oregon, November 1420, 2009, Nukada, A., Ogata, Y., Endo, T., Matsuoka, S., Bandwidth intensive 3-D FFT kernel for GPUs using CUDA, in Proc. of the 2008 ACM/IEEE conference on Supercomputing, Austin, Texas, November 1521, 2008, Nyland, L., Harris, M., Prins, J., Fast n-body simulation with CUDA, in H. Nguyen, (Ed.), GPU Gems 3, Chapter 31, Addison-Wesley Professional, Obukhov, A., Kharlamov, A., Discrete cosine transform for 88 blocks with CUDA, NVIDIA white paper, 2008, Accessed August 12, Obukhov, A., Face detection with CUDA, GPU Technology Conference, San Jose, California, USA, February 10, Obukhov, A., Haar classifiers for object detection with CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 33, Morgan Kaufmann, 2011, Owens, J. D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., Purcell, T. J., A survey of general-purpose computation on graphics hardware, Comput. Graph. Forum 26, Vol. 1, 2007, Paul, E., Steinbach, B., Perkowski, M. A., Application of CUDA in the Boolean domain for the unate covering problem, Steinbach, B. (ed.), Proc. of the 9th International Workshops on Boolean Problems, September 16-17, Freiberg University of Mining and Technology, Freiberg, Germany, 2010, , ISBN Peters, H., Schulz-Hildebrandt, O., Luttenberger, N., Fast in-place sorting with CUDA based on bitonic sort, Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 8, Morgan Kaufmann, 2011, Perinkulam, A., Kundu, S., Logic simulation using graphics processors, in Proc. 14th International Test Synthesis Workshop, San Antonio, Texas, USA, March 57, Perkowski, M.A., Lukac, M., Kerntopf, P., Kameyama, M., GPU library based approach to quantum logic synthesis, Proc. 2nd Workshop on Reversible Computation, Bremen, Germany, July 2-3, 2010, Phillips, J. C., Stone, J. E., Probing biomolecular machines with graphics processors, Commun. ACM, Vol. 52, No. 10, 2009, Pinto, N., Cox, D. D., GPU metaprogramming: A case study in biologically inspired machine vision, in Wen-mei W. Hwu, (ed.), GPU Computing Gems - The Jade Edition, Chapter 33, Morgan Kaufmann, Podlozhnyuk, V., FFT-based 2D convolution, NVIDIA white paper, 2/sdk/website/projects/

225 6 An Overview of Miscellaneous Applications of GPU Computing 211 convolutionfft2d/doc/convolutionfft2d.pdf, 2007, Accessed July 10, Podlozhnyuk, V., Image convolution with CUDA, NVIDIA CUDA SDK Whitepapers, 2007, Accessed July 10, Posthoff, C., Steinbach, B., The solution of SAT problems using ternary vectors and parallel processing, Int. J. Electronics and Telecommunications (JET), ISSN , Warsaw, 2011, Volume 57, Number 3, Index , Pratx, G., Chinn, G., Olcott, P. D., Levin, C. S., Fast, accurate and shift-varying line projections for iterative reconstruction using the GPU, IEEE Trans. Med. Image, Vol. 28, No. 3, Pratx, G., Cui, J.-Y., Prevrhal, S., Levin, C. S., 3-D tomographic image reconstruction from randomly ordered lines with CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 42, Morgan Kaufmann, 2011, Pullan, G., Brandvik, T., Large-scale gas turbine simulations on GPU clusters, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 14, Morgan Kaufmann, 2011, Rahman, A., Houzet, D., Pellerin, D., Visual saliency model on multi-gpu, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 30, Morgan Kaufmann, 2011, Rees, S. J., Walkenhorst, J., Large-scale credit risk loss simulation, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 23, Morgan Kaufmann, 2011, Richmond, P., Romano, D., Template-driven agent-based modeling and simulation with CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 21, Morgan Kaufmann, 2011, Richmond, P., Walker, D., Coakley, S., Romano, D., High performance cellular level agent-based simulation with FLAME for the GPU, Brief. Bioinform., Vol. 11, No. 3, 2010, Rizk, G., Lavenier, D., Rajopadhye, S., GPU accelerated RNA folding algorithm, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 14, Morgan Kaufmann, 2011, Röber, N., Kaminski, U., Masuch, M., Ray acoustics using computer graphics technology, Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, September 10-15, 2007, DAFX-1 - DAFX Röber, N., Spindler, M., Masuch, M., Waveguide-based room acoustics through graphics hardware, Proc. of ICMC 2006, Ann Arbor, MI, MPublishing, University of Michigan Library, Vol. 2006, available at Accessed August 2, Rodrigues, C. I., Hardy, D. J., Stone, J. E., Schulten, K., Hwu, W. W., GPU acceleration of cutoff pair potentials for molecular modeling applications, in Proc. of the 2008 Conference on Computing Frontiers, Ischia, Italy, May 57, 2008, Rudnicki, W. R., Jankowski, A., Modzelewski, A., Piotrowski, A., Zadrozny, A., The new SIMD implementation of the Smith-Waterman algorithm on cell microprocessor,

226 212 Stanislav Stanković, Jaakko Astola Fundamenta Informaticae, Vol. 96, 2009, Ruijters, D., ter Haar Romeny, B. M., Suetens, P., Efficient GPU-accelerated elastic image registration, in Proc.s of Sixth IASTED Int. Conf. on Biomedical Engineering, Innsbruck, Austria, February 1315, 2008, Sasaki, J., Shizuku, Y., Hirose, T., Kuroki, N., Numa, M., A technique for accelerating SVM-based image recognition using GPU, SASIMI 2012, Beppu, Japan, March 8-9, Schaa, D., Brown, B., Jang, B., Mistry, P., Dominguez, R., Kaeli, D., Moore, R., Kopans, D. B., GPU acceleration of iterative digital breast tomosynthesis, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 40, Morgan Kaufmann, Schied, C., Hanika, J., Dammertz, H., Lensch, H. P. A., High-performance iterated function systems, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 18, Morgan Kaufmann, 2011, Shackelford, J., Kandasamy, N., Sharp, G., Deformable volumetric registration using B-splines, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 47, Morgan Kaufmann, 2011, Shaffer, E., Zagaris, G., GPU accelerated derivative-free mesh optimization, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 13, Morgan Kaufmann, 2011, Sharp, G., Kandasamy, N., Singh, H., Folkert, M., GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration, Phys. Med. Biol., Vol. 52, 2007, Shi, J., Cai, Y., Hou, W., Ma, L., Tan, S. X.-D., Ho, P.-H., Wang, X., GPU friendly fast Poisson solver for structured power grid network analysis, Proc. of the 46th Annual Design Automation Conference, San Francisco, California, July 2631, 2009, Sorensen, T. S., Schaeffter, T., Noe, K. Ø., Hansen, M. S., Accelerating the nonequispaced Fast Fourier transform on commodity graphics hardware, IEEE Trans. on Medical Imaging, Vol. 27, No. 4, 2008, Stam, J., Fung, J., Image de-mosaicing, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 36, Morgan Kaufmann, 2011, Stava, O., Benes, B., Connected component labeling in CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 35, Morgan Kaufmann, 2011, Stone, J. E., Hardy, D. J., Saam, J., Vandivort, K.L., Schulten, K., GPU-accelerated computation and interactive display of molecular orbitals, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 1, Morgan Kaufmann, 2011, Stone, J.E., Phillips, J. C., Freddolino, P. L., Hardy, D. J., Trabuco, L. G., Schulten, K., Accelerating molecular modeling applications with graphics processors, J. Comput. Chem., Vol. 28, 2007, Stone, J. E., Saam, J., Hardy, D. J., Vandivort, K. L., Hwu, W. W., Schulten, K., High performance computation and interactive display of molecular orbitals on GPUs and multi-core CPUs, in Proc. for the 2nd Workshop on General-Purpose Processing on

227 6 An Overview of Miscellaneous Applications of GPU Computing 213 Graphics Processing Units, Washington, DC, USA March 08-08, 2009, Stone, S., Haldar, J., Tsao, S., Hwu, W., Sutton, B., Liang, Z., Accelerating advanced MRI reconstructions on GPUs, J. Parallel Distrib. Comput., Vol. 68, No. 10, 2008, Szirmay-Kalos, L., Toth, B., Magdics, M., Monte Carlo photon transport on the GPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 17, Morgan Kaufmann, 2011, Temizel, A., Halici, T., Logoglu, B., Temizel, T. T., Omruuzun, F., Karaman, E., Experiences on image and video processing with CUDA and OpenCL, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 34, Morgan Kaufmann, 2011, Townsend, R., Sankaralingam, K., Sinclair, M. D., Leveraging the untapped computation power of GPUs: Fast spectral synthesis using texture interpolation, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 7, Morgan Kaufmann, 2011, Trapnell, C., Schatz, M. C., Optimizing data intensive GPGPU computations for DNA sequence alignment, Parallel Comput., Vol. 35, No. 89, 2009, Trapnell, C., Salzberg, S., How to map billions of short reads onto genomes, Nat. Biotechnol., Vol. 27, No. 5, 2009, Trompoukis, X. S., Asouti, V. G., Kampolis, I. C., Giannakoglou, K. C., CUDA implementation of vertex-ventered, finite volume CFD methods on unstructured grids with flow control applications, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 17, Morgan Kaufmann, 2011, Vanek, J., Trmal, J., Psutka, J. V., Psutka, J., Optimized acoustic likelihoods computation for NVIDIA and ATI/AMD graphics processors, IEEE Trans. on Audio, Speech, and Language Processing, Vol. 20, No. 6, 2012, Vineet, V., Narayanan, P. J., CUDA cuts: fast graph cuts on the GPU, in Proc. Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, June 23-28, 2008, Vineet, V., Narayanan, P. J., Solving multilabel MRFs using incremental alpha-expansion on the GPUs, in Proc. of the Ninth Asian Conference on Computer Vision (ACCV 09), Xian, China, September 23-27, 2009, Lecture Notes in Computer Science, Vol. 5996, Springer, 2009, Volkov, V., Demmel, J. W., Benchmarking GPUs to tune dense linear algebra, in Proc. of the 2008 ACM/IEEE conference on Supercomputing, Austin, Texas, USA, November 15-21, Volkov, V., Kazian, B., Fitting FFT onto the G80 architecture, Project report, University of California, Berkeley May 19, Weinman, J. J., Lidaka, A., Aggarwal, S., Large-scale machine learning, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 19, Morgan Kaufman, 2011, Weiss, R. M., GPU-accelerated ant colony optimization, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 22, Morgan Kaufmann, 2011, Weiss, B., Bailey, M., Massive parallel computing to accelerate genome-matching, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 12, Morgan Kaufmann,

228 214 Stanislav Stanković, Jaakko Astola 2011, Wirth, A., Cserkaszky, A., Kari, B., Legrady, D., Feher, S., Czifrus, S., Domonkos, B., Implementation of 3D Monte Carlo PET reconstruction algorithm on GPU, in Proc. ot the IEEE Medical Imaging Conference 09, Orlando, FL, USA, October, 25-31, 2009, Wozniak, A., Using video-oriented instructions to speed up sequence comparison, Comput. Appl. Biosci., Vol. 13, 1997, Wu, C., SiftGPU: A GPU implementation of scale invariant feature transform (SIFT), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA, , Accessed August 10, Wu, J., He, Z., Hong, B., An efficient CUDA algorithm for the maximum network flow problem, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 5, Morgan Kaufmann, 2011, Xu, W., Mueller, K., Using GPUs to learn effective parameter settings for GPU-accelerated iterative CT reconstruction, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 43, Morgan Kaufmann, 2011, Xu, F., Mueller, K., Real-time 3D computed tomographic reconstruction using commodity graphics hardware, Phys. Med. Biol., Vol. 52, No. 12, 2007, Xu, F., Mueller, K., Accelerating popular tomographic reconstruction algorithms on commodity PC graphics hardware, IEEE Trans. Nucl. Sci., Vol. 52, No. 3, 2005, Yasuda, K., Accelerating density functional calculations with graphics processing unit, J. Chem. Theory Comput., Vol. 4, 2008, For detailed coverage see R.G. Parr, W. Yang, Density Functional Theory of Atoms and Molecules, Oxford University Press, New York, Yan, G., Yian, J., Zhu, S., Qin, C., Dai, Y., Yang, F., Dong, D., Wu, P., Fast Katsevich algorithm based on GPU for helical cone-beam computed tomography, IEEE Trans. Inf. Technol. Biomed., Vol. 14, No. 4, 2010, Yang, R., Pollefeys, M., Multi-resolution real-time stereo on commodity graphics hardware, in Proc. Conf. on Computer Vision and Pattern Recognition, Madison, Wisconsin, USA, June 18-20, 2003, Vol. 1, Yang, R., Pollefeys, M., Improved real-time stereo on commodity graphics hardware, in Proc. Conf. on Computer Vision and Pattern Recognition Workshop on Real-time 3D Sensors and Their Use, Washington, D.C., USA, June 27-Julu 2, 2004, Yokota, R., Barba, L. A., Treecode and fast multipole method for N-body simulation with CUDA, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 9, Morgan Kaufmann, 2011, Yu, W., Chen, T., Franchetti, F., Hoe, J., High performance stereo vision designed for massively data parallel platforms, IEEE Trans. Circuits Syst. Video Technol., Vol, 2, No. 11, 2010, Zhang, Y., Cohen, J., Davidson, A. A., Owens, J. D., A hybrid method for solving tridiagonal systems on the GPU, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol.

229 6 An Overview of Miscellaneous Applications of GPU Computing 215 2, Chapter 11, Morgan Kaufmann, 2011, Zhao, Y., Taubin, G., Real-time stereo on GPGPU using progressive multiresolution adaptive windows, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 31, Morgan Kaufmann, 2011, Zhuo, Y., Wu, X., Haldar, J. P., Hwu, W., Liang, Z., Sutton, B. P., Multi-GPU implementation for iterative MR image reconstruction with field correction, in Proc. Int. Society for Magnetic Resonance in Medicine (ISMRM 2010), May 1-7, 2010, Stockholm, Sweden, 2010, Zhuo, Y., Wu, X.-L., Haldar, J. P., Marin, T., Hwu, W. W., Liang, Z.-P., Sutton, B. P., Using GPUs to accelerate advanced MRI reconstruction with field inhomogeneity compensation, in Wen-mei W. Hwu, (ed.), GPU Computing Gems, Vol. 2, Chapter 44, Morgan Kaufmann, 2011, Zink, B., A general relativistic evolution code on CUDA architectures, Tech. Rept. CCT- TR , Louisiana State University, Center for Computation and Technology, Louisiana, USA, 2008, Accessed August 1, 2012.

230

231

232 T a mp e r eun i v e r s i t yo f T e c h n o l o g y T a mp e r ei n t e r n a t i o n a l Ce n t e rf o rsi g n a l Po r c e s s i n g P. O. Bo x T a mp e r e Fi n l a n d I SBN I SSN

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1 Multi-processor vs. Multi-computer architecture µp vs. DSP RISC vs. DSP RISC Reduced-instruction-set Register-to-register operation Higher throughput by using