A Parallel Environment for Simulating Quantum Computation. Geva Patz

Size: px

Start display at page:

Download "A Parallel Environment for Simulating Quantum Computation. Geva Patz"

Chloe Strickland
6 years ago
Views:

1 A Parallel Environment for Simulating Quantum Computation by Geva Patz B.S. Computer Science University of South Africa (1998) Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2003 c Massachusetts Institute of Technology All rights reserved. Author Program in Media Arts and Sciences, School of Architecture and Planning May 21, 2003 Certified by Stephen A. Benton Professor of Media Arts and Sciences Thesis Supervisor Accepted by Andrew B. Lippman Chairman Department Committee on Graduate Students

2 2

3 3 A Parallel Environment for Simulating Quantum Computation by Geva Patz Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning on May 21, 2003, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences Abstract This thesis describes the design and implementation of an environment to allow quantum computation to be simulated on classical computers. Although it is believed that quantum computers cannot in general be efficiently simulated classically, it is nevertheless possible to simulate small but interesting systems, on the order of a few tens of quantum bits. Since the state of the art of physical implementations is less than 10 bits, simulation remains a useful tool for understanding the behavior of quantum algorithms. To create a suitable envrionment for simulation, we constructed a 32-node cluster of workstation class computers linked with a high speed (gigabit Ethernet) network. We then wrote an initial simulation environment based on parallel linear algebra libraries with a Matlab front end. These libraries operated on large matrices representing the problem being simulated. The parallel Matlab environment demonstrated a degree of parallel speedup as we added processors, but overall execution times were high, since the amount of data scaled exponentially with the size of the problem. This increased both the number of operations that had to be performed to compute the simulation, and the volume of data that had to be communicated between the nodes as they were computing. The scaling also affected memory utilization, limiting us to a maximum problem size of 14 qubits. In an attempt to increase simulation efficiency, we revisited the design of the simulation environment. Many quantum algorithms have a structure that can be described using the tensor product operator from linear algebra. We believed that a new simulation environment based on this tensor product structure would be substantially more efficient than one based on large matrices. We designed a new simulation envrionment that exploited this tensor product structure. Benchmarks that we performed on the new simulation environment confirmed that it was substantially more efficient, allowing us to perform simulations of the quantum Fourier transform and the discrete approximation to the solution of 3-SAT by adiabatic evolution up to 25 qubits in a reasonable time. Thesis Supervisor: Stephen A. Benton Title: Professor of Media Arts and Sciences

4 4

5 A Parallel Environment for Simulating Quantum Computation by Geva Patz The following people served as readers for this thesis: Thesis Reader Isaac L. Chuang Associate Professor MIT Media Laboratory Thesis Reader Edward Farhi Professor of Physics MIT Center for Theoretical Physics

6 6

7 7 Acknowledgments Many thanks to the members of my thesis committee: To Steve Benton, my advisor, who stepped in at a moment of need and guided me through the completion of this thesis. His wise guidance and kind support were invaluable. Thanks to my reader, Isaac Chuang, without whom this thesis would not have happened. He introduced me to the world of quantum computing, pointed me at the problem that this thesis addressed, and suggested ways to approach the solution. He also enabled me to have access to the computing resources required to make this work possible. Thanks also to my other reader, Eddie Farhi, who introduced me to the idea of adiabatic quantum computing, and whose office I always looked forward to visiting. I d have to switch to a much bigger font to thank Linda Peterson adequately. Her office is a haven for desperate, panic-stricken, confused or otherwise needy students, and she is a wellspring of helpful advice (um, I mean options). To my wife, Alex: thank you so much for the support and encouragement you ve given me throughout my time at MIT, and for putting up with me in my sleep-deprived, notaltogether-cheerful thesis writing mode.

8 8

9 Contents 1 Introduction 19 2 Background Why is quantum computing interesting? Basic concepts of quantum computation Quantum bits Quantum gates Quantum algorithms The quantum Fourier transform Quantum computation by adiabatic evolution The tensor product Parallel simulation of quantum computation Simulating quantum computing Previous work in simulating quantum computation Parallel processing and cluster computing Cluster computing The tensor product as a means to optimize computation Parallelizing by the tensor product Efficient computation of tensor product structured multiplications. 50 9

10 10 CONTENTS 4 The simulation environment Hardware Initial software implementation Overview of libraries used Prior work in parallelizing Matlab Design of the parallel Matlab environment The tensor product based simulation environment Algorithm (circuit) specification language Compilation Distribution Execution Evaluation Methodology The fundamentals Single node execution times Data transfer timing Startup overhead Gates and gate combinations The quantum Fourier transform The quantum Fourier transform on the initial environment Replicating the discrete Fourier transform Circuit-based implementation Comparing efficient and inefficient circuit specifications SAT by adiabatic evolution Conclusions 101 A Code listings 105 A.1 Circuit specification language parser

11 CONTENTS 11 A.2 Quantum Fourier transform circuit generator A.3 3-SAT problem generator for adiabatic evolution

12 12 CONTENTS

13 List of Figures 2-1 Quantum NOT gate The CNOT gate Three CNOTs form a swap gate Using a Hadamard gate to generate entangled Bell states General circuit representation of the quantum Fourier transform Basic schematic form of a quantum circuit A representative quantum circuit The cluster nodes, seen from below The two dimensional block cyclic distribution Layering of libraries in the first generation simulation environment State diagram for the parallel Matlab server master node State diagram for the parallel Matlab server slave nodes Circuit to demonstrate different specification orderings Circuit for the compilation example State of the internal data structure System-level overview of the parallel execution of a problem Algorithm specification to illustrate computation sequence An example computation sequence, illustrating communication patterns State diagram for the new simulator master node State diagram for the new simulator slave nodes

14 14 LIST OF FIGURES 5-1 Single-node execution times Single-node execution times with identity matrix Vector transfer times Startup overhead Block of CNOTs circuit Block of CNOTs, no permutation-related communication (cf Figure 5-8) Alternating CNOTs circuit Alternating CNOTs Parallel Matlab based simulator performance Traditional Fourier transform execution times Quantum Fourier transform circuit execution times Quantum Fourier transform inefficient circuit execution times Execution times for 3-SAT by adiabatic evolution, N steps

15 List of Tables 4.1 Abbreviated grammar for the tensor product specification language Record types for the internal compiler data structure Number of runs for simulation data Fourier transform execution times for larger problem size

16 16 LIST OF TABLES

17 Listings 4.1 A sample parallel Matlab script A sample algorithm specification Inefficient specification of the circuit in Figure Efficient specification of the circuit in Figure Input file for the compilation example A.1 Header file definitions A.2 Lexical analyzer definition for flex A.3 Parser definition for bison A.4 Efficient quantum Fourier transform circuit generation script A.5 Example efficient circuit output for 4 qubits A.6 Unoptimized quantum Fourier transform circuit generation script A.7 Example unoptimized circuit output for 4 qubits A.8 Simulation code generator for 3-SAT by adiabatic evolution A.9 Example adiabatic evolution code for 3 qubits A.10 Random instance generator for 3-SAT by adaibatic evolution

18 18 LISTINGS

19 Chapter 1 Introduction The idea of simulating quantum computation on classical computers seems at first not to make logical sense. Quantum computing is interesting primarily because it appears to be able to solve problems that are intractable for classical computers. If this is the case, then quantum computers cannot be efficiently simulated on classical ones. Our goal, however, is much more modest. We do not seek to efficiently simulate quantum computation in general, for arbitrary problem size. Rather, we want to simulate the largest problems that we can, until physical implementations of quantum computers have overtaken our abilities to simulate them. The largest physically realized quantum computation to date operated on seven quantum bits (qubits) [VSB + 01]. Given that problem size doubles with every additional qubit, simulating even low tens of qubits would allow us to investigate problems many orders of magnitude larger than the size of quantum computers that we are currently able to build. The exponential scaling of demands on memory and processor resources with increasing problem size will always overwhelm us at some point, but with some thought we may be able to postpone that point far enough to allow us to simulate some interesting problems. Simulation also has a much more rapid configuration turn-around time than physical experiments. We all hope that in the future, quantum computers will be as trivially reconfigurable as the desktop classical computers of today, but at the moment every successful 19

20 20 CHAPTER 1. INTRODUCTION physical quantum computation has been a complex, carefully planned experiment with an elaborate experimental setup tailored to solving one specific problem (often one specific instance of a problem). For now, simulation offers much greater flexibility for reconfiguration, and is an essential tool for planning any experimental realization. More generally, the lessons we learn in simulating quantum computation on classical computers may yield insights that will be useful in other fields that deal with similarly large problems. An obvious application would be simulating other quantum systems, but similar techniques are also useful in such fields as image and signal processing. To achieve even the relatively modest goal of simulating problems on the order of 20 qubits, we require substantial computing resources, and an intelligent approach to using those resources. One path of attack is to combine the resources (memory and CPU) of multiple processing units. High end parallel processing computers are, however, expensive, rare and often difficult to program. Ideally, we would like to harness the power of readily available, inexpensive, easily configurable workstation class computers to perform our computations. This suggests exploiting the technique of cluster computing, in which multiple off-the shelf workstations are combined into a parallel computing resource. Regardless of the amount of simulation hardware we have available, it will be useful to find efficient ways of representing the problem we are simulating, in order to reduce the resource consumption of our simulations. The resources we are typically most interested in are memory and CPU, but in a clustered computing environment there is another resource that becomes significant, too: communication. In this thesis, I will describe a simulation environment that we have built to explore the simulation of quantum computation. We began by building and configuring a clustered computing environment. We then implemented a simulator on it based on a library of parallel linear algebra routines. This approach was chosen because linear algebra is at the core of the mathematical representation of quantum computing. Although our initial simulation environment validated the feasibility of simulating quantum computation on a cluster of classical workstations, it also uncovered a number of

21 21 limitations, both of cluster computing in general, and of the specific simulation approach we had chosen. We therefore developed a new simulation environment, designed to more efficiently represent and simulate problems, and to reduce the reliance on inter-node communications which had proved to be a substantial bottleneck in cluster computing. Specifically, we based our new simulation around the tensor product, a mathematical structure that neatly parallels the structure of many quantum algorithms, and that provides a basis for a more compact and efficient representation of these problems. Chapter 2 starts by outlining some of the basic concepts of quantum computing and the associated mathematics that will be necessary to understand the rest of this thesis. Chapter 3 continues by reviewing concepts relevant to the parallel simulation of quantum computation. It discusses how quantum computation may be simulated, and introduces cluster computing, which forms the basis of the architecture of our simulation environment. It also describes how the tensor product, introduced in the previous chapter, has been used as a structure for parallelization and efficient execution of simulations. With the background out of the way, Chapter 4 describes our simulation environment in detail, beginning with the cluster hardware, then moving on to a description of the initial (parallel matrix based) simulation environment. It describes the limitations of the initial simulation environment that motivated the design of the new simulation environment, and then describes the new design. Chapter 5 discusses our evaluation of the new simulation environment, describing the benchmarks we developed and the results of running these benchmarks on the simulator. Finally, Chapter 6 summarizes our conclusions, and suggests some directions in which the simulation environment could be taken in the future.

22 22 CHAPTER 1. INTRODUCTION

23 Chapter 2 Background This chapter introduces some key background concepts that will be relevant to the rest of the thesis. This is by no means intended to be an exhaustive or rigorous survey of the subject of quantum computation, but instead is meant to give the reader enough background to follow the concepts and notation used elsewhere. For a more comprehensive review of quantum computation and the underlying principles of quantum mechanics, the reader is referred to [NC00] or [Pre98]. After a brief motivation in Section 2.1 of why the subject of quantum computing is interesting to study, Section 2.2 introduces some of the elementary concepts and principles of quantum computation, along with the mathematical structures used to represent them. In section 2.3 we review a few representative quantum algorithms. Section 2.4 introduces the tensor product, which will be the mathematical key to the design of our simulation environment. 2.1 Why is quantum computing interesting? The theory of quantum computation is rich and interesting in its own right, but it is of particular interest because it is believed that quantum computers may be able to perform certain types of computation that are fundamentally too hard for classical computers to 23

24 24 CHAPTER 2. BACKGROUND perform in a reasonable time. A classical computer is simply a computer in which the physical representation and manipulation of information follows the laws of classical physics. This definition encompasses every practical computer in use today, from the microcontroller in a washing machine to the fastest supercomputers. Strictly, since quantum mechanics underpins all of physics, classical computers are simply a special case of quantum computers, but since their design does not directly exploit the principles of quantum mechanics, it is helpful to distinguish them from quantum computers that do so. More formally, classical computers are types of Turing machines, named for Alan Turing, who in a seminal paper in 1936 [Tur36] developed the first formal, abstract model that defined what it means for a machine, abstract or physical, to compute an algorithm. The abstract mathematical computing machine that Turing introduced was called a logical computing machine in his paper, but we now refer to it as a Turing machine. Although in principle any problem expressible as an algorithm can be solved on a Turing machine, in practice, certain kinds of problem may not be solvable on classical computers with reasonable computational resources (with resources usually defined as storage space and computing time). The study of the resource requirements of algorithms is known as complexity theory. Complexity theory divides problems into a number of complexity classes based on the resources required to solve them. One of the most important of these classes is P (for Polynomial time), which is defined as the set of problems 1 that can be solved on a deterministic machine (loosely, a conventional Turing machine) in polynomial time, in other words where the amount of time (equivalently, the number of steps) taken to solve the problem can be related to the size of the problem by a polynomial in the problem size. Less formally, P is essentially the class of problems that can be efficiently computed on classical computers. Another class, NP (for Nondeterministic Polynomial time), is defined as those prob- 1 Strictly, the complexity classes are defined in terms of decision problems, i.e. problems that require a YES or NO answer. Since algorithms can be restated as equivalent decision problems, we ignore this formal nicety here.

25 2.1. WHY IS QUANTUM COMPUTING INTERESTING? 25 lems where the solution can be verified in polynomial time. Clearly P NP, since the solution to any problem in P can be verified by executing the problem in polynomial time. It is believed that P NP, and many problems in NP have been posed for which no known solution algorithm exists in P, but this has not yet been proved, and whether or not P = NP remains one of the great unanswered questions in computer science. A further complexity class is PSPACE, being those problems solvable with unlimited time, but a polynomial amount of storage space (memory). Again, it is clear that NP PSPACE, and it is suspected, but unproven, that NP PSPACE. Thus it remains unknown even whether or not P = PSPACE. It is known that there are classes of problems that are outside PSPACE, hence outside P. For instance, we know that PSPACE EXPSPACE, where EXPSPACE is the set of problems solvable with unlimited time and with an amount of memory that increases exponentially with problem size. How does all this discussion of complexity classes relate to quantum computers? A new complexity class, BQP has been defined to encompass all problems that can be efficiently solved on a quantum computer. BQP is defined as those problems that can be solved in polynomial time by a quantum computer with a bounded probability that the solution is incorrect (most definitions give this probability bound as p 0.25, but the choice of bound is arbitrary). It has been shown that BQP PSPACE, but the relation to P and NP is unproven. Tantalizingly, however, there are problems that have been shown to be in BQP, but that are strongly believed to be outside P. Proving that this is so, i.e. proving that there are problems that can be solved on quantum computers that cannot be solved on classical computers, is equivalent to proving that P PSPACE, but even in the absence of a proof, there are strong hints that suggest this is so. Herein lies the promise of quantum computing. In particular, the current interest in quantum computing was largely stimulated by a paper by Peter Shor [Sho97] that gave algorithms for calculating discrete logarithms and

26 26 CHAPTER 2. BACKGROUND for finding the prime factors of an integer in polynomial time on a quantum computer. The integer factoring algorithm generated particular interest because of its potential application to cryptanalysis (the well-known RSA public key cryptosystem, for instance, depends on the difficulty of integer factoring for its security). Shor presented an algorithm that finds the prime factors of an N-bit integer in O((log N) 3 ) time. There is no equivalent classical algorithm known that can perform factoring in time O((log N) k ) for any k. The most efficient classical algorithms currently known, the Number Field Sieve and Multiple Polynomial Quadratic Sieve, have exponential run times (O(e (ln N)1/3 (ln ln N) 2/3 ) and O(e ln N ln ln N ) respectively) [Bre00]. Algorithms such as Shor s strongly suggest that BQP P, and it is this that drives much of the interest in quantum computing. 2.2 Basic concepts of quantum computation Quantum bits The elementary unit of quantum data is the qubit (for quantum bit ), by analogy with the bit in classical computing. Although we will deal with qubits almost exclusively as mathematical abstractions, it is important to bear in mind that, just as classical bits have a physical representation, so qubits correspond to physical states within a quantum computer, subject to the laws of physics and in particular those of quantum mechanics. Qubits, like bits, have states, such as 0 and 1. Unlike classical bits, qubits are not restricted to these states, but can take on any state of the form ψ = a 0 + b 1, (2.1) where a and b are complex numbers. Mathematically, the states 0 and 1 are orthonormal basis vectors for a two-dimensional complex vector space. The state of a qubit is a unit vector in this space. The general qubit in 2.1 is said to be in a superposition of states 0 and 1.

27 2.2. BASIC CONCEPTS OF QUANTUM COMPUTATION 27 The restriction to unit vectors arises from the interpretation of a and b: A crucial principle of quantum mechanics is that we cannot precisely determine ( measure ) the state of a quantum system (in this case, a qubit). Measuring a qubit in a state ψ as above will yield the measurement 0 with probability a 2 and 1 with probability b 2. Since these probabilities must sum to 1, a 2 + b 2 = 1. This vector representation of qubits is a very useful mathematical abstraction, and we will make extensive use of it in our simulation environment. We will often use vector notation to represent the state of a qubit, as in ψ = a b. (2.2) The extension of these concepts to multiple qubits is straightforward. Two qubits have four computational basis states, 00, 01, 10 and 11, corresponding to the four possible states of a pair of classical bits. These states are sometimes written as integers in the form 0, 1, 2 and 3 respectively. The state vector describing the state of a pair of qubits is simply ψ = a 00 + b 01 + c 10 + d 11. (2.3) More generally, a system of n qubits has computational basis states which take the form x 0 x 1 x 2... x n, where each of the x i {0, 1}. There are therefore 2 n such basis states, and the state vector for such a system has 2 n entries (or probability amplitudes). This exponential increase in information as the number of qubits increases hints at the potential computational power of quantum computing Quantum gates Computation with qubits requires manipulating their states. These manipulations are again physical, and their exact nature depends on the particular physical implementation of a given quantum computer. Here too, though, it is helpful to use a mathematical

28 28 CHAPTER 2. BACKGROUND abstraction to describe these manipulations independent of any specific physical implementation. An abstraction which is helpful in describing a wide range of quantum algorithms is the circuit model, in which algorithms are described as a collection of quantum gates, which operate on qubits by analogy with the logic gates of classical computing (AND, OR, NOT, etc.) To illustrate, consider the quantum NOT gate. Just as the classical NOT swaps bit values, taking 0 1 and 1 0, the quantum NOT gate takes 0 1 and 1 0. More generally, however, it takes any state ψ = a 0 + b 1 to the state ψ = b 0 + a 1. Graphically, this is usually represented as in Figure 2-1 (X is a standard shorthand for the NOT gate, and represents the classical XOR operation, or equivalently binary addition modulo 2). ψ X ψ Figure 2-1: Quantum NOT gate Mathematically, the NOT gate can be represented as a 2 2 matrix: X = 0 1 (2.4) 1 0 All quantum gates on n qubits can be represented similarly as 2 n 2 n unitary matrices. A matrix U is unitary when UU = I (U is the adjoint of U, defined as (U ) T, where U is the complex conjugate matrix of U). The unitarity property is necessary to ensure that the output of a quantum gate remains a unit vector. An example of a two qubit gate is the controlled-not, or CNOT gate. This has the form

29 2.2. BASIC CONCEPTS OF QUANTUM COMPUTATION 29 U CNOT = (2.5) The CNOT gate is graphically represented as in Figure 2-2. C C C C T X C T T C T Figure 2-2: The CNOT gate: the representation on the right is a common shorthand. Three alternating CNOTs in succession have the effect of exchanging the values of two qubits, as in Figure 2-3. This combination can itself be represented as a two-qubit operator, the swap gate. a b a b b a b a Figure 2-3: Three CNOTs form a swap gate: the common representation of the swap gate is on the right Another useful gate is the Hadamard gate, represented H. It has the form H = The Hadamard gate takes 0 to a superposition of states 0 and 1 :. (2.6) (2.7) It is this ability to create and manipulate superpositions of states that give quantum computers their inherent parallelism. To illustrate, suppose we have a function f that

30 30 CHAPTER 2. BACKGROUND can be implemented with a unitary function U f that transforms two input qubits x y as follows (where signifies single bit binary addition): U f : x y x y f(x) (2.8) Now suppose that we apply U f to the input state with x in the superposition shown in (2.7) and y = 0. Then U f : f(0) + 1 f(1) 0. (2.9) 2 2 This output state contains information about f(0) and f(1), so in a loose sense we have performed an evaluation of f on both 0 and 1. This notion of quantum parallelism is one of the keys to the potential power of quantum computation. However, note that we cannot extract both the values of f(0) and f(1) directly from this output state. If we attempt to measure it, we will destroy the state and we will get one of the two measurement outcomes (0, f(0)) or (1, f(1)) with equal probability p = 0.5. To unlock the information potential of quantum systems, we need another concept, that of entanglement. A full discussion of entanglement is beyond the scope of this overview. However, let us consider the circuit in Figure 2-4, which demonstrates another important use for the Hadamard gate, in preparing a class of entangled states known as Bell states or EPR pairs. x H ψ y Figure 2-4: Using a Hadamard gate to generate entangled Bell states If the input to this circuit is 00, then the output is ψ = β 00. (2.10)

31 2.3. QUANTUM ALGORITHMS 31 At a first glance, this might look like just another superposition, but this is not the case. If we were to apply a Hadamard transform to two qubits to create a superposition, the output would be φ = (2.11) 2.3 Quantum algorithms As in classical computing, our interest in quantum computing is to be able to execute useful algorithms. The quantum bits and quantum gates introduced above provide a helpful abstraction for specifying these algorithms. Just as classical logic gates can be combined into circuits, so quantum gates can be combined into quantum circuits. Many useful algorithms can be conveniently expressed in this form, and indeed we often use the terms algorithm simulation and circuit simulation interchangeably. We have already seen simple circuits that perform useful functions such as swapping qubit values (Figure 2-3) and generating Bell states (Figure 2-4). For a more substantial example, we will discuss an algorithm to calculate an important transform known as the quantum Fourier transform in section The circuit model is not the only way of thinking of quantum algorithms. In section 2.3.2, we consider an alternative technique, that of quantum adiabatic evolution, and apply it to 3-SAT, a classic hard problem from traditional computer science. The algorithms presented in this section were chosen to give an illustrative flavor of some applications of quantum computing. They will also form the basis of some of the performance benchmarks for the tensor product based simulation environment discussed in Chapter 5.

32 32 CHAPTER 2. BACKGROUND The quantum Fourier transform The quantum Fourier transform is analogous to the classical discrete Fourier transform, familiar from signal processing applications, which takes an input vector (x 0, x 1,..., x N 1 ) of complex numbers, and maps it to an output vector (y 0, y 1,..., y N 1 ) as follows: y k = 1 N 1 x j e 2πijk/N (2.12) N k=0 The quantum Fourier transform has an analogous definition. Given an orthonormal basis 0, 1,..., N 1, the quantum Fourier transform is a linear operator acting as follows on the basis states: j 1 N 1 e 2πijk/N k (2.13) N k=0 Although the above representation makes the relation between the discrete and the quantum Fourier transforms clear, an alternative equivalent representation, known as the product representation, provides a more useful structure for generating circuits to compute the quantum Fourier transform (with N = 2 n ): j 1 2 n/2 = 1 2 n/2 n l=1 ( n k=1 ( ) πij2 l k l 0 + exp ( 2πi k l=1 j n k+l 2 l ) 1 ) (2.14), (2.15) where j i is the ith bit in the binary representation of j, and k l is the lth qubit in k. This representation corresponds to the quantum circuit in figure 2-5. Absent from the circuit is the final bit reversal operation, which reverses the order of the output qubits analogously to the bit reversal of the discrete Fourier transform. Each of the R i in the diagram is a rotation, defined by

33 2.3. QUANTUM ALGORITHMS 33 R i = e 2πi/2k. (2.16) j 1 H R 2... R n 1 R n j 2 H R 2... R n 2 R n 1.. j n 1 H R 2 j n H Figure 2-5: General circuit representation of the quantum Fourier transform The quantum Fourier transform, in turn, is an important component of many significant larger quantum algorithms, such as Shor s integer factoring algorithm Quantum computation by adiabatic evolution Although the circuit model is a convenient abstraction for representing quantum algorithms, it is not the only way of mapping problems onto quantum systems. One alternative framework [FGG + 01] is based on exploiting the adiabatic theorem. To understand the adiabatic theorem, we must introduce another fundamental concept of quantum mechanics, the Hamiltonian. The time evolution of a quantum system can be described by the Schrödinger equation: i h d ψ(t) dt = H(t) ψ(t) (2.17) Here, ψ(t) is the state vector of the system at time t, h is Planck s constant (generally, units are chosen such that h = 1), and H(t) is a Hermitian operator called the Hamiltonian of the system. Briefly stated, the adiabatic theorem states the following: consider a quantum system

34 34 CHAPTER 2. BACKGROUND whose evolution is governed by the Hamiltonian H(t). Take H(t) = H(t/T ) where H(s) is a one-parameter family of Hamiltonians with 0 s 1. Let the instantaneous eigenstates and eigenvalues of H(s) be of the form E l (s) l; s with E i (s) E j (s) for i < j. The adiabatic theorem states that if the gap between the lowest two eigenvalues E 1 (s) E 0 (s) is strictly greater than zero for all 0 s 1, then lim l = 0; s = 1 ψ(t ) = 1. (2.18) T In other words, if the gap is positive as above, then if T is big enough (i.e. if t/t is small enough), ψ(t) remains close to the ground state ψ g (t) = l = 0; s = t/t of the system. This gives a hint as to how adiabatic evolution might be used for quantum computation, if we can specify our algorithm in the form of a series of Hamiltonians H(t), chosen in such a way that the initial ground state of H(0) is both known and easy to construct. For each instance of the problem, we can then construct a problem Hamiltonian H P. Although H P is not difficult to construct, its ground state, which encodes the solution to the corresponding instance of the problem, is difficult to compute directly. This is where we use adiabatic evolution. We set H(T ) = H P, so the ground state ψ g (T ) of H(T ) encodes the solution. T is the running time of our algorithm. For 0 t T, H(t) smoothly interpolates between the initial Hamiltonian H(0) and the final Hamiltonian, H(T ), which is equivalent to H P. If T is large enough, then H(t) will vary slowly. By the adiabatic theorem, then, final state of this evolution ψ(t ) will be close to the solution state ψ g (T ). To illustrate, consider the example of the 3-SAT problem, which has been shown to be NP-complete [Coo71]. A problem is NP-complete if it is in the complexity class NP, and if it has the property that any other problem in NP is reducible to it by a polynomial time algorithm (by problem Φ is reducible to problem Φ we mean that any instance of Φ can be converted in polynomial time into an instance of Φ with the same truth value). A n-bit instance of 3-SAT is a Boolean formula consisting of a conjunction of clauses C 1 C 2... C M, where each clause C involves at most three of the n bits. The problem

35 2.3. QUANTUM ALGORITHMS 35 requires finding a satisfying assignment, that is a set of values for each of the n bits that makes all of the clauses simultaneously true. An instance of 3-SAT can be expressed in a manner suitable for the application of adiabatic evolution by constructing a Hamiltonian to represent it (a problem Hamiltonian ) as follows: for each clause C with associated bits z ic, z jc and z kc, define an energy function h C (z ic, z jc, z kc ) such that h C = 0 if (z ic, z jc, z kc ) satisfies clause C, and h C = 1 otherwise. Each bit z i is associated with a corresponding qubit z i. Each clause C is associated with an operator H P,C ( z 1 z 2... z n ) = h C (z ic, z jc, z kc ) z 1 z 2... z n. (2.19) The problem Hamiltonian H P is then the sum over the clauses of the H P,C : H P = C H P,C (2.20) Given a problem Hamiltonian as above, one can solve the instance of 3-SAT by finding its ground state. To do this, it is necessary to start with a Hamiltonian H B with a known ground state (the initial Hamiltonian ) and to use adiabatic evolution to go from this known ground state to the ground state of H P. H B is constructed as follows: define the one-bit Hamiltonian H (i) B acting on bit i thus: H (i) B = 1 (1 σ(i) x ), (2.21) 2 where For each clause C, define σ (i) x = (2.22) H B,C = H (i C) B + H (j C) B + H (k C) B, (2.23)

36 36 CHAPTER 2. BACKGROUND then H B = C H B,C. (2.24) Adiabatic evolution proceeds by taking H(t) = ( 1 t ) ( ) t H B + H P, (2.25) T T so H(s) = (1 s)h B + sh P. (2.26) Start the system at t = 0 in the known ground state of H(0) (i.e. in the ground state of H B ). By the adiabatic theorem, if T is big enough and the minimum gap, g min, between the two lowest energy eigenstates is not zero, ψ(t ) will be close to the ground state of H P, which represents the solution to the instance of 3-SAT. 2.4 The tensor product The tensor product is also known as the Kronecker product or the direct product of matrices. It is an operation on two matrices, denoted A B. If A and B are m n and p q matrices respectively, then A B is a mp nq matrix defined as follows: A B = a 1,1 B a 1,2 B a 1,n B a 2,1 B a 2,2 B a 2,n B a m,1 B a m,2 B a m,n B. (2.27) The tensor product has a number of useful properties, which will be helpful later when we attempt to compute tensor products. It is associative:

37 2.4. THE TENSOR PRODUCT 37 A (B C) = (A B) C (2.28) It is distributive over normal matrix multiplication: (A B)(C D) = (AC BD), (2.29) (provided that the dimensions of A, B, C and D are such that AC and BD are defined). Inverses and transposes of tensor products have the following useful properties: (A B) 1 = A 1 B 1 (2.30) (A B) T = A T B T (2.31) The above already suggests that we may be able to reduce the amount of computation performed on a large matrix if it can be expressed as the tensor product of smaller matrices. To see this, take for example the matrix A = M 1 M 2... M n, and consider the relative amount of computational effort in computing A 1 versus computing M 1 i matrices. for the n smaller Finally, there are two more properties of the tensor product that will be useful to us in calculating the trace and eigenvalues of matrices represented in tensor product form. In the case of the trace, we have that tr(a B) = tr(a)tr(b). (2.32) In the case of eigenvalues, if A and B have eigenvalues λ i and µ j respectively, with corresponding eigenvectors x i and y j, then (A B)(x i y j ) = λ i µ j (x i y j ). (2.33) In other words, every eigenvalue of A B is a product of the eigenvalues of A and B.

38 38 CHAPTER 2. BACKGROUND How is this useful to us in simulating quantum computing? It turns out that quantum circuits often have natural decompositions in terms of the tensor product. Consider for example the simple circuit for the Hadamard transform on four qubits: H H H H This has the tensor product representation H H H H, where H is the one-qubit (2 2) Hadamard gate matrix. In general, any parallel sequence of gates can be represented as the tensor product of the operator matrices corresponding to them.

39 Chapter 3 Parallel simulation of quantum computation This chapter will review the concepts that motivate the design of our simulation environment. Section 3.1 gives an overview of the type of simulations that we wish to perform, and gives a sense of the complexity of implementing these simulations on classical computers. One way of tackling this complexity is by using the combined power of multiple processors in parallel to perform the simulation. There are many approaches to parallel computing, and we have chosen to use an architecture known as cluster computing. Section 3.2 defines cluster computing, motivates our choice of this architecture, and describes the challenges and limitations particular to it. In order to exploit the potential advantages of parallel hardware, we require a means of parallelizing the computations we will perform. In the previous chapter, we introduced the tensor product and saw how we could use it as a structure for many quantum algorithms. Now, in section 3.3, we will explain how the tensor product structure has been used to guide parallelization. We will also consider how tensor product based transforms can be efficiently applied at each of the parallel steps. 39

40 40 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION 3.1 Simulating quantum computing The phrase simulating quantum computing can have a number of meanings 1. For the purposes of this thesis, when we say that we intend to simulate quantum computation, we mean that we will use classical computers to simulate the operation of certain quantum algorithms, typically expressed as quantum circuits. We do not simulate the behavior of any particular physical implementation of quantum computation, concerning ourselves rather with the algorithmic/circuit abstraction. In order to simulate quantum circuits, we must be able to represent each circuit, its input and its output classically. At their most basic, quantum circuits can be thought of as transforms that operate on an n-qubit input state ψ to produce an output state ψ, as in Figure 3-1. ψ n / Quantum circuit n / ψ Figure 3-1: Basic schematic form of a quantum circuit As we have seen in Chapter 2, the input ψ and the output ψ can be represented as state vectors of dimension 2 n, say x and y respectively. The entries in each of these vectors are complex numbers. The circuit itself can be represented by a 2 n 2 n transform matrix U, also of complex numbers. Simulating the operation of the circuit is then simply a matter of performing the computation y = Ux. (3.1) The simplicity of equation 3.1 is deceptive, however. For a start, the sizes of x, y and U grow exponentially with problem size. If we store U and x and perform a full matrix-vector multiplication, we will require at least 2 n n+3 bytes of storage for single precision 1 Our use of the term simulation to refer to the simulation of quantum computations on classical computers should not be confused with quantum simulation, which typically refers to the simulation of a quantum system on a quantum computer

41 3.1. SIMULATING QUANTUM COMPUTING 41 values. This translates to over 32 Gb for a 16-qubit problem. We will also require O(2 2 n) complex multiplication operations. Furthermore, the entries of U are usually not directly specified, but must be computed from the constituent gates of the circuit. As an example, consider the circuit in Figure 3-2. This corresponds to the multiplication y = (I 2 U f )(U d U e )(U a U b U c )x, (3.2) where I 2 is the 2 2 identity matrix. ψ 0 U a ψ 0 ψ 1 U b U d ψ 1 ψ 2 U c U e U f ψ 2 Figure 3-2: A representative quantum circuit The amount of computational work required to perform the above calculation naïvely turns out to be greater still than would have appeared at first when we treated circuits as a single transform. Here, we require at least three matrix-vector multiplications, in addition to the work required to compute the three sets of tensor products Previous work in simulating quantum computation There are relatively few simulation environments for quantum computation in the literature. Most consist of languages or environments for describing small quantum circuits or algorithms and for simulating them on single-processor workstations. A good representative of this class is the QCL language [Öme00a], which provides basic constructs for specifying quantum algorithms, but which has no native support for parallelism. QCL was designed primarily as a programming language, not as a simulation environment. The ultimate intent of QCL is to provide a means to specify algorithms that would be executed using quantum computing resources (or a mix of quantum and classi-

42 42 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION cal computing resources). The author, does however, provide a simulation environment, the QC library [Öme00b] to allow QCL programs to be executed in the absence of quantum resources. The QC library stores quantum states as state vectors, using a compressed representation in which only non-zero amplitudes are stored. This trades memory efficiency in the case in which many amplitudes are zero for a computational performance penalty when operating on this more complex representation in memory. In the general case, where many or all of the probability amplitudes are nonzero, essentially the full state vector must be stored. The lack of support for parallelism in the QC library places significant limits on the size of the problem that can be simulated, because of both memory and CPU cycle limitations. There is, of course, nothing in principle preventing a parallel back end from being developed to execute QCL code. Perhaps the most complete published parallel simulation environment is the one developed at ISI/USC by Obenland and Despain [OD98]. Their interest was particularly in simulating a physical implementation of quantum computation using laser pulses directed at ions in an ion trap. The ISI team had access to high end supercomputers, specifically a Cray T3E and an IBM SP2 multiprocessor, and they took advantage of this to execute their simulations in parallel. They noted a significant speedup on larger problems, close to the theoretically predicted parallel speedup. Obenland and Despain s work was a clear indication that parallelism could be fruitfully exploited to achieve larger, faster simulations of quantum computing. However, they had access to high-end, purpose-built supercomputing environments. We wanted to know to what extent results like this could be achievable on more widely available parallel environments, in particular on a cluster of off-the-shelf workstations. The ISI simulation timings revealed that communications overhead ultimately became the dominant time factor. Because of the highly efficient internal interconnect in the high

43 3.2. PARALLEL PROCESSING AND CLUSTER COMPUTING 43 end supercomputers that were used, communications overhead was not a significant factor for small numbers of processors. However, when 25% of the available processors were used, communications overhead increased to 40-60% of total execution time for many problems. With half the processors in use, it increased to take up 60-90% of the execution time. These findings, on a tightly-coupled multiprocessor architecture with a high speed internal interconnect, suggest that message passing based parallelism would be even harder on clusters, where the interprocessor interconnect is substantially slower, and this was indeed our experience. 3.2 Parallel processing and cluster computing The regular structure of, and large number of operations involved in, many linear algebra problems make them attractive as candidates for parallel processing. The term parallel processing refers to any architecture in which one or more processors operate on multiple data elements simultaneously. Early computers designed to perform fast linear algebra operations typically made use of vector processors. As the name suggests, vector processors perform operations (e.g. addition or multiplication) on vectors of multiple data items rather than on single memory addresses or registers. Vector processors execute instructions sequentially, but achieve parallelism at the data level. Early supercomputers typically contained one large vector processor operating at high speed. For example, the earliest and most well known true vector processor supercomputer, the Cray 1, operated on eight vector registers, each of which could hold 64 eightbyte words. Vector processing has evolved into the modern concept of single instruction, multiple data (SIMD) parallelism. This is almost ubiquitous in modern processor designs, such as the Motorola PowerPC VelocityEngine, or the Intel Pentium MMX extensions. Another, often complementary, approach to parallel computing is to increase the num-

44 44 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION ber of processors in the system and to parallelize the execution of algorithms across multiple processors. A number of models of parallel processing have been attempted. One popular early approach was massively parallel processing (MPP). MPP systems are so named because they contain a large number of processors hundreds or sometimes thousands of them. Each processor has its own local memory, and the processors are linked using a high-speed internal interconnect mechanism Cluster computing In recent years, with the advent of higher bandwidth network interconnects, it has become feasible to build parallel computing systems out of multiple independent workstations, rather than a single machine with multiple processors. This technique is known as cluster computing. The term cluster is somewhat loosely applied to groups of networked conventional workstations that cooperate computationally. The workstations may be heterogeneous in terms of such factors as their processing capacity (number, type and speed of CPUs), their memory size and configuration and even the operating systems that run on them. We make a distinction here between clusters and so-called networks of workstations (NOWs). Although the terms are sometimes used interchangeably in the literature, the term cluster is generally applied to a network that, while it may consist of workstationclass computer hardware, is essentially dedicated to the task of parallel computation. A NOW, by contrast, may comprise machines that are also used for other purposes, often desktop machines that perform networked computation only when otherwise idle. Clustering makes parallel processing more accessible than traditional single box parallel processing. Cluster components are cheaper, and are easily replaced or upgraded. Clusters can be expanded, partitioned or otherwise reconfigured with relative ease. A wider range of development tools and operating system support is available for commodity workstation hardware, making development on a cluster environment more accessible to a general user base than development on traditional multiprocessor machines.

45 3.2. PARALLEL PROCESSING AND CLUSTER COMPUTING 45 Clusters are becoming increasingly accepted in the high performance computing community, with 93 clusters appearing in the most recent (at the time of writing) TOP500 list [MSDS02] of the highest performance computers in the world (ranked according to their performance on the LINPACK benchmark). It is worth noting, however, that many of the clusters on the list use unconventional high-speed interconnects or other customized hardware enhancements that differentiate them from off-the-shelf computing hardware. Indeed, there are only 14 self made clusters on the TOP500 list at this time. There are several tradeoffs in using a cluster instead of a traditional multiprocessor machine. Individual cluster nodes are often significantly less reliable than individual processing elements in a multiprocessor machine, an issue that becomes increasingly significant as cluster sizes increase. Furthermore, although many message passing libraries, such as MPI [For93] and PVM[GBD + 94], have been ported to cluster environments, clustering support is often less mature than support for traditional multiprocessor environments. The most significant disadvantage of cluster computing, though, is the decreased interprocessor communications performance relative to other multiprocessor environments. Local memory access on typical Intel processor based machines are on the order of 1 gigabyte/s, with latencies around clock cycles. High speed memory crossbars used in some modern supercomputer designs offer even higher bandwidths. By comparison, even fast interconnects such as Gigabit Ethernet and Myrinet offer raw bandwidths on the order of low hundreds of megabytes per second. These raw bandwidths are further degraded by protocol overhead and inherent transport inefficiencies. Even with low protocol overhead, network-induced latencies are at least on the order of thousands of clock cycles [CWM01]. Probably the most widespread clustering architecture, not least of all because its definition is broad enough to encompass a wide range of different clustered environments, is the Beowulf clustering environment [BSS + 95], named for the initial implementation of such an environment at NASA s Goddard Space Flight Center. Originally applied to clusters based on the free Linux operating system, the term Beowulf has now come to be applied to any

46 46 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION cluster that approximately meets the following criteria: The cluster hardware is essentially standard workstation hardware, possibly augmented by a more exotic fast network interconnect. The operating systems in use on the cluster are free Unix variants (originally Linux was the operating system of choice for Beowulf clusters, now alternatives such as FreeBSD are increasingly common). The cluster is dedicated to parallel computational purposes, and is typically accessed through a single point of entry (the head node). Some combination of software to facilitate parallel computing is installed on the cluster. Such software typically includes one or more message passing libraries, and cluster-wide administrative tools. It may also include operating system enhancements such as kernel extensions that implement a shared process numbering space. Our parallel simulation environment is implemented on a 32-node Beowulf cluster. More details of the cluster configuration can be found in section The tensor product as a means to optimize computation There are two ways in which we use the tensor product to optimize the simulation of quantum circuits. First, we use the tensor product structure to determine an intelligent parallelization of the circuit. Then, we use the tensor product decomposition to minimize the amount of matrix computation we perform Parallelizing by the tensor product The digital signal processing community has used the tensor product as a means to structure parallel implementations of signal processing transforms for some years [GCT92]. Much work has been done in expressing common transforms such as the Fast Fourier

47 3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION 47 Transform in tensor product form, and using this representation to parallelize the application of the transform to an input vector [Pit97]. In our simulations of quantum computing, we use very similar techniques to those applied to large signal processing transforms. We apply operators (transform matrices) with a tensor product structure to state vectors (input vectors). It seems reasonable, therefore, that the same techniques that have been useful in signal processing would be useful to us. To understand how the tensor product structure can be used to determine a corresponding parallelization, let us consider an idealized m m transform A of the form A = I n B, (3.3) where I n is the n n identity matrix, and B is thus of size m n m n. Now suppose we wish to calculate the matrix-vector product y = Ax, where x is a vector of size m. This corresponds to the application of a set of operators to a state vector. To illustrate, take m = 8 and n = 4. A then has the following form: A = = B 1,1 B 1,2 B 2,1 B 2,2 B 1,1 B 1, B 2,1 B 2, B 1,1 B 1, B 2,1 B 2, B 1,1 B 1, B 2,1 B 2, B 1,1 B 1, B 2,1 B 2,2 (3.4) (3.5)

48 48 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION It is clear from the above that we can calculate the product y = Ax by partitioning x into four equal partitions of two elements, and then performing four smaller calculations of the following form (1 i 4): y 2i 1 y 2i = B x 2i 1 x 2i (3.6) Notice that the result of each of the calculations is independent of the other three results. This implies that the calculation of y = Ax can effectively be parallelized across four processors. More generally, any calculation of the form y = Ax where A has the form A = I n B can be parallelized over n processors, each performing a multiplication by B of some partition of x. What if we have fewer than n processors? Suppose there are p processors available, where p < n. We note that I n B = (I p I n/p ) B (3.7) = I p (I n/p B), (3.8) and partition the computation into p parallel subcomputations, each involving a multiplication of a partition of x by (I n/p B). What about the case when the tensor product representation does not take the convenient form A = I n B? Here we must make use of a permutation matrix to rearrange the tensor product into this form. A permutation matrix P is a square matrix whose elements are each either 0 or 1, and where P T P = I = P P T (3.9) A stride permutation matrix P n,s (where s is a factor of n, i.e. n = ms) is a permutation

49 3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION 49 matrix which when applied to a vector x of length n rearranges it as follows: ] P n,s x = [x 1, x 1+s, x 1+2s,..., x 1+(m 1)s, x 2, x 2+s, x 2+2s,..., x s, x 2s, x 3s,..., x ms (3.10) Now suppose that A and B are matrices of sizes m n and p q respectively. We can relate A B to B A with stride permutations as follows: A B = P mp,m (B A)P nq,q (3.11) This is known as the commutative property of the tensor product, and it allows us to rewrite tensor products of the form A = B I n in the form A = P m,(m n) (I n B)P m,n, (3.12) where m is the dimension of the square matrix A. To illustrate, consider the following circuit fragment, where U and V are 3-qubit and 2-qubit operators respectively: j 1,2,3 j 4,5 j 6,7 3 / U 2 / 2 / V The tensor product representation of this circuit applied to the state vector x corresponding to the input state x is [U I 2 2 V ]x = [(U I 2 2) V ]x (3.13) = [(P 2 5,2 2(U I 2 2)P 2 5,23) V ]x (3.14) = [(P 2 5,2 2 I 2 2)(I 2 2 U V )(P 2 5,2 3 I 22)]x (3.15)

50 50 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION = [p(i 4 (U V ))p ]x, (3.16) where p = P 2 5,2 2 I 2 2 and p = P 2 5,2 3 I 22 are permutation matrices. Thus, the circuit above can be parallelized by the technique described above using up to four processors. The permutations p and p can be implemented by rearranging the elements of x. In a multiprocessor architecture, such rearrangements can often be achieved by alternate addressing of the underlying data. In a clustered environment, however, the rearrangements require communication of the elements to be rearranged between the nodes with processors holding the relevant data. Since we are communicating only state vectors, and not full matrices, this communication is comparatively low relative to problem size. Nonetheless, avoiding unnecessary communications will be important in minimizing execution times Efficient computation of tensor product structured multiplications We have discussed above how a tensor product representation of a quantum circuit can be parallelized by reducing each parallel component to the form I m B, perhaps with some permutation of the data before and after. We have not yet considered how best to perform each of the multiplications Bx i (where x i is the ith partition of the vector x). Even from the very simple example above, where B = U V, it is clear that B may well have its own tensor product structure. The multiplication of such tensor products can be performed efficiently using the following method, described in [BD96b]: Suppose we wish to compute the product Bx = (M 1 M 2 )x, (3.17) where M 1 and M 2 have dimensions m n and p q respectively. We first reshape x into a q n matrix thus: X i,j x (j 1)q+i (3.18)

51 3.3. THE TENSOR PRODUCT AS A MEANS TO OPTIMIZE COMPUTATION 51 It has been shown ([Dyk87]) that the product (M 1 M 2 )x can then be calculated by finding Y = (M ) 1 (M 2 X) T T (3.19) and then converting Y back into a vector using the reverse of the process by which X was constructed in (3.18). We have reduced the multiplication of x by a mp nq matrix to a series of two multiplications by smaller matrices. In general, we have Bx = (M 1 M 2... M K )x (3.20) ( ) ) T T T = (M 1 M 2 (... (M K X) T...), (3.21) where X is derived from x as above. To see how this can significantly reduce the amount of computation required, consider the case where each of the M i (1 i K) is a square matrix of dimension n n. Then the number of multiplications required to compute the conventional matrix-vector multiplication Bx, with B as in equation (3.20) above, is of order O(n 2K ), since B is a n K n K matrix. The reformulation in equation (3.21) allows us to perform K matrix-matrix multiplications of the form M i X i, where X i is the intermediate result formed by the sequence of multiplications by M j, j = i K (X K X). In each case M i is an n n matrix, and X i is an n n K 1 matrix. Each matrix multiplication thus requires O(n k+1 ) multiplications, and the total computation requires O(Kn K+1 ) multiplications. The computation additionally requires K matrix transpositions of n n K 1 matrices. The amount of work required to perform these transpositions is, however, not great relative to the work for the matrix-matrix multiplica-

52 52 CHAPTER 3. PARALLEL SIMULATION OF QUANTUM COMPUTATION tions. In principle, the matrix-matrix multiplications and the matrix transpositions could be implemented as parallel routines to further take advantage of multiple processors available in a parallel environment. However, in practice, in a clustered environment, the communications overhead of the requisite parallel linear algebra routines (discussed in more detail in Chapter 5) limits the usefulness of this additional parallelization.

53 Chapter 4 The simulation environment This chapter describes our parallel environment for simulating quantum computation on a cluster of classical workstations, beginning with a description of the cluster hardware in section 4.1. Our first attempt at a simulation environment, discussed in section 4.2, was based on parallel matrix operations. These parallel operations were primarily drawn from existing optimized libraries, described in section We used Matlab as the front end for this simulation environment, drawing on existing work in interfacing Matlab to parallel linear algebra libraries described in section Our implementation, described in section 4.2.3, followed the architecture of these prior implementations, but was tailored to perform the functions we required for our simulation, and to operate on complex matrices. It became apparent to us that the matrix-based implementation of our initial simulation environment was suboptimal with respect to resource requirement scaling, both in terms of memory usage and computation time. Seeking to improve simulation efficiency, we developed a new simulation environment based on the tensor product structure of quantum circuits. This allowed us to apply prior work in the parallelization of tensor product computation, and in efficient implementation of tensor product multiplications (described above in sections and respectively) to our simulations. Section 4.3 describes the new, tensor product based, simulation environment. We de- 53

54 54 CHAPTER 4. THE SIMULATION ENVIRONMENT veloped a simple circuit specification language (section 4.3.1) as in input mechanism, along with a compiler (section 4.3.2) that translates this input into a sequence of steps to be executed by the new parallel simulation code. Section describes how this compiled representation is distributed to the nodes, and section describes the actual execution. 4.1 Hardware The hardware on which the simulation environment runs consists of a networked cluster of off-the-shelf computers, pictured in Figure 4-1. It consists of 33 machines with a total of 68 processors, as follows: One head node, with 4 Gb RAM and four Intel Pentium III Xeon processors with a 900 MHz system clock speed Eight older cluster nodes, each with 768 Mb RAM and two Pentium III processors with a 1 GHz system clock speed 24 newer cluster nodes, each with 1 Gb RAM and two Pentium III processors with a 1.2 GHz system clock speed The nodes are interconnected using 1000BaseT gigabit Ethernet, through a switch with a claimed backplane switching throughput of 38 Gbit/sec, enough in theory to allow for simultaneous communication between all the nodes. Initially the nodes ran on switched 100BaseT fast Ethernet, but it soon became clear that communications overhead was a significant performance bottleneck, so the nodes were upgraded to a faster network transport. Gigabit Ethernet was chosen because of its low cost and ease of configuration relative to many other high speed interconnect mechanisms such as Myrinet or Fiber Channel. Although the gigabit Ethernet network is a substantial performance improvement on the old 100 Mbit/sec network, several factors conspire to reduce the effective maximum throughput of the network. The PCI network adapters used in all the nodes are low-end, 32-bit wide cards. Since the nodes all have a standard workstation architecture, the PCI

4.1. HARDWARE 55 Figure 4-1: The cluster nodes, seen from below. visible at the top right. The eight older, lower-capacity machines are bus is a single bus shared by all I/O devices.

55 4.1. HARDWARE 55 Figure 4-1: The cluster nodes, seen from below. visible at the top right. The eight older, lower-capacity machines are bus is a single bus shared by all I/O devices. On the software side, we are using a stock Linux kernel, with the inefficiencies inherent in a TCP stack that must necessarily be all things to all users. We tested our networking setup with the well-known netperf network benchmarking tool. On an otherwise unloaded network, peak node-to-node TCP data throughput was 482 Mbit/sec. Reliability is a significant concern in any large network of off-the-shelf workstations. 99.5% uptime, for instance, may be entirely tolerable on a single workstation, but a cluster with 32 nodes independently experiencing 99.5% uptime will have an unacceptably low overall uptime of 85%. Two areas in which particular work was required to achieve acceptable reliability were cooling and hard drive reliability. In order to prevent frequent overheating, we found it necessary to physically enclose the ceiling mounted rack in which the cluster nodes were mounted, so that forced air could be channeled downwards from vents in the ceiling to the

Quantum Complexity Theory and Adiabatic Computation

Chapter 9 Quantum Complexity Theory and Adiabatic Computation 9.1 Defining Quantum Complexity We are familiar with complexity theory in classical computer science: how quickly can a computer (or Turing