Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Similar documents
Spin glass simulations on Janus

Quantum versus Thermal annealing (or D-wave versus Janus): seeking a fair comparison

An FPGA-based supercomputer for statistical physics: the weird case of Janus

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

Monte Carlo Simulation of the Ising Model. Abstract

GRAPE and Project Milkyway. Jun Makino. University of Tokyo

Parallel Tempering Algorithm in Monte Carlo Simulation

Kinetic Monte Carlo (KMC) Kinetic Monte Carlo (KMC)

Monte Caro simulations

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Trivially parallel computing

MONTE CARLO METHODS IN SEQUENTIAL AND PARALLEL COMPUTING OF 2D AND 3D ISING MODEL

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Physics 115/242 Monte Carlo simulations in Statistical Physics

Statistics and Quantum Computing

GPU accelerated Monte Carlo simulations of lattice spin models

Quantum versus Thermal annealing, the role of Temperature Chaos

Molecular Dynamics Simulations

NCU EE -- DSP VLSI Design. Tsung-Han Tsai 1

A Monte Carlo Implementation of the Ising Model in Python

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Intro. Each particle has energy that we assume to be an integer E i. Any single-particle energy is equally probable for exchange, except zero, assume

GPU-accelerated Computing at Scale. Dirk Pleiter I GTC Europe 10 October 2018

Two case studies of Monte Carlo simulation on GPU

Metropolis Monte Carlo simulation of the Ising Model

Review for the Midterm Exam

The XY-Model. David-Alexander Robinson Sch th January 2012

4. Cluster update algorithms

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Lecture 13: Sequential Circuits, FSM

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Simulation of the two-dimensional square-lattice Lenz-Ising model in Python

Wang-Landau sampling for Quantum Monte Carlo. Stefan Wessel Institut für Theoretische Physik III Universität Stuttgart

Branislav K. Nikolić

Estimates for factoring 1024-bit integers. Thorsten Kleinjung, University of Bonn

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Critical Dynamics of Two-Replica Cluster Algorithms

FPGA Implementation of a Predictive Controller

arxiv:cond-mat/ v1 19 Sep 1995

SIMULATION OF ISING SPIN MODEL USING CUDA

Efficient random number generation on FPGA-s

System Roadmap. Qubits 2018 D-Wave Users Conference Knoxville. Jed Whittaker D-Wave Systems Inc. September 25, 2018

A = {(x, u) : 0 u f(x)},

André Schleife Department of Materials Science and Engineering

Observation of topological phenomena in a programmable lattice of 1800 superconducting qubits

Copyright 2001 University of Cambridge. Not to be quoted or copied without permission.

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

Complexity of the quantum adiabatic algorithm

Random Walks A&T and F&S 3.1.2

Connected spatial correlation functions and the order parameter of the 3D Ising spin glass

Quantum Annealing in spin glasses and quantum computing Anders W Sandvik, Boston University

GPU Computing Activities in KISTI

ab initio Electronic Structure Calculations

Their Statistical Analvsis. With Web-Based Fortran Code. Berg

Phase Transitions in Spin Glasses

The Last Survivor: a Spin Glass Phase in an External Magnetic Field.

Lab 70 in TFFM08. Curie & Ising

Implementation of Nonlinear Template Runner Emulated Digital CNN-UM on FPGA

Numerical methods for lattice field theory

Complexity, Parallel Computation and Statistical Physics

Chapter 7. Sequential Circuits Registers, Counters, RAM

A Parallel Method for the Computation of Matrix Exponential based on Truncated Neumann Series

QCDOC A Specialized Computer for Particle Physics

The Phase Transition of the 2D-Ising Model

Introduction The Nature of High-Performance Computation

The GigaFitter for Fast Track Fitting based on FPGA DSP Arrays

CMOS Ising Computer to Help Optimize Social Infrastructure Systems

EECS150 - Digital Design Lecture 21 - Design Blocks

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

GPU Based Parallel Ising Computing for Combinatorial Optimization Problems in VLSI Physical Design

A Deep Convolutional Neural Network Based on Nested Residue Number System

CSE370: Introduction to Digital Design

Monte Carlo Simulation of Spins

Parallel Tempering And Adaptive Spacing. For Monte Carlo Simulation

Parallel Performance Evaluation through Critical Path Analysis

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Implementation Of Digital Fir Filter Using Improved Table Look Up Scheme For Residue Number System

Elliptic Curve Group Core Specification. Author: Homer Hsing

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

More Science per Joule: Bottleneck Computing

Processor Design & ALU Design

Programmable Logic Devices II

Tate Bilinear Pairing Core Specification. Author: Homer Hsing

EECS150 - Digital Design Lecture 23 - FSMs & Counters

Multivariate Gaussian Random Number Generator Targeting Specific Resource Utilization in an FPGA

Parallel Tempering Algorithm in Monte Carlo Simula5on

Introduction to the Renormalization Group

A 29.5 Tflops simulation of planetesimals in Uranus-Neptune region on GRAPE-6

Phase transitions and finite-size scaling

A brief introduction to the inverse Ising problem and some algorithms to solve it

Expectations, Markov chains, and the Metropolis algorithm

An FPGA Implementation of Reciprocal Sums for SPME

Quantum and classical annealing in spin glasses and quantum computing. Anders W Sandvik, Boston University

Lecture #4: Potpourri

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Monte Carlo Simulation of the 2D Ising model

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

Numerical Analysis of 2-D Ising Model. Ishita Agarwal Masters in Physics (University of Bonn) 17 th March 2011

Spin glasses and Adiabatic Quantum Computing

Transcription:

Janus: FPGA Based System for Scientific Computing Filippo Mantovani Physics Department Università degli Studi di Ferrara Ferrara, 28/09/2009

Overview: 1. The physical problem: - Ising model and Spin Glass in short; 2. The computational challenge: - Monte Carlo tools ; - Computational features of Spin Glass problems; 3. The Janus system: - an exotic attempt to simulate spin systems; 4. Conclusions - 2 of 31 -

Statistical mechanics... Statistical mechanics tries to describe the macroscopic behavior of matter in terms of average values of microscopic structure. For instance, try to explain why magnets have a transition temperature (beyond which they lose their magnetic state)......in terms of elementary magnetic dipoles attached to each atom in the material (that we call spins). - 3 of 31 -

The Ising model Each spin takes only the values +1 (up) or -1 (down) and interacts only with its nearest neighbours in a discrete D-dimensional mesh (e.g. 2D). Energy of the system for a given configuration {Si}: U {S i } = J ij s i s j Note that: A) Each configuration {Si} has a probability given U {S } / kt U {S } by the Boltzmann factor: P {S i } e e i i B) Macroscopic observables have averages obtained by summing on each configurations, each weighted by its probability. Magnetization: M i s i M {S } M {S i } P {S i } i - 4 of 31 -

The Ising model U {S i } = J ij s i s j What about the J?!? In the Ising model we assume that J are constant (=+1 or -1). The energy profile is therefore smooth (and in this case symmetric): - 5 of 31 -

The Spin Glass model Spin glasses are a generalization of Ising systems. An apparently trivial change in the energy function: U { Si } = ij Jij si s j, s={ 1, 1}, J ={ 1, 1 } makes spin glass models much more interesting and complex than Ising systems: if we leave J NOT fixed, we obtain 2 important consequences: 1) frustration: 2) corrugation in energy function: - 6 of 31 -

Problems around that... Problems like finding the minimum energy configuration at a given temperature or finding the critical temperature are absolutely not trivial and represent a computational nightmare!!! HINT: If we consider a 3D lattice, with 48 spins per side, we obtain that the 3 48 possible configurations are 2!!! - 7 of 31 -

Overview: 1. The physical problem: - Spin Glass and Ising model in short; 2. The computational challenge: - Monte Carlo tools ; - Computational features of Spin Glass problems; 3. The Janus system: - an exotic attempt to simulate spin systems; 4. Conclusions - 8 of 31 -

Monte Carlo algorithms As seen before, relative small lattices produce a huge number of possible configurations. Monte Carlo algorithms are tools allowing us to visit configuration according to its probability. The Markov chains theory assures us that is valid the property of ergodicity: So the probability of the configuration i, after a time long enough, tends to a probability (equilibrium condition). - 9 of 31 -

Monte Carlo algorithms A direct consequence of this Markov chains property is that we can produce configurations that are distributed following a given distribution. C MC ; i i=1 n The good news is that observables are trivially computed as unweighted sums of Monte Carlo generated configurations*: MC f {C } f C P C = i f Ci *we do not need any more to sum the observable weighted with the probability. - 10 of 31 -

The Metropolis algorithm Assuming a 3D lattice composed of spins, we can update each lattice site following these steps: consider one spin s0 and compute his local energy E: E= 0j J ij s j s0 flip the value of the spin: s0' = -s0 compute the new local energy E': E '= 0j J ij s j s0 ' if E'<E then the spin flip is accepted; else we compute X =exp E ' E /T and generate a pseudo-random number R (0 R 1): if R<X then the spin flip is accepted as well else remains s. NOTE: If we replace the spin lattice with the topology of a random graph the algorithm keeps working (e.g. this is exactly what we used to implement graph coloring). - 11 of 31 -

Or easier... - 12 of 31 -

Computational features Operations on spins bit-manipulation. Computation of a limited set of exponential. (We can store them in LUTs). Random numbers should be of good quality and with a long period. A huge degree of available parallelism. Regular program flow. (orderly loops on the grid sites) Regular, predictable memory access pattern. Information-exchange (processor memory) is huge, however the size of the data-base is tiny. (On-chip memory seems to be ideal...) - 13 of 31 -

The PC approach... Using standard architectures as Spin Glass engines two different algorithms are commonly used: Synchronous Multi-Spin Coding (SMSC): one CPU handles one single system; replicas are handled on a farm of several CPU (128-256); at every step each CPU update in parallel N spins (N=2 or 4); NOTE: bottleneck is the number of floating point random numbers can be generated in parallel. Asynchronous Multi-Spin Coding (AMSC): each CPU handles several (64-128) systems in parallel; replicas are handled on CPU farm (smaller than previous); at every step each CPU update in parallel 64-128 spins of different systems; on each single CPU random number is shared among all systems. - 14 of 31 -

Overview: 1. The physical problem: - Spin Glass and Ising model in short; 2. The computational challenge: - Monte Carlo tools ; - Computational features of Spin Glass problems; 3. The Janus system: - an exotic attempt to simulate spin systems; 4. Conclusions - 15 of 31 -

The Janus project A collaboration of: Universities of Rome La Sapienza and Ferrara Universities of Madrid, Zaragoza, Extremadura BIFI (Zaragoza) Eurotech - 16 of 31 -

The Janus approach Preamble: We split the 3D lattice in 2 parts, black spins and white spins (slice = checkerboard). We cannot update in the same time black spins and white spins. We store the lattice within the embedded memories of the FPGA. We use a set of registers to have no latency access to a small set of spins (and couplings) We house a set (as large as possible) of Update Engines... - 17 of 31 -

The Janus approach Each Update Engine: computes the local contribution to U: U= ij si Jij s j addresses a probability table according with U; compares with a freshly generated; random number; sets the new spin value; - 18 of 31 -

The Janus approach A whole picture: - 19 of 31 -

The toy... The basic hardware elements are: a 2-D grid of 4 x 4 FPGA-based processors (SPs); data links among nearest neighbours on the FPGA grid; one control processor on each board (IOP) with 2 gbit-ethernet ports; a standard PC (Janus host) for each 2 Janus boards. - 20 of 31 -

Janus summary: 256 Xilinx Virtex4-LX200 (+16 IOPs); 1024 update engines on each processor; pipelineable to one spin update per clock cycle; 88% of available logic resources; using a bandwidth of ~12000 read bits + ~1000 written bits per clock cycle 47% of available on-chip memory; system clock at 62.5 MHz 16 ps average spin update time. 1 SP ~40W, 16 SP ~640W, rack ~11KW - 21 of 31 -

Performance (for physicists) The relevant performance measure is the average update time per spin (R) measured in pico-sec per spin. For each FPGA in the system we have: R = 1024 spin updates for each clock cycle = 1024 * 16 ns ~ 16 ps / spin For one complete 16 processors board: R = 1/16 * 1024 spin updates for each clock cycle = 1/16 * 1024 * 16 ns ~ 1 ps / spin - 22 of 31 -

Performance (for computer scientists) * these operations are on very short data words!!! ** submitted to Gordon Bell Prize 2008. Memory bandwidth Sustaining these performance requires a huge bandwidth to/from (fine-grained) memory: processing just one bit implies getting 12 more bits from memory: 6 neighbours + 6 J = 12 bits; generating pseudo-randoms requires 3 32-bit data; 1 more 32-bit data (from the LUT) is necessary for MC step; All in all the memory bandwidth sustained is: 140 bit 1024/16 ns ~ 1 TB/s Operations Each update-core performs 11 + 2 sustained pipelined operations* per clock cycle (@ 62.5 MHz) 1024 update engines perform: 1024 13 ops / 16 ns = 832 Gops 1024 3 ops / 16 ns = 192 Gops** - 23 of 31 -

Performance Comparing with more conventional systems... Mantovani Filippo, 25/03/2009-24 of 31 -

Power consumption Power: 1 FPGA... ~40W 1 Janus system... ~11KW 1 domestic power supply... ~5KW Operations per Watts: Janus... ~8.75 Gops/W Best case Green500*... ~0.536 Gflops/W A run of 25 days** uses 15 oil barrels; to obtain the same results with a cluster of PCs we need 12000 oil barrels***. * BladeCenter QS22 Cluster, PowerXCell 8i 4.0 Ghz, University of Warsaw (June 2009). ** Run period of the first intensive Janus simulation (April 2008). *** We suppose an unreal condition of scalability. Mantovani Filippo, 25/03/2009-25 of 31 -

Janus gallery: SP - 26 of 31 -

Janus gallery: IOP - 27 of 31 -

Janus gallery: board - 28 of 31 -

Janus gallery: rack - 29 of 31 -

Overview: 1. The physical problem: - Spin Glass and Ising model in short; - LQCD vs statistical mechanics; 2. The computational challenge: - Monte Carlo tools ; - Computational features of Spin Glass problems; 3. The Janus system: - an exotic attempt to simulate spin systems; 4. Conclusions - 30 of 31 -

Conclusions the Janus system is an heterogeneous massively parallel systems; made by highly parallel processors implemented on FPGA; delivers impressive performance: 1 SP is equivalent to ~100 PC and ~30-50 new multi-core architectures; used for scientific applications like Spin Glass, but in principle available for any kind of problems fitting the topology of the machine and the resources of the FPGAs. uses NON-challenge technology... (plug the LEGO bricks and play) - 31 of 31 -

A slightly broader view... The study of these systems: for a physicist means a better understanding of condensed matter; for a computer scientist means developing new algorithms and improving old ones; for a computer mason (like me) means developing a massively parallel high performance system. Moreover: All these systems are surprisingly strong related with other areas, like e.g. graph coloring. - 32 of 31 -

- 33 of 31 -

About the random numbers The pseudo random numbers are generated using a shift-register of N words of 32 bits and can generate up to ~100 random numbers per clock cycle. NOTE: 1 shift-register generating 32 numbers uses ~2% of the FPGA resources (~2500 Slices). - 34 of 31 -