Spin glass simulations on Janus

Similar documents
Janus: FPGA Based System for Scientific Computing Filippo Mantovani

Quantum versus Thermal annealing (or D-wave versus Janus): seeking a fair comparison

Two case studies of Monte Carlo simulation on GPU

An FPGA-based supercomputer for statistical physics: the weird case of Janus

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

Welcome to MCS 572. content and organization expectations of the course. definition and classification

Phase Transitions in Spin Glasses

CSE370: Introduction to Digital Design

Lab 70 in TFFM08. Curie & Ising

Quantum versus Thermal annealing, the role of Temperature Chaos

Big Bang, Big Iron: CMB Data Analysis at the Petascale and Beyond

REVIEW: Derivation of the Mean Field Result

CRYSTAL in parallel: replicated and distributed (MPP) data. Why parallel?

Kinetic Monte Carlo (KMC) Kinetic Monte Carlo (KMC)

On the Use of a Many core Processor for Computational Fluid Dynamics Simulations

CMP 338: Third Class

Julian Merten. GPU Computing and Alternative Architecture

On Portability, Performance and Scalability of a MPI OpenCL Lattice Boltzmann Code

Quantum Computing. Separating the 'hope' from the 'hype' Suzanne Gildert (D-Wave Systems, Inc) 4th September :00am PST, Teleplace

GPU accelerated Monte Carlo simulations of lattice spin models

A Monte Carlo Implementation of the Ising Model in Python

Population annealing study of the frustrated Ising antiferromagnet on the stacked triangular lattice

Monte Carlo Simulation of the Ising Model. Abstract

2. Accelerated Computations

Lecture 2: Metrics to Evaluate Systems

Physics 115/242 Monte Carlo simulations in Statistical Physics

High-Performance Scientific Computing

CMOS Ising Computer to Help Optimize Social Infrastructure Systems

EECS 579: Logic and Fault Simulation. Simulation

MOLECULAR DYNAMIC SIMULATION OF WATER VAPOR INTERACTION WITH VARIOUS TYPES OF PORES USING HYBRID COMPUTING STRUCTURES

Any live cell with less than 2 live neighbours dies. Any live cell with 2 or 3 live neighbours lives on to the next step.

Molecular Dynamics Simulations

Building a Multi-FPGA Virtualized Restricted Boltzmann Machine Architecture Using Embedded MPI

FPGA Implementation of a Predictive Controller

Alexei Safonov Lecture #18

EECS150 - Digital Design Lecture 21 - Design Blocks

BeiHang Short Course, Part 7: HW Acceleration: It s about Performance, Energy and Power

GPU Computing Activities in KISTI

The Phase Transition of the 2D-Ising Model

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Generating Hard but Solvable SAT Formulas

Spin glasses, where do we stand?

EECS150 - Digital Design Lecture 11 - Shifters & Counters. Register Summary

Towards a highly-parallel PDE-Solver using Adaptive Sparse Grids on Compute Clusters

GPU-based computation of the Monte Carlo simulation of classical spin systems

Cactus Tools for Petascale Computing

Parallelization of the QC-lib Quantum Computer Simulator Library

EECS150 - Digital Design Lecture 15 SIFT2 + FSM. Recap and Outline

arxiv:cond-mat/ v1 19 Sep 1995

Statistics and Quantum Computing

and B. Taglienti (b) (a): Dipartimento di Fisica and Infn, Universita di Cagliari (c): Dipartimento di Fisica and Infn, Universita di Roma La Sapienza

MONTE CARLO METHODS IN SEQUENTIAL AND PARALLEL COMPUTING OF 2D AND 3D ISING MODEL

Outline. EECS Components and Design Techniques for Digital Systems. Lec 18 Error Coding. In the real world. Our beautiful digital world.

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

An Efficient FETI Implementation on Distributed Shared Memory Machines with Independent Numbers of Subdomains and Processors

Parallel Tempering Algorithm in Monte Carlo Simulation

GRAPE and Project Milkyway. Jun Makino. University of Tokyo

SIMULATED TEMPERING: A NEW MONTECARLO SCHEME

Featured Articles Advanced Research into AI Ising Computer

CRYPTOGRAPHIC COMPUTING

ENERGY CONSERVATION The Fisrt Law of Thermodynamics and the Work/Kinetic-Energy Theorem

GPU Based Parallel Ising Computing for Combinatorial Optimization Problems in VLSI Physical Design

Complexity of the quantum adiabatic algorithm

ab initio Electronic Structure Calculations

Hardware Acceleration of the Tate Pairing in Characteristic Three

EECS Components and Design Techniques for Digital Systems. Lec 26 CRCs, LFSRs (and a little power)

Quantum computing with superconducting qubits Towards useful applications

Tate Bilinear Pairing Core Specification. Author: Homer Hsing

Zacros. Software Package Development: Pushing the Frontiers of Kinetic Monte Carlo Simulation in Catalysis

Quantum simulation with string-bond states: Joining PEPS and Monte Carlo

CS 700: Quantitative Methods & Experimental Design in Computer Science

EECS150 - Digital Design Lecture 23 - FFs revisited, FIFOs, ECCs, LSFRs. Cross-coupled NOR gates

These are special traffic patterns that create more stress on a switch

Today. ESE532: System-on-a-Chip Architecture. Energy. Message. Preclass Challenge: Power. Energy Today s bottleneck What drives Efficiency of

Parallelization of the QC-lib Quantum Computer Simulator Library

QCDOC A Specialized Computer for Particle Physics

Lattice Boltzmann simulations on heterogeneous CPU-GPU clusters

Claude Tadonki. MINES ParisTech PSL Research University Centre de Recherche Informatique

Dynamic resource sharing

Massively parallel semi-lagrangian solution of the 6d Vlasov-Poisson problem

Direct Self-Consistent Field Computations on GPU Clusters

Parallelism in Computer Arithmetic: A Historical Perspective

Quantum and classical annealing in spin glasses and quantum computing. Anders W Sandvik, Boston University

Efficient random number generation on FPGA-s

Session 8C-5: Inductive Issues in Power Grids and Packages. Controlling Inductive Cross-talk and Power in Off-chip Buses using CODECs

arxiv: v1 [hep-lat] 7 Oct 2010

Branislav K. Nikolić

Performance evaluation of scalable optoelectronics application on large-scale Knights Landing cluster

Listening for thunder beyond the clouds

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

quantum mechanics is a hugely successful theory... QSIT08.V01 Page 1

Review: Directed Models (Bayes Nets)

Large-scale Electronic Structure Simulations with MVAPICH2 on Intel Knights Landing Manycore Processors

1.0 Introduction to Quantum Systems for Information Technology 1.1 Motivation

Stochastic chemical kinetics on an FPGA: Bruce R Land. Introduction

The Last Survivor: a Spin Glass Phase in an External Magnetic Field.

1 Brief Introduction to Quantum Mechanics

Chapter 7. Sequential Circuits Registers, Counters, RAM

Lecture 27: Hardware Acceleration. James C. Hoe Department of ECE Carnegie Mellon University

OHW2013 workshop. An open source PCIe device virtualization framework

Transcription:

Spin glass simulations on Janus R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara raffaele.tripiccione@unife.it UCHPC, Rodos (Greece) Aug. 27 th, 2012

Warning / Disclaimer / Fineprints I' m an outsider here ---> a physicist's view on an application-specific architecture A flavor of physics-motivated, performance-paranoic, (hopefully) unconventional computer architecture However a few points of contact with main-stream CS may still exist...

On the menu today WHAT?: spin-glass simulations in short WHY?: computational challenges HOW?: the JANUS systems DID IT WORK?: measured and expected performance (and comparison with conventional systems) Take-away lessons / Conclusions

Our computational problem Bring a spin-glass (*) system of e.g. 48 3 grid points to thermal equilibrium: - a challenge never attempted sofar ---> - follow the system for 10 12 10 13 Monte Carlo (*) steps - on ~100 independent system instances Back-of-envelope estimate: 1 high-end CPU for 10,000 years (which is not the same as 10,000 CPUs for 1 year...) (*) to be defined in the next slides

Statistical mechanics in brief... Statistical mechanics tries to describe the macroscopic behaviour of matter in terms of average values of microscopic structure An (hopefully familiar) example : Explain why magnets have a transition temperature beyond which they lose their magnetic state T

The Ising model... The tiny little magnets are named spins; they take just two values A configuration is a specific value assignment for all spins in the system The macro -behavior is dictated by the energy function at the micro level: Each spin interacts only with its nearest neighbours in a discrete D-dim mesh: U {S} = ij JS i S j J 0 Statistical physics bridges the gap from micro to macro...

The spin-glass model... Spin-glasses are a generalization of Ising systems. They are the reference theoretical model of glassy behavior Interesting per se A model of complexity Interesting for industrial applications An apparently trivial change in the energy functions makes spin-glasses much more complex than Ising systems Studying these systems is a computational nightmare...

Why are Spin Glasses so hard?? A very simple change in the energy-function (defined on e.g. a discrete 3- D lattice) U= NB ij J ij i j, ={ 1, 1} J={ 1, 1} hides tremendously complex dynamics, due to the extremely irregular energy landscape in the configuration space (frustration):

Monte Carlo algorithms These beasts are best studied numerically by Monte Carlo algorithms Monte Carlo algorithms navigate in configuration space in such a way that: ----> any configuration will show up according to its probability to be realized in the real world (at a given temperature) MC algorithms come in several versions most versions have remarkably similar requirements in terms of their algorithmic structure.

The Metropolis algorithm An endless loop... Pick up one (or several) spin(s) Compute the energy Flip it/them Compute the new energy U U' Compute U=U' U If U 0 accept the change unconditionally else accept the change only with probability e U/KT pick up new spin(s) and do it again

... just a few C lines

Monte Carlo algorithms Common features: bit-manipulation operations on spins (+ LUT access) (good-quality/long) random numbers a huge degree of available parallelism regular program flow (orderly loops on the grid sites) regular, predictable memory access pattern information-exchange (processor<->memory) is huge however the size of the data-base is tiny many small (nottoo small) cores hardwired control on-chip memory

Compute intensive, you mean?? One Monte Carlo step is roughly the (real) time in which a (real) system flips one of its spins, roughly 1 pico-second If you want to understand what happens in just the first seconds of a real experiment you need O(10 12 ) time steps on ~ 100 replicas of a 100 3 system ---> 10 20 updates Clever programming on standard CPUs: 1 ns /spin-update ---> 3000 years

Compute intensive, you mean?? The dynamics is dramatically slow (see picture) So even a simulated box whose size is a small multiple of the corr. Length will give accurate physics results Good news: we're in business even if we simulate a very small box... However...

Hard scaling vs Weak Scaling Amdahl's law (strong scaling) S A = 1 p p 1 p p/n = 1 1 p p/n vs... Gustafson's law (weak scaling) S G = 1 p Np 1 p p = 1 p Np In our case enlarging system-size is meaningless, as we do not yet have the resources to study a small system ----> the ultimate quest for strong scaling...

The JANUS project An attempt at developing, building and operating an applicationdriven compute engine for Monte Carlo simulations of spin glass systems A collaboration of: Universities of Rome (La Sapienza) and Ferrara Universities of Madrid, Zaragoza, Badajoz BIFI (Zaragoza) Eurotech Partially supported by Microsoft, Xilinx

The nature of the available parallelism Spin glass simulations have two levels of available parallelism 1) Embarassingly trivial: need statistics on several replicas ---> farm it out to independent processors 2) Trivially identified: sweep order for Monte Carlo update is not specified ---> can update in parallel any set of non-mutually interacting spins make it a black-white checkerboard: it opens the way to tens of thousands of independent thread... 1) & 2) do not commute

The ideal spin glass machine... A further question: what is the appropriate system-scale at which this parallelism is best exploited One update engine: computes the local contribution to U U= NB ij i J ij j addresses a probability table compares with a freshly generated random numbr assigns the new spin value

The ideal spin glass machine... All this is just a bunch (~1000) of gates And in spite of that a typical CPU core, with O(107+) gates can process perhaps 4 spins at each clock cycle If you can arrange your stock of gates the way it best suits the algorithm, can easily expect ~1000 update engines on one chip ----> The best structure is a massively-many-core organization ( or perhaps an application-driven GPU??)

The ideal spin glass machine... is an orderly structure (a 2D grid) of a large number of update engines each update engine handles a subset of the physical mesh its architectural structure is extremely simple each data path processess one bit at a time memory addresing is regular and predictable SIMD processing is OK however memory bandwidth requirements are huge (need 7 bit to process one bit..) however memory can be local to the processor Simple hardware structure ---> FPGA are OK!

The JANUS machine A parallel system of (themselves) massively parallel processor chips The basic hardware element: A 2-D grid of 4 x 4 (FPGA based) processors (SP's) Data links among nearest neighbours on the grid One control processors on each board (IOP) with 2 Gbit Ethernet links to host st

JANUS: a picture gallery

Our large machine 256 (16 x 16) processors 8 host PCs --> ~ 90 TIPS for spin-glass simulation A typical simulation wall-clock time on this nice little machine goes down to a more manageable ~ 100 days.

JANUS as a spin-glass engine The 2008 implementation (XILINX Virtex4-LX200): 1024 update cores on each processor, pipelineable to one spin update per clock cycle ---> 88% of available logic resources system clock at 62.5 Mhz ---> 16 ps average spin update time using a bandwidth of ~ 12000 read bits + 1000 written bits per clock cycle ---> 47% of available on-chip memory

(Measured) Performances Let's use conventional units, first???? The data path of each Processing Element (PE) performs 11 + 2 sustained pipelined ops per clock cycle (62.5 Mhz) We have 1024 PEs ----> ~ 830 GIPS However 11 ops are on very short data words: more honestly: 7... 8 sustained conventional pipelined ops per clock cycle: We have 1024 PEs ----> ~ 300 GIPS ---> 10 GIPS/W Sustained by ~ 1 Tbyte/sec combined memory bandwidth

(Measured) Performances Physicicst like a different figure-of-merit ----> the spin-flip rate R, typically measured in psecs per flip For each processor in the system: R = 1 Nf = 1 1024 62.5 MHz 16ps / flip For one complete element of the IANUS core (16 procs): R = 1 N p Nf = 1 16 1024 62.5 MHz 1 ps / flip as fast as Nature...

Physics results

Performance figures (2008-2009) Spin-glass addicts like to quote the average spin-update time SUT GUT Janus module 16 ps 1 ps PC (IntelCoreDuo) 3000 ps 700 ps IBM CBE (all cores) - 65 ps 300x 700x!!

Performance figures (2010-2011) In the last couple of years, multi/many core processors and GPUs have entered the arena... Still 10x 20x!!

What next?? 4+ years old Janus still has an edge on state-of-the-art commercial HPC computing architectures Reasonable to continue on the same line, surfing on technology developments Expected performance increase???? FPGA size 2.5 3.0 x Clock frequency 4.0 x SUT parallel 16 x Grand total 160-200 x Log(Grand Total) ~ 7.5

What next?? Janus 2 Exactly the same architecture of JANUS but... Xilinx Virtex-7 FPGAs (Virtex7-485) 2 DDR-3 memory banks on each SP Improved local 4x4 interconnection Tighter coupling with the HOST (on-box CPU + PCIe gen2) Protos in fall 2012 Physics in early 2013 ----> Simulate a 128 3 Ising-spin glass for 2 42 time steps

Looking at the crystal ball... How long is the (predicted) opportunity window for Janus2?? A graphical anwer (and some speculations on Moore's law)

Take-away lessons JANUS is an extremely rewarding example of (strongly application driven) on-chip multiprocessing: We designed a machine around an unconventional problem No wonder the machine turned out to be unconventional enough Results were rewarding... WHY????

Take-away lessons Results were rewarding...why?? there is a lot of parallelism available that is actually exploited; load is automatically balanced among the update engines; memory access is heavy, but patterns are predictable; processors (and their memories) are arranged on a regular grid; inter-node traffic is not huge and regular. IN SHORT Our machine tried to exploit all these feature at best