Spin glass simulations on Janus

Spin glass simulations on Janus R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara raffaele.tripiccione@unife.it UCHPC, Rodos (Greece) Aug. 27 th, 2012

Warning / Disclaimer / Fineprints I' m an outsider here ---> a physicist's view on an application-specific architecture A flavor of physics-motivated, performance-paranoic, (hopefully) unconventional computer architecture However a few points of contact with main-stream CS may still exist...

On the menu today WHAT?: spin-glass simulations in short WHY?: computational challenges HOW?: the JANUS systems DID IT WORK?: measured and expected performance (and comparison with conventional systems) Take-away lessons / Conclusions

Our computational problem Bring a spin-glass (*) system of e.g. 48 3 grid points to thermal equilibrium: - a challenge never attempted sofar ---> - follow the system for 10 12 10 13 Monte Carlo (*) steps - on ~100 independent system instances Back-of-envelope estimate: 1 high-end CPU for 10,000 years (which is not the same as 10,000 CPUs for 1 year...) (*) to be defined in the next slides

Statistical mechanics in brief... Statistical mechanics tries to describe the macroscopic behaviour of matter in terms of average values of microscopic structure An (hopefully familiar) example : Explain why magnets have a transition temperature beyond which they lose their magnetic state T

The Ising model... The tiny little magnets are named spins; they take just two values A configuration is a specific value assignment for all spins in the system The macro -behavior is dictated by the energy function at the micro level: Each spin interacts only with its nearest neighbours in a discrete D-dim mesh: U {S} = ij JS i S j J 0 Statistical physics bridges the gap from micro to macro...

The spin-glass model... Spin-glasses are a generalization of Ising systems. They are the reference theoretical model of glassy behavior Interesting per se A model of complexity Interesting for industrial applications An apparently trivial change in the energy functions makes spin-glasses much more complex than Ising systems Studying these systems is a computational nightmare...

Why are Spin Glasses so hard?? A very simple change in the energy-function (defined on e.g. a discrete 3- D lattice) U= NB ij J ij i j, ={ 1, 1} J={ 1, 1} hides tremendously complex dynamics, due to the extremely irregular energy landscape in the configuration space (frustration):

Monte Carlo algorithms These beasts are best studied numerically by Monte Carlo algorithms Monte Carlo algorithms navigate in configuration space in such a way that: ----> any configuration will show up according to its probability to be realized in the real world (at a given temperature) MC algorithms come in several versions most versions have remarkably similar requirements in terms of their algorithmic structure.

The Metropolis algorithm An endless loop... Pick up one (or several) spin(s) Compute the energy Flip it/them Compute the new energy U U' Compute U=U' U If U 0 accept the change unconditionally else accept the change only with probability e U/KT pick up new spin(s) and do it again

... just a few C lines

Monte Carlo algorithms Common features: bit-manipulation operations on spins (+ LUT access) (good-quality/long) random numbers a huge degree of available parallelism regular program flow (orderly loops on the grid sites) regular, predictable memory access pattern information-exchange (processor<->memory) is huge however the size of the data-base is tiny many small (nottoo small) cores hardwired control on-chip memory

Compute intensive, you mean?? One Monte Carlo step is roughly the (real) time in which a (real) system flips one of its spins, roughly 1 pico-second If you want to understand what happens in just the first seconds of a real experiment you need O(10 12 ) time steps on ~ 100 replicas of a 100 3 system ---> 10 20 updates Clever programming on standard CPUs: 1 ns /spin-update ---> 3000 years

Compute intensive, you mean?? The dynamics is dramatically slow (see picture) So even a simulated box whose size is a small multiple of the corr. Length will give accurate physics results Good news: we're in business even if we simulate a very small box... However...

Hard scaling vs Weak Scaling Amdahl's law (strong scaling) S A = 1 p p 1 p p/n = 1 1 p p/n vs... Gustafson's law (weak scaling) S G = 1 p Np 1 p p = 1 p Np In our case enlarging system-size is meaningless, as we do not yet have the resources to study a small system ----> the ultimate quest for strong scaling...

The JANUS project An attempt at developing, building and operating an applicationdriven compute engine for Monte Carlo simulations of spin glass systems A collaboration of: Universities of Rome (La Sapienza) and Ferrara Universities of Madrid, Zaragoza, Badajoz BIFI (Zaragoza) Eurotech Partially supported by Microsoft, Xilinx

The nature of the available parallelism Spin glass simulations have two levels of available parallelism 1) Embarassingly trivial: need statistics on several replicas ---> farm it out to independent processors 2) Trivially identified: sweep order for Monte Carlo update is not specified ---> can update in parallel any set of non-mutually interacting spins make it a black-white checkerboard: it opens the way to tens of thousands of independent thread... 1) & 2) do not commute

The ideal spin glass machine... A further question: what is the appropriate system-scale at which this parallelism is best exploited One update engine: computes the local contribution to U U= NB ij i J ij j addresses a probability table compares with a freshly generated random numbr assigns the new spin value

The ideal spin glass machine... All this is just a bunch (~1000) of gates And in spite of that a typical CPU core, with O(107+) gates can process perhaps 4 spins at each clock cycle If you can arrange your stock of gates the way it best suits the algorithm, can easily expect ~1000 update engines on one chip ----> The best structure is a massively-many-core organization ( or perhaps an application-driven GPU??)

The ideal spin glass machine... is an orderly structure (a 2D grid) of a large number of update engines each update engine handles a subset of the physical mesh its architectural structure is extremely simple each data path processess one bit at a time memory addresing is regular and predictable SIMD processing is OK however memory bandwidth requirements are huge (need 7 bit to process one bit..) however memory can be local to the processor Simple hardware structure ---> FPGA are OK!

The JANUS machine A parallel system of (themselves) massively parallel processor chips The basic hardware element: A 2-D grid of 4 x 4 (FPGA based) processors (SP's) Data links among nearest neighbours on the grid One control processors on each board (IOP) with 2 Gbit Ethernet links to host st

JANUS: a picture gallery

Our large machine 256 (16 x 16) processors 8 host PCs --> ~ 90 TIPS for spin-glass simulation A typical simulation wall-clock time on this nice little machine goes down to a more manageable ~ 100 days.

JANUS as a spin-glass engine The 2008 implementation (XILINX Virtex4-LX200): 1024 update cores on each processor, pipelineable to one spin update per clock cycle ---> 88% of available logic resources system clock at 62.5 Mhz ---> 16 ps average spin update time using a bandwidth of ~ 12000 read bits + 1000 written bits per clock cycle ---> 47% of available on-chip memory

(Measured) Performances Let's use conventional units, first???? The data path of each Processing Element (PE) performs 11 + 2 sustained pipelined ops per clock cycle (62.5 Mhz) We have 1024 PEs ----> ~ 830 GIPS However 11 ops are on very short data words: more honestly: 7... 8 sustained conventional pipelined ops per clock cycle: We have 1024 PEs ----> ~ 300 GIPS ---> 10 GIPS/W Sustained by ~ 1 Tbyte/sec combined memory bandwidth

(Measured) Performances Physicicst like a different figure-of-merit ----> the spin-flip rate R, typically measured in psecs per flip For each processor in the system: R = 1 Nf = 1 1024 62.5 MHz 16ps / flip For one complete element of the IANUS core (16 procs): R = 1 N p Nf = 1 16 1024 62.5 MHz 1 ps / flip as fast as Nature...

Physics results

Performance figures (2008-2009) Spin-glass addicts like to quote the average spin-update time SUT GUT Janus module 16 ps 1 ps PC (IntelCoreDuo) 3000 ps 700 ps IBM CBE (all cores) - 65 ps 300x 700x!!

Performance figures (2010-2011) In the last couple of years, multi/many core processors and GPUs have entered the arena... Still 10x 20x!!

What next?? 4+ years old Janus still has an edge on state-of-the-art commercial HPC computing architectures Reasonable to continue on the same line, surfing on technology developments Expected performance increase???? FPGA size 2.5 3.0 x Clock frequency 4.0 x SUT parallel 16 x Grand total 160-200 x Log(Grand Total) ~ 7.5

What next?? Janus 2 Exactly the same architecture of JANUS but... Xilinx Virtex-7 FPGAs (Virtex7-485) 2 DDR-3 memory banks on each SP Improved local 4x4 interconnection Tighter coupling with the HOST (on-box CPU + PCIe gen2) Protos in fall 2012 Physics in early 2013 ----> Simulate a 128 3 Ising-spin glass for 2 42 time steps

Looking at the crystal ball... How long is the (predicted) opportunity window for Janus2?? A graphical anwer (and some speculations on Moore's law)

Take-away lessons JANUS is an extremely rewarding example of (strongly application driven) on-chip multiprocessing: We designed a machine around an unconventional problem No wonder the machine turned out to be unconventional enough Results were rewarding... WHY????

Take-away lessons Results were rewarding...why?? there is a lot of parallelism available that is actually exploited; load is automatically balanced among the update engines; memory access is heavy, but patterns are predictable; processors (and their memories) are arranged on a regular grid; inter-node traffic is not huge and regular. IN SHORT Our machine tried to exploit all these feature at best