Unveiling the misteries of our Galaxy with Intersystems Caché

Unveiling the misteries of our Galaxy with Intersystems Caché Dr. Jordi Portell i de Mora on behalf of the Gaia team at the Institute for Space Studies of Catalonia (IEEC UB) and the Gaia Data Processing and Analysis Consortium throughout Europe InterSystems Spain Summit 2016 - Barcelona, 26 October 2016

It s all about big numbers After all, our areas are not that different Human Genome: ~3 billion base pairs Company customers: up to ~7 billion people (potentially) Our Galaxy: ~200 billion stars

So you want a census of our Galaxy? 1. Get the list of stars 2. Locate them 3. Get their information a) CV: Where are they coming from? (and where are they going?) b) Education: How bright are they? c) Personal interests: Which is their favourite color? d) Food habits: What are they composed of? 4. Do this for a representative fraction of the Galaxy Gentle reminder: we re talking about ~200 billion stars (2x10 11 )

The answer: ESA s Gaia mission Global Astrometry from Space Positions, distances and motions of ~1 billion stars Photometry: brightness and colors Spectroscopy: fingerprint (chemical composition) Most complete and accurate Milky Way census to date Accuracy: ~0.000000004 degrees (~15 microarcsecs)

The Gaia spacecraft in a nutshell Orbit around L2 point 1.5 million Km from Earth 5 years (nominally) 2 telescopes Gigapixel camera 106 CCDs, 9Mpix each Autonomous operation Object detection onboard

Launch! Watch video here: https://www.youtube.com/watch?v=gidvvgtefjg and more Gaia videos here: http://www.cosmos.esa.int/web/gaia/media-gallery/videos

Data, data, data! Downlink: 7 Mbps, 8h/day 25 GB / day (65 GB uncompressed) 115 TB in 5 years 50 million measurements / day (1 measurement = 12 15 tiny photos) 100 billion measurements in 5 years

And now what do we do with the data? All data must be promptly received and handled Continuous data accumulation and arrangement Spacecraft health must be monitored daily Many issues can only be detected after data processing Big processing only at the end? NO! Incremental data processing VERY complex algorithms HUGE system of equations iterative solution

The Data Processing and Analysis Consortium

DPAC keystone: the Daily Pipeline Reception and decoding of raw data packets Decompression Initial Data Treatment (IDT) First Look diagnostics and calibrations

Some features of the Gaia daily pipeline Complex software system written in Java (as all DPAC) Portability, scalability (multi thread operation), development tools ~1 million lines of code Many different algorithms Data driven approach: triggering depending on data type received High efficiency required Data train approach: prepare data and pass to algorithms Avoid random data access if possible!

Products from the daily pipeline Raw measurements reconstruction Flagging of peculiar objects and conditions Monitoring of basic angle between telescopes Interferometry: accuracy of 3.6 picometres

Products from the daily pipeline First determination of satellite attitude 100 milliarcsec accuracy (0.00003 degrees) Determination of astrophysical background Needed for an accurate photometry

Products from the daily pipeline Image parameters determination Main output for downstream systems Position and brightness Preliminary cross matching Identification of observations and link to catalogue sources Most demanding element in terms of DB performance Continuous queries on a table with ~3 billion records

But is this technically feasible?? YES! Pipeline typically idle a few hours/day even in dense days Peaks of ~150 million measurements/day handled without problems ~10 tables with typ. ~1 2 billion records Pipeline is stable

How? 30 IBM idataplex computing nodes CPU typically underused; RAM quite busy Very big DB node: 32 Xeon cores (E5 4640 2.4GHz), avg. ~50% in dense days 1.25TB RAM 7.5TB of local SSD for most frequently accessed tables NetAPP NAS storage: FAS 8060 with ~600 disks and 10TB SSD 10G network (Cisco 5548, <1ms latency) Typical DB disk occupation: ~20TB (regular cleanup: only most recent data kept) Intersystems Caché ; ) Over 10 billion records served daily (incl. complex queries) Also CESCA/CSUC (pre deployment tests): 5 computing nodes + DB node ( just 64GB RAM)

2 years of nominal operations 50 billion measurements processed so far

Main achievement (so far): Gaia DR1

Just a tiny (daily) detail...

What is all this useful for? Science and Knowledge Better understanding of the Galaxy we live in Astronomers are so happy with our data! Technological advances Numerical algorithms and techniques Massive data handling systems New data handling and compression algorithms A clear example: FAPEC Compressor so it gives a direct return to you! Efficient data handling and compression of medical images, genomic information, professinal imaging, massive data transfers, etc.

More info: http://gaia.ub.edu @GaiaUB GaiaApp Unveiling the misteries of our Galaxy with Intersystems Caché Dr. Jordi Portell i de Mora jordi.portell@dapcom.es on behalf of the Gaia team at the Institute for Space Studies of Catalonia (IEEC UB) and the Gaia Data Processing and Analysis Consortium throughout Europe InterSystems Spain Summit 2016 - Barcelona, 26 October 2016