Data Intensive Computing meets High Performance Computing Kathy Yelick Associate Laboratory Director for Computing Sciences, Lawrence Berkeley National Laboratory Professor of Electrical Engineering and Computer Sciences, UC Berkeley
National Energy Research Scientific Computing Facility Department of Energy Office of Science (unclassified) Facility 4000 users, 500 projects From 48 states; 65% from universities 1400 refereed publications per year Systems designed for science 1.3 PF Hopper system (Cray XE6) - 4th Fastest computer in US, 8th in world.5 PF in Franklin (Cray XT4), Carver (IBM idataplex) and other clusters Computing Sciences 2 2
Science is Increasingly Data Intensive Existing ability to generate science data is already challenging our ability to store, analyze, & archive it. Some observational devices grow in capability with Moore s Law. Data sets are growing exponentially. Petabyte (10 15 Byte) data sets are common: Climate: next IPCC estimates 10s of PBs Genome: JGI alone will have about 1 PB this year and double each year Particle physics: LHC projects 16 PB / yr Astrophysics: LSST, others, estimate 5 PB / yr Petascale HPC simulations on today s systems lead to petascale datasets 3 Computing Sciences 3
Scientific Data is Growing Exponentially Scientific is stored in tape archives Repacking is essential to keep up with data and technology growth Ability to store, transfer, and analyze are limitations: e.g., DOE runs its own network Computing Sciences 4 4
Growth in Data Outstrips Growth in Computing Goal: Grow storage, transfer & analysis capability for DOE facilities Data generation exceeds storage and process rate Needs energy efficient computing, memory & I/O Increase over 2010 60 50 40 30 20 10 0 Projected Rates Sequencers Detectors Processors Memory 2010 2011 2012 2013 2014 2015 Computing Sciences 5
Science in the Data 2006 Nobel Prize on anisotropy of Cosmic Microwave Background Shows an image of the universe at 400,000 years 2011 Nobel Prize on accelerating expansion of universe Measured by supernovae as standard candles Simulations combined with observational data 2011 Discovery of Youngest nearby Supernova: first-of-a-kind images Used machine learning to eliminate 90% of the manual image search Discover 8,700 new astrophysical transients, including supernovae, novae, active galaxies, and quasars, and three new classes of objects Cosmic Microware Background Computing Sciences Youngest nearby Supernova discovered Expansion of the universe 6
Emerging Challenges Data Provenance Lab Notebook for digital data Captures critical parameters, analysis chain, annotations, etc. Provide reproducibility and verifiability Life-Cycle Management Observational Data Gains Value Data Curation What piece of data will be important in 2060? Value Time Obs data Model data Computing Sciences 7
Develop and Provide Science Gateway Infrastructure 30+ projects use the NERSC Filesystem (NGF) -> web gateway Gauge Connection: QCD data in HPSS PyDap: Interactive subselection of 20 th Century ReAnalysis climate data Computational Research Software for Science Deep Sky: Web interface to analyze astronomical data; steer observations Distributed systems and cloud computing Workflows: Carbon Capture (CCSI), Environment (ASCEM), Neutrinos (DayaBay), Soil, Water, Data management algorithms and visualization Computing Sciences 8
Facilities Require Exascale Computing Astronomy Particle Physics Chemistry and Materials Genomics Fusion Petascale to Exascale Petabyte data sets today, many growing exponentially Processing grows super-linearly Need to move entire DOE workload to Exascale Computing Sciences 9
The Scientific Exploration Process Simulation Site Exascale Simula3on Machine + analysis Parallel Storage Perform some data analysis on exascale machine (e.g. in Situ pattern identification) Experiment/observation Site Experiment/observa3on Processing Machine (Parallel) Storage Archive Archive Analysis Sites Need to reduce EBs and PBs of data, and move only TBs to simulation sites Analysis Analysis Analysis Machine Machine Machines Shared Shared storage Shared storage storage Reduce and prepare data for further exploratory Analysis (Data mining)
Summary Computing for simulation has been considered the 3 rd pillar of science Data analysis is quickly becoming a 4 th Challenges include: Size of data Technology: storage, networking, computing Mathematics: discovering information in massive data Data management, provenance, curation Computing Sciences 11 11