COMPUTATIONAL STATISTICS IN ASTRONOMY: NOW AND SOON

Similar documents
BAYESIAN CROSS-IDENTIFICATION IN ASTRONOMY

PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University

MULTIPLE EXPOSURES IN LARGE SURVEYS

Data-Intensive Statistical Challenges in Astrophysics

Data-Intensive Statistical Challenges in Astrophysics

Real Astronomy from Virtual Observatories

Astroinformatics in the data-driven Astronomy

HUBBLE SOURCE CATALOG. Steve Lubow Space Telescope Science Institute June 6, 2012

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST

Large Synoptic Survey Telescope

Virtual Observatory: Observational and Theoretical data

RLW paper titles:

An intelligent client application for on-line astronomical information

Data Management Plan Extended Baryon Oscillation Spectroscopic Survey

Some Statistical Aspects of Photometric Redshift Estimation

Introduction to the Sloan Survey

Scientific Data Flood. Large Science Project. Pipeline

The SDSS Data. Processing the Data

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University

The Millennium Simulation: cosmic evolution in a supercomputer. Simon White Max Planck Institute for Astrophysics

HIERARCHICAL MODELING OF THE RESOLVE SURVEY

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Data Intensive Computing meets High Performance Computing

Exploiting Sparse Non-Linear Structure in Astronomical Data

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30

Doing astronomy with SDSS from your armchair

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Learning an astronomical catalog of the visible universe through scalable Bayesian inference

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

The Universe in the Cloud. Darren Croton Centre for Astrophysics and Supercomputing Swinburne University

Source Detection in the Image Domain

Current Status of Chinese Virtual Observatory

Robust and accurate inference via a mixture of Gaussian and terrors

Classifying Galaxy Morphology using Machine Learning

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

The SDSS is Two Surveys

ECE521 week 3: 23/26 January 2017

Cosmological N-Body Simulations and Galaxy Surveys

Virtual Sensors and Large-Scale Gaussian Processes

An end-to-end simulation framework for the Large Synoptic Survey Telescope Andrew Connolly University of Washington

A Unified Theory of Photometric Redshifts. Tamás Budavári The Johns Hopkins University

Earth s Atmosphere & Telescopes. Atmospheric Effects

Precise measurement of the radial BAO scale in galaxy redshift surveys

Neutron inverse kinetics via Gaussian Processes

The Large Synoptic Survey Telescope

Dark Energy and Massive Neutrino Universe Covariances

(KAVLI IPMU) Cosmology with Subaru HSC survey. A large number of cosmological N-body simulations. Parameter space exploration and the Gaussian process

Algorithms for Higher Order Spatial Statistics

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Techniques for Computer Vision

(Slides for Tue start here.)

Statistics of the Universe: Exa-calculations and Cosmology's Data Deluge

Taking the census of the Milky Way Galaxy. Gerry Gilmore Professor of Experimental Philosophy Institute of Astronomy Cambridge

Data Exploration and Unsupervised Learning with Clustering

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

From DES to LSST. Transient Processing Goes from Hours to Seconds. Eric Morganson, NCSA LSST Time Domain Meeting Tucson, AZ May 22, 2017

Probabilistic photometric redshifts in the era of Petascale Astronomy

Machine Learning Applications in Astronomy

Michael L. Norman Chief Scientific Officer, SDSC Distinguished Professor of Physics, UCSD

Scalable and Power-Efficient Data Mining Kernels

Introduction to simulation databases with ADQL and Topcat

the open-source sky survey David W. Hogg (NYU)

Science Alerts from GAIA. Simon Hodgkin Institute of Astronomy, Cambridge

Expectation-Maximization

CHEMICAL ABUNDANCE ANALYSIS OF RC CANDIDATE STAR HD (46 LMi) : PRELIMINARY RESULTS

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Telescopes. Optical Telescope Design. Reflecting Telescope

Angular power spectra and correlation functions Notes: Martin White

Image Processing in Astronomy: Current Practice & Challenges Going Forward

Why Use a Telescope?

Detecting Dark Matter Halos using Principal Component Analysis

New Astronomy With a Virtual Observatory

SPATIAL DATA MINING. Ms. S. Malathi, Lecturer in Computer Applications, KGiSL - IIM

BigBOSS Data Reduction Software

Jacobi-Based Eigenvalue Solver on GPU. Lung-Sheng Chien, NVIDIA

Big Data Inference. Combining Hierarchical Bayes and Machine Learning to Improve Photometric Redshifts. Josh Speagle 1,

Expectation Maximization

Detecting the Unexpected

Patent Searching using Bayesian Statistics

Studying galaxies with the Sloan Digital Sky Survey

LISA: Probing the Universe with Gravitational Waves. Tom Prince Caltech/JPL. Laser Interferometer Space Antenna LISA

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

The expected Gaia revolution in asteroid science: Photometry and Spectroscopy

Celeste: Scalable variational inference for a generative model of astronomical images

Lecture 12: Distances to stars. Astronomy 111

Fast Hierarchical Clustering from the Baire Distance

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS 231A Section 1: Linear Algebra & Probability Review

Baryon Acoustic Oscillations (BAO) in the Sloan Digital Sky Survey Data Release 7 Galaxy Sample

ANTARES: The Arizona-NOAO Temporal Analysis and Response to Events System

Singular Value Decompsition

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Gaussian Process Modeling in Cosmology: The Coyote Universe. Katrin Heitmann, ISR-1, LANL

Hypothesis Testing, Bayes Theorem, & Parameter Estimation

Telescopes... Light Buckets

ALMA Development Program

A SPEctra Clustering Tool for the exploration of large spectroscopic surveys. Philipp Schalldach (HU Berlin & TLS Tautenburg, Germany)

Francisco Prada Campus de Excelencia Internacional UAM+CSIC Instituto de Física Teórica (UAM/CSIC) Instituto de Astrofísica de Andalucía (CSIC)

Transcription:

COMPUTATIONAL STATISTICS IN ASTRONOMY: NOW AND SOON / The Johns Hopkins University

Recording Observations Astronomers drew it Now kids do it on the SkyServer #1 by Haley

Multicolor Universe

Eventful Universe

Trends in Astronomy Exponential growth of data Moore s law in detectors 1000 100 10 1 0.1 1970 1975 1980 1985 2000 1995 1990 CCDs Glass

Sloan Digital Sky Survey Cosmic Genome Project 2001-2010 500M rows, 400+ cols 18TB 30TB of images from 30 CCDs Software revolution in astro Astronomers learn SQL Cannot look at the data anymore

The New Generations Large Synoptic Survey Telescope [OPTICAL] 3 trillion rows, 200+ attributes, 100+ tables ~ 30PB 60PB of images, 3.2 Gpix cam Square Kilometer Array [RADIO] Processing limited

Data-Intensive Science The Fourth Paradigm Jim Gray Phenomenology Calculus Simulations escience

Keeping Up? Image processing Catalog extraction O(n) What is difficult? O(n log n) O(n 2 ),

More is Different Lots of opportunities Lots of challenges Lots of problems New approaches New algorithms New tools New computers

Do More at the Data Statistics on remote resources Fine-tune SQL Server for astronomy Analyses in and driven by SQL SDSS catalog archive and more GALEX, HLA, UKIDSS, PanSTARRS,

Storing Simulations Millennium Run (MPA) 10 billion particles, 64 snapshots FoF groups and merger trees Millennium XXL 300 billion particles MultiDark Bolshoi Turbulence simulations (JHU) 1024 4 grid, 27TB

Storing Simulations Millennium Run (MPA) 10 billion particles, 64 snapshots FoF groups and merger trees Millennium XXL 300 billion particles MultiDark Bolshoi Turbulence simulations (JHU) 1024 4 grid, 27TB

Observing Simulations Comparison to real observations Lots of spatial searches In the database! Lemson, TB & Szalay (2011)

Fast Searches Map space-filling curves to database indices Hierarchical Triangular Mesh on the unit sphere Peano-Hilbert / Morton curves in 3D Combine with query shapes Build from primitives

Fast Searches Map space-filling curves to database indices Hierarchical Triangular Mesh on the unit sphere Peano-Hilbert / Morton curves in 3D Combine with query shapes Build from primitives

Millennium XXL

Understanding Subtleties Interactive visualization of 27TB of turbulence sim Kai Bürger (TUM, JHU)

SkyQuery Dynamical federation of archives

Cross-Identification One of the most fundamental analysis steps

What is the Right Question? Cross-identification is a hard problem Computationally, Scientifically & Statistically Need symmetric n-way solution Need reliable quality measure Same or not? Distance threshold? Maximum likelihood?

Modeling the Astrometry Astrometric precision A simple function Where on the sky? Anywhere really

NOT SAME OR Same or Not? The Bayes factor H: all observations of the same object at m K: might be from separate objects at {m i }

NOT SAME OR Same or Not? The Bayes factor H: all observations of the same object at m On the sky K: might be from separate objects at {m i } Astrometry

NOT SAME OR Same or Not? The Bayes factor H: all observations of the same object at m On the sky K: might be from separate objects at {m i } Astrometry

Normal Distribution Astrometric precision: Fisher distribution: Analytic results: For high accuracies:

Analytic Results Normal distribution Flat and spherical Gauss and Fisher 2-way results

From Priors to Posteriors Posterior probability from prior & Bayes factor Prior probability of a match Like dice in a bag: 1/N and N 1 n In general?

From Priors to Posteriors Different selections Nearby / Distant Red / Blue But only 1 number

Self-Consistent Estimates Prior has an unknown fudge-factor Educated guess Or solve for it: TB & Szalay (2008)

Works in General Simulations Heinis, TB, Szalay (2009) Demo performance Proper motion Kerekes, TB+ (2010) Unknown velocities Matching events TB (2011) E.g., supernovae in time

SkyQuery the new generation! Dynamic federation of astronomy databases Query the collection as if they were one The 3 rd generation tool coming this fall Cluster of machines running partitioned jobs Proper probabilistic exec with variable errors

SkyQuery Almost pure standard SQL

SkyQuery Almost pure standard SQL

SkyQuery Almost pure standard SQL Added XMATCH Verifiable Flexible

Only the first steps Different geometries Resolved sources have shapes: galaxies, etc. Varies as a function of wavelength

Science is Interactive

Science is Interactive Too much to be accurate By the time you do the calculations, the answer may have changed

Science is Interactive Too much to be accurate By the time you do the calculations, the answer may have changed Randomized algorithms Improving answers over time Rethink the basic methods

Principal Component Analysis Principal directions Directions of largest variations Eigenproblem of covariances Singular Value Decomposition Problems Needs lots of memory Only need largest ones Very sensitive to outliers

Streams of Data Mean

Streams of Data Mean Covariance Iterative evaluation!

Streaming PCA Initialization Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized algorithm!

Robust PCA PCA minimizes σ RMS of the residuals r = y Py Quadratic formula: r 2 extremely sensitive to outliers We optimize a robust M-scale σ 2 (Maronna 2005) Implicitly given by Fits in with the iterative method (TB+ 2009)

Galaxy Spectra T. B., V. Wild, et al. (2009)

Galaxy Spectra High SNR eigenfunctions Sign of robustness T. B., V. Wild, et al. (2009)

Galaxy Spectra High SNR eigenfunctions Sign of robustness It makes a difference Budavari, Wild, et al. (2009)

Synthetic Streams 3D Gaussian rotated into 50D Stretches: 7, 6, 5 Total Var = 110 Plotting square roots of the top 5 eigenvalues Streaming Classic PCA

With Outliers Adding 0.1% outliers = 100 in each bin Outliers take over the PCs Instability, no convergence Streaming Classic PCA

Robust Algorithm Outliers under control Marked on top Initialized with SVD On a set of 100 vectors Streaming Robust PCA

Comparison Classic Robust

Sparsity Important wavelengths Cf., the Lick indices Yip, Mahoney+ (2012, in prep)

Time Domain (the new window) Online aggregation of survey data Learning the night sky from a series of images Correcting for atmospheric distortions Finding transient events Online federation of event streams Online outlier detection new discoveries!

Uncertainties No data points Measurements with errors

Inverse Problems Bayesian inference but computational bottlenecks High dimensional models w/ empirical priors Non-parameteric? Deconvolution with degeneracies Aided by physics Photometric inversion TB (2009)

Trends in Computing

New Limitation is Energy! Power to compute the same thing? CPU is 10 less efficient than a digital signal processor DSP is 10 less efficient than a custom chip New design: multicores with slower clocks But the interconnect is expensive Need simpler components

Emerging Architectures Andrew Chien: 10 10 to replace the 90/10 rule Custom modules on chip, cf. SoC in cellphones Statistics on a video codec module?

Emerging Architectures Andrew Chien: 10 10 to replace the 90/10 rule Custom modules on chip, cf. SoC in cellphones Statistics on a specialized units, e.g., codecs?

GPUs Evolved to be General Purpose Virtual world: simulation of real physics C for CUDA and OpenCL with lots of libraries 512 cores 25k threads running 1 billion/sec Old algorithms built on wrong assumption Today processing is free but memory is slow New programming paradigm!

Integrated with Databases Scientific computations on the GPU Driven remotely from SQL User-Defined Functions IPC

Baryon Acoustic Oscillations 600 trillion galaxy pairs Correlation function: C for CUDA on GPUs BAO Tian, Neyrinck, TB & Szalay (2011)

Matching on GPUs Recent Github release Multi-GPU implementation Search in 5 great perf! NVIDIA GTX 480 1.5GB 29M 29M in 11 seconds C2050 Teslas 400M 150M in 3 minutes

Summary Astronomy has always been data-driven Data-intensive for decades (since SDSS) Need new approaches and algorithms But prototypes are not enough to scale Many promising directions to explore Perfect timing for the SAMSI Big Data Program!