Learning an astronomical catalog of the visible universe through scalable Bayesian inference

Similar documents
Celeste: Variational inference for a generative model of astronomical images

Celeste: Scalable variational inference for a generative model of astronomical images

Celeste: Variational inference for a generative model of astronomical images

Streaming Variational Bayes

VCMC: Variational Consensus Monte Carlo

Two Useful Bounds for Variational Inference

Probabilistic Graphical Models

Overlapping Astronomical Sources: Utilizing Spectral Information

Image Processing in Astronomy: Current Practice & Challenges Going Forward

Graphical Models for Collaborative Filtering

Latent Variable Models

STA 4273H: Statistical Machine Learning

GAUSSIAN PROCESS REGRESSION

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Astronomical image reduction using the Tractor

Algorithms for Variational Learning of Mixture of Gaussians

ECS289: Scalable Machine Learning

Contrastive Divergence

Photometric Techniques II Data analysis, errors, completeness

Auto-Encoding Variational Bayes

Variational Autoencoder

Hierarchical Bayesian Modeling

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Bayesian Machine Learning

Hubble s Law and the Cosmic Distance Scale

an introduction to bayesian inference

STA 4273H: Statistical Machine Learning

Topic 3: Neural Networks

Probabilistic Graphical Models & Applications

Approximate Bayesian Computation

Model Checking and Improvement

Probabilistic Catalogs and beyond...

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Intro to Probability. Andrei Barbu

The Expectation-Maximization Algorithm

Part 1: Expectation Propagation

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

PMR Learning as Inference

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Cosmology with the Sloan Digital Sky Survey Supernova Search. David Cinabro

Probabilistic Graphical Models

Introduction to Machine Learning

Probabilistic Graphical Models

Understanding Covariance Estimates in Expectation Propagation

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Regression with Numerical Optimization. Logistic

Notes on Machine Learning for and

Hierarchical Bayesian Modeling

Online Algorithms for Sum-Product

Probabilistic Reasoning in Deep Learning

CPSC 540: Machine Learning

X-ray observations of neutron stars and black holes in nearby galaxies

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Techniques Lecture 3

Variational Inference via Stochastic Backpropagation

Homework on Properties of Galaxies in the Hubble Deep Field Name: Due: Friday, April 8 30 points Prof. Rieke & TA Melissa Halford

CSC321 Lecture 18: Learning Probabilistic Models

CPSC 540: Machine Learning

STA 4273H: Sta-s-cal Machine Learning

Homework #7: Properties of Galaxies in the Hubble Deep Field Name: Due: Friday, October points Profs. Rieke

Efficient Likelihood-Free Inference

The SDSS is Two Surveys

Data Processing in DES

Afternoon Meeting on Bayesian Computation 2018 University of Reading

On Bayesian Computation

Probabilistic Graphical Models

Basics of Photometry

Overfitting, Bias / Variance Analysis

UNSUPERVISED LEARNING

Linear and logistic regression

10708 Graphical Models: Homework 2

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

MULTIPLE EXPOSURES IN LARGE SURVEYS

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Unsupervised Learning

Counterfactuals via Deep IV. Matt Taddy (Chicago + MSR) Greg Lewis (MSR) Jason Hartford (UBC) Kevin Leyton-Brown (UBC)

Variational Scoring of Graphical Model Structures

Probabilistic Time Series Classification

Chapter 23 Lecture. The Cosmic Perspective Seventh Edition. Dark Matter, Dark Energy, and the Fate of the Universe Pearson Education, Inc.

Fast Likelihood-Free Inference via Bayesian Optimization

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

Natural Gradients via the Variational Predictive Distribution

VIME: Variational Information Maximizing Exploration

Undirected Graphical Models

Sum-Product Networks. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 17, 2017

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Convex Optimization Algorithms for Machine Learning in 10 Slides

Deep Learning. What Is Deep Learning? The Rise of Deep Learning. Long History (in Hind Sight)

Andriy Mnih and Ruslan Salakhutdinov

Lecture : Probabilistic Machine Learning

The Origin of Deep Learning. Lili Mou Jan, 2015

Lecture 13 Fundamentals of Bayesian Inference

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

GALAXIES. Hello Mission Team members. Today our mission is to learn about galaxies.

Ages of stellar populations from color-magnitude diagrams. Paul Baines. September 30, 2008

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Linear Models A linear model is defined by the expression

LECTURE 1: Introduction to Galaxies. The Milky Way on a clear night

Time-Sensitive Dirichlet Process Mixture Models

Transcription:

Learning an astronomical catalog of the visible universe through scalable Bayesian inference Jeffrey Regier with the DESI Collaboration Department of Electrical Engineering and Computer Science UC Berkeley December 9, 2016

The DESI Collaboration Jeff Regier Ryan Giordano Steve Howard Jon McAuliffe Michael Jordan Andy Miller Ryan Adams Jarrett Revels Andreas Jensen Kiran Pamnany Debbie Bard Rollin Thomas Dustin Lang David Schlegel Prabhat 2 / 31

An astronomical image An image from the Sloan Digital Sky Survey covering roughly one quarter square degree of the sky. 3 / 31

Faint light sources Most light sources are near the detection limit. 4 / 31

Outline 1. our graphical model for astronomical images (Celeste) 2. scaling approximate posterior inference to catalog the visible universe 3. model extensions 5 / 31

The Celeste graphical model 6 / 31

Scientific color priors 7 / 31

Galaxies: light-density model The light density for galaxy s is modeled as mixture of two extremal galaxy prototypes: h s (w) = θ s h s1 (w) + (1 θ s )h s0 (w). Each prototype (i = 0 or i = 1) is a mixture of bivariate normal distributions: J h si (w) = η ij φ (w; µ s, ν ij Q s ). j=1 An elliptical galaxy, θ s = 0 Shared covariance matrix Q s accounts for the scale σ s, rotation ϕ s, and axis ratio ρ s. A spiral galaxy, θ s = 1 8 / 31

Idealized sky view The brightness for sky position w is S G b (w) = l sb g s (w) s=1 where g s (w) = { 1 {µ s = w}, if a s = 0 ( star ) h s (w), if a s = 1 ( galaxy ). 9 / 31

Astronomical images Images differ from the idealized sky view due to 1. pixelation and point spread K f nbm (w) = ᾱ nbk φ ( w m ; w + ξ ) nbk, τ nbk k=1 G nbm = G b f nbm 2. background radiation and calibration F nbm = ι nb [ɛ nb + G nbm ] 3. finite exposure duration x nbm (a s, r s, c s ) S s=1 Poisson (F nbm) 10 / 31

Intractable posterior Let Θ = (a s, r s, c s ) S s=1. The posterior on Θ is intractable because of coupling between the sources: p(θ x) = p(x Θ)p(Θ) p(x) and p(x) = = p(x Θ)p(Θ) dθ N B M p(x nbm Θ)p(Θ) dθ. n=1 b=1 m=1 11 / 31

Variational inference Variational inference approximates the exact posterior p with a simpler distribution q Q. 12 / 31

Variational inference for Celeste An approximating distribution that factorizes across light sources (a structured mean-field assumption) makes most expectations tractable: q(θ) = S q(θ s ). s=1 The delta method for moments approximates the remaining expectations. Existing catalogs provide good initial settings for the variational parameters. Light sources are unlikely to contribute photons to distant pixels. The model contains an auxiliary variable indicating the mixture component that generated each source s colors. Newton s method converges in tens of iterations. 13 / 31

Validation from Stripe 82 Photo Celeste position 0.37 0.24 missed gals 23 / 421 8 / 421 missed stars 10 / 421 38 / 421 color u-g 1.25 0.70 color g-r 0.37 0.21 color r-i 0.25 0.17 color i-z 0.31 0.15 brightness 0.20 0.37 profile 0.26 0.31 axis ratio 0.19 0.13 scale 1.64 1.76 angle 17.04 12.64 Average error. Lower is better. Highlight scores are more than 2 standard deviations better. 14 / 31

Scaling inference to the visibile universe

The setting Big data Sloan Digital Sky Survey: 55 TB of images; hundreds of millions of stars and galaxies Large Synoptic Survey (2019): 15 TB of images nightly Fancy hardware Cori supercomputer, Phase 1: 1,630 nodes, each with 32 cores. Cori supercomputer, Phase 2: 9,300 nodes, each with 272 hardware threads. 30 teraflops. 1.5 PB array of SSDs 16 / 31

Julia programming language high-level syntax as fast as C++ (when necessary) a single language for hotspots and the rest multi-threading (experimental), not just multi-processing 17 / 31

Fast serial optimization algorithm analytic expectations (and one in delta method for moments) block coordinate ascent Newton steps rather than L-BFGS or a first-order method manually coded gradients and Hessians 3x overhead from computing exact Hessians 18 / 31

Parallelism among nodes A region of the sky, shown twice, divided into 25 overlapping boxes: A1,...,A9 and B1,...,B16. Each box corresponds to a task: to optimize all the light sources within its boundaries. 19 / 31

Parallelism among threads Light sources that do not overlap may be updated concurrently. 20 / 31

Parallelism among threads Light sources that do not overlap may be updated concurrently. [1] Pan, Xinghao, et al. CYCLADES: Conflict-free Asynchronous Machine Learning. NIPS 2016. 20 / 31

Weak and strong scaling Celeste light sources/second. We observe perfect scaling up to 64 nodes. The we are limited by interconnect bandwidth. 21 / 31

Hero run: catalog of the Sloan Digital Sky Survey 512 nodes 32 cores/node = 16,384 cores 16,384 cores 16 hours = 250,000 core hours input: 55 TB of astronomical images output: catalog of 250 million light sources 22 / 31

A deep generative model for galaxies

A deep generative model for galaxies Example z n = [0.1, 0.5, 0.2, 0.1] f µ (z n ) = f σ (z n ) = z n N (0, I) x n z n N (f µ (z), f σ (z)) x n = 24 / 31

Autoencoder architecture 25 / 31

Sample fits Each row corresponds to a different example from a test set. The left column shows the input x. The center column shows the output f µ(z) for a z sampled from N (g µ(x), g σ(x)). The right column shows the output f σ(z) for the same z. 26 / 31

Second-order stochastic variational inference

Second-order Stochastic Variational Inference Require: ω is the initial vector of variational parameters; δ (δ min, δ max ) is the initial trust-region radius; γ > 1 is the trust region expansion factor; and η 1 (0, 1) and η 2 > 0 are constants. 1: for i 1 to M do 2: Sample e 1,..., e N iid from base distribution ɛ. 3: g ν ˆL(ν; e 1,..., e N ) ω 4: H 2 ˆL(ν; ν e 1,..., e N ) ω 5: ω arg max ν {g ν + ν Hν : ν δ} non-convex quadratic optimization 6: β g ω + ω Hω the expected improvement 7: Sample e 1,..., e N iid from base distribution ɛ. 8: α ˆL(ω ; e 1,..., e N ) ˆL(ω; e 1,..., e N ) the observed improvement 9: if α/β > η 1 and g η 2 δ then 10: ω ω 11: δ max(γδ, δ max ) 12: else 13: δ δ/γ 14: if δ < δ min or i = M then return ω 28 / 31

Second-order Stochastic Variational Inference Require: ω is the initial vector of variational parameters; δ (δ min, δ max ) is the initial trust-region radius; γ > 1 is the trust region expansion factor; and η 1 (0, 1) and η 2 > 0 are constants. 1: for i 1 to M do 2: Sample e 1,..., e N iid from base distribution ɛ. 3: g ν ˆL(ν; e 1,..., e N ) ω 4: H 2 ˆL(ν; ν e 1,..., e N ) ω 5: ω arg max ν {g ν + ν Hν : ν δ} non-convex quadratic optimization 6: β g ω + ω Hω the expected improvement 7: Sample e 1,..., e N iid from base distribution ɛ. 8: α ˆL(ω ; e 1,..., e N ) ˆL(ω; e 1,..., e N ) the observed improvement 9: if α/β > η 1 and g η 2 δ then 10: ω ω 11: δ max(γδ, δ max ) 12: else 13: δ δ/γ 14: if δ < δ min or i = M then return ω 68X fewer iterations than ADVI/SGD. 28 / 31

Case study: empirical convergence rates 29 / 31

Open questions How do we more accurately approximate the posterior distribution? normalizing flows without an encoder network hybrid VI/MCMC linear response variational Bayes (LRVB) How do we model a spatially varying point spread function (PSF)? Variational autoencoders fit independent PSFs well. But it isn t easy to account for dependence among the PSFs of nearby astronomical objects. How can we easily account for the details we don t yet account for? cosmic rays, airplanes, and satelights camera saturation (censoring) and imaging artifacts terrestrial vs extraterestrial background radiation transient events: supernovae, exoplanets, and near-earth asteroids additional imaging datasets, spectrographic datasets, calibration datasets 30 / 31

Thank you! 31 / 31