MULTIPLE EXPOSURES IN LARGE SURVEYS

Similar documents
PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University

COMPUTATIONAL STATISTICS IN ASTRONOMY: NOW AND SOON

Data-Intensive Statistical Challenges in Astrophysics

BAYESIAN CROSS-IDENTIFICATION IN ASTRONOMY

Classifying Galaxy Morphology using Machine Learning

Hypothesis Testing, Bayes Theorem, & Parameter Estimation

The shapes of faint galaxies: A window unto mass in the universe

SDSS Data Management and Photometric Quality Assessment

Vector Space Models. wine_spectral.r

Exploiting Sparse Non-Linear Structure in Astronomical Data

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

C. Watson, E. Churchwell, R. Indebetouw, M. Meade, B. Babler, B. Whitney

Lecture 11: SDSS Sources at Other Wavelengths: From X rays to radio. Astr 598: Astronomy with SDSS

A8824: Statistics Notes David Weinberg, Astronomy 8824: Statistics Notes 6 Estimating Errors From Data

Probabilistic photometric redshifts in the era of Petascale Astronomy

Science Results Enabled by SDSS Astrometric Observations

Randomized Selection on the GPU. Laura Monroe, Joanne Wendelberger, Sarah Michalak Los Alamos National Laboratory

Expectation Maximization

Data Processing in DES

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Machine Learning Techniques for Computer Vision

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

STUDIES OF SELECTED VOIDS. SURFACE PHOTOMETRY OF FAINT GALAXIES IN THE DIRECTION OF IN HERCULES VOID

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

PCA, Kernel PCA, ICA

Noise & Data Reduction

Dimensionality Reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Quantifying correlations between galaxy emission lines and stellar continua

The SDSS is Two Surveys

Statistical learning. Chapter 20, Sections 1 4 1

Signal Model vs. Observed γ-ray Sky

Noise & Data Reduction

Ch 4. Linear Models for Classification

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Some Statistical Aspects of Photometric Redshift Estimation

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

The Structural Properties of Milky Way Dwarf Galaxies. Ricardo Muñoz (Universidad de Chile) Collaborators:

The Large Synoptic Survey Telescope

Modern Image Processing Techniques in Astronomical Sky Surveys

Monte Carlo Quality Assessment (MC-QA)

STA 4273H: Sta-s-cal Machine Learning

HUBBLE SOURCE CATALOG. Steve Lubow Space Telescope Science Institute June 6, 2012

Statistical Tools and Techniques for Solar Astronomers

arxiv: v1 [astro-ph.im] 9 May 2013

Robust Principal Component Analysis

Protein Complex Identification by Supervised Graph Clustering

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Data Exploration and Unsupervised Learning with Clustering

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

Practical Statistics

Detecting Dark Matter Halos using Principal Component Analysis

Big Data Inference. Combining Hierarchical Bayes and Machine Learning to Improve Photometric Redshifts. Josh Speagle 1,

1 Singular Value Decomposition and Principal Component

Project: Hubble Diagrams

The magnitude system. ASTR320 Wednesday January 30, 2019

MACHINE LEARNING ADVANCED MACHINE LEARNING

Principal Component Analysis

L11: Pattern recognition principles

Statistics in Astronomy (II) applications. Shen, Shiyin

26. Cosmology. Significance of a dark night sky. The Universe Is Expanding

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Robustness of Principal Components

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Astronomy 114. Lecture35:TheBigBang. Martin D. Weinberg. UMass/Astronomy Department

MACHINE LEARNING ADVANCED MACHINE LEARNING

Chapter 6: Transforming your data

Principal Component Analysis for Distributed Data Sets with Updating

Expectation-Maximization

Detection and Localization of Tones and Pulses using an Uncalibrated Array

Source Detection in the Image Domain

SAA and Moon Proximity Effects on NEOWISE Single-exposure Source Detection Reliability

HST.582J/6.555J/16.456J

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Astronomical image reduction using the Tractor

Some Applications of Machine Learning to Astronomy. Eduardo Bezerra 20/fev/2018

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Automatic Star-tracker Optimization Framework. Andrew Tennenbaum The State University of New York at Buffalo

Inversion, history matching, clustering and linear algebra

Excerpts from previous presentations. Lauren Nicholson CWRU Departments of Astronomy and Physics

Dimensionality reduction

Linear Dimensionality Reduction

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CS181 Midterm 2 Practice Solutions

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

AstroBITS: Open Cluster Project

arxiv: v1 [astro-ph.im] 16 Jun 2011

LSST Pipelines and Data Products. Jim Bosch / LSST PST / January 30, 2018

SVD, PCA & Preprocessing

A Hunt for Tidal Features in a Nearby Ultra Diffuse Galaxy

CPSC 340: Machine Learning and Data Mining

Large Synoptic Survey Telescope

Lecture 12: Distances to stars. Astronomy 111

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Based on slides by Richard Zemel

GALEX GR1 Instrument Performance and Calibration Review

Transcription:

MULTIPLE EXPOSURES IN LARGE SURVEYS / Johns Hopkins University

Big Data? Noisy Skewed Artifacts

Big Data? Noisy Skewed Artifacts

Serious Issues Significant fraction of catalogs is junk GALEX 50% PS1 3PI 50-80% PS1 MDS >95% Textbook methods often fail due to artifacts What are the good techniques?

Time Domain

Time Series of Faint Sources? Co-add images and do forced photometry Ideal if we have all observations but we never do Independent catalogs as we go Need to dig in the noise to build good timeseries Goal is an incremental strategy to weed out noise Otherwise catalogs are overwhelmed by junk

Detection Probability Measured flux is true + normal error Probability of detection

Detection Probability Measured flux is true + normal error Probability of detection As a function of the true flux Thresholds at 2-, 3-, 4- & 5 Sharper for 9-way stacks

Detection Probability Multiple exposures Binomial

Detection Probability Multiple exposures Binomial

What is a Real Source? Is it real or just noise? Bayesian hypothesis testing

What is a Real Source? Is it real or just noise? Bayesian hypothesis testing Out of k observations n detections of f i fluxes

Apparent Flux Distribution Galaxy number-counts as fn of magnitude Empirical relation approximately shows More and more fainter and fainter sources! But there is a limit, cf. Olbers paradox

Distribution of Noise Peaks Local maxima of continuous Gaussian random field Cf., P(k) by Barden, Bond, Kaiser, Szalay (BBKS; 1986) Now in 2D:

Something Like LSST Simulation Sky at 5 is 24 mag Object limit is at 28 Bayes factor Considering only fluxes

Adding Directions Bayes factor from cross-id As TB & Szalay (2008) Faint sources can be distinguished based on their celestial coordinates Always at same place!

Cross-Identification Hard problem Computationally, Scientifically & Statistically Need symmetric n-way solution Need reliable quality measure Same or not? Distance threshold? Maximum likelihood?

NOT SAME OR Same or Not? The Bayes factor H: all observations of the same object K: might be from separate objects

NOT SAME OR Same or Not? The Bayes factor H: all observations of the same object Same properties, e.g., coordinates, brightness K: might be from separate objects Properties could be different

Works in General Analytic results for Gaussian errors Incremental n-way strategy We can find moving stars With unknown velocities Matching events in time E.g., supernovae

SkyQuery the new generation! Dynamic federation of astronomy databases Query the collection as if they were one The 3 rd generation tool coming in December Cluster of machines running partitioned jobs Proper probabilistic exec with variable errors

SkyQuery the new generation! Dynamic federation of astronomy databases Query the collection as if they were one The 3 rd generation tool coming in December Cluster of machines running partitioned jobs Proper probabilistic exec with variable errors

Only the first steps Resolved shapes: radio morphology (Fan, TB+ 2014) Colors to augment matches (Marquez, TB, Sarro 2014)

Galaxy Light Linear Combination Principal Components Analysis (PCA) SDSS galaxy type On a big memory machine

Principal Component Analysis Principal directions Directions of largest variations Eigenproblem of covariances Singular Value Decomposition Problems Needs lots of memory Only need largest ones Very sensitive to outliers

Science is Interactive Too much to be accurate By the time you do the calculations, the answer may have changed

Streams of Data Mean

Streams of Data Mean Covariance Iterative evaluation!

Streaming PCA Initialization Eigensystem of a small, random subset Truncate at p largest eigenvalues Incremental updates Mean and the low-rank A matrix SVD of A yields new eigensystem Randomized algorithm!

Streaming PCA 3D Gaussian rotated into 50D Stretches: 7, 6, 5 Total Var = 110

With Outliers Adding 0.1% outliers = 100 in each bin Outliers take over the PCs Instability, no convergence

Robust Algorithm Outliers under control Marked on top Initialized with SVD On a set of 100 vectors

Summary Plan for the junk Proper statistics save money and gain speed Incremental randomized strategies scale Crossmatching, embeddings, ML, etc. Not there, yet Need new methods and tools