Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University

Similar documents
Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University

Scientific Data Flood. Large Science Project. Pipeline

Astroinformatics in the data-driven Astronomy

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST

Real Astronomy from Virtual Observatories

Transient Alerts in LSST. 1 Introduction. 2 LSST Transient Science. Jeffrey Kantor Large Synoptic Survey Telescope

The Large Synoptic Survey Telescope

Doing astronomy with SDSS from your armchair

An end-to-end simulation framework for the Large Synoptic Survey Telescope Andrew Connolly University of Washington

Large Synoptic Survey Telescope

Astronomical Notes. Astronomische Nachrichten. A machine learning classification broker for the LSST transient database

SDSS Data Management and Photometric Quality Assessment

Taking the census of the Milky Way Galaxy. Gerry Gilmore Professor of Experimental Philosophy Institute of Astronomy Cambridge

Parametrization and Classification of 20 Billion LSST Objects: Lessons from SDSS

Machine Learning Applications in Astronomy

Anomaly Detection. Jing Gao. SUNY Buffalo

Data Exploration vis Local Two-Sample Testing

What shall we learn about the Milky Way using Gaia and LSST?

From the Big Bang to Big Data. Ofer Lahav (UCL)

Physics Lab #10: Citizen Science - The Galaxy Zoo

Galaxies. Introduction. Different Types of Galaxy. Teacher s Notes. Shape. 1. Download these notes at

New Astronomy With a Virtual Observatory

Real-time Variability Studies with the NOAO Mosaic Imagers

Data Intensive Computing meets High Performance Computing

Some ML and AI challenges in current and future optical and near infra imaging datasets

Lecture 25: Cosmology: The end of the Universe, Dark Matter, and Dark Energy. Astronomy 111 Wednesday November 29, 2017

Time Domain Astronomy in the 2020s:

Astronomy 1. 10/17/17 - NASA JPL field trip 10/17/17 - LA Griffith Observatory field trip

Science Alerts from GAIA. Simon Hodgkin Institute of Astronomy, Cambridge

How Did the Universe Begin?

Igor Soszyński. Warsaw University Astronomical Observatory

Learning algorithms at the service of WISE survey

Hubble s Law and the Cosmic Distance Scale

LSST Science. Željko Ivezić, LSST Project Scientist University of Washington

Overview of Modern Astronomy. Prof. D. L. DePoy

Directed Reading A. Section: The Life Cycle of Stars TYPES OF STARS THE LIFE CYCLE OF SUNLIKE STARS A TOOL FOR STUDYING STARS.

The Sloan Digital Sky Survey

A Look Back: Galaxies at Cosmic Dawn Revealed in the First Year of the Hubble Frontier Fields Initiative

From DES to LSST. Transient Processing Goes from Hours to Seconds. Eric Morganson, NCSA LSST Time Domain Meeting Tucson, AZ May 22, 2017

Lecture 22: The expanding Universe. Astronomy 111 Wednesday November 15, 2017

Lecture Outlines. Chapter 25. Astronomy Today 7th Edition Chaisson/McMillan Pearson Education, Inc.

arxiv:astro-ph/ v1 3 Aug 2004

How do telescopes "see" on Earth and in space?

D4.2. First release of on-line science-oriented tutorials

Astronomical "color"

Classifying Galaxy Morphology using Machine Learning

Science at the Kavli Institute

Virtual Observatory: Observational and Theoretical data

Chandra: Revolution through Resolution. Martin Elvis, Chandra X-ray Center

The New LSST Informatics and Statistical Sciences Collaboration Team. Kirk Borne Dept of Computational & Data Sciences GMU

ANTARES: The Arizona-NOAO Temporal Analysis and Response to Events System

Lecture Outlines. Chapter 26. Astronomy Today 8th Edition Chaisson/McMillan Pearson Education, Inc.

Test ABCDE. 1. What is the oldest era on the geological timescale? A. Precambrian B. Paleozoic C. Mesozoic D. Cenozoic

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Practice Test: ES-5 Galaxies

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30

Basics of Multivariate Modelling and Data Analysis

LSST. Pierre Antilogus LPNHE-IN2P3, Paris. ESO in the 2020s January 19-22, LSST ESO in the 2020 s 1

Introduction: Don t get bogged down in classical (grade six) topics, or number facts, or other excessive low-level knowledge.

Large Synoptic Survey Telescope, Computational-science Requirements of. George Beckett, 2 nd September 2016

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Astronomy is remote sensing

Massive Event Detection. Abstract

Searching for Needles in the Sloan Digital Haystack

Collecting Light. In a dark-adapted eye, the iris is fully open and the pupil has a diameter of about 7 mm. pupil

Exascale IT Requirements in Astronomy Joel Primack, University of California, Santa Cruz


ASTRONOMY (ASTRON) ASTRON 113 HANDS ON THE UNIVERSE 1 credit.

Astronomy 1 Fall 2016

The Galaxy Zoo Project

Cosmology with the Sloan Digital Sky Survey Supernova Search. David Cinabro

ANSWER KEY. Stars, Galaxies, and the Universe. Telescopes Guided Reading and Study. Characteristics of Stars Guided Reading and Study

Local Volume, Milky Way, Stars, Planets, Solar System: L3 Requirements

Optical Synoptic Telescopes: New Science Frontiers *

Introduction to SDSS -instruments, survey strategy, etc

JINA Observations, Now and in the Near Future

APS Science Curriculum Unit Planner

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

Studying the universe

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

A Convex Hull Peeling Depth Approach to Nonparametric Massive Multivariate Data Analysis with Applications

Galaxies and Cosmology

Chapter 26: Cosmology

Cosmology with the Sloan Digital Sky Survey Supernova Search. David Cinabro

Fast Hierarchical Clustering from the Baire Distance

GLAST. Exploring the Extreme Universe. Kennedy Space Center. The Gamma-ray Large Area Space Telescope

Computational Intelligence Challenges and. Applications on Large-Scale Astronomical Time Series Databases

Answer Key for Exam C

Answer Key for Exam B

ASTRONOMY 202 Spring 2007: Solar System Exploration

Dark Energy and Dark Matter

PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University

V. Astronomy Section

Lecture 16 The Measuring the Stars 3/26/2018

o Terms to know o Big Bang Theory o Doppler Effect o Redshift o Universe

Homework on Properties of Galaxies in the Hubble Deep Field Name: Due: Friday, April 8 30 points Prof. Rieke & TA Melissa Halford

RFI Detectives Activity for Large Public Venues

Cosmology. Stellar Parallax seen. The modern view of the universe

Transcription:

Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University kborne@gmu.edu, http://classweb.gmu.edu/kborne/

Outline What is Surprise Detection? Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

Outline What is Surprise Detection? Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

Outlier Detection has many names Semi-supervised Learning Outlier Detection Novelty Detection Anomaly Detection Deviation Detection Surprise Detection

Outlier Detection has many names Semi-supervised Learning Outlier Detection Novelty Detection Anomaly Detection Deviation Detection Surprise Detection

Outlier Detection as Surprise Detection Benefits of very large datasets: best statistical analysis of typical events automated search for rare events Surprise! Graphic from S. G. Djorgovski

Basic Knowledge Discovery Problem Outlier detection: (unknown unknowns) Finding the objects and events that are outside the bounds of our expectations (outside known clusters) These may be real scientific discoveries or garbage Outlier detection is therefore useful for: Novelty Discovery is my Nobel prize waiting? Anomaly Detection is the detector system working? Science Data Quality Assurance is the data pipeline working? One person s garbage is another person s treasure. One scientist s noise is another scientist s signal.

Outlier detection: (unknown unknowns) Simple techniques exist for uni- and multi-variate data: Outlyingness: O(x) = x μ(x n ) / σ(x n ) Mahalanobis distance: Normalized Euclidean distance: Numerous (countless?) outlier detection algorithms have been developed. For example, see these reviews: Novelty Detection: A Review Part 1: Statistical Approaches, by Markos & Singh, Signal Processing, 83, 2481-2497 (2003). A Survey of Outlier Detection Methodologies, by Hodge & Austin, Artificial Intelligence Review, 22, 85-126 (2004 ). Capabilities of Outlier Detection Schemes in Large Datasets, by Tang, Chen, Fu, & Cheung, Knowledge and Information Systems, 11 (1), 45-84 (2006). Outlier Detection, A Survey, by Chandola, Banerjee, & Kumar, Technical Report (2007). How does one optimally find outliers in 10 3 -D parameter space? or in interesting subspaces (lower dimensions)? How do we measure their interestingness?

Outline What is Surprise Detection? Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

LSST = Large Synoptic Survey Telescope http://www.lsst.org/ 8.4-meter diameter primary mirror = 10 square degrees! Hello!

Observing Strategy: One pair of images every 40 seconds for each spot on the sky, then continue across the sky continuously every night for 10 years (~2019-2029), with time domain sampling in log(time) intervals (to capture dynamic range of transients). LSST (Large Synoptic Survey Telescope): Ten-year time series imaging of the night sky mapping the Universe! ~1,000,000 events each night anything that goes bump in the night! Cosmic Cinematography! The New Sky! @ http://www.lsst.org/ Education and Public Outreach have been an integral and key feature of the project since the beginning the EPO program includes formal Ed, informal Ed, Citizen Science projects, and Science Centers / Planetaria.

LSST Key Science Drivers: Mapping the Dynamic Universe Solar System Inventory (moving objects, NEOs, asteroids: census & tracking) Nature of Dark Energy (distant supernovae, weak lensing, cosmology) Optical transients (of all kinds, with alert notifications within 60 seconds) Digital Milky Way (proper motions, parallaxes, star streams, dark matter) LSST in time and space: When? ~2019-2029 Where? Cerro Pachon, Chile Architect s design of LSST Observatory

3-Gigapixel camera LSST Summary http://www.lsst.org/ One 6-Gigabyte image every 20 seconds 30 Terabytes every night for 10 years 100-Petabyte final image data archive anticipated all data are public!!! 20-Petabyte final database catalog anticipated Real-Time Event Mining: 1-10 million events per night, every night, for 10 yrs Follow-up observations required to classify these Repeat images of the entire night sky every 3 nights: Celestial Cinematography

The LSST will represent a 10K-100K times increase in the VOEvent network traffic. This poses significant real-time classification demands on the event stream: from data to knowledge! from sensors to sense!

The LSST Data Mining Raison d etre More data is not just more data more is different! Discover the unknown unknowns. Massive Data-to-Knowledge challenge.

The LSST Data Mining Challenges 1. Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). 2. Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time. 3. Massive event stream: knowledge extraction in real time for 1,000,000 events each night. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Look at these in more detail...

LSST challenges # 1, 2 Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey My grad students will be asked to mine these data (~30 TB each night 60,000 CDs filled with data):

LSST challenges # 1, 2 Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey My grad students will be asked to mine these data (~30 TB each night 60,000 CDs filled with data): a sea of CDs

LSST challenges # 1, 2 Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey My grad students will be asked to mine these data (~30 TB each night 60,000 CDs filled with data): a sea of CDs Image: The CD Sea in Kilmington, England (600,000 CDs)

LSST challenges # 1, 2 Each night for 10 years LSST will obtain the equivalent amount of data that was obtained by the entire Sloan Digital Sky Survey My grad students will be asked to mine these data (~30 TB each night 60,000 CDs filled with data): A sea of CDs each and every day for 10 yrs Cumulatively, a football stadium full of 200 million CDs after 10 yrs The challenge is to find the new, the novel, the interesting, and the surprises (the unknown unknowns) within all of these data. Yes, more is most definitely different!

LSST data mining challenge # 3 Approximately 1,000,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

flux LSST data mining challenge # 3 Approximately 1,000,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: time

flux LSST data mining challenge # 3 Approximately 1,000,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help! time

flux LSST data mining challenge # 3 Approximately 1,000,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help! Characterize first! Classify later. time

Characterization includes Feature detection and extraction: Identify and describe features in the data Extract feature descriptors from the data Curate these features for scientific search & re-use Find other parameters and features from other archives, other databases, other sky surveys and use those to help characterize (ultimately classify) each new event. hence, cope with a highly multivariate parameter space Outlier / Anomaly / Novelty / Surprise detection Clustering: unsupervised learning ; class discovery Correlation discovery

Outline What is Surprise Detection? Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD (work done in collaboration Arun Vedachalam)

Challenge: which data points are the outliers?

Inlier or Outlier? Is it in the eye of the beholder?

3 Experiments

Experiment #1-A (L-TN) Simple linear data stream Test A Is the red point an inlier or and outlier?

Experiment #1-B (L-SO) Simple linear data stream Test B Is the red point an inlier or and outlier?

Experiment #1-C (L-HO) Simple linear data stream Test C Is the red point an inlier or and outlier?

Experiment #2-A (V-TN) Inverted V-shaped data stream Test A Is the red point an inlier or and outlier?

Experiment #2-B (V-SO) Inverted V-shaped data stream Test B Is the red point an inlier or and outlier?

Experiment #2-C (V-HO) Inverted V-shaped data stream Test C Is the red point an inlier or and outlier?

Experiment #3-A (C-TN) Circular data topology Test A Is the red point an inlier or and outlier?

Experiment #3-B (C-SO) Circular data topology Test B Is the red point an inlier or and outlier?

Experiment #3-C (C-HO) Circular data topology Test C Is the red point an inlier or and outlier?

KNN-DD = K-Nearest Neighbors Data Distributions f K (d[x i,x j ])

KNN-DD = K-Nearest Neighbors Data Distributions f O (d[x i,o])

KNN-DD = K-Nearest Neighbors Data Distributions f O (d[x i,o]) f K (d[x i,x j ])

The Test: K-S test Tests the Null Hypothesis: the two data distributions are drawn from the same parent population. If the Null Hypothesis is rejected, then it is probable that the two data distributions are different. This is our definition of an outlier: The Null Hypothesis is rejected. Therefore the data point s location in parameter space deviates in an improbable way from the rest of the data distribution.

Advantages and Benefits of KNN-DD Based on the non-parametric K-S test It makes no assumption about the shape of the data distribution or about normal behavior It compares the cumulative distributions of the data values (i.e., the sets of inter-point distances) without regard to the nature of those distributions (to be continued)

Cumulative Data Distribution (K-S test) for Experiment 1A (L-TN)

Cumulative Data Distribution (K-S test) for Experiment 2B (V-SO)

Cumulative Data Distribution (K-S test) for Experiment 3C (C-HO)

Results of KNN-DD experiments Experiment ID Short Description of Experiment KS Test p-value Outlier Index = 1-p = Outlyingness Likelihood Outlier Flag (p<0.05?) L-TN L-SO L-HO V-TN V-SO V-HO C-TN C-SO C-HO Linear data stream, True Normal test Linear data stream, Soft Outlier test Linear data stream, Hard Outlier test V-shaped stream, True Normal test V-shaped stream, Soft Outlier test V-shaped stream, Hard Outlier test Circular stream, True Normal test Circular stream, Soft Outlier test Circular stream, Hard Outlier test 0.590 41.0% False 0.096 90.4% Potential Outlier 0.025 97.5% TRUE 0.366 63.4% False 0.063 93.7% Potential Outlier 0.041 95.9% TRUE 0.728 27.2% False 0.009 99.1% TRUE 0.005 99.5% TRUE The K-S test p value is essentially the likelihood of the Null Hypothesis.

Results of KNN-DD experiments Experiment ID Short Description of Experiment KS Test p-value Outlier Index = 1-p = Outlyingness Likelihood Outlier Flag (p<0.05?) L-TN L-SO L-HO V-TN V-SO V-HO C-TN C-SO C-HO Linear data stream, True Normal test Linear data stream, Soft Outlier test Linear data stream, Hard Outlier test V-shaped stream, True Normal test V-shaped stream, Soft Outlier test V-shaped stream, Hard Outlier test Circular stream, True Normal test Circular stream, Soft Outlier test Circular stream, Hard Outlier test 0.590 41.0% False 0.096 90.4% Potential Outlier 0.025 97.5% TRUE 0.366 63.4% False 0.063 93.7% Potential Outlier 0.041 95.9% TRUE 0.728 27.2% False 0.009 99.1% TRUE 0.005 99.5% TRUE The K-S test p value is essentially the likelihood of the Null Hypothesis.

Astronomy data experiment #1: Star-Galaxy separation Approximately 100 stars and 100 galaxies were selected from the SDSS & 2MASS catalogs 8 parameters (ugriz and JHK magnitudes) were extracted 7 colors were computed: u-g, g-r, r-i, i-z, z-j, J-H, H-K these are used to locate each object in feature space hence, we have a 7-dimensional parameter feature space The galaxies are treated as the outliers relative to the stars i.e., can KNN-DD separate them from the stars? Results: (for p=0.05 and K=20) 78% of the galaxies were correctly classified as outliers (TP) 1% of the stars were incorrectly classified as outliers (FP)

Astronomy data experiment #2: Star-Quasar separation 1000 stars selected from the SDSS & 2MASS catalogs 100 quasars selected from Penn St. astrostats website 8 parameters (ugriz and JHK magnitudes) were extracted 7 colors were computed: u-g, g-r, r-i, i-z, z-j, J-H, H-K these are used to locate each object in feature space hence, we have a 7-dimensional parameter feature space The quasars are treated as the outliers relative to the stars i.e., can KNN-DD separate them from the stars? Results: (for p=0.05 and K=20) 100% of the quasars were correctly classified as outliers (TP) 29% of the stars were incorrectly classified as outliers (FP)

Advantages and Benefits of KNN-DD continued KNN-DD: operates on multivariate data (thus solving the curse of dimensionality) is algorithmically univariate (by estimating a function that is based only on the distance between data points, which themselves occupy high-dimensional parameter space) is simply extensible to higher dimensions is computed only on small-k local subsamples of the full dataset of N data points (K << N) is easily (embarrassingly) parallelized when testing multiple data points for outlyingness

Future Work for KNN-DD (i.e., deficiencies that need attention) Validate our choices of p and K, which are not well determined or justified Measure the KNN-DD algorithm s learning times Determine the algorithm s complexity Compare the algorithm against several other outlier detection algorithms We started doing that, but there are a very large number Evaluate the KNN-DD algorithm s effectiveness on much larger datasets (e.g., 77,000 SDSS quasars) We started doing that, but with mixed results, which we are still analyzing Test its usability on event streams (streaming data)

Future Work in Surprise Detection Test and validate many more of the existing outlier detection algorithms on astronomical data: score them according to their effectiveness and efficiency Derive an interestingness index, which will probably be based upon a mixture of outlyingness metrics: test and validate this on single data points, data sequences (e.g., trending data), and data series (e.g., time series) Apply resulting algorithms and indices on very large datasets (e.g., SDSS+2MASS+GALEX+WISE catalogs, Kepler time series, etc.) test on LSST simulated data and catalogs, in preparation for the real thing at the end of the decade Investigate applicability of algorithms to SDQA (Science Data Quality Assessment) tasks for large sky surveys