Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University

Similar documents
Surprise Detection in Multivariate Astronomical Data Kirk Borne George Mason University

Astroinformatics: massive data research in Astronomy Kirk Borne Dept of Computational & Data Sciences George Mason University

Scientific Data Flood. Large Science Project. Pipeline

Astroinformatics in the data-driven Astronomy

Astronomy of the Next Decade: From Photons to Petabytes. R. Chris Smith AURA Observatory in Chile CTIO/Gemini/SOAR/LSST

Real Astronomy from Virtual Observatories

The New LSST Informatics and Statistical Sciences Collaboration Team. Kirk Borne Dept of Computational & Data Sciences GMU

Astronomical Notes. Astronomische Nachrichten. A machine learning classification broker for the LSST transient database

Transient Alerts in LSST. 1 Introduction. 2 LSST Transient Science. Jeffrey Kantor Large Synoptic Survey Telescope

Machine Learning Applications in Astronomy

New Astronomy With a Virtual Observatory

The Large Synoptic Survey Telescope

Lecture 25: Cosmology: The end of the Universe, Dark Matter, and Dark Energy. Astronomy 111 Wednesday November 29, 2017

Time Domain Astronomy in the 2020s:

Parametrization and Classification of 20 Billion LSST Objects: Lessons from SDSS

Skyalert: Real-time Astronomy for You and Your Robots

ANTARES: The Arizona-NOAO Temporal Analysis and Response to Events System

Taking the census of the Milky Way Galaxy. Gerry Gilmore Professor of Experimental Philosophy Institute of Astronomy Cambridge

Anomaly Detection. Jing Gao. SUNY Buffalo

Doing astronomy with SDSS from your armchair

SDSS Data Management and Photometric Quality Assessment

Data Intensive Computing meets High Performance Computing

arxiv:astro-ph/ v1 3 Aug 2004

An end-to-end simulation framework for the Large Synoptic Survey Telescope Andrew Connolly University of Washington

From the Big Bang to Big Data. Ofer Lahav (UCL)

Real-time Variability Studies with the NOAO Mosaic Imagers

Large Synoptic Survey Telescope

Data Exploration vis Local Two-Sample Testing

What shall we learn about the Milky Way using Gaia and LSST?

CS570 Data Mining. Anomaly Detection. Li Xiong. Slide credits: Tan, Steinbach, Kumar Jiawei Han and Micheline Kamber.

LSST. Pierre Antilogus LPNHE-IN2P3, Paris. ESO in the 2020s January 19-22, LSST ESO in the 2020 s 1

The Zadko Telescope: the Australian Node of a Global Network of Fully Robotic Follow-up Telescopes

Detecting the Unexpected

ASTRONOMY 202 Spring 2007: Solar System Exploration

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Computational Intelligence Challenges and. Applications on Large-Scale Astronomical Time Series Databases

Large Synoptic Survey Telescope, Computational-science Requirements of. George Beckett, 2 nd September 2016

Rick Ebert & Joseph Mazzarella For the NED Team. Big Data Task Force NASA, Ames Research Center 2016 September 28-30

Master Information Session

TMT and Space-Based Survey Missions

Challenges and Methods for Massive Astronomical Data

Virtual Observatory: Observational and Theoretical data

arxiv: v1 [astro-ph] 12 Nov 2008

How do telescopes work? Simple refracting telescope like Fuertes- uses lenses. Typical telescope used by a serious amateur uses a mirror

LSST Science. Željko Ivezić, LSST Project Scientist University of Washington

Massive Event Detection. Abstract

Transiting Hot Jupiters near the Galactic Center

PRACTICAL ANALYTICS 7/19/2012. Tamás Budavári / The Johns Hopkins University

Heidi B. Hammel. AURA Executive Vice President. Presented to the NRC OIR System Committee 13 October 2014

Astronomy 1. 10/17/17 - NASA JPL field trip 10/17/17 - LA Griffith Observatory field trip

Life as an Astronomer:

Unsupervised Anomaly Detection for High Dimensional Data

Machine Astronomers to Discover the Unexpected in Astronomical Surveys. Ray Norris, Western Sydney University & CSIRO Astronomy & Space Science,

Synergies between and E-ELT

Some ML and AI challenges in current and future optical and near infra imaging datasets

Hubble s Law and the Cosmic Distance Scale

Science Alerts from GAIA. Simon Hodgkin Institute of Astronomy, Cambridge

The Future of Cosmology

The European Perspective for

Science at the Kavli Institute

Optical Synoptic Telescopes: New Science Frontiers *

Directed Reading. Section: Viewing the Universe THE VALUE OF ASTRONOMY. Skills Worksheet. 1. How did observations of the sky help farmers in the past?

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

ASI - Italian Space Agency The Space Science Data Center is a Research Infrastructure of the Italian Space Agency

Measuring Radial Velocities of Low Mass Eclipsing Binaries

Basics of Multivariate Modelling and Data Analysis

Kavli IPMU-Berkeley Symposium "Statistics, Physics and Astronomy" January , 2018 Lecture Hall, Kavli IPMU

The Discovery Channel Telescope: An Investment in Astronomical Science at Boston University

Astronomy 1 Fall 2016

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Gravitational Wave Astronomy s Next Frontier in Computation

Collecting Light. In a dark-adapted eye, the iris is fully open and the pupil has a diameter of about 7 mm. pupil

Table of Contents and Executive Summary Final Report, ReSTAR Committee Renewing Small Telescopes for Astronomical Research (ReSTAR)

Probabilistic photometric redshifts in the era of Petascale Astronomy

The Pan-STARRS Moving Object Pipeline

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

Part 3: The Dark Energy

Overview of Modern Astronomy. Prof. D. L. DePoy

How do telescopes "see" on Earth and in space?

Learning algorithms at the service of WISE survey

(Slides for Tue start here.)

Igor Soszyński. Warsaw University Astronomical Observatory

Introduction to Astronomy Mr. Steindamm

Present and Future Large Optical Transient Surveys. Supernovae Rates and Expectations

Welcome to AURA Observatory in Chile

What is astronomy actually? These are good questions and worthy of an answer.

LSST, Euclid, and WFIRST

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

PlanetQuest The Planet-Wide Observatory

Cosmic acceleration. Questions: What is causing cosmic acceleration? Vacuum energy (Λ) or something else? Dark Energy (DE) or Modification of GR (MG)?

Detection of Unauthorized Electricity Consumption using Machine Learning

APS Science Curriculum Unit Planner

Ultra-compact binaries in the Catalina Real-time Transient Survey. The Catalina Real-time Transient Survey. A photometric study of CRTS dwarf novae

Physics Lab #10: Citizen Science - The Galaxy Zoo

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

SLAC AS AN LSST USER FACILITY

FIRST IDEAS TO CONNECT ASTRONOMICAL DATA, DEEP LEARNING AND IMAGE ANALYSIS

DANIEL WILSON AND BEN CONKLIN. Integrating AI with Foundation Intelligence for Actionable Intelligence

Introduction: Don t get bogged down in classical (grade six) topics, or number facts, or other excessive low-level knowledge.

Astronomy 1 Fall 2016

Transcription:

Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University kborne@gmu.edu, http://classweb.gmu.edu/kborne/

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

Astronomy: Data-Driven Science = Evidence-based Forensic Science

From Data-Driven to Data-Intensive Astronomy has always been a data-driven science It is now a data-intensive science: welcome to Astroinformatics! Data-oriented Astronomical Research = the 4 th Paradigm Scientific KDD (Knowledge Discovery in Databases)

Astroinformatics Activities Borne (2010): Astroinformatics: Data-Oriented Astronomy Research and Education, Journal of Earth Science Informatics, vol. 3, pp. 5-17. Web home: http://www.practicalastroinformatics.org/ Astro data mining papers: Scientific Data Mining in Astronomy arxiv:0911.0505 Data Mining and Machine Learning in Astronomy arxiv:0906.2173 Virtual Observatory Data Mining Interest Group (contact longo@na.infn.it) Astroinformatics Conference @ Caltech, June 16-19 (Astroinformatics2010) NASA/Ames Conference on Intelligent Data Understanding @ October 5-7 Astro2010 Decadal Survey Position Papers: Astroinformatics: A 21st Century Approach to Astronomy The Revolution in Astronomy Education: Data Science for the Masses The Astronomical Information Sciences: Keystone for 21st-Century Astronomy Wide-Field Astronomical Surveys in the Next Decade Great Surveys of the Universe

From Data-Driven to Data-Intensive Astronomy has always been a data-driven science It is now a data-intensive science: welcome to Astroinformatics! Data-oriented Astronomical Research = the 4 th Paradigm Scientific KDD (Knowledge Discovery in Databases): Characterize the known (clustering, unsupervised learning) Assign the new (classification, supervised learning) Discover the unknown (outlier detection, semi-supervised learning) Scientific Knowledge! Benefits of very large datasets: best statistical analysis of typical events automated search for rare events

Outlier Detection as Semi-supervised Learning Graphic from S. G. Djorgovski

Basic Astronomical Knowledge Problem Outlier detection: (unknown unknowns) Finding the objects and events that are outside the bounds of our expectations (outside known clusters) These may be real scientific discoveries or garbage Outlier detection is therefore useful for: Novelty Discovery is my Nobel prize waiting? Anomaly Detection is the detector system working? Science Data Quality Assurance is the data pipeline working? How does one optimally find outliers in 10 3 -D parameter space? or in interesting subspaces (in lower dimensions)? How do we measure their interestingness?

Outlier Detection has many names Outlier Detection Novelty Detection Anomaly Detection Deviation Detection Surprise Detection

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD

LSST = Large Synoptic Survey Telescope http://www.lsst.org/ (mirror funded by private donors) 8.4-meter diameter primary mirror = 10 square degrees! Hello! (design, construction, and operations of telescope, observatory, and data system: NSF) (camera: DOE)

LSST Key Science Drivers: Mapping the Universe Solar System Map (moving objects, NEOs, asteroids: census & tracking) Nature of Dark Energy (distant supernovae, weak lensing, cosmology) Optical transients (of all kinds, with alert notifications within 60 seconds) Galactic Structure (proper motions, stellar populations, star streams, dark matter) LSST in time and space: When? 2016-2026 Where? Cerro Pachon, Chile Model of LSST Observatory

Observing Strategy: One pair of images every 40 seconds for each spot on the sky, then continue across the sky continuously every night for 10 years (2016-2026), with time domain sampling in log(time) intervals (to capture dynamic range of transients). LSST (Large Synoptic Survey Telescope): Ten-year time series imaging of the night sky mapping the Universe! 100,000 events each night anything that goes bump in the night! Cosmic Cinematography! The New Sky! @ http://www.lsst.org/ Education and Public Outreach have been an integral and key feature of the project since the beginning the EPO program includes formal Ed, informal Ed, Citizen Science projects, and Science Centers / Planetaria.

LSST Summary http://www.lsst.org/ Plan (pending Decadal Survey): commissioning in 2016 3-Gigapixel camera One 6-Gigabyte image every 20 seconds 30 Terabytes every night for 10 years 100-Petabyte final image data archive anticipated all data are public!!! 20-Petabyte final database catalog anticipated Real-Time Event Mining: 10,000-100,000 events per night, every night, for 10 yrs Follow-up observations required to classify these Repeat images of the entire night sky every 3 nights: Celestial Cinematography

The LSST will represent a 10K-100K times increase in the VOEvent network traffic. This poses significant real-time classification demands on the event stream: from data to knowledge! from sensors to sense!

MIPS = MIPS model for Event Follow-up Measurement Inference Prediction Steering Heterogeneous Telescope Network = Global Network of Sensors: Similar projects in NASA, Earth Science, DOE, NOAA, Homeland Security, NSF DDDAS (voeventnet.org, skyalert.org) Machine Learning enables IP part of MIPS: Autonomous (or semi-autonomous) Classification Intelligent Data Understanding Rule-based Model-based Neural Networks Temporal Data Mining (Predictive Analytics) Markov Models Bayes Inference Engines

Example: The Thinking Telescope Reference: http://www.thinkingtelescopes.lanl.gov

From Sensors to Sense From Data to Knowledge: from sensors to sense (semantics) Data Information Knowledge

Outline Astroinformatics Example Application: The LSST Project New Algorithm for Surprise Detection: KNN-DD (work done in collaboration Arun Vedachalam)

Challenge: which data points are the outliers?

Inlier or Outlier? Is it in the eye of the beholder?

3 Experiments

Experiment #1-A (L-TN) Simple linear data stream Test A Is the red point an inlier or and outlier?

Experiment #1-B (L-SO) Simple linear data stream Test B Is the red point an inlier or and outlier?

Experiment #1-C (L-HO) Simple linear data stream Test C Is the red point an inlier or and outlier?

Experiment #2-A (V-TN) Inverted V-shaped data stream Test A Is the red point an inlier or and outlier?

Experiment #2-B (V-SO) Inverted V-shaped data stream Test B Is the red point an inlier or and outlier?

Experiment #2-C (V-HO) Inverted V-shaped data stream Test C Is the red point an inlier or and outlier?

Experiment #3-A (C-TN) Circular data topology Test A Is the red point an inlier or and outlier?

Experiment #3-B (C-SO) Circular data topology Test B Is the red point an inlier or and outlier?

Experiment #3-C (C-HO) Circular data topology Test C Is the red point an inlier or and outlier?

KNN-DD = K-Nearest Neighbors Data Distributions f K (d[x i,x j ])

KNN-DD = K-Nearest Neighbors Data Distributions f O (d[x i,o])

KNN-DD = K-Nearest Neighbors Data Distributions f O (d[x i,o]) f K (d[x i,x j ])

The Test: K-S test Tests the Null Hypothesis: the two data distributions are drawn from the same parent population. If the Null Hypothesis is rejected, then it is probable that the two data distributions are different. This is our definition of an outlier: The Null Hypothesis is rejected. Therefore the data point s location in parameter space deviates in an improbable way from the rest of the data distribution.

Advantages and Benefits of KNN-DD The K-S test is non-parametric It makes no assumption about the shape of the data distribution or about normal behavior It compares the cumulative distribution of the data values (inter-point distances)

Cumulative Data Distribution (K-S test) for Experiment 1A (L-TN)

Cumulative Data Distribution (K-S test) for Experiment 2B (V-SO)

Cumulative Data Distribution (K-S test) for Experiment 3C (C-HO)

Advantages and Benefits of KNN-DD The K-S test is non-parametric It makes no assumption about the shape of the data distribution or about normal behavior KNN-DD: operates on multivariate data (thus solving the curse of dimensionality) is algorithmically univariate (by estimating a function that is based only on the distance between data points) is computed only on a small-k local subsample of the full dataset N (K << N) is easily parallelized when testing multiple data points for outlyingness

Results of KNN-DD experiments Experiment ID Short Description of Experiment KS Test p-value Outlier Index = 1-p = Outlyingness Likelihood Outlier Flag (p<0.05?) L-TN (Fig. 5a) L-SO (Fig. 5b) L-HO (Fig. 5c) V-TN (Fig. 7a) V-SO (Fig. 7b) V-HO (Fig. 7c) C-TN (Fig. 9a) C-SO (Fig. 9b) C-HO (Fig. 9c) Linear data stream, True Normal test Linear data stream, Soft Outlier test Linear data stream, Hard Outlier test V-shaped stream, True Normal test V-shaped stream, Soft Outlier test V-shaped stream, Hard Outlier test Circular stream, True Normal test Circular stream, Soft Outlier test Circular stream, Hard Outlier test 0.590 41.0% False 0.096 90.4% Potential Outlier 0.025 97.5% TRUE 0.366 63.4% False 0.063 93.7% Potential Outlier 0.041 95.9% TRUE 0.728 27.2% False 0.009 99.1% TRUE 0.005 99.5% TRUE The K-S test p value is essentially the likelihood of the Null Hypothesis.

Results of KNN-DD experiments Experiment ID Short Description of Experiment KS Test p-value Outlier Index = 1-p = Outlyingness Likelihood Outlier Flag (p<0.05?) L-TN (Fig. 5a) L-SO (Fig. 5b) L-HO (Fig. 5c) V-TN (Fig. 7a) V-SO (Fig. 7b) V-HO (Fig. 7c) C-TN (Fig. 9a) C-SO (Fig. 9b) C-HO (Fig. 9c) Linear data stream, True Normal test Linear data stream, Soft Outlier test Linear data stream, Hard Outlier test V-shaped stream, True Normal test V-shaped stream, Soft Outlier test V-shaped stream, Hard Outlier test Circular stream, True Normal test Circular stream, Soft Outlier test Circular stream, Hard Outlier test 0.590 41.0% False 0.096 90.4% Potential Outlier 0.025 97.5% TRUE 0.366 63.4% False 0.063 93.7% Potential Outlier 0.041 95.9% TRUE 0.728 27.2% False 0.009 99.1% TRUE 0.005 99.5% TRUE The K-S test p value is essentially the likelihood of the Null Hypothesis.

Future Work Validate our choices of p and K Measure the KNN-DD algorithm s learning times Determine the algorithm s complexity Compare the algorithm against several other outlier detection algorithms Evaluate the algorithm s effectiveness on much larger datasets Demonstrate its usability on streaming data