Automated Variable Source Classification: Methods and Challenges

Similar documents
Astronomical Time Series Analysis for the Synoptic Survey Era. Joseph W. Richards. UC Berkeley Department of Astronomy Department of Statistics

Sampler of Interdisciplinary Measurement Error and Complex Data Problems

Igor Soszyński. Warsaw University Astronomical Observatory

Exploiting Sparse Non-Linear Structure in Astronomical Data

Lecture 25: The Cosmic Distance Scale Sections 25-1, 26-4 and Box 26-1

arxiv: v1 [astro-ph.im] 16 Jun 2011

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Results of the OGLE-II and OGLE-III surveys

Machine-learned classification of variable stars detected by (NEO)WISE

Relativity and Astrophysics Lecture 15 Terry Herter. RR Lyrae Variables Cepheids Variables Period-Luminosity Relation. A Stellar Properties 2

Standard candles in the Gaia perspective

arxiv: v2 [astro-ph.im] 28 Mar 2014

arxiv: v2 [astro-ph.sr] 19 Jul 2011

Machine Learning Linear Models

Chapter 14 The Milky Way Galaxy

Distance Measuring Techniques and The Milky Way Galaxy

Techniques for measuring astronomical distances generally come in two variates, absolute and relative.

The Milky Way Galaxy (ch. 23)

New Prediction Methods for Tree Ensembles with Applications in Record Linkage

Baltic Astronomy, vol. 24, , 2015 A STUDY OF DOUBLE- AND MULTI-MODE RR LYRAE VARIABLES. A. V. Khruslov 1,2

Outline Challenges of Massive Data Combining approaches Application: Event Detection for Astronomical Data Conclusion. Abstract

Chapter 30. Galaxies and the Universe. Chapter 30:

The Milky Way Galaxy. Some thoughts. How big is it? What does it look like? How did it end up this way? What is it made up of?

The Automated Detection and Classification of Variable Stars Nikola Whallon Advisors: Carl Akerlof, Jean Krisch

Galaxies. The majority of known galaxies fall into one of three major classes: spirals (78 %), ellipticals (18 %) and irregulars (4 %).

Galaxies. Lecture Topics. Lecture 23. Discovering Galaxies. Galaxy properties. Local Group. History Cepheid variable stars. Classifying galaxies

Hubble s Law and the Cosmic Distance Scale

Physics Lab #6: Citizen Science - Variable Star Zoo

Classification Using Decision Trees

Studying the Milky Way with pulsating stars

Astronomy C SSSS 2018

arxiv: v1 [astro-ph] 12 Nov 2008

The Gaia Mission. Coryn Bailer-Jones Max Planck Institute for Astronomy Heidelberg, Germany. ISYA 2016, Tehran

arxiv: v1 [astro-ph.sr] 19 Dec 2016

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Name: Partner(s): 1102 or 3311: Desk # Date: NGC 6633

Data Analysis Methods

ASTR STELLAR PULSATION COMPONENT. Peter Wood Research School of Astronomy & Astrophysics

Galaxies & Introduction to Cosmology

Thom et al. (2008), ApJ

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

arxiv: v1 [astro-ph.im] 14 Jan 2016

The Milky Way Galaxy. Sun you are here. This is what our Galaxy would look like if we were looking at it from another galaxy.

Practice Problem!! Assuming a uniform protogalactic (H and He only) cloud with a virial temperature of 10 6 K and a density of 0.

Recent Researches concerning Semi-Regular Variables

REGRESSION TREE CREDIBILITY MODEL

Science Olympiad UW- Milwaukee Regional. Astronomy Test

Astronomy II (ASTR1020) Exam 3 Test No. 3D

Fast Hierarchical Clustering from the Baire Distance

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Variability Analysis of the Gaia data

Computational Intelligence Challenges and. Applications on Large-Scale Astronomical Time Series Databases

AS1001: Galaxies and Cosmology

arxiv: v1 [astro-ph.sr] 13 Apr 2018

Learning Decision Trees

Stellar Evolution: After the Main Sequence. Chapter Twenty-One. Guiding Questions

Part 1. Estimates of the Sun s motion in the Milky Way Galaxy

Stellar Evolution: After the Main Sequence. Chapter Twenty-One

Stellar Evolution: After the Main Sequence. Guiding Questions. Chapter Twenty-One

The Milky Way Galaxy and Interstellar Medium

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

The Milky Way. Mass of the Galaxy, Part 2. Mass of the Galaxy, Part 1. Phys1403 Stars and Galaxies Instructor: Dr. Goderya

A Correntropy based Periodogram for Light Curves & Semi-supervised classification of VVV periodic variables

Random Forests: Finding Quasars

Machine Learning. Ensemble Methods. Manfred Huber

Characterization of variable stars using the ASAS and SuperWASP databases

Probabilistic photometric redshifts in the era of Petascale Astronomy

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

AS1001:Extra-Galactic Astronomy. Lecture 3: Galaxy Fundamentals

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Rule Generation using Decision Trees

Question 1. Question 2. Correct. Chapter 16 Homework. Part A

View of the Galaxy from within. Lecture 12: Galaxies. Comparison to an external disk galaxy. Where do we lie in our Galaxy?

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

MULTIPLE EXPOSURES IN LARGE SURVEYS

From the Big Bang to Big Data. Ofer Lahav (UCL)

Lab: Distance to the Globular Cluster M15 Containing RR Lyrae Stars

On the double-mode RR Lyrae variables of the Sculptor dwarf galaxy

Lecture Outlines. Chapter 23. Astronomy Today 8th Edition Chaisson/McMillan Pearson Education, Inc.

Improved methodology for the automated classification of periodic variable stars

day month year documentname/initials 1

The Milky Way, Hubble Law, the expansion of the Universe and Dark Matter Chapter 14 and 15 The Milky Way Galaxy and the two Magellanic Clouds.

Non-radial pulsations in RR Lyrae stars from the OGLE photometry

The Milky Way Galaxy Guiding Questions

The Milky Way Galaxy

Lecture Outlines. Chapter 24. Astronomy Today 8th Edition Chaisson/McMillan Pearson Education, Inc.

Lecture 28: Spiral Galaxies Readings: Section 25-4, 25-5, and 26-3

Unsupervised clustering of COMBO-17 galaxy photometry

Time-Series Photometric Surveys: Some Musings

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

M31 - Andromeda Galaxy M110 M32

THE PLEIADES OPEN CLUSTER

The Milky Way Galaxy. sun. Examples of three Milky-Way like Galaxies

arxiv: v1 [astro-ph.im] 5 May 2015

Learning Decision Trees

Stochastic Modeling of Astronomical Time Series

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning Applications in Astronomy

Chapter 8: The Family of Stars

Transcription:

Automated Variable Source Classification: Methods and Challenges James Long Texas A&M Department of Statistics September 14, 2016 1 / 55

Methodology: Statistical Classifiers Methodology: CART Example with OGLE Data Challenge 1: Selection of Training Data Challenge 2: Classification versus Clustering Conclusions and Opportunities 2 / 55

Outline Methodology: Statistical Classifiers Methodology: CART Example with OGLE Data Challenge 1: Selection of Training Data Challenge 2: Classification versus Clustering Conclusions and Opportunities 3 / 55

Survey Data Sets are Large and Growing Hipparcos (1989 1993): 2712 periodic variables Laurent Eyer and students classified all by eye. OGLE (1992 present): 100,000s Gaia (present): millions LSST (2020): billions 4 / 55

Light Curves Belong to Different Classes Magnitudes 16.0 15.5 15.0 14.5 14.0 Class: Mira 0 500 1000 1500 2000 2500 Time (Days) pulsating red giant in late stage of stellar evolution mean magnitude variation due to dust long period, high amplitude 5 / 55

Light Curves Belong to Different Classes Class: RR Lyrae 0 500 1000 1500 2000 19.2 19.0 18.8 18.6 18.4 Time (Days) Magnitudes 0.0 0.1 0.2 0.3 0.4 0.5 0.6 19.2 19.0 18.8 18.6 18.4 Phase (Days) Magnitudes pulsating horizontal branch stars (almost) strictly periodic, short period, low amplitude 6 / 55

Hierarchical Structure to Classes 7 / 55

Classification Example Data: 100, 000 variable sources in M33 30 observations / source in I band mix of Miras (O rich/c rich), SRVs, Cepheids, non periodic sources, junk, etc. Magnitudes 22.0 21.5 21.0 20.5 Goals: 0 500 1000 1500 Time (Days) Find O rich and C rich Miras. Determine period luminosity relationships for the Miras. 8 / 55

Overview of Statistical Classification Key Terms: training data: lightcurves of known class unlabeled data: lightcurves of unknown class Steps in Classification: 1. feature extraction: derive quantities from light curves useful for separating classes, eg period, amplitude, derivatives, etc. 2. classifier construction: using training data, construct function Ĉ(features) class 3. apply classifier: for unlabeled data, compute features and predict class using Ĉ 9 / 55

Classifier Construction using CART 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Feature x Feature y CART developed by Breiman et al in 1980 s [1] recursively partitions feature space partition represented by tree 10 / 55

Building CART Tree... 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Feature x Feature y blue orange y < 0.394257 blue orange 11 / 55

Building CART Tree... 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Feature x Feature y blue black orange y < 0.394257 x < 0.369436 blue black orange 12 / 55

Building CART Tree... 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Feature x Feature y black blue black orange y < 0.394257 x < 0.125044 x < 0.369436 black blue black orange 13 / 55

Resulting Classifier 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Feature x Feature y black blue black orange black orange y < 0.394257 x < 0.125044 x < 0.369436 y < 0.600437 x < 0.288413 black blue black orange black orange 14 / 55

Apply Classifier to Test Data Test Data: Data used to evaluate classifier accuracy. Test data is not used to construct classifier. Confusion Matrix: Rows are true class of test data. Columns are predicted class of test data. Entries are counts. Predicted Truth black blue orange black 23 1 7 blue 2 30 2 orange 3 1 31 15 / 55

Outline Methodology: Statistical Classifiers Methodology: CART Example with OGLE Data Challenge 1: Selection of Training Data Challenge 2: Classification versus Clustering Conclusions and Opportunities 16 / 55

OGLE Classification Example Classes Mira O rich Mira C rich Cepheid RR Lyrae AB RR Lyrae C Features period (of best fitting sinusoid) amplitude = 95 th percentile mag - 5 th percentile mag skew of magnitude measurements p2p scatter (used by Dubath) 17 / 55

First 6 Rows of Feature Class Dataframe 500 total rows. 5 classes. training data: 400 randomly selected rows test data: remaining 100 rows 18 / 55

Feature Distributions 1.0 0.0 0.0 0.4 0.8 1.2 log(period) 0.5 1.0 2.5 1.0 0.0 skew cepheid mira crich mira orich amp 0 2 4 rrab rrc 0.0 0.6 1.2 p2p_scatter 0.5 1.0 2.0 0 1 2 3 4 5 19 / 55

CART Model Fit To Training Data yes period >= 0.42 no period >= 0.89 rrc 85 / 85 amp < 0.75 rrab 85 / 85 cepheid 72 / 72 p2p_scatter < 0.061 period >= 1 mira orich 55 / 66 mira crich 63 / 85 mira orich 6 / 7 20 / 55

Confusion Matrix using Test Data Predicted Truth cepheid mira-crich mira-orich rrab rrc cepheid 24 0 0 0 0 mira-crich 0 15 10 0 0 mira-orich 0 5 12 0 0 rrab 1 0 0 14 0 rrc 0 0 0 1 14 Conclusion: Develop features to better separate O/C rich Mira. 21 / 55

Notes on Existing Classification Literature On machine learning classification of variable stars Richards J. et al. 2011 [7] mix of OGLE and Hipparcos data extract 50+ features test several classifiers, Random Forest works best Random forest automated supervised classification of Hipparcos periodic variable stars Dubath et al. 2011 [2] Hipparcos data extract 10 features use random Forest Modeling Light Curves for Improved Classification Faraway, J. Mahabal, A. et al. 2014 [3] model light curve variation using Gaussian processes extract features from Gaussian process fit improve classification accuracy over simpler features used in [7] 22 / 55

Outline Methodology: Statistical Classifiers Methodology: CART Example with OGLE Data Challenge 1: Selection of Training Data Challenge 2: Classification versus Clustering Conclusions and Opportunities 23 / 55

What Training Data To Use? Unlabeled Data: Light curves with 20 photometric measurements. Two Options for Training Data 1. High SN: Many photometric measurements / light curve Pros: Accurately estimate features (eg period estimates correct) Cons: Training looks different than unlabeled data. 2. Training resembles Unlabeled: 20 photometric measurements Pros: Training looks the same as unlabeled. Cons: Features estimated incorrectly. 24 / 55

What Training Data To Use? Unlabeled Data: Light curves with 20 photometric measurements. Two Options for Training Data 1. High SN: Many photometric measurements / light curve Pros: Accurately estimate features (eg period estimates correct) Cons: Training looks different than unlabeled data. 2. Training resembles Unlabeled: 20 photometric measurements Pros: Training looks the same as unlabeled. Cons: Features estimated incorrectly. 25 / 55

Training Data Should Resemble Unlabeled Data Hypothetical Example: Unlabeled Data: RR Lyrae and Miras with 20 photometric measurements Features: period and amplitude. Training 1: Light curves with > 100 photometric measurements Training 2: Light curves with 20 photometric measurements 26 / 55

Classifier built on Training 1 Data Feature Distribution CART Tree amplitude 0.0 0.5 1.0 1.5 2.0 mira rr yes mira 100 / 100 period >= 51 no rr 100 / 100 1 0 1 2 log(period) Conclusion: Seemingly Perfect Classification 27 / 55

Apply Classifier to Unlabeled Data Feature Distribution 1 0 1 2 0.0 0.5 1.0 1.5 2.0 log(period) amplitude mira rr Confusion Matrix Predicted Truth mira rr mira 22 78 rr 28 72 28 / 55

Observations classifier constructed using Training Data 1 used period to separate classes for poorly sampled unlabeled data, period does not separate classes (cannot compute period accurately) but amplitude is still useful for separating classes 29 / 55

Classifier built on Training 2 Data Feature Distribution 1 0 1 2 0.0 0.5 1.0 1.5 2.0 log(period) amplitude mira rr CART Tree amplitude >= 0.83 mira 97 / 122 rr 75 / 78 yes no 30 / 55

Apply Train 2 Classifier to Unlabeled Data Feature Distribution 1 0 1 2 0.0 0.5 1.0 1.5 2.0 log(period) amplitude mira rr Confusion Matrix Predicted Truth mira rr mira 96 4 rr 29 71 Conclusion: Much better performance. 31 / 55

Summary of Training Data Selection classifiers constructed on high SN data find class boundaries in high SN feature space these boundaries may not exist for low SN unlabeled data. downsampling high SN data to match unlabeled data SN can improve classifier performance example of domain adaptation / transfer learning Long et al. [4] for extensive discussion, methodology 32 / 55

Outline Methodology: Statistical Classifiers Methodology: CART Example with OGLE Data Challenge 1: Selection of Training Data Challenge 2: Classification versus Clustering Conclusions and Opportunities 33 / 55

Recall Classification Example Data: 100, 000 variable sources (large J stetson) in M33 30 observations / source in I band mix of Miras (O rich/c rich), SRVs, Cepheids, non periodic sources, junk, etc. Goals: find O rich and C rich Miras determine period luminosity relationships for the Miras 34 / 55

Building Classifier for M33 is Difficult OGLE Training Data downsample to match M33 cadence / photometric error select OGLE classes which match classes in M33 Evaluating Classifier Performance straightforward to measure error rate on training data how do we measure error rate on test? classification is only an intermediate step towards larger astronomy goals (specifically modeling of the light curve populations) 35 / 55

A Different Approach: Clustering clustering: find groups (ie clusters) of objects in feature space compute feature distance between all objects find clusters where: distance between objects within cluster is small distance between objects in different clusters is large clustering is different than classification: No training data. No objective measure of success. 36 / 55

OGLE Data 1.0 0.0 0.5 0.0 0.4 0.8 1.2 1.0 0.0 log(period) skew amp 0.5 1.5 0 2 4 0.0 0.6 1.2 p2p_scatter 0.5 0.5 1.5 2.5 0 1 2 3 4 5 37 / 55

Hierarchical Agglomerative Clustering Idea Main Idea: every observation starts as own cluster iteratively merge close clusters together iterate until one giant cluster left Method is Hierarchical: Each iteration produces a clustering, so do not specify number of clusters in advance. Agglomerative: Initially every observation in own cluster. 38 / 55

Hierarchical Agglomerative Clustering Pseudocode N {1,..., n} d ij d(x i, x j ) i, j N C in {x i } i N for k = n,..., 2: i, j argmin {i,j:i<j, i,j N} d C (C ik, C jk ) Ci(k 1) C ik C jk Cl(k 1) C lk l i, j and l N N N\{j} The C k are the k clusters in the k th level of the hierarchy. 39 / 55

How to Merge Clusters (What is d C?) Average Linkage d C (C i, C j ) = Complete Linkage Single Linkage d C (C i, C j ) = d C (C i, C j ) = 1 d(x, x ) #(C i )#(C j ) x C i x C j max d(x, x ) x C i,x C j min d(x, x ) x C i,x C j 40 / 55

Constructing a Dendrogram At iteration k i, j argmin d C (C ik, C jk ). {i,j:i<j,i,j N} The height of this cluster merger is h k = d C (C ik, C jk ) The sequence h n,..., h 2 is monotonically increasing. Plot with heights of cluster mergers is a dendrogram. 41 / 55

Clustering Dendrogram of OGLE Data 42 / 55

Two Clusters 43 / 55

Two Clusters 1.0 0.0 0.5 0.0 0.4 0.8 1.2 log(period) 0.5 1.5 1.0 0.0 skew amp 0 2 4 0.0 0.6 1.2 p2p_scatter 0.5 0.5 1.5 2.5 0 1 2 3 4 5 44 / 55

Three Clusters 45 / 55

Three Clusters 1.0 0.0 0.5 0.0 0.4 0.8 1.2 log(period) 0.5 1.5 1.0 0.0 skew amp 0 2 4 0.0 0.6 1.2 p2p_scatter 0.5 0.5 1.5 2.5 0 1 2 3 4 5 46 / 55

Four Clusters 47 / 55

Four Clusters 1.0 0.0 0.5 0.0 0.4 0.8 1.2 log(period) 0.5 1.5 1.0 0.0 skew amp 0 2 4 0.0 0.6 1.2 p2p_scatter 0.5 0.5 1.5 2.5 0 1 2 3 4 5 48 / 55

Outlier Identification Supervised detection of anomalous light curves in massive astronomical catalogs Nun, Pichara, Protopapas, Kim ApJ 2014 [6] Model voting distribution of random forest using Bayesian network. Outliers have unusual voting patterns. Discovery of Bright Galactic R Coronae Borealis and DY Persei Variables: Rare Gems Mined from ACVS Miller, Richards, Bloom, et al. [5] Find rare R Cor Bor stars using random forest classifier, human analysis of light curves, and spectroscopic follow up 49 / 55

More on Clustering and Outlier Detection many knobs in clustering methods features distance metric between features hierarchical clustering, k means, model based, etc. hard to statistically quantify successful clustering may explain popularity of classification opinion: variable star classification is between the statistical concepts of clustering, classification, and outlier detection 50 / 55

Outline Methodology: Statistical Classifiers Methodology: CART Example with OGLE Data Challenge 1: Selection of Training Data Challenge 2: Classification versus Clustering Conclusions and Opportunities 51 / 55

Conclusions statistical classification 1. select training data 2. extract features 3. build classifier 4. apply classifier to unlabeled data training data should look like unlabeled data classification as practiced in statistics does not always fit perfectly with what astronomers want to do 52 / 55

Opportunities for Learning More / Project Ideas join Working Group 2 (WG2) compete in WG2 classification challenge (starting October?) machine learning tutorial on SDSS: https://github.com/juramaga/clustering/blob/master/ machine-learning-on-sdss.ipynb Modeling Light Curves for Improved Classification Faraway, J. Mahabal, A. et al. 2014 [3] Data available: http://people.bath.ac.uk/jjf23/modlc/ outlier hunting in data set, eg OGLE http://ogledb.astrouw.edu.pl/~ogle/cvs/ 53 / 55

Upcoming Topics in Time Domain astronomical motivation for time domain / variable sources distance determination period luminosity relation expansion of the universe feature extraction / modeling light curves example problem: mapping the Milky Way halo with RR Lyrae 54 / 55

Bibliography I [1] Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. Classification and regression trees. Chapman & Hall/CRC, 1984. [2] P. Dubath, L. Rimoldini, M. Süveges, J. Blomme, M. López, LM Sarro, J. De Ridder, J. Cuypers, L. Guy, I. Lecoeur, et al. Random forest automated supervised classification of hipparcos periodic variable stars. Monthly Notices of the Royal Astronomical Society, 414(3):2602 2617, 2011. [3] Julian Faraway, Ashish Mahabal, Jiayang Sun, Xiaofeng Wang, Lingsong Zhang, et al. Modeling light curves for improved classification. arxiv preprint arxiv:1401.3211, 2014. [4] J.P. Long, N. El Karoui, J.A. Rice, J.W. Richards, and J.S. Bloom. Optimizing automated classification of variable stars in new synoptic surveys. Publications of the Astronomical Society of the Pacific, 124(913):280 295, 2012. [5] AA Miller, JW Richards, JS Bloom, SB Cenko, JM Silverman, DL Starr, and KG Stassun. Discovery of bright galactic r coronae borealis and dy persei variables: Rare gems mined from acvs. The Astrophysical Journal, 755(2):98, 2012. [6] Isadora Nun, Karim Pichara, Pavlos Protopapas, and Dae-Won Kim. Supervised detection of anomalous light curves in massive astronomical catalogs. The Astrophysical Journal, 793(1):23, 2014. [7] J.W. Richards, D.L. Starr, N.R. Butler, J.S. Bloom, J.M. Brewer, A. Crellin-Quick, J. Higgins, R. Kennedy, and M. Rischard. On machine-learned classification of variable stars with sparse and noisy time-series data. The Astrophysical Journal, 733:10, 2011. 55 / 55