A Correntropy based Periodogram for Light Curves & Semi-supervised classification of VVV periodic variables Los Pablos (P4J): Pablo Huijse Pavlos Protopapas Pablo Estévez Pablo Zegers Jose Principe Harvard-Chile Data Science School
Motivation Light curve analysis challenges: Uneven sampling. Different noise sources and heteroscedastic errors. May have few points. Databases can be huge. Picture sources: http://www.atnf.csiro.au/ and http://www.hao.ucar.edu/
Least Squares Spectral Analysis (LSSA) For a given frequency, the Lomb-Scargle (LS) power is equivalent to the L2 norm of the coefficients of the sinusoidal model that best fits the data in a least squares sense The Generalized Lomb Scargle (GLS) periodogram M. Zechmeister, The Generalized Lomb Scargle Periodogram, A&A, 2009 J. VanderPlas & Z. Ivezic, Periodograms for Multiband Astronomical Time Series, ApJ, 2015
Maximum correntropy criterion For two arbitrary R.V. with N realizations - Generalizes correlation to higher order moments - Samples are compared through a kernel - Parameter: kernel bandwidth J. Principe, Information Theoretic Learning, Springer, 2010
Maximum correntropy criterion - Fit a model to the data by maximizing correntropy - MCC is equivalent to maximizing the pdf of the error at e=0 (Principe 2010) - M-estimator - Robust to non-gaussian noise and outliers - Assumes homoscedastic noise J. Principe, Information Theoretic Learning, Springer, 2010
Weighted Maximum Correntropy Criterion Simple sample weighting through the kernel bandwidth Fixed-point updates: 1. Assume sigma fixed and update 2. Assume beta fixed and update 3. WMCC convergence: Stop or go back to 1
Statistical test for period significance - LS has an analytical expression for the false alarm pbb (assumes Gaussian noise) - Generalized Extreme Value (GEV) statistics - The maxima* from several realizations of an experiment follow - Do bootstrap, find maxima on a subset of frequencies, fit GEV, compute false alarm probability [1] [1] M. Suveges, Extreme-value modelling for the significance assessment of periodogram peaks, MNRAS, 2012
Synthetic test - - Simple irregular sampling: Generate linearly spaced vector, add jitter proportional to the 1/Fs, discard 80% of the points Model import P4J t = P4J.irregular_sampling(T=100.0, N=100) y_clean = P4J.trigonometric_model(t, f0=2.0, A=[1.0, 0.5,.25]) y, y_noisy, dy = P4J.contaminate_time_series(t, y_clean, SNR=0.0, red_noise_ratio=0.25, outlier_ratio=0.0)
Example my_per = P4J.periodogram(M=3, method='wmcc') my_per.fit(t, y_noisy, dy) freq, per = my_per.grid_search(0.0, 5.0, 1.0, 0.1, n_local_max=10) my_per.fit_extreme_cdf(n_bootstrap=100, n_frequencies=100) per_levels = my_per.get_fap(np.asarray([0.05, 0.01, 0.001])) SNR=10.0, red_noise_var=0.25
Example my_per = P4J.periodogram(M=3, method='wmcc') my_per.fit(t, y_noisy, dy) freq, per = my_per.grid_search(0.0, 5.0, 1.0, 0.1, n_local_max=10) my_per.fit_extreme_cdf(n_bootstrap=100, n_frequencies=100) per_levels = my_per.get_fap(np.asarray([0.05, 0.01, 0.001])) SNR=0.0, red_noise_var=0.25
Results - Performance: The % of cases were relative error is below tol Confidence: The average significance at f = f0 No outliers, red_noise_ratio = 0.0, i.e. Noise is perfectly explained by uncertainties 10 random time vectors, 100 noise realizations
Results - Performance: The % of cases were relative error is below tol Confidence: The average significance at f = f0 No outliers, red_noise_ratio = 1/8 10 random time vectors, 100 noise realizations
Results - Performance: The % of cases were relative error is below tol Confidence: The average significance at f = f0 No outliers, red_noise_ratio = 1/4 10 random time vectors, 100 noise realizations
VISTA Variables of the Via Lactea (VVV) ESO survey. Most measurements in the K band (near infrared) using 7 apertures. Public Survey. Study the structure of the Galactic bulge and the origin of our galaxy.
VISTA Variables of the Via Lactea (VVV) ESO survey. Most measurements in the K band (near infrared) using 7 apertures. Public Survey. Study the structure of the Galactic bulge and the origin of our galaxy. F. Gran et al 2016: - 1,019 RRab Light curves - Fields b201 - b228 (~47 sq. deg) - Detected with AoV, corrected manually
VISTA Variables of the Via Lactea (VVV) Method 1. Grab light curves (LC) from fields b201 b228 (~47 sq. deg) 2. Discard LC with chi2 < 2.0 and N<30 3. Discard LC with per. confidence < th 4. Create features 5. Semi-supervised PU classification
Analysis of VVV periodic variables First N light curves sorted in periodicity confidence - Amount of reported RRL out of this set (lost RRL) - Relative error of the detected periods vs reported period 10,000 20,000 50,000 Lomb Scargle 154 / 0.019 66 / 0.025 32 / 0.043 Generalized LS 554 / 0.011 94 / 0.030 11 / 0.038 WMCC periodogram 75 / 0.007 54/ 0.011 6 / 0.018
Analysis of VVV periodic variables
Analysis of VVV periodic variables
Analysis of VVV periodic variables
Analysis of VVV periodic variables
Semi-supervised and PU Learning - Semi-supervised classification - Manifold assumption - Clustering assumption - Low-density separation - Self-Learning/Graph-based/Avoiding changes in dense regions [1] - >10,000 unlabeled periodic light curves ~1,000 labeled RRab (positive class) No other survey to crossmatch Positive/Unlabeled (PU) scenario Images taken from wikipedia [1] X. Zhu, Semi-supervised Learning Literature Survey, 2005 (online, public)
Efficient SS/PU Learning Bagging PU [1] (transductive version) 1. 2. 3. 4. Positive dataset (size NP), Unlabeled dataset (size NU) Do T bootstrap sets from unlabeled data (size K) Train T weak learners (NP+K), predict in OOB set (NU-unique[K]) Average OOB predictions No graph computation/few parameters/highly parallel/simple github.com/phuijse/bagging_pu [1] F. Mordelet and J.P Vert, A bagging SVM to learn from PU ensembles Pattern Recog. Letter, v. 37, 2014 [2] M. Claesen et al. A robust ensemble approach to learn PU Data using SVM, Neurocomputing, 2014 [3] M. Claesen et al. Assesing binary classifiers using only PU data, Neurocomputing, 2015
Analysis of VVV periodic variables L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-sne, JMLR, 2008
Analysis of VVV periodic variables L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-sne, JMLR, 2008
Analysis of VVV periodic variables [1] F. Mordelet and J.P Vert, A bagging SVM to learn from PU ensembles Pattern Recog. Letter, v. 37, 2014 [2] M. Claesen et al. A robust ensemble approach to learn PU Data using SVM, Neurocomputing, 2014 [3] M. Claesen et al. Assesing binary classifiers using only PU data, Neurocomputing, 2015
Pbb in [0.95, 1.00] 132 lc
Pbb in [0.85, 0.95] 101 lc
Pbb in [0.65, 0.85] 102 lc
Pbb in [0.50, 0.65] 82 lc
Conclusions and Future work - Periodicity detection based on information theoretic functionals are more precise and less sensitive to FP New set of VVV RR Lyrae candidates to confirm, and more fields to run Compare with more periodicity detection methods (C. Entropy, AoV, PDM), test different features (FATS) and PU/SS methods Test other surveys (Pan-STARRS, CTRS, synthetic LSST lc) Improve computational implementations LINKS: pypi.python.org/pypi/p4j github.com/phuijse/p4j github.com/phuijse/bagging_pu