A Correntropy based Periodogram for Light Curves & Semi-supervised classification of VVV periodic variables

Size: px

Start display at page:

Download "A Correntropy based Periodogram for Light Curves & Semi-supervised classification of VVV periodic variables"

Elvin Jackson
5 years ago
Views:

1 A Correntropy based Periodogram for Light Curves & Semi-supervised classification of VVV periodic variables Los Pablos (P4J): Pablo Huijse Pavlos Protopapas Pablo Estévez Pablo Zegers Jose Principe Harvard-Chile Data Science School

2 Motivation Light curve analysis challenges: Uneven sampling. Different noise sources and heteroscedastic errors. May have few points. Databases can be huge. Picture sources: and

Least Squares Spectral Analysis (LSSA) For a given frequency, the Lomb-Scargle (LS) power is equivalent to the L2 norm of the coefficients of the sinusoidal model that best fits the data in a least

3 Least Squares Spectral Analysis (LSSA) For a given frequency, the Lomb-Scargle (LS) power is equivalent to the L2 norm of the coefficients of the sinusoidal model that best fits the data in a least squares sense The Generalized Lomb Scargle (GLS) periodogram M. Zechmeister, The Generalized Lomb Scargle Periodogram, A&A, 2009 J. VanderPlas & Z. Ivezic, Periodograms for Multiband Astronomical Time Series, ApJ, 2015

4 Maximum correntropy criterion For two arbitrary R.V. with N realizations - Generalizes correlation to higher order moments - Samples are compared through a kernel - Parameter: kernel bandwidth J. Principe, Information Theoretic Learning, Springer, 2010

5 Maximum correntropy criterion - Fit a model to the data by maximizing correntropy - MCC is equivalent to maximizing the pdf of the error at e=0 (Principe 2010) - M-estimator - Robust to non-gaussian noise and outliers - Assumes homoscedastic noise J. Principe, Information Theoretic Learning, Springer, 2010

updates: 1. Assume sigma fixed and update 2.

6 Weighted Maximum Correntropy Criterion Simple sample weighting through the kernel bandwidth Fixed-point updates: 1. Assume sigma fixed and update 2. Assume beta fixed and update 3. WMCC convergence: Stop or go back to 1

Statistical test for period significance - LS has an analytical expression for the false alarm pbb (assumes Gaussian noise) - Generalized Extreme Value (GEV) statistics - The maxima* from several

7 Statistical test for period significance - LS has an analytical expression for the false alarm pbb (assumes Gaussian noise) - Generalized Extreme Value (GEV) statistics - The maxima* from several realizations of an experiment follow - Do bootstrap, find maxima on a subset of frequencies, fit GEV, compute false alarm probability [1] [1] M. Suveges, Extreme-value modelling for the significance assessment of periodogram peaks, MNRAS, 2012

irregular_sampling(T=100.0, N=100) y_clean = P4J.trigonometric_model(t, f0=2.0, A=[1.0, 0.5,.

8 Synthetic test - - Simple irregular sampling: Generate linearly spaced vector, add jitter proportional to the 1/Fs, discard 80% of the points Model import P4J t = P4J.irregular_sampling(T=100.0, N=100) y_clean = P4J.trigonometric_model(t, f0=2.0, A=[1.0, 0.5,.25]) y, y_noisy, dy = P4J.contaminate_time_series(t, y_clean, SNR=0.0, red_noise_ratio=0.25, outlier_ratio=0.0)

9 Example my_per = P4J.periodogram(M=3, method='wmcc') my_per.fit(t, y_noisy, dy) freq, per = my_per.grid_search(0.0, 5.0, 1.0, 0.1, n_local_max=10) my_per.fit_extreme_cdf(n_bootstrap=100, n_frequencies=100) per_levels = my_per.get_fap(np.asarray([0.05, 0.01, 0.001])) SNR=10.0, red_noise_var=0.25

10 Example my_per = P4J.periodogram(M=3, method='wmcc') my_per.fit(t, y_noisy, dy) freq, per = my_per.grid_search(0.0, 5.0, 1.0, 0.1, n_local_max=10) my_per.fit_extreme_cdf(n_bootstrap=100, n_frequencies=100) per_levels = my_per.get_fap(np.asarray([0.05, 0.01, 0.001])) SNR=0.0, red_noise_var=0.25

11 Results - Performance: The % of cases were relative error is below tol Confidence: The average significance at f = f0 No outliers, red_noise_ratio = 0.0, i.e. Noise is perfectly explained by uncertainties 10 random time vectors, 100 noise realizations

12 Results - Performance: The % of cases were relative error is below tol Confidence: The average significance at f = f0 No outliers, red_noise_ratio = 1/8 10 random time vectors, 100 noise realizations

13 Results - Performance: The % of cases were relative error is below tol Confidence: The average significance at f = f0 No outliers, red_noise_ratio = 1/4 10 random time vectors, 100 noise realizations

14 VISTA Variables of the Via Lactea (VVV) ESO survey. Most measurements in the K band (near infrared) using 7 apertures. Public Survey. Study the structure of the Galactic bulge and the origin of our galaxy.

15 VISTA Variables of the Via Lactea (VVV) ESO survey. Most measurements in the K band (near infrared) using 7 apertures. Public Survey. Study the structure of the Galactic bulge and the origin of our galaxy. F. Gran et al 2016: - 1,019 RRab Light curves - Fields b201 - b228 (~47 sq. deg) - Detected with AoV, corrected manually

16 VISTA Variables of the Via Lactea (VVV) Method 1. Grab light curves (LC) from fields b201 b228 (~47 sq. deg) 2. Discard LC with chi2 < 2.0 and N<30 3. Discard LC with per. confidence < th 4. Create features 5. Semi-supervised PU classification

17 Analysis of VVV periodic variables First N light curves sorted in periodicity confidence - Amount of reported RRL out of this set (lost RRL) - Relative error of the detected periods vs reported period 10,000 20,000 50,000 Lomb Scargle 154 / / / Generalized LS 554 / / / WMCC periodogram 75 / / / 0.018

18 Analysis of VVV periodic variables

19 Analysis of VVV periodic variables

20 Analysis of VVV periodic variables

21 Analysis of VVV periodic variables

22 Semi-supervised and PU Learning - Semi-supervised classification - Manifold assumption - Clustering assumption - Low-density separation - Self-Learning/Graph-based/Avoiding changes in dense regions [1] - >10,000 unlabeled periodic light curves ~1,000 labeled RRab (positive class) No other survey to crossmatch Positive/Unlabeled (PU) scenario Images taken from wikipedia [1] X. Zhu, Semi-supervised Learning Literature Survey, 2005 (online, public)

23 Efficient SS/PU Learning Bagging PU [1] (transductive version) Positive dataset (size NP), Unlabeled dataset (size NU) Do T bootstrap sets from unlabeled data (size K) Train T weak learners (NP+K), predict in OOB set (NU-unique[K]) Average OOB predictions No graph computation/few parameters/highly parallel/simple github.com/phuijse/bagging_pu [1] F. Mordelet and J.P Vert, A bagging SVM to learn from PU ensembles Pattern Recog. Letter, v. 37, 2014 [2] M. Claesen et al. A robust ensemble approach to learn PU Data using SVM, Neurocomputing, 2014 [3] M. Claesen et al. Assesing binary classifiers using only PU data, Neurocomputing, 2015

24 Analysis of VVV periodic variables L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-sne, JMLR, 2008

25 Analysis of VVV periodic variables L.J.P. van der Maaten and G.E. Hinton, Visualizing High-Dimensional Data Using t-sne, JMLR, 2008

26 Analysis of VVV periodic variables [1] F. Mordelet and J.P Vert, A bagging SVM to learn from PU ensembles Pattern Recog. Letter, v. 37, 2014 [2] M. Claesen et al. A robust ensemble approach to learn PU Data using SVM, Neurocomputing, 2014 [3] M. Claesen et al. Assesing binary classifiers using only PU data, Neurocomputing, 2015

27 Pbb in [0.95, 1.00] 132 lc

28 Pbb in [0.85, 0.95] 101 lc

29 Pbb in [0.65, 0.85] 102 lc

30 Pbb in [0.50, 0.65] 82 lc

31 Conclusions and Future work - Periodicity detection based on information theoretic functionals are more precise and less sensitive to FP New set of VVV RR Lyrae candidates to confirm, and more fields to run Compare with more periodicity detection methods (C. Entropy, AoV, PDM), test different features (FATS) and PU/SS methods Test other surveys (Pan-STARRS, CTRS, synthetic LSST lc) Improve computational implementations LINKS: pypi.python.org/pypi/p4j github.com/phuijse/p4j github.com/phuijse/bagging_pu

Bagging and Other Ensemble Methods

Bagging and Other Ensemble Methods Sargur N. Srihari srihari@buffalo.edu 1 Regularization Strategies 1. Parameter Norm Penalties 2. Norm Penalties as Constrained Optimization 3. Regularization and Underconstrained