The Challenges of Heteroscedastic Measurement Error in Astronomical Data

The Challenges of Heteroscedastic Measurement Error in Astronomical Data Eric Feigelson Penn State Measurement Error Workshop TAMU April 2016

Outline 1. Measurement errors are ubiquitous in astronomy, but different from those in social/bio sciences 2. Progress in measurement errors in bivariate linear regression & density but [3 slides] 3. Little progress for correlation coefficients; multivariate regression, clustering & classification; time series analysis; spatial point processes; etc. [2 slides] 4. Measurement errors and left-censoring need to be treated in a unified fashion [3 slides] A Call to Arms New Methodology is urgently needed!!

Astronomers simultaneously measure observed quantities and their heteroscedastic error variances Infrared image of stars with instrumental & astrophysical background [UKIDSS] Visible spectrum of distant galaxy [SDSS/BOSS] Visible/UV images of a distant galaxy with nondetections [HST] X-ray image with Poisson background & sources [CXO] Astronomers are extremely careful & quantitative in their observations: Exposure time, detector noise, calibration are known in advance Scientifically interesting fluxes are compared to source-free background noise

Science results from random studies X-ray spectrum of a BL Lac object Radio observations of 3 T Tauri stars Structure of interstellar cloud

Gamma-ray search from galaxy cluster Stellar atmospheres for planet search Metallicity of Galactic halo stars

Dwarf galaxy structure & environment Gamma-ray structures near the Galactic Center

Heteroscedastic measurement errors with `known variances are ubiquitous in astronomical research in both the observational data and astrophysical modeling MEs can be present in all variables of a multivariate dataset The same datasets can have non-detections (left-censored data points) when the signal does not exceed ~3x the ME Bibliometrics ~9000 papers appear annually in the 4 principal journals (MNRAS, ApJ, AJ, A&A) of which ~70% have MEs. Full text available from the ADS abstract service (adswww.harvard.edu). Electronic tables can be retrieved from the Journal Web sites or from the Vizier archive service (vizier.u-strasbg.fr)

Statistical methods for datasets with heteroscedastic MEs as data inputs (not model outputs) are not well-treated in standard books on ME methodology Fuller Measurement Error Models 2006 Carroll, Ruppert, Stefanski, Crainiceanu Measurement Error in Nonlinear Models: A Modern Perspective 2006 Buonaccorsi Measurement Error: Models, Methods, and Applications 2010

Regression with heteroscedastic MEs Astronomers frequently engage in regression to find best-fit parameters for either simple heuristic (e.g. linear, power law) or complicated nonlinear astrophysical models. The most common regression technique for many decades has been least squares weighted by the measurement errors nicknamed `minimum chi-squared regression : Is this procedure used in other fields?

Progress on linear regression Problems arise with min-χ 2 regression arise when the total scatter is only partially due to MEs. This formulation is also weak at model selection, confidence intervals, etc. Modified least squares procedures are widely used for the case of heteroscedastic MEs in both X and Y variables, and equation error in Y: o FITEXY (Numerical Recipes, 1990s) o BCES (Akritas & Bershady 1996) Similar to homoscedastic debiased slopes in EIV regression.

Maximum likelihood approach of Kelly (2007) Some Aspects of Measurement Error in Linear Regression of Astronomical Data B.C. Kelly, Astrophysical Journal, 665, 1489-1506 (2007) A complicated likelihood is written with a bivariate linear relationship, heteroscedastic MEs in both variables, equation error, incorporated into a normal mixture model. The best fit is obtained with Bayesian inference using MCMC and a Gibbs sampler. Kelly presents a flexible formulation that can be extended to nonlinear models, mutivariate regression, censored and truncated data, and other problems arising in astronomical research. However, only the simpler applications are in use today.

Statistics in R with heteroscedastic MEs Statistical function Pearson correlation coefficient CRAN package Boot, psych, RankAggreg, weights Chi-squared & t test Nonparametric density estimation weights decon Regressions: MARS, ICR, FDA, k-nn, ridge, quantile Clustering: Hierarchical, K-means, K-medoids Classification: CART, SVM, SOM, neural net caret WeightedCluster, weightedkmeans caret, tree Time series: Box-Pierce, Ljung-Box tests WeightedPortTest Are these methods appropriate for astronomical weighted data?

Positions of 6 stars are known in advance from other studies How left-censored values are generated from astronomical data 1 2 5 3 4 6 Signal & noise for 6 stars 1 31 +/- 4 2 97 4 3 60 7 4 31 10 5 7 5 6 22 26 Undetected fluxes replaced with 3σ upper limits 5 <15 6 <78

Measurement errors, S/N and left-censoring Left-censored data is currently treated with survival analysis techniques where the censored value is exactly measured. But unlike biometrical and industrial reliability testing environment where survival times are measured precisely, in astronomy the censored value is typically set at the 3σ upper limit where σ is the known noise level. A new statistical approach is needed that treats all observations (weighted detections and nondetections) in a self-consistent fashion

Conclusion: A Call to Arms!! Lots of methodology is needed for the astronomical case where variances of measurement errors are easily obtained. Thousands of studies are involved from all branches of astronomy. Some of the needed methodology is probably straightforward. E.g. clarify and present existing procedures with CRAN packages. Some of the needed methodology requires creativity. E.g. a unified signal/noise detection & nondetection treatment. Astronomical journals are eager to publish methodology (AJ, MNRAS, A&C). Acceptance rates are high. These issues can be discussed in detail at the 2016-17 SAMSI program in astrostatistics.