Censoring and truncation in astronomical surveys. Eric Feigelson (Astronomy & Astrophysics, Penn State)

Censoring and truncation in astronomical surveys Eric Feigelson (Astronomy & Astrophysics, Penn State)

Outline 1. Examples of astronomical surveys 2. One survey: Truncation and spatial/luminosity distributions 3. Two or more surveys: Censoring and luminosity distributions & relationships 4. A call to arms

Faint Images of the Radio Sky at Twenty-centimeters (FIRST) 811,000 radio sources

Infrared Astronomical Satellite (IRAS) Early 1980s 350,000 mid-ir sources in all-sky survey

Two-Micron All Sky Survey (2MASS) Early 2000s 300,000,000 stars and galaxies

Sloan Digital Sky Survey (SDSS) 2000s 100,000,000 visual band stars and galaxies plus 1,000,000 spectra

ROSAT All-Sky Survey (RASS) 100,000 X-ray sources plus diffuse emission

Energetic Gamma-Ray Experiment Telescope (EGRET) on the Compton Gamma-Ray Observatory 1990s 271 sources with E>100 MeV

Star counts: The first flux limited surveys For a uniform population of objects distributed randomly in transparent space: S = L / 4 π D 2 V = 4/3 π D 3 log N ~ log V ~ -1.5 log S William Herschel (1785) used deviations from this prediction to infer that the Universe (now known as our Milky Way Galaxy) is limited in extent (~1 kpc) and is elongated in shape.

Star counts at different Galactic latitudes halo disk The Galaxy has multiple components with different spatial distributions producing strong deviations from the logn~-1.5 logs law. Bahcall & Soneira 1980 The universe at faint magnitudes. I - Models for the galaxy and the predicted star counts

But, near the Galactic plane, space is very opaque due to interstellar dust. Thus, Herschel s galaxy is ~20 times too small, which was not appreciated until the 1920-30s. Big effort in 1910-50s to address the Fundamental equation of stellar statistics to simultaneously establish the distribution of stellar luminosities and their spatial distributionin the Galactic disk: A(S,l,b) = Int [D(r,l,b) Φ(L,Abs(r,l,b)) r 2 dr A = number of stars/deg -2 in flux interval around S towards direction (l,b). (Star count data) D = number of stars/pc -3 at distance r from Earth towards (l,b) Φ = normalized distribution of stellar luminosities (LF) corrected for spatially-dependent absorption Fredholm Type 1 integral equation solved by Taylor expansion, Fourier transform, numerical integration. This effort essentially failed in the disk due to the patchy absorption, but functions reasonably well towards the Galactic poles (b=90 o ). Trumpler & Weaver Statistical Astronomy 1953

Stellar populations & kinematics in the Galaxy New catalogs with 4-6 dimensions of phase space: (α,δ) Location in the sky π Parallax giving distance (µ α, µ δ ) Proper motion across the sky V r Radial velocity 50,000 stars have all 6 values measured, 100,000 have 5 values, 30M have 4 values (Hipparcos, UCAC2, Gen-Cop, RAVE, ). Fascinating structure: Thin disk, thick disk, halo components of Galaxy Trace star formation around Sun Halo streams & cannibalized galaxies Very little sophisticated statistical work has been performed on these catalogs (e.g. mixture models, wavelet analysis)

Model of Milky Way cannibalizing the Sagittarius galaxy

Extragalactic surveys: The simple case of nearby galaxies

A recent SDSS study Volume-limited survey of 28,089 with d<150 Mpc Low lum High lum Blanton et al. 2005 The Properties and Luminosity Function of Extremely Low Luminosity Galaxies

Concentration Surface brightness Color Luminosity

Normal galaxy luminosity function Parametric model: Schechter function = Gen. Pareto distribution dn/dl ~ Φ ~ L -α e -L/L* Nonparametric methods: two used here (1/V max, stepwise max. likelihood) (Many technical issues concerning correction for missing low-surface brightness galaxies, K-correction and evolution-corrections to luminosity, double Schechter fits, etc.)

Nonparametric estimators to LFs 1. Classic estimator: Φ(L) = N(L) / V where N is the number of stars/galaxies/agns in surveyed volume V. A biased estimator. 2. Schmidt estimator (>700 citations): Φ(L) = Σ 1 / V max (L i ) where V max is the maximum volume within which an object of the observed flux could have been seen given the survey s sensitivity limit. Unbiased estimator but with high variance. Schmidt 1968 Space Distribution and Luminosity Functions of Quasi-Stellar Radio Sources Felten 1976 On Schmidt's V m estimator and other estimators of luminosity function

Lynden-Bell-Woodroofe MLE (>100 citations): Recursion relation: where C k is the number of stars/galaxies/agns in the k-th rectangle in the luminosity-distance diagram Lynden-Bell 1971 A method of allowing for known observational selection in small samples applied to 3CR quasars Woodroofe 1985 Estimation of a distribution function with truncated data

Stepwise maximum-likelihood estimator (>500 citations): where Set dl/dφ = 0 to obtain Efstathiou et al. 1988 Analysis of a complete galaxy redshift survey. II - The field-galaxy luminosity function

Takeuchi et al. (2000) apply these LF estimators to Monte Carlo simulations of small galaxy catalogs (N=100 and 1000). All perform well when spatial distribution is homogeneous. But for spatially clustered distributions, the 1/V max estimator is badly affected. There has been no analytical evaluation of the mathematical properties of these estimators since study of 1/V max by Felten (1976).

Extragalactic surveys The difficult case of quasar evolution AGN exhibit huge cosmic evolution: hundreds of times more luminous quasars when the Universe was half its current age compared to today.

Here is the logn-logs distribution of radio sources in the sky. The expected logn~-1.5logs of a uniform distribution has been removed here, so deviations from a horizontal line indicates cosmic evolution of the volume density and/or LF. The radio sources arise from two main classes, AGN and star-forming galaxies, which exhibit different evolution behaviors. Enormous literature 1960-90s on quasar evolution, and a 1990-00s literature on star formation evolution (the `Madau plot ). Condon et al. 2002 Radio sources and star formation in the local universe

Two or more flux limited surveys 1. A flux-limited survey gives a catalog of sources at one waveband 2. Astronomers observe this sample at another waveband, but only some objects are detected 3. Questions: What is the LF in the second waveband? (density estimation) Are these LFs the same in different subsamples? (k-sample test) What is the relationship between the first and second luminosities? (correlation & regression) 4. Generalize to a multivariate database with many properties measured for the initial sample. Nondetections appear in many columns. In astronomical parlance, the nondetections give upper limits. In statistical parlance, the nondetections give left-censored data points. These questions have arisen in hundreds of studies

An example of astronomical censored data Heckman et al. 1989 A millimeter-wave survey of CO emission in Seyfert galaxies

Several solutions to censored LF problem were considered by qstronomers during 1975-85, leading to the rediscovery of Efron s redistribute-to-the-right algorithm for the maximum-likelihood Kaplan-Meier product-limit estimator for randomly censored data. Schmitt, Feigelson & Nelson and Isobe et al. introduced astronomers to existing methods of survival analysis for treatment of censored data. Schmitt 1985 Statistical analysis of astronomical data containing upper bounds Feigelson & Nelson 1985 Statistical methods for astronomical data with upper limits I - Univariate distributions Isobe et al. 1986 Statistical methods for astronomical data with upper limite II - Correlation and regression

Example of univariate survival analysis (Kaplan-Meier) Feigelson & Nelson 1986

Test of bivariate correlation/regression for simulated flux-limited surveys Biased line using detections only Isobe, Feigelson & Nelson 1986 Unbiased Buckley-James line Including nondetections

ASURV code Stand-alone code implementing survival methods for astronomy: Kaplan-Meier estimator for the luminosity function Gehan, logrank, Peto-Peto & Peto-Prentice 2-sample tests Cox regression for bivariate correlation Generalized Kendall s τ & Spearman s ρ coefficient including censoring in both variables (Brown, Hollander & Korwar, Akritas) Linear regression with Gaussian residuals (EM Algorithm) Linear regression with K-M residuals (Buckley-James) Linear regression with censoring in both variables (Campbell, Schmitt) LaValley, Isobe & Feigelson 1992 ASURV software report

Methodological work needed for astronomical censored data 1 Investigate best methods for astronomical censoring patterns Type 1 censoring in F=L/4pd 2 gives nonrandom censoring in L 2. Develop multivariate analysis for censoring in all variables Clustering, regression/pca, MANOVA, MARS & neural nets, etc. Shapley et al. 2001

3. Calculate nonlinear regression & goodness-of-fit 4. Treat simultaneous censoring & truncation Astronomical survey Goal: Characterizing LFs and relationships in stars/gals/agns Full population of stars/galaxies/agns Truncated sample from fluxlimited survey Censored measurements of other properties AIDS epidemic Goals: Characterizing infected populations & mortality from AIDS Full population of infected individuals Truncated sample from symptomatic individuals Censored measurements of survival 5. Treat simultaneous censoring & measurement error Unlike biometrical and industrial reliability testing environments where survival times are measured precisely, in astronomy the censored value is typically set at the 3σ upper limit where σ is the known noise level. A new statistical approach is needed that treats all observations (detected or not) with measurement errors. Fisher s fiducial distributions??

A call to arms Astrostatistical research for astronomical surveys Develop mathematics for nonparametric estimators for luminosity functions in flux-limited surveys, including applications to complex situations like spatial inhomogeneities & cosmic evolution. Perform state-of-the-art multivariate/spatial analysis of Galactic stellar populations using Hipparcos/UCAC2/ databases Extend survival methods to truncated, multiplycensored, measurement-error, multivarate databases