Data-driven models of stars David W. Hogg Center for Cosmology and Particle Physics, New York University Center for Data Science, New York University Max-Planck-Insitut für Astronomie, Heidelberg 2015 July 31 in collaboration with: Melissa Ness (MPIA), Hans-Walter Rix (MPIA), Anna Ho (MPIA), Gail Zasowski (JHU), and Dan Foreman-Mackey (NYU)
Annie Jump Cannon O B A F G K M temperature sequence! alphabetical order (A B F G K M O) is hydrogen-line-strength order Cannon understood the temperature sequence of stars without the benefit of physical models data-driven non-linear dimensionality reduction manifold learning (using a huge amount of prior knowledge) namesake of The Cannon
chemodynamics stars populate orbits in the Milky Way conserved actions (or chaotic equivalents) stars are formed from particular gas clouds stars have conserved surface abundances the combined action-chemical space will be far more informative than either taken independently
chemodynamics top priority for many new projects Gaia & Gaia-ESO HERMES & GALAH SDSS-III APOGEE terrifying inconsistencies in current approaches models of stars are amazingly good...... but chemical signatures are incredibly tiny
exoplanets extra-solar planets are always measured relative to their host stars you only understand a planet as well as you can understand the star
the paradox of precision astrophysics models are incredibly explanatory ΛCDM stellar spectroscopy helioseismology and yet...
the paradox of precision astrophysics models are incredibly explanatory ΛCDM stellar spectroscopy helioseismology and yet... models are wrong (ruled out) in detail χ 2 ν The χ 2 statistic is a measure of the size of your data! missing physics, approximation, computation, gastrophysics
physics-driven models put in everything you know gravity, atomic and molecular transitions, radiation make approximations to make things computable sub-grid models, mixing length, etc
machine learning the most extreme of data-driven models the data is the model none of your knowledge is relevant learn (fit) an exceedingly flexible model explain or cluster the data transformation from data to labels concept of non-parametrics concept of train, validate, and test many packages and implementations (and outrageous successes)
when does machine learning help you? train & test situation training data are statistically identical to the test data same noise amplitude same distance or redshift distribution same luminosity distribution never true! training data have accurate and precise labels therefore, we can t use vanilla machine learning! (astronomers rarely can)
data-driven models (my personal usage) make use of things you strongly believe noise model & instrument resolution causal structure (shared parameters) capitalize on huge amounts of data exceedingly flexible model concept of train, validate, and test every situation will be bespoke
label transfer for stars a few of your stars have good labels (from somewhere) can you use this to label the other stars? why would you want to do this?
label transfer for stars a few of your stars have good labels (from somewhere) can you use this to label the other stars? why would you want to do this? you don t have good models at your wavelengths? you want two surveys to be on the same system? you have some stars at high SNR, some at low SNR? you spent human time on some stars but can t on all?
stellar spectra stars are very close to black-bodies to first order, a stellar spectrum depends on effective temperature T eff and surface gravity log g
stellar spectra stars are very close to black-bodies to first order, a stellar spectrum depends on effective temperature T eff and surface gravity log g to second order, metallicity [Fe/H] and rotation
stellar spectra stars are very close to black-bodies to first order, a stellar spectrum depends on effective temperature T eff and surface gravity log g to second order, metallicity [Fe/H] and rotation to third order, tens of chemical abundances
stellar spectra all chemical information is in absorption lines corresponding to atomic and molecular transitions some 30 elements are visible in the best stars spectroscopy at is the primary tool R λ > 20, 000 λ
stellar astrophysics 1.0 1.0 0.8 0.8 0.6 1.0 A B Teff = 4750, log g = 3.0, [Fe/H] = 0.15 Teff = 4849, log g = 2.2, [Fe/H] = -1.0 normalized flux f 0.6 0.8 0.6 1.0 0.4 0.8 0.6 0.2 1.0 0.8 Teff = 3614, log g = 0.4, [Fe/H] = -0.68 Teff = 5003, log g = 2.8, [Fe/H] = -0.71 0.6 0.0 0.015200 15400 0.2 15600 158000.4 16000 16200 0.6 16400 16600 0.8 16800 1.0 wavelength λ (Å)
stellar astrophysics
SDSS-III APOGEE Galactic archaeology APOGEE DR10: 56,000 stars R = 22, 500 spectra in 1.5 < λ < 1.7 µm precise RVs and stellar parameters plan for a dozen abundances for every star (our own home-built and special continuum normalization; ask me!) APOGEE DR12: 156,000 stars now available
SDSS-III APOGEE 1.0 1.0 0.8 0.8 0.6 1.0 A B Teff = 4750, log g = 3.0, [Fe/H] = 0.15 Teff = 4849, log g = 2.2, [Fe/H] = -1.0 normalized flux f 0.6 0.8 0.6 1.0 0.4 0.8 0.6 0.2 1.0 0.8 Teff = 3614, log g = 0.4, [Fe/H] = -0.68 Teff = 5003, log g = 2.8, [Fe/H] = -0.71 0.6 0.0 0.015200 15400 0.2 15600 158000.4 16000 16200 0.6 16400 16600 0.8 16800 1.0 wavelength λ (Å)
train, validate, and test split the data into three disjoint subsets in the training step you set the parameters of your model using the training set the validation set is used to set hyperparameters or model complexity in the test step you apply the model to the test set new data to make predictions or deliver results
The Cannon: training set 543 stars (too few) from 19 clusters (too few) T eff, log g, [Fe/H] labels from APOGEE calling parameters and abundances labels slight adjustments to labels to get them onto possible isochrones terrible coverage of the main sequence only the Pleiades home-made Pleiades labels (by Ness) no [Fe/H] spread at high log g.
The Cannon: training set log g (dex) log g (dex) log g (dex) log g (dex) 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 5500 13 Gyr ( 2.35) 12 Gyr ( 2.33) 12.7 Gyr ( 2.06) 13 Gyr ( 1.98) 13 Gyr ( 1.78) M92 M15 M53 N5466 N4147 13 Gyr ( 1.66) 11.7 Gyr ( 1.58) 11.5 Gyr ( 1.5) 13 Gyr ( 1.33) 13 Gyr ( 003) M2 M13 M3 M5 M107 10 Gyr ( 0.82) 1 Gyr ( 0.28) 2 Gyr ( 0.20) 5 Gyr ( 0.03) 3.2 Gyr ( 0.01) M71 N2158 N2420 N188 M67 1.6 Gyr (0.02) 0.15 Gyr (+0.03) 2.5 Gyr (+0.09) 5 Gyr (+0.47) N7789 Pleiades N6819 N6791 5000 4500 4000 5500 5000 4500 4000 5500 5000 4500 4000 5500 5000 4500 4000 5500 5000 4500 4000 Teff (K) Teff (K) Teff (K) Teff (K) Teff (K)
The Cannon: training set log g (dex) log g (dex) log g (dex) log g (dex) 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 5500 13 Gyr ( 2.35) 12 Gyr ( 2.33) 12.7 Gyr ( 2.06) 13 Gyr ( 1.98) 13 Gyr ( 1.78) M92 M15 M53 N5466 N4147 13 Gyr ( 1.66) 11.7 Gyr ( 1.58) 11.5 Gyr ( 1.5) 13 Gyr ( 1.33) 13 Gyr ( 003) M2 M13 M3 M5 M107 10 Gyr ( 0.82) 1 Gyr ( 0.28) 2 Gyr ( 0.20) 5 Gyr ( 0.03) 3.2 Gyr ( 0.01) M71 N2158 N2420 N188 M67 1.6 Gyr (0.02) 0.15 Gyr (+0.03) 2.5 Gyr (+0.09) 5 Gyr (+0.47) N7789 Pleiades N6819 N6791 5000 4500 4000 5500 5000 4500 4000 5500 5000 4500 4000 5500 5000 4500 4000 5500 5000 4500 4000 Teff (K) Teff (K) Teff (K) Teff (K) Teff (K)
The Cannon: model a generative model of the APOGEE spectra given label vector l, predict flux vector f probabilistic prediction p(f l, θ) use every spectral pixel s uncertainty variance σλn 2 responsibly details: spectral expectation is quadratic in the labels every wavelength λ treated independently an intrinsic Gaussian scatter s 2 λ at every wavelength λ 80,000 free parameters in θ!
The Cannon: model ln p(f n l n, θ) = L ln p(f λn l n, θ λ, s 2 λ) λ=1 ln p(f λn l n, θ λ, s 2 λ) = 1 [f λn θ λt l n] 2 2 σλn 2 + + ln(σ s2 λn 2 + s 2 λ) λ l T {1, T eff, log g, [Fe/H], Teff, 2 T eff log g,, [Fe/H] 2} θ T { θ λ, s 2 } L λ λ=1
The Cannon: model ln p(f n l n, θ) training step: optimize w.r.t. parameters θ at fixed labels l using training-set data linear least squares every wavelength λ treated independently test step: optimize w.r.t. labels l at fixed parameters θ using test-set (survey) data non-linear optimization every star treated independently
The Cannon: model training
The Cannon: model training
The Cannon: model training cross-validation
The Cannon: results The Cannon is far faster than physical modeling model trains in seconds (thousands of fits) The Cannon labels all 56,000 stars in APOGEE DR10 in two hours (pure Python on a laptop) labels appear sensible The Cannon labels lie near sensible isochrones scatter against APOGEE labels consistent with APOGEE precision successfully puts labels on dwarfs
The Cannon: test time
The Cannon: test time
The Cannon: test time
The Cannon: test time
The Cannon: test time
The Cannon: test time
The Cannon: comparison with APOGEE labels
The Cannon: label veracity
The Cannon: label veracity
The Cannon: works at low signal-to-noise
The Cannon: works at low signal-to-noise
The Cannon: results The Cannon is far faster than physical modeling model trains in seconds (thousands of fits) The Cannon labels all 56,000 stars in APOGEE DR10 in two hours (pure Python on a laptop) labels appear sensible The Cannon labels lie near sensible isochrones scatter against APOGEE labels consistent with APOGEE precision successfully puts labels on dwarfs
The Cannon: label transfer from APOGEE to LAMOST
The Cannon: shortcuts and choices no Bayes; no partial or noisy labels quadratic order replacing polynomial with a Gaussian process continuous model complexity; non-parametric spectral representation too-small training set only three labels age, [α/fe] splitting the giant branch how to go to many elements?
The Cannon: masses and ages for red giants
The Cannon: masses and ages for red giants
applications for data-driven models Kepler and K2 light curves the first systematic exoplanet catalog from K2 data Foreman-Mackey et al. (arxiv:1502.04715) building a consistent all-sky stellar parameter system for Gaia quasar target selection XDQSO and XDQSOz Bovy et al., 2011 (ApJ 729 141), 2012 (ApJ 749 41) CMB foregrounds
data-driven models incredibly powerful tools clustering, label transfer, prediction, de-noising make use of things you strongly believe especially the noise model every situation will be bespoke expect to get dirty