Andy Casey Astrophysics; Statistics

Size: px

Start display at page:

Download "Andy Casey Astrophysics; Statistics"

Russell O’Neal’
5 years ago
Views:

1 Science in the era of Gaia data big Astrophysics; Statistics andycasey astrowizicist astrowizici.st

2 Science in the era of Gaia data big - The Gaia mission All about Gaia. What makes data big?

3 Science in the era of Gaia data big - The Gaia mission All about Gaia. What makes data big? - Pedagogy of data analysis, when you have lots of data Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models)

4 Science in the era of Gaia data big - The Gaia mission All about Gaia. What makes data big? - Pedagogy of data analysis, when you have lots of data Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models) - Tools & resources for data analysis: pick the right tool for the job

5 Science in the era of Gaia data big - The Gaia mission All about Gaia. What makes data big? - Pedagogy of data analysis, when you have lots of data Examples of how pedagogy drives decisions in big and small data analysis (data-driven methods, non-parametric models) - Tools & resources for data analysis: pick the right tool for the job - Unsolicited advice to be ahead of the data wave

7 having data is no longer currency in astronomy

8 having data is no longer currency in astronomy having good ideas and the ability to effortlessly use data is currency

9 having data is no longer currency in astronomy having good ideas and the ability to effortlessly use data is currency This talk is about making you rich

10 The Gaia satellite The Billion Star Surveyor (tm) One billion stars for one billion Euros An astrometric mission designed to measure the position, parallax, brightness, and proper motions for more than one billion stars.

11 The Gaia satellite The Billion Star Surveyor (tm) One billion stars for one billion Euros An astrometric mission designed to measure the position, parallax, brightness, and proper motions for more than one billion stars.

12 The Gaia satellite The Billion Star Surveyor (tm) One billion stars for one billion Euros For up to 1.7 billion sources: - Positions - Proper motions - Radial velocities (and scatter) - Parallax - Photometry (G, BP, RP) - Colours (G-BP, G-RP, BP-RP) - Dust along line of sight - Stellar effective temperatures - Stellar radii - Stellar masses - Stellar luminosities - Astrometric excess noise (more than a single-star solution) - Orbital solutions for solar system objects - Variable stars (including light curves of new kinds of objects) Credit: Erik Tollerud

13 Source count completeness Gaia observes everything. Stars, galaxies, quasars, asteroids, et cetera.

14 Photometric performance Kepler-precision photometry, but for one billion stars

15 Astrometric performance (It is very good)

16 Astrometric performance Note: Hipparchus data release 1

17 Proper motion performance G ~ 18 star at 30 kpc w/ 0.4 mas/yr is approx. 2 km/s precision at 100,000 light years away

18 You are here.

19 Credit: S. BRUNIER/ESO/ESA

21 Gaia Data Release 2 This was the first real data release, and just averaged values.

22 Gaia Data Release 2 This was the first real data release, and just averaged values. are we at big data yet?

23 Gaia Data Release 5 The flood is coming. This is what we need to deal with (easily). Position measurements Brightness measurements Medium-resolution spectra Low-resolution spectra Size of reduced data products for science 128 trillion 380 trillion 1 billion 100 billion 1 petabyte

24 Gaia Data Release 5 The flood is coming. This is what we need to deal with (easily). are we at big data yet? Position measurements Brightness measurements Medium-resolution spectra Low-resolution spectra Size of reduced data products for science 128 trillion 380 trillion 1 billion 100 billion 1 petabyte

25 Gaia Data Release 5 The flood is coming. This is what we need to deal with (easily). are we at big data Position measurements Brightness measurements Medium-resolution spectra Low-resolution spectra Size of reduced data products for science 128 trillion 380 trillion 1 billion 100 billion 1 petabyte yet? rule of thumb: If you can load it into RAM, then you are not at big data.

26 Five pedagogical questions to ask yourself to keep you out of scientific and data analysis cul-de-sacs 1. Do you have small data, or do you have big data? 2. What is the simplest, dumbest model you can think of? 3. What assumptions are you making? 4. What is the utility of the model? 5. What can you afford?

27 1. Do you have small data, or do you have big data? If you can t load it into RAM, you have options (in terms of difficulty): Do you need to load all the data at once? Memory-mapped arrays: store data on external hard drives and treat it (really carefully) as memory. Can you subsample the data and get a comparable result? Can you use statistics of the data to get a comparable result? Can you simplify the data you use and get a comparable result (e.g., ignore covariances)? Can you recast your problem as a map-reduce problem?

28 2. What is the simplest, dumbest model you can think of? Always start with the simplest model you can think of, even if you know it is dumb and will not give you great results. For example: 1. Linear regression (for fitting data) a design matrix can have nonlinear entries, but you are still doing linear regression! 2. k-means (for clustering) use k-means++ for initialisation, always 3. Logistic regression (for classification) Don t change this model until you have answered all five questions! When complicated models aren t working correctly, always ask what is the simplest, dumbest thing is that you could test to check your intuition.

29 3. What assumptions are you making? You have made an infinite number of assumptions. What are the most important assumptions? (Seriously, write them down) Do you assume that your data are drawn from a straight line? Do you assume the data points are independent? Do you assume the noise in the data are normally distributed? Do you assume that you have the correct objective function? Do you assume that you have optimised to the global minimum? Do you assume that you have used an appropriate optimisation algorithm? Do you assume that the noise estimates you have are correct? Do you assume that we do not live in a simulation? (Would it matter?)

30 4. What is the utility of the model? All models are wrong, some are useful. Even a dumb model can tell you a lot about what you should do next. If you have a dumb model but you parameterise your model errors, then the model errors (or residuals from the data) will inform you where your model is failing. Do the underlying physical models make good predictions? Under what conditions will this model fail? (models should fail loudly!) Do you need a point estimate of your model parameters, or do you need a posterior probability distribution over data? Does this model give a point estimate that you can use for other purposes?

31 5. What can you afford? Sometimes a point estimate of the parameters of a very simple model is good enough to answer the question you have. Sometimes you will need to sample a posterior probability distribution of a complicated model. Or worse: calculate the fully marginalised likelihood (FML; a.k.a. the evidence ). What can you afford? (etc.) Answers to these questions will (in a very practical sense) help drive your model complexity.

32 Example: data-driven models For when the data are better than the models.

33 Hierarchical data-driven models of stellar properties Hierarchical, complex model Analytic integrals to marginalise parameters Tractable!-ish Use joint information between stars to denoise properties of the sample arxivs: , (Leistedt et al. and Anderson et al.)

-ish Use joint information between stars to denoise properties of the

34 Hierarchical data-driven models of stellar properties Hierarchical, complex model Analytic integrals to marginalise parameters Tractable!-ish Use joint information between stars to denoise properties of the sample arxivs: , (Leistedt et al. and Anderson et al.)

35 Hierarchical data-driven models of stellar properties 1. Do you have small data, or do you have big data? Small. 2. What is the simplest, dumbest model you can think of? Gaussian mixture model. 3. What assumptions are you making? Independence among stars. Many others. 4. What is the utility of the model? Most parallaxes are noisy. This model improves them. 5. What can you afford? Posterior distributions over data, but only through analytic marginalisation.

36 Example: non-parametric models Terribly named, because they really have infinite numbers of parameters.

37 Non-parametric model for binary star inference 1. Do you have small data, or do you have big data? Big. We ded. 2. What is the simplest, dumbest model you can think of? Mixture of two components. 3. What assumptions are you making? Some stars with similar colours and luminosity will be single stars. 4. What is the utility of the model? Point estimates of binary probability for two billion stars. 5. What can you afford? Posterior distributions over data, but only if we get clever.

38 Non-parametric model for binary star inference

39 Non-parametric model for binary star inference radial velocity variance template systematics astrometric noise Fit a mixture model (normal and log-normal) to all observables of stars in our ball Calculate p(single data) for the star of interest Move on to the next bluer/redder than expected photometric variability

40 Non-parametric model for binary star inference apparent g flux radial velocity variance (km s 1 ) 1.0 radial velocity variance (km s 1 ) radial velocity variance (km s 1 ) apparent bp flux apparent rp flux 108

41 Non-parametric model for binary star inference In practice we might want to sample the mixture parameters for every star Can we afford it? Hell no! We can barely optimise it! But we may be able to analytically marginalise out parameters that we don t care about

42 Non-parametric model for binary star inference ~210 million parameter model for brighter stars, about 1B parameter model for all stars. Converted a big data problem to a small data problem that is embarrassingly parallel, and one where we might be able to analytically marginalise out many hyper-nuisance-parameters.

43 Non-parametric model for binary star inference vrad excess/p p 1 e binary probability K/P p 1 e 2 0.0

Non-parametric model for binary star inference N =

$magnitude 4 0 4 8 0 1 binary fraction 12 0.0 1.5 3.$ 0 bp-rp 12 0.0 1.5 3.

44 Non-parametric model for binary star inference N = absolute G magnitude absolute G magnitude binary fraction bp-rp bp-rp Now we can do a population study of binary stars that is 10 5 times larger than anything we could do before.

45 Why not just turn on the Machine Learning (tm)? As physicists we are often interested in the mechanisms that produced the data. That is, we want a generative model for the data. Neural networks are universal function approximators (we ve known that literally for decades), but they will not give you a generative model for the data that is interpretable. This applies to most ML methods. Sometime s that s OK. Sometimes you don t care about interpretability, or how the data were generated. But often we do care, and we can afford an interpretable model, but we (incorrectly) opt to use Machine Learning.

46 Why not just turn on the Machine Learning (tm)? Consider a problem where there are: Lots of high quality data. It s hard to model those data, and/or the existing models do not make good predictions ( the data are better than the models ). We just want answers. We don t care why.

47 Why not just turn on the Machine Learning (tm)? Turn on the ML! Create some training set of well-known objects. Train a Convolutional Neural Network (CNN) to estimate the intrinsic (or latent) properties of some objects, given an image (or spectrum) of the object. You responsibly run cross-validation (or drop-out) to convince yourself things work. You run the test step. Your CNN has identified an object with properties that defy everything we thought we knew about astrophysics! (But in many other ways, it is similar enough to objects in the training set, so we have some reason to trust it)

48 (Get it? Convolutional Neural Network.) Models that lack interpretability can really suck.

49 When should I turn on the Machine Learning (tm)? Can you write a generative model for the data (that evaluates in less than a Hubble time)? Don t use machine learning. Forward model the data. Do you care about model interpretability, or interpreting the results that you get? Don t use machine learning. Forward model the data. Do you want a posterior probability distribution over data? Don t use machine learning. Forward model the data. Do you need to retain some semblance of probability over data? Don t use machine learning. Forward model the data. Do you want to classify or estimate things, or make decisions, and you don t care about the physics? Hell yeah! Turn the Machine Learning up to 11!

50 Even when you turn on the Machine Learning (tm), the rules still apply! What is the simplest, dumbest model you can think of? Start with that. From Google on Scalable and accurate deep learning with electronic health records (Nature): Regularised logistic regression performed essentially just as well as Deep Neural Networks (mortality C.I vs 0.94 to 0.96). Huge cost, complexity, and interpretability difference in those models.

51 Standard tools for data analysis Linear algebra. Go back to basics. Keep your linear (matrix) algebra sharp. Python (3): astropy, numpy, scipy, scikit-learn, TensorFlow (not just for ML) Positives: Good glue. Human-readable, machine-executable. Transferable skill. Negatives: Only a little bit slow. Stan: probabilistic programming language When to use: If you have a model that doesn t have bespoke parts (e.g., no models at grid points, or functions that are not differentiable). When not to use: When your model contains bespoke parts. Or if statements (kinda). Fortran/C: Betterise your code by speeding up the slowest parts. You can call Fortran or C functions directly from Python. PostgreSQL: Learn it. Write scripts to ingest data. You will thank yourself later. Hadoop: If you have a map-reduce job, use Hadoop. Transferable skill.

52 Resources Statistics: Information theory, inference and learning algorithms, Sokal s notes, Probablistic Programming and Bayesian Methods for Hackers, Bayesian Data Analysis, Hamiltonian Monte Carlo Version control: oh shit git Machine Learning: Talking Machines, Which ML algorithm is for me?, Matrix calculus you need for deep learning, You should understand backpropagation, Machine Learning 101 (Google Engineers) Code: astropy, tensorflow, stan, scikit-learn, fortran from python Probabilistic graphical models: an introduction Linear algebra: immersive linear algebra

53 Unsolicited advice to be ahead of the data wave 1. Create a GitHub or BitBucket account and use it. Push daily. Push good code. Push bad code. Push grant proposals. Push paper drafts. Push. Push. Push. 2. Read arxiv: and do all the exercises. 3. Be familiar with tools (machine learning, optimisation algorithms, linear algebra) and know how to chose the right tool. It s hard. 4. Think about if you can map-reduce your data analysis problem. If you can, learn Hadoop as part of that project. 5. Start with the simplest model for data analysis. But for fun, think about how to fit a line to one petabyte of data.

54 Gaia Sprints Not traditional scientific meetings. Aim is to bring together people who want to exploit Gaia data on short timescales. We do everything in the open. Open data. Open science. No invited participants; everyone applies to attend (incl. the SOC, the Gaia principal investigator, etc). Best scientific experience of my life, Most important week of my year. Next Sprint: 2019 Santa Barbara gaia.lol

55 Conclusions The data are only going to get bigger. Those who can t swim, will drown. Those who can swim will drown in.

56 Conclusions The data are only going to get bigger. Those who can t swim, will drown. Those who can swim will drown in. Remember to ask yourself: 1. Do you have small data, or do you have big data? 2. What is the simplest, dumbest model you can think of? 3. What assumptions are you making? 4. What is the utility of the model? 5. What can you afford?

Hierarchical Bayesian Modeling

Hierarchical Bayesian Modeling Making scientific inferences about a population based on many individuals Angie Wolfgang NSF Postdoctoral Fellow, Penn State Astronomical Populations Once we discover an