Data Exploration vis Local Two-Sample Testing

Size: px

Start display at page:

Download "Data Exploration vis Local Two-Sample Testing"

Gilbert Burke
5 years ago
Views:

Data Exploration vis Local Two-Sample Testing 0 20

1 Data Exploration vis Local Two-Sample Testing Freeman, Kim, and Lee (2017)

2 Astrostatistics at Carnegie Mellon CMU Astrostatistics Network Graph 2017 (not including collaborations external to Pittsburgh) (and not including astronomy grad students at least, not yet) LW CG The Department Head, detached from the astrostatistical body. JK RC HT MW CE PF CB TP RM AL XG JN MD FL CS The interface to LSST.

3 Astrostatistics at Carnegie Mellon Things That Interest Us: High-dimensional data (with lower-dimensional intrinsic geometries) Dealing with non-representative data (selection bias) Multivariate conditional density estimation: f(y1,,yk x1,,xp) Likelihood-free inference (e.g., approximate Bayesian computation) Photometric redshift estimation Galaxy evolution Hurricanes (On Earth not other planets) Whatever is interesting and has inferential aspects (but prediction is OK too, in moderation)

4 Astrostatistics at Carnegie Mellon: Tools arxiv: Flexible Conditional Density Estimator arxiv: Python version of FlexCoDE

Astrostatistics at Carnegie Mellon: Tools arxiv: 1306.

5 Astrostatistics at Carnegie Mellon: Tools arxiv: What we will concentrate on today! arxiv:

6 The Statistical Setting Assume that we have measurements x of astronomical objects that are divided into two classes (e.g., high-mass vs. low-mass, discovered-via-radial velocity vs. discovered-via-transit, Cepheid vs. RR Lyrae, etc., etc.) Each object is associated with p measurements, i.e., each x is a vector of length p, and is sampled from some multidimensional probability density function: x 0 1,...,x 0 m P 0 and x 1 1,...,x 1 n P 1 A global two-sample test would ask whether P 0 and P 1 are the same, i.e., it would test the hypotheses: H 0 : P 0 = P 1 against H 1 : P 0 6= P 1

7 The Statistical Setting Global two-sample tests are generally non-informative. CMU graduate student Ilmun Kim developed a local twosample test that at a series of test points compares H i,0 : P(Y = y X = x i )=P(Y = y) against H i,1 : P(Y = y X = x i ) 6= P(Y = y) with test statistic 6 b T (x i )= b P(Y =1 x i ) b P(Y =1) q bv (x i ) b

8 The Statistical Setting How does one estimate P(Y=y X=xi)? a) Split labeled data into training and test sets. b) Create N random subsets of the training data, each of size n, without replacement. c) Loop over the N subsets: i) Split subset data into training and validation sets. ii) Learn a decision tree from the training set. iii) Place validation and test data into tree nodes. The average of the validation labels is prediction of class proportion for the subset. d) Each test datum xi thus has N associated proportion estimates: the mean, e.g., is the estimate of P(Y=y X=xi).

9 The Statistical Setting previous slides Freeman, Kim, and Lee (2017)

10 (Known) Caveats The current code assumes two unique response values. The allowed response values are 0 and 1. Empirical observation: the performance of LTST degrades if the dataset includes too many relatively uninformative predictor variables. (TBS: to be studied.) Memory use effectively limits sample size to For now. The effect of changing the number of subsets N has not been studied. The false-discovery rate correction for multiple comparisons depends on the number of test points: too many leads to a loss of test power. (And is it needed?) The current code crashes in Python 3.5: use 3.6. The current code untested in Python 2: use 3.6. If you uncover anything else: pfreeman@cmu.edu

11 Wait: How Does This Differ From, e.g., Random Forest? Random forest prediction produces a classification based on a threshold number of class votes across trees. It does not attempt to determine if the observed number of votes is, e.g., consistent with the adopted threshold. In other words, there is no hypothesis test involved. Importantly, LTST does not actually classify a test datum! It determines whether it exists in a region of predictor space dominated by one of the classes (or by neither). So, e.g., a test datum in a Cepheid-dominated region may itself be an RR Lyrae variable. Ultimately, LTST is about discovering neighborhoods in predictor space, and not about classification per se.

12 To Play Along Click on DES JupyterLabs Click on Deploy Lab and then Go To Lab On the file tab, click through to ltst, under examples (Note: you may need to copy ltst.py to the directory LSST_Viz. You can do this by making a duplicate, moving the duplicate to the LSST_Viz directory by dragging, and renaming it ltst.py in that directory.)

Example #1: Galaxy Morphology Summary Statistics Highdimensional Lowdimensional Parametrized model fitting Image features (i.e., summary statistics) Human brains (i.e., natural neural networks) Connectivity/ unsupervised learning Our data: 7 summary statistics for 2487 galaxies, with mass as response.

13 Example #1: Galaxy Morphology Summary Statistics Highdimensional Lowdimensional Parametrized model fitting Image features (i.e., summary statistics) Human brains (i.e., natural neural networks) Connectivity/ unsupervised learning Our data: 7 summary statistics for 2487 galaxies, with mass as response. & statistics_large.csv

14 Example #1: Galaxy Morphology Summary Statistics Visualization of 700 test data in first two diffusion-space coordinates. Blue: test data that lie in regions of predictor space where the proportion of high-mass galaxies is significantly larger than the global proportion. Red: significantly smaller than the global proportion. Green: consistent with the global proportion (i.e., we fail to reject the null).

15 Example #2: Exoplanetary Properties Extracted from the Exoplanet Archive ( Predictors: period, semi-major axis length, mass, eccentricity, number of planets in system Response: discovery method (radial velocity vs. transit) 750 training data, 249 test data Visualization via boxplots:

16 Example #3: Your Data (or Mine) In the time remaining: Using either your own data or the example dataset supplied in work through your own LTST analysis!

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted