Nonlinear Data Transformation with Diffusion Map

Size: px

Start display at page:

Download "Nonlinear Data Transformation with Diffusion Map"

Morris Nicholson
5 years ago
Views:

1 Nonlinear Data Transformation with Diffusion Map Peter Freeman Ann Lee Joey Richards* Chad Schafer Department of Statistics Carnegie Mellon University * now at U.C. Berkeley t! " 3 3 x 10!11 5 0!5!10! x 10!12! 2 t!5!10!15!20!4!25!6 " !2 x 10!7 t! " Richards et al. (2009; ApJ 691, 32) Predict redshift and identify outliers (Richards, Freeman, Lee, Schafer (

2 Data Transformation: Why Do It? The Problem: Astronomical data that inhabit complex structures in (high-)dimensional spaces are difficult to analyze using standard statistical methods. For instance, we may want to: Estimate photometric redshifts from galaxy colors Estimate galaxy parameters (age, metallicity, etc.) from galaxy spectra Classify supernovae using irregularly spaced photometric observations The Solution: If these data possess a simpler underlying geometry in the original data space, we transform the data so as to capture and exploit that geometry. Usually (but not always), transforming the data affects dimensionality reduction, mitigating the curse of dimensionality. We seek to transform the data in such a way as to preserve relevant physical information whose variation is apparent in the original data.

3 Data Transformation: Example These data inhabit a one-dimensional manifold in a two-dimensional space. Perhaps a physical parameter of interest (e.g., redshift) varies smoothly along the manifold. Note that these may be non-standard data (e.g., each data point may represent a vector of values, like a spectrum). We want to transform the data in such a way that we can employ simple statistics (e.g., linear regression) to model the variation of that physical parameter. (Accurately.)

4 The Classic Choice: PCA Principal components analysis will do a terrible job (at dimension reduction) in this instance because it is a linear transformer.

5 The Classic Choice: PCA Principal components analysis will do a terrible job (at dimension reduction) in this instance because it is a linear transformer. In PCA, high-dimensional data are projected onto hyperplanes. Physical information may not be well-preserved in the transformation.

6 Nonlinear Data Transformation There are many available methods for nonlinear data transformation which have yet to be widely applied to astronomical data: Local linear embedding (LLE; see, e.g., Vanderplas & Connolly 2009) Others: Laplacian eigenmaps, Hessian eigenmaps, LTSA We apply the diffusion map (Coifman & Lafon 2006, Lafon & Lee 2006; see diffusionmap R package). The Idea: to estimate the true distance between two data points via a fictive diffusion (i.e., Markov random walk) process. The Advantage: The Euclidean distance between points x and y in the space of transformed data is approximately the diffusion distance between those points in the original data space. Thus variations of physical parameters along the original manifold are approximately preserved in the new data space.

7 Diffusion Map: Intuition Pick location... * * Diffusion Distances Diffusion Distances Diffusion Distances t = 1 t = 2 t = 25...set up a kernel... * * * *... * * * t = 1 t = 2...and map out the random walk. Yields distribution over points after first step Distribution after the second step t = 25 Distribution after the 25 th step tered on one point

8 Diffusion Map: The Math (Part I) Define similarity measure between two points x and y, e.g., the Euclidean distance: s(x, y) = Construct a weighted graph: p ( ) 2 cx,i c y,i. Row-normalize to compute one-step probabilities: i=1 A key feature of SCA is that th ( ) ) s(x, y)2 w(x, y) = exp ( ɛ where is a tuning parameter p 1 (x, y) = w(x, y)/ z w(x, z). ilities between all n data points in an Use p 1(x,y) to populate n x n matrix P of one-step probabilities.

9 Diffusion Map: The Math (Part II) The probability of stepping from x to y in t steps is Pt. The diffusion distance between x and y at time t is D 2 t (x, y) = j=1 λ 2t j (ψ j (x) ψ j ( y)) 2 Retain the top m eigenmodes to create diffusion map: D 2 t t : x [λ t 1 ψ 1(x), λ t 2 ψ 2(x),..., λ t m ψ m(x)] from R p to R m, we have that (x, y) m j=1 λ 2t λj = j th largest eigenvalue of P ψj = j th (right) eigenvector of P j (ψ j (x) ψ j ( y)) 2 = t (x) t ( y) 2 The tuning parameters ε and m are determined by minimizing predictive risk (a topic I will skip over in the interests of time). The choice of t generally does not matter.

10 The Spiral, Redux Coordinate Functions The first diffusion coordinate The second diffusion coordinate rdinate plot for diffusion map Second coordinate plot for diffusion map 91 92

11 Application I Spectroscopic redshift estimation and outlier detection using SDSS galaxy spectra. Estimation via adaptive regression: R r( t ) = t β = = m j=1 m j=1 β j t,j (x) β j λ t j ψ j (x) = m j=1 β j ψ j (x). We see that the choice of the parameter t is u " 3! 3 t 5 0!5!10!15 x 10!12 x 10!11 5 0! 2 t!5!10 " 2!15!20!25 True Spectroscopic Redshift Predict redshift and identify outliers (Richards, Freeman, Lee, Schafer (2009!6!4!2! 1 t 0 " x 10! Richards et al. (2009; ApJ 691, 32) 93

12 Application II Estimating properties of SDSS galaxies (age, metallicity, etc.) using a subset of the Bruzual & Charlot (2003) dictionary of theoretical galaxy spectra. Selection of prototype spectra made through diffusion K-means.! 3 /(1!! 3 ) " !2!4!6 0! 2 /(1!! 2 ) " 2!50!100!100!80!60! 1 /(1!! 1 ) " 1!40!20 Richards et al. (2009; MNRAS 399, 1044) For estimating properties of galaxies (Richards, Freeman, Lee, Schafer (

13 Freeman et al. (2009; MNRAS 398, 2012) Photometric Redshift Estimate ( z ) R Petro Photometric Redshift Estimate ( z ) Photometric redshift estimation for SDSS Main Sample Galaxies. Uses Nyström Extension laxies for quickly predicting le, we photometric extract those redshifts galaxies with of > ; Strauss nitude test <17.77set (or Fdata, given the ur main sample galaxy or MSG sample. We diffusion coordinates 0 galaxies from this sample to train our regres-of training setalgorithm data.described on of the outlier-removal o the removal of 251 galaxies from this set. Displays effect of flux e algorithm outlined in Sections 2.1 and 2.2 measurement error upon! = (0.05, 150), er estimates (!!, m) i.e. in order e appropriate, the 16-dimensional colour data predictions: attenuation o 150-dimensional bias. space or more elements of the set {F! } < 0, we m analysis. The flux units are nanomaggies;! to magnitude m! is m! = log10 F!,! = 2.5 log10 (F!j /F!i ). o colours is ci j f galaxies in our sample is Applications 0.00 Application III Photometric Redshift Estimate ( z ) Photometric redshift estimation using SCA Spectroscopic Redshift (Z) genvector estimates are independent of those apply the Nystro m extension to validation Figure 1. Top: predictions for randomly selected objects in the MSG redshift estimation (Freeman, Newman, Lee, Richards, Sc at a time, then concatenate the resultingphotometric pre!!

14 Application IV Classifying SNe in the Supernova Photometric Classification Challenge (Kessler et al. arxiv: ) g i See talk by Joey Richards for more details! r z Richards et al. (2010; in preparation)

15 Future Application Distribution Space Encoding Space Data Space Component 3 Confidence/Credible Region Component 2 Physically Possible Distributions Component 1 Transform observed light curves and theoretical light curves to a low-dimensional encoding space, where they may be compared using nonparametric density estimation. Supernova light curves

16 Diffusion Map: Challenges Computational Challenge I: efficient construction of weighted graph w. Distance computation slow for high-dimensional data. Graph may be sparse: can we short-circuit the distance computation? Computational Challenge II: execution time and memory requirements for eigen-decomposition of the one-step probability matrix P. SVD limited to approximately 10,000 x 10,000 matrices on typical desktop computers. Slow: we only need the top n% of eigenvalues and eigenvectors, but typical SVD implementations compute all of them. P may be sparse: efficient sparse SVD algorithms? Would algorithm of Budavári et al. (2009; MNRAS 394, 1496) help?

17 Diffusion Map: Challenges Computational Challenge III: efficient implementation of the Nyström Extension to apply training set results to far larger test sets. Predictions for 350,000 SDSS MSGs computed in 10 CPU hours...is this too slow in the era of LSST?

18 And One Statistical Challenge P. E. Freeman et al Spectroscopic Redshift (Z) ose ion Figure 1. Top: predictions for randomly selected objects in the MSG pre- validation hotometric redshift Newman, Richards,! ) = (0.05,(Freeman, set, forestimation (!!, m 150). Bottom: same as Lee, top, for the LRGSchafer (2009)) are! ) = (0.012, 200). In both cases, we remove 5σ validation set, with (!!, m om- outliers from the sample prior to plotting, 94 thus the actual number of plotted ing points is 9740 (top) and 9579 (bottom). our Spectroscopic Redshift (Z) Spe Can attenuation bias be effectively mitigated? TBD. This is not diffusion map specific... ias d Deviation Sample Standard Deviation Sample Bias Photometric Redshift Estimate ( z ) Flux measurement error causes attenuation bias: Photometric Redshift Estimate ( z ) with uss We resbed set. 2.2 der data Photometric Redshift Estimate ( z ) Applications we ies; F!, 2015 Photometric redshift estimation using SCA

19 And One Statistical Challenge... No. 1, 2008 MACHINE LEARNING FOR PROBABI five-band ollow-up of ANNz, t training ic survey talog and al. 2001).! 17.77) 2), while y uniform me limited tion). ANNz (Collister & Lahav 2004; PASP 116, 345) Fig. 2. Spectroscopic vs. photometric redshifts for ANNz applied to 10,000 galaxies randomly selected from the SDSS EDR. tecture was 5 : 10 : 10 : 1. A committee of five such networks was trained on the training and validation sets, then applied to the evaluation set. Figure 2 shows the ANNz photometric knn (Ball et al. 2008; ApJ 683, 12) Fig. 6. Photometric vs. spectroscopic redshift for the 82,672 SDSS DR5 main sample galaxies of the blind testing set (20% of the sample). Here, zphot is the mean photometric redshift from the PDF for each object. The result from a single split (of the 10 used for validation) of the data into training and blind testing data is shown. Here,! is the RMS dispersion between zphot and zspec. [See the electronic edition of the Journal for a color version of this figure.] tro

20 Summary Methods of nonlinear data transformation such as diffusion map can help make statistical analyses of complex (and perhaps high-dimensional) data tractable. Analyses with diffusion map generally outperform (i.e., result in a lower predictive risk) similar analyses with PCA, a linear technique. Nonlinear techniques have great promise in the era of LSST, so long as certain computational challenges are overcome. We seek Optimal construction of weighted graphs Optimal implementations of SVD (memory, execution time, sparsity) Optimal implementation of the Nyström Extension Regardless of whether the challenges are overcome, the accuracy of our results may be limited by measurement error.

21 Predictive Risk: an Algorithm Pick tuning parameter values ε and m. Transform the data into diffusion space. Perform k-fold cross-validation on the transformed data: Assign each datum to one of k groups. Fit model (e.g., linear regression) to the data in k-1 groups (i.e., leave the data of the k th group out of the fit). Given best-fit model, compute estimate ŷ i for all data in the k th group. Repeat process until all k groups have been held out. Assuming the L 2 (squared-error) loss function, our estimate of the predictive risk is generally R(ɛ,m)= 1 n n [ŷ j (ɛ,m) Y j ] 2 j=1 We vary ε and m until the predictive risk estimate is minimized.

22 Nyström Extension The basic idea: compute the similarity of a test set datum to the training set data, and use that similarity to determine the diffusion coordinate for that datum via interpolation, with no eigen-decomposition. Mathematically: Ψ = WΨΛ W is the matrix of similarities between the test set data and the training set data, while Λ is a diagonal matrix with entries 1/λi.

Exploiting Sparse Non-Linear Structure in Astronomical Data

Exploiting Sparse Non-Linear Structure in Astronomical Data Ann B. Lee Department of Statistics and Department of Machine Learning, Carnegie Mellon University Joint work with P. Freeman, C. Schafer, and