Three data analysis problems Andreas Zezas University of Crete CfA
Two types of problems: Fitting Source Classification
Fitting: complex datasets
Fitting: complex datasets Maragoudakis et al. in prep.
Fitting: complex datasets Galaxy Spectral Engery Distribution (SED) 10 5 10 3 10 1 10 1 Fnu (mjy) 10 3 10 5 10 7 Best Model Optical Spectrum Photometric Flux 10 9 10 11 10 13 10 15 10 1 10 2 10 3 10 4 10 5 10 6 10 7 10 8 Wavelength (Å)
Fitting: complex datasets data and folded model Galaxy Spectral Engery Distribution (SED) 10 5 0.1 10 3 Iterative fitting may work, but it is inefficient and confidence intervals on parameters not reliable Fnu (mjy) 10 1 0.01 10 1 10 3 10 3 10 5 10 7 4 Best Model Optical Spectrum Photometric Flux How do we fit jointly the two datasets? 2 10 9 0 10 11 2 10 13 4 10 15 1 10 100 10 1 10 2 10 3 10 Energy 4 (kev) 10 5 10 6 10 7 10 8 Wavelength (Å) VERY common problem!
Problem 2 Model selection in 2D fits of images
A primer on galaxy morphology Three components: spheroidal R 1/ 4 I(R) = I e exp 7.67 1 Re exponential disk r I(R) = I0 exp rh and nuclear point source (PSF)
Use a generalized model R I(R) = I e exp k R e Fitting: The method n=4 : spheroidal n=1 : disk Add other (or alternative) models as needed Add blurring by PSF Do χ 2 fit (e.g. Peng et al., 2002) 1/ n 1
Fitting: The method Typical model tree I(R) = I e exp k R R e 1/ n 1 n=free n=4 n=1 n=4 n=4 n=4 PSF PSF PSF PSF PSF
Fitting: Discriminating between models Generally χ 2 works BUT: Combinations of different models may give similar χ 2 How to select the best model? Models not nested: cannot use standard methods Look at the residuals
Fitting: Discriminating between models
Fitting: Discriminating between models Excess variance 2 σ XS 2 = σ obj 2 σ sky Best fitting model among least χ 2 models the one that has the lowest exc. variance
Fitting: Examples Data Model Residuals Sérsic Model χ 2 ν σ 2 XS (1) (2) (3) Sérsic 1.107 1.722(0.120) Sérsic + psfagn 1.107 1.657(0.118) Sérsic + exdisk 1.107 1.770(0.121) Sérsic + psfagn + exdisk 1.106 1.472(0.113) Sérsic + psfagn Sérsic + exdisk Bonfini et al. in prep. Sérsic + psfagn + exdisk
Fitting: Problems Data Model Residuals However, method not ideal: It is not calibrated Sérsic Cannot give significance Fitting process computationally intensive Sérsic + psfagn Require an alternative, robust, fast, method Sérsic + exdisk Sérsic + psfagn + exdisk
Problem 3 Source Classification (a) Stars
Classifying stars Relative strength of lines discriminates between different types of stars Currently done by eye or by cross-correlation analysis
Classifying stars Would like to define a quantitative scheme based on strength of different lines.
Classifying stars Maravelias et al. in prep.
Classifying stars Not simple. Multi-parameter space Degeneracies in parts of the parameter space Sparse sampling Continuous distribution of parameters in training sample (cannot use clustering) Uncertainties and intrinsic variance in training sample
Problem 3 Source Classification (b) Galaxies
Classifying galaxies Ho et al. 1999
Classifying galaxies 1.5 1.0 (a) AGN LOG ([OIII]/Hβ) 0.5 0.0-0.5 HII LOG ([OIII]/Hβ) 1.5 1.0 0.5 0.0-0.5 (c) Seyfert LINER -1.0 Comp -1.0-2.0-1.5-1.0-0.5 0.0 0.5 LOG ([NII]/Hα) -2.0-1.5-1.0-0.5 0.0 LOG ([OI]/Hα) Seyfert HII LINER -1.0-0.5 0.0 0.5 LOG ([SII]/Hα) Kewley et al. 2001 Kewley et al. 2006
Classifying galaxies Basically an empirical scheme Multi-dimensional parameter space Sparse sampling - but now large training sample available Uncertainties and intrinsic variance in training sample LOG ([OIII]/Hβ) 1.5 1.0 0.5 0.0-0.5-1.0 (a) HII AGN Comp -2.0-1.5-1.0-0.5 0.0 0.5 LOG ([NII]/Hα) Use observations to define locus of different classes
Classifying galaxies Uncertainties in classification due to measurement errors uncertainties in diagnostic scheme Not always consistent results LOG ([OIII]/Hβ) 1.5 1.0 0.5 0.0-0.5 (a) HII AGN from different diagnostics -1.0 Comp Use ALL diagnostics together -2.0-1.5-1.0-0.5 0.0 0.5 LOG ([NII]/Hα) Maragoudakis et al in prep. Obtain classification with a confidence interval
PSfrag replacements Classification Problem similar to inverting 800 Hardness ratios 600 to spectral 400 200 0 parameters PSfrag replacements But more difficult We do not have well defined grid 1000 1.0 log(s/m) 0.5 Grid is not continuous Draws 0.0 0.0 0.5 1.0 1.5 log(m/h) log(m/h) 0.5 0.0 0.5 1.0 1.5 2.0 1.0 0.5 0.0 Table 1: Posterior Probabilities for the Grid of the Power-Law log(s/m) Model. The 95% posterior region is indicated in bold face. Figure 2: X-ray Colors Fitted by the Classical and Bayesian Methods. The top left panel shows the 0.250 0.500 0.125 0.250 0.075 0.125 0.050 0.075 0.025 0.050 0.010 0.025 point estimates of colors with marginal 1.75 2.00 error11.36% bars fitted 13.93% by the3.35% classical 1.00% method. 0.53% In the top 0.24% right panel, posterior draws of the X-ray colors 1.50 1.75 simulated 5.56% by the 13.70% Bayesian 5.99% method 2.34% are superimposed 1.70% 0.67% on the grids. The bottom panels show three-dimensional 1.25 1.50 1.80% graphical 7.76% summaries 5.61% for3.11% the posterior 2.82% draws. 1.56% Γ The 1.00 1.25 0.38% 2.71% 2.87% 2.26% 2.33% 1.58% large dot in the panels except the bottom left one represents the true values of X-ray colors. Γ N H 0.75 1.00 0.07% 0.54% 0.82% 0.75% 1.00% 0.81% 0.50 0.75 0.01% 0.09% 0.15% 0.18% 0.23% 0.17% N H Taeyoung Park s thesis C S and C H. The top right panel of Figure 2 shows the joint posterior draws of C S and C H resulting from the Gibbs sampler; a large dot in the diagram represents the true values of the X-ray colors. In the bottom row of Figure 2 presents the three-dimensional histogram of the draws to the left and the contour plot to the right. Because the Monte Carlo3 draws are superimposed on the grids of the power-law and thermal models in the color-color diagram, we can reversely infer the parameters of the models by computing posterior probabilities corresponding to each section split by the grids. Table 1 presents the normalized posterior probabilities of the X-ray colors in the grid of the power-law model. The 95% highest joint
Summary Model selection in multi-component 2D image fits Joint fits of datasets of different sizes Classification in multi-parameter space Definition of the locus of different source types based on sparse data with uncertainties Characterization of objects given uncertainties in classification scheme and measurement errors All are challenging problems related to very common data analysis tasks. Any volunteers?