Further observations, tests and developments

Size: px

Start display at page:

Download "Further observations, tests and developments"

Marilynn Miles
5 years ago
Views:

1 : Further bservatins, tests and develpments prepared by: apprved by: reference: issue: 1 revisin: 0 date: status: Issued Cryn A.L. Bailer-Jnes Max Planck Institute fr Astrnmy, Heidelberg calj@mpia.de

2 Abstract I present results f further tests f ILIUM fr estimating T eff, lg g and [Fe/H] fr nrmal stars frm BP/RP. I quantify the rbustness f the summary statistics t randm resampling f the train/test sets, investigate hw the perfrmance depends n the wavelength cverage f the input spectra and examine the impact n perfrmance f systematic flux errrs in the fluxes. I describe hw randmly sampling the grid fr building the frward mdel can bias ILIUM and als explain why it yields systematic errrs in the absence f a signature in the data (and I argue why this is actually desirable). I further intrduce tw mdificatins t the basic algrithm. First, in an attempt t ffset the negative impact f including nisy but AP-sensitive pixels in the input space, a SNR-based weighting scheme is prpsed and tested, althugh it is fund t be f limited effectiveness. Secnd, I prpse and demnstrate a methd fr estimating the cvariances (uncertainties) in the estimated APs based n the phtmetric errrs. Technical Nte 2

3 Cntents 1 Intrductin 4 2 Systematic errrs and prirs with ILIUM 4 3 An erratum and cnsideratins f frward mdel fitting 6 4 Further tests x versampled data Rbustness f perfrmance statistics under train/test set resampling Wavelength range selectin Effect n perfrmance f systematic flux errrs Minr develpments SNR weighting f the sensitivities AP cvariance estimates based n the phtmetric uncertainties Backgrund material Applicatin t ILIUM Results Technical Nte 3

4 1 Intrductin I intrduced ILIUM in CBJ-042 and CBJ-043 as a new algrithm fr estimating parameters frm multidimensinal data. In thse dcuments I applied it t estimating T eff lg g and T eff [Fe/H] frm BP/RP spectra. Here I make sme additinal tests and mdificatins f the algrithm and reprt n the results. The new tests address the fllwing issues 1. whether 3x versampled spectra imprves the AP accuracy ver nn-versampled 2. hw rbust the summary statistics are t randm reselectin f the train/test sets 3. dependence f AP accuracy n the wavelength range selected 4. effect f systematic flux errrs n perfrmance. I als develp and test tw mdificatins t the algrithm 1. intrductin f a SNR weighting term in the AP update equatin. This is mtivated by the fact that the sensitivities nly measure the ideal sensitivity f the bands t the APs in the absence f nise 2. an alternative scheme fr estimating AP uncertainties. It uses the sensitivity matrix t translate the phtmetric errrs (mre generally, cvariance) int cvariance estimates fr the APs. I als discuss tw aspects f the basic algrithm. In sectin 2 I explain why ILIUM prduces systematic errrs in AP estimates when there is n signature in the data and hw reflects an implicit prir. In sectin 3 I shw hw randm selectin f train and test sets can significantly degrade the frward mdelling due t a biasing issue. 2 Systematic errrs and prirs with ILIUM Standard machine learning methds are nrmally trained by minimizing sme glbal errr functin. With such methds we frequently see systematic errrs in thse parts f the AP space where the data have n r very lw sensitivity t the APs. If there is n signal, the algrithm shuld ideally reprt prir value fr the AP, e.g. the mean in the training set. This is desirable: if the data prvide n infrmatin, then rather than reprt an arbitrary value reprt the best estimate based n what we d knw frm the training data. This explains why methds such as SVMs and ANNs systematically verestimate the metallicity f lw metallicity ht stars (which shw n metallicity signature): the mean frm the training set is higher than the true value. In Figure 5 f CBJ-043 we see that ILIUM des the same thing. Why is this, given that it desn t minimize a glbal errr? The reasn is prbably that at lw metallicity fr ht stars, Technical Nte 4

5 c(grid.pht.st[subgrid.bj, band], runif.pred.pht.st[, band]) c(grid.pht.st[subgrid.bj, band], runif.pred.pht.st[, band]) c(grid.pht.st[subgrid.bj, band], runif.pred.pht.st[, band]) FIGURE 1: Predictins f the full frward mdel fr the dwarfs as a functin f lg (T eff ) at cnstant T eff = K in 12 different bands (with wavelength in nm at the tp f each panel). The black crsses are the (nise-free) grid pints, the red stars are the frward mdel predictins (at randmly selected AP values) and the blue circles the nisy (G=15) grid pints. The three different blue and black pints at each [Fe/H] value crrespnd t the three different lg g values. The flux pltted n the rdinate is in standardized units. the sensitivity is zer fr all r mst bands. In this case, the weak frward mdel prvides n increment fr the flux at each iterative update. The flux is predicted well by just the strng frward mdel, which was cnstructed by averaging ver all the metallicities in the training grid. Thus the metallicity predicted by ILIUM will be that average metallicity. This is bviusly higher than the lwest metallicity, s we get a systematic errr. I can cnfirm this idea by lking at plts f the frward mdel. Fig. 1 shws the weak frward mdel fr sme bands at T eff = K. The sensitivity (gradient f the frward mdel) is essentially zer in all bands at [Fe/H] belw abut 2. This cntrasts t the frward mdel at T eff = 5000 K (Figure 4 f CBJ-043): there the sensitivity is significantly nn-zer in many bands even at the lwest metallicities. The fact that the systematic errr als vanishes in the residuals fr just cl stars Technical Nte 5

6 (see Figure 6 f CBJ-043) als suggests this interpretatin is crrect. The cnclusin is that the mean f the weak AP ver the training data is a prir n the weak AP, but ne which nly has an effect when there is essentially n signal in the data. Such a prir is desirable, but it s nt ideal that the training data distributin determines it (see Bailer-Jnes et al fr sme related discussin). 3 An erratum and cnsideratins f frward mdel fitting In CBJ-042 and CBJ-043 I stated that the frward mdel was fitted using the training set (which was typically a randmly selected subset f 50% f the whle data set). This was nt the case (as can be seen in the plts f the frward mdel). I actually used the whle data set fr fitting the frward mdel. The training set was used t select the nearest neighbur fr the initializatin f the algrithm (and the mdel applied t the test set, as stated). This is als dne thrughut the present TN. This prcedure is legitimate, because the a central aspect f the frward mdelling apprach is t explit the full density grid available. We wuld nt want t cnsciusly leave ut half the data. With cnventinal machine learning ne may cnsider this as cheating, because we are testing the perfrmance n data which was used t build the mdel. (It wuldn t be fully cheating, because we train the mdel n nise-free data whereas we test n nisy.) This wuld be unacceptable when using template matching algrithms (nearest neighburs), because there we match bservatins directly with templates, s having the exact template is bviusly cheating! But with algrithms which smth ver the training data this is mre legitimate, especially if (as we d with ILIUM) we prtect against verfitting. The templates are used t learn the verall shape f the functin, but they are nt used individually in the AP assignments. This is nt s different frm leave-ne-ut crss validatin with, say, 1000 bjects. There we build 1000 mdels, each trained n 999 bjects and the perfrmance assessed n the ne left ut. Leaving ut the ne bject will hardly affect the mdel fit, yet we cnsider this a legitimate prcedure. What s imprtant is hw well the algrithm learns the underlying functin and hw we test this. Distinct train and test sets is a cmmn apprach t this, but it may nt always be necessary r sufficient. A mre significant issue is that the train/test data may be different frm the real data, what I call the synthetic spectra mismatch prblem (CBJ-044). In CU8 we have als tried training n a randm grid and testing n a nminal grid. But this is nt fundamentally different frm ILIUM, because the randm (test) grid is built by interplating the nminal (training) grid. ILIUM requires an adequate sampling f the APs t enable it t prduce gd fits. The required sampling (number and distributin f pints) depends n the fitting methd used (currently I use smthing splines with varius degrees-f-freedm: see sectin 3.2 f CBJ-042). What is Technical Nte 6

7 FIGURE 2: Predictins f the full frward mdel (red stars) as a functin f lg (T eff ) at cnstant lg g=4.0 in 12 different bands (with wavelength in nm at the tp f each panel) using just the training data. The black circles are (nise-free) grid pints at all lg g used t fit just the strng cmpnent. The flux pltted n the rdinate is in standardized units. the impact f using just the training grid (Figure 5 f CBJ-042) fr fitting the frward mdel? Fig. 2 and Fig. 3 shw the resulting frward mdel fits and shuld be cmpared t Figures 6 and 7 in CBJ-042 (the crrespnding fits using the full data set). We see that the lg g fit is still very gd (the shape has hardly changed). In cntrast, the T eff fit is bradly similar, but nw shws several discntinuities. Why is this? The frward mdel is the sum f the strng (T eff ) and weak (lg g) cmpnents. The red pints in the plts shw the predictins f the full (i.e. summed) frward mdel with respect t ne AP hlding the ther cnstant. They are nt the fits f the individual cmpnents. In Fig. 3 the weak cmpnent is frmed by fitting t the black pints. The full frward mdel is the sum f this plus the fit frm the strng cmpnent, which is just a cnstant fr this plt (T eff cnstant). Thus the red curve is ffset frm what we wuld expect fr a fit t the black pint (best seen Technical Nte 7

8 FIGURE 3: As Fig. 2, but nw shwing predictins f the full frward mdel as a functin f lg g at cnstant T eff =5000 K (red pints) and the pints at this T eff used just t fit each weak cmpnent (black crsses). in the plt fr the band at 573 nm). This ffset is present because the strng cmpnent is a fit ver the full T eff range: it des nt predict the flux at each T eff exactly. The same affects Fig. 2. Again this plts the full frward mdel predictin, nt the strng cmpnent fit. The latter is smth (it s a spline) whereas the frmer is this plus an increment at each T eff which depends n the lg g value (here it is lg g = 4.0). At each T eff value in the training grid we have different lg g values (the blue pints in Figure 5 f CBJ-042) s the fit and thus the flux predictin fr lg g = 4.0 des nt necessarily change smthly with T eff. We see strng discntinuities at T eff = K (lg (T eff ) = 4.0) and at T eff = K (lg (T eff ) = 4.15). The reasn fr this is clear frm Figure 5 f CBJ-042: at these tw T eff values there are nly tw lw lg g values in the training set, very different frm the lg g values used in the weak cmpnent fits at the neighburing pints. Mrever, with just tw pints the frward mdel fitting algrithm fits a straight line (Fig. 4), s the predictins at lg g = 4.0 are Technical Nte 8

9 FIGURE 4: As Fig. 3 but nw fr T eff =10000 K. pr. The bttm line is that ILIUM was nt able t generalise adequately due t a mismatch between training and test sets. The cnsequence is that when trained n this randmly sampled and sparse training set, ILIUM des very badly n bth APs fr certain parts f the AP space. In principle this is als an issue when training n the full grid, but it was much smaller (s hardly degraded the results) because (1) the lg g distributin changed smthly ver the grid (all pints in Figure 5 f CBJ-042) and (2) there were enugh pints t get a gd fit. Thus the full frward mdel varying with respect t temperature is smth (Figure 6 in CBJ-042) and with respect t gravity has negligible ffset (Figure 7 in CBJ-042). In cnclusin: ILIUM nt nly needs an adequate sampling f bth APs t get gd 1D fits, but the grid sampling in the weak AP at each value f the strng AP shuld nt be biased. Randm sampling f the APs t split int train and test sets can intrduce such a bias. Technical Nte 9

10 4 Further tests 4.1 3x versampled data Sme f the experiments in CBJ-043 have been re-run, but nw using the 3x versampled data. (The sigmas are f curse kept cnsistent with this change in the sampling: see sectin f CBJ-042.) The summary results are as fllws [Fe/H] lg (T eff ) δφ 8.6e 3 6.3e 5 ILIUM, 3x versampled dwarfs, G=15, T eff 7000 K δφ σ φ Fr cmparisn, the results n the nn-versampled data fr the same train/test sets frm CBJ- 043 were [Fe/H] lg (T eff ) δφ 5.6e 3 5.2e 5 ILIUM, dwarfs, G=15, T eff 7000 K δφ σ φ At G=18.5 with versampling we get [Fe/H] lg (T eff ) δφ e 4 ILIUM, 3x versampling, dwarfs, G=18.5, T eff 7000 K δφ σ φ and the results n nn-versampled spectra frm CBJ-043 are [Fe/H] lg (T eff ) δφ e 6 ILIUM, dwarfs, G=18.5, T eff 7000 K δφ σ φ Oversampling des nt imprve the averaged perfrmance at either G=15 r G=18.5 n the T eff [Fe/H] prblem. This nly invlved nrmal stars: emissin line stars with reslved features may well benefit frm the versampling (see Lpez Marti et al fr numerus examples). 4.2 Rbustness f perfrmance statistics under train/test set resampling The tests perfrmed in CBJ-042 and CBJ-043 randmly split the available data int nnintersecting train and test sets. Hwever, the perfrmance n the test set culd depend upn the particular train and test sets selected. (As explained in sectin 3, the training set effects the nearest neighbur initializatin and the test set is what we test the perfrmance n, but the frward mdel is nt affected.) T examine this, I reran ILIUM 15 times with randmly resampled train and test sets, using the T eff lg g prblem at G=18.5 at [Fe/H]=0.0 (as described in sectin 3.5 f CBJ-042). There are 137 bjects in each f the train and test sets. Fr each run I calculate Technical Nte 10

11 mean lgg mean lgteff absmean lgg absmean lgteff rms lgg rms lgteff sd lgg sd lgteff FIGURE 5: ILIUM perfrmance fr different randmly selected train and tests n the T eff lg g prblem at G=18.5 at [Fe/H]=0.0. The individual runs are shwn as pen circles. The mean, median and upper/lwer quartiles are pltted as a diamnd, filled circle and vertical bars respectively. The fur sets f results crrespnd t the fur different statistics shw, with lg g n the left and lg (T eff ) n the right. Technical Nte 11

12 fur statistics: (1) mean residual, δφ; (2) mean abslute residual, δφ ; (3) RMS residual, σ φ ; (4) standard deviatin f the residuals. (2) and (3) are summaries f the magnitude f the residual, whereas (1) measures the systematic errr and (4) the scatter abut this (as it is the RMS f the mean-subtracted residuals). The results are shwn in Fig. 5. There is certainly sme spread, with 2 f the 15 runs yielding unusually large values fr σ φ fr lg g. The spread between the upper and lwer quartiles (cntains the central 50% f the results) fr δφ is fr lg g (abut 8% f the mean value) and fr lg (T eff ) (abut 5% f the mean). Thus the reprted statistics are rbust t within abut 5 10%, at least fr this prblem and this sample size. Larger train and test sets wuld generally prduce a smaller spread. Differences between runs f ILIUM much less than this shuld nt be cnsidered significant. 4.3 Wavelength range selectin phtns/s wavelength / nm FIGURE 6: BP (blue) and RP (red) nise-free spectra f three stars (A dwarf with T eff = K, lg g=4.0; F dwarf T eff =7000 K, lg g=4.0; K giant T eff =4500 K, lg g=1.5; zer metallicity and extinctin in all cases). There are 60 pixels in each f BP and RP. These are nn-versampled spectra, and last fur pixels f BP (centred at 1272, 1404, 1569 and 1780 nm) are mitted frm the plt. The slid vertical lines shw the nminal selectin range (used in CBJ-042 and CBJ-043). The dashed vertical lines shw the wider range used in the DSC QSO wrk f Bailer-Jnes et al. (2008) and CBJ-036. The selectin f input pixels t use in the mdel impacts the accuracy f the parametrizatin, Technical Nte 12

13 in the sense that adding mre relevant inputs shuld imprve the perfrmance. Each f BP and RP deliver windws with 60 pixels (which I cnsider as a single 120 element vectr) with a wavelength cverage shwn in Fig. 6. The nminal wavelength range used here (and in CBJ-042 and CBJ-043) are pixels 12:54, 75:108 (slid vertical lines). The DSC QSO wrk f Bailer- Jnes et al. (2008) and CBJ-036 used a brader range f 9:54, 72:111 (dashed vertical lines). The narrwer range was used fr ILIUM because early experiments shwed that this imprved the results, presumably because this mitted lwer SNR pixel bins. Why is this? Unlike many ther machine learning methds, the perfrmance f ILIUM shuld nt suffer due t a higher data dimensinality per se. This is because it fits 1D frward mdels t each band, s is nt subject t the curse f dimensinality in the number f data dimensins (pixels). Because ILIUM uses sensitivity weighting, then including bins which have n r little infrmatin (lw sensitivity) will receive a lw weigh in the update equatin and have prprtinally little influence. Hwever, the sensitivity is calculated using nise-free spectra: the SNR is nt taken int accunt. S if we have bins which are very sensitive t the APs but are typically very nisy, they will get a large weight in the update equatin (see sectin and Figures 9 and 10 f CBJ-042), but due t nise this update may actually be harmful. Under its present frmulatin ILIUM culd have prblems if it uses bins which have a high sensitivity but a lw SNR. This culd apply t the very blue pixels in BP. Here I apply ILIUM t the same prblem as in sectin 4.2, but nw investigate the effect f varying the pixel range used. I again perfrm 15 runs with the same 15 randm train/test sets used previusly. The results are shwn in Fig. 7 are fr the pixel fllwing ranges and crrespnding wavelength ranges (frm tp t bttm): 1. BPnm: BP nly frm nminal, 12:45 ( nm) 2. RPnm: RP nly frm nminal, 75:108 ( nm) 3. nm: nminal wavelength range, 12:45, 75: nm+: nminal, plus three mre pixels at red end f BP and blue end f RP, 12:48, 72:108 ( , nm) 5. nm++: nm+, plus three mre pixels at blue end f BP, 9:48, 72:108 ( , nm) 6. DSC/QSO paper range, 9:54, 72:111 ( , nm) The nminal range is mre r less the best n bth APs despite the fact that BP is heavily cut at the the red end (quite near the peak f the sensitivity) and the red is cut in the blue end well befre the drp. Adding three mre pixels in each case t extend t regins f lwer sensitivity (the nm+ result) increases the perfrmance by a very small amunt which is barely significant Technical Nte 13

14 BPnm RPnm nm nm+ nm++ DSC/ QSO meanabs lgg meanabs lgteff FIGURE 7: ILIUM perfrmance n T eff lg g prblem at G=18.5 at [Fe/H]=0.0 using different input pixels (described in the text). The perfrmance metric is the mean abslute errr ver the whle test set. Open circles are the result f different randmly selected train/test sets. The mean, median and upper/lwer quartiles are pltted as a diamnd, filled circle and vertical bars respectively. Technical Nte 14

15 (the inter-quartile ranges verlap). Surprisingly, adding just anther three pixels t the blue end f BP (the nm++ result) actually degrades the perfrmance by a statistically significant amunt cmpared t nm fr bth lg g (by almst 0.1 dex) and lg (T eff ) (by dex), sme 15% f the average perfrmance. This is presumably a result f these pixels having a high sensitivity yet lw SNR, thereby adding a highly weighted but almst randm update, as described abve. This dependence f perfrmance n the wavelength range selected is clearly undesirable. A mdificatin f the sensitivities t accmmdate the SNR is described in sectin 5.1. Using the brader DSC/QSO range degrades lg (T eff ) further ver nm++, but with a lesser effect n lg g. This may again be n accunt f misleading updates frm lw SNR pixels. Nt surprisingly using just the BP r RP prtin f the nminal range (BPnm, RPnm) is wrse than using the tw tgether: RP alne is particularly bad at lg g (a factr f tw wrse). This must be due t the reductin in infrmatin. Nte that the nminal wavelength range (used in CBJ-042, -043) mits Hα in bth BP and RP. The range was selected based n the perfrmance fr nrmal stars, and nt emissin line stars. Althugh Hα is nt visible in BP/RP in absrptin (nt even in 4x versampled spectra) it can be seen in emissin fr Be stars (see Lpez Marti et al. 2009), s nce we expand the bject-type space, we ll need t include this. I nw repeat the abve, but n the T eff [Fe/H] prblem fr dwarfs at G=15 presented in CBJ There I trained and tested ILIUM using stars with T eff spanning the range K. [Fe/H] was prly determined fr ht stars (a well knwn phenmenn), s the relevant results were thse reprted fr T eff 7000 K. Including the ht stars in the training set shuldn t adversely effect the mdel perfrmance n [Fe/H] fr cl stars, because the weak cmpnents in the frward mdel are all independent 1D functins f [Fe/H] at each grid value f T eff. S a weak r nn-existent signature at high T eff des nt affect the [Fe/H] estimatin at lw T eff. As we knw a priri that we nly want t estimate [Fe/H] fr cl stars it makes little sense t train n ht stars at all. S here I train and test test ILIUM just fr stars lying in the range K (inclusive), with the full [Fe/H] range used in CBJ-043 (+1 t 4 dex). These are zer extinctin dwarfs with lg g {4.0, 4.5, 5.0} (same as in sectin 2.1 f CBJ-043). There are 657 stars in ttal, which again are randmly split int equal-sized train and test sets. The results are shwn in Fig. 8. Cnsidering first [Fe/H], we see that again the nm and nm+ results are the best. The results with the brader nm++ and DSC/QSO ranges are, as befre, wrse by a statistically significant amunt but by a scientifically insignificant ne. Using just BP r RP alne degrades the results by 0.07 dex n average. Cntrary t what ne might expect, RP des indeed carry significant infrmatin n the metallicity. Turning nw t the results fr T eff, we find sme surprising results. Mst nticeable is the fact that RP alne des significantly better than nm and nm+, reducing the mean abslute residual by 25%. The perfrmance with BP alne, in cntrast, is significantly wrse. Including the three very lw SNR pixels in BP (nm++ result) nw des nt degrade the perfrmance fr T eff at all, and the even wider range f DSC/QSO des nt lwer perfrmance ver nm as much as it did fr the T eff lg g prblem Technical Nte 15

16 BPnm RPnm nm nm+ nm++ DSC/ QSO meanabs [Fe/H] meanabs lgteff FIGURE 8: ILIUM perfrmance n T eff [Fe/H] (fr T eff 7000 K) prblem at G=15 fr dwarfs using different input pixels (described in the text). The perfrmance metric is the mean abslute errr ver the whle test set. Open circles are the result f different randmly selected train/test sets. The mean, median and upper/lwer quartiles are pltted as a diamnd, filled circle and vertical bars respectively. Technical Nte 16

17 (Fig. 7). This might be because we nw have nly cl stars, and n average the extra pixels in these wider ranges have nt very high sensitivities fr cl stars. When cmparing these results t Fig. 7, recall that we have changed bth the T eff range and the G magnitude. Bth culd explain why the RPnm range alne give the best results: averaged ver the sample, the cl star sample has mre flux in RP than in BP, s cmbined with the verall lwer SNR, BP may be prviding a net cnfusing sample. This again suggests that we culd benefit if we use sme frm f SNR weighting in ILIUM (see sectin 5.1). 4.4 Effect n perfrmance f systematic flux errrs The BP/RP data prcessing will nt be perfect. Here I investigate the impact f tw errrs n the ILIUM with the T eff lg g prblem at G=15 (see sectin 3.3 f CBJ-042 fr the nminal results) lgg residual lg(teff) residual lgg resid lgg resid lgg lg(teff) lg(teff) resid lg(teff) resid lgg lg(teff) FIGURE 9: AP residuals n the T eff lg g prblem at G=15 with a 1% systematic additive flux errr (histgrams and red triangles) cmpared t the unperturbed data (black circles) Technical Nte 17

18 Additive errr. Fr each spectrum in the test set I calculate the median flux and add e times this t each pixel. (This is dne befre standardizing the data.) The training data and frward mdels are unchanged. This simulates a cnstant (acrss the spectrum) backgrund term, thus simulating imperfect backgrund subtractin in the data (whereas the frward mdel assumes perfect crrectin). With e = 0.01, the residuals are shw in Fig. 9 pltted in red, with the unperturbed results (Figure 11 f CBJ-032) shwn in black. The perfrmance statistics are lg g lg (T eff ) δφ ILIUM, G=15, 1% systematic additive errr in fluxes δφ σ φ Already with a 1% flux errr the perfrmance is seriusly degraded, by a factr f five in δφ fr bth APs. With a 5% errr, δφ in lg g is 1.4 dex and there is a systematic errr f 0.6 dex: lg g cannt be estimated! lg (T eff ) hlds up better with a δφ f dex, but the systematic is almst as large. Multiplicative errr. Fr each spectrum in the test set I multiply the flux by (1 + e). (This is dne befre standardizing the data.) The training data and frward mdels are unchanged. This simulates a multiplicative flux calibratin errr in each pixel (whereas the frward mdel assumes perfect calibratin). With e = 0.01, the residuals are shw in Fig. 10 pltted in red, with the unperturbed results (Figure 11 f CBJ-032) shwn in black. The perfrmance statistics are lg g lg (T eff ) δφ ILIUM, G=15, 1% systematic multiplicative errr in fluxes δφ σ φ There is again a degradatin, but it is nt as serius as with the additive errr. This is because the systematic errr added in the additive case will be much larger than in the multiplicative case fr pixels with lw cunts (at the ends f the spectrum), because fr them 1% f the median (ver the spectrum) is much larger than 1% f their wn flux. That even small systematic errrs degrade the perfrmance by s much is nt surprising when we cnsider hw ILIUM wrks. At each iteratin it uses the difference in flux between the frward mdel predictin and the measurement (step 3 f the algrithm; sectin 2.2 f CBJ- 042). If these are n systematically different scales, ILIUM will give pr results. We culd mitigate the effect f an additive errr by setting the median r mean flux (acrss the spectrum) t zer fr bth the bserved spectra and the training data. But such fixes shuld wait until we have a better idea f the magnitude and nature f the calibratin errrs in the BP/RP data. Technical Nte 18

19 lgg residual lgg lgg resid lgg lg(teff) resid lg(teff) residual lg(teff) lgg resid lg(teff) lg(teff) resid FIGURE 10: AP residuals n the T eff lg g prblem at G=15 with a 1% systematic multiplicative flux errr (histgrams and red triangles) cmpared t the unperturbed data (black circles) Technical Nte 19

20 5 Minr develpments 5.1 SNR weighting f the sensitivities We saw in sectin 4.3 that the perfrmance depends n the pixel range used, typically being wrse if a brader range was used which includes lw sensitivity pixels. I speculated the cause t be quasi-randm cntributins t the AP updates frm high sensitivity but nisy pixels. While sensible feature selectin is a prerequisite fr machine learning, it is nt bvius what the tradeff shuld be between sensitivity and SNR, r belw what SNR level ne shuld drp data. Mrever, it is desirable t use a cmmn set f pixels fr classificatin fr mst bjects, s sme active weighting f pixels t accunt fr the SNR which can usually be estimated culd be beneficial. An apprpriate scheme is t replace the sensitivity, s ij, with w i s ij where w i is a weighting factr fr pixel i. An apprpriate weight is the SNR in pixel i, SNR i. In rder that the weighting desn t change the verall scale f the updates, I define the weights as w i = SNR i SNR i (1) Thus equatin 3 in CBJ-042 becmes where δp = W S δφ (2) W = diag(w i ) (3) is a diagnal I I matrix. We replace S with W S in the update equatin 5 in CBJ-042. The frward mdelling is nt affected. I used this SNR weighting methd n the nminal T eff lg g prblem (sectin 3.3 f CBJ-042). I select five particular stars fr analysis. Their AP updates are shwn in Fig. 11 and their SNR weighting vectrs (spectra) in Fig. 12. The spectra f AP updates fr these five stars at the secnd iteratin are shwn in Fig. 13. Fr cmparisn, Fig. 14 shws the crrespnding AP update spectra when nt using SNR weighting. We see hw the SNR weighting suppresses cntributins frm the very blue end f the spectrum and, crrespndingly, hw it enhances cntributins in the centre f the spectrum, especially at band 30. I nw carry ut a mre systematic cmparisn with and withut SNR weighting fr different wavelength ranges. Taking int accunt the variance in the summary statistics due t randmized train/test samples, I perfrm tests n the same 15 sets as used in sectins 4.2 and 4.3. The gal is t assess whether SNR weighting helps when including lw SNR pixels, s I d tests n the nm, nm++ and DSC/QSO ranges defined in sectin 4.3. Technical Nte 20

21 N. = 1 true = 5 resid =! GF = 6.09 N. = 1 true = 4000 resid = 8.39 GF = 6.09 lgg lg(teff) iteratin iteratin N. = 2 true = 4.5 resid = GF = 1.25 N. = 2 true = 5750 resid = 2.14 GF = 1.25 lgg lg(teff) iteratin iteratin N. = 3 true = 4 resid =! GF = 5.48 N. = 3 true = 8000 resid =!7.77 GF = 5.48 lgg lg(teff) iteratin iteratin N. = 4 true = 2 resid = GF = 1.33 N. = 4 true = 8750 resid = 63.8 GF = 1.33 lgg lg(teff) iteratin iteratin N. = 5 true = 4.5 resid =! GF = 1.93 N. = 5 true = resid = 2.11 GF = 1.93 lgg lg(teff) iteratin iteratin FIGURE 11: AP evlutin fr five stars in the test data set at G=15 (lg g left, lg (T eff ) right) using SNR weighting. The true APs are written at the tp f each panel pair and pltted as the red hrizntal line. Cf. Figure 8 f CBJ-042 which shws the same withut SNR weighting. Technical Nte 21

22 FIGURE 12: SNR weight (equatin 1) fr the five stars with AP evlutin shwn in Fig. 11 and AP spectrum updates at iteratin 2 shwn in Fig. 13. The mean SNR is in the five cases. Technical Nte 22

23 FIGURE 13: The spectrum f lg g updates (blue, left clumn) and lg (T eff ) updates (red, right clumn) at the secnd iteratin fr five stars (rws). These are the same five stars shwn in Figure 8 f CBJ-042. In each panel the hrizntal axis is the band (pixel) number and the vertical axis the AP update, in standardized units. The slid hrizntal line is the mean, and the dashed lines the upper and lwer clipping limits. Technical Nte 23

24 FIGURE 14: As Fig. 13 but withut SNR weighting. Technical Nte 24

25 nm nm nm++ nm++ DSC/ QSO DSC/ QSO meanabs lgg meanabs lgteff FIGURE 15: Effect f SNR weighting. Each line shws the ILIUM perfrmance (mean abslute residual) n T eff lg g prblem at G=15 at [Fe/H]=0.0 using the three different input pixel ranges nm, nm++, and DSC/QSO, fr SNR weighting included (rws 1,3,5 the upper f each pair; pints in green and magenta) and the standard ILIUM algrithm (rws 2,4,6 the lwer f each pair; pints in black and red). As befre, pen circles are the result f different randmly selected train/test sets and the mean, median and upper/lwer quartiles are pltted as a diamnd, filled circle and vertical bars respectively. Technical Nte 25

26 nm nm nm++ nm++ DSC/ QSO DSC/ QSO meanabs lgg meanabs lgteff FIGURE 16: As Fig. 15 but nw fr G=18.5 Technical Nte 26

27 The results are shw in Fig. 15 fr G=15 (red/black fr the standard update algrithm, magenta/green fr the SNR weighted updating). Surprisingly, SNR weighting is wrse than n weighting fr each wavelength range, by an amunt which is statistically significant fr T eff but nt fr lg g. We bserve that the mean/median perfrmance (magenta diamnd/circle) with the SNR weighting is the same fr all three wavelength ranges (the interquartile ranges verlap) fr bth APs. S while SNR weighting seems t degrade perfrmance slightly, it has fulfilled the bjective f making the perfrmance independent f the wavelength range. (On the ther hand, this is als true fr lg g withut SNR weighting.) The same examinatin at G=18.5 is shwn in Fig. 16. Nw the SNR weighting des decrease the mean abslute residual, by 0.1 dex fr lg g (up t 30%) and almst in lg (T eff ) (up t 20%). Hwever, we nw see a dependence f the perfrmance with wavelength cverage. As we g dwn the figure mre nisy pixels are included and the perfrmance gets wrse, bth with and withut SNR weighting. S at G=18.5 SNR weighting helps n bth parameters, but it des nt eliminate a degrading effect f including nisy pixels. In cnclusin, SNR weighting has an effect, but this particular scheme des nt always give the desired effect. Perhaps a weighting linear in the SNR is nt ptimal. We saw in the specific cases examined abve that it resulted in a lt f weight being given t band 30. Hwever, because such large updates will then be clipped (see sectin f CBJ-042) this may actually result in the small updates being given mre cmparative weight after all. I therefre repeated the tests at G=15 with SNR weighting but withut ding the clipping. But this did nt imprve perfrmance (in fact it made it cnsiderably wrse in all cases). SNR weighting requires further investigatin. 5.2 AP cvariance estimates based n the phtmetric uncertainties Backgrund material The cvariance matrix between tw vectrs x and y is Cv(x, y) = E[(x E[x])(y E[y]) T ] (4) where E is the expectatin peratr. The cvariance matrix f a single vectr y is C y Cv(y, y) = Var(y) = E[(y E[y])(y E[y]) T ] (5) Let tw vectrs, x and y, be related by the linear transfrmatin y = Ax (6) Technical Nte 27

28 The cvariance matrix f y is related t the cvariance matrix f x by C y = E[(y E[y])(y E[y]) T ] = E[A(x E[x])(x E[x]) T A T ] = A E[(x E[x])(x E[x]) T ] A T = AC x A T (7) where I assume that A is fixed, i.e. has zer variance (s E[Ax] = AE[x]). The slutin t the abve linear equatin (prvided A T A is nn-singular) is s by analgy t equatin 7, we can write x = (A T A) 1 A T y (8) C x = (A T A) 1 A T C y ((A T A) 1 A T ) T = (A T A) 1 A T C y A(A T A) 1 (9) where I have used the fact that A T A and therefre (A T A) 1 are symmetric matrices. This is the frmula fr cnverting the cvariance (errrs) n y t the cvariance n x. If the errrs n the cmpnents f y are uncrrelated and equal t σ 2 y, then we can write C y = σ 2 yi where I is the identity matrix, and equatin 9 can be simplified t C x = σ 2 y(a T A) 1 (10) Anther standard result frm matrix algebra (derivable frm equatin 4 with a little algebra) is Var(x y) = Cv(x, x) + Cv(y, y) 2Cv(x, y) = Var(x) + Var(y) 2Cv(x, y) (11) Applicatin t ILIUM We can use the abve results t estimate the cvariance in the APs estimated by ILIUM, C φ, given the cvariance (measurement errr) in the phtmetry, C p. The basic equatin f ILIUM is equatin 3 in CBJ-042, which assumes that changes in the phtmetry can be apprximated by a first rder Taylr expansin using the sensitivities, i.e. δp = Sδφ (12) If we assume that ILIUM gives unbiased estimates f the APs, then we can interpret this as an expansin equatin fr the estimated phtmetry, p e, abut the true phtmetry, p t (i.e. at the true values f the APs, φ t ), given the estimated APs, φ e. We can then write the abve equatin as p e p t = S(φ e φ t ) (13) Technical Nte 28

29 where S is calculated at φ t in bth cases. Using equatin 7 we see that the cvariances are related as Var(p e p t ) = S Var(φ e φ t )S T (14) Using equatin 11 we can write this as Var(p e ) + Var(p t ) 2Cv(p e, p t ) = S (Var(φ e ) + Var(φ t ) 2Cv(φ e, φ t )) S T (15) Hwever, as φ t and p t are the true values, they have zer variance. Mrever, as we assume that ILIUM is giving unbiased estimates, these true values are als uncrrelated with the estimated values (i.e. the estimates are scattered randmly arund them). Therefre, the Cv terms in the abve equatin are als zer and it becmes Using equatin 9 we get Var(p e ) = S Var(φ e )S T C p = SC φ S T (16) C φ = (S T S) 1 S T C p S(S T S) 1 (17) Thus given the estimated variance (mre generally, cvariance) in the measured phtmetry we can calculate the crrespnding cvariance in the APs using the sensitivity at the estimated APs. C φ is a symmetric J J matrix with elements c jj. The AP variances are the diagnal elements, s the estimated uncertainties n the APs are the square rts f these, diag(c φ ), which write as ɛ feh and ɛ teff (where the latter always refers t lg (T eff )). Rather than reprting the ff-diagnal cvariances, I will reprt the crrelatin cefficient between tw APs j and j, r jj = c jj / c jj c j j, which, as there are just tw APs, I write as r. (I reserve the symbls ρ and σ fr statistical measures f the crrelatin and standard deviatin f the residuals.) Strictly speaking we shuld use the sensitivity at the (unknwn) true APs in equatin 17. The abve is an apprximatin which will be reasnably valid if the AP errr is nt large. Furthermre, it will hld n average ver the sample if ILIUM gives unbiased estimates, which seems t be the case given the very lw systematic errrs I have reprted in CBJ-042 and CBJ-043. Equatin 17 als assumes that there are n errrs in S, which in turn assumes the frward mdel is perfect. We culd accmmdate errrs in the sensitivities by generalizing equatin 7 t include the cvariance in A. If the phtmetric errrs are uncrrelated, then we can write C p = diag(σ 2 p), where σ p is the vectr f estimated errrs in each band. (In general C φ will still have nn-zer ff-diagnal elements.) If we had n individual errr estimates n the phtmetric bservatins, then we culd use equatin 10 and set σ t sme verall estimate f the nise in p. Fr cmputatinal reasns we can calculate equatin 17 using C φ = MC p M T (18) where M = (S T S) 1 S T (19) Technical Nte 29

30 It will be recalled frm sectin f CBJ-042 that the update equatin in ILIUM is just δφ = M δp. There I als defined a different estimate f the AP uncertainties, namely M δp, which is based n the difference between the measured and predicted phtmetry. Here we use an estimate based n the uncertainties in the phtmetry. It is wrth nting that there are sme similarities between the principles f ILIUM and the figure-f-merit defined t assess the perfrmance f phtmetric systems in the earlier Gaia filter design design prcess (see Lindegren et al and Jrdi et al. 2006). Sensitivities were als used by Bailer-Jnes (2003, 2004) in the Heuristic Filter Design apprach Results I used the abve t estimate the AP uncertainties and crrelatins fr the T eff [Fe/H] prblem fr dwarfs at G=20 (sectin 2.3 f CBJ-043). Fig. 17 shws varius distributins and scatter plts. What is nt visible in these is the fact that a very large fractin f bjects have large estimated uncertainties, especially in [Fe/H] where they can reach tens f dex. Sme 282/858 = 33% f bjects have an uncertainty rati (estimated errr divided by mdulus f true residual) in [Fe/H] greater than 8 (and s are nt pltted in the tp rw). This reduces t 54/338 = 16% when we limit the sample t cl stars (T eff 7000 K). Fr the T eff uncertainty rati it is 12% fr bth the full and cl star sample. This is a cnsiderably larger fractin than was btained with the riginal uncertainty estimate technique (sectin 2.6 f CBJ-042), which gave just 7% f bjects beynd this level. Cmparing the tp tw panels f Fig. 17 with thse in Fig. 12 f CBJ-043 als shws that the ratis are generally higher (i.e. verestimated) with the new methd. Fig. 17 als shws the actual values f the estimated uncertainties as a functin f AP. Larger uncertainties in [Fe/H] are predicted fr higher T eff (as expected), and what is als interesting is that the [Fe/H] uncertainty is never predicted t be small fr larger T eff, which is a crrect reflectin f the situatin. Interestingly, the uncertainty in lg (T eff ) is als larger at higher T eff. This may be a result f the larger [Fe/H] uncertainty and a pssible crrelatin between them. Indeed, the right-hand panel in the secnd rw shws that the crrelatins can be very large. On the ther hand, the riginal methd gave many errr estimates which were significant underestimates f the residuals (the peak arund 0 in the histgrams f Figure 12 f CBJ-043), smething which des nt ccur here s is in favur f the new methd. Fig. 18 plts the same estimates but nw n the grid f the true APs. It illustrates mre clearly the grwth f errrs fr large T eff. The bttm panel plts the size and magnitude f the estimated crrelatins. Negative crrelatins are predicted fr ht stars and lw metallicity stars, that is, an verestimatin f ne AP is predicted t accmpanied by an underestimatin f the ther. We can cmpare this with the plt f the magnitudes f the residuals (i.e. estimated minus true) shwn in the tp tw panels f Fig. 19, where I have used the same scale fr the symbls. The [Fe/H] uncertainties are significant verestimates f the [Fe/H] residuals, whereas the T eff uncertainties are als verestimated, but nt by s much. The bttm panel shw the sign f the prduct f the tw residuals, i.e. the sign f the crrelatin. This des nt shw the same trend Technical Nte 30

31 feh errr estimate rati feh errr estimate rati (Teff<=7000K) feh feh errr estimate feh lg(teff) errr estimate lg(teff) errr estimate rati estimated errr crrelatin (Teff<=7000K) lg(teff) feh errr estimate lg(teff) lg(teff) errr estimate FIGURE 17: Predicted AP uncertainties and crrelatins fr the T eff [Fe/H] prblem fr dwarfs at G=20. Tp rw: AP uncertainties expressed as a rati f the abslute value f the true residuals. Secnd rw, left: as abve, but nly fr cl stars (T eff 7000 K). Secnd rw, right: the estimated crrelatins, r, nly fr cl stars. Third and furth rws: estimated uncertainties (nt ratis) as a functin f the APs, with cl stars pltted in red. Technical Nte 31

32 [Fe/H] uncertainties True lg(teff) True [Fe/H] lg(teff) uncertainties True lg(teff) True [Fe/H] crrelatins True lg(teff) True [Fe/H] FIGURE 18: Predicted AP uncertainties and crrelatins fr the T eff [Fe/H] prblem fr dwarfs at G=20 fr individual bjects pltted in the space f the true APs. The area f the pltted circle (nt the diameter) is prprtinal t the size f the uncertainty r crrelatin. Fr the latter, psitive values are shwn in black and negative in red. (Nt all pints are present because the R pltting sftware des nt plt any symbl belw a certain size.) Technical Nte 32

33 [Fe/H] residuals True lg(teff) True [Fe/H] lg(teff) residuals True lg(teff) True [Fe/H] sign f residual crrelatin True lg(teff) True [Fe/H] FIGURE 19: T eff residuals (tp) and [Fe/H] residuals (middle) fr the T eff [Fe/H] prblem fr dwarfs at G=20. The area f the pltted circle (nt the diameter) is prprtinal t the size f the residual and the clur the sign (black psitive, red negative). The bttm panel shws the sign f the prduct f the residuals: filled black pints indicate a psitive crrelatin and pen red squares a negative ne. Technical Nte 33

34 t negative crrelatins fr ht and lw metallicity stars we see predicted in Fig feh errr estimate rati lg(teff) errr estimate rati feh errr estimate rati (Teff<=7000K) estimated errr crrelatin (Teff<=7000K) feh errr estimate feh errr estimate feh lg(teff) lg(teff) errr estimate feh lg(teff) errr estimate lg(teff) FIGURE 20: As Fig. 17 but fr G=15 Figs. 20, 21, 22 shw the same three diagnstic plts but nw fr the data at G=15 (sectin 2.1 f CBJ-043). As the residuals and the uncertainties are nw much smaller, I have increased the area f the pints by a factr f 16. The uncertainty estimates are clser t the true residuals and nt nearly as verestimated as was the case at G=20. The histgrams shw that the uncertainties expressed as a fractin f the residuals are much smaller. In fact, they nw have a tendency t be underestimated, as was als seen at G=15 with the riginal uncertainty estimatin methd (Figure 9 f CBJ-043). The agreement f the predicted and actual crrelatins in the uncertainties/residuals is perhaps better than at G=20. This uncertainty estimatin methd has a mre slid grunding than the rather naive ne de- Technical Nte 34

35 [Fe/H] uncertainties True lg(teff) True [Fe/H] lg(teff) uncertainties True lg(teff) True [Fe/H] crrelatins True lg(teff) True [Fe/H] FIGURE 21: As Fig. 18 but fr G=15 and with the circles pltted 16 times larger in area. Technical Nte 35

36 [Fe/H] residuals True lg(teff) True [Fe/H] lg(teff) residuals True lg(teff) True [Fe/H] sign f residual crrelatin True lg(teff) True [Fe/H] FIGURE 22: As Fig. 19 but fr G=15 and with the circles pltted 16 times larger in area. Technical Nte 36

37 scribed and used in CBJ-042 and CBJ-043. The new methd depends explicitly n the estimated phtmetric errr rather than a sample f the AP update scillatin. Yet based n this ne example, its nt clear that the new methd gives mre accurate uncertainty estimates (it seems wrse at G=20, perhaps similar in perfrmance at G=15). A mre rigrus assessment is required. Technical Nte 37

38 References Bailer-Jnes C.A.L., 2003, Heuristic design f filter systems using an evlutinary algrithm Presentatin at PWG meeting, 9 Octber 2003, Leiden Bailer-Jnes C.A.L., 2004, Evlutinary design f phtmetric systems and its applicatin t Gaia, A&A 419, 385 Bailer-Jnes C.A.L., 2009, ILIUM: An iterative lcal interplatin methd fr parameter estimatin, GAIA-C8-TN-MPIA-CBJ-042 Bailer-Jnes C.A.L., 2009, Applicatin f ILIUM t the estimatin f the T eff [Fe/H] pair frm BP/RP, GAIA-C8-TN-MPIA-CBJ-043 Bailer-Jnes C.A.L., 2009, A prcedure t calibrate algrithms fr estimating parameters frm spectra, GAIA-C8-TN-MPIA-CBJ-044 Bailer-Jnes C.A.L., Smith K.W., Tiede C., Srd R., Vallenari A., 2008, Finding rare bjects and building pure samples: prbabilistic quasar classificatin frm lw-reslutin Gaia spectra, MNRAS 391, 1838 Jrdi C., Hg E., Brwn A.G.A., Lindegren L., Bailer-Jnes C.A.L., Carrasc J.M., Knude J., Straizys V., et al., 2006, The design and perfrmance f the Gaia phtmetric system MNRAS 367, 290 Lindegren, L., 2003, Optimizing Gaia s phtmetric system - thughts n distance measure and figure f merit Functins, GAIA-LL-047 Lpez Marti B., Figueras F., Jrdi C., Carrasc J.M., Gebran M., 2009, Surce with spectral features as seen by Gaia BP/RP GAIA-C5-UB-BLM Technical Nte 38

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP

Application of ILIUM to the estimation of the T eff [Fe/H] pair from BP/RP Applicatin f ILIUM t the estimatin f the T eff [Fe/H] pair frm BP/RP prepared by: apprved by: reference: issue: 1 revisin: 1 date: 2009-02-10 status: Issued Cryn A.L. Bailer-Jnes Max Planck Institute fr