Replacement of Dry Deposition Velocities (Vd) at CASTNET Sites (cont d) Gary Lear George Bowker CAMD

Introduction For CASTNET, hourly deposition velocity estimates are the output of the Multi-Layer Model (MLM) MLM utilizes numerous hourly meteorological parameters as inputs. Windspeed Sigma theta (turbulence) Temperature Precipitation Solar radiation Deposition Velocity (Vd) If any of the met parameters is missing, the hourly Vd estimate is also missing. Missing observations can skew aggregations and distort trends

Introduction CASTNET data completeness for total deposition is cumulative for all variables Missing data = $ EPA discontinued met measurements at our CASTNET sites so now it is essential to estimate deposition velocities without these measurements. Percent (%) 100 90 80 70 60 50 40 30 20 10 0 Concentrations Data Completeness Meteorological measurements Deposition velocity Total (wet + dry) deposition

Replacement of limited missing/invalid deposition velocity estimates. If longer-term (annual, seasonal) dry deposition estimates are most important for deposition assessments, it may be possible to impute (replace) the missing Vd s using historical values. We observed that replacing a missing/invalid Vd with a historical Vd could simplify the replacement process and could probably result in a moreaccurate long-term deposition estimate (since the historic variability of a longer-term average is just a few percent). This is based on the concept that we replace missing hours of Vd with a representative historical Vd from that same hour. Quality Assurance Decisions with Air Models: A Case Study of Imputation of Missing Input Data Using EPA?s Multi-layer Model. G.E. Bowker, D.B. Schwede, G.G. Lear, W.J. Warren-Hicks, P.L. Finkelstein. 2010. Water, air, and soil pollution. DOI: 10.1007/s11270-011-0808-7DOI: 10.1007/s11270-011-0808-7

Seasonal Variation in Vd Although meteorology dominates the magnitude of Vd at short timescales such as hours or days, longer time scales are dominated by site-specific seasonal trends Weekly deposition velocity for individual years Average Vd for all years (solid black)

Seasonal Variation in Vd

For each site: Seasonal variation is relatively consistent from year to year Variance of each week is similar from year to year Diurnal patterns are also very consistent with seasonal variations (not shown) Seasonal Variation in Vd

How do we quantify the error of these imputed data sets? Variable mixture of real measurements and imputed values Both real and imputed measurements have some but different! -- degree of autocorrelation Missing periods are not of truly random occurrence or duration and are difficult to model Classical statistics (e.g., mean and variance) are tricky in these situations!

Our solution: Quantify uncertainty of imputation scheme using known data sets replaced with imputed values and compare known with estimate From the entire CASTNET data record: Identify site-years with essentially complete data ( Master data). Create a complete Historical average time series (average hourly Vd) based on at least 5-years. Historical average for site Master data is a complete year at this site Vd Time (season)

Create Mask patterns Create Mask patterns from patterns of missing data from different sites The black line shows pattern of missing data. It is a Mask file. Vd Solid blue line is a representative year at a different site with missing data. It is used to make a Mask file. Time (season)

Vd Vd Master-Mask files Replace Master data with historical averages where mask pattern indicates a missing value to make Master-Mask Calculate the average Vd for the replaced file. Compare to the original Master file Time (season) Time (season)

Rinse and repeat 132 Master site-year files. 152 Mask files Result Many thousands of different combinations, each with a different percent missing.

Difference of Imputed Values Compared to Known Relative percent differences (RPD) of the annual aggregate of the Master-Mask imputed values compared to the known Master data The same figure, but looking at the absolute relative percent difference (ARPD) Annual Averages The red squares are the standard deviation of the decile

How robust is the scheme? We want to compare results across parameters, seasons and sites. So, we need to put them in the same units: Normalize values by dividing them by the standard deviation at 100% missing

How robust is the scheme? Normalize the figure, by dividing all values by the standard deviation at 100% missing. The result is the Standard Score, or Z-score, which is the number of standard deviations from the mean. These values are approximately three standard deviations from the mean. The red line has a value of 1 standard deviation at 100% missing.

Compare by Parameter Annual Averages

Compare by Season Summer

Compare by Season Spring

Compare by Season Fall

Compare by Season Winter

Uncertainty vs. Percent Replaced By parameter, how does uncertainty vary with amount of data replaced? Plot the Z-scores as a function of the percent of data replaced

Uncertainty vs. Percent Replaced When normalized, the patterns for all seasons appear to be the same. Same with parameters!

Uncertainty vs. Percent Replaced Conclusion: The patterns across parameters are very similar, and indicates the data replacement scheme is quite robust. y = -4.4E-05x 2 + 0.0142x

How do we estimate uncertainty for specific sites, seasons and parameters? Data needed: Mean and standard deviation for the site based on the historic CASTNET database for the season and parameter Percentage of missing data for your site-year. The uncertainty equation: y = -4.4E-05x 2 + 0.0142x (where x is the percent of missing data, and y is the estimated uncertainty in units of standard deviations for your site s season and parameter). Result: The average expected uncertainty expressed as a standard deviation (percent of the average value).

How do the imputed annual fluxes compare with previous estimates?

Explore how these changes affect the data http://epa.gov/castnet/javaweb/mcannual_replaced.html