We observe a time series of ground-truth observations z(t), which may represent air pollution measurements (e.g., for PM2.5) taken at a central site monitor. The goal of an exposure prediction model is to accurately replicate such observations with a predicted time series z∗(t) that can be computed at locations and times without monitoring data. The predicted series z∗(t) may represent output from a regression model, a machine learning algorithm, a computer simulation model such as the CMAQ model, or any combination of these approaches. While we will focus on temporal data series z(t) and predictions z∗(t) indexed by time t, we discuss extensions of these ideas to the spatial domain in the Discussion.
Our goal is to compare a time series of model predictions, z∗(t), with a reference time series of ground-truth observations, z(t), for times t = 1, …, n. Existing approaches include quantifying prediction accuracy using correlation \(r=\sqrt{R^2}=\mathrm{Cor}\left({z}^{\ast }(t),z(t)\right)\) and \(RMSE=\sqrt{\frac{1}{n}\sum_{t=1}^n{\left({z}^{\ast }(t)-z(t)\right)}^2}\) [7, 11, 14, 35]. Additional existing approaches include the log variance ratio (LVR), defined as \(\log \left(\frac{\mathrm{Var}\left({z}^{\ast }(t)\right)}{\mathrm{Var}\left(z(t)\right)}\right)\) [7]. The LVR captures differences in temporal variation between models, which can impact precision in estimated health associations. We refer to these approaches as overall model performance measures, which we denote by roverall, RMSEoverall, and LVRoverall.
Overall model performance can be impacted by model errors at timescales different than the timescale of interest. Therefore to better capture health estimation capacity, we propose frequency band model performance. Frequency band model performance differs from existing approaches by restricting the model predictions z∗(t) and observations z(t) to their timescale-specific components to create measures r(k), RMSE(k), and LVR(k), which represent correlation, RMSE, and LVR for a chosen range of frequencies denoted as band k. To extract the frequency band k components, we use a discrete Fourier transform [6, 23, 40]. For our reference time series, z(t), we partition the range of frequencies [0, n/2) into non-overlapping frequency bands k = 1, …, K such that
$$z(t)=\sum_{k=1}^K{z}_k(t)$$
(1)
The same approach is used to partition z∗(t) into components \({z}_k^{\ast }(t)\). Then, \({r}_{(k)}=\mathrm{Cor}\left({z}_k^{\ast }(t),{z}_k(t)\right)\), \(RMS{E}_{(k)}=\sqrt{\frac{1}{n}\sum_{t=1}^n{\left({z}_k^{\ast }(t)-{z}_k(t)\right)}^2}\), and \(LV{R}_{(k)}=\log \left(\frac{\mathrm{Var}\left({z}_k^{\ast }(t)\right)}{\mathrm{Var}\left({z}_k\right)\Big)}\right)\). To facilitate comparisons between overall and frequency band k RMSE, we scale RMSEoverall and RMSE(k) by the standard deviation of the reference time series z(t) and zk(t), respectively.
To understand the advantage of frequency band model evaluation, consider a hypothetical time series of PM2.5 observations and model predictions (Fig. 1). Existing overall model performance metrics (roverall, RMSEoverall, and LVRoverall) are applied to the time series on the left, which is impacted by differences between observations and model predictions at all frequencies. In contrast, frequency band k model performance restricts evaluation of model performance to the frequency of interest using a discrete Fourier transform (e.g., high frequency, such as day-to-day variation) and is not impacted by errors at other frequencies (e.g., medium frequency, such as monthly variation, and low frequency, such as seasonal variation) (Fig. 1).
As in previous work estimating health associations of PM air pollution [23], we set K = 6 and consider frequency bands of k = 1: [1,6) cycles per year corresponding to seasonal components, k = 2: [6,12) cycles per year, k = 3: [12,26) cycles per year, k = 4: [26,52) cycles per year, k = 5: [52,104) cycles per year, and k = 6: [104,183) cycles per year corresponding to acute components. Our particular interest for acute health associations of air pollution is in the acute timescale captured by the frequency band k = 6 model performance (k = 6: [104,183) cycles per year), corresponding to a timescale of a few days or less. However, frequency band model performance can be applied to any timescales that are of interest.
We first conduct a simulation study to determine how errors at specific frequency bands impact overall model performance measures roverall, RMSEoverall, and LVRoverall as well as frequency band k model performance measures r(k), RMSE(k), and LVR(k). We hypothesize that errors at a specific frequency band k′ impact the overall model performance (roverall, RMSEoverall, and LVRoverall) and the model performance at the same frequency band k′ (r(k′), RMSE(k′), and LVR(k′)), but not model performance at different frequency bands k ≠ k′. For example, seasonal error that occurs within the first frequency band (k = 1 corresponding to [1,6) cycles per year) would impact model performance measured by roverall, RMSEoverall, LVRoverall, r(1), RMSE(1), and LVR(1), but not model performance at higher frequency bands relevant for estimating acute health associations (e.g., r(6), RMSE(6), and LVR(6)). This implies that if we are primarily interested in model predictions at frequency band k′, model performance measured by r(k′), RMSE(k′), and LVR(k′) will better reflect errors relevant for frequency band k′ while limiting the influence of errors at other frequency bands k ≠ k′.
To simulate observed air pollution time series that are non-negative and right-skewed, let z(t) = exp(x(t)σx + μx), where x(t) is a time series with Var(x(t)) = 1, and where μx and σx represent the mean and standard deviation of the log-transformed time series. We specify μx = 1.9 log μ g/m3 and σx = 0.6 log μ g/m3 to reflect log-transformed concentrations of PM2.5 in New York City from 2010 to 2018. We simulate the time series \(x(t)=\frac{\sum_{k\in \left\{1,2,6\right\}}{x}_k(t)}{\sqrt{\mathrm{Var}\left(\sum_{k\in \left\{1,2,6\right\}}{x}_k(t)\right)}}\) consisting of frequency bands k = 1, 2, 6 to approximately reflect seasonal, monthly, and acute time trends found in air pollution concentrations. Each xk(t) is specified using cosine functions with varying wavelengths (details in Additional file, Section A) and Var(xk(t)) = 1. As a sensitivity analysis, we increase the relative seasonal variation to reflect observed relative variability across timescales in PM2.5 data, i.e. Var(x(1)(t)) ∈ {1.5,2}. We simulate 100 observed time series with 3 years of data each (n = 1095).
We simulate model predictions by incorporating classical measurement error at varying timescales into simulated observed time series. Our model predictions are simulated as z∗(t) = exp(x∗(t)σx + μx), with μx = 1.9 and σx = 0.6 as in the simulated observed time series. The log-transformed simulated model predictions x∗(t) = x(t) + wk(t)σc, where x(t) is the simulated observed time series (log-transformed with Var(x(t)) = 1) and wk(t) is the standardized error component at frequency band k with Var(wk(t)) = 1, The magnitude of error is represented by the standard deviation σc ∈ {0.2,0.4,0.6,0.8}. We obtain wk(t) as the standardized k frequency band component from a discrete Fourier transform of standard normal error. Therefore, on the logarithmic scale, our simulated model predictions are the simulated observed time series with classical error at frequency band k. As k varies from 1, …, 6, the timescale of the error changes from seasonal (k = 1) to acute (k = 6) (Fig. 2). For each of our 100 simulated observed time series with n = 1095, we simulate 24 model prediction time series with varying classical measurement error (four varying magnitudes of error σc and six error frequencies wk(t) k = 1, …, 6).
Health counts (e.g., number of deaths or emergency department (ED) visits for day t) are simulated as Poisson(μ(t)) where μ(t) = exp(β0 + β1zresidual(t) + 0.03zfitted(t)). The residuals zresidual(t) and fitted zfitted(t) are obtained from a regression of the simulated observed time series z(t) on a natural spline of time with 24 degrees of freedom to capture seasonal and monthly trends. Therefore, the residuals zresidual(t) represent the sub-monthly frequency components of z(t) relevant for acute health associations. We specify the acute health association β1 = log(1.1)/10 corresponding to a relative risk of 1.1 per 10 μ g/m3 increase in PM2.5. The base rate of health β0 = 5 yields approximately 200 health counts per day (i.e., deaths or ED visits) and is selected to reflect cardiorespiratory ED visits observed in large U.S. cities [34].
We estimate health associations for simulated model predictions with varying magnitudes of classical error (σc) and error wk(t) at frequencies 1, …, 6. The health association model is an overdispersed Poisson time series regression model controlling for non-acute temporal trends with a natural spline of time with 280 degrees of freedom per year to isolate the acute time series. This number of degrees of freedom is much larger than generally used in the epidemiologic literature to properly adjust for wk(t) error added at k = 5 [52, 104) cycles per year. In practice, confounding in epidemiologic studies of acute health associations is controlled for by both a smooth function of time (generally with <10 degrees of freedom per year) as well as smooth functions for meteorology including temperature and humidity [37, 38, 44, 45, 50]. To evaluate the estimated health associations for the model predictions, we compute the estimated health association RMSE and percent relative mean bias in the estimated health association (mean bias/β1 × 100). Using scatterplots and R2, we examine associations of both mean overall model performance and frequency band k model performance with health association RMSE and percent relative mean bias across 100 simulated datasets.
We also compare observations of PM2.5 to predictions from an exposure model to demonstrate both overall model performance and frequency band k = 6 model performance. To represent “ground-truth” observations, we develop a dataset of observed daily PM2.5 data in μ g/m3 from 17 monitors across 8 US cities from 2010 to 2017 using the US Environmental Protection Agency’s (US EPA) Air Quality System. We select monitors from cities based on geographic locations throughout the continental US and based on previous studies of air pollution and health [23, 34, 44, 55], including Atlanta, GA (number of monitoring sites: n = 4); Dallas, TX (n = 1); Houston, TX (n = 1); Los Angeles, CA (n = 5); New York City, NY (n = 1); Pittsburgh, PA (n = 2); Seattle/Tacoma, WA (n = 1); and Salt Lake City, UT (n = 2). For each monitor, we interpolate PM2.5 means for short gaps (<10 days) in the data for each city to create uninterrupted time series. We include only monitors that had at least 1 year of daily concentrations after interpolation. We utilize the longest complete time series for each monitor.
As our prediction model, we utilize predictions from the Fused Air Quality Surface Using Downscaling (FAQSD) approach at 2010 US census tracts from the US EPA [52]. FAQSD uses a Bayesian space-time model to fuse monitoring data with CMAQ model predictions and develop predictions at 2010 US census tracts [3,4,5, 42]. CMAQ is an atmospheric chemical transport model that provides predictions 12 × 12 km resolution grids across the US [12, 13] and may be calibrated or fused with observed monitoring data [21]. We link FAQSD predictions to monitors using the census tract where the monitor is located. We compare the observed PM2.5 data and FAQSD using overall performance measures (roverall, RMSEoverall, and LVRoverall) and frequency band k = 6 model performance (r(6), RMSE(6), and LVR(6)), where k = 6 corresponds to [104,183) cycles per year and represents variation relevant for acute health associations at timescales <3.5 days [23]. All analyses were conducted using R version 4.0 [41].