Imputation method for lifetime exposure assessment in air pollution epidemiologic studies
 Jan Beyea^{1}Email author,
 Steven D Stellman^{2},
 Susan Teitelbaum^{3},
 Irina Mordukhovich^{4} and
 Marilie D Gammon^{4}
DOI: 10.1186/1476069X1262
© Beyea et al.; licensee BioMed Central Ltd. 2013
Received: 26 December 2012
Accepted: 24 July 2013
Published: 7 August 2013
Abstract
Background
Environmental epidemiology, when focused on the life course of exposure to a specific pollutant, requires historical exposure estimates that are difficult to obtain for the full time period due to gaps in the historical record, especially in earlier years. We show that these gaps can be filled by applying multiple imputation methods to a formal risk equation that incorporates lifetime exposure. We also address challenges that arise, including choice of imputation method, potential bias in regression coefficients, and uncertainty in ageatexposure sensitivities.
Methods
During time periods when parameters needed in the risk equation are missing for an individual, the parameters are filled by an imputation model using group level information or interpolation. A random component is added to match the variance found in the estimates for study subjects not needing imputation. The process is repeated to obtain multiple data sets, whose regressions against health data can be combined statistically to develop confidence limits using Rubin’s rules to account for the uncertainty introduced by the imputations. To test for possible recall bias between cases and controls, which can occur when historical residence location is obtained by interview, and which can lead to misclassification of imputed exposure by disease status, we introduce an “incompleteness index,” equal to the percentage of dose imputed (PDI) for a subject. “Effective doses” can be computed using different functional dependencies of relative risk on age of exposure, allowing intercomparison of different risk models. To illustrate our approach, we quantify lifetime exposure (dose) from traffic air pollution in an established case–control study on Long Island, New York, where considerable inmigration occurred over a period of many decades.
Results
The major result is the described approach to imputation. The illustrative example revealed potential recall bias, suggesting that regressions against health data should be done as a function of PDI to check for consistency of results. The 1% of study subjects who lived for long durations near heavily trafficked intersections, had very high cumulative exposures. Thus, imputation methods must be designed to reproduce nonstandard distributions.
Conclusions
Our approach meets a number of methodological challenges to extending historical exposure reconstruction over a lifetime and shows promise for environmental epidemiology. Application to assessment of breast cancer risks will be reported in a subsequent manuscript.
Keywords
Exposure Air pollution Traffic Benzo(a)pyrene PAH Multiple imputation Epidemiology Inmigration DoseBackground
Historical reconstruction of exposure offers the opportunity to cover periods of a person's life when they may be most vulnerable[1, 2], but data challenges may make it difficult to attempt such a reconstruction for an epidemiologic study. Despite these difficulties, longterm assessment of residential exposure to pollutants is increasingly being used in environmental epidemiology of disease, particularly cancer[3–5]. As part of an effort in a largescale study of environmental exposures and breast cancer[6], we wanted to add individualized inhalation exposures of PAH from traffic emissions to a previously collected set of individualized PAH exposures from diet and active/passive smoking. Within the category of PAH air exposure, we focused on traffic emissions because they are a major source of both indoor and outdoor exposures to PAH, and often the largest source in areas near cities, as has been confirmed in a number of experimental studies[7–12]. However, it was necessary to confront a number of methodological challenges before inhalation exposure from traffic could be obtained over a lifetime and before it could be totaled appropriately.
First, it was necessary to decide how to combine yearly exposures over decades. There was no guarantee that simply adding them to obtain a dose would capture possible differences in relative risk per unit of exposure at different studysubject ages. We approached this challenge by computing “effective doses” assuming a range of different functional dependencies on age of exposure. Effective doses are computed by summing functions of yearly exposure. Having a suite of effective doses available for use in regressions against health outcomes allows for broad coverage of the possible agesensitivities.
Second, it was necessary to deal with exposure before study subjects arrived in the study area, as well as exposure that occurred in years before desirable records on emissions and traffic flows were available. A third challenge arose, because the earlier in time we went back to estimate exposure, the greater the percentage we found of missing residential information. To deal with these two types of missingdata problems we propose the use of multiple imputation, namely constructing a series of data sets with missing parameters randomly drawn from a set of predetermined values derived from an imputation model. The resulting multiple data sets are then available to be regressed against health outcomes. On the positive side of this approach, random errors in regression coefficients are reduced within an imputation data set, because sample sizes are increased. On the negative side, there is new variance in regression coefficients introduced when comparing results between imputed data sets. The net effect of these competing tendencies on overall variance can be assessed using Rubin’s Rules or alternate formulations[13–15]. With this combined variance in hand, confidence limits on regression coefficients can be obtained that explicitly include uncertainty introduced by imputation of missing values.
The fourth challenge we faced was a consequence of limitations in the imputation process itself, namely the vulnerability to increased risk of recall bias, when information needed to impute a missing dose component is based in part on questionnaire answers obtained from individual study participants. This can lead to bias in regression coefficients and, therefore, the possibility of finding false or hidden associations with health outcome.
The quantity imputed is the portion of the total dose accumulated during periods of missing location data or missing emissions data. The percentage of the missing component can vary in magnitude by study subject. Although dealing with partially missing information presented a challenge, it also offered an opportunity to check the robustness of regressions with imputed data, and thus test for recall bias. To this end, we introduce the “percentage of dose imputed” (PDI) as a continuous index that allows limiting analysis to groups of study subjects depending on their degree of imputation. With PDI limit set to 0, the analysis is a completecase analysis. With the limit on PDI set to 100%, the full population is analyzed. In this way, regression of health data could be performed as a function of the limit on PDI to see if the results were consistent as larger and larger components of imputed dose were allowed for study subjects. If there was consistency, then it would be reasonable to extend the results of a completecase analysis to larger segments of the full population.
A fifth methodological challenge arose, because the observed doses among the subjects with complete information turned out to be right skewed: specifically in our case–control study the 1% of women who had lived for a long time near heavily trafficked intersections. This presented a challenge to our imputing doses in such a way as to properly populate “outlier” values. To deal with this type of imputation challenge, we propose approaches that do not assume standard distributions; moreover we recommend using more than one approach to allow testing the sensitivity of the regression results to the choice of imputation model.
All of these challenges are likely to be faced, to a lesser or greater degree, when researchers contemplate studies that require individualized historical exposures of adults over a lifetime, or even a great portion of it. As a result of all the potential difficulties, historical exposure is often ignored and analysts look for associations with exposures in recent periods when data are complete. However, doing so can introduce exposure misclassification, depending on the historical mobility of the population and the variation of exposure in place and time, with a resulting tendency to bias results towards the null. Associations can be missed[16]; and periods of greatest importance for health effects go unnoticed[4]. One approach to dealing with missing historical exposure data is to restrict analysis to a subset of the study population with complete data, but such a restriction can decrease sample size drastically and raises questions about applicability of results to the entire study population. Thus, it was of interest to explore new ways to accomplish multiple imputation in the historical context and to find ways to test for recall bias.
We illustrate the approaches developed to deal with the challenges using data from a populationbased, case control study of breast cancer on Long Island, New York, a cancer for which a lifecourse approach to research is strongly recommended[17].
Methods
Risk equation
In Eq. 1, p _{ i } is the probability of an individual, i, becoming a case in year, y. β is a regression coefficient. Z _{calc}(i, y) is the exposure opportunity accumulated for time periods when information is complete for individuals, i, permitting individualized estimates. Z _{imp}(i, y), is an imputed dose for any time periods of interest when information needed for an individualized exposure estimate is lacking. Only the missing subcomponents of the exposure model are imputed, so that as much physical and biologic knowledge is retained. Z _{imp}(i, y), by definition, will be zero for the set of subjects included in completecase analysis.
Covariates that represent known risk factors for the disease under study, as well as possible confounding variables, are implicitly included in A. The magnitude of Z _{calc}(i, y) + Z _{imp}(i, y) determines the risk that subject, i, will become a case in year, y, arising from exposure terminating in a given year, such as a year related to a biological state (e.g., menopause), or at the end of the study period, which is the value chosen for our numerical examples.
E(y’, i) is the shortterm exposure at date, y′. The variable, y, is the year of disease diagnosis. The sum in Eq. 2 is over the dates, y′, running from birth year, y _{b}, to the last date of exposure interest, y _{stop}. B(y′y _{b} , yy _{b}) is a relative risk that compares the effect of unit exposure at age, y'y _{b}, to the effect at a standard age of exposure, say 1years old. Thus, if a unit dose at age 10 will double the risk of disease caused by a unit dose at age 1, the Bvariable for age 10 equals 2. In this way, the Zterms, which depend on B, can be called effective doses.
Subjects with the same Zdose have the same increase in log (odds of disease) compared to an unexposed person, no matter how their exposure was distributed over time. Rather than define the baseline of B as a single exposure in a particular year, the baseline can be defined relative to a standard distribution of exposure over a time period. To include birthcohort effects, for instance in a cohort study, the Bfactor would include a separate entry for birth year, y _{b}, and be written as B(y′y _{b} , yy _{b} , y _{b}).
In this paper, we consider four functional forms for B, distinguished by subscripts 1 through 4, which span a wide range of possible age sensitivities:
B _{1}: B = Constant;
B _{2}: B is inversely proportional to square of age of onset, which increases relative risk at young ages (as in best mathematical fit to data for relative risk of breast cancer following ionizing radiation exposure [2]);
B _{3}: B vanishes except at peak historical exposure (threshold model). (To work as a threshold model, B _{ 3 } can be used in conjunction with a threshold step function added to the risk equation.);
B _{4}: B is negligible except in the last kyears of the study (pure promoter model).
We note that a latency period can be implemented in any of the above models simply by changing the stop year in the summations.
The equation for effective dose, Eq. 2, involves a summation over a study subject’s entire life, but individual values of each variable that are needed to compute the sum are not usually completely available since birth. To fill in the missing information, we replace variables at various times with group level values. The time steps over which emissions must be tracked in equation 2 depends on the model being used to assign individualized exposure estimates. The base unit might be months, or any duration corresponding to data typically available depending on disease and the nature of the shortterm exposure model. For instance, the natural starting unit might be hours for a meteorological dispersion model that tracks emissions that vary by hour of day and accounts for the correlated meteorological conditions that also change hour by hour, such as rainfall, wind speed, and wind direction. Such an hourly model may be testable against hourly air concentration data collected at monitoring stations[18, 19]. As long as average emissions per month do not change rapidly, it possible to do the hourly summations once for each month of the year and use the results for the same month in each year without much loss in accuracy. Data averaged by month is appropriate when movement of study subjects between residences is known by month of the year.
It is useful to separate the exposure, E(y′, i), in Eq. 2 into sums or products of terms that reflect the underlying model of exposure, so that terms needing imputation can be separately identified. For a land use regression model the natural separation might be a linear combination of terms, including air concentrations at monitoring stations[20]. For a meteorological dispersion model, it would be natural to break E(y', i) into a product of terms. Regardless of whether the separation is a sum or product, whenever there is a time that a particular term or subpart is missing for an individual, it will need to be imputed according to an imputation model. In either case, random numbers are used in the multiple imputation model as part of the process of drawing parameter values from a predetermined set of options.
ϵ(y′) in Eq. 3 is the average emission per km from the tailpipe of an average vehicle along the road network at a particular date, y′, determined from average US values[21]. T(y′), in Eq. 3, is the average number of vehicles per km on an average road in the modeling area in the year, y′. Since 1970 it has increased by a factor of 2.7 in the entire U.S.[21]. D(y′, r _{ i }[y]), in Eq. 3, is the “dispersion function,” also called the “transfer function”[22]. It relates emissions from the entire road network to exposure at individual residences, r _{ i }. It is computed by tracking puffs of pollution as they change direction with hourlywind changes, as they expand at rates dependent on meteorological conditions, as their concentration declines due to depletion processes, and as they finally arrive at residences.
D represents one hour’s exposure received at the residence of a subject, r _{ i }[y′] at a date, y′, from one unit of pollutant emitted from all the vehicles on the road network, some of whose emissions may have started their airborne travel several hours earlier. To complete the calculation of D, the received exposure is divided by the average number of cars per km on the roadway, T(y′). In this paper, any imputation of Eq. 3 that is needed for a study subject in year, y′, is done on the transfer function, D(y′), as long as data for ϵ() and T() are available. Dvalues tend to be roughly constant over historical time, changing only as the relative traffic pattern changes in different parts of the modeling region. In addition to depending on the road network, Dvalues depend on regionspecific meteorological factors that enter the transport dispersion model.
Incompleteness index
Our approach is to use this incompleteness index (PDI) to define subsets of participants with varying contributions of group level components. Thus, we might compare regression or correlation results for a subset of study subjects with PDI’s = 0 to larger subsets with PDI's less than 20, 30, 40%, and so on. If the results are robust, with means and confidence limits that suggest a similar association, we can have confidence that imputation is not introducing artificial findings. Extrapolation of the results to the full sample of study subjects would become reasonable. Chisquare analysis of Tables of cases and controls grouped by intervals of PDI is a simple way to test for differential bias in imputation percentage.
It would also be reasonable to define an alternate incompleteness index based on the percentage of time that imputation was needed for a subject. However, this “percentage of time imputed” (PTI) is not as conservative an index as the percentage of dose imputed, particularly in situations where historical exposures varied greatly over time. A major advantage of setting a numerical value of PDI as a limit is that it insures that all study subjects have their degree of imputation constrained. In any case, PTI should be computed along with the PDI as a sensitivity check.
Imputation of gaps in an individual’s dose
To individualize the surrogate in Eq. 5 for study subject, i, the surrogate has been multiplied by a random variable, V _{1}(i), which is uniquely chosen for each study subject in each imputed data set. The distribution of V _{1} should be chosen to have a mean of unity and a variance that matches the variance in 1960 of the individualized exposure estimates. For purposes of allowing sensitivity analysis, we have added a parameter, k _{1}, as an exponent in Eq. 5, which could be varied around unity, should one want to test regressions with health data for different assumptions about the relationship between early vehicle registration numbers and traffic exposure. Standardization of the surrogate is carried out separately for cases and controls. Although Eq. 5 is independent of geographic locations of residences, it would also be possible to perform the analysis separately by geographical subregions of the study area.
A surrogate like the one defined in Equation 5 is the very opposite of an individualized dose, which implies considerable possible misclassification. And, the surrogate is likely to be highly correlated with age, requiring careful control of the agevariable to avoid residual confounding. On the other hand, if exposure is particularly important at young ages, then exposures in the period requiring surrogates is likely to be important. Also, it is quite possible that increased cancer incidence with age is due in part to the longer time older people have to accumulate exposure and to develop cancer[23]. A middle ground position would be to compare regression coefficients with and without the early dose surrogate included. When included, the implicit assumption is that early dose effects are so strong biologically that an exposure estimate grossly misclassified at the individual level will still be adequate to improve the overall dose classification, thereby facilitating the detection of any effect. Excluding the early dose surrogate is equivalent to assuming a biologic model that has cancer risk from the exposure falling off in importance years after exposure.
The early surrogate cannot contribute directly to any recall bias, because no information from interview plays a role in its initial calculation. Recall bias could nevertheless carry over to the early surrogate, however, because the scaling factor which is computed separately by cases and controls is based on post1960 individualized data that depends on subject recall of address locations. For some of the numerical examples given in this paper, we exclude the early dose surrogate, because including it does not help us elucidate potential recall bias using the PDI, which is a major focus of the paper.
It may not always be possible to find a boundary on which to match a surrogate. This is the case for subjects during periods when they resided within the temporal boundaries of the study, but outside the spatial regions. The natural way to generate dose surrogates for these subjects during their outofarea periods is to retain as much data needed for individual calculations as applies outside the spatial region, but use average values for nonavailable parameters. The needed average values can be computed by summing over the individualized values calculable for study subjects inside the spatial study area during the same time period.
where p _{3} and p _{4} have been replaced by < p _{3} > and < p _{4}>, averages over values for other study subjects, namely those subjects residing in the study area at time, t. Once again, individualization can be carried out by multiplying by a random variable with mean unity, which we call, V _{2}. The variance of V _{2} would be chosen to match the variance of subjects with individualized doses inside the study area. With this surrogate, the variance, V _{ 2 }, can change over time, so that matching needs to be carried out more than once, say, every five years. As with the earlier surrogate, standardization is done separately for cases and controls.
We note that the parameter, k _{2}, could also serve as a regression coefficient to be determined as part of the regression with health data. An alternate approach would be to individualize the parameters < p _{3} > and < p_{ 4 } > separately. See supplementary information for a discussion of these options.
The two grouplevel surrogates defined above refer to dose accumulated over different time periods in a subject's life. Obtaining an individual's imputed dose requires summing over both, if both are nonzero.
Imputation of missing residence information
There may be additional imputation required for a study subject, even within the temporal and spatial boundaries of the main study, for instance, to impute an individualized, Dvalue or other indicator of local exposure. In our example, historical street addresses, including house number, were obtained by interview, which meant relying on the memory of study subjects for input to geocoding software. Additional file1: Figure S11 shows the percentage of geocoding success by year of arrival at residence. For 1960, the success rate was 65%; therefore, 35% of the addresses in that year needed imputation.
We consider two ways to carry out the imputation on the theory that having two methods to compare regression results is a safeguard. Consistent results have the potential to increase confidence in the outcome; inconsistent results would be a warning that the imputation approach was problematic.
Within the mathematical function that defines shortterm exposure there will be some residence indicator, whether it be proximity to roads or a meteorological transfer function specific to a residence, such as the Dvalues discussed earlier. When a quantity like D is missing, the easiest method of imputation, suggested by RaaschouNielsen et al.[5] is to interpolate between values at residences before and after the missing value. To obtain values that differ between different imputed data sets, we can multiply the interpolated imputed exposure quantity by a random number with a log normal distribution whose variance is obtained from a lognormal fit to the two calculable quantities about which one is interpolating. For each imputed data set, a distinct value is randomly drawn from the derived lognormal distribution. Alternatively, when there is a missing quantity at starting or stopping dates, extrapolation can be used. In such a case, zero variance is assigned to the lognormal distribution and there is, therefore, no difference in interpolated dose component for a study subject in different imputed data sets for the missing residence period.
This method has the advantage that it relies solely on individual information and does not require the use of covariates in helping to choose the imputed dispersion coefficient. It has the disadvantage that, if there is a long tail in the exposure distribution, it can overstate the imputed exposure components when one of those high values is used for imputation.
Our second method relies on the fact that when information on residence location is obtained by questionnaire, it is likely that most residents will be able to remember at least the city, town, or village of the residence, even if they are not able to remember the full address. This means their location can be narrowed; in the US, we have found location could be narrowed to the Census Division Place level 80% of the time in our example study. Thus, it is possible to narrow the imputation to values of the observed data within the corresponding Place. In what follows, we call this, “imputation by place.” To handle the situation when a census division cannot be assigned or when the number of either cases or controls in a census Place is less than five (11% of locations in our example study), it is appropriate to fall back on the interpolation method.
With imputation by place, it is good practice to use covariates available for a subject, such as income or education, in picking the imputed values, because they could have influenced the area chosen for a residence. One of the limitations of multiple imputation is that there are situations involving missingness not at random that can introduce bias. This possibility can be reduced by including a large number of covariates in the imputation model, thereby increasing the chances that any variables predictive of missing values are incorporated[15].
Observed values of covariates included in the imputationbyplace algorithm for our breast cancer example were D, smoking status (passive and active), education, history of fertility problems, history of benign breast disease, family history of breast cancer, combined estrogen receptor and progesterone receptor status among cases, parity, BMI, dietary intake of PAH, lifetime intakes of grilled or barbecued and smoked meats, lifetime alcohol intake, and age at diagnosis. Case–control status was also included, which at first consideration seems circular; however, including it is a valid part of multiple imputation[24, 25]. Otherwise, patterns in the observed data might not be transferred to the unobserved data during imputation.
In contrast to imputation by interpolation, which may overestimate high values of dose, when there is a highly rightskewed distribution, the method of imputation by place may overestimate some low values of dose, should the values in the top tail of the distribution be correlated with the density of road intersections, as would be expected. This contrasting tendency makes use of the two methods in regression of health outcomes a useful test of the robustness of results.
Breast cancer example
Study population
We use data on 3,064 female breast cancer cases and controls who were residents of Long Island, New York, between August, 1996, and July, 1997[26]. Cases were women newly diagnosed with a first primary in situ or invasive breast cancer; controls under 65 were identified through random digit dialing and by Health Care Finance Administration (HCFA) rosters for those 65 years of age and older. Controls were frequency matched to the expected age distribution of cases.
Benzo(a)pyrene (BaP) emitted from vehicle tailpipes was the pollutant proxy analyzed for the numerical examples, because there is an abundance of good historical data[21]. The emission per km variable, ϵ(y’), in Eq. 3 peaked in the US in the mid1970s, declining rapidly with the introduction of catalytic converters until the late 1980s[21]. This means that per unit of time, exposure in the pre1980 period was likely to be much higher than after 1985, adding large temporal variation to the spatial variation associated with proximity to traffic evident in Figure 2.
Data for BaP and other PAH tailpipe emissions extend back only to 1960, which sets that year as a boundary for shifting to group level information. In the absence of information, and no engineering reason evident in the literature to think otherwise, we took ϵ(y′) to be a constant prior to 1960 equal to the known value in 1960. We also took the dispersion variable, D, as constant, equal to the average value computed in 1960 for residences in the study area. We used vehicle registrations in New York State, for which date existed back to 1900, as the surrogate for traffic flow rates for the pre1960 period, as in Eq. 5. See Additional file1: Figure S12
The LIBCSP meteorological dispersion model
The LIBCSP meteorological dispersion model was developed specifically for the Long Island Breast Cancer Study Project's analysis of traffic exposures, although it has been adapted for use in the Buffalo, New York area, with separate tests carried out in that region[19, 27]. A special feature of the model is its allowance for separate emissions from road intersections, which are known to be higher than during cruise emissions[28, 29]. The model does not account for street canyon effects, but that is not a necessity for Long Island.
The contribution from each road segment to the air concentration at a downwind receptor residence was computed within 100 meters of a road using a highway linesource model[30] applied to each of the 500,000, straightline road segments in the traffic network. The “Chock” highway model was chosen because it gave the best fit, when compared to a suite of models tested, to tracer concentrations near the Long Island Expressway as part of a test carried out by the New York Department of Environmental Protection[31]. R^{2} values ranged from 0.75 to 0.92 for various meteorological conditions and angles to the road.
Beyond 100 meters, we used a standard Gaussian puff dispersion model (equivalent to the USEPA’s “RAM“ model[32]). Additional file2: Table S23 describes the default parameters and data sources used in the dispersion model. Further details can be found in the text of Additional file2, including an explanation of the approach used to shift between different meteorological models at 100 meters and a description of how we account for background BaP blown into the study area from outside it.
Total concentration at a study subject’s residence was computed as the sum of the contributions from the approximately 500,000 source segments. This concentration at a residence forms the basis for the transfer function, D.
The transport model used to compute the transfer function, D, was previously validated against BaP soil measurements at approximately 500 residences[18]. A key result was higher than average emissions coming from traffic intersections, which were modeled separately. See Additional file1: Figure S13 for a graphic showing separatelymodeled intersection regions superimposed on a segment of the road network.
Imputation to account for missing residence location in the Long Island example
Results
The main results from this paper are Eqs. 2, 3, 4, 5, 6 and the approaches we described to operationalize them. In this section, we discuss results from the Long Island example.
Pearson correlation coefficients between doses computed with different functional forms for the biologic effectiveness factor, averaged over 15 imputations ^{ a }
Samples of women limited by allowed percentage of imputed dose (PDI)  

Dose surrogate  CCA (PDI = 0%)  PDI < 20%  PDI < 40%  PDI < 60%  PDI < 80%  (PDI < = 100%)  
N:  547  1164  1477  1764  2051  2986  
Cumulative dose 19601995  (comparison dose variable)  
Dose for 1995 only (Promoter model)  0.59 (0.11)  0.39 (0.15)  0.39 (0.16)  0.39 (0.16)  0.41 (0.16)  0.33 (0.15)  
Peak annual dose in 1960–1995 (Threshold model)  0.99 (0.00)  0.93 (0.00)  0.93 (0.00)  0.92 (0.01)  0.91 (0.01)  0.75 (0.017)  
Cumulative dose X (onset age)^{2} (Age sensitive model)^{b}  0.89 (0.024)  0.87 (0.028)  0.85 (0.033)  0.84 (0.039)  0.82 (0.044)  0.79 (0.06)  
Pre1960 surrogate^{c}  −0.020  −0.031  −0.031  −0.030  −0.016  +0.00044 
Early surrogate dosage (before 1960)
The early surrogate dosage for the breast cancer study subjects averaged 50% of the post1960 dose (calculated for the B _{1}, constant sensitivity biologic model). Additional file1: Figure S14 graphs the average dependency on age. The Pearson correlation coefficient was 0.6.
There was essentially zero correlation between the early surrogate and the cumulative dose post1960. The Pearson correlation coefficient ranged from −0.05 to 0.001 as the limit on PDI ranged from 0 to 100%. This low correlation means that inclusion of the early surrogate in the dose summation could alter the distribution curve for total dose.
Doses calculated after 1960 (Z _{calc} and Z _{imp})
The Z _{imp} surrogate itself, like Z _{calc}, is highly correlated with year of arrival, albeit with a different sign (Pearson coefficient = 0.56). Figure 4b shows how adding the postarrival, outofarea surrogate, Z _{imp}, to the individualized dose component, Z _{calc}, removes the dependence on year of arrival.
To avoid interpreting BaP as more than a PAH proxy[34], doses are presented in the Figures, not in units of inhaled milligrams of BaP, but in units of the 1995 average dose computed for women in the study area. Note that Figures 4a and4b show graphs for but one imputed data set; however, the patterns are virtually identical at this level of detail for 14 additional imputed data sets, as is to be expected given the great variance around the mean dose curve (results not shown). Also note that a dose value in Figure 4b may be the sum of three terms, an outofarea imputation, an inarea imputation, and a geographically modeled component.
Distribution of women by limits on incompleteness index (PDI ^{ a } ) in cumulative dose for one imputed data set ^{ b } (Women who arrived before 1996)
Limits on PDI^{a}  Number of control women  Number of case women  Total 

Zero (complete coverage)  267  280  547 
>0 and < 20%  336  281  617 
20% to < 40%  164  149  313 
40% to < 60%  150  137  287 
60% to < 80%  136  151  287 
80% to < = 100%  465  470  935 
Total  1518  1468  2986 
Probability level for ChiSquare  0.13 
Rightskewed distribution and a 1% dose tail
The top 1% of the cumulative dose distribution showed unusually high values associated with doses from heavilytrafficked intersections, as shown in Figure 3, where probability plots are presented with and without the intersection dose component included in total dose. Without intersection dose included, the probability plot of log dose in Figure 3, with the exception of one point, trends smoothly below the normal line at the high end. However, with intersection dose included there is an abrupt deviation, with points falling well outside the upper 95% confidence limit on the normal curve shown in the Figure. The same skewed distribution appears when plotting doses computed for other assumptions about the biological effectiveness factor, B, with the exception of the pure promoter model, B _{4} (results not shown). The skewed distribution at high doses was maintained no matter the limitation placed on imputation percentage (PDI) in 15 imputed data sets (results not shown).
Discussion
The past decade has seen an increasing understanding of the importance of the impact on chronic disease risk of exposures at all stages of life, especially exposures occurring during the early years of life when many organs are still undergoing rapid and formative growth[17]. Because many epidemiological studies of chronic illnesses such as cancer are usually initiated when the study population have already reached adulthood, capturing decadesold data on critical factors such as nutrition, exercise, and environmental exposures is a major challenge.
Residential address histories can be gathered by interview with relative ease, and we previously developed models for using histories obtained directly from patient interviews to estimate residential exposure to PAHs from traffic sources[35, 36] which we then calibrated against environmental measurements for participants in a breast cancer case–control study[18]. We further refined our methodology and historical database to take into account longterm changes in emission control technology[21]. A primary limitation, however, continues to be the completeness and accuracy of patient address recall. The present work provides a methodology to address this type of limitation by applying multiple imputation methods to extract the maximum information from available data, making it possible to use incompletely known residential data to put credible bounds on exposure metrics.
The early (pre1960) surrogate dosage for the breast cancer study subjects averaged 50% of the post1960 dose, showing that it can make a substantial contribution to lifetime dosage.^{b} Interestingly, because this surrogate prior to scaling does not depend on information obtained from interview, it cannot directly introduce any recall bias into the dose estimates, although some bias can be carried over to it from the post1960 averages. The way the early surrogate is defined in our example, without separate standardization by geographic region, the only connection to an individual is through age. However, the weak association (p = 0.13) between PDI and case status in our breast cancer data implies possible differential bias due to missing residential data; this will be investigated in a forthcoming epidemiological analysis that uses the present method to predict breast cancer risk.
Our finding of a consistent, strongly rightskewed distribution of traffic doses, associated with heavily trafficked intersections, means that nonstandard distributions of exposure must be anticipated in studying traffic emissions. And, for our example data, this highdose subpopulation is worthy of attention in future risk analyses that include these study subjects.
Because Pearson correlation coefficients between dose calculated with different biological effectiveness factors ranged from 0.33 to 0.79 with no limits set on PDI (PDI < = 100), the sensitivity of risk to age at exposure could be important in determining the strength of an association with health outcome; moreover, it suggests that regressions against health data might shed some light on which biologic model is most appropriate.
Conclusions
In historical reconstructions of exposure, investigators typically identify data that have been collected for environmental monitoring and not for epidemiology, and that do not fully cover the population of interest. We suggest here a way of filling in missing data that we hope will spur increased use of historical exposure reconstruction in health studies. A novel feature of our exposure reconstruction approach is an incompleteness index (PDI) that can be used to study the degree to which imputation might create artifactual dosages and, subsequently, improper associations with health outcomes  a form of differential bias to which researchers need to be sensitive in general[37, 38] and specifically in traffic air pollution studies[39].
Endnotes
^{a}The risk equation in the main text, Eq. 1, was considered to be linear in dose. If the log odds is considered to be nonlinear in exposure, then there would be terms in powers of Z as well a term in the log of Z[23, 40, 41] and possibly powers of log of Z. We did not consider such an extension of the methodology in the examples given in this paper.
^{b}One way to reduce the relative contribution of the early surrogate would be to trade one approximation for another, namely running the full exposure model at times earlier than the period for which full emissions and traffic data exists. This extrapolation could introduce misclassification due to errors in traffic flow rates and tailpipe emission rates, but might introduce less misclassification than would result from relying on an early surrogate. Depending on the variability in a model, an analyst might explore extrapolating the individualized model back in time to reduce the contribution from the early dose surrogate to a modest level relative to total dose.
Abbreviations
 A:

Constant term in regression, including covariate contributions to risk
 B1 through B4:

Functional forms for biological effectiveness
 B(y’ybyyb:

i): Biological effectiveness factor in subject, i, that relates exposure at a particular biological age, y’y _{b}, to detection of disease at a later biological age, yy _{b}
 BaP:

Benzo(a)pyrene
 β:

Vector of regression coefficients for Z(i,y)
 D(y’ri[y′]):

Transfer function
 E(y′i):

Exposure at date, y′, for subject, i
 ϵ(y′):

Emission from a vehicle tailpipe in year, y′
 pi:

Probability of an individual getting a disease
 PDI:

Percentage of dose imputed
 ri[y′]:

Residence of a subject at date, y′
 T(y′):

Average number of vehicles on an average road on date y′
 V1(i)V2(i):

Random variables with mean of unity
 Y:

Date of disease detection
 y′:

Date of an exposure
 yb:

Date of birth
 ystop:

Last exposure date of interest
 Z(i:

y): Vector of effective doses for individual, i, in year, y
 Zcalc:

Modeled component of individual dose
 Zimp:

Imputed part of dose.
Declarations
Acknowledgments
Maureen Hatch provided advice during the early stages of this work. In addition, the following individuals provided information and/or assistance: Jing Nie, James Cook, Laura Palmer, Suffolk County Department of Public works, Michael J. Finocchi, Nassau Department of Public Works, Ed Finster, USEPA, Roy Fedelem, Long Island Regional Planning Board, Anthony Dentrone, Hagstrom maps, Sureli Nepomuceno, Columbia University Health Sciences Library. This work was supported in part by Intramural Research Program of NIH, National Cancer Institute, Division of Cancer Epidemiology and Genetics; USPHS grants CA/ES66572, ES10126, CA63021, CA17613, and, Long Island and Babylon Breast Cancer Coalitions, West Islip Breast Cancer Coalition for L.I., Inc., Huntington Breast Cancer Action Coalition, Noreen T. Holland Breast Cancer Foundation, Inc., Breast Cancer Grassroots Organizations for a Unified Purpose.
Authors’ Affiliations
References
 Cohn BA, Wolff MS, Cirillo PM, Sholtz RI: DDT and breast cancer in young women: new data on the significance of age at exposure. Environ Health Perspect. 2007, 115: 14061414.Google Scholar
 Preston DL, Mattsson A, Holmberg E, Shore R, Hildreth NG, Boice JD: Radiation effects on breast cancer risk: a pooled analysis of eight cohorts. Radiat Res. 2002, 158: 220235. 10.1667/00337587(2002)158[0220:REOBCR]2.0.CO;2.View ArticleGoogle Scholar
 Lubin JH, Wang ZY, Boice JD, Xu ZY, Blot WJ, Wang LD, Kleinerman RA: Risk of lung cancer and residential radon in China: pooled results of two studies. Int J Cancer. 2004, 109: 132137. 10.1002/ijc.11683.View ArticleGoogle Scholar
 Nyberg F, Gustavsson P, Jarup L, Bellander T, Berglind N, Jakobsson R, Pershagen G: Urban air pollution and lung cancer in Stockholm. Epidemiology. 2000, 11: 487495. 10.1097/0000164820000900000002.View ArticleGoogle Scholar
 RaaschouNielsen O, Bak H, Sørensen M, Jensen SS, Ketzel M, Hvidberg M, Schnohr P, Tjønneland A, Overvad K, Loft S: Air pollution from traffic and risk for lung cancer in three danish cohorts. Cancer Epidemiol Biomarkers Prevent. 2010, 19: 12841291. 10.1158/10559965.EPI100036.View ArticleGoogle Scholar
 Gammon MD, Santella RM: PAH, genetic susceptibility and breast cancer risk: an update from the long island breast cancer study project. European J Cancer. 2008, 44: 636640. 10.1016/j.ejca.2008.01.026.View ArticleGoogle Scholar
 Dubowsky SD, Wallace LA, Buckley TJ: The contribution of traffic to indoor concentrations of polycyclic aromatic hydrocarbons. J Expo Anal Environ Epidemiol. 1999, 9: 312321. 10.1038/sj.jea.7500034.View ArticleGoogle Scholar
 Lim LH, Harrison RM, Harrad S: The contribution of traffic to atmospheric concentrations of polycyclic aromatic hydrocarbons. Environ Sci Technol. 1999, 33: 35383542. 10.1021/es990392d.View ArticleGoogle Scholar
 Harkov R, Greenberg A, Darack F, Daisey JM, Lioy PJ: Summertime variations in polycyclic aromatic hydrocarbons at four sites in New Jersey. Environ Sci Technol. 1984, 18: 287291.View ArticleGoogle Scholar
 Dickhut RM, Canuel EA, Gustafson KE, Liu K, Arzayus KM, Walker SE, Edgecombe G, Gaylor MO, MacDonald EH: Automotive sources of carcinogenic polycyclic aromatic hydrocarbons associated with particulate matter in the Chesapeake Bay Region. Environ Sci Technol. 2000, 34: 46354640. 10.1021/es000971e.View ArticleGoogle Scholar
 Dunbar JC, Lin CI, Vergucht I, Wong J, Duran JL: Estimating the contributions of mobile sources of PAH to urban air using realtime PAH monitoring. Sci Total Environ. 2001, 279: 119. 10.1016/S00489697(01)006866.View ArticleGoogle Scholar
 Levy JI, Houseman EA, Spengler JD, Loh P, Ryan L: Fine particulate matter and polycyclic aromatic hydrocarbon concentration patterns in Roxbury, Massachusetts: a communitybased GIS analysis. Environ Health Perspect. 2001, 109: 341347. 10.1289/ehp.01109341.View ArticleGoogle Scholar
 Zhou X, Reiter JP: A note on bayesian inference after multiple imputation. Am Stat. 2010, 64: 159163. 10.1198/tast.2010.09109.View ArticleGoogle Scholar
 White IR, Royston P, Wood AM: Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011, 30: 377399. 10.1002/sim.4067.View ArticleGoogle Scholar
 Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR: Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009, 338: b239310.1136/bmj.b2393.View ArticleGoogle Scholar
 Alavanja MC, Lubin JH, Mahaffey JA, Brownson RC: Residential radon exposure and risk of lung cancer in Missouri. Am J Public Health. 1999, 89: 10421048. 10.2105/AJPH.89.7.1042.View ArticleGoogle Scholar
 IOM: Breast Cancer and the Environment: A Life Course Approach; Committee on Breast Cancer and the Environment, Institute of Medicine. 2012, Washington. DC: The National Academies PressGoogle Scholar
 Beyea J, Hatch M, Stellman SD, Santella RM, Teitelbaum SL, Prokopczyk B, Camann D, Gammon MD: Validation and calibration of a model used to reconstruct historical exposure to polycyclic aromatic hydrocarbons for use in epidemiologic studies. Environ Health Perspect. 2006, 114: 10531058. 10.1289/ehp.8659.View ArticleGoogle Scholar
 Nie J, Beyea J, Bonner MR, Han D, Vena JE, Rogerson P, Vito D, Muti P, Trevisan M, Edge SB, Freudenheim JL: Exposure to traffic emissions throughout life and risk of breast cancer: the Western New York Exposures and Breast Cancer (WEB) study. Cancer Causes Control. 2007, 18: 947955. 10.1007/s1055200790362.View ArticleGoogle Scholar
 Noth EM, Hammond SK, Biging GS, Tager IB: A spatialtemporal regression model to predict daily outdoor residential PAH concentrations in an epidemiologic study in Fresno, CA. Atmos Environ. 2011, 45: 23942403. 10.1016/j.atmosenv.2011.02.014.View ArticleGoogle Scholar
 Beyea J, Stellman SD, Hatch M, Gammon MD: Airborne Emissions from 1961 to 2004 of Benzo[a]pyrene from U.S. Vehicles per km of Travel Based on Tunnel Studies. Environ Sci Technol. 2008, 42: 73157320. 10.1021/es8000773.View ArticleGoogle Scholar
 Liu LJ, Curjuric I, Keidel D, Heldstab J, Kunzli N, BayerOglesby L, AckermannLiebrich U, Schindler C: Characterization of sourcespecific air pollution exposure for a large populationbased Swiss cohort (SAPALDIA). Environ Health Perspect. 2007, 115: 16381645. 10.1289/ehp.10177.View ArticleGoogle Scholar
 Greenland S: Dose–response and trend analysis in epidemiology: alternatives to categorical analysis. Epidemiology. 1995, 6: 356365. 10.1097/0000164819950700000005.View ArticleGoogle Scholar
 Donders AR, van der Heijden GJ, Stijnen T, Moons KG: Review: a gentle introduction to imputation of missing values. J Clin Epidemiol. 2006, 59: 10871091. 10.1016/j.jclinepi.2006.01.014.View ArticleGoogle Scholar
 Little RJA: Regression With Missing X's: A Review. J Am Stat Assoc. 1992, 87: 12271237.Google Scholar
 Gammon MD, Neugut AI, Santella RM, Teitelbaum SL, Britton JA, Terry MB, Eng SM, Wolff MS, Stellman SD, Kabat GC, Levin B, Bradlow HL, Hatch M, Beyea J, Camann D, Trent M, Senie R, Garbowski G, Maffeo C, Montalvan P, Berkowitz G, Kemeny M, Citron M, Schnabel F, Schuss A, Hajdu S, Vinceguerra V, Collman G, Obrams I: The long island breast cancer study project: description of a multiinstitutional collaboration to identify environmental risk factors for breast cancer. Breast Cancer Res Treat. 2002, 74: 235254. 10.1023/A:1016387020854.View ArticleGoogle Scholar
 Li Y, Nie J, Beyea J, Rudra CB, Browne RW, Bonner MR, Mu L, Trevisan M, Freudenheim JL: Exposure to traffic emissions: Associations with biomarkers of antioxidant status and oxidative damage. Environ Res. 2013, 121: 3138.View ArticleGoogle Scholar
 Nelli JP, Messina AD, Bullin JA: Analysis and modeling of air quality at street intersections. JAPCA. 1983, 33: 760764.Google Scholar
 Sculley R: Vehicle emission rate analysis for carbon monoxide hot spot modeling. JAPCA. 1989, 39: 13341343. 10.1080/08940630.1989.10466628.View ArticleGoogle Scholar
 Chock DP: A simple linesource model for dispersion near roadways. Atmos Environ. 1978, 12: 823829. 10.1016/00046981(78)900197.View ArticleGoogle Scholar
 Sistla G, Samson P, Keenan M, Rao S: A study of pollutant dispersion near highways. Atmos Environ. 1979, 13: 669685. 10.1016/00046981(79)901963.View ArticleGoogle Scholar
 Catalano JA, Turner BD, Novak JH: User's Guide for RAM. Research Triangle Park. 1987, Environmental Protection Agency, 2Google Scholar
 van Buuren S, GroothuisOudshoorn K: Multivariate imputation by chained equations in R. J Stat Software. 2011, 45: 167.Google Scholar
 Stellman SD, Guidotti TL: Polycyclic aromatic hydrocarbons. Environmental and Occupational Medicine. Edited by: Rom WN, Markowitz SB3. 2007, Philadelphia: Lippincott Williams & Wilkins, 12401250.Google Scholar
 Beyea J, Hatch M: Geographic exposure modeling: a valuable extension of GIS for use in environmental epidemiology mht version. Env Health Perspectives. 1999, 107 (Suppl. 1): 181190.View ArticleGoogle Scholar
 Development of a traffic model for predicting airborne PAH exposures since. 1960, New York: Report to the National Cancer Institute and the National Institute of Environmental Health Sciences for work completed under USPHS Grant U01CA/ES66572,http://www.cipi.com/PDF/beyea2005trafficpahmodel.pdf, on Long Island,
 Chyou PH: Patterns of bias due to differential misclassification by case–control status in a case–control study. European J Epidemiol. 2007, 22: 717. 10.1007/s106540069078x.View ArticleGoogle Scholar
 Jurek AM, Greenland S, Maldonado G: Brief Report: How far from nondifferential does exposure or disease misclassification have to be to bias measures of association away from the null?. Int J Epidemiol. 2008, 37: 382385. 10.1093/ije/dym291.View ArticleGoogle Scholar
 Rosenlund M, Bellander T, Nordquist T, Alfredsson L: Trafficgenerated air pollution and myocardial infarction. Epidemiology. 2009, 20: 265271. 10.1097/EDE.0b013e318190ea68.View ArticleGoogle Scholar
 Boucher KM, Slattery ML, Berry TD, Quesenberry C, Anderson K: Statistical methods in epidemiology: a comparison of statistical methods to analyze dose–response and trend analysis in epidemiologic studies. J Clin Epidemiol. 1998, 51: 12231233. 10.1016/S08954356(98)001292.View ArticleGoogle Scholar
 Becher H: Analysis of continuous covariables in epidemiological studies: Dose–response modelling and confounder adjustment. Biometrical J. 2002, 44: 683700. 10.1002/15214036(200209)44:6<683::AIDBIMJ683>3.0.CO;22.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.