 Methodology
 Open Access
 Published:
Extended twostage designs for environmental research
Environmental Health volume 21, Article number: 41 (2022)
Abstract
Background
The twostage design has become a standard tool in environmental epidemiology to model multilocation data. However, its standard form is rather inflexible and poses important limitations for modelling complex risks associated with environmental factors. In this contribution, we illustrate multiple design extensions of the classical twostage method, all implemented within a unified analytic framework.
Methods
We extended standard twostage metaanalytic models along the lines of linear mixedeffects models, by allowing locationspecific estimates to be pooled through flexible fixed and randomeffects structures. This permits the analysis of associations characterised by combinations of multivariate outcomes, hierarchical geographical structures, repeated measures, and/or longitudinal settings. The analytic framework and inferential procedures are implemented in the R package mixmeta.
Results
The design extensions are illustrated in examples using multicity time series data collected as part of the National Morbidity, Mortality and Air Pollution Study (NMMAPS). Specifically, four case studies demonstrate applications for modelling complex associations with air pollution and temperature, including nonlinear exposure–response relationships, effects clustered at multiple geographical levels, differential risks by age, and effect modification by air conditioning in a longitudinal analysis.
Conclusions
The definition of several design extensions of the classical twostage design within a unified framework, along with its implementation in freelyavailable software, will provide researchers with a flexible tool to address novel research questions in twostage analyses of environmental health risks.
Introduction
In environmental epidemiological studies, it is common practice to investigate shortterm associations between environmental exposures and health outcomes by analysing data collected from multiple locations. An analytical approach applied in this setting is based on the twostage design, which has become the standard method for the analysis of multilocation data [1,2,3,4,5,6,7,8,9,10,11,12]. The design is based on the separation of the analysis into two steps: in the first stage, locationspecific exposure–response associations are estimated while adjusting for various confounders; then, in the second stage, the estimates are pooled using metaanalytic methods, which can potentially incorporate locationspecific metapredictors.
The twostage design offers several advantages. First, the pooling of data collected in multiple locations increases the statistical power, thus facilitating the detection of small risks usually associated with environmental stressors [13]. At the same time, the separation in two steps provides a flexible and computationally efficient analytical framework compared to onestage approaches [2, 14, 15]. This allows analyses of large datasets collected across multiple populations, increasing the representativeness of the findings. Finally, an important advantage of the twostage design is the enhanced ability to examine heterogeneity in risk across populations, which can be linked to contextual characteristics.
However, there are known limitations of this analytical method. For instance, the standard twostage design requires the association of interest to be represented by a single effect summary (e.g., a relative risk or odds ratio) for being pooled in the second stage. However, in the context of modelling exposure–response associations, this step requires the simplification of potentially complex relationships and/or the adoption of strong functional assumptions (e.g., linearity). Similarly, this restriction prevents combining multiple estimates of the association of interest from the same location, for example when collected from different age groups or periods. Finally, the standard twostage analytic design does not take into account potential geographical dependencies, often occurring in the presence of clustering. These limitations represent important barriers to the application of the twostage framework for addressing more complex research questions about environmental health risks.
In this contribution, we illustrate a unified framework that combines multiple design extensions of the classical twostage method for environmental health studies, some of which were described independently in published analyses [6, 16,17,18,19]. This extended twostage framework is based on linear mixedeffects metaanalytical models, previously developed and published by our research group [20], that can combine multivariate outcomes, longitudinal settings, multilevel structures, and/or repeated measurement [20]. This framework relaxes the constraints described above and offers a flexible and generally applicable tool to implement more advanced study designs using multilocation data.
The article is organized as follows. Firstly, we introduce the extended twostage design and its features, including the design structure and related modelling framework. Then, after presenting the specific example and the related dataset, we will demonstrate applications of the various design extensions in multiple case studies using multilocation analyses of health risks of temperature and air pollution. In a final discussion section, we describe the epidemiological context, strengths and limitations, and area of further research. An uptodate version of the notes, data, and R scripts for reproducing the examples are available on a GitHub repository (see Availability of data and material).
Methods
Extended twostage design
In the classical twostage design, the data are organised and analysed in firststage models that provide independent estimates of a single parameter representing the association of interest in each study area, for instance, a city. These effect summaries are then pooled in the second stage using metaanalytic techniques to combine the information and compute an overall estimate. As discussed above, these requirements pose important analytical constraints. The extended twostage described here overcomes these limitations, first allowing different estimates of single or multiple parameters to be computed in each location, and then relaxing the assumption of independence of estimates within and between locations.
This extended framework provides a flexible setting that allows designing more complex epidemiological studies to address more elaborated research questions. For example, in each study area, multiple parameters could be used to represent complex exposure–response dependencies, such as nonlinear and lagged temperaturehealth associations of temperature [21], or alternatively correlated effects of multiple exposures, such as different pollutants included in the same firststage model [22]. At the same time, relaxing the independence assumption allows accounting for correlations arising when the locations are nested within higher geographical levels (e.g., cities within countries), therefore modelling patterns of similarities and differences [19]. Moreover, in each study area, the firststage model can be applied multiple times to obtain repeated measures of the same association, for instance longitudinally at different times or for different subgroups, such as by age or sex. This structure allows the investigation of temporal variations in risk [17] and the flexible pooling of effect modifications [16].
These analytic features, namely complex multivariate exposure–response relationships, geographical hierarchies, and longitudinal or repeatedmeasure structures can be incorporated individually or simultaneously in the extended twostage framework, offering a flexible analytic context for modern environmental research studies.
Statistical framework
The extension of the twostage design is made possible by the development of a unified statistical framework, previously developed and published by our research group [20], that specifies the secondstage metaanalysis as a mixedeffects linear model [20], as described below. Here we assume that estimates of the association of interest \({\widehat{{\varvec{\theta}}}}_{i}\) have been obtained from each of the \(i=1,\dots ,n\) locations. Here \({\widehat{{\varvec{\theta}}}}_{i}\) generally represents the output of the firststage analysis (see appendix A), and it can include single or multiple coefficients obtained by single or repeated measurements across times or groups, depending on the specific application. In addition, without loss of generality, such estimates can be obtained from various types of firststage models, such as time series for aggregated data [23] or survival analysis of individuallevel records [24], among others.
The firststage estimates \({\widehat{{\varvec{\theta}}}}_{i}\) can be combined in the second stage using an extended randomeffects metaanalysis that flexibly models potentially complex dependence structures. This extended metaanalytical model can be written as a linear mixedeffects model:
with \({{\varvec{b}}}_{i}\sim N\left(0,{\varvec{\Psi}}\right)\), and \({{\varvec{\varepsilon}}}_{i}\sim N\left(0,{{\varvec{S}}}_{i}\right)\).
The design matrix \({\varvec{X}}_{i}\), potentially expanded to account for multivariate outcomes, includes fixedeffect predictors and associated coefficients \({\varvec{\beta}}\). Random terms are represented by the design matrix \({{\varvec{Z}}}_{i}\) with coefficients \({{\varvec{b}}}_{i}\), and by the errors \({{\varvec{\varepsilon}}}_{i}\). The random terms have (co)variance matrices \({\varvec{\Psi}}\) and \({{\varvec{S}}}_{i}\), representing the deviations and errors between and within locations, respectively.
It is important to note that the association parameter \({\widehat{{\varvec{\theta}}}}_{i}\) could have a general nested design with L level inducing possible nonindependence of the estimates, e.g. associations estimated at multiple times, or in cities nested within a country. The extended framework naturally considers the nested design with a hierarchy of the randomeffects effects vector\({{\varvec{b}}}_{i}\), then \({{\varvec{b}}}_{i}\) consists of the random coefficients operating on the levels (from outer to inner)\(l=1,\dots ,L\): \({{\varvec{b}}}_{i}^{T}=\left({{\varvec{b}}}_{i1}^{T},\dots ,{b}_{iL}^{T}\right)\), and the design matrix \({\mathbf{Z}}_{i}\) of the random terms has the corresponding partitioning\({{\varvec{Z}}}_{i}=\left({{\varvec{Z}}}_{i1}\dots {{\varvec{Z}}}_{iL}\right)\),\({{\varvec{Z}}}_{il}=\left({{\varvec{Z}}}_{il1}\dots {{\varvec{Z}}}_{ilnl}\right)\). Note that every matrix \({{\varvec{Z}}}_{ilj}\) has nonzero entries only in the rows that correspond to units in the group j \((j=1,\dots ,{n}_{l})\) of level\(l\).
The (co)variance matrix of the random terms has then the following structure:
where \({{\varvec{\Psi}}}_{l}\) is the covariance of the random terms operating at level \(l\).
Example and data
The various extensions of the twostage design will be illustrated using the same analytical example of multicity timeseries data collected as part of the National Morbidity, Mortality and Air Pollution Study (NMMAPS) [25]. This database contains, among other information, daily series of mortality counts and weather and pollution measurements totalling 5114 observations for the period 1987–2000 in each of 108 cities in the USA. This data resource has been used in several epidemiological analyses to assess health risks associated with air pollution and later with temperature [5, 26,27,28,29,30].
The NMMAPS data consisted of daily series of allcause and causespecific mortality, also stratified by age groups (0–64, 65–74, 65 and older), and various indices of daily levels of several pollutants and weather variables. In addition, the database included citylevel metadata with several variables on geographical, climatological, demographic and socioeconomic characteristics. The original datasets were collected on the 15th of May, 2013 through the package NMMAPSdata in the R software [31]. The package is now archived and the mortality series are not provided anymore. The data are here complemented with information on air conditioning use, collected longitudinally for a subset of cities and obtained from different sources [17].
The database is used in a series of case studies described in the next sections to illustrate the various extensions of the twostage design. In each of them, we assume that firststage models have been performed separately in the 108 locations, collecting summary estimates of association parameter(s) \({\widehat{{\varvec{\theta}}}}_{i}\) and their (co)variance matrix \(V\left({\widehat{{\varvec{\theta}}}}_{i}\right)\), and optionally locationspecific metadata. These data are made available in a GitHub repository, together with the R code for the first stage to produce these quantities from the original data, and for the second stage to reproduce the results of the case studies (see Availability of data and material). Methodological and analytical details, in particular related to the firststage modelling, are omitted to focus on specific aspects of the extensions of the twostage design, with additional information provided in the Supplementary Material. As methodological case studies, these analyses should be considered illustrative examples and are not meant to offer substantive epidemiological evidence.
Results
Case study 1: modelling complex multiparameter associations
Motivation
As mentioned earlier, an important limitation of the standard twostage design is the need to simplify the relationship estimated in the first stage in a single effect summary, for it to be pooled in the second stage. This prevents the modelling of more complex associations represented by multiple parameters.
This limitation can be addressed by extending the twostage design so that multiple quantities can be jointly combined in the second stage, using metaanalytic models that take into account their multivariate structure and their covariance (correlation) within and between locations. The metaanalytical methods can be further extended to multivariate metaregression models that include specific predictors to explain (part of) the observed heterogeneity. This extension of the two stage design has been known as a multivariate metaanalysis or multivariate metaregression [15], and it can be represented as a specific parametrisation of the linear mixed effects metaanalytic framework presented above. These extensions can be implemented with the R package mvmeta [32] or with the updated and more general R package mixmeta [33].
In this case study, we offer an example of this extension to assess health risks associated with outdoor temperature, often characterised by marked nonlinearity and heterogeneity of the effects across locations. In particular, we will investigate the association between heat and allcause mortality during the summer months and the potential role of cityspecific characteristics in modifying the risk. This extension of the twostage design has been previously used in published analyses which evaluated the shortterm health impacts of temperature [6, 16, 34].
Brief description of the data, model, and analysis
We assume that summeronly time series models have been fitted in each of the 108 NMMAPS cities to estimates temperaturemortality relationships using spline functions (see Supplementary Material B1), obtaining sets of four coefficients and their (co)variance matrices that represent the multiparameter nonlinear associations. In the second stage, we use these estimates as multivariate outcomes in the extended metaanalytical framework.
First, we fit a multivariate metaanalysis using a maximum likelihood (ML) estimator to pool the firststage results and obtain an estimate of the average heatmortality exposure–response curve. We then attempt to identify possible contextual characteristics that explain a quota of heterogeneity. Among potential factors, we consider population size, education (% of people with highschool degree) and unemployment (% of unemployed). These variables are included as predictors in multivariate metaregressions, and their effects tested through likelihood ratio (LR) tests. Finally, a stepwise procedure is applied to select the best set among univariable and multivariable models. See Supplementary Material B1 for details.
Results
The basic multivariate metaanalytic model (with no predictors and only intercepts) produces pooled estimates of the set of coefficients representing the average heatmortality association across the 108 cities. These coefficients can be used to compute the nonlinear exposure–response curve expressed as relative risk (RR) by applying the same spline transformations on an average summer temperature distribution represented in a relative percentile scale [15]. The results are displayed in Fig. 1, showing a minimum mortality risk at low summer temperatures (MMT) and the sharp increase of the RR beyond the 90^{th} percentile.
The simple metaanalysis shows a substantial heterogeneity in heatmortality associations across cities, with an I^{2} of 61.5% and a highly significant Cochran Q test (pvalue < 0.001). Therefore, we assess if some of this heterogeneity was explained by some city characteristics, specifically population size, education, and unemployment, by adding them as predictors in multivariate metaregressions. Results are reported in Table 1. When tested separately in univariable models, each predictor is significantly associated with modification of the heatmortality association. The full multivariable model identifies instead independent associations only for population size and unemployment, and these results are consistent with the selection of the forward stepwise procedure.
The tests above demonstrate an effect modification by specific citylevel metavariables, but provide little information on its direction. This can be identified by using the parameters of the multivariate metaregression models to predict the multivariate outcome, namely the coefficients of the spline function representing the heatmortality relationship, for given values of the metapredictors. As an example, we used this method to isolate the effect modification of population size, keeping the other metapredictors constant. The results, shown in the right panel of Fig. 1, indicate a higher mortality risk of heat in larger cities.
This case study demonstrates an extension of the twostage design to pool multiparameter associations. The specific example illustrates an application for complex exposure–response relationships, but the multiparameter definition can be generalised, and the method is applicable for instance also to pool effects of multiple pollutants or multiple health outcomes [22].
Case study 2: modelling complex hierarchical structures
Motivation
Another important limitation of the standard twostage design is the assumption of conditional independence between firststage estimates. In environmental epidemiological associations, this assumption is invalid in the presence of geographical clustering, occurring when estimates are more similar in locations within the same region than between regions.
The twostage design can be extended accordingly by modelling the dependencies among estimates through a hierarchical structure (e.g., cities within countries, or countries within states). This extension can be implemented through a secondstage multilevel metaanalysis that defines multiple sets of random effects at different geographical levels.
In this case study, we provide an example in an analysis of the association between air pollution and nonaccidental mortality in a multicity time series study. Specifically, we assess the increased risk associated with exposure to ozone in a sample of NMMAPS cities accounting for clustering within states. We previously applied this extended twostage design in a study evaluating the shortterm health effects of pollutants [19].
Brief description of the model, data, and analysis
As in the previous case study, we assume that firststage time series models have been performed in each city, collecting estimates of the logRR for an increase in ozone of 10 µg/m^{3}, along with its variance as a measure of the uncertainty (see Supplementary material B2). Estimates for cities with no or limited daily measurements of ozone were set to missing, leaving a sample of 98 cities within 38 states.
We start the analysis by fitting a standard metaanalysis with cityspecific random effects. Then, in order to account for potential geographical differences, we first perform a standard metaregression with state indicators as fixedeffects predictors, and then the extended model including two levels of random effects by cities nested within states. Finally, we compute statelevel fixedeffects predictions from the metaregression, and best linear unbiased predictions at both city and state level from the multilevel model [20]. See Supplementary material B2 for details.
Results
The standard metaanalytic model with singlelevel random effects for cities returns a pooled RR of nonaccidental mortality of 1.0037 (95%CI: 1.0027 to 1.0047), corresponding to a percentage increase of 0.37%, with a betweencity variance equal to 0.0049^{2}. The inclusion of state indicators in the metaregression suggests that there are significant geographical differences (LR test with a pvalue < 0.001). Two drawbacks of this fixedeffects approach are the lack of a pooled effect estimate, and the high uncertainty in statelevel predictions given the low number of cities within states and an highlyparameterised model.
The multilevel randomeffects model addresses these limitations. First, this model provides a pooled relative risk of 1.0038 (95%CI: 1.0024 to 1.0051), with a similar point estimate and slighter higher confidence intervals than the standard metaanalysis. The betweengroup heterogeneity is split between states (0.0030^{2}) and cities (0.0040^{2}), suggesting variation at both levels. Figure 2 displays these geographical differences by mapping the citylevel best linear unbiased predictions (BLUPs) of the RR for a 10 µg/m^{3} increase in ozone.
Second, the multilevel model can improve the statespecific estimates by computing BLUPs at this geographical level. Figure 3 compares these quantities with fixedeffects predictions obtained from the standard metaregression model. The results reveal the gain in precision of the BLUPs resulting from the shrinkage and borrowing of information across states [20]. These estimates are more reliable than fixedeffects predictions, where only the withinstate information is used.
This case study illustrates how to extend the classical twostage design by accounting for hierarchical dependencies between estimates from different locations. This flexible multilevel structure offers the possibility to separate the heterogeneity across geographical levels and to obtain more reliable and informative association estimates. The approach can be seamlessly extended to multiparameter associations, combining multilevel and multivariate models [18].
Case study 3: subgroups analysis, and dose–response relationships
Motivation
Common applications of the twostage design entail the provision of single effect summaries from each location. However, the analysis can sometimes be repeated by subgroups of the population defined by specific characteristics, such as sex or age, resulting in repeated measures and dependencies that the standard twostage design is not able to handle.
The extended framework addresses this limitation, offering an adaptable grouping structure that allows multiple association estimates within a location. Moreover, the role of subgroups characteristics can be flexibly examined in a dose–response fashion by including either categorical and continuous variables in the fixedpart component. As for the extensions presented in the previous case studies, this framework is also applicable to multivariate outcomes.
In this case study, we extend further the investigation of the association between heat and allcause mortality illustrated in Case Study 1 by stratifying the analysis by age. This provides repeated estimates for each of the 108 NMMAPS cities and the opportunity to apply flexible models to examine patterns of risk varying by age.
Brief description of the model, data, and analysis
The stratified analysis involves the fitting of the same firststage regression model as in Case Study 1, but this time repeated separately for the three age groups (0–64, 65–74, 65 and older) using agespecific mortality series (see Supplementary material B3). We assume that this step has been performed and that we have obtained 324 sets of coefficients and associated (co)variance matrices representing agespecific heatmortality associations in three age groups and 108 cities.
In the second stage, we first fit a standard metaregression that ignores the citylevel clustering and models the 324 multivariate outcomes using categorical indicators for age groups and unitspecific random effects. This model is first extended to account for clustering by defining the randomeffect grouping structure at the city level. Then, we specify a continuous age variable by assigning specific values to the groups (60, 70, and 85 years) and finally we model it using either a linear or nonlinear spline parametrisation. See Supplementary Material B3 for details.
Results
Table 2 offer a comparison between the different modelling strategies. All the models indicate evidence for an effect modification of age, but those correctly accounting for clustering by defining citylevel groups (Models 1–3) demonstrate a better fit. The comparison of the more flexible models that define a continuous dose–response parametrisation (Models 2 and 3) suggests the presence of nonlinearity. Note that the spline model (Model 3) has virtually an identical fit of the model with categorical indicators (Model 1), given that the number of groups/values equals the spline terms. However, the more flexible option defining the effect modification on a continuous scale has still some advantages, as illustrated below.`
The analysis has similarities to Case Study 1, which illustrated the effect modification related to cityspecific variables, but, in this case, modelling withincity variations in risk. Still, the direction of the effect is difficult to ascertain when applying complex multiparameter functions. Therefore, we rely on the same approach to predict average heatmortality exposure–response curves for specific age values, taking advantage of the continuous dose–response parametrisation of the repeatedmeasure multivariate model. The results are reported in Fig. 4, suggesting a clear age pattern with the risk of heat increasing at older ages.
This case study shows how to extend the classical twostage design to account for repeated measures originating, for instance, in the presence of multiple estimates from population subgroups in the same location. This design extension also offers the possibility of modelling effect modifications by specific characteristics using flexible dose–response parametrisations on a continuous scale. It is interesting to note that this approach relaxes the requirement of defining fixed subgroups (e.g., by age), as different values can be attributed across locations.
Case study 4: modelling longitudinal patterns of risk
Motivation
A different setting in which repeated measures can arise in twostage analyses is when multiple estimates are collected at different times for the same location. This situation poses methodological problems that, similarly to the previous case study, standard designs are not equipped to handle.
The development of the twostage methods to address these limitations requires accounting for the longitudinal structure of the data and modelling temporal trends in the exposure–response association. This extension provides environmental epidemiologists with the possibility of studying longitudinal patterns of risk, and considering potential timevarying factors explaining the variability of the estimated association over time.
In this case study, we again revise the analysis of heatmortality relationships described in Case Study 1 by fitting the model in multiple subperiods in each city. This step offers the opportunity to study temporal changes in the exposure–response curve and to assess the role of air conditioning (AC) in attenuating the risk. This case study is an illustrative example of a published analysis by our research group [17].
Brief description of the model, data, and analysis
We assume that in the first stage the data for the subset of 89 NMMAPS cities with information on AC data were split into five subperiods (1987–98, 1990–92, 1993–95, 1996–98, and 1999–2000), and that separate time series models were fitted in each city/period combination, deriving a total of 445 sets of coefficients (co)variance matrices representing the multivariate association. Each city/period combination can be assigned a measure of AC prevalence use (%) reconstructed from an external database [17] (see Supplementary Material B4).
In a second step, we apply a longitudinal multivariate randomeffect metaregression to evaluate changes in heatrelated mortality risks, accounting for both within and betweencity variations. We include in the model a smooth spline function of calendar year and a linear term for AC as timevarying predictors, assessing their contribution with LR tests. As in the previous case study, this flexible continuous parametrisation allows the prediction of nonlinear exposure–response curves for any given year and potential scenarios of AC use. See Supplementary Material B4 for details.
Results
The longitudinal metaregression model suggests an independent effect of both calendar year (LR test pvalue = 0.038) and air conditioning (pvalue = 0.008). We evaluate their role by predicting the exposure–response associations in RR scale for different AC prevalence levels (80% vs 20%) in the year 1990. The curves are displayed in Fig. 5 (left panel), indicating how increasing AC has a protective effect at hot temperatures.
In order to assess the joint contributions of trends and AC use, we depict two scenarios to represent longitudinal changes in risk along years: a factual scenario using the observed trend in average AC prevalence, and a counterfactual scenario with AC use kept constant in time at the value of 1987. The right panel of Fig. 5 shows the results, summarising the heat effects as the RR computed at the 99^{th} percentile versus the MMT along the period 1987–2000. The predicted risk under the counterfactual scenario (in blue) reveals a decreasing trend independent from AC use. Nonetheless, the comparison with the factual scenario (in red) suggests that the increase in AC prevalence during the period contributed somehow to attenuate the risk.
This last case study demonstrates the extension of the twostage design to study longitudinal associations, evaluating changes in risk across both spatial and temporal dimensions. The flexibility of the extended framework allows parametrising effects on a continuous scale and performing secondstage metaanalysis with balanced and unbalanced data, with important design advantages.
Discussion
In this contribution, we presented several design extensions of classical twostage studies, and introduced several examples that illustrate how the flexibility of this modelling tool can improve the investigation of the effect of environmental exposures on health outcomes. Specifically, we showed how the extended twostage design can be applied to investigate complex exposure–response dependencies, multilevel longitudinal structures, and repeatedmeasure dose–response associations. The analytic framework can be applied using classical inferential procedures and can be easily implemented using the R package mixmeta.
The twostage design was proposed for the analysis of multilocation data. The methodology has been popularised by multicity time series studies investigating shortterm risk associations with environmental stressors [2, 5, 10], and it has become a common tool to assess the acute effects of pollutants [4, 7,8,9, 11] and temperature [1, 3]. The twostage design has been also implemented in multicohort studies (e.g. ESCAPE project) to evaluate to longterm effect of pollutants [12, 24, 35], and in genetic epidemiology studies [36, 37]. Several extensions of a standard design have been proposed over the years, all of which can be represented as specific applications of the unified framework proposed here.
The most straightforward extension considers multiple estimates obtained in the first stage and the application of multivariate metaanalytic models in the second stage. This approach was originally developed to pool lagged effects [2], multiple pollutants [22], and nonlinear dependencies [15], or more complex distributed lag nonlinear associations [38].
Early applications of the twostage design considered a small number of locations within a country, but the increased availability of environmental measures and health data now allows studies that include hundreds of locations within several countries [18, 19, 39]. In this setting, the locations can have a hierarchical structure that can be directly incorporated into the extended twostage design. This extension has been proposed to obtain global, country, and citylevel estimates of the associations by combining information within and between locations [18, 19, 39].
Environmental risk factors are often associated with risks that vary according to some individual or contextual characteristic [28, 40, 41]. The comparison of association measures across subgroups was originally performed qualitatively and/or without consideration of the possible nonindependence of multiple estimates collected within a location [42]. The extended twostage design can directly model dependencies between the stratified estimates within each location, and appropriate inferential procedures can be used to evaluate differences across subgroup estimates.
In addition, such differences can be linked with measurable characteristics that can be included as categorical and continuous fixedeffects terms in the extended secondstage metaregression. This extension allows modelling risks varying both within locations (e.g., age in Case Study 3) and between locations (e.g., population size and unemployment rate in Case Study 1). This effect modification patterns can be modelled linearly or nonlinearly using flexible parametric functions, representing a further extension of dose–response pooling methods applied in observational studies [43, 44].
With the availability of longer time series of environmental exposures and health outcomes, researchers have started to investigate the temporal variation in associations of short term environmental exposures and health outcomes [3, 17, 45,46,47,48,49]. In particular, modelling approaches have proposed timevarying extensions of distributed lag nonlinear models [47, 48], Bayesian hierarchical models [3, 46], and functional metaregression [49]. The extended twostage design naturally accommodates balanced and unbalanced association parameters longitudinally directly accounting for possible nonindependences, and it provides the possibility to parametrise trends through linear and nonlinear functions. It is important to note that the longitudinal setting can incorporate other extensions, such as multivariate outcomes and multilevel structures, modelling potentially complex structures of longitudinal associations [17].
The data example and the four case studies are consistent with the most common application of the twostage design in time series analysis of shortterm effects of environmental exposures. However, it is worth noting that the framework proposed here is not restricted to the time series setting, and firststage estimates can be obtained by any other approach such as casecrossover or timetoevent Cox models. Therefore, the extended twostage design can similarly be applied in environmental epidemiological studies investigating either short or longterm effects of environmental exposure, using either individuallevel or aggregated crosssectional, case–control, and cohort data [12, 24, 35,36,37, 50].
An important advantage of the proposed development is the fact that it is grounded on a unified likelihoodbased inferential framework and implemented in freely available and easytouse software. All the analyses illustrated in the four case studies can be performed using the R package mixmeta, which offers a simple syntax to define all the different models and combinations of them. Similar extensions of the twostage design were proposed based on Bayesian hierarchical models, for instance for multivariate [22], multilevel [14] and longitudinal data [46], but they usually require advanced statistical and programming skills and can be computationally more demanding. Nonetheless, the Bayesian framework offers more flexibility in accommodating randomeffects and correlations, for instance spatial structures that are not yet available and generally more difficult to implement in our likelihoodbased development.
Conclusions
Technological developments in environmental monitoring, coupled with advancements in data linkage and collaborative tools, offer new opportunities for researchers to collect large multilocations databases. The development of a general and extended framework for twostage designs is therefore timely and offers a flexible and generally applicable tool for modern environmental epidemiological studies.
Availability of data and materials
An uptodate version of the R scripts and data to fully reproduce the examples described in the four case studies are added in a GitHub repository, available at https://github.com/gasparrini/extended2stage.
Change history
09 May 2022
A Correction to this paper has been published: https://doi.org/10.1186/s1294002200861z
Abbreviations
 NMMAPS:

National Morbidity, Mortality and Air Pollution Study
 ML:

Maximum likelihood
 LR:

Likelihood ratio
 RR:

Relative risk
 MMT:

Minimum mortality risk temperature
 BLUPs:

Best linear unbiased predictions
 AC:

Air conditioning
 AIC:

Akaike information criterion
 BIC:

Bayesian. information criterion
References
Baccini M, Biggeri A, Accetta G, Kosatsky T, Katsouyanni K, Analitis A, et al. Heat effects on mortality in 15 European cities. Epidemiology. 2008;19(5):711–9.
Berhane K, Thomas DC. A twostage model for multiple time series data of counts. Biostatistics. 2002;3(1):21–32.
Bobb JF, Peng RD, Bell ML, Dominici F. HeatRelated Mortality and Adaptation to Heat in the United States. Environ Health Perspect. 2014;122(8):811–6.
Chen R, Kan H, Chen B, Huang W, Bai Z, Song G, et al. Association of particulate air pollution with daily mortality: the China Air Pollution and Health Effects Study. Am J Epidemiol. 2012;175(11):1173–81.
Dominici F, Samet JM, Zeger SL. Combining evidence on air pollution and daily mortality from the 20 largest US cities: a hierarchical modelling strategy. J R Stat Soc Ser. 2000;163:263–84.
Gasparrini A, Guo YM, Hashizume M, Lavigne E, Zanobetti A, Schwartz J, et al. Mortality risk attributable to high and low ambient temperature: a multicountry observational study. Lancet. 2015;386(9991):369–75.
Liu C, Chen R, Sera F, VicedoCabrera AM, Guo YM, Tong SL, et al. Ambient Particulate Air Pollution and Daily Mortality in 652 Cities. N Engl J Med. 2019;381(8):705–15.
Romieu I, Gouveia N, Cifuentes LA, de Leon AP, Junger W, Vera J, et al. Multicity study of air pollution and mortality in Latin America (the ESCALA study). Res Rep Health Eff Inst. 2012;171:5–86.
Samoli E, Analitis A, Touloumi G, Schwartz J, Anderson HR, Sunyer J, et al. Estimating the exposureresponse relationships between particulate matter and mortality within the APHEA multicity project. Environ Health Perspect. 2005;113(1):88–95.
Schwartz J. Assessing confounding, effect modification, and thresholds in the association between ambient particles and daily deaths. Environ Health Perspect. 2000;108(6):563–8.
Wong CM, VichitVadakan N, Kan H, Qian Z. Public Health and Air Pollution in Asia (PAPA): a multicity study of shortterm effects of air pollution on mortality. Environ Health Perspect. 2008;116(9):1195–202.
Basagana X, Pedersen M, BarreraGomez J, Gehring U, GiorgisAllemand L, Hoek G, et al. Analysis of multicentre epidemiological studies: contrasting fixed or random effects modelling and metaanalysis. Int J Epidemiol. 2018;47(4):1343–54.
Armstrong BG, Gasparrini A, Tobias A, Sera F. Sample size issues in time series regressions of counts on environmental exposures. Bmc Med Res Methodol. 2020;20(1):1–9.
Dominici F, Daniels M, Zeger SL, Samet JM. Air pollution and mortality: Estimating regional and national doseresponse relationships. J Am Stat Assoc. 2002;97(457):100–11.
Gasparrini A, Armstrong B, Kenward MG. Multivariate metaanalysis for nonlinear and other multiparameter associations. Stat Med. 2012;31(29):3821–39.
Sera F, Armstrong B, Tobias A, VicedoCabrera AM, Astrom C, Bell ML, et al. How urban characteristics affect vulnerability to heat and cold: a multicountry analysis. Int J Epidemiol. 2019;48(4):1101–12.
Sera F, Hashizume M, Honda Y, Lavigne E, Schwartz J, Zanobetti A, et al. Air Conditioning and Heatrelated Mortality A Multicountry Longitudinal Study. Epidemiology. 2020;31(6):779–87.
VicedoCabrera AM, Scovronick N, Sera F, Roye D, Schneider R, Tobias A, et al. The burden of heatrelated mortality attributable to recent humaninduced climate change. Nat Clim Chang. 2021;11(6):492–500.
VicedoCabrera AM, Sera F, Liu C, Armstrong B, Milojevic A, Guo YM, et al. Short term association between ozone and mortality: global two stage time series study in 406 locations in 20 countries. BMJ. 2020;368. https://www.bmj.com/content/368/bmj.m108.
Sera F, Armstrong B, Blangiardo M, Gasparrini A. An extended mixedeffects framework for metaanalysis. Stat Med. 2019;38(29):5429–44.
Gasparrini A. Modeling exposurelagresponse associations with distributed lag nonlinear models. Stat Med. 2014;33(5):881–99.
Dominici F, Zanobetti A, Zeger SL, Schwartz J, Samet JM. Hierarchical bivariate time series models: a combined analysis of the effects of particulate matter on morbidity and mortality. Biostatistics. 2004;5(3):341–60.
Bhaskaran K, Gasparrini A, Hajat S, Smeeth L, Armstrong B. Time series regression studies in environmental epidemiology. Int J Epidemiol. 2013;42(4):1187–95.
Cesaroni G, Forastiere F, Stafoggia M, Andersen ZJ, Badaloni C, Beelen R, et al. Long term exposure to ambient air pollution and incidence of acute coronary events: prospective cohort study and metaanalysis in 11 European cohorts from the ESCAPE Project. BMJ. 2014;348. https://www.bmj.com/content/348/bmj.f7412.full.
Samet JM, Dominici F, Zeger SL, Schwartz J, Dockery DW. The National Morbidity, Mortality, and Air Pollution Study. Part I: Methods and methodologic issues. Res Rep Health Eff Inst. 2000;94 Pt 1:5–14. discussion 7584.
Daniels MJ, Dominici F, Zeger SL, Samet JM. The National Morbidity, Mortality, and Air Pollution Study. Part III: PM10 concentrationresponse curves and thresholds for the 20 largest US cities. Res Rep Health Eff Inst. 2004;94 Pt 3:1–21. discussion 3–30.
Dominici F, McDermott A, Daniels M, Zeger SL, Samet JM. Revised analyses of the National Morbidity, Mortality, and Air Pollution Study: Mortality among residents of 90 cities. J Toxicol Environ HealthPart aCurrent Issues. 2005;68(13–14):1071–92.
Huang Y, Dominici F, Bell ML. Bayesian hierarchical distributed lag models for summer ozone exposure and cardiorespiratory mortality. Environmetrics. 2005;16(5):547–62.
Samet JM, Zeger SL, Dominici F, Curriero F, Coursac I, Dockery DW, et al. The National Morbidity, Mortality, and Air Pollution Study. Part II: Morbidity and mortality from air pollution in the United States. Res Rep Health Eff Inst. 2000;94(Pt 2):5–70 discussion 19.
Zhang YQ, Xiang QQ, Yu Y, Zhan ZY, Hu KJ, Ding Z. Sociogeographic disparity in cardiorespiratory mortality burden attributable to ambient temperature in the United States. Environ Sci Pollut Res. 2019;26(1):694–705.
Peng RD, Welty LJ. The nmmapsdata package. R news. 2004;4(2):10–4.
Gasparrini A, Gasparrini MA. Package ‘mvmeta.’ 2019.
Gasparrini A, Sera F, Gasparrini MA. Package ‘mixmeta.’ 2021.
Scovronick N, Sera F, Acquaotta F, Garzena D, Fratianni S, Wright CY, et al. The association between ambient temperature and mortality in South Africa: A timeseries analysis. Environ Res. 2018;161:229–35.
Beelen R, RaaschouNielsen O, Stafoggia M, Andersen ZJ, Weinmayr G, Hoffmann B, et al. Effects of longterm exposure to air pollution on naturalcause mortality: an analysis of 22 European cohorts within the multicentre ESCAPE project. Lancet. 2014;383(9919):785–95.
Raimondi S, Gandini S, Fargnoli MC, Bagnardi V, Maisonneuve P, Specchia C, et al. Melanocortin1 receptor, skin cancer and phenotypic characteristics (MSKIP) project: study design and methods for pooling results of genetic epidemiological studies. Bmc Med Res Methodol. 2012;12:1–13.
Surendran P, Feofanova EV, Lahrouchi N, Ntalla I, Karthikeyan S, Cook J, et al. Discovery of rare variants associated with blood pressure regulation through metaanalysis of 1.3 million individuals. Nat Genet. 2020;52(12):1314–32.
Gasparrini A, Armstrong B. Reducing and metaanalysing estimates from distributed lag nonlinear models. Bmc Med Res Methodol. 2013;13:1–10.
Meng X, Liu C, Chen RJ, Sera F, VicedoCabrera AM, Milojevic A, et al. Short term associations of ambient nitrogen dioxide with daily total, cardiovascular, and respiratory mortality: multilocation analysis in 398 cities. BMJ. 2021;372. https://www.bmj.com/content/372/bmj.n534.
Son JY, Liu JC, Bell ML. Temperaturerelated mortality: a systematic review and investigation of effect modifiers. Environ Res Letters. 2019;14(7):073004.
Zeka A, Zanobetti A, Schwartz J. Individuallevel modifiers of the effects of particulate matter on daily mortality. Am J Epidemiol. 2006;163(9):849–59.
Nordio F, Zanobetti A, Colicino E, Kloog I, Schwartz J. Changing patterns of the temperaturemortality association by time and location in the US, and implications for climate change. Environ Int. 2015;81:80–6.
Crippa A, Discacciati A, Bottai M, Spiegelman D, Orsini N. Onestage doseresponse metaanalysis for aggregated data. Stat Methods Med Res. 2019;28(5):1579–96.
Orsini N, Li RF, Wolk A, Khudyakov P, Spiegelman D. MetaAnalysis for Linear and Nonlinear DoseResponse Relations: Examples, an Evaluation of Approximations, and Software. Am J Epidemiol. 2012;175(1):66–73.
Carugno M, Consonni D, Bertazzi PA, Biggeri A, Baccini M. Temporal trends of PM10 and its impact on mortality in Lombardy. Italy Environ Pollut. 2017;227:280–6.
Chen C, Warrington JA, Dominici F, Peng RD, Esty DC, Bobb JF, et al. Temporal variation in association between shortterm exposure to fine particulate matter and hospitalisations in older adults in the USA: a longterm timeseries analysis of the US Medicare dataset. Lancet Planetary Health. 2021;5(8):E534–41.
Chung Y, Yang D, Gasparrini A, VicedoCabrera AM, Ng CFS, Kim Y, et al. Changing Susceptibility to NonOptimum Temperatures in Japan, 1972–2012: The Role of Climate, Demographic, and Socioeconomic Factors. Environ Health Perspect. 2018;126(5):057002.
Gasparrini A, Guo YM, Hashizume M, Kinney PL, Petkova EP, Lavigne E, et al. Temporal Variation in HeatMortality Associations: A Multicountry Study. Environ Health Perspect. 2015;123(11):1200–7.
Yu J, Park J, Choi T, Hashizume M, Kim Y, Honda Y, et al. Nonparametric Bayesian Functional MetaRegression: Applications in Environmental Epidemiology. J Agric Biol Environ Stat. 2021;26(1):45–70.
Sera F, Ferrari P. A Multilevel Model to Estimate the Within and the BetweenCenter Components of the Exposure/Disease Association in the EPIC Study. Plos One. 2015;10(3):e0117815.
Acknowledgements
Not applicable
Funding
AG was funded by the Medical Research CouncilUK (Grant ID: MR/M022625/1), the Natural Environment Research Council UK (Grant ID: NE/R009384/1) and the European Union’s Horizon 2020 Project Exhaustion (Grant ID: 820655).
Author information
Authors and Affiliations
Contributions
FS and AG conceptualise the research goals and aims, FS and AG developed the methods and the software, FS and AG analysed the data, FS was a major contributor in writing the manuscript that was revised and approved by AG.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not Applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this article was revised: The text has been modified to include that the material is available in a GitHub repository, freely accessible by the readers.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Sera, F., Gasparrini, A. Extended twostage designs for environmental research. Environ Health 21, 41 (2022). https://doi.org/10.1186/s1294002200853z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1294002200853z
Keywords
 Environmental epidemiology
 Twostage design
 Metaanalysis
 Temperature
 Pollution