 Research
 Open Access
 Published:
Statistical strategies for constructing health risk models with multiple pollutants and their interactions: possible choices and comparisons
Environmental Health volume 12, Article number: 85 (2013)
Abstract
Background
As public awareness of consequences of environmental exposures has grown, estimating the adverse health effects due to simultaneous exposure to multiple pollutants is an important topic to explore. The challenges of evaluating the health impacts of environmental factors in a multipollutant model include, but are not limited to: identification of the most critical components of the pollutant mixture, examination of potential interaction effects, and attribution of health effects to individual pollutants in the presence of multicollinearity.
Methods
In this paper, we reviewed five methods available in the statistical literature that are potentially helpful for constructing multipollutant models. We conducted a simulation study and presented two data examples to assess the performance of these methods on feature selection, effect estimation and interaction identification using both crosssectional and timeseries designs. We also proposed and evaluated a twostep strategy employing an initial screening by a treebased method followed by further dimension reduction/variable selection by the aforementioned five approaches at the second step.
Results
Among the five methods, least absolute shrinkage and selection operator regression performs well in general for identifying important exposures, but will yield biased estimates and slightly larger model dimension given many correlated candidate exposures and modest sample size. Bayesian model averaging, and supervised principal component analysis are also useful in variable selection when there is a moderately strong exposureresponse association. Substantial improvements on reducing model dimension and identifying important variables have been observed for all the five statistical methods using the twostep modeling strategy when the number of candidate variables is large.
Conclusions
There is no uniform dominance of one method across all simulation scenarios and all criteria. The performances differ according to the nature of the response variable, the sample size, the number of pollutants involved, and the strength of exposureresponse association/interaction. However, the twostep modeling strategy proposed here is potentially applicable under a multipollutant framework with many covariates by taking advantage of both the screening feature of an initial treebased method and dimension reduction/variable selection property of the subsequent method. The choice of the method should also depend on the goal of the study: risk prediction, effect estimation or screening for important predictors and their interactions.
Background
As public awareness of consequences of environmental exposures has grown, estimating the adverse health effects due to simultaneous exposure to multiple pollutants is currently an important topic to explore [1]. It can provide insight into understanding the biological mechanisms of pollutant toxicity and guide regulatory standards for public health [2, 3]. To date, numerous studies have examined a wide range of health impacts from exposure to ambient pollutants, with positive evidences on elevated allcause mortality [4, 5], impaired cardiac function [6], adverse cardiovascular events [7, 8], propensity for diabetes mellitus [9], raised incidence of respiratory symptoms in children [10, 11], reduced lung function [12, 13], preterm delivery and low birth weight [14, 15] and increased cancer risk [16, 17].
In air pollution epidemiology, traditionally, health risks were assessed by estimated effects of one or several monitored pollutants using singlepollutant models. For instance, fine particulate matter (PM_{2.5}) has been one of the most frequently studied pollutants [6, 7, 10, 14, 16, 18–20]. Other criteria air pollutants such as carbon monoxide (CO), sulfur dioxide (SO_{2}), nitrogen dioxide (NO_{2}) and ozone (O_{3}) also have been implicated in adverse effects on health [14, 18, 19]. However, the reliability of using single pollutant as the surrogate for air pollution is inadequate because air pollution is a mixture of many different gases, vapors and particles, with varying concentration and composition depending on the geographic regions and meteorological conditions [21]. To capture the complex nature of environmental exposure as a whole, there is a need to move from a singlepollutant to a multipollutant approach, as recommended by the National Research Council [1, 21–24].
Some of the major challenges in evaluating the health impacts of environmental exposures in a multipollutant model are (1) to identify the specific components of pollution mixture that are most critical to the outcome of interest, when a large number of exposures are observed; (2) to examine the potential for interaction effects among pollutants, given the evidence that the impact of combined exposures may differ from the sum of the impacts from singlepollutant models [25]; and (3) to efficiently attribute health effects to individual pollutants in the presence of multicollinearity. Pollutants are often correlated due to the temporal and spatial pattern of emissions, and the strong influence of common meteorology. The various common constituents of the atmosphere and their chemical reactions across a region can induce this temporal and spatial correlation [26]. Additionally, multipollutant models place higher demands on the completeness and quality of the necessary exposure data, which has been a longstanding issue in environmental epidemiology [21]. While very important, the issue of exposure assessment and areal integration/imputation of possibly misaligned exposure data is beyond the scope of our coverage in the present paper.
Often, surrogates or proxy summary measures of the total environmental exposure burden have been used to explore the deleterious effects of pollution mixture. A simple way is to represent each emission source by a single pollutant, for instance, NO_{2} or CO for vehicular traffic, PM_{10} for traffic and other combustion sources, and SO_{2} as an indicator of power plant emissions. A second approach involves replacing ambient concentrations of pollutants with an indicator variable for the emission source. Factors such as living in close proximity to major roads or around roads with certain traffic intensity have been used to assess health risks [16]. A third strategy divides pollutants into groups by physicalchemical characteristics or biological reactivity. For example, the separation of fine (PM_{2.5}) and coarse particles (PM_{102.5}) based on size distribution has sharpened our understanding of where inhaled sizespecific particulate matters come to rest in human body and how they lead to different biological symptoms [6, 12]. A fourth strategy uses source apportionment analysis or “receptor modeling”, a widely used technique that distinguishes emission sources, especially for PM, by apportionment using profiles and source impacts [27]. A few epidemiological studies have used source apportionment data as explanatory variables in their examinations of PM health effects [28, 29]. Lastly, composite metrics, such as the Air Quality Index (AQI) proposed by the U.S. Environmental Protection Agency (EPA), might be used to explore health impacts of air pollution. The AQI provides an integrated value of air quality health impacts from five air pollutants (O_{3}, PM, CO, SO_{2}, and NO_{2}) regulated by the Clean Air Act, and is categorized into six levels labeled as “good”, “moderate”, “unhealthy for sensitive groups”, “unhealthy”, “very unhealthy” and “hazardous”, each accompanied with difference health advices [30]. While the strategies above could be easily implemented, overly simplified conversion to surrogate categories may lead to loss of information, and the resulting estimates corresponding to surrogates or proxies have distinct interpretations than the ones corresponding to individual exposures. More importantly, synergistic effects among pollutants become not identifiable by substituting with the composite or mixture measures.
To address potential synergistic effects among air pollutants, multivariate models with main effects for each pollutant and interaction effects for each pair of pollutants has been suggested [1, 21]. However, it is well known that the statistical power for detecting a significant twoway interaction is low unless there is a strong measurable interaction or an adequately large sample size, and this reduction in power is even greater for higher order interactions. Consequently, it may not be feasible to detect all pairwise (and higher order) interactions given a large set of pollutants. When a synergy among multiple pollutants is suspected, treebased regression methods such as classification and regression tree (CART) can be used to explore multiple potentially nonlinear and hierarchical interactions [31]. While CART is well suited for identifying important main effects and interactions, it does not lead to quantitatively meaningful effect estimates. An alternative to quantify health effects with complex interactions is the deletion/substitution/addition (DSA) algorithm by Sinisi and van der Laan [32]. This algorithm was proposed to explore interactions in highdimensional genomic data, and was later adapted to the environmental context to overcome the issue of a large number of correlated predictors [33]. One advantage of this algorithm is its flexibility of imposing customized constraints on model size, order of interactions, and polynomial functions of exposures to be included in the model. In addition, its model selection process is more aggressive than other automated selections as it does not enforce the restriction of nested structure on the explored class of models. However, the DSA approach has been criticized because its estimates are not consistent if the ratio of sample size to the number of candidate predictors is small, and the associated confidence intervals have poor statistical properties when substantial correlation exists among predictors [34].
To identify risk factors that have the most explanatory power in the presence of correlation, a natural route is to apply some dimension reduction technique. Several standard methods, such as factor analysis or principal component analysis (PCA), have been implemented to analyze the effects of multiple pollutants [35–37]. PCA derives orthogonal components from the set of exposure variables by making full use of the variability in the data by using the eigenvalues and eigenvectors of the exposure correlation matrix. A resultant “eigenexposure” can be used in subsequent analysis. A danger in this analysis is that the relationship between exposures and response variables is not accounted for in the generation of the principal components. A modified version of PCA, named supervised principal component analysis (SPCA), may go beyond this limit [38]. By a prescreen of eliminating pollutants not associated with the outcome, SPCA returns effect estimates with smaller bias than corresponding estimates from PCA. However, after the prescreen step, the limitation of not accounting for the multivariate exposureresponse associations in PCA still applies to SPCA, and it may suffer from model misspecification. Partial leastsquares regression (PLSR) is another extension of PCA, in which an optimum subset of predictors is found that is also relevant for responsepredictor relationships [39–41]. Despite its popularity in engineering and machine learning community, PLSR has been seldom used in environmental health studies. One disadvantage of all PCAbased techniques is that since each principal component is a linear combination of multiple pollutants, estimated coefficients do not have direct quantitative interpretation.
While PCA offers dimension reduction in constructing summary exposure features, shrinkagebased regression, e.g., least absolute shrinkage and selection operator (LASSO) or ridge regression, may be helpful in dealing with large dimensional models through variable selection [42]. Due to the nature of the linear constraint, LASSO regression differs from ridge regression in that exact shrinkage of coefficients to zero is feasible, and hence it incorporates variable selection [43]. Unfortunately, shrinkagebased methods are unable to provide unbiased effect estimates, and some of them (i.e., ridge regression) do not perform variable selection.
A further statistical difficulty involves incorporating model uncertainty in the effect estimates. Given a long list of explanatory variables, the usual practice is to present results from a single model selected from a series of hypothesis testing procedures while ignoring other plausible models. Skepticism regarding this philosophy that a single model can serve as a representative of “the true model” in a given dataset with moderate sample sizes prevails [44]. To address this problem of selecting “the” model, Bayesian model averaging (BMA) averages the effect estimates across all possible models weighted by the model posterior probabilities [45, 46]. However, there are differences in opinion associated with the interpretation of effect estimates from BMA in the presence of extensive collinearity or under a large model space [46].
From the above discussion it is clear that the choice of the method has to be guided by the goal of the study: screening for associations, effect estimation or risk prediction; it also has to be governed by the size of the dataset as well as the dimensionality of the potential set of predictors. There is a paucity of literature reviewing the operating characteristics of existing statistical models and comparing their strengths and weaknesses in a multipollutant context. Billionnet et al. recently presented an excellent review of statistical methods used or potentially applicable to this problem [47]. However, no quantified assessment of the relative performances of different methods or numerical results from simulation studies was provided in that review. In this paper, we conduct an extensive simulation study for continuous outcomes from a crosssectional study as well as disease counts from a timeseries study, and present two data examples to assess the performances of the five statistical methods: DSA, SPCA, LASSO, PLSR and BMA. In addition, a twostep modeling strategy employing an initial screening by CART in combination with each of the aforementioned five methods is evaluated when there is a long list of candidate pollutants plus interactions to consider. We examine the performance of each approach with a focus on feature selection, effect estimation and detection of main effects and interactions. Other aspects such as efficiency of the estimates, stability of the results, flexibility of implementation and difficulty in interpretation are discussed as well. This comprehensive comparison will be helpful for making informed decisions on which analytic approach to choose with different data structures and study designs. We also provide annotated Rcodes for implementation of all the methods that may serve as a useful resource to the practitioner (see Additional file 1).
Methods
Statistical methods
In this section, we describe how each method is implemented and what are the advantages and disadvantages of using each one. All statistical analyses can be conducted using R software (version 2.13.0), and packages corresponding to different methods will be introduced.
Notations
Let us consider a regression problem with a continuous response Y and potential predictors X (different pollutants or source components in a single pollutant in the context of a multipollutant study):
where β _{0} refers to the intercept, β _{ p } is the coefficient for predictor Z _{ p } (can be zero in the true model), the random error variable ϵ follows a standard normal distribution, X is a set of candidate pollutants with size K, and Z is the entire set of potential predictors: for this paper we consider all the main effect terms and pairwise interactions among candidate pollutants in X, hence the dimension of Z is $P=K+\left({}_{2}^{K}\right)$.
Classification and regression tree (CART)
The algorithm of CART involves recursively partitioning observations until reduction in the variability of the outcome is maximized [48]. The construction of a decision tree involves three steps [31, 43]: (1) Recursive data partitioning: at each partitioning step, the algorithm examines all split points on every explanatory variable in X, and chooses the best pair of split points and splitting variable in terms of the minimum sum of squared deviations from the mean responses in the resultant two subgroups. This partitioning is repeated on each of the new subgroups. In this process, no assumption of linear relationship has been assumed, so complex interactions and nonlinear effects may be captured. (2) Construction of the preliminary tree: in order to capture the important structure of the data, a large preliminary tree is preferred and the binary splitting is stopped when a minimum node size is reached or when further split does not improve the overall fit significantly. (3) Pruning the decision tree: to overcome the problem of overfitting, the preliminary tree is pruned by its predictive ability, for example, minimizing the tenfold crossvalidated error. This recursive partitioning of the regression trees can be implemented by rpart package [48].
Unlike traditional regression models, a treebased method has several attractive properties: it is less sensitive to outliers, it requires no distributional assumption or data transformation, it is adaptable to complicated interactions among a large pool of predictors, the results are visually intuitive, and the prediction rule is easy to follow [38, 41]. However, its application is restricted by the fact that quantified risk and effect estimates corresponding to the predictors cannot be obtained directly.
Deletion/Substitution/Addition (DSA)
As a novel model selection approach, the implementation of DSA algorithm can be divided into three steps [32, 49]: (1) Construct the whole model space as linear combinations of basis functions under userspecified constraints, where the choices of basis functions of candidate predictors are determined by the maximum order of interactions and maximum sum of powers (e.g. terms of the nature ${X}_{1}^{i}{X}_{2}^{j}$, with a constraint on i+j), and the model size also has a specified maximal value. (2) Starting from an intercept model, the DSA algorithm searches the model space by making deletion, substitution and addition moves repeatedly until the model size exceeds the specified maximum value. Given the current best model of size p=3, say (e.g., X _{1} + X _{2} + X _{1} X _{2}), a deletion move produces a deletion set of models with size p1 by deleting an existing term from the current model (e.g., X _{1} + X _{2}), finds a model with the minimum objective function within this deletion set, and updates the best model of size p1 if this minimum is less than previously saved minimum of size p1. The objective function chosen depends on the type of response considered and corresponds to sum of squared residuals in linear regression. The substitution and addition moves are performed similarly, with the substitution set containing models of the same size p where an existing term is replaced by a new term (e.g., X _{1} + X _{2} + X _{1} X _{3}) and the addition set containing models of size p+1 by adding a new term to the current model (e.g., X _{1} + X _{2} + X _{3} + X _{1} X _{2}). (3) After the optimal model for each model size is identified, the final model with its corresponding predictors is selected based on crossvalidation. This flexible model selection approach can be implemented by modelUtils and DSA package jointly [50].
Considering its original motivation of detecting transcription factor binding sites for the analysis of genomic data, the DSA algorithm was developed to enable an exhaustive search over the entire covariate space, which includes complex interactions and nonlinear terms of predictors, a feature that is likely to be useful in multipollutant studies. Another attraction of this algorithm is the adoption of the deletion, substitution and addition moves. Unlike automatic model selection such as backward or stepwise procedures which depend on tests for nested models, DSA allows for the flexibility of deleting, replacing or adding terms at each move, thus forcing the search to be more exhaustive. Additionally, the use of crossvalidation in the algorithm ensures the selected model being less sensitive to outliers and has good predictive ability [33].
Supervised principal component analysis (SPCA)
Acknowledging that conventional PCA only maximizes the variance explained by linear combinations of the predictor variables, SPCA was proposed to take into account the relationship between predictors and response variables in the dimension reduction process [51]. The benefit of SPCA as a feature selection tool becomes apparent when the covariate space grows, especially under extreme conditions where the number of covariates exceeds the number of observations, the wellknown P>N situation.
Roberts and Martin further refined the SPCA method to make it suitable for multipollutant timeseries studies [38]. For describing the method, we assume the regression dataset comes from a crosssectional study with continuous responses as described in Equation 1. The implementation of SPCA can be achieved as follows: (1) Sort Wald’s statistics from univariate models for all individual and interaction terms in a descending order. (2) Choose a reduced matrix Z’ selecting S covariates with absolute values of Wald’s statistics larger than a threshold ζ, where the optimal choice of ζ is determined by minimizing the prediction error of the corresponding multivariate model via 10fold crossvalidation. (3) Compute the first q (q<S) principal components of the reduced matrix Z’. Typically the first few principal components capture most of the variability in the covariate space, and in some cases it may be enough to use only the first principal component. (4) Refit the multipollutant model based on the linear combinations of S predictors retained in the reduced matrix Z’. As a result, the fitted model for SPCA can be explicitly expressed as:
where Q _{ i } denotes the ith principal component from the reduced matrix Z’, b _{ i } is the specified effect of the ith principal component, and α _{ is } is the loading factor of the sth predictor in matrix Z’ for the ith principal component. The superpc package provides SPCA analysis with continuous outcome [52]. Modified R codes as in Roberts and Martin [38] were used for implementation of SPCA in timeseries studies with count data since no statistical packages are readily available.
By excluding covariates not strongly related to the response variables, SPCA reduces the bias of effect estimates in comparison with conventional PCA [38]. However, there are concerns on the loadings for predictors in the reduced matrix since they are computed without consideration of their multivariate associations with outcomes. Moreover, the interpretation of estimated coefficients from SPCA is difficult especially when more than one principal components of the reduced matrix are included.
Least absolute shrinkage and selection operator (LASSO)
Shrinkagebased regression methods such as LASSO were developed to address the problem of fitting a regression model when the number of predictors is large compared to the sample size. By imposing a penalty on the size of the regression coefficients, they are expected to perform a biasvariance tradeoff with the price of sacrificing unbiasedness of estimates for more precision [53]. The LASSO estimates are defined by minimizing the sum of squared errors with a bound on the sum of the absolute values of the coefficient estimates [43]:
where N denotes the sample size and λ is the tuning parameter that controls the amount of shrinkage. Due to the L _{ 1 } penalty term, computation of the LASSO solutions is a quadratic programming problem and there is no closed form expression for estimates obtained from LASSO [53]. An efficient algorithm for all possible LASSO estimates using modified least angle regression (LAR) was proposed by Efron et al. [43, 54]. Briefly, least angle regression iteratively builds models by including the predictor with the highest correlation to the current estimated residual. At each step, LASSO solutions are computed for a grid of shrinkage parameters, starting from zero to the least squares fit, and the optimal λ is selected by the minimum crossvalidated root mean squared error. As a result, LASSO coefficient path for all predictors is constructed. The lars package was developed to implement this algorithm. Many existing prediction loss criteria, for example, the Mallows’ Cp statistic, can be used for selecting the optimal dimension of the LASSO model.
Despite a small bias introduced in the coefficients, LASSO regression has desirable analytical properties [55]. And by imposing a linear constraint, LASSO regression can shrink coefficients to exactly zero, a feature similar to variable selection that is not shared in ridge regression [56]. However, correct statistical inference following LASSO remains a challenging problem.
Partial leastsquares regression (PLSR)
PLSR is useful in constructing predictive models with highdimensional covariates. Compared to PCA, PLSR often requires fewer components to achieve the same prediction error as it captures information in the predictors as well as the relationships between predictors and response variables [39, 41]. There are several PLSR algorithms developed, among which a projection kernel based algorithm is used as the default method in the pls package in R and can be described through the following steps [40]: (1) Compute the first eigenvector of kernel matrices Z ^{T} YY ^{T} Z and Y ^{T} ZZ ^{T} Y, referred as weight vectors l _{ 1 } and h _{ 1 }, respectively, and normalize both vectors so that Zl _{ 1 } = Yh _{ 1 }= 1. (2) Project the matrix of covariates on its weights to get the zscores f _{ 1 } = Zl _{ 1 }. (3) Calculate loadings for covariates by ordinary least square (OLS) regression ${r}_{1}^{T}={\left({f}_{1}^{T}{f}_{1}\right)}^{1}{f}_{1}^{T}Z={f}_{1}^{T}Z={l}_{1}^{T}{Z}^{T}Z$. (4) Deflate the matrix of covariates ${Z}_{1}=Z{f}_{1}{r}_{1}^{T}$. (5) Repeat steps 1–4 multiple times until all PLS components are determined. Thus, the PLSR coefficients can be estimated by $\widehat{\beta}=L{\left({L}^{T}{Z}^{T}\mathit{ZL}\right)}^{1}{H}^{T}$, where L and H denote matrices with the weight vectors l _{ 1 } and h _{ 1 } as their columns. The optimal number of components is chosen empirically by plotting the crossvalidated predictive residual error sum of squares (PRESS) as a function of the number of components, and selecting the number of components that yields the first local minimum in PRESS statistic or by a predefined threshold on the sequential increments of the PRESS statistic.
A major limitation of the PLSR analysis is that as a PCAbased method, it is not always optimal in screening or removing predictors that have no association with the response, since its components are computed as linear combinations of all predictors. However, it still serves as a powerful dimension reduction and noise removal tool with the focus on prediction.
Bayesian model averaging (BMA)
As an effective approach to deal with model uncertainty, BMA provides robust estimation of parameters by model averaging. Suppose Δ is the parameter of interest (coefficient corresponding to a particular predictor in Z, say), then its posterior distribution given all observations (Y, Z) can be expressed as an average of the posterior distributions of parameter Δ under each of the models considered, weighted by their posterior model probabilities [57, 58]:
where M _{ 1 },…, M _{ D } (D ≤ 2^{P}) denote the models considered, and the posterior probability of model M _{ d }, according to Bayes’ rule, is given by:
in which P(M _{ d }) denotes the prior probability of model M _{ d } being the true model, and the marginal likelihood of model M _{ d } can be computed by integrating the likelihood function P(Yθ _{ d }, M _{ d }, Z) over the prior probability of modelspecific parameter vector ${\theta}_{d}=\left({\beta}_{d},{\sigma}_{d}^{2}\right)$, say.
The precise expression for the posterior mean and variance of Δ can be derived by employing the rule of conditional expectation and the law of total variance, respectively [59, 60].
Estimation in BMA consists of three steps: (1) Choose the prior probability for each model and prior density of parameters in each model. (2) Compute the marginal likelihood and posterior model probability for each model considered. (3) Derive the weighted average posterior distribution of the parameters of interest, and estimate its associated mean and variance. It should be noted that during the implementation of BMA by bicreg function in the bma package, enumeration of all models will be computationally too expensive if the number of candidate predictors exceeds 30, and a preliminary model selection will be conducted by default, before the model averaging [61–63].
Compared to conventional modeling methods which ignore model uncertainty, BMA is attractive in that it does not select for a single “best” model and it makes inferences by averaging over a range of possible models. BMAbased confidence intervals are wellcalibrated by taking account of both sampling variation within models and betweenmodel uncertainty. One concern for the BMA analysis is the potential for large variance of estimates produced in the presence of extensive multicollinearity [46, 64].
Simulation studies
Health effects of environmental exposures have been examined widely in crosssectional, timeseries, cohort and case–control studies. In general, crosssectional studies relate continuous (e.g. blood pressure, heart rate variability) or binary (e.g. acute asthma attacks) variables to exposures at a single time point [33, 65], while timeseries analyses associate the number of events (e.g. deaths, hospitalizations, or emergency visits) with changes in daily ambient level of pollutants/exposures [10]. In order to evaluate the performances of different model building/dimension reduction strategies, exposureresponse relationships were simulated in this paper under two settings: continuous outcome from a crosssectional study and daily event counts from a timeseries study. Binary outcomes from a crosssectional/cohort/case–control study can be analyzed in a similar fashion as continuous outcomes using a logit link function. Two hypothetical sets of candidate pollutants with different sizes (K=4 and K=10 or 20) were used to assess the effectiveness of statistical methods in terms of feature selection and estimation of regression coefficients involved in the health risk models. Therefore, a combination of four scenarios each with different subsets of nonzero main effects and interactions was investigated under each strategy: Scenarios 1 and 2 examine continuous responses in a crosssectional study with 4 and 20 pollutants, respectively, and Scenarios 3 and 4 examine daily counts from a timeseries study with 4 and 10 pollutants, respectively.
Simulation settings for crosssectional studies
Lognormal distribution is an empirically justified density for many pollutant concentration levels [66]. A multivariate structure with pairwise correlations instead of mutual independence allows for a better capture of multipollutant properties. Under the crosssectional design with four pollutants, with X=(X _{ 1 }, X _{ 2 }, X _{ 3 }, X _{ 4 }) as a 4pollutant random vector, exposure variables were generated from a multivariate lognormal distribution with the mean μ=E[X]=(1.20, 2.30, 1.89, 1.00) and the covariance matrix $\Sigma =E\left[\left(X\mu \right){\left(X\mu \right)}^{T}\right]=\left({\sigma}_{\mathit{ij}}^{2}\right)$, in which diagonal elements ${\sigma}_{\mathit{ii}}^{2}$ = 1.00, i=1,…,4, and offdiagonal elements ${\sigma}_{12}^{2}$ = 0.52, ${\sigma}_{13}^{2}$ =0.35, ${\sigma}_{14}^{2}$ =0.28, ${\sigma}_{23}^{2}$ =0.57, ${\sigma}_{24}^{2}$ =0.54, and ${\sigma}_{34}^{2}$ =0.41. Here, the choice of distribution parameters was based on observed data on four common pollutants CO, NO_{2}, PM_{2.5} and SO_{2} in the Detroit Asthma Morbidity, Air Quality and Traffic (DAMAT) study, where each of the pollutants were standardized to have a unit variance [10]. Similarly, in the simulation Scenario 2, we use X to denote a 20component vector from a multivariate lognormal distribution with the mean μ=E[X]=(μ _{ 1 },…,μ _{ 20 }), μ _{ i }=1.00, i=1,…,20, and the covariance matrix $\Sigma ={\left(\begin{array}{cc}\hfill {\Sigma}_{1}\hfill & \hfill {0}_{5*15}\hfill \\ \hfill {0}_{15*5}\hfill & \hfill {\Sigma}_{2}\hfill \end{array}\right)}_{20*20}$, where ${\Sigma}_{1}=\left({\sigma}_{\mathit{ij}}^{2}\right)$ with ${\sigma}_{\mathit{ii}}^{2}$ =1.00, ${\sigma}_{\mathit{ij}}^{2}$ =0.20, i=1,…,5, j=1,…,5, i≠j; and Σ_{2} denotes an identity matrix I _{15*15}.
A normal linear regression model was used to generate the continuous outcome Y given X in the crosssectional study. We considered the true generation model to be a standard multiple linear regression model including main effects of a subset of pollutants and some pairwise interactions, as follows:
where ϵ was assumed to be independent and follow a normal distribution ϵ ~ N(0, 3^{2}). Regression coefficients were prespecified: in simulation Scenario 1 (K=4), β _{0} =0.1, β _{2}= β _{3}=0.5, β _{ i }=0, i=1,4, γ _{ 23 }=0.2, γ _{ ij }=0, (i, j)≠(2, 3); and in simulation Scenario 2 (K=20), β _{0}=0.1, β _{1}=β _{2}=β _{6}=β _{7}=0.5, β _{ i }=0, i≠1, 2, 6 or 7, γ _{ 12 }=γ _{ 16 } = γ _{ 67 } =0.2, γ _{ ij }=0, (i, j)≠(1,2), (1,6) or (6,7). This choice of the error distribution and regression coefficients ensures that the ratio of the sum of squares for regression to the total sum of squares R ^{2} was fixed at the level of 0.25. For each simulation scenario, we simulated 1000 datasets of a moderate sample size 250. No confounding factors were taken into account for the purpose of simplicity.
Simulation settings for timeseries studies
Under the timeseries design with four pollutants, the multivariate vector X _{ t } (t=1,…,400) corresponding to daily exposure measurements on a period of 400 days was generated by an autoregressive model depending on previous 10 days. Specifically,
where c _{ i } (i=1,…,4) denotes the constant vector controlling for the seasonal effect of X _{ t }, ϕ _{ kj } is the partial autocorrelation function (PACF) coefficient for the kth pollutant at lag day j, k=1,…,4 and j=1,…,10. Specification of seasonal effects and PACF coefficients were estimated from data in the DAMAT study [6] (see Additional file 2: Table S1). Here, we used the same set of means and covariance matrix corresponding to the four pollutants as in the simulation Scenario 1.
In timeseries studies, daily counts as an outcome would be influenced by a number of factors, such as temporal trend or meteorological variation [35]. To eliminate these confounding factors, we adopted a method previously used to generate realistic timeseries count data [38, 67, 68]. First, we fitted a Poisson regression with a loglinear link to the actual DAMAT data.
In this model, ω _{ t } refers to the expected number of asthma events on day t, the list of covariates includes season, day of the week (DOW), daily average relative humidity (RH), smooth function Spline(.) of daily average temperature (Temp) and day of the study (t). Next, the estimated counts of asthma events ${\widehat{\omega}}_{\mathrm{t}}$ were extracted and added as an offset to a prespecified exposureresponse relationship in a second Poisson regression:
where ψ _{t} refers to the timevarying mean counts (after removing the effect of the confounding factors), main and interaction effects of four pollutants were specified as β _{1}=β _{3}=0.3, β _{i}=0, i≠1 or 3, γ _{ 13 }=0.1, γ _{ ij }=0, (i, j)≠(1, 3). Timeseries asthma counts of length 400 days representing an average number of 13.3 counts/day were then generated from Poisson distributions with mean ψ _{t}.
Timeseries counts for simulation Scenario 4 were generated in a similar fashion, but assuming a tenpollutant multivariate structure with mean μ=(μ _{ 1 },…,μ _{ 10 }), μ _{ i }=1.00, i=1,…,10 and covariance matrix $\Sigma ={\left(\begin{array}{cc}\hfill {\Sigma}_{1}\hfill & \hfill {0}_{4*6}\hfill \\ \hfill {0}_{6*4}\hfill & \hfill {\Sigma}_{2}\hfill \end{array}\right)}_{10*10}$, where ${\Sigma}_{1}=\left({\sigma}_{\mathit{ij}}^{2}\right)$ and ${\sigma}_{\mathit{ii}}^{2}$ =1.00, i=1,…,4, ${\sigma}_{12}^{2}$ =0.60, ${\sigma}_{13}^{2}$ =0.40, ${\sigma}_{14}^{2}$ =0.20, ${\sigma}_{23}^{2}$ =0.50, ${\sigma}_{24}^{2}$ =0.20, ${\sigma}_{34}^{2}$ =0.10 and Σ_{2} refers to an identity matrix I _{6*6}. Seasonal effects were the same as in the fourpollutant framework, while PACF coefficients were obtained from the DAMAT data under an autoregressive model on previous 5 days for reduced complicity (see Additional file 2: Table S1). Parameters in the second Poisson regression model were specified as β _{1}=β _{3}=β _{6}=β _{9}= 0.2, β _{i}=0, i≠1, 3, 6 or 9, γ _{ 13 } = γ _{ 16 } = 0.1, γ _{ ij }=0, (i, j)≠(1, 3) or (1, 6). Considering the increased number of pollutants involved, daily counts were simulated for 800 days with an average rate of 14.7 events/day. Although some timeseries studies have assessed health effects of exposure to multiple pollutants on longer time scales, relative performances of different statistical methods should remain similar given the sample size we chose.
Data analysis examples
Crosssectional design with quantitative outcome: oxidative stress biomarkers and environmental contaminant exposures in the national health and nutrition examination survey study
In addition to the simulation study, we applied the above methods to data from the National Health and Nutrition Examination Survey (NHANES) collected between 2005 and 2008. NHANES is an ongoing crosssectional study designed to measure subject exposure to various environmental chemicals, dietary intake patterns, and various health outcomes [69]. Our previous studies indicated several associations between urinary phthalate metabolites and serum markers of oxidative stress in a large human population [70, 71]. As a followup, this analysis examines the same association when phthalate exposure occurs in conjunction with exposure to other environmental contaminants that may also be capable of causing an oxidative stress response.
In this combined dataset, we included subjects aged 12 and up with complete data on all exposures, outcomes, and covariates, containing age, ethnicity, and poverty income ratio used in the sampling process, and gender, body mass index, serum cotinine, and urinary creatinine considered to be correlated with the outcome, which resulted in a final sample size of 3,773. The population distribution by covariates is presented in Additional file 2: Table S2 and a total of 25 exposures are categorized into 4 groups. The outcomes of interest were bilirubin and gammaglutamyl transferase (GGT) which are believed to be systemic markers of oxidative stress. A conventional logtransformation was applied to all the exposures and outcomes. This study design corresponds to Scenario 2 in our simulation: crosssectional data with continuous outcome and larger number of potential predictors. We did not incorporate the survey weights from NHANES in our analysis as many R packages for our methods of variable selection do not allow for adjustment for survey design. However, models were adjusted for some characteristics used in the creation of the survey weights, including age, ethnicity, and poverty income ratio.
The procedure to conduct our analysis can be summarized into six steps: (1) We first regressed the outcome (bilirubin or GGT) on the covariates listed above and used the residuals from the regression model as the response for model selection in the following steps. (2) We performed correlation analysis to assess the presence of collinearity within each group of environmental exposures (See Additional file 2: Table S3). When several exposures in the same group were highly correlated (Pearson correlation coefficient > 0.60), only the one with the smallest pvalue in the singleexposure regression model was retained (see Additional file 2: Table S4). (3) We applied CART with the criteria of minimizing the crossvalidated error to the reduced subset of exposures after the correlation analysis. Among a reduced number of 18 candidate exposures from Step 2, seven were selected in the construction of the regression tree for the outcome bilirubin, and eight were selected for the outcome GGT (See Additional file 2: Figure S5). (4) We applied our methods to the main effects and all pairwise interactions of individual exposures selected by CART, including BMA, LASSO, SPCA and DSA. (5) All predictors identified by different methods were incorporated into a bigger omnibus linear regression model, where the outcome was bilirubin or GGT and all covariates were controlled for. (6) Eventually, proposed models for bilirubin or GGT were constructed by eliminating nonsignificant predictors (pvalue > 0.05) from the omnibus models and exposure effects were estimated using ordinary linear regression. Note that proper inference and testing after model selection is an incredibly challenging problem, and fitting standard models to yield measures of significance postmodel selection obviously leads to inflated Type 1 error rates and overly optimistic results. The results from the presented model should thus be interpreted with this caveat in mind.
Timeseries design with count data: asthma morbidly and ambient air pollutants in the detroit asthma morbidity, air quality and traffic study
We applied the proposed methods to the Detroit Asthma Morbidity, Air Quality and Traffic (DAMAT) study. Li et al. (2011) analyzed timeseries count data from 2004–2006 on asthma morbidity in the pediatric Medicaid population of Detroit, Michigan. Concentrations of pollutants PM_{2.5}, CO, NO_{2} and SO_{2} were examined for potential associations. Statistically significant associations at 5day lag and 3 and 5day moving averages of SO_{2} and PM_{2.5} concentrations were observed with asthma emergency department visits and hospitalizations using singlepollutant models [10]. We now extend our scope to multipollutant models exploring potential interactions to address the same research question.
Daily asthma events were identified from emergency department visits and hospitalizations for the Detroit Medicaidinsured population, and further restricted to children 2–18 years of age due to the difficulty of asthma diagnosis for children under 2. A total of 12,933 asthma events were observed from 7,063 children during the 1,096 days, representing an average rate of 11.8 events per day [10]. Daily measurements of CO, NO_{2}, SO_{2} and PM_{2.5}, and meteorological data including temperature and relative humidity, were also obtained for this study period. Previous analyses suggested strong evidence of rise in daily asthma events with increasing 3day lag of concentrations for SO_{2} and PM_{2.5}, so this lag specification of air pollutants was used as exposure variables. The mean 3day lag of air pollutants were 0.43 ppm for CO, 16.8 ppb for NO_{2}, 3.78 ppb for SO_{2}, and 15.0 μg/m^{3} for PM_{2.5}. This study design corresponds to Scenario 3 in our simulation analyses: timeseries data with a count response and four air pollutants measured at 3day lags.
To control for the temporal pattern and weather effects in the timeseries data, we applied the twostage generalized linear model (GLM) as described in the data generation of Scenario 3 in the simulation. The estimated daily counts from the first Poisson regression in (12) were used as an offset in the second Poisson regression model in (13) where the X’ s represent the four exposures. Note that this twostep approach is adopted only due to a logistical inconvenience that some of the model selection tools available in existing statistical software do not allow inclusion of a set of “forced variables” in the model without performing variable selection on this set of confounders. Ideally, one would work with a single model that performs variable selection in the presence of confounders to be adjusted for.
The collection of main effects and pairwise interactions identified to be associated with the outcome within each method were incorporated into an omnibus GLM, with confounding factors such as year, season, DOW, time, RH and Temp adjusted and potential overdispersion considered. Finally, a proposed model with nonsignificant exposure variables eliminated was constructed.
Results
Simulation results
Under each simulation configuration, we generated 1000 datasets. For each examined study design, we present the average of estimated effects with their empirical standard errors for nonzero predictors, the percentage of models correctly identifying the nonzero coefficients, and the average model size over 1000 replicates for five different statistical methods (BMA, DSA, LASSO, PLSR and SPCA). We also considered additional measures, including false positive rate (FPR), true positive rate (TPR), and mean squared error (MSE) for the coefficients corresponding to truly null and truly associated predictors, to help evaluate the performances of competing statistical methods as displayed in Additional file 2: Table S6.
For the ease of interpretation, the regression model in SPCA was fit on the first principal component of the reduced matrix Z’, and accordingly the estimated effects of predictors highly associated with the outcome can be expressed as the products of estimated regression coefficient for the first principal component b_{1} and the loading factors of predictors in the construction of first principal component α_{1s} (i.e. ${\widehat{\beta}}_{s}={b}_{1}{\alpha}_{1s}$ if Z _{ s } ∈ Z '). In BMA analysis, predictors were considered to be identified if their posterior probabilities exceeded 10%, where the cutoff value was chosen in an adhoc manner. In the presence of a long list of candidate pollutants, we also assessed identification and estimation of nonzero coefficients using a twostep modeling strategy, which employs an initial screening by CART followed by further dimension reduction/variable selection by the five methods at the second step. This strategy was not used in the timeseries studies because existing package does not apply CART to count data. The main findings of the simulation study are summarized as follows.
Crosssectional studies
Table 1 presents simulation results corresponding to the five methods under simulation Scenario 1 (crosssectional design, K=4). Note that LASSO estimates of nonzero coefficients are the least biased and most efficient, with smallest MSE among all methods as displayed in Additional file 2: Table S6. Point estimates of interaction effects from PCAbased approaches such as SPCA and PLSR stay close to the true parameters, whereas substantial bias has been observed in their estimated main effects. Main effects from DSA are overestimated, while in BMA analysis the main effects are underestimated and interaction effect overestimated. These directions of upward or downward bias are not consistent across simulation settings. In PLSR, each component is constructed as a weighted mean of all candidate predictors, so all predictors are used in the regression model with its model size equal to the maximum model size. LASSO regression, SPCA and BMA have high interaction detection rates as assessed by the percentage of correct identification of the interaction term, while DSA algorithm barely identifies interaction with a very low rate of 4.4%. In terms of variable selection, BMA and LASSO regression are preferred with lower FPRs, high TPRs, and average model sizes close to the true one. SPCA tends to select a larger model as represented by its average model size and high FPR.
Table 2(A) displays simulation results obtained from four modeling approaches DSA, LASSO, PLSR and SPCA under simulation Scenario 2. As the number of exposure variables increases from 4 to 20, the total number of candidate predictors (210), including main effects (20) and all pairwise interactions (190), grows in a quadratic manner. In this setting, BMA results are not provided because the total number of explanatory variables exceeds the maximum limit 30 set by bma package, and a stepwise procedure is applied automatically to eliminate the redundant explanatory variables. Among four available approaches, none of these methods appear to be a desirable choice in terms of variable selection because an appreciable number of null predictors are included conservatively, but LASSO regression is much superior to others. Compared to the results from simulation Scenario 1, estimated main effects from LASSO regression are shrunk heavily in order to induce a sparse model and compensate for the instability caused by multicollinearity. Interestingly, estimation for interaction effects by LASSO regression appears to be robust, and its interaction detection rate remains high. DSA algorithm also selects small models, yet it has a very limited ability of identifying interactions. SPCA has high detection rates for all nonzero coefficients, but the use of this method is restricted by its considerably large FPR and average model size. The MSE of null predictors within PCAbased methods (i.e. SPCA and PLSR) has been found much smaller and this may relate to the weak associations between null predictors and the outcome considered in the construction of principal components (Additional file 2: Table S6).
CART is widely used for detecting complex effects among a large number of explanatory variables, so as an alternative strategy we decided to apply CART to the complete set of candidate predictors prior to the five regression methods. The simulation results from this twostep strategy are given in Table 2(B). In the initial step of construction of the decision tree with the minimum crossvalidated error, 4 pollutants are correctly identified to be associated with the outcome over 1000 datasets with percentages greater than 60% (X _{ 1 }: 86.8%, X _{ 2 }: 70.8%, X _{ 6 }: 83.7%, and X _{ 7 }: 61.6%), whereas other pollutants less than 13%. This result suggests that as an exploratory procedure for variable selection, CART is able to detect important variables and reduce the number of variables for regression model. Compared to the simulation results in Table 2(A), there are substantial improvements on the variable selection for all approaches, as reflected by the appreciable reduction in average fitted model size and FPR. BMA analysis becomes feasible after initial screening by CART, and they perform similarly as in simulation Scenario 1. The bias of estimates of nonzero coefficients and the MSE of null predictors in LASSO regression are significantly reduced as expected, whereas estimates with their empirical standard errors and interaction detection rates in DSA algorithm do not change appreciably.
Timeseries studies
Table 3 compares performances of regression models under a timeseries design with count data as response and 4 individual pollutants and their pairwise interactions as candidate variables. The implementation of the DSA algorithm is not available for timeseries data. In contrast to the simulated crosssectional studies with dominance of a few methods, BMA, LASSO and PLSR all appear to perform quite reasonably in terms of estimation of risk ratio coefficients as presented by their small biases and MSEs in Additional file 2: Table S6. Among four approaches, LASSO regression with its 100 percent detection rate for all individual and interaction terms provides least biased and most efficient estimates. BMA has a comparable performance as LASSO in terms of variable selection but produces less efficient estimates as displayed by larger empirical standard errors and MSEs. PLSR also provides estimates with small bias but does not have the feature of variable selection. In spite of the inconsistent estimates, the high interaction detection rate in SPCA indicates its potential value under this simulation scenario.
A large number of candidate predictors in a timeseries setting can be computationally intensive, so we simulated daily counts with 10 air pollutants and present results regarding relative performances of four approaches in Table 4. With an increased sample size, BMA becomes favorable in terms of effect estimation and variable selection. In comparison, LASSO regression yields less parsimonious models, but provides the smallest MSEs for both null and associated predictors. The estimated main effects by PCAbased methods PLSR and SPCA are not very informative, yet their estimations for interaction terms have small biases and the interaction detection rate in SPCA is high.
Results for data examples
NHANES data on oxidative stress biomarkers and environmental contaminant exposures
Table 5 presents the results of model selection from different methods for the NHANES data. Predictors and their estimated effects from the proposed models are shown in Additional file 2: Table S7. For the outcome bilirubin, where the associations between bilirubin and multiple exposures are moderate, individual exposures methyl paraben (EPAR), mono (2ethyl5oxohexyl) phthalate (MEOHP), perchlorate (P8) and triclosan (TCS) are statistically significant in the omnibus regression model. Among all the cross products selected by different methods, the interaction of TCS and EPAR is the only one significant in the omnibus model, and none of the methods allows a correct identification of this effect except for LASSO regression. Likely due to the strong associations between the outcome GGT and exposures, all individual exposures identified by CART are tested to be significant in the omnibus model. Among significant interactions indicated in this model, cross products of EPAR*propyl paraben (PPAR), EPAR*mono (3carboxypropyl) phthalate (MCPP), and EPAR*monoisobutyl phthalate (MiBP) are commonly selected in the four competing statistical methods, while propyl paraben*2,4,5trichlorophenol (PPAR*2,4,5TCP) and P8*mono(3carboxypropyl) phthalate (MCPP) are selected only once.
Examining results from the NHANES data suggest the usefulness of CART in feature selection. LASSO regression identifies all the significant main effects and interactions in the omnibus models but yields slightly large models. BMA and SPCA detect highly significant interactions (p≤0.001) but would miss a few weaker associations. For instance, among five significant interactions in the omnibus model for outcome GGT, the most significant two or three were typically detected by BMA and SPCA. Similar as in the simulation, DSA performs poorly by failing to detect any interactions either for bilirubin or GGT.
DAMAT data on asthma morbidly and ambient air pollutants
Predictors identified from different methods for DAMAT data are displayed in Table 6, with their estimated effects in the omnibus model shown in Additional file 2: Table S8. Consistent with the simulation study and analysis of NHANES data, LASSO regression chooses the largest model among all methods. Although the performances of BMA and SPCA are slightly different, both of them are able to identify the most significant predictor PM_{2.5} in the proposed model but fail to detect the less significant effect of CO. No interactions were selected in the DAMAT study.
Discussion
This paper provides empirical guidance for selecting methods readily available in statistical packages to assess the health effects of environmental exposures in a multipollutant framework under conventional crosssectional and timeseries studies. A summary chart of all the methods discussed in this paper with corresponding references to the implementation software is presented in Table 7.
From our empirical investigation, we observe that in general, LASSO regression is an appealing approach due to its robustness in estimation of regression coefficients and its power in identifying nonzero coefficients associated with variables of importance under various study designs and parameter settings. By jointly minimizing the sum of squared errors and shrinking some estimated regression coefficients to zero, it seeks a model with accurate estimates in terms of mean squared error properties, especially when the exposureresponse association is weak or multiple significant pollutants exist, as illustrated in the simulation studies and examples. Another feature of LASSO regression worth consideration is its computational efficiency. Given a long list of exposure variables, the LAR algorithm enables the implementation of LASSO regression to be on the same order of the ordinary least squares, which is orders of magnitude more efficient than other methods such as DSA [54]. However, when there is a large pool of correlated candidate variables, LASSO regression tends to select conservative models by including unrelated predictors and impose a strong shrinkage towards zero on estimated regression coefficients.
As a competitive approach addressing the issue of model uncertainty by averaging across models, we find BMA useful in model selection and parameter estimation when the dataset has a high signaltonoise ratio or a sufficiently large sample size, for instance, simulated datasets in Scenarios 3 and 4. When applied to realistic dataset in which the coefficient of determination R ^{2} is small, BMA is likely to produce biased and inefficient estimates with both sampling and misspecification errors considered, as a result, true effects could be masked [45]. Another difficulty in BMA analysis is the selection of prior specifications on individual parameters and model space [57]. In the simulation studies and data analyses, we chose standard noninformative priors for all parameters and assumed all models to be equally likely. However, enumeration of all models causes computational inefficiency when exploring model spaces with more than 30 variables [63]. Therefore, it is encouraged to use informative priors or reduced set of models to improve the predictive accuracy if background knowledge is possible [72]. Furthermore, the idea of averaging over models using different shrinkage weights, approximate the corresponding marginal probabilities by BIC for faster computation are aspects of BMA analysis that deserve more attention.
PCAbased methods, such as SPCA and PLSR, may reduce the impact of multicollinearity, but the bias in their effect estimates is considerable. This is not surprising because there is no guarantee that maximumvariance predictor variables always maximize information from the dependent variable [73]. In our analysis, estimation for interactions was more consistent than for the main effects in PCAbased methods, and this may relate to the larger variability of the interaction terms which leads to larger loading factors in the construction of first few principal components and smaller biases in the estimated effects. PLSR in general has better estimation performance, especially for main effects, because all candidate predictors are considered in the model, while in SPCA individual exposures with nonzero main effects could be eliminated at the prescreen because their univariate associations with the outcome are weak in the presence of nonzero interaction effects. When the property of feature selection is considered, SPCA appears to perform well given its ability of capturing nonzero coefficients in a relatively small model.
It is challenging to detect interaction effects among multiple pollutants using standard/generalized linear regression, so it is expected the DSA algorithm proposed for genomic data could be borrowed to address this problem. Unfortunately, from our analysis, DSA is not successful in identifying interactions either in the NHANES example or in simulated datasets, where the coefficient of determination R ^{2} is low. It is possible that when applied to datasets with strong exposureresponse associations or sufficiently large sample sizes, DSA would be helpful in identifying interactions or producing unbiased effect estimates.
From our analysis, LASSO regression is powerful at capturing nonzero coefficients. However, proper significance testing and inference following LASSO to produce honest pvalues is a nontrivial task. Fitting an ordinary GLM with selected predictors and using the corresponding tests and confidence intervals are obviously incorrect and greedy solutions. Recent research in this domain has the potential to create breakthroughs in statistical inference following model selection [74]. To address the asymptotic problem in coefficient estimation, the adaptive LASSO regression has been proposed by introducing adaptive weights to the L_{1} penalty term [55]. Suppose $\widehat{\beta}$ is a $\sqrt{n}$ consistent estimate to β (e.g., $\widehat{\beta}$ estimated by OLS) and w is a weights vector $w=1/\widehat{\beta}{}^{\gamma}$ for some positive constant γ, the adaptive LASSO estimates are given by:
It has been shown that the adaptive LASSO can enjoy the computational efficiency by performing the LARS algorithm on weighted predictors and achieve both consistency in variable selection and asymptotic normality in coefficient estimation [55]. To control the bias in effect estimation when predictors are highly correlated, an extension of LASSO regression, the group LASSO, could be used [75, 76]. The group LASSO regression has a penalty as an intermediate between the L _{ 1 } and L _{ 2 } penalty with its estimator defined as:
In this equation, $Y\mathit{Z\beta}{}_{2}^{2}={\displaystyle \sum _{i=1}^{N}{\left({Y}_{i}{\beta}_{0}{\displaystyle \sum _{p=1}^{P}{\beta}_{p}{Z}_{\mathit{ip}}}\right)}^{2}}$, λ ≥ 0 is a tuning parameter, ${V}_{g}=\sqrt{{p}_{g}}I$ is a penalty matrix with the number of predictors in the gth group p _{ g } and identity matrix I, and β ^{(g)} is the coefficient vector of the gth group, g = 1,…,G. It has the attractive property of performing variable selection at the group level and is invariant under groupwise orthogonal transformations like ridge regression [75, 77]. Another alternative that we have not explored is the application of the elastic net, a penalized least squares method using a penalty as the a combination of the LASSO and ridge penalty [43, 78]
where $\beta {}_{2}^{2}={\displaystyle \sum _{p=1}^{P}{\beta}_{p}^{2}}$, $\beta {}_{1}={\displaystyle \sum _{p=1}^{P}\left{\beta}_{p}\right}$, λ_{ 1 } and λ _{ 2 } refer to the tuning parameters for L _{ 1 } and L _{ 2 } norm penalty. As a compromise between ridge and lasso regression, the elastic net has the characteristic of both selecting variables like LASSO and shrinking the regression coefficients of correlated predictors like ridge. Therefore, it outperforms LASSO regression by encouraging grouping effects and improving the prediction accuracy when there are high correlations between predictors [43, 78]. Both the group LASSO regression and the elastic net are popular choices when dealing with grouped variables, a data structure matches well with the hierarchical nature of the pollution mixture, and hence deserves further investigation.
As a nonparametric technique, CART is well suited for exploring complex relationships under a multipollutant framework, however, it is limited by the fact that enforcing monotonicity constrains in building a regression tree is not possible. Some ideas have been provided in the construction of monotone classification trees. Potharst and Bioch suggested imposing monotonicity constraints by adding corner elements of nodes to the existing data [79]. Feelders proposed to use resampling to generate many different trees and select the ones that are monotone [80]. Feelders and Pardoel have developed an algorithm that grows a large overfitted tree at an initial step, prunes the tree towards a sequence of monotone subtrees, and then selects the one with the best predictive accuracy [81]. Hopefully, some of these algorithms can be improved and extended to the monotone regression trees, making the CART more attractive in describing exposureresponse relationships.
In this study, we proposed a twostep strategy for estimating health effects when a large number of candidate pollutants exist. As an initial step, we use CART to explore the associations between individual pollutants and the response. At the second step, different methods (BMA, DSA, LASSO, PLSR, and SPCA) are applied to the subset of important pollutants selected by CART. The advantage of this twostep strategy comes from the reduction of some less important dimensions at the first step, and consequently the signal in the dataset carried forward to the second step is boosted. However, in real data analysis, there is possibility that a true effect failed to be identified at the initial step given a weak exposureresponse association will be missed in the regression model. Therefore, the performance of this twostep strategy depends heavily on the initial screen by CART. Compared to univariate associations assessed in the first step of SPCA, CART uses the information from the set of all candidate variables as a whole and examines the complex interactions among exposures and responses, thus we expect this twostep modeling strategy to be adaptable in a multipollutant context.
One limitation of our study is that only linear main effects and linear interaction terms between pollutants were assumed in our simulation and data analyses, however, this assumption may not always be true as evidences of nonlinear relationship, such as threshold effect, polynomial effect or other nonparametric relationship, have been provided in multiple studies [10, 82–84]. Among all the methods we have discussed, LASSO regression and BMA perform well in the current simulation studies but may cause bias in estimation if a misspecified interaction model was adopted in the presence of a truly nonlinear effect; DSA algorithm exhibits limited ability in detecting interactions in our simulations but may perform better in higher order polynomial models; CART may help to identify the threshold values of exposure variables and capture nonlinear and higher order interactions but does not quantify exposure effect estimates. More expanded simulation studies with complex and highdimensional nonlinear exposureresponse surface are warranted. We also acknowledge that although we have examined multiple simulation scenarios in our study, the estimated health effects of exposures in some time‒series studies are much smaller than the effects defined in our simulations [4, 7, 10], thus further assessment of the relative performances of different methods in timeseries analysis with smaller effect size or larger sample size may be of added value. Another major issue in exposure epidemiology, namely, exposure measurement error and varying limits of detection for different exposures was not considered in our study, it has become well recognized that measurement errors at different scales may bias the estimation of regression coefficients in a multipolluant model, for instance, effects of pollutants measured with larger errors can be transferred to other correlated pollutants with less errors [85]. Researchers have taken steps to address the challenges of measurement error in timeseries analyses [85–87], and further exploration is needed to develop methods that take into account the exposure uncertainty. Selection of lags in exposureresponse relationship is another important issue we did not explore. How to incorporate multiple lagged exposures in a highdimensional responseexposure surface remains a problem where no consensus has been reached. Selecting predictors/complex models under a distributed lag structure remains an issue of ongoing research [88, 89]. In many health effects studies, where direct measurements of personal exposure to multiple pollutants are not practical, ambient pollutants concentrations are often used as proxies for personal exposure [6, 13]. However, previous studies suggested that exposure assessment based upon regional monitors does not adequately represent the personal exposure [1, 90–92]. Approaches to estimating personal exposure levels from ambient measurements via spatial modeling has been adopted for some air pollutants (e.g. NO_{2}, PM_{2.5}) [9, 93, 94], but no attempt towards creating a multipollutant spatially varying surface has been reported. Use of surrogates as imperfect measures of environmental exposures or imputation strategies to circumvent incomplete data issues is not studied in this paper.
Conclusions
Among the five methods evaluated for regression analysis, there is no uniform dominance of one method across all examined simulation scenarios and data examples. Assuming that exposure distributions are reasonably approximated by lognormal distributions and the strength of correlations among pollutants is moderate, the performances of competing methods differ according to the nature of the response variable, the sample size, the number of exposure variables involved, and the strength of exposureresponse association. In addition, supported by our results, the twostep modeling strategy proposed in this paper is potentially applicable under a multipollutant framework by taking advantage of both the screening feature of CART and dimension reduction or variable selection property of the subsequent statistical method. Extension of these methods under complex sampling schemes and correlated data with accompanying software packages merits further development.
Characterizing uncertainty in appropriate ways to proceed with tests of significance and construction of valid confidence intervals (often with biased estimates) following the variable selection/shrinkage methods we have discussed is a very important problem that is not fully resolved. This is typically done by advanced bootstrap or resampling strategies that is nontrivial and computation intensive. Bayesian methods have an advantage of providing measures of uncertainty based on the exact draws from a posterior distribution, an idea that translates to complex models. In this paper we have not discussed fully Bayesian methods. Bayesian approaches require a complete and thorough independent study.
Modeling of nonlinear and complex exposure surfaces that are truly high dimensional is a daunting problem, recent attempts toward targeted estimation of parameters that are relevant for policymaking using reduced hierarchical models are extremely noteworthy in this context [95]. After building a complex model, one may attempt to extract information on not just the statistical model parameters, but parameters that are most informative regarding the scientific or policy question of interest. For example, risk ratios corresponding to days when certain pollutants exceed national standards may be more interpretable than the parameters associated with a natural spline term in a nonlinear model [95]. However, any model trying to capture simultaneous coexposure due to multiple pollutants comes with the challenge of sparsity of sample size in crossconfiguration levels of pollutants, thus borrowing strength from prior studies, metaanalysis across studies and the possibility of using variable selection shrinkage methods, smoothing techniques as well as Bayesian hierarchical models offer modern statistical tools to approach this problem. The ultimate goal will be to arrive at a lowdimensional, meaningful representation or summary of a high dimensional modeling problem.
Abbreviations
 PM2.5:

Particulate matter less than 2.5 micrometers in diameter
 PM102.5:

Particulate matter between 2.5 and 10 micrometers in diameter
 CO:

Carbon monoxide
 SO2:

Sulfur dioxide
 NO2:

Nitrogen dioxide
 O3:

Ozone
 BMA:

Bayesian model averaging
 DSA:

Deletion/Substitution/Addition
 LASSO:

Least absolute shrinkage and selection operator
 PLSR:

Partial leastsquare regression
 SPCA:

Supervised principal component analysis
 PACF:

Partial autocorrelation function
 DOW:

Day of the week
 Temp:

Temperature
 RH:

Relative humidity
 NHANES:

National Health and Nutrition Examination Survey
 DAMAT:

Detroit Asthma Morbidity, Air Quality and Traffic
 FPR:

False positive rate
 TPR:

True positive rate
 MSE:

Mean squared error.
References
 1.
Dominici F, Peng RD, Barr CD, Bell ML: Protecting human health from air pollution: shifting from a singlepollutant to a multipollutant approach. Epidemiology. 2010, 21: 187194. 10.1097/EDE.0b013e3181cc86e8.
 2.
Johns DO, Stanek LW, Walker K, Benromdhane S, Hubbell B, Ross M, Devlin RB, Costa DL, Greenbaum DS: Practical advancement of multipollutant scientific and risk assessment approaches for ambient air pollution. Environ Health Perspect. 2012, 120: 12381242. 10.1289/ehp.1204939.
 3.
Brown JS, Graham JA, Chen LC, Postlethwait EM, Ghio AJ, Foster WM, Gordon T: Panel discussion review: session four–assessing biological plausibility of epidemiological findings in air pollution research. J Expo Sci Environ Epidemiol. 2007, 17: S97S105.
 4.
Maynard D, Coull BA, Gryparis A, Schwartz J: Mortality risk associated with shortterm exposure to traffic particles and sulfates. Environ Health Perspect. 2007, 115: 751755. 10.1289/ehp.9537.
 5.
Dockery DW, Pope CA, Xu X, Spengler JD, Ware JH, Fay ME, Ferris BG, Speizer FE: An association between air pollution and mortality in six U.S. cities. N Engl J Med. 1993, 329: 17531759. 10.1056/NEJM199312093292401.
 6.
Zanobetti A, Gold DR, Stone PH, Suh HH, Schwartz J, Coull BA, Speizer FE: Reduction in heart rate variability with traffic and air pollution in patients with coronary artery disease. Environ Health Perspect. 2010, 118: 324330.
 7.
Pope CA, Burnett RT, Krewski D, Jerrett M, Shi Y, Calle EE, Thun MJ: Cardiovascular mortality and exposure to airborne fine particulate matter and cigarette smoke: shape of the exposureresponse relationship. Circulation. 2009, 120: 941948. 10.1161/CIRCULATIONAHA.109.857888.
 8.
Brook RD: Cardiovascular effects of air pollution. Clin Sci (Lond). 2008, 115: 175187. 10.1042/CS20070444.
 9.
Brook RD, Jerrett M, Brook JR, Bard RL, Finkelstein MM: The relationship between diabetes mellitus and trafficrelated air pollution. J Occup Environ Med. 2008, 50: 3238. 10.1097/JOM.0b013e31815dba70.
 10.
Li S, Batterman S, Wasilevich E, Wahl R, Wirth J, Su FC, Mukherjee B: Association of daily asthma emergency department visits and hospital admissions with ambient air pollutants among the pediatric Medicaid population in Detroit: timeseries and timestratified casecrossover analyses with threshold effects. Environ Res. 2011, 111: 11371147. 10.1016/j.envres.2011.06.002.
 11.
Pierse N, Rushton L, Harris RS, Kuehni CE, Silverman M, Grigg J: Locally generated particulate pollution and respiratory symptoms in young children. Thorax. 2006, 61: 216220. 10.1136/thx.2004.036418.
 12.
Schikowski T, Sugiri D, Ranft U, Gehring U, Heinrich J, Wichmann HE, Kramer U: Longterm air pollution exposure and living close to busy roads are associated with COPD in women. Respir Res. 2005, 6: 15210.1186/146599216152.
 13.
Gauderman WJ, Avol E, Gilliland F, Vora H, Thomas D, Berhane K, McConnell R, Kuenzli N, Lurmann F, Rappaport E, Margolis H, Bates D, Peters J: The effect of air pollution on lung development from 10 to 18 years of age. N Engl J Med. 2004, 351: 10571067. 10.1056/NEJMoa040610.
 14.
Brauer M, Lencar C, Tamburic L, Koehoorn M, Demers P, Karr C: A cohort study of trafficrelated air pollution impacts on birth outcomes. Environ Health Perspect. 2008, 116: 680686. 10.1289/ehp.10952.
 15.
Bobak M: Outdoor air pollution, low birth weight, and prematurity. Environ Health Perspect. 2000, 108: 173176. 10.1289/ehp.00108173.
 16.
Beelen R, Hoek G, van den Brandt PA, Goldbohm RA, Fischer P, Schouten LJ, Armstrong B, Brunekreef B: Longterm exposure to trafficrelated air pollution and lung cancer risk. Epidemiology. 2008, 19: 702710. 10.1097/EDE.0b013e318181b3ca.
 17.
Crouse DL, Goldberg MS, Ross NA, Chen H, Labreche F: Postmenopausal breast cancer is associated with exposure to trafficrelated air pollution in Montreal, Canada: a case–control study. Environ Health Perspect. 2010, 118: 15781583. 10.1289/ehp.1002221.
 18.
Wu J, Wilhelm M, Chung J, Ritz B: Comparing exposure assessment methods for trafficrelated air pollution in an adverse pregnancy outcome study. Environ Res. 2011, 111: 685692. 10.1016/j.envres.2011.03.008.
 19.
Venners SA, Wang B, Xu Z, Schlatter Y, Wang L, Xu X: Particulate matter, sulfur dioxide, and daily mortality in Chongqing, China. Environ Health Perspect. 2003, 111: 562567.
 20.
Pope CA: Epidemiology of fine particulate air pollution and human health: biologic mechanisms and who’s at risk?. Environ Health Perspect. 2000, 108: 713723.
 21.
Mauderly JL, Burnett RT, Castillejos M, Ozkaynak H, Samet JM, Stieb DM, Vedal S, Wyzga RE: Is the air pollution health research community prepared to support a multipollutant air quality management framework?. Inhal Toxicol. 2010, 22: 119.
 22.
Sacks JD, Ito K, Wilson WE, Neas LM: Impact of covariate models on the assessment of the air pollutionmortality association in a single and multipollutant context. Am J Epidemiol. 2012, 176: 622634. 10.1093/aje/kws135.
 23.
Vedal S, Kaufman JD: What does multipollutant air pollution research mean?. Am J Respir Crit Care Med. 2011, 183: 46. 10.1164/rccm.2010091520ED.
 24.
Research Priorities for Airborne Particulate Matter, IV. Continuing Research Progress. Edited by: National Research Council. 2004, Washington, DC: National Academy Press
 25.
Mauderly JL, Samet JM: Is there evidence for synergy among air pollutants in causing health effects?. Environ Health Perspect. 2009, 117: 16.
 26.
Mayer H: Air pollution in cities. Atmos Environ. 1999, 33: 40294037. 10.1016/S13522310(99)001442.
 27.
Grahame T, Hidy GM: Pinnacles and pitfalls for source apportionment of potential health effects from airborne particle exposure. Inhal Toxicol. 2007, 19: 727744. 10.1080/08958370701399687.
 28.
Lall R, Ito K, Thurston GD: Distributed lag analyses of daily hospital admissions and sourceapportioned fine particle air pollution. Environ Health Perspect. 2011, 119: 455460.
 29.
Sarnat JA, Marmur A, Klein M, Kim E, Russell AG, Sarnat SE, Mulholland JA, Hopke PK, Tolbert PE: Fine particle sources and cardiorespiratory morbidity: an application of chemical mass balance and factor analytical sourceapportionment methods. Environ Health Perspect. 2008, 116: 459466.
 30.
US EPA: Air Quality Index (AQI)  A guide to air quality and your health. 2009, EPA456/F09002
 31.
Hu W, Mengersen K, McMichael A, Tong S: Temperature, air pollution and total mortality during summers in Sydney, 1994–2004. Int J Biometeorol. 2008, 52: 689696. 10.1007/s0048400801618.
 32.
Sinisi SE, van der Laan MJ: Deletion/substitution/addition algorithm in learning with applications in genomics. Stat Appl Genet Mol Biol. 2004, 3: Article18
 33.
Mortimer K, Neugebauer R, Lurmann F, Alcorn S, Balmes J, Tager I: Air pollution and pulmonary function in asthmatic children: effects of prenatal and lifetime exposures. Epidemiology. 2008, 19: 550557. 10.1097/EDE.0b013e31816a9dcb.
 34.
Dominici F, Wang C, Crainiceanu C, Parmigiani G: Model selection and health effect estimation in environmental epidemiology. Epidemiology. 2008, 19: 558560. 10.1097/EDE.0b013e31817307dc.
 35.
Burnett RT, Brook J, Dann T, Delocla C, Philips O, Cakmak S, Vincent R, Goldberg MS, Krewski D: Association between particulate and gasphase components of urban air pollution and daily mortality in eight Canadian cities. Inhal Toxicol. 2000, 12: 1539.
 36.
Qian Z, Zhang J, Korn LR, Wei F, Chapman RS: Factor analysis of household factors: are they associated with respiratory conditions in Chinese children?. Int J Epidemiol. 2004, 33: 582588. 10.1093/ije/dyg278.
 37.
Arif AA, Shah SM: Association between personal exposure to volatile organic compounds and asthma among US adult population. Int Arch Occup Environ Health. 2007, 80: 711719. 10.1007/s0042000701832.
 38.
Roberts S, Martin MA: Using supervised principal components analysis to assess multiple pollutant effects. Environ Health Perspect. 2006, 114: 18771882.
 39.
Geladi P, Kowalski BR: Partial leastsquares regression  a tutorial. Anal Chim Acta. 1986, 185: 117.
 40.
Hoeskuldsson A: PLS regression methods. J Chemometr. 1988, 2: 211228. 10.1002/cem.1180020306.
 41.
Wold S, Sjostrom M, Eriksson L: PLSregression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001, 58: 109130. 10.1016/S01697439(01)001551.
 42.
Roberts S, Martin MA: A critical assessment of shrinkagebased regression approaches for estimating the adverse health effects of multiple air pollutants. Atmos Environ. 2005, 39: 62236230. 10.1016/j.atmosenv.2005.07.004.
 43.
The Elements of Statistical Learning. Edited by: Hastie T, Tibshirani R, Friedman J. 2001, New York: SpringerVerlag
 44.
Clyde M: Model uncertainty and health effect studies for particulate matter. Environmetrics. 2000, 11: 745763. 10.1002/1099095X(200011/12)11:6<745::AIDENV431>3.0.CO;2N.
 45.
Koop G, Tole L: Measuring the health effects of air pollution: to what extent can we really say that people are dying from bad air?. J Environ Econ Manage. 2004, 47: 3054. 10.1016/S00950696(03)000755.
 46.
Thomas DC, Jerrett M, Kuenzli N, Louis TA, Dominici F, Zeger S, Schwarz J, Burnett RT, Krewski D, Bates D: Bayesian model averaging in timeseries studies of air pollution and mortality. J Toxicol Environ Health A. 2007, 70: 311315. 10.1080/15287390600884941.
 47.
Billionnet C, Sherrill D, AnnesiMaesano I: Estimating the health effects of exposure to multipollutant mixture. Ann Epidemiol. 2012, 22: 126141. 10.1016/j.annepidem.2011.11.004.
 48.
Classification and Regression Trees. Edited by: Breiman L, Friedman JH, Olshen RA, Stone CJ. 1984, Monterey, CA: Wadsworth and Brooks
 49.
Haight TJ, Wang Y, van der Laan MJ, Tager IB: A crossvalidation deletionsubstitutionaddition model selection algorithm: application to marginal structural models. Comput Stat Data An. 2010, 54: 30803094. 10.1016/j.csda.2010.02.002.
 50.
DataAdaptive Estimation with CrossValidation and the D/S/A Algorithm. [http://www.stat.berkeley.edu/~laan/Software/]
 51.
Bair E, Hastie T, Paul D, Tibshirani R: Prediction by supervised principal components. J Am Stat Assoc. 2006, 101: 1810.1198/016214505000000187.
 52.
Bair E, Tibshirani R: Semisupervised methods to predict patient survival from gene expression data. PLoS Biol. 2004, 2: 511522.
 53.
Tibshirani R: Regression shrinkage and selection via the Lasso. J Roy Stat Soc B Met. 1996, 58: 267288.
 54.
Efron B, Johnstone I, Hastie T, Tibshirani R: Least angle regression. Ann Stat. 2004, 32: 407499. 10.1214/009053604000000067.
 55.
Zou H: The adaptive lasso and its oracle properties. J Am Stat Assoc. 2006, 101: 14181429. 10.1198/016214506000000735.
 56.
Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996, 58: 267288.
 57.
Hoeting J, Madigan D, Raftery A, Volinsky C: Bayesian model averaging: a tutorial. Stat Sci. 1999, 14: 382417. 10.1214/ss/1009212519.
 58.
Madigan D, Raftery AE: Model selection and accounting for model uncertainty in graphical models using Occam’s window. J Am Stat Assoc. 1994, 89: 15351546. 10.1080/01621459.1994.10476894.
 59.
Draper D: Hierarchical modeling, variable selection, and utility. 1999, UK: Technical Report, Dept Mathematical Sciences, Univ Bath
 60.
Raftery A: Approximate Bayes factors and accounting from model uncertainty in generalized linear models. Biometrika. 1996, 83: 251266. 10.1093/biomet/83.2.251.
 61.
Raftery AE: Bayesian model selection in social research. Sociol Methodol. 1995, 25: 111163.
 62.
Raftery AE, Madigan D, Hoeting JA: Bayesian model averaging for linear regression models. J Am Stat Assoc. 1997, 92: 179191. 10.1080/01621459.1997.10473615.
 63.
Furnival GM, Wilson RW: Regression by leaps and bounds. Technometrics. 1974, 16: 499511. 10.1080/00401706.1974.10489231.
 64.
Thomas DC, Witte JS, Greenland S: Dissecting effects of complex mixtures: who’s afraid of informative priors?. Epidemiology. 2007, 18: 186190. 10.1097/01.ede.0000254682.47697.70.
 65.
Zhao Z, Zhang Z, Wang Z, Ferm M, Liang Y, Norback D: Asthmatic symptoms among pupils in relation to winter indoor and outdoor air pollution in schools in Taiyuan, China. Environ Health Perspect. 2008, 116: 9097.
 66.
Kahn HD: Note on the distribution of air pollutants. J Air Pollut Control Assoc. 1973, 23: 97310.1080/00022470.1973.10469870.
 67.
Roberts S, Martin MA: The question of nonlinearity in the dose–response relation between particulate matter air pollution and mortality: can Akaike’s Information Criterion be trusted to take the right turn?. Am J Epidemiol. 2006, 164: 12421250. 10.1093/aje/kwj335.
 68.
Roberts S, Switzer P: Mortality displacement and distributed lag models. Inhal Toxicol. 2004, 16: 879888. 10.1080/08958370490519598.
 69.
NCHS (National Center for Health Statistics), National Health and Nutrition Examination Survey: [http://www.cdc.gov/nchs/nhanes/about_nhanes.htm]
 70.
Ferguson KK, LochCaruso R, Meeker JD: Urinary phthalate metabolites in relation to biomarkers of inflammation and oxidative stress: NHANES 1999–2006. Environ Res. 2011, 111: 718726. 10.1016/j.envres.2011.02.002.
 71.
Ferguson KK, LochCaruso R, Meeker JD: Exploration of oxidative stress and inflammatory markers in relation to urinary phthalate metabolites: NHANES 1999–2006. Environ Sci Technol. 2012, 46: 477485. 10.1021/es202340b.
 72.
Spiegelhalter DJ, Dawid AP, Lauritzen SL, Cowell RG: Bayesian analysis in expert systems. Stat Sci. 1993, 8: 219247. 10.1214/ss/1177010888.
 73.
Hadi AS, Ling RF: Some cautionary notes on the use of principal components regression. Am Stat. 1998, 52: 1519.
 74.
Lockhart R, Taylor J, Tibshirani R, Tibshirani R: A significance test for the lasso. 2013,http://arxiv.org/abs/1301.7161,
 75.
Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. J Roy Stat Soc B. 2006, 68: 4967. 10.1111/j.14679868.2005.00532.x.
 76.
Meier L, van de Geer SA, Buhlmann P: The group lasso for logistic regression. J Roy Stat Soc B. 2008, 70: 5371. 10.1111/j.14679868.2007.00627.x.
 77.
Bakin S: PhD Thesis. Adaptive regression and model selection in data mining problems. 1999, Australian National University
 78.
Zou H, Hastie T: Regularization and variable selection via the elastic net. J Roy Stat Soc B. 2005, 67: 301320. 10.1111/j.14679868.2005.00503.x.
 79.
Potharst R, Bioch JC: A decision tree algorithm for ordinal classification. Advances in Intelligent Data Analysis, Proceedings, vol. 1642. Edited by: Hand DJ, Kok JN, Berthold MR. 1999, Berlin: SpringerVerlag Berlin, 187198.
 80.
Feelders AJ: Prior knowledge in economic applications of data mining <D>. Lect Notes Comput Sci. 2000, 1910: 395400. 10.1007/3540453725_42.
 81.
Feelders A, Pardoel M: Pruning for monotone classification trees. Advances in Intelligent Data Analysis V, vol. 2810. Edited by: Berthold MR, Lenz HJ, Bradley E, Kruse R, Borgelt C. 2003, Berlin: SpringerVerlag Berlin, 112.
 82.
Ulm K: A statistical method for assessing a threshold in epidemiological studies. Stat Med. 1991, 10: 341349. 10.1002/sim.4780100306.
 83.
Billionnet C, Gay E, Kirchner S, Leynaert B, AnnesiMaesano I: Quantitative assessments of indoor air pollution and respiratory health in a populationbased sample of French dwellings. Environ Res. 2011, 111: 425434. 10.1016/j.envres.2011.02.008.
 84.
Kelsall JE, Samet JM, Zeger SL, Xu J: Air pollution and mortality in Philadelphia, 1974–1988. Am J Epidemiol. 1997, 146: 750762. 10.1093/oxfordjournals.aje.a009351.
 85.
Zeger SL, Thomas D, Dominici F, Samet JM, Schwartz J, Dockery D, Cohen A: Exposure measurement error in timeseries studies of air pollution: concepts and consequences. Environ Health Perspect. 2000, 108: 419426. 10.1289/ehp.00108419.
 86.
Chang HH, Peng RD, Dominici F: Estimating the acute health effects of coarse particulate matter accounting for exposure measurement error. Biostatistics. 2011, 12: 637652. 10.1093/biostatistics/kxr002.
 87.
Zeka A, Schwartz J: Estimating the independent effects of multiple pollutants in the presence of measurement error: an application of a measurementerrorresistant technique. Environ Health Perspect. 2004, 112: 16861690. 10.1289/ehp.7286.
 88.
Gasparrini A, Armstrong B, Kenward MG: Distributed lag nonlinear models. Stat Med. 2010, 29: 22242234. 10.1002/sim.3940.
 89.
Welty LJ, Peng RD, Zeger SL, Dominici F: Bayesian distributed lag models: estimating effects of particulate matter air pollution on daily mortality. Biometrics. 2009, 65: 282291. 10.1111/j.15410420.2007.01039.x.
 90.
Avery CL, Mills KT, Williams R, McGraw KA, Poole C, Smith RL, Whitsel EA: Estimating error in using residential outdoor PM2.5 concentrations as proxies for personal exposures: a metaanalysis. Environ Health Perspect. 2010, 118: 673678. 10.1289/ehp.0901158.
 91.
Avery CL, Mills KT, Williams R, McGraw KA, Poole C, Smith RL, Whitsel EA: Estimating error in using ambient PM2.5 concentrations as proxies for personal exposures: a review. Epidemiology. 2010, 21: 215223. 10.1097/EDE.0b013e3181cb41f7.
 92.
Brown KW, Sarnat JA, Suh HH, Coull BA, Koutrakis P: Factors influencing relationships between personal and ambient concentrations of gaseous and particulate pollutants. Sci Total Environ. 2009, 407: 37543765. 10.1016/j.scitotenv.2009.02.016.
 93.
Jerrett M, Burnett RT, Ma R, Pope CA, Krewski D, Newbold KB, Thurston G, Shi Y, Finkelstein N, Calle EE, Thun MJ: Spatial analysis of air pollution and mortality in Los Angeles. Epidemiology. 2005, 16: 727736. 10.1097/01.ede.0000181630.15826.7d.
 94.
Molitor J, Jerrett M, Chang CC, Molitor NT, Gauderman J, Berhane K, McConnell R, Lurmann F, Wu J, Winer A, Thomas D: Assessing uncertainty in spatial exposure models for air pollution health effects assessment. Environ Health Perspect. 2007, 115: 11471153. 10.1289/ehp.9849.
 95.
Bobb JF, Dominici F, Peng RD: Reduced hierarchical models with application to estimating health effects of simultaneous exposure to multiple pollutants. J Roy Stat Soc CApp. 2013, 62: 451472. 10.1111/rssc.12006.
Acknowledgements
This work was supported by NIEHS Grants R01ES017290, R01ES015146, R21ES20811, EPA STAR R834797 and R01ES019616. Additional support was provided by NIEHS Grant P30ES017885 entitled “Lifestage Exposure and Adult Disease” and NIEHS Grant P42ES017198. We thank Dr. Steven Roberts (School of Finance and Applied Statistics, Australian National University, Canberra, Australia) for sharing his R codes for implementation of SPCA method in timeseries studies with count data. We are grateful to the reviewers for their helpful comments and useful suggestions to improve this paper.
Author information
Additional information
Competing interests
The authors declared that they have no competing interests.
Authors’ contributions
ZS conducted simulation studies, performed data analysis of the second example, and drafted the manuscript. YT assisted in simulation studies, performed statistical analysis of the first example, and drafted portions of the article. SL has been involved in data cleaning and statistical analysis of the second example, and drafted portions of the article. KKF helped to clean data for the first example, drafted portions of the article, and edited the article carefully. JDM designed the study of the first example and edited the article. SKP provided insightful comments. SAB designed the DAMAT study and revised the article critically. BM conceived the study, directed its implementation and coordination, provided statistical methodology insights, and edited the article. All authors have read and approved the final manuscript.
Electronic supplementary material
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Sun, Z., Tao, Y., Li, S. et al. Statistical strategies for constructing health risk models with multiple pollutants and their interactions: possible choices and comparisons. Environ Health 12, 85 (2013). https://doi.org/10.1186/1476069X1285
Received:
Accepted:
Published:
Keywords
 Bayesian model averaging
 Classification and regression tree
 Collinearity
 Interaction effect
 Model selection
 Multiple pollutants
 Principal component analysis
 Shrinkage
 Variable selection
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate. Please note that comments may be removed without notice if they are flagged by another user or do not comply with our community guidelines.