We have described the methodology used and main findings reported by meta-analyses of epidemiological studies investigating associations between environmental exposures and pregnancy outcomes conducted over the last 10 years and reported in the English language literature. In total we identified and described 16 meta-analyses meeting our inclusion criteria. The number of studies included in the reported meta-analyses varied greatly, with the largest number of studies available for environmental tobacco smoke. Only a small number of the studies reported to be following meta-analyses guidelines or using a quality rating system. Heterogeneity was reported in a number of the studies. Publication bias did not appear to occur frequently. The meta-analyses suggested statistically significant associations between ETS and stillbirth, birth weight and any congenital anomalies, PM2.5 and PTB, outdoor air pollution and possibly some congenital anomalies, indoor air pollution from solid fuel use and stillbirth and birth weight, PCB exposure and birth weight, disinfection by-products in water and stillbirth, SGA and possibly some congenital anomalies, occupational exposure to pesticides and solvents and some congenital anomalies, and agent orange and some congenital anomalies. However the number of studies included in the meta-analyses was often small, the exposure assessment limited and quality variable.
The relatively small number of meta-analyses (N=16) is at first glance perhaps surprising given the number of years of research in the area of environmental exposures and pregnancy outcomes. However as the meta-analyses showed, often there are not many studies with comparable data to conduct meta-analyses, except perhaps for ETS. Outcomes such as stillbirth and congenital anomalies studies are fairly rare and large numbers of subjects are needed, and for congenital anomalies the additional problem is case ascertainment and classification that can vary considerably between studies. Outcomes such as gestational age, birth weight, PTB, and LBW occur more frequently and are easier to study and compare among studies.
The main challenge to pooling studies using meta-analytical techniques is often thought to lie in the difficulties of combining studies with differences in exposure assessment, and therefore in obtaining comparable indices for meta-analyses. The ETS studies compared simple indices such as ETS exposed vs. non ETS exposed women [18–20] in the majority of studies retrospectively and to a great extent self-reported which may lead to exposure misclassification. However, in the (sensitivity) analyses there was little difference in the observed associations whether the data were obtained retrospectively or prospectively, or by self-report and/or some biochemical marker [19, 20], which provides increased confidence in the results. Unfortunately there was little exploration of the importance of level and duration of the ETS exposure.
For outdoor air pollution, generally regulatory ambient measurements were used to derive exposure indices providing some numerical concentration values for the exposure response relationships. However there were considerable differences in terms of, for example, the temporal resolution of measurements or the distance of maternal home address to the measurements stations, which could lead to some doubt to how representative these were for the population.
The studies on disinfection by-products often used regulatory monitoring data of trihalomethanes in water, but generally did not include water intake measures or concentrations of other DBPs, which probably lead to exposure misclassification errors [17, 23–25]. In some cases, analyses focused on high vs. low exposed groups which were not always directly comparable between studies.
The occupational exposure studies relied to a large extent on self reported job title and some assignment of exposure to the job title possibly leading to a considerable exposure misclassification [27–29]. Only Govarts et al. used biomonitoring data of POPs from different studies but had to use conversion factors to make comparable indices because POPs were measured in different media (Maternal blood, cord blood, and breast milk) . Again this may increase measurement error. Furthermore they focused only on some specific POPs and not the whole POP mixture.
In general, with various exceptions, non-differential measurement error/exposure misclassification may lead to attenuation in risk estimates and/or loss in power but could be compensated in the increased numbers of subjects in the combined studies . A further option is to stratify analyses by the quality of the exposure assessment.
A further limitation of any meta-analysis of observational studies is residual confounding. Although the majority of individual studies had attempted to match or control for some important confounding variables such as maternal age, parity, socioeconomic status, alcohol, and drug use, the covariates included varied between studies. Since this may have resulted in residual confounding structures differing among the studies, it may have led to inappropriate pooling of heterogeneous study results in the meta-analysis. On the other hand, where studies with different underlying confounder structures show similar results, this will lead to increased confidence in the results.
Few studies reported having followed meta-analyses guidelines (MOOSE) or using a quality scoring system. Even though some did not report following guidelines, their approach appeared to be following the guidelines. One of the reasons for not following guidelines or using quality scores is probably the small number of studies included in general in the meta-analyses with the authors being familiar with the studies in the field. The few studies that included quality scores in their analysis did not see any difference in risk estimates between higher and lower quality studies [19, 20].
The most used method to detect heterogeneity in the data was Cochran’s Q test. Only a small number of studies identified heterogeneity in their studies and this may be partly due to the fact that the tests for heterogeneity are not very powerful when the number of included studies is low [33, 34]. If heterogeneity existed, generally no strategy was used in an attempt to reduce heterogeneity, for instance by making subgroups probably because of the small number of studies; however, some studies had already decided beforehand to conduct meta-analyses by subgroup (e.g. study design type). Salmasi et al. conducted meta-analyses overall and then stratified by the type of exposure assessment (self reported vs. biochemical) and thereby reduced the heterogeneity . Sapkota et al. found less heterogeneity in studies of PM2.5 than PM10, suggesting that the former may be a better exposure index, since in PM10 may be acting as an imperfect surrogate for PM2.5 with differences between areas in how good to the surrogate is . Of course, other explanations are also possible, including for example large variability in toxicity. At times a priori, or after testing even if there was no heterogeneity in the data, the meta-analyses used random effects models to take account of possible underlying difference between studies. This may have resulted at times in more conservative effect estimates (i.e. larger confidence intervals), but may better reflect the reality, where heterogeneity exist but may not be detected because of a small numbers of studies.
One issue to note is that authors often use I2 to estimate heterogeneity and we have referred to it as such here too. However I2 is not a measure of the magnitude of the between-study heterogeneity, nor a point estimate of between-study heterogeneity. It represents the approximate proportion of total variability in point estimates that can be attributed to heterogeneity . The total variation depends importantly on the within-study precisions (essentially the sample sizes of the individual studies). Therefore, so must I2. Furthermore, I2 does not estimate a meaningful parameter, so should be regarded as a descriptive statistic rather than a point estimate . Authors often omit to mention that the magnitude of heterogeneity can be quantified, using a point estimate of the among-study variance of true effects, often called τ2 (tau-squared). Thus, I2 may be viewed as the proportion of variability in the point estimates that is due to τ2 rather than within-study error . A more appropriate descriptor for I2 would be a measure of inconsistency, since it depends on the extent of overlap in confidence intervals across studies.
Funnel plots and the Egger test were mostly used to detect publication bias. There was little publication bias observed. One of the reasons may be that many of the studies were time consuming and difficult to conduct and that therefore authors made great efforts to get the data published. Furthermore, a sufficient number of studies are needed to be able to detect publication bias, and where few studies are available, it may not be possible. Sensitivity analyses generally consisted of some subgroup analyses or leaving one study out at the time to determine if there were some influential studies. Generally the results did not change appreciably, suggesting that the results presented were robust.