A multivariate analysis of CalEnviroScreen: comparing environmental and socioeconomic stressors versus chronic disease

Background The health-risk assessment paradigm is shifting from single stressor evaluation towards cumulative assessments of multiple stressors. Recent efforts to develop broad-scale public health hazard datasets provide an opportunity to develop and evaluate multiple exposure hazards in combination. Methods We performed a multivariate study of the spatial relationship between 12 indicators of environmental hazard, 5 indicators of socioeconomic hardship, and 3 health outcomes. Indicators were obtained from CalEnviroScreen (version 3.0), a publicly available environmental justice screening tool developed by the State of California Environmental Protection Agency. The indicators were compared to the total rate of hospitalization for 14 ICD-9 disease categories (a measure of disease burden) at the zip code tabulation area population level. We performed principal component analysis to visualize and reduce the CalEnviroScreen data and spatial autoregression to evaluate associations with disease burden. Results CalEnviroScreen was strongly associated with the first principal component (PC) from a principal component analysis (PCA) of all 20 variables (Spearman ρ = 0.95). In a PCA of the 12 environmental variables, two PC axes explained 43% of variance, with the first axis indicating industrial activity and air pollution, and the second associated with ground-level ozone, drinking water contamination and PM2.5. Mass of pesticides used in agriculture was poorly or negatively correlated with all other environmental indicators, and with the CalEnviroScreen calculation method, suggesting a limited ability of the method to capture agricultural exposures. In a PCA of the 5 socioeconomic variables, the first PC explained 66% of variance, representing overall socioeconomic hardship. In simultaneous autoregressive models, the first environmental and socioeconomic PCs were both significantly associated with the disease burden measure, but more model variation was explained by the socioeconomic PCs. Conclusions This study supports the use of CalEnviroScreen for its intended purpose of screening California regions for areas with high environmental exposure and population vulnerability. Study results further suggest a hypothesis that, compared to environmental pollutant exposure, socioeconomic status has greater impact on overall burden of disease. Electronic supplementary material The online version of this article (10.1186/s12940-017-0344-z) contains supplementary material, which is available to authorized users.


Spatial joining of CalEnviroScreen versus hospital data
The spatial levels of organization for the CalEnviroScreen (CES) versus ICD-9 datasets are different, requiring generation of a geographic correspondence. While CES is organized at the census-tract scale, the spatial reference point for ICD-9 is the zip code. Census tracts are precise polygon features, whereas zip codes correspond to linear USPS route areas, post office box collection points, or large volume customer collection points. Therefore, zip codes require further treatment to allow correspondence with polygon features or area-based calculations [1]. Zip code tabulation areas (ZCTA) are polygons, developed and employed by the US Census to approximate the areal coverage of zip codes, and as such are appropriate for area-based calculations [2]. Because ZCTA is an aggregate scale that can be used to test for and address spatial autocorrelation, we incorporated all data up to the scale of ZCTA. However, ZCTAs are similar but not identical to zip codes, such that additional processing is needed to extrapolate between measures, and avoid data loss. The following procedure was employed to achieve these goals: 2. ICD-9 zip codes that represent post office boxes or large volume customers were assigned to ZCTA in ArcMap by doing a spatial join (point to polygon) between the ESRI zip code point layer and the ESRI ZCTA polygon layer. That is, each ZCTA was defined to contain the corresponding zip code area, in addition to all PO boxes and large volume customers located within the geographic boundaries of that ZCTA. The ZCTA polygons were also automatically set to represent the spatial locations for all zip code areas (i.e., not PO boxes or large volume customers). 3. Each ZCTA has a total population estimate in the 2010 US Census. These data were downloaded from www.socialexplorer.com on May 13, 2015. This was used as the denominator term in the disease burden measure. 4. ICD-9 diagnosis frequencies were summed from all corresponding zip codes by ZCTA.
That is, diagnosis counts from all zip codes including PO Box and zip code areas were summed within the corresponding ZCTA. 5. The ICD-9-based disease burden measure was determined for each ZCTA (see main text Methods for details) 6. CalEnviroScreen (CES) results and the derived multivariate indicators (i.e., PCA results) were aggregated from the census-tract scale to the ZCTA scale using the areaweighted and population-weighted average of corresponding values. Data aggregation was performed on the census tracts and parameters from the original CES October, 2014 dataset after transformation, and also on the PCA results. The populationweighting scheme was based on results from the MABLE/Geocorr12 population weighting algorithm [3], which were downloaded on September 30, 2015.
7. Simultaneously autoregressive models (SAR) require all polygons to have adjacent neighbors; there can be no isolated polygons. Therefore, geographically isolated ZCTAs having CES data needed to be removed from the analysis. Four isolated ZCTAs were removed: 93608 (Cantua Creek), 95257 (Wilseyville), 95925 (Challenge), and 96052 (Lewiston). All of these sites were in rural areas, with low to moderate CES scores (8.4 to 41.9), a low overall population (152 to 2196 residents in the 2010 census), and mostly white (Caucasian) inhabitants (51% to 90%). Two ZCTAs were separated from other ZCTAs by narrow channels: 94501 (city of Alameda) and 92662 (Newport Beach). These two ZCTAs were manually joined to adjacent ZCTA polygons using the ArcMap editor. 8. Linear model and SAR model analyses were performed on the resulting data at the ZCTA scale.

Correlation among diseases in ICD-9 database
In the study analyses, we opted to focus on hospitalization rate for a subset of specific diagnoses, rather than the total frequency of ICD-9 diagnoses, as the measure of disease burden (see Methods). The multivariate correlation structure for reported ICD-9 codes (count person −1 yr −1 ) was examined across all zip codes (N = 1667) for each of the individual diagnoses, as well as the total rate of hospitalizations with at least one of the diagnoses (i.e., the "disease burden measure" in the main text). These data were square root transformed to achieve multivariate normality and analyzed via Pearson's correlation coefficients (r). A positive correlation was observed for almost all pairwise comparisons (Table S2). Correlations were strong among total diagnosis rate, pneumonia, chronic obstructive pulmonary disease (COPD), asthma, myocardial infarction (MI; heart attack), and cerebral vascular accident (CVA; stroke) (Table S2, Fig. S1).
In a principal component analysis of this correlation matrix, the first principal component (PC), which was negatively associated with all diseases, explained 45% of the variance in the data set. This PC was also very strongly associated with total reported codes (Pearson's r = −0.90).
These findings indicate that examination of the separate diagnoses would only provide limited unique information. This supported the development of a general indicator of disease burden that combines the different results. For this reason, we employed the disease burden measure, as described in the text.

-2011
Sign √DB * ln |DB| 1 c 6 a. To account for the presence of zeros, values were log 10 transformed following the order-of-magnitude stabilizing procedure described in McCune and Grace [6]: y = log 10 (x + 10 C ) -C, where C = floor(log 10 (min(x))) b. Site weighting was based on expected hazard of the site type and distance from populated census block within the census tract [5] c. Modulus transformation per John and Draper [7]