Construction of an environmental quality index for public health research

Background A more comprehensive estimate of environmental quality would improve our understanding of the relationship between environmental conditions and human health. An environmental quality index (EQI) for all counties in the U.S. was developed. Methods The EQI was developed in four parts: domain identification; data source acquisition; variable construction; and data reduction. Five environmental domains (air, water, land, built and sociodemographic) were recognized. Within each domain, data sources were identified; each was temporally (years 2000–2005) and geographically (county) restricted. Variables were constructed for each domain and assessed for missingness, collinearity, and normality. Domain-specific data reduction was accomplished using principal components analysis (PCA), resulting in domain-specific indices. Domain-specific indices were then combined into an overall EQI using PCA. In each PCA procedure, the first principal component was retained. Both domain-specific indices and overall EQI were stratified by four rural–urban continuum codes (RUCC). Higher values for each index were set to correspond to areas with poorer environmental quality. Results Concentrations of included variables differed across rural–urban strata, as did within-domain variable loadings, and domain index loadings for the EQI. In general, higher values of the air and sociodemographic indices were found in the more metropolitan areas and the most thinly populated areas have the lowest values of each of the domain indices. The less-urbanized counties (RUCC 3) demonstrated the greatest heterogeneity and range of EQI scores (−4.76, 3.57) while the thinly populated strata (RUCC 4) contained counties with the most positive scores (EQI score ranges from −5.86, 2.52). Conclusion The EQI holds promise for improving our characterization of the overall environment for public health. The EQI describes the non-residential ambient county-level conditions to which residents are exposed and domain-specific EQI loadings indicate which of the environmental domains account for the largest portion of the variability in the EQI environment. The EQI was constructed for all counties in the United States, incorporating a variety of data to provide a broad picture of environmental conditions. We undertook a reproducible approach that primarily utilized publically-available data sources.


Background
Polluted environments have contributed to harmful exposures associated with human morbidity [1][2][3][4][5]. The empirical characterization of environmental conditions, however, is challenging because the non-residential ambient environment comprises an almost uncountable array of complex mixtures, which are difficult to quantify simultaneously. Moreover, the effect of the surrounding environment on human morbidity is more broadly understood to include exposures such as socioeconomic deprivation, access to healthy food, highway safety, etc. The complex nature of the overall environment likely contributes to the practice of using isolated exposures to represent ambient conditions. Environment often encompasses traditional exposure like pollutants, chemicals, and water quality, as well as other non-genetic exposures such as the built environment, nutrition, and socioeconomic climate. In environmental epidemiology, ambient conditions are usually explored singly: one exposure or category of exposure at a time (e.g., ozone, pesticides, water disinfection by-products) [6]. Sometimes mixtures are used within one domain (e.g., air data) [1,7], and other times total environments may be characterized (e.g., exposure to healthy food environment) [8,9]. Still other work includes entire environmental domains to estimate non-residential ambient conditions (e.g., socioeconomic deprivation) [10,11]. And rarely, if ever, are multiple environmental domains combined, even though we know humans are exposed to these multiple environmental domains simultaneously.
Multiple challenges exist in combining across environmental domains or environmental types to construct one environmental measure. Much of the data we use to characterize environmental conditions are collected for administrative, regulatory and non-research purposes [12]. Measures collected at different scales would need to be meaningfully combined. They may also be measured at different units of spatial and temporal aggregation. A more complete estimation of the non-residential ambient environment may also be limited by statistical approaches and disciplinary practices. Statistical imprecision of estimates may be a concern if many variables are necessary to appropriately estimate a given domain or overall environment and a limited number of outcomes are being distributed across multiple exposure and covariate categories. From a disciplinary perspective, most research teams rarely include more than one type of exposure specialist. But many of these challenges can be readily overcome with appropriate statistical methods and interdisciplinary research teams.
Here we describe a method of constructing an environmental quality index (EQI) representing multiple domains of the non-residential ambient environment, including the air, water, land, built and sociodemographic domains. This manuscript outlines a reproducible approach to the development of the EQI that capitalizes almost exclusively on publically-available data sources.

Domain identification
A fuller description of the methods used for EQI construction is available in Additional file 1. We initially identified three environmental domains, air, water and land, based on selected chapters from the United States (U.S.) Environmental Protection Agency (EPA) 2008 Report on the Environment (ROE) [13]. Following consultation with the ROE, the team undertook a more extensive review to complement the domains and data sources already identified, which included the following activities: 1) identifying precise literature search terms, limits and reporting format; 2) conducting a literature review on "Environment and Infant Mortality"; 3) recording findings; 4) finalizing search terms for within-domain literature review; 5) conducting a within-domain literature review; and 6) recording findings. We chose infant mortality to be the health outcome for the literature search for several reasons: 1) infant mortality is a well-researched and understood health outcome; 2) infant mortality is a general outcome, with known positive associations with other lifetime health measures such as disability-adjusted life expectancy [14]; as such, the environmental exposure-health outcome relationship would not be restricted to one organ (e.g., heart disease) or system (e.g., asthma); 3) the research team was largely composed of reproductive/perinatal researchers for whom infant mortality was an important health outcome. The literature review was conducted in PubMed for the years 1980-2008. We added the built and sociodemographic domains based on the findings of the literature review. From this broad search, and our a priori identification, five specific domains were considered: air, water, land, built, and sociodemographic environments.

Geographic level of analysis
The unit of analysis for EQI development was U.S. county. While county is a broad unit of analysis that may not allow for small-geography specificity, most national data sources are available at the county level. We wanted to construct a replicable process and product for use across the United States and we deemed the county level as the most widely generalizable. It also enables linkage to health data aggregated to the county level, such as national birth statistics from the National Center for Health Statistics (NCHS).

Data source time period
At the initiation of the EQI development, we restricted the temporal framework to 2000 to 2005. We wanted to primarily utilize publicly available data, and this sixyear window was chosen based on availability of both environmental (including decennial census) and outcome data (e.g., national birth records).

Data sources
The data sources are described in detail elsewhere [15]. Briefly, data sources were considered for EQI inclusion based on temporal, spatial, and quality-related criteria. Temporal appropriateness required data to represent the 2000 to 2005 time period. Data sources were considered spatially appropriate if data were available at, or could be aggregated or interpolated to represent, the county level for all 50 states. Data quality, especially related to data source documentation, was determined by data source managers (in data reports and internal documentation), project investigators, and with the larger field of environmental research, through use and critique of the various data sources.
The air domain included two data sources: the Air Quality System (AQS) [16], which is a repository of national ambient air concentrations from monitors across the country for criteria air pollutants; and the National-Scale Air Toxics Assessment (NATA) [17], which uses emissions inventory data and air dispersion models to estimate non-residential ambient concentrations of hazardous air pollutants (HAPs).
The water domain comprised five data sources: Watershed Assessment, Tracking & Environmental Results (WATERS) Program Database [18], Estimates of Water Use in the U.S. [19], National Atmospheric Deposition Program (NDAP) [20], Drought Monitor Network [21], and National Contaminant Occurrence Database (NCOD) [22]. The WATERS Program Database is a collection of data from various EPA-conducted water assessment programs including impairment, water quality standards, pollutant discharge permits, and beach violations and closures. The Estimates of Water Use in the U.S. is calculated by the United States Geological Survey (USGS) and includes county-level estimates of water withdrawals for domestic, agricultural, and industrial uses. The NDAP dataset provides measures of chemicals in precipitation using a network of monitors located throughout the U.S. The Drought Monitor Data provides raster data on the drought status for the entire U.S. on a weekly basis. The NCOD dataset provides data from public water supplies on 69 different contaminants.
The land domain was constructed using data from five sources. The 2002 National Pesticide Use Database [23] estimates state-level pesticide usage based on pesticide ingredients and crop type. The 2002 Census of Agriculture [24] is a summary of agricultural activity, including information about crops, livestock, and chemicals used. The National Priority Site data [25] includes location of and information on sites that have been placed on the National Priority List (NPL), including indicators for major facilities (e.g., Superfund sites), large quantity generators, toxics release inventory, Resources Conservation and Recovery Act treatment, storage and disposal facilities, corrective action facilities, assessment, cleanup, and redevelopment exchange (brownfield sites), and section seven tracking system pesticide producing site locations. The National Geochemical Survey [26] contains geochemical data (e.g., arsenic, selenium, mercury, lead, zinc, magnesium, manganese, iron, etc.). The fifth source is the EPA Radon Zone Map [27], which identifies areas of the U.S. with the potential for elevated indoor radon levels.
The sociodemographic domain included two data sources: the U.S. Census [28] and Federal Bureau of Investigation (FBI) Uniform Crime Report (UCR) [29]. The U.S. Census collects population and housing data every 10 years, economic and government data every five years and the American Community Survey annually. FBI UCR rate data are available annually and by crime type (violent or property).
The built environment domain employed five data sources. Dun and Bradstreet collects commercial information on businesses and contains more than 195 million records [30]. These data are the only data used in the EQI which are not free, though they are publically available for purchase. Topographically Integrated Geocoding Encoding Reference (TIGER) [31] data provides maps and road layers for the U.S. at multiple units of census geography. The Fatality Analysis Reporting System (FARS) [32] data is a national census providing the National Highway Traffic Safety administration yearly reports of fatal injuries suffered in motor vehicle crashes. Housing and Urban Development (HUD) [33] data provide a count of low-rent and section-eight housing in each housing authority area, which corresponds to cities. The built environment domain also included the percent using public transportation variable from the census, which was not included in the sociodemographic domain; census data have been previously described.

EQI construction Variable construction
Each of the data sources could plausibly give rise to hundreds of potentially relevant variables; therefore only specific variables were selectedor in some cases, constructedfrom each of the data sources. A detailed listing of all the constructed variables is available in Additional file 2.

Statistical processes common to all variables in all domains
Variable collinearity was assessed within subgrouping and when the correlation coefficients exceeded 0.7, one variable was chosen for inclusion. Similar variables with low numbers of missing values were retained over those with high numbers of missing values. If missingness was approximately equal, the decision about which variable to retain was based on exposure routes from hazard summaries [34], with routes from the appropriate domain being primary.
Variable missingness was also assessed to determine if missing data were missing or instead represented true zeros. For instance, when crime data was missing for a county we considered that missing because crime occurs most everywhere but when beach closure data were missing for a county, we considered those to be true zeros because not all counties have beaches. When more than 50 percent of all counties were missing or zero for a given variable, that variable was excluded from further consideration for the EQI.
Because of the data reduction approach used for index construction (principal components analysis (PCA), discussed in detail below), and the statistical assumptions implied by this method, variables were assessed for normality. This was done by visually comparing histograms of each variable's distribution to a normal distribution for that variable. When violations of normality were observed, transformations were considered to enable the variable to best approximate the normal distribution. For each variable, natural-log (hereafter, log), logit, and squared-root transformations were considered and distributions were visually inspected again. In each case, log transformation resulted in the most normally-appearing distribution. For variables with true zeros, log-transformation was achieved by adding half of the non-zero minimum value to all observations and then taking the natural log of that value.
Finally, variables were assessed to determine valences for environmental quality. Valences, or the positive or negative direction of the indices, were determined based on potential for human health and ecological effects. Domains containing variables with known or suspected potential for adverse health outcomes (e.g., increased morbidity or mortality) or ecologic effects (e.g., disruption of biotic integrity) were considered to have a negative valence with higher values representing poorer environmental quality. In some cases, the valence of a given variable was unknown, in which case the valence would be empirically assigned through the data reduction/PCA process by virtue of its association with other variables in that domain.

Air domain variable construction
Daily concentrations of six criteria air pollutants were downloaded from the AQS [16] and temporally averaged to get annual mean concentrations for each monitor location from 2000 to 2005. The annual means were then temporally and spatially kriged to estimate annual concentrations at each county center point. An exponential covariance structure for the spatial covariance was implemented to represent both temporal and spatial variability. These values were then averaged for the full study period.
The 2002 NATA [17] database was used as an initial source of county-level HAP concentrations for evaluation of variables to include. Emissions estimates were retrieved from the NATAs for 1999 and 2005, and estimates for each variable from the three NATAs were averaged to get a composite emissions estimate across the study period. Air domain variables were then checked for normality of distribution and where indicated, were log-transformed. For both criteria and hazardous air pollutants, higher concentrations are negative for air quality. Therefore, the valence of the air domain is negative.

Water domain variable construction
Water impairment is determined for multiple types of water usage: agricultural, drinking, recreational, wildlife and industrial. Using the WATERS [18] database and joining the data in GIS software with measures of stream length in the Reach Address Database [35], a cumulative measure of percent of water impaired for any use was used to represent overall water quality in the county.
Water contamination is caused by several sources and we used the number of National Pollutant Discharge Elimination System (NPDES) [36] permits in a county as a proxy for general water contamination. Three composite variables were included in the EQI: a composite for number of sewage permits, a composite for industrial permits, and one for stormwater permits, all per 1000 km of stream length per county.
Recreational water quality was assessed also using the WATERS database [18], from which we created three variables for number of days of beach closure -for any event, for contamination events, and for rain events in a county.
The quality of the water used for domestic needs data was extracted from the Estimates of Water Use in the U.S. [19] database as a proxy for domestic water quality from which two variables were included in the EQI: the percent of population on self-supplied water supplies and the percent of those on public water supplies which are on surface waters.
The atmospheric deposition of chemicals can affect water quality. The NDAP [20] dataset provides measures for the concentration of nine chemicals in precipitation, calcuim, magnesium, potassium, sodium, ammonium, nitrate, chloride, sulfate, and mercury. Annual summary data from each monitoring site for each year 2000-2005 were spatially kriged, using an exponential covariance structure, to achieve national coverage and county level estimates. The annual estimates for each pollutant were then averaged over the six-year study period. The data for all pollutants, except sulfate, were skewed and therefore were log-transformed to achieve normal distributions.
We expect that drought affects the concentration of pathogens and chemicals in waters and therefore can affect water quality. The Drought Monitor [21] dataset provides raster data on six possible drought status conditions for the entire U.S. on a weekly basis. The data were spatially aggregated to the county level to estimate the percentage of the county in each drought status condition. From this data we used the percentage of the county in extreme drought (D3-D4) in the EQI.
Chemical contamination of water supplies was extracted from the NCOD [22] dataset which provides data on 69 contaminants provided by public water supplies throughout the country for the period from 1998-2005. Data for all samples in a county for each contaminant were averaged over the entire time period of the data and log-transformed to achieve normal distributions. Missing values were set to zero, with the assumption that lack of measurement for an area indicated low concern for contamination with that particular contaminant.
The majority of variables in the water domain are estimates of pollutants for which higher values are considered negative for water quality. The final valence of the water domain is negative, indicating a higher water domain score is associated with poorer environmental quality.

Land domain variable construction
Information on the agricultural environment, were obtained from the 2002 Census of Agriculture [24]. In total, eight variables representing agriculture were constructed and county-level percentages (acres applied per county total acreage) were calculated and log-transformed.
Variables specific to pesticide application were also constructed. Herbicide, insecticide, and fungicide use for each county were estimated using crop data from the 2002 Census of Agriculture and state pesticide use data from the 2002 National Pesticide Use Dataset [23]. All pesticide variables were log-transformed.
The natural geochemistry and soil contamination of an area was estimated using the National Geochemical Survey (NGS) data [26]. These data, collected for stream sediments, soils, and other media, were combined at the county level to estimate the mean values of 13 geochemical contaminants available and were log-transformed.
Large industrial facilities represent sources for pollutants released into the environment. The National Priority List [25] data from the EPA provided information on facilities for the U.S. Because many counties had at least one, but no counties had all six of the facility types present, a composite facilities data variable was constructed by summing the count of any one of the six facilities types (brownfield sites [37], superfund sites [38], toxic release inventory sites [39], pesticide producing location sites [40], large quantity generator sites [41], and treatment, storage and disposal sites [42]) across the counties. The facilities rate variable was assessed for normality and log-transformed.
Finally, the potential for elevated indoor radon levels was represented using county score from the EPA Radon Zone map [27].
As all constructs in the land domain were determined to have a negative valence, the valence of the land domain as a whole is also negative, indicating a higher land domain score represents poor environmental quality.

Sociodemographic domain
The sociodemographic environment is an important environment for human health. Eleven variables from the United States Census [28] were included in the sociodemographic domain of the EQI. The sociodemographic domain contains a mix of positive and negative features; therefore when the sociodemographic domain was constructed, positive variables were reverse-coded to ensure that a higher amount of the sociodemographic domain represented adverse environmental conditions.
The area-level crime environment was represented using the Federal Bureau of Investigation (FBI) Uniform Crime Reports (UCR) [29]. These data required some manipulation for inclusion in the EQI. Because crime reporting is voluntary and crime data are reported for less than half the U.S. counties, yet it seemed unlikely that no crimes occurred in the areas with no reported crime, crime data were spatially and temporally kriged to estimate values for counties with no reported crime. Kriging employed a double exponential covariance structure for the spatial covariance; one structure represented short-range variability and the other long-range variability. The covariance model was fit to experimental covariance values using a least squares method and demonstrated sufficient fit. Varying geographical unit sizes were not explicitly accounted for through the kriging estimates, but crime estimates were made for 57 percent of U.S. counties, mostly in rural areas. The crime variable was log-transformed for inclusion in the EQI.
Both constructs in the sociodemographic domain have a negative valence. Therefore, the final valence of the sociodemographic domain is negative, indicating a higher sociodemographic domain score is associated with poor environmental quality.

Built environment domain
Housing environments vary and features of the housing environment have the potential to influence human health and well-being. The housing environment was represented using two variables available from the HUD data source, low-rent and section-eight [43], which were summed to result in the count of any low-rent or section-eight housing in each county; the subsidized housing rate was constructed from this count. The subsidized housing rate was log-transformed. Highway safety was represented by a traffic fatality variable. Rates for the count of fatal crashes per county were constructed. This rate was log distributed (due to many counties having zero fatal crashes) and was therefore log-transformed. The percent of county residents who use public transportation was the only U.S. Census [28] variable used in the built environment domain of the EQI. For many counties, the percent of the population who reports using public transportation is near 0, and it was therefore log-transformed.
We were interested in characterizing the relative proportions of each county that were served by highways, secondary roads and primary roads. The proportions of all roadways that were highways or primary roads were included.
Business and service environments are important predictors of human health and activity. We sought to estimate features of the economic and service environment using data from Duns and Bradstreet [30]. Nine business environment rate variables were constructed by dividing the county-level count of a business type by the countylevel population count. All variables except the negative food environment were log-transformed for normality. The business and service environments contain a mix of positive and negative features; therefore when the built domain was constructed, positive variables were reversecoded to ensure that a higher amount of the these service variables represent adverse environmental conditions. The built domain's valence is negative indicating a higher built domain score represents poor environmental quality.

EQI temporal representation
When annual data were available, variable consistency (mean and standard deviation) was compared across each year of the six-years (2000)(2001)(2002)(2003)(2004)(2005). Additionally, proto-EQIs were constructed using data from one year (2002) and from the average of all six-years. For those variables that were spatially kriged, county-level values before and after kriging were also compared. Because these county-level values were temporally consistent, the EQI was constructed based on county-level averages for the six-year period for each variable in each domain.

RUCC stratification
Recognizing that environments differ across the ruralurban continuum [44], we concluded the EQI would be most useful if it accommodated rural-urban environmental differences. Therefore, EQI construction was stratified by rural-urban continuum codes (RUCC). The RUCC is a nine-item categorization code of proximity to/influence of major metropolitan areas [45]. As has been done elsewhere, the nine-item categories were condensed into four categories for which RUCC1 represents metropolitan urbanized = codes 1 + 2 + 3; RUCC2 nonmetro urbanized = 4 + 5; RUCC3 less urbanized = 6 + 7; and RUCC4 thinly populated =8 + 9 [46][47][48][49]. Both stratified county-specific and all-county indices were created. Loadings on the stratified and non-stratified sets of indices were assessed to determine loading heterogeneity across counties. Because these loadings differed meaningfully by RUCC level, we constructed a RUCC-stratified EQI for each county.

Data reduction
Similar to the approach employed in other research [10,50,51], principal components analysis (PCA) was chosen for data reduction in this study because the investigators sought an empirical summary of total area-level variance explained by the environmental variables, rather than a confirmation of any underlying factor structure comprised of the previously identified domains.
Because it was unclear which of the variables included in the domain-specific PCAs were irrelevant to human health, we retained all the variables for inclusion in the RUCC-stratified and overall indices.

Component extraction and index construction
The constructed variables from each dataset were merged to produce a domain-specific county-level dataset. The domain-specific variables were then combined using PCA. PCA produces variable loadings, which are roughly equivalent to the "weight" or contribution that each variable makes toward explaining the total variance. The loading associated with each variable is then multiplied by its mean value for the given geography (county, for the EQI) and these weighted mean values are summed. Although it is possible to form as many independent linear combinations as there are variables, we retained only the first principal component: the unique linear combination that accounted for the largest possible proportion of the total variability in the component measures. This process was undertaken separately for each of the four RUCC strata.
The first principal component, which we labeled the domain-specific index (e.g., air domain index), was standardized to have a mean of 0 and standard deviation (SD) of 1 by dividing the index by the square of the eigenvalue [52]. Each domain-specific index was then included in a second PCA procedure (Figure 1), from which we extracted the first principal component to create the EQI.
Pearson's product moment correlations were used to assess relationships among the indices and between the indices and other county-level variables with a cut off of 0.7.

Description of variables comprising EQI domains
The full listing and description of variables contained in the EQI can be found in Additional file 2. Here we present exemplar variables from each domain to describe the variables that represented common patterns of variable loadings (e.g., monotonically increasing or decreasing loadings from most urban to most rural, u-shaped loading pattern from most urban to most rural, etc.). Means, standard deviations, and ranges are included.
Variables included in the air domain generally show moderate to high variability between rural-urban strata, with higher averages in the most urban stratum decreasing to the most rural stratum (Table 1). For example, CO has mean values of 705, 598, 472, 343 ppm for each stratum from most urban to most rural. This pattern holds true for most of the hazardous air pollutants as well, though some pollutants show higher means in the nonmetro urbanized or less-urbanized strata (e.g., chlorine, dimethyl sulfate). PM 10 , PM 2.5 , and carbon tetrachloride are relatively stable across rural-urban strata.
The variables included in the water domain also demonstrate moderate variability across the rural-urban strata. The metropolitan-urbanized and non-metro urbanized strata both have higher overall impaired stream length (14.00% and 14.20%, respectively) compared to the lessurbanized and thinly populated strata (8.79% and 6.54% respectively) ( Table 2). The urban strata also demonstrated a higher number of discharge permits per stream length than the rural strata. The thinly-populated stratum had the highest percentage of population on self-supplied sources (35.61%) and the lowest percentage of population on surface water sources (21.94%). While most chemical contaminants demonstrated similar concentrations across the rural-urban strata, there were a few differences. Fluoride and Di(2-ethylhexyl)adipate (DEHA) were present in higher concentrations on the metropolitan-urbanized stratum. There was little variability across rural-urban strata for atmospheric deposition of chemicals and percent of land in extreme drought.
In the land domain, the metropolitan-urbanized counties have higher averages of soil contaminants, more facilities, and lower agricultural-related variables (% harvested, % irrigated) than non-metro urbanized, less-urban, and   thinly-populated counties (Table 3). Pesticides and animal units show no clear pattern in variation across the strata. For example, average pounds of fungicides applied are 1820, 4030, 2740, and 2140 for most urban to most rural strata, respectively. There is little variation in the distribution of radon zones or agricultural chemicals applied across rural-urban strata. Socioeconomic variables included in the sociodemographic domain indicate that rural counties are generally more deprived than more urban counties (Table 4), having the lowest household income ($30,350) and highest percent of persons in poverty (16.1%). From the crime perspective, however, rural areas are at an advantage compared to more urban areas; the mean violent crime rate for rural counties was 352.5 compared with 390.9 for the most urban and 398.1 for the non-metropolitan urbanized counties.
Contributing to the built environment domain (Table 5), the most rural counties have the smallest proportion of highways and significantly higher rate of traffic fatalities compared with more urban areas. Urban counties had fewer education-related businesses, positive food establishments, recreation-related resources and subsidized housing units per person compared with more rural counties.

Variable loadings on EQI domains
Variable loadings are a function of the county-level prevalence of a variable and its association with the other variables contributing to the total county-level variability for a given domain. The full listing variable loadings across RUCC strata and on the overall EQI can be found in Additional file 3. Here we present exemplar variables from each domain to describe the variables that represented common patterns of variable loadings (e.g., monotonically increasing or decreasing loadings from most urban to most rural; u-shaped loading pattern from most urban to most rural, etc.).
The loadings for the variables that comprise the air domain varied by RUCC strata, though not extensively ( Table 6). Direction of loadings were similar across rural-urban strata. Criteria air pollutants were less influential in the metropolitan-urban stratum compared to the other strata, while influence of hazardous air pollutants varied. The first principal component explained 47% of the total air variability and the domain was approximately normally distributed.
The loadings for the variables that comprise the water domain varied by RUCC and also by construct, suggesting that some constructs were more influential in urban areas and others in rural areas (Table 7). Variables representing overall water quality loaded positively in the two urban RUCC and negatively in the rural RUCC strata. The loadings for variables representing general water contamination and recreational water quality varied by RUCC though they were overall quite low. Loadings for variables representing domestic water quality and drought varied by RUCC, though they were all positive. The loadings for variables representing the atmospheric deposition construct varied by RUCC and did not demonstrate any clear patterns. Variables in the chemical contamination construct demonstrated little variability by RUCC with loadings of similar values for all variables across all RUCC. The first principal component explained 46% of the total variability for the water variables, and while each of the variables contributing to the water domain were normally distributed, the water domain itself was not. This may have resulted from so many regions of the U.S. lacking water quality information; there was considerable data for some counties and almost no data for others. In light of its non-normal distribution, the water domain itself and its contribution to the overall EQI should be interpreted with caution.
The loadings for variables in the land domain varied considerably (Table 8). For mercury, lead, titanium, and aluminum, loading magnitudes were much lower in the most urban stratum, while the loadings across all other strata were comparable. Some variables had the highest loading in the most-urban and most-rural strata (e.g., herbicides), while others remained stable across strata (e.g., arsenic, iron, harvested acreage). Direction of loadings was consistent across strata and the first principal component accounted for 32% of the total variability. This domain was approximately normally distributed with just a few counties having significantly lower land-domain values. These outlying counties were retained, however, to enable the EQI's construction for all U.S. counties.
The loadings for the variables that comprise the sociodemographic domain also varied by RUCC code (Table 9), indicating some variables were more influential in urban settings while others exerted more of an effect on the domain score in rural counties. The patterns of association within the socioeconomic construct were fairly consistent, however, meaning the variables that loaded negatively in the urban counties also loaded negatively in the least urban counties. For instance, renter occupation and vacant units were negatively associated with median household value and median household income across rural-urban status. The one socioeconomic variable for which this was not the case was for the percentage of persons who worked outside the county; for this variable, working outside the county in less urbanized and thinly populated was inversely associated with more than a high school education, but was positively associated in metropolitan urbanized and non-metropolitan urbanized counties. The first principal component accounted for 32% of all county-level variability and was normally distributed.     The variables that comprised the built environment domain loaded much less consistently across the ruralurban categories (Table 10). In general, there were more inverse or negative variable loadings in the most urban counties compared with the less urban counties, and the most rural counties had fairly consistent positive variable loadings. Given this variability, the first principal component accounted for only 23% of the total county-level variability in the built environment, but was also normally distributed.

Domain-specific index description for overall EQI
The means, standard deviations, and ranges for each domain-specific index are presented in Table 11. In general, higher values of the air and sociodemographic indices were found in the more metropolitan areas and the most thinly populated areas have the lowest values of each of the indices. Mean values for the land domain index did not vary substantially by RUCC strata and mean values for the built environment indices were below zero, or in the direction of better built environment quality.
Correlations among the domain specific indices were modest (Table 12), ranging from 0.08 (air and water domain) to 0.40 (air and built domain). The correlations between the overall EQI and each of the domain specific indices reflected the relative importance of that domain to overall environmental variability, and ranged from 0.75 (overall EQI and the sociodemographic domain) to 0.37 (overall EQI and the water domain).

Domain-specific loadings on overall EQI
The first principal component accounted for 39% of the total county-level non-residential ambient environmental variability. The pattern of association for the domainspecific loadings differed by rural-urban status ( Table 13).
As constructed, the index loadings on the overall EQI index are mean (0) and standard deviation (1); the index is normally distributed with a very slight left skew. In the most urban areas, RUCC 1, the built environment domain was most influential as indicated by its highest loading value (0.52) followed by the air domain (0.51). For the non-metropolitan urbanized areas (RUCC 2), the sociodemographic and land domains loaded similarly on the overall EQI (0.60 and 0.55, respectively), followed by the built environment domain. For this particular grouping of counties, the water domain was least influential, based on its low PCA coefficient (0.30). The air domain was the least influential for the less-urbanized counties ((RUCC 3) 0.16), followed by the water domain (0.30). In the most thinly populated counties, the air and water domain were characterized by the lowest loadings (0.03 and 0.13, respectively) while the sociodemographic and land domains were the most influential (loadings of 0.63 and 0.58, respectively).

Description of EQI
The distribution of the RUCC-stratified EQI scores is displayed in Figure 2. For these scores, higher values tend toward poorer environments while negative values are associated with more positive domain attributes. By

Correlations with other sociodemographic features
Environmental quality is only modestly associated with age, sex and racial sociodemographic characteristics in the United States (Table 14). The lowest positive correlations are between the percent under five years of age and high values on the EQI in both the overall and in the most urban counties (0.05 and 0.02, respectively). The highest correlations, 0.60 and 0.54, were for the relationship between percent white non-Hispanic and EQI values in the non-metro and less urban counties.

Discussion
We developed an Environmental Quality Index for all counties in the United States incorporating data for five environmental domains: air, water, land, built, and sociodemographic. For each environmental domain, variables were constructed to represent exposures within that domain; indices for each domain and for environmental quality as a whole were developed by stratifying by rural-urban continuum codes. Variable loadings varied  by domain and rural-urban designation, suggesting that environmental quality is driven by different domains in rural and urban areas. By virtue of the standardization used to construct the indices, approximately equal numbers of counties were at the positive end of the environmental quality spectrum as were at the negative end of the environmental quality spectrum. The EQI is not the only index available for environmental estimation. The Environmental Performance Index (EPI), produced by a team at Yale University, is a country-level index that uses 22 performance indicators for which countries can be held accountable for environmental sustainability [53]. Both the EQI and the EPI rely on similar data sources (official statistics, monitoring data, modeled data, spatial data), prepare data similarly for variable construction (e.g., use of population denominators to construct standardized weights), and employ weighting and aggregation in construction. These similarities support the approach undertaken to construct the EQI. The EPI differs from the EQI in important ways, however. The EPI includes a substantially different set of environmental domains than the EQI, focusing on water effects (human and ecological health), air effects (human and ecological health), biodiversity and habitat, forests, fisheries, agriculture, climate change and energy. It is also constructed using target-based indicators for assessing performance on environmental health indicators rather than being purely an environmental representation. Finally, the EPI is aggregated at the country level to accommodate its international scope, while the EQI, though solely for the United States, gets at much finer detail at the county level.
Another index for natural environment vulnerability was developed by the South Pacific Applied Geoscience commission, the United Nations Environment Programme and their partners. The Environmental Vulnerability Index (EVI) [54] was developed through collaboration with countries, institutions, and experts across the globe and was designed for use with other economic and social vulnerability indices to provide insights into the processes that can negatively influence the sustainable development of countries. The EVI is based on 50 indicators for estimating country-level environmental vulnerability. Unlike the EQI, it is constructed by averaging the various measures. One limitation of the EVI is that it does not reflect environments dominated by human systems (e.g., cities, farms).
Most other environmental quality indices focus on one environmental domain (e.g., Air Quality Index [55]) or a specific type of activity (e.g., Pedestrian Environmental Quality Index [56]) or vulnerability (e.g., Cumulative Environmental Vulnerability Assessment [57], heat vulnerability index [58]). State-specific indices also exist, (e.g., CalEnviro Screen 1.0 [59], Virginia Environmental Quality Index [60] and Michigan Environmental Quality Index [61]) but their comparability across states is limited by their respective data sources and construction. A major strength of the EQI is that it encompasses multiple environmental domains, and all U.S. states and counties.
The EQI holds substantial promise for improving environmental estimation for public health. One important limitation of prior environmental health work has been the inability to control for the multiple environments to which people are simultaneously exposed. If these multiple human activity spaces occur within the same county, using the EQI will provide an estimate of the non-residential ambient county-level conditions to which residents are exposed, whether they are at home, at school, or at work. In addition to the EQI, each of the domain-specific indices is informative. The domainspecific loadings on the EQI indicate which of the environmental domains accounts for the largest portion of the variability in the EQI; in essence, these loadings answer the question as to which domain is making the biggest contribution to the total environment. Because most environmental health practice occurs at the domain level, this domain-specific information may be even more important to policy makers and environmental health activists than the overall EQI. Drilling down further, the variable loadings on each of the domains are also informative for the same reason. In the land environment, for instance, it might be important to know if pesticides or superfund sites seem to be contributing the largest share of variability to the land index. This information has obvious implications for public health intervention. The RUCC-stratified domains and EQI indices will also make an important public health contribution. We know urban and rural areas differ in important ways and these RUCC-stratified indices help us disentangle what domains may be driving some of the observed rural-urban differences in public health outcomes. While the total amount of environmental variability accounted for by any given EQI domain or the overall index may be modest, they contribute more control for or explanation of non-residential environmental conditions than has heretofore been possible. While the process and product reported here makes a clear contribution to the environmental health literature,  this work is not without limitations. Despite the large number of variables used for the EQI, data scarcityin terms of spatial and temporal coveragerepresents an important limitation to this work. Many of the data sources required spatial or temporal kriging to construct county level estimates. For example, even with extensive air monitoring networks, the measured spatial coverage of the U.S. is incomplete, particularly in rural areas. Many data sources are disproportionately located in urban areas (e.g., crime data), whereas others are found in rural areas (e.g., industrial livestock operations). The nonrandom distribution of environmental risk means that virtually all interpolated data are inaccurate, and our ability to draw inference for data-sparse rural areas is impaired. Another potential limitation of the EQI is its construction at the county level. While the county may be too diffuse a unit to enable specific exposure assessment, it is a fair representation of the non-residential ambient environment. By explicitly describing the EQI construction process, we provide the necessary tools for interested investigators to apply at smaller units of aggregation with more specific data sources. Further, we plan to provide access to the data used to construct the EQI publically on the U.S. EPA website. A third limitation results from the data that were available for EQI construction. One aspect of our literature review identifying data sources used "infant mortality and environment" as search terms. While we contend we obtained adequate representation of the five environmental domains, it is possible our use of infant mortality precluded us from finding an environmental domain. Despite this possibility, however, the index is so broadly representative of the non-residential ambient environment it should be widely applicable to other health outcomes. Most of the EQI data were collected for nonresearch purposes; therefore, the data collection methodology, quality control and reporting varied across data source, domain and variable. We endeavored to include comparable data whenever possible, but data-quality differences are important to recognize. Because we relied on available data, and not all sources of environmental quality are measured at the county level, not all potentially relevant data are represented in the EQI. However, we attempted to capture as much as available for each of the five domains. Further, more data are collected in urban areas, which likely results in a more valid estimate of urban compared with rural environments. We have little information for Native American reservations and National Parks, for instance, which limits our ability to comment on those county spaces. In addition, the use of the EQI as a measure of exposure assumes exposure to "environment" is consistent for all individuals, but the extent of environmental exposure was not assessed. The EQI is focused mostly on the outside environment, which may not be the most relevant exposure in relation to human health and disease. Finally, population-level analyses offer little predictive utility for individual-level risk. Therefore, while the index may be useful at identifying lower quality environments that may predict populationlevel health outcomes, it cannot be used to predict adverse outcomes for individuals. We believe the EQI, and the approach taken for its development, represents a promising step and we encourage others to contribute additional work to this endeavor.

Conclusions
The Environmental Quality Index was constructed for all counties in the United States and incorporates a wide variety of data to provide a broad picture of environmental conditions in the United States. The approach we undertook was based on a reproducible methodology that accesses mostly publically-available data sources. Future development of the EQI includes assessing the consequences of the variable choices through sensitivity analyses, updating for 2006-2010, and exploring other levels of spatial aggregation. In this manuscript we present a valid, easily replicable methodology that can be broadly applied at different units of aggregation. As environmental public health researchers, we are fundamentally interested in the environmental contribution to human health. The EQI may aid us in developing knowledge on connections between the overall environment and human health outcomes.