Sample selection and selection of point-of-entry sources
We selected CWSs located in California’s San Joaquin Valley that were actively operating between 2005 and 2007, had at least one source with a geographic coordinate that could be used to estimate customer demographics, and had at least one active point-of-entry source with an arsenic sample reported during this period. These selection criteria resulted in a slight under-representation of smaller systems (i.e., < 200 connections) in our final sample (see Additional file 1: Table A1). Our time period represents one full compliance period under the SDWA, in which each CWS should have taken at least one arsenic sample [24].
Point-of-entry sources are those that directly enter the distribution system. We selected two types of point-of-entry sources: (1) sources in active use that had no arsenic treatment, or that treated for contaminants other than arsenic, and (2) treatment plants in active use that potentially treated for arsenic (Additional fle 2: Figure A1). We used the California Department of Public Health’s (CDPH) Permits, Inspections, Compliance, Monitoring and Enforcement (PICME) database [25] to identify source types, their location in relation to the distribution system, and their possible treatment techniques. We confirmed the existence of arsenic treatment technologies with state and county regulators.
For the six CWSs with confirmed arsenic treatment plants that were in use during the study period, we used all point-of-entry sampling points prior to installation of treatment, and only sampling points from treatment plants after the installation date. For CWSs with no confirmed arsenic treatment, we selected systems where either all point-of-entry sources were labeled as untreated, or all point-of-entry sources were labeled as having treatment. In practice a CWS may have both treated and untreated sources. But because the CDPH databases did not allow us to accurately ascertain whether untreated sources entered the distribution system if treated sources were also available, we conservatively selected CWSs in this manner. We tested the sensitivity of this decision by comparing regression results using our final sample to results using all CWSs. Our final sample included 464 of the 671 CWS active in the Valley from 2005 to 2007.
Outcome measures and independent variables
In order to assess compliance with the Revised Arsenic Rule (i.e., MCL violations) and exposure burdens, we conducted two main sets of analyses: one focused on MCL violations, the other on exposure. Specifically, for each CWS, we derived four main outcome measures: (1) officially recorded arsenic MCL violations, (2) average system and source-level arsenic concentrations, (3) population potentially exposed to arsenic, and (4) water quality samples of arsenic concentrations at point-of-entry to the distribution system. We used the first measure to analyze compliance. We used the second two measures to derive descriptive exposure statistics and run sensitivity analyses. We used the fourth measure as the outcome variable in a linear regression model. We calculated average arsenic measures because (1) the MCL for arsenic is assessed using running annual average of arsenic concentrations for water systems; and (2) this MCL is based on a consideration of long-term chronic exposure making the average concentration of arsenic a suitable metric.
Arsenic MCL violations
The key outcome for our compliance analysis was officially recorded arsenic MCL violations derived from the PICME database. We created a binary variable indicating whether a system had received at least one MCL violation during the study period. This measure helped control for bias that could occur because CWSs with higher arsenic levels are required to sample more frequently [26], thereby increasing the probability that they would receive more MCL violations.
Average system and source-level arsenic concentrations
To estimate arsenic concentrations in the distribution system we used arsenic water quality sampling data for the selected point-of-entry sources from CDPH’s Water Quality Monitoring database [27] (Additional file 2: Figure A1). Previous studies have noted the benefit of using such publicly available water quality monitoring records as an alternative to costly tap water samples [28]. Using these data points, we derived the average arsenic concentration served by each CWS for the entire compliance period. We calculated this by averaging the average source concentrations for each system during our time period. As in previous studies [5, 19], we assumed average system-level concentrations represent the average arsenic concentration in water served to residents. We also calculated each CWS’s yearly average arsenic concentration to conduct sensitivity analyses. Because we did not have flow measurements for individual sources, we assumed that each point-of-entry source contributed independently, constantly and equally, to a CWS’s distribution system, regardless of season. For sampling points below the detection limit, we took the square root of the detection limit as a proxy for the arsenic concentration [29].
We categorized source-level and system-level averages into three concentration categories defined in relation to the revised arsenic rule (> 10 μg As/L) and the old rule (> 50 μg As/L): (1) < 10 μg As/L (“low”), (2) 10–49.9 μg As/L (“medium”), and (3) ≥ 50 μg As/L (“high”). In addition, we used average source and system-level concentrations to create binary variables that we used in bivariate analyses. Here, average levels were coded as 1 (≥ 10 μg As/L), or 0 (< 10 μg As/L).
Potentially exposed population
Using a previously developed method [30] described in Balazs et al. [31], we computed the population potentially exposed to the three aforementioned exposure categories. The approach to calculate the potentially exposed population (PEP) for the high-arsenic category is summarized by the following equation:
(1)
where X
i
is the total population served in CWS i; s
ih
is the number of sources for CWS i with average arsenic concentrations classified as high (h); and S
it
is the total number of point-of-entry sources for CWS i. To calculate the PEP for the low (l) or medium (m) categories, we replaced s
ih
with s
il
or s
im
, respectively. We used PICME data on the number of people served by each CWS to calculate the population size. If the number of customers served by a CWS was not available from the PICME database, we used information from the Water Quality Monitoring database. To estimate counts of potentially exposed individuals according to demographic characteristics (e.g. race/ethnicity) we multiplied the PEP in each arsenic category for each CWS by the estimated proportion of customers in each demographic subgroup for the CWS (e.g. 50% people of color), and then summed these counts across all CWSs for each arsenic category.
Concentration of arsenic at point-of-entry
Arsenic sampling data for each point-of-entry source were used as the outcome variable in our regression model, as described under “Regression Model” below.
Analyses
Compliance analyses
We used our binary arsenic MCL violation variable to analyze whether CWSs with higher fractions of people of color or lower SES faced greater compliance violations. Because only 34 CWSs had at least one MCL violation we did not have enough outcomes to use multivariate regression techniques. Instead we ran Fisher’s Exact tests for contingency tables, comparing the presence of at least one MCL violation to CWSs with high or low levels of our variables of interest (i.e. race/ethnicity or homeownership). To determine the threshold for high and low levels of race/ethnicity (i.e., percent people of color) or homeownership rate we used the median value of these variables.
To consider the impact of under- or mis-reported violations, we ran sensitivity analyses in which we replaced official MCL violations with the number of CWSs with any source whose average yearly arsenic concentrations exceeded the MCL during the study period, and the number of CWSs with any source whose compliance period average exceeded the MCL. This allowed for an approximation of whether a system may have exceeded the MCL (and so should have been issued an MCL violation) since arsenic MCL violations are based on a running annual average [26]. Thus these sensitivity analyses should capture differences due to MCL exceedances that went under-reported.
Exposure analyses
To assess the relationship between demographics of customers served by CWSs and potential exposure, we first examined the demographic characteristics of the population potentially exposed to three different arsenic levels, and additional characteristics of the systems at those levels. To further analyze the relationship, we used our binary variables for average system-level arsenic concentrations to conduct Fisher’s Exact tests.
Finally, we examined the relationship between system-level demographics and arsenic levels using our continuous measure of arsenic concentrations. We used a linear cross-sectional regression model with robust standard errors to account for clustering. To derive the inference, we clustered outcomes at the CWS-level (i.e. point-of-entry arsenic concentrations measured on a given day for a given source). Our final model reported sandwich-type robust standard errors [32] that allowed for arbitrary correlation, including correlation within CWS units. The a priori selected model controlled for known or hypothesized potential system-level confounders.
The model’s outcome variable, Y
ijk, was arsenic concentration for the i
th water system, the j
th source in system i, on day k (since January 1st, 2005). While arsenic samples from individual sources were our outcome measurements, the CWS was the primary unit of analysis, consistent with other calculations above. Our final model did not re-weight CWSs with more samples; thus systems with more measurements contributed more to the estimates. We addressed this issue by stratifying by system size to see if smaller CWSs (with fewer samples) had a different effect on water quality than larger CWSs.
Key independent variables were the percentage of people of color served by CWSs (referent category non-Latino whites) and percent home ownership in each CWS. Home ownership rate is a proxy metric for wealth and political representation [33]. We used this SES measure as an indicator of the economic resources available to a water system to mitigate contamination [34]. Race/ethnicity and home ownership data were derived from the 2000 U.S. Census, measured at the CWS-level, and assumed to be constant for all three years [35]. Since CWS service areas do not follow Census boundaries we used a spatial approach in Geographic Information Systems (GIS) to estimate demographic variables for each CWS. In brief, for each CWS, we estimated a population-weighted average of each variable across all block groups that contained sources for the CWS. This value was used to derive a percent estimate of demographic characteristics (e.g. 50% homeownership) served by that CWS [31].
We controlled for other potentially confounding water system characteristics including: source of water (ground water or groundwater and surface water versus surface water alone); system ownership (public, privately owned and not regulated by the Public Utility Commission (PUC), with private PUC-regulated as referent category); geographic location (Valley floor and foothills, with mountains as referent category); season (summer/fall or winter/spring); year of sampling (2006 and 2007, with 2005 as referent category); and number of service connections (< 200 or ≥ 200 connections). We determined ownership structure by combining data in PICME with data from the PUC’s list of regulated systems. We obtained all other characteristics from PICME. With the exception of year and season, all covariates were measured at the water system level.
We stratified by system size to assess if demographic effects on water quality might be stronger among smaller systems, and to test the hypothesis that scale alone explains water quality. We used number of connections as a threshold for small versus large CWSs, where those with fewer than 200 connections are considered “small” [26]. We used our final model to estimate the amount of arsenic contamination attributable to the proportion of the population that are homeowners by predicting expected values for each observation if percent homeownership equaled 100%, as described by Greenland and Drescher [36]. All statistical analyses were conducted using Stata v10 (College Station, Texas). We used Stata’s cluster command to derive robust standard errors.