We monitored 24-hr household (kitchen and living area) concentrations of PM2.5 in 617 rural households from 4 states in India on a cross-sectional basis. We then, developed and validated log linear regression models that predicted household concentrations as a function of multiple, independent household variables and subsequently generated state and national estimates using “household survey data” from The Indian National Family and Health Survey (2005)[23] in three stages as described below.
Stage 1: Household monitoring for PM2.5
Selection of states and households for air pollution monitoring
Six hundred and seventeen households in four geographically and culturally distinct states (Central-Madhya Pradesh (MP), South-Tamil Nadu (TN), North-Uttaranchal (UA) and East- West Bengal (WB)) of India, were recruited between November 2004 and March 2005 to perform household measurements. The choice of states was made primarily to provide a representative basis for the model. Selection of households across the country to generate a representative, measured national estimate was not feasible on account of financial and logistic constraints.
Multi-stage sampling was used to randomly select two districts from each state and three villages from each district. Approximately 25 households were selected by stratified random sampling based on fuel and kitchen type, in each village resulting in around 150–155 households from each state. Each village encompassed as many as several hundred households. To select the study households, the field team first conducted a rapid assessment of all households in the village. The team members went to each household and asked several short questions, including ones about primary fuel type and kitchen type. After the completion of the rapid assessment, a stratified random sample – based on fuel and kitchen type – of twenty five households was drawn. The following day, these households were invited to participate in the study. Urban households could not be included (we elaborate on this, in the discussion section).
Informed consent was obtained from all study households prior to any assessments. The protocols for measurements were approved by the human subjects committees of Sri Ramachandra University and The University of California, Berkeley. All household assessments including questionnaire administration and air pollution measurements were performed shortly after recruitment and simultaneously in the four states using four field teams. Field teams were trained jointly by the core investigators prior to deputing the teams for field work. A manual containing standard operating procedures was provided to all field team members for respective data collection tasks. Field data collection was completed between November 2004 and March 2005.
Measurement of PM
2.5
concentrations in multiple household micro-environments
24 ±2-h PM2.5 concentrations were measured in the kitchen and living area microenvironments using UCB Particle and Temperature (PATs) monitors, in all study households. Gravimetric instruments (portable constant-flow SKC pumps Model 224-PCAR8, SKC, Eighty-Four, PA, USA) were co-located with the UCB-PATs in a subset (10%) of the study households for validation.
Instruments were placed in the kitchen area or living area according to the following standard protocol: (1) approximately 100 cm from the stove (for kitchen area measurements) (2) at a height of 145 cm above the floor (as close as possible to the primary sleeping or sitting area for living area measurements) and (3) at least 150 cm away (horizontally) from doors and windows, where possible (for outdoor kitchen areas we used only the first two criteria). (Note: The living area was defined as the room outside of the kitchen area where household members spend the most time; it was typically a common multipurpose area and sometimes a separate bedroom. In households with a single common area used for cooking and sleeping, a separate living area could not be defined and measurements were taken only in the kitchen area as per above mentioned criteria).
UCB-PATs were used as per validated methods published previously [24, 25]. Briefly, monitors were calibrated with combustion aerosols (e.g. wood and charcoal) and against temperature in the laboratory before being used in the field. Particle coefficients were derived for each instrument in the field through co-location of UCB-PATs monitors and gravimetric samplers in around 15% of households (n = 96). All UCB-PATs were zeroed in a Ziploc bag for a period of 30 to 60 minutes before and after deployment. Particle and temperature coefficients along with the results from zeroing were subsequently used in the data processing algorithm. After monitoring, all data files were batch processed using a customized software package developed for this device. This process produced a master data sheet, which was manually scanned for errors before creating an individual .csv file for each monitoring period.
Gravimetric PM2.5 samples were collected using methods published previously [8]. Briefly, samples were collected using a BGI triplex cyclone (scc1.062, Waltham, MA) in portable constant-flow SKC pumps (Model 224-PCAR8, SKC, Eighty-Four, PA, USA) equipped with a 37-mm diameter Teflon filter (pore size 0.45 μm also supplied by SKC) at a flow rate of 1.5 l/min. Filters were weighed using a Thermo Cahn C – 34 Microbalance (Thermo Scientific, Waltham, MA, USA) at Sri Ramachandra University and a Mettler Toledo-MT5 balance (Mettler, Greisensee, Switzerland) at The Energy Research Institute in New Delhi. Both balances operated at a resolution of 0.1 μg and were used according to the same standard operating procedure. All filters were conditioned in a temperature and relative humidity controlled room before weighing. Approximately, twenty percent of the gravimetric samples (collected from 96 households) were paired with field blanks (n = 18); none of the pre- and post- field blank weights differed by greater than 0.003 mg.
Stage 2: Development of models to estimate household concentrations of PM2.5 on the basis of household determinants
Questionnaires were administered in all study households to collect information on a range of household variables. This primarily included physical variables likely to directly influence household concentrations such as fuel type, kitchen location, stove type, ventilation, fuel quantity and cooking duration. Information on indicators of other sources of indoor emission of particulate matter were also captured by recording use of solid fuels for heating, indoor smoking, number of hours without electricity (indicative of use of kerosene based lamps for lighting) and use of incense or mosquito coils. Variables likely to indirectly influence concentrations such as house type, ethnicity, income as well as behavioral variables such as meal type, type of cooking tasks etc. were collected by a larger socio-demographic survey conducted in the same villages by another team of collaborators but could not be included for analyses in this paper. We first developed models to estimate kitchen area concentrations (from measurements conducted in 617 households) in relation to these variables. Most household variables related to cookfuel use are likely to directly influence kitchen area concentrations, with living area concentrations in turn, being influenced by respective kitchen area concentrations. We therefore developed regressions equations for the relationship between kitchen and living area concentrations (from paired measurements in 427 households) in order to be able to derive the living area from measured /modeled kitchen concentrations. We describe the procedures for modeling the kitchen and living area concentrations separately in greater detail below.
Estimation of kitchen area concentrations
We developed multiple regression models to relate the measured kitchen area concentrations of PM2.5 to categorical and continuous household variables. A Box-Cox procedure was used to select the optimal transformation of the dependent variable. One way Analysis of Variance (ANOVA) models were fit to each of the categorical and continuous predictors; predictors which led to a significant F-test(p < 0.05) were selected for inclusion in the multiple regression model resulting in inclusion of fuel type, kitchen type, kitchen ventilation, state (a proxy for geographical location) and cooking duration as primary model variables.
Fuel type (labeled as “Fuel” in the model) was classified as wood, dung, kerosene and LPG. (Note: fuel type refers to use of these fuels as the primary fuels during the monitoring period and may not reflect average fuel use in these households). Kitchen type/location (labeled as “Kit” in the model) was classified as outdoor kitchen (ODK), separate (often semi-enclosed) outdoor kitchen (SOK), indoor kitchen partitioned from the rest of the living area (IWPK) and indoor kitchen without partitions (IWOPK) i.e. common living and cooking areas. Kitchen a ventilation (labeled as “Vent” in the model) was classified as good, moderate and poor on the basis of self-reported availability of windows, ventilation, open eves, and the presence of chimneys and fans inside the kitchen area. The 4 states were assigned to one of four geographic regions (labeled as “Reg” in the model) viz. Uttar Pradesh (North), West Bengal (East), Madhya Pradesh (Central) and Tamil Nadu (South) respectively. Information on kerosene lamp use, mosquito coil and incense usage was collected from households but the large number of missing observations precluded their use in the model. Stove type added no additional information over fuel type as nearly all solid cookfuels were used traditional stoves (simple 3 stone fires or stoves built by the household using locally available materials including mud, plaster or metal) and was therefore excluded from analyses. Accordingly, the following regression model was fitted to the data:
(1)
where I(X = L) = 1, if the categorical variable X assumes the level ‘L’, else 0
Reference categories included “LPG” for fuel, “outdoor kitchen” for kitchen type/location, “good” for ventilation and “North” for region respectively.
Estimation of living area concentrations
Most household variables related to cookfuel use are likely to directly influence kitchen area concentrations, with living area concentrations in turn, being influenced by respective kitchen area concentrations. We therefore examined the relationship between kitchen and living area concentrations in paired measurements in order to be able to derive the living area from the kitchen concentrations.
Since the co-relation between measured living area and kitchen area concentrations was not linear, we for the paired kitchen area- living area measurements,
(2)
where, L = 24-h living area PM2.5 concentration; K = 24 h- kitchen area PM2.5 concentration.
Expressing equation 2 as L = δK
1 + βwhere δ = e
α and applying the values of δ = 0.147 and β = -0.680 obtained from the regression, living area room concentrations were finally estimated by equation 3 below,
Modeled estimates for living area room concentrations were thus derived by first, applying equation 1 to estimate kitchen area concentrations as a function of household determinants and subsequently applying equation 3 to derive living area concentrations, as a function of the respective estimated kitchen area concentrations.
Finally, correlations between measured vs. modeled values were estimated using Pearson’s correlation coefficients.
Stage 3: Generation of state and national estimates for household concentrations
The process of generating state and national estimates using information on household variables required matching the variables from the study household questionnaires to the variables in the much larger national Indian NFHS 2005 survey (while recognizing that national surveys may not be able to capture household information at the same level of detail). Three of the five significant predictor variables for the model (primary fuel use, kitchen (type)/location and geographical region) were identical in both (i.e. study questionnaire and the Indian NFHS) datasets. Information on other two (cooking duration and kitchen ventilation) however was only available in the study dataset and was not captured in the Indian NFHS survey. We thus had to impute these values for the Indian NFHS dataset as follows.
We imputed cooking duration by linear regression of cooking hours with number of household members and type of fuel in study household dataset as
(4)
Similarly, a polytomous regression model was used to impute kitchen area ventilation in terms of living room ventilation and kitchen (type)/ location allowing for possible interactions as
(5)
Once information on all significant predictor variables (actual or imputed) was assembled for the Indian NFHS 2005 household data set, coefficients from the multiple regression equation (1) were then applied to estimate household concentrations. Finally, predicted household concentrations were combined to generate state and national estimates using the state and national sampling weights used by the Indian NFHS.
Stage 4: Assessing model accuracy through k-fold cross validation and bootstrapping
We applied cross validation and bootstrapping methods to estimate the accuracy of models developed in earlier stages. We first performed a k-fold cross validation for the household model (described in Stage 2) by excluding households from each of the 24 villages (~25 households) sequentially. The 24-fold cross-validation (using the log transformed 24 hr kitchen concentration dataset) provided an overall correlation coefficient between modeled and measured values.
Bootstrapping was then used to estimate the standard error of prediction for the national model (described in Stage 3). To compute the bootstrapping standard error of the kitchen area PM2.5 estimates, we first generated 200 constructed datasets (replicates) of PM2.5 as ; where X refers to the vector of all the predictors in a household. Each constructed dataset was required to be of the same size as the original data based on estimated parameters and empirical predictors. The model was applied on each of the 200 constructed datasets (the estimates started to converge after application on 100 replicates and was doubled to allow an additional margin for stability) to obtain the empirical standard deviations of each parameter along with error variance. We used the empirical standard deviation of error variance, considered to be the standard error to obtain the bootstrapping standard error of predicted PM2.5 concentrations.