Considerations in the use of different spirometers in epidemiological studies

Background Spirometric lung function measurements have been proven to be excellent objective markers of respiratory morbidity. The use of different types of spirometers in epidemiological and clinical studies may present systematically different results affecting interpretation and implication of results. We aimed to explore considerations in the use of different spirometers in epidemiological studies by comparing forced expiratory volume in 1 s (FEV1) and forced vital capacity (FVC) measurements between the Masterscreen pneumotachograph and EasyOne spirometers. We also provide a correction equation for correcting systematic differences using regression calibration. Methods Forty-nine volunteers had lung function measured on two different spirometers in random order with at least three attempts on each spirometer. Data were analysed using correlation plots, Bland and Altman plots and formal paired t-tests. We used regression calibration to provide a correction equation. Results The mean (SD) FEV1 and FVC was 3.78 (0.63) L and 4.78 (0.63) L for the Masterscreen pneumotachograph and 3.54 (0.60) L and 4.41 (0.83) L for the EasyOne spirometer. The mean FEV1 difference of 0.24 L and mean FVC difference of 0.37 L between the spirometers (corresponding to 6.3 and 8.4% difference, respectively) were statistically significant and consistent between younger (< 30 years) and older volunteers (> 30 years) and between males and females. Regression calibration indicated that an increase of 1 L in the EasyOne measurements corresponded to an average increase of 1.032 L in FEV1 and 1.005 L in FVC in the Masterscreen measurements. Conclusion Use of different types of spirometers may result in significant systematic differences in lung function values. Epidemiological researchers need to be aware of these potential systematic differences and correct for them in analyses using methods such as regression calibration.


Background
Spirometry is a commonly used test of lung function, an important tool in the diagnosis, and monitoring of respiratory diseases and is frequently used in epidemiological and clinical research [1]. Results of spirometry tests depend on several factors including technical factors such as the type of spirometer used, personal factors such as a subject's posture, and the cooperation between the subject and the technician, which need to be considered in clinical and epidemiological studies.
Despite potential differences between spirometers, there may be compelling reasons to use different spirometers in clinical and epidemiological research. In large-scale multicentre studies for example, for efficiency reasons more than one spirometer of the same type or different spirometers of different types may be used in different centres. In follow-up studies, there may be need to replace older spirometers by newer spirometers.
Comparisons between different types of spirometers as well as similar types of spirometers have been performed in several studies [2][3][4][5]. Systematic differences between different types of spirometers have been reported [2,4]. Such differences can bias exposure-health relationships in studies where the use of a specific spirometer is associated with exposure, e.g. in multi-centre studies of effects of ambient air pollution where different spirometers are used in different study regions with different levels of exposure. Adjustment for type of spirometer is one possibility to account for systematic differences between spirometers. However, this may result in over-adjustment if region is also an important determinant of exposure. Methods such as regression calibration are more suitable in such situations, but require data on comparability of devices [6].
In this study we compared FEV 1 and FVC measurements from two widely used spirometers -the Masterscreen pneumotachograph and the EasyOne spirometer that were simultaneously used in the Prevention and Incidence of Asthma and Mite Allergy (PIAMA) birth cohort study. We also investigated comparability between two EasyOne spirometers. We used the obtained measurements to provide a correction equation to adjust for differences between the spirometers in an epidemiological study.

Comparison study design and study population
Two series of spirometry tests were performed in volunteers by trained research staff between April and May 2017. In the first test series that we consider to be our main comparison performed at the University Medical Centre Groningen, we compared the Masterscreen pneumotachograph with an EasyOne spirometer (referred to here as EasyOne1). Two highly experienced and trained technicians conducted spirometry measurements in the first test series (one with the Masterscreen pneumotachograph and one with the EasyOne1). We let each technician use a different spirometer by design to reflect a real-life multicentre research setting where different spirometers are used in different centers by different technicians. In the second series, one of the technicians involved in the first test series performed the tests at Utrecht University, and the EasyOne1 from the first series was compared to a second EasyOne spirometer of the same generation, referred to as EasyOne2 (both purchased in 2008). In both series, all volunteers performed tests on both spirometers in random order but in immediate succession to eliminate confounding by individual characteristics. Forced expiratory volume in 1 s (FEV 1 ) and forced vital capacity (FVC) were measured in sitting position, while wearing a nose clip. Measurements that fulfilled the ATS/ERS criteria [1] were included in the analysis (n = 45 for each of the series). In addition, test results were included which did not meet these criteria (difference between the largest and next largest value ≤150 mL for FEV 1 and FVC), but which were obtained from otherwise technically acceptable flow-volume curves with the difference between the largest and next largest values for FEV 1 and FVC ≤ 200 mL, (n = 4 for each of the two series) as in previous analyses [7]. Zero flow was established before each measurement with both devices. For each test series, the final study population consisted of 49 volunteers. Information on ethnicity, self-reported weight, height and age of volunteers was also collected.

The PIAMA cohort
The PIAMA birth cohort is a Dutch population-based study that started in 1996/97 with 3963 new-borns and has been extensively described elsewhere [8]. Follow-ups were conducted at the child's age of 3 months, yearly until age 8, and then at ages 11, 14, 16 and 17 years. Medical examinations with measurements of lung function including FEV 1 and FVC and anthropometric characteristics such as weight and height were conducted at ages 8, 12 and 16. At age 16, lung function measurements were obtained in 721 participants. Both the Masterscreen pneumotachograph (CareFusion, Yorba Linda, CA, USA) and Easy One spirometers (NDD Medical Technologies, Inc., Switzerland) were used to measure FEV 1 and FVC at age 16 in two centres, Groningen and Utrecht respectively. We applied the correction equation in the current study to lung function data from the PIAMA cohort measured at age 16.
Ethical approval of the current study was obtained from medical ethical review board from University Medical Center Groningen (ref no. M17.220613) and all volunteers provided consent to participate.
The Masterscreen pneumotachograph is one of the most widely used pulmonary function systems. It measures lung volumes indirectly with a pneumotachograph using the pressure difference over a small, fixed resistance, offered by a fine metal mesh [9]. In brief, it measures the pressure drop when a patient blows into the device. The pressure drop divided by the resistance of the pneumotachograph yields the flow, which can be transformed into a volume by time integration [10]. It is sensitive to temperature, humidity and atmospheric pressure of surrounding air and therefore requires constant calibration.
The EasyOne spirometer is a handheld standalone flow-sensing instrument that requires no calibration though calibration can be checked with a syringe [11]. Unlike the Masterscreen pneumotachograph, the EasyOne spirometer incorporates an ultrasonic flow sensor to measure the flow of air in and out of the patients' lungs. Ultrasonic flow measurements are independent of gas composition, pressure, temperature, and humidity and therefore inaccuracy is reduced due to the mentioned factors [12].

Statistical analysis
Sample size calculations were performed based on a standard deviation (SD) for FEV 1 of 0.5 L. With a significance level of 0.05, 44 volunteers were required to detect a mean difference of 0.3 L between the spirometers with 80% power.
Correlations and agreement between spirometry measurements performed with the different spirometers were assessed with scatterplots, Pearson correlation coefficients and Bland and Altman plots [13]. Significance of differences between spirometers (within persons) was tested with paired t-tests.
In the absence of a gold standard, we computed the percent predicted FEV 1 and FVC according to sex, age, height, and ethnicity based on reference regression equations developed by the Global Lung Function Initiative (GLI) [14] to assess which of the two spirometers most likely gives a better estimate of the lung function. Moreover, we used the data from the first test series to provide a correction equation by regressing measurements from the Masterscreen pneumotachograph on the measurements obtained by the EasyOne1 spirometer as follows: The regression coefficients can be used to correct for systematic differences in epidemiological analyses and we showed this by applying the equation to lung function data from the PIAMA birth cohort collected at age 16. Data were analysed using SAS version 9.4 (The SAS Institute, Cary, NC, USA). Table 1 shows characteristics of the volunteers that participated in the two series of spirometer comparisons. On average, the FEV 1 and FVC as measured by the Masterscreen pneumotachograph were significantly higher than the FEV 1 and FVC as measured by the EasyOne1 spirometer (FEV 1 : 3.78 L vs 3.54 L, mean difference 0.24 L, p-value < 0.0001; FVC: 4.78 L vs 4.41 L, mean difference 0.37 L, p-value < 0.0001). The 0.24 L and 0.37 L mean differences, correspond to a 6.3% decrease in FEV 1 switching from the Masterscreen pneumotachograph to the EasyOne1 spirometer and 8.4% decrease in FVC switching from the Masterscreen pneumotachograph to the EasyOne1 spirometer respectively. Differences in FEV 1 and FVC between the two EasyOne spirometers were small i.e. FEV 1 : 3.50 L vs 3.46 L with a mean difference of 0.03 L, p-value < 0.003 and FVC: 4.31 L vs 4.27 L mean difference, 0.04 L, p-value < 0.003, respectively. The mean differences correspond to a 1.1% decrease in FEV 1 switching from the Easy-One1 to the EasyOne2 spirometer and 0.9% decrease in FVC switching from the EasyOne1 to the Easy-One2 spirometer (Tables 1 and 2). The observed differences between the spirometers were similar in males and females and in younger and older volunteers (Table 2).

Results
Measurements were highly correlated (r = 0.98 for the first test series and r = 0.99 for the second test series for both FEV 1 and FVC) indicating a strong linear relationship, which deviates from identity ( Fig. 1) for FEV 1 (but not FVC) in the first test series, but not for the second test series. The Bland and Altman plots show that the mean differences are consistently larger than zero indicating a systematic difference between the two spirometers with the Masterscreen pneumotachograph consistently producing higher values than the EasyOne1. There was no systematic difference between the two EasyOne1 and EasyOne2 measurements (Fig. 2).
Using the GLI reference equations, the percent predicted for the Masterscreen pneumotachograph was close to 100% (98.3% for FEV 1 and 103.7% for FVC), but less so for the EasyOne1 (92.3% for FEV 1 and 95.5% for FVC).
Regression of the measurements from the Masterscreen pneumotachograph on the EasyOne1 measurements produced the following regression equations (Fig. 1): The above regression equations indicate that an increase of 1 L in the EasyOne1 measurements is associated with an estimated average increase of 1.032 L for the FEV 1 and 1.005 L for the FVC in the Masterscreen pneumotachograph measurements. Table 3 shows the mean of FEV 1 and FVC as measured in the PIAMA birth cohort at the age of 16 years, before and after correction for the systematic differences. The mean difference reduces from 0.37 L to 0.13 L for FEV 1 and 0.44 L to 0.07 L for FVC after correction.

Discussion
We compared FEV 1 and FVC measurements from two different, widely used spirometers, the EasyOne and Masterscreen pneumotachograph and found that the EasyOne spirometer provided on average systematically lower measurements than the Masterscreen. We also investigated the agreement between two EasyOne spirometers of the same generation and found that In epidemiological studies, lung function measurements can be performed using more than one spirometer of the same type or different types. This study showed a systematic difference between two types of spirometers used in the PIAMA birth cohort study [15]. We conducted this experiment in healthy volunteers for which the mean percent predicted FEV 1 and FVC was expected to be close to 100%. Based on reference equations provided by the GLI [14], for none of the spirometers the mean percent FEV 1 and FVC was exactly 100%, but percentages were closer to 100% for the Masterscreen pneumotachograph than the EasyOne1 especially for FEV 1 . The lower percent predicted lung function for the EasyOne1 suggests that the EasyOne spirometer may be more likely to overestimate the percentage of subjects with a clinically low lung function in a setting where different spirometers are used. This has been previously demonstrated in a comparison involving the EasyOne spirometer and a water-sealed spirometer (Collins, Stead-Wells) where underestimated values of both FEV 1 and FVC from the EasyOne spirometer and consequently higher prevalence rates of airway obstruction were observed [16]. It is important to note that the GLI reference equations are not universally applicable. However, these equations are based on an extensive database and studies in the Netherlands have shown that   measurements in the Dutch population generally agree with the GLI references values in adults [17]. We therefore believe these equations are most likely suitable for our current study population as the Masterscreen-Easy-One comparison population was 100% Dutch. It is advised that regardless of which reference equations are used, clinical decisions should never be based solely on lung function test results but backed up with complementary laboratory clinical and physical findings [18]. Several studies have conducted similar experiments comparing different types of spirometers, handheld/office and standard laboratory spirometers both in clinical and research settings [2][3][4][19][20][21][22], with the comparisons also used as quality control procedure in international multicentre epidemiological studies [23,24]. High correlations were observed throughout these studies, but significant systematic differences between spirometers in some of the studies [2,19,20] suggest that measurements from different spirometers are not always comparable. Kunzli et al. [4] conducted a study comparing eight flow sensing spirometers of the same type (Sensormedics 2200) and found that the new generation of Sensormedics (V max ) gave systematically lower results than the older generation. Based on this comparison, an informed decision on choice of spirometers to use for their follow up study was made by excluding the new generation spirometers in the SALPADIA cohort. Similar practical changes were made in another study based on a similar comparison [23]. Small systematically lower FVC and FEV 1 at follow-up, may eventually translate into erroneous deficits of lung function in the studied population, leading to erroneous conclusions about the effect of environmental, biologic or life-style factors on lung function changes [2]. Use of different types of lung function spirometers in the same study can be less detrimental if comparability is established and if necessary any systematic differences corrected.
The source of the observed differences between the Masterscreen pneumotachograph and the EasyOne spirometer is unclear. The Masterscreen pneumotachograph was routinely calibrated for each session as per requirement. The EasyOne spirometers are made to require no calibration but were occasionally checked using a calibration syringe. Both spirometers were therefore thoroughly checked as regards calibration such that chances that the observed differences are due to calibration differences are minimal. However, the following limitations should be considered: two experienced technicians performed the first test series (one with the Masterscreen pneumotachograph and one with the EasyOne) and one of them performed all measurements of the second test series. We designed the comparison of the Masterscreen pneumotachograph and EasyOne spirometers such that different technicians operated the different spirometers to imitate a real multicentre study. While the technicians were highly trained and experienced, due to the study design it was impossible to disentangle differences between spirometers from differences between technicians. Consequently, part of the observed difference between spirometers may be attributable to differences between technicians. The provided correction equation thus simultaneously corrects for the technician and device effect and may not be generalizable to other studies where different technicians are involved. However, it is expected that the calibration method can be applied accordingly. We were not able to assess the external validity of the correction for spirometry measurements outside the PIAMA population, but it has been used before to correct spirometry measurements [6] and the method has been validated in other fields of epidemiology [25]. We used self-reported instead of measured height and weight for the in total 98 volunteers that participated in the comparisons of the spirometers. Since spirometers were compared within persons, and consequently height and weight did not differ between the spirometers that were compared within a series, this does not affect the observed differences between spirometers. Self-reported height might be a source of bias when applying the GLI equations as height values may be over−/underreported. Weight is not used in the GLI equations to estimate percent predicted lung function and therefore poses no risk of bias. Studies of the agreement between self-reported and measured weight and height provided inconsistent results, some suggested good agreement [26,27], while others reported significant discrepancies mainly in overweight/obese individuals [28,29]. It is also not clear to what extent the systematic differences between the two spirometers can be attributed to hardware as computer software has been identified as another as major source of discrepancies between spirometers [30].
The strength of this study is that the order of the spirometers was randomized to minimize influences of personal characteristics and differences due to study design. We observed high precision of the regression parameter estimates, which highly suggests that the sample size in our experiment is not a concern.
Conclusion We observed systematic differences between lung function measurements from two spirometers of different types. Epidemiological researchers need to be aware of these potential systematic differences and correct for them in the analyses using methods such as regression calibration.

Funding
The research leading to these results has received funding from Dutch Lung Foundation (Project number 4.1.14.001). The funders did not play any role in