Spatial analysis of bladder, kidney, and pancreatic cancer on upper Cape Cod: an application of generalized additive models to case-control data
© Vieira et al; licensee BioMed Central Ltd. 2009
Received: 07 November 2008
Accepted: 10 February 2009
Published: 10 February 2009
In 1988, elevated cancer incidence in upper Cape Cod, Massachusetts prompted a large epidemiological study of nine cancers to investigate possible environmental risk factors. Positive associations were observed, but explained only a portion of the excess cancer incidence. This case-control study provided detailed information on individual-level covariates and residential history that can be spatially analyzed using generalized additive models (GAMs) and geographical information systems (GIS).
We investigated the association between residence and bladder, kidney, and pancreatic cancer on upper Cape Cod. We estimated adjusted odds ratios using GAMs, smoothing on location. A 40-year residential history allowed for latency restrictions. We mapped spatially continuous odds ratios using GIS and identified statistically significant clusters using permutation tests.
Maps of bladder cancer are essentially flat ignoring latency, but show a statistically significant hot spot near known Massachusetts Military Reservation (MMR) groundwater plumes when 15 years latency is assumed. The kidney cancer map shows significantly increased ORs in the south of the study area and decreased ORs in the north.
Spatial epidemiology using individual level data from population-based studies addresses many methodological criticisms of cluster studies and generates new exposure hypotheses. Our results provide evidence for spatial clustering of bladder cancer near MMR plumes that suggest further investigation using detailed exposure modeling.
Methods for mapping point-based epidemiologic data have received less attention than mapping areal data . Generalized additive models (GAMs), a type of statistical model that combines smoothing with the ability to analyze binary outcome data and adjust for covariates, provide a useful framework for examining point data [13–18]. Using individual-level information and location in a generalized additive model, we calculated the crude and adjusted odds ratios for bladder, kidney, and pancreatic cancers on upper Cape Cod. These analyses, unlike many registry-based maps, have the advantage of controlling for spatial confounders, examining the effect of latency, and allowing for hypothesis testing for the significance of location in the disease maps. The objectives of the present analyses are to identify exposure hypotheses for further investigation and to demonstrate spatial epidemiology using generalized additive models.
We investigated the association between residence and kidney, pancreatic, and bladder cancer on upper Cape Cod, Massachusetts using data from a population-based case-control study . Spatial analyses of breast, lung, and colorectal cancer were previously reported in Vieira et al. . The Massachusetts Cancer Registry was used to identify incident cancer cases diagnosed from 1983–1986. Participants were restricted to permanent residents (living on upper Cape Cod at least 6 months of the year) with complete residential histories. A total of 62 bladder cancer cases, 35 kidney cancer cases, and 37 pancreatic cancer cases were included.
There were 885 bladder cancer controls, 803 kidney cancer controls, and 651 pancreatic cancer controls. A large number of controls were available for the present analyses because the parent study investigated nine cancer types including breast, lung, and colorectal cancers with larger numbers of cases . See earlier papers [2, 3] for a detailed description of the methods used to define the study population, including the rationale for the method of control selection. Briefly, controls were chosen to represent the underlying population that gave rise to all the cancer cases from the large epidemiologic study, i.e., permanent residents of upper Cape Cod during the same time period. Controls were frequency matched to all cancer cases on age, gender, and vital status. It is important to note that controls were not matched on town of residence. Because many of the cases were deceased or elderly, three different sources of controls were used: (1) random digit dialing for living controls less than 65 years of age; (2) Centers for Medicare and Medicaid Services (formerly the Health Care Financing Administration) for living controls 65 years of age or older; and (3) death certificates for controls who had died from 1983–1986.
Participants or their next-of-kin completed an extensive interview, providing information on demographics (age, sex, marital status, and education), a forty-year residential history, and potential confounders. "Index years" were randomly assigned to controls in a distribution similar to that of diagnosis years for cases. We used the diagnosis and index years to estimate timing and duration of environmental exposure among case and controls, respectively. The Institutional Review Board of Boston University Medical Center approved the research.
Geographical Information System (GIS)
All residential addresses reported by participants in the upper Cape Cod area over the forty-year period prior to the diagnosis or index year were eligible for the spatial analysis. Many participants lived at more than one address during their residential history on upper Cape Cod. We excluded all addresses where residency time began after the diagnosis date for cases and index date for controls. The bladder cancer data set included 95 case locations and 1,382 control locations. The kidney cancer data set included 54 case locations and 1,220 control locations. The pancreatic cancer data set included 49 case locations and 1,005 control locations.
Generalized Additive Models (GAMs)
We estimated local disease odds using generalized additive models, a form of non-parametric or semi-parametric regression with the ability to analyze binary and continuous outcome data while adjusting for covariates [13, 18]. We modeled location, a potential surrogate measure of exposure, using a bivariate smooth (S) of spatial coordinates (x1) and (x2)
logit [p(x1, x2)] = S(x1, x2) + γ'z
where the left-hand side is the log of the disease odds at location (x1, x2), z is a vector of covariates, and γ is a vector of parameters. Throughout this paper, we will refer to (x1) and (x2) as longitude and latitude, although (x1) and (x2) are technically measures of distance and not degrees. The model is semiparametric because it has the nonparametric smooth but the covariates are modeled parametrically. Without the smooth function, S(x1, x2), the model becomes an ordinary logistic regression on the covariates.
Spatial confounding occurs when risk factors for a disease are not evenly distributed. For example, a cluster of lung cancer may be due to an increased density of smokers. A group of core confounders, chosen a priori based on the current scientific literature or study design, was included in all adjusted cancer analyses: vital status at interview, gender, race, age at diagnosis or index year, and usual number of cigarettes smoked. In the kidney and bladder cancer analyses, we also controlled for history of urinary tract infection or stone. In the pancreatic cancer analysis, we also controlled for usual alcohol consumption. We dropped other covariates from the model because they did not change the appearance of the map, including education level, prior medical treatment with irradiation, and occupational exposure to solvents. The covariate with the largest percentage of missing data was education (7% of the participants). Since education was not a confounder in the final model, we did not exclude these participants.
We used a loess smooth which adapts to changes in population density . The amount of smoothing, referred to as the span size, depends on the percentage of the data points in the neighborhood. As a result, the geographic extent of the neighborhood is smaller in densely populated areas and larger in areas with sparse population. We determined the optimal amount of smoothing for each map by minimizing the Akaike's Information Criterion (AIC). Small span sizes produce bumpier surfaces and larger span sizes produce smoother surfaces. As the span size increases, the amount of bias in the fit increases and the variance decreases . We created a rectangular grid covering the study area using the minimum and maximum longitude and latitude from the study subjects (The GAM does not predict locations beyond where the subjects live). Grid points lying outside the outline map of the study area were clipped, as were areas where people cannot reside (e.g., ocean or conservation land).
We converted from log odds to odds ratios (ORs) using the odds of disease in the whole study area as the reference. When controls are appropriately sampled from the population giving rise to the cases, the case-control ratio (disease odds) in a subset of the area should be proportional to the disease incidence rate and the odds ratio estimates rate ratio. In order to make maps visually comparable, we mapped all results using the same dark blue to dark red continuous (unclassified) color scale and range of odds ratios, 0.25–2.50. This range covers most but not all of the ORs observed in our analyses, preventing maps from being washed out by an area of extremely high or low ORs. We determined the presence of spatial confounding by visually comparing crude and adjusted maps. If their optimal span sizes differ, we also compared maps using a common span size, allowing us to distinguish between changes due to confounder adjustment and changes due to span size.
GAMs also provide a framework for hypothesis testing. We first tested the null hypothesis that case status does not depend on the smooth term using a permutation test based on the difference of the deviances of model (1) with and without the smooth term. We condition on the number of cases and controls, preserving the relationship between case/control status and covariates, and randomly assign individuals to locations. We carry out 999 permutations of location in addition to the original. For each permutation, we ran the GAM using the optimal span of the original data and computed the deviance statistic. We do this to preserve the spatial resolution of the original map; the test is thus conditional on the span size. We divide the rank of the observed value by 1000 to obtain a p-value. We used a p-value cut off of 0.05 as a screening tool for possibly meaningful associations. We discuss results as "significant" if the associated p-values are less than 0.05, but acknowledge that some results may be due to chance.
If the global deviance test indicates that the map is unlikely to be flat, we next want to locate areas of the map that exhibit unusually high or low disease odds. We examined point-wise departures from the null hypothesis using permutation tests if the global statistic indicated that location was significant at the 0.05 level. We obtained a distribution of the log odds at every point using the same set of permutations we used for calculating the global statistics. We defined areas of significantly decreased odds ("cold spots") to include all points that ranked in the lower 2.5% of the point-wise permutation distributions and areas of elevated odds ("hot spots") to include all points that ranked in the upper 2.5% of the point-wise permutation distributions. By drawing the 2.5% and 97.5% contour lines, we mapped areas of significantly decreased and increased risk.
Webster et al.  provides a detailed discussion of the statistical methods, analyses using synthetic data, and a comparison with the kernel method of Kelsall and Diggle . We used S-Plus  to perform the generalized additive modeling and ArcGIS  to map the results of our analyses. Sample program code is available at http://www.cireeh.org/pmwiki.php/Main/SpatialEpidemiology.
Our initial, no-latency analyses included all eligible residences with complete address information to allow for geocoding. Therefore, exposures occurring up to diagnosis were assumed to contribute to the risk of disease. By including all addresses, the disease outcome is replicated with the same covariates but different residence for each participant. However, solid cancers initiated by exposure to carcinogens typically take more than a decade to develop. For cancers with sufficient case numbers, we performed a fifteen-year latency analysis by restricting inclusion to the residences occupied by participants at least fifteen years prior to the diagnosis or index year (Residences within the fifteen year window were excluded because geographical location within that window was assumed not relevant to the outcome). Because the inclusion of multiple residences for the same individual may bias our statistical model, we also performed an analysis of residence of longest duration.
Distribution (%) of Selected Characteristics of Cases and Controls
(n = 62)
(n = 885)
(n = 35)
(n = 803)
(n = 37)
(n = 651)
Age at diagnosis or index year (y)
Education level (y)
Less than 12
16 or more
Ever regular cigarette smokera
Ever regular alcohol drinkera
Ever exposed occupationally to solventsc
History of urinary tract infection or stone
Prior medical treatment with radiation
Alive at interview
Summary of bladder cancer models showing degree of smoothing and global test statistic
Longest Duration Residence
Longest Duration Residence
We also restricted the adjusted 15-year latency analysis to residences of longest duration. Using the same span size as before (0.30), Figure 3d shows that although cluster size and shape changed, the overall spatial pattern remained similar, suggesting that the use of multiple residences did not cause bias. The optimal span for the longest duration analysis was 0.95, likely due to the reduced sample size; using the larger span size resulted in a smoother surface (not shown).
Summary of kidney cancer models showing degree of smoothing and global test statistic
The AIC curve for the adjusted kidney cancer model indicates two local minima at span sizes of 0.15 and 0.40 before reaching the global minimum (and optimal span) of 0.90. We repeated the adjusted kidney analysis using spans of 0.40 (Figure 4c) and 0.15 (Figure 4d). The map for span of 0.40 is similar to the results we obtained using the optimal span. The small span of 0.15 produced a bumpier surface, including an area of highly increased risk in the center of the study region. This hot spot is likely spurious due to the sparse population in this area (Figure 2). We did not test for statistical significance of location in models that did not use the optimal span size.
Summary of pancreatic cancer models showing degree of smoothing and global test statistic
We examined potential bias from multiple residences in the pancreatic cancer analysis by restricting participants' addresses to the one of longest duration. The spatial patterns of the resulting map differ from the map with all residences suggesting a bias may exist (not shown). The cluster in the center of the study area (Figure 5b) is no longer elevated. The case locations that contributed to the cluster in the analysis with all residences were for different individuals, but were of shorter residency duration than other residences. We were unable to consider latency because there were too few cases.
In our analyses, bladder cancer on upper Cape Cod displayed an area of increased risk when we considered latency, and location became a significant predictor. Many geographic analyses based on cancer registry data only use address at diagnosis. The greater spatial variation in bladder cancer with increased latency is consistent with misclassification of geographically associated risk factors, including environmental exposures. If population movement is random with respect to disease status, ignoring latency should increase nondifferential exposure misclassification and tend to make maps flatter.
Results of the kidney cancer analysis indicated that there was latitudinal variation causing the odds ratios to tilt significantly in magnitude from north to south. The AIC curve resulted in two additional minima of 0.15 and 0.40. Choice of span size is one of the most important issues in smoothing . We used the Akaike Information Criterion, a computationally feasible method for choosing an "optimal" span based on the tradeoff between bias and variance of the smooth. There are, however, problems with automatic selection procedures. Selecting the span that optimizes the bias-variance tradeoff is not necessarily the same as understanding the importance of map features. Rather than using a single span, there may be important aspects of the data at different scales. New methods are needed to address this issue, e.g., .
The adjusted pancreatic cancer map showed more pronounced hot and cold spots compared to the crude analysis. Rather than creating disease clusters as is often assumed, spatial confounding was partially hiding areas of increased risk. Unlike cancer registry maps which contain limited data on covariates, our analyses controlled for many covariates available in the case-control study questionnaire.
A number of epidemiologic studies have examined cancer and environmental exposures on Cape Cod [1–11]. Previous studies investigated the association between kidney, pancreatic and bladder cancer and tetrachloroethylene (PCE) in drinking water from vinyl-lined asbestos cement distribution pipes [2, 3] and cranberry cultivation . Increased relative risks were found for bladder and pancreatic cancer in the highest PCE-exposed individuals. No evidence was found for increased risks of bladder, kidney, and pancreatic cancer associated with living within 2600 ft of cranberry bogs. Adding PCE exposure and proximity to cranberry bogs to the current spatial models had no effect on the appearance of the maps.
Overlaying maps of odds ratios with maps of pollution sources can generate hypotheses about exposure. Caution is needed, however, because many geographic features overlap. The Massachusetts online repository of geographically coded features for shape files potentially related to environmental exposure  was explored to generate hypotheses for further investigation. Groundwater plumes were of particular interest because of our earlier analyses of breast cancer and pollution of drinking water . With no prior knowledge of any geographic relationship to bladder cancer, we compared the two data sets (Figure 6), and found a suggestive overlap between the bladder cancer hot spots and ground water plumes, some from the MMR and others from landfills and wastewater treatment facilities. Groundwater plumes in the Barnstable and Falmouth areas are currently being modeled to investigate further this hypothesis.
Case-control studies are one of the standard epidemiologic tools for investigating associations between disease and exposure. By combining such data with advanced statistical techniques, we were able to address many criticisms of spatial studies. Cancer cases were ascertained from a registry and cancer types were studied separately. Point-based data from a region were used, avoiding aggregation within arbitrary political boundaries. Controls provided an estimate of the underlying, non-uniform population density. Our analyses controlled for many covariates not available using registry data alone. Residential history information allowed us to take latency into account, potentially quite important for diseases like cancer. However, there were only sufficient numbers of cases to perform a latency analysis for bladder cancer. Had there been a larger number of cases, the residential histories would have allowed for space-time analyses using GAMs. In a prior study, we successfully illustrated the ability to generate hypotheses for location and timing of exposures using breast cancer data .
Our results have a number of potential limitations. While areas of increased or decreased risk may theoretically be caused by non-uniform control selection, sampling of controls within the study area did not depend on geography. Use of residential history allows analysis of latency (when sample size is sufficient), but it produces multiple residences, a potential source of bias. Because residences were analyzed, an apparent cluster may be caused by a few diseased people moving within a small area. To examine the effect of multiple residences, our analyses were restricted to residences of longest duration. The spatial pattern of risk was similar for bladder and kidney cancers, with little difference in the location and magnitude of hot and cold spots, but the map of pancreatic cancer differed when we restricted to residence of longest duration. This suggests that the inclusion of multiple residences did not bias the bladder and kidney cancers analyses but there may be a bias in the pancreatic cancer analysis. Improved methods for analyzing data with multiple residences are needed; weighting by residence time has been suggested .
We computed global and pointwise p-values, but many epidemiologists prefer confidence intervals when evaluating the precision of point estimates . It should be possible to compute variance bands (also known as confidence bands) for our maps . We performed permutation tests that conditioned on the span size of the observed data which may be smaller than the span size for a permuted dataset under the null hypothesis of a flat map. This could possibly lead to a larger deviance statistic under the null hypothesis and the global permutation p-value would then be too large (conservative). The effect of conditioning on the original span size is a topic of future research. Pointwise tests were only conducted if the global deviance test indicated that the map was unlikely to be flat, but performing multiple testing at each location may result in an increase in the type I error rate. Although spatial analyses are useful for generating new hypotheses, the location of significant hot and cold spots should be considered exploratory.
Using generalized additive models and GIS, we generated maps of bladder, kidney, and pancreatic cancer risk. When available, population-based case-control studies provide extensive data on potential risk factors and residential histories that address many methodological criticisms of cluster studies. The results of the current analysis illustrate the usefulness of GAM methods in generating hypotheses for further investigation. We identified a significant hot spot of bladder cancer that coincides with groundwater plumes. Groundwater in the study area is currently being modeled to explore this possible association between drinking water contamination and cancer.
Akaike's Information Criterion
generalized additive model
geographical information systems
Massachusetts Military Reservation
This work was supported by grant number 5 P42 ES007381 from the National Institute of Environmental Health (NIEHS), NIH. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIEHS, NIH.
- Aschengrau A, Ozonoff D: Upper Cape Cancer Incidence Study. Final Report. 1992, Boston: Massachusetts Department of Public HealthGoogle Scholar
- Aschengrau A, Ozonoff D, Paulu C, Coogan P, Vezina R, Heeren T, Zhang Y: Cancer risk and tetrachloroethylene-contaminated drinking water in Massachusetts. Arch Environ Health. 1993, 48: 284-292.View ArticleGoogle Scholar
- Paulu C, Aschengrau A, Ozonoff D: Tetrachloroethylene-contaminated drinking water and the risk of colon-rectum, lung, and other cancers. Environ Health Perspect. 1999, 107: 265-271. 10.2307/3434592.View ArticleGoogle Scholar
- Ozonoff D, Aschengrau A, Coogan P: Cancer in the vicinity of a Department of Defense superfund site in Massachusetts. Toxicol Indust Health. 1994, 10: 119-141.View ArticleGoogle Scholar
- Aschengrau A, Ozonoff D, Coogan P, Vezina R, Heeren T, Zhang Y: Cancer risk and residential proximity to cranberry cultivation in Massachusetts. Am J Public Health. 1996, 86: 1289-1296. 10.2105/AJPH.86.9.1289.View ArticleGoogle Scholar
- Silent Spring Institute: Cape Cod Breast Cancer and Environment Study: Final report. 1997, Newton, MAGoogle Scholar
- Aschengrau A, Paulu C, Ozonoff D: Tetrachloroethylene-contaminated drinking water and the risk of breast cancer. Environ Health Perspect. 1998, 106: 947-53. 10.2307/3434136.View ArticleGoogle Scholar
- Aschengrau A, Rogers S, Ozonoff D: Perchloroethylene-contaminated drinking water and the risk of breast cancer: Additional results from Cape Cod, Massachusetts, USA. Environ Health Perspect. 2003, 111: 167-173.View ArticleGoogle Scholar
- McKelvey W, Brody JG, Aschengrau A, Swartz CH: Association between residence on Cape Cod, Massachusetts, and breast cancer. Ann Epidemiol. 2004, 14: 89-94. 10.1016/S1047-2797(03)00120-0.View ArticleGoogle Scholar
- Brody J, Aschengrau A, McKelvey W, Rudel R, Swartz C, Kennedy T: Breast cancer risk and historical exposure to pesticides from wide-area applications assessed with GIS. Environ Health Perspect. 2004, 112: 889-897.View ArticleGoogle Scholar
- French JL, Wand MP: Generalized additive models for cancer mapping with incomplete covariates. Biostatistics. 2004, 5 (2): 177-191. 10.1093/biostatistics/5.2.177.View ArticleGoogle Scholar
- Bithell JF: A classification of disease mapping methods. Statist Med. 2000, 19 (17–18): 2203-2215. 10.1002/1097-0258(20000915/30)19:17/18<2203::AID-SIM564>3.0.CO;2-U.View ArticleGoogle Scholar
- Hastie T, Tibshirani R: Generalized Additive Models. 1990, London: Chapman and HallGoogle Scholar
- Kelsall J, Diggle P: Spatial variation in risk of disease: a nonparametric binary regression approach. J Roy Stat Soc C-App. 1998, 47: 559-573. 10.1111/1467-9876.00128.View ArticleGoogle Scholar
- Webster T, Vieira V, Weinberg J, Aschengrau A: Spatial analysis of case-control data using generalized additive models. Proceedings from EUROHEIS/SAHSU Conference: 30–31 March 2003; Östersund, Sweden. Edited by: Jarup L. 2003, London: Small Area Health Statistics Unit, Imperial College, 56-59.Google Scholar
- Webster T, Vieira V, Weinberg J, Aschengrau A: A method for mapping population-based case control studies using generalized additive models. Int J Health Geogr. 2006, 5: 26-10.1186/1476-072X-5-26.View ArticleGoogle Scholar
- Vieira V, Webster T, Weinberg J, Aschengrau A, Ozonoff D: Spatial analysis of lung, colorectal, and breast cancer on Cape Cod: An application of generalized additive models to case-control data. Environ Health. 2005, 4: 11-10.1186/1476-069X-4-11.View ArticleGoogle Scholar
- Wood S: Generalized Additive Models. 2006, Chapman and Hall: LondonGoogle Scholar
- Insightful S-Plus. [http://www.insightful.com/products/splus/default.asp]
- ESRI GIS and Mapping Software: ArcView. [http://www.esri.com/software/arcgis/arcview/index.html]
- Godtliebsen F, Marron JS, Chaudhuri P: Significance in scale space for bivariate density estimation. J Comput Graph Stat. 2002, 11: 1-21. 10.1198/106186002317375596.View ArticleGoogle Scholar
- Kammann EE, Wand MP: Geoadditive models. J Roy Stat Stat Soc C-App. 2003, 52: 1-18. 10.1111/1467-9876.00385.View ArticleGoogle Scholar
- Commonwealth of Massachusetts: Massachusetts Geographic Information System (MassGIS). Cape Cod Commission Datalayers. [http://www.mass.gov/mgis/ccc.htm]
- Gallagher , Lisa : Refining exposure assessment methods using GIS-based models: Drinking water contaminants and breast cancer risk on Cape Cod, Massachusetts. 2008, Boston, MA: Boston University (Doctoral Thesis)Google Scholar
- Vieira VM, Webster TF, Weinberg JM, Aschengrau A: Spatial-temporal analysis of breast cancer in upper Cape Cod, Massachusetts. Int J Health Geogr. 2008, 7: 46-10.1186/1476-072X-7-46.View ArticleGoogle Scholar
- Sabel CE, Gatrell AC, Loytonen M, Maasilta P, Jokelainen M: Modelling exposure opportunities: Estimating relative risk for motor neurone disease in Finland. Soc Sci Med. 2000, 50: 1121-1137. 10.1016/S0277-9536(99)00360-3.View ArticleGoogle Scholar
- Rothman KJ, Greenland S: Modern Epidemiology. 1998, Philadelphia: Lippincott-Raven, 2Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.