Collection of neurodegenerative diseases and bloom data
To identify the association between harmful algal blooms and NDs, chlorophyll-a (chl-a) concentration data and the annual number of NDs for an administrative unit (‘gu’ or ‘gun’) in S. Korea were used. Since the observed sites where chl-a concentrations were measured were located along the rivers and lakes, the administrative units were not accurately matched geographically. Thus, a gridded population was used to solve this problem, which are estimate of the people in a grid cell, derived with a geo-statistical model using census or small area population counts and several other spatial datasets [19].
For bloom data, chl-a concentrations from 939 locations were obtained from the Water Information System at the Korean Ministry of the Environment. Since monitoring of cyanotoxins was limited and has not been conducted regularly, chl-a concentration was used as a bloom intensity indicator; the chl-a data are the only available bloom parameter covering the entire areas along the rivers and lakes during the study period from 2005 to 2017. Annual average chl-a concentrations per location were used for analysis. A Korean governmental agency conducts monitoring of cyanobacterial cell counts, geosmin, and chl-a concentrations during HAB seasons. Measurements of BMAA, anatoxins, or saxitoxin are rarely performed, and data are not available. Therefore, in this study, chl-a concentration was used as a bloom indicator.
Conversion to gridded population
In order to map the annual HAB and ND data, the longitude and latitude information was collected for all 939 chl-a monitoring locations. Then the data were converted into geographical information system using QGIS software (version 3.22.2, https://qgis.org/en/site/). The entire area of S. Korea was divided into a grid of 1 km, and then 127,595 points of the grid were generated. Next, the 939 observed locations were mapped onto the entire grid points. If two or more observation locations of chl-a correspond to one grid, the median value of chl-a concentration was used and then mapped onto the grid.
Next, the population number in each 1-km grid was collected using the SGIS (Statistical Geographic Information Service) Plus (https://sgis.kostat.go.kr) system. SGIS Plus is a location-based open service platform linked and fused with public and private data and census data of population, households, houses, and businesses owned by Statistics Korea (Korean Bureau of Census). SGIS Plus provides a variety of statistical data according to the size of the grids. The smallest size of the grid is 100 m, and the largest is 100 km. The total population and numbers of over 65 years old were extracted for every 1-km grid from the entire areas of S. Korea. In Fig. 1, red dots represent all observed locations of chl-a concentration, and grey lines are 1-km grids. For the surrounding areas in Seoul, the capital, the 1-km grids are not distinguishable due to its highly dense population. Thus, it was enlarged for displaying better granularity.
Figure 2 depicts more detailed information on the Seoul area. In Fig. 2, panel (a) illustrates the number of people in the 1-km grid in Seoul. The darker the color is, the larger the number of populations in the area is. Panel (b) represents the observed location of the chl-a concentration measured. It shows that chl-a concentrations were monitored along the rivers and streams. Panel (c) overlays the gridded population and chl-a concentration observed locations. A benefit of using a 1-km gridded population is its scalability: we can easily extend the grid size by aggregating the grids into 3-km or 5-km grids and then matching each extended grid with observed chl-a concentration data.
Cross-sectional time series data
After collecting chl-a concentration data over 13 years (from 2005 to 2017), we matched the gridded population to these observed years. However, SGIS Plus does not provide data every year. Statistics Korea conducted the Census every five years until 2015. Therefore, gridded population data in 2005, 2010, and 2015 were used. In 2015, Statistics Korea performed a register-based Census, and census data have been generated yearly since then. Thus, based on the Censuses, we could have the gridded population data in 2005, 2010, 2015, 2016, and 2017. To match chl-a concentration data for those 13 years to the irregularly observed gridded population, we had to expand the population data to 13 years. Thus, we used an interpolation method of missing values in time series by connecting consecutive non-missing input values with the spline method [20]. Finally, we had population estimates for these missing years of 2006, 2007, 2008, 2009, 2011, 2012, 2013, and 2014.
In this study, three NDs were considered; MND, AD, and PD. The ICD-10-CM diagnosis codes were G12.2 for MND, G30.0 for AD, and G20 for PD. In addition, the codes of three diseases were converted from the ICD-9 codes (before 2015 data) to the ICD-10 codes. The annual numbers of patients with the three diseases from the National Health Insurance Service were collected, and the data were rearranged according to the administrative units of the district. In S. Korea, a unit of district-level has three types: “Si” as a city, “Gu” as a district in metropolitan cities only, and “Gun” as a division within provinces. The total number of administrative units is 226, less than the total number of the 1-km grid and less than the number of chl-a observed locations (939 locations). Due to the lack of exact congruency between 1-km grid, chl-a marked locations, and the of administrative units, we had to match the patient number to the corresponding chl-a concentration location using the below method.
The number of patients for each administrative district in the 1-km grid was evenly distributed according to the populations of each grid. Let xit be the number of senior patients in the ith administrative unit at time t and yjt be the number of ND patients in the jth 1- km grid at the same time t. The modified number of patient yjt is given by in Eq. (1).
$${\boldsymbol{y}}_{\boldsymbol{jt}}={\boldsymbol{x}}_{\boldsymbol{it}}\times \frac{{\boldsymbol{G}}_{\boldsymbol{jt}}}{{\boldsymbol{P}}_{\boldsymbol{it}}},\boldsymbol{i}=\textbf{1},\textbf{2},\dots, \textbf{226},\boldsymbol{j}=\textbf{1},\textbf{2},\dots, {\boldsymbol{n}}_{\boldsymbol{i}},\boldsymbol{t}=\textbf{2005},\dots, \textbf{2017}$$
(1)
where, Pit is the number of total populations in the ith administrative unit at time t, and Gjt is the number of people in the jth 1- km grid, which is one of the administrative units at time t. Figure 3 depicts the heatmaps of the modified numbers of patients in a 1-km grid. From panels (b) to (d), each panel represents the heatmap of the modified number of ND patients of MND, AD, and PD, respectively. Since much of the population live in the Seoul metropolitan area, about one-quarter of the people of S. Korea, we zoomed in on the area. The zoom-in area can be seen in the upper right corner of each panel. Other dark-colored areas are another regional capital of S. Korea. The upper left panel (a) represents the concentration of the observed spot of chl-a. We can see that the higher concentrations are located following the four big rivers in S. Korea.
As NDs are related to old people, the number of patients in the 1-km grid needs to be adjusted after considering the population size of older people (> 65 years old). So we performed the second modification of the number of diseases. Eq. (2) represents the second modified number of NDs according to the more aging population.
$${\boldsymbol{y}}_{\boldsymbol{jt}}={\boldsymbol{x}}_{\boldsymbol{it}}\times \frac{\textbf{1}}{\boldsymbol{P}{\textbf{65}}_{\boldsymbol{it}}}\times \frac{{\boldsymbol{G}}_{\boldsymbol{jt}}}{{\boldsymbol{P}}_{\boldsymbol{it}}},\boldsymbol{i}=\textbf{1},\textbf{2},\dots, \textbf{226},\boldsymbol{j}=\textbf{1},\textbf{2},\dots, {\boldsymbol{n}}_{\boldsymbol{i}},\boldsymbol{t}=\textbf{2005},\dots, \textbf{2017}$$
(2)
where P65it is the number of populations over 65 in the ith administrative unit at time t, the other variables are the same with Eq. (1). The first product on the right side of Eq. (2) is the corrected number of ND patients according to the number of the population over 65 years in the administrative unit. The updated number was revised by multiplying the 1-km grid population and administrative unit population ratio. The final corrected number of patients was calculated for every three ND and was used as a response variable in the statistical model.
Statistical analysis
In order to examine the relationships between the HABs and the numbers of three NDs, we had to figure out two issues first: 1) to evaluate the long-lasting effects of HABs on the NDs, and 2) to confirm the spatial distance impact of HABs on the three NDs. We applied generalized linear mixed models (GLMMs) for the first issue and used generalized linear models (GLMs) for the second one. For the GLMMs, we set the log link function and the normal distribution for a random component. In addition, to identify a serial correlation that the repeated measured response variable might have, we added an one-way random effect to the model [21]. The main effects of HABs on the variation of NDs were considered fixed effects in the model. Let \({y}_{jt}^k\) be the corrected number of kth ND defined in Eq. (2), where k = 1, 2, 3, 1 is for MNDs, 2 for ADs, and 3 for PDs and zjt be the observed concentration of chl-a with the same time point and location. We modeled the number of NDs for the effects of HABs in the following:
$$\textbf{log}\left({\boldsymbol{y}}_{\boldsymbol{jt}}^{\boldsymbol{k}}\right)={\boldsymbol{\beta}}_{\textbf{0}}+\sum_{\boldsymbol{l}=\textbf{0}}^{\boldsymbol{\infty}}{\boldsymbol{\beta}}_{\boldsymbol{l}+\textbf{1}}{\boldsymbol{z}}_{\boldsymbol{j},\boldsymbol{t}-\boldsymbol{l}}+{\boldsymbol{u}}_{\boldsymbol{j}}+{\boldsymbol{\epsilon}}_{\boldsymbol{j}\boldsymbol{t}},\boldsymbol{k}=\textbf{1},\textbf{2},\textbf{3},\boldsymbol{j}=\textbf{1},\textbf{2},\dots, \textbf{939},\boldsymbol{t}=\textbf{2005},\textbf{2006},\dots, \textbf{2017}$$
(3)
where uj is a random effect of jth location, and we assumed that all uj have zero mean and the same variance \({\sigma}_u^2\) and each uj s is independent of each other, and ϵjt is an independently and normally distributed white noise. We could reflect the serial correlation within a location by including a random effect in the model. We performed the Hausman test to determine whether the model had to include a random effect [22]. The null hypothesis of the trial was that there is zero correlation between the fixed effects and random effects. If we did not reject the null hypothesis, the random effect model was better than the fixed effect model. Based on the results of the Hausman test, all three ND models showed that the null hypothesis did not reject all. Therefore, we could say that all three models have random effects.
For the second issue, we only utilized the 2017 data set because it is the most recent one among the complete data sets, had very few missing observations, and it may best reflect the spatial effect of HABs in the model compared to other data sets. To model the ND number response variable, GLMs were used:
$$\textbf{log}\left({\boldsymbol{y}}_{\boldsymbol{jr}}^{\boldsymbol{k}}\right)={\boldsymbol{\beta}}_{\textbf{0}}+{\boldsymbol{\beta}}_{\boldsymbol{jr}}{\boldsymbol{z}}_{\boldsymbol{j}}+{\boldsymbol{\epsilon}}_{\boldsymbol{j}},\boldsymbol{k}=\textbf{1},\textbf{2},\textbf{3},\boldsymbol{j}=\textbf{1},\textbf{2},\dots, \textbf{939}.$$
(4)
In this model, the subscript k and j are the same with Eq. (3). The subscript r represents the size of the radius. We examined the change of the effects of HABs on three ND response variables according to the radius increment by 3 km and 5 km (see Supplementary Information). When we used a 1-km grid for this purpose, many grids had no population as the locations for measuring chl-a concentrations were generally on the rivers if the width of the water bodies was bigger than 1 km and very close to the rivers, lakes, and streams. Thus, few people reside within those 1-km grids, and those grids with red dots on the river generate a lighter blue color due to the number of lower populations (Fig. 2(c)). Therefore, we used the primary grid sizes of 3 km and 5 km, instead of 1 km, for further statistical analyses.