Several key concepts are currently at the forefront of understanding the geospatial big data revolution. Big data, such as electronic health records and customer transactions, are generally characterized by a high volume of data; large variety of data sources, formats, and structures; and a high velocity of new data creation [5,6,7]. As a consequence, big data require specialized methods and techniques for processing and analysis. Data science broadly refers to methods to provide new knowledge from the rigorous analysis of big data, integrating methods and concepts from disciplines including computer science, engineering, and statistics [8, 9]. The data science workflow generally resembles an iterative process of data import and processing, followed by cleaning, transformation, visualization, modeling, and finally communication of results [10].
Spatial data science is a niche and still forming field focused on methods to process, manage, analyze, and visualize spatial big data, providing opportunities to derive dynamic insights from complex spatial phenomena [11]. Spatial data science workflows are comprised of steps for data manipulation, data integration, exploratory data analysis, visualization, and modeling – and are specifically applied to spatial data often using specialized software for spatial data formats [12]. For example, a spatial data science workflow may include data wrangling using open source solutions such as the Geospatial Data Abstraction Library (GDAL), scripting in R, Python, and Spatial SQL for spatial analyses facilitated by high-performance computing (e.g., querying big data stored on a distributed data infrastructure through cloud computing platforms such as Amazon Web Services for analysis; or spatial big data analytics conducted on a supercomputer), and geovisualization using D3. Spatial data synthesis is considered an important challenge in spatial data science, which includes issues related to spatial data aggregation (of different scales) and spatial data integration (harmonizing diverse spatial data types related to format, reference, unit, etc.) [11]. Advances in cyberGIS (defined as GIS based on advanced cyberinfrastructure and e-science) – and more broadly high-performance computing capabilities for high-dimensional data – have played an integral role in transforming our capacity to handle spatial big data and thus for spatial data science applications. For example, a National Science Foundation-supported cyberGIS supercomputer called ROGER was created in 2014, which enables the execution of geospatial applications requiring advanced cyberinfrastructure through high-performance computing (e.g., > 4 petabytes of high-speed persistent storage), graphics processing unit (GPU)-accelerated computing, big data-intensive subsystems using Hadoop and Spark, and Openstack cloud computing [11, 13].
As spatial data science continues to evolve as a discipline, spatial big data are constantly expanding, with two prominent examples being volunteered geographic information (VGI) and remote sensing. The term VGI encapsulates user-generated content with a locational component [14]. In the past decade, VGI has seen an explosion with the advent and continued expansion of social media and smart phones, where users can post and thus create geotagged tweets on Twitter, Instagram photos, Snapchat videos, and Yelp reviews [15]. Usage of VGI should be accompanied by an awareness of potential legal issues including but not limited to intellectual property, liability, and privacy for the operator, contributor, and user of VGI [16]. Remote sensing is another type of spatial big data capturing characteristics of objects from a distance such as imagery from satellite sensors [17]. Depending on the sensor, remote sensing spatial big data can be expansive in both its geographic coverage (spanning the entire globe) as well as its temporal coverage (with frequent revisit times). In recent years, we have seen an enormous increase in satellite remote sensing big data as private companies and governments continue to launch higher resolution satellites. For example, DigitalGlobe collects over 1 billion km2 of high-resolution imagery each year as part of its constellation of commercial satellites including the WorldView and GeoEye spacecraft [18]. The U.S. Geological Survey and NASA Landsat program has continually launched earth-observing satellites since 1972, with spatial resolutions as fine as 15 m and increasing spectral resolution with each subsequent Landsat mission (e.g., Landsat 8 Operational Land Imager and Thermal Infrared Sensor launched in 2013 are comprised of 9 spectral bands and 2 thermal bands) [19].