Modelling and Assessing Spatial Big Data: Use Cases of the OpenStreetMap Full-History Dump (2019)

Standard

Many methods for intrinsic quality assessment of spatial data are based on the OpenStreetMap full-history dump. Typically, the high-level analysis is conducted; few approaches take into account the low-level properties of data files. In this chapter, a low-level data-type analysis is introduced. It offers a novel framework for the overview of big data files and assessment of full-history data provenance (lineage). Developed tools generate tables and charts, which facilitate the comparison and analysis of datasets. Also, resulting data helped to develop a universal data model for optimal storing of OpenStreetMap full-history data in the form of a relational database. Databases for several pilot sites were evaluated by two use cases. First, a number of intrinsic data quality indicators and related metrics were implemented. Second, a framework for the inventory of spatial distribution of massive data uploads is discussed. Both use cases confirm the effectiveness of the proposed data-type analysis and derived relational data model.

Noskov, A., Grinberger, A. Y., Papapesios, N., Rousell, A., Troilo, R., & Zipf, A. (2019). Modelling and Assessing Spatial Big Data: Use Cases of the OpenStreetMap Full-History Dump. In Spatial Planning in the Big Data Revolution (pp. 16-44). IGI Global

—————————————————————————————————————————————-

Nowadays, effective processing of big data files is a significant challenge. Standard and well-known big data files are images and video files. In addition to them, a tremendous amount of information is registered in the form of log files. Data-centered services often record all contributions and modifications of datasets. In contrast to private data hidden from the public, open data services, like Wikipedia and OpenStreetMap (OSM), usually provide the history of all users’ contributions in the form of full-history dumps (FHD). It is popular to provide full-history information of open data projects in either compressed XML format or Google’s Protocolbuffer Binary Format (PBF).

Classically, big data files are converted to an indexed relational database for further processing. Recently, various big-data specific solutions have been introduced. Typically, such solutions are based on multi-core cloud solutions. The MapReduce concept is usually implemented for the development of software for multi-core based processing of big data files. Many approaches to big log data processing have been introduced in the last years. Apache Hadoop and Apache Spark are popular platforms for MapReduce-based novel solutions.

In this work, OSM FHD data are considered. Even though the article is focused on OSM, prospectively, proposed solutions can be extended to other sources of full-history data (e.g., Wikipedia). All introduced approaches are developed in an open-source manner. Hence, the developed solutions can be quickly adapted and improved for other projects. Several prepared tools are assembled as a part of Integrated Geographic Information Systems Tool Kit (IGIS.TK), which can be described as IDE for GIS projects. Currently, the parallelization is achieved by concurrent processing of FHD’s clips. In the future, it will be modified by using threading libraries for optimal utilization of available CPUs.

In this work, a data-type model for universal analysis of full-history data is introduced. The model provides an overview of FHD and insight to the data provenance. The model is designed for the low-level data-type based analysis of full-history data. It allows comparing different clipped FHDs and observing dynamics and specific of users’ contributions. Resulting statistics are presented in table and chart views. Table and charts are available as interactive HTML files. Moreover, using the prepared statistics, a novel relational data model for OSM FHD has been developed. Databases for several pilot sites have been generated according to the relational model.

Two use cases are based on the prepared databases. In the frame of the first use case, various intrinsic data quality indicators and related metrics were calculated. The resulting data and charts allow users to compare the quality of the examined FHD datasets. In the frame of the second use case, the spatial distribution of massive data uploads is investigated.

Picture2

Data model of OSM full-history data

An example of the way (line) spatial feature comprising bad characters.

An example of the way (line) spatial feature comprising bad characters.

 

Number of OSM contributors per year

Number of OSM contributors per year

 

 

 

Leave a comment