Conflation of roads
The majority of main roads (classified as motorways and primary roads) are already present in OpenStreetMap (OSM) for a lot of countries. The primary roads of Europe, Asia and Americas are likely to be fully covered.
But a multitude of the last mile roads are still missing. Roads such as driveways, forest tracks, paths in scarcely populated areas outside towns.
The goal of this experiment was to learn how hard it would be to conflate the existing OpenStreetMap road network with an alternative vector map. The alternative source is in many aspects more complete.
In the text below the following terms are consistently used.
- “Old” means existing OSM data.
- “New” means data from alternative sources.
- “Ready” designates a subset of “new” datasets qualified for merging with “old” data.
- “Dropped” data contains roads from “new” that are duplicates, have quality issues or other reasons to not be uploaded.
I conducted the described experiment on a private server.
Main principles of conflation
The core principles are the same I had in my previous conflation projects. Be as conservative as possible with the new data and liberal as possible with the old data. The following practical rules governed the process.
- In a case of a doubt, avoid marking a way as ready. It is preferred to drop a new way than to risk and create a duplicate to an existing way.
- The old data is treated as the absolute truth. Even when such data is proved or likely to be of worse quality than matching new data.
- The old data is never removed. Old ways may be modified by adding new nodes. Things may be retagged. Neither old ways nor their old nodes can be removed from the server.
- It is allowed to modify new roads in order to better incorporate them into a consistent end result. We can reposition or shorten/lengthen ends of new ways to help them to connect with old ways.
- Extension of existing road networks is preferred to creation of isolated sets of ready roads. However, isolated roads are preferable to not uploading anything at all.
The input map data was split into multiple disjoint geometrical extents, or areas. Each area was covered by two data layers.
- old layer with roads extracted from the OpenStreetMap planet database for the extent. In short, the planet extract was filtered with
osmfilterfor the pattern “highway=*”. OSM XML was the common input format.
- new layer with roads extracted from a ESRI shape (SHP) file.
ogr2osmconverted it to the OSM XML format.
The conflation selects ready roads from new roads and connects them to old roads. The result for an area should be split into smaller chunks. It should be possible to upload each such data chunk independently, in a random order to other chunks and regardless whether they succeeded or not. That is, they should not contain duplicated ready elements (nodes or ways). Once uploaded, elements receive new permanent IDs, and uploading them again will create duplicates. We will call such chunks of data clusters.
A ready collection of clusters satisfies the following rules.
- A cluster is not empty, it contains at least one new way.
- Clusters are of minimum possible size that still allows uploading them independently. More specifically, all ways inside a cluster can be reached from each other (otherwise a possibility exists to split it into smaller clusters). Obviously, every new way must be present in exactly one cluster. Additionally, old modified roads (with new nodes inserted into them cannot be shared between clusters.
Old unmodified roads may optionally be included into one or more clusters to help with visualizing of where ready roads connect to the existing road network. Because they are unmodified, they will not be uploaded and thus cannot create conflicts.
Optionally, output of dropped ways can be generated. It should contain all rejected new ways with comments attached on why they have been rejected. Examples for such reasons:
- intersection or aliasing with an old way,
- defects, such as being too short, self-intersecting, etc.
- no connection to an old road network.
The data processing consisted of a multistage pipeline with mostly scripted, but also manual passes.
Preliminary data cleaning
Unfortunately, the new input data is not perfect. Topological, factual or other issues manifest themselves for a minority of new ways.
To clean things up, it is possible to open the new data layer in JOSM, validate it and resolve discovered warnings. It did not seem reasonable to perform the clean up this early. A lot of this manual work will be thrown away at conflation, as many ways will be dropped. Typically, fixing of mistakes did not change the outcome of the conflation for a given way.
A developed sanitize-pass automatically addressed a few simple classes of issues.
Each new way has two end nodes. These nodes could become attached, or snapped, to old ways. Several geometrical rules could then decide that a given new way is a partial or complete duplicate of an old way. The new way can also intersect unrelated old ways. Any of such reasons causes the script to drop the candidate new way as aliased to one or more old ways.
Two ways can also exist on different vertical layers. In such cases, they have less chances to be detected as aliased.
To speed things up, ways were indexed in a spatial data structure. It allowed to quickly look up all pairs of ways that could interact with each other. Pairs of ways with intersecting or overlapping bounding boxes (bboxes) were tested for snapping and aliasing (see below).
The easiest way to explain all situations when a new way is aliased or not would be to draw pictures. But I cannot be bothered, so here is pseudocode for the algorithm.
For each new road from the new layer.
- Find its bounding box.
- Filter a list of end-snap candidates and intersections/duplicates from the old layer by new bbox + margin.
- Try to snap both ends of the new road to the candidate roads.
- For each end, find all nodes from that end that are closer than a threshold to every road.
- If there are more than zero such end nodes, all of them are replaced with an end node that connects the candidate road and the new road.
- The connection node, if it exists, replaces all snapped nodes and becomes the new end node of the new way. It may be either:
- an existing old node, or
- a completely new node added somewhere in the middle of an old way.
- If there are multiple snapping candidates out of old candidate ways for the same end, we chose the closest one.
- After the snapping, if modifications to the new way were made, verify that the new way is still not empty and not trivial.
- This is done to exclude cases of short ways being completely assimilated into an old way after snapping.
- For each old candidate road, determine if it aliases the new road.
- Check the following conditions (“close” is defined by predefined threshold values).
- No segment of the new way, excluding the end points if they have been snapped, comes close to any of the old nodes.
- No segment of the old ways comes close to any of the new nodes, excluding the snapped end nodes.
- No two segments of old and new ways intersect, with an exception for snapped ends, where they are guaranteed to have a common point.
- The angle between segments at a snapped node is not too sharp.
- If all the conditions above hold, the new way becomes ready. Otherwise it gets dropped.
Output of ready data. All ready new ways, their nodes, and modified old ways (with new nodes snapped into them) are saved in a single OSM XML file.
Output of rejected data. This is only needed for debug.
- Write a second OSM XML file, add
comment=*with reasons for rejections. Comments should contain IDs of conflicting old ways, nodes or other information. The saved data helps to determine why a certain way has been dropped instead of becoming ready.
Manual data analysis
Before proceeding further, the ready data file was loaded into JOSM and validated. Usually, a reasonable number of warnings for both new and old data shows up. It was important to address all indicated problems in new data.
Certain issues in old ways and nodes could also be addressed. But it was important to not delete anything old, only reorder, rename, move etc. That is because the opened OSM XML file did not contain all the map data for the area, and deleted elements could still be referenced by ways or relations.
Certain warnings were actually caused by this incomplete nature of data. They required no resolution as they were not present in the complete dataset.
A cluster contains new roads and may contain modified old ways in a case when new nodes were inserted into them in the middle.
My original script for clusterization worked but was too slow. A rewrite which used better data structures made it very fast, more than a hundred times faster.
Additionally, the clusterization encoded properties of data into OSM XML file names. There was a symbol indicating whether the cluster is connected to the old roads, or completely isolated. It also sorted clusters by amount of new elements they contained. The biggest cluster happened to contain 10000 elements (nodes and ways). But the absolute majority of clusters were, as expected, tiny, sometimes only of a single way segment connecting two existing nodes.
Ignored data interactions
Only the “highway” layer of old OSM data was processed at the conflation stage. It did not take into consideration interactions between the chosen ready roads and currently hidden map features. Such simplification most likely created inconsistencies. Among them are:
- Roads crossing buildings
- Roads crossing rivers, streams etc. linear objects should have segments mapped as “bridge”. Input data contained only footway bridges.
- Roads crossing other types of “roads”, such as railroads. It was actually added later in the process that railway=* did affect the conflation.
- Roads running along the shore slipping into water.
They are not the worst thing possible. These issues do not break anything related to highway routing. They can easily be detected later and fixed at any time. They are often caused by coarseness of the OSM data, such as very roughly traced lake boundaries.
As it is usual, my understanding of the nature of the geographical data and relations between its components were adjusted during the process. The data processing phases and algorithms have been updated or rewritten several times to accommodate the newly obtained knowledge.
For example, simplification of using Euclidian distance between two Earth points specified by (lat, lon) coordinates should be done carefully. One degree of latitude is never equal to one degree of longitude when both are converted to meters.
More than 2 layers in input layers.
layer = -2in one data set can be marked as
layer = -1in another. Both values are correct, as layers are only used for relative ordering of elements. For this reason, processing tightly built cities with multi-level road networks won’t work with the described approach. It targets primarily countryside with infrequent eventual simple tunnels and bridges.
Snapping of nodes becomes tricky if there are multiple target candidates. A lot depends on the chosen threshold values. For sophisticated situations, it is always possible to get an ugly resulting geometry.
Snapping both ends of the same new way to the same old way simultaneously turned out to be tricky, as there are many corner cases when things may go wrong. Having a debug data layer with dropped nodes and comments attached to nodes helped with debugging the discovered problems and finessing the algorithms.
Old data quality issues
Negative layer for old forest roads in old data. The conflation algorithm had to be adjusted to ignore layers when determining aliasing. It still takes layers into account when snapping.
Old roads described as multipolygons. The original data extraction kept only ways and no multipolygons. This situation arose extremely rare and only in tightly populated areas, where no ready roads were surviving the aliasing check.
New data quality issues
The OSM XML derived from the SHP file was not spotless either. For example, one of input files contained about 39 thousand new ways. JOSM validator reported about 800 errors/warnings for that file.
Typical problems were: - missing connections between roads, - overlapping roads, - duplicate road segments, - intersecting ways without connection, - sharp angles in paths, - tiny dashes (unconnected roads shorter than 10 meters), - unconnected longer roads.
There was a rare case of a degenerate way containing only one node.
Decisions on which clusters to upload
As mentioned before, the clusterization separated all clusters into two groups.
- A connected cluster has at least one old node.
- An unconnected cluster has all nodes as new (negative IDs).
Connected clusters are mostly OK to upload. However, it was decided to ignore “small” ones, as they were more likely to introduce duplicates that slipped the aliasing, or become “phantom roads”.
An unconnected cluster sometimes truly represents an isolated road network. Roads and paths on an island fully surrounded by water are an example. Most often a cluster became unconnected because all ways potentially leading to the existing road network got dropped as aliases. At the same time, old ways did not exactly reached far enough to allow snapping. Sometimes it was a couple of meters of distance than prevented an isolated cluster to become connected.
It was decided to keep only clusters large enough (in terms of new elements). For unconnected cluster a higher threshold was chosen.
All computation-heavy processing was done in Python. Complete XML data of both old and new layers was usually loaded into the operating memory beforehand.
Conflation of an average area took from 30 to 60 minutes on a 9 year old desktop with Intel CPU at about 3.1 GHz. It required quite a lot memory, usually from 4 GB to 16 GB. Having 32 GB RAM was comfortable to conflate the biggest area yet without noticeable swapping.