Atakua's Diary

Land cover import in Openstreetmap

Land cover geographic data is what is mostly represented as landuse=* in the OSM database. Other tagging schemes e.g. landcover=* also exist.

During the ongoing land cover import for Sweden [1] I learned several things that were not documented anywhere in the OSM wiki or elsewhere, as far as I know. Below are my observations on what pitfalls and issues arise during all stages of the import, from data conversion to conflation and tag choice.

Data import of zero-dimensional (separate points of interest) and linear (roads) objects are regularly done in OSM. Some documentation and tools to assist with such imports exist. Compared to them, importing of polygons and multipolygons has unique challenges.

Please make sure you read the OSM import guidelines [2], [7] before you start working on an import. Then document as much of the process as possible for your own and other reference, as it is likely to take more time to finish than you originally estimated.

General lessons learned

Start small. Choose a small subset of import data and experiment with it. You are likely to improve your data processing pipeline several times, and it is faster to evaluate what an end result would look like on a small piece of data. When your approach starts looking good enough to be applied to a larger scale, still try to keep individual pieces of uploaded data reasonably small as this reduces risks of data conflicts, incomplete uploads, and generally helps with parallelizing the work.
Prepare for a lot of opposition. The OSM community had many cases with badly carried out imports. A small mistake in data that may be forgivable and easily correctable when made once becomes a major problem with replicated at scale by a script. People are very defensive against especially land cover imports. Just be persistent, listen for constructive feedback, ignore non-constructive feedback, improve your tools and data, do not blindly trust your tools.
Keep track of what is done. Document everything. Your import will take quite some time, you are likely to forget what you did in the beginning. As more than one person may work on importing, situations when two or more people attempt to process the same geographical extent in parallel are possible. Write it down to keep track of what is done and by whom. Besides, there are not many success stories about land cover imports, and we need more documented experience on the topic.
Find those who are interested in the same thing. Depending on your ambitions, you will have to import hundreds of thousands of new objects. Find some same minded individuals to parallelize as much of manual work as possible.
Learn your tools and write new tools. As a lot of data processing is done in JOSM, including final touch ups and often the uploading and conflict resolution,. Learning and knowing the most effective ways to accomplish everyday tasks helps to speed the process up. Learn the editor’s shortcut key combinations, or change them to match your taste. Some good key combinations helping to work with multiple data layers that I did not know before starting the import are mentioned in [11]. Programming skills are also a must for at least one person in your group who develops an automated data conversion pipeline or adjusts existing tools for the purpose. Always look for ways to automate tedious work. And always look for ways to improve your data processing pipeline, as you’ll learn new patterns in your new or existing OSM data.
Think about coordinate systems early. It is not enough to assume that everything comes in WGS84. Convert data to same coordinate systems before you start merging or otherwise processing it. I discovered that depending on the moment when a dataset is finally brought to WGS84 there may or may not be weird effects such as overlapping of adjacent tiles, loss of precisions resulting in self-intersections of polygons, or other avoidable effects.

On Source data

The source data for your import may be either in raster or vector form. If you have choice, prefer importing vector data as it would save you one step in data processing and avoid troubles related to raster to vector conversion.

Data actuality

It goes without saying that source of the import should be up to date. Cross-checking it against available aerial imagery throughout the import process should help with judging of how modern the offered data is.

Objects classification

Take notice of what classes of features are present in the import data, how well they can be translated into the OSM tagging scheme. You may certainly want to throw away, reclassify or merge several feature classes present in the import data.

Even after you have started the import, continuously estimate and cross-check that the classification is consistently applied in the source, correctly translated to the OSM tagging schema, and how often misclassification mistakes requiring manual correction occur.

As an example, an area classified as “grass” in a source dataset may be in fact representing a golf course, a park, a farmland, a heath etc. in different parts of a country. They are tagged differently in the OSM, and, if possible, this difference should be preserved.

Check regularly for misclassification of features, especially when you switch between areas with largely different biotops. Tags choice for tundra is likely to be different from those used for more southern areas, and will require adjustments to the tag correspondence mapping used in your data processing routines.

Be aware that there are also unsolved issues of tagging of e.g. natural=wood against landuse=forest that might affect your decisions on tags choice. Very good points on complexity of properly tagging of land cover are presented in [3].

Data resolution

For linear objects it is important to adequately reflect shape and position of actual natural features they represent.

Data resolution is also important as certain types of objects are worth importing only if they are represented with resolution fine enough. Conversely, having objects with too many details will result in an increase of amount of raw data to process without giving any benefits to the end result.

It is easy to tell resolution of a raster data as it is defined by pixel size. For vector data, estimation of how good newly added polygons are aligned with existing ones can be used as a rough indication of data resolution.

Consider the following example. For a forest, having its details drawn on a map in the range of 1 to 10 meters should be just enough for practical uses. A forest with a unit resolution of 100 meters is of less use for e.g. pedestrians. But mapping a forest with resolution of 10 centimeters is basically outlining every tree, which is of little practical use for larger territories.

Similarly, trying to create an outline of regular buildings from data with resolution of one meter or worse will not succeed to capture their true shape. A data with 10 cm details can be used to correctly detect all 90 degree angles of buildings.

Converting data to OSM format

Most likely the import data will not be in a format directly acceptable by the OSM database, that is, OSM XML or equivalent binary formats. Additional processing steps will be needed to load the data one or more third-party or custom written tools.

Several freely available GIS applications, libraries and frameworks are available to help you with data processing: QGIS, GRASS GIS, GDAL, OGR etc. However, knowing programming is a must at this stage as it is often simpler to write a small Python (or similar comparable scripted language) converter than to try do the same work in a GUI tool. Moreover, many steps require automation to be applied to many files, which can also be automated through scripting.

Raster data may often be available in GeoTIFF format which is a TIFF image with additional metadata describing coordinate system, bounding box, pixels meaning etc. Vector data can be present in many forms, from simple CSV files to ESRI shapefiles, XML-based GML, JSON-based GeoJSON files, or even stored in a geospatial database.

Once data is converted into the OSM XML format, a few tools are available to process it as well, such as command-line tools osmconvert, osmfilter, and tools and plug-ins for the main OSM editor JOSM.

Importing a single feature

The whole process of importing can be described by repeated addition, modification or deletion of features, in the land cover case represented by individual units of forests, farmlands, residential areas etc. Such features have to be extracted from the source data, and then inserted into the OSM database. Many decisions have to be made to make sure that enough useful information is extracted and not much noise is introduced at the same time, so that the new feature does won’t create more trouble than good it brings.

The following tasks have to be solved for every import feature considered for manipulation.

Tracing vector boundaries of a feature. For vector data, the boundaries should be already in vector format. For raster data, individual pixels with the same classification have to be grouped into bigger vector outlines of (multi)polygons. Certain tools exist that can assist with solving this [4].
Assigning correct tags to features. The tagging scheme of OSM has unique properties, and deciding what tags a (multi)polygons should have is very important. At the very least the tagging for new features should match with what has been already been used for tagging objects in the same area.
Assuring correct mutual boundaries between old and new features. The OSM project support only a single data layer, and everything has to be nicely organized in that only layer. Although different types of land cover may overlap in reality, it is not the common case. No sharp border can be often defined between two natural or artificial areas either, but maps often simplify this to actually representing things as a single border. Certain types of overlaps are definitely considered to be erroneous, e.g. two forests overlapping by a large part, or forest sliding into a lake. Note that this task is affected by how accurate boundaries were specified for pre-mapped features. Sometimes it is feasible to delete old objects and replace them with new ones, provided that there is enough evidence that new features do not lose any information present in the old objects. See further discussion on the subject below.
Assuring correct borders between adjacent imported data pieces. As a dataset for land cover is rarely imported in a single go for the whole planet, it is bound to be split into more or less arbitrary sized and shaped chunks. The data itself does not necessarily dictate on what principle such splitting is to be made. Borders for these chunks may be chosen based on an administrative principle (import by country, municipality, city, region etc) and/or by data size limitations (rectangular tiles of several kilometers wide etc.) Regardless of a chosen strategy, artificial borders will be imposed upon the data. E.g. one can split what in reality is a single farmland into several parts. It is often important to hide such seams in the end result by carefully “sewing” the features back together. In certain cases of bugs in the splitting process, new data pieces may even start overlapping with adjacent pieces, which only adds extra manual work without any value.
Finding balance between data density and usefulness. Even if import data resolution looks to be optimal, it is often worth to further filter, smooth, remove small details or otherwise pre- and postprocess resulting features. Let us consider a few examples. a) For the case of raster data, it is worth removing lone “forest” pixels marking individual trees standing in a farmland or in a residential area. b) Rasterization noise is always an issue to deal with: imported data should not look “pixelated” with suspiciously looking 90 degrees corners where there should be none. c) Lastly, many nodes lying on a straight line can be safely removed without losing accuracy of vector data but reducing its size. A lot of filters exist for both simplification and smoothing [4] [5], but most of them require some experimentation to find their optimal parameters to be run with. Doing too aggressive filtering can destroy essential parts of imported features.
Keeping feature size under control. Artificial splitting of import data surprisingly has its own positive effects on keeping size/area of natural features in check. An automatically traced forest can turn into a multipolygon that spans many dozens of thousands of nodes and hundreds of inner ways. In practice having several smaller and simpler organized adjacent polygons covering the same area is better. Other means to keep features’ size in check can be used. For example, roads crossing forests can effectively cut them in smaller parts that are then represented as more contained features.

Conflation with existing data

Conflation is merging data from two or more sources into a single consistent representation. For us, it means merging two layers of vector data: “new” with import features and “old” with existing features — into a single layer to be then uploaded to the main OSM database.

Let us assume that both the “old” data already present in the OSM and “new” data to be imported are self-consistent: no overlapping happens, no broken polygons are present etc. Always make sure that it is true for both layers before you start merging them, and fix discovered problems early.

When these data layers are self-consistent, new inconsistencies can only arise from interaction of old and new features. For the land cover case, it is the (multi)polygons intersecting and overlapping each other in all possible ways.

Start thinking about how you are going to address these problem early in the import process. Solving them efficiently is a critical component of having a successful import.

Algorithmically, problems of finding exact shape of overlapping, points of intersections, common borders etc. of two or several multipolygons are very far from trivial. Solutions to such problems tend to be algorithmically complex meaning applying them for huge number of features with many nodes in each may take unreasonable time to finish.

Whenever possible, the conflation task should be simplified. Compromises between speed, accuracy and data loss/simplification have to be made. Improve your conflation algorithms as you progress with early data chunks and learn the issues arising over them. If you see that same tasks arise over and over and take a lot of human time to be resolved manually, integrate a solution for it into your algorithms.

For situations when old and new features overlap, intersect or otherwise happen to be in a conflict, define a consistent decision making strategy on which modifications to conflicting features will be applied. For example, one should decide in which situations old or new nodes are to be removed, moved or added, whether conflicting features are to be merged, or if there are conditions for some of them to be thrown away.

Be on lookout for common patterns in the data that can be easily solved by a computer. More complex cases can be marked by computer for manual resolution.

Do not leave too much work for humans however. Humans are bad with tedious work, and will quickly start making mistakes. Everything that is reasonable to do by a machine for conflation should be done by machine.

It is easiest to solve conflicts when no conflicts can arise. For the features, if no features can overlap, they cannot conflict. In this sense, undershoot in data is better than overshoot, but again, make an informed decision that applies to your imports best.

Consider that making sure that features’ borders are aligned is easier than making a decision about what to do with two arbitrarily overlapping polygons. This means that making sure that no two old/new feature pairs overlap would help greatly with conflation. Simply saying, importing features only for areas where there is “white space” on the map is easier than importing features to already tightly mapped areas. You might clean up the space first by removing old features (according to some strategy), or just leave those alone.

Decide how to verify that conflation is correct. JOSM validator helps with detecting when e.g. two ways with identical “landuse” tags overlap. But it does not check for everything, and certain cases that are obvious for a human to be wrong are skipped by the validator. Use it and other similar tools, as well as visual inspection of the result, to see that no (obvious) errors have slipped into the end result.

Let us consider different strategies to approaching the data merging task.

Raster vs Raster

This is arguably the simplest case of data conflation as the task can be reduced to comparing values of individual pixels as two pixels are either fully overlap or do not overlap at all (provided that two rasters are brought to the same resolution and extent position). There is no problem with detecting intersections of (multi)polygons. Algorithmic complexity is proportional to the area of a raster map in pixels.

Existing OSM data is vector, not raster. You can however, rasterize it [6] into a matrix of values that can then be processed together with the import raster data layer. Now, a decision about every pixel of import data can now be made based on two data sources: old data for the pixel, and new data from the import. For example, pixels for which there already is some data in the OSM database can be made “invisible” for the vectorization process, and tracing will not create any vector features passing through such pixels. This will effectively make sure that no “deep” overlapping between old and new features can happen. However, due to data loss at the rasterization process now borders of old and new features may and will intersect somewhat along their common borders. Thus, the task of making sure two features do not overlap is reduced to the task of making sure two features have a common border.

This approach was used with the Sweden land cover import [1]. Issues discovered so far and solutions for them.

Make sure that the raster datasets being compared are completely identical in their extent (bounding box position and dimensions), resolution and coordinate projection systems. Tools that work with geographic data in raster formats typically expect all layers to have exactly the same dimensions, no relative shifting, no variation in projections etc. Not all of them have proper error reporting when these conditions do not hold. An incorrectly skewed mask layer creates holes for phantom features in the resulting vector layer, simultaneously leaving a lot of overlapping for features that should not have been generated at all.
Quality of existing OSM data starts to affect results of vectorization of the import data layer. Any errors or simplifications that are present in old OSM data will be reflected in the eventual vector data to import. For example, a roughly and sketchy drawn forest patch will be able to partially mask a nearby farmland about to be imported, thus making its shape to be wrong. A road that was drawn too far off its actual position will create a phantom cutline in your data.
Because there may be multiple ways to draw a border for a forest or another natural patch of land, sliver (long and thin) patches of land cover may appear in the data. They stem from “subtracting” of two almost identically shaped old and new polygons.
To effectively prevent new land cover polygons to creep over roads, include roads into the mask layer. However, because roads usually have no “thickness”, use buffering [12] to create an area around them, effectively turning them into long curved polygons. The same applies to railroads. To prevent houses from being partially covered by small patches of trees, you can try buffering them as well.

Vector vs vector conflict solving

There are no known tools working directly with OSM-format data that allow for conflict resolution with another set of vector features. One can, however, import existing OSM data into one of GIS applications together with import vector data and do your processing there, then export modified import vector features to an OSM file. More exploration is needed in this direction to see how efficient and accurate it can be, e.g. using vector overlaying [10].

You might consider some old vector features for removal from the OSM dataset if new features that are located at the same position are of better spatial resolution, quality or tagging set. For example, data coming from older imports may be considered for replacement by new, more recent/detailed import data. Be careful however with editing OSM XML extract files in your scripts as simply deleting (or marking for deletion) nodes, ways or relations may leave dangling references from other objects somewhere outside the current map extent. Typically, it is best avoided to automatically remove old features; the final decision must be made in manual mode.

Deleting objects is easier in new data layers as they have not yet been “observed” in a global database and no external references could be created yet, so you can just drop unnecessary objects from your files. Land cover data is almost always redundant to some extent, so often instead of trying to solve a complex conflicting overlapping or intersection it is easier to remove them altogether and replace with a manually drawn configuration.

If your old and new vector features can overlap in an arbitrary manner, you should explore known algorithms for merging, subtracting etc. operations on (multi)polygons. Be sure to measure their performance however as they may be too slow for bigger datasets. A simpler and faster spatial strategies to detect some common cases of (non)interactions between features should be cleverly used. For example, before calculating an intersection of two polygons, check if their bounding boxes overlap. If they do not overlap, there is no chance of intersection either, and there is no need to run a complex algorithm for discovering that. There are many spatial indexes invented to aid with the general task of telling whether two objects are “close” to each other.

If there is a guarantee that features may only intersect in a tight area along their common borders, one can try snapping [8] of nodes to lines in an attempt to unify the border between the features. This is not universal, however, as there will always be cases when manual post-editing will be needed. The following JOSM plug-in [9] was developed to assist with a kind of node snapping in JOSM; it still relies on manual adjustment for complex cases.

To be continued: practical examples of problems and solutions

There are a lot of common specific issues with the land cover data and its conflation. Pictures should also definitely help with explaining them. I plan to talk about these in more details later in a separate post.