Atakua's Diary

Import plan of Swedish settlements names from Lantmäteriet’s GSD-Terrängkartan

The text below is the first revision of the new Openstreetmap data import plan. The latest state of the document can be found on the OSM wiki page. It is likely to receive several updates over the time as the project goes, but the technical aspects of the process are likely to remain.

Import of Swedish settlements names from Lantmäteriet’s GSD-Terrängkartan

Goals

To improve OSM completeness for toponymical dataset on territory Sweden using an official map supplied by Swedish mapping, cadastral and land registration authority.

This import considers OSM data representable as nodes tagged with usual key/value pairs: “place=city”, “place=town”, “place=village”, “place=hamlet”, “place=isolated_dwelling”, and “place=locality”. However, it is not planned (but not fully excluded either) to add/modify any nodes with “city” and “town” values. They are expected to be already fully mapped.

Physical geographical features associated with these tags are further collectively called as “settlements”. Names used by settlements are called “toponyms”.

Schedule

January 2020 — start of the project.
(TODO date): discuss at talk-se mailing list, address discovered issues
(TODO date): discuss at imports mailing list, address discovered issues
(TODO date): upload the first batch of data

Import Data

Import data comes in a form of vector SHP data files produced by ESRI ArcGIS software.

Documentation for the imported data is provided inside the downloaded archives in PDF form (in Swedish). A copy of the PDF file is available here: https://drive.google.com/open?id=1hFigNt6DKdiPpLIkAsKn7pr47gVlK_Hz

Specifically, data from gistext/TX layers corresponding to toponyms is extracted.

Background

Data source site: https://www.lantmateriet.se/sv/Kartor-och-geografisk-information/geodataprodukter/produktlista/terrangkartan/

The actual SHP files are downloaded from ftp://download-opendata.lantmateriet.se/ after a free account is registered. A copy of them is placed here: https://drive.google.com/open?id=1PCCs6AJpk48P6hMhcIM0Ax8YO3mR2-OP

Data license: CC0.
Type of license (if applicable): Public Domain with Attribution
OSM attribution: https://wiki.openstreetmap.org/wiki/Contributors#Lantm.C3.A4teri
ODbL Compliance verified: yes

Import Type

This is a one time import for the data available by the Q1‘2020. The data is first processed with scripts, then loaded into JOSM, visually controlled to be consistent and non-conflicting, validated with existing tools. It is then uploaded through the JOSM upload interface.

Data Preparation

Later in the text we will be using words “old” to denote data already present in the OSM database, “new” to denote data extracted from the source, and “ready” to denote the result of automatic conflation “old” and “new”.

Data Reduction & Simplification

From the source vector, points corresponding to toponyms’ labels are extracted. The rest of the data is dropped from the source.

At the conflation stage, only points to be found not represented in the current OSM database are preserved.

Tagging Plans

Points in the source files are associated with a number of fields.

KKOD field of source SHP files is mapped to one of the OSM “place” values. Field TEXT is mapped to “name” tags, with possible transformations (see below). See early mentioned PDF documentation for explanation of KKOD and TEXT meanings.

KKODs are mapped as the following Python dict:

~~~~~

tr_t = { 1: {“place”: “isolated_dwelling”},# Bebyggelse, enstaka gård, hus 2: {“place”: “hamlet”},# Bebyggelse, by, större gård, mindre stadsdel 3: {“place”: “hamlet”},# Bebyggelse, by, stadsdel 4: {“place”: “hamlet”},# Bebyggelse, samhälle, samlad by 5: {“place”: “village”},# Tätort 200 - 499 inv., större stadsdel 6: {“place”: “town”},# Tätort 500 - 1 999 inv. 7: {“place”: “town”},# Tätort 2 000 - 9 999 inv. 8: {“place”: “city”},# Tätort 10 000 - 49 999 inv. 9: {“place”: “city”},# Tätort 50 000 och fler inv. 14: {“place”: “locality”},# Herrgård, storleksklass 1 16: {“place”: “locality”},# Herrgård, storleksklass 2 }

~~~~~

Remaining KKODs and associated points in source data are dropped.

In addition to tags derived from the source data, auxiliary tags, for example “source” and “lantmateriet:kkod”, are added to all new nodes. For nodes where additional human attention is required, “fixme” tags are added at later data processing stages (see examples below).

Changeset Tags

Changesets will be tagged with source = “GSD-Terrängkartan”.

Data Transformation

Tools used:

osmconvert, osmfilter and ogr2osm to perform initial data format conversion and filtering.
Scripts and tools (link) to convert, split, clean up, conflate data and resolve issues at intermediate steps.
JOSM editor to manually fix remaining issues, visually and semi-automatically review changesets, and finally upload them to the OSM-database.

Data processing diagram

See the diagram below. The conflation stage is described later in more details.

Data Transformation Results

OSM files with ready nodes:

v9 (fixed even more abbreviations): https://drive.google.com/open?id=182NzEuSHM3fuYIVRErp7-GWYhum02UZN

Source files

OSM files with new nodes (before conflation) and OSM filtered extract with all nodes with “place=*” within Sweden’s borders (old nodes):

https://drive.google.com/file/d/1QGAVhQajqd5rcJ__3kN9F4yPl4Zj35E3

Older file sets

v1: https://drive.google.com/file/d/1HAtM63CGIE-ulYmkAUnuJDvaLAETEy4I/view?usp=sharing
v3 (more cleanup of names): https://drive.google.com/file/d/1-MYN4060-lCY3yZJpgUimmS3Td6SceNq/view?usp=sharing
v7 (even more name cleanup and smarts added): https://drive.google.com/open?id=199O-CXs9wNdwnAQysKxNE2MfpBN8j88D
v8 (smaller tiles are available): https://drive.google.com/open?id=12qtdvT54PchGPyCrNLCcjRaXwKetarpm

Explanation of included files

places.osm is a file with OSM-extract filtered to have only nodes with “place=*” tag. Ways with the tag were converted to nodes and included as well. It is provided here for reference only: https://drive.google.com/open?id=16VVu2SakSD_Yx4jkFZoP30CCXlNYSROx. The most recent version can be generated from the OSM-database
regions/tx_.osm is a file for the country’s region (number ischosen after those used in source SHP files names, see below). A single OSM file contains from 100 to 15000 ready nodes.
tiles/tx_.osm are the same data split in smaller chunks. Each file should contain around 200 ready nodes.
Log files contain warnings about unresolved names and statistic information about processed and generated files.

Mapping of 21 regions to file numbers follows numbering used by Lantmäteriet’s original files:

Blekinge 10
Dalarna 20
Gävleborg 21
Gotland 09
Halland 13
Jämtland 23
Jönköping 06
Kalmar 08
Kronoberg 07
Norrbotten 25
Örebro 18
Östergötland 05
Skåne 12
Södermanland 04
Stockholm 01
Uppsala 03
Varmland 17
Västerbotten 24
Västernorrland 22
Västmanland 19
Västra Götaland 14

Preliminary counts of ready nodes in individual regions:

tx_01.osm:Total nodes: 2402
tx_03.osm:Total nodes: 3512
tx_04.osm:Total nodes: 5115
tx_05.osm:Total nodes: 6862
tx_06.osm:Total nodes: 5633
tx_07.osm:Total nodes: 4546
tx_08.osm:Total nodes: 4146
tx_09.osm:Total nodes: 161
tx_10.osm:Total nodes: 1054
tx_12.osm:Total nodes: 6104
tx_13.osm:Total nodes: 3612
tx_14.osm:Total nodes: 19328
tx_17.osm:Total nodes: 8757
tx_18.osm:Total nodes: 4663
tx_19.osm:Total nodes: 2538
tx_20.osm:Total nodes: 3249
tx_21.osm:Total nodes: 4670
tx_22.osm:Total nodes: 2124
tx_23.osm:Total nodes: 2708
tx_24.osm:Total nodes: 2020
tx_25.osm:Total nodes: 2019

Data Merge Workflow

Team Approach

A team of contributors collaborating through the mailing list talk-se will import data covering different parts of the country. The whole country area is split into 21 sub-units following the territorial scheme present in the original data source. To better balance the following manual validation work, these files will are also split into smaller tiles. The goal is to have about 100-200 new nodes per a single tile.

Script developed to perform the splitting: https://github.com/grigory-rechistov/nmd-osm-tools/blob/master/split-osm-by-limit.py

The collaboration is guided through online spreadsheets or other convenient mechanisms to make sure that no two people attempt to upload the same data chunk twice. The main spreadsheet to track progress: https://docs.google.com/spreadsheets/d/1lfzSt0iYqxOe07cK2wkNbcJbUTecW2JH6eLDE4k0Mmc/edit?usp=sharing

Changeset size policy

Individual changesets of this import should follow regular OSM policies on size limits. Total amount of new nodes is expected to be about 95 thousand, meaning that multiple changesets will be required to upload everything.

References

Nodes positions and toponyms’ names can be validated using the following sources:

Lantmäteriet’s own raster tiles service available as background layer in JOSM.
Lantmäteriet’s name toponym search service https://kso.etjanster.lantmateriet.se/
Historical maps of Sweden used as background layers in JOSM.
Existing OSM data (used to visually discover inconsistencies).
Publicly available information on toponyms (Wikipedia etc.) to verify names when there is doubt.

Data extraction

Conversion of a Geofabrik extract for Sweden is done by the following script: https://github.com/grigory-rechistov/nmd-osm-tools/blob/master/geofabrik_to_places.sh

Conversion of SHP files to OSM files is done by the following script: https://github.com/grigory-rechistov/nmd-osm-tools/blob/master/shp_to_osm.sh

Note to self: for d in tk_; do SHP=$d/terrang//gistext/tx*.shp; echo ~/workspace/nmd-osm-tools/shp_to_osm.sh $SHP ~/tmp/ortnamn/basename $SHP .shp.osm; done

The following tag translation filter is supplied to ogr2osm: https://github.com/grigory-rechistov/nmd-osm-tools/blob/master/translations/lm_tx.py

Notes on names

The source SHP TEXT fields often contain slightly mangled toponym names. It is tolerable by humans but increases risk of duplicates for automated script. E.g., “St. mosse” and “Stora mosse” would be treated as two different places. To counter this, a set of regular expression based conversion heuristics is applied to expand typical abbreviations, such as “St.”, “L.”, etc.
Longer toponym strings are sometimes split into two close points, each of which contains a hyphenated part of the original name. Such pairs have to be concatenated back into a single toponym. Criteria for merging: close nodes, one ends with “-“, another starts with lowercase. The heuristic does not work all the time, e.g. “Öster-Övsjö” split at the dash will not be detected. However, all remaining names with dashes are reported and relevant nodes are marked for human intervention.

Very few toponyms are split in three or more parts. Possible unmerged left-overs are considered suspicious and marked with “fixme” tags for later manual resolution.

Names with non-Swedish letter symbols (punctuation, numbers etc.) are marked with “fixme” for human inspection. E.g. “Günthers” is a valid but unusual toponym worth rechecking. This will also mark toponyms in minority languages to be verified by humans.

Revert plan

In a case of problems discovered after “bad” data is uploaded to the OSM database, it must first be reverted, then corrected and the improved change reuploaded.

All participating users should maintain ranges of changeset numbers for their uploads in the collaboration spreadsheet . This should assist with reverting faulty changes via JOSM option “Revert changeset”.

Changesets’ and nodes’ “source” tags can also be used to track down nodes participating in incorrect changesets.

It is recommended to document reasons why reverting was necessary. Later, develop a mitigation plan to address discovered issues, fix them and re-attempt uploading if deemed reasonable.

Conflation and final automatic preparation steps

The base script developed for automatic conflation is https://github.com/grigory-rechistov/nmd-osm-tools/blob/master/conflate-places.py

Its algorithm operates on a set of old nodes (OSM-extract, nodes marked with “place=*”, around 68 000 nodes for the country) and new nodes (produced earlier from SHP files). The script produces ready nodes, which is a strict subset of new nodes. No old nodes are modified in any way during the process. This means that existing data has absolute priority, even in cases it is likely of lower quality than new data.

The sequence of steps is as following.

Create a spatial index structure with old nodes to have fast spatial lookup.
For all new nodes validation/correction of the “name” tag is performed.
For each new node, find old nodes close enough to it to be candidate for duplicates.
For each candidate node, compare its name against the current new node name. Comparison is fuzzy to allow for some text variation typical for names. Alternative old names are also checked if present.
If a name match is found, the current new node is marked as “duplicate” and is excluded from further analysis and results.
An OSM file with ready data is generated.
The OSM file is optionally split into smaller tiles to ease and speed up visual validation.

Notes on name comparison

In additions to name sanitation presented earlier, comparison of names between old and new datasets need to account for remaining possible variations. To reduce probability of false negatives (erroneously deciding that two nodes are not aliases when they are), strings are brought to normalized forms before comparison.

Other algorithms of fuzzy string comparison, such as Levenshtein distance, Python’s difflib.SequenceMatcher etc. may also be tried if accuracy of current scheme turns out to be lacking.

Expected issues and their risk assessment

Classes of issues described below are assessed with respect to:

their perceived probability to happen,
their negative impact if not fixed,
estimation of effort needed to detect them,
estimation of effort needed fix them.

So far, the most problematic issues seems to be “A duplicate of existing node is added” and “A new node is added with incorrect position”. It is expected that to to discover and fix such problems would require most of required manual editing.

A new node not corresponding to any toponym of real world is added to the map

Probability: very low as the the authorities database is being regularly updated.

Impact: medium as it creates confusion for map users trying to reach a ghost place.

Effort to detect and fix: high/high for tiny settlements as a non-existing place would be impossible to detect and correct without a physical visit of the coordinates. The bigger the settlement, however, the easier it is to discover and delete the mistake by simply looking at the e.g. land satellite image.

A new node with incorrect name is added

Probability: medium, mostly as a result of a typo in the source or unusual spelling used in name.

Impact: low. A misspelled name will still likely be recognizable by humans and/or correctable by computers.

Effort to detect: low. Cross-checking against other public toponym data should uncover the correct or preferable spelling.

Effort to fix: low. Just rename the node manually.

A new node with incorrect classification is added

E.g. adding “place=village” instead of “place=town” etc.

Probability: medium to high. The issue here reduces mostly to the chosen source-destination tag remapping scheme, tagging practices for a particular region. No administrative hierarchy information, which could have been used to derive the administrative classification of settlements, is present in the used data source.

The most controversy is expected to be around tagging with “place=locality”. Officially this tag is reserved for named locations without population. On practice, this tag is sometimes used for settlements with unknown status, from isolated buildings to historically sections parts of cities falling in between the existing administrative hierarchy. In this import, “place=locality” is used to represent the smallest named entity, smaller than “isolated_dwelling”.

Impact: low. Correct name and coordinates for a settlement are arguably more important than to decide whether it should be treated as e.g. “village” or “town”.

If needed, a change in “place=*” scheme can be applied after additional classifications become available.

Effort to detect: medium. Consultation with external sources will be needed to cross-check the official position on a settlement’s type.

Effort to fix: easy both for manual and automated retagging once a mistake is discovered and new classification level is known.

A new node is added with incorrect position

Probability: high for small errors, medium for big errors.

Ideally, a “place=*” node should be placed at the settlement’s center (e.g. the main square, the main train station, geometrical center etc.). However, the Lantmäteriet’s map often has its labels at the side, so that a text label on a corresponding “paper map” won’t cover the settlement’s territory. For big cities with large area and complex borders, this may create a node placement error up to 2 kilometers relative to its “ideal” position. However, bit settlements are already well mapped and won’t receive updates during this import.

For small settlements (the majority of imported nodes), offset error is mostly small, as they have small linear dimensions.

Impact: medium. There is no official “center” for smaller settlements such as a single cottage or a tiny village. At the same time, finding them on a map without any sort of textual labeling is problematic

Effort to detect: medium. Mostly visually comparing against aerial imagery is enough.

Effort to fix: low. If a node is discovered to be at an sub-optimal position by someone, it is easy to move it closer to optimal coordinates.

A duplicate of existing node is added

Probability: low to medium. Exact duplicates of existing nodes (both name and coordinates match) should be impossible (provided the conflation scripts are’ error-free). A deviation of coordinates and/or names of old and new nodes increase the probability of skipping the duplicate onto the map. However, fine-tuned thresholds and distance algorithms (both for spatial and textual information) should reduce the error rate.

Impact: medium. Two nodes naming the same settlement is confusing, but easy to fix upon discovery. Both nodes are still likely to be easily associated with the same place. But it will definitely look annoying until fixed.

Effort to discover: low. Two closely placed closely named nodes are obvious upon inspection.

Effort to fix: low. To manually delete a duplicate is easy.

A node for physically existing toponym is not added

Probability: unknown. Chances of having an unknown name for unknown place are hard to estimate.

Impact: very low. If a place was not present in OSM, and it remains unknown, then apparently nobody thinks it is interesting.

Effort to detect: unknown.

Effort fix: hard; as an alternative source must be analyzed to find any missing settlements.

Toponyms in languages other than Swedish

Several minority languages are used in Sweden, and toponyms may be stated in several languages as well. Lantmäteriet’s documentation mentions that the data does contain text using letters of Sami language.

The baseline conflation algorithm tries to match names of new nodes against multiple tags of old nodes (“name”, “alt_name”, “name:sa” etc.)

Probability: low

Impact: low/medium. Additional nodes for the same settlement would be created where a single node with several tags for names (“alt_name”, “name:fi” etc.) should be made instead.

Effort to detect: unknown

Effort to fix: medium (manual, upon detection).

Quality Assurance Plan

Common sense should be applied when visually inspecting ready data. Some of visual/manual checks/correction are expected.

Not many large (place=town or place=city) settlements should usually be added. Those should already be (almost) completely represented in the OSM database. It is expected that the vast majority of new nodes are to be tagged with place=isolated_dwelling.
Generally, new nodes should be placed on land, not inside water. An exception in a form of named archipelago is theoretically possible; however, this import does not contain new “place=archipelago” primitives.
All ready nodes marked with “fixme” must be checked and acted upon. Other important things to do after the data is loaded into JOSM, and also after it was uploaded to the OSM database.
At all stages when vector data is loaded into JOSM, the standard JOSM Validator shall be used to detect inconsistencies.
It is a requirement for this import that no errors detected by the validator are uploaded together with the new data. When possible, even older errors have to be fixed along with the upload (and committed in separate changesets). No new warnings caused by the new data being imported are allowed. It is encouraged to fix pre-existing warnings for areas that are being updated right before the import changeset uploading (“the boyscout rule”).
After individual changesets uploads import are finished, the Osmose web service will be used to detect any remaining errors/warnings. To simplify detection of problems caused by this particular import, the per-account web page can be used: http://osmose.openstreetmap.fr/en/byuser/ .