Working with Museum Data

To make edits non-destructive, data in OpenStreetMap is not truely deleted but superseded by newer versions. We explain here the details about that mechanism and how to retrieve other than recent data with the Overpass API.

A Point in Time

It is possible to retrieve old states of data. A simple and illustrative example is to view former buildings and highways at the Old Oak construction site in London:

[date:"2018-01-01T00:00:00Z"];
(
  way[highway]({{bbox}});
  way[building]({{bbox}});
);
out geom;

As you can see, some buildings match the background rendering, i.e. current buildings. Some current buildings miss from the results which means that they have not been mapped at that date. And some buldings no longer exist, as they had occupied the same space where now the construction site is situated.

You can play a little bit with the date (from 2013 to today) to see how the existing data varied.

Museum Data Structures

There are quite a number of data structures that seemingly interrelate with data version control.

First of all, each node, way, and relation has a version number, and older versions of elements are in general available through the main API or other data sources. Then there are changesets including meta information and comments. Finally, each element version has attached a timestamp to it.

Changesets

Changesets are designed to let the mapper group changes as they see fit. A changeset can be open up to 24 hours. A traditional paradigm on changesets would be that they are stacked one over the other, and the stack can be undone up to the point where the unwanted change sits. This is deeply in conflict with geographic structuring, because one for sure does not want to undo unrelated changes at the other end of the world to fix a local mistake. On the other hand, formal interdepence is a too restrictive point of view, as one absolutely might have aligned element A to element B in changeset X, while element B itself comes from changeset Y. Then there is no formal, but well a factual dependency of A on B, thus X on Y.

Multiple changesets can be entangled. Have a look at node 6551935928: Go to the data tab and scroll through the versions of date 2021-08-13. The query itslef is explained alongside timeline and retro further below.

timeline(node,6551935928);
for (t["created"])
{
  retro (_.val)
  {
    node(6551935928);
    out meta;
  }
}

Version 7 belongs to changeset 109621973. Then versions 8 and 9 belong to changeset 109622477. Versions 10 and 11 again belong to changeset 109621973, while finally version 12 belongs to changeset 109622477. I.e. changeset 109621973 depends on changeset 109622477, but changeset 109622477 also depends on changeset 109621973.

There exist at least over a thousand of such simply mutual interpendencies on the same element. But there exist also circular dependencies or dependencies involving multiple elements.

As a consequence of this, it is usually but not always possible at all to revert a single changeset even if one were to accept the other problems of the stack metaphor.

Changesets comment may still be instructive on what was intended with a changeset. However, the quality of changeset comments varies wildly. Quite a number of changeset comments are autogenerated by the editing software or otherwise unhelpful information.

Version Numbers

So are version numbers the right concept?

It has already been mentioned that geographic interdependencies happen between elements and are hard to spot on a single element. But there are more blatant problems: The StreetMap in OpenStreetMap suggests that it is about giving geometry to ways, but ways have only references to node ids. As a result, the geometry of a way can change without the way getting a new version.

See for example here. This is way 14259870 version 4 at 2013-01-01:

[date:"2013-01-01T00:00:00Z"];
way(142598705);
out geom meta;

This is the same way 14259870 in the same version 4. With different geometry at timestamp 2019-01-01:

[date:"2019-01-01T00:00:00Z"];
way(142598705);
out geom;

It is widespread: I've made an analysis based on the data as of 2023-01-26 21:48, and 147 447 805 of 916 422 831 ways have at least one version with multiple geometries. This is more than 15% of all ways.

To safely determine the geometry of a way and changes to it, version numbers are unfit.

Remarkably, relations suffer from the opposite problem when trying to track changes on them: If someone splits a way which is member of a relation for whatever reason then this creates a new version of the relation. At least a somewhat justified new version is created when a minor geometry change happens somewhere on a relation. If one tracks change in a local area of some kilometers diameter then there is almost sure that those minor geometry changes that have happened on a long relation elsewhere drastically outnumber relevant changes.

In other words: treating relations for change tracking as modified by their version number has a bad signal-to-noise ratio. On top of this, relations inherit the problem that changes to the way that do not result in a changed relation member list do not increment the relation version. In other words, the problem is both a lot of noise and the absence of the desired signal.

Timestamps

For the reasons mentioned before, the Overpass API relies for museum data on timestamps.

As shown above, one can add the prefix [date:"YYYY-MM-DDThh:mm:ssZ"] before the query to run the query towards the data as of the provided date. The T and Z are fixed markers and indicate that always UTC is used as reference; the YYYY-MM-DD is the date in exactly that order and hh:mm:ss is the time in exactly that format. Please note that you need exactly one single semicolon at the end of the one or more prefixes, and that multiple prefixes like [date:...], [timeout:...], and [out:...] are not separated by semicolons:

[date:"2013-01-01T00:00:00Z"][timeout:30];
(
  way[highway]({{bbox}});
  way[building]({{bbox}});
);
out geom;

The advantage of the timestamp paradigm is that it is conceptually simple: You see the geometry of ways (and state of relation members) in a then consistent combination, nonwithstanding whether asynchronous changes mean that a different geometry applies to the same way version at a different point in time.

The data of the timestamp when a mapper has started a changeset is what they have started working on. The data of the timestamp when a mapper has finished a changeset usually is what they have deemed complete, but there is no guarantee that the mapper has actually reviewed what they had produced.

In principle, it is possible that edits of multiple mappers within an area and the short time span of a changeset overlap. But in practice this has hardly been ever observed. Either it is so close that a conflict prevented the second mapper from completing the upload or the edits of the two mappers have been separated enough to tell the apart.

While it is in principle also possible to spot a change by downloading two data states of the points in time before and after a change or a reference period, then doing a text diff, there are server side tools available to reduce the amount of data to handle, see next section.

Select Changed Data

There are variants that cater for the differing needs of different use cases. Choose the variant that fits your use case with the minimal possible amount of data because this relieves both you in term of processing resources as the server in terms of retrieval effort.

Patch Data

There is a prefix [diff:...] that fulfills the promise that Overpass API can compare literally the states at two different time stamps.

Please note that Overpass Turbo does not show the results. A user interface to show minor differences in geometry is a difficult thing to do, and Overpass Turbo does not have such a user interface. One of the difficulties to solve is that one usually wants to see the old geometry, the new geometry, and that in many cases these two differ only in a very minuscule way.

After that warning, you are invited to have a look at an instructive example:

[diff:"2018-04-01T00:00:00Z","2018-07-01T00:00:00Z"];
(
  way[highway]({{bbox}});
  way[building]({{bbox}});
);
out geom meta;

This is a three-month-period during which enough changes happened to have all phenomena in one example. Go to the data tab, and copy-paste the full text to elsewhere to enable full-text search. There are XML nodes action with one of the following three types:

The type modify announces updates to elements. For ways this includes the case where the underlying nodes have moved but no new version created. For nodes and relations a change is detected only in case of moved nodes.
The type create announces a newly created or relevant element. Note that the element may have existed before elsewhere but has not met the selection criteria so far.
The type delete announces that an existing element ceased to exist or ceased to meet the selection criteria.

The degree of verbosity is exactly the one you have chosen in the output statement. This means that the result may contain seemingly unchanged elements unless you use out geom meta as the degree of verbosity.

If you want to patch a downstream database of filtered elements then this should be exactly what you need to update that database. However, there is no information about the whereabouts of deleted elements. For many applications this is a shortcoming, for example if you search for the changeset of an element that caused the apparent or real deletion.

Whereabouts

The adiff variant is shorthand for Augmented Diff. It returns everything the diff setting returns, and in addition it shows for any Delete action the whereabouts of the element in a distinct new subsection of the action.

Scroll through the raw data in the data tab for the well known example:

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
(
  way[highway](51.51,-0.27,51.54,-0.23);
  way[building](51.51,-0.27,51.54,-0.23);
);
out geom meta;

You will find that not only for every deleted way there is a follow-up entry for that way that gives the meta data but also has a flag visible="false" or visible="true".

An example for visible="false":

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way(229832127);
out geom meta;

This way has truly been deleted in version 10, hence has the flag visible="false".

An example for visible="true":

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way(8206624)(51.51,-0.27,51.54,-0.23);
out geom meta;

This way has been cut back radically, see old ...

[date:"2018-01-01T00:00:00Z"];
way(8206624);
out geom meta;

... and new variant:

[date:"2019-01-01T00:00:00Z"];
way(8206624);
out geom meta;

Thus it got literally out of scope, i.e. the requested bounding box. But still exists.

And an example for a way that exhibits the same data and version,

[adiff:"2018-04-01T00:00:00Z","2018-07-01T00:00:00Z"];
way(3687939)(51.457,-0.235,51.459,-0.231);
out geom meta;

but got out of scope because the underlying nodes have moved from

[date:"2018-04-01T00:00:00Z"];
way(3687939);
out geom meta;

to the geometry

[date:"2018-07-01T00:00:00Z"];
way(3687939);
out geom meta;

Depending on your use case for the data, you need to treat all of these cases differently.

If you want to patch the database, then already the diff from the previous section is enough, and you can treat all cases of visible the same way.

If you want to point people towards the changeset that has the context for the change then - the visible="false" is striahgtforward because it is guaranteed that the way has been deleted in that changeset. - the same way version case (and in principle also same relation version case) has for sure a different changeset that caused the change, and you have to ensure that you explicitly see and connect the nodes (members) to figure out the actually responsible change. - the different way version, but visible="true" case can be either of the two.

When you use more sophisticated queries than a bounding box then there are even more possibilities how changes to used but not printed elements indirectly govern elements to fall out of the result. Make sure you understand your problem domain, and that your end users do that, too.

Please note that the Augmented Diff mode only makes sense with output flag out meta (or out geom meta). Otherwise, the meta information is simply not contained in the result, and you get no difference to the simpler diff mode.

Specific Differences

If you intend to track changes to a certain tag or tagging then it will often happen that you get way too much by-catch of data that changed in a way irrelevant to you.

The statement compare reduces results to only those where a certain expression changed, like for example the value of the tag maxspeed:

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way[highway](51.51,-0.27,51.54,-0.23);
compare(delta:t["maxspeed"]);
out geom meta;

This result is more than ten times smaller than the original result.

Does it make sense to instead result the initial query to only maxspeed?

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way[highway][maxspeed](51.51,-0.27,51.54,-0.23);
compare(delta:t["maxspeed"]);
out geom meta;

This result is even slightly smaller. It does not contain the old or new elements without a maxspeed tag. As we use adiff, we get for elements with deleted maxspeed tag at least the visible flag and meta data. Whether this is an improvement or not again depends on your use case: in many cases, knowing that an element has been a highway in the bounding box already gives insight.

It is in principle possible to ask for multiple tags in parallel. You can just use an arbitrary expression after delta, for example two tags:

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way[highway](51.51,-0.27,51.54,-0.23);
compare(delta:t["maxspeed"] + t["sidewalk"]);
out geom meta;

Now the result contains all elements that have a change in the maxspeed tag or the sidewalk tag.

In principle, it could happen that a value of the tag maxspeed is conflated with a value of the tag sidewalk. This would happen for an element to changes from maxspeed=X, no sidewalk to sidewalk=X, no "maxspeed` or back. While this is ruled out for maxspeed and sidewalk because the values have no overlap, a good workaround attempt is to place a delimiter between the two expressions:

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way[highway](51.51,-0.27,51.54,-0.23);
compare(delta:t["maxspeed"] + ";" + t["sidewalk"]);
out geom meta;

Unfortunately this has the side effect that elements which have been deleted or created are now always contained in the result, although they have not changed the respective tag.

Why?

The expression t["maxspeed"] + ";" + t["sidewalk"] is for an existing element without maxspeed and sidewalk tag evaluated to ;. On the non-existing counterpart the expression is evaluated to the empty string, which is different from ;.

Hence a better workaround is to make the semicolon depending on the sidewalk key:

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way[highway](51.51,-0.27,51.54,-0.23);
compare(delta:t["maxspeed"]
    + (is_tag("sidewalk") ? ";" : "") + t["sidewalk"]);
out geom meta;

You can also directly aim to get only the created and deleted objects:

[adiff:"2018-01-01T00:00:00Z","2019-01-01T00:00:00Z"];
way[highway](51.51,-0.27,51.54,-0.23);
compare(delta:1);
out geom meta;

Any non-empty constant expression will work here.

Plot an Element over Time

When tracing a strange state of the data back to the origin then it is quite painful to try-and-error through many dates until one has identified the relevant change.

For this reason the Overpass API also offers a pair of the statement timeline and the block statement retro to iterate through salient points in time, most likely revealing the point in time where the relevant change happend a lot quicker.

The Timeline Statement

Museum requests always work on timestamps for a reason. But sometimes one has rather an element and maybe a version number. It is then possible to get a specific version or the entire element history to programtically parse the timestamp from this. However, this means a back and forth between the main API and the Overpass API instead of a single request answering all questions.

As the purpose of the Overpass API is to lower the barrier, it has its own function to get a timestamp directly for a given version (in this example of node 1 the version 11) ...

timeline(node,1,11);
out;

... or for the entire history of an element (in this example node 1):

timeline(node,1);
out;

You must look at the data tab to see the result. The output are derived elements. They can be used to tabulate some interesting info.

[out:csv(refversion,created)];
timeline(node,1);
out;

The Timeline-Retro Loop

But it is far more powerful in the idiom combining timeline, foreach, and retro. The loop foreach executes its body once for every element in its input set, and retro sets the timestamp to the date from the given expression.

This way we can for example plot the value of a certain tag over the course of the element's versions:

[out:csv(::timestamp,::version,::changeset,::user,name)];
timeline(node,305640277);
foreach
{
  retro (u(t["created"]))
  {
    node(305640277);
    out meta;
  };
};

Line 1 sets the output to CSV to have a more well-arranged result. The id of the node is both in line 2 and line 7: in line 2 it is part of the timeline syntax while in line 7 it is the start of a standard simple request.

The block statement foreach executes the block in lines 5 to 9 once for each element in the result of the timestamp statement. The retro statement now moves the timestamp under which the core request in lines 7 and 8 operates to the value of the respective created tag of each executed element. To do so, the expression t["created"] must point to tag value of the single element in the default set, and this is accomplished by u(...).

The construct is based on repeating a request for a bunch of in the context of the given element interesting timestamps. So you can in principle combine a timeline of element A with a request body starting of element B, but it usually does not make sense.

An example that does make sense: find ways that might have been split off a given way:

[out:csv(date,version,user,length,spinoffs)];
timeline(way,241135851);
foreach
{
  retro(u(t["created"]))
  {
    way(241135851)->.main;
    node(w.main)->.n;
    way(bn.n)(if:version()==1)->.spinoffs;
    make stat
        date=main.u(timestamp()),
        version=main.u(version()),
        user=main.u(user()),
        length=main.u(length()),
        spinoffs=spinoffs.set(id()+ " " + timestamp());
    out;
  };
};

In line 1, we define the columns to use. We want to output only derived elements, so this simply matches the columns defined in lines 11 to 15. Lines 2 to 7 together with lines 17 and 18 are the same timeline-retro idiom as seen before. The only change is that we redirect the way of interest into the set named main. As we have to work with multiple sets in parallel, it makes the request more comprehensible to have everything in a named set.

In lines 8 and 9 we take advantage of that we can use the full syntax within the retro block: we navigate from the nodes of the way to all ways using one or more of those nodes, thus select all ways topologically selected. This is restricted in line 9 by (if:version()==1) to only version 1 to expose the most likely candidates for split off ways.

Lines 11 to 14 use properties of the main way stored in main. Thus we need to refer to the way through its variable name and use u(...) to address properties of the single element instead of aggregated properties of the set as a whole. By contrast, set(...) in line 15 produces the usual semicolon separated list of the expression thereafter. This makes sense as main is known to have exactly one element, but spinoffs could have any number of members from zero to many.

Now we can spot in the result that version 18 is significantly shorter than version 17, and way 1160159397 with a timestamp coincident to this version's timestamp is a very likely candidate for a split-off way. Likewise, the way got even shorter in version 20, and way 1257085209 is by timestamp coincidence also a candidate for a split-off way.

Unearthing with JOSM

Before you handcraft too much, a short reminder: if your want to revert a complete changeset then the reverter plugin does this more reliable than manual work.

It is possible to edit removed data back into OpenStreetMap, although some caveats apply. The basic process is to load the former data into JOSM and then copy over the elements one wants back to life. Unless manual tweaking this applies new ids to the copied objects.

Make sure that you have turned on the Expert Mode in the JOSM settings.

Figure out the exact extent of the region (bounding box) and date that you want to reactivate data from. Change the tab in the download dialogue to Download from Overpass API. You can then download your data of interest with a request like

[date:"$TIMESTAMP"];
(
  nwr({{bbox}});
  node(w);
);
out meta;

This is the old data you can copy from.

Now go once again to the download dialogue and erase the request from the textfield. This way you get the current data in the same bounding box.

You must download via Download as a new layer. Only that way you have the current data in a proper data layer.

Now you can switch between the two layers by using the layers pane (Alt-Shift+L). Turn the old state layer active by putting the check mark there and click if necessary the eye to get unobstructed view. Copy from there what you want to reactivate. To save topology, do so in one step. Turn the current state layer active by putting the check mark there and paste into that layer.

Now you have the reactivated objects as new objects there. If they need to be welded into existing objects when do so now. Finally, you can upload or complete the editing session with related edits. As changesets should make clear intent, it is usually unhelpful to mix a recontruction operation with genuinely new editing.

next: More information