Overpass API > Blog >

Compound Values

Published: 2018-06-30

Table of content

Promoting the Semicolon?
Capturing Levels
Units
Beyond the Geodata

In the blog post about numbers the semicolons and units have been an open problem. A new set of new features now solves this problem. These features are the subject of this blog post.

In particular, values with semicolons deserve a clear framework: They are a set of values and they are represented as a list. Thus, all the function relating to this start with lrs_ for list represented set. We have already used that framework to speed-up building a list of values.

Multiple levels in a metro station

Promoting the Semicolon?

Sometimes, there is a need to represent multiple values in OpenStreetMap for one tag. An example are levels in multi-story structures. While most features reside on a single level, there are structures like stairs and elevators; their very purpose is to belong to multiple levels. Hence, it makes sense that we record for them multiple values for the key level.

The de-facto standard for multiple values is concatenating the values with semicolons into the value field. This is a problem for multiple reasons:

On a superficious level, the problem is that the semicolon is a legit character within a value and by no means special. Thus, a genuinely single value might be treated as two by software that splits by semicolon. In addition, values in OpenStreetMap are limited to 255 characters. Thus, lists may break suddely if they become longer.

The other way round, it could be desireable to convert OSM into other formats where the semicolon has a meaning. Think for example of CSV, originally "comma separated values", but often "semicolon separated values". While there are escaping rules, the semicolon in a tag value without a reminder that this may happen can and will bring down simpler tools that evaluate CSV or confuse those that try to be smart. This is in particular true if one can have good faith testing data without a semicolon in a value and run into semicolons in production like meeting a black swan.

This brings us to the more general problem. There are a couple of questions of multi-values that should be answered. Every software has to answer this for itself. For a software like Overpass API where the intent is to keep semantics backwards-compatible over decades, this means to guess what will be the settlement in the community how to answer these design questions. But the settlement has not taken place yet or there never will be a settlement.

This is not about leadership. This is about the tedious work of setting up hypotheses and testing them against the now not-so-small body of all exsiting OSM data whether these hypotheses are the best or at least sensible choices.

The first question is whether the order of the values is important or not. If I ask for values -2;-1 and -1;-2 for the level tag then I imply the answer was no. But if you think of seamarks or waymark trails where a colouring like red;white;blue is different from blue;red;white then the preference tends to keep up order.

Please note that order-matters is a much more expensive feature than can-be-reordered once lists get longer. For a list of ten things, there are already more than 3.6 million possible orders. For a list of 15 things, this rises to 1.3 trillion possible orders. Or, in other terms: For a list of eighty things one needs 10 byte to store which values are present but up to 75 bytes to store the order. Thus, for longer lists storing the order is the expensive thing.

A couple of related questions exist. One is the already mentioned how to recognize genuine semicolons that are not intended as separators. Another is one of equality: are -1 and -1.0 equal? What about leading or trailing space, whitespace, two semicolons one after another?

I have decided to settle for the following policy that I consider the most application-neutral hence most future-safe: Values are kept verbatim and semicolons treated in no way special unless the user applies a semicolon-breaking function. These functions may drop leading and trailing space. If a complete list consists of terms that can all be read as numbers then they may all be treated as numbers for the sake of sorting and equality. Of course, the individual functions have a well-defined behaviour on whitespace and numbers. It is just that it may change in a future version if an urgent need for that comes up.

Capturing Levels

Let us work with levels in practice. We start with a simple example and want to get all elements that are on level 1. Being aware of the problem that an element can have multiple levels, we query for all elements that have 1 as a subset in its level value:

nwr({{bbox}})[level~"1"];
out geom;

Different than expected, this query also contains elements with level=10 and level=-1. A remedy to this is to restrict the regular expression to allow only 1:

nwr({{bbox}})[level~"^1$|^1;|;1$|;1;"];
out geom;

This does solve the task, but looks horribly awkward. To overcome this, there is now the evaluator lrs_in. It tests whether its second argument interpreted as semicolon separated list contains its first argument as element:

nwr({{bbox}})(if:lrs_in(1,t["level"]));
out geom;

We filter for all elements from the given bouding box for that lrs_in(1,t["level"]) returns true. The expression lrs_in(1,...) returns true if and only if its second argument is a list containing its first argument, 1. We apply this to t["level"], the respective value of the tag level.

A more basic question is what values for level exist in a place. Technically spoken, we want the union of all lists in the level values. The evaulator lrs_union does this in a fancy syntax:

nwr({{bbox}})[level];
make stat level=lrs_union(set(t["level"]), "");
out;

The first line collects all elements in the bounding box with a level tag. In the second line we build a list. The list must be stored somewhere; we use here a tag level of a newly derived element stat. The statement make stat level=... does that for the result of its argument. The primary use of lrs_union is to union its two arguments. But because each argument is sorted and reduced to unique values for that purpose, making the union with an empty argument turns the other argument into a sorted list of unique values. set(...) compiles this list, using semicolons like inside the individual list entries.

Now we can select all underground elements:

nwr({{bbox}})(if:lrs_isect("-4;-3;-2;-1",t["level"]));
out geom;

The evaluator lrs_isect returns all elements that are contained in both arguments interpreted as semicolon separated lists. We intersect t["level"] with the fixed list -4;-3;-2;-1.

Units

When we are recording legal or factual regulations in OpenStreetMap, we may find non-standard units. A typical example for this are speed limits in mph.

We want to have all values in the same units to do e.g. computations for routing. For this purpose we can rewrite the value to km/h on the fly, regardless whether we are in the UK or elsewhere:

[out:csv(::id,maxspeed,highway,name)];
way({{bbox}})[highway][highway!=footway];
convert row ::id=id(),highway=t["highway"],name=t["name"],
    maxspeed=(suffix(t["maxspeed"]) == "mph"
        ? number(t["maxspeed"]) * 1.609344
        : number(t["maxspeed"]));
out;

Line 1 is the declaration to get everything in CSV. This makes sense because we want to process the values and not the geometry of the ways. We choose as columns the id of the object and the generated maxspeed, highway, and name tag values. Line 2 is a standard query for all ways within the bounding box with a highway tag and where that highway tag is not footway because footways should not have speed limits.

In line 3 the interesting thing is within the convert statement: We use a ternary expression (Condition ? If-True : If-False ) to determine the value to set for the maxspeed tag. If and only if the suffix of the value of the maxspeed tag is mph then the expression suffix(t["maxspeed"]) == "mph" is true. In that case we multiply the value of maxspeed taken as number with the conversion constant. Otherwise we take the maxspeed as a plain number. This together ensures that we get a value in km/h regardless whether the actual value has been in mph or km/h.

The last step is to do proper error checking. This means that we funnel an error message to the CSV if we meet anything that looks like a unit but is neither the implicit km/h nor mph:

[out:csv(::id,maxspeed,highway,name)];
way({{bbox}})[maxspeed];
if (lrs_union(set(suffix(t["maxspeed"])),"mph")=="mph")
{
  convert row ::id=id(),highway=t["highway"],name=t["name"],
      maxspeed=(suffix(t["maxspeed"]) == "mph"
          ? number(t["maxspeed"]) * 1.609344
          : number(t["maxspeed"]));
}
else
{
  make row ::id=1,highway="bad_unit",
      name=set("{"+suffix(t["maxspeed"])+"}");
}
out;

The changed lines are line 3 and line 12. Line 3 is a test whether the units are as expected: The aggregator set(...) on suffix(t["maxspeed"]) delivers all suffixes in the found ways as semicolon-separated list. This set should be the empty set or a set containing only mph. The fastest way to check whether something is a subset is currently to lrs_union it with the superset and check whether it is equal to the superset afterwards. Line 12 produces a single result element that has in the fields used by this CSV formatting an error message and by set("{"+suffix(t["maxspeed"])+"}") the list of all values that appear as suffixes.

The example currently returns knots from a waterway in the marina in the bounding box.

Beyond the Geodata

I you want to check the credibility of the geodata then it can be helpful to look at the metadata. In particular, the age of the object and surrounding objects gives a hint whether the object still exists on the ground or whether simply no mapping activity may have taken place.

The other data in question is well-known from the meta output mode. In addition, this data can now be accessed via evaulators. For example, you could now find all data that has been last touched in a specific changeset:

nwr(if:changeset()==58017401)({{bbox}});
out geom;

Another example is to find objects with a suspiciously high version number. Relations always have very high version numbers. Thus we look at ways here:

[out:csv(ver,count)];
way({{bbox}});
for (version())
{
  make stat count=count(ways),ver=_.val;
  out;
}

Line 1 is the CSV declaration with columns matching the tags generated in line 5 and printed in line 6. Line 2 is a standard query for all objects in the bounding box. Line 3 contains the new evaulator, version(). It is used by the for loop to group output by version number.

Now we can investigate a reasonable number of objects with the highest version numbers:

way({{bbox}})(if:version()>50);
out geom;

There are further evaulators for timestamp(), uid(), and user()). These should be self-explanatory.