2. What we’ll cover
● OpenStreetMap (OSM) and it’s data model
● A Missing Maps use case that needed big data tooling to
process OSM History
● OSMesa, what it is, and what it can do
● The future of distributed OSM processing, and what it will
enable
5. OSM Data Model
The OSM data model consists mainly of 3 elements:
● Nodes - Points
● Ways - LineStrings, Polygons
● Relations - GeometryCollections, Polygon with holes,
MultiPolygons
As well as the tag-based metadata that applies to each
elements, and changesets grouping edits
7. OSM Data Model: Changesets
● Edits are grouped into changesets, which have their own
metadata such as use comments (for developers, think
commit messages)
● Adding hashtags to user comments allows downstream
processing to group changes - for example, #HOTLunch
8.
9.
10.
11.
12. Backfilling missing maps
● Missing maps leaderboard processes OSM change files to
increment user and campaign statistics
● The statistics were correct for when the streaming
calculation started, but there was the problem of accounting
for edits previous to that streaming calculation not counting
towards user’s totals.
● So, there was a need to “backfill” the statistics based on
OSM history.
13. ● Through the Red Cross and a grant
from Microsoft Philanthropies, Seth
Fitzsimmons of Pacific Atlas was
hired to solve the backfilling problem.
● Seth was previously involved with
releasing OSM data as a public
dataset on AWS and early work on
distributed processing of OSM data
18. Backfill: Athena approach
● Seth first tried to use Athena to calculate the backfill
statistics. This approach didn’t work
● The complexity of the queries made the jobs blow up or
never finish
● Also, Athena's geospatial support hadn't been announced
yet, and once it was, it still didn’t work with the complicated
set of queries
19. ● Seth started showing interest in a set of tools that Azavea
was building at the time that used Apache Spark and
GeoTrellis for calculations calculating similar statistics
● He ported his complicated SQL queries for Athena to
SparkSQL and started contributing to that effort
Backfill: New approach
21. What is OSMesa?
● It's a loose term for a workflow for OSM data processing
● Still being defined - useful, but amorphous
● More a group of tools and techniques then, say, a library
● Uses Spark, GeoTrellis and AWS to process OSM data into
geometries, vector tiles, and statistics
22. ● a distributed computation engine.
● An API that lets you work with distributed data as a
collection, including a DataFrames API
● Written in Scala, with language bindings for use with Java,
Python, and R.
23. ● Spark DataFrames provide an API that is similar to R or
Pandas DataFrames; allows working with data in a SQL-like
manner
● Very powerful, and can express complicated queries
● (partially) Abstracts away the complexities of distributed
computing
24. ● Core geospatial library in Scala
● Enables Spark with geospatial types and operations
● Generally focused on Raster data, wraps JTS for vector
support
● Vector Tile module for reading and writing vector tiles
26. ● With OSMesa, we can create full historical geometries.
● To do this, we need needed to create a concept of “minor
versions” of geometries
Creating features from History
28. way v1
highway=unclassified
node v1
node v1
node v1
node v1
way v1.1
highway=unclassified
node v1
node v1
node v2
node v2
way v2
highway=primary
node v1
node v1
node v2
node v2
minor
version
change
29. ● With minor versions, we can bake new ORC files that
contain geometries of every element in OSM history, with
ways/relations representing every edit to the element as well
as elements that they contain
● Then, we compute statistics per changeset based on
geometries, and roll up the statistics per user and hashtag
Full historical geometries
30. ● Processing of full history into features in under 40 minutes
(cluster of 255 m3.2xlarge nodes)
● This is not a small cluster ( ≈$65/hour). YMMV with smaller
clusters.
● We are building update mechanisms to avoid refreshing the
entire dataset
Processing OSM data at scale
34. Global friction surface for cost distance calculations using elevation (SRTM) and OSM roads + water bodies
35. ● Building matching between OSM and other vector datasets
● Generating vector tiles for URCHN containing a subset of
historical data to front-end analytics
OSMesa: Other current uses
37. The Future: Validation workflows, Reputation
scores
● Better validation workflows is a big question in the OSM
community right now (according to SOTM US 2017)
● HOT Tasking manager does some; we can do better
● One way to improve validation workflows is to suggest
validation be done by veteran mappers, validation be
suggested for more junior mappers (“reputations core”)
● Development Seed, who contribute & uses OSMesa work,
have great ideas in this space.
38. The Future: Data Science notebooks,
production workflows
● We are aiming to create a Python notebook environment for
doing data science on OSM, in combination with raster data
● By building on Spark and projects like GeoMesa’s
“JTSFrames”, RasterFrames, and GeoTrellis, we’re creating
a platform that works both for data scientist poking around
in a Jupyter notebook and production systems.
39. The Future: Machine Learning pre- and post-
processing
● Pre-processing geospatial imagery and OSM into training
chips - a distributed label-maker
● Managing data into and out of Raster Vision
● Post-processing by cleaning the model output, matching to
OSM or other vector data to remove duplicates, conflation
workflows
● Matching OSM to imagery dates: e.g. pre- and post-
disaster.
40. Join in the fun
● There is a lot of interesting development challenges that
need to be met in the OSM world
● OSM has many different voices in the room, but they all
have one goal: building a better map
● Join the effort to build a better map
41. If you could ask the OpenStreetMap any
question, at any scale, what would you ask it?
43. OSM Data Model: Nodes
● Single location; only OSM element with geospatial data
● Can represent points of interest, or be solely for inclusion in
ways
● Represents a Point geometry
44. OSM Data Model: Ways
● References a sequence of ordered nodes
● Represents a LineString geometry
● Closed ways can represent Polygon geometries
45. OSM Data Model: Relations
● Group of nodes, ways, and other relations
● Used for representing a Polygon with holes,
MultiPolygons, and more generally GeometryCollections
46. OSM Data Model: Tags
● Each Node, Way and Relation can have a sequence of
tags, which are string-based keys and values. This
describes the role of each element on the map, e.g.
○ highway=residential
○ landuse=grass
○ amenity=fast_food
49. Ways to work with OSM snapshots
● Import OSM data into PostGIS
○ osm2pgsql
○ imposm3
● Render into raster tiles or vector tiles
○ Mapnik
○ Tegola
● Utilize for routing software
○ pgRouting
50. Ways to work with OSM history
● Clip it using osmium, and import a subset into PostGIS
● After that … not a lot of mature tooling available
51. Why is OSM history useful
● Calculating user history statistics
● Calculating campaign history statistics
● Calculating complete answers to the question, “what has
changed?”
● Taking a snapshot of OSM at any point in history
● Analytics for research
52. Why ORC?
● On-demand querying + predicate push-down is possible if
OSM data is in a format that was well-understood by the
Hadoop ecosystem
● bespoke formats have their place, especially when size or
other considerations are all-consuming, but it's really
frustrating to see people continually implementing OSM PBF
parsers to be slightly faster when those parsers are typically
single-use (for a specific application). i wanted to sidestep
the whole process and use a well-known, well-supported
53. The Approach: Features from OSM data
● Join element data to the other elements that contain them;
for example, join each node to the way(s) it belongs to.
● Assign a minor version to ways and relations modified
because the underlying elements change; e.g. a minor
version increments for a way if someone moves the nodes
belonging to it.
● Create Points, Line, Polygons, and Multipolygons for each
major and minor version of the element.
ProcessOSM.scala on GitHub
55. Analytic Vector Tiles
● The name we’ve been using for Vector Tiles that contain
information for analysis not (necessarily) for display
● OSMesa/VectorPipe can create sets of Analytic Vector Tiles
from arbitrary subsets of OSM History and publish them to
AWS S3
● Think custom Mapbox QA Tiles, containing relations and
historical elements
● We are creating streaming update workflows to keep
Analytic Vector Tile sets up-to-the-minute (almost).
56. Other work in this space
● Mapbox’s Jennings Anderson gave a talk at SOTM and
wrote a blog post around quarterly QA tiles
● Uses a work-in-progress project called osm-wayback to
create the historical QA tiles. Goal of project is “...to create
historic geometries for each intermediate version of an OSM
feature.”
● RocksDB on the backend, which creates a ≈ 600GB index
● We have collaborating and looking to further collaborate,
the work is awesome