I created a bus route finding tool using PostGIS and ArangoDB.
I wanted a different way to navigate from village to village using bus routes. Transport for London (TfL) provides spider maps (which is almost the perfect answer), but they don’t show long bus routes and are specific to London.
While online mapping services are the ‘de-facto’ approach for finding specific routes, I find that they don’t tend to resolve long routes well and sometimes don’t show all services available. Instead, I’d like a way to know which routes are scheduled to pass through which village and allow myself (or a tool) to resolve that.
To get an overview of possible routes at scale, I built a workflow that uses;
- OpenStreetMap data stored in PostgreSQL to resolve bus route connections
- ArangoDB to store resolved bus route connections and the villages that they connect to
- Python as an intermediary for bringing data from PostgreSQL to ArangoDB and presenting routes
Obtaining & Interpreting the OpenStreetMap Data
I imported the London dataset from Geofabrik into PostgreSQL using osm2pgsql. I have a database instance running locally on my MacBook Pro, but a cloud-managed instance (e.g using RDS) would work just as well.
osm2pgsql -d osm -U osmuser --create --slim --hstore -C 2000 --number-processes 4 greater-london-latest.osm.pbf
# -d [database name]
# -U [postgres username]
# -C 2000; for limiting the amount of RAM usage by the import tool
Visualizing the OpenStreetMap Data
QGIS is a great tool that allows me to explore the openstreetmap dataset directly from PostgreSQL (or file) in it’s intended form, a map. I’ll be using it to drive how my SQL queries are structured and to see what data points I can use to reach the end goal.
Adding a data source for viewing in QGIS can be done by right-clicking PostgreSQL and hitting “New Connection” inside the browser
Since I’m working on my local MacBook, simply providing the host and database is enough. I don’t need authentication in my ‘rapid-development’ scenario.
Once imported, I extracted planet_osm_line
for routes and planet_osm_points
for villages or ‘places of interest’. I do this by applying a filter of route = 'bus'
on the planet_osm_line
layer, and place = 'suburb'
on planet_osm_point
.
What am I trying to achieve?
The end-goal is to be able to resolve which planet_osm_line[]
can reach from planet_osm_point
A -> planet_osm_point
B, or in a simpler terms, “Which buses can I use to get from town A to town B”. Not specifically “How can I get from my house to the specific store on the other side of town”.
I was able to create a 1km boundary from planet_osm_points
directly in PostgreSQL by using the ST_Buffer
GIS function, and to conditionally return using the ST_Intersects
GIS function. Additionally, to filter out all the irrelevant data points, I return all points where place
has a value of suburb
, and all where all lines where the value of route
is bus
.
WITH suburb_buffer AS (
SELECT name, place, ST_Buffer(way, 1000) AS buffer
FROM planet_osm_point
WHERE place = 'suburb'
)
SELECT DISTINCT line.ref, sb.name
FROM planet_osm_line line
JOIN suburb_buffer sb ON ST_Intersects(line.way, sb.buffer)
WHERE line.route = 'bus';
line.ref | sb.name |
---|---|
SL8 | Hanwell |
SL8 | Hillingdon |
SL8 | Shepherd’s Bush |
With this data returned, I can clearly see that the path of the SL8 bus route passes through Hanwell, Hillingdon and Shepherd’s Bush.
I’m relying on the database to handle as much of the resolving logic as possible as opposed to any Python code, mainly for reducing time spent on maintaining Python code. In order, this query is doing the following;
suburb_buffer
represents a list of all suburbs according to OpenStreetMap.way
is a single point stored as GIS data.- I request a circle around
way
points that is 1km wide, and store it asbuffer
.
sb
is a single iteration ofsuburb_buffer
, which representsplanet_osm_points
.line
is a single iteration ofplanet_osm_line
, restricted byline.route = 'bus'
line
is also restricted by whetherST_Intersects()
returnsTrue
.ST_Intersects()
verifies whether the line in iteration matches the point in iteration.
I then moved this query to a Python script to translate this information to ArangoDB. I did this so that I can use the best tool for the job. As a graph database, I can quickly perform complex relational queries without much computational overhead.
Three collections exist in ArangoDB, where each returned value from PostgreSQL is converted into one document in each collection;
route
holds all bus routes, taken fromline.ref
(planet_osm_line
)location
holds all suburbs, taken fromsb.name
(planet_osm_point
)route_connections
is the confirmation returned fromST_Intersects()
(route
passes throughlocation
)_to
and_from
are required keys in an Edge document.- After creating the
route
andlocation
, I;- Take the
_id
of each - Set
_to
on theroute_connection
to the_id
oflocation
- Set
_from
on theroute_connection
to the_id
ofroute
- Take the
AQL (ArangoDB Query Language) is used to traverse the graph and resolve routes.
FOR vtx, edge, path IN 1..6 ANY 'location/{START}' route_connection
FILTER vtx.name == "{TARGET}"
RETURN path
Here, I traverse from location/{START}
to a location
where location.name
is {TARGET}
. This is checked against the route_connection
collection. For the result below, {START}
is “Yiewsley” and {TARGET}
is “Shepherd’s Bush”.
Finally, I built an “API Bridge”, returning a JSON representation of calculated routes for use by a possible Vue.JS interface or the likes.
"698 ➡️ Hillingdon ➡️ SL8, N207 ➡️ Shepherd's Bush ➡️"
{
"Hillingdon": {
"routes": ["698", "U1", "U3", "U5"],
"next": {
"Shepherd's Bush": {
"routes": ["SL8"]
}
}
}
}
In the future, I’d to render spider maps using this data or to be able to interpret this data on-the-go on my iPhone. But at-least now I can get all the possible routes (extreme, quick and sometimes not possible, since 698 is the local school bus) from one village to another without a specific location in mind.
I’d also like to explore optimizing the data structure within the graph database, perhaps moving the route
collection into route_connections
and eventually including stops and distance to provide more accurate resolves.