Publishing data on public transport networks - how hard can it be?

A lot of data on public transportation schedules and routes are freely and publicly available. However, their usage in scientific contexts has remained limited. How come? Read our story to find out the challenges and problems we had to overcome for working with PT data for getting you fast up to speed in analyzing PT networks!

Go to the profile of Rainer Kujala
May 16, 2018
Upvote 0 Comment

The Data Descriptor in Scientific Data is here:

Three years ago we at the Complex Systems research group at the Computer Science department at Aalto University became interested in studying public transportation (PT) networks. As a part of our efforts within this field, we joined forces with transportation engineers from the Department of Built Environment, secured a grant from the Academy of Finland, and started working.

The first step in our efforts was to get our hands on PT network data. As a matter of fact, in our grant application we even had promised to build and launch a repository for research data on PT networks. Given that PT schedules and routes are commonly published in an established, well-structured data format (called GTFS), we thought that curating such data and processing them into city-sized extracts would be a fast thing to do. But no, it was not that easy.

In the end, we had to build an automated pipeline for sorting out many technical challenges related to curating and transforming the original data into easily accessible data formats. The many practical challenges we faced included merging data from different sources, filtering data spatially and temporally, computing walking distances between PT access points, as well as checking for clear errors in the data. To sort out these technical issues, our pipeline ended up being worth thousands of lines of code.

However, these technical challenges were not the end of the story! In fact, our greatest challenge ended up being the licensing issues related to the source data. This was not something we have had been expecting, partially due to being immersed in the open data culture in Finland. Although PT network data are often available over the Internet, the provided licensing terms of the data vary across PT agencies and operators making reuse of the data cumbersome. In particular, data licensing terms often did not allow publishing the data further in modified versions. This forced us to limit our original selection of over 100 cities to 25. In the future, we hope that many more PT agencies publish their timetables according to some standard permissive open data license, such as the Creative Commons licenses!

Now, more than 2 years later, we finally have created our data repository containing PT network data for 25 cities in multiple easy-to-access formats. Especially, we hope that researchers interested in working with PT networks can now get faster up to speed with working in the fascinating world of PT networks!

For more technical details of our data, see our recently published Data Descriptor, our software tools, as well as our own data repository enabling visual inspection of the PT network data!

PT networks included in our data collection.
Can you guess the cities included in our collection based on their "fingerprints"? (Licensing terms of the original data, see the published data descriptor, background maps (c) OpenStreetMap contributors and Carto.)

[Cover image: Original PT network data courtesy of Helsinki Region Transport (CC BY 4.0), background map (c) OpenStreetMap and Carto.]   

Go to the profile of Rainer Kujala

Rainer Kujala

Ph.D Student, Aalto University, Department of Computer Science

Ph.D. student on complex systems/networks and computational science. Broadly interested all fields related to data science. Lately focused on public transportation. Defending my thesis in fall 2018.

No comments yet.