The Data Descriptor in Scientific Data is here: https://go.nature.com/2IGxstM
Three years ago we at the Complex Systems research group at the Computer Science department at Aalto University became interested in studying public transportation (PT) networks. As a part of our efforts within this field, we joined forces with transportation engineers from the Department of Built Environment, secured a grant from the Academy of Finland, and started working.
The first step in our efforts was to get our hands on PT network data. As a matter of fact, in our grant application we even had promised to build and launch a repository for research data on PT networks. Given that PT schedules and routes are commonly published in an established, well-structured data format (called GTFS), we thought that curating such data and processing them into city-sized extracts would be a fast thing to do. But no, it was not that easy.
In the end, we had to build an automated pipeline for sorting out many technical challenges related to curating and transforming the original data into easily accessible data formats. The many practical challenges we faced included merging data from different sources, filtering data spatially and temporally, computing walking distances between PT access points, as well as checking for clear errors in the data. To sort out these technical issues, our pipeline ended up being worth thousands of lines of code.
However, these technical challenges were not the end of the story! In fact, our greatest challenge ended up being the licensing issues related to the source data. This was not something we have had been expecting, partially due to being immersed in the open data culture in Finland. Although PT network data are often available over the Internet, the provided licensing terms of the data vary across PT agencies and operators making reuse of the data cumbersome. In particular, data licensing terms often did not allow publishing the data further in modified versions. This forced us to limit our original selection of over 100 cities to 25. In the future, we hope that many more PT agencies publish their timetables according to some standard permissive open data license, such as the Creative Commons licenses!
Now, more than 2 years later, we finally have created our data repository containing PT network data for 25 cities in multiple easy-to-access formats. Especially, we hope that researchers interested in working with PT networks can now get faster up to speed with working in the fascinating world of PT networks!
For more technical details of our data, see our recently published Data Descriptor, our software tools, as well as our own data repository enabling visual inspection of the PT network data!

[Cover image: Original PT network data courtesy of Helsinki Region Transport (CC BY 4.0), background map (c) OpenStreetMap and Carto.]
Please sign in or register for FREE
If you are a registered user on Research Data at Springer Nature, please sign in