As a Better Science Through Better Data 2018 writing competition winner, I was invited to share my thoughts from #SciData18. Find out more about the competition here.
The day began with a celebration of technological advancements that have helped to shape modern research. Bioinformatician and data scientist at RTI International, Rebecca Boyles, reminded us of the quantum leaps humans have taken in the realm of e-science in the past 50 years (https://www.youtube.com/watch?v=ZNEXzsDB-gQ). Boyles compared the excitement of the Human Genome Project to that of the Apollo moon landing; both programmes foresaw the dawn of a new era and inspired the technological trajectory that has led to the emergence of today’s high-throughput technology. With methods like next-generation sequencing our capacity to generate volumes of computational information is increasing exponentially. As the digital universe grows and our scientific world explodes with big data, the challenge is no longer generating data but rather managing and sharing it for the greater good.
Big data, little time
Springer Nature’s recent survey of more than 7,700 respondents revealed that whilst researchers are motivated to share their data for altruistic reasons, such as enhancing research impact and public benefit, they encounter practical challenges. Editor-in-chief of Nature, Magdalena Skipper, reported the major challenges in data sharing as:
- Not knowing how to organise data in a presentable and useful way
- Not knowing which data repository to use
- The cost of sharing data
- The lack of time to deposit data
As storing, integrating, analysing, comparing and sharing data is no mean feat, how can we support researchers in their quest to mentor open science?
Mentors for mentors
Marta Teperek believes it’s time to invest in people who can help. Her Data Stewardship project at Delft University of Technology offers subject-specific advice and guidance on good data management practice, a prerequisite for open science. Data Stewards appointed in all faculties at TU Delft are the first point of contact for any data query. With comprehensive knowledge of their research field and a deep understanding of all phases of the research lifecycle, these Data Generalists possess the multidisciplinary skillset required to deliver the appropriate solutions to data problems with efficiency and empathy. In the words of Boyles, the Data Generalist is the ‘Swiss army knife’ of the research team; their domain speciality, statistics and computing expertise, communication prowess, and technical problem-solving ability means that they are the modern research toolkit must-have. Hiring a Data Generalist means team members can leverage the tools they need for the job without having to be fluent in computer programming language or machine learning algorithms.
A series of inspiring lightning talks gave striking examples of ways in which PIs, postdocs, PhD students, publishers and the public are harnessing the power of technology to govern open research.
Andrej-Nikolai Speiss, Head of the Molecular Andrology Group at the University Hospital Hamburg-Eppendorf, demonstrated how the online tool, WebPlotDigitizer, can be used to extract underlying numerical data from a variety of graphical displays. Since a vast amount of published data is only available in the form of figures with no supporting numerical data it is often difficult to determine whether the data holds what it promises. The software reads a plot, image or map and reverse-engineers it to report the underlying data. This data can be re-analysed to validate its statistical robustness and reproducibility in a way that was not previously possible.
When the concept of reproducibility comes in to orbit, being able to access the full experimental methodology is essential. Carsten Kettner introduced ‘STRENDA’, a web-based system supported by the Beilstein-Institut that provides researchers with ‘Standards for Reporting Enzymology Data’. These guidelines help authors to comprehensively report enzyme data by defining a list of assay parameters, such as molecule identity, temperature and pH, before publication. These are entered into the ‘STRENDA’ database and are automatically checked for compliance with the guidelines.
Cite it right
To make certain that data is Findable, Accessible, Interoperable and Reusable (FAIR), we must cite it. The motivation to cite data arises from the appreciation that datasets are just as valuable to the ongoing academic discourse as scholarly articles and deserve to be cited in the same way to democratise access for data re-use and verification. We heard from EMBL-EBI software engineer, Sarala M. Wimalaratne, about identifers.org, a resolving system that supports the formal citation of primary research data through the use of ‘persistent identifiers’, such as Digital Object Identifiers (DOIs). In simpler terms, ‘persistent identifiers’ are long-lasting, unique labels assigned to digital resources that are guaranteed to be managed and kept up-to-date over time. As hyperlinks to webpages often change due to technical updates, users cannot access these records due to dead or broken links. Identifiers.org aims to track these changes and provide the latest URLs to facilitate improved access to data.
To make data FAIR for all, DataCite’s Director of Communications, Helena Cousijn, informed of how data-level metrics can provide feedback on data usage, views and impact. Counting data re-use through ‘The Make Data Count’ project offers reward for researchers, thereby incentivising them to share their data.
PhD student, Claudia Wolff, explained how members of the public can help. Using crowd sourced photographs of coastal regions uploaded to Coastwards.org, Claudia classifies coastal morphology to assess the impacts of rising sea-level in the Mediterranean basin.
As Wolff’s experience shows, you don’t have to be a professional to lead the way to an open future. We can all play a part in mentoring open science.