How to fix the problems with open data

Oriana Genolet explores the issues facing open data at Better Science Through Better Data 2019 and how to fix them
How to fix the problems with open data

Training for early career researchers on best open data practices was just one of the topics discussed at this year’s Better Science through Better data conference, which was full of recommendations from speakers on how to better facilitate the sharing of scientific data

Five months before finishing my PhD at the Max Planck Institute of Molecular Genetics, I came across the Better Science through Better Data writing competition, which I figured would be a great opportunity to research and write about a subject little known to me so far.

Without losing time I started to gather information on how researchers should be rewarded for data sharing and reproducible research, submitted my work, and two months later I was at the Better Science through Better Research data conference in London after being selected as one of four winning entries.

I started the day hoping to learn a bit more about open and reproducible data and meet interesting people involved in the subject, but was soon astonished by a parallel world that had escaped my notice: I was suddenly introduced to a plethora of tools to make research data openly available, high-quality and accessible for researchers. It had only taken me five long years of research, plus one day of a conference, to find them. ORCID, the FREYA project, DataCite, Jisc, Data Stewardess, Data Champions, The Carpentries, Research data training workshops, and so on were all on my horizon.

I believe that a vast number of early career scientists like me are unaware of the steps that need to be taken to document and share high-quality data throughout their research years. Most are confronted with that problem at the time of publishing, and in some cases, a significant amount of time and effort needs to be invested to meet data sharing requirements at that point.

As I was listening to the talks at Scidata19, I heard many speakers touch on ways to motivate and educate early career scientists about open data and research reproducibility practices. It seemed to me that several scientists together with funding and research institutions have started to find possible solutions which could be taken on by others in order to bridge the existing knowledge gap.

Paola Quattroni, the Research Funding Data Manager at Cancer Research UK, tells us that the funding agency is partnering with Springer Nature to offer Research Data training workshops for early career scientists. These workshops aim to clarify about best data sharing practices covering topics such as FAIR data principles, repositories and metadata.

The implementation of research data training workshops is an initiative that could be reproduced by other research institutions, allowing early career researchers an easy access to data management. Some research labs, on the other hand, are taking matters into their own hands.

Tomas Knapen, assistant professor at Vrije Universiteit Amsterdam, told us how it’s not only important to make research data open and accessible, but also to implement open and thoroughly documented methods. He encourages his students to keep a clear track of any changes carried out in their analyses by uploading code and data to Git and Github repositories, which results in a complete history of the project. For him, it is clear that this is a key step for open and reproducible research. He also says that new members joining his lab must take statistics courses, join data Carpentry workshops and learn open science skills.

Knapen sees himself as part of a new generation of scientists that views open science and methods as a door to new collaborations and a knowledge source for new ways of data analysis, and not as an extra workload. In reality, the fear of data misuse and the pressure to publish fast and novel research are factors that still weigh heavily on the minds of a vast majority of researchers.  According to the State of Open Data Report 2019, the number one cause for lack of data sharing among researchers is concerns about data misuse, affecting more than 35% of scientists.

Still, as scientists, maintaining the reproducibility of our research is as important as generating new knowledge. Up to 85% of researchers have failed to reproduce someone else’s results according to a recent Nature survey. 65% of survey respondents even failed to reproduce their own results. Unavailability of methods and code together with lack of raw data are cited as the major factors contributing to this reproducibility crisis.

Yasemin Turkyilmaz van der Velden is a data steward at TU Delft, someone that provides support on all issues concerning data management and a role I was not aware of before the conference. For an early career scientist who hopes to share their data but perhaps lacks the training to do so, a data steward is the right person to approach regarding questions on data storage, security, sharing and publication requirements. Every faculty at the TU Delft has its own data steward, which is part of a project funded until 2020 from the University’s Executive Board and an initiative that other research institutions could consider taking on.

Turkyilmaz van der Velden focused her talk on the importance of thoroughly storing, documenting and openly sharing data. According to recent studies, the probability of receiving data from groups that failed to upload datasets on open repositories decreases on 17% per year.  She also spoke about the importance of documenting protocols with tools such as and code by using notebooks like RStudio or Jupyter in order to increase research reproducibility.

The need for young scientists to know and make use of these tools at the earliest opportunity is self-evident. Nevertheless, Rebecca Grant, Research Data Manager at Springer Nature, told me during one of the coffee breaks that many training providers find it challenging to incentivize researchers to attend data-focused training. It seems to me that more pressure should come from funding agencies, research institutions and principal investigators to make the attendance of students to statistical courses, data management courses and software carpentries highly encouraged or even obligatory.

There are still numerous gaps to fill regarding data management knowledge and the fomenting of open and FAIR data practices. Still, with the help of institutions and scientist actively working on the issue, there is a hope that in a few years we might build a network of researchers working together in an open and collaborative environment instead of against each other, expanding knowledge with concise and reproducible results. 

Oriana Genolet is a fifth year PhD Student at the Max Planck Institute of Molecular Genetics in Berlin, where she tries to discover the genes leading to a developmental delay in female embryos by using mouse embryonic stem cells as in vitro models. She is interested in open data practices as a mean to achieve high-quality and reproducible results in science.

Oriana is a winner of the Better Science Through Better Data writing competition. Read her winning entry here

Please sign in or register for FREE

If you are a registered user on Research Data at Springer Nature, please sign in