The Better Science Through Better Data conference 2019, held in London on 6 November, attracted a diverse audience. From clinicians to librarians, it was clear that the open data movement is not just for scientists or academics. It was pleasing to see many early career researchers enthusiastically getting involved.
I attended the conference as a winner of the writing competition where I explored the benefits and pitfalls of unrestricted data use. The presenters spoke extensively on what still stands between us and an open data world.
Why share? Data reuse unlocks new insights.
There are many brilliant databases and well-documented research data already shared online. Exploratory analysis aided by machine learning pattern recognition algorithms can yield interesting correlations that inspire more in-depth studies. Without the need to invest in brand new data collection, data reuse studies are often faster and cheaper. In fact, many participants of data reuse challenges produce great insights while just equipped simply with a laptop. To see the benefits, we need look no further than the 2019 Wellcome Data Re-use Prizes. The winning entry for the malaria category developed a workflow that identified previously overlooked variables, such as family education or recent health factors, that contribute to malaria prevalence. These insights can be very helpful in developing malaria prevention programmes. Projects like this show that improving data to increase data reuse is a worthy cause.
There are, however, barriers to more data re-use.
Barrier 1: Non-standard practices
Most labs have developed their own individual, non-standardised data collection and storage practices. This culminates in poorly-named files (saved in a hurry on shared drives) or a whirlwind of different file types. In his keynote on how to prepare ourselves for open science, Tomas Knapen quipped “Have mercy on your future self!”. Disorganised data makes sharing a pain in the neck. However, this can be easily avoided by version control, standardising data storage and recording practices across the field.
There is precedent for bringing previously chaotic systems together into a standardised format in science. In 2019, the International Union of Pure and Applied Chemistry (IUPAC) is celebrating its 100th birthday with the theme “A Common Language for Chemistry”. The standardisation of the IUPAC nomenclature has revolutionised communication in chemistry research. It is not difficult to imagine that similar field-wide standardisations of data practices would bring about a paradigm shift as well.
Barrier 2: Incomplete metadata
Metadata – data about data such as how it was collected and with what equipment – is often as important as the data itself – it. Without clear information on experimental conditions, data collection methods and so on, reproducibility is reduced. Sabina Leonelli raised the point at the SciData conference that even though data is shared, a lack of accompanying metadata deters reuse. Scientists cannot reuse such data confidently and wrong conclusions may be drawn. Collecting, curating and cleaning up metadata can be inconvenient and time-consuming, especially when data outputs are collected from a variety of laboratory equipment in a myriad of file types.
Some of the solution might be technical: perhaps biotechnology manufacturers can play a part in ensuring the ease of FAIR data collection and integrate data management into the software that comes with hardware.
Leonelli also spoke about how metabolomics is a field known for being data-intensive and where results are extremely sensitive to even the smallest change in experimental parameters. The Metabolomics Standards Initiative provides a list of metadata to be reported alongside metabolomics data. This is a good step towards reaching a field-wide consensus on data reporting that would go a long way in increasing interoperability and reusability of data.
Barrier 3: Data security
Some researchers, myself included, shy away from sharing data out of fear that it is not secure. Sensitive data may concern the privacy of individuals (with clinical trial and medical record data for example) or profitable intellectual property (like pharmaceutical research in the process of being patented).
At the conference, Yves-Alexandre de Montjoye explained that anonymization practices are falling behind deanonymisation capabilities. He found that with 15 demographic attributes, a person can be reidentified in any anonymised dataset accurately 99.98% of the time. It is not hard to imagine that highly sensitive anonymised medical data can be matched to patients relatively easy.
I spoke to several researchers who have worked with clinical trials, who all told me that data is almost never shared due to these privacy concerns. In the race against time to develop therapies for debilitating diseases such as cancer, the biomedical community would save vast amounts of time and money if clinical data can be shared.
de Montjoye proposed a switch to query-based systems where on-the-fly anonymisation protects data security. Users can query databases for what they require and receive only the relevant information. In this manner, de-anonymisation is significantly more challenging due to incomplete information. Technologies such as blockchain allows secure decentralised data storage and accurate sharing of data. With further development, strong data security can enable researchers to confidently share even sensitive medical data.
Barrier 4: The methodology blackbox Knapen identified a gap in the open science pipeline - the methodology of data analysis or experimentation are not well documented. Data steward Yasemin Turkyilmaz-van der Velden later suggested protocols.io, an open access platform for sharing of protocols and workflows, as a solution to that.
A discussion ensued on Twitter (check out #SciData19 for more!) regarding the reliability of protocols.io. The reliability and legitimacy of the repository relies on the integrity of its users. Ultimately, we need a cultural shift – scientists that are data-literate and share the same passion towards open science.
Researchers need more training for open science, regardless of career stage
The barriers to an open data world are slowly being chipped away. Greater awareness and focus on data sharing has led to improved infrastructure being developed. Still, work needs to be done in equipping researchers with those tools. The research ecosystem needs to see a cultural change towards open science before the vision of an open data world can become a reality.
Education and training was a recurring theme throughout the day. There are copious amounts of resources on open science, from articles to even courses to attend. FOSTER Plus is an EU-funded project that provides readings on open science, data management, data science skills and even moderated online courses for many topics. These resources are aimed at young scientists, academic staff and policymakers – all valuable stakeholders in the scientific research ecosystem. Knapen also explained how he used The Carpentries to train new lab members in open science.
The TU Delft Data Stewardship Project has taken open science training one step further. Every TU Delft faculty has a dedicated Data Steward to promote good and open data practices. With an expert knowledge in their field as well as advanced skills in data management, these stewards are well-positioned to provide practical advice to their department.
Many speakers reiterated the call for early career researchers to take the lead to learn more and instil open science practices in their workplace. However, I believe we need to extend this call to all researchers regardless of career stage and also to policymakers, teachers and support staff. It is the principal investigators, rather than early career researchers, that shape the lab’s data management culture. Open science needs a community-level effort. Scientists need the support of non-research colleagues and their institution’s management to fully embrace open science.
The open data discussion has been heavily focused on the sciences. In his keynote address, Mikko Tolonen gave us a perspective of the state of open data in the humanities. Many researchers do not realise the computational aspect of the humanities or that their artefacts and records are considered data. The majority of data in the humanities, according to Tolonen, require digitising and clean-up. The open data movement extends across disciplines. We need more advocates for data – even in the humanities!
It is heartening to see how data sharing has progressed. We still have a long road ahead of us to achieve open science. Yet, the enthusiasm and conversation that took place at SciData19 fills me with hope that we will eventually get there. Let’s continue to work together to achieve Better Science through Better Data!