Most of us have an intuitive idea of what it means for data to be "FAIR." Still, there are still no universally recognized strategies for ensuring that research data abide by the FAIR Guiding Principles in a manner that makes the data truly findable, accessible, interoperable, and reusable. Our approach, described in this recently published paper, is motivated by an observation that many of us in the data-science community find a bit depressing: Investigators profess to want to make their data FAIR (often because their funders mandate it), but they feel overwhelmed by the FAIR Guiding Principles as articulated in the now-famous 2016 publication. There are simply too many principles to consider, and the principles all deal with things that most scientists never think about. Between this rock and hard place, investigators often feel they have no choice but to assert that their data are “FAIR”, although their assertion comes more from desperation than knowledge and skill.
But there are two “meta principles” that can help to demystify the confusion about how to make data FAIR.
First, half of the FAIR Guiding Principles define desirable properties of the repository where the data are placed: They are therefore not anything for scientists to worry about. Although the characteristics of the repository are important, they are almost never anything over which an investigator has direct control. Thus, for all practical purposes, scientists can ignore these archive-dependent determinants of FAIRness and leave them to the people who manage data repositories.
Second, the one thing over which investigators do have control is their metadata. The Guiding Principles make clear that datasets become FAIR when scientists annotate their data with “rich” metadata that adhere to the correct community standards. This matter naturally begs the question of how do investigators know if there even is a community standard to which their metadata should adhere, let alone how can they make sure that they are using such standards in the correct way?
Investigators can author metadata in a standards-adherent way by filling in a template that the community itself has created to capture the necessary standards—enumerating the things that scientists need to say about their data so that a third party can make sense of (1) the experiment that was conducted and (2) what the data represent. Because the relevant community standards are embedded in the metadata template, filling in the template ensures that the standards-dependent FAIR Guiding Principles are met.
Our work on metadata templates is taking place at a time when many groups of scientists—as well as funders and publishers—are searching for technology that they can simply point at an online repository to indicate whether the datasets are FAIR. There are now dozens of tools that purport to measure data FAIRness, and many more on the way. Unfortunately, inter-rater reliability among these tools is poor, and none of the tools has developed much of a following. The problem is often compounded by software engineers who believe that throwing even more technology at the automation of FAIR evaluation will correct all the limitations of previous approaches. Unfortunately, the whole enterprise of trying to programmatically evaluate datasets for FAIRness is doomed because no computer on its own can ever know (1) what community standards might be appropriate, (2) whether metadata are sufficiently “rich” to enable other investigators to find relevant datasets, (3) whether the data really are interoperable with other datasets, and (4) whether the experimental context is described sufficiently to re-use the data in an appropriate manner. The answers to all these questions are not a matter of computation, but a matter of how a scientific community advocates description of the standard types of experiments that it performs.
Nevertheless, still hoping to obtain a computational solution to the problem, the Research on Research Institute (RoRI) issued a request for proposals in 2020 hoping that someone could come up with a better tool. The RoRI RFP was unusual, however, because it encouraged proposers to suggest alternative approaches should they have what they think is a better idea. When my team applied for funding for what is now the FAIRware Workbench (described in our paper), we asserted that the world does not need yet another FAIR evaluator that attempts to address the problem purely through brute-force computation. Obviously, the reviewers agreed with us. (The FAIRware Workbench continues to be developed with support from the U.S. National Library of Medicine and other sources.)
The ultimate solution to the challenge of ensuring that data are FAIR rests with individual scientific communities, who need to take responsibility for the necessary metadata standards. The biologists and the earth scientists have made great strides in this area, but other groups of scientists need to follow their lead. More important, funders need to provide support for the development, promotion, and adoption of these kinds of standards, and for the infrastructure required to make it easy to put these standards to use in the course of routine data stewardship.
In the end, standards development is inherently time-consuming and tedious—and thus expensive—even in the setting of the Metadata for Machines workshops initiated by the GO FAIR Foundation to streamline this process. If the scientific community is serious about the need for FAIR data, then it has to expand its commitment to the work required to standardize the reporting guidelines and ontologies that allow data to comport with the FAIR Guiding Principles in the first place. More important, funders cannot simply demand that data be FAIR; they need to underwrite the requisite standards-development work, and to provide the tools that will make the use of community metadata standards a natural and inevitable component of everyday data management. It will take more than just funding to effect this change. It will take a major shift in the culture of science so that investigators can come to recognize that FAIR datasets are just as important an endpoint of research as are high-impact journal publications—and that putting a dataset online with inadequate metadata is worse than submitting a journal article with typos and missing references.