Community repositories: the best way to share the data underlying your research
Options for sharing the data that underlie your research article include: available upon request, personal website, generalist repository, institutional repository, community repository. Of these, community repositories are optimal. This post summarises why.
Sharing the data that underpin your research article benefits you (25% citation advantage), other researchers and society. Public repositories provide stable, long-term solutions to research data sharing, making them preferable to sharing privately or on personal websites. Repositories come in two flavours: generalist and community/domain-specific. Generalist repositories store a wide range of data formats to support data from many research areas, and are useful when a more targetted repository is not available. However, where possible, it is best to deposit your data in a community repository.
Each community repository is built around a specific data type, and this data-specificity yields several advantages. Here are a few examples:
- the PRIDE Archive (proteomics data)
- OpenTopography (Earth science-oriented, topography data)
- Materials Cloud (computational materials science data)
- The Qualitative Data Repository (qualitative social sciences data).
The advantages of using community repositories
Advantage 1: Community familiarity. Experienced researchers of a given community know which repositories frequently hold the types of data generated and analysed in their field. If they wish to find data, a community repository is the first place to check. Datasets in these repositories also frequently link to the research article(s) that have utilised the data (here’s an example on the PANGAEA repository). Most frequently, the data link to the first article that the data underlie, but some repositories allow for subsequent articles that made use of the data to also be referenced.
It is also worth bearing in mind that experienced researchers expect to see the data that underlie the paper deposited/shared in an appropriate repository. Seeing data for which a community repository exists shared in a generalist repository has two negative effects: i) inconvenience, as generalist repositories do not ensure the data are deposited in a standardised, community-approved form; ii) scepticism, as it appears the researchers who wrote the article either weren't familiar enough with the field to know where the data should be deposited or the data are not of sufficient quality to appear in a community repository.
Advantage 2: Metadata. As community repositories ensure comprehensive metadata is associated with the data they hold, it is easy to search one of these repositories for specific types of data. As an example, in NCBI's Sequence Read Archive (DNA & RNA data) you can search by specialist fields such as organism taxonomic classification, read length or sequencing platform. As another example, in PANGAEA (earth & environmental science data) you can search by geographic location, topic, funder and project.
Advantage 3: Interoperability. Data in the same repository tend to be interoperable, as the repositories require data to be in the correct format before they can be deposited. Therefore, users know what to expect and will not need to dedicate time to deciphering and standardising data formats.
Advantage 4: Machine-readability. The above features facilitate machine interaction with data: software and algorithms can be set up to find, access and analyse large amounts of data. This can greatly reduce the time needed for research projects, and enable projects of far greater scope than would otherwise be possible. Much importance has been placed upon this machine-accessibility of data, and it is a very active area of development.
Advantage 5: Time-saving. You do not need to worry about finding, re-familiarising yourself with and sharing your data if/when it is requested as community repositories offer a straightforward method by which your data can be located, contextualised and accessed in future. Yes--preparing the data and metadata for sharing takes additional time up front, but it can save a lot of time later. In fact, the better your data and the more popular your research, the more time can be saved.
How to find an appropriate community repository
Registries of data repositories exist in order to help researchers search for repositories by data type, research field, data access restrictions, metadata standards, and many more fields. The most common two are https://www.re3data.org/ and https://fairsharing.org/.
Publishers frequently provide guidance. Here is Springer Nature’s guidance on finding a repository for your data, and here is a recent blog describing how we have adapted our approach to better reflect community standards.
What to do if there is no community repository for your data
Generalist repositories and institutional repositories host a wide range of data types. When community repositories are not available for the type of data you have generated, you should share your data via these. Due to their more general nature, they do not have such strict requirements for correct formats and complementary metadata. Institutional repositories are slightly favoured as they may have more metadata requirements (and specialist curators who will help you create metadata).
Sharing of underlying research data is becoming increasingly common, but for many research areas we are still in the early days. Funders and publishers are increasingly mandating data sharing, and effort and resources are being dedicated to expanding and improving the infrastructure to ease the pain points for researchers. Researchers can enjoy the above-mentioned benefits while contributing to positive change by ensuring their data are shared in the appropriate way.