There are many benefits associated with sharing your research data, not limited to increased scientific integrity and an associated increase in citations to your research article. These remain true when sharing large datasets, but the very size of the data can add complexities. These include high storage costs and a lack of the necessary infrastructure to deposit and preserve the data.
Regardless of size, many funding bodies require that research data be preserved beyond the length of the study. Preparation in the form of a Data Management Plan is strongly advised to establish good data practices. This is of increased value when you are likely to produce a large amount of data as it will enable you to estimate, manage, and where possible reduce costs through the entire data lifecycle.
Where can I share my large data?
The size of a dataset, and whether it can be considered large, is dependent on your field of research. Disciplines that typically produce sizeable amounts of a specific type of data, which may be technically challenging to work with, may already have a community repository. Where available, they should be used as they offer advantages over generalist repositories including; ensuring that consistent and comprehensive metadata are collected, interoperability with other deposited data and facilitated machine interaction with the data. You can use the re3data and FAIRsharing registries to identify discipline-specific repositories.
If you are submitting related work to a journal, the publisher will often mandate deposition of certain data types to community repositories. A list of mandated data types at Springer Nature can be found here.
If your data aren’t suited to a discipline-specific repository, first check with your institution. Many institutions have created their own repositories or established partnerships with generalist repositories specifically to help accommodate your data storage needs in a more cost efficient manner.
One of the following generalist repositories might otherwise be a good fit for your data:
Dryad accepts up to 300GB per data publication through their web interface but can accept larger submissions on request.
Figshare+ launched by figshare.com to allow for larger file-size uploads and increased storage space. Figshare+ caters to researchers with datasets from 20GB+ to 5TB or more.
Harvard Dataverse allows for individual files not exceeding 2.5GB but datasets up to 1TB are accepted. Authors are advised to contact Harvard Dataverse if their datasets are approaching 1TB.
Open Science Framework allows for storage of up to 5GB for private projects and up to 50GB for public projects and components.
Science Data Bank does not apply specific storage size limits but asks that authors contact them if their data file is larger than 500 GB.
Zenodo currently accepts datasets up to 50GB (authors can have multiple datasets) however not wanting to turn away larger use cases, they recommend authors contact them for help if they are looking to upload larger files.
It is useful to note that some of the above repositories will allow upload of data files through an FTP Client such as FileZilla or FlashFXP. This is a good method for large sized data files and or unstable network connections. The repository will provide you with instructions and a username and password to complete your data submission.
Questions to consider before depositing your large dataset.
- How should I organise my data?
Are users likely to require the entire dataset? If so, the ability to download the entire record via a single download (.zip file) will be most useful. Perhaps instead, the record consists of disparate or clearly distinct datasets that are more easily consumed as individual files. For example, authors Navarro-Racines et al deposited their ~7TB of bias-corrected climate change projections as a series of tiles / separate downloads, thus allowing for regional-specific data retrieval.
- Which data should I share?
In some cases, it is simply not feasible to share very large datasets, or expect users to be able to download and (re)use them. It may be worth considering whether there is an appropriate alternative to sharing the full dataset.
For example Vergopolan et al had 22 TB of a maximally compressed, high-resolution, satellite-based surface soil moisture dataset. They opted to share a more practicable 33.8GB of raw data alongside the python code and the instructions for post-processing into geographic coordinates. Similarly, Brown et al archived the trained model that they had developed to generate near real time land-cover predictions in Zenodo.
- Have I added enough metadata?
Having gone to considerable effort to collect and store your large dataset, it is important to ensure the data you share are FAIR (Findable, Accessible, Interoperable, Reusable). The first step towards reuse is the ability to find the data and this initial discovery is helped by the existence of rich metadata.
Be sure to provide a thorough description of the observations and the format they are being shared; subsequent users may well be discouraged from downloading a large data file if it is unclear exactly what they are looking at. Similarly, including a separate README file in your data upload will be helpful if you are sharing a large number of different files, or a file containing many variables.