A year out of pilot for Research Data Support

One year on from the full launch of Springer Nature’s data curation and sharing service Research Data Support, we look at the some of the data and metadata made openly accessible, what we’ve learned, and how the service has evolved.

Apr 26, 2019
2
0

Toward the end of 2017 I wrote about the wide variety of research data published during the pilot phase of Springer Nature’s Research Data Support service. Since its full launch in March 2018, the service has continued in its mission of making it easier for researchers to share their data files in a repository with comprehensive metadata and has expanded scope in response to researchers’ and institutions’ needs. Research Data Support is now available to researchers:

  • with a manuscript under submission to any Springer Nature journal
  • with a published article at any peer-reviewed journal from any publisher
  • whose data cannot be shared openly

In addition, Springer Nature has partnered with institutions and funding agencies to cover the costs of using the service for certain researchers. These include:

Types of data published

Since its launch, the range of publications making Research Data Support available to authors during submission has expanded from Biological Sciences to a wider multidisciplinary base, notably in terms of Health Science journals and Computer Science conference proceedings. This includes an integration with the world’s largest journal Scientific Reports, and for Earth, Environmental and Space science authors at Nature.

To date 52% of accepted data submissions are Biological Science-related, while Health Science makes up 45% (note that disciplines are not mutually exclusive and there is significant overlap between these two). Computer Science (19%) and Earth and Environmental Science (5%) are the next most common disciplines. More specifically, 19% of accepted submissions are related to Cancer and/or Oncology, 15% to Genomics or Genetics, 8% to Health care and 6% to Ecology.


The fact that we continue to see submissions from researchers in Biological and Health Sciences suggests that these researchers have multiple data outputs supporting their publications that can be made more findable and accessible. We can infer this as they are more likely to have encountered repositories for specific data types (such as genomics and molecular structures), and have sensitive data (such identifiers of human research participants), which may be out of scope for Research Data Support, yet they still make up a large proportion of those using the service.

Research Data Support also adopts a wide definition of ‘data’ and while over half of published datasets include are the more familiar tabular type (i.e. spreadsheets and text files), 21% have included image data, 19% software and 9% specialist types such as liquid chromatography or phylogenetic tree formats.

11% are also ‘metadata only’ records, that either summarise all data in a variety of other locations or describe files that cannot be made publicly accessible, for legitimate reasons such as to protect research participant privacy. These metadata records greatly improve on simple statements in journal articles such as ‘data are available on request’. They outline in detail what data are available, from where and under what terms. See an example of a metadata-only output of Research Data Support in the Springer Nature figshare repository here.

Impact of curated data

Research Data Support makes data more findable and accessible to others and we can observe this from the levels of interest and interaction with the curated outputs in the Springer Nature figshare repository.

A noteworthy example of this interaction is a meticulously assembled and arranged dataset consisting of over 20,000 observations across over 4,500 species of jawed vertebrates (available here). Since its publication in August 2018 the dataset has been viewed over 2,750 times, downloaded 137 times, cited once and linked in 36 social media posts.

The related article published in Nature Ecology & Evolution investigates the allometric relationship between brain and body size in vertebrates. The lead author describes it as ‘the most extensive brain- and body-mass dataset to date’.

As well as creating an enhanced data record, Research Data Support also recommends improvements to the content of the related research publication. Mainly this involves improved linking from article to dataset, but in 27% of cases has also led to further edits, for example correcting an error or providing more detailed information about the data.

The value of curation

Research Data Support is based around carefully-constructed curation standards and processes. Before the service launch we objectively assessed the value that this would add through a comparison of curated versus non-curated datasets. A paper describing this assessment in detail, authored by members of the Springer Nature Research Data team, is in press at the International Journal of Digital Curation (available here as a preprint in bioRxiv: http://dx.doi.org/10.1101/530691).

Springer Nature editors, many ex-researchers, participated in a single-blind assessment of curated and non-curated datasets to rate the quality of the metadata. The mean scores for quality and completeness of metadata were higher for the curated (or ‘edited’ in the graph below) datasets, as well as the curated datasets being more clearly licensed, citable and better linked to the associated research publication.

Recent developments at Research Data Support

The broad range of disciplines and data types submitted to Research Data Support demonstrate the wide-ranging value of this service to authors who need support in curating and sharing their data. However over the last year we have also learned that there are researchers who have not been able to utilise the service with its initial focus on datasets that can be shared publicly in a general repository such as figshare.

Taking a single message from these trends; we’ve learned that we can better support researchers by expanding our service to include more data types that are currently out of scope. A key example is clinical sensitive data. Indeed, the most common reason for submissions being rejected has been due to datasets that have not or cannot be sufficiently anonymised to be published openly.

Research Data Support already works with authors to ensure that sensitive data types, for example identifiers of human participants, are either suitably anonymised before publication or shared elsewhere under controlled access terms. This dataset, for example - which includes information on paediatric ward admissions in a rural Kenyan hospital - was published with advice from Research Data Support. We can identify issues and advise on good practice for sharing data about human research participants, including techniques such as removal or obfuscation of potentially identifying information - enabling data with potential for reuse to be published openly, while minimising risk to participant privacy.

However, our pilot with npj Breast Cancer has led us to develop a more active offering for sensitive clinical data; the ‘metadata only’ record described above. This feature addresses a wider problem in transparency and discoverability for data files that cannot be shared openly.

Likewise authors also seem to value data records that summarise all data related to their research publication, as opposed to a data record only applicable to files that have been shared publicly. The Research Data team has developed new approaches to capturing and creating metadata to support these comprehensive records - that support a specific study, rather than a specific dataset - on behalf of researchers, which we will continue to refine and share more on in the coming months.

Meanwhile, feel free to browse all the curated outputs from the team here.


Graham Smith

Research Data Editor, Springer Nature

At Springer Nature I work to develop and promote data publishing tools, initiatives and policies across the organisation. I have an academic background in geology and geophysics, specifically studying seismics at live volcanoes. I have previously worked in a similar data-focused role at the Natural History Museum, managing data pathways and curation practices for big taxonomic and collection data.

No comments yet.