Opening up to advance discovery

Themes, data sources and use cases for our upcoming hack day.

Like Comment

With less than 2 weeks until the next Springer Nature hack day here are some potential themes, data sources and use cases, which might prompt some ideas and help with forming teams ahead of the day.

The theme of the event is advancing discovery with published research data (and, of course, scholarly publications). This is timely, with several cross-community initiatives currently focusing on linking research data and publications (Scholix), data citation (Force11) and research data discovery and reuse (DataCite event data). And, perhaps most importantly, some scholarly research depends on effective text and data mining to solve complex problems.

There will be numerous data sources available to participants including large amounts of Springer Nature’s full text content, much of which ordinarily needs a subscription to access.

Combined with already open Springer Nature content - from the SciGraph linked open data platform, open access journal articles, and article reference lists made available via CrossRef - this is a large corpus of publication related data. But research communication is widely distributed across numerous platforms, repositories and publishers and we anticipate external resources being utilised on the day, including:

With so many possible combinations for a 6 hour sprint, we may need to focus on a small number of ideas and use cases. Here are a few that came out of our event planning in the Springer Nature team:

  • A journal editor who wants to encourage data sharing by recommending specific data repositories to authors submitting to their journal might benefit from a data repository identification and selection tool. Lists of research data repositories, research disciplines from the latest SN SciGraph along with publication data might help with this tool.

  • Could data from publications, subject ontologies and data repository APIs help a researcher or data librarian, who wants to make research data management more efficient, automatically generate research data metadata? And, could machine learning techniques be applied to particular types of data file (e.g. images) to aid data curation, such as classification of image files?

  • Publishers, researchers, editors, librarians and infrastructure providers can all benefit (more readership, more citations, improved reader experience) from improved data-article linking. A simple approach more publishers could take is using the DataCite API - DataCite mint DOIs for datasets - to identify datasets linking to their DOIs. But research data identification is complex, with thousands of publicly available datasets not having DOIs but other types of identifier. We’re also preparing a curated list of repositories and identifier types, such as accession codes, that could accelerate creation of discipline specific linking solutions.

  • Funding agencies and institutions are increasingly interested in research data policy compliance monitoring but the outputs of particular grants - which can potentially be tracked through grant identifiers - can be hard to find consistently and comprehensively. One could leverage a shared requirement of many journals - requiring authors to provide data availability statements in journal articles, like an Acknowledgement for data - from publisher full text to mine and classify these statements. This would achieve better compliance monitoring and reporting, and understand the impact of journal data policies on researcher behaviour.

The scope of the day also includes data visualisation, and answering research questions that benefit from combining published research with other knowledge structures, supporting everything from monitoring the spread of infectious diseases to creation of evolutionary trees - and much else besides. We also hope to learn how we can support more of this valuable research through our content offerings.

These are just some initial, largely untested suggestions. We look forward to welcoming friends from organisations such as Alan Turing Institute, University of Oxford, the Systems Biology Institute and DataCite to London, to surprise and inspire attendees with their own ideas.

The last few tickets are available here.

Iain Hrynaszkiewicz

Publisher, Open Research, PLOS

Iain Hrynaszkiewicz is Publisher, Open Research at Public Library of Science (PLOS), where he leads the conceptualisation and development of new products and services that add value to the PLOS portfolio by supporting and enabling open science. Iain was previously Head of Data Publishing at Springer Nature where he developed and implemented research data policies and services, and was publisher of Nature Research Group’s Scientific Data journal. He has also been Outreach Director at Faculty of 1000 (F1000), and spent seven years at the first commercial open access publisher BioMed Central (BMC) in a variety of editorial, publishing and product/policy development roles. Iain is part of several research/publishing community projects related to data sharing and reproducible research. He founded and is co-chair of an Interest Group in the Research Data Alliance (RDA) that is setting standards for journal research data policy globally, and founder of the annual early-career researcher conference, Better Science through Better Data. He has published numerous papers related to data sharing, open access, and the role of publishers in reproducible research - one of which has been cited nearly 200 times.