With less than 2 weeks until the next Springer Nature hack day here are some potential themes, data sources and use cases, which might prompt some ideas and help with forming teams ahead of the day.
The theme of the event is advancing discovery with published research data (and, of course, scholarly publications). This is timely, with several cross-community initiatives currently focusing on linking research data and publications (Scholix), data citation (Force11) and research data discovery and reuse (DataCite event data). And, perhaps most importantly, some scholarly research depends on effective text and data mining to solve complex problems.
There will be numerous data sources available to participants including large amounts of Springer Nature’s full text content, much of which ordinarily needs a subscription to access.
Combined with already open Springer Nature content - from the SciGraph linked open data platform, open access journal articles, and article reference lists made available via CrossRef - this is a large corpus of publication related data. But research communication is widely distributed across numerous platforms, repositories and publishers and we anticipate external resources being utilised on the day, including:
EU PubMed Central
Registry of research data repositories (Re3data.org)
Data repository APIs (such as figshare’s)
With so many possible combinations for a 6 hour sprint, we may need to focus on a small number of ideas and use cases. Here are a few that came out of our event planning in the Springer Nature team:
A journal editor who wants to encourage data sharing by recommending specific data repositories to authors submitting to their journal might benefit from a data repository identification and selection tool. Lists of research data repositories, research disciplines from the latest SN SciGraph along with publication data might help with this tool.
Could data from publications, subject ontologies and data repository APIs help a researcher or data librarian, who wants to make research data management more efficient, automatically generate research data metadata? And, could machine learning techniques be applied to particular types of data file (e.g. images) to aid data curation, such as classification of image files?
Publishers, researchers, editors, librarians and infrastructure providers can all benefit (more readership, more citations, improved reader experience) from improved data-article linking. A simple approach more publishers could take is using the DataCite API - DataCite mint DOIs for datasets - to identify datasets linking to their DOIs. But research data identification is complex, with thousands of publicly available datasets not having DOIs but other types of identifier. We’re also preparing a curated list of repositories and identifier types, such as accession codes, that could accelerate creation of discipline specific linking solutions.
Funding agencies and institutions are increasingly interested in research data policy compliance monitoring but the outputs of particular grants - which can potentially be tracked through grant identifiers - can be hard to find consistently and comprehensively. One could leverage a shared requirement of many journals - requiring authors to provide data availability statements in journal articles, like an Acknowledgement for data - from publisher full text to mine and classify these statements. This would achieve better compliance monitoring and reporting, and understand the impact of journal data policies on researcher behaviour.
The scope of the day also includes data visualisation, and answering research questions that benefit from combining published research with other knowledge structures, supporting everything from monitoring the spread of infectious diseases to creation of evolutionary trees - and much else besides. We also hope to learn how we can support more of this valuable research through our content offerings.
These are just some initial, largely untested suggestions. We look forward to welcoming friends from organisations such as Alan Turing Institute, University of Oxford, the Systems Biology Institute and DataCite to London, to surprise and inspire attendees with their own ideas.
The last few tickets are available here.