Minimizing wheel reinvention: creating an open data culture via infrastructure in India
Last month the Global Biodata Coalition and the University of Delhi jointly organized the 1st Indo-GBC virtual seminar, titled "Data Sharing at a Global Level: Evolving perspectives amidst challenges".
The Data Sharing at a Global Level: Evolving perspectives amidst challenges seminar was organized by Saurabh Raghuvanshi (Associate Professor, University of Delhi) and Chuck Cook (Program Manager, Global Biodata Coalition). The aims of this event were twofold:
- to discuss global data conservation and sharing models with a view to developing a robust data sharing ecosystem at a national level for India.
- to encourage participation by Indian funding agencies in international coordinating activities such as the GBC.
Introducing the seminar, Raghuvanshi noted that India is emerging as one of the largest global producers of life science data consequently extensive efforts are underway to develop a robust life science data collection, interpretation and curation framework across the nation. Concerns about data curation and data stewardship capacity were reiterated several times during the course of the seminar.
In her message to seminar attendees Renu Swarup (Secretary, Department of Biotechnology, Govt. of India), noted the active efforts of Indian funding agencies in developing a robust ecosystem for efficient storage and sharing of life science data generated in India. She also highlighted the need for national efforts to sync with global agency efforts on research data sharing.
Eric Green kicked off the seminar with an introduction and overview of the Global Biodata Coalition (GBC). The GBC is a coalition of funding agencies with the explicit aim of coordinating funding for biodata (life science and medical science data) repositories, to ensure the longevity and sustainability of the core resources which underpin much of life sciences and biomedical research. Green spoke of the need to act across borders, as although research data are generated nationally they are used internationally. The GBC’s main aim is to encourage research funders to work together on tackling data science challenges. Green noted that there are around 3,000 biodata resources globally, and of these around 100 of these are considered ‘core’. These biodata resources are associated with a budget of around $500m. Biodata resources are highly interconnected which is ideal for data discovery, but also means that they are susceptible to issues caused by failure of weak links.
Alongside the exponential growth in biodata generation, we are seeing an increase in the emergence of open data policies from publishers, funders and other research stakeholders. Green described how this is increasing the demands on biodata repositories, and how current funding of these resources is fragmentary, fragile and haphazard. Green described how poor international coordination is leading to duplication, waste and lack of sustainability planning for biodata resources. There is also the growing threat of biodata resources retreating behind subscription firewalls for sustainability, with The Arabidopsis Information Resource mentioned as an example of this.
Niklas Blomberg presented an overview of ELIXIR, which brings together European biodata resources. ELIXIR is an example of an internationally collaborative approach for data resources, which brings together 23 nodes, 55 commissioned projects and 397 teams. Blomberg noted the philosophy behind ELIXIR as being “the rising tide lifts all the boats”. The ELIXIR project has identified a set of core resources, which are of fundamental importance to the broader life science community, and act for the long-term preservation of biological data. These core resources are considered to be fundamental research infrastructure, and the long-term sustainability of the resources is therefore vital for bioscience research. The intention is for the core resources to be funded differently to the usual grant-based academic projects. In order to be considered as a core resource, the key requirement is for the data to be completely open to use by all. Blomberg showed that the top three countries accessing EMBL-EBI resources are the USA, China and India, and that the majority of these resources represent international collaborations.
Blomberg described some of the international collaborations which have been established in the bioscience discipline, for example the International Nucleotide Sequence Database Collaboration (INSDC) which has been key in developing and implementing regular data exchange, common data standards, and open and unrestricted access to data on an international level. Blomberg called for Indian funding agencies to participate in these types of scientific collaborations, which are not research projects but research infrastructure initiatives. Blomberg shared his experience that funding bodies need to take decadal view in considering investment and governance for data infrastructure. Blomberg’s view is that the complexity of these data and the accompanying requirement for skilled data professionals mean that a national focus for biodata resources is unlikely to be sustainable over the longer term.
I was invited to speak about the importance of research data repositories and how the use of repositories can enable data to be FAIR (findable, accessible, interoperable and reusable). There are essentially two types of repositories, those which are discipline-specific and those which are generalists. Discipline-specific repositories are the ideal location for data, as these repositories usually implement data standards, data tools and data visualizations best suited for their holdings. Discipline-specific repositories also are more likely to be staffed by specialists for that disciplinary area, who are able to provide technical guidance for and validation of, deposited datasets. All of this means that discipline-specific repositories are able to maximize the findability, accessibility, interoperability and reusability of their data holdings. For data types and disciplines which do not have specialist repositories, generalist repositories are vital for enabling researchers to ensure their data are maximally findable and accessible.
In the panel discussion, Anurag Agarwal (CSIR-Institute of Genomics and Integrative Biology) discussed the challenges of sharing human-derived data in an ethical way, while, Akhilesh K. Tyagi (University of Delhi) highlighted need for controlled access repositories in view of emerging biodiversity-related bio-economy issues. The need for policies and data access frameworks developed by a range of Indian stakeholders (public and private) was also discussed.
India is taking important steps to embed an open data culture for its research community, and pulling in international expertise to minimize wheel reinvention is certainly a sensible way forward.