What is involved in archiving life science datasets?

Behind the repository: Find out about the Life Science Database Archive

Go to the profile of Shigeru Yatsuzuka
Mar 17, 2018
7
0

“Easier said than done” is a famous proverb in Japan. I believe “Easier said than archived” is equally true.

Our repository, Life Science Database Archive (LSDB Archive) collects and stores over 130 datasets generated by life scientists mainly in Japan and stable state as national public goods.  

LSDB Archive makes it easier for researchers, companies and citizens to:

  • understand metadata in a unified format
  • search datasets, including sequences and images
  • download datasets in general and linked open data (LOD) suitable formats
  • access datasets with clear terms of use based on the Creative Commons licenses
  • cite datasets using unique persistent identifiers (DOIs) 

It has been almost 10 years since we started LSDB Archive. When we began, I had thought it would be easier to archive than it would be to develop a bioinformatics analysis system. But how wrong I was.  

Why?

Well, after wrestling with many datasets and invisible barriers, I can pinpoint four specific issues:  

Collection:

In Japan, almost all large-scale life science research projects are affiliated with one or more of the following ministries:

  • Ministry of Education, Culture, Sports, Science and Technology (MEXT)
  • Ministry of Economy, Trade and Industry (METI)
  • Ministry of Agriculture, Forestry and Fisheries (MAFF)
  • Ministry of Health, Labor and Welfare (MHLW)

(National Bioscience Database Center (NBDC) is included within the MEXT.)  

Such hierarchies can be both barriers to and supportive of archiving data. We have had many tough negotiations with the ministries because we wanted to start from the view that data openness is necessary. However, with the help of strong initiatives from the Government of Japan’s Cabinet Office, we have made huge advances in promoting the acceptance of open data. We expect to continue making further progress in this area, with the ongoing support of these government departments.    

Validation:

Some datasets might have errors or mismatches in them. While some datasets might have incorrect notations within them. So, we have to validate datasets.

An example shows that one data table has IDs like “P1” although another table has ones like “P001”. At a glance, users cannot understand “P1” is the same as “P001”.

Another example shows that “-2” and “0.01” are contained in the same column. But we found that “-2” means logarithm! That is, “-2” and “0.01” are the same value. Small distinctions like this mean a lot to users being able to understand and use the data!  

Metadata curation:

The metadata of LSDB Archive includes the following (only main items):

  • Database name
  • Creator
  • Database description
  • Background and funding
  • Reference(s)
  • Description of data contents
  • Description of each data items 

Above all, “Description of each data items” is the most important for understanding data. But it is apt to be unclear or forgotten. In many datasets, a jargon is used as the name of an item, which is distributed in only one laboratory or amongst a handful of developers. To counteract this, we “dig” the lost metadata from papers or using “web archive”.  

Data curation:

Data originated from existing database systems on the web are sometimes not suitable for reuse. There are 2 reasons.

  • The normalization of data tables can disturb users who would like to understand data as a whole.
  • Some data items have elements only understandable in a web system. These are often flags for something (font color, kind of link, show order, etc.).

To resolve those problems, we separate out the different elements of datasets, remove some items and rebuild them to help make the datasets understandable, searchable and downloadable.  

Figures below show a typical and simplified sample of a sequence of processes (from validation to rebuilding). 

Lastly, we are proud to mention FANTOM5, the functional annotation of the mammalian genome database. The FANTOM consortium has published a collection of articles with various Nature Research journals, which can be found at www.nature.com/collections/fantom5. As part of this collection, the FANTOM researchers published several Data Descriptors at Scientific Data (for example doi:10.1038/sdata.2017.113). As part of the Data Descriptor publications, the researchers deposited datasets in LSDB Archive to ensure proper storage and preservation of these data. In accepting FANTOM5 data, we curated its metadata and made useful data lists in collaboration with the FANTOM5 researchers. Openly access the FANTOM5 data archived at LSDB Archive. 

We expect many life science researchers will follow the FANTOM consortium’s lead in publishing at Scientific Data in conjunction with archiving their data at LSDB Archive. 

We look forward to continue helping Japanese life scientists to archive and preserve their data properly for the future.     

Visit the Life Science Database Archive (LSDB Archive) to access openly available research data.

Go to the profile of Shigeru Yatsuzuka

Shigeru Yatsuzuka

Researcher, National Bioscience Database Center (NBDC)

After working for IT vendors as an engineer and a project manager, now I work for NBDC to manage Life Science Database Archive.

No comments yet.