Increasing usability of FANTOM5 data
Interview with four authors of the FANTOM5 collection
Many thanks to Yoshiko Fujikawa who originally conducted and wrote up this interview in Japanese with support from Yoko Shintani - it was then translated by the Springer Nature Research Data team, with support from Yoko Shintani and Yoshiko Fujikawa.
FANTOM is an international consortium of mammalian genome research led by RIKEN. The fifth edition, FANTOM5, has monitored RNA transcribed from the human genome comprehensively in more than 500 samples (including primary cells derived from organ tissues). Although a series of papers that report scientific findings based on the FANTOM5 data have already been published, a recent collection of Data Descriptors (published by Scientific Data) describe, for the first time, FANTOM5’s data acquisition processes, sample quality and data processing in detail. The FANTOM5 collection includes a report on the Web tool "RefEx", which enables users to easily search and view gene expression data, including the one produced through FANTOM5. We talked to two of the scientists, Dr Kawaji and Dr Kasukawa who worked on the generation of FANTOM5 data, along with Dr Bono and Dr Ono who developed RefEx.
FANTOM is one of Japan’s leading international research consortia. I heard it was named by Dr. Bono.
Bono: The FANTOM consortium was established in 2000, with the initial aim being to assign functional annotation to mouse cDNAs [expressed genes]. I tried to think of a name that would be easy to remember and came up with FANTOM (Functional ANnoTation Of the Mouse). This consortium has continued to grow to the present fifth edition under the leadership of Dr. Yoshihide Hayashizaki. The target has since been changed from the mouse cDNA to the mammalian genome; fortunately, we did not need to change the letter "M" in the name.
What does annotation mean in this context?
Bono: Annotation means to add explanatory notes. Even if the genome of a certain organism is sequenced, we do not immediately know which portion of the genome represent genes, which genes are expressed in a particular tissue, how they are transcribed, or how gene expression is controlled. Their description is called functional annotation of the genome. In fact, although it is over 15 years since the human and mouse genomes were first sequenced, the complete function of their genomes have not yet been fully elucidated. One way to understand the function of the genome is by analyzing the RNA generated as a result of transcription of genomic DNA. After the total RNA (transcriptome) contained in the cells has been extracted, sequences of individual RNA molecules are determined. The resulting data can be subjected to further detailed analysis, such as gene expression levels and mechanisms of transcriptional regulation. FANTOM has consistently focused on transcriptome analysis.
I’d like to ask Dr. Kawaji and Dr. Kasukawa, who have both played essential roles in the fifth edition, more about the latest FANTOM5 research.
Kawaji: As in the previous editions of FANTOM, we profiled the total RNA content of cells, transcriptome, in the fifth edition. One of the unique characteristics of our study is that the analysis was performed with more than 500 different types of samples, including primary cells and organ tissues in the body, mainly derived from human and mouse.
Many analyses on cancer cell lines have been conducted in the past however FANTOM5 focused on a wide range of normal cells (such as primary cultured cells), distinguishing it from other studies. A smaller number of samples from rats, dogs, macaque monkeys and chickens were also analyzed. Another unique characteristic is that we used the CAGE (Cap Analysis of Gene Expression) method developed at RIKEN, allowing us to monitor transcription initiation at a single base resolution. Based on the profile of the transcription initiations, we managed to identify 180,000 promoters and 65,000 enhancers in the human genome. These results indicate that such a large number of regulatory regions orchestrate to control activities of genes across the genome, and only a part of their complexities have been elucidated so far.
The results of FANTOM5 have already been published, haven’t they?
Kasukawa: Yes. Two papers on promoters and enhancers were first published in Nature in 2014. Since then more than 50 related articles have been published including most recently, papers focusing on non-coding RNA and microRNA. FANTOM5 involves 500 scientists from 20 countries, and as such I expect further related articles will continue to be published.
Publication of datasets as an article in Scientific Data
You recently published a collection of articles on FANTOM5 datasets at Scientific Data.
Kawaji: As a set of data on mammalian cell diversity and gene regulation provides a basis of many researches in life science, we believed that it was important for other people to be able to utilize the enormous amount of research data produced through FANTOM5. The launch of data journals, such as Scientific Data, which focus on publishing articles that help researchers to reuse data, was very timely. We therefore decided to report in-depth details about the generation of the FANTOM5 data, in particular on CAGE and CAGEscan. While most of the data had already been made available in public data repositories, we anticipated their publications will increase opportunities exposing the data to many researchers with full details, leading to its use in broader contexts.
You have reported these data across several Data Descriptors.
Kasukawa: Yes. The data acquisition methods and data processing procedures differed depending on the scope of research (e.g. CAGE and RNA-seq methods for non-coding RNA) and target organism, respectively. Therefore, the data were roughly grouped and reported in separate Data Descriptor articles.
Kawaji: We first reported human and mouse CAGE data and will continue to report data on rats, dogs and macaque monkeys in the future. Articles related to FANTOM5 have been featured as the FANTOM5 Collection on the Nature Research website*1.
Kawaji: The collection home page has a link to the Comment article published in Scientific Data, where the FANTOM5 datasets have been outlined6. Reports published in Nature and related journals are also listed, providing an overview of the achievements of FANTOM5.
I see the collection includes a Data Descriptor on data updating7.
Kasukawa: The positions of RNA sequences transcribed from the genome are determined via the corresponding reference sequence of the human genome. However, the reference sequence is updated every few years and a newer version than the one used in FANTOM5 has already emerged. We reanalyzed FANTOM5 data using the most current version available and published this reprocessed data in order to retain the usefulness of the data.
We describe how the data were reanalyzed. It was not a simple task of running a single analysis program. It required careful assessment and decisions, such as evaluation of the validity of the analysis results and adjustment of data handling according to the result of evaluation. Although this kind of reanalysis and associated updates are indispensable for maintaining data usability, it is not unusual that they remain unpublished in any journals since they are not scientific discoveries by themselves. We are very pleased that we could publish these efforts as data descriptors.
FANTOM5 data can be viewed using RefEx
Dr. Bono and Dr. Ono have published an Article on RefEx in Scientific Data8.
Bono: Dr. Ono and myself are currently based at the Database Center for Life Science (DBCLS). One of our missions is to create a researcher-friendly Web tool to help data sharing and reuse. To achieve this, we created the simple viewer "RefEx" to search and view data that serves as a reference for gene expression analysis, including FANTOM5 data. In this report we discuss how FANTOM5 datasets are useful for a wide range of biologists, as well as the importance of making data easy to use.
Ono: FANTOM5 data can be viewed through the RIKEN database, but their focus is on the analysis of transcriptional regulatory mechanisms. The database has an advanced search page for each transcription initiation, and RNA quantification data are displayed as "transcription initiation activity." We, on the other hand, aimed for a viewer with which even general biologists unfamiliar with transcriptional regulation analysis could utilize the data. Therefore, FANTOM5 data were reanalyzed, and the detailed data obtained for each transcription initiation were grouped by gene to enable searching by "gene expression levels."
RefEx became publicly available around 2012.
Bono: Yes. RefEx was originally started as a Web tool that enabled comparison of gene expressions within 40 healthy organs, and Dr. Ono and I have been the main contributors in its expansion. As well as existing grouped data from 40 organs, the current version has an additional viewer dedicated to FANTOM5 data.
Ono: Gene expression data is greatly influenced by the method used and the type and state of sample measured. I have therefore wanted a standard for reliable reference data since I was a student. I therefore feel that the creation of the RefEx tool, which allows an overview of data measured by several methods, is of great significance.
Figure 1: Overview screen of RefEx (left) and FANTOM5 data viewer (right).
When publishing research results obtained using RefEx, please cite the RefEx paper8.
The role Scientific Data plays in promoting data sharing
One of the aims of Scientific Data is to promote data reuse. How do you think this journal can be utilized?
Kasukawa: As a researcher in the field of genomics, I regularly use data produced by others, including other research consortia. Research cannot be done in the current era otherwise. For using others’ data, it is necessary to understand detail and quality of the data. It is very helpful if they were explained as when published at Scientific Data. Data were often reported as supplementary information, which made them hard to understand due to lack of detail as well as difficult to locate with the information often being scattered amongst several papers.
Bono: General experimental scientists can also benefit from using publicly available data for their own research as it means they do not have to repeat experiments. Presenting data with clarity should be helpful for them as well.
How can general experimental scientists effectively use gene expression data?
Kawaji: I heard a story that the data produced by the previous edition of FANTOM was quite useful for Prof. Shinya Yamanaka when he selected candidate genes that can induce pluripotent stem cells.
Ono: Indeed. The primary use would be to narrow down candidate genes. For instance, a researcher who had obtained a few dozen candidate target genes for cancer therapies was able to effectively narrow them down for further experimentation after searching genes expressed at very low levels in normal tissues using RefEx.
Bono: You can also use them to verify the reproducibility of your own data. This technique is used to prove that the same conclusions can be obtained by reanalyzing your research results using publicly available data. Many experimental biologists will want to perform such reanalyses and we are running seminars to teach some of these data analysis techniques.
What should we be careful about when reusing data?
Ono: It is important to check under what conditions the original data were obtained, and most importantly, whether the experiments were of appropriate quality.
In addition, as a provider of reanalyzed data, I think it is important to keep the information traceable within a viewer such as RefEx. Data reliability is confirmed by the researcher reusing these data, but fundamentally is based on the original data. Therefore, journals like Scientific Data play an important role. RefEx clearly states how the original data was processed.
So what can data producers expect from publishing with Scientific Data?
Kawaji: I think it is of huge advantage for them to have a place where they can publish a paper focusing on descriptions of datasets, in particular the ones produced with substantial efforts, to be utilized in broader contexts.
Kasukawa: One of the reasons that database updates are not always recognized as scientific achievements is because there is no system to evaluate data management. I hope that our report in Scientific Data will also be considered as a good use case in establishment of such an evaluation system.
Will data sharing and reuse become popular in Japan, too?
Bono: Today the promotion of data sharing is emphasized internationally, and I feel that this trend is accelerating rapidly in Japan. Japan has hosted a "BioHackathon" with data analysts and the FAIR Data Principles*2 were created through the BioHackathon. The FANTOM projects have also published the present papers on their datasets. I hope that data sharing will become more and more popular in Japan as well as other countries throughout the world.
|Hideya Kawaji Ph.D,
Unit Leader, Advanced Center for Computing and Communication
Preventive Medicine and Applied Genomics Unit
Program Coordinator, Preventive Medicine and Diagnosis Innovation Program, RIKEN
|Takeya Kasukawa, Ph.D
Unit Leader, Large Scale Data Managing Unit
|Hidemasa Bono, Ph.D
Project Associate Professor
Database Center for Life Science (DBCLS) Research Organization of
Information and Systems
|Hiromasa Ono, Ph.D
Project Assistant Professor
Database Center for Life Science (DBCLS)
Research Organization of Information and Systems
Access the full FANTOM5 collection here: http://www.nature.com/collections/fantom5