Manual curation of large omics data helps research in dermatology
Federico A, Hautanen V, Christian N, Kremer A, Serra A, Greco D. Manually curated and harmonised transcriptomics datasets of psoriasis and atopic dermatitis patients. Sci Data. 2020;7:343. doi:10.1038/s41597-020-00696-8.
Although many studies have characterised the molecular makeup of Psoriasis and Atopic Dermatitis, the lack of standardisation in data description and pre-processing hampers their usability. In this Data Descriptor published in Scientific Data, we report a manually curated collection of pre-processed and harmonised transcriptomics datasets of patients affected by Psoriasis (PSO) and Atopic Dermatitis (AD).
Despite of the many research studies published in the last years with the aim of identifying genes associated with both diseases, the molecular determinants underlying the phenotype of both Psoriasis and Atopic Dermatitis are relatively underexplored. For example, prognostic biomarkers and disease endotypes have not been defined in order to predict disease trajectories. One of the reasons for this knowledge gap may be the lack of a large amount of easily accessible molecular data from PSO and AD patients, which could be exploited by the research community.
The currently running EU IMI project ”BIOMAP - Biomarkers in Atopic Dermatitis and Psoriasis”, which has received funding from the Innovative Medicines Initiative 2 Joint Undertaking (JU), has the overarching aim of generating systems medicine models of PSO and AD based on both clinical and molecular profiles.
In the context of the BIOMAP project, the research led by Dr. Antonio Federico in the laboratory of Prof. Dario Greco at Tampere University aimed to generate a comprehensive catalogue of publicly available transcriptomics datasets of patients affected by PSO and AD. Gene expression data for almost 1,000 samples profiled by DNA microarray and more than 650 samples profiled by bulk RNA sequencing were retrieved from multiple public data repositories. In order to make the data reusable for the research community, both the molecular data and associated metadata were extensively quality checked, and, for the datasets which passed quality control, pre-processing and harmonisation were performed. In fact, one of the crucial parts of this work has been the harmonisation of the information reported in the metadata enclosed with the gene expression profiles. This step has been challenging for all of the involved co-authors in order to set up a unified data model to describe all of the clinical variables across the datasets, while still preserving the information content. The clinical data, in fact, were very heterogeneous across the datasets in terms of both information content and terminology, placing the attention on the current lack of required standard models for the annotation of clinical (or technical) data describing OMICS data on public repositories. To carry out a rigorous harmonisation of the clinical data, we defined tailored data dictionaries in order to harmonise all the clinical variables.
Another challenge of this work has been to define the best analytical standards, both for DNA microarrays and RNA sequencing, in order to make the data easily reusable by the research community. For this reason, we decided to design technology-specific pipelines encompassing widespread algorithms for data transformation.
The source of data generated by our research group represents a valuable starting point for advanced transcriptome-based analyses aimed at filling the gap of knowledge in the pathogenesis of PSO and AD. Our study gives important hints regarding the harmonisation process of large-scale and heterogeneous data, enhancing FAIRness, interpretability and inter-comparability.
We hope that our effort will be of help to other researchers willing to join our battle against complex diseases, including Psoriasis and Atopic Dermatitis.