Competitive, crowd-sourced computational approaches to translational medicine and systems biology data
Reflections on the DREAM Challenges and EPIDEMIUM workshop, 19-20th April, Paris, France.
Organised by DREAM Challenges founder Gustavo Stolovitzky, Julio Saez-Rodriguez, Pablo Meyer, Elise Blaese and EPIDEMIUM organiser Olivier de Fresnoye, the DREAM Challenges and EPIDEMIUM workshop was organised as a satellite of the RECOMB 2018 conference.
DREAM Challenges and EPIDEMIUM organise computational biology data analysis competitions to encourage computer science and Machine Learning (ML) experts to work on the many ‘big’ datasets being produced by the bio-medical community to solve specific real world issues. For example, a recent DREAM challenge was on polypharmacy, and participants were asked to predict compounds that bind multiple targets using data made available for this challenge, alongside other publicly available datasets.
Competitors are given access to a number of datasets with which to train their algorithms, with a distinct dataset (same attributes, different data) retained to test the performance of each team’s efforts. Keeping the test dataset private was discussed as an important part of the way the challenges are run, with some evidence of previous participants ‘over fitting’ their algorithms to the test data when it was mistakenly made available in a prior challenge. The organizers run the challenges with the overarching goal of improving human health, but as there can be substantial financial rewards for winning teams, it was noted that not all participants have the same motivations.
In DREAM Challenges, the competitive phase is followed by a collaborative phase; allowing competitors to learn from each other, further refine their algorithms and (for the most engaged), gain the opportunity to co-author a paper on the Challenge. This is a key part of the ‘crowd-source’ element of the way these challenges are run, with the deliberate intention to remove bias as different teams’ algorithms are refined in a collaborative manner. This collaborative phase also mirrors the way in which ML algorithms are developed, with a continuous cycle of improvement.
This year the workshop also included an interactive session on how these challenges are run, with the organizers genuinely interested in learning how they might improve the experience for challenge participants.
I was invited to present on the work of Scientific Data (my slides), as access to high quality data underpins the success (or otherwise) of computational biology efforts. Scientific Data’s emphasis on publishing technically sound research data, in a manner which aims to maximise data reuse, is greatly appreciated by this community. We have also published several Challenge datasets since the journal’s launch:
- The species translation challenge—A systems biology perspective on human and rat bronchial epithelial cells
- Matched computed tomography segmentation and demographic data for oropharyngeal cancer radiomics challenges
- A large, open source dataset of stroke anatomical brain images and manual lesion segmentations
- A high-resolution 7-Tesla fMRI dataset from complex natural stimulation with an audio movie
The workshop was a chance for winners and runners up of recent challenges to present their work to the community, to learn about the latest data types being generated by the biomedical research community and to discuss potential Challenge questions for the following year. I thank the organizers for inviting Scientific Data to take part, it was a truly fascinating and very enjoyable 2 days.