HyperKvasir: Data to enable AI for automated GI disease detection
Artificial intelligence (AI) is currently a hot topic in medicine, and detection of diseases in the gastrointestinal tract can potentially be improved using AI-based systems. However, available data is often scarce. HyperKvasir provides more than a million images to develop better algorithms.
What would we do without our life-saving doctors and clinicians helping a lot of people every day? They are priceless. However, they are not errorless, and symptoms and diseases might be overlooked. In the area of disease detection in the gastrointestinal (GI) tract, anomalies are regularly missed, and there are large inter- and intra-observer variations.
A future vision in this respect is to develop a computer-assisted system acting as a potential “flawless” and “24/7-available” digital helper to be used by clinicians while examining patients. Analyzing the endoscopy video during the medical procedure, such a system is envisioned to notify the clinician about findings in the displayed frame as an add-on to the pure video signal shown on the screen today.
This dream has made artificial intelligence (AI) a hot topic in medicine. However, a major challenge is to transfer the knowledge and experience of medical experts to the computer models. Data to create good AI-models is generally inaccessible, medical data is often sparse and hard to obtain due to for example legal restrictions, and there is a lack of qualified personnel to perform the cumbersome and tedious labeling of the data - limiting what is possible to achieve with automatic analysis. In this respect, our recent HyperKvasir dataset is the largest image and video collection of the GI tract available today. The data is collected during real gastro- and colonoscopy examinations at Bærum Hospital in Norway and partly labeled by experienced GI endoscopists. The dataset contains 110,079 images and 374 videos and represents anatomical landmarks and pathological and normal findings - from both the upper and lower part of the GI tract. The total number of images and video frames together is around 1,17 million. The dataset consists of four primary data records, i.e., labeled images, segmented images, unlabeled images, and labeled videos. Initial experiments found in the paper (and its references) demonstrate the potential benefits of AI-based computer-assisted diagnosis systems, but also that there is still a long way to go to meet the complete future vision.
With our publicly shared dataset, we hope to enable and accelerate research on efficient AI-systems for anomaly detection in the entire GI tract. We believe that HyperKvasir can play an important and valuable role in developing better algorithms and computer-assisted examination systems not only for the areas in gastro- and colonoscopy, but also for other fields in medicine.
The HyperKvasir dataset can be found at OSF: https://osf.io/mh9sj/
The paper describing the dataset can be found at https://www.nature.com/articles/s41597-020-00622-y or https://doi.org/10.1038/s41597-020-00622-y