From unstructured administrative records to an accessible, open, global dataset of pandemic- and epidemic-prone disease outbreaks.

How we created a dataset of 2227 unique outbreaks associated to 70 different infectious diseases occurred in a total of 233 countries and territories from 1996 until March 2022.
From unstructured administrative records to an accessible, open, global dataset of pandemic- and epidemic-prone disease outbreaks.

The COVID-19 pandemic has widely demonstrated the hazard that infectious diseases can pose to global public health and development. According to the latest available estimates from the World Health Organization (WHO), as of November 2022 it has been confirmed to have affected over 628 million people worldwide, having caused more than 6.5 million deaths. However, COVID-19 is not the only infectious disease threatening the world. In 2019, the year before the first confirmed death from COVID-19, infectious diseases claimed more than 5.1 million lives, accounting for 14% of the 55.4 million deaths worldwide, approximately 4 million fewer than in 2000.

Despite the decline in deaths from infectious diseases in the last two decades leading up to the COVID-19 pandemic, the world has also witnessed how disease outbreaks are emerging at unprecedented rates. For instance, the outbreak of severe acute respiratory syndrome associated Coronavirus (SARS-CoV) in 2003, the 2009-2010 influenza A(H1N1) pandemic, the Middle East respiratory syndrome Coronavirus (MERS-CoV) outbreak in 2012, the 2013-2016 west African Ebola virus disease epidemic, and the 2015-2016 zika virus epidemic; all spread in several countries across different continents, disproportionately impacting the most vulnerable communities.

The increasing number of disease outbreaks has also lent impetus to a growing research interest on examining this phenomenon, which has gone hand-in-hand with a rising need for reliable open data from official sources. However, existing datasets on the matter exclusively cover a limited number of infectious diseases, or are specific to a population, country, or region. Moreover, other datasets are based on unofficial information, which may contain incorrect information or disinformation from false reports, or are not publicly available, hampering their reuse and utilization.

The WHO, as part of its mandate, collects information about confirmed and potential public health events of concern in the world. Specifically for COVID-19, the WHO made available the Coronavirus Dashboard. For the rest of diseases, the information is contained in the Disease Outbreak News (DONs). The information of the DONs is obtained from an integrated global system coordinated by the WHO. This information is based on epidemiological, clinical, and laboratory investigations conducted by the official public health authorities, institutions, and research networks of the WHO and its partners all over the world. However, since the DONs are not primary produced for statistical purposes, they are unstructured, with a format that makes it difficult to extract detailed information, and they do not make use of concepts and definitions that conform to international standards. Therefore, to create a database that is statistically sound for research purposes, we first processed the information from the DONs and then merged it with the Coronavirus Dashboard data. 

Our final dataset contains information on 2227 disease outbreaks which occurred over the period from January 1996 to March 2022. In comparison with existing data on the matter, our dataset provides five key advantages. First, a wide geographic coverage of 233 countries and territories around the world. Second, an extensive coverage of 70 infection diseases. Third, the utilization of standardized concepts and definitions, for which we used the codes of the International Standard Organization for countries and territories (ISO-3166), and the tenth revision of the International Statistical Classification of Diseases and related Health Problems (ICD-10). This allows the users to flexibly obtain information for a specific region, country, year, or disease. Fourth, for transparency, replicability, and reproducibility purposes, we make the data, metadata, and the code to create these data publicly available from Figshare both in human- and machine-readable formats (HTML, R, csv), which facilitates the reuse of the information. Moreover, by re-running the code the user can automatically extract more recent DONs from the website to keep the database updated. Finally, our data are interoperable, i.e. they can be easily integrated with other data by using the country code, the year, and/or the disease code as key variable(s) matching observations between datasets.

Our dataset will contribute to enhance knowledge on epidemics. Further research using these data may combine this information with other sources to identify the factors associated with the exposure of countries to pandemic- and epidemic-prone disease outbreaks.

Please sign in or register for FREE

If you are a registered user on Research Data at Springer Nature, please sign in