Similar to how a burglar leaves fingerprints on a crime scene, viruses also leave different imprints in our immune system. To recognize such patterns and to identify clues left behind, we need to collect a variety of measurements of our immune system to capture the parameters specific for the pathogen in question. In recent years, we have witnessed an increase in publicly available data regarding influenza, mostly due to new technological advancements and open data initiatives. However, to obtain a systemic view of the influenza virus imprint, it is necessary to combine datasets across clinical studies.
The goal of our research is quite straightforward—we wanted to apply machine learning to identify patterns in the data gathered at the Human Immune Monitoring Center at the Stanford University to determine why some individuals fail to mount antibody response after influenza vaccination. Well, it seemed quite straightforward in the beginning, but soon, things started to become more and more complicated.
The data were generated using different immune assays and were available in hundreds of different files. Simply taking the collected data and running machine learning algorithms immediately was not possible at all. Data had to be preprocessed, merged, cleaned, standardized and fully integrated before starting with the machine learning process. This process required that we build an automated process to transform data in the database so that they are easily searchable, and therefore usable.
Only by using good quality, standardized and cleaned data can one gain good and useful insights into them. The most important step in obtaining good quality data requires researchers and clinicians to work closely together with informaticians and data scientists before starting the project. There are some issues that we, as researchers, do not think about but that significantly affect data quality. For example, if the same type of vaccine is written differently even in a small way, the computer will treat it as two different vaccines (e.g. Fluzone with capital letter is different from fluzone in lowercase). For this reason, standardization is one of the most important steps when designing a study.
By releasing our database, we want to encourage other researchers to open their data as well, since high quality data that can be utilized “off the shelf” in useful way, without spending months on cleaning and standardization, is rarely available. We, as researchers, have the responsibility to openly share data of high quality since, in most cases, we are the ones who understand such data the most, and it is easy for us to prepare them in high quality. We therefore hope to see more and more standardized data published.
To read more about how FluPRINT database was built, please read our recently published manuscript and check out the website dedicated to the project at fluprint.com. If you are interested to know how you can take advantage of the data, please check out SIMON, our machine learning pipeline and take part in the open source society dedicated to make SIMON free for everyone at genular.com.