Through the provision of our Research Data Helpdesk and our other research data services, we have found sensitive data one of the most challenging aspects of data sharing for researchers. In spite of this, many stakeholders require or encourage that authors share their research data.
While researchers are encouraged to share their data openly where possible, examples where census data and even Netflix viewing data have been used to re-identify individuals show that there are risks involved. Researchers working with sensitive data should still be able to gain credit for their work and allow others to reuse their data, and with careful processing these data can often be shared safely.
Here I’ll talk briefly about the potentially sensitive data our Research Data team encounter in the course of our data checks and how researchers could de-sensitise such data.
First, what is sensitive data? In the context of data sharing sensitive data is anything that could potentially identify participants of a study. Most obviously: name, date of birth, email address, etc. However, as well as these ‘direct’ identifiers, participants could also be identified by more ‘indirect’ identifiers once combined, such as age, sex, race, religion, place of birth/treatment/employment, etc. Some categories of sensitive data are considered to be particularly risky as they reveal highly personal information, such as health/disease status, which could have an impact on study participants’ insurance or job security.
Of course, when researchers collect data, they often collect much more than is actually needed to answer their specific research question. This isn’t a bad thing: it could help in ensuring the study population is representative of the general population, and allow researchers to uncover trends that go beyond their original hypothesis. However, it could also lead to sensitive data being needlessly shared.
For example, we recently dealt with a large spreadsheet with of around 200 parameters reported for over 20 thousand subjects. The data included ‘date of birth’, ‘hospital admission date’, ‘sex’, ‘weight’, ‘HIV status’, et cetera, as well as specifying (in the article) the hospital into which subjects were admitted. Though the subjects may not be identifiable from any one piece of data, when these data are taken together, there is certainly a risk. Furthermore, there is an additional concern when the data contains sensitive aspects such as disease status.
In the above example, the Research Data team's advice was threefold:
- remove any parameters not necessary to support the claims of the related paper (e.g., ‘hospital admission date’)
- convert identifying data to ranges, where possible (e.g., dates to date ranges; ages to age ranges)
- give careful consideration to whether inclusion of disease-related parameters is necessary
In de-sensitising the data, the authors were able to reduce the number of parameters by more than 80% without reducing the information supporting the main claims of their article. For example, ‘date of birth’ was removed, but ‘age in months’ was retained. Furthermore, in some cases where specific values had been reported, such as the oxygen level in the blood, the data were converted to instructive binary values: e.g., ‘>90%’ and ‘<90%’.
Our Research Data Helpdesk can provide expert help with research data-related questions. Please get in touch via: Research.Data@springernature.com.