Through the provision of our Research Data Helpdesk and Research Data Support curation service, we have found that sensitive data is one of the most challenging aspects of data sharing for researchers. In spite of this, many stakeholders require or encourage that authors share their research data.
While researchers are encouraged to share their data openly where possible, examples where census data and even Netflix viewing data have been used to re-identify individuals show that there are risks involved. Researchers working with sensitive data should still be able to gain credit for their work and allow others to reuse their data, and with careful processing these data can often be shared safely.
In the course of providing our Research Data Support service, the data publishing team checks for any aspects of a researcher’s data which could potentially be sensitive. Here I’ll talk briefly about the potentially sensitive data we encounter and how researchers could de-sensitise such data.
First, what is sensitive data? In the context of data sharing sensitive data is anything that could potentially identify participants of a study. Most obviously: name, date of birth, email address, etc. However, as well as these ‘direct’ identifiers, participants could also be identified by more ‘indirect’ identifiers once combined, such as age, sex, race, religion, place of birth/treatment/employment, etc. Some categories of sensitive data are considered to be particularly risky as they reveal highly personal information, such as health/disease status, which could have an impact on study participants’ insurance or job security.
Of course, when researchers collect data, they often collect much more than is actually needed to answer their specific research question. This isn’t a bad thing: it could help in ensuring the study population is representative of the general population, and allow researchers to uncover trends that go beyond their original hypothesis. However, it could also lead to sensitive data being needlessly shared.
For example, we recently dealt with a large spreadsheet with of around 200 parameters reported for over 20 thousand subjects. The data included ‘date of birth’, ‘hospital admission date’, ‘sex’, ‘weight’, ‘HIV status’, et cetera, as well as specifying the hospital into which subjects were admitted. Though the subjects may not be identifiable from any one piece of data, when these data are taken together, there is certainly a risk. Furthermore, there is an additional concern when the data contains sensitive aspects such as disease status.
In the above example, the Research Data Support advice was threefold:
- remove any parameters not necessary to support the claims of the related paper (e.g., ‘hospital admission date’)
- convert identifying data to data ranges, where possible (e.g., convert dates to date ranges; ages to age ranges)
- give careful consideration to whether inclusion of disease-related parameters is necessary
In de-sensitising the data, the authors were able to reduce the number of parameters by more than 80% without reducing the information supporting the main claims of their manuscript. For example, ‘date of birth’ was removed, but ‘age in months’ was retained. Furthermore, in some cases where specific values had been reported, such as the oxygen level in the blood, the data were converted to instructive binary values: e.g., ‘>90%’ and ‘<90%’.
The Research Data Support team deal are experienced specialists in identifying sensitive data. As well as working with individual authors to anonymise their data, we also have partnerships with journals such as npj Breast Cancer and Wellcome Open Research. In these partnerships, Research Data Support ensure all submitting authors are aware of any aspects of their data that are potentially sensitive.
If you feel your journal would benefit from such a service, please get in touch: Research.Data@springernature.com.