Some of the most challenging aspects of data sharing are getting authors to understand what kind of data they have to share, why and how the data should be shared in a way that significantly improves their discoverability, accessibility and reusability by secondary researchers. It is sometimes not very clear to authors what we mean by raw data. The precise definition of raw data is “data that have not been processed for use”. If raw data are mandated, such as DNA sequences and genotype data, then these have to be deposited in suitable repositories, to be put to maximum use by the research community, and in many cases, to be translated to deliver patient benefit.
In some occasions, raw data can be analysed more than once before being used to derive the figures and tables in a published article. Even though researchers do not have to share these “semi-processed” data, doing so can lead to a better understanding of the type of analysis and program or software that was used to generate specific figures in an article.
When we talk about raw data, these are data that have come straight off an experiment, which have not been subjected to processing, manipulation or “cleaning” by researchers to remove outliers. They are the first data that can be recorded. These may include daily lists of customers’ purchases in a busy supermarket, number of cells counted in patient blood samples and thermometer temperature recordings of a sample during the course of an experiment. Processed data are data that have been analysed in such a way to show a result or feature. This processed, “publishable” form of the data forms the basis of a study’s conclusions and interpretations and allows authors to communicate their results with other researchers in the community. Whether they are part of a published article or not, raw data can often be more valuable and re-usable than analysed data as the former can be processed in a number of different ways to derive novel results. For example, a study that reclassified a type of breast cancer into cell-based subtypes used only publicly available datasets and this type of analysis led to the identification of rare breast cancer subtypes that had not been previously picked up, raising the potential for personalised cancer diagnosis and treatment.
Think about the type of data you have generated during the course of a study. What are you going to do with it and what is the best way to report and share that data? For example, consider a busy supermarket that collects huge volumes of raw data each day about customers' purchases. Data in the form of long lists of customers’ purchases along with the prices of items and time and date of purchases will yield very little, if any useful information until they are processed. However, this approach does not apply for example, with clinical data. The benefits of sharing raw medical research data while taking into account patient confidentiality, anonymity and consent for publication, are increasingly recognised not only by researchers, but also by patients themselves. For example, in clinical research, sharing data aids researchers to identify patterns of disease that would otherwise remain concealed, leading to new methods of predicting, diagnosing, and treating illness.
What can researchers gain by sharing all the research data that underlie their published articles? The benefits of sharing data extend beyond the academic community. Sharing data means increasing the impact and visibility of research, allowing that data to be independently validated by other researchers and providing great resources for education and training. Data sharing confirms ownership of an author’s data and opens new doors for collaboration with other researchers who find, and use that data.
Do you still need help identifying which part of your data you should share? Contact our Research data helpdesk.
Image credit: wocintechat.com on Flickr