The basics of data citation

Researchers often ask us, ‘What is a data citation? When - and how - should I cite data?’ In this post, you’ll find the answers to some frequently asked questions, to help you get started with data citations.
The basics of data citation

What is a citation, and what is a data citation?

Citations allow researchers to acknowledge the contributions of others. The traditional mechanism for sharing scholarly output is the peer-reviewed article, and researchers are trained to cite previous articles that describe ideas, results or lines of reasoning that materially impact on their own work.

The research article was often used as a stand-in for all other scholarly outputs, but researchers are increasingly sharing other research outputs via stable online repositories, like data, source code and protocols – and these too can and should be directly cited.

Data citations are a formal way to record and acknowledge any externally hosted datasets mentioned in a manuscript. They help link an article to related data, and provide credit for data producers, just like traditional literature citations help link to and credit other peer-reviewed works mentioned by the authors. Cited datasets are generally hosted in data repositories1.

Data citations should conform to the Joint Declaration of Data Citation Principles (JDDCP). If you are interested in learning more about data citation we recommend reading these principles, and the related paper by Starr et al. (2015).

What should I cite when there is both a paper and an accompanying dataset?

Cite what you used. If you are primarily referring to findings or ideas in an article then citing the paper is sufficient. If you used associated datasets, especially data archived outside of the article and its supplementary material, then you should cite the data. Often it will be appropriate to cite both: the paper and any datasets you used.

Which data should I cite?

Any stably archived data relevant to your publication should be formally cited with its persistent identifier2, regardless of where in the manuscript or why the data are mentioned. In addition, any data used from other researchers in the generation or validation of your study, should also be cited.

How do I format my data citations?

Following established guidance from DataCite, your data citation should include the following fields.

  1. Creator/Authors(s)
    The creators of the dataset, which may be distinct from the authors of the related manuscript..
  2. Dataset Title
    The title of the dataset, as recorded at the repository.
  3. Repository Name
    For DataCite DOIs this should align with the “Publisher” field in DataCite metadata.
  4. Dataset identifier
    This will be a DataCite DOI or an appropriate repository accession ID.
  5. Dataset Publication Year
    The year the data were made publicly available.

It is important to only include information in the data citation that is present in the metadata associated with the data record. For example, if a dataset does not have clearly defined data creators or authors, or a field to record the dataset title, this information should not be invented or estimated (e.g. by looking at related publications). This might be well intentioned as a means for giving due credit, however, this practice risks entering erroneous information into the citation record.

And it is that simple! Getting data citations right means you are ensuring that the data you have generated and the data you relied on for your work, will be included in data citation infrastructure giving proper credit to those generating and sharing valuable research data.

Do you have a question about research data? 

Get free help and advice on sharing your research data: visit our research data help desk.


1 If you are not sure where to deposit your data, we provide a list of recommended repositories, with which we have established publication workflows to make data sharing and publication as easy as possible for researchers.

2 Persistent identifiers (PIDs) are assigned to each dataset by the repository hosting the data. The data PID may be an accession identifier, such as those assigned by repositories which are part of the National Center for Biotechnology Information (NCBI). However, many repositories (and especially those outside the biomedical domain), will assign a digital object identifier (DOI) for hosted data.

Photo by Christian Lue on Unsplash

Please sign in or register for FREE

If you are a registered user on Research Data Community, please sign in