Coronavirus Open Citations Dataset

Citations made in the last twenty years using open data provided by OpenCitations and Crossref

The Coronavirus Open Citations Dataset

The Coronavirus Open Citations Dataset curated by OpenCitations currently contains (as of 16 May 2020) information about 189,697 citations and about the 49,719 citing or cited articles involved in these citations. The full dataset, used for the visualization below, is stored in JSON format on Zenodo under a Creative Commons CC0 waiver, to enable anyone to use these data for any purpose:

Peroni, S. (2020). Coronavirus Open Citations Dataset. Version 2.0. Zenodo.

Currently, this dataset includes also citations to/from the articles related to the COVID-19 pandemic, since they have been recently added to COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. COCI is the most extensive dataset of open citation data released by OpenCitations, and its current version, published on 12 May 2020, includes citations of articles published in Crossref up to the beginning of May 2020. We will release new versions of this Coronavirus Open Citations Dataset on Zenodo following each future release of COCI (at bimonthly intervals) and also when significant volumes of new coronavirus citation data are uploaded by the scholarly community to CROCI, the Crowdsourced Citation Index.

Why and by (approximately) how much are open citations lacking

Even after future releases of COCI, we will still be missing many relevant citations coming from articles in journals whose publishers are not participating in the Initiative for Open Citations (I4OC) and opening their reference lists at Crossref, and those coming from preprints that also do not deposit their reference lists in Crossref.

The list of Crossref DOIs of coronavirus-related publications for which we currently lack citations is available in a CSV file.

C4: the Campaign to Crowsource Coronavirus Citations

You can help in improving this situation by sharing coronavirus citation data to which you have legal access in CROCI.

CROCI, the Crowdsourced Citation Index, is an OpenCitations Index containing citations deposited by individuals. We described the rationale and the main ideas behind the development of CROCI in an article published in the proceedings of the 17th International Conference on Scientometrics and Informetrics, held in Rome in September 2019, in which we described the procedure to follow for adding new citation data to the CROCI GitHub repository.

To add small or large amounts of coronavirus citation data, please follow the procedure described in the CROCI readme file.

The visualization

The following simple interactive visualization, developed just to show what might be possible using this dataset, shows the most relevant articles reporting investigations of coronaviruses in the past twenty years. These fall into three distinct periods characterised by three primary related diseases: SARS, MERS and COVID-19.

For the purpose of this initial visualization, we consider only citations where both the citing article and the cited article are on the topic of coronaviruses, and of these include only the articles that received, overall, at least twenty citations per year since their publication date. The resulting dataset contains 902 citations between 109 articles.

Users can filter to select a subset of these citations, based on the family name of an author of the visualized articles.

The closer an article lies to the centre of display circle, the greater its number of links to the other articles displayed. Selecting an article by clicking on its symbol changes its margin to black. Its bibliographic metadata is then shown on the right while, within the display circle. Its citation links to all the articles in its reference list are shown by light blue lines and the symbols for these cited papers are shown with a bold light blue margin. The citations it receives from other publications are shown by light red lines and the symbols for these citing articles are shown with a bold light red margin.

The author is not present in the dataset!

Citations received

citations ≤ 50

50 < citations ≤ 200

200 < citations ≤ 500

> 500 citations

Publication dates

pre-SARS period (before 2003)

SARS period (between 2003 and 2011)

MERS period (between 2012 and 2019)

COVID-19 period (since 2020)

Article and relations

selected article

references of selected article

citations to selected article

Metadata of selected article

None selected

How it works

We developed a Python notebook to retrieve all the data used in the visualization. We obtained a list of relevant articles about coronaviruses using the Crossref API, by selecting all articles which contain any of the words "coronavirus", "covid19", "sarscov", "ncov2019", and "2019ncov" either in the title or in the abstract (total: 11,842 articles, as of 14 May 2020). Then, we retrieved all the citations which involve these articles either as a citing entity or as a cited entity. We used the DOIs of these selected articles to retrieve citations involving the articles from all the OpenCitations Indexes by using the unifying REST API, resulting in the retrieval of 189,697 citations. Finally, we used the Crossref API again to retrieve bibliographic metadata (i.e. authors, year of publication, title, publication venue, and DOI) for the articles involved in all the citations retrieved (total: 49,719 articles, as of 14 May 2020). Where Crossref did not return bibliographic metadata for some articles, we completed these metadata by hand.

We created the visualization using Cytoscape JS and JQuery. We used Pure CSS to define the layout of the website. All the software, data, and additional material used for creating this website and the related data are available in our GitHub repository.


For any comment, suggestion, improvement, critique, please do not hesitate to contact directly Silvio Peroni, who is responsible for the development of all the software and data for this visualization, and of the releases of the Coronavirus Open Citations Dataset. You can reach him easily via email or Twitter.

If you would like to help in any way (collecting new missing citation data, developing new interactive Web visualisations for the data, writing promotional material such as posts and tweets for the dataset and the campaign, coordinate possible efforts, etc.) please do not hesitate to get in touch with us.