Skip to content

ORCID-DOI index

The orcid_process.py script extracts DOI-author associations from ORCID XML summary files. The output is a CSV index that maps each DOI to the authors who claimed it in their ORCID profile.

Terminal window
uv run python -m oc_meta.run.orcid_process \
-out <output_path> \
-s <summaries_path> \
[-t <threshold>]
ParameterDescription
-out, --outputOutput directory for CSV files
-s, --summariesDirectory containing ORCID XML summaries (scanned recursively)
-t, --thresholdNumber of files to process before saving a CSV chunk (default: 10000)

The script expects ORCID public data summaries in XML format. These can be downloaded from the ORCID public data file.

The downloaded archive must be extracted before processing:

Terminal window
tar -xzf ORCID_2024_10_summaries.tar.gz -C /path/to/destination/

Each XML file contains an ORCID profile with external identifiers. The script extracts DOIs marked with relationship type “self” (i.e., works authored by the profile owner).

CSV files with two columns:

ColumnContent
idDOI
valueAuthor name and ORCID in format Surname, Given [0000-0000-0000-0000]

Multiple authors can be associated with the same DOI if they all claimed it in their profiles.

Terminal window
uv run python -m oc_meta.run.orcid_process \
-out ./orcid_index \
-s ./ORCID_2023_10_summaries \
-t 50000

This processes all XML files in ORCID_2023_10_summaries and saves a CSV chunk every 50000 files.

The script tracks processed ORCID IDs. If interrupted, it skips already processed files on the next run.