ORCID-DOI index

The orcid_process.py script extracts DOI-author associations from ORCID XML summary files. The output is a CSV index that maps each DOI to the authors who claimed it in their ORCID profile.

Usage

uv run python -m oc_meta.run.orcid_process \
    -out <output_path> \
    -s <summaries_path> \
    [-t <threshold>]

Parameters

Parameter	Description
`-out, --output`	Output directory for CSV files
`-s, --summaries`	Directory containing ORCID XML summaries (scanned recursively)
`-t, --threshold`	Number of files to process before saving a CSV chunk (default: 10000)

Input

The script expects ORCID public data summaries in XML format. These can be downloaded from the ORCID public data file.

The downloaded archive must be extracted before processing:

tar -xzf ORCID_2024_10_summaries.tar.gz -C /path/to/destination/

Each XML file contains an ORCID profile with external identifiers. The script extracts DOIs marked with relationship type “self” (i.e., works authored by the profile owner).

Output

CSV files with two columns:

Column	Content
`id`	DOI
`value`	Author name and ORCID in format `Surname, Given [0000-0000-0000-0000]`

Multiple authors can be associated with the same DOI if they all claimed it in their profiles.

Example

uv run python -m oc_meta.run.orcid_process \
    -out ./orcid_index \
    -s ./ORCID_2023_10_summaries \
    -t 50000

This processes all XML files in ORCID_2023_10_summaries and saves a CSV chunk every 50000 files.

Resume support

The script tracks processed ORCID IDs. If interrupted, it skips already processed files on the next run.