ORCID-DOI index#
The orcid_process.py script extracts DOI-author associations from ORCID XML summary files. The output is a CSV index that maps each DOI to the authors who claimed it in their ORCID profile.
Usage#
uv run python -m oc_meta.run.orcid_process \
-out <output_path> \
-s <summaries_path> \
[-t <threshold>]
Parameters#
Parameter |
Description |
|---|---|
|
Output directory for CSV files |
|
Directory containing ORCID XML summaries (scanned recursively) |
|
Number of files to process before saving a CSV chunk (default: 10000) |
Input#
The script expects ORCID public data summaries in XML format. These can be downloaded from the ORCID public data file.
The downloaded archive must be extracted before processing:
tar -xzf ORCID_2024_10_summaries.tar.gz -C /path/to/destination/
Each XML file contains an ORCID profile with external identifiers. The script extracts DOIs marked with relationship type “self” (i.e., works authored by the profile owner).
Output#
CSV files with two columns:
Column |
Content |
|---|---|
|
DOI |
|
Author name and ORCID in format |
Multiple authors can be associated with the same DOI if they all claimed it in their profiles.
Example#
uv run python -m oc_meta.run.orcid_process \
-out ./orcid_index \
-s ./ORCID_2023_10_summaries \
-t 50000
This processes all XML files in ORCID_2023_10_summaries and saves a CSV chunk every 50000 files.
Resume support#
The script tracks processed ORCID IDs. If interrupted, it skips already processed files on the next run.