DATA + CODE: REPOSITORY PERSISTENCE DATA AND RESULTS OF TEXT MINING SCHOLARLY LITERATURE

== GEORGE MACGREGOR | https://purl.org/g3om4c | 2025-11-24 ==
==============================================================================================================
==============================================================================================================
This README file documents:

1) 4 x .js scripts, used to query several APIs in order to generate data.
2) 2 x .csv data files, containing data arising from the aforementioned API interrogation, merged to generate a larger, more meaningful dataset. This README file explains the nature of the data and its structure.  

The resulting data underpins analysis published in the following forthcoming research paper:

*** Macgregor, G. & Davidson, J. (2026) Examining the persistence of European open repository infrastructure and its diffusion in the scholarly record. [Journal of publication TBC]. Preprint available: https://doi.org/10.48550/arXiv.2601.04015

Data arising are extensive and may therefore be reused in subsequent publications exploring similar topics.

All data were captured and the datasets compiled during the month of June 2025
==============================================================================================================
==============================================================================================================
\\\\ SCRIPTS \\\\

Several machine interfaces were interrogated to gather data. This was performed using a number scripts, all of which are reproduced here:

1) opendoar-registry.js
2) http-response-check.js
3) wayback-machine.js
4) core-tdm.js

All scripts were deployed within Google Apps Scripts, writing response data to Google Sheets, which were then exported as .csv for merging, analysis, etc. See: https://developers.google.com/apps-script

Scripts include in-line comments to assist in their reuse.

An explanation of the methodological motivations for deploying these data sources is provided in [1]. 

===opendoar-registry.js===
Three repository registries were interrogated to generate data, one of which (OpenDOAR) supports a RESTful API. OpenDOAR data are exposed as JSON objects via the Jisc Open Policy Finder API (formerly the Sherpa API - https://openpolicyfinder.jisc.ac.uk/). The script assists in querying and extracting registry data pertaining to European repositories, including key registry data, such as repository name, home URL, OAI-PMH endpoint location, and country code.

Details of data extracted from the remaining two repository registries is described in registry-response-data.csv below.

===http-response-check.js===
Repository locations identified through registry interrogation had their HTTP status verified. This script gathers HTTP status request codes for every repository domain URL and its associated OAI-PMH endpoint. Common HTTP response codes are widely documented by the IETF and IANA. Codes can be interpreted here: https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml. 

===wayback-machine.js===
A script used to query the Wayback Availability JSON API (https://archive.org/help/wayback_api.php). Used to capture an approximate 'date of decease' for repositories found to be returning unsatisfactory HTTP response codes (gathered using http-response-check.js). Archived snapshot data on the last available website archive, including archived_snapshot_URL and associated timestamp.

===core-tdm.js===
Script written to query the CORE 'Works' API to mine scholarly literature for specific URIs (https://api.core.ac.uk/docs/v3#tag/Works). This script was deployed to query version 3.0 of the CORE API (using the CORE API Query Language). This script seeks to mine CORE's bibliographic corpus for scholarly works that cite, refer to, or actively use specified repository URIs. Script processes JSON responses from the 'Works' API, logging key bibliographic data elements. This includes work title, authors, documentType, doi, identifiers, id (an internal CORE identifier), oaiIds (OAI identifiers associated with repositories), yearPublished, and depositedDate. Script also processes any full text arising from the fullText element, parsing and extracting in-text references to specified repository URIs, using the URL prefix of repositories with a wildcard. It should be noted that not all works in CORE have full text available for TDM.
==============================================================================================================
==============================================================================================================
\\\\ DATA \\\\

Data are made available as two .csv files. These are:

1) registry-response-data.csv
2) core-wbm-results.csv
==============================================================================================================

=== DATA DESCRPTION - registry-response-data.csv ===

This file provide 11 columns (columns A to K) and merges data generated by opendoar-registry.js and http-response-check.js. 

It also includes registry data derived from the:

1) Registry of Open Access Repository (ROAR - https://roar.eprints.org/), where a full JSON export of registry data on European repositories was possible. 
2) Institutional Archive Registry (IAR), an historic repository registry, a snapshot of which was captured by the Internet Archive's Wayback Machine on 13 June 2006, at which point 750 repositories registered. Relevant data were web scraped from this snapshot, extracting circa 300 repositories.

The columns can be described as follows:

*** Registry ID (Column A): This column contains IDs assigned to repositories within the source registry. The majority of IDs related to OpenDOAR. Where no such ID from OpenDOAR was available, they were derived from ROAR. ROAR assigned IDs are identifiable by their trailing hyphen and uppercase 'R' (e.g. 10834-R). Blank cells in this column indicate a repository discovered via the IAR, where a similar registry identifier was unavailable. 

*** Repo name (Column B): This column contains the known name of the repository about which data are gathered.  

*** Repo home URL (Column C): The registry recorded home URL for the described repository.

*** Repo OAI-PMH location (Column D): The registry recorded OAI-PMH endpoint location for the described repository, where known.

*** Type (Column E): A characterization of the repository type, according to the 'type' scheme employed by the original source registry, whether OpenDOAR, ROAR or IAR.

*** Type - mapped (Column F): A characterization of the repository type, according to the 'type' scheme employed by OpenDOAR. ROAR and IAR repository types were mapped to the OpenDOAR scheme, as described in [1].

*** Software (Column G): The registry recorded software used by the repository described.

*** Country code (Column H): The geographic or national location of the repository described by the data, as denoted by two-letter country code (ISO 3166-1 alpha-2). 

*** HTTP response 1 (domain) (Column I): Result of HTTP response tests on the described repository at the domain level (i.e. 'repo home URL'), logged as the HTTP response code. Common HTTP response codes are widely documented by the IETF and IANA. Codes can be interpreted here: https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml. 

*** HTTP response 2 (endpoint) (Column J):  Result of HTTP response tests on the described repository at the endpoint level (i.e. 'repo OAI-PMH location'), logged as the HTTP response code. Common HTTP response codes are widely documented by the IETF and IANA. Codes can be interpreted here: https://www.iana.org/assignments/http-status-codes/http-status-codes.xhtml. 

==============================================================================================================

=== DATA DESCRIPTION - core-wbm-results.csv ===

This file is larger, providing 129 columns (columns A to DY). It merges data generated by wayback-machine.js and core-tdm.js. Each row documents a dead repository domain discovered in the scholarly literature, with the dead repository domain denoted in column A, and all subsequent columns either describing the scholarly work found to refer or cite the dead repository content and the nature of these references, or data pertaining to the dead repository's date of decease. 

The columns can be described as follows:

*** Repo domain (Column A): This column contains the described repository's domain, with the http(s):// protocol declaration omitted. Each row documents a dead repository domain discovered in the scholarly literature -- the domain of which is contained in column A -- with columns B to I describing the scholarly work found to refer to or cite content in the dead repository.

*** CORE ID (Column B): Contains the CORE ID of items found within the CORE corpus that contain references to dead repository content. The CORE ID is a local identifier for scholarly works within CORE.

*** Title (Column C): Column contains the title information associated with the scholarly work found in CORE and being described.

*** Creators (Column D): Column contains the names of the creator(s) associated with the scholarly work found within CORE and being described.

*** Affiliation (Column E): Column contains the affiliation(s) associated with the creator(s) specified in Column D.

*** Date (Column F): Column providing the data of publication associated with the scholarly work being described in columns B-E.

*** Date of deposit (Column G): Column provides the date (in serial number) on which the described scholarly work was deposited in a known repository, where found by CORE. Note that date of deposit is known in only a subset of cases and so most cells will be unpopulated. 

*** Category (Column H): A column denoting the known category type of the scholarly work, e.g. research, thesis, report, etc., as determined by CORE and where known by CORE.

*** DOI (Column I): Column contains the digital objective identifier (DOI), where known, of the scholarly work described in the aforementioned columns.

*** Item ref 1 -> Item ref 112 (Columns J -> DQ): Columns contain references or citations within the described scholarly work to known 'dead' repository content, as mined from the scholarly work using the CORE 'Works' API.

*** Total refs (Column DR): Column summarises the total number of HTTP references or citations within the described scholarly work to known 'dead' repository content (as mined via the CORE 'Works' API).

*** Total refs + harvest (Column DS): Column summarises total from Column DR with the total number of references or citations as identified via OAI identifiers (as mined via the CORE 'Works' API).    

*** Repo home (Column DU): Column containing the root domain of the known 'dead' repository found to have been cited or reference in the scholarly work noted in Columns B-I.

*** Wayback machine (Column DV): Last known archived_snapshot_URL for 'repo home' (Column DU), as archived by the Internet Archive's Wayback machine, thereby inferred a 'date of decease' for the dead repository.

*** Date of decease (Column DW): Presumed date of decease for 'repo home' (Column DU), as extracted from the archived_snapshot_URL value in Column DV. Stored as a serial number.

*** Year of decease (Column DX): The year of the date of decease.

*** Ref latency (Column DY): Using data in Column F and Columns DV - DX, column calculates (in years) the extent to which dead content was or was not being cited in new literature (known in the related research article as 'dead on arrival' references). Negative values denote the number of years *after* repository death that content was cited in published literature. Positive values the reverse. #N/A indicates that no reliable date of decease could be found to perform the relevant calculation. 

==============================================================================================================

==LICENCING==
This work is licensed under the Creative Commons Attribution 4.0 International Public License. See https://creativecommons.org/licenses/by/4.0/legalcode for further details.

==REFERENCES==

[1] Macgregor, G. & Davidson, J. (2026) Examining the persistence of European open repository infrastructure and its diffusion in the scholarly record. [Journal of publication TBC]. Preprint available: https://doi.org/10.48550/arXiv.2601.04015


==============================================================================================================
==============================================================================================================