Modelling of text editions as linked data (edition2LD)

Duration: 01.01.2023-31.12.2023

Links die Abbildung eines Digitalisats der FS Nepal. In der Mitte eine Grafik zur Datenspeicherung und rechts ein Ausschnitt eines Python Skrips.

HAdW

Editions as linked data

Scientific text editions pose a major challenge due to a variety of factors with regard to secondary, overarching evaluation and their reusability. These include the compilation of content as well as the evaluation and safeguarding of the collected data. The challenges include

different temporal and geographical orientations of the research content
different languages and language levels
different writing systems
Presentation in different combinations of edited text, translation, facsimiles, etc.
Digital data is available in different system architectures and data models
Changeability of the collected data during the term of the edition ‘hot data’ (only the long-term archiving of the data after the end of the project ‘cold data’ guarantees the immutability of the data).

The long-term accessibility of research data is therefore an essential element of scientific research.

The Project

The ‘edition2LD’ project is working on a solution for data curation that makes heterogeneous research data accessible in the long term and across the boundaries listed above. This must be flexible enough for the heterogeneity of the data and at the same time stable enough for a long-term perspective. For the solution, the edition2LD project follows the paradigm of Linked Data (LD) and the integration of data into the Semantic Web.

Use case as ‘best practice’

Screenshot eines Digitalisats, oben eine Miniaturabbildung der digitalisierten Schriftrolle, darunter zwei Spalten mit dazugehörigen Daten — Digitale Edition der „Religions- und rechtsgeschichtliche Quellen des vormodernen Nepal”, nepalica.hadw-bw.de.

In order to develop the workflow for linked data (LD) modelling, the project chooses the editions of the HAdW research centre „Religions- und rechtsgeschichtliche Quellen des vormodernen Nepal“ (Nepal-FS) as a use case. The research centre produces digital editions of the Nepali texts (in Devanagari script) including facsimiles, English translations (in Latin script), commentaries and index entries (persons, places, specialist terms).

It is published on nepalica.hadw-hw.de and on the pages of the Heidelberg University Library (see e.g. https://digi.hadw-bw.de/view/dna_0001_0005) with a DOI.

Approach

The workflow for LD modelling should be generic enough to prepare for future transferability to other projects. At the same time, it should be capable of transferring data into RDF in batches - repeatedly triggered - and thus responding to the major challenge of variable ‘hot data’. When developing the automated mapping processes, it is therefore immensely important to minimise the step of manual post-processing, ideally so that it only has to be carried out once.

Screenshot eines Registereintrags in der Nepalica-Datenbank. In einer Tabelle mit zwei Spalten sind Daten zu der betreffenden Person gelistet. Darunter ist eine zweite Tabelle mit drei Spalten zu sehen, in der weitere Dateien nach Nummern klassifiziert sind. — Registereintrag zur Person Pṛthvīnārāyaṇa Śāha, einem König von Gorkha und später Nepal in nepalica.hadw-bw.de

The project focuses on modelling the information units ‘text’, ‘English translation’, named entities (personal and place names) and specialist terms as well as metadata. For the modelling of named entities, the project can draw on register entries from the Nepal-FS, some of which already contain further references to standard data repositories and encyclopaedic resources.

Schema zur Pipeline der Datenverarbeitung mittels Python-Skripten, das Schema beinhaltet verschiedene Bausteine, die durch Pfeile miteinander verbunden sind. Die einzelnen Begriffe sind nicht alle lesbar. — Pipeline der Datenverarbeitung mittels Python-Skripten (Stand der Arbeiten: terminology.py; NamedEntities.py), die aus dem Input der verschiedenen Quellen RDF-Daten erstellen

The vocabularies, ontologies and repositories used for modelling are those already established as standards: RDFS, SKOS, Gemeinsame Normdatei GND, VIAF, DBpedia, GeoNames, FOAF. In addition, the LD modelling adds links to instances of two ontologies for historical names of people and places in Nepal (NepalPeople and NepalPlaces, see Tittel 2022*), which are being developed using the research data of the Nepal-FS.

The data sources are

the database of the Nepal-FS
Files with further information
Information that is integrated into the pipeline from the register and glossary entries via web crawling
NepalPeople and NepalPlaces ontologies: Python scripts compare the modelled data with the NepalPeople and NepalPlaces entries and integrate links to their instances if necessary.

Screenshot eines Python Codes. Ungefähr 35 Zeilen mit Befehlen sind abgebildet. — Python-Code

As of today [September 2023], the modelling of the named entities and terms is 90% complete, the modelling of the units ‘English translation’, ‘Nepali edition’ and the metadata is in progress.

The project "Sprachdatenbasierte Modellierung von Wissensnetzen in der mittelalterlichen Romania – ALMA" (internal link), which started as an inter-academic project of the HAdW, BAdW and AdW Mainz on 1 August 2022 as part of the Academies' Programme, follows an approach that is already based on linked data throughout. As ALMA compiles text editions (here of medieval legal and medical texts), this database can also be used well for an edition2LD approach.

Publications

Svoboda-Baas, Dieta/Tittel, Sabine: Text+Plus, #04: Modellierung von Texteditionen als Linked Data (edition2LD), in: Text+ Blog, 18.12.2023, https://textplus.hypotheses.org/8723.

*Tittel, Sabine. "Towards an Ontology for Toponyms in Nepalese Historical Documents", in: Proceeding of the Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia within the 13th Language Resources and Evaluation Conference, Marseille, June 2022, 2022, Marseille (European Language Resource Association - ELRA), S. 7-16.