Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

May 2019

2019 Proceeding

Publisher:

Association for Computing Machinery
New York
NY
United States

Conference:

DATeCH2019: 3rd International Conference on Digital Access to Textual Cultural Heritage Brussels Belgium May 8 - 10, 2019

ISBN:

978-1-4503-7194-0

Published:

08 May 2019

Recommend ACM DL

ALREADY A SUBSCRIBER?SIGN IN

Bibliometrics

Abstract

No abstract available.

Proceeding Downloads

PDFFront matter (Title, Copyright, Contents, Foreword by the programme chairs, Foreword by the organisation chairs, DATeCH 2019 organisation, Programme Committee)

PDFBack matter (List of authors)

Select All

Export Citations Save to Binder

SESSION: Evaluation and improvement of OCR

research-article

Labelling OCR Ground Truth for Usage in Repositories

pp 3–8https://doi.org/10.1145/3322905.3322916

The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT)...

research-article

OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control

pp 9–13https://doi.org/10.1145/3322905.3322926

This paper presents the development and implementation of a robust OCR tool and a related comprehensive workflow for the recognition of Greek printed polytonic scripts. This project is initiated and developed by an interdisciplinary team with expertise ...

research-article

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

pp 15–18https://doi.org/10.1145/3322905.3322922

The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the ...

research-article

A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection

pp 19–24https://doi.org/10.1145/3322905.3322908

PoCoTo is known as a web-based interactive tool for the postcorrection of OCR-results on historical texts. In this paper we first introduce A-PoCoTo, a fully automated extension of PoCoTo designed for the use in large-scale digitization projects. Among ...

SESSION: Applications

research-article

Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic

pp 27–32https://doi.org/10.1145/3322905.3322927

While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the ...

research-article

Open Access

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

pp 33–38https://doi.org/10.1145/3322905.3322910

When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For ...

research-article

Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii

pp 39–44https://doi.org/10.1145/3322905.3322921

Historic itinerary research investigates the traveling paths of historic entities, to determine their influence and reach. A potential source of such information are the Regesta Imperii (RI), a large-scale resource for European medieval history ...

research-article

Diamonds in Borneo: Commodities as Concepts in Context

pp 45–50https://doi.org/10.1145/3322905.3322924

The intensified circulation of people, commodities and ideas is one of the characteristics of a globalizing world. To understand the causes and consequences of these circulations, we have to know which commodities circulated when and where, on what ...

SESSION: OCR and HTR in practise

research-article

OCR-D: An end-to-end open source OCR framework for historical printed documents

pp 53–58https://doi.org/10.1145/3322905.3322917

Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep ...

research-article

Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software

pp 59–64https://doi.org/10.1145/3322905.3322911

This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper ...

research-article

Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study

pp 65–71https://doi.org/10.1145/3322905.3322932

Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that ...

research-article

Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts

pp 73–78https://doi.org/10.1145/3322905.3322920

Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing ...

SESSION: Digitisation of historical languages

research-article

Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project

pp 81–85https://doi.org/10.1145/3322905.3322925

This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin ...

research-article

Open Access

Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions

pp 87–91https://doi.org/10.1145/3322905.3322931

In this paper, we show that the OCR engine Ocropy can be trained for fonts used in rather complex and varied Coptic typeset. For each of the three fonts presented in this paper, we used a number of texts from scholarly editions with different ...

research-article

A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis

pp 93–96https://doi.org/10.1145/3322905.3322928

Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered ...

research-article

Improving OCR of historical newspapers and journals published in Finland

pp 97–102https://doi.org/10.1145/3322905.3322914

This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we ...

SESSION: Access to data

research-article

From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable

pp 105–110https://doi.org/10.1145/3322905.3322906

The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). ...

research-article

Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector

pp 111–116https://doi.org/10.1145/3322905.3322907

The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich ...

research-article

Open Access

Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench

pp 117–122https://doi.org/10.1145/3322905.3322909

We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or ...

research-article

Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability

pp 123–129https://doi.org/10.1145/3322905.3322913

The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as ...

SESSION: Natural language processing

research-article

Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts

Helmut Schmid

pp 133–137https://doi.org/10.1145/3322905.3322915

Part-of-speech tagging, morphological tagging, and lemmatization of historical texts pose special challenges due to the high spelling variability and the lack of large, high-quality training corpora. Researchers therefore often first map the words to ...

research-article

Stylometry of literary papyri

pp 139–142https://doi.org/10.1145/3322905.3322930

In this paper we present the first results of stylometric analysis of literary papyri. Specifically we perform a range of tests for unsupervised clustering of authors. We scrutinise both the best classic distance-based methods as well as the state-of-...

research-article

Using lexicography to characterise relations between species mentions in the biodiversity literature

Sandra Young

pp 143–148https://doi.org/10.1145/3322905.3322918

The biodiversity literature is one of the longest-standing examples of recording heritage in the world. Today there are many efforts to standardise and integrate the literature to ensure access to the information, both for heritage and research ...

research-article

Standoff Annotation for the Ancient Greek and Latin Dependency Treebank

Giuseppe G. A. Celano

pp 149–153https://doi.org/10.1145/3322905.3322919

This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that ...

SESSION: Metadata

research-article

Hidden Metadata in Plain Sights: Romanian Folklore Catalogues

Liviu-Ovidiu Pop

pp 157–159https://doi.org/10.1145/3322905.3322912

This paper explores the way in which the old catalogues from the Romanian folklore archives can be improved with updated information about two key aspects of the folklore collections: the informant/performer and the village/location where the recordings ...

research-article

Validating 126 million MARC records

Péter Király

pp 161–168https://doi.org/10.1145/3322905.3322929

The paper describes the method and results of validation of 14 library catalogues. The format of the catalog record is Machine Readable Catalog (MARC21) which is the most popular metadata standards for describing books. The research investigates the ...

research-article

Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project

pp 169–173https://doi.org/10.1145/3322905.3322923

In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe ...

Index Terms

Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage

Index terms have been assigned to the content through auto-classification.

Recommendations

Acceptance Rates

Overall Acceptance Rate60of86submissions,70%

Year	Submitted	Accepted	Rate
DATeCH2017	37	29	78%
DATeCH '14	49	31	63%
Overall	86	60	70%

Comments

DATECH

Sections

Proceeding Downloads

Labelling OCR Ground Truth for Usage in Repositories

OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control

Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection

Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic

Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii

Diamonds in Borneo: Commodities as Concepts in Context

OCR-D: An end-to-end open source OCR framework for historical printed documents

Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software

Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study

Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts

Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project

Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions

A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis

Improving OCR of historical newspapers and journals published in Finland

From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable

Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector

Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench

Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability

Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts

Stylometry of literary papyri

Using lexicography to characterise relations between species mentions in the biodiversity literature

Standoff Annotation for the Ancient Greek and Latin Dependency Treebank

Hidden Metadata in Plain Sights: Romanian Folklore Catalogues

Validating 126 million MARC records

Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project

Index Terms

UbiMob '05: Proceedings of the 2nd French-speaking conference on Mobility and ubiquity computing

UbiMob '08: Proceedings of the 4th French-speaking conference on Mobility and ubiquity computing

IHM '09: Proceedings of the 21st International Conference on Association Francophone d'Interaction Homme-Machine

Acceptance Rates