skip to main content
10.1145/3322905acmotherconferencesBook PagePublication PagesdatechConference Proceedingsconference-collections
DATeCH2019: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
ACM2019 Proceeding
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
DATeCH2019: 3rd International Conference on Digital Access to Textual Cultural Heritage Brussels Belgium May 8 - 10, 2019
ISBN:
978-1-4503-7194-0
Published:
08 May 2019

Bibliometrics
Abstract

No abstract available.

Skip Table Of Content Section
SESSION: Evaluation and improvement of OCR
research-article
Labelling OCR Ground Truth for Usage in Repositories

The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT)...

research-article
OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control

This paper presents the development and implementation of a robust OCR tool and a related comprehensive workflow for the recognition of Greek printed polytonic scripts. This project is initiated and developed by an interdisciplinary team with expertise ...

research-article
Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model

The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the ...

research-article
A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection

PoCoTo is known as a web-based interactive tool for the postcorrection of OCR-results on historical texts. In this paper we first introduce A-PoCoTo, a fully automated extension of PoCoTo designed for the use in large-scale digitization projects. Among ...

SESSION: Applications
research-article
Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic

While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the ...

research-article
Open Access
Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache

When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For ...

research-article
Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii

Historic itinerary research investigates the traveling paths of historic entities, to determine their influence and reach. A potential source of such information are the Regesta Imperii (RI), a large-scale resource for European medieval history ...

research-article
Diamonds in Borneo: Commodities as Concepts in Context

The intensified circulation of people, commodities and ideas is one of the characteristics of a globalizing world. To understand the causes and consequences of these circulations, we have to know which commodities circulated when and where, on what ...

SESSION: OCR and HTR in practise
research-article
OCR-D: An end-to-end open source OCR framework for historical printed documents

Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep ...

research-article
Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software

This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper ...

research-article
Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study

Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that ...

research-article
Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts

Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing ...

SESSION: Digitisation of historical languages
research-article
Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project

This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin ...

research-article
Open Access
Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions

In this paper, we show that the OCR engine Ocropy can be trained for fonts used in rather complex and varied Coptic typeset. For each of the three fonts presented in this paper, we used a number of texts from scholarly editions with different ...

research-article
A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis

Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered ...

research-article
Improving OCR of historical newspapers and journals published in Finland

This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we ...

SESSION: Access to data
research-article
From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable

The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). ...

research-article
Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector

The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich ...

research-article
Open Access
Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench

We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or ...

research-article
Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability

The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as ...

SESSION: Natural language processing
research-article
Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts

Part-of-speech tagging, morphological tagging, and lemmatization of historical texts pose special challenges due to the high spelling variability and the lack of large, high-quality training corpora. Researchers therefore often first map the words to ...

research-article
Stylometry of literary papyri

In this paper we present the first results of stylometric analysis of literary papyri. Specifically we perform a range of tests for unsupervised clustering of authors. We scrutinise both the best classic distance-based methods as well as the state-of-...

research-article
Using lexicography to characterise relations between species mentions in the biodiversity literature

The biodiversity literature is one of the longest-standing examples of recording heritage in the world. Today there are many efforts to standardise and integrate the literature to ensure access to the information, both for heritage and research ...

research-article
Standoff Annotation for the Ancient Greek and Latin Dependency Treebank

This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that ...

SESSION: Metadata
research-article
Hidden Metadata in Plain Sights: Romanian Folklore Catalogues

This paper explores the way in which the old catalogues from the Romanian folklore archives can be improved with updated information about two key aspects of the folklore collections: the informant/performer and the village/location where the recordings ...

research-article
Validating 126 million MARC records

The paper describes the method and results of validation of 14 library catalogues. The format of the catalog record is Machine Readable Catalog (MARC21) which is the most popular metadata standards for describing books. The research investigates the ...

research-article
Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project

In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe ...

Index Terms

  1. Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Acceptance Rates

          Overall Acceptance Rate60of86submissions,70%
          YearSubmittedAcceptedRate
          DATeCH2017372978%
          DATeCH '14493163%
          Overall866070%