No abstract available.
Proceeding Downloads
Labelling OCR Ground Truth for Usage in Repositories
The rapid developments in deep/machine learning algorithms have over the last decade largely replaced traditional pattern/language-based approaches to OCR. Training these new tools requires scanned images alongside their transcriptions (Ground Truth, GT)...
OCR for Greek polytonic (multi accent) historical printed documents: development, optimization and quality control
This paper presents the development and implementation of a robust OCR tool and a related comprehensive workflow for the recognition of Greek printed polytonic scripts. This project is initiated and developed by an interdisciplinary team with expertise ...
Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model
The legibility of the text of rare books is often subject to precarious conditions: natural decay or erosion, and ink bleed caused by flawed printing methods centuries ago often make such texts difficult to recognize. This difficulty hardens the ...
A-I-PoCoTo: Combining Automated and Interactive OCR Postcorrection
PoCoTo is known as a web-based interactive tool for the postcorrection of OCR-results on historical texts. In this paper we first introduce A-PoCoTo, a fully automated extension of PoCoTo designed for the use in large-scale digitization projects. Among ...
Arabic-SOS: Segmentation, Stemming, and Orthography Standardization for Classical and pre-Modern Standard Arabic
While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the ...
Automatic Semantic Text Tagging on Historical Lexica by Combining OCR and Typography Classification: A Case Study on Daniel Sander's Wörterbuch der Deutschen Sprache
When converting historical lexica into electronic form the goal is not only to obtain a high quality OCR result for the text but also to perform a precise automatic recognition of typographical attributes in order to capture the logical structure. For ...
Automatic Reconstruction of Emperor Itineraries from the Regesta Imperii
Historic itinerary research investigates the traveling paths of historic entities, to determine their influence and reach. A potential source of such information are the Regesta Imperii (RI), a large-scale resource for European medieval history ...
Diamonds in Borneo: Commodities as Concepts in Context
The intensified circulation of people, commodities and ideas is one of the characteristics of a globalizing world. To understand the causes and consequences of these circulations, we have to know which commodities circulated when and where, on what ...
OCR-D: An end-to-end open source OCR framework for historical printed documents
- Clemens Neudecker,
- Konstantin Baierer,
- Maria Federbusch,
- Matthias Boenig,
- Kay-Michael Würzner,
- Volker Hartmann,
- Elisa Herrmann
Various research projects were concerned with the development and adaptation of methods for OCR specifically for historical printed documents (cf. METAe [20], IMPACT [1], eMOP [9]). However, these initiatives have ended before the wide adoption of deep ...
Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771-1929: Early Results Using the PIVAJ Software
This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. The historical digital newspaper ...
Towards the Extraction of Statistical Information from Digitised Numerical Tables: The Medical Officer of Health Reports Scoping Study
Numerical data of considerable significance is present in historical documents in tabular form. Due to the challenges involved in the extraction of this data from the scanned documents it is not available to researchers in a useful representation that ...
Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts
Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing ...
Challenges of Mass OCR-isation of Medieval Latin Texts in a Resource-Limited Project
This paper aims to present the first stage of the ANR project VELUM (Towards Innovative Ways of Visualising, Exploring and Linking Resources for Medieval Latin) which, by 2022, is intended to compile the largest representative corpus of Medieval Latin ...
Optical Character Recognition for Coptic fonts: A multi-source approach for scholarly editions
In this paper, we show that the OCR engine Ocropy can be trained for fonts used in rather complex and varied Coptic typeset. For each of the three fonts presented in this paper, we used a number of texts from scholarly editions with different ...
A New Strategy for Arabic OCR: Archigraphemes, Letter Blocks, Script Grammar, and shape synthesis
Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered ...
Improving OCR of historical newspapers and journals published in Finland
This paper presents experiments on Optical character recognition (OCR) of historical newspapers and journals published in Finland. The corpus has two main languages: Finnish and Swedish and is written in both Blackletter and Antiqua fonts. Here we ...
From Tribunal Archive to Digital Research Facility (TRIADO): Exploring ways to make archives accessible and useable
The TRIADO project (2016-2019) is a cooperation between Netwerk Oorlogsbronnen (coordinator), NIOD Institute for War, Holocaust and Genocide Studies, Huygens ING/KNAW Humanities Cluster and the National Archives of the Netherlands (Nationaal Archief). ...
Cross-disciplinary Collaborations to Enrich Access to Non-Western Language Material in the Cultural Heritage Sector
The British Library is home to millions of items representing every age of written civilisation, including books, manuscripts and newspapers in all written languages. Large digitisation programmes currently underway are opening up access to this rich ...
Curation Technologies for Cultural Heritage Archives: Analysing and transforming a heterogeneous data set into an interactive curation workbench
We present a platform that enables the semantic analysis, enrichment, visualisation and presentation of a document collection in a way that enables human users to intuitively interact and explore the collection, in short, a curation platform or ...
Implementation of a Databaseless Web REST API for the Unstructured Texts of Migne's Patrologia Graeca with Searching capabilities and additional Semantic and Syntactic expandability
The search through large corpora of unstructured text on the Web Domain is not an easy task and such services are not offered for the common user. One such corpus is the published works of east Christian fathers by Jacques Paul Migne, known as ...
Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts
Part-of-speech tagging, morphological tagging, and lemmatization of historical texts pose special challenges due to the high spelling variability and the lack of large, high-quality training corpora. Researchers therefore often first map the words to ...
Stylometry of literary papyri
In this paper we present the first results of stylometric analysis of literary papyri. Specifically we perform a range of tests for unsupervised clustering of authors. We scrutinise both the best classic distance-based methods as well as the state-of-...
Using lexicography to characterise relations between species mentions in the biodiversity literature
The biodiversity literature is one of the longest-standing examples of recording heritage in the world. Today there are many efforts to standardise and integrate the literature to ensure access to the information, both for heritage and research ...
Standoff Annotation for the Ancient Greek and Latin Dependency Treebank
This contribution presents the work in progress to convert the Ancient Greek and Latin Dependency Treebank (AGLDT) into standoff annotation using PAULA XML. With an increasing number of annotations of any kind, it becomes more and more urgent that ...
Hidden Metadata in Plain Sights: Romanian Folklore Catalogues
This paper explores the way in which the old catalogues from the Romanian folklore archives can be improved with updated information about two key aspects of the folklore collections: the informant/performer and the village/location where the recordings ...
Validating 126 million MARC records
The paper describes the method and results of validation of 14 library catalogues. The format of the catalog record is Machine Readable Catalog (MARC21) which is the most popular metadata standards for describing books. The research investigates the ...
Turning Digitised Material into a Diachronic Corpus: Metadata Challenges in the Nederlab Project
In this paper, we argue that exploitation of historical corpus data requires text metadata which metadata accompanying digital objects from digital libraries, archives or other electronic text collections, do not provide. Most text collections describe ...
Index Terms
- Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
DATeCH2017 | 37 | 29 | 78% |
DATeCH '14 | 49 | 31 | 63% |
Overall | 86 | 60 | 70% |