OCR

It is possible to search the full text of titles processed with OCR. The full texts can then be downloaded together with the images contained therein as plain text, ALTO XML or a PDF file (Title page→ “Downloads” section). In the facet, the hits that are returned can be filtered for titles in which is it possible to search the full text (Full text search available).

The OCR solution used and the status of processing
Up until 2023, the only software used on e-rara to generate full texts was ABBYY FineReader. This software was used to process content from titles written in the 17th to 20th centuries. As of 2024, Tesseract Version 5 has also been used with various language models.

Full text search
A general search searches through title metadata and automatically through all full texts too. Titles with full text can also be searched directly (button top right on title view). When carrying out a full text search, the following points are to be kept in mind:

  • Truncation is carried out automatically: when searching for london for example, the search engine will also deliver Londonderry as a hit. To get exact matches, please place the term for which you are searching in speech marks: “london”.
  • Depending on the original text, OCR has its flaws to a greater or lesser extent. Searches for specific words do not always deliver every possible hit.
  • Depending on the OCR solution, some printed characters are mixed up with other characters: for the printed character “ſ” (in texts written in the Gothic and old Antiqua typefaces), FineReader reads out “s”, while some OCR solutions do read out "ſ". This is relevant for the reuse of downloaded full texts. When searching for specific words on e-rara, the “ſ” character is normalised: a search for “august” can also deliver “auguſt” as a hit.
  • Generally speaking, the OCR solution reproduces the text as in the original. When searching full texts, please therefore take into consideration orthographic and typographic anomalies. This applies in particular to the use of u/v, i/j (in contrast to s/ſ) as well as abbreviations and ligatures.

NER/NEL

From 2025 onwards, selected full texts will be processed with Named Entity Recognition (NER) and Named Entity Linking (NEL). These technologies will enable all people, topics and places mentioned in full texts to be automatically identified and linked in the Integrated Authority File (GND). By searching for the relevant terms on the start page, it is possible to search for these entities and the text passages in which they occur. The lists of people, places and topics are also available for individual titles and can be selected in the title view.

NER/NEL on e-rara is based on Google’s Natural Language API and delivers good results on the whole. This AI-supported procedure is, however, not perfect and a certain percentage of the entities are falsely identified or not identified and linked.