The OCR module enables the integration of DSpace with an external Optical Character Recognition software. Out-of-box, the module supports the open-source Tesseract OCR engine (https://github.com/tesseract-ocr). For each image a curation task allows to extract its text representation in hOCR format for full-text indexing in SOLR. Tesseract supports a very large set of languages including: Italian, French, Spanish, German, Arabic, Simplified and Traditional Chinese and many others (https://github.com/tesseract-ocr/langdata). The OCR engine can also be instructed with personalized training files to recognize fonts and specific languages.
In the presence of the IIIF Image Viewer module, the OCR module also provides support for IIIF Search API through a server component, subject to the same terms of the module license. The IIIF Search API enable the activation of the search functionality inside the IIIF viewer, providing search within images, navigation through the results and highlighting on the image of the OCR text corresponding to the search terms entered. The internal search engine also provides the suggestion of search terms during typing.