OCR & Transcription

Get the text out of your digitized documents.

  • DSpace integration with external Optical Character Recognition software;
  • Process hOCR format for full-text indexing in SOLR & image text overlay;
  • Supports out-of-box a large set of languages, allowing more with personalized training files;
  • work online on manual transcription;
  • and much more!

Screenshots

Description of the available code

The OCR module enables the integration of DSpace with an external Optical Character Recognition software. Out-of-box, the module supports the open-source Tesseract OCR engine (https://github.com/tesseract-ocr). For each image a curation task allows to extract its text representation in hOCR format for full-text indexing in SOLR. Tesseract supports a very large set of languages including: Italian, French, Spanish, German, Arabic, Simplified and Traditional Chinese and many others (https://github.com/tesseract-ocr/langdata). The OCR engine can also be instructed with personalized training files to recognize fonts and specific languages.

In the presence of the IIIF Image Viewer module, the OCR module also provides support for IIIF Search API through a server component, subject to the same terms of the module license. The IIIF Search API enable the activation of the search functionality inside the IIIF viewer, providing search within images, navigation through the results and highlighting on the image of the OCR text corresponding to the search terms entered. The internal search engine also provides the suggestion of search terms during typing.

Take a quick look at it

Live demo

Check our services

Services

The new features we could develop with your support

  • The module will allow the replacement of OCR files automatically obtained via dedicated UI;

  • In co-presence of the IIIF Image Server module, the system will allow the editing of the OCR image, capturing the positional information. The OCR editing will also be available in the absence of an initial OCR file, allowing the online transcription of texts;

  • An approval workflow for transcripts will allow decentralized but controlled process management.

To access the code and start using the module: €3,000
You can express your preference on the functionality you would like us to develop first.

Make IT open!

Target budget: €75,000
0%

Access & use: €3,000

Other modules:

CKAN Integration

Add Research Data Management features to your DSpace.

IIIF Image Viewer

Use an international standard to work with image collections.

Document Viewer

View, access and work on full documents.

Video/Audio Streaming

Simplify access and reuse of audio/video content.

Other solutions