The Document Viewer is an extension of the IIIF Image Viewer module (that is a prerequisite software). The Document Viewer allows the online visualization of PDF files in the browser, using only standard web features without the need for any external plugins, such as Flash Player, etc, and with minimal bandwidth use. The viewer is extendible to other format as detailed in the future functionalities section. No third party services are involved, both the original files and the web optimized files used by the viewer reside in the DSpace instance. The original file (PDF, etc.) is not downloaded by the browser, instead partial and resized image files are downloaded at the optimal resolution for the current device and zoom level.
A suite of curation tasks run to process each PDF page:
- an image with configurable resolution is extracted, to balance quality and disk usage;
- a text representation of the image is extracted, while preserving the positioning of data;
- textual information is indexed with positions in the IIIF Search API.
The viewer prevents end users to copy and paste the content of the file, and downloading of the original PDF file can be avoided. The viewer provides a “search inside” feature with highlighting functionality for PDFs where text extraction is possible. Combining the Document Viewer module with the OCR module allows to exploit the “search inside” and the highlighting also for scanned (image) PDFs.