A newspaper digitization app for a European document scanning agency. Detection of articles that span across multiple columns, text extraction, and article type detection
We develop custom solutions for document digitization
Our client is a large document scanning and digitization company based in Europe that was looking for a way to improve and scale their newspaper scanning efforts.
Our client has approached us with the task of creating an application for intelligent newspaper processing and digitization. The task of digitizing a newspaper is the most complicated OCR-based one as newspapers have complex layouts with articles spanning across multiple columns, starting and stopping in arbitrary places, often broken up by images and advertisements.
We started with the analysis of newspaper pages to get a clear understanding of the page and article structure. During this analysis we have discovered that many historical newspapers have suffered some damage, like paper warping, scratches, or fading ink. This forced us to develop a preprocessing module for increasing the quality of scanned newspaper pages.
This module removes any paper warp caused by old age or moisture damage, removes dust and scratches, and fills in letters and symbols that have faded with time.
Certain small symbols, like punctuation and umlauts, tend to get lost during OCR especially if the source material is not in the best condition, therefore we have put additional effort into preservation of these symbols during preprocessing.
The next challenge was the detection of various articles and their types. Newspaper articles are difficult if not impossible to accurately detect for standart OCR solutions due to their complex structure and huge formatting variability.
Our application detects text blocks that belong to the same article across the entire page and compiles them together in a correct order, including headers, subheaders, illustrations, authorship attribution, and any other elements that are part of the same article. For each detected article our system determines its type, like an editorial article, an advertisement, an obituary, etc.
The text and images from each article are extracted in a form of editable and searchable text document.
All relations between article blocks are presented visually and can be manually edited using a visual editor. The user can click on any item within a newspaper page and reassign its article affiliation, change the blocks order, and change article types.
We have managed to achieve 98-99% recognition quality, making our system a reliable solution for digitizing newspapers of any time period and publisher.
Our newspaper digitization system is successfully used to create high-quality digital copies of historical and modern newspapers and has helped our client to grow their business and become one of the top document digitization companies in Europe.