A system for extracting articles from southeast newspapers. Recognition of complex layouts, detection and extraction of articles.
Newspapers are one of the most difficult documents to digitize due to their complex and irregular structure. Southeast Asian newspapers present an even bigger challenge: their specific layout and editing style make them problematic to process with AI.
Our client approached us with a task of developing an AI system for southeast Asian newspaper article extraction. The main requirement and challenge of this project is accuracy: each article needs to be detected and extracted in full, including all headers and subheaders.
Due to the complexity of the task at hand, we have decided to develop a proof-of-concept: a first iteration of a system meant to show if the problem can be solved with required accuracy. Creating a prototype highlights the potential pitfalls and bottlenecks, while being inexpensive and quick to develop.
Using LLMs is one of the easiest and least expensive approaches to document digitization, which is why it is the perfect fit for a prototype.
First, we extract the text and its coordinates, i.e. its position on the page, using OCR and Azure Document Intelligence. Using this data, we can format the text to prepare it for LLM processing.
Next, we send the text to GPT-4o and ask it to split the text into separate articles. GPT searches for large semantic structures, on the basis of which it determines where one article ends and another begins.
Using the text coordinates, we can restore the text structure to locate the articles on each newspaper page.
This approach showed high accuracy despite the limitations of a prototype. Based on these developments, we have developed a powerful newspaper detection system.
Improving recognition results requires a change in approach, which is why we have used a segmentation model in the final project. Segmentation models are more powerful and more customizable than any LLM, so it lends itself well to newspaper article detection.
We begin by applying a segmentation model to identify distinct blocks of text and images on a newspaper page. Each block is enclosed in a segmentation mask with precise coordinates. Similarly, images on the page are also enclosed in bounding boxes as part of the segmentation process.
After segmentation, we use Azure DI Read Model to extract the text contained within each identified block, after which we calculate embedding vectors for both the text and the images using pre-trained models: BERT for text and ResNet for image embeddings.
Next, we employ clustering algorithms to group the blocks and images into coherent articles. The clustering is based not only on the embedding vectors but also on additional parameters, such as the relative spatial positions of the blocks on the page.
This multi-dimensional clustering allows to treat adjacent or semantically related blocks as part of the same article. If the clustering results are inaccurate, we apply post-processing algorithms to refine the results by considering additional contextual cues.
Once the articles are identified, we crop the page according to the block boundaries of the clustered articles. For each article, we save the cropped region as a PNG image, the extracted text, the embedding vector of the article, and a generated summary.
The system detects articles from southeast Asian newspapers with over 98% accuracy: it extracts articles keeping the structure intact, detects images, headers and subheaders. The articles are exported in multiple formats ready for post processing by our client’s software.