Case Studies

Southeast Asian Newspaper Extraction

Industry:
Document archiving
Client:
Confidential
Platform:
Cloud
Duration:
3 months
98%
Accuracy
Southeast Asian Newspaper Extraction

Project Summary

A system for extracting articles from southeast newspapers. Recognition of complex layouts, detection and extraction of articles.

Services

AI Prototype Development Computer Vision System Development

Team

1 Project Manager 3 Machine Learning Developers 1 QA Engineer

Target Audience

Document Archives

Challenge

Newspapers are one of the most difficult documents to digitize due to their complex and irregular structure. Southeast Asian newspapers present an even bigger challenge: their specific layout and editing style make them problematic to process with AI.

Our client approached us with a task of developing an AI system for southeast Asian newspaper article extraction. The main requirement and challenge of this project is accuracy: each article needs to be detected and extracted in full, including all headers and subheaders. 

Solution

Due to the complexity of the task at hand, we have decided to develop a proof-of-concept: a first iteration of a system meant to show if the problem can be solved with required accuracy. Creating a prototype highlights the potential pitfalls and bottlenecks, while being inexpensive and quick to develop.

Prototype

Using LLMs is one of the easiest and least expensive approaches to document digitization, which is why it is the perfect fit for a prototype.

First, we extract the text and its coordinates, i.e. its position on the page, using OCR and Azure Document Intelligence. Using this data, we can format the text to prepare it for LLM processing.

Next, we send the text to GPT-4o and ask it to split the text into separate articles. GPT searches for large semantic structures, on the basis of which it determines where one article ends and another begins.

Using the text coordinates, we can restore the text structure to locate the articles on each newspaper page.

This approach showed high accuracy despite the limitations of a prototype. Based on these developments, we have developed a powerful newspaper detection system.

Article Detection With A Segmentation Model

Improving recognition results requires a change in approach, which is why we have used a segmentation model in the final project. Segmentation models are more powerful and more customizable than any LLM, so it lends itself well to newspaper article detection.

We begin by applying a segmentation model to identify distinct blocks of text and images on a newspaper page. Each block is enclosed in a segmentation mask with precise coordinates. Similarly, images on the page are also enclosed in bounding boxes as part of the segmentation process.

After segmentation, we use Azure DI Read Model to extract the text contained within each identified block, after which we calculate embedding vectors for both the text and the images using pre-trained models: BERT for text and ResNet for image embeddings.

Next, we employ clustering algorithms to group the blocks and images into coherent articles. The clustering is based not only on the embedding vectors but also on additional parameters, such as the relative spatial positions of the blocks on the page.

This multi-dimensional clustering allows to treat adjacent or semantically related blocks as part of the same article. If the clustering results are inaccurate, we apply post-processing algorithms to refine the results by considering additional contextual cues.

Once the articles are identified, we crop the page according to the block boundaries of the clustered articles. For each article, we save the cropped region as a PNG image, the extracted text, the embedding vector of the article, and a generated summary.

Workflow

  1. PDF Input: The process begins by ingesting a PDF of a newspaper,
  2. Segmentation:
    1. Apply a segmentation model (e.g.,YOLOv8- seg) to detect and mark individual blocks (text and images) on the page, providing bounding boxes for each,
  3. OCR and Embedding Extraction:
    1. For each text block, Azure's OCR extracts the text,
    2. Text embeddings are computed using BAAI to represent the semantic content,
    3. Image embeddings are calculated using ResNet,
    4. DBSCAN, a clustering algorithm, is used to group the text and image blocks into articles. The algorithm utilizes both the embedding vectors and spatial relationships (e.g., relative coordinates) of the blocks,
  4. Algorithms: If clustering results are imprecise, a post-processing step involving contextual rules or heuristic-based methods is applied to adjust the groupings,
  5. Article Cropping and Output:
    1. Once articles are identified, the system crops the page based on the bounding boxes of each clustered article,
    2. For each article, the following outputs are generated:
      1. PNG: A cropped image of the article,
      2. Text: The full text extracted via OCR,
      3. Embedding: The combined embedding vector for the text (It will be useful if we search for relevant articles in the database on request in the future).

Results

The system detects articles from southeast Asian newspapers with over 98% accuracy: it extracts articles keeping the structure intact, detects images, headers and subheaders. The articles are exported in multiple formats ready for post processing by our client’s software.

Let's Work Together!

Do you want to know the total cost of development and realization of the project? Tell us about your requirements, our specialists will contact you as soon as possible.

BWT Chatbot