Back to Portfolio

Government Form Data Extraction System

Tecnnologies

GD Picture OpenCV AWS Lambda

Client

Confidential

Platform

Cloud

Country

USA

50+

Form types

AI-powered system for extracting data from government forms. Extraction of forms form large PDFs, detection of form type, extraction of relevant data.

Services

AI System Development
Cloud Engineering

Team

1 Project Manager
2 AI Developers
1 Full-Stack Developer

Target Audience

Government agencies

Challenge

Despite the abundance of AI document solutions on the market, none of them can handle complex document processing with high enough accuracy, and government forms are among them: they usually contain multiple input field types, come in a large variety of layouts, and often come in sets of hundreds of forms inside a single PDF file.

Our client approached us with a task of augmenting their digital document handling ecosystem with an AI-powered government form processing system to aid in document digitization efforts of various businesses across the US.

The key challenges lay in the high variety of layouts, complexity of the government forms and processing speed:

High Variety Of Government Forms

There is a huge variety of forms in the US, each with its own layout and design. The system would need to involve a form type detection module, as well as be flexible enough to account for new form types.

Complex Document Processing

The forms are PDF documents with hundreds of pages with dozens of field types and complex spreadsheets. Detecting individual forms from a large document and locating all fields correctly requires the use of powerful machine learning models.

Processing Speed

Fast document processing speed is essential for any AI document analysis system, even more so for a system with high load. We approached the development of the system keeping in mind the potential daily load.

Solution

We have developed an AI-powered government form processing system that detects individual forms in large PDF documents, determines the form type and layout, and extracts relevant data in a JSON format for further processing.

We have designed the system to be cloud-based to ensure stable system performance under high loads.

Form Extraction From A Large Document

Since forms are PDF files with a text layer, the system uses GD Picture to extract the text and look for keywords.

Multipage PDF documents are divided into separate forms by detecting specific keywords which mark a title page of each form, thus detecting the end of one form and the start of another.

Form Type Detection

The system detects the type of a form by analyzing its overall structure and keywords, which improves data extraction quality and accuracy. Form title, structure, and data fields are all analyzed to give an accurate assumption about the document type.

Our system is capable of working with dozens of form types.

Data Extraction From Government Forms

Extraction of data from a form starts with detecting its structure and basic elements, also known as primitives: horizontal and vertical lines, text boxes, spreadsheets, etc. After determining the general form structure, the system looks for text inside input fields and field titles, matches them together in pairs, and extracts them into a .json file for integration with the client’s systems.

Efficient AI Document Processing

Our system is cloud-based: all documents are uploaded into an Amazon cloud for processing. We have designed the system with potential high loads in mind, optimizing the processing speed of each document. Large 100-page documents are processed in under 1 minute, a single page document is processed in 2 seconds.

We used AWS Lambda to create an easily scalable, fault-tolerant system in a short timeframe. The system supports batch processing, making it easy to process multiple documents at once.

Results

The system has been successfully integrated into our client's application and is already used by companies in multiple fiels to automatically process government forms. We continue to improve the system, adding new features and support for new form types.

Next Case Study

Success Stories

Electronic Medical Record Document Processing System

June 2024

Robotic Process Automation System For Insurance Claims

June 2023

Mobile

Computer Vision Grade Tracking Service for Universities

March 2025

Research: AI Models Invoice Processing Benchmark

March 2025

Contact Us

Let's Work Together!

Do you want to know the total cost of development and realization of the project? Tell us about your requirements, our specialists will contact you as soon as possible.