AI-powered system for extracting data from government forms. Extraction of forms form large PDFs, detection of form type, extraction of relevant data.
Despite the abundance of AI document solutions on the market, none of them can handle complex document processing with high enough accuracy, and government forms are among them: they usually contain multiple input field types, come in a large variety of layouts, and often come in sets of hundreds of forms inside a single PDF file.
Our client approached us with a task of augmenting their digital document handling ecosystem with an AI-powered government form processing system to aid in document digitization efforts of various businesses across the US.
The key challenges lay in the high variety of layouts, complexity of the government forms and processing speed:
There is a huge variety of forms in the US, each with its own layout and design. The system would need to involve a form type detection module, as well as be flexible enough to account for new form types.
The forms are PDF documents with hundreds of pages with dozens of field types and complex spreadsheets. Detecting individual forms from a large document and locating all fields correctly requires the use of powerful machine learning models.
Fast document processing speed is essential for any AI document analysis system, even more so for a system with high load. We approached the development of the system keeping in mind the potential daily load.
We have developed an AI-powered government form processing system that detects individual forms in large PDF documents, determines the form type and layout, and extracts relevant data in a JSON format for further processing.
We have designed the system to be cloud-based to ensure stable system performance under high loads.
Since forms are PDF files with a text layer, the system uses GD Picture to extract the text and look for keywords.
Multipage PDF documents are divided into separate forms by detecting specific keywords which mark a title page of each form, thus detecting the end of one form and the start of another.
The system detects the type of a form by analyzing its overall structure and keywords, which improves data extraction quality and accuracy. Form title, structure, and data fields are all analyzed to give an accurate assumption about the document type.
Our system is capable of working with dozens of form types.
Extraction of data from a form starts with detecting its structure and basic elements, also known as primitives: horizontal and vertical lines, text boxes, spreadsheets, etc. After determining the general form structure, the system looks for text inside input fields and field titles, matches them together in pairs, and extracts them into a .json file for integration with the client’s systems.
Our system is cloud-based: all documents are uploaded into an Amazon cloud for processing. We have designed the system with potential high loads in mind, optimizing the processing speed of each document. Large 100-page documents are processed in under 1 minute, a single page document is processed in 2 seconds.
We used AWS Lambda to create an easily scalable, fault-tolerant system in a short timeframe. The system supports batch processing, making it easy to process multiple documents at once.
The system has been successfully integrated into our client's application and is already used by companies in multiple fiels to automatically process government forms. We continue to improve the system, adding new features and support for new form types.