A data extraction module powered by GPT-3.5 for a legal document processing system
Our client is a SaaS company working on an in-cloud legal document management system to help law firms reduce paperwork and manage document flows better. The system handles over 40.000 documents every month, organizes and stores document and law firm data, as well as provides a task manager system.
Looking to gain competitive advantage on the market, the company approached us to create an AI document analyzing module with text extraction capabilities to automate document data processing.
We have developed a module for intelligent document processing: a system powered by GPT-3.5 capable of extracting relevant data from legal documents in a matter of seconds.
The AI module is capable of processing dozens of different document types by analyzing their layout. After the document’s type is detected, we use paddleOCR to extract the text layer and pass it onto further processing.
We use GPT-3.5 to extract relevant information, e.g. dates of legal proceedings, information on the legal process, like date and location of a forensic examination or a rehearing.
A standard GPT-3.5 module has a text length limitation, which prevents us from implementing it into legal document processing. Using an optimized GPT-3.5 32k, we’ve bypassed the limitation.
Legal documents are often signed by judges to certify them, and a document missing a signature, or signed by a different person, is not legally binding. One of the most important aspects of the AI module is signature detection: unsigned documents need to be filtered out for further investigation.
We have utilized YOLO 5 to detect signatures and determine their author using a dataset of judges’ signatures. The client’s system uses this information to filter out legal documents that are either missing signatures or are signed by someone else, alerting the user of an unverified document.
Our client's system is run on an Azure Cloud, so our AI module is developed to be cloud-based as well to ensure seamless integration. We have configured access to OpenAI API in the cloud for continuous system operation.
The data extracted from legal documents is imported into the client's system in a JSON format for easy data integration.
Using GPT for document recognition can quickly add up: the longer the document, the more tokens it represents, the more it costs to process it. We have implemented multiple techniques to reduce costs for our client:
Our client's legal document processing app is now processing documents automatically, using GPT-3.5 to extract relevant data and filter out unverified documents. The AI system was implemented seamlessly and works in the cloud to ensure stable and continuous operation.
Using the power of modern AI, our client has enhanced their product and gained an advantage over competition in the field of automatic document processing.