Intelligent Document Processing (IDP) Models Benchmark

We are constantly testing large language models for business automation tasks. AI model benchmark is based on digital documents datasets of various layouts and languages that represent documents processed in real projects.

We test how well AI models work at extracting data from complex documents by assessing data detection accuracy and completeness.

Testing Criteria

01

Recognition Accuracy

How accurately an AI model detects and extracts data from a document, like field titles and values, document layout, text and character blocks.

02

Processing Duration

How long it takes a model to process one document on average.

03

Cost

The processing cost per 1000 pages and any additional costs.

Reports

March 2026

Updated Comparison of AI Models For Invoice Processing: Amazon Analyze Expense API, Azure AI Document Intelligence, Google Document AI, GPT 5.2, GPT 5 Mini, Claude Sonnet 4.5, Gemini 3 Flash, Gemini 3 Pro

January 2026

Benchmarking LLMs Handwriting Recognition: Azure, AWS, Google, Claude Sonnet, Gemini 2.5 Flash Lite, GPT-5 Mini, Grok 4

November 2025

Benchmarking LLMs for RFQ Item Matching: GPT-5, GPT-5 Nano, Gemini 2.5 Pro, Claude Sonnet 4, Grok 4

July 2025

Testing LLMs On Extracting Dimensional and Tolerance Data From Engineering Drawings: Gemini 2.5 Flash, Gemini 2.5 Pro, ChatGPT o4 mini, ChatGPT o3, Claude Opus 4, Qwen VL Plus

June 2025

Comparison Of AI Models For Table Extraction: Amazon Boto3 Textract, Azure Prebuilt Layout, GPT-4o API, Gemini 2.5 Pro, Grok 2 Vision, Pixtral Large, Google Layout Parser

March 2025

Expanded Comparison Of AI Models For Invoice Processing: Amazon Analyze Expense API, Azure AI Document Intelligence, Google Document AI, GPT-4o API, GPT-4o API with text input with 3rd party OCR, Gemini 2.0 Pro Experimental, Deepseek v3

February 2025

Best AI Services For Automatic Invoice Processing: Amazon Analyze Expense API, Azure AI Document Intelligence, Google Document AI, GPT-4o API, GPT-4o API with text input with 3rd party OCR.

Invoice Processing: 2026 vs 2025

We have tested 8 document AI services and LLMs on a dataset of invoices and compared the results to our previous invoice processing becnhmark.

View on Full Screen

Model	2025 Average Efficiency	2026 Average Efficiency
Amazon Analyze Expense API	91.10%	82.87%
Azure AI Document Intelligence	85.70%	90.52%
Google Document AI	68.10%	79.76%
Gemini 3 Flash	—	89.72%
Gemini 3 Pro	—	94.75%
Gemini 2 Pro	90.20%	—
GPT 5.2	—	87.59%
GPT-4o API (image input)	89.20%	—
GPT 5 Mini	—	87.94%
GPT-4o API (text input with 3rd party OCR)	86.50%	—
Claude Sonnet 4.5	—	90.27%
Deepseek v3 (text input)	88.10%	—

See Full Report March 2026

Reports

Handwriting Processing

We have tested 7 AI models on a set of handwritten forms.

View on Full Screen

Model	Average model accuracy	Average cost per 1000 forms	Average processing time per form, s
GPT 5 Mini	88,19%	$5.06	32.179
Gemini 2.5 Flash Lite	87,29%	$0.37	5.484
AWS	72,61%	$65	4.845
Claude Sonnet	70,34%	$18.7	15.488
Azure	67,52%	$10	6.588
Google	50,69%	$30	5.633
Grok 4	22,74%	$11.5	129.257

See Full Report January 2026

Reports

Engineering Drawings Processing: Tabular Data

We have tested 7 popular AI services capable of processing tabular data on schedules from engineering drawings.

View on Full Screen

Service	Table Extraction Accuracy	Processing duration Per 1 Page, s	Cost, per 1000 pages
Azure Prebuilt Layout	81,5%	4.3 ± 0.2	$10
Amazon boto3 Textract	82,1%	2.9 ± 0.2	$15
Gemini 2.5 pro preview 05-06	94,2%	47.4 ± 15.7	$58
GPT-4o API	38,5%	16.9 ± 1.9	$19
Grok 2 vision 1212	Failed	—	—
Pixtral large latest	Failed	—	—
Google Layout Parser	Failed	—	—

See Full Report June 2025

RFQ Processing

We benchmarked 5 leading models on their ability to recommend catalog items based on real RFQ inputs.

View on Full Screen

Model	Hit Rate@5	MRR	nDCG@5	Cost per RFQ
Gpt 5	0.759	0.633	0.665	$1.2907
Gpt 5 nano	0.826	0.609	0.663	$0.0849
Gemini 2.5 pro	0.767	0.637	0.670	$1.0325
Grok 4	0.771	0.594	0.637	$0.0461
Claude sonnet 4	0.800	0.581	0.637	$1.6369

Hit Rate@5: How often the correct item appeared anywhere in the model’s top-5 list,
MRR (Mean Reciprocal Rank): How close to the top the correct item landed, with a #1 being ideal,
nDCG@5: A position-weighted ranking score that rewards models for placing the right item near the top.

See Full Report November 2025

Engineering Drawings Processing: Dimensional & Tolerance Data

Testing 6 LLMs on extraction of dimensional and tolerance data from real-world mechanical engineering drawings.

View on Full Screen

Service	Data Extraction Efficiency	Processing duration Per 1 Page, s	Cost, Per 1000 Pages
Gemini 2.5 Flash	77.34%	77.5	$30.5
Gemini 2.5 Pro	79.96%	91.4	$130.4
Gpt-o4 mini	39.59%	41.75	$24.9
Gpt-o3	20.38%	163	$239.2
Claude Opus 4	40.49%	64.8	$312
Qwen VL Plus	7.64	22	$1.59

See Full Report July 2025

Invoice Processing

We have analysed 7 most popular AI document detection models to test how well they work “out-of-the-box” on a set of digital invoices and have assessed how well they process invoices of various layouts and languages.

View on Full Screen

Service	Invoice Detection Accuracy Without Items	Invoice Detection Accuracy With Items	Processing duration Per 1 Page, s	Cost, per 1000 pages
Azure AI Document Intelligence	85,8%	85,7%	4.3 ± 0.2	$10
GPT-4o using 3d party OCR (Prebuilt Layout model by Azure AI)	90,8%	86,5%	33.0 ± 2.3	$8,8 ¹
GPT-4o only	88,3%	89,2%	16.9 ± 1.9	$8,8
Google Document AI	83,8%	68,1%	3.8 ± 0.2	$10
Amazon Analyze Expense API	91,3%	91,1%	2.9 ± 0.2	$10 ²
Gemini 2.0 Pro	90%	90,2%	8 ± 1.5	$4,5 ³
DeepSeek v3 API (Prebuilt Layout model by Azure AI)	93,3%	88,1%	69	11$
Unified Approach	~99%	~97%	~15	~30$

1 — Additional $10 per 1000 pages from using a text recognition model

2 — Additional $0.008 per page after one million

3 — $1.25, input prompts ≤ 128k tokens, $2.50, input prompts > 128k tokens, $5.00, output prompts ≤ 128k tokens, $10.00, output prompts > 128k tokens

See Full Report March 2025

Unified Intelligence: Enhance Invoice Data Extraction Up to 97%

To achieve exceptional accuracy in extracting data from invoices, we combined the power of multiple large language models (LLMs). We use advanced matching algorithms to compare the outputs of each model and select the final results using a majority-vote principle.

This ensemble approach allows us to leverage the unique strengths of each LLM, providing robust and scalable invoice data extraction for real-world business needs.

As a result, we have drastically increased the average extraction accuracy from 85% to 97%.

FAQs

Yes, we offer custom benchmarking services tailored to specific business requirements. Contact us to discuss your needs.

Currently we have tested Amazon Analyze Expense API, Azure AI Document Intelligence, Google Document AI, GPT-4o API, GPT-4o API - text input with 3rd party OCR. We constantly research new AI models to evaluate and text.

We use a variety of digital documents, including invoices, receipts, contracts, and forms, with different layouts and languages to ensure comprehensive testing.

We update our benchmark monthly to ensure that our evaluations reflect the latest advancements and updates in AI models.

Yes, our monthly reports include recommendations on the best AI models for specific tasks, such as invoice processing, based on our comprehensive evaluations.

Yes, we evaluate the processing duration of AI models to assess their suitability for real-time document processing tasks.

We continuously monitor updates and new versions of AI models. When a significant update is released, we retest the model to ensure our benchmark remains accurate and up-to-date.

The GPT-4o API processes documents directly, while the GPT-4o API - text input with 3rd party OCR uses a third-party Optical Character Recognition (OCR) service to convert documents to text before processing.

Our Services

Modular AI Systems

Full-stack AI Developers

MVP Development Services

Contact Us

Let's Work Together!

Do you want to know the total cost of development and realization of the project? Tell us about your requirements, our specialists will contact you as soon as possible.

Intelligent Document Processing (IDP) Models Benchmark

Testing Criteria

Recognition Accuracy

Processing Duration

Cost

Reports

Invoice Processing: 2026 vs 2025

Reports

Handwriting Processing

Reports

Engineering Drawings Processing: Tabular Data

RFQ Processing

Engineering Drawings Processing: Dimensional & Tolerance Data

Invoice Processing

Unified Intelligence: Enhance Invoice Data Extraction Up to 97%

Looking for the best AI model for your project?

FAQs

Do you provide custom benchmarks for specific business needs?

What models do you test?

What types of documents do you use for testing?

How often do you update your benchmark?

Do you provide recommendations based on your benchmark?

Do you test AI models for real-time processing?

How do you handle updates and new versions of AI models?

What is the difference between GPT-4o API and GPT-4o API - text input with 3rd party OCR?

Our Services

Modular AI Systems

Full-stack AI Developers

MVP Development Services

Contact Us

Let's Work Together!