In this analysis, we have analysed 7 popular AI models to test how well they process digital invoices without pre training or any fine tuning.
Read on to learn:
We regularly benchmark AI models to find the best ones for digital document processing for different applications. Take a look at our previous report where we’ve tested 5 AI models on invoices of various years and a comprehensive report on all of our tests.
We are constantly testing large language models for business automation tasks. Check out the latest results.
This report evaluates and compares the performance of seven distinct methods for invoice recognition across varying years and digitization formats. The focus is on assessing their accuracy in extracting key invoice fields, which is critical for automation and data processing workflows. The following solutions are analyzed:
The analysis builds upon previous findings in Invoice Extraction Accuracy 2 to provide an updated and comprehensive comparison.
To ensure a structured and fair evaluation, a standardized dataset of invoices from different years was used. The methodology included:
A collection of scanned and digital invoices was used to test each solution’s ability to handle different formats and years:
№ |
Year |
Number of Items |
---|---|---|
1 |
2018 |
4 |
2 |
2009 |
1 |
3 |
2018 |
3 |
4 |
2009 |
1 |
5 |
2018 |
12 |
6 |
2018 |
2 |
7 |
2015 |
3 |
8 |
2016 |
2 |
9 |
2008 |
3 |
10 |
2011 |
2 |
11 |
2017 |
2 |
12 |
2006 |
4 |
13 |
2009 |
2 |
14 |
2019 |
3 |
15 |
2018 |
2 |
16 |
2018 |
1 |
17 |
2012 |
3 |
18 |
2010 |
4 |
19 |
2020 |
3 |
20 |
2012 |
3 |
The following fields were extracted and compared across all models. Each solution uses slightly different naming conventions, which were standardized for evaluation:
№ |
Resulting Field |
AWS |
Azure |
|
---|---|---|---|---|
1 |
Invoice Id |
INVOICE_RECEIPT_ID |
InvoiceId |
invoice_id |
2 |
Invoice Date |
INVOICE_RECEIPT_DATE |
InvoiceDate |
invoice_date |
3 |
Net Amount |
SUBTOTAL |
SubTotal |
net_amount |
4 |
Tax Amount |
TAX |
TotalTax |
total_tax_amount |
5 |
Total Amount |
TOTAL |
InvoiceTotal |
total_amount |
6 |
Due Date |
DUE_DATE |
DueDate |
due_date |
7 |
Purchase Order |
PO_NUMBER |
PurchaseOrder |
purchase_order |
8 |
Payment Terms |
PAYMENT_TERMS |
- |
payment_terms |
9 |
Customer Address |
RECEIVER_ADDRESS |
BillingAddress |
receiver_address |
10 |
Customer Name |
RECEIVER_NAME |
CustomerName |
receiver_name |
11 |
Vendor Address |
VENDOR_ADDRESS |
VendorAddress |
supplier_address |
12 |
Vendor Name |
VENDOR_NAME |
VendorName |
supplier_name |
13 |
Item: Description |
ITEM |
Description |
- |
14 |
Item: Quantity |
QUANTITY |
Quantity |
- |
15 |
Item: Unit Price |
UNIT_PRICE |
UnitPrice |
- |
16 |
Item: Amount |
PRICE |
Amount |
- |
Note: For Gemini, Deepseek, and GPT, the models were explicitly instructed to return data in the Resulting Field format for consistency.
The evaluation of item-level extraction focuses on four key attributes:
Note on Google AI: Unlike other solutions, Google’s Document AI does not break down items into individual attributes but returns full item rows as unstructured text, complicating direct comparison for these fields.
To quantify extraction accuracy, a weighted efficiency metric (Eff, %) was applied, combining:
Formulas:
Eff, % = (COUNTIF(strict ess. fields, positive) + COUNTIF(non-strict ess. fields, positive if RLD > RLD threshold) + COUNTIF(items, positive)) / ((COUNT(all fields) + COUNT(all items)) * 100
RLD, % = 1 - [Levenshtein distance]/Max(Len(s1),Len(s2)) * 100
Eff-I, % = Positive IF (ALL(Quantity, Unit Price, Amount - positive) AND RLD(Description) > RLD threshold) * 100
Pricing models for AI services were calculated per invoice, accounting for:
Formulas:
[total_cost] = [input token cost] * ([prompt token count] + [OCR input json token count]) + [output token cost] * [result json token count]
[total_cost] = [input token cost] * ([prompt token count] + [input image token count]) + [output token cost] * [result json token count]
Key Considerations:
Note: Google AI results were excluded from the charts above.
Issue: Azure AI failed to detect full employee names in Invoice 5, recognizing only first names instead of complete names.
Impact: This resulted in a significantly lower efficiency score (33.3%) for Azure on this invoice, while other models achieved 100% accuracy across all 12 items.
Conclusion: Azure’s inability to parse multi-word descriptions in structured fields highlights a critical limitation compared to competitors.
Observation: Low-resolution invoices (e.g., Samples 13, 17, 18) generally did not degrade detection accuracy across models.
Minor Exceptions: Invoice 15: Deepseek misread a comma as a dot, leading to an incorrect numerical value.
Conclusion: Modern OCR and AI models are robust to resolution issues, though rare formatting errors may occur.
Critical Flaw: Google Document AI combines all item attributes into a single unstructured string, making field-level comparison impossible.
Example:
Actual image:
All other services have 100% correct detection with breakdown by attributes:
Impact: Google’s approach fails to align with industry-standard attribute breakdowns (Description, Quantity, Unit Price, Amount), rendering it incompatible for automated workflows requiring structured data.
Finding: Multi-line item descriptions had no negative impact on detection quality—except for Google AI, which struggles with any structured parsing.
Why It Matters: Complex invoices with wrapped text or line breaks were handled flawlessly by AWS, Azure, GPT, Gemini, and Deepseek.
Two invoices were excluded due to atypical structures that caused widespread detection failures:
Insight: Unconventional layouts remain a challenge for all models.
Strengths:
Comparison (Sample 20):
Model |
Accuracy of Attributes |
Notes |
---|---|---|
Gemini |
100% |
Correct values and formatting. |
GPT-4o |
Partial |
Inaccurate numerical values. |
Deepseek |
Low |
Missing/incorrect fields. |
Example: Sample #20, actual image:
Gemini:
GPTI: Same attributes but inaccurate values:
Deepseek: Most of values are incorrect or absent, bad text in text attributes:
Service |
Cost |
Cost per page (average) |
---|---|---|
$10 / 1000 pages 1 |
$0.01 |
|
$10 / 1000 pages |
$0.01 |
|
$10 / 1000 pages |
$0.01 |
|
$2.50 / 1M input tokens, $10.00 / 1M output tokens 2 |
$0.021 |
|
$2.50 / 1M input tokens, $10.00 / 1M output tokens |
$0.0087 |
|
$1.25, input prompts ≤ 128k tokens $2.50, input prompts > 128k tokens $5.00, output prompts ≤ 128k tokens $10.00, output prompts > 128k tokens |
$0.0045 |
|
$10 / 1000 pages + $0.27 / 1M input tokens, $1.10 / 1M output tokens |
$0.011 |
Notes:
1 — $8 / 1000 pages after one million per month
2 — Additional $10 per 1000 pages from using a text recognition model
This comprehensive evaluation of seven invoice extraction solutions—AWS, Azure, Google AI, GPT-4o (text & image), Gemini, and Deepseek—revealed critical insights into their accuracy, efficiency, and limitations. Below is a consolidated summary of the findings:
This research highlights that no single solution is perfect, but the optimal choice depends on the use case—whether prioritizing precision, cost, or structured output.