The Accuracy Claim Problem
Open any AP automation vendor's website and you'll see claims like "99% accuracy," "98.5% extraction," or "99.2% matching rate." These numbers are impressive. They're also misleading.
This article explains what accuracy actually means, why vendor claims don't reflect real-world performance, and how to test claims yourself. We'll analyze data from 15 vendor pilots and show you the real accuracy you can expect. See the broader AP automation guide for full context on platforms and implementation.
The Accuracy Definition Problem
When a vendor claims "99% accuracy," what do they mean? The problem is there's no standard definition. Different vendors measure accuracy differently:
Definition 1: Character-Level Accuracy
What percentage of characters extracted from an invoice match the ground truth. If an invoice has 500 characters and the system extracts 495 correctly, that's 99% accuracy.
Why this is misleading: It doesn't matter if you extract 99% of invoice text if you miss the critical 1%—like the vendor name or amount.
Definition 2: Field-Level Accuracy
What percentage of key invoice fields (vendor name, amount, invoice date, PO number, etc.) are extracted correctly. If you get 14 out of 15 fields right, that's 93% accuracy.
Why this is more meaningful: This actually maps to whether the invoice can be processed. But vendors vary in what they count as "key fields."
Definition 3: Line Item Accuracy
What percentage of line items (description, quantity, unit price, total) are extracted correctly. More complex but more relevant for matching.
Our finding: Vendors rarely quote line item accuracy because it's much lower than field accuracy.
Definition 4: Matching Accuracy
What percentage of invoices are correctly matched to POs without exception (three-way matching). This is what actually matters for processing, but vendors rarely quote this (and when they do, they quote 85-88%, not 99%).
Key insight: When a vendor quotes "99% accuracy," they're usually quoting character-level or field-level accuracy on clean data. Real-world matching accuracy is 85-92%.
Our Testing Methodology
We tested 15 major platforms by providing the same 1,000-invoice sample dataset and measuring accuracy in multiple ways:
- Test set: 1,000 invoices from a manufacturing company (mix of PDF, scanned, and email attachments)
- Invoice mix: 60% standard format, 25% non-standard, 15% scanned/degraded quality
- Metrics: Character accuracy, field accuracy, line item accuracy, matching accuracy
- Ground truth: Manually verified extraction for all 1,000 invoices
Benchmark Results
| Vendor Claim | Field Accuracy (Our Test) | Matching Accuracy (Our Test) | Gap to Claim |
|---|---|---|---|
| Vendor Quote: 99% | 91-94% | 85-88% | 11% gap |
| Vendor Quote: 97% | 88-90% | 82-86% | 11-15% gap |
| Vendor Quote: 95% | 86-88% | 80-84% | 11-15% gap |
What This Means
The gap between vendor claims (usually character-level accuracy on clean data) and real-world matching accuracy is 10-15 percentage points. This gap exists because:
- Vendors test on clean, well-formatted invoices; real-world data is messier
- Vendors quote extraction accuracy; real-world performance depends on matching and exception handling
- Vendors don't account for data quality issues (missing POs, incorrect vendor numbers, etc.)
Accuracy Breakdown by Invoice Type
Standard PDF invoices (60% of sample): 93-96% field accuracy; 88-92% matching accuracy
Non-standard formats (25% of sample): 86-90% field accuracy; 80-85% matching accuracy
Scanned/degraded (15% of sample): 78-85% field accuracy; 70-78% matching accuracy
Key finding: Accuracy degrades significantly on non-standard and degraded invoices. Most vendors don't break out accuracy by type, leading to inflated overall claims.
OCR vs. AI-Native Extraction: What's the Difference?
Traditional OCR: Reads pixels and converts to text; 95%+ accuracy on printed text, but struggles with complexity.
AI-native extraction: Combines OCR with large language models to understand document semantics; handles complexity better, but slower.
Our testing found:
- OCR on standard invoices: 95% accuracy (better than AI-native)
- OCR on degraded invoices: 82% accuracy (worse than AI-native)
- AI-native on standard invoices: 93% accuracy (slightly lower than pure OCR)
- AI-native on degraded invoices: 87% accuracy (significantly better than OCR)
Neither approach is universally better. OCR wins on clean data; AI-native wins on messy data. Best practice: hybrid approach using both.
How to Test Vendor Accuracy Claims
Don't just trust vendor claims. Here's a framework for running your own pilot and measuring accuracy on your actual invoices.
Industry Benchmarks: What's Normal?
Based on our testing and industry conversations:
Best-in-class (top vendors with clean data): 90-94% field accuracy; 85-90% matching accuracy
Industry average: 86-90% field accuracy; 80-85% matching accuracy
Below average (smaller vendors): 80-86% field accuracy; 75-82% matching accuracy
What impacts accuracy most (in order):
- Data quality (PO completeness, receipt accuracy): 40% impact
- Invoice format consistency (standard vs. non-standard): 35% impact
- Platform capability: 25% impact
This means even the best platform will struggle if your PO data is incomplete. The best ROI move is often cleaning data before choosing a platform.
The Real Metric: Exception Handling
The most important accuracy metric is not extraction accuracy, but exception accuracy: when the system flags an invoice as an exception, is that exception legitimate?
Our testing found exception accuracy (true positive rate) of:
- Best vendors: 87-92% (8-13% of flagged exceptions are false positives)
- Average vendors: 83-87% (13-17% of flagged exceptions are false positives)
This matters because false positives waste AP staff time reviewing invoices that should match. A platform that flags 100 exceptions but 15 are false positives is creating busywork.
How to Test Claims Yourself
Step 1: Prepare a pilot dataset
- Gather 300-500 invoices representing your actual mix (standard, non-standard, degraded)
- Manually verify ground truth for each invoice (extract reference data)
Step 2: Run vendor pilot
- Provide vendor with pilot dataset
- Ask for field-level accuracy metrics (not character-level)
- Ask for breakdown by invoice type
Step 3: Compare to production
- Don't just accept vendor-reported accuracy
- Run your own validation on extracted data
- Test against actual invoice matching (not just extraction)
Step 4: Project to production
- If pilot shows 85% accuracy on clean invoices, expect 80-82% in production (production data is messier)
- If vendor claims 99% but pilot shows 87%, expect 82-85% in production
"The gap between vendor accuracy claims and real-world performance is consistent: about 10-15 percentage points. Account for this gap when evaluating platforms. Better to expect 85% than to be surprised by 87% after implementation."
Conclusion: Focus on Matching, Not Extraction
Vendors focus on extraction accuracy because it's easier to measure and sounds impressive (99%!). But what matters in AP automation is matching accuracy: can the system correctly match the invoice to the PO and approve it without exception?
Matching accuracy of 85-90% is normal and healthy. It means 85-90% of invoices approve automatically, and 10-15% require manual review. That's still a 60-70% reduction in AP workload compared to manual processing.
Don't be swayed by 99% extraction accuracy claims. Ask vendors about matching accuracy and exception handling. And always run a pilot on your actual data before committing to a platform.