olmOCR
Technical details on olmOCR's throughput-focused architecture.
Theory & Benchmarks

olmOCR (Open Language Model OCR) is designed by the Allen Institute for AI to handle the massive scale of web-crawled documents.
Architecture
olmOCR is built for "massively parallel" processing. It focuses on converting billions of PDF pages into high-quality training data for LLMs.
- Minimalist Design: Avoids heavy secondary classification steps to maximize raw OCR throughput.
- Language Model Alignment: Specifically tuned to output text in a format that maximizes readability for subsequent NLP tasks.
- Robust PDF Parsing: Specialized handling of embedded fonts, vector graphics, and multi-column layouts.
Performance Benchmarks
olmOCR excels in throughput metrics (pages per second) rather than just single-image accuracy.
| Metric | Performance |
|---|---|
| Throughput (A100 GPU) | ~2,500 pages/hour |
| Text Consistency | 98.4% |
| Layout Preservation | 92.0% |
Use Case
olmOCR is the ideal choice for researchers and enterprises looking to digitize entire libraries or massive archives where processing time is the primary bottleneck.