We Tested 17 PDF Parsers on 800 Documents. The Best One Depends on What You're Parsing.
Parser accuracy swings 49 percentage points by document type. pdfplumber hits 93.4% on tables but 70.4% overall. Nougat hallucinates on long documents. Gemini 3 Pro leads at 88% but costs 50x more than LlamaParse. And the Unicode ligature U+FB01 silently breaks every keyword search in your RAG pipeline.
Forty-nine percentage points. That's the accuracy gap between the same PDF parser running on a legal contract (98.8%) versus an invoice (49.9%), according to PDFBench, a benchmark that tested 17 parsers on 800+ real documents across 6 domains. Not 49 points between the best parser and the worst. Between the same parser on two different document types.
If you're building a retrieval-augmented generation pipeline and you picked your PDF parser based on a vendor benchmark, you're probably running the wrong tool on half your corpus.
Why PDFs Are Adversarial to Language Models
A PDF is not a document. It is a set of instructions for placing characters at specific coordinates on a page. There is no semantic structure. No paragraphs. No reading order. When you open a two-column academic paper in a PDF viewer, your eyes naturally read left column top-to-bottom, then right column. A rule-based text extractor reads left-to-right across the page, interleaving sentences from both columns into nonsense.
A comparative study from Missouri S&T (Adhikari et al., 2024) tested 10 PDF parsing tools across 6 document categories using the DocLayNet dataset. All rule-based parsers struggled with scientific and patent documents specifically because of multi-column layouts and embedded equations. PyMuPDF and pypdfium2 performed best overall on text extraction, but even they produced degraded output on complex layouts.
Then there's the ligature problem. PDF fonts store common character pairs as single glyphs. Letters "fi" become Unicode U+FB01 (Latin Small Ligature Fi). The letters "fl" become U+FB00. "difficult" extracts as "difficult." Your text looks fine in a viewer because the glyph renders correctly, but every downstream search for "difficult" silently fails. As the PDF Association documented, this happens because the glyph lookup (for display) and the Unicode lookup (for semantics) use different encoding tables inside the font, and many PDF producers get the ToUnicode mapping wrong.
Headers, footers, and page numbers inject noise into every page. Hyphenated words at line breaks split across tokens. Footnotes interleave with body text. Each of these problems multiplies when the extracted text goes into an LLM context window, because the model has no way to distinguish formatting artifacts from content. Every garbage token counts against your context budget.
The Scorecard: 17 Parsers, 6 Domains, No Consensus
PDFBench (Applied AI, December 2025) ran 17 parsers on 353+ documents with manually verified ground truth from source HTML, DOCX, and LaTeX conversions. Results upend conventional wisdom about which parser is "best."
| Rank | Parser | Edit Similarity | Best For |
|---|---|---|---|
| 1 | pypdfium2 | 78.3% | Legal, general text |
| 2 | pypdf | 78.3% | Legal, general text |
| 3 | extractous | 77.5% | HR documents |
| 4 | pymupdf | 77.3% | Fast extraction |
| 5 | kreuzberg | 74.9% | Consistency, invoices |
| 6 | pymupdf4llm | 74.7% | LLM pipelines |
| 7 | docling | 71.3% | Structure preservation |
| 8 | pdfplumber | 70.4% | Table extraction |
| 9 | pdfminer | 68.2% | Text positioning |
| 10 | unstructured | 66.5% | HR documents |
But this aggregate ranking hides the real story. pypdfium2 scores 98.8% on legal contracts and 49.9% on invoices. pdfplumber ranks 8th overall but achieves a 93.4% TEDS (Tree-Edit-Distance-based Similarity) score on tables, the highest of any parser tested. Docling preserves document structure (headings, sections, hierarchy) at 60%+ while most text-only parsers score below 35% on structure.
When PDFBench added frontier LLMs to the comparison, the gap widened further. GPT-5.1 achieved 92% edit similarity by treating each page as a visual input. Gemini 3 Pro hit 88%. But GPT-5.1 costs roughly $0.05 per page, while LlamaParse matches the open-source leaders at 78% for $0.003 per page.
Tables: Where Every Parser Fails Differently
Table extraction is the hardest subproblem in PDF parsing, and it's where the gap between "looks correct" and "is correct" matters most for LLM consumption.
Borderless tables (no grid lines, alignment-only structure) defeat most rule-based parsers because there are no drawn lines to detect. Parsers must infer column boundaries from whitespace alignment, a problem that breaks down when columns have variable-width content. Spanning cells (cells that cover multiple rows or columns) produce garbled output in markdown because markdown's pipe-delimited table format has no spanning syntax. Nested tables (tables inside tables, common in financial disclosures) are effectively unrepresentable in markdown at all.
Adhikari et al. found that for table detection, TATR (Table Transformer) outperformed all rule-based parsers on financial, patent, legal, and scientific documents. Camelot performed best on government tenders. PyMuPDF won on manuals. No single tool dominated across all categories.
pdfplumber's 93.4% TEDS score is strong but comes with a caveat: it works best on tables with visible borders and clean alignment. On borderless tables common in corporate reports, its accuracy drops. For markdown-formatted table output, pymupdf4llm scores 84.8% TEDS, and IBM's Docling scores 84.5%.
Equations: LaTeX or Garbage, Nothing in Between
Scientific PDFs store equations as positioned glyph sequences. Standard extractors produce output like f θ i ; x ð Þ ¼ ai þ bi /C0 ai where the original reads f(θᵢ, x) = aᵢ + (bᵢ − aᵢ). As Eric Ma documented, standard tools produce "alphabet soup" from equations because they extract individual glyphs without understanding mathematical structure.
Nougat (Neural Optical Understanding for Academic Documents, Meta Research) takes a different approach: it treats the entire page as an image and uses a vision transformer to reconstruct structured markdown, including LaTeX equations. Its output for the same equation: \[f(\theta_{i},x)=a_{i}+\frac{(b_{i}-a_{i})}{1+\exp\{d_{i}(x-\log(c_{i}))\}}\]. Clean, parseable, and token-efficient.
But Nougat has a known failure mode: on long documents, it begins generating repetitive hallucinated text. Pages 50+ in a long PDF can produce loops of repeated sentences that never appeared in the original. Marker (Vik Paruchuri, Mozilla Builders) addresses this by using a pipeline of specialized models: Surya for layout detection and OCR, Texify for equation-to-LaTeX conversion, and a post-processing model for cleanup. Marker doesn't hallucinate because each model handles a narrow task rather than trying to reconstruct the entire document from pixels.
Mathpix offers the highest-accuracy equation OCR commercially, claiming 99%+ on printed LaTeX, but at $0.004 per page with volume discounts. For pipelines processing thousands of academic papers, the cost adds up. Marker runs locally on GPU with zero per-page cost after the hardware investment.
The Token Tax: How Bad Conversion Wastes Your Context Window
A benchmark comparing PDF vs. markdown for AI token usage (coeld.io, March 2026) tested a 768-page illustrated book and a 19-page text document. The text token counts after extraction were close (367,858 PDF-extracted tokens vs. 367,286 markdown tokens for the Grimm fairy tales corpus). But the operational costs diverged sharply: PDF prep took 679ms vs. 0.26ms for markdown. Storage was 6.15MB vs. 1.49MB.
Token count parity hides a more insidious problem: token quality. A cleanly extracted markdown file uses tokens on content. A poorly extracted PDF dumps tokens on: repeated headers and footers (every page), page numbers, misaligned table cell separators, orphaned column fragments, broken hyphenations, and ligature artifacts that tokenize differently than their intended characters. Mozilla Builders' documentation for Marker claims that clean markdown output reduces token count by 70% compared to direct LLM-on-PDF approaches, and eliminates encoding artifacts entirely.
At GPT-4-class pricing ($10 per million input tokens), a 100-page PDF with 30% token waste costs $0.003 more per query. Scale that to 10,000 queries across a document corpus of 1,000 PDFs and the waste hits $30,000. Math gets worse with larger models and longer context windows.
Vision Models vs. Traditional Parsing: A New Class of Tradeoff
GPT-5.1's 92% edit similarity on PDFBench represents a different bet: treat the PDF as an image, skip text extraction entirely, let the vision model read it. Google's Gemini 3 Pro (88%) and Claude Sonnet 4.5 (80%) follow the same approach. No ligature bugs. No column interleaving. No header/footer noise. The model "sees" the document the way a human does.
Cost and latency. At $0.05 per page, GPT-5.1 is 17x more expensive than LlamaParse ($0.003/page) and infinitely more expensive than Marker (free, runs locally). For a 1,000-page legal discovery corpus queried 100 times, the vision model approach costs $5,000 in API fees. The traditional pipeline costs $3 for LlamaParse or $0 for a local Marker installation with a one-time GPU investment.
Marker's hybrid mode (passing --use_llm with Gemini 2.0 Flash) represents the middle ground: use traditional OCR and layout detection for initial extraction, then pass the results through an LLM for cleanup. Marker's benchmarks show this hybrid approach matches or beats pure LLM parsing at a fraction of the cost, because the LLM only processes pre-structured text, not raw page images.
Practical Recommendations by Document Type
| Document Type | Best Parser | Why |
|---|---|---|
| Legal contracts | pypdfium2 or pypdf | 98.8% accuracy, 100% reliability, fast |
| Academic papers | Marker (with --use_llm) | Handles equations, multi-column, figures; no hallucination risk |
| Financial reports | pdfplumber + TATR | 93.4% table accuracy; TATR catches borderless tables |
| Invoices | Azure Document Intelligence or GPT-5.1 | Generic parsers fail below 50%; need layout-aware models |
| Scanned documents | Nougat (short docs) or Marker | Nougat excels on academic scans but hallucinates on 50+ pages |
| Books / long text | pymupdf4llm | Best LLM-ready markdown output, no hallucination risk, fast |
| Government filings | Camelot (tables) + pypdfium2 (text) | Camelot wins on government table formats per Adhikari et al. |
| Mixed corpus (unknown types) | LlamaParse | 78% accuracy at $0.003/page; best quality-cost ratio across domains |
Limitations
PDFBench's 353-document corpus is large for a parser benchmark but small relative to the diversity of real PDF formats. Its legal documents are synthetic (generated from templates), not scanned originals with stamps, handwriting, and redactions. The benchmark excludes CJK-heavy documents, right-to-left scripts, and PDFs with embedded multimedia. The TEDS scores for table extraction are computed on a subset, not the full corpus. Token waste calculations in this article use simplified cost models; real-world costs depend on chunking strategy, embedding model, retrieval frequency, and cache hit rates. Nougat's hallucination problem may improve with future model versions; the current assessment reflects the latest available release. Marker's benchmark numbers come from its own repository and have not been independently replicated by a neutral party.
The Strongest Counterargument
Vision-native LLMs might make this entire analysis obsolete within 18 months. If GPT-5.1 already scores 92% by looking at page images, the next generation could hit 98%+ and eliminate the need for any extraction pipeline. Why maintain a Marker installation, tune pdfplumber parameters, and debug ligature mappings when you can POST a page image to an API and get perfect markdown back? It has teeth: Google, OpenAI, and Anthropic are all investing heavily in multimodal document understanding, and their accuracy curves are improving faster than the traditional parsers' annual update cycles. If your document volume is low enough that API costs don't matter (under 10,000 pages per month), the vision model approach is already the correct choice today.
The Bottom Line
PDF-to-markdown conversion is 5 different problems pretending to be 1: text extraction, OCR, structure recovery, table parsing, and equation reconstruction. No single tool solves all of them. The most important decision isn't which parser to use; it's whether your document type even matches the parser's strengths. A legal team using pdfplumber is leaving 28 accuracy points on the table compared to pypdfium2. A research lab feeding academic papers through pypdf is feeding its LLM equation garbage at 4 tokens per dollar. Match the tool to the document. Clean your markdown before it hits the context window. Count the tokens you're wasting. That 49-point accuracy gap between document types is wider than any gap between parsers. Start there.