Editorial illustration for B2B Document Extractor Rebuilt: Rule-Based vs. LLM Using pytesseract OCR
B2B Document Extractor Rebuilt: Rule-Based vs. LLM Using...
B2B Document Extractor Rebuilt: Rule-Based vs. LLM Using pytesseract OCR
Why does extracting data from B2B order forms still feel like a puzzle? In practice every document varies just enough to trip a rule‑based system: one client puts the purchase‑order number in the top‑left corner, another tucks it into the bottom‑right. Labels shift, too—“PO Number”, “Order ID”, “Order Reference”, or something entirely different.
Humans read the page, infer context, and move on. A regex rule can hunt for a fixed string, but it stalls the moment a new label appears. That tension is the heart of the experiment behind this article.
The author rebuilt the same extractor twice, first with a traditional pipeline that couples pytesseract OCR and handcrafted regex patterns, then with an LLM‑driven flow that adds Ollama and LLaMA 3 to the mix. The aim isn’t to crown one method as superior; it’s to pinpoint where rule‑based approaches begin to break under growing layout diversity and whether a language model can actually cut down the maintenance burden. The step‑by‑step rebuild, head‑to‑head benchmarks, and guidance on when to avoid LLMs follow.
Afterwards, we use pytesseract to read the image and extract the raw text via OCR (Optical Character Recognition). Put simply, OCR means that the tool "looks" at the image and tries to recognize letters from pixels. Quite similar to how humans decipher handwritten notes.
- In the second step, we use regex. These are regular expressions that search for specific patterns inside the text. For example, we can define: "Search for everything that comes after PO Number: ." Already in this second step, we can identify the first problem: What happens if the customer simply writes "Order Reference" instead of "PO Number: "?
In that case, the regex pattern finds nothing. What we can then do (or must do) is add a new rule. Execute Script 1 for Approach 1 Next, we create a new file called approach1_traditional.py with the following code that you can find in the GitHub-Gist inside the same folder: https://gist.github.com/Sari95/aa2be6938fbcb1c7f94b053d9046f55d Now we execute the file again inside the terminal: python approach1_traditional.py The Result of Approach 1 For Layout A, everything works perfectly: For Layout B?
Not a single field is recognized and all values return "None": And this is exactly where the problem lies. For every new customer, new regex rules would have to be written, tested, and deployed. With 200 customers, that means 200 different patterns.
And every time a customer slightly changes their form, the system breaks again. Approach 2: A New Way (pytesseract + Ollama + LLaMA 3) In this second approach, we keep the OCR step, but replace the rigid regex rules with an LLM: - pytesseract still reads the text from the PDF. - Instead of telling the code "Search for PO Number: ", we tell the LLM: "Here is an order document.
Why this matters
The experiment shows that a single OCR engine, pytesseract, can feed both a rule‑based pipeline and a large language model with the same raw text. Yet the downstream extraction diverges sharply. Rule‑based scripts stumble when a purchase order number moves from the top‑left to the bottom‑right, or when a label switches from “PO Number” to “Order ID”.
An LLM, by contrast, can infer meaning from context, but we've seen no data on consistency across thousands of vendor formats. Consequently, developers must decide whether the flexibility of a language model justifies the added compute and potential opacity. Founders should ask: does the marginal gain in recall outweigh the risk of unpredictable outputs?
Researchers are left with an open question about how well current LLMs handle noisy OCR transcripts. In practice, the OCR step itself remains a bottleneck; pixel‑to‑character conversion is still imperfect, especially with handwritten notes. Until we see systematic benchmarks, the choice between rules and LLMs stays a pragmatic trade‑off rather than a clear victory.
Further Reading
- Receipt OCR with LLMs vs Tesseract: What Actually Changed? - Iunera
- LLM-Based OCR vs Traditional OCR: What Actually Works in 2026 - Parsli
- LLMs vs OCR Data Extraction: Which One Should You Use? - Klippa
- OCR vs LLMs: What's the Best Tool for Document Processing in 2025? - TableFlow
- Document Data Extraction in 2026: LLMs vs OCRs - Vellum