How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison

Introduction

Extracting structured data from B2B PDF invoices, purchase orders, and receipts is a common challenge. Many developers turn to rule-based approaches using OCR (like Tesseract) or explore modern LLMs (like LLaMA 3) for more flexible extraction. This guide walks you through building the same extractor twice — once with pytesseract rules and once with Ollama + LLaMA 3 — so you can compare performance, accuracy, and maintenance on a realistic B2B order scenario.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

What You Need

Step-by-Step Guide

Step 1: Set Up the Environment and Sample Document

First, create a project folder and install dependencies:

pip install pytesseract pdf2image Pillow ollama

Place your sample B2B PDF in the folder. For this guide, we assume a purchase order containing fields like Order ID, Supplier Name, Line Items, Total Amount.

Step 2: Build the Rule-Based Extractor with pytesseract

Create a Python script rule_extractor.py. Use pdf2image to convert PDF pages to images, then apply Tesseract OCR:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path('order.pdf')
text = pytesseract.image_to_string(images[0])

Now define rules using regex and keyword matching. For example:

Test with your PDF and adjust regex patterns. This approach works well for consistent layouts but fails if the format changes.

Step 3: Build the LLM-Based Extractor with Ollama and LLaMA 3

Create llm_extractor.py. Read the PDF text as before (or use OCR output). Then pass it to Ollama:

import ollama

prompt = """You are a B2B document parser. Extract fields: Order ID, Supplier Name, Line Items (as list), Total. Output only JSON.
Document:
{ocr_text}
""".format(ocr_text=text)

response = ollama.chat(model='llama3', messages=[{'role': 'user', 'content': prompt}])
result = json.loads(response['message']['content'])

This method is layout-agnostic and handles variations naturally. However, it requires running a local LLM and may be slower. You can also tweak the prompt to enforce schema.

How to Build a B2B Document Extractor with Both Rules and LLM: A Step-by-Step Comparison
Source: towardsdatascience.com

Step 4: Compare Outputs and Handle Failures

Run both scripts on the same document. Compare extracted JSON:

For failures, enhance rules with fallback patterns, or improve LLM prompt by providing examples. Consider using both in a hybrid pipeline where LLM acts as a backup.

Step 5: Optimize for Your Use Case

For production, measure accuracy, speed, and maintenance overhead. Rule-based is fast and cheap but brittle. LLM-based offers flexibility but requires GPU and careful prompt engineering.

You can also combine them: try rules first, then use LLM for confidence threshold below 90%.

Tips for Success

By building the same extractor twice, you gain practical insight into trade-offs and can make an informed choice for your B2B document processing needs.

Tags:

Recommended

Discover More

How to Use Coursera's 2026 Job Skills Report to Build a Future-Proof CareerHow to Get Hogwarts Legacy for Free on PC: A Step-by-Step GuideOpenAI Smartphone Project Confirmed: Exclusive Details on the AI Giant’s Hardware AmbitionsComparing Rule-Based and LLM Methods for B2B Document Extraction: A Practical ExperimentHow to Secure Your Linux System: Upgrading to Kernel Versions 7.0.6 or 6.18.29 to Mitigate Dirty Frag Vulnerability