Why Docling + OkraPDF
Docling is IBM’s open-source document parser. It runs entirely on your machine — no API keys, no cloud calls, no per-page pricing. Its TableFormer model is best-in-class for complex table extraction (merged cells, nested headers, spanning rows).
Once you’ve parsed a document, you often need to share it — give a colleague a link to the extracted tables, let a client chat with the report, or serve the structured data via API to downstream systems.
OkraPDF handles that layer. Upload your PDF, deploy Docling’s extraction, and get:
- Chat completions over the document
- Structured extraction with JSON schemas
- Page images with bounding box overlays
- Deterministic URLs for every page, table, and figure
- Collection queries across multiple documents
Your PDF bytes stay on your machine. Only the structured text and coordinates are sent to OkraPDF.
PDF bytes ──► [your machine: Docling] ──► structured JSON
│
▼
[OkraPDF: store + serve]
│
┌──────────┼──────────┐
▼ ▼ ▼
chat page images API URLs
Install
# Docling (Python)
pip install docling requests
# OkraPDF API key
export OKRA_API_KEY=okra_...
Docling requires Python 3.10+ and ~4 GB RAM for the layout + table models.
First run downloads models from HuggingFace (~500 MB).
How it works
The integration is a three-step pipeline:
- Upload the PDF to OkraPDF with
skip_parse=true — stores the file for page rendering, but skips OCR. No extraction charges.
- Parse the PDF locally with Docling —
DocumentConverter().convert() returns a DoclingDocument with text, tables, figures, and bounding boxes.
- Ingest the Docling output into the OkraPDF document — replaces the extraction layer. The document is now live with chat, search, and API access.
Full example
import os, sys, requests
from docling.document_converter import DocumentConverter
API_URL = "https://api.okrapdf.com"
API_KEY = os.environ["OKRA_API_KEY"]
PDF_PATH = sys.argv[1] # e.g. "quarterly-report.pdf"
# ── Step 1: Upload PDF (skip_parse — no OCR charge) ─────────────
with open(PDF_PATH, "rb") as f:
resp = requests.post(
f"{API_URL}/v1/documents?skip_parse=true",
files={"file": (os.path.basename(PDF_PATH), f, "application/pdf")},
headers={"Authorization": f"Bearer {API_KEY}"},
)
resp.raise_for_status()
doc_id = resp.json()["documentId"]
print(f"Uploaded: {doc_id}")
# ── Step 2: Parse locally with Docling ───────────────────────────
result = DocumentConverter().convert(PDF_PATH)
doc_dict = result.document.export_to_dict()
print(f"Parsed: {len(doc_dict.get('pages', {}))} pages, "
f"{len(doc_dict.get('texts', []))} texts, "
f"{len(doc_dict.get('tables', []))} tables")
# ── Step 3: Send raw Docling JSON — server handles everything ────
resp = requests.post(
f"{API_URL}/document/{doc_id}/ingest",
json={"data": doc_dict, "vendor": "docling", "mode": "replace"},
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
)
resp.raise_for_status()
print(f"\nDocument live:")
print(f" Chat: {API_URL}/document/{doc_id}/chat/completions")
print(f" Markdown: {API_URL}/v1/documents/{doc_id}/full.md")
print(f" Page 1: {API_URL}/v1/documents/{doc_id}/pg_1.png")
print(f" Page 1 md: {API_URL}/v1/documents/{doc_id}/pg_1.md")
No client-side mapping needed. You send the raw export_to_dict() output and OkraPDF’s server-side Docling plugin handles bbox conversion (BOTTOMLEFT → 0-1 relative), table cell restructuring (flat grid → row/cell hierarchy), and label passthrough. The raw Docling JSON is also stored verbatim for auditability.
Docling’s output includes structured labels and bounding boxes for every element:
| Docling label | What it is |
|---|
text | Body paragraph |
section_header | Section heading |
title | Document title |
list_item | Bulleted or numbered list entry |
table | Structured table with cell grid |
picture / chart | Figure with optional caption |
footnote | Footnote text |
page_header / page_footer | Running headers and footers |
key_value_region | Key-value pair (forms) |
formula | Mathematical formula |
code | Code block |
All labels are passed through to OkraPDF as-is. OkraPDF maps them to canonical types at the rendering boundary — you always get the original Docling label in the API response.
Bounding box conversion
Docling uses BOTTOMLEFT origin with absolute pixel coordinates. OkraPDF uses 0-1 relative coordinates (origin top-left).
The conversion flips the Y axis and normalizes by page dimensions:
# Docling: l=72, t=720, r=300, b=700 on a 612x792 page
# OkraPDF: x=0.118, y=0.091, w=0.373, h=0.025
x = l / page_width # 72/612 = 0.118
y = (page_height - t) / page_height # (792-720)/792 = 0.091
w = (r - l) / page_width # (300-72)/612 = 0.373
h = (t - b) / page_height # (720-700)/792 = 0.025
Table structure
Docling’s TableFormer model extracts table cells as a flat array with row/column grid indices:
{
"table_cells": [
{"text": "Revenue", "start_row_offset_idx": 0, "start_col_offset_idx": 0},
{"text": "$10M", "start_row_offset_idx": 0, "start_col_offset_idx": 1},
{"text": "Profit", "start_row_offset_idx": 1, "start_col_offset_idx": 0},
{"text": "$2M", "start_row_offset_idx": 1, "start_col_offset_idx": 1}
]
}
The example code groups these into OkraPDF’s table > row > cell hierarchy:
{
"type": "table",
"children": [
{"type": "row", "children": [
{"type": "cell", "value": "Revenue"},
{"type": "cell", "value": "$10M"}
]},
{"type": "row", "children": [
{"type": "cell", "value": "Profit"},
{"type": "cell", "value": "$2M"}
]}
]
}
Using with the CLI
If you already have a Docling JSON output file, use the CLI to upload and ingest separately:
# Upload PDF (no parsing)
okra upload report.pdf --skip-parse
# → doc-abc123...
# Ingest Docling output
curl -X POST https://api.okrapdf.com/document/doc-abc123/ingest \
-H "Authorization: Bearer $OKRA_API_KEY" \
-H "Content-Type: application/json" \
-d @docling-output.json
Data sovereignty
This pattern gives you full control over where PDF bytes are processed:
| Step | Where it runs | What’s sent |
|---|
| PDF parsing | Your machine (Docling) | Nothing — fully local |
| Upload | OkraPDF API | PDF bytes (for page images) |
| Ingest | OkraPDF API | Structured text + coordinates |
| Chat / extraction | OkraPDF edge | Queries only |
For maximum privacy, you can skip the PDF upload entirely and use POST /v1/documents/ingest to create a document from structured data alone — but you won’t get page images or PDF download.
Verify it’s lossless
OkraPDF stores the raw Docling JSON server-side and preserves original labels — no mapping, no data loss. You can verify this by comparing the snapshot export against your local Docling output:
# 1. Check the snapshot — raw Docling types preserved as-is
curl -s -H "Authorization: Bearer $OKRA_API_KEY" \
"https://api.okrapdf.com/exports/$DOC_ID/snapshot" | python3 -c "
import sys, json
d = json.load(sys.stdin)
types = {}
has_bbox = 0
for page in d['pages']:
for b in page['blocks']:
types[b['type']] = types.get(b['type'], 0) + 1
if b.get('bbox'): has_bbox += 1
total = sum(types.values())
print(f'Total blocks: {total}, with bbox: {has_bbox}')
for t, c in sorted(types.items(), key=lambda x: -x[1]):
print(f' {t}: {c}')
"
Example output for a 2-page resume:
Total blocks: 169, with bbox: 169
Types:
text: 79
list_item: 48
section_header: 41
picture: 1
Notice the types are Docling’s raw labels (section_header, list_item) — not mapped to generic types. OkraPDF resolves these to canonical types only at the rendering boundary (markdown export, chat context), so the original fidelity is always available via the API.
# 2. Compare block count: local vs deployed
python3 -c "
from docling.document_converter import DocumentConverter
result = DocumentConverter().convert('report.pdf')
doc = result.document.export_to_dict()
local = len(doc.get('texts', [])) + len(doc.get('tables', [])) + len(doc.get('pictures', []))
print(f'Local Docling blocks: {local}')
"
curl -s -H "Authorization: Bearer $OKRA_API_KEY" \
"https://api.okrapdf.com/exports/$DOC_ID/snapshot" | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
deployed = sum(len(p['blocks']) for p in d['pages'])
print(f'Deployed OkraPDF blocks: {deployed}')
"
If counts match and types are raw Docling labels, the ingest is lossless.
Standalone example
A complete standalone script is available at examples/docling-ingest.py.
See also