Skip to main content

Why Docling + OkraPDF

Docling is IBM’s open-source document parser. It runs entirely on your machine — no API keys, no cloud calls, no per-page pricing. Its TableFormer model is best-in-class for complex table extraction (merged cells, nested headers, spanning rows). Once you’ve parsed a document, you often need to share it — give a colleague a link to the extracted tables, let a client chat with the report, or serve the structured data via API to downstream systems. OkraPDF handles that layer. Upload your PDF, deploy Docling’s extraction, and get:
  • Chat completions over the document
  • Structured extraction with JSON schemas
  • Page images with bounding box overlays
  • Deterministic URLs for every page, table, and figure
  • Collection queries across multiple documents
Your PDF bytes stay on your machine. Only the structured text and coordinates are sent to OkraPDF.
PDF bytes ──► [your machine: Docling] ──► structured JSON


                                    [OkraPDF: store + serve]

                              ┌──────────┼──────────┐
                              ▼          ▼          ▼
                           chat     page images   API URLs

Install

# Docling (Python)
pip install docling requests

# OkraPDF API key
export OKRA_API_KEY=okra_...
Docling requires Python 3.10+ and ~4 GB RAM for the layout + table models. First run downloads models from HuggingFace (~500 MB).

How it works

The integration is a three-step pipeline:
  1. Upload the PDF to OkraPDF with skip_parse=true — stores the file for page rendering, but skips OCR. No extraction charges.
  2. Parse the PDF locally with Docling — DocumentConverter().convert() returns a DoclingDocument with text, tables, figures, and bounding boxes.
  3. Ingest the Docling output into the OkraPDF document — replaces the extraction layer. The document is now live with chat, search, and API access.

Full example

import os, sys, requests
from docling.document_converter import DocumentConverter

API_URL = "https://api.okrapdf.com"
API_KEY = os.environ["OKRA_API_KEY"]
PDF_PATH = sys.argv[1]  # e.g. "quarterly-report.pdf"

# ── Step 1: Upload PDF (skip_parse — no OCR charge) ─────────────

with open(PDF_PATH, "rb") as f:
    resp = requests.post(
        f"{API_URL}/v1/documents?skip_parse=true",
        files={"file": (os.path.basename(PDF_PATH), f, "application/pdf")},
        headers={"Authorization": f"Bearer {API_KEY}"},
    )
    resp.raise_for_status()
    doc_id = resp.json()["documentId"]

print(f"Uploaded: {doc_id}")

# ── Step 2: Parse locally with Docling ───────────────────────────

result = DocumentConverter().convert(PDF_PATH)
doc_dict = result.document.export_to_dict()

print(f"Parsed: {len(doc_dict.get('pages', {}))} pages, "
      f"{len(doc_dict.get('texts', []))} texts, "
      f"{len(doc_dict.get('tables', []))} tables")

# ── Step 3: Send raw Docling JSON — server handles everything ────

resp = requests.post(
    f"{API_URL}/document/{doc_id}/ingest",
    json={"data": doc_dict, "vendor": "docling", "mode": "replace"},
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
)
resp.raise_for_status()

print(f"\nDocument live:")
print(f"  Chat:       {API_URL}/document/{doc_id}/chat/completions")
print(f"  Markdown:   {API_URL}/v1/documents/{doc_id}/full.md")
print(f"  Page 1:     {API_URL}/v1/documents/{doc_id}/pg_1.png")
print(f"  Page 1 md:  {API_URL}/v1/documents/{doc_id}/pg_1.md")
No client-side mapping needed. You send the raw export_to_dict() output and OkraPDF’s server-side Docling plugin handles bbox conversion (BOTTOMLEFT → 0-1 relative), table cell restructuring (flat grid → row/cell hierarchy), and label passthrough. The raw Docling JSON is also stored verbatim for auditability.

What Docling extracts

Docling’s output includes structured labels and bounding boxes for every element:
Docling labelWhat it is
textBody paragraph
section_headerSection heading
titleDocument title
list_itemBulleted or numbered list entry
tableStructured table with cell grid
picture / chartFigure with optional caption
footnoteFootnote text
page_header / page_footerRunning headers and footers
key_value_regionKey-value pair (forms)
formulaMathematical formula
codeCode block
All labels are passed through to OkraPDF as-is. OkraPDF maps them to canonical types at the rendering boundary — you always get the original Docling label in the API response.

Bounding box conversion

Docling uses BOTTOMLEFT origin with absolute pixel coordinates. OkraPDF uses 0-1 relative coordinates (origin top-left). The conversion flips the Y axis and normalizes by page dimensions:
# Docling: l=72, t=720, r=300, b=700 on a 612x792 page
# OkraPDF: x=0.118, y=0.091, w=0.373, h=0.025
x = l / page_width            # 72/612 = 0.118
y = (page_height - t) / page_height  # (792-720)/792 = 0.091
w = (r - l) / page_width      # (300-72)/612 = 0.373
h = (t - b) / page_height     # (720-700)/792 = 0.025

Table structure

Docling’s TableFormer model extracts table cells as a flat array with row/column grid indices:
{
  "table_cells": [
    {"text": "Revenue", "start_row_offset_idx": 0, "start_col_offset_idx": 0},
    {"text": "$10M",    "start_row_offset_idx": 0, "start_col_offset_idx": 1},
    {"text": "Profit",  "start_row_offset_idx": 1, "start_col_offset_idx": 0},
    {"text": "$2M",     "start_row_offset_idx": 1, "start_col_offset_idx": 1}
  ]
}
The example code groups these into OkraPDF’s table > row > cell hierarchy:
{
  "type": "table",
  "children": [
    {"type": "row", "children": [
      {"type": "cell", "value": "Revenue"},
      {"type": "cell", "value": "$10M"}
    ]},
    {"type": "row", "children": [
      {"type": "cell", "value": "Profit"},
      {"type": "cell", "value": "$2M"}
    ]}
  ]
}

Using with the CLI

If you already have a Docling JSON output file, use the CLI to upload and ingest separately:
# Upload PDF (no parsing)
okra upload report.pdf --skip-parse
# → doc-abc123...

# Ingest Docling output
curl -X POST https://api.okrapdf.com/document/doc-abc123/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d @docling-output.json

Data sovereignty

This pattern gives you full control over where PDF bytes are processed:
StepWhere it runsWhat’s sent
PDF parsingYour machine (Docling)Nothing — fully local
UploadOkraPDF APIPDF bytes (for page images)
IngestOkraPDF APIStructured text + coordinates
Chat / extractionOkraPDF edgeQueries only
For maximum privacy, you can skip the PDF upload entirely and use POST /v1/documents/ingest to create a document from structured data alone — but you won’t get page images or PDF download.

Verify it’s lossless

OkraPDF stores the raw Docling JSON server-side and preserves original labels — no mapping, no data loss. You can verify this by comparing the snapshot export against your local Docling output:
# 1. Check the snapshot — raw Docling types preserved as-is
curl -s -H "Authorization: Bearer $OKRA_API_KEY" \
  "https://api.okrapdf.com/exports/$DOC_ID/snapshot" | python3 -c "
import sys, json
d = json.load(sys.stdin)
types = {}
has_bbox = 0
for page in d['pages']:
    for b in page['blocks']:
        types[b['type']] = types.get(b['type'], 0) + 1
        if b.get('bbox'): has_bbox += 1
total = sum(types.values())
print(f'Total blocks: {total}, with bbox: {has_bbox}')
for t, c in sorted(types.items(), key=lambda x: -x[1]):
    print(f'  {t}: {c}')
"
Example output for a 2-page resume:
Total blocks: 169, with bbox: 169
Types:
  text: 79
  list_item: 48
  section_header: 41
  picture: 1
Notice the types are Docling’s raw labels (section_header, list_item) — not mapped to generic types. OkraPDF resolves these to canonical types only at the rendering boundary (markdown export, chat context), so the original fidelity is always available via the API.
# 2. Compare block count: local vs deployed
python3 -c "
from docling.document_converter import DocumentConverter
result = DocumentConverter().convert('report.pdf')
doc = result.document.export_to_dict()
local = len(doc.get('texts', [])) + len(doc.get('tables', [])) + len(doc.get('pictures', []))
print(f'Local Docling blocks: {local}')
"

curl -s -H "Authorization: Bearer $OKRA_API_KEY" \
  "https://api.okrapdf.com/exports/$DOC_ID/snapshot" | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
deployed = sum(len(p['blocks']) for p in d['pages'])
print(f'Deployed OkraPDF blocks: {deployed}')
"
If counts match and types are raw Docling labels, the ingest is lossless.

Standalone example

A complete standalone script is available at examples/docling-ingest.py.

See also