Docling - OkraPDF

Why Docling + OkraPDF

Docling is IBM’s open-source document parser. It runs entirely on your machine — no API keys, no cloud calls, no per-page pricing. Its TableFormer model is best-in-class for complex table extraction (merged cells, nested headers, spanning rows). Once you’ve parsed a document, you often need to share it — give a colleague a link to the extracted tables, let a client chat with the report, or serve the structured data via API to downstream systems. OkraPDF handles that layer. Upload your PDF, deploy Docling’s extraction, and get:

Chat completions over the document
Structured extraction with JSON schemas
Page images with bounding box overlays
Deterministic URLs for every page, table, and figure
Collection queries across multiple documents

Your PDF bytes stay on your machine. Only the structured text and coordinates are sent to OkraPDF.

PDF bytes ──► [your machine: Docling] ──► structured JSON
                                              │
                                              ▼
                                    [OkraPDF: store + serve]
                                         │
                              ┌──────────┼──────────┐
                              ▼          ▼          ▼
                           chat     page images   API URLs

Install

# Docling (Python)
pip install docling requests

# OkraPDF API key
export OKRA_API_KEY=okra_...

Docling requires Python 3.10+ and ~4 GB RAM for the layout + table models. First run downloads models from HuggingFace (~500 MB).

How it works

The integration is a three-step pipeline:

Upload the PDF to OkraPDF with skip_parse=true — stores the file for page rendering, but skips OCR. No extraction charges.
Parse the PDF locally with Docling — DocumentConverter().convert() returns a DoclingDocument with text, tables, figures, and bounding boxes.
Ingest the Docling output into the OkraPDF document — replaces the extraction layer. The document is now live with chat, search, and API access.

Full example

import os, sys, requests
from docling.document_converter import DocumentConverter

API_URL = "https://api.okrapdf.com"
API_KEY = os.environ["OKRA_API_KEY"]
PDF_PATH = sys.argv[1]  # e.g. "quarterly-report.pdf"

# ── Step 1: Upload PDF (skip_parse — no OCR charge) ─────────────

with open(PDF_PATH, "rb") as f:
    resp = requests.post(
        f"{API_URL}/v1/documents?skip_parse=true",
        files={"file": (os.path.basename(PDF_PATH), f, "application/pdf")},
        headers={"Authorization": f"Bearer {API_KEY}"},
    )
    resp.raise_for_status()
    doc_id = resp.json()["documentId"]

print(f"Uploaded: {doc_id}")

# ── Step 2: Parse locally with Docling ───────────────────────────

result = DocumentConverter().convert(PDF_PATH)
doc_dict = result.document.export_to_dict()

print(f"Parsed: {len(doc_dict.get('pages', {}))} pages, "
      f"{len(doc_dict.get('texts', []))} texts, "
      f"{len(doc_dict.get('tables', []))} tables")

# ── Step 3: Send raw Docling JSON — server handles everything ────

resp = requests.post(
    f"{API_URL}/document/{doc_id}/ingest",
    json={"data": doc_dict, "vendor": "docling", "mode": "replace"},
    headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
)
resp.raise_for_status()

print(f"\nDocument live:")
print(f"  Chat:       {API_URL}/document/{doc_id}/chat/completions")
print(f"  Markdown:   {API_URL}/v1/documents/{doc_id}/full.md")
print(f"  Page 1:     {API_URL}/v1/documents/{doc_id}/pg_1.png")
print(f"  Page 1 md:  {API_URL}/v1/documents/{doc_id}/pg_1.md")

No client-side mapping needed. You send the raw export_to_dict() output and OkraPDF’s server-side Docling plugin handles bbox conversion (BOTTOMLEFT → 0-1 relative), table cell restructuring (flat grid → row/cell hierarchy), and label passthrough. The raw Docling JSON is also stored verbatim for auditability.

What Docling extracts

Docling’s output includes structured labels and bounding boxes for every element:

Docling label	What it is
`text`	Body paragraph
`section_header`	Section heading
`title`	Document title
`list_item`	Bulleted or numbered list entry
`table`	Structured table with cell grid
`picture` / `chart`	Figure with optional caption
`footnote`	Footnote text
`page_header` / `page_footer`	Running headers and footers
`key_value_region`	Key-value pair (forms)
`formula`	Mathematical formula
`code`	Code block

All labels are passed through to OkraPDF as-is. OkraPDF maps them to canonical types at the rendering boundary — you always get the original Docling label in the API response.

Bounding box conversion

Docling uses BOTTOMLEFT origin with absolute pixel coordinates. OkraPDF uses 0-1 relative coordinates (origin top-left). The conversion flips the Y axis and normalizes by page dimensions:

# Docling: l=72, t=720, r=300, b=700 on a 612x792 page
# OkraPDF: x=0.118, y=0.091, w=0.373, h=0.025
x = l / page_width            # 72/612 = 0.118
y = (page_height - t) / page_height  # (792-720)/792 = 0.091
w = (r - l) / page_width      # (300-72)/612 = 0.373
h = (t - b) / page_height     # (720-700)/792 = 0.025

Table structure

Docling’s TableFormer model extracts table cells as a flat array with row/column grid indices:

{
  "table_cells": [
    {"text": "Revenue", "start_row_offset_idx": 0, "start_col_offset_idx": 0},
    {"text": "$10M",    "start_row_offset_idx": 0, "start_col_offset_idx": 1},
    {"text": "Profit",  "start_row_offset_idx": 1, "start_col_offset_idx": 0},
    {"text": "$2M",     "start_row_offset_idx": 1, "start_col_offset_idx": 1}
  ]
}

The example code groups these into OkraPDF’s table > row > cell hierarchy:

{
  "type": "table",
  "children": [
    {"type": "row", "children": [
      {"type": "cell", "value": "Revenue"},
      {"type": "cell", "value": "$10M"}
    ]},
    {"type": "row", "children": [
      {"type": "cell", "value": "Profit"},
      {"type": "cell", "value": "$2M"}
    ]}
  ]
}

Using with the CLI

If you already have a Docling JSON output file, use the CLI to upload and ingest separately:

# Upload PDF (no parsing)
okra upload report.pdf --skip-parse
# → doc-abc123...

# Ingest Docling output
curl -X POST https://api.okrapdf.com/document/doc-abc123/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d @docling-output.json

Data sovereignty

This pattern gives you full control over where PDF bytes are processed:

Step	Where it runs	What’s sent
PDF parsing	Your machine (Docling)	Nothing — fully local
Upload	OkraPDF API	PDF bytes (for page images)
Ingest	OkraPDF API	Structured text + coordinates
Chat / extraction	OkraPDF edge	Queries only

For maximum privacy, you can skip the PDF upload entirely and use POST /v1/documents/ingest to create a document from structured data alone — but you won’t get page images or PDF download.

Verify it’s lossless

OkraPDF stores the raw Docling JSON server-side and preserves original labels — no mapping, no data loss. You can verify this by comparing the snapshot export against your local Docling output:

# 1. Check the snapshot — raw Docling types preserved as-is
curl -s -H "Authorization: Bearer $OKRA_API_KEY" \
  "https://api.okrapdf.com/exports/$DOC_ID/snapshot" | python3 -c "
import sys, json
d = json.load(sys.stdin)
types = {}
has_bbox = 0
for page in d['pages']:
    for b in page['blocks']:
        types[b['type']] = types.get(b['type'], 0) + 1
        if b.get('bbox'): has_bbox += 1
total = sum(types.values())
print(f'Total blocks: {total}, with bbox: {has_bbox}')
for t, c in sorted(types.items(), key=lambda x: -x[1]):
    print(f'  {t}: {c}')
"

Example output for a 2-page resume:

Total blocks: 169, with bbox: 169
Types:
  text: 79
  list_item: 48
  section_header: 41
  picture: 1

Notice the types are Docling’s raw labels (section_header, list_item) — not mapped to generic types. OkraPDF resolves these to canonical types only at the rendering boundary (markdown export, chat context), so the original fidelity is always available via the API.

# 2. Compare block count: local vs deployed
python3 -c "
from docling.document_converter import DocumentConverter
result = DocumentConverter().convert('report.pdf')
doc = result.document.export_to_dict()
local = len(doc.get('texts', [])) + len(doc.get('tables', [])) + len(doc.get('pictures', []))
print(f'Local Docling blocks: {local}')
"

curl -s -H "Authorization: Bearer $OKRA_API_KEY" \
  "https://api.okrapdf.com/exports/$DOC_ID/snapshot" | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
deployed = sum(len(p['blocks']) for p in d['pages'])
print(f'Deployed OkraPDF blocks: {deployed}')
"

If counts match and types are raw Docling labels, the ingest is lossless.

Standalone example

A complete standalone script is available at examples/docling-ingest.py.

Documentation Index

​Why Docling + OkraPDF

​Install

​How it works

​Full example

​What Docling extracts

​Bounding box conversion

​Table structure

​Using with the CLI

​Data sovereignty

​Verify it’s lossless

​Standalone example

​See also

Why Docling + OkraPDF

Install

How it works

Full example

What Docling extracts

Bounding box conversion

Table structure

Using with the CLI

Data sovereignty

Verify it’s lossless

Standalone example

See also