Local Extraction + Redaction (Docling)

Why this matters

Most PDF extraction services send your document to a cloud OCR API. For sensitive documents (tax forms, contracts, medical records), that means your PDF bytes go to a third party before you even see the text. With Docling, extraction happens locally — your PDF never leaves your machine. The extracted text is then deployed to OkraPDF’s edge, where redaction is applied at serve-time based on the viewer’s role.

PDF bytes → [your machine: Docling parse] → [OkraPDF edge: store + redact at serve-time]
                 ↑ no third-party cloud            ↑ admin sees full text
                                                   ↑ viewer sees PII replaced
                                                   ↑ public sees allowlisted sections only

Install

# Docling (Python, runs locally)
pip install docling

# Okra Edge Kit (TypeScript)
npm install @okrapdf/edge-kit

Full example

import { execSync } from 'node:child_process';
import { readFileSync, mkdirSync } from 'node:fs';
import { createRedactor, deploy } from '@okrapdf/edge-kit';
import type { PageInput } from '@okrapdf/edge-kit';

const PDF_PATH = './tax-form.pdf';
const OUT_DIR = './docling-output';

// ── Step 1: Parse locally with Docling ──────────────────────────
// PDF bytes never leave your machine
mkdirSync(OUT_DIR, { recursive: true });
execSync(`docling --to json "${PDF_PATH}" --output "${OUT_DIR}"`);

const jsonFile = execSync(`ls "${OUT_DIR}"/*.json`).toString().trim();
const doc = JSON.parse(readFileSync(jsonFile, 'utf-8'));

// ── Step 2: Convert Docling JSON → PageInput[] ─────────────────
// Group text items by page, preserve bounding boxes
const pageMap = new Map<number, { texts: string[]; items: PageInput['items'] }>();

for (const text of doc.texts) {
  const prov = text.prov?.[0];
  if (!prov) continue;

  if (!pageMap.has(prov.page_no)) {
    pageMap.set(prov.page_no, { texts: [], items: [] });
  }
  const entry = pageMap.get(prov.page_no)!;
  entry.texts.push(text.text);

  if (prov.bbox) {
    entry.items!.push({
      text: text.text,
      bbox: {
        x: prov.bbox.l,
        y: prov.bbox.t,
        w: prov.bbox.r - prov.bbox.l,
        h: prov.bbox.t - prov.bbox.b,
      },
    });
  }
}

const pages: PageInput[] = [...pageMap.entries()]
  .sort(([a], [b]) => a - b)
  .map(([pageNum, { texts, items }]) => ({
    pageNum,
    text: texts.join('\n'),
    items,
  }));

// ── Step 3: Configure PII detection ─────────────────────────────
const pii = {
  patterns: ['SSN', 'EMAIL', 'PHONE_US', 'TAX_ID_US'],
  includeNames: true,
  includeAddresses: true,
};

// ── Step 4: Preview redaction locally (optional) ────────────────
// Useful for auditing what will be redacted before deploying
const redact = createRedactor({
  pii,
  publicFieldAllowlist: ['Form 1099-R', 'Instructions for Recipient'],
});

const result = redact(pages);
console.log(result.stats);
// { totalMatches: 8, pagesAffected: 1, byRule: { SSN: 1, PHONE_US: 3, EMAIL: 2, PERSON_NAME: 2 } }

// Inspect what each role sees
console.log(result.view('admin', 1));   // full text: "SSN: 123-45-6789"
console.log(result.view('viewer', 1));  // redacted:  "SSN: [SSN_6978]"
console.log(result.view('public', 1));  // restricted: allowlisted sections only

// ── Step 5: Deploy to edge ──────────────────────────────────────
// Full text stored at edge, redaction applied at serve-time per role
const deployed = await deploy({
  pages,
  meta: { title: 'Tax Form 1099-R', filename: '1099-r.pdf' },
  redact: {
    pii,
    publicFieldAllowlist: ['Form 1099-R', 'Instructions for Recipient'],
  },
  apiKey: process.env.OKRA_API_KEY!,
});

console.log(deployed.urls.admin);   // full text — internal use only
console.log(deployed.urls.viewer);  // PII redacted — safe for external sharing
console.log(deployed.urls.public);  // allowlisted sections — public embedding

What gets caught

The pii config uses OpenRedaction — pick a preset or list specific patterns:

// Preset-based
const pii = { preset: 'hipaa', includeNames: true };

// Pattern-based
const pii = { patterns: ['SSN', 'EMAIL', 'PHONE_US', 'TAX_ID_US'] };

// Combined — preset + extra patterns + names/addresses
const pii = {
  preset: 'hipaa',
  patterns: ['TAX_ID_US'],
  includeNames: true,
  includeAddresses: true,
};

No pii field? Uses OpenRedaction defaults (all patterns enabled). For domain-specific patterns, pass customPatterns with raw regex:

const redact = createRedactor({
  pii: {
    preset: 'hipaa',
    customPatterns: [
      { type: 'ACCOUNT_NUM', regex: /ACC-\d{8}/g, priority: 10, placeholder: '[ACCOUNT_{n}]', severity: 'high' },
    ],
  },
  publicFieldAllowlist: ['Summary'],
});

Visual overlays

Docling provides bounding boxes for each text item. The redactor maps PII matches to PDF coordinates, so you can render redaction boxes on a visual preview:

const overlays = result.overlays('viewer');
// [
//   { page: 1, x: 43.1, y: 762.1, w: 27.3, h: 15.6, label: 'ein' },
//   { page: 1, x: 168.8, y: 735.1, w: 14.1, h: 60.6, label: 'phone' },
// ]

Why Docling

Docling is an open-source PDF parser that runs entirely on your machine. No API keys, no per-page pricing, no network calls during extraction. PageInput is vendor-agnostic — use whatever parser fits your workflow (LlamaParse, Azure Doc Intel, Unstructured, etc). Docling is a good option when you want local-only extraction.

Documentation Index

​Why this matters

​Install

​Full example

​What gets caught

​Visual overlays

​Why Docling

​See also