Documentation Index
Fetch the complete documentation index at: https://docs.okrapdf.com/llms.txt
Use this file to discover all available pages before exploring further.
Why this matters
Most PDF extraction services send your document to a cloud OCR API. For sensitive documents (tax forms, contracts, medical records), that means your PDF bytes go to a third party before you even see the text.
With Docling, extraction happens locally — your PDF never leaves your machine. The extracted text is then deployed to OkraPDF’s edge, where redaction is applied at serve-time based on the viewer’s role.
PDF bytes → [your machine: Docling parse] → [OkraPDF edge: store + redact at serve-time]
↑ no third-party cloud ↑ admin sees full text
↑ viewer sees PII replaced
↑ public sees allowlisted sections only
Install
# Docling (Python, runs locally)
pip install docling
# Okra Edge Kit (TypeScript)
npm install @okrapdf/edge-kit
Full example
import { execSync } from 'node:child_process';
import { readFileSync, mkdirSync } from 'node:fs';
import { createRedactor, deploy } from '@okrapdf/edge-kit';
import type { PageInput } from '@okrapdf/edge-kit';
const PDF_PATH = './tax-form.pdf';
const OUT_DIR = './docling-output';
// ── Step 1: Parse locally with Docling ──────────────────────────
// PDF bytes never leave your machine
mkdirSync(OUT_DIR, { recursive: true });
execSync(`docling --to json "${PDF_PATH}" --output "${OUT_DIR}"`);
const jsonFile = execSync(`ls "${OUT_DIR}"/*.json`).toString().trim();
const doc = JSON.parse(readFileSync(jsonFile, 'utf-8'));
// ── Step 2: Convert Docling JSON → PageInput[] ─────────────────
// Group text items by page, preserve bounding boxes
const pageMap = new Map<number, { texts: string[]; items: PageInput['items'] }>();
for (const text of doc.texts) {
const prov = text.prov?.[0];
if (!prov) continue;
if (!pageMap.has(prov.page_no)) {
pageMap.set(prov.page_no, { texts: [], items: [] });
}
const entry = pageMap.get(prov.page_no)!;
entry.texts.push(text.text);
if (prov.bbox) {
entry.items!.push({
text: text.text,
bbox: {
x: prov.bbox.l,
y: prov.bbox.t,
w: prov.bbox.r - prov.bbox.l,
h: prov.bbox.t - prov.bbox.b,
},
});
}
}
const pages: PageInput[] = [...pageMap.entries()]
.sort(([a], [b]) => a - b)
.map(([pageNum, { texts, items }]) => ({
pageNum,
text: texts.join('\n'),
items,
}));
// ── Step 3: Configure PII detection ─────────────────────────────
const pii = {
patterns: ['SSN', 'EMAIL', 'PHONE_US', 'TAX_ID_US'],
includeNames: true,
includeAddresses: true,
};
// ── Step 4: Preview redaction locally (optional) ────────────────
// Useful for auditing what will be redacted before deploying
const redact = createRedactor({
pii,
publicFieldAllowlist: ['Form 1099-R', 'Instructions for Recipient'],
});
const result = redact(pages);
console.log(result.stats);
// { totalMatches: 8, pagesAffected: 1, byRule: { SSN: 1, PHONE_US: 3, EMAIL: 2, PERSON_NAME: 2 } }
// Inspect what each role sees
console.log(result.view('admin', 1)); // full text: "SSN: 123-45-6789"
console.log(result.view('viewer', 1)); // redacted: "SSN: [SSN_6978]"
console.log(result.view('public', 1)); // restricted: allowlisted sections only
// ── Step 5: Deploy to edge ──────────────────────────────────────
// Full text stored at edge, redaction applied at serve-time per role
const deployed = await deploy({
pages,
meta: { title: 'Tax Form 1099-R', filename: '1099-r.pdf' },
redact: {
pii,
publicFieldAllowlist: ['Form 1099-R', 'Instructions for Recipient'],
},
apiKey: process.env.OKRA_API_KEY!,
});
console.log(deployed.urls.admin); // full text — internal use only
console.log(deployed.urls.viewer); // PII redacted — safe for external sharing
console.log(deployed.urls.public); // allowlisted sections — public embedding
What gets caught
The pii config uses OpenRedaction — pick a preset or list specific patterns:
// Preset-based
const pii = { preset: 'hipaa', includeNames: true };
// Pattern-based
const pii = { patterns: ['SSN', 'EMAIL', 'PHONE_US', 'TAX_ID_US'] };
// Combined — preset + extra patterns + names/addresses
const pii = {
preset: 'hipaa',
patterns: ['TAX_ID_US'],
includeNames: true,
includeAddresses: true,
};
No pii field? Uses OpenRedaction defaults (all patterns enabled).
For domain-specific patterns, pass customPatterns with raw regex:
const redact = createRedactor({
pii: {
preset: 'hipaa',
customPatterns: [
{ type: 'ACCOUNT_NUM', regex: /ACC-\d{8}/g, priority: 10, placeholder: '[ACCOUNT_{n}]', severity: 'high' },
],
},
publicFieldAllowlist: ['Summary'],
});
Visual overlays
Docling provides bounding boxes for each text item. The redactor maps PII matches to PDF coordinates, so you can render redaction boxes on a visual preview:
const overlays = result.overlays('viewer');
// [
// { page: 1, x: 43.1, y: 762.1, w: 27.3, h: 15.6, label: 'ein' },
// { page: 1, x: 168.8, y: 735.1, w: 14.1, h: 60.6, label: 'phone' },
// ]
Why Docling
Docling is an open-source PDF parser that runs entirely on your machine. No API keys, no per-page pricing, no network calls during extraction.
PageInput is vendor-agnostic — use whatever parser fits your workflow (LlamaParse, Azure Doc Intel, Unstructured, etc). Docling is a good option when you want local-only extraction.
See also