Ingest API - OkraPDF

Overview

Use POST /v1/documents/ingest when parsing already happened in your own pipeline. You send vendor output (unstructured, llamaparse, or canonical) and OkraPDF handles normalization, hydration, lifecycle processing, and document endpoints.

Request

curl -X POST https://api.okrapdf.com/v1/documents/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vendor": "unstructured",
    "data": [
      {
        "type": "NarrativeText",
        "text": "Invoice total due is $12,480",
        "metadata": { "page_number": 1 }
      }
    ],
    "pdfUrl": "https://example.com/invoice.pdf"
  }'

Supported connector IDs

`vendor` value	Expected shape
`unstructured`	array of Unstructured elements (`type`, `metadata.page_number`)
`llamaparse`	object with `pages[].items[]` entries
`canonical`	object with canonical `pages[].blocks[]`

If vendor is omitted, OkraPDF tries to auto-detect from payload shape.

Response model

The endpoint returns 202 Accepted and starts lifecycle processing.

{
  "documentId": "doc-...",
  "phase": "ingesting",
  "status": "processing",
  "vendor": "unstructured",
  "pageCount": 12,
  "workflowId": "...",
  "urls": {
    "self": "https://api.okrapdf.com/document/doc-...",
    "status": "https://api.okrapdf.com/document/doc-.../status",
    "pages": "https://api.okrapdf.com/document/doc-.../pages",
    "publish": "https://api.okrapdf.com/document/doc-.../publish"
  }
}

What happens after ingest

Vendor payload is normalized to Okra’s canonical parse shape.
Parsed nodes are hydrated into the document graph.
Lifecycle jobs run (snapshot/materialization/projection workflow).
Standard document surfaces become available (pages, chat/completion, output profiles, URL builder).

Failure modes

Unknown payload shape without vendor: 422 with supported connector list.
Invalid payload for chosen connector: 422 normalization error.
Workflow startup failure: 500 with error payload.

No silent drops: payloads are validated before lifecycle continues.

Replace mode

Pass "mode": "replace" to supersede existing nodes on affected pages before hydrating new ones. Existing nodes get status = 'superseded' — they remain in the graph for audit but are excluded from completions.

curl -X POST https://api.okrapdf.com/document/$DOC_ID/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vendor": "canonical",
    "mode": "replace",
    "data": { "pages": [{ "pageNumber": 3, "blocks": [...] }] }
  }'

Combine with branching to correct extraction errors without touching the original document.

Example: LlamaParse → Ingest → Chat

A complete walkthrough: parse a PDF with LlamaParse, ingest the result, and query it.

# 1. Parse with LlamaParse
JOB=$(curl -s -X POST 'https://api.cloud.llamaindex.ai/api/parsing/upload' \
  -H "Authorization: Bearer $LLAMAPARSE_API_KEY" \
  -F 'file=@report.pdf' | jq -r '.id')

# 2. Wait for LlamaParse to finish
while [ "$(curl -s https://api.cloud.llamaindex.ai/api/parsing/job/$JOB \
  -H "Authorization: Bearer $LLAMAPARSE_API_KEY" | jq -r '.status')" != "SUCCESS" ]; do
  sleep 3
done

# 3. Fetch JSON result
curl -s "https://api.cloud.llamaindex.ai/api/parsing/job/$JOB/result/json" \
  -H "Authorization: Bearer $LLAMAPARSE_API_KEY" > result.json

# 4. Ingest into OkraPDF
DOC_ID=$(curl -s -X POST https://api.okrapdf.com/v1/documents/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --argjson data "$(cat result.json)" \
    '{vendor: "llamaparse", data: $data}')" \
  | jq -r '.documentId')

echo "Document: $DOC_ID"

# 5. Wait for lifecycle
while [ "$(curl -s https://api.okrapdf.com/document/$DOC_ID/status \
  -H "Authorization: Bearer $OKRA_API_KEY" | jq -r '.phase')" != "complete" ]; do
  sleep 2
done

# 6. Chat with the document
curl -s -X POST "https://api.okrapdf.com/document/$DOC_ID/chat/completions" \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Summarize this document"}]}' \
  | jq -r '.choices[0].message.content'

What surfaces are available after ingest

Surface	Available	Notes
Chat completions	Yes	Full document context from ingested nodes
Structured output (`/generate`)	Yes	Works on ingested nodes like any document
Status	Yes	Phase, page count, node count
Branch	Yes	Zero-copy fork of ingested document
Page images (`pg_N.png`)	No	Requires original PDF binary (use `pdfUrl` to enable)
Download (`/download`)	No	Requires original PDF binary
Full markdown (`full.md`)	Yes	Materialized from R2 snapshot

Pass pdfUrl in the ingest request to enable page images and downloads:

curl -X POST https://api.okrapdf.com/v1/documents/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vendor": "llamaparse",
    "data": { ... },
    "pdfUrl": "https://example.com/report.pdf"
  }'

When to use this endpoint

Use Ingest API when you:

already run extraction with external vendors,
want OkraPDF delivery + policy + output layers,
need a stable doc-... lifecycle without re-running OCR in Okra.

Branch + Replace

Fork a doc, replace bad OCR, compare completions.

Output Schema

Materialize reproducible structured outputs from ingested documents.

URL Builder

Build immutable URLs for pages, tables, and artifacts.

Documentation Index

​Overview

​Request

​Supported connector IDs

​Response model

​What happens after ingest

​Failure modes

​Replace mode

​Example: LlamaParse → Ingest → Chat

​What surfaces are available after ingest

​When to use this endpoint

​Related Docs