Skip to main content

Overview

Use POST /v1/documents/ingest when parsing already happened in your own pipeline. You send vendor output (unstructured, llamaparse, or canonical) and OkraPDF handles normalization, hydration, lifecycle processing, and document endpoints.

Request

curl -X POST https://api.okrapdf.com/v1/documents/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vendor": "unstructured",
    "data": [
      {
        "type": "NarrativeText",
        "text": "Invoice total due is $12,480",
        "metadata": { "page_number": 1 }
      }
    ],
    "pdfUrl": "https://example.com/invoice.pdf"
  }'

Supported connector IDs

vendor valueExpected shape
unstructuredarray of Unstructured elements (type, metadata.page_number)
llamaparseobject with pages[].items[] entries
canonicalobject with canonical pages[].blocks[]
If vendor is omitted, OkraPDF tries to auto-detect from payload shape.

Response model

The endpoint returns 202 Accepted and starts lifecycle processing.
{
  "documentId": "doc-...",
  "phase": "ingesting",
  "status": "processing",
  "vendor": "unstructured",
  "pageCount": 12,
  "workflowId": "...",
  "urls": {
    "self": "https://api.okrapdf.com/document/doc-...",
    "status": "https://api.okrapdf.com/document/doc-.../status",
    "pages": "https://api.okrapdf.com/document/doc-.../pages",
    "publish": "https://api.okrapdf.com/document/doc-.../publish"
  }
}

What happens after ingest

  1. Vendor payload is normalized to Okra’s canonical parse shape.
  2. Parsed nodes are hydrated into the document graph.
  3. Lifecycle jobs run (snapshot/materialization/projection workflow).
  4. Standard document surfaces become available (pages, chat/completion, output profiles, URL builder).

Failure modes

  • Unknown payload shape without vendor: 422 with supported connector list.
  • Invalid payload for chosen connector: 422 normalization error.
  • Workflow startup failure: 500 with error payload.
No silent drops: payloads are validated before lifecycle continues.

Replace mode

Pass "mode": "replace" to supersede existing nodes on affected pages before hydrating new ones. Existing nodes get status = 'superseded' — they remain in the graph for audit but are excluded from completions.
curl -X POST https://api.okrapdf.com/document/$DOC_ID/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vendor": "canonical",
    "mode": "replace",
    "data": { "pages": [{ "pageNumber": 3, "blocks": [...] }] }
  }'
Combine with branching to correct extraction errors without touching the original document.

Example: LlamaParse → Ingest → Chat

A complete walkthrough: parse a PDF with LlamaParse, ingest the result, and query it.
# 1. Parse with LlamaParse
JOB=$(curl -s -X POST 'https://api.cloud.llamaindex.ai/api/parsing/upload' \
  -H "Authorization: Bearer $LLAMAPARSE_API_KEY" \
  -F 'file=@report.pdf' | jq -r '.id')

# 2. Wait for LlamaParse to finish
while [ "$(curl -s https://api.cloud.llamaindex.ai/api/parsing/job/$JOB \
  -H "Authorization: Bearer $LLAMAPARSE_API_KEY" | jq -r '.status')" != "SUCCESS" ]; do
  sleep 3
done

# 3. Fetch JSON result
curl -s "https://api.cloud.llamaindex.ai/api/parsing/job/$JOB/result/json" \
  -H "Authorization: Bearer $LLAMAPARSE_API_KEY" > result.json

# 4. Ingest into OkraPDF
DOC_ID=$(curl -s -X POST https://api.okrapdf.com/v1/documents/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --argjson data "$(cat result.json)" \
    '{vendor: "llamaparse", data: $data}')" \
  | jq -r '.documentId')

echo "Document: $DOC_ID"

# 5. Wait for lifecycle
while [ "$(curl -s https://api.okrapdf.com/document/$DOC_ID/status \
  -H "Authorization: Bearer $OKRA_API_KEY" | jq -r '.phase')" != "complete" ]; do
  sleep 2
done

# 6. Chat with the document
curl -s -X POST "https://api.okrapdf.com/document/$DOC_ID/chat/completions" \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Summarize this document"}]}' \
  | jq -r '.choices[0].message.content'

What surfaces are available after ingest

SurfaceAvailableNotes
Chat completionsYesFull document context from ingested nodes
Structured output (/generate)YesWorks on ingested nodes like any document
StatusYesPhase, page count, node count
BranchYesZero-copy fork of ingested document
Page images (pg_N.png)NoRequires original PDF binary (use pdfUrl to enable)
Download (/download)NoRequires original PDF binary
Full markdown (full.md)YesMaterialized from R2 snapshot
Pass pdfUrl in the ingest request to enable page images and downloads:
curl -X POST https://api.okrapdf.com/v1/documents/ingest \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vendor": "llamaparse",
    "data": { ... },
    "pdfUrl": "https://example.com/report.pdf"
  }'

When to use this endpoint

Use Ingest API when you:
  • already run extraction with external vendors,
  • want OkraPDF delivery + policy + output layers,
  • need a stable doc-... lifecycle without re-running OCR in Okra.

Branch + Replace

Fork a doc, replace bad OCR, compare completions.

Output Schema

Materialize reproducible structured outputs from ingested documents.

URL Builder

Build immutable URLs for pages, tables, and artifacts.