Documentation Index
Fetch the complete documentation index at: https://docs.okrapdf.com/llms.txt
Use this file to discover all available pages before exploring further.
What is an Output Schema?
An output schema is a self-contained, reproducible extraction recipe attached to a document. It bundles three things:
| Component | Purpose |
|---|
| Schema | The shape of the output (JSON Schema) |
| Prompt | Extraction instructions sent to the LLM |
| Model | Which LLM runs the extraction |
Once defined, the SDK extracts data from the document, validates it against the schema, and materializes the result — storing it permanently alongside a full audit trail.
Why Output Schemas?
Reproducibility. Every output records exactly what produced it: which model, what prompt, the raw LLM response before parsing. You can always trace back from a result to its source.
Zero-cost reads. Materialized outputs are written to R2 on creation. Public reads serve directly from R2 — the Durable Object never wakes. No compute cost on read.
Composability. A single document can have many output schemas: invoice, receipt, contract_terms, compliance_flags. Each is an independent extraction with its own recipe.
How It Works
SDK extracts data
│
▼
┌─────────────────────────┐
│ Durable Object │
│ ┌───────────────────┐ │
│ │ output_profiles │ │ ← recipe (schema + prompt + model)
│ │ materialized_data │ │ ← result + audit trail
│ └───────────────────┘ │
│ │ │
│ write to R2 │
└─────────────────────────┘
│
▼
┌─────────────────────────┐
│ R2 (data only) │ ← public reads, no DO wake
│ /o_invoice/data.json │
└─────────────────────────┘
Use Cases
Invoice Processing
Extract vendor, total, line items, and dates from uploaded invoices. Attach the output schema once, then every invoice in the collection gets the same extraction recipe applied.
const profile = {
schema: {
type: 'object',
properties: {
vendor: { type: 'string' },
invoice_number: { type: 'string' },
date: { type: 'string', format: 'date' },
total: { type: 'number' },
currency: { type: 'string' },
line_items: {
type: 'array',
items: {
type: 'object',
properties: {
description: { type: 'string' },
quantity: { type: 'number' },
unit_price: { type: 'number' },
amount: { type: 'number' },
},
},
},
},
},
prompt: 'Extract all invoice fields. For line items, include description, quantity, unit price, and line total.',
model: 'claude-sonnet-4-5-20250929',
};
After materialization, any system can read the structured invoice data at:
GET /v1/documents/{id}/o_invoice/data.json
No API key needed. No server wake. Just JSON.
Compliance Screening
Flag regulatory risks in financial filings. The schema defines the flags, the prompt instructs what to look for, the model does the analysis.
const profile = {
schema: {
type: 'object',
properties: {
risk_level: { type: 'string', enum: ['low', 'medium', 'high', 'critical'] },
flags: {
type: 'array',
items: {
type: 'object',
properties: {
category: { type: 'string' },
description: { type: 'string' },
page: { type: 'number' },
severity: { type: 'string' },
},
},
},
summary: { type: 'string' },
},
},
prompt: 'Analyze this filing for regulatory compliance risks. Flag material weaknesses, related party transactions, going concern language, and restatement disclosures.',
model: 'claude-sonnet-4-5-20250929',
};
Pull key terms from legal documents for deal review dashboards.
const profile = {
schema: {
type: 'object',
properties: {
parties: { type: 'array', items: { type: 'string' } },
effective_date: { type: 'string', format: 'date' },
termination_date: { type: 'string', format: 'date' },
governing_law: { type: 'string' },
payment_terms: { type: 'string' },
auto_renewal: { type: 'boolean' },
non_compete_months: { type: 'number' },
liability_cap: { type: 'string' },
},
},
prompt: 'Extract key contract terms including parties, dates, governing law, payment terms, renewal clauses, non-compete duration, and liability caps.',
model: 'claude-sonnet-4-5-20250929',
};
Pull key metrics from 10-K filings for benchmarking and analysis dashboards.
const profile = {
schema: {
type: 'object',
properties: {
company: { type: 'string' },
fiscal_year_ended: { type: 'string' },
income_statement: {
type: 'object',
properties: {
revenue: { type: 'string' },
cost_of_revenue: { type: 'string' },
gross_profit: { type: 'string' },
operating_income: { type: 'string' },
net_income: { type: 'string' },
eps_basic: { type: 'string' },
eps_diluted: { type: 'string' },
},
},
balance_sheet: {
type: 'object',
properties: {
total_assets: { type: 'string' },
total_liabilities: { type: 'string' },
total_stockholders_equity: { type: 'string' },
cash_and_equivalents: { type: 'string' },
},
},
cash_flow: {
type: 'object',
properties: {
operating_cash_flow: { type: 'string' },
capital_expenditures: { type: 'string' },
free_cash_flow: { type: 'string' },
},
},
},
},
prompt: 'Extract key financial details from this 10-K filing including income statement, balance sheet, and cash flow metrics.',
model: 'kimi-k2p5',
};
Resume Parsing
Structure candidate data from uploaded resumes for ATS integrations.
const profile = {
schema: {
type: 'object',
properties: {
name: { type: 'string' },
email: { type: 'string' },
phone: { type: 'string' },
skills: { type: 'array', items: { type: 'string' } },
experience: {
type: 'array',
items: {
type: 'object',
properties: {
company: { type: 'string' },
title: { type: 'string' },
start_date: { type: 'string' },
end_date: { type: 'string' },
},
},
},
education: {
type: 'array',
items: {
type: 'object',
properties: {
institution: { type: 'string' },
degree: { type: 'string' },
year: { type: 'number' },
},
},
},
},
},
prompt: 'Extract structured candidate information from this resume.',
model: 'claude-sonnet-4-5-20250929',
};
Audit Trail
Every materialized output stores a full audit record alongside the data:
| Field | Description |
|---|
model | The model that ran the extraction |
prompt | The exact prompt that was sent |
raw_response | The raw LLM output before JSON parsing |
created_at | Timestamp of materialization |
Access the audit trail at:
GET /document/{id}/output/{name}/audit
This is authenticated and never exposed publicly — the public R2 path only serves the validated data.
Public URL Pattern
Materialized outputs are available at a predictable, cacheable URL:
GET /v1/documents/{id}/o_{name}/data.json
The o_ prefix tells the worker to read from R2 directly. The Durable Object never wakes.
Combine with the t_ transform prefix for provider-specific extractions:
GET /v1/documents/{id}/t_llamaparse/o_invoice/data.json
Response headers include Cache-Control: public, max-age=3600 and Access-Control-Allow-Origin: * for easy embedding.
The t_ and o_ URL segments are inspired by Cloudinary’s URL-as-API pattern — encode transforms in the path so results are cacheable, embeddable, and readable without an SDK.