Output Schema - OkraPDF

What is an Output Schema?

An output schema is a self-contained, reproducible extraction recipe attached to a document. It bundles three things:

Component	Purpose
Schema	The shape of the output (JSON Schema)
Prompt	Extraction instructions sent to the LLM
Model	Which LLM runs the extraction

Once defined, the SDK extracts data from the document, validates it against the schema, and materializes the result — storing it permanently alongside a full audit trail.

Why Output Schemas?

Reproducibility. Every output records exactly what produced it: which model, what prompt, the raw LLM response before parsing. You can always trace back from a result to its source. Zero-cost reads. Materialized outputs are written to R2 on creation. Public reads serve directly from R2 — the Durable Object never wakes. No compute cost on read. Composability. A single document can have many output schemas: invoice, receipt, contract_terms, compliance_flags. Each is an independent extraction with its own recipe.

How It Works

SDK extracts data
      │
      ▼
  ┌─────────────────────────┐
  │   Durable Object        │
  │   ┌───────────────────┐ │
  │   │ output_profiles   │ │  ← recipe (schema + prompt + model)
  │   │ materialized_data │ │  ← result + audit trail
  │   └───────────────────┘ │
  │           │             │
  │     write to R2         │
  └─────────────────────────┘
              │
              ▼
  ┌─────────────────────────┐
  │   R2 (data only)        │  ← public reads, no DO wake
  │   /o_invoice/data.json  │
  └─────────────────────────┘

Use Cases

Invoice Processing

Extract vendor, total, line items, and dates from uploaded invoices. Attach the output schema once, then every invoice in the collection gets the same extraction recipe applied.

const profile = {
  schema: {
    type: 'object',
    properties: {
      vendor: { type: 'string' },
      invoice_number: { type: 'string' },
      date: { type: 'string', format: 'date' },
      total: { type: 'number' },
      currency: { type: 'string' },
      line_items: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            description: { type: 'string' },
            quantity: { type: 'number' },
            unit_price: { type: 'number' },
            amount: { type: 'number' },
          },
        },
      },
    },
  },
  prompt: 'Extract all invoice fields. For line items, include description, quantity, unit price, and line total.',
  model: 'claude-sonnet-4-5-20250929',
};

After materialization, any system can read the structured invoice data at:

GET /v1/documents/{id}/o_invoice/data.json

No API key needed. No server wake. Just JSON.

Compliance Screening

Flag regulatory risks in financial filings. The schema defines the flags, the prompt instructs what to look for, the model does the analysis.

const profile = {
  schema: {
    type: 'object',
    properties: {
      risk_level: { type: 'string', enum: ['low', 'medium', 'high', 'critical'] },
      flags: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            category: { type: 'string' },
            description: { type: 'string' },
            page: { type: 'number' },
            severity: { type: 'string' },
          },
        },
      },
      summary: { type: 'string' },
    },
  },
  prompt: 'Analyze this filing for regulatory compliance risks. Flag material weaknesses, related party transactions, going concern language, and restatement disclosures.',
  model: 'claude-sonnet-4-5-20250929',
};

Contract Term Extraction

Pull key terms from legal documents for deal review dashboards.

const profile = {
  schema: {
    type: 'object',
    properties: {
      parties: { type: 'array', items: { type: 'string' } },
      effective_date: { type: 'string', format: 'date' },
      termination_date: { type: 'string', format: 'date' },
      governing_law: { type: 'string' },
      payment_terms: { type: 'string' },
      auto_renewal: { type: 'boolean' },
      non_compete_months: { type: 'number' },
      liability_cap: { type: 'string' },
    },
  },
  prompt: 'Extract key contract terms including parties, dates, governing law, payment terms, renewal clauses, non-compete duration, and liability caps.',
  model: 'claude-sonnet-4-5-20250929',
};

Financial Filing Extraction

Pull key metrics from 10-K filings for benchmarking and analysis dashboards.

const profile = {
  schema: {
    type: 'object',
    properties: {
      company: { type: 'string' },
      fiscal_year_ended: { type: 'string' },
      income_statement: {
        type: 'object',
        properties: {
          revenue: { type: 'string' },
          cost_of_revenue: { type: 'string' },
          gross_profit: { type: 'string' },
          operating_income: { type: 'string' },
          net_income: { type: 'string' },
          eps_basic: { type: 'string' },
          eps_diluted: { type: 'string' },
        },
      },
      balance_sheet: {
        type: 'object',
        properties: {
          total_assets: { type: 'string' },
          total_liabilities: { type: 'string' },
          total_stockholders_equity: { type: 'string' },
          cash_and_equivalents: { type: 'string' },
        },
      },
      cash_flow: {
        type: 'object',
        properties: {
          operating_cash_flow: { type: 'string' },
          capital_expenditures: { type: 'string' },
          free_cash_flow: { type: 'string' },
        },
      },
    },
  },
  prompt: 'Extract key financial details from this 10-K filing including income statement, balance sheet, and cash flow metrics.',
  model: 'kimi-k2p5',
};

Resume Parsing

Structure candidate data from uploaded resumes for ATS integrations.

const profile = {
  schema: {
    type: 'object',
    properties: {
      name: { type: 'string' },
      email: { type: 'string' },
      phone: { type: 'string' },
      skills: { type: 'array', items: { type: 'string' } },
      experience: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            company: { type: 'string' },
            title: { type: 'string' },
            start_date: { type: 'string' },
            end_date: { type: 'string' },
          },
        },
      },
      education: {
        type: 'array',
        items: {
          type: 'object',
          properties: {
            institution: { type: 'string' },
            degree: { type: 'string' },
            year: { type: 'number' },
          },
        },
      },
    },
  },
  prompt: 'Extract structured candidate information from this resume.',
  model: 'claude-sonnet-4-5-20250929',
};

Audit Trail

Every materialized output stores a full audit record alongside the data:

Field	Description
`model`	The model that ran the extraction
`prompt`	The exact prompt that was sent
`raw_response`	The raw LLM output before JSON parsing
`created_at`	Timestamp of materialization

Access the audit trail at:

GET /document/{id}/output/{name}/audit

This is authenticated and never exposed publicly — the public R2 path only serves the validated data.

Public URL Pattern

Materialized outputs are available at a predictable, cacheable URL:

GET /v1/documents/{id}/o_{name}/data.json

The o_ prefix tells the worker to read from R2 directly. The Durable Object never wakes. Combine with the t_ transform prefix for provider-specific extractions:

GET /v1/documents/{id}/t_llamaparse/o_invoice/data.json

Response headers include Cache-Control: public, max-age=3600 and Access-Control-Allow-Origin: * for easy embedding. The t_ and o_ URL segments are inspired by Cloudinary’s URL-as-API pattern — encode transforms in the path so results are cacheable, embeddable, and readable without an SDK.

Documentation Index

​What is an Output Schema?

​Why Output Schemas?

​How It Works

​Use Cases

​Invoice Processing

​Compliance Screening

​Contract Term Extraction

​Financial Filing Extraction

​Resume Parsing

​Audit Trail

​Public URL Pattern