> ## Documentation Index
> Fetch the complete documentation index at: https://docs.okrapdf.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Eval Agent

> Async LLM-as-judge that evaluates every document chat response. Catches hallucinations, enforces compliance, and logs alerts — without blocking the response.

## Overview

Every OkraPDF document can have an Eval Agent attached. When enabled, it evaluates each chat completion asynchronously — checking for hallucinated facts, compliance violations, or custom policy rules you define.

The eval never blocks the response. Results appear in the document event log within seconds.

```
User question → Completion runs → Response sent immediately
                                        │
                                        ▼ (async, via queue)
                                   EvalAgent evaluates
                                        │
                                        ▼
                                   Alerts logged
```

## Enable eval on a document

```bash theme={null}
curl -X PUT https://api.okrapdf.com/document/$DOC_ID/config/eval \
  -H "Authorization: Bearer $OKRA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "enabled": true,
    "instructions": "Flag any response that cites numbers, dates, or facts not found in the document."
  }'
```

Response:

```json theme={null}
{
  "document_id": "doc-abc123",
  "spec_version": 1,
  "eval": {
    "enabled": true,
    "scope": "document",
    "instructions": "Flag any response that cites numbers, dates, or facts not found in the document.",
    "maxRecentTurns": 5
  }
}
```

## How it works

1. **You chat normally** — `POST /document/:id/chat/completions` returns instantly, no latency added.
2. **Three hooks fire asynchronously** via the document's internal queue:
   * `turn.before` — evaluates the user query before the LLM runs
   * `tool.execute.after` — evaluates tool call results
   * `turn.after` — evaluates the final response against the document
3. **EvalAgent judges** using a fast LLM (Haiku or Kimi-K2.5) with your instructions as the evaluation criteria.
4. **Results logged** to the document event log — viewable via API or the info page.

## Check eval results

```bash theme={null}
curl https://api.okrapdf.com/document/$DOC_ID/events?limit=10 \
  -H "Authorization: Bearer $OKRA_API_KEY"
```

Example entries:

```json theme={null}
[
  {
    "event": "log",
    "detail": {
      "message": "[EvalAgent] turn.after completed: 2 action(s)"
    }
  },
  {
    "event": "log",
    "detail": {
      "message": "[EvalAgent] info: Response correctly reports that revenue data is not in the document. No hallucinated figures detected."
    }
  }
]
```

## Guardrail examples

### Hallucination detection (financial documents)

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "Flag any response that cites dollar amounts, percentages, or financial metrics not explicitly stated in the document. Be strict — estimates or inferred values must be flagged."
}'
```

### Source accuracy (legal/compliance)

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "Verify that every claim in the response has a direct source in the document. Flag any response that paraphrases in a way that changes the meaning. Flag missing citations or page references."
}'
```

### PII leakage prevention

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "Flag if the response contains Social Security numbers, account numbers, or personal addresses from the document. These should be redacted, not exposed in chat responses."
}'
```

### Scope enforcement (narrow the agent)

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "This document is an employee handbook. Flag any response that answers questions outside the scope of HR policies, benefits, and workplace procedures. The agent should decline off-topic questions."
}'
```

### Tone and brand voice

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "Flag responses that use casual language, slang, or first-person voice. All responses should be professional and written in third-person."
}'
```

## Configuration options

| Field            | Type                     | Default      | Description                                                |
| ---------------- | ------------------------ | ------------ | ---------------------------------------------------------- |
| `enabled`        | boolean                  | `false`      | Turn eval on/off                                           |
| `scope`          | `"document"` \| `"user"` | `"document"` | Eval context scope — per-document or per-user turn history |
| `instructions`   | string                   | —            | Natural language evaluation criteria                       |
| `model`          | object                   | auto         | Override the eval model (see below)                        |
| `maxRecentTurns` | number                   | `5`          | How many recent turns to include as context                |

### Custom eval model

By default, EvalAgent uses Claude Haiku (if `ANTHROPIC_API_KEY` is set) or Kimi-K2.5 via OpenRouter. Override with:

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "...",
  "model": {
    "provider": "anthropic",
    "model": "claude-haiku-4-5-20251001"
  }
}'
```

Or use OpenRouter for any model:

```bash theme={null}
curl -X PUT .../config/eval -d '{
  "enabled": true,
  "instructions": "...",
  "model": {
    "provider": "openrouter",
    "model": "google/gemini-2.5-flash"
  }
}'
```

## Disable eval

```bash theme={null}
curl -X PUT .../config/eval -d '{"enabled": false}'
```

## Architecture

EvalAgent is a separate Durable Object that runs independently from the document's completion handler.

* **No latency impact** — eval events are written to the document's internal queue during completion, then processed asynchronously in a separate DO wake.
* **Durable** — queued eval events survive DO hibernation. If the eval LLM is slow, events retry with exponential backoff.
* **Scoped** — each document (or user, if `scope: "user"`) gets its own EvalAgent instance with its own turn history.
* **Fail-open** — if the eval LLM errors or times out, the completion is unaffected. Errors are logged, never surfaced to the user.

## See also

* [Chat](/features/chat) — document chat completions
* [Output Schema](/features/output-schema) — structured extraction with validation
