Why Did My Test Fail?

Common failures, what they mean, and how to fix them.

Config errors

`CONFIG_VALIDATION_ERROR: At least one provider must be configured`

You don't have a providers section, or it's empty.

# Fix: add at least one provider
providers:
  openai:
    apiKeyEnv: "OPENAI_API_KEY"

`CONFIG_VALIDATION_ERROR: Exactly one of 'prompt' or 'command' must be set`

Each test case needs either a prompt (referencing a key in prompts) or a command (a shell command). Not both, and not neither.

# Wrong — missing prompt
tests:
  - name: "my-test"
    expect:
      judge:
        - criteria: "..."

# Fix — add prompt reference
tests:
  - name: "my-test"
    prompt: "my-prompt"
    expect:
      judge:
        - criteria: "..."

`CONFIG_VALIDATION_ERROR: schemaFile is required when format is 'json'`

You set output.format: "json" but didn't provide a schema file.

# Fix: add schemaFile
expect:
  output:
    format: "json"
    schemaFile: "./schemas/response.json"

Provider errors

`Authentication failed`

Your API key is missing or invalid. Check:

The env var name in apiKeyEnv matches what you've exported
The key is set in your shell: echo $OPENAI_API_KEY
In CI, the secret is configured in your repo settings

# Verify the key is set
echo $OPENAI_API_KEY | head -c 10

`Rate limited`

You're sending too many requests. Reduce concurrency:

defaults:
  concurrency: 1

Or increase delay between requests by lowering concurrency and reducing repeat.

`Provider timeout`

The model took too long to respond. Increase the timeout:

defaults:
  timeoutMs: 120000  # 2 minutes

For large prompts or complex tool chains, 60 seconds (the default) may not be enough.

`Network error`

Can't reach the provider API. Check your internet connection and any proxy settings. For Ollama, make sure the server is running:

ollama serve

Assertion failures

`TOOL_CALL_MISSING: Expected tool 'X' to be called`

The model didn't call the tool you expected. Common causes:

Tool not in tools list — the tool must be defined in the test's tools section for the model to know about it
Prompt doesn't instruct tool use — make sure your system prompt tells the model about available tools
Model chose not to call it — the model decided a direct response was better. Make the prompt more explicit about when to use tools.
Temperature too high — lower temperature for more predictable behavior

`TOOL_CALL_UNEXPECTED: Tool 'X' was called but shouldNotCall was set`

The model called a tool you explicitly forbid. This is usually a prompt issue — the model doesn't know it shouldn't call that tool, or the instruction isn't strong enough.

Fix: strengthen the system prompt, or add explicit instructions like "Never call process_refund without manager approval."

`TOOL_CALL_ARGS_MISMATCH: Expected args {...} but got {...}`

The model called the right tool with wrong arguments. Check:

The argsMatch values are what the model would reasonably extract from the user message
You're using partial matching — argsMatch only checks the keys you specify, extra keys are fine
The argument types match — "123" (string) vs 123 (number)

`TOOL_CALL_ORDER_WRONG: Expected tool 'X' at position N`

Tools were called in the wrong order. The order field is 0-indexed:

toolCalls:
  - tool: "step_one"
    order: 0        # called first
  - tool: "step_two"
    order: 1        # called second

`SCHEMA_INVALID: Output does not match schema`

The model's JSON output doesn't match your JSON Schema. Check:

The model is actually returning JSON (not markdown-wrapped JSON)
The schema matches what the model is producing — run kindlm test --reporter json to see the raw output
additionalProperties: false in your schema will reject extra fields the model adds

`SCHEMA_PARSE_ERROR: Output is not valid JSON`

The model returned text that isn't valid JSON. Common causes:

Model wrapped JSON in markdown code blocks (```json ... ```)
Model added explanation text before/after the JSON
Model returned partial JSON (truncated by maxTokens)

Fix: make the system prompt explicit: "Respond with JSON only. No markdown, no explanation."

`PII_DETECTED: Found PII pattern 'X' in output`

The output contains text matching a PII regex pattern. Check:

The model is leaking data from tool responses — tighten the prompt: "Never include customer email, phone, or SSN in your response"
False positive — the pattern matched something that isn't actually PII. Add the false-positive text to context in the prompt, or adjust custom patterns.

`KEYWORD_DENIED: Output contains denied keyword 'X'`

The output contains a forbidden word or phrase. This is case-insensitive.

If it's a false positive (the word appears in legitimate context), either remove it from the deny list or restructure the test.

`KEYWORD_MISSING: Output does not contain any allowed keyword`

You set guardrails.keywords.allow and none of those keywords appear in the output. The output must contain at least one of the allowed keywords.

`JUDGE_BELOW_THRESHOLD: Score 0.XX < minimum 0.XX`

The LLM judge scored the response below your threshold. Check:

kindlm test --reporter json to see the judge's reasoning
The criteria might be too strict for the model — try lowering minScore
The model's response might actually be bad — read the output and the judge's explanation
Judge results are non-deterministic — increase defaults.repeat and check if it's consistently failing

`DRIFT_EXCEEDED: Drift score 0.XX > maximum 0.XX`

The output has changed significantly from the baseline. This isn't necessarily a bug — maybe the change is intentional.

Review the diff: kindlm baseline compare
If the new behavior is better, update the baseline: kindlm baseline set
If it's a regression, investigate the prompt change that caused it

Gate failures

`Gates: ✗ FAILED — pass rate 85% < minimum 95%`

Too many individual assertions failed across all tests and runs. Either:

Fix the failing assertions (see above)
Lower gates.passRateMin if the threshold is too strict for your use case
Increase defaults.repeat — more runs smooth out non-deterministic failures

Deterministic vs probabilistic gates

If your deterministic assertions (schema, PII, keywords, tool calls) all pass but judge scores drag down the overall pass rate, use split gates:

gates:
  deterministicPassRate: 1.0    # zero tolerance for deterministic checks
  probabilisticPassRate: 0.8    # allow some variance in judge scores

CI-specific issues

Tests pass locally but fail in CI

Missing API key — check that the secret is set in CI settings
Rate limiting — CI runs may overlap with other jobs. Lower concurrency
Network — some CI environments block outbound requests. Check firewall rules
Timeout — CI runners may be slower. Increase timeoutMs

Job times out

LLM API calls are slow. Set timeout-minutes: 10 (or more) on the job:

jobs:
  kindlm:
    runs-on: ubuntu-latest
    timeout-minutes: 10

How to read JUnit output

kindlm test --reporter junit > junit.xml

Each test case becomes a JUnit <testcase>. Failed assertions become <failure> elements with the failure code and message. Most CI systems (GitHub Actions, GitLab, Jenkins) render this natively.

Still stuck?

Run with --reporter json to see full details including raw model output
Check GitHub Issues for known problems
Open a new issue with your config (redact API keys) and the error output