Why Did My Test Fail?
Common failures, what they mean, and how to fix them.
Config errors
CONFIG_VALIDATION_ERROR: At least one provider must be configured
You don't have a providers section, or it's empty.
# Fix: add at least one provider
providers:
openai:
apiKeyEnv: "OPENAI_API_KEY"
CONFIG_VALIDATION_ERROR: Exactly one of 'prompt' or 'command' must be set
Each test case needs either a prompt (referencing a key in prompts) or a command (a shell command). Not both, and not neither.
# Wrong — missing prompt
tests:
- name: "my-test"
expect:
judge:
- criteria: "..."
# Fix — add prompt reference
tests:
- name: "my-test"
prompt: "my-prompt"
expect:
judge:
- criteria: "..."
CONFIG_VALIDATION_ERROR: schemaFile is required when format is 'json'
You set output.format: "json" but didn't provide a schema file.
# Fix: add schemaFile
expect:
output:
format: "json"
schemaFile: "./schemas/response.json"
Provider errors
Authentication failed
Your API key is missing or invalid. Check:
- The env var name in
apiKeyEnvmatches what you've exported - The key is set in your shell:
echo $OPENAI_API_KEY - In CI, the secret is configured in your repo settings
# Verify the key is set
echo $OPENAI_API_KEY | head -c 10
Rate limited
You're sending too many requests. Reduce concurrency:
defaults:
concurrency: 1
Or increase delay between requests by lowering concurrency and reducing repeat.
Provider timeout
The model took too long to respond. Increase the timeout:
defaults:
timeoutMs: 120000 # 2 minutes
For large prompts or complex tool chains, 60 seconds (the default) may not be enough.
Network error
Can't reach the provider API. Check your internet connection and any proxy settings. For Ollama, make sure the server is running:
ollama serve
Assertion failures
TOOL_CALL_MISSING: Expected tool 'X' to be called
The model didn't call the tool you expected. Common causes:
- Tool not in tools list — the tool must be defined in the test's
toolssection for the model to know about it - Prompt doesn't instruct tool use — make sure your system prompt tells the model about available tools
- Model chose not to call it — the model decided a direct response was better. Make the prompt more explicit about when to use tools.
- Temperature too high — lower
temperaturefor more predictable behavior
TOOL_CALL_UNEXPECTED: Tool 'X' was called but shouldNotCall was set
The model called a tool you explicitly forbid. This is usually a prompt issue — the model doesn't know it shouldn't call that tool, or the instruction isn't strong enough.
Fix: strengthen the system prompt, or add explicit instructions like "Never call process_refund without manager approval."
TOOL_CALL_ARGS_MISMATCH: Expected args {...} but got {...}
The model called the right tool with wrong arguments. Check:
- The
argsMatchvalues are what the model would reasonably extract from the user message - You're using partial matching —
argsMatchonly checks the keys you specify, extra keys are fine - The argument types match —
"123"(string) vs123(number)
TOOL_CALL_ORDER_WRONG: Expected tool 'X' at position N
Tools were called in the wrong order. The order field is 0-indexed:
toolCalls:
- tool: "step_one"
order: 0 # called first
- tool: "step_two"
order: 1 # called second
SCHEMA_INVALID: Output does not match schema
The model's JSON output doesn't match your JSON Schema. Check:
- The model is actually returning JSON (not markdown-wrapped JSON)
- The schema matches what the model is producing — run
kindlm test --reporter jsonto see the raw output additionalProperties: falsein your schema will reject extra fields the model adds
SCHEMA_PARSE_ERROR: Output is not valid JSON
The model returned text that isn't valid JSON. Common causes:
- Model wrapped JSON in markdown code blocks (```json ... ```)
- Model added explanation text before/after the JSON
- Model returned partial JSON (truncated by
maxTokens)
Fix: make the system prompt explicit: "Respond with JSON only. No markdown, no explanation."
PII_DETECTED: Found PII pattern 'X' in output
The output contains text matching a PII regex pattern. Check:
- The model is leaking data from tool responses — tighten the prompt: "Never include customer email, phone, or SSN in your response"
- False positive — the pattern matched something that isn't actually PII. Add the false-positive text to context in the prompt, or adjust custom patterns.
KEYWORD_DENIED: Output contains denied keyword 'X'
The output contains a forbidden word or phrase. This is case-insensitive.
If it's a false positive (the word appears in legitimate context), either remove it from the deny list or restructure the test.
KEYWORD_MISSING: Output does not contain any allowed keyword
You set guardrails.keywords.allow and none of those keywords appear in the output. The output must contain at least one of the allowed keywords.
JUDGE_BELOW_THRESHOLD: Score 0.XX < minimum 0.XX
The LLM judge scored the response below your threshold. Check:
kindlm test --reporter jsonto see the judge's reasoning- The criteria might be too strict for the model — try lowering
minScore - The model's response might actually be bad — read the output and the judge's explanation
- Judge results are non-deterministic — increase
defaults.repeatand check if it's consistently failing
DRIFT_EXCEEDED: Drift score 0.XX > maximum 0.XX
The output has changed significantly from the baseline. This isn't necessarily a bug — maybe the change is intentional.
- Review the diff:
kindlm baseline compare - If the new behavior is better, update the baseline:
kindlm baseline set - If it's a regression, investigate the prompt change that caused it
Gate failures
Gates: ✗ FAILED — pass rate 85% < minimum 95%
Too many individual assertions failed across all tests and runs. Either:
- Fix the failing assertions (see above)
- Lower
gates.passRateMinif the threshold is too strict for your use case - Increase
defaults.repeat— more runs smooth out non-deterministic failures
Deterministic vs probabilistic gates
If your deterministic assertions (schema, PII, keywords, tool calls) all pass but judge scores drag down the overall pass rate, use split gates:
gates:
deterministicPassRate: 1.0 # zero tolerance for deterministic checks
probabilisticPassRate: 0.8 # allow some variance in judge scores
CI-specific issues
Tests pass locally but fail in CI
- Missing API key — check that the secret is set in CI settings
- Rate limiting — CI runs may overlap with other jobs. Lower
concurrency - Network — some CI environments block outbound requests. Check firewall rules
- Timeout — CI runners may be slower. Increase
timeoutMs
Job times out
LLM API calls are slow. Set timeout-minutes: 10 (or more) on the job:
jobs:
kindlm:
runs-on: ubuntu-latest
timeout-minutes: 10
How to read JUnit output
kindlm test --reporter junit > junit.xml
Each test case becomes a JUnit <testcase>. Failed assertions become <failure> elements with the failure code and message. Most CI systems (GitHub Actions, GitLab, Jenkins) render this natively.
Still stuck?
- Run with
--reporter jsonto see full details including raw model output - Check GitHub Issues for known problems
- Open a new issue with your config (redact API keys) and the error output