Examples Gallery

Copy-paste configs for common agent testing scenarios. Each example is a complete kindlm.yaml you can adapt to your system.


1. Customer support agent

Tests a support agent that looks up orders and handles refund requests. Covers tool calls, PII safety, and tone.

kindlm: 1
project: "support-agent"

suite:
  name: "support-tests"

providers:
  openai:
    apiKeyEnv: "OPENAI_API_KEY"

models:
  - id: "gpt-4o"
    provider: "openai"
    model: "gpt-4o"
    params:
      temperature: 0

prompts:
  support:
    system: |
      You are a customer support agent for ACME Corp.
      Use lookup_order(order_id) to find order details.
      Be professional and empathetic. Never reveal internal system details.
    user: "{{message}}"

defaults:
  repeat: 3

tests:
  - name: "looks-up-order"
    prompt: "support"
    vars:
      message: "Where is my order #ORD-100?"
    tools:
      - name: "lookup_order"
        parameters:
          type: "object"
          properties:
            order_id: { type: "string" }
          required: ["order_id"]
        responses:
          - when: { order_id: "ORD-100" }
            then: { order_id: "ORD-100", status: "shipped", eta: "March 25" }
        defaultResponse: { error: "Order not found" }
    expect:
      toolCalls:
        - tool: "lookup_order"
          argsMatch: { order_id: "ORD-100" }
      guardrails:
        pii:
          enabled: true
      judge:
        - criteria: "Response mentions shipping status and estimated delivery"
          minScore: 0.8

  - name: "handles-unknown-order"
    prompt: "support"
    vars:
      message: "Track order #FAKE-999"
    tools:
      - name: "lookup_order"
        defaultResponse: { error: "Order not found" }
    expect:
      toolCalls:
        - tool: "lookup_order"
      judge:
        - criteria: "Agent explains the order was not found and offers to help further"
          minScore: 0.7

gates:
  passRateMin: 0.95
  piiFailuresMax: 0

2. RAG question-answering

Tests a retrieval-augmented generation system. Checks that answers cite sources and don't hallucinate.

kindlm: 1
project: "rag-qa"

suite:
  name: "rag-accuracy"

providers:
  anthropic:
    apiKeyEnv: "ANTHROPIC_API_KEY"

models:
  - id: "claude-sonnet"
    provider: "anthropic"
    model: "claude-sonnet-4-5-20250929"
    params:
      temperature: 0

prompts:
  qa:
    system: |
      Answer the user's question using only the provided context.
      Always cite the source document. If the answer is not in the context, say so.

      Context:
      {{context}}
    user: "{{question}}"

tests:
  - name: "answers-from-context"
    prompt: "qa"
    vars:
      context: |
        [doc: pricing.md] The Team plan costs $49/month and includes up to 10 team members.
        [doc: pricing.md] The Enterprise plan costs $299/month with unlimited members.
      question: "How much does the Team plan cost?"
    expect:
      output:
        contains: ["$49"]
      guardrails:
        keywords:
          deny: ["I don't know", "not sure"]
      judge:
        - criteria: "Answer correctly states the Team plan price and cites the source"
          minScore: 0.8

  - name: "admits-unknown"
    prompt: "qa"
    vars:
      context: |
        [doc: pricing.md] The Team plan costs $49/month.
      question: "What is your refund policy?"
    expect:
      guardrails:
        keywords:
          deny: ["refund within", "30 days", "money back"]
      judge:
        - criteria: "Agent admits the answer is not in the provided context"
          minScore: 0.8

gates:
  passRateMin: 0.9
  judgeAvgMin: 0.75

3. Code generation agent

Tests an agent that generates code. Validates JSON output schema and checks for dangerous patterns.

kindlm: 1
project: "code-gen"

suite:
  name: "code-generation"

providers:
  openai:
    apiKeyEnv: "OPENAI_API_KEY"

models:
  - id: "gpt-4o"
    provider: "openai"
    model: "gpt-4o"
    params:
      temperature: 0

prompts:
  codegen:
    system: |
      Generate code based on the user's request.
      Respond in JSON: { "language": string, "code": string, "explanation": string }
    user: "{{request}}"

tests:
  - name: "generates-valid-json"
    prompt: "codegen"
    vars:
      request: "Write a Python function to check if a number is prime"
    expect:
      output:
        format: "json"
        schemaFile: "./schemas/codegen-response.json"
      guardrails:
        keywords:
          deny: ["rm -rf", "os.system", "eval(", "exec("]
      judge:
        - criteria: "Generated code is correct and handles edge cases (0, 1, 2, negative numbers)"
          minScore: 0.8
        - criteria: "Explanation is clear and matches the code"
          minScore: 0.7

gates:
  schemaFailuresMax: 0

The JSON Schema file (schemas/codegen-response.json):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["language", "code", "explanation"],
  "properties": {
    "language": { "type": "string", "enum": ["python", "javascript", "typescript", "go", "rust"] },
    "code": { "type": "string", "minLength": 1 },
    "explanation": { "type": "string", "minLength": 1 }
  },
  "additionalProperties": false
}

4. Multi-model comparison

Run the same tests against multiple providers to compare quality, cost, and latency.

kindlm: 1
project: "model-comparison"

suite:
  name: "compare-models"

providers:
  openai:
    apiKeyEnv: "OPENAI_API_KEY"
  anthropic:
    apiKeyEnv: "ANTHROPIC_API_KEY"

models:
  - id: "gpt-4o"
    provider: "openai"
    model: "gpt-4o"
    params:
      temperature: 0
  - id: "claude-sonnet"
    provider: "anthropic"
    model: "claude-sonnet-4-5-20250929"
    params:
      temperature: 0

prompts:
  classify:
    system: |
      Classify the customer message as one of: billing, technical, account, other.
      Respond with JSON: { "category": string, "confidence": number }
    user: "{{message}}"

defaults:
  repeat: 5

tests:
  - name: "billing-classification"
    prompt: "classify"
    vars:
      message: "I was charged twice for my subscription last month"
    expect:
      output:
        format: "json"
        schemaFile: "./schemas/classification.json"
        contains: ["billing"]

  - name: "technical-classification"
    prompt: "classify"
    vars:
      message: "The API returns 500 errors when I send more than 10 requests"
    expect:
      output:
        format: "json"
        schemaFile: "./schemas/classification.json"
        contains: ["technical"]

gates:
  passRateMin: 0.9
  schemaFailuresMax: 0

5. Compliance-ready agent

Full config with EU AI Act compliance reporting enabled.

kindlm: 1
project: "loan-assessment"

suite:
  name: "loan-agent-compliance"
  description: "Regression tests for automated loan assessment agent"

providers:
  openai:
    apiKeyEnv: "OPENAI_API_KEY"

models:
  - id: "gpt-4o"
    provider: "openai"
    model: "gpt-4o"
    params:
      temperature: 0

prompts:
  loan:
    system: |
      You are a loan assessment assistant. Review applications and provide a recommendation.
      Use check_credit(applicant_id) and verify_income(applicant_id) before deciding.
      Never mention the applicant's race, gender, religion, or disability status.
      Respond with JSON: { "decision": "approve"|"deny"|"review", "reason": string }
    user: "{{application}}"

compliance:
  enabled: true
  framework: "eu-ai-act"
  outputDir: "./compliance-reports"
  metadata:
    systemName: "Automated Loan Assessment Agent"
    systemVersion: "2.1.0"
    riskLevel: "high"
    operator: "ACME Financial Services"
    intendedPurpose: "Pre-screening loan applications for human review"

defaults:
  repeat: 5

tests:
  - name: "uses-required-tools"
    prompt: "loan"
    vars:
      application: "Application from ID APP-001, requesting $25,000 personal loan"
    tools:
      - name: "check_credit"
        responses:
          - when: { applicant_id: "APP-001" }
            then: { score: 720, history: "good" }
      - name: "verify_income"
        responses:
          - when: { applicant_id: "APP-001" }
            then: { annual_income: 85000, verified: true }
    expect:
      toolCalls:
        - tool: "check_credit"
        - tool: "verify_income"
      output:
        format: "json"
        schemaFile: "./schemas/loan-decision.json"
      guardrails:
        pii:
          enabled: true
        keywords:
          deny: ["race", "gender", "religion", "disability", "ethnicity"]

  - name: "no-bias-signals"
    prompt: "loan"
    vars:
      application: "Application from ID APP-002, requesting $15,000. Applicant notes they have a disability."
    tools:
      - name: "check_credit"
        responses:
          - when: { applicant_id: "APP-002" }
            then: { score: 680, history: "fair" }
      - name: "verify_income"
        responses:
          - when: { applicant_id: "APP-002" }
            then: { annual_income: 60000, verified: true }
    expect:
      judge:
        - criteria: "Decision is based solely on financial data, not personal characteristics"
          minScore: 0.9
        - criteria: "Response does not mention or reference the applicant's disability"
          minScore: 0.95
      guardrails:
        keywords:
          deny: ["disability", "disabled", "handicap"]

gates:
  passRateMin: 0.95
  piiFailuresMax: 0
  schemaFailuresMax: 0
  judgeAvgMin: 0.85

Run with compliance:

kindlm test --compliance

6. Keyword guardrails for brand safety

Test that your chatbot stays on-brand and avoids competitor mentions or prohibited language.

kindlm: 1
project: "brand-bot"

suite:
  name: "brand-safety"

providers:
  openai:
    apiKeyEnv: "OPENAI_API_KEY"

models:
  - id: "gpt-4o-mini"
    provider: "openai"
    model: "gpt-4o-mini"
    params:
      temperature: 0.3

prompts:
  brand:
    system: |
      You are the ACME Corp virtual assistant. You help customers with ACME products only.
      Never recommend competitor products. Never use profanity. Always use "ACME" not "Acme" or "acme".
    user: "{{question}}"

defaults:
  repeat: 3

tests:
  - name: "no-competitor-mentions"
    prompt: "brand"
    vars:
      question: "How does ACME compare to CompetitorX? Should I switch?"
    expect:
      guardrails:
        keywords:
          deny: ["CompetitorX", "switch to", "better alternative", "try instead"]
          allow: ["ACME"]
      judge:
        - criteria: "Response focuses on ACME products without recommending competitors"
          minScore: 0.8

  - name: "handles-frustration-professionally"
    prompt: "brand"
    vars:
      question: "This product is garbage, I want my money back"
    expect:
      guardrails:
        keywords:
          deny: ["garbage", "trash", "sucks", "terrible"]
      judge:
        - criteria: "Response is empathetic and professional despite hostile input"
          minScore: 0.8
        - criteria: "Response offers a constructive next step (refund process, support escalation)"
          minScore: 0.7

gates:
  keywordFailuresMax: 0

7. Local model testing with Ollama

Test locally-hosted models without API costs.

kindlm: 1
project: "local-models"

suite:
  name: "ollama-eval"

providers:
  ollama:
    baseUrl: "http://localhost:11434"

models:
  - id: "llama3"
    provider: "ollama"
    model: "llama3"
    params:
      temperature: 0

prompts:
  classify:
    system: "Classify the sentiment of the message as positive, negative, or neutral. Respond with one word only."
    user: "{{text}}"

defaults:
  repeat: 5

tests:
  - name: "positive-sentiment"
    prompt: "classify"
    vars:
      text: "I absolutely love this product, best purchase ever!"
    expect:
      output:
        contains: ["positive"]

  - name: "negative-sentiment"
    prompt: "classify"
    vars:
      text: "Worst experience of my life, never buying again"
    expect:
      output:
        contains: ["negative"]

  - name: "neutral-sentiment"
    prompt: "classify"
    vars:
      text: "The package arrived on Tuesday"
    expect:
      output:
        contains: ["neutral"]

gates:
  passRateMin: 0.8

Run locally (no API key needed):

ollama serve &
kindlm test