Behavioral Regression Testing for AI Agents: Introducing agentcheck

Software testing has solved hard problems over the past few decades. Property-based testing, mutation testing, snapshot testing, fuzz testing. But none of that infrastructure was built with AI agents in mind, and the gap shows.

When you write a REST API endpoint, you call it with a known input and assert on the output. The function is deterministic. If it breaks, it breaks the same way every time. You can bisect the commit that caused it. You can write a regression test that will catch it forever.

When you write an AI agent, none of that is true. The same prompt can produce different tool call sequences on different invocations. The model can hallucinate a confirmation it never actually performed. A new version of the underlying LLM can silently change the tool-use behavior your product depends on, and your unit tests will still pass because they never actually ran the agent.

This is the problem agentcheck was built to solve.

What agentcheck actually is

agentcheck is a test runner for AI agent behavior. You install it, annotate your test functions with @agent_test, and run agentcheck test. It collects your tests, runs each one the number of times you specified, evaluates your assertions, detects flakiness across runs, compares results against a saved baseline, and generates a report.

It does not mock your agent. It does not stub the LLM. It runs the real thing, collects what the agent actually did, and gives you a structured way to make claims about that behavior.

The install is straightforward:

pip install pygent-test

Zero runtime dependencies for the core library. Adapter-specific extras are opt-in.

The data model

Everything in agentcheck flows through two data classes: AgentResult and ToolCall.

AgentResult
  input          str              The prompt you sent
  final_output   str              What the agent returned
  tool_calls     list[ToolCall]   Every tool invocation, in order
  steps          int              How many reasoning steps the agent took
  errors         list[str]        Any errors that occurred
  latency        float            Wall-clock seconds for the full run
  cost           float | None     Token cost if the agent reports it
  metadata       dict             Anything else you want to carry through

ToolCall
  name      str          The tool name
  args      dict         Arguments passed to the tool
  output    str | None   What the tool returned
  success   bool         Whether the tool invocation succeeded

You construct these yourself inside your test functions. The point is that you own the translation from your agent framework to this schema. agentcheck does not care whether you're using OpenAI, Gemini, CrewAI, LangGraph, or an HTTP endpoint. Once you produce an AgentResult, the entire assertion and reporting system works the same way.

from agentcheck import AgentResult, ToolCall, agent_test, expect

@agent_test(runs=3)
def test_booking_agent():
    result = run_my_agent("Book a table for 2 at Nobu, 8pm")

    return (
        expect(result)
        .used_tool("get_availability")
        .used_tool("book_table")
        .used_tools_in_order(["get_availability", "book_table"])
        .final_output_contains("confirmed")
        .did_not_claim_confirmation_without_tool("book_table")
        .verify()
    )

The assertion API

There are 15 assertion types. They cover the things that actually break in agent pipelines.

Tool presence and frequency

expect(result).used_tool("search")
expect(result).used_tool_times("search", 1)
expect(result).used_tool_at_least("search", 1)
expect(result).used_tool_at_most("book", 2)
expect(result).did_not_use_tool("delete")
expect(result).used_any_tool()

Tool ordering

expect(result).used_tools_in_order(["get_weather", "book_restaurant"])

This assertion verifies that the tools appear in the given sequence. It does not require them to be the only tools called; it checks subsequence ordering.

Tool success

expect(result).tool_succeeded("book_table")

Output shape

expect(result).steps_less_than(5)
expect(result).finished_successfully()
expect(result).did_not_error()
expect(result).final_output_contains("confirmed")
expect(result).final_output_does_not_contain("I was unable to")
expect(result).final_output_matches_pattern(r"Order #\d+")

The hallucination check

expect(result).did_not_claim_confirmation_without_tool("book_table")

This one is particularly useful. It checks whether the final output contains confirmation-like language ("confirmed", "booked", "reserved", "scheduled") while the agent never actually called the specified tool. This is the pattern you see when a model decides to answer from memory instead of from its tools. It passes when either the output contains no confirmation language, or the tool was actually called.

Collect mode

By default, the first failing assertion raises immediately. If you want to see all failures at once:

check = expect(result, collect=True)
check.used_tool("search")
check.used_tool("book")
check.finished_successfully()
check.verify()  # raises with all failures combined

Multiple runs and flakiness detection

This is where agentcheck differs meaningfully from just writing a pytest test and calling your agent once.

You specify how many times each test should run:

@agent_test(runs=5)
def test_booking_agent():
    ...

agentcheck runs the function 5 times, collects 5 AgentResult objects, and evaluates the assertion for each. But instead of simply reporting pass or fail, it computes a flakiness score.

The score is variance-based. Given N runs, it computes the variance of the binary pass/fail outcomes and scales it:

flakiness = min(variance * 4, 1.0)

A test that passes every time has a flakiness score of 0.0. A test that fails every time also has a score of 0.0. The maximum score of 1.0 occurs when the test passes exactly half the time, which is the highest-entropy outcome for a binary variable.

Beyond the score, agentcheck tracks whether the tool call sequence itself is unstable across runs. If your agent called ["search", "book"] in one run and ["book"] in another, that is flagged separately as unstable tool paths. This distinction matters because you can have a test that always passes but with meaningfully different behavior across runs. That is still flakiness worth knowing about.

The CLI output looks like this:

PASS   test_booking_agent    5/5 runs   flaky=0.00   3 steps   0.412s
FLAKY  test_search_agent     3/5 runs   flaky=0.96   unstable_tool_paths

Regression detection

Knowing that your agent currently behaves well is only half the problem. The other half is knowing when it stops.

The workflow is two commands:

agentcheck test examples/
agentcheck bless examples/

bless saves the current test results as a baseline. On every subsequent run, agentcheck compares the new results against the baseline and surfaces regressions:

agentcheck test examples/ --fail-on-regression

Exit code 0 means clean. Exit code 2 means regression detected. The comparison is per-test and looks at success rate, flakiness score, and whether new failure categories appeared. A test that went from 5/5 to 3/5 is a regression. A test that introduced a new missing_required_tool failure category is a regression. A test that improved is not a regression and will not cause a non-zero exit.

This gives you something to wire into CI. When you deploy a new version of your agent or upgrade the underlying model, the pipeline will tell you exactly which tests regressed and by how much.

Failure taxonomy

agentcheck uses a structured taxonomy for classifying failures rather than just recording assertion text. When a test fails, each failure is tagged with a category:

missing_required_tool       The agent did not call a tool that was required
unsupported_success_claim   The agent claimed success without calling the tool
wrong_tool_order            Tools were called in the wrong sequence
tool_failure                A tool call explicitly returned failure
output_missing_content      Required text was absent from the final output
unexpected_output_content   Forbidden text appeared in the final output
step_budget_exceeded        The agent used more steps than allowed
error_in_run                An exception occurred during the agent run

These categories appear in the CLI output, in the JSON report, and in the HTML report. They are what the regression comparison uses when it says "new failure category appeared."

The CLI

agentcheck test <path>                Run tests at path
agentcheck test <path> -k <filter>    Run only tests matching keyword
agentcheck test <path> --html out.html  Generate HTML report
agentcheck test <path> --fail-on-regression  Exit 2 on regression

agentcheck bless <path>              Save current results as baseline
agentcheck baseline list             Show all saved baselines
agentcheck baseline inspect <file>   Inspect a baseline file
agentcheck baseline delete <file>    Remove a baseline

agentcheck report                    Re-render report from last run
agentcheck report --html out.html    HTML from last run

agentcheck history list              Show run history
agentcheck history show <id>         Show a specific run

agentcheck contract init <name>      Create an agent contract file
agentcheck contract validate <file>  Validate a contract file

agentcheck generate scenarios <contract> --output <file> --stub <file>
                                     Generate test scenarios from a contract

agentcheck config init               Create agentcheck.json config file

Test discovery works like pytest: any file matching test_*.py or *_test.py, any function decorated with @agent_test. The -k filter matches substrings of test names. Exit code 3 means no tests matched the filter.

Agent contracts and scenario generation

An agent contract is a JSON document that describes what a named agent is supposed to do:

{
  "name": "booking_agent",
  "description": "Checks availability then books a restaurant.",
  "schema_version": "1",
  "expected_tools": ["get_availability", "book_table"],
  "required_tool_order": ["get_availability", "book_table"],
  "step_budget": 5,
  "forbidden_claims": ["confirmed", "booked"],
  "scenario_tags": ["happy_path", "tool_failure", "edge_case"]
}

agentcheck contract validate checks the contract for structural correctness and internal consistency (for example, it will flag a required_tool_order entry that names a tool not in expected_tools).

From a contract, you can generate test scenarios:

agentcheck generate scenarios booking_contract.json \
  --output scenarios.json \
  --stub test_booking_stub.py

This produces a scenario pack covering the specified tags and a test file stub with the assertion skeletons filled in based on the contract definition. You provide the agent invocation; agentcheck provides the structure.

The adapter layer

The adapter layer solves the translation problem for the most common frameworks.

HttpAdapter

For agents deployed behind an HTTP endpoint. Uses only the Python standard library.

from agentcheck.adapters.http import HttpAdapter

adapter = HttpAdapter(
    url="https://my-agent.example.com/run",
    request_key="message",
    response_output_key="answer",
    response_tools_key="tool_calls",
    auth_env_var="MY_AGENT_API_KEY",
)

result = adapter.run_input("Book a table for 2")

The adapter accepts several key aliases for tool call fields in the response body to accommodate different API conventions. name or tool, args or arguments or input, output or result, success or ok. String-only tool call arrays are also handled.

For environment-driven configuration:

adapter = HttpAdapter.from_env(
    url_env_var="AGENT_ENDPOINT",
    auth_env_var="AGENT_API_KEY",
)

CrewAIAdapter

from agentcheck.adapters.crewai import CrewAIAdapter

adapter = CrewAIAdapter()

@agent_test(runs=3)
def test_research_crew():
    crew = build_research_crew()
    result = adapter.run(crew, "Research the latest in AI safety")
    return expect(result).used_any_tool().verify()

CrewAIAdapter.run() calls crew.kickoff(). CrewAIAdapter.run_agent() wraps a single CrewAI Agent using execute_task. The adapter normalizes tasks_output, extracts errors from the raw result, and reads token usage for cost tracking. crewai is a soft dependency; if it is not installed, the adapter raises a clear ImportError at call time rather than at import time.

LangGraphAdapter

from agentcheck.adapters.langgraph import LangGraphAdapter

adapter = LangGraphAdapter(
    agent_factory=build_research_graph,
    input_key="messages",
    output_key="messages",
)

@agent_test(runs=3)
def test_research_graph(agent):
    result = agent.run("What are the implications of AGI?")
    return expect(result).used_any_tool().finished_successfully().verify()

The adapter wraps a compiled StateGraph and handles the HumanMessage conversion and streaming output collection internally.

pytest integration

agentcheck ships a pytest plugin. If you prefer running your agent tests through pytest rather than the agentcheck CLI, you can do that without changing your test code:

pytest examples/ -q

The plugin registers as agentcheck in the pytest11 entry point. @agent_test decorated functions are discovered by pytest, and the runs and assertion logic behaves identically to the CLI.

This means agentcheck tests can live alongside your regular unit tests in the same test suite and be run in the same CI step. The __test__ = False attribute that @agent_test sets prevents pytest from double-discovering the function.

Reports

Every agentcheck test run writes four artifacts to .agentcheck/:

.agentcheck/reports/latest.json     Full structured report
.agentcheck/reports/latest.md       Markdown summary
.agentcheck/reports/latest.html     HTML report (always generated)
.agentcheck/traces/latest.json      Per-run trace with AgentResult data
.agentcheck/history.json            Append-only run history

The HTML report includes per-test pass rates, flakiness scores, unstable tool path flags, failure taxonomy breakdowns, and tool call timelines. You can also generate it on demand from any previous run:

agentcheck report --html my_report.html

The history commands let you compare runs over time:

agentcheck history list
agentcheck history show <run-id>

Architecture diagram

                        Your test file
                             |
               @agent_test(runs=N, agent_factory=...)
                             |
                      agentcheck CLI
                    /         |         \
           discovery     runner      reporter
               |             |            |
        test_*.py files   runs N times  JSON / MD / HTML
                             |
                    [AgentResult, AgentResult, ...]
                             |
                    assertion evaluator
                             |
                    flakiness scorer
                             |
                    regression comparator
                          (baseline)
                             |
                       exit code + report

The runner never touches your agent directly. It calls the function you wrote, which calls your agent, which returns an AgentResult. That boundary is where all the adapter work happens.

Invarium

One of the first teams to take a close look at agentcheck before public release was the engineering team at Invarium. Invarium builds infrastructure for enterprise AI deployments, and the problem they ran into is the same one everyone building on top of LLMs eventually hits: how do you know your agent is still doing what it was doing last week?

Their interest in agentcheck specifically was around the regression detection workflow and the adapter layer. When you are running agents inside an enterprise environment, the agent is often behind an internal HTTP endpoint that your test suite never directly touches. The HttpAdapter pattern, where you describe the request/response mapping once and then write test assertions against the normalized AgentResult, fits naturally into that architecture.

They are evaluating agentcheck for integration into their existing CI pipelines alongside their current LLM evaluation tooling.

What agentcheck does not do

It is worth being direct about the scope.

agentcheck does not evaluate output quality. It does not use another LLM to judge whether the response was good. It does not do semantic similarity scoring. Those are valid approaches for certain problems, but they have their own non-determinism and cost. agentcheck is about behavioral correctness: did the agent use the right tools, in the right order, without hallucinating success. That is a narrower and more tractable question, and the answer is deterministic once you have the AgentResult.

It does not handle multi-agent orchestration natively. If your system is a graph of agents, you test each node independently or wrap the full graph and treat it as a single agent. The adapter layer is the right place to handle that translation.

It does not replace prompt evaluation. If you need to judge answer quality, factual accuracy, or response style, you need a different tool. agentcheck is a complement to that, not a replacement.

Getting started

pip install pygent-test

agentcheck test your_tests/

For framework-specific adapters:

pip install "pygent-test[openai]"
pip install "pygent-test[langgraph]"
pip install "pygent-test[crewai]"

Source and documentation are on GitHub at github.com/ashutosh-rath02/pygent-test. Issues and contributions are open.

The version in this post is 0.2.2. The library requires Python 3.10 or newer.

Behavioral Regression Testing for AI Agents: Introducing agentcheck

What agentcheck actually is

The data model

The assertion API

Multiple runs and flakiness detection

Regression detection

Failure taxonomy

The CLI

Agent contracts and scenario generation

The adapter layer

pytest integration

Reports

Architecture diagram

Invarium

What agentcheck does not do

Getting started

Comments

More from this blog

Launching LLM Gateway: Building a Control Plane for LLM Reliability

Understanding what’s behind “Where is my train”

How to Build Your Own Search Engine: A Step-by-Step Guide

Baking with OO Goodness : Understanding factory methods

Command Palette

What agentcheck actually is

The data model

The assertion API

Multiple runs and flakiness detection

Regression detection

Failure taxonomy

The CLI

Agent contracts and scenario generation

The adapter layer

pytest integration

Reports

Architecture diagram

Invarium

What agentcheck does not do

Getting started

Comments

More from this blog