GroundProbe

What you get

Every assertive sentence in your agent's transcript, color-coded by epistemic class. Click each claim to see the witnesses (tool results) GroundProbe used to ground it.

[grounded 86%] The acme package is at version 0.5.2 and is licensed under MIT.

[meta] Let me check the README for more details.

[speculative 70%] It was first published in November 2018 and has been downloaded over 2 million times last month.

[speculative 64%] The package author is based in Berlin.

[inferred 58%] The package supports TypeScript natively.

[grounded 91%] License is MIT, which matches the manifest.

9 factual claims 33% grounded 11% inferred 56% hallucination

How it works

Three signals, no LLM required.

extractor

Atomic claims

Splits agent text into sentence-sized claims, with care for code, URLs, file paths, abbreviations.

scorer

Three similarity signals

N-gram overlap, longest common substring, and TF-IDF cosine — blended with a windowed local-match boost.

classifier

Epistemic verdict

Maps the blended score to grounded / inferred / speculative / meta. Optional LLM judge cleans up the ambiguous band.

reporter

Self-contained HTML

One file. Dark mode. Filters. Search. Witness drill-down. Send it as a PR comment, archive it, share it.

Install

Three integration paths. Pick whichever fits your workflow — they all use the same engine.

cli

1. Command line

npm install -g groundprobe
groundprobe analyze session.jsonl --html report.html
Open the HTML.

action

2. GitHub Action

Add to .github/workflows/groundprobe.yml:

uses: aureliocpr-ctrl/groundprobe@v0.2
with:
  transcript: agent-runs/session.jsonl
  max-hallucination: '0.30'

mcp

3. MCP server

Wire it into Claude Code, Cursor, or any MCP client:

// .claude/settings.json
{
  "mcpServers": {
    "groundprobe": {
      "command": "node",
      "args": ["groundprobe/dist/mcp/server.js"]
    }
  }
}

Use cases

Anywhere an AI agent's claims need an evidence trail.

CI quality gate

Fail PRs where the agent invented more than X% of factual claims. Run on every push, archive the report.

Audit trail

Ship the HTML report alongside every agent-authored PR so reviewers can see which sentences came from a tool result and which didn't.

Live debugging

An agent says something surprising? Profile its session and find the exact sentence with no evidence in seconds.

Benchmarking

Compare hallucination rates across models, prompts, agent frameworks. Aggregate dashboard included.

FAQ

Does it call an LLM?

By default, no. The deterministic classifier blends three text-similarity signals and runs offline. You can opt in to an LLM judge (Ollama, OpenAI, Anthropic) for tie-breaking on ambiguous claims.

What's the difference vs Langfuse, Helicone, Phoenix?

Those are observability platforms — they trace every LLM call. GroundProbe is an epistemic profiler: it tells you, for each sentence the agent emitted, whether it has supporting evidence. Different layer, complementary.

Does it work with Cursor / Aider / my own runner?

Yes. Five adapters auto-detect the input format. For custom agents, GroundProbe ships a native JSON / JSONL schema documented in the README.

How accurate is the deterministic classifier?

0.72 macro-F1 on the bundled synthetic micro-benchmark (18 items). With an LLM judge attached on the ambiguous band, expect ~0.85+. The benchmark is reproducible: node bench/run-bench.mjs.

Is it free?

MIT-licensed open source. No SaaS, no telemetry, no account.

Stop guessing what your AI knows vs invents.