GroundProbe profiles every claim your agent makes — Claude Code, Cursor, Aider, anything — and tags it grounded, inferred, or speculative. No SaaS. No API key required. MIT.
Every assertive sentence in your agent's transcript, color-coded by epistemic class. Click each claim to see the witnesses (tool results) GroundProbe used to ground it.
Three signals, no LLM required.
Splits agent text into sentence-sized claims, with care for code, URLs, file paths, abbreviations.
N-gram overlap, longest common substring, and TF-IDF cosine — blended with a windowed local-match boost.
Maps the blended score to grounded / inferred / speculative / meta. Optional LLM judge cleans up the ambiguous band.
One file. Dark mode. Filters. Search. Witness drill-down. Send it as a PR comment, archive it, share it.
Three integration paths. Pick whichever fits your workflow — they all use the same engine.
npm install -g groundprobegroundprobe analyze session.jsonl --html report.htmlAdd to .github/workflows/groundprobe.yml:
uses: aureliocpr-ctrl/groundprobe@v0.2 with: transcript: agent-runs/session.jsonl max-hallucination: '0.30'
Wire it into Claude Code, Cursor, or any MCP client:
// .claude/settings.json { "mcpServers": { "groundprobe": { "command": "node", "args": ["groundprobe/dist/mcp/server.js"] } } }
Anywhere an AI agent's claims need an evidence trail.
Fail PRs where the agent invented more than X% of factual claims. Run on every push, archive the report.
Ship the HTML report alongside every agent-authored PR so reviewers can see which sentences came from a tool result and which didn't.
An agent says something surprising? Profile its session and find the exact sentence with no evidence in seconds.
Compare hallucination rates across models, prompts, agent frameworks. Aggregate dashboard included.
By default, no. The deterministic classifier blends three text-similarity signals and runs offline. You can opt in to an LLM judge (Ollama, OpenAI, Anthropic) for tie-breaking on ambiguous claims.
Those are observability platforms — they trace every LLM call. GroundProbe is an epistemic profiler: it tells you, for each sentence the agent emitted, whether it has supporting evidence. Different layer, complementary.
Yes. Five adapters auto-detect the input format. For custom agents, GroundProbe ships a native JSON / JSONL schema documented in the README.
0.72 macro-F1 on the bundled synthetic micro-benchmark (18 items). With an LLM judge attached on the ambiguous band, expect ~0.85+. The benchmark is reproducible: node bench/run-bench.mjs.
MIT-licensed open source. No SaaS, no telemetry, no account.