CLI — Early Access

The infrastructure
layer for
autonomous
engineering.

Paniolo builds and evolves the harness around your AI coding agents — giving them project intelligence, observability, and the structural guardrails that turn generated code into production-grade output.

77.0% Pass@1 — Terminal-Bench 2 AHE (10 iterations) — ICLR 2026
12% Fewer tokens than the seed harness SWE-bench-verified transfer
+10.1 pp Cross-family performance gain Three alternate model families
The Evolution Loop
Tasks
Real-world coding benchmarks
Trajectories
Agent execution traces
Structured Experience
Distilled root causes
Harness Mutation
Falsifiable edits with predictions
Verified Improvement
Measured, reverted if failing
Core Principles

The foundation of
reliable AI engineering

Built from years of production experience — before the industry named it harness engineering.

01

Every AI error is infrastructure debt.

When an agent makes a mistake, the right response is not to fix the output and move on. It is to update the harness so the same failure is structurally prevented. Errors are signals. Signals become infrastructure.

02

Harness quality is model-agnostic.

A well-engineered harness outperforms a better base model with a weak one. This is empirically validated at ICLR 2026, and it is the core premise Paniolo is built on. The infrastructure layer is the lever.

03

Observability before optimization.

You cannot reliably improve what you cannot see. Every component, trajectory, and decision in the harness must be auditable before any evolution loop can be trusted. Structure precedes speed.

Research-Validated Framework

Three pillars of
observability

Validated at ICLR 2026. The methodology that makes autonomous harness evolution reliable.

Component
Observability

Every harness component — capstone file, tool descriptions, middleware, skills, sub-agent configs, long-term memory — gets a file-level representation. The action space becomes explicit, auditable, and revertible.

Experience
Observability

Raw agent traces are distilled into a structured evidence corpus — root causes your evolving harness can act on. The intelligence layer compounds over time without requiring manual inspection.

Decision
Observability

Every harness edit is paired with a self-declared prediction, verified against next-round outcomes. Each edit becomes a falsifiable contract. Ineffective edits are rolled back automatically.

69.7% 77.0% Pass@1 lift via AHE Terminal-Bench 2 · ICLR 2026
10 iters Autonomous evolution rounds ~32 hours to full campaign
-12% Token reduction vs baseline Better results, lower cost
3 families Cross-model portability GPT · Qwen · Gemini · DeepSeek
Peer-Reviewed Foundation

Built on published science,
not speculation

Paniolo is grounded in Agentic Harness Engineering (AHE), published at ICLR 2026 by researchers from Fudan University, Peking University, and Shanghai Qiji Zhifeng. The paper introduces a closed-loop observability framework that autonomously evolves coding-agent harnesses without base-model retraining.

Ten iterations lift pass@1 from 69.7% to 77.0%, surpassing every human-designed baseline — OpenCode, Terminus-2, and Codex — and both self-evolving baselines. The frozen harness transfers to SWE-bench-verified and yields consistent gains of +5.1 to +10.1 pp across three alternate model families.

The evolved harness uses 12% fewer tokens than the seed. As token pricing increases, this efficiency advantage compounds. Better performance and lower cost are not in tension — harness quality is the resolution.

Paniolo was already building toward this. We treat every agent error as infrastructure debt, every correction as a permanent improvement to the intelligence layer. The science validated the architecture we were already constructing.

Terminal-Bench 2 SWE-bench-verified ICLR 2026 Cross-model Transfer Tools · Middleware · Memory
From the Founders

Harnessed

Follow on Substack
Issue No. 001 Anabel Kinsey

A Review of Agentic Harness Engineering

Researchers at Fudan, Peking University, and Shanghai Qiji Zhifeng confirm what we have been building on: harness engineering is vital to AI performance in code. Ten iterations of autonomous evolution lift pass@1 from 69.7% to 77.0%, using 12% fewer tokens. Better results, lower cost. This is the research Paniolo is built on.

Next: a deep dive into QMD, linting, and structured tooling.

Read the full review
Work With Us

Build agents that
improve with experience.

We are selectively onboarding enterprise engineering teams and investors ahead of our CLI launch. Select what best describes you above and the message below will update to match.

Based New York & Honolulu, Hawaiʻi