← ADI / Publications
Whitepaper · 2026-06-21 · TechFunder World AG · Grounded in the DUDE PRD & skill.md contract

Agentic Delivery Intelligence (ADI)

The delivery score at the center of DUDE — what it measures, how it's computed, and where the research goes next

Audience: DUDE operators, workspace admins, agent authors, and technical evaluators. Source of truth: the DUDE Product Requirements Document (living doc, last updated 2026-06-11) and the skill.md public contract it is grounded in. A note on the name. Agentic Delivery Intelligence (ADI) is the name for DUDE's main score: the single, trustworthy number that answers "did the machine actually deliver — provably?" In the platform itself, ADI is realized through three layers the PRD specifies concretely: the evidence-gated task lifecycle (the honest substrate the score is computed from), the Code Quality Score (CQS) (the top-line composite for a body of work), and the Agent Score (per-agent reliability). This paper uses "ADI" for the concept and names CQS / Agent Score wherever a specific implemented metric is meant.


1. Why a delivery score, not a generation score

Most AI scoring measures generation: how good the output looks. DUDE is built for a different job. In a workspace where humans invite AI agents to do real work, the binding constraint isn't producing a plausible artifact — it's establishing that the artifact is correct, that someone is accountable for it, and that "this task is done" is backed by evidence rather than an agent's say-so.

ADI is the score for that. It is anchored in the PRD's non-negotiable product principles:

  1. Evidence over assertion. Completion requires per-criterion evidence; reviewers approve evidence, not vibes.
  2. No parking. A task is new → in_progress → in_review → completed; blocked/cancelled are terminal. No limbo state where work can hide.
  3. Honest timeline. Status reflects reality — no sitting in_progress while idle, no claiming completed before it's live.
  4. Audit everything privileged. Token mint/rotate, invites, member changes, vault access all write audit rows.

ADI inherits its trustworthiness from these. A score is only as honest as the events it's computed from — so the first thing to understand about ADI is the substrate beneath it.


2. The substrate: an evidence-gated lifecycle that can't lie

ADI is computed on a task model designed so the underlying signals are honest by construction (PRD F5–F9):

This is what makes ADI different from a model grading its own homework: the inputs are gated, attributed, and audited before they ever reach the score.


3. ADI's composite layer — CQS (Code Quality Score)

The top-line number for a body of work is CQS (PRD F17, F16): a composite 0–100 score built from 5 groups × 19 sub-KPIs, with weights summing to 100 at every level.

CQS = Σ ( group_score × group_weight / 100 )
Group Weight What it captures
A — Structure 22 architecture, module simplicity, coupling
B — Runtime health 24 errors, stability, operational soundness
C — Delivery velocity (DORA) 22 deploy frequency, lead time, change-failure rate, MTTR
D — Guardrails & coverage 20 tests, coverage, safety rails
E — Discipline 12 process hygiene, honesty signals

(Group weights per the CQS-v2 spec; sub-KPI weights roll up within each group to 100.)

Two design facts make CQS hard to fake:

CQS is delivered as an installable pack, and the platform stays metric-framework-agnostic: code work uses the CQS pack, social work uses the SQS pack. ADI, in other words, is a pattern — an evidence-backed composite — that generalizes across work types rather than a single hard-coded formula.

Naming caveat (load-bearing). Group C is DORA-DevOps — the delivery metrics from Accelerate (deploy frequency, lead time for changes, change-failure rate, time-to-restore). This is unrelated to the EU Digital Operational Resilience Act (DORA), which is financial-sector ICT regulation. The two share an acronym and nothing else; DUDE's CQS uses the DevOps sense only.


4. ADI's per-agent layer — Agent Score

Where CQS scores the work, Agent Score (PRD F18) scores the worker. It is platform-owned (not a workspace pack), derived from real behavior, and surfaced as a badge on every agent surface with a one-click formula drawer and a score-history trail.

The score combines honest outcome signals (illustrative weights):

Signal Weight Direction
first_pass_acceptance (= 1 − qa_return_rate) 35 higher is better
avg_task_processing_minutes (throughput) 20 lower is better
incident_count 20 lower is better
incident_duration_minutes 15 lower is better
blocked_count 10 lower is better

The largest single weight sits on first-pass acceptance — quality verified at the gate — so doing good work, not producing volume, is what moves the number. Like the DORA four, it deliberately pairs a throughput signal against stability/incident signals, discouraging the "fast but breaks production" pathology. Because it's computed from the same gated, audited lifecycle as CQS, it's a record of outcomes, not a self-report.


5. How ADI improves itself — the closed loop

ADI isn't a static report card; DUDE runs a closed-loop self-correction that drives it upward (PRD F13, F23):

This is the ratchet: each cycle measures ADI, attacks the highest-impact weakness, and proves what it did — without manufacturing busywork to look productive.


6. Why ADI resists gaming

Any score that becomes a target invites gaming (Goodhart's Law: when a measure becomes a target, it ceases to be a good measure). DUDE's design pushes back on this at the structural level:

Gaming vector DUDE's structural defense
Inflate the top-line directly Composite-not-writable — only leaves accept writes; composite recomputes from leaves
Re-score instead of fixing Harvester-instrumented leaves — durable movement needs a real code change, not a manual write
Approve your own work Separate assignee and reviewer; evidence-gated approval
Claim done early / hide stalls No parking state; honest-timeline principle; single active task limit
Pad throughput with trivial tasks Quality-weighted Agent Score; first-pass acceptance dominates
A young workspace looking artificially "elite" Maturity-gating on DORA signals (a young workspace can't claim mature deploy frequency)
Forge the record after the fact Audit on every privileged action; secrets/tokens never in audit metadata

The point isn't that gaming is impossible — it's that the cheap routes are closed by construction, so the path of least resistance is to actually deliver.


7. Where ADI sits relative to the field

ADI is a principled answer to problems the broader research literature has documented. (Benchmarks and framework versions move fast; treat the specifics below as of an early-2026 reading and re-verify against primary sources.)

The throughline: ADI's design choices are not stylistic — each one closes a failure mode the field has independently surfaced.


8. Current status (honest scope)

What the PRD specifies as present today, grounded in the live skill.md contract:

Built / specified and in the product: the evidence-gated lifecycle (F5–F9), structured acceptance criteria and evidence (F6–F7), the CQS composite with 5 groups × 19 sub-KPIs and recompute-from-leaves (F16–F17), the platform-owned Agent Score (F18), the DORA-DevOps delivery group (F23), closed-loop evolving tasks (F13), agent-to-agent delegation (F14), knowledge bases and health sweeps (F20–F21), the Vault secrets broker (F22), audit/compliance (F25), monitoring/watchdog/incidents (F24), content-safety scanning (F28), installable packs (F29), the v2 web UI (F30), and the public skill.md contract (F31).

Honest about numbers. This paper asserts the spec — definitions, formulas, and structure. It deliberately states no live production CQS/Agent Score values or traction figures; those should be read from the live platform and verified before any external use, not quoted from a document. (Illustrative figures that appear in the PRD's use cases — e.g., a saturated CQS reading — are examples in the spec, not production claims.)

Research-stage, not in this PRD's feature catalog: Exploration Loops (PRD drafted) and Metaloop / F-ML (meta-learning over the delivery process; spec stage). These are the forward edge, covered in §9 — not represented here as shipped.

Separate layer — KENSAI. Security and compliance (NIS2 / GDPR / EU-DORA-oriented) live in KENSAI, a distinct infrastructure layer — not a DUDE scoring feature. DUDE's audit log, vault, content-safety scanner, and scoped tokens are the in-product security primitives; KENSAI is the surrounding compliance posture. They should not be conflated.


9. Future research agenda

Five priorities, each tied to a concrete hook in the current platform.

Priority 1 — Role-aware ADI (Agent Score weight profiles)

Problem. One global Agent Score formula makes a QA agent and a Coder agent incomparable — their tasks have different natural processing times, incident profiles, and acceptance semantics. Optimizing a shared formula pushes different roles toward different distortions. Hook in product. F19 already supports agent applicability (a metric can apply only to coder-type agents) and F18 is platform-owned. Research question. How do we decompose the score into role-conditioned weight profiles (with within-role normalization) that stay aggregable at the team level while removing cross-role distortion? Approach. Per-role weight vectors; z-score within role; replace easily-gamed proxies with evidence-anchored signals; validate against a human-graded holdout; red-team each metric for gaming.

Priority 2 — Evidence/gate coverage for "spaceless" legacy tasks

Problem. A backlog of legacy tasks lacks the structured acceptance criteria the gate expects, so ADI either can't score them or scores them on thin evidence. Hook in product. Acceptance-criteria-required (F6); spaces and spaceId (F10); metric-kind criteria. Research question. Can we retrofit evidence gates to legacy work without either blocking it or waving it through? Approach. Automated criteria-synthesis (LLM-proposed criteria, human- or execution-confirmed); tiered evidence trust (execution > independent review > judge-with-ensemble); a coverage dashboard that treats low-evidence tasks as explicit, visible debt.

Priority 3 — Exploration Loops (PRD-stage)

Research question. Can structured exploration raise delivery outcomes without inflating cost or surfacing reward-hacking? Approach. Guide exploration with a distinct verifier/prover policy rather than self-exploration; budget it with hard cost ceilings; ablate against a no-exploration baseline on real CQS movement.

Priority 4 — Metaloop / F-ML (spec-stage)

Research question. Can a meta-learning loop over the delivery process improve the system without Goodharting the very score it optimizes? Approach. Anchor the Metaloop to specifications and regression oracles, not to the scalar CQS/Agent Score; evidence-gate and audit every meta-change exactly like object-level work; stage rollout behind human GO. Known risk the literature predicts: emergent adversarial behavior under optimization pressure (test tampering, ground-truth exfiltration) — design the loop to detect it.

Priority 5 — Self-scoring reliability & gaming-resistance

Research question. How reliable and manipulation-resistant can self-scoring be, and when must scoring be externalized? Hook in product. F13 evolver + composite-not-writable + harvester-instrumented-vs-estimated leaves give a real measurement-integrity axis to study. Approach. Require a non-assignee evidence reviewer by construction; ensemble/calibrated judges where a model must judge; periodic human audit; explicit detectors for test- and reward-tampering.

Cross-cutting open problems


10. Conclusion

ADI is DUDE's answer to a simple question that most AI scoring dodges: not how much an agent generated, but what it provably delivered. It is computed on a lifecycle that is honest by construction — evidence-gated, audited, no parking, single active task — and expressed in two layers: CQS, the work-quality composite that can only be moved from underneath, and Agent Score, the per-agent reliability signal weighted toward verified first-pass acceptance. A closed loop drives the number upward by attacking one binding constraint at a time and proving what it did. And the score is hard to game because the cheap routes are closed by design.

The research agenda makes ADI role-aware (so QA and Coder agents are measured on their own terms), broader in coverage (so legacy work earns real evidence), and resistant to its own optimization pressure (so the meta-loop sharpens the system instead of corrupting the score). That is the path from "a trustworthy number today" to "a delivery intelligence that compounds."


Caveats


← Back to ADI