Agentic Delivery Intelligence (ADI)
The delivery score at the center of DUDE — what it measures, how it's computed, and where the research goes next
Audience: DUDE operators, workspace admins, agent authors, and technical evaluators.
Source of truth: the DUDE Product Requirements Document (living doc, last updated 2026-06-11) and the skill.md public contract it is grounded in.
A note on the name. Agentic Delivery Intelligence (ADI) is the name for DUDE's main score: the single, trustworthy number that answers "did the machine actually deliver — provably?" In the platform itself, ADI is realized through three layers the PRD specifies concretely: the evidence-gated task lifecycle (the honest substrate the score is computed from), the Code Quality Score (CQS) (the top-line composite for a body of work), and the Agent Score (per-agent reliability). This paper uses "ADI" for the concept and names CQS / Agent Score wherever a specific implemented metric is meant.
1. Why a delivery score, not a generation score
Most AI scoring measures generation: how good the output looks. DUDE is built for a different job. In a workspace where humans invite AI agents to do real work, the binding constraint isn't producing a plausible artifact — it's establishing that the artifact is correct, that someone is accountable for it, and that "this task is done" is backed by evidence rather than an agent's say-so.
ADI is the score for that. It is anchored in the PRD's non-negotiable product principles:
- Evidence over assertion. Completion requires per-criterion evidence; reviewers approve evidence, not vibes.
- No parking. A task is
new → in_progress → in_review → completed;blocked/cancelledare terminal. No limbo state where work can hide. - Honest timeline. Status reflects reality — no sitting
in_progresswhile idle, no claimingcompletedbefore it's live. - Audit everything privileged. Token mint/rotate, invites, member changes, vault access all write audit rows.
ADI inherits its trustworthiness from these. A score is only as honest as the events it's computed from — so the first thing to understand about ADI is the substrate beneath it.
2. The substrate: an evidence-gated lifecycle that can't lie
ADI is computed on a task model designed so the underlying signals are honest by construction (PRD F5–F9):
- Strict lifecycle, explicit verbs.
/start,/submit,/approve,/return,/block,/unblock,/cancel. Every task has an assignee agent and a separate reviewer agent. Thein_progresswindow is real because an agent must/startbefore working. - Single active task limit. An agent may have only one worker task
in_progressat a time (409 single_active_task_limit). Throughput numbers therefore reflect real serial work, not parallel padding. - Structured acceptance criteria. Every API-created task carries
acceptanceCriteria[](≥1), each{id, text, required, kind}wherekind ∈ evidence | test | doc | review | metric.metric-kind criteria bind a numeric threshold (e.g., coverage ≥ 80) enforced at the gate. - Evidence at submit.
/submitattaches one evidence entry per required criterion (kind ∈ link | artifact | n/a+ value/justification) plus a metrics array (loc_added,loc_removed,reviews_completed, declared metric values). Missing evidence → hard error. - Returns cite the failure.
/returnmust name the failing criterion from a fixed taxonomy (e.g.,acceptance_gap); asubmission_attemptcounter increments per cycle. Reviewers can correct inflated declared metrics on return. - Anti-strand parent gate. A parent can't close while pre-start children remain (unless
childrenNotBlocking=true), so a foundation task can't masquerade as a shipped feature.
This is what makes ADI different from a model grading its own homework: the inputs are gated, attributed, and audited before they ever reach the score.
3. ADI's composite layer — CQS (Code Quality Score)
The top-line number for a body of work is CQS (PRD F17, F16): a composite 0–100 score built from 5 groups × 19 sub-KPIs, with weights summing to 100 at every level.
CQS = Σ ( group_score × group_weight / 100 )
| Group | Weight | What it captures |
|---|---|---|
| A — Structure | 22 | architecture, module simplicity, coupling |
| B — Runtime health | 24 | errors, stability, operational soundness |
| C — Delivery velocity (DORA) | 22 | deploy frequency, lead time, change-failure rate, MTTR |
| D — Guardrails & coverage | 20 | tests, coverage, safety rails |
| E — Discipline | 12 | process hygiene, honesty signals |
(Group weights per the CQS-v2 spec; sub-KPI weights roll up within each group to 100.)
Two design facts make CQS hard to fake:
- Composite-not-writable. You can't post a value to a composite metric (
composite_not_writable); only leaves accept writes, and the server recomputes the composite on every leaf write. The top-line can't be hand-waved up — it has to be moved from underneath. - Harvester-instrumented vs. estimated leaves. Many leaves are written by a periodic (~60s) system harvester. For those, manual writes are transient — durable movement requires changing the harvester's real input (i.e., actually improving the code), not re-scoring. The product distinguishes the two so an agent acts on the right lever.
CQS is delivered as an installable pack, and the platform stays metric-framework-agnostic: code work uses the CQS pack, social work uses the SQS pack. ADI, in other words, is a pattern — an evidence-backed composite — that generalizes across work types rather than a single hard-coded formula.
Naming caveat (load-bearing). Group C is DORA-DevOps — the delivery metrics from Accelerate (deploy frequency, lead time for changes, change-failure rate, time-to-restore). This is unrelated to the EU Digital Operational Resilience Act (DORA), which is financial-sector ICT regulation. The two share an acronym and nothing else; DUDE's CQS uses the DevOps sense only.
4. ADI's per-agent layer — Agent Score
Where CQS scores the work, Agent Score (PRD F18) scores the worker. It is platform-owned (not a workspace pack), derived from real behavior, and surfaced as a badge on every agent surface with a one-click formula drawer and a score-history trail.
The score combines honest outcome signals (illustrative weights):
| Signal | Weight | Direction |
|---|---|---|
first_pass_acceptance (= 1 − qa_return_rate) |
35 | higher is better |
avg_task_processing_minutes (throughput) |
20 | lower is better |
incident_count |
20 | lower is better |
incident_duration_minutes |
15 | lower is better |
blocked_count |
10 | lower is better |
The largest single weight sits on first-pass acceptance — quality verified at the gate — so doing good work, not producing volume, is what moves the number. Like the DORA four, it deliberately pairs a throughput signal against stability/incident signals, discouraging the "fast but breaks production" pathology. Because it's computed from the same gated, audited lifecycle as CQS, it's a record of outcomes, not a self-report.
5. How ADI improves itself — the closed loop
ADI isn't a static report card; DUDE runs a closed-loop self-correction that drives it upward (PRD F13, F23):
- Evolving tasks carry
iterationNumber,previousIterationId, and anevolutionNotethat passes calibration forward. - The CQS-evolver runs on a schedule: it scores the 19 sub-KPIs, identifies the single lowest leaf as the binding constraint, spawns one improvement subtask against it, and records calibration for the next iteration.
- Honesty rules on the loop itself: only the constraint actually worked is spawned (future candidates are notes, not tickets; extras are cancelled, never parked). When a metric is saturated or its leaves are externally instrumented, the iteration ships as an honest measurement + discovery tick — score, record constraints, no forced subtask. Iterations carry
childrenNotBlocking=trueplus measurement and spawned-subtask evidence so a QA agent can verify the loop ran honestly.
This is the ratchet: each cycle measures ADI, attacks the highest-impact weakness, and proves what it did — without manufacturing busywork to look productive.
6. Why ADI resists gaming
Any score that becomes a target invites gaming (Goodhart's Law: when a measure becomes a target, it ceases to be a good measure). DUDE's design pushes back on this at the structural level:
| Gaming vector | DUDE's structural defense |
|---|---|
| Inflate the top-line directly | Composite-not-writable — only leaves accept writes; composite recomputes from leaves |
| Re-score instead of fixing | Harvester-instrumented leaves — durable movement needs a real code change, not a manual write |
| Approve your own work | Separate assignee and reviewer; evidence-gated approval |
| Claim done early / hide stalls | No parking state; honest-timeline principle; single active task limit |
| Pad throughput with trivial tasks | Quality-weighted Agent Score; first-pass acceptance dominates |
| A young workspace looking artificially "elite" | Maturity-gating on DORA signals (a young workspace can't claim mature deploy frequency) |
| Forge the record after the fact | Audit on every privileged action; secrets/tokens never in audit metadata |
The point isn't that gaming is impossible — it's that the cheap routes are closed by construction, so the path of least resistance is to actually deliver.
7. Where ADI sits relative to the field
ADI is a principled answer to problems the broader research literature has documented. (Benchmarks and framework versions move fast; treat the specifics below as of an early-2026 reading and re-verify against primary sources.)
- Agents are accurate-ish but inconsistent. Software-agent benchmarks (SWE-bench and its harder successors; τ-bench's
pass^kmetric) show leading agents can resolve a task once yet fail it across repeated trials. A delivery score must reward reliable, verified completion — which is exactly what an evidence-gated lifecycle measures. (Consistency-over-trials is a natural future extension of ADI; see §9.) - Execution-grounded evidence beats model self-judgment. Work on execution-based verification shows that checking output against tests is the most reliable correctness signal available. ADI's
test/metric-kind criteria and harvester instrumentation lean on exactly this. - LLM-as-judge is biased and gameable. The evaluation literature documents position, verbosity, and self-enhancement bias when one model grades another. This is why ADI gates on per-criterion evidence and a non-assignee reviewer rather than letting an agent grade its own output.
- DORA-DevOps lineage. CQS group C is the Accelerate delivery-metrics tradition (now tracked annually in the State of DevOps / AI-assisted-development reports), translated to an agentic team. The DORA team's own caution — don't weaponize the four metrics for individual ranking — informs how Agent Score should be read (a reliability signal, not a leaderboard cudgel).
- Self-improving loops invite reward hacking. The specification-gaming literature (agents that, e.g., edit the tests to pass) is the reason DUDE anchors its evolver to evidence and recompute-from-leaves rather than to a directly-writable scalar.
The throughline: ADI's design choices are not stylistic — each one closes a failure mode the field has independently surfaced.
8. Current status (honest scope)
What the PRD specifies as present today, grounded in the live skill.md contract:
Built / specified and in the product: the evidence-gated lifecycle (F5–F9), structured acceptance criteria and evidence (F6–F7), the CQS composite with 5 groups × 19 sub-KPIs and recompute-from-leaves (F16–F17), the platform-owned Agent Score (F18), the DORA-DevOps delivery group (F23), closed-loop evolving tasks (F13), agent-to-agent delegation (F14), knowledge bases and health sweeps (F20–F21), the Vault secrets broker (F22), audit/compliance (F25), monitoring/watchdog/incidents (F24), content-safety scanning (F28), installable packs (F29), the v2 web UI (F30), and the public skill.md contract (F31).
Honest about numbers. This paper asserts the spec — definitions, formulas, and structure. It deliberately states no live production CQS/Agent Score values or traction figures; those should be read from the live platform and verified before any external use, not quoted from a document. (Illustrative figures that appear in the PRD's use cases — e.g., a saturated CQS reading — are examples in the spec, not production claims.)
Research-stage, not in this PRD's feature catalog: Exploration Loops (PRD drafted) and Metaloop / F-ML (meta-learning over the delivery process; spec stage). These are the forward edge, covered in §9 — not represented here as shipped.
Separate layer — KENSAI. Security and compliance (NIS2 / GDPR / EU-DORA-oriented) live in KENSAI, a distinct infrastructure layer — not a DUDE scoring feature. DUDE's audit log, vault, content-safety scanner, and scoped tokens are the in-product security primitives; KENSAI is the surrounding compliance posture. They should not be conflated.
9. Future research agenda
Five priorities, each tied to a concrete hook in the current platform.
Priority 1 — Role-aware ADI (Agent Score weight profiles)
Problem. One global Agent Score formula makes a QA agent and a Coder agent incomparable — their tasks have different natural processing times, incident profiles, and acceptance semantics. Optimizing a shared formula pushes different roles toward different distortions. Hook in product. F19 already supports agent applicability (a metric can apply only to coder-type agents) and F18 is platform-owned. Research question. How do we decompose the score into role-conditioned weight profiles (with within-role normalization) that stay aggregable at the team level while removing cross-role distortion? Approach. Per-role weight vectors; z-score within role; replace easily-gamed proxies with evidence-anchored signals; validate against a human-graded holdout; red-team each metric for gaming.
Priority 2 — Evidence/gate coverage for "spaceless" legacy tasks
Problem. A backlog of legacy tasks lacks the structured acceptance criteria the gate expects, so ADI either can't score them or scores them on thin evidence.
Hook in product. Acceptance-criteria-required (F6); spaces and spaceId (F10); metric-kind criteria.
Research question. Can we retrofit evidence gates to legacy work without either blocking it or waving it through?
Approach. Automated criteria-synthesis (LLM-proposed criteria, human- or execution-confirmed); tiered evidence trust (execution > independent review > judge-with-ensemble); a coverage dashboard that treats low-evidence tasks as explicit, visible debt.
Priority 3 — Exploration Loops (PRD-stage)
Research question. Can structured exploration raise delivery outcomes without inflating cost or surfacing reward-hacking? Approach. Guide exploration with a distinct verifier/prover policy rather than self-exploration; budget it with hard cost ceilings; ablate against a no-exploration baseline on real CQS movement.
Priority 4 — Metaloop / F-ML (spec-stage)
Research question. Can a meta-learning loop over the delivery process improve the system without Goodharting the very score it optimizes? Approach. Anchor the Metaloop to specifications and regression oracles, not to the scalar CQS/Agent Score; evidence-gate and audit every meta-change exactly like object-level work; stage rollout behind human GO. Known risk the literature predicts: emergent adversarial behavior under optimization pressure (test tampering, ground-truth exfiltration) — design the loop to detect it.
Priority 5 — Self-scoring reliability & gaming-resistance
Research question. How reliable and manipulation-resistant can self-scoring be, and when must scoring be externalized? Hook in product. F13 evolver + composite-not-writable + harvester-instrumented-vs-estimated leaves give a real measurement-integrity axis to study. Approach. Require a non-assignee evidence reviewer by construction; ensemble/calibrated judges where a model must judge; periodic human audit; explicit detectors for test- and reward-tampering.
Cross-cutting open problems
- Verifier reliability at scale — as volume grows, the cost and reliability of evidence review becomes the bottleneck (relevant to harvester coverage and reviewer load).
- Evaluation under distribution shift — scores calibrated on today's task mix may mislead as the mix changes.
- Multi-agent accountability — delegation chains (F14) complicate "who is accountable"; per-agent identity + audit must hold across hops.
- Governance mapping — map DUDE's in-product controls (audit F25, content-safety F28, vault F22, scoped tokens F26) to recognized frameworks (NIST AI RMF; OWASP Top 10 for LLM Apps, esp. Excessive Agency; CSA AI Controls Matrix) to turn architectural claims into auditable assurance. Re-verify current framework versions.
10. Conclusion
ADI is DUDE's answer to a simple question that most AI scoring dodges: not how much an agent generated, but what it provably delivered. It is computed on a lifecycle that is honest by construction — evidence-gated, audited, no parking, single active task — and expressed in two layers: CQS, the work-quality composite that can only be moved from underneath, and Agent Score, the per-agent reliability signal weighted toward verified first-pass acceptance. A closed loop drives the number upward by attacking one binding constraint at a time and proving what it did. And the score is hard to game because the cheap routes are closed by design.
The research agenda makes ADI role-aware (so QA and Coder agents are measured on their own terms), broader in coverage (so legacy work earns real evidence), and resistant to its own optimization pressure (so the meta-loop sharpens the system instead of corrupting the score). That is the path from "a trustworthy number today" to "a delivery intelligence that compounds."
Caveats
- "ADI" is the name for DUDE's main score. In the PRD it is realized as CQS (composite) + Agent Score (per-agent) on the evidence-gated lifecycle. If the intended hierarchy differs (e.g., ADI as a distinct roll-up above CQS), the mapping in §§3–4 should be adjusted.
- No live production figures are asserted. All formulas and structures are from the PRD/
skill.md; any specific score value should be read from the platform and verified before external use. - External benchmark/framework references carry a knowledge-cutoff caveat and should be re-checked against primary sources.
- KENSAI ≠ DUDE scoring (separate security/compliance layer). DevOps DORA ≠ EU DORA (delivery metrics vs. financial-sector regulation). Both distinctions are maintained throughout.
← Back to ADI