One canonical task store for Agentic OS — every task across every workstream in a single SQLite store, written through one controlled path, with completion proven by evidence rather than asserted. The source of truth the dashboard, hooks, cron, and Claude all read from and write to.
The GSD Task Spine is a single canonical store for every task across every workstream in Agentic OS — always fresh, tamper-resistant, and the source of truth the dashboard, hooks, cron, and Claude all read from and write to through one controlled path.
Brainstormed 2026-05-16 (thread e1459f57) with three independent inputs — web research, Codex 5.5, and Gemini 3.1 Pro — which converged hard on the same architecture: one SQLite store, one writer, everyone else reads.
“It says a task is pending, I ask Claude to check, it comes back saying it was actually done.”
Task state in Agentic OS is fragmented across 8+ surfaces, none authoritative. Completion state does not propagate — a task marked done in one surface still shows pending in the others. There is no propagation path. That is the bug the Task Spine exists to kill.
The eight non-authoritative surfaces task state was scattered across:
| Surface | Why it isn't the source of truth |
|---|---|
| Claude Code native tasks | In-session only — evaporate at session end |
| GSD statusline bridge | Tracks only the single active task |
pending-tasks-registry.md | Manual, updated only at bank time — 1–2 day lag |
| Session files' “open loops” | Read-only snapshots, frozen after bank |
Command Centre SQLite DB (tasks table) | Persistent but fed from ephemeral session state |
.planning/STATE.md + ROADMAP.md | Phase-level, days stale |
cron/jobs/ + cron_runs table | Scheduled work, tracked separately |
Memory / feedback_*.md | Decisions and lessons — no completion state |
The store holds every task — recorded and tagged ephemeral (in-session steps) or durable (cross-session / cross-workstream work). Views filter by tier.
The lean core plus an append-only event log, typed dependency links, and domain-risk fields. Chosen by Calum over the lean-only option.
Phase 1 is task creation + tracking only. The Hermes supervisor and the messaging interface are explicitly deferred to Phases 2–3.
The four-phase roadmap — this dashboard documents Phase 1, milestones M1 and M2:
| Phase | Scope |
|---|---|
| Phase 1 — the task spine | Canonical store, single-writer CLI, freshness mechanism, integrity layers, migration of the 8 surfaces. Milestones M1–M5. |
| Phase 1b — QC harness | Wrappers that auto-run the Codex/Gemini QC + conditional Opus third-agent lifecycle. The spine models and gates this lifecycle; the harness drives it. (Task #23.) |
| Phase 2 — Hermes supervisor | A supervisor over the store; advisors = Claude/Codex/Gemini. Includes dreaming / self-review — daily offline reviews over the spine's data. (Task #22.) |
| Phase 3 — messaging interface | Telegram / WhatsApp / phone access to the supervisor. |
A task cannot move to status = done without an evidence pointer — a file path, URL, commit hash, or check result. Completion is proven on write, not asserted. This single rule, enforced in the one write path, is what makes the “it said pending but was done” bug impossible going forward.
The core principle all three research sources converged on. Every write — from Claude, hooks, cron, or the dashboard — goes through a single CLI. Nothing writes raw SQL to the task tables.
Its own SQLite file — a sibling of the Command Centre's data.db. A separate file avoids colliding with that DB's existing tasks table and keeps the spine cleanly isolated. WAL mode on. Zero new dependencies — SQLite ships in Python's stdlib.
A single CLI. Every mutation — Claude, hooks, cron, dashboard — goes through it. The invariant and evidence enforcement live here, in one place.
Claude (via CLI read commands), the Command Centre dashboard, and auto-generated read-only projection files. No new external services, API keys, accounts, or installs — fully local.
| Table | Role | Key columns |
|---|---|---|
tasks |
Current state — one row per task. A projection of the event log. | Identity (id, title, body), classification (tier, status, workstream, complexity), provenance, completion evidence (evidence, verification_status, last_verified_at), lifecycle (expires_at, due), risk gating (risk_tier, gate_status, market, brand), relationships. |
task_events |
Append-only history — one immutable row per change. The tasks row rebuilds from this. |
event_id, task_id, ts, actor, event_type, from_value, to_value, note, evidence. Event types include created, status_changed, completed, verified, reopened, reaped. |
task_links |
Typed dependencies between tasks. The writer rejects cycles. | from_task, to_task, link_type — one of blocks, blocked_by, parent_child, duplicates, supersedes, related. |
risk_tier and complexity are independent axes (a Calum correction folded into the design) — a task can be routine-risk yet structurally complex. The invariant holds across both: no done without evidence, whatever the risk or complexity.
done without --evidencecancelledtask_eventsA human-only, codeword-gated, two-factor path — the single sanctioned way to expunge from the append-only log + snapshots when legal needs content removed. Writes a content-free tombstone. Never on the LLM-facing CLI.
A working tools/tasks.py CLI, backed by SQLite, that can create, read, update, and complete tasks — with an append-only event log and the “no done without evidence” invariant enforced in the single write path. Executed as a 6-task TDD plan by a subagent under superpowers:subagent-driven-development.
7 commits on branch feature/v4-migration — 6 from the TDD plan plus the review-fix commit 243ff7e.
Owns the schema, the connection, and the single write-path operations.
schema.py — DDL for the three tables + indexes, idempotentdb.py — connection factory: WAL + 5s busy-timeout + schema applystore.py — the write-path: create_task, get_task, list_tasks, set_status, complete_taskA thin argparse CLI over the package — the single writer. Five subcommands:
create — returns a T-xxxxxx idlist / get — read the storeupdate — change statusdone — refuses without --evidence, succeeds with itEvery mutation writes both the tasks row and an immutable task_events row, inside one transaction. The “no done without evidence” rule was confirmed by CLI smoke test — done without --evidence exits non-zero.
The 6 TDD tasks — each written test-first, red then green then commit:
CREATE IF NOT EXISTScreated event; collision-resistant idsget_task + list_tasks with status / workstream / tier filtersset_status rejects reaching done directly; complete_task refuses blank evidencetools/tasks.py with all five subcommands; full M1 suite greenAfter the 6-task plan completed, the shipped code went through an independent Opus code review. Verdict: APPROVED-WITH-NITS — no HIGH issues. 16 M1 tests green. (15 pre-existing test_ingredient_claim_gate.py failures were confirmed unrelated — domain assertions, not M1.)
243ff7eFIXEDThe review's single MEDIUM finding: complete_task could re-complete an already-done task. Doing so would write a meaningless done→done event into the append-only log — noise in an immutable history that is supposed to be authoritative.
The fix: complete_task now guards against re-completing an already-done task. Shipped as review-fix commit 243ff7e with +1 test. This is the 7th commit in the M1 chain.
set_status could still silently reopen a done task. The M1 plan explicitly defers the state machine, so done-terminality became an M2 task — a deliberate reopen_task with a reopened event. See Milestone 2, Task 2.
Make the task store trustworthy on read — every read carries a freshness signal, a daily cron sweep reaps dead tasks and flags drift, the store is snapshotted, done becomes a stable state only left through a deliberate reopen, and every task can carry an explicit, /goal-ready definition of done.
An 11-task plan — written, QC'd over 3 rounds, finalised, and ready to execute. No code shipped yet.
Three new store-layer mutations extend the single write-path; two new modules add the verify-before-trust signal and the reconciliation sweep; a snapshot module takes atomic backups; the CLI gains the matching subcommands; a cron job schedules the daily sweep.
verify_task — record a verification outcome (verified / failed) with a timestamp and a verified eventreopen_task + reap_task + set_status done-guard — reopen becomes the only sanctioned way out of done and clears the stale completion / verification fields; closes the M1 review itemfreshness.py — a pure-function verify-before-trust signal, zero I/O. A brand-new task stays quiet — no false advisoryget / list — per-row freshness label + a footer counting tasks that advise verificationreconcile.py — the sweep: reap expired ephemerals, flag durable tasks needing verification, flag done tasks with missing evidencesnapshot.py — atomic SQLite backups via the online backup API, with retention pruningverify / reopen / reconcile / snapshotcron/jobs/task-spine-reconcile.md — a daily 05:00 sweep + snapshotdefinition_of_done column + idempotent migration — added to the DDL for fresh DBs and via an idempotent ALTER TABLE for an existing M1 storecreate --dod, a dod subcommand, and set_definition_of_done to set or refine it latergoal-brief emitter — renders a task as a /goal-ready brief: its context, its definition of done, and the verify command to close the loop/goal is a native Claude Code + Codex feature — it loops a session until an explicit definition of done is met. It is video-driven: folded into M2 on 2026-05-17 after two YouTube videos, rather than deferred to a later milestone.
The Task Spine is the durable cross-session store that /goal reads from and writes to. A /goal run reads a task's definition_of_done, works the loop, and writes the outcome back. M2's verify_task is the persistent analogue of /goal's end-of-turn completion check — the same idea, made durable across sessions.
tasks.db, not data.db. Design §5 predates the M1 decision to give the spine its own SQLite file; M2 snapshots that file.complete_task populates evidence. M2 flags done tasks whose evidence file has gone missing (a proof-of-done integrity breach). The cross-surface drift case is a post-migration concern for M5.set_status could silently reopen a done task; Task 2 implements the deliberate reopen_task + a set_status guard./goal feature. Not from design.md — added 2026-05-17 after the /goal videos.The M2 plan went through three plan-QC rounds — each a parallel review by Codex 5.5 and Gemini 3.1 Pro (2026-05-16/17). Every finding was folded in inline; the plan was not halted on a HIGH — per project rule, a HIGH on a plan is fixed inline and the work proceeds.
The three rounds tell a story: R1 caught four HIGH issues across both reviewers; R2 was a re-QC after the revision that recovered a Gemini HIGH truncated in R1; R3 re-reviewed the expanded plan after the /goal fold was added. The journey itself surfaced a tooling bug — the QC wrappers' output budget was too small and silently truncated a review.
The first plan-QC of M2's Freshness layer (Tasks 1–8). Codex 5.5 and Gemini 3.1 Pro both returned REFINE. Findings folded into the revision:
connect(). The snapshot CLI command ran after connect(), which auto-creates the DB — masking the “no task store” guard. Fixed: Task 7 special-cases snapshot before connect(); a new test covers it.
reopen_task left completed_at / completed_by / evidence / verification_status / last_verified_at intact — so a reopened task still read as done and verified. Fixed: reopen_task clears all of them; the old proof stays in task_events.
schedule / timezone / status keys the runtime does not read. Fixed: Task 8 uses the verified name / time / days / active format.
freshness_signal flagged every brand-new task with a false VERIFY advisory (unverified ⇒ stale ⇒ advise_verify). Fixed: the freshness model was reworked — stale now requires a prior verification, and advise_verify fires only for done or risk-bearing-live tasks with a real trust gap.
find_missing_evidence resolved relative paths against the process cwd (now checks absolute paths only); and snapshot filenames had second precision and could overwrite (now a numeric suffix on same-second collision).
A re-QC after the Round 1 revision. Codex returned REFINE; Gemini returned SHIP pending one REFINE. Both flagged the same single remaining bug:
_prune RETENTION BUG
This was Gemini's truncated Round 1 HIGH, recovered. Round 1's Gemini response was cut off by gemini_qc.py's output budget. With the budget raised, Round 2 recovered the finding: _prune sorted snapshot filenames lexically — but a same-second collision file (...-2.db) sorts before the unsuffixed name (- is 0x2D, . is 0x2E), so _prune would delete the newer snapshot. Fixed: _prune now orders by file mtime; a dedicated test covers it with os.utime-controlled mtimes.
C:/Python314/python.exe (resolves in both Git Bash and PowerShell). Codex's alternative — rewrite every command as PowerShell — was not taken: plan steps run via the Bash tool, not PowerShell.
test_ingredient_claim_gate.py cases.
After Rounds 1–2, Calum directed — from two YouTube videos — that the native /goal feature be folded into M2. Tasks 9–11 were added, so the whole expanded plan went through a third QC round. Codex 5.5: REFINE (one HIGH, two MEDIUM — all fixed inline). Gemini 3.1 Pro: SHIP.
(id, title) fixture — which crashes on the SCHEMA's CREATE INDEX ... ON tasks(status) before the column assertion is reached. Fixed: the test now builds the full M2 schema, DROP COLUMN definition_of_done to get a genuine M1-shaped table, then asserts the migration re-adds it.
freshness_signal could label an unverified-but-timestamped row (drift / manual DB edit) as verified. Fixed: the label branches were reordered so verified requires verification_status == 'verified'; a regression test covers the drift case.
set_definition_of_done silently accepted a blank value, erasing the DoD. Fixed: it now rejects blank input, consistent with the non-empty rule on evidence and reopen reason; a test covers it.
create dispatch omitting domain / complexity / owner / risk_tier — was a verified non-issue: the M1 create subparser never exposed those as CLI flags, so create_task correctly defaults them. Nothing is dropped.
Round 1's Gemini review was silently truncated by gemini_qc.py's output budget. The fix swept the QC wrappers: MAX_ARTIFACT_CHARS 60K→1M (the 60K cap would have truncated the M2 plan itself), truncate() now warns loudly on stderr, and gemini_qc.py's MAX_OUTPUT_TOKENS went 8192→32768. Banked to feedback_qc_wrapper_artifact_and_output_caps.md.
feature/v4-migration — 6 TDD tasks plus the review-fix commit.The roadmap chart is a status snapshot, not a Gantt — M1 is the only bar at 100%. The QC chart shows the journey's shape: Round 1 carried the most HIGH findings (4), Round 2 isolated the single recovered _prune HIGH, and Round 3 re-reviewed the expanded /goal-fold plan.
The 11-task M2 plan is finalised and QC'd over 3 rounds — ready to execute via superpowers:subagent-driven-development. This ships the Freshness layer and the native /goal fold.
gsd-task-bridge.js's atomic write fails its rename step intermittently — 8 errors logged 2026-05-16, likely concurrent Claude panes racing on shared config-dir files. Low impact. Tracked as a pending task — to be fixed deliberately after root-causing, not bundled.
The remaining Phase 1 milestones, each getting its own plan when reached:
| Milestone | Status | Scope |
|---|---|---|
| M1 — Store + CLI core | SHIPPED | The SQLite store, the task_store/ package, the tasks.py CLI, the no-done-without-evidence invariant. |
| M2 — Freshness & /goal | PLANNED | Verify-before-trust reads, the reconciliation sweep, atomic snapshots, the native /goal fold. 11-task plan, ready. |
| M3 — Integrity | FUTURE | Reconstructable rows from task_events, gated bulk / destructive ops. |
| M4 — Admin path | FUTURE | The human-only, codeword-gated legal-removal path — two-factor, content-free tombstone. |
| M5 — Migration | FUTURE | Migrate the 8 non-authoritative surfaces into the spine — additive, dry-runnable, nothing old deleted until verified. |
Phase 1b adds the multi-LLM QC enforcement harness. Phase 2 is the Hermes supervisor agent with daily dreaming / self-review over the spine's structured history. Phase 3 is the messaging interface — Telegram / WhatsApp / phone access. The Phase 1 spine is the substrate all three build on.
Every fact on this dashboard is drawn from these four source files under F:/Agentic-OS/ — no detail is invented.
444c6d8 and f842ba2.dd1759b.## Bank log entries dated 2026-05-16 and 2026-05-17 narrate what was built, reviewed, and verified — M1 ship, the code review, the M2 plan QC.