← Command Centre

GSD Task Spine

One canonical task store for Agentic OS — every task across every workstream in a single SQLite store, written through one controlled path, with completion proven by evidence rather than asserted. The source of truth the dashboard, hooks, cron, and Claude all read from and write to.

Phase 1 · M1 shipped 16 May 2026 · M2 planned
M1 — Store + CLI core: SHIPPED M2 — Freshness & /goal: PLANNED 7 commits · 16 tests green 3 QC rounds — Codex 5.5 + Gemini 3.1 Pro

The GSD Task Spine is a single canonical store for every task across every workstream in Agentic OS — always fresh, tamper-resistant, and the source of truth the dashboard, hooks, cron, and Claude all read from and write to through one controlled path.

Brainstormed 2026-05-16 (thread e1459f57) with three independent inputs — web research, Codex 5.5, and Gemini 3.1 Pro — which converged hard on the same architecture: one SQLite store, one writer, everyone else reads.

The problem

DESIGN §1
The user-visible bug — structural, not careless

“It says a task is pending, I ask Claude to check, it comes back saying it was actually done.”

Task state in Agentic OS is fragmented across 8+ surfaces, none authoritative. Completion state does not propagate — a task marked done in one surface still shows pending in the others. There is no propagation path. That is the bug the Task Spine exists to kill.

The eight non-authoritative surfaces task state was scattered across:

SurfaceWhy it isn't the source of truth
Claude Code native tasksIn-session only — evaporate at session end
GSD statusline bridgeTracks only the single active task
pending-tasks-registry.mdManual, updated only at bank time — 1–2 day lag
Session files' “open loops”Read-only snapshots, frozen after bank
Command Centre SQLite DB (tasks table)Persistent but fed from ephemeral session state
.planning/STATE.md + ROADMAP.mdPhase-level, days stale
cron/jobs/ + cron_runs tableScheduled work, tracked separately
Memory / feedback_*.mdDecisions and lessons — no completion state

Goal & scope

DESIGN §2
Scope

Every task, tiered

The store holds every task — recorded and tagged ephemeral (in-session steps) or durable (cross-session / cross-workstream work). Views filter by tier.

Build scope

Approach B — full spine

The lean core plus an append-only event log, typed dependency links, and domain-risk fields. Chosen by Calum over the lean-only option.

Foundation first

Supervisor deferred

Phase 1 is task creation + tracking only. The Hermes supervisor and the messaging interface are explicitly deferred to Phases 2–3.

The four-phase roadmap — this dashboard documents Phase 1, milestones M1 and M2:

PhaseScope
Phase 1 — the task spineCanonical store, single-writer CLI, freshness mechanism, integrity layers, migration of the 8 surfaces. Milestones M1–M5.
Phase 1b — QC harnessWrappers that auto-run the Codex/Gemini QC + conditional Opus third-agent lifecycle. The spine models and gates this lifecycle; the harness drives it. (Task #23.)
Phase 2 — Hermes supervisorA supervisor over the store; advisors = Claude/Codex/Gemini. Includes dreaming / self-review — daily offline reviews over the spine's data. (Task #22.)
Phase 3 — messaging interfaceTelegram / WhatsApp / phone access to the supervisor.

Where it stands today

SNAPSHOT
M1
Shipped
7
M1 commits
16
M1 tests green
11
M2 planned tasks
3
QC rounds
5
Milestones M1–M5
The headline invariant

A task cannot move to status = done without an evidence pointer — a file path, URL, commit hash, or check result. Completion is proven on write, not asserted. This single rule, enforced in the one write path, is what makes the “it said pending but was done” bug impossible going forward.

One store, one writer, everyone else reads

DESIGN §3

The core principle all three research sources converged on. Every write — from Claude, hooks, cron, or the dashboard — goes through a single CLI. Nothing writes raw SQL to the task tables.

Claude · hooks · cron · dashboard | | | | +------+---+----+----------+ every write through ONE path v tools/tasks.py -- the single writer (invariant + evidence enforcement) v SQLite -- .command-centre/tasks.db tables: tasks . task_events . task_links | +----------+--------------------+ v v v Claude (reads) dashboard generated read-only (reads) projections
Canonical store

.command-centre/tasks.db

Its own SQLite file — a sibling of the Command Centre's data.db. A separate file avoids colliding with that DB's existing tasks table and keeps the spine cleanly isolated. WAL mode on. Zero new dependencies — SQLite ships in Python's stdlib.

The one writer

tools/tasks.py

A single CLI. Every mutation — Claude, hooks, cron, dashboard — goes through it. The invariant and evidence enforcement live here, in one place.

Readers

Everyone else

Claude (via CLI read commands), the Command Centre dashboard, and auto-generated read-only projection files. No new external services, API keys, accounts, or installs — fully local.

Data model — three tables

DESIGN §4
TableRoleKey columns
tasks Current state — one row per task. A projection of the event log. Identity (id, title, body), classification (tier, status, workstream, complexity), provenance, completion evidence (evidence, verification_status, last_verified_at), lifecycle (expires_at, due), risk gating (risk_tier, gate_status, market, brand), relationships.
task_events Append-only history — one immutable row per change. The tasks row rebuilds from this. event_id, task_id, ts, actor, event_type, from_value, to_value, note, evidence. Event types include created, status_changed, completed, verified, reopened, reaped.
task_links Typed dependencies between tasks. The writer rejects cycles. from_task, to_task, link_type — one of blocks, blocked_by, parent_child, duplicates, supersedes, related.
Two independent axes

risk_tier and complexity are independent axes (a Calum correction folded into the design) — a task can be routine-risk yet structurally complex. The invariant holds across both: no done without evidence, whatever the risk or complexity.

Freshness & integrity layers

DESIGN §5
Freshness — 3 layers

Always trustworthy on read

  • Proof-of-done on write — CLI rejects done without --evidence
  • Verify-before-trust on read — read output carries a freshness signal
  • Reconciliation sweep — a scheduled cron safety net
Integrity — 5 layers

Tamper-resistant by construction

  • No hard delete — tasks can only be cancelled
  • Append-only event log — immutable
  • Reconstructable rows from task_events
  • Atomic snapshots of the store
  • Gated bulk / destructive ops
Admin / legal-removal path

The one controlled exception

A human-only, codeword-gated, two-factor path — the single sanctioned way to expunge from the append-only log + snapshots when legal needs content removed. Writes a content-free tombstone. Never on the LLM-facing CLI.

Milestone 1 — Store + CLI core

SHIPPED · 2026-05-16

A working tools/tasks.py CLI, backed by SQLite, that can create, read, update, and complete tasks — with an append-only event log and the “no done without evidence” invariant enforced in the single write path. Executed as a 6-task TDD plan by a subagent under superpowers:subagent-driven-development.

ab5cc0f 6a3101a 507b99c 3a32582 4d67514 1a7e114 243ff7e

7 commits on branch feature/v4-migration — 6 from the TDD plan plus the review-fix commit 243ff7e.

What shipped

DELIVERABLES
The package

tools/task_store/

Owns the schema, the connection, and the single write-path operations.

  • schema.py — DDL for the three tables + indexes, idempotent
  • db.py — connection factory: WAL + 5s busy-timeout + schema apply
  • store.py — the write-path: create_task, get_task, list_tasks, set_status, complete_task
The CLI

tools/tasks.py

A thin argparse CLI over the package — the single writer. Five subcommands:

  • create — returns a T-xxxxxx id
  • list / get — read the store
  • update — change status
  • done — refuses without --evidence, succeeds with it
The invariant, verified

Every mutation writes both the tasks row and an immutable task_events row, inside one transaction. The “no done without evidence” rule was confirmed by CLI smoke test — done without --evidence exits non-zero.

The 6 TDD tasks — each written test-first, red then green then commit:

  • T1SQLite schema — the three tables + indexes, idempotent via CREATE IF NOT EXISTS
  • T2Database connection — WAL mode, 5s busy-timeout, schema applied on connect
  • T3Create a task — inserts the row + a created event; collision-resistant ids
  • T4Read tasksget_task + list_tasks with status / workstream / tier filters
  • T5Status updates + completion invariantset_status rejects reaching done directly; complete_task refuses blank evidence
  • T6The CLItools/tasks.py with all five subcommands; full M1 suite green

Independent code review

OPUS REVIEW
Verdict — APPROVED-WITH-NITS, no HIGH

After the 6-task plan completed, the shipped code went through an independent Opus code review. Verdict: APPROVED-WITH-NITS — no HIGH issues. 16 M1 tests green. (15 pre-existing test_ingredient_claim_gate.py failures were confirmed unrelated — domain assertions, not M1.)

The one MEDIUM — fixed inline in commit 243ff7eFIXED

The review's single MEDIUM finding: complete_task could re-complete an already-done task. Doing so would write a meaningless done→done event into the append-only log — noise in an immutable history that is supposed to be authoritative.

The fix: complete_task now guards against re-completing an already-done task. Shipped as review-fix commit 243ff7e with +1 test. This is the 7th commit in the M1 chain.

DEFERRED → M2 The review also flagged that set_status could still silently reopen a done task. The M1 plan explicitly defers the state machine, so done-terminality became an M2 task — a deliberate reopen_task with a reopened event. See Milestone 2, Task 2.

Milestone 2 — Freshness & Goal-Definition

PLANNED — not yet executed

Make the task store trustworthy on read — every read carries a freshness signal, a daily cron sweep reaps dead tasks and flags drift, the store is snapshotted, done becomes a stable state only left through a deliberate reopen, and every task can carry an explicit, /goal-ready definition of done.

An 11-task plan — written, QC'd over 3 rounds, finalised, and ready to execute. No code shipped yet.

Tasks 1–8 — the Freshness layer

DESIGN §5

Three new store-layer mutations extend the single write-path; two new modules add the verify-before-trust signal and the reconciliation sweep; a snapshot module takes atomic backups; the CLI gains the matching subcommands; a cron job schedules the daily sweep.

  • T1verify_task — record a verification outcome (verified / failed) with a timestamp and a verified event
  • T2reopen_task + reap_task + set_status done-guardreopen becomes the only sanctioned way out of done and clears the stale completion / verification fields; closes the M1 review item
  • T3freshness.py — a pure-function verify-before-trust signal, zero I/O. A brand-new task stays quiet — no false advisory
  • T4Freshness in CLI get / list — per-row freshness label + a footer counting tasks that advise verification
  • T5reconcile.py — the sweep: reap expired ephemerals, flag durable tasks needing verification, flag done tasks with missing evidence
  • T6snapshot.py — atomic SQLite backups via the online backup API, with retention pruning
  • T7CLI subcommandsverify / reopen / reconcile / snapshot
  • T8Reconciliation cron jobcron/jobs/task-spine-reconcile.md — a daily 05:00 sweep + snapshot

Tasks 9–11 — the native /goal fold

ADDED 2026-05-17
  • T9definition_of_done column + idempotent migration — added to the DDL for fresh DBs and via an idempotent ALTER TABLE for an existing M1 store
  • T10DoD write pathcreate --dod, a dod subcommand, and set_definition_of_done to set or refine it later
  • T11goal-brief emitter — renders a task as a /goal-ready brief: its context, its definition of done, and the verify command to close the loop
The /goal link — why the spine and /goal pair up

/goal is a native Claude Code + Codex feature — it loops a session until an explicit definition of done is met. It is video-driven: folded into M2 on 2026-05-17 after two YouTube videos, rather than deferred to a later milestone.

The Task Spine is the durable cross-session store that /goal reads from and writes to. A /goal run reads a task's definition_of_done, works the loop, and writes the outcome back. M2's verify_task is the persistent analogue of /goal's end-of-turn completion check — the same idea, made durable across sessions.

Design-alignment notes

PLAN §
Four points where the M2 plan resolves design ambiguity
  • Snapshots target tasks.db, not data.db. Design §5 predates the M1 decision to give the spine its own SQLite file; M2 snapshots that file.
  • The evidence-drift check is implemented as its coherent inverse. Design §5 says “flag pending tasks whose evidence already exists” — but in the M1 schema only complete_task populates evidence. M2 flags done tasks whose evidence file has gone missing (a proof-of-done integrity breach). The cross-surface drift case is a post-migration concern for M5.
  • The done-terminality fix is folded into M2. The M1 review flagged that set_status could silently reopen a done task; Task 2 implements the deliberate reopen_task + a set_status guard.
  • Tasks 9–11 fold in the native /goal feature. Not from design.md — added 2026-05-17 after the /goal videos.

The QC journey

3 ROUNDS

The M2 plan went through three plan-QC rounds — each a parallel review by Codex 5.5 and Gemini 3.1 Pro (2026-05-16/17). Every finding was folded in inline; the plan was not halted on a HIGH — per project rule, a HIGH on a plan is fixed inline and the work proceeds.

The three rounds tell a story: R1 caught four HIGH issues across both reviewers; R2 was a re-QC after the revision that recovered a Gemini HIGH truncated in R1; R3 re-reviewed the expanded plan after the /goal fold was added. The journey itself surfaced a tooling bug — the QC wrappers' output budget was too small and silently truncated a review.

Round 1 — both reviewers: REFINEREFINE / REFINE

The first plan-QC of M2's Freshness layer (Tasks 1–8). Codex 5.5 and Gemini 3.1 Pro both returned REFINE. Findings folded into the revision:

HIGH · CODEX Snapshot ran after connect(). The snapshot CLI command ran after connect(), which auto-creates the DB — masking the “no task store” guard. Fixed: Task 7 special-cases snapshot before connect(); a new test covers it.
HIGH · CODEX + GEMINI Reopen left stale verification state. reopen_task left completed_at / completed_by / evidence / verification_status / last_verified_at intact — so a reopened task still read as done and verified. Fixed: reopen_task clears all of them; the old proof stays in task_events.
HIGH · CODEX Cron frontmatter format. Task 8's cron frontmatter used schedule / timezone / status keys the runtime does not read. Fixed: Task 8 uses the verified name / time / days / active format.
HIGH · GEMINI Freshness false-positive noise. freshness_signal flagged every brand-new task with a false VERIFY advisory (unverifiedstaleadvise_verify). Fixed: the freshness model was reworked — stale now requires a prior verification, and advise_verify fires only for done or risk-bearing-live tasks with a real trust gap.
MEDIUM · CODEX Cron body was a shell-command block, but the runtime executes the body as a Claude prompt — rewritten as a scheduled-job prompt. Also: find_missing_evidence resolved relative paths against the process cwd (now checks absolute paths only); and snapshot filenames had second precision and could overwrite (now a numeric suffix on same-second collision).
Round 2 — Codex REFINE / Gemini SHIP-pending-oneREFINE / SHIP−1

A re-QC after the Round 1 revision. Codex returned REFINE; Gemini returned SHIP pending one REFINE. Both flagged the same single remaining bug:

HIGH · THE _prune RETENTION BUG This was Gemini's truncated Round 1 HIGH, recovered. Round 1's Gemini response was cut off by gemini_qc.py's output budget. With the budget raised, Round 2 recovered the finding: _prune sorted snapshot filenames lexically — but a same-second collision file (...-2.db) sorts before the unsuffixed name (- is 0x2D, . is 0x2E), so _prune would delete the newer snapshot. Fixed: _prune now orders by file mtime; a dedicated test covers it with os.utime-controlled mtimes.
CODEX · ADDRESSED Shell/path portability — the plan's POSIX shell syntax could mislead in a PowerShell context. Addressed with an explicit “Execution context” header note and by switching all interpreter calls to C:/Python314/python.exe (resolves in both Git Bash and PowerShell). Codex's alternative — rewrite every command as PowerShell — was not taken: plan steps run via the Bash tool, not PowerShell.
CODEX · ADDRESSED Test-suite verification clarity — Task 7's single full-directory run conflated “green” with a known 15-failure exit code. Split into an explicit M1+M2 file list that must exit 0, and a full-directory run whose only accepted failures are the 15 pre-existing test_ingredient_claim_gate.py cases.
Round 3 — Codex REFINE / Gemini SHIPREFINE / SHIP

After Rounds 1–2, Calum directed — from two YouTube videos — that the native /goal feature be folded into M2. Tasks 9–11 were added, so the whole expanded plan went through a third QC round. Codex 5.5: REFINE (one HIGH, two MEDIUM — all fixed inline). Gemini 3.1 Pro: SHIP.

HIGH · CODEX Migration-test fixture. Task 9's migration test used a bare (id, title) fixture — which crashes on the SCHEMA's CREATE INDEX ... ON tasks(status) before the column assertion is reached. Fixed: the test now builds the full M2 schema, DROP COLUMN definition_of_done to get a genuine M1-shaped table, then asserts the migration re-adds it.
MEDIUM · CODEX Freshness label. freshness_signal could label an unverified-but-timestamped row (drift / manual DB edit) as verified. Fixed: the label branches were reordered so verified requires verification_status == 'verified'; a regression test covers the drift case.
MEDIUM · CODEX Blank-DoD reject. set_definition_of_done silently accepted a blank value, erasing the DoD. Fixed: it now rejects blank input, consistent with the non-empty rule on evidence and reopen reason; a test covers it.
GEMINI · NON-ISSUE Gemini's one MEDIUM — the create dispatch omitting domain / complexity / owner / risk_tier — was a verified non-issue: the M1 create subparser never exposed those as CLI flags, so create_task correctly defaults them. Nothing is dropped.
A tooling bug the journey surfaced

Round 1's Gemini review was silently truncated by gemini_qc.py's output budget. The fix swept the QC wrappers: MAX_ARTIFACT_CHARS 60K→1M (the 60K cap would have truncated the M2 plan itself), truncate() now warns loudly on stderr, and gemini_qc.py's MAX_OUTPUT_TOKENS went 8192→32768. Banked to feedback_qc_wrapper_artifact_and_output_caps.md.

Project at a glance

4 CHARTS

Milestone roadmap

M1 shipped, M2 planned, M3–M5 are future phases of the Phase 1 spine.

Test count — Milestone 1

16 M1 tests green. The pre-existing 15 unrelated failures are out of scope.

QC findings per round

M2 plan-QC — HIGH, MEDIUM and addressed-note findings, Codex 5.5 + Gemini 3.1 Pro combined.

Milestone 1 commit timeline

The 7-commit chain on feature/v4-migration — 6 TDD tasks plus the review-fix commit.
Reading the charts

The roadmap chart is a status snapshot, not a Gantt — M1 is the only bar at 100%. The QC chart shows the journey's shape: Round 1 carried the most HIGH findings (4), Round 2 isolated the single recovered _prune HIGH, and Round 3 re-reviewed the expanded /goal-fold plan.

What's next

ROADMAP
Immediate

Execute the M2 plan

The 11-task M2 plan is finalised and QC'd over 3 rounds — ready to execute via superpowers:subagent-driven-development. This ships the Freshness layer and the native /goal fold.

Tracked bug

GSD bridge EPERM-on-rename

gsd-task-bridge.js's atomic write fails its rename step intermittently — 8 errors logged 2026-05-16, likely concurrent Claude panes racing on shared config-dir files. Low impact. Tracked as a pending task — to be fixed deliberately after root-causing, not bundled.

The remaining Phase 1 milestones, each getting its own plan when reached:

MilestoneStatusScope
M1 — Store + CLI coreSHIPPEDThe SQLite store, the task_store/ package, the tasks.py CLI, the no-done-without-evidence invariant.
M2 — Freshness & /goalPLANNEDVerify-before-trust reads, the reconciliation sweep, atomic snapshots, the native /goal fold. 11-task plan, ready.
M3 — IntegrityFUTUREReconstructable rows from task_events, gated bulk / destructive ops.
M4 — Admin pathFUTUREThe human-only, codeword-gated legal-removal path — two-factor, content-free tombstone.
M5 — MigrationFUTUREMigrate the 8 non-authoritative surfaces into the spine — additive, dry-runnable, nothing old deleted until verified.
After Phase 1

Phase 1b adds the multi-LLM QC enforcement harness. Phase 2 is the Hermes supervisor agent with daily dreaming / self-review over the spine's structured history. Phase 3 is the messaging interface — Telegram / WhatsApp / phone access. The Phase 1 spine is the substrate all three build on.

Sources & provenance

SOURCE FILES

Every fact on this dashboard is drawn from these four source files under F:/Agentic-OS/ — no detail is invented.

  • Design specification
    projects/briefs/gsd-task-spine/design.md
    The problem, the architecture, the three-table data model, the freshness & integrity layers, the phasing, and the research summary. Committed in 444c6d8 and f842ba2.
  • Milestone 1 plan
    projects/briefs/gsd-task-spine/.planning/2026-05-16-m1-store-and-cli.md
    The 6-task TDD plan — schema, DB connection, create, read, status/done, CLI. Committed in dd1759b.
  • Milestone 2 plan
    projects/briefs/gsd-task-spine/.planning/2026-05-16-m2-freshness.md
    The 11-task Freshness & Goal-Definition plan, including the “Post-QC revisions” section that narrates the 3-round QC journey.
  • Session 49 — Claude OS stack build
    context/sessions/session-49-claude-os-stack-build.md
    The ## Bank log entries dated 2026-05-16 and 2026-05-17 narrate what was built, reviewed, and verified — M1 ship, the code review, the M2 plan QC.