MIS — IT Operations

Step-by-step SOPs for VPS, deploy, and infrastructure work. Copy-paste-ready commands. Bookmark this on phone — the terminal-side companion.

v1.1 · 2026-05-02 PM · updated

What's new in v1.1 — 2026-05-02 PM

Recovery Sequence reordered after peer review. Order is now: clear Coolify queue → push .dockerignore → ONE clean test deploy → hardening Steps 1+2. Reason: Step 1 restarts Docker; if the next build fires before Fix A is in, it burns 777 MB context all over again.
New Step R.1 — Clear Coolify queue. 13 deployments are jammed in queue right now. Cancel from UI individually — do NOT restart Coolify (Docker daemon mid-prune risk).
Fix A now has a mandatory audit step before pushing .dockerignore. Blindly excluding **/*.PNG can silently 404 dashboards. Includes a grep one-liner to find all live image references.
Fix B narrowed from "skip data:/creative-review:" (would block your regular updates) to a tight skip-tier for memory:/docs:/legal:/hr: commits only.
Fix D added — volume mount architecture. Live audit revealed creative-review/images/ is 767 MB of genuinely-served content — can't be excluded. Fix D moves it to a mounted volume so it's served without being part of the Docker image. Now MEDIUM-HIGH priority, not "quiet afternoon".
Fix E added — disk alerting cron. Webhook fires when disk > 60%. Prevents the next 100% surprise.
Fix F added — Coolify deployment history retention. 1092 historical deploys is bloat in itself. Cap at 50.

What is this? The MIS / IT Operations dashboard. Self-contained playbooks for the VPS (Hostinger srv843884), Coolify, GitHub Pages deploy pipeline, and disk-fill incident response.

Every code block has a one-click Copy button. Numbered steps run top-to-bottom. Expected outputs are highlighted in green; warnings in red; new sections in purple.

Source documents this dashboard mirrors:
• F:/Claude Root/claude-setup/hardening_instructions.txt
• memory/project_coolify_deploy_stale.md
• memory/working-context/session-29-pinecone-ad-agent-audit.md

Current state

As of 2026-05-02 PM

VPS disk

~28%

After 115.9 GB prune (2026-05-02 AM)

Coolify queue

13 jammed

Clear before any other step

Build context size

777 MB

Fix A: ~600 MB. Fix D: <50 MB

Docker log rotation

Done

R.4 shipped 2026-05-02 PM

journald cap

Done

R.5 shipped 2026-05-02 PM

Coolify auto-cleanup

Every 10 min

R.6 shipped — first run 14:50:04 UTC

Daily 04:00 UTC prune

Active

Fix C cron registered (SCHED-011)

Auto-sync hook

Disabled

Removed from settings.local.json

Coolify Auto Deploy

OFF

Done 2026-04-19

Disk alerting

Not yet

Fix E (new)

Volume-mount arch

REQUIRED

Fix D — only way to remove 767 MB images dir from build

Will regular small updates still work?

Yes — here's how

The concern: if we shrink the build context and add filters, will our automated workflows (data syncs, creative-review uploads, registry refreshes) still appear on the live dashboard?

The answer: yes, with two layered guarantees:

1. Today (Recovery + Fix C/E/F): every commit still triggers a Coolify rebuild — we're just capping the cost. Hardening (R.4/R.5) prevents log/journal bloat. Cron (Fix C) auto-prunes unused images nightly. Alerting (Fix E) catches the next disk-fill 24h early. Build context drops modestly (~777 MB → ~600 MB) because most of the bulk is genuinely-served content from the creative-review dashboard — per the audit.

2. Within 1–2 weeks (Fix D — volume mount): the structural answer. The 767 MB creative-review/images/ directory is served live, so it can't be excluded — but it CAN be served from a Coolify-mounted VPS volume instead of being baked into every Docker image. Workflows rsync updates directly. Build context becomes <50 MB. No Docker rebuild on data changes. This is the only way to durably support frequent commits at scale.

Sequence: Today → Recovery + Fix C/E/F (gets us stable). Within 1–2 weeks → Fix D (durable).

Related IT projects

Sister dashboards

Quiz Bridge — Cloudflare Worker email capture

Edge-first email capture for quiz forms (and future signup forms). Worker writes durably to Cloudflare KV first, then forwards to Shopify Customer API. Why it matters as IT: reduces dependency on the Coolify Apps VPS (this VPS) for customer-facing data capture — emails survive any downstream outage. Six-tab dashboard explains the architecture, options, and live submission monitor. Status: configuring (Worker deployed, KV binding + secret pending).

Google Consent Mode v2 — Shopify 4-store deploy SOP (NEW 2026-05-18)

Insert a single gtag('consent','default',{all granted}) block into layout/theme.liquid of all 4 Brainzyme Shopify stores (UK/US/DE/FR). ~330 bytes inline JS, placed before {{ content_for_header }} so it fires before the theme/app Google tag. Why it matters as IT: pre-deploy the v2 signals ad_user_data + ad_personalization were never declared on any store — Google silently dropped users from remarketing and could not run Enhanced Conversions on the website-tag side. Post-deploy every /g/collect POST carries gcs=G111 (= all 4 signals granted). Covers: 6-assertion pre-PUT suite (byte-bounded diff, Liquid tag-balance, marker-uniqueness), Shopify Dev MCP validate_theme with partial-theme-staging error filtering, read-after-write retry-with-backoff (Shopify Admin API has ~1-3 s eventual consistency — the US canary caught this), Playwright clean-state mobile verify (the canonical gcs in network proof), Lighthouse byte-delta perf-neutrality check, per-store rollback procedure. Posture: all-granted, no CMP banner — counsel-approved consent-or-leave disclaimer. Conventional default-denied + CMP guides do NOT apply — do not let a future LLM "correct" toward default-denied. Status: Part 1 LIVE on all 4 storefronts 2026-05-18. Part 2 (Web Pixels Customer Privacy API) decoupled → task #36 pending Admin check.

Dashboard Push-Live Inbox — rebuild SOP (NEW 2026-05-11)

8-phase recipe to redeploy the Push Live system from scratch (machine reset / VPS migration / new environment). Covers: FastAPI server, Task Scheduler, Windows Firewall, Tailscale, Coolify Container Labels (untick Readonly + tick Escape), shared dashboard client, Claude startup hook, E2E test. Why it matters as IT: replaces manual JSON-export-and-drag-into-chat with a single Push Live button. Loop = Browser → Cloudflare → Coolify Traefik → private Tailnet → local FastAPI → inbox folder → Claude consumes. Lists all banked artefacts so nothing is lost on rebuild. Status: E2E LIVE 2026-05-10.

Inbox over Cloudflare Tunnel — cutover SOP (NEW 2026-05-16, Phase 7 LIVE 2026-05-17)

10-phase migration that cuts the dashboard→FastAPI inbox path from Coolify reverse-proxy + Tailnet (4 hops, 502'ing under load) to Cloudflare Tunnel direct (2 hops, encrypted). New public hostname inbox.nutritionalproducts.org. Auth model: x-api-token header + 2nd-chance source classification — loopback + Tailnet sources still work (backward-compat); browser POSTs require the token. Covers: idempotent CF writes executor (tunnel + ingress + DNS), secret-hygiene install-token handoff (ACL-restricted file, 10-min auto-delete, never to chat or _writes.json), inbox_server.py patches, cloudflared Windows service install, 4-test acceptance battery, shared inbox-client.js cutover (Phase 7), 7-day rollback window. Why it matters as IT: ships the migration brief over 4 Codex passes (11 HIGHs fixed inline) and bakes in — secrets NEVER persist to git or chat; rollback round-trip is ~60s using IDs in _writes.json; documents the Cloudflare token type trap (cfat_* works for direct API but NOT for Wrangler — Wrangler needs OAuth). Status: LIVE end-to-end as of 2026-05-17. Acceptance battery 3/4 effective pass + real creative-review round-trip (513 KB envelope) + ft-tasks Pattern A round-trip + Playwright sweep 11/11 clean.

Windows Terminal — two Claude Code profiles (NEW 2026-05-11)

7-phase recipe covering: (1–5) two isolated Claude Code agents (Claude A blue / Claude B green) side-by-side in Windows Terminal, each profile setting CLAUDE_CONFIG_DIR inline via cmd.exe /k for full session / context / memory isolation while sharing the project root F:/Agentic-OS/, alt+shift+d → duplicatePane preserving source profile; (6) env-shim launcher scripts/claude-os.cmd that loads .env into process scope before claude.exe boots (replaces HKCU mirroring after Codex + Gemini peer review); (7) interactive launcher launch.cmd with model + session-mode picker and the resume wrappers that survived a PATH-order trap caught by Codex 5.5 decision-mode QC. Why it matters as IT: one-click parallel agents without state collision; secrets stay process-scoped, no HKCU pollution; resume by session number works end-to-end. Status: live on Calum's primary workstation 2026-05-11.

Quiz Dashboard — rebuild SOP (NEW 2026-05-11)

Operational analytics for the Brain Fog Quiz pipeline at apps.nutritionalproducts.org/quiz-dashboard/. Read-only; bearer-auth via sessionStorage prompt (CLOUDFLARE_REVIEWS_WORKER_TOKEN). Reads /api/test-stores + /api/submissions on the brainzyme-reviews Worker. Surfaces per-market Worker health, per-quiz cards (UK A/B/C/GD + US-A + DE-A + FR-A), incident log auto-derived from forward_dead entries, recent submissions table. Why it matters as IT: built in response to the 2nd OPS01 token-rotation event in 2 weeks — all 4 Shopify tokens silently 401-ing for ~3 days while customer leads were lost. Dashboard surfaces the exact state we lacked, so future rotation events are caught within hours. Status: live 2026-05-11.

GSD Task Bridge — Claude Code v2.x rebuild SOP (banked 2026-05-12)

4-phase recipe to bridge Claude Code v2.x in-memory task state into the GSD statusline. v2 deprecated TodoWrite (which wrote to ~/.claude/todos/*.json) in favour of TaskCreate / TaskUpdate / TaskList — state lives in process memory only, so gsd-statusline.js v1.28.0 silently rendered nothing. Fix is a PostToolUse hook (gsd-task-bridge.js, ~167 lines, registered in both project + user settings.json) that catches TaskCreate|TaskUpdate, parses the canonical v2 response shape ({"task":{"id":"N","subject":"...","activeForm":"...","status":"..."}}), and writes a small JSON file the patched statusline reads with session-id matching to prevent stale carry-over across Claude A / B / resumes. Why it matters as IT: first piece of v2.x compatibility infrastructure not available upstream — future GSD-style integrations on v2 must follow this hook pattern, not the legacy todos/ glob. Status: live on both Claude A and Claude B; banked into repo as commit 76e51e9 on feature/v4-migration (2026-05-12) so a clean checkout can rebuild.

Local Skill Auto-Commit Suite — rebuild SOP (NEW 2026-05-11)

4-phase recipe to port the upstream f6eebfb auto-commit suite to a clean checkout. 5 files: PostToolUse hook (skill-auto-commit.js) + detached worker + scripts/lib/common.sh helper lib + scripts/session-end.sh + scripts/rollback.sh interactive picker. Hook fires fire-and-forget after every Write / Edit / MultiEdit but matches ONLY .claude/skills/{name}/SKILL.local.md and CLAUDE.local.md — a deliberate narrow scope. Calum reviewed a broader alternative (commit every Claude Write/Edit as wip: {task name}) and explicitly rejected it for commit-noise / bisect pain. Why it matters as IT: preserves user customisations durably without depending on wrap-up rituals or risking upstream-merge conflicts on shared SKILL.md files. Status: live on F:/Agentic-OS/ 2026-05-11.

Tool Map → Master Sheet Sync — executive brief (NEW 2026-05-13)

Executive recap of today's session-49 work. Extended tools/sync_registries_to_master_sheet.py with a 3rd derived-mirror tab for reference/tool-map.md: 97 rows flattened from the 13-section markdown into a queryable Master Sheet view (sheetId 1038784443). Sibling tabs (Connections Registry / Page Build Registry) refreshed in the same run. check-tool-map.py validator already in place (parallel-session-built); Stage 9 of /preflight now reports clean. Access Registry dashboard's Tool Map card cross-links to all 3 Sheet tabs. Workstream still in-flight: ghost-dir source-identification task (Session 49 entry in pending-tasks-registry) blocks cleanup of C:\Users\PC\.claude . Three commits: Agentic-OS 9056f41 (sync script) + brainzyme-git 5a5f206d (access-registry HTML) + brainzyme-git tool-map-sync dashboard (this page). Connected work: GSD Task Bridge v2 SOP (motivating incident), System-State Mirror Hook (closes the backup gap).

System-State Mirror Hook — rebuild SOP (NEW 2026-05-12)

Stop hook that mirrors load-bearing files OUTSIDE the repo (~/.claude/settings.json, ~/.claude-secondary/settings.json, Windows Terminal settings, user-scope hooks) into reference/system-state/ so git history captures them. Hash-compare before write; detached worker auto-commits so it never blocks Claude. Motivated by the 2026-05-12 incident where PowerShell Remove-Item silently trimmed a trailing-space path and nuked the canonical ~/.claude/ dir — including the 12.3 KB user-scope settings.json with all global hook registrations. Recovery only worked because Claude Code's internal file-history/ happened to save us; AGENTS.md auto-commit didn't cover this path because it's outside the repo. Why it matters as IT: closes the only remaining gap where load-bearing system files had no deliberate backup — and bakes in the lesson that PowerShell path-trim is a foot-gun (proper deletion of trailing-space dirs uses Python shutil.rmtree with \\?\ prefix, NOT PowerShell, NOT cmd.exe + rd). Status: live as Stop hook 2026-05-12; commit ee3858d on feature/v4-migration.

How to use this dashboard

Disk emergency? Go to Emergency Flush first.

One-line paste. Frees disk in <30 seconds. Run this BEFORE anything else when deploys fail with no space left on device.

Then work through Recovery Sequence.

Steps R.1 → R.6. Clears the queue, pushes the build-context fix, then hardens Docker + journald. Order matters — do not reorder.

Once stable, work through Long-term Fixes.

Six fixes total. Do C first (cron), then E (alerting), then F (history retention). D (volume mount) is the structural overhaul for next quiet afternoon.

Something broke? Rollback Reference has every undo command.

Each fix has a paired rollback. Keep this tab open during any deploy work.

When to run: deploy fails with no space left on device, or VPS disk is >85%. Single line. Single paste.

Clears stopped containers, unused images, build cache, dangling volumes, container logs, old Coolify build artifacts, and oversized journals. Frees 100+ GB in seconds when the disk has been bloating.

This is a rescue command, not a routine. The structural fix lives in Recovery Sequence.

Step E.1

SSH into VPS

From your local terminal:

BASH · LOCAL

ssh [email protected]

Step E.2

Single-line emergency flush

CRITICAL

Paste as ONE line. Semicolons separate the commands. Runs in ~10–30 seconds.

BASH · VPS · root

df -h /; docker system df; docker container prune -f; docker image prune -af; docker network prune -f; docker volume prune -f; docker builder prune -af; docker system prune -af --volumes; truncate -s 0 /var/lib/docker/containers/*/*-json.log 2>/dev/null; find /data/coolify/applications -maxdepth 2 -type d -name 'commit-*' -mtime +1 -exec rm -rf {} + 2>/dev/null; journalctl --vacuum-size=200M; echo '=== AFTER ==='; df -h /; docker system df

Expected: BEFORE/AFTER df -h / blocks. Use% drops 30–70 percentage points. RECLAIMABLE on Images drops near zero.

Step E.3

If the multi-statement got mangled by paste

Sometimes xterm chops the long line. Run the most important single command on its own:

BASH · VPS · root

docker system prune -af --volumes

When prompted Are you sure? type y and Enter. This single command does ~95% of the flush work.

Expected: stream of deleted: sha256:... lines, ending with Total reclaimed space: XXX GB.

Step E.4

Confirm the disk recovered

BASH · VPS · root

df -h /
docker system df

Healthy state: /dev/sda1 at <40% used. RECLAIMABLE on Images near zero.

Goal: get from "disk-just-flushed-but-queue-jammed" back to a healthy, deployable state. Six steps. Order matters.

Why this order? If you do hardening Step 1 first (Docker restart), Coolify will retry the next queued build with a 777 MB context BEFORE the .dockerignore fix is in — and you'll burn disk again. Push .dockerignore first, then verify with one clean test deploy, THEN restart Docker.

Time: ~25 minutes start to finish, plus 24h verification window.

Step R.1 · Clear Coolify queue

Cancel the 13 jammed deployments

NEW

There are 13 deployments stuck in Queued state. Each one will try to rebuild a 777 MB image when it runs. Cancel them all from the Coolify UI before doing anything else.

R.1.1 Open Coolify deployments page:

URL

http://168.231.115.117:8000/project/.../application/.../deployment

R.1.2 For each Queued entry, click into it and hit Cancel. Do this individually. Do NOT click "Restart" on Coolify itself — that risks restarting the Docker daemon mid-prune.

R.1.3 When all queued items are cancelled, the deployment list should show only the most recent successful build (or the most recent failed one). The 13 queued count drops to 0.

Coolify does not auto-trigger a fresh deploy after cancellation. You'll need to trigger one manually in Step R.3.

Step R.2 · Audit before .dockerignore

Find what files are referenced live

NEW — CRITICAL

Blindly excluding folders from the Docker image will silently 404 any dashboard that references them. Run this audit first to see if any live HTML/JS/JSON references the paths we're about to exclude.

AUDIT ALREADY RUN 2026-05-02 PM — results below. Skip the grep commands; key findings:

• google-ads-audit/pipeline/snapshots/ IS served live — the Google Ads Audit dashboard at v4.0-alt/index.html fetches ../pipeline/snapshots/manifest.json on load. Must stay in image. Footprint: 2.4 MB, not a bloat concern.

• creative-review/images/ IS served live — the creative-review dashboard at creative-review/index.html uses basePath: 'images/canonical-library/'. phase2-merge.html renders <img src="images/phase2-adhd-...">. Must stay in image. Footprint: 767 MB — the bulk of the build context.

• creative-review/page-reviews/*/screenshots/ NOT live-referenced. Footprint: ~22 MB. Safe to exclude.

• No src="...PNG" references found. But still don't blindly exclude **/*.PNG — risk is too high. Targeted excludes only.

• .git/ directory: 961 MB. nixpacks may exclude it implicitly, but explicit .dockerignore is safer.

Honest expectation: Fix A's .dockerignore will save ~100–150 MB max, not the 700+ MB I claimed in v1.0. Most of the 777 MB context is genuinely-served creative-review images. The real durable answer is Fix D — volume mount, which moves creative-review/images/ out of the build context entirely. Fix D is now priority, not "do later".

If you want to re-run the audit yourself (e.g. after months of changes), the commands are below for reference:

Show audit commands

BASH · LOCAL · F:/brainzyme-git

grep -rE '(creative-review/page-reviews|pipeline/snapshots|creative-review/images)' --include='*.html' --include='*.js' --include='*.json' . | grep -v '\.git/' | head -50

Step R.3 · Push the .dockerignore fix

Shrink build context from 777 MB to <50 MB

PROMOTED FROM FIX A

Based on R.2 findings, write the new .dockerignore. The default below assumes R.2 returned no live references. Adjust if R.2 found matches.

R.3.1 Open F:/brainzyme-git/.dockerignore in your editor.

R.3.2 Replace contents with this audit-verified safe baseline:

.dockerignore

.git/
node_modules/
*.zip
*.mp4
*.mov
*.psd
tmp/
output/pending/
_archive/
creative-review/page-reviews/*/screenshots/

Audit-verified excluded paths only:
• creative-review/page-reviews/*/screenshots/ — not live-referenced (saves ~22 MB)
• .git/ — 961 MB. nixpacks may already skip but explicit is safer
• tmp/, output/pending/, _archive/, common binary dev artifacts — safe by convention

NOT excluded (would break dashboards):
• creative-review/images/ (767 MB) — live-served by creative-review dashboard
• google-ads-audit/pipeline/snapshots/ (2.4 MB) — fetched live by audit dashboard
• **/*.PNG — risk of breaking image-heavy pages

R.3.3 Commit and push:

BASH · LOCAL · F:/brainzyme-git

git add .dockerignore
git commit -m 'fix(deploy): shrink build context via .dockerignore'
git push

R.3.4 In Coolify UI, manually trigger a deploy of the latest commit. Watch the deploy log for the build context line:

Expected: #6 transferring context: ~600 MB (was 777 MB). Modest ~100–150 MB savings — honest because most of the bulk is genuinely-served creative-review images. If you see <500 MB, .dockerignore overshot — check the dashboards still work.

R.3.5 Open apps.nutritionalproducts.org/creative-review/ and apps.nutritionalproducts.org/google-ads-audit/v4.0-alt/ — both dashboards must still render images. If either is broken, run Rollback 3 immediately.

Wait for this deploy to succeed AND dashboards to verify before continuing to R.4. The big structural win comes later from Fix D (volume mount) — that one cuts build context to <50 MB. Fix A is just immediate first aid.

Step R.4 · Docker log rotation

Cap container logs at 30 MB each

PENDING

Without this, container logs grow unbounded. With max-size 10m / max-file 3 / compress true, each container is hard-capped at ~30 MB compressed.

R.4.0 SSH to the VPS from your Windows shell at F:\Claude Root:

BASH · LOCAL · F:\Claude Root

ssh root@srv843884

Then confirm Docker is alive:

BASH · VPS · root

systemctl is-active docker

Expected: active. If failed, see Rollback Reference.

R.4.1 Single-paste heredoc — writes the full daemon.json atomically. No nano needed. Paste the entire block in one go:

BASH · VPS · root

cat > /etc/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true"
  }
}
EOF

Why heredoc and not nano? The heredoc writes the whole JSON in one shell operation, so xterm paste-mangling cannot truncate or split it mid-line. Strictly safer than typing into nano.

R.4.2 Verify the file is valid:

BASH · VPS · root

cat /etc/docker/daemon.json
python3 -m json.tool /etc/docker/daemon.json

Expected: the 8 lines printed back, then re-printed by python's JSON parser. No parse errors.

R.4.3 Restart Docker and verify all containers came back:

BASH · VPS · root

systemctl restart docker
systemctl is-active docker
docker ps --format 'table {{.Names}}\t{{.Status}}'

Expected: active + all Coolify containers showing Up X seconds.

Step R.5 · journald cap

Cap systemd journal at 500 MB

PENDING

Systemd journal grows unbounded by default. 500M is plenty for forensics.

R.5.1 Single-paste block — sed-edits the line, falls back to append if the line wasn't there to edit, then restarts journald and confirms:

BASH · VPS · root

sed -i 's/^#\?SystemMaxUse=.*/SystemMaxUse=500M/' /etc/systemd/journald.conf
grep SystemMaxUse /etc/systemd/journald.conf || echo 'SystemMaxUse=500M' >> /etc/systemd/journald.conf
grep SystemMaxUse /etc/systemd/journald.conf
systemctl restart systemd-journald
journalctl --disk-usage

Expected: SystemMaxUse=500M printed back (no # prefix), then a journal-size number ≤ 500M.

Step R.6 · Coolify UI

Force Docker Cleanup every 10 minutes

VERIFY

Coolify v4.0.0-beta.463 has the cleanup config on a dedicated Docker Cleanup sidebar tab (NOT under Advanced). There is no threshold field in this version — the cleanup just runs on schedule. Setting frequency to every 10 minutes means it trims unconditionally, which is more aggressive than threshold-gated cleanup. Better outcome.

R.6.1 Open Coolify in browser. Navigate:
Servers → [your server] → Docker Cleanup (left sidebar)

Confirm/set these values:

Docker cleanup frequency: */10 * * * *
Force Docker Cleanup: ON (checked)
Delete Unused Volumes: ON (checked)
Delete Unused Networks: ON (checked)
Disable Application Image Retention: OFF (leave unchecked — safer)

Click Save.

R.6.2 Verify cleanup actually fires — click Trigger Manual Cleanup at the top of the same page, then scroll to Recent Executions. You should see a fresh entry with status Success and duration ~10–15 seconds.

Don't confuse with the unrelated "Disk Usage" section under Advanced. That has a "Disk usage check frequency" field which controls notification timing only — it does NOT trigger cleanup. Leave that at its default unless you want notifications more/less often.

Auto Deploy stays OFF — already done 2026-04-19. Don't toggle it.

Step R.7 · Sanity check

Verify everything is wired correctly

NEW

Single-paste health-check block. Run this on the VPS after R.4–R.6. Confirms every defence is actually active and not silently misconfigured.

R.7.1 SSH to the VPS if you've exited, then paste:

BASH · VPS · root

echo "=== R.4: Docker log rotation ==="
cat /etc/docker/daemon.json 2>/dev/null
python3 -m json.tool /etc/docker/daemon.json > /dev/null 2>&1 && echo "✓ valid JSON" || echo "✗ INVALID/MISSING"
echo "Docker: $(systemctl is-active docker)"
echo ""
echo "=== R.5: journald cap ==="
grep -E "^SystemMaxUse" /etc/systemd/journald.conf || echo "✗ NOT SET"
journalctl --disk-usage
echo ""
echo "=== Disk health ==="
df -h /
echo ""
echo "=== Docker storage ==="
docker system df
echo ""
echo "=== Containers running ==="
docker ps --format "table {{.Names}}\t{{.Status}}"
echo ""
echo "=== Recent cleanup activity ==="
journalctl -u docker --since "1 hour ago" --no-pager 2>/dev/null | tail -5

Pass criteria (all must be true):
• daemon.json printed with the 8-line log rotation config + ✓ valid JSON
• Docker: active
• SystemMaxUse=500M printed back, journal usage ≤ 500M
• Disk Use% in the 30–40% range (give cleanup 10–20 min after R.6 to catch up)
• RECLAIMABLE on Images is small (a few hundred MB max — not gigabytes)
• All Coolify containers showing Up X minutes — none restarting or unhealthy

If any line shows ✗: stop, paste the verification output to Claude, do not proceed to long-term fixes. Most likely cause = a paste got mangled during R.4 or R.5 — re-run that step.

R.7.2 Also check Coolify UI: Servers → [your server] → Docker Cleanup → scroll to Recent Executions. Should show at least one Success entry within the last 10–15 minutes — that's the proof cleanup is firing on schedule.

Order to apply: Fix C (cron) → Fix E (alerting) → Fix F (history retention) → Fix B (narrow filter, optional) → Fix D (volume mount, structural — do on a quiet afternoon).

Fix A was promoted into the Recovery Sequence as Step R.3 because order matters — it must be in before hardening.

Fix C · Belt-and-braces cron

Daily image prune at 04:00 UTC

PENDING

Coolify's Force Docker Cleanup is the primary defence. This cron is a redundant nightly safety: removes any unused image older than 48 hours.

Confirmed safe by peer review: docker image prune only removes unreferenced images — running containers' images are never touched, regardless of age.

C.1 Open the root crontab on the VPS:

BASH · VPS · root

crontab -e

C.2 Append this line at the bottom:

CRON

0 4 * * * docker image prune -af --filter "until=48h" >> /var/log/docker-prune.log 2>&1

Save and exit (Ctrl+O, Enter, Ctrl+X).

C.3 Confirm:

BASH · VPS · root

crontab -l | grep docker-prune

Expected: the line you just added is printed back.

Fix E · Disk alerting

Webhook when disk > 60%

NEW

Without this, the next disk-fill is silent until deploys start failing. A 5-minute cron that checks disk and pings a webhook (Slack / Discord / email) when above threshold catches it 24+ hours early.

E.1 Decide your webhook target. For a quick start, use a free Discord webhook or a Reamaze inbox endpoint. Replace YOUR_WEBHOOK_URL below with the actual URL.

E.2 Create the alert script:

BASH · VPS · root

cat > /usr/local/bin/disk-alert.sh << 'EOF'
#!/bin/bash
USAGE=$(df -h / | awk 'NR==2 {gsub(/%/,""); print $5}')
THRESHOLD=60
WEBHOOK_URL="YOUR_WEBHOOK_URL"
if [ "$USAGE" -gt "$THRESHOLD" ]; then
  curl -sS -X POST -H 'Content-Type: application/json' \
    -d "{\"content\":\"[VPS srv843884] Disk at ${USAGE}% — above ${THRESHOLD}% threshold\"}" \
    "$WEBHOOK_URL" > /dev/null
fi
EOF
chmod +x /usr/local/bin/disk-alert.sh

E.3 Add to crontab (5-min interval):

CRON

*/5 * * * * /usr/local/bin/disk-alert.sh

E.4 Test the alert by temporarily lowering THRESHOLD to 10 and running the script manually:

BASH · VPS · root

/usr/local/bin/disk-alert.sh

Should fire a webhook. Restore THRESHOLD to 60 once verified.

Fix F · Coolify history retention

Cap deployment history at 50

NEW

1092 historical deployments in the Coolify table is bloat. Cap retention so the database stays small and the UI stays fast.

F.1 In Coolify UI: Settings → Advanced → find "Deployment history retention" (or similar — exact label varies by version).

F.2 Set to 50. Save.

F.3 If the setting doesn't exist in your Coolify version, manually clean from Postgres on next quiet maintenance window. Hold for now.

Fix B · Narrow deploy filter

Skip deploy on memory/docs commits only

REVISED — NARROWED

Original plan dropped. The earlier suggestion to skip data: and creative-review: commits would have blocked your regular workflows from deploying. Bad call.

Narrow safe version: commits that ONLY touch memory/, hr/, legal/, CLAUDE.md, or pure .md docs have zero impact on the served site. Filtering those is safe and saves ~10–20% of deploys.

B.1 Open F:/brainzyme-git/.github/workflows/deploy.yml. Replace the existing if: block with:

YAML

    if: >-
      !startsWith(github.event.head_commit.message, 'Auto-sync sessions.json')
      && !startsWith(github.event.head_commit.message, 'deploy status update')
      && !startsWith(github.event.head_commit.message, 'docs:')
      && !startsWith(github.event.head_commit.message, 'memory:')
      && !contains(github.event.head_commit.message, '[skip ci]')

Do NOT add data:, creative-review:, feat(registry):, or any prefix that touches actual served content. Those commits change the live dashboard and must trigger a deploy.

Fix D · Volume-mount architecture

Stop rebuilding the image on every data change

NEW — STRUCTURAL

The right answer for high-frequency small updates. Right now every commit triggers a full nginx rebuild — even if all that changed was one JSON file. The real fix: nginx image holds the static shell only; data files live in a Coolify-mounted volume that workflows write to directly without any Docker rebuild.

D.1 Architecture overview:

nginx image holds: HTML, JS, CSS, templates, layouts (rebuild rarely — only on theme changes)
VPS volume holds: JSON data, screenshots, snapshots, all fast-changing assets (mounted at /usr/share/nginx/html/data)
GitHub Actions workflow rsyncs changed data files to the volume on every push — no Docker build, no image layer, no prune

D.2 Implementation steps (do this on a calm afternoon — ~30 min):

In Coolify: add a bind-mount volume to the application: VPS path /data/brainzyme-static → container path /usr/share/nginx/html/data
Move creative-review/, google-ads-audit/pipeline/snapshots/, and other fast-churning data dirs to /data/ path under the new mount
Update HTML/JS references to point at /data/... paths
Add a GitHub Actions step that rsyncs the moved dirs from repo to /data/brainzyme-static/ via SSH on push (skipping the Docker rebuild path)
For static-shell-only commits, deploy.yml continues to trigger a normal Coolify rebuild

D.3 Lighter interim option: if full volume migration is too much right now, use a multi-stage Dockerfile where stage 1 is COPY . /build, stage 2 copies only the served output. Combined with Fix A's .dockerignore, keeps the final image layer small without changing serving architecture. ~10 min.

Schedule Fix D within 1–2 weeks. Per the live audit, creative-review/images/ is 767 MB of genuinely-served content. Recovery + Fix A only saves ~100–150 MB. The remaining ~600 MB rebuilds on every commit until D is in. Keep the deploy cadence light until then.

Use this tab every time you set up a new GitHub-hosted Coolify project. The disk-fill incident on this VPS happened because none of the defences below were in place from day 1 — they got installed reactively after disk hit 100%.

This is the day-1 checklist for any new Coolify-deployed GitHub Pages style static site. Total time: ~20 minutes if you follow it linearly. Saves you from a 4-hour incident later.

Source: distilled from this VPS's full incident response 2026-04-29 → 2026-05-02. Battle-tested defences only.

Pre-deploy (in the GitHub repo, before you click "Deploy")

B.1 · Repo prep

Create `.dockerignore` from day 1

DAY 1

Empty .dockerignore = full repo gets copied into every Docker build = bloat. Start with this baseline:

.dockerignore

.git/
node_modules/
*.zip
*.mp4
*.mov
*.psd
tmp/
output/pending/
_archive/
**/screenshots/

Add project-specific exclusions for any folders that are dev-only (tests, scratch, internal docs). Do not exclude any path that's served live — audit first with: grep -rE "your-folder/" --include='*.html' --include='*.js' from the repo root.

B.2 · Repo prep

Decide volume-mount strategy upfront

DAY 1

If the repo has any single folder larger than 100 MB that's served live (images, videos, large datasets), plan to volume-mount it from day 1. If you skip this and just baked it into the image, every push rebuilds the whole thing.

Reference: Fix D in Long-term Fixes + design doc at F:/Claude Root/claude-setup/specs/2026-05-02-fix-d-volume-mount-design.md + paste-safe SOP at F:/Claude Root/claude-setup/fix-d-volume-mount-instructions.txt.

Rule of thumb: under 100 MB per folder = bake into image, don't bother with volume mount. Over 100 MB or growing-rapidly = mount it.

B.3 · Repo prep

Set up `deploy.yml` with skip filters

DAY 1

If using a GitHub Actions workflow that pings Coolify's webhook, add commit-prefix filters from day 1 so docs/memory/internal commits don't trigger pointless rebuilds. Include this in .github/workflows/deploy.yml under the deploy job's if::

YAML

    if: >-
      !startsWith(github.event.head_commit.message, 'docs:')
      && !startsWith(github.event.head_commit.message, 'memory:')
      && !contains(github.event.head_commit.message, '[skip ci]')

Do NOT skip commits that change actual served content (data syncs, asset uploads, anything users see).

VPS hardening (after first successful deploy lands)

B.4 · VPS · Docker config

Log rotation (one-time)

DAY 1

Caps container logs so they can't grow unbounded. Same as R.4 in Recovery Sequence.

BASH · VPS · root

cat > /etc/docker/daemon.json << 'EOF'
{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3",
    "compress": "true"
  }
}
EOF
systemctl restart docker
systemctl is-active docker

B.5 · VPS · systemd

journald cap (one-time)

DAY 1

Caps systemd journal at 500 MB. Same as R.5 in Recovery Sequence.

BASH · VPS · root

sed -i 's/^#\?SystemMaxUse=.*/SystemMaxUse=500M/' /etc/systemd/journald.conf
grep SystemMaxUse /etc/systemd/journald.conf || echo 'SystemMaxUse=500M' >> /etc/systemd/journald.conf
systemctl restart systemd-journald

B.6 · VPS · cron

Daily 04:00 UTC image prune (one-time)

DAY 1

Belt-and-braces nightly cleanup. Run crontab -e as root, add this line at the bottom, save (Ctrl+O, Enter, Ctrl+X):

CRON · VPS · root

0 4 * * * docker image prune -af --filter "until=48h" >> /var/log/docker-prune.log 2>&1

Verify after 24h: tail /var/log/docker-prune.log

Coolify configuration (in the UI)

B.7 · Coolify UI · cleanup

Force Docker Cleanup every 10 min

DAY 1

Servers → [your server] → Docker Cleanup. Set:

Docker cleanup frequency: */10 * * * *
Force Docker Cleanup: ON
Delete Unused Volumes: ON
Delete Unused Networks: ON
Disable Application Image Retention: OFF (leave unchecked — safer)

Then click Trigger Manual Cleanup once and verify "Recent Executions" shows Success. That confirms it's wired correctly.

B.8 · Coolify UI · deploy mode

Disable Coolify "Auto Deploy"

DAY 1

Coolify's built-in "Auto Deploy" fires on EVERY push, ignoring [skip ci] (Coolify issue #4357). Disable it and rely on your deploy.yml webhook for explicit deploys instead.

Path: Application → Configuration → Advanced → "General" subsection → Auto Deploy toggle → OFF (grey).

B.9 · Coolify UI · retention

Cap deployment history at 50

DAY 1

Without a cap, Coolify accumulates deployment records indefinitely (this VPS hit 1092 before we noticed). Database bloat, slow UI. Path varies by Coolify version — look for "Deployment history retention" or similar under Settings → Advanced. Set to 50.

Monitoring (recommended, not blocking)

B.10 · VPS · alerting

Disk alert webhook at 60% threshold

Optional

Catches the next disk-fill 24h early instead of via failed deploys. See Long-term Fix E for full script. Skip for week one if you don't already have a webhook target (Discord/Slack/email-via-webhook).

B.11 · Discipline

Bookmark this dashboard on phone

Recommended

apps.nutritionalproducts.org/it-sop/. Terminal-side companion. If a deploy fails or disk climbs, you've got every command one tap away.

Verification — sanity check after the above

After completing B.4–B.9, run the same R.7 verification block from the Recovery Sequence. Replaces "did the recovery succeed?" with "did the bootstrap succeed?" — same checks. If everything passes, you've done in 20 minutes what took 4 hours of incident response on this VPS.

If anything broke: these are the exact undo commands. Run on the VPS as root, or in F:/brainzyme-git/ for git-based rollbacks.

Rollback 1

R.4 broke Docker (won't restart)

Removes the daemon.json so Docker boots with stock defaults.

BASH · VPS · root

rm -f /etc/docker/daemon.json
systemctl restart docker
systemctl is-active docker

Expected: active

Rollback 2

R.5 broke logging

BASH · VPS · root

sed -i 's/^SystemMaxUse=500M/#SystemMaxUse=/' /etc/systemd/journald.conf
systemctl restart systemd-journald

Rollback 3

R.3 .dockerignore broke a deploy (404s on dashboard)

Revert the commit and push.

BASH · LOCAL · F:/brainzyme-git

git revert HEAD --no-edit
git push

Then re-do the audit (R.2) more carefully and try again with a narrower exclusion list.

Diagnostic

What's wrong with Docker?

BASH · VPS · root

systemctl status docker.service --no-pager | tail -30
journalctl -xeu docker.service --no-pager | tail -50

Current state

Will regular small updates still work?

Related IT projects

How to use this dashboard

Disk emergency? Go to Emergency Flush first.

Then work through Recovery Sequence.

Once stable, work through Long-term Fixes.

Something broke? Rollback Reference has every undo command.

SSH into VPS

Single-line emergency flush

If the multi-statement got mangled by paste

Confirm the disk recovered

Cancel the 13 jammed deployments

Find what files are referenced live

Shrink build context from 777 MB to <50 MB

Cap container logs at 30 MB each

Cap systemd journal at 500 MB

Force Docker Cleanup every 10 minutes

Verify everything is wired correctly

Daily image prune at 04:00 UTC

Webhook when disk > 60%

Cap deployment history at 50

Skip deploy on memory/docs commits only

Stop rebuilding the image on every data change

Pre-deploy (in the GitHub repo, before you click "Deploy")

Create .dockerignore from day 1

Decide volume-mount strategy upfront

Set up deploy.yml with skip filters

VPS hardening (after first successful deploy lands)

Log rotation (one-time)

journald cap (one-time)

Daily 04:00 UTC image prune (one-time)

Coolify configuration (in the UI)

Force Docker Cleanup every 10 min

Disable Coolify "Auto Deploy"

Cap deployment history at 50

Monitoring (recommended, not blocking)

Disk alert webhook at 60% threshold

Bookmark this dashboard on phone

Verification — sanity check after the above

R.4 broke Docker (won't restart)

R.5 broke logging

R.3 .dockerignore broke a deploy (404s on dashboard)

What's wrong with Docker?

Create `.dockerignore` from day 1

Set up `deploy.yml` with skip filters