docs: add README operator guide

First operator-facing README. Covers quick start (build, configure, first-user login), the multi-drive batch workflow with concrete time estimates, the four drive-lock states with their confirm tokens, notable settings, daily report / notifications, ops cookbook (logs, user CLI, backups, /health probe, DB reset), and an honest "known gaps" list. Cross-references CLAUDE.md (architecture + rationale) and SPEC.md (per-version feature reference) for deeper docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:08:42 -04:00 · 2026-05-02 11:08:42 -04:00 · c589e3c8e5
commit c589e3c8e5
parent d4c0770b9e
1 changed files with 242 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,242 @@
+# TrueNAS Burn-In Dashboard
+
+Web dashboard for running disciplined burn-in tests on TrueNAS drives.
+Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and
+`nvme-cli` over SSH and tracks every job in SQLite.
+
+Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but
+adds: concurrent burn-ins, pool-membership safety locks, login + audit,
+live progress UI, daily email reports, and resumable state.
+
+## Stack
+
+FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No
+external services beyond your TrueNAS host. Templates and static assets
+are bind-mounted; Python source is baked into the image.
+
+---
+
+## Quick start
+
+```bash
+# 1. Configure
+cp .env.example .env
+# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
+# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
+
+# 2. Build + run
+docker compose up -d --build
+
+# 3. Open the dashboard
+open http://localhost:8084   # or your host's IP
+
+# 4. First time: the login page renders a "Create initial admin" form.
+#    Pick a username + password (>= 8 chars). Done.
+```
+
+If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that
+account is created on startup automatically. After that the env vars are
+ignored — change passwords from the UI ("Change password" header link) or
+the CLI (`docker exec -it truenas-burnin python -m app.auth_cli reset
+<username>`).
+
+---
+
+## Burning in many drives at once
+
+The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently
+(configurable in Settings, default 4) and queues the rest. Submitting 14
+drives doesn't take 14 separate clicks — you submit once and the queue
+drains automatically as slots free up.
+
+### The workflow
+
+1. **Select all idle drives** — click the checkbox in the table header
+   (next to "DRIVE"). It auto-checks every drive that's currently
+   selectable: idle, no active SMART test, not pool-locked. Pool-locked
+   drives are intentionally excluded; if you really want to burn one of
+   them in, unlock it individually first (see [Drive locks](#drive-locks)
+   below).
+2. **Click the Burn-In button** in the batch action bar that slides up
+   from the bottom — it shows the count of selected drives.
+3. **In the batch modal**: pick the stages to run (Short SMART, Long
+   SMART, Surface Validate — drag to reorder), confirm your operator
+   name, and click Start.
+4. **All selected drives are queued** in one POST. Up to
+   `max_parallel_burnins` enter `running`; the rest sit `queued`. As each
+   running job finishes, the next queued job picks up the freed slot
+   automatically — no operator action between batches.
+5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
+
+### Time estimate
+
+| Drive size | Profile     | Per-drive runtime (default block size) |
+|-----------|-------------|----------------------------------------|
+| 250 GB SSD | Short + Long SMART + Surface | ~1 hour                  |
+| 14 TB HDD  | Short + Long SMART + Surface | ~24 hours                |
+| 14 TB HDD  | Short + Long SMART (no surface) | ~6–8 hours            |
+
+For 12× 14 TB drives at default 4-parallel: roughly **3–4 days** end-to-end.
+Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts
+runtime roughly in half at ~2× RAM cost — matches the upstream
+`disk-burnin.sh` recommendation.
+
+### Watch out
+
+- **Stuck-job timeout** — `stuck_job_hours` (default 24) marks any job
+  past that threshold as `unknown` and kills the remote process. If
+  you're burning in 14 TB drives with default block size, raise this to
+  **48** in Settings before starting, or you'll get false positives near
+  the end of surface_validate.
+- **Thermal gate** — if drives currently under burn-in hit the
+  temperature warning threshold, new jobs wait up to 3 minutes before
+  acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
+  is otherwise fine.
+
+### Cancelling
+
+Click the red ✕ next to a running job. The orchestrator:
+1. Marks the job `cancelled` in the DB.
+2. Issues `kill -9 <remote_pid>` over a fresh SSH session (the badblocks
+   PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`).
+3. Cancels the asyncio task, releasing the semaphore slot for the next
+   queued job.
+
+Cancellations are durable — restart the container and queued jobs resume,
+cancelled jobs stay cancelled.
+
+---
+
+## Drive locks
+
+To prevent destroying live data, the dashboard refuses to start
+destructive burn-in on drives ZFS or the kernel reports as in-use.
+Detected lock states (with the typed-confirmation token required to
+override):
+
+| State         | Detected via              | Confirm token                |
+|---------------|---------------------------|------------------------------|
+| Active pool   | `zpool list -vHP`         | the pool name (e.g. `tank`)  |
+| Boot pool     | pool name = `boot-pool`   | `DESTROY BOOT POOL`          |
+| Exported ZFS  | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` |
+| Mounted FS    | `findmnt -no SOURCE`      | `DESTROY MOUNTED FILESYSTEM` |
+
+Detection runs every poll cycle (~12 s). On any SSH or parser failure the
+poller fails *closed*: previously-locked drives stay locked, previously-
+unlocked drives stay unlocked, until detection recovers.
+
+Unlock is in-memory only with a 10-minute TTL — bound to the
+`(pool_name, pool_role)` observed at unlock time. If a subsequent poll
+reclassifies the drive (e.g. `(exported)` → `tank` because someone
+imported the pool), the grant is invalidated automatically.
+
+Every unlock writes an audit event and surfaces in the next daily report
+in a red banner.
+
+---
+
+## Settings highlights
+
+All settings live under `/settings` (header link). Key knobs:
+
+- **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container
+  for changes to take effect.
+- **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
+  badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
+  tune for speed vs paranoia.
+- **`stuck_job_hours`** (default 24) — raise for big drives.
+- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
+- **`bad_block_threshold`** (default 0) — number of bad blocks
+  surface_validate tolerates before failing the stage.
+- **`retention_log_days`** (default 35) — when to NULL out
+  `burnin_stages.log_text`. Nightly job at 03:00 local.
+- **`retention_backup_keep`** (default 14) — how many nightly DB
+  snapshots to keep in `/data/backups/`.
+
+---
+
+## Notifications
+
+- **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with
+  drive-level summary, failed-health banner, and a red banner listing
+  every pool-drive unlock from the last 24 h.
+- **Per-job email alerts** on pass/fail (configurable).
+- **Webhook URL** posts JSON on every job state change.
+
+Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
+
+---
+
+## Operations
+
+### Logs
+
+```bash
+docker logs -f truenas-burnin
+# JSON-structured. Filter with jq:
+docker logs truenas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
+```
+
+### User management
+
+```bash
+docker exec -it truenas-burnin python -m app.auth_cli list
+docker exec -it truenas-burnin python -m app.auth_cli add <username>
+docker exec -it truenas-burnin python -m app.auth_cli reset <username>
+```
+
+Passwords are read from a TTY prompt; never accept them on the command
+line.
+
+### Backups
+
+Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online
+`sqlite3.backup`, doesn't lock writers). To restore:
+
+```bash
+docker compose down
+cp data/backups/app-2026-05-01.db data/app.db
+docker compose up -d
+```
+
+### Health probe
+
+`/health` is unauthenticated and returns 200 only when DB, poller, and
+SSH (when configured) all check green; 503 otherwise. Use it for
+container/orchestrator health checks.
+
+```bash
+curl -sf http://localhost:8084/health | jq
+```
+
+### Resetting the DB
+
+If you need to start over:
+
+```bash
+docker compose down
+sudo rm -f data/app.db data/session_secret
+# keep data/settings_overrides.json if you want to preserve UI settings
+docker compose up -d
+```
+
+---
+
+## See also
+
+- `CLAUDE.md` — full architecture, file map, deploy workflow, and the
+  rationale behind every non-obvious design decision.
+- `SPEC.md` — canonical feature reference per version.
+- `tests/` — `python -m unittest discover tests/` (44 tests, stdlib-only).
+
+---
+
+## Known gaps / not-yet-built
+
+- No multi-user RBAC — every user is effectively admin.
+- No per-drive SMART attribute trend graphs (snapshots only).
+- No scheduled burn-ins — jobs run immediately when queued.
+- No CSRF tokens on state-changing endpoints (relies on
+  `SameSite=Strict` session cookie).
+
+PRs welcome.