# NAS Burn-In Dashboard Web dashboard for running disciplined burn-in tests on TrueNAS drives. Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and `nvme-cli` over SSH and tracks every job in SQLite. Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but adds: concurrent burn-ins, pool-membership safety locks, login + audit, live progress UI, daily email reports, and resumable state. ## Stack FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image. --- ## Quick start ```bash # 1. Configure cp .env.example .env # edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally, # INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup. # 2. Build + run docker compose up -d --build # 3. Open the dashboard open http://localhost:8084 # or your host's IP # 4. First time: the login page renders a "Create initial admin" form. # Pick a username + password (>= 8 chars). Done. ``` If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that account is created on startup automatically. After that the env vars are ignored — change passwords from the UI ("Change password" header link) or the CLI (`docker exec -it nas-burnin python -m app.auth_cli reset `). --- ## Burning in many drives at once The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently (configurable in Settings, default 4) and queues the rest. Submitting 14 drives doesn't take 14 separate clicks — you submit once and the queue drains automatically as slots free up. ### The workflow 1. **Select all idle drives** — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see [Drive locks](#drive-locks) below). 2. **Click the Burn-In button** in the batch action bar that slides up from the bottom — it shows the count of selected drives. 3. **In the batch modal**: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start. 4. **All selected drives are queued** in one POST. Up to `max_parallel_burnins` enter `running`; the rest sit `queued`. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches. 5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked." ### Time estimate | Drive size | Profile | Per-drive runtime (default block size) | |-----------|-------------|----------------------------------------| | 250 GB SSD | Short + Long SMART + Surface | ~1 hour | | 14 TB HDD | Short + Long SMART + Surface | ~24 hours | | 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours | For 12× 14 TB drives at default 4-parallel: roughly **3–4 days** end-to-end. Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts runtime roughly in half at ~2× RAM cost — matches the upstream `disk-burnin.sh` recommendation. ### Watch out - **Stuck-job timeout** — `stuck_job_hours` (default 168 = 7 days) marks any job past that threshold as `unknown` and kills the remote process. The default covers `-w` surface_validate on 14 TB+ HDDs with margin. If you're running short SSDs and want faster detection of genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h which false-positived on multi-TB drives.) - **Thermal gate** — if drives currently under burn-in hit the temperature warning threshold, new jobs wait up to 3 minutes before acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but is otherwise fine. ### Cancelling Click the red ✕ next to a running job. The orchestrator: 1. Marks the job `cancelled` in the DB. 2. Issues `kill -9 ` over a fresh SSH session (the badblocks PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`). 3. Cancels the asyncio task, releasing the semaphore slot for the next queued job. Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled. ### Job states explained | State | When it's set | |-------------|-------------------------------------------------------------------------------| | `queued` | Submitted, waiting for a `max_parallel_burnins` slot | | `running` | Actively executing some stage | | `passed` | All stages finished green | | `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) | | `cancelled` | Operator clicked ✕ | | `unknown` | Job was alive but its outcome is indeterminate — see below | `unknown` fires in two situations: 1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because the job has been running too long without finishing. 2. The asyncio task got cancelled mid-stage by something *other* than an operator click — usually a container restart (`docker compose up -d`, `--build`, or the host rebooting). Burn-in source code goes through the Dockerfile `COPY`, so any source-code deploy recreates the container, drops the SSH connection to TrueNAS, and would orphan the running burn-in. Avoid `--build` while burn-ins are active. When `unknown` fires the drawer's per-stage Reason block shows *"Task cancelled mid-run — likely container restart or shutdown"* so the classification is explicit, not silent. --- ## Drive drawer Click any drive row to slide a detail drawer down from the top. Three tabs: - **Burn-In** — per-stage breakdown of the latest job - **SMART** — short/long test states + cached SMART attributes - **Events** — last 50 audit events for the drive ### Surface-validate visualization For drives in a `surface_validate` stage (running or finished), the Burn-In tab renders: 1. **Vital-signs strip** — `Start` (with date) · `Elapsed` · `ETA` (duration remaining) · `Finish` (wall-clock estimate, browser-local timezone) · `Temp` (cool/warm/hot colour). Computed from data in the drawer payload; ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22" stutter at the very start. 2. **Four pattern meters** — `0xaa` / `0x55` / `0xff` / `0x00`. Each meter is split into a left half (write phase, blue) and a right half (verify phase, green). Current pattern's label glows blue; completed patterns' labels go green. This translates badblocks's per-phase percent into monotonic 0-99% overall progress, so the bar never appears to "rewind" when a new phase starts. 3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47% within phase"*. Makes the visual grammar unambiguous. 4. **Completed-pattern history** — once pattern 1 finishes, a chip appears showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the first pattern's elapsed time. ### Failure reason block Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason pill at the top of the stage section. Sources, in order of preference: 1. The stage's own `error_text` 2. The parent job's `error_text` (backfilled by the drawer when the stage's own is empty — catches orphan rows from hard crashes) 3. A heuristic: if the log is tiny and no real progress was recorded, *"Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running"* Otherwise: *"No error message recorded."* — there's never a blank where you expect to see why something broke. ### Column sorting Click any column header (Drive, Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort state persists in `localStorage` so it survives page reload AND every SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to the bottom regardless of direction. Sortable values are emitted as `data-sort-*` attributes on each ``, with numeric priority maps for SMART states (e.g. `running` always sorts ahead of `idle`). --- ## Drive locks To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override): | State | Detected via | Confirm token | |---------------|---------------------------|------------------------------| | Active pool | `zpool list -vHP` | the pool name (e.g. `tank`) | | Boot pool | pool name = `boot-pool` | `DESTROY BOOT POOL` | | Exported ZFS | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` | | Mounted FS | `findmnt -no SOURCE` | `DESTROY MOUNTED FILESYSTEM` | Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails *closed*: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers. Unlock is in-memory only with a 10-minute TTL — bound to the `(pool_name, pool_role)` observed at unlock time. If a subsequent poll reclassifies the drive (e.g. `(exported)` → `tank` because someone imported the pool), the grant is invalidated automatically. Every unlock writes an audit event and surfaces in the next daily report in a red banner. --- ## Settings highlights All settings live under `/settings` (header link). Key knobs: - **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container for changes to take effect. - **`surface_validate_block_size` / `_block_buffer` / `_passes`** — badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; tune for speed vs paranoia. - **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs; drop for faster detection on small fast drives. - **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. - **`bad_block_threshold`** (default 0) — number of bad blocks surface_validate tolerates before failing the stage. - **`retention_log_days`** (default 35) — when to NULL out `burnin_stages.log_text`. Nightly job at 03:00 local. - **`retention_backup_keep`** (default 14) — how many nightly DB snapshots to keep in `/data/backups/`. --- ## Notifications - **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h. - **Per-job email alerts** on pass/fail (configurable). - **Webhook URL** posts JSON on every job state change. Configure SMTP in Settings → Email. Includes a "Test SMTP" button. --- ## Operations ### Logs ```bash docker logs -f nas-burnin # JSON-structured. Filter with jq: docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"' ``` ### User management ```bash docker exec -it nas-burnin python -m app.auth_cli list docker exec -it nas-burnin python -m app.auth_cli add docker exec -it nas-burnin python -m app.auth_cli reset ``` Passwords are read from a TTY prompt; never accept them on the command line. ### Backups Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online `sqlite3.backup`, doesn't lock writers). To restore: ```bash docker compose down cp data/backups/app-2026-05-01.db data/app.db docker compose up -d ``` ### Health probe `/health` is unauthenticated and returns 200 only when DB, poller, and SSH (when configured) all check green; 503 otherwise. Use it for container/orchestrator health checks. ```bash curl -sf http://localhost:8084/health | jq ``` ### Resetting the DB If you need to start over: ```bash docker compose down sudo rm -f data/app.db data/session_secret # keep data/settings_overrides.json if you want to preserve UI settings docker compose up -d ``` --- ## Updating dependencies `requirements.in` is the human-edited list. `requirements.txt` is a fully-pinned lockfile generated from it (with sha256 hashes), consumed at build time with `pip install --require-hashes`. **Never edit `requirements.txt` by hand.** ```bash # 1. Add or change a constraint in requirements.in $EDITOR requirements.in # 2. Regenerate the lockfile (runs pip-compile in a clean container) ./scripts/regenerate-lockfile.sh # 3. Review the diff — transitive bumps may be CVE fixes or breaking changes git diff requirements.txt # 4. Rebuild + smoke-test docker compose up -d --build app curl -sf http://localhost:8084/health | jq # 5. Commit BOTH files together git add requirements.in requirements.txt git commit -m "deps: bump for " ``` This + the daily security scan (`scripts/security-scan.sh`) gives defense-in-depth: pinning prevents accidental breakage from upstream releases (Starlette 1.0 broke us once), `--require-hashes` defends against compromised mirrors, and `pip-audit` catches new CVEs in any pinned version after the fact. ## See also - `CLAUDE.md` — full architecture, file map, deploy workflow, and the rationale behind every non-obvious design decision. - `SPEC.md` — canonical feature reference per version. - `tests/` — `python -m unittest discover tests/` (65 tests, stdlib-only). Or run inside the deployed container with `scripts/run-tests.sh`. --- ## Known gaps / not-yet-built - No multi-user RBAC — every user is effectively admin. - No per-drive SMART attribute trend graphs (snapshots only). - No scheduled burn-ins — jobs run immediately when queued. - No CSRF tokens on state-changing endpoints (relies on `SameSite=Strict` session cookie). PRs welcome.