nas-burnin/README.md
Brandon Walter 2107981cf1
Some checks failed
Security scan / bandit (push) Has been cancelled
Security scan / pip-audit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled
docs: drawer surface_validate + sorting + job states
Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51:

- Job states section explains passed / failed / cancelled / unknown,
  including when 'unknown' fires (stuck-job timeout OR container
  restart cancelling the asyncio task).
- Drive drawer section covers the new surface_validate visualization:
  vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four
  per-pattern meters with split write/verify halves, phase caption,
  completed-pattern duration history.
- Failure reason block describes the three-tier source resolution
  (stage error_text → job error_text → heuristic) and what shows up
  when none is available.
- Column sorting describes the click-to-cycle behaviour and the
  localStorage persistence that survives SSE refreshes.

Plus an explicit warning: don't `--build` while burn-ins are running
(now classified `unknown` instead of `failed` — but still better to
avoid the kill in the first place).
2026-05-09 15:34:12 -07:00

361 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NAS Burn-In Dashboard
Web dashboard for running disciplined burn-in tests on TrueNAS drives.
Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and
`nvme-cli` over SSH and tracks every job in SQLite.
Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
live progress UI, daily email reports, and resumable state.
## Stack
FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No
external services beyond your TrueNAS host. Templates and static assets
are bind-mounted; Python source is baked into the image.
---
## Quick start
```bash
# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
# 2. Build + run
docker compose up -d --build
# 3. Open the dashboard
open http://localhost:8084 # or your host's IP
# 4. First time: the login page renders a "Create initial admin" form.
# Pick a username + password (>= 8 chars). Done.
```
If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that
account is created on startup automatically. After that the env vars are
ignored — change passwords from the UI ("Change password" header link) or
the CLI (`docker exec -it nas-burnin python -m app.auth_cli reset
<username>`).
---
## Burning in many drives at once
The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently
(configurable in Settings, default 4) and queues the rest. Submitting 14
drives doesn't take 14 separate clicks — you submit once and the queue
drains automatically as slots free up.
### The workflow
1. **Select all idle drives** — click the checkbox in the table header
(next to "DRIVE"). It auto-checks every drive that's currently
selectable: idle, no active SMART test, not pool-locked. Pool-locked
drives are intentionally excluded; if you really want to burn one of
them in, unlock it individually first (see [Drive locks](#drive-locks)
below).
2. **Click the Burn-In button** in the batch action bar that slides up
from the bottom — it shows the count of selected drives.
3. **In the batch modal**: pick the stages to run (Short SMART, Long
SMART, Surface Validate — drag to reorder), confirm your operator
name, and click Start.
4. **All selected drives are queued** in one POST. Up to
`max_parallel_burnins` enter `running`; the rest sit `queued`. As each
running job finishes, the next queued job picks up the freed slot
automatically — no operator action between batches.
5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
### Time estimate
| Drive size | Profile | Per-drive runtime (default block size) |
|-----------|-------------|----------------------------------------|
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour |
| 14 TB HDD | Short + Long SMART + Surface | ~24 hours |
| 14 TB HDD | Short + Long SMART (no surface) | ~68 hours |
For 12× 14 TB drives at default 4-parallel: roughly **34 days** end-to-end.
Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts
runtime roughly in half at ~2× RAM cost — matches the upstream
`disk-burnin.sh` recommendation.
### Watch out
- **Stuck-job timeout** — `stuck_job_hours` (default 168 = 7 days)
marks any job past that threshold as `unknown` and kills the remote
process. The default covers `-w` surface_validate on 14 TB+ HDDs with
margin. If you're running short SSDs and want faster detection of
genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h
which false-positived on multi-TB drives.)
- **Thermal gate** — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
is otherwise fine.
### Cancelling
Click the red ✕ next to a running job. The orchestrator:
1. Marks the job `cancelled` in the DB.
2. Issues `kill -9 <remote_pid>` over a fresh SSH session (the badblocks
PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`).
3. Cancels the asyncio task, releasing the semaphore slot for the next
queued job.
Cancellations are durable — restart the container and queued jobs resume,
cancelled jobs stay cancelled.
### Job states explained
| State | When it's set |
|-------------|-------------------------------------------------------------------------------|
| `queued` | Submitted, waiting for a `max_parallel_burnins` slot |
| `running` | Actively executing some stage |
| `passed` | All stages finished green |
| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
| `cancelled` | Operator clicked ✕ |
| `unknown` | Job was alive but its outcome is indeterminate — see below |
`unknown` fires in two situations:
1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
the job has been running too long without finishing.
2. The asyncio task got cancelled mid-stage by something *other* than an
operator click — usually a container restart (`docker compose up -d`,
`--build`, or the host rebooting). Burn-in source code goes through
the Dockerfile `COPY`, so any source-code deploy recreates the
container, drops the SSH connection to TrueNAS, and would orphan the
running burn-in. Avoid `--build` while burn-ins are active.
When `unknown` fires the drawer's per-stage Reason block shows
*"Task cancelled mid-run — likely container restart or shutdown"* so the
classification is explicit, not silent.
---
## Drive drawer
Click any drive row to slide a detail drawer down from the top. Three tabs:
- **Burn-In** — per-stage breakdown of the latest job
- **SMART** — short/long test states + cached SMART attributes
- **Events** — last 50 audit events for the drive
### Surface-validate visualization
For drives in a `surface_validate` stage (running or finished), the Burn-In
tab renders:
1. **Vital-signs strip**`Start` (with date) · `Elapsed` · `ETA` (duration
remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
`Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
stutter at the very start.
2. **Four pattern meters**`0xaa` / `0x55` / `0xff` / `0x00`. Each meter
is split into a left half (write phase, blue) and a right half (verify
phase, green). Current pattern's label glows blue; completed patterns'
labels go green. This translates badblocks's per-phase percent into
monotonic 0-99% overall progress, so the bar never appears to "rewind"
when a new phase starts.
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
within phase"*. Makes the visual grammar unambiguous.
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
first pattern's elapsed time.
### Failure reason block
Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
pill at the top of the stage section. Sources, in order of preference:
1. The stage's own `error_text`
2. The parent job's `error_text` (backfilled by the drawer when the stage's
own is empty — catches orphan rows from hard crashes)
3. A heuristic: if the log is tiny and no real progress was recorded,
*"Stopped without recording an error — likely cause: SSH connection drop
or container restart while this stage was running"*
Otherwise: *"No error message recorded."* — there's never a blank where you
expect to see why something broke.
### Column sorting
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
state persists in `localStorage` so it survives page reload AND every
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
the bottom regardless of direction.
Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
with numeric priority maps for SMART states (e.g. `running` always sorts
ahead of `idle`).
---
## Drive locks
To prevent destroying live data, the dashboard refuses to start
destructive burn-in on drives ZFS or the kernel reports as in-use.
Detected lock states (with the typed-confirmation token required to
override):
| State | Detected via | Confirm token |
|---------------|---------------------------|------------------------------|
| Active pool | `zpool list -vHP` | the pool name (e.g. `tank`) |
| Boot pool | pool name = `boot-pool` | `DESTROY BOOT POOL` |
| Exported ZFS | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` |
| Mounted FS | `findmnt -no SOURCE` | `DESTROY MOUNTED FILESYSTEM` |
Detection runs every poll cycle (~12 s). On any SSH or parser failure the
poller fails *closed*: previously-locked drives stay locked, previously-
unlocked drives stay unlocked, until detection recovers.
Unlock is in-memory only with a 10-minute TTL — bound to the
`(pool_name, pool_role)` observed at unlock time. If a subsequent poll
reclassifies the drive (e.g. `(exported)``tank` because someone
imported the pool), the grant is invalidated automatically.
Every unlock writes an audit event and surfaces in the next daily report
in a red banner.
---
## Settings highlights
All settings live under `/settings` (header link). Key knobs:
- **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container
for changes to take effect.
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
tune for speed vs paranoia.
- **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs;
drop for faster detection on small fast drives.
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
- **`bad_block_threshold`** (default 0) — number of bad blocks
surface_validate tolerates before failing the stage.
- **`retention_log_days`** (default 35) — when to NULL out
`burnin_stages.log_text`. Nightly job at 03:00 local.
- **`retention_backup_keep`** (default 14) — how many nightly DB
snapshots to keep in `/data/backups/`.
---
## Notifications
- **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with
drive-level summary, failed-health banner, and a red banner listing
every pool-drive unlock from the last 24 h.
- **Per-job email alerts** on pass/fail (configurable).
- **Webhook URL** posts JSON on every job state change.
Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
---
## Operations
### Logs
```bash
docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
```
### User management
```bash
docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>
```
Passwords are read from a TTY prompt; never accept them on the command
line.
### Backups
Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online
`sqlite3.backup`, doesn't lock writers). To restore:
```bash
docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d
```
### Health probe
`/health` is unauthenticated and returns 200 only when DB, poller, and
SSH (when configured) all check green; 503 otherwise. Use it for
container/orchestrator health checks.
```bash
curl -sf http://localhost:8084/health | jq
```
### Resetting the DB
If you need to start over:
```bash
docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d
```
---
## Updating dependencies
`requirements.in` is the human-edited list. `requirements.txt` is a
fully-pinned lockfile generated from it (with sha256 hashes), consumed
at build time with `pip install --require-hashes`. **Never edit
`requirements.txt` by hand.**
```bash
# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in
# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh
# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt
# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq
# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"
```
This + the daily security scan (`scripts/security-scan.sh`) gives
defense-in-depth: pinning prevents accidental breakage from upstream
releases (Starlette 1.0 broke us once), `--require-hashes` defends
against compromised mirrors, and `pip-audit` catches new CVEs in any
pinned version after the fact.
## See also
- `CLAUDE.md` — full architecture, file map, deploy workflow, and the
rationale behind every non-obvious design decision.
- `SPEC.md` — canonical feature reference per version.
- `tests/``python -m unittest discover tests/` (65 tests, stdlib-only). Or run inside the deployed container with `scripts/run-tests.sh`.
---
## Known gaps / not-yet-built
- No multi-user RBAC — every user is effectively admin.
- No per-drive SMART attribute trend graphs (snapshots only).
- No scheduled burn-ins — jobs run immediately when queued.
- No CSRF tokens on state-changing endpoints (relies on
`SameSite=Strict` session cookie).
PRs welcome.