After the chunk-read refactor, the inner _drain coroutine assigns to last_db_write_ts and last_pct_sample. Without nonlocal, Python compiles these as locals of _drain, so any READ before the first assignment raises UnboundLocalError. In 1.0.0-55 / -57 the bug was hidden by gather(return_exceptions= True), which silently swallowed the exception — the drain coroutine ended immediately, the asyncssh channel buffer filled up, and the remote badblocks blocked on pipe_write. THAT was the actual cause of the "parser silently never works" symptom, not anything to do with the chunk-read or tr-pipe logic itself. 1.0.0-57 dropped the gather (single drain after merging 2>&1), which made the next deploy surface the bug as an explicit error_text on the surface_validate stage: "cannot access local variable 'last_db_write_ts' where it is not associated with a value". Fix: add both vars to the nonlocal declaration. pending_log_chunks only gets .append/.clear (no reassignment) so it doesn't need nonlocal. This is the bug that's been hiding behind all the recent parser work. Sorry for the round trips. |
||
|---|---|---|
| .forgejo/workflows | ||
| app | ||
| mock-truenas | ||
| scripts | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| CLAUDE.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| README.md | ||
| requirements.in | ||
| requirements.txt | ||
| SPEC.md | ||
NAS Burn-In Dashboard
Web dashboard for running disciplined burn-in tests on TrueNAS drives.
Sits next to the NAS, not on it — orchestrates smartctl, badblocks, and
nvme-cli over SSH and tracks every job in SQLite.
Inspired by the community disk-burnin.sh script (Spearfoot et al.) but
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
live progress UI, daily email reports, and resumable state.
Stack
FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image.
Quick start
# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
# 2. Build + run
docker compose up -d --build
# 3. Open the dashboard
open http://localhost:8084 # or your host's IP
# 4. First time: the login page renders a "Create initial admin" form.
# Pick a username + password (>= 8 chars). Done.
If you set INITIAL_ADMIN_* env vars and the users table is empty, that
account is created on startup automatically. After that the env vars are
ignored — change passwords from the UI ("Change password" header link) or
the CLI (docker exec -it nas-burnin python -m app.auth_cli reset <username>).
Burning in many drives at once
The dashboard runs up to max_parallel_burnins burn-ins concurrently
(configurable in Settings, default 4) and queues the rest. Submitting 14
drives doesn't take 14 separate clicks — you submit once and the queue
drains automatically as slots free up.
The workflow
- Select all idle drives — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see Drive locks below).
- Click the Burn-In button in the batch action bar that slides up from the bottom — it shows the count of selected drives.
- In the batch modal: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start.
- All selected drives are queued in one POST. Up to
max_parallel_burninsenterrunning; the rest sitqueued. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches. - The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
Time estimate
| Drive size | Profile | Per-drive runtime (default block size) |
|---|---|---|
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour |
| 14 TB HDD | Short + Long SMART + Surface | ~24 hours |
| 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours |
For 12× 14 TB drives at default 4-parallel: roughly 3–4 days end-to-end.
Bumping surface_validate_block_size from 4096 to 8192 in Settings cuts
runtime roughly in half at ~2× RAM cost — matches the upstream
disk-burnin.sh recommendation.
Watch out
- Stuck-job timeout —
stuck_job_hours(default 168 = 7 days) marks any job past that threshold asunknownand kills the remote process. The default covers-wsurface_validate on 14 TB+ HDDs with margin. If you're running short SSDs and want faster detection of genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h which false-positived on multi-TB drives.) - Thermal gate — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase
temp_warn_cif your chassis runs hot but is otherwise fine.
Cancelling
Click the red ✕ next to a running job. The orchestrator:
- Marks the job
cancelledin the DB. - Issues
kill -9 <remote_pid>over a fresh SSH session (the badblocks PID is captured at launch viash -c 'echo PID:$$; exec ...'). - Cancels the asyncio task, releasing the semaphore slot for the next queued job.
Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled.
Job states explained
| State | When it's set |
|---|---|
queued |
Submitted, waiting for a max_parallel_burnins slot |
running |
Actively executing some stage |
passed |
All stages finished green |
failed |
A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
cancelled |
Operator clicked ✕ |
unknown |
Job was alive but its outcome is indeterminate — see below |
unknown fires in two situations:
- The stuck-job detector (
stuck_job_hours, default 7 days) trips because the job has been running too long without finishing. - The asyncio task got cancelled mid-stage by something other than an
operator click — usually a container restart (
docker compose up -d,--build, or the host rebooting). Burn-in source code goes through the DockerfileCOPY, so any source-code deploy recreates the container, drops the SSH connection to TrueNAS, and would orphan the running burn-in. Avoid--buildwhile burn-ins are active.
When unknown fires the drawer's per-stage Reason block shows
"Task cancelled mid-run — likely container restart or shutdown" so the
classification is explicit, not silent.
Drive drawer
Click any drive row to slide a detail drawer down from the top. Three tabs:
- Burn-In — per-stage breakdown of the latest job
- SMART — short/long test states + cached SMART attributes
- Events — last 50 audit events for the drive
Surface-validate visualization
For drives in a surface_validate stage (running or finished), the Burn-In
tab renders:
- Vital-signs strip —
Start(with date) ·Elapsed·ETA(duration remaining) ·Finish(wall-clock estimate, browser-local timezone) ·Temp(cool/warm/hot colour). Computed from data in the drawer payload; ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22" stutter at the very start. - Four pattern meters —
0xaa/0x55/0xff/0x00. Each meter is split into a left half (write phase, blue) and a right half (verify phase, green). Current pattern's label glows blue; completed patterns' labels go green. This translates badblocks's per-phase percent into monotonic 0-99% overall progress, so the bar never appears to "rewind" when a new phase starts. - Phase caption — explicit text: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Makes the visual grammar unambiguous.
- Completed-pattern history — once pattern 1 finishes, a chip appears
showing
0xaa: 14h 22m. Lets you predict the rest of the run from the first pattern's elapsed time.
Failure reason block
Stages that ended failed / cancelled / unknown show a coloured Reason
pill at the top of the stage section. Sources, in order of preference:
- The stage's own
error_text - The parent job's
error_text(backfilled by the drawer when the stage's own is empty — catches orphan rows from hard crashes) - A heuristic: if the log is tiny and no real progress was recorded, "Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running"
Otherwise: "No error message recorded." — there's never a blank where you expect to see why something broke.
Column sorting
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
state persists in localStorage so it survives page reload AND every
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
the bottom regardless of direction.
Sortable values are emitted as data-sort-* attributes on each <tr>,
with numeric priority maps for SMART states (e.g. running always sorts
ahead of idle).
Drive locks
To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override):
| State | Detected via | Confirm token |
|---|---|---|
| Active pool | zpool list -vHP |
the pool name (e.g. tank) |
| Boot pool | pool name = boot-pool |
DESTROY BOOT POOL |
| Exported ZFS | lsblk zfs_member partitions not in any active pool |
DESTROY EXPORTED POOL |
| Mounted FS | findmnt -no SOURCE |
DESTROY MOUNTED FILESYSTEM |
Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails closed: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers.
Unlock is in-memory only with a 10-minute TTL — bound to the
(pool_name, pool_role) observed at unlock time. If a subsequent poll
reclassifies the drive (e.g. (exported) → tank because someone
imported the pool), the grant is invalidated automatically.
Every unlock writes an audit event and surfaces in the next daily report in a red banner.
Settings highlights
All settings live under /settings (header link). Key knobs:
max_parallel_burnins(default 4) — semaphore cap. Restart container for changes to take effect.surface_validate_block_size/_block_buffer/_passes— badblocks-b/-c/-p. Defaults preserve original behaviour; tune for speed vs paranoia.stuck_job_hours(default 168 = 7 days) — covers 14 TB+ HDDs; drop for faster detection on small fast drives.temp_warn_c/temp_crit_c— thermal gating thresholds.bad_block_threshold(default 0) — number of bad blocks surface_validate tolerates before failing the stage.retention_log_days(default 35) — when to NULL outburnin_stages.log_text. Nightly job at 03:00 local.retention_backup_keep(default 14) — how many nightly DB snapshots to keep in/data/backups/.
Notifications
- Daily SMTP report at
smtp_report_hour(default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h. - Per-job email alerts on pass/fail (configurable).
- Webhook URL posts JSON on every job state change.
Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
Operations
Logs
docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
User management
docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>
Passwords are read from a TTY prompt; never accept them on the command line.
Backups
Automated nightly to /data/backups/app-YYYY-MM-DD.db (online
sqlite3.backup, doesn't lock writers). To restore:
docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d
Health probe
/health is unauthenticated and returns 200 only when DB, poller, and
SSH (when configured) all check green; 503 otherwise. Use it for
container/orchestrator health checks.
curl -sf http://localhost:8084/health | jq
Resetting the DB
If you need to start over:
docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d
Updating dependencies
requirements.in is the human-edited list. requirements.txt is a
fully-pinned lockfile generated from it (with sha256 hashes), consumed
at build time with pip install --require-hashes. Never edit
requirements.txt by hand.
# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in
# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh
# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt
# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq
# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"
This + the daily security scan (scripts/security-scan.sh) gives
defense-in-depth: pinning prevents accidental breakage from upstream
releases (Starlette 1.0 broke us once), --require-hashes defends
against compromised mirrors, and pip-audit catches new CVEs in any
pinned version after the fact.
See also
CLAUDE.md— full architecture, file map, deploy workflow, and the rationale behind every non-obvious design decision.SPEC.md— canonical feature reference per version.tests/—python -m unittest discover tests/(65 tests, stdlib-only). Or run inside the deployed container withscripts/run-tests.sh.
Known gaps / not-yet-built
- No multi-user RBAC — every user is effectively admin.
- No per-drive SMART attribute trend graphs (snapshots only).
- No scheduled burn-ins — jobs run immediately when queued.
- No CSRF tokens on state-changing endpoints (relies on
SameSite=Strictsession cookie).
PRs welcome.