User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.
Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields
Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
unchanged (single percent, no meters)
3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.
Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
|
||
|---|---|---|
| .forgejo/workflows | ||
| app | ||
| mock-truenas | ||
| scripts | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| CLAUDE.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| README.md | ||
| requirements.in | ||
| requirements.txt | ||
| SPEC.md | ||
NAS Burn-In Dashboard
Web dashboard for running disciplined burn-in tests on TrueNAS drives.
Sits next to the NAS, not on it — orchestrates smartctl, badblocks, and
nvme-cli over SSH and tracks every job in SQLite.
Inspired by the community disk-burnin.sh script (Spearfoot et al.) but
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
live progress UI, daily email reports, and resumable state.
Stack
FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image.
Quick start
# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
# 2. Build + run
docker compose up -d --build
# 3. Open the dashboard
open http://localhost:8084 # or your host's IP
# 4. First time: the login page renders a "Create initial admin" form.
# Pick a username + password (>= 8 chars). Done.
If you set INITIAL_ADMIN_* env vars and the users table is empty, that
account is created on startup automatically. After that the env vars are
ignored — change passwords from the UI ("Change password" header link) or
the CLI (docker exec -it nas-burnin python -m app.auth_cli reset <username>).
Burning in many drives at once
The dashboard runs up to max_parallel_burnins burn-ins concurrently
(configurable in Settings, default 4) and queues the rest. Submitting 14
drives doesn't take 14 separate clicks — you submit once and the queue
drains automatically as slots free up.
The workflow
- Select all idle drives — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see Drive locks below).
- Click the Burn-In button in the batch action bar that slides up from the bottom — it shows the count of selected drives.
- In the batch modal: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start.
- All selected drives are queued in one POST. Up to
max_parallel_burninsenterrunning; the rest sitqueued. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches. - The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
Time estimate
| Drive size | Profile | Per-drive runtime (default block size) |
|---|---|---|
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour |
| 14 TB HDD | Short + Long SMART + Surface | ~24 hours |
| 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours |
For 12× 14 TB drives at default 4-parallel: roughly 3–4 days end-to-end.
Bumping surface_validate_block_size from 4096 to 8192 in Settings cuts
runtime roughly in half at ~2× RAM cost — matches the upstream
disk-burnin.sh recommendation.
Watch out
- Stuck-job timeout —
stuck_job_hours(default 168 = 7 days) marks any job past that threshold asunknownand kills the remote process. The default covers-wsurface_validate on 14 TB+ HDDs with margin. If you're running short SSDs and want faster detection of genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h which false-positived on multi-TB drives.) - Thermal gate — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase
temp_warn_cif your chassis runs hot but is otherwise fine.
Cancelling
Click the red ✕ next to a running job. The orchestrator:
- Marks the job
cancelledin the DB. - Issues
kill -9 <remote_pid>over a fresh SSH session (the badblocks PID is captured at launch viash -c 'echo PID:$$; exec ...'). - Cancels the asyncio task, releasing the semaphore slot for the next queued job.
Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled.
Drive locks
To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override):
| State | Detected via | Confirm token |
|---|---|---|
| Active pool | zpool list -vHP |
the pool name (e.g. tank) |
| Boot pool | pool name = boot-pool |
DESTROY BOOT POOL |
| Exported ZFS | lsblk zfs_member partitions not in any active pool |
DESTROY EXPORTED POOL |
| Mounted FS | findmnt -no SOURCE |
DESTROY MOUNTED FILESYSTEM |
Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails closed: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers.
Unlock is in-memory only with a 10-minute TTL — bound to the
(pool_name, pool_role) observed at unlock time. If a subsequent poll
reclassifies the drive (e.g. (exported) → tank because someone
imported the pool), the grant is invalidated automatically.
Every unlock writes an audit event and surfaces in the next daily report in a red banner.
Settings highlights
All settings live under /settings (header link). Key knobs:
max_parallel_burnins(default 4) — semaphore cap. Restart container for changes to take effect.surface_validate_block_size/_block_buffer/_passes— badblocks-b/-c/-p. Defaults preserve original behaviour; tune for speed vs paranoia.stuck_job_hours(default 168 = 7 days) — covers 14 TB+ HDDs; drop for faster detection on small fast drives.temp_warn_c/temp_crit_c— thermal gating thresholds.bad_block_threshold(default 0) — number of bad blocks surface_validate tolerates before failing the stage.retention_log_days(default 35) — when to NULL outburnin_stages.log_text. Nightly job at 03:00 local.retention_backup_keep(default 14) — how many nightly DB snapshots to keep in/data/backups/.
Notifications
- Daily SMTP report at
smtp_report_hour(default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h. - Per-job email alerts on pass/fail (configurable).
- Webhook URL posts JSON on every job state change.
Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
Operations
Logs
docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
User management
docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>
Passwords are read from a TTY prompt; never accept them on the command line.
Backups
Automated nightly to /data/backups/app-YYYY-MM-DD.db (online
sqlite3.backup, doesn't lock writers). To restore:
docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d
Health probe
/health is unauthenticated and returns 200 only when DB, poller, and
SSH (when configured) all check green; 503 otherwise. Use it for
container/orchestrator health checks.
curl -sf http://localhost:8084/health | jq
Resetting the DB
If you need to start over:
docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d
Updating dependencies
requirements.in is the human-edited list. requirements.txt is a
fully-pinned lockfile generated from it (with sha256 hashes), consumed
at build time with pip install --require-hashes. Never edit
requirements.txt by hand.
# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in
# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh
# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt
# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq
# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"
This + the daily security scan (scripts/security-scan.sh) gives
defense-in-depth: pinning prevents accidental breakage from upstream
releases (Starlette 1.0 broke us once), --require-hashes defends
against compromised mirrors, and pip-audit catches new CVEs in any
pinned version after the fact.
See also
CLAUDE.md— full architecture, file map, deploy workflow, and the rationale behind every non-obvious design decision.SPEC.md— canonical feature reference per version.tests/—python -m unittest discover tests/(65 tests, stdlib-only). Or run inside the deployed container withscripts/run-tests.sh.
Known gaps / not-yet-built
- No multi-user RBAC — every user is effectively admin.
- No per-drive SMART attribute trend graphs (snapshots only).
- No scheduled burn-ins — jobs run immediately when queued.
- No CSRF tokens on state-changing endpoints (relies on
SameSite=Strictsession cookie).
PRs welcome.