Burns in drives for servers. SMART short, long and Surface Pass test to verify your drives before RAID deployment. Your drive(s) will be ZEROED out.
Find a file
Brandon Walter 129f233e0a
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
fix: stdbuf -oL on the tr pipe (1.0.0-58)
1.0.0-57's tr-pipe fix delivered \n-terminated progress lines but
tr's stdout is block-buffered (4 KB chunks) when its destination
is a pipe — and the SSH channel is a pipe. At ~50 bytes per badblocks
progress line, that means ~80 lines accumulate (~6 minutes at our
throughput) before tr flushes anything.

stdbuf -oL forces tr's stdout to line-buffered mode. Each \n now
triggers a flush. Progress lines reach asyncssh as they happen.
2026-05-13 10:29:03 -07:00
.forgejo/workflows feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34) 2026-05-03 09:29:53 -04:00
app fix: stdbuf -oL on the tr pipe (1.0.0-58) 2026-05-13 10:29:03 -07:00
mock-truenas Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
scripts infra: rename truenas-burnin → nas-burnin (1.0.0-41) 2026-05-04 07:16:02 -07:00
tests feat: per-pattern badblocks meters in drive drawer (1.0.0-44) 2026-05-08 22:34:35 -07:00
.env.example Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
.gitignore Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
CLAUDE.md infra: rename truenas-burnin → nas-burnin (1.0.0-41) 2026-05-04 07:16:02 -07:00
docker-compose.yml infra: rename truenas-burnin → nas-burnin (1.0.0-41) 2026-05-04 07:16:02 -07:00
Dockerfile deps: pin transitive dependencies via lockfile (1.0.0-25) 2026-05-02 17:15:02 -04:00
README.md docs: drawer surface_validate + sorting + job states 2026-05-09 15:34:12 -07:00
requirements.in deps: pin transitive dependencies via lockfile (1.0.0-25) 2026-05-02 17:15:02 -04:00
requirements.txt deps: pin transitive dependencies via lockfile (1.0.0-25) 2026-05-02 17:15:02 -04:00
SPEC.md infra: rename truenas-burnin → nas-burnin (1.0.0-41) 2026-05-04 07:16:02 -07:00

NAS Burn-In Dashboard

Web dashboard for running disciplined burn-in tests on TrueNAS drives. Sits next to the NAS, not on it — orchestrates smartctl, badblocks, and nvme-cli over SSH and tracks every job in SQLite.

Inspired by the community disk-burnin.sh script (Spearfoot et al.) but adds: concurrent burn-ins, pool-membership safety locks, login + audit, live progress UI, daily email reports, and resumable state.

Stack

FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image.


Quick start

# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.

# 2. Build + run
docker compose up -d --build

# 3. Open the dashboard
open http://localhost:8084   # or your host's IP

# 4. First time: the login page renders a "Create initial admin" form.
#    Pick a username + password (>= 8 chars). Done.

If you set INITIAL_ADMIN_* env vars and the users table is empty, that account is created on startup automatically. After that the env vars are ignored — change passwords from the UI ("Change password" header link) or the CLI (docker exec -it nas-burnin python -m app.auth_cli reset <username>).


Burning in many drives at once

The dashboard runs up to max_parallel_burnins burn-ins concurrently (configurable in Settings, default 4) and queues the rest. Submitting 14 drives doesn't take 14 separate clicks — you submit once and the queue drains automatically as slots free up.

The workflow

  1. Select all idle drives — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see Drive locks below).
  2. Click the Burn-In button in the batch action bar that slides up from the bottom — it shows the count of selected drives.
  3. In the batch modal: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start.
  4. All selected drives are queued in one POST. Up to max_parallel_burnins enter running; the rest sit queued. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches.
  5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."

Time estimate

Drive size Profile Per-drive runtime (default block size)
250 GB SSD Short + Long SMART + Surface ~1 hour
14 TB HDD Short + Long SMART + Surface ~24 hours
14 TB HDD Short + Long SMART (no surface) ~68 hours

For 12× 14 TB drives at default 4-parallel: roughly 34 days end-to-end. Bumping surface_validate_block_size from 4096 to 8192 in Settings cuts runtime roughly in half at ~2× RAM cost — matches the upstream disk-burnin.sh recommendation.

Watch out

  • Stuck-job timeoutstuck_job_hours (default 168 = 7 days) marks any job past that threshold as unknown and kills the remote process. The default covers -w surface_validate on 14 TB+ HDDs with margin. If you're running short SSDs and want faster detection of genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h which false-positived on multi-TB drives.)
  • Thermal gate — if drives currently under burn-in hit the temperature warning threshold, new jobs wait up to 3 minutes before acquiring a slot. Increase temp_warn_c if your chassis runs hot but is otherwise fine.

Cancelling

Click the red ✕ next to a running job. The orchestrator:

  1. Marks the job cancelled in the DB.
  2. Issues kill -9 <remote_pid> over a fresh SSH session (the badblocks PID is captured at launch via sh -c 'echo PID:$$; exec ...').
  3. Cancels the asyncio task, releasing the semaphore slot for the next queued job.

Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled.

Job states explained

State When it's set
queued Submitted, waiting for a max_parallel_burnins slot
running Actively executing some stage
passed All stages finished green
failed A stage failed deterministically (bad blocks > threshold, SMART failure, etc.)
cancelled Operator clicked ✕
unknown Job was alive but its outcome is indeterminate — see below

unknown fires in two situations:

  1. The stuck-job detector (stuck_job_hours, default 7 days) trips because the job has been running too long without finishing.
  2. The asyncio task got cancelled mid-stage by something other than an operator click — usually a container restart (docker compose up -d, --build, or the host rebooting). Burn-in source code goes through the Dockerfile COPY, so any source-code deploy recreates the container, drops the SSH connection to TrueNAS, and would orphan the running burn-in. Avoid --build while burn-ins are active.

When unknown fires the drawer's per-stage Reason block shows "Task cancelled mid-run — likely container restart or shutdown" so the classification is explicit, not silent.


Drive drawer

Click any drive row to slide a detail drawer down from the top. Three tabs:

  • Burn-In — per-stage breakdown of the latest job
  • SMART — short/long test states + cached SMART attributes
  • Events — last 50 audit events for the drive

Surface-validate visualization

For drives in a surface_validate stage (running or finished), the Burn-In tab renders:

  1. Vital-signs stripStart (with date) · Elapsed · ETA (duration remaining) · Finish (wall-clock estimate, browser-local timezone) · Temp (cool/warm/hot colour). Computed from data in the drawer payload; ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22" stutter at the very start.
  2. Four pattern meters0xaa / 0x55 / 0xff / 0x00. Each meter is split into a left half (write phase, blue) and a right half (verify phase, green). Current pattern's label glows blue; completed patterns' labels go green. This translates badblocks's per-phase percent into monotonic 0-99% overall progress, so the bar never appears to "rewind" when a new phase starts.
  3. Phase caption — explicit text: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Makes the visual grammar unambiguous.
  4. Completed-pattern history — once pattern 1 finishes, a chip appears showing 0xaa: 14h 22m. Lets you predict the rest of the run from the first pattern's elapsed time.

Failure reason block

Stages that ended failed / cancelled / unknown show a coloured Reason pill at the top of the stage section. Sources, in order of preference:

  1. The stage's own error_text
  2. The parent job's error_text (backfilled by the drawer when the stage's own is empty — catches orphan rows from hard crashes)
  3. A heuristic: if the log is tiny and no real progress was recorded, "Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running"

Otherwise: "No error message recorded." — there's never a blank where you expect to see why something broke.

Column sorting

Click any column header (Drive, Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort state persists in localStorage so it survives page reload AND every SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to the bottom regardless of direction.

Sortable values are emitted as data-sort-* attributes on each <tr>, with numeric priority maps for SMART states (e.g. running always sorts ahead of idle).


Drive locks

To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override):

State Detected via Confirm token
Active pool zpool list -vHP the pool name (e.g. tank)
Boot pool pool name = boot-pool DESTROY BOOT POOL
Exported ZFS lsblk zfs_member partitions not in any active pool DESTROY EXPORTED POOL
Mounted FS findmnt -no SOURCE DESTROY MOUNTED FILESYSTEM

Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails closed: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers.

Unlock is in-memory only with a 10-minute TTL — bound to the (pool_name, pool_role) observed at unlock time. If a subsequent poll reclassifies the drive (e.g. (exported)tank because someone imported the pool), the grant is invalidated automatically.

Every unlock writes an audit event and surfaces in the next daily report in a red banner.


Settings highlights

All settings live under /settings (header link). Key knobs:

  • max_parallel_burnins (default 4) — semaphore cap. Restart container for changes to take effect.
  • surface_validate_block_size / _block_buffer / _passes — badblocks -b / -c / -p. Defaults preserve original behaviour; tune for speed vs paranoia.
  • stuck_job_hours (default 168 = 7 days) — covers 14 TB+ HDDs; drop for faster detection on small fast drives.
  • temp_warn_c / temp_crit_c — thermal gating thresholds.
  • bad_block_threshold (default 0) — number of bad blocks surface_validate tolerates before failing the stage.
  • retention_log_days (default 35) — when to NULL out burnin_stages.log_text. Nightly job at 03:00 local.
  • retention_backup_keep (default 14) — how many nightly DB snapshots to keep in /data/backups/.

Notifications

  • Daily SMTP report at smtp_report_hour (default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h.
  • Per-job email alerts on pass/fail (configurable).
  • Webhook URL posts JSON on every job state change.

Configure SMTP in Settings → Email. Includes a "Test SMTP" button.


Operations

Logs

docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'

User management

docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>

Passwords are read from a TTY prompt; never accept them on the command line.

Backups

Automated nightly to /data/backups/app-YYYY-MM-DD.db (online sqlite3.backup, doesn't lock writers). To restore:

docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d

Health probe

/health is unauthenticated and returns 200 only when DB, poller, and SSH (when configured) all check green; 503 otherwise. Use it for container/orchestrator health checks.

curl -sf http://localhost:8084/health | jq

Resetting the DB

If you need to start over:

docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d

Updating dependencies

requirements.in is the human-edited list. requirements.txt is a fully-pinned lockfile generated from it (with sha256 hashes), consumed at build time with pip install --require-hashes. Never edit requirements.txt by hand.

# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in

# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh

# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt

# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq

# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"

This + the daily security scan (scripts/security-scan.sh) gives defense-in-depth: pinning prevents accidental breakage from upstream releases (Starlette 1.0 broke us once), --require-hashes defends against compromised mirrors, and pip-audit catches new CVEs in any pinned version after the fact.

See also

  • CLAUDE.md — full architecture, file map, deploy workflow, and the rationale behind every non-obvious design decision.
  • SPEC.md — canonical feature reference per version.
  • tests/python -m unittest discover tests/ (65 tests, stdlib-only). Or run inside the deployed container with scripts/run-tests.sh.

Known gaps / not-yet-built

  • No multi-user RBAC — every user is effectively admin.
  • No per-drive SMART attribute trend graphs (snapshots only).
  • No scheduled burn-ins — jobs run immediately when queued.
  • No CSRF tokens on state-changing endpoints (relies on SameSite=Strict session cookie).

PRs welcome.