brandon/nas-burnin

Fork 0

Burns in drives for servers. SMART short, long and Surface Pass test to verify your drives before RAID deployment. Your drive(s) will be ZEROED out.

Find a file

Brandon Walter 7e42464016 Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details fix: missing nonlocal on _drain's tracker vars (1.0.0-59) After the chunk-read refactor, the inner _drain coroutine assigns to last_db_write_ts and last_pct_sample. Without nonlocal, Python compiles these as locals of _drain, so any READ before the first assignment raises UnboundLocalError. In 1.0.0-55 / -57 the bug was hidden by gather(return_exceptions= True), which silently swallowed the exception — the drain coroutine ended immediately, the asyncssh channel buffer filled up, and the remote badblocks blocked on pipe_write. THAT was the actual cause of the "parser silently never works" symptom, not anything to do with the chunk-read or tr-pipe logic itself. 1.0.0-57 dropped the gather (single drain after merging 2>&1), which made the next deploy surface the bug as an explicit error_text on the surface_validate stage: "cannot access local variable 'last_db_write_ts' where it is not associated with a value". Fix: add both vars to the nonlocal declaration. pending_log_chunks only gets .append/.clear (no reassignment) so it doesn't need nonlocal. This is the bug that's been hiding behind all the recent parser work. Sorry for the round trips.		2026-05-13 10:31:35 -07:00
.forgejo/workflows	feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34)	2026-05-03 09:29:53 -04:00
app	fix: missing nonlocal on _drain's tracker vars (1.0.0-59)	2026-05-13 10:31:35 -07:00
mock-truenas	Initial commit — TrueNAS Burn-In Dashboard v0.5.0	2026-02-24 00:08:29 -05:00
scripts	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00
tests	feat: per-pattern badblocks meters in drive drawer (1.0.0-44)	2026-05-08 22:34:35 -07:00
.env.example	Initial commit — TrueNAS Burn-In Dashboard v0.5.0	2026-02-24 00:08:29 -05:00
.gitignore	Initial commit — TrueNAS Burn-In Dashboard v0.5.0	2026-02-24 00:08:29 -05:00
CLAUDE.md	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00
docker-compose.yml	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00
Dockerfile	deps: pin transitive dependencies via lockfile (1.0.0-25)	2026-05-02 17:15:02 -04:00
README.md	docs: drawer surface_validate + sorting + job states	2026-05-09 15:34:12 -07:00
requirements.in	deps: pin transitive dependencies via lockfile (1.0.0-25)	2026-05-02 17:15:02 -04:00
requirements.txt	deps: pin transitive dependencies via lockfile (1.0.0-25)	2026-05-02 17:15:02 -04:00
SPEC.md	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00

README.md

NAS Burn-In Dashboard

Web dashboard for running disciplined burn-in tests on TrueNAS drives. Sits next to the NAS, not on it — orchestrates smartctl, badblocks, and nvme-cli over SSH and tracks every job in SQLite.

Inspired by the community disk-burnin.sh script (Spearfoot et al.) but adds: concurrent burn-ins, pool-membership safety locks, login + audit, live progress UI, daily email reports, and resumable state.

Stack

FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image.

Quick start

# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.

# 2. Build + run
docker compose up -d --build

# 3. Open the dashboard
open http://localhost:8084   # or your host's IP

# 4. First time: the login page renders a "Create initial admin" form.
#    Pick a username + password (>= 8 chars). Done.

If you set INITIAL_ADMIN_* env vars and the users table is empty, that account is created on startup automatically. After that the env vars are ignored — change passwords from the UI ("Change password" header link) or the CLI (docker exec -it nas-burnin python -m app.auth_cli reset <username>).

Burning in many drives at once

The dashboard runs up to max_parallel_burnins burn-ins concurrently (configurable in Settings, default 4) and queues the rest. Submitting 14 drives doesn't take 14 separate clicks — you submit once and the queue drains automatically as slots free up.

The workflow

Select all idle drives — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see Drive locks below).
Click the Burn-In button in the batch action bar that slides up from the bottom — it shows the count of selected drives.
In the batch modal: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start.
All selected drives are queued in one POST. Up to max_parallel_burnins enter running; the rest sit queued. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches.
The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."

Time estimate

Drive size	Profile	Per-drive runtime (default block size)
250 GB SSD	Short + Long SMART + Surface	~1 hour
14 TB HDD	Short + Long SMART + Surface	~24 hours
14 TB HDD	Short + Long SMART (no surface)	~6–8 hours

For 12× 14 TB drives at default 4-parallel: roughly 3–4 days end-to-end. Bumping surface_validate_block_size from 4096 to 8192 in Settings cuts runtime roughly in half at ~2× RAM cost — matches the upstream disk-burnin.sh recommendation.

Watch out

Stuck-job timeout — stuck_job_hours (default 168 = 7 days) marks any job past that threshold as unknown and kills the remote process. The default covers -w surface_validate on 14 TB+ HDDs with margin. If you're running short SSDs and want faster detection of genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h which false-positived on multi-TB drives.)
Thermal gate — if drives currently under burn-in hit the temperature warning threshold, new jobs wait up to 3 minutes before acquiring a slot. Increase temp_warn_c if your chassis runs hot but is otherwise fine.

Cancelling

Click the red ✕ next to a running job. The orchestrator:

Marks the job cancelled in the DB.
Issues kill -9 <remote_pid> over a fresh SSH session (the badblocks PID is captured at launch via sh -c 'echo PID:$$; exec ...').
Cancels the asyncio task, releasing the semaphore slot for the next queued job.

Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled.

Job states explained

State	When it's set
`queued`	Submitted, waiting for a `max_parallel_burnins` slot
`running`	Actively executing some stage
`passed`	All stages finished green
`failed`	A stage failed deterministically (bad blocks > threshold, SMART failure, etc.)
`cancelled`	Operator clicked ✕
`unknown`	Job was alive but its outcome is indeterminate — see below

unknown fires in two situations:

The stuck-job detector (stuck_job_hours, default 7 days) trips because the job has been running too long without finishing.
The asyncio task got cancelled mid-stage by something other than an operator click — usually a container restart (docker compose up -d, --build, or the host rebooting). Burn-in source code goes through the Dockerfile COPY, so any source-code deploy recreates the container, drops the SSH connection to TrueNAS, and would orphan the running burn-in. Avoid --build while burn-ins are active.

When unknown fires the drawer's per-stage Reason block shows "Task cancelled mid-run — likely container restart or shutdown" so the classification is explicit, not silent.

Drive drawer

Click any drive row to slide a detail drawer down from the top. Three tabs:

Burn-In — per-stage breakdown of the latest job
SMART — short/long test states + cached SMART attributes
Events — last 50 audit events for the drive

Surface-validate visualization

For drives in a surface_validate stage (running or finished), the Burn-In tab renders:

Vital-signs strip — Start (with date) · Elapsed · ETA (duration remaining) · Finish (wall-clock estimate, browser-local timezone) · Temp (cool/warm/hot colour). Computed from data in the drawer payload; ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22" stutter at the very start.
Four pattern meters — 0xaa / 0x55 / 0xff / 0x00. Each meter is split into a left half (write phase, blue) and a right half (verify phase, green). Current pattern's label glows blue; completed patterns' labels go green. This translates badblocks's per-phase percent into monotonic 0-99% overall progress, so the bar never appears to "rewind" when a new phase starts.
Phase caption — explicit text: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Makes the visual grammar unambiguous.
Completed-pattern history — once pattern 1 finishes, a chip appears showing 0xaa: 14h 22m. Lets you predict the rest of the run from the first pattern's elapsed time.

Failure reason block

Stages that ended failed / cancelled / unknown show a coloured Reason pill at the top of the stage section. Sources, in order of preference:

The stage's own error_text
The parent job's error_text (backfilled by the drawer when the stage's own is empty — catches orphan rows from hard crashes)
A heuristic: if the log is tiny and no real progress was recorded, "Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running"

Otherwise: "No error message recorded." — there's never a blank where you expect to see why something broke.

Column sorting

Click any column header (Drive, Serial, Size, Temp, Health, Short SMART, Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort state persists in localStorage so it survives page reload AND every SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to the bottom regardless of direction.

Sortable values are emitted as data-sort-* attributes on each <tr>, with numeric priority maps for SMART states (e.g. running always sorts ahead of idle).

Drive locks

To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override):

State	Detected via	Confirm token
Active pool	`zpool list -vHP`	the pool name (e.g. `tank`)
Boot pool	pool name = `boot-pool`	`DESTROY BOOT POOL`
Exported ZFS	`lsblk` `zfs_member` partitions not in any active pool	`DESTROY EXPORTED POOL`
Mounted FS	`findmnt -no SOURCE`	`DESTROY MOUNTED FILESYSTEM`

Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails closed: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers.

Unlock is in-memory only with a 10-minute TTL — bound to the (pool_name, pool_role) observed at unlock time. If a subsequent poll reclassifies the drive (e.g. (exported) → tank because someone imported the pool), the grant is invalidated automatically.

Every unlock writes an audit event and surfaces in the next daily report in a red banner.

Settings highlights

All settings live under /settings (header link). Key knobs:

max_parallel_burnins (default 4) — semaphore cap. Restart container for changes to take effect.
surface_validate_block_size / _block_buffer / _passes — badblocks -b / -c / -p. Defaults preserve original behaviour; tune for speed vs paranoia.
stuck_job_hours (default 168 = 7 days) — covers 14 TB+ HDDs; drop for faster detection on small fast drives.
temp_warn_c / temp_crit_c — thermal gating thresholds.
bad_block_threshold (default 0) — number of bad blocks surface_validate tolerates before failing the stage.
retention_log_days (default 35) — when to NULL out burnin_stages.log_text. Nightly job at 03:00 local.
retention_backup_keep (default 14) — how many nightly DB snapshots to keep in /data/backups/.

Notifications

Daily SMTP report at smtp_report_hour (default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h.
Per-job email alerts on pass/fail (configurable).
Webhook URL posts JSON on every job state change.

Configure SMTP in Settings → Email. Includes a "Test SMTP" button.

Operations

Logs

docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'

User management

docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>

Passwords are read from a TTY prompt; never accept them on the command line.

Backups

Automated nightly to /data/backups/app-YYYY-MM-DD.db (online sqlite3.backup, doesn't lock writers). To restore:

docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d

Health probe

/health is unauthenticated and returns 200 only when DB, poller, and SSH (when configured) all check green; 503 otherwise. Use it for container/orchestrator health checks.

curl -sf http://localhost:8084/health | jq

Resetting the DB

If you need to start over:

docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d

Updating dependencies

requirements.in is the human-edited list. requirements.txt is a fully-pinned lockfile generated from it (with sha256 hashes), consumed at build time with pip install --require-hashes. Never edit requirements.txt by hand.

# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in

# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh

# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt

# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq

# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"

This + the daily security scan (scripts/security-scan.sh) gives defense-in-depth: pinning prevents accidental breakage from upstream releases (Starlette 1.0 broke us once), --require-hashes defends against compromised mirrors, and pip-audit catches new CVEs in any pinned version after the fact.

Known gaps / not-yet-built

No multi-user RBAC — every user is effectively admin.
No per-drive SMART attribute trend graphs (snapshots only).
No scheduled burn-ins — jobs run immediately when queued.
No CSRF tokens on state-changing endpoints (relies on SameSite=Strict session cookie).

PRs welcome.

README.md Unescape Escape