brandon/nas-burnin

Fork 0

Burns in drives for servers. SMART short, long and Surface Pass test to verify your drives before RAID deployment. Your drive(s) will be ZEROED out.

Find a file

Brandon Walter 383258df97 Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details feat: phase caption + bad-block badge + per-pattern history (1.0.0-47) Three additions to the surface_validate drawer block: 1. Phase caption below the meters: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Pure JS — no schema change. Makes the visual grammar explicit without needing the operator to mentally map phase=4 to "verifying pattern 2". 2. Bad-block badge in the vitals row. Green at 0, red at >0. The number was already on the stage row but burying it in the log felt wrong — surfacing it next to temp/speed/ETA keeps it in eye-line during long runs. 3. Per-pattern duration history below the caption. New bb_phase_history JSON column (idempotent migration) maps {phase_num: ts}. Parser stamps the timestamp on every phase transition (and on stage entry for phase 1). Drawer diffs consecutive write-phase starts to derive "0xaa: 14h 22m" for completed patterns. Once one pattern is done you can predict the rest without leaving the drawer. Persistence is idempotent: re-entry of the same phase keeps the original timestamp so a transient parser reset doesn't blow away history. JSON parse failures fail gracefully (no row rendered).		2026-05-08 23:23:02 -07:00
.forgejo/workflows	feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34)	2026-05-03 09:29:53 -04:00
app	feat: phase caption + bad-block badge + per-pattern history (1.0.0-47)	2026-05-08 23:23:02 -07:00
mock-truenas	Initial commit — TrueNAS Burn-In Dashboard v0.5.0	2026-02-24 00:08:29 -05:00
scripts	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00
tests	feat: per-pattern badblocks meters in drive drawer (1.0.0-44)	2026-05-08 22:34:35 -07:00
.env.example	Initial commit — TrueNAS Burn-In Dashboard v0.5.0	2026-02-24 00:08:29 -05:00
.gitignore	Initial commit — TrueNAS Burn-In Dashboard v0.5.0	2026-02-24 00:08:29 -05:00
CLAUDE.md	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00
docker-compose.yml	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00
Dockerfile	deps: pin transitive dependencies via lockfile (1.0.0-25)	2026-05-02 17:15:02 -04:00
README.md	fix: stuck_job_hours default 24 → 168 (7 days) (1.0.0-43)	2026-05-08 13:23:05 -07:00
requirements.in	deps: pin transitive dependencies via lockfile (1.0.0-25)	2026-05-02 17:15:02 -04:00
requirements.txt	deps: pin transitive dependencies via lockfile (1.0.0-25)	2026-05-02 17:15:02 -04:00
SPEC.md	infra: rename truenas-burnin → nas-burnin (1.0.0-41)	2026-05-04 07:16:02 -07:00

README.md

NAS Burn-In Dashboard

Web dashboard for running disciplined burn-in tests on TrueNAS drives. Sits next to the NAS, not on it — orchestrates smartctl, badblocks, and nvme-cli over SSH and tracks every job in SQLite.

Inspired by the community disk-burnin.sh script (Spearfoot et al.) but adds: concurrent burn-ins, pool-membership safety locks, login + audit, live progress UI, daily email reports, and resumable state.

Stack

FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image.

Quick start

# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.

# 2. Build + run
docker compose up -d --build

# 3. Open the dashboard
open http://localhost:8084   # or your host's IP

# 4. First time: the login page renders a "Create initial admin" form.
#    Pick a username + password (>= 8 chars). Done.

If you set INITIAL_ADMIN_* env vars and the users table is empty, that account is created on startup automatically. After that the env vars are ignored — change passwords from the UI ("Change password" header link) or the CLI (docker exec -it nas-burnin python -m app.auth_cli reset <username>).

Burning in many drives at once

The dashboard runs up to max_parallel_burnins burn-ins concurrently (configurable in Settings, default 4) and queues the rest. Submitting 14 drives doesn't take 14 separate clicks — you submit once and the queue drains automatically as slots free up.

The workflow

Select all idle drives — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see Drive locks below).
Click the Burn-In button in the batch action bar that slides up from the bottom — it shows the count of selected drives.
In the batch modal: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start.
All selected drives are queued in one POST. Up to max_parallel_burnins enter running; the rest sit queued. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches.
The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."

Time estimate

Drive size	Profile	Per-drive runtime (default block size)
250 GB SSD	Short + Long SMART + Surface	~1 hour
14 TB HDD	Short + Long SMART + Surface	~24 hours
14 TB HDD	Short + Long SMART (no surface)	~6–8 hours

For 12× 14 TB drives at default 4-parallel: roughly 3–4 days end-to-end. Bumping surface_validate_block_size from 4096 to 8192 in Settings cuts runtime roughly in half at ~2× RAM cost — matches the upstream disk-burnin.sh recommendation.

Watch out

Stuck-job timeout — stuck_job_hours (default 168 = 7 days) marks any job past that threshold as unknown and kills the remote process. The default covers -w surface_validate on 14 TB+ HDDs with margin. If you're running short SSDs and want faster detection of genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h which false-positived on multi-TB drives.)
Thermal gate — if drives currently under burn-in hit the temperature warning threshold, new jobs wait up to 3 minutes before acquiring a slot. Increase temp_warn_c if your chassis runs hot but is otherwise fine.

Cancelling

Click the red ✕ next to a running job. The orchestrator:

Marks the job cancelled in the DB.
Issues kill -9 <remote_pid> over a fresh SSH session (the badblocks PID is captured at launch via sh -c 'echo PID:$$; exec ...').
Cancels the asyncio task, releasing the semaphore slot for the next queued job.

Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled.

Drive locks

To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override):

State	Detected via	Confirm token
Active pool	`zpool list -vHP`	the pool name (e.g. `tank`)
Boot pool	pool name = `boot-pool`	`DESTROY BOOT POOL`
Exported ZFS	`lsblk` `zfs_member` partitions not in any active pool	`DESTROY EXPORTED POOL`
Mounted FS	`findmnt -no SOURCE`	`DESTROY MOUNTED FILESYSTEM`

Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails closed: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers.

Unlock is in-memory only with a 10-minute TTL — bound to the (pool_name, pool_role) observed at unlock time. If a subsequent poll reclassifies the drive (e.g. (exported) → tank because someone imported the pool), the grant is invalidated automatically.

Every unlock writes an audit event and surfaces in the next daily report in a red banner.

Settings highlights

All settings live under /settings (header link). Key knobs:

max_parallel_burnins (default 4) — semaphore cap. Restart container for changes to take effect.
surface_validate_block_size / _block_buffer / _passes — badblocks -b / -c / -p. Defaults preserve original behaviour; tune for speed vs paranoia.
stuck_job_hours (default 168 = 7 days) — covers 14 TB+ HDDs; drop for faster detection on small fast drives.
temp_warn_c / temp_crit_c — thermal gating thresholds.
bad_block_threshold (default 0) — number of bad blocks surface_validate tolerates before failing the stage.
retention_log_days (default 35) — when to NULL out burnin_stages.log_text. Nightly job at 03:00 local.
retention_backup_keep (default 14) — how many nightly DB snapshots to keep in /data/backups/.

Notifications

Daily SMTP report at smtp_report_hour (default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h.
Per-job email alerts on pass/fail (configurable).
Webhook URL posts JSON on every job state change.

Configure SMTP in Settings → Email. Includes a "Test SMTP" button.

Operations

Logs

docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'

User management

docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>

Passwords are read from a TTY prompt; never accept them on the command line.

Backups

Automated nightly to /data/backups/app-YYYY-MM-DD.db (online sqlite3.backup, doesn't lock writers). To restore:

docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d

Health probe

/health is unauthenticated and returns 200 only when DB, poller, and SSH (when configured) all check green; 503 otherwise. Use it for container/orchestrator health checks.

curl -sf http://localhost:8084/health | jq

Resetting the DB

If you need to start over:

docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d

Updating dependencies

requirements.in is the human-edited list. requirements.txt is a fully-pinned lockfile generated from it (with sha256 hashes), consumed at build time with pip install --require-hashes. Never edit requirements.txt by hand.

# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in

# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh

# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt

# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq

# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"

This + the daily security scan (scripts/security-scan.sh) gives defense-in-depth: pinning prevents accidental breakage from upstream releases (Starlette 1.0 broke us once), --require-hashes defends against compromised mirrors, and pip-audit catches new CVEs in any pinned version after the fact.

Known gaps / not-yet-built

No multi-user RBAC — every user is effectively admin.
No per-drive SMART attribute trend graphs (snapshots only).
No scheduled burn-ins — jobs run immediately when queued.
No CSRF tokens on state-changing endpoints (relies on SameSite=Strict session cookie).

PRs welcome.

README.md Unescape Escape