Addresses 12 of 13 findings from the Codex tech-debt + security review of versions 1.0.0-22 through 1.0.0-27. Item #5 (live pool re-check before start_job) deferred — would add an SSH round-trip per start. #1 Pool detection now treats zpool / lsblk / findmnt failures INDEPENDENTLY. Previously a single None blew away the whole map, so a host where lsblk lacks zfs_member info but zpool works would never lock pool members. Extended findmnt parser to recognise /dev/mapper/*, /dev/dm-*, /dev/md*, /dev/da*, /dev/ada* (LVM, devicemapper, MD RAID, FreeBSD CORE devnames). #2 Admin role enforced on every settings mutation. New auth.require_admin() helper applied to GET /settings, POST /api/v1/settings, /test-smtp, /test-ssh. Previously any authenticated user (the CLI explicitly supports non-admin accounts) could rewrite SMTP/SSH/API secrets. #3 First-user setup race closed. auth.create_user() now accepts bootstrap_only=True which wraps the existence check + insert in BEGIN IMMEDIATE so two concurrent /api/v1/auth/setup requests can't both create admin accounts during the bootstrap window. #4 Case-insensitive uniqueness enforced via new `uniq_users_username_nocase` index. Login does NOCASE lookup so without this `Admin` and `admin` could coexist as distinct rows. #6 New `session_cookie_secure` setting (default False for LAN/dev deploys, set True in production behind HTTPS) flips the session cookie's Secure flag. Defends against on-the-wire exposure when the dashboard is reachable over plain HTTP. #7 Audit trail bound to authenticated identity. Burn-in start / cancel / unlock / drive reset all now use `_operator_for(request)` which reads `request.state.current_user.full_name|username` instead of the body's operator field. Logged-in users can no longer spoof attribution. Drive reset's literal-"operator" fallback (window._operator was never set) is also fixed by this. #8 Login rate-limit race fixed. New `register_login_attempt()` is atomic check-AND-increment in synchronous code (no awaits inside), so a parallel burst can't slip past the threshold. `record_login_failure()` removed; `clear_login_failures()` now also drops any active lockout for a successful auth. Pre-existing bug where `tripped` was always False (so user_login_locked_out audit events never fired) also fixed. #9 NVMe surface_validate post-format check now mirrors the SSH path: fails on FAILED health AND on real SMART attribute failures, soft-passes SSH-only failures (logged), surfaces warnings to the stage log without failing. #10 retention.backup_db() now writes to `.tmp` then atomic-renames into the canonical daily slot — an interrupted backup leaves the tmp behind but doesn't corrupt the real snapshot. Scheduler marks last_run_date only on (prune AND backup) success so a transient failure gets retried within the 03:00 hour. #11 /health DB probe now exercises the WRITE path via a temp-table INSERT/SELECT/COMMIT round-trip. Previously only read PRAGMA journal_mode + a row count, which silently passes on read-only mounts and broken-WAL conditions. #12 security-scan.sh now fails loudly if `git fetch` or `git reset --hard origin/main` errors (was `|| true`, scanning stale code silently). pip-audit now runs in a throwaway python:3.12-slim container against requirements.txt instead of `docker exec`-ing into the live truenas-burnin container — cleaner separation, no transient package install on prod. #13 Badblocks SSH stage no longer doubles its log_text. Previously appended every 20-line chunk during streaming AND the full accumulated output at end. Now only flushes the un-flushed tail (typically <20 lines). `result["output"]` stays in-memory only. Verification: all 44 unit tests pass in container; /health 200; security scan returns 0 findings; deployed maple build is green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|---|---|---|
| .forgejo/workflows | ||
| app | ||
| mock-truenas | ||
| scripts | ||
| tests | ||
| .env.example | ||
| .gitignore | ||
| CLAUDE.md | ||
| docker-compose.yml | ||
| Dockerfile | ||
| README.md | ||
| requirements.in | ||
| requirements.txt | ||
| SPEC.md | ||
TrueNAS Burn-In Dashboard
Web dashboard for running disciplined burn-in tests on TrueNAS drives.
Sits next to the NAS, not on it — orchestrates smartctl, badblocks, and
nvme-cli over SSH and tracks every job in SQLite.
Inspired by the community disk-burnin.sh script (Spearfoot et al.) but
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
live progress UI, daily email reports, and resumable state.
Stack
FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image.
Quick start
# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
# 2. Build + run
docker compose up -d --build
# 3. Open the dashboard
open http://localhost:8084 # or your host's IP
# 4. First time: the login page renders a "Create initial admin" form.
# Pick a username + password (>= 8 chars). Done.
If you set INITIAL_ADMIN_* env vars and the users table is empty, that
account is created on startup automatically. After that the env vars are
ignored — change passwords from the UI ("Change password" header link) or
the CLI (docker exec -it truenas-burnin python -m app.auth_cli reset <username>).
Burning in many drives at once
The dashboard runs up to max_parallel_burnins burn-ins concurrently
(configurable in Settings, default 4) and queues the rest. Submitting 14
drives doesn't take 14 separate clicks — you submit once and the queue
drains automatically as slots free up.
The workflow
- Select all idle drives — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see Drive locks below).
- Click the Burn-In button in the batch action bar that slides up from the bottom — it shows the count of selected drives.
- In the batch modal: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start.
- All selected drives are queued in one POST. Up to
max_parallel_burninsenterrunning; the rest sitqueued. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches. - The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
Time estimate
| Drive size | Profile | Per-drive runtime (default block size) |
|---|---|---|
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour |
| 14 TB HDD | Short + Long SMART + Surface | ~24 hours |
| 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours |
For 12× 14 TB drives at default 4-parallel: roughly 3–4 days end-to-end.
Bumping surface_validate_block_size from 4096 to 8192 in Settings cuts
runtime roughly in half at ~2× RAM cost — matches the upstream
disk-burnin.sh recommendation.
Watch out
- Stuck-job timeout —
stuck_job_hours(default 24) marks any job past that threshold asunknownand kills the remote process. If you're burning in 14 TB drives with default block size, raise this to 48 in Settings before starting, or you'll get false positives near the end of surface_validate. - Thermal gate — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase
temp_warn_cif your chassis runs hot but is otherwise fine.
Cancelling
Click the red ✕ next to a running job. The orchestrator:
- Marks the job
cancelledin the DB. - Issues
kill -9 <remote_pid>over a fresh SSH session (the badblocks PID is captured at launch viash -c 'echo PID:$$; exec ...'). - Cancels the asyncio task, releasing the semaphore slot for the next queued job.
Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled.
Drive locks
To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override):
| State | Detected via | Confirm token |
|---|---|---|
| Active pool | zpool list -vHP |
the pool name (e.g. tank) |
| Boot pool | pool name = boot-pool |
DESTROY BOOT POOL |
| Exported ZFS | lsblk zfs_member partitions not in any active pool |
DESTROY EXPORTED POOL |
| Mounted FS | findmnt -no SOURCE |
DESTROY MOUNTED FILESYSTEM |
Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails closed: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers.
Unlock is in-memory only with a 10-minute TTL — bound to the
(pool_name, pool_role) observed at unlock time. If a subsequent poll
reclassifies the drive (e.g. (exported) → tank because someone
imported the pool), the grant is invalidated automatically.
Every unlock writes an audit event and surfaces in the next daily report in a red banner.
Settings highlights
All settings live under /settings (header link). Key knobs:
max_parallel_burnins(default 4) — semaphore cap. Restart container for changes to take effect.surface_validate_block_size/_block_buffer/_passes— badblocks-b/-c/-p. Defaults preserve original behaviour; tune for speed vs paranoia.stuck_job_hours(default 24) — raise for big drives.temp_warn_c/temp_crit_c— thermal gating thresholds.bad_block_threshold(default 0) — number of bad blocks surface_validate tolerates before failing the stage.retention_log_days(default 35) — when to NULL outburnin_stages.log_text. Nightly job at 03:00 local.retention_backup_keep(default 14) — how many nightly DB snapshots to keep in/data/backups/.
Notifications
- Daily SMTP report at
smtp_report_hour(default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h. - Per-job email alerts on pass/fail (configurable).
- Webhook URL posts JSON on every job state change.
Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
Operations
Logs
docker logs -f truenas-burnin
# JSON-structured. Filter with jq:
docker logs truenas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
User management
docker exec -it truenas-burnin python -m app.auth_cli list
docker exec -it truenas-burnin python -m app.auth_cli add <username>
docker exec -it truenas-burnin python -m app.auth_cli reset <username>
Passwords are read from a TTY prompt; never accept them on the command line.
Backups
Automated nightly to /data/backups/app-YYYY-MM-DD.db (online
sqlite3.backup, doesn't lock writers). To restore:
docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d
Health probe
/health is unauthenticated and returns 200 only when DB, poller, and
SSH (when configured) all check green; 503 otherwise. Use it for
container/orchestrator health checks.
curl -sf http://localhost:8084/health | jq
Resetting the DB
If you need to start over:
docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d
Updating dependencies
requirements.in is the human-edited list. requirements.txt is a
fully-pinned lockfile generated from it (with sha256 hashes), consumed
at build time with pip install --require-hashes. Never edit
requirements.txt by hand.
# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in
# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh
# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt
# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq
# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"
This + the daily security scan (scripts/security-scan.sh) gives
defense-in-depth: pinning prevents accidental breakage from upstream
releases (Starlette 1.0 broke us once), --require-hashes defends
against compromised mirrors, and pip-audit catches new CVEs in any
pinned version after the fact.
See also
CLAUDE.md— full architecture, file map, deploy workflow, and the rationale behind every non-obvious design decision.SPEC.md— canonical feature reference per version.tests/—python -m unittest discover tests/(44 tests, stdlib-only).
Known gaps / not-yet-built
- No multi-user RBAC — every user is effectively admin.
- No per-drive SMART attribute trend graphs (snapshots only).
- No scheduled burn-ins — jobs run immediately when queued.
- No CSRF tokens on state-changing endpoints (relies on
SameSite=Strictsession cookie).
PRs welcome.