# NAS Burn-In Dashboard Web dashboard for running disciplined burn-in tests on TrueNAS drives. Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and `nvme-cli` over SSH and tracks every job in SQLite. Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but adds: concurrent burn-ins, pool-membership safety locks, login + audit, live progress UI, daily email reports, and resumable state. ## Stack FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No external services beyond your TrueNAS host. Templates and static assets are bind-mounted; Python source is baked into the image. --- ## Quick start ```bash # 1. Configure cp .env.example .env # edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally, # INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup. # 2. Build + run docker compose up -d --build # 3. Open the dashboard open http://localhost:8084 # or your host's IP # 4. First time: the login page renders a "Create initial admin" form. # Pick a username + password (>= 8 chars). Done. ``` If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that account is created on startup automatically. After that the env vars are ignored — change passwords from the UI ("Change password" header link) or the CLI (`docker exec -it nas-burnin python -m app.auth_cli reset `). --- ## Burning in many drives at once The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently (configurable in Settings, default 4) and queues the rest. Submitting 14 drives doesn't take 14 separate clicks — you submit once and the queue drains automatically as slots free up. ### The workflow 1. **Select all idle drives** — click the checkbox in the table header (next to "DRIVE"). It auto-checks every drive that's currently selectable: idle, no active SMART test, not pool-locked. Pool-locked drives are intentionally excluded; if you really want to burn one of them in, unlock it individually first (see [Drive locks](#drive-locks) below). 2. **Click the Burn-In button** in the batch action bar that slides up from the bottom — it shows the count of selected drives. 3. **In the batch modal**: pick the stages to run (Short SMART, Long SMART, Surface Validate — drag to reorder), confirm your operator name, and click Start. 4. **All selected drives are queued** in one POST. Up to `max_parallel_burnins` enter `running`; the rest sit `queued`. As each running job finishes, the next queued job picks up the freed slot automatically — no operator action between batches. 5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked." ### Time estimate | Drive size | Profile | Per-drive runtime (default block size) | |-----------|-------------|----------------------------------------| | 250 GB SSD | Short + Long SMART + Surface | ~1 hour | | 14 TB HDD | Short + Long SMART + Surface | ~24 hours | | 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours | For 12× 14 TB drives at default 4-parallel: roughly **3–4 days** end-to-end. Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts runtime roughly in half at ~2× RAM cost — matches the upstream `disk-burnin.sh` recommendation. ### Watch out - **Stuck-job timeout** — `stuck_job_hours` (default 24) marks any job past that threshold as `unknown` and kills the remote process. If you're burning in 14 TB drives with default block size, raise this to **48** in Settings before starting, or you'll get false positives near the end of surface_validate. - **Thermal gate** — if drives currently under burn-in hit the temperature warning threshold, new jobs wait up to 3 minutes before acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but is otherwise fine. ### Cancelling Click the red ✕ next to a running job. The orchestrator: 1. Marks the job `cancelled` in the DB. 2. Issues `kill -9 ` over a fresh SSH session (the badblocks PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`). 3. Cancels the asyncio task, releasing the semaphore slot for the next queued job. Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled. --- ## Drive locks To prevent destroying live data, the dashboard refuses to start destructive burn-in on drives ZFS or the kernel reports as in-use. Detected lock states (with the typed-confirmation token required to override): | State | Detected via | Confirm token | |---------------|---------------------------|------------------------------| | Active pool | `zpool list -vHP` | the pool name (e.g. `tank`) | | Boot pool | pool name = `boot-pool` | `DESTROY BOOT POOL` | | Exported ZFS | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` | | Mounted FS | `findmnt -no SOURCE` | `DESTROY MOUNTED FILESYSTEM` | Detection runs every poll cycle (~12 s). On any SSH or parser failure the poller fails *closed*: previously-locked drives stay locked, previously- unlocked drives stay unlocked, until detection recovers. Unlock is in-memory only with a 10-minute TTL — bound to the `(pool_name, pool_role)` observed at unlock time. If a subsequent poll reclassifies the drive (e.g. `(exported)` → `tank` because someone imported the pool), the grant is invalidated automatically. Every unlock writes an audit event and surfaces in the next daily report in a red banner. --- ## Settings highlights All settings live under `/settings` (header link). Key knobs: - **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container for changes to take effect. - **`surface_validate_block_size` / `_block_buffer` / `_passes`** — badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; tune for speed vs paranoia. - **`stuck_job_hours`** (default 24) — raise for big drives. - **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. - **`bad_block_threshold`** (default 0) — number of bad blocks surface_validate tolerates before failing the stage. - **`retention_log_days`** (default 35) — when to NULL out `burnin_stages.log_text`. Nightly job at 03:00 local. - **`retention_backup_keep`** (default 14) — how many nightly DB snapshots to keep in `/data/backups/`. --- ## Notifications - **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with drive-level summary, failed-health banner, and a red banner listing every pool-drive unlock from the last 24 h. - **Per-job email alerts** on pass/fail (configurable). - **Webhook URL** posts JSON on every job state change. Configure SMTP in Settings → Email. Includes a "Test SMTP" button. --- ## Operations ### Logs ```bash docker logs -f nas-burnin # JSON-structured. Filter with jq: docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"' ``` ### User management ```bash docker exec -it nas-burnin python -m app.auth_cli list docker exec -it nas-burnin python -m app.auth_cli add docker exec -it nas-burnin python -m app.auth_cli reset ``` Passwords are read from a TTY prompt; never accept them on the command line. ### Backups Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online `sqlite3.backup`, doesn't lock writers). To restore: ```bash docker compose down cp data/backups/app-2026-05-01.db data/app.db docker compose up -d ``` ### Health probe `/health` is unauthenticated and returns 200 only when DB, poller, and SSH (when configured) all check green; 503 otherwise. Use it for container/orchestrator health checks. ```bash curl -sf http://localhost:8084/health | jq ``` ### Resetting the DB If you need to start over: ```bash docker compose down sudo rm -f data/app.db data/session_secret # keep data/settings_overrides.json if you want to preserve UI settings docker compose up -d ``` --- ## Updating dependencies `requirements.in` is the human-edited list. `requirements.txt` is a fully-pinned lockfile generated from it (with sha256 hashes), consumed at build time with `pip install --require-hashes`. **Never edit `requirements.txt` by hand.** ```bash # 1. Add or change a constraint in requirements.in $EDITOR requirements.in # 2. Regenerate the lockfile (runs pip-compile in a clean container) ./scripts/regenerate-lockfile.sh # 3. Review the diff — transitive bumps may be CVE fixes or breaking changes git diff requirements.txt # 4. Rebuild + smoke-test docker compose up -d --build app curl -sf http://localhost:8084/health | jq # 5. Commit BOTH files together git add requirements.in requirements.txt git commit -m "deps: bump for " ``` This + the daily security scan (`scripts/security-scan.sh`) gives defense-in-depth: pinning prevents accidental breakage from upstream releases (Starlette 1.0 broke us once), `--require-hashes` defends against compromised mirrors, and `pip-audit` catches new CVEs in any pinned version after the fact. ## See also - `CLAUDE.md` — full architecture, file map, deploy workflow, and the rationale behind every non-obvious design decision. - `SPEC.md` — canonical feature reference per version. - `tests/` — `python -m unittest discover tests/` (44 tests, stdlib-only). --- ## Known gaps / not-yet-built - No multi-user RBAC — every user is effectively admin. - No per-drive SMART attribute trend graphs (snapshots only). - No scheduled burn-ins — jobs run immediately when queued. - No CSRF tokens on state-changing endpoints (relies on `SameSite=Strict` session cookie). PRs welcome.