diff --git a/README.md b/README.md new file mode 100644 index 0000000..90e9766 --- /dev/null +++ b/README.md @@ -0,0 +1,242 @@ +# TrueNAS Burn-In Dashboard + +Web dashboard for running disciplined burn-in tests on TrueNAS drives. +Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and +`nvme-cli` over SSH and tracks every job in SQLite. + +Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but +adds: concurrent burn-ins, pool-membership safety locks, login + audit, +live progress UI, daily email reports, and resumable state. + +## Stack + +FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No +external services beyond your TrueNAS host. Templates and static assets +are bind-mounted; Python source is baked into the image. + +--- + +## Quick start + +```bash +# 1. Configure +cp .env.example .env +# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally, +# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup. + +# 2. Build + run +docker compose up -d --build + +# 3. Open the dashboard +open http://localhost:8084 # or your host's IP + +# 4. First time: the login page renders a "Create initial admin" form. +# Pick a username + password (>= 8 chars). Done. +``` + +If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that +account is created on startup automatically. After that the env vars are +ignored — change passwords from the UI ("Change password" header link) or +the CLI (`docker exec -it truenas-burnin python -m app.auth_cli reset +`). + +--- + +## Burning in many drives at once + +The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently +(configurable in Settings, default 4) and queues the rest. Submitting 14 +drives doesn't take 14 separate clicks — you submit once and the queue +drains automatically as slots free up. + +### The workflow + +1. **Select all idle drives** — click the checkbox in the table header + (next to "DRIVE"). It auto-checks every drive that's currently + selectable: idle, no active SMART test, not pool-locked. Pool-locked + drives are intentionally excluded; if you really want to burn one of + them in, unlock it individually first (see [Drive locks](#drive-locks) + below). +2. **Click the Burn-In button** in the batch action bar that slides up + from the bottom — it shows the count of selected drives. +3. **In the batch modal**: pick the stages to run (Short SMART, Long + SMART, Surface Validate — drag to reorder), confirm your operator + name, and click Start. +4. **All selected drives are queued** in one POST. Up to + `max_parallel_burnins` enter `running`; the rest sit `queued`. As each + running job finishes, the next queued job picks up the freed slot + automatically — no operator action between batches. +5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked." + +### Time estimate + +| Drive size | Profile | Per-drive runtime (default block size) | +|-----------|-------------|----------------------------------------| +| 250 GB SSD | Short + Long SMART + Surface | ~1 hour | +| 14 TB HDD | Short + Long SMART + Surface | ~24 hours | +| 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours | + +For 12× 14 TB drives at default 4-parallel: roughly **3–4 days** end-to-end. +Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts +runtime roughly in half at ~2× RAM cost — matches the upstream +`disk-burnin.sh` recommendation. + +### Watch out + +- **Stuck-job timeout** — `stuck_job_hours` (default 24) marks any job + past that threshold as `unknown` and kills the remote process. If + you're burning in 14 TB drives with default block size, raise this to + **48** in Settings before starting, or you'll get false positives near + the end of surface_validate. +- **Thermal gate** — if drives currently under burn-in hit the + temperature warning threshold, new jobs wait up to 3 minutes before + acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but + is otherwise fine. + +### Cancelling + +Click the red ✕ next to a running job. The orchestrator: +1. Marks the job `cancelled` in the DB. +2. Issues `kill -9 ` over a fresh SSH session (the badblocks + PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`). +3. Cancels the asyncio task, releasing the semaphore slot for the next + queued job. + +Cancellations are durable — restart the container and queued jobs resume, +cancelled jobs stay cancelled. + +--- + +## Drive locks + +To prevent destroying live data, the dashboard refuses to start +destructive burn-in on drives ZFS or the kernel reports as in-use. +Detected lock states (with the typed-confirmation token required to +override): + +| State | Detected via | Confirm token | +|---------------|---------------------------|------------------------------| +| Active pool | `zpool list -vHP` | the pool name (e.g. `tank`) | +| Boot pool | pool name = `boot-pool` | `DESTROY BOOT POOL` | +| Exported ZFS | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` | +| Mounted FS | `findmnt -no SOURCE` | `DESTROY MOUNTED FILESYSTEM` | + +Detection runs every poll cycle (~12 s). On any SSH or parser failure the +poller fails *closed*: previously-locked drives stay locked, previously- +unlocked drives stay unlocked, until detection recovers. + +Unlock is in-memory only with a 10-minute TTL — bound to the +`(pool_name, pool_role)` observed at unlock time. If a subsequent poll +reclassifies the drive (e.g. `(exported)` → `tank` because someone +imported the pool), the grant is invalidated automatically. + +Every unlock writes an audit event and surfaces in the next daily report +in a red banner. + +--- + +## Settings highlights + +All settings live under `/settings` (header link). Key knobs: + +- **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container + for changes to take effect. +- **`surface_validate_block_size` / `_block_buffer` / `_passes`** — + badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; + tune for speed vs paranoia. +- **`stuck_job_hours`** (default 24) — raise for big drives. +- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. +- **`bad_block_threshold`** (default 0) — number of bad blocks + surface_validate tolerates before failing the stage. +- **`retention_log_days`** (default 35) — when to NULL out + `burnin_stages.log_text`. Nightly job at 03:00 local. +- **`retention_backup_keep`** (default 14) — how many nightly DB + snapshots to keep in `/data/backups/`. + +--- + +## Notifications + +- **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with + drive-level summary, failed-health banner, and a red banner listing + every pool-drive unlock from the last 24 h. +- **Per-job email alerts** on pass/fail (configurable). +- **Webhook URL** posts JSON on every job state change. + +Configure SMTP in Settings → Email. Includes a "Test SMTP" button. + +--- + +## Operations + +### Logs + +```bash +docker logs -f truenas-burnin +# JSON-structured. Filter with jq: +docker logs truenas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"' +``` + +### User management + +```bash +docker exec -it truenas-burnin python -m app.auth_cli list +docker exec -it truenas-burnin python -m app.auth_cli add +docker exec -it truenas-burnin python -m app.auth_cli reset +``` + +Passwords are read from a TTY prompt; never accept them on the command +line. + +### Backups + +Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online +`sqlite3.backup`, doesn't lock writers). To restore: + +```bash +docker compose down +cp data/backups/app-2026-05-01.db data/app.db +docker compose up -d +``` + +### Health probe + +`/health` is unauthenticated and returns 200 only when DB, poller, and +SSH (when configured) all check green; 503 otherwise. Use it for +container/orchestrator health checks. + +```bash +curl -sf http://localhost:8084/health | jq +``` + +### Resetting the DB + +If you need to start over: + +```bash +docker compose down +sudo rm -f data/app.db data/session_secret +# keep data/settings_overrides.json if you want to preserve UI settings +docker compose up -d +``` + +--- + +## See also + +- `CLAUDE.md` — full architecture, file map, deploy workflow, and the + rationale behind every non-obvious design decision. +- `SPEC.md` — canonical feature reference per version. +- `tests/` — `python -m unittest discover tests/` (44 tests, stdlib-only). + +--- + +## Known gaps / not-yet-built + +- No multi-user RBAC — every user is effectively admin. +- No per-drive SMART attribute trend graphs (snapshots only). +- No scheduled burn-ins — jobs run immediately when queued. +- No CSRF tokens on state-changing endpoints (relies on + `SameSite=Strict` session cookie). + +PRs welcome.