Catches the README, SPEC, and CLAUDE.md that were missed in the 1.0.0-38 product rename. Infrastructure identifiers (paths, container, repo URL) deliberately stay as truenas-burnin. Also refreshes SPEC.md version (1.0.0-8 → 1.0.0-39) and CLAUDE.md last-updated stamp (1.0.0-12 → 1.0.0-39).
274 lines
9.6 KiB
Markdown
274 lines
9.6 KiB
Markdown
# NAS Burn-In Dashboard
|
||
|
||
Web dashboard for running disciplined burn-in tests on TrueNAS drives.
|
||
Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and
|
||
`nvme-cli` over SSH and tracks every job in SQLite.
|
||
|
||
Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but
|
||
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
|
||
live progress UI, daily email reports, and resumable state.
|
||
|
||
## Stack
|
||
|
||
FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No
|
||
external services beyond your TrueNAS host. Templates and static assets
|
||
are bind-mounted; Python source is baked into the image.
|
||
|
||
---
|
||
|
||
## Quick start
|
||
|
||
```bash
|
||
# 1. Configure
|
||
cp .env.example .env
|
||
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
|
||
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
|
||
|
||
# 2. Build + run
|
||
docker compose up -d --build
|
||
|
||
# 3. Open the dashboard
|
||
open http://localhost:8084 # or your host's IP
|
||
|
||
# 4. First time: the login page renders a "Create initial admin" form.
|
||
# Pick a username + password (>= 8 chars). Done.
|
||
```
|
||
|
||
If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that
|
||
account is created on startup automatically. After that the env vars are
|
||
ignored — change passwords from the UI ("Change password" header link) or
|
||
the CLI (`docker exec -it truenas-burnin python -m app.auth_cli reset
|
||
<username>`).
|
||
|
||
---
|
||
|
||
## Burning in many drives at once
|
||
|
||
The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently
|
||
(configurable in Settings, default 4) and queues the rest. Submitting 14
|
||
drives doesn't take 14 separate clicks — you submit once and the queue
|
||
drains automatically as slots free up.
|
||
|
||
### The workflow
|
||
|
||
1. **Select all idle drives** — click the checkbox in the table header
|
||
(next to "DRIVE"). It auto-checks every drive that's currently
|
||
selectable: idle, no active SMART test, not pool-locked. Pool-locked
|
||
drives are intentionally excluded; if you really want to burn one of
|
||
them in, unlock it individually first (see [Drive locks](#drive-locks)
|
||
below).
|
||
2. **Click the Burn-In button** in the batch action bar that slides up
|
||
from the bottom — it shows the count of selected drives.
|
||
3. **In the batch modal**: pick the stages to run (Short SMART, Long
|
||
SMART, Surface Validate — drag to reorder), confirm your operator
|
||
name, and click Start.
|
||
4. **All selected drives are queued** in one POST. Up to
|
||
`max_parallel_burnins` enter `running`; the rest sit `queued`. As each
|
||
running job finishes, the next queued job picks up the freed slot
|
||
automatically — no operator action between batches.
|
||
5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
|
||
|
||
### Time estimate
|
||
|
||
| Drive size | Profile | Per-drive runtime (default block size) |
|
||
|-----------|-------------|----------------------------------------|
|
||
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour |
|
||
| 14 TB HDD | Short + Long SMART + Surface | ~24 hours |
|
||
| 14 TB HDD | Short + Long SMART (no surface) | ~6–8 hours |
|
||
|
||
For 12× 14 TB drives at default 4-parallel: roughly **3–4 days** end-to-end.
|
||
Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts
|
||
runtime roughly in half at ~2× RAM cost — matches the upstream
|
||
`disk-burnin.sh` recommendation.
|
||
|
||
### Watch out
|
||
|
||
- **Stuck-job timeout** — `stuck_job_hours` (default 24) marks any job
|
||
past that threshold as `unknown` and kills the remote process. If
|
||
you're burning in 14 TB drives with default block size, raise this to
|
||
**48** in Settings before starting, or you'll get false positives near
|
||
the end of surface_validate.
|
||
- **Thermal gate** — if drives currently under burn-in hit the
|
||
temperature warning threshold, new jobs wait up to 3 minutes before
|
||
acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
|
||
is otherwise fine.
|
||
|
||
### Cancelling
|
||
|
||
Click the red ✕ next to a running job. The orchestrator:
|
||
1. Marks the job `cancelled` in the DB.
|
||
2. Issues `kill -9 <remote_pid>` over a fresh SSH session (the badblocks
|
||
PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`).
|
||
3. Cancels the asyncio task, releasing the semaphore slot for the next
|
||
queued job.
|
||
|
||
Cancellations are durable — restart the container and queued jobs resume,
|
||
cancelled jobs stay cancelled.
|
||
|
||
---
|
||
|
||
## Drive locks
|
||
|
||
To prevent destroying live data, the dashboard refuses to start
|
||
destructive burn-in on drives ZFS or the kernel reports as in-use.
|
||
Detected lock states (with the typed-confirmation token required to
|
||
override):
|
||
|
||
| State | Detected via | Confirm token |
|
||
|---------------|---------------------------|------------------------------|
|
||
| Active pool | `zpool list -vHP` | the pool name (e.g. `tank`) |
|
||
| Boot pool | pool name = `boot-pool` | `DESTROY BOOT POOL` |
|
||
| Exported ZFS | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` |
|
||
| Mounted FS | `findmnt -no SOURCE` | `DESTROY MOUNTED FILESYSTEM` |
|
||
|
||
Detection runs every poll cycle (~12 s). On any SSH or parser failure the
|
||
poller fails *closed*: previously-locked drives stay locked, previously-
|
||
unlocked drives stay unlocked, until detection recovers.
|
||
|
||
Unlock is in-memory only with a 10-minute TTL — bound to the
|
||
`(pool_name, pool_role)` observed at unlock time. If a subsequent poll
|
||
reclassifies the drive (e.g. `(exported)` → `tank` because someone
|
||
imported the pool), the grant is invalidated automatically.
|
||
|
||
Every unlock writes an audit event and surfaces in the next daily report
|
||
in a red banner.
|
||
|
||
---
|
||
|
||
## Settings highlights
|
||
|
||
All settings live under `/settings` (header link). Key knobs:
|
||
|
||
- **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container
|
||
for changes to take effect.
|
||
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
|
||
badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
|
||
tune for speed vs paranoia.
|
||
- **`stuck_job_hours`** (default 24) — raise for big drives.
|
||
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
|
||
- **`bad_block_threshold`** (default 0) — number of bad blocks
|
||
surface_validate tolerates before failing the stage.
|
||
- **`retention_log_days`** (default 35) — when to NULL out
|
||
`burnin_stages.log_text`. Nightly job at 03:00 local.
|
||
- **`retention_backup_keep`** (default 14) — how many nightly DB
|
||
snapshots to keep in `/data/backups/`.
|
||
|
||
---
|
||
|
||
## Notifications
|
||
|
||
- **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with
|
||
drive-level summary, failed-health banner, and a red banner listing
|
||
every pool-drive unlock from the last 24 h.
|
||
- **Per-job email alerts** on pass/fail (configurable).
|
||
- **Webhook URL** posts JSON on every job state change.
|
||
|
||
Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
|
||
|
||
---
|
||
|
||
## Operations
|
||
|
||
### Logs
|
||
|
||
```bash
|
||
docker logs -f truenas-burnin
|
||
# JSON-structured. Filter with jq:
|
||
docker logs truenas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
|
||
```
|
||
|
||
### User management
|
||
|
||
```bash
|
||
docker exec -it truenas-burnin python -m app.auth_cli list
|
||
docker exec -it truenas-burnin python -m app.auth_cli add <username>
|
||
docker exec -it truenas-burnin python -m app.auth_cli reset <username>
|
||
```
|
||
|
||
Passwords are read from a TTY prompt; never accept them on the command
|
||
line.
|
||
|
||
### Backups
|
||
|
||
Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online
|
||
`sqlite3.backup`, doesn't lock writers). To restore:
|
||
|
||
```bash
|
||
docker compose down
|
||
cp data/backups/app-2026-05-01.db data/app.db
|
||
docker compose up -d
|
||
```
|
||
|
||
### Health probe
|
||
|
||
`/health` is unauthenticated and returns 200 only when DB, poller, and
|
||
SSH (when configured) all check green; 503 otherwise. Use it for
|
||
container/orchestrator health checks.
|
||
|
||
```bash
|
||
curl -sf http://localhost:8084/health | jq
|
||
```
|
||
|
||
### Resetting the DB
|
||
|
||
If you need to start over:
|
||
|
||
```bash
|
||
docker compose down
|
||
sudo rm -f data/app.db data/session_secret
|
||
# keep data/settings_overrides.json if you want to preserve UI settings
|
||
docker compose up -d
|
||
```
|
||
|
||
---
|
||
|
||
## Updating dependencies
|
||
|
||
`requirements.in` is the human-edited list. `requirements.txt` is a
|
||
fully-pinned lockfile generated from it (with sha256 hashes), consumed
|
||
at build time with `pip install --require-hashes`. **Never edit
|
||
`requirements.txt` by hand.**
|
||
|
||
```bash
|
||
# 1. Add or change a constraint in requirements.in
|
||
$EDITOR requirements.in
|
||
|
||
# 2. Regenerate the lockfile (runs pip-compile in a clean container)
|
||
./scripts/regenerate-lockfile.sh
|
||
|
||
# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
|
||
git diff requirements.txt
|
||
|
||
# 4. Rebuild + smoke-test
|
||
docker compose up -d --build app
|
||
curl -sf http://localhost:8084/health | jq
|
||
|
||
# 5. Commit BOTH files together
|
||
git add requirements.in requirements.txt
|
||
git commit -m "deps: bump <package> for <reason>"
|
||
```
|
||
|
||
This + the daily security scan (`scripts/security-scan.sh`) gives
|
||
defense-in-depth: pinning prevents accidental breakage from upstream
|
||
releases (Starlette 1.0 broke us once), `--require-hashes` defends
|
||
against compromised mirrors, and `pip-audit` catches new CVEs in any
|
||
pinned version after the fact.
|
||
|
||
## See also
|
||
|
||
- `CLAUDE.md` — full architecture, file map, deploy workflow, and the
|
||
rationale behind every non-obvious design decision.
|
||
- `SPEC.md` — canonical feature reference per version.
|
||
- `tests/` — `python -m unittest discover tests/` (44 tests, stdlib-only).
|
||
|
||
---
|
||
|
||
## Known gaps / not-yet-built
|
||
|
||
- No multi-user RBAC — every user is effectively admin.
|
||
- No per-drive SMART attribute trend graphs (snapshots only).
|
||
- No scheduled burn-ins — jobs run immediately when queued.
|
||
- No CSRF tokens on state-changing endpoints (relies on
|
||
`SameSite=Strict` session cookie).
|
||
|
||
PRs welcome.
|