nas-burnin/README.md
Brandon Walter 4922b19a9f
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
fix: stuck_job_hours default 24 → 168 (7 days) (1.0.0-43)
A user with 4× 14 TB WD HDDs running -w surface_validate had all
4 jobs marked 'unknown' at exactly 24h+1min — the stuck-job
detector firing on legitimate work because 14 TB at 8192-block
badblocks needs ~5+ days to complete all 4 patterns × 2 phases.

168h covers a full -w pass on 14 TB+ HDDs with margin. Anyone
running short SSDs who wants faster detection can drop the value
in Settings → Burn-in.

README warning replaced — no longer instructs users to bump the
threshold before starting big-drive burn-ins, since the default
now handles that case.

Settings UI already accepts up to 168 via the input's max=168
attribute, so no template change needed.
2026-05-08 13:23:05 -07:00

276 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NAS Burn-In Dashboard
Web dashboard for running disciplined burn-in tests on TrueNAS drives.
Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and
`nvme-cli` over SSH and tracks every job in SQLite.
Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
live progress UI, daily email reports, and resumable state.
## Stack
FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No
external services beyond your TrueNAS host. Templates and static assets
are bind-mounted; Python source is baked into the image.
---
## Quick start
```bash
# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.
# 2. Build + run
docker compose up -d --build
# 3. Open the dashboard
open http://localhost:8084 # or your host's IP
# 4. First time: the login page renders a "Create initial admin" form.
# Pick a username + password (>= 8 chars). Done.
```
If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that
account is created on startup automatically. After that the env vars are
ignored — change passwords from the UI ("Change password" header link) or
the CLI (`docker exec -it nas-burnin python -m app.auth_cli reset
<username>`).
---
## Burning in many drives at once
The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently
(configurable in Settings, default 4) and queues the rest. Submitting 14
drives doesn't take 14 separate clicks — you submit once and the queue
drains automatically as slots free up.
### The workflow
1. **Select all idle drives** — click the checkbox in the table header
(next to "DRIVE"). It auto-checks every drive that's currently
selectable: idle, no active SMART test, not pool-locked. Pool-locked
drives are intentionally excluded; if you really want to burn one of
them in, unlock it individually first (see [Drive locks](#drive-locks)
below).
2. **Click the Burn-In button** in the batch action bar that slides up
from the bottom — it shows the count of selected drives.
3. **In the batch modal**: pick the stages to run (Short SMART, Long
SMART, Surface Validate — drag to reorder), confirm your operator
name, and click Start.
4. **All selected drives are queued** in one POST. Up to
`max_parallel_burnins` enter `running`; the rest sit `queued`. As each
running job finishes, the next queued job picks up the freed slot
automatically — no operator action between batches.
5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."
### Time estimate
| Drive size | Profile | Per-drive runtime (default block size) |
|-----------|-------------|----------------------------------------|
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour |
| 14 TB HDD | Short + Long SMART + Surface | ~24 hours |
| 14 TB HDD | Short + Long SMART (no surface) | ~68 hours |
For 12× 14 TB drives at default 4-parallel: roughly **34 days** end-to-end.
Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts
runtime roughly in half at ~2× RAM cost — matches the upstream
`disk-burnin.sh` recommendation.
### Watch out
- **Stuck-job timeout** — `stuck_job_hours` (default 168 = 7 days)
marks any job past that threshold as `unknown` and kills the remote
process. The default covers `-w` surface_validate on 14 TB+ HDDs with
margin. If you're running short SSDs and want faster detection of
genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h
which false-positived on multi-TB drives.)
- **Thermal gate** — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
is otherwise fine.
### Cancelling
Click the red ✕ next to a running job. The orchestrator:
1. Marks the job `cancelled` in the DB.
2. Issues `kill -9 <remote_pid>` over a fresh SSH session (the badblocks
PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`).
3. Cancels the asyncio task, releasing the semaphore slot for the next
queued job.
Cancellations are durable — restart the container and queued jobs resume,
cancelled jobs stay cancelled.
---
## Drive locks
To prevent destroying live data, the dashboard refuses to start
destructive burn-in on drives ZFS or the kernel reports as in-use.
Detected lock states (with the typed-confirmation token required to
override):
| State | Detected via | Confirm token |
|---------------|---------------------------|------------------------------|
| Active pool | `zpool list -vHP` | the pool name (e.g. `tank`) |
| Boot pool | pool name = `boot-pool` | `DESTROY BOOT POOL` |
| Exported ZFS | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` |
| Mounted FS | `findmnt -no SOURCE` | `DESTROY MOUNTED FILESYSTEM` |
Detection runs every poll cycle (~12 s). On any SSH or parser failure the
poller fails *closed*: previously-locked drives stay locked, previously-
unlocked drives stay unlocked, until detection recovers.
Unlock is in-memory only with a 10-minute TTL — bound to the
`(pool_name, pool_role)` observed at unlock time. If a subsequent poll
reclassifies the drive (e.g. `(exported)``tank` because someone
imported the pool), the grant is invalidated automatically.
Every unlock writes an audit event and surfaces in the next daily report
in a red banner.
---
## Settings highlights
All settings live under `/settings` (header link). Key knobs:
- **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container
for changes to take effect.
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
tune for speed vs paranoia.
- **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs;
drop for faster detection on small fast drives.
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
- **`bad_block_threshold`** (default 0) — number of bad blocks
surface_validate tolerates before failing the stage.
- **`retention_log_days`** (default 35) — when to NULL out
`burnin_stages.log_text`. Nightly job at 03:00 local.
- **`retention_backup_keep`** (default 14) — how many nightly DB
snapshots to keep in `/data/backups/`.
---
## Notifications
- **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with
drive-level summary, failed-health banner, and a red banner listing
every pool-drive unlock from the last 24 h.
- **Per-job email alerts** on pass/fail (configurable).
- **Webhook URL** posts JSON on every job state change.
Configure SMTP in Settings → Email. Includes a "Test SMTP" button.
---
## Operations
### Logs
```bash
docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
```
### User management
```bash
docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>
```
Passwords are read from a TTY prompt; never accept them on the command
line.
### Backups
Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online
`sqlite3.backup`, doesn't lock writers). To restore:
```bash
docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d
```
### Health probe
`/health` is unauthenticated and returns 200 only when DB, poller, and
SSH (when configured) all check green; 503 otherwise. Use it for
container/orchestrator health checks.
```bash
curl -sf http://localhost:8084/health | jq
```
### Resetting the DB
If you need to start over:
```bash
docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d
```
---
## Updating dependencies
`requirements.in` is the human-edited list. `requirements.txt` is a
fully-pinned lockfile generated from it (with sha256 hashes), consumed
at build time with `pip install --require-hashes`. **Never edit
`requirements.txt` by hand.**
```bash
# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in
# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh
# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt
# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq
# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"
```
This + the daily security scan (`scripts/security-scan.sh`) gives
defense-in-depth: pinning prevents accidental breakage from upstream
releases (Starlette 1.0 broke us once), `--require-hashes` defends
against compromised mirrors, and `pip-audit` catches new CVEs in any
pinned version after the fact.
## See also
- `CLAUDE.md` — full architecture, file map, deploy workflow, and the
rationale behind every non-obvious design decision.
- `SPEC.md` — canonical feature reference per version.
- `tests/``python -m unittest discover tests/` (65 tests, stdlib-only). Or run inside the deployed container with `scripts/run-tests.sh`.
---
## Known gaps / not-yet-built
- No multi-user RBAC — every user is effectively admin.
- No per-drive SMART attribute trend graphs (snapshots only).
- No scheduled burn-ins — jobs run immediately when queued.
- No CSRF tokens on state-changing endpoints (relies on
`SameSite=Strict` session cookie).
PRs welcome.