nas-burnin/README.md

# NAS Burn-In Dashboard

Web dashboard for running disciplined burn-in tests on TrueNAS drives.
Sits next to the NAS, not on it — orchestrates `smartctl`, `badblocks`, and
`nvme-cli` over SSH and tracks every job in SQLite.

Inspired by the community `disk-burnin.sh` script (Spearfoot et al.) but
adds: concurrent burn-ins, pool-membership safety locks, login + audit,
live progress UI, daily email reports, and resumable state.

## Stack

FastAPI + HTMX (SSE) + asyncssh + SQLite, in one Docker container. No
external services beyond your TrueNAS host. Templates and static assets
are bind-mounted; Python source is baked into the image.

---

## Quick start

```bash
# 1. Configure
cp .env.example .env
# edit SSH_HOST / SSH_USER / SSH_KEY (see .env.example) and, optionally,
# INITIAL_ADMIN_USERNAME / INITIAL_ADMIN_PASSWORD for first-run setup.

# 2. Build + run
docker compose up -d --build

# 3. Open the dashboard
open http://localhost:8084   # or your host's IP

# 4. First time: the login page renders a "Create initial admin" form.
#    Pick a username + password (>= 8 chars). Done.
```

If you set `INITIAL_ADMIN_*` env vars *and* the users table is empty, that
account is created on startup automatically. After that the env vars are
ignored — change passwords from the UI ("Change password" header link) or
the CLI (`docker exec -it nas-burnin python -m app.auth_cli reset
<username>`).

---

## Burning in many drives at once

The dashboard runs **up to `max_parallel_burnins`** burn-ins concurrently
(configurable in Settings, default 4) and queues the rest. Submitting 14
drives doesn't take 14 separate clicks — you submit once and the queue
drains automatically as slots free up.

### The workflow

1. **Select all idle drives** — click the checkbox in the table header
   (next to "DRIVE"). It auto-checks every drive that's currently
   selectable: idle, no active SMART test, not pool-locked. Pool-locked
   drives are intentionally excluded; if you really want to burn one of
   them in, unlock it individually first (see [Drive locks](#drive-locks)
   below).
2. **Click the Burn-In button** in the batch action bar that slides up
   from the bottom — it shows the count of selected drives.
3. **In the batch modal**: pick the stages to run (Short SMART, Long
   SMART, Surface Validate — drag to reorder), confirm your operator
   name, and click Start.
4. **All selected drives are queued** in one POST. Up to
   `max_parallel_burnins` enter `running`; the rest sit `queued`. As each
   running job finishes, the next queued job picks up the freed slot
   automatically — no operator action between batches.
5. The toast shows e.g. "12 burn-in(s) queued, 0 skipped, 0 pool-locked."

### Time estimate

| Drive size | Profile     | Per-drive runtime (default block size) |
|-----------|-------------|----------------------------------------|
| 250 GB SSD | Short + Long SMART + Surface | ~1 hour                  |
| 14 TB HDD  | Short + Long SMART + Surface | ~24 hours                |
| 14 TB HDD  | Short + Long SMART (no surface) | ~6–8 hours            |

For 12× 14 TB drives at default 4-parallel: roughly **3–4 days** end-to-end.
Bumping `surface_validate_block_size` from 4096 to 8192 in Settings cuts
runtime roughly in half at ~2× RAM cost — matches the upstream
`disk-burnin.sh` recommendation.

### Watch out

- **Stuck-job timeout** — `stuck_job_hours` (default 168 = 7 days)
  marks any job past that threshold as `unknown` and kills the remote
  process. The default covers `-w` surface_validate on 14 TB+ HDDs with
  margin. If you're running short SSDs and want faster detection of
  genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h
  which false-positived on multi-TB drives.)
- **Thermal gate** — if drives currently under burn-in hit the
  temperature warning threshold, new jobs wait up to 3 minutes before
  acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
  is otherwise fine.

### Cancelling

Click the red ✕ next to a running job. The orchestrator:
1. Marks the job `cancelled` in the DB.
2. Issues `kill -9 <remote_pid>` over a fresh SSH session (the badblocks
   PID is captured at launch via `sh -c 'echo PID:$$; exec ...'`).
3. Cancels the asyncio task, releasing the semaphore slot for the next
   queued job.

Cancellations are durable — restart the container and queued jobs resume,
cancelled jobs stay cancelled.

### Job states explained

| State       | When it's set                                                                 |
|-------------|-------------------------------------------------------------------------------|
| `queued`    | Submitted, waiting for a `max_parallel_burnins` slot                          |
| `running`   | Actively executing some stage                                                 |
| `passed`    | All stages finished green                                                     |
| `failed`    | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
| `cancelled` | Operator clicked ✕                                                            |
| `unknown`   | Job was alive but its outcome is indeterminate — see below                    |

`unknown` fires in two situations:

1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
   the job has been running too long without finishing.
2. The asyncio task got cancelled mid-stage by something *other* than an
   operator click — usually a container restart (`docker compose up -d`,
   `--build`, or the host rebooting). Burn-in source code goes through
   the Dockerfile `COPY`, so any source-code deploy recreates the
   container, drops the SSH connection to TrueNAS, and would orphan the
   running burn-in. Avoid `--build` while burn-ins are active.

When `unknown` fires the drawer's per-stage Reason block shows
*"Task cancelled mid-run — likely container restart or shutdown"* so the
classification is explicit, not silent.

---

## Drive drawer

Click any drive row to slide a detail drawer down from the top. Three tabs:

- **Burn-In** — per-stage breakdown of the latest job
- **SMART** — short/long test states + cached SMART attributes
- **Events** — last 50 audit events for the drive

### Surface-validate visualization

For drives in a `surface_validate` stage (running or finished), the Burn-In
tab renders:

1. **Vital-signs strip** — `Start` (with date) · `Elapsed` · `ETA` (duration
   remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
   `Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
   ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
   stutter at the very start.
2. **Four pattern meters** — `0xaa` / `0x55` / `0xff` / `0x00`. Each meter
   is split into a left half (write phase, blue) and a right half (verify
   phase, green). Current pattern's label glows blue; completed patterns'
   labels go green. This translates badblocks's per-phase percent into
   monotonic 0-99% overall progress, so the bar never appears to "rewind"
   when a new phase starts.
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
   within phase"*. Makes the visual grammar unambiguous.
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
   showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
   first pattern's elapsed time.

### Failure reason block

Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
pill at the top of the stage section. Sources, in order of preference:

1. The stage's own `error_text`
2. The parent job's `error_text` (backfilled by the drawer when the stage's
   own is empty — catches orphan rows from hard crashes)
3. A heuristic: if the log is tiny and no real progress was recorded,
   *"Stopped without recording an error — likely cause: SSH connection drop
   or container restart while this stage was running"*

Otherwise: *"No error message recorded."* — there's never a blank where you
expect to see why something broke.

### Column sorting

Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
state persists in `localStorage` so it survives page reload AND every
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
the bottom regardless of direction.

Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
with numeric priority maps for SMART states (e.g. `running` always sorts
ahead of `idle`).

---

## Drive locks

To prevent destroying live data, the dashboard refuses to start
destructive burn-in on drives ZFS or the kernel reports as in-use.
Detected lock states (with the typed-confirmation token required to
override):

| State         | Detected via              | Confirm token                |
|---------------|---------------------------|------------------------------|
| Active pool   | `zpool list -vHP`         | the pool name (e.g. `tank`)  |
| Boot pool     | pool name = `boot-pool`   | `DESTROY BOOT POOL`          |
| Exported ZFS  | `lsblk` `zfs_member` partitions not in any active pool | `DESTROY EXPORTED POOL` |
| Mounted FS    | `findmnt -no SOURCE`      | `DESTROY MOUNTED FILESYSTEM` |

Detection runs every poll cycle (~12 s). On any SSH or parser failure the
poller fails *closed*: previously-locked drives stay locked, previously-
unlocked drives stay unlocked, until detection recovers.

Unlock is in-memory only with a 10-minute TTL — bound to the
`(pool_name, pool_role)` observed at unlock time. If a subsequent poll
reclassifies the drive (e.g. `(exported)` → `tank` because someone
imported the pool), the grant is invalidated automatically.

Every unlock writes an audit event and surfaces in the next daily report
in a red banner.

---

## Settings highlights

All settings live under `/settings` (header link). Key knobs:

- **`max_parallel_burnins`** (default 4) — semaphore cap. Restart container
  for changes to take effect.
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
  badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
  tune for speed vs paranoia.
- **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs;
  drop for faster detection on small fast drives.
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
- **`bad_block_threshold`** (default 0) — number of bad blocks
  surface_validate tolerates before failing the stage.
- **`retention_log_days`** (default 35) — when to NULL out
  `burnin_stages.log_text`. Nightly job at 03:00 local.
- **`retention_backup_keep`** (default 14) — how many nightly DB
  snapshots to keep in `/data/backups/`.

---

## Notifications

- **Daily SMTP report** at `smtp_report_hour` (default 08:00 local) with
  drive-level summary, failed-health banner, and a red banner listing
  every pool-drive unlock from the last 24 h.
- **Per-job email alerts** on pass/fail (configurable).
- **Webhook URL** posts JSON on every job state change.

Configure SMTP in Settings → Email. Includes a "Test SMTP" button.

---

## Operations

### Logs

```bash
docker logs -f nas-burnin
# JSON-structured. Filter with jq:
docker logs nas-burnin 2>&1 | jq -rR 'fromjson? | "\(.ts) \(.level) \(.msg)"'
```

### User management

```bash
docker exec -it nas-burnin python -m app.auth_cli list
docker exec -it nas-burnin python -m app.auth_cli add <username>
docker exec -it nas-burnin python -m app.auth_cli reset <username>
```

Passwords are read from a TTY prompt; never accept them on the command
line.

### Backups

Automated nightly to `/data/backups/app-YYYY-MM-DD.db` (online
`sqlite3.backup`, doesn't lock writers). To restore:

```bash
docker compose down
cp data/backups/app-2026-05-01.db data/app.db
docker compose up -d
```

### Health probe

`/health` is unauthenticated and returns 200 only when DB, poller, and
SSH (when configured) all check green; 503 otherwise. Use it for
container/orchestrator health checks.

```bash
curl -sf http://localhost:8084/health | jq
```

### Resetting the DB

If you need to start over:

```bash
docker compose down
sudo rm -f data/app.db data/session_secret
# keep data/settings_overrides.json if you want to preserve UI settings
docker compose up -d
```

---

## Updating dependencies

`requirements.in` is the human-edited list. `requirements.txt` is a
fully-pinned lockfile generated from it (with sha256 hashes), consumed
at build time with `pip install --require-hashes`. **Never edit
`requirements.txt` by hand.**

```bash
# 1. Add or change a constraint in requirements.in
$EDITOR requirements.in

# 2. Regenerate the lockfile (runs pip-compile in a clean container)
./scripts/regenerate-lockfile.sh

# 3. Review the diff — transitive bumps may be CVE fixes or breaking changes
git diff requirements.txt

# 4. Rebuild + smoke-test
docker compose up -d --build app
curl -sf http://localhost:8084/health | jq

# 5. Commit BOTH files together
git add requirements.in requirements.txt
git commit -m "deps: bump <package> for <reason>"
```

This + the daily security scan (`scripts/security-scan.sh`) gives
defense-in-depth: pinning prevents accidental breakage from upstream
releases (Starlette 1.0 broke us once), `--require-hashes` defends
against compromised mirrors, and `pip-audit` catches new CVEs in any
pinned version after the fact.

## See also

- `CLAUDE.md` — full architecture, file map, deploy workflow, and the
  rationale behind every non-obvious design decision.
- `SPEC.md` — canonical feature reference per version.
- `tests/` — `python -m unittest discover tests/` (65 tests, stdlib-only). Or run inside the deployed container with `scripts/run-tests.sh`.

---

## Known gaps / not-yet-built

- No multi-user RBAC — every user is effectively admin.
- No per-drive SMART attribute trend graphs (snapshots only).
- No scheduled burn-ins — jobs run immediately when queued.
- No CSRF tokens on state-changing endpoints (relies on
  `SameSite=Strict` session cookie).

PRs welcome.