docs: drawer surface_validate + sorting + job states
Some checks failed
Security scan / bandit (push) Has been cancelled
Security scan / pip-audit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled

Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51:

- Job states section explains passed / failed / cancelled / unknown,
  including when 'unknown' fires (stuck-job timeout OR container
  restart cancelling the asyncio task).
- Drive drawer section covers the new surface_validate visualization:
  vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four
  per-pattern meters with split write/verify halves, phase caption,
  completed-pattern duration history.
- Failure reason block describes the three-tier source resolution
  (stage error_text → job error_text → heuristic) and what shows up
  when none is available.
- Column sorting describes the click-to-cycle behaviour and the
  localStorage persistence that survives SSE refreshes.

Plus an explicit warning: don't `--build` while burn-ins are running
(now classified `unknown` instead of `failed` — but still better to
avoid the kill in the first place).
This commit is contained in:
Brandon Walter 2026-05-09 15:34:12 -07:00
parent 659f540270
commit 2107981cf1

View file

@ -106,6 +106,91 @@ Click the red ✕ next to a running job. The orchestrator:
Cancellations are durable — restart the container and queued jobs resume, Cancellations are durable — restart the container and queued jobs resume,
cancelled jobs stay cancelled. cancelled jobs stay cancelled.
### Job states explained
| State | When it's set |
|-------------|-------------------------------------------------------------------------------|
| `queued` | Submitted, waiting for a `max_parallel_burnins` slot |
| `running` | Actively executing some stage |
| `passed` | All stages finished green |
| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
| `cancelled` | Operator clicked ✕ |
| `unknown` | Job was alive but its outcome is indeterminate — see below |
`unknown` fires in two situations:
1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
the job has been running too long without finishing.
2. The asyncio task got cancelled mid-stage by something *other* than an
operator click — usually a container restart (`docker compose up -d`,
`--build`, or the host rebooting). Burn-in source code goes through
the Dockerfile `COPY`, so any source-code deploy recreates the
container, drops the SSH connection to TrueNAS, and would orphan the
running burn-in. Avoid `--build` while burn-ins are active.
When `unknown` fires the drawer's per-stage Reason block shows
*"Task cancelled mid-run — likely container restart or shutdown"* so the
classification is explicit, not silent.
---
## Drive drawer
Click any drive row to slide a detail drawer down from the top. Three tabs:
- **Burn-In** — per-stage breakdown of the latest job
- **SMART** — short/long test states + cached SMART attributes
- **Events** — last 50 audit events for the drive
### Surface-validate visualization
For drives in a `surface_validate` stage (running or finished), the Burn-In
tab renders:
1. **Vital-signs strip**`Start` (with date) · `Elapsed` · `ETA` (duration
remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
`Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
stutter at the very start.
2. **Four pattern meters**`0xaa` / `0x55` / `0xff` / `0x00`. Each meter
is split into a left half (write phase, blue) and a right half (verify
phase, green). Current pattern's label glows blue; completed patterns'
labels go green. This translates badblocks's per-phase percent into
monotonic 0-99% overall progress, so the bar never appears to "rewind"
when a new phase starts.
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
within phase"*. Makes the visual grammar unambiguous.
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
first pattern's elapsed time.
### Failure reason block
Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
pill at the top of the stage section. Sources, in order of preference:
1. The stage's own `error_text`
2. The parent job's `error_text` (backfilled by the drawer when the stage's
own is empty — catches orphan rows from hard crashes)
3. A heuristic: if the log is tiny and no real progress was recorded,
*"Stopped without recording an error — likely cause: SSH connection drop
or container restart while this stage was running"*
Otherwise: *"No error message recorded."* — there's never a blank where you
expect to see why something broke.
### Column sorting
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
state persists in `localStorage` so it survives page reload AND every
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
the bottom regardless of direction.
Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
with numeric priority maps for SMART states (e.g. `running` always sorts
ahead of `idle`).
--- ---
## Drive locks ## Drive locks