docs: drawer surface_validate + sorting + job states

Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51: - Job states section explains passed / failed / cancelled / unknown, including when 'unknown' fires (stuck-job timeout OR container restart cancelling the asyncio task). - Drive drawer section covers the new surface_validate visualization: vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four per-pattern meters with split write/verify halves, phase caption, completed-pattern duration history. - Failure reason block describes the three-tier source resolution (stage error_text → job error_text → heuristic) and what shows up when none is available. - Column sorting describes the click-to-cycle behaviour and the localStorage persistence that survives SSE refreshes. Plus an explicit warning: don't `--build` while burn-ins are running (now classified `unknown` instead of `failed` — but still better to avoid the kill in the first place).
2026-05-09 15:34:12 -07:00 · 2026-05-09 15:34:12 -07:00 · 2107981cf1
commit 2107981cf1
parent 659f540270
1 changed files with 85 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -106,6 +106,91 @@ Click the red ✕ next to a running job. The orchestrator:
 Cancellations are durable — restart the container and queued jobs resume,
 cancelled jobs stay cancelled.
 ### Job states explained
 | State       | When it's set                                                                 |
 |-------------|-------------------------------------------------------------------------------|
 | `queued`    | Submitted, waiting for a `max_parallel_burnins` slot                          |
 | `running`   | Actively executing some stage                                                 |
 | `passed`    | All stages finished green                                                     |
 | `failed`    | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
 | `cancelled` | Operator clicked ✕                                                            |
 | `unknown`   | Job was alive but its outcome is indeterminate — see below                    |
 `unknown` fires in two situations:
 1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
   the job has been running too long without finishing.
 2. The asyncio task got cancelled mid-stage by something *other* than an
   operator click — usually a container restart (`docker compose up -d`,
   `--build`, or the host rebooting). Burn-in source code goes through
   the Dockerfile `COPY`, so any source-code deploy recreates the
   container, drops the SSH connection to TrueNAS, and would orphan the
   running burn-in. Avoid `--build` while burn-ins are active.
 When `unknown` fires the drawer's per-stage Reason block shows
 *"Task cancelled mid-run — likely container restart or shutdown"* so the
 classification is explicit, not silent.
 ---
 ## Drive drawer
 Click any drive row to slide a detail drawer down from the top. Three tabs:
 - **Burn-In** — per-stage breakdown of the latest job
 - **SMART** — short/long test states + cached SMART attributes
 - **Events** — last 50 audit events for the drive
 ### Surface-validate visualization
 For drives in a `surface_validate` stage (running or finished), the Burn-In
 tab renders:
 1. **Vital-signs strip** — `Start` (with date) · `Elapsed` · `ETA` (duration
   remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
   `Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
   ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
   stutter at the very start.
 2. **Four pattern meters** — `0xaa` / `0x55` / `0xff` / `0x00`. Each meter
   is split into a left half (write phase, blue) and a right half (verify
   phase, green). Current pattern's label glows blue; completed patterns'
   labels go green. This translates badblocks's per-phase percent into
   monotonic 0-99% overall progress, so the bar never appears to "rewind"
   when a new phase starts.
 3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
   within phase"*. Makes the visual grammar unambiguous.
 4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
   showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
   first pattern's elapsed time.
 ### Failure reason block
 Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
 pill at the top of the stage section. Sources, in order of preference:
 1. The stage's own `error_text`
 2. The parent job's `error_text` (backfilled by the drawer when the stage's
   own is empty — catches orphan rows from hard crashes)
 3. A heuristic: if the log is tiny and no real progress was recorded,
   *"Stopped without recording an error — likely cause: SSH connection drop
   or container restart while this stage was running"*
 Otherwise: *"No error message recorded."* — there's never a blank where you
 expect to see why something broke.
 ### Column sorting
 Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
 Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
 state persists in `localStorage` so it survives page reload AND every
 SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
 the bottom regardless of direction.
 Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
 with numeric priority maps for SMART states (e.g. `running` always sorts
 ahead of `idle`).
 ---
 ## Drive locks