diff --git a/README.md b/README.md index 4c0da7b..dfee4e1 100644 --- a/README.md +++ b/README.md @@ -106,6 +106,91 @@ Click the red ✕ next to a running job. The orchestrator: Cancellations are durable — restart the container and queued jobs resume, cancelled jobs stay cancelled. +### Job states explained + +| State | When it's set | +|-------------|-------------------------------------------------------------------------------| +| `queued` | Submitted, waiting for a `max_parallel_burnins` slot | +| `running` | Actively executing some stage | +| `passed` | All stages finished green | +| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) | +| `cancelled` | Operator clicked ✕ | +| `unknown` | Job was alive but its outcome is indeterminate — see below | + +`unknown` fires in two situations: + +1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because + the job has been running too long without finishing. +2. The asyncio task got cancelled mid-stage by something *other* than an + operator click — usually a container restart (`docker compose up -d`, + `--build`, or the host rebooting). Burn-in source code goes through + the Dockerfile `COPY`, so any source-code deploy recreates the + container, drops the SSH connection to TrueNAS, and would orphan the + running burn-in. Avoid `--build` while burn-ins are active. + +When `unknown` fires the drawer's per-stage Reason block shows +*"Task cancelled mid-run — likely container restart or shutdown"* so the +classification is explicit, not silent. + +--- + +## Drive drawer + +Click any drive row to slide a detail drawer down from the top. Three tabs: + +- **Burn-In** — per-stage breakdown of the latest job +- **SMART** — short/long test states + cached SMART attributes +- **Events** — last 50 audit events for the drive + +### Surface-validate visualization + +For drives in a `surface_validate` stage (running or finished), the Burn-In +tab renders: + +1. **Vital-signs strip** — `Start` (with date) · `Elapsed` · `ETA` (duration + remaining) · `Finish` (wall-clock estimate, browser-local timezone) · + `Temp` (cool/warm/hot colour). Computed from data in the drawer payload; + ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22" + stutter at the very start. +2. **Four pattern meters** — `0xaa` / `0x55` / `0xff` / `0x00`. Each meter + is split into a left half (write phase, blue) and a right half (verify + phase, green). Current pattern's label glows blue; completed patterns' + labels go green. This translates badblocks's per-phase percent into + monotonic 0-99% overall progress, so the bar never appears to "rewind" + when a new phase starts. +3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47% + within phase"*. Makes the visual grammar unambiguous. +4. **Completed-pattern history** — once pattern 1 finishes, a chip appears + showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the + first pattern's elapsed time. + +### Failure reason block + +Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason +pill at the top of the stage section. Sources, in order of preference: + +1. The stage's own `error_text` +2. The parent job's `error_text` (backfilled by the drawer when the stage's + own is empty — catches orphan rows from hard crashes) +3. A heuristic: if the log is tiny and no real progress was recorded, + *"Stopped without recording an error — likely cause: SSH connection drop + or container restart while this stage was running"* + +Otherwise: *"No error message recorded."* — there's never a blank where you +expect to see why something broke. + +### Column sorting + +Click any column header (Drive, Serial, Size, Temp, Health, Short SMART, +Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort +state persists in `localStorage` so it survives page reload AND every +SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to +the bottom regardless of direction. + +Sortable values are emitted as `data-sort-*` attributes on each ``, +with numeric priority maps for SMART states (e.g. `running` always sorts +ahead of `idle`). + --- ## Drive locks