docs: drawer surface_validate + sorting + job states
Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51: - Job states section explains passed / failed / cancelled / unknown, including when 'unknown' fires (stuck-job timeout OR container restart cancelling the asyncio task). - Drive drawer section covers the new surface_validate visualization: vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four per-pattern meters with split write/verify halves, phase caption, completed-pattern duration history. - Failure reason block describes the three-tier source resolution (stage error_text → job error_text → heuristic) and what shows up when none is available. - Column sorting describes the click-to-cycle behaviour and the localStorage persistence that survives SSE refreshes. Plus an explicit warning: don't `--build` while burn-ins are running (now classified `unknown` instead of `failed` — but still better to avoid the kill in the first place).
This commit is contained in:
parent
659f540270
commit
2107981cf1
1 changed files with 85 additions and 0 deletions
85
README.md
85
README.md
|
|
@ -106,6 +106,91 @@ Click the red ✕ next to a running job. The orchestrator:
|
|||
Cancellations are durable — restart the container and queued jobs resume,
|
||||
cancelled jobs stay cancelled.
|
||||
|
||||
### Job states explained
|
||||
|
||||
| State | When it's set |
|
||||
|-------------|-------------------------------------------------------------------------------|
|
||||
| `queued` | Submitted, waiting for a `max_parallel_burnins` slot |
|
||||
| `running` | Actively executing some stage |
|
||||
| `passed` | All stages finished green |
|
||||
| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
|
||||
| `cancelled` | Operator clicked ✕ |
|
||||
| `unknown` | Job was alive but its outcome is indeterminate — see below |
|
||||
|
||||
`unknown` fires in two situations:
|
||||
|
||||
1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
|
||||
the job has been running too long without finishing.
|
||||
2. The asyncio task got cancelled mid-stage by something *other* than an
|
||||
operator click — usually a container restart (`docker compose up -d`,
|
||||
`--build`, or the host rebooting). Burn-in source code goes through
|
||||
the Dockerfile `COPY`, so any source-code deploy recreates the
|
||||
container, drops the SSH connection to TrueNAS, and would orphan the
|
||||
running burn-in. Avoid `--build` while burn-ins are active.
|
||||
|
||||
When `unknown` fires the drawer's per-stage Reason block shows
|
||||
*"Task cancelled mid-run — likely container restart or shutdown"* so the
|
||||
classification is explicit, not silent.
|
||||
|
||||
---
|
||||
|
||||
## Drive drawer
|
||||
|
||||
Click any drive row to slide a detail drawer down from the top. Three tabs:
|
||||
|
||||
- **Burn-In** — per-stage breakdown of the latest job
|
||||
- **SMART** — short/long test states + cached SMART attributes
|
||||
- **Events** — last 50 audit events for the drive
|
||||
|
||||
### Surface-validate visualization
|
||||
|
||||
For drives in a `surface_validate` stage (running or finished), the Burn-In
|
||||
tab renders:
|
||||
|
||||
1. **Vital-signs strip** — `Start` (with date) · `Elapsed` · `ETA` (duration
|
||||
remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
|
||||
`Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
|
||||
ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
|
||||
stutter at the very start.
|
||||
2. **Four pattern meters** — `0xaa` / `0x55` / `0xff` / `0x00`. Each meter
|
||||
is split into a left half (write phase, blue) and a right half (verify
|
||||
phase, green). Current pattern's label glows blue; completed patterns'
|
||||
labels go green. This translates badblocks's per-phase percent into
|
||||
monotonic 0-99% overall progress, so the bar never appears to "rewind"
|
||||
when a new phase starts.
|
||||
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
|
||||
within phase"*. Makes the visual grammar unambiguous.
|
||||
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
|
||||
showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
|
||||
first pattern's elapsed time.
|
||||
|
||||
### Failure reason block
|
||||
|
||||
Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
|
||||
pill at the top of the stage section. Sources, in order of preference:
|
||||
|
||||
1. The stage's own `error_text`
|
||||
2. The parent job's `error_text` (backfilled by the drawer when the stage's
|
||||
own is empty — catches orphan rows from hard crashes)
|
||||
3. A heuristic: if the log is tiny and no real progress was recorded,
|
||||
*"Stopped without recording an error — likely cause: SSH connection drop
|
||||
or container restart while this stage was running"*
|
||||
|
||||
Otherwise: *"No error message recorded."* — there's never a blank where you
|
||||
expect to see why something broke.
|
||||
|
||||
### Column sorting
|
||||
|
||||
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
|
||||
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
|
||||
state persists in `localStorage` so it survives page reload AND every
|
||||
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
|
||||
the bottom regardless of direction.
|
||||
|
||||
Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
|
||||
with numeric priority maps for SMART states (e.g. `running` always sorts
|
||||
ahead of `idle`).
|
||||
|
||||
---
|
||||
|
||||
## Drive locks
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue