docs: drawer surface_validate + sorting + job states
Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51: - Job states section explains passed / failed / cancelled / unknown, including when 'unknown' fires (stuck-job timeout OR container restart cancelling the asyncio task). - Drive drawer section covers the new surface_validate visualization: vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four per-pattern meters with split write/verify halves, phase caption, completed-pattern duration history. - Failure reason block describes the three-tier source resolution (stage error_text → job error_text → heuristic) and what shows up when none is available. - Column sorting describes the click-to-cycle behaviour and the localStorage persistence that survives SSE refreshes. Plus an explicit warning: don't `--build` while burn-ins are running (now classified `unknown` instead of `failed` — but still better to avoid the kill in the first place).
This commit is contained in:
parent
659f540270
commit
2107981cf1
1 changed files with 85 additions and 0 deletions
85
README.md
85
README.md
|
|
@ -106,6 +106,91 @@ Click the red ✕ next to a running job. The orchestrator:
|
||||||
Cancellations are durable — restart the container and queued jobs resume,
|
Cancellations are durable — restart the container and queued jobs resume,
|
||||||
cancelled jobs stay cancelled.
|
cancelled jobs stay cancelled.
|
||||||
|
|
||||||
|
### Job states explained
|
||||||
|
|
||||||
|
| State | When it's set |
|
||||||
|
|-------------|-------------------------------------------------------------------------------|
|
||||||
|
| `queued` | Submitted, waiting for a `max_parallel_burnins` slot |
|
||||||
|
| `running` | Actively executing some stage |
|
||||||
|
| `passed` | All stages finished green |
|
||||||
|
| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
|
||||||
|
| `cancelled` | Operator clicked ✕ |
|
||||||
|
| `unknown` | Job was alive but its outcome is indeterminate — see below |
|
||||||
|
|
||||||
|
`unknown` fires in two situations:
|
||||||
|
|
||||||
|
1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
|
||||||
|
the job has been running too long without finishing.
|
||||||
|
2. The asyncio task got cancelled mid-stage by something *other* than an
|
||||||
|
operator click — usually a container restart (`docker compose up -d`,
|
||||||
|
`--build`, or the host rebooting). Burn-in source code goes through
|
||||||
|
the Dockerfile `COPY`, so any source-code deploy recreates the
|
||||||
|
container, drops the SSH connection to TrueNAS, and would orphan the
|
||||||
|
running burn-in. Avoid `--build` while burn-ins are active.
|
||||||
|
|
||||||
|
When `unknown` fires the drawer's per-stage Reason block shows
|
||||||
|
*"Task cancelled mid-run — likely container restart or shutdown"* so the
|
||||||
|
classification is explicit, not silent.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Drive drawer
|
||||||
|
|
||||||
|
Click any drive row to slide a detail drawer down from the top. Three tabs:
|
||||||
|
|
||||||
|
- **Burn-In** — per-stage breakdown of the latest job
|
||||||
|
- **SMART** — short/long test states + cached SMART attributes
|
||||||
|
- **Events** — last 50 audit events for the drive
|
||||||
|
|
||||||
|
### Surface-validate visualization
|
||||||
|
|
||||||
|
For drives in a `surface_validate` stage (running or finished), the Burn-In
|
||||||
|
tab renders:
|
||||||
|
|
||||||
|
1. **Vital-signs strip** — `Start` (with date) · `Elapsed` · `ETA` (duration
|
||||||
|
remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
|
||||||
|
`Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
|
||||||
|
ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
|
||||||
|
stutter at the very start.
|
||||||
|
2. **Four pattern meters** — `0xaa` / `0x55` / `0xff` / `0x00`. Each meter
|
||||||
|
is split into a left half (write phase, blue) and a right half (verify
|
||||||
|
phase, green). Current pattern's label glows blue; completed patterns'
|
||||||
|
labels go green. This translates badblocks's per-phase percent into
|
||||||
|
monotonic 0-99% overall progress, so the bar never appears to "rewind"
|
||||||
|
when a new phase starts.
|
||||||
|
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
|
||||||
|
within phase"*. Makes the visual grammar unambiguous.
|
||||||
|
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
|
||||||
|
showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
|
||||||
|
first pattern's elapsed time.
|
||||||
|
|
||||||
|
### Failure reason block
|
||||||
|
|
||||||
|
Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
|
||||||
|
pill at the top of the stage section. Sources, in order of preference:
|
||||||
|
|
||||||
|
1. The stage's own `error_text`
|
||||||
|
2. The parent job's `error_text` (backfilled by the drawer when the stage's
|
||||||
|
own is empty — catches orphan rows from hard crashes)
|
||||||
|
3. A heuristic: if the log is tiny and no real progress was recorded,
|
||||||
|
*"Stopped without recording an error — likely cause: SSH connection drop
|
||||||
|
or container restart while this stage was running"*
|
||||||
|
|
||||||
|
Otherwise: *"No error message recorded."* — there's never a blank where you
|
||||||
|
expect to see why something broke.
|
||||||
|
|
||||||
|
### Column sorting
|
||||||
|
|
||||||
|
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
|
||||||
|
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
|
||||||
|
state persists in `localStorage` so it survives page reload AND every
|
||||||
|
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
|
||||||
|
the bottom regardless of direction.
|
||||||
|
|
||||||
|
Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
|
||||||
|
with numeric priority maps for SMART states (e.g. `running` always sorts
|
||||||
|
ahead of `idle`).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Drive locks
|
## Drive locks
|
||||||
|
|
|
||||||
Loading…
Add table
Reference in a new issue