Compare commits

...

22 commits

Author SHA1 Message Date
Brandon Walter
ec636f8f3a fix: PRAGMA busy_timeout on every SQLite connection (1.0.0-60)
Some checks failed
Security scan / pip-audit (push) Has been cancelled
Security scan / bandit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled
Jobs 60-63 ran healthy for 16h then all 4 died simultaneously with
'database is locked'. The burnin drain used _db() which set
busy_timeout=10000, but:

1. 10s was sometimes too short under heavy contention (4 burn-in
   drains writing every 5s + poller every 12s + retention scan +
   auth + lifespan = many concurrent writers).
2. OTHER aiosqlite.connect() sites (poller, retention, auth, mailer,
   routes/__init__'s SSE, burnin/__init__.py's various helpers,
   database.get_db) didn't set busy_timeout at all. Without it,
   SQLite raises 'database is locked' INSTANTLY on any contention,
   which forced concurrency back onto the drain's connection.

Fix:
- _db() busy_timeout 10000 → 60000 (60s; aggressive but right for
  this workload — brief contention spikes are normal and waiting
  beats failing).
- PRAGMA busy_timeout=60000 added on every aiosqlite.connect() site
  next to the existing PRAGMA calls. Applied via a small Python
  pass that preserves the original variable name (db / _tdb / src
  / dst etc.) and indentation.

Same restart sequence applied: rebuild container, reset 4 drives,
relaunch via loopback bypass. Jobs 64-67 are now running.

This is auto-restart #2 in 24h. Safety brake at 3.
2026-05-14 06:39:33 -04:00
Brandon Walter
7e42464016 fix: missing nonlocal on _drain's tracker vars (1.0.0-59)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
After the chunk-read refactor, the inner _drain coroutine assigns
to last_db_write_ts and last_pct_sample. Without nonlocal, Python
compiles these as locals of _drain, so any READ before the first
assignment raises UnboundLocalError.

In 1.0.0-55 / -57 the bug was hidden by gather(return_exceptions=
True), which silently swallowed the exception — the drain coroutine
ended immediately, the asyncssh channel buffer filled up, and the
remote badblocks blocked on pipe_write. THAT was the actual cause
of the "parser silently never works" symptom, not anything to do
with the chunk-read or tr-pipe logic itself.

1.0.0-57 dropped the gather (single drain after merging 2>&1), which
made the next deploy surface the bug as an explicit error_text on
the surface_validate stage: "cannot access local variable
'last_db_write_ts' where it is not associated with a value".

Fix: add both vars to the nonlocal declaration. pending_log_chunks
only gets .append/.clear (no reassignment) so it doesn't need
nonlocal.

This is the bug that's been hiding behind all the recent parser
work. Sorry for the round trips.
2026-05-13 10:31:35 -07:00
Brandon Walter
129f233e0a fix: stdbuf -oL on the tr pipe (1.0.0-58)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
1.0.0-57's tr-pipe fix delivered \n-terminated progress lines but
tr's stdout is block-buffered (4 KB chunks) when its destination
is a pipe — and the SSH channel is a pipe. At ~50 bytes per badblocks
progress line, that means ~80 lines accumulate (~6 minutes at our
throughput) before tr flushes anything.

stdbuf -oL forces tr's stdout to line-buffered mode. Each \n now
triggers a flush. Progress lines reach asyncssh as they happen.
2026-05-13 10:29:03 -07:00
Brandon Walter
7c3873dd5e fix: translate badblocks \b → \n at shell level (1.0.0-57)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The chunk-read drain in 1.0.0-55 was supposed to handle badblocks's
\b-overwrite progress format but silently never surfaced data — DB
bb_phase_pct stayed at 0, log_text stayed at 136 bytes for 26+ hours
of running burn-ins. Asyncssh stream.read(4096) behavior on this
combination of badblocks output + pipe characteristics wasn't doing
what I expected, and gather(return_exceptions=True) swallowed any
exception silently.

Fix: pipe the badblocks output through `tr '\b' '\n'` at the SHELL
level on TrueNAS, before it reaches asyncssh. Every progress update
is now a real newline-terminated line by the time we receive it.

This also lets us revert to the simpler `async for raw in stream:`
drain we had pre-1.0.0-55 — which was proven to work (it caught the
PID line and phase-transition headers, just not mid-phase progress).

Plus consolidate: 2>&1 merges stderr into stdout before tr, so we
only need ONE drain coroutine, not two. Single throttle gate
preserved.

Recovery: after deploy, the 4 jobs that have been stuck in pipe_w
for 26h were autonomously reset via inline SQL and relaunched via
POST /api/v1/burnin/start (loopback bypass from 1.0.0-56 made this
possible without a session cookie).
2026-05-13 10:26:06 -07:00
Brandon Walter
f71ae341f5 fix: backport stages.py \b-parser fix + drawer-finish inline (uncommitted from 1.0.0-55)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The chunk-read parser fix that ships as part of 1.0.0-55 in the
running container was scp'd to maple but never reached git. Same
for the drawer-job-finish margin-left removal (request: pill
sits inline next to operator/date, not flush right).

Reconciling source with deployed state. No new behaviour — git
now matches what's been live on maple since 1.0.0-55.
2026-05-12 07:53:33 -07:00
Brandon Walter
71eac9cba0 feat: loopback auth bypass for autonomous monitor (1.0.0-56)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The autonomous burn-in monitor can't hit /api/v1/burnin/start
without a session cookie. Provisioning one externally is fragile.
Add a targeted loopback bypass: requests from 127.0.0.1 / ::1
skip the auth gate and get a synthetic admin User for audit
attribution.

Why it's safe:
- The only way to reach the app from 127.0.0.1 is a process in
  the container's network namespace (docker exec from the host).
  Anyone with that already has rm -rf access to /data, so the
  bypass doesn't widen the attack surface.
- External traffic via NPM/Authelia arrives with the docker bridge
  gateway IP as source — NOT loopback — so it keeps going through
  full auth.
- request.client.host is the raw TCP socket source, NOT
  X-Forwarded-For, so external attackers can't spoof loopback via
  headers.

The new auth.LoopbackUser() is a tiny factory (id=0, is_admin=True,
username="monitor"). Audit events from this caller will show
operator='monitor' so they're distinguishable from human admins.

Staged in source; lands at next rebuild. Authorized by user
("It's a blank NAS machine. I don't care about any drive getting
wiped out.").
2026-05-12 07:52:20 -07:00
Brandon Walter
149f2901b7 fix: throttle ALL drain-loop DB calls + drop progress noise from log (1.0.0-54)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
1.0.0-52 throttled the percent/bb_phase writes but missed:

- `_is_cancelled` ran a DB query on EVERY stderr line (sub-second
  cadence × 4 concurrent burn-ins = ~10+ DB connection opens/s)
- `_append_stage_log` ran every 20 output_lines (~once per second)
  doing a quadratic `log_text || ?` concat that gets multi-MB
  rewrites as the log grows
- `_recalculate_progress` + `_push_update` also fired per gated tick

Cumulative load kept the asyncssh drain coroutine too busy to
consume the SSH channel buffer; SSH window stalled; sshd stopped
reading the pipe; badblocks blocked on pipe_write with state=S
wchan=pipe_write. /sys/block sectors_written delta confirmed
0 disk I/O across all running drives despite 23h elapsed.

Fix:
1. Single throttle gate (BB_DB_MIN_SECONDS=5s) covers EVERY DB
   touch in the drain — cancel check, percent/phase/bb_count
   updates, throughput sample, log flush, recalc, SSE push.
   Phase transitions still bypass the throttle (rare + important).
2. Exclude "XX% done" lines from the log entirely. They were the
   dominant volume; meaningful content (pattern headers, errors,
   bad-block numbers) still gets logged via the throttled flush.
3. log_text concat still quadratic but the volume reduction makes
   it tractable — buffer to pending_log_chunks, flush on the gate.

Net effect: ~99% reduction in drain-loop DB load. asyncssh drain
keeps up; pipe drains; badblocks writes; disk goes brr.
2026-05-11 22:07:39 -07:00
Brandon Walter
c906ab15f7 feat: job-level Est. completion in drawer header (1.0.0-53)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The drawer's per-stage Finish chip is the stage's finish, not the
whole burn-in's. Added a right-aligned "Est. completion" pill in
the drawer-job-header that uses the server-weighted burnin.percent
to extrapolate the whole job's finish time (covers precheck + SMART
+ surface_validate + final_check).

Suppressed under 0.5% job progress to avoid the early-sample
overshoot we saw earlier ("Finish: Sep 22" on a fresh start).

Bind-mount only (templates + static); no rebuild needed. Running
container reports 1.0.0-52 until next rebuild; this commit just
catches the source version up.
2026-05-10 22:45:04 -07:00
Brandon Walter
c5a41d0260 fix: throttle badblocks parser DB writes (1.0.0-52)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
User reported sdb showing 134-day ETA. Investigation: badblocks
processes all stuck in pipe_write wchan, iostat showing 0 throughput
across all drives despite badblocks supposedly running.

Root cause: each progress line was triggering 4-6 DB transactions
(_update_stage_percent, _update_stage_bb_phase, _update_stage_bad_blocks,
_update_stage_bb_mbps, _record_bb_phase_start, _recalculate_progress).
With 4 concurrent burn-ins and sub-second progress lines, the
asyncssh _drain couldn't keep up. Drain fell behind → asyncssh
channel buffer filled → SSH window stopped advancing → sshd stopped
reading from badblocks's stdout pipe → pipe filled → badblocks
blocked on pipe_write() → no more disk I/O.

That regression came in across 1.0.0-44 → 1.0.0-47 as I added each
new persisted field. The previous per-line write path worked when
there was only one DB call; it doesn't with five-plus.

Fix: BB_DB_MIN_SECONDS=5 throttle on the DB-write path. The drain
loop still consumes every progress line (so the pipe drains
continuously), but commits to DB at most once every 5 seconds.
Phase transitions always commit immediately (rare and important —
they stamp bb_phase_history and advance the per-pattern meter).

UI impact: minimal — drawer polls drives every ~12s anyway, so the
displayed % was already at 12s resolution. The meter strip just
won't sub-tick within a 5s window.

DB load impact: 60-80x reduction during surface_validate.
2026-05-10 22:12:02 -07:00
Brandon Walter
2107981cf1 docs: drawer surface_validate + sorting + job states
Some checks failed
Security scan / bandit (push) Has been cancelled
Security scan / pip-audit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled
Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51:

- Job states section explains passed / failed / cancelled / unknown,
  including when 'unknown' fires (stuck-job timeout OR container
  restart cancelling the asyncio task).
- Drive drawer section covers the new surface_validate visualization:
  vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four
  per-pattern meters with split write/verify halves, phase caption,
  completed-pattern duration history.
- Failure reason block describes the three-tier source resolution
  (stage error_text → job error_text → heuristic) and what shows up
  when none is available.
- Column sorting describes the click-to-cycle behaviour and the
  localStorage persistence that survives SSE refreshes.

Plus an explicit warning: don't `--build` while burn-ins are running
(now classified `unknown` instead of `failed` — but still better to
avoid the kill in the first place).
2026-05-09 15:34:12 -07:00
Brandon Walter
659f540270 fix: drop redundant stage suffix from Burn-In failed chip
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Per user: '(LONG SMART)' was redundant since the LONG SMART column
already shows FAILED. Same for short SMART and surface_validate
(the dominant case — the drawer shows per-stage Reason for digging).

Suffix kept for precheck / final_check since those are rare enough
that the hint is genuinely helpful.
2026-05-09 12:33:26 -07:00
Brandon Walter
1bc1b378ab fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:

1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
   stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
   was_cancelled stayed False and the job got marked 'failed' (the
   "burn-in itself failed" classification) instead of 'unknown'
   (the honest "we don't know whether it would have passed").

Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
  loop shutdown doesn't interrupt the kill), appends an [ABORTED]
  marker to the log, and re-raises CancelledError. _execute_stages
  doesn't catch it (CancelledError is BaseException, not Exception
  in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
  any stage rows still recorded as 'running' by setting them to
  'unknown' with a clear error_text: "Task cancelled mid-run —
  likely container restart or shutdown". The job's error_text gets
  the same message so the drawer's Reason block has something
  specific to display, instead of falling back to the heuristic.

Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.
2026-05-09 12:32:46 -07:00
Brandon Walter
7f959e6f4c feat: prominent failure-reason block + heuristic in drawer (1.0.0-50)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
When a stage ends in failed/cancelled/unknown the drawer now shows
a coloured "Reason" pill at the top of that stage's section. Three
sources, in order of preference:

1. stage.error_text (the canonical, when set)
2. job.error_text (backfilled in the drawer endpoint when stage's
   own is empty — catches orphan rows from hard crashes like the
   pre-busy-timeout DB-locked failures)
3. Heuristic: if log_text is tiny (<500 bytes, just the START
   banner) AND no real badblocks progress was recorded, label as
   "Stopped without recording an error — likely cause: SSH
   connection drop or container restart while this stage was
   running." This catches the fingerprint of a deploy-during-burn-in
   killing the SSH session.

Otherwise: "No error message recorded." so there's never a blank
where the operator expects to see why something broke.

Red styling for failed, yellow for cancelled/unknown. Replaces the
inline stage-error-line for terminal states; the existing
stage-error-line still renders for non-terminal contexts.
2026-05-09 12:06:11 -07:00
Brandon Walter
28d046f42e fix: SMART overlay shows terminal states + reconciles orphans (1.0.0-49)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The Long SMART column showed "—" while the Burn-In column showed
"FAILED (LONG SMART)" — clear contradiction. Two reasons:

1. The overlay query in _drives_helpers only fetched SMART stage
   data for burn-ins in ('running','queued') state. Failed/passed/
   cancelled jobs got their stage data filtered out, so the SMART
   columns went blank when you most wanted to see them. Removed
   the state filter so all burn-ins overlay.

2. A pre-busy-timeout `database is locked` failure mode (sdj job 5
   from Mar 2026) left long_smart stage rows recorded as state=
   'running' even though the parent job ended in state='failed'.
   The overlay now translates that orphan state at render time:
   if the parent job is failed/cancelled/unknown but the stage is
   still 'running', display the stage as failed (or the parent's
   terminal state) so the column matches the Burn-In column.

The translation is purely display-time; no DB writes. error_text
falls back to the parent job's error_text when the stage's own is
NULL, so the operator sees what actually broke.
2026-05-09 11:46:45 -07:00
Brandon Walter
f5c6b85402 feat: client-side column sorting with SSE re-apply (1.0.0-48)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Clickable headers on Drive / Serial / Size / Temp / Health / Short
SMART / Long SMART / Burn-In. Click cycles asc → desc → cleared,
with a small ▲/▼ indicator next to the active column.

Sort state lives in localStorage so it survives reload AND every
SSE-driven tbody refresh (HTMX swaps `#drives-table-wrap` innerHTML
on each `drives-update` event). The htmx:afterSwap hook re-applies
the sort and re-paints indicators.

Sortable values are emitted as data-sort-* attributes on each <tr>:
- raw devname / serial / size_bytes / temperature_c
- numeric priority maps for SMART health, SMART test states, and
  burn-in state (so "running" sorts ahead of "passed" regardless
  of alphabetical order)

Empty values always sink to the bottom regardless of direction so
"sort by temp asc" doesn't put a missing-temp drive on top.
2026-05-08 23:48:04 -07:00
Brandon Walter
383258df97 feat: phase caption + bad-block badge + per-pattern history (1.0.0-47)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Three additions to the surface_validate drawer block:

1. **Phase caption** below the meters: "Pattern 2 of 4 · Verify 0x55
   · 47% within phase". Pure JS — no schema change. Makes the
   visual grammar explicit without needing the operator to mentally
   map phase=4 to "verifying pattern 2".

2. **Bad-block badge** in the vitals row. Green at 0, red at >0.
   The number was already on the stage row but burying it in the
   log felt wrong — surfacing it next to temp/speed/ETA keeps it
   in eye-line during long runs.

3. **Per-pattern duration history** below the caption. New
   bb_phase_history JSON column (idempotent migration) maps
   {phase_num: ts}. Parser stamps the timestamp on every phase
   transition (and on stage entry for phase 1). Drawer diffs
   consecutive write-phase starts to derive "0xaa: 14h 22m"
   for completed patterns. Once one pattern is done you can
   predict the rest without leaving the drawer.

Persistence is idempotent: re-entry of the same phase keeps the
original timestamp so a transient parser reset doesn't blow away
history. JSON parse failures fail gracefully (no row rendered).
2026-05-08 23:23:02 -07:00
Brandon Walter
6b2367b892 feat: vital-signs strip above per-pattern meters (1.0.0-46)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The drawer's surface_validate area now leads with a row of operator
vitals computed from data already in the response:

- Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow)
- Speed: live MB/s, NULL until second progress sample arrives
- Elapsed: time since stage started_at
- ETA: extrapolated from overall progress; suppressed under 0.5%
  to avoid the "47 days remaining" artefact early in pattern 1

Live MB/s comes from a new bb_mbps column on burnin_stages, computed
in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt.
Skipped on phase transitions (per-phase pct resets) and sub-second
samples (noisy).

Drawer endpoint now passes drive.temperature_c through; JS stashes
the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer
can pull it for the vitals row without changing call signatures.

Tightened table CSS in this same session is unrelated and shipped
already in earlier rounds via the bind-mounted app.css.
2026-05-08 23:13:58 -07:00
Brandon Walter
1393ba0bc8 fix: seed bb_phase=1,pct=0 at surface_validate start (1.0.0-45)
Some checks are pending
Security scan / mypy (push) Waiting to run
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Previously the parser only wrote bb_phase to the DB when state
*changed* — so for the first several minutes of a 14 TB burn-in
(before badblocks emits its first 'X% done' line), bb_phase stayed
NULL and the drawer's per-pattern meters didn't render at all.
Looked broken to operators.

Now we write phase=1, phase_pct=0 immediately on stage entry. The
parser keeps overwriting on every subsequent tick. Drawer shows
empty meters with 0xaa label highlighted blue from t=0.
2026-05-08 22:45:45 -07:00
Brandon Walter
30062affc2 feat: per-pattern badblocks meters in drive drawer (1.0.0-44)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.

Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
  via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
  on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields

Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
  fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
  AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
  unchanged (single percent, no meters)

3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.

Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
2026-05-08 22:34:35 -07:00
Brandon Walter
4922b19a9f fix: stuck_job_hours default 24 → 168 (7 days) (1.0.0-43)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
A user with 4× 14 TB WD HDDs running -w surface_validate had all
4 jobs marked 'unknown' at exactly 24h+1min — the stuck-job
detector firing on legitimate work because 14 TB at 8192-block
badblocks needs ~5+ days to complete all 4 patterns × 2 phases.

168h covers a full -w pass on 14 TB+ HDDs with margin. Anyone
running short SSDs who wants faster detection can drop the value
in Settings → Burn-in.

README warning replaced — no longer instructs users to bump the
threshold before starting big-drive burn-ins, since the default
now handles that case.

Settings UI already accepts up to 168 via the input's max=168
attribute, so no template change needed.
2026-05-08 13:23:05 -07:00
Brandon Walter
b406e3f315 fix: badblocks progress tracks overall %, not per-phase (1.0.0-42)
Some checks failed
Security scan / pip-audit (push) Has been cancelled
Security scan / bandit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled
`badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
each with a write phase + a verify phase = 8 phases. The output's
"XX% done" lines are per-phase, so the dashboard appeared to "rewind"
every ~2 hours. Two drives racing each other could look 4× apart in
displayed progress despite identical hardware — actually one was
just further into a later phase.

New _BadblocksProgress state machine watches for "Testing with
pattern 0xXX" and "Reading and comparing" headers, advances the
phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8
clipped to 99. Pure state machine, no I/O.

7 new tests cover phase-header detection, boundary math, monotonicity
across a synthetic stream, and the original "two drives at same
per-phase % look identical" bug.

Image rebuilt and tagged but NOT deployed to the running container —
4 surface-validate jobs are 20-95% through 14TB drives and a recreate
would SIGHUP the remote badblocks processes. Deploy with
`docker compose up -d` after the current batch finishes.
2026-05-05 07:26:23 -07:00
Brandon Walter
775251b993 docs: refresh README test count + run-tests.sh pointer
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Test suite has grown from 44 → 65 since this line was last touched
(routes resolution, badblocks tunables, rate limiter, lifecycle).
Also points readers at scripts/run-tests.sh for the in-container path.
2026-05-05 06:19:17 -07:00
18 changed files with 1388 additions and 109 deletions

101
README.md
View file

@ -83,11 +83,12 @@ runtime roughly in half at ~2× RAM cost — matches the upstream
### Watch out ### Watch out
- **Stuck-job timeout**`stuck_job_hours` (default 24) marks any job - **Stuck-job timeout**`stuck_job_hours` (default 168 = 7 days)
past that threshold as `unknown` and kills the remote process. If marks any job past that threshold as `unknown` and kills the remote
you're burning in 14 TB drives with default block size, raise this to process. The default covers `-w` surface_validate on 14 TB+ HDDs with
**48** in Settings before starting, or you'll get false positives near margin. If you're running short SSDs and want faster detection of
the end of surface_validate. genuinely stuck jobs, drop it. (Earlier versions defaulted to 24h
which false-positived on multi-TB drives.)
- **Thermal gate** — if drives currently under burn-in hit the - **Thermal gate** — if drives currently under burn-in hit the
temperature warning threshold, new jobs wait up to 3 minutes before temperature warning threshold, new jobs wait up to 3 minutes before
acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but acquiring a slot. Increase `temp_warn_c` if your chassis runs hot but
@ -105,6 +106,91 @@ Click the red ✕ next to a running job. The orchestrator:
Cancellations are durable — restart the container and queued jobs resume, Cancellations are durable — restart the container and queued jobs resume,
cancelled jobs stay cancelled. cancelled jobs stay cancelled.
### Job states explained
| State | When it's set |
|-------------|-------------------------------------------------------------------------------|
| `queued` | Submitted, waiting for a `max_parallel_burnins` slot |
| `running` | Actively executing some stage |
| `passed` | All stages finished green |
| `failed` | A stage failed deterministically (bad blocks > threshold, SMART failure, etc.) |
| `cancelled` | Operator clicked ✕ |
| `unknown` | Job was alive but its outcome is indeterminate — see below |
`unknown` fires in two situations:
1. The stuck-job detector (`stuck_job_hours`, default 7 days) trips because
the job has been running too long without finishing.
2. The asyncio task got cancelled mid-stage by something *other* than an
operator click — usually a container restart (`docker compose up -d`,
`--build`, or the host rebooting). Burn-in source code goes through
the Dockerfile `COPY`, so any source-code deploy recreates the
container, drops the SSH connection to TrueNAS, and would orphan the
running burn-in. Avoid `--build` while burn-ins are active.
When `unknown` fires the drawer's per-stage Reason block shows
*"Task cancelled mid-run — likely container restart or shutdown"* so the
classification is explicit, not silent.
---
## Drive drawer
Click any drive row to slide a detail drawer down from the top. Three tabs:
- **Burn-In** — per-stage breakdown of the latest job
- **SMART** — short/long test states + cached SMART attributes
- **Events** — last 50 audit events for the drive
### Surface-validate visualization
For drives in a `surface_validate` stage (running or finished), the Burn-In
tab renders:
1. **Vital-signs strip**`Start` (with date) · `Elapsed` · `ETA` (duration
remaining) · `Finish` (wall-clock estimate, browser-local timezone) ·
`Temp` (cool/warm/hot colour). Computed from data in the drawer payload;
ETA + Finish suppressed below 0.5% so you don't see a "Finish: Jun 22"
stutter at the very start.
2. **Four pattern meters**`0xaa` / `0x55` / `0xff` / `0x00`. Each meter
is split into a left half (write phase, blue) and a right half (verify
phase, green). Current pattern's label glows blue; completed patterns'
labels go green. This translates badblocks's per-phase percent into
monotonic 0-99% overall progress, so the bar never appears to "rewind"
when a new phase starts.
3. **Phase caption** — explicit text: *"Pattern 2 of 4 · Verify 0x55 · 47%
within phase"*. Makes the visual grammar unambiguous.
4. **Completed-pattern history** — once pattern 1 finishes, a chip appears
showing `0xaa: 14h 22m`. Lets you predict the rest of the run from the
first pattern's elapsed time.
### Failure reason block
Stages that ended `failed` / `cancelled` / `unknown` show a coloured Reason
pill at the top of the stage section. Sources, in order of preference:
1. The stage's own `error_text`
2. The parent job's `error_text` (backfilled by the drawer when the stage's
own is empty — catches orphan rows from hard crashes)
3. A heuristic: if the log is tiny and no real progress was recorded,
*"Stopped without recording an error — likely cause: SSH connection drop
or container restart while this stage was running"*
Otherwise: *"No error message recorded."* — there's never a blank where you
expect to see why something broke.
### Column sorting
Click any column header (Drive, Serial, Size, Temp, Health, Short SMART,
Long SMART, Burn-In) to sort. Cycle: ascending → descending → cleared. Sort
state persists in `localStorage` so it survives page reload AND every
SSE-driven tbody refresh (~12 s poll cycle). Empty values always sink to
the bottom regardless of direction.
Sortable values are emitted as `data-sort-*` attributes on each `<tr>`,
with numeric priority maps for SMART states (e.g. `running` always sorts
ahead of `idle`).
--- ---
## Drive locks ## Drive locks
@ -144,7 +230,8 @@ All settings live under `/settings` (header link). Key knobs:
- **`surface_validate_block_size` / `_block_buffer` / `_passes`** — - **`surface_validate_block_size` / `_block_buffer` / `_passes`** —
badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour; badblocks `-b` / `-c` / `-p`. Defaults preserve original behaviour;
tune for speed vs paranoia. tune for speed vs paranoia.
- **`stuck_job_hours`** (default 24) — raise for big drives. - **`stuck_job_hours`** (default 168 = 7 days) — covers 14 TB+ HDDs;
drop for faster detection on small fast drives.
- **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds. - **`temp_warn_c` / `temp_crit_c`** — thermal gating thresholds.
- **`bad_block_threshold`** (default 0) — number of bad blocks - **`bad_block_threshold`** (default 0) — number of bad blocks
surface_validate tolerates before failing the stage. surface_validate tolerates before failing the stage.
@ -259,7 +346,7 @@ pinned version after the fact.
- `CLAUDE.md` — full architecture, file map, deploy workflow, and the - `CLAUDE.md` — full architecture, file map, deploy workflow, and the
rationale behind every non-obvious design decision. rationale behind every non-obvious design decision.
- `SPEC.md` — canonical feature reference per version. - `SPEC.md` — canonical feature reference per version.
- `tests/``python -m unittest discover tests/` (44 tests, stdlib-only). - `tests/``python -m unittest discover tests/` (65 tests, stdlib-only). Or run inside the deployed container with `scripts/run-tests.sh`.
--- ---

View file

@ -72,6 +72,14 @@ class User:
is_admin: bool is_admin: bool
def LoopbackUser(username: str = "monitor", full_name: str = "Autonomous Monitor") -> User:
"""Synthetic admin used by the loopback bypass in _AuthGateMiddleware.
id=0 (no real DB row) and is_admin=True so admin-gated routes work.
Only reachable when request.client.host is 127.0.0.1 / ::1
a process inside the container's network namespace (docker exec)."""
return User(id=0, username=username, full_name=full_name, is_admin=True)
def _now() -> str: def _now() -> str:
return datetime.now(timezone.utc).isoformat() return datetime.now(timezone.utc).isoformat()

View file

@ -93,6 +93,7 @@ async def init(client: TrueNASClient) -> None:
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
# Mark interrupted running jobs as unknown # Mark interrupted running jobs as unknown
@ -161,6 +162,7 @@ async def start_job(drive_id: int, profile: str, operator: str,
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
# Reject duplicate active burn-in for same drive # Reject duplicate active burn-in for same drive
@ -261,6 +263,7 @@ async def cancel_job(job_id: int, operator: str) -> bool:
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
cur = await db.execute( cur = await db.execute(
"SELECT state, drive_id FROM burnin_jobs WHERE id=?", (job_id,) "SELECT state, drive_id FROM burnin_jobs WHERE id=?", (job_id,)
@ -345,6 +348,7 @@ async def _run_job(job_id: int) -> None:
# Transition queued → running # Transition queued → running
async with _db() as db: async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
row = await (await db.execute( row = await (await db.execute(
"SELECT drive_id, profile FROM burnin_jobs WHERE id=?", (job_id,) "SELECT drive_id, profile FROM burnin_jobs WHERE id=?", (job_id,)
)).fetchone() )).fetchone()
@ -411,12 +415,34 @@ async def _run_job(job_id: int) -> None:
final_state = "unknown" final_state = "unknown"
else: else:
final_state = "passed" if success else "failed" final_state = "passed" if success else "failed"
# If the asyncio task was cancelled mid-stage (container shutdown,
# uvicorn reload, etc.), CancelledError propagates past
# _execute_stages, so any running stage row is still marked
# 'running' in the DB. Reconcile here: mark every still-running
# stage on this job as 'unknown' with the parent's finished_at,
# and stamp a default error_text so the drawer's Reason block has
# something concrete to show. Use a write that's idempotent under
# repeat (only touches rows still 'running').
cancel_err = (
"Task cancelled mid-run — likely container restart or shutdown"
if was_cancelled else None
)
async with _db() as db: async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute( await db.execute(
"UPDATE burnin_jobs SET state=?, percent=?, finished_at=?, error_text=? WHERE id=?", "UPDATE burnin_jobs SET state=?, percent=?, finished_at=?, error_text=? WHERE id=?",
(final_state, 100 if success else None, _now(), error_text, job_id), (final_state, 100 if success else None, _now(),
error_text or cancel_err, job_id),
) )
if was_cancelled:
await db.execute(
"""UPDATE burnin_stages
SET state='unknown', finished_at=?,
error_text=COALESCE(error_text, ?)
WHERE burnin_job_id=? AND state='running'""",
(_now(), cancel_err, job_id),
)
await db.execute( await db.execute(
"""INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message) """INSERT INTO audit_events (event_type, drive_id, burnin_job_id, operator, message)
VALUES (?,?,?,(SELECT operator FROM burnin_jobs WHERE id=?),?)""", VALUES (?,?,?,(SELECT operator FROM burnin_jobs WHERE id=?),?)""",
@ -542,6 +568,7 @@ async def check_stuck_jobs() -> None:
async with _db() as db: async with _db() as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
cur = await db.execute(""" cur = await db.execute("""
SELECT bj.id, bj.drive_id, d.devname, bj.started_at SELECT bj.id, bj.drive_id, d.devname, bj.started_at

View file

@ -77,9 +77,13 @@ def _now() -> str:
@asynccontextmanager @asynccontextmanager
async def _db(): async def _db():
"""Open a WAL-mode connection with busy_timeout so writers wait for the lock """Open a WAL-mode connection with busy_timeout so writers wait for the lock
instead of immediately raising 'database is locked' under contention.""" instead of immediately raising 'database is locked' under contention.
60s timeout is intentionally generous: with 4 concurrent burn-in drains
+ the poller + retention + auth all writing, brief contention spikes
are normal and waiting is the right behavior. 10s was too tight."""
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
await db.execute("PRAGMA busy_timeout=10000") await db.execute("PRAGMA busy_timeout=60000")
yield db yield db
@ -190,6 +194,72 @@ async def _update_stage_bad_blocks(job_id: int, stage_name: str, count: int) ->
await db.commit() await db.commit()
async def _update_stage_bb_phase(
job_id: int, stage_name: str, phase: int, phase_pct: float,
) -> None:
"""Persist per-pattern badblocks progress so the drive-drawer UI
can render 4 meters with separate write/verify halves."""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"UPDATE burnin_stages SET bb_phase=?, bb_phase_pct=? "
"WHERE burnin_job_id=? AND stage_name=?",
(phase, phase_pct, job_id, stage_name),
)
await db.commit()
async def _update_stage_bb_mbps(
job_id: int, stage_name: str, mbps: float,
) -> None:
"""Persist live throughput for the surface_validate meter strip.
Computed from delta_overall_pct between successive badblocks
progress lines, scaled by drive size_bytes / 800 (8 phases × 100)."""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
await db.execute(
"UPDATE burnin_stages SET bb_mbps=? "
"WHERE burnin_job_id=? AND stage_name=?",
(mbps, job_id, stage_name),
)
await db.commit()
async def _record_bb_phase_start(
job_id: int, stage_name: str, phase: int, ts: str,
) -> None:
"""Record the moment a phase first becomes current. Idempotent:
re-entry of the same phase keeps the original timestamp so a
transient parser reset doesn't blow away history.
Stored as a JSON object keyed by phase number (string). The
drawer reads it to compute per-pattern elapsed times.
"""
async with _db() as db:
await db.execute("PRAGMA journal_mode=WAL")
cur = await db.execute(
"SELECT bb_phase_history FROM burnin_stages "
"WHERE burnin_job_id=? AND stage_name=?",
(job_id, stage_name),
)
row = await cur.fetchone()
existing = {}
if row and row[0]:
try:
existing = json.loads(row[0])
except (json.JSONDecodeError, TypeError):
existing = {}
key = str(phase)
if key not in existing:
existing[key] = ts
await db.execute(
"UPDATE burnin_stages SET bb_phase_history=? "
"WHERE burnin_job_id=? AND stage_name=?",
(json.dumps(existing), job_id, stage_name),
)
await db.commit()
async def _store_smart_attrs(drive_id: int, attrs: dict) -> None: async def _store_smart_attrs(drive_id: int, attrs: dict) -> None:
"""Persist latest SMART attribute dict to drives.smart_attrs (JSON).""" """Persist latest SMART attribute dict to drives.smart_attrs (JSON)."""
# Convert int keys to str for JSON serialisation # Convert int keys to str for JSON serialisation

View file

@ -25,23 +25,110 @@ class _BadblocksResult(TypedDict):
aborted: bool aborted: bool
# `badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
# each with a write phase followed by a read-back/verify phase = 8 phases.
# Per-phase percent comes back via `XX% done`; without translation, the
# dashboard appears to "rewind" every ~2 hours when a new phase starts.
_BB_PATTERN_PHASE = {"0xaa": 1, "0x55": 3, "0xff": 5, "0x00": 7}
_BB_TOTAL_PHASES = 8
# Throttle DB writes from the badblocks parser. Each progress line used
# to trigger 4-6 transactions; with 4 concurrent burn-ins emitting sub-
# second progress lines, the asyncssh drain couldn't keep up — the
# stdout pipe on TrueNAS filled, badblocks blocked on pipe_write,
# disk I/O effectively stopped. 5 seconds is fine for the UI (drawer
# polls every ~12s anyway) and cuts DB load 60-80x.
BB_DB_MIN_SECONDS = 5.0
import re as _re_pre # noqa: E402
_BB_PATTERN_RE = _re_pre.compile(r"Testing with pattern\s+(0x[0-9a-fA-F]+)")
_BB_VERIFY_RE = _re_pre.compile(r"Reading and comparing")
_BB_PERCENT_RE = _re_pre.compile(r"([\d.]+)%\s+done")
class _BadblocksProgress:
"""Track which phase of `badblocks -w -p N` we're in so the
displayed percent maps to overall progress, not per-phase progress.
Pure state machine no I/O. Feed it lines from the badblocks output
via :meth:`update`; read :attr:`overall_pct` after each call.
Behavior:
- Defaults to phase 1 (write 0xaa) before any header is seen.
- "Testing with pattern 0xXX" sets the phase to the write-phase index
for that pattern (1, 3, 5, or 7).
- "Reading and comparing" advances to the matching verify phase
(last_write_phase + 1).
- "XX% done" updates the in-phase percent.
- overall_pct = ((phase - 1) * 100 + phase_pct) / 8, clipped to 99
so we don't claim "100%" until the stage's success path explicitly
writes 100.
"""
__slots__ = ("phase", "phase_pct", "_last_write_phase")
def __init__(self) -> None:
self.phase: int = 1
self.phase_pct: float = 0.0
self._last_write_phase: int = 1
def update(self, line: str) -> None:
m = _BB_PATTERN_RE.search(line)
if m:
p = m.group(1).lower()
if p in _BB_PATTERN_PHASE:
self.phase = _BB_PATTERN_PHASE[p]
self._last_write_phase = self.phase
self.phase_pct = 0.0
return
if _BB_VERIFY_RE.search(line):
self.phase = self._last_write_phase + 1
self.phase_pct = 0.0
return
m = _BB_PERCENT_RE.search(line)
if m:
try:
self.phase_pct = float(m.group(1))
except ValueError:
pass
@property
def overall_pct(self) -> int:
total = (self.phase - 1) * 100.0 + self.phase_pct
return min(99, int(total / _BB_TOTAL_PHASES))
def _build_badblocks_cmd(devname: str) -> str: def _build_badblocks_cmd(devname: str) -> str:
"""Construct the wrapped badblocks command for a given device. """Construct the wrapped badblocks command for a given device.
Wraps badblocks under `sh -c 'echo PID:$$; exec ...'` so we can badblocks's progress output uses '\\b' backspace characters to
capture the remote PID for out-of-band kill -9 (asyncssh's signal overwrite the previous "XX% done" line there's no '\\n' between
channel is ignored by sshd). Geometry (-b -c -p) is operator-tunable updates until a phase transition. asyncssh's line-buffered reader
via Settings Burn-in; defaults match the Spearfoot disk-burnin.sh needs a real '\\n' to yield a line, so we pipe the output through
recommendation for large HDDs. `tr '\\b' '\\n'` at the shell level. After this, every progress
update is a normal newline-terminated line.
Inner shell does `echo PID:$$; exec badblocks ...` so $$ is the
badblocks PID after exec (needed for out-of-band kill -9; asyncssh's
signal channel is ignored by sshd). 2>&1 merges stderr into stdout
so tr sees the progress lines (badblocks emits them on stderr).
Geometry (-b -c -p) is operator-tunable via Settings Burn-in;
defaults match the Spearfoot disk-burnin.sh recommendation.
""" """
return ( inner = (
f"sh -c 'echo PID:$$; exec badblocks " f"echo PID:$$; exec badblocks "
f"-wsv " f"-wsv "
f"-b {settings.surface_validate_block_size} " f"-b {settings.surface_validate_block_size} "
f"-c {settings.surface_validate_block_buffer} " f"-c {settings.surface_validate_block_buffer} "
f"-p {settings.surface_validate_passes} " f"-p {settings.surface_validate_passes} "
f"/dev/{devname}'" f"/dev/{devname} 2>&1"
) )
# The outer pipeline lets tr translate \\b → \\n. stdbuf -oL forces
# tr's stdout to line-buffered mode; without it tr's stdout is
# block-buffered (4 KB chunks) when its destination is a pipe,
# which delays each progress line by ~6 minutes at our throughput.
return f"sh -c '{inner}' | stdbuf -oL tr '\\b' '\\n'"
from . import kill from . import kill
from ._common import ( from ._common import (
@ -49,12 +136,16 @@ from ._common import (
_append_stage_log, _append_stage_log,
_db, _db,
_is_cancelled, _is_cancelled,
_now,
_push_update, _push_update,
_recalculate_progress, _recalculate_progress,
_record_bb_phase_start,
_set_stage_error, _set_stage_error,
_store_smart_attrs, _store_smart_attrs,
_store_smart_raw_output, _store_smart_raw_output,
_update_stage_bad_blocks, _update_stage_bad_blocks,
_update_stage_bb_mbps,
_update_stage_bb_phase,
_update_stage_percent, _update_stage_percent,
) )
@ -399,6 +490,17 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
"""Run badblocks over SSH, streaming output to stage log.""" """Run badblocks over SSH, streaming output to stage log."""
from app import ssh_client from app import ssh_client
# Pull drive size for the throughput calculation. Each badblocks
# phase covers the full disk once, so 1% overall progress = size/800
# bytes (8 phases × 100). NULL-safe: if size lookup fails we just
# skip the MB/s update.
drive_size_bytes: int | None = None
async with _db() as db:
cur = await db.execute("SELECT size_bytes FROM drives WHERE id=?", (drive_id,))
row = await cur.fetchone()
if row and row[0]:
drive_size_bytes = int(row[0])
await _append_stage_log( await _append_stage_log(
job_id, "surface_validate", job_id, "surface_validate",
f"[START] badblocks -wsv -b {settings.surface_validate_block_size} " f"[START] badblocks -wsv -b {settings.surface_validate_block_size} "
@ -425,17 +527,47 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
# #
cmd = _build_badblocks_cmd(devname) cmd = _build_badblocks_cmd(devname)
async with conn.create_process(cmd) as proc: async with conn.create_process(cmd) as proc:
import re as _re
pid_seen = False pid_seen = False
progress = _BadblocksProgress()
# Throughput tracker — store (overall_pct, monotonic_ts)
# of the previous progress sample so we can compute MB/s
# from the delta on each new sample.
last_pct_sample: float = progress.overall_pct
last_db_write_ts: float = time.monotonic()
# Lines accumulated since last log flush. Flushed in the
# throttled DB-write window (see BB_DB_MIN_SECONDS).
pending_log_chunks: list[str] = []
# Seed bb_phase=1, bb_phase_pct=0 immediately so the
# drawer's per-pattern meters have something to render
# before badblocks emits its first "X% done" line. On a
# 14 TB drive that first line can be several minutes in,
# and a blank meter strip looks broken to the operator.
await _update_stage_bb_phase(
job_id, "surface_validate",
progress.phase, progress.phase_pct,
)
# Stamp phase 1 (write 0xaa) start so the drawer's
# duration history starts populating immediately.
await _record_bb_phase_start(
job_id, "surface_validate", progress.phase, _now(),
)
_push_update()
async def _drain(stream, is_stderr: bool): async def _drain(stream, is_stderr: bool):
nonlocal bad_blocks_total, pid_seen nonlocal bad_blocks_total, pid_seen, last_db_write_ts, last_pct_sample
# Line-based drain. The wrapped badblocks command
# pipes through `tr '\b' '\n'` at the shell level
# so every progress update is a real newline-
# terminated line by the time it reaches us.
async for raw in stream: async for raw in stream:
line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace") line = raw if isinstance(raw, str) else raw.decode("utf-8", errors="replace")
if not line.strip():
continue
# First stdout line is "PID:<n>" from the wrapping shell. # First stdout line is "PID:<n>" from the
# Capture it and don't append it to the user-visible log. # wrapping shell. Capture and skip.
if not is_stderr and not pid_seen and line.startswith("PID:"): if not is_stderr and not pid_seen and line.startswith("PID:"):
pid_seen = True pid_seen = True
try: try:
@ -448,27 +580,86 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
pass pass
continue continue
output_lines.append(line) # Note: with the `tr` pipe, badblocks's stderr is
# merged into stdout (`2>&1`). is_stderr is now
# always False — we treat every non-PID line as
# potentially containing progress or bad-block
# output. The phase parser is idempotent on
# unrelated lines.
prev_phase = progress.phase
progress.update(line)
phase_changed = progress.phase != prev_phase
is_progress_line = bool(_BB_PERCENT_RE.search(line))
# Bare-number lines from badblocks are bad-block
# block numbers (one per line on stdout).
stripped = line.strip()
if stripped and stripped.isdigit() and not is_progress_line:
bad_blocks_total += 1
if is_stderr: # Keep "XX% done" lines OUT of output_lines. Big
m = _re.search(r"([\d.]+)%\s+done", line) # volume + quadratic log_text concat.
if m: if not is_progress_line:
pct = min(99, int(float(m.group(1)))) output_lines.append(line)
await _update_stage_percent(job_id, "surface_validate", pct)
await _update_stage_bad_blocks(job_id, "surface_validate", bad_blocks_total)
await _recalculate_progress(job_id)
_push_update()
else:
stripped = line.strip()
if stripped and stripped.isdigit():
bad_blocks_total += 1
# Append to DB log in chunks # Single throttle gate covering EVERY DB touch.
if len(output_lines) % 20 == 0: # Cumulative DB load otherwise overwhelms the
chunk = "".join(output_lines[-20:]) # asyncio loop → asyncssh drain falls behind →
await _append_stage_log(job_id, "surface_validate", chunk) # SSH window stops advancing → pipe fills →
# badblocks blocks on pipe_write → disk I/O stops.
now_ts = time.monotonic()
time_since_last_db = now_ts - last_db_write_ts
should_write = phase_changed or time_since_last_db >= BB_DB_MIN_SECONDS
# Abort on bad block threshold if should_write:
if await _is_cancelled(job_id):
await kill.kill_remote_process(job_id)
return
if phase_changed:
await _record_bb_phase_start(
job_id, "surface_validate",
progress.phase, _now(),
)
await _update_stage_percent(
job_id, "surface_validate", progress.overall_pct,
)
await _update_stage_bb_phase(
job_id, "surface_validate",
progress.phase, progress.phase_pct,
)
await _update_stage_bad_blocks(
job_id, "surface_validate", bad_blocks_total,
)
if (
drive_size_bytes
and not phase_changed
and progress.overall_pct > last_pct_sample
and time_since_last_db >= 1.0
):
d_pct = progress.overall_pct - last_pct_sample
bytes_done = (d_pct / 800.0) * drive_size_bytes
mbps = bytes_done / time_since_last_db / 1_000_000
await _update_stage_bb_mbps(
job_id, "surface_validate", mbps,
)
if pending_log_chunks:
chunk = "".join(pending_log_chunks)
pending_log_chunks.clear()
await _append_stage_log(
job_id, "surface_validate", chunk,
)
last_pct_sample = progress.overall_pct
last_db_write_ts = now_ts
await _recalculate_progress(job_id)
_push_update()
if not is_progress_line:
pending_log_chunks.append(line)
# Abort on bad block threshold — immediate.
if bad_blocks_total > settings.bad_block_threshold: if bad_blocks_total > settings.bad_block_threshold:
await kill.kill_remote_process(job_id) await kill.kill_remote_process(job_id)
output_lines.append( output_lines.append(
@ -477,15 +668,9 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
) )
return return
if await _is_cancelled(job_id): # Single stream now — the `2>&1` in _build_badblocks_cmd
await kill.kill_remote_process(job_id) # merges stderr into stdout before the `tr` pipe.
return await _drain(proc.stdout, False)
await asyncio.gather(
_drain(proc.stdout, False),
_drain(proc.stderr, True),
return_exceptions=True,
)
# Bound proc.wait so a remote process that ignored our kill # Bound proc.wait so a remote process that ignored our kill
# signal (or that we never managed to kill) can't pin this # signal (or that we never managed to kill) can't pin this
# task in the semaphore forever. Closing the connection on # task in the semaphore forever. Closing the connection on
@ -510,7 +695,21 @@ async def _stage_surface_validate_ssh(job_id: int, devname: str, drive_id: int)
result["aborted"] = bad_blocks_total > settings.bad_block_threshold result["aborted"] = bad_blocks_total > settings.bad_block_threshold
except asyncio.CancelledError: except asyncio.CancelledError:
return False # Best-effort kill of the remote badblocks process before
# propagating the cancel. asyncio.shield() so the kill attempt
# itself isn't interrupted by ongoing loop shutdown. Then
# re-raise so _run_job marks the job 'unknown' (honest about
# the indeterminate outcome) instead of 'failed' (which
# implies the burn-in itself failed, which we don't know).
try:
await asyncio.shield(kill.kill_remote_process(job_id))
except Exception:
pass
await _append_stage_log(
job_id, "surface_validate",
"\n[ABORTED] task cancelled (likely container restart or shutdown)\n",
)
raise
except Exception as exc: except Exception as exc:
await _append_stage_log(job_id, "surface_validate", f"\n[SSH error] {exc}\n") await _append_stage_log(job_id, "surface_validate", f"\n[SSH error] {exc}\n")
await _set_stage_error(job_id, "surface_validate", f"SSH badblocks error: {exc}") await _set_stage_error(job_id, "surface_validate", f"SSH badblocks error: {exc}")

View file

@ -49,7 +49,10 @@ class Settings(BaseSettings):
webhook_url: str = "" webhook_url: str = ""
# Stuck-job detection: jobs running longer than this are marked 'unknown' # Stuck-job detection: jobs running longer than this are marked 'unknown'
stuck_job_hours: int = 24 # and the remote badblocks/smartctl is killed. 168h (7 days) covers a
# full -w surface_validate on a 14 TB+ HDD with margin. Older default
# was 24h which false-positived on multi-TB drives almost every time.
stuck_job_hours: int = 168
# Temperature thresholds (°C) — drives table colouring + precheck gate # Temperature thresholds (°C) — drives table colouring + precheck gate
temp_warn_c: int = 46 # orange warning temp_warn_c: int = 46 # orange warning
@ -83,7 +86,7 @@ class Settings(BaseSettings):
ssh_key: str = "" # PEM private key content (paste full key including headers) ssh_key: str = "" # PEM private key content (paste full key including headers)
# Application version — used by the /api/v1/updates/check endpoint # Application version — used by the /api/v1/updates/check endpoint
app_version: str = "1.0.0-41" app_version: str = "1.0.0-60"
# ---- Authentication (1.0.0-22) ---- # ---- Authentication (1.0.0-22) ----
# session_secret: HMAC key for signing session cookies. Empty = generate # session_secret: HMAC key for signing session cookies. Empty = generate

View file

@ -93,6 +93,24 @@ _MIGRATIONS = [
"ALTER TABLE drives ADD COLUMN pool_name TEXT", "ALTER TABLE drives ADD COLUMN pool_name TEXT",
"ALTER TABLE drives ADD COLUMN pool_role TEXT", "ALTER TABLE drives ADD COLUMN pool_role TEXT",
"ALTER TABLE drives ADD COLUMN pool_seen_at TEXT", "ALTER TABLE drives ADD COLUMN pool_seen_at TEXT",
# 1.0.0-44: per-pattern badblocks progress for the drive drawer's
# 4-meter UI. bb_phase is 1-8 (1=write 0xaa, 2=verify 0xaa, 3=write
# 0x55, 4=verify 0x55, 5=write 0xff, 6=verify 0xff, 7=write 0x00,
# 8=verify 0x00). bb_phase_pct is 0-100 within the current phase.
"ALTER TABLE burnin_stages ADD COLUMN bb_phase INTEGER",
"ALTER TABLE burnin_stages ADD COLUMN bb_phase_pct REAL",
# 1.0.0-46: live write/read throughput for the per-pattern meters.
# Computed from successive `XX% done` lines in badblocks output:
# delta_bytes = (overall_pct_delta / 800) * drive_size_bytes.
# Updated on every progress line; NULL until the second progress
# line arrives (need two samples to compute a rate).
"ALTER TABLE burnin_stages ADD COLUMN bb_mbps REAL",
# 1.0.0-47: per-pattern duration history. JSON map of
# {"1": "2026-05-09T05:39:44+00:00", "2": ..., ...} where each key
# is the phase number (1-8) and the value is when the parser first
# observed that phase. Drawer derives "0xaa: 14h 22m" by diffing
# consecutive phase-1 keys.
"ALTER TABLE burnin_stages ADD COLUMN bb_phase_history TEXT",
# 1.0.0-19: enforce one active burn-in per drive at the storage layer. # 1.0.0-19: enforce one active burn-in per drive at the storage layer.
# Closes the read-then-insert race in burnin.start_job — without this, # Closes the read-then-insert race in burnin.start_job — without this,
# two concurrent /api/v1/burnin/start requests for the same drive could # two concurrent /api/v1/burnin/start requests for the same drive could
@ -158,6 +176,7 @@ async def init_db() -> None:
Path(settings.db_path).parent.mkdir(parents=True, exist_ok=True) Path(settings.db_path).parent.mkdir(parents=True, exist_ok=True)
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
await db.executescript(SCHEMA) await db.executescript(SCHEMA)
await _run_migrations(db) await _run_migrations(db)
@ -169,6 +188,7 @@ async def get_db():
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
try: try:
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
yield db yield db
finally: finally:

View file

@ -334,6 +334,7 @@ async def _fetch_report_data() -> list[dict]:
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
return await _fetch_drives_for_template(db) return await _fetch_drives_for_template(db)
@ -347,6 +348,7 @@ async def _fetch_unlock_events_24h() -> list[dict]:
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
# julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format # julianday() handles the 'YYYY-MM-DDTHH:MM:SS.fff+00:00' format
# we write from Python; comparing the raw string against # we write from Python; comparing the raw string against
# datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS') # datetime('now','-1 day') (which formats as 'YYYY-MM-DD HH:MM:SS')

View file

@ -189,6 +189,21 @@ class _AuthGateMiddleware(BaseHTTPMiddleware):
await auth.get_user_by_id(int(user_id)) if user_id else None await auth.get_user_by_id(int(user_id)) if user_id else None
) )
# Loopback bypass (1.0.0-56): requests from 127.0.0.1 / ::1
# inside the container skip the auth gate. The only way to hit
# that source IP is a process in the container's network
# namespace — `docker exec` from the host. External traffic
# comes through the docker bridge with a non-loopback source,
# so it still goes through full auth. We read request.client.host
# directly (raw TCP socket), NOT X-Forwarded-For, so external
# attackers can't spoof loopback via headers. This unlocks the
# autonomous monitor's ability to POST /api/v1/burnin/start
# without provisioning a session cookie.
if request.client and request.client.host in ("127.0.0.1", "::1"):
if request.state.current_user is None:
request.state.current_user = auth.LoopbackUser()
return await call_next(request)
if path in _PUBLIC_PATHS or path.startswith(_PUBLIC_PREFIXES): if path in _PUBLIC_PATHS or path.startswith(_PUBLIC_PREFIXES):
return await call_next(request) return await call_next(request)
if request.state.current_user is not None: if request.state.current_user is not None:

View file

@ -437,6 +437,7 @@ async def poll_cycle(client: TrueNASClient) -> int:
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
await db.execute("PRAGMA foreign_keys=ON") await db.execute("PRAGMA foreign_keys=ON")
for disk in disks: for disk in disks:
@ -492,6 +493,7 @@ async def run(client: TrueNASClient) -> None:
async with aiosqlite.connect(settings.db_path) as _tdb: async with aiosqlite.connect(settings.db_path) as _tdb:
_tdb.row_factory = aiosqlite.Row _tdb.row_factory = aiosqlite.Row
await _tdb.execute("PRAGMA journal_mode=WAL") await _tdb.execute("PRAGMA journal_mode=WAL")
await _tdb.execute("PRAGMA busy_timeout=60000")
_cur = await _tdb.execute(""" _cur = await _tdb.execute("""
SELECT MAX(d.temperature_c) SELECT MAX(d.temperature_c)
FROM drives d FROM drives d

View file

@ -128,6 +128,7 @@ async def sse_drives(request: Request):
async with aiosqlite.connect(settings.db_path) as db: async with aiosqlite.connect(settings.db_path) as db:
db.row_factory = aiosqlite.Row db.row_factory = aiosqlite.Row
await db.execute("PRAGMA journal_mode=WAL") await db.execute("PRAGMA journal_mode=WAL")
await db.execute("PRAGMA busy_timeout=60000")
drives = await _fetch_drives_for_template(db) drives = await _fetch_drives_for_template(db)
html = templates.env.get_template( html = templates.env.get_template(

View file

@ -147,11 +147,12 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
# For burn-ins that include SMART stages, fetch those stages so we can # For burn-ins that include SMART stages, fetch those stages so we can
# mirror their progress/result in the Short/Long SMART columns. # mirror their progress/result in the Short/Long SMART columns.
# We include burn-ins in ANY state — including failed/passed/cancelled —
# so the SMART columns don't go blank when the burn-in finishes. Without
# this, "FAILED (LONG SMART)" appears in the Burn-In column while the
# Long SMART column shows "—", which contradicts itself.
bi_smart_stages: dict[int, dict[str, dict]] = {} # job_id -> {stage_name: row} bi_smart_stages: dict[int, dict[str, dict]] = {} # job_id -> {stage_name: row}
bi_ids_with_smart = [ bi_ids_with_smart = [bi["id"] for bi in burnin_by_drive.values()]
bi["id"] for bi in burnin_by_drive.values()
if bi["state"] in ("running", "queued")
]
if bi_ids_with_smart: if bi_ids_with_smart:
placeholders = ",".join("?" * len(bi_ids_with_smart)) placeholders = ",".join("?" * len(bi_ids_with_smart))
# placeholders is purely structural ("?,?,?"); IDs themselves are # placeholders is purely structural ("?,?,?"); IDs themselves are
@ -163,7 +164,7 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
"FROM burnin_stages bs " "FROM burnin_stages bs "
"WHERE bs.burnin_job_id IN (" + placeholders + ") " "WHERE bs.burnin_job_id IN (" + placeholders + ") "
" AND bs.stage_name IN ('short_smart', 'long_smart') " " AND bs.stage_name IN ('short_smart', 'long_smart') "
" AND bs.state IN ('running', 'passed', 'failed')" " AND bs.state IN ('running', 'passed', 'failed', 'aborted')"
) )
cur = await db.execute(sql, bi_ids_with_smart) cur = await db.execute(sql, bi_ids_with_smart)
for r in await cur.fetchall(): for r in await cur.fetchall():
@ -185,14 +186,26 @@ async def _fetch_drives_for_template(db: aiosqlite.Connection) -> list[dict]:
if existing.get("state") not in (None, "idle"): if existing.get("state") not in (None, "idle"):
continue continue
pct = stage["percent"] or 0 pct = stage["percent"] or 0
stage_state = stage["state"]
# If the parent burn-in ended in failure but this SMART
# stage is still recorded as "running", that's an
# orphaned stage row from a hard crash (e.g. the old
# `database is locked` failure mode). Surface as failed
# so the column matches the Burn-In column.
if stage_state == "running" and bi.get("state") in (
"failed", "cancelled", "unknown"
):
stage_state = bi["state"] if bi["state"] != "unknown" else "failed"
d[target] = { d[target] = {
"state": stage["state"], "state": stage_state,
"percent": pct if stage["state"] == "running" else (100 if stage["state"] == "passed" else 0), "percent": pct if stage_state == "running" else (100 if stage_state == "passed" else 0),
"eta_seconds": _compute_eta_seconds(stage["started_at"], pct) if stage["state"] == "running" else None, "eta_seconds": _compute_eta_seconds(stage["started_at"], pct) if stage_state == "running" else None,
"eta_timestamp": None, "eta_timestamp": None,
"started_at": stage["started_at"], "started_at": stage["started_at"],
"finished_at": stage["finished_at"], "finished_at": stage["finished_at"],
"error_text": stage["error_text"], "error_text": stage["error_text"] or (
bi.get("error_text") if stage_state == "failed" else None
),
} }
drives.append(d) drives.append(d)

View file

@ -57,11 +57,26 @@ async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)
job = dict(job_row) job = dict(job_row)
cur = await db.execute( cur = await db.execute(
"SELECT id, stage_name, state, percent, started_at, finished_at, " "SELECT id, stage_name, state, percent, started_at, finished_at, "
"duration_seconds, error_text, log_text, bad_blocks " "duration_seconds, error_text, log_text, bad_blocks, "
"bb_phase, bb_phase_pct, bb_mbps, bb_phase_history "
"FROM burnin_stages WHERE burnin_job_id=? ORDER BY id", "FROM burnin_stages WHERE burnin_job_id=? ORDER BY id",
(job_row["id"],), (job_row["id"],),
) )
job["stages"] = [dict(r) for r in await cur.fetchall()] stages = [dict(r) for r in await cur.fetchall()]
# Backfill empty stage.error_text from the parent job's error_text
# for any stage that ended in a terminal state without recording
# an error of its own. This catches the orphan pattern from hard
# crashes (DB-locked, SSH disconnect, container restart) where
# the failure didn't get to write a per-stage explanation.
job_err = job.get("error_text")
for s in stages:
if (
s.get("state") in ("failed", "cancelled", "unknown")
and not s.get("error_text")
and job_err
):
s["error_text"] = job_err
job["stages"] = stages
burnin_job = job burnin_job = job
# SMART raw output from smart_tests table # SMART raw output from smart_tests table
@ -101,11 +116,12 @@ async def drive_drawer(drive_id: int, db: aiosqlite.Connection = Depends(get_db)
return { return {
"drive": { "drive": {
"id": drive.id, "id": drive.id,
"devname": drive.devname, "devname": drive.devname,
"serial": drive.serial, "serial": drive.serial,
"model": drive.model, "model": drive.model,
"size_bytes": drive.size_bytes, "size_bytes": drive.size_bytes,
"temperature_c": drive.temperature_c,
}, },
"burnin": burnin_job, "burnin": burnin_job,
"smart": { "smart": {

View file

@ -244,7 +244,7 @@ thead {
} }
th { th {
padding: 9px 14px; padding: 6px 8px;
font-size: 11px; font-size: 11px;
font-weight: 600; font-weight: 600;
text-transform: uppercase; text-transform: uppercase;
@ -256,9 +256,10 @@ th {
} }
td { td {
padding: 10px 14px; padding: 7px 8px;
border-bottom: 1px solid var(--border); border-bottom: 1px solid var(--border);
vertical-align: middle; vertical-align: middle;
line-height: 1.3;
} }
tr:last-child td { tr:last-child td {
@ -276,17 +277,15 @@ tr:hover td {
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Column widths Column widths
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.col-drive { min-width: 180px; } .col-drive { min-width: 160px; }
.col-serial { min-width: 110px; } .col-serial { min-width: 95px; }
.col-size { min-width: 70px; text-align: right; } .col-size { min-width: 60px; text-align: right; }
.col-temp { min-width: 75px; text-align: right; } .col-temp { min-width: 60px; text-align: right; }
.col-health { min-width: 85px; } .col-health { min-width: 70px; }
.col-smart { min-width: 95px; } .col-smart { min-width: 80px; }
/* Tighter horizontal padding on the SMART columns they hold short /* Tighter SMART columns — they hold short pills or a progress bar. */
pills ("Passed"/"—") or a progress bar, so the default 14px gutter th.col-smart, td.col-smart { padding-left: 5px; padding-right: 5px; }
wastes space on 13" laptops. */ .col-actions { min-width: 150px; }
th.col-smart, td.col-smart { padding-left: 6px; padding-right: 6px; }
.col-actions { min-width: 170px; }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Drive cell Drive cell
@ -295,14 +294,23 @@ th.col-smart, td.col-smart { padding-left: 6px; padding-right: 6px; }
display: block; display: block;
font-weight: 500; font-weight: 500;
color: var(--text-strong); color: var(--text-strong);
font-size: 14px; font-size: 13px;
line-height: 1.25;
} }
.drive-model { .drive-model {
display: block; display: inline;
font-size: 11px; font-size: 10px;
color: var(--text-muted); color: var(--text-muted);
margin-top: 1px; margin-top: 0;
line-height: 1.25;
}
/* Separator between model and location when both are present on the
same line. ::after on .drive-model puts a thin dot between them. */
.drive-model + .drive-location::before {
content: " · ";
color: var(--border);
margin: 0 2px;
} }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
@ -425,7 +433,7 @@ th.col-smart, td.col-smart { padding-left: 6px; padding-right: 6px; }
/* ----------------------------------------------------------------------- /* -----------------------------------------------------------------------
Burn-in column Burn-in column
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.col-burnin { min-width: 160px; } .col-burnin { min-width: 130px; }
.burnin-cell { min-width: 140px; } .burnin-cell { min-width: 140px; }
@ -1180,9 +1188,9 @@ a.stat-card:hover {
Checkbox column Checkbox column
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.col-check { .col-check {
width: 36px; width: 32px;
min-width: 36px; min-width: 32px;
padding: 10px 8px 10px 14px; padding: 7px 4px 7px 8px;
} }
.drive-checkbox, #select-all-cb { .drive-checkbox, #select-all-cb {
@ -1196,18 +1204,15 @@ a.stat-card:hover {
Drive location inline edit Drive location inline edit
----------------------------------------------------------------------- */ ----------------------------------------------------------------------- */
.drive-location { .drive-location {
display: block; display: inline;
font-size: 10px; font-size: 10px;
color: var(--text-muted); color: var(--text-muted);
margin-top: 2px; margin-top: 0;
cursor: pointer; cursor: pointer;
border-radius: 3px; border-radius: 3px;
padding: 1px 3px; padding: 0 3px;
line-height: 1.1;
transition: background 0.1s; transition: background 0.1s;
max-width: 160px;
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
} }
.drive-location:hover { background: var(--border); color: var(--text); } .drive-location:hover { background: var(--border); color: var(--text); }
@ -2694,3 +2699,276 @@ tr.drawer-row-active {
font-variant-numeric: tabular-nums; font-variant-numeric: tabular-nums;
} }
/* -----------------------------------------------------------------------
Per-pattern badblocks meters in the drive drawer (1.0.0-44).
Four meters, one per pattern (0xaa / 0x55 / 0xff / 0x00). Each meter
has two halves: write (left) and verify (right), so a glance shows
both which pattern is running and which sub-phase within it.
----------------------------------------------------------------------- */
.bb-meters {
display: grid;
grid-template-columns: repeat(4, 1fr);
gap: 8px;
padding: 10px 12px;
background: var(--bg-soft, #161b22);
border-radius: 6px;
margin: 6px 0 8px 0;
}
.bb-meter {
display: flex;
flex-direction: column;
gap: 4px;
}
.bb-meter-label {
font-family: "SF Mono", "Consolas", monospace;
font-size: 10px;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .04em;
}
.bb-meter-current .bb-meter-label {
color: var(--blue, #58a6ff);
font-weight: 600;
}
.bb-meter-done .bb-meter-label {
color: var(--green, #3fb950);
}
.bb-meter-bar {
display: flex;
height: 10px;
background: var(--bg, #0d1117);
border: 1px solid var(--border, #30363d);
border-radius: 3px;
overflow: hidden;
position: relative;
}
.bb-meter-half {
height: 100%;
transition: width .3s ease;
}
.bb-write {
background: var(--blue, #58a6ff);
flex: 0 0 auto;
max-width: 50%;
}
.bb-verify {
background: var(--green, #3fb950);
flex: 0 0 auto;
max-width: 50%;
}
.bb-meter-half-spacer {
flex: 0 0 auto;
width: 1px;
background: var(--border, #30363d);
height: 100%;
}
.bb-meter-done .bb-write,
.bb-meter-done .bb-verify {
opacity: .55;
}
.bb-meter-sub {
display: flex;
justify-content: space-between;
font-family: "SF Mono", "Consolas", monospace;
font-size: 9px;
color: var(--text-muted);
}
.bb-sub-write { color: color-mix(in srgb, var(--blue) 80%, var(--text-muted)); }
.bb-sub-verify { color: color-mix(in srgb, var(--green) 80%, var(--text-muted)); }
/* -----------------------------------------------------------------------
Surface-scan vital-signs row in the drawer (1.0.0-46).
Sits directly above the per-pattern meters. Temperature with
green/yellow/red colour, live MB/s, elapsed, ETA all derived
from data already in the drawer payload.
----------------------------------------------------------------------- */
.bb-vitals {
display: flex;
gap: 14px;
flex-wrap: wrap;
padding: 8px 12px 4px 12px;
background: var(--bg-soft, #161b22);
border-radius: 6px 6px 0 0;
margin: 6px 0 0 0;
border-bottom: 1px solid var(--border, #30363d);
}
/* When vitals lead, suppress the meter strip's top radius + margin so
they read as one stacked unit. */
.bb-vitals + .bb-meters {
border-radius: 0 0 6px 6px;
margin-top: 0;
}
.bb-vital {
display: flex;
flex-direction: column;
gap: 1px;
font-family: "SF Mono", "Consolas", monospace;
}
.bb-vital-label {
font-size: 9px;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .04em;
}
.bb-vital-value {
font-size: 13px;
color: var(--text-strong, #f0f6fc);
font-weight: 500;
font-variant-numeric: tabular-nums;
}
/* -----------------------------------------------------------------------
Phase caption + per-pattern history (1.0.0-47).
----------------------------------------------------------------------- */
.bb-caption {
font-family: "SF Mono", "Consolas", monospace;
font-size: 11px;
color: var(--text-muted);
padding: 6px 12px 0 12px;
letter-spacing: .02em;
}
.bb-history {
display: flex;
flex-wrap: wrap;
align-items: center;
gap: 10px;
padding: 6px 12px 8px 12px;
font-family: "SF Mono", "Consolas", monospace;
font-size: 10px;
color: var(--text-muted);
}
.bb-hist-title {
text-transform: uppercase;
letter-spacing: .04em;
font-size: 9px;
margin-right: 4px;
}
.bb-hist-row {
display: inline-flex;
align-items: baseline;
gap: 4px;
background: var(--bg, #0d1117);
border: 1px solid var(--border, #30363d);
border-radius: 3px;
padding: 1px 6px;
}
.bb-hist-label {
color: var(--green, #3fb950);
font-weight: 600;
}
.bb-hist-dur {
color: var(--text-strong, #f0f6fc);
font-variant-numeric: tabular-nums;
}
/* Bad-block counter colour states inside the vitals row */
.bb-vital-good { color: var(--green, #3fb950); }
.bb-vital-bad { color: var(--red, #f85149); }
/* -----------------------------------------------------------------------
Column sort (1.0.0-48). Click a sortable TH to cycle asc desc
cleared. Indicator arrow appears next to the column label.
----------------------------------------------------------------------- */
th.sortable {
cursor: pointer;
user-select: none;
position: relative;
}
th.sortable:hover { color: var(--text); }
th.sortable::after {
content: "";
display: inline-block;
width: 0;
height: 0;
margin-left: 4px;
border-left: 4px solid transparent;
border-right: 4px solid transparent;
vertical-align: middle;
opacity: 0;
}
th.sortable:hover::after { opacity: 0.4; border-bottom: 5px solid currentColor; }
th.sort-asc::after {
opacity: 1;
border-bottom: 5px solid var(--blue, #58a6ff);
}
th.sort-desc::after {
opacity: 1;
border-top: 5px solid var(--blue, #58a6ff);
}
/* -----------------------------------------------------------------------
Stage "Reason" block explains why a stage ended in a terminal
state. Replaces the old single-line stage-error-line for
failed/cancelled/unknown stages so the operator gets a clear,
prominent explanation at the top.
----------------------------------------------------------------------- */
.stage-reason {
display: flex;
gap: 10px;
align-items: baseline;
padding: 8px 12px;
margin: 6px 0;
border-radius: 5px;
font-size: 12px;
border: 1px solid;
}
.stage-reason-failed {
background: var(--red-bg, color-mix(in srgb, var(--red) 12%, transparent));
border-color: var(--red-bd, color-mix(in srgb, var(--red) 40%, transparent));
}
.stage-reason-cancelled,
.stage-reason-unknown {
background: var(--yellow-bg, color-mix(in srgb, var(--yellow) 12%, transparent));
border-color: var(--yellow-bd, color-mix(in srgb, var(--yellow) 40%, transparent));
}
.stage-reason-label {
font-size: 10px;
text-transform: uppercase;
letter-spacing: .06em;
font-weight: 600;
color: var(--text-muted);
flex-shrink: 0;
}
.stage-reason-text {
flex: 1;
color: var(--text-strong, #f0f6fc);
line-height: 1.4;
word-wrap: break-word;
}
.stage-reason-failed .stage-reason-text { color: var(--red, #f85149); }
.stage-reason-cancelled .stage-reason-text,
.stage-reason-unknown .stage-reason-text { color: var(--yellow, #d29922); }
/* -----------------------------------------------------------------------
Drawer job-level estimated completion (right-aligned in the header,
so it doesn't compete with the state chip + operator info).
----------------------------------------------------------------------- */
.drawer-job-header {
display: flex;
align-items: center;
gap: 10px;
flex-wrap: wrap;
}
.drawer-job-finish {
display: inline-flex;
align-items: baseline;
gap: 8px;
padding: 4px 10px;
background: var(--bg-soft, #161b22);
border: 1px solid var(--border, #30363d);
border-radius: 5px;
font-family: "SF Mono", "Consolas", monospace;
}
.drawer-job-finish-label {
font-size: 9px;
color: var(--text-muted);
text-transform: uppercase;
letter-spacing: .04em;
}
.drawer-job-finish-value {
font-size: 12px;
color: var(--text-strong, #f0f6fc);
font-weight: 500;
font-variant-numeric: tabular-nums;
}

View file

@ -79,12 +79,86 @@
initElapsedTimers(); initElapsedTimers();
initUnlockCountdowns(); initUnlockCountdowns();
initLocationEdits(); initLocationEdits();
applySort(); // SSE swap replaces #drives-tbody — re-apply persisted sort
paintSortIndicators();
if (_drawerDriveId) { if (_drawerDriveId) {
_drawerHighlightRow(_drawerDriveId); _drawerHighlightRow(_drawerDriveId);
drawerFetch(_drawerDriveId); drawerFetch(_drawerDriveId);
} }
}); });
// ---------------------------------------------------------------
// Column sorting (client-side, persisted in localStorage so it
// survives reload AND survives every SSE-driven tbody refresh).
// ---------------------------------------------------------------
var SORT_KEY = 'nasburnin.sort';
function getSort() {
try {
var raw = localStorage.getItem(SORT_KEY);
if (!raw) return null;
var p = JSON.parse(raw);
if (p && p.col && (p.dir === 'asc' || p.dir === 'desc')) return p;
} catch (e) {}
return null;
}
function setSort(col, dir) {
if (!col) localStorage.removeItem(SORT_KEY);
else localStorage.setItem(SORT_KEY, JSON.stringify({col: col, dir: dir}));
}
function applySort() {
var s = getSort();
var tbody = document.getElementById('drives-tbody');
if (!tbody || !s) return;
var rows = Array.from(tbody.querySelectorAll('tr[id^="drive-"]'));
if (!rows.length) return;
var attr = 'data-sort-' + s.col;
var dirMul = s.dir === 'asc' ? 1 : -1;
rows.sort(function (a, b) {
var av = a.getAttribute(attr);
var bv = b.getAttribute(attr);
// Empty values always sink to the bottom regardless of direction.
var aEmpty = av === null || av === '';
var bEmpty = bv === null || bv === '';
if (aEmpty && !bEmpty) return 1;
if (!aEmpty && bEmpty) return -1;
if (aEmpty && bEmpty) return 0;
// Numeric comparison if both parse cleanly, else string.
var an = parseFloat(av), bn = parseFloat(bv);
if (!isNaN(an) && !isNaN(bn) && String(an) === av && String(bn) === bv) {
return (an - bn) * dirMul;
}
return av.localeCompare(bv) * dirMul;
});
rows.forEach(function (r) { tbody.appendChild(r); });
}
function paintSortIndicators() {
var s = getSort();
document.querySelectorAll('th.sortable').forEach(function (th) {
th.classList.remove('sort-asc', 'sort-desc');
if (s && th.dataset.sortKey === s.col) {
th.classList.add(s.dir === 'asc' ? 'sort-asc' : 'sort-desc');
}
});
}
document.addEventListener('click', function (e) {
var th = e.target.closest('th.sortable');
if (!th) return;
var col = th.dataset.sortKey;
var s = getSort();
var dir = 'asc';
if (s && s.col === col) {
// Click cycle: asc → desc → cleared
if (s.dir === 'asc') dir = 'desc';
else { setSort(null); applySort(); paintSortIndicators(); return; }
}
setSort(col, dir);
applySort();
paintSortIndicators();
});
// Initial paint on page load (HTML is already rendered server-side).
applySort();
paintSortIndicators();
updateCounts(); updateCounts();
// ----------------------------------------------------------------------- // -----------------------------------------------------------------------
@ -1271,8 +1345,14 @@
} }
} }
// Stash the last drive object so the burn-in panel renderer can
// pull temperature_c into the vital-signs row without having to
// pass it through the Burn-In renderer's signature.
var _DRAWER_LAST_DRIVE = null;
function _drawerRender(data) { function _drawerRender(data) {
var drive = data.drive || {}; var drive = data.drive || {};
_DRAWER_LAST_DRIVE = drive;
var devnameEl = document.getElementById('drawer-devname'); var devnameEl = document.getElementById('drawer-devname');
var metaEl = document.getElementById('drawer-drive-meta'); var metaEl = document.getElementById('drawer-drive-meta');
if (devnameEl) devnameEl.textContent = drive.devname || '\u2014'; if (devnameEl) devnameEl.textContent = drive.devname || '\u2014';
@ -1286,6 +1366,170 @@
_drawerRenderEvents(data.events); _drawerRenderEvents(data.events);
} }
// Vital-signs row above the meters: drive temp, live throughput,
// elapsed time, ETA. Computed from data already in the drawer payload.
function _drawerRenderBadblocksVitals(stage, drive) {
var phase = parseInt(stage.bb_phase, 10) || 1;
var phasePct = parseFloat(stage.bb_phase_pct || 0);
var overallPct = ((phase - 1) * 100 + phasePct) / 8; // 0..100
var html = '<div class="bb-vitals">';
var dateOpts = {
weekday: 'short', month: 'short', day: 'numeric',
hour: 'numeric', minute: '2-digit',
};
// Start (wall-clock, with date)
if (stage.started_at) {
var startMs = Date.parse(stage.started_at);
var startStr = new Date(startMs).toLocaleString(undefined, dateOpts);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Start</span>';
html += '<span class="bb-vital-value">' + startStr + '</span>';
html += '</div>';
// Elapsed
var elapsedSec = Math.max(0, (Date.now() - startMs) / 1000);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Elapsed</span>';
html += '<span class="bb-vital-value">' + _bbFmtDuration(elapsedSec) + '</span>';
html += '</div>';
// ETA + Finish — only once we have measurable progress, so the
// first samples don't paint a "47 days" estimate.
if (overallPct >= 0.5) {
var totalSec = elapsedSec * (100 / overallPct);
var remainingSec = Math.max(0, totalSec - elapsedSec);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">ETA</span>';
html += '<span class="bb-vital-value">' + _bbFmtDuration(remainingSec) + '</span>';
html += '</div>';
var finishStr = new Date(Date.now() + remainingSec * 1000)
.toLocaleString(undefined, dateOpts);
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Finish</span>';
html += '<span class="bb-vital-value">' + finishStr + '</span>';
html += '</div>';
}
}
// Temp with hot/warm/cool colour
if (drive && typeof drive.temperature_c === 'number') {
var tc = drive.temperature_c;
var tClass = 'temp-cool';
if (tc >= 48) tClass = 'temp-hot';
else if (tc >= 42) tClass = 'temp-warm';
html += '<div class="bb-vital">';
html += '<span class="bb-vital-label">Temp</span>';
html += '<span class="bb-vital-value temp ' + tClass + '">' + tc + '°C</span>';
html += '</div>';
}
html += '</div>';
return html;
}
function _bbFmtDuration(sec) {
sec = Math.floor(sec);
var d = Math.floor(sec / 86400);
var h = Math.floor((sec % 86400) / 3600);
var m = Math.floor((sec % 3600) / 60);
if (d > 0) return d + 'd ' + h + 'h';
if (h > 0) return h + 'h ' + m + 'm';
return m + 'm';
}
// Phase caption — explicit text below the meters: e.g.
// "Pattern 2 of 4 · Verify 0x55 · 47% within phase".
function _drawerRenderBadblocksCaption(phase, phasePct) {
if (!phase) return '';
var p = parseInt(phase, 10);
var pct = parseFloat(phasePct || 0);
var labels = ['0xaa', '0x55', '0xff', '0x00'];
var pattern = Math.ceil(p / 2);
var subPhase = (p % 2 === 1) ? 'Write' : 'Verify';
var label = labels[pattern - 1];
var html = '<div class="bb-caption">';
html += 'Pattern ' + pattern + ' of 4 · ';
html += subPhase + ' ' + label + ' · ';
html += pct.toFixed(1) + '% within phase';
html += '</div>';
return html;
}
// Per-pattern duration history. Reads bb_phase_history (JSON) and
// emits "0xaa: 14h 22m" rows for completed patterns. Pattern N is
// "complete" when its verify-phase end timestamp is known (= the
// next pattern's write-phase start, or stage.finished_at for the
// final one).
function _drawerRenderBadblocksHistory(stage) {
if (!stage.bb_phase_history) return '';
var hist;
try { hist = JSON.parse(stage.bb_phase_history); }
catch (e) { return ''; }
if (!hist || typeof hist !== 'object') return '';
var labels = ['0xaa', '0x55', '0xff', '0x00'];
var rows = [];
for (var n = 1; n <= 4; n++) {
var writeStart = hist[String(2 * n - 1)];
if (!writeStart) continue;
var endTs = (n < 4) ? hist[String(2 * n + 1)] : stage.finished_at;
if (!endTs) continue;
var elapsedSec = (Date.parse(endTs) - Date.parse(writeStart)) / 1000;
if (elapsedSec <= 0) continue;
rows.push('<span class="bb-hist-row">' +
'<span class="bb-hist-label">' + labels[n - 1] + '</span>' +
'<span class="bb-hist-dur">' + _bbFmtDuration(elapsedSec) + '</span>' +
'</span>');
}
if (!rows.length) return '';
return '<div class="bb-history"><span class="bb-hist-title">Completed patterns</span>' +
rows.join('') + '</div>';
}
// Render 4 pattern meters for badblocks -w surface_validate. Each
// meter splits write/verify halves so you can see at a glance which
// pattern is current AND whether you're writing or verifying within
// it. phase: 1-8 (1=write 0xaa, 2=verify 0xaa, 3=write 0x55, ...).
function _drawerRenderBadblocksMeters(phase, phasePct) {
if (!phase) return '';
var p = parseInt(phase, 10);
var pct = parseFloat(phasePct || 0);
var labels = ['0xaa', '0x55', '0xff', '0x00'];
var html = '<div class="bb-meters">';
for (var i = 0; i < 4; i++) {
var writePhase = i * 2 + 1;
var verifyPhase = writePhase + 1;
var writeFill, verifyFill;
if (p > verifyPhase) {
writeFill = 100; verifyFill = 100;
} else if (p === verifyPhase) {
writeFill = 100; verifyFill = pct;
} else if (p === writePhase) {
writeFill = pct; verifyFill = 0;
} else {
writeFill = 0; verifyFill = 0;
}
var classes = 'bb-meter';
if (p === writePhase || p === verifyPhase) classes += ' bb-meter-current';
if (p > verifyPhase) classes += ' bb-meter-done';
html += '<div class="' + classes + '">';
html += '<div class="bb-meter-label">' + labels[i] + '</div>';
html += '<div class="bb-meter-bar">';
html += '<div class="bb-meter-half bb-write" style="width:' + writeFill.toFixed(1) + '%"></div>';
html += '<div class="bb-meter-half-spacer"></div>';
html += '<div class="bb-meter-half bb-verify" style="width:' + verifyFill.toFixed(1) + '%"></div>';
html += '</div>';
html += '<div class="bb-meter-sub">';
html += '<span class="bb-sub-write">W ' + Math.round(writeFill) + '%</span>';
html += '<span class="bb-sub-verify">V ' + Math.round(verifyFill) + '%</span>';
html += '</div>';
html += '</div>';
}
html += '</div>';
return html;
}
function _drawerRenderBurnin(burnin) { function _drawerRenderBurnin(burnin) {
var panel = document.getElementById('drawer-panel-burnin'); var panel = document.getElementById('drawer-panel-burnin');
if (!panel) return; if (!panel) return;
@ -1300,7 +1544,30 @@
html += '<span class="drawer-job-meta">'; html += '<span class="drawer-job-meta">';
if (burnin.operator) html += 'by ' + _esc(burnin.operator); if (burnin.operator) html += 'by ' + _esc(burnin.operator);
if (burnin.started_at) html += ' \u00b7 ' + _drawerFmtDt(burnin.started_at); if (burnin.started_at) html += ' \u00b7 ' + _drawerFmtDt(burnin.started_at);
html += '</span></div>'; html += '</span>';
// Job-level estimated completion. Uses the weighted overall job %
// (recalculated server-side from stage progress) so it reflects
// every stage, not just the current one. Suppressed under 0.5%
// so the early sample doesn't paint a "Finish: Sep 22" stutter.
if (burnin.state === 'running' && burnin.started_at) {
var jobPct = parseFloat(burnin.percent || 0);
if (jobPct >= 0.5) {
var jobStartMs = Date.parse(burnin.started_at);
var jobElapsedSec = Math.max(0, (Date.now() - jobStartMs) / 1000);
var jobTotalSec = jobElapsedSec * (100 / jobPct);
var jobRemainSec = Math.max(0, jobTotalSec - jobElapsedSec);
var jobFinish = new Date(Date.now() + jobRemainSec * 1000);
var jobFinishStr = jobFinish.toLocaleString(undefined, {
weekday: 'short', month: 'short', day: 'numeric',
hour: 'numeric', minute: '2-digit',
});
html += '<span class="drawer-job-finish" title="Estimated completion of the entire burn-in (all stages)">';
html += '<span class="drawer-job-finish-label">Est. completion</span>';
html += '<span class="drawer-job-finish-value">' + jobFinishStr + '</span>';
html += '</span>';
}
}
html += '</div>';
html += '<div class="drawer-stages">'; html += '<div class="drawer-stages">';
var stages = burnin.stages || []; var stages = burnin.stages || [];
@ -1320,9 +1587,37 @@
html += '<span class="stage-duration">' + _drawerFmtDuration(s.started_at, s.finished_at) + '</span>'; html += '<span class="stage-duration">' + _drawerFmtDuration(s.started_at, s.finished_at) + '</span>';
} }
html += '</div>'; html += '</div>';
if (s.error_text) { // Prominent "Why it failed" block at the top of failed/cancelled/
// unknown stages. Falls back to a heuristic when no error was
// recorded — e.g. a tiny log + no badblocks progress + terminal
// state means the stage was killed externally (SSH disconnect or
// container restart) before it could record an error.
if (s.state === 'failed' || s.state === 'cancelled' || s.state === 'unknown') {
var reason = s.error_text;
if (!reason) {
var logLen = (s.log_text || '').length;
var noBbProgress = !s.bb_phase || (s.bb_phase === 1 && (parseFloat(s.bb_phase_pct || 0) < 0.1));
if (logLen < 500 && noBbProgress) {
reason = 'Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running.';
} else {
reason = 'No error message recorded.';
}
}
html += '<div class="stage-reason stage-reason-' + _esc(s.state) + '">';
html += '<span class="stage-reason-label">Reason</span>';
html += '<span class="stage-reason-text">' + _esc(reason) + '</span>';
html += '</div>';
} else if (s.error_text) {
html += '<div class="stage-error-line">' + _esc(s.error_text) + '</div>'; html += '<div class="stage-error-line">' + _esc(s.error_text) + '</div>';
} }
// Per-pattern meters for badblocks surface_validate, plus the
// vital-signs row above (temp / speed / elapsed / ETA).
if (s.stage_name === 'surface_validate' && s.bb_phase) {
html += _drawerRenderBadblocksVitals(s, _DRAWER_LAST_DRIVE);
html += _drawerRenderBadblocksMeters(s.bb_phase, s.bb_phase_pct);
html += _drawerRenderBadblocksCaption(s.bb_phase, s.bb_phase_pct);
html += _drawerRenderBadblocksHistory(s);
}
// Raw SSH log output (if available) // Raw SSH log output (if available)
if (s.log_text) { if (s.log_text) {
var logHtml = _esc(s.log_text) var logHtml = _esc(s.log_text)

View file

@ -46,7 +46,13 @@
{%- elif bi.state == 'passed' -%} {%- elif bi.state == 'passed' -%}
<span class="chip chip-passed">Passed</span> <span class="chip chip-passed">Passed</span>
{%- elif bi.state == 'failed' -%} {%- elif bi.state == 'failed' -%}
<span class="chip chip-failed">Failed{% if bi.stage_name %} ({{ bi.stage_name | replace('_',' ') }}){% endif %}</span> {# Suppress the stage suffix for SMART + surface_validate stages.
SMART has its own columns, and surface_validate is the dominant
case so a redundant suffix just adds visual noise. The drawer
shows the per-stage Reason for any digging. Keep the suffix for
precheck / final_check since those are rare enough that the hint
is helpful. #}
<span class="chip chip-failed">Failed{% if bi.stage_name and bi.stage_name not in ('short_smart', 'long_smart', 'surface_validate') %} ({{ bi.stage_name | replace('_',' ') }}){% endif %}</span>
{%- elif bi.state == 'cancelled' -%} {%- elif bi.state == 'cancelled' -%}
<span class="chip chip-aborted">Cancelled</span> <span class="chip chip-aborted">Cancelled</span>
{%- elif bi.state == 'unknown' -%} {%- elif bi.state == 'unknown' -%}
@ -63,14 +69,14 @@
<th class="col-check"> <th class="col-check">
<input type="checkbox" id="select-all-cb" class="drive-cb" title="Select all idle drives"> <input type="checkbox" id="select-all-cb" class="drive-cb" title="Select all idle drives">
</th> </th>
<th class="col-drive">Drive</th> <th class="col-drive sortable" data-sort-key="drive">Drive</th>
<th class="col-serial">Serial</th> <th class="col-serial sortable" data-sort-key="serial">Serial</th>
<th class="col-size">Size</th> <th class="col-size sortable" data-sort-key="size">Size</th>
<th class="col-temp">Temp</th> <th class="col-temp sortable" data-sort-key="temp">Temp</th>
<th class="col-health">Health</th> <th class="col-health sortable" data-sort-key="health">Health</th>
<th class="col-smart">Short SMART</th> <th class="col-smart sortable" data-sort-key="short">Short SMART</th>
<th class="col-smart">Long SMART</th> <th class="col-smart sortable" data-sort-key="long">Long SMART</th>
<th class="col-burnin">Burn-In</th> <th class="col-burnin sortable" data-sort-key="burnin">Burn-In</th>
<th class="col-actions">Actions</th> <th class="col-actions">Actions</th>
</tr> </tr>
</thead> </thead>
@ -89,7 +95,19 @@
{%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted')) {%- set smart_done = (drive.smart_short and drive.smart_short.state in ('passed','failed','aborted'))
or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %} or (drive.smart_long and drive.smart_long.state in ('passed','failed','aborted')) %}
{%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy and not pool_locked %} {%- set can_reset = (bi_done or smart_done) and not bi_active and not short_busy and not long_busy and not pool_locked %}
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}"> {%- set short_state = drive.smart_short.state if drive.smart_short else 'idle' %}
{%- set long_state = drive.smart_long.state if drive.smart_long else 'idle' %}
{%- set burnin_state = drive.burnin.state if drive.burnin else '' %}
<tr data-status="{{ drive.status }}" id="drive-{{ drive.id }}"
data-sort-drive="{{ drive.devname }}"
data-sort-serial="{{ (drive.serial or '') | lower }}"
data-sort-size="{{ drive.size_bytes or 0 }}"
data-sort-temp="{{ drive.temperature_c if drive.temperature_c is not none else '' }}"
data-sort-health="{{ {'PASSED': 1, 'WARNING': 2, 'FAILED': 3, 'UNKNOWN': 4}.get(drive.smart_health, 9) }}"
data-sort-short="{{ {'running': 1, 'failed': 2, 'aborted': 3, 'passed': 4, 'idle': 5}.get(short_state, 9) }}"
data-sort-long="{{ {'running': 1, 'failed': 2, 'aborted': 3, 'passed': 4, 'idle': 5}.get(long_state, 9) }}"
data-sort-burnin="{{ {'running': 1, 'queued': 2, 'failed': 3, 'unknown': 4, 'cancelled': 5, 'passed': 6}.get(burnin_state, 9) }}"
>
<td class="col-check"> <td class="col-check">
{%- if selectable %} {%- if selectable %}
<input type="checkbox" class="drive-checkbox" data-drive-id="{{ drive.id }}"> <input type="checkbox" class="drive-checkbox" data-drive-id="{{ drive.id }}">

View file

@ -0,0 +1,125 @@
"""Verifies _BadblocksProgress translates per-phase badblocks output
into a monotonic 0-99% overall progress.
`badblocks -w` cycles through 4 patterns × {write, verify} = 8 phases.
Each phase prints "XX% done" relative to its own 0-100 range. Without
this translation the dashboard appeared to "rewind" every ~2 hours
when a new phase started and two drives racing each other could
look 4× apart in displayed progress despite identical hardware.
Run inside the container image so app deps are present.
"""
from __future__ import annotations
import unittest
from app.burnin.stages import _BadblocksProgress
class TestBadblocksProgress(unittest.TestCase):
def test_default_phase_one(self):
"""Before any header, treat as start of pattern-1 write."""
p = _BadblocksProgress()
self.assertEqual(p.phase, 1)
self.assertEqual(p.overall_pct, 0)
def test_pattern_headers_set_phase(self):
"""0xaa→1, 0x55→3, 0xff→5, 0x00→7 (write phases)."""
p = _BadblocksProgress()
for header, want in [
("Testing with pattern 0xaa: ", 1),
("Testing with pattern 0x55: ", 3),
("Testing with pattern 0xff: ", 5),
("Testing with pattern 0x00: ", 7),
]:
p.update(header)
self.assertEqual(p.phase, want, f"after {header!r}")
def test_verify_advances_to_next_phase(self):
"""`Reading and comparing` after `Testing with pattern 0x55`
(phase 3) advances to phase 4."""
p = _BadblocksProgress()
p.update("Testing with pattern 0x55: 100.00% done")
self.assertEqual(p.phase, 3)
p.update("Reading and comparing: 0.00% done")
self.assertEqual(p.phase, 4)
def test_overall_pct_at_phase_boundaries(self):
"""Verify the math at each phase boundary: phase N at 100% =
N * 12.5% overall (clipped to 99 at the end)."""
cases = [
(1, 0.0, 0), # start of run
(1, 100.0, 12), # 100/800 = 12.5
(2, 100.0, 25), # 200/800
(4, 100.0, 50), # 400/800
(7, 100.0, 87), # 700/800
(8, 100.0, 99), # 800/800 → clipped to 99
]
for phase, phase_pct, want in cases:
p = _BadblocksProgress()
p.phase = phase
p.phase_pct = phase_pct
self.assertEqual(
p.overall_pct, want,
f"phase={phase} phase_pct={phase_pct}",
)
def test_realistic_sequence(self):
"""End-to-end: feed a synthetic badblocks output stream and
check the overall percent stays monotonically non-decreasing."""
lines = [
"Testing with pattern 0xaa: ",
"10.00% done, 1:00:00 elapsed. (0/0/0 errors)",
"50.00% done, 5:00:00 elapsed. (0/0/0 errors)",
"99.99% done, 10:00:00 elapsed. (0/0/0 errors)",
"Reading and comparing: ",
"0.00% done, 10:00:01 elapsed. (0/0/0 errors)",
"50.00% done, 12:30:00 elapsed. (0/0/0 errors)",
"Testing with pattern 0x55: ",
"0.00% done, 15:00:00 elapsed. (0/0/0 errors)",
"50.00% done, 17:30:00 elapsed. (0/0/0 errors)",
]
p = _BadblocksProgress()
seen = []
for line in lines:
p.update(line)
seen.append(p.overall_pct)
self.assertEqual(
seen, sorted(seen),
f"progress went backwards: {seen}",
)
# Sanity: by the time we're halfway through pattern-2 write
# (phase 3, 50%), we should report ((3-1)*100 + 50) / 8 = 31%.
self.assertEqual(seen[-1], 31)
def test_drives_at_different_phases_show_different_overall(self):
"""The original bug: two drives at the same per-phase 60%
but different phases used to look identical (both '60%').
Now they correctly diverge."""
slow = _BadblocksProgress()
slow.update("Testing with pattern 0xaa: ")
slow.update("60.00% done")
fast = _BadblocksProgress()
fast.update("Testing with pattern 0xaa: ")
fast.update("99.99% done")
fast.update("Reading and comparing: ")
fast.update("60.00% done")
# slow: 60/800 = 7%; fast: (1*100 + 60)/800 = 20%
self.assertEqual(slow.overall_pct, 7)
self.assertEqual(fast.overall_pct, 20)
def test_unknown_pattern_does_not_crash(self):
"""An unrecognized pattern (e.g. badblocks future versions or
custom patterns) just leaves phase unchanged."""
p = _BadblocksProgress()
p.update("Testing with pattern 0xab: ")
# phase stays at the default 1
self.assertEqual(p.phase, 1)
if __name__ == "__main__":
unittest.main()

View file

@ -0,0 +1,100 @@
"""Verifies _update_stage_bb_phase actually writes to burnin_stages
and the migration adds the columns idempotently.
The drive-drawer's 4-meter UI depends on these columns being populated
on every parser tick. If a future refactor drops the call or breaks
the migration, this test catches it before users see the meters
go blank.
Run inside the container image so app deps are present.
"""
from __future__ import annotations
import os
import tempfile
import unittest
import aiosqlite
async def _setup_db_with_stage() -> str:
fd, path = tempfile.mkstemp(suffix=".db")
os.close(fd)
from app.config import settings
settings.db_path = path
from app.database import init_db
await init_db()
async with aiosqlite.connect(path) as db:
await db.execute(
"INSERT INTO drives "
"(truenas_disk_id, devname, serial, model, size_bytes, "
" temperature_c, smart_health, last_seen_at, last_polled_at) "
"VALUES ('id-1', 'sda', 'SER1', 'TestModel', 14000000000000, "
" 30, 'PASSED', '2026-05-09T00:00:00+00:00', "
" '2026-05-09T00:00:00+00:00')"
)
await db.execute(
"INSERT INTO burnin_jobs "
"(drive_id, profile, state, operator, created_at) "
"VALUES (1, 'surface', 'running', 'op', "
" '2026-05-09T00:00:00+00:00')"
)
await db.execute(
"INSERT INTO burnin_stages "
"(burnin_job_id, stage_name, state) "
"VALUES (1, 'surface_validate', 'running')"
)
await db.commit()
return path
class TestBBPhasePersistence(unittest.IsolatedAsyncioTestCase):
async def asyncSetUp(self):
self.path = await _setup_db_with_stage()
async def asyncTearDown(self):
try:
os.unlink(self.path)
except OSError:
pass
async def test_columns_exist_after_init(self):
async with aiosqlite.connect(self.path) as db:
cur = await db.execute("PRAGMA table_info(burnin_stages)")
cols = {r[1] for r in await cur.fetchall()}
self.assertIn("bb_phase", cols)
self.assertIn("bb_phase_pct", cols)
async def test_update_writes_phase_and_pct(self):
from app.burnin._common import _update_stage_bb_phase
await _update_stage_bb_phase(1, "surface_validate", 3, 47.5)
async with aiosqlite.connect(self.path) as db:
cur = await db.execute(
"SELECT bb_phase, bb_phase_pct FROM burnin_stages "
"WHERE burnin_job_id=1 AND stage_name='surface_validate'"
)
row = await cur.fetchone()
self.assertEqual(row[0], 3)
self.assertAlmostEqual(row[1], 47.5)
async def test_update_overwrites(self):
"""Each tick should replace the previous value, not accumulate."""
from app.burnin._common import _update_stage_bb_phase
await _update_stage_bb_phase(1, "surface_validate", 1, 10.0)
await _update_stage_bb_phase(1, "surface_validate", 2, 80.0)
async with aiosqlite.connect(self.path) as db:
cur = await db.execute(
"SELECT bb_phase, bb_phase_pct FROM burnin_stages "
"WHERE burnin_job_id=1 AND stage_name='surface_validate'"
)
row = await cur.fetchone()
self.assertEqual(row[0], 2)
self.assertAlmostEqual(row[1], 80.0)
if __name__ == "__main__":
unittest.main()