Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:
1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
was_cancelled stayed False and the job got marked 'failed' (the
"burn-in itself failed" classification) instead of 'unknown'
(the honest "we don't know whether it would have passed").
Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
loop shutdown doesn't interrupt the kill), appends an [ABORTED]
marker to the log, and re-raises CancelledError. _execute_stages
doesn't catch it (CancelledError is BaseException, not Exception
in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
any stage rows still recorded as 'running' by setting them to
'unknown' with a clear error_text: "Task cancelled mid-run —
likely container restart or shutdown". The job's error_text gets
the same message so the drawer's Reason block has something
specific to display, instead of falling back to the heuristic.
Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.
Three additions to the surface_validate drawer block:
1. **Phase caption** below the meters: "Pattern 2 of 4 · Verify 0x55
· 47% within phase". Pure JS — no schema change. Makes the
visual grammar explicit without needing the operator to mentally
map phase=4 to "verifying pattern 2".
2. **Bad-block badge** in the vitals row. Green at 0, red at >0.
The number was already on the stage row but burying it in the
log felt wrong — surfacing it next to temp/speed/ETA keeps it
in eye-line during long runs.
3. **Per-pattern duration history** below the caption. New
bb_phase_history JSON column (idempotent migration) maps
{phase_num: ts}. Parser stamps the timestamp on every phase
transition (and on stage entry for phase 1). Drawer diffs
consecutive write-phase starts to derive "0xaa: 14h 22m"
for completed patterns. Once one pattern is done you can
predict the rest without leaving the drawer.
Persistence is idempotent: re-entry of the same phase keeps the
original timestamp so a transient parser reset doesn't blow away
history. JSON parse failures fail gracefully (no row rendered).
The drawer's surface_validate area now leads with a row of operator
vitals computed from data already in the response:
- Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow)
- Speed: live MB/s, NULL until second progress sample arrives
- Elapsed: time since stage started_at
- ETA: extrapolated from overall progress; suppressed under 0.5%
to avoid the "47 days remaining" artefact early in pattern 1
Live MB/s comes from a new bb_mbps column on burnin_stages, computed
in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt.
Skipped on phase transitions (per-phase pct resets) and sub-second
samples (noisy).
Drawer endpoint now passes drive.temperature_c through; JS stashes
the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer
can pull it for the vitals row without changing call signatures.
Tightened table CSS in this same session is unrelated and shipped
already in earlier rounds via the bind-mounted app.css.
Previously the parser only wrote bb_phase to the DB when state
*changed* — so for the first several minutes of a 14 TB burn-in
(before badblocks emits its first 'X% done' line), bb_phase stayed
NULL and the drawer's per-pattern meters didn't render at all.
Looked broken to operators.
Now we write phase=1, phase_pct=0 immediately on stage entry. The
parser keeps overwriting on every subsequent tick. Drawer shows
empty meters with 0xaa label highlighted blue from t=0.
User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.
Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields
Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
unchanged (single percent, no meters)
3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.
Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
`badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
each with a write phase + a verify phase = 8 phases. The output's
"XX% done" lines are per-phase, so the dashboard appeared to "rewind"
every ~2 hours. Two drives racing each other could look 4× apart in
displayed progress despite identical hardware — actually one was
just further into a later phase.
New _BadblocksProgress state machine watches for "Testing with
pattern 0xXX" and "Reading and comparing" headers, advances the
phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8
clipped to 99. Pure state machine, no I/O.
7 new tests cover phase-header detection, boundary math, monotonicity
across a synthetic stream, and the original "two drives at same
per-phase % look identical" bug.
Image rebuilt and tagged but NOT deployed to the running container —
4 surface-validate jobs are 20-95% through 14TB drives and a recreate
would SIGHUP the remote badblocks processes. Deploy with
`docker compose up -d` after the current batch finishes.
Extracts the badblocks shell-command construction into
_build_badblocks_cmd(devname) so it can be unit-tested without
spinning up an asyncssh connection. Behavior unchanged.
Three tests guard:
1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1)
2. Operator-set tunables actually propagate to the command
3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays
intact — without it, cancel cannot kill the remote process
because asyncssh's signal channel is silently ignored by sshd.
Five files needed annotation tweaks to clear the 14 outstanding
mypy errors, all cosmetic (zero runtime bugs):
- settings_store._coerce: return Any (concrete type depends on key,
no narrowing path mypy can follow from the dict lookup)
- retention._state: explicit dict[str, str | None] init
- mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and
SMTP both narrow to the parent class for shared call sites
- burnin/stages.py: TypedDict for the badblocks result dict so
`result["bad_blocks"]` narrows to int at the comparison site
scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and
findings.log line. Comment updated to reflect gating status.
Three LOW-severity findings from Codex's audit of the post-split
package, all small mechanical cleanups:
#1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias
bound at import time. After a test (or runtime) monkey-patches
app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would
advertise the OLD value while grant_pool_unlock used the new one.
Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API
stays in sync with whatever the actual source-of-truth is.
#2 _stage_surface_validate_ssh() carried dead extraction scaffolding
from when the badblocks logic was first inlined into burnin.py:
_is_cancelled_sync (sync wrapper that does run_until_complete in
a coroutine — would deadlock if ever called), last_logged_pct,
on_progress, accumulated_lines, on_progress_async — none on any
control-flow path. Plus result["output"] which was set but never
read. All deleted; the inline _drain coroutines below already
handle progress/log throttling correctly.
#3 The new module boundaries were leaking — root orchestration
mutated _remote_pids and _unlock_grants directly even though
kill.clear_remote_pid() and unlock.invalidate_grant() existed.
Now using the helpers, so a future change to the storage shape
only requires editing the owning module.
Bonus from Codex's check note: _get_client() now asserts
burnin._client is not None with a clear message instead of relying
on an obscure NoneType AttributeError if a stage is somehow called
before init().
Verified: 44/44 tests pass; container boots clean; /health 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continues the staged burnin.py module split started in 1.0.0-30.
Two more clean extractions; orchestration (init, _run_job,
start_job, cancel_job, check_stuck_jobs, semaphore) intentionally
stays in __init__.py for now to avoid threading the TrueNASClient
through cross-module setters.
* app/burnin/_common.py — shared helpers with no upward deps:
STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants;
_now / _db connection helper; _is_cancelled, _start_stage,
_finish_stage, _cancel_stage, _set_stage_error, _update_stage_*,
_append_stage_log, _store_smart_*, _recalculate_progress; SSE
_push_update. Imports nothing from sibling burnin modules.
* app/burnin/stages.py — every per-stage implementation moved
verbatim: _stage_precheck, _stage_smart_test +
_stage_smart_test_api / _ssh, _stage_surface_validate +
_surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate,
_stage_final_check, plus _badblocks_available, _nvme_cli_available,
and _dispatch_stage. Pulls the shared helpers from _common,
remote-PID setters from kill, and the live TrueNASClient via a
lazy `_get_client()` helper that defers `from app import burnin`
until call time so we don't trip a circular import.
* __init__.py shrank from ~1480 LoC to ~600. Re-exports every
public name (start_job, cancel_job, init, check_stuck_jobs,
PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers
in routes.py / mailer.py / poller.py see the same surface.
State that didn't move: _semaphore, _client, _active_tasks remain
on the package root (with a runtime _client reference from routes.py
preserved). _run_job and start_job still live in __init__.py — full
task.py extraction would require giving stages access to _client
through a setter rather than the lazy lookup, deferred to a future
slice.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. No public API change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the planned tech-debt cleanup. burnin.py was 1667 lines
and growing; staged extraction gives smaller diffs to review and a
clear bisect target if anything regresses.
Mechanical move only — no behaviour change. The two extracted modules:
* app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError,
is_unlocked / unlock_expiry / grant_pool_unlock, plus the four
*_TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level
state; opens its own DB connection in grant_pool_unlock so it
doesn't depend on the parent package's _db() helper.
* app/burnin/kill.py — _remote_pids dict and the kill_remote_process /
set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled
out of __init__.py so the asyncssh-ignores-signals workaround lives
next to the state it operates on.
app/burnin/__init__.py re-exports every public symbol the rest of the
app imports — `from app import burnin; burnin.start_job(...)`,
`burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep
working unchanged. Internal aliases `_remote_pids` and `_unlock_grants`
on the package root point at the SAME dict objects in the submodules,
so existing in-package mutations (set in stages, cleared in cleanup
callbacks) work without rewrite.
Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false
monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at
import time and won't propagate back to the submodule's read site, so
the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. routes.py, mailer.py, poller.py untouched —
the public API is identical.
Future: extract stages, task, _common in subsequent versions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>