Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:
1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
was_cancelled stayed False and the job got marked 'failed' (the
"burn-in itself failed" classification) instead of 'unknown'
(the honest "we don't know whether it would have passed").
Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
loop shutdown doesn't interrupt the kill), appends an [ABORTED]
marker to the log, and re-raises CancelledError. _execute_stages
doesn't catch it (CancelledError is BaseException, not Exception
in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
any stage rows still recorded as 'running' by setting them to
'unknown' with a clear error_text: "Task cancelled mid-run —
likely container restart or shutdown". The job's error_text gets
the same message so the drawer's Reason block has something
specific to display, instead of falling back to the heuristic.
Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.
Three LOW-severity findings from Codex's audit of the post-split
package, all small mechanical cleanups:
#1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias
bound at import time. After a test (or runtime) monkey-patches
app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would
advertise the OLD value while grant_pool_unlock used the new one.
Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API
stays in sync with whatever the actual source-of-truth is.
#2 _stage_surface_validate_ssh() carried dead extraction scaffolding
from when the badblocks logic was first inlined into burnin.py:
_is_cancelled_sync (sync wrapper that does run_until_complete in
a coroutine — would deadlock if ever called), last_logged_pct,
on_progress, accumulated_lines, on_progress_async — none on any
control-flow path. Plus result["output"] which was set but never
read. All deleted; the inline _drain coroutines below already
handle progress/log throttling correctly.
#3 The new module boundaries were leaking — root orchestration
mutated _remote_pids and _unlock_grants directly even though
kill.clear_remote_pid() and unlock.invalidate_grant() existed.
Now using the helpers, so a future change to the storage shape
only requires editing the owning module.
Bonus from Codex's check note: _get_client() now asserts
burnin._client is not None with a clear message instead of relying
on an obscure NoneType AttributeError if a stage is somehow called
before init().
Verified: 44/44 tests pass; container boots clean; /health 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continues the staged burnin.py module split started in 1.0.0-30.
Two more clean extractions; orchestration (init, _run_job,
start_job, cancel_job, check_stuck_jobs, semaphore) intentionally
stays in __init__.py for now to avoid threading the TrueNASClient
through cross-module setters.
* app/burnin/_common.py — shared helpers with no upward deps:
STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants;
_now / _db connection helper; _is_cancelled, _start_stage,
_finish_stage, _cancel_stage, _set_stage_error, _update_stage_*,
_append_stage_log, _store_smart_*, _recalculate_progress; SSE
_push_update. Imports nothing from sibling burnin modules.
* app/burnin/stages.py — every per-stage implementation moved
verbatim: _stage_precheck, _stage_smart_test +
_stage_smart_test_api / _ssh, _stage_surface_validate +
_surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate,
_stage_final_check, plus _badblocks_available, _nvme_cli_available,
and _dispatch_stage. Pulls the shared helpers from _common,
remote-PID setters from kill, and the live TrueNASClient via a
lazy `_get_client()` helper that defers `from app import burnin`
until call time so we don't trip a circular import.
* __init__.py shrank from ~1480 LoC to ~600. Re-exports every
public name (start_job, cancel_job, init, check_stuck_jobs,
PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers
in routes.py / mailer.py / poller.py see the same surface.
State that didn't move: _semaphore, _client, _active_tasks remain
on the package root (with a runtime _client reference from routes.py
preserved). _run_job and start_job still live in __init__.py — full
task.py extraction would require giving stages access to _client
through a setter rather than the lazy lookup, deferred to a future
slice.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. No public API change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the planned tech-debt cleanup. burnin.py was 1667 lines
and growing; staged extraction gives smaller diffs to review and a
clear bisect target if anything regresses.
Mechanical move only — no behaviour change. The two extracted modules:
* app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError,
is_unlocked / unlock_expiry / grant_pool_unlock, plus the four
*_TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level
state; opens its own DB connection in grant_pool_unlock so it
doesn't depend on the parent package's _db() helper.
* app/burnin/kill.py — _remote_pids dict and the kill_remote_process /
set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled
out of __init__.py so the asyncssh-ignores-signals workaround lives
next to the state it operates on.
app/burnin/__init__.py re-exports every public symbol the rest of the
app imports — `from app import burnin; burnin.start_job(...)`,
`burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep
working unchanged. Internal aliases `_remote_pids` and `_unlock_grants`
on the package root point at the SAME dict objects in the submodules,
so existing in-package mutations (set in stages, cleared in cleanup
callbacks) work without rewrite.
Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false
monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at
import time and won't propagate back to the submodule's read site, so
the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. routes.py, mailer.py, poller.py untouched —
the public API is identical.
Future: extract stages, task, _common in subsequent versions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>