Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:
1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
was_cancelled stayed False and the job got marked 'failed' (the
"burn-in itself failed" classification) instead of 'unknown'
(the honest "we don't know whether it would have passed").
Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
loop shutdown doesn't interrupt the kill), appends an [ABORTED]
marker to the log, and re-raises CancelledError. _execute_stages
doesn't catch it (CancelledError is BaseException, not Exception
in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
any stage rows still recorded as 'running' by setting them to
'unknown' with a clear error_text: "Task cancelled mid-run —
likely container restart or shutdown". The job's error_text gets
the same message so the drawer's Reason block has something
specific to display, instead of falling back to the heuristic.
Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.