nas-burnin/app
Brandon Walter 1bc1b378ab
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51)
Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:

1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
   stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
   was_cancelled stayed False and the job got marked 'failed' (the
   "burn-in itself failed" classification) instead of 'unknown'
   (the honest "we don't know whether it would have passed").

Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
  loop shutdown doesn't interrupt the kill), appends an [ABORTED]
  marker to the log, and re-raises CancelledError. _execute_stages
  doesn't catch it (CancelledError is BaseException, not Exception
  in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
  any stage rows still recorded as 'running' by setting them to
  'unknown' with a clear error_text: "Task cancelled mid-run —
  likely container restart or shutdown". The job's error_text gets
  the same message so the drawer's Reason block has something
  specific to display, instead of falling back to the heuristic.

Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.
2026-05-09 12:32:46 -07:00
..
burnin fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51) 2026-05-09 12:32:46 -07:00
routes feat: prominent failure-reason block + heuristic in drawer (1.0.0-50) 2026-05-09 12:06:11 -07:00
static feat: prominent failure-reason block + heuristic in drawer (1.0.0-50) 2026-05-09 12:06:11 -07:00
templates feat: client-side column sorting with SSE re-apply (1.0.0-48) 2026-05-08 23:48:04 -07:00
__init__.py Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
auth.py feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34) 2026-05-03 09:29:53 -04:00
auth_cli.py infra: rename truenas-burnin → nas-burnin (1.0.0-41) 2026-05-04 07:16:02 -07:00
config.py fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51) 2026-05-09 12:32:46 -07:00
database.py feat: phase caption + bad-block badge + per-pattern history (1.0.0-47) 2026-05-08 23:23:02 -07:00
logging_config.py Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
mailer.py fix: annotate to mypy-clean + promote to gating (1.0.0-40) 2026-05-03 21:21:55 -07:00
main.py rename: TrueNAS Burn-In → NAS Burn-In (1.0.0-38) 2026-05-03 14:01:40 -04:00
models.py feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21) 2026-05-02 09:25:56 -04:00
notifier.py Stage 7: SSH architecture, SMART attribute monitoring, drive reset, and polish 2026-02-24 08:09:30 -05:00
poller.py fix: address Codex audit findings (1.0.0-28) 2026-05-02 18:48:16 -04:00
renderer.py Stage 7: SSH architecture, SMART attribute monitoring, drive reset, and polish 2026-02-24 08:09:30 -05:00
retention.py fix: annotate to mypy-clean + promote to gating (1.0.0-40) 2026-05-03 21:21:55 -07:00
settings_store.py fix: annotate to mypy-clean + promote to gating (1.0.0-40) 2026-05-03 21:21:55 -07:00
ssh_client.py fix: live pool re-check before start_job + drop dead run_badblocks (1.0.0-29) 2026-05-02 21:29:11 -04:00
terminal.py chore: re-sync deployed work that pre-dates this session 2026-05-02 09:24:42 -04:00
truenas.py chore: dev-experience + mypy noise cleanup 2026-05-03 21:11:23 -07:00