nas-burnin

48 commits 1 branch 2 tags 1.8 MiB

Author	SHA1	Message	Date
Brandon Walter	1bc1b378ab	fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Container restarts (uvicorn shutdown / 'docker compose up -d') were silently classifying running burn-ins as 'failed' with empty error_text. Two reasons converged: 1. _stage_surface_validate_ssh caught asyncio.CancelledError at the stage level and returned False, swallowing the cancel signal. 2. _run_job's outer CancelledError handler then never fired, so was_cancelled stayed False and the job got marked 'failed' (the "burn-in itself failed" classification) instead of 'unknown' (the honest "we don't know whether it would have passed"). Fix: - Stage now does best-effort kill of remote badblocks (shielded so loop shutdown doesn't interrupt the kill), appends an [ABORTED] marker to the log, and re-raises CancelledError. _execute_stages doesn't catch it (CancelledError is BaseException, not Exception in 3.8+) so it propagates up to _run_job. - _run_job's existing CancelledError handler now also reconciles any stage rows still recorded as 'running' by setting them to 'unknown' with a clear error_text: "Task cancelled mid-run — likely container restart or shutdown". The job's error_text gets the same message so the drawer's Reason block has something specific to display, instead of falling back to the heuristic. Future container restarts on running burn-ins will now show as yellow "UNKNOWN" with the explicit cancel reason, matching the existing behaviour of check_stuck_jobs() for stuck timeouts.	2026-05-09 12:32:46 -07:00
Brandon Walter	eb2a964171	fix: address Codex review of burnin package split (1.0.0-32) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Three LOW-severity findings from Codex's audit of the post-split package, all small mechanical cleanups: #1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias bound at import time. After a test (or runtime) monkey-patches app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would advertise the OLD value while grant_pool_unlock used the new one. Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API stays in sync with whatever the actual source-of-truth is. #2 _stage_surface_validate_ssh() carried dead extraction scaffolding from when the badblocks logic was first inlined into burnin.py: _is_cancelled_sync (sync wrapper that does run_until_complete in a coroutine — would deadlock if ever called), last_logged_pct, on_progress, accumulated_lines, on_progress_async — none on any control-flow path. Plus result["output"] which was set but never read. All deleted; the inline _drain coroutines below already handle progress/log throttling correctly. #3 The new module boundaries were leaking — root orchestration mutated _remote_pids and _unlock_grants directly even though kill.clear_remote_pid() and unlock.invalidate_grant() existed. Now using the helpers, so a future change to the storage shape only requires editing the owning module. Bonus from Codex's check note: _get_client() now asserts burnin._client is not None with a clear message instead of relying on an obscure NoneType AttributeError if a stage is somehow called before init(). Verified: 44/44 tests pass; container boots clean; /health 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:35:07 -04:00
Brandon Walter	19c2c0dc0f	refactor: extract _common.py + stages.py from burnin (1.0.0-31) Continues the staged burnin.py module split started in 1.0.0-30. Two more clean extractions; orchestration (init, _run_job, start_job, cancel_job, check_stuck_jobs, semaphore) intentionally stays in __init__.py for now to avoid threading the TrueNASClient through cross-module setters. * app/burnin/_common.py — shared helpers with no upward deps: STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants; _now / _db connection helper; _is_cancelled, _start_stage, _finish_stage, _cancel_stage, _set_stage_error, _update_stage_, _append_stage_log, _store_smart_, _recalculate_progress; SSE _push_update. Imports nothing from sibling burnin modules. * app/burnin/stages.py — every per-stage implementation moved verbatim: _stage_precheck, _stage_smart_test + _stage_smart_test_api / _ssh, _stage_surface_validate + _surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate, _stage_final_check, plus _badblocks_available, _nvme_cli_available, and _dispatch_stage. Pulls the shared helpers from _common, remote-PID setters from kill, and the live TrueNASClient via a lazy `_get_client()` helper that defers `from app import burnin` until call time so we don't trip a circular import. * __init__.py shrank from ~1480 LoC to ~600. Re-exports every public name (start_job, cancel_job, init, check_stuck_jobs, PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers in routes.py / mailer.py / poller.py see the same surface. State that didn't move: _semaphore, _client, _active_tasks remain on the package root (with a runtime _client reference from routes.py preserved). _run_job and start_job still live in __init__.py — full task.py extraction would require giving stages access to _client through a setter rather than the lazy lookup, deferred to a future slice. Verification: 44/44 unit tests pass in container; /health 200; container boots clean. No public API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:18:04 -04:00
Brandon Walter	9cbae44495	refactor: split burnin.py into a package — extract unlock + kill (1.0.0-30) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details First slice of the planned tech-debt cleanup. burnin.py was 1667 lines and growing; staged extraction gives smaller diffs to review and a clear bisect target if anything regresses. Mechanical move only — no behaviour change. The two extracted modules: * app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError, is_unlocked / unlock_expiry / grant_pool_unlock, plus the four _TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level state; opens its own DB connection in grant_pool_unlock so it doesn't depend on the parent package's _db() helper. app/burnin/kill.py — _remote_pids dict and the kill_remote_process / set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled out of __init__.py so the asyncssh-ignores-signals workaround lives next to the state it operates on. app/burnin/__init__.py re-exports every public symbol the rest of the app imports — `from app import burnin; burnin.start_job(...)`, `burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep working unchanged. Internal aliases `_remote_pids` and `_unlock_grants` on the package root point at the SAME dict objects in the submodules, so existing in-package mutations (set in stages, cleared in cleanup callbacks) work without rewrite. Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at import time and won't propagate back to the submodule's read site, so the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly. Verification: 44/44 unit tests pass in container; /health 200; container boots clean. routes.py, mailer.py, poller.py untouched — the public API is identical. Future: extract stages, task, _common in subsequent versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:44:28 -04:00

Renamed from app/burnin.py (Browse further)

4 commits