`badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
each with a write phase + a verify phase = 8 phases. The output's
"XX% done" lines are per-phase, so the dashboard appeared to "rewind"
every ~2 hours. Two drives racing each other could look 4× apart in
displayed progress despite identical hardware — actually one was
just further into a later phase.
New _BadblocksProgress state machine watches for "Testing with
pattern 0xXX" and "Reading and comparing" headers, advances the
phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8
clipped to 99. Pure state machine, no I/O.
7 new tests cover phase-header detection, boundary math, monotonicity
across a synthetic stream, and the original "two drives at same
per-phase % look identical" bug.
Image rebuilt and tagged but NOT deployed to the running container —
4 surface-validate jobs are 20-95% through 14TB drives and a recreate
would SIGHUP the remote badblocks processes. Deploy with
`docker compose up -d` after the current batch finishes.
Extracts the badblocks shell-command construction into
_build_badblocks_cmd(devname) so it can be unit-tested without
spinning up an asyncssh connection. Behavior unchanged.
Three tests guard:
1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1)
2. Operator-set tunables actually propagate to the command
3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays
intact — without it, cancel cannot kill the remote process
because asyncssh's signal channel is silently ignored by sshd.
Five files needed annotation tweaks to clear the 14 outstanding
mypy errors, all cosmetic (zero runtime bugs):
- settings_store._coerce: return Any (concrete type depends on key,
no narrowing path mypy can follow from the dict lookup)
- retention._state: explicit dict[str, str | None] init
- mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and
SMTP both narrow to the parent class for shared call sites
- burnin/stages.py: TypedDict for the badblocks result dict so
`result["bad_blocks"]` narrows to int at the comparison site
scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and
findings.log line. Comment updated to reflect gating status.
Three LOW-severity findings from Codex's audit of the post-split
package, all small mechanical cleanups:
#1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias
bound at import time. After a test (or runtime) monkey-patches
app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would
advertise the OLD value while grant_pool_unlock used the new one.
Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API
stays in sync with whatever the actual source-of-truth is.
#2 _stage_surface_validate_ssh() carried dead extraction scaffolding
from when the badblocks logic was first inlined into burnin.py:
_is_cancelled_sync (sync wrapper that does run_until_complete in
a coroutine — would deadlock if ever called), last_logged_pct,
on_progress, accumulated_lines, on_progress_async — none on any
control-flow path. Plus result["output"] which was set but never
read. All deleted; the inline _drain coroutines below already
handle progress/log throttling correctly.
#3 The new module boundaries were leaking — root orchestration
mutated _remote_pids and _unlock_grants directly even though
kill.clear_remote_pid() and unlock.invalidate_grant() existed.
Now using the helpers, so a future change to the storage shape
only requires editing the owning module.
Bonus from Codex's check note: _get_client() now asserts
burnin._client is not None with a clear message instead of relying
on an obscure NoneType AttributeError if a stage is somehow called
before init().
Verified: 44/44 tests pass; container boots clean; /health 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continues the staged burnin.py module split started in 1.0.0-30.
Two more clean extractions; orchestration (init, _run_job,
start_job, cancel_job, check_stuck_jobs, semaphore) intentionally
stays in __init__.py for now to avoid threading the TrueNASClient
through cross-module setters.
* app/burnin/_common.py — shared helpers with no upward deps:
STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants;
_now / _db connection helper; _is_cancelled, _start_stage,
_finish_stage, _cancel_stage, _set_stage_error, _update_stage_*,
_append_stage_log, _store_smart_*, _recalculate_progress; SSE
_push_update. Imports nothing from sibling burnin modules.
* app/burnin/stages.py — every per-stage implementation moved
verbatim: _stage_precheck, _stage_smart_test +
_stage_smart_test_api / _ssh, _stage_surface_validate +
_surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate,
_stage_final_check, plus _badblocks_available, _nvme_cli_available,
and _dispatch_stage. Pulls the shared helpers from _common,
remote-PID setters from kill, and the live TrueNASClient via a
lazy `_get_client()` helper that defers `from app import burnin`
until call time so we don't trip a circular import.
* __init__.py shrank from ~1480 LoC to ~600. Re-exports every
public name (start_job, cancel_job, init, check_stuck_jobs,
PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers
in routes.py / mailer.py / poller.py see the same surface.
State that didn't move: _semaphore, _client, _active_tasks remain
on the package root (with a runtime _client reference from routes.py
preserved). _run_job and start_job still live in __init__.py — full
task.py extraction would require giving stages access to _client
through a setter rather than the lazy lookup, deferred to a future
slice.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. No public API change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the planned tech-debt cleanup. burnin.py was 1667 lines
and growing; staged extraction gives smaller diffs to review and a
clear bisect target if anything regresses.
Mechanical move only — no behaviour change. The two extracted modules:
* app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError,
is_unlocked / unlock_expiry / grant_pool_unlock, plus the four
*_TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level
state; opens its own DB connection in grant_pool_unlock so it
doesn't depend on the parent package's _db() helper.
* app/burnin/kill.py — _remote_pids dict and the kill_remote_process /
set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled
out of __init__.py so the asyncssh-ignores-signals workaround lives
next to the state it operates on.
app/burnin/__init__.py re-exports every public symbol the rest of the
app imports — `from app import burnin; burnin.start_job(...)`,
`burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep
working unchanged. Internal aliases `_remote_pids` and `_unlock_grants`
on the package root point at the SAME dict objects in the submodules,
so existing in-package mutations (set in stages, cleared in cleanup
callbacks) work without rewrite.
Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false
monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at
import time and won't propagate back to the submodule's read site, so
the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. routes.py, mailer.py, poller.py untouched —
the public API is identical.
Future: extract stages, task, _common in subsequent versions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>