nas-burnin

Author	SHA1	Message	Date
Brandon Walter	7c3873dd5e	fix: translate badblocks \b → \n at shell level (1.0.0-57) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The chunk-read drain in 1.0.0-55 was supposed to handle badblocks's \b-overwrite progress format but silently never surfaced data — DB bb_phase_pct stayed at 0, log_text stayed at 136 bytes for 26+ hours of running burn-ins. Asyncssh stream.read(4096) behavior on this combination of badblocks output + pipe characteristics wasn't doing what I expected, and gather(return_exceptions=True) swallowed any exception silently. Fix: pipe the badblocks output through `tr '\b' '\n'` at the SHELL level on TrueNAS, before it reaches asyncssh. Every progress update is now a real newline-terminated line by the time we receive it. This also lets us revert to the simpler `async for raw in stream:` drain we had pre-1.0.0-55 — which was proven to work (it caught the PID line and phase-transition headers, just not mid-phase progress). Plus consolidate: 2>&1 merges stderr into stdout before tr, so we only need ONE drain coroutine, not two. Single throttle gate preserved. Recovery: after deploy, the 4 jobs that have been stuck in pipe_w for 26h were autonomously reset via inline SQL and relaunched via POST /api/v1/burnin/start (loopback bypass from 1.0.0-56 made this possible without a session cookie).	2026-05-13 10:26:06 -07:00
Brandon Walter	f71ae341f5	fix: backport stages.py \b-parser fix + drawer-finish inline (uncommitted from 1.0.0-55) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The chunk-read parser fix that ships as part of 1.0.0-55 in the running container was scp'd to maple but never reached git. Same for the drawer-job-finish margin-left removal (request: pill sits inline next to operator/date, not flush right). Reconciling source with deployed state. No new behaviour — git now matches what's been live on maple since 1.0.0-55.	2026-05-12 07:53:33 -07:00
Brandon Walter	149f2901b7	fix: throttle ALL drain-loop DB calls + drop progress noise from log (1.0.0-54) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details 1.0.0-52 throttled the percent/bb_phase writes but missed: - `_is_cancelled` ran a DB query on EVERY stderr line (sub-second cadence × 4 concurrent burn-ins = ~10+ DB connection opens/s) - `_append_stage_log` ran every 20 output_lines (~once per second) doing a quadratic `log_text \|\| ?` concat that gets multi-MB rewrites as the log grows - `_recalculate_progress` + `_push_update` also fired per gated tick Cumulative load kept the asyncssh drain coroutine too busy to consume the SSH channel buffer; SSH window stalled; sshd stopped reading the pipe; badblocks blocked on pipe_write with state=S wchan=pipe_write. /sys/block sectors_written delta confirmed 0 disk I/O across all running drives despite 23h elapsed. Fix: 1. Single throttle gate (BB_DB_MIN_SECONDS=5s) covers EVERY DB touch in the drain — cancel check, percent/phase/bb_count updates, throughput sample, log flush, recalc, SSE push. Phase transitions still bypass the throttle (rare + important). 2. Exclude "XX% done" lines from the log entirely. They were the dominant volume; meaningful content (pattern headers, errors, bad-block numbers) still gets logged via the throttled flush. 3. log_text concat still quadratic but the volume reduction makes it tractable — buffer to pending_log_chunks, flush on the gate. Net effect: ~99% reduction in drain-loop DB load. asyncssh drain keeps up; pipe drains; badblocks writes; disk goes brr.	2026-05-11 22:07:39 -07:00
Brandon Walter	c5a41d0260	fix: throttle badblocks parser DB writes (1.0.0-52) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details User reported sdb showing 134-day ETA. Investigation: badblocks processes all stuck in pipe_write wchan, iostat showing 0 throughput across all drives despite badblocks supposedly running. Root cause: each progress line was triggering 4-6 DB transactions (_update_stage_percent, _update_stage_bb_phase, _update_stage_bad_blocks, _update_stage_bb_mbps, _record_bb_phase_start, _recalculate_progress). With 4 concurrent burn-ins and sub-second progress lines, the asyncssh _drain couldn't keep up. Drain fell behind → asyncssh channel buffer filled → SSH window stopped advancing → sshd stopped reading from badblocks's stdout pipe → pipe filled → badblocks blocked on pipe_write() → no more disk I/O. That regression came in across 1.0.0-44 → 1.0.0-47 as I added each new persisted field. The previous per-line write path worked when there was only one DB call; it doesn't with five-plus. Fix: BB_DB_MIN_SECONDS=5 throttle on the DB-write path. The drain loop still consumes every progress line (so the pipe drains continuously), but commits to DB at most once every 5 seconds. Phase transitions always commit immediately (rare and important — they stamp bb_phase_history and advance the per-pattern meter). UI impact: minimal — drawer polls drives every ~12s anyway, so the displayed % was already at 12s resolution. The meter strip just won't sub-tick within a 5s window. DB load impact: 60-80x reduction during surface_validate.	2026-05-10 22:12:02 -07:00
Brandon Walter	1bc1b378ab	fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Container restarts (uvicorn shutdown / 'docker compose up -d') were silently classifying running burn-ins as 'failed' with empty error_text. Two reasons converged: 1. _stage_surface_validate_ssh caught asyncio.CancelledError at the stage level and returned False, swallowing the cancel signal. 2. _run_job's outer CancelledError handler then never fired, so was_cancelled stayed False and the job got marked 'failed' (the "burn-in itself failed" classification) instead of 'unknown' (the honest "we don't know whether it would have passed"). Fix: - Stage now does best-effort kill of remote badblocks (shielded so loop shutdown doesn't interrupt the kill), appends an [ABORTED] marker to the log, and re-raises CancelledError. _execute_stages doesn't catch it (CancelledError is BaseException, not Exception in 3.8+) so it propagates up to _run_job. - _run_job's existing CancelledError handler now also reconciles any stage rows still recorded as 'running' by setting them to 'unknown' with a clear error_text: "Task cancelled mid-run — likely container restart or shutdown". The job's error_text gets the same message so the drawer's Reason block has something specific to display, instead of falling back to the heuristic. Future container restarts on running burn-ins will now show as yellow "UNKNOWN" with the explicit cancel reason, matching the existing behaviour of check_stuck_jobs() for stuck timeouts.	2026-05-09 12:32:46 -07:00
Brandon Walter	383258df97	feat: phase caption + bad-block badge + per-pattern history (1.0.0-47) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Three additions to the surface_validate drawer block: 1. Phase caption below the meters: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Pure JS — no schema change. Makes the visual grammar explicit without needing the operator to mentally map phase=4 to "verifying pattern 2". 2. Bad-block badge in the vitals row. Green at 0, red at >0. The number was already on the stage row but burying it in the log felt wrong — surfacing it next to temp/speed/ETA keeps it in eye-line during long runs. 3. Per-pattern duration history below the caption. New bb_phase_history JSON column (idempotent migration) maps {phase_num: ts}. Parser stamps the timestamp on every phase transition (and on stage entry for phase 1). Drawer diffs consecutive write-phase starts to derive "0xaa: 14h 22m" for completed patterns. Once one pattern is done you can predict the rest without leaving the drawer. Persistence is idempotent: re-entry of the same phase keeps the original timestamp so a transient parser reset doesn't blow away history. JSON parse failures fail gracefully (no row rendered).	2026-05-08 23:23:02 -07:00
Brandon Walter	6b2367b892	feat: vital-signs strip above per-pattern meters (1.0.0-46) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The drawer's surface_validate area now leads with a row of operator vitals computed from data already in the response: - Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow) - Speed: live MB/s, NULL until second progress sample arrives - Elapsed: time since stage started_at - ETA: extrapolated from overall progress; suppressed under 0.5% to avoid the "47 days remaining" artefact early in pattern 1 Live MB/s comes from a new bb_mbps column on burnin_stages, computed in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt. Skipped on phase transitions (per-phase pct resets) and sub-second samples (noisy). Drawer endpoint now passes drive.temperature_c through; JS stashes the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer can pull it for the vitals row without changing call signatures. Tightened table CSS in this same session is unrelated and shipped already in earlier rounds via the bind-mounted app.css.	2026-05-08 23:13:58 -07:00
Brandon Walter	1393ba0bc8	fix: seed bb_phase=1,pct=0 at surface_validate start (1.0.0-45) Some checks are pending Security scan / mypy (push) Waiting to run Details Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Previously the parser only wrote bb_phase to the DB when state changed — so for the first several minutes of a 14 TB burn-in (before badblocks emits its first 'X% done' line), bb_phase stayed NULL and the drawer's per-pattern meters didn't render at all. Looked broken to operators. Now we write phase=1, phase_pct=0 immediately on stage entry. The parser keeps overwriting on every subsequent tick. Drawer shows empty meters with 0xaa label highlighted blue from t=0.	2026-05-08 22:45:45 -07:00
Brandon Walter	30062affc2	feat: per-pattern badblocks meters in drive drawer (1.0.0-44) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details User asked for one meter per badblocks pattern. The drawer now shows 4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split into write (left, blue) + verify (right, green) halves so a glance shows both which pattern is current AND whether you're writing or verifying within it. Backend: - New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100) via idempotent ALTER TABLE migration - _update_stage_bb_phase() helper called from the badblocks parser on every tick (when phase or percent changes) - /api/v1/drives/{id}/drawer SELECT now returns the new fields Frontend (app.js + app.css): - _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern fill state and emits 4-meter HTML with W/V sub-labels - Conditional render: only shows when stage_name === 'surface_validate' AND bb_phase is set, so historical pre-1.0.0-44 stage rows render unchanged (single percent, no meters) 3 new tests cover the migration columns, single-tick persistence, and overwrite-on-second-tick. Total suite: 75 tests. Image rebuilt and tagged but NOT deployed — 4 burn-ins are running right now and a recreate would SIGHUP them. Deploy with `docker compose up -d` after the current batch finishes; the migration runs at init and the meters light up for the next batch.	2026-05-08 22:34:35 -07:00
Brandon Walter	b406e3f315	fix: badblocks progress tracks overall %, not per-phase (1.0.0-42) Some checks failed Security scan / pip-audit (push) Has been cancelled Details Security scan / bandit (push) Has been cancelled Details Security scan / gitleaks (push) Has been cancelled Details Security scan / mypy (push) Has been cancelled Details `badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00), each with a write phase + a verify phase = 8 phases. The output's "XX% done" lines are per-phase, so the dashboard appeared to "rewind" every ~2 hours. Two drives racing each other could look 4× apart in displayed progress despite identical hardware — actually one was just further into a later phase. New _BadblocksProgress state machine watches for "Testing with pattern 0xXX" and "Reading and comparing" headers, advances the phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8 clipped to 99. Pure state machine, no I/O. 7 new tests cover phase-header detection, boundary math, monotonicity across a synthetic stream, and the original "two drives at same per-phase % look identical" bug. Image rebuilt and tagged but NOT deployed to the running container — 4 surface-validate jobs are 20-95% through 14TB drives and a recreate would SIGHUP the remote badblocks processes. Deploy with `docker compose up -d` after the current batch finishes.	2026-05-05 07:26:23 -07:00
Brandon Walter	d38807f957	test: cover Spearfoot tunables in badblocks command Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Extracts the badblocks shell-command construction into _build_badblocks_cmd(devname) so it can be unit-tested without spinning up an asyncssh connection. Behavior unchanged. Three tests guard: 1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1) 2. Operator-set tunables actually propagate to the command 3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays intact — without it, cancel cannot kill the remote process because asyncssh's signal channel is silently ignored by sshd.	2026-05-03 21:24:10 -07:00
Brandon Walter	7cd66d460f	fix: annotate to mypy-clean + promote to gating (1.0.0-40) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Five files needed annotation tweaks to clear the 14 outstanding mypy errors, all cosmetic (zero runtime bugs): - settings_store._coerce: return Any (concrete type depends on key, no narrowing path mypy can follow from the dict lookup) - retention._state: explicit dict[str, str \| None] init - mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and SMTP both narrow to the parent class for shared call sites - burnin/stages.py: TypedDict for the badblocks result dict so `result["bad_blocks"]` narrows to int at the comparison site scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and findings.log line. Comment updated to reflect gating status.	2026-05-03 21:21:55 -07:00
Brandon Walter	eb2a964171	fix: address Codex review of burnin package split (1.0.0-32) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Three LOW-severity findings from Codex's audit of the post-split package, all small mechanical cleanups: #1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias bound at import time. After a test (or runtime) monkey-patches app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would advertise the OLD value while grant_pool_unlock used the new one. Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API stays in sync with whatever the actual source-of-truth is. #2 _stage_surface_validate_ssh() carried dead extraction scaffolding from when the badblocks logic was first inlined into burnin.py: _is_cancelled_sync (sync wrapper that does run_until_complete in a coroutine — would deadlock if ever called), last_logged_pct, on_progress, accumulated_lines, on_progress_async — none on any control-flow path. Plus result["output"] which was set but never read. All deleted; the inline _drain coroutines below already handle progress/log throttling correctly. #3 The new module boundaries were leaking — root orchestration mutated _remote_pids and _unlock_grants directly even though kill.clear_remote_pid() and unlock.invalidate_grant() existed. Now using the helpers, so a future change to the storage shape only requires editing the owning module. Bonus from Codex's check note: _get_client() now asserts burnin._client is not None with a clear message instead of relying on an obscure NoneType AttributeError if a stage is somehow called before init(). Verified: 44/44 tests pass; container boots clean; /health 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:35:07 -04:00
Brandon Walter	19c2c0dc0f	refactor: extract _common.py + stages.py from burnin (1.0.0-31) Continues the staged burnin.py module split started in 1.0.0-30. Two more clean extractions; orchestration (init, _run_job, start_job, cancel_job, check_stuck_jobs, semaphore) intentionally stays in __init__.py for now to avoid threading the TrueNASClient through cross-module setters. * app/burnin/_common.py — shared helpers with no upward deps: STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants; _now / _db connection helper; _is_cancelled, _start_stage, _finish_stage, _cancel_stage, _set_stage_error, _update_stage_, _append_stage_log, _store_smart_, _recalculate_progress; SSE _push_update. Imports nothing from sibling burnin modules. * app/burnin/stages.py — every per-stage implementation moved verbatim: _stage_precheck, _stage_smart_test + _stage_smart_test_api / _ssh, _stage_surface_validate + _surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate, _stage_final_check, plus _badblocks_available, _nvme_cli_available, and _dispatch_stage. Pulls the shared helpers from _common, remote-PID setters from kill, and the live TrueNASClient via a lazy `_get_client()` helper that defers `from app import burnin` until call time so we don't trip a circular import. * __init__.py shrank from ~1480 LoC to ~600. Re-exports every public name (start_job, cancel_job, init, check_stuck_jobs, PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers in routes.py / mailer.py / poller.py see the same surface. State that didn't move: _semaphore, _client, _active_tasks remain on the package root (with a runtime _client reference from routes.py preserved). _run_job and start_job still live in __init__.py — full task.py extraction would require giving stages access to _client through a setter rather than the lazy lookup, deferred to a future slice. Verification: 44/44 unit tests pass in container; /health 200; container boots clean. No public API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:18:04 -04:00
Brandon Walter	9cbae44495	refactor: split burnin.py into a package — extract unlock + kill (1.0.0-30) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details First slice of the planned tech-debt cleanup. burnin.py was 1667 lines and growing; staged extraction gives smaller diffs to review and a clear bisect target if anything regresses. Mechanical move only — no behaviour change. The two extracted modules: * app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError, is_unlocked / unlock_expiry / grant_pool_unlock, plus the four _TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level state; opens its own DB connection in grant_pool_unlock so it doesn't depend on the parent package's _db() helper. app/burnin/kill.py — _remote_pids dict and the kill_remote_process / set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled out of __init__.py so the asyncssh-ignores-signals workaround lives next to the state it operates on. app/burnin/__init__.py re-exports every public symbol the rest of the app imports — `from app import burnin; burnin.start_job(...)`, `burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep working unchanged. Internal aliases `_remote_pids` and `_unlock_grants` on the package root point at the SAME dict objects in the submodules, so existing in-package mutations (set in stages, cleared in cleanup callbacks) work without rewrite. Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at import time and won't propagate back to the submodule's read site, so the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly. Verification: 44/44 unit tests pass in container; /health 200; container boots clean. routes.py, mailer.py, poller.py untouched — the public API is identical. Future: extract stages, task, _common in subsequent versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:44:28 -04:00

15 commits