Commit graph

15 commits

Author SHA1 Message Date
Brandon Walter
7c3873dd5e fix: translate badblocks \b → \n at shell level (1.0.0-57)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The chunk-read drain in 1.0.0-55 was supposed to handle badblocks's
\b-overwrite progress format but silently never surfaced data — DB
bb_phase_pct stayed at 0, log_text stayed at 136 bytes for 26+ hours
of running burn-ins. Asyncssh stream.read(4096) behavior on this
combination of badblocks output + pipe characteristics wasn't doing
what I expected, and gather(return_exceptions=True) swallowed any
exception silently.

Fix: pipe the badblocks output through `tr '\b' '\n'` at the SHELL
level on TrueNAS, before it reaches asyncssh. Every progress update
is now a real newline-terminated line by the time we receive it.

This also lets us revert to the simpler `async for raw in stream:`
drain we had pre-1.0.0-55 — which was proven to work (it caught the
PID line and phase-transition headers, just not mid-phase progress).

Plus consolidate: 2>&1 merges stderr into stdout before tr, so we
only need ONE drain coroutine, not two. Single throttle gate
preserved.

Recovery: after deploy, the 4 jobs that have been stuck in pipe_w
for 26h were autonomously reset via inline SQL and relaunched via
POST /api/v1/burnin/start (loopback bypass from 1.0.0-56 made this
possible without a session cookie).
2026-05-13 10:26:06 -07:00
Brandon Walter
f71ae341f5 fix: backport stages.py \b-parser fix + drawer-finish inline (uncommitted from 1.0.0-55)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The chunk-read parser fix that ships as part of 1.0.0-55 in the
running container was scp'd to maple but never reached git. Same
for the drawer-job-finish margin-left removal (request: pill
sits inline next to operator/date, not flush right).

Reconciling source with deployed state. No new behaviour — git
now matches what's been live on maple since 1.0.0-55.
2026-05-12 07:53:33 -07:00
Brandon Walter
149f2901b7 fix: throttle ALL drain-loop DB calls + drop progress noise from log (1.0.0-54)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
1.0.0-52 throttled the percent/bb_phase writes but missed:

- `_is_cancelled` ran a DB query on EVERY stderr line (sub-second
  cadence × 4 concurrent burn-ins = ~10+ DB connection opens/s)
- `_append_stage_log` ran every 20 output_lines (~once per second)
  doing a quadratic `log_text || ?` concat that gets multi-MB
  rewrites as the log grows
- `_recalculate_progress` + `_push_update` also fired per gated tick

Cumulative load kept the asyncssh drain coroutine too busy to
consume the SSH channel buffer; SSH window stalled; sshd stopped
reading the pipe; badblocks blocked on pipe_write with state=S
wchan=pipe_write. /sys/block sectors_written delta confirmed
0 disk I/O across all running drives despite 23h elapsed.

Fix:
1. Single throttle gate (BB_DB_MIN_SECONDS=5s) covers EVERY DB
   touch in the drain — cancel check, percent/phase/bb_count
   updates, throughput sample, log flush, recalc, SSE push.
   Phase transitions still bypass the throttle (rare + important).
2. Exclude "XX% done" lines from the log entirely. They were the
   dominant volume; meaningful content (pattern headers, errors,
   bad-block numbers) still gets logged via the throttled flush.
3. log_text concat still quadratic but the volume reduction makes
   it tractable — buffer to pending_log_chunks, flush on the gate.

Net effect: ~99% reduction in drain-loop DB load. asyncssh drain
keeps up; pipe drains; badblocks writes; disk goes brr.
2026-05-11 22:07:39 -07:00
Brandon Walter
c5a41d0260 fix: throttle badblocks parser DB writes (1.0.0-52)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
User reported sdb showing 134-day ETA. Investigation: badblocks
processes all stuck in pipe_write wchan, iostat showing 0 throughput
across all drives despite badblocks supposedly running.

Root cause: each progress line was triggering 4-6 DB transactions
(_update_stage_percent, _update_stage_bb_phase, _update_stage_bad_blocks,
_update_stage_bb_mbps, _record_bb_phase_start, _recalculate_progress).
With 4 concurrent burn-ins and sub-second progress lines, the
asyncssh _drain couldn't keep up. Drain fell behind → asyncssh
channel buffer filled → SSH window stopped advancing → sshd stopped
reading from badblocks's stdout pipe → pipe filled → badblocks
blocked on pipe_write() → no more disk I/O.

That regression came in across 1.0.0-44 → 1.0.0-47 as I added each
new persisted field. The previous per-line write path worked when
there was only one DB call; it doesn't with five-plus.

Fix: BB_DB_MIN_SECONDS=5 throttle on the DB-write path. The drain
loop still consumes every progress line (so the pipe drains
continuously), but commits to DB at most once every 5 seconds.
Phase transitions always commit immediately (rare and important —
they stamp bb_phase_history and advance the per-pattern meter).

UI impact: minimal — drawer polls drives every ~12s anyway, so the
displayed % was already at 12s resolution. The meter strip just
won't sub-tick within a 5s window.

DB load impact: 60-80x reduction during surface_validate.
2026-05-10 22:12:02 -07:00
Brandon Walter
1bc1b378ab fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:

1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
   stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
   was_cancelled stayed False and the job got marked 'failed' (the
   "burn-in itself failed" classification) instead of 'unknown'
   (the honest "we don't know whether it would have passed").

Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
  loop shutdown doesn't interrupt the kill), appends an [ABORTED]
  marker to the log, and re-raises CancelledError. _execute_stages
  doesn't catch it (CancelledError is BaseException, not Exception
  in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
  any stage rows still recorded as 'running' by setting them to
  'unknown' with a clear error_text: "Task cancelled mid-run —
  likely container restart or shutdown". The job's error_text gets
  the same message so the drawer's Reason block has something
  specific to display, instead of falling back to the heuristic.

Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.
2026-05-09 12:32:46 -07:00
Brandon Walter
383258df97 feat: phase caption + bad-block badge + per-pattern history (1.0.0-47)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Three additions to the surface_validate drawer block:

1. **Phase caption** below the meters: "Pattern 2 of 4 · Verify 0x55
   · 47% within phase". Pure JS — no schema change. Makes the
   visual grammar explicit without needing the operator to mentally
   map phase=4 to "verifying pattern 2".

2. **Bad-block badge** in the vitals row. Green at 0, red at >0.
   The number was already on the stage row but burying it in the
   log felt wrong — surfacing it next to temp/speed/ETA keeps it
   in eye-line during long runs.

3. **Per-pattern duration history** below the caption. New
   bb_phase_history JSON column (idempotent migration) maps
   {phase_num: ts}. Parser stamps the timestamp on every phase
   transition (and on stage entry for phase 1). Drawer diffs
   consecutive write-phase starts to derive "0xaa: 14h 22m"
   for completed patterns. Once one pattern is done you can
   predict the rest without leaving the drawer.

Persistence is idempotent: re-entry of the same phase keeps the
original timestamp so a transient parser reset doesn't blow away
history. JSON parse failures fail gracefully (no row rendered).
2026-05-08 23:23:02 -07:00
Brandon Walter
6b2367b892 feat: vital-signs strip above per-pattern meters (1.0.0-46)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The drawer's surface_validate area now leads with a row of operator
vitals computed from data already in the response:

- Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow)
- Speed: live MB/s, NULL until second progress sample arrives
- Elapsed: time since stage started_at
- ETA: extrapolated from overall progress; suppressed under 0.5%
  to avoid the "47 days remaining" artefact early in pattern 1

Live MB/s comes from a new bb_mbps column on burnin_stages, computed
in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt.
Skipped on phase transitions (per-phase pct resets) and sub-second
samples (noisy).

Drawer endpoint now passes drive.temperature_c through; JS stashes
the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer
can pull it for the vitals row without changing call signatures.

Tightened table CSS in this same session is unrelated and shipped
already in earlier rounds via the bind-mounted app.css.
2026-05-08 23:13:58 -07:00
Brandon Walter
1393ba0bc8 fix: seed bb_phase=1,pct=0 at surface_validate start (1.0.0-45)
Some checks are pending
Security scan / mypy (push) Waiting to run
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Previously the parser only wrote bb_phase to the DB when state
*changed* — so for the first several minutes of a 14 TB burn-in
(before badblocks emits its first 'X% done' line), bb_phase stayed
NULL and the drawer's per-pattern meters didn't render at all.
Looked broken to operators.

Now we write phase=1, phase_pct=0 immediately on stage entry. The
parser keeps overwriting on every subsequent tick. Drawer shows
empty meters with 0xaa label highlighted blue from t=0.
2026-05-08 22:45:45 -07:00
Brandon Walter
30062affc2 feat: per-pattern badblocks meters in drive drawer (1.0.0-44)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.

Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
  via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
  on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields

Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
  fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
  AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
  unchanged (single percent, no meters)

3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.

Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
2026-05-08 22:34:35 -07:00
Brandon Walter
b406e3f315 fix: badblocks progress tracks overall %, not per-phase (1.0.0-42)
Some checks failed
Security scan / pip-audit (push) Has been cancelled
Security scan / bandit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled
`badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
each with a write phase + a verify phase = 8 phases. The output's
"XX% done" lines are per-phase, so the dashboard appeared to "rewind"
every ~2 hours. Two drives racing each other could look 4× apart in
displayed progress despite identical hardware — actually one was
just further into a later phase.

New _BadblocksProgress state machine watches for "Testing with
pattern 0xXX" and "Reading and comparing" headers, advances the
phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8
clipped to 99. Pure state machine, no I/O.

7 new tests cover phase-header detection, boundary math, monotonicity
across a synthetic stream, and the original "two drives at same
per-phase % look identical" bug.

Image rebuilt and tagged but NOT deployed to the running container —
4 surface-validate jobs are 20-95% through 14TB drives and a recreate
would SIGHUP the remote badblocks processes. Deploy with
`docker compose up -d` after the current batch finishes.
2026-05-05 07:26:23 -07:00
Brandon Walter
d38807f957 test: cover Spearfoot tunables in badblocks command
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Extracts the badblocks shell-command construction into
_build_badblocks_cmd(devname) so it can be unit-tested without
spinning up an asyncssh connection. Behavior unchanged.

Three tests guard:
1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1)
2. Operator-set tunables actually propagate to the command
3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays
   intact — without it, cancel cannot kill the remote process
   because asyncssh's signal channel is silently ignored by sshd.
2026-05-03 21:24:10 -07:00
Brandon Walter
7cd66d460f fix: annotate to mypy-clean + promote to gating (1.0.0-40)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Five files needed annotation tweaks to clear the 14 outstanding
mypy errors, all cosmetic (zero runtime bugs):

- settings_store._coerce: return Any (concrete type depends on key,
  no narrowing path mypy can follow from the dict lookup)
- retention._state: explicit dict[str, str | None] init
- mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and
  SMTP both narrow to the parent class for shared call sites
- burnin/stages.py: TypedDict for the badblocks result dict so
  `result["bad_blocks"]` narrows to int at the comparison site

scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and
findings.log line. Comment updated to reflect gating status.
2026-05-03 21:21:55 -07:00
Brandon Walter
eb2a964171 fix: address Codex review of burnin package split (1.0.0-32)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Three LOW-severity findings from Codex's audit of the post-split
package, all small mechanical cleanups:

#1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias
   bound at import time. After a test (or runtime) monkey-patches
   app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would
   advertise the OLD value while grant_pool_unlock used the new one.
   Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API
   stays in sync with whatever the actual source-of-truth is.

#2 _stage_surface_validate_ssh() carried dead extraction scaffolding
   from when the badblocks logic was first inlined into burnin.py:
   _is_cancelled_sync (sync wrapper that does run_until_complete in
   a coroutine — would deadlock if ever called), last_logged_pct,
   on_progress, accumulated_lines, on_progress_async — none on any
   control-flow path. Plus result["output"] which was set but never
   read. All deleted; the inline _drain coroutines below already
   handle progress/log throttling correctly.

#3 The new module boundaries were leaking — root orchestration
   mutated _remote_pids and _unlock_grants directly even though
   kill.clear_remote_pid() and unlock.invalidate_grant() existed.
   Now using the helpers, so a future change to the storage shape
   only requires editing the owning module.

Bonus from Codex's check note: _get_client() now asserts
burnin._client is not None with a clear message instead of relying
on an obscure NoneType AttributeError if a stage is somehow called
before init().

Verified: 44/44 tests pass; container boots clean; /health 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:35:07 -04:00
Brandon Walter
19c2c0dc0f refactor: extract _common.py + stages.py from burnin (1.0.0-31)
Continues the staged burnin.py module split started in 1.0.0-30.
Two more clean extractions; orchestration (init, _run_job,
start_job, cancel_job, check_stuck_jobs, semaphore) intentionally
stays in __init__.py for now to avoid threading the TrueNASClient
through cross-module setters.

* app/burnin/_common.py — shared helpers with no upward deps:
  STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants;
  _now / _db connection helper; _is_cancelled, _start_stage,
  _finish_stage, _cancel_stage, _set_stage_error, _update_stage_*,
  _append_stage_log, _store_smart_*, _recalculate_progress; SSE
  _push_update. Imports nothing from sibling burnin modules.

* app/burnin/stages.py — every per-stage implementation moved
  verbatim: _stage_precheck, _stage_smart_test +
  _stage_smart_test_api / _ssh, _stage_surface_validate +
  _surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate,
  _stage_final_check, plus _badblocks_available, _nvme_cli_available,
  and _dispatch_stage. Pulls the shared helpers from _common,
  remote-PID setters from kill, and the live TrueNASClient via a
  lazy `_get_client()` helper that defers `from app import burnin`
  until call time so we don't trip a circular import.

* __init__.py shrank from ~1480 LoC to ~600. Re-exports every
  public name (start_job, cancel_job, init, check_stuck_jobs,
  PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers
  in routes.py / mailer.py / poller.py see the same surface.

State that didn't move: _semaphore, _client, _active_tasks remain
on the package root (with a runtime _client reference from routes.py
preserved). _run_job and start_job still live in __init__.py — full
task.py extraction would require giving stages access to _client
through a setter rather than the lazy lookup, deferred to a future
slice.

Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. No public API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:18:04 -04:00
Brandon Walter
9cbae44495 refactor: split burnin.py into a package — extract unlock + kill (1.0.0-30)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
First slice of the planned tech-debt cleanup. burnin.py was 1667 lines
and growing; staged extraction gives smaller diffs to review and a
clear bisect target if anything regresses.

Mechanical move only — no behaviour change. The two extracted modules:

* app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError,
  is_unlocked / unlock_expiry / grant_pool_unlock, plus the four
  *_TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level
  state; opens its own DB connection in grant_pool_unlock so it
  doesn't depend on the parent package's _db() helper.

* app/burnin/kill.py — _remote_pids dict and the kill_remote_process /
  set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled
  out of __init__.py so the asyncssh-ignores-signals workaround lives
  next to the state it operates on.

app/burnin/__init__.py re-exports every public symbol the rest of the
app imports — `from app import burnin; burnin.start_job(...)`,
`burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep
working unchanged. Internal aliases `_remote_pids` and `_unlock_grants`
on the package root point at the SAME dict objects in the submodules,
so existing in-package mutations (set in stages, cleared in cleanup
callbacks) work without rewrite.

Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false
monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at
import time and won't propagate back to the submodule's read site, so
the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly.

Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. routes.py, mailer.py, poller.py untouched —
the public API is identical.

Future: extract stages, task, _common in subsequent versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:44:28 -04:00