Jobs 60-63 ran healthy for 16h then all 4 died simultaneously with
'database is locked'. The burnin drain used _db() which set
busy_timeout=10000, but:
1. 10s was sometimes too short under heavy contention (4 burn-in
drains writing every 5s + poller every 12s + retention scan +
auth + lifespan = many concurrent writers).
2. OTHER aiosqlite.connect() sites (poller, retention, auth, mailer,
routes/__init__'s SSE, burnin/__init__.py's various helpers,
database.get_db) didn't set busy_timeout at all. Without it,
SQLite raises 'database is locked' INSTANTLY on any contention,
which forced concurrency back onto the drain's connection.
Fix:
- _db() busy_timeout 10000 → 60000 (60s; aggressive but right for
this workload — brief contention spikes are normal and waiting
beats failing).
- PRAGMA busy_timeout=60000 added on every aiosqlite.connect() site
next to the existing PRAGMA calls. Applied via a small Python
pass that preserves the original variable name (db / _tdb / src
/ dst etc.) and indentation.
Same restart sequence applied: rebuild container, reset 4 drives,
relaunch via loopback bypass. Jobs 64-67 are now running.
This is auto-restart #2 in 24h. Safety brake at 3.
After the chunk-read refactor, the inner _drain coroutine assigns
to last_db_write_ts and last_pct_sample. Without nonlocal, Python
compiles these as locals of _drain, so any READ before the first
assignment raises UnboundLocalError.
In 1.0.0-55 / -57 the bug was hidden by gather(return_exceptions=
True), which silently swallowed the exception — the drain coroutine
ended immediately, the asyncssh channel buffer filled up, and the
remote badblocks blocked on pipe_write. THAT was the actual cause
of the "parser silently never works" symptom, not anything to do
with the chunk-read or tr-pipe logic itself.
1.0.0-57 dropped the gather (single drain after merging 2>&1), which
made the next deploy surface the bug as an explicit error_text on
the surface_validate stage: "cannot access local variable
'last_db_write_ts' where it is not associated with a value".
Fix: add both vars to the nonlocal declaration. pending_log_chunks
only gets .append/.clear (no reassignment) so it doesn't need
nonlocal.
This is the bug that's been hiding behind all the recent parser
work. Sorry for the round trips.
1.0.0-57's tr-pipe fix delivered \n-terminated progress lines but
tr's stdout is block-buffered (4 KB chunks) when its destination
is a pipe — and the SSH channel is a pipe. At ~50 bytes per badblocks
progress line, that means ~80 lines accumulate (~6 minutes at our
throughput) before tr flushes anything.
stdbuf -oL forces tr's stdout to line-buffered mode. Each \n now
triggers a flush. Progress lines reach asyncssh as they happen.
The chunk-read drain in 1.0.0-55 was supposed to handle badblocks's
\b-overwrite progress format but silently never surfaced data — DB
bb_phase_pct stayed at 0, log_text stayed at 136 bytes for 26+ hours
of running burn-ins. Asyncssh stream.read(4096) behavior on this
combination of badblocks output + pipe characteristics wasn't doing
what I expected, and gather(return_exceptions=True) swallowed any
exception silently.
Fix: pipe the badblocks output through `tr '\b' '\n'` at the SHELL
level on TrueNAS, before it reaches asyncssh. Every progress update
is now a real newline-terminated line by the time we receive it.
This also lets us revert to the simpler `async for raw in stream:`
drain we had pre-1.0.0-55 — which was proven to work (it caught the
PID line and phase-transition headers, just not mid-phase progress).
Plus consolidate: 2>&1 merges stderr into stdout before tr, so we
only need ONE drain coroutine, not two. Single throttle gate
preserved.
Recovery: after deploy, the 4 jobs that have been stuck in pipe_w
for 26h were autonomously reset via inline SQL and relaunched via
POST /api/v1/burnin/start (loopback bypass from 1.0.0-56 made this
possible without a session cookie).
The chunk-read parser fix that ships as part of 1.0.0-55 in the
running container was scp'd to maple but never reached git. Same
for the drawer-job-finish margin-left removal (request: pill
sits inline next to operator/date, not flush right).
Reconciling source with deployed state. No new behaviour — git
now matches what's been live on maple since 1.0.0-55.
1.0.0-52 throttled the percent/bb_phase writes but missed:
- `_is_cancelled` ran a DB query on EVERY stderr line (sub-second
cadence × 4 concurrent burn-ins = ~10+ DB connection opens/s)
- `_append_stage_log` ran every 20 output_lines (~once per second)
doing a quadratic `log_text || ?` concat that gets multi-MB
rewrites as the log grows
- `_recalculate_progress` + `_push_update` also fired per gated tick
Cumulative load kept the asyncssh drain coroutine too busy to
consume the SSH channel buffer; SSH window stalled; sshd stopped
reading the pipe; badblocks blocked on pipe_write with state=S
wchan=pipe_write. /sys/block sectors_written delta confirmed
0 disk I/O across all running drives despite 23h elapsed.
Fix:
1. Single throttle gate (BB_DB_MIN_SECONDS=5s) covers EVERY DB
touch in the drain — cancel check, percent/phase/bb_count
updates, throughput sample, log flush, recalc, SSE push.
Phase transitions still bypass the throttle (rare + important).
2. Exclude "XX% done" lines from the log entirely. They were the
dominant volume; meaningful content (pattern headers, errors,
bad-block numbers) still gets logged via the throttled flush.
3. log_text concat still quadratic but the volume reduction makes
it tractable — buffer to pending_log_chunks, flush on the gate.
Net effect: ~99% reduction in drain-loop DB load. asyncssh drain
keeps up; pipe drains; badblocks writes; disk goes brr.
User reported sdb showing 134-day ETA. Investigation: badblocks
processes all stuck in pipe_write wchan, iostat showing 0 throughput
across all drives despite badblocks supposedly running.
Root cause: each progress line was triggering 4-6 DB transactions
(_update_stage_percent, _update_stage_bb_phase, _update_stage_bad_blocks,
_update_stage_bb_mbps, _record_bb_phase_start, _recalculate_progress).
With 4 concurrent burn-ins and sub-second progress lines, the
asyncssh _drain couldn't keep up. Drain fell behind → asyncssh
channel buffer filled → SSH window stopped advancing → sshd stopped
reading from badblocks's stdout pipe → pipe filled → badblocks
blocked on pipe_write() → no more disk I/O.
That regression came in across 1.0.0-44 → 1.0.0-47 as I added each
new persisted field. The previous per-line write path worked when
there was only one DB call; it doesn't with five-plus.
Fix: BB_DB_MIN_SECONDS=5 throttle on the DB-write path. The drain
loop still consumes every progress line (so the pipe drains
continuously), but commits to DB at most once every 5 seconds.
Phase transitions always commit immediately (rare and important —
they stamp bb_phase_history and advance the per-pattern meter).
UI impact: minimal — drawer polls drives every ~12s anyway, so the
displayed % was already at 12s resolution. The meter strip just
won't sub-tick within a 5s window.
DB load impact: 60-80x reduction during surface_validate.
Container restarts (uvicorn shutdown / 'docker compose up -d') were
silently classifying running burn-ins as 'failed' with empty
error_text. Two reasons converged:
1. _stage_surface_validate_ssh caught asyncio.CancelledError at the
stage level and returned False, *swallowing* the cancel signal.
2. _run_job's outer CancelledError handler then never fired, so
was_cancelled stayed False and the job got marked 'failed' (the
"burn-in itself failed" classification) instead of 'unknown'
(the honest "we don't know whether it would have passed").
Fix:
- Stage now does best-effort kill of remote badblocks (shielded so
loop shutdown doesn't interrupt the kill), appends an [ABORTED]
marker to the log, and re-raises CancelledError. _execute_stages
doesn't catch it (CancelledError is BaseException, not Exception
in 3.8+) so it propagates up to _run_job.
- _run_job's existing CancelledError handler now also reconciles
any stage rows still recorded as 'running' by setting them to
'unknown' with a clear error_text: "Task cancelled mid-run —
likely container restart or shutdown". The job's error_text gets
the same message so the drawer's Reason block has something
specific to display, instead of falling back to the heuristic.
Future container restarts on running burn-ins will now show as
yellow "UNKNOWN" with the explicit cancel reason, matching the
existing behaviour of check_stuck_jobs() for stuck timeouts.
Three additions to the surface_validate drawer block:
1. **Phase caption** below the meters: "Pattern 2 of 4 · Verify 0x55
· 47% within phase". Pure JS — no schema change. Makes the
visual grammar explicit without needing the operator to mentally
map phase=4 to "verifying pattern 2".
2. **Bad-block badge** in the vitals row. Green at 0, red at >0.
The number was already on the stage row but burying it in the
log felt wrong — surfacing it next to temp/speed/ETA keeps it
in eye-line during long runs.
3. **Per-pattern duration history** below the caption. New
bb_phase_history JSON column (idempotent migration) maps
{phase_num: ts}. Parser stamps the timestamp on every phase
transition (and on stage entry for phase 1). Drawer diffs
consecutive write-phase starts to derive "0xaa: 14h 22m"
for completed patterns. Once one pattern is done you can
predict the rest without leaving the drawer.
Persistence is idempotent: re-entry of the same phase keeps the
original timestamp so a transient parser reset doesn't blow away
history. JSON parse failures fail gracefully (no row rendered).
The drawer's surface_validate area now leads with a row of operator
vitals computed from data already in the response:
- Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow)
- Speed: live MB/s, NULL until second progress sample arrives
- Elapsed: time since stage started_at
- ETA: extrapolated from overall progress; suppressed under 0.5%
to avoid the "47 days remaining" artefact early in pattern 1
Live MB/s comes from a new bb_mbps column on burnin_stages, computed
in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt.
Skipped on phase transitions (per-phase pct resets) and sub-second
samples (noisy).
Drawer endpoint now passes drive.temperature_c through; JS stashes
the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer
can pull it for the vitals row without changing call signatures.
Tightened table CSS in this same session is unrelated and shipped
already in earlier rounds via the bind-mounted app.css.
Previously the parser only wrote bb_phase to the DB when state
*changed* — so for the first several minutes of a 14 TB burn-in
(before badblocks emits its first 'X% done' line), bb_phase stayed
NULL and the drawer's per-pattern meters didn't render at all.
Looked broken to operators.
Now we write phase=1, phase_pct=0 immediately on stage entry. The
parser keeps overwriting on every subsequent tick. Drawer shows
empty meters with 0xaa label highlighted blue from t=0.
User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.
Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields
Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
unchanged (single percent, no meters)
3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.
Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
`badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
each with a write phase + a verify phase = 8 phases. The output's
"XX% done" lines are per-phase, so the dashboard appeared to "rewind"
every ~2 hours. Two drives racing each other could look 4× apart in
displayed progress despite identical hardware — actually one was
just further into a later phase.
New _BadblocksProgress state machine watches for "Testing with
pattern 0xXX" and "Reading and comparing" headers, advances the
phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8
clipped to 99. Pure state machine, no I/O.
7 new tests cover phase-header detection, boundary math, monotonicity
across a synthetic stream, and the original "two drives at same
per-phase % look identical" bug.
Image rebuilt and tagged but NOT deployed to the running container —
4 surface-validate jobs are 20-95% through 14TB drives and a recreate
would SIGHUP the remote badblocks processes. Deploy with
`docker compose up -d` after the current batch finishes.
Extracts the badblocks shell-command construction into
_build_badblocks_cmd(devname) so it can be unit-tested without
spinning up an asyncssh connection. Behavior unchanged.
Three tests guard:
1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1)
2. Operator-set tunables actually propagate to the command
3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays
intact — without it, cancel cannot kill the remote process
because asyncssh's signal channel is silently ignored by sshd.
Five files needed annotation tweaks to clear the 14 outstanding
mypy errors, all cosmetic (zero runtime bugs):
- settings_store._coerce: return Any (concrete type depends on key,
no narrowing path mypy can follow from the dict lookup)
- retention._state: explicit dict[str, str | None] init
- mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and
SMTP both narrow to the parent class for shared call sites
- burnin/stages.py: TypedDict for the badblocks result dict so
`result["bad_blocks"]` narrows to int at the comparison site
scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and
findings.log line. Comment updated to reflect gating status.
Three LOW-severity findings from Codex's audit of the post-split
package, all small mechanical cleanups:
#1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias
bound at import time. After a test (or runtime) monkey-patches
app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would
advertise the OLD value while grant_pool_unlock used the new one.
Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API
stays in sync with whatever the actual source-of-truth is.
#2 _stage_surface_validate_ssh() carried dead extraction scaffolding
from when the badblocks logic was first inlined into burnin.py:
_is_cancelled_sync (sync wrapper that does run_until_complete in
a coroutine — would deadlock if ever called), last_logged_pct,
on_progress, accumulated_lines, on_progress_async — none on any
control-flow path. Plus result["output"] which was set but never
read. All deleted; the inline _drain coroutines below already
handle progress/log throttling correctly.
#3 The new module boundaries were leaking — root orchestration
mutated _remote_pids and _unlock_grants directly even though
kill.clear_remote_pid() and unlock.invalidate_grant() existed.
Now using the helpers, so a future change to the storage shape
only requires editing the owning module.
Bonus from Codex's check note: _get_client() now asserts
burnin._client is not None with a clear message instead of relying
on an obscure NoneType AttributeError if a stage is somehow called
before init().
Verified: 44/44 tests pass; container boots clean; /health 200.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Continues the staged burnin.py module split started in 1.0.0-30.
Two more clean extractions; orchestration (init, _run_job,
start_job, cancel_job, check_stuck_jobs, semaphore) intentionally
stays in __init__.py for now to avoid threading the TrueNASClient
through cross-module setters.
* app/burnin/_common.py — shared helpers with no upward deps:
STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants;
_now / _db connection helper; _is_cancelled, _start_stage,
_finish_stage, _cancel_stage, _set_stage_error, _update_stage_*,
_append_stage_log, _store_smart_*, _recalculate_progress; SSE
_push_update. Imports nothing from sibling burnin modules.
* app/burnin/stages.py — every per-stage implementation moved
verbatim: _stage_precheck, _stage_smart_test +
_stage_smart_test_api / _ssh, _stage_surface_validate +
_surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate,
_stage_final_check, plus _badblocks_available, _nvme_cli_available,
and _dispatch_stage. Pulls the shared helpers from _common,
remote-PID setters from kill, and the live TrueNASClient via a
lazy `_get_client()` helper that defers `from app import burnin`
until call time so we don't trip a circular import.
* __init__.py shrank from ~1480 LoC to ~600. Re-exports every
public name (start_job, cancel_job, init, check_stuck_jobs,
PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers
in routes.py / mailer.py / poller.py see the same surface.
State that didn't move: _semaphore, _client, _active_tasks remain
on the package root (with a runtime _client reference from routes.py
preserved). _run_job and start_job still live in __init__.py — full
task.py extraction would require giving stages access to _client
through a setter rather than the lazy lookup, deferred to a future
slice.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. No public API change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the planned tech-debt cleanup. burnin.py was 1667 lines
and growing; staged extraction gives smaller diffs to review and a
clear bisect target if anything regresses.
Mechanical move only — no behaviour change. The two extracted modules:
* app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError,
is_unlocked / unlock_expiry / grant_pool_unlock, plus the four
*_TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level
state; opens its own DB connection in grant_pool_unlock so it
doesn't depend on the parent package's _db() helper.
* app/burnin/kill.py — _remote_pids dict and the kill_remote_process /
set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled
out of __init__.py so the asyncssh-ignores-signals workaround lives
next to the state it operates on.
app/burnin/__init__.py re-exports every public symbol the rest of the
app imports — `from app import burnin; burnin.start_job(...)`,
`burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep
working unchanged. Internal aliases `_remote_pids` and `_unlock_grants`
on the package root point at the SAME dict objects in the submodules,
so existing in-package mutations (set in stages, cleared in cleanup
callbacks) work without rewrite.
Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false
monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at
import time and won't propagate back to the submodule's read site, so
the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly.
Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. routes.py, mailer.py, poller.py untouched —
the public API is identical.
Future: extract stages, task, _common in subsequent versions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>