nas-burnin

Author	SHA1	Message	Date
Brandon Walter	129f233e0a	fix: stdbuf -oL on the tr pipe (1.0.0-58) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details 1.0.0-57's tr-pipe fix delivered \n-terminated progress lines but tr's stdout is block-buffered (4 KB chunks) when its destination is a pipe — and the SSH channel is a pipe. At ~50 bytes per badblocks progress line, that means ~80 lines accumulate (~6 minutes at our throughput) before tr flushes anything. stdbuf -oL forces tr's stdout to line-buffered mode. Each \n now triggers a flush. Progress lines reach asyncssh as they happen.	2026-05-13 10:29:03 -07:00
Brandon Walter	7c3873dd5e	fix: translate badblocks \b → \n at shell level (1.0.0-57) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The chunk-read drain in 1.0.0-55 was supposed to handle badblocks's \b-overwrite progress format but silently never surfaced data — DB bb_phase_pct stayed at 0, log_text stayed at 136 bytes for 26+ hours of running burn-ins. Asyncssh stream.read(4096) behavior on this combination of badblocks output + pipe characteristics wasn't doing what I expected, and gather(return_exceptions=True) swallowed any exception silently. Fix: pipe the badblocks output through `tr '\b' '\n'` at the SHELL level on TrueNAS, before it reaches asyncssh. Every progress update is now a real newline-terminated line by the time we receive it. This also lets us revert to the simpler `async for raw in stream:` drain we had pre-1.0.0-55 — which was proven to work (it caught the PID line and phase-transition headers, just not mid-phase progress). Plus consolidate: 2>&1 merges stderr into stdout before tr, so we only need ONE drain coroutine, not two. Single throttle gate preserved. Recovery: after deploy, the 4 jobs that have been stuck in pipe_w for 26h were autonomously reset via inline SQL and relaunched via POST /api/v1/burnin/start (loopback bypass from 1.0.0-56 made this possible without a session cookie).	2026-05-13 10:26:06 -07:00
Brandon Walter	f71ae341f5	fix: backport stages.py \b-parser fix + drawer-finish inline (uncommitted from 1.0.0-55) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The chunk-read parser fix that ships as part of 1.0.0-55 in the running container was scp'd to maple but never reached git. Same for the drawer-job-finish margin-left removal (request: pill sits inline next to operator/date, not flush right). Reconciling source with deployed state. No new behaviour — git now matches what's been live on maple since 1.0.0-55.	2026-05-12 07:53:33 -07:00
Brandon Walter	71eac9cba0	feat: loopback auth bypass for autonomous monitor (1.0.0-56) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The autonomous burn-in monitor can't hit /api/v1/burnin/start without a session cookie. Provisioning one externally is fragile. Add a targeted loopback bypass: requests from 127.0.0.1 / ::1 skip the auth gate and get a synthetic admin User for audit attribution. Why it's safe: - The only way to reach the app from 127.0.0.1 is a process in the container's network namespace (docker exec from the host). Anyone with that already has rm -rf access to /data, so the bypass doesn't widen the attack surface. - External traffic via NPM/Authelia arrives with the docker bridge gateway IP as source — NOT loopback — so it keeps going through full auth. - request.client.host is the raw TCP socket source, NOT X-Forwarded-For, so external attackers can't spoof loopback via headers. The new auth.LoopbackUser() is a tiny factory (id=0, is_admin=True, username="monitor"). Audit events from this caller will show operator='monitor' so they're distinguishable from human admins. Staged in source; lands at next rebuild. Authorized by user ("It's a blank NAS machine. I don't care about any drive getting wiped out.").	2026-05-12 07:52:20 -07:00
Brandon Walter	149f2901b7	fix: throttle ALL drain-loop DB calls + drop progress noise from log (1.0.0-54) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details 1.0.0-52 throttled the percent/bb_phase writes but missed: - `_is_cancelled` ran a DB query on EVERY stderr line (sub-second cadence × 4 concurrent burn-ins = ~10+ DB connection opens/s) - `_append_stage_log` ran every 20 output_lines (~once per second) doing a quadratic `log_text \|\| ?` concat that gets multi-MB rewrites as the log grows - `_recalculate_progress` + `_push_update` also fired per gated tick Cumulative load kept the asyncssh drain coroutine too busy to consume the SSH channel buffer; SSH window stalled; sshd stopped reading the pipe; badblocks blocked on pipe_write with state=S wchan=pipe_write. /sys/block sectors_written delta confirmed 0 disk I/O across all running drives despite 23h elapsed. Fix: 1. Single throttle gate (BB_DB_MIN_SECONDS=5s) covers EVERY DB touch in the drain — cancel check, percent/phase/bb_count updates, throughput sample, log flush, recalc, SSE push. Phase transitions still bypass the throttle (rare + important). 2. Exclude "XX% done" lines from the log entirely. They were the dominant volume; meaningful content (pattern headers, errors, bad-block numbers) still gets logged via the throttled flush. 3. log_text concat still quadratic but the volume reduction makes it tractable — buffer to pending_log_chunks, flush on the gate. Net effect: ~99% reduction in drain-loop DB load. asyncssh drain keeps up; pipe drains; badblocks writes; disk goes brr.	2026-05-11 22:07:39 -07:00
Brandon Walter	c906ab15f7	feat: job-level Est. completion in drawer header (1.0.0-53) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The drawer's per-stage Finish chip is the stage's finish, not the whole burn-in's. Added a right-aligned "Est. completion" pill in the drawer-job-header that uses the server-weighted burnin.percent to extrapolate the whole job's finish time (covers precheck + SMART + surface_validate + final_check). Suppressed under 0.5% job progress to avoid the early-sample overshoot we saw earlier ("Finish: Sep 22" on a fresh start). Bind-mount only (templates + static); no rebuild needed. Running container reports 1.0.0-52 until next rebuild; this commit just catches the source version up.	2026-05-10 22:45:04 -07:00
Brandon Walter	c5a41d0260	fix: throttle badblocks parser DB writes (1.0.0-52) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details User reported sdb showing 134-day ETA. Investigation: badblocks processes all stuck in pipe_write wchan, iostat showing 0 throughput across all drives despite badblocks supposedly running. Root cause: each progress line was triggering 4-6 DB transactions (_update_stage_percent, _update_stage_bb_phase, _update_stage_bad_blocks, _update_stage_bb_mbps, _record_bb_phase_start, _recalculate_progress). With 4 concurrent burn-ins and sub-second progress lines, the asyncssh _drain couldn't keep up. Drain fell behind → asyncssh channel buffer filled → SSH window stopped advancing → sshd stopped reading from badblocks's stdout pipe → pipe filled → badblocks blocked on pipe_write() → no more disk I/O. That regression came in across 1.0.0-44 → 1.0.0-47 as I added each new persisted field. The previous per-line write path worked when there was only one DB call; it doesn't with five-plus. Fix: BB_DB_MIN_SECONDS=5 throttle on the DB-write path. The drain loop still consumes every progress line (so the pipe drains continuously), but commits to DB at most once every 5 seconds. Phase transitions always commit immediately (rare and important — they stamp bb_phase_history and advance the per-pattern meter). UI impact: minimal — drawer polls drives every ~12s anyway, so the displayed % was already at 12s resolution. The meter strip just won't sub-tick within a 5s window. DB load impact: 60-80x reduction during surface_validate.	2026-05-10 22:12:02 -07:00
Brandon Walter	2107981cf1	docs: drawer surface_validate + sorting + job states Some checks failed Security scan / bandit (push) Has been cancelled Details Security scan / pip-audit (push) Has been cancelled Details Security scan / gitleaks (push) Has been cancelled Details Security scan / mypy (push) Has been cancelled Details Documents the drawer enhancements landed across 1.0.0-44 → 1.0.0-51: - Job states section explains passed / failed / cancelled / unknown, including when 'unknown' fires (stuck-job timeout OR container restart cancelling the asyncio task). - Drive drawer section covers the new surface_validate visualization: vital-signs strip (Start / Elapsed / ETA / Finish / Temp), four per-pattern meters with split write/verify halves, phase caption, completed-pattern duration history. - Failure reason block describes the three-tier source resolution (stage error_text → job error_text → heuristic) and what shows up when none is available. - Column sorting describes the click-to-cycle behaviour and the localStorage persistence that survives SSE refreshes. Plus an explicit warning: don't `--build` while burn-ins are running (now classified `unknown` instead of `failed` — but still better to avoid the kill in the first place).	2026-05-09 15:34:12 -07:00
Brandon Walter	659f540270	fix: drop redundant stage suffix from Burn-In failed chip Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Per user: '(LONG SMART)' was redundant since the LONG SMART column already shows FAILED. Same for short SMART and surface_validate (the dominant case — the drawer shows per-stage Reason for digging). Suffix kept for precheck / final_check since those are rare enough that the hint is genuinely helpful.	2026-05-09 12:33:26 -07:00
Brandon Walter	1bc1b378ab	fix: cancel-mid-stage marks job 'unknown' not 'failed' (1.0.0-51) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Container restarts (uvicorn shutdown / 'docker compose up -d') were silently classifying running burn-ins as 'failed' with empty error_text. Two reasons converged: 1. _stage_surface_validate_ssh caught asyncio.CancelledError at the stage level and returned False, swallowing the cancel signal. 2. _run_job's outer CancelledError handler then never fired, so was_cancelled stayed False and the job got marked 'failed' (the "burn-in itself failed" classification) instead of 'unknown' (the honest "we don't know whether it would have passed"). Fix: - Stage now does best-effort kill of remote badblocks (shielded so loop shutdown doesn't interrupt the kill), appends an [ABORTED] marker to the log, and re-raises CancelledError. _execute_stages doesn't catch it (CancelledError is BaseException, not Exception in 3.8+) so it propagates up to _run_job. - _run_job's existing CancelledError handler now also reconciles any stage rows still recorded as 'running' by setting them to 'unknown' with a clear error_text: "Task cancelled mid-run — likely container restart or shutdown". The job's error_text gets the same message so the drawer's Reason block has something specific to display, instead of falling back to the heuristic. Future container restarts on running burn-ins will now show as yellow "UNKNOWN" with the explicit cancel reason, matching the existing behaviour of check_stuck_jobs() for stuck timeouts.	2026-05-09 12:32:46 -07:00
Brandon Walter	7f959e6f4c	feat: prominent failure-reason block + heuristic in drawer (1.0.0-50) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details When a stage ends in failed/cancelled/unknown the drawer now shows a coloured "Reason" pill at the top of that stage's section. Three sources, in order of preference: 1. stage.error_text (the canonical, when set) 2. job.error_text (backfilled in the drawer endpoint when stage's own is empty — catches orphan rows from hard crashes like the pre-busy-timeout DB-locked failures) 3. Heuristic: if log_text is tiny (<500 bytes, just the START banner) AND no real badblocks progress was recorded, label as "Stopped without recording an error — likely cause: SSH connection drop or container restart while this stage was running." This catches the fingerprint of a deploy-during-burn-in killing the SSH session. Otherwise: "No error message recorded." so there's never a blank where the operator expects to see why something broke. Red styling for failed, yellow for cancelled/unknown. Replaces the inline stage-error-line for terminal states; the existing stage-error-line still renders for non-terminal contexts.	2026-05-09 12:06:11 -07:00
Brandon Walter	28d046f42e	fix: SMART overlay shows terminal states + reconciles orphans (1.0.0-49) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The Long SMART column showed "—" while the Burn-In column showed "FAILED (LONG SMART)" — clear contradiction. Two reasons: 1. The overlay query in _drives_helpers only fetched SMART stage data for burn-ins in ('running','queued') state. Failed/passed/ cancelled jobs got their stage data filtered out, so the SMART columns went blank when you most wanted to see them. Removed the state filter so all burn-ins overlay. 2. A pre-busy-timeout `database is locked` failure mode (sdj job 5 from Mar 2026) left long_smart stage rows recorded as state= 'running' even though the parent job ended in state='failed'. The overlay now translates that orphan state at render time: if the parent job is failed/cancelled/unknown but the stage is still 'running', display the stage as failed (or the parent's terminal state) so the column matches the Burn-In column. The translation is purely display-time; no DB writes. error_text falls back to the parent job's error_text when the stage's own is NULL, so the operator sees what actually broke.	2026-05-09 11:46:45 -07:00
Brandon Walter	f5c6b85402	feat: client-side column sorting with SSE re-apply (1.0.0-48) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Clickable headers on Drive / Serial / Size / Temp / Health / Short SMART / Long SMART / Burn-In. Click cycles asc → desc → cleared, with a small ▲/▼ indicator next to the active column. Sort state lives in localStorage so it survives reload AND every SSE-driven tbody refresh (HTMX swaps `#drives-table-wrap` innerHTML on each `drives-update` event). The htmx:afterSwap hook re-applies the sort and re-paints indicators. Sortable values are emitted as data-sort-* attributes on each <tr>: - raw devname / serial / size_bytes / temperature_c - numeric priority maps for SMART health, SMART test states, and burn-in state (so "running" sorts ahead of "passed" regardless of alphabetical order) Empty values always sink to the bottom regardless of direction so "sort by temp asc" doesn't put a missing-temp drive on top.	2026-05-08 23:48:04 -07:00
Brandon Walter	383258df97	feat: phase caption + bad-block badge + per-pattern history (1.0.0-47) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Three additions to the surface_validate drawer block: 1. Phase caption below the meters: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Pure JS — no schema change. Makes the visual grammar explicit without needing the operator to mentally map phase=4 to "verifying pattern 2". 2. Bad-block badge in the vitals row. Green at 0, red at >0. The number was already on the stage row but burying it in the log felt wrong — surfacing it next to temp/speed/ETA keeps it in eye-line during long runs. 3. Per-pattern duration history below the caption. New bb_phase_history JSON column (idempotent migration) maps {phase_num: ts}. Parser stamps the timestamp on every phase transition (and on stage entry for phase 1). Drawer diffs consecutive write-phase starts to derive "0xaa: 14h 22m" for completed patterns. Once one pattern is done you can predict the rest without leaving the drawer. Persistence is idempotent: re-entry of the same phase keeps the original timestamp so a transient parser reset doesn't blow away history. JSON parse failures fail gracefully (no row rendered).	2026-05-08 23:23:02 -07:00
Brandon Walter	6b2367b892	feat: vital-signs strip above per-pattern meters (1.0.0-46) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The drawer's surface_validate area now leads with a row of operator vitals computed from data already in the response: - Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow) - Speed: live MB/s, NULL until second progress sample arrives - Elapsed: time since stage started_at - ETA: extrapolated from overall progress; suppressed under 0.5% to avoid the "47 days remaining" artefact early in pattern 1 Live MB/s comes from a new bb_mbps column on burnin_stages, computed in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt. Skipped on phase transitions (per-phase pct resets) and sub-second samples (noisy). Drawer endpoint now passes drive.temperature_c through; JS stashes the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer can pull it for the vitals row without changing call signatures. Tightened table CSS in this same session is unrelated and shipped already in earlier rounds via the bind-mounted app.css.	2026-05-08 23:13:58 -07:00
Brandon Walter	1393ba0bc8	fix: seed bb_phase=1,pct=0 at surface_validate start (1.0.0-45) Some checks are pending Security scan / mypy (push) Waiting to run Details Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Previously the parser only wrote bb_phase to the DB when state changed — so for the first several minutes of a 14 TB burn-in (before badblocks emits its first 'X% done' line), bb_phase stayed NULL and the drawer's per-pattern meters didn't render at all. Looked broken to operators. Now we write phase=1, phase_pct=0 immediately on stage entry. The parser keeps overwriting on every subsequent tick. Drawer shows empty meters with 0xaa label highlighted blue from t=0.	2026-05-08 22:45:45 -07:00
Brandon Walter	30062affc2	feat: per-pattern badblocks meters in drive drawer (1.0.0-44) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details User asked for one meter per badblocks pattern. The drawer now shows 4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split into write (left, blue) + verify (right, green) halves so a glance shows both which pattern is current AND whether you're writing or verifying within it. Backend: - New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100) via idempotent ALTER TABLE migration - _update_stage_bb_phase() helper called from the badblocks parser on every tick (when phase or percent changes) - /api/v1/drives/{id}/drawer SELECT now returns the new fields Frontend (app.js + app.css): - _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern fill state and emits 4-meter HTML with W/V sub-labels - Conditional render: only shows when stage_name === 'surface_validate' AND bb_phase is set, so historical pre-1.0.0-44 stage rows render unchanged (single percent, no meters) 3 new tests cover the migration columns, single-tick persistence, and overwrite-on-second-tick. Total suite: 75 tests. Image rebuilt and tagged but NOT deployed — 4 burn-ins are running right now and a recreate would SIGHUP them. Deploy with `docker compose up -d` after the current batch finishes; the migration runs at init and the meters light up for the next batch.	2026-05-08 22:34:35 -07:00
Brandon Walter	4922b19a9f	fix: stuck_job_hours default 24 → 168 (7 days) (1.0.0-43) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details A user with 4× 14 TB WD HDDs running -w surface_validate had all 4 jobs marked 'unknown' at exactly 24h+1min — the stuck-job detector firing on legitimate work because 14 TB at 8192-block badblocks needs ~5+ days to complete all 4 patterns × 2 phases. 168h covers a full -w pass on 14 TB+ HDDs with margin. Anyone running short SSDs who wants faster detection can drop the value in Settings → Burn-in. README warning replaced — no longer instructs users to bump the threshold before starting big-drive burn-ins, since the default now handles that case. Settings UI already accepts up to 168 via the input's max=168 attribute, so no template change needed.	2026-05-08 13:23:05 -07:00
Brandon Walter	b406e3f315	fix: badblocks progress tracks overall %, not per-phase (1.0.0-42) Some checks failed Security scan / pip-audit (push) Has been cancelled Details Security scan / bandit (push) Has been cancelled Details Security scan / gitleaks (push) Has been cancelled Details Security scan / mypy (push) Has been cancelled Details `badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00), each with a write phase + a verify phase = 8 phases. The output's "XX% done" lines are per-phase, so the dashboard appeared to "rewind" every ~2 hours. Two drives racing each other could look 4× apart in displayed progress despite identical hardware — actually one was just further into a later phase. New _BadblocksProgress state machine watches for "Testing with pattern 0xXX" and "Reading and comparing" headers, advances the phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8 clipped to 99. Pure state machine, no I/O. 7 new tests cover phase-header detection, boundary math, monotonicity across a synthetic stream, and the original "two drives at same per-phase % look identical" bug. Image rebuilt and tagged but NOT deployed to the running container — 4 surface-validate jobs are 20-95% through 14TB drives and a recreate would SIGHUP the remote badblocks processes. Deploy with `docker compose up -d` after the current batch finishes.	2026-05-05 07:26:23 -07:00
Brandon Walter	775251b993	docs: refresh README test count + run-tests.sh pointer Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Test suite has grown from 44 → 65 since this line was last touched (routes resolution, badblocks tunables, rate limiter, lifecycle). Also points readers at scripts/run-tests.sh for the in-container path.	2026-05-05 06:19:17 -07:00
Brandon Walter	8ae84862de	infra: rename truenas-burnin → nas-burnin (1.0.0-41) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Matches the 1.0.0-38 product display rename. Touches every infrastructure identifier: - container_name: truenas-burnin → nas-burnin - forge URL in /api/v1/updates/check - security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description - run-tests.sh default container name - doc paths in README/SPEC/CLAUDE - in-app instruction strings (login.html, settings.html, auth_cli.py) Maple migration done in lockstep: docker compose down (truenas-burnin) mv ~/docker/stacks/{truenas-burnin,nas-burnin} systemd unit ExecStart updated + daemon-reload docker compose up -d --build → container nas-burnin Old image truenas-burnin-app removed (~12 GB reclaimed) Stale top-level orphans cleaned (config.py, poller.py, routes.py, truenas.py, tests/) — all dead since pre-split refactors Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin → nas-burnin) is a separate UI-only step. Forgejo redirects the old URL after rename, so this commit can be pushed to the existing remote first; remote URL gets updated locally once you rename.	2026-05-04 07:16:02 -07:00
Brandon Walter	d38807f957	test: cover Spearfoot tunables in badblocks command Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Extracts the badblocks shell-command construction into _build_badblocks_cmd(devname) so it can be unit-tested without spinning up an asyncssh connection. Behavior unchanged. Three tests guard: 1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1) 2. Operator-set tunables actually propagate to the command 3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays intact — without it, cancel cannot kill the remote process because asyncssh's signal channel is silently ignored by sshd.	2026-05-03 21:24:10 -07:00
Brandon Walter	7cd66d460f	fix: annotate to mypy-clean + promote to gating (1.0.0-40) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Five files needed annotation tweaks to clear the 14 outstanding mypy errors, all cosmetic (zero runtime bugs): - settings_store._coerce: return Any (concrete type depends on key, no narrowing path mypy can follow from the dict lookup) - retention._state: explicit dict[str, str \| None] init - mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and SMTP both narrow to the parent class for shared call sites - burnin/stages.py: TypedDict for the badblocks result dict so `result["bad_blocks"]` narrows to int at the comparison site scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and findings.log line. Comment updated to reflect gating status.	2026-05-03 21:21:55 -07:00
Brandon Walter	cd92a4d3c8	chore: dev-experience + mypy noise cleanup Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details - scripts/run-tests.sh — one-shot wrapper for the tar+docker-cp dance that was being done by hand every test run. Optional pattern arg for a single module. Cleans tests/ out of the container after. - scripts/security-scan.sh — mount the deploy app/ at /opt/app/app (not /src) so internal `from . import X` resolves through the `app` package and stops producing spurious "Module 'src' has no attribute X" errors that masked real findings. - app/truenas.py — explicit `raise RuntimeError("unreachable")` after the retry loop. Functionally a no-op (loop always returns or re-raises), but makes the post-loop control flow obvious to readers and silences the mypy missing-return false positive. mypy stays informational. Down to 14 real findings after these fixes — promoting to gating still needs settings_store + retention typing work, which is its own pass.	2026-05-03 21:11:23 -07:00
Brandon Walter	0ebc325746	docs: rename to NAS Burn-In + version bump in spec/context Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Catches the README, SPEC, and CLAUDE.md that were missed in the 1.0.0-38 product rename. Infrastructure identifiers (paths, container, repo URL) deliberately stay as truenas-burnin. Also refreshes SPEC.md version (1.0.0-8 → 1.0.0-39) and CLAUDE.md last-updated stamp (1.0.0-12 → 1.0.0-39).	2026-05-03 18:53:33 -05:00
Brandon Walter	8033161efb	fix: address Codex routes-split follow-up review (1.0.0-39) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Three low-severity findings from Codex on the 1.0.0-37 split: 1. Trim dead package-level imports in routes/__init__.py — only `poller` was actually used; auth/burnin/mailer/settings_store were the exact shadowing footgun the absolute sub-router imports work around. Reword the comment block to match. 2. Thread `operator` through smart_start + smart_cancel. Previously the JS client sent it but the server ignored it; add audit_events rows (smart_test_start / smart_test_cancel) so the field is actually meaningful. 3. New tests/test_routes_resolution.py — guards two historical regressions: /api/v1/burnin/export.csv must register before /{job_id} (FastAPI int-coerce 422 trap) and the mailer back-compat shim `from app.routes import _fetch_drives_for_template` must keep importing. Plus a sub-router enumeration test that catches missed include_router calls in future splits.	2026-05-03 15:04:38 -05:00
Brandon Walter	a8a7d99621	rename: TrueNAS Burn-In → NAS Burn-In (1.0.0-38) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Product display name only — page titles, headers, email, browser notification, FastAPI app title. Repo, container_name, file paths, and infrastructure identifiers (truenas-burnin everywhere) stay put to avoid breaking deployment.	2026-05-03 14:01:40 -04:00
Brandon Walter	40dac9090d	refactor: extract drives + burnin routes (1.0.0-37) Largest routes/ slice yet — drives.py (8 endpoints) and burnin.py (4 endpoints). Drives helpers live in _drives_helpers.py so the dashboard SSE handler in routes/__init__.py and mailer.py can both keep using them via re-export. routes/__init__.py shrinks from 815 → 163 LoC; only the dashboard / and /sse/drives stream remain there. Routes split is now functionally complete: 12 files, ~1800 LoC distributed by feature.	2026-05-03 09:59:15 -04:00
Brandon Walter	fc7fb4c714	refactor: extract settings routes (1.0.0-36) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Pulls /settings + /api/v1/settings* + /api/v1/settings/redacted + /test-smtp + /test-ssh into routes/settings.py (155 LoC). All five endpoints share the admin gate from auth.require_admin and the secret_status / SECRET_FIELDS helpers, so the boundary is clean. routes/__init__.py shrank from 960 -> 815 LoC. Cleanup bonus: dropped an orphan "# Print view (must be BEFORE /{job_id} int route)" comment that referenced the print-view endpoint already extracted to history.py. Verification: 59/59 tests pass; /settings 401 (auth-gated as expected); /login still 200; container boots clean at 1.0.0-36. Remaining slices: routes/burnin.py (start + cancel + export.csv + {job_id}) and routes/drives.py (the biggest, with the unlock route that's currently interleaved between the burnin endpoints in __init__.py — drives extraction unblocks burnin extraction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:48:24 -04:00
Brandon Walter	3c39344069	refactor: extract history + audit + stats + report routes (1.0.0-35) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Continues the routes/ package split — four more clean extractions, all following the same absolute-import pattern documented in the 1.0.0-34 gotcha note. * routes/history.py (184 LoC) — /history, /history/{id}, and the /history/{id}/print view that MUST register before the {id} int route to avoid FastAPI's int("print") 422. Helpers _PAGE_SIZE, _ALL_STATES, _HISTORY_QUERY, _state_where moved with the endpoints. B608 nosec annotated on the count_sql f-string (it's two hardcoded literals; user input goes through bound params). * routes/audit.py (53 LoC) — /audit page only. Owns _AUDIT_QUERY + _AUDIT_EVENT_COLORS. * routes/stats.py (111 LoC) — /stats analytics page. Pure aggregation queries against burnin_jobs/drives, no shared helpers beyond stale_context. * routes/report.py (24 LoC) — POST /api/v1/report/send. Now requires admin (was open to any authenticated user; sending mail is a side effect non-admins shouldn't be able to fire — same principle as the settings mutation gates added in 1.0.0-28). routes/__init__.py shrank from 1261 -> 960 LoC. Remaining work: drives, burnin, settings, dashboard — same pattern. Each future slice will use the `import app.routes.X as _Y` absolute-import gotcha workaround from 1.0.0-34. Verification: 59/59 tests pass; /login 200 (public); /history /audit /stats 401 (correctly auth-gated by middleware); container boots clean at 1.0.0-35. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:44:22 -04:00
Brandon Walter	aa7822d6ce	feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Closes the four remaining items from the post-Codex hardening list. #1 Rate-limit unlock + change-password endpoints (1.0.0-33) * Generalised the existing login limiter into a reusable `_RateLimiter` class in app/auth.py. Atomic check-then-increment in synchronous code so a parallel asyncio burst can't slip past the threshold. * `unlock_limiter` (5 attempts in 10 min → 10 min lockout) gates POST /api/v1/drives/{id}/unlock per-drive AND per-source-IP. * `pwchange_limiter` (5 in 10 min → 15 min lockout) gates POST /api/v1/auth/change-password per-user AND per-IP. * Both clear on successful operation. The login limiter keeps its existing `register_login_attempt` / `clear_login_failures` facade names so external callers don't change. #3 mypy in security-scan (1.0.0-33) * Added a 4th tool to the daily scan + forge workflow. Runs in a throwaway python:3.12-slim container against the deploy dir, exit code is informational only (NOT included in the `TOTAL_EXIT` failure sum). Findings land in ~/security-scans/scan-YYYY-MM-DD/mypy.txt for ratchet-down work over time. * Forge job uses `continue-on-error: true` so it doesn't fail the workflow until the type-debt baseline is annotated down. #4 Lifecycle test coverage (1.0.0-33) * New tests/test_lifecycle.py with 15 cases: - TestCommonHelpers (7 tests): _start_stage, _finish_stage success/failure/error-preservation, _recalculate_progress weighted math, _is_cancelled, _append_stage_log. - TestStartCancelJob (4 tests): start_job inserts queued row + correct stage list, duplicate-active rejection, cancel marks state, cancel returns False on terminal-state jobs. - TestRateLimiter (4 tests): under-threshold ok, trips at threshold, clear removes both counter + lockout, separate keys don't interfere. * Total goes from 44 to 59 tests; closes the orchestration-path coverage gap Codex flagged. #2 Partial routes.py split (1.0.0-34) * routes.py → routes/ package. Same staged-extraction pattern as the burnin.py split. * routes/auth.py — login/logout/setup/change-password (170 LoC). * routes/system.py — /health, /ws/terminal, /api/v1/updates/check (136 LoC). * routes/_helpers.py — shared utilities used by both extracted modules and the still-monolithic remainder: client_ip, operator_for, is_stale, stale_context, secret_status, SECRET_FIELDS (97 LoC). * routes/__init__.py shrank from 1568 LoC to 1261. Future slices can extract drives, burnin, history, settings the same way. * GOTCHA recorded in commit body: `from app import auth` at the top of __init__.py binds `auth` as an attribute on the package namespace, so `from . import auth as _auth_routes` finds the OUTER module and yields `app.auth` instead of the submodule. Fix is `import app.routes.auth as _auth_routes` (absolute). This bit me once at deploy time; container failed to start with `module 'app.auth' has no attribute 'router'`. Verification: 59/59 tests pass (44 existing + 15 new); container boots clean at 1.0.0-34; /health 200 with all checks green; security scan still clean (mypy informational findings ignored from totals). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:29:53 -04:00
Brandon Walter	eb2a964171	fix: address Codex review of burnin package split (1.0.0-32) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Three LOW-severity findings from Codex's audit of the post-split package, all small mechanical cleanups: #1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias bound at import time. After a test (or runtime) monkey-patches app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would advertise the OLD value while grant_pool_unlock used the new one. Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API stays in sync with whatever the actual source-of-truth is. #2 _stage_surface_validate_ssh() carried dead extraction scaffolding from when the badblocks logic was first inlined into burnin.py: _is_cancelled_sync (sync wrapper that does run_until_complete in a coroutine — would deadlock if ever called), last_logged_pct, on_progress, accumulated_lines, on_progress_async — none on any control-flow path. Plus result["output"] which was set but never read. All deleted; the inline _drain coroutines below already handle progress/log throttling correctly. #3 The new module boundaries were leaking — root orchestration mutated _remote_pids and _unlock_grants directly even though kill.clear_remote_pid() and unlock.invalidate_grant() existed. Now using the helpers, so a future change to the storage shape only requires editing the owning module. Bonus from Codex's check note: _get_client() now asserts burnin._client is not None with a clear message instead of relying on an obscure NoneType AttributeError if a stage is somehow called before init(). Verified: 44/44 tests pass; container boots clean; /health 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:35:07 -04:00
Brandon Walter	19c2c0dc0f	refactor: extract _common.py + stages.py from burnin (1.0.0-31) Continues the staged burnin.py module split started in 1.0.0-30. Two more clean extractions; orchestration (init, _run_job, start_job, cancel_job, check_stuck_jobs, semaphore) intentionally stays in __init__.py for now to avoid threading the TrueNASClient through cross-module setters. * app/burnin/_common.py — shared helpers with no upward deps: STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants; _now / _db connection helper; _is_cancelled, _start_stage, _finish_stage, _cancel_stage, _set_stage_error, _update_stage_, _append_stage_log, _store_smart_, _recalculate_progress; SSE _push_update. Imports nothing from sibling burnin modules. * app/burnin/stages.py — every per-stage implementation moved verbatim: _stage_precheck, _stage_smart_test + _stage_smart_test_api / _ssh, _stage_surface_validate + _surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate, _stage_final_check, plus _badblocks_available, _nvme_cli_available, and _dispatch_stage. Pulls the shared helpers from _common, remote-PID setters from kill, and the live TrueNASClient via a lazy `_get_client()` helper that defers `from app import burnin` until call time so we don't trip a circular import. * __init__.py shrank from ~1480 LoC to ~600. Re-exports every public name (start_job, cancel_job, init, check_stuck_jobs, PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers in routes.py / mailer.py / poller.py see the same surface. State that didn't move: _semaphore, _client, _active_tasks remain on the package root (with a runtime _client reference from routes.py preserved). _run_job and start_job still live in __init__.py — full task.py extraction would require giving stages access to _client through a setter rather than the lazy lookup, deferred to a future slice. Verification: 44/44 unit tests pass in container; /health 200; container boots clean. No public API change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:18:04 -04:00
Brandon Walter	9cbae44495	refactor: split burnin.py into a package — extract unlock + kill (1.0.0-30) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details First slice of the planned tech-debt cleanup. burnin.py was 1667 lines and growing; staged extraction gives smaller diffs to review and a clear bisect target if anything regresses. Mechanical move only — no behaviour change. The two extracted modules: * app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError, is_unlocked / unlock_expiry / grant_pool_unlock, plus the four _TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level state; opens its own DB connection in grant_pool_unlock so it doesn't depend on the parent package's _db() helper. app/burnin/kill.py — _remote_pids dict and the kill_remote_process / set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled out of __init__.py so the asyncssh-ignores-signals workaround lives next to the state it operates on. app/burnin/__init__.py re-exports every public symbol the rest of the app imports — `from app import burnin; burnin.start_job(...)`, `burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep working unchanged. Internal aliases `_remote_pids` and `_unlock_grants` on the package root point at the SAME dict objects in the submodules, so existing in-package mutations (set in stages, cleared in cleanup callbacks) work without rewrite. Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at import time and won't propagate back to the submodule's read site, so the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly. Verification: 44/44 unit tests pass in container; /health 200; container boots clean. routes.py, mailer.py, poller.py untouched — the public API is identical. Future: extract stages, task, _common in subsequent versions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:44:28 -04:00
Brandon Walter	6c20e57fd8	fix: live pool re-check before start_job + drop dead run_badblocks (1.0.0-29) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Closes the last open Codex finding (#5) and removes one piece of dead code Codex flagged in passing. #5 — Live pool re-check before burn-in start: Before this change, _is_unlocked compared the operator's unlock grant against the cached drives.pool_* row. If a drive was imported into a pool, mounted, or had ZFS labels written between the operator's unlock click and the next ~12s poll, burn-in could still start against the stale identity and silently destroy the new pool. start_job now calls a fresh ssh_client.fresh_pool_check_for_drive() immediately after the cached gate. That helper re-runs the three detection probes (zpool list -vHP / lsblk zfs_member / findmnt) over a fresh SSH session and returns the live answer for one devname. If it differs from cached state we invalidate any existing unlock grant and raise PoolMemberError with the FRESH pool name so the UI reflects current reality. If fresh shows free but cached said locked the drive came back to free since last poll — log it and allow. Cost: ~200ms per burn-in start. For batch starts of 12 drives, that's 2.4s extra latency — cheap against destroying a freshly-imported pool. Dead code removal: ssh_client.run_badblocks() — no callers since 1.0.0-13 when the SSH badblocks logic was inlined into burnin._stage_surface_validate_ssh (with the asyncssh-signal-doesn't-actually-kill workaround). Removing the dead function also lets us drop the now-unused `from typing import Callable` import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 21:29:11 -04:00
Brandon Walter	066fbbc403	fix: address Codex audit findings (1.0.0-28) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Addresses 12 of 13 findings from the Codex tech-debt + security review of versions 1.0.0-22 through 1.0.0-27. Item #5 (live pool re-check before start_job) deferred — would add an SSH round-trip per start. #1 Pool detection now treats zpool / lsblk / findmnt failures INDEPENDENTLY. Previously a single None blew away the whole map, so a host where lsblk lacks zfs_member info but zpool works would never lock pool members. Extended findmnt parser to recognise /dev/mapper/, /dev/dm-, /dev/md, /dev/da, /dev/ada* (LVM, devicemapper, MD RAID, FreeBSD CORE devnames). #2 Admin role enforced on every settings mutation. New auth.require_admin() helper applied to GET /settings, POST /api/v1/settings, /test-smtp, /test-ssh. Previously any authenticated user (the CLI explicitly supports non-admin accounts) could rewrite SMTP/SSH/API secrets. #3 First-user setup race closed. auth.create_user() now accepts bootstrap_only=True which wraps the existence check + insert in BEGIN IMMEDIATE so two concurrent /api/v1/auth/setup requests can't both create admin accounts during the bootstrap window. #4 Case-insensitive uniqueness enforced via new `uniq_users_username_nocase` index. Login does NOCASE lookup so without this `Admin` and `admin` could coexist as distinct rows. #6 New `session_cookie_secure` setting (default False for LAN/dev deploys, set True in production behind HTTPS) flips the session cookie's Secure flag. Defends against on-the-wire exposure when the dashboard is reachable over plain HTTP. #7 Audit trail bound to authenticated identity. Burn-in start / cancel / unlock / drive reset all now use `_operator_for(request)` which reads `request.state.current_user.full_name\|username` instead of the body's operator field. Logged-in users can no longer spoof attribution. Drive reset's literal-"operator" fallback (window._operator was never set) is also fixed by this. #8 Login rate-limit race fixed. New `register_login_attempt()` is atomic check-AND-increment in synchronous code (no awaits inside), so a parallel burst can't slip past the threshold. `record_login_failure()` removed; `clear_login_failures()` now also drops any active lockout for a successful auth. Pre-existing bug where `tripped` was always False (so user_login_locked_out audit events never fired) also fixed. #9 NVMe surface_validate post-format check now mirrors the SSH path: fails on FAILED health AND on real SMART attribute failures, soft-passes SSH-only failures (logged), surfaces warnings to the stage log without failing. #10 retention.backup_db() now writes to `.tmp` then atomic-renames into the canonical daily slot — an interrupted backup leaves the tmp behind but doesn't corrupt the real snapshot. Scheduler marks last_run_date only on (prune AND backup) success so a transient failure gets retried within the 03:00 hour. #11 /health DB probe now exercises the WRITE path via a temp-table INSERT/SELECT/COMMIT round-trip. Previously only read PRAGMA journal_mode + a row count, which silently passes on read-only mounts and broken-WAL conditions. #12 security-scan.sh now fails loudly if `git fetch` or `git reset --hard origin/main` errors (was `\|\| true`, scanning stale code silently). pip-audit now runs in a throwaway python:3.12-slim container against requirements.txt instead of `docker exec`-ing into the live truenas-burnin container — cleaner separation, no transient package install on prod. #13 Badblocks SSH stage no longer doubles its log_text. Previously appended every 20-line chunk during streaming AND the full accumulated output at end. Now only flushes the un-flushed tail (typically <20 lines). `result["output"]` stays in-memory only. Verification: all 44 unit tests pass in container; /health 200; security scan returns 0 findings; deployed maple build is green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:48:16 -04:00
Brandon Walter	3a9bdc9e15	feat: CSP + security headers middleware + session-fixation defense (1.0.0-27) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details #6 — defense-in-depth security headers: * New _SecurityHeadersMiddleware emits five headers on every response: - Content-Security-Policy: tight default-src 'self', allow-list the three CDNs we actively load (unpkg for HTMX, cdnjs for QR codes, jsdelivr for xterm.js), plus 'unsafe-inline' for the inline script in settings.html and inline style in job_print.html. Tighten via nonces later if you want true CSP-level XSS protection. - X-Content-Type-Options: nosniff - Referrer-Policy: same-origin - X-Frame-Options: DENY (no clickjacking) - Permissions-Policy: camera/microphone/geolocation/interest-cohort all blocked * Middleware ordering: SecurityHeaders -> AuthGate -> Session, so headers go on EVERY response including 401/403/redirects. #7 — session-fixation defense: * request.session.clear() now runs BEFORE setting user_id/username on successful /login AND /api/v1/auth/setup. Discards any pre-login payload an attacker might have seeded the cookie with. Combined with SameSite=strict + the HMAC-signed Starlette session cookie, this closes the residual fixation surface. Verified: curl -sSI /login returns all five headers; container boots clean; /health 200; existing session for the operator continues to work because we only clear on the LOGIN flow itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:28:13 -04:00
Brandon Walter	11218753ce	feat: secret handling — status badges + redacted endpoint + rotation audit (1.0.0-26) Closes #5 of the post-Codex hardening list: * Settings UI now shows a `[set]` (green) or `[unset]` (gray) badge next to every password/key field. Tells the operator at a glance which secrets are configured without ever rendering the value. * SSH key gets a granular source label: `set (environment variable)`, `set (mounted secret)`, or `set (stored in settings DB — prefer a mounted secret in production)`. Same hint copy in the field's help text now actively recommends `/run/secrets/ssh_key` over the textarea. * New `GET /api/v1/settings/redacted` admin-only endpoint dumps every editable setting with secrets replaced by `**`, plus the per-secret status map. Useful for ops triage ("what's actually loaded?") without the secrets ever leaving the container or hitting a transcript. `POST /api/v1/settings` writes a `settings_secret_changed` audit event whenever a non-empty secret is rotated. Records field names, operator, source IP — never the value. Lets the audit page answer "who rotated the SMTP password last week?". Internal: `_SECRET_FIELDS` constant in routes.py is now the single source of truth for which fields get the redaction / audit treatment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:15:57 -04:00
Brandon Walter	992e2c47b3	deps: pin transitive dependencies via lockfile (1.0.0-25) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Closes the unpinned-deps gotcha that broke production once already (Starlette 1.0 shipping in 2026-04 changed the TemplateResponse signature; our floating requirements.txt picked it up on the next rebuild and the dashboard 500'd until 1.0.0-12 patched the call sites). Mechanics: * `requirements.in` — human-edited input, identical contents to the old `requirements.txt`. * `requirements.txt` — now an autogenerated lockfile (876 lines, every transitive pinned with sha256 hashes). Regenerated via `scripts/regenerate-lockfile.sh`, which runs `pip-compile --generate-hashes --strip-extras` in a clean python:3.12-slim container so the script has no host dependencies. * Dockerfile installs with `pip install --require-hashes` — refuses any package whose sha256 doesn't match the lockfile, defending against compromised PyPI mirrors and accidental version drift. Verification: * Container boots clean on the hash-locked install (1.0.0-25). * /health returns 200 with all checks green. * Daily security scan (pip-audit + bandit + gitleaks) returns 0 findings against the new lockfile. Future deps changes: edit requirements.in, run the regenerate script, review the diff, rebuild, commit both files. README §"Updating dependencies" walks through it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:15:02 -04:00
Brandon Walter	1a19252019	feat: daily security scan — pip-audit + bandit + gitleaks (1.0.0-24) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Two layers of defence-in-depth scanning: * `.forgejo/workflows/security-scan.yml` — runs pip-audit, bandit, and gitleaks on every push, every PR, and nightly at 07:00 UTC. Activates when the forge has a runner; harmless no-op until then. Bandit is invoked with `--skip B608` because every dynamic SQL build in this codebase uses bound parameters for data and structural placeholders only — we still catch real injection through code review. * `scripts/security-scan.sh` + systemd `service`/`timer` — maple-side daily scanner that runs the same three tools entirely in containers (no host pollution). Differences from the forge job: - pip-audit runs INSIDE the live container against installed packages, catching new CVEs in transitives requirements.txt doesn't pin (e.g. starlette breaking changes shipping in 1.0). - bandit scans the LIVE deploy dir at ~/docker/stacks/truenas-burnin/app/, not a fresh git checkout — so drift between forge HEAD and prod surfaces here too. - gitleaks scans a managed clone in ~/scan-checkouts/, kept fast-forward to origin/main. Output: ~/security-scans/scan-YYYY-MM-DD/{summary,pip-audit,bandit, gitleaks}.txt with 30-day retention. ~/security-scans/findings.log appended on any non-zero exit. SECURITY_SCAN_WEBHOOK env in the service unit lets you POST findings to Mattermost / Slack / etc. once you decide where alerts should land. First-run findings already actioned in this commit: * pip-audit caught 3 CVEs in `pip` itself (CVE-2025-8869, CVE-2026-1703, CVE-2026-3219). Dockerfile now upgrades pip to >=26.0 before installing the rest. * bandit's B608 SQL-injection heuristic flagged two f-string SQL constructions in `_upsert_drive` and `_fetch_drives_for_template`. Both were structural concatenation (column-list selection, '?,?,?' placeholder count), not data interpolation, but refactored from f-string to explicit concatenation so a future reviewer doesn't have to relitigate. * bandit's B104 (binding to 0.0.0.0) annotated with inline `# nosec B104` — container deliberately binds all interfaces; nginx-proxy- manager fronts it. * gitleaks: 0 secrets across 14 commits. Clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:07:22 -04:00
Brandon Walter	c589e3c8e5	docs: add README operator guide First operator-facing README. Covers quick start (build, configure, first-user login), the multi-drive batch workflow with concrete time estimates, the four drive-lock states with their confirm tokens, notable settings, daily report / notifications, ops cookbook (logs, user CLI, backups, /health probe, DB reset), and an honest "known gaps" list. Cross-references CLAUDE.md (architecture + rationale) and SPEC.md (per-version feature reference) for deeper docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:08:42 -04:00
Brandon Walter	d4c0770b9e	feat: app-level login + hardening sweep (1.0.0-22 -> 1.0.0-23) Two layered changes shipped in this branch: == 1.0.0-22: app-level authentication == The dashboard previously had only an IP allowlist. Adds username + bcrypt password auth, signed-cookie sessions, and a "first user setup" flow. * New app/auth.py: User dataclass, bcrypt hash/verify, get_user_by_id/ username, create_user, touch_last_login, FastAPI `get_current_user` dependency. Session secret loaded from SESSION_SECRET env or persisted to /data/session_secret. * New app/auth_cli.py: `python -m app.auth_cli list\|reset\|add` for out-of-band user management. Passwords always read from a TTY prompt. * Schema: idempotent ALTER for `users` table (id, username unique, password_hash, full_name, is_admin, created_at, last_login_at). * main.py: SessionMiddleware (HMAC-signed cookie, max-age 7 days, SameSite=strict — see hardening section) + _AuthGateMiddleware that populates request.state.current_user and bounces unauth'd HTML GETs to /login while returning 401 JSON for everything else. * Routes: GET /login renders first-user-setup form when users table is empty otherwise sign-in form; POST /login; POST /api/v1/auth/setup (only works while empty); GET\|POST /logout. * Bootstrap: env vars INITIAL_ADMIN_USERNAME + INITIAL_ADMIN_PASSWORD create the first admin on startup if both set AND users table empty. Ignored thereafter — change passwords via UI or CLI. * Layout: header shows current_user.full_name\|username + Logout link. Modal operator field auto-fills from the logged-in user via <meta name="default-operator"> rendered in layout (replaces the localStorage-only previous behaviour). * requirements.txt: pinned bcrypt>=4.0,<5.0, itsdangerous>=2.1, python-multipart>=0.0.7. First step toward addressing the unpinned-deps gotcha. * New app/templates/login.html with first-user-setup variant. == 1.0.0-23: hardening sweep == Closes the eight-item gap audit: * DB retention + automated backup. New app/retention.py runs daily at 03:00 local. Nulls burnin_stages.log_text on stages older than retention_log_days (default 35), VACUUMs to reclaim pages, then runs `sqlite3 .backup` to /data/backups/app-YYYY-MM-DD.db keeping the retention_backup_keep most recent (default 14). Wired into the lifespan supervisor next to mailer/poller. * CSRF mitigation. SessionMiddleware bumped to SameSite=strict so the browser refuses to send the session cookie on cross-site POSTs — removes the actual CSRF vector. Trade-off: external links into the app require re-auth. * Login rate limiting. In-memory per-username AND per-source-IP failure counters in auth.py. 10 failures within 10 min trips a 15-min lockout for both keys. Returns HTTP 429 with a clear "try again in N min" message. Cleared on successful login. * Login audit events. New event types in audit_events: user_login, user_login_failed, user_login_locked_out, user_logout, user_password_changed. All include source IP. Recorded via auth.audit_auth_event(). * Password change UI. Header link "Change password" opens templates/components/modal_password.html (current/new/confirm). Posts to POST /api/v1/auth/change-password — bcrypt-verifies current, requires >=8 char new pw, writes audit event. * NVMe burn-in path. _stage_surface_validate now detects nvme* devnames and routes to _stage_surface_validate_nvme() which runs `nvme format -s 1 --force` (cryptographic erase). Seconds vs hours of badblocks, exercises the controller's secure-erase. Falls back to badblocks if nvme-cli isn't installed. Post-format SMART check. * Mounted-FS detection. ssh_client.get_mounted_drives() runs `findmnt -no SOURCE`, parses non-ZFS sources back to base devnames. Poller treats them as pool_name='(mounted)', pool_role='mounted'. Confirm token DESTROY MOUNTED FILESYSTEM, distinct purple styling, audit event mounted_drive_unlocked, daily-report banner picks it up. * Deeper /health. Real readiness check — DB write probe (PRAGMA journal_mode), poller freshness (age <= 3x stale_threshold), SSH test_connection() when configured. Returns 503 when any check fails so a proxy/orchestrator can take the container out of rotation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:08:29 -04:00
Brandon Walter	5da1a1704f	feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21) Substantial feature + reliability sweep. Each version below was developed, tested live against the maple/TrueNAS deployment, and Codex-reviewed before bundling. 1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process (sshd ignores SSH signal-channel requests by default), so a cancel of a long-running badblocks left the remote process running and proc.wait() hanging — pinning the asyncio.Semaphore slot forever. * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to capture the remote PID; store in burnin._remote_pids[job_id]. * burnin._kill_remote_process(job_id) opens a fresh SSH session and issues `kill -9 <pid>` — sshd honours that. * Bound proc.wait() with asyncio.wait_for(timeout=15). * burnin._active_tasks tracks every _run_job task so cancel_job and check_stuck_jobs can actually cancel the asyncio task (was DB-only before). Also fixes the documented asyncio.create_task GC gotcha (weak refs only). * _run_job finalizer reads current state and skips the write if state != 'running' so cancelled/unknown aren't clobbered. 1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/ health/poll timestamps; devname/serial/model/size_bytes were stuck at first-INSERT values forever. After kernel SCSI re-enumeration two drives could both show as `sda`. Fixed by updating all six fields. Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop off the dashboard while audit/burnin_jobs FKs stay intact. 1.0.0-15/-16 — pool-membership lock. * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses the flattened TrueNAS output (container vdevs + their device children both appear at depth 1; section markers cache/log/spare/special/dedup switch the role). * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l` to detect drives carrying ZFS labels not in any active pool — they get pool_name='(exported)', pool_role='exported'. * Three idempotent ALTER TABLE migrations on drives: pool_name/pool_role/pool_seen_at. * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and the drive isn't in burnin._unlock_grants. Routes layer maps to 409 with structured detail {pool_name, pool_role, pool_locked: true} so the frontend can render an unlock affordance. * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator, reason}. Token is the pool name for active pools, "DESTROY BOOT POOL" for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5 chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types: pool_drive_unlocked / boot_pool_drive_unlocked / exported_pool_drive_unlocked. * Grants are in-memory only — container restart wipes them. * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs Burn-In button. modal_unlock.html with type-to-confirm field. Live unlock countdown via tickUnlockCountdowns() in app.js. * Daily report: red banner listing every unlock event from the last 24h, with operator + reason + timestamp. 1.0.0-17 — Codex review fail-open + XSS + structured-error fixes. * ssh_client.get_pool_membership / get_zfs_member_drives now return None on failure (vs {} for 'definitely empty'). poller passes update_pool=False to _upsert_drive on detection failure, preserving existing pool columns instead of clearing them. Without this fix a 1-second SSH blip silently unlocked every drive. * mailer._build_unlock_banner_html escapes every interpolated field via html.escape() (was '<' only). Time filter switched to julianday() — string >= against datetime('now', '-1 day') compared formats with different separators ('T' vs ' ') and timezone suffixes, causing subtle off-by-N-hour inclusion. * app.js submitStart/submitBatchStart now detect the structured pool_locked 409 detail and auto-open the unlock modal for the offending drive (was [object Object] in toast). 1.0.0-18 — Codex grant-binding + commit-ordering fixes. * Unlock grants bound to the (pool_name, pool_role) observed at unlock time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry invalidate the grant if the live row's pool identity has changed. Prevents an 'exported' unlock from carrying over when the drive turns out to be in active 'tank' or 'boot-pool'. * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit() succeeds — previously a failed audit insert left an unaudited grant armed. 1.0.0-19 — Codex race + cancellation classification + test scaffold. * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in try/except aiosqlite.IntegrityError -> ValueError so the read-then- insert race in start_job can't produce two queued rows for the same drive. * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown, future code paths) where DB state is still 'running', finalizer writes 'unknown' instead of mis-classifying as 'failed'. * tests/ stdlib unittest scaffold: - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs, single-disk depth-1, plural section markers, partition stripping, sdaa-style names, multi-pool, role reset between pools. - test_unlock_flow.py (18 tests): token validation per pool kind, identity-binding invalidation, TTL expiry, audit-commit-then-arm ordering, unique-active-burnin partial index. Run via `python -m unittest discover tests/`. No new dependencies. 1.0.0-20 — Spearfoot-inspired badblocks tunables. * surface_validate_block_size (-b, default 4096), surface_validate_ block_buffer (-c, default 64), surface_validate_passes (-p, default 1) exposed in Settings UI; persist via settings_store.json. Validation: block size must be a power of 2 between 512 and 1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1 roughly halves runtime on multi-TB HDDs at ~2x RAM cost. 1.0.0-21 — SMART overall-health column actually populated. * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted to UNKNOWN forever (only burn-in stages ever wrote a real value). * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for all drives in a single SSH session, deterministically delimited with @@devname@@ ... @@END@@ markers. Returns {devname: PASSED\|FAILED\| UNKNOWN} or None on SSH failure. * poller calls it every 5th cycle (~1 min at default 12s interval), caches in _state['smart_health_cache'] so transient failures preserve the previous values. * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14 -> 6 so Short/Long SMART columns fit comfortably on a 13-inch display. * 5 additional parser tests (44 total, all passing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:25:56 -04:00
Brandon Walter	b85bac7686	chore: re-sync deployed work that pre-dates this session These files have been live on maple for a while via direct scp/edit but were never committed back to the forge. Restoring parity so the repo matches the running container's source tree before the new feature work on top. - app/terminal.py: NEW. xterm.js <-> asyncssh PTY bridge wired into the log drawer's Terminal tab. Was added on the deploy host only. - app/truenas.py: misc REST client tweaks deployed but not committed. - CLAUDE.md / SPEC.md: documentation drift — Stage 8 terminal section, updated file map. - docker-compose.yml / requirements.txt: minor infra deltas already active on maple. No behaviour change vs the running container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:24:42 -04:00
Brandon Walter	289c6d8f1a	fix: reset clears burn-in dashboard column via last_reset_at timestamp Add last_reset_at column to drives table (migration-safe ALTER TABLE). _fetch_burnin_by_drive now excludes jobs created before the drive's last_reset_at, so the dashboard burn-in column goes blank after reset while the History page still shows the full job record. reset_drive stamps last_reset_at = now() alongside clearing smart_attrs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 11:24:32 -05:00
Brandon Walter	645d55cfcc	docs: update CLAUDE.md and SPEC.md for Stage 8 (live terminal) Documents WebSocket terminal architecture, xterm.js lazy loading, message protocol, tab lifecycle, and reconnect behavior. SPEC.md: updated drawer tabs (4 tabs including Terminal), added WS endpoint, corrected bad block threshold default (0, not 2), version bumped to 1.0.0-8. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 11:16:29 -05:00
Brandon Walter	5a802bff2e	feat: live SSH terminal in drawer (xterm.js + asyncssh WebSocket) Adds a Terminal tab to the log drawer with a full PTY session bridged over WebSocket to the TrueNAS SSH host. xterm.js loaded lazily on first tab open. Supports resize, paste, full color, and reconnect. - app/terminal.py: asyncssh PTY ↔ WebSocket bridge - routes.py: @router.websocket("/ws/terminal") - dashboard.html: Terminal tab + drawer panel - app.js: xterm.js lazy load, init, onData, resize observer, reconnect - app.css: terminal panel styles (no padding, overflow hidden) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 09:30:56 -05:00
Brandon Walter	70c26121a8	ui: move version badge next to title in header left side Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 09:23:10 -05:00
Brandon Walter	22ed2c6e12	fix: JS syntax error breaking all buttons; add settings restart banner app.js: stages.forEach callback in _drawerRenderBurnin was missing its closing });, causing a syntax error that prevented the entire script from loading — all click handlers (Short/Long SMART, Burn-In, cancel) were unregistered as a result. settings.html: add a prominent yellow restart banner with the docker command (docker compose restart app) that appears after saving any system settings that require a container restart to take effect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 08:57:57 -05:00
Brandon Walter	fc33c0d11e	docs: update CLAUDE.md for Stage 7; bump version to 1.0.0-7 Documents all Stage 7 features: SSH burn-in architecture, SMART attr monitoring, drive reset, version badge, stats polish, new env vars, new API routes, and real-TrueNAS cutover steps. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 08:13:21 -05:00

1 2

54 commits