Commit graph

38 commits

Author SHA1 Message Date
Brandon Walter
30062affc2 feat: per-pattern badblocks meters in drive drawer (1.0.0-44)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.

Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
  via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
  on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields

Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
  fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
  AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
  unchanged (single percent, no meters)

3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.

Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
2026-05-08 22:34:35 -07:00
Brandon Walter
4922b19a9f fix: stuck_job_hours default 24 → 168 (7 days) (1.0.0-43)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
A user with 4× 14 TB WD HDDs running -w surface_validate had all
4 jobs marked 'unknown' at exactly 24h+1min — the stuck-job
detector firing on legitimate work because 14 TB at 8192-block
badblocks needs ~5+ days to complete all 4 patterns × 2 phases.

168h covers a full -w pass on 14 TB+ HDDs with margin. Anyone
running short SSDs who wants faster detection can drop the value
in Settings → Burn-in.

README warning replaced — no longer instructs users to bump the
threshold before starting big-drive burn-ins, since the default
now handles that case.

Settings UI already accepts up to 168 via the input's max=168
attribute, so no template change needed.
2026-05-08 13:23:05 -07:00
Brandon Walter
b406e3f315 fix: badblocks progress tracks overall %, not per-phase (1.0.0-42)
Some checks failed
Security scan / pip-audit (push) Has been cancelled
Security scan / bandit (push) Has been cancelled
Security scan / gitleaks (push) Has been cancelled
Security scan / mypy (push) Has been cancelled
`badblocks -w` cycles through 4 patterns (0xaa, 0x55, 0xff, 0x00),
each with a write phase + a verify phase = 8 phases. The output's
"XX% done" lines are per-phase, so the dashboard appeared to "rewind"
every ~2 hours. Two drives racing each other could look 4× apart in
displayed progress despite identical hardware — actually one was
just further into a later phase.

New _BadblocksProgress state machine watches for "Testing with
pattern 0xXX" and "Reading and comparing" headers, advances the
phase counter, and reports overall = ((phase-1) * 100 + phase_pct) / 8
clipped to 99. Pure state machine, no I/O.

7 new tests cover phase-header detection, boundary math, monotonicity
across a synthetic stream, and the original "two drives at same
per-phase % look identical" bug.

Image rebuilt and tagged but NOT deployed to the running container —
4 surface-validate jobs are 20-95% through 14TB drives and a recreate
would SIGHUP the remote badblocks processes. Deploy with
`docker compose up -d` after the current batch finishes.
2026-05-05 07:26:23 -07:00
Brandon Walter
775251b993 docs: refresh README test count + run-tests.sh pointer
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Test suite has grown from 44 → 65 since this line was last touched
(routes resolution, badblocks tunables, rate limiter, lifecycle).
Also points readers at scripts/run-tests.sh for the in-container path.
2026-05-05 06:19:17 -07:00
Brandon Walter
8ae84862de infra: rename truenas-burnin → nas-burnin (1.0.0-41)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Matches the 1.0.0-38 product display rename. Touches every
infrastructure identifier:

- container_name: truenas-burnin → nas-burnin
- forge URL in /api/v1/updates/check
- security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description
- run-tests.sh default container name
- doc paths in README/SPEC/CLAUDE
- in-app instruction strings (login.html, settings.html, auth_cli.py)

Maple migration done in lockstep:
  docker compose down (truenas-burnin)
  mv ~/docker/stacks/{truenas-burnin,nas-burnin}
  systemd unit ExecStart updated + daemon-reload
  docker compose up -d --build → container nas-burnin
  Old image truenas-burnin-app removed (~12 GB reclaimed)
  Stale top-level orphans cleaned (config.py, poller.py, routes.py,
  truenas.py, tests/) — all dead since pre-split refactors

Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin →
nas-burnin) is a separate UI-only step. Forgejo redirects the old
URL after rename, so this commit can be pushed to the existing
remote first; remote URL gets updated locally once you rename.
2026-05-04 07:16:02 -07:00
Brandon Walter
d38807f957 test: cover Spearfoot tunables in badblocks command
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Extracts the badblocks shell-command construction into
_build_badblocks_cmd(devname) so it can be unit-tested without
spinning up an asyncssh connection. Behavior unchanged.

Three tests guard:
1. Defaults match disk-burnin.sh recommendation (-b 4096 -c 64 -p 1)
2. Operator-set tunables actually propagate to the command
3. The PID-capture wrapper (sh -c 'echo PID:\$\$; exec ...') stays
   intact — without it, cancel cannot kill the remote process
   because asyncssh's signal channel is silently ignored by sshd.
2026-05-03 21:24:10 -07:00
Brandon Walter
7cd66d460f fix: annotate to mypy-clean + promote to gating (1.0.0-40)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Five files needed annotation tweaks to clear the 14 outstanding
mypy errors, all cosmetic (zero runtime bugs):

- settings_store._coerce: return Any (concrete type depends on key,
  no narrowing path mypy can follow from the dict lookup)
- retention._state: explicit dict[str, str | None] init
- mailer: explicit `server: smtplib.SMTP` binding so SMTP_SSL and
  SMTP both narrow to the parent class for shared call sites
- burnin/stages.py: TypedDict for the badblocks result dict so
  `result["bad_blocks"]` narrows to int at the comparison site

scripts/security-scan.sh: mypy now counted in TOTAL_EXIT and
findings.log line. Comment updated to reflect gating status.
2026-05-03 21:21:55 -07:00
Brandon Walter
cd92a4d3c8 chore: dev-experience + mypy noise cleanup
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
- scripts/run-tests.sh — one-shot wrapper for the tar+docker-cp dance
  that was being done by hand every test run. Optional pattern arg
  for a single module. Cleans tests/ out of the container after.

- scripts/security-scan.sh — mount the deploy app/ at /opt/app/app
  (not /src) so internal `from . import X` resolves through the
  `app` package and stops producing spurious "Module 'src' has no
  attribute X" errors that masked real findings.

- app/truenas.py — explicit `raise RuntimeError("unreachable")` after
  the retry loop. Functionally a no-op (loop always returns or
  re-raises), but makes the post-loop control flow obvious to
  readers and silences the mypy missing-return false positive.

mypy stays informational. Down to 14 real findings after these
fixes — promoting to gating still needs settings_store + retention
typing work, which is its own pass.
2026-05-03 21:11:23 -07:00
Brandon Walter
0ebc325746 docs: rename to NAS Burn-In + version bump in spec/context
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Catches the README, SPEC, and CLAUDE.md that were missed in the
1.0.0-38 product rename. Infrastructure identifiers (paths,
container, repo URL) deliberately stay as truenas-burnin.

Also refreshes SPEC.md version (1.0.0-8 → 1.0.0-39) and CLAUDE.md
last-updated stamp (1.0.0-12 → 1.0.0-39).
2026-05-03 18:53:33 -05:00
Brandon Walter
8033161efb fix: address Codex routes-split follow-up review (1.0.0-39)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Three low-severity findings from Codex on the 1.0.0-37 split:

1. Trim dead package-level imports in routes/__init__.py — only
   `poller` was actually used; auth/burnin/mailer/settings_store
   were the exact shadowing footgun the absolute sub-router
   imports work around. Reword the comment block to match.

2. Thread `operator` through smart_start + smart_cancel.
   Previously the JS client sent it but the server ignored it;
   add audit_events rows (smart_test_start / smart_test_cancel)
   so the field is actually meaningful.

3. New tests/test_routes_resolution.py — guards two historical
   regressions: /api/v1/burnin/export.csv must register before
   /{job_id} (FastAPI int-coerce 422 trap) and the mailer
   back-compat shim `from app.routes import _fetch_drives_for_template`
   must keep importing. Plus a sub-router enumeration test that
   catches missed include_router calls in future splits.
2026-05-03 15:04:38 -05:00
Brandon Walter
a8a7d99621 rename: TrueNAS Burn-In → NAS Burn-In (1.0.0-38)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Product display name only — page titles, headers, email, browser
notification, FastAPI app title. Repo, container_name, file paths,
and infrastructure identifiers (truenas-burnin everywhere) stay put
to avoid breaking deployment.
2026-05-03 14:01:40 -04:00
Brandon Walter
40dac9090d refactor: extract drives + burnin routes (1.0.0-37)
Largest routes/ slice yet — drives.py (8 endpoints) and burnin.py
(4 endpoints). Drives helpers live in _drives_helpers.py so the
dashboard SSE handler in routes/__init__.py and mailer.py can both
keep using them via re-export.

routes/__init__.py shrinks from 815 → 163 LoC; only the dashboard /
and /sse/drives stream remain there. Routes split is now functionally
complete: 12 files, ~1800 LoC distributed by feature.
2026-05-03 09:59:15 -04:00
Brandon Walter
fc7fb4c714 refactor: extract settings routes (1.0.0-36)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Pulls /settings + /api/v1/settings* + /api/v1/settings/redacted +
/test-smtp + /test-ssh into routes/settings.py (155 LoC). All five
endpoints share the admin gate from auth.require_admin and the
secret_status / SECRET_FIELDS helpers, so the boundary is clean.

routes/__init__.py shrank from 960 -> 815 LoC. Cleanup bonus: dropped
an orphan "# Print view (must be BEFORE /{job_id} int route)" comment
that referenced the print-view endpoint already extracted to history.py.

Verification: 59/59 tests pass; /settings 401 (auth-gated as expected);
/login still 200; container boots clean at 1.0.0-36.

Remaining slices: routes/burnin.py (start + cancel + export.csv +
{job_id}) and routes/drives.py (the biggest, with the unlock route
that's currently interleaved between the burnin endpoints in
__init__.py — drives extraction unblocks burnin extraction).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:48:24 -04:00
Brandon Walter
3c39344069 refactor: extract history + audit + stats + report routes (1.0.0-35)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Continues the routes/ package split — four more clean extractions, all
following the same absolute-import pattern documented in the 1.0.0-34
gotcha note.

* routes/history.py (184 LoC) — /history, /history/{id}, and the
  /history/{id}/print view that MUST register before the {id} int route
  to avoid FastAPI's int("print") 422. Helpers _PAGE_SIZE,
  _ALL_STATES, _HISTORY_QUERY, _state_where moved with the endpoints.
  B608 nosec annotated on the count_sql f-string (it's two hardcoded
  literals; user input goes through bound params).

* routes/audit.py (53 LoC) — /audit page only. Owns _AUDIT_QUERY +
  _AUDIT_EVENT_COLORS.

* routes/stats.py (111 LoC) — /stats analytics page. Pure aggregation
  queries against burnin_jobs/drives, no shared helpers beyond
  stale_context.

* routes/report.py (24 LoC) — POST /api/v1/report/send. Now requires
  admin (was open to any authenticated user; sending mail is a side
  effect non-admins shouldn't be able to fire — same principle as the
  settings mutation gates added in 1.0.0-28).

routes/__init__.py shrank from 1261 -> 960 LoC. Remaining work:
drives, burnin, settings, dashboard — same pattern. Each future slice
will use the `import app.routes.X as _Y` absolute-import gotcha
workaround from 1.0.0-34.

Verification: 59/59 tests pass; /login 200 (public); /history /audit
/stats 401 (correctly auth-gated by middleware); container boots
clean at 1.0.0-35.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:44:22 -04:00
Brandon Walter
aa7822d6ce feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Closes the four remaining items from the post-Codex hardening list.

#1 Rate-limit unlock + change-password endpoints (1.0.0-33)
   * Generalised the existing login limiter into a reusable
     `_RateLimiter` class in app/auth.py. Atomic check-then-increment
     in synchronous code so a parallel asyncio burst can't slip past
     the threshold.
   * `unlock_limiter` (5 attempts in 10 min → 10 min lockout) gates
     POST /api/v1/drives/{id}/unlock per-drive AND per-source-IP.
   * `pwchange_limiter` (5 in 10 min → 15 min lockout) gates
     POST /api/v1/auth/change-password per-user AND per-IP.
   * Both clear on successful operation. The login limiter keeps its
     existing `register_login_attempt` / `clear_login_failures`
     facade names so external callers don't change.

#3 mypy in security-scan (1.0.0-33)
   * Added a 4th tool to the daily scan + forge workflow. Runs in a
     throwaway python:3.12-slim container against the deploy dir,
     exit code is informational only (NOT included in the
     `TOTAL_EXIT` failure sum). Findings land in
     ~/security-scans/scan-YYYY-MM-DD/mypy.txt for ratchet-down
     work over time.
   * Forge job uses `continue-on-error: true` so it doesn't fail the
     workflow until the type-debt baseline is annotated down.

#4 Lifecycle test coverage (1.0.0-33)
   * New tests/test_lifecycle.py with 15 cases:
     - TestCommonHelpers (7 tests): _start_stage, _finish_stage
       success/failure/error-preservation, _recalculate_progress
       weighted math, _is_cancelled, _append_stage_log.
     - TestStartCancelJob (4 tests): start_job inserts queued row +
       correct stage list, duplicate-active rejection, cancel marks
       state, cancel returns False on terminal-state jobs.
     - TestRateLimiter (4 tests): under-threshold ok, trips at
       threshold, clear removes both counter + lockout, separate
       keys don't interfere.
   * Total goes from 44 to 59 tests; closes the orchestration-path
     coverage gap Codex flagged.

#2 Partial routes.py split (1.0.0-34)
   * routes.py → routes/ package. Same staged-extraction pattern as
     the burnin.py split.
   * routes/auth.py — login/logout/setup/change-password (170 LoC).
   * routes/system.py — /health, /ws/terminal, /api/v1/updates/check
     (136 LoC).
   * routes/_helpers.py — shared utilities used by both extracted
     modules and the still-monolithic remainder: client_ip,
     operator_for, is_stale, stale_context, secret_status,
     SECRET_FIELDS (97 LoC).
   * routes/__init__.py shrank from 1568 LoC to 1261. Future slices
     can extract drives, burnin, history, settings the same way.
   * GOTCHA recorded in commit body: `from app import auth` at the
     top of __init__.py binds `auth` as an attribute on the package
     namespace, so `from . import auth as _auth_routes` finds the
     OUTER module and yields `app.auth` instead of the submodule.
     Fix is `import app.routes.auth as _auth_routes` (absolute).
     This bit me once at deploy time; container failed to start
     with `module 'app.auth' has no attribute 'router'`.

Verification: 59/59 tests pass (44 existing + 15 new); container
boots clean at 1.0.0-34; /health 200 with all checks green; security
scan still clean (mypy informational findings ignored from totals).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:29:53 -04:00
Brandon Walter
eb2a964171 fix: address Codex review of burnin package split (1.0.0-32)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Three LOW-severity findings from Codex's audit of the post-split
package, all small mechanical cleanups:

#1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias
   bound at import time. After a test (or runtime) monkey-patches
   app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would
   advertise the OLD value while grant_pool_unlock used the new one.
   Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API
   stays in sync with whatever the actual source-of-truth is.

#2 _stage_surface_validate_ssh() carried dead extraction scaffolding
   from when the badblocks logic was first inlined into burnin.py:
   _is_cancelled_sync (sync wrapper that does run_until_complete in
   a coroutine — would deadlock if ever called), last_logged_pct,
   on_progress, accumulated_lines, on_progress_async — none on any
   control-flow path. Plus result["output"] which was set but never
   read. All deleted; the inline _drain coroutines below already
   handle progress/log throttling correctly.

#3 The new module boundaries were leaking — root orchestration
   mutated _remote_pids and _unlock_grants directly even though
   kill.clear_remote_pid() and unlock.invalidate_grant() existed.
   Now using the helpers, so a future change to the storage shape
   only requires editing the owning module.

Bonus from Codex's check note: _get_client() now asserts
burnin._client is not None with a clear message instead of relying
on an obscure NoneType AttributeError if a stage is somehow called
before init().

Verified: 44/44 tests pass; container boots clean; /health 200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:35:07 -04:00
Brandon Walter
19c2c0dc0f refactor: extract _common.py + stages.py from burnin (1.0.0-31)
Continues the staged burnin.py module split started in 1.0.0-30.
Two more clean extractions; orchestration (init, _run_job,
start_job, cancel_job, check_stuck_jobs, semaphore) intentionally
stays in __init__.py for now to avoid threading the TrueNASClient
through cross-module setters.

* app/burnin/_common.py — shared helpers with no upward deps:
  STAGE_ORDER + _STAGE_BASE_WEIGHTS + POLL_INTERVAL constants;
  _now / _db connection helper; _is_cancelled, _start_stage,
  _finish_stage, _cancel_stage, _set_stage_error, _update_stage_*,
  _append_stage_log, _store_smart_*, _recalculate_progress; SSE
  _push_update. Imports nothing from sibling burnin modules.

* app/burnin/stages.py — every per-stage implementation moved
  verbatim: _stage_precheck, _stage_smart_test +
  _stage_smart_test_api / _ssh, _stage_surface_validate +
  _surface_validate_nvme / _ssh / _truenas, _stage_timed_simulate,
  _stage_final_check, plus _badblocks_available, _nvme_cli_available,
  and _dispatch_stage. Pulls the shared helpers from _common,
  remote-PID setters from kill, and the live TrueNASClient via a
  lazy `_get_client()` helper that defers `from app import burnin`
  until call time so we don't trip a circular import.

* __init__.py shrank from ~1480 LoC to ~600. Re-exports every
  public name (start_job, cancel_job, init, check_stuck_jobs,
  PoolMemberError, UNLOCK_TTL_SECONDS, etc.) so external callers
  in routes.py / mailer.py / poller.py see the same surface.

State that didn't move: _semaphore, _client, _active_tasks remain
on the package root (with a runtime _client reference from routes.py
preserved). _run_job and start_job still live in __init__.py — full
task.py extraction would require giving stages access to _client
through a setter rather than the lazy lookup, deferred to a future
slice.

Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. No public API change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 01:18:04 -04:00
Brandon Walter
9cbae44495 refactor: split burnin.py into a package — extract unlock + kill (1.0.0-30)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
First slice of the planned tech-debt cleanup. burnin.py was 1667 lines
and growing; staged extraction gives smaller diffs to review and a
clear bisect target if anything regresses.

Mechanical move only — no behaviour change. The two extracted modules:

* app/burnin/unlock.py — _UnlockGrant, _unlock_grants, PoolMemberError,
  is_unlocked / unlock_expiry / grant_pool_unlock, plus the four
  *_TOKEN constants and UNLOCK_TTL_SECONDS. Owns its module-level
  state; opens its own DB connection in grant_pool_unlock so it
  doesn't depend on the parent package's _db() helper.

* app/burnin/kill.py — _remote_pids dict and the kill_remote_process /
  set_remote_pid / clear_remote_pid / get_remote_pid helpers. Pulled
  out of __init__.py so the asyncssh-ignores-signals workaround lives
  next to the state it operates on.

app/burnin/__init__.py re-exports every public symbol the rest of the
app imports — `from app import burnin; burnin.start_job(...)`,
`burnin.PoolMemberError`, `burnin.UNLOCK_TTL_SECONDS`, etc. all keep
working unchanged. Internal aliases `_remote_pids` and `_unlock_grants`
on the package root point at the SAME dict objects in the submodules,
so existing in-package mutations (set in stages, cleared in cleanup
callbacks) work without rewrite.

Test fix: tests/test_unlock_flow.py:test_expired_grant_returns_false
monkey-patches UNLOCK_TTL_SECONDS. The package-root alias is bound at
import time and won't propagate back to the submodule's read site, so
the test now patches `app.burnin.unlock.UNLOCK_TTL_SECONDS` directly.

Verification: 44/44 unit tests pass in container; /health 200;
container boots clean. routes.py, mailer.py, poller.py untouched —
the public API is identical.

Future: extract stages, task, _common in subsequent versions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:44:28 -04:00
Brandon Walter
6c20e57fd8 fix: live pool re-check before start_job + drop dead run_badblocks (1.0.0-29)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Closes the last open Codex finding (#5) and removes one piece of dead
code Codex flagged in passing.

#5 — Live pool re-check before burn-in start:
  Before this change, _is_unlocked compared the operator's unlock grant
  against the cached drives.pool_* row. If a drive was imported into a
  pool, mounted, or had ZFS labels written between the operator's
  unlock click and the next ~12s poll, burn-in could still start
  against the stale identity and silently destroy the new pool.

  start_job now calls a fresh ssh_client.fresh_pool_check_for_drive()
  immediately after the cached gate. That helper re-runs the three
  detection probes (zpool list -vHP / lsblk zfs_member / findmnt) over
  a fresh SSH session and returns the live answer for one devname.
  If it differs from cached state we invalidate any existing unlock
  grant and raise PoolMemberError with the FRESH pool name so the UI
  reflects current reality. If fresh shows free but cached said locked
  the drive came back to free since last poll — log it and allow.

  Cost: ~200ms per burn-in start. For batch starts of 12 drives, that's
  2.4s extra latency — cheap against destroying a freshly-imported pool.

Dead code removal:
  ssh_client.run_badblocks() — no callers since 1.0.0-13 when the SSH
  badblocks logic was inlined into burnin._stage_surface_validate_ssh
  (with the asyncssh-signal-doesn't-actually-kill workaround). Removing
  the dead function also lets us drop the now-unused
  `from typing import Callable` import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 21:29:11 -04:00
Brandon Walter
066fbbc403 fix: address Codex audit findings (1.0.0-28)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Addresses 12 of 13 findings from the Codex tech-debt + security review
of versions 1.0.0-22 through 1.0.0-27. Item #5 (live pool re-check
before start_job) deferred — would add an SSH round-trip per start.

#1  Pool detection now treats zpool / lsblk / findmnt failures
    INDEPENDENTLY. Previously a single None blew away the whole map,
    so a host where lsblk lacks zfs_member info but zpool works would
    never lock pool members. Extended findmnt parser to recognise
    /dev/mapper/*, /dev/dm-*, /dev/md*, /dev/da*, /dev/ada* (LVM,
    devicemapper, MD RAID, FreeBSD CORE devnames).

#2  Admin role enforced on every settings mutation. New
    auth.require_admin() helper applied to GET /settings,
    POST /api/v1/settings, /test-smtp, /test-ssh. Previously any
    authenticated user (the CLI explicitly supports non-admin
    accounts) could rewrite SMTP/SSH/API secrets.

#3  First-user setup race closed. auth.create_user() now accepts
    bootstrap_only=True which wraps the existence check + insert in
    BEGIN IMMEDIATE so two concurrent /api/v1/auth/setup requests
    can't both create admin accounts during the bootstrap window.

#4  Case-insensitive uniqueness enforced via new
    `uniq_users_username_nocase` index. Login does NOCASE lookup so
    without this `Admin` and `admin` could coexist as distinct rows.

#6  New `session_cookie_secure` setting (default False for LAN/dev
    deploys, set True in production behind HTTPS) flips the session
    cookie's Secure flag. Defends against on-the-wire exposure when
    the dashboard is reachable over plain HTTP.

#7  Audit trail bound to authenticated identity. Burn-in start /
    cancel / unlock / drive reset all now use `_operator_for(request)`
    which reads `request.state.current_user.full_name|username`
    instead of the body's operator field. Logged-in users can no
    longer spoof attribution. Drive reset's literal-"operator"
    fallback (window._operator was never set) is also fixed by this.

#8  Login rate-limit race fixed. New `register_login_attempt()` is
    atomic check-AND-increment in synchronous code (no awaits inside),
    so a parallel burst can't slip past the threshold.
    `record_login_failure()` removed; `clear_login_failures()` now
    also drops any active lockout for a successful auth. Pre-existing
    bug where `tripped` was always False (so user_login_locked_out
    audit events never fired) also fixed.

#9  NVMe surface_validate post-format check now mirrors the SSH path:
    fails on FAILED health AND on real SMART attribute failures,
    soft-passes SSH-only failures (logged), surfaces warnings to the
    stage log without failing.

#10 retention.backup_db() now writes to `.tmp` then atomic-renames
    into the canonical daily slot — an interrupted backup leaves the
    tmp behind but doesn't corrupt the real snapshot. Scheduler marks
    last_run_date only on (prune AND backup) success so a transient
    failure gets retried within the 03:00 hour.

#11 /health DB probe now exercises the WRITE path via a temp-table
    INSERT/SELECT/COMMIT round-trip. Previously only read PRAGMA
    journal_mode + a row count, which silently passes on read-only
    mounts and broken-WAL conditions.

#12 security-scan.sh now fails loudly if `git fetch` or
    `git reset --hard origin/main` errors (was `|| true`, scanning
    stale code silently). pip-audit now runs in a throwaway
    python:3.12-slim container against requirements.txt instead of
    `docker exec`-ing into the live truenas-burnin container —
    cleaner separation, no transient package install on prod.

#13 Badblocks SSH stage no longer doubles its log_text. Previously
    appended every 20-line chunk during streaming AND the full
    accumulated output at end. Now only flushes the un-flushed tail
    (typically <20 lines). `result["output"]` stays in-memory only.

Verification: all 44 unit tests pass in container; /health 200;
security scan returns 0 findings; deployed maple build is green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:48:16 -04:00
Brandon Walter
3a9bdc9e15 feat: CSP + security headers middleware + session-fixation defense (1.0.0-27)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
#6 — defense-in-depth security headers:
* New _SecurityHeadersMiddleware emits five headers on every response:
  - Content-Security-Policy: tight default-src 'self', allow-list the
    three CDNs we actively load (unpkg for HTMX, cdnjs for QR codes,
    jsdelivr for xterm.js), plus 'unsafe-inline' for the inline script
    in settings.html and inline style in job_print.html. Tighten via
    nonces later if you want true CSP-level XSS protection.
  - X-Content-Type-Options: nosniff
  - Referrer-Policy: same-origin
  - X-Frame-Options: DENY (no clickjacking)
  - Permissions-Policy: camera/microphone/geolocation/interest-cohort
    all blocked
* Middleware ordering: SecurityHeaders -> AuthGate -> Session, so
  headers go on EVERY response including 401/403/redirects.

#7 — session-fixation defense:
* request.session.clear() now runs BEFORE setting user_id/username on
  successful /login AND /api/v1/auth/setup. Discards any pre-login
  payload an attacker might have seeded the cookie with. Combined
  with SameSite=strict + the HMAC-signed Starlette session cookie,
  this closes the residual fixation surface.

Verified: curl -sSI /login returns all five headers; container boots
clean; /health 200; existing session for the operator continues to
work because we only clear on the LOGIN flow itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:28:13 -04:00
Brandon Walter
11218753ce feat: secret handling — status badges + redacted endpoint + rotation audit (1.0.0-26)
Closes #5 of the post-Codex hardening list:

* Settings UI now shows a `[set]` (green) or `[unset]` (gray) badge next
  to every password/key field. Tells the operator at a glance which
  secrets are configured without ever rendering the value.

* SSH key gets a granular source label: `set (environment variable)`,
  `set (mounted secret)`, or `set (stored in settings DB — prefer a
  mounted secret in production)`. Same hint copy in the field's help
  text now actively recommends `/run/secrets/ssh_key` over the textarea.

* New `GET /api/v1/settings/redacted` admin-only endpoint dumps every
  editable setting with secrets replaced by `***`, plus the per-secret
  status map. Useful for ops triage ("what's actually loaded?") without
  the secrets ever leaving the container or hitting a transcript.

* `POST /api/v1/settings` writes a `settings_secret_changed` audit event
  whenever a non-empty secret is rotated. Records field names, operator,
  source IP — never the value. Lets the audit page answer "who rotated
  the SMTP password last week?".

Internal: `_SECRET_FIELDS` constant in routes.py is now the single
source of truth for which fields get the redaction / audit treatment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:15:57 -04:00
Brandon Walter
992e2c47b3 deps: pin transitive dependencies via lockfile (1.0.0-25)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Closes the unpinned-deps gotcha that broke production once already
(Starlette 1.0 shipping in 2026-04 changed the TemplateResponse
signature; our floating requirements.txt picked it up on the next
rebuild and the dashboard 500'd until 1.0.0-12 patched the call sites).

Mechanics:
* `requirements.in` — human-edited input, identical contents to the
  old `requirements.txt`.
* `requirements.txt` — now an autogenerated lockfile (876 lines, every
  transitive pinned with sha256 hashes). Regenerated via
  `scripts/regenerate-lockfile.sh`, which runs `pip-compile
  --generate-hashes --strip-extras` in a clean python:3.12-slim
  container so the script has no host dependencies.
* Dockerfile installs with `pip install --require-hashes` — refuses
  any package whose sha256 doesn't match the lockfile, defending
  against compromised PyPI mirrors and accidental version drift.

Verification:
* Container boots clean on the hash-locked install (1.0.0-25).
* /health returns 200 with all checks green.
* Daily security scan (pip-audit + bandit + gitleaks) returns 0 findings
  against the new lockfile.

Future deps changes: edit requirements.in, run the regenerate script,
review the diff, rebuild, commit both files. README §"Updating
dependencies" walks through it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:15:02 -04:00
Brandon Walter
1a19252019 feat: daily security scan — pip-audit + bandit + gitleaks (1.0.0-24)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Two layers of defence-in-depth scanning:

* `.forgejo/workflows/security-scan.yml` — runs pip-audit, bandit, and
  gitleaks on every push, every PR, and nightly at 07:00 UTC. Activates
  when the forge has a runner; harmless no-op until then. Bandit is
  invoked with `--skip B608` because every dynamic SQL build in this
  codebase uses bound parameters for data and structural placeholders
  only — we still catch real injection through code review.

* `scripts/security-scan.sh` + systemd `service`/`timer` — maple-side
  daily scanner that runs the same three tools entirely in containers
  (no host pollution). Differences from the forge job:
    - pip-audit runs INSIDE the live container against installed
      packages, catching new CVEs in transitives requirements.txt
      doesn't pin (e.g. starlette breaking changes shipping in 1.0).
    - bandit scans the LIVE deploy dir at
      ~/docker/stacks/truenas-burnin/app/, not a fresh git checkout —
      so drift between forge HEAD and prod surfaces here too.
    - gitleaks scans a managed clone in ~/scan-checkouts/, kept
      fast-forward to origin/main.
  Output: ~/security-scans/scan-YYYY-MM-DD/{summary,pip-audit,bandit,
  gitleaks}.txt with 30-day retention. ~/security-scans/findings.log
  appended on any non-zero exit. SECURITY_SCAN_WEBHOOK env in the
  service unit lets you POST findings to Mattermost / Slack / etc. once
  you decide where alerts should land.

First-run findings already actioned in this commit:

* pip-audit caught 3 CVEs in `pip` itself (CVE-2025-8869,
  CVE-2026-1703, CVE-2026-3219). Dockerfile now upgrades pip to
  >=26.0 before installing the rest.

* bandit's B608 SQL-injection heuristic flagged two f-string SQL
  constructions in `_upsert_drive` and `_fetch_drives_for_template`.
  Both were structural concatenation (column-list selection,
  '?,?,?' placeholder count), not data interpolation, but refactored
  from f-string to explicit concatenation so a future reviewer
  doesn't have to relitigate.

* bandit's B104 (binding to 0.0.0.0) annotated with inline `# nosec
  B104` — container deliberately binds all interfaces; nginx-proxy-
  manager fronts it.

* gitleaks: 0 secrets across 14 commits. Clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:07:22 -04:00
Brandon Walter
c589e3c8e5 docs: add README operator guide
First operator-facing README. Covers quick start (build, configure,
first-user login), the multi-drive batch workflow with concrete time
estimates, the four drive-lock states with their confirm tokens,
notable settings, daily report / notifications, ops cookbook (logs,
user CLI, backups, /health probe, DB reset), and an honest "known
gaps" list.

Cross-references CLAUDE.md (architecture + rationale) and SPEC.md
(per-version feature reference) for deeper docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:08:42 -04:00
Brandon Walter
d4c0770b9e feat: app-level login + hardening sweep (1.0.0-22 -> 1.0.0-23)
Two layered changes shipped in this branch:

== 1.0.0-22: app-level authentication ==

The dashboard previously had only an IP allowlist. Adds username +
bcrypt password auth, signed-cookie sessions, and a "first user setup"
flow.

* New app/auth.py: User dataclass, bcrypt hash/verify, get_user_by_id/
  username, create_user, touch_last_login, FastAPI `get_current_user`
  dependency. Session secret loaded from SESSION_SECRET env or persisted
  to /data/session_secret.
* New app/auth_cli.py: `python -m app.auth_cli list|reset|add` for
  out-of-band user management. Passwords always read from a TTY prompt.
* Schema: idempotent ALTER for `users` table (id, username unique,
  password_hash, full_name, is_admin, created_at, last_login_at).
* main.py: SessionMiddleware (HMAC-signed cookie, max-age 7 days,
  SameSite=strict — see hardening section) + _AuthGateMiddleware that
  populates request.state.current_user and bounces unauth'd HTML GETs
  to /login while returning 401 JSON for everything else.
* Routes: GET /login renders first-user-setup form when users table is
  empty otherwise sign-in form; POST /login; POST /api/v1/auth/setup
  (only works while empty); GET|POST /logout.
* Bootstrap: env vars INITIAL_ADMIN_USERNAME + INITIAL_ADMIN_PASSWORD
  create the first admin on startup if both set AND users table empty.
  Ignored thereafter — change passwords via UI or CLI.
* Layout: header shows current_user.full_name|username + Logout link.
  Modal operator field auto-fills from the logged-in user via
  <meta name="default-operator"> rendered in layout (replaces the
  localStorage-only previous behaviour).
* requirements.txt: pinned bcrypt>=4.0,<5.0, itsdangerous>=2.1,
  python-multipart>=0.0.7. First step toward addressing the
  unpinned-deps gotcha.
* New app/templates/login.html with first-user-setup variant.

== 1.0.0-23: hardening sweep ==

Closes the eight-item gap audit:

* DB retention + automated backup. New app/retention.py runs daily at
  03:00 local. Nulls burnin_stages.log_text on stages older than
  retention_log_days (default 35), VACUUMs to reclaim pages, then runs
  `sqlite3 .backup` to /data/backups/app-YYYY-MM-DD.db keeping the
  retention_backup_keep most recent (default 14). Wired into the
  lifespan supervisor next to mailer/poller.

* CSRF mitigation. SessionMiddleware bumped to SameSite=strict so the
  browser refuses to send the session cookie on cross-site POSTs —
  removes the actual CSRF vector. Trade-off: external links into the
  app require re-auth.

* Login rate limiting. In-memory per-username AND per-source-IP failure
  counters in auth.py. 10 failures within 10 min trips a 15-min lockout
  for both keys. Returns HTTP 429 with a clear "try again in N min"
  message. Cleared on successful login.

* Login audit events. New event types in audit_events: user_login,
  user_login_failed, user_login_locked_out, user_logout,
  user_password_changed. All include source IP. Recorded via
  auth.audit_auth_event().

* Password change UI. Header link "Change password" opens
  templates/components/modal_password.html (current/new/confirm).
  Posts to POST /api/v1/auth/change-password — bcrypt-verifies current,
  requires >=8 char new pw, writes audit event.

* NVMe burn-in path. _stage_surface_validate now detects nvme*
  devnames and routes to _stage_surface_validate_nvme() which runs
  `nvme format -s 1 --force` (cryptographic erase). Seconds vs hours
  of badblocks, exercises the controller's secure-erase. Falls back
  to badblocks if nvme-cli isn't installed. Post-format SMART check.

* Mounted-FS detection. ssh_client.get_mounted_drives() runs
  `findmnt -no SOURCE`, parses non-ZFS sources back to base devnames.
  Poller treats them as pool_name='(mounted)', pool_role='mounted'.
  Confirm token DESTROY MOUNTED FILESYSTEM, distinct purple styling,
  audit event mounted_drive_unlocked, daily-report banner picks it up.

* Deeper /health. Real readiness check — DB write probe (PRAGMA
  journal_mode), poller freshness (age <= 3x stale_threshold), SSH
  test_connection() when configured. Returns 503 when any check fails
  so a proxy/orchestrator can take the container out of rotation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:08:29 -04:00
Brandon Walter
5da1a1704f feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21)
Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.

1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
  (sshd ignores SSH signal-channel requests by default), so a cancel of a
  long-running badblocks left the remote process running and proc.wait()
  hanging — pinning the asyncio.Semaphore slot forever.

  * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
    capture the remote PID; store in burnin._remote_pids[job_id].
  * burnin._kill_remote_process(job_id) opens a fresh SSH session and
    issues `kill -9 <pid>` — sshd honours that.
  * Bound proc.wait() with asyncio.wait_for(timeout=15).
  * burnin._active_tasks tracks every _run_job task so cancel_job and
    check_stuck_jobs can actually cancel the asyncio task (was DB-only
    before). Also fixes the documented asyncio.create_task GC gotcha
    (weak refs only).
  * _run_job finalizer reads current state and skips the write if state
    != 'running' so cancelled/unknown aren't clobbered.

1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
  health/poll timestamps; devname/serial/model/size_bytes were stuck at
  first-INSERT values forever. After kernel SCSI re-enumeration two
  drives could both show as `sda`. Fixed by updating all six fields.
  Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
  off the dashboard while audit/burnin_jobs FKs stay intact.

1.0.0-15/-16 — pool-membership lock.
  * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
    the flattened TrueNAS output (container vdevs + their device children
    both appear at depth 1; section markers cache/log/spare/special/dedup
    switch the role).
  * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
    to detect drives carrying ZFS labels not in any active pool — they
    get pool_name='(exported)', pool_role='exported'.
  * Three idempotent ALTER TABLE migrations on drives:
    pool_name/pool_role/pool_seen_at.
  * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
    the drive isn't in burnin._unlock_grants. Routes layer maps to 409
    with structured detail {pool_name, pool_role, pool_locked: true} so
    the frontend can render an unlock affordance.
  * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
    reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
    for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
    chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
    pool_drive_unlocked / boot_pool_drive_unlocked /
    exported_pool_drive_unlocked.
  * Grants are in-memory only — container restart wipes them.
  * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
    Burn-In button. modal_unlock.html with type-to-confirm field.
    Live unlock countdown via tickUnlockCountdowns() in app.js.
  * Daily report: red banner listing every unlock event from the last
    24h, with operator + reason + timestamp.

1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
  * ssh_client.get_pool_membership / get_zfs_member_drives now return
    None on failure (vs {} for 'definitely empty'). poller passes
    update_pool=False to _upsert_drive on detection failure, preserving
    existing pool columns instead of clearing them. Without this fix a
    1-second SSH blip silently unlocked every drive.
  * mailer._build_unlock_banner_html escapes every interpolated field
    via html.escape() (was '<' only). Time filter switched to
    julianday() — string >= against datetime('now', '-1 day') compared
    formats with different separators ('T' vs ' ') and timezone
    suffixes, causing subtle off-by-N-hour inclusion.
  * app.js submitStart/submitBatchStart now detect the structured
    pool_locked 409 detail and auto-open the unlock modal for the
    offending drive (was [object Object] in toast).

1.0.0-18 — Codex grant-binding + commit-ordering fixes.
  * Unlock grants bound to the (pool_name, pool_role) observed at unlock
    time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
    invalidate the grant if the live row's pool identity has changed.
    Prevents an 'exported' unlock from carrying over when the drive
    turns out to be in active 'tank' or 'boot-pool'.
  * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
    succeeds — previously a failed audit insert left an unaudited grant
    armed.

1.0.0-19 — Codex race + cancellation classification + test scaffold.
  * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
    (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
    try/except aiosqlite.IntegrityError -> ValueError so the read-then-
    insert race in start_job can't produce two queued rows for the same
    drive.
  * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
    future code paths) where DB state is still 'running', finalizer
    writes 'unknown' instead of mis-classifying as 'failed'.
  * tests/ stdlib unittest scaffold:
    - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
      single-disk depth-1, plural section markers, partition stripping,
      sdaa-style names, multi-pool, role reset between pools.
    - test_unlock_flow.py (18 tests): token validation per pool kind,
      identity-binding invalidation, TTL expiry, audit-commit-then-arm
      ordering, unique-active-burnin partial index.
    Run via `python -m unittest discover tests/`. No new dependencies.

1.0.0-20 — Spearfoot-inspired badblocks tunables.
  * surface_validate_block_size (-b, default 4096), surface_validate_
    block_buffer (-c, default 64), surface_validate_passes (-p, default
    1) exposed in Settings UI; persist via settings_store.json.
    Validation: block size must be a power of 2 between 512 and
    1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
    roughly halves runtime on multi-TB HDDs at ~2x RAM cost.

1.0.0-21 — SMART overall-health column actually populated.
  * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
    to UNKNOWN forever (only burn-in stages ever wrote a real value).
  * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
    all drives in a single SSH session, deterministically delimited with
    @@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
    UNKNOWN} or None on SSH failure.
  * poller calls it every 5th cycle (~1 min at default 12s interval),
    caches in _state['smart_health_cache'] so transient failures preserve
    the previous values.
  * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
    -> 6 so Short/Long SMART columns fit comfortably on a 13-inch
    display.
  * 5 additional parser tests (44 total, all passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:25:56 -04:00
Brandon Walter
b85bac7686 chore: re-sync deployed work that pre-dates this session
These files have been live on maple for a while via direct scp/edit but
were never committed back to the forge. Restoring parity so the repo
matches the running container's source tree before the new feature work
on top.

- app/terminal.py: NEW. xterm.js <-> asyncssh PTY bridge wired into the
  log drawer's Terminal tab. Was added on the deploy host only.
- app/truenas.py: misc REST client tweaks deployed but not committed.
- CLAUDE.md / SPEC.md: documentation drift — Stage 8 terminal section,
  updated file map.
- docker-compose.yml / requirements.txt: minor infra deltas already
  active on maple.

No behaviour change vs the running container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:24:42 -04:00
Brandon Walter
289c6d8f1a fix: reset clears burn-in dashboard column via last_reset_at timestamp
Add last_reset_at column to drives table (migration-safe ALTER TABLE).
_fetch_burnin_by_drive now excludes jobs created before the drive's
last_reset_at, so the dashboard burn-in column goes blank after reset
while the History page still shows the full job record.

reset_drive stamps last_reset_at = now() alongside clearing smart_attrs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 11:24:32 -05:00
Brandon Walter
645d55cfcc docs: update CLAUDE.md and SPEC.md for Stage 8 (live terminal)
Documents WebSocket terminal architecture, xterm.js lazy loading,
message protocol, tab lifecycle, and reconnect behavior.

SPEC.md: updated drawer tabs (4 tabs including Terminal), added WS
endpoint, corrected bad block threshold default (0, not 2), version
bumped to 1.0.0-8.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 11:16:29 -05:00
Brandon Walter
5a802bff2e feat: live SSH terminal in drawer (xterm.js + asyncssh WebSocket)
Adds a Terminal tab to the log drawer with a full PTY session bridged
over WebSocket to the TrueNAS SSH host. xterm.js loaded lazily on
first tab open. Supports resize, paste, full color, and reconnect.

- app/terminal.py: asyncssh PTY ↔ WebSocket bridge
- routes.py: @router.websocket("/ws/terminal")
- dashboard.html: Terminal tab + drawer panel
- app.js: xterm.js lazy load, init, onData, resize observer, reconnect
- app.css: terminal panel styles (no padding, overflow hidden)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 09:30:56 -05:00
Brandon Walter
70c26121a8 ui: move version badge next to title in header left side
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 09:23:10 -05:00
Brandon Walter
22ed2c6e12 fix: JS syntax error breaking all buttons; add settings restart banner
app.js: stages.forEach callback in _drawerRenderBurnin was missing its
closing });, causing a syntax error that prevented the entire script
from loading — all click handlers (Short/Long SMART, Burn-In, cancel)
were unregistered as a result.

settings.html: add a prominent yellow restart banner with the docker
command (docker compose restart app) that appears after saving any
system settings that require a container restart to take effect.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 08:57:57 -05:00
Brandon Walter
fc33c0d11e docs: update CLAUDE.md for Stage 7; bump version to 1.0.0-7
Documents all Stage 7 features: SSH burn-in architecture, SMART attr
monitoring, drive reset, version badge, stats polish, new env vars,
new API routes, and real-TrueNAS cutover steps.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 08:13:21 -05:00
Brandon Walter
2dff58bd52 Stage 7: SSH architecture, SMART attribute monitoring, drive reset, and polish
SSH (app/ssh_client.py — new):
- asyncssh-based client: start_smart_test, poll_smart_progress, abort_smart_test,
  get_smart_attributes, run_badblocks with streaming progress callbacks
- SMART attribute table: monitors attrs 5/10/188/197/198/199 for warn/fail thresholds
- Falls back to REST API / mock simulation when ssh_host is not configured

Burn-in stages updated (burnin.py):
- _stage_smart_test: SSH path polls smartctl -a, stores raw output + parsed attributes
- _stage_surface_validate: SSH path streams badblocks, counts bad blocks vs configurable threshold
- _stage_final_check: SSH path checks smartctl attributes; DB fallback for mock mode
- New DB helpers: _append_stage_log, _update_stage_bad_blocks, _store_smart_attrs,
  _store_smart_raw_output

Database (database.py):
- Migrations: burnin_stages.log_text, burnin_stages.bad_blocks,
  drives.smart_attrs (JSON), smart_tests.raw_output

Settings (config.py + settings_store.py):
- ssh_host, ssh_port, ssh_user, ssh_password, ssh_key — all runtime-editable
- SSH section in Settings UI with Test SSH Connection button

Webhook (notifier.py):
- Added bad_blocks and timestamp fields to payload per SPEC

Drive reset (routes.py + drives_table.html):
- POST /api/v1/drives/{id}/reset — clears SMART state, smart_attrs; audit logged
- Reset button visible on drives with completed test state (no active burn-in)

Log drawer (app.js):
- Burn-In tab: shows raw stage log_text (SSH output) with bad block highlighting
- SMART tab: shows SMART attribute table with warn/fail colouring + raw smartctl output

Polish:
- Version badge (v1.0.0-6d) in header via Jinja2 global
- Parallel burn-in warning when max_parallel_burnins > 8 in Settings
- Stats page: avg duration by drive size + failure breakdown by stage
- settings.html: SSH section with key textarea, parallel warn div

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 08:09:30 -05:00
Brandon Walter
4ab54d7ed8 Add temp thresholds, bad block threshold, editable system settings, check for updates, history completed time
- config.py: add temp_warn_c (46°C), temp_crit_c (55°C), bad_block_threshold (0), app_version
- settings_store.py: expose all new fields + system settings (truenas_base_url, api_key, poll_interval, etc.) as editable; save to JSON for persistence; add validation for log_level, poll/stale intervals, temp range
- renderer.py: _temp_class() now reads temp_warn_c/temp_crit_c from settings instead of hardcoded 40/50
- burnin.py: precheck uses settings.temp_crit_c; fix NameError bug (_execute_stages referenced 'profile' that was not in scope)
- routes.py: add GET /api/v1/updates/check (Forgejo releases API); settings_page passes new editable fields; save_settings skips empty truenas_api_key like smtp_password
- settings.html: move system settings from read-only card into editable form; add temp/bad-block fields to Burn-In Behavior; add Check for Updates button; restart-required indicator on save
- history.html: add Completed (finished_at) column next to Started
- app.css: toast container shifts up when drawer is open (body.drawer-open)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 07:43:23 -05:00
Brandon Walter
c0f9098779 feat: add log drawer (Stage 7a)
Click any drive row to slide up a drawer with three tabs:
- Burn-In: stage timeline with state icons, elapsed timers, error lines in red
- SMART: short and long test status, timestamps, progress
- Events: last 50 audit events for the drive (newest first)

Drawer auto-refreshes on every SSE poll cycle. Row highlights blue
while drawer is open. Clicking same row or pressing Esc closes it.
Auto-scroll toggle keeps burn-in tab pinned to bottom during active runs.

New API: GET /api/v1/drives/{id}/drawer

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 07:22:53 -05:00
Brandon Walter
b73b5251ae Initial commit — TrueNAS Burn-In Dashboard v0.5.0
Full-stack burn-in orchestration dashboard (Stages 1–6d complete):
FastAPI backend, SQLite/WAL, SSE live dashboard, mock TrueNAS server,
SMTP/webhook notifications, batch burn-in, settings UI, audit log,
stats page, cancel SMART/burn-in, drag-to-reorder stages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 00:08:29 -05:00