Commit graph

10 commits

Author SHA1 Message Date
Brandon Walter
28d046f42e fix: SMART overlay shows terminal states + reconciles orphans (1.0.0-49)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The Long SMART column showed "—" while the Burn-In column showed
"FAILED (LONG SMART)" — clear contradiction. Two reasons:

1. The overlay query in _drives_helpers only fetched SMART stage
   data for burn-ins in ('running','queued') state. Failed/passed/
   cancelled jobs got their stage data filtered out, so the SMART
   columns went blank when you most wanted to see them. Removed
   the state filter so all burn-ins overlay.

2. A pre-busy-timeout `database is locked` failure mode (sdj job 5
   from Mar 2026) left long_smart stage rows recorded as state=
   'running' even though the parent job ended in state='failed'.
   The overlay now translates that orphan state at render time:
   if the parent job is failed/cancelled/unknown but the stage is
   still 'running', display the stage as failed (or the parent's
   terminal state) so the column matches the Burn-In column.

The translation is purely display-time; no DB writes. error_text
falls back to the parent job's error_text when the stage's own is
NULL, so the operator sees what actually broke.
2026-05-09 11:46:45 -07:00
Brandon Walter
383258df97 feat: phase caption + bad-block badge + per-pattern history (1.0.0-47)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Three additions to the surface_validate drawer block:

1. **Phase caption** below the meters: "Pattern 2 of 4 · Verify 0x55
   · 47% within phase". Pure JS — no schema change. Makes the
   visual grammar explicit without needing the operator to mentally
   map phase=4 to "verifying pattern 2".

2. **Bad-block badge** in the vitals row. Green at 0, red at >0.
   The number was already on the stage row but burying it in the
   log felt wrong — surfacing it next to temp/speed/ETA keeps it
   in eye-line during long runs.

3. **Per-pattern duration history** below the caption. New
   bb_phase_history JSON column (idempotent migration) maps
   {phase_num: ts}. Parser stamps the timestamp on every phase
   transition (and on stage entry for phase 1). Drawer diffs
   consecutive write-phase starts to derive "0xaa: 14h 22m"
   for completed patterns. Once one pattern is done you can
   predict the rest without leaving the drawer.

Persistence is idempotent: re-entry of the same phase keeps the
original timestamp so a transient parser reset doesn't blow away
history. JSON parse failures fail gracefully (no row rendered).
2026-05-08 23:23:02 -07:00
Brandon Walter
6b2367b892 feat: vital-signs strip above per-pattern meters (1.0.0-46)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
The drawer's surface_validate area now leads with a row of operator
vitals computed from data already in the response:

- Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow)
- Speed: live MB/s, NULL until second progress sample arrives
- Elapsed: time since stage started_at
- ETA: extrapolated from overall progress; suppressed under 0.5%
  to avoid the "47 days remaining" artefact early in pattern 1

Live MB/s comes from a new bb_mbps column on burnin_stages, computed
in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt.
Skipped on phase transitions (per-phase pct resets) and sub-second
samples (noisy).

Drawer endpoint now passes drive.temperature_c through; JS stashes
the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer
can pull it for the vitals row without changing call signatures.

Tightened table CSS in this same session is unrelated and shipped
already in earlier rounds via the bind-mounted app.css.
2026-05-08 23:13:58 -07:00
Brandon Walter
30062affc2 feat: per-pattern badblocks meters in drive drawer (1.0.0-44)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
User asked for one meter per badblocks pattern. The drawer now shows
4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split
into write (left, blue) + verify (right, green) halves so a glance
shows both which pattern is current AND whether you're writing or
verifying within it.

Backend:
- New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100)
  via idempotent ALTER TABLE migration
- _update_stage_bb_phase() helper called from the badblocks parser
  on every tick (when phase or percent changes)
- /api/v1/drives/{id}/drawer SELECT now returns the new fields

Frontend (app.js + app.css):
- _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern
  fill state and emits 4-meter HTML with W/V sub-labels
- Conditional render: only shows when stage_name === 'surface_validate'
  AND bb_phase is set, so historical pre-1.0.0-44 stage rows render
  unchanged (single percent, no meters)

3 new tests cover the migration columns, single-tick persistence,
and overwrite-on-second-tick. Total suite: 75 tests.

Image rebuilt and tagged but NOT deployed — 4 burn-ins are running
right now and a recreate would SIGHUP them. Deploy with
`docker compose up -d` after the current batch finishes; the
migration runs at init and the meters light up for the next batch.
2026-05-08 22:34:35 -07:00
Brandon Walter
8ae84862de infra: rename truenas-burnin → nas-burnin (1.0.0-41)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Matches the 1.0.0-38 product display rename. Touches every
infrastructure identifier:

- container_name: truenas-burnin → nas-burnin
- forge URL in /api/v1/updates/check
- security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description
- run-tests.sh default container name
- doc paths in README/SPEC/CLAUDE
- in-app instruction strings (login.html, settings.html, auth_cli.py)

Maple migration done in lockstep:
  docker compose down (truenas-burnin)
  mv ~/docker/stacks/{truenas-burnin,nas-burnin}
  systemd unit ExecStart updated + daemon-reload
  docker compose up -d --build → container nas-burnin
  Old image truenas-burnin-app removed (~12 GB reclaimed)
  Stale top-level orphans cleaned (config.py, poller.py, routes.py,
  truenas.py, tests/) — all dead since pre-split refactors

Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin →
nas-burnin) is a separate UI-only step. Forgejo redirects the old
URL after rename, so this commit can be pushed to the existing
remote first; remote URL gets updated locally once you rename.
2026-05-04 07:16:02 -07:00
Brandon Walter
8033161efb fix: address Codex routes-split follow-up review (1.0.0-39)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Three low-severity findings from Codex on the 1.0.0-37 split:

1. Trim dead package-level imports in routes/__init__.py — only
   `poller` was actually used; auth/burnin/mailer/settings_store
   were the exact shadowing footgun the absolute sub-router
   imports work around. Reword the comment block to match.

2. Thread `operator` through smart_start + smart_cancel.
   Previously the JS client sent it but the server ignored it;
   add audit_events rows (smart_test_start / smart_test_cancel)
   so the field is actually meaningful.

3. New tests/test_routes_resolution.py — guards two historical
   regressions: /api/v1/burnin/export.csv must register before
   /{job_id} (FastAPI int-coerce 422 trap) and the mailer
   back-compat shim `from app.routes import _fetch_drives_for_template`
   must keep importing. Plus a sub-router enumeration test that
   catches missed include_router calls in future splits.
2026-05-03 15:04:38 -05:00
Brandon Walter
40dac9090d refactor: extract drives + burnin routes (1.0.0-37)
Largest routes/ slice yet — drives.py (8 endpoints) and burnin.py
(4 endpoints). Drives helpers live in _drives_helpers.py so the
dashboard SSE handler in routes/__init__.py and mailer.py can both
keep using them via re-export.

routes/__init__.py shrinks from 815 → 163 LoC; only the dashboard /
and /sse/drives stream remain there. Routes split is now functionally
complete: 12 files, ~1800 LoC distributed by feature.
2026-05-03 09:59:15 -04:00
Brandon Walter
fc7fb4c714 refactor: extract settings routes (1.0.0-36)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Pulls /settings + /api/v1/settings* + /api/v1/settings/redacted +
/test-smtp + /test-ssh into routes/settings.py (155 LoC). All five
endpoints share the admin gate from auth.require_admin and the
secret_status / SECRET_FIELDS helpers, so the boundary is clean.

routes/__init__.py shrank from 960 -> 815 LoC. Cleanup bonus: dropped
an orphan "# Print view (must be BEFORE /{job_id} int route)" comment
that referenced the print-view endpoint already extracted to history.py.

Verification: 59/59 tests pass; /settings 401 (auth-gated as expected);
/login still 200; container boots clean at 1.0.0-36.

Remaining slices: routes/burnin.py (start + cancel + export.csv +
{job_id}) and routes/drives.py (the biggest, with the unlock route
that's currently interleaved between the burnin endpoints in
__init__.py — drives extraction unblocks burnin extraction).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:48:24 -04:00
Brandon Walter
3c39344069 refactor: extract history + audit + stats + report routes (1.0.0-35)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Continues the routes/ package split — four more clean extractions, all
following the same absolute-import pattern documented in the 1.0.0-34
gotcha note.

* routes/history.py (184 LoC) — /history, /history/{id}, and the
  /history/{id}/print view that MUST register before the {id} int route
  to avoid FastAPI's int("print") 422. Helpers _PAGE_SIZE,
  _ALL_STATES, _HISTORY_QUERY, _state_where moved with the endpoints.
  B608 nosec annotated on the count_sql f-string (it's two hardcoded
  literals; user input goes through bound params).

* routes/audit.py (53 LoC) — /audit page only. Owns _AUDIT_QUERY +
  _AUDIT_EVENT_COLORS.

* routes/stats.py (111 LoC) — /stats analytics page. Pure aggregation
  queries against burnin_jobs/drives, no shared helpers beyond
  stale_context.

* routes/report.py (24 LoC) — POST /api/v1/report/send. Now requires
  admin (was open to any authenticated user; sending mail is a side
  effect non-admins shouldn't be able to fire — same principle as the
  settings mutation gates added in 1.0.0-28).

routes/__init__.py shrank from 1261 -> 960 LoC. Remaining work:
drives, burnin, settings, dashboard — same pattern. Each future slice
will use the `import app.routes.X as _Y` absolute-import gotcha
workaround from 1.0.0-34.

Verification: 59/59 tests pass; /login 200 (public); /history /audit
/stats 401 (correctly auth-gated by middleware); container boots
clean at 1.0.0-35.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:44:22 -04:00
Brandon Walter
aa7822d6ce feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34)
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
Closes the four remaining items from the post-Codex hardening list.

#1 Rate-limit unlock + change-password endpoints (1.0.0-33)
   * Generalised the existing login limiter into a reusable
     `_RateLimiter` class in app/auth.py. Atomic check-then-increment
     in synchronous code so a parallel asyncio burst can't slip past
     the threshold.
   * `unlock_limiter` (5 attempts in 10 min → 10 min lockout) gates
     POST /api/v1/drives/{id}/unlock per-drive AND per-source-IP.
   * `pwchange_limiter` (5 in 10 min → 15 min lockout) gates
     POST /api/v1/auth/change-password per-user AND per-IP.
   * Both clear on successful operation. The login limiter keeps its
     existing `register_login_attempt` / `clear_login_failures`
     facade names so external callers don't change.

#3 mypy in security-scan (1.0.0-33)
   * Added a 4th tool to the daily scan + forge workflow. Runs in a
     throwaway python:3.12-slim container against the deploy dir,
     exit code is informational only (NOT included in the
     `TOTAL_EXIT` failure sum). Findings land in
     ~/security-scans/scan-YYYY-MM-DD/mypy.txt for ratchet-down
     work over time.
   * Forge job uses `continue-on-error: true` so it doesn't fail the
     workflow until the type-debt baseline is annotated down.

#4 Lifecycle test coverage (1.0.0-33)
   * New tests/test_lifecycle.py with 15 cases:
     - TestCommonHelpers (7 tests): _start_stage, _finish_stage
       success/failure/error-preservation, _recalculate_progress
       weighted math, _is_cancelled, _append_stage_log.
     - TestStartCancelJob (4 tests): start_job inserts queued row +
       correct stage list, duplicate-active rejection, cancel marks
       state, cancel returns False on terminal-state jobs.
     - TestRateLimiter (4 tests): under-threshold ok, trips at
       threshold, clear removes both counter + lockout, separate
       keys don't interfere.
   * Total goes from 44 to 59 tests; closes the orchestration-path
     coverage gap Codex flagged.

#2 Partial routes.py split (1.0.0-34)
   * routes.py → routes/ package. Same staged-extraction pattern as
     the burnin.py split.
   * routes/auth.py — login/logout/setup/change-password (170 LoC).
   * routes/system.py — /health, /ws/terminal, /api/v1/updates/check
     (136 LoC).
   * routes/_helpers.py — shared utilities used by both extracted
     modules and the still-monolithic remainder: client_ip,
     operator_for, is_stale, stale_context, secret_status,
     SECRET_FIELDS (97 LoC).
   * routes/__init__.py shrank from 1568 LoC to 1261. Future slices
     can extract drives, burnin, history, settings the same way.
   * GOTCHA recorded in commit body: `from app import auth` at the
     top of __init__.py binds `auth` as an attribute on the package
     namespace, so `from . import auth as _auth_routes` finds the
     OUTER module and yields `app.auth` instead of the submodule.
     Fix is `import app.routes.auth as _auth_routes` (absolute).
     This bit me once at deploy time; container failed to start
     with `module 'app.auth' has no attribute 'router'`.

Verification: 59/59 tests pass (44 existing + 15 new); container
boots clean at 1.0.0-34; /health 200 with all checks green; security
scan still clean (mypy informational findings ignored from totals).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:29:53 -04:00