nas-burnin

Author	SHA1	Message	Date
Brandon Walter	28d046f42e	fix: SMART overlay shows terminal states + reconciles orphans (1.0.0-49) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The Long SMART column showed "—" while the Burn-In column showed "FAILED (LONG SMART)" — clear contradiction. Two reasons: 1. The overlay query in _drives_helpers only fetched SMART stage data for burn-ins in ('running','queued') state. Failed/passed/ cancelled jobs got their stage data filtered out, so the SMART columns went blank when you most wanted to see them. Removed the state filter so all burn-ins overlay. 2. A pre-busy-timeout `database is locked` failure mode (sdj job 5 from Mar 2026) left long_smart stage rows recorded as state= 'running' even though the parent job ended in state='failed'. The overlay now translates that orphan state at render time: if the parent job is failed/cancelled/unknown but the stage is still 'running', display the stage as failed (or the parent's terminal state) so the column matches the Burn-In column. The translation is purely display-time; no DB writes. error_text falls back to the parent job's error_text when the stage's own is NULL, so the operator sees what actually broke.	2026-05-09 11:46:45 -07:00
Brandon Walter	383258df97	feat: phase caption + bad-block badge + per-pattern history (1.0.0-47) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Three additions to the surface_validate drawer block: 1. Phase caption below the meters: "Pattern 2 of 4 · Verify 0x55 · 47% within phase". Pure JS — no schema change. Makes the visual grammar explicit without needing the operator to mentally map phase=4 to "verifying pattern 2". 2. Bad-block badge in the vitals row. Green at 0, red at >0. The number was already on the stage row but burying it in the log felt wrong — surfacing it next to temp/speed/ETA keeps it in eye-line during long runs. 3. Per-pattern duration history below the caption. New bb_phase_history JSON column (idempotent migration) maps {phase_num: ts}. Parser stamps the timestamp on every phase transition (and on stage entry for phase 1). Drawer diffs consecutive write-phase starts to derive "0xaa: 14h 22m" for completed patterns. Once one pattern is done you can predict the rest without leaving the drawer. Persistence is idempotent: re-entry of the same phase keeps the original timestamp so a transient parser reset doesn't blow away history. JSON parse failures fail gracefully (no row rendered).	2026-05-08 23:23:02 -07:00
Brandon Walter	6b2367b892	feat: vital-signs strip above per-pattern meters (1.0.0-46) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details The drawer's surface_validate area now leads with a row of operator vitals computed from data already in the response: - Temp: drive temperature with cool/warm/hot colour (≥48 red, ≥42 yellow) - Speed: live MB/s, NULL until second progress sample arrives - Elapsed: time since stage started_at - ETA: extrapolated from overall progress; suppressed under 0.5% to avoid the "47 days remaining" artefact early in pattern 1 Live MB/s comes from a new bb_mbps column on burnin_stages, computed in the badblocks parser as (delta_overall_pct / 800) * size_bytes / dt. Skipped on phase transitions (per-phase pct resets) and sub-second samples (noisy). Drawer endpoint now passes drive.temperature_c through; JS stashes the latest drive object in _DRAWER_LAST_DRIVE so the burn-in renderer can pull it for the vitals row without changing call signatures. Tightened table CSS in this same session is unrelated and shipped already in earlier rounds via the bind-mounted app.css.	2026-05-08 23:13:58 -07:00
Brandon Walter	30062affc2	feat: per-pattern badblocks meters in drive drawer (1.0.0-44) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details User asked for one meter per badblocks pattern. The drawer now shows 4 meters (one per pattern: 0xaa / 0x55 / 0xff / 0x00), each split into write (left, blue) + verify (right, green) halves so a glance shows both which pattern is current AND whether you're writing or verifying within it. Backend: - New columns burnin_stages.bb_phase (1-8) + bb_phase_pct (0-100) via idempotent ALTER TABLE migration - _update_stage_bb_phase() helper called from the badblocks parser on every tick (when phase or percent changes) - /api/v1/drives/{id}/drawer SELECT now returns the new fields Frontend (app.js + app.css): - _drawerRenderBadblocksMeters(phase, phasePct) computes per-pattern fill state and emits 4-meter HTML with W/V sub-labels - Conditional render: only shows when stage_name === 'surface_validate' AND bb_phase is set, so historical pre-1.0.0-44 stage rows render unchanged (single percent, no meters) 3 new tests cover the migration columns, single-tick persistence, and overwrite-on-second-tick. Total suite: 75 tests. Image rebuilt and tagged but NOT deployed — 4 burn-ins are running right now and a recreate would SIGHUP them. Deploy with `docker compose up -d` after the current batch finishes; the migration runs at init and the meters light up for the next batch.	2026-05-08 22:34:35 -07:00
Brandon Walter	8ae84862de	infra: rename truenas-burnin → nas-burnin (1.0.0-41) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Matches the 1.0.0-38 product display rename. Touches every infrastructure identifier: - container_name: truenas-burnin → nas-burnin - forge URL in /api/v1/updates/check - security-scan: REPO_URL, REPO, DEPLOY_DIR, systemd unit description - run-tests.sh default container name - doc paths in README/SPEC/CLAUDE - in-app instruction strings (login.html, settings.html, auth_cli.py) Maple migration done in lockstep: docker compose down (truenas-burnin) mv ~/docker/stacks/{truenas-burnin,nas-burnin} systemd unit ExecStart updated + daemon-reload docker compose up -d --build → container nas-burnin Old image truenas-burnin-app removed (~12 GB reclaimed) Stale top-level orphans cleaned (config.py, poller.py, routes.py, truenas.py, tests/) — all dead since pre-split refactors Forge repo rename (git.hellocomputer.xyz/brandon/truenas-burnin → nas-burnin) is a separate UI-only step. Forgejo redirects the old URL after rename, so this commit can be pushed to the existing remote first; remote URL gets updated locally once you rename.	2026-05-04 07:16:02 -07:00
Brandon Walter	8033161efb	fix: address Codex routes-split follow-up review (1.0.0-39) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Three low-severity findings from Codex on the 1.0.0-37 split: 1. Trim dead package-level imports in routes/__init__.py — only `poller` was actually used; auth/burnin/mailer/settings_store were the exact shadowing footgun the absolute sub-router imports work around. Reword the comment block to match. 2. Thread `operator` through smart_start + smart_cancel. Previously the JS client sent it but the server ignored it; add audit_events rows (smart_test_start / smart_test_cancel) so the field is actually meaningful. 3. New tests/test_routes_resolution.py — guards two historical regressions: /api/v1/burnin/export.csv must register before /{job_id} (FastAPI int-coerce 422 trap) and the mailer back-compat shim `from app.routes import _fetch_drives_for_template` must keep importing. Plus a sub-router enumeration test that catches missed include_router calls in future splits.	2026-05-03 15:04:38 -05:00
Brandon Walter	40dac9090d	refactor: extract drives + burnin routes (1.0.0-37) Largest routes/ slice yet — drives.py (8 endpoints) and burnin.py (4 endpoints). Drives helpers live in _drives_helpers.py so the dashboard SSE handler in routes/__init__.py and mailer.py can both keep using them via re-export. routes/__init__.py shrinks from 815 → 163 LoC; only the dashboard / and /sse/drives stream remain there. Routes split is now functionally complete: 12 files, ~1800 LoC distributed by feature.	2026-05-03 09:59:15 -04:00
Brandon Walter	fc7fb4c714	refactor: extract settings routes (1.0.0-36) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Pulls /settings + /api/v1/settings* + /api/v1/settings/redacted + /test-smtp + /test-ssh into routes/settings.py (155 LoC). All five endpoints share the admin gate from auth.require_admin and the secret_status / SECRET_FIELDS helpers, so the boundary is clean. routes/__init__.py shrank from 960 -> 815 LoC. Cleanup bonus: dropped an orphan "# Print view (must be BEFORE /{job_id} int route)" comment that referenced the print-view endpoint already extracted to history.py. Verification: 59/59 tests pass; /settings 401 (auth-gated as expected); /login still 200; container boots clean at 1.0.0-36. Remaining slices: routes/burnin.py (start + cancel + export.csv + {job_id}) and routes/drives.py (the biggest, with the unlock route that's currently interleaved between the burnin endpoints in __init__.py — drives extraction unblocks burnin extraction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:48:24 -04:00
Brandon Walter	3c39344069	refactor: extract history + audit + stats + report routes (1.0.0-35) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Continues the routes/ package split — four more clean extractions, all following the same absolute-import pattern documented in the 1.0.0-34 gotcha note. * routes/history.py (184 LoC) — /history, /history/{id}, and the /history/{id}/print view that MUST register before the {id} int route to avoid FastAPI's int("print") 422. Helpers _PAGE_SIZE, _ALL_STATES, _HISTORY_QUERY, _state_where moved with the endpoints. B608 nosec annotated on the count_sql f-string (it's two hardcoded literals; user input goes through bound params). * routes/audit.py (53 LoC) — /audit page only. Owns _AUDIT_QUERY + _AUDIT_EVENT_COLORS. * routes/stats.py (111 LoC) — /stats analytics page. Pure aggregation queries against burnin_jobs/drives, no shared helpers beyond stale_context. * routes/report.py (24 LoC) — POST /api/v1/report/send. Now requires admin (was open to any authenticated user; sending mail is a side effect non-admins shouldn't be able to fire — same principle as the settings mutation gates added in 1.0.0-28). routes/__init__.py shrank from 1261 -> 960 LoC. Remaining work: drives, burnin, settings, dashboard — same pattern. Each future slice will use the `import app.routes.X as _Y` absolute-import gotcha workaround from 1.0.0-34. Verification: 59/59 tests pass; /login 200 (public); /history /audit /stats 401 (correctly auth-gated by middleware); container boots clean at 1.0.0-35. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:44:22 -04:00
Brandon Walter	aa7822d6ce	feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Closes the four remaining items from the post-Codex hardening list. #1 Rate-limit unlock + change-password endpoints (1.0.0-33) * Generalised the existing login limiter into a reusable `_RateLimiter` class in app/auth.py. Atomic check-then-increment in synchronous code so a parallel asyncio burst can't slip past the threshold. * `unlock_limiter` (5 attempts in 10 min → 10 min lockout) gates POST /api/v1/drives/{id}/unlock per-drive AND per-source-IP. * `pwchange_limiter` (5 in 10 min → 15 min lockout) gates POST /api/v1/auth/change-password per-user AND per-IP. * Both clear on successful operation. The login limiter keeps its existing `register_login_attempt` / `clear_login_failures` facade names so external callers don't change. #3 mypy in security-scan (1.0.0-33) * Added a 4th tool to the daily scan + forge workflow. Runs in a throwaway python:3.12-slim container against the deploy dir, exit code is informational only (NOT included in the `TOTAL_EXIT` failure sum). Findings land in ~/security-scans/scan-YYYY-MM-DD/mypy.txt for ratchet-down work over time. * Forge job uses `continue-on-error: true` so it doesn't fail the workflow until the type-debt baseline is annotated down. #4 Lifecycle test coverage (1.0.0-33) * New tests/test_lifecycle.py with 15 cases: - TestCommonHelpers (7 tests): _start_stage, _finish_stage success/failure/error-preservation, _recalculate_progress weighted math, _is_cancelled, _append_stage_log. - TestStartCancelJob (4 tests): start_job inserts queued row + correct stage list, duplicate-active rejection, cancel marks state, cancel returns False on terminal-state jobs. - TestRateLimiter (4 tests): under-threshold ok, trips at threshold, clear removes both counter + lockout, separate keys don't interfere. * Total goes from 44 to 59 tests; closes the orchestration-path coverage gap Codex flagged. #2 Partial routes.py split (1.0.0-34) * routes.py → routes/ package. Same staged-extraction pattern as the burnin.py split. * routes/auth.py — login/logout/setup/change-password (170 LoC). * routes/system.py — /health, /ws/terminal, /api/v1/updates/check (136 LoC). * routes/_helpers.py — shared utilities used by both extracted modules and the still-monolithic remainder: client_ip, operator_for, is_stale, stale_context, secret_status, SECRET_FIELDS (97 LoC). * routes/__init__.py shrank from 1568 LoC to 1261. Future slices can extract drives, burnin, history, settings the same way. * GOTCHA recorded in commit body: `from app import auth` at the top of __init__.py binds `auth` as an attribute on the package namespace, so `from . import auth as _auth_routes` finds the OUTER module and yields `app.auth` instead of the submodule. Fix is `import app.routes.auth as _auth_routes` (absolute). This bit me once at deploy time; container failed to start with `module 'app.auth' has no attribute 'router'`. Verification: 59/59 tests pass (44 existing + 15 new); container boots clean at 1.0.0-34; /health 200 with all checks green; security scan still clean (mypy informational findings ignored from totals). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:29:53 -04:00

10 commits