nas-burnin

Author	SHA1	Message	Date
Brandon Walter	aa7822d6ce	feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Security scan / mypy (push) Waiting to run Details Closes the four remaining items from the post-Codex hardening list. #1 Rate-limit unlock + change-password endpoints (1.0.0-33) * Generalised the existing login limiter into a reusable `_RateLimiter` class in app/auth.py. Atomic check-then-increment in synchronous code so a parallel asyncio burst can't slip past the threshold. * `unlock_limiter` (5 attempts in 10 min → 10 min lockout) gates POST /api/v1/drives/{id}/unlock per-drive AND per-source-IP. * `pwchange_limiter` (5 in 10 min → 15 min lockout) gates POST /api/v1/auth/change-password per-user AND per-IP. * Both clear on successful operation. The login limiter keeps its existing `register_login_attempt` / `clear_login_failures` facade names so external callers don't change. #3 mypy in security-scan (1.0.0-33) * Added a 4th tool to the daily scan + forge workflow. Runs in a throwaway python:3.12-slim container against the deploy dir, exit code is informational only (NOT included in the `TOTAL_EXIT` failure sum). Findings land in ~/security-scans/scan-YYYY-MM-DD/mypy.txt for ratchet-down work over time. * Forge job uses `continue-on-error: true` so it doesn't fail the workflow until the type-debt baseline is annotated down. #4 Lifecycle test coverage (1.0.0-33) * New tests/test_lifecycle.py with 15 cases: - TestCommonHelpers (7 tests): _start_stage, _finish_stage success/failure/error-preservation, _recalculate_progress weighted math, _is_cancelled, _append_stage_log. - TestStartCancelJob (4 tests): start_job inserts queued row + correct stage list, duplicate-active rejection, cancel marks state, cancel returns False on terminal-state jobs. - TestRateLimiter (4 tests): under-threshold ok, trips at threshold, clear removes both counter + lockout, separate keys don't interfere. * Total goes from 44 to 59 tests; closes the orchestration-path coverage gap Codex flagged. #2 Partial routes.py split (1.0.0-34) * routes.py → routes/ package. Same staged-extraction pattern as the burnin.py split. * routes/auth.py — login/logout/setup/change-password (170 LoC). * routes/system.py — /health, /ws/terminal, /api/v1/updates/check (136 LoC). * routes/_helpers.py — shared utilities used by both extracted modules and the still-monolithic remainder: client_ip, operator_for, is_stale, stale_context, secret_status, SECRET_FIELDS (97 LoC). * routes/__init__.py shrank from 1568 LoC to 1261. Future slices can extract drives, burnin, history, settings the same way. * GOTCHA recorded in commit body: `from app import auth` at the top of __init__.py binds `auth` as an attribute on the package namespace, so `from . import auth as _auth_routes` finds the OUTER module and yields `app.auth` instead of the submodule. Fix is `import app.routes.auth as _auth_routes` (absolute). This bit me once at deploy time; container failed to start with `module 'app.auth' has no attribute 'router'`. Verification: 59/59 tests pass (44 existing + 15 new); container boots clean at 1.0.0-34; /health 200 with all checks green; security scan still clean (mypy informational findings ignored from totals). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 09:29:53 -04:00
Brandon Walter	eb2a964171	fix: address Codex review of burnin package split (1.0.0-32) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Three LOW-severity findings from Codex's audit of the post-split package, all small mechanical cleanups: #1 routes.py:848 read burnin.UNLOCK_TTL_SECONDS — a snapshot alias bound at import time. After a test (or runtime) monkey-patches app.burnin.unlock.UNLOCK_TTL_SECONDS the API response would advertise the OLD value while grant_pool_unlock used the new one. Now reads burnin.unlock.UNLOCK_TTL_SECONDS directly so the API stays in sync with whatever the actual source-of-truth is. #2 _stage_surface_validate_ssh() carried dead extraction scaffolding from when the badblocks logic was first inlined into burnin.py: _is_cancelled_sync (sync wrapper that does run_until_complete in a coroutine — would deadlock if ever called), last_logged_pct, on_progress, accumulated_lines, on_progress_async — none on any control-flow path. Plus result["output"] which was set but never read. All deleted; the inline _drain coroutines below already handle progress/log throttling correctly. #3 The new module boundaries were leaking — root orchestration mutated _remote_pids and _unlock_grants directly even though kill.clear_remote_pid() and unlock.invalidate_grant() existed. Now using the helpers, so a future change to the storage shape only requires editing the owning module. Bonus from Codex's check note: _get_client() now asserts burnin._client is not None with a clear message instead of relying on an obscure NoneType AttributeError if a stage is somehow called before init(). Verified: 44/44 tests pass; container boots clean; /health 200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 01:35:07 -04:00
Brandon Walter	066fbbc403	fix: address Codex audit findings (1.0.0-28) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Addresses 12 of 13 findings from the Codex tech-debt + security review of versions 1.0.0-22 through 1.0.0-27. Item #5 (live pool re-check before start_job) deferred — would add an SSH round-trip per start. #1 Pool detection now treats zpool / lsblk / findmnt failures INDEPENDENTLY. Previously a single None blew away the whole map, so a host where lsblk lacks zfs_member info but zpool works would never lock pool members. Extended findmnt parser to recognise /dev/mapper/, /dev/dm-, /dev/md, /dev/da, /dev/ada* (LVM, devicemapper, MD RAID, FreeBSD CORE devnames). #2 Admin role enforced on every settings mutation. New auth.require_admin() helper applied to GET /settings, POST /api/v1/settings, /test-smtp, /test-ssh. Previously any authenticated user (the CLI explicitly supports non-admin accounts) could rewrite SMTP/SSH/API secrets. #3 First-user setup race closed. auth.create_user() now accepts bootstrap_only=True which wraps the existence check + insert in BEGIN IMMEDIATE so two concurrent /api/v1/auth/setup requests can't both create admin accounts during the bootstrap window. #4 Case-insensitive uniqueness enforced via new `uniq_users_username_nocase` index. Login does NOCASE lookup so without this `Admin` and `admin` could coexist as distinct rows. #6 New `session_cookie_secure` setting (default False for LAN/dev deploys, set True in production behind HTTPS) flips the session cookie's Secure flag. Defends against on-the-wire exposure when the dashboard is reachable over plain HTTP. #7 Audit trail bound to authenticated identity. Burn-in start / cancel / unlock / drive reset all now use `_operator_for(request)` which reads `request.state.current_user.full_name\|username` instead of the body's operator field. Logged-in users can no longer spoof attribution. Drive reset's literal-"operator" fallback (window._operator was never set) is also fixed by this. #8 Login rate-limit race fixed. New `register_login_attempt()` is atomic check-AND-increment in synchronous code (no awaits inside), so a parallel burst can't slip past the threshold. `record_login_failure()` removed; `clear_login_failures()` now also drops any active lockout for a successful auth. Pre-existing bug where `tripped` was always False (so user_login_locked_out audit events never fired) also fixed. #9 NVMe surface_validate post-format check now mirrors the SSH path: fails on FAILED health AND on real SMART attribute failures, soft-passes SSH-only failures (logged), surfaces warnings to the stage log without failing. #10 retention.backup_db() now writes to `.tmp` then atomic-renames into the canonical daily slot — an interrupted backup leaves the tmp behind but doesn't corrupt the real snapshot. Scheduler marks last_run_date only on (prune AND backup) success so a transient failure gets retried within the 03:00 hour. #11 /health DB probe now exercises the WRITE path via a temp-table INSERT/SELECT/COMMIT round-trip. Previously only read PRAGMA journal_mode + a row count, which silently passes on read-only mounts and broken-WAL conditions. #12 security-scan.sh now fails loudly if `git fetch` or `git reset --hard origin/main` errors (was `\|\| true`, scanning stale code silently). pip-audit now runs in a throwaway python:3.12-slim container against requirements.txt instead of `docker exec`-ing into the live truenas-burnin container — cleaner separation, no transient package install on prod. #13 Badblocks SSH stage no longer doubles its log_text. Previously appended every 20-line chunk during streaming AND the full accumulated output at end. Now only flushes the un-flushed tail (typically <20 lines). `result["output"]` stays in-memory only. Verification: all 44 unit tests pass in container; /health 200; security scan returns 0 findings; deployed maple build is green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:48:16 -04:00
Brandon Walter	3a9bdc9e15	feat: CSP + security headers middleware + session-fixation defense (1.0.0-27) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details #6 — defense-in-depth security headers: * New _SecurityHeadersMiddleware emits five headers on every response: - Content-Security-Policy: tight default-src 'self', allow-list the three CDNs we actively load (unpkg for HTMX, cdnjs for QR codes, jsdelivr for xterm.js), plus 'unsafe-inline' for the inline script in settings.html and inline style in job_print.html. Tighten via nonces later if you want true CSP-level XSS protection. - X-Content-Type-Options: nosniff - Referrer-Policy: same-origin - X-Frame-Options: DENY (no clickjacking) - Permissions-Policy: camera/microphone/geolocation/interest-cohort all blocked * Middleware ordering: SecurityHeaders -> AuthGate -> Session, so headers go on EVERY response including 401/403/redirects. #7 — session-fixation defense: * request.session.clear() now runs BEFORE setting user_id/username on successful /login AND /api/v1/auth/setup. Discards any pre-login payload an attacker might have seeded the cookie with. Combined with SameSite=strict + the HMAC-signed Starlette session cookie, this closes the residual fixation surface. Verified: curl -sSI /login returns all five headers; container boots clean; /health 200; existing session for the operator continues to work because we only clear on the LOGIN flow itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:28:13 -04:00
Brandon Walter	11218753ce	feat: secret handling — status badges + redacted endpoint + rotation audit (1.0.0-26) Closes #5 of the post-Codex hardening list: * Settings UI now shows a `[set]` (green) or `[unset]` (gray) badge next to every password/key field. Tells the operator at a glance which secrets are configured without ever rendering the value. * SSH key gets a granular source label: `set (environment variable)`, `set (mounted secret)`, or `set (stored in settings DB — prefer a mounted secret in production)`. Same hint copy in the field's help text now actively recommends `/run/secrets/ssh_key` over the textarea. * New `GET /api/v1/settings/redacted` admin-only endpoint dumps every editable setting with secrets replaced by `**`, plus the per-secret status map. Useful for ops triage ("what's actually loaded?") without the secrets ever leaving the container or hitting a transcript. `POST /api/v1/settings` writes a `settings_secret_changed` audit event whenever a non-empty secret is rotated. Records field names, operator, source IP — never the value. Lets the audit page answer "who rotated the SMTP password last week?". Internal: `_SECRET_FIELDS` constant in routes.py is now the single source of truth for which fields get the redaction / audit treatment. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:15:57 -04:00
Brandon Walter	1a19252019	feat: daily security scan — pip-audit + bandit + gitleaks (1.0.0-24) Some checks are pending Security scan / pip-audit (push) Waiting to run Details Security scan / bandit (push) Waiting to run Details Security scan / gitleaks (push) Waiting to run Details Two layers of defence-in-depth scanning: * `.forgejo/workflows/security-scan.yml` — runs pip-audit, bandit, and gitleaks on every push, every PR, and nightly at 07:00 UTC. Activates when the forge has a runner; harmless no-op until then. Bandit is invoked with `--skip B608` because every dynamic SQL build in this codebase uses bound parameters for data and structural placeholders only — we still catch real injection through code review. * `scripts/security-scan.sh` + systemd `service`/`timer` — maple-side daily scanner that runs the same three tools entirely in containers (no host pollution). Differences from the forge job: - pip-audit runs INSIDE the live container against installed packages, catching new CVEs in transitives requirements.txt doesn't pin (e.g. starlette breaking changes shipping in 1.0). - bandit scans the LIVE deploy dir at ~/docker/stacks/truenas-burnin/app/, not a fresh git checkout — so drift between forge HEAD and prod surfaces here too. - gitleaks scans a managed clone in ~/scan-checkouts/, kept fast-forward to origin/main. Output: ~/security-scans/scan-YYYY-MM-DD/{summary,pip-audit,bandit, gitleaks}.txt with 30-day retention. ~/security-scans/findings.log appended on any non-zero exit. SECURITY_SCAN_WEBHOOK env in the service unit lets you POST findings to Mattermost / Slack / etc. once you decide where alerts should land. First-run findings already actioned in this commit: * pip-audit caught 3 CVEs in `pip` itself (CVE-2025-8869, CVE-2026-1703, CVE-2026-3219). Dockerfile now upgrades pip to >=26.0 before installing the rest. * bandit's B608 SQL-injection heuristic flagged two f-string SQL constructions in `_upsert_drive` and `_fetch_drives_for_template`. Both were structural concatenation (column-list selection, '?,?,?' placeholder count), not data interpolation, but refactored from f-string to explicit concatenation so a future reviewer doesn't have to relitigate. * bandit's B104 (binding to 0.0.0.0) annotated with inline `# nosec B104` — container deliberately binds all interfaces; nginx-proxy- manager fronts it. * gitleaks: 0 secrets across 14 commits. Clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:07:22 -04:00
Brandon Walter	d4c0770b9e	feat: app-level login + hardening sweep (1.0.0-22 -> 1.0.0-23) Two layered changes shipped in this branch: == 1.0.0-22: app-level authentication == The dashboard previously had only an IP allowlist. Adds username + bcrypt password auth, signed-cookie sessions, and a "first user setup" flow. * New app/auth.py: User dataclass, bcrypt hash/verify, get_user_by_id/ username, create_user, touch_last_login, FastAPI `get_current_user` dependency. Session secret loaded from SESSION_SECRET env or persisted to /data/session_secret. * New app/auth_cli.py: `python -m app.auth_cli list\|reset\|add` for out-of-band user management. Passwords always read from a TTY prompt. * Schema: idempotent ALTER for `users` table (id, username unique, password_hash, full_name, is_admin, created_at, last_login_at). * main.py: SessionMiddleware (HMAC-signed cookie, max-age 7 days, SameSite=strict — see hardening section) + _AuthGateMiddleware that populates request.state.current_user and bounces unauth'd HTML GETs to /login while returning 401 JSON for everything else. * Routes: GET /login renders first-user-setup form when users table is empty otherwise sign-in form; POST /login; POST /api/v1/auth/setup (only works while empty); GET\|POST /logout. * Bootstrap: env vars INITIAL_ADMIN_USERNAME + INITIAL_ADMIN_PASSWORD create the first admin on startup if both set AND users table empty. Ignored thereafter — change passwords via UI or CLI. * Layout: header shows current_user.full_name\|username + Logout link. Modal operator field auto-fills from the logged-in user via <meta name="default-operator"> rendered in layout (replaces the localStorage-only previous behaviour). * requirements.txt: pinned bcrypt>=4.0,<5.0, itsdangerous>=2.1, python-multipart>=0.0.7. First step toward addressing the unpinned-deps gotcha. * New app/templates/login.html with first-user-setup variant. == 1.0.0-23: hardening sweep == Closes the eight-item gap audit: * DB retention + automated backup. New app/retention.py runs daily at 03:00 local. Nulls burnin_stages.log_text on stages older than retention_log_days (default 35), VACUUMs to reclaim pages, then runs `sqlite3 .backup` to /data/backups/app-YYYY-MM-DD.db keeping the retention_backup_keep most recent (default 14). Wired into the lifespan supervisor next to mailer/poller. * CSRF mitigation. SessionMiddleware bumped to SameSite=strict so the browser refuses to send the session cookie on cross-site POSTs — removes the actual CSRF vector. Trade-off: external links into the app require re-auth. * Login rate limiting. In-memory per-username AND per-source-IP failure counters in auth.py. 10 failures within 10 min trips a 15-min lockout for both keys. Returns HTTP 429 with a clear "try again in N min" message. Cleared on successful login. * Login audit events. New event types in audit_events: user_login, user_login_failed, user_login_locked_out, user_logout, user_password_changed. All include source IP. Recorded via auth.audit_auth_event(). * Password change UI. Header link "Change password" opens templates/components/modal_password.html (current/new/confirm). Posts to POST /api/v1/auth/change-password — bcrypt-verifies current, requires >=8 char new pw, writes audit event. * NVMe burn-in path. _stage_surface_validate now detects nvme* devnames and routes to _stage_surface_validate_nvme() which runs `nvme format -s 1 --force` (cryptographic erase). Seconds vs hours of badblocks, exercises the controller's secure-erase. Falls back to badblocks if nvme-cli isn't installed. Post-format SMART check. * Mounted-FS detection. ssh_client.get_mounted_drives() runs `findmnt -no SOURCE`, parses non-ZFS sources back to base devnames. Poller treats them as pool_name='(mounted)', pool_role='mounted'. Confirm token DESTROY MOUNTED FILESYSTEM, distinct purple styling, audit event mounted_drive_unlocked, daily-report banner picks it up. * Deeper /health. Real readiness check — DB write probe (PRAGMA journal_mode), poller freshness (age <= 3x stale_threshold), SSH test_connection() when configured. Returns 503 when any check fails so a proxy/orchestrator can take the container out of rotation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:08:29 -04:00
Brandon Walter	5da1a1704f	feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21) Substantial feature + reliability sweep. Each version below was developed, tested live against the maple/TrueNAS deployment, and Codex-reviewed before bundling. 1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process (sshd ignores SSH signal-channel requests by default), so a cancel of a long-running badblocks left the remote process running and proc.wait() hanging — pinning the asyncio.Semaphore slot forever. * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to capture the remote PID; store in burnin._remote_pids[job_id]. * burnin._kill_remote_process(job_id) opens a fresh SSH session and issues `kill -9 <pid>` — sshd honours that. * Bound proc.wait() with asyncio.wait_for(timeout=15). * burnin._active_tasks tracks every _run_job task so cancel_job and check_stuck_jobs can actually cancel the asyncio task (was DB-only before). Also fixes the documented asyncio.create_task GC gotcha (weak refs only). * _run_job finalizer reads current state and skips the write if state != 'running' so cancelled/unknown aren't clobbered. 1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/ health/poll timestamps; devname/serial/model/size_bytes were stuck at first-INSERT values forever. After kernel SCSI re-enumeration two drives could both show as `sda`. Fixed by updating all six fields. Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop off the dashboard while audit/burnin_jobs FKs stay intact. 1.0.0-15/-16 — pool-membership lock. * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses the flattened TrueNAS output (container vdevs + their device children both appear at depth 1; section markers cache/log/spare/special/dedup switch the role). * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l` to detect drives carrying ZFS labels not in any active pool — they get pool_name='(exported)', pool_role='exported'. * Three idempotent ALTER TABLE migrations on drives: pool_name/pool_role/pool_seen_at. * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and the drive isn't in burnin._unlock_grants. Routes layer maps to 409 with structured detail {pool_name, pool_role, pool_locked: true} so the frontend can render an unlock affordance. * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator, reason}. Token is the pool name for active pools, "DESTROY BOOT POOL" for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5 chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types: pool_drive_unlocked / boot_pool_drive_unlocked / exported_pool_drive_unlocked. * Grants are in-memory only — container restart wipes them. * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs Burn-In button. modal_unlock.html with type-to-confirm field. Live unlock countdown via tickUnlockCountdowns() in app.js. * Daily report: red banner listing every unlock event from the last 24h, with operator + reason + timestamp. 1.0.0-17 — Codex review fail-open + XSS + structured-error fixes. * ssh_client.get_pool_membership / get_zfs_member_drives now return None on failure (vs {} for 'definitely empty'). poller passes update_pool=False to _upsert_drive on detection failure, preserving existing pool columns instead of clearing them. Without this fix a 1-second SSH blip silently unlocked every drive. * mailer._build_unlock_banner_html escapes every interpolated field via html.escape() (was '<' only). Time filter switched to julianday() — string >= against datetime('now', '-1 day') compared formats with different separators ('T' vs ' ') and timezone suffixes, causing subtle off-by-N-hour inclusion. * app.js submitStart/submitBatchStart now detect the structured pool_locked 409 detail and auto-open the unlock modal for the offending drive (was [object Object] in toast). 1.0.0-18 — Codex grant-binding + commit-ordering fixes. * Unlock grants bound to the (pool_name, pool_role) observed at unlock time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry invalidate the grant if the live row's pool identity has changed. Prevents an 'exported' unlock from carrying over when the drive turns out to be in active 'tank' or 'boot-pool'. * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit() succeeds — previously a failed audit insert left an unaudited grant armed. 1.0.0-19 — Codex race + cancellation classification + test scaffold. * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in try/except aiosqlite.IntegrityError -> ValueError so the read-then- insert race in start_job can't produce two queued rows for the same drive. * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown, future code paths) where DB state is still 'running', finalizer writes 'unknown' instead of mis-classifying as 'failed'. * tests/ stdlib unittest scaffold: - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs, single-disk depth-1, plural section markers, partition stripping, sdaa-style names, multi-pool, role reset between pools. - test_unlock_flow.py (18 tests): token validation per pool kind, identity-binding invalidation, TTL expiry, audit-commit-then-arm ordering, unique-active-burnin partial index. Run via `python -m unittest discover tests/`. No new dependencies. 1.0.0-20 — Spearfoot-inspired badblocks tunables. * surface_validate_block_size (-b, default 4096), surface_validate_ block_buffer (-c, default 64), surface_validate_passes (-p, default 1) exposed in Settings UI; persist via settings_store.json. Validation: block size must be a power of 2 between 512 and 1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1 roughly halves runtime on multi-TB HDDs at ~2x RAM cost. 1.0.0-21 — SMART overall-health column actually populated. * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted to UNKNOWN forever (only burn-in stages ever wrote a real value). * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for all drives in a single SSH session, deterministically delimited with @@devname@@ ... @@END@@ markers. Returns {devname: PASSED\|FAILED\| UNKNOWN} or None on SSH failure. * poller calls it every 5th cycle (~1 min at default 12s interval), caches in _state['smart_health_cache'] so transient failures preserve the previous values. * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14 -> 6 so Short/Long SMART columns fit comfortably on a 13-inch display. * 5 additional parser tests (44 total, all passing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 09:25:56 -04:00
Brandon Walter	289c6d8f1a	fix: reset clears burn-in dashboard column via last_reset_at timestamp Add last_reset_at column to drives table (migration-safe ALTER TABLE). _fetch_burnin_by_drive now excludes jobs created before the drive's last_reset_at, so the dashboard burn-in column goes blank after reset while the History page still shows the full job record. reset_drive stamps last_reset_at = now() alongside clearing smart_attrs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 11:24:32 -05:00
Brandon Walter	5a802bff2e	feat: live SSH terminal in drawer (xterm.js + asyncssh WebSocket) Adds a Terminal tab to the log drawer with a full PTY session bridged over WebSocket to the TrueNAS SSH host. xterm.js loaded lazily on first tab open. Supports resize, paste, full color, and reconnect. - app/terminal.py: asyncssh PTY ↔ WebSocket bridge - routes.py: @router.websocket("/ws/terminal") - dashboard.html: Terminal tab + drawer panel - app.js: xterm.js lazy load, init, onData, resize observer, reconnect - app.css: terminal panel styles (no padding, overflow hidden) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 09:30:56 -05:00
Brandon Walter	2dff58bd52	Stage 7: SSH architecture, SMART attribute monitoring, drive reset, and polish SSH (app/ssh_client.py — new): - asyncssh-based client: start_smart_test, poll_smart_progress, abort_smart_test, get_smart_attributes, run_badblocks with streaming progress callbacks - SMART attribute table: monitors attrs 5/10/188/197/198/199 for warn/fail thresholds - Falls back to REST API / mock simulation when ssh_host is not configured Burn-in stages updated (burnin.py): - _stage_smart_test: SSH path polls smartctl -a, stores raw output + parsed attributes - _stage_surface_validate: SSH path streams badblocks, counts bad blocks vs configurable threshold - _stage_final_check: SSH path checks smartctl attributes; DB fallback for mock mode - New DB helpers: _append_stage_log, _update_stage_bad_blocks, _store_smart_attrs, _store_smart_raw_output Database (database.py): - Migrations: burnin_stages.log_text, burnin_stages.bad_blocks, drives.smart_attrs (JSON), smart_tests.raw_output Settings (config.py + settings_store.py): - ssh_host, ssh_port, ssh_user, ssh_password, ssh_key — all runtime-editable - SSH section in Settings UI with Test SSH Connection button Webhook (notifier.py): - Added bad_blocks and timestamp fields to payload per SPEC Drive reset (routes.py + drives_table.html): - POST /api/v1/drives/{id}/reset — clears SMART state, smart_attrs; audit logged - Reset button visible on drives with completed test state (no active burn-in) Log drawer (app.js): - Burn-In tab: shows raw stage log_text (SSH output) with bad block highlighting - SMART tab: shows SMART attribute table with warn/fail colouring + raw smartctl output Polish: - Version badge (v1.0.0-6d) in header via Jinja2 global - Parallel burn-in warning when max_parallel_burnins > 8 in Settings - Stats page: avg duration by drive size + failure breakdown by stage - settings.html: SSH section with key textarea, parallel warn div Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 08:09:30 -05:00
Brandon Walter	4ab54d7ed8	Add temp thresholds, bad block threshold, editable system settings, check for updates, history completed time - config.py: add temp_warn_c (46°C), temp_crit_c (55°C), bad_block_threshold (0), app_version - settings_store.py: expose all new fields + system settings (truenas_base_url, api_key, poll_interval, etc.) as editable; save to JSON for persistence; add validation for log_level, poll/stale intervals, temp range - renderer.py: _temp_class() now reads temp_warn_c/temp_crit_c from settings instead of hardcoded 40/50 - burnin.py: precheck uses settings.temp_crit_c; fix NameError bug (_execute_stages referenced 'profile' that was not in scope) - routes.py: add GET /api/v1/updates/check (Forgejo releases API); settings_page passes new editable fields; save_settings skips empty truenas_api_key like smtp_password - settings.html: move system settings from read-only card into editable form; add temp/bad-block fields to Burn-In Behavior; add Check for Updates button; restart-required indicator on save - history.html: add Completed (finished_at) column next to Started - app.css: toast container shifts up when drawer is open (body.drawer-open) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 07:43:23 -05:00
Brandon Walter	c0f9098779	feat: add log drawer (Stage 7a) Click any drive row to slide up a drawer with three tabs: - Burn-In: stage timeline with state icons, elapsed timers, error lines in red - SMART: short and long test status, timestamps, progress - Events: last 50 audit events for the drive (newest first) Drawer auto-refreshes on every SSE poll cycle. Row highlights blue while drawer is open. Clicking same row or pressing Esc closes it. Auto-scroll toggle keeps burn-in tab pinned to bottom during active runs. New API: GET /api/v1/drives/{id}/drawer Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 07:22:53 -05:00
Brandon Walter	b73b5251ae	Initial commit — TrueNAS Burn-In Dashboard v0.5.0 Full-stack burn-in orchestration dashboard (Stages 1–6d complete): FastAPI backend, SQLite/WAL, SSE live dashboard, mock TrueNAS server, SMTP/webhook notifications, batch burn-in, settings UI, audit log, stats page, cancel SMART/burn-in, drag-to-reorder stages. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 00:08:29 -05:00

14 commits