#6 — defense-in-depth security headers:
* New _SecurityHeadersMiddleware emits five headers on every response:
- Content-Security-Policy: tight default-src 'self', allow-list the
three CDNs we actively load (unpkg for HTMX, cdnjs for QR codes,
jsdelivr for xterm.js), plus 'unsafe-inline' for the inline script
in settings.html and inline style in job_print.html. Tighten via
nonces later if you want true CSP-level XSS protection.
- X-Content-Type-Options: nosniff
- Referrer-Policy: same-origin
- X-Frame-Options: DENY (no clickjacking)
- Permissions-Policy: camera/microphone/geolocation/interest-cohort
all blocked
* Middleware ordering: SecurityHeaders -> AuthGate -> Session, so
headers go on EVERY response including 401/403/redirects.
#7 — session-fixation defense:
* request.session.clear() now runs BEFORE setting user_id/username on
successful /login AND /api/v1/auth/setup. Discards any pre-login
payload an attacker might have seeded the cookie with. Combined
with SameSite=strict + the HMAC-signed Starlette session cookie,
this closes the residual fixation surface.
Verified: curl -sSI /login returns all five headers; container boots
clean; /health 200; existing session for the operator continues to
work because we only clear on the LOGIN flow itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes#5 of the post-Codex hardening list:
* Settings UI now shows a `[set]` (green) or `[unset]` (gray) badge next
to every password/key field. Tells the operator at a glance which
secrets are configured without ever rendering the value.
* SSH key gets a granular source label: `set (environment variable)`,
`set (mounted secret)`, or `set (stored in settings DB — prefer a
mounted secret in production)`. Same hint copy in the field's help
text now actively recommends `/run/secrets/ssh_key` over the textarea.
* New `GET /api/v1/settings/redacted` admin-only endpoint dumps every
editable setting with secrets replaced by `***`, plus the per-secret
status map. Useful for ops triage ("what's actually loaded?") without
the secrets ever leaving the container or hitting a transcript.
* `POST /api/v1/settings` writes a `settings_secret_changed` audit event
whenever a non-empty secret is rotated. Records field names, operator,
source IP — never the value. Lets the audit page answer "who rotated
the SMTP password last week?".
Internal: `_SECRET_FIELDS` constant in routes.py is now the single
source of truth for which fields get the redaction / audit treatment.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the unpinned-deps gotcha that broke production once already
(Starlette 1.0 shipping in 2026-04 changed the TemplateResponse
signature; our floating requirements.txt picked it up on the next
rebuild and the dashboard 500'd until 1.0.0-12 patched the call sites).
Mechanics:
* `requirements.in` — human-edited input, identical contents to the
old `requirements.txt`.
* `requirements.txt` — now an autogenerated lockfile (876 lines, every
transitive pinned with sha256 hashes). Regenerated via
`scripts/regenerate-lockfile.sh`, which runs `pip-compile
--generate-hashes --strip-extras` in a clean python:3.12-slim
container so the script has no host dependencies.
* Dockerfile installs with `pip install --require-hashes` — refuses
any package whose sha256 doesn't match the lockfile, defending
against compromised PyPI mirrors and accidental version drift.
Verification:
* Container boots clean on the hash-locked install (1.0.0-25).
* /health returns 200 with all checks green.
* Daily security scan (pip-audit + bandit + gitleaks) returns 0 findings
against the new lockfile.
Future deps changes: edit requirements.in, run the regenerate script,
review the diff, rebuild, commit both files. README §"Updating
dependencies" walks through it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layers of defence-in-depth scanning:
* `.forgejo/workflows/security-scan.yml` — runs pip-audit, bandit, and
gitleaks on every push, every PR, and nightly at 07:00 UTC. Activates
when the forge has a runner; harmless no-op until then. Bandit is
invoked with `--skip B608` because every dynamic SQL build in this
codebase uses bound parameters for data and structural placeholders
only — we still catch real injection through code review.
* `scripts/security-scan.sh` + systemd `service`/`timer` — maple-side
daily scanner that runs the same three tools entirely in containers
(no host pollution). Differences from the forge job:
- pip-audit runs INSIDE the live container against installed
packages, catching new CVEs in transitives requirements.txt
doesn't pin (e.g. starlette breaking changes shipping in 1.0).
- bandit scans the LIVE deploy dir at
~/docker/stacks/truenas-burnin/app/, not a fresh git checkout —
so drift between forge HEAD and prod surfaces here too.
- gitleaks scans a managed clone in ~/scan-checkouts/, kept
fast-forward to origin/main.
Output: ~/security-scans/scan-YYYY-MM-DD/{summary,pip-audit,bandit,
gitleaks}.txt with 30-day retention. ~/security-scans/findings.log
appended on any non-zero exit. SECURITY_SCAN_WEBHOOK env in the
service unit lets you POST findings to Mattermost / Slack / etc. once
you decide where alerts should land.
First-run findings already actioned in this commit:
* pip-audit caught 3 CVEs in `pip` itself (CVE-2025-8869,
CVE-2026-1703, CVE-2026-3219). Dockerfile now upgrades pip to
>=26.0 before installing the rest.
* bandit's B608 SQL-injection heuristic flagged two f-string SQL
constructions in `_upsert_drive` and `_fetch_drives_for_template`.
Both were structural concatenation (column-list selection,
'?,?,?' placeholder count), not data interpolation, but refactored
from f-string to explicit concatenation so a future reviewer
doesn't have to relitigate.
* bandit's B104 (binding to 0.0.0.0) annotated with inline `# nosec
B104` — container deliberately binds all interfaces; nginx-proxy-
manager fronts it.
* gitleaks: 0 secrets across 14 commits. Clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First operator-facing README. Covers quick start (build, configure,
first-user login), the multi-drive batch workflow with concrete time
estimates, the four drive-lock states with their confirm tokens,
notable settings, daily report / notifications, ops cookbook (logs,
user CLI, backups, /health probe, DB reset), and an honest "known
gaps" list.
Cross-references CLAUDE.md (architecture + rationale) and SPEC.md
(per-version feature reference) for deeper docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layered changes shipped in this branch:
== 1.0.0-22: app-level authentication ==
The dashboard previously had only an IP allowlist. Adds username +
bcrypt password auth, signed-cookie sessions, and a "first user setup"
flow.
* New app/auth.py: User dataclass, bcrypt hash/verify, get_user_by_id/
username, create_user, touch_last_login, FastAPI `get_current_user`
dependency. Session secret loaded from SESSION_SECRET env or persisted
to /data/session_secret.
* New app/auth_cli.py: `python -m app.auth_cli list|reset|add` for
out-of-band user management. Passwords always read from a TTY prompt.
* Schema: idempotent ALTER for `users` table (id, username unique,
password_hash, full_name, is_admin, created_at, last_login_at).
* main.py: SessionMiddleware (HMAC-signed cookie, max-age 7 days,
SameSite=strict — see hardening section) + _AuthGateMiddleware that
populates request.state.current_user and bounces unauth'd HTML GETs
to /login while returning 401 JSON for everything else.
* Routes: GET /login renders first-user-setup form when users table is
empty otherwise sign-in form; POST /login; POST /api/v1/auth/setup
(only works while empty); GET|POST /logout.
* Bootstrap: env vars INITIAL_ADMIN_USERNAME + INITIAL_ADMIN_PASSWORD
create the first admin on startup if both set AND users table empty.
Ignored thereafter — change passwords via UI or CLI.
* Layout: header shows current_user.full_name|username + Logout link.
Modal operator field auto-fills from the logged-in user via
<meta name="default-operator"> rendered in layout (replaces the
localStorage-only previous behaviour).
* requirements.txt: pinned bcrypt>=4.0,<5.0, itsdangerous>=2.1,
python-multipart>=0.0.7. First step toward addressing the
unpinned-deps gotcha.
* New app/templates/login.html with first-user-setup variant.
== 1.0.0-23: hardening sweep ==
Closes the eight-item gap audit:
* DB retention + automated backup. New app/retention.py runs daily at
03:00 local. Nulls burnin_stages.log_text on stages older than
retention_log_days (default 35), VACUUMs to reclaim pages, then runs
`sqlite3 .backup` to /data/backups/app-YYYY-MM-DD.db keeping the
retention_backup_keep most recent (default 14). Wired into the
lifespan supervisor next to mailer/poller.
* CSRF mitigation. SessionMiddleware bumped to SameSite=strict so the
browser refuses to send the session cookie on cross-site POSTs —
removes the actual CSRF vector. Trade-off: external links into the
app require re-auth.
* Login rate limiting. In-memory per-username AND per-source-IP failure
counters in auth.py. 10 failures within 10 min trips a 15-min lockout
for both keys. Returns HTTP 429 with a clear "try again in N min"
message. Cleared on successful login.
* Login audit events. New event types in audit_events: user_login,
user_login_failed, user_login_locked_out, user_logout,
user_password_changed. All include source IP. Recorded via
auth.audit_auth_event().
* Password change UI. Header link "Change password" opens
templates/components/modal_password.html (current/new/confirm).
Posts to POST /api/v1/auth/change-password — bcrypt-verifies current,
requires >=8 char new pw, writes audit event.
* NVMe burn-in path. _stage_surface_validate now detects nvme*
devnames and routes to _stage_surface_validate_nvme() which runs
`nvme format -s 1 --force` (cryptographic erase). Seconds vs hours
of badblocks, exercises the controller's secure-erase. Falls back
to badblocks if nvme-cli isn't installed. Post-format SMART check.
* Mounted-FS detection. ssh_client.get_mounted_drives() runs
`findmnt -no SOURCE`, parses non-ZFS sources back to base devnames.
Poller treats them as pool_name='(mounted)', pool_role='mounted'.
Confirm token DESTROY MOUNTED FILESYSTEM, distinct purple styling,
audit event mounted_drive_unlocked, daily-report banner picks it up.
* Deeper /health. Real readiness check — DB write probe (PRAGMA
journal_mode), poller freshness (age <= 3x stale_threshold), SSH
test_connection() when configured. Returns 503 when any check fails
so a proxy/orchestrator can take the container out of rotation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.
1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
(sshd ignores SSH signal-channel requests by default), so a cancel of a
long-running badblocks left the remote process running and proc.wait()
hanging — pinning the asyncio.Semaphore slot forever.
* Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
capture the remote PID; store in burnin._remote_pids[job_id].
* burnin._kill_remote_process(job_id) opens a fresh SSH session and
issues `kill -9 <pid>` — sshd honours that.
* Bound proc.wait() with asyncio.wait_for(timeout=15).
* burnin._active_tasks tracks every _run_job task so cancel_job and
check_stuck_jobs can actually cancel the asyncio task (was DB-only
before). Also fixes the documented asyncio.create_task GC gotcha
(weak refs only).
* _run_job finalizer reads current state and skips the write if state
!= 'running' so cancelled/unknown aren't clobbered.
1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
health/poll timestamps; devname/serial/model/size_bytes were stuck at
first-INSERT values forever. After kernel SCSI re-enumeration two
drives could both show as `sda`. Fixed by updating all six fields.
Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
off the dashboard while audit/burnin_jobs FKs stay intact.
1.0.0-15/-16 — pool-membership lock.
* ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
the flattened TrueNAS output (container vdevs + their device children
both appear at depth 1; section markers cache/log/spare/special/dedup
switch the role).
* ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
to detect drives carrying ZFS labels not in any active pool — they
get pool_name='(exported)', pool_role='exported'.
* Three idempotent ALTER TABLE migrations on drives:
pool_name/pool_role/pool_seen_at.
* burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
the drive isn't in burnin._unlock_grants. Routes layer maps to 409
with structured detail {pool_name, pool_role, pool_locked: true} so
the frontend can render an unlock affordance.
* POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
pool_drive_unlocked / boot_pool_drive_unlocked /
exported_pool_drive_unlocked.
* Grants are in-memory only — container restart wipes them.
* UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
Burn-In button. modal_unlock.html with type-to-confirm field.
Live unlock countdown via tickUnlockCountdowns() in app.js.
* Daily report: red banner listing every unlock event from the last
24h, with operator + reason + timestamp.
1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
* ssh_client.get_pool_membership / get_zfs_member_drives now return
None on failure (vs {} for 'definitely empty'). poller passes
update_pool=False to _upsert_drive on detection failure, preserving
existing pool columns instead of clearing them. Without this fix a
1-second SSH blip silently unlocked every drive.
* mailer._build_unlock_banner_html escapes every interpolated field
via html.escape() (was '<' only). Time filter switched to
julianday() — string >= against datetime('now', '-1 day') compared
formats with different separators ('T' vs ' ') and timezone
suffixes, causing subtle off-by-N-hour inclusion.
* app.js submitStart/submitBatchStart now detect the structured
pool_locked 409 detail and auto-open the unlock modal for the
offending drive (was [object Object] in toast).
1.0.0-18 — Codex grant-binding + commit-ordering fixes.
* Unlock grants bound to the (pool_name, pool_role) observed at unlock
time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
invalidate the grant if the live row's pool identity has changed.
Prevents an 'exported' unlock from carrying over when the drive
turns out to be in active 'tank' or 'boot-pool'.
* grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
succeeds — previously a failed audit insert left an unaudited grant
armed.
1.0.0-19 — Codex race + cancellation classification + test scaffold.
* Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
(drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
try/except aiosqlite.IntegrityError -> ValueError so the read-then-
insert race in start_job can't produce two queued rows for the same
drive.
* _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
future code paths) where DB state is still 'running', finalizer
writes 'unknown' instead of mis-classifying as 'failed'.
* tests/ stdlib unittest scaffold:
- test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
single-disk depth-1, plural section markers, partition stripping,
sdaa-style names, multi-pool, role reset between pools.
- test_unlock_flow.py (18 tests): token validation per pool kind,
identity-binding invalidation, TTL expiry, audit-commit-then-arm
ordering, unique-active-burnin partial index.
Run via `python -m unittest discover tests/`. No new dependencies.
1.0.0-20 — Spearfoot-inspired badblocks tunables.
* surface_validate_block_size (-b, default 4096), surface_validate_
block_buffer (-c, default 64), surface_validate_passes (-p, default
1) exposed in Settings UI; persist via settings_store.json.
Validation: block size must be a power of 2 between 512 and
1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
roughly halves runtime on multi-TB HDDs at ~2x RAM cost.
1.0.0-21 — SMART overall-health column actually populated.
* /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
to UNKNOWN forever (only burn-in stages ever wrote a real value).
* ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
all drives in a single SSH session, deterministically delimited with
@@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
UNKNOWN} or None on SSH failure.
* poller calls it every 5th cycle (~1 min at default 12s interval),
caches in _state['smart_health_cache'] so transient failures preserve
the previous values.
* Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
-> 6 so Short/Long SMART columns fit comfortably on a 13-inch
display.
* 5 additional parser tests (44 total, all passing).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
These files have been live on maple for a while via direct scp/edit but
were never committed back to the forge. Restoring parity so the repo
matches the running container's source tree before the new feature work
on top.
- app/terminal.py: NEW. xterm.js <-> asyncssh PTY bridge wired into the
log drawer's Terminal tab. Was added on the deploy host only.
- app/truenas.py: misc REST client tweaks deployed but not committed.
- CLAUDE.md / SPEC.md: documentation drift — Stage 8 terminal section,
updated file map.
- docker-compose.yml / requirements.txt: minor infra deltas already
active on maple.
No behaviour change vs the running container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add last_reset_at column to drives table (migration-safe ALTER TABLE).
_fetch_burnin_by_drive now excludes jobs created before the drive's
last_reset_at, so the dashboard burn-in column goes blank after reset
while the History page still shows the full job record.
reset_drive stamps last_reset_at = now() alongside clearing smart_attrs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
app.js: stages.forEach callback in _drawerRenderBurnin was missing its
closing });, causing a syntax error that prevented the entire script
from loading — all click handlers (Short/Long SMART, Burn-In, cancel)
were unregistered as a result.
settings.html: add a prominent yellow restart banner with the docker
command (docker compose restart app) that appears after saving any
system settings that require a container restart to take effect.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents all Stage 7 features: SSH burn-in architecture, SMART attr
monitoring, drive reset, version badge, stats polish, new env vars,
new API routes, and real-TrueNAS cutover steps.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- config.py: add temp_warn_c (46°C), temp_crit_c (55°C), bad_block_threshold (0), app_version
- settings_store.py: expose all new fields + system settings (truenas_base_url, api_key, poll_interval, etc.) as editable; save to JSON for persistence; add validation for log_level, poll/stale intervals, temp range
- renderer.py: _temp_class() now reads temp_warn_c/temp_crit_c from settings instead of hardcoded 40/50
- burnin.py: precheck uses settings.temp_crit_c; fix NameError bug (_execute_stages referenced 'profile' that was not in scope)
- routes.py: add GET /api/v1/updates/check (Forgejo releases API); settings_page passes new editable fields; save_settings skips empty truenas_api_key like smtp_password
- settings.html: move system settings from read-only card into editable form; add temp/bad-block fields to Burn-In Behavior; add Check for Updates button; restart-required indicator on save
- history.html: add Completed (finished_at) column next to Started
- app.css: toast container shifts up when drawer is open (body.drawer-open)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Click any drive row to slide up a drawer with three tabs:
- Burn-In: stage timeline with state icons, elapsed timers, error lines in red
- SMART: short and long test status, timestamps, progress
- Events: last 50 audit events for the drive (newest first)
Drawer auto-refreshes on every SSE poll cycle. Row highlights blue
while drawer is open. Clicking same row or pressing Esc closes it.
Auto-scroll toggle keeps burn-in tab pinned to bottom during active runs.
New API: GET /api/v1/drives/{id}/drawer
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>