The autonomous burn-in monitor can't hit /api/v1/burnin/start
without a session cookie. Provisioning one externally is fragile.
Add a targeted loopback bypass: requests from 127.0.0.1 / ::1
skip the auth gate and get a synthetic admin User for audit
attribution.
Why it's safe:
- The only way to reach the app from 127.0.0.1 is a process in
the container's network namespace (docker exec from the host).
Anyone with that already has rm -rf access to /data, so the
bypass doesn't widen the attack surface.
- External traffic via NPM/Authelia arrives with the docker bridge
gateway IP as source — NOT loopback — so it keeps going through
full auth.
- request.client.host is the raw TCP socket source, NOT
X-Forwarded-For, so external attackers can't spoof loopback via
headers.
The new auth.LoopbackUser() is a tiny factory (id=0, is_admin=True,
username="monitor"). Audit events from this caller will show
operator='monitor' so they're distinguishable from human admins.
Staged in source; lands at next rebuild. Authorized by user
("It's a blank NAS machine. I don't care about any drive getting
wiped out.").
Addresses 12 of 13 findings from the Codex tech-debt + security review
of versions 1.0.0-22 through 1.0.0-27. Item #5 (live pool re-check
before start_job) deferred — would add an SSH round-trip per start.
#1 Pool detection now treats zpool / lsblk / findmnt failures
INDEPENDENTLY. Previously a single None blew away the whole map,
so a host where lsblk lacks zfs_member info but zpool works would
never lock pool members. Extended findmnt parser to recognise
/dev/mapper/*, /dev/dm-*, /dev/md*, /dev/da*, /dev/ada* (LVM,
devicemapper, MD RAID, FreeBSD CORE devnames).
#2 Admin role enforced on every settings mutation. New
auth.require_admin() helper applied to GET /settings,
POST /api/v1/settings, /test-smtp, /test-ssh. Previously any
authenticated user (the CLI explicitly supports non-admin
accounts) could rewrite SMTP/SSH/API secrets.
#3 First-user setup race closed. auth.create_user() now accepts
bootstrap_only=True which wraps the existence check + insert in
BEGIN IMMEDIATE so two concurrent /api/v1/auth/setup requests
can't both create admin accounts during the bootstrap window.
#4 Case-insensitive uniqueness enforced via new
`uniq_users_username_nocase` index. Login does NOCASE lookup so
without this `Admin` and `admin` could coexist as distinct rows.
#6 New `session_cookie_secure` setting (default False for LAN/dev
deploys, set True in production behind HTTPS) flips the session
cookie's Secure flag. Defends against on-the-wire exposure when
the dashboard is reachable over plain HTTP.
#7 Audit trail bound to authenticated identity. Burn-in start /
cancel / unlock / drive reset all now use `_operator_for(request)`
which reads `request.state.current_user.full_name|username`
instead of the body's operator field. Logged-in users can no
longer spoof attribution. Drive reset's literal-"operator"
fallback (window._operator was never set) is also fixed by this.
#8 Login rate-limit race fixed. New `register_login_attempt()` is
atomic check-AND-increment in synchronous code (no awaits inside),
so a parallel burst can't slip past the threshold.
`record_login_failure()` removed; `clear_login_failures()` now
also drops any active lockout for a successful auth. Pre-existing
bug where `tripped` was always False (so user_login_locked_out
audit events never fired) also fixed.
#9 NVMe surface_validate post-format check now mirrors the SSH path:
fails on FAILED health AND on real SMART attribute failures,
soft-passes SSH-only failures (logged), surfaces warnings to the
stage log without failing.
#10 retention.backup_db() now writes to `.tmp` then atomic-renames
into the canonical daily slot — an interrupted backup leaves the
tmp behind but doesn't corrupt the real snapshot. Scheduler marks
last_run_date only on (prune AND backup) success so a transient
failure gets retried within the 03:00 hour.
#11 /health DB probe now exercises the WRITE path via a temp-table
INSERT/SELECT/COMMIT round-trip. Previously only read PRAGMA
journal_mode + a row count, which silently passes on read-only
mounts and broken-WAL conditions.
#12 security-scan.sh now fails loudly if `git fetch` or
`git reset --hard origin/main` errors (was `|| true`, scanning
stale code silently). pip-audit now runs in a throwaway
python:3.12-slim container against requirements.txt instead of
`docker exec`-ing into the live truenas-burnin container —
cleaner separation, no transient package install on prod.
#13 Badblocks SSH stage no longer doubles its log_text. Previously
appended every 20-line chunk during streaming AND the full
accumulated output at end. Now only flushes the un-flushed tail
(typically <20 lines). `result["output"]` stays in-memory only.
Verification: all 44 unit tests pass in container; /health 200;
security scan returns 0 findings; deployed maple build is green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#6 — defense-in-depth security headers:
* New _SecurityHeadersMiddleware emits five headers on every response:
- Content-Security-Policy: tight default-src 'self', allow-list the
three CDNs we actively load (unpkg for HTMX, cdnjs for QR codes,
jsdelivr for xterm.js), plus 'unsafe-inline' for the inline script
in settings.html and inline style in job_print.html. Tighten via
nonces later if you want true CSP-level XSS protection.
- X-Content-Type-Options: nosniff
- Referrer-Policy: same-origin
- X-Frame-Options: DENY (no clickjacking)
- Permissions-Policy: camera/microphone/geolocation/interest-cohort
all blocked
* Middleware ordering: SecurityHeaders -> AuthGate -> Session, so
headers go on EVERY response including 401/403/redirects.
#7 — session-fixation defense:
* request.session.clear() now runs BEFORE setting user_id/username on
successful /login AND /api/v1/auth/setup. Discards any pre-login
payload an attacker might have seeded the cookie with. Combined
with SameSite=strict + the HMAC-signed Starlette session cookie,
this closes the residual fixation surface.
Verified: curl -sSI /login returns all five headers; container boots
clean; /health 200; existing session for the operator continues to
work because we only clear on the LOGIN flow itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two layered changes shipped in this branch:
== 1.0.0-22: app-level authentication ==
The dashboard previously had only an IP allowlist. Adds username +
bcrypt password auth, signed-cookie sessions, and a "first user setup"
flow.
* New app/auth.py: User dataclass, bcrypt hash/verify, get_user_by_id/
username, create_user, touch_last_login, FastAPI `get_current_user`
dependency. Session secret loaded from SESSION_SECRET env or persisted
to /data/session_secret.
* New app/auth_cli.py: `python -m app.auth_cli list|reset|add` for
out-of-band user management. Passwords always read from a TTY prompt.
* Schema: idempotent ALTER for `users` table (id, username unique,
password_hash, full_name, is_admin, created_at, last_login_at).
* main.py: SessionMiddleware (HMAC-signed cookie, max-age 7 days,
SameSite=strict — see hardening section) + _AuthGateMiddleware that
populates request.state.current_user and bounces unauth'd HTML GETs
to /login while returning 401 JSON for everything else.
* Routes: GET /login renders first-user-setup form when users table is
empty otherwise sign-in form; POST /login; POST /api/v1/auth/setup
(only works while empty); GET|POST /logout.
* Bootstrap: env vars INITIAL_ADMIN_USERNAME + INITIAL_ADMIN_PASSWORD
create the first admin on startup if both set AND users table empty.
Ignored thereafter — change passwords via UI or CLI.
* Layout: header shows current_user.full_name|username + Logout link.
Modal operator field auto-fills from the logged-in user via
<meta name="default-operator"> rendered in layout (replaces the
localStorage-only previous behaviour).
* requirements.txt: pinned bcrypt>=4.0,<5.0, itsdangerous>=2.1,
python-multipart>=0.0.7. First step toward addressing the
unpinned-deps gotcha.
* New app/templates/login.html with first-user-setup variant.
== 1.0.0-23: hardening sweep ==
Closes the eight-item gap audit:
* DB retention + automated backup. New app/retention.py runs daily at
03:00 local. Nulls burnin_stages.log_text on stages older than
retention_log_days (default 35), VACUUMs to reclaim pages, then runs
`sqlite3 .backup` to /data/backups/app-YYYY-MM-DD.db keeping the
retention_backup_keep most recent (default 14). Wired into the
lifespan supervisor next to mailer/poller.
* CSRF mitigation. SessionMiddleware bumped to SameSite=strict so the
browser refuses to send the session cookie on cross-site POSTs —
removes the actual CSRF vector. Trade-off: external links into the
app require re-auth.
* Login rate limiting. In-memory per-username AND per-source-IP failure
counters in auth.py. 10 failures within 10 min trips a 15-min lockout
for both keys. Returns HTTP 429 with a clear "try again in N min"
message. Cleared on successful login.
* Login audit events. New event types in audit_events: user_login,
user_login_failed, user_login_locked_out, user_logout,
user_password_changed. All include source IP. Recorded via
auth.audit_auth_event().
* Password change UI. Header link "Change password" opens
templates/components/modal_password.html (current/new/confirm).
Posts to POST /api/v1/auth/change-password — bcrypt-verifies current,
requires >=8 char new pw, writes audit event.
* NVMe burn-in path. _stage_surface_validate now detects nvme*
devnames and routes to _stage_surface_validate_nvme() which runs
`nvme format -s 1 --force` (cryptographic erase). Seconds vs hours
of badblocks, exercises the controller's secure-erase. Falls back
to badblocks if nvme-cli isn't installed. Post-format SMART check.
* Mounted-FS detection. ssh_client.get_mounted_drives() runs
`findmnt -no SOURCE`, parses non-ZFS sources back to base devnames.
Poller treats them as pool_name='(mounted)', pool_role='mounted'.
Confirm token DESTROY MOUNTED FILESYSTEM, distinct purple styling,
audit event mounted_drive_unlocked, daily-report banner picks it up.
* Deeper /health. Real readiness check — DB write probe (PRAGMA
journal_mode), poller freshness (age <= 3x stale_threshold), SSH
test_connection() when configured. Returns 503 when any check fails
so a proxy/orchestrator can take the container out of rotation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>