Two layered changes shipped in this branch: == 1.0.0-22: app-level authentication == The dashboard previously had only an IP allowlist. Adds username + bcrypt password auth, signed-cookie sessions, and a "first user setup" flow. * New app/auth.py: User dataclass, bcrypt hash/verify, get_user_by_id/ username, create_user, touch_last_login, FastAPI `get_current_user` dependency. Session secret loaded from SESSION_SECRET env or persisted to /data/session_secret. * New app/auth_cli.py: `python -m app.auth_cli list|reset|add` for out-of-band user management. Passwords always read from a TTY prompt. * Schema: idempotent ALTER for `users` table (id, username unique, password_hash, full_name, is_admin, created_at, last_login_at). * main.py: SessionMiddleware (HMAC-signed cookie, max-age 7 days, SameSite=strict — see hardening section) + _AuthGateMiddleware that populates request.state.current_user and bounces unauth'd HTML GETs to /login while returning 401 JSON for everything else. * Routes: GET /login renders first-user-setup form when users table is empty otherwise sign-in form; POST /login; POST /api/v1/auth/setup (only works while empty); GET|POST /logout. * Bootstrap: env vars INITIAL_ADMIN_USERNAME + INITIAL_ADMIN_PASSWORD create the first admin on startup if both set AND users table empty. Ignored thereafter — change passwords via UI or CLI. * Layout: header shows current_user.full_name|username + Logout link. Modal operator field auto-fills from the logged-in user via <meta name="default-operator"> rendered in layout (replaces the localStorage-only previous behaviour). * requirements.txt: pinned bcrypt>=4.0,<5.0, itsdangerous>=2.1, python-multipart>=0.0.7. First step toward addressing the unpinned-deps gotcha. * New app/templates/login.html with first-user-setup variant. == 1.0.0-23: hardening sweep == Closes the eight-item gap audit: * DB retention + automated backup. New app/retention.py runs daily at 03:00 local. Nulls burnin_stages.log_text on stages older than retention_log_days (default 35), VACUUMs to reclaim pages, then runs `sqlite3 .backup` to /data/backups/app-YYYY-MM-DD.db keeping the retention_backup_keep most recent (default 14). Wired into the lifespan supervisor next to mailer/poller. * CSRF mitigation. SessionMiddleware bumped to SameSite=strict so the browser refuses to send the session cookie on cross-site POSTs — removes the actual CSRF vector. Trade-off: external links into the app require re-auth. * Login rate limiting. In-memory per-username AND per-source-IP failure counters in auth.py. 10 failures within 10 min trips a 15-min lockout for both keys. Returns HTTP 429 with a clear "try again in N min" message. Cleared on successful login. * Login audit events. New event types in audit_events: user_login, user_login_failed, user_login_locked_out, user_logout, user_password_changed. All include source IP. Recorded via auth.audit_auth_event(). * Password change UI. Header link "Change password" opens templates/components/modal_password.html (current/new/confirm). Posts to POST /api/v1/auth/change-password — bcrypt-verifies current, requires >=8 char new pw, writes audit event. * NVMe burn-in path. _stage_surface_validate now detects nvme* devnames and routes to _stage_surface_validate_nvme() which runs `nvme format -s 1 --force` (cryptographic erase). Seconds vs hours of badblocks, exercises the controller's secure-erase. Falls back to badblocks if nvme-cli isn't installed. Post-format SMART check. * Mounted-FS detection. ssh_client.get_mounted_drives() runs `findmnt -no SOURCE`, parses non-ZFS sources back to base devnames. Poller treats them as pool_name='(mounted)', pool_role='mounted'. Confirm token DESTROY MOUNTED FILESYSTEM, distinct purple styling, audit event mounted_drive_unlocked, daily-report banner picks it up. * Deeper /health. Real readiness check — DB write probe (PRAGMA journal_mode), poller freshness (age <= 3x stale_threshold), SSH test_connection() when configured. Returns 503 when any check fails so a proxy/orchestrator can take the container out of rotation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
142 lines
5.2 KiB
Python
142 lines
5.2 KiB
Python
"""
|
|
Background retention + backup tasks.
|
|
|
|
* Stage-log pruning: each surface_validate burn-in stage can write tens of
|
|
MB of badblocks output to burnin_stages.log_text. Without retention the
|
|
DB grows unbounded — we observed 447 MB on the live host after a few
|
|
weeks of use. Nightly job nulls log_text on stages older than
|
|
`retention_days`, then VACUUMs to reclaim pages.
|
|
|
|
* Automated DB backup: nightly `sqlite3 .backup` to `backups/app-YYYY-
|
|
MM-DD.db` inside the data dir. Retains the most recent
|
|
`backup_keep_count` files. Uses the online-backup API so the live DB
|
|
isn't locked.
|
|
|
|
Both tasks share a single hourly tick — cheap and fits the existing
|
|
mailer-style background-loop pattern. Failures are logged but never
|
|
crash the supervisor.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import asyncio
|
|
import logging
|
|
from datetime import datetime, timedelta, timezone
|
|
from pathlib import Path
|
|
|
|
import aiosqlite
|
|
|
|
from app.config import settings
|
|
|
|
log = logging.getLogger(__name__)
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Stage-log pruning
|
|
# ---------------------------------------------------------------------------
|
|
|
|
async def prune_stage_logs(retention_days: int) -> int:
|
|
"""NULL out log_text on burnin_stages older than retention_days.
|
|
Returns the number of rows updated."""
|
|
cutoff = (datetime.now(timezone.utc) - timedelta(days=retention_days)).isoformat()
|
|
async with aiosqlite.connect(settings.db_path) as db:
|
|
cur = await db.execute(
|
|
"""UPDATE burnin_stages
|
|
SET log_text = NULL
|
|
WHERE log_text IS NOT NULL
|
|
AND finished_at IS NOT NULL
|
|
AND finished_at < ?""",
|
|
(cutoff,),
|
|
)
|
|
n = cur.rowcount or 0
|
|
await db.commit()
|
|
if n > 0:
|
|
log.info("Retention: pruned log_text on %d stage row(s) older than %d days",
|
|
n, retention_days)
|
|
return n
|
|
|
|
|
|
async def vacuum_db() -> None:
|
|
"""Reclaim pages freed by the prune. SQLite VACUUM rewrites the file
|
|
so it must run outside any transaction."""
|
|
async with aiosqlite.connect(settings.db_path, isolation_level=None) as db:
|
|
await db.execute("VACUUM")
|
|
log.info("Retention: VACUUM completed")
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Backup
|
|
# ---------------------------------------------------------------------------
|
|
|
|
def _backup_dir() -> Path:
|
|
return Path(settings.db_path).parent / "backups"
|
|
|
|
|
|
async def backup_db(keep_count: int) -> Path | None:
|
|
"""Online-backup the live DB to backups/app-YYYY-MM-DD.db. Returns
|
|
the new file's path. Old backups beyond keep_count are deleted."""
|
|
bdir = _backup_dir()
|
|
bdir.mkdir(parents=True, exist_ok=True)
|
|
today = datetime.now().strftime("%Y-%m-%d")
|
|
out = bdir / f"app-{today}.db"
|
|
|
|
# aiosqlite.Connection.backup() is an async wrapper around
|
|
# sqlite3.Connection.backup — atomic online snapshot that doesn't
|
|
# block writers (it copies pages in batches and yields between).
|
|
async with aiosqlite.connect(settings.db_path) as src:
|
|
async with aiosqlite.connect(str(out)) as dst:
|
|
await src.backup(dst)
|
|
|
|
log.info("Retention: DB backed up to %s (%d bytes)", out, out.stat().st_size)
|
|
|
|
# Keep the N most recent backups; delete older.
|
|
snapshots = sorted(bdir.glob("app-*.db"), key=lambda p: p.stat().st_mtime,
|
|
reverse=True)
|
|
for old in snapshots[keep_count:]:
|
|
try:
|
|
old.unlink()
|
|
log.info("Retention: removed old backup %s", old.name)
|
|
except OSError as exc:
|
|
log.warning("Retention: could not remove %s: %s", old, exc)
|
|
|
|
return out
|
|
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Scheduler — single hourly tick fires daily-grain work
|
|
# ---------------------------------------------------------------------------
|
|
|
|
_RUN_HOUR = 3 # 03:00 local time — quiet for most homelabs
|
|
_state = {"last_run_date": None}
|
|
|
|
|
|
async def run() -> None:
|
|
"""Background loop. Wakes every 5 min, runs the daily tasks once
|
|
when the local hour matches _RUN_HOUR and we haven't run today."""
|
|
log.info(
|
|
"Retention loop started (run at %02d:00 local; prune>%d days; keep %d backups)",
|
|
_RUN_HOUR,
|
|
settings.retention_log_days,
|
|
settings.retention_backup_keep,
|
|
)
|
|
while True:
|
|
try:
|
|
now = datetime.now()
|
|
today = now.strftime("%Y-%m-%d")
|
|
if now.hour == _RUN_HOUR and _state["last_run_date"] != today:
|
|
_state["last_run_date"] = today
|
|
try:
|
|
pruned = await prune_stage_logs(settings.retention_log_days)
|
|
if pruned:
|
|
await vacuum_db()
|
|
except Exception as exc:
|
|
log.exception("Retention: pruning failed: %s", exc)
|
|
try:
|
|
await backup_db(settings.retention_backup_keep)
|
|
except Exception as exc:
|
|
log.exception("Retention: backup failed: %s", exc)
|
|
except asyncio.CancelledError:
|
|
raise
|
|
except Exception as exc:
|
|
log.exception("Retention loop iteration failed: %s", exc)
|
|
await asyncio.sleep(300) # 5 min
|