nas-burnin/scripts/security-scan.sh
Brandon Walter aa7822d6ce
Some checks are pending
Security scan / pip-audit (push) Waiting to run
Security scan / bandit (push) Waiting to run
Security scan / gitleaks (push) Waiting to run
Security scan / mypy (push) Waiting to run
feat: rate limiter + mypy + lifecycle tests + routes/ split (1.0.0-33/-34)
Closes the four remaining items from the post-Codex hardening list.

#1 Rate-limit unlock + change-password endpoints (1.0.0-33)
   * Generalised the existing login limiter into a reusable
     `_RateLimiter` class in app/auth.py. Atomic check-then-increment
     in synchronous code so a parallel asyncio burst can't slip past
     the threshold.
   * `unlock_limiter` (5 attempts in 10 min → 10 min lockout) gates
     POST /api/v1/drives/{id}/unlock per-drive AND per-source-IP.
   * `pwchange_limiter` (5 in 10 min → 15 min lockout) gates
     POST /api/v1/auth/change-password per-user AND per-IP.
   * Both clear on successful operation. The login limiter keeps its
     existing `register_login_attempt` / `clear_login_failures`
     facade names so external callers don't change.

#3 mypy in security-scan (1.0.0-33)
   * Added a 4th tool to the daily scan + forge workflow. Runs in a
     throwaway python:3.12-slim container against the deploy dir,
     exit code is informational only (NOT included in the
     `TOTAL_EXIT` failure sum). Findings land in
     ~/security-scans/scan-YYYY-MM-DD/mypy.txt for ratchet-down
     work over time.
   * Forge job uses `continue-on-error: true` so it doesn't fail the
     workflow until the type-debt baseline is annotated down.

#4 Lifecycle test coverage (1.0.0-33)
   * New tests/test_lifecycle.py with 15 cases:
     - TestCommonHelpers (7 tests): _start_stage, _finish_stage
       success/failure/error-preservation, _recalculate_progress
       weighted math, _is_cancelled, _append_stage_log.
     - TestStartCancelJob (4 tests): start_job inserts queued row +
       correct stage list, duplicate-active rejection, cancel marks
       state, cancel returns False on terminal-state jobs.
     - TestRateLimiter (4 tests): under-threshold ok, trips at
       threshold, clear removes both counter + lockout, separate
       keys don't interfere.
   * Total goes from 44 to 59 tests; closes the orchestration-path
     coverage gap Codex flagged.

#2 Partial routes.py split (1.0.0-34)
   * routes.py → routes/ package. Same staged-extraction pattern as
     the burnin.py split.
   * routes/auth.py — login/logout/setup/change-password (170 LoC).
   * routes/system.py — /health, /ws/terminal, /api/v1/updates/check
     (136 LoC).
   * routes/_helpers.py — shared utilities used by both extracted
     modules and the still-monolithic remainder: client_ip,
     operator_for, is_stale, stale_context, secret_status,
     SECRET_FIELDS (97 LoC).
   * routes/__init__.py shrank from 1568 LoC to 1261. Future slices
     can extract drives, burnin, history, settings the same way.
   * GOTCHA recorded in commit body: `from app import auth` at the
     top of __init__.py binds `auth` as an attribute on the package
     namespace, so `from . import auth as _auth_routes` finds the
     OUTER module and yields `app.auth` instead of the submodule.
     Fix is `import app.routes.auth as _auth_routes` (absolute).
     This bit me once at deploy time; container failed to start
     with `module 'app.auth' has no attribute 'router'`.

Verification: 59/59 tests pass (44 existing + 15 new); container
boots clean at 1.0.0-34; /health 200 with all checks green; security
scan still clean (mypy informational findings ignored from totals).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 09:29:53 -04:00

141 lines
5.9 KiB
Bash

#!/usr/bin/env bash
# Daily security scan of the deployed truenas-burnin source on maple.
# Mirrors the .forgejo/workflows/security-scan.yml CI pipeline so a finding
# the runner-less forge would have flagged still surfaces here.
#
# Tools all run in containers — nothing installed on the host.
# pip-audit — known CVEs in installed packages (scans the LIVE container)
# bandit — Python static security analysis on host source tree
# gitleaks — secrets across the full git history
#
# Output:
# ~/security-scans/scan-YYYY-MM-DD/{pip-audit,bandit,gitleaks}.txt
# ~/security-scans/findings.log — appended one line per scan with findings
#
# Wiring:
# Daily systemd user timer at 03:30 local (after the in-app retention job
# so backups are fresh). See scripts/security-scan.{service,timer}.
set -uo pipefail
REPO_URL="${REPO_URL:-https://git.hellocomputer.xyz/brandon/truenas-burnin.git}"
REPO="${REPO:-$HOME/scan-checkouts/truenas-burnin}"
OUT_BASE="${OUT_BASE:-$HOME/security-scans}"
DATE="$(date +%Y-%m-%d)"
OUT_DIR="$OUT_BASE/scan-$DATE"
SUMMARY="$OUT_BASE/findings.log"
GITLEAKS_VERSION="${GITLEAKS_VERSION:-8.21.2}"
mkdir -p "$OUT_DIR" "$(dirname "$REPO")"
# Maintain a dedicated checkout for scanning. The deploy at
# ~/docker/stacks/truenas-burnin/ is just the bind-mounted source — no
# .git, no history — so gitleaks can't scan there. We keep a separate
# clone, fast-forward it to origin/main each run.
if [ ! -d "$REPO/.git" ]; then
echo "Cloning $REPO_URL to $REPO ..."
git clone --quiet "$REPO_URL" "$REPO" || {
echo "fatal: git clone failed" >&2
exit 65
}
fi
cd "$REPO"
# Refresh the scan checkout. Failures here mean we'd be scanning stale
# code without knowing — fail loudly instead of soldiering on silently.
if ! git fetch --quiet --prune origin; then
echo "fatal: git fetch failed in $REPO" >&2
exit 65
fi
git checkout --quiet main || true # ok if already on main
if ! git reset --hard --quiet origin/main; then
echo "fatal: git reset --hard failed in $REPO" >&2
exit 65
fi
echo "=== Security scan $DATE ===" > "$OUT_DIR/summary.txt"
date -Iseconds >> "$OUT_DIR/summary.txt"
echo >> "$OUT_DIR/summary.txt"
# --- pip-audit against the lockfile in a throwaway container ------------
# Previously we did `docker exec truenas-burnin pip install pip-audit`
# which mutated the live production container with a transient package.
# Now scan the lockfile in an ephemeral container — same coverage of
# pinned versions + their transitives, no side effects on prod.
echo "--- pip-audit (requirements.txt in throwaway container) ---" | tee -a "$OUT_DIR/summary.txt"
docker run --rm \
-v "$REPO/requirements.txt:/work/requirements.txt:ro" \
-w /work \
python:3.12-slim sh -c \
"pip install --quiet --no-cache-dir --disable-pip-version-check pip-audit 2>/dev/null && pip-audit --requirement requirements.txt --strict --format=columns" \
> "$OUT_DIR/pip-audit.txt" 2>&1
PIPS=$?
echo " exit=$PIPS ($OUT_DIR/pip-audit.txt)" | tee -a "$OUT_DIR/summary.txt"
# --- bandit against the LIVE deploy dir ---------------------------------
# Scan what's actually running, not what's in git — catches drift between
# forge HEAD and maple. B608 (SQL injection via dynamic strings) is
# skipped globally: every dynamic SQL build in this codebase uses
# bound parameters for data and structural placeholders only.
DEPLOY_DIR="${DEPLOY_DIR:-$HOME/docker/stacks/truenas-burnin}"
echo "--- bandit (deploy: $DEPLOY_DIR) ---" | tee -a "$OUT_DIR/summary.txt"
docker run --rm \
-v "$DEPLOY_DIR/app:/src:ro" \
python:3.12-slim sh -c \
"pip install --quiet --no-cache-dir --disable-pip-version-check bandit 2>/dev/null && bandit -r /src -ll -ii --skip B608" \
> "$OUT_DIR/bandit.txt" 2>&1
BANDITS=$?
echo " exit=$BANDITS ($OUT_DIR/bandit.txt)" | tee -a "$OUT_DIR/summary.txt"
# --- mypy against the deploy dir (informational only) -------------------
# Type checker — surfaces None-handling bugs and missing-attribute errors
# the runtime would have caught at the worst possible moment. Doesn't
# count toward the failure exit-code sum until the codebase is annotated
# enough to make findings actionable.
echo "--- mypy (informational) ---" | tee -a "$OUT_DIR/summary.txt"
docker run --rm \
-v "$DEPLOY_DIR/app:/src:ro" \
python:3.12-slim sh -c \
"pip install --quiet --no-cache-dir --disable-pip-version-check mypy 2>&1 | tail -3 && mypy --ignore-missing-imports --no-strict-optional /src" \
> "$OUT_DIR/mypy.txt" 2>&1
MYPY=$?
echo " exit=$MYPY ($OUT_DIR/mypy.txt) — informational only" | tee -a "$OUT_DIR/summary.txt"
# --- gitleaks against the full git history ------------------------------
echo "--- gitleaks ---" | tee -a "$OUT_DIR/summary.txt"
docker run --rm \
-v "$REPO:/repo:ro" \
"zricethezav/gitleaks:v$GITLEAKS_VERSION" \
detect --source /repo --no-banner --redact --verbose \
> "$OUT_DIR/gitleaks.txt" 2>&1
LEAKS=$?
echo " exit=$LEAKS ($OUT_DIR/gitleaks.txt)" | tee -a "$OUT_DIR/summary.txt"
# --- summary + notification --------------------------------------------
TOTAL_EXIT=$(( PIPS + BANDITS + LEAKS ))
{
echo
echo "Total findings exit-code sum: $TOTAL_EXIT"
echo " pip-audit: $PIPS"
echo " bandit: $BANDITS"
echo " gitleaks: $LEAKS"
} >> "$OUT_DIR/summary.txt"
if [ "$TOTAL_EXIT" -ne 0 ]; then
printf '%s — findings (pip-audit=%d bandit=%d gitleaks=%d) — see %s\n' \
"$DATE" "$PIPS" "$BANDITS" "$LEAKS" "$OUT_DIR" >> "$SUMMARY"
# Hook for downstream notification — wire to your existing Mattermost
# / Fastmail / webhook chain. Stays a no-op until SECURITY_SCAN_WEBHOOK
# is set in the systemd unit's Environment=.
if [ -n "${SECURITY_SCAN_WEBHOOK:-}" ]; then
curl -fsS -X POST -H 'Content-Type: text/plain' \
--data-binary "@$OUT_DIR/summary.txt" \
"$SECURITY_SCAN_WEBHOOK" || true
fi
fi
# Retention — keep last 30 daily directories, prune older.
find "$OUT_BASE" -maxdepth 1 -type d -name "scan-*" -mtime +30 \
-exec rm -rf {} \;
exit "$TOTAL_EXIT"