nas-burnin/app/templates
Brandon Walter 5da1a1704f feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21)
Substantial feature + reliability sweep. Each version below was developed,
tested live against the maple/TrueNAS deployment, and Codex-reviewed
before bundling.

1.0.0-13 — asyncssh proc.kill() doesn't actually kill the remote process
  (sshd ignores SSH signal-channel requests by default), so a cancel of a
  long-running badblocks left the remote process running and proc.wait()
  hanging — pinning the asyncio.Semaphore slot forever.

  * Wrap long-lived commands in `sh -c 'echo PID:$$; exec <cmd>'` to
    capture the remote PID; store in burnin._remote_pids[job_id].
  * burnin._kill_remote_process(job_id) opens a fresh SSH session and
    issues `kill -9 <pid>` — sshd honours that.
  * Bound proc.wait() with asyncio.wait_for(timeout=15).
  * burnin._active_tasks tracks every _run_job task so cancel_job and
    check_stuck_jobs can actually cancel the asyncio task (was DB-only
    before). Also fixes the documented asyncio.create_task GC gotcha
    (weak refs only).
  * _run_job finalizer reads current state and skips the write if state
    != 'running' so cancelled/unknown aren't clobbered.

1.0.0-14 — poller._upsert_drive ON CONFLICT only refreshed temperature/
  health/poll timestamps; devname/serial/model/size_bytes were stuck at
  first-INSERT values forever. After kernel SCSI re-enumeration two
  drives could both show as `sda`. Fixed by updating all six fields.
  Also added 7-day stale filter to _DRIVES_QUERY so removed drives drop
  off the dashboard while audit/burnin_jobs FKs stay intact.

1.0.0-15/-16 — pool-membership lock.
  * ssh_client.get_pool_membership() runs `zpool list -vHP` and parses
    the flattened TrueNAS output (container vdevs + their device children
    both appear at depth 1; section markers cache/log/spare/special/dedup
    switch the role).
  * ssh_client.get_zfs_member_drives() runs `lsblk -no NAME,FSTYPE -l`
    to detect drives carrying ZFS labels not in any active pool — they
    get pool_name='(exported)', pool_role='exported'.
  * Three idempotent ALTER TABLE migrations on drives:
    pool_name/pool_role/pool_seen_at.
  * burnin.start_job raises PoolMemberError if pool_name IS NOT NULL and
    the drive isn't in burnin._unlock_grants. Routes layer maps to 409
    with structured detail {pool_name, pool_role, pool_locked: true} so
    the frontend can render an unlock affordance.
  * POST /api/v1/drives/{id}/unlock accepts {confirm_token, operator,
    reason}. Token is the pool name for active pools, "DESTROY BOOT POOL"
    for boot-pool, "DESTROY EXPORTED POOL" for exported. Reason >= 5
    chars. TTL = UNLOCK_TTL_SECONDS = 600. Audit event types:
    pool_drive_unlocked / boot_pool_drive_unlocked /
    exported_pool_drive_unlocked.
  * Grants are in-memory only — container restart wipes them.
  * UI: lock icon (yellow/red/orange), pool pill, conditional Unlock vs
    Burn-In button. modal_unlock.html with type-to-confirm field.
    Live unlock countdown via tickUnlockCountdowns() in app.js.
  * Daily report: red banner listing every unlock event from the last
    24h, with operator + reason + timestamp.

1.0.0-17 — Codex review fail-open + XSS + structured-error fixes.
  * ssh_client.get_pool_membership / get_zfs_member_drives now return
    None on failure (vs {} for 'definitely empty'). poller passes
    update_pool=False to _upsert_drive on detection failure, preserving
    existing pool columns instead of clearing them. Without this fix a
    1-second SSH blip silently unlocked every drive.
  * mailer._build_unlock_banner_html escapes every interpolated field
    via html.escape() (was '<' only). Time filter switched to
    julianday() — string >= against datetime('now', '-1 day') compared
    formats with different separators ('T' vs ' ') and timezone
    suffixes, causing subtle off-by-N-hour inclusion.
  * app.js submitStart/submitBatchStart now detect the structured
    pool_locked 409 detail and auto-open the unlock modal for the
    offending drive (was [object Object] in toast).

1.0.0-18 — Codex grant-binding + commit-ordering fixes.
  * Unlock grants bound to the (pool_name, pool_role) observed at unlock
    time. _UnlockGrant dataclass; _is_unlocked and unlock_expiry
    invalidate the grant if the live row's pool identity has changed.
    Prevents an 'exported' unlock from carrying over when the drive
    turns out to be in active 'tank' or 'boot-pool'.
  * grant_pool_unlock now writes to _unlock_grants only AFTER db.commit()
    succeeds — previously a failed audit insert left an unaudited grant
    armed.

1.0.0-19 — Codex race + cancellation classification + test scaffold.
  * Partial unique index uniq_active_burnin_per_drive ON burnin_jobs
    (drive_id) WHERE state IN ('queued','running'). INSERT now wraps in
    try/except aiosqlite.IntegrityError -> ValueError so the read-then-
    insert race in start_job can't produce two queued rows for the same
    drive.
  * _run_job tracks was_cancelled flag; on bare task.cancel() (shutdown,
    future code paths) where DB state is still 'running', finalizer
    writes 'unknown' instead of mis-classifying as 'failed'.
  * tests/ stdlib unittest scaffold:
    - test_pool_parser.py (21 tests): mirror/raidz/draid container vdevs,
      single-disk depth-1, plural section markers, partition stripping,
      sdaa-style names, multi-pool, role reset between pools.
    - test_unlock_flow.py (18 tests): token validation per pool kind,
      identity-binding invalidation, TTL expiry, audit-commit-then-arm
      ordering, unique-active-burnin partial index.
    Run via `python -m unittest discover tests/`. No new dependencies.

1.0.0-20 — Spearfoot-inspired badblocks tunables.
  * surface_validate_block_size (-b, default 4096), surface_validate_
    block_buffer (-c, default 64), surface_validate_passes (-p, default
    1) exposed in Settings UI; persist via settings_store.json.
    Validation: block size must be a power of 2 between 512 and
    1048576. Defaults preserve existing behaviour. Bumping to 8192/64/1
    roughly halves runtime on multi-TB HDDs at ~2x RAM cost.

1.0.0-21 — SMART overall-health column actually populated.
  * /api/v2.0/disk doesn't expose smart_health, so every drive defaulted
    to UNKNOWN forever (only burn-in stages ever wrote a real value).
  * ssh_client.get_smart_health_map([devnames]) runs `smartctl -H` for
    all drives in a single SSH session, deterministically delimited with
    @@devname@@ ... @@END@@ markers. Returns {devname: PASSED|FAILED|
    UNKNOWN} or None on SSH failure.
  * poller calls it every 5th cycle (~1 min at default 12s interval),
    caches in _state['smart_health_cache'] so transient failures preserve
    the previous values.
  * Dashboard CSS: col-smart min-width 150 -> 95, horizontal padding 14
    -> 6 so Short/Long SMART columns fit comfortably on a 13-inch
    display.
  * 5 additional parser tests (44 total, all passing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 09:25:56 -04:00
..
components feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21) 2026-05-02 09:25:56 -04:00
audit.html Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
dashboard.html feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21) 2026-05-02 09:25:56 -04:00
history.html Add temp thresholds, bad block threshold, editable system settings, check for updates, history completed time 2026-02-24 07:43:23 -05:00
job_detail.html Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
job_print.html Initial commit — TrueNAS Burn-In Dashboard v0.5.0 2026-02-24 00:08:29 -05:00
layout.html ui: move version badge next to title in header left side 2026-02-24 09:23:10 -05:00
settings.html feat: pool-membership lock + cancellation hardening + smart_health refresh + tunables (1.0.0-13 -> 1.0.0-21) 2026-05-02 09:25:56 -04:00
stats.html Stage 7: SSH architecture, SMART attribute monitoring, drive reset, and polish 2026-02-24 08:09:30 -05:00